Researchers from NIST, The University of Texas at Dallas, the University of Maryland, and the University of New South Wales displayed “highly challenging” pairs of face images to individuals with and without training in matching images, and to “deep convolutional neural networks” (DCNNs) that trained themselves to classify images as being from the same source or from different sources.
The Experiment
Twenty pairs of pictures (12 same-source and 8 different-source pairs) were presented to the following groups:
- 57 forensic facial examiners (“professionals trained to identify faces in images and videos [for use in court] using a set of tools and procedures that vary across forensic laboratories”);
- 30 forensic facial reviewers (“trained to perform faster and less rigorous identifications [for] generating leads in criminal cases”);
- 13 super-recognizers (“untrained people with strong skills in face recognition”);
- 31 undergraduate students; and
- 4 DCNNs (“deep convolutional neural networks” developed between 2015 and 2017”).
Comparisons of the Groups
To compare the performance of the groups, the researchers relied on a statistic known as AUC (or, more precisely, AUROC, for “Area Under the Receiver Operating Characteristic” curve). AUROC combines two more familiar statistics—the true-positive (TP) proportion and the false-positive (FP) proportion—into one number. In doing so, it pays no heed to the fact that a false-positive may be more costly than a false negative. A simple way to think about the number is this: The AUROC of a classifier is equal to the probability that the classifier will rank a randomly chosen pair of images higher when they originate from the same source than when the pair come from two different sources. That is,
Because making up scores at random would be expected to be correct in this sense about half the time, an AUROC of 0.5 means that, overall, the classifier’s scores are useless for distinguishing between same-source and different-source pairs. AUROCs greater than 0.5 indicate better overall classifications, but the value for the area generally does not translate into the more familiar (and more easily comprehended) measures of accuracy such as the sensitivity (the true-positive probability) and specificity (the true-negative probability) of a classification test. See Box 1. Basically, the larger the AUROC, the better the scores are--in some overall sense--in discriminating between same-source and and different-source pairs.
Now that we have some idea of what the AUROC signifies (corrections are welcome—I do not purport to be an expert on signal detection theory), let’s see how the different groups of classifiers did. The median performance of each group was
Again, these are medians. Roughly half the classifiers in each group had higher AUROCs, and half had lower ones. (The automated systems A2015, A2016, A2017a, and A2017b had only one ROC, and hence only one AUROC.) “Individual accuracy varied widely in all [human] groups. All face specialist groups (facial examiners, reviewers, and super-recognizers) had at least one participant with an AUC below the median of the students. At the top of the distribution, all but the student group had at least one participant with no errors.”
A2017b: ░░░░░░░░░░ 0.96 facial examiners: ░░░░░░░░░ 0.93 facial reviewers: ░░░░░░░░░ 0.87 A2017a: ░░░░░░░░░ 0.85 super-recognizers: ░░░░░░░░ 0.83 A2106: ░░░░░░░░ 0.76 fingerprint examiners: ░░░░░░░░ 0.76 A2015: ░░░░░░░ 0.68 students: ░░░░░░░ 0.68
Using the distribution of student UAROCs (fitted to a normal distribution), the authors reported the fraction of participants in each group who scored above the student 95th percentile as follows:
The best computerized system, A2017b, had a higher AUROC than 73% of the face specialists. To put it another way, “35% of examiners, 13% of reviewers, and 23% of superrecognizers were more accurate than A2017b,” which “was equivalent to a student at the 98th percentile.”
facial examiners: ░░░░░░░░░░░ 53% super-recognizers: ░░░░░░░░░ 46% facial reviewers: ░░░░░░░ 36% fingerprint examiners: ░░░ 17%
But none of the preceding reveals how often the classifications based on the scores would be right or wrong. Buried in an appendix to the article (and reproduced below in Box 2) are estimates of “the error rates associated with judgments of +3 and −3 [obtained by computing] the fraction of high-confidence same-person (+3) ratings made to different identity face pairs” and estimates of “the probability of same identity pairs being assigned a −3.” The table indicates that facial examiners who were very confident usually were correct, expressing maximum confidence less than 1% of the time for same-source pairs (false positives) and less than 2% of the time for different-source pairs (false negatives). Students made these errors a little more than 7% and 14% of the time, respectively.
Fusion
The article promises to “show the benefits of a collaborative effort that combines the judgments of humans and machines.” It describes the method for ascertaining whether “a colloborative effort” improves performance as follows:
We examined the effectiveness of combining examiners, reviewers, and superrecognizers with algorithms. Human judgments were fused with each of the four algorithms as follows. For each face image pair, an algorithm returned a similarity score that is an estimate of how likely it is that the images show the same person. Because the similarity score scales differ across algorithms, we rescaled the scores to the range of human ratings (SI Appendix, SI Text). For each face pair, the human rating and scaled algorithm score were averaged, and the AUC was computed for each participant–algorithm fusion.Unless I am missing something, there was no collaboration between human and machine. Each did their own thing. A number midway between the separate similarity scores on each pair produced a larger area under the ROC than either set of separate scores. To the extent that “Fusing Humans and Machines” conjures images of cyborgs, it seems a bit much. The more modest point is that a very simple combination of scores of a human and a machine classifier works better (with respect to AUROC as a measure of success) than either one alone.
BOX 1. THE ROC CURVE AND ITS AREA
Suppose that we were to take a score of +1 or more as sufficient to classify a pair of images as originating from the same source. Some of these classifications would be incorrect (contributing to the false-positive (FP) proportion for this decision threshold), and some would be correct (contributing to the true-positive (TP) proportion). Of course, the threshold for the classification could be set at other scores. The ROC curve is simply a plot of the points (TPP[score], FPP[score]) for the person or machine scoring the pairs of images for the many possible decision thresholds.
For example, if the threshold score for a positive classification were set higher than all the reported scores, there would no declared positives. Both the false positive and the true positive proportions would be zero. At the other extreme, if the threshold score were placed at the bottom of the scale, all the classifications would be positive. Hence, every same-source pair would be classified positively, as would every different-source pair. Both the TPP and the FPP would be 1. A so-called random classifier, in which the scores have no correlation to the actual source of images, would be expected to produce a straight line connecting these points (0,0) and (1,1). A more useful classifier would have a curve with mostly higher points, as shown in the sketch below.
Look at the better classifier in the graph (the one whose operating characteristics are indicated by the asterisks). Consider the score implicit in the asterisk above the little tick-mark on the horizontal axis and across from the mark on the vertical axis. The FPP there is 0.2, so the specificity is 0.8. The sensitivity is the height of the better ROC curve at that implicit score threshold. The height of that asterisk is 0.5. The better classifier with that threshold makes correct associations only half the time when confronted with same-source pairs and 80% of the time when presented with different-source pairs. When shown 20 pairs, 12 of which are from the same face, as in the experiment discussed here, the better classifier is expected to make 50% × 12 = 6 correct positive classifications and 80% × 8 = 6.4 correct negative classifications. The overall expected percentage of correct classifications is therefore 12.4/20 = 62% rather than 75%.
The moral of the arithmetic: The area under the ROC is not so readily related to the accuracy of the classifier for particular similarity scores. (It is more helpful in describing how well the classifier generally ranks a same-source pair relative to a different-source pair.) 2/
Suppose that we were to take a score of +1 or more as sufficient to classify a pair of images as originating from the same source. Some of these classifications would be incorrect (contributing to the false-positive (FP) proportion for this decision threshold), and some would be correct (contributing to the true-positive (TP) proportion). Of course, the threshold for the classification could be set at other scores. The ROC curve is simply a plot of the points (TPP[score], FPP[score]) for the person or machine scoring the pairs of images for the many possible decision thresholds.
For example, if the threshold score for a positive classification were set higher than all the reported scores, there would no declared positives. Both the false positive and the true positive proportions would be zero. At the other extreme, if the threshold score were placed at the bottom of the scale, all the classifications would be positive. Hence, every same-source pair would be classified positively, as would every different-source pair. Both the TPP and the FPP would be 1. A so-called random classifier, in which the scores have no correlation to the actual source of images, would be expected to produce a straight line connecting these points (0,0) and (1,1). A more useful classifier would have a curve with mostly higher points, as shown in the sketch below.
TPP (sensitivity) 1 | * o | * | o | + * o | o Random (worthless) classifier | o * Better classifier (AUC > 0.5) | o---+------------ FPP (1 – specificity) 1An AUROC of, say, 0.75, does not mean that 75% of the classifications (using a particular score as the threshold for declaring a positive association) are correct. Neither does it mean that 75% is the sensitivity or specificity when using a given score as a decision threshold. Nor does not mean that 25% is the false-positive or the false-negative proportion. Instead, how many classifications are correct at a given score threshold depends on: (1) the specificity at that score threshold, (2) the specificity at that score threshold, and (3) the proportion of same-source and different-source pairs in the sample or population of pairs.
Look at the better classifier in the graph (the one whose operating characteristics are indicated by the asterisks). Consider the score implicit in the asterisk above the little tick-mark on the horizontal axis and across from the mark on the vertical axis. The FPP there is 0.2, so the specificity is 0.8. The sensitivity is the height of the better ROC curve at that implicit score threshold. The height of that asterisk is 0.5. The better classifier with that threshold makes correct associations only half the time when confronted with same-source pairs and 80% of the time when presented with different-source pairs. When shown 20 pairs, 12 of which are from the same face, as in the experiment discussed here, the better classifier is expected to make 50% × 12 = 6 correct positive classifications and 80% × 8 = 6.4 correct negative classifications. The overall expected percentage of correct classifications is therefore 12.4/20 = 62% rather than 75%.
The moral of the arithmetic: The area under the ROC is not so readily related to the accuracy of the classifier for particular similarity scores. (It is more helpful in describing how well the classifier generally ranks a same-source pair relative to a different-source pair.) 2/
BOX 2. "[T]he estimate qˆ for the error rate and the upper and lower limits of the 95% confidence interval." (From Table S2)
Group | Estimate | 0.95 CI |
Type of Error: False Positive (+3 on different faces) | ||
Facial Examiners | 0.9% | 0.002 to 0.022 |
Facial Reviewers | 1.2% | 0.003 to 0.036 |
Super-recognizers | 1.0% | 0.0002 to 0.052 |
Fingerprint Examiners | 3.8% | 0.022 to 0.061 |
Students | 7.3% | 0.044 to 0.112 |
Type of Error: False Negative (-3 on same faces) | ||
Facial Examiners | 1.8% | 0.009 to 0.030 |
Facial Reviewers | 1.4% | 0.005 to 0.032 |
Super-recognizers | 5.1% | 0.022 to 0.099 |
Fingerprint Examiners | 3.3% | 0.021 to 0.050 |
Students | 14.5% | 0.111 to 0.185 |
UPDATES
June 9, 2018: Corrections and additions made in response to comments from Hari Iyer.NOTES
- P.J. Phillips, A.N. Yates, Y. Hu, C.A. Hahn, E. Noyes, K. Jackson, J.G. Cavazos, G. Jeckeln, R. Ranjan, S. Sankaranarayanan, J.-C. Chen, C.D. Castillo, R. Chellappa, D. White and A.J. O’Toole. Face Recognition Accuracy of Forensic Examiners, Superrecognizers, and Algorithms. Proceedings of the National Academy of Sciences, Published online May 28, 2018. DOI: 10.1073/pnas.1721355115
- As Hari Iyer put it in response to the explanation in Box 1, "given a randomly chosen observation x1 belonging to class 1, and a randomly chosen observation x0 belonging to class 0, the (empirical) AUC is the estimated probability that the evaluated classification algorithm will assign a higher score to x1 than to x0." For a proof, see Alexej Gossman, Probabilistic Interpretation of AUC, Jan. 25, 2018, http://www.alexejgossmann.com/auc/. A geometric proof can be found in Matthew Drury, The Probabilistic Interpretation of AUC, in Scatterplot Smoothers, Jun 21, 2017, http://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html.