Forensic Science, Statistics & the Law: Fusing Humans and Machines to Recognize Faces

A new article on the accuracy of facial recognition by humans and machines represents “the most comprehensive examination to date of face identification performance across groups of humans with variable levels of training, experience, talent, and motivation.” 1/ It concludes that the optimal performance comes from a “fusion” of man (or woman) and machine. But the meaning of “accuracy” and “fusion” are not necessarily what one might think.

Researchers from NIST, The University of Texas at Dallas, the University of Maryland, and the University of New South Wales displayed “highly challenging” pairs of face images to individuals with and without training in matching images, and to “deep convolutional neural networks” (DCNNs) that trained themselves to classify images as being from the same source or from different sources.

The Experiment
Twenty pairs of pictures (12 same-source and 8 different-source pairs) were presented to the following groups:

57 forensic facial examiners (“professionals trained to identify faces in images and videos [for use in court] using a set of tools and procedures that vary across forensic laboratories”);
30 forensic facial reviewers (“trained to perform faster and less rigorous identifications [for] generating leads in criminal cases”);
13 super-recognizers (“untrained people with strong skills in face recognition”);
31 undergraduate students; and
4 DCNNs (“deep convolutional neural networks” developed between 2015 and 2017”).

Students took the test in a single session, while the facial examiners, reviewers, super-recognizers, and fingerprint examiners had three months to complete the test. They all expressed degrees of confidence that each pair showed the same person as opposed to two different people. (+3 meant that “the observations strongly support that it is the same person”; –3 meant that “the observations strongly support that it is not the same person”). The computer programs generated “similarity scores” that were transformed to the same seven-point scale.

Comparisons of the Groups
To compare the performance of the groups, the researchers relied on a statistic known as AUC (or, more precisely, AUROC, for “Area Under the Receiver Operating Characteristic” curve). AUROC combines two more familiar statistics—the true-positive (TP) proportion and the false-positive (FP) proportion—into one number. In doing so, it pays no heed to the fact that a false-positive may be more costly than a false negative. A simple way to think about the number is this: The AUROC of a classifier is equal to the probability that the classifier will rank a randomly chosen pair of images higher when they originate from the same source than when the pair come from two different sources. That is,

AUROC = P(score|same > score|different)
Because making up scores at random would be expected to be correct in this sense about half the time, an AUROC of 0.5 means that, overall, the classifier’s scores are useless for distinguishing between same-source and different-source pairs. AUROCs greater than 0.5 indicate better overall classifications, but the value for the area generally does not translate into the more familiar (and more easily comprehended) measures of accuracy such as the sensitivity (the true-positive probability) and specificity (the true-negative probability) of a classification test. See Box 1. Basically, the larger the AUROC, the better the scores are--in some overall sense--in discriminating between same-source and and different-source pairs.

Now that we have some idea of what the AUROC signifies (corrections are welcome—I do not purport to be an expert on signal detection theory), let’s see how the different groups of classifiers did. The median performance of each group was

A2017b: ░░░░░░░░░░ 0.96

facial examiners: ░░░░░░░░░ 0.93

facial reviewers: ░░░░░░░░░ 0.87

A2017a: ░░░░░░░░░ 0.85

super-recognizers: ░░░░░░░░ 0.83

A2106: ░░░░░░░░ 0.76

fingerprint examiners: ░░░░░░░░ 0.76

A2015: ░░░░░░░ 0.68

students: ░░░░░░░ 0.68

Again, these are medians. Roughly half the classifiers in each group had higher AUROCs, and half had lower ones. (The automated systems A2015, A2016, A2017a, and A2017b had only one ROC, and hence only one AUROC.) “Individual accuracy varied widely in all [human] groups. All face specialist groups (facial examiners, reviewers, and super-recognizers) had at least one participant with an AUC below the median of the students. At the top of the distribution, all but the student group had at least one participant with no errors.”

Using the distribution of student UAROCs (fitted to a normal distribution), the authors reported the fraction of participants in each group who scored above the student 95th percentile as follows:

facial examiners: ░░░░░░░░░░░ 53%

super-recognizers: ░░░░░░░░░ 46%

facial reviewers: ░░░░░░░ 36%

fingerprint examiners: ░░░ 17%

The best computerized system, A2017b, had a higher AUROC than 73% of the face specialists. To put it another way, “35% of examiners, 13% of reviewers, and 23% of superrecognizers were more accurate than A2017b,” which “was equivalent to a student at the 98th percentile.”

But none of the preceding reveals how often the classifications based on the scores would be right or wrong. Buried in an appendix to the article (and reproduced below in Box 2) are estimates of “the error rates associated with judgments of +3 and −3 [obtained by computing] the fraction of high-confidence same-person (+3) ratings made to different identity face pairs” and estimates of “the probability of same identity pairs being assigned a −3.” The table indicates that facial examiners who were very confident usually were correct, expressing maximum confidence less than 1% of the time for same-source pairs (false positives) and less than 2% of the time for different-source pairs (false negatives). Students made these errors a little more than 7% and 14% of the time, respectively.

Fusion
The article promises to “show the benefits of a collaborative effort that combines the judgments of humans and machines.” It describes the method for ascertaining whether “a colloborative effort” improves performance as follows:

We examined the effectiveness of combining examiners, reviewers, and superrecognizers with algorithms. Human judgments were fused with each of the four algorithms as follows. For each face image pair, an algorithm returned a similarity score that is an estimate of how likely it is that the images show the same person. Because the similarity score scales differ across algorithms, we rescaled the scores to the range of human ratings (SI Appendix, SI Text). For each face pair, the human rating and scaled algorithm score were averaged, and the AUC was computed for each participant–algorithm fusion.

Unless I am missing something, there was no collaboration between human and machine. Each did their own thing. A number midway between the separate similarity scores on each pair produced a larger area under the ROC than either set of separate scores. To the extent that “Fusing Humans and Machines” conjures images of cyborgs, it seems a bit much. The more modest point is that a very simple combination of scores of a human and a machine classifier works better (with respect to AUROC as a measure of success) than either one alone.

BOX 1. THE ROC CURVE AND ITS AREA

Suppose that we were to take a score of +1 or more as sufficient to classify a pair of images as originating from the same source. Some of these classifications would be incorrect (contributing to the false-positive (FP) proportion for this decision threshold), and some would be correct (contributing to the true-positive (TP) proportion). Of course, the threshold for the classification could be set at other scores. The ROC curve is simply a plot of the points (TPP[score], FPP[score]) for the person or machine scoring the pairs of images for the many possible decision thresholds.

For example, if the threshold score for a positive classification were set higher than all the reported scores, there would no declared positives. Both the false positive and the true positive proportions would be zero. At the other extreme, if the threshold score were placed at the bottom of the scale, all the classifications would be positive. Hence, every same-source pair would be classified positively, as would every different-source pair. Both the TPP and the FPP would be 1. A so-called random classifier, in which the scores have no correlation to the actual source of images, would be expected to produce a straight line connecting these points (0,0) and (1,1). A more useful classifier would have a curve with mostly higher points, as shown in the sketch below.

      TPP (sensitivity)
     1 |           *   o
       |       *
       |           o
       |   
       +   *   o
       |                   o Random (worthless) classifier
       |   o               * Better classifier (AUC > 0.5)
       |                         
       o---+------------ FPP (1 – specificity)
                       1

An AUROC of, say, 0.75, does not mean that 75% of the classifications (using a particular score as the threshold for declaring a positive association) are correct. Neither does it mean that 75% is the sensitivity or specificity when using a given score as a decision threshold. Nor does not mean that 25% is the false-positive or the false-negative proportion. Instead, how many classifications are correct at a given score threshold depends on: (1) the specificity at that score threshold, (2) the specificity at that score threshold, and (3) the proportion of same-source and different-source pairs in the sample or population of pairs.

Look at the better classifier in the graph (the one whose operating characteristics are indicated by the asterisks). Consider the score implicit in the asterisk above the little tick-mark on the horizontal axis and across from the mark on the vertical axis. The FPP there is 0.2, so the specificity is 0.8. The sensitivity is the height of the better ROC curve at that implicit score threshold. The height of that asterisk is 0.5. The better classifier with that threshold makes correct associations only half the time when confronted with same-source pairs and 80% of the time when presented with different-source pairs. When shown 20 pairs, 12 of which are from the same face, as in the experiment discussed here, the better classifier is expected to make 50% × 12 = 6 correct positive classifications and 80% × 8 = 6.4 correct negative classifications. The overall expected percentage of correct classifications is therefore 12.4/20 = 62% rather than 75%.

The moral of the arithmetic: The area under the ROC is not so readily related to the accuracy of the classifier for particular similarity scores. (It is more helpful in describing how well the classifier generally ranks a same-source pair relative to a different-source pair.) 2/

BOX 2. "[T]he estimate qˆ for the error rate and the upper and lower limits of the 95% confidence interval." (From Table S2)

Group	Estimate	0.95 CI
Type of Error: False Positive (+3 on different faces)
Facial Examiners	0.9%	0.002 to 0.022
Facial Reviewers	1.2%	0.003 to 0.036
Super-recognizers	1.0%	0.0002 to 0.052
Fingerprint Examiners	3.8%	0.022 to 0.061
Students	7.3%	0.044 to 0.112
Type of Error: False Negative (-3 on same faces)
Facial Examiners	1.8%	0.009 to 0.030
Facial Reviewers	1.4%	0.005 to 0.032
Super-recognizers	5.1%	0.022 to 0.099
Fingerprint Examiners	3.3%	0.021 to 0.050
Students	14.5%	0.111 to 0.185

UPDATES

June 9, 2018: Corrections and additions made in response to comments from Hari Iyer.

NOTES

P.J. Phillips, A.N. Yates, Y. Hu, C.A. Hahn, E. Noyes, K. Jackson, J.G. Cavazos, G. Jeckeln, R. Ranjan, S. Sankaranarayanan, J.-C. Chen, C.D. Castillo, R. Chellappa, D. White and A.J. O’Toole. Face Recognition Accuracy of Forensic Examiners, Superrecognizers, and Algorithms. Proceedings of the National Academy of Sciences, Published online May 28, 2018. DOI: 10.1073/pnas.1721355115
As Hari Iyer put it in response to the explanation in Box 1, "given a randomly chosen observation x₁ belonging to class 1, and a randomly chosen observation x₀ belonging to class 0, the (empirical) AUC is the estimated probability that the evaluated classification algorithm will assign a higher score to x₁ than to x₀." For a proof, see Alexej Gossman, Probabilistic Interpretation of AUC, Jan. 25, 2018, http://www.alexejgossmann.com/auc/. A geometric proof can be found in Matthew Drury, The Probabilistic Interpretation of AUC, in Scatterplot Smoothers, Jun 21, 2017, http://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html.

Forensic Science, Statistics & the Law

Pages

Wednesday, May 30, 2018

Fusing Humans and Machines to Recognize Faces

No comments:

Post a Comment

Labels

Popular Posts

Search This Blog

Blog Archive

Places to visit, books to read, meetings to attend [or to avoid]

A2017b:	░░░░░░░░░░ 0.96
facial examiners:	░░░░░░░░░ 0.93
facial reviewers:	░░░░░░░░░ 0.87
A2017a:	░░░░░░░░░ 0.85
super-recognizers:	░░░░░░░░ 0.83
A2106:	░░░░░░░░ 0.76
fingerprint examiners:	░░░░░░░░ 0.76
A2015:	░░░░░░░ 0.68
students:	░░░░░░░ 0.68

facial examiners:	░░░░░░░░░░░ 53%
super-recognizers:	░░░░░░░░░ 46%
facial reviewers:	░░░░░░░ 36%
fingerprint examiners:	░░░ 17%