Wednesday, May 30, 2018

Fusing Humans and Machines to Recognize Faces

A new article on the accuracy of facial recognition by humans and machines represents “the most comprehensive examination to date of face identification performance across groups of humans with variable levels of training, experience, talent, and motivation.” 1/ It concludes that the optimal performance comes from a “fusion” of man (or woman) and machine. But the meaning of “accuracy” and “fusion” are not necessarily what one might think.

Researchers from NIST, The University of Texas at Dallas, the University of Maryland, and the University of New South Wales displayed “highly challenging” pairs of face images to individuals with and without training in matching images, and to “deep convolutional neural networks” (DCNNs) that trained themselves to classify images as being from the same source or from different sources.

The Experiment
Twenty pairs of pictures (12 same-source and 8 different-source pairs) were presented to the following groups:
  • 57 forensic facial examiners (“professionals trained to identify faces in images and videos [for use in court] using a set of tools and procedures that vary across forensic laboratories”);
  • 30 forensic facial reviewers (“trained to perform faster and less rigorous identifications [for] generating leads in criminal cases”);
  • 13 super-recognizers (“untrained people with strong skills in face recognition”);
  • 31 undergraduate students; and
  • 4 DCNNs (“deep convolutional neural networks” developed between 2015 and 2017”).
Students took the test in a single session, while the facial examiners, reviewers, super-recognizers, and fingerprint examiners had three months to complete the test. They all expressed degrees of confidence that each pair showed the same person as opposed to two different people. (+3 meant that “the observations strongly support that it is the same person”; –3 meant that “the observations strongly support that it is not the same person”). The computer programs generated “similarity scores” that were transformed to the same seven-point scale.

Comparisons of the Groups
To compare the performance of the groups, the researchers relied on a statistic known as AUC (or, more precisely, AUROC, for “Area Under the Receiver Operating Characteristic” curve). AUROC combines two more familiar statistics—the true-positive (TP) proportion and the false-positive (FP) proportion—into one number. In doing so, it pays no heed to the fact that a false-positive may be more costly than a false negative. A simple way to think about the number is this: The AUROC of a classifier is equal to the probability that the classifier will rank a randomly chosen pair of images higher when they originate from the same source than when the pair come from two different sources. That is,

AUROC = P(score|same > score|different)

Because making up scores at random would be expected to be correct in this sense about half the time, an AUROC of 0.5 means that, overall, the classifier’s scores are useless for distinguishing between same-source and different-source pairs. AUROCs greater than 0.5 indicate better overall classifications, but the value for the area generally does not translate into the more familiar (and more easily comprehended) measures of accuracy such as the sensitivity (the true-positive probability) and specificity (the true-negative probability) of a classification test. See Box 1. Basically, the larger the AUROC, the better the scores are--in some overall sense--in discriminating between same-source and and different-source pairs. 

Now that we have some idea of what the AUROC signifies (corrections are welcome—I do not purport to be an expert on signal detection theory), let’s see how the different groups of classifiers did. The median performance of each group was
A2017b:░░░░░░░░░░ 0.96
facial examiners:░░░░░░░░░ 0.93
facial reviewers:░░░░░░░░░ 0.87
A2017a:░░░░░░░░░ 0.85
super-recognizers:░░░░░░░░ 0.83
A2106:░░░░░░░░ 0.76
fingerprint examiners:░░░░░░░░ 0.76
A2015:░░░░░░░ 0.68
students:░░░░░░░ 0.68
Again, these are medians. Roughly half the classifiers in each group had higher AUROCs, and half had lower ones. (The automated systems A2015, A2016, A2017a, and A2017b had only one ROC, and hence only one AUROC.) “Individual accuracy varied widely in all [human] groups. All face specialist groups (facial examiners, reviewers, and super-recognizers) had at least one participant with an AUC below the median of the students. At the top of the distribution, all but the student group had at least one participant with no errors.”

Using the distribution of student UAROCs (fitted to a normal distribution), the authors reported the fraction of participants in each group who scored above the student 95th percentile as follows:
facial examiners:░░░░░░░░░░░ 53%
super-recognizers:░░░░░░░░░ 46%
facial reviewers:░░░░░░░ 36%
fingerprint examiners:░░░ 17%
The best computerized system, A2017b, had a higher AUROC than 73% of the face specialists. To put it another way, “35% of examiners, 13% of reviewers, and 23% of superrecognizers were more accurate than A2017b,” which “was equivalent to a student at the 98th percentile.”

But none of the preceding reveals how often the classifications based on the scores would be right or wrong. Buried in an appendix to the article (and reproduced below in Box 2) are estimates of “the error rates associated with judgments of +3 and −3 [obtained by computing] the fraction of high-confidence same-person (+3) ratings made to different identity face pairs” and estimates of “the probability of same identity pairs being assigned a −3.” The table indicates that facial examiners who were very confident usually were correct, expressing maximum confidence less than 1% of the time for same-source pairs (false positives) and less than 2% of the time for different-source pairs (false negatives). Students made these errors a little more than 7% and 14% of the time, respectively.

The article promises to “show the benefits of a collaborative effort that combines the judgments of humans and machines.” It describes the method for ascertaining whether “a colloborative effort” improves performance as follows:
We examined the effectiveness of combining examiners, reviewers, and superrecognizers with algorithms. Human judgments were fused with each of the four algorithms as follows. For each face image pair, an algorithm returned a similarity score that is an estimate of how likely it is that the images show the same person. Because the similarity score scales differ across algorithms, we rescaled the scores to the range of human ratings (SI Appendix, SI Text). For each face pair, the human rating and scaled algorithm score were averaged, and the AUC was computed for each participant–algorithm fusion.
Unless I am missing something, there was no collaboration between human and machine. Each did their own thing. A number midway between the separate similarity scores on each pair produced a larger area under the ROC than either set of separate scores. To the extent that “Fusing Humans and Machines” conjures images of cyborgs, it seems a bit much. The more modest point is that a very simple combination of scores of a human and a machine classifier works better (with respect to AUROC as a measure of success) than either one alone.


Suppose that we were to take a score of +1 or more as sufficient to classify a pair of images as originating from the same source. Some of these classifications would be incorrect (contributing to the false-positive (FP) proportion for this decision threshold), and some would be correct (contributing to the true-positive (TP) proportion). Of course, the threshold for the classification could be set at other scores. The ROC curve is simply a plot of the points (TPP[score], FPP[score]) for the person or machine scoring the pairs of images for the many possible decision thresholds.

For example, if the threshold score for a positive classification were set higher than all the reported scores, there would no declared positives. Both the false positive and the true positive proportions would be zero. At the other extreme, if the threshold score were placed at the bottom of the scale, all the classifications would be positive. Hence, every same-source pair would be classified positively, as would every different-source pair. Both the TPP and the FPP would be 1. A so-called random classifier, in which the scores have no correlation to the actual source of images, would be expected to produce a straight line connecting these points (0,0) and (1,1). A more useful classifier would have a curve with mostly higher points, as shown in the sketch below.

      TPP (sensitivity)
     1 |           *   o
       |       *
       |           o
       +   *   o
       |                   o Random (worthless) classifier
       |   o               * Better classifier (AUC > 0.5)
       o---+------------ FPP (1 – specificity)
An AUROC of, say, 0.75, does not mean that 75% of the classifications (using a particular score as the threshold for declaring a positive association) are correct. Neither does it mean that 75% is the sensitivity or specificity when using a given score as a decision threshold. Nor does not mean that 25% is the false-positive or the false-negative proportion. Instead, how many classifications are correct at a given score threshold depends on: (1) the specificity at that score threshold, (2) the specificity at that score threshold, and (3) the proportion of same-source and different-source pairs in the sample or population of pairs.

Look at the better classifier in the graph (the one whose operating characteristics are indicated by the asterisks). Consider the score implicit in the asterisk above the little tick-mark on the horizontal axis and across from the mark on the vertical axis. The FPP there is 0.2, so the specificity is 0.8. The sensitivity is the height of the better ROC curve at that implicit score threshold. The height of that asterisk is 0.5. The better classifier with that threshold makes correct associations only half the time when confronted with same-source pairs and 80% of the time when presented with different-source pairs. When shown 20 pairs, 12 of which are from the same face, as in the experiment discussed here, the better classifier is expected to make 50% × 12 = 6 correct positive classifications and 80% × 8 = 6.4 correct negative classifications. The overall expected percentage of correct classifications is therefore 12.4/20 = 62% rather than 75%.

The moral of the arithmetic: The area under the ROC is not so readily related to the accuracy of the classifier for particular similarity scores. (It is more helpful in describing how well the classifier generally ranks a same-source pair relative to a different-source pair.) 2/

BOX 2. "[T]he estimate qˆ for the error rate and the upper and lower limits of the 95% confidence interval." (From Table S2)
GroupEstimate0.95 CI
Type of Error: False Positive (+3 on different faces)
Facial Examiners0.9%0.002 to 0.022
Facial Reviewers1.2%0.003 to 0.036
Super-recognizers1.0%0.0002 to 0.052
Fingerprint Examiners3.8%0.022 to 0.061
Students7.3%0.044 to 0.112
Type of Error: False Negative (-3 on same faces)
Facial Examiners1.8%0.009 to 0.030
Facial Reviewers1.4%0.005 to 0.032
Super-recognizers5.1%0.022 to 0.099
Fingerprint Examiners3.3%0.021 to 0.050
Students14.5%0.111 to 0.185

June 9, 2018: Corrections and additions made in response to comments from Hari Iyer.
  1. P.J. Phillips, A.N. Yates, Y. Hu, C.A. Hahn, E. Noyes, K. Jackson, J.G. Cavazos, G. Jeckeln, R. Ranjan, S. Sankaranarayanan, J.-C. Chen, C.D. Castillo, R. Chellappa, D. White and A.J. O’Toole. Face Recognition Accuracy of Forensic Examiners, Superrecognizers, and Algorithms. Proceedings of the National Academy of Sciences, Published online May 28, 2018. DOI: 10.1073/pnas.1721355115
  2. As Hari Iyer put it in response to the explanation in Box 1, "given a randomly chosen observation x1 belonging to class 1, and a randomly chosen observation x0 belonging to class 0, the (empirical) AUC is the estimated probability that the evaluated classification algorithm will assign a higher score to x1 than to x0." For a proof, see Alexej Gossman, Probabilistic Interpretation of AUC, Jan. 25, 2018, A geometric proof can be found in Matthew Drury, The Probabilistic Interpretation of AUC, in Scatterplot Smoothers, Jun 21, 2017,

Saturday, May 26, 2018

Against Method: ACE-V, Reproducibility, and Now Preproducibility

Forensic-science practitioners like to describe their activities as scientific. Indeed, if the work they did were not scientific, how could one say that they were practicing forensic science?

Thus, one finds textbooks with impressive titles like "Forensic Comparative Science" devoted to “[t]he comparative science disciplines of finger prints, firearm/tool marks, shoe prints/tire prints, documents and handwriting,” and more. 1/ The practitioners of these "science disciplines" describe their work as "analogous to scientific method of critically observing details in images, determining similarities or differences in the data, performing comparative measurements to experiment whether the details in the images actually agree or disagree," and so on. 2/ They insist that they follow a multistep process that "is a scientific methodology" for "hypothesis testing"3/ — even if the process lacks any defined threshold for deciding when perceived (or even objectively measured) features are sufficiently similar or different to reach a conclusion.

An entire article in the Journal of Forensic Identification — “a scientific journal that provides more than 100 pages of articles related to forensics ... written by forensic authorities from around the world who are practitioners or academics in forensic science fields” 4/ — is devoted to demonstrating that “[a]nalysis, comparison, evaluation, and verification (ACE-V) is a scientific methodology that is part of the scientific method.” 5/ The abstract observes that
Several publications have attempted to explain ACE-V as a scientific method or its role within the scientific method, but these attempts are either not comprehensive or not explicit. This article ... outlines the scientific method as a seven-step process. The scientific method is discussed using the premises of uniqueness, persistence, and classifiability. Each step of the scientific method is addressed specifically as it applies to friction ridge impression examination in casework. It is important for examiners to understand and apply the scientific method, including ACE-V, and be able to articulate this method. 6/
The Scientific Working Group on Friction Ridge Analysis, Study, and Technology (SWGFAST) agreed, urging examiners to write in their reports that ACE-V is nothing less than “[t]he acronym for a scientific method: Analysis, Comparison, Evaluation, and Verification.” 7/

It is revealing to contrast such assertions with comments on the meaning of reproducibility in science that appear in an essay published this week in Nature. There, Philip Stark, the Associate Dean of the Division of Mathematical and Physical Sciences and Professor of Statistics at the University of California (Berkeley), noted that reproducibility means different things in different fields, but pointed to “preproducibility” as a  prerequisite to reproducibility:
An experiment or analysis is preproducible if it has been described in adequate detail for others to undertake it. ... The distinction between a preproducible scientific report and current common practice is like the difference between a partial list of ingredients and a recipe. To bake a good loaf of bread, it isn’t enough to know that it contains flour. It isn’t even enough to know that it contains flour, water, salt and yeast. The brand of flour might be omitted from the recipe with advantage, as might the day of the week on which the loaf was baked. But the ratio of ingredients, the operations, their timing and the temperature of the oven cannot.

Given preproducibility — a ‘scientific recipe’ — we can attempt to make a similar loaf of scientific bread. If we follow the recipe but do not get the same result, either the result is sensitive to small details that cannot be controlled, the result is incorrect or the recipe was not precise enough ... . 8/
Descriptions of procedures for subjective pattern-matching in traditional forensic science are much like the list of ingredients. There are no quantitative instructions for how many and which of the potentially distinguishing features to use and how long to process or cook these ingredients at each step of the process. Indeed, it is even worse than that. Even though trained examiners know what ingredients to choose from, they can pick any subset of them that they think could be effective for the case at hand. Thus, although ACE-V can be described as a series of steps within "a broadly stated framework," 9/ that does not make it a “scientific recipe.” Imprecision at every step deprives it of  "preproducibility." It might be called a "process" rather than a "method," 10/ but in the end, “ACE-V is an acronym, not a methodology." 11/

It does not follow, however, that the comparisons and conclusions are of no epistemic value or that they cannot be studied scientifically. Quite the contrary. Unlike the natural sciences, the absence of preproducibility does not preclude reproducibilty. Another laboratory expert can start with the same materials to be compared, and we can see if the outcome is the same. Moreover, we can even conduct blind tests of examiner performance to determine how accurately criminalists are able to classify traces originating from the same source and traces coming from different sources.

Such validation studies show that latent print examiners, for example, have real expertise, but these findings do not mean that expert examiners are following a particularly “scientific method” in making their judgments. Psychologists report that some individuals are phenomenally accurate in recognizing faces, 12/ but that does not mean that the “super recognizers” are using a well-defined, or indeed, any kind of scientific procedure to accomplish these feats. The training and experience that criminalists receive may include instruction in facts and principles of biology and physics, and their performance may be generally accurate and reliable, but that does not mean that they are applying a scientific method. A flow chart is not a scientific test.

Until criminalists can articulate and follow a preproducible procedure, they should not present their work as deeply scientific. In court, they can explain that scientists have studied the nature of the patterns they analyze. They can refer to any well-designed studies proving that subjective pattern-matching by trained analysts can be valid and reliable. They can vouch for the fact that criminalists have been making side-by-side comparisons for a long time. If courts are persuaded that the resulting individual opinions are helpful, then skilled witnesses can give those opinions. But such opinions should not be gussied up as a scientific method of hypothesis testing. 13/ As one critic of such rhetoric explained, "forensic science could show that it does have validation, certification, accreditation, oversight, and  basic  research  without  showing  that  it  uses  the  'scientific method.'" 14/

  1. John R. Vanderkolk, Forensic Comparative Science xii (2009).
  2. Id. at 90.
  3. M. Reznicek, R.M. Ruth & D.M. Schilens, ACE-V and the Scientific Method, 60 J. Forensic Identification 87, 87 (2010). See also Michele Triplett & Lauren Cooney, The Etiology of ACE-V and its Proper Use: An Exploration of the Relationship Between ACE-V and the Scientific Method of Hypothesis Testing, 56 J. Forensic Identification 345, 353 (2006), (“ACE-V is synonymous with hypothesis testing. A more in-depth understanding of scientific methodology can be found by reading the works of well-known scientists and philosophers such as Aristotle, Isaac Newton, Francis Bacon, Galileo Galilei, and Karl Popper, to name just a few.”).
  4. Abstract of Journal of Forensic Identification (JFI),, accessed May 25, 2018.
  5. Reznicek et al., supra note 3, at 87.
  6. Id. The Department of Justice has retreated from this phrasing, preferring to call “an examiner’s belief” an “inductive inference . . . made in a logical and scientifically defensible manner.” Department of Justice, Approved Uniform Language for Testimony and Reports for the Forensic Latent Print Discipline, Feb. 22, 2018, at 2 & 2 n.2,
  7. Scientific Working Group on Friction Ridge Analysis, Study, and Technology, Standard for Reporting Friction Ridge Examinations (Latent/Tenprint), Appendix, at 4 n.2, 2012, (emphasis added).
  8. Philip B. Stark, Before Reproducibility must Come Preproducibility, 557 Nature 613 (2018), doi: 10.1038/d41586-018-05256-0,
  9. Comm. on Identifying the Needs of the Forensic Sci. Cmty., Nat'l Research Council, Strengthening Forensic Science in the United States: A Path Forward 142 (2009).
  10. Michele Triplett, Is ACE-V a Process or a Method?, IDentification News, June/July 2012, at 5–6,
  11. Sandy L. Zabell, Fingerprint Evidence, 13 J. L. & Pol'y 143, 178 (2005).
  12. Richard Russell, Brad Duchaine, and Ken Nakayama, Super-recognizers: People with Extraordinary Face Recognition Ability, 16 Psychonomic Bull. and Rev. 252 (2009), doi: 10.3758/PBR.16.2.252.
  13. David H. Kaye, How Daubert and Its Progeny Have Failed Criminalistics Evidence, and a Few Things the Judiciary Could Do About It, 86 Fordham L. Rev. 1639 (2018),
  14. Simon A. Cole, Acculturating Forensic Science: What Is ‘Scientific Culture’, and How Can Forensic Science Adopt It?, 38 Fordham Urb. L.J. 436, 451 (2010).

Monday, May 21, 2018

Firearms Toolmark Testimony: Looking Back and Forward

By inspecting toolmarks on bullets or spent cartridge cases, firearms examiners can supply valuable information on whether a particular gun fired the ammunition in question. But the limits on this information have not always been respected in court, and a growing number of opinions have tried to address this fact. A forthcoming article in a festschrift for Professor Paul Giannelli surveys the developing law on this type of feature-matching evidence.

The article explains how the courts have moved from a position of skepticism of the ability of examiners to link bullets and other ammunition components to a particular gun to full-blown acceptance of identification “to the exclusion of all other firearms.” From that apogee, challenges to firearm-mark evidence over the past decade or so, have generated occasional restrictions on the degree of confidence that firearms experts can express in court, but they have not altered the paradigm of making source attributions and exclusions instead of statements about the degree to which the evidence supports these conclusions. After reviewing the stages in the judicial reception of firearm-mark evidence, including the reactions to reports from the National Academy of Sciences and the President's Council of Advisors on Science and Technology, the article concludes by describing a more scientific, quantitative, evidence-based form of testimony that should supplant or augment the current experience-based decisions of skilled witnesses. A few excerpts follow:
From: David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, Case Western Reserve Law Review (forthcoming Vol. 68, Issue 3, 2018) (most footnotes omitted)

* * *
I. Rejection of Expert Source Attributions
For a time, courts did not admit testimony that items originated from a particular firearm. Some courts reasoned that jurors could make the comparisons and draw their own conclusions. In People v. Weber, for example, the trial court struck from the record an examiner’s testimony “that in his opinion the two bullets taken from the bodies were fired from this pistol, leaving that as a question for the jury to determine by an inspection of the bullets themselves.” In this 1904 trial, the court did not question the expert’s ability to discover toolmarks that could be probative of identity, but it saw no reason to believe that the expert would be better than lay jurors at drawing inferences from that information. Other courts allowed such opinions, but not if they were stated as “facts.” * * *

IV. Heightened Scrutiny Following the 2009 NAS Report
* * *
Neither the 2008 nor the 2009 NAS report made recommendations on admissibility of evidence, for that was not part of their charge. Practitioners and prosecutors proposed that this meant that the reports should or could not be taken as undermining the admissibility of traditional highly judgmental pattern-matching identifications. However, the committees’ reviews of the literature clearly lent credence to the questions about the routine admission of categorical source attributions based on firearm-marks. 50/ In five prominent published opinions, courts cited the NAS reports and the opinions cited in Part III of this Article to limit such testimony. * * *

50. For example, in describing the scientific basis of “forensic science fields like firearms examination,” the 2008 report quoted with approval an article by two forensic scientists stating that “[f]orensic individualization sciences that lack actual data, which is most of them, . . . simply . . . assume the conclusion of a miniscule probability of a coincidental match . . . .” [Nat'l Research Council Comm. To Assess the Feasibility, Accuracy, and Tech. Capability of a Nat'l Ballistics Database, Ballistic Imaging 1, 54-55 (Daniel L. Cork et al., eds. 2008)] (quoting John I. Thornton & Joseph L. Peterson, The General Assumptions and Rationale of Forensic Identification, in 3 David L. Faigman, David H. Kaye, Michael J. Saks, & Joseph Sanders, Modern Scientific Evidence: the Law and Science of Expert Testimony § 24-7.2, at 169 (2002)). Apparently recognizing the threat of such assessments, AFTE complained that the committees’ literature reviews were shallow. In response to the 2008 Report, it wrote that “the committee lacked the expertise and information necessary for the in-depth study that would be required to offer substantive statements with regard to these fundamental issues of firearm and toolmark identification.” [AFTE Comm. for the Advancement of the Sci. of Firearm & Toolmark Identification, The Response of the Association of Firearm and Tool Mark Examiners to the National Academy of Sciences 2008 Report Assessing the Feasibility, Accuracy, and Technical Capability of a National Ballistics Database, AFTE J., Summer 2008, at 243, available at 243]. Likewise, it wrote that “the [2009] NAS committee in effect chose to ignore extensive research supporting the scientific underpinnings of the identification of firearm and toolmark evidence.” AFTE Comm. for the Advancement of the Sci. of Firearm & Toolmark Identification, The Response of the Association of Firearms and Tool Mark Examiners to the February 2009 National Academy of Science Report “Strengthening Forensic Science in the United States: A Path Forward,” AFTE J., Summer 2009, at 204, 206. According to AFTE, “years of empirical research . . . conclusively show[] that sufficient individuality is often present on tool (firearm tools or non-firearm tools) working surfaces to permit a trained examiner to conclude that a toolmark was made by a certain tool and that there is no credible possibility that it was made by any other tool working surface.” AFTE Comm. Response, supra * * * , at 242. After all, “[t]he principles and techniques utilized in forensic firearms identification have been used internationally for nearly a century by the relevant forensic science community to both identify and exclude specific firearms as the source of fired bullets and cartridge cases.” Id. at 237 (emphasis added). Prosecutors too sought to blunt the implications of the skeptical statements about the limited validation of the premises of the traditional theory of firearm-mark identification with an affidavit from the chairman of the NAS committee that wrote the 2008 Report. Affidavit of John E. Rolph at 1-3, United States v. Edwards, No. F-516-01 (D.C. Super. Ct., May 23, 2008). Yet, the affidavit merely collects excerpts from the report itself and ends with one that could be read as supporting admissibility under certain conditions. For another affidavit from a committee member contending that NAS “has questioned the validity of these fundamental assumptions of uniqueness and reproducibility,” see Declaration of Alicia Carriquiry, PhD. In Support of Motion in Limine to Exclude Firearms Examiner’s Opinion at 5, People v. Knight, No. LA067366 (Cal. Super. Ct. Apr. 2012). The use of affidavits of one or two committee members to give their personal views on what the words that the committee as a whole agreed upon is ill-advised. It resembles asking individual members of Congress to provide their post hoc thoughts on what a committee report on legislation, or the statute itself, really meant.