Court challenges to the validity of forensic identification of the gun that fired a bullet based on toolmark comparisons have increased since the President's Council of Advisors on Science and Technology (PCAST) issued a report in late 2016 stressing the limitations in the scientific research on the subject. A study from the Netherlands preprinted in 2019 adds to the research literature. The abstract reads (in part):
Forensic firearm examiners compare the features in cartridge cases to provide a judgment addressing the question about their source: do they originate from one and the same or from two different firearms? In this article, the validity and reliability of these judgments is studied and compared to the outcomes of a computer-based method. The ... true positive rates (sensitivity) and the true negative rates (specificity) of firearm examiners are quite high. ... The examiners are overconfident, giving judgments of evidential strength that are too high. The judgments of the examiners and the outcomes of the computer-based method are only moderately correlated. We suggest to implement performance feedback to reduce overconfidence, to improve the calibration of degree of support judgments, and to study the possibility of combining the judgments of examiners and the outcomes of computer-based methods to increase the overall validity.
Erwin J.A.T. Mattijssen, Cilia L.M. Witteman, Charles E.H. Berger, Nicolaas W. Brand & Reinoud D. Stoel, Validity and Reliability of Forensic Firearm Examiners. Forensic Sci. Int’l 2020, 307:110112.
Despite the characterization of examiner sensitivity and specificity as "quite high," the observed specificity was only 0.89, which corresponds to a false-positive rate of 11%—much higher than the <2% estimate quoted in recent judicial opinions. But the false-positive proportions from different experiments are not as discordant as they might appear to be when naively juxtaposed. To appreciate the sensitivity and specificity reported in this experiment, we need to understand the way that the validity test was constructed.
Design of the Study
The researchers fired two bullets from two hundred 9 mm Luger Glock pistols seized in the Netherlands. These 400 test firings gave rise to true (same-source) and false (different-source) pairings of two-dimensional comparison images of the striation patterns on cartridge cases. Specifically, there were 400 cartridge cases from which the researchers made "measurements of the striations of the firing pin aperture shear marks" and prepared "digital images [of magnifications of] the striation patterns using oblique lighting, optimized to show as many of the striations as possible while avoiding overexposure." (They also produced three-dimensional data, but I won't discuss those here.)
They invited forensic firearm examiners from Europe, North America, South America, Asia and Oceania by e-mail to examine the images. Of the recipients, 112 participated, but only 77 completed the online questionnaire with 60 comparison images of striation patterns aligned side-by-side. (The 400 images gave rise to (400×399)/2 distinct pairs of images, of which 200 were same-source pairs. They
could hardly ask the volunteers to study all these 79,800 pairs, so they used a computer program for matching such patterns to obtain 60 pairs that seemed to cover "the full range of comparison difficulty" but that overrepresented "difficult" pairs — an important choice that we'll talk about soon. Of the 60, 38 were same-source pairs, and 22 were different-source pairs.)
The examiners first evaluated the degree of similarity on a five-point scale. Then they were shown the 60 pairs again and asked (1) whether the comparison provides support that the striations in the cartridge cases are the result of firing the cartridge cases with one (same-source) or with two (different-source) Glock pistols; (2) for their sense of the degree of support for this conclusion; and (3) whether they would have provided an inconclusive conclusion in casework.
The degree of support was reported or placed on a six-point scale of "weak support" (likelihood ratio L = 2 to 10), "moderate support" (L = 10 to 100), "moderately strong support" (L = 100 to 1,000), "strong support" (L = 1,000 to 10,000), "very strong support" (L = 10,000 to 1,000,000), and "extremely strong support" (L > 1,000,000). The computerized system mentioned above also generated numerical likelihood ratios. (The proximity of the ratio to 1 was taken as a measure of difficulty.)
A Few of the Findings
For the 60 two-dimensional comparisons, the computer program and the human examiners performed as follows:
Table 1. Results for (computer | examiner) excluding pairs deemed inconclusive by examiners.
|
SS pair |
DS pair |
SS outcome |
(36 | 2365) |
(10 | 95) |
DS outcome |
(2 | 74) |
(12 | 784) |
validity
|
sens = (.95 | .97) FNP = (.05 | .03) |
spec = (.55 | .89) FPP = (.45 | .11) |
Abbreviations: SS = same source; DS = difference source
sens = sensitivity; spec = specificity
FNP = false negative proportion; FPP = false positive proportion |
Table 1 combines two of the tables in the article. The entry "36 | 2365," for example, denotes that the computer program correctly classified as same-source 36 of the 38 same-source pairs (95%), while the examinations of the 77 examiners correctly classified 2,365 pairs out of the 2,439 same-source comparisons (97%) that they did not consider inconclusive. The computer program did not have the option to avoid a conclusion (or rather a likelihood ratio) in borderline cases. When examiners' conclusions on the cases they would have called inconclusive in practice were added in, the sensitivity and specificity dropped to 0.93 and 0.81, respectively.
Making Sense of These Results
The reason there are more comparisons for the examiners is that there were 77 of them and only one computer program. The 77 examiners made 77 × 60 comparisons, while the computer program made only 1 × 60 comparisons on the the 60 pairs. Those pairs, as I noted earlier, were not a representative sample. On all 79,800 possible pairings of the test fired bullets, the tireless computer program's sensitivity and specificity were both 0.99. If we can assume that the human examiners would have done at least as well as the computer program that it outperformed (on the 60 more or less "difficult" cases), their performance for all possible pairs would have been excellent.
An "Error Rate" for Court?
The experiment gives a number for the false-positive "error rate" (misclassifications across all the 22 different-source pairs) of 11%. If we conceive of the examiners' judgments as a random sample from some hypothetical population of identically conducted experiments, then the true false-positive error probability could be somewhat higher (as emphasized in the PCAST report) or lower. How should such numbers be used in admissibility rulings under Daubert v. Merrell Dow Pharmaceuticals, Inc.? At trial, to give the jury a sense of the chance of a false-positive error (as PCAST also urged)?
For admissibility, Daubert referred (indirectly) to false-positive proportions in particular studies of "voice prints," although the more apt statistic would be a likelihood ratio for a positive classification. For the Netherlands study, that would be L+ = Pr(+|SS) / Pr(+|DS) ≈ 0.97/0.11 = 8.8. In words, it is almost nine times more probable that an examiner (like the ones in the study) will report a match when confronted with a randomly selected same-source pair of images than a randomly selected different-source pair (from the set of 60 constructed for the experiment). That validates the examiners' general ability to distinguish between same- and different-source pairs at a modest level of accuracy in that sample.
But to what extent can or should this figure (or just the false-positive probability estimate of 0.11) be presented to a factfinder as a measure of the probative value of a declared match? In this regard, arguments arise over presenting an average figure in a specific case (although that seems like a common enough practice in statistics) and the realism of the experiment. The researchers warn that
For several reasons it is not possible to directly relate the true positive and true negative rates, and the false positive and false negative rates of the examiners in this study to actual casework. One of these reasons is that the 60 comparisons we used were selected to over-represent ‘difficult’ comparisons. In addition, the use of the online questionnaire did not enable the examiners to manually compare the features of the cartridge cases as they would normally do in casework. They could not include in their considerations the features of other firearm components, and their results and conclusions were not peer reviewed. Enabling examiners to follow their standard operating procedures could result in better performance.
There Is More
Other facets of the paper also make it recommended reading. Data on the reliability of conclusions (both within and across examiners) are presented, and an analysis of the extent to which examiners' judgments of how strongly the images supported their source attributions led the authors to remark that
When we look at the actual proportion of misleading choices, the examiners judged lower relative frequencies of occurrence (and thus more extreme LRs) than expected if their judgments would have been well-calibrated. This can be seen as overconfidence, where examiners provide unwarranted support for either same-source or different-source propositions, resulting in LRs that are too high or too low, respectively. ... Simply warning examiners about overconfidence or asking them to explain their judgments does not necessarily decrease overconfidence of judgments.