Saturday, July 9, 2022

Preliminary Results from a Blind Quality Control Program

The Houston Forensic Science Center recently reported the results of realistic, blind tests of its firearms examiners. Realism comes from disguising materials to look like actual casework and injecting these "mock evidence items" into the regular flow of business. The judgments of the examiners for the mock cases can be evaluated with respect to the true state of affairs (ammunition components from the same firearm as opposed to components from different firearms). Eagerly, I looked for a report of how often the examiners declared an association for pairs of items that were not associated with one another (false "identifications") and how often they declared that there was no association for pairs that were in fact associated (false "eliminations").

These kinds of conditional "error rates" are by no means all there is to quality control and to improving examiner performance, which is the salutary objective of the Houston lab, but they are prominent in judicial opinions on the admissibility of firearms-toolmark evidence. So too, they (along with the cognate statistics of specificity and sensitivity) are established measures of the validity of tests for the presence or absence of a condition. Yet, I searched in vain for clear statements of these standard measures of examiner performance in the article by Maddisen Neuman, Callan Hundl, Aimee Grimaldi, Donna Eudaley, Darrell Stein and Peter Stout on "Blind Testing in Firearms: Preliminary Results from a Blind Quality Control Program," 67(3) J. Forensic Sci. 964-974 (2022).

Instead, tables use a definition of "ground truth" that includes materials being intentionally "insufficient" or "unsuitable" for analysis, and they focus on whether "[t]he reported results either matched the ground truth or resulted in an inconclusive decision." (Here, "inconclusive" is different from insufficient" and "unsuitable." For the sake of readers who are unfamiliar with firearms argot, Table 1 defines--or tries to--the terminology for describing the outcomes of the mock cases.)

TABLE 1. Statements for the Outcome of an Examination
(adapted from p. 966 tbl. 1)

Binary (Yes/No) Source Conclusions

Identification: A sufficient correspondence of individual characteristics will lead the examiner to the conclusion that both items (evidence and tests) originated from the same source.
Elimination: A disagreement of class characteristics will lead the examiner to the conclusion that the items did not originate from the same source. In some instances, it may be possible to support a finding of elimination even though the class characteristics are similar when there is marked disagreement of individual characteristics.
Statements of No Source Conclusion

Unsuitable: A lack of suitable microscopic characteristics will lead the examiner to the conclusion that the items are unsuitable for identification.
Insufficient: Examiners may render an opinion that markings on an item are insufficient when:
• an item has discernible class characteristics but no individual characteristics
• an item does not exhibit class characteristics and has few individual characteristics of such poor quality that precludes an examiner from rendering an opinion;
• the examiner cannot determine if markings on an item were made by a firearm during the firing process; or
• the examiner cannot determine if markings are individual or subclass.
Inconclusive: An insufficient correspondence of individual and/or class characteristics will lead the examiner to the conclusion that no identification or elimination could be made with respect to the items examined.
Note on "identification": The identification of cartridge case/bullet toolmarks is made to the practical, not absolute, exclusion of all other firearms. This is because it is not possible to examine all firearms in the world, a prerequisite for absolute certainty. The conclusion that sufficient agreement for identification exists between toolmarks means that the likelihood that another firearm could have made the questioned toolmarks is so remote as to be considered a practical impossibility.

There were 51 mock cases containing anywhere from 2 to 41 items (median = 9). In the course of the five-and-a-half year study, 460 items were examined for a total of 570 judgments by only 11 firearms examiners, with experience ranging from 5.5 to 23 years. The mock evidence varied greatly in its informativeness, and the article suggests that the lab sought to use a greater proportion of challenging cases than might be typical.

Whether or not the study is generalizable to other examiners, laboratories, and cases, the authors write that "no hard errors were observed; that is, no identifications were declared for true nonmatching pairs, and no eliminations were declared for true matching pairs." This sounds great, but how probative is the observation of "no hard errors"

Table 3 of the article states that there were 143 false pairs, of which 106 were designated inconclusive. It looks like the examiners were hesitant to make an elimination, even for a false pair. They made only 37 eliminations. Since there were no "hard errors," none of the false pairs were misclassified as identifications. Ignoring inconclusives, which are not presented as evidence for or against an association, the observed false-identification rate therefore was 0/37. Using the rule of three for a quick approximation, we can estimate the 95% confidence interval as going from 0 to 3/37. To use phrasing like that in the 2016 PCAST Report, the false-positive rate could be as large as 1 in 9.

Applying the same reasoning to the 386 true pairs, of which 119 were designated inconclusive, the observed false-elimination rate must have been 0/267. The 95% confidence interval for the false-elimination rate thus extends to about 3/267, or 1/89.

These confidence intervals should not be taken too seriously. The simple binomial probability model implicit in the calculations does not hold for dependent comparisons. To quote the authors (p. 968), "Because the data were examined at the comparison level, an item of evidence can appear in the data set in multiple comparisons and be represented by multiple comparison conclusions. For example, Item 1 may have been compared to Item 2 and Item 3 with comparison conclusions of elimination and identification, respectively." Moreover, I could be misconstruing the tables. Finally, even if the numbers are all on target, they should not taken as proof that error rates are as high as the upper confidence limits. The intervals are merely indications of the uncertainty in using particular numbers as estimates of long-term error rates.

In short, the "blind quality control" program is a valuable supplement to minimal-competency proficiency testing. The absence of false identifications and false eliminations is encouraging, but the power of this study to pin down the probability of errors at the Houston laboratory is limited.

No comments:

Post a Comment