Sunday, June 23, 2019

The Miami Dade Bullet-matching Study Surfaces in United States v. Romero-Lobato

Last month, the US District Court for the District of Nevada rejected another challenge to firearms toolmark comparisons. The opinion in United States v. Romero-Lobato, 1/ written by Judge Larry R. Hicks, relies in part on a six-year-old study that has yet to appear in any scientific journal. 2/ The National Institute of Justice (the research-and-development arm of the Department of Justice) funded the Miami-Dade Police Department Crime Laboratory "to evaluate the repeatability and uniqueness of striations imparted by consecutively manufactured EBIS barrels with the same EBIS pattern to spent bullets as well as to determine the error rate for the identification of same gun evidence." 3/ Judge Hicks describes the 2013 study as follows:
The Miami-Dade Study was conducted in direct response to the NAS Report and was designed as a blind study to test the potential error rate for matching fired bullets to specific guns. It examined ten consecutively manufactured barrels from the same manufacturer (Glock) and bullets fired from them to determine if firearm examiners (165 in total) could accurately match the bullets to the barrel. 150 blind test examination kits were sent to forensics laboratories across the United States. The Miami-Dade Study found a potential error rate of less than 1.2% and an error rate by the participants of approximately 0.007%. The Study concluded that “a trained firearm and tool mark examiner with two years of training, regardless of experience, will correctly identify same gun evidence.”
The "NAS Report" was the work of a large committee of scientists, forensic-science practitioners, lawyers, and others assembled by the National Academy of Sciences to recommend improvements in forensic science. A federal judge and a biostatistician co-chaired the committee. In 2009, four years after Congress funded the project, the report arrived. It emphasized the need to measure the error probabilities in pattern-matching tasks and discussed what statisticians call two-by-two contingency tables for estimating the sensitivity (true-positive probability) and specificity (true-negative probability) of the classifications. However, the Miami-Dade study was not designed to measure these quantities. To understand what it did measure, let's look at some of the details in the report to NIJ as well as what the court gleaned from the report (directly or indirectly).

A Blind Study?

The study was not blind in the sense of the subjects not realizing that they were being tested. They surely knew that they were not performing normal casework when they received the unusual samples and the special questionnaire with the heading "Answer Sheet: Consecutively Rifled EBIS-2 Test Set" asking such questions as "Is your Laboratory ASCLD/Lab Accredited?" That is not a fatal flaw, but it has some bearing -- not recognized in the report's sections on "external validity" -- on generalizing from the experimental findings to case work.  4/

Volunteer Subjects?

The "150 blind examination kits" somehow went to 201 examiners, not just in the United States, but also in "4 international countries." 5/ The researchers did not consider or reveal the performance of 36 "participants [who] did not meet the two year training requirement for this study." (P. 26). How well they did in comparison to their more experienced colleagues would have been worth knowng, although it would have been hard to draw a clear concolusions since there so few errors on the test. In any event, ignoring the responses from the trainees "resulted in a data-producing sample of 165 participants." (P. 26).

These research subjects came from emails sent to "the membership list for the Association of Firearm and Tool Mark Examiners (AFTE)." (Pp. 15-16). AFTE members all "derive[] a substantial portion of [their] livelihood from the examination, identification, and evaluation of firearms and related materials and/or tool marks." (P. 15). Only 35 of the 165 volunteers were certified by AFTE (p. 30), and 20 worked at unaccredited laboratories (P. 31).

What Error Rates?

Nine of the 165 fully trained subjects (5%) made errors (treating "inconclusive" as a correct response). The usual error rates (false positives and false negatives) are not reported because of the design of the "blind examination kits." The obvious way to obtain those error rates is to ask each subject to evaluate pairs of items -- some from the same source and some from different sources (with the examiners blinded to the true source information known to the researchers). Despite the desire to respond to the NAS report, the Miami Dade Police Department Laboratory did not make "kits" consisting of such a mixture of pairs of same-source and different-source bullets.

Instead, the researchers gave each subject a single collection of ten bullets produced by firing one manufacturer's ammunition in eight of the ten barrels. (Two of these "questioned bullets," as I will call them, came from barrel 3 and two from barrel 9; none came from barrel 4.) Along with the ten questioned bullets, they gave the subjects eight pairs of what we can call "exemplar bullets." Each pair of exemplar bullets came from two test fires of the same eight of the ten consecutively manufactured barrels (barrels 1-3 and 5-9). The task was to associate each questioned bullet with an exemplar pair or to decide that it could not be associated with any of the eight pairs. Or, the research subjects could circle "inconclusive" on the questionnaire. Notice that almost all the questioned bullets came from the barrels that produced the exemplar bullets -- only two such barrels were not a source of an unknown -- and bullets from only one barrel that produced a questioned bullet was not in the exemplar set.

This complicated and unbalanced design raises several questions. After associating an unknown bullet with an exemplar pair, will an examiner seriously consider the other exemplar pairs? After eliminating a questioned bullet as originating from, say seven exemplar-pair barrels, would he be inclined to pick one of the remaining three? Because of the extreme overlap in the sets, on average, such strategies would pay off. Such interactions could make false eliminations less probable, and true associations more probable, than with the simpler design of a series of single questioned-to-source comparisons.

The report to NIJ does not indicate that the subjects received any instructions to prevent them from having an expectation that most of the questioned bullets would match some pair of exemplar bullets. The only instructions it mentions are on a questionnaire that reads:
Please microscopically compare the known test shots from each of the 8 barrels with the 10 questioned bullets submitted. Indicate your conclusion(s) by circling the appropriate known test fired set number designator on the same line as the alpha unknown bullet. You also have the option of Inconclusive and Elimination. ...
Yet, the report confidently asserts that "[t]he researchers utilized an 'open set' design where the participants had no expectation that all unknown tool marks should match one or more of the unknowns." (P. 28).

To be sure, the study has some value in demonstrating that the subset of the subjects could perform a presumably difficult task in associating unknown bullets with exemplar ones. Moreover, whatever one thinks of this alleged proof of "uniqueness," the results imply that there are microscopic (or other) features of marks on bullets that vary with the barrel through which they traveled. But the study does not supply a good measure of examiner skill at making associations in fully "open" situations.

A 0.007% Error Rate?

As noted above, but not in the court's opinion, 5% of the examiners made some kind of error. That said, there were only 12 false-positive associations or false-negative ones (outright eliminations) out of 165 x 10 = 1,650 answers. (I am assuming that every subject completed the questionnaire for every unknown bullet.) That is an overall error proportion of 12/1650 = 0.007 = 0.7%.

The researchers computed the error rate slightly differently. They only reported the average error rate for the 165 experienced examiners. The vast majority (156) made no errors. Six made 1 error, and 3 made 2. So the average examiner's proportion of errors was [156(0) + 6(0.1) + 3(0.2)]/165 = 0.007. No difference at all.

This 0.007 figure is 100 times the number the court gave. Perhaps the opinion had a typographical error -- an adscititious  percentage sign that the court missed when it reissued its opinion (to correct other typographical errors). The error rate is still small and would not affect the court's reasoning.

But the overall proportion of errors and the average-examiner error rate could diverge. The report gives the error proportions for the 9 examiners who made errors as 0.1 (6 of the examiners) and 0.2 (another 3 examiners). Apparently, all of the 9 erroneous examiners evaluated all 10 unknowns. What about the other 156 examiners? Did all of them evaluate all 10? The worst-case scenario is that every one of the 156 error-free examiners answered only one question. That supplies only 156 correct answers. Add this number to the 12 incorrect answers, and we have an error proportion of 12/168 = 0.7 = 7% -- another 100 times larger than the court's number.

However, this worst-case scenario did not occur. The funding report states that "[t]here were 1,496 correct answers, 12 incorrect answers and 142 inconclusive answers." (P. 15). The sum of these numbers of answers is 1,650. Did every examiner answer every question? Apparently so. For this 100% completion rate, the report's emphasis on the examiner average (which is never larger and often smaller than the overall error proportion) is a distinction without a difference.

There is a further issue with the number itself. "Inconclusives" are not correct associations. If every examiner came back with "inconclusive" for every questioned bullet, the researchers hardly could report the zero error rate as validating bullet-matching. 7/ From the trial court's viewpoint, inconclusives just do not count. They do not produce testimony of false associations or of false eliminations. The sensible thing to do, in ascertaining error rates for Daubert purposes, is to toss out all "inconclusives."

Doing so here makes little difference. There were 142 inconclusive answers. (P. 15). If these were merely "not used to calculate the overall average error rates," as the report claims (p. 32), the overall error proportion was 12/(1605 - 142) = 12/1508 = 0.008 -- still very small (but still difficult to interpret in terms of the parameters of accuracy for two-by-two tables).

The report to NIJ discussed another finding that, at first blush, could be relevant to the evidence in this case: "Three of these 35 AFTE certified participants reported a total of four errors, resulting in an error rate of 0.011 for AFTE Certified participants." (P. 30). Counter-intuitively, this 1% average is larger than the reported average error rate of 0.007 for all the examiners.

That the certified examiners did worse than the uncertified ones may be a fluke. The standard error in the estimate of the average-examiner error rate was 0.32 (p. 29), which indicates that, despite the observed difference in the sample data, the study does not reveal whether certified examiners generally do better or worse than uncertified ones. 7/

A Potential Error Rate?

Finally, the court's reference to "a potential error rate of less than 1.2%" deserves mention. The "potential error rate" is tricky. Potentially, the error rate of individual practitioners like the ones who volunteered for the study, with no verification step by another examiner, could be larger (or smaller). There is no sharp and certain line that can be drawn for the maximum possible error rate. (Except that it could not be 100%.)

In this case, 1.2% is the upper limit of a two-sided confidence interval. The Miami Dade authors wrote that:
A 95% confidence interval for the average error rate, based on the large sample distribution of the sample average error rate, is between 0.002 and 0.012. Using a confidence interval of 95%, the error rate is no more than 0.012, or 1.2%.
A 95% confidence interval means that if there had been a large number of volunteer studies just like this one, making random draws from an unchanging population of volunteer-examiners and having these examiners perform the same task in the same way, about 95% of the many resulting confidence intervals would encompass the true value for the entire population. But the hypothetical confidence intervals would vary from one experiment to the next. We have a statistical process -- a sort of butterfly net -- that is broad enough to capture the unknown butterfly in about 95% of our swipes. The weird thing is that with each swipe, the size and center of the net change. On the Miami Dade swipe, one end of the net stretched out to the average error rate of 1.2%.

So the court was literally correct. There is "a potential error rate" of 1.2%. There is also a higher potential error rate that could be formulated -- just ask for 99% "confidence." Or lower -- try 90% confidence. And for every confidence interval that could be constructed by varying the confidence coefficient, there is the potential for the average error rate to exceed the upper limit. Such is the nature of a random variable. Randomness does not make the upper end of the estimate implausible. It just means that it is not "the potential error rate," but rather a clue to how large the actual rate of error for repeated experiments could be.

Contrary to the suggestion in Romero-Lobato, that statistic is not the "potential rate of error" mentioned in the Supreme Court's opinion in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). The opinion advises judges to "ordinarily ... consider the known or potential rate of error, see, e.g., United States v. Smith, 869 F. 2d 348, 353-354 (CA7 1989) (surveying studies of the error rate of spectrographic voice identification technique)." The idea is that along with the validity of an underlying theory, how well "a particular scientific technique" works in practice affects the admissibility of evidence generated with that technique. When the technique consists of comparing things like voice spectrograms, the accuracy with which the process yields correct results in experiments like the ones noted in Smith are known error rates. That is, they are known for the sample of comparisons in the experiment. (The value for all possible examiners' comparisons is never known.)

These experimentally determined error rates are also a "potential rate of error" for the technique as practiced in case work. The sentence in Daubert that speaks to "rate of error" continues by adding, as part of the error-rate issue, "the existence and maintenance of standards controlling the technique's operation, see United States v. Williams, 583 F. 2d 1194, 1198 (CA2 1978) (noting professional organization's standard governing spectrographic analysis)." The experimental testing of the technique shows that it can work -- potentially; controlling standards ensure that it will be applied consistently and appropriately to achieve this known potential. Thus, Daubert's reference to "potential" rates does not translate into a command to regard the upper confidence limit (which merely accounts for sampling error in the experiment) as a potential error rate for practical use.

NOTES
  1. No. 3:18-cr-00049-LRH-CBC, 2019 WL 2150938 (D. Nev. May 16, 2019).
  2. That is my impression anyway. The court cites the study as Thomas G. Fadul, Jr., et al., An Empirical Study to Improve the Scientific Foundation of Forensic Firearm and Tool Mark Identification Utilizing Consecutively Manufactured Glock EBIS Barrels with the Same EBIS Pattern (2013), available at https://www.ncjrs.gov/pdffiles1/nij/grants/244232.pdf. The references in Ronald Nichols, Firearm and Toolmark Identification: The Scientific Reliability of the Forensic Science Discipline 133 (2018) (London: Academic Press), also do not indicate a subsequent publication.
  3. P. 3. The first of the two "research hypotheses" was that "[t]rained firearm and tool mark examiners will be able to correctly identify unknown bullets to the firearms that fired them when examining bullets fired through consecutively manufactured barrels with the same EBIS pattern utilizing individual, unique and repeatable striations." (P. 13). The phrase "individual, unique and repeatable striations" begs a question or two.
  4. The researchers were comforted by the thought that "[t]he external validity strength of this research project was that all testing was conducted in a crime laboratory setting." (P. 25). As secondary sources of external validity, they noted that "[p]articipants utilized a comparison microscope," "[t]he participants were trained firearm and tool mark examiners," "[t]he training and experience of the participants strengthened the external validity," and "[t]he number of participants exceeded the minimum sample size needed to be statistically significant." Id. Of course, it is not the "sample size" that is statistically significant, but only a statistic that summarizes an aspect of the data (other than the number of observations).
  5. P. 26 ("A total of 201 examiners representing 125 crime laboratories in 41 states, the District of Columbia, and 4 international countries completed the Consecutively Rifled EBIS-2 Test Set questionnaire/answer sheet.").
  6. Indeed, some observers might argue that an "inconclusive" when there is ample information to reach a conclusion is just wrong. In this context, however, that argument is not persuasive. Certainly, "inconclusives" can be missed opportunities that should be of concern to criminalists, but they are not outright false positives or false negatives.
  7. The opinion does not state whether the examiner in the case -- "Steven Johnson, a supervising criminalist in the Forensic Science Division of the Washoe County Sheriff's Office" -- is certified or not, but it holds that he is "competent to testify" as an expert.

No comments:

Post a Comment