Saturday, June 24, 2023

Maryland Supreme Court Resists "Unqualified" Firearms-toolmark Testimony

This week, the Maryland Supreme Court became the first state supreme court to hold, unequivocally, that a firearms-toolmark examiner may not testify that a bullet was fired from a particular gun without a disclaimer indicating that source attribution is not a scientific or practical certainty. \1/ The opinion followed two trials, \2/ two evidentiary hearings (one on general scientific acceptance \3/ and one on the scientific validity of firearms-toolmark identifications \4/) and affidavits from experts in research methods or statistics. The Maryland court did not discuss the content of the required disclaimer. It merely demanded that the qualified expert's opinion not be "unqualified." In addition, the opinions are limited to source attributions via the traditional procedure of judging presumed "individual" microscopic features with no standardized rules for concluding that the markings match.

The state contended that Kobina Ebo Abruquah murdered a roommate by shooting him five times, including once in the back of the head. A significant part of the state's case came from a firearms examiner for the Prince George’s County Police Department. The examiner "opined that four bullets and one bullet fragment ... 'at some point had been fired from [a Taurus .38 Special revolver belonging to Mr. Abruquah].'" A bare majority of four justices agreed that admission of the opinion was an abuse of the trial court's discretion. Three justices strongly disputed this conclusion. Two of the three opinions in the case included tables displaying counts or percentages from experiments in which analysts compared sets of either bullets or cartridge casings fired from a few types of handguns to ascertain how frequently their source attributions and exclusions were correct and how often they were wrong.

There is a lot one might say about these opinions, but here I attend only to the statistical parts. \5/ As noted below (endnotes 3 an 4), neither party produced any statisticians or research scientists with training or extensive experience in applying statistical methods. The court did not refer to the recent, burgeoning literature on "error rates" in examiner-performance studies. Instead, the opinions drew on (or questioned) the analysis in the 2016 report of the President's Council of Advisors on Science and Technology (PCAST). The report essentially dismissed the vast majority of the research on which one expert for the state (James Hamby, a towering figure in the firearms-toolmark examiner community) relied. These studies, PCAST explained, usually asked examiners to match a set of questioned bullets to a set of guns that fired them. 

A dissenting opinion of Justice Steven Gould argued—with the aid of a couple of probability calculations—that the extremely small number of false matches in these "closed set" studies demonstrated that examiners were able to perform much better than would be expected if they were just guessing when they lined up the bullets with the guns. \6/

That is a fair point. "Closed set" studies can show that examiners are extracting some discriminating information. But they do not lend themselves to good estimates of the probability of false identifications and false exclusions. For answering the "error rate" question in Daubert, they indicate lower bounds—the conditional error probabilities for examiners under the test conditions could be close to zero, but they could be considerably larger.

More useful experiments simulate what examiners do in cases like Abruquah—where they decide whether a specific gun fired the bullet that could have come from guns beyond those in an enumerated set of known guns. To accomplish this, the experiment can pair test bullets fired from a known gun with a "questioned" bullet and have the examiner report whether the questioned bullet did or did not travel through the barrel of the tested gun.

The majority opinion, written by Chief Justice Matthew Fader, discussed two such experiments known as "Ames I" and "Ames II" (because they were done at the Ames National Laboratory, "a government-owned, contractor-operated national laboratory of the U.S. Department of Energy, operated by and located on the campus of Iowa State University in Ames, Iowa"). The first experiment, funded by the Department of Defense and completed in 2014, "was designed to provide a better understanding of the error rates associated with the forensic comparison of fired cartridge cases." The experiment did not investigate performance with regard to toolmarks on the projectiles (the bullets themselves) propelled from the cases, through the barrel of a gun, and beyond. Apparently referring to the closed-set kind of studies, the researchers observed that "[five] previous studies have been carried out to examine this and related issues of individualization and durability of marks ... , but the design of these previous studies, whether intended to measure error rates or not, did not include truly independent sample sets that would allow the unbiased determination of false-positive or false-negative error rates from the data in those studies." \7/

However, their self-published technical report does not present the results in the kind of classification table that statisticians would expect. Part of such a table is on this blog: \8/ 

The researchers enrolled 284 volunteer examiners in the study, and 218 submitted answers (raising an issue of selection bias). The 218 subjects (who obviously knew they were being tested) “made ... l5 comparisons of 3 knowns to 1 questioned cartridge case. For all participants, 5 of the sets were from known same-source firearms [known to the researchers but not the firearms examiners], and 10 of the sets were from known different-source firearms.” 3/ Ignoring “inconclusive” comparisons, the performance of the examiners is shown in Table 1.

Table 1. Outcomes of comparisons
(derived from pp. 15-16 of Baldwin et al.)

~S S
E 1421 4 1425
+E 22 1075 1097

1443 1079
E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

False negatives. Of the 4 + 1075 = 1079 judgments in which the gun was the same, 4 were negative. This false negative rate is Prop(–E |S) = 4/1079 = 0.37%. ("Prop" is short for "proportion," and "|" can be read as "given" or "out of all.") Treating the examiners tested as random samples of all examiners of interest, and viewing the performance in the experiment as representative of the examiners' behavior in casework with materials comparable to those in the experiment, we can estimate the portion of false negatives for all examiners. The point estimate is 0.37%. A 95% confidence interval is 0.10% to 0.95%. These numbers provide an estimate of how frequently all examiners would declare a negative association in all similar cases in which the association actually is positive.Instead of false negatives, we also can describe true negatives, or specificity. The observed specificity is Prop(E|~S) = 99.63%. The 95% confidence interval around this estimate is 99.05% to 99.90%.

False positives. The observed false-positive rate is Prop(+E |~S) = 22/1443 = 1.52%, and the 95% confidence interval is 0.96% to 2.30%. The observed true-positive rate, or sensitivity, is 98.48%, and its 95% confidence interval is 97.7% to 99.04%.

Taken at face value, these results seem rather encouraging. On average, examiners displayed high levels of accuracy, both for cartridge cases from the same gun (better than 99% specificity) and from different guns (better than 98% sensitivity).

I did not comment on the implications of the fact that analysts often opted out of the binary classifications by declaring that an examination was inconclusive. This reporting category has generated a small explosion of literature and argumentation. There are two extreme views. Some organizations and individuals maintain that just because specimens did or not come from the same source, the failure to discern which of these two states of nature applies is an error. A more apt term would be "missed signal"; \9/ it is hardly obvious that Daubert's fleeting reference to "error rates" was meant to encompass not only false positives and negatives but also test results that are neither positive nor negative. At the other pole are claims that all inconclusive outcomes should be counted as correct in computing the false-positive and false-negative error proportions seen in an experiment. 

Incredibly, the latter is the only way in which the Ames laboratory computed the error proportions. I would like to think that had the report been subject to editorial review at a respected journal, a correction would have been made. Unchastened, the Ames laboratory again only counted inconclusive responses as if they were correct when it wrote up the results of its second study.

This lopsided treatment of inconclusives was an issue in Abruquah. The majority opinion described the two studies as follows (citations and footnotes omitted):

Of the 1,090 comparisons where the “known” and “unknown” cartridge cases were fired from the same source firearm, the examiners [in the Ames I study] incorrectly excluded only four cartridge cases, yielding a false-negative rate of 0.367%. Of the 2,180 comparisons where the “known” and “unknown” cartridge cases were fired from different firearms, the examiners incorrectly matched 22 cartridge cases, yielding a false-positive rate of 1.01%. However, of the non-matching comparison sets, 735, or 33.7%, were classified as inconclusive, id., a significantly higher percentage than in any closed-set study.

The Ames Laboratory later conducted a second open-set, black-box study that was completed in 2020 ... The Ames II Study ... enrolled 173 examiners for a three-phase study to test for ... foundational validity: accuracy (in Phase I), repeatability (in Phase II), and reproducibility (in Phase III). In each of three phases, each participating examiner received 15 comparison sets of known and unknown cartridge cases and 15 comparison sets of known and unknown bullets. The firearms used for the bullet comparisons were either Beretta or Ruger handguns and the firearms used for the cartridge case comparisons were either Beretta or Jimenez handguns. ... As with the Ames I Study, although there was a “ground truth” correct answer for each sample set, examiners were permitted to pick from among the full array of the AFTE Range of Conclusions—identification, elimination, or one of the three levels of “inconclusive.”

The first phase of testing was designed to assess accuracy of identification, “defined as the ability of an examiner to correctly identify a known match or eliminate a known nonmatch.” In the second phase, each examiner was given the same test set examined in phase one, without being told it was the same, to test repeatability, “defined as the ability of an examiner, when confronted with the exact same comparison once again, to reach the same conclusion as when first examined.” In the third phase, each examiner was given a test set that had previously been examined by one of the other examiners, to test reproducibility, “defined as the ability of a second examiner to evaluate a comparison set previously viewed by a different examiner and reach the same conclusion.”

In the first phase, ... [t]reating inconclusive results as appropriate answers, the authors identified a false negative rate for bullets and cartridge cases of 2.92% and 1.76%, respectively, and a false positive rate for each of 0.7% and 0.92%, respectively. Examiners selected one of the three categories of inconclusive for 20.5% of matching bullet sets and 65.3% of nonmatching bullet sets. [T]he results overall varied based on the type of handgun that produced the bullet/cartridge, with examiners’ results reflecting much greater certainty and correctness in classifying bullets and cartridge cases fired from the Beretta handguns than from the Ruger (for bullets) and Jimenez (for cartridge cases) handguns.

The opinion continues with a description of some statistics for the level of intra- and inter-examiner reliability observed in the Ames II study, but I won't pursue those here. The question of accuracy is enough for today. \10/ To some extent, the majority's confidence in the reported low error proportions (all under 3%) was shaken by the presence of inconclusives: "if at least some inconclusives should be treated as incorrect responses, then the rates of error in open-set studies performed to date are unreliable. Notably, if just the 'Inconclusive-A' responses—those for which the examiner thought there was almost enough agreement to identify a match—for non-matching bullets in the Ames II Study were counted as incorrect matches, the 'false positive' rate would balloon from 0.7% to 10.13%."

But should any of the inconclusives "be treated as incorrect," and if so, how many? Doesn't it depend on the purpose of the studies and the computation? If the purpose is to probe what the PCAST Report neoterically called "foundational validity"—whether a procedure is at least capable of giving accurate source conclusions when properly employed by a skilled examiner—then inconclusives are not such a problem. They represent lost opportunities to extract useful information from the specimens, but they do not change the finding that, within the experiment itself, in those instances in which the examiner is willing to come down on one side or the other, the conclusion is usually correct.

One justice stressed this fact. Justice Gould insisted that

[T]he focus of our inquiry should not be the reliability of the AFTE Theory in general, but rather the reliability of conclusive determinations produced when the AFTE Theory is applied. Of course, an examiner applying the AFTE Theory might be unable to declare a match (“identification”) or a non-match (“elimination”), resulting in an inconclusive determination. But that's not our concern. Rather, our concern is this: when the examiner does declare an identification or elimination, we want to know how reliable that determination is.

He was unimpressed with the extreme view that every failure to recognize "ground truth" is an "error" for the purpose of evaluating an identification method under Daubert. \11/ He argued for error proportions like those used by PCAST: \12/

This brings us to a different way of looking at error rates, one that received no consideration by the Majority ... I am referring to calculating error by excluding inconclusives from both the numerator and the denominator. .... [C]ontrary to Mr. Faigman's unsupported criticism, excluding inconclusives from the numerator and denominator accords with both common sense and accepted statistical methodologies. ... PCAST ... contended that ... false positive rates should be based only on conclusive examinations “because evidence used against a defendant will typically be based on conclusive, rather than inconclusive, determinations.” ... So, far from being "crazy" ... , excluding inconclusives from error rate calculations when assessing the reliability of a positive identification is not only an acceptable approach, but the preferred one, at least according to PCAST. Moreover, from a mathematical standpoint, excluding inconclusives from the denominator actually penalizes the examiner because errors accounted for in the numerator are measured against a smaller denominator, i.e., a smaller sample size.
So what happens when the error proportions for the subset of positive and negative conclusions are computed with the Ames data? The report's denominator is too large, but the resulting bias is not so great in this particular case. For Ames I, Justice Gould's opinion tracks Table 1 above:
With respect to matching [cartridge] sets, the number of inconclusives was so low that whether inconclusives are included in the denominator makes little difference to error rates. Of the 1,090 matching sets, only 11, or 1.01 percent, were inconclusives. Of the conclusive determinations, 1,075 were correctly identified as a match (“identifications”) and four were incorrectly eliminated (“eliminations”). ... Measured against the total number of matching sets (1,090), the false elimination rate was 0.36 percent. Against only the conclusive determinations (1,079), the false elimination rate was 0.37 percent. ...

Of 2,178 non-matching sets, examiners reported 735 inconclusives for an inconclusive rate of 33.7 percent, 1,421 sets as correct eliminations, and 22 sets as incorrect identifications (false positives). ... As a percentage of the total 2,178 non-matching sets, the false positive rate was 1.01 percent. As a percentage of the 1,443 conclusive determinations, however, the false positive rate was 1.52 percent. Either way, the results show that the risk of a false positive is very low
For Ames II,
There were 41 false eliminations. As a percentage of the 1,405 recorded results, the false elimination rate was 2.9 percent. As a percentage of only the conclusive results, the false elimination rate increased to 3.7 percent ... .

... There were 20 false positives. Measured against the total number of recorded results (2,842), the false positive rate was 0.7 percent. Measured against only the conclusive determinations, however, the false positive rate increases to 2.04 percent.

In sum, on the issue of whether a substantial number of firearms-toolmark examiners can generally avoid erroneous source attributions and exclusions when tested as in Ames I and Ames II, the answer seems to be that, yes, they can. Perhaps this helps explain the Chief Justice's concession that "[t]he relatively low rate of 'false positive' responses in studies conducted to date is by far the most persuasive piece of evidence in favor of admissibility of firearms identification evidence." But the court was quick to add that "[o]n balance, however, the record does not demonstrate that that rate is reliable, especially when it comes to actual casework."

Extrapolating from the error proportions in experiments to those in casework is difficult indeed. Is the largely self-selected sample of examiners who enroll in and complete the study representative of the general population of examiners doing casework? Does the fact that the enrolled examiners know they are being tested make them more (or less) careful or cautious? Do examiners have expectations about the prevalence of true sources in the experiment that differ from those they have in casework? Are the specimens in the experiment comparable to those in casework? \13/ Do error probabilities for comparing marks on cartridge cases apply to the marks on the bullets they house? Does it matter if the type of gun used in the experiment is different from the type in the case?

Most of the questions are matters of external validity. Some of them are the subject of explicit discussion in the opinions in Abruquah. For example, Justice Gould rejects, as a conjecture unsupported by the record, the concern that examiners might be more prone to avoid a classification by announcing an "inconclusive" outcome in an experiment than in practice.

To different degrees, the generalizability questions interact with the legal question being posed. As I have indicated, whether the scientific literature reveals that a method practiced by skilled analysts can produce conclusions that are generally correct for evidence like that in a given case is one important issue under Daubert. Whether the same studies permit accurate estimates of error probabilities in general casework is a distinct, albeit related, scientific question. How to count or adjust for inconclusives in experiments is but a subpart of the latter question.

And, how to present source attributions in the absence of reasonable error-probability estimates for casework is a question that Abruquah barely begins to answer. No opinion embraced the defendant's argument that only a limp statement like "unable to exclude as a possible source" is allowed. But neither does the case follow other courts that allow statements such as the awkward and probably ineffectual "reasonable degree of ballistic certainty" for expressing the difficult-to-quantify uncertainty in toolmark source attributions. After Abruquah, if an expert makes a source attribution in Maryland, some kind of qualification or caveat is necessary. \14/ But what will that be?

Toolmark examiners are trained to believe that their job is to provide source conclusions for investigators and courts to use, but neither law nor science compels this job description. Perhaps it would be better to replace conclusion-centered testimony about the (probable) truth of sources conclusions with evidence-centered statements about the degree to which the evidence supports a source conclusion. The Abruquah court wrote that

The reports, studies, and testimony presented to the circuit court demonstrate that the firearms identification methodology employed in this case can support reliable conclusions that patterns and markings on bullets are consistent or inconsistent with those on bullets fired from a particular firearm. Those reports, studies, and testimony do not, however, demonstrate that that methodology can reliably support an unqualified conclusion that such bullets were fired from a particular firearm.

The expert witness's methodology provides "support" for a conclusion, and the witness could simply testify about the direction and magnitude of the support without opining on the truth of the conclusion itself. \15/ "Consistent with" testimony is a statement about the evidence, but it is a minimal, if not opaque, description of the data. Is it all that the record in Abruquah—not to mention the record in the next case—should allow? Only one thing is clear—fights over the legally permissible modes for presenting the outcomes of toolmark examinations will continue.

Notes

  1. In Commonwealth v. Pytou Heang, 942 N.E.2d 927 (Mass. 2011), the Massachusetts Supreme Judicial Court upheld source-attribution testimony "to a reasonable degree of scientific certainty" but added "that [the examiner] could not exclude the possibility that the projectiles were fired by another nine millimeter firearm." The Court proposed "guidelines" to allow source attribution to no more than "a reasonable degree of ballistic certainty."
  2. The defendant was convicted at a first trial in 2013, then retried in 2018. In 2020, the Maryland Supreme Court changed its standard for admitting scientific evidence from a requirement of general acceptance of the method in the relevant scientific communities (denominated the "Frye-Reed standard" in Maryland) to a more direct showing of scientific validity described in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), and an advisory committee note accompanying amendments in 2000 to Federal Rule of Evidence 702 (called the "Daubert-Rochkind standard" in Abruquah).
  3. The experts at the "Frye-Reed hearing" were William Tobin (a "Principal of Forensic Engineering International," https://forensicengineersintl.com/about/william-tobin/, and "former head of forensic metallurgy operations for the FBI Laboratory" (Unsurpassed Experience, https://forensicengineersintl.com/); James Hamby ("a laboratory director who has specialized in firearm and tool mark identification for the past 49 years" and who is "a past president of AFTE [the Association of Firearm and Tool Mark Examiners] ... and has trained firearms examiners from over 15 countries worldwide," Speakers, International Symposium on Forensic Science, Lahore, Pakistan, Mar. 17-19, 2020, https://isfs2020.pfsa.punjab.gov.pk/james-edward); Torin Suber ("a forensic scientist manager with the Maryland State Police"); and Scott McVeigh (the firearms examiner in the case).
  4. The experts at the supplemental "Daubert-Rochkind hearing" were James Hamby (a repeat performance), and David Faigman, "Chancellor & Dean, William B. Lockhart Professor of Law and the John F. Digardi Distinguished Professor of Law" at the University of California College of the Law, San Francisco.
  5. Remarks on the legal analysis will appear in the 2024 cumulative supplement to The New Wigmore, A Treatise on Evidence: Expert Evidence.
  6. The opinion gives a simplified example:
    The test administrator fires two bullets from each of 10 consecutively manufactured handguns. The administrator then gives you two sets of 10 bullets each. One set consists of 10 “unknown” bullets—where the source of the bullet is unknown to the examiner—and the other set consists of 10 “known” bullets—where the source of the bullet is known. You are given unfettered access to a sophisticated crime lab, with the tools, supplies, and equipment necessary to conduct a forensic examination. And, like the vocabulary tests from grade school requiring you to match words with pictures, you must match each of the 10 unknown bullets to the 10 known bullets.

    Even though you know that each of the unknowns can be matched with exactly one of the knowns, you probably wouldn't know where to begin. If you had to resort to guessing, your odds of correctly matching the 10 unknown bullets to the 10 knowns would be one out of 3,628,800. [An accompanying note 11 explains that: "[w]ith 10 unknown bullets and 10 known bullets, the odds of guessing the first pair correctly are one out of 10. And if you get the first right, the odds of getting the second right are one out of nine. If you get the first two right, the odds of getting the third right are one out of eight, and so on. Thus, the odds of matching each unknown bullet to the correct known is represented by the following calculation: (1/10) x (1/9) x (1/8) x (1/7) x (1/6) x (1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Even if you correctly matched five unknown bullets to five known bullets and guessed on the remaining five unknowns, your odds of matching the remaining unknowns correctly would be one out of 120. [Note 12: "(1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Not very promising.

    The closed-set and semi-closed-set studies before the trial court—the studies which PCAST discounted—show that if you were to properly apply the AFTE Theory, you would be very likely to match correctly each of the 10 unknowns to the corresponding knowns. See Validation Study; Worldwide Study; Bullet Validation Study. ... Your odds would thus improve from virtually zero (one in 3,628,800) to 100 percent. Yet according to PCAST, those studies provide no support for the scientific validity of the AFTE Theory. ...
  7. David P. Baldwin, Stanley J. Bajic, Max Morris & Daniel Zamzow, A Study of False-positive and False-negative Error Rates in Cartridge Case Comparisons, Ames Laboratory, USDOE, Tech. Rep. #IS-5207 (2014), at https://afte.org/uploads/documents/swggun-false-postive-false-negative-usdoe.pdf [https://perma.cc/4VWZ-CPHK].
  8. David H. Kaye, PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?, Forensic Sci., Stat. & L., Nov. 1, 2016, http://for-sci-law.blogspot.com/2016/11/pcast-and-ames-study-will-real-error.html
  9. David H. Kaye et al., Toolmark-comparison Testimony: A Report to the Texas Forensic Science Commission, May 2, 2022, available at http://ssrn.com/abstract=4108012.
  10. I will note, however, that the report apparently strains to make the attained levels for reliability seem high. Alan H. Dorfman & Richard Valliant, A Re-Analysis of Repeatability and Reproducibility in the Ames-USDOE-FBI Study, 9 Stat. & Pub. Pol'y 175 (2022).
  11. The opinion attributes this view to
    Mr. Abruquah's expert, Professor David Faigman, [who declared] that "in the annals of scientific research or of proficiency testing, it would be difficult to find a more risible manner of measuring error." To Mr. Faigman, the issue was simple: in Ames I and II, the ground truth was known, thus "there are really only two answers to the test, like a true or false exam[ple]." Mr. Faigman explained that "the common sense of it is if you know the answer is either A or B and the person says I don't know, in any testing that I've ever seen that's a wrong answer." He argued, therefore, that inconclusives should be counted as errors.
  12. See also NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach (David H. Kaye ed. 2012), available at ssrn.com/abstract=2050067 (arguing against counting inconclusives in error proportions that are supposed to indicate the probative value of actual conclusions).
  13. Testing examiner performance in the actual flow of cases would help address the last three questions. A somewhat confusing analysis of results in such an experiment is described in a posting last year. David H. Kaye, Preliminary Results from a Blind Quality Control Program, Forensic Sci., Stat. & L., July 9, 2022, http://for-sci-law.blogspot.com/2022/07/preliminary-results-from-blind-quality.html.
  14. The court wrote that:
    It is also possible that experts who are asked the right questions or have the benefit of additional studies and data may be able to offer opinions that drill down further on the level of consistency exhibited by samples or the likelihood that two bullets or cartridges fired from different firearms might exhibit such consistency. However, based on the record here, and particularly the lack of evidence that study results are reflective of actual casework, firearms identification has not been shown to reach reliable results linking a particular unknown bullet to a particular known firearm.
  15. See, e.g., David H. Kaye, The Nikumaroro Bones: How Can Forensic Scientists Assist Factfinders?, 6 Va. J. Crim. L. 101 (2018).

LAST UPDATED 29 June 2023

No comments:

Post a Comment