Wednesday, September 27, 2023

How Accurate Is Mass Spectrometry in Forensic Toxicology?

Mass spectrometry (MS) is the "[s]tudy of matter through the formation of gas-phase ions that are characterized using mass spectrometers by their mass, charge, structure, and/or physicochemical properties." ANSI-ASB Standard 098 for Mass Spectral Analysis in Forensic Toxicology § 3.11 (2023). MS has become "the preferred technique for the confirmation of drugs, drug metabolites, relevant xenobiotics, and endogenous analytes in forensic toxicology." Id. at Foreword.

But no "criteria for the acceptance of mass spectrometry data have been ... universally applied by practicing forensic toxicologists." Id. Therefore, the American Academy of Forensic Sciences' Academy Standards Board (ASB) promulgated a "consensus based forensic standard[] within a framework accredited by the American National Standards Institute (ANSI)," id., that provides "minimum requirements." Id. § 1.

To a nonexpert reader (like me), the minimum criteria for the accuracy of MS "confirmation" are not apparent. Consider Section 4.2.1 on "Full-Scan Acquisition using a Single-Stage Low-Resolution Mass Analyzer." It begins with the formal requirement that

[T]he following shall be met when using a single-stage low-resolution mass analyzer in full-scan mode.
a) A minimum of a single diagnostic ion shall be monitored.

It is hard to imagine an MS test method that would not meet the single-ion minimum. Perhaps what makes this requirement meaningful is that the one or more ions must be "diagnostic." However, this adjective begs the question of what the minimum requirement for diagnositicity should be. A "diagnostic ion" is a "molecular ion or fragment ion whose presence and relative abundance are characteristic of the targeted analyte." Id. § 3.4. So what makes an ion "characteristic"? Must it always be present (in some relative abundance) when the "targeted analyte" is in the specimen (at or above some limit of detection)? That would make the ion a marker for the analyte with perfect sensitivity: Pr(ion|analyte) = 1. Even so, it would not be characteristic of the analyte unless its presence is highly specific, that is, unless Pr(no-such-ion|something-else) ≅ 1. But the standard contains no minimum values for sensitivity, specificity, or the likelihood ratio Pr(ion|analyte) / Pr(ion|something-else), which quantifies the positive diagnostic value of a binary test. \1/

This is not to say that there are no minimum requirements in the standard. There certainly are. For example, Section 4.2.1 continues:

b) When monitoring more than one diagnostic ion:
1. ratios of diagnostic ions shall agree with those calculated from a concurrently analyzed reference material given the tolerances shown in Table 1; OR
2. the spectrum shall be compared using an appropriate library search and be above a pre-defined match factor as demonstrated through method validation.

But the standard does not explain how the tolerances in Table 1 were determined. What are the conditional error probabilities that they produce?

Likewise, establishing a critical value for the "match factor" \2/ before using it is essential to a frequentist decision rule, but what are the operating characteristics of the rule? "Method validation" is governed (to the extent that voluntary standards govern anything) by ANSI-ASB 036, Standard Practices for Method Validation in Forensic Toxicology (2019). This standard requires testing to establish that a method is "fit for purpose," but it gives no accuracy rates that would fulfill this vague directive.

Firms that sell antibody test kits for detecting Covid-19 infections no longer can sell whatever they deem is fit for purpose. In May 2020, the FDA stopped issuing emergency use permits for these diagnostic tests without validation showing that they "are 90% 'sensitive,' or able to detect coronavirus antibodies, and 95% 'specific,' or able to avoid false positive results." \3/ Forensic toxicologists do not seem to have proposed such minimum requirements for MS tests.


  1. Other toxicology standards refer to ASB 098 as if it indicates what it required to apply the label "diagnostic." ANSI/ASB 113, Standard for Identification Criteria in Forensic Toxicology, § 4.5.2 (2023) ("All precursor and product ions are required to be diagnostic per ASB Standard 098, Standard for Mass Spectral Data Acceptance in Forensic Toxicology (2022).").
  2. Section 3.13 defines "match factor" as a "mathematical value [a scalar?] that indicates the degree of similarity between an unknown spectrum and a reference spectrum."
  3. See How Do Forensic-science Tests Compare to Emergency COVID-19 Tests?, Forensic Sci., Stat. & L., May 5, 2020 (quoting Thomas M. Burton, FDA Sets Standards for Coronavirus Antibody Tests in Crackdown on Fraud, Wall Street J., Updated May 4, 2020 8:24 pm ET,

Monday, September 18, 2023

Use with Caution: NIJ's Training Course in Population Genetics and Statistics for Forensic Analysts

The National Institute of Justice (NIJ) "is the research, development and evaluation agency of the U.S. Department of Justice . . . dedicated to improving knowledge and understanding of crime and justice issues through science." It offers a series of webpages and video recordings (a "training course") on Population Genetics and Statistics for Forensic Analysts. The course should be approached with caution. I have not worked through all the pages and videos, but here are a few things that rang alarm bells:

NIJ's Training Comment

Many statisticians have employed what is known as Bayesian probability ... which is based on probability as a measure of one's degree of belief. This type of probability is conditional in that the outcome is based on knowing information about other circumstances and is derived from Bayes Theorem. Bayes' rule applies to both objective and subjective probabilities. Both types of probability include conditional probabilities. The "type of probability" is not derived from Bayes' Theorem.

Conditional probability, by definition, is the probability P of an event A given that an event B has occurred. ... Take the example of a die with six sides. If one was to throw the die, the probability of it landing on any one side would be 1/6. This probability, however, assumes that the die is not weighted or rigged in any way, and that all of the sides contain a different number. If this were not true, then the probability would be conditional and dependent on these other factors. The "other factors" are nothing more than part of the description of the experiment whose outcomes are the events that are observed. They are not conditioning events in a sample space.

The following equation can be used to determine the probability of the evidence given that a presumed individual is the contributor rather than a random individual in the population: LR = P(E/H1) / P(E/H0) ... . In the case of a single source sample, the hypothesis for the numerator (the suspect is the source of the DNA) is a given, and thus reduces to 1. This reduces to: LR = 1/ P(E/H0) which is simply 1/P, where P is the genotype frequency. The hypothesis for the numerator of a likelihood ratio is always "a given"--that is, it goes on the right-hand-side of the expression for a conditional probability. So is the hypothesis in the denominator. Neither probability "reduces to 1" for that reason. Only if the "evidence" is the true genotype in both the recovered sample and the sample from the defendant can it be said that P(E|H1) = 1. In other words, to say that the probability of a reported match is 1 if the defendant is the source treats the probability of laboratory error as zero. That may be acceptable as a simplifying assumption, but the assumption should be made visible in a training course.

Although likelihood ratios can be used for determining the significance of single source crime stains, they are more commonly used in mixture interpretation. ... The use of any formula for mixture interpretation should only be applied to cases in which the analyst can reasonably assume "that all contributors to the mixed profile are unrelated to each other, and that allelic dropout has no practical impact." This limitation does not apply to modern probabilistic genotyping software!

Is "Match Form" Testimony Poor Form?

The likelihood ratio (LR) is essentially a number that expresses how many times more probable the data from an experiment are if one hypothesis is true than if another hypothesis is true. For example, suppose we make a single measurement of the height of a known individual. Then we do the same for an individual who is covered from head to foot by a sheet. We want know if we have measured the same individual twice or two different individuals once. The closer the two measured heights are to one another, the more the measurements support the same-source hypothesis as opposed to the different-source hypothesis.

Why? Because closer measurements are more probable for same-source pairs than for different-source pairs. This implies that in repeated experiments with some proportion of same-source and different-source pairs, the closer measurements will tend to filter out the different-source pairs (which tend to have more distance between the two measurements) and to include more same-source pairs (which tend to be marked by the more similar measurements).

By quantifying the relative probability for the data given each hypothesis, the LR indicates how well a given degree of similarity discriminates between the hypotheses. Its value is

LR = Probability(data | H1) / Probability(data | H2),

where H1 is the same-source hypothesis and H2 is the different-source hypothesis.

Likelihood ratios are routinely reported in cases with samples from crime scenes or victims that contain DNA from several individuals. A DNA analyst might testify that the electropherograms are ten thousand times more probable if the defendant's DNA is present than if an unrelated person's DNA is there. \1/ We may call such statements "relative-probability-of-the-data" testimony.

But some DNA experts prefer what they call a "match form" for the presentation. \2/ An example of a "match form" statement is that “[a] match between the shoes … and [the defendant] is 9.67 thousand times more probable than a coincidental match to an unrelated African-American person.” \3/ More generally, a match-form presentation states that “a match between the evidence and reference [samples] is (some number) times more probable than coincidence.” \4/

This formulation has been criticized as highly misleading. According to William Thompson, it is

likely to mislead lay people and foster misunderstandings that are detrimental to people accused of a crime. I recommend that Cybergenetics immediately cease using this misleading language and find a better way to explain its findings. Standards development organizations such as OSAC should consider developing standards that address the appropriateness, or inappropriateness, of such presentations. Courts should refuse to admit PG [probabilistic genotyping] evidence when it is mischaracterized in this manner. Lawyers involved in cases in which defendants were convicted based on this misleading language should consider the appropriateness of appellate remedies. \5/

The main concern is that juxtaposing “match” and “coincidence” will lead judges and jurors to think that the "match statistic" pertains to the probabilities of hypotheses (H1 and H2) about the source of the DNA rather than probabilities about the laboratory’s data. In simpler terms, the concern is that most people will understand "coincidence" and "coincidental match" as an assertion that the observed match is the result of coincidence; moreover, they will think that "match" is an assertion that the defendant is the matcher. If that happens, then the assertion that a match is 10,000 times more likely than coincidence would be (mis)understood as a statement that the odds against a coincidence having occurred are 10,000 to 1.

Instead, LR = 10,000 should be understood (according to Bayes' rule) as a statement about the change in the odds that defendant, as opposed to some unknown, unrelated person, is the matcher. For example, if defendant has a strong alibi—strong enough, in conjunction with other evidence, to establish that the prior odds of H1 as opposed to H2 are only 1 to 5,000—then this LR raises the odds to 10,000 x 1:5,000 = 2:1. Such final odds are far from overwhelming.

Cybergenetics does not seems disposed to abandon "match form" testimony. Dr. Thompson claims that for fingerprint comparisons, "'[m]atch' is shorthand for source identification, [s]o, it is predictable that many lay people will interpret the term 'match,' when used to describe DNA evidence, to mean that the person of interest has been identified either definitively or with a high degree of certainty as a contributor." Pointing to a dictionary, Cybergenetics angrily responds that this is just "Thompson’s private language." \6/ But a tradition in forensic science is to equate a "match" with an identification, as shown by the title of articles such as "Is a Match Really a Match? A Primer on the Procedures and Validity of Firearm and Toolmark Identification." \7/ In popular culture, the term may have a similar connotation. Perhaps Youtube trumps Merriam-Webster. \8/

As far as I know, no studies compare the comprehensibility of relative-probability-of-the-data testimony to match-form testimony. Therefore, the law and the practice has to be guided by intuition. My sense is that avoiding the transposition of the probabilities in a likelihood ratio requires special care if the match-versus-coincidence approach is used. The witness must explain not only that a "DNA match" is merely a degree of similarity between the electropherograms being compared, but also that "coincidence" or "coincidental match" is shorthand for the proposition that the "match" is a match to an unrelated person (or other specified source)—and that it is not a conclusion that a coincidence has occurred. The phrase "coincidental match" is too ambiguous to be left undefined.

In short, I am not sure that an absolute rule against match-form testimony is necessary, but I see no clear benefit to the phraseology. Relative-probability-of-the-data testimony seems to be a more straightforward description of a DNA likelihood ratio. However, it too needs explanation to reduce the risk of blindly transposing the conditional probabilities for the data into conditional probabilities for the hypotheses. Cases announcing that a likelihood ratio is a ratio of source-hypothesis probabilities are legion. \9/


  1. Cf. Commonwealth v. McClellan, 178 A.3d 874 (Pa. Super. Ct. 2018) ("[I]t was determined that the DNA sample taken from the gun's grip was at least 384 times more probable if the sample originated from Appellant and two unknown, unrelated individuals than if it originated from a relative to Appellant and two unknown, unrelated individuals").
  2. Mark Perlin, Explaining the Likelihood Ratio in DNA Mixture Interpretation, in Proceedings of Promega's Twenty First International Symposium on Human Identification at 7 (Dec. 29, 2010); cf. Mark W. Perlin, Joseph B. Kadane & Robin W. Cotton, Match Likelihood Ratio for Uncertain Genotypes, 8 Law, Probability & Risk 289 (2009),
  3. United States v. Anderson, No. 4:21-CR-00204, 2023 WL 3510823, at *3 (M.D. Pa. Apr. 26, 2023). For additional instances of “match form” testimony or reporting, see Howell v. Schweitzer, No. 1:20-cv-2853, 2023 WL 1785530 (N.D. Ohio Jan. 11, 2023); Sanford v. Russell, No. 17-13062, 2021 WL 1186495 (E.D. Mich. Mar. 30, 2021); State v. Anthony, 266 So.3d 415 (La. Ct. App. 2019).
  4. Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837, at 8 (2014).
  5. William C. Thompson, Uncertainty in Probabilistic Genotyping of Low Template DNA: A Case Study Comparing STRMix™ and TrueAllele™, 68 J. Forensic Sci. 1049, 1059 (2023), doi:10.1111/1556-4029.15225.
  6. Mark W. Perlin et al., Reporting Exclusionary Results on Complex DNA Evidence, A Case Report Response to 'Uncertainty in Probabilistic Genotyping of Low Template DNA: A Case Study Comparing Strmix™ and Trueallele®' Software 31 (May 18, 2023), available at SSRN: or
  7. Stephen G. Bunch et al., Is a Match Really a Match? A Primer on the Procedures and Validity of Firearm and Toolmark Identification, 11 Forensic Science Communications, No. 3 (2009),
  8. In addition, a dictionary definition of "match" ( is "a pair suitably associated." Suitable association suggests that a hypothesis about the nature of the association is true.
  9. E.g., State v. Pickett, 246 A.3d 279 (N.J. App. 2021) (The "likelihood ratio [is] a statistic measuring the probability that a given individual was a contributor to the sample against the probability that another, unrelated individual was the contributor.") (citing Justice Ming W. Chin et al., Forensic DNA Evidence § 5.5 (2020)).

Friday, July 7, 2023

No "Daubert Hearing" on Latent Fingerprint Matching in US v. Ware

Last month, in United States v. Ware, 69 F.4th 830 (11th Cir. 2023), the U.S. Court of Appeals for the Eleventh Circuit  "carefully review[ed]" the convictions of Dravion Sanchez Ware arising out of a month-long crime spree near Atlanta, in 2017. He was found to have participated "in robbing ... three spas, four massage parlors, a nail salon, and a restaurant." The opinion  recounts the nine brutal robberies in luxuriant detail. It also discusses Mr. Ware's argument that the district court erred "by not holding a formal Daubert hearing before admitting expert fingerprint evidence."

In a word, the Eleventh Circuit rejected the argument as "unpersuasive." No surprise there. More surprising is the opinion's incoherent discussion of the 2009 NRC report on forensic science and the 2016 PCAST follow-up report. \1/ On the one hand, we are told that "[t]he science could not possibly have been so unreliable as to be inadmissible." On the other hand, "[t]he District Court here could have held a Daubert hearing to assess the relatively new reports Ware presented." So which is it? If a type of evidence cannot possibly be excluded as scientifically invalid under Daubert, how can it be proper to hold a pretrial testimonial hearing on admissibility under Daubert? And, was the court of appeals correct in concluding that the two reports do not impeach, to the point of requiring a hearing, the traditional practice of admitting latent fingerprint comparisons?

During Ware's trial, an unnamed "crime lab scientist with the Georgia Bureau of Investigation Division of Forensic Sciences" "outlined the science behind fingerprints themselves, including their uniqueness" and explained the four-step process the lab follows ... : 'Analysis, Comparison, Evaluation, and Verification,' or ACEV.” The last step "involves another examiner completing the whole process a second time." The opinion does not indicate whether the verifying analyst is blinded to the knowledge of the main examiner's finding. Interestingly as well (think Confrontation Clause), the opinion implies that the testifying expert in Ware was not the main examiner. "[S]he was the verifying examiner," and "she testified that the lab concluded the latent print ... led to an identification conclusion matched to Ware's left middle finger." After that,

Defense counsel specifically asked about the PCAST report [and] vigorously cross-examined ... discussing the possibility of a latent fingerprint not being usable ... , the subjectiveness of every step ... , and the bias that may creep into the verification process ... . The expert and defense counsel discussed ... the potential for false positives and negatives. On cross, the defense also attacked the expert's claim that she did not know of the Georgia Bureau of Investigation ever misidentifying someone with a fingerprint comparison, and that she did not know the rate at which a verifier disagrees with the original assessment.

To preclude such testimony about his unique fingerprint on an item stolen in one of the robberies, Ware had moved before the trial for an order excluding fingerprint-comparison evidence. Of course, such a ruling would have been extraordinary, but the defense contended that the 2009 and the 2016 reports required nothing less \2/ and asked for a full-fledged pretrial hearing on the matter. In response, "[t]he District Court conditionally denied the motion ... unless Ware's counsel could produce before trial a case from this Court or a district court in this Circuit that favors excluding fingerprint expert evidence under Daubert." \3/

The court of appeals correctly observed that "[f]ingerprint comparison has long been accepted as a field worthy of expert opinions in this Circuit, as well as in almost every one of our sister circuits." The only problem is that all the opinions cited to show this solid wall of precedent predate the NRC or the PCAST reports. A more complete analysis has to establish that the scientists' reviews of friction-ridge pattern matching do not raise enough of a doubt to expect that a hearing would let the defense breach the wall. 

Along these lines, the court of appeals wrote that

The [District] Court considered the reports and arguments presented and found that fingerprint evidence was reliable enough as a general matter to be presented to the jury. Many of the critiques of fingerprint evidence found in the PCAST report go to the weight that ought to be given fingerprint analysis, not to the legitimacy of the practice as a whole. Appellant Br. at 25 (“The studies collectively demonstrate that many examiners can, under some circumstances, produce correct answers at some level of accuracy.” (emphasis in original)).

This quotation from the PCAST report is faint praise. Although the court of appeals was sure that "Ware's contrary authority even says that fingerprint evidence can be reliable," the depth of its knowledge about the PCAST (and the earlier NRC committee) reports is open to question. The circuit court had trouble keeping track of the names of the groups. It transformed the National Research Council (the operating arm of the National Academies of Science, Engineering, and Medicine) into a "United States National Resource Council" (69 F.4th at 840) and then imagined an "NCAST report[]" (id. at 848). \4/ 

Deeper inspection of "Ware's contrary authority" is in order. The 2009 NRC committee report quoted with approval the searing conclusion of Haber & Haber that “[w]e have reviewed available scientific evidence of the validity of the ACE-V method and found none.” It reiterated the Habers' extreme claim that because "the standards upon which the method’s conclusions rest have not been specified quantitatively ... the validity of the ACE-V method cannot be tested." To be sure, the committee agreed that fingerprint examiners had something going for them. It wrote that "more research is needed regarding the discriminating value of the various ridge formations [to] provide examiners with a solid basis for the intuitive knowledge they have gained through experience." But does "intuitive knowledge" qualify as "scientific knowledge" under Daubert? Is a suggestion that friction-ridge comparisons need a more solid basis equal to a statement that the comparisons are "reliable" within the meaning of that opinion? The response to "NCAST" was underwhelming.

But research has progressed since  2009. The second "contrary authority," the PCAST report, reviewed this research. At first glance, this report supports the court's conclusion that no hearing was necessary. It assures courts that "latent fingerprint analysis is a foundationally valid subjective methodology." In doing so, it rejects the NRC committee's notion that the absence of quantitative match rules precludes testing whether examiners can reach valid conclusions. It discusses two so-called black-box studies of the work of examiners operating in the "intuitive" mode. Yet, the Ware court does not cite or quote the boxed and highlighted finding (Number 5).

Perhaps the omission reflects the fact that the PCAST finding is so guarded. PCAST added that "additional black-box studies are needed to clarify the reliability of the method," undercutting the initial assurance, which was "[b]ased largely on two ... studies." Furthermore, according to PCAST, to be "scientifically valid," latent-print identifications must be accompanied by admissions that "false positive rates" could be very high (greater than 1 in 18). \5/

The Ware court transforms all of this into a blanket and bland assertion that the report establishes reliability even though it "may cast doubt on the error rate of fingerprint analysis and comparison." The latter concern, it says, goes not to admissibility, but only to "weight" or "credibility." 

Can it really be this simple? Are not "error rates" an explicit factor affecting admissibility (as well as weight) under Daubert? Certainly, the Eleventh Cicuit's view that the problems with fingerprint comparisons articulated in the two scientific reports are not profound enough to force a wave of pretrial hearings is defensible, but the court's explanation of its position in Ware is sketchy.

At bottom, the problem with the fingerprint evidence introduced against Ware (as best as one can tell from the opinion) is not that it is speculative or valueless. The difficulty is that the judgments are presented as if they were scientific truths. The Ware court is satisfied because "Defense counsel put the Government's expert through his paces during cross-examination, and counsel specifically asked the expert about the findings in the PCAST report." But would it be better to moderate the presentations to avoid overclaiming in the first place? 

The impending amendment to Rule 702 of the Federal Rules of Evidence is supposed to encourage this kind of "gatekeeping." Defense counsel might be more successful in constraining overreaching experts than in excluding them altogether. That too should be part of the "considerable leeway" granted to district courts seeking to reconcile expert testimony.with modern scientific knowledge.


  1. President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods (2016),  []
  2. In Ware
    The pretrial motion to exclude the fingerprint identification "relied on the 2009 United States National Resource Counsel (“NRC”) report and subsequent 2016 President's Counsel of Advisors on Science and Technology (“PCAST”) report, which supposedly revealed a dearth of "proper scientific studies of fingerprint comparison evidence" and claimed that "there is no scientific basis for concluding a fingerprint was left by a specific person," positing that "because fingerprint analysis involves individual human judgement, the resulting [fingerprint comparison] conclusion can be influenced by cognitive bias."
  3. Why insist on a pre-existing determination in one particular geographic region that scientific validity is lacking in order to grant a hearing on whether scientific validity is present? Is the "science" underlying fingerprint comparisons different in Georgia and the other southeastern states comprising the 11th Circuit different from that in the rest of the country?
  4. OK, these peccadillos are not substantive, but one would have thought that three circuit court judges, after "carefully reviewing the record," could have gotten the names and acronyms straight. Senior Judge Gerald Tjoflat wrote the panel opinion. At one point, he was a serious contender for the Supreme Court seat filled by Justice Anthony Kennedy. After Judge Tjoflat announced that he would retire to senior status on the bench in 2019, President Donald Trump nomined Robert J. Luck to the court. In addition to Judge Luck, Judge Kevin C. Newsom, a 2017 appointee of President Trump was on the panel. Judicial politics being what it is, over 30 senators voted against the confirmation of Judges Newsom and Luck.
  5. PCAST suggested that if a court agreed that what it called "foundational validity" were present, then to achieve "validity as applied" some very specific statements about "error rates" would be required:
    Overall, it would be appropriate to inform jurors that (1) only two properly designed studies of the accuracy of latent fingerprint analysis have been conducted and (2) these studies found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study. This would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.
    The studies actually found conditional false-positive proportions of 6/3628 (0.17%) and 42/995 (4.2%, or 7/960 = 1.4% if one discards "clerical errors.") (P. 98, tbl. 1). Earlier postings discuss these FBI-Noblis and Miami Dade police department numbers.

Saturday, June 24, 2023

Maryland Supreme Court Resists "Unqualified" Firearms-toolmark Testimony

This week, the Maryland Supreme Court became the first state supreme court to hold, unequivocally, that a firearms-toolmark examiner may not testify that a bullet was fired from a particular gun without a disclaimer indicating that source attribution is not a scientific or practical certainty. \1/ The opinion followed two trials, \2/ two evidentiary hearings (one on general scientific acceptance \3/ and one on the scientific validity of firearms-toolmark identifications \4/) and affidavits from experts in research methods or statistics. The Maryland court did not discuss the content of the required disclaimer. It merely demanded that the qualified expert's opinion not be "unqualified." In addition, the opinions are limited to source attributions via the traditional procedure of judging presumed "individual" microscopic features with no standardized rules for concluding that the markings match.

The state contended that Kobina Ebo Abruquah murdered a roommate by shooting him five times, including once in the back of the head. A significant part of the state's case came from a firearms examiner for the Prince George’s County Police Department. The examiner "opined that four bullets and one bullet fragment ... 'at some point had been fired from [a Taurus .38 Special revolver belonging to Mr. Abruquah].'" A bare majority of four justices agreed that admission of the opinion was an abuse of the trial court's discretion. Three justices strongly disputed this conclusion. Two of the three opinions in the case included tables displaying counts or percentages from experiments in which analysts compared sets of either bullets or cartridge casings fired from a few types of handguns to ascertain how frequently their source attributions and exclusions were correct and how often they were wrong.

There is a lot one might say about these opinions, but here I attend only to the statistical parts. \5/ As noted below (endnotes 3 an 4), neither party produced any statisticians or research scientists with training or extensive experience in applying statistical methods. The court did not refer to the recent, burgeoning literature on "error rates" in examiner-performance studies. Instead, the opinions drew on (or questioned) the analysis in the 2016 report of the President's Council of Advisors on Science and Technology (PCAST). The report essentially dismissed the vast majority of the research on which one expert for the state (James Hamby, a towering figure in the firearms-toolmark examiner community) relied. These studies, PCAST explained, usually asked examiners to match a set of questioned bullets to a set of guns that fired them. 

A dissenting opinion of Justice Steven Gould argued—with the aid of a couple of probability calculations—that the extremely small number of false matches in these "closed set" studies demonstrated that examiners were able to perform much better than would be expected if they were just guessing when they lined up the bullets with the guns. \6/

That is a fair point. "Closed set" studies can show that examiners are extracting some discriminating information. But they do not lend themselves to good estimates of the probability of false identifications and false exclusions. For answering the "error rate" question in Daubert, they indicate lower bounds—the conditional error probabilities for examiners under the test conditions could be close to zero, but they could be considerably larger.

More useful experiments simulate what examiners do in cases like Abruquah—where they decide whether a specific gun fired the bullet that could have come from guns beyond those in an enumerated set of known guns. To accomplish this, the experiment can pair test bullets fired from a known gun with a "questioned" bullet and have the examiner report whether the questioned bullet did or did not travel through the barrel of the tested gun.

The majority opinion, written by Chief Justice Matthew Fader, discussed two such experiments known as "Ames I" and "Ames II" (because they were done at the Ames National Laboratory, "a government-owned, contractor-operated national laboratory of the U.S. Department of Energy, operated by and located on the campus of Iowa State University in Ames, Iowa"). The first experiment, funded by the Department of Defense and completed in 2014, "was designed to provide a better understanding of the error rates associated with the forensic comparison of fired cartridge cases." The experiment did not investigate performance with regard to toolmarks on the projectiles (the bullets themselves) propelled from the cases, through the barrel of a gun, and beyond. Apparently referring to the closed-set kind of studies, the researchers observed that "[five] previous studies have been carried out to examine this and related issues of individualization and durability of marks ... , but the design of these previous studies, whether intended to measure error rates or not, did not include truly independent sample sets that would allow the unbiased determination of false-positive or false-negative error rates from the data in those studies." \7/

However, their self-published technical report does not present the results in the kind of classification table that statisticians would expect. Part of such a table is on this blog: \8/ 

The researchers enrolled 284 volunteer examiners in the study, and 218 submitted answers (raising an issue of selection bias). The 218 subjects (who obviously knew they were being tested) “made ... l5 comparisons of 3 knowns to 1 questioned cartridge case. For all participants, 5 of the sets were from known same-source firearms [known to the researchers but not the firearms examiners], and 10 of the sets were from known different-source firearms.” 3/ Ignoring “inconclusive” comparisons, the performance of the examiners is shown in Table 1.

Table 1. Outcomes of comparisons
(derived from pp. 15-16 of Baldwin et al.)

~S S
E 1421 4 1425
+E 22 1075 1097

1443 1079
E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

False negatives. Of the 4 + 1075 = 1079 judgments in which the gun was the same, 4 were negative. This false negative rate is Prop(–E |S) = 4/1079 = 0.37%. ("Prop" is short for "proportion," and "|" can be read as "given" or "out of all.") Treating the examiners tested as random samples of all examiners of interest, and viewing the performance in the experiment as representative of the examiners' behavior in casework with materials comparable to those in the experiment, we can estimate the portion of false negatives for all examiners. The point estimate is 0.37%. A 95% confidence interval is 0.10% to 0.95%. These numbers provide an estimate of how frequently all examiners would declare a negative association in all similar cases in which the association actually is positive.Instead of false negatives, we also can describe true negatives, or specificity. The observed specificity is Prop(E|~S) = 99.63%. The 95% confidence interval around this estimate is 99.05% to 99.90%.

False positives. The observed false-positive rate is Prop(+E |~S) = 22/1443 = 1.52%, and the 95% confidence interval is 0.96% to 2.30%. The observed true-positive rate, or sensitivity, is 98.48%, and its 95% confidence interval is 97.7% to 99.04%.

Taken at face value, these results seem rather encouraging. On average, examiners displayed high levels of accuracy, both for cartridge cases from the same gun (better than 99% specificity) and from different guns (better than 98% sensitivity).

I did not comment on the implications of the fact that analysts often opted out of the binary classifications by declaring that an examination was inconclusive. This reporting category has generated a small explosion of literature and argumentation. There are two extreme views. Some organizations and individuals maintain that just because specimens did or not come from the same source, the failure to discern which of these two states of nature applies is an error. A more apt term would be "missed signal"; \9/ it is hardly obvious that Daubert's fleeting reference to "error rates" was meant to encompass not only false positives and negatives but also test results that are neither positive nor negative. At the other pole are claims that all inconclusive outcomes should be counted as correct in computing the false-positive and false-negative error proportions seen in an experiment. 

Incredibly, the latter is the only way in which the Ames laboratory computed the error proportions. I would like to think that had the report been subject to editorial review at a respected journal, a correction would have been made. Unchastened, the Ames laboratory again only counted inconclusive responses as if they were correct when it wrote up the results of its second study.

This lopsided treatment of inconclusives was an issue in Abruquah. The majority opinion described the two studies as follows (citations and footnotes omitted):

Of the 1,090 comparisons where the “known” and “unknown” cartridge cases were fired from the same source firearm, the examiners [in the Ames I study] incorrectly excluded only four cartridge cases, yielding a false-negative rate of 0.367%. Of the 2,180 comparisons where the “known” and “unknown” cartridge cases were fired from different firearms, the examiners incorrectly matched 22 cartridge cases, yielding a false-positive rate of 1.01%. However, of the non-matching comparison sets, 735, or 33.7%, were classified as inconclusive, id., a significantly higher percentage than in any closed-set study.

The Ames Laboratory later conducted a second open-set, black-box study that was completed in 2020 ... The Ames II Study ... enrolled 173 examiners for a three-phase study to test for ... foundational validity: accuracy (in Phase I), repeatability (in Phase II), and reproducibility (in Phase III). In each of three phases, each participating examiner received 15 comparison sets of known and unknown cartridge cases and 15 comparison sets of known and unknown bullets. The firearms used for the bullet comparisons were either Beretta or Ruger handguns and the firearms used for the cartridge case comparisons were either Beretta or Jimenez handguns. ... As with the Ames I Study, although there was a “ground truth” correct answer for each sample set, examiners were permitted to pick from among the full array of the AFTE Range of Conclusions—identification, elimination, or one of the three levels of “inconclusive.”

The first phase of testing was designed to assess accuracy of identification, “defined as the ability of an examiner to correctly identify a known match or eliminate a known nonmatch.” In the second phase, each examiner was given the same test set examined in phase one, without being told it was the same, to test repeatability, “defined as the ability of an examiner, when confronted with the exact same comparison once again, to reach the same conclusion as when first examined.” In the third phase, each examiner was given a test set that had previously been examined by one of the other examiners, to test reproducibility, “defined as the ability of a second examiner to evaluate a comparison set previously viewed by a different examiner and reach the same conclusion.”

In the first phase, ... [t]reating inconclusive results as appropriate answers, the authors identified a false negative rate for bullets and cartridge cases of 2.92% and 1.76%, respectively, and a false positive rate for each of 0.7% and 0.92%, respectively. Examiners selected one of the three categories of inconclusive for 20.5% of matching bullet sets and 65.3% of nonmatching bullet sets. [T]he results overall varied based on the type of handgun that produced the bullet/cartridge, with examiners’ results reflecting much greater certainty and correctness in classifying bullets and cartridge cases fired from the Beretta handguns than from the Ruger (for bullets) and Jimenez (for cartridge cases) handguns.

The opinion continues with a description of some statistics for the level of intra- and inter-examiner reliability observed in the Ames II study, but I won't pursue those here. The question of accuracy is enough for today. \10/ To some extent, the majority's confidence in the reported low error proportions (all under 3%) was shaken by the presence of inconclusives: "if at least some inconclusives should be treated as incorrect responses, then the rates of error in open-set studies performed to date are unreliable. Notably, if just the 'Inconclusive-A' responses—those for which the examiner thought there was almost enough agreement to identify a match—for non-matching bullets in the Ames II Study were counted as incorrect matches, the 'false positive' rate would balloon from 0.7% to 10.13%."

But should any of the inconclusives "be treated as incorrect," and if so, how many? Doesn't it depend on the purpose of the studies and the computation? If the purpose is to probe what the PCAST Report neoterically called "foundational validity"—whether a procedure is at least capable of giving accurate source conclusions when properly employed by a skilled examiner—then inconclusives are not such a problem. They represent lost opportunities to extract useful information from the specimens, but they do not change the finding that, within the experiment itself, in those instances in which the examiner is willing to come down on one side or the other, the conclusion is usually correct.

One justice stressed this fact. Justice Gould insisted that

[T]he focus of our inquiry should not be the reliability of the AFTE Theory in general, but rather the reliability of conclusive determinations produced when the AFTE Theory is applied. Of course, an examiner applying the AFTE Theory might be unable to declare a match (“identification”) or a non-match (“elimination”), resulting in an inconclusive determination. But that's not our concern. Rather, our concern is this: when the examiner does declare an identification or elimination, we want to know how reliable that determination is.

He was unimpressed with the extreme view that every failure to recognize "ground truth" is an "error" for the purpose of evaluating an identification method under Daubert. \11/ He argued for error proportions like those used by PCAST: \12/

This brings us to a different way of looking at error rates, one that received no consideration by the Majority ... I am referring to calculating error by excluding inconclusives from both the numerator and the denominator. .... [C]ontrary to Mr. Faigman's unsupported criticism, excluding inconclusives from the numerator and denominator accords with both common sense and accepted statistical methodologies. ... PCAST ... contended that ... false positive rates should be based only on conclusive examinations “because evidence used against a defendant will typically be based on conclusive, rather than inconclusive, determinations.” ... So, far from being "crazy" ... , excluding inconclusives from error rate calculations when assessing the reliability of a positive identification is not only an acceptable approach, but the preferred one, at least according to PCAST. Moreover, from a mathematical standpoint, excluding inconclusives from the denominator actually penalizes the examiner because errors accounted for in the numerator are measured against a smaller denominator, i.e., a smaller sample size.
So what happens when the error proportions for the subset of positive and negative conclusions are computed with the Ames data? The report's denominator is too large, but the resulting bias is not so great in this particular case. For Ames I, Justice Gould's opinion tracks Table 1 above:
With respect to matching [cartridge] sets, the number of inconclusives was so low that whether inconclusives are included in the denominator makes little difference to error rates. Of the 1,090 matching sets, only 11, or 1.01 percent, were inconclusives. Of the conclusive determinations, 1,075 were correctly identified as a match (“identifications”) and four were incorrectly eliminated (“eliminations”). ... Measured against the total number of matching sets (1,090), the false elimination rate was 0.36 percent. Against only the conclusive determinations (1,079), the false elimination rate was 0.37 percent. ...

Of 2,178 non-matching sets, examiners reported 735 inconclusives for an inconclusive rate of 33.7 percent, 1,421 sets as correct eliminations, and 22 sets as incorrect identifications (false positives). ... As a percentage of the total 2,178 non-matching sets, the false positive rate was 1.01 percent. As a percentage of the 1,443 conclusive determinations, however, the false positive rate was 1.52 percent. Either way, the results show that the risk of a false positive is very low
For Ames II,
There were 41 false eliminations. As a percentage of the 1,405 recorded results, the false elimination rate was 2.9 percent. As a percentage of only the conclusive results, the false elimination rate increased to 3.7 percent ... .

... There were 20 false positives. Measured against the total number of recorded results (2,842), the false positive rate was 0.7 percent. Measured against only the conclusive determinations, however, the false positive rate increases to 2.04 percent.

In sum, on the issue of whether a substantial number of firearms-toolmark examiners can generally avoid erroneous source attributions and exclusions when tested as in Ames I and Ames II, the answer seems to be that, yes, they can. Perhaps this helps explain the Chief Justice's concession that "[t]he relatively low rate of 'false positive' responses in studies conducted to date is by far the most persuasive piece of evidence in favor of admissibility of firearms identification evidence." But the court was quick to add that "[o]n balance, however, the record does not demonstrate that that rate is reliable, especially when it comes to actual casework."

Extrapolating from the error proportions in experiments to those in casework is difficult indeed. Is the largely self-selected sample of examiners who enroll in and complete the study representative of the general population of examiners doing casework? Does the fact that the enrolled examiners know they are being tested make them more (or less) careful or cautious? Do examiners have expectations about the prevalence of true sources in the experiment that differ from those they have in casework? Are the specimens in the experiment comparable to those in casework? \13/ Do error probabilities for comparing marks on cartridge cases apply to the marks on the bullets they house? Does it matter if the type of gun used in the experiment is different from the type in the case?

Most of the questions are matters of external validity. Some of them are the subject of explicit discussion in the opinions in Abruquah. For example, Justice Gould rejects, as a conjecture unsupported by the record, the concern that examiners might be more prone to avoid a classification by announcing an "inconclusive" outcome in an experiment than in practice.

To different degrees, the generalizability questions interact with the legal question being posed. As I have indicated, whether the scientific literature reveals that a method practiced by skilled analysts can produce conclusions that are generally correct for evidence like that in a given case is one important issue under Daubert. Whether the same studies permit accurate estimates of error probabilities in general casework is a distinct, albeit related, scientific question. How to count or adjust for inconclusives in experiments is but a subpart of the latter question.

And, how to present source attributions in the absence of reasonable error-probability estimates for casework is a question that Abruquah barely begins to answer. No opinion embraced the defendant's argument that only a limp statement like "unable to exclude as a possible source" is allowed. But neither does the case follow other courts that allow statements such as the awkward and probably ineffectual "reasonable degree of ballistic certainty" for expressing the difficult-to-quantify uncertainty in toolmark source attributions. After Abruquah, if an expert makes a source attribution in Maryland, some kind of qualification or caveat is necessary. \14/ But what will that be?

Toolmark examiners are trained to believe that their job is to provide source conclusions for investigators and courts to use, but neither law nor science compels this job description. Perhaps it would be better to replace conclusion-centered testimony about the (probable) truth of sources conclusions with evidence-centered statements about the degree to which the evidence supports a source conclusion. The Abruquah court wrote that

The reports, studies, and testimony presented to the circuit court demonstrate that the firearms identification methodology employed in this case can support reliable conclusions that patterns and markings on bullets are consistent or inconsistent with those on bullets fired from a particular firearm. Those reports, studies, and testimony do not, however, demonstrate that that methodology can reliably support an unqualified conclusion that such bullets were fired from a particular firearm.

The expert witness's methodology provides "support" for a conclusion, and the witness could simply testify about the direction and magnitude of the support without opining on the truth of the conclusion itself. \15/ "Consistent with" testimony is a statement about the evidence, but it is a minimal, if not opaque, description of the data. Is it all that the record in Abruquah—not to mention the record in the next case—should allow? Only one thing is clear—fights over the legally permissible modes for presenting the outcomes of toolmark examinations will continue.


  1. In Commonwealth v. Pytou Heang, 942 N.E.2d 927 (Mass. 2011), the Massachusetts Supreme Judicial Court upheld source-attribution testimony "to a reasonable degree of scientific certainty" but added "that [the examiner] could not exclude the possibility that the projectiles were fired by another nine millimeter firearm." The Court proposed "guidelines" to allow source attribution to no more than "a reasonable degree of ballistic certainty."
  2. The defendant was convicted at a first trial in 2013, then retried in 2018. In 2020, the Maryland Supreme Court changed its standard for admitting scientific evidence from a requirement of general acceptance of the method in the relevant scientific communities (denominated the "Frye-Reed standard" in Maryland) to a more direct showing of scientific validity described in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), and an advisory committee note accompanying amendments in 2000 to Federal Rule of Evidence 702 (called the "Daubert-Rochkind standard" in Abruquah).
  3. The experts at the "Frye-Reed hearing" were William Tobin (a "Principal of Forensic Engineering International,", and "former head of forensic metallurgy operations for the FBI Laboratory" (Unsurpassed Experience,; James Hamby ("a laboratory director who has specialized in firearm and tool mark identification for the past 49 years" and who is "a past president of AFTE [the Association of Firearm and Tool Mark Examiners] ... and has trained firearms examiners from over 15 countries worldwide," Speakers, International Symposium on Forensic Science, Lahore, Pakistan, Mar. 17-19, 2020,; Torin Suber ("a forensic scientist manager with the Maryland State Police"); and Scott McVeigh (the firearms examiner in the case).
  4. The experts at the supplemental "Daubert-Rochkind hearing" were James Hamby (a repeat performance), and David Faigman, "Chancellor & Dean, William B. Lockhart Professor of Law and the John F. Digardi Distinguished Professor of Law" at the University of California College of the Law, San Francisco.
  5. Remarks on the legal analysis will appear in the 2024 cumulative supplement to The New Wigmore, A Treatise on Evidence: Expert Evidence.
  6. The opinion gives a simplified example:
    The test administrator fires two bullets from each of 10 consecutively manufactured handguns. The administrator then gives you two sets of 10 bullets each. One set consists of 10 “unknown” bullets—where the source of the bullet is unknown to the examiner—and the other set consists of 10 “known” bullets—where the source of the bullet is known. You are given unfettered access to a sophisticated crime lab, with the tools, supplies, and equipment necessary to conduct a forensic examination. And, like the vocabulary tests from grade school requiring you to match words with pictures, you must match each of the 10 unknown bullets to the 10 known bullets.

    Even though you know that each of the unknowns can be matched with exactly one of the knowns, you probably wouldn't know where to begin. If you had to resort to guessing, your odds of correctly matching the 10 unknown bullets to the 10 knowns would be one out of 3,628,800. [An accompanying note 11 explains that: "[w]ith 10 unknown bullets and 10 known bullets, the odds of guessing the first pair correctly are one out of 10. And if you get the first right, the odds of getting the second right are one out of nine. If you get the first two right, the odds of getting the third right are one out of eight, and so on. Thus, the odds of matching each unknown bullet to the correct known is represented by the following calculation: (1/10) x (1/9) x (1/8) x (1/7) x (1/6) x (1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Even if you correctly matched five unknown bullets to five known bullets and guessed on the remaining five unknowns, your odds of matching the remaining unknowns correctly would be one out of 120. [Note 12: "(1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Not very promising.

    The closed-set and semi-closed-set studies before the trial court—the studies which PCAST discounted—show that if you were to properly apply the AFTE Theory, you would be very likely to match correctly each of the 10 unknowns to the corresponding knowns. See Validation Study; Worldwide Study; Bullet Validation Study. ... Your odds would thus improve from virtually zero (one in 3,628,800) to 100 percent. Yet according to PCAST, those studies provide no support for the scientific validity of the AFTE Theory. ...
  7. David P. Baldwin, Stanley J. Bajic, Max Morris & Daniel Zamzow, A Study of False-positive and False-negative Error Rates in Cartridge Case Comparisons, Ames Laboratory, USDOE, Tech. Rep. #IS-5207 (2014), at [].
  8. David H. Kaye, PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?, Forensic Sci., Stat. & L., Nov. 1, 2016,
  9. David H. Kaye et al., Toolmark-comparison Testimony: A Report to the Texas Forensic Science Commission, May 2, 2022, available at
  10. I will note, however, that the report apparently strains to make the attained levels for reliability seem high. Alan H. Dorfman & Richard Valliant, A Re-Analysis of Repeatability and Reproducibility in the Ames-USDOE-FBI Study, 9 Stat. & Pub. Pol'y 175 (2022).
  11. The opinion attributes this view to
    Mr. Abruquah's expert, Professor David Faigman, [who declared] that "in the annals of scientific research or of proficiency testing, it would be difficult to find a more risible manner of measuring error." To Mr. Faigman, the issue was simple: in Ames I and II, the ground truth was known, thus "there are really only two answers to the test, like a true or false exam[ple]." Mr. Faigman explained that "the common sense of it is if you know the answer is either A or B and the person says I don't know, in any testing that I've ever seen that's a wrong answer." He argued, therefore, that inconclusives should be counted as errors.
  12. See also NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach (David H. Kaye ed. 2012), available at (arguing against counting inconclusives in error proportions that are supposed to indicate the probative value of actual conclusions).
  13. Testing examiner performance in the actual flow of cases would help address the last three questions. A somewhat confusing analysis of results in such an experiment is described in a posting last year. David H. Kaye, Preliminary Results from a Blind Quality Control Program, Forensic Sci., Stat. & L., July 9, 2022,
  14. The court wrote that:
    It is also possible that experts who are asked the right questions or have the benefit of additional studies and data may be able to offer opinions that drill down further on the level of consistency exhibited by samples or the likelihood that two bullets or cartridges fired from different firearms might exhibit such consistency. However, based on the record here, and particularly the lack of evidence that study results are reflective of actual casework, firearms identification has not been shown to reach reliable results linking a particular unknown bullet to a particular known firearm.
  15. See, e.g., David H. Kaye, The Nikumaroro Bones: How Can Forensic Scientists Assist Factfinders?, 6 Va. J. Crim. L. 101 (2018).

LAST UPDATED 29 June 2023