Forensic Science, Statistics & the Law

Evidence-centric and Conclusion-centric: What's the Difference?

2024-09-24T00:14:00.006-04:00

In recent months, various forensic practitioners have asked for a short and simple explanation of the difference between the "evidence-centric" and "conclusion-centric" forms of reporting and testifying. I figure that's what I get for introducing the phrases in a panel asked to talk about “Bringing Statistics into the Courtroom” at the Conference on Forensics, Statistics, and Law, at the University of Virginia in March 2018.^\1/ The particular phrases may have been original (at least I think they were), but the underlying ideas, which are what matter, have been around for a long time.

For concreteness, I will focus on a specific example--something that is the target of criticism du jour in US courts. I am thinking of firearms-toolmark examinations culminating in conclusions about the source of a bullet or shell casing. However, the dichotomy between statements of the evidence and its probative value, on the one hand, and statements of the inferences or conclusions that follow from the evidentiatary statements, on the other hand, is far more general. It applies to techniques for shedding light on the possible source of fingerprints, treadmarks, handwriting, images, glass fragments, paint chips, fibers, bones, bitemarks, or any other traces acquired at or around crime scenes or victims. Indeed, the distinction applies to all manner of empirical inquiry, as suggested by the subtitle of the late Frederick Schauer's book, The Proof: Uses of Evidence in Law, Politics, and Everything Else.^\2/ Moreover, nothing in the terminology reveals whether the evidence in question comes from a scientifically validated process, from (typically) less reliable modes of acquiring and assessing data, or from superstitions masquerading as knowledge.

A Recent Case

With these broad obervations in mind, let's turn to a recent appellate opinion comparing the state's arguments for introducing the conclusions of a firearms examiner to arguments that could be marshalled for believing in palmistry. In People v. Tidd, No. A167548, 2024 WL 3982134 (Cal. Ct. App., Aug. 29, 2024), someone in a white sport utility vehicle shot an enebriated pedestrian in the leg. Investigators located a nine-millimeter Luger cartridge case, manufactured by Speer, in the vicinity. Six days later, police arrrested Raymond Tidd as he was about to enter a white SUV carrying a loaded nine-millimeter Sig Sauer pistol. He pled guilty to some firearms possession and carrying charges but went to trial for attempted murder and other felonies for the shooting. Before trial, he objected to testimony from a criminalist with the San Francisco Police Department named Jacobus Swanepoel. The trial court held a hearing and ruled that the expert could testify that the cartridge case came from the defendant's gun as long as he refrained from asserting his opinion “to a scientific certainty.”

At trial, Swanepoel "flatly asserted the recovered cartridge and test cartridges fired from Tidd's gun 'were fired in the same firearm.'" The jury found Tibbs guilty of two firearms felonies for the shooting (but not of attempted murder). However, the court of appeals reversed because the prosecution never showed that the highly subjective judgmental procedure the expert described was "reasonably reliable."

The court's analysis of what it would take to achieve such reliability is open to debate, but that is a subject for another day. My target today is the statements that were made or could have been made about the cartridge cases and the gun or guns that fired them. What statements are "conclusion-centric"? Which are "evidence-centric"?

A Preliminary Point

"Evidence" exists in the context of some hypothesis about a matter of fact (a state of nature). Nothing can sensibly be deemed "evidence" without asking, "evidence of what?" That the sky is dark is evidence that it is nighttime. It is not conclusive evidence (nothing is conclusive evidence of an empirical fact) because darkness in the daytime can come from a total solar eclipse (or even less likely events). But darkness is pretty good evidence (E) for the hypothesis (H) that night has arrived. The hypotheses of most interest in cases like Tidd are source hypotheses like Swanepoel's assertion that the cartridges were "fired in the same gun" or "the cartridges were fired in different guns." These assertions are conclusions that the analyst reaches (or does not reach) on the basis of the evidence about the marks on the objects examined.

Conclusion-centric Statements

With respect to the source hypothesis, a scan of the opinion in Tibbs reveals the following conclusion-centered statements from the firearms examiner:

The cartridges "were fired in the same gun." In other words, the same-gun hypothesis H is true (given the markings seen in the examination). It may not be "scientifically certain," but it is true nonetheless. In symbols, P(H|E) is practically (or exactly?) 1, where the expression "P(H|E)" stands for "the probability of the hypothesis H given the evidence E." Obviously, this is conclusion-centric, as it tells the jury what they should believe about H (or at least what the examiner believes about this conclusion).
"[I]t was 'more likely than not that this [Sig Sauer pistol] fired the cartridge casing' he was analyzing." This sounds like a statement that, given the marks (E), H is probably true: P(H|E) > 1/2. This statement is also conclusion-centered. It is a statement about the (probable) truth of H. However, in this case,
"He explained that by agreeing to 'more likely' he was 'saying it is this firearm,' and by 'not' he was expressing that the chance of finding another firearm 'with the same signature or the same fingerprint is remote or really small.'" Without getting hung up on the dissonance in this compound sentence, this is a statement that given the markings (E), the probability of not-H is "really small." Mathematically, P(not-H|E) implies that H is not merely probable--it is really probable: P(H|E) is almost 1. Another conclusion-centric statement.
"[T]he Sig Sauer pistol could not be excluded from the class of firearms that could have fired the cartridge case submitted for analysis." Also conclusion-centered. "Could not be excluded" means "is included as" (or "is consistent with") a possible source, it asserts that P(H|E) > 0. That focuses on the (possible) truth of the conclusion. But classifying this phrasing as conclusion-centric is a close case. If "could not be excluded" were a more neutral statement of the pair of observations themselves (such as "has the same characteristics" or "matches") then it is a statement about the evidence E rather than what E proves.

Evidence-centric Statements

"Investigators also located a nine-millimeter Luger cartridge case ... on the street near where the shooting took place." With respect to the source conclusion, the statement of the caliber is evidence-centric in the sense that it specifies a feature of ammunition that might have used in the shooting without drawing a conclusion (be it categorical or probabilistic) about the gun taken from Tidd as having had that ammunition fired in it. That Tidd was found with a gun of the same caliber makes the characteristic incriminating evidence--it raises the probability of H. In symbols, P(H|caliber match) > P(H), where P(H) is the probability of H prior to considering the caliber data.
“[I]t would be rare to find another firearm that exhibits that same ... pattern ... .” Evidence-centric. We have an assertion that the frequency of the incriminating marks is unusual in the population of guns that might have been used in the shooting. Finding a rare pattern supports H because it would be surprising to see the same (or more persuasive) evidence E if a different gun had been used. This is an assertion about the value of the evidence as proof of H rather than assertion about the truth of H itself.^\3/

So far we have encountered (1) statements of conclusions (or their probabilities) about a source hypothesis and (2) statements of the evidence ("it's a 9-mm caliber bullet," "the bullets have matching patterns," and so on) with regard to the source hypothesis. But there is a further category of evidence-centric statements that are more interesting. These evidence-centric statements go beyond descriptions of the data and comparisons using the data, but they do not go so far as to express conclusions. Rather, they are statements of the degree to which the evidence supports one hypothesis relative to alternative hypotheses.

I cannot find these in Tidd, for the examiner there was following the conclusion-centric paradigm espoused by the Association of Firearm and Tool Mark Examiners (AFTE). But they not hard to envision. Their defining feature is that they concern the probability of the evidence under the different hypotheses rather than the conclusion-centric probability of the hypothesis given the evidence. For example:

The patterns (considering all the similarities and differences) are much more probable for bullets fired from defendant's gun than for bullets fired from different guns.
The patterns are much more compatible with the hypothesis that bullets were fired from the defendant's gun than the hypothesis that they were fired from different guns.
The patterns provide strong support for the hypothesis that bullets were fired from the defendant's gun rather than the hypothesis that they were fired from different guns.

These statements all flow from assessing how probable the patterns are for pairs of hypotheses. They look to P(E|H) rather than to the examiner's judgments of P(H|E). The underlying principle is that patterns that are more probable under one hypothesis than another are stronger evidence for the former hypothesis and thus support it instead of supporting the competing hypothesis. The hope is that experts will better assist the judge or jury by telling them about the strength of the evidence in this way instead of telling them which hypothesis to believe (to some degree of certainty).

Notes

David H. Kaye, The Nikumaroro Bones: How Can Forensic Scientists Assist Factfinders?, 6 Va. J. Crim. L. 101 (2018), https://www.ssrn.com/abstract=3177752. See also David H. Kaye, Forensic Statistics in the Courtroom, in Handbook of Forensic Statistics 225 (David Banks et al. eds., 2021), https://www.ssrn.com/abstract=3561914.
Frederick Schauer, The Proof: Uses of Evidence in Law, Politics, and Everything Else (2022).
Thus, "really small" is a qualitative expression of p-value. In statistics, a p-value is the the probability of such extreme evidence arising when the "alternative hypothesis" (such as "different gun") is true. Everything else being equal, the smaller the p-value, the harder it is to believe the alternative hypothesis is true.

Who Published Bayes' Theorem?

2024-09-22T19:20:00.002-04:00

For the legal profession, "American Law Reports or ALR is a longstanding, highly trusted series of in-depth articles, called annotations, on specific legal issues."^\1/ But the report on "Use of Bayes' Theorem in Criminal and Civil Cases"^\2/ may be less worthy of trust than the average annotation. Historians and statisticians, at least, would be surprised at its opening sentences:

Bayes' Theorem, named for the English clergyman and scientist who published it in 1763, is a scientific principle of the likelihood ratio used to calculate conditional probabilities leading to a variety of statistical "Bayesian" methodologies. It speculates on the probability that a particular fact is true or that a particular event will occur, given our knowledge of one or more related facts or events.

There is no doubt (asymptotically speaking) that in 1763, Thomas Bayes did not publish the theorem that bears his name. He was dead. A close friend, the Reverend Richard Price, read the paper to the Royal Society of London after retrieving it from Bayes's papers and adding to it.^\3/ This expanded and modified version appeared in the Society's Philosophical Transactions.

The ALR's rendering of 18th Century history won't cause any problems for lawyers employing or confronting Bayesian analyses in court. Posthumous or otherwise, discovered first by Bayes or by someone else,^\4/ the theorem is what it is. But what will lawyers make of the idea that the theorem is "a scientific principle of the likelihood ratio" that "speculates" about conditional probability? Is the theorem mere speculation? Is it only about a likelihood ratio? No, and no.

The usual understanding is that a likelihood ratio is the data-based, objective part of the theorem and that speculation becomes an issue if and when the prior probability is not objective as well. As Judea Pearl and Dana MacKenzie explain,

Bayesian statistics gives us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief ... . Still, what frequentists could not abide was that Bayesians were allowing opinion, in the form of subjective probabilities, to intrude into the pristine kingdom of statistics. Mainstream statisticians were won over only grudgingly, when Bayesian analysis proved a superior tool for a variety of applications ... .^\5/

Fitting expert prior probabilities and likelihoods into the legal process poses some special problems for mainstream statisticians. That accommodation is the subject of continuing, if often repetitive, multidisciplinary dialog that does not fit neatly and quickly into a blog.

Notes

University of South Carolina School of Law, Legal Research, Analysis & Writing, https://guides.law.sc.edu/LRAWSpring/LRAW/alr.
Deborah F. Buckman, Annot., Use of Bayes' Theorem in Criminal and Civil Cases, 47 A.L.R.7th Art. 4 (2019).
Stephen M. Stigler, Richard Price, The First Bayesian, 33 Statistical Science 117 (2018), https://doi.org/10.1214/17-STS635; Martyn Hooper, Richard Price, Bayes’ Theorem, and God, Significance 36-39 (Feb. 2013), https://www.york.ac.uk/depts/maths/histstat/price.pdf.
Stephen M. Stigler, Who Discovered Bayes’s Theorem?, 37 Am. Stat. 290-296 (1983).
Judea Pearl & Dana MacKenzie, The Book of Why: The New Science of Cause and Effect 90 (2018).

NIST's Forensic Science Environmental Scan 2023

2024-09-05T15:22:00.001-04:00

Yesterday, I received an email that included the following notice:

NIST released a new report today from an intensive 18-month, multi-pronged effort by the NIST Forensic Science team to assess the strategic opportunities for forensic science research and standards that are needed to drive significant advancements in the practice of forensic science over the coming decade.

The report is based on an extensive literature review, input from external stakeholders, and feedback from NIST subject matter experts. Throughout 2023, NIST conducted an assessment of the forensic science environment to inform strategic planning efforts. The resulting Forensic Science Environmental Scan 2023 report captured salient issues and trends across five different landscapes: governance, economic, societal, scientific and technological, and legal and regulatory. In addition, NIST held a roundtable in September 2023 with forensic science thought leaders from across forensic disciplines to discuss the long-term vision and strategic priorities for forensic science in the United States. NIST used these inputs to identify the grand challenges facing forensic science today, the strategies for addressing them through advances in research and standards, and the subsequent implementation of the advances into forensic science practice.

I can't say I have read the report yet, but I am posting the notice for anyone who wants to scan the "environmental scan."

Smith v. Arizona: The Facts and the Outcome

2024-08-26T20:04:00.007-04:00

In A.E. Van Vogt’s murky but classic science fiction novel, The World of Null-A, the main character repeatedly dies, only to reappear in a new body with no memory of his former life. With help from extra brain matter, he moves on anyway. That, very roughly, is what the state of Arizona tried to accomplish in Smith v. Arizona, 144 S.Ct. 1785 (2024). It wrote off one expert witness but allowed her to live on through another expert. This plot twist left the defendant complaining that his Sixth Amendment “right … to be confronted with the witnesses against him” had been violated. This right, the Supreme Court repeatedly has held, prevents the prosecution from introducing “testimonial hearsay” without producing the author of the hearsay for cross-examination.

The Court’s struggle to define “testimonial” is becoming epic, and the definition of hearsay is one of those terrors of law school that leaves many students wondering whether they should have gone to some other professional or graduate school. One essential feature of hearsay is that the assertion is offered to prove “the truth of the matter asserted.” This phrase was at the center of the dispute in Smith, and the Court usefully clears up some of the confusion created by the opinions in Williams v. Illinois, 567 U.S. 50 (2012). Unfortunately, the Court did not stop there but also expressed some hasty thoughts about what might make a forensic-science expert’s hearsay statements nontestimonial.

This post describes the facts of Smith. They are drawn primarily from Justice Kagan's opinion for the Court. Ruminations on both parts of the majority opinion—and the three concurring opinions—may come later.

Jason Smith was arrested in a shed in Yuma county that contained “a large quantity of what appeared to be drugs and drug-related items.” He pleaded not guilty to charges of possession. The State asked “a crime lab run by the Arizona Department of Public Safety (DPS) for a ‘full scientific analysis.’” The State informed the lab of who the defendant was, what he was charged with, and the fact that a trial was pending. An analyst named Elizabeth Rast “ran the requested tests.” She typed out notes and submitted a signed report. The notes described each item, its weight, the tests performed, and a conclusion about its identity. The report of “results/interpretations” stated that four items “[c]ontained a usable quantity of methamphetamine,” three “[c]ontained a usable quantity of marijuana,” and one “[c]ontained a usable quantity of cannabis.”

After Rast “stopped working at the lab, for unexplained reasons,” prosecutors replaced her name on the “final pre-trial conference statement” with thast of a current employee. They promised that “Greggory Longoni, [a] forensic scientist (substitute expert),” who had no previous connection to the case, would “provide an independent opinion on the drug testing performed by Elizabeth Rast.” At trial, Longoni purported to give an independent opinion on the nature of the eight items. Yet, he arrived at his opinion by consulting nothing more than Rast's report and notes. He did no re-examination or retesting of his own. He merely

referred to those materials and related what was in them, item by item by item. As to each, he described the specific ‘scientific method[s]’ Rast had used … (e.g., a microscopic examination, a chemical color test, a gas chromatograph/mass spectrometer test). … [H]e stated that the testing had adhered to ‘general principles of chemistry,’ as well as to the lab's ‘policies and practices …. [H]e noted, for example, that Rast had run a “blank” to confirm that testing equipment was not contaminated. \1/

Readers of the opinion might think that "those materials" to which Longoni referred "[w]hen [he] took the stand" included Rast's report, but this is not quite correct. The prosecution was careful not to ask Longoni to recite the conclusions in the report. For example, the prosecutor posed these questions:

Q Let me be clear. You’re not testifying as to her report, you’re testifying as to review of lab notes?
A Correct.
Q In reviewing what was done, your knowledge and training as a forensic scientist, your knowledge and experience with DPS’s policies, practices, procedures, your knowledge of chemistry, the lab notes, the intake records, the chemicals used, the tests done, can you form an independent opinion on the identity of Item 26?
A Yes.

In this manner, Longoni told the jury that his “independent opinion” was that “Item 26 was 'a usable quantity of marijuana,' … Items 20A and 20B were 'usable quantit[ies] of methamphetamine,' and … Item 28 was '[a] usable quantity of cannabis.'”

The jury convicted, and Smith appealed on the ground that he was convicted via Rast’s written statements with no opportunity to cross-examine her. The State insisted that “Longoni testified about ‘his own independent opinions,’ even though making use of Rast's records.” The Arizona Court of Appeals agreed that Longoni was “present[ing] his independent expert opinions” as “based on his review of Rast's work.” It affirmed, relying on a 2014 case in which it had stated that an expert may testify to “the substance of a non-testifying expert's analysis, if such evidence forms the basis of the [testifying] expert's opinion.” State ex rel. Montgomery v. Karp, 236 Ariz. 120 (Ct. App. 2014). According to the Court of Appeals in Karp, the “underlying facts” are then “used only to show the basis of [the in-court witness's] opinion and not to prove their truth.” All this seemed so obvious to the court that it did not think its opinion was even worth publishing. The Arizona Supreme Court declined review without comment.

The U.S. Supreme Court was much more interested. In Williams v. Illinois, 567 U.S. 50 (2012), five Justices—a majority of the Court—had rejected this very reasoning. These Justices had dismissed the basis-only rationale as “legal fiction” (Thomas, J., concurring), “very weak,” “factually implausible,” “nonsense,” and “sheer fiction.” (Kagan, Scalia, Ginsburg & Sotomayor, JJ., dissenting and quoting D. Kaye, D. Bernstein, & J. Mnookin, The New Wigmore: Expert Evidence §4.10.1, pp. 196-197 (2d ed. 2011); id., §4.11.6, at 24 (Supp. 2012)). However, the remaining four Justices, in a plurality opinion written by Justice Alito, had vigorously advanced the not-for-its-truth theory for basis evidence.

The absence of a unifying rationale from a majority of the Court in Williams enabled the Arizona court to rely on one part of the Williams plurality opinion in one breath and to say that the “plurality decision … has limited if any precedential value” in the next. Meanwhile, other states had found the only-to-show-the-basis argument entirely unpersuasive.

And, so, for the fourth time, the Supreme Court granted a writ of certiorari to review either the introduction of forensic-science test findings made by analysts who were not presented for cross-examination or references to another analyst's findings by the expert presenting the scientific evidence. The petition propounded the question:

Whether the Confrontation Clause of the Sixth Amendment permits the prosecution in a criminal trial to present testimony by a substitute expert conveying the testimonial statements of a nontestifying forensic analyst, on the grounds … that the testifying expert offers some independent opinion and the analyst's statements are offered not for their truth but to explain the expert's opinion … .

On this narrow issue, the Court was unanimous: At least when the surrogate expert endorses the basis for his opinion as reliable, introducing the missing witness’s out-of-court but “testimonial” statements of and about her findings violates the Confrontation Clause. Justice Kagan wrote an opinion, joined in its discussion of this issue by every other Justice (except for Justice Alito and the Chief Justice). Justices Thomas and Gorsuch also filed concurring opinions that no one else joined. Justice Alito, joined only by the Chief Justice, filed an concurring opinion accusing the rest of the Court of “inflict[ing] a needless, unwarranted, and crippling wound on modern evidence law” and of “blow[ing] up the Federal Rules” with “a radical change” to the Rule 703 framework for admitting expert testimony. This is an obvious exaggeration, but explaining why merits a later post to unpack the logical foundation of Rule 703.

Beyond the hearsay issue, based on little more than sketchy suggestions from the Deputy Solicitor General at oral argument, Justice Kagan proposed that laboratory documentation could be introduced with no opportunity for confrontation if its “primary purpose” was for accreditation, quality control, or unofficial “notes to self.” Two of the seven Justices signing the majority opinion defected from this part of it. Justice Gorsuch expressed broader skepticism of the primary-purpose tests the Court had introduced in previous Confrontation Clause cases. Already, crime laboratories are talking about how to restructure or rewrite their documentation to fit into these new, nontestimonial categories. \2/ As I see it (so far), the nature of forensic science makes these alleged purposes too intertwined with the knowing production of evidence for a “primary purpose” test to disentangle them, but that conclusion also begs for later explanation.

In any event, the Supreme Court unanimously vacated the state court of appeals order affirming Smith's conviction and remanding for the Arizona Court of Appeals “[t]o address the additional issue of whether Rast's records were testimonial (including whether that issue was forfeited).” The state then conceded that it had forfeited the opportunity to argue that parts of the testimony were not “testimonial” hearsay under Crawford because it had not raised the argument in its appeal. The Arizona Court of Appeals remanded to the Yuma County Superior Court to vacate the judgment of conviction.

But this did not guarantee Jason Smith his freedom. The state could have tried him a second time. Perhaps Rast could be located to testify. Or, the lab could test the drugs again. (At trial, Longoni had testified that retesting would have taken only two to three hours.) After that, an analyst who did such testing could have testified to truly independent findings. On August 20, however, Jason Smith entered into a plea bargain, sparing the state the burden of retrying him. What concession he received in return, I do not know.

Note

144 S.Ct. at 1799. But see id. at 1802 (declining to resolve a dispute between the parties as to whether Longoni's references to the report indicated that he partially relied on it as opposed to only the notes).
Consortium of Forensic Science Organizations, Untitled Memorandum, Aug. 20, 2024, https://thecfso.org/wp-content/uploads/2024/08/Smith-v-Arizona-Final.pdf

A Draft Standard on "Terminology for a Suspected Pattern of Dental Origin"

2024-08-22T22:06:00.002-04:00

The Academy Standards Board (ASB) of the American Academy of Forensic Sciences is seeking comments by September 9 on a Technical Report 194, First Edition, 2024, titled "Terminology for a Suspected Pattern of Dental Origin." Although only a "template" for organizing comments is listed on the ASB website, the draft standard can be found at https://www.aafs.org/sites/default/files/media/documents/194_TR_Ballot01.pdf.

The OSAC page on “Standards Open for Comment” (which you won't see by perusing the OSAC website unless you click on “How To Work With Us” in the navigation pane) says “NOTE: This is OSAC 2021-N-0030, Terminology for a Suspected Pattern of Dental Origin, currently on the OSAC Registry.” Does that mean the ASB committee believed that nothing in the OSAC product, which emerged with no review from an advisory scientific and technical panel of experts, needed improvement? It is good to go public as an SDO-approved standard for terminology unless someone objects and proposes something better?

Alas, it is not that good. Although a discussion of the choice of various terms to define and the definitions themselves could occupy pages, it is too tedious an undertaking for me to write or for many readers to plough through. Suffice it to say that some of the standard has a stream-of-consciousness feeling to it. Like "spurious observation anomaly not intrinsically present feature not related to the source."

The larger question is why try to promulgate a free-floating standard terminology rather than articulate standard procedures with appropriate terms? What might these standard practices be? Presumably, the proposed terminology is a precursor to performing "suspected pattern of dental origin analysis," which is defined as "forensic examination, analysis, and determination of the pattern for potential links to dental origins." This "potential links" study seems to be subdivided into (1) "bitemark assessment analysis," (2) "bitemark analysis," (3) "bitemark comparison analysis," and (4) "bitemark individualization analysis." Can any of these analyses produce results of "evidentiary value" (defined as "information of sufficient usefulness to serve as the basis for making an empirically significant scientific determination")?

The dentists are not prepared to say so. But neither are they willing to list in their bibliography any of the well-known articles and reports concluding that demonstrations of the scientific validity of these analyses are little more than wishful thinking. Rather than claim that these terms refer to procedures that have "evidentiary value" or opine that they lack such value, they merely note that their definitions are "not an endorsement of [the] scientific validity" of the processes they are supposed to describe. Apparently, the drafters from OSAC are agnostics rather than atheists. Or maybe they have their doubts about "bitemark individualization analysis." That phrase is marked "deprecated." Yet, the process of "visual comparison" is not deprecated as unvalidated or invalid, and a note suggests that other "individualization method[s]" for bitemarks could be just dandy.

So the best comment might be a recommendation to jettison this standard. Researchers can use their own clearly defined terms in devising and validating procedures that can be used in criminal investigations that involve what might be toothmarks, bitemarks, or wounds from other sources and mechanisms. When demonstrably valid procedures become available, the time will be ripe for a standard with uniform terminology.

Meanwhile, promulgating these terms and definitions, even with the agnostic disclaimers, risks encouraging the acceptance of dubious forensic "science." The very existence of an expert standard with these terms and definitions might suggest that the words describe something meaningful and encourage testimony that the field has a standardized system of some kind. I can imagine testimony that

The terms I am using in the analysis of what I have determined to be bitemarks are generally accepted in forensic odontology, medicine, and forensic science. I am following the ASB technical report on the subject. The report was produced with funding from the National Institute of Standards and Technology and is included and recommended for adoption by the government-supported Organization of Scientific Area Committees for Forensic Science.

Is this what forensic science and the law needs?

“Predictive or Profiling Evidence” and Diaz v. United States

2024-07-01T21:55:00.001-04:00

Today the Association of American Law School’s (AALS) Section on Evidence distributed the following announcement to its members:

In a divided decision, the Supreme Court recently concluded that expert testimony about the likely mental state of individuals arrested with drugs in their position is admissible in criminal trials under the Federal Rules of Evidence in Diaz v. United States, 602 U. S. ____ (2024). This decision was sharply contested in its own right, but also drew attention to an area of broader controversy in the law of evidence: the increasing use of "predictive" or "profiling" evidence, by which expert witnesses present testimony suggesting that an individual is more or less likely to have had a particular mental state or behaved in a particular way based on their personal circumstances or characteristics. Scholars who have written about this phenomenon have expressed disquiet about the use of such evidence (and some courts have limited the use of such evidence in edge cases involving particular prejudice or overbroad characterizations), but no consensus has emerged as to the reasons for objecting to predictive evidence or as to how to systematically distinguish between such evidence and other forms of indirect and circumstantial evidence that is routinely admitted. This panel brings together scholars to discuss predictive and profiling evidence from a variety of perspectives. Was the Supreme Court right to find such evidence consistent with the federal rules governing the admissibility of expert evidence? Is such evidence generally consistent with due process and equal protection concerns? Is it generally empirically sound? Do we need new federal rules or common law doctrines to limit the admissibility of some forms of predictive or profiling evidence?

Who can object to a scholarly program on the subject, even if it is hardly new, having been the subject of a multitude of opinions, a number of statutes, and even a previous AALS program decades ago? Hopefully, the courts and the professoriate have made some progress in understanding what has come to be called "framework evidence." Neither would it be fair to criticize a necessarily brief announcement for not defining "predictive or profiling evidence." Presumably, the evidence teachers do not need a definition.

But I was surprised to read my fellow law professors’ sweeping characterization of Diaz v. United States, 144 S.Ct. 1727 (2024), as establishing that "such evidence [is] consistent with the federal rules governing the admissibility of expert evidence" and "that expert testimony about the likely mental state of individuals arrested with drugs in their position [sic?] is admissible in criminal trials under the Federal Rules of Evidence." Although I doubt that the announcement will cause its narrow range of readers to believe that the case stands for more than it does, I do worry that the same kind of language will crop up in undiscerning judicial opinions and commentary on Diaz. Therefore, it may be be worth listing some problematic aspects of the statements.

First, the case plainly held that certain "expert testimony about the likely mental state of individuals arrested with drugs in their position [possession?] is" not "admissible in criminal trials under the Federal Rules of Evidence." Every Justice agreed that no expert can testify that a defendant charged with importing proscribed drugs knew that they were transporting drugs. That would be explicit ultimate-opinion testimony on a criminal defendant's state of mind. (There are cases limiting the Rule 704(b) ban to mental health professionals, but the Court did not consider that possible way to interpret the rule. It stuck with a more literal reading of the text.)

Second, no “predictive or profiling evidence” was introduced in the case, and only one Justice thought it worth discussing. Certainly, Diaz is not a case of a criminal profiler predicting (“inferring” would be more precise) the characteristics of a criminal from the type or manner of the crimes under investigation (or any other such “predictive or profiling evidence”). As the excerpts from the trial transcript reproduced below show, the witness did not claim such expertise; furthermore, the trial judge barred him from stating a belief about the defendant’s knowledge (although one could well think that his testimony made it plain enough what his belief was).

Third, the issue before the Court was not admissibility under the rules of evidence writ large. It was the scope of a single part of a solitary rule. Federal Rule 704(b), which has no counterpart in the rules of most states, declares that no expert witness may "state an opinion about whether the defendant did or did not have a mental state or condition that constitutes an element of the crime charged or of a defense" because "[t]hose matters are for the trier of fact alone." All that the Diaz Court held was that when such an opinion is not stated and is an inference that does not necessarily follow (as a matter of deductive logic) from the witness’s statements, then Rule 704(b) does not preclude its admission. It did not—and could not—have held that the Rule makes it admissible. Cf. David H. Kaye, The Ultimate Opinion Rule and Forensic Science Identification, 60 Jurimetrics J. 75 (2020).

Thus, Diaz should not be read as supporting—or opposing—the use of “predictive or profiling evidence” generally, or even in the subcategory of testimony offered to prove a defendant’s state-of-mind.

More thoughts on the three opinions in the case and how the result fits into the range of possible interpretations of Rule 704(b) will appear in the upcoming supplement to The New Wigmore on Evidence: Expert Evidence § 2.2.3(b) (available online in VitalLaw) and, time permitting, in a further positing about the case here.

Excerpts from Trial Transcript
Mar. 18, 2021

The most pertinent portion of the testimony at issue in Diaz is as follows (with italics added):

BY MR. OLAH [Assistant US Attorney]:
Q. Where do you work?
A. I’m a special agent with Homeland Security Investigations.
Q. And how long have you been with HSI?
A. I’ve been a special agent since 1996. So going on 20 — I believe 28 years.
Q. Were you in law enforcement before joining HSI?
A. Prior to becoming a special agent, I was a U.S. Border Patrol agent. And prior to that, I was a sheriff’s corrections deputy.
***
Q. Have you been involved in drug trafficking investigations as a special agent with HSI?
A. Yes, I have.
Q. Approximately how many such investigations?
A. I’ve been involved in over 500 investigations dealing with distribution of drugs and also the – which would include the importation of drugs.
Q. And can you summarize for the jury the various investigation techniques you’ve used?
A. The techniques I’ve used, I’ve utilized wiretaps, where you actually listen to a drug trafficker talk on the telephone and how they conduct business. I’ve done controlled purchases where I utilized an undercover agent or a cooperating source. And we actually go out on the street and buy the drugs. I’ve spoken with cooperating defendants that have been arrested for drug trafficking related offenses. I’ve talked to cooperating sources that have information related to the distribution of drugs and drug trafficking organizations. I have spoken with other agents that work drug trafficking organizations and have worked on task forces with other agencies such as the Federal Bureau of Investigation, the Drug Enforcement Administration, and local police departments dealing with drug trafficking related crimes.
***
Q. Agent Flood, why are drugs imported into the United States?
MS. IREDALE: Objection, 401.
THE COURT: Overruled.
THE WITNESS: Based upon drugs that – some drugs are manufactured in Mexico and outside the United States. Therefore, they’re brought across the border, into the United States to be sold.
BY MR. OLAH:
***
Q. With respect to vehicles, can you describe the general process of movement from Mexico to wherever it goes?
A. From Mexico, they are packaged. They are put into *** a vehicle. *** I have seen drugs hidden in every area of a vehicle. *** And then they are transported from point A to point B across the border.
Q. And based on your training and experience, are the transporters compensated for their efforts?
A. Yes. It’s a job. It’s to take it from point A to point B.
***
Q. Agent Flood, based on your training and experience, are large quantities of drugs entrusted to drivers that are unaware of those drugs?
MS. IREDALE: Objection. 401, 403.
THE COURT: Overruled.
THE WITNESS: No. In extreme circumstances – actually, in most circumstances, the driver knows they are hired. It’s a business. They are hired to take the drugs from point A to point B.
BY MR. OLAH:
Q. And why aren’t – why don’t they use unknowing couriers, generally?
MS. IREDALE: Objection. 401, 403.
THE COURT: Overruled. You may answer.
THE WITNESS: Generally, it’s a risk of your – your cargo not making it to the new market; not knowing where it’s going; not being able to retrieve it at the ending point, at your point B. So there’s a risk of not delivering your product and, therefore, you’re not going to make any money.
***
Cross Examination
***
Q. So you said that unknowing couriers are very rare.
A. Yes.
***
Q. You work for HSI. Right?
A. Correct.
Q. And you’re aware that your own agency has identified many schemes where drug trafficking organizations use unknowing couriers. Right?
A. I – I know of three schemes that were primarily identified as being possible for an unknowing courier. It doesn’t necessarily mean that they are unknowing couriers. ***

Volunteer Bias in Interlaboratory Studies

2024-06-15T17:06:00.001-04:00

The National Institute for Standards and Technology (NIST) is soliciting laboratories to join an "interlaboratory study" of CG-MS (gas chromatography mass spectrometry) for seized drug analysis. The announcement that I received, in abbreviated form, reads:

Forensic Science Quality Assurance Program
Seized Drugs General Method for GC-MS Reporting Limits Study
. . .
Study Design, Purpose, and Rationale
The goals of the study are 1) to capture the range of methods, instrumentation, and analytical approaches used in the community, 2) investigate mass spectral variability across methods, and 3) investigate how different reporting practices effect the limit of seized drug reporting.
Timeline and Commitment
Registration is currently open and will close on July 5, 2024. To participate in this study, laboratories must be accredited forensic laboratories based in the United States and have a valid Schedule I & II DEA license, a validated seized drug screening method using GC-MS, and a documented reporting practice. To be considered for the study, participants will be required to complete a pre-study questionnaire pertaining to the method that will be used for sample analysis. After acceptance into the study, participants will be provided a kit of 10 solutions containing mixtures of controlled substances and asked to analyze the solutions and report whether the analytes are present above their established reporting thresholds. Participants will also be required to report chromatographic peak height/area and retention time of each peak and provide the raw datafile from each run. Standards used for comparison will also be reported. [D]ue to a limited number of available kits, completion of the pre-study questionnaire does not guarantee acceptance into the study. . . .
Publication of Results
Upon closure of data entry, laboratories will receive a preliminary report containing a summary of reported data, consensus results, and a summary of analytes present in each mixture. [A] final report . . . will be made publicly available by Spring 2025. . . . NIST will not knowingly reveal laboratory identities associated with study results.
For questions, contact andrea.yarberry@nist.gov
To signup, go to: https://forms.gle/gPDU9aENHguPkw1D7

The effort is laudable, but one might ask why NIST is not beginning with a sampling frame of laboratories created to represent "the community" and then drawing a probability sample from this list. Will the laboratories that notice the announcement and ask to participate present the full "range of methods, instrumentation, and analytical approaches used in the community"? Will the volunteer sample be skewed toward higher quality labs? Will it include all the "different reporting practices [that] effect [sic] the limit of seized drug reporting [whatever this "limit" denotes in the population of laboratories]"?

Rigor in sampling may not be required to answer certain questions, but it seems relevant to determining what the sign-up form refers to as "the current landscape of GC-MS methods and associated reporting practices and how those factors effect [sic] the concentration of drug that is/is not ultimately reported." Certainly, it should be a consideration for the legal community if and when the results of the study are presented as an indication of "the known or potential rate of error" for GC-MS analysis as practiced in forensic-science laboratories (Daubert v. Merrell Dow Pharm., 509 U.S. 579, 594 (1993)).

ISO Standards on Forensic Science: Pay to Play?

2024-05-26T11:44:00.003-04:00

"ISO, the International Organization for Standardization, brings global experts together to agree on the best way of doing things – for anything from making a product to managing a process." ^1/ For the last few years, it has been devising the following overarching set of standards for all of forensic science:

Forensic sciences (TC 272) ISO/DIS 21043-1, Forensic sciences - Part 1: Terms and definitions - 5/27/2024, $58.00
ISO/DIS 21043-3, Forensic Sciences - Part 3: Analysis - 5/26/2024, $62.00
ISO/DIS 21043-4, Forensic Sciences - Part 4: Interpretation - 5/26/2024, $67.00
ISO/DIS 21043-5, Forensic Sciences - Part 5: Reporting - 5/26/2024, $53.00 ^2/

These are

part of a series which, when completed, will include the different components of the forensic process from scene to courtroom ... . The series describes primarily “what” is standardized, not the “how” or “who”. Best practice manuals and standard operating procedures should describe “how” the requirements of this document would be met. ^3/

It sounds like the standards in progress will not specify "the best way of doing things." Will they merely list the things that are in need of "standardization"? Will they be too open-ended to constitute what the U.S. Supreme Court refers to as "standards controlling the technique's operation"^4/?

I cannot answer these questions because I have not seen the drafts that were open for public comment. Members of the public cannot read the drafts without paying IS0 the $240 listed above. If anyone who has paid to play has thoughts on these documents that they would like to share beyond the TC (Technical Committee) that drafted the standards, I'll post them--at no charge.

Notes

Int'l Org. for Standardization, About ISO.
ANSI Standards Action, Mar. 15, 2024, at 48.
ISO 21043-1:2018(en) Forensic sciences — Part 1: Terms and definitions.
Daubert v. Merrell Dow Pharm., 509 U.S. 579, 594 (1993).

What's Uniqueness Got to Do with It?

2024-01-12T13:56:00.004-05:00

Columbia University has announced that "AI Discovers That Not Every Fingerprint Is Unique"! The subtitle of the press release of January 10, 2024, boldly claims that

Columbia engineers have built a new AI that shatters a long-held belief in forensics–that fingerprints from different fingers of the same person are unique. It turns out they are similar, only we’ve been comparing fingerprints the wrong way!

Forensic Magazine immediately and uncritically rebroadcast (quoting verbatim without acknowledgment from the press release) the confused statements about uniqueness. According to the Columbia release and Forensic Magazine, "It’s a well-accepted fact in the forensics community that fingerprints of different fingers of the same person—or intra-person fingerprints—are unique and therefore unmatchable." Forensics Magazine adds that "Now, a new study shows an AI-based system has learned to correlate a person’s unique fingerprints with a high degree of accuracy."

Does this mean that the "well-accepted fact" and "long-held belief" in uniqueness been shattered or not? Clearly, not. The study is about similarity, not uniqueness. In fact, uniqueness has essentially nothing to do with it. I can classify equilateral triangles drawn on a flat surface as triangles rather than as other regular polygons whether or not the triangles are each different enough from one another (uniqueness within the set of triangles) that I notice these differences. To say that objects "are unique and therefore unmatchable" is a nonsequitur. A human genome is probably unique to that individual, but forensic geneticists know that six-locus STR profiles are "matchable" to those of other individuals in the population. A cold hit to a person who could not have been the source of the six-locus profile in the U.K. database occurred long ago (as was to be expected for the random-match probabilities of the genotypes).

Perhaps the myth that the study shatters is that it is impossible to distinguish fingerprints left by different fingers of the same individual X from fingerprints left by fingers of different individuals (not-X). But there is no obvious reason why this would be impossible even if every print is distinguishable from every other print (uniqueness).

The Columbia press release describes the study design this way:

[U]ndergraduate senior Gabe Guo ... who had no prior knowledge of forensics, found a public U.S. government database of some 60,000 fingerprints and fed them in pairs into an artificial intelligence-based system known as a deep contrastive network. Sometimes the pairs belonged to the same person (but different fingers), and sometimes they belonged to different people.

Over time, the AI system, which the team designed by modifying a state-of-the-art framework, got better at telling when seemingly unique fingerprints belonged to the same person and when they didn’t. The accuracy for a single pair reached 77%. When multiple pairs were presented, the accuracy shot significantly higher, potentially increasing current forensic efficiency by more than tenfold.

The press release reported the following odd facts about the authors' attempts to publish their study in a scientific journal:

Once the team verified their results, they quickly sent the findings to a well-established forensics journal, only to receive a rejection a few months later. The anonymous expert reviewer and editor concluded that “It is well known that every fingerprint is unique,” and therefore it would not be possible to detect similarities even if the fingerprints came from the same person.

The team ... fed their AI system even more data, and the system kept improving. Aware of the forensics community's skepticism, the team opted to submit their manuscript to a more general audience. The paper was rejected again, but [Professor Hod] Lipson ... appealed. “I don’t normally argue editorial decisions, but this finding was too important to ignore,” he said. “If this information tips the balance, then I imagine that cold cases could be revived, and even that innocent people could be acquitted.” ...

After more back and forth, the paper was finally accepted for publication by Science Advances. ... One of the sticking points was the following question: What alternative information was the AI actually using that has evaded decades of forensic analysis? ... “The AI was not using ... the patterns used in traditional fingerprint comparison,” said Guo ... . “Instead, it was using something else, related to the angles and curvatures of the swirls and loops in the center of the fingerprint.”

Proprietary fingerprint matching algorithms also do not arrive at matches the way human examiners do. They "see" different features in the patterns and tend to rank the top candidates for true matches in a database trawl differently than the human experts. Again, however, these facts about automated systems neither prove nor disprove claims of uniqueness. And, theoretical uniqueness has little or nothing to do with the actual probative value of assertions of matches by humans, automated systems, or both.

Although not directly applicable, the day after the publicity on the Guo et al. paper, I came across the following report on "Limitations of AI-based predictive models" in a weekly survey of papers in Science:

A central promise of artificial intelligence (AI) in health care is that large datasets can be mined to predict and identify the best course of care for future patients. Unfortunately, we do not know how these models would perform on new patients because they are rarely tested prospectively on truly independent patient samples. Chekroud et al. showed that machine learning models routinely achieve perfect performance in one dataset even when that dataset is a large international multisite clinical trial (see the Perspective by Petzschner). However, when that exact model was tested in truly independent clinical trials, performance fell to chance levels. Even when building what should be a more robust model by aggregating across a group of similar multisite trials, subsequent predictive performance remained poor. -- Science p. 164, 10.1126/science.adg8538; see also p. 149, 10.1126/science.adm9218

Note: This posting was last modified on 1/12/24 2:45 PM

SWGDE's Best Practices for Remote Collection of Digital Evidence from a Networked Computing Environment

2023-11-18T21:19:00.003-05:00

SWGDE 22-F-003-1.0, Best Practices for Remote Collection of Digital Evidence from a Networked Computing Environment, is a forensic-science standard proposed for inclusion on the Organization of Scientific Area Committees for Forensic Science (OSAC) Registry—"a repository of selected published and proposed standards … to promote valid, reliable, and reproducible forensic results.”

The best practices “may not be applicable in all circumstances.” In fact, “[w]hen warranted, an examiner may deviate from these best practices and still obtain reliable, defensible results.” I guess that is why they are called best practices rather than required practices. But what circumstances would justify using anything but the best practices? On this question, the standard is silent. It merely says that “[i]f examiners encounter situations warranting deviation from best practices, they should thoroughly document the specifics of the situation and actions taken.”

Likewise, the best practices for “preparation” seem rather rudimentary. “Examiners should ascertain the appropriate means of acquiring data from identified networked sources.” No doubt, but how could they ever prepare to collect digital information without ascertaining how to acquire data? What makes a means “appropriate”? All that a digital evidence expert can glean from this document is that he or she “should be aware of the limitations of each acquisition method and consider actions to mitigate these limitations if appropriate” and should consider “methods and limitation variables as they relate to various operating systems.” How does such advice regularize or improve anything?

Same thing with a recommendation that “[p]rior to the acquisition process, examiners should prepare their destination media”? What steps for preparing the destination media are best? Well, [s]terilization of destination media [whatever the process of “sterilization” is in this context] is not generally required.” But it is required “when needed to satisfy administrative or organizational requirements or when a specific analysis process makes it a prudent practice.” When would sterilization be prudent? The drafters do not seem to be very sure. “[E]xaminers may need to sanitize destination media provided to an external recipient to ensure extraneous data is not disclosed.” Or maybe they don’t? “Examiners may also be required to destroy copies of existing data to comply with legal or regulatory requirements.” Few people would dispute that the best practice is to follow the law, but examiners hardly need best practices documents from standards developing organizations to know that.

The standard is indeterminate when it comes to what it calls “triage”—“preview[ing] the contents of potential data sources prior to acquisition.” We learn that “[e]xaminers may need to preview the contents of potential data sources prior to acquisition” to “reduce the amount of data acquired, avoid acquiring irrelevant information, or comply with restrictions on search authority.” What amount of data makes "triage" a best practice? How does the examiner know that irrelevant information may be present? Why can "triage" sometimes be skipped? When it is desirable and how should it be done? The standard merely observes that “[t]here may be multiple iterations of triage … .” When are multiple iterations advisable? Well, it “depend[s] on the complexity of the investigation.” Equally vague is the truism that “[e]xaminers should use forensically sound processes to conduct triage to the extent possible.”

Finally, designating steps like “perform acquisition” and “validate collected data” as “best practices” does little to inform examiners of how to collect digital evidence from a network. To be fair, a few parts of the standard are more concrete, and, possibly, other SWGDE standards fill in the blanks. But, on its face, much of this remote acquisition standard simply gestures toward possible best practices. It does not expound them. In this respect, it resembles other forensic-science standards that emerge from forensic-science standards developing organizations only to be criticized as vague at critical points.

"Conditions Regarding the Use of SWGDE Documents"

2023-11-18T18:33:00.002-05:00

SWGDE is the Scientific Working Group on Digital Evidence. Its website describes it as a meta-organization—a group that “brings together organizations actively engaged in the field of digital and multimedia evidence to foster communication and cooperation as well as to ensure quality and consistency within the forensic community.” Structured as a non-profit corporation, it solicits "your donations or sponsorship." \1/ Its 70 “member organizations” consist of (by a quick and possibly error-prone categorization and count):

16 local, state, and federal police agencies; \2/
4 digital forensics software companies; \3/
18 training and consulting organizations; \4/
6 prosecutors' offices; \5/
8 crime laboratories and coroners' or medical examiners' offices; \6/
3 major corporations; \7/
3 universities; \8/
A swath of federal executive agencies (or parts of them), including NASA, NIST, and the Departments of Defense, Homeland Security, Interior, Justice, Labor, and Treasury. \9/

SWGDE has produced “countless academic papers,” although none are listed on its website. SWGDE "encourages the use and redistribution of our documents," but it regards them as private property. It states that "The Disclaimer and Redistribution policies (also included in the cover pages to each document) also establish what is considered SWGDE's Intellectual Property."

These policies are unusual, if not unique, among among standards developing organizations. An IP lawyer would find it odd, I think, to read that admonitions such as the following are part of an author's copyright:

Individuals may not misstate and/or over represent [sic] duties and responsibilities of SWGDE work. This includes claiming oneself as a contributing member without actively participating in SWGDE meetings; claiming oneself as an officer of SWGDE without serving as such ... .

With respect to actual IP rights, SWGDE purports to control not only the specific expression of ideas—as allowed by copyright law—but all "information" contained in its documents—a claim that far exceeds the scope of copyright. It imposes the following "condition to the use of this document (and the information contained herein) in any judicial, administrative, legislative, or other adjudicatory proceeding in the United States or elsewhere":

notification by e-mail before or contemporaneous to the introduction of this document, or any portion thereof, as a marked exhibit offered for or moved into evidence in such proceeding. The notification should include: 1) The formal name of the proceeding, including docket number or similar identifier; 2) the name and location of the body conducting the hearing or proceeding; and 3) the name, mailing address (if available) and contact information of the party offering or moving the document into evidence. Subsequent to the use of this document in the proceeding please notify SWGDE as to the outcome of the matter.

As author (or otherwise), an SDO certainly can ask readers to do anything it would like them to do with its publications—and the SWGDE "conditions regarding use" do contain the phrase "the SWGDE requests." Even reformulating the paragraph as a polite request rather than a demand supposedly supported by copyright law, however, one might ask what legislative proceeding with a "formal name" would have a forensic-science standard "offered or moved into evidence." Impeachment and subsequent trial, I guess.

Notes

Neither its full name nor its acronym turned up in a search of the IRS list of tax-exempt 501(c)(3) organizations, so donors seeking a charitable deduction on their taxes might need to inquire further.
As listed on the website, they are the Columbus, Ohio Police Department; Eugene Police Department; Florida Department of Law Enforcement (FDLE); Lawrence, KS Police Department; Johnson County, KS Sheriff's Office; Los Angeles County, CA Sheriff's Department; Louisville, KY Metro Police Department; Massachusetts State Police; Oklahoma State Bureau of Investigation; New York State Police; New York City Police Department (NYPD); Plano, TX Police Department; Seattle Police Department; Weld County, CO Sheriff's Office; US Department of Justice - Federal Bureau of Investigation (FBI); US Department of Homeland Security - US Secret Service (USSS); and the US Postal Inspection Service (USPIS).
Amped Software USA Inc.; AVPreserve; BlackRainbow; SecurCube.
National White Collar Crime Center (NW3C); Digital Forensics.US LLC / Veritek Cyber Solutions; MetrTech Consultancy; Midwest Forensic Consultants LLC; Hexordia; Forensic Data Corp; Forensic Video & Audio Associates, Inc; Laggui And Associates, Inc.; Loehrs Forensics; N1 Discovery; Precision Digital Forensics, Inc. (PDFI); Premier Cellular Mapping & Analytics; Primeau Forensics, Recorded Evidence Solutions, LLC; AVPreserve; LTD; BEK TEK; TransPerfect Legal Solutions; VTO Labs; Unique Wire, Inc
Adams County, CO District Attorney's Office; Burlington County, NJ Prosecutor's Office; Dallas County, TX District Attorneys Office; Middlesex County, NJ Prosecutor's Office; State of Wisconsin Department of Justice; US Department of Justice - Executive Office United States Attorney Generals Office.
City of Phoenix, AZ Crime Lab; Houston Forensic Science Center; Boulder County Coroner's Office; Miami-Dade County, FL; Medical Examiner Department; Virginia Department of Forensic Science; Westchester County, NY Forensic Lab; North Carolina State Crime Laboratory; and the US Department of Defense - Army Criminal Investigation Laboratory (Army CID).
Carrier Corporation; Target Corporation; and Walmart Stores Inc.
San Jose State University; University of Colorado Denver - National Center for Media Forensics (NCMF); University of Wisconsin Stevens Point.
NASA Office of Inspector General - Computer Crimes Division; National Institute of Standards and Technology; Treasury Inspector General for Tax Administration; US Department of Defense - Defense Cyber Crimes Center (DC3); US Department of Homeland Security - Homeland Security Investigations (HSI); US Department of Justice - Office of the Inspector General (DOJ OIG); US Department of Labor - Office of Inspector General (DOL OIG); US Department of the Interior - Office of the Inspector General (DOI OIG); US Department of Treasury - Internal Revenue Service (IRS); US Postal Service - Office of Inspector General (Postal OIG). Yet another organizational member is the Puerto Rico Office of the Comptroller, Division of Database Analysis, Digital Forensic and Technological Development.

How Accurate Is Mass Spectrometry in Forensic Toxicology?

2023-09-27T09:21:00.003-04:00

Mass spectrometry (MS) is the "[s]tudy of matter through the formation of gas-phase ions that are characterized using mass spectrometers by their mass, charge, structure, and/or physicochemical properties." ANSI-ASB Standard 098 for Mass Spectral Analysis in Forensic Toxicology § 3.11 (2023). MS has become "the preferred technique for the confirmation of drugs, drug metabolites, relevant xenobiotics, and endogenous analytes in forensic toxicology." Id. at Foreword.

But no "criteria for the acceptance of mass spectrometry data have been ... universally applied by practicing forensic toxicologists." Id. Therefore, the American Academy of Forensic Sciences' Academy Standards Board (ASB) promulgated a "consensus based forensic standard[] within a framework accredited by the American National Standards Institute (ANSI)," id., that provides "minimum requirements." Id. § 1.

To a nonexpert reader (like me), the minimum criteria for the accuracy of MS "confirmation" are not apparent. Consider Section 4.2.1 on "Full-Scan Acquisition using a Single-Stage Low-Resolution Mass Analyzer." It begins with the formal requirement that

[T]he following shall be met when using a single-stage low-resolution mass analyzer in full-scan mode.
a) A minimum of a single diagnostic ion shall be monitored.

It is hard to imagine an MS test method that would not meet the single-ion minimum. Perhaps what makes this requirement meaningful is that the one or more ions must be "diagnostic." However, this adjective begs the question of what the minimum requirement for diagnositicity should be. A "diagnostic ion" is a "molecular ion or fragment ion whose presence and relative abundance are characteristic of the targeted analyte." Id. § 3.4. So what makes an ion "characteristic"? Must it always be present (in some relative abundance) when the "targeted analyte" is in the specimen (at or above some limit of detection)? That would make the ion a marker for the analyte with perfect sensitivity: Pr(ion|analyte) = 1. Even so, it would not be characteristic of the analyte unless its presence is highly specific, that is, unless Pr(no-such-ion|something-else) ≅ 1. But the standard contains no minimum values for sensitivity, specificity, or the likelihood ratio Pr(ion|analyte) / Pr(ion|something-else), which quantifies the positive diagnostic value of a binary test. \1/

This is not to say that there are no minimum requirements in the standard. There certainly are. For example, Section 4.2.1 continues:

b) When monitoring more than one diagnostic ion:
1. ratios of diagnostic ions shall agree with those calculated from a concurrently analyzed reference material given the tolerances shown in Table 1; OR
2. the spectrum shall be compared using an appropriate library search and be above a pre-defined match factor as demonstrated through method validation.

But the standard does not explain how the tolerances in Table 1 were determined. What are the conditional error probabilities that they produce?

Likewise, establishing a critical value for the "match factor" \2/ before using it is essential to a frequentist decision rule, but what are the operating characteristics of the rule? "Method validation" is governed (to the extent that voluntary standards govern anything) by ANSI-ASB 036, Standard Practices for Method Validation in Forensic Toxicology (2019). This standard requires testing to establish that a method is "fit for purpose," but it gives no accuracy rates that would fulfill this vague directive.

Firms that sell antibody test kits for detecting Covid-19 infections no longer can sell whatever they deem is fit for purpose. In May 2020, the FDA stopped issuing emergency use permits for these diagnostic tests without validation showing that they "are 90% 'sensitive,' or able to detect coronavirus antibodies, and 95% 'specific,' or able to avoid false positive results." \3/ Forensic toxicologists do not seem to have proposed such minimum requirements for MS tests.

NOTES

Other toxicology standards refer to ASB 098 as if it indicates what it required to apply the label "diagnostic." ANSI/ASB 113, Standard for Identification Criteria in Forensic Toxicology, § 4.5.2 (2023) ("All precursor and product ions are required to be diagnostic per ASB Standard 098, Standard for Mass Spectral Data Acceptance in Forensic Toxicology (2022).").
Section 3.13 defines "match factor" as a "mathematical value [a scalar?] that indicates the degree of similarity between an unknown spectrum and a reference spectrum."
See How Do Forensic-science Tests Compare to Emergency COVID-19 Tests?, Forensic Sci., Stat. & L., May 5, 2020 (quoting Thomas M. Burton, FDA Sets Standards for Coronavirus Antibody Tests in Crackdown on Fraud, Wall Street J., Updated May 4, 2020 8:24 pm ET, https://www.wsj.com/articles/fda-sets-standards-for-coronavirus-antibody-tests-in-crackdown-on-fraud-11588605373).

Use with Caution: NIJ's Training Course in Population Genetics and Statistics for Forensic Analysts

2023-09-18T18:13:00.002-04:00

The National Institute of Justice (NIJ) "is the research, development and evaluation agency of the U.S. Department of Justice . . . dedicated to improving knowledge and understanding of crime and justice issues through science." It offers a series of webpages and video recordings (a "training course") on Population Genetics and Statistics for Forensic Analysts. The course should be approached with caution. I have not worked through all the pages and videos, but here are a few things that rang alarm bells:

NIJ's Training	Comment


Many statisticians have employed what is known as Bayesian probability ... which is based on probability as a measure of one's degree of belief. This type of probability is conditional in that the outcome is based on knowing information about other circumstances and is derived from Bayes Theorem.	Bayes' rule applies to both objective and subjective probabilities. Both types of probability include conditional probabilities. The "type of probability" is not derived from Bayes' Theorem.

Conditional probability, by definition, is the probability P of an event A given that an event B has occurred. ... Take the example of a die with six sides. If one was to throw the die, the probability of it landing on any one side would be 1/6. This probability, however, assumes that the die is not weighted or rigged in any way, and that all of the sides contain a different number. If this were not true, then the probability would be conditional and dependent on these other factors.	The "other factors" are nothing more than part of the description of the experiment whose outcomes are the events that are observed. They are not conditioning events in a sample space.

The following equation can be used to determine the probability of the evidence given that a presumed individual is the contributor rather than a random individual in the population: LR = P(E/H₁) / P(E/H₀) ... . In the case of a single source sample, the hypothesis for the numerator (the suspect is the source of the DNA) is a given, and thus reduces to 1. This reduces to: LR = 1/ P(E/H₀) which is simply 1/P, where P is the genotype frequency.	The hypothesis for the numerator of a likelihood ratio is always "a given"--that is, it goes on the right-hand-side of the expression for a conditional probability. So is the hypothesis in the denominator. Neither probability "reduces to 1" for that reason. Only if the "evidence" is the true genotype in both the recovered sample and the sample from the defendant can it be said that P(E\|H₁) = 1. In other words, to say that the probability of a reported match is 1 if the defendant is the source treats the probability of laboratory error as zero. That may be acceptable as a simplifying assumption, but the assumption should be made visible in a training course.

Although likelihood ratios can be used for determining the significance of single source crime stains, they are more commonly used in mixture interpretation. ... The use of any formula for mixture interpretation should only be applied to cases in which the analyst can reasonably assume "that all contributors to the mixed profile are unrelated to each other, and that allelic dropout has no practical impact."	This limitation does not apply to modern probabilistic genotyping software!

Is "Match Form" Testimony Poor Form?

2023-09-18T10:34:00.001-04:00

The likelihood ratio (LR) is essentially a number that expresses how many times more probable the data from an experiment are if one hypothesis is true than if another hypothesis is true. For example, suppose we make a single measurement of the height of a known individual. Then we do the same for an individual who is covered from head to foot by a sheet. We want know if we have measured the same individual twice or two different individuals once. The closer the two measured heights are to one another, the more the measurements support the same-source hypothesis as opposed to the different-source hypothesis.

Why? Because closer measurements are more probable for same-source pairs than for different-source pairs. This implies that in repeated experiments with some proportion of same-source and different-source pairs, the closer measurements will tend to filter out the different-source pairs (which tend to have more distance between the two measurements) and to include more same-source pairs (which tend to be marked by the more similar measurements).

By quantifying the relative probability for the data given each hypothesis, the LR indicates how well a given degree of similarity discriminates between the hypotheses. Its value is

LR = Probability(data | H₁) / Probability(data | H₂),

where H₁ is the same-source hypothesis and H₂ is the different-source hypothesis.

Likelihood ratios are routinely reported in cases with samples from crime scenes or victims that contain DNA from several individuals. A DNA analyst might testify that the electropherograms are ten thousand times more probable if the defendant's DNA is present than if an unrelated person's DNA is there. \1/ We may call such statements "relative-probability-of-the-data" testimony.

But some DNA experts prefer what they call a "match form" for the presentation. \2/ An example of a "match form" statement is that “[a] match between the shoes … and [the defendant] is 9.67 thousand times more probable than a coincidental match to an unrelated African-American person.” \3/ More generally, a match-form presentation states that “a match between the evidence and reference [samples] is (some number) times more probable than coincidence.” \4/

This formulation has been criticized as highly misleading. According to William Thompson, it is

likely to mislead lay people and foster misunderstandings that are detrimental to people accused of a crime. I recommend that Cybergenetics immediately cease using this misleading language and find a better way to explain its findings. Standards development organizations such as OSAC should consider developing standards that address the appropriateness, or inappropriateness, of such presentations. Courts should refuse to admit PG [probabilistic genotyping] evidence when it is mischaracterized in this manner. Lawyers involved in cases in which defendants were convicted based on this misleading language should consider the appropriateness of appellate remedies. \5/

The main concern is that juxtaposing “match” and “coincidence” will lead judges and jurors to think that the "match statistic" pertains to the probabilities of hypotheses (H₁ and H₂) about the source of the DNA rather than probabilities about the laboratory’s data. In simpler terms, the concern is that most people will understand "coincidence" and "coincidental match" as an assertion that the observed match is the result of coincidence; moreover, they will think that "match" is an assertion that the defendant is the matcher. If that happens, then the assertion that a match is 10,000 times more likely than coincidence would be (mis)understood as a statement that the odds against a coincidence having occurred are 10,000 to 1.

Instead, LR = 10,000 should be understood (according to Bayes' rule) as a statement about the change in the odds that defendant, as opposed to some unknown, unrelated person, is the matcher. For example, if defendant has a strong alibi—strong enough, in conjunction with other evidence, to establish that the prior odds of H₁ as opposed to H₂ are only 1 to 5,000—then this LR raises the odds to 10,000 x 1:5,000 = 2:1. Such final odds are far from overwhelming.

Cybergenetics does not seems disposed to abandon "match form" testimony. Dr. Thompson claims that for fingerprint comparisons, "'[m]atch' is shorthand for source identification, [s]o, it is predictable that many lay people will interpret the term 'match,' when used to describe DNA evidence, to mean that the person of interest has been identified either definitively or with a high degree of certainty as a contributor." Pointing to a dictionary, Cybergenetics angrily responds that this is just "Thompson’s private language." \6/ But a tradition in forensic science is to equate a "match" with an identification, as shown by the title of articles such as "Is a Match Really a Match? A Primer on the Procedures and Validity of Firearm and Toolmark Identification." \7/ In popular culture, the term may have a similar connotation. Perhaps Youtube trumps Merriam-Webster. \8/

As far as I know, no studies compare the comprehensibility of relative-probability-of-the-data testimony to match-form testimony. Therefore, the law and the practice has to be guided by intuition. My sense is that avoiding the transposition of the probabilities in a likelihood ratio requires special care if the match-versus-coincidence approach is used. The witness must explain not only that a "DNA match" is merely a degree of similarity between the electropherograms being compared, but also that "coincidence" or "coincidental match" is shorthand for the proposition that the "match" is a match to an unrelated person (or other specified source)—and that it is not a conclusion that a coincidence has occurred. The phrase "coincidental match" is too ambiguous to be left undefined.

In short, I am not sure that an absolute rule against match-form testimony is necessary, but I see no clear benefit to the phraseology. Relative-probability-of-the-data testimony seems to be a more straightforward description of a DNA likelihood ratio. However, it too needs explanation to reduce the risk of blindly transposing the conditional probabilities for the data into conditional probabilities for the hypotheses. Cases announcing that a likelihood ratio is a ratio of source-hypothesis probabilities are legion. \9/

Notes

Cf. Commonwealth v. McClellan, 178 A.3d 874 (Pa. Super. Ct. 2018) ("[I]t was determined that the DNA sample taken from the gun's grip was at least 384 times more probable if the sample originated from Appellant and two unknown, unrelated individuals than if it originated from a relative to Appellant and two unknown, unrelated individuals").
Mark Perlin, Explaining the Likelihood Ratio in DNA Mixture Interpretation, in Proceedings of Promega's Twenty First International Symposium on Human Identification at 7 (Dec. 29, 2010); cf. Mark W. Perlin, Joseph B. Kadane & Robin W. Cotton, Match Likelihood Ratio for Uncertain Genotypes, 8 Law, Probability & Risk 289 (2009), https://doi.org/10.1093.
United States v. Anderson, No. 4:21-CR-00204, 2023 WL 3510823, at *3 (M.D. Pa. Apr. 26, 2023). For additional instances of “match form” testimony or reporting, see Howell v. Schweitzer, No. 1:20-cv-2853, 2023 WL 1785530 (N.D. Ohio Jan. 11, 2023); Sanford v. Russell, No. 17-13062, 2021 WL 1186495 (E.D. Mich. Mar. 30, 2021); State v. Anthony, 266 So.3d 415 (La. Ct. App. 2019).
Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837, at 8 (2014).
William C. Thompson, Uncertainty in Probabilistic Genotyping of Low Template DNA: A Case Study Comparing STRMix™ and TrueAllele™, 68 J. Forensic Sci. 1049, 1059 (2023), doi:10.1111/1556-4029.15225.
Mark W. Perlin et al., Reporting Exclusionary Results on Complex DNA Evidence, A Case Report Response to 'Uncertainty in Probabilistic Genotyping of Low Template DNA: A Case Study Comparing Strmix™ and Trueallele®' Software 31 (May 18, 2023), available at SSRN: https://ssrn.com/abstract=4449313 or http://dx.doi.org/10.2139/ssrn.4449313.
Stephen G. Bunch et al., Is a Match Really a Match? A Primer on the Procedures and Validity of Firearm and Toolmark Identification, 11 Forensic Science Communications, No. 3 (2009), https://archives.fbi.gov/archives/about-us/lab/forensic-science-communications/fsc/july2009/review/2009_07_review01.htm.
In addition, a dictionary definition of "match" (https://www.merriam-webster.com/dictionary/match) is "a pair suitably associated." Suitable association suggests that a hypothesis about the nature of the association is true.
E.g., State v. Pickett, 246 A.3d 279 (N.J. App. 2021) (The "likelihood ratio [is] a statistic measuring the probability that a given individual was a contributor to the sample against the probability that another, unrelated individual was the contributor.") (citing Justice Ming W. Chin et al., Forensic DNA Evidence § 5.5 (2020)).

No "Daubert Hearing" on Latent Fingerprint Matching in US v. Ware

2023-07-07T15:45:00.005-04:00

Last month, in United States v. Ware, 69 F.4th 830 (11th Cir. 2023), the U.S. Court of Appeals for the Eleventh Circuit "carefully review[ed]" the convictions of Dravion Sanchez Ware arising out of a month-long crime spree near Atlanta, in 2017. He was found to have participated "in robbing ... three spas, four massage parlors, a nail salon, and a restaurant." The opinion recounts the nine brutal robberies in luxuriant detail. It also discusses Mr. Ware's argument that the district court erred "by not holding a formal Daubert hearing before admitting expert fingerprint evidence."

In a word, the Eleventh Circuit rejected the argument as "unpersuasive." No surprise there. More surprising is the opinion's incoherent discussion of the 2009 NRC report on forensic science and the 2016 PCAST follow-up report. \1/ On the one hand, we are told that "[t]he science could not possibly have been so unreliable as to be inadmissible." On the other hand, "[t]he District Court here could have held a Daubert hearing to assess the relatively new reports Ware presented." So which is it? If a type of evidence cannot possibly be excluded as scientifically invalid under Daubert, how can it be proper to hold a pretrial testimonial hearing on admissibility under Daubert? And, was the court of appeals correct in concluding that the two reports do not impeach, to the point of requiring a hearing, the traditional practice of admitting latent fingerprint comparisons?

During Ware's trial, an unnamed "crime lab scientist with the Georgia Bureau of Investigation Division of Forensic Sciences" "outlined the science behind fingerprints themselves, including their uniqueness" and explained the four-step process the lab follows ... : 'Analysis, Comparison, Evaluation, and Verification,' or ACEV.” The last step "involves another examiner completing the whole process a second time." The opinion does not indicate whether the verifying analyst is blinded to the knowledge of the main examiner's finding. Interestingly as well (think Confrontation Clause), the opinion implies that the testifying expert in Ware was not the main examiner. "[S]he was the verifying examiner," and "she testified that the lab concluded the latent print ... led to an identification conclusion matched to Ware's left middle finger." After that,

Defense counsel specifically asked about the PCAST report [and] vigorously cross-examined ... discussing the possibility of a latent fingerprint not being usable ... , the subjectiveness of every step ... , and the bias that may creep into the verification process ... . The expert and defense counsel discussed ... the potential for false positives and negatives. On cross, the defense also attacked the expert's claim that she did not know of the Georgia Bureau of Investigation ever misidentifying someone with a fingerprint comparison, and that she did not know the rate at which a verifier disagrees with the original assessment.

To preclude such testimony about his unique fingerprint on an item stolen in one of the robberies, Ware had moved before the trial for an order excluding fingerprint-comparison evidence. Of course, such a ruling would have been extraordinary, but the defense contended that the 2009 and the 2016 reports required nothing less \2/ and asked for a full-fledged pretrial hearing on the matter. In response, "[t]he District Court conditionally denied the motion ... unless Ware's counsel could produce before trial a case from this Court or a district court in this Circuit that favors excluding fingerprint expert evidence under Daubert." \3/

The court of appeals correctly observed that "[f]ingerprint comparison has long been accepted as a field worthy of expert opinions in this Circuit, as well as in almost every one of our sister circuits." The only problem is that all the opinions cited to show this solid wall of precedent predate the NRC or the PCAST reports. A more complete analysis has to establish that the scientists' reviews of friction-ridge pattern matching do not raise enough of a doubt to expect that a hearing would let the defense breach the wall.

Along these lines, the court of appeals wrote that

The [District] Court considered the reports and arguments presented and found that fingerprint evidence was reliable enough as a general matter to be presented to the jury. Many of the critiques of fingerprint evidence found in the PCAST report go to the weight that ought to be given fingerprint analysis, not to the legitimacy of the practice as a whole. Appellant Br. at 25 (“The studies collectively demonstrate that many examiners can, under some circumstances, produce correct answers at some level of accuracy.” (emphasis in original)).

This quotation from the PCAST report is faint praise. Although the court of appeals was sure that "Ware's contrary authority even says that fingerprint evidence can be reliable," the depth of its knowledge about the PCAST (and the earlier NRC committee) reports is open to question. The circuit court had trouble keeping track of the names of the groups. It transformed the National Research Council (the operating arm of the National Academies of Science, Engineering, and Medicine) into a "United States National Resource Council" (69 F.4th at 840) and then imagined an "NCAST report[]" (id. at 848). \4/

Deeper inspection of "Ware's contrary authority" is in order. The 2009 NRC committee report quoted with approval the searing conclusion of Haber & Haber that “[w]e have reviewed available scientific evidence of the validity of the ACE-V method and found none.” It reiterated the Habers' extreme claim that because "the standards upon which the method’s conclusions rest have not been specified quantitatively ... the validity of the ACE-V method cannot be tested." To be sure, the committee agreed that fingerprint examiners had something going for them. It wrote that "more research is needed regarding the discriminating value of the various ridge formations [to] provide examiners with a solid basis for the intuitive knowledge they have gained through experience." But does "intuitive knowledge" qualify as "scientific knowledge" under Daubert? Is a suggestion that friction-ridge comparisons need a more solid basis equal to a statement that the comparisons are "reliable" within the meaning of that opinion? The response to "NCAST" was underwhelming.

But research has progressed since 2009. The second "contrary authority," the PCAST report, reviewed this research. At first glance, this report supports the court's conclusion that no hearing was necessary. It assures courts that "latent fingerprint analysis is a foundationally valid subjective methodology." In doing so, it rejects the NRC committee's notion that the absence of quantitative match rules precludes testing whether examiners can reach valid conclusions. It discusses two so-called black-box studies of the work of examiners operating in the "intuitive" mode. Yet, the Ware court does not cite or quote the boxed and highlighted finding (Number 5).

Perhaps the omission reflects the fact that the PCAST finding is so guarded. PCAST added that "additional black-box studies are needed to clarify the reliability of the method," undercutting the initial assurance, which was "[b]ased largely on two ... studies." Furthermore, according to PCAST, to be "scientifically valid," latent-print identifications must be accompanied by admissions that "false positive rates" could be very high (greater than 1 in 18). \5/

The Ware court transforms all of this into a blanket and bland assertion that the report establishes reliability even though it "may cast doubt on the error rate of fingerprint analysis and comparison." The latter concern, it says, goes not to admissibility, but only to "weight" or "credibility."

Can it really be this simple? Are not "error rates" an explicit factor affecting admissibility (as well as weight) under Daubert? Certainly, the Eleventh Cicuit's view that the problems with fingerprint comparisons articulated in the two scientific reports are not profound enough to force a wave of pretrial hearings is defensible, but the court's explanation of its position in Ware is sketchy.

At bottom, the problem with the fingerprint evidence introduced against Ware (as best as one can tell from the opinion) is not that it is speculative or valueless. The difficulty is that the judgments are presented as if they were scientific truths. The Ware court is satisfied because "Defense counsel put the Government's expert through his paces during cross-examination, and counsel specifically asked the expert about the findings in the PCAST report." But would it be better to moderate the presentations to avoid overclaiming in the first place?

The impending amendment to Rule 702 of the Federal Rules of Evidence is supposed to encourage this kind of "gatekeeping." Defense counsel might be more successful in constraining overreaching experts than in excluding them altogether. That too should be part of the "considerable leeway" granted to district courts seeking to reconcile expert testimony.with modern scientific knowledge.

Notes

President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods (2016), [https://perma.cc/R76Y-7VU]
In Ware
The pretrial motion to exclude the fingerprint identification "relied on the 2009 United States National Resource Counsel (“NRC”) report and subsequent 2016 President's Counsel of Advisors on Science and Technology (“PCAST”) report, which supposedly revealed a dearth of "proper scientific studies of fingerprint comparison evidence" and claimed that "there is no scientific basis for concluding a fingerprint was left by a specific person," positing that "because fingerprint analysis involves individual human judgement, the resulting [fingerprint comparison] conclusion can be influenced by cognitive bias."
Why insist on a pre-existing determination in one particular geographic region that scientific validity is lacking in order to grant a hearing on whether scientific validity is present? Is the "science" underlying fingerprint comparisons different in Georgia and the other southeastern states comprising the 11th Circuit different from that in the rest of the country?
OK, these peccadillos are not substantive, but one would have thought that three circuit court judges, after "carefully reviewing the record," could have gotten the names and acronyms straight. Senior Judge Gerald Tjoflat wrote the panel opinion. At one point, he was a serious contender for the Supreme Court seat filled by Justice Anthony Kennedy. After Judge Tjoflat announced that he would retire to senior status on the bench in 2019, President Donald Trump nomined Robert J. Luck to the court. In addition to Judge Luck, Judge Kevin C. Newsom, a 2017 appointee of President Trump was on the panel. Judicial politics being what it is, over 30 senators voted against the confirmation of Judges Newsom and Luck.
PCAST suggested that if a court agreed that what it called "foundational validity" were present, then to achieve "validity as applied" some very specific statements about "error rates" would be required:
Overall, it would be appropriate to inform jurors that (1) only two properly designed studies of the accuracy of latent fingerprint analysis have been conducted and (2) these studies found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study. This would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.
The studies actually found conditional false-positive proportions of 6/3628 (0.17%) and 42/995 (4.2%, or 7/960 = 1.4% if one discards "clerical errors.") (P. 98, tbl. 1). Earlier postings discuss these FBI-Noblis and Miami Dade police department numbers.

Maryland Supreme Court Resists "Unqualified" Firearms-toolmark Testimony

2023-06-24T21:40:00.007-04:00

This week, the Maryland Supreme Court became the first state supreme court to hold, unequivocally, that a firearms-toolmark examiner may not testify that a bullet was fired from a particular gun without a disclaimer indicating that source attribution is not a scientific or practical certainty. \1/ The opinion followed two trials, \2/ two evidentiary hearings (one on general scientific acceptance \3/ and one on the scientific validity of firearms-toolmark identifications \4/) and affidavits from experts in research methods or statistics. The Maryland court did not discuss the content of the required disclaimer. It merely demanded that the qualified expert's opinion not be "unqualified." In addition, the opinions are limited to source attributions via the traditional procedure of judging presumed "individual" microscopic features with no standardized rules for concluding that the markings match.

The state contended that Kobina Ebo Abruquah murdered a roommate by shooting him five times, including once in the back of the head. A significant part of the state's case came from a firearms examiner for the Prince George’s County Police Department. The examiner "opined that four bullets and one bullet fragment ... 'at some point had been fired from [a Taurus .38 Special revolver belonging to Mr. Abruquah].'" A bare majority of four justices agreed that admission of the opinion was an abuse of the trial court's discretion. Three justices strongly disputed this conclusion. Two of the three opinions in the case included tables displaying counts or percentages from experiments in which analysts compared sets of either bullets or cartridge casings fired from a few types of handguns to ascertain how frequently their source attributions and exclusions were correct and how often they were wrong.

There is a lot one might say about these opinions, but here I attend only to the statistical parts. \5/ As noted below (endnotes 3 an 4), neither party produced any statisticians or research scientists with training or extensive experience in applying statistical methods. The court did not refer to the recent, burgeoning literature on "error rates" in examiner-performance studies. Instead, the opinions drew on (or questioned) the analysis in the 2016 report of the President's Council of Advisors on Science and Technology (PCAST). The report essentially dismissed the vast majority of the research on which one expert for the state (James Hamby, a towering figure in the firearms-toolmark examiner community) relied. These studies, PCAST explained, usually asked examiners to match a set of questioned bullets to a set of guns that fired them.

A dissenting opinion of Justice Steven Gould argued—with the aid of a couple of probability calculations—that the extremely small number of false matches in these "closed set" studies demonstrated that examiners were able to perform much better than would be expected if they were just guessing when they lined up the bullets with the guns. \6/

That is a fair point. "Closed set" studies can show that examiners are extracting some discriminating information. But they do not lend themselves to good estimates of the probability of false identifications and false exclusions. For answering the "error rate" question in Daubert, they indicate lower bounds—the conditional error probabilities for examiners under the test conditions could be close to zero, but they could be considerably larger.

More useful experiments simulate what examiners do in cases like Abruquah—where they decide whether a specific gun fired the bullet that could have come from guns beyond those in an enumerated set of known guns. To accomplish this, the experiment can pair test bullets fired from a known gun with a "questioned" bullet and have the examiner report whether the questioned bullet did or did not travel through the barrel of the tested gun.

The majority opinion, written by Chief Justice Matthew Fader, discussed two such experiments known as "Ames I" and "Ames II" (because they were done at the Ames National Laboratory, "a government-owned, contractor-operated national laboratory of the U.S. Department of Energy, operated by and located on the campus of Iowa State University in Ames, Iowa"). The first experiment, funded by the Department of Defense and completed in 2014, "was designed to provide a better understanding of the error rates associated with the forensic comparison of fired cartridge cases." The experiment did not investigate performance with regard to toolmarks on the projectiles (the bullets themselves) propelled from the cases, through the barrel of a gun, and beyond. Apparently referring to the closed-set kind of studies, the researchers observed that "[five] previous studies have been carried out to examine this and related issues of individualization and durability of marks ... , but the design of these previous studies, whether intended to measure error rates or not, did not include truly independent sample sets that would allow the unbiased determination of false-positive or false-negative error rates from the data in those studies." \7/

However, their self-published technical report does not present the results in the kind of classification table that statisticians would expect. Part of such a table is on this blog: \8/

The researchers enrolled 284 volunteer examiners in the study, and 218 submitted answers (raising an issue of selection bias). The 218 subjects (who obviously knew they were being tested) “made ... l5 comparisons of 3 knowns to 1 questioned cartridge case. For all participants, 5 of the sets were from known same-source firearms [known to the researchers but not the firearms examiners], and 10 of the sets were from known different-source firearms.” 3/ Ignoring “inconclusive” comparisons, the performance of the examiners is shown in Table 1.

Table 1. Outcomes of comparisons
(derived from pp. 15-16 of Baldwin et al.)

~S S

–E 1421 4 1425

+E 22 1075 1097

1443 1079

–E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

False negatives. Of the 4 + 1075 = 1079 judgments in which the gun was the same, 4 were negative. This false negative rate is Prop(–E |S) = 4/1079 = 0.37%. ("Prop" is short for "proportion," and "|" can be read as "given" or "out of all.") Treating the examiners tested as random samples of all examiners of interest, and viewing the performance in the experiment as representative of the examiners' behavior in casework with materials comparable to those in the experiment, we can estimate the portion of false negatives for all examiners. The point estimate is 0.37%. A 95% confidence interval is 0.10% to 0.95%. These numbers provide an estimate of how frequently all examiners would declare a negative association in all similar cases in which the association actually is positive.Instead of false negatives, we also can describe true negatives, or specificity. The observed specificity is Prop(E|~S) = 99.63%. The 95% confidence interval around this estimate is 99.05% to 99.90%.

False positives. The observed false-positive rate is Prop(+E |~S) = 22/1443 = 1.52%, and the 95% confidence interval is 0.96% to 2.30%. The observed true-positive rate, or sensitivity, is 98.48%, and its 95% confidence interval is 97.7% to 99.04%.

Taken at face value, these results seem rather encouraging. On average, examiners displayed high levels of accuracy, both for cartridge cases from the same gun (better than 99% specificity) and from different guns (better than 98% sensitivity).

Table 1. Outcomes of comparisons (derived from pp. 15-16 of Baldwin et al.)
	~S	S
–E	1421	4	1425
+E	22	1075	1097
	1443	1079
–E is a negative finding (the examiner decided there was no association). +E is a positive finding (the examiner decided there was an association). S indicates that the cartridges came from bullets fired by the same gun. ~S indicates that the cartridges came from bullets fired by a different gun.

I did not comment on the implications of the fact that analysts often opted out of the binary classifications by declaring that an examination was inconclusive. This reporting category has generated a small explosion of literature and argumentation. There are two extreme views. Some organizations and individuals maintain that just because specimens did or not come from the same source, the failure to discern which of these two states of nature applies is an error. A more apt term would be "missed signal"; \9/ it is hardly obvious that Daubert's fleeting reference to "error rates" was meant to encompass not only false positives and negatives but also test results that are neither positive nor negative. At the other pole are claims that all inconclusive outcomes should be counted as correct in computing the false-positive and false-negative error proportions seen in an experiment.

Incredibly, the latter is the only way in which the Ames laboratory computed the error proportions. I would like to think that had the report been subject to editorial review at a respected journal, a correction would have been made. Unchastened, the Ames laboratory again only counted inconclusive responses as if they were correct when it wrote up the results of its second study.

This lopsided treatment of inconclusives was an issue in Abruquah. The majority opinion described the two studies as follows (citations and footnotes omitted):

Of the 1,090 comparisons where the “known” and “unknown” cartridge cases were fired from the same source firearm, the examiners [in the Ames I study] incorrectly excluded only four cartridge cases, yielding a false-negative rate of 0.367%. Of the 2,180 comparisons where the “known” and “unknown” cartridge cases were fired from different firearms, the examiners incorrectly matched 22 cartridge cases, yielding a false-positive rate of 1.01%. However, of the non-matching comparison sets, 735, or 33.7%, were classified as inconclusive, id., a significantly higher percentage than in any closed-set study.

The Ames Laboratory later conducted a second open-set, black-box study that was completed in 2020 ... The Ames II Study ... enrolled 173 examiners for a three-phase study to test for ... foundational validity: accuracy (in Phase I), repeatability (in Phase II), and reproducibility (in Phase III). In each of three phases, each participating examiner received 15 comparison sets of known and unknown cartridge cases and 15 comparison sets of known and unknown bullets. The firearms used for the bullet comparisons were either Beretta or Ruger handguns and the firearms used for the cartridge case comparisons were either Beretta or Jimenez handguns. ... As with the Ames I Study, although there was a “ground truth” correct answer for each sample set, examiners were permitted to pick from among the full array of the AFTE Range of Conclusions—identification, elimination, or one of the three levels of “inconclusive.”

The first phase of testing was designed to assess accuracy of identification, “defined as the ability of an examiner to correctly identify a known match or eliminate a known nonmatch.” In the second phase, each examiner was given the same test set examined in phase one, without being told it was the same, to test repeatability, “defined as the ability of an examiner, when confronted with the exact same comparison once again, to reach the same conclusion as when first examined.” In the third phase, each examiner was given a test set that had previously been examined by one of the other examiners, to test reproducibility, “defined as the ability of a second examiner to evaluate a comparison set previously viewed by a different examiner and reach the same conclusion.”

In the first phase, ... [t]reating inconclusive results as appropriate answers, the authors identified a false negative rate for bullets and cartridge cases of 2.92% and 1.76%, respectively, and a false positive rate for each of 0.7% and 0.92%, respectively. Examiners selected one of the three categories of inconclusive for 20.5% of matching bullet sets and 65.3% of nonmatching bullet sets. [T]he results overall varied based on the type of handgun that produced the bullet/cartridge, with examiners’ results reflecting much greater certainty and correctness in classifying bullets and cartridge cases fired from the Beretta handguns than from the Ruger (for bullets) and Jimenez (for cartridge cases) handguns.

The opinion continues with a description of some statistics for the level of intra- and inter-examiner reliability observed in the Ames II study, but I won't pursue those here. The question of accuracy is enough for today. \10/ To some extent, the majority's confidence in the reported low error proportions (all under 3%) was shaken by the presence of inconclusives: "if at least some inconclusives should be treated as incorrect responses, then the rates of error in open-set studies performed to date are unreliable. Notably, if just the 'Inconclusive-A' responses—those for which the examiner thought there was almost enough agreement to identify a match—for non-matching bullets in the Ames II Study were counted as incorrect matches, the 'false positive' rate would balloon from 0.7% to 10.13%."

But should any of the inconclusives "be treated as incorrect," and if so, how many? Doesn't it depend on the purpose of the studies and the computation? If the purpose is to probe what the PCAST Report neoterically called "foundational validity"—whether a procedure is at least capable of giving accurate source conclusions when properly employed by a skilled examiner—then inconclusives are not such a problem. They represent lost opportunities to extract useful information from the specimens, but they do not change the finding that, within the experiment itself, in those instances in which the examiner is willing to come down on one side or the other, the conclusion is usually correct.

One justice stressed this fact. Justice Gould insisted that

[T]he focus of our inquiry should not be the reliability of the AFTE Theory in general, but rather the reliability of conclusive determinations produced when the AFTE Theory is applied. Of course, an examiner applying the AFTE Theory might be unable to declare a match (“identification”) or a non-match (“elimination”), resulting in an inconclusive determination. But that's not our concern. Rather, our concern is this: when the examiner does declare an identification or elimination, we want to know how reliable that determination is.

He was unimpressed with the extreme view that every failure to recognize "ground truth" is an "error" for the purpose of evaluating an identification method under Daubert. \11/ He argued for error proportions like those used by PCAST: \12/

This brings us to a different way of looking at error rates, one that received no consideration by the Majority ... I am referring to calculating error by excluding inconclusives from both the numerator and the denominator. .... [C]ontrary to Mr. Faigman's unsupported criticism, excluding inconclusives from the numerator and denominator accords with both common sense and accepted statistical methodologies. ... PCAST ... contended that ... false positive rates should be based only on conclusive examinations “because evidence used against a defendant will typically be based on conclusive, rather than inconclusive, determinations.” ... So, far from being "crazy" ... , excluding inconclusives from error rate calculations when assessing the reliability of a positive identification is not only an acceptable approach, but the preferred one, at least according to PCAST. Moreover, from a mathematical standpoint, excluding inconclusives from the denominator actually penalizes the examiner because errors accounted for in the numerator are measured against a smaller denominator, i.e., a smaller sample size.

So what happens when the error proportions for the subset of positive and negative conclusions are computed with the Ames data? The report's denominator is too large, but the resulting bias is not so great in this particular case. For Ames I, Justice Gould's opinion tracks Table 1 above:

With respect to matching [cartridge] sets, the number of inconclusives was so low that whether inconclusives are included in the denominator makes little difference to error rates. Of the 1,090 matching sets, only 11, or 1.01 percent, were inconclusives. Of the conclusive determinations, 1,075 were correctly identified as a match (“identifications”) and four were incorrectly eliminated (“eliminations”). ... Measured against the total number of matching sets (1,090), the false elimination rate was 0.36 percent. Against only the conclusive determinations (1,079), the false elimination rate was 0.37 percent. ...

Of 2,178 non-matching sets, examiners reported 735 inconclusives for an inconclusive rate of 33.7 percent, 1,421 sets as correct eliminations, and 22 sets as incorrect identifications (false positives). ... As a percentage of the total 2,178 non-matching sets, the false positive rate was 1.01 percent. As a percentage of the 1,443 conclusive determinations, however, the false positive rate was 1.52 percent. Either way, the results show that the risk of a false positive is very low

For Ames II,

There were 41 false eliminations. As a percentage of the 1,405 recorded results, the false elimination rate was 2.9 percent. As a percentage of only the conclusive results, the false elimination rate increased to 3.7 percent ... .

... There were 20 false positives. Measured against the total number of recorded results (2,842), the false positive rate was 0.7 percent. Measured against only the conclusive determinations, however, the false positive rate increases to 2.04 percent.

In sum, on the issue of whether a substantial number of firearms-toolmark examiners can generally avoid erroneous source attributions and exclusions when tested as in Ames I and Ames II, the answer seems to be that, yes, they can. Perhaps this helps explain the Chief Justice's concession that "[t]he relatively low rate of 'false positive' responses in studies conducted to date is by far the most persuasive piece of evidence in favor of admissibility of firearms identification evidence." But the court was quick to add that "[o]n balance, however, the record does not demonstrate that that rate is reliable, especially when it comes to actual casework."

Extrapolating from the error proportions in experiments to those in casework is difficult indeed. Is the largely self-selected sample of examiners who enroll in and complete the study representative of the general population of examiners doing casework? Does the fact that the enrolled examiners know they are being tested make them more (or less) careful or cautious? Do examiners have expectations about the prevalence of true sources in the experiment that differ from those they have in casework? Are the specimens in the experiment comparable to those in casework? \13/ Do error probabilities for comparing marks on cartridge cases apply to the marks on the bullets they house? Does it matter if the type of gun used in the experiment is different from the type in the case?

Most of the questions are matters of external validity. Some of them are the subject of explicit discussion in the opinions in Abruquah. For example, Justice Gould rejects, as a conjecture unsupported by the record, the concern that examiners might be more prone to avoid a classification by announcing an "inconclusive" outcome in an experiment than in practice.

To different degrees, the generalizability questions interact with the legal question being posed. As I have indicated, whether the scientific literature reveals that a method practiced by skilled analysts can produce conclusions that are generally correct for evidence like that in a given case is one important issue under Daubert. Whether the same studies permit accurate estimates of error probabilities in general casework is a distinct, albeit related, scientific question. How to count or adjust for inconclusives in experiments is but a subpart of the latter question.

And, how to present source attributions in the absence of reasonable error-probability estimates for casework is a question that Abruquah barely begins to answer. No opinion embraced the defendant's argument that only a limp statement like "unable to exclude as a possible source" is allowed. But neither does the case follow other courts that allow statements such as the awkward and probably ineffectual "reasonable degree of ballistic certainty" for expressing the difficult-to-quantify uncertainty in toolmark source attributions. After Abruquah, if an expert makes a source attribution in Maryland, some kind of qualification or caveat is necessary. \14/ But what will that be?

Toolmark examiners are trained to believe that their job is to provide source conclusions for investigators and courts to use, but neither law nor science compels this job description. Perhaps it would be better to replace conclusion-centered testimony about the (probable) truth of sources conclusions with evidence-centered statements about the degree to which the evidence supports a source conclusion. The Abruquah court wrote that

The reports, studies, and testimony presented to the circuit court demonstrate that the firearms identification methodology employed in this case can support reliable conclusions that patterns and markings on bullets are consistent or inconsistent with those on bullets fired from a particular firearm. Those reports, studies, and testimony do not, however, demonstrate that that methodology can reliably support an unqualified conclusion that such bullets were fired from a particular firearm.

The expert witness's methodology provides "support" for a conclusion, and the witness could simply testify about the direction and magnitude of the support without opining on the truth of the conclusion itself. \15/ "Consistent with" testimony is a statement about the evidence, but it is a minimal, if not opaque, description of the data. Is it all that the record in Abruquah—not to mention the record in the next case—should allow? Only one thing is clear—fights over the legally permissible modes for presenting the outcomes of toolmark examinations will continue.

Notes

In Commonwealth v. Pytou Heang, 942 N.E.2d 927 (Mass. 2011), the Massachusetts Supreme Judicial Court upheld source-attribution testimony "to a reasonable degree of scientific certainty" but added "that [the examiner] could not exclude the possibility that the projectiles were fired by another nine millimeter firearm." The Court proposed "guidelines" to allow source attribution to no more than "a reasonable degree of ballistic certainty."
The defendant was convicted at a first trial in 2013, then retried in 2018. In 2020, the Maryland Supreme Court changed its standard for admitting scientific evidence from a requirement of general acceptance of the method in the relevant scientific communities (denominated the "Frye-Reed standard" in Maryland) to a more direct showing of scientific validity described in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), and an advisory committee note accompanying amendments in 2000 to Federal Rule of Evidence 702 (called the "Daubert-Rochkind standard" in Abruquah).
The experts at the "Frye-Reed hearing" were William Tobin (a "Principal of Forensic Engineering International," https://forensicengineersintl.com/about/william-tobin/, and "former head of forensic metallurgy operations for the FBI Laboratory" (Unsurpassed Experience, https://forensicengineersintl.com/); James Hamby ("a laboratory director who has specialized in firearm and tool mark identification for the past 49 years" and who is "a past president of AFTE [the Association of Firearm and Tool Mark Examiners] ... and has trained firearms examiners from over 15 countries worldwide," Speakers, International Symposium on Forensic Science, Lahore, Pakistan, Mar. 17-19, 2020, https://isfs2020.pfsa.punjab.gov.pk/james-edward); Torin Suber ("a forensic scientist manager with the Maryland State Police"); and Scott McVeigh (the firearms examiner in the case).
The experts at the supplemental "Daubert-Rochkind hearing" were James Hamby (a repeat performance), and David Faigman, "Chancellor & Dean, William B. Lockhart Professor of Law and the John F. Digardi Distinguished Professor of Law" at the University of California College of the Law, San Francisco.
Remarks on the legal analysis will appear in the 2024 cumulative supplement to The New Wigmore, A Treatise on Evidence: Expert Evidence.
The opinion gives a simplified example:
The test administrator fires two bullets from each of 10 consecutively manufactured handguns. The administrator then gives you two sets of 10 bullets each. One set consists of 10 “unknown” bullets—where the source of the bullet is unknown to the examiner—and the other set consists of 10 “known” bullets—where the source of the bullet is known. You are given unfettered access to a sophisticated crime lab, with the tools, supplies, and equipment necessary to conduct a forensic examination. And, like the vocabulary tests from grade school requiring you to match words with pictures, you must match each of the 10 unknown bullets to the 10 known bullets.

Even though you know that each of the unknowns can be matched with exactly one of the knowns, you probably wouldn't know where to begin. If you had to resort to guessing, your odds of correctly matching the 10 unknown bullets to the 10 knowns would be one out of 3,628,800. [An accompanying note 11 explains that: "[w]ith 10 unknown bullets and 10 known bullets, the odds of guessing the first pair correctly are one out of 10. And if you get the first right, the odds of getting the second right are one out of nine. If you get the first two right, the odds of getting the third right are one out of eight, and so on. Thus, the odds of matching each unknown bullet to the correct known is represented by the following calculation: (1/10) x (1/9) x (1/8) x (1/7) x (1/6) x (1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Even if you correctly matched five unknown bullets to five known bullets and guessed on the remaining five unknowns, your odds of matching the remaining unknowns correctly would be one out of 120. [Note 12: "(1/5) x (1/4) x (1/3) x (1/2) x (1/1)."] Not very promising.

The closed-set and semi-closed-set studies before the trial court—the studies which PCAST discounted—show that if you were to properly apply the AFTE Theory, you would be very likely to match correctly each of the 10 unknowns to the corresponding knowns. See Validation Study; Worldwide Study; Bullet Validation Study. ... Your odds would thus improve from virtually zero (one in 3,628,800) to 100 percent. Yet according to PCAST, those studies provide no support for the scientific validity of the AFTE Theory. ...
David P. Baldwin, Stanley J. Bajic, Max Morris & Daniel Zamzow, A Study of False-positive and False-negative Error Rates in Cartridge Case Comparisons, Ames Laboratory, USDOE, Tech. Rep. #IS-5207 (2014), at https://afte.org/uploads/documents/swggun-false-postive-false-negative-usdoe.pdf [https://perma.cc/4VWZ-CPHK].
David H. Kaye, PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?, Forensic Sci., Stat. & L., Nov. 1, 2016, http://for-sci-law.blogspot.com/2016/11/pcast-and-ames-study-will-real-error.html
David H. Kaye et al., Toolmark-comparison Testimony: A Report to the Texas Forensic Science Commission, May 2, 2022, available at http://ssrn.com/abstract=4108012.
I will note, however, that the report apparently strains to make the attained levels for reliability seem high. Alan H. Dorfman & Richard Valliant, A Re-Analysis of Repeatability and Reproducibility in the Ames-USDOE-FBI Study, 9 Stat. & Pub. Pol'y 175 (2022).
The opinion attributes this view to
Mr. Abruquah's expert, Professor David Faigman, [who declared] that "in the annals of scientific research or of proficiency testing, it would be difficult to find a more risible manner of measuring error." To Mr. Faigman, the issue was simple: in Ames I and II, the ground truth was known, thus "there are really only two answers to the test, like a true or false exam[ple]." Mr. Faigman explained that "the common sense of it is if you know the answer is either A or B and the person says I don't know, in any testing that I've ever seen that's a wrong answer." He argued, therefore, that inconclusives should be counted as errors.
See also NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach (David H. Kaye ed. 2012), available at ssrn.com/abstract=2050067 (arguing against counting inconclusives in error proportions that are supposed to indicate the probative value of actual conclusions).
Testing examiner performance in the actual flow of cases would help address the last three questions. A somewhat confusing analysis of results in such an experiment is described in a posting last year. David H. Kaye, Preliminary Results from a Blind Quality Control Program, Forensic Sci., Stat. & L., July 9, 2022, http://for-sci-law.blogspot.com/2022/07/preliminary-results-from-blind-quality.html.
The court wrote that:
It is also possible that experts who are asked the right questions or have the benefit of additional studies and data may be able to offer opinions that drill down further on the level of consistency exhibited by samples or the likelihood that two bullets or cartridges fired from different firearms might exhibit such consistency. However, based on the record here, and particularly the lack of evidence that study results are reflective of actual casework, firearms identification has not been shown to reach reliable results linking a particular unknown bullet to a particular known firearm.
See, e.g., David H. Kaye, The Nikumaroro Bones: How Can Forensic Scientists Assist Factfinders?, 6 Va. J. Crim. L. 101 (2018).

LAST UPDATED 29 June 2023

Is "Bitemark Analysis" Better than "Bitemark Comparisons"?

2023-06-11T15:58:00.002-04:00

In October 2022, NIST released a draft report entitled "Bitemark Analysis: A NIST Scientific Foundation Review." A press release announced "Forensic Bitemark Analysis Not Supported by Sufficient Data, NIST Draft Review Finds." In March 2023, the final version reaching the same conclusions was released. Soon afterward, the NIST-supported Organization of Scientific Area Committees for Forensic Science (OSAC) revised the scope of the work that its forensic odontology subcommittee can undertake. The description now specifies, in italics no less, that "The Forensic Odontology Subcommittee does not develop standards on bitemark recognition, comparison, and identification."

Yet, some medical examiners believe that "analysis" of marks on the skin "frequently yields valuable information that forensic odontologists testify to in courts of law, just as forensic pathologists do with respect to their objective findings and their interpretations of those findings based on experience, training and the circumstances of the event." Richard Souviron & Leslie Haller, Bite Mark Evidence: Bite Mark Analysis Is Not the Same as Bite Mark Comparison or Matching or Identification, 4 J. L. & Biosci. 617, 618 (2017). They distinguish between "analysis" and "comparison," recognizing that the latter is not scientifically well founded, and seeking to preserve the former as a legitimate expert endeavor. They propose that

The analysis process involves answering basic, crucial, questions such as whether or not the pattern injury is a human bite mark. This question can be the most difficult part of the entire process. After establishing whether a patterned injury is, indeed, a bite mark, other questions must be asked. Is it a human bite mark? Was it made by an adult or a child? Was it swabbed for DNA? Was it made through clothing? If so, was the clothing swabbed for DNA? Where is it located on the victim and in what position was the victim when it happened? Could it have been self-inflicted? What was the position of the biter? Was it offensive or defensive? Was it affectionate or does it demonstrate violence? Will it produce a permanent injury? If so, simple battery may become aggravated battery. When was the bite inflicted in relation to the time of death? Is it fresh, a scar or somewhere in between? Was the person bitten alive or dead at the time? Are there any unique dental characteristics that could be used to exclude possible suspects? In cases of multiple bites, did the same biter make them all? Were they all made at the same time or do they establish a pattern of long-term abuse?

These questions, and more, are the essential core of the analysis of every bite mark, and produce a large amount of information that can be of considerable value to an investigation before any suspects are identified or charged.

Id. So where are the experiments or other studies to show that most of these "essential" parts of bitemark analysis can be done validly and reliably? Can medical examiners correctly classify "pattern injuries" as bitemarks? As human bitemarks? As the mark of a child or an adult? As affectionate? As unique? As coming from the same biter?

Bite (and other) marks will be encountered in autopsies. They need to be photographed and examined along with other injuries or characteristics. But odontologists and medical examiners should think hard before they claim the ability to do all these things "and more" as part of "analysis."

"Double Blind Peer Review" of Research on Hair Fibers

2023-01-19T17:17:00.000-05:00

Yesterday I heard about some new publications on forensic hair microscopy published in, of all places, a journal on pharmaceuticals. My first thought was that the journal might be a predatory one with deceptive advertising designed to con scholars into paying for publication in what appears to be a reputable scientific journal. But that was too cynical.

The papers are

S. Sneha Harshini, Vishnu Priya Veeraraghavan, Abirami Arthanari1, R. Gayathri, S. Kavitha, J. Selvaraj, P. K. Reshma & Y. Dinesh, Comparative Study of Male and Female Human Hair: A Microscopic Analysis, 13 J. Advanced Pharm. Tech. & Rsch. S297-S301 (2022);
S. Nehal Safiya, Vishnu Priya Veeraraghavan, Abirami Arthanari, R. Gayathri, J. Selvaraj, S. Kavitha & Y. Dinesh, J., Comparison of Human and Animal Hair – A Microscopical Analysis, 13 Advanced Pharm. Tech. & Rsch. S112–S116 (2022); and
S. Nehal Safiya, Vishnu Priya Veeraraghavan, Abirami Arthanari, R. Gayathri, J. Selvaraj, S. Kavitha & Y. Dinesh, A Comparative Study of Different Animal Hairs: A Microscopic Analysis, 13 Advanced Pharm. Tech. & Rsch. S117–S120 (2022)

The Journal of Advanced Pharmaceutical Technology & Research has a scientific society and a respectable publisher behind it. The former is the “Society of Pharmaceutical Education & Research (SPER),” which is one of the leading pharmaceutical association in the country [of India] ... with a member base of around 3,500, it spread [sic] across the country and have [sic] 13 state branches.”

The latter is Wolters Kluwer’s Medknow. Located in Mumbai, Medknow “provides publishing services for peer-reviewed, online and print-plus-online journals in medicine on behalf of learned societies and associations with a focus on emerging markets.” Wolters Kluwer insists that Medknow “journals employ a double-blind review process, in which the author identities are concealed from the reviewers, and vice versa, throughout the review process.”

Although the journal is not indexed in Medline, the research comes from the Saveetha Dental College, Chennai, Tamil Nadu, India—”one of the finest institutions in the world with a unique curriculum that is a spectacular fusion of the best practices of the east and west.”

So I read the papers. Unbelievable.

The abstract and the conclusion of "A Comparative Study of Male and Female Human Hair" announce that “[t]his study can be concluded that the structural comparison between male and female hair specimens can be used as evidence for forensic analysis at crime scenes.” How so? Well, for one thing, “[i]n this study, it is observed that the color of human male hair is completely black, while it is black on the proximal end and brown at the distal end of human female hair.”

Astonishingly, the sample of hairs is never described. How many men and women provided hair? Where did they come from? How many hairs were taken from each subject and compared? Were the examinations blind? Without this elementary information, no one can understand or assess the reported results.

The “Comparison of Human and Animal Hair – A Microscopical Analysis” is similarly devoid of any meaningful description of the research.

The “Comparative Study of Different Animal Hairs: A Microscopic Analysis” appears to be a description of four hairs – one each from a dog, a cat, a horse, and a rat. The researchers found some differences among them. This they found encouraging: "The present study might be used in forensic investigations."

So much for "double blind peer review."

Grand Jury Subpoenas for Newborn Screening Blood Spots

2022-08-02T15:22:00.000-04:00

On July 10, the New Jersey Office of the Public Defender and the New Jersey Monitor sued the state department of health "to obtain redacted copies of [grand jury] subpoenas ... so that they can learn more about how the State Newborn Screening Laboratory has effectively turned into a warrantless DNA collection facility for State criminal prosecutions." \1/

New Jersey's neonatal screening program, like that in other states, uses a few drops of blood from the newborn’s heel to test "for certain genetic, endocrine, and metabolic disorders ... prior to discharge from a hospital or birthing center." \2/ The Department of Health explains that "[e]arly detection and treatment of the disorders on the newborn screening panel can prevent lifelong disabilities, including intellectual and developmental disabilities, and life threatening infections." \3/ Like many other states, New Jersey health officials retain a "Guthrie card" (named after Dr. Robert Guthrie, who in the 1960s, successfully championed mandatory screening laws for a metabolic disease that causes preventable intellectual disability). \4/

The complaint alleges that the Office of the Public Defender (OPD) "became alarmed" that State Police "are utilizing the residual blood spot samples" and that the health department rebuffed requests to provide information on subpoenas the department may have received from grand juries. The cause of the alarm is described as follows:

The State Police had re-opened an investigation into a “cold case” of sexual assault that had occurred in 1996 and had genetically narrowed the suspects to one of three brothers and their male offspring. ... [They] served a subpoena upon the Newborn Screening Laboratory in or about August 2021 to obtain residual dried blood spot samples that had been collected from a male newborn in or about June 2012.

To ascertain which family member was the suspect, the State Police sought the blood spot sample that was taken from an approximately nine-year-old child when he was a newborn to compare it to the DNA it had collected at the crime scene in 1996. The State Police successfully obtained the child’s blood spot sample, sequenced the DNA, and then ran further analysis utilizing a technique known as investigative genetic genealogy. The State Police alleges those results showed the newborn blood spot sample belonged to the genetic child of the suspect. From there, the State Police used those results to form the basis of an affidavit of probable cause to acquire a warrant to obtain a buccal swab from OPD’s client, who is the child’s father. OPD’s client was then criminally charged.

OPD further asserted "a significant interest in knowing how expansive this law enforcement practice is so that it may better represent its clients who may be subject to such warrantless searches." It did not explain how learning the number of subpoenas would improve its ability to defend any particular client.

The other plaintiff, the New Jersey Monitor, described itself as "the eyes and ears of the public [with] an interest in reporting to the public about this practice that violates basic concepts of genetic privacy."

The pleading claims that "law enforcement agencies are flouting search warrant requirements" and that "[b]ecause the Supreme Court of the United States and the New Jersey Supreme Court recognize that people have a right of privacy in their DNA and that the collection and analysis of that DNA is a search, a search warrant is generally required for such invasive actions."

I have not researched New Jersey jurisprudence, but I strongly doubt that the U.S. Supreme Court's opinions constitutionalize any free-floating "basic concepts of genetic privacy." \5/ The allegation of "subversion of the warrant requirement" of the Fourth Amendment presupposes that a warrant is required. That could be, but this question is not directly covered by Supreme Court precedent. It is the conclusion of what has to be a more complex legal argument. How might that argument go?

The Fourth Amendment declares that "[t]he right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause ... ." How do subpoenas for Guthrie cards come within this proscription? They are not quite seizures of any person or any person's papers or effects.

Are they searches of the person? Certainly, a physical intrusion into the body to extract blood would be, and the state has done that with a warrantless heel prick. But that search is constitutional because of an exception to the warrant-preference rule. The "special needs" exception allows the government to conduct searches and seizures to advance important government interests other than collecting information for criminal cases. Compulsory neonatal screening is an important public health program for providing early treatment or prevention of suffering and impairment. It predates DNA testing for identification (and DNA testing for disease, for that matter). New Jersey's legislation dates back to 1964. That grand jury subpoenas can be issued today to investigate a crime does not make the original search or seizure does not transform the original interference with bodily integrity into one that required probable cause. \6/

There is, however, a second search. The subpoena itself triggers Fourth Amendment protections -- but not to the extent of a physical entry to acquire information. The privacy and security interests are quite different, and the Supreme Court has held that the government may use an administrative subpoena to acquire documents so long as “the documents sought are relevant to the [investigation]” and the document request is “adequate, but not excessive,” for those purposes. \7/ Unlike the warrant process, a subpoena does not require probable cause.

At least, not normally. A Guthrie-card subpoena might be different. In Carpenter v. United States, \8/ the Supreme Court held that probable cause was required for the government to compel wireless carriers to produce time-stamped records of cell-site location information (CSLI) on a robbery suspect that had 12,898 location points cataloging his cell phone's movements over 127 days. Courts had issued orders for these business records in an FBI investigation into a series of robberies, under the Stored Communications Act, which merely requires "specific and articulable facts showing that there are reasonable grounds to believe that ... the records ... [sought] are relevant and material to an ongoing criminal investigation." \9/ Cause to believe that a record is relevant to an investigation is not probable cause to believe that the record is evidence of a suspect's criminal conduct. The majority opinion in Carpenter emphasized that CSLI records added up to (or will, in the near future, amount to) "a detailed chronicle of a person's physical presence compiled every day, every moment, over several years." \10/ As such, it held the relevance-based orders in question were unreasonable searches.

One can argue that the information that can be extracted from a DNA sample "implicates privacy concerns" at least as much as CSLI data. \11/ But the analogy requires attention to the kind of DNA information the government obtains (and the precautions it takes against other personal information being acquired from the DNA).

Until the blood is analyzed, no informational privacy is compromised. \12/ In the case mentioned in the complaint, the police "had genetically narrowed the suspects to one of three brothers and their male offspring." I would guess that they accomplished this by means of Y-STR typing combined with other leads. The police then obtained the Guthrie card for "an approximately nine-year-old child," "sequenced the DNA, and then ran further analysis utilizing a technique known as investigative genetic genealogy" to conclude that the child's "blood spot sample belonged to the genetic child of the suspect."

It is difficult to discern what DNA testing was done. "Investigative genetic genealogy" normally involves comparisons of haploblocks from crime-scene DNA and DNA in genetic genealogy databases that are open to the public in order to pick possible relatives to the unknown person whose DNA was at the crime-scene. With those findings, ordinary genealogical research may produce a list of suspects. In the case mention in the complaint, police already had the list of suspects. Why perform the extensive haploblock analysis of "investigative genetic genealogy" if the three siblings and the child of one of them already are known? Would not comparing a number of autosomal STR loci not known to be medically informative have been able to show whether the child had a substantial probability of being the child of the man whose DNA was associated with the 1996 sexual assault that the police were investigating? That might be enough for probable cause for a court order compelling the implicated brother to provide a DNA sample for comparison to the one from the 1996 sexual assault. \13/

Of course, it can be argued that the particular loci the police actually used for the investigation hardly matter -- that the very fact that the sample contains medically relevant information that the police could acquire from the Guthrie card makes the case similar enough to the location tracking in Carpenter to require probable cause. In Carpenter, the FBI was only interested in associating the defendant's cell phone with towers near the robberies that were under investigation. Did they assemble detailed itineraries of Carpenter's movements at all other locations that he (or, more precisely, his phone) visited? Perhaps the mere fact that the many cell-site records were in their possession was enough.

Yet, this argument resembles the one rejected in most cases on the constitutionality of forcing convicted offenders (or even arrestees) to surrender DNA for law-enforcement databases. Most judges, and the Supreme Court, rejected the argument that the potential to type all kinds of loci in itself required probable cause for collecting and profiling the DNA for identification only. \14/

None of this means that New Jersey's Guthrie-card subpoenas are clearly or even probably constitutional. I merely suggest that there could be more to the issue than the complaint alleges. Also, it seems worth noting that the exact connection between the the public records request and the constitutional issue is not entirely apparent. \15/

NOTES

Thanks to Fred Bieber for news of the complaint.

N.J. Office of the Public Defender v. N.J. Dep't of Health, Civ. No. ___ (Complaint, July 10, 2022), available at https://www.theverge.com/2022/7/29/23283837/nj-police-baby-dna-crimes-lawsuit-public-defender.

Centers for Disease Control and Prevention, Newborn Screening Portal, Nov. 29, 2021, https://www.cdc.gov/newbornscreening/index.html.

N.J. Dep't of Health, Newborn Screening and Genetic Services, Feb. 10, 2022, https://www.nj.gov/health/fhs/nbs/.

Harvey L. Levy, Robert Guthrie and the Trials and Tribulations of Newborn Screening, 7(1) Int’l J. Neonatal Screening 5 (2021), available at https://doi.org/10.3390/ijns7010005.
Cf. Dobbs v. Jackson Women's Health Organization, No. 19–1392 (U.S. June 24, 2022), available at https://www.supremecourt.gov/opinions/21pdf/19-1392_6j37.pdf.

Cf. Ferguson v. Charleston, 532 U.S. 67 (2001), available at https://scholar.google.com/scholar_case?case=12447804856380641716. Another exception is consent. Although consent for Fourth Amendment purposes is far less onerous than medical informed consent, the only grounds for refusal in New Jersey are religious. 26 N.J. Stat. Ann. § 26:2-111. So the consent exception does not apply.

Okla. Press Publ’g Co. v. Walling, 327 U.S. 186, 209 (1946) (upholding an FTC order for the production of a newspaper publishing corporation’s books and records as request was made pursuant to statute and was reasonably relevant). The Fifth Amendment privilege against self-incrimination offers protection when the act of production itself would be incriminating as an admission. E.g., United States v. Hubbell, 530 U.S. 27 (2000).

138 S.Ct. 2206 (2018), available at https://scholar.google.com/scholar_case?case=14655974745807704559.

18 U.S.C. § 2703(d).

Id. at 2220.

Cf. id. at 2266-67 (Gorsuch, J., dissenting and asking "Why is the relevant fact the seven days of information the government asked for instead of the two days of information the government actually saw? ... And in what possible sense did the government 'search' five days' worth of location information it was never even sent?").

See Maryland v. Pringle, 540 U.S. 366, 371-72 (2003) (finding probable cause for arresting three men in a car after finding $763 of rolled-up cash in the glove compartment and five plastic glassine baggies of cocaine were behind the back-seat armrest).
See David H. Kaye, Why So Contrived? DNA Databases After Maryland v. King, 104 J. Crim. L. & Criminology 535 (2014); David H. Kaye, A Fourth Amendment Theory for Arrestee DNA and Other Biometric Databases, 15 U. Pa. J. Const. L. 1095 (2013).

Whether accessing the Guthrie cards for criminal investigations is common or rare in New Jersey would not seem to affect the legality of the subpoenas. Of course, the extent of the access should be a matter of public concern, and widespread law enforcement use of the cards could prompt legislation to curtail the practice. But that is so whether or not the alleged invasions of "genetic privacy" are constitutional. Still, uncovering a widespread practice that is not only of general public interest, but also illegal, might add weight to the case for public disclosure under a balancing test for such disclosure. In that event, the allegations of unconstitutionality would not be superfluous to the complaint. Nonetheless, if the opinions on the state and federal law of search and seizure are overly rhetorical, one might wonder whether they go beyond a simple "statement of the facts on which the claim is based." Rules Governing the Courts of the State of New Jersey, Rule 4:5-2, available at https://www.njcourts.gov/attorneys/assets/rules/r4-5.pdf.

Preliminary Results from a Blind Quality Control Program

2022-07-09T18:54:00.003-04:00

The Houston Forensic Science Center recently reported the results of realistic, blind tests of its firearms examiners. Realism comes from disguising materials to look like actual casework and injecting these "mock evidence items" into the regular flow of business. The judgments of the examiners for the mock cases can be evaluated with respect to the true state of affairs (ammunition components from the same firearm as opposed to components from different firearms). Eagerly, I looked for a report of how often the examiners declared an association for pairs of items that were not associated with one another (false "identifications") and how often they declared that there was no association for pairs that were in fact associated (false "eliminations").

These kinds of conditional "error rates" are by no means all there is to quality control and to improving examiner performance, which is the salutary objective of the Houston lab, but they are prominent in judicial opinions on the admissibility of firearms-toolmark evidence. So too, they (along with the cognate statistics of specificity and sensitivity) are established measures of the validity of tests for the presence or absence of a condition. Yet, I searched in vain for clear statements of these standard measures of examiner performance in the article by Maddisen Neuman, Callan Hundl, Aimee Grimaldi, Donna Eudaley, Darrell Stein and Peter Stout on "Blind Testing in Firearms: Preliminary Results from a Blind Quality Control Program," 67(3) J. Forensic Sci. 964-974 (2022).

Instead, tables use a definition of "ground truth" that includes materials being intentionally "insufficient" or "unsuitable" for analysis, and they focus on whether "[t]he reported results either matched the ground truth or resulted in an inconclusive decision." (Here, "inconclusive" is different from insufficient" and "unsuitable." For the sake of readers who are unfamiliar with firearms argot, Table 1 defines--or tries to--the terminology for describing the outcomes of the mock cases.)

TABLE 1. Statements for the Outcome of an Examination
(adapted from p. 966 tbl. 1)
Binary (Yes/No) Source Conclusions
Identification: A sufficient correspondence of individual characteristics will lead the examiner to the conclusion that both items (evidence and tests) originated from the same source.
Elimination: A disagreement of class characteristics will lead the examiner to the conclusion that the items did not originate from the same source. In some instances, it may be possible to support a finding of elimination even though the class characteristics are similar when there is marked disagreement of individual characteristics.
Statements of No Source Conclusion
Unsuitable: A lack of suitable microscopic characteristics will lead the examiner to the conclusion that the items are unsuitable for identification.
Insufficient: Examiners may render an opinion that markings on an item are insufficient when:
• an item has discernible class characteristics but no individual characteristics
• an item does not exhibit class characteristics and has few individual characteristics of such poor quality that precludes an examiner from rendering an opinion;
• the examiner cannot determine if markings on an item were made by a firearm during the firing process; or
• the examiner cannot determine if markings are individual or subclass.
Inconclusive: An insufficient correspondence of individual and/or class characteristics will lead the examiner to the conclusion that no identification or elimination could be made with respect to the items examined.
Note on "identification": The identification of cartridge case/bullet toolmarks is made to the practical, not absolute, exclusion of all other firearms. This is because it is not possible to examine all firearms in the world, a prerequisite for absolute certainty. The conclusion that sufficient agreement for identification exists between toolmarks means that the likelihood that another firearm could have made the questioned toolmarks is so remote as to be considered a practical impossibility.

There were 51 mock cases containing anywhere from 2 to 41 items (median = 9). In the course of the five-and-a-half year study, 460 items were examined for a total of 570 judgments by only 11 firearms examiners, with experience ranging from 5.5 to 23 years. The mock evidence varied greatly in its informativeness, and the article suggests that the lab sought to use a greater proportion of challenging cases than might be typical.

Whether or not the study is generalizable to other examiners, laboratories, and cases, the authors write that "no hard errors were observed; that is, no identifications were declared for true nonmatching pairs, and no eliminations were declared for true matching pairs." This sounds great, but how probative is the observation of "no hard errors"

Table 3 of the article states that there were 143 false pairs, of which 106 were designated inconclusive. It looks like the examiners were hesitant to make an elimination, even for a false pair. They made only 37 eliminations. Since there were no "hard errors," none of the false pairs were misclassified as identifications. Ignoring inconclusives, which are not presented as evidence for or against an association, the observed false-identification rate therefore was 0/37. Using the rule of three for a quick approximation, we can estimate the 95% confidence interval as going from 0 to 3/37. To use phrasing like that in the 2016 PCAST Report, the false-positive rate could be as large as 1 in 9.

Applying the same reasoning to the 386 true pairs, of which 119 were designated inconclusive, the observed false-elimination rate must have been 0/267. The 95% confidence interval for the false-elimination rate thus extends to about 3/267, or 1/89.

These confidence intervals should not be taken too seriously. The simple binomial probability model implicit in the calculations does not hold for dependent comparisons. To quote the authors (p. 968), "Because the data were examined at the comparison level, an item of evidence can appear in the data set in multiple comparisons and be represented by multiple comparison conclusions. For example, Item 1 may have been compared to Item 2 and Item 3 with comparison conclusions of elimination and identification, respectively." Moreover, I could be misconstruing the tables. Finally, even if the numbers are all on target, they should not taken as proof that error rates are as high as the upper confidence limits. The intervals are merely indications of the uncertainty in using particular numbers as estimates of long-term error rates.

In short, the "blind quality control" program is a valuable supplement to minimal-competency proficiency testing. The absence of false identifications and false eliminations is encouraging, but the power of this study to pin down the probability of errors at the Houston laboratory is limited.

Why Did the Proposed Amendment to Rule 702 Scuttle the "Preponderance of the Evidence"?

2022-07-06T17:05:00.001-04:00

After posting a description of the changes to the proposed amendment to Federal Rule of Evidence 702, I received the following inquiry:

Which one is actually the proposal? "More likely than not" or "by a preponderance of the evidence"? The former seems to be a weakening, the latter (even if it is redundant for lawyers) puts forensic scientists on notice. Use of the word "evidence" in the latter is, however, potentially confusing. "Evidential reliability" is about the "reliability" [sic] of the "evidence", i.e., the "scientific validity" of the methods applied to arrive at the "opinion". The proposed change (if it is the proposed change) seems to refer to "evidence" about the "reliability" of the "evidence" (in which the first and second instance of the word "evidence" do not refer to the same thing).

The first iteration of the amendment used "preponderance." It read, "[a]n [expert] witness ... may testify ... if the proponent has demonstrated by a preponderance of the evidence that" the proposed evidence satisfies various requirements regarding what the Supreme Court in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), called "evidentiary reliability." Now the proposed text is, "An [expert] witness ... may testify ... if the proponent demonstrates to the court that it is more likely than not that" the proposed evidence satisfies these requirements.

Why the change? Partly because of the elliptical nature of the original formulation and partly because of the awkwardness of the construction "evidence that the evidence." As the rest of this posting explains, the new (green) version is better drafted, but the idea was never in doubt.

The governing principle comes from Federal Rule of Evidence 104(a) as interpreted in Bourjaily v. United States, 483 U.S. 171 (1987). The rule begins with a general observation that

The court must decide any preliminary question about whether a witness is qualified, a privilege exists, or evidence is admissible. In so deciding, the court is not bound by evidence rules, except those on privilege.

Fed. R. Evid. 104(a). So to decide whether proffered evidence is admissible at trial, the court can consider all pertinent, non-privileged information presented to it, whether or not the information about admissibility would be admissible in a trial.

But Rule 104 is silent on how confident the judge should be that the proposed evidence satisfies the requirements for admissibility. That is where Bourjaily comes in. In that case, the government wanted to introduce out-of-court statements of a coconspirator as evidence against the defendant. To avoid the rule against hearsay, it sought to persuade the court to apply the rule that certain statements of conspirators are admissible against everyone in the conspiracy. Defendant's membership in the conspiracy was thus a preliminary question for the court, and the Bourjaily Court explained that

We are ... guided by our prior decisions regarding admissibility determinations that hinge on preliminary factual questions. We have traditionally required that these matters be established by a preponderance of proof. Evidence is placed before the jury when it satisfies the technical requirements of the evidentiary Rules, which embody certain legal and policy determinations. The inquiry made by a court concerned with these matters is not whether the proponent of the evidence wins or loses his case on the merits, but whether the evidentiary Rules have been satisfied. Thus, the evidentiary standard is unrelated to the burden of proof on the substantive issues, be it a criminal case ... or a civil case. ... The preponderance standard ensures that, before admitting evidence, the court will have found it more likely than not that the technical issues and policy concerns addressed by the Federal Rules of Evidence have been afforded due consideration. ... Therefore, we hold that, when the preliminary facts relevant to Rule 801(d)(2)(E) are disputed, the offering party must prove them by a preponderance of the evidence.

483 U.S. at 175-76 (note omitted).

Applying Bourjaily to the preliminary questions in Rule 702, it is quite clear that the trial court has to find that "evidentiary reliability" under Rule 702 is more probable than not. To foreclose any debate about it, in Daubert itself, the Court pointed to the preponderance standard, writing that "[f]aced with a proffer of expert scientific testimony, then, the trial judge must determine at the outset, pursuant to Rule 104(a), whether the expert is proposing to testify to (1) scientific knowledge that (2) will assist the trier of fact to understand or determine a fact in issue." 509 U.S. at 592.

Yet, many public commenters did not see this. Some comments claimed that the word "evidence" in "preponderance of the evidence" would constrain the court to considering only such evidence as would be admissible at trial in deciding whether the proposed expert testimony is admissible. Other comments claimed that the phrase would keep previously admissible evidence from juries. Indeed, "almost all of the fire was directed toward the term 'preponderance of the evidence.'” Advisory Comm. on Evid. Rules, Report to the Standing Committee, May 15, 2022, at 7.

The Advisory Committee unabashedly rejected both these claims. In its report to the Standing Committee, it wrote that:

The Committee does not agree that the preponderance of the evidence standard would limit the court to considering only admissible evidence; the plain language of Rule 104(a) allows the court deciding admissibility to consider inadmissible evidence. Nor did the Committee believe that the use of the term preponderance of the evidence would shift the factfinding role from the jury to the judge, for the simple reason that, when it comes to making preliminary determinations about admissibility, the judge is and always has been a factfinder.

Id. Nevertheless,

[T]he Committee recognized that it would be possible to replace the term “preponderance of the evidence” with a term that would achieve the same purpose while not raising the concerns (valid or not) mentioned by many commentators. The Committee unanimously agreed to change the proposal as issued for public comment to provide that the proponent must establish that it is “more likely than not” that the reliability requirements are met. This standard is substantively identical to “preponderance of the evidence” but it avoids any reference to “evidence” and thus addresses the concern that the term “evidence” means only admissible evidence.

Id. Finally,

The Committee was also convinced by the suggestion in the public comment that the rule should clarify that it is the court and not the jury that must decide whether it is more likely than not that the reliability requirements of the rule have been met. Therefore, the Committee unanimously agreed with a change requiring that the proponent establish “to the court” that it is more likely than not that the reliability requirements have been met. The proposed Committee Note was amended to clarify that nothing in amended Rule 702 requires a court to make any findings about reliability in the absence of a proper objection.

Id. Overlooked in this debate over the niceties of the phrase "preponderance of the evidence" is a different drafting point. The proposed amendment makes it explicit that the standard pertains to the court's role in considering scientific validity, but it does not do the same for the other requirements of Rule 702--namely, that the witness be "qualified as an expert by knowledge, skill, experience, training, or education." That a witness is qualified to testify also must be established as more probable than not. For a rare case excluding testimony from a latent fingerprint examiner because she ran into problems in demonstrating proficiency, see United States v. Cloud, No. 1:19-cr-02032-SMJ-1, 2021 WL 7184484 (E.D. Wash. Dec. 17, 2021) (false exclusion in casework, a false exclusion on a proficiency test, and receiving help from her supervisor on a follow-up proficiency test).

Proposed Amendment to Federal Rule of Evidence 702 Clears More Hurdles

2022-07-01T15:20:00.000-04:00

The following report appeared in the OSAC newsletter OSAC In Brief, June 2022, at 4-6 with the title "Proposed Amendment to Federal Rule of Evidence 702 Clears More Hurdles." It updates a report in the June 2022 issue (posted earlier today on this blog). Both reports are meant to be boringly factual. More opinionated remarks may appear later.

After five years of discussion, a proposed amendment to Federal Rule of Evidence 702 on testimony by expert witnesses has progressed to the Judicial Conference of the United States—the policy-making arm of the federal judiciary. If the Judicial Conference accepts the unanimous recommendations of both its Advisory Committee on Evidence Rules, which drafted the amendment, and its standing Committee on Rules of Practice and Procedure, which endorsed it this month, the amendment will be delivered to the Supreme Court for transmittal to Congress. Then, unless Congress intervenes, it will become effective by the end of next year.

But what effect would it have? According to the Advisory Committee chair, U.S. District Court Judge Patrick Schiltz, the amendment does not alter the meaning of the rule in the slightest. “It simply makes it clearer, makes it easier for people to understand, so that fewer mistakes will be made” (as reported June 7, in Bloomberg Law). Box 1 shows the proposed changes, which differ slightly from those discussed in the OSAC In Brief article of July 2021.

BOX 1. Proposed Changes to Federal Rule of Evidence 702

A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if the proponent demonstrates to the court that it is more likely than not that:
(a) the expert's scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;
(b) the testimony is based on sufficient facts or data;
(c) the testimony is the product of reliable principles and methods; and
(d) the ~~expert has reliably applied~~ expert's opinion reflects a reliable application of the principles and methods to the facts of the case.

On the face of it, the amendment does little, if anything, to alter the substance of the existing rule. It adds the words “if the proponent demonstrates to the court that it is more likely than not” in front of the criteria for admitting expert testimony, but the Supreme Court had already noted that in exercising a longstanding “gatekeeping” role, the district court needs to determine whether the conditions for admitting expert testimony are “established by a preponderance of proof.” (Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579, 592 n.10 (1993) (citing Fed. Evid. 104(a); as a result of public comments, the Advisory Committee substituted “more likely than not” for the “preponderance of evidence” to describe the proponent’s burden of persuasion on the issue of admissibility).

The other wording change concerns the well entrenched reliability-as-applied requirement (“the expert has reliably applied” in part (d)). The amendment uses an alternative phrase—“the expert's opinion reflects a reliable application.” Although one could argue that the specific reference to “opinion” limits the requirement to personal opinions, that is not the intent. An explanatory note that will accompany the revised rule (if and when it is adopted) makes it plain that it still must appear that the expert has applied a valid and reliable method proficiently and appropriately in making any and all findings and inferences. The only purpose of the change is “to emphasize that each expert opinion must stay within the bounds of what can be concluded from a reliable application” of a reliable method to the facts of the case. And, this Advisory Committee Note (ACN) adds that this directive is “is especially pertinent to the testimony of forensic experts,” for which “the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results” rather than “assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty ... .”

During the six-month comment period that ended in February, the draft received well over 500 comments. The Reporter to the Advisory Committee found the public reaction “somewhat surprising, because the proposed amendment essentially seeks only to clarify the application of Rule 702 as it was amended in 2000—and that amendment received [only] 179 comments.” Lawyers from the plaintiffs’ side of the civil bar opposed the latest amendment, while defendants’ lawyers supported it.

There were relatively few comments about the implications of the additional words and the accompanying note for the areas of forensic science covered by OSAC. These too were (predictably) divided. The National District Attorneys Association (NDAA) objected to the ACN’s singling out forensic-science testimony as a problem and saw the amendments as “a solution in search of a problem.” But the New York City Bar Association expressed “particular concern [with] criminal prosecutions” and “the scientific validity of many types of ‘feature-comparison’ methods of identification, such as those involving fingerprints, footwear and hair.” The New York State Crime Laboratory Advisory Committee (NYSCLAC) objected to “changes limiting forensic science testimony” but then maintained that its laboratories already complied with the guidance in the ACN. The Union of Concerned Scientists questioned parts of the NDAA and NYSCLAC statements and insisted that “forensic evidence should be required to present courts with estimates of error rates relevant to their methodologies.” The Innocence Project and other organizations and individuals submitted a joint statement praising the changes and pressing for more. They wanted the text of the rule to contain a requirement that testimony is not only “the product of reliable principles and methods” (the current wording), but also to specify that it “includes the limitations and uncertainty of those principles and methods.”

The conflicting comments regarding forensic science produced no modifications. If the amendment is adopted, it will implement, to some extent, the 2016 recommendation of the President’s Council of Advisors on Science and Technology that “the Judicial Conference of the United States ... should prepare ... an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic feature-comparison methods.”

Author’s disclaimer: This report presents the views of the author. Their publication in In Brief is not an endorsement by NIST or OSAC, and they are not intended to represent the views of any OSAC unit. No estimate of the known or potential rate of error is available.

Proposed Changes to Federal Rule of Evidence 702

2022-07-01T11:36:00.001-04:00

The following report appeared in the OSAC newsletter OSAC In Brief, July 2021, at 3-7 with the uninspired title "Proposed Changes to Federal Rule of Evidence 702." It was followed by an update in the June 2022 issue (about to be reproduced on this blog). Both are meant to be boringly factual. More opinionated remarks may appear later.

On April 30, the federal Advisory Committee on Evidence Rules unanimously proposed two changes to the wording of Federal Rule of Evidence 702. The rule, which many states have adopted in one form or another, provides for testimony by expert witnesses. The changes do not alter the meaning of the rule, but they can be seen as a course-correction signal telling courts to be more vigorous in ensuring that “forensic expert testimony is valid, reliable, and not overstated in court.”

The quoted words come from a report of the Advisory Committee. Facilitating such testimony also is part of OSAC’s raison d’être. This article for In Brief therefore describes the proposed amendment, a little bit of its history, the steps required for it to be enacted into law, and its significance for OSAC’s work.

The Proposer: An Advisory Committee to the Standing Committee of the Judicial Conference

The Judicial Conference of the United States is the policymaking organ of the judicial branch of the federal government. Composed of the Chief Justice of the U.S. Supreme Court, the chief judges of the 13 federal judicial circuits, and select federal district judges, it also is required by statute “to carry on a continuous study of the operation and effect of the general rules of practice and procedure" that apply in the federal courts (and, with some variations, in many state court systems as well). The Conference relies on a “Committee on Rules of Practice and Procedure, commonly referred to as the ‘Standing Committee.’" The Standing Committee, in turn, relies on advisory committees on appellate, bankruptcy, civil, criminal, and evidence rules. These advisory committees are comprised of “federal judges, practicing lawyers, law professors, state chief justices, and representatives of the Department of Justice.” (Quotations are from the Administrative Office of the U.S. Courts.) The Advisory Committee on Evidence Rules (which we can abbreviate as ACER) is one of these committees.

The Proposed Text: Two Wording Changes

Rule 702 went into effect in federal courts in 1975. It was one sentence long. The Supreme Court famously interpreted it in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), a somewhat ambivalent and abstract opinion. The Court expounded further in cases in 1997 and 1999. The rule was rewritten to incorporate the teachings in these cases in 2000, leading to the version with the longer sentence in the right-hand side of Box 1.

BOX 1. FEDERAL RULE OF EVIDENCE 702 THEN AND NOW
The Rule in 1975	The Rule in 2021
If scientific, technical, or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify thereto in the form of an opinion or otherwise.	A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if: (a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue; (b) the testimony is based on sufficient facts or data; (c) the testimony is the product of reliable principles and methods; and (d) the expert has reliably applied the principles and methods to the facts of the case.

The proposed amendment makes two seemingly minor changes, shown in Box 2:

BOX 2. THE ADVISORY COMMITTEE’S PROPOSED AMENDMENT TO RULE 702

A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if the proponent has demonstrated by a preponderance of the evidence that:
(a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;
(b) the testimony is based on sufficient facts or data;
(c) the testimony is the product of reliable principles and methods; and
(d) the ~~expert has reliably applied~~ expert’s opinion reflects a reliable application of the principles and methods to the facts of the case.

Reading these words, one might well ask what is going on. The first change seems to state the obvious (to lawyers, anyway). A footnote in Daubert already indicates that in the “preliminary assessment of whether the reasoning or methodology” possesses “evidentiary reliability,” the trial court must be satisfied by “a preponderance of proof” because that is the threshold for all “[p]reliminary questions concerning the qualification of a person to be a witness, the existence of a privilege, or the admissibility of evidence.” It may not hurt to state this standard in the text of the rule (although including it after the opening clause about qualifications awkwardly fails to modify the qualifications part of the rule). But why bother?

Similarly, the change to Part (d) is potentially confusing because it limits the “reliable application” prong of the rule to expert “opinion” even though, as the Advisory Committee that drafted the original rule noted, it is “logically unfounded” to “assume[] that experts testify only in the form of opinions.” Instead, “[t]he rule … recognizes that an expert on the stand may give a dissertation or exposition of scientific or other principles relevant to the case, leaving the trier of fact to apply them to the facts.” But aside from the probably unintended limitation of the as-applied prong to opinions, why bother? What is the difference between testimony when an expert has “reasonably applied the principles and methods” and testimony that “reflects a reasonable application of the principles and methods”?

The answers lie in ACER’s official note prepared to accompany the rule, the minutes of its meetings, and its periodic reports to the Standing Committee on its progress in revising the rule.

The Purpose of the New Text

For OSAC, the most salient parts of the note of the Advisory Committee are in Boxes 3 and 4. As to the first change, regarding “preponderance,” ACER believed that

BOX 3. Part of ACER’s Proposed Note Explaining Its First Proposed Change

[M]any courts have held that the critical questions of the sufficiency of an expert’s basis, and the application of the expert’s methodology, are questions of weight and not admissibility. These rulings are an incorrect application of Rules 702 and 104(a). … The Committee concluded that emphasizing the preponderance standard in Rule 702 specifically was made necessary by the courts that have failed to apply correctly the reliability requirements of that rule. … [Explicitly incorporating the standard] means that once the court has found the admissibility requirement to be met by a preponderance of the evidence, any attack by the opponent will go only to the weight of the evidence.

A major push for this change came from individuals and organizations concerned with civil litigation in which, they believed, courts have admitted expert opinions that a drug or chemical is harmful without adequately verifying that there is a body of scientific literature sufficient to let a reasonable expert conclude that the substance can cause the kind of harm claimed to have occurred under the conditions of the case. However, it also will remind judges in criminal cases that they must have proof that the scientific literature is sufficient to support the findings of forensic-science experts.

As Box 4 shows, the second part of the “amendment is especially pertinent to the testimony of forensic [science] experts in both criminal and civil cases”:

BOX 4. Part of ACER’s Proposed Note Explaining Its Second Proposed Change

Rule 702(d) has also been amended to emphasize that a trial judge must exercise gatekeeping authority with respect to the opinion ultimately expressed by a testifying expert. … The amendment is especially pertinent to the testimony of forensic experts in both criminal and civil cases. Forensic experts should avoid assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty—if the methodology is subjective and thus potentially subject to error. In deciding whether to admit forensic expert testimony, the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results. Expert opinion testimony regarding the weight of feature comparison evidence (i.e., evidence that a set of features corresponds between two examined items) must be limited to those inferences that can reasonably be drawn from a reliable application of the principles and methods. This amendment does not, however, bar testimony that comports with substantive law requiring opinions to a particular degree of certainty. … [N]othing in the amendment requires the court to nitpick an expert’s opinion in order to reach a perfect expression of what the basis and methodology can support. The … standard does not require perfection. On the other hand, it does not permit the expert to make extravagant claims that are unsupported by the expert’s basis and methodology.

It is the ACER note, much more than the revisions to the text of the rule, that has implications for forensic-science evidence. As the note indicates, the committee was especially concerned with forensic-science testimony. Its briefing materials included summaries of federal cases from across the spectrum of forensic sciences that raised the issue of “overstatement.” Furthermore, the idea of a new Advisory Committee Note came from the 2016 report of the President’s Council of Advisors on Science and Technology. PCAST called on “the Judicial Conference [to] prepare, with advice from the scientific community, a best practices manual and an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic feature-comparison methods.”

Apparently, PCAST did not realize that ACER is not empowered to write new notes to old rules. At a symposium convened by ACER in 2017, PCAST co-chair and newly appointed Presidential science advisor, Eric Lander, advised the committee as follows: “If an advisory note is a possibility, I’d favor it. If it’s not, change a comma in the rule and then write a new advisory note. Change one word, any word and write an advisory note.” Advisory Comm. on Evid. Rules Symposium on Forensic Expert Testimony, Daubert, and Rule 702, 86 Ford. L. Rev. 1463, 1523 (2018). This change-a-word artifice is more or less what is happening.

What Is Next in the Rulemaking Process?

The proposed amendment is just that—proposed. To become law, the ACER amendment and accompanying note must be approved by the Standing Committee after a six-month period for public comment and testimony (after which ACER reviews and can revise the proposed amendment and seek more comment). The Standing Committee then reviews the final drafts. It can revise and return the draft to ACER, or it can submit the amendment and note to the full Judicial Conference for its review. If the Judicial Conference approves, the drafts go to the Supreme Court, which normally transmits them to Congress with no substantive review. Congress then can adopt, reject, modify, or defer the rule change, but if Congress is silent for seven months, the amendment becomes effective at the end of the year.

Plainly, the proposal, which was four years in the making, still has a long way to go, but the very fact that ACER deliberated at length and expressed concern about forensic-science testimony, overstatement, and error probabilities could have more immediate impact in litigation.

Implications for OSAC

To help satisfy the proof requirements of Rule 702 (both as it stands and as it might be amended), subcommittees drafting standards for making findings and for reporting or testifying should specifically cite the scientific literature that supports each part of the standard. Valid estimates of potential error rates (or related statistics on the accuracy of results), or procedures to arrive at these estimates, should be part of such standards. Scientific and Technical Review Panels (STRPs) already are instructed to look for this content or for an explanation in the standard of why methods for ascertaining and expressing uncertainty in measurements, observations, or inferences are not present in the standards they review.

The repeated references to “overstatement” in ACER’s deliberations and materials should reinforce the desire of OSAC units to address the admittedly difficult problem of prescribing standards for testimony—and to use phrases in all standards that involve results that will satisfy the insistence on “those inferences that can reasonably be drawn from a reliable application of the principles and methods.” Cases on firearms-toolmark identifications (called “ballistics” cases in the ACER materials) suggest that judicial efforts are unlikely to produce the best solution. The Department of Justice has attempted to confront this issue with its Uniform Language for Testimony and Reports standards (ULTRs). It argued to ACER that these ULTRs help solve the problem of overclaiming, but one response was that because there are no such standards in laboratories generally, a new Advisory Committee Note is necessary. OSAC units still can help fill this gap if they act quickly.

Disclaimer: This report presents the views of the author. Their publication in In Brief is not an endorsement by NIST or OSAC, and they are not intended to represent the views of any OSAC unit. The error rate associated with them is not known.

State v. Ghigliotti, Computer-assisted Bullet Matching, and the ASB Standards

2022-06-11T15:05:00.002-04:00

In State v. Ghigliotti, 232 A.3d 468, 471 (N.J. App. Div. 2020), a firearms examiner concluded that a particular gun did not fire the bullet (or, more precisely, a bullet jacket) removed from the body of a man found shot to death by the side of a road in Union County, New Jersey. That was 2005, and the case went nowhere.

Ten years later, a detective prevailed on a second firearms examiner to see what he thought of the toolmark evidence. After considerable effort, this examiner reported that the microscopic comparisons with many test bullets from the gun in question were inconclusive.

However, at a training seminar in New Orleans he learned of two tools developed and marketed by Ultra Electronics Forensic Technology, the creator of the Integrated Ballistics Identification System (IBIS) that "can find the 'needle in the haystack', suggesting possible matches between pairs of spent bullets and cartridge cases, at speeds well beyond human capacity. The Bullettrax system “digitally captures the surface of a bullet in 2D and 3D, providing a topographic model of the marks around its circumference.” As “[t]he world’s most advanced bullet acquisition station” it uses “intelligent surface tracking that automatically adapts to deformations of damaged and fragmented bullets.”

The complementary Matchpoint is an “analysis station” with “[p]owerful visualization tools [that] go beyond conventional comparison microscopes to ease the recognition of high-confidence matches. Indeed, Matchpoint increases identification success rates while reducing efforts required for ultimate confirmations.” It features multiple side-by-side view of images from the Bullettrax data and score analysis. The court explained that “the Matchpoint software ... included tools for flattening and manipulating the images, adjusting the brightness, zooming in, and ‘different overlays of ... color scaling.’”

But the examiner did not make the comparisons based on the digitally generated and enhanced images, and he did not rely on any similarity-score analysis. Rather, he “looked at the images side-by-side on a computer screen using Matchpoint [only] ‘to try and target areas of interest to determine ... if (he) was going to go back and continue with further [conventional] microscopic comparisons or not.’” He found four such areas of agreement. Conducting a new microscopic analysis of these and other areas a few weeks later, he “‘came to an opinion of an identification or a positive identification’ ... grounded in his ‘training and experience and education as a practitioner in firearms identification’ and his handling of over 2300 cases.” 232 A.3d at 478–49.

The trial court “determined that a Frye hearing was necessary to demonstrate the reliability of the computer images of the bullets produced by BULLETTRAX before the expert would be permitted to testify at trial.” Id. at 471, The state filed an interlocutory appeal, arguing that the positive identification did not depend on Ultra’s products. The Appellate Division affirmed, holding that the hearing should proceed.

I do not know where the case stands, but its facts provide the basis for a thought experiment. At about the same time as the Ghigliotto court affirmed the order for a hearing, the American Academy of Forensic Sciences Standards Board (ASB) published a package of standards on toolmark comparisons. Created in 2015, ASB describes itself as “an ANSI [American National Standards Institute]-accredited Standards Developing Organization with the purpose of providing accessible, high quality science-based consensus forensic standards.” Academy Standards Board, Who We Are, 2022. Two of its standards concern three-dimensional (3D) data and inferences in toolmark comparisons, while the third is specific to software for comparing 2D or 3D data.

We can put the third to the side, for it is limited to software that "seeks to assess both the level of geometric similarity (similarity of toolmarks) and the degree of certainty that the observed similarity results from a common origin." ANSI/ASB Standard 062, Standard for Topography Comparison Software for Toolmark Analysis § 3.1 (2021). The data collection and visualization software here does neither, and the scoring feature of Matchpoint was not used.

ANSI/ASB Standard 061, Firearms and Toolmarks 3D Measurement Systems and Measurement Quality Control (2021), is more apposite although it is only intended “to ensure the instrument’s accuracy, to conduct instrument calibration, and to estimate measurement uncertainty for each axis (X, Y, and Z).” It promises “procedures for validation of 3D system hardware” but not software. It “does not apply to legacy 2D type systems,” leaving one to wonder whether there are any standards for validating them.

Even for "3D system hardware," the procedure for “developmental validity” (§ 4.1) is nonexistent. There are no criteria in this standard for recognizing when a measurement system is valid and no steps that a researcher must follow to study validity. Instead, the section on “Developmental Validation (Mandatory)” states that an “organization with appropriate knowledge and/or [sic] expertise” shall complete “a developmental validation”; that this validation “typically” (but not necessarily) consists of library research (“identifying and citing previously published scientific literature”); and that “ample”—but entirely uncited— literature exists “to establish the underlying imaging technology” for seven enumerated technologies. In full, the three sentences on “developmental validation” are

As per ANSI/ASB Standard 063, Implementation of 3D Technologies in Forensic Firearm and Toolmark Comparison Laboratories, a developmental validation shall be completed by at least one organization with appropriate knowledge and/or expertise. The developmental validation of imaging hardware typically consists of identifying and citing previously published scientific literature establishing the underlying imaging technology. The methods defined above of coherence scanning interferometry, confocal microscopy, confocal chromatic microscopy, focus variation microscopy, phase-shifting interferometric microscopy, photometric stereo, and structured light projection all have ample published scientific literature which can be cited to establish an underlying imaging technology.

Perhaps the section is merely there to point the reader to the different standard, ASB 063, on implementation of 3D technologies. \1/ But that standard seems to conceive of “developmental validation” as a process that occurs in a forensic laboratory or other organization by a predefined process with a “technical reviewer” to sign off on the resulting document that becomes the object of further review through “[p]eer-reviewed publication (or other means of dissemination to the scientific community, such as a peer-reviewed presentation at a scientific meeting).” § 4.1.3.4. The data and the statistics needed to assess measurement validity are left to the readers' imaginations (or statistical acumen). \2/

ASB 061 devotes more attention to what it calls “deployment validation” on the part of every laboratory that chooses to use a 3D measuring instrument. This part of the standard describes some procedures for checking whether X, Y, and Z “scales” that should reveal whether measurements of the coordinates of points on the surface of the material are close to what they should be. For example, § 4.2.5.4.1 specifies that

Using calibrated geometric standards (e.g., sine wave, pitch, step heights), measurements shall be conducted to check the X and Y lateral scales as well as the vertical Z scale. Ten measurements shall be performed consecutively ... . The measurement uncertainty of the repeatability measurements shall overlap with the certified value and uncertainty of the geometric standard used.

The phrasing is confusing (to me, at least). I assume that a “geometric standard” is the equivalent of a ruler of known length (a “certified value” of, say, 1 ± 0.01 microns). But what does the edict that “[t]he measurement uncertainty of the repeatability measurements shall overlap with the certified value and uncertainty of the geometric standard used” mean operationally?

The best answer I can think of is that the standard contemplates comparing two intervals. One is the scale value (along, say, the X-axis). Imagine that the “geometric standard” that is taken to be the truth is certified as having a length of 1± 0.01 microns. Let’s call this the “certified standard interval.”

Now the laboratory makes ten measurements for its “deployment validation” to produce what we can call a “sample interval” from the ten measurements. The ASB standard does not contain any directions on how this is to be done. One approach would be to compute a confidence interval on the assumption that the sample measurements are normally distributed. Suppose the observed sample mean for them is 0.80, and the standard error computed from the ten sample measurements is s = 0.10 microns. The confidence interval is then 0.80 ± k(0.10), where is k is some constant. If the confidence interval includes any part of the certified interval, this part of the deployment-validation requirement is met.

What values of k would be suitable for the instrument to be regarded as “deploymentally valid”? The standard is devoid of any insight into this critical value and its relationship to confidence. It does not explain what the interval-overlap requirement is supposed to accomplish, but if the confidence interval is part of it, it is an ad hoc form of hypothesis testing with an unstated significance level.

Is the question of whether the hypothesis that there is no difference between the standard reference value of 1 and the true sample mean can be rejected at some preset significance level all that important here? Should not the question be how much the disparities between the sample of ten measured values and the geometric-standard value would affect the efficacy of the measurements? An observed sample mean that is 20% too low does not lead to the rejection of the hypothesis that the instrument’s measurements are, in the long run, exactly correct, but with only ten measurements in the sample, that may tell us more about the lack of the statistical power of the test than about the ability of the instrumentation to measure what it seeks to measure with suitable accuracy for the applications to which it is put.

In sum, the standard’s section on “Developmental Validation (Mandatory)” mandates nothing that is not trivially obvious—the court already knows that it should look for support for the 3D scanning and image-manipulation methods in the scientific literature, and the standard does not reveal what the substance of this validation should be. “Deployment Validation (Mandatory)” is supposed to ensure that the laboratory is properly prepared to use a previously validated system for casework. It is of little use in a hearing on the general acceptance of the scanning system and the theories behind it. (One could argue that scientists would accept a system that a laboratory has been rigorously pretested and shown to be perform accurately, even with no other validation, but it is not clear that the standard describes an appropriate, rigorous pretesting procedure.)

Moreover, the standard explicitly excludes software from its reach, making it inapplicable to the Matchpoint image-manipulation tools the helped the examiner in Ghigliotti zero in on the regions that altered his opinion. The companion standard on software does not fill this gap, for it deals only with software that produces similarity scores or random-match probabilities. Finally, ASB 063's substantive requirements for "deployment validation" prior to laboratory implementation might well prohibit an examiner from going to the developer of hardware and software not yet adopted by his or her employer for help with locating features for further visual analysis, as occurred in Ghigliotti. But that is not responsive to the legal question of whether the developer's system is generally accepted as valid in the scientific community.

NOTES

ANSI/ASB 063 is even more devoid of references. The entire bibliography consists of a webpage entitled “control chart.” There, attorneys, courts, or experts seeking to use the standard will discover that a “control chart is a graph used to study how a process changes over time.” That is great for quality control of instrumentation, but it is irrelevant to validation.
Under § 4.1.2.4, "The plan for developmental validation study shall include the following:
"a) the limitations of the procedure;
"b) the conditions under which reliable results can be obtained;
"c) critical aspects of the procedure that shall be controlled and monitored;
"d) the ability of the resulting procedure to meet the needs of the given application."

Last updated: 12 June 2022

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part III—Six Empirical Studies)

2022-05-24T15:23:00.003-04:00

New York’s Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court had reasoned that the output of a computer program should not have been admitted without a full evidentiary hearing on the program's general acceptance within the scientific community.

In Wakefield, the Court of Appeals faced a different question for a more complex computer program. This time, the question was whether, after holding such a hearing, the trial court erred in finding that the more sophisticated program was generally accepted as a scientifically valid and reliable means of estimating “likelihood ratios” for DNA mixtures like the ones recovered in the case. The program, known as TrueAllele, is marketed by Cybergenetics, “a Pittsburgh-based bioinformation company [whose] computers translate DNA data into useful information.”

As discussed separately, the Wakefield court held that, in the circumstances of the case, the output of TrueAllele was admissible to associate the defendant with a murder. It emphasized “multiple validation studies ... demonstrat[ing] TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” 2022 WL 1217463 at *7. But the court did not describe the level of accuracy attained in any of the validation studies. That is surely something lawyers would want to know about, so I decided to read the “peer-reviewed publications in scientific journals” (id.) to which the court must have been referring.

The state introduced 31 exhibits at the evidentiary hearing in 2015. Nine were journal publications of some kind. Six of those described data collected to establish (or indirectly suggesting) that TrueAllele was accurate. Only three of them relied on “known DNA samples” as opposed to samples from casework. The synopses that follow do not describe all the parts of them, let alone all the findings from them. I merely pick out the parts that I found most interesting and most pertinent to the question of accuracy or error (two sides of the same coin).

The 2009 Cybergenetics Known-samples Study

The first study is M.W. Perlin & A. Sinelnikov, An Information Gap in DNA Evidence Interpretation. 4 PLoS ONE e8327. This experiment used 40 laboratory-constructed two-contributor mixture samples (from two pairs of unrelated individuals) with varying mixture proportions and total DNA amounts (0.125 ng to 1 ng) to show that TrueAllele was much better at classifying a sample as containing a contributor’s DNA than was the cumulative probability of inclusion method (CPI) that employed peak-height thresholds for binary determinations of the presence of alleles. TrueAllele’s likelihood ratios (LRs) supported the hypothesis of inclusion in nearly every instance (LR>1).

However, the data could not reveal whether the level of positive support (log-LR) was accurate. Does a computed LR of 1,000,000 “really” indicate evidence that is five orders of magnitude more probative than a computed LR of 10? The “empirical evidence” from the study cannot answer this question. The best we can do is to verify that the computed LR increases as the quantity of DNA does. The uncertainty inherent in the PCR process is smaller for larger starting quantities, and this should be reflected in the magnitude of the LR.

The 2011 Cybergenetics–New York State Police Casework Study

The second study also used two-contributor mixtures, but these came from casework in which the alleles, as ascertained by conventional methods, did not exclude the defendant as a possible contributor. In Mark W. Perlin et al., Validating TrueAllele DNA Mixture Interpretation, 56 J. Forensic Sci. 1430 (2011), researchers from Cybergenetics and the New York State Police laboratory selected “16 two-person mixture samples” that met certain criteria “from 40 adjudicated cases and one proficiency test conducted in” the New York laboratory. TrueAllele generated larger LRs than those from the manual analyses. That TrueAllele did not produce LRs < 1 (indicative of exclusions) for any defendant included by conventional analysis is evidence of a low false-exclusion probability. The computed LRs are greater than 1 when they should be. But this empirical evidence does not directly address the question of whether the magnitude of the LRs themselves are as close or as far from 1 as they should be if they are to be understood as a Bayes' factor.

The 2013 Cybergenetics–New York State Police Casework Study

The third study is more extensive. In Mark W. Perlin et al., New York State TrueAllele ® Casework Validation Study, 58 J. Forensic Sci. 1458 (2013), Cybergenetics worked with the New York laboratory to reanalyze DNA mixtures with up to three contributors from 39 adjudicated cases and two proficiency tests. “Whenever there was a human result, the computer’s genotype was concordant,” and TrueAllele “produced a match statistic on 81 mixture items ... , while human review reported a statistic on [only] 25 of these items.”

This time Cybergenetics also tried to answer the question of how often TrueAllele produces false “matches” (LR>1) when it compares a known noncontributor’s sample to a mixed sample. It accomplished this by simulating false pairs of samples for TrueAllele to process. As the authors explained,

We compared each of the 87 matched mixture evidence genotypes with the (<87) reference genotypes from the other 40 cases. Each of these 7298 comparisons should generate a mismatch between the unrelated genotypes from different cases and hence a negative log(LR) value. A genotype inference method having good specificity should exhibit mismatch information values [log-LRs] that are negative in the same way that true matches are positive.

Id. at 1461. Thus, they derived two empirical distributions for likelihood ratios—one for the nonexcluded defendants in the cases (who we would expect to be actual sources)—and one for the unrelated individuals (who we would expect to be non-sources). The empirical distributions were well separated, and the log(LR) was always less than zero for the presumed non-sources.

So TrueAllele seems to work well as a classifier (for distinguishing true-source pairs from false-source pairs) in these small-scale studies. But again, the question of whether the magnitudes of its LRs are highly accurate remains. With astronomically large LRs, it is hard to know the answer. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991). \1/

The 2013 UCF–Cybergenetics Known-samples Study

The fourth study is J. Ballantyne, E.K. Hanson & M.W. Perlin, DNA Mixture Genotyping by Probabilistic Computer Interpretation of Binomially-sampled Laser Captured Cell Populations: Combining Quantitative Data for Greater Identification Information, 53 Sci. & Justice 103–114 (2013). It is not a validation study, but researchers from the University of Central Florida and Cybergenetics made two different two-person mixtures with equal quantities of DNA from each person. In such 50:50 mixtures, peak heights are expected to be similar, making it harder to fit the pattern of alleles into the pairs (single-locus genotypes) from each contributor than if there had been a major and minor contributor. So the team created ten small (20 cell) subsamples of each of the two mixed DNA samples by selecting cells at random. They analyzed these subsamples separately. They used TrueAllele to estimate the relative contributions (“mixture weights”) in the 20-cell samples, and found that when TrueAllele combined data from multiple subsamples, it assigned a 99% probability to the two contributor genotypes. The point of the study was to demonstrate the possibility of subdividing even small balanced samples to take advantage of peak height differences arising from imbalances in the even smaller subsamples.

The 2013 Cybergenetics–Virginia Department of Forensic Services Casework Study

The fifth study is more on point. In Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837 (2014), researchers from Cybergenetics and the Virginia Department of Forensic Services compared TrueAllele with manual analysis on 111 selected casework samples. The set of criminal case mixtures paired with a nonexcluded defendant’s profile should produce large LRs. For ten pairs, TrueAllele failed to return “a reproducible positive match statistic.” Among the 101 remaining, presumably same-source pairs, the smallest LR was 18. Since the LR must be less than 1 to be deemed indicative of a noncontributor, in no instance did TrueAllele generate a falsely exonerating result.

But what about falsely incriminating LRs? This time, the researchers did not reassign the defendant’s profiles to other cases to produce false pairs. Rather, they generated 10,000 random STR genotypes (from population databases of alleles in Virginia) to simulate the STR profiles of non-sources of the mixtures from the criminal cases. They paired each of these non-source profiles with 101 genotypes that emerged from the unknown mixtures and calculated LR values. There were fewer than 1 in 20,000 LRs suggesting an association (LR > 1) among these mixture/non-source pairs; less than 1in 1,000,000 for LR > 1,000; and no false positives at all for LR > 6,054. In other words, TrueAllele produced an empirical distribution for false pairs that consisted almost entirely of LRs < 1 and that never had very large LRs. Again, it seems to be an excellent classifier.

The 2015 Cybergenetics–Kern Regional Crime Laboratory Known-samples Study

Finally, in M.W. Perlin et al., TrueAllele Genotype Identification on DNA Mixtures Containing up to Five Unknown Contributors, 60 J. Forensic Sci. 857 (2015), researchers from Cybergenetics and the Kern Regional Crime Laboratory in California obtained DNA samples from five known individuals. They constructed ten two-person mixtures by randomly selecting two of the five contributors and mixing their DNA in proportions picked at random. The researchers constructed ten 3-, 4-, and 5-person mixtures in the same manner. From each of these 4 × 10 mixtures, they created a 1 nanogram and a 200 picogram sample for STR analysis. TrueAlelle computed an LR for each of the genotypes that went into each analyzed sample (the alternative hypothesis being a random genotype).

Defining an exclusion as a LR < 1, TrueAllele rarely excluded true contributors to the 1 ng 2- or 3-contributor mixtures (no exclusions in 20 comparisons and 1 in 30, respectively), but with 4 and 5 contributors involved, the false-exclusion rates were 9/40 and 9/50, respectively. The false exclusions came from the more extreme mixtures. As long as at least 10% of the nanogram mixtures came from the lesser contributor, there were no false exclusions. The false-exclusion rates for the 200 pg samples were larger: 2/20, 4/30, 13/40, and 19/50. For these low-template mixtures, a greater proportion of the lesser contributor’s DNA (25%) had to be present to avoid false exclusions.

To assess false inclusions, 10,000 genotypes were randomly generated from each of three ethnic population allele databases. These noncontributor profiles were compared with the eight mixtures. For ethnic group and DNA mixture sample, the LRs fell well below LR=1, meaning that there were few false inclusions. For the high DNA levels (1 ng), the proportion of comparisons with misleading LRs (LR > 1 for the simulated noncontributors) were 0/600,000, 25/900,000, 186/1,200,000, and 1,301/1,500,000 for the 2-, 3-, 4-, and 5-person mixtures, respectively. The worst case (the most misleadingly high LR) occurred for the five-person mixture, where one LR was 1,592. For the low-template DNA mixtures, the corresponding false-inclusion proportions were 2/600,000, 53/900,000, 177/1,200,000, and 145/1,500,000. The worst outcome was an LR of 101 for a four-person mixture.

Apparently using “reliable” in its legal or nonstatistical sense (as in Daubert and Federal Rule of Evidence 702), the researchers concluded that “[t]his in-depth experimental study and statistical analysis establish the reliability of TrueAllele for the interpretation of DNA mixture evidence over a broad range of forensic casework conditions.” \2/ My sense of the studies as of the time of the hearing in Wakefield is that they show that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAlelle’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with unrelated noncontributors. \3/ Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more deviled by stochastic effects on peak heights and locations.

NOTES

In this early study, we compared the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index,” a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types. We found that the theoretical PI diverged from the empirical LR for PI > 80 or so.

Cf. David W. Bauer, Nasir Butt, Jennifer M. Hornyak & Mark W. Perlin, Validating TrueAllele Interpretation of DNA Mixtures Containing up to Ten Unknown Contributors, 65 J. Forensic Sci. 380, 380 (2020), doi: 10.1111/1556-4029.14204 (abstract concluding that “[t]he study found that TrueAllele is a reliable method for analyzing DNA mixtures containing up to ten unknown contributors

One might argue that the number of mixed samples collectively studied is too small. PCAST indicated that “there is relatively little published evidence” because “[i]n human molecular genetics, an experimental validation of an important diagnostic method would typically involve hundreds of distinct samples.” President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods 81 (2016) 81 (notes omitted), https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]. The number of distinct samples (mixtures from different contributors) combining all the studies listed here seems closer to 100.