Friday, February 26, 2016

Is "Reasonable Scientific Certainty" Unreasonable?

Next month, the National Commission on Forensic Science is expected to vote on a proposal to make three recommendations about the testimonial use of phrases such as "to a reasonable degree of scientific certainty" and "to a reasonable degree of [discipline] certainty":
Recommendation #1: The Attorney General should direct all attorneys appearing on behalf of the Department of Justice (a) to forego use of these phrases when presenting forensic discipline testimony unless directly required by judicial authority as a condition of admissibility for the witness’ opinion or conclusion, and (b) to assert the legal position that such terminology is not required and is indeed misleading.

Recommendation #2: The Attorney General should direct all forensic science service providers and forensic science medical providers employed by Department of Justice not to use such language in reports or couch their testimony in such terms unless directed to do so by judicial authority.

Recommendation #3: The Attorney General should, in collaboration with NIST, direct the OSACs to develop appropriate language that may be used by experts when reporting or testifying about results or findings based on observations of evidence and data derived from evidence.
Most of the public comments have been supportive, 1/ but three days ago, one commissioner submitted a comment arguing that Recommendation #1 would require the Department of Justice to argue for overturning existing law that “seem[s] to require” these phrases in some forensic-science identification fields and that Recommendation #3 asks the Attorney General to take action that exceeds her authority.

[Added 3/1/16: At least, this is what I thought the comment was driving at, but, as explained in a follow-up posting, I was mistaken. Nevertheless, I think the analysis of this point is worth leaving up for general viewing, since it addresses a question that might be raised about the proposal.]

The second point is well taken—the Attorney General has no power to “direct” NIST or the OSAC to act, and NIST supports but does not direct the OSAC structure. However, the notion that any federal district court is legally compelled to condition the admission of expert testimony on an obscure phrase like “reasonable scientific certainty” seems farfetched. Below are excerpts from a comment that I filed with the Commission today explaining my thinking (with minor alterations):

Previous drafts of the final document before the Commission included references to the case law and literature supporting the subcommittee’s view that these recommendations are compatible with the existing law of evidence — that the law does not require experts to use these particular (and problematic) phrases, even though some judges and lawyers expect and even prefer to hear them. 2/ The comments that follow do not try to restate the previous legal analysis or to summarize the legal literature. They respond to the analysis in the Feb. 23 Comment. ...

Nothing in the Comment establishes that, when presented with the relevant legal authority and analysis, any court would find it difficult to accept the position the Department is being asked to take. The cases cited in the Comment do not contradict the proposed position. 3/ Not one of these cases considered whether the testifying expert must testify to “a reasonable degree of [discipline] certainty” as opposed to offering an opinion that the markings on a gun or fingerprint offer strong support for the source conclusion (or some similar less-than-absolutely-certain testimony). In most of them, the defense sought to exclude the source-attribution testimony entirely, on the Daubert ground that science and statistics did not support source attributions to one and only one possible source. The trial judges in these cases agreed that absolute, categorical claims of identity were too extreme. Those assertions are the kind of overclaiming that, Deputy Attorney General Yates announced two days ago, the Department of Justice is seeking to avoid.

As an alternative to scientifically indefensible or overstated claims, the trial judges in the cited cases set an upper bound on the certainty that the expert may express — “reasonable certainty” of one kind or another. Other federal trial judges have set other upper bounds. E.g., United States v. Glynn, 578 F.Supp.2d 567 (S.D.N.Y. 2008) (“the ballistics opinions ... may be stated in terms of ‘more likely than not,’ but nothing more”). No court has dictated one formulaic expression to the exclusion of all other ways to solve the problem of expert and prosecutorial exaggeration. 4/ In every one of the cases cutting back on overclaiming, for the government’s experts to have presented less categorically certain phrasing in these cases would not have violated the pretrial orders, and the government easily could have requested somewhat different phrasing as long as it did not amount to the kind overclaiming that the orders were issued to protect against.

United States v. Cazares, 788 F.3d 956 (9th Cir. 2015), the only appellate case that the Comment perceives as demonstrating that “it is an overstatement to categorically claim that the phrase ‘to a reasonable degree of [discipline] certainty’ ‘is not required’” clearly does not demand the use of this phrase instead of more transparent alternatives. No such alternatives were before the Ninth Circuit. The firearms examiner did not use the phrase “reasonable ballistic certainty,” but instead claimed total  “scientific certainty.” Id. at 988. The Assistant U.S. Attorney did the same. Id. The panel excused this testimony and prosecutorial exaggeration as harmless error. 5/ It cited to the cases noted in the Comment only to show that less-than-absolute testimony of firearms identification had been held to satisfy the requirements of Daubert. In an obvious dictum, the court of appeals referred to “reasonable ballistic certainty” as “the proper expert characterization of toolmark identification”—not to prescribe these words as the only permissible mode of expressing conclusions across the realm of forensic identification, but only to make the point that, given the expert’s acknowledgment of subjectivity in her analysis and her concession that “[t]here is no absolute certainty in science,” id. at 988, “[a]ny error in this case from the ‘scientific certainty’ characterization was harmless.” Id. at 990.

Moreover, the nature of the disagreement with the observation that “use of the [reasonable degree of scientific or discipline-specific certainty] phrase is not required by law and is primarily a relic of custom and practice” is difficult to fathom. The Comment agrees that “the use of this phrase is not required by the Federal Rules of Evidence.” This is every bit as true in the Ninth Circuit as the others judicial circuits. What, then, is the basis of the claim that a court is “perhaps” required to insist that an expert use the phrase? The Constitution can override the rules of evidence, but no one can seriously claim that the Constitution conditions expert scientific testimony on a particular form of words — and a potentially misleading mixture of words at that.

In sum, there are courts that find comfort in phrases like "reasonable scientific certainty," and a few courts have fallen back on variants such as "reasonable ballistic certainty" as a response to arguments that identification methods cannot ensure that an association between an object or person and a trace is 100% certain. But it seems fair to say that "such terminology is not required " -- at least not by any existing rule of law.

  1. E.g., Erin Murphy & Andrea Roth, Public Comment on NCFS Recommendation Re: Reasonable Degree of Scientific Certainty, Feb. 23, 2016,!documentDetail;D=DOJ-LA-2016-0001-0011

  2. These have been moved to a separate "views" document available through a link at The recommended position is supported not only by the opinions of appellate courts across the country, but also the writings of federal judges, the drafters of the Federal Rules of Evidence, and the authors of the three leading legal treatises on scientific evidence.

  3. If they did, that would be a reason for the Department to advance a position to harmonize a conflict among the U.S. courts of appeals.

  4. For example, in one case cited in the Comment, United States v. Monteiro, 470 F. Supp. 2d 351 (D. Mass. 2006), the trial judge actually granted the defendant’s motion to exclude firearms testimony (unless the government supplemented the record with information establishing compliance with professional standards). The court then presented “reasonable degree of ballistic certainty” testimony as an acceptable way for the expert may testify,” but the court’s concern was plainly that “the expert may not testify that there is a match to an exact statistical certainty.” Id. at 375.

    Similarly, in United States v. Ashburn, 88 F. Supp. 3d 239 (E.D.N.Y. 2015), the  court’s concern was testimony “that he is ‘certain’ or ‘100%’ sure of his conclusions that two items match, that a match is to ‘the exclusion of all other firearms in the world,’ or that there is a ‘practical impossibility’ that any other gun could have fired the recovered materials.” Id. at 250. The trial judge settled on “reasonable ballistic certainty” as an acceptable alternative, but not necessarily an exclusive one.

    So too, in United States v. Taylor, 663 F.Supp.2d 1170 (D.N.M. 2009), the district judge wrote that:
    Mr. Nichols will be permitted to give to the jury his expert opinion that there is a match between the .30–.30 caliber rifle recovered from the abandoned house and the bullet believed to have killed Mr. Chunn. However, because of the limitations on the reliability of firearms identification evidence discussed above, Mr. Nichols will not be permitted to testify that his methodology allows him to reach this conclusion as a matter of scientific certainty. Mr. Nichols also will not be allowed to testify that he can conclude that there is a match to the exclusion, either practical or absolute, of all other guns. He may only testify that, in his opinion, the bullet came from the suspect rifle to within a reasonable degree of certainty in the firearms examination field.
    Id. at 1180.

  5. The court of appeals reasoned that “the ‘scientific certainty’ characterization was subject to cross examination which resulted in acknowledgment of subjectivity in the expert's work, [and] the district court properly instructed as to the role of expert testimony and there was substantial evidence otherwise linking the defendants to the . . . murders.” Id. at 990.

Thursday, February 25, 2016

"Stress Tests" by the Department of Justice and the FBI's "Approved Scientific Standards for Testimony and Reports"

Yesterday, Deputy Attorney General Sally Yates addressed the assembled members of the American Academy of Forensic Sciences at their annual scientific meeting in Las Vegas. 1/ Some excerpts and remarks follow:
In the near future, we expect the FBI to solicit bids for an independent review—or “root cause analysis”—to determine what went wrong and why in the hair analysis field. We hope that this review will help us identify potential blind spots in our own practices and develop effective corrective measures.

But it does not take a root cause analysis to draw some initial conclusions about errors arising in the FBI’s pre-2000 hair cases. It’s clear that, in at least some of the cases reviewed, lab examiners and attorneys either overstated the strength of the forensic evidence or failed to properly qualify the limitations of the forensic analysis. This doesn’t necessarily mean that there were problems with the underlying science—it means that the probative value of the scientific evidence wasn’t always properly communicated to juries. And as you all know, it’s crucial we put this type of evidence in its proper context, given that laypeople can misunderstand the science.
I'll resist the obvious puns about a root cause analysis for hair testimony, but I wonder whether it pays to fund a major study of the culture that produced the overstated hair testimony.  Aren't the solutions to overstated testimony — ultracrepidarianism, as I have called it 2/ — fairly obvious? They are (1) better education and training of criminalists about the limits of their knowledge; (2) clear standards specifying what testimony is permissible; (3) better education and training of prosecutors and defenders about the statements that they as well as the criminalists can make; and (4) comprehensive and reasonably frequent review, not just of expert testimony, but also of the questions and opening and closing arguments of prosecutors, to ensure compliance with testimonial standards.
To address this problem, the FBI is close to finalizing new internal standards for testimony and reporting—which they’re calling “Approved Scientific Standards for Testimony and Reports,” or ASSTR. These documents, designed for almost all forensic disciplines currently practiced by the FBI, will clearly define what statements are supported by existing science. This will guide our lab examiners when they draft reports and take the witness stand, thereby reducing the risk of testimonial overstatement.
That is welcome news about the FBI. More broadly, OSAC needs to develop similar standards, and the Department of Justice should provide better training for its prosecutors. Experts are not the only sources of misstatements at trials. Both prosecutors and defense lawyers can misstate the content or implications of expert testimony. It happens all the time with DNA random-match probabilities, and in the infamous Santae Tribble case, it was the Assistant US Attorney, and not the expert witness, who told the jury that "[t]here is one chance, perhaps for all we know, in 10 million that it could [be] someone else’s hair."3/
While the FBI is preparing an ASSTR for each discipline, it’s fair to say that the risk of overstatement can vary depending on the discipline. The risk is arguably the lowest in certain types of disciplines, such as those involving chemical analysis. In drug testing, for example, the current technology makes it possible for experts to determine the chemical composition of a controlled substance with a high degree of certainty and with very little human interpretation.
"Arguably" is a key word here. "[D]isciplines ... involving chemical analysis" have the potential for suppressing uncertainty. Take a look at the ASTM E2937-13 Standard Guide for Using Infrared Spectroscopy in Forensic Paint Examinations. This document presupposes that a major task of the chemical analyst is "to determine whether any significant differences exist between the known and questioned samples." It defines a "significant difference" as "a difference between two samples that indicates that the two samples do not have a common origin." The forensic chemist is expected to declare whether "[s]pectra are dissimilar," "indistinguishable," or "inconclusive." The standard offers no guidance on how to explain the significance of spectra that are "indistinguishable." This is precisely the problem that hair analysts faced. Moreover, just as there is subjectivity in deciding whether hairs are indistinguishable, the forensic chemistry standard offers only a self-described "rule of thumb." This rule proposes "that the positions of corresponding peaks in two or more spectra be within ± 5 cm-1," but "[f]or sharp absorption peaks one should use tighter constraints. One should critically scrutinize the spectra being compared if corresponding peaks vary by more than 5 cm-1. Replicate collected spectra may be necessary to determine reproducibility of absorption position."

What is the nature of the "critical scrutiny" that permits an examiner to classify peaks that differ by more than 1 / (5 cm) as "similar"? When are any replicates necessary? How many are necessary? What is the accuracy of examiners who follow this open-ended "rule of thumb"? Perusing such standards suggests that forensic chemistry may not be so radically different from the other disciplines in which the risk of ultracrepidarianism is seen as more acute.
But, as you all know, the degree of certainty may be more difficult to quantify in other forensic disciplines. For example, a relatively small number of disciplines call on forensic professionals to compare two items—such as shoe prints or tire treads—and make judgments about their similarities and differences. These so-called “pattern” or “impression” disciplines present unique challenges, especially when an examiner attempts to assess the likelihood that the two items came from the same source.
As the paint standard exemplifies, forensic professionals compare two items in many fields. The spectra used in forensic chemistry are patterns. DNA profiles are patterns. Even some of the software that is supposed to give objectively established probabilities of the components of a DNA mixture have parameters that analysts can adjust as they see fit. Perhaps the line between the quantified-degree-of-certainty fields and the difficult-to-quantify ones is not entirely congruent with a simple divide between pattern-and-impression evidence and other forensic fields.
In any business, whether it’s medicine or manufacturing, it is standard practice to regularly review your internal procedures to make sure you’re performing at the highest level possible. Our DOJ labs do this all the time, and we plan to do it here, too. The department intends to conduct a quality assurance review of other forensic science disciplines practiced at the FBI—to determine whether the same kind of “testimonial overstatement” we found during our review of microscopic hair evidence could have crept into other disciplines that rely heavily on human interpretation and where the degree of certainty can be difficult to quantify. We’re thinking of it as a forensics “stress test.”
This sounds great, but how is "quality assurance review" a "stress test"? In cardiology, a stress test "determines the amount of stress that your heart can manage before developing either an abnormal rhythm or evidence of ischemia (not enough blood flow to the heart muscle)." 4/ In the banking system, stress testing examines whether banks have "sufficient capital to continue operations throughout times of economic and financial stress and that they have robust, forward-looking capital-planning processes that account for their unique risks." 5/ Checking whether "you're performing at the highest level possible" in providing testimony in the ordinary course of affairs is not a stress test for the FBI or anybody else (although I suppose it could prove to be stressful).

Of course, whether one should call the planned review a "stress test" is purely a matter of terminology. A much more important question is which disciplines will be reviewed. It would be unfortunate if the only recipients of review are the criminalists who examine  patterns and impressions in the form of toolmarks, shoeprints, handwriting, and so on. As we have just seen, it is not so easy to specify all the "disciplines that rely heavily on human interpretation and where the degree of certainty can be difficult to quantify."
This is an important moment in forensic science. The rise of new technologies presents both tremendous opportunities and potential challenges. At the same time, we must grapple with some of the most basic questions that lie at the intersection of science and the law: how do we make complex scientific principles understandable for judges, attorneys and jurors? How do we accurately communicate to laypeople the many things that forensic science can teach us—ensuring that we neither overstate the strength of our evidence nor understate the value of this information?

There are no easy answers ...
Amen to that!

  1. Office of the Deputy Attorney General, Justice News: Deputy Attorney General Sally Q. Yates Delivers Remarks During the 68th Annual Scientific Meeting Hosted by the American Academy of Forensic Science, Feb. 24, 2016,
  2. David H. Kaye, Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, 72 Wash. & Lee L. Rev. Online 227 (2015).
  3. David H. Kaye, The FBI's Worst Hair Days, Forensic Science, Statistics and the Law, July 31, 2014.
  4. WebMD, Heart Disease and Stress Tests,
  5. Board of Governors of the Federal Reserve System, Stress Tests and Capital Planning,

Wednesday, February 24, 2016

Is OSAC Painting Itself Out of the Picture? Time to Comment on ASTM E1640-14

The Organization of Scientific Area Committees (OSAC) started on the wrong foot with the first standard it chose to place on its blue-ribbon Registry of Approved Standards. Two more standards from the Chemistry and Instrumental Analysis Committee are currently up for public comment. Here, I discuss the first of the two, known in the field as ASTM E1610-14.

ASTM E1610-14 is a "Standard Guide for Forensic Paint Analysis and Comparison." It seeks "to assist individuals who conduct forensic paint analyses in their evaluation, selection, and application of tests that may be of value to their investigations." To a large extent, it accomplishes this goal.

At the same time, however, this Standard Guide fails to provide much useful guidance on a matter of critical concern to the legal system — reporting results in a way that fairly conveys their probative value and the inevitable uncertainty of any scientifically validated test. In fact, the Guide fails to reflect or acknowledge modern thinking about the interpretation of forensic-science test results.

The premise of this document seems to be that the main task of the criminalist is to make "physical matches between known and questioned samples" (sec. 7.1) based on "significance assessments" (sec. 8.3) in which a "significant difference" is "a difference between two samples that indicates that the two samples do not have a common origin." (Sec. 3.2.10). Although this definition of “significant” offers little or no guidance, these matches are expected to be "conclusive." (Sec. 8.6.1). This categorical approach does not represent the modern view of the evaluative statements that criminalists should make. Most literature on forensic inference now maintains that analysts should present statements about the weight of the evidence rather than categorical conclusions. 1/

If analysts studying traces of paint are to make "conclusive" statements, however, and if the NIST-supported OSAC organization is to follow through on the National Academy of Sciences' recommendation to incorporate estimates of uncertainty into forensic science, the conditional error probabilities for these conclusions must be provided.

The concept of validity in measurement is closely connected to estimating uncertainty. Section 1.2 announces that “[t]he need for validated methods and quality assurance guidelines is also addressed,” but this sentence is the only place in the document where the words "validity," "validated," or "validation" appear, and classifications based on a process with unknown sensitivity and specificity cannot be considered validated. Indeed, this ASTM Standard Guideline indirectly endorses a minority notion (from the legal perspective) on what it takes to establish validity. It approvingly refers (in sec. 4.1) to quality assurance guidelines from a 1999 SWGMAT document (actually published in 2000). These guidelines include the observation that "[t]echniques and procedures ... currently accepted by the scientific community should be considered valid." But it is widely appreciated that the general-acceptance criterion for scientific validity is not necessarily sufficient under Federal Rule of Evidence 702 and the rules of many states.

ASTM 1610-14 also contemplates testimony about a "physical match" (sec. 8.6) based on postulated "individualizing characteristics" rather than more appropriate probabilistic assessments. Section 8.6.1 advises that "[t]he most conclusive type of examination that can be performed on paint samples is physical matching. ... The corresponding features must possess individualizing characteristics."

The requirement of "individualizing characteristics" is too strict. Although intuition indicates that a combination of enough characteristics can constitute very strong evidence of a common source, it is not clear that any single characteristic is "individualizing." And, even if the existence of strictly individualizing characteristics has been demonstrated in the scientific literature, a criminalist should be able to use other characteristics in forming an expert opinion. After all, a combination of other characteristics that are known to be less than perfectly discriminating when considered one at a time can be highly discriminating when evaluated in toto.

Thus, the approach to "physical matches" in section 8.6.1 seems inconsistent with the logic of section 5.2 of the same ASTM document. This section explains that
Searching for differences between questioned and known samples is the basic thrust of forensic paint analysis and comparison. However, differences in appearance, layer sequence, size, shape, thickness, or some other physical or chemical feature can exist even in samples that are known to be from the same source. A forensic paint examiner’s goal is to assess the significance of any observed differences. The absence of significant differences at the conclusion of an analysis suggests that the paint samples could have a common origin. The strength of such an interpretation is a function of the type or number of corresponding features, or both.
This language is important for legal purposes because it affects the presentation of negative findings (“absence of significant differences”) that would incriminate a suspect or defendant. Surprisingly, there is nothing in the standard to guide or inform the analyst about how to report these results. Is the analyst expected to report only that “I could find no significant differences”? That seems insufficient. That “I am suggesting that the paint samples could have a common origin”? That even though “differences can exist even in samples that are known to be from the same source,” in this case there is “an absence of significant differences”? That may connote more than is appropriate. That, given “the type or number of corresponding features,” the interpretation of “common source” is very strong? What studies establish that a criminalist accurately can quantify (even verbally) the strength of the common-source interpretation? What is the uncertainty associated with such judgments? To be sure, the Standard Guide lists 69 references, but it is not clear which, if any, of them answer these questions.

Perhaps I am being too critical. Readers are invited to examine the ASTM Standard Guide for themselves and tell OSAC what they think.

  1. E.g., John S. Buckleton et al., An Extended Likelihood Ratio Framework for Interpreting Evidence, 46 Sci. & Just. 69, 70 (2006) ("The idea of assessing the weight of evidence using a relative measure (known as the likelihood ratio) ... dominates the literature as the method of choice for interpreting forensic evidence across evidence types."); Angel Carrecedo, Forensic Genetics: History, in Forensic Biology 19, 22 (Max M. Houck ed. 2015) (“the single most important advance in forensic genetic thinking is the realization that the scientist should address the probability of the evidence.”).

Monday, February 15, 2016

Approximating Individualization: The ASTM's Standard Terminology for Digital Evidence

Forensic scientists portraying existing standards for evidence testing and evaluation have been known to praise the “rigorous standard development process of ASTM,” 1/ an internationally recognized standards development organization. Having looked over the organization's "Standard Terminology for Digital and Multimedia Evidence Examination" 2/ (not to mention some of its other standards), 3/ I wonder if the results are as rigorous (or as comprehensible) as they should be.

Consider the definition of "individualization":
Individualization, n—theoretically, a determination that two samples derive from the same source; practically, a determination that two samples derive from sources that cannot be distinguished within the sensitivity of the comparison process. (Compare identification.) DISCUSSION—Theoretical individualization is the asymptotic upper bound of the sensitivity of a source identification process.
The definition presents individualization as a theoretical construct that cannot be fully attained. But lots of things can be sorted down to the individual level—telephone, passport, and social security numbers are obvious examples. Surnames and given names are not individualizing at the national level, but they are among the students in almost every class that I have taught. As these examples suggest, individualization can only be defined for the elements of a set. 4/ If the set is enumerated and all its elements available for inspection, then it is possible to "individualize"—not just theoretically or "asymptotically," but practically and precisely.

Thus, the domain of the ASTM definition must be cases in which no exhaustive list of the elements is available. Even with this modification, however, the definition of "individualization" as "a determination that two samples derive from sources that cannot be distinguished within the sensitivity of the comparison process" is flawed for two reasons.

First, "sensitivity," not "specificity," must be what is intended. Sensitivity is the probability that the process will declare that an item comes from a source when it really does come from the source. The least upper bound on sensitivity (or any other probability) is 1. A process that always declares a positive association will have a sensitivity of 1 because it always will declare a positive association when there is one. The degree of source discrimination within a set of potential sources is the specificity. Only when the specificity equals 1 is exact individualization possible.

Second, the ASTM definition of "individualization" fails to state a crucial presupposition. If the specificity of the test verges on 1, then (by definition) the probability that a claim of individualization will be correct when the target item is the individual so identified also verges on 1. This is the approximate individualization that the ASTM is trying to define. But the definition as written does not require that the specificity be close to 1. An analyst following the words of the definition could claim to have "individualized" even when neither ideal nor approximate individualization exists.  As long as the specificity is not 1, "a determination that two samples derive from sources that cannot be distinguished" only shows that the item is an element of a class of indistinguishable items.

  1. Jay Siegel, Forensic Chemistry: Fundamentals and Applications 230 (2015).
  2. ASTM E2916-13, Standard Terminology for Digital and Multimedia Evidence Examination (2013), available for $44 at
  3. E.g., Broken Glass, Mangled Statistics, Forensic Science, Statistics & the Law, Feb. 3, 2016.
  4. David H. Kaye, Identification, Individuality, and Uniqueness: What's the Difference?, 8 Law, Probability & Risk 85 (2009); David H. Kaye, Probability, Individualization, and Uniqueness in Forensic Science Evidence: Listening to the Academies, 75 Brooklyn L. Rev. 1163 (2010).

Saturday, February 13, 2016

Broken Glass: What Do the Data Show?

In Broken Glass, Mangled Statistics, I noted "a plethora of statistical issues" in ASTM E2926-13, a Standard Test Method for Forensic Comparison of Glass Using Micro X-ray Fluorescence (μ-XRF) Spectrometry, that is working its way through the process for inclusion on the OSAC Registry of Approved Standards. The questions I raised about the Standard's procedures and criteria for declaring matches between glass specimens were based on elementary statistical theory but not data. Even if the ASTM's hypothesis testing procedures are idiosyncratic or conceptually flawed, they could have desirable properties.

There are some collections of glass that have been used to test the performance of the matching rules for some of the variables used in forensic testing. An FBI publication from 2009 offers the following summary:
Databases of refractive indices and/or chemical compositions of glass received in casework have been established by a number of crime laboratories (Koons et al. 1991). Although these glass databases are undeniably valuable, it should be noted that they may not be representative of the actual population of glass, and the distribution of glass properties may not be normal. Although these are not direct indicators of the rarity in any specific case, they can be used to show that the probability of a coincidental match is rare.

Koons and Buscaglia (1999) used the data from a chemical composition database and refractive index database to calculate the probability of a coincidental match. They estimated that ... the chance of finding a coincidental match in forensic glass casework using refractive index and chemical composition alone is 1 in 100,000 to 1 in 10 trillion, which strongly supports the supposition that glass fragments recovered from an item of evidence and a broken object with indistinguishable [refractive index] and chemical composition are unlikely to be from another source and can be used reliably to assist in reconstructing the events of a crime.

Range overlap on glass analytical data that include chemical composition data is considered a conservative standard. In one study, on a data set consisting of three replicate measurements each for 209 specimens, the range-overlap test discriminated all specimens, and all other statistical analysis-based tests performed worse (Koons and Buscaglia 2002).

Range-overlap tests, however, may achieve their high discrimination by indicating that two specimens from the same source are differentiable. Another study showed that when using a range-overlap test, the number of specimens differentiated that were actually from the same source may have been as high as seven percent (Bottrell et al. 2007).

The range-overlap approach, however, seems prudent given that other tests with higher thresholds for differentiation, such as t-tests with Welch modification (Curran et al. 2000) or Bayesian analysis (Walsh 1996), lower the number of specimens differentiated that were actually from the same source by worsening the ability to differentiate specimens that are genuinely different, a result that is unacceptable.
If I understand the argument, the author contends that high sensitivity is more important than high specificity. That makes sense for a screening test that will be followed by a more specific test, but in general, is it better to avoid falsely associating a defendant with crime-scene glass or to avoid falsely associating the defendant with the known glass? Any decision rule as to what is "indistinguishable" will generate a mix of false positives and false negatives. Should not the ASTM standards provide estimates from data (that might be representative of some relevant population) of these risks for each decision rule that the standards endorse or mandate?


Wednesday, February 3, 2016

Broken Glass, Mangled Statistics

The motto of ASTM International is “Helping Our World Work Better.” This internationally recognized standards development organization contributes to the world of forensic science by promulgating standards of various kinds for performing and interpreting chemical and other tests.

By mid-August 2015, five ASTM Standards were up for public comment to the Organization of Scientific Area Committees. OSAC “is part of an initiative by NIST and the Department of Justice to strengthen forensic science in the United States.” [1] Operating as “[a] collaborative body of more than 500 forensic science practitioners and other experts,” [1] OSAC is reviewing and developing documents for possible inclusion on a Registry of Approved Standards and a Registry of Approved Guidelines. 1/ NIST promises that “[a] standard or guideline that is posted on either Registry demonstrates that the methods it contains have been assessed to be valid by forensic practitioners, academic researchers, measurement scientists, and statisticians ... .” [2]

Last month, OSAC approved its first Registry entry (notwithstanding some puzzling language), ASTM E2329-14, a Standard Practice for Identification of Seized Drugs. Another standard on the list for OSAC’s quasi-governmental seal of approval is ASTM E2926-13, a Standard Test Method for Forensic Comparison of Glass Using Micro X-ray Fluorescence (μ-XRF) Spectrometry (available for a fee). It will be interesting to see whether this standard survives the scrutiny of measurement scientists and statisticians, for it raises a plethora of statistical issues.

What It is All About

Suppose that someone stole some bottles of beer and money from a bar, breaking a window to gain entry. A suspect’s clothing is found to contain four small glass fragments. Various tests are available to help determine whether the four fragments (the “questioned” specimens) came from the broken window (the “known”). The hypothesis that they did can be denoted H1, and the “null hypothesis” that they did not can be designated H0.

Micro X-ray Fluorescence (μ-XRF) Spectrometry involves bombarding a specimen with X-rays. The material then emits other X-rays at frequencies that are characteristic of the elements that compose it. In the words of the ASTM Standard, “[t]he characteristic X-rays emitted by the specimen are detected using an energy dispersive X-ray detector and displayed as a spectrum of energy versus intensity. Spectral and elemental ratio comparisons of the glass specimens are conducted for source discrimination or association.” Such “source discrimination” would be a conclusion that H0 is true; “association” would be a conclusion that H1 is true. The former finding would mean that the suspect's glass fragments did not come from the crime scene; the latter would mean either that (one way or another) they came from the broken window at the bar (or from another piece of glass somewhere that has a similar elemental composition).

Unspecified "Sampling Techniques"
for Assessing Variability Within the Pane of Window Glass

One statistical issue arises from the fact that the known glass is not perfectly homogeneous. Even if measurements of the ratios of the concentrations of different elements in a specimen are perfectly precise (the error of measurement is zero), a fragment from one location could have a different ratio than a fragment from another place in the known specimen. This natural variability must be accounted for in deciding between the two hypotheses. The Standard wisely cautions that “[a]ppropriate sampling techniques should be used to account for natural heterogeneity of the material, varying surface geometries, and potential critical depth effects.” But it gives no guidance at all as to what sampling techniques can accomplish this and how measurements that indicate spatial variation should be treated.

The Statistics of "Peak Identification"

The section of ASTM E-2926 on “Calculation and Interpretation of Results” advises analysts to “[c]ompare the spectra using peak identification, spectral comparisons, and peak intensity ratio comparisons.” First, “peak identification” means comparing “detected elements of the questioned and known glass spectra.” The Standard indicates that when “[r]eproducible differences” in the elements detected in the specimens are found, the analysis can cease and the null hypothesis H0 can be presented as the outcome of the test. No further analysis is required. The criterion for when an element “may be” detected is that “the area of a characteristic energy of an element has a signal-to-noise ratio of three or more.” Where did this statistical criterion come from? What is the sensitivity and specificity of a test for the presence of an element based on this criterion?

The Statistics of Spectral Comparisons

Second, “spectral comparisons should be conducted,” but apparently, only “[w]hen peak identification does not discriminate between the specimens.” This procedure amounts to eyeballing (or otherwise comparing?) “the spectral shapes and relative peak heights of the questioned and known glass specimen spectra.” But what is known about the performance of criminalists who undertake this pattern-matching task? Has their sensitivity and specificity been determined in controlled experiments, or are judgments accepted on the basis of self-described but incompletely validated “knowledge, skill, ability, experience, education, or training ... used in conjunction with professional judgment,” to use a stock phrase found in many an ASTM Standard?

The Statistics of Peak Intensity Ratios

Third, only “[w]hen evaluation of spectral shapes and relative peak heights do not discriminate between the specimens” does the Standard recommend that “peak intensity ratios should be calculated.” These “peak intensity ratio comparisons” for elements such as “Ca/Mg, Ca/Ti, Ca/Fe, Sr/Zr, Fe/Zr, and Ca/K” “may be used” “[w]hen the area of a characteristic energy peak of an element has a signal-to-noise ratio of ten or more.” To choose between “association” and “discrimination of the samples based on elemental ratios,” the Standard recommends, “when practical,” analyzing “a minimum of three replicates on each questioned specimen examined and nine replicates on known glass sources.” Inasmuch as the Standard emphasizes that “μ-XRF is a nondestructive elemental analysis technique” and “fragments usually do not require sample preparation,” it is not clear just when the analyst should be content with fewer than three replicate measurements—or why three and nine measurements provide a sufficient sampling to assess measurement variability in two sets of specimens, respectively.

Nevertheless, let’s assume that we have three measurements on each of the four questioned specimens and nine on the known specimen. What should be done with these two sets of numbers? The Standard first proposes a “range overlap” test. I’ll quote it in full:
For each elemental ratio, compare the range of the questioned specimen replicates to the range for the known specimen replicates. Because standard deviations are not calculated, this statistical measure does not directly address the confidence level of an association. If the ranges of one or more elements in the questioned and known specimens do not overlap, it may be concluded that the specimens are not from the same source.
Two problems are glaringly apparent. First, statisticians appreciate that the range is not a robust statistic. It is heavily influenced by any outliers. Second, if the  properties of the "ratio ranges" are unknown, how can one know what to conclude—and what to tell a judge, jury, or investigator about the strength of the conclusion? Would a careful criminalist who finds no range overlap have to quote or paraphrase the introduction to the Standard, and report that "the specimens are indistinguishable in all of these observed and measured properties," so that "the possibility that they originated from the same source of glass cannot be eliminated"? Would the criminalist have to add that there is no scientific basis for stating what the statistical significance of this inability to tell them apart is? Or could an expert rely on the Standard to say that by not eliminating the same-source possibility, the tests "conducted for source discrimination or association" came out in favor of association?

The Standard offers a cryptic alternative to the simplistic range method (without favoring one over the other and without mentioning any other statistical procedures):
±3s—For each elemental ratio, compare the average ratio for the questioned specimen to the average ratio for the known specimens ±3s. This range corresponds to 99.7 % of a normally distributed population. If, for one or more elements, the average ratio in the questioned specimen does not fall within the average ratio for the known specimens ±3s, it may be concluded that the samples are not from the same source.
The problems with this poorly written formulation of a frequentist hypothesis test are legion:

1. What "population" is "normally distributed"? Apparently, it is the measurements of the elemental ratios in the questioned specimen. What supports the assumption of normality?

2. What is "s"? The standard deviation of what variable? It appears to be the sample standard deviation of the nine measurements on the known specimen.

3. The Standard seems to contemplate a 99.7% confidence interval (CI) for the mean μ of the ratios in the known specimen. If the measurement error is normally distributed about μ, then the CI for μ is approximately the known specimen's sample mean ±4.3s. This margin of error is larger than ±3s because the population standard deviation σ is unknown and the sample mean therefore follows a t-distribution with eight degrees of freedom. The desired 99.7% is the coverage probability for a ±3σ CI. Using ±3 with the estimator s rather than the true value σ results in a confidence coefficient below 99%. One would have to use a number greater than ±4 rather than ±3 to achieve 99.7% confidence.

4. The use of any confidence interval for the sample mean of the measurements in the known specimen is misguided. Why ignore the variance in the measured ratios in the questioned specimens? That is, the recommendation tells the analyst to ask whether, for each ratio in each questioned specimen, the miscomputed 99.7% CI covers “the average ratio in the questioned specimen.” But this “average ratio” is not the true ratio. The usual procedure (assuming normality) would be a two-sample t-test of the difference between the mean ratio for the questioned sample and the mean for the known specimen.

5. Even with the correct test statistic and distribution, the many separate tests (one for each ratio Ca/Mg, Ca/Ti, Fe/Zr, etc.) cloud the interpretation of the significance of the difference in a pair of sample means. Moreover, with multiple unknown specimens, the probability of finding a significant difference in at least one ratio for at least one unknown fragment is greater than the significance probability in a single comparison. The risk of a false exclusion for, say, ten independent comparisons could be ten times the nominal value of 0.003.

6. Why ±3 as opposed to, say, ±4? I mention ±4 not because it is clearly better, but because it is the standard for making associations using a different test method (ASTM E2330). What explains the same standards development organization promulgating facially inconsistent statistical standards?

7. Why strive for a 0.003 false-rejection probability as opposed to, say, 0.01, 0.03, or anything else? This type of question can be asked about any sharp cutoff. Why is a difference of 2.99σ dismissed as not useful when 3σ is definitive? Within the classical hypothesis-testing framework, an acceptable answer would be that the line has to be drawn somewhere, and the 0.003 significance level is needed to protect against the risk of a false rejection of the null hypothesis in situations in which a false rejection would be very troublesome. Some statistics textbooks even motivate the choice of the less demanding but more conventional significance level of 0.05 by analogizing to a trial in which a false conviction is much more serious than a false acquittal.

Here, however, that logic cuts in the opposite direction. The null hypothesis H0 that should not be falsely rejected is that the two sets of measurements come from fragments that do not have a common source. But 0.003 is computed for the hypothesis H1 that the fragments all come from the same, known source. The significance test in ASTM E2926-13 addresses (in its own way) the difference in the means when sampling from the known specimen. Using a very demanding standard for rejecting H1 in favor of the suspect’s claim H0 privileges the prosecution claim that the fragments come from different sources. 2/ And it does so without mentioning the power of the test: What is probability of reporting that fragments are indistinguishable — that there is an association — when the fragments do come from different sources? Twenty years ago, when a National Academy of Sciences panel examined and approved the FBI's categorical rule of "match windows" for DNA testing, it discussed both operating characteristics of the procedure—the ability to declare a match for DNA samples from the same source (sensitivity) and the ability to declare a nonmatch for DNA samples from different sources. [3] By looking only to sensitivity, ASTM E2926-13 takes a huge step backwards.

8. Whatever significance level is desired, to be fair and balanced in its interpretation of the data, a laboratory that undertakes hypothesis tests should report the probability of differences in the test statistic as large or larger than those observed under the two hypotheses: (1) when the sets of measurements come from the same broken window (H1); and (2) when the sets of measurements come from different sources of glass in the area in which the suspect lives and travels (H0). The ASTM Standard completely ignores H0. Data on the distribution of the elemental composition of glass in the geographic area would be required to address it, and the Standard should at least gesture to how such data should be used. If such data are missing, the best the analyst can do is to report candidly that the questioned fragment might have come from the known glass or from any other glass with a similar set of elemental concentrations and, for completeness, to add that how often other glass like this is present is unknown.

9. Would a likelihood ratio be a better way to express the probative value of the data? Certainly, there is an argument to that effect in the legal and forensic science literature. [4-8] Quantifying and aggregating the spectral data that the ASTM Standard now divides into three, lexically ordered procedures and combining them with other tests on glass would be a challenge, but it merits thought. Should not the Standard explicitly acknowledge that reporting on the strength of the evidence rather than making categorical judgments is a respectable approach?

* * *

In sum, even within the framework of frequentist hypothesis testing, ASTM E2926 is plagued with problems — from the wrong test statistic and procedure for the specified level of “confidence,” to the reversal of the null and alternative hypotheses, to the failure to consider the power of the test. Can such a Standard be considered “valid by forensic practitioners, academic researchers, measurement scientists, and statisticians”?

  1. The difference between the two is not pellucid, since OSAC-approved standards can be a list of “shoulds” and guidelines can include “shalls.”
  2. The best defense I can think of for it is a quasi-Bayesian argument that by the time H1 gets to this hypothesis test, it has survived the qualitative "peak identification" and "spectral comparison" tests. Given this prior knowledge, it should require unusually surprising evidence from the peak intensity ratios to reject H1 in favor of the defense claim H0.
  1. OSAC Registry of Approved Standards and OSAC Registry of Approved Guidelines, last visited Feb. 2, 2016
  2. NIST, Organization of Scientific Area Committees,, last visited Feb. 2, 2016
  3. National Research Council Committee on Forensic DNA Science: An Update, The Evaluation of Forensic DNA Evidence (1996)
  4. Colin Aitken & Franco Taroni, Statistics and the Evaluation of Evidence for Forensic Science (2d ed. 2004)
  5. James M. Curran et al., Forensic Interpretation of Glass Evidence (2000)
  6. ENFSI Guideline for Evaluative Reporting in Forensic Science (2015)
  7. David H. Kaye et al., The New Wigmore: Expert Evidence (2d ed. 2011)
  8. Royal Statistical Soc'y Working Group on Statistics and the Law, Fundamentals of Probability and Statistical Evidence in Criminal Proceedings: Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses (2010)
Postscript: See Broken Glass: What Do the Data Show?, Forensic Sci., Stat. & L., Feb. 13, 2016,

Disclosure and disclaimer: Although I am a member of the Legal Resource Committee of OSAC, the views expressed here are mine alone. They are not those of any organization. They are not necessarily shared by anyone inside (or outside) of NIST, OSAC, any SAC, any OSAC Task Force, or any OSAC Resource Committee.