Wednesday, February 3, 2016

Broken Glass, Mangled Statistics

The motto of ASTM International is “Helping Our World Work Better.” This internationally recognized standards development organization contributes to the world of forensic science by promulgating standards of various kinds for performing and interpreting chemical and other tests.

By mid-August 2015, five ASTM Standards were up for public comment to the Organization of Scientific Area Committees. OSAC “is part of an initiative by NIST and the Department of Justice to strengthen forensic science in the United States.” [1] Operating as “[a] collaborative body of more than 500 forensic science practitioners and other experts,” [1] OSAC is reviewing and developing documents for possible inclusion on a Registry of Approved Standards and a Registry of Approved Guidelines. 1/ NIST promises that “[a] standard or guideline that is posted on either Registry demonstrates that the methods it contains have been assessed to be valid by forensic practitioners, academic researchers, measurement scientists, and statisticians ... .” [2]

Last month, OSAC approved its first Registry entry (notwithstanding some puzzling language), ASTM E2329-14, a Standard Practice for Identification of Seized Drugs. Another standard on the list for OSAC’s quasi-governmental seal of approval is ASTM E2926-13, a Standard Test Method for Forensic Comparison of Glass Using Micro X-ray Fluorescence (μ-XRF) Spectrometry (available for a fee). It will be interesting to see whether this standard survives the scrutiny of measurement scientists and statisticians, for it raises a plethora of statistical issues.

What It is All About

Suppose that someone stole some bottles of beer and money from a bar, breaking a window to gain entry. A suspect’s clothing is found to contain four small glass fragments. Various tests are available to help determine whether the four fragments (the “questioned” specimens) came from the broken window (the “known”). The hypothesis that they did can be denoted H1, and the “null hypothesis” that they did not can be designated H0.

Micro X-ray Fluorescence (μ-XRF) Spectrometry involves bombarding a specimen with X-rays. The material then emits other X-rays at frequencies that are characteristic of the elements that compose it. In the words of the ASTM Standard, “[t]he characteristic X-rays emitted by the specimen are detected using an energy dispersive X-ray detector and displayed as a spectrum of energy versus intensity. Spectral and elemental ratio comparisons of the glass specimens are conducted for source discrimination or association.” Such “source discrimination” would be a conclusion that H0 is true; “association” would be a conclusion that H1 is true. The former finding would mean that the suspect's glass fragments did not come from the crime scene; the latter would mean either that (one way or another) they came from the broken window at the bar (or from another piece of glass somewhere that has a similar elemental composition).

Unspecified "Sampling Techniques"
for Assessing Variability Within the Pane of Window Glass

One statistical issue arises from the fact that the known glass is not perfectly homogeneous. Even if measurements of the ratios of the concentrations of different elements in a specimen are perfectly precise (the error of measurement is zero), a fragment from one location could have a different ratio than a fragment from another place in the known specimen. This natural variability must be accounted for in deciding between the two hypotheses. The Standard wisely cautions that “[a]ppropriate sampling techniques should be used to account for natural heterogeneity of the material, varying surface geometries, and potential critical depth effects.” But it gives no guidance at all as to what sampling techniques can accomplish this and how measurements that indicate spatial variation should be treated.

The Statistics of "Peak Identification"

The section of ASTM E-2926 on “Calculation and Interpretation of Results” advises analysts to “[c]ompare the spectra using peak identification, spectral comparisons, and peak intensity ratio comparisons.” First, “peak identification” means comparing “detected elements of the questioned and known glass spectra.” The Standard indicates that when “[r]eproducible differences” in the elements detected in the specimens are found, the analysis can cease and the null hypothesis H0 can be presented as the outcome of the test. No further analysis is required. The criterion for when an element “may be” detected is that “the area of a characteristic energy of an element has a signal-to-noise ratio of three or more.” Where did this statistical criterion come from? What is the sensitivity and specificity of a test for the presence of an element based on this criterion?

The Statistics of Spectral Comparisons

Second, “spectral comparisons should be conducted,” but apparently, only “[w]hen peak identification does not discriminate between the specimens.” This procedure amounts to eyeballing (or otherwise comparing?) “the spectral shapes and relative peak heights of the questioned and known glass specimen spectra.” But what is known about the performance of criminalists who undertake this pattern-matching task? Has their sensitivity and specificity been determined in controlled experiments, or are judgments accepted on the basis of self-described but incompletely validated “knowledge, skill, ability, experience, education, or training ... used in conjunction with professional judgment,” to use a stock phrase found in many an ASTM Standard?

The Statistics of Peak Intensity Ratios

Third, only “[w]hen evaluation of spectral shapes and relative peak heights do not discriminate between the specimens” does the Standard recommend that “peak intensity ratios should be calculated.” These “peak intensity ratio comparisons” for elements such as “Ca/Mg, Ca/Ti, Ca/Fe, Sr/Zr, Fe/Zr, and Ca/K” “may be used” “[w]hen the area of a characteristic energy peak of an element has a signal-to-noise ratio of ten or more.” To choose between “association” and “discrimination of the samples based on elemental ratios,” the Standard recommends, “when practical,” analyzing “a minimum of three replicates on each questioned specimen examined and nine replicates on known glass sources.” Inasmuch as the Standard emphasizes that “μ-XRF is a nondestructive elemental analysis technique” and “fragments usually do not require sample preparation,” it is not clear just when the analyst should be content with fewer than three replicate measurements—or why three and nine measurements provide a sufficient sampling to assess measurement variability in two sets of specimens, respectively.

Nevertheless, let’s assume that we have three measurements on each of the four questioned specimens and nine on the known specimen. What should be done with these two sets of numbers? The Standard first proposes a “range overlap” test. I’ll quote it in full:
For each elemental ratio, compare the range of the questioned specimen replicates to the range for the known specimen replicates. Because standard deviations are not calculated, this statistical measure does not directly address the confidence level of an association. If the ranges of one or more elements in the questioned and known specimens do not overlap, it may be concluded that the specimens are not from the same source.
Two problems are glaringly apparent. First, statisticians appreciate that the range is not a robust statistic. It is heavily influenced by any outliers. Second, if the  properties of the "ratio ranges" are unknown, how can one know what to conclude—and what to tell a judge, jury, or investigator about the strength of the conclusion? Would a careful criminalist who finds no range overlap have to quote or paraphrase the introduction to the Standard, and report that "the specimens are indistinguishable in all of these observed and measured properties," so that "the possibility that they originated from the same source of glass cannot be eliminated"? Would the criminalist have to add that there is no scientific basis for stating what the statistical significance of this inability to tell them apart is? Or could an expert rely on the Standard to say that by not eliminating the same-source possibility, the tests "conducted for source discrimination or association" came out in favor of association?

The Standard offers a cryptic alternative to the simplistic range method (without favoring one over the other and without mentioning any other statistical procedures):
±3s—For each elemental ratio, compare the average ratio for the questioned specimen to the average ratio for the known specimens ±3s. This range corresponds to 99.7 % of a normally distributed population. If, for one or more elements, the average ratio in the questioned specimen does not fall within the average ratio for the known specimens ±3s, it may be concluded that the samples are not from the same source.
The problems with this poorly written formulation of a frequentist hypothesis test are legion:

1. What "population" is "normally distributed"? Apparently, it is the measurements of the elemental ratios in the questioned specimen. What supports the assumption of normality?

2. What is "s"? The standard deviation of what variable? It appears to be the sample standard deviation of the nine measurements on the known specimen.

3. The Standard seems to contemplate a 99.7% confidence interval (CI) for the mean μ of the ratios in the known specimen. If the measurement error is normally distributed about μ, then the CI for μ is approximately the known specimen's sample mean ±4.3s. This margin of error is larger than ±3s because the population standard deviation σ is unknown and the sample mean therefore follows a t-distribution with eight degrees of freedom. The desired 99.7% is the coverage probability for a ±3σ CI. Using ±3 with the estimator s rather than the true value σ results in a confidence coefficient below 99%. One would have to use a number greater than ±4 rather than ±3 to achieve 99.7% confidence.

4. The use of any confidence interval for the sample mean of the measurements in the known specimen is misguided. Why ignore the variance in the measured ratios in the questioned specimens? That is, the recommendation tells the analyst to ask whether, for each ratio in each questioned specimen, the miscomputed 99.7% CI covers “the average ratio in the questioned specimen.” But this “average ratio” is not the true ratio. The usual procedure (assuming normality) would be a two-sample t-test of the difference between the mean ratio for the questioned sample and the mean for the known specimen.

5. Even with the correct test statistic and distribution, the many separate tests (one for each ratio Ca/Mg, Ca/Ti, Fe/Zr, etc.) cloud the interpretation of the significance of the difference in a pair of sample means. Moreover, with multiple unknown specimens, the probability of finding a significant difference in at least one ratio for at least one unknown fragment is greater than the significance probability in a single comparison. The risk of a false exclusion for, say, ten independent comparisons could be ten times the nominal value of 0.003.

6. Why ±3 as opposed to, say, ±4? I mention ±4 not because it is clearly better, but because it is the standard for making associations using a different test method (ASTM E2330). What explains the same standards development organization promulgating facially inconsistent statistical standards?

7. Why strive for a 0.003 false-rejection probability as opposed to, say, 0.01, 0.03, or anything else? This type of question can be asked about any sharp cutoff. Why is a difference of 2.99σ dismissed as not useful when 3σ is definitive? Within the classical hypothesis-testing framework, an acceptable answer would be that the line has to be drawn somewhere, and the 0.003 significance level is needed to protect against the risk of a false rejection of the null hypothesis in situations in which a false rejection would be very troublesome. Some statistics textbooks even motivate the choice of the less demanding but more conventional significance level of 0.05 by analogizing to a trial in which a false conviction is much more serious than a false acquittal.

Here, however, that logic cuts in the opposite direction. The null hypothesis H0 that should not be falsely rejected is that the two sets of measurements come from fragments that do not have a common source. But 0.003 is computed for the hypothesis H1 that the fragments all come from the same, known source. The significance test in ASTM E2926-13 addresses (in its own way) the difference in the means when sampling from the known specimen. Using a very demanding standard for rejecting H1 in favor of the suspect’s claim H0 privileges the prosecution claim that the fragments come from different sources. 2/ And it does so without mentioning the power of the test: What is probability of reporting that fragments are indistinguishable — that there is an association — when the fragments do come from different sources? Twenty years ago, when a National Academy of Sciences panel examined and approved the FBI's categorical rule of "match windows" for DNA testing, it discussed both operating characteristics of the procedure—the ability to declare a match for DNA samples from the same source (sensitivity) and the ability to declare a nonmatch for DNA samples from different sources. [3] By looking only to sensitivity, ASTM E2926-13 takes a huge step backwards.

8. Whatever significance level is desired, to be fair and balanced in its interpretation of the data, a laboratory that undertakes hypothesis tests should report the probability of differences in the test statistic as large or larger than those observed under the two hypotheses: (1) when the sets of measurements come from the same broken window (H1); and (2) when the sets of measurements come from different sources of glass in the area in which the suspect lives and travels (H0). The ASTM Standard completely ignores H0. Data on the distribution of the elemental composition of glass in the geographic area would be required to address it, and the Standard should at least gesture to how such data should be used. If such data are missing, the best the analyst can do is to report candidly that the questioned fragment might have come from the known glass or from any other glass with a similar set of elemental concentrations and, for completeness, to add that how often other glass like this is present is unknown.

9. Would a likelihood ratio be a better way to express the probative value of the data? Certainly, there is an argument to that effect in the legal and forensic science literature. [4-8] Quantifying and aggregating the spectral data that the ASTM Standard now divides into three, lexically ordered procedures and combining them with other tests on glass would be a challenge, but it merits thought. Should not the Standard explicitly acknowledge that reporting on the strength of the evidence rather than making categorical judgments is a respectable approach?

* * *

In sum, even within the framework of frequentist hypothesis testing, ASTM E2926 is plagued with problems — from the wrong test statistic and procedure for the specified level of “confidence,” to the reversal of the null and alternative hypotheses, to the failure to consider the power of the test. Can such a Standard be considered “valid by forensic practitioners, academic researchers, measurement scientists, and statisticians”?

Notes
  1. The difference between the two is not pellucid, since OSAC-approved standards can be a list of “shoulds” and guidelines can include “shalls.”
  2. The best defense I can think of for it is a quasi-Bayesian argument that by the time H1 gets to this hypothesis test, it has survived the qualitative "peak identification" and "spectral comparison" tests. Given this prior knowledge, it should require unusually surprising evidence from the peak intensity ratios to reject H1 in favor of the defense claim H0.
References
  1. OSAC Registry of Approved Standards and OSAC Registry of Approved Guidelines http://www.nist.gov/forensics/osac/osac-registries.cfm, last visited Feb. 2, 2016
  2. NIST, Organization of Scientific Area Committees, http://www.nist.gov/forensics/osac/index.cfm, last visited Feb. 2, 2016
  3. National Research Council Committee on Forensic DNA Science: An Update, The Evaluation of Forensic DNA Evidence (1996)
  4. Colin Aitken & Franco Taroni, Statistics and the Evaluation of Evidence for Forensic Science (2d ed. 2004)
  5. James M. Curran et al., Forensic Interpretation of Glass Evidence (2000)
  6. ENFSI Guideline for Evaluative Reporting in Forensic Science (2015)
  7. David H. Kaye et al., The New Wigmore: Expert Evidence (2d ed. 2011)
  8. Royal Statistical Soc'y Working Group on Statistics and the Law, Fundamentals of Probability and Statistical Evidence in Criminal Proceedings: Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses (2010)
Postscript: See Broken Glass: What Do the Data Show?, Forensic Sci., Stat. & L., Feb. 13, 2016,

Disclosure and disclaimer: Although I am a member of the Legal Resource Committee of OSAC, the views expressed here are mine alone. They are not those of any organization. They are not necessarily shared by anyone inside (or outside) of NIST, OSAC, any SAC, any OSAC Task Force, or any OSAC Resource Committee.

No comments:

Post a Comment