Monday, October 24, 2016

PCAST’s Sampling Errors (Part I)

The President’s Council of Advisors on Science and Technology (PCAST) has reported to the President that important steps must be taken to improve forensic science. Among other valuable recommendations, its report on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods insists that the uncertainty associated with forensic-science findings of a positive association between a criminal defendant and some form of incriminating trace, impression, or pattern evidence should be estimated with reasonable accuracy, and that this estimate must be presented in court. These broad propositions are ones to which everyone concerned with doing justice should subscribe.

Yet, the PCAST report also argues in favor of presenting, through a statistical device known as a one-sided confidence interval, only part of the picture concerning the statistical error in studies of the performance of forensic examiners. In expressing this more detailed position, the PCAST report misstates the meaning of confidence intervals and offers a dubious justification.

The report states that “[b]y convention, a confidence level of 95 percent is most widely used—meaning that there is a 5 percent chance the true value exceeds the bound.” (P. 153). This explanation reiterates a similar statement in a much earlier report of a committee of the National Research Council of the National Academies. In its 1992 report on DNA Technology in Forensic Science, the NRC committee discussed “the traditional 95% confidence limit, whose use implies that the true value has only a 5% chance of exceeding the upper bound.” (P. 76).

This loose language did not escape the attention of statisticians. Bruce Weir, for example, promptly responded that “[a] court may be excused for such nonstatistical language, but a report issued by the NRC ... must lose some credibility with statisticians.”  B. S. Weir, Population Genetics in the Forensic DNA Debate, 89 Proc. Nat’l Acad. Sci. USA 11654, 11654 (1992). As most statistics textbooks recognize, “a confidence level of 95%” does not mean “that there is a 5 percent chance the true value” lies outside the particular interval. In the theory that motivates confidence intervals, the true value is an unknown constant, not a variable that has a probability associated with it. The 5% figure is the expected proportion of instances in which many intervals (most of them somewhat different in their central location and width) will cover the unknown, true interval. Only a Bayesian analysis can supply a probability that the true value falls outside a given interval.

This criticism may sound rather technical, and it is, but considering the scientific horsepower on PCAST, one would have expected its description of statistical concepts to be unobjectionable. More disturbing is the report’s dismissal of the presentation of standard, two-sided confidence intervals as “obfuscation”:
Because one should be primarily concerned about overestimating SEN [sensitivity] or underestimating FPR [false positive rate], it is appropriate to use a one-sided confidence bound. By convention, a confidence level of 95 percent is most widely used—meaning that there is a 5 percent chance the true value exceeds the bound. Upper 95 percent one-sided confidence bounds should thus be used for assessing the error rates and the associated quantities that characterize forensic feature matching methods. (The use of lower values may rightly be viewed with suspicion as an attempt at obfuscation.)
P. 153. Without unearthing the technical details of one-sided and two-sided confidence intervals, and in full agreement with the notion that good science requires acknowledging the possibility that false-positive error probabilities could occur more often than seen in a single study, I have to say that this paragraph seems to contradict the ideal of a forensic scientist who does not take sides.

Certainly, the law (not science) treats false convictions are more serious than false acquittals. But does this asymmetry imply that expert witnesses should not discuss the fact that sampling error (which is the only thing that confidence intervals address) can work in both directions? The legal and social policy judgment that it is better to risk a false acquittal than a false conviction requires the state to prove its case decisively — by a body of evidence that leaves no reasonable doubt about the defendant’s guilt. At the same time, it presumes that evidence should be presented and assessed for what is worth — neither more nor less. Consequently, the uncertainties in scientific evidence should be made clear — whichever way they cut. If possible, they should be expressed fairly, without favoring the prosecution's theory or the defense's.

As such, we should amend PCAST’s talk of “obfuscation.” It is fair to say that the exclusive use of lower values, or even point estimates — instead of both upper and lower values — may rightly be viewed with suspicion as an attempt at obfuscation. It is equally fair to say that the exclusive use of upper values also may rightly be viewed with suspicion as an attempt at obfuscation. Finally, presentation of the full range of uncertainty can be viewed with approbation as an attempt at transparency. In sum it is far from clear that the one-sided 95% confidence interval best achieves the objectives of either law or science.

Technical postscript of 12/8/16 (for people who want to use PCAST's estimates of sampling error)

A forensic scientist wrote me that he was unable to replicate the numbers PCAST provided for one-sided 95% confidence intervals. The report intimidatingly states that
For technical reasons, there is no single, universally agreed method for calculating these confidence intervals (a problem known as the “binomial proportion confidence interval”). However, the several widely used methods give very similar results, and should all be considered acceptable: the Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval. 396/ Web-based calculators are available for all of these methods. 397/ For example, if a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.)
P. 153. Relying on the PCAST approach to testify to a range for error rates could be dangerous. The article that the PCAST report cites does not conclude that the four methods perform equally well. Neither does it recommend the Clopper-Pearson (CP) method. The article is Lawrence D. Brown, T. Toni Cai & Anirban DasGupta, Interval Estimation for a Binomial Proportion, 16 Stat. Sci. 101 (2001). The abstract "recommend[s] the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested in Agresti and Coull for larger n." The authors explain that "for small n (40 or less), we recommend that either the Wilson or the Jeffreys prior interval should be used. They are very similar, and either may be used depending on taste." Id. at 102. "For larger n (n > 40), the Wilson, the Jeffreys and the Agresti–Coull interval are all very similar, and so for such n, due to its simplest form, we come to the conclusion that the Agresti–Coull interval should be recommended." Id. at 103. Brown et al. were especially critical of the procedure that PCAST used. They described the CP interval as "rather inaccurate" and concluded that
The Clopper–Pearson interval is wastefully conservative and is not a good choice for practical use, unless strict adherence to the prescription C (p, n ) ≥ 1−α is demanded. Even then, better exact methods are available ... .
The statistical calculator (EpiTools) that PCAST recommended likewise warns that "the Clopper-Pearson Exact method is very conservative and tends to produce wider intervals than necessary."

How much of a difference does this really make? I have not done the necessary computations, and I will be surprised if they produce big swings in the estimated upper bounds. After I get the chance to grind out the numbers, I will supply them in a later posting.

More on the PCAST Report

Sunday, October 23, 2016

PCAST on "Foundational Validity," Evidentiary Reliability, and the Admissibility of "Firearms Analysis"

Finding 6 of President's Council of Scientific Advisors (PCAST), on "firearms analysis," is blunt:
PCAST finds that firearms analysis currently falls short of the criteria for foundational validity, because there is only a single appropriately designed study to measure validity and estimate reliability. The scientific criteria for foundational validity require more than one such study, to demonstrate reproducibility.
(P. 112). In the next breath, the report adds, " Whether firearms analysis should be deemed admissible based on current evidence is a decision that belongs to the courts." Id.

The juxtaposition of these observations is puzzling, for the report also seems to equate "foundational validity" to the legal standard for admissibility in Federal Rule of Evidence 702(c). This part of Rule 702 requires expert testimony to be "the product of reliable principles and methods." Seeking "complete clarity about our intent," PCAST (p. 43) states that
we have adopted specific terms to refer to the scientific standards for two key types of scientific validity, which we mean to correspond, as scientific standards, to the legal standards in Rule 702 (c,d)):
(1) by “foundational validity,” we mean the scientific standard corresponding to the legal standard of evidence being based on “reliable principles and methods,” and
(2) by “validity as applied,” we mean the scientific standard corresponding to the legal standard of an expert having “reliably applied the principles and methods.”
In other words, the PCAST report apparently advances these claims:
  1. Evidence E ("firearms analysis") lacks quality Q ("foundational validity," which is a "key type of scientific validity").
  2. Q is a sine qua non for the admissibility of all expert evidence under Rule 702.
  3. Courts can admit E if they want to.
Can the third claim be reconciled with the first two? I can think of three possibilities:
  1. A court might not be bound by Rule 702(c) because the jurisdiction does not follow this rule. (The rule was amended to include the words "reliable principles and methods" to codify the Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), line of cases (particularly General Electric Co. v. Joiner, 522 U.S. 136 (1997)), but a substantial minority of states apply a different version of the rule.)
  2. A court can reject PCAST's claim that "foundational validity" is "the scientific standard corresponding to the legal standard" (at least in the manner that PCAST defines "foundational validity").
  3. A court can reject PCAST's claim that the scientific literature does not establish "foundational validity" as the term is defined and applied in the report to "firearms analysis."
By introducing definitions of the terms "validity" and "reliability" that depart from the established meanings in statistics and the social sciences (p. 47 n.107), the PCAST scientists may not have achieved "complete clarity." On the one hand, we are told that established scientific criteria for "validity" correspond to evidentiary "reliability" under Rule 702. On the other, we are assured that the absence of "validity" does not dictate a legal conclusion. For me, this nomenclature falls short of complete clarity.

More on the PCAST Report

Thursday, October 20, 2016

The First Opinion To Discuss the PCAST Report

In United States v. Chester, No. 13 CR 00774 (N.D. Ill. Oct. 7, 2016), a federal district court denied “defendants’ second joint renewed motion to exclude expert testimony regarding firearm toolmark analysis.” Normally, yet another federal district court opinion admitting testimony of a positive association in toolmarks would not be newsworthy. But this trial court decision came in response to a recent report from the President’s Council of Advisors on Science and Technology (PCAST). This report, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, argues that source-attribution testimony for bullet markings has not been demonstrated to possess scientific validity. 1/ Thus, Forensic Magazine ran this headline: “Firearms Evidence Allowed in Chicago Hobos Gang Trial—Despite PCAST Argument.” 2/ So what is PCAST's analysis, and how did the court overcome it?

PCAST expressed concern that the
“Theory of Identification as it Relates to Toolmarks”—which defines the criteria for making an identification—is circular. The “theory” states that an examiner may conclude that two items have a common origin if their marks are in sufficient agreement,” where “sufficient agreement” is defined as the examiner being convinced that the items are extremely unlikely to have a different origin. In addition, the “theory” explicitly states that conclusions are subjective (p. 104).
The court did not dispute the absence of a standardized procedure with well-defined judgmental criteria for source attribution. Rather, it maintained that the very scientific literature cited by PCAST does not contradict a previous ruling in the case that the judgments of firearms examiners amount to the kind of “scientific knowledge” necessary to admit scientific evidence under the Supreme Court's opinion in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993). As discussed at length in many publications (and occasionally in depth), Justice Blackmun's opinion in Daubert articulates a loose, multifactor standard for ascertaining the scientific soundness of proposed testimony, 3/ and the district court had examined the “Daubert factors.” in its first ruling. This time, it limited its analysis to “error rates.” Judge John J. Tharp, Jr., wrote that:
As such, the report does not dispute the accuracy or acceptance of firearm toolmark analysis within the courts. Rather, the report laments the lack of scientifically rigorous “blackbox” studies needed to demonstrate the reproducibility of results, which is critical to cementing the accuracy of the method. Id. at 11. The report gives detailed explanations of how such studies should be conducted in the future, and the Court hopes researchers will in fact conduct such studies. See id. at 106. However, PCAST did find one scientific study that met its requirements (in addition to a number of other studies with less predictive power as a result of their designs). That study, the “Ames Laboratory study,” found that toolmark analysis has a false positive rate between 1 in 66 and 1 in 46. Id. at 110. The next most reliable study, the “Miami-Dade Study” found a false positive rate between 1 in 49 and 1 in 21. Thus, the defendants’ submission places the error rate at roughly 2%. 3 The Court finds that this is a sufficiently low error rate to weigh in favor of allowing expert testimony. See Daubert v. Merrell Dow Pharms., 509 U.S. 579, 594 (1993) (“the court ordinarily should consider the known or potential rate of error”); United States v. Ashburn, 88 F. Supp. 3d 239, 246 (E.D.N.Y. 2015) (finding error rates between 0.9 and 1.5% to favor admission of expert testimony); United States v. Otero, 849 F. Supp. 2d 425, 434 (D.N.J. 2012) (error rate that “hovered around 1 to 2%” was “low” and supported admitting expert testimony). The other factors remain unchanged from this Court’s earlier ruling on toolmark analysis. See ECF No. 781.

3. Because the experts will testify as to the likelihood that rounds were fired from the same firearm, the relevant error rate in this case is the false positive rate (that is, the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect).
But is it true that “the report does not dispute the accuracy ... of firearm toolmark analysis within the courts”? The report claims that the accuracy of the statements that appear in court is not known with adequate precision. According to the authors, no convincing range for the risks of errors can be derived from the existing scientific literature. PCAST insists that “[b]ecause firearms analysis is at present a subjective feature-comparison method, its foundational validity can only be established through multiple independent black box studies ... .” 4/ PCAST is emphatic (some would say dogmatic) in insisting that “the sole way to establish foundational validity is through multiple independent ‘black-box’ studies that measure how often examiners reach accurate conclusions across many feature-comparison problems involving samples representative of the intended use. In the absence of such studies, a feature-comparison method cannot be considered scientifically valid” (pp. 66, 68).

Under this criterion for scientific validity, it is hard to see how the PCAST report can be characterized as not disputing accuracy. The Miami-Dade study that the opinion relies on barely counts for PCAST. The report lists it under the heading of “Non-black-box studies of firearms analysis” (p. 106). That leaves a single study, and a single study cannot satisfy the report’s demand for multiple studies. For better or worse, PCAST's bottom line is clear
At present, there is only a single study that was appropriately designed to test foundational validity and estimate reliability (Ames Laboratory study). Importantly, the study was conducted by an independent group, unaffiliated with a crime laboratory. Although the report is available on the web, it has not yet been subjected to peer review and publication.

The scientific criteria for foundational validity require appropriately designed studies by more than one group to ensure reproducibility. Because there has been only a single appropriately designed study, the current evidence falls short of the scientific criteria for foundational validity. There is thus a need for additional, appropriately designed black-box studies to provide estimates of reliability.
The district court in Chester read this passage as meaning that the science is there, but that it would be nice to have a few more studies to show that other researchers can replicate the small error rates in the unpublished study. But is not PCAST really saying that there is a paucity of acceptable experiments from which to ascertain applicable error probabilities? In this regard, much more than "reliability" (in the statistical sense) and "reproducibility" of a number is at issue. One should ask not just whether a a second research group has replicated a given study, but more broadly, whether a solid body of studies with varied designs and different samples of examiners establish that the findings as a whole are robust and generalizable.

In contrast, the Chester court is satisfied with two studies that it understands to reveal false positive error probabilities in the neighborhood of 2%. Where PCAST is unable to perceive evidence of “validity” in the sense of reasonably well known error probabilities, the court finds the probability of error to be small enough to allow testimony as to the origin of the bullets.

But which probability does the court conclude is comfortingly small? The PCAST report defines a false-positive error probability one way. The Chester court expresses a different understanding of the meaning of this probability. Stay tuned for later discussion of this "false-positive error fallacy."

Notes
  1. See Eric Lander, William Press, S. James Gates, Jr., Susan L. Graham, J. Michael McQuade, and Daniel Schrag, PCAST Releases Report on Forensic Science in Criminal Courts, Sept. 20, 2016, https://www.whitehouse.gov/blog/2016/09/20/pcast-releases-report-forensic-science-criminal-courts 
  2. Seth Augerstein, Firearms Evidence Allowed in Chicago Hobos Gang Trial—Despite PCAST Argument, Forensic Mag., Oct. 13, 2016. http://www.forensicmag.com/news/2016/10/firearms-evidence-allowed-chicago-hobos-gang-trial-despite-pcast-argument 
  3. See, e.g., David H. Kaye, David E. Bernstein & Jennifer L. Mnookin, The New Wigmore: A Treatise on Evidence: Expert Evidence (2d ed. 2011) (updated annually).
  4. P. 106. The definition of "black-box study" at page 48, Box 2, is seriously incomplete. There, the report explains that "[b]y a 'black-box study,' we mean an empirical study that assesses a subjective method by having examiners analyze samples and render opinions about the origin or similarity of samples." But an "empirical" studies come in a multitude of designs. The PCAST authors have specific ideas about the design of empirical studies that they call "black-box studies."
More on the PCAST Report

Thursday, October 6, 2016

Does log-LR separate Hillary Clinton from Donald Trump?

Many forensic statisticians regard likelihood ratios (or their logarithms) as a providing an ideal measure of the probative value of comparisons of patterns and other test results. Moving into another context, in medium.com "Data Scientist" Maixent Chenebaux describes differences between the acceptance speeches of Hillary Clinton and Donald Trump at their party's nominating convention:
we would like to find which are the candidates’ favorite words. To get the “Trumpian” and the “Clintonian” vocabularies, we have to find the words that occur the most in one candidate’s talk and, at the same time, the least in the opponent’s. For example, the word “really” is found 15 times in Trump’s speech but only once in Clinton’s. One way to determine this is to calculate the odds ratio for each word. The odds ratio (here named OR) was, for each word, computed using the following formula:

OR(wordi) = log { ptrump(wordi) / pclinton(wordi) }

The first term of the ratio is the probability of a word being in Trump’s vocabulary, and the other one is the probability of the same word being in Clinton’s. The log function allows us to efficiently sort each word in one category or the other: when the probabilities are equal, the log function is null. In any other cases, it is either negative (a word is Clintonian) or positive (a word is Trumpian).
There is a simpler way to say this. If a word is used proportionally more often in Trump's speech than in Clinton's, it is Trumpian; if it occurs proportionally more in Clinton's speech, it is Clintonian.

The "odds ratio" in the box is nit really an odds ratio. It is the logarithm of a Bayes factor (the ratio of posterior to prior odds) or the log of the likelihood ratio. But seems like extra baggage. We are not trying to deduce the author of the speech from the frequencies of the distinctive words. We are only trying to pick out the words that are distinctive. Their usefulness in classifying additional speech samples by author would require further research. If we had a transcript of a speech that we knew was either Clinton's or Trump's--and had to guess the speaker without knowing anything about politics and semantics--then the discriminators obtained from this study might be useful. We could claim scientific validity in using them to discriminate between the two possible authors if these terms had been shown to be sensitive and specific (and, hence, to have high likelihood ratios) when cross-validated on other speeches.

In any event, the words that emerged as discriminators for the nomination speeches and that were "almost exclusively" by that candidate (pother candidate(word) ≈ 0 ?) were

Clinton Trump
America
working
hard
together
Donald
millions
campaign
enough
wants
great
got
China
really
Mexico
nice
problem(s)
Iran
disaster

One proposed interpretation is that Trump uses shorter words, but that is not obvious from this list. Another is
Fine observers will note that “Trump” does not appear in the Clintonian wordset above, the reason being that Trump himself made numerous mentions of his last name in his speech (10 times), bringing the ratio way down. As a point of comparison, Clinton’s name is only used twice: once in Hillary’s speech (about her husband Bill Clinton), and once in Trump’s. Moreover, the Clintonian word “Wants” that shows up in the list is mostly used to criticize her opponent (“He wants to divide us […]”, “He wants us to fear the future and fear each other.”). It clearly shows that Clinton talked about Trump, and Trump talked about … himself!
A word whose LR was about 1 is "thanks." It does not work well as a classifier. But here semantics, not statistics, indicates a difference between the two candidates:
They both use “thank(s)” numerous times, but in a different manner: while Clinton specifically thanked a group of people or an individual [e.g. "I want to thank Bernie Sanders"] Trump’s “thanks” were mostly employed when the crowd was applauding him [e.g., "That's really nice, thanks"].
Returning to statistics alone,
Most of Trump’s sentences are short: more than 21% of Trump’s speech is made of sentences that contain 5 or 6 words. Clinton’s sentence lengths are more evenly distributed, 12-word sentences being the most frequent. ... Obama, during his first nomination speech, employed an average of 25.7 words per sentence, which is almost equal to Clinton and Trump combined. Obama also repeated himself 24% less than Clinton, and 42% less than Trump.

That's it for now! I would not want to repeat myself.

Reference

Maixent Chenebaux, Semantics — What Does Data Science Reveal about Clinton and Trump?, Oct. 5, 2016, https://medium.com/reputation-squad/semantics-what-does-data-science-reveal-about-clinton-nd-trump-afdf427e833b#.ctkkoyu2o