Monday, October 24, 2016

PCAST’s Sampling Errors (Part I)

The President’s Council of Advisors on Science and Technology (PCAST) has reported to the President that important steps must be taken to improve forensic science. Among other valuable recommendations, its report on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods insists that the uncertainty associated with forensic-science findings of a positive association between a criminal defendant and some form of incriminating trace, impression, or pattern evidence should be estimated with reasonable accuracy, and that this estimate must be presented in court. These broad propositions are ones to which everyone concerned with doing justice should subscribe.

Yet, the PCAST report also argues in favor of presenting, through a statistical device known as a one-sided confidence interval, only part of the picture concerning the statistical error in studies of the performance of forensic examiners. In expressing this more detailed position, the PCAST report misstates the meaning of confidence intervals and offers a dubious justification.

The report states that “[b]y convention, a confidence level of 95 percent is most widely used—meaning that there is a 5 percent chance the true value exceeds the bound.” (P. 153). This explanation reiterates a similar statement in a much earlier report of a committee of the National Research Council of the National Academies. In its 1992 report on DNA Technology in Forensic Science, the NRC committee discussed “the traditional 95% confidence limit, whose use implies that the true value has only a 5% chance of exceeding the upper bound.” (P. 76).

This loose language did not escape the attention of statisticians. Bruce Weir, for example, promptly responded that “[a] court may be excused for such nonstatistical language, but a report issued by the NRC ... must lose some credibility with statisticians.”  B. S. Weir, Population Genetics in the Forensic DNA Debate, 89 Proc. Nat’l Acad. Sci. USA 11654, 11654 (1992). As most statistics textbooks recognize, “a confidence level of 95%” does not mean “that there is a 5 percent chance the true value” lies outside the particular interval. In the theory that motivates confidence intervals, the true value is an unknown constant, not a variable that has a probability associated with it. The 5% figure is the expected proportion of instances in which many intervals (most of them somewhat different in their central location and width) will cover the unknown, true interval. Only a Bayesian analysis can supply a probability that the true value falls outside a given interval.

This criticism may sound rather technical, and it is, but considering the scientific horsepower on PCAST, one would have expected its description of statistical concepts to be unobjectionable. More disturbing is the report’s dismissal of the presentation of standard, two-sided confidence intervals as “obfuscation”:
Because one should be primarily concerned about overestimating SEN [sensitivity] or underestimating FPR [false positive rate], it is appropriate to use a one-sided confidence bound. By convention, a confidence level of 95 percent is most widely used—meaning that there is a 5 percent chance the true value exceeds the bound. Upper 95 percent one-sided confidence bounds should thus be used for assessing the error rates and the associated quantities that characterize forensic feature matching methods. (The use of lower values may rightly be viewed with suspicion as an attempt at obfuscation.)
P. 153. Without unearthing the technical details of one-sided and two-sided confidence intervals, and in full agreement with the notion that good science requires acknowledging the possibility that false-positive error probabilities could occur more often than seen in a single study, I have to say that this paragraph seems to contradict the ideal of a forensic scientist who does not take sides.

Certainly, the law (not science) treats false convictions are more serious than false acquittals. But does this asymmetry imply that expert witnesses should not discuss the fact that sampling error (which is the only thing that confidence intervals address) can work in both directions? The legal and social policy judgment that it is better to risk a false acquittal than a false conviction requires the state to prove its case decisively — by a body of evidence that leaves no reasonable doubt about the defendant’s guilt. At the same time, it presumes that evidence should be presented and assessed for what is worth — neither more nor less. Consequently, the uncertainties in scientific evidence should be made clear — whichever way they cut. If possible, they should be expressed fairly, without favoring the prosecution's theory or the defense's.

As such, we should amend PCAST’s talk of “obfuscation.” It is fair to say that the exclusive use of lower values, or even point estimates — instead of both upper and lower values — may rightly be viewed with suspicion as an attempt at obfuscation. It is equally fair to say that the exclusive use of upper values also may rightly be viewed with suspicion as an attempt at obfuscation. Finally, presentation of the full range of uncertainty can be viewed with approbation as an attempt at transparency. In sum it is far from clear that the one-sided 95% confidence interval best achieves the objectives of either law or science.

Technical postscript of 12/8/16 (for people who want to use PCAST's estimates of sampling error)

A forensic scientist wrote me that he was unable to replicate the numbers PCAST provided for one-sided 95% confidence intervals. The report intimidatingly states that
For technical reasons, there is no single, universally agreed method for calculating these confidence intervals (a problem known as the “binomial proportion confidence interval”). However, the several widely used methods give very similar results, and should all be considered acceptable: the Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval. 396/ Web-based calculators are available for all of these methods. 397/ For example, if a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.)
P. 153. Relying on the PCAST approach to testify to a range for error rates could be dangerous. The article that the PCAST report cites does not conclude that the four methods perform equally well. Neither does it recommend the Clopper-Pearson (CP) method. The article is Lawrence D. Brown, T. Toni Cai & Anirban DasGupta, Interval Estimation for a Binomial Proportion, 16 Stat. Sci. 101 (2001). The abstract "recommend[s] the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested in Agresti and Coull for larger n." The authors explain that "for small n (40 or less), we recommend that either the Wilson or the Jeffreys prior interval should be used. They are very similar, and either may be used depending on taste." Id. at 102. "For larger n (n > 40), the Wilson, the Jeffreys and the Agresti–Coull interval are all very similar, and so for such n, due to its simplest form, we come to the conclusion that the Agresti–Coull interval should be recommended." Id. at 103. Brown et al. were especially critical of the procedure that PCAST used. They described the CP interval as "rather inaccurate" and concluded that
The Clopper–Pearson interval is wastefully conservative and is not a good choice for practical use, unless strict adherence to the prescription C (p, n ) ≥ 1−α is demanded. Even then, better exact methods are available ... .
The statistical calculator (EpiTools) that PCAST recommended likewise warns that "the Clopper-Pearson Exact method is very conservative and tends to produce wider intervals than necessary."

How much of a difference does this really make? I have not done the necessary computations, and I will be surprised if they produce big swings in the estimated upper bounds. After I get the chance to grind out the numbers, I will supply them in a later posting.

No comments:

Post a Comment