Sunday, December 11, 2016

PCAST’s Sampling Errors (Part II: Getting More Technical)

The report to the President on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods from the President’s Council of Advisors on Science and Technology emphasizes the need for research to assess the accuracy of the conclusions of criminalists who compare the features of identification evidence — things like fingerprints, toolmarks, hairs, DNA, and bitemarks. To this extent, it treads on firm (and well-worn) ground.

It also stresses the need to inform legal factfinders — the judges and jurors who try cases with the assistance of such evidence — of the false-positive error rates discovered in comprehensive and well designed studies with large sample sizes. This too is a laudable objective (although the specificity, which is the complement of the false-negative probability, also affects the probative value of the evidence and therefore should be incorporated into a presentation that indicates probative value).

When sample sizes are small, different studies could well generate very different false-positive rates if only because accuracy varies, both within and across examiners. Testing different samples of examiners at different times therefore will show different levels of performance. A common response to this sampling variability is a confidence interval (CI). A CI is intended to demarcate a range of possible values that might plausibly include the value that an ideal study of all examiners at all times would find.

The report advocates the use of an upper limit of a one-sided 95% CI for the false-positive rates instead of, or perhaps in addition to, the observed rates themselves. Some of the report’s statements about CIs and what to report are collected in an appendix to this posting. They have led to claims of a very large false-positive error rate for latent fingerprint identification. (See On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons,” Dec. 10, 2016.)

A previous posting on "PCAST’s Sampling Errors" identified problems with the specifics of PCAST’s understanding of the meaning of a CI, with the technique it used to compute CIs for the performance studies it reviewed, and with the idea of substituting a worst-case scenario (the upper part of a CI) for the full range of the interval. Informing the factfinder of plausible variation both above and below the point estimate — is fairer to all concerned, and that interval should be computed with techniques that will not create distracting arguments about the statistical acumen of the expert witness.

This posting elaborates on these concerns. Forensic scientists and criminalists who are asked to testify to error probabilities (as they should) need to be aware of the nuances lest they be accused of omitting important information or using inferior statistical methods. I provide a few tables that display PCAST’s calculations of the upper limits of one-sided 95% CIs; two-sided CIs computed with the same method for ascertaining the width of the intervals; and CIs with methods that PCAST listed but did not use.

The discussion explains how to use the statistical tool that PCAST recommended. It also shows that there is potential for confusion and manipulation in the reporting of confidence intervals. This does not mean that such intervals should not be used — quite the contrary, they can give a sense of the fuzziness that sampling error creates for point estimates. Contrary to the recommendation of the PCAST report, however, they should not be presented as if they give the probability that an error probability is as high as a particular value.

At best, being faithful to the logic behind confidence intervals, one can report that if the error probability were that large, then the probability of the observed error rate or a smaller one would have some designated value. To present the upper limit of a 95% CI as if it states that the probability is “at least 5%” that the false-positive error probability could be “as high as” the upper end of that CI — the phrasing used in the report (p. 153) — would be to succumb to the dreaded “transposition fallacy” that textbooks on statistics abjure. (See also David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, June 16, 2016.)

I. PCAST’s Hypothetical Cases

The PCAST Report gives examples of CIs for two hypothetical validation studies. In one, there are 25 tests of examiner’s ability to use features to classify pairs according to their source.  This hypothetical experiment establishes that the examiners made x = 0 false positive classifications in n = 25 trials. The report states that “if an empirical study found no false positives in 25 individual tests, there is still a reasonable chance (at least 5 percent) that the true error rate might be as high as roughly 1 in 9.” (P. 153.)

I think I know what PCAST is trying to say, but this manner of expressing it is puzzling. Let the true error rate be some unknown, fixed value θ. The sentence in the report might amount to an assertion that the probability that θ is roughly 1 in 9 is at least 5%. In symbols,
Pr(θ ≈ 1/9 | x = 0, n = 25) ≥ 0.05. 
Or does it mean that the probability that θ is roughly 1/9 or less is 5% or more, that is,
Pr(θ ≤ 1/9 | x = 0, n = 25) ≥ 0.05? 
Or does it state that the probability that θ is roughly 1/9 or more is 5% or more, that is,
Pr(θ ≥ 1/9 | x = 0, n = 25) ≥ 0.05?

None of these interpretations can be correct. Theta is a parameter rather than a random variable. From the frequentist perspective of CIs, it does not have probabilities attached to it. What one can say is that, if an enormous number of identical experiments were conducted, each would generate some point estimate for the false positive probability θ in the population. Some of these estimates x/n would be a little higher than θ; some would be a little lower (if θ > 0); some would a lot higher; some a lot lower (if θ ≫ 0); some would be spot on. If we constructed a 95% CI around each estimate, about 95% of them would cover the unknown θ, and about 5% would miss it. (I am not sure where the "at least 5%" in the PCAST report comes from.) Likewise, if we constructed a 90% CI for each experiment, about 90% would cover the unknown θ, and about 10% would miss it.

In assessing a point estimate, the width of the interval is what matters. At a given confidence level, wider intervals signal less precision in the estimate. The 90% CI would be narrower than the 95% one — it would have the appearance of greater precision, but that greater precision is illusory.  It just means that we can narrow the interval by being less “confident” about the claim that it captures the true value θ. PCAST uses 95% confidence because 95% is a conventional number in many (but not all) fields of scientific inquiry -- a fact that greatly impressed Justice Breyer in oral argument in Hall v. Florida.

In sum, 95% two-sided CIs are useful for manifesting the fuzziness that surrounds any point estimate because of sampling error. But they do not answer the question of what probability we should attach to the claim that “the true error rate might be as high as roughly 1 in 9.” Even with a properly computed CI that goes from 0 to 1/9 (which we can write as [0, 1/9]), the most we can say is that the process used to generate such CIs will err about 5% of the time. (It is misleading to claim that it will err “at least 5%” of the time.) We have a single CI, and we can plausibly propose that it is one of those that does not err. But we cannot say that 95% or fewer of the intervals [0, 1/9] will capture θ. Hence, we cannot say that “there is ... at least [a] 5 percent [chance] that the true error rate [θ] might be as high as roughly 1 in 9.”

To reconstruct how PCAST computed “roughly 1 in 9,” we can use EpiTools, the web-based calculator that the report recommended. This tool does not have an explicit option for one-sided intervals, but a 95% one-sided CI leaves 5% of the probability mass in a single tail. Hence, the upper limit is equal to that for a two-sided 90% CI. This is so because a 90% CI places one-half of the 10% of the mass that it does not cover in each tail. Figure 1 shows the logic.

Figure 1. Graphical Representation of PCAST's
Upper Limit of a One-sided 95% CI (not drawn to scale)
 95% CI (two-sided)
       **
      *****
     ********
    ************
   ****************
*************************
--[--------------------]------------> x/n
Lower 2.5% region       Upper 2.5% region

 90% CI (two-sided)
       **
      *****
     ********
    ************
  *****************
*************************
---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

PCAST: Report the interval from
the observed x/n to the start of the upper 5% region? 
Or just report where the upper 5% region starts?

---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

Inputting n = 25, x = 0, and confidence = 90% yields the results in Table 1.

Table 1. Checking and Supplementing PCAST’s CIs
in Its First Hypothetical Case (x/n = 0/25)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.1129 = 1/9 0.1372 = 1/7
Wilson 0.0977 = 1/10 0.1332 = 1/8
Jeffreys 0.0732 = 1/14 0.0947 = 1/11
Agresti-Coull 0.1162 = 1/9 0.1576 = 1/6

The Wilson and Jeffreys methods, which are recommended in the statistics literature cited by PCAST, give smaller upper bounds on the false-positive error rate than the Clopper-Pearson method that PCAST uses. The recommended numbers are 1 in 10 and 1 in 14 compared to PCAST’s 1 in 9.

On the other hand, the upper limit for a two-sided 95% CI for the Wilson and Jeffreys tests are 1 in 8 and 1 in 11. In this example (and in general) using the one-sided interval has the opposite effect from what PCAST may have intended. PCAST wants to keep juries from hearing many false-positive-error probabilities that are below θ at the cost of giving them estimates that are above θ. But as Figure 1 illustrates, using the upper limit of a 90% two-sided interval to arrive at a one-sided 95% CI lowers the upper limit compared to the two-sided 95% CI. PCAST’s one-sided intervals do less, not more, to protect defendants from estimates of error false-positive error probabilities that are too low.

Of course, I am being picky. What is the difference if the expert reports that the false-positive probability could be 1 in 14 instead of PCAST’s upper limit of 1 in 9? The jury will receive a similar message — with perfect scores on only 25 relevant tests of examiners, one cannot plausibly claim that examiners rarely err. Another way to make this point is to compute the probability of the study’s finding so few false positives (x/n = 0/25) when the probability of an error on each independent test is 1 in 9 (or any other number that one likes). If 1/9 is the binomial probability of a false-positive error, then the chance of x = 0 errors in 25 tests of (unbeknownst to the examiners) same-source specimens is (1 – 1/9)25 = 0.053.

At this point, you may say, wait a minute, this is essentially the 5% figure given by PCAST. Indeed, it is, but it has a very different meaning. It indicates how often one would see studies with a zero false-positive error rate when the unknown, true rate is actually 1/9. In other words, it is the p-value for the data from the experiment when the hypothesis θ = 1/9 is true. (This is a one-tailed p-value. The expected number of false positives is 25(1/9) = 2.7. The probability of more than 6 or more false positives is 5.3%. So the two-tailed p-value is 10.5%. Only about 1 time in ten would studies with 25 trials produce outcomes that diverged at least much as this one did from what is statistically expected if the probability of error in each trial with different-source specimens really is 1/9.) Thus, the suggestion that the true probability is 1/9 is not grossly inconsistent with the data. But the data are not highly probable under the hypothesis that the false-positive error probability is 1/9 (or any larger number).

PCAST also gives some numbers for another hypothetical case. The report states that
[I]f a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context "the false positive rate might be as high as."
The output of EpiTools is included in Table 2.

Table 2. Supplementing PCAST’s CIs
in Its Second Hypothetical Case (x/n = 0/100)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0295 = 1/34 0.0362 = 1/28
Wilson 0.0263 = 1/38 0.0370 = 1/27
Jeffreys 0.0190 = 1/53 0.0247 = 1/40
Agresti-Coull 0.0317 = 1/32 0.0444 = 1/23

Again, to the detriment of defendants, the recommended methods give smaller upper bounds than a more appropriate two-sided interval. For example, PCAST’s 90% Clopper-Pearson interval is [0, 3%], while the 95% interval is [0, 4%]. A devotee of the Jeffreys method (or a less scrupulous expert witness seeking to minimize the false positive-error risk) could report the smallest interval of [0, 2%].

II. Two Real Studies with Larger Sample Sizes

The two hypotheticals are extreme cases. Sample sizes are small, and the outcomes are extreme — no false-positive errors at all. Let’s look at examples from the report that involve larger samples and higher observed error rates.

The report notes that in Langenburg, Champod, and Genessay (2012),
For the non-mated pairs, there were 17 false positive matches among 711 conclusive examinations by the experts. The false positive rate was 2.4 percent (upper 95 percent confidence bound of 3.5 percent). The estimated error rate corresponds to 1 error in 42 cases, with an upper bound corresponding to 1 error in 28 cases.
P. 93 (notes omitted). Invoking EpiTools, one finds that

Table 3. Supplementing PCAST’s CIs
for Langenburg et al. (2012) (x/n = 17/711)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0356 = 1/28 0.0140 = 1/71 0.0380 = 1/26
Wilson 0.0353 = 1/28 0.0150 = 1/67 0.0380 = 1/26
Jeffreys 0.0348 = 1/29 0.0145 = 1/69 0.0371 = 1/27
Agresti-Coull 0.0355 = 1/28 0.0147 = 1/68 0.0382 = 1/26

As one would expect for a larger sample, all the usual methods for producing a binomial CI generate similar results. The PCAST one-sided interval is [2.4%, 3.6%] compared to a two-sided interval of [1.4%, 3.8%].

For a final example, we can consider PCAST's approach to sampling error for the FBI-Noblis study of latent fingerprint identification. The report describes two related studies of latent fingerprints, both with large sample sizes and small error rates, but I will only provide additional calculations for the 2011 one. The report describes it as follows:
The authors assembled a collection of 744 latent-known pairs, consisting of 520 mated pairs and 224 non-mated pairs. To attempt to ensure that the non-mated pairs were representative of the type of matches that might arise when police identify a suspect by searching fingerprint databases, the known prints were selected by searching the latent prints against the 58 million fingerprints in the AFIS database and selecting one of the closest matching hits. Each of 169 fingerprint examiners was shown 100 pairs and asked to classify them as an identification, an exclusion, or inconclusive. The study reported 6 false positive identifications among 3628 nonmated pairs that examiners judged to have “value for identification.” The false positive rate was thus 0.17 percent (upper 95 percent confidence bound of 0.33 percent). The estimated rate corresponds to 1 error in 604 cases, with the upper bound indicating that the rate could be as high as 1 error in 306 cases.
The experiment is described more fully in Fingerprinting Under the Microscope: A Controlled Experiment on the Accuracy and Reliability of Latent Print Examinations (Part I), Apr. 26, 2011, et seq. More detail on the one statistic selected for discussion in the PCAST report is in Table 4.

Table 4. Supplementing PCAST’s CIs
for Ulery et al. (2011) (x/n = 6/3628)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0033 = 1/303 0.0006 = 1/1667 0.0036 = 1/278
Wilson 0.0032 = 1/313 0.0008 = 1/1250 0.0036 = 1/278
Jeffreys 0.0031 = 1/323 0.0007 = 1/1423 0.0034 = 1/294
Agresti-Coull 0.0033 = 1/303 0.0007 = 1/1423 0.0037 = 1/270

The interval that PCAST seems to recommend is [0.17%, 0.33%]. The two-sided interval is [0.06%, 0.36%]. Whereas PCAST would have the expert testify that the error rate is "as high as 1 error in 306 cases," or perhaps between "1 error in 604 cases [and] 1 error in 306 cases," the two-sided 95% CI extends from 1 in 1667 on up to 1 in 278.

Although PCAST's focus on the high end of estimated error rates strikes me as tendentious, a possible argument for it is psychologically and legally oriented. If jurors would anchor on the first number they hear, such as 1/1667, they would underestimate the false-positive error probability. By excluding lower limits from the presentation, we avoid that possibility. Of course, we also increase the risk that jurors will overestimate the effect of sampling error. A better solution to this perceived problem might be to present the interval from highest to lowest rather than ignoring the low end entirely.

APPENDIX

PCAST is less than pellucid on whether a judge or jury should learn of (1) the point estimate together with an upper limit or (2) only the upper limit. The Council seems to have rejected the more conventional approach of giving both end points of a two-sided CI. The following assertions suggest that the Council favored the second, and most extreme approach:
  • [T]o inform jurors that [the] only two properly designed studies of the accuracy of latent fingerprint analysis ... found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study ... would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.” (P. 96.)
  • The report states that "the other (Miami-Dade 2014 study) yielded a considerably higher false positive rate of 1 in 18." (P. 96.) Yet, 1 in 18 was not the false-positive rate observed in the study, but, at best, the upper end of a one-sided 95% CI above this rate.
Other parts of the report point in the other direction or are simply ambiguous. Examples follow:
  • If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). (P. 112)
  • Studies designed to estimate a method’s false positive rate and sensitivity are necessarily conducted using only a finite number of samples. As a consequence, they cannot provide “exact” values for these quantities (and should not claim to do so), but only “confidence intervals,” whose bounds reflect, respectively, the range of values that are reasonably compatible with the results. When reporting a false positive rate to a jury, it is scientifically important to state the “upper 95 percent one-sided confidence bound” to reflect the fact that the actual false positive rate could reasonably be as high as this value. 116/ (P. 53)
  • The upper confidence bound properly incorporates the precision of the estimate based on the sample size. For example, if a study found no errors in 100 tests, it would be misleading to tell a jury that the error rate was 0 percent. In fact, if the tests are independent, the upper 95 percent confidence bound for the true error rate is 3.0 percent. Accordingly a jury should be told that the error rate could be as high as 3.0 percent (that is, 1 in 33). The true error rate could be higher, but with rather small probability (less than 5 percent). If the study were much smaller, the upper 95 percent confidence limit would be higher. For a study that found no errors in 10 tests, the upper 95 percent confidence bound is 26 percent—that is, the actual false positive rate could be roughly 1 in 4 (see Appendix A). (P. 53 n. 116.)
  • In summarizing these studies, we apply the guidelines described earlier in this report (see Chapter 4 and Appendix A). First, while we note (1) both the estimated false positive rates and (2) the upper 95 percent confidence bound on the false positive rate, we focus on the latter as, from a scientific perspective, the appropriate rate to report to a jury—because the primary concern should be about underestimating the false positive rate and the true rate could reasonably be as high as this value. 262/ (P. 92.)
  • Since empirical measurements are based on a limited number of samples, SEN and FPR cannot be measured exactly, but only estimated. Because of the finite sample sizes, the maximum likelihood estimates thus do not tell the whole story. Rather, it is necessary and appropriate to quote confidence bounds within which SEN, and FPR, are highly likely to lie. (P. 152.)
  • For example, if a study finds zero false positives in 100 tries, the [Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval] give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.) (P. 153.)

No comments:

Post a Comment