Sunday, July 23, 2017

The Source and Soundness of PCAST's 5% Rule

The President’s Council of Advisors on Science and Technology (PCAST) Report on comparative pattern matching in forensic science has a deceptively simple rule for the admissibility of evidence of a match between a questioned and a known sample: if examiners would declare that the two samples have the same source as often as one time in 20 when analyzing pairs of samples actually that come from different samples, then the comparisons are “scientifically unreliable.” The report gives no explanation of how it arrived at this rule beyond the following enigmatic paragraph: 1/
False positive rate (abbreviated FPR) is defined as the probability that the method declares a match between two samples that are from different sources (again in an appropriate population), that is, FPR = P(M|H0). For example, a value FPR = 0.01 would indicate that two samples from different sources will be (mistakenly) called as a match 1 percent of the time. Methods with a high FPR are scientifically unreliable for making important judgments in court about the source of a sample. To be considered reliable, the FPR should certainly be less than 5 percent and it may be appropriate that it be considerably lower, depending on the intended application. 2/
Five percent has a crisp, authoritative ring to it, but why is 5% “certainly” the maximum tolerable FPR for courtroom use of the test? And what “intended applications” would demand a lower FPR? Is the underlying thought that greater “scientific reliability” is required as the gravity of the case increases—from a civil case, to a misdemeanor, to a major crime, on up to a capital case?

Statistical Practice as the Basis for the 5% Rule

Inasmuch as the paragraph is found in an appendix entitled "statistical issues," we should expect statistical concepts and practice to help answer such questions. And in fact, 5% is a common number in statistics. In many applications, statistical hypothesis tests try to keep the risk of a false rejection of the “null hypothesis” H0—a false-positive conclusion—below 5%. Researchers and journal editors in many fields prize results that can be said to be “statistically significant,” usually at the 0.05 level or better. The expression p < 0.05 is therefore a common accoutrement of experimental or observational results indicating an association between variables. Likewise, the Food and Drug Administration demands clinical trials to show that a new drug is effective for its intended use (“validity,” if you will), with “the typical ‘cap’ on the type I [false positive] error rate ... set at 5% .”3/ In the forensic pattern-matching context, the null hypothesis H0 in the PCAST paragraph would be that a questioned and a known sample are not associated with the same source.

Thus, to the extent PCAST was thinking of the 5% FPR as the significance level required to reject H0, its emphasis on 5% is well grounded in statistical practice. Using certain standard levels of significance, particularly 5%, can be traced to the 1920s. The eminent British statistician Sir R. A. Fisher wrote:
It is convenient to draw the line at about the level at which we can say: ‘Either there is somethng in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.’ ... If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. 4/
For FPRs larger than 5%, the reports of criminalists do not meet (Fisher’s) criterion for establishing a “scientific fact.” Their conclusions of positive association for such error-prone procedures are not, in PCAST’s words, “scientifically reliable.”

Having equated PCAST’s unexplained choice of 5% with a common implementation of statistical hypothesis testing, we also can see why the report suggested that a “considerably lower” number might be required for scientific “reliability.” A 5% FPR lets in examiner conclusions that might be wrong about one time in twenty when defendants are innocent and there is no true association between the questioned item and the known one. False positives tend to increase the rate of false convictions, whereas false negatives tend to would increase the rate of false acquittals. The norm that false convictions are worse than false acquittals counsels caution in relying on an examiner’s conclusion to convict a defendant. And if false convictions in the most serious of cases are worse still, we can see why the PCAST report stated that “the FPR should certainly be less than 5 percent and it may be appropriate that it be considerably lower, depending on the intended application.” Five percent may be good enough for an editor to publish an interesting paper purporting to have discovered something new in social psychology, but this scientific convention does not mean that 5% is good enough for a criminal conviction, let alone one that would lead to an execution.

So we can see that PCAST’s 5% figure did not come from thin air. Indeed, some statisticians and psychologists think that it is too weak a standard—that the general rule in science ought to be p < 0.005. 5/ Nevertheless, the general use of the arguably lenient 5% significance level does not establish that the 5% rule is legally compelled. The law incorporates the intensified concern for false positives into the burden of persuasion for the evidence as a whole. The jury is instructed to acquit unless it has no reasonable doubt that a defendant in a criminal case is guilty; in contrast, in a civil case, the plaintiff can prevail on a mere preponderance of the evidence. But these burdens do not apply to individual items of evidence. The standard for admitting scientific—and other—evidence does not change with how much is at stake in the particular case. After all, the probative value of scientific evidence is no different in a criminal case than in a civil one. Although the PCAST report insists that its statements about science are merely designed to inform courts about scientific standards, if “scientific reliability” depends on the “importance” of the “judgments in court” and varies according to “the intended application,” then PCAST's "scientific reliability" turns out to be based on what is considered socially or legally “appropriate.”

Beyond the FPR

In sum, it is (or would have been) fair for PCAST to point out that it is uncommon for results at higher significance levels than 0.05 to be credited in the scientific literature. But a more deeply analytical report would have noted that there is uneasiness in the statistical community with the hypothesis testing framework and particularly with over-reliance on the p < 0.05 rule. (Today's mail includes an invitation to attend a "Symposium on Statistical Inference: A World Beyond p < 0.05" sponsored by the American Statistical Association.)

Only part of the world beyond p < 0.05 comes from the fact that the FPR is not the only quantity that determines “scientific reliability.” Superficially, the false-positive error probability might look like the appropriate statistic for considering the probative value of a positive finding, but that cannot be right. Scientific evidence, like all circumstantial evidence, has probative value to the extent it changes the probability of a material fact. That there is much more to probative value than the FPR therefore is easily seen through the lens of Bayes’ rule. As the PCAST report notes, in this context, Bayes' theorem prescribes how probability or odds change with the introduction of evidence. The odds after learning of the examiner’s finding are the odds without that information multiplied by the Bayes factor: posterior odds = prior odds × BF.

The Bayes factor thus indicates the strength of the evidence. Stronger evidence has a larger BF and hence a greater impact on the prior odds than weaker evidence. The Bayes factor is a simple ratio. The FPR appears as the denominator, and the sensitivity—or true positive rate—forms the numerator. In symbols, BF = sensitivity / FPR.

The report acknowledges that sensitivity matters (for some purposes at least). Earlier, the report states that “[i]t is necessary to have appropriate empirical measurements of a method’s false positive rate and the method’s sensitivity. [I]t is necessary to know these two measures to assess the probative value of a method.” 6/  Because it takes both operating characteristics to express the probative value of the test, PCAST cannot sensibly dismiss a test as having so little probative value as to be considered “scientifically reliable” on the basis of only one number. Realizing this prompts the next question for devising a rule in the spirit of PCAST's—namely, what is the sensitivity that, together with an FPR of 5%, would define the threshold for “scientific reliability”?

One might imagine that PCAST would consider any false-negative rate in excess of 5% as too high. 7/ If so, it follows that the scientists are saying that, in their view of what is important or what is the dominant convention in various domains, subjective pattern matching must shift the prior odds by a factor of at least .95/.05 = 19 to be considered “scientifically reliable.” On the other hand, if the scientists on PCAST think it is appropriate for a false-negative probability to be ten times the maximum acceptable false-positive probability, then their minimum for “reliability” would become a FNR of 50% and a FPR of 5%, for a Bayes’ factor of only ten.

What Does the Law Require?

Whether the cutoff comes from the FPR alone or the more complete Bayes factor, the very notion of a sharp cutoff is questionable. The purpose of a forensic-science test for identity is to provide evidence that will assist judges or jurors. Forensic scientists who present results and reasonable estimates of the likelihoods or conditional error probabilities associated with their conclusions are staying within the bounds of what is scientifically known.

Consider a hypothetical pattern-matching test for identity for which FPR = 10% and sensitivity = 70% as shown by extensive experiments, each of which demonstrates an ability to distinguish sources from nonsources with accuracy above what would be expected by chance (p < 0.05). According to the PCAST report, this test would be inadmissible for want of “scientific reliability” or “foundational validity” because the FPR of 10% is too high. But if this were a test for a disease, would we really want a diagnosing physician to ignore the positive test result just because the FPR is greater than 5%? The positive finding from the lab would raise the prior odds from, say, 1 to 2, to 7 to 2 (corresponding to an increase in probability from 33% to 78%). Like the physician trying to reach the best possible diagnosis, the judge or jury trying to reach the best possible reconstruction of the events could benefit from knowing that an examiner, who can perform at the empirically established level of accuracy, has found a positive association.

The logic behind a high hurdle for scientific evidence is that “it is likely to be shrouded with an aura of near infallibility, akin to the ancient oracle of Delphi.” 8/ As one federal judge (an advisor to PCAST) wrote in excluding the testimony of a handwriting expert:
[I]t is the Court's role to ensure that a given discipline does not falsely lay claim to the mantle of science, cloaking itself with the aura of unassailability that the imprimatur of ‘science’ confers and thereby distorting the truth-finding process. There have been too many pseudo-scientific disciplines that have since been exposed as profoundly flawed, unreliable, or baseless for any Court to take this role lightly. 9/
Under this rationale, a court should be able to admit the positive test result if the jury is informed of and can appreciate the limitations of the finding. A result that is ten time more probable when the samples have the reported source than when they have different sources is not unreliable “junk science.” Of course, it may not be the product of a particularly scientific (or even a very standardized) procedure, and that must be made clear to the factfinder. When the criminalists employing the highly subjective procedure truly have specialized knowledge—as evidenced by rigorous and repeated tests of their ability to arrive at correct answers—their findings can be presented along with their known error rates without creating “an aura of near infallibility.” If this view of what judges and juries can understand is correct, then a blanket rule against all expert evidence that has known error rates in excess of 5% is unsound.

This criticism of PCAST's 5% rule does not reject the main theme of the report—that when a forensic identification procedure relies on a vaguely defined judgmental process (such as "sufficient similarities and explicable dissimilarities in the light of the examiner's training and experience"), well-founded estimates of the ability of examiners to make the correct judgments are vital to admitting source attributions in court. Of course, Daubert v. Merrell Pharmaceuticals 9/ did not make any single factor, including a "known or potential rate of error," absolutely necessary for admitting all types of scientific evidence. But the Daubert Court painted with an amazingly broad brush. The considerations that will be most important can vary from one type of evidence to another.  When it comes to source attributions from entirely subjective assessments of the similarities and differences in feature sets, there is a cogent argument that the only acceptable way to validate the psychological process is to study how often examiners reach the right conclusions when confronted with same-source and different-source samples.

Notes
  1. Thanks to Ken Melson for calling to my attention to this paragraph.
  2. PCAST Report at 161-52.
  3. Russell Katz, FDA: Evidentiary Standards for Drug Development and Approval, 1(3) NeuroRx 307–316, (2004), doi: 10.1602/neurorx.1.3.307.
  4. R.A. Fisher, The Arrangement of Field Experiments, 33 J. Ministry Agric. Gr. Brit. 504 (1926), as quoted in L. Savage, On Rereading R.A. Fisher, 4 Annals of Statistics 471 (1976).
  5. Kelly Servick, It Will Be Much Harder To Call New Findings ‘Significant’ If This Team Gets Its Way, Jul. 25, 2017, 2:30 PM, Science, DOI: 10.1126/science.aan7154.
  6. PCAST Report at 50 (emphasis added).
  7. However, the report made no mention of the fact that the false-negative rate was higher than that in at least one of the two experiments on latent print identification of which it approved.
  8. United States v. Alexander, 526 F.2d 161, 168 (8th Cir. 1975).
  9. Almeciga v. Center for Investigative Reporting, Inc., 185 F. Supp. 3d 401, 415 (S.D.N.Y. 2016) (Rakoff, J.).
  10. 509 U.S. 579 (1993).

No comments:

Post a Comment