Tuesday, December 27, 2016

NCFS Draft Views on "Statistical Statements in Forensic Testimony"


The period for comments on the second public draft of a proposed National Commission on Forensic Science (NCFS) views document on Statistical Statements in Forensic Testimony opened today and will close on January 25, 2017. Comments can be submitted at regulations.gov, Docket No. DOJ-LA-2016-0025.

The full document can be downloaded from https://www.regulations.gov/document?D=DOJ-LA-2016-0025-0001. It defines "statistical statements" broadly, to encompass quantitative or qualitative statements [that] indicate the accuracy of measurements or observations and the significance of these findings." "These statistical statements," it explains, "may describe measurement accuracy (or conversely, measurement uncertainty), weight of evidence (the extent to which measurements or observations support particular conclusions), or the probability or certainty of the conclusions themselves."

The draft summarizes the views as follows (footnote omitted):
1. Forensic experts, both in their reports and in testimony, should present and describe the features of the questioned and known samples (the data), and similarities and differences in those features as well as the process used to arrive at determining them. The presentation should include statements of the limitations and uncertainties in the measurements or observations.

2. No one form of statistical calculation or statement is most appropriate to all forensic evidence comparisons or other inference tasks. Thus, the expert needs to be able to support, as part of a report and in testimony, the choice used in the specific analysis carried out and the assumptions on which it was based. When the statistical calculation relies on a specific database, the report should make clear which one and its relevance for the case at hand.

3. The expert should report the limitations and uncertainty associated with measurements and the inferences that could be drawn from them. This report might take the form of an interval for an estimated value, or of separate statements regarding errors and uncertainties associated with the analysis of the evidence. If the expert has no information on sources of error in measurements and inferences, the expert must state this fact.

4. Forensic science experts should not state that a specific individual or object is the source of the forensic science evidence and should make it clear that, even in circumstances involving extremely strong statistical evidence, it is possible that other individuals or objects could possess or have left a similar set of observed features. Forensic science experts should confine their evaluative statements to the support that the findings provide for the claim linked to the forensic evidence.

5. To explain the value of the data in addressing claims as to the source of a questioned sample, forensic examiners may:
A. Refer to relative frequencies of individual features in a sample of individuals or objects in a relevant population (as sampled and then represented in a reference database). The examiner should note the uncertainties in these frequencies as estimates of the frequencies of particular features in the population.

B. Present estimates of the relative frequency of an observed combination of features in a relevant population based on a probabilistic model that is well grounded in theory and data. The model may relate the probability of the combination to the probabilities of individual features.

C. Present probabilities (or ratios of probabilities) of the observed features under different claims as to the origin of the questioned sample. The examiner should note the uncertainties in any such values.

D. When the statistical statement is derived from an automated computer-based system for making classifications, present not only the classification but also the operating characteristics of the system (the sensitivity and specificity of the system as established in relevant experiments using data from a relevant population). If the expert has no information or limited information about such operating characteristics, the expert must state this fact.
6. Not all forensic subdisciplines currently can support a probabilistic or statistical statement. There may still be value to the factfinder in learning whatever comparisons the expert in those subdisciplines has carried out. But the absence of models and empirical evidence needs to be expressed both in testimony and written reports.
The document will be discussed at the January 2017 NCFS meeting. A final version should be up for a vote at the (final?) Commission meeting, on April 10-11, 2017.

Thursday, December 22, 2016

Realistically Testing Forensic Laboratory Performance in Houston

The Houston Forensic Science Center, announced on November 17, 2016, that
HFSC Begins Blind Testing in DNA, Latent Prints, National First
This innovation -- said to be unique among forensic laboratories and to exceed the demands of accreditation -- does not refer to blind testing of samples from crime scenes. It is generally recognized that analysts should be blinded to information that they do not need to reach conclusions about the similarities and differences in crime-scene samples and samples from suspects or other persons of interest. One would hope that many laboratories already employ this strategy for managing unwanted sources of possible cognitive bias.

Perhaps confusingly, the Houston lab's announcement refers to "'blindly' test[ing] its analysts and systems, assisting with the elimination of bias while also helping to catch issues that might exist in the processes." More clearly stated, "[u]nder HFSC’s blind testing program analysts in five sections do not know whether they are performing real casework or simply taking a test. The test materials are introduced into the workflow and arrive at the laboratory in the same manner as all other evidence and casework."

A month earlier, the National  Commission on Forensic Science unanimously recommended, as a research strategy, "introducing known-source samples into the routine flow of casework in a blinded manner, so that examiners do not know their performance is being studied." Of course, whether the purpose is research or instead what the Houston lab calls a "blind quality control program," the Commission noted that "highly challenging samples will be particularly valuable for helping examiners improve their skills." It is often said that existing proficiency testing programs not only fail to blind examiners to the fact that they are being tested, but also are only designed to test minimum levels of performance.

The Commission bent over backward to imply that the outcomes of the studies it proposed would not necessarily be admissible in litigation. It wrote that
To avoid unfairly impugning examiners and laboratories who participate in research on laboratory performance, judges should consider carefully whether to admit evidence regarding the occurrence or rate of error in research studies. If such evidence is admitted, it should only be under narrow circumstances and with careful explanation of the limitations of such data for establishing the probability of error in a given case.
The Commission's concern was that applying statistics from work with unusually difficult cases to more typical casework might overstate the probability of error in the less difficult cases. At the same time, its statement of views included a footnote implying that the defense should have access to the outcomes of performance tests:
[T]he results of performance testing may fall within the government’s disclosure obligations under Brady v Maryland, 373 U.S. 83 (1963). But the right of defendants to examine such evidence does not entail a right to present it in the courtroom in a misleading manner. The Commission is urging that courts give careful consideration to when and how the results of performance testing are admitted in evidence, not that courts deny defendants access to evidence that they have a constitutional right to review.
Using traditional proficiency test results and the newer performance tests in which examiners are blinded to the fact that they are being tested in a given case (which is a better way to test proficiency) to impeach a laboratory's reported results raises interesting questions of relevance under Federal Rules of Evidence 403 and 404. See, e.g., Edward J. Imwinkelried & David H. Kaye, DNA Typing: Emerging or Neglected Issues, 76 Wash. L. Rev. 413 (2001).

Sunday, December 11, 2016

PCAST’s Sampling Errors (Part II: Getting More Technical)

The report to the President on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods from the President’s Council of Advisors on Science and Technology emphasizes the need for research to assess the accuracy of the conclusions of criminalists who compare the features of identification evidence — things like fingerprints, toolmarks, hairs, DNA, and bitemarks. To this extent, it treads on firm (and well-worn) ground.

It also stresses the need to inform legal factfinders — the judges and jurors who try cases with the assistance of such evidence — of the false-positive error rates discovered in comprehensive and well designed studies with large sample sizes. This too is a laudable objective (although the specificity, which is the complement of the false-negative probability, also affects the probative value of the evidence and therefore should be incorporated into a presentation that indicates probative value).

When sample sizes are small, different studies could well generate very different false-positive rates if only because accuracy varies, both within and across examiners. Testing different samples of examiners at different times therefore will show different levels of performance. A common response to this sampling variability is a confidence interval (CI). A CI is intended to demarcate a range of possible values that might plausibly include the value that an ideal study of all examiners at all times would find.

The report advocates the use of an upper limit of a one-sided 95% CI for the false-positive rates instead of, or perhaps in addition to, the observed rates themselves. Some of the report’s statements about CIs and what to report are collected in an appendix to this posting. They have led to claims of a very large false-positive error rate for latent fingerprint identification. (See On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons,” Dec. 10, 2016.)

A previous posting on "PCAST’s Sampling Errors" identified problems with the specifics of PCAST’s understanding of the meaning of a CI, with the technique it used to compute CIs for the performance studies it reviewed, and with the idea of substituting a worst-case scenario (the upper part of a CI) for the full range of the interval. Informing the factfinder of plausible variation both above and below the point estimate — is fairer to all concerned, and that interval should be computed with techniques that will not create distracting arguments about the statistical acumen of the expert witness.

This posting elaborates on these concerns. Forensic scientists and criminalists who are asked to testify to error probabilities (as they should) need to be aware of the nuances lest they be accused of omitting important information or using inferior statistical methods. I provide a few tables that display PCAST’s calculations of the upper limits of one-sided 95% CIs; two-sided CIs computed with the same method for ascertaining the width of the intervals; and CIs with methods that PCAST listed but did not use.

The discussion explains how to use the statistical tool that PCAST recommended. It also shows that there is potential for confusion and manipulation in the reporting of confidence intervals. This does not mean that such intervals should not be used — quite the contrary, they can give a sense of the fuzziness that sampling error creates for point estimates. Contrary to the recommendation of the PCAST report, however, they should not be presented as if they give the probability that an error probability is as high as a particular value.

At best, being faithful to the logic behind confidence intervals, one can report that if the error probability were that large, then the probability of the observed error rate or a smaller one would have some designated value. To present the upper limit of a 95% CI as if it states that the probability is “at least 5%” that the false-positive error probability could be “as high as” the upper end of that CI — the phrasing used in the report (p. 153) — would be to succumb to the dreaded “transposition fallacy” that textbooks on statistics abjure. (See also David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, June 16, 2016.)

I. PCAST’s Hypothetical Cases

The PCAST Report gives examples of CIs for two hypothetical validation studies. In one, there are 25 tests of examiner’s ability to use features to classify pairs according to their source.  This hypothetical experiment establishes that the examiners made x = 0 false positive classifications in n = 25 trials. The report states that “if an empirical study found no false positives in 25 individual tests, there is still a reasonable chance (at least 5 percent) that the true error rate might be as high as roughly 1 in 9.” (P. 153.)

I think I know what PCAST is trying to say, but this manner of expressing it is puzzling. Let the true error rate be some unknown, fixed value θ. The sentence in the report might amount to an assertion that the probability that θ is roughly 1 in 9 is at least 5%. In symbols,
Pr(θ ≈ 1/9 | x = 0, n = 25) ≥ 0.05. 
Or does it mean that the probability that θ is roughly 1/9 or less is 5% or more, that is,
Pr(θ ≤ 1/9 | x = 0, n = 25) ≥ 0.05? 
Or does it state that the probability that θ is roughly 1/9 or more is 5% or more, that is,
Pr(θ ≥ 1/9 | x = 0, n = 25) ≥ 0.05?

None of these interpretations can be correct. Theta is a parameter rather than a random variable. From the frequentist perspective of CIs, it does not have probabilities attached to it. What one can say is that, if an enormous number of identical experiments were conducted, each would generate some point estimate for the false positive probability θ in the population. Some of these estimates x/n would be a little higher than θ; some would be a little lower (if θ > 0); some would a lot higher; some a lot lower (if θ ≫ 0); some would be spot on. If we constructed a 95% CI around each estimate, about 95% of them would cover the unknown θ, and about 5% would miss it. (I am not sure where the "at least 5%" in the PCAST report comes from.) Likewise, if we constructed a 90% CI for each experiment, about 90% would cover the unknown θ, and about 10% would miss it.

In assessing a point estimate, the width of the interval is what matters. At a given confidence level, wider intervals signal less precision in the estimate. The 90% CI would be narrower than the 95% one — it would have the appearance of greater precision, but that greater precision is illusory.  It just means that we can narrow the interval by being less “confident” about the claim that it captures the true value θ. PCAST uses 95% confidence because 95% is a conventional number in many (but not all) fields of scientific inquiry -- a fact that greatly impressed Justice Breyer in oral argument in Hall v. Florida.

In sum, 95% two-sided CIs are useful for manifesting the fuzziness that surrounds any point estimate because of sampling error. But they do not answer the question of what probability we should attach to the claim that “the true error rate might be as high as roughly 1 in 9.” Even with a properly computed CI that goes from 0 to 1/9 (which we can write as [0, 1/9]), the most we can say is that the process used to generate such CIs will err about 5% of the time. (It is misleading to claim that it will err “at least 5%” of the time.) We have a single CI, and we can plausibly propose that it is one of those that does not err. But we cannot say that 95% or fewer of the intervals [0, 1/9] will capture θ. Hence, we cannot say that “there is ... at least [a] 5 percent [chance] that the true error rate [θ] might be as high as roughly 1 in 9.”

To reconstruct how PCAST computed “roughly 1 in 9,” we can use EpiTools, the web-based calculator that the report recommended. This tool does not have an explicit option for one-sided intervals, but a 95% one-sided CI leaves 5% of the probability mass in a single tail. Hence, the upper limit is equal to that for a two-sided 90% CI. This is so because a 90% CI places one-half of the 10% of the mass that it does not cover in each tail. Figure 1 shows the logic.

Figure 1. Graphical Representation of PCAST's
Upper Limit of a One-sided 95% CI (not drawn to scale)
 95% CI (two-sided)
       **
      *****
     ********
    ************
   ****************
*************************
--[--------------------]------------> x/n
Lower 2.5% region       Upper 2.5% region

 90% CI (two-sided)
       **
      *****
     ********
    ************
  *****************
*************************
---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

PCAST: Report the interval from
the observed x/n to the start of the upper 5% region? 
Or just report where the upper 5% region starts?

---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

Inputting n = 25, x = 0, and confidence = 90% yields the results in Table 1.

Table 1. Checking and Supplementing PCAST’s CIs
in Its First Hypothetical Case (x/n = 0/25)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.1129 = 1/9 0.1372 = 1/7
Wilson 0.0977 = 1/10 0.1332 = 1/8
Jeffreys 0.0732 = 1/14 0.0947 = 1/11
Agresti-Coull 0.1162 = 1/9 0.1576 = 1/6

The Wilson and Jeffreys methods, which are recommended in the statistics literature cited by PCAST, give smaller upper bounds on the false-positive error rate than the Clopper-Pearson method that PCAST uses. The recommended numbers are 1 in 10 and 1 in 14 compared to PCAST’s 1 in 9.

On the other hand, the upper limit for a two-sided 95% CI for the Wilson and Jeffreys tests are 1 in 8 and 1 in 11. In this example (and in general) using the one-sided interval has the opposite effect from what PCAST may have intended. PCAST wants to keep juries from hearing many false-positive-error probabilities that are below θ at the cost of giving them estimates that are above θ. But as Figure 1 illustrates, using the upper limit of a 90% two-sided interval to arrive at a one-sided 95% CI lowers the upper limit compared to the two-sided 95% CI. PCAST’s one-sided intervals do less, not more, to protect defendants from estimates of error false-positive error probabilities that are too low.

Of course, I am being picky. What is the difference if the expert reports that the false-positive probability could be 1 in 14 instead of PCAST’s upper limit of 1 in 9? The jury will receive a similar message — with perfect scores on only 25 relevant tests of examiners, one cannot plausibly claim that examiners rarely err. Another way to make this point is to compute the probability of the study’s finding so few false positives (x/n = 0/25) when the probability of an error on each independent test is 1 in 9 (or any other number that one likes). If 1/9 is the binomial probability of a false-positive error, then the chance of x = 0 errors in 25 tests of (unbeknownst to the examiners) same-source specimens is (1 – 1/9)25 = 0.053.

At this point, you may say, wait a minute, this is essentially the 5% figure given by PCAST. Indeed, it is, but it has a very different meaning. It indicates how often one would see studies with a zero false-positive error rate when the unknown, true rate is actually 1/9. In other words, it is the p-value for the data from the experiment when the hypothesis θ = 1/9 is true. (This is a one-tailed p-value. The expected number of false positives is 25(1/9) = 2.7. The probability of more than 6 or more false positives is 5.3%. So the two-tailed p-value is 10.5%. Only about 1 time in ten would studies with 25 trials produce outcomes that diverged at least much as this one did from what is statistically expected if the probability of error in each trial with different-source specimens really is 1/9.) Thus, the suggestion that the true probability is 1/9 is not grossly inconsistent with the data. But the data are not highly probable under the hypothesis that the false-positive error probability is 1/9 (or any larger number).

PCAST also gives some numbers for another hypothetical case. The report states that
[I]f a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context "the false positive rate might be as high as."
The output of EpiTools is included in Table 2.

Table 2. Supplementing PCAST’s CIs
in Its Second Hypothetical Case (x/n = 0/100)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0295 = 1/34 0.0362 = 1/28
Wilson 0.0263 = 1/38 0.0370 = 1/27
Jeffreys 0.0190 = 1/53 0.0247 = 1/40
Agresti-Coull 0.0317 = 1/32 0.0444 = 1/23

Again, to the detriment of defendants, the recommended methods give smaller upper bounds than a more appropriate two-sided interval. For example, PCAST’s 90% Clopper-Pearson interval is [0, 3%], while the 95% interval is [0, 4%]. A devotee of the Jeffreys method (or a less scrupulous expert witness seeking to minimize the false positive-error risk) could report the smallest interval of [0, 2%].

II. Two Real Studies with Larger Sample Sizes

The two hypotheticals are extreme cases. Sample sizes are small, and the outcomes are extreme — no false-positive errors at all. Let’s look at examples from the report that involve larger samples and higher observed error rates.

The report notes that in Langenburg, Champod, and Genessay (2012),
For the non-mated pairs, there were 17 false positive matches among 711 conclusive examinations by the experts. The false positive rate was 2.4 percent (upper 95 percent confidence bound of 3.5 percent). The estimated error rate corresponds to 1 error in 42 cases, with an upper bound corresponding to 1 error in 28 cases.
P. 93 (notes omitted). Invoking EpiTools, one finds that

Table 3. Supplementing PCAST’s CIs
for Langenburg et al. (2012) (x/n = 17/711)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0356 = 1/28 0.0140 = 1/71 0.0380 = 1/26
Wilson 0.0353 = 1/28 0.0150 = 1/67 0.0380 = 1/26
Jeffreys 0.0348 = 1/29 0.0145 = 1/69 0.0371 = 1/27
Agresti-Coull 0.0355 = 1/28 0.0147 = 1/68 0.0382 = 1/26

As one would expect for a larger sample, all the usual methods for producing a binomial CI generate similar results. The PCAST one-sided interval is [2.4%, 3.6%] compared to a two-sided interval of [1.4%, 3.8%].

For a final example, we can consider PCAST's approach to sampling error for the FBI-Noblis study of latent fingerprint identification. The report describes two related studies of latent fingerprints, both with large sample sizes and small error rates, but I will only provide additional calculations for the 2011 one. The report describes it as follows:
The authors assembled a collection of 744 latent-known pairs, consisting of 520 mated pairs and 224 non-mated pairs. To attempt to ensure that the non-mated pairs were representative of the type of matches that might arise when police identify a suspect by searching fingerprint databases, the known prints were selected by searching the latent prints against the 58 million fingerprints in the AFIS database and selecting one of the closest matching hits. Each of 169 fingerprint examiners was shown 100 pairs and asked to classify them as an identification, an exclusion, or inconclusive. The study reported 6 false positive identifications among 3628 nonmated pairs that examiners judged to have “value for identification.” The false positive rate was thus 0.17 percent (upper 95 percent confidence bound of 0.33 percent). The estimated rate corresponds to 1 error in 604 cases, with the upper bound indicating that the rate could be as high as 1 error in 306 cases.
The experiment is described more fully in Fingerprinting Under the Microscope: A Controlled Experiment on the Accuracy and Reliability of Latent Print Examinations (Part I), Apr. 26, 2011, et seq. More detail on the one statistic selected for discussion in the PCAST report is in Table 4.

Table 4. Supplementing PCAST’s CIs
for Ulery et al. (2011) (x/n = 6/3628)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0033 = 1/303 0.0006 = 1/1667 0.0036 = 1/278
Wilson 0.0032 = 1/313 0.0008 = 1/1250 0.0036 = 1/278
Jeffreys 0.0031 = 1/323 0.0007 = 1/1423 0.0034 = 1/294
Agresti-Coull 0.0033 = 1/303 0.0007 = 1/1423 0.0037 = 1/270

The interval that PCAST seems to recommend is [0.17%, 0.33%]. The two-sided interval is [0.06%, 0.36%]. Whereas PCAST would have the expert testify that the error rate is "as high as 1 error in 306 cases," or perhaps between "1 error in 604 cases [and] 1 error in 306 cases," the two-sided 95% CI extends from 1 in 1667 on up to 1 in 278.

Although PCAST's focus on the high end of estimated error rates strikes me as tendentious, a possible argument for it is psychologically and legally oriented. If jurors would anchor on the first number they hear, such as 1/1667, they would underestimate the false-positive error probability. By excluding lower limits from the presentation, we avoid that possibility. Of course, we also increase the risk that jurors will overestimate the effect of sampling error. A better solution to this perceived problem might be to present the interval from highest to lowest rather than ignoring the low end entirely.

APPENDIX

PCAST is less than pellucid on whether a judge or jury should learn of (1) the point estimate together with an upper limit or (2) only the upper limit. The Council seems to have rejected the more conventional approach of giving both end points of a two-sided CI. The following assertions suggest that the Council favored the second, and most extreme approach:
  • [T]o inform jurors that [the] only two properly designed studies of the accuracy of latent fingerprint analysis ... found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study ... would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.” (P. 96.)
  • The report states that "the other (Miami-Dade 2014 study) yielded a considerably higher false positive rate of 1 in 18." (P. 96.) Yet, 1 in 18 was not the false-positive rate observed in the study, but, at best, the upper end of a one-sided 95% CI above this rate.
Other parts of the report point in the other direction or are simply ambiguous. Examples follow:
  • If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). (P. 112)
  • Studies designed to estimate a method’s false positive rate and sensitivity are necessarily conducted using only a finite number of samples. As a consequence, they cannot provide “exact” values for these quantities (and should not claim to do so), but only “confidence intervals,” whose bounds reflect, respectively, the range of values that are reasonably compatible with the results. When reporting a false positive rate to a jury, it is scientifically important to state the “upper 95 percent one-sided confidence bound” to reflect the fact that the actual false positive rate could reasonably be as high as this value. 116/ (P. 53)
  • The upper confidence bound properly incorporates the precision of the estimate based on the sample size. For example, if a study found no errors in 100 tests, it would be misleading to tell a jury that the error rate was 0 percent. In fact, if the tests are independent, the upper 95 percent confidence bound for the true error rate is 3.0 percent. Accordingly a jury should be told that the error rate could be as high as 3.0 percent (that is, 1 in 33). The true error rate could be higher, but with rather small probability (less than 5 percent). If the study were much smaller, the upper 95 percent confidence limit would be higher. For a study that found no errors in 10 tests, the upper 95 percent confidence bound is 26 percent—that is, the actual false positive rate could be roughly 1 in 4 (see Appendix A). (P. 53 n. 116.)
  • In summarizing these studies, we apply the guidelines described earlier in this report (see Chapter 4 and Appendix A). First, while we note (1) both the estimated false positive rates and (2) the upper 95 percent confidence bound on the false positive rate, we focus on the latter as, from a scientific perspective, the appropriate rate to report to a jury—because the primary concern should be about underestimating the false positive rate and the true rate could reasonably be as high as this value. 262/ (P. 92.)
  • Since empirical measurements are based on a limited number of samples, SEN and FPR cannot be measured exactly, but only estimated. Because of the finite sample sizes, the maximum likelihood estimates thus do not tell the whole story. Rather, it is necessary and appropriate to quote confidence bounds within which SEN, and FPR, are highly likely to lie. (P. 152.)
  • For example, if a study finds zero false positives in 100 tries, the [Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval] give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.) (P. 153.)

Saturday, December 10, 2016

On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons”

On November 1, Professor Christophe Champod tweeted
“The White House report noted a study showing a 1 in 18 error rate for fingerprint comparison.” This is ridiculous.
He was referring to an interview conducted for the University of Virginia School of Law 1/ and later published on the campus news site as Fallible Fingerprints. 2/ In the interview, Professor Brandon Garrett observed that
For years, courts have grandfathered in techniques like fingerprinting that provide valuable information, but with error rates and probative value that have simply not been adequately studied. For example, the White House report noted a study showing a one-in-18 error rate for fingerprint comparison and another showing a one-in-six error rate for bite-mark comparison. A 2009 report by the National Academy of Sciences carefully detailed how much of the forensic evidence used was without “any meaningful scientific validation.” Little changed.
In a guest editorial in the Washington Post, 3/ he disseminated the same statistic on the accuracy of positive associations from latent prints:
Any human technique has an error rate, and a crucial quality control is to do testing to find out how good experts really are. It is not enough for fingerprint or bite-mark examiners to vouch for their own reliability. We must put their experience to the test. The few tests that have been done show disturbing error rates. For example, the White House report highlights a study showing a 1 in 18 error rate for fingerprint comparison and another showing a shocking 1 in 6 error rate for bite marks.
The general complaint is understandable. Courts have used a low bar in allowing criminalists to testify to unjustified — and exaggerated — claims of certainty for various forms of identification evidence. But is it true that little has changed since 2009 in the field of research into the probative value of latent fingerprint identification? The “White House report” is the study of the President’s Council of Advisers on Science and Technology (PCAST) released in September 2016 (that is the subject of other postings, pro and con, on this blog). The report lists four “early studies,” and two “black box” studies. Every study is dated 2009 or later. Despite this growing body of evidence on the accuracy of traditional latent fingerprint examinations, the courts — and the members of the bar — have not generally considered the need to use the experiments in explaining the uncertainty in latent print identifications to jurors. In other words, we have better science but not necessarily better law.

How to use the data from the studies is therefore a matter of emerging importance. To employ the research correctly, one needs to consider it as a whole — not to cherry pick results to support the positions of prosecutors or defense counsel. This makes it important to consider whether Professor Champod is right when he dismisses the 1 in 18 figure as ridiculous. There are a priori reasons to trust his judgment. As a forensic scientist at the University of Lausanne, he has been a progressive force in making latent print identification more scientific and in developing ways to incorporate a recognition of uncertainty into the reports of examiners. 4/

At the same time, Professor Garrett “is a principal investigator of UVA’s year-old Center for Statistics and Applications in Forensics Evidence, which is generating new research about forensic analysis and sharing best practices in order to facilitate justice.” 5/ Surely, the principal investigators for the government-funded Forensic Science Center of Excellence should be supplying balanced assessments as they perform their mission to “evaluate and solidify the statistical foundation for fingerprint, firearm, toolmark, and other pattern evidence analyses” so as to “allow forensic scientists to quantify the level of confidence they have in statistical computations made with these methods and the conclusions reached from those analyses.” 6/ Furthermore, Professor Garrett is a leading scholar of criminal procedure and an astute analyst of the factors that can produce wrongful convictions. 7

Professor Garrett is right to insist that “[a]ny human technique has an error rate, and [it is] crucial ... to do testing ... . It is not enough for fingerprint or bite-mark examiners to vouch for their own reliability [and validity].” Still, I have to side with Professor Champod. The figure of 1 in 18 for false identifications, as a summary of the studies into the validity of latent fingerprint identification, is incredibly inflated.

To give a fair picture of the studies, one cannot just pick out one extreme statistic from one study as representative. Table 1 in the report presents the false positives for “black-box studies” as follows:


Raw Data Freq. (Confidence bound) Estimated Rate Bound on Rate

Ulery et al. 2011 (FBI) 6/3628 0.17% (0.33%) 1 in 604 1 in 306
Pacheco et al. 2014 (Miami-Dade) 42/995 4.2% (5.4%) 1 in 24 1 in 18
Pacheco et al. 2014 (Miami-Dade) (excluding clerical errors) 7/960 0.7% (1.4%) 1 in 137 1 in 73


An expert witness who presented only the 1 in 18 figure would be the target of withering cross-examination. In its summary of the studies, PCAST treated 1 in 18 as one of two equally important figures. It recommended (on page 96) that
Overall, it would be appropriate to inform jurors that (1) only two properly designed studies of the accuracy of latent fingerprint analysis have been conducted and (2) these studies found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study. This would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.
Although slightly more complete, this presentation also would expose an expert to major problems on cross-examination. First, the two studies are not of equal quality. The FBI-Noblis study, which produced the smaller error rate, is plainly better designed and entitled to more credence. Second, the 1 in 18 figure counts a large number of what were said to be “clerical errors.” Third, the upper bounds are not unbiased estimates. The point estimates are 1 in 604, 1 in 24 (with the alleged clerical mistakes), and 1 in 137 (without them). Fourth, verification by a second examiner (who should be blinded to the outcome of the first examination) would drastically reduce these rates.

There also are countervailing considerations. For example, the PCAST upper bound only considers sampling error. Extrapolating from the experiments to practice is a larger source of uncertainty. Nonetheless, informing jurors as PCAST proposes hardly seems calculated to provide adequate and accurate information about the probative value of positive fingerprint identifications.

In the end, and regardless of what one thinks of the PCAST wording for the statistical information to give to a jury, it seems clear that 1/18 is not an acceptable summary of what research to date suggests for the false-positive probability of latent fingerprint identifications.

References
  1. Eric Williamson, Ushering in the Death of Junk Science, Oct. 31, 2016.
  2. Eric Williamson, Fallible Fingerprints: Law Professor Seeks to Shore Up the Science Used in Courts, UVAToday,  Nov. 11, 2016.
  3. Brandon Garrett, Calls for Limits on ‘Flawed Science’ in Court Are Well-founded: A Guest Post, Wash. Post, Sept. 20, 2016.
  4. See, e.g., Christophe Champod et al., Fingerprints and Other Ridge Skin Impressions (2d ed. 2016).
  5. Williamson, supra notes 1 & 2.
  6. NIST, Forensic Science Center of Excellence, June 12, 2014 (updated Aug. 26, 2016).
  7. See, e.g., Brandon L. Garrett, Convicting the Innocent: Where Criminal Prosecutions Go Wrong (2012).
  • Previous postings on friction skin ridge validation studies can be found by clicking on the labels "error" or "fingerprint."

Tuesday, December 6, 2016

The Scientific Basis for the 21-foot Rule in Police Shootings

Yesterday, the murder trial of North Carolina police officer Michael Slager for shooting Walter Scott, an African-American man fleeing from him, ended in a hung jury. 1/ Police (and their lawyers in cases in which officers face charges for unjustified shootings) sometimes refer to a “21-foot rule.” It is a rule of thumb that holds that “a suspect armed with an edged weapon [can] fatally engage an officer armed with a handgun within a distance of 21 feet.” 2/

According to one law review article, one Court of Appeals, in Sigman v. Town of Chapel Hill, held that a per se rule allowing police to shoot armed suspects within a 21-foot radius is constitutionally reasonable. 3/ Another law professor claims that police are generally trained “that they are permitted to shoot anyone who appears threatening or challenges them within that danger zone.” 4/

The legal or training “rule,” if there is one, has little bearing in a case in which a suspect is running away from the officer, but what of the underlying factual premise? Was the expert for the defendants in Sigman correct in testifying that “studies ... have shown that an armed individual within twenty-one feet of an officer still has time to get to the officer and stab and fatally wound the officer even if the officer has his weapon brandished and is prepared to or has fired a shot"? What studies have been conducted under what conditions?

The answer seems to be that no systematic body of scientific research exists and that no study at all addresses the danger zone for an officer with a weapon out and prepared to fire. At least, this is the conclusion of one police trainer as of 2014. Ron Martinelli explains that
The 21-foot rule was developed by Lt. John Tueller, a firearms instructor with the Salt Lake City Police Department. Back in 1983, Tueller set up a drill where he placed a "suspect" armed with an edged weapon 20 or so feet away from an officer with a holstered sidearm. He then directed the armed suspect to run toward the officer in attack mode. The training objective was to determine whether the officer could draw and accurately fire upon the assailant before the suspect stabbed him.

After repeating the drill numerous times, Tueller—who is now retired—wrote an article saying it was entirely possible for a suspect armed with an edged weapon to fatally engage an officer armed with a handgun within a distance of 21 feet. The so-called "21-Foot Rule" was born and soon spread throughout the law enforcement community.
Martinelli adds that although “Lt. John Tueller did us all a tremendous service in at least starting a discussion and educating us about action vs. reaction and perception-reaction lag, ... it is certainly time to move forward with a far more scientific analysis that actually seeks to support or reject this hypothesis.” Moreover, a scientifically informed rule of thumb would have to attend to a few variables known to the officer. As Martinelli observes,
Whether the "21-Foot Rule" is an applicable defense in an officer-involved shooting actually depends upon the facts and evidence of each case. The shooting of a knife-wielding suspect at less than 21 feet by an experienced, competent, and well-equipped officer who has the tactical advantage of an obstruction such as a police vehicle between herself and her attacker might be inappropriate. But the shooting of a knife-wielding assailant at more than 21 feet by an inexperienced officer, wearing a difficult holster system, with no obstructions between herself and the attacker might be justified.
Notes
  1. Mark Berman, Mistrial Declared in Case of South Carolina Officer Who Shot Walter Scott after Traffic Stop, Wash. Post, Dec. 5, 2016.
  2. Ron Martinelli, Revisiting the "21-Foot Rule", Police, Sept. 18, 2014.
  3. Nancy C. Marcus, Out  of Breath and Down to the Wire: A Call for Constitution-Focused Police Reform, 59 Howard L.J. 5, 50 (2016) (citing Sigman v. Town of Chapel Hill, 161 F.3d 782 (4th Cir. 1998)). With respect to the 21-foot rule, however, the court of appeals held only that a muncipality cannot be held liable for “deliberate indifference to or reckless disregard for the constitutional rights of persons” in teaching its officers “that an officer may use deadly force to stop a threatening individual armed with an edged weapon when that individual comes within 21 feet.” 161 F.3d at 23.
  4. Eric Miller, Rendering the Community, and the Constitution, Incomprehensible Through Police Training, Jotwell, Nov. 10, 2016.