Tuesday, December 27, 2016

NCFS Draft Views on "Statistical Statements in Forensic Testimony"


The period for comments on the second public draft of a proposed National Commission on Forensic Science (NCFS) views document on Statistical Statements in Forensic Testimony opened today and will close on January 25, 2017. Comments can be submitted at regulations.gov, Docket No. DOJ-LA-2016-0025.

The full document can be downloaded from https://www.regulations.gov/document?D=DOJ-LA-2016-0025-0001. It defines "statistical statements" broadly, to encompass quantitative or qualitative statements [that] indicate the accuracy of measurements or observations and the significance of these findings." "These statistical statements," it explains, "may describe measurement accuracy (or conversely, measurement uncertainty), weight of evidence (the extent to which measurements or observations support particular conclusions), or the probability or certainty of the conclusions themselves."

The draft summarizes the views as follows (footnote omitted):
1. Forensic experts, both in their reports and in testimony, should present and describe the features of the questioned and known samples (the data), and similarities and differences in those features as well as the process used to arrive at determining them. The presentation should include statements of the limitations and uncertainties in the measurements or observations.

2. No one form of statistical calculation or statement is most appropriate to all forensic evidence comparisons or other inference tasks. Thus, the expert needs to be able to support, as part of a report and in testimony, the choice used in the specific analysis carried out and the assumptions on which it was based. When the statistical calculation relies on a specific database, the report should make clear which one and its relevance for the case at hand.

3. The expert should report the limitations and uncertainty associated with measurements and the inferences that could be drawn from them. This report might take the form of an interval for an estimated value, or of separate statements regarding errors and uncertainties associated with the analysis of the evidence. If the expert has no information on sources of error in measurements and inferences, the expert must state this fact.

4. Forensic science experts should not state that a specific individual or object is the source of the forensic science evidence and should make it clear that, even in circumstances involving extremely strong statistical evidence, it is possible that other individuals or objects could possess or have left a similar set of observed features. Forensic science experts should confine their evaluative statements to the support that the findings provide for the claim linked to the forensic evidence.

5. To explain the value of the data in addressing claims as to the source of a questioned sample, forensic examiners may:
A. Refer to relative frequencies of individual features in a sample of individuals or objects in a relevant population (as sampled and then represented in a reference database). The examiner should note the uncertainties in these frequencies as estimates of the frequencies of particular features in the population.

B. Present estimates of the relative frequency of an observed combination of features in a relevant population based on a probabilistic model that is well grounded in theory and data. The model may relate the probability of the combination to the probabilities of individual features.

C. Present probabilities (or ratios of probabilities) of the observed features under different claims as to the origin of the questioned sample. The examiner should note the uncertainties in any such values.

D. When the statistical statement is derived from an automated computer-based system for making classifications, present not only the classification but also the operating characteristics of the system (the sensitivity and specificity of the system as established in relevant experiments using data from a relevant population). If the expert has no information or limited information about such operating characteristics, the expert must state this fact.
6. Not all forensic subdisciplines currently can support a probabilistic or statistical statement. There may still be value to the factfinder in learning whatever comparisons the expert in those subdisciplines has carried out. But the absence of models and empirical evidence needs to be expressed both in testimony and written reports.
The document will be discussed at the January 2017 NCFS meeting. A final version should be up for a vote at the (final?) Commission meeting, on April 10-11, 2017.

Thursday, December 22, 2016

Realistically Testing Forensic Laboratory Performance in Houston

The Houston Forensic Science Center, announced on November 17, 2016, that
HFSC Begins Blind Testing in DNA, Latent Prints, National First
This innovation -- said to be unique among forensic laboratories and to exceed the demands of accreditation -- does not refer to blind testing of samples from crime scenes. It is generally recognized that analysts should be blinded to information that they do not need to reach conclusions about the similarities and differences in crime-scene samples and samples from suspects or other persons of interest. One would hope that many laboratories already employ this strategy for managing unwanted sources of possible cognitive bias.

Perhaps confusingly, the Houston lab's announcement refers to "'blindly' test[ing] its analysts and systems, assisting with the elimination of bias while also helping to catch issues that might exist in the processes." More clearly stated, "[u]nder HFSC’s blind testing program analysts in five sections do not know whether they are performing real casework or simply taking a test. The test materials are introduced into the workflow and arrive at the laboratory in the same manner as all other evidence and casework."

A month earlier, the National  Commission on Forensic Science unanimously recommended, as a research strategy, "introducing known-source samples into the routine flow of casework in a blinded manner, so that examiners do not know their performance is being studied." Of course, whether the purpose is research or instead what the Houston lab calls a "blind quality control program," the Commission noted that "highly challenging samples will be particularly valuable for helping examiners improve their skills." It is often said that existing proficiency testing programs not only fail to blind examiners to the fact that they are being tested, but also are only designed to test minimum levels of performance.

The Commission bent over backward to imply that the outcomes of the studies it proposed would not necessarily be admissible in litigation. It wrote that
To avoid unfairly impugning examiners and laboratories who participate in research on laboratory performance, judges should consider carefully whether to admit evidence regarding the occurrence or rate of error in research studies. If such evidence is admitted, it should only be under narrow circumstances and with careful explanation of the limitations of such data for establishing the probability of error in a given case.
The Commission's concern was that applying statistics from work with unusually difficult cases to more typical casework might overstate the probability of error in the less difficult cases. At the same time, its statement of views included a footnote implying that the defense should have access to the outcomes of performance tests:
[T]he results of performance testing may fall within the government’s disclosure obligations under Brady v Maryland, 373 U.S. 83 (1963). But the right of defendants to examine such evidence does not entail a right to present it in the courtroom in a misleading manner. The Commission is urging that courts give careful consideration to when and how the results of performance testing are admitted in evidence, not that courts deny defendants access to evidence that they have a constitutional right to review.
Using traditional proficiency test results and the newer performance tests in which examiners are blinded to the fact that they are being tested in a given case (which is a better way to test proficiency) to impeach a laboratory's reported results raises interesting questions of relevance under Federal Rules of Evidence 403 and 404. See, e.g., Edward J. Imwinkelried & David H. Kaye, DNA Typing: Emerging or Neglected Issues, 76 Wash. L. Rev. 413 (2001).

Sunday, December 11, 2016

PCAST’s Sampling Errors (Part II: Getting More Technical)

The report to the President on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods from the President’s Council of Advisors on Science and Technology emphasizes the need for research to assess the accuracy of the conclusions of criminalists who compare the features of identification evidence — things like fingerprints, toolmarks, hairs, DNA, and bitemarks. To this extent, it treads on firm (and well-worn) ground.

It also stresses the need to inform legal factfinders — the judges and jurors who try cases with the assistance of such evidence — of the false-positive error rates discovered in comprehensive and well designed studies with large sample sizes. This too is a laudable objective (although the specificity, which is the complement of the false-negative probability, also affects the probative value of the evidence and therefore should be incorporated into a presentation that indicates probative value).

When sample sizes are small, different studies could well generate very different false-positive rates if only because accuracy varies, both within and across examiners. Testing different samples of examiners at different times therefore will show different levels of performance. A common response to this sampling variability is a confidence interval (CI). A CI is intended to demarcate a range of possible values that might plausibly include the value that an ideal study of all examiners at all times would find.

The report advocates the use of an upper limit of a one-sided 95% CI for the false-positive rates instead of, or perhaps in addition to, the observed rates themselves. Some of the report’s statements about CIs and what to report are collected in an appendix to this posting. They have led to claims of a very large false-positive error rate for latent fingerprint identification. (See On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons,” Dec. 10, 2016.)

A previous posting on "PCAST’s Sampling Errors" identified problems with the specifics of PCAST’s understanding of the meaning of a CI, with the technique it used to compute CIs for the performance studies it reviewed, and with the idea of substituting a worst-case scenario (the upper part of a CI) for the full range of the interval. Informing the factfinder of plausible variation both above and below the point estimate — is fairer to all concerned, and that interval should be computed with techniques that will not create distracting arguments about the statistical acumen of the expert witness.

This posting elaborates on these concerns. Forensic scientists and criminalists who are asked to testify to error probabilities (as they should) need to be aware of the nuances lest they be accused of omitting important information or using inferior statistical methods. I provide a few tables that display PCAST’s calculations of the upper limits of one-sided 95% CIs; two-sided CIs computed with the same method for ascertaining the width of the intervals; and CIs with methods that PCAST listed but did not use.

The discussion explains how to use the statistical tool that PCAST recommended. It also shows that there is potential for confusion and manipulation in the reporting of confidence intervals. This does not mean that such intervals should not be used — quite the contrary, they can give a sense of the fuzziness that sampling error creates for point estimates. Contrary to the recommendation of the PCAST report, however, they should not be presented as if they give the probability that an error probability is as high as a particular value.

At best, being faithful to the logic behind confidence intervals, one can report that if the error probability were that large, then the probability of the observed error rate or a smaller one would have some designated value. To present the upper limit of a 95% CI as if it states that the probability is “at least 5%” that the false-positive error probability could be “as high as” the upper end of that CI — the phrasing used in the report (p. 153) — would be to succumb to the dreaded “transposition fallacy” that textbooks on statistics abjure. (See also David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, June 16, 2016.)

I. PCAST’s Hypothetical Cases

The PCAST Report gives examples of CIs for two hypothetical validation studies. In one, there are 25 tests of examiner’s ability to use features to classify pairs according to their source.  This hypothetical experiment establishes that the examiners made x = 0 false positive classifications in n = 25 trials. The report states that “if an empirical study found no false positives in 25 individual tests, there is still a reasonable chance (at least 5 percent) that the true error rate might be as high as roughly 1 in 9.” (P. 153.)

I think I know what PCAST is trying to say, but this manner of expressing it is puzzling. Let the true error rate be some unknown, fixed value θ. The sentence in the report might amount to an assertion that the probability that θ is roughly 1 in 9 is at least 5%. In symbols,
Pr(θ ≈ 1/9 | x = 0, n = 25) ≥ 0.05. 
Or does it mean that the probability that θ is roughly 1/9 or less is 5% or more, that is,
Pr(θ ≤ 1/9 | x = 0, n = 25) ≥ 0.05? 
Or does it state that the probability that θ is roughly 1/9 or more is 5% or more, that is,
Pr(θ ≥ 1/9 | x = 0, n = 25) ≥ 0.05?

None of these interpretations can be correct. Theta is a parameter rather than a random variable. From the frequentist perspective of CIs, it does not have probabilities attached to it. What one can say is that, if an enormous number of identical experiments were conducted, each would generate some point estimate for the false positive probability θ in the population. Some of these estimates x/n would be a little higher than θ; some would be a little lower (if θ > 0); some would a lot higher; some a lot lower (if θ ≫ 0); some would be spot on. If we constructed a 95% CI around each estimate, about 95% of them would cover the unknown θ, and about 5% would miss it. (I am not sure where the "at least 5%" in the PCAST report comes from.) Likewise, if we constructed a 90% CI for each experiment, about 90% would cover the unknown θ, and about 10% would miss it.

In assessing a point estimate, the width of the interval is what matters. At a given confidence level, wider intervals signal less precision in the estimate. The 90% CI would be narrower than the 95% one — it would have the appearance of greater precision, but that greater precision is illusory.  It just means that we can narrow the interval by being less “confident” about the claim that it captures the true value θ. PCAST uses 95% confidence because 95% is a conventional number in many (but not all) fields of scientific inquiry -- a fact that greatly impressed Justice Breyer in oral argument in Hall v. Florida.

In sum, 95% two-sided CIs are useful for manifesting the fuzziness that surrounds any point estimate because of sampling error. But they do not answer the question of what probability we should attach to the claim that “the true error rate might be as high as roughly 1 in 9.” Even with a properly computed CI that goes from 0 to 1/9 (which we can write as [0, 1/9]), the most we can say is that the process used to generate such CIs will err about 5% of the time. (It is misleading to claim that it will err “at least 5%” of the time.) We have a single CI, and we can plausibly propose that it is one of those that does not err. But we cannot say that 95% or fewer of the intervals [0, 1/9] will capture θ. Hence, we cannot say that “there is ... at least [a] 5 percent [chance] that the true error rate [θ] might be as high as roughly 1 in 9.”

To reconstruct how PCAST computed “roughly 1 in 9,” we can use EpiTools, the web-based calculator that the report recommended. This tool does not have an explicit option for one-sided intervals, but a 95% one-sided CI leaves 5% of the probability mass in a single tail. Hence, the upper limit is equal to that for a two-sided 90% CI. This is so because a 90% CI places one-half of the 10% of the mass that it does not cover in each tail. Figure 1 shows the logic.

Figure 1. Graphical Representation of PCAST's
Upper Limit of a One-sided 95% CI (not drawn to scale)
 95% CI (two-sided)
       **
      *****
     ********
    ************
   ****************
*************************
--[--------------------]------------> x/n
Lower 2.5% region       Upper 2.5% region

 90% CI (two-sided)
       **
      *****
     ********
    ************
  *****************
*************************
---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

PCAST: Report the interval from
the observed x/n to the start of the upper 5% region? 
Or just report where the upper 5% region starts?

---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

Inputting n = 25, x = 0, and confidence = 90% yields the results in Table 1.

Table 1. Checking and Supplementing PCAST’s CIs
in Its First Hypothetical Case (x/n = 0/25)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.1129 = 1/9 0.1372 = 1/7
Wilson 0.0977 = 1/10 0.1332 = 1/8
Jeffreys 0.0732 = 1/14 0.0947 = 1/11
Agresti-Coull 0.1162 = 1/9 0.1576 = 1/6

The Wilson and Jeffreys methods, which are recommended in the statistics literature cited by PCAST, give smaller upper bounds on the false-positive error rate than the Clopper-Pearson method that PCAST uses. The recommended numbers are 1 in 10 and 1 in 14 compared to PCAST’s 1 in 9.

On the other hand, the upper limit for a two-sided 95% CI for the Wilson and Jeffreys tests are 1 in 8 and 1 in 11. In this example (and in general) using the one-sided interval has the opposite effect from what PCAST may have intended. PCAST wants to keep juries from hearing many false-positive-error probabilities that are below θ at the cost of giving them estimates that are above θ. But as Figure 1 illustrates, using the upper limit of a 90% two-sided interval to arrive at a one-sided 95% CI lowers the upper limit compared to the two-sided 95% CI. PCAST’s one-sided intervals do less, not more, to protect defendants from estimates of error false-positive error probabilities that are too low.

Of course, I am being picky. What is the difference if the expert reports that the false-positive probability could be 1 in 14 instead of PCAST’s upper limit of 1 in 9? The jury will receive a similar message — with perfect scores on only 25 relevant tests of examiners, one cannot plausibly claim that examiners rarely err. Another way to make this point is to compute the probability of the study’s finding so few false positives (x/n = 0/25) when the probability of an error on each independent test is 1 in 9 (or any other number that one likes). If 1/9 is the binomial probability of a false-positive error, then the chance of x = 0 errors in 25 tests of (unbeknownst to the examiners) same-source specimens is (1 – 1/9)25 = 0.053.

At this point, you may say, wait a minute, this is essentially the 5% figure given by PCAST. Indeed, it is, but it has a very different meaning. It indicates how often one would see studies with a zero false-positive error rate when the unknown, true rate is actually 1/9. In other words, it is the p-value for the data from the experiment when the hypothesis θ = 1/9 is true. (This is a one-tailed p-value. The expected number of false positives is 25(1/9) = 2.7. The probability of more than 6 or more false positives is 5.3%. So the two-tailed p-value is 10.5%. Only about 1 time in ten would studies with 25 trials produce outcomes that diverged at least much as this one did from what is statistically expected if the probability of error in each trial with different-source specimens really is 1/9.) Thus, the suggestion that the true probability is 1/9 is not grossly inconsistent with the data. But the data are not highly probable under the hypothesis that the false-positive error probability is 1/9 (or any larger number).

PCAST also gives some numbers for another hypothetical case. The report states that
[I]f a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context "the false positive rate might be as high as."
The output of EpiTools is included in Table 2.

Table 2. Supplementing PCAST’s CIs
in Its Second Hypothetical Case (x/n = 0/100)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0295 = 1/34 0.0362 = 1/28
Wilson 0.0263 = 1/38 0.0370 = 1/27
Jeffreys 0.0190 = 1/53 0.0247 = 1/40
Agresti-Coull 0.0317 = 1/32 0.0444 = 1/23

Again, to the detriment of defendants, the recommended methods give smaller upper bounds than a more appropriate two-sided interval. For example, PCAST’s 90% Clopper-Pearson interval is [0, 3%], while the 95% interval is [0, 4%]. A devotee of the Jeffreys method (or a less scrupulous expert witness seeking to minimize the false positive-error risk) could report the smallest interval of [0, 2%].

II. Two Real Studies with Larger Sample Sizes

The two hypotheticals are extreme cases. Sample sizes are small, and the outcomes are extreme — no false-positive errors at all. Let’s look at examples from the report that involve larger samples and higher observed error rates.

The report notes that in Langenburg, Champod, and Genessay (2012),
For the non-mated pairs, there were 17 false positive matches among 711 conclusive examinations by the experts. The false positive rate was 2.4 percent (upper 95 percent confidence bound of 3.5 percent). The estimated error rate corresponds to 1 error in 42 cases, with an upper bound corresponding to 1 error in 28 cases.
P. 93 (notes omitted). Invoking EpiTools, one finds that

Table 3. Supplementing PCAST’s CIs
for Langenburg et al. (2012) (x/n = 17/711)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0356 = 1/28 0.0140 = 1/71 0.0380 = 1/26
Wilson 0.0353 = 1/28 0.0150 = 1/67 0.0380 = 1/26
Jeffreys 0.0348 = 1/29 0.0145 = 1/69 0.0371 = 1/27
Agresti-Coull 0.0355 = 1/28 0.0147 = 1/68 0.0382 = 1/26

As one would expect for a larger sample, all the usual methods for producing a binomial CI generate similar results. The PCAST one-sided interval is [2.4%, 3.6%] compared to a two-sided interval of [1.4%, 3.8%].

For a final example, we can consider PCAST's approach to sampling error for the FBI-Noblis study of latent fingerprint identification. The report describes two related studies of latent fingerprints, both with large sample sizes and small error rates, but I will only provide additional calculations for the 2011 one. The report describes it as follows:
The authors assembled a collection of 744 latent-known pairs, consisting of 520 mated pairs and 224 non-mated pairs. To attempt to ensure that the non-mated pairs were representative of the type of matches that might arise when police identify a suspect by searching fingerprint databases, the known prints were selected by searching the latent prints against the 58 million fingerprints in the AFIS database and selecting one of the closest matching hits. Each of 169 fingerprint examiners was shown 100 pairs and asked to classify them as an identification, an exclusion, or inconclusive. The study reported 6 false positive identifications among 3628 nonmated pairs that examiners judged to have “value for identification.” The false positive rate was thus 0.17 percent (upper 95 percent confidence bound of 0.33 percent). The estimated rate corresponds to 1 error in 604 cases, with the upper bound indicating that the rate could be as high as 1 error in 306 cases.
The experiment is described more fully in Fingerprinting Under the Microscope: A Controlled Experiment on the Accuracy and Reliability of Latent Print Examinations (Part I), Apr. 26, 2011, et seq. More detail on the one statistic selected for discussion in the PCAST report is in Table 4.

Table 4. Supplementing PCAST’s CIs
for Ulery et al. (2011) (x/n = 6/3628)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0033 = 1/303 0.0006 = 1/1667 0.0036 = 1/278
Wilson 0.0032 = 1/313 0.0008 = 1/1250 0.0036 = 1/278
Jeffreys 0.0031 = 1/323 0.0007 = 1/1423 0.0034 = 1/294
Agresti-Coull 0.0033 = 1/303 0.0007 = 1/1423 0.0037 = 1/270

The interval that PCAST seems to recommend is [0.17%, 0.33%]. The two-sided interval is [0.06%, 0.36%]. Whereas PCAST would have the expert testify that the error rate is "as high as 1 error in 306 cases," or perhaps between "1 error in 604 cases [and] 1 error in 306 cases," the two-sided 95% CI extends from 1 in 1667 on up to 1 in 278.

Although PCAST's focus on the high end of estimated error rates strikes me as tendentious, a possible argument for it is psychologically and legally oriented. If jurors would anchor on the first number they hear, such as 1/1667, they would underestimate the false-positive error probability. By excluding lower limits from the presentation, we avoid that possibility. Of course, we also increase the risk that jurors will overestimate the effect of sampling error. A better solution to this perceived problem might be to present the interval from highest to lowest rather than ignoring the low end entirely.

APPENDIX

PCAST is less than pellucid on whether a judge or jury should learn of (1) the point estimate together with an upper limit or (2) only the upper limit. The Council seems to have rejected the more conventional approach of giving both end points of a two-sided CI. The following assertions suggest that the Council favored the second, and most extreme approach:
  • [T]o inform jurors that [the] only two properly designed studies of the accuracy of latent fingerprint analysis ... found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study ... would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.” (P. 96.)
  • The report states that "the other (Miami-Dade 2014 study) yielded a considerably higher false positive rate of 1 in 18." (P. 96.) Yet, 1 in 18 was not the false-positive rate observed in the study, but, at best, the upper end of a one-sided 95% CI above this rate.
Other parts of the report point in the other direction or are simply ambiguous. Examples follow:
  • If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). (P. 112)
  • Studies designed to estimate a method’s false positive rate and sensitivity are necessarily conducted using only a finite number of samples. As a consequence, they cannot provide “exact” values for these quantities (and should not claim to do so), but only “confidence intervals,” whose bounds reflect, respectively, the range of values that are reasonably compatible with the results. When reporting a false positive rate to a jury, it is scientifically important to state the “upper 95 percent one-sided confidence bound” to reflect the fact that the actual false positive rate could reasonably be as high as this value. 116/ (P. 53)
  • The upper confidence bound properly incorporates the precision of the estimate based on the sample size. For example, if a study found no errors in 100 tests, it would be misleading to tell a jury that the error rate was 0 percent. In fact, if the tests are independent, the upper 95 percent confidence bound for the true error rate is 3.0 percent. Accordingly a jury should be told that the error rate could be as high as 3.0 percent (that is, 1 in 33). The true error rate could be higher, but with rather small probability (less than 5 percent). If the study were much smaller, the upper 95 percent confidence limit would be higher. For a study that found no errors in 10 tests, the upper 95 percent confidence bound is 26 percent—that is, the actual false positive rate could be roughly 1 in 4 (see Appendix A). (P. 53 n. 116.)
  • In summarizing these studies, we apply the guidelines described earlier in this report (see Chapter 4 and Appendix A). First, while we note (1) both the estimated false positive rates and (2) the upper 95 percent confidence bound on the false positive rate, we focus on the latter as, from a scientific perspective, the appropriate rate to report to a jury—because the primary concern should be about underestimating the false positive rate and the true rate could reasonably be as high as this value. 262/ (P. 92.)
  • Since empirical measurements are based on a limited number of samples, SEN and FPR cannot be measured exactly, but only estimated. Because of the finite sample sizes, the maximum likelihood estimates thus do not tell the whole story. Rather, it is necessary and appropriate to quote confidence bounds within which SEN, and FPR, are highly likely to lie. (P. 152.)
  • For example, if a study finds zero false positives in 100 tries, the [Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval] give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.) (P. 153.)

Saturday, December 10, 2016

On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons”

On November 1, Professor Christophe Champod tweeted
“The White House report noted a study showing a 1 in 18 error rate for fingerprint comparison.” This is ridiculous.
He was referring to an interview conducted for the University of Virginia School of Law 1/ and later published on the campus news site as Fallible Fingerprints. 2/ In the interview, Professor Brandon Garrett observed that
For years, courts have grandfathered in techniques like fingerprinting that provide valuable information, but with error rates and probative value that have simply not been adequately studied. For example, the White House report noted a study showing a one-in-18 error rate for fingerprint comparison and another showing a one-in-six error rate for bite-mark comparison. A 2009 report by the National Academy of Sciences carefully detailed how much of the forensic evidence used was without “any meaningful scientific validation.” Little changed.
In a guest editorial in the Washington Post, 3/ he disseminated the same statistic on the accuracy of positive associations from latent prints:
Any human technique has an error rate, and a crucial quality control is to do testing to find out how good experts really are. It is not enough for fingerprint or bite-mark examiners to vouch for their own reliability. We must put their experience to the test. The few tests that have been done show disturbing error rates. For example, the White House report highlights a study showing a 1 in 18 error rate for fingerprint comparison and another showing a shocking 1 in 6 error rate for bite marks.
The general complaint is understandable. Courts have used a low bar in allowing criminalists to testify to unjustified — and exaggerated — claims of certainty for various forms of identification evidence. But is it true that little has changed since 2009 in the field of research into the probative value of latent fingerprint identification? The “White House report” is the study of the President’s Council of Advisers on Science and Technology (PCAST) released in September 2016 (that is the subject of other postings, pro and con, on this blog). The report lists four “early studies,” and two “black box” studies. Every study is dated 2009 or later. Despite this growing body of evidence on the accuracy of traditional latent fingerprint examinations, the courts — and the members of the bar — have not generally considered the need to use the experiments in explaining the uncertainty in latent print identifications to jurors. In other words, we have better science but not necessarily better law.

How to use the data from the studies is therefore a matter of emerging importance. To employ the research correctly, one needs to consider it as a whole — not to cherry pick results to support the positions of prosecutors or defense counsel. This makes it important to consider whether Professor Champod is right when he dismisses the 1 in 18 figure as ridiculous. There are a priori reasons to trust his judgment. As a forensic scientist at the University of Lausanne, he has been a progressive force in making latent print identification more scientific and in developing ways to incorporate a recognition of uncertainty into the reports of examiners. 4/

At the same time, Professor Garrett “is a principal investigator of UVA’s year-old Center for Statistics and Applications in Forensics Evidence, which is generating new research about forensic analysis and sharing best practices in order to facilitate justice.” 5/ Surely, the principal investigators for the government-funded Forensic Science Center of Excellence should be supplying balanced assessments as they perform their mission to “evaluate and solidify the statistical foundation for fingerprint, firearm, toolmark, and other pattern evidence analyses” so as to “allow forensic scientists to quantify the level of confidence they have in statistical computations made with these methods and the conclusions reached from those analyses.” 6/ Furthermore, Professor Garrett is a leading scholar of criminal procedure and an astute analyst of the factors that can produce wrongful convictions. 7

Professor Garrett is right to insist that “[a]ny human technique has an error rate, and [it is] crucial ... to do testing ... . It is not enough for fingerprint or bite-mark examiners to vouch for their own reliability [and validity].” Still, I have to side with Professor Champod. The figure of 1 in 18 for false identifications, as a summary of the studies into the validity of latent fingerprint identification, is incredibly inflated.

To give a fair picture of the studies, one cannot just pick out one extreme statistic from one study as representative. Table 1 in the report presents the false positives for “black-box studies” as follows:


Raw Data Freq. (Confidence bound) Estimated Rate Bound on Rate

Ulery et al. 2011 (FBI) 6/3628 0.17% (0.33%) 1 in 604 1 in 306
Pacheco et al. 2014 (Miami-Dade) 42/995 4.2% (5.4%) 1 in 24 1 in 18
Pacheco et al. 2014 (Miami-Dade) (excluding clerical errors) 7/960 0.7% (1.4%) 1 in 137 1 in 73


An expert witness who presented only the 1 in 18 figure would be the target of withering cross-examination. In its summary of the studies, PCAST treated 1 in 18 as one of two equally important figures. It recommended (on page 96) that
Overall, it would be appropriate to inform jurors that (1) only two properly designed studies of the accuracy of latent fingerprint analysis have been conducted and (2) these studies found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study. This would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.
Although slightly more complete, this presentation also would expose an expert to major problems on cross-examination. First, the two studies are not of equal quality. The FBI-Noblis study, which produced the smaller error rate, is plainly better designed and entitled to more credence. Second, the 1 in 18 figure counts a large number of what were said to be “clerical errors.” Third, the upper bounds are not unbiased estimates. The point estimates are 1 in 604, 1 in 24 (with the alleged clerical mistakes), and 1 in 137 (without them). Fourth, verification by a second examiner (who should be blinded to the outcome of the first examination) would drastically reduce these rates.

There also are countervailing considerations. For example, the PCAST upper bound only considers sampling error. Extrapolating from the experiments to practice is a larger source of uncertainty. Nonetheless, informing jurors as PCAST proposes hardly seems calculated to provide adequate and accurate information about the probative value of positive fingerprint identifications.

In the end, and regardless of what one thinks of the PCAST wording for the statistical information to give to a jury, it seems clear that 1/18 is not an acceptable summary of what research to date suggests for the false-positive probability of latent fingerprint identifications.

References
  1. Eric Williamson, Ushering in the Death of Junk Science, Oct. 31, 2016.
  2. Eric Williamson, Fallible Fingerprints: Law Professor Seeks to Shore Up the Science Used in Courts, UVAToday,  Nov. 11, 2016.
  3. Brandon Garrett, Calls for Limits on ‘Flawed Science’ in Court Are Well-founded: A Guest Post, Wash. Post, Sept. 20, 2016.
  4. See, e.g., Christophe Champod et al., Fingerprints and Other Ridge Skin Impressions (2d ed. 2016).
  5. Williamson, supra notes 1 & 2.
  6. NIST, Forensic Science Center of Excellence, June 12, 2014 (updated Aug. 26, 2016).
  7. See, e.g., Brandon L. Garrett, Convicting the Innocent: Where Criminal Prosecutions Go Wrong (2012).
  • Index to this blog's comments on the PCAST report and (cases discussing it)
  • Previous postings on the FBI-Noblis, Miami Dade, and Tangen et al. studies can be found by clicking on the labels "error" or "fingerprint."

Tuesday, December 6, 2016

The Scientific Basis for the 21-foot Rule in Police Shootings

Yesterday, the murder trial of North Carolina police officer Michael Slager for shooting Walter Scott, an African-American man fleeing from him, ended in a hung jury. 1/ Police (and their lawyers in cases in which officers face charges for unjustified shootings) sometimes refer to a “21-foot rule.” It is a rule of thumb that holds that “a suspect armed with an edged weapon [can] fatally engage an officer armed with a handgun within a distance of 21 feet.” 2/

According to one law review article, one Court of Appeals, in Sigman v. Town of Chapel Hill, held that a per se rule allowing police to shoot armed suspects within a 21-foot radius is constitutionally reasonable. 3/ Another law professor claims that police are generally trained “that they are permitted to shoot anyone who appears threatening or challenges them within that danger zone.” 4/

The legal or training “rule,” if there is one, has little bearing in a case in which a suspect is running away from the officer, but what of the underlying factual premise? Was the expert for the defendants in Sigman correct in testifying that “studies ... have shown that an armed individual within twenty-one feet of an officer still has time to get to the officer and stab and fatally wound the officer even if the officer has his weapon brandished and is prepared to or has fired a shot"? What studies have been conducted under what conditions?

The answer seems to be that no systematic body of scientific research exists and that no study at all addresses the danger zone for an officer with a weapon out and prepared to fire. At least, this is the conclusion of one police trainer as of 2014. Ron Martinelli explains that
The 21-foot rule was developed by Lt. John Tueller, a firearms instructor with the Salt Lake City Police Department. Back in 1983, Tueller set up a drill where he placed a "suspect" armed with an edged weapon 20 or so feet away from an officer with a holstered sidearm. He then directed the armed suspect to run toward the officer in attack mode. The training objective was to determine whether the officer could draw and accurately fire upon the assailant before the suspect stabbed him.

After repeating the drill numerous times, Tueller—who is now retired—wrote an article saying it was entirely possible for a suspect armed with an edged weapon to fatally engage an officer armed with a handgun within a distance of 21 feet. The so-called "21-Foot Rule" was born and soon spread throughout the law enforcement community.
Martinelli adds that although “Lt. John Tueller did us all a tremendous service in at least starting a discussion and educating us about action vs. reaction and perception-reaction lag, ... it is certainly time to move forward with a far more scientific analysis that actually seeks to support or reject this hypothesis.” Moreover, a scientifically informed rule of thumb would have to attend to a few variables known to the officer. As Martinelli observes,
Whether the "21-Foot Rule" is an applicable defense in an officer-involved shooting actually depends upon the facts and evidence of each case. The shooting of a knife-wielding suspect at less than 21 feet by an experienced, competent, and well-equipped officer who has the tactical advantage of an obstruction such as a police vehicle between herself and her attacker might be inappropriate. But the shooting of a knife-wielding assailant at more than 21 feet by an inexperienced officer, wearing a difficult holster system, with no obstructions between herself and the attacker might be justified.
Notes
  1. Mark Berman, Mistrial Declared in Case of South Carolina Officer Who Shot Walter Scott after Traffic Stop, Wash. Post, Dec. 5, 2016.
  2. Ron Martinelli, Revisiting the "21-Foot Rule", Police, Sept. 18, 2014.
  3. Nancy C. Marcus, Out  of Breath and Down to the Wire: A Call for Constitution-Focused Police Reform, 59 Howard L.J. 5, 50 (2016) (citing Sigman v. Town of Chapel Hill, 161 F.3d 782 (4th Cir. 1998)). With respect to the 21-foot rule, however, the court of appeals held only that a muncipality cannot be held liable for “deliberate indifference to or reckless disregard for the constitutional rights of persons” in teaching its officers “that an officer may use deadly force to stop a threatening individual armed with an edged weapon when that individual comes within 21 feet.” 161 F.3d at 23.
  4. Eric Miller, Rendering the Community, and the Constitution, Incomprehensible Through Police Training, Jotwell, Nov. 10, 2016.

Monday, November 28, 2016

"We Can Predict Your Face" and Put It on a Billboard

An article entitled Craig Venter’s Latest Production (Arlene Weintraub, MIT Technology Review, Sept.-Oct. 2016, at 94) reports that
At Human Longevity Inc. (HLI) in La Jolla, California, more than two dozen machines work around the clock, sequencing one human genome every 15 minutes at a cost of under $2,000 per genome. The whole operation fits comfortably in three rooms. Back in 2000, when its founder, J. Craig Venter, first sequenced a human genome, it cost $100 million and took a building-size, $50 million computer nine months to complete. [¶] Venter’s goal is to sequence at least one million genomes, something that seems likely to take the better part of a decade ...

Seated behind his desk in his office two floors above the sequencing lab, his red poodle Darwin sleeping quietly at his feet, Venter has pulled up images on his computer that show how in one early experiment HLI scientists were able to sequence 1,000 people’s genomes and then reconstruct their faces solely on the basis of the genetic data. “We can predict your face, your height, your body mass index, your eye color, you hair color and texture,” he says, marveling at how closely one of the reconstructed faces matches the photo of the actual study participant.
It would be nice to have a decently designed study of how well all the 1,000 (adult?) faces match the photos. As Richard Feynman once told his students,
The first principle is that you must not fool yourself — and you are the easiest person to fool.
But why wait for whole genome sequencing to "predict" faces? Scores of police agencies already use a different company's product. Parabon Snapshot provides pictures as part of a "scientifically objective description" so "you can conduct your investigation more efficiently and close cases more quickly." Ellen Greytak, director of bioinformatics for Parabon, says that "So far, we've done more than sixty different cases, and we've also done evaluations at the local, state, federal and international levels." In fact, "we've had one conviction and a few other arrests." Michael Roberts, Could DNA Imaging Used in Bennett Family Murder Break JonBenet Case?, Westword, Sept. 16, 2016. Although "none of the police agencies in question has gone public with the technology's role in the cases thus far, she teases that an announcement about a success is pending." Id.

"The composite isn't intended to be like a driver's license photo, but it will bear a resemblance. And if you have a list of 1,000 people who were nearby that day, you can put the ones that match the most at the top, and the ones that match the least at the bottom." Id. (quoting Greytak). Parabon's website explains that this achievement comes from "using deep data mining and advanced machine learning algorithms in a specialized bioinformatics pipeline."

Maybe I missed it, but I saw no references to cross-validation studies of whatever oozed out of the pipeline. Nevertheless, "Snapshot trait predictions are presented with a corresponding measure of confidence, which reflects the degree to which such factors influence each particular trait. Traits, such as eye color, that are highly heritable (i.e., are not greatly affected by environmental factors) are predicted with higher accuracy and confidence than those that have lower heritability; these differences are shown in the confidence metrics that accompany each Snapshot trait prediction."

A "confidence metric" seems to be missing from an unusual advertising campaign called "The Face of Litter" in Hong Kong:
Using the “Snapshot” DNA phenotyping services of a company called Parabon Nanolabs, Ogilvy [a marketing firm] collected litter from the streets and using DNA obtained from the litter, created profiles of the offenders ... . These profiles are now posted on outdoor ads at bus stops, subway and train stations, and even on highway billboards.
Nanalyze, Parabon Nanolabs and DNA Phenotyping, Apr. 30, 2015. Parabon calls this a "social experiment," calling to mind the dismissive remark, "That's not an experiment, it's an experience."

(Last updated 11/29/16)

Saturday, November 26, 2016

People v. Ramsaran: How Not to Evaluate a DNA Mixture

Did Ganesh Ramsaran kill his wife, Jennifer, after dropping their children off at school on a chilly morning in December 2012? At around 7:54 that night, he called the police and told them that Jennifer was missing. He said she had left their home in New Berlin, New York, at 10 that morning to go shopping in Syracuse, about 60 miles away. He expected her home at around 5:00 p.m.

When she did not show up, he wasted little time. At 5:30,  he called his father-in-law to say that he was going to call the police. When he called the police a few hours later, he was “adamant something terrible had happened.” He said they had a perfect marriage.

A program manager for IBM, Ganesh later told police that, after returning from the school, he worked from home on his computers. However, a forensic investigator determined that one of Ganesh's work computers was not used at all that day and the other was not used between 8:08 a.m. and 6:31 p.m. (except for the automatic installation of new software before 8:25 a.m.).  Jennifer had been playing an online game until about 8:15 a.m., when she abruptly ceased playing. She had told her online game partner that she was not going to Syracuse until later in the week because her van was making strange noises. Video footage undermined Ganesh’s account of having gone for a run to the Y that morning. As for the perfect marriage, there was evidence of a divorce in the air, an extra-marital affair, and even an insatiable sex drive.

Five days after the disappearance, the van was found in an apartment parking lot. Two months later, Jennifer's naked and decomposed body was found at the bottom of an embankment, with bruises and lacerations, bleeding underneath the scalp, and internal hemorrhages across the back.

Ganesh was tried for murder. In addition to the facts given above, DNA evidence supported the verdict of guilty. “Large blood stains in the back of [the van] were a conclusive DNA match with the victim,” and she “could not be excluded a contributor” to a blood stain on the sweatshirt that Ganesh had worn the day she disappeared. A “forensic expert testified that it was 1.661 quadrillion times more likely that the blood sample from the sweatshirt contained a combination of defendant's and the victim's blood than if two randomly selected individuals were the donors.”

The appellate division reversed the conviction because of “defense counsel's failure to object to the prosecutor's inappropriate characterization of the DNA testimony and evidence during summation.” The expert, Daniel Myers, had testified that although “the STR/DNA mixture profile ... is 1.661 quadrillion times more likely to be observed if donors are defendant and the victim than if two random unrelated people were selected, ... there were not enough alleles or DNA data to say conclusively that the victim's DNA was present.”

The prosecutor went further “during summation, ... by stating ... ‘on that sweatshirt is [defendant's] wife's DNA’”; that Jennifer’s “DNA was on that area where the bloody spot is”; and that “the forensic people [say that Jennifer's] DNA is on that sweatshirt, to some degree.”

How should the significance of the DNA typing results have been presented? The likelihood ratio for “two random unrelated people” is relevant, since it constitutes one conceivable explanation for the origin of the blood stain on the sweatshirt. If it were the sole “defense hypothesis,” and if the likelihood ratio with that hypothesis in the denominator were anything like 1.661 quadrillion, then the prosecution could maintain that the only reasonable conclusion was that the mixed stain came from Jennifer and Ganesh. Indeed, if the only comparison were between a Jennifer-Ganesh mixture and a nonJennifer-nonGanesh mixture, it would not have been egregiously wrong to suggest that Ganesh’s DNA in the mixture is what the forensic analyst reported “to some degree.”

However, other hypotheses consistent with innocence cannot be ignored. The most obvious is that the contributors were Ganesh and an individual unrelated to Jennifer. After all, Ganesh was wearing the sweatshirt. The hypothesis that the mixed stain was from him and someone besides the victim surely merited consideration along with the less probable hypothesis that the DNA from from neither Ganesh nor his wife. Without revealing how the likelihood for that defense-compatible hypothesis compares with the likelihood for the prosecution's claim of a Jennifer-Ganesh mixture, the state withheld information necessary to a fair assessment of the DNA evidence.

Prosecutors have license to present their evidence for all that it is worth, and the prosecutor in this case may have believed that the likelihood ratio of 1.661 quadrillion was the appropriate measure of probative value. But to know what the bloodstain on the sweatshirt really proves, one must consider all the relevant alternatives to the state’s claim about the evidence. There is no indication in the opinion that the jury was given the information it needed to assess the bloodstain evidence.

Now it could be that the likelihood ratio with the Ganesh-nonJennifer hypothesis in the denominator also would have been outrageously large. But without some indication of the magnitude of that likelihood ratio, the jury was not given a fair picture of the meaning of the mixture.

References
People  v. Ramsaran, 141 App. Div. 3d 865, 35 N.Y.S.3d 549 (2016)
Joel Stashenko, Panel Orders New Trial for Man Charged in Wife's Death, N.Y.L.J., July 19, 2016

Acknowledgment
Thanks to Ted Hunt for calling this case to my attention.

Wednesday, November 23, 2016

A Flaky Forensic Genetics and Medicine Journal

I keep quietly adding emails to a posting on Flaky Academic Journals, but today's specimen from the Journal of Forensic Genetics and Medicine deserves special billing. The journal belongs to the North Carolina publisher and conference organizer, Allied Academies, which is allied with OMICS. Today's "confidential" email shows that it cannot even keep the names of its journals straight. It thinks its email advertisement is privileged with "work product immunity," and it denies that its emailed promises and claims are "given or endorsed by the company." Here is the solicitation:
Dear David H. Kaye,
Greetings from Journal of Forensic Genetics and Medicine.
It gives us great pleasure to invite you and your research allies to submit a manuscript for the Journal of Forensic Genetics and Medicine. We are delighted to announce that we are planning to release Inaugural Issue for our newly launched Journal. Your contribution adds more value to our inaugural issue. ...
Your contribution will help Journal of Sinusitis and Migraine establish its high standard and facilitate the journal to be indexed by prestigious ISI soon. ...
We Look forward for our long lasting scientific relationship.
With Regards,
Solomon Ebe, Editorial Assistant, Journal of Forensic Genetics and Medicine, Allied Academies, P.O.Box670, Candler, NC28715, USA

This message is confidential. It may also be privileged or otherwise protected by work product immunity or other legal rules. ... [Y]ou may not copy this message or disclose its contents to anyone. The views, opinions, conclusions and other information’s expressed in this electronic mail are not given or endorsed by the company unless otherwise indicated by an authorized representative independent of this message.
A second email arrived 12/2/16 with an additional "not given or endorsed" promise "that COMPLETE WAIVER will be provided on the articles submitted on or before December30th, 2016 for the inaugural issue."

The editorial board, if the website is to be believed, consists of the following individuals:
  • J. Thomas McClintock, Department of Biology and Chemistry, Forensic Science Program, Liberty University, Lynchburg, Virginia, USA.
  • James P Landers, Commonwealth Chaired Professor Dept. of Chemistry, Mechanical Engineering and Pathology Jefferson Scholar Faculty Fellow Co-Director of the Center for Nano-BioSystem Integration University of Virginia, Charlottesville, VA, United States
  • Robert W. Allen, Professor of Forensic Science and Chairman, School of Forensic Sciences, Center for Health Sciences, Oklahoma State University,1111 West 17th St.Tulsa, OK 74107, USA.
  • Susan A. Greenspoon, Forensic Molecular Biologist, Virginia Department of Forensic Science, 700 North Fifth Street Richmond, VA 23219, USA.
  • James Jabbour, Program Director, Assistant Professor, Applied Forensic Science,School of Applied Sciences, Forensic Sciences at Mount Ida College,777 Dedham Street, Newton, MA 02459, USA.

Thursday, November 10, 2016

The Defense Attorney's Fallacy in United States v. Natson

Searching for cases that illustrate the range of statistical statements that courts encounter, I stumbled (in two senses) on United States v. Natson. On December 16, 2003, a hunter discovered the remains of Ardena Carter and a fetus in a remote area of the Fort Benning Military Reservation in Columbus, Georgia. A student at Georgia Southern University, Carter had been shot in the back of the head with a pistol.

A grand jury indicted her boyfriend, Michael Antonio Natson, for homicide, feticide, and carrying and using a firearm during the murder. Calling for the death penalty, the government contended that Natson, a military police officer, killed Carter to keep her from seeking child support. To establish paternity, the government turned to DNA testing. However, the fetal bones yielded only a partial (five-locus) DNA profile, and the government’s expert described the findings as “inconclusive.” If this were all that the expert had to say, the court’s exclusion of the tests as irrelevant would have been unremarkable.

But the DNA expert, Shaun Weiss, did have more to say. As the court explained the proffered testimony, Weiss would have added that Natson was “26 times more likely to be the father of the fetus than a random person” and that “there is a 96.30% probability that Defendant is the father.” Weiss down played these numbers, maintaining that “the statistical probability of paternity must be at 99.99% for the DNA scientific community to consider a DNA test to show a paternity match.” In other words, Weiss believed that unless an inference of paternity is all but certain (99.99%), the test results are not scientifically acceptable. In light of these expert asservations, the federal district court concluded that
It would be sheer speculation for a jury to determine from Weiss's testimony that Defendant is the father. Therefore, the Court finds that the testimony is not relevant and would not assist the trier of fact. Accordingly, it is not admissible under Federal Rules of Evidence 702, 401, and 402.
Furthermore, the court dismissed the 26:1 odds and the 96% probability as "significantly low." The testing was "not probative." Weiss could only "testify with certainty that Defendant was 'possibly' the father, along with thousands of other random persons."

This is a very strange application of the legal concept of relevance. Suppose the government had found an unused 9mm bullet near the body (that might have been dropped there by the killer). Would the court have dismissed as irrelevant evidence that Natson also owned a 9 mm pistol because it only shows that defendant's pistol might possibly have been the murder weapon, along with thousands of  random pistols?

The notion that identification evidence is not relevant unless it limits the class of possible perpetrators to a very small number has been called the "defense attorney's fallacy." The fallacy lies in equating weakly or modestly probative identification evidence to the complete absence of probative value. Such reasoning is inconsistent with the common-law definition of relevance expressed in Federal Rule of Evidence 401. Rule 401 defines relevance in probabilistic terms:
“Relevant evidence" means evidence having any tendency to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without the evidence.
Whether Natson was the father of the fetus (and thereby might have had the motive the government ascribed to him) is surely of some consequence, and if offspring of Natson and Carter are 26 times more probable to possess the genotypes of the fetal bones than are the offspring of Carter and a randomly selected man, then the genotypes make it “more ... probable” that Natson is indeed the father. Thus, the DNA test results, although far less definitive than in the usual paternity case with copious, fresh samples, were unequivocally relevant.

This explanation of why the genetic data is relevant does not depend on the court’s claim that a likelihood ratio of 26 means that Natson is “26 times more likely to be the father of the fetus than a random person” or that “there is a 96.30% probability that Defendant is the father.” These expressions are statements of the odds or probability of the “fact that is of consequence.” Geneticists cannot deduce such values without a probability for the fact "without the evidence.” To arrive at probabilities for a claim of paternity, parentage testers typically assume that the “fact ... without the evidence” is equiprobable, then adjust it in light of the genetic evidence. The expert's choice, which is not usually based on scientific data, of 50% for the probability without the evidence, has been sharply criticized. With respect to the question of relevance, however, that issue is a red herring. In Natson, 26 is the ratio of (1) the probability of the genetic evidence when Natson is the father to (2) the probability of the same evidence when he is not. It is a likelihood ratio. As such, the evidence alters the probability of paternity, no matter what the starting probability might be. That makes the evidence relevant. It does not matter whether the probability without the evidence is 50%, 5%, or any other number (except for unrealizable prior probabilities of exactly 1 or 0).

Although the court excluded the statistical statements about the DNA evidence, the jury convicted Natson. Family members testified that Carter was pregnant and that she had told them that Natson was the father, but the most crucial testimony may have come from FBI firearm and toolmark examiner Paul Tangren. Tangren opined that a discharged ammunition cartridge recovered "from the scene of the alleged crime" exhibited toolmarks that "were sufficiently similar" to those on cartridges test-fired from a pistol owned by Natson "to identify Defendant's gun ... to a 100% degree of certainty."

The judge imposed a sentence of imprisonment for life without parole. Natson appealed the conviction on the ground that “the case investigators, intentionally and calculatingly, refused to develop information ... which might implicate” other suspects. In an unreported opinion, the U.S. Court of Appeals for the Eleventh Circuit denied this appeal.

References

Thursday, November 3, 2016

The False-Positive Fallacy in the First Opinion to Discuss the PCAST Report

Last month, I quoted the following discussion of the PCAST report on forensic science that appeared in United States v. Chester, No. 13 CR 00774 (N.D. Ill. Oct. 7, 2016):
As such, the report does not dispute the accuracy or acceptance of firearm toolmark analysis within the courts. Rather, the report laments the lack of scientifically rigorous “blackbox” studies needed to demonstrate the reproducibility of results, which is critical to cementing the accuracy of the method. Id. at 11. The report gives detailed explanations of how such studies should be conducted in the future, and the Court hopes researchers will in fact conduct such studies. See id. at 106. However, PCAST did find one scientific study that met its requirements (in addition to a number of other studies with less predictive power as a result of their designs). That study, the “Ames Laboratory study,” found that toolmark analysis has a false positive rate between 1 in 66 and 1 in 46. Id. at 110. The next most reliable study, the “Miami-Dade Study” found a false positive rate between 1 in 49 and 1 in 21. Thus, the defendants’ submission places the error rate at roughly 2%.3 The Court finds that this is a sufficiently low error rate to weigh in favor of allowing expert testimony. See Daubert v. Merrell Dow Pharms., 509 U.S. 579, 594 (1993) (“the court ordinarily should consider the known or potential rate of error”); United States v. Ashburn, 88 F. Supp. 3d 239, 246 (E.D.N.Y. 2015) (finding error rates between 0.9 and 1.5% to favor admission of expert testimony); United States v. Otero, 849 F. Supp. 2d 425, 434 (D.N.J. 2012) (error rate that “hovered around 1 to 2%” was “low” and supported admitting expert testimony). The other factors remain unchanged from this Court’s earlier ruling on toolmark analysis. See ECF No. 781.

3. Because the experts will testify as to the likelihood that rounds were fired from the same firearm, the relevant error rate in this case is the false positive rate (that is, the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect).
I suggested that the court missed (or summarily dismissed) the main point the President's Council of Science and Technology Advisers were making -- that there is an insufficient basis in the literature for concluding that "the error rate [is] roughly 2%," but the court's understanding of "the error rate" also merits comment. The description of the meaning of "the false positive rate" in note 3 (quoted above) is plainly wrong. Or, rather, it is subtly wrong. If the experts will testify that two bullets came from the same gun, they will be testifying that their tests were positive. If the tests are in error, the test results will be false positives. And if the false-positive error probability is only 2%, it sounds as if there is only a 2% probability "that [the] expert's testimony ... is in fact incorrect."

But that is not how these probabilities work. The court's impression reflects what we can call a "false-positive fallacy." It is a variant on the well-known transposition fallacy (also loosely called the prosecutor's fallacy). Examiner-performance studies are incapable of producing what the court would like to know (and what it thought it was getting) -- "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect." The last phrase denotes the probability that a source hypothesis is false. It can be called a source probability. The "false positive rate" is the probability that certain evidence will arise if the source hypothesis is true. It can be called an evidence probability. As explained below, this evidence probability is but one of three probabilities that determine the source probability.

I. Likelihoods: The Need to Consider Two Error Rates

The so-called black-box studies can generate estimates of the evidence probabilities, but they cannot reveal the source probabilities. Think about how the performance study is designed. Examiners decide whether pairs of bullets or cartridges were discharged from the same source (S) or from different guns (~S). They are blinded to whether S or ~S is true, but the researchers control and know the true state of affairs (what forensic scientists like to call "ground truth"). The proportion of cases in which the examiners report a positive association (+E) out of all the cases of S can be written Prop(+E in cases of S), or more compactly, Prop(+E | S).  This proportion leads to an estimate of the probability that, in practice, the examiners and others like them will report a positive association (+E) when confronted with same-source bullets. This conditional probability for +E given that S is true can be abbreviated Prob(+E | S). I won't be fastidious about the difference between a proportion and a probability and will just write P(+E | S) for either, as the context dictates. In the long run, for the court's 2% figure (which is higher than the one observed false-positive proportion in the Ames study), we expect examiners to respond positively (+E) when S is not true (and they do reach a conclusion) only P(+E | ~S) = 2% of the time. 

Surprisingly, a small number like 2% for the "false-positive error rate" P(+E | ~S) does not necessarily mean that the positive finding +E has any probative value! Suppose that positive findings +E occur just as often when S is false as when S is true. (Examiners who are averse to false-positive judgments might be prone to err on the side of false negatives.) If the false-negative error probability is P(–E | S) = 98%, then examiners will tend to report –E 98% of the time for same-source bullets (S), just as they report +E 98% of the time for different-source bullets (S). Learning that such examiners found a positive association is of zero value in separating same-source cases from different-source cases. We may as well have flipped a coin. The outcome (the side of the coin, or the positive judgment of the examiner) bears no relationship to whether the S is true or not.

Although a false negative probability of 98% is absurdly high, it illustrates the unavoidable fact that only when the ratio of the two likelihoods, P(+E | S) and P(+E | ~S), exceeds 1 is a positive association positive evidence of a true association. Consequently, the court's thought that "the relevant error rate in this case is the false positive rate" is potentially misleading. This likelihood is but one of the two relevant likelihoods. (And there would be still more relevant likelihoods if there were more than two hypotheses to consider.)

II. Prior Probabilities: The Need to Consider the Base Rate

Furthermore, yet another quantity -- the mix of same-source and different-source pairs of bullets in the cases being examined -- is necessary to arrive at the court's understanding of "the false positive rate" as "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect." 1/ In technical jargon, the probability as described is the complement of the posterior probability (or positive predictive value in this context), and the posterior probability depends on not only on the two likelihoods, or evidence probabilities, but also on the "prior probability" for the hypotheses S.

A few numerical examples illustrate the effect of the prior probability. Imagine that a performance study with 500 same-source pairs and 500 different-source pairs (that led to conclusions) found the outcomes given in Table 1.

Table 1. Outcomes of comparisons

~S S
E 490 350 840
+E 10 150 160

500 500
E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

The first column of the table states that in the different-source cases, examiners reported a positive association +E in only 10 cases. Thus, their false-positive error rate was P(+E | ~S) = 10/500 = 2%. This is the figure used in Chester. (The second column states that in the same-source cases, examiners reported a negative association 350 times. Thus, their false-negative rate was P(–E | S) = 350/500 = 70%.)

But the bottom row of the table states that the examiners reported a positive association +E for 10 different-source cases and 150 same-source cases. Of the 10 + 150 = 160 cases of positive evidence, 150 are correct, and 10 are incorrect. The rate of incorrect positive findings was therefore P(~S | +E) = 10/160 = 6.25%. Within the four corners of the study, one might say, as the court did, that "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect" is only 2%. Yet, the rate of incorrect positive findings in the study exceeded 6%. The difference is not huge, but it illustrates the fact that the false-negative probability as well as the false-positive probability affects P(~S | +E), which indicates how often an examiner who declares a positive association is wrong. 2/

Now let's change the mix of same- and different-source pairs of bullets from 50:50 to 10:90. We will keep the conditional-error probabilities the same, at P(+E | ~S) = 2% and P(–E | S) = 70%. Table 2 meets these constraints:

Table 2. Outcomes of comparisons

~S S
E 980 70 1050
+E 20 30 50

1000 100

Row 2 shows that there are 20 false positives out of the 50 positively reported associations. The proportion of false positives in the modified study is P(~S | +E) = 40%. But the false-positive rate P(+E | ~S) is still 2% (20/1000).

III. "When I'm 64": A Likelihood Ratio from the Ames Study

The Chester court may not have had a correct understanding of the 2% error rate it quoted, but the Ames study does establish that examiners are capable of distinguishing between same-source and different-source items on which they were tested. Their performance was far better than the outcomes in the hypothetical Tables 1 and 2. The Ames study found that across all the examiners studied, P(+E |S) = 1075/1097 = 98.0%, and P(+E |~S) = 22/1443 = 1.52% . 3/ In other words, on average, examiners made a correct positive associations 98.0/1.52 = 64 times more often when presented with same-source cartridges than they made incorrect positive associations when presented with different-source cartridges. This likelihood ratio, as it is called, means that when confronted with cases involving an even mix of same- and different-source items, over time and over all examiners, the pile of correct positive associations would be some 64 times higher than the pile of incorrect positive associations. Thus, in Chester, Judge Tharp was correct in suggesting that the one study that satisfied PCAST's criteria offers an empirical demonstration of expertise at associating bullet cartridges with the gun that fired them.

Likewise, an examiner presenting a source attribution can point to a study deemed to be well designed by PCAST that found that a self-selected group of 218 examiners given cartridge cases from bullets fired by one type of handgun correctly identified more than 99 out of 100 same-gun cartridges and correctly excluded more than 98 out of 100 different-gun cartridges. For completeness, however, the examiner should add that he or she has no database with which to estimate the frequency of distinctive marks -- unless, of course, there is one that is applicable to the case at bar.

 * * *

Whether the Ames study, together with other literature in the field, suffices to validate the expertise under Daubert is a further question that I will not pursue here. My objective has been to clarify the meaning of and some of the limitations on the 2% false-positive error rate cited in Chester. Courts concerned with the scientific validity of a forensic method of identification must attend to "error rates." In doing so, they need to appreciate that it takes two to tango. Both false-positive and false-negative conditional-error probabilities need to be small to validate the claim that examiners have the skill to distinguish accurately between positively and negatively associated items of evidence.

Notes
  1. Not wishing to be too harsh on the court, I might speculate that its thought that the only "relevant error rate" for positive associations is the false-positive rate might have been encouraged by the PCAST report's failure to present any data on negative error rates in its discussion of the performance of firearms examiners. A technical appendix to the report indicates that the related likelihood is pertinent to the weight of the evidence, but this fact might be lost on the average reader -- even one who looks at the appendix.
  2. The PCAST report alluded to this effect in its appendix on statistics. That Judge Tharp did not pick up on this is hardly surprising.
  3. See David H. Kaye, PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?, Forensic Sci., Stat. & L., Nov. 1, 2016, http://for-sci-law.blogspot.com/2016/11/pcast-and-ames-study-will-real-error.html.
More on the PCAST Report

Tuesday, November 1, 2016

Index to Comments and Cases Discussing the PCAST Report on Forensic Science

The page lists discussions of the PCAST Report on this blog as well as some academic literature on it and court opinions that discuss the report. I expect to update the list periodically.

Forensic Science, Statistics & the Law
Academic Literature
  • I.W. Evett, C.E.H. Berger, J.S. Buckleton, C. Champod, G. Jackson, Finding the Way Forward for Forensic Science in the US—A Commentary on the PCAST Report, 278 Forensic Sci. Int'l 16-23 (2017), https://t.co/A7y7Qy6dRn
  • Eric S. Lander, Response to the ANZFSS Council Statement on the President’s Council of Advisors on Science and Technology Report, Australian J. Forensic Sci. (2017), DOI: 10.1080/00450618.2017.1304992
  • Geoffrey Stewart Morrison, David H. Kaye, David J. Balding, et al., A Comment on the PCAST Report: Skip the 'Match'/'Non-Match' Stage, 272 Forensic Sci. Int'l e7-e9 (2017), http://dx.doi.org/doi:10.1016/j.forsciint.2016.10.018. Accepted manuscript available at SSRN: https://ssrn.com/abstract=2860440
  • Adam B. Shniderman, Prosecutors Respond to Calls for Forensic Science Reform: More Sharks in Dirty Water, 126 Yale L.J. F. 348 (2017), http://www.yalelawjournal.org/forum/prosecutors-respond-to-calls-for-forensic-science-reform
Federal Cases
  • United States v. Chester, No. 13 CR 00774 (N.D. Ill. Oct. 7, 2016), noted, The First Opinion To Discuss the PCAST Report, Forensic Sci., Stat. & L., Oct. 20, 2016, Forensic Sci., Stat. & L., Feb. 3, 2017, http://for-sci-law.blogspot.com/2017/02/connecticut-trial-court-deems-pcast.html
State and Washington DC Cases