Wednesday, July 5, 2017

Multiple Hypothesis Testing in Karlo v. Pittsburgh Glass Works

The following posting is adapted from a draft of an annual update to the legal treatise The New Wigmore on Evidence: Expert Evidence. I am not sure of the implications of the calculations in note 23 and the fact that the age-based groups are overlapping. Advice is welcome.

The Age Discrimination in Employment Act of 1967 (ADEA) 1/ covers individuals who are at least forty years old. The federal circuit courts are split as to whether a disparate-impact claim is viable when it is limited to a subgroup of employees such as those aged fifty and older. In Karlo v. Pittsburgh Glass Works, 2/ the Third Circuit held that statistical proof of disparate impact on such a subgroup can support a claim for recovery. The court countered the employer’s argument that “plaintiffs will be able to ‘gerrymander’ arbitrary age groups in order to manufacture a statistically significant effect” 3/ by promising that “the Federal Rules of Evidence and Daubert jurisprudence [are] a sufficient safeguard against the menace of unscientific methods and manipulative statistics.” 4/ In Daubert v. Merrell Dow Pharmaceuticals, the Supreme Court famously reminded trial judges applying the Federal Rules of Evidence that they are gatekeepers responsible for ensuring that scientific evidence presented at trials is based on sound science. By the end of the Karlo opinion, however, the court appeals held that the Senior District Judge Terrence F. McVerry had been too vigorous a gatekeeper when he found inadmissible a statistical analysis of reductions in force offered by laid-off older workers.

The basic problem was that plaintiffs claimed to have observed statistically significant disparities in various overlapping age groups without correcting for the fact that by performing a series of hypothesis tests, they had more than one opportunity to discover something "significant." By way of analogy, if you flip a coin five times and observe five heads, you might begin to suspect that the coin is not fair. The probability of five heads in a row with a fair coin is p = (1/2)5 = 1/32 = 0.03. We can say that the five heads in the sample are "statistically significant" proof (at the conventional 0.05 level) that the coin is unfair.

But suppose you get to repeat the experiment five times. Now the probability of at least one sample of 5 flips with 5 heads is about five times larger. It is 1 - (1 - 1/32)5 = 0.146785, to be exact. This outcome is not so far out line with what is expected of a fair coin. It would be seen about 15% of the time for a fair coin. This is weak evidence that the coin is unfair; certainly, it is not as compelling as the 3% p-value. So the extra testing, with the opportunity to select any one or more of the five samples as proof of unfairness, has reduced the weight of the statistical evidence of unfairness. The effect of the opportunity to search for significance is sometimes known as "selection bias" or, of late, "p-hacking."

In Karlo, Dr. Michael Campion—a distinguished professor of management at Purdue University with degrees in industrial and organizational psychology—compared proportions of Pittsburgh Glass workers older than 40, 45, 50, 55, and 60 who were laid off to the proportion of younger workers who were laid off. He found that the disparities in three of the five categories were statistically significant at the 0.05 level. 5/ The disparity for the 40-and-older range, he said, fell “just short,” being “ significant at the 13% level.” Dr. Campion maintained that “[t]hese results suggest that there is evidence of disparate impact.” 6/ He also misconstrued the 0.05 level as “a 95% probability that the difference in termination rates of the subgroups is [] due to chance alone.” 7/ The district court expressed doubt as to whether Dr. Campion was a qualified statistical expert 8/ and excluded the testimony under Daubert as inadequate “data snooping.” 9/

Apparently, Judge McVerry was more impressed with the report of Defendant’s expert, James Rosenberger — a statistics professor at Pennsylvania State University and a fellow of the American Statistical Association and the American Association for the Advancement of Science. The report advocated adjusting the significance level to account for the five groupings of over-40 workers. The Chief Judge of the Third Circuit, D. Brooks Smith (also an adjunct professor at Penn State), described the recommended correction as follows:
The Bonferroni procedure adjusts for that risk [of a false positive] by dividing the “critical” significance level by the number of comparisons tested. In this case, PGW's rebuttal expert, Dr. James L. Rosenberger, argues that the critical significance level should be p < 0.01, rather than the typical p < 0.05, because Dr. Campion tested five age groups (0.05 / 5 = 0.01). Once the Bonferroni adjustment is applied, Dr. Campion's results are not statistically significant. Thus, Dr. Rosenberger argues that Dr. Campion cannot reject the null hypothesis and report evidence of disparate impact. 10/
Another way to apply the Bonferroni correction is to change the p-value. That is, when M independent comparisons have been conducted, the Bonferroni correction is either to set “the critical significance level . . . at 0.05/M” (as Professor Rosenberger recommended) or “to inflate all the calculated P values by a factor of M before considering against the conventional critical P value (for example, 0.05).” 11/

The Court of Appeals was not so sure that this conservative adjustment was essential to the admissibility of the p-values or assertions of statistical significance. It held that the district court erred in excluding the subgroup analysis and granting summary judgment. It remanded “for further Daubert proceedings regarding plaintiffs' statistical evidence.” 12/ Further proceedings were said to be necessary partly because the district court had applied “an incorrectly rigorous standard for reliability.” 13/ The lower court had set “a higher bar than what Rule 702 demands” 14/ because “it applied a bright-line exclusionary rule” for all studies with multiple comparisons that have no Bonferroni correction. 15/

But the district court did not clearly articulate such a rule. It wrote that “Dr. Campion does not apply any of the generally accepted statistical procedures (i.e., the Bonferroni procedure) to correct his results for the likelihood of a false indication of significance.” 16/ The sentence is grammatically defective (and hence confusing). On the one hand, it refers to "generally accepted statistical procedures." On the other hand, the parenthetical phrase suggests that only one "procedure" exists. Had the district court written “e.g.” instead of “i.e.,” it would have been clear that it was not promulgating a dubious rule that only the Bonferroni adjustment to p-values or significance levels would satisfy Daubert. To borrow from Mark Twain, "the difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." 17/

Understanding the district court to be demanding a Bonferroni correction in all cases of multiple testing, the court of appeals essentially directed it to reconsider its exclusionary ruling in light of the fact that other procedures could be superior. Indeed, there are many adjustment methods in common use, of which Bonferroni’s is merely the simplest. 18/ However, plaintiff’s expert apparently had no other method to offer, which makes it hard to see why the possibility of some alternative adjustment, suggested by neither expert in the case, made the district court's decision to exclude Dr. Campion's proposed testimony an abuse of discretion.

A rule insisting on a suitable response to the multiple-comparison problem does not seem “incorrectly rigorous.” To the contrary, statisticians usually agree that “the proper use of P values requires that they be ... appropriately adjusted for multiple testing when present.” 19/ It is widely understood that when multiple comparisons are made, reported p-values will exaggerate the significance of the test statistic. 20/ The court of appeal’s statement that “[i]n certain cases, failure to perform a statistical adjustment may simply diminish the weight of an expert's finding.” 21/ is therefore slightly misleading. In virtually all cases, multiple comparisons degrade the meaning of a p-value. Unless the statistical tests are all perfectly correlated, multiple comparisons always make the true probability of the disparity (or a larger one) under the model of pure chance greater than the nominal value. 22/

Even so, whether the fact that an unadjusted p-value exaggerates the weight of evidence invariably makes unadjusted p-values or reports of significance inadmissible under Daubert is a more delicate question. If no reasonable adjustment can be devised for the type of analysis used and no better analysis can be done, then the nominal p-values might be presented along with a cautionary statement about selection bias. In addition, in extreme cases, the adjustment will be small and the degree of exaggeration will not be so formidable as to render the unadjusted p-value inadmissible. For instance, if the nominal p-value were 0.001, the fact that the corrected figure is 0.005 would not be a fatal flaw. The disparity would be highly statistically significant even with the correction. But that was not the situation in Karlo. In this case, statistical significance was not apparent. It was undisputed that as soon as one considered the number of tests performed, not a single subgroup difference was significant at the 0.05 level. 23/

Consequently, the rejection of the district court’s conclusion that the particular statistical analysis in the expert’s report was unsound seems harsh. It should be within the trial court’s discretion to prevent an expert from testifying to the statistical significance of disparities (or their p-values) unless the expert avoids multiple comparisons that would seriously degrade the claims of significance or modifies those claims to reflect the negative impact of the repeated tests on the strength of the statistical evidence. 24/ The logic of Daubert does not allow an expert to dismiss the problem of selection bias on the theory -- advanced by plaintiffs in Karlo -- that “adjusting the required significance level [is only] required [when the analyst performs] ‘a huge number of analyses of all possibilities to try to find something significant.'’’ 25/ The threat to the correct interpretation of a significance probability does not necessarily disappear when the number of comparisons is moderate rather than “huge.” Given the lack of highly significant results here (even nominally), it is not statistically acceptable to ignore the threat. 26/ Although the Third Circuit was correct to observe that not all statistical imperfections render studies invalid within the meaning of Daubert, the reasoning offered in support of the claim of significant disparities in Karlo was not statistically acceptable. 27/

Notes
l. 29 U.S.C. §§ 621–634.
2. 849 F.3d 61 (3d Cir. 2017).
3. Id. at 76.
4. Id.
5. He testified that he did not compute a z-score (a way to analyze the difference between two proportions when the sample sizes are large) for the 60-and-over group “because ‘[t]here are only 14 terminations, which means the statistical power to detect a significant effect is very low.’” Karlo, 849 F.2d at 82 n.15.
6. Karlo v. Pittsburgh Glass Works, LLC, 2015 WL 4232600, at *11, No. 2:10–cv–1283 (W.D. Penn. July 13, 2015), vacated, 849 F.3d 61 (3d Cir. 2017).
7. Id. at *11 n.13. "A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis." Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 213 (2017).
8. Id. at *12.
9. Id. at *13.
10. 849 F.3d at 82 (notes omitted).
11. Pak C. Sham & Shaun M. Purcell, Statistical Power and Significance Testing in Large-scale Genetic Studies, 15 Nature Reviews Genetics 335 (2014) (Box 3).
12. Id. at 80 (note omitted).
13. Id. at 82.
14. Id at 83.
15. Id. (internal quotation marks and ellipsis deleted).
16. Karlo, 2015 WL 4232600, at *1.
17. George Bainton, The Art of Authorship 87–88 (1890.
18. Martin Krzywinski & Naomi Altman, Points of Significance: Comparing Samples — Part II, 11 Nature Methods 355, 355 (2014)
19. Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 214 (2017)
20. Krzywinski & Altman, supra note 18
21. Id. at 83 (emphasis added).
22. Because each age group included some of the same older workers, the tests here were not completely independent. But neither were they completely dependent.
23. However, that three out of five groups exhibited significant associations between age and terminations is surprising under the null hypothesis that those variables are uncorrelated. If each test were independent, then the probability of a significant result in each group would be 0.05. The probability of one or more significant results in five tests would be 0.226; that of two or more would be 0.0226; of three or more, 0.00116.
24. Joseph Gastwirth, Case Comment: An Expert's Report Criticizing Plaintiff's Failure to Account for Multiple Comparisons Is Deemed Admissible in EEOC v. Autozone, 7 Law, Probability & Risk 61, 62 (2008).
25. Karlo, 849 F.3d at 82.
26. Dr. Campion also believed that “his method [was] analogous to ‘cross-validating the relationship between age and termination at different cut-offs,’ or ‘replication with different samples.’” Id. at 83. Although the court of appeals seemed to take these assertions at face value, cross-validation involves applying the same statistical model to different data sets (or distinct subsets of one larger data set). For instance, a equation that predicts law school grades as a function of such variables as undergraduate grades and LSAT test scores might be derived from one data set, then checked to ensure that it performs well in an independent data set. Findings in one large data set of statistically significant associations between particular genetic loci and a disease could be checked to see if the associations were present in an independent data set. No such validation or replication was performed in this case.
27. The Karlo opinion suggested that the state of statistical knowledge or practice might be different in social science than in the broader statistical community. The court pointed to a statement (in a footnote on regression coefficients) in a treatise on statistical evidence in discrimination cases that “the Bonferroni adjustment [is] ‘good statistical practice,’ but ‘not widely or consistently adopted’ in the behavioral and social sciences.” Id. (quoting Ramona L. Paetzold & Steve L. Willborn, The Statistics of Discrimination: Using Statistical Evidence in Discrimination Cases § 6:7, at 308 n.2 (2016 Update)). The treatise writers were referring to an unreported case in which the district court found itself unable to resolve the apparent conflict between the generally recognized problem of multiple comparisons and an EEOC expert’s insistence that labor economists do not make such corrections and courts do not require them. E.E.O.C. v. Autozone, Inc., No. 00-2923, 2006 WL 2524093, at *4 (W.D. Tenn. Aug. 29, 2006). In the face of these divergent perceptions, the district judge decided not to grant summary judgment just because of this problem. Id. (“[T]he Court does not have a sufficient basis to find that ... the non-utilization [of the Bonferroni adjustment] makes [the expert's] results unreliable.”). The notion that multiple comparisons generally can be ignored in labor economics or employment discrimination cases is false, Gastwirth, supra note 23, at 62 (“In fact, combination methods and other procedures that reduce the number of individual tests used to analyse data in equal employment cases are basic statistical procedures that have been used to analyse data in discrimination cases.”), and any tendency to overlook multiple comparisons in “behavioral and social science” more generally is statistically indefensible.
That said, the outcome on appeal in Karlo might be defended as a pragmatic response to the lower court's misunderstanding of the meaning of the ADEA. The court excluded the unadjusted findings of significance for several reasons. In addition to criticizing Professor Campion's refusal to make any adjustment for his series of hypothesis tests across age groups, Judge McVerry noted that "the subgrouping analysis would only be helpful to the factfinder if this Court held that Plaintiffs could maintain an over-fifty disparate impact claim." Karlo, 2015 WL 4232600, at *13 n.16. He sided with "the majority view amongst the circuits that have considered this issue ... that a disparate impact analysis must compare employees aged 40 and over with those 39 and younger ... ." Id. (Petruska v. Reckitt Benckiser, LLC, No. CIV.A. 14–03663 CCC, 2015 WL 1421908, at *6 (D.N.J. Mar.26, 2015)). The Third Circuit decisively rejected this construction of the ADEA, pulling this rug out from under the district court. Having held that the district court erred in interpreting the ADEA, requiring the district court to re-examine the statistical showing under the ADEA, correctly understood, might seem appropriate.
Of course, ordinarily an evidentiary ruling that can be supported on several independent grounds will be upheld on appeal as long as at least one of the independent grounds is valid. Here, the ADEA argument was literally a footnote to the independent ground that the failure to adjust for multiple comparisons invalidated the expert's claim of significant disparities. Nevertheless, the independent-grounds rule normally applies after a trial. It avoids retrials when the trial judge would or could rule the same way on retrial. Because Karlo is a summary judgment case, there is less reason to sustain the evidentiary ruling. But even so, the court of appeals did not have to vacate the judgment. Instead, it could have followed the usual independent-grounds rule to affirm the summary judgment while noting that district court could reconsider its Daubert ruling in light of the court of appeals' explanation of the proper reach of the ADEA and the range of statistically valid responses to the problem of multiple hypothesis tests. As a practical matter, however, there may be little difference between having counsel address the issue in the context of a motion to reconsider and a renewed motion for summary judgment.

Friday, June 30, 2017

Judge Spotlights PCAST Report

When the District of Columbia Court of Appeals (the District's "supreme court") overruled Frye v. United States and replaced the general acceptance standard for scientific evidence with one based on the Daubert line of cases, 1/ the court admonished trial judges to use "a delicate touch" in regulating the flow of expert testimony. 2/ One judge offered more guidance. Judge Catharine Friend Easterly penned a concurring opinion proposing that
trial courts will be called upon to scrutinize an array of forensic expert testimony under new, more scientifically demanding standards. As the opinion of the court states, “[t]here is no ‘grandfathering’ provision in Rule 702,” and, under the new rule we adopt, courts may not “reflexively admit expert testimony because it has become accustomed to doing so under the Dyas/Frye test. 3/
Daubert does not necessarily erect a more demanding standard than Frye. It leaves plenty of wiggle room for undiscriminating or lenient rulings. Moreover, under Frye, counsel can challenge scientific evidence that is generally accepted in the forensic-science community (predominantly forensic-science practitioners) but whose scientific foundations are seen as weak in the broader scientific community. Both Frye and Daubert enable -- indeed, both require -- courts to depart from reflexively admitting expert testimony just because they are accustomed to it. The legal difference between the two approaches is that Daubert creates the theoretical possibility of rejecting a method that is still clearly generally accepted but that a small minority of scientists has come to regard -- on the basis of sound (but not yet generally accepted) scientific arguments -- as unfounded. This is merely the flip side of evidence that is not yet generally accepted but that is scientifically sound. Frye keeps such evidence out; Daubert does not. In sum, the standards are formally different, but, as written, one is not more demanding than the other.

But regardless of whether Daubert is more demanding than what the Supreme Court called the "austere" standard of Frye, the remainder of Judge Easterly's opinion is worthy of general notice. The opinion urges the judiciary to heed the findings of the 2009 NRC Report on forensic science and the 2016 PCAST report on particular methods. It observes that
Fortunately, in assessing the admissibility of forensic expert testimony, courts will have the aid of landmark reports that examine the scientific underpinnings of certain forensic disciplines routinely admitted under Dyas/Frye, most prominently, the National Research Council's congressionally-mandated 2009 report Strengthening Forensic Science in the United States: A Path Forward, and the President's Council of Advisors on Science and Technology's (PCAST) 2016 report Forensic Science in the Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods [hereinafter PCAST Report]. These reports provide information about best practices for scientific testing, an objective yardstick against which proffered forensic evidence can be measured, as well as critiques of particular types of forensic evidence. In addition, the PCAST Report contains recommendations for trial judges performing their gatekeeping role under Rule 702:
(A) When deciding the admissibility of [forensic] expert testimony, ... judges should take into account the appropriate scientific criteria for assessing scientific validity including: (i) foundational validity,  with respect to the requirement under Rule 702(c) that testimony is the product of reliable principles and methods; and (ii) validity as applied, with respect to [the] requirement under Rule 702(d) that an expert has reliably applied the principles and methods to the facts of the case.
(B) ... [J]udges, when permitting an expert to testify about a foundationally valid feature-comparison method, should ensure that testimony about the accuracy of the method and the probative value of proposed identifications is scientifically valid in that it is limited to what the empirical evidence supports. Statements suggesting or implying greater certainty are not scientifically valid and should not be permitted. In particular, courts should never permit scientifically indefensible claims such as: “zero,” “vanishingly small,” “essentially zero,” “negligible,” “minimal,” or “microscopic” error rates; “100 percent certainty” or proof “to a reasonable degree of scientific certainty;” identification “to the exclusion of all other sources;” or a chance of error so remote as to be a “practical impossibility.”
PCAST Report, supra, at 19; see also id. at 142–45; Gardner v. United States, 140 A.3d 1172, 1184 (D.C. 2016) (imposing limits on experts' statements of certainty). 4/
Notes
  1. Motorola v. Murray, 147 A.3d 751 (D.C. 2016) (en banc); Frye Dies at Home at 93, June 30, 2017, http://for-sci-law.blogspot.com/2017/06/frye-dies-at-home-at-93.html.
  2. 147 A.3d at 757.
  3. Id. at 759 (emphasis added).
  4. Id. at 759-60 (notes omitted).

Frye Dies at Home at 93

The general-scientific-acceptance standard for scientific evidence originated in the District of Columbia, when the federal circuit court for the District upheld the exclusion of a blood-pressure test for deception in Frye v. United States, 93 F. 1013  (D.C. Cir. 1923). In October of 2016, the District of Columbia's highest court ended the standard's 93-year life there.The D.C. Court of Appeals unanimously overruled Frye and replaced it with the more open-ended Federal Rule of Evidence 702.

It did so in Motorola v. Murray, 147 A.3d 751 (D.C. 2016) (en banc), at the request of the trial court,  which felt that Frye required the admission of expert testimony that cell phones cause brain tumors. That view was mistaken. It is quite possible to exclude, as not based on a generally accepted method, opinions of general causation from expert witnesses when the scientific consensus is that the pertinent scientific studies do not support those opinions. See Cell Phones, Brain Cancer, and Scientific Outliers Are Not the Best Reasons to Abandon Frye v. United States, Nov. 26, 2015.

Elsewhere, I have argued that the choice between the Daubert line of cases codified in Rule 702 and the earlier Frye standard is less important than is the rigor with which the courts apply either standard. The Court of Appeals in Murray remarked that "[p]roperly performing the gatekeeping function will require a delicate touch." Id. at 757. It noted that trial courts have "discretion (informed by careful inquiry) to exclude some expert testimony." Id. In the end, "[t]he trial court still will need to determine whether the opinion 'is the product of reliable principles and methods[,] ... reliably applied.'" Id. at 758 (quoting Fed. R. Evid. 702 (c), (d)).

Wednesday, May 31, 2017

A Few Statistical and Legal Ideas About the Weight of Evidence

The expression “weight of evidence” has become popular among theorists of forensic science, where it is used to indicate the extent to which findings support the claim that two similar traces originated from the same source as opposed to the claim that they originated from different sources. Speaking more broadly, the idea is that the degree of corroboration a body of evidence provides for a theory or hypothesis depends on the probability of the evidence given that hypothesis compared to the probability of the evidence given other hypotheses. This notion has a rich intellectual history in philosophy, law, and statistics.A recent book review* discusses ways to quantify this measure of corroboration and the motivations for them. Some excerpts follow:

The “likelihood ratio” is a concept that pervades statistics. 31/ As [Richard] Lempert argued, it can be used to define whether an item of evidence is [logically] relevant. For example, in the 1990s researchers developed a prostate cancer test based on the level of prostate-specific antigen (“PSA”). The test, they said, was far from definitive but still had diagnostic value. Should anyone have believed them? A straightforward method for validation is to run the test on subjects known to have the disease and on other subjects known to be disease-free. The PSA test was shown to give a positive result (to indicate that the cancer was present) about 70% of the time when the cancer was, in fact, present, and about 10% of the time when the cancer was not actually present. Thus, the test has diagnostic value. The doctor and patient can understand that positive results arise more often among patients with the disease than among those without it.

But why should we say that the greater probability of the evidence (a positive test result) among cancer patients than among cancer-free patients makes the test diagnostic of prostate cancer? There are three answers. One is that if we use it to sort patients into the two categories, we will (in the long run) do a better job than if we use some totally bogus procedure (such as flipping a coin). This is a “frequentist” interpretation of diagnostic value.

A second justification takes the notion of “support” for a hypothesis as fundamental. 37/ Results that are more probable under a hypothesis H1 about the true state of affairs are stronger evidence for H1 than for any alternative (H2) under which they are less probable. If the evidence were to occur with equal probability under both states, however, the evidence would lend equal support to both possibilities. In this example, such evidence would provide no basis for distinguishing between cancer-free and cancer-afflicted patients. It would have no diagnostic value, 38/ and the test should be kept off the market. The coin-flipping test is like this. A head is no more or less probable when the cancer is present than when it is absent.

A difference between the “frequentist,” long-run justification and the “likelihoodist,” support-based understanding is that the latter applies even when we do not perform or imagine a long series of tests. If it really is more probable to observe the data under one state of affairs than another, it would seem perverse to conclude that the data somehow support the latter over the former. The data are “more consistent” with the state of affairs that makes their appearance on a single occasion more probable (even without the possibility of replication).

The same thing is true of circumstantial evidence in law. Circumstantial evidence E that is just as probable when one party’s account is true as it is when that account is false has no value as proof that the account is true or false. It supports both states of nature equally and is logically irrelevant. To condense these observations into a formula, we can write:
E is irrelevant (to choosing between H1 and H2) if P(E|H1) = P(E|H2),
where P(E|H1) and P(E|H2) are the probabilities of the evidence conditional on (“given the truth of,” or just “given”) the hypotheses. The conditional probabilities (or quantities that are directly proportional to them) have a special name: likelihoods. So a mathematically equivalent statement is that
E is irrelevant if the likelihood ratio L = P(E|H1) / P(E|H2) = 1.
A fancier way to express it is that E is irrelevant if the logarithm of L is 0. Such evidence E has zero “weight” when placed on a metaphorical balance scale that aggregates the weight of the evidence in favor of one hypothesis or the other. 39/ In this [prostate cancer] case, the likelihood ratio for a positive test result is 70% ÷ 10% = 7, which is greater than 1. Thus, the test is relevant evidence in deciding whether the patient has cancer. ...

Nothing that I have said so far involves Bayes’s rule. “Likelihood” and “support” are the primitive concepts. Lempert argued for a likelihood ratio of 1 as the defining characteristic of relevance by relying on a third justification—the Bayesian model of learning. How does this work? Think of probability as a pile of poker chips. Being 100% certain that a particular hypothesis about the world is correct means that all of the chips sit on top of that hypothesis. Twenty-five percent certainty means that 25% of the chips sit on the same hypothesis, and the remaining 75% are allocated to the other hypotheses. 42/ To keep things as simple as possible, let’s assume there are only two hypotheses that could be true. To be concrete, let’s say that H1 asserts that the individual has cancer and that H2 asserts that he does not. Assume that doctors know that men with this patient’s symptoms have a 25% probability of having prostate cancer. We start with 25% of the chips on hypothesis 1 (H1: cancer) and 75% on the alternative (H2: some other cause of the symptoms). Learning that the PSA test is positive for cancer requires us to take some of the chips from H2 and put them on H1. Bayes’s rule dictates just how many chips we transfer. The exact amount generally depends on two things: the percentage of chips that were on H1 (the prior probability) and the likelihood ratio L in this simple situation. ... [T]he very simple structure of Bayes’s rule in this case [is]
Odds(H1) · L = Odds(H1|E).
The rule requires updating the “prior odds” (on the left-hand side) by multiplying by the Bayes factor (which also is the likelihood ratio L) to arrive at the “posterior odds” (on the right-hand side). ...

The crucial point is that multiplication by L = 1 never changes the prior odds. Evidence that is equally probable under each hypothesis produces no change in the allocation of the chips—no matter what the initial distribution. Prior odds of 1:3 become posterior odds of 1:3. Prior odds of 10,000:1 become posterior odds of 10,000:1. The evidence is never worth considering. Again, we can get fancy and place the odds and the likelihood ratio on a logarithmic scale. Then the posterior log odds are the prior log odds plus the weight of the evidence (WOE = log-L):
New LO = Prior LO + WOE. 44/
Evidence that has zero weight (L = 1, log-L = 0) leaves us where we started. Evidence E that does not change the odds (and, hence, the corresponding probability) is uninformative—it is irrelevant. Inversely, evidence that does change the probability is relevant—as [Federal] Rule [of Evidence] 401 states in near-identical terms. This, in a nutshell, is the Bayesian explanation of the rule as it applies to circumstantial evidence. It tracks the text of the rule better than the likelihoodist, support-based analysis, but both lead to the conclusion that relevance vel non turns on whether the likelihood ratio departs from 1. ...

... The simple likelihood ratio is the basic measure that dominates the forensic science literature on evaluative conclusions. However, most writers in this area construe the likelihood ratio as the ratio of posterior odds to prior odds and base its use on that purely Bayesian interpretation. Greater clarity would come from using the related term “Bayes factor” when this is the motivation for the ratio. 51/ [Note 51: The choice of words is not merely a labeling issue. In simple situations, the Bayes factor and the likelihood ratio are numerically equivalent, but more generally, there are conceptual and operational differences. For instance, simple likelihood ratios can be used to gauge relative support within any pair of hypotheses, even when the pair is not exhaustive. But when there are many hypotheses, the Bayes factor is not so simple. See [Peter M. Lee, Bayesian Statistics 140 (4th ed. 2012)], at 141–42. It becomes the usual numerator divided by a weighted sum of the likelihoods for each hypothesis. The weights are the probabilities (conditional on the falsity of the hypothesis in the numerator). For an example, see Tacha Hicks et al., A Framework for Interpreting Evidence, in Forensic DNA Evidence Interpretation 37, 63 (John S. Buckleton et al. eds., 2d ed. 2016). Furthermore, there is disagreement over the use of a likelihood ratio for highly multidimensional data (such as fingerprint patterns and bullet striations) and whether and how to express uncertainty with respect to the likelihood ratio itself. Compare Franco Taroni et al., Dismissal of the Illusion of Uncertainty in the Assessment of a Likelihood Ratio, 15 Law, Probability & Risk 1, 2 (2016), with M.J. Sjerps et al., Uncertainty and LR: To Integrate or Not to Integrate, That’s the Question, 15 Law, Probability & Risk 23, 23–26 (2016). ... ] ...

The obvious Bayesian measure of probative value is the Bayes factor (B). In the examples used here, B is equal to the likelihood ratio L, and therefore the statisticians’ “weight of evidence” is WOE = log-B = log-L. 58/ The value of L in these cases tells us just how much more the evidence supports one theory than another and hence—this is the Bayesian part—just how much we should adjust our belief (expressed as odds) for any starting point. For the PSA test for cancer, L = 7 is “the change in odds favoring disease.” A test with greater diagnostic value would have a larger likelihood ratio and induce a stronger shift toward that conclusion. ... [T]he likelihood-ratio measure (or variations on it), which keeps prior probabilities out of the picture, is more typically used to describe the value of test results as evidence of disease or other conditions in medicine and psychology. Using the same measure in law has significant advantages. ...

NOTES
* David H. Kaye, Digging into the Foundations of Evidence Law, 115 Mich. L. Rev. 915 (2017)  (reviewing The Michael J. Saks & Barbara A. Spellman, Psychological Foundations of Evidence Law (2016)).

31. Vic Barnett, Comparative Statistical Inference 306 (3d ed. 1999) (“The principles of maximum likelihood and of likelihood ratio tests occupy a central place in statistical methodology.”); see, e.g., id. at 178–80 (describing likelihood ratio tests in frequentist hypothesis testing); N. Reid, Likelihood, in Statistics in the 21st Century 419 (Adrian E. Raftery et al. eds., 2002).

37. A “support function” can be required to have several appealing, formal properties, such as transitivity and additivity. E.g., A.W.F. Edwards, Likelihood 28–32 (Johns Hopkins Univ. Press, expanded ed. 1992) (1972). It also can be derived, in simple cases, from other, arguably more fundamental, principles. E.g., Barnett, supra note 31, at 310–11.

39. See generally I. J. Good, Weight of Evidence and the Bayesian Likelihood Ratio, in the Use of Statistics in Forensic Science 85 (C.G.G. Aitken & D.A. Stoney eds., 1991); I. J. Good, Weight of Evidence: A Brief Survey, in 2 Bayesian Statistics 249 (J.M. Bernardo et al. eds., 1985) (providing background information regarding the use of Bayesian statistics in evaluating weight of evidence). ...


42. If the individual were to keep some of the chips in reserve, the analogy between the fraction of them on a hypothesis and the kind of probability that pertains to random events such as games of chance would break down.

44. A deeper motivation for using logarithms may lie in information theory, but, if so, it is not important here. See Solomon Kullback, Information Theory and Statistics (1959).

58. The logarithm of B has been called “weight of evidence” since 1878. I.J. Good, A. M. Turing’s Statistical Work in World War II, 66 Biometrika 393, 393 (1979) ... . While working in the town of Banbury to decipher German codes, Alan Turing famously (in cryptanalysis and statistics, at least) coined the term “ban” to designate a power of 10 for this metaphorical weight. Good, supra, at 394. Thus, a B of 10 is 1 ban, 100 is 2 ban, and so on.

Saturday, May 20, 2017

Science Friday and Contrived Statistics for Hair Comparisons

On May 19th, Public Radio International's Science Friday show had a segment entitled "There’s Less Science In Forensic Science Than You Think." The general theme — that some practices have not been validated by rigorous scientific testing — is a fair (and disturbing) indictment. But listeners may have come away with the impression that the FBI has determined that hair examiners make up statistics from personal experience 95% of the time to help out prosecutors.

Ira Flato, the show's host, opened with the observation that "The FBI even admitted in 2015, after decades, investigators had overstated the accuracy of hair sample matches over 95% of the time in ways that benefited the prosecution." He returned to this statistic when he asked Betty Layne DesPortes, a lawyer and the current President of the American Academy of Forensic Sciences, the following question:
Dr. DesPortes, I want to go back to that FBI admission in 2015 that for decades investigators had overstated the accuracy of their hair samples, and I mean 95% of the time in a way that benefited the prosecution. Is this a form of cognitive bias coming into the picture?
Ms. DesPortes replied that
It is, and ... you would have overstatement along the lines of, "Well, I’ve never seen in my X years of experience that two hairs would be this similar, so it must be a match," and then they would just start making statistics up based on, "Well, I’ve had a hundred cases in my practice, and there have been a thousand cases in my lab, and nobody else has ever reported similar hairs like this," so let’s just start throwing in one in a hundred thousand as a statistic — "one in a hundred thousand" — and that’s where the misstatement came in.
But neither Ms. DesPortes nor anyone else knows how often FBI examiners cited statistics like "one in a hundred thousand" based on either their recollections of their own casework or their impression of the collective experience of all hair examiners. 1/

To be sure, such testimony would have been flagged as erroneous in the FBI-DOJ Microscopy Hair Comparison Review. But so would a much more scientifically defensible statement such as
The hair removed from the towel exhibited the same microscopic characteristics as the known hair sample, and I concluded it was consistent with having originated from him. However, hair comparison is not like fingerprints, for example. It’s not a positive identification. I can’t make that statement." 2/
The Hair Comparison Review was not designed to produce a meaningful estimate of an error rate for hair comparisons. It produced no statistics on the different categories of problematic testimony. The data and the results have not been recorded (at least, not publicly) so as to allow independent researchers to ascertain the extent to which FBI examiners overstated their findings in various ways. See David H. Kaye, Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, 72 Wash. & Lee L. Rev. Online 227 (2015).

The interim results from the Hair Comparison Review prompted the Department of Justice to plan a retrospective study of FBI testimony involving other identification methods as well. In July 2016, it asked a group of statisticians how best to conduct the new "Forensic Science Disciplines Review." The informal recommendations that emerged in this "Statisticians' Roundtable" included creating a database of testimony that would permit more rigorous, social science research. But this may never happen. A new President appointed a new Attorney General, who promptly suspended the expanded study.

NOTES
  1. Ms. DesPortes may not have meant to imply that all the instances of exaggerated testimony were of the type she identified.
  2. That statements like these may be scientifically defensible does not render them admissible or optimal.
(For related postings, click on the label "hair.")

Tuesday, May 16, 2017

The Reappearing Rapid DNA Act

With bipartisan sponsorship, the Rapid DNA Act of 2017 (H.R.510 and S. 139) is sailing through Congress. The Senate bill made it to the legislative calendar on May 11, 2017, without amendment and without a written report from the Judiciary Committee.  The Committee Chairman, Senator Grassley, wrote this about the bill:
Turning to legislation, the first bill is S.139, the Rapid DNA Act of 2017. It is sponsored by Senator Hatch. The Committee reported this bill and the Senate passed it in the last Congress. The bill would establish standards for a new category of DNA samples that can be taken more quickly and then uploaded to our national DNA index. 1/
This characterization is misleading. The bill itself contains no standards for producing profiles to upload to the national database. It orders the FBI to “issue standards.” Specifically, the part of the bill entitled “standards” adds to the DNA Identification Act of 1994, 42 U.S.C. § 14131(a), a new Section 5, which reads as follows:
(A) ... the Director of the Federal Bureau of Investigation shall issue standards and procedures for the use of Rapid DNA instruments and resulting DNA analyses.
(B) In this Act, the term ‘Rapid DNA instruments’ means instrumentation that carries out a fully automated process to derive a DNA analysis from a DNA sample. 2/
But the FBI does not need new authorization to devise standards for “Rapid DNA instruments.” The “resulting DNA analyses” are not a new category of “samples,” and some such profiles already may be in the National DNA Index System (NDIS). In fact, the FBI issued standards for “rapid” profiles years ago. One need only peek at the FBI's forthright answers to “Frequently Asked Questions on Rapid DNA Analysis.” There, the FBI explained that
Based upon recommendations from the Scientific Working Group on DNA Analysis Methods (SWGDAM), the FBI Director approved and issued The Addendum to the Quality Assurance Standards for DNA Databasing Laboratories performing Rapid DNA Analysis and Modified Rapid DNA Analysis Using a Rapid DNA Instrument (or “Rapid QAS Addendum”). The Addendum contains the quality assurance standards specific to the use of a Rapid DNA instrument by an accredited laboratory; it took effect December 1, 2014.
The FBI added that “[a]n accredited laboratory participating in NDIS may use CODIS to upload authorized known reference DNA profiles developed with a Rapid DNA instrument performing Modified Rapid DNA Analysis to NDIS if [certain] requirements are satisfied” and that “DNA records generated by an NDIS-approved Rapid DNA system performing Rapid DNA analysis in an NDIS participating laboratory are eligible for NDIS.” 3/

But if the FBI does not need the bill to develop standards or to incorporate rapid-DNA results into NDIS, what is the real purpose of the bill? The answer is simple. The bill clears the way for these results to come, not from accredited laboratories, 4/ but from police stations, jails, or prisons. The House Judiciary Committee was explicit in its brief report on the bill:
Currently, booking stations have to send their DNA samples off to state labs and wait weeks for the results. This has created a backlog that impacts all criminal investigations using forensics, not just forensics used for identification purposes. H.R. 510 would modify the current law regarding DNA testing and access to CODIS. The short turnaround time resulting from increased use of Rapid DNA technology would help to quickly eliminate potential suspects, capture those who have committed a previous crime and left DNA evidence, as well as free up current DNA profilers to do advanced forensic DNA analysis, such as crime scene analysis and rape-kits. 5/
The FBI was more succinct when it referred to “the goal of using Rapid DNA systems in the booking environment” and reported that “legislation will be needed in order for DNA records that are generated by Rapid DNA systems outside an accredited laboratory to be uploaded to NDIS.6/

Is the migration of DNA profiling from the laboratory to the police station — and potentially to the officer on the street — a good idea? The efficiency argument from the House Committee has some force. We do not demand that only accredited laboratories conduct breath alcohol testing of drivers who seem to be intoxicated. Police using properly maintained portable instruments can do the job. 7/

How is DNA different? In one respect, it is less problematic than roadside alcohol testing. Rapid DNA analysis is not for crime-scene samples. (At least, not yet.) It is for samples from arrestees or convicted offenders whose profiles can be uploaded to a database. The police have an incentive to avoid uploading inaccurate profiles. Such profiles will degrade the effectiveness of the database. Any cold hits that they might produce will be shown to be false when a later DNA test from the suspect fails to replicate the incorrect profile. In contrast, incriminating output of a faulty alcohol test usually enables a conviction and will not be shown to be in error.

But there is more to the matter than efficiently generating and uploading profiles. It could be argued that DNA information is more private that a breath alcohol measurement and that having CODIS profiles known to local police is more dangerous than having it known only to laboratory personnel. Considering the limited kind of information that is present in a CODIS profile, however, this argument does not strike me as compelling.

POSTSCRIPT

The Rapid DNA Act of 2017 met no opposition as the Senate and House passed the bills. S. 139 generated unanimous consent (and no discussion) on May 16. 8/ Its counterpart, H.R. 510, passed after receiving praise from two of its sponsors and the observation from Representative Goodlatte (R-VA) that "this is a good bill. It is a bipartisan bill. I thank Members on both sides of the aisle for their contributions to this effort." 9/

NOTES
  1. Prepared Statement by Senator Chuck Grassley of Iowa, Chairman, Senate Judiciary Committee Executive Business Meeting, May 11, 2017, https://www.judiciary.senate.gov/imo/media/doc/05-11-17%20Grassley%20Statement.pdf, viewed May 16, 2017.
  2. Rapid DNA Act of 2017, S. 139 § 2(a).
  3. The difference between “Rapid DNA Analysis” and “Modified Rapid DNA Analysis” is that the former is “a “swab in – profile out” process ... of automated extraction, amplification, separation, detection, and allele calling without human intervention,” whereas the latter uses “human interpretation and technical review” for ascertaining the alleles in a profile. FBI, Frequently Asked Questions on Rapid DNA Analysis, https://www.fbi.gov/services/laboratory/biometric-analysis/codis/rapid-dna-analysis, Nos. 1 &2, viewed May 17, 2017.
  4. The DNA Identification Act of 1994, 42 U.S.C. § 14131, which the Rapid DNA Act amends, requires the FBI to create and consider the recommendations of "an advisory board on DNA quality assurance methods." § 14131(a)(1)(A).  The members of the board must come from "nominations proposed by the head of the National Academy of Sciences and professional societies of crime laboratory officials." Id. They "shall develop, and if appropriate, periodically revise, recommended standards for quality assurance, including standards for testing the proficiency of forensic laboratories, and forensic analysts, in conducting analyses of DNA." § 14131(a)(1)(C). As the name indicates, the board is purely advisory. The Act only demands that
    The Director of the Federal Bureau of Investigation, after taking into consideration such recommended standards, shall issue (and revise from time to time) standards for quality assurance, including standards for testing the proficiency of forensic laboratories, and forensic analysts, in conducting analyses of DNA.
    § 14131(a)(2).
    The advisory board was a half-a-loaf response to the recommendation of a National  Academy of Sciences committee for "a National Committee on Forensic DNA Typing (NCFDT) under the auspices of an appropriate government agency, such as NIH or NIST, to provide expert advice primarily on scientific and technical issues concerning forensic DNA typing." NRC Committee on DNA Technology in Forensic Science, DNA Technology in Forensic Science 72-73 (1992). Now that NIST has established an Organization of Scientific Area Committees for Forensic Science to develop science-based standards for DNA testing and other forensic science methods, Congress should reconsider the need for the overlapping FBI board.
  5. On May 11, 2017, the House Committee on the Judiciary recommended adoption of H.R. 510 without holding hearings. The Judiciary Committee saw no need to consult independent scientists. It was satisfied with the fact that
    the Judiciary Committee’s Subcommittee on Crime, Terrorism, Homeland Security and Investigations held a hearing on a virtually identical bill, H.R. 320, on June 18, 2015, [at which] testimony was received from: Ms. Amy Hess, Executive Assistant Director of Science and Technology, Federal Bureau of Investigation; Ms. Jody Wolf, Assistant Crime Laboratory Administrator, Phoenix Police Department Crime Laboratory, President, American Society of Criminal Laboratory Directors; and Ms. Natasha Alexenko, Founder, Natasha’s Justice Project.
    Report to accompany H.R. 510, May 11, 2017, https://www.congress.gov/115/crpt/hrpt117/CRPT-115hrpt117.pdf
  6. FBI Answers, No. 13, https://www.fbi.gov/services/laboratory/biometric-analysis/codis/rapid-dna-analysis, viewed May 17, 2017 (emphasis added).
  7. “As of January 1, 2017, there is no Rapid DNA system that is approved for use by an accredited forensic laboratory for performing Rapid DNA Analysis.” Several systems had been approved but they do “not contain the 20 CODIS Core Loci required as of January 1, 2017.” FBI Answers, No. 6, https://www.fbi.gov/services/laboratory/biometric-analysis/codis/rapid-dna-analysis, viewed May 16, 2017. 
  8. 163 Cong. Rec. S2954-2955, 115th Cong., 1st Sess., May 16, 2017.
  9. Id. at H4205.

Saturday, May 6, 2017

Who Copy Edits ASTM Standards?

This posting is not about science or law. It is about English writing. I recently had occasion to read the “Standard Guide for Analysis of Clandestine Drug Laboratory Evidence” issued by ASTM International, a private standards development organization. The standard exemplifies a common problem with the ASTM standards for forensic science — an apparent absence of copy and line editing to achieve clear and efficient expression of the ideas of the committees that write the standards. 1/

This particular standard, known as E2882-12, opens with an observation about the “scope” of the document — namely, that
This guide does not replace knowledge, skill, ability, experience, education, or training and should be used in conjunction with professional judgment.
The word “replace” has caused a couple of readers to complain that this admonition implies that unstructured “knowledge, skill, ability, experience, education, or training” suffices for the analysis of the evidence. That is not  a fair reading of the sentence, but joining the two clauses with “and” makes it seem like they are separate points. Why not make it as easy as possible for the reader to get the intended message? I think the sentence amounts to nothing more than the following simple idea:
This standard is intended to help professionals use their knowledge and skill to analyze clandestine drug laboratory evidence.
Why not just say this? Why all the extra verbiage?

Unfortunately, this text is not an isolated example of the need for detailed editing. Another infelicity is
capacity—the amount of finished product that could be produced, either in one batch or over a defined period of time, and given a set list of variables.
The words “and given a set list of variables” are a sentence fragment. They dangle aimlessly after the comma. The copy edit is obvious:
Capacity is the amount of finished product that could be produced, for a specified set of variables, either in one batch or over a stated period of time.
It still may not be clear what a “set of variables” means here, but at least the words about unnamed variables occur where they belong.

The wording in a section on reporting is especially obscure:
Laboratories should have documented policies establishing protocols for reviewing verbal information and conclusions should be subject to technical review whenever possible. It is acknowledged that responding to queries in court or investigative needs may present an exception.
One clear statement of what the sentences seem to assert is that
Laboratories should have written protocols to ensure that oral communications from laboratory personnel are reviewed for technical correctness. However, a protocol can dispense with (1) review of some courtroom testimony and (2) review that would impede an investigation.
Whether this edited version expresses what the authors wanted to say or presents a satisfactory policy is unclear, but at least the version is more easily understood.

Other phrases that should raise red flags for editing abound. I’ll end with three examples.
  • This guide does not purport to address all of the safety concerns, if any, associated with its use. The editor would say: Make up your mind. If there are no safety concerns, then the sentence is worthless. If there are safety concerns, then the standard should address them. If there is a reason not to address all of them, then the standard can say, “There are additional safety concerns for a laboratory to consider.” If there is a desire to be very cautious, it could read, “There could be additional safety concerns for a laboratory to consider.”
  • ... calculations can be achieved from ... . Copy editor: It sounds odd to speak of "achieving" calculations. The phrase "calculations can be made by" would be more apt.
  • Quantitative measurements of clandestine laboratory samples have an accuracy which is dependent on sampling and, if a liquid, on volume calculations. This sentence is both circumlocutious ("which is dependent") and disjointed ("if a liquid" is in the wrong place to modify "samples"). It also seems to conflate measurements on subsamples of the material submitted for analysis ("clandestine laboratory samples") with inference from the subsamples to the sample of the seized items. If this reading of the dense sentence is correct, editing would expand it along the following lines: "The accuracy of quantitative measurements of a liquid sample depends on the calculated volume of the sample. When the material analyzed is not the entire sample, then the accuracy of any inferences to the entire sample also depends on the homogeneity of the sample and the procedure by which the subsample was chosen.
Good writing requires the right words in the correct order. Good editing makes the writing more readable. Many existing technical standards in forensic science still need good editing to make them fully fit for purpose.

NOTE
  1. Although some publishers distinguish between line editing and copy editing, this posting uses the phrase "copy editing" broadly, to refer to the process of reviewing and correcting written material to ensure "that whatever appears in public is accurate, easy to follow, and fit for purpose." Society for Editors and Proofreaders, FAQs: What Is Copy-editing?, https://www.sfep.org.uk/about/faqs/what-is-copy-editing/.

Friday, April 28, 2017

Are "Exclusions" Deductive and "Identifications" Merely Probabilistic?

Lately I have heard people say that “source exclusions” are the product of deductively valid reasoning, whereas “source identifications” are less certain. But the difference between such conclusions does not arise from the fact that one is deductive and the other is not. On reflection, "exclusions" are no less probabilistic in nature than "identifications." 

Presumably, the reason people may think that exclusions are deductions is that “exclusion” can be part of a deductive argument. For example,
(1) No human being with Type O blood will leave Type A blood at a crime scene.
(2) Defendant has Type O blood.
(3) The crime-scene bloodstain is Type A.
Therefore,
(4) Defendant is excluded as the source of the stain.
This argument is a formally valid deduction. If the premises (1)–(3) are true, the conclusion (4) must be true. There is, however, no guarantee that any or all the premises are true. Perhaps something very strange (but not logically impossible) happened to convert a Type O stain into a Type A one. Or perhaps the defendant or the stain was mistyped. These are not very likely events, but deductive logic does not make them true. So although (4) is certain to be true conditional on (1)–(3), we cannot be absolutely certain that (4) is in fact true. In the symbolism of probability statements, the fact that Pr[(4) | (1)&(2)&(3)] =1 does not ensure that Pr(4) = 1 unless Pr(1) = Pr(2) = Pr(3) = 1.

One might think that I have misstated the argument. Indeed, the argument ending in exclusion might be reframed as follows:
(1) Everything we know tells us that no human being with Type O blood will leave Type A blood at a crime scene.
(2) A blood test shows that defendant has Type O blood.
(3) A blood test shows that the crime-scene bloodstain is Type A.
Therefore,
(4) The blood test excludes defendant as the source of the blood stain.
This too is a deductively valid argument. The premises entail the conclusion, and the premises are even harder to dispute than the ones in the previous example. But if this is all that a criminalist means by an exclusion, then an exclusion is not actually a statement that the defendant is not the source of the trace. The conclusion in our second argument only asserts that the test has excluded the defendant. Unless the test never errs, it does not follow (deductively) that the defendant was not the source of the bloodstain. To appreciate the force of the "deduction," we need to study how often criminalists report exclusions when examining items from different sources as opposed to items from the same source.

The situation is the same for “identification.” This conclusion also can come at the end of a deductive argument. For example,
(1) Fingerprints from the same finger always match.
(2) Fingerprints from different fingers never match.
(3) The questioned and known fingerprints being compared match.
Therefore,
(4) The fingerprints being compared are from the same finger — an “identification.”
As with an exclusion, the argument from (1)–(3) to (4) is logically impeccable. If (1)–(3) are true, then so is (4). And, once more, because propositions (1)–(3) might not all be true, the truth of the “identification” is not absolutely certain.

Again, we can rephrase the argument in a (vain) effort to make it appear that the desired conclusion is purely deductive:
(1) Fingerprints from the same finger always match.
(2) Fingerprints from different fingers never match.
(3) The questioned and known fingerprints being compared match.
Therefore,
(4) I have identified the questioned print as coming from the finger that left the known print.
But again, the deductive argument does not get us to the conclusion that the known finger is the source of the questioned print. To ascertain the probative value of a positive source classification (an “identification”), we need to study the performance of criminalists making these source attributions. We need to study how often criminalists report “identification” when examining items from the same source as opposed to items from different sources. In the end, if there is a difference in the certainty we can attach to an exclusion as opposed to an identification, it does not emanate from the difference between inductive and deductive forms of argument. It results from the fact that the premises of some inductive arguments are more probably true than the premises of other inductive arguments.

Reference
  1. Brian Skyrms, Choice and Chance: An Introduction to Inductive Logic (4th ed. 2000).

Wednesday, April 26, 2017

A Superficial Opinion on Fingerprints in Missouri Has a Few Interesting Wrinkles

In State v. Hightower, 1/ the Missouri Court of Appeals added to the list of superficial opinions on the admissibility of latent fingerprint matches. In this case, a man pointed a gun at the driver of a car, snatched a purse and an iPad, fired into the air, and escaped. The driver did not see the robber’s face, but a detective lifted two “relatively new and undisturbed” prints from the driver’s window. A latent fingerprint examiner with the St. Louis County Police Department used a state AFIS (automated fingerprint identification system) to arrive the conclusion that the latent prints “were left by Defendant's left middle and ring fingers.” On the basis of this identification, a jury convicted David Hightower of armed robbery, and the trial court sentenced him to serve 18 years in prison.

At a pretrial hearing on general scientific acceptance, Dr. Ralph Haber, a research psychologist and “forensic scientist and expert witness” (resume, at 1) testified for the defendant that (as the state court of appeals put it) “the National Academy [of Science] and the National Institute [of Standards and Technology] have both decried the reliability and accuracy of fingerprint evidence adduced using the ACE-V method.” Apparently, he was referring to the well known 2009 report of the NAS Committee on Identifying the Needs of the Forensic Science Community and the report of the NIST Expert Working Group on Human Factors in Latent Print Analysis (D.H. Kaye ed., 2012). 2/ He also “testified he has been asked to serve on the National Commission [on Forensic Science] committee responsible for developing standards for fingerprint analysis.” 3/ The trial court was more impressed by his admission that “in every hearing he had been involved in to exclude fingerprint evidence the evidence had been deemed admissible, save one case from Maryland.” It denied the defendant’s motion to exclude the evidence.

At trial, Dr. Haber “concluded that a person could not be identified with 100% certainty based on a fingerprint,” but acknowledged on cross-examination that “he had not looked at the actual fingerprints in the present case.” When the prosecutor argued in a closing statement that “other ... experts in latent fingerprint examinations ... could have examined this but were not asked to,” the judge instructed the jury to disregard the statement.

The court of appeals assumed that the comment was improper. In a typical display of judicial unrealism, the appellate court blithely “presume[d] the court's curative instructions to the jury removed any prejudice from the prosecutor's statements.” But was the prosecutor in the wrong in the first place? Although the defense certainly has no obligation to examine fingerprints, I do not think the answer is entirely obvious — particularly when the defense produces an expert who testifies that the identification is wrong or uncertain. Here, Dr. Haber apparently choose not to examine the prints although his resume prominently advertises 120 hours of “fingerprint comparison training.” Of course, the prosecutor's comment referred not just to the testifying defense expert’s work, but to the defendant’s decision not to call on still other examiners. Moreover, the defense theory was not that other examiners would disagree with the state’s expert, but only that the meaning of an agreed-upon match is unclear.

If the meaning of a match is indeed unclear — because the entire process has not been fully validated — then it is hard to see how the evidence is generally accepted in a relevant scientific community. The opinion in Hightower does not respond to that argument. Instead, the court maintained that judges have always found subjective comparisons of fingerprints sufficient to demonstrate singular identity. But very few of these opinions have asked whether the scientific literature evinces general acceptance of the proposition that latent print examiners as a group can reliably and accurately match prints of the quality of the ones in this case. Without that showing, how can general scientific acceptance be said to exist?

Fortunately, there are empirical studies of the process that help address this question. These studies appear in reputable scientific journals. 4/ They should inform rulings on admissibility, and the scientific findings should be used to help convey the degree of certainty in any source conclusions.

Finally, the court made short work of Hightower's argument that the conviction could not rest the fingerprint identification alone, at least not without "additional evidence indicating the fingerprints could have only been impressed at the time the crime was committed." The court wrote that
This argument is without merit. “[A] fingerprint at the scene of the crime may in and of itself be sufficient to convict.” State v. Bell, 62 S.W.3d 84, 96 (Mo. App. W.D. 2001). The defendant in Bell claimed that a partial palm print found in a place accessible to the public without credible evidence establishing it was left near the time of crime was insufficient evidence to sustain a conviction. Id. The Western District denied his point, stating there “was sufficient evidence to establish that the palm print on the counter was recent and occurred near the time of the crime” because the hotel clerk testified she had cleaned the counter twenty minutes prior to the robbery and no one other than herself and the robber had touched it in between the time of the robbery and the time the police lifted the print. Id. In the present case, Ms. Gillespie testified and her mother echoed that Defendant hit the car window open-palmed and the detective who collected the fingerprints stated they appeared “fresh” and it was his belief that they had been left recently.
At least one law review article has proposed that a single item of circumstantial evidence tying a defendant to a crime should not be sufficient for a conviction. Nevertheless, when the value of the evidence is great enough, a rigid, two-pieces-of-evidence rule seems too strict.

A further problem in Hightower is that it is not clear that a lay witness can discern the age of the fingerprints or that the detective possessed the necessary expertise to do so. What skill or experience did he have in dating prints? Are there any studies to establish that anyone can discern the age of prints just by shining a flashlight on them? Ascertaining how old prints might be from their physical or chemical properties always has eluded forensic science, although a promising technique has been reported. In Hightower, though, it does not appear that the defendant objected to this part of the detective's testimony.

NOTES
  1. 511 S.W.3d 454 (Mo. App. 2017).
  2. It would be fairer to say that these reports called for research to establish the probabilities of false positive and negative errors in latent print examinations and for the results of comparisons to be presented in ways that recognize the degree of uncertainty in fingerprint identifications.
  3. The Commission never established subcommittee on fingerprint analysis, and Dr. Haber is not listed as a member of any of the Commission’s seven subcommittees.
  4. Some of them are described in this blog.

The Justice Department’s Explanation for the End of the National Commission on Forensic Science

The decision of the Department of Justice to let the NCFS expire — a decision that was as predictable as the date of the next solar eclipse — was presented to the Commission at its final meeting on Monday, April 10. Everyone present knew that the NCFS was not intended to be an indefinite fixture. New administration or not,  a decision to continue operating the Commission and its subcommittees had to come by April 23, 2017. 1/ After all, the NCFS charter of April 23, 2013, asked it “to provide recommendations and advice to the Department of Justice” for two years. Former Attorney General Eric Holder renewed the charter once, on April 23, 2015. 2/ The new Attorney General, Jeff Sessions, elected not to renew the charter a second time.

No reasons for this discretionary action were provided. An Associate Deputy Attorney General graciously thanked the Commission for its work, stated that the new Attorney General had decided to use alternative mechanisms for developing departmental policy, described some aspects of what those would be, indicated that a press release was in the works, and thanked the Commission again.

Below I describe the written documents (an Executive Order and press releases) related to the sunsetting of the Commission and reproduce an abridged version of the (slightly garbled) computer-generated transcript. I made corrections to the extent I was confident about what actually was said, and I edited out some material that was not important to seeing where the Department of Justice (DOJ) may be headed.

I. The DOJ Press Release and the Task Force on Crime Reduction and Public Safety

A press release of April 10, 2017,  announced “a series of actions the Department will take to advance forensic science and help combat the rise in violent crime.” A new “Task Force on Crime Reduction and Public Safety” within the DOJ “will spearhead the development of [a] strategic plan” to “increase the capacity of forensic science providers, improve the reliability of forensic analysis, and permit reporting of forensic results with greater specificity.”

Nothing in the creation of the Task Force specifically suggested forensic science was of any concern. It emanated from an Executive Order signed on February 9, 2017. Harking back to the rhetoric of the Nixon Administration but placing illegal immigration at the top of the list of crimes to combat, this order declared that
It shall be the policy of the executive branch to reduce crime in America. ... A focus on law and order and the safety and security of the American people requires a commitment to enforcing the law and developing policies that comprehensively address illegal immigration, drug trafficking, and violent crime.
Within three weeks, Attorney General Sessions outlined the membership of the task force. The President’s order did not specify who or what types of people should comprise the Task Force. The Attorney General designated his Deputy Attorney General as its chair and “relevant Department components” to supply its members. He named “the Director of the Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF), the Administrator of the Drug Enforcement Administration (DEA), the Director of the FBI and the Director of the U.S. Marshals Service (USMS)” as key members.

This group might not possess the immediate knowledge for devising a plan to “increase the capacity of forensic science providers, improve the reliability of forensic analysis, and permit reporting of forensic results with greater specificity,” although the membership could be supplemented. Along with the membership, the stated objectives seem limited to investigative matters instead of the widely ranging interests pursued by NCFS and reflected in its recommendations.

In fact, the list of objectives itself is a little puzzling. No one would argue with the ambition of improving reliability (in the sense of trustworthy results) and laboratory capacity, but what does it mean to be able to report “forensic results with greater specificity”? Reporting that one finger is the source of a latent print, that one gun is the source of given bullet, and that one set of teeth are the source of a bitemark are already as specific as one can possibly get. One might question the scientific status of such claims, as the last President’s Council of Advisers on Science and Technology did, but the Obama DOJ rejected much of the PCAST critique. Surely, the new President and Attorney General are not expressing newfound doubts about the ability of forensic-science practitioners to make specific source attributions.

II. The Associate Deputy Attorney General’s Remarks to the Commission
Good morning, everyone. Let me say it's an honor to be here on behalf of the acting Deputy Attorney General and be able to address this group. My name is Andrew Goldsmith, I'm an Associate Deputy Attorney General and the Department’s National Criminal Discovery Coordinator. ... As some of you know, about two years ago the then Deputy Attorney General asked me to work with you on the criminal discovery recommendation, and in that role I had the pleasure of working with a number of you.
...
Before I go further I’d like to talk about the Attorney General's firm commitment to forensic science. On Friday, Attorney General Sessions and I spoke in his office, and he made clear to me in his view good forensics is not only important because it enables us to convict the guilty, but also to clear the innocent. He stressed to me we need to focus on the integrity of the process where we have prompt access to high-quality forensics technology. He found troubling the backlog in forensics analysis, and as I will discuss in more depth later, as part of the Task Force on Crime Reduction and Public Safety, he established a forensic science subcommittee to that task force. Moreover, and I also plan to address this as well later on, he is authorized me to announce here today a series of forward-looking actions that will conform to forensic science subcommittee's development of a strategic plan on forensics.
...
The Department and NIST created the Commission as a commitment to strengthening forensic science. Our justice system depends on reliable, scientifically valid evidence to solve crimes, identify wrongdoers, and ensure innocent people are not wrongly convicted. This Department. like every other Department that has come before it, remains committed to these principles. Over the past three years, the Commission has played a role in this effort, and we are grateful for your contributions. I'd like to highlight two contributions I am certain will have long-lasting effect. As you know, we announced new department-wide guidance on criminal discovery in cases with forensic evidence at the last Commission meeting. From my vantage point as the national criminal discovery coordinator, the recommendation on pretrial discovery will have long-lasting and important effects. ... 3/

... [A]s part of my training efforts including my discussion and training of forensic examiners, I have learned there is no single Commission recommendation more important for the practice of forensic science than the recommendation regarding universal accreditation. I have been told the Department's decision to publicly announce the policy on accreditation and to mandate our prosecutors to rely on accredited labs when practicable has made a difference in laboratories and moving to accreditation. These recommendations and the Department's review and implementation are a demonstration of the measurable impact of the work of this Commission over the past three years, and for that as well as many other products of this Commission, the Department thanks you.

To identify the elephant in the room, everyone knows the Commission’s charter is expiring this month, and it probably won't be a surprise to learn the charter will not be renewed. As part of any transition, it is critical to re-evaluate and realign resources to achieve a new administration’s priority. Attorney General Sessions has announced his commitment to reducing violent crime in America particularly in our cities, and he has identified the troubling rise in crime as a focus of the Department when he formed the Task Force on Crime Reduction and Public Safety and established a Forensic Science Subcommittee to the Task Force to fight against this increase in crime.

The Task Force and its various subcommittees including the subcommittees on hate crime and on forensic science advise our internal Department working groups with representation for relevant components including laboratories and prosecuting entities. Although these are internal in nature, they are each seeking relevant external stakeholder input. The forensic science subcommittee in particular has been tasked with considering how we will continue to advance the purposes of this Commission in a manner consistent with the Department’s forensics priorities and its policy to reduce crime in America and develop a strategic plan. We plan to consider all options and closely review the Commission’s summary report and secure feedback from the Commissioners and other stakeholders. We will consider all the information before we decide how to move forward.

Today I'm announcing three actions that will inform the Forensic Science Subcommittee's development of a strategic plan on forensics. First, in the coming week the Department will appoint a senior forensic advisor to interface with forensics science stakeholders, advise Department leadership and work with the Subcommittee to develop a strategic plan. The strategic plan will consider questions critical to increasing capacity and ensuring access to high-quality forensic analysis. Some of the questions that will be considered include the following: What are the biggest needs in forensic science inside the Department and outside the Department? Is there more for a body like the Commission to accomplish, or would next steps be better undertaken by some other body? What specific support do Department laboratories and prosecutors need? What does the partner community need? What is required to improve practices? What are the barriers, legal practical or otherwise, and what resources do we need to overcome those barriers? Is the structure sufficient to set standards, or is some other body needed? What is needed to improve capacity so every prosecutor can be assured he or she will receive prompt results when he or she submits evidence for testing? What resources and relationships can the Department best draw on to get thoughtful advice? What is the Department currently doing to advance this issue? Are their better ways to support state and local practitioners?

The second major part of this initiative I announce is that we are publishing an issue for comment in the Federal Register seeking broad stakeholder input on just those questions I went through and what the Department should consider after the expiration of the Commission. That notice will be open until June 9th. We invite you to submit comments and encourage you to share this notice broadly.

Third, the Department is conducting a needs assessment of forensic laboratories. As you know, in December 2016, Congress passed the Justice for All Authorization Act which has several mandates to improve and advance forensic science. The needs assessment will examine serious issues of capacity and backlog at public crime labs and in the medical-legal investigation community. It will consider other topics such as research and coordination necessary when developing a strategic plan to address the needs of the forensic science community.

At the same time, the Department is considering the previously announced projects of forensic sciences of the review and the uniform language for testimony in reports and identifying where they may fit in the subcommittees’ work. We expect this process to develop a strategic plan to be deliberate and thorough but not an endless one by any means. We have every expectation of announcing how we will continue to meet these goals in the coming months.

I know the expiration of the Commission’s charter does not impact the — and the Department supports the work — and is coordinating with NIST in whether the MOU [Memorandum of Understanding] needs to be amended. I want to emphasize at the Department we recognize our responsibility to work tirelessly, improve the work we do, and enhance the administration of justice. Part of that responsibility is to ensure we are regularly coordinating with the right people on these issues and acting in a manner that demonstrates our commitment to fair play and honest dealings in every matter we handle. We will work to understand lessons of this Commission and continue to advance our goals.

Again, the Department thanks you for your contributions and emphasizes we are not finished relying on you yet. Please expect to work with us in the coming months and review, share and respond to any public inquiries. The commitment of people in this room, the time and participation over the last three years was exemplary and represents what we are capable of doing when we work together towards a single unified goal. There is no question forensic sciences one of the most critical tools we have to reduce crime, increase public safety and it will remain a priority in the Department. In order to turn back rising crime, we need to rely on you working together. The federal government intends to use its money, research and expertise to help us figure out what your needs are and determine the best ways to ensure forensic science is accurate, reliable and available to law enforcement and prosecutors to fight crime and the Department of Justice intends to do that.

The new challenge of violent crime in our nation is real and the task in front of us is clear. We need to resist temptation to ignore this or downplay it. We need to tackle it head-on to ensure justice and safety for all Americans. The Department's pledge to identify strategic plans going forward reflects this commitment to justice and the rule of law. In maintaining the public’s confidence in the accurate and reliable forensic science analyses, we need to clear the innocent and convict the guilty. On behalf of the Attorney General, the acting Deputy Attorney General, and the men and women of the Department of Justice, I thank you once again for your efforts.
NOTES
  1. The Federal Advisory Committee Act of 1972 limits such commissions to two-years of operations unless renewed.
  2. The renewed charter described the duration as “indefinite” but added that "[t]he Commission's termination date is two years from the date this Charter is filed with Congress, and is subject to renewal in accordance with Section 14 of FACA [the Federal Advisory Committee Act]."
  3. The reference is to the Supplemental Guidance for Prosecutors Regarding Criminal Discovery Involving Forensic Evidence and Experts, Jan. 5, 2017.  The encomium judiciously pretermits the friction with DOJ that NCFS’s work on pretrial discovery initially generated. The Commission produced nothing of much substance until 2015. That year began on a low note when DOJ tried to block NCFS from voting on a draft recommendation to have DOJ laboratories open their files in criminal cases to defense lawyers even when the Federal Rules of Criminal Procedure did not demand such access. The one federal judge on the Commission resigned in protest; newly appointed Deputy Attorney General Sally Yates rescinded the ruling that the NCFS was exceeding its mandate; Judge Jed Rakoff rejoined the group; NCFS approved the draft recommendations; and DOJ responded by issuing the Supplemental Guidance. See http://for-sci-law.blogspot.com/2015/01/justice-department-reverses-decision-on.html.

Thursday, April 13, 2017

Fact Check: The National Commission on Forensic Science Vote That Wasn't

Forensic Magazine continues to report that a majority of the National Commission on Forensic Science voted in favor of its own dissolution. In a mostly recycled paragraph from an earlier article, 1/ its senior science writer, Seth Augenstein, wrote today that “the commission itself had voted against its own renewal at its January meeting, by a 16-15 vote.” 2/

The Commission never took any vote on whether it would be a good idea to extend the Commission's life. The question put to a vote was whether to include a statement to this effect in an historical document summarizing the activities of the Commission. 3/ The subject of the vote could not have much clearer. 4/ The meeting synopsis states
[A] vote was taken to determine whether this summary report should include a statement that the Commission should continue in its current form. As a business document a simple majority of 50% “yes” votes was required to approve inclusion of this statement. A total of 42% “yes” votes were received, and therefore no statement would be included regarding the continuation of the Commission. 5/
The precise question posed and the complete vote on it were as follows: 6/
Document or Vote Question Asked Total Votes # Yes # No # Abstain
Does the NCFS Summary Report include a sentence that NCFS continues in its current form? 38 16 15 7
NOTES
  1. Seth Augenstein, Final Meeting of National Commission on Forensic Science ‘Reflects Back,’ Apr. 10, 2017, 11:59am, http://www.forensicmag.com/news/2017/04/final-meeting-national-commission-forensic-science-reflects-back. The paragraph stated that
    The NCFS produced 45 documents and recommendations in three years of work, which encompassed 600 public comments. But the commission itself had voted against its own renewal at its January meeting, by a 16-15 vote."
  2. Seth Augenstein, Even Without Forensic Commission, Forensic Science Overhaul Proceeds at OSAC, Apr. 13, 2017, 12:12pm, http://www.forensicmag.com/news/2017/04/even-without-forensic-commission-forensic-science-overhaul-proceeds-osac. The latest paragraph states that
    The NCFS, by the end of its last meeting on Tuesday, produced 45 documents and recommendations in three years of work—many of which directed OSAC’s explorations into forensic disciplines. But the commission itself had voted against its own renewal at its January meeting, by a 16-15 vote. Sessions announced that it would not be renewed on Monday.
    The additions are also inaccurate. Very few of the NCFS Views documents and Recommendations documents seem to have "directed OSAC's explorations."
  3. Reflecting Back—Looking Toward the Future, Dec. 16, 2016 (draft), https://www.justice.gov/ncfs/page/file/921431/download.
  4. The discussion as recorded on the meeting webcast includes the following (with intervening speaker statements omitted without ellipses):
    HON. PAM KING: This is a business record ... of this particular Commission. ... This is a document that does not take any real position as to whether something should or should not be done. ...I did get some comments from Commissioners before this meetings ... One of the ones that I really would like to get some discussion on is [the] strong feelings among some Commissioners that maybe we do want to make a statement about whether or not this Commission should continue. ...
    JULIA LEIGHTON: I would not shy away from a recommendation ... I think to scrap it altogether ... is to give up on the work we’ve done.
    GERALD LAPORTE: So I don’t agree — disagree — with anything Julia has said. ... but I don’t know if we really are in a position to make a recommendation ... .
    ARTURO CASADEVALL: I want to support what Julia said. Commissions like this develop an institutional memory. ... I strongly think we should make a recommendation that something like this continue.
    S. JAMES GATES: [A]bsent a committee like this, I don’t see a consistent driver for making progress. ...
    MATTHEW REDLE: Whether it is this form or not, ... there ought to be more work done to continue the progress that we have made ...
    JULIA LEIGHTON: [W]e need a national body [with] the gravitas of being a nonpartisan federal advisory commission. ...
    HON. JED RAKOFF: ... I do think it is important that we say some something [to] indicate that we believe the Commission should continue. ...[J]udges do pay some attention to what this Commission says and does. So I think it plays a role there that is not played by other very wonderful groups ... and some very wonderful reports. I would very much strongly encourage that we have something in there ...
    WILLIAM THOMPSON: This Commission is uniquely well situated to address those [human factors] issues ... so I hope the Commission continues to address those kinds of questions ... .
    JULES EPSTEIN: So ... for this concluding portion ... yeah, we should keep going in some shape or form. ... [M]ore needs to be done. More constituencies will look to us than to other segregated constituencies. [T]he federal advisory commission should continue.
    WILLIE MAY: Certainly, I think that the Commission’s work is not completed. [I]t would serve the country very well to continue this ... .
  5. National Commission on Forensic Science Meeting #12, Jan. 9-10, 2017, at 6, https://www.justice.gov/ncfs/page/file/953106/download
  6. Id. at 10.