Monday, September 11, 2017

The New York City Medical Examiner's Office "Under Fire" for Low Template DNA Testing

According to an Associated Press story, “DNA lab techniques” are “now under fire.” 1/ The article refers to the procedures used by New York City Office of the Chief Medical Examiner to analyze and interpret low template DNA mixtures—samples with minuscule quantities of DNA. Selected segments of the DNA molecules are copied repeatedly (by mean of a chemical process known as PCR) to produce enough of them to detect certain highly variable DNA sequences (called STRs). Every round of PCR amplification essentially doubles the number of replicated segments. Following the lead of the U.K.’s former Forensic Science Service, which pioneered a protocol with extra cycles of amplification, the OCME used 31 cycles instead of the standard 28.

But if the PCR primer for an STR does not latch on to enough of the small number of starting DNA molecules, that STR will not appear in PCR-amplified product. At the same time, if stray human DNA molecules are present in the samples, their STRs can be amplified along with the ones that are of real interest. The first phenomenon is called “drop-out”; the latter is “drop-in.”

Initially, OCME analysts interpreted the results by hand. Some years later, it created a computer program that used empirically determined drop-in and  drop-out probabilities and generated a measure of the extent to which the DNA results supported the conclusion that the mixture contains a suspect’s DNA as opposed to an unrelated contributor’s. It published a validation study of the software. 2/

Both these “lab techniques” have been “under fire,” as the AP put it, for years. The article suggests that only two courts have decided serious challenges to LT-DNA evidence, that they reached opposite conclusions, and that the more recent view is that the evidence is unreliable. 3/ In fact, a larger number of trial courts have considered challenges to extra cycles of amplification and to the FST program. Almost all of them found the OCME’s approach to have gained scientific acceptance.

The published opinions from New York trial courts are noteworthy. (There are more unpublished ones.) In thefirst reported case, People v. Megnath, 898 N.Y.S.2d 408 (N.Y. Sup. Ct. Queens Co. 2010), a court admitted manually interpreted LT-DNA evidence, finding that the procedures are not novel and that the modifications are generally accepted. 4/ In United States v. Morgan, 53 F.Supp.3d 732 (S.D.N.Y. 2014), a federal district court reached the same conclusion. In People v. Garcia, 963 N.Y.S.2d 517 (N.Y. Sup. Ct. 2013), a local New York court found general acceptance of both extra cycles and the FST program.

The first setback for the OCME came in People v. Collins, 49 Misc.3d 595, 15 N.Y.S.3d 564 (N.Y. Sup. Ct., Kings Co. 2015), when a well regarded trial judge conducted an extensive hearing and issued a detailed opinion finding the extra cycles and the FST program had not achieved general acceptance in the scientific community. However, other New York judges have not followed Collins in excluding OCME LT-DNA testimony. People v. Lopez, 50 Misc.3d 632, 23 N.Y.S.3d 820 (N.Y. Sup. Ct., Bronx Co. 2015); People v. Debraux, 50 Misc.3d 247, 21 N.Y.S.3d 535 (N.Y. Co. Sup. Ct. 2015). In the absence of any binding precedent (trial court opinions lack precedential value) and given the elaborate Collins opinion, it is fair to say that “case law on the merits of the science” is not so “clear,” but, quantitatively, it leans toward admissibility.

This is not to say that the opinions are equally persuasive or that they are uniformly well informed. A specious argument that several courts have relied on is that because Bayes’ theorem was discovered centuries ago and likelihood ratios are used in other contexts, the FST necessarily rests on generally accepted methods. E.g., People v. Rodriguez, Ind. No. 5471/2009, Decision and Order (Sup.Ct. N.Y. Co. Oct. 24, 2013). That is comparable to reasoning that because the method of least squares was developed over two centuries ago, every application of linear regression is valid. The same algebra can cover a multitude of sins.

Likewise, the Associated Press (and courts) seem to think that the FST (or more advanced software for computing likelihood ratios) supplies “the likelihood that a suspect’s DNA is present in a mixture of substances found at a crime scene.” 5/ A much longer article in the Atlantic presents a likelihood ratio as "the probability, weighed against coincidence, that sample X is a match with sample Y." 6/ That description is jumbled. The likelihood ratio does not weigh the probability that two samples match "against coincidence."

Rather, the ratio addresses whether the pattern of alleles in a mixed sample is more probable if the suspect's DNA is part of the mixture than if an unrelated individual's DNA is there instead. The ratio is the probability of the complex and possibly incomplete pattern arising under the former hypothesis divided by the probability of the pattern under the latter. Obviously, the ratio of two probabilities is not a probability or a likelihood of anything.

Putting aside all other explanations for the overlap between the mixture and the suspect's alleles--explanations like relatives or some laboratory errors--this likelihood ratio indicates how much the evidence changes the odds in favor of the suspect’s DNA being in the mixture. It quantifies the probative value of the evidence, not the probability that one or another explanation of the evidence is true. Although likelihood-ratio testimony has conceptual advantages, explaining the meaning of the figure in the courtroom so as to avoid the misinterpretations exemplified above can be challenging.

Notes
  1. Colleen Long, DNA Lab Techniques, 1 Pioneered in New York, Now Under Fire,  AP News, Sept. 10, 2017, https://www.apnews.com/ca4e3fdfe28d419c83a50acaf4c24521 
  2. Adelle A. Mitchell et al., Validation of a DNA Mixture Statistics Tool Incorporating Allelic Drop-Out and Drop-In, 6 Forensic Sci. Int’l: Genetics 749-761 (2012); Adelle A. Mitchell et al., Likelihood Ratio Statistics for DNA Mixtures Allowing for Drop-out and Drop-in, 3 Forensic Sci. Int'l: Genetics Supp. Series e240-e241 (2011).
  3. Long, supra note 1 ("There is no clear case law on the merits of the science. In 2015, Brooklyn state Supreme Court judge Mark Dwyer tossed a sample collected through the low copy number method. ... But earlier, a judge in Queens found the method scientifically sound.").
  4. For criticism of the “nothing-new” reasoning in the opinion, see David H. Kaye et al., The New Wigmore on Evidence: Expert Evidence (Cum. Supp. 2017).
  5. These are the reporter’s words. Long, supra note 1. For a judicial equivalent, see, for example, People v. Debraux, 50 Misc.3d 247, 256, 21 N.Y.S.3d 535, 543 (N.Y. Co. Sup. Ct. 2015) (referring to FST as “showing that the likelihood that DNA found on a gun was that of the defendant”).
  6. Matthew Shaer, The False Promise of DNA Testing: The Forensic Technique Is Becoming Ever More Common—and Ever Less Reliable, Atlantic, June 2016, https://www.theatlantic.com/magazine/archive/2016/06/a-reasonable-doubt/480747/

Friday, September 1, 2017

Flaky Academic Journals and Forestry

The legal community may be catching on to the proliferation of predatory, bogus, or just plain flaky journals of medicine, forensic science, statistics, and every other subject that might attract authors willing to pay "open access" fees. As indicated in the Flaky Academic Journals blog, these businesses advertise rigorous peer review, but they operate like vanity presses. A powerful article (noted here) in Bloomberg's BNA Expert Evidence Report and Bloomberg Businessweek alerts litigators to the problem by discussing the most notorious megapublisher of biomedical journals, OMICS International, and its value to drug companies.willing to cut corners in presenting their research findings.

The most recent forensic-science article to go this route is Ralph Norman Haber & Lyn Haber, A Forensic Case Study with Only a Single Piece of Evidence, Journal of Forensic Studies, Vol. 2017, issue 1, unpaginated. In fact, it is the only article that the aspiring journal has published (despite spamming for potential authors at least 11 times). The website offers an intriguing description of this "Journal of Forensic Studies." It explains that "Forensic studies is a scientific journal which covers high quality manuscripts which are both relevant and applicable to the broad field of Forestry. This journal encompasses the study related to the majority of forensically related cases."

Monday, August 14, 2017

PCAST's Review of Firearms Identification as Reported in the Press

According to the Washington Post,
The President’s Council of Advisors on Science and Technology [PCAST] said that only one valid study, funded by the Pentagon in 2014, established a likely error rate in firearms testing at no more than 1 in 46. Two less rigorous recent studies found a 1 in 20 error rate, the White House panel said. 1/
The impression that one might receive from such reporting is that errors (false positives? false negatives?) occur in about one case in every 20, or omaybe one in 40.

Previous postings have discussed the fact that a false-positive probability is not generally the probability that an examiner who reports an association is wrong. Here, I will indicate how well the numbers in the Washington Post correspond to statements from PCAST. Not all of them can be found in the section on "Firearms Analysis" (§ 5.5) in the September 2016 PCAST report, and there are other numbers provided in that section.

But First, Some Background

By way of background, the 2016 report observes that
AFTE’s “Theory of Identification as it Relates to Toolmarks”—which defines the criteria for making an identification—is circular. The “theory” states that an examiner may conclude that two items have a common origin if their marks are in “sufficient agreement,” where “sufficient agreement” is defined as the examiner being convinced that the items are extremely unlikely to have a different origin. In addition, the “theory” explicitly states that conclusions are subjective. 2/
A number of thoughtful forensic scientists agree that such criteria are opaque or circular. 3/ Despite its skepticism of the Association of Firearm and Tool Mark Examiners' criteria for deciding that components of ammunition come from a particular, known gun, PCAST acknowledged that
relatively recently ... its validity [has] been subjected to meaningful empirical testing. Over the past 15 years, the field has undertaken a number of studies that have sought to estimate the accuracy of examiners’ conclusions.
Unfortunately, PCAST finds almost all these studies inadequate. "While the results demonstrate that examiners can under some circumstances identify the source of fired ammunition, many of the studies were not appropriate for assessing scientific validity and estimating the reliability because they employed artificial designs that differ in important ways from the problems faced in casework." 4/ "Specially, many of the studies employ 'set-based' analyses, in which examiners are asked to perform all pairwise comparisons within or between small samples sets." Some of these studies -- namely, "closed-set" designs "may substantially underestimate the false positive rate." The only valid way to study validity and reliability, the report insists, is with experiments that require examiners to examine pairs of items in which the existence of a true association is independent of an association in each and every other pair.

The False-positive Error Rate in the One Valid Study

According to the Post, the "one valid study ... established a likely error rate in firearms testing at no more than 1 in 46." This sentence is correct. PCAST reported a "bound on rate" of "1 in 46." 5/ This figure is the upper bound of a one-sided 95% confidence interval. Of course, the "true" error rate -- the one that would exist if there were no random sampling error in the selection of examiners -- could be much larger than this upper bound. Or, it could be much smaller. 6/ The Post omits the statistically unbiased "estimated rate" of "1 in 66" given in the PCAST report.

The 1 in 20 False-positive Error Rate for "Less Rigorous Recent Studies"

The statement that "[t]wo less rigorous recent studies found a 1 in 20 error rate" seems even less complete. The report mentioned five other studies. Four "set-to-set/closed" studies suggested error rates of 1 in 5103 (1 in 1612 for the 95% upper bound). Presumably, the Post did not see fit to mention all the "less rigorous" studies because these closed-set studies were methodologically hopeless -- at least, that is the view of them expressed in .the PCAST report.

The Post's "1 in 20 figure" apparently came from PCAST's follow-up report of 2017. 7/ The addendum refers to a re-analysis of a 14-year-old study of eight FBI examiners co-authored by Stephen Bunch, who "offered an estimate of the number of truly independent comparisons in the study and concluded that the 95% upper confidence bound on the false-positive rate in his study was 4.3%." 8/ This must be one of the Post's "two less rigorous recent studies."  In the 2016 report, PCAST identified it as a "set-to-set/partly open" study with an "estimated rate" of 1 in 49 (1 in 21 for the 95% upper bound). 9/

The second "less rigorous" study is indeed more recent (2014). The 2016 report summarizes its findings as follows:
The study found 42 false positives among 995 conclusive examinations. The false positive rate was 4.2 percent (upper 95 percent confidence bound of 5.4 percent). The estimated rate corresponds to 1 error in 24 cases, with the upper bound indicating that the rate could be as high as 1 error in 18 cases. (Note: The paper observes that “in 35 of the erroneous identifications the participants appeared to have made a clerical error, but the authors could not determine this with certainty.” In validation studies, it is inappropriate to exclude errors in a post hoc manner (see Box 4). However, if these 35 errors were to be excluded, the false positive rate would be 0.7 percent (confidence interval 1.4 percent), with the upper bound corresponding to 1 error in 73 cases.) 10/
Another Summary

Questions of which studies count, how much they count, and what to make of their limitations are intrinsic to scientific literature reviews. Journalists limited to a few sentences hardly can be expected to capture all the nuances. Even so, a slightly more complete summary of the PCAST review might read as follows:
The President’s Council of Advisors on Science and Technology said that an adequate body of scientific studies does not yet exist.to show that toolmark examiners can associate discharged ammunition to a specific firearm with very high accuracy. Only one rigorous study with one type of gun, funded by the Defense Department, has been conducted. It found that examiners who reached firm conclusions made positive associations about 1 time in 66 when examining cartridge cases from different guns. Less rigorous studies have found both higher and lower false-positive error rates for conclusions of individual examiners, the White House panel said.
NOTES
  1. Spencer S. Hsu & Keith L. Alexander, Forensic Errors Trigger Reviews of D.C. Crime Lab Ballistics Unit Prosecutors Say, Wash. Post, Mar. 24, 2017.
  2. PCAST, at 104 (footnote omitted).
  3. See, e.g., Christophe Champod, Chris Lennard, Pierre Margot & Milutin Stoilovic, Fingerprints and Other Ridge Skin Impressions 71 (2016) (quoted in David H. Kaye, "The Mask Is Down": Fingerprints and Other Ridge Skin Impressions, Forensic Sci., Stat. & L., Aug. 11, 2017, http://for-sci-law.blogspot.com/2017/08/the-mask-is-down-fingerprints-and-other.html)
  4. PCAST, at 105.
  5. Id. at 111, tbl. 2.
  6. The authors of the study had this to say about the false-positive errors:
    [F]or the pool of participants used in this study the fraction of false positives was approximately 1%. The study was specifically designed to allow us to measure not simply a single number from a large number of comparisons, but also to provide statistical insight into the distribution and variability in false-positive error rates. The result is that we can tell that the overall fraction is not necessarily representative of a rate for each examiner in the pool. Instead, examination of the data shows that the rate is a highly heterogeneous mixture of a few examiners with higher rates and most examiners with much lower error rates. This finding does not mean that 1% of the time each examiner will make a false-positive error. Nor does it mean that 1% of the time laboratories or agencies would report false positives, since this study did not include standard or existing quality assurance procedures, such as peer review or blind reanalysis. What this result does suggest is that quality assurance is extremely important in firearms analysis and that an effective QA system must include the means to identify and correct issues with sufficient monitoring, proficiency testing, and checking in order to find false-positive errors that may be occurring at or below the rates observed in this study.
    David P. Baldwin, Stanley J. Bajic, Max Morris, and Daniel Zamzow, A Study of False-Positive and False-Negative Error Rates in Cartridge Case Comparisons, May 2016, at 18, available at https://www.ncjrs.gov/pdffiles1/nij/249874.pdf.
  7. PCAST, An Addendum to the PCAST Report on Forensic Science in Criminal Courts, Jan. 6, 201.
  8. Id. at 7.
  9. PCAST, at 111, tbl. 2
  10. Id. at 95 (footnote omitted).

Friday, August 11, 2017

"The Mask Is Down": Fingerprints and Other Ridge Skin Impressions

The mask is down, and this should lead to heated debates in the near future as many practitioners have not yet realized the earth-shattering nature of the changes. (Preface, at xi).
If you thought that fingerprint identification is a moribund and musty field, you should read the second edition of Fingerprints and Other Ridge Skin Impressions (FORSI for short), by Christophe Champod, Chris Lennard, Pierre Margot, and Milutin Stoilovic.

The first edition "observed a field that is in rapid progress on both detection and identification issues." (Preface 2003). In the ensuing 13 years, "the scientific literature in this area has exploded (over 1,000 publications) and the related professions have been shaken by errors, challenges by courts and other scientists, and changes of a fundamental nature related to previous claims of infallibility and absolute individualization." (Preface 2016, at xi).

The Scientific Method

From the outset, the authors -- all leading researchers in forensic science -- express dissatisfaction with "standard, shallow statements such as 'nature never repeats itself'" and "the tautological argument that every entity in nature is unique." (P. 1). They also dispute the claim, popular among latent print examiners, that the "ACE-V protocol" is a deeply "scientific method":
ACE-V is a useful mnemonic acronym that stands for analysis, comparison, evaluation, and verification ... . Although [ACE-V was] not originally named that way, pioneers in forensic science were already applying such a protocol (Heindi 1927; Locard 193]). ... Its. It is a protocol that does not, in itself give details as to how the inference is conducted. Most authors stay at this descriptive stage and leave the inferential or decision component of the process to "training and experience" without giving any more guidance as to how examiners arrive at their decisions. As rightly highlighted in the NRC report (National Research Council 2009, pp. 5-12): "ACE-V provides a broadly stated framework for conducting friction ridge analyses. However, this framework is not specific enough to qualify as a validated method for this type of analysis." Some have compared the steps of ACE-V to the steps of standard hypothesis testing, described generally as the "scientific method" (Wertheim 2000; Triplett and Cooney 2006; Reznicek et al. 2010: Brewer 2014). We agree that ACE-V reflects good forensic practice and that there is an element of peer review in the verification stage ... ; however, draping ACE-V with the term "scientific method" runs the risk of giving this acronym more weight than it deserves. (Pp. 34-35).
Indeed, it is hard to know what to make of claims that "standard hypothesis testing" is the "scientific method." Scientific thinking takes many forms, and the source of its spectacular successes is a set of norms and practices for inquiry and acceptance of theories that go beyond some general steps for qualitatively assessing how similar two objects are and what the degree of similarity implies about a possible association between the objects.

Exclusions as Probabilities

Many criminalists think of exclusions as logical deductions. They think, for example, that deductively valid reasoning shows that the same finger could not possibly be the source of two prints that are so radically different in some feature or features. I have always thought that exclusions are part of an inductive logical argument -- not, strictly speaking, a deductive one. 1/ However, FORSI points out that if the probability is zero that "the features in the mark and in the submitted print [are] in correspondence, meaning within tolerances, if these have come from the same source," then "an exclusion of common source is the obvious deductive conclusion ... ." (P. 71). This is correct. Within a Boolean logic (one in which the truth values of all propositions are 1 or 0), exclusions are deductions, and deductive arguments are certainly valid or invalid.

But the usual articulation of what constitutes an exclusion (with probability 1) does not withstand analysis. Every pair of images has some difference in every feature (even when the images come from the same source). How does the examiner know (with probability 1) that a difference "cannot be explained other than by the hypothesis of different sources"? (P. 70). In some forensic identification fields, the answer is that the difference must be "significant." 2/ But this is an evasion. As FORSI explains,
In practice, the difficulty lies in defining what a "significant difference" actually is (Thornton 1977). We could define "significant as being a clear difference that cannot be readily explained other than by a conclusion that the print and mark are from different sources. But it is a circular definition: Is it "significant" if one can cannot resolve it by another explanation than a different source or do we conclude to an exclusion because of the "significant" difference? (Page 71).
Fingerprint examiners have their own specialized vocabulary for characterizing differences in a pair of prints. FORSI defines the terms "exclusion" and "significant" by invoking a concept familiar (albeit unnecessary) in forensic DNA analysis -- the match window within which two measurements of what might be the same allele are said to match. In the fingerprint world, the analog seems to be "tolerance":
The terms used to discuss differences have varied over the years and can cause confusion (Leo 1998). The terminology is now more or less settled (SWGFAST 2013b). Dissimilarities are differences in appearance between two compared friction ridge areas from the same source, whereas discrepancy is the observation of friction ridge detail in one impression that does not exist in the corresponding area of another impression. In the United Kingdom, the term disagreement is also used for discrepancy and the term explainable difference for dissimilarity (Forensic Science Regulator 2015a).

A discrepancy is then a "significant" difference and arises when the compared features are declared to be "out of tolerance" for the examiner, tolerances as defined during the analysis. This ability to distinguish between dissimilarity (compatible to some degree with a common source) and discrepancy (meaning almost de facto different sources) is essential and relies mainly on the examiner's experience. ... The first key question ... then becomes ... :
Ql, How probable is it to observe the features in the mark and in the submitted print in correspondence. meaning within tolerances, if these have come from the same source? (P. 71).
The phrase "almost de facto different sources" is puzzling. "De facto" means in fact as opposed to in law. Whether a print that is just barely out of tolerance originated from the same finger always is a question of fact. I presume "almost de facto different sources" means the smallest point at which probability of being out of tolerance is so close to zero that we may as well round it off to exactly zero. An exclusion is thus a claim that it is practically impossible for the compared features to be out of tolerance when they are in an image from the same source.

But to insist that this probability is zero is to violate "Cromwell's Rule," as the late Dennis Lindley called the admonition to avoid probabilities of 0 or 1 for empirical claims. As long as there is a non-zero probability that the perceived "discrepancy" could somehow arise -- as there always is if only because every rule of biology could have a hitherto unknown exception -- deductive logic does not make an exclusion a logical certainty. Exclusions are probabilistic. So are "identifications" or "individualizations."

Inclusions as Probabilities

At the opposite pole from an exclusion is a categorical "identification" or "source attribution." Categorical exclusions are statements of probability -- the examiner is reporting "I don't see how these differences could exist for a common source" -- from which it follows that the hypothesis of a different source has a high probability (not that it is deductively certain to be true). Likewise, categorical "identifications" are statements of probability -- now the examiner is reporting "I don't see how all these features could be as similar as they are for different sources" -- from which it follows that the hypothesis of a common source has a high probability (not that it is certain to be true). This leaves a middle zone of inclusions in which the examiner is not confident enough to declare an identification or an exclusion and the examiner makes no effort to describe its probative value -- beyond saying "It is not conclusive proof of anything."

The idea that examiners report all-but-certain exclusions and all-but-certain inclusions ("identifications") has three problems. First, how should examiners get to these states of subjective near-certainty? Second, each report seemed to involve the probability of the observed features under only a single hypothesis -- different source for exclusions and same source for inclusions. Third, everything between the zones of near-certainty gets tossed in the dust bin.

I won't get into the first issue here, but I will note FORSI's treatment of the second two. FORSI seems to accept exclusions (in the sense of near-zero probabilities for the observations given the same-source hypothesis) as satisfactory; nevertheless, for inclusions, it urges examiners to consider the probability of the observations under both hypotheses. In doing so, it adopts a mixed perspective, using a match-window  p-value for the exclusion step and a likelihood ratio for an inclusion. Some relevant excerpts follow:
The above discussion has considered the main factors driving toward an exclusion (associated with question Q1; we should now move to the critical factor that will drive toward an identification, with this being the specificity of the corresponding features. ...

Considerable confusion exists among laymen, indeed also among fingerprint examiners, on the use of words such as match, unique, identical, same, and identity. Although the phrase "all fingerprints are unique" has been used to justify fingerprint identification opinions, it is no more than a statement of the obvious. Every entity is unique, nu because an entity can only be identical to itself. Thus, to say that "this mark and this print are identical to each other" is to invoke a profound misconception; the two might be indistinguishable, but they cannot be identical. In turn, the notion of "indistinguishability" is intimately related to the quantity and quality of detail that has been observed. This leads to distinguishing between the source variability derived from good-quality prints and the expressed variability in the mark, which can be partial, distorted, or blurred (Stoney 1989). Hence, once the examiner is confident that they cannot exclude, the only question that needs to be addressed is simply:
Q2. What is the probability of observing the features in the mark (given their tolerances) if the mark originates from an unknown individual?
If the ratio is calculated between the two probabilities associated with Ql. and Q2, we obtain what is called a likelihood ratio (LR). Ql becomes the numerator question and Q2 becomes the denominator question. ...

In a nutshell, the numerator is the probability of the observed features if the mark is from the POI, while the denominator is the probability of the observed features if the mark is from a different source. When viewed as a ratio, the strength of the observations is conveyed not only by the response to one or the other of the key questions, but by a balanced assessment of both. ... The LR is especially ... applies regardless of the type of forensic evidence considered and has been put at the core of evaluative reporting in forensic science (Willis 2015). The range of values for the LR is between 0 and infinity. A value of 1 indicates that the forensic findings are equally likely under either proposition and they do not help the case in one direction or the other. A value of 10,000, as an example, means that the forensic finding provides very strong support for the prosecution proposition (same source) as opposed to its alternative (the defense proposition—different sources). A value below 1 will strengthen the case in favor of the view that the mark is from a different source than the POI. The special case of exclusion is when the numerator of the LR is equal to 0, making the LR also equal to 0. Hence, the value of forensic findings is essentially a relative and conditional measure that helps move a case in one direction or the other depending on the magnitude of the LR. The explicit formalization of the problem in the form of a LR is not new in the area of fingerprinting and can be traced back to Stoney (1985). (P. 75)
In advocating a likelihood ratio (albeit one for an initial "exclusion" with poorly defined statistical properties), FORSI is at odds with the historical practice. This practice, as we saw, demands near certainty if an inclusion is to labelled an "identification" or an "exclusion." In the middle range, examiners "report 'inconclusive' without any other qualifiers of the weight to be assigned to the comparison." (P. 98). FORSI disapproves of this "peculiar state of affairs." (P. 99). It notes that
Examiners could, at times, resort to terms such as "consistent with, points consistent with," or "the investigated person cannot be excluded as the donor of the mark," but without offering any guidance as to the weight of evidence [see, for example, Maceo (2011a)]. In our view, these expressions are misleading. We object to information formulated in such broad terms that may be given more weight than is justified. These terms have been recently discouraged in the NRC report (National Research Council 2009) and by some courts (e.g., in England and Wales R v. Puacca [2005] EWCA Crim 3001). And this is not a new debate. As early as 1987, Brown and Cropp (1987) suggested to avoid using the expressions "match," "identical" and "consistent with."

There is a need to find appropriate ways to express the value of findings. The assignment of a likelihood ratio is appropriate. Resorting to the term "inconclusive" deprives the court of information that may be essential. (P. 99).
The Death of "Individualization" and the Sickness of "Individual Characteristics"

The leaders of the latent print community have all but abandoned the notion of "individualization" as a claim that one and one finger that ever existed could have left the particular print. (Judging from public comments to the National Commission on Forensic Science, however, individual examiners are still comfortable with such testimony.) FORSI explains:
In the fingerprint held, the term identification is often used synonymously with individualization. It represents a statement akin to certainty that a particular mark was made by the friction ridge skin of a particular person. ... Technically identification refers to the assignment of an entity to a specific group or label. whereas individualization represents the special case of identification when the group is of size 1. ... [Individualization] has been called the Earth population paradigm (Champod 2009b). ... Kaye (2009) refers to "universal individualization" relative to the entire world. But identification could also be made without referring to the Earth's population, referring instead to a smaller subset, for example, the members of a country, a city, or a community. In that context, Kaye talks about "local individualization" (relative to a proper subset). This distinction between "local" and "global" was used in two cases ... [W]e would recommend avoiding using the term "individualization." (P. 78).
The whole earth definition of "individualization" also underlies the hoary distinction in forensic science between "class" and "individual" characteristics. But a concatenation of class characteristics can be extremely rare and hence of similar probative value as putatively individual characteristics, and one cannot know a priori that "individual" characteristics are limited to a class of size 1. In the fingerprinting context, FORSI explains that
In the literature, specificity was often treated by distinguishing "class" characteristics from "individual" characteristics. Level 1 features would normally be referred to as class characteristics, whereas levels 2 and 3 deal with "individual" characteristics. That classification had a direct correlation with the subsequent decisions: only comparisons involving "individual" characteristics could lead to an identification conclusion. Unfortunately, the problem of specificity is more complex than this simple dichotomy. This distinction between "class" and "individual" characteristics is just a convenient, oversimplified way of describing specificity. Specificity is a measure on a continuum (probabilities range from 0 to 1, without steps) that can hardly be reduced to two categories without more nuances. The term individual characteristic is particularly misleading, as a concordance of one minutia (leaving aside any consideration of level 3 features) would hardly be considered as enough to identify The problem with this binary categorization is that it encourages the examiner to disregard the complete spectrum of feature specificity that ranges from low to high. It is proposed that specificity at each feature level be studied without any preconceived classification of its identification capability by itself Indeed, nothing should prevent a specific general pattern—such as, for example, an arch with continuous ridges from one side to the other (without any minutiae)—from being considered as extremely selective, since no such pattern has been observed to date. (P.74)
FORSI addresses many other topics -- errors, fraud, automated matching systems, probabilistic systems, chemical methods for detection of prints, and much more. Anyone concerned with latent-fingerprint evidence should read it. Those who do will see why the authors express views like these:
Over the years, the fingerprint community has fostered a state of laissez-faire that left most of the debate to the personal informed decisions of the examiner. This state manifests itself in the dubious terminology and semantics that are used by the profession at large ... . (P. 344).
We would recommend, however, a much more humble way of reporting this type of evidence to the decision maker. Fingerprint examiners should be encouraged to report all their associations by indicating the degree of support the mark provides in favor of an association. In that situation, the terms "identification" or "individualization" may disappear from reporting practices as we have suggested in this book. (P. 345).

Notes
  1. David H. Kaye, Are "Exclusions" Deductive and "Identifications" Merely Probabilistic?, Forensic Sci., Stat. & L., Apr. 28, 2017, http://for-sci-law.blogspot.com/2017/04/
  2. E.g., SWGMAT, Forensic Paint Analysis and Comparison Guidelines 3.2.9 (2000), available at https://drive.google.com/file/d/0B1RLIs_mYm7eaE5zOV8zQ2x5YmM/view

Saturday, August 5, 2017

Questions on a Bell krater and Certainty in Forensic Archaeology

IN THE NAME OF THE PEOPLE OF THE STATE OF NEW YORK
TO ANY LAW ENFORCEMENT OFFICER OR POLICE OFFICER OF NEW YORK

YOU ARE THEREFORE COMMANDED, between 6:00 a.m. and 9:00 p.m., to enter and to search the Metropolitan Museum of Art, 1000 Fifth Avenue, New York, NY 10028 (“the target premises”), for the above described property, and if you find such property or any part thereof, to bring it before the Court without unnecessary delay.
So reads a search warrant for
A Paestan Red-Figure B 11-Krater (a wide, round, open container used for holding wine at social events), attributed to Python from the 360 to 350 B.C., approximately 14 1/2 inches in diameter, and depicting the Greek god Dionysos in his youth with a woman on a cart being drawn by Papposilenos on one side and two youths standing between palmettes on the reverse side.
A New York court issued the warrant on July 24 to the District Attorney for New York County. The warrant seems to have been based on “photos and other evidence sent to them in May by a forensic archaeologist in Europe who has been tracking looted artifacts for more than a decade. The museum said that it hand-delivered the object to prosecutors the next day and anticipates that the vase, used in antiquity for mixing water and wine, will ultimately return to Italy.” 1/

The archaeologist, Christos Tsirogiannis, lists himself on LinkedIn as a research assistant at the Scottish Centre for Crime and Justice Research, University of Glasgow, and a forensic archaeologist and illicit antiquities researcher at the University of Cambridge. He contacted the New York district attorney’s office after the museum previously had notified Italian authorities, with no apparent effect, of the evidence that the Bell krater, as this type of container is called, had been looted from a grave in Southern Italy. Dr. Tsirogiannis compared photos on the museum’s website to “Polaroid photos shot between 1972 and 1995 that he said were seized ... in 1995” from storehouses of an Italian antiquities dealer convicted of conspiring to traffic in ancient treasures” to conclude “that the item was disinterred from a grave site in southern Italy by looters.” 2/

Dr. Tsirogiannis was asked about how he could be certain of his photographic identification in an interview on NPR's Morning Edition. His answer was “that’s my job” -- I've done it over a thousand times.
Transcript (excerpt), Morning Edition, Aug. 4, 2017, 5:07 AM ET

AILSA CHANG, HOST: So how did you first discover that this vase in the Met was an artifact looted from a grave in Italy in the 1970s?
CHRISTOS TSIROGIANNIS: I have granted official access to a confiscated archive of a convicted Italian dealer convicted for antiquities trafficking. And the archive is full of photographs, among which I discovered five depicting this particular object. And by comparing these images with the image that was at the Metropolitan Museum of Art website, I identified that it is the same object.
CHANG: How can you know for certain?
TSIROGIANNIS: That's my job. ... I'm a forensic archaeologist, and I am doing this for more than 10 years now, identifying 1,100 of antiquities in the same actual way.
CHANG: Eleven hundred stolen antiquities you have identified?
TSIROGIANNIS: So far.
The response is reminiscent of testimony heard over the years from forensic analysts of many types of trace evidence -- things like fingerprints, toolmarks, hair, shoeprints, and bitemarks. In those fields (which should not be regarded as equivalent), such assurances are much less acceptable today. The identification here could well be correct (although the previously convicted antiquities dealer staunchly denies it), but would it be objectionable because the procedure for comparing the photographs is subject to cognitive bias, lacks well-defined standards, and is not validated in studies of the accuracy with which forensic archaeologists match photographs of similar vases, and so on?

The vase surrendered by the museum certainly "vividly ... depicts Dionysus, god of the grape harvest, riding in a cart pulled by a satyr" and is attributed "to the Greek artist Python, considered one of the two greatest vase painters of his day." 3/ Are there be statistics on the distinctiveness of the designs on the various Bell kraters in use over 2,000 years ago or is each assumed to be visibly unique? How should the photographic evidence in such a case be presented in court?

Notes
  1. Tom Mashberg, Ancient Vase Seized From Met Museum on Suspicion It Was Looted, N.Y. Times, July 31, 2017 (printed as Vase, Thought to Be Looted, Is Seized From Met., N.Y. Times, Aug. 1, 2017, at A1).
  2. Id.
  3. Id.

Thursday, July 27, 2017

Overvaluing P-values

Seventy-two "big names in statistics want to shake up [the] much maligned P-value." 1/ These academic researchers give the following one-sentence summary of their proposal to change the way scientific articles are written:
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005. 2/
The President’s Council of Advisors on Science and Technology (PCAST) effectively invoked the current "default P-value" of 0.05 as a rule for admitting scientific evidence in court. 3/ In light of this new (but not novel) call for reducing the conventional p-value, one might think that PCAST was being too generous toward forensic-science identifications. After all, if 5% is ten times the value that should be used to declare differences “statistically significant” in scientific research, then it seems way too large as a limit for what PCAST called “scientific reliability” for courtroom use. But that conclusion would rest on a misunderstanding of the objective and nature of the proposal to change the nomenclature for p-values.

The motivation for moving from 0.05 to 0.005 is “growing concern over the credibility of claims of new discoveries based on ‘statistically significant’ findings.” The authors argue that regarding 0.05 as proof of a real difference (a true positive) is “a leading cause of non-reproducibility” of published discoveries in scientific fields that could easily be corrected by referring to findings with p-values between 0.005 and 0.05 as “suggestive” rather than “significant.” The problem with the verbal tag of “statistically significant” (p < 0.05) is that, in comparison to the state of scientific research ninety years ago when Sir Ronald Fisher floated the 0.05 level, “[a] much larger pool of scientists are now asking a much larger number of questions, possibly with much lower prior odds of success,” resulting in too many apparent discoveries that cannot be replicated in later experiments.

Not only is the group of 72 addressing the perils of the p-value in a different context, but their proposal is not intended as a bright-line rule for deciding what to publish. They explain:
We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive evidence.
So too, “[r]esults that do not reach the threshold for statistical significance (whatever it is) can still be important” in litigation, and the desire to shake things up in the research community does not reveal much about appropriate standards for admissibility in court.

However, PCAST is on firm ground in emphasizing the need to present forensic-science findings without overstating their probative value. The 72 researchers focus on probative value when they discuss a “more direct measure of the strength of evidence.” They suggest that a “two-sided P-value of 0.05 [often] corresponds to Bayes factors ... that range from about 2.5 to 3.4.” Such evidence, they note, is weak. In contrast, they defend the "two-sided P-value of 0.005" in part on the ground that it "corresponds to Bayes factors between approximately 14 and 26." As such, it "represents ‘substantial’ to ‘strong’ evidence according to conventional Bayes factor classifications."

Forensic scientists who advocate describing the strength of evidence rather than only false-positive rates are more demanding. They usually consider Bayes factors between 10 and 100 to constitute "moderate" rather than “strong” evidence. 4/

Notes
  1. Dalmeet Singh Chawla, Big Names in Statistics Want To Shake Up Much-maligned P value, Nature News, July 27, 2017, http://www.nature.com/news/big-names-in-statistics-want-to-shake-up-much-maligned-p-value-1.22375
  2. Daniel J. Benjamin et al., Redefine Statistical Significance (2017), PsyArXiv. July 22. https://osf.io/preprints/psyarxiv/mky9j (forthcoming in Nature Human Behavior).
  3. For discussion, see The Source and Soundness of PCAST's 5% Rule, Forensic Sci., Stat. & L., July 23, 2017, http://for-sci-law.blogspot.com/2017/07/the-source-and-soundness-of-pcasts-5.html.
  4. E.g., R. Marquis et al., Discussion on How to Implement a Verbal Scale in a Forensic Laboratory: Benefits, Pitfalls and Suggestions to Avoid Misunderstandings, 56(5) Sci. & Just. 364-70 (2016), doi: 10.1016/j.scijus.2016.05.009, preprint available at http://wp.unil.ch/forensicdecision/files/2015/01/1-s2.0-S1355030616300338-main.pdf (Appendix A).

Sunday, July 23, 2017

The Source and Soundness of PCAST's 5% Rule

The President’s Council of Advisors on Science and Technology (PCAST) Report on comparative pattern matching in forensic science has a deceptively simple rule for the admissibility of evidence of a match between a questioned and a known sample: if examiners would declare that the two samples have the same source as often as one time in 20 when analyzing pairs of samples actually that come from different samples, then the comparisons are “scientifically unreliable.” The report gives no explanation of how it arrived at this rule beyond the following enigmatic paragraph: 1/
False positive rate (abbreviated FPR) is defined as the probability that the method declares a match between two samples that are from different sources (again in an appropriate population), that is, FPR = P(M|H0). For example, a value FPR = 0.01 would indicate that two samples from different sources will be (mistakenly) called as a match 1 percent of the time. Methods with a high FPR are scientifically unreliable for making important judgments in court about the source of a sample. To be considered reliable, the FPR should certainly be less than 5 percent and it may be appropriate that it be considerably lower, depending on the intended application. 2/
Five percent has a crisp, authoritative ring to it, but why is 5% “certainly” the maximum tolerable FPR for courtroom use of the test? And what “intended applications” would demand a lower FPR? Is the underlying thought that greater “scientific reliability” is required as the gravity of the case increases—from a civil case, to a misdemeanor, to a major crime, on up to a capital case?

Statistical Practice as the Basis for the 5% Rule

Inasmuch as the paragraph is found in an appendix entitled "statistical issues," we should expect statistical concepts and practice to help answer such questions. And in fact, 5% is a common number in statistics. In many applications, statistical hypothesis tests try to keep the risk of a false rejection of the “null hypothesis” H0—a false-positive conclusion—below 5%. Researchers and journal editors in many fields prize results that can be said to be “statistically significant,” usually at the 0.05 level or better. The expression p < 0.05 is therefore a common accoutrement of experimental or observational results indicating an association between variables. Likewise, the Food and Drug Administration demands clinical trials to show that a new drug is effective for its intended use (“validity,” if you will), with “the typical ‘cap’ on the type I [false positive] error rate ... set at 5% .”3/ In the forensic pattern-matching context, the null hypothesis H0 in the PCAST paragraph would be that a questioned and a known sample are not associated with the same source.

Thus, to the extent PCAST was thinking of the 5% FPR as the significance level required to reject H0, its emphasis on 5% is well grounded in statistical practice. Using certain standard levels of significance, particularly 5%, can be traced to the 1920s. The eminent British statistician Sir R. A. Fisher wrote:
It is convenient to draw the line at about the level at which we can say: ‘Either there is somethng in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.’ ... If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. 4/
For FPRs larger than 5%, the reports of criminalists do not meet (Fisher’s) criterion for establishing a “scientific fact.” Their conclusions of positive association for such error-prone procedures are not, in PCAST’s words, “scientifically reliable.”

Having equated PCAST’s unexplained choice of 5% with a common implementation of statistical hypothesis testing, we also can see why the report suggested that a “considerably lower” number might be required for scientific “reliability.” A 5% FPR lets in examiner conclusions that might be wrong about one time in twenty when defendants are innocent and there is no true association between the questioned item and the known one. False positives tend to increase the rate of false convictions, whereas false negatives tend to would increase the rate of false acquittals. The norm that false convictions are worse than false acquittals counsels caution in relying on an examiner’s conclusion to convict a defendant. And if false convictions in the most serious of cases are worse still, we can see why the PCAST report stated that “the FPR should certainly be less than 5 percent and it may be appropriate that it be considerably lower, depending on the intended application.” Five percent may be good enough for an editor to publish an interesting paper purporting to have discovered something new in social psychology, but this scientific convention does not mean that 5% is good enough for a criminal conviction, let alone one that would lead to an execution.

So we can see that PCAST’s 5% figure did not come from thin air. Indeed, some statisticians and psychologists think that it is too weak a standard—that the general rule in science ought to be p < 0.005. 5/ Nevertheless, the general use of the arguably lenient 5% significance level does not establish that the 5% rule is legally compelled. The law incorporates the intensified concern for false positives into the burden of persuasion for the evidence as a whole. The jury is instructed to acquit unless it has no reasonable doubt that a defendant in a criminal case is guilty; in contrast, in a civil case, the plaintiff can prevail on a mere preponderance of the evidence. But these burdens do not apply to individual items of evidence. The standard for admitting scientific—and other—evidence does not change with how much is at stake in the particular case. After all, the probative value of scientific evidence is no different in a criminal case than in a civil one. Although the PCAST report insists that its statements about science are merely designed to inform courts about scientific standards, if “scientific reliability” depends on the “importance” of the “judgments in court” and varies according to “the intended application,” then PCAST's "scientific reliability" turns out to be based on what is considered socially or legally “appropriate.”

Beyond the FPR

In sum, it is (or would have been) fair for PCAST to point out that it is uncommon for results at higher significance levels than 0.05 to be credited in the scientific literature. But a more deeply analytical report would have noted that there is uneasiness in the statistical community with the hypothesis testing framework and particularly with over-reliance on the p < 0.05 rule. (Today's mail includes an invitation to attend a "Symposium on Statistical Inference: A World Beyond p < 0.05" sponsored by the American Statistical Association.)

Only part of the world beyond p < 0.05 comes from the fact that the FPR is not the only quantity that determines “scientific reliability.” Superficially, the false-positive error probability might look like the appropriate statistic for considering the probative value of a positive finding, but that cannot be right. Scientific evidence, like all circumstantial evidence, has probative value to the extent it changes the probability of a material fact. That there is much more to probative value than the FPR therefore is easily seen through the lens of Bayes’ rule. As the PCAST report notes, in this context, Bayes' theorem prescribes how probability or odds change with the introduction of evidence. The odds after learning of the examiner’s finding are the odds without that information multiplied by the Bayes factor: posterior odds = prior odds × BF.

The Bayes factor thus indicates the strength of the evidence. Stronger evidence has a larger BF and hence a greater impact on the prior odds than weaker evidence. The Bayes factor is a simple ratio. The FPR appears as the denominator, and the sensitivity—or true positive rate—forms the numerator. In symbols, BF = sensitivity / FPR.

The report acknowledges that sensitivity matters (for some purposes at least). Earlier, the report states that “[i]t is necessary to have appropriate empirical measurements of a method’s false positive rate and the method’s sensitivity. [I]t is necessary to know these two measures to assess the probative value of a method.” 6/  Because it takes both operating characteristics to express the probative value of the test, PCAST cannot sensibly dismiss a test as having so little probative value as to be considered “scientifically reliable” on the basis of only one number. Realizing this prompts the next question for devising a rule in the spirit of PCAST's—namely, what is the sensitivity that, together with an FPR of 5%, would define the threshold for “scientific reliability”?

One might imagine that PCAST would consider any false-negative rate in excess of 5% as too high. 7/ If so, it follows that the scientists are saying that, in their view of what is important or what is the dominant convention in various domains, subjective pattern matching must shift the prior odds by a factor of at least .95/.05 = 19 to be considered “scientifically reliable.” On the other hand, if the scientists on PCAST think it is appropriate for a false-negative probability to be ten times the maximum acceptable false-positive probability, then their minimum for “reliability” would become a FNR of 50% and a FPR of 5%, for a Bayes’ factor of only ten.

What Does the Law Require?

Whether the cutoff comes from the FPR alone or the more complete Bayes factor, the very notion of a sharp cutoff is questionable. The purpose of a forensic-science test for identity is to provide evidence that will assist judges or jurors. Forensic scientists who present results and reasonable estimates of the likelihoods or conditional error probabilities associated with their conclusions are staying within the bounds of what is scientifically known.

Consider a hypothetical pattern-matching test for identity for which FPR = 10% and sensitivity = 70% as shown by extensive experiments, each of which demonstrates an ability to distinguish sources from nonsources with accuracy above what would be expected by chance (p < 0.05). According to the PCAST report, this test would be inadmissible for want of “scientific reliability” or “foundational validity” because the FPR of 10% is too high. But if this were a test for a disease, would we really want a diagnosing physician to ignore the positive test result just because the FPR is greater than 5%? The positive finding from the lab would raise the prior odds from, say, 1 to 2, to 7 to 2 (corresponding to an increase in probability from 33% to 78%). Like the physician trying to reach the best possible diagnosis, the judge or jury trying to reach the best possible reconstruction of the events could benefit from knowing that an examiner, who can perform at the empirically established level of accuracy, has found a positive association.

The logic behind a high hurdle for scientific evidence is that “it is likely to be shrouded with an aura of near infallibility, akin to the ancient oracle of Delphi.” 8/ As one federal judge (an advisor to PCAST) wrote in excluding the testimony of a handwriting expert:
[I]t is the Court's role to ensure that a given discipline does not falsely lay claim to the mantle of science, cloaking itself with the aura of unassailability that the imprimatur of ‘science’ confers and thereby distorting the truth-finding process. There have been too many pseudo-scientific disciplines that have since been exposed as profoundly flawed, unreliable, or baseless for any Court to take this role lightly. 9/
Under this rationale, a court should be able to admit the positive test result if the jury is informed of and can appreciate the limitations of the finding. A result that is ten time more probable when the samples have the reported source than when they have different sources is not unreliable “junk science.” Of course, it may not be the product of a particularly scientific (or even a very standardized) procedure, and that must be made clear to the factfinder. When the criminalists employing the highly subjective procedure truly have specialized knowledge—as evidenced by rigorous and repeated tests of their ability to arrive at correct answers—their findings can be presented along with their known error rates without creating “an aura of near infallibility.” If this view of what judges and juries can understand is correct, then a blanket rule against all expert evidence that has known error rates in excess of 5% is unsound.

This criticism of PCAST's 5% rule does not reject the main theme of the report—that when a forensic identification procedure relies on a vaguely defined judgmental process (such as "sufficient similarities and explicable dissimilarities in the light of the examiner's training and experience"), well-founded estimates of the ability of examiners to make the correct judgments are vital to admitting source attributions in court. Of course, Daubert v. Merrell Pharmaceuticals 9/ did not make any single factor, including a "known or potential rate of error," absolutely necessary for admitting all types of scientific evidence. But the Daubert Court painted with an amazingly broad brush. The considerations that will be most important can vary from one type of evidence to another.  When it comes to source attributions from entirely subjective assessments of the similarities and differences in feature sets, there is a cogent argument that the only acceptable way to validate the psychological process is to study how often examiners reach the right conclusions when confronted with same-source and different-source samples.

Notes
  1. Thanks to Ken Melson for calling to my attention to this paragraph.
  2. PCAST Report at 161-52.
  3. Russell Katz, FDA: Evidentiary Standards for Drug Development and Approval, 1(3) NeuroRx 307–316, (2004), doi: 10.1602/neurorx.1.3.307.
  4. R.A. Fisher, The Arrangement of Field Experiments, 33 J. Ministry Agric. Gr. Brit. 504 (1926), as quoted in L. Savage, On Rereading R.A. Fisher, 4 Annals of Statistics 471 (1976).
  5. Kelly Servick, It Will Be Much Harder To Call New Findings ‘Significant’ If This Team Gets Its Way, Jul. 25, 2017, 2:30 PM, Science, DOI: 10.1126/science.aan7154.
  6. PCAST Report at 50 (emphasis added).
  7. However, the report made no mention of the fact that the false-negative rate was higher than that in at least one of the two experiments on latent print identification of which it approved.
  8. United States v. Alexander, 526 F.2d 161, 168 (8th Cir. 1975).
  9. Almeciga v. Center for Investigative Reporting, Inc., 185 F. Supp. 3d 401, 415 (S.D.N.Y. 2016) (Rakoff, J.).
  10. 509 U.S. 579 (1993).

Wednesday, July 5, 2017

Multiple Hypothesis Testing in Karlo v. Pittsburgh Glass Works

The following posting is adapted from a draft of an annual update to the legal treatise The New Wigmore on Evidence: Expert Evidence. I am not sure of the implications of the calculations in note 23 and the fact that the age-based groups are overlapping. Advice is welcome.

The Age Discrimination in Employment Act of 1967 (ADEA) 1/ covers individuals who are at least forty years old. The federal circuit courts are split as to whether a disparate-impact claim is viable when it is limited to a subgroup of employees such as those aged fifty and older. In Karlo v. Pittsburgh Glass Works, 2/ the Third Circuit held that statistical proof of disparate impact on such a subgroup can support a claim for recovery. The court countered the employer’s argument that “plaintiffs will be able to ‘gerrymander’ arbitrary age groups in order to manufacture a statistically significant effect” 3/ by promising that “the Federal Rules of Evidence and Daubert jurisprudence [are] a sufficient safeguard against the menace of unscientific methods and manipulative statistics.” 4/ In Daubert v. Merrell Dow Pharmaceuticals, the Supreme Court famously reminded trial judges applying the Federal Rules of Evidence that they are gatekeepers responsible for ensuring that scientific evidence presented at trials is based on sound science. By the end of the Karlo opinion, however, the court appeals held that the Senior District Judge Terrence F. McVerry had been too vigorous a gatekeeper when he found inadmissible a statistical analysis of reductions in force offered by laid-off older workers.

The basic problem was that plaintiffs claimed to have observed statistically significant disparities in various overlapping age groups without correcting for the fact that by performing a series of hypothesis tests, they had more than one opportunity to discover something "significant." By way of analogy, if you flip a coin five times and observe five heads, you might begin to suspect that the coin is not fair. The probability of five heads in a row with a fair coin is p = (1/2)5 = 1/32 = 0.03. We can say that the five heads in the sample are "statistically significant" proof (at the conventional 0.05 level) that the coin is unfair.

But suppose you get to repeat the experiment five times. Now the probability of at least one sample of 5 flips with 5 heads is about five times larger. It is 1 - (1 - 1/32)5 = 0.146785, to be exact. This outcome is not so far out line with what is expected of a fair coin. It would be seen about 15% of the time for a fair coin. This is weak evidence that the coin is unfair; certainly, it is not as compelling as the 3% p-value. So the extra testing, with the opportunity to select any one or more of the five samples as proof of unfairness, has reduced the weight of the statistical evidence of unfairness. The effect of the opportunity to search for significance is sometimes known as "selection bias" or, of late, "p-hacking."

In Karlo, Dr. Michael Campion—a distinguished professor of management at Purdue University with degrees in industrial and organizational psychology—compared proportions of Pittsburgh Glass workers older than 40, 45, 50, 55, and 60 who were laid off to the proportion of younger workers who were laid off. He found that the disparities in three of the five categories were statistically significant at the 0.05 level. 5/ The disparity for the 40-and-older range, he said, fell “just short,” being “ significant at the 13% level.” Dr. Campion maintained that “[t]hese results suggest that there is evidence of disparate impact.” 6/ He also misconstrued the 0.05 level as “a 95% probability that the difference in termination rates of the subgroups is [] due to chance alone.” 7/ The district court expressed doubt as to whether Dr. Campion was a qualified statistical expert 8/ and excluded the testimony under Daubert as inadequate “data snooping.” 9/

Apparently, Judge McVerry was more impressed with the report of Defendant’s expert, James Rosenberger — a statistics professor at Pennsylvania State University and a fellow of the American Statistical Association and the American Association for the Advancement of Science. The report advocated adjusting the significance level to account for the five groupings of over-40 workers. The Chief Judge of the Third Circuit, D. Brooks Smith (also an adjunct professor at Penn State), described the recommended correction as follows:
The Bonferroni procedure adjusts for that risk [of a false positive] by dividing the “critical” significance level by the number of comparisons tested. In this case, PGW's rebuttal expert, Dr. James L. Rosenberger, argues that the critical significance level should be p < 0.01, rather than the typical p < 0.05, because Dr. Campion tested five age groups (0.05 / 5 = 0.01). Once the Bonferroni adjustment is applied, Dr. Campion's results are not statistically significant. Thus, Dr. Rosenberger argues that Dr. Campion cannot reject the null hypothesis and report evidence of disparate impact. 10/
Another way to apply the Bonferroni correction is to change the p-value. That is, when M independent comparisons have been conducted, the Bonferroni correction is either to set “the critical significance level . . . at 0.05/M” (as Professor Rosenberger recommended) or “to inflate all the calculated P values by a factor of M before considering against the conventional critical P value (for example, 0.05).” 11/

The Court of Appeals was not so sure that this conservative adjustment was essential to the admissibility of the p-values or assertions of statistical significance. It held that the district court erred in excluding the subgroup analysis and granting summary judgment. It remanded “for further Daubert proceedings regarding plaintiffs' statistical evidence.” 12/ Further proceedings were said to be necessary partly because the district court had applied “an incorrectly rigorous standard for reliability.” 13/ The lower court had set “a higher bar than what Rule 702 demands” 14/ because “it applied a bright-line exclusionary rule” for all studies with multiple comparisons that have no Bonferroni correction. 15/

But the district court did not clearly articulate such a rule. It wrote that “Dr. Campion does not apply any of the generally accepted statistical procedures (i.e., the Bonferroni procedure) to correct his results for the likelihood of a false indication of significance.” 16/ The sentence is grammatically defective (and hence confusing). On the one hand, it refers to "generally accepted statistical procedures." On the other hand, the parenthetical phrase suggests that only one "procedure" exists. Had the district court written “e.g.” instead of “i.e.,” it would have been clear that it was not promulgating a dubious rule that only the Bonferroni adjustment to p-values or significance levels would satisfy Daubert. To borrow from Mark Twain, "the difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." 17/

Understanding the district court to be demanding a Bonferroni correction in all cases of multiple testing, the court of appeals essentially directed it to reconsider its exclusionary ruling in light of the fact that other procedures could be superior. Indeed, there are many adjustment methods in common use, of which Bonferroni’s is merely the simplest. 18/ However, plaintiff’s expert apparently had no other method to offer, which makes it hard to see why the possibility of some alternative adjustment, suggested by neither expert in the case, made the district court's decision to exclude Dr. Campion's proposed testimony an abuse of discretion.

A rule insisting on a suitable response to the multiple-comparison problem does not seem “incorrectly rigorous.” To the contrary, statisticians usually agree that “the proper use of P values requires that they be ... appropriately adjusted for multiple testing when present.” 19/ It is widely understood that when multiple comparisons are made, reported p-values will exaggerate the significance of the test statistic. 20/ The court of appeal’s statement that “[i]n certain cases, failure to perform a statistical adjustment may simply diminish the weight of an expert's finding.” 21/ is therefore slightly misleading. In virtually all cases, multiple comparisons degrade the meaning of a p-value. Unless the statistical tests are all perfectly correlated, multiple comparisons always make the true probability of the disparity (or a larger one) under the model of pure chance greater than the nominal value. 22/

Even so, whether the fact that an unadjusted p-value exaggerates the weight of evidence invariably makes unadjusted p-values or reports of significance inadmissible under Daubert is a more delicate question. If no reasonable adjustment can be devised for the type of analysis used and no better analysis can be done, then the nominal p-values might be presented along with a cautionary statement about selection bias. In addition, in extreme cases, the adjustment will be small and the degree of exaggeration will not be so formidable as to render the unadjusted p-value inadmissible. For instance, if the nominal p-value were 0.001, the fact that the corrected figure is 0.005 would not be a fatal flaw. The disparity would be highly statistically significant even with the correction. But that was not the situation in Karlo. In this case, statistical significance was not apparent. It was undisputed that as soon as one considered the number of tests performed, not a single subgroup difference was significant at the 0.05 level. 23/

Consequently, the rejection of the district court’s conclusion that the particular statistical analysis in the expert’s report was unsound seems harsh. It should be within the trial court’s discretion to prevent an expert from testifying to the statistical significance of disparities (or their p-values) unless the expert avoids multiple comparisons that would seriously degrade the claims of significance or modifies those claims to reflect the negative impact of the repeated tests on the strength of the statistical evidence. 24/ The logic of Daubert does not allow an expert to dismiss the problem of selection bias on the theory -- advanced by plaintiffs in Karlo -- that “adjusting the required significance level [is only] required [when the analyst performs] ‘a huge number of analyses of all possibilities to try to find something significant.'’’ 25/ The threat to the correct interpretation of a significance probability does not necessarily disappear when the number of comparisons is moderate rather than “huge.” Given the lack of highly significant results here (even nominally), it is not statistically acceptable to ignore the threat. 26/ Although the Third Circuit was correct to observe that not all statistical imperfections render studies invalid within the meaning of Daubert, the reasoning offered in support of the claim of significant disparities in Karlo was not statistically acceptable. 27/

Notes
l. 29 U.S.C. §§ 621–634.
2. 849 F.3d 61 (3d Cir. 2017).
3. Id. at 76.
4. Id.
5. He testified that he did not compute a z-score (a way to analyze the difference between two proportions when the sample sizes are large) for the 60-and-over group “because ‘[t]here are only 14 terminations, which means the statistical power to detect a significant effect is very low.’” Karlo, 849 F.2d at 82 n.15.
6. Karlo v. Pittsburgh Glass Works, LLC, 2015 WL 4232600, at *11, No. 2:10–cv–1283 (W.D. Penn. July 13, 2015), vacated, 849 F.3d 61 (3d Cir. 2017).
7. Id. at *11 n.13. "A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis." Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 213 (2017).
8. Id. at *12.
9. Id. at *13.
10. 849 F.3d at 82 (notes omitted).
11. Pak C. Sham & Shaun M. Purcell, Statistical Power and Significance Testing in Large-scale Genetic Studies, 15 Nature Reviews Genetics 335 (2014) (Box 3).
12. Id. at 80 (note omitted).
13. Id. at 82.
14. Id at 83.
15. Id. (internal quotation marks and ellipsis deleted).
16. Karlo, 2015 WL 4232600, at *1.
17. George Bainton, The Art of Authorship 87–88 (1890.
18. Martin Krzywinski & Naomi Altman, Points of Significance: Comparing Samples — Part II, 11 Nature Methods 355, 355 (2014)
19. Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 214 (2017)
20. Krzywinski & Altman, supra note 18
21. Id. at 83 (emphasis added).
22. Because each age group included some of the same older workers, the tests here were not completely independent. But neither were they completely dependent.
23. However, that three out of five groups exhibited significant associations between age and terminations is surprising under the null hypothesis that those variables are uncorrelated. If each test were independent, then the probability of a significant result in each group would be 0.05. The probability of one or more significant results in five tests would be 0.226; that of two or more would be 0.0226; of three or more, 0.00116.
24. Joseph Gastwirth, Case Comment: An Expert's Report Criticizing Plaintiff's Failure to Account for Multiple Comparisons Is Deemed Admissible in EEOC v. Autozone, 7 Law, Probability & Risk 61, 62 (2008).
25. Karlo, 849 F.3d at 82.
26. Dr. Campion also believed that “his method [was] analogous to ‘cross-validating the relationship between age and termination at different cut-offs,’ or ‘replication with different samples.’” Id. at 83. Although the court of appeals seemed to take these assertions at face value, cross-validation involves applying the same statistical model to different data sets (or distinct subsets of one larger data set). For instance, a equation that predicts law school grades as a function of such variables as undergraduate grades and LSAT test scores might be derived from one data set, then checked to ensure that it performs well in an independent data set. Findings in one large data set of statistically significant associations between particular genetic loci and a disease could be checked to see if the associations were present in an independent data set. No such validation or replication was performed in this case.
27. The Karlo opinion suggested that the state of statistical knowledge or practice might be different in social science than in the broader statistical community. The court pointed to a statement (in a footnote on regression coefficients) in a treatise on statistical evidence in discrimination cases that “the Bonferroni adjustment [is] ‘good statistical practice,’ but ‘not widely or consistently adopted’ in the behavioral and social sciences.” Id. (quoting Ramona L. Paetzold & Steve L. Willborn, The Statistics of Discrimination: Using Statistical Evidence in Discrimination Cases § 6:7, at 308 n.2 (2016 Update)). The treatise writers were referring to an unreported case in which the district court found itself unable to resolve the apparent conflict between the generally recognized problem of multiple comparisons and an EEOC expert’s insistence that labor economists do not make such corrections and courts do not require them. E.E.O.C. v. Autozone, Inc., No. 00-2923, 2006 WL 2524093, at *4 (W.D. Tenn. Aug. 29, 2006). In the face of these divergent perceptions, the district judge decided not to grant summary judgment just because of this problem. Id. (“[T]he Court does not have a sufficient basis to find that ... the non-utilization [of the Bonferroni adjustment] makes [the expert's] results unreliable.”). The notion that multiple comparisons generally can be ignored in labor economics or employment discrimination cases is false, Gastwirth, supra note 23, at 62 (“In fact, combination methods and other procedures that reduce the number of individual tests used to analyse data in equal employment cases are basic statistical procedures that have been used to analyse data in discrimination cases.”), and any tendency to overlook multiple comparisons in “behavioral and social science” more generally is statistically indefensible.
That said, the outcome on appeal in Karlo might be defended as a pragmatic response to the lower court's misunderstanding of the meaning of the ADEA. The court excluded the unadjusted findings of significance for several reasons. In addition to criticizing Professor Campion's refusal to make any adjustment for his series of hypothesis tests across age groups, Judge McVerry noted that "the subgrouping analysis would only be helpful to the factfinder if this Court held that Plaintiffs could maintain an over-fifty disparate impact claim." Karlo, 2015 WL 4232600, at *13 n.16. He sided with "the majority view amongst the circuits that have considered this issue ... that a disparate impact analysis must compare employees aged 40 and over with those 39 and younger ... ." Id. (Petruska v. Reckitt Benckiser, LLC, No. CIV.A. 14–03663 CCC, 2015 WL 1421908, at *6 (D.N.J. Mar.26, 2015)). The Third Circuit decisively rejected this construction of the ADEA, pulling this rug out from under the district court. Having held that the district court erred in interpreting the ADEA, requiring the district court to re-examine the statistical showing under the ADEA, correctly understood, might seem appropriate.
Of course, ordinarily an evidentiary ruling that can be supported on several independent grounds will be upheld on appeal as long as at least one of the independent grounds is valid. Here, the ADEA argument was literally a footnote to the independent ground that the failure to adjust for multiple comparisons invalidated the expert's claim of significant disparities. Nevertheless, the independent-grounds rule normally applies after a trial. It avoids retrials when the trial judge would or could rule the same way on retrial. Because Karlo is a summary judgment case, there is less reason to sustain the evidentiary ruling. But even so, the court of appeals did not have to vacate the judgment. Instead, it could have followed the usual independent-grounds rule to affirm the summary judgment while noting that district court could reconsider its Daubert ruling in light of the court of appeals' explanation of the proper reach of the ADEA and the range of statistically valid responses to the problem of multiple hypothesis tests. As a practical matter, however, there may be little difference between having counsel address the issue in the context of a motion to reconsider and a renewed motion for summary judgment.

Friday, June 30, 2017

Judge Spotlights PCAST Report

When the District of Columbia Court of Appeals (the District's "supreme court") overruled Frye v. United States and replaced the general acceptance standard for scientific evidence with one based on the Daubert line of cases, 1/ the court admonished trial judges to use "a delicate touch" in regulating the flow of expert testimony. 2/ One judge offered more guidance. Judge Catharine Friend Easterly penned a concurring opinion proposing that
trial courts will be called upon to scrutinize an array of forensic expert testimony under new, more scientifically demanding standards. As the opinion of the court states, “[t]here is no ‘grandfathering’ provision in Rule 702,” and, under the new rule we adopt, courts may not “reflexively admit expert testimony because it has become accustomed to doing so under the Dyas/Frye test. 3/
Daubert does not necessarily erect a more demanding standard than Frye. It leaves plenty of wiggle room for undiscriminating or lenient rulings. Moreover, under Frye, counsel can challenge scientific evidence that is generally accepted in the forensic-science community (predominantly forensic-science practitioners) but whose scientific foundations are seen as weak in the broader scientific community. Both Frye and Daubert enable -- indeed, both require -- courts to depart from reflexively admitting expert testimony just because they are accustomed to it. The legal difference between the two approaches is that Daubert creates the theoretical possibility of rejecting a method that is still clearly generally accepted but that a small minority of scientists has come to regard -- on the basis of sound (but not yet generally accepted) scientific arguments -- as unfounded. This is merely the flip side of evidence that is not yet generally accepted but that is scientifically sound. Frye keeps such evidence out; Daubert does not. In sum, the standards are formally different, but, as written, one is not more demanding than the other.

But regardless of whether Daubert is more demanding than what the Supreme Court called the "austere" standard of Frye, the remainder of Judge Easterly's opinion is worthy of general notice. The opinion urges the judiciary to heed the findings of the 2009 NRC Report on forensic science and the 2016 PCAST report on particular methods. It observes that
Fortunately, in assessing the admissibility of forensic expert testimony, courts will have the aid of landmark reports that examine the scientific underpinnings of certain forensic disciplines routinely admitted under Dyas/Frye, most prominently, the National Research Council's congressionally-mandated 2009 report Strengthening Forensic Science in the United States: A Path Forward, and the President's Council of Advisors on Science and Technology's (PCAST) 2016 report Forensic Science in the Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods [hereinafter PCAST Report]. These reports provide information about best practices for scientific testing, an objective yardstick against which proffered forensic evidence can be measured, as well as critiques of particular types of forensic evidence. In addition, the PCAST Report contains recommendations for trial judges performing their gatekeeping role under Rule 702:
(A) When deciding the admissibility of [forensic] expert testimony, ... judges should take into account the appropriate scientific criteria for assessing scientific validity including: (i) foundational validity,  with respect to the requirement under Rule 702(c) that testimony is the product of reliable principles and methods; and (ii) validity as applied, with respect to [the] requirement under Rule 702(d) that an expert has reliably applied the principles and methods to the facts of the case.
(B) ... [J]udges, when permitting an expert to testify about a foundationally valid feature-comparison method, should ensure that testimony about the accuracy of the method and the probative value of proposed identifications is scientifically valid in that it is limited to what the empirical evidence supports. Statements suggesting or implying greater certainty are not scientifically valid and should not be permitted. In particular, courts should never permit scientifically indefensible claims such as: “zero,” “vanishingly small,” “essentially zero,” “negligible,” “minimal,” or “microscopic” error rates; “100 percent certainty” or proof “to a reasonable degree of scientific certainty;” identification “to the exclusion of all other sources;” or a chance of error so remote as to be a “practical impossibility.”
PCAST Report, supra, at 19; see also id. at 142–45; Gardner v. United States, 140 A.3d 1172, 1184 (D.C. 2016) (imposing limits on experts' statements of certainty). 4/
Notes
  1. Motorola v. Murray, 147 A.3d 751 (D.C. 2016) (en banc); Frye Dies at Home at 93, June 30, 2017, http://for-sci-law.blogspot.com/2017/06/frye-dies-at-home-at-93.html.
  2. 147 A.3d at 757.
  3. Id. at 759 (emphasis added).
  4. Id. at 759-60 (notes omitted).

Frye Dies at Home at 93

The general-scientific-acceptance standard for scientific evidence originated in the District of Columbia, when the federal circuit court for the District upheld the exclusion of a blood-pressure test for deception in Frye v. United States, 93 F. 1013  (D.C. Cir. 1923). In October of 2016, the District of Columbia's highest court ended the standard's 93-year life there.The D.C. Court of Appeals unanimously overruled Frye and replaced it with the more open-ended Federal Rule of Evidence 702.

It did so in Motorola v. Murray, 147 A.3d 751 (D.C. 2016) (en banc), at the request of the trial court,  which felt that Frye required the admission of expert testimony that cell phones cause brain tumors. That view was mistaken. It is quite possible to exclude, as not based on a generally accepted method, opinions of general causation from expert witnesses when the scientific consensus is that the pertinent scientific studies do not support those opinions. See Cell Phones, Brain Cancer, and Scientific Outliers Are Not the Best Reasons to Abandon Frye v. United States, Nov. 26, 2015.

Elsewhere, I have argued that the choice between the Daubert line of cases codified in Rule 702 and the earlier Frye standard is less important than is the rigor with which the courts apply either standard. The Court of Appeals in Murray remarked that "[p]roperly performing the gatekeeping function will require a delicate touch." Id. at 757. It noted that trial courts have "discretion (informed by careful inquiry) to exclude some expert testimony." Id. In the end, "[t]he trial court still will need to determine whether the opinion 'is the product of reliable principles and methods[,] ... reliably applied.'" Id. at 758 (quoting Fed. R. Evid. 702 (c), (d)).