Friday, September 30, 2011

Prometheus Unbound: Releasing the New Edition of the FJC Reference Manual on Scientific Evidence

Two days ago, the National Academy of Sciences released a third edition of the Federal Judicial Center’s Reference Manual on Scientific Evidence. I listened to the webcast of the unveiling as an insider and an outsider. An insider in that I co-authored two of the chapters. An outsider in that until the Manual eventually emerged, I seemed poised on the exterior side of the event horizon of a singularity into which drafts disappeared and time slowed.

Wednesday's unveiling included remarks from the two co-chairs of the NAS committee assembled to commission and supervise the writing of the manual—a group of five judges and five science professionals (a physician, a toxicologist, an engineer, a statistician, and an epidemiologist). Judge Gladys Kessler explained that in 1993, Daubert v. Merrell Dow Pharmaceuticals established “the gatekeeping role” of federal judges.

Certainly, the majority opinion, penned by Justice Blackmun in Daubert, is famous for this metaphor, but if federal judges were not gatekeepers before 1993, waht were they? Surely not sheep. The Daubert opinion draws heavily on prior case law regarding the Federal Rules of Evidence, substituting for the previously dominant “austere” requirement of general acceptance in the scientific community a multifaceted inquiry into “evidential reliability.” Under either legal standard, however, judges are gatekeepers.

Although Judge Kessler correctly suggested that Daubert’s reliability standard (borrowed from earlier court of appeals cases) goes beyond mere relevancy, the same can be said for the general-acceptance standard (announced in a court of appeals case in 1923) that it displaced. Thus, the notion that judges were not gatekeepers for scientific evidence until 1993 always has struck me as odd. (For more on this legal history and the meaning of Daubert, see Kaye et al. (2011).)

Dr. Jerome Kassirer fielded a number of questions from the virtual and physical audiences. One came from a forensic scientist or analyst in Florida who wanted to know if there were any “practicing forensic scientists” on the editorial committee. The response, that the Manual relied on the very detailed 2009 NRC report on forensic science in its treatment of the forensic sciences, missed the subtext of the question. The practicing forensic science community has been bashing the 2009 report for not accurately depicting the knowledge base of forensic identification techniques (other than DNA evidence). The criticism often takes the form of complaints that the committee lacked enough forensic scientists. (For a rejoinder from the co-chair of that committee, see Edwards (2010).)

Another question was why there was no chapter on digital forensics. The answer referred to the failure of the designated author to produce a manuscript that the committee thought would be useful or intelligible to judges. This probably was not the only chapter to fall by the wayside. Indeed, the problems encountered with such chapters may be part of a more complete answer than the one given to the interlocutor who asked whether an 11-year gap between editions was not a bit much.

A final question to which I alerted included a little speech about the importance of Bayesian inference. The questioner wanted to know why the Manual did not mention Bayes’ rule. Evidently, the questioner was not updating his prior beliefs with any data, for the chapters on statistics, DNA evidence, and medicine have substantial discussions of Bayes' rule. A better question would have been why there is not more discussion in the epidemiology chapter or why the presentation in the medicine chapter is so garbled. But that is a subject for another day.

References

Harry T. Edwards, 2010, The National Academy of Sciences Report on Forensic Sciences: What it Means for the Bench and Bar, Presentation at the Superior Court of the District of Columbia Conference on The Role of the Court in an Age of Developing Science & Technology, Washington, D.C., May 6, available at http://www.fd.org/pdf_lib/The NAS Report on Forensic Science.pdf.

David H. Kaye et al., 2011, The New Wigmore, A Treatise on Evidence: Expert Evidence, 2d ed., New York, NY: Aspen Pub.

National Research Council Committee on the Development of the Third Edition of the
Reference Manual on Scientific Evidence ed., 2011, Reference Manual on Scientific Evidence, Washington DC: National Academies Press.

Friday, September 23, 2011

"The first experimental study exploring DNA interpretation"

A recent study, entitled “Subjectivity and Bias in Forensic DNA Mixture Interpretation,” proudly presents itself as “the first experimental study exploring DNA interpretation.” The researchers are to be commended for seeking to examine, on at least a limited basis, the variations in the judgments of analysts about a complex mixture of DNA.

In addition to documenting such variation, they suggest that their experiment shows that motivational or contextual bias caused analysts in an unnamed case in Georgia to include one suspect in a rape case as a possible contributor to the DNA mixture. This claim merits scrutiny. The experiment is not properly designed to investigate causation, and the investigators' causal inference lacks the foundation of a controlled experiment. To put it unkindly, if an experiment is an intervention that at least attempts to control for potentially confounding variables so as to permit a secure inference of causation, then this is no experiment.

In the study, Itiel Dror, a cognitive neuroscientist and Honorary Research Associate at University College London, and Greg Hampikian, a Professor of Biology and Criminal Justice at Boise State University, presented electropherograms to 17 “expert DNA analysts ... in an accredited government laboratory in North America.” The electropherograms came from a complex mixture of DNA from at least four or five people recovered in a gang rape in Georgia. The article does not state how many analysts worked on the case, whether they worked together or separately, the exact information that they received, or whether they peeked at the suspects’ profiles before determining the alleles that were present in the mixture. They imply that the analysts were told that unless they could corroborate the accusations, no prosecution could succeed. Reasonably enough, Dror and Hampikian postulate that such information could bias an individual performing a highly subjective task.

In the actual case, one man pled guilty and accused three others of participating. The three men denied the accusation. The Georgia laboratory found that one of the three could not be excluded. Contrary to the expectations or desires of the police, the analysts either excluded the other two suspects or were unable to reach a conclusion as to them.

The 17 independent analysts shown the electropherograms from the case split on the interpretation of the complex mixture data. The study does not state the analysts’ conclusions for suspects 1 and 2. Presumably, they were consistent with one another and with the Georgia laboratory’s findings. With regard to suspect 3, however, “One examiner concluded that the suspect ‘cannot be excluded’, 4 examiners concluded ‘inconclusive’, and 12 examiners concluded ‘exclude.’”

From these outcomes, the researchers draw two main conclusions. The first is that “even using the ‘gold standard’ DNA, different examiners reach conflicting conclusions based on identical evidentiary data.”

That complex mixture analysis is unreliable (in the technical sense of being subject to considerable inter-examiner variation) is not news to forensic scientists and lawyers. Although the article implies that the NRC report on forensic science presents all DNA analysis as highly objective, the report refers to “interpretational ambiguities,” “the chance of misinterpretation,” and “inexperience in interpreting mixtures” as potential problems (NRC Report 2009, 132). The Federal Judicial Center’s Reference Manual on Scientific Evidence (Kaye & Sensabaugh 2011) explains that “A good deal of judgment can go into the determination of which peaks are real, which are artifacts, which are ‘masked,’ and which are absent for some other reason.” In The Double Helix and the Law of Evidence (2010, 208), I wrote that “As concurrently conducted ... , most mixture analyses involving partial or ambiguous profiles entail considerable subjectivity.” In 2003, Bill Thompson and his colleagues emphasized the risk of misinterpretation in an article for the defense bar.

These concerns about ambiguity and subjectivity have not escaped the attention of the courts. Supreme Court Justice Samuel Alito, quoting a law review article and a book for litigators, wrote that
[F]orensic samples often constitute a mixture of multiple persons, such that it is not clear whose profile is whose, or even how many profiles are in the sample at all. All of these factors make DNA testing in the forensic context far more subjective than simply reporting test results … .
and that
STR analyses are plagued by issues of suboptimal samples, equipment malfunctions and human error, just as any other type of forensic DNA test.
District Attorney’s Office for Third Judicial Dist. v. Osborne, 557 U.S. __ (2009) (Alito, J., concurring). Dror and Hampikian even quote DNA expert Peter Gill as saying that “If you show 10 colleagues a mixture, you will probably end up with 10 different answers.” Learning that 17 examiners were unanimous as to the presence of two profiles in a complex mixtures and that they disagreed as to a third supports the widespread recognition that complex mixtures are open to interpretation, and it adds some more information about just how frequently analysts might differ in evaluating one set of electropherograms.

The second conclusion that the authors draw is that in the Georgia case “the extraneous context appears to have influenced the interpretation of the DNA mixture.” This conclusion may well be true; however, it is all but impossible to draw on the basis of this “first experimental study studying DNA interpretation.” As noted at the outset, the “experimental study” has no treatment group. The study resembles collaborative exercises in DNA interpretation that have been done over the years. A true experiment—or at least a controlled one—would have included some analysts exposed to potentially biasing extraneous information. Their decisions could have been compared to those of the unexposed analysts.

Instead of controlling for confounding variables, the researchers compare the outcomes in their survey of analysts’ performance on an abstract exercise to the outcomes for one or two analysts in the original case. This approach does not permit them to exclude even the most obvious rival hypotheses. Perhaps it was not information about the police theory of the case and the prosecution's needs, but a difference in the labs' protocols that caused the difference. Perhaps the examiners outside of Georgia, who knew they were being studied, were more cautious in their judgments. Or, perhaps the police pressure, desires, or expectations really did have the hypothesized effect in Georgia. The study cannot distinguish among these and other possibilities.

In addition, the difference in outcomes between the Georgia group and the subjects in the study seems to be within the range of unbiased inter-examiner variability. How can one conclude that the Georgia analysts would not have included suspect 3 if they had not received the extraneous information and had followed the same protocol as the other 17? If the variability due to ordinary subjectivity in the process is such that 1 time out 17 an analyst will include the reference profile in question, then the probability that a Georgia analyst would do so is 0.059. I am not a firm believer in hypothesis testing at the 0.05 level, but I cannot help thinking that even under the hypothesis that the Georgia group was not affected to the slightest degree by the extraneous information, the chance that the result would have been the same is not negligible.

In raising these concerns, I certainly am not claiming that an expectation or motivation effect arising from information about the nature of the crime and the need for incriminating evidence played no role in the Georgia case. But the research reported in the paper does not go very far to establish that it was a significant factor and that it was the cause of the disparity between the Georgia analysts and the 17 others.

The authors’ caveat that “it is always hard to draw scientific conclusions when dealing with methodologies involving real casework” is not responsive to these criticisms. The problem lies with conclusions that outstrip the data when it would have been straightforward to collect better data. The sample size here does not give a reliable estimate of variability in judgments of examiners working in the same laboratory. The sampling plan ignores the possibility of greater variability across laboratories. Perhaps the confined scope of the study reflects a lack of funding or an unwillingness of forensic analysts to cooperate in research because of the pressure of their caseloads or for other reasons -- a complaint aired in Mnookin et al. (2011). Inasmuch as the researchers do not explain how they chose their small sample, it is hard to know.

Beyond the ubiquitous issue of sample size, the subjects in the "experiment" were not assigned to treatment and control groups. No analysts were given the same extraneous information that the Georgia ones had. Of course, extraneous information presented in an experiment could be less influential than it would be in practice. External validity is always a problem with experiments. But there is reason to believe that a controlled experiment could detect an effect in a case like this. Bill Thompson (2009) reported anecdotes suggesting that even outside of actual casework, different DNA examiners presented with electropherograms of mixtures can be induced to reach different conclusions when given extraneous and unnecessary information about the case. That the effect might be less in simulated conditions does not mean that it is undetectable in a controlled experiment.

Convincing scientific knowledge flows from a combination of well designed experimental and observational studies. Dr. Dror's work on fingerprint comparisons (e.g., Dror 2006) has contributed to a better understanding of the effect of examiner expectations in that task. Experiments designed to detect the impact of potentially biasing information on interpretation of DNA profiles in a controlled setting also would be worth undertaking.

References

Itiel E. Dror & Greg Hampikian (2011), Subjectivity and Bias in Forensic DNA Mixture Interpretation, Sci. & Justice, doi:10.1016/j.scijus.2011.08.004, http://www.scienceandjusticejournal.com/article/S1355-0306%2811%2900096-7/abstract

Itiel E. Dror, David Charlton & Ailsa E. Peron (2006), Contextual Information Renders Experts Vulnerable to Making Erroneous Identifications, Forensic Science International, 156(1): 74-78

David H. Kaye (2010), The Double Helix and the Law of Evidence

David H. Kaye & George Sensabaugh (2011), Reference Guide on DNA Identification Evidence, in Reference Manual on Scientific Evidence, 3d ed.

Jennifer L. Mnookin et al. (2011), The Need for a Research Culture in the Forensic Sciences, UCLA Law Review, 58(3): 725-779

National Research Council Committee on Identifying the Needs of the Forensic Sciences Community (2009), Strengthening Forensic Science in the United States: A Path Forward, Wash DC: National Academy Press

William C. Thompson, Simon Ford, Travis Doom, Michael Raymer, Dan E. Krane, Evaluating Forensic DNA Evidence: Essential Elements of a Competent Defense Review, The Champion, Apr. 2003, at 16

William C. Thompson (2009), Painting the Target Around the Matching Profile: the Texas Sharpshooter Fallacy in Forensic DNA Interpretation, Law, Probability & Risk 8: 257-276

Cross-posted at the Double Helix Law blog