Monday, November 28, 2016

"We Can Predict Your Face" and Put It on a Billboard

An article entitled Craig Venter’s Latest Production (Arlene Weintraub, MIT Technology Review, Sept.-Oct. 2016, at 94) reports that
At Human Longevity Inc. (HLI) in La Jolla, California, more than two dozen machines work around the clock, sequencing one human genome every 15 minutes at a cost of under $2,000 per genome. The whole operation fits comfortably in three rooms. Back in 2000, when its founder, J. Craig Venter, first sequenced a human genome, it cost $100 million and took a building-size, $50 million computer nine months to complete. [¶] Venter’s goal is to sequence at least one million genomes, something that seems likely to take the better part of a decade ...

Seated behind his desk in his office two floors above the sequencing lab, his red poodle Darwin sleeping quietly at his feet, Venter has pulled up images on his computer that show how in one early experiment HLI scientists were able to sequence 1,000 people’s genomes and then reconstruct their faces solely on the basis of the genetic data. “We can predict your face, your height, your body mass index, your eye color, you hair color and texture,” he says, marveling at how closely one of the reconstructed faces matches the photo of the actual study participant.
It would be nice to have a decently designed study of how well all the 1,000 (adult?) faces match the photos. As Richard Feynman once told his students,
The first principle is that you must not fool yourself — and you are the easiest person to fool.
But why wait for whole genome sequencing to "predict" faces? Scores of police agencies already use a different company's product. Parabon Snapshot provides pictures as part of a "scientifically objective description" so "you can conduct your investigation more efficiently and close cases more quickly." Ellen Greytak, director of bioinformatics for Parabon, says that "So far, we've done more than sixty different cases, and we've also done evaluations at the local, state, federal and international levels." In fact, "we've had one conviction and a few other arrests." Michael Roberts, Could DNA Imaging Used in Bennett Family Murder Break JonBenet Case?, Westword, Sept. 16, 2016. Although "none of the police agencies in question has gone public with the technology's role in the cases thus far, she teases that an announcement about a success is pending." Id.

"The composite isn't intended to be like a driver's license photo, but it will bear a resemblance. And if you have a list of 1,000 people who were nearby that day, you can put the ones that match the most at the top, and the ones that match the least at the bottom." Id. (quoting Greytak). Parabon's website explains that this achievement comes from "using deep data mining and advanced machine learning algorithms in a specialized bioinformatics pipeline."

Maybe I missed it, but I saw no references to cross-validation studies of whatever oozed out of the pipeline. Nevertheless, "Snapshot trait predictions are presented with a corresponding measure of confidence, which reflects the degree to which such factors influence each particular trait. Traits, such as eye color, that are highly heritable (i.e., are not greatly affected by environmental factors) are predicted with higher accuracy and confidence than those that have lower heritability; these differences are shown in the confidence metrics that accompany each Snapshot trait prediction."

A "confidence metric" seems to be missing from an unusual advertising campaign called "The Face of Litter" in Hong Kong:
Using the “Snapshot” DNA phenotyping services of a company called Parabon Nanolabs, Ogilvy [a marketing firm] collected litter from the streets and using DNA obtained from the litter, created profiles of the offenders ... . These profiles are now posted on outdoor ads at bus stops, subway and train stations, and even on highway billboards.
Nanalyze, Parabon Nanolabs and DNA Phenotyping, Apr. 30, 2015. Parabon calls this a "social experiment," calling to mind the dismissive remark, "That's not an experiment, it's an experience."

(Last updated 11/29/16)

Saturday, November 26, 2016

People v. Ramsaran: How Not to Evaluate a DNA Mixture

Did Ganesh Ramsaran kill his wife, Jennifer, after dropping their children off at school on a chilly morning in December 2012? At around 7:54 that night, he called the police and told them that Jennifer was missing. He said she had left their home in New Berlin, New York, at 10 that morning to go shopping in Syracuse, about 60 miles away. He expected her home at around 5:00 p.m.

When she did not show up, he wasted little time. At 5:30,  he called his father-in-law to say that he was going to call the police. When he called the police a few hours later, he was “adamant something terrible had happened.” He said they had a perfect marriage.

A program manager for IBM, Ganesh later told police that, after returning from the school, he worked from home on his computers. However, a forensic investigator determined that one of Ganesh's work computers was not used at all that day and the other was not used between 8:08 a.m. and 6:31 p.m. (except for the automatic installation of new software before 8:25 a.m.).  Jennifer had been playing an online game until about 8:15 a.m., when she abruptly ceased playing. She had told her online game partner that she was not going to Syracuse until later in the week because her van was making strange noises. Video footage undermined Ganesh’s account of having gone for a run to the Y that morning. As for the perfect marriage, there was evidence of a divorce in the air, an extra-marital affair, and even an insatiable sex drive.

Five days after the disappearance, the van was found in an apartment parking lot. Two months later, Jennifer's naked and decomposed body was found at the bottom of an embankment, with bruises and lacerations, bleeding underneath the scalp, and internal hemorrhages across the back.

Ganesh was tried for murder. In addition to the facts given above, DNA evidence supported the verdict of guilty. “Large blood stains in the back of [the van] were a conclusive DNA match with the victim,” and she “could not be excluded a contributor” to a blood stain on the sweatshirt that Ganesh had worn the day she disappeared. A “forensic expert testified that it was 1.661 quadrillion times more likely that the blood sample from the sweatshirt contained a combination of defendant's and the victim's blood than if two randomly selected individuals were the donors.”

The appellate division reversed the conviction because of “defense counsel's failure to object to the prosecutor's inappropriate characterization of the DNA testimony and evidence during summation.” The expert, Daniel Myers, had testified that although “the STR/DNA mixture profile ... is 1.661 quadrillion times more likely to be observed if donors are defendant and the victim than if two random unrelated people were selected, ... there were not enough alleles or DNA data to say conclusively that the victim's DNA was present.”

The prosecutor went further “during summation, ... by stating ... ‘on that sweatshirt is [defendant's] wife's DNA’”; that Jennifer’s “DNA was on that area where the bloody spot is”; and that “the forensic people [say that Jennifer's] DNA is on that sweatshirt, to some degree.”

How should the significance of the DNA typing results have been presented? The likelihood ratio for “two random unrelated people” is relevant, since it constitutes one conceivable explanation for the origin of the blood stain on the sweatshirt. If it were the sole “defense hypothesis,” and if the likelihood ratio with that hypothesis in the denominator were anything like 1.661 quadrillion, then the prosecution could maintain that the only reasonable conclusion was that the mixed stain came from Jennifer and Ganesh. Indeed, if the only comparison were between a Jennifer-Ganesh mixture and a nonJennifer-nonGanesh mixture, it would not have been egregiously wrong to suggest that Ganesh’s DNA in the mixture is what the forensic analyst reported “to some degree.”

However, other hypotheses consistent with innocence cannot be ignored. The most obvious is that the contributors were Ganesh and an individual unrelated to Jennifer. After all, Ganesh was wearing the sweatshirt. The hypothesis that the mixed stain was from him and someone besides the victim surely merited consideration along with the less probable hypothesis that the DNA from from neither Ganesh nor his wife. Without revealing how the likelihood for that defense-compatible hypothesis compares with the likelihood for the prosecution's claim of a Jennifer-Ganesh mixture, the state withheld information necessary to a fair assessment of the DNA evidence.

Prosecutors have license to present their evidence for all that it is worth, and the prosecutor in this case may have believed that the likelihood ratio of 1.661 quadrillion was the appropriate measure of probative value. But to know what the bloodstain on the sweatshirt really proves, one must consider all the relevant alternatives to the state’s claim about the evidence. There is no indication in the opinion that the jury was given the information it needed to assess the bloodstain evidence.

Now it could be that the likelihood ratio with the Ganesh-nonJennifer hypothesis in the denominator also would have been outrageously large. But without some indication of the magnitude of that likelihood ratio, the jury was not given a fair picture of the meaning of the mixture.

People  v. Ramsaran, 141 App. Div. 3d 865, 35 N.Y.S.3d 549 (2016)
Joel Stashenko, Panel Orders New Trial for Man Charged in Wife's Death, N.Y.L.J., July 19, 2016

Thanks to Ted Hunt for calling this case to my attention.

Wednesday, November 23, 2016

A Flaky Forensic Genetics and Medicine Journal

I keep quietly adding emails to a posting on Flaky Academic Journals, but today's specimen from the Journal of Forensic Genetics and Medicine deserves special billing. The journal belongs to the North Carolina publisher and conference organizer, Allied Academies, which is allied with OMICS. Today's "confidential" email shows that it cannot even keep the names of its journals straight. It thinks its email advertisement is privileged with "work product immunity," and it denies that its emailed promises and claims are "given or endorsed by the company." Here is the solicitation:
Dear David H. Kaye,
Greetings from Journal of Forensic Genetics and Medicine.
It gives us great pleasure to invite you and your research allies to submit a manuscript for the Journal of Forensic Genetics and Medicine. We are delighted to announce that we are planning to release Inaugural Issue for our newly launched Journal. Your contribution adds more value to our inaugural issue. ...
Your contribution will help Journal of Sinusitis and Migraine establish its high standard and facilitate the journal to be indexed by prestigious ISI soon. ...
We Look forward for our long lasting scientific relationship.
With Regards,
Solomon Ebe, Editorial Assistant, Journal of Forensic Genetics and Medicine, Allied Academies, P.O.Box670, Candler, NC28715, USA

This message is confidential. It may also be privileged or otherwise protected by work product immunity or other legal rules. ... [Y]ou may not copy this message or disclose its contents to anyone. The views, opinions, conclusions and other information’s expressed in this electronic mail are not given or endorsed by the company unless otherwise indicated by an authorized representative independent of this message.
A second email arrived 12/2/16 with an additional "not given or endorsed" promise "that COMPLETE WAIVER will be provided on the articles submitted on or before December30th, 2016 for the inaugural issue."

The editorial board, if the website is to be believed, consists of the following individuals:
  • J. Thomas McClintock, Department of Biology and Chemistry, Forensic Science Program, Liberty University, Lynchburg, Virginia, USA.
  • James P Landers, Commonwealth Chaired Professor Dept. of Chemistry, Mechanical Engineering and Pathology Jefferson Scholar Faculty Fellow Co-Director of the Center for Nano-BioSystem Integration University of Virginia, Charlottesville, VA, United States
  • Robert W. Allen, Professor of Forensic Science and Chairman, School of Forensic Sciences, Center for Health Sciences, Oklahoma State University,1111 West 17th St.Tulsa, OK 74107, USA.
  • Susan A. Greenspoon, Forensic Molecular Biologist, Virginia Department of Forensic Science, 700 North Fifth Street Richmond, VA 23219, USA.
  • James Jabbour, Program Director, Assistant Professor, Applied Forensic Science,School of Applied Sciences, Forensic Sciences at Mount Ida College,777 Dedham Street, Newton, MA 02459, USA.
UPDATE (2 Apr. 2019): This journal no longer appears on Allied's website.

Thursday, November 10, 2016

The Defense Attorney's Fallacy in United States v. Natson

Searching for cases that illustrate the range of statistical statements that courts encounter, I stumbled (in two senses) on United States v. Natson. On December 16, 2003, a hunter discovered the remains of Ardena Carter and a fetus in a remote area of the Fort Benning Military Reservation in Columbus, Georgia. A student at Georgia Southern University, Carter had been shot in the back of the head with a pistol.

A grand jury indicted her boyfriend, Michael Antonio Natson, for homicide, feticide, and carrying and using a firearm during the murder. Calling for the death penalty, the government contended that Natson, a military police officer, killed Carter to keep her from seeking child support. To establish paternity, the government turned to DNA testing. However, the fetal bones yielded only a partial (five-locus) DNA profile, and the government’s expert described the findings as “inconclusive.” If this were all that the expert had to say, the court’s exclusion of the tests as irrelevant would have been unremarkable.

But the DNA expert, Shaun Weiss, did have more to say. As the court explained the proffered testimony, Weiss would have added that Natson was “26 times more likely to be the father of the fetus than a random person” and that “there is a 96.30% probability that Defendant is the father.” Weiss down played these numbers, maintaining that “the statistical probability of paternity must be at 99.99% for the DNA scientific community to consider a DNA test to show a paternity match.” In other words, Weiss believed that unless an inference of paternity is all but certain (99.99%), the test results are not scientifically acceptable. In light of these expert asservations, the federal district court concluded that
It would be sheer speculation for a jury to determine from Weiss's testimony that Defendant is the father. Therefore, the Court finds that the testimony is not relevant and would not assist the trier of fact. Accordingly, it is not admissible under Federal Rules of Evidence 702, 401, and 402.
Furthermore, the court dismissed the 26:1 odds and the 96% probability as "significantly low." The testing was "not probative." Weiss could only "testify with certainty that Defendant was 'possibly' the father, along with thousands of other random persons."

This is a very strange application of the legal concept of relevance. Suppose the government had found an unused 9mm bullet near the body (that might have been dropped there by the killer). Would the court have dismissed as irrelevant evidence that Natson also owned a 9 mm pistol because it only shows that defendant's pistol might possibly have been the murder weapon, along with thousands of  random pistols?

The notion that identification evidence is not relevant unless it limits the class of possible perpetrators to a very small number has been called the "defense attorney's fallacy." The fallacy lies in equating weakly or modestly probative identification evidence to the complete absence of probative value. Such reasoning is inconsistent with the common-law definition of relevance expressed in Federal Rule of Evidence 401. Rule 401 defines relevance in probabilistic terms:
“Relevant evidence" means evidence having any tendency to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without the evidence.
Whether Natson was the father of the fetus (and thereby might have had the motive the government ascribed to him) is surely of some consequence, and if offspring of Natson and Carter are 26 times more probable to possess the genotypes of the fetal bones than are the offspring of Carter and a randomly selected man, then the genotypes make it “more ... probable” that Natson is indeed the father. Thus, the DNA test results, although far less definitive than in the usual paternity case with copious, fresh samples, were unequivocally relevant.

This explanation of why the genetic data is relevant does not depend on the court’s claim that a likelihood ratio of 26 means that Natson is “26 times more likely to be the father of the fetus than a random person” or that “there is a 96.30% probability that Defendant is the father.” These expressions are statements of the odds or probability of the “fact that is of consequence.” Geneticists cannot deduce such values without a probability for the fact "without the evidence.” To arrive at probabilities for a claim of paternity, parentage testers typically assume that the “fact ... without the evidence” is equiprobable, then adjust it in light of the genetic evidence. The expert's choice, which is not usually based on scientific data, of 50% for the probability without the evidence, has been sharply criticized. With respect to the question of relevance, however, that issue is a red herring. In Natson, 26 is the ratio of (1) the probability of the genetic evidence when Natson is the father to (2) the probability of the same evidence when he is not. It is a likelihood ratio. As such, the evidence alters the probability of paternity, no matter what the starting probability might be. That makes the evidence relevant. It does not matter whether the probability without the evidence is 50%, 5%, or any other number (except for unrealizable prior probabilities of exactly 1 or 0).

Although the court excluded the statistical statements about the DNA evidence, the jury convicted Natson. Family members testified that Carter was pregnant and that she had told them that Natson was the father, but the most crucial testimony may have come from FBI firearm and toolmark examiner Paul Tangren. Tangren opined that a discharged ammunition cartridge recovered "from the scene of the alleged crime" exhibited toolmarks that "were sufficiently similar" to those on cartridges test-fired from a pistol owned by Natson "to identify Defendant's gun ... to a 100% degree of certainty."

The judge imposed a sentence of imprisonment for life without parole. Natson appealed the conviction on the ground that “the case investigators, intentionally and calculatingly, refused to develop information ... which might implicate” other suspects. In an unreported opinion, the U.S. Court of Appeals for the Eleventh Circuit denied this appeal.


Thursday, November 3, 2016

The False-Positive Fallacy in the First Opinion to Discuss the PCAST Report

Last month, I quoted the following discussion of the PCAST report on forensic science that appeared in United States v. Chester, No. 13 CR 00774 (N.D. Ill. Oct. 7, 2016):
As such, the report does not dispute the accuracy or acceptance of firearm toolmark analysis within the courts. Rather, the report laments the lack of scientifically rigorous “blackbox” studies needed to demonstrate the reproducibility of results, which is critical to cementing the accuracy of the method. Id. at 11. The report gives detailed explanations of how such studies should be conducted in the future, and the Court hopes researchers will in fact conduct such studies. See id. at 106. However, PCAST did find one scientific study that met its requirements (in addition to a number of other studies with less predictive power as a result of their designs). That study, the “Ames Laboratory study,” found that toolmark analysis has a false positive rate between 1 in 66 and 1 in 46. Id. at 110. The next most reliable study, the “Miami-Dade Study” found a false positive rate between 1 in 49 and 1 in 21. Thus, the defendants’ submission places the error rate at roughly 2%.3 The Court finds that this is a sufficiently low error rate to weigh in favor of allowing expert testimony. See Daubert v. Merrell Dow Pharms., 509 U.S. 579, 594 (1993) (“the court ordinarily should consider the known or potential rate of error”); United States v. Ashburn, 88 F. Supp. 3d 239, 246 (E.D.N.Y. 2015) (finding error rates between 0.9 and 1.5% to favor admission of expert testimony); United States v. Otero, 849 F. Supp. 2d 425, 434 (D.N.J. 2012) (error rate that “hovered around 1 to 2%” was “low” and supported admitting expert testimony). The other factors remain unchanged from this Court’s earlier ruling on toolmark analysis. See ECF No. 781.

3. Because the experts will testify as to the likelihood that rounds were fired from the same firearm, the relevant error rate in this case is the false positive rate (that is, the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect).
I suggested that the court missed (or summarily dismissed) the main point the President's Council of Science and Technology Advisers were making -- that there is an insufficient basis in the literature for concluding that "the error rate [is] roughly 2%," but the court's understanding of "the error rate" also merits comment. The description of the meaning of "the false positive rate" in note 3 (quoted above) is plainly wrong. Or, rather, it is subtly wrong. If the experts will testify that two bullets came from the same gun, they will be testifying that their tests were positive. If the tests are in error, the test results will be false positives. And if the false-positive error probability is only 2%, it sounds as if there is only a 2% probability "that [the] expert's testimony ... is in fact incorrect."

But that is not how these probabilities work. The court's impression reflects what we can call a "false-positive fallacy." It is a variant on the well-known transposition fallacy (also loosely called the prosecutor's fallacy). Examiner-performance studies are incapable of producing what the court would like to know (and what it thought it was getting) -- "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect." The last phrase denotes the probability that a source hypothesis is false. It can be called a source probability. The "false positive rate" is the probability that certain evidence will arise if the source hypothesis is true. It can be called an evidence probability. As explained below, this evidence probability is but one of three probabilities that determine the source probability.

I. Likelihoods: The Need to Consider Two Error Rates

The so-called black-box studies can generate estimates of the evidence probabilities, but they cannot reveal the source probabilities. Think about how the performance study is designed. Examiners decide whether pairs of bullets or cartridges were discharged from the same source (S) or from different guns (~S). They are blinded to whether S or ~S is true, but the researchers control and know the true state of affairs (what forensic scientists like to call "ground truth"). The proportion of cases in which the examiners report a positive association (+E) out of all the cases of S can be written Prop(+E in cases of S), or more compactly, Prop(+E | S).  This proportion leads to an estimate of the probability that, in practice, the examiners and others like them will report a positive association (+E) when confronted with same-source bullets. This conditional probability for +E given that S is true can be abbreviated Prob(+E | S). I won't be fastidious about the difference between a proportion and a probability and will just write P(+E | S) for either, as the context dictates. In the long run, for the court's 2% figure (which is higher than the one observed false-positive proportion in the Ames study), we expect examiners to respond positively (+E) when S is not true (and they do reach a conclusion) only P(+E | ~S) = 2% of the time. 

Surprisingly, a small number like 2% for the "false-positive error rate" P(+E | ~S) does not necessarily mean that the positive finding +E has any probative value! Suppose that positive findings +E occur just as often when S is false as when S is true. (Examiners who are averse to false-positive judgments might be prone to err on the side of false negatives.) If the false-negative error probability is P(–E | S) = 98%, then examiners will tend to report –E 98% of the time for same-source bullets (S), just as they report +E 98% of the time for different-source bullets (S). Learning that such examiners found a positive association is of zero value in separating same-source cases from different-source cases. We may as well have flipped a coin. The outcome (the side of the coin, or the positive judgment of the examiner) bears no relationship to whether the S is true or not.

Although a false negative probability of 98% is absurdly high, it illustrates the unavoidable fact that only when the ratio of the two likelihoods, P(+E | S) and P(+E | ~S), exceeds 1 is a positive association positive evidence of a true association. Consequently, the court's thought that "the relevant error rate in this case is the false positive rate" is potentially misleading. This likelihood is but one of the two relevant likelihoods. (And there would be still more relevant likelihoods if there were more than two hypotheses to consider.)

II. Prior Probabilities: The Need to Consider the Base Rate

Furthermore, yet another quantity -- the mix of same-source and different-source pairs of bullets in the cases being examined -- is necessary to arrive at the court's understanding of "the false positive rate" as "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect." 1/ In technical jargon, the probability as described is the complement of the posterior probability (or positive predictive value in this context), and the posterior probability depends on not only on the two likelihoods, or evidence probabilities, but also on the "prior probability" for the hypotheses S.

A few numerical examples illustrate the effect of the prior probability. Imagine that a performance study with 500 same-source pairs and 500 different-source pairs (that led to conclusions) found the outcomes given in Table 1.

Table 1. Outcomes of comparisons

~S S
E 490 350 840
+E 10 150 160

500 500
E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

The first column of the table states that in the different-source cases, examiners reported a positive association +E in only 10 cases. Thus, their false-positive error rate was P(+E | ~S) = 10/500 = 2%. This is the figure used in Chester. (The second column states that in the same-source cases, examiners reported a negative association 350 times. Thus, their false-negative rate was P(–E | S) = 350/500 = 70%.)

But the bottom row of the table states that the examiners reported a positive association +E for 10 different-source cases and 150 same-source cases. Of the 10 + 150 = 160 cases of positive evidence, 150 are correct, and 10 are incorrect. The rate of incorrect positive findings was therefore P(~S | +E) = 10/160 = 6.25%. Within the four corners of the study, one might say, as the court did, that "the likelihood that an expert’s testimony that two bullets were fired by the same source is in fact incorrect" is only 2%. Yet, the rate of incorrect positive findings in the study exceeded 6%. The difference is not huge, but it illustrates the fact that the false-negative probability as well as the false-positive probability affects P(~S | +E), which indicates how often an examiner who declares a positive association is wrong. 2/

Now let's change the mix of same- and different-source pairs of bullets from 50:50 to 10:90. We will keep the conditional-error probabilities the same, at P(+E | ~S) = 2% and P(–E | S) = 70%. Table 2 meets these constraints:

Table 2. Outcomes of comparisons

~S S
E 980 70 1050
+E 20 30 50

1000 100

Row 2 shows that there are 20 false positives out of the 50 positively reported associations. The proportion of false positives in the modified study is P(~S | +E) = 40%. But the false-positive rate P(+E | ~S) is still 2% (20/1000).

III. "When I'm 64": A Likelihood Ratio from the Ames Study

The Chester court may not have had a correct understanding of the 2% error rate it quoted, but the Ames study does establish that examiners are capable of distinguishing between same-source and different-source items on which they were tested. Their performance was far better than the outcomes in the hypothetical Tables 1 and 2. The Ames study found that across all the examiners studied, P(+E |S) = 1075/1097 = 98.0%, and P(+E |~S) = 22/1443 = 1.52% . 3/ In other words, on average, examiners made a correct positive associations 98.0/1.52 = 64 times more often when presented with same-source cartridges than they made incorrect positive associations when presented with different-source cartridges. This likelihood ratio, as it is called, means that when confronted with cases involving an even mix of same- and different-source items, over time and over all examiners, the pile of correct positive associations would be some 64 times higher than the pile of incorrect positive associations. Thus, in Chester, Judge Tharp was correct in suggesting that the one study that satisfied PCAST's criteria offers an empirical demonstration of expertise at associating bullet cartridges with the gun that fired them.

Likewise, an examiner presenting a source attribution can point to a study deemed to be well designed by PCAST that found that a self-selected group of 218 examiners given cartridge cases from bullets fired by one type of handgun correctly identified more than 99 out of 100 same-gun cartridges and correctly excluded more than 98 out of 100 different-gun cartridges. For completeness, however, the examiner should add that he or she has no database with which to estimate the frequency of distinctive marks -- unless, of course, there is one that is applicable to the case at bar.

 * * *

Whether the Ames study, together with other literature in the field, suffices to validate the expertise under Daubert is a further question that I will not pursue here. My objective has been to clarify the meaning of and some of the limitations on the 2% false-positive error rate cited in Chester. Courts concerned with the scientific validity of a forensic method of identification must attend to "error rates." In doing so, they need to appreciate that it takes two to tango. Both false-positive and false-negative conditional-error probabilities need to be small to validate the claim that examiners have the skill to distinguish accurately between positively and negatively associated items of evidence.

  1. Not wishing to be too harsh on the court, I might speculate that its thought that the only "relevant error rate" for positive associations is the false-positive rate might have been encouraged by the PCAST report's failure to present any data on negative error rates in its discussion of the performance of firearms examiners. A technical appendix to the report indicates that the related likelihood is pertinent to the weight of the evidence, but this fact might be lost on the average reader -- even one who looks at the appendix.
  2. The PCAST report alluded to this effect in its appendix on statistics. That Judge Tharp did not pick up on this is hardly surprising.
  3. See David H. Kaye, PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?, Forensic Sci., Stat. & L., Nov. 1, 2016,

Tuesday, November 1, 2016

Index to Comments and Cases Discussing the PCAST Report on Forensic Science

The page lists the discussions of the PCAST report and its addendum appearing on this blog. It also lists some academic literature and court opinions that discuss the report. I expect to update the list periodically.

Forensic Science, Statistics & the Law
Academic Journals and Books
  • ANZFSS Council, Letter to the Editor, 50(5) Australian J. Forensic Sci. 451–452 (2018), originally published as ANZFSS Council Response to President’s Council of Advisors on Science and Technology Report, available at
  • John Buckleton, Jo-Anne Bright & Duncan Taylor, Letter, Response to Lander’s Response to the ANZFSS Council Statement on the President’s Council of Advisors on Science and Technology Report, 50(5) Australian J. Forensic Sci. 453–454 (2018) (arguing that STRmix has been validated)
  • Gary Edmond & Kristy A. Martire, Antipodean Forensics: A Comment on ANZFSS’s Response to PCAST, 50(2) Australian J. Forensic Sci. 140-151 (2017)
  • I.W. Evett, C.E.H. Berger, J.S. Buckleton, C. Champod, G. Jackson, Finding the Way Forward for Forensic Science in the US—A Commentary on the PCAST Report, 278 Forensic Sci. Int'l 16-23 (2017),
  • David L. Faigman et al., 1 Modern Scientific Evidence: The Law and Science of Expert Testimony x (2016-2017) ("[C]ourts have largely ignored the virtually consensus opinion of mainstream academic scientists that much of the forensic expertise routinely admitted in courts today is unsound. The latest statement of this consensus view came in September, 2016, in a lengthy and carefully reasoned report by The President's Council of  Advisors on Science and Technology (PCAST).")
  • Ted Robert Hunt, Scientific Validity and Error Rates: A Short Response to the PCAST Report, 86 Fordham L. Rev. Online 24-39 (2018) ("To clarify the DOJ’s position, this Article is a short response to the Report’s discussion of scientific validity. The focus is on PCAST’s use of the term foundational validity, its views on error rates, and the proposed application of these concepts to forensic feature-comparison methods.")
  • Aliza B. Kaplan & Janis C. Puracal, It's Not a Match: Why the Law Can't Let Go of Junk Science
    81 Alb. L. Rev. 895 (2017-2018) 
  • David H. Kaye, David E. Bernstein & Jennifer L. Mnookin, The New Wigmore on Evidence: Expert Evidence § 15.7.5 (2d ed. Cum, Suppl. 2019)
  • David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, 68 Case W. Res. L. Rev. 723-45 (2018),
  • Eric S. Lander, Response to the ANZFSS Council Statement on the President’s Council of Advisors on Science and Technology Report, 49(4) Australian J. Forensic Sci. 366-368 (2017)
  • Geoffrey Stewart Morrison, David H. Kaye, David J. Balding, et al., A Comment on the PCAST Report: Skip the 'Match'/'Non-Match' Stage, 272 Forensic Sci. Int'l e7-e9 (2017), Accepted manuscript available at SSRN:
  • Adam B. Shniderman, Prosecutors Respond to Calls for Forensic Science Reform: More Sharks in Dirty Water, 126 Yale L.J. F. 348 (2017),
  • Transcript, Symposium on Forensic Science Testimony, Daubert, and Rule 702, 86 Fordham L. Rev. 1463-1550 (2018)
Professional Periodicals
  • Judge Herbert B. Dixon Jr., Another Harsh Spotlight on Forensic Sciences, Judges' J. Winter, 2017, at 36 ("The report's conclusion is clear that the accuracy of many forensic feature-comparison methods has been assumed rather than scientifically established on empirical evidence. ... PCAST expects, partly based on the strength of its evaluations of scientific validity in this report, that some forensic feature-comparison methods may be determined inadmissible because they lack adequate evidence of scientific validity.")
  • Donna Lee Elm, Continued Challenge for Forensics: The PCAST Report, Crim. Just., Summ. 2017, at 4-8.
  • Jennifer Friedman, Another Opport6unity for Forensic Reform: A Call to the Courts, Champion, July 2017, at 40 
  • Jonathan J. Koehler, How Trial Judges Should Think About Forensic Science Evidence, Judicature, Spr. 2018, at 28–38, (critiques organized criticisms of the report)
  • Norman L. Reimer, Two New Tools to Include in a Cutting-Edge Defense Toolkit, NACDL's Champion, Nov. 2016, at 9-10 ("[T]he PCAST report was not greeted with great glee by the Department of Justice or the Federal Bureau of Investigation. ... So this report will by no means change practices overnight. But that is all the more reason why the defense bar should up its game when confronting questionable forensic evidence. The PCAST report will be a big help in that effort.").
  • Jack D. Roady, The PCAST Report: A Review and Moving Forward—A Prosecutor's Perspective, Crim. Just., Summ. 2017, at 8-14, 39.
  • J. H. Pate Skene, Up to the Courts:  Managing Forensic Testimony with Limited Scientific Validity, Judicature, Spr. 2018, at 39-50, (“With the exception of DNA analysis of single-source samples, none of the forensic methods reviewed by PCAST has yet met rigorous criteria for both foundational validity (Rule 702(c)) and validity as applied (Rule 702(d)).”)
  • Eric Alexander Vos, Using the PCAST Report to Exclude, Limit, or Minimize Experts, Crim. Just., Summ. 2017, at 15-19.
Federal Cases
State and Washington DC Cases

PCAST and the Ames Bullet Cartridge Study: Will the Real Error Rates Please Stand Up?

An article in yesterday’s Boston Globe reports that “the [PCAST] report’s findings have also been widely criticized, especially by those in the forensics field, who argued that the council lacked any representation from ballistics experts. They argued that the council’s findings do not undermine the accuracy of firearms examinations.” 1/

The criticism that “ballistics experts” did not participate in writing the report is unpersuasive. These experts are great at their jobs, but reviewing the scientific literature on the validity and reliability of their toolmark comparisons is not a quotidian task. Would one criticize a meta-analysis of studies on the efficacy of a surgical procedure on the ground that the authors were epidemiologists rather than surgeons?

On the other hand, the argument that the “findings do not undermine the accuracy of firearms examinations” is correct (but inconclusive). True, the President’s Council of Advisors on Science and Technology (PCAST) did not find that toolmark comparisons as currently practiced are inaccurate. Rather, it concluded (on page 112) that
[F]irearms analysis currently falls short of the criteria for foundational validity, because there is only a single appropriately designed study to measure validity and estimate reliability. The scientific criteria for foundational validity require more than one such study, to demonstrate reproducibility.
In other words, PCAST found that existing literature (including that called to its attention by “ballistics experts”) does not adequately answer the question of how accurate firearms examiners are when comparing markings on cartridges—because only a single study that was designed as desired by PCAST provides estimates of accuracy.

Although PCAST’s view is that more performance studies are necessary to satisfy Federal Rule of Evidence 702, PCAST uses the single study to derive a false-positive error rate for courtroom use (just in case a court disagrees with its understanding of the rule of evidence, or the science, or in case the jurisdiction follows a different rule).

To evaluate PCAST's proposal, it will be helpful first to describe what the study itself found. Athough “it has not yet been subjected to peer review and publication” (p. 111), the “Ames study,” as PCAST calls it, is available online. 2/ The researchers enrolled 284 volunteer examiners in the study, and 218 submitted answers (raising an issue of selection bias). The 218 subjects (who obviously knew they were being tested) “made ... l5 comparisons of 3 knowns to 1 questioned cartridge case. For all participants, 5 of the sets were from known same-source firearms [known to the researchers but not the firearms examiners], and 10 of the sets were from known different-source firearms.” 3/ Ignoring “inconclusive” comparisons, the performance of the examiners is shown in Table 1.

Table 1. Outcomes of comparisons
(derived from pp. 15-16 of Baldwin et al.)

~S S
E 1421 4 1425
+E 22 1075 1097

1443 1079
E is a negative finding (the examiner decided there was no association).
+E is a positive finding (the examiner decided there was an association).
S indicates that the cartridges came from bullets fired by the same gun.
~S indicates that the cartridges came from bullets fired by a different gun.

False negatives. Of the 4 + 1075 = 1079 judgments in which the gun was the same, 4 were negative. This false negative rate is Prop(–E |S) = 4/1079 = 0.37%. ("Prop" is short for "proportion," and "|" can be read as "given" or "out of all.") Treating the examiners tested as random samples of all examiners of interest, and viewing the performance in the experiment as representative of the examiners' behavior in casework with materials comparable to those in the experiment, we can estimate the portion of false negatives for all examiners. The point estimate is 0.37%. A 95% confidence interval is 0.10% to 0.95%. These numbers provide an estimate of how frequently all examiners would declare a negative association in all similar cases in which the association actually is positive.Instead of false negatives, we also can describe true negatives, or specificity. The observed specificity is Prop(E|~S) = 99.63%. The 95% confidence interval around this estimate is 99.05% to 99.90%.

False positives. The observed false-positive rate is Prop(+E |~S) = 22/1443 = 1.52%, and the 95% confidence interval is 0.96% to 2.30%. The observed true-positive rate, or sensitivity, is 98.48%, and its 95% confidence interval is 97.7% to 99.04%.

Taken at face value, these results seem rather encouraging. On average, examiners displayed high levels of accuracy, both for cartridge cases from the same gun (better than 99% specificity) and from different guns (better than 98% sensitivity).

Applying such numbers to individual examiners and particular cases obviously is challenging. The PCAST report largely elides the difficulties. (See Box 1.) It notes (on page 112) that "20 of the 22 false positives were made by just 5 of the 218 examiners — strongly suggesting that the false positive rate is highly heterogeneous across the examiners"; however, it does not discuss the implications of this fact for testimony about "the error rates" that it wants "clearly presented." It calls for "rigorous proficiency testing" of the examiner and disclosure of those test results, but it does not consider how the examiner’s level of proficiency maps onto to the distribution of error rates seen in the Ames study. Neither does it consider how testimony should address the impact of verification by a second examiner. If the errors occur independently across examiners (as might be the case if the verification is truly blind), then the relevant false-positive error rate drops to (1.52%)2 = 0.0231%. Is omitting some correction for verification an appropriate way to present the results of a rigorously verified examination? Indeed, is a false-positive error rate enough to convey the probative value of a positive finding? I will discuss the last question later.


Foundational validity. PCAST finds that firearms analysis currently falls short of the criteria for foundational validity, ... . If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). [P. 112.]

Validity as applied. If firearms analysis is allowed in court, validity as applied would, from a scientific standpoint, require that the expert: (1) has undergone rigorous proficiency testing on a large number of test problems to evaluate his or her capability and performance, and discloses the results of the proficiency testing ... . [P. 113.]

[The] false-positive rate for examiner cartridge case comparisons ... was measured and for the pool of participants used in this study the fraction of false positives was approximately 1%. The study was specifically designed to allow us to measure not simply a single number from a large number of comparisons, but also to provide statistical insight into the distribution and variability in false-positive error rates. The ... overall fraction is not necessarily representative of a rate for each examiner in the pool. Instead, ... the rate is a highly heterogeneous mixture of a few examiners with higher rates and most examiners with much lower error rates. This finding does not mean that 1% of the time each examiner will make a false-positive error. Nor does it mean that 1% of the time laboratories or agencies would report false positives, since this study did not include standard or existing quality assurance procedures, such as peer review or blind reanalysis. [P. 18.]

  1. Milton J. Valencia, Scrutiny over Forensics Expands to Ballistics, Boston Globe, Oct. 31, 2016,
  2. David P. Baldwin, Stanley J. Bajic, Max Morris & Daniel Zamzow, A Study of False-positive and False-negative Error Rates in Cartridge Case Comparisons, Ames Laboratory, USDOE, Technical Report #IS-5207 (2014), at 
  3. Id. at 10.