Wednesday, July 17, 2019

No Tension Beween Rule 704 and Best Principles for Interpretating Forensic-science Test Results

At a webinar on probabilistic genotyping organized by the FBI, the Department of Justice’s Senior Advisor on Forensic Science, Ted Hunt, summarized the rules of evidence that are most pertinent to scientific and expert testimony. In the course of a masterful survey, he suggested that Federal Rule of Evidence 704 somehow conflicts with the evidence-centric approach to evaluating laboratory results recommended by a subcommittee of the National Commission on Forensic Science, by the American Statistical Association, and by European forensic-science service providers. 1/ In this approach, the expert stops short of opining on whether the defendant is the source of the trace. Instead, the expert merely reports that the data are L times more probable when the hypothesis is true than when some alternative source hypothesis is true. (Or, the expert gives some qualitative expression such as "strong support" when this likelihood ratio is large.)

Whatever the merits of these proposals, Rule 704 does not stand in the way of implementing the recommended approach to reporting and testifying. First, the identity of the source of a trace is not necessarily an ultimate issue. To use the example of latent-print identification given in the webinar, the traditional opinion that a named individual is the source of a print is not an opinion on an ultimate issue. Courts have long allowed examiners to testify that the print lifted from a gun comes from a specific finger. But this conclusion is not an opinion on whether the murder defendant is the one who pulled the trigger. The examiner’s source attribution bears on the ultimate issue of causing the death of a human being, but the examiner who reports that the prints were defendant's is not opining that the defendant not only touched the gun (or had prints planted on it) but also pulled the trigger. Indeed, the latent print examiner would have no scientific basis for such an opinion on an element of the crime of murder.

Furthermore, even when an expert does want to express an opinion on an ultimate issue, Rule 704 does not counsel in favor of admitting it into evidence. Rule 704(a) consists of a single sentence: “An opinion is not objectionable just because it embraces an ultimate issue.” The sole function of these words is to repeal an outmoded, common-law rule categorically excluding these opinions. The advisory committee that drafted this repealing rule explained that “to allay any doubt on the subject, the so-called ‘ultimate issue’ rule is specifically abolished by the instant rule.” The committee expressed no positive preference for such opinions over evidence-centric expert testimony. It emphasized that Rules 701, 702, and 403 protect against unsuitable opinions on ultimate issues. Modern courts continue to exclude ultimate-opinion testimony when it is not sufficiently helpful to jurors. For example, conclusions of law remain highly objectionable.

Consequently, any suggestion that Rule 704 is an affirmative reason to admit one kind of testimony over another is misguided. “The effect of Rule 704 is merely to remove the proscription against opinions on ‘ultimate issues' and to shift the focus to whether the testimony is ‘otherwise admissible.’” 2/ If conclusion-centric testimony is admissible, then so is the evidence-centric evaluation that lies behind it--with or without the conclusion.

In sum, there is no tension between Rule 704(a) and the recommendation to follow the evidence-centric approach. Repealing a speed limit on a road does not imply that drivers should put the pedal to the floor.

  1. This is the impression I received. The recording of the webinar should be available at the website of the Forensic Technology Center of Excellence in a week or two.
  2. Torres v. County of Oakland, 758 F.2d 147, 150 (6th Cir.1985).
UPDATED: 18 July 2019 6:22 AM

Saturday, July 6, 2019

Distorting Daubert and Parting Ways with PCAST in Romero-Lobato

United States v. Romero-Lobato 1/ is another opinion applying the criteria for admissibility of scientific evidence articulated in Daubert v. Merrell Dow Pharmaceuticals 2/ to uphold the admissibility of a firearms examiner's conclusion that the microscopic marks on recovered bullets prove that they came from a particular gun. To do so, the U.S. District Court for the District of Nevada rejects the conclusions of the President's Council of Advisors on Science and Technology (PCAST) on validating a scientific procedure.

This is not to say that the result in the case is wrong. There is a principled argument for admitting suitably confined testimony about matching bullet or ammunition marks. But the opinion from U.S. District Court Judge Larry R. Hicks does not contain such an argument. The court does not reach the difficult question of how far a toolmark expert may go in forging a link between ammunition and a particular gun. It did not have to. In what seems to be a poorly developed challenge to firearms-toolmark expertise, the defense sought to exclude all testimony about such an association.

This posting describes the facts of the case, the court's description of the law on the admissibility of source attributions by firearms-toolmark examiners, and its review of the practice under the criteria for admitting scientific evidence set forth by the Supreme Court in Daubert.


A grand jury indicted Eric Romero-Lobato for seven felonies. On March 4, 2018, he allegedly tried to rob the Aguitas Bar and Grill and discharged a firearm (a Taurus PT111 G2) into the ceiling. On May 14, he allegedly stole a woman's car at gunpoint while she was cleaning it at a carwash. Later that night, he crashed the car in a high-speed chase On the front passenger's seat, was a Taurus PT111 G2 handgun.

Steven Johnson, a supervising criminalist in the Forensic Science Division of the Washoe County Sheriff's Office," 3/ was prepared to testify that the handgun had fired a round into the ceiling of the bar. Romero-Lobato moved "to preclude the testimony." The district court held a pretrial hearing at which Johnson testified to his background, training, and experience. He explained that he matched the bullet to the gun using the "AFTE method" advocated by the Association of Firearm and Tool Mark Examiners.

Defendant's challenge rested "on the critical NAS and PCAST Reports as evidence that 'firearms analysis' is not scientifically valid and fails to meet the requisite threshold for admission under Daubert and Federal Rule of Evidence 702." Apparently, the only expert at the hearing was the Sheriff Department's criminalist. Judge Hicks denied the motion to exclude Johnson's expert opinion testimony and issued a relatively detailed opinion.


Skipping over the early judicial resistance to "this is the gun" testimony, 4/ the court noted that despite misgivings about such testimony on the part of several federal district courts, only one reported case has barred all source opinion testimony 5/ and the trend among the more critical courts is to search for ways to admit the conclusion with qualifications on its certainty.

Judge Hicks did not pursue the possibility of admitting but constraining the testimony, apparently because the defendant did ask for that. Instead, the court reasoned that to overcome the inertia of the current caselaw, a defendant must have extraordinarily strong evidence (although it also recognized that the burden is on the government to prove scientific validity under Daubert) .The judge wrote:
[T]he defense has not cited to a single case where a federal court has completely prohibited firearms identification testimony on the basis that it fails the Daubert reliability analysis. The lack of such authority indicates to the Court that defendant's request to exclude Johnson's testimony wholesale is unprecedented, and when such a request is made, a defendant must make a remarkable argument supported by remarkable evidence. Defendant has not done so here.
Defendant's less-than-remarkable evidence was primarily two consensus reports of scientific and other experts who reviewed the literature on firearms-mark comparisons. 6/  Both are remarkable. The first document was the highly publicized National Academy of Sciences committee report on improving forensic science. The committee expressed concerns about the largely subjective comparison process and the absence of studies to adequately measure the uncertainty in the evaluations. The court deemed these to be satisfied by a single research report submitted to Department of Justice, which funded the study:
The NAS Report, released in 2009, concluded that “[s]ufficient studies have not been done to understand the reliability and repeatability” of firearm and toolmark examination methods. ... The Report's main issue with the AFTE method was that it did not provide a specific protocol for determining a match between a shell casing or bullet and a specific firearm. ... Instead, examiners were to rely on their training and experience to determine if there was a “sufficient agreement” (i.e. match) between the mark patterns on the casing or bullet and the firearm's barrel. ... During the Daubert hearing, Johnson testified about his field's response to the NAS Report, pointing to a 2013 study from Miami-Dade County (“Miami-Dade Study”). The Miami-Dade Study was conducted in direct response to the NAS Report and was designed as a blind study to test the potential error rate for matching fired bullets to specific guns. It examined ten consecutively manufactured barrels from the same manufacturer (Glock) and bullets fired from them to determine if firearm examiners (165 in total) could accurately match the bullets to the barrel. 150 blind test examination kits were sent to forensics laboratories across the United States. The Miami-Dade Study found a potential error rate of less than 1.2% and an error rate by the participants of approximately 0.007%. The Study concluded that “a trained firearm and tool mark examiner with two years of training, regardless of experience, will correctly identify same gun evidence.”
A more complete (and accurate) reading of the Miami Dade Police Department's study, shows that it was not designed to measure error rates as they are defined in the NAS report and that the "error rate" was much closer to 1%. That's still small, and, with truly independent verification of an examiners' conclusions, the error rate should be smaller than that for examiners whose findings are not duplicated. Nonetheless, as an earlier posting shows, the data are not as easily interpreted and applied to case work as the report from the crime laboratory suggests.The research study, which has yet to appear in any scientific journal. has severe limitations.

The second report, released late in 2016 by the President's Council of Advisors on Science and Technology (PCAST) flatly maintained that microscopic firearms-marks comparisons had not been scientifically validated. Essentially dismissing the Miami Dade Police and earlier research as not properly designed to measure the ability of examiners to infer whether the same gun fired test bullets and ones recovered from a crime scene, PCAST reasoned that (1) AFTE-type identification had yet to be shown to be "reliable" within the meaning of Rule 702 (as PCAST interpreted the rule); (2) if courts disagreed with PCAST's legal analysis of the rule's requirements, they should at least require examiners associating ammunition with a particular firearm to give an upper bound, as ascertained from controlled experiments, on false-positive associations. (These matters are discussed in previous postings.)

The court did not address the second conclusion and gave little or no weight to the first one. It wrote that the 2016 report
concluded that there was only one study done that “was appropriately designed to test foundational validity and estimate reliability,” the Ames Laboratory Study (“Ames Study”). The Ames Study ... reported a false-positive rate of 1.52%. ... The PCAST Report did not reach a conclusion as to whether the AFTE method was reliable or not because there was only one study available that met its criteria.
All true. PCAST certainly did not write that there is a large body of high quality research that proves toolmark examiners cannot associate expended ammunition with specific guns. PCAST's position is that a single study is not a body of evidence that establishes a scientific theory--replication is crucial. If the court believed that there is such a body of literature, it should have explained the basis for its disagreement with the Council's assessment of the literature. If it agreed with PCAST that the research base is thin, then it should have explained why forensic scientists should be able to testify--as scientists--that they know which gun fired which bullet. This opinion does neither. (I'll come to court's discussion of Daubert below.)

Instead, the court repeats the old news that
the PCAST Report was criticized by a number of entities, including the DOJ, FBI, ATF, and AFTE. Some of their issues with the Report were its lack of transparency and consistency in determining which studies met its strict criteria and which did not and its failure to consult with any experts in the firearm and tool mark examination field.
Again, all true. And all so superficial. That prosecutors and criminal investigators did not like the presidential science advisors' criticism of their evidence is no surprise. But exactly what was unclear about PCAST's criteria for replicated, controlled, experimental proof? In fact, the DOJ later criticized PCAST for being too clear--for having a "nine-part" "litmus test" rather than more obscure "trade-offs" with which to judge what research is acceptable. 7/

And what was the inconsistency in PCAST's assessment of firearms-marks comparisons? Judge Hicks maintained that
The PCAST Report refused to consider any study that did not meet its strict criteria; to be considered, a study must be a “black box” study, meaning that it must be completely blind for the participants. The committee behind the report rejected studies that it did not consider to be blind, such as where the examiners knew that a bullet or spent casing matched one of the barrels included with the test kit. This is in contrast to studies where it is not possible for an examiner to correctly match a bullet to a barrel through process of elimination.
This explanation enucleates no inconsistency. The complaint seems to be that PCAST's criteria for a validating a predominantly subjective feature-comparison procedure are too demanding or restrictive, not that these criteria were applied inconsistently. Indeed, no inconsistency in applying the "litmus test" for an acceptable research design to firearms-mark examinations is apparent.

Moreover, the court's definition of "a 'black box' study" is wrong. All that PCAST meant by "black box" is that the researchers are not trying to unpack the process that examiners use and inspect its components. Instead, they say to the examiner, "Go ahead, do your thing. Just tell us your answer, and we'll see if you are right." The term is used by software engineers who test complex programs to verify that the outputs are what they should be for the inputs. The Turing test for the proposition that "machines can think" is a kind of black box test.

Nonetheless, this correction is academic. The court is right about the fact that PCAST gave no credence to "closed tests" like those in which an examiner sorts bullets into pairs knowing in advance that every bullet has a mate. Such black-box experiments are not worthless. They show a nonzero level of skill, but they are easier than "open tests" in which an examiner is presented with a single pair of bullets to decide whether they have a common source, then another pair, and another, and so on. In Romero-Lobato, the examiner had one bullet from the ceiling to compare to a test bullet he fired from one suspect gun. There is no "trade-off" that would make the closed-test design appropriate for establishing the examiner's skill at the task he performed.

All that remains of the court's initial efforts to avoid the PCAST report is the tired complaint about a "failure to consult with any experts in the firearm and tool mark examination field." But what consultation does the judge think was missing? The scientists and technologists who constitute the Council asked the forensic science community for statements and literature to support their practices. It shared a draft of its report with the Department of Justice before finalizing it. After releasing the report, it asked for more responses and issued an addendum. Forensic-services providers may complain that the Council did not use the correct criteria, that its members were closed-minded or biased, or that the repeated opportunities to affect the outcome were insufficient or even a sham. But a court needs more than a throw-away sentence about "failure to consult" to justify treating the PCAST report as suspect.


Having cited a single, partly probative police laboratory study as if it were a satisfactory response to the National Academy's concerns and having colored the President's Council report as controversial without addressing the limited substance of the prosecutors' and investigators' complaints, the court offered a "Daubert analysis." It marched through the five indicia that the Supreme Court enumerated as factors that courts might consider in assessing scientific validity and reliability.

A. It Has Been Tested

The Romero-Lobato opinion made much of the fact that "[t]he AFTE methodology has been repeatedly tested" 8/ through "numerous journals [sic] articles and studies exploring the AFTE method" 9/ and via Johnson's perfect record on proficiency tests as proved by his (hearsay and character evidence) testimony. Einstein once expressed impatience "with scientists who take a board of wood, look for its thinnest part and drill a great number of holes where drilling is easy." 10/ Going through the drill of proficiency testing does not prove much if the tests are simple and unrealistic. A score of trivial or poorly designed experiments should not engender great confidence. The relevant question under Daubert is not simply "how many tests so far?" It is how many challenging tests have been passed. The opinion makes no effort to answer that question. It evinces no awareness of the "10 percent error rate in ballistic evidence" noted in the NAS Report, that prompted corrective action in the Detroit Police crime laboratory.

Instead of responding to PCAST's criticisms of the design of the AFTE Journal studies, the court wrote that "[a]lthough both the NAS and PCAST Reports were critical of the AFTE method because of its inherent subjectivity, their criticisms do not affect whether the technique they criticize has been repeatedly tested. The fact that numerous studies have been conducted testing the validity and accuracy of the AFTE method weighs in favor of admitting Johnson's testimony."

But surely the question under Daubert is not whether there have been "numerous studies." It is what these studies have shown about the accuracy of trained examiners to match a single unknown bullet with control bullets from a single gun. The court may have been correct in concluding that the testing prong of Daubert favors admissibility here, but its opinion fails to demonstrate that "[t]here is little doubt that the AFTE method of identifying firearms satisfies this Daubert element."

B. Publication and Peer Review

Daubert recognizes that, to facilitate the dissemination, criticism, and modification of theories, modern science relies on publication in refereed journals that members of the scientific community read. Romero-Lobato deems this factor to favor admission for two reasons. First, the AFTE Journal in which virtually all the studies dismissed by PCAST appear, uses referees. That it is not generally regarded as a significant scientific journal -- it is not available through most academic libraries, for example -- went unnoticed.

Second, the court contended that "of course, the NAS and PCAST Reports themselves constitute peer review despite the unfavorable view the two reports have of the AFTE method. The peer review and publication factor therefore weighs in favor of admissibility." The idea that the rejection in consensus reports of a series of studies as truly validating a theory "weighs in favor of admissibility" is difficult to fathom. Some readers might find it preposterous.

C. Error Rates

Just as the court was content to rely on the absolute number of studies as establishing that the AFTE method has been adequately tested, it takes the error rates reported in the questioned studies at face value. Finding the numbers to be "very low," and implying (without explanation) that PCAST's criteria are too "strict," it concludes that Daubert's "error rate" factor too "weighs in favor of admissibility."

A more plausible conclusion is that a large body of studies that fail to measure the error rates (false positive and negative associations) appropriately but do not indicate very high error rates is no more than weakly favorable to admission. (For further discussion, see the previous postings on the court's discussion of the Miami Dade and Ames Laboratory technical reports.)

D. Controlling Standards

The court cited no controlling standards for the judgment of  "'sufficient agreement' between the 'unique surface contours' of two toolmarks." After reciting the AFTE's definition of "sufficient agreement," Judge Hicks decided that "matching two tool marks essentially comes down to the examiner's subjective judgment based on his training, experience, and knowledge of firearms. This factor weighs against admissibility."

However, the opinion adds that "the consecutive matching striae ('CMS') method," which Johnson used after finding "sufficient agreement," is "an objective standard under Daubert." It is "objective" because an examiner cannot conclude that there is a match unless he "observes two or more sets of three or more consecutive matching markings on a bullet or shell casing." The opinion did not consider the possibility that this numerical rule does little to confine discretion if no standard guides the decision of  whether a marking matches. Instead, the opinion debated whether the CMS method should be considered objective and confused that question with how widely the method is used.

The relevant inquiry is not whether a method is subjective or objective. For a predominantly subjective method, the question is whether standards for making subjective judgments will produce more accurate and more reliable (repeatable and reproducible) decisions and how much more accurate and reliable they will be.

E. General Acceptance

Finally, the court found "widespread acceptance in the scientific community." But the basis for this conclusion was flimsy. It consisted of statements from other courts like "the AFTE method ... is 'widely accepted among examiners as reliable'" and "[t]his Daubert factor is designed to prohibit techniques that have 'only minimal support' within the relevant community." Apparently, the court regarded the relevant community as confined to examiners. Judge Hicks wrote that
it is unclear if the PCAST Report would even constitute criticism from the “relevant community” because the committee behind the report did not include any members of the forensic ballistics community ... . The acceptance factor therefore weighs in favor of admitting Johnson's testimony.
If courts insulate forensic-science service providers from the critical scrutiny of outside scientists, how can they legitmately use the general-acceptance criterion to help ascertain whether examiners are presenting "scientific knowledge" à la Daubert or something else?

  1. No. 3:18-cr-00049-LRH-CBC, 2019 WL 2150938 (D. Nev. May 16, 2019).
  2. 509 U.S. 579 (1993).
  3. For a discussion of a case involving inaccurate testimony from the same laboratory that caught the attention of the Supreme Court, see David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, 2016, available at; McDaniel v. Brown: Prosecutorial and Expert Misstatements of Probabilities Do Not Justify Postconviction Relief — At Least Not Here and Not Now, Forensic Sci., Stat. & L., July 7, 2014,
  4. See David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, 68 Case W. Res. L. Rev. 723, 724-25 (2018), available at The court relied on the article's explication of more modern case law.
  5. The U.S. District Court for the District of Colorado  excluded toolmark conclusions in the prosecutions for the bombing of the federal office building in Oklahoma City. The toolmarks there came from a screwdriver. David H. Kaye et al., The New Wigmore, A Treatise on Evidence: Expert Evidence 686-87 (2d ed. 2011).
  6. The court was aware of an earlier report from a third national panel of experts raising doubts about the AFTE method, but it did not cite or discuss that report's remarks. Although the 2008 National Academies report on the feasibility of establishing a ballistic imaging database only considered the forensic toolmark analysis of firearms in passing, it gave the practice no compliments. Kaye, supra note 2, a 729-32.
  7. Ted Robert Hunt, Scientific Validity and Error Rates: A Short Response to the PCAST Report, 86 Fordham L. Rev. Online Art. 14 (2017),
  8. Quoting United States v. Ashburn, 88 F.Supp.3d 239, 245 (E.D.N.Y. 2015).
  9. Citing United States v. Otero, 849 F.Supp.2d 425, 432–33 (D.N.J. 2012), for "numerous journals [sic] articles and studies exploring the AFTE method."
  10. Philipp Frank, Einstein's Philosophy of Science, Reviews of Modern Physics (1949).
MODIFIED: 7 July 2019 9:10 EST

Sunday, June 23, 2019

The Miami Dade Bullet-matching Study Surfaces in United States v. Romero-Lobato

Last month, the US District Court for the District of Nevada rejected another challenge to firearms toolmark comparisons. The opinion in United States v. Romero-Lobato, 1/ written by Judge Larry R. Hicks, relies in part on a six-year-old study that has yet to appear in any scientific journal. 2/ The National Institute of Justice (the research-and-development arm of the Department of Justice) funded the Miami-Dade Police Department Crime Laboratory "to evaluate the repeatability and uniqueness of striations imparted by consecutively manufactured EBIS barrels with the same EBIS pattern to spent bullets as well as to determine the error rate for the identification of same gun evidence." 3/ Judge Hicks describes the 2013 study as follows:
The Miami-Dade Study was conducted in direct response to the NAS Report and was designed as a blind study to test the potential error rate for matching fired bullets to specific guns. It examined ten consecutively manufactured barrels from the same manufacturer (Glock) and bullets fired from them to determine if firearm examiners (165 in total) could accurately match the bullets to the barrel. 150 blind test examination kits were sent to forensics laboratories across the United States. The Miami-Dade Study found a potential error rate of less than 1.2% and an error rate by the participants of approximately 0.007%. The Study concluded that “a trained firearm and tool mark examiner with two years of training, regardless of experience, will correctly identify same gun evidence.”
The "NAS Report" was the work of a large committee of scientists, forensic-science practitioners, lawyers, and others assembled by the National Academy of Sciences to recommend improvements in forensic science. A federal judge and a biostatistician co-chaired the committee. In 2009, four years after Congress funded the project, the report arrived. It emphasized the need to measure the error probabilities in pattern-matching tasks and discussed what statisticians call two-by-two contingency tables for estimating the sensitivity (true-positive probability) and specificity (true-negative probability) of the classifications. However, the Miami-Dade study was not designed to measure these quantities. To understand what it did measure, let's look at some of the details in the report to NIJ as well as what the court gleaned from the report (directly or indirectly).

A Blind Study?

The study was not blind in the sense of the subjects not realizing that they were being tested. They surely knew that they were not performing normal casework when they received the unusual samples and the special questionnaire with the heading "Answer Sheet: Consecutively Rifled EBIS-2 Test Set" asking such questions as "Is your Laboratory ASCLD/Lab Accredited?" That is not a fatal flaw, but it has some bearing -- not recognized in the report's sections on "external validity" -- on generalizing from the experimental findings to case work.  4/

Volunteer Subjects?

The "150 blind examination kits" somehow went to 201 examiners, not just in the United States, but also in "4 international countries." 5/ The researchers did not consider or reveal the performance of 36 "participants [who] did not meet the two year training requirement for this study." (P. 26). How well they did in comparison to their more experienced colleagues would have been worth knowng, although it would have been hard to draw a clear concolusions since there so few errors on the test. In any event, ignoring the responses from the trainees "resulted in a data-producing sample of 165 participants." (P. 26).

These research subjects came from emails sent to "the membership list for the Association of Firearm and Tool Mark Examiners (AFTE)." (Pp. 15-16). AFTE members all "derive[] a substantial portion of [their] livelihood from the examination, identification, and evaluation of firearms and related materials and/or tool marks." (P. 15). Only 35 of the 165 volunteers were certified by AFTE (p. 30), and 20 worked at unaccredited laboratories (P. 31).

What Error Rates?

Nine of the 165 fully trained subjects (5%) made errors (treating "inconclusive" as a correct response). The usual error rates (false positives and false negatives) are not reported because of the design of the "blind examination kits." The obvious way to obtain those error rates is to ask each subject to evaluate pairs of items -- some from the same source and some from different sources (with the examiners blinded to the true source information known to the researchers). Despite the desire to respond to the NAS report, the Miami Dade Police Department Laboratory did not make "kits" consisting of such a mixture of pairs of same-source and different-source bullets.

Instead, the researchers gave each subject a single collection of ten bullets produced by firing one manufacturer's ammunition in eight of the ten barrels. (Two of these "questioned bullets," as I will call them, came from barrel 3 and two from barrel 9; none came from barrel 4.) Along with the ten questioned bullets, they gave the subjects eight pairs of what we can call "exemplar bullets." Each pair of exemplar bullets came from two test fires of the same eight of the ten consecutively manufactured barrels (barrels 1-3 and 5-9). The task was to associate each questioned bullet with an exemplar pair or to decide that it could not be associated with any of the eight pairs. Or, the research subjects could circle "inconclusive" on the questionnaire. Notice that almost all the questioned bullets came from the barrels that produced the exemplar bullets -- only two such barrels were not a source of an unknown -- and bullets from only one barrel that produced a questioned bullet was not in the exemplar set.

This complicated and unbalanced design raises several questions. After associating an unknown bullet with an exemplar pair, will an examiner seriously consider the other exemplar pairs? After eliminating a questioned bullet as originating from, say seven exemplar-pair barrels, would he be inclined to pick one of the remaining three? Because of the extreme overlap in the sets, on average, such strategies would pay off. Such interactions could make false eliminations less probable, and true associations more probable, than with the simpler design of a series of single questioned-to-source comparisons.

The report to NIJ does not indicate that the subjects received any instructions to prevent them from having an expectation that most of the questioned bullets would match some pair of exemplar bullets. The only instructions it mentions are on a questionnaire that reads:
Please microscopically compare the known test shots from each of the 8 barrels with the 10 questioned bullets submitted. Indicate your conclusion(s) by circling the appropriate known test fired set number designator on the same line as the alpha unknown bullet. You also have the option of Inconclusive and Elimination. ...
Yet, the report confidently asserts that "[t]he researchers utilized an 'open set' design where the participants had no expectation that all unknown tool marks should match one or more of the unknowns." (P. 28).

To be sure, the study has some value in demonstrating that the subset of the subjects could perform a presumably difficult task in associating unknown bullets with exemplar ones. Moreover, whatever one thinks of this alleged proof of "uniqueness," the results imply that there are microscopic (or other) features of marks on bullets that vary with the barrel through which they traveled. But the study does not supply a good measure of examiner skill at making associations in fully "open" situations.

A 0.007% Error Rate?

As noted above, but not in the court's opinion, 5% of the examiners made some kind of error. That said, there were only 12 false-positive associations or false-negative ones (outright eliminations) out of 165 x 10 = 1,650 answers. (I am assuming that every subject completed the questionnaire for every unknown bullet.) That is an overall error proportion of 12/1650 = 0.007 = 0.7%.

The researchers computed the error rate slightly differently. They only reported the average error rate for the 165 experienced examiners. The vast majority (156) made no errors. Six made 1 error, and 3 made 2. So the average examiner's proportion of errors was [156(0) + 6(0.1) + 3(0.2)]/165 = 0.007. No difference at all.

This 0.007 figure is 100 times the number the court gave. Perhaps the opinion had a typographical error -- an adscititious  percentage sign that the court missed when it reissued its opinion (to correct other typographical errors). The error rate is still small and would not affect the court's reasoning.

But the overall proportion of errors and the average-examiner error rate could diverge. The report gives the error proportions for the 9 examiners who made errors as 0.1 (6 of the examiners) and 0.2 (another 3 examiners). Apparently, all of the 9 erroneous examiners evaluated all 10 unknowns. What about the other 156 examiners? Did all of them evaluate all 10? The worst-case scenario is that every one of the 156 error-free examiners answered only one question. That supplies only 156 correct answers. Add this number to the 12 incorrect answers, and we have an error proportion of 12/168 = 0.7 = 7% -- another 100 times larger than the court's number.

However, this worst-case scenario did not occur. The funding report states that "[t]here were 1,496 correct answers, 12 incorrect answers and 142 inconclusive answers." (P. 15). The sum of these numbers of answers is 1,650. Did every examiner answer every question? Apparently so. For this 100% completion rate, the report's emphasis on the examiner average (which is never larger and often smaller than the overall error proportion) is a distinction without a difference.

There is a further issue with the number itself. "Inconclusives" are not correct associations. If every examiner came back with "inconclusive" for every questioned bullet, the researchers hardly could report the zero error rate as validating bullet-matching. 7/ From the trial court's viewpoint, inconclusives just do not count. They do not produce testimony of false associations or of false eliminations. The sensible thing to do, in ascertaining error rates for Daubert purposes, is to toss out all "inconclusives."

Doing so here makes little difference. There were 142 inconclusive answers. (P. 15). If these were merely "not used to calculate the overall average error rates," as the report claims (p. 32), the overall error proportion was 12/(1605 - 142) = 12/1508 = 0.008 -- still very small (but still difficult to interpret in terms of the parameters of accuracy for two-by-two tables).

The report to NIJ discussed another finding that, at first blush, could be relevant to the evidence in this case: "Three of these 35 AFTE certified participants reported a total of four errors, resulting in an error rate of 0.011 for AFTE Certified participants." (P. 30). Counter-intuitively, this 1% average is larger than the reported average error rate of 0.007 for all the examiners.

That the certified examiners did worse than the uncertified ones may be a fluke. The standard error in the estimate of the average-examiner error rate was 0.32 (p. 29), which indicates that, despite the observed difference in the sample data, the study does not reveal whether certified examiners generally do better or worse than uncertified ones. 7/

A Potential Error Rate?

Finally, the court's reference to "a potential error rate of less than 1.2%" deserves mention. The "potential error rate" is tricky. Potentially, the error rate of individual practitioners like the ones who volunteered for the study, with no verification step by another examiner, could be larger (or smaller). There is no sharp and certain line that can be drawn for the maximum possible error rate. (Except that it could not be 100%.)

In this case, 1.2% is the upper limit of a two-sided confidence interval. The Miami Dade authors wrote that:
A 95% confidence interval for the average error rate, based on the large sample distribution of the sample average error rate, is between 0.002 and 0.012. Using a confidence interval of 95%, the error rate is no more than 0.012, or 1.2%.
A 95% confidence interval means that if there had been a large number of volunteer studies just like this one, making random draws from an unchanging population of volunteer-examiners and having these examiners perform the same task in the same way, about 95% of the many resulting confidence intervals would encompass the true value for the entire population. But the hypothetical confidence intervals would vary from one experiment to the next. We have a statistical process -- a sort of butterfly net -- that is broad enough to capture the unknown butterfly in about 95% of our swipes. The weird thing is that with each swipe, the size and center of the net change. On the Miami Dade swipe, one end of the net stretched out to the average error rate of 1.2%.

So the court was literally correct. There is "a potential error rate" of 1.2%. There is also a higher potential error rate that could be formulated -- just ask for 99% "confidence." Or lower -- try 90% confidence. And for every confidence interval that could be constructed by varying the confidence coefficient, there is the potential for the average error rate to exceed the upper limit. Such is the nature of a random variable. Randomness does not make the upper end of the estimate implausible. It just means that it is not "the potential error rate," but rather a clue to how large the actual rate of error for repeated experiments could be.

Contrary to the suggestion in Romero-Lobato, that statistic is not the "potential rate of error" mentioned in the Supreme Court's opinion in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). The opinion advises judges to "ordinarily ... consider the known or potential rate of error, see, e.g., United States v. Smith, 869 F. 2d 348, 353-354 (CA7 1989) (surveying studies of the error rate of spectrographic voice identification technique)." The idea is that along with the validity of an underlying theory, how well "a particular scientific technique" works in practice affects the admissibility of evidence generated with that technique. When the technique consists of comparing things like voice spectrograms, the accuracy with which the process yields correct results in experiments like the ones noted in Smith are known error rates. That is, they are known for the sample of comparisons in the experiment. (The value for all possible examiners' comparisons is never known.)

These experimentally determined error rates are also a "potential rate of error" for the technique as practiced in case work. The sentence in Daubert that speaks to "rate of error" continues by adding, as part of the error-rate issue, "the existence and maintenance of standards controlling the technique's operation, see United States v. Williams, 583 F. 2d 1194, 1198 (CA2 1978) (noting professional organization's standard governing spectrographic analysis)." The experimental testing of the technique shows that it can work -- potentially; controlling standards ensure that it will be applied consistently and appropriately to achieve this known potential. Thus, Daubert's reference to "potential" rates does not translate into a command to regard the upper confidence limit (which merely accounts for sampling error in the experiment) as a potential error rate for practical use.

  1. No. 3:18-cr-00049-LRH-CBC, 2019 WL 2150938 (D. Nev. May 16, 2019).
  2. That is my impression anyway. The court cites the study as Thomas G. Fadul, Jr., et al., An Empirical Study to Improve the Scientific Foundation of Forensic Firearm and Tool Mark Identification Utilizing Consecutively Manufactured Glock EBIS Barrels with the Same EBIS Pattern (2013), available at The references in Ronald Nichols, Firearm and Toolmark Identification: The Scientific Reliability of the Forensic Science Discipline 133 (2018) (London: Academic Press), also do not indicate a subsequent publication.
  3. P. 3. The first of the two "research hypotheses" was that "[t]rained firearm and tool mark examiners will be able to correctly identify unknown bullets to the firearms that fired them when examining bullets fired through consecutively manufactured barrels with the same EBIS pattern utilizing individual, unique and repeatable striations." (P. 13). The phrase "individual, unique and repeatable striations" begs a question or two.
  4. The researchers were comforted by the thought that "[t]he external validity strength of this research project was that all testing was conducted in a crime laboratory setting." (P. 25). As secondary sources of external validity, they noted that "[p]articipants utilized a comparison microscope," "[t]he participants were trained firearm and tool mark examiners," "[t]he training and experience of the participants strengthened the external validity," and "[t]he number of participants exceeded the minimum sample size needed to be statistically significant." Id. Of course, it is not the "sample size" that is statistically significant, but only a statistic that summarizes an aspect of the data (other than the number of observations).
  5. P. 26 ("A total of 201 examiners representing 125 crime laboratories in 41 states, the District of Columbia, and 4 international countries completed the Consecutively Rifled EBIS-2 Test Set questionnaire/answer sheet.").
  6. Indeed, some observers might argue that an "inconclusive" when there is ample information to reach a conclusion is just wrong. In this context, however, that argument is not persuasive. Certainly, "inconclusives" can be missed opportunities that should be of concern to criminalists, but they are not outright false positives or false negatives.
  7. The opinion does not state whether the examiner in the case -- "Steven Johnson, a supervising criminalist in the Forensic Science Division of the Washoe County Sheriff's Office" -- is certified or not, but it holds that he is "competent to testify" as an expert.

Tuesday, June 11, 2019

Junk DNA (Literally) in Virginia

The Washington Post reported yesterday on a motion in Alexandria Circuit Court to suppress "all evidence flowing from the warrantless search of [Jesse Bjerke's] genetic profile." 1/ Mr. Bjerke is accused of raping a 24-year-old lifeguard at gunpoint at her home after following her from the Alexandria, Va., pool where she worked. She "could describe her attacker only as a thin man she believed was 35 to 40 years old and a little over 6 feet tall." 2/ Swabs taken by a nurse contained sperm from which the Virginia Department of Forensic Sciences obtained a standard STR profile.

Apparently, the STR profile was in neither the Virginia DNA database not the national one (NDIS). So the police turned to the Virginia bioinformatics company, Parabon Labs, which has had success with genetic genealogy searches of the publicly available genealogy database, GEDmatch. Parabaon reported that
[T]he subject DNA file shares DNA with cousins related to both sides of Jesse's family tree, and the ancestral origins of the subject are equivalent to those of Jesse. These genetic connections are very compelling evidence that the subject is Jesse. The fact that Jesse was residing in Alexandria, VA at the time of the crime in 2016 fits the eyewitness description and his traits are consistent with phenotype predictions, further strengthens the confidence of this conclusion.
Recognizing the inherent limitations in genetic genealogy, Parabon added that
Unfortunately, it is always possible that the subject is another male that is not identifiable through vital records or other research means and is potentially unknown to his biological family. This could be the result if an out-of-wedlock birth, a misattributed paternity, an adoption, or an anonymous abandonment.
The motion suggests that the latter paragraph, together with the firm's boiler-plate disclaimer of warranties and the fact that the report contains hearsay, means that police lacked even probable cause to believe that the sperm came from the defendant. This view of the information that the police received is implausible, but regardless of whether "the facts contained in the Parabon report do not support probable cause," 3/ the police did not use the information either to arrest Mr. Bjerke immediately or to seek a warrant to compel him to submit to DNA sampling. Instead,
Police began following Bjerke at his home and the hospital where he worked as a nurse. They took beer bottles, soda cans and an apple core from his trash. They tracked him to a Spanish restaurant ... and, after he left, bagged the straws he had used.

The DNA could not be eliminated as a match for the sperm from the rape scene, a forensic analysis found, leading to Bjerke’s indictment and arrest in February. With [a] warrant, law enforcement again compared his DNA with the semen at the crime scene. The result: a one in 7.2 billion chance it was not his. 4/
A more precise description of the "one in 7.2 billion chance" is that if Mr. Bjerke is not the source, then an arbitrarily selected unrelated man would have that tiny a chance of having the STR profile. The probability of the STR match given the hypothesis that another man is the source is not necessarily the same as the probability of the source given the match. But for a prior probability reflecting the other evidence so far revealed about Mr. Bjerke, there would not be much difference between the conditional probability the laboratory supplied and the article's transposed one.

Faced with such compelling evidence, Mr. Bjerke wants it excluded at trial. The motion states that
For the purposes of this motion, there are three categories of DNA testing. (1) DNA testing conducted before Jesse Bjerke was a suspect in the case; (2) DNA testing conducted without a warrant after Jesse Bjerke became a suspect in the case; and (3) DNA testing conducted with a warrant after Jesse Bjerke's arrest. This motion seeks to suppress all DNA evidence in categories two and three that relate to Jesse Bjerke.
An obstacle is the many cases -- not mentioned in the motion -- holding that shed or "abandoned" DNA is subject to warrantless collection and analysis for identifying features on the theory that the procedure is not a "search" under the Fourth Amendment. The laboratory analysis is not an invasion of Mr. Bjerke's reasonable expectation of privacy -- at least, not if we focus solely on categories (2) and (3), as the motion urges. This standard STR typing was done after the genetic genealogy investigation was completed. The STR profile (which the motion calls a "genetic profile" even though it does not characterize any genes) provides limited information about an individual. For that reason, the conclusion of the majority of courts that testing shed DNA is not a search is supportable, though not ineluctable. ("Limited" does not mean "zero.")

Indeed, most laboratory tests on or for traces from crimes are not treated as searches covered by the warrant and probable cause protections. Is it a search to have the forensic lab analyze a fingerprint from a glass left at a restaurant? Suppose a defendant tosses a coat in a garbage bin on the street, and the police retrieve it, remove glass particles, and analyze the chemical composition to see they match the glass from a broken window in a burglary? Did they need a warrant to study the glass particles?

The underlying issue is how much the constitution constrains the police in using trace evidence that might associate a known suspect with a crime scene or victim. When the analysis reveals little or nothing more than the fact of the association, I do not see much of an argument for requiring a warrant. That said, there is a little additional information in the usual STR profile, so there is some room for debate here.

However, this case might be even more debatable (although the defense motion does not seem to recognize it) because of category (1) -- the genetic genealogy phase of the case. The police, or rather the firm they hired to derive a genome-wide scan for the genetic genealogy, have much more information about Mr. Bjerke at their disposal. They have on the order of a million SNPs. In theory, Parabon or the police could inspect the SNP data for medical or other sensitive information on Mr. Bjerke now that he has been identified as the probable source of those sperm.

Nevertheless, I do not know why the police or the lab would want to do this, and it has always been true that once a physical DNA sample is in the possession of the police, the possibility exists for medical genetic testing using completely different loci. Testing shed DNA in that way should be considered a search. Bjerke is a step in that direction, but are we there yet?

The Post's online story has 21 comments on it. Not one supported the idea that there was a significant invasion of privacy in the investigation. These comments are a decidedly small sample that does not represent any clear population, but the complete lack of support for the argument that genetic genealogy implicates important personal privacy was striking.

  1. Defendant's Motion to Suppress, Commonwealth v. Bjerke, No. CF19000031 (Cir. Ct., Alexandria, Va. May 20, 2019).
  2. Rachel Weiner, Alexandria Rape Suspect Challenging DNA Search Used to Crack Case, Wash, Post, June 10, 2019, at 1:16 PM.
  3. Defendant's Motion, supra note 1.
  4. Weiner, supra note 2.
  • Thanks to Rachel Weiner for alerting me to the case and providing a copy of the defendant's motion.

Friday, June 7, 2019

Aleatory and Epistemic Uncertainty

An article in the Royal Society's Open Science journal on "communicating uncertainty about facts, numbers and science" is noteworthy for the sheer breadth of the fields it surveys and its effort to devise a taxonomy of uncertainty for the purpose of communicating its nature or degree. The article distinguishes between "aleatory" and "epistemic" uncertainty:

[A] large literature has focused on what is frequently termed 'aleatory uncertainty' due to the fundamental indeterminacy or randomness in the world, often couched in terms of luck or chance. This generally relates to future events, which we can't know for certain. This form of uncertainty is an essential part of the assessment, communication and management of both quantifiable and unquantifiable future risks, and prominent examples include uncertain economic forecasts, climate change models and actuarial survival curves.

By contrast, our focus in this paper is uncertainties about facts, numbers and science due to limited knowledge or ignorance—so-called epistemic uncertainty. Epistemic uncertainty generally, but not always, concerns past or present phenomena that we currently don't know but could, at least in theory, know or establish.

The distinction is of interest to philosophers, psychologists, economists, and statisticians. But it is a little hard to pin down with the definition in the article. Aleatory uncertainty applies on the quantum mechanical level, but is it true that "in theory" predictions like weather and life span cannot be certain? Chaos theory shows that the lack of perfect knowledge about initial conditions of nonlinear systems makes long-term predictions very uncertain, but is it theoretically impossible to have perfect knowledge? The card drawn from a well-shuffled deck is a matter of luck, but if we knew enough about the shuffle, couldn't we know the card that is drawn? Thus, I am not so sure that the distinction is between (1) "fundamental ... randomness in the world" and (2) ignorance that could be remedied "in theory."

Could the distinction be between (1) instances of a phenomenon that has variable outcomes at the level of our existing knowledge of the world and (2) a single instance of a phenomenon that we do not regard as the outcome of a random process or that already has occurred, so that the randomness is gone? The next outcome of rolling a die (an alea in Latin) is always uncertain (unless I change the experimental setup to precisely fix the conditions of the roll), 1/ but whether the last roll produced a 1 is only uncertain to the extent that I cannot trust my vision or memory. I could reduce the latter, epistemic uncertainty by improving my system of making observations. For example, I could have several keen and truthful observers watch the toss, or I could film it and study the recording thoroughly. From this perspective, the frequency and propensity conceptions of probability concern aleatory uncertainty, and the subjective and logical conceptions traffic in both aleatory and epistemic uncertainty.

When it comes to the courtroom, epistemic uncertainty is usually in the forefront, and I may get to that example at a later date. For now, I'll just note that, regardless of whether the distinction offered above between aleatory and epistemic uncertainty is philosophically rigorous, people's attitudes toward aleatory and epistemic risk defined in this way do seem to be somewhat different. 2/

  1. Cf. P. Diaconis, S. Holmes & R. Montgomery, Dynamical Bias in the Coin Toss, 49(2) SIAM Rev. 211-235 (2007),
  2. Gülden Ülkümen, Craig R.  Fox & B. F. Malle, Two Dimensions of Subjective Uncertainty: Clues from Natural Language, 145(10) Journal of Experimental Psychology: General, 1280-1297.; Craig R. Fox & Gülden Ülkümen, Distinguishing Two Dimensions of Uncertainty, in Perspectives on Thinking, Judging, and Decision Making (Brun, W., Keren, G., Kirkebøen, G., & Montgomery, H.  eds. 2011).