Wednesday, August 26, 2020

Terminology Department: Defining Bias for Nonstatisticians

The Organization of Scientific Area Committees for Forensic Science (OSAC) is trying to develop definitions of common technical terms that can be used across most forensic-science subject areas. "Bias" is one of these ubiquitous terms, but its statistical meaning does not conform to the usual dictionary definitions, such as  "an inclination of temperament or outlook, especially: a personal and sometimes unreasoned judgment" \1/ or "the action of supporting or opposing a particular person or thing in an unfair way, because of allowing personal opinions to influence your judgment." \2/ 

I thought the following definition might be useful for forensic-science practitioners:

A systematic tendency for estimates or measurements to be above or below their true values. A study is said to be biased if its design is such that it systematically favors certain outcomes. An estimator of a population parameter is biased when the average value of the estimates (from an infinite number of samples) would not equal the value of the parameter. Bias arises from systematic as opposed to random error in the collection of units to be measured, the measurement of the units, or the process for estimating quantities based on the measurements.

It ties together some of the simplest definitions I have seen in textbooks and reference works on statistics -- namely:

Yadolah Dodge, The Concise Encyclopedia of Statistics 41 (2008): From a statistical point of view, the bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. If we calculate the mean of a large number of unbiased estimations, we will find the correct value. The bias indicates the distance of the estimator from the true value of the parameter. Comment: This is the definition for mathematical statistics.
B. S. Everitt & A. Skrondal, The Cambridge Dictionary of Statistics 45 (4th ed. 2010) (citing Altman, D.G. (1991) Practical Statistics for Medical Research, Chapman and Hall, London): In general terms, deviation of results or inferences from the truth, or processes leading to such deviation. More specifically, the extent to which the statistical method used in a study does not estimate the quantity thought to be estimated, or does not test the hypothesis to be tested. In estimation usually measured by the difference between the expected value of an estimator and the true value of the parameter. An estimator for which E(θ-hat) = θ is said to be unbiased. See also ascertainment bias, recall bias, selection bias and biased estimator. Comment: The general definition (first sentence) fails to differentiate between random and systematic deviations. The “more specific” definition in the next sentence is limited to the definition in mathematical statistics.
David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 283 (Federal Judicial Center & Nat’l Research Council eds., 3d ed. 2011): Also called systematic error. A systematic tendency for an estimate to be too high or too low. An estimate is unbiased if the bias is zero. (Bias does not mean prejudice, partiality, or discriminatory intent.) See nonsampling error. Comment: This one is intended to convey the essential idea to judges.
David H. Kaye, Frequentist Methods for Statistical Inference, in Handbook of Forensic Statistics 39, 44 (D. Banks, K. Kafadar, D. Kaye & M. Tackett eds. 2020): [A]n unbiased estimator t of [a parameter] θ will give estimates whose errors eventually should average out to zero. Error is simply the difference between the estimate and the true value. For an unbiased estimator, the expected value of the errors is E(tθ) = 0. Comment: Yet another version of the definition of an unbiased estimator of a population or model parameter.
JCGM, International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM) (3d ed. 2012): measurement bias, bias -- estimate of a systematic measurement error Comment: The VIM misdefines bias as an estimate of bias.
David S. Moore & George P. McCabe, Introduction to the Practice of Statistics 232 (2d ed. 1993): Bias. The design of a study is biased if it systematically favors certain outcomes. In a causal study, bias can result from confounding. Or can it?

NOTES

  1. Merriam-Webster Dictionary (online).
  2. Cambridge Dictionary (online)

Saturday, August 22, 2020

Phrases for the Value or Weight of Evidence

A few statisticians asked me (independently) about the usage of the terms evidential value, evidentiary value, and probative value. For years, I thought the phrases all meant the same thing, but that is not true in some fields.

Evidential Value

Black’s Law Dictionary (which tends to have aged definitions) has this definition of evidential value: “Value of records given as or in support of evidence, based on the certainty of the records origins. The value here is not in the record content. This certainty is essential for authentic and adequate evidence of an entity’s actions, functioning, policies, and/or structure.”

Under this definition, "evidential value" pertains to a document's value merely as the container of information. The definition distinguishes between the provenance and authenticity of a document -- where did it come from and has it been altered? -- and the content of the document -- what statements or information does it contain? Likewise, archivists distinguish between "evidential value" and "informational value." The former, according to the Society of American Archivists "relates to the process of creation rather than the content (informational value) of the records."

Evidentiary Value

Lawyers use the phrases "evidentiary value" and "probative value" (or "probative force") as synonyms. For example, a 1932 note in the University of Pennsylvania Law Review on "Evidentiary Value of Finger-Prints" predicted that "the time is not far distant when courts must scrutinize and properly evaluate the probative force to be given to evidence that finger-prints found on the scene correspond with those of the accused." \1/

Forensic scientists use "evidentiary value" to denote the utility of examining objects for information on whether the objects have a common origin. A 2009 report of a committee of the National Academies complained that there was no standard threshold for deciding when bitemarks have "reached a threshold of evidentiary value." \2/ More generally, the phrase can denote the value of any expert analysis of effects as proof of the possible cause of those effects. \3/

Evidential Value

Unlike archivists, forensic scientists use the phrase “evidential value” interchangeably with "evidentiary value." It appears routinely in titles or articles and books such as "Evidential Value of Multivariate Physicochemical Data," \4/ "Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing," \5/ and "Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene." \6/

Probative Value

Lawyers use "probative value" to denote the degree to which an item of evidence proves the proposition it is offered to prove. Credible evidence that a defendant threatened to kill the deceased, whose death was caused by a poison, is probative of whether the death was accidental and whether defendant was the killer. With circumstantial evidence like this, various probability-based formulations have been proposed to express probative value quantitatively. \7/ One of the simplest is the likelihood ratio or Bayes factor (BF) favored by most forensic statisticians. \8/ Its logarithm has qualities that argue for using log(BF) to express the "weight" of an item of evidence. \9/

The rules of evidence require judges to exclude evidence when unfair prejudice, distraction, and undue consumption of time in presenting the evidence substantially outweigh the probative value of the evidence. \10/ In theory, judges do not exclude evidence just because they do not believe that the witness is telling the truth. The jury will take credibility into account in deciding the case. However, in ensuring that there is sufficient probative value to bother with the evidence, judges can hardly avoid being influenced by the trustworthiness of the source of the information. Moreover, the importance of the fact that the proposed testimony addresses and the availability of alternative, less prejudicial proof also can influence the decision to exclude evidence that is probative of a material fact. \11/

NOTES

  1. Note, Evidentiary Value of Finger-Prints, 80 U. Penn. L. Rev. 887 (1932).
  2. Comm. on Identifying the Needs of the Forensic Sci. Cmty. Nat'l Research Council, Strengthening Forensic Science in the United States: A Path Forward 176 (2009).
  3. Nicholas Dempsey & Soren Blau, Evaluating the Evidentiary Value of the Analysis of Skeletal Trauma in Forensic Research: A Review of Research and Practice, 307 Forensic Sci. Int'l (2020), https://doi.org/10.1016/j.forsciint.2020.110140. Still another usage of the term occurs in epistemology. See P. Gärdenfors, B. Hansson, N-E. Sahlin, Evidentiary Value: Philosophical, Judicial and Psychological Aspects of a Theory (1983); Dennis V. Lindley, Review, 35(3) Brit. J. Phil. Sci. 293-296 (1984) (criticizing this theory).
  4. Grzegorz Zadora, Agnieszka Martyna, Daniel Ramos & Colin Aitken, Statistical Analysis in Forensic Science: Evidential Value of Multivariate Physicochemical Data (2014).
  5. Zuhaib Subhani, Barbara Daniel & Nunzianda Frascione, DNA Profiles from Fingerprint Lifts—Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing, 64(1) J. Forensic Sci. 201–06 (2019), https://doi.org/10.1111/1556-4029.13830.
  6. I.W. Evett, Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene, 33(2) J. Forensic Sci. Soc’y 83-86 (1993).
  7. 1 McCormick on Evidence § 185 (R. Mosteller ed., 8th ed. 2020); David H. Kaye, Review-essay, Digging into the Foundations of Evidence Law, 116 Mich. L. Rev. 915-34 (2017), http://ssrn.com/abstract=2903618.
  8. See Anuradha Akmeemana, Peter Weis, Ruthmara Corzo, Daniel Ramos, Peter Zoon, Tatiana Trejos, Troy Ernst, Chip Pollock, Ela Bakowska, Cedric Neumann & Jose Almirall, Interpretation of Chemical Data from Glass Analysis for Forensic Purposes, J. Chemometrics (2020), DOI:10.1002/cem.3267.
  9. Fed. R. Evid. 403; Unif. R. Evid. 403; 1 McCormick, supra note 6, § 185.
  10. I. J. Good, Weight of Evidence: A Brief Survey, in 2 Bayesian Statistics 249–270 (Bernardo, J. M., M. H. DeGroot, D. V. Lindley & A. F. M. Smith eds., 1985); Irving John Good, Weight of Evidence and the Bayesian Likelihood Ratio, in The Use of Statistics in Forensic Science 85–106 (C. G. G. Aitken & David A. Stoney eds., 1991).
  11. 1 McCormick, supra note 6, § 185.

Tuesday, August 18, 2020

"Quite High" Accuracy for Firearms-mark Comparisons

Court challenges to the validity of forensic identification of the gun that fired a bullet based on toolmark comparisons have increased since the President's Council of Advisors on Science and Technology (PCAST) issued a report in late 2016 stressing the limitations in the scientific research on the subject. A study from the Netherlands preprinted in 2019 adds to the research literature. The abstract reads (in part):

Forensic firearm examiners compare the features in cartridge cases to provide a judgment addressing the question about their source: do they originate from one and the same or from two different firearms? In this article, the validity and reliability of these judgments is studied and compared to the outcomes of a computer-based method. The ... true positive rates (sensitivity) and the true negative rates (specificity) of firearm examiners are quite high. ... The examiners are overconfident, giving judgments of evidential strength that are too high. The judgments of the examiners and the outcomes of the computer-based method are only moderately correlated. We suggest to implement performance feedback to reduce overconfidence, to improve the calibration of degree of support judgments, and to study the possibility of combining the judgments of examiners and the outcomes of computer-based methods to increase the overall validity.

Erwin J.A.T. Mattijssen, Cilia L.M. Witteman, Charles E.H. Berger, Nicolaas W. Brand & Reinoud D. Stoel, Validity and Reliability of Forensic Firearm Examiners. Forensic Sci. Int’l 2020, 307:110112. 

 

Despite the characterization of examiner sensitivity and specificity as "quite high," the observed specificity was only 0.89, which corresponds to a false-positive rate of 11%—much higher than the <2% estimate quoted in recent judicial opinions. But the false-positive proportions from different experiments are not as discordant as they might appear to be when naively juxtaposed. To appreciate the sensitivity and specificity reported in this experiment, we need to understand the way that the validity test was constructed.

Design of the Study

The researchers fired two bullets from two hundred 9 mm Luger Glock pistols seized in the Netherlands. These 400 test firings gave rise to true (same-source) and false (different-source) pairings of two-dimensional comparison images of the striation patterns on cartridge cases. Specifically, there were 400 cartridge cases from which the researchers made "measurements of the striations of the firing pin aperture shear marks" and prepared "digital images [of magnifications of] the striation patterns using oblique lighting, optimized to show as many of the striations as possible while avoiding overexposure." (They also produced three-dimensional data, but I won't discuss those here.)

They invited forensic firearm examiners from Europe, North America, South America, Asia and Oceania by e-mail to examine the images. Of the recipients, 112 participated, but only 77 completed the online questionnaire with 60 comparison images of striation patterns aligned side-by-side. (The 400 images gave rise to (400×399)/2 distinct pairs of images, of which 200 were same-source pairs. They could hardly ask the volunteers to study all these 79,800 pairs, so they used a computer program for matching such patterns to obtain 60 pairs that seemed to cover "the full range of comparison difficulty" but that overrepresented "difficult" pairs — an important choice that we'll talk about soon. Of the 60, 38 were same-source pairs, and 22 were different-source pairs.)

The examiners first evaluated the degree of similarity on a five-point scale. Then they were shown the 60 pairs again and asked (1) whether the comparison provides support that the striations in the cartridge cases are the result of firing the cartridge cases with one (same-source) or with two (different-source) Glock pistols; (2) for their sense of the degree of support for this conclusion; and (3) whether they would have provided an inconclusive conclusion in casework.

The degree of support was reported or placed on a six-point scale of "weak support" (likelihood ratio L = 2 to 10), "moderate support" (L = 10 to 100), "moderately strong support" (L = 100 to 1,000), "strong support" (L = 1,000 to 10,000), "very strong support" (L = 10,000 to 1,000,000), and "extremely strong support" (L > 1,000,000). The computerized system mentioned above also generated numerical likelihood ratios. (The proximity of the ratio to 1 was taken as a measure of difficulty.)

A Few of the Findings

For the 60 two-dimensional comparisons, the computer program and the human examiners performed as follows:

Table 1. Results for (computer | examiner) excluding pairs deemed inconclusive by examiners.

SS pair DS pair
SS outcome (36 | 2365) (10 | 95)
DS outcome (2 | 74) (12 | 784)
validity
sens = (.95 | .97)
FNP = (.05 | .03)
spec = (.55 | .89)
FPP = (.45 | .11)
Abbreviations: SS = same source; DS = difference source
sens = sensitivity; spec = specificity
FNP = false negative proportion; FPP = false positive proportion

Table 1 combines two of the tables in the article. The entry "36 | 2365," for example, denotes that the computer program correctly classified as same-source 36 of the 38 same-source pairs (95%), while the examinations of the 77 examiners correctly classified 2,365 pairs out of the 2,439 same-source comparisons (97%) that they did not consider inconclusive. The computer program did not have the option to avoid a conclusion (or rather a likelihood ratio) in borderline cases. When examiners' conclusions on the cases they would have called inconclusive in practice were added in, the sensitivity and specificity dropped to 0.93 and 0.81, respectively.

Making Sense of These Results

The reason there are more comparisons for the examiners is that there were 77 of them and only one computer program. The 77 examiners made 77 × 60 comparisons, while the computer program made only 1 × 60 comparisons on the the 60 pairs. Those pairs, as I noted earlier, were not a representative sample. On all 79,800 possible pairings of the test fired bullets, the tireless computer program's sensitivity and specificity were both 0.99. If we can assume that the human examiners would have done at least as well as the computer program that it outperformed (on the 60 more or less "difficult" cases), their performance for all possible pairs would have been excellent.

An "Error Rate" for Court?

The experiment gives a number for the false-positive "error rate" (misclassifications across all the 22 different-source pairs) of 11%. If we conceive of the examiners' judgments as a random sample from some hypothetical population of identically conducted experiments, then the true false-positive error probability could be somewhat higher (as emphasized in the PCAST report) or lower. How should such numbers be used in admissibility rulings under Daubert v. Merrell Dow Pharmaceuticals, Inc.? At trial, to give the jury a sense of the chance of a false-positive error (as PCAST also urged)?

For admissibility, Daubert referred (indirectly) to false-positive proportions in particular studies of "voice prints," although the more apt statistic would be a likelihood ratio for a positive classification. For the Netherlands study, that would be L+ = Pr(+|SS) / Pr(+|DS) ≈ 0.97/0.11 = 8.8. In words, it is almost nine times more probable that an examiner (like the ones in the study) will report a match when confronted with a randomly selected same-source pair of images than a randomly selected different-source pair (from the set of 60 constructed for the experiment). That validates the examiners' general ability to distinguish between same- and different-source pairs at a modest level of accuracy in that sample.

But to what extent can or should this figure (or just the false-positive probability estimate of 0.11) be presented to a factfinder as a measure of the probative value of a declared match? In this regard, arguments arise over presenting an average figure in a specific case (although that seems like a common enough practice in statistics) and the realism of the experiment. The researchers warn that

For several reasons it is not possible to directly relate the true positive and true negative rates, and the false positive and false negative rates of the examiners in this study to actual casework. One of these reasons is that the 60 comparisons we used were selected to over-represent ‘difficult’ comparisons. In addition, the use of the online questionnaire did not enable the examiners to manually compare the features of the cartridge cases as they would normally do in casework. They could not include in their considerations the features of other firearm components, and their results and conclusions were not peer reviewed. Enabling examiners to follow their standard operating procedures could result in better performance.

There Is More

Other facets of the paper also make it recommended reading. Data on the reliability of conclusions (both within and across examiners) are presented, and an analysis of the extent to which examiners' judgments of how strongly the images supported their source attributions led the authors to remark that

When we look at the actual proportion of misleading choices, the examiners judged lower relative frequencies of occurrence (and thus more extreme LRs) than expected if their judgments would have been well-calibrated. This can be seen as overconfidence, where examiners provide unwarranted support for either same-source or different-source propositions, resulting in LRs that are too high or too low, respectively. ... Simply warning examiners about overconfidence or asking them to explain their judgments does not necessarily decrease overconfidence of judgments.

Monday, August 10, 2020

Applying the Justice Department's Policy on a Reasonable Degree of Certainty in United States v. Hunt

In United States v. Hunt, \1/ Senior U.S. District Court Judge David Russell disposed of a challenge to proposed firearms-toolmark testimony. The first part of the unpublished opinion dealt with the scientific validity (as described in Daubert v. Merrell Dow Pharmaceuticals, Inc.) of the Association of Firearms and Toolmark Examiners (AFTE) "Theory of Identification As It Relates to Toolmarks." Mostly, this portion of the opinion is of the form: "Other courts have accepted the government's arguments. We agree." This kind of an opinion is common for forensic-science methods that have a long history of judicial acceptance--whether or not such acceptance is deserved.

The unusual part of the opinion comes at the end. There, the court misconstrues the Department of Justice's internal policy on the use of the phrase "reasonable certainty" to characterize an expert conclusion for associating spent ammunition with a gun that might have fired it. This posting describes some of the history of that policy and suggests that (1) the court may have unwittingly rejected it; (2) the court's order prevents the experts from expressing the same AFTE theory that the court deemed scientifically valid; and (3) the government can adhere to its written policy on avoiding various expressions of "reasonable certainty" and still try the case consistently with the judge's order.

I. The Proposed Testimony

Dominic Hunt was charged with being a felon is possession of ammunition recovered from two shootings. The government proposed to use two firearm and toolmark examiners--Ronald Jones of the Oklahoma City Police Department and Howard Kong of the Bureau of Alcohol, Tobacco, Firearms and Explosives' (ATF) Forensic Science Laboratory--to establish that the ammunition was not fired from the defendant's brother's pistol--or his cousin's pistol. To eliminate those hypotheses, "the Government intend[ed] its experts to testify" that "the unknown firearm was likely a Smith & Wesson 9mm Luger caliber pistol," and that "the probability that the ammunition ... were fired in different firearms is so small it is negligible."

This testimony differs from the usual opinion testimony that ammunition components recovered from the scene of a shooting came from a specific, known gun associated with a defendant. It appears that the "unknown" Luger pistol was never discovered and thus that the examiners could not use it to fire test bullets for comparison purposes. Their opinion was that several of the shell casings had marks and other features that were so similar that they must have come from the same gun of the type they specified.

But the reasoning process the examiners used to arrive at this conclusion--which postulates "class," "subclass," and conveniently designated "individual" characteristics--is the same as the one used in the more typical case of an association to a known gun. Perhaps heartened by several recent trial court opinions severely limiting testimony about the desired individualizing characteristics, Hunt moved "to Exclude Ballistic Evidence, or Alternatively, for a Daubert Hearing."

II. The District Court's Order

Hunt lost. After rejecting the pretrial objection to the scientific foundation of the examiners' opinions and the proper application of accepted methods by the two examiners, Judge Russell turned to the defendant's "penultimate argument [seeking] limitations on the Government's firearm toolmark experts." He embraced the government's response "that no limitation is necessary because Department of Justice guidance sufficiently limits a firearm examiner's testimony."

The odd thing is that he turned the Department's written policy on its head by embracing a form of testimony that the policy sought to eliminate. And the court did this immediately after it purported to implement DoJ's "reasonable" policy. The relevant portion of the opinion begins:

In accordance with recent guidance from the Department of Justice, the Government's firearm experts have already agreed to refrain from expressing their findings in terms of absolute certainty, and they will not state or imply that a particular bullet or shell casing could only have been discharged from a particular firearm to the exclusion of all other firearms in the world. The Government has also made clear that it will not elicit a statement that its experts' conclusions are held to a reasonable degree of scientific certainty.
The Court finds that the limitations mentioned above and prescribed by the Department of Justice are reasonable, and that the Government's experts should abide by those limitations. To that end, the Governments experts:
[S]hall not [1] assert that two toolmarks originated from the same source to the exclusion of all other sources.... [2] assert that examinations conducted in the forensic firearms/toolmarks discipline are infallible or have a zero error rate.... [3] provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data.... [4] cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career as a direct measure for the accuracy of a proffered conclusion..... [5] use the expressions ‘reasonable degree of scientific certainty,’ ‘reasonable scientific certainty,’ or similar assertions of reasonable certainty in either reports or testimony unless required to do so by [the Court] or applicable law. \2/

So far it seems that the court simply told the government's experts (including the city police officer) to tow the federal line. But here comes the zinger. The court abruptly turned around and decided to ignore the Attorney General's mandate that DoJ personnel should strive to avoid expressions of "reasonable scientific certainty" and the like. The court wrote:

As to the fifth limitation described above, the Court will permit the Government's experts to testify that their conclusions were reached to a reasonable degree of ballistic certainty, a reasonable degree of certainty in the field of firearm toolmark identification, or any other version of that standard. See, e.g., U.S. v. Ashburn, 88 F. Supp. 3d 239, 249 (E.D.N.Y. 2015) (limiting testimony to a “reasonable degree of ballistics certainty” or a “reasonable degree of certainty in the ballistics field.”); U.S. v. Taylor, 663 F. Supp. 2d 1170, 1180 (D.N.M. 2009) (limiting testimony to a “reasonable degree of certainty in the firearms examination field.”). Accordingly, the Government's experts should not testify, for example, that “the probability the ammunition charged in Counts Eight and Nine were fired in different firearms is so small it is negligible” ***.

So the experts can testify that they have "reasonable certainty" that the ammunition was fired from the same gun, but they cannot say the probability that it was fired from a different gun is small enough that the alternative hypothesis has a negligible probability? Even though that is how experts in the field achieve "reasonable certainty" (according to the AFTE description that the court held was scientifically valid)? This part of the opinion hardly seems coherent. \3/

III. The Tension Between the Order and the ULTR

The two cases that the court cited for its "reasonable ballistic certainty" ruling were decided years before the ULTR that it called reasonable, and such talk of "ballistic certainty" and "any other version of that standard" is precisely what the Department had resolved to avoid if at all possible. The history of the "fifth limitation" has an easily followed paper trail that compels the conclusion that this limitation was intended to avoid precisely the kind of testimony that Judge Russell's order permits.

Let's start with the ULTR quoted (in part) by the court. It has a footnote to the "fifth limitation" that instructs readers to "See Memorandum from the Attorney General to Heads of Department Components (Sept. 9. 2016), https://www.justice.gov/opa/file/891366/download." The memorandum's subject is "Recommendations of the National Commission on Forensic Science; Announcement for NCFS Meeting Eleven." In it, Attorney General Loretta Lynch wrote:

As part of the Department's ongoing coordination with the National Commission on Forensic Science (NCFS), I am responding today to several NCFS recommendations to advance and strengthen forensic science. *** I am directing Department components to *** work with the Deputy Attorney General to implement these policies *** .

1. Department forensic laboratories will review their policies and procedures to ensure that forensic examiners are not using the expressions "reasonable scientific certainty" or "reasonable [forensic discipline] certainty" in their reports or testimony. Department prosecutors will abstain from use of these expressions when presenting forensic reports or questioning forensic experts in court unless required by a judge or applicable law.

The NCFS was adamant that judges should not require "reasonable [forensic discipline] certainty." Its recommendation to the Attorney General explained that

Forensic discipline conclusions are often testified to as being held “to a reasonable degree of scientific certainty” or “to a reasonable degree of [discipline] certainty.” These terms have no scientific meaning and may mislead factfinders about the level of objectivity involved in the analysis, its scientific reliability and limitations, and the ability of the analysis to reach a conclusion. Forensic scientists, medical professionals and other scientists do not routinely express opinions or conclusions “to a reasonable scientific certainty” outside of the courts. Neither the Daubert nor Frye test of scientific admissibility requires its use, and consideration of caselaw from around the country confirms that use of the phrase is not required by law and is primarily a relic of custom and practice. There are additional problems with this phrase, including:
• There is no common definition within science disciplines as to what threshold establishes “reasonable” certainty. Therefore, whether couched as “scientific certainty” or“ [discipline] certainty,” the term is idiosyncratic to the witness.
• The term invites confusion when presented with testimony expressed in probabilistic terms. How is a lay person, without either scientific or legal training, to understand an expert’s “reasonable scientific certainty” that evidence is “probably” or possibly linked to a particular source?

Accordingly, the NCFS recommended that the Attorney General "direct all attorneys appearing on behalf of the Department of Justice (a) to forego use of these phrases ... unless directly required by judicial authority as a condition of admissibility for the witness’ opinion or conclusion ... ." As we have seen, the Attorney General adopted this recommendation. \4/

IV. How the Prosecutors and the ATF Expert Can Follow Departmental Policy

Interestingly, Judge Russell's opinion does not require the lawyers and the witnesses to use the expressions of certainty. It "permits" them to do so (seemingly on the theory that this practice is just what the Department contemplated). But not all that is permitted is required. To be faithful to Department policy, the prosecution cannot accept the invitation. The experts can give their conclusion that the ammunition came from a single gun. However, they should not add, and the prosecutor may not ask them to swear to, some expression of "reasonable [discipline] certainty" because: (1) the Department's written policy requires them to avoid it "unless required by a judge or applicable law"; (2) the judge has not required it; and (3) applicable law does not require it. \5/

The situation could change if at the trial, Judge Russell were to intervene and to ask the experts about "reasonable certainty." In that event, the government should remind the court that its policy, for the reasons stated by the National Commission and accepted by the Attorney General, is to avoid these expressions. If the court then rejects the government's position, the experts must answer. But even then, unless the defense "opens the door" by cross-examining on the meaning of "reasonable [discipline] certainty," there is no reason for the prosecution to use the phrase in its examination of witnesses or closing arguments.

NOTES

  1. No. CR-19-073-R, 2020 WL 2842844 (W.D. Okla. June 1, 2020).
  2. The ellipses in the quoted part of the opinion are the court's. I have left out only the citations in the opinion to the Department's Uniform Language on Testimony and Reporting (ULTR) for firearms-toolmark identifications. That document is a jumble that is a subject for another day.
  3. Was Judge Russell thinking that the "negligible probability" judgment is valid (and hence admissible as far as the validity requirement of Daubert goes) but that it would be unfairly prejudicial or confusing to give the jury this valid judgment? Is the court's view that "negligible" is too strong a claim in light of what is scientifically known? If such judgments are valid, as AFTE maintains, they are not generally prejudicial. Prejudice does not mean damage to the opponent's case that arises from the very fact that evidence is powerful.
  4. At the NCFS meeting at which the Department informed the Commission that it was adopting its recommendation, "Commission member, theoretical physicist James Gates, complimented the Department for dealing with these words that 'make scientists cringe.'" US Dep't of Justice to Discourage Expressions of "Reasonable Scientific Certainty," Forensic Sci., Stat. & L., Sept. 12, 2016, http://for-sci-law.blogspot.com/2016/09/us-dept-of-justice-to-discourage.html.
  5. In a public comment to the NCFS, then commissioner Ted Hunt (now the Department's senior advisor on forensic science) cited the "ballistic certainty" line of cases as indicative of a problem with the NCFS recommendation as then drafted but agreed that applicable law did not require judges to follow the practice of insisting on or entertaining expressions of certitude. See "Reasonable Scientific Certainty," the NCFS, the Law of the Courtroom," and That Pesky Passive Voice, Forensic Sci., Stat. & L., http://for-sci-law.blogspot.com/2016/03/reasonable-scientific-certainty-ncfs.html, Mar. 1, 2016; Is "Reasonable Scientific Certainty" Unreasonable?, Forensic Sci., Stat. & L., Feb. 26, 2016, http://for-sci-law.blogspot.com/2016/02/is-reasonable-scientific-certainty.html (concluding that
    In sum, there are courts that find comfort in phrases like "reasonable scientific certainty," and a few courts have fallen back on variants such as "reasonable ballistic certainty" as a response to arguments that identification methods cannot ensure that an association between an object or person and a trace is 100% certain. But it seems fair to say that "such terminology is not required " -- at least not by any existing rule of law.)