Friday, November 27, 2020

Mysteries of the Department of Justice's ULTR for Firearm-toolmark Pattern Examinations

The Department of Justice's Uniform Language for Testimony and Reports (ULTR) for the Forensic Firearms/toolmarks Discipline – Pattern Examination" offers a ready response to motions to limit overclaiming or, to use the pedantic term, ultracrepidarianism, in expert testimony. Citing the DoJ policy, several federal district courts have indicated that they expect the government's expert witnesses to follow this directive (or something like it). \1/

Parts of the current version (with changes from the original) are reproduced in Box 1. \2/ This posting poses three questions about this guidance. Although the ULTR is a step in the right direction, it has a ways to go in articulating a clear and optimal policy.

Box 1. The ULTR

DEPARTMENT OF JUSTICE
UNIFORM LANGUAGE FOR TESTIMONY AND REPORTS
FOR THE FORENSIC FIREARMS/TOOLMARKS DISCIPLINE –
PATTERN MATCH EXAMINATION
...
III. Conclusions Regarding Forensic Pattern Examination of Firearms/Toolmarks Evidence for a Pattern Match

The An examiner may offer provide any of the following conclusions:
1.Source identification (i.e., identified)
2.Source exclusion (i.e., excluded)
3.Inconclusive
Source identification
‘Source identification’ is an examiner’s conclusion that two toolmarks originated from the same source. This conclusion is an examiner’s decision opinion that all observed class characteristics are in agreement and the quality and quantity of corresponding individual characteristics is such that the examiner would not expect to find that same combination of individual characteristics repeated in another source and has found insufficient disagreement of individual characteristics to conclude they originated from different sources.

The basis for a ‘source identification’ conclusion is an examiner’s decision opinion that the observed class characteristics and corresponding individual characteristics provide extremely strong support for the proposition that the two toolmarks came originated from the same source and extremely weak support for the proposition that the two toolmarks came originated from different sources.

A ‘source identification’ is the statement of an examiner’s opinion (an inductive inference2) that the probability that the two toolmarks were made by different sources is so small that it is negligible. A ‘source identification’ is not based upon a statistically-derived or verified measurement or an actual comparison to all firearms or toolmarks in the world.

Source exclusion
‘Source exclusion’ is an examiner’s conclusion that two toolmarks did not originate from the same source.

The basis for a ‘source exclusion’ conclusion is an examiner’s decision opinion that the observed class and/or individual characteristics provide extremely strong support for the proposition that the two toolmarks came from different sources and extremely weak or no support for the proposition that the two toolmarks came from the same source two toolmarks can be differentiated by their class characteristics and/or individual characteristics.

Inconclusive
‘Inconclusive’ is an examiner’s conclusion that all observed class characteristics are in agreement but there is insufficient quality and/or quantity of corresponding individual characteristics such that the examiner is unable to identify or exclude the two toolmarks as having originated from the same source.

The basis for an ‘inconclusive’ conclusion is an examiner’s decision opinion that there is an insufficient quality and/or quantity of individual characteristics to identify or exclude. Reasons for an ‘inconclusive’ conclusion include the presence of microscopic similarity that is insufficient to form the conclusion of ‘source identification;’ a lack of any observed microscopic similarity; or microscopic dissimilarity that is insufficient to form the conclusion of ‘source exclusion.’

IV. Qualifications and Limitations of Forensic Firearms/Toolmarks Discipline Examinations
A conclusion provided during testimony or in a report is ultimately an examiner’s decision and is not based on a statistically-derived or verified measurement or comparison to all other firearms or toolmarks. Therefore, an An examiner shall not assert that two toolmarks originated from the same source to the exclusion of all other sources. This may wrongly imply that a ‘source identification’ conclusion is based upon a statistically-derived or verified measurement or an actual comparison to all other toolmarks in the world, rather than an examiner’s expert opinion.
○ assert that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’3 of an item of evidence.

○ use the terms ‘individualize’ or ‘individualization’ when describing a source conclusion.

○ assert that two toolmarks originated from the same source to the exclusion of all other sources.
• An examiner shall not assert that examinations conducted in the forensic firearms/toolmarks discipline are infallible or have a zero error rate.

• An examiner shall not provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data.

• An examiner shall not cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career as a direct measure for the accuracy of a proffered conclusion provided. An examiner may cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career for the purpose of establishing, defending, or describing his or her qualifications or experience.

• An examiner shall not assert that two toolmarks originated from the same source with absolute or 100% certainty, or use the expressions ‘reasonable degree of scientific certainty,’ ‘reasonable scientific certainty,’ or similar assertions of reasonable certainty in either reports or testimony unless required to do so by a judge or applicable law.34


2 Inductive reasoning (inferential reasoning):
A mode or process of thinking that is part of the scientific method and complements deductive reasoning and logic. Inductive reasoning starts with a large body of evidence or data obtained by experiment or observation and extrapolates it to new situations. By the process of induction or inference, predictions about new situations are inferred or induced from the existing body of knowledge. In other words, an inference is a generalization, but one that is made in a logical and scientifically defensible manner. Oxford Dictionary of Forensic Science 130 (Oxford Univ. Press 2012).
3 As used in this document, the term ‘uniqueness’ means having the quality of being the only one of its kind.’ Oxford English Dictionary 804 (Oxford Univ. Press 2012).
34 See Memorandum from the Attorney General to Heads of Department Components (Sept. 9. 2016), https://www.justice.gov/opa/file/891366/download.

1

Are the two or three conclusions -- identification, exclusion, and inconclusive -- the only way the examiners are allowed to report their results?

In much of the world, examiners are discouraged from reaching only two conclusions--included vs. excluded (with the additional option of denominating the data as too limited to permit such a classification). They are urged to articulate how strongly the data support one classification over the other. Instead of pigeonholing, they might say, for example, that the data strongly support the same-source classification (because those data are far more probable for ammunition fired from the same gun than for ammunition discharged from different guns). 

The ULTR studiously avoids mentioning this mode of reporting. It states that examiners "may provide any of the following ... ." It does not state whether they also may choose not to -- and instead report only the degree of support for the same-source (or the different-source) hypothesis. Does the maxim of expressio unius est exclusio alterius apply? Department of Justice personnel are well aware of this widely favored alternative. They have attended meetings of statisticians at which straw polls overwhelmingly endorsed it over the Department's permitted conclusions. Yet, the ULTR does not list statements of support (essentially, likelihood ratios) as permissible. But neither are they found in the list of thou-shalt-nots in Part IV. \3/ Is the idea that if the examiners have a conclusion to offer, they must state it as one of the two or three categorical ones -- and that they may give a qualitative likelihood ratio if they want to?

2

Is the stated logic of a "source identification" internally coherent and intellectually defensible?

The ULTR explains that "[t]he basis for a 'source identification' is

an examiner’s opinion that the observed class characteristics and corresponding individual characteristics provide extremely strong support for the proposition that the two toolmarks originated from the same source and extremely weak support for the proposition that the two toolmarks originated from different sources.

Translated into likelihood language, the DoJ's "basis for a source identification" is the belief that the likelihood ratio is very large -- the numerator of L is close to one, and the denominator is close to zero (see Box 2).

On this understanding, a "source identification" is a statement about the strength of the evidence rather than a conclusion (in the sense of a decision about the source hypothesis). However, the next paragraph of the ULTR states that "[a] ‘source identification’ is the statement of an examiner’s opinion (an inductive inference2) that the probability that the two toolmarks were made by different sources is so small that it is negligible."

Box 2. A Technical Definition of Support
The questioned toolmarks and the known ones have some degree of observed similarity X with respect to relevant characteristics. Let Lik(S) be the examiner's judgment of the likelihood of the same-source hypothesis S. This likelihood is proportional to Prob(X | S), the probability of the observed degree of similarity X given the hypothesis (S). For simplicity, we may as well let the constant be 1. Let Lik(D) be the examiner's judgment of the likelihood of the different-source hypothesis (D). This likelihood is Prob(X | D). The support for S is the logarithm of the likelihood ratio L = Lik(S) / Lik(D) = Prob(X | S) / Prob(X | D). \4/

In this way, the ULTR jumps from a likelihood to a posterior probability. To assert that "the probability that the two toolmarks were made by different sources ... is negligible" is to say that Prob(D|X) is close to 0, and hence that Prob(S|X) is nearly 1. However, the likelihood ratio L = Lik(S) / Lik(D) is only one factor that affects Prob(D|X). Bayes' theorem establishes that

Odds(D|X) = Odds(D) / L.

Consequently, a very large L (great support for S) shrinks the odds in favor of S, but whether we end up with a "negligible" probability for D depends on the odds on D without considering the strength of the toolmark evidence. Because the expertise of toolmark analysts only extends to evaluating the toolmark evidence, it seems that the ULTR is asking them to step outside their legitimate  sphere of expertise by assessing, either explicitly or implicitly, the strength of the particular non-scientific evidence in the case.

There is a way to circumvent this objection. To defend a "source identification" as a judgment that Prob(D|X) is negligible, the examiner could contend that the likelihood ratio L is not just very large, as the ULTR's first definition required, but that it is so large that it swamps every probability that a judge or juror reasonably might entertain in any possible case before learning about the toolmarks. A nearly infinite L would permit an analyst to dismiss the posterior odds on D as negligible without attempting to estimate the odds on the basis of other evidence in the particular case (see Box 3).

Box 3. How large must L be to swamp all plausible prior odds?

Suppose that the smallest prior same-source probability in any conceivable case were p = 1/1,000,000. The prior odds on the different-source hypothesis would be approximately 1/p = 1,000,000. According to Bayes' rule, the posterior odds on D then would be about (1/p)/L = 1,000,000/L.

How large would the support L for S have to be to make D a "negligible" possibility? If "negligible" means a probability below, say 1/100,000, then the threshold value of L (call it L*) would be such that 1,000,000 / L* < 1/100,000 (approximately). Hence L* > 10^11. Are examiners are able to reliably tell whether the toolmarks are such that L > 100 billion?

One can use different numbers, of course, but whether the swamping defense of the ULTR really works to justify actual testimony as to "source identification" defined according to the ULTR is none too clear.

The ULTR seems slightly embarrassed with the characterization of a "source identification" as an "opinion" on the small size of a probability. Parenthetically it calls the opinion "an inductive inference," which sounds more impressive. But the footnote that is supposed to explain the the more elegant phrase only muddies the waters. It reads as follows:

Inductive reasoning (inferential reasoning): A mode or process of thinking that is part of the scientific method and complements deductive reasoning and logic. Inductive reasoning starts with a large body of evidence or data obtained by experiment or observation and extrapolates it to new situations. By the process of induction or inference, predictions about new situations are inferred or induced from the existing body of knowledge. In other words, an inference is a generalization, but one that is made in a logical and scientifically defensible manner. Oxford Dictionary of Forensic Science 130 (Oxford Univ. Press 2012) [sic]. \4/

The flaws in this definition are many. First, "inferential reasoning" is not equivalent to "inductive reasoning." Inference is reaching a conclusion from stated premises. The argument from the premises to the conclusion can be deductive or inductive. Deductive arguments are valid when the conclusion is true given that the premises are true. Inductive arguments are sound when the conclusion is sufficiently probable given that the premises are true. Second, inductive reasoning can be based on a small body of evidence as well as on a large body of evidence. In other words, deduction produces logical certainty, whereas induction can yield no more than probable truth. Third, an induction -- that is, the conclusion of an inductive argument -- need not be particularly scientific or "scientifically defensible." Fourth, an inductive conclusion is not necessarily "a generalization." An inductive argument, no less than a deductive one, can go from the general to the specific -- as is the case for an inference that two toolmarks were made by the same source. Presenting an experience-based opinion as the product of "the scientific method" by the fiat of a flawed definition of "inductive reasoning" is puffery.

3

If the examiner has correctly discerned matching "individual characteristics" (as the ULTR calls them), why cannot the examiner "assert that a ‘source identification’ ... is based on ... ‘uniqueness’" or that there has been an "individualization"?

The ULTR states that a "source identification" is based on an examination of "class characteristics" and "individual characteristics." Presumably, "individual characteristics" are ones that differ in every source and thus permit "individualization." The dictionary on which the ULTR relies defines "individualization" as "assigning a unique source for a given piece of physical evidence" (which it distinguishes from "identification"). But the ULTR enjoins an examiner from using "the terms ‘individualize’ or ‘individualization’ when describing a source conclusion," from asserting "that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’ of an item of evidence," and from stating "that two toolmarks originated from the same source to the exclusion of all other sources."

The stated reason to avoid these terms is that a source attribution "is not based on a statistically-derived or verified measurement or comparison to all other firearms or toolmarks." But who would think that an examiner who "assert[s] that two '[t]oolmarks originated from the same source to the exclusion of all other sources'" is announcing "an actual comparison to all other toolmarks in the world"? The examiner apparently is allowed to report a plethora of matching "individual characteristics" and to opine (or "inductively infer") that there is virtually no chance that the marks came from a different source. Allowing such testimony cuts the heart out of the rules against asserting "uniqueness" and claiming "individualization."

NOTES

  1. E.g., United States v. Hunt, 464 F.Supp.3d 1252 (W.D. Okla. 2020) (discussed on this blog Aug. 10, 2020).
  2. The original version was adopted on 7/24/2018. It was revised on 6/8/2020.
  3. Are numerical versions of subjective likelihood ratios prohibited by the injunction in Part IV that "[a]n examiner shall not provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data"? Technically, a likelihood ratio is not a "degree of probability" or (arguably) a statistic, but it seems doubtful that the drafters of the ULTR chose their terminology with the niceties of statistical terminology in mind.
  4. A.W.F. Edwards, Likelihood 31 (rev. ed. 1992) (citing H. Jeffreys, Further Significance Tests, 32 Proc. Cambridge Phil. Soc'y 416 (1936)).
  5. The correct name of the dictionary is A Dictionary of Forensic Science, and its author is Suzanne Bell. The quotation in the ULTR omits the following part of the definition of "inductive inference": "A forensic example is fingerprints. Every person's fingerprints are unique, but this is an inference based on existing knowledge since the only way to prove it would be to take and study the fingerprints of every human being ever born."

Tuesday, November 24, 2020

Wikimedia v. NSA: It's Classified!

The National Security Agency (NSA) engages in systematic, warrantless "upstream" surveillance of Internet communications that travel in and out of the United States along a "backbone" of fiber optic cables. The ACLU and other organizations maintain that Upstream surveillance is manifestly unconstitutional. Whether or not that is correct, the government has stymied one Fourth Amendment challenge after another on the ground that plaintiffs lacked standing because they cannot prove that the surveillance entails intercepting, copying, and reviewing any of their communications. Of course, the reason plaintiffs have no direct evidence is that the government won't admit or deny it. Instead, the government has asserted that the surveillance program is a privileged state secret, classified its details, and resisted even in camera hearings in ordinary courts.

In Wikimedia Foundation v. National Security Agency, 857 F.3d 193 (4th Cir. 2017), however, the Court of Appeals for the Fourth circuit held that the Wikimedia Foundation, which operates Wikipedia, made "allegations sufficient to survive a facial challenge to standing." Id. at 193. The court concluded that Wikimedia's allegations were plausible enough to defeat a motion to dismiss the complaint because

Wikimedia alleges three key facts that are entitled to the presumption of truth. First, “[g]iven the relatively small number of international chokepoints,” the volume of Wikimedia's communications, and the geographical diversity of the people with whom it communicates, Wikimedia's “communications almost certainly traverse every international backbone link connecting the United States with the rest of the world.”

Second, “in order for the NSA to reliably obtain communications to, from, or about its targets in the way it has described, the government,” for technical reasons that Wikimedia goes into at length, “must be copying and reviewing all the international text-based communications that travel across a given link” upon which it has installed surveillance equipment. Because details about the collection process remain classified, Wikimedia can't precisely describe the technical means that the NSA employs. Instead, it spells out the technical rules of how the Internet works and concludes that, given that the NSA is conducting Upstream surveillance on a backbone link, the rules require that the NSA do so in a certain way. ...

Third, per the PCLOB [Privacy and Civil Liberties Oversight Board] Report and a purported NSA slide, “the NSA has confirmed that it conducts Upstream surveillance at more than one point along the [I]nternet backbone.” Together, these allegations are sufficient to make plausible the conclusion that the NSA is intercepting, copying, and reviewing at least some of Wikimedia's communications. To put it simply, Wikimedia has plausibly alleged that its communications travel all of the roads that a communication can take, and that the NSA seizes all of the communications along at least one of those roads. 

Id. at 210-11 (citations omitted).

The Fourth Circuit therefore vacated an order dismissing Wikimedia's complaint issued by Senior District Judge Thomas Selby Ellis III, the self-described "impatient" jurist who achieved later notoriety and collected ethics complaints (that were rejected last year) for his management of the trial of former Trump campaign manager Paul Manafort.

On remand, the government moved for summary judgment. Wikimedia Found. v. Nat'l Sec. Agency/Cent. Sec. Serv., 427 F.Supp.3d 582 (D. Md. 2019). Once more, the government argued that Wikimedia lacked standing to complain that the Upstream surveillance violated its Fourth Amendment rights. It suggested that the "plausible" inference that the NSA must be "intercepting, copying, and reviewing at least some of Wikimedia's communications” recognized by the Fourth Circuit was not so plausible after all. To support this conclusion, it submitted a declaration of Henning Schulzrinne, a Professor of Computer Science and Electrical Engineering at Columbia University. Dr. Schulzrinne described how companies carrying Internet traffic might filter transmissions before copying them by “mirroring” with “routers” or “switches” that could perform “blacklisting” or “whitelisting” if the NSA chose to give the companies information on its targets with which to create “access control lists.”

But Dr. Schulzrinne supplied no information and formed no opinion on whether it was at all likely that the NSA used the mirroring methods that he envisioned, and Wikimedia produced a series of expert reports from Scott Bradner, who had served as Harvard University’s Technology Security Officer and taught at that university. Bradner contended that the NSA could hardly be expected to give away the information on its targets and concluded that it is all but certain that the agency intercepted and opened at least one of Wikimedia's trillions of Internet communications.

The district court refused to conduct an evidentiary hearing on the factual issue. Instead, it disregarded the expert's opinion as inadmissible scientific evidence under Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), because no one without access to classified information could "know what the NSA prioritizes in the Upstream surveillance program ... and therefore Mr. Bradner has no knowledge or information about it." Wikimedia, 427 F. Supp. 3d at 604–05 (footnotes omitted).

This reasoning resembles that from Judge Ellis's first opinion in this long-running case. In Wikimedia Found. v. Nat'l Sec. Agency, 143 F. Supp. 3d 344, 356 (D. Md. 2015), the judge characterized Wikimedia’s allegations as mere “suppositions and speculation, with no basis in fact, about how the NSA” operates and maintained that it was impossible for Wikimedia to prove its allegations “because the scope and scale of Upstream surveillance remain classified . . . .” Id. Rather than allow full consideration of the strength of the evidence that makes Wikimedia’s claim plausible, the district court restated its position that “Mr. Bradner has no [direct] knowledge or information” because that information is classified. Wikimedia, 427 F. Supp. 3d at 604–605.

In a pending appeal to the Fourth Circuit, Edward Imwinkelried, Michael Risinger, Rebecca Wexler, and I prepared a brief as amici curiae in support of Wikimedia. The brief expresses surprise at “the district court’s highly abbreviated analysis of Rule 702 and Daubert, as well as the court’s consequent decision to rule inadmissible opinions of the type that Wikimedia’s expert offered in this case.” It describes the applicable standard for excluding expert testimony. It then argues that the expert’s method of reasoning was sound and that its factual bases regarding the nature of Internet communications and surveillance technology, together with public information on the goals and needs of the NSA program, were sufficient to justify the receipt of the proposed testimony.

UPDATE (9/27/21): On 9/15/21, the Fourth Circuit affirmed the summary judgment order -- but not on the basis of Judge Ellis's theories about expert testimony. A divided panel reasoned that the suit had to be dismissed because the government had properly invoked the state secrets privilege and that because the government would have to disclose those secrets to defend itself, “further litigation would present an unjustifiable risk of disclosure.” Wikimedia Found. v. Nat'l Sec. Agency/Cent. Sec. Serv., 14 F.4th 276 (4th Cir. 2021).

Monday, September 28, 2020

Terminology Department: Significance

Inns of Court College of Advocacy, Guidance on the Preparation, Admission and Examination of Expert Evidence § 5.2 (3d ed. 2020)
Statisticians, for example, use what appear to be everyday words in specific technical senses. 'Significance' is an example. In everyday language it carries associations of importance, something with considerable meaning. In statistics it is a measure of the likelihood that a relationship between two or more variables is caused by something other than random chance.
Welcome to the ICCA

The Inns of Court College of Advocacy ... is the educational arm of the Council of the Inns of Court. The ICCA strives for ‘Academic and Professional Excellence for the Bar’. Led by the Dean, the ICCA has a team of highly experienced legal academics, educators and instructional designers. It also draws on the expertise of the profession across the Inns, Circuits, Specialist Bar Associations and the Judiciary to design and deliver bespoke training for student barristers and practitioners at all levels of seniority, both nationally, pan-profession and on an international scale.

How good is the barristers' definition of statistical significance? In statistics, an apparent association between variables is said to be significant when it is lies outside the range that one would expect to see in some large fraction of repeated, identically conducted studies in which the variables are in fact uncorrelated. Sir Ronald Fisher articulated the idea as follows:

[I]t is convenient to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.’ This level ... we may call the 5 per cent. point .... If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. [1]

For Fisher, a "significant" result would occur by sheer coincidence no "more than once in twenty trials" (on average). 

Is such statistical significance the same as the barristers' "likelihood" that an observed "relationship ... is caused by something other than random chance"? One might object to the appearance of the term "likelihood" in the definition because it too is a technical term with a specialized meaning in statistics, but that is not the main problem. The venacular likelihood that X is the cause of extreme data (where X is anything other than random chance) is not a "level of significance" such as 5%, 2%, or 1%. These levels are conditional error probabilities: If the variables are uncorrelated and we use a given level to call the observed results significant, then, in the (very) long run, we will label coincidental results as significant no more than that level specifies. For example, if we always use a 0.01 level, we will call coincidences "significant" no more than 1% of the time (in the limit).

The probability (the venacular likelihood) "that a relationship between two or more variables is caused by something other than random chance" is quite different. [2, p.53] Everything else being equal, significant results are more likely to signal a true relationship than are nonsignificant results, but the significance level itself refers to the probability of data that are uncommon when there is no true relationship, and not to the probability that the apparent relationship is real. In symbols, Pr(relationship | extreme data) is not Pr(extreme data | relationship). Naively swapping the terms in the expressions for the conditional probabilities is known as the transposition fallacy. In regard to criminal cases involving statistical evidence, it often is called the "prosecutor's fallacy." Perhaps "barristers' fallacy" can be added to the list.

REFERENCES

  1. Ronald Fisher, The Arrangement of Field Experiments, 33 J. Ministry Agric. Gr. Brit 503-515, 504 (1926).
  2. David H. Kaye, Frequentist Methods for Statistical Inference, in Handbook of Forensic Statistics 39-72 (David Banks et al. eds. 2021).

ACKNOWLEDGMENT: Thanks to Geoff Morrison for alerting me to the ICCA definition.

Wednesday, August 26, 2020

Terminology Department: Defining Bias for Nonstatisticians

The Organization of Scientific Area Committees for Forensic Science (OSAC) is trying to develop definitions of common technical terms that can be used across most forensic-science subject areas. "Bias" is one of these ubiquitous terms, but its statistical meaning does not conform to the usual dictionary definitions, such as  "an inclination of temperament or outlook, especially: a personal and sometimes unreasoned judgment" \1/ or "the action of supporting or opposing a particular person or thing in an unfair way, because of allowing personal opinions to influence your judgment." \2/ 

I thought the following definition might be useful for forensic-science practitioners:

A systematic tendency for estimates or measurements to be above or below their true values. A study is said to be biased if its design is such that it systematically favors certain outcomes. An estimator of a population parameter is biased when the average value of the estimates (from an infinite number of samples) would not equal the value of the parameter. Bias arises from systematic as opposed to random error in the collection of units to be measured, the measurement of the units, or the process for estimating quantities based on the measurements.

It ties together some of the simplest definitions I have seen in textbooks and reference works on statistics -- namely:

Yadolah Dodge, The Concise Encyclopedia of Statistics 41 (2008): From a statistical point of view, the bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. If we calculate the mean of a large number of unbiased estimations, we will find the correct value. The bias indicates the distance of the estimator from the true value of the parameter. Comment: This is the definition for mathematical statistics.
B. S. Everitt & A. Skrondal, The Cambridge Dictionary of Statistics 45 (4th ed. 2010) (citing Altman, D.G. (1991) Practical Statistics for Medical Research, Chapman and Hall, London): In general terms, deviation of results or inferences from the truth, or processes leading to such deviation. More specifically, the extent to which the statistical method used in a study does not estimate the quantity thought to be estimated, or does not test the hypothesis to be tested. In estimation usually measured by the difference between the expected value of an estimator and the true value of the parameter. An estimator for which E(θ-hat) = θ is said to be unbiased. See also ascertainment bias, recall bias, selection bias and biased estimator. Comment: The general definition (first sentence) fails to differentiate between random and systematic deviations. The “more specific” definition in the next sentence is limited to the definition in mathematical statistics.
David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 283 (Federal Judicial Center & Nat’l Research Council eds., 3d ed. 2011): Also called systematic error. A systematic tendency for an estimate to be too high or too low. An estimate is unbiased if the bias is zero. (Bias does not mean prejudice, partiality, or discriminatory intent.) See nonsampling error. Comment: This one is intended to convey the essential idea to judges.
David H. Kaye, Frequentist Methods for Statistical Inference, in Handbook of Forensic Statistics 39, 44 (D. Banks, K. Kafadar, D. Kaye & M. Tackett eds. 2020): [A]n unbiased estimator t of [a parameter] θ will give estimates whose errors eventually should average out to zero. Error is simply the difference between the estimate and the true value. For an unbiased estimator, the expected value of the errors is E(tθ) = 0. Comment: Yet another version of the definition of an unbiased estimator of a population or model parameter.
JCGM, International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM) (3d ed. 2012): measurement bias, bias -- estimate of a systematic measurement error Comment: The VIM misdefines bias as an estimate of bias.
David S. Moore & George P. McCabe, Introduction to the Practice of Statistics 232 (2d ed. 1993): Bias. The design of a study is biased if it systematically favors certain outcomes. In a causal study, bias can result from confounding. Or can it?

NOTES

  1. Merriam-Webster Dictionary (online).
  2. Cambridge Dictionary (online)

Saturday, August 22, 2020

Phrases for the Value or Weight of Evidence

A few statisticians asked me (independently) about the usage of the terms evidential value, evidentiary value, and probative value. For years, I thought the phrases all meant the same thing, but that is not true in some fields.

Evidential Value

Black’s Law Dictionary (which tends to have aged definitions) has this definition of evidential value: “Value of records given as or in support of evidence, based on the certainty of the records origins. The value here is not in the record content. This certainty is essential for authentic and adequate evidence of an entity’s actions, functioning, policies, and/or structure.”

Under this definition, "evidential value" pertains to a document's value merely as the container of information. The definition distinguishes between the provenance and authenticity of a document -- where did it come from and has it been altered? -- and the content of the document -- what statements or information does it contain? Likewise, archivists distinguish between "evidential value" and "informational value." The former, according to the Society of American Archivists "relates to the process of creation rather than the content (informational value) of the records."

Evidentiary Value

Lawyers use the phrases "evidentiary value" and "probative value" (or "probative force") as synonyms. For example, a 1932 note in the University of Pennsylvania Law Review on "Evidentiary Value of Finger-Prints" predicted that "the time is not far distant when courts must scrutinize and properly evaluate the probative force to be given to evidence that finger-prints found on the scene correspond with those of the accused." \1/

Forensic scientists use "evidentiary value" to denote the utility of examining objects for information on whether the objects have a common origin. A 2009 report of a committee of the National Academies complained that there was no standard threshold for deciding when bitemarks have "reached a threshold of evidentiary value." \2/ More generally, the phrase can denote the value of any expert analysis of effects as proof of the possible cause of those effects. \3/

Evidential Value

Unlike archivists, forensic scientists use the phrase “evidential value” interchangeably with "evidentiary value." It appears routinely in titles or articles and books such as "Evidential Value of Multivariate Physicochemical Data," \4/ "Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing," \5/ and "Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene." \6/

Probative Value

Lawyers use "probative value" to denote the degree to which an item of evidence proves the proposition it is offered to prove. Credible evidence that a defendant threatened to kill the deceased, whose death was caused by a poison, is probative of whether the death was accidental and whether defendant was the killer. With circumstantial evidence like this, various probability-based formulations have been proposed to express probative value quantitatively. \7/ One of the simplest is the likelihood ratio or Bayes factor (BF) favored by most forensic statisticians. \8/ Its logarithm has qualities that argue for using log(BF) to express the "weight" of an item of evidence. \9/

The rules of evidence require judges to exclude evidence when unfair prejudice, distraction, and undue consumption of time in presenting the evidence substantially outweigh the probative value of the evidence. \10/ In theory, judges do not exclude evidence just because they do not believe that the witness is telling the truth. The jury will take credibility into account in deciding the case. However, in ensuring that there is sufficient probative value to bother with the evidence, judges can hardly avoid being influenced by the trustworthiness of the source of the information. Moreover, the importance of the fact that the proposed testimony addresses and the availability of alternative, less prejudicial proof also can influence the decision to exclude evidence that is probative of a material fact. \11/

NOTES

  1. Note, Evidentiary Value of Finger-Prints, 80 U. Penn. L. Rev. 887 (1932).
  2. Comm. on Identifying the Needs of the Forensic Sci. Cmty. Nat'l Research Council, Strengthening Forensic Science in the United States: A Path Forward 176 (2009).
  3. Nicholas Dempsey & Soren Blau, Evaluating the Evidentiary Value of the Analysis of Skeletal Trauma in Forensic Research: A Review of Research and Practice, 307 Forensic Sci. Int'l (2020), https://doi.org/10.1016/j.forsciint.2020.110140. Still another usage of the term occurs in epistemology. See P. Gärdenfors, B. Hansson, N-E. Sahlin, Evidentiary Value: Philosophical, Judicial and Psychological Aspects of a Theory (1983); Dennis V. Lindley, Review, 35(3) Brit. J. Phil. Sci. 293-296 (1984) (criticizing this theory).
  4. Grzegorz Zadora, Agnieszka Martyna, Daniel Ramos & Colin Aitken, Statistical Analysis in Forensic Science: Evidential Value of Multivariate Physicochemical Data (2014).
  5. Zuhaib Subhani, Barbara Daniel & Nunzianda Frascione, DNA Profiles from Fingerprint Lifts—Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing, 64(1) J. Forensic Sci. 201–06 (2019), https://doi.org/10.1111/1556-4029.13830.
  6. I.W. Evett, Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene, 33(2) J. Forensic Sci. Soc’y 83-86 (1993).
  7. 1 McCormick on Evidence § 185 (R. Mosteller ed., 8th ed. 2020); David H. Kaye, Review-essay, Digging into the Foundations of Evidence Law, 116 Mich. L. Rev. 915-34 (2017), http://ssrn.com/abstract=2903618.
  8. See Anuradha Akmeemana, Peter Weis, Ruthmara Corzo, Daniel Ramos, Peter Zoon, Tatiana Trejos, Troy Ernst, Chip Pollock, Ela Bakowska, Cedric Neumann & Jose Almirall, Interpretation of Chemical Data from Glass Analysis for Forensic Purposes, J. Chemometrics (2020), DOI:10.1002/cem.3267.
  9. Fed. R. Evid. 403; Unif. R. Evid. 403; 1 McCormick, supra note 6, § 185.
  10. I. J. Good, Weight of Evidence: A Brief Survey, in 2 Bayesian Statistics 249–270 (Bernardo, J. M., M. H. DeGroot, D. V. Lindley & A. F. M. Smith eds., 1985); Irving John Good, Weight of Evidence and the Bayesian Likelihood Ratio, in The Use of Statistics in Forensic Science 85–106 (C. G. G. Aitken & David A. Stoney eds., 1991).
  11. 1 McCormick, supra note 6, § 185.

Tuesday, August 18, 2020

"Quite High" Accuracy for Firearms-mark Comparisons

Court challenges to the validity of forensic identification of the gun that fired a bullet based on toolmark comparisons have increased since the President's Council of Advisors on Science and Technology (PCAST) issued a report in late 2016 stressing the limitations in the scientific research on the subject. A study from the Netherlands preprinted in 2019 adds to the research literature. The abstract reads (in part):

Forensic firearm examiners compare the features in cartridge cases to provide a judgment addressing the question about their source: do they originate from one and the same or from two different firearms? In this article, the validity and reliability of these judgments is studied and compared to the outcomes of a computer-based method. The ... true positive rates (sensitivity) and the true negative rates (specificity) of firearm examiners are quite high. ... The examiners are overconfident, giving judgments of evidential strength that are too high. The judgments of the examiners and the outcomes of the computer-based method are only moderately correlated. We suggest to implement performance feedback to reduce overconfidence, to improve the calibration of degree of support judgments, and to study the possibility of combining the judgments of examiners and the outcomes of computer-based methods to increase the overall validity.

Erwin J.A.T. Mattijssen, Cilia L.M. Witteman, Charles E.H. Berger, Nicolaas W. Brand & Reinoud D. Stoel, Validity and Reliability of Forensic Firearm Examiners. Forensic Sci. Int’l 2020, 307:110112. 

 

Despite the characterization of examiner sensitivity and specificity as "quite high," the observed specificity was only 0.89, which corresponds to a false-positive rate of 11%—much higher than the <2% estimate quoted in recent judicial opinions. But the false-positive proportions from different experiments are not as discordant as they might appear to be when naively juxtaposed. To appreciate the sensitivity and specificity reported in this experiment, we need to understand the way that the validity test was constructed.

Design of the Study

The researchers fired two bullets from two hundred 9 mm Luger Glock pistols seized in the Netherlands. These 400 test firings gave rise to true (same-source) and false (different-source) pairings of two-dimensional comparison images of the striation patterns on cartridge cases. Specifically, there were 400 cartridge cases from which the researchers made "measurements of the striations of the firing pin aperture shear marks" and prepared "digital images [of magnifications of] the striation patterns using oblique lighting, optimized to show as many of the striations as possible while avoiding overexposure." (They also produced three-dimensional data, but I won't discuss those here.)

They invited forensic firearm examiners from Europe, North America, South America, Asia and Oceania by e-mail to examine the images. Of the recipients, 112 participated, but only 77 completed the online questionnaire with 60 comparison images of striation patterns aligned side-by-side. (The 400 images gave rise to (400×399)/2 distinct pairs of images, of which 200 were same-source pairs. They could hardly ask the volunteers to study all these 79,800 pairs, so they used a computer program for matching such patterns to obtain 60 pairs that seemed to cover "the full range of comparison difficulty" but that overrepresented "difficult" pairs — an important choice that we'll talk about soon. Of the 60, 38 were same-source pairs, and 22 were different-source pairs.)

The examiners first evaluated the degree of similarity on a five-point scale. Then they were shown the 60 pairs again and asked (1) whether the comparison provides support that the striations in the cartridge cases are the result of firing the cartridge cases with one (same-source) or with two (different-source) Glock pistols; (2) for their sense of the degree of support for this conclusion; and (3) whether they would have provided an inconclusive conclusion in casework.

The degree of support was reported or placed on a six-point scale of "weak support" (likelihood ratio L = 2 to 10), "moderate support" (L = 10 to 100), "moderately strong support" (L = 100 to 1,000), "strong support" (L = 1,000 to 10,000), "very strong support" (L = 10,000 to 1,000,000), and "extremely strong support" (L > 1,000,000). The computerized system mentioned above also generated numerical likelihood ratios. (The proximity of the ratio to 1 was taken as a measure of difficulty.)

A Few of the Findings

For the 60 two-dimensional comparisons, the computer program and the human examiners performed as follows:

Table 1. Results for (computer | examiner) excluding pairs deemed inconclusive by examiners.

SS pair DS pair
SS outcome (36 | 2365) (10 | 95)
DS outcome (2 | 74) (12 | 784)
validity
sens = (.95 | .97)
FNP = (.05 | .03)
spec = (.55 | .89)
FPP = (.45 | .11)
Abbreviations: SS = same source; DS = difference source
sens = sensitivity; spec = specificity
FNP = false negative proportion; FPP = false positive proportion

Table 1 combines two of the tables in the article. The entry "36 | 2365," for example, denotes that the computer program correctly classified as same-source 36 of the 38 same-source pairs (95%), while the examinations of the 77 examiners correctly classified 2,365 pairs out of the 2,439 same-source comparisons (97%) that they did not consider inconclusive. The computer program did not have the option to avoid a conclusion (or rather a likelihood ratio) in borderline cases. When examiners' conclusions on the cases they would have called inconclusive in practice were added in, the sensitivity and specificity dropped to 0.93 and 0.81, respectively.

Making Sense of These Results

The reason there are more comparisons for the examiners is that there were 77 of them and only one computer program. The 77 examiners made 77 × 60 comparisons, while the computer program made only 1 × 60 comparisons on the the 60 pairs. Those pairs, as I noted earlier, were not a representative sample. On all 79,800 possible pairings of the test fired bullets, the tireless computer program's sensitivity and specificity were both 0.99. If we can assume that the human examiners would have done at least as well as the computer program that it outperformed (on the 60 more or less "difficult" cases), their performance for all possible pairs would have been excellent.

An "Error Rate" for Court?

The experiment gives a number for the false-positive "error rate" (misclassifications across all the 22 different-source pairs) of 11%. If we conceive of the examiners' judgments as a random sample from some hypothetical population of identically conducted experiments, then the true false-positive error probability could be somewhat higher (as emphasized in the PCAST report) or lower. How should such numbers be used in admissibility rulings under Daubert v. Merrell Dow Pharmaceuticals, Inc.? At trial, to give the jury a sense of the chance of a false-positive error (as PCAST also urged)?

For admissibility, Daubert referred (indirectly) to false-positive proportions in particular studies of "voice prints," although the more apt statistic would be a likelihood ratio for a positive classification. For the Netherlands study, that would be L+ = Pr(+|SS) / Pr(+|DS) ≈ 0.97/0.11 = 8.8. In words, it is almost nine times more probable that an examiner (like the ones in the study) will report a match when confronted with a randomly selected same-source pair of images than a randomly selected different-source pair (from the set of 60 constructed for the experiment). That validates the examiners' general ability to distinguish between same- and different-source pairs at a modest level of accuracy in that sample.

But to what extent can or should this figure (or just the false-positive probability estimate of 0.11) be presented to a factfinder as a measure of the probative value of a declared match? In this regard, arguments arise over presenting an average figure in a specific case (although that seems like a common enough practice in statistics) and the realism of the experiment. The researchers warn that

For several reasons it is not possible to directly relate the true positive and true negative rates, and the false positive and false negative rates of the examiners in this study to actual casework. One of these reasons is that the 60 comparisons we used were selected to over-represent ‘difficult’ comparisons. In addition, the use of the online questionnaire did not enable the examiners to manually compare the features of the cartridge cases as they would normally do in casework. They could not include in their considerations the features of other firearm components, and their results and conclusions were not peer reviewed. Enabling examiners to follow their standard operating procedures could result in better performance.

There Is More

Other facets of the paper also make it recommended reading. Data on the reliability of conclusions (both within and across examiners) are presented, and an analysis of the extent to which examiners' judgments of how strongly the images supported their source attributions led the authors to remark that

When we look at the actual proportion of misleading choices, the examiners judged lower relative frequencies of occurrence (and thus more extreme LRs) than expected if their judgments would have been well-calibrated. This can be seen as overconfidence, where examiners provide unwarranted support for either same-source or different-source propositions, resulting in LRs that are too high or too low, respectively. ... Simply warning examiners about overconfidence or asking them to explain their judgments does not necessarily decrease overconfidence of judgments.

Monday, August 10, 2020

Applying the Justice Department's Policy on a Reasonable Degree of Certainty in United States v. Hunt

In United States v. Hunt, \1/ Senior U.S. District Court Judge David Russell disposed of a challenge to proposed firearms-toolmark testimony. The first part of the unpublished opinion dealt with the scientific validity (as described in Daubert v. Merrell Dow Pharmaceuticals, Inc.) of the Association of Firearms and Toolmark Examiners (AFTE) "Theory of Identification As It Relates to Toolmarks." Mostly, this portion of the opinion is of the form: "Other courts have accepted the government's arguments. We agree." This kind of an opinion is common for forensic-science methods that have a long history of judicial acceptance--whether or not such acceptance is deserved.

The unusual part of the opinion comes at the end. There, the court misconstrues the Department of Justice's internal policy on the use of the phrase "reasonable certainty" to characterize an expert conclusion for associating spent ammunition with a gun that might have fired it. This posting describes some of the history of that policy and suggests that (1) the court may have unwittingly rejected it; (2) the court's order prevents the experts from expressing the same AFTE theory that the court deemed scientifically valid; and (3) the government can adhere to its written policy on avoiding various expressions of "reasonable certainty" and still try the case consistently with the judge's order.

I. The Proposed Testimony

Dominic Hunt was charged with being a felon is possession of ammunition recovered from two shootings. The government proposed to use two firearm and toolmark examiners--Ronald Jones of the Oklahoma City Police Department and Howard Kong of the Bureau of Alcohol, Tobacco, Firearms and Explosives' (ATF) Forensic Science Laboratory--to establish that the ammunition was not fired from the defendant's brother's pistol--or his cousin's pistol. To eliminate those hypotheses, "the Government intend[ed] its experts to testify" that "the unknown firearm was likely a Smith & Wesson 9mm Luger caliber pistol," and that "the probability that the ammunition ... were fired in different firearms is so small it is negligible."

This testimony differs from the usual opinion testimony that ammunition components recovered from the scene of a shooting came from a specific, known gun associated with a defendant. It appears that the "unknown" Luger pistol was never discovered and thus that the examiners could not use it to fire test bullets for comparison purposes. Their opinion was that several of the shell casings had marks and other features that were so similar that they must have come from the same gun of the type they specified.

But the reasoning process the examiners used to arrive at this conclusion--which postulates "class," "subclass," and conveniently designated "individual" characteristics--is the same as the one used in the more typical case of an association to a known gun. Perhaps heartened by several recent trial court opinions severely limiting testimony about the desired individualizing characteristics, Hunt moved "to Exclude Ballistic Evidence, or Alternatively, for a Daubert Hearing."

II. The District Court's Order

Hunt lost. After rejecting the pretrial objection to the scientific foundation of the examiners' opinions and the proper application of accepted methods by the two examiners, Judge Russell turned to the defendant's "penultimate argument [seeking] limitations on the Government's firearm toolmark experts." He embraced the government's response "that no limitation is necessary because Department of Justice guidance sufficiently limits a firearm examiner's testimony."

The odd thing is that he turned the Department's written policy on its head by embracing a form of testimony that the policy sought to eliminate. And the court did this immediately after it purported to implement DoJ's "reasonable" policy. The relevant portion of the opinion begins:

In accordance with recent guidance from the Department of Justice, the Government's firearm experts have already agreed to refrain from expressing their findings in terms of absolute certainty, and they will not state or imply that a particular bullet or shell casing could only have been discharged from a particular firearm to the exclusion of all other firearms in the world. The Government has also made clear that it will not elicit a statement that its experts' conclusions are held to a reasonable degree of scientific certainty.
The Court finds that the limitations mentioned above and prescribed by the Department of Justice are reasonable, and that the Government's experts should abide by those limitations. To that end, the Governments experts:
[S]hall not [1] assert that two toolmarks originated from the same source to the exclusion of all other sources.... [2] assert that examinations conducted in the forensic firearms/toolmarks discipline are infallible or have a zero error rate.... [3] provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data.... [4] cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career as a direct measure for the accuracy of a proffered conclusion..... [5] use the expressions ‘reasonable degree of scientific certainty,’ ‘reasonable scientific certainty,’ or similar assertions of reasonable certainty in either reports or testimony unless required to do so by [the Court] or applicable law. \2/

So far it seems that the court simply told the government's experts (including the city police officer) to tow the federal line. But here comes the zinger. The court abruptly turned around and decided to ignore the Attorney General's mandate that DoJ personnel should strive to avoid expressions of "reasonable scientific certainty" and the like. The court wrote:

As to the fifth limitation described above, the Court will permit the Government's experts to testify that their conclusions were reached to a reasonable degree of ballistic certainty, a reasonable degree of certainty in the field of firearm toolmark identification, or any other version of that standard. See, e.g., U.S. v. Ashburn, 88 F. Supp. 3d 239, 249 (E.D.N.Y. 2015) (limiting testimony to a “reasonable degree of ballistics certainty” or a “reasonable degree of certainty in the ballistics field.”); U.S. v. Taylor, 663 F. Supp. 2d 1170, 1180 (D.N.M. 2009) (limiting testimony to a “reasonable degree of certainty in the firearms examination field.”). Accordingly, the Government's experts should not testify, for example, that “the probability the ammunition charged in Counts Eight and Nine were fired in different firearms is so small it is negligible” ***.

So the experts can testify that they have "reasonable certainty" that the ammunition was fired from the same gun, but they cannot say the probability that it was fired from a different gun is small enough that the alternative hypothesis has a negligible probability? Even though that is how experts in the field achieve "reasonable certainty" (according to the AFTE description that the court held was scientifically valid)? This part of the opinion hardly seems coherent. \3/

III. The Tension Between the Order and the ULTR

The two cases that the court cited for its "reasonable ballistic certainty" ruling were decided years before the ULTR that it called reasonable, and such talk of "ballistic certainty" and "any other version of that standard" is precisely what the Department had resolved to avoid if at all possible. The history of the "fifth limitation" has an easily followed paper trail that compels the conclusion that this limitation was intended to avoid precisely the kind of testimony that Judge Russell's order permits.

Let's start with the ULTR quoted (in part) by the court. It has a footnote to the "fifth limitation" that instructs readers to "See Memorandum from the Attorney General to Heads of Department Components (Sept. 9. 2016), https://www.justice.gov/opa/file/891366/download." The memorandum's subject is "Recommendations of the National Commission on Forensic Science; Announcement for NCFS Meeting Eleven." In it, Attorney General Loretta Lynch wrote:

As part of the Department's ongoing coordination with the National Commission on Forensic Science (NCFS), I am responding today to several NCFS recommendations to advance and strengthen forensic science. *** I am directing Department components to *** work with the Deputy Attorney General to implement these policies *** .

1. Department forensic laboratories will review their policies and procedures to ensure that forensic examiners are not using the expressions "reasonable scientific certainty" or "reasonable [forensic discipline] certainty" in their reports or testimony. Department prosecutors will abstain from use of these expressions when presenting forensic reports or questioning forensic experts in court unless required by a judge or applicable law.

The NCFS was adamant that judges should not require "reasonable [forensic discipline] certainty." Its recommendation to the Attorney General explained that

Forensic discipline conclusions are often testified to as being held “to a reasonable degree of scientific certainty” or “to a reasonable degree of [discipline] certainty.” These terms have no scientific meaning and may mislead factfinders about the level of objectivity involved in the analysis, its scientific reliability and limitations, and the ability of the analysis to reach a conclusion. Forensic scientists, medical professionals and other scientists do not routinely express opinions or conclusions “to a reasonable scientific certainty” outside of the courts. Neither the Daubert nor Frye test of scientific admissibility requires its use, and consideration of caselaw from around the country confirms that use of the phrase is not required by law and is primarily a relic of custom and practice. There are additional problems with this phrase, including:
• There is no common definition within science disciplines as to what threshold establishes “reasonable” certainty. Therefore, whether couched as “scientific certainty” or“ [discipline] certainty,” the term is idiosyncratic to the witness.
• The term invites confusion when presented with testimony expressed in probabilistic terms. How is a lay person, without either scientific or legal training, to understand an expert’s “reasonable scientific certainty” that evidence is “probably” or possibly linked to a particular source?

Accordingly, the NCFS recommended that the Attorney General "direct all attorneys appearing on behalf of the Department of Justice (a) to forego use of these phrases ... unless directly required by judicial authority as a condition of admissibility for the witness’ opinion or conclusion ... ." As we have seen, the Attorney General adopted this recommendation. \4/

IV. How the Prosecutors and the ATF Expert Can Follow Departmental Policy

Interestingly, Judge Russell's opinion does not require the lawyers and the witnesses to use the expressions of certainty. It "permits" them to do so (seemingly on the theory that this practice is just what the Department contemplated). But not all that is permitted is required. To be faithful to Department policy, the prosecution cannot accept the invitation. The experts can give their conclusion that the ammunition came from a single gun. However, they should not add, and the prosecutor may not ask them to swear to, some expression of "reasonable [discipline] certainty" because: (1) the Department's written policy requires them to avoid it "unless required by a judge or applicable law"; (2) the judge has not required it; and (3) applicable law does not require it. \5/

The situation could change if at the trial, Judge Russell were to intervene and to ask the experts about "reasonable certainty." In that event, the government should remind the court that its policy, for the reasons stated by the National Commission and accepted by the Attorney General, is to avoid these expressions. If the court then rejects the government's position, the experts must answer. But even then, unless the defense "opens the door" by cross-examining on the meaning of "reasonable [discipline] certainty," there is no reason for the prosecution to use the phrase in its examination of witnesses or closing arguments.

NOTES

  1. No. CR-19-073-R, 2020 WL 2842844 (W.D. Okla. June 1, 2020).
  2. The ellipses in the quoted part of the opinion are the court's. I have left out only the citations in the opinion to the Department's Uniform Language on Testimony and Reporting (ULTR) for firearms-toolmark identifications. That document is a jumble that is a subject for another day.
  3. Was Judge Russell thinking that the "negligible probability" judgment is valid (and hence admissible as far as the validity requirement of Daubert goes) but that it would be unfairly prejudicial or confusing to give the jury this valid judgment? Is the court's view that "negligible" is too strong a claim in light of what is scientifically known? If such judgments are valid, as AFTE maintains, they are not generally prejudicial. Prejudice does not mean damage to the opponent's case that arises from the very fact that evidence is powerful.
  4. At the NCFS meeting at which the Department informed the Commission that it was adopting its recommendation, "Commission member, theoretical physicist James Gates, complimented the Department for dealing with these words that 'make scientists cringe.'" US Dep't of Justice to Discourage Expressions of "Reasonable Scientific Certainty," Forensic Sci., Stat. & L., Sept. 12, 2016, http://for-sci-law.blogspot.com/2016/09/us-dept-of-justice-to-discourage.html.
  5. In a public comment to the NCFS, then commissioner Ted Hunt (now the Department's senior advisor on forensic science) cited the "ballistic certainty" line of cases as indicative of a problem with the NCFS recommendation as then drafted but agreed that applicable law did not require judges to follow the practice of insisting on or entertaining expressions of certitude. See "Reasonable Scientific Certainty," the NCFS, the Law of the Courtroom," and That Pesky Passive Voice, Forensic Sci., Stat. & L., http://for-sci-law.blogspot.com/2016/03/reasonable-scientific-certainty-ncfs.html, Mar. 1, 2016; Is "Reasonable Scientific Certainty" Unreasonable?, Forensic Sci., Stat. & L., Feb. 26, 2016, http://for-sci-law.blogspot.com/2016/02/is-reasonable-scientific-certainty.html (concluding that
    In sum, there are courts that find comfort in phrases like "reasonable scientific certainty," and a few courts have fallen back on variants such as "reasonable ballistic certainty" as a response to arguments that identification methods cannot ensure that an association between an object or person and a trace is 100% certain. But it seems fair to say that "such terminology is not required " -- at least not by any existing rule of law.)

Tuesday, May 5, 2020

How Do Forensic-science Tests Compare to Emergency COVID-19 Tests?

The Wall Street Journal recently reported that
At least 160 antibody tests for Covid-19 entered the U.S. market without previous FDA scrutiny on March 16, because the agency felt then that it was most important to get them to the public quickly. Accurate antibody testing is a potentially important tool for public-health officials assessing how extensively the coronavirus has swept through a region or state.
Now, the FDA will require test companies to submit an application for emergency-use authorization and require them to meet standards for accuracy. Tests will need to be found 90% “sensitive,” or able to detect coronavirus antibodies, and 95% “specific,” or able to avoid false positive results. \1/
How many test methods in forensic science have been shown to perform at or above these emergency levels? It is hard to say. For FDA-authorized tests, one can find the manufacturers' figures on the FDA's website, but for forensic-science tests, there is no such repository of information on the standards adopted by voluntary standards development organizations. The forensic-science test-method standards approved by consensus bodies such as the Academy Standards Board and ASTM Inc. rarely state the performance characteristics of these tests.

For the FDA's minimum operating characteristics of a yes-no test, the likelihood ratio for a positive result is Pr(+ | antibodies) / Pr(+ | no-antibodies) = 0.90/(1 − .95) = 18. The likelihood ratio for a negative result is Pr(− | no-antibodies) / Pr(− | antibodies) = .95/(1 − .90) = 9.5. In other words, a clean bill of health on a serological test with minimally acceptable performance would occur less than ten times as frequently for people with less than the detectable level of the virus as compared to people with detectable levels.

According to an Ad Hoc Working Group of the forensic Scientific Working Group on DNA Analysis Methods (SWGDAM), such a likelihood ratio may be described as providing "limited support." This description is near the lower end of a scale for likelihood ratios. These "verbal qualifiers" go from "uninformative" (L=1), to "limited" (2 to 99), "moderate" (100 to 999), "strong" (1,000 to 999,999), and, finally, "very strong" (1,000,000 or more). \2/

A more finely graded table appears "for illustration purposes" in an ENFSI [European Network of Forensic Science Institutes] Guideline for Evaluative Reporting in Forensic Science. The table classifies L = 9.5 as "weak support." \3/

NOTES
  1. Thomas M. Burton, FDA Sets Standards for Coronavirus Antibody Tests in Crackdown on Fraud, Wall Street J., Updated May 4, 2020 8:24 pm ET, https://www.wsj.com/articles/fda-sets-standards-for-coronavirus-antibody-tests-in-crackdown-on-fraud-11588605373; see also Mark McCarty, FDA’s Stenzel Highlights Sensitivity, Specificity for COVID-19 Antibody Testing Antibodies Fighting Coronavirus, BioWorld, May 6, 2020, https://www.bioworld.com/articles/434912-fdas-stenzel-highlights-sensitivity-specificity-for-covid-19-antibody-testing ("the agency’s performance expectations for serological tests are that overall sensitivity is 90%, and overall specificity is at least 95%. The specificity level of 95% is applicable to each antibody isotype, assuming results for each isotype are broken out, and sensitivity should be 90% for IgG if reported. If IgM is reported separately, sensitivity should be at least 70%, but the list of tests suggests that not all the tests currently operating under the EAU meet those benchmarks."),
  2. Recommendations of the SWGDAM Ad Hoc Working Group on Genotyping Results Reported as Likelihood Ratios, 2018, available via https://www.swgdam.org/publications.
  3. ENFSI Guideline for Evaluative Reporting in Forensic Science, 2016, p. 17, http://enfsi.eu/wp-content/uploads/2016/09/m1_guideline.pdf.

Saturday, April 25, 2020

Estimating Prevalence from Serological Tests for COVID-19 Infections: What We Don't Know Can Hurt Us

A statistical debate has erupted over the proportion of the population that has been infected with SARS-CoV-2. It is a crucial number in arguments about "herd immunity" and public health measures to control the COVID-19 pandemic. A news article in yesterday's issue of Science reports that

[S]urvey results, from Germany, the Netherlands, and several locations in the United States, find that anywhere from 2% to 30% of certain populations have already been infected with the virus. The numbers imply that confirmed COVID-19 cases are an even smaller fraction of the true number of people infected than many had estimated and that the vast majority of infections are mild. But many scientists question the accuracy of the antibody tests ... .\1/

The first sentence reflects a common assumption -- that the reported proportion of test results that are positive -- directly indicates the prevalence of infections where the tested people live. The last sentence gives one reason this might not be the case. But the fact that tests for antibodies are inaccurate does not necessarily preclude good estimates of the prevalence. It may still be possible to adjust the proportion up or down to arrive at the percentage "already ... infected with the virus." There is a clever and simple procedure for doing that -- under certain conditions. Before describing it, let's look another, more easily grasped threat to estimating prevalence -- "sampling bias."

Sampling Design: Who Gets Tested?

Inasmuch as the people tested in the recent studies are not based on random samples of any well defined population, the samples of test results may not be representative of what the outcome would be if the entire population of interest were tested. Several sources of bias in sampling have been noted.

A study of a German town "found antibodies to the virus in 14% of the 500 people tested. By comparing that number with the recorded deaths in the town, the study suggested the virus kills only 0.37% of the people infected. (The rate for seasonal influenza is about 0.1%.)" But the researchers "sampled entire households. That can lead to overestimating infections, because people living together often infect each other." \2/ Of course, one can count just one individual per household, so this clumping does not sound like a fundamental problem.

"A California serology study of 3300 people released last week in a preprint [found 50] antibody tests were positive—about 1.5%. [The number in the draft paper by Eran Bendavid, Bianca Mulaney, Neeraj Sood, et al. is 3330 \3/] But after adjusting the statistics to better reflect the county's demographics, the researchers concluded that between 2.49% and 4.16% of the county's residents had likely been infected." However, the Stanford researchers "recruit[ed] the residents of Santa Clara county through ads on Facebook," which could have "attracted people with COVID-19–like symptoms who wanted to be tested, boosting the apparent positive rate." \4/ This "unhealthy volunteer" bias is harder to correct with this study design.

"A small study in the Boston suburb of Chelsea has found the highest prevalence of antibodies so far. Prompted by the striking number of COVID-19 patients from Chelsea colleagues had seen, Massachusetts General Hospital pathologists ... collected blood samples from 200 passersby on a street corner. ... Sixty-three were positive—31.5%." As the pathologists acknowledged, pedestrians on a single corner "aren't a representative sample." \5/

Even efforts to find subjects at random will fall short of the mark because of self-selection on the part of subjects. "Unhealthy volunteer" bias is a threat even in studies like one planned for Miami-Dade County that will use random-digit dialing to utility customers to recruit subjects. \6/

In sum, sampling bias could be a significant problem in many of these studies. But it is something epidemiologists always face, and enough quick and dirty surveys (with different possible sources of sampling bias) could give a usable indication of what better designed studies would reveal.

Measurement Error: No Gold Standard

A second criticism holds that because the "specificity" of the serological tests could be low, the estimates of prevalence are exaggerated. "Specificity" refers the extent to which the test (correctly) does not signal and infection when applied to an uninfected individual. If it (incorrectly) signals an infection for these individuals, it causes false positives. Low specificity means lots of false positives. Worries over specificity recur throughout the Science article's summary of the controversy:

  • "The result carries several large caveats. The team used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH's own validation tests found a specificity of higher than 99.5%."
  • "Because the absolute numbers of positive tests were so small, false positives may have been nearly as common as real infections."
  • "Streeck and his colleagues claimed the commercial antibody test they used has 'more than 99% specificity,' but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%. That means that in the Heinsberg sample of 500, the test could have produced more than a dozen false positives out of roughly 70 the team found." \7/

Likewise, political scientist and statistician Andrew Gelman blogged that no screening test that lacks a very high specificity can produce a usable estimate of population prevalence -- at least when the proportion of tests that are positive is small. This limitation, he insisted, is "the big one." \8/ He presented the following as a devastating criticism of the Santa Clara study (with my emphasis added):

Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia ... OK, here are [sic] concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.
If the specificity is 90%, we’re sunk.
With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero ... . On the other hand, if the specificity were 100%, then we could take the result at face value.
So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:
A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.
This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!

To be sure, the fact that the serological tests are not perfectly accurate in detecting an immune response makes it dangerous to rely on the proportion of people tested who test positive as the measure of the proportion of the population who have been infected. Unless the test is perfectly sensitive (is certain to be positive for an infected person) and specific (certain to be negative for an uninfected person), the observed proportion will not be the true proportion of past infections -- even in the sample. As we will see shortly, however, there is a simple way to correct for imperfect sensitivity and specificity in estimating the population prevalence, and there is a voluminous literature on using imperfect screening tests to estimate population prevalence. \9/ Recognizing what one wants to estimate leads quickly to the conclusion that the usual media reports of the raw proportion of positives among the tested group (even with a margin of error to account for sampling variability) is not generally the right statistic to focus on.

Moreover, the notion that because false positives inflate an estimate of the number who have been infected, only the specificity is relevant is misconceived. Sure, false positives (imperfect specificity) inflate the estimate. But false negatives (imperfect sensitivity) simultaneously deflate it. Both types of misclassifications should be considered.

How, then, do epidemiologists doing surveillance studies normally handle the fact that the tests for a disease are not perfectly accurate? Let's use p to denote the positive proportion in the sample of people tested -- for example, the 1.5% in the Santa Clara sample or the 21% figure for New York City that Governor Andrew Cuomo announced in a tweet. The performance of the serological test depends on its true sensitivity SEN and true specificity SPE. For the moment, let's assume that these are known parameters of the test. In reality, they are estimated from separate studies that themselves have sampling errors, but we'll just try out some values for them. First, let's derive a general result that contains ideas presented in 1954 in the legal context of serological tests for parentage. \10/

Let PRE designate the true prevalence in the population (such as everyone in Santa Clara county or New York City) from which a sample of people to be tested is drawn. We pick a person totally at random. That person either has harbored the virus (inf) or not (uninf). The former probability we abbreviate as Pr(inf); the latter is Pr(uninf). The probability that the individual tests positive is

  Pr(test+) = Pr[test+ & (inf or uninf)]
     = Pr[(test+ & inf) or (test+ & uninf)]
     = Pr(test+ & inf) + Pr(test+ & uninf)
     = Pr(test+ | inf)Pr(inf) + Pr(test+ | uninf)Pr(uninf)     (1)*

In words, the probability of the positive result is (a) the probability the test is positive if the person has been infected, weighted by the probability he or she has been infected, plus (b) the probability it is positive if the person has not been infected, weighted by the probability of no infection.

We can rewrite (1) in terms of the sensitivity and specificity. SEN is Pr(test+|inf) -- the probability of a positive result if the person has been infected. SPE is Pr(test–|uninf) -- the probability of a negative result if the person has not been infected. For the random person, the probability of infection is just the true prevalence in the population, PRE. So the first product in (1) is simply SEN × PRE.

To put SPE into the second term, we note that the probability that an event happens is 1 minus the probability that it does not happen. Consequently, we can write the second term as (1 – SPE) × (1 – PRE). Thus, we have

     Pr(test+) = SEN PRE + (1 – SPE)(1 – PRE)           (2)

Suppose, for example, that SEN = 70%, SPE = 80%, and PRE = 10%. Then Pr(test+) = 1/5 + PRE/2 = 0.25. The expected proportion of observed positives in a random sample would be 0.25 -- a substantial overestimate of the true prevalence PRE = 0.10.

In this example, with rather poor sensitivity, using the observed proportion p of positives in a large random sample to estimate the prevalence PRE would be foolish. So we should not blithely substitute p for PRE. Indeed, doing so can give us a bad estimate even when the test has perfect specificity. When SPE = 1, Equation (2) reduces to Pr(test+) = SEN PRE. In this situation, the sample proportion does not estimate the prevalence -- it estimates only a fraction of it.

Clearly, good sensitivity is not a sufficient condition for using the sample proportion p to estimate the true prevalence PRE, even in huge samples. Both SEN and SPE cause misclassifications, and they work in opposite directions. Poor specificity leads to false positives, but poor sensitivity leads to true positives being counted as negatives. The net effect of these opposing forces is mediated by the prevalence.

To correct for the expected misclassifications in a large random sample, we can use the observed proportion of positives, not as estimator of the prevalence, but as an estimator of Pr(test+). Setting p = Pr(test +), we solve for PRE to obtain an estimated prevalence of

      pre = (p + SPE – 1)/(SPE + SEN – 1)         (3) \11/

For the Santa Clara study, Bendavid et al. found p = 50/3330 = 1.5%, and suggested that SEN = 80.3% and SPE = 99.5%. \12/ For these values, the estimated prevalence is pre = 1.25%. If we change SPE to 98.5%, where Gelman wrote that "you get into trouble," the estimate is pre = 0, which is clearly too small. Instead, the researchers used equation (3) only after they transformed their stratified sample data to fit the demographics of the county. That adjustment produced an inferred proportion p' = 2.81%.  Using that adjusted value for p, Equation (3) becomes

      pre = (p' + SPE – 1)/(SEN + SPE – 1)         (4)

For the SPE of 98.5%, equation (4) gives an estimated prevalence of pre = 1.66%. For 99.5% it is 2.9%. Although some critics have complained about using Equation (3) with the demographically adjusted proportion p' shown in (4), if the adjustment provides a better picture of the full population, it seems like the right proportion to use for arriving at the point estimate pre.

Nevertheless, there remains a sense in which the sensitivity is key. Given SEN = 80.3%, dropping SPE to 97.2% gives pre = 0. Ouch! When SPE drops below 97.2%, pre turns negative, which is ridiculous. In fact, this result holds for many other values of SEN. So one does need a high sensitivity for Equation (3) to be plausible -- at least when the true prevalence (and hence p') is small. But as PRE (and thus p') grow larger, Equations (3) and (4) look better. For example, if p = 20%, then pre is 22% even with SPE = 97.2% and SEN = 80.3%. Indeed, with this large a p even with a specificity of only SPE = 90% we still get a substantial pre = 14.2%.

Random Sampling Error

I have pretended the sensitivity and specificity are known with certainty.  Equation (3) only gives a point estimate for true prevalence. It does not account for sampling variability -- either in p (and hence p') or in the estimates (sen and spe) of SEN and SPE, respectively, that have to be plugged into (3). To be clear that we are using estimates from the separate validity studies rather than the unknown true values for SEN and SPE, we can write the relevant equation as follows:

      pre = (p + spe – 1)/(sen + spe – 1)         (5)

Dealing with the variance of p (or p') with sample sizes like 3300 is not hard. Free programs on the web give confidence intervals based on various methods for arriving at the standard error for pre considering the size of the random sample that produced the estimate p. (Try it out.)

Our uncertainty about SEN and SPE is greater (at this point, because the tests rushed into use have not been well validated, as discussed in previous postings). Bendavid et al. report a confidence interval for PRE that is said to account for the variances in all three estimators -- p, sen, and spe. \13/ However, a savage report in Ars Technica \14/ collects tweets such as a series complaining that "[t]he confidence interval calculation in their preprint made demonstrable math errors." \15/ Nonetheless, it should be feasible to estimate the contribution that sampling error in the validity studies for the serological tests contributes to the uncertainty in pre as an estimator of the population prevalence PRE. The researchers, at any rate, are convinced that "[t]he argument that the test is not specific enough to detect real positives is deeply flawed." \16/ Although they are working with a relatively low estimated prevalence, they could be right. \17/ If sensitivity is in the range they claim, their estimates of prevalence should not be dismissed out of hand.

* * *

The take away message is that a gold standard serological test is not always necessary for effective disease surveillance. It is true that unless the test is highly accurate, the positive test proportion p (or a proportion p' adjusted for a stratified sample) is not a good estimator of the true prevalence PRE. That has been known for quite some time and is not in dispute. At the same time, pre sometimes can be a useful estimator of true prevalence. That too is not in dispute. Of course, as always, good data are better than post hoc corrections, but for larger prevalences, serological tests may not require 99.5% specificty to produce useful estimates of how many people have been infected by SARs-CoV-2.


UPDATE (5/9/20): An Oregon State University team in Corvallis is going door to door in an effort to test a representative sample of the college town's population. \1/ A preliminary report released to the media reports a simple incidence of 2/1,000. Inasmuch the sketchy accounts indicate that the samples collected are nasal swabs, the proportion cannot be directly compared to the proportion positive for serological tests mentioned above. The nasal swabbing is done by the respondents in the survey rather than by medical personnel, \2/ and the results pertain to the presence of the virus at the time of the swabbing rather than to an immune response that may be the result of exposure in the past.

UPDATE (7/9/20): Writing on “SARS-CoV-2 seroprevalence in COVID-19 hotspots” in The Lancet on July 6, Isabella Eckerle and Benjamin Meyer report that

Antibody cross-reactivity with other human coronaviruses has been largely overcome by using selected viral antigens, and several commercial assays are now available for SARS-CoV-2 serology. ... The first SARS-CoV-2 seroprevalence studies from cohorts representing the general population have become available from COVID-19 hotspots such as China, the USA, Switzerland, and Spain. In The Lancet, Marina Pollán and colleagues and Silvia Stringhini and colleagues separately report representative population-based seroprevalence data from Spain and Switzerland collected from April to early May this year. Studies were done in both the severely affected urban area of Geneva, Switzerland, and the whole of Spain, capturing both strongly and less affected provinces. Both studies recruited randomly selected participants but excluded institutionalised populations ... . They relied on IgG as a marker for previous exposure, which was detected by two assays for confirmation of positive results.

The Spanish study, which included more than 60,000 participants, showed a nationwide seroprevalence of 5·0% (95% CI 4·7–5·4; specificity–sensitivity range of 3·7% [both tests positive] to 6·2% [at least one test positive]), with urban areas around Madrid exceeding 10% (eg, seroprevalence by immunoassay in Cuenca of 13·6% [95% CI 10·2–17·8]). ... Similar numbers were obtained across the 2766 participants in the Swiss study, with seroprevalence data from Geneva reaching 10·8% (8·2–13·9) in early May. The rather low seroprevalence in COVID-19 hotspots in both studies is in line with data from Wuhan, the epicentre and presumed origin of the SARS-CoV-2 pandemic. Surprisingly, the study done in Wuhan approximately 4–8 weeks after the peak of infection reported a low seroprevalence of 3·8% (2·6–5·4) even in highly exposed health-care workers, despite an overwhelmed health-care system.

The key finding from these representative cohorts is that most of the population appears to have remained unexposed to SARS-CoV-2, even in areas with widespread virus circulation. [E]ven countries without strict lockdown measures have reported similarly low seroprevalence—eg, Sweden, which reported a prevalence of 7·3% at the end of April—leaving them far from reaching natural herd immunity in the population.

UPDATE (10/5/20): In Seroprevalence of SARS-CoV-2–Specific Antibodies Among Adults in Los Angeles County, California, on April 10-11, 2020, JAMA 2020, 323(23):2425-2427, doi: 10.1001/jama.2020.8279, Neeraj Sood, Paul Simon, Peggy Ebner, Daniel Eichner, Jeffrey Reynolds, Eran Bendavid, and Jay Bhattacharya used the methods applied in the Santa Clara study on a "random sample ... with quotas for enrollment for subgroups based on age, sex, race, and ethnicity distribution of Los Angeles County residents" invited for tests "to estimate the population prevalence of SARS-CoV-2 antibodies." The tests have an estimated sensitivity of 82.7% (95% CI of 76.0%-88.4%) and specificity of 99.5% (95% CI of 99.2%-99.7%).

The weighted proportion of participants who tested positive was 4.31% (bootstrap CI, 2.59%-6.24%). After adjusting for test sensitivity and specificity, the unweighted and weighted prevalence of SARS-CoV-2 antibodies was 4.34% (bootstrap CI, 2.76%-6.07%) and 4.65% (bootstrap CI, 2.52%-7.07%), respectively.

The estimate of 4.65% suggests that some "367, 000 adults had SARS-CoV-2 antibodies, which is substantially greater than the 8,430 cumulative number of confirmed infections in the county on April 10." As such, "fatality rates based on confirmed cases may be higher than rates based on number of infections." Indeed, the reported fatality rate based on the number of confirmed cases (about 3% in the US) would be too high by a factor of 44! But "[s]election bias is likely. The estimated prevalence may be biased due to nonresponse or that symptomatic persons may have been more likely to participate. Prevalence estimates could change with new information on the accuracy of test kits used. Also, the study was limited to 1 county."

NOTES

  1. Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831
  2. Id.
  3. Eran Bendavid, Bianca Mulaney, Neeraj Sood et al.,  COVID-19 Antibody Seroprevalence in Santa Clara County, California. medRxiv preprint dated Apr. 11, 2020.
  4. Id.
  5. Id.
  6. University of Miami Health System, Sylvester Researchers Collaborate with County to Provide Important COVID-19 Answers, Apr. 25, 2020, http://med.miami.edu/news/sylvester-researchers-collaborate-with-county-to-provide-important-covid-19
  7. Vogel, supra note 1.
  8. Andrew Gelman, Concerns with that Stanford Study of Coronavirus Prevalence, posted 19 April 2020, 9:14 am, on Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/
  9. E.g., Joseph Gastwirth, The Statistical Precision of Medical Screening Procedures: Application to Polygraph and AIDS Antibodies Test Data, Stat. Sci. 1987, 2:213-222; D. J. Hand, Screening vs. Prevalence Estimation, Appl. Stat., 1987, 38:1-7; Fraser I. Lewis & Paul R. Torgerson, 2012, A Tutorial in Estimating the Prevalence of Disease in Humans and Animals in the Absence of a Gold Standard Diagnostic Emerging Themes in Epidemiology, 9:9, https://ete-online.biomedcentral.com/articles/10.1186/1742-7622-9-9; Walter J. Rogan & Beth Gladen, Estimating Prevalence from Results of a Screening-test. Am J Epidemiol. 1978, 107: 71-76; Niko Speybroeck, Brecht Devleesschauwer, Lawrence Joseph & Dirk Berkvens, Misclassification Errors in Prevalence Estimation: Bayesian Handling with Care, Int J Public Health, 2012, DOI:10.1007/s00038-012-0439-9
  10. H. Steinhaus, 1954, The Establishment of Paternity, Pr. Wroclawskiego Tow. Naukowego ser. A, no. 32. (discussed in Michael O. Finkelstein and William B. Fairley, A Bayesian Approach to Identification Evidence. Harvard Law Rev., 1970, 83:490-517). For a related discussion, see David H. Kaye, The Prevalence of Paternity in "One-Man" Cases of Disputed Parentage, Am. J. Human Genetics, 1988, 42:898-900 (letter).
  11. This expression is known as "the Rogan–Gladen adjusted estimator of 'true' prevalence" (Speybroeck et al., supra note 9) or "the classic Rogan-Gladen estimator of true prevalence in the presence of an imperfect diagnostic test." Lewis & Torgerson, supra note 9. The reference is to Rogan & Gladen, supra note 9.
  12. They call the proportion p = 1.5% the "unadjusted" estimate of prevalence.
  13. Some older discussions of the standard error in this situation can be found in Gastwirth, supra note 9; Hand, supra note 9. See also J. Reiczigel, J. Földi, & L. Ózsvári, Exact Confidence Limits for Prevalence of a Disease with an Imperfect Diagnostic Test, Epidemiology and Infection, 2010, 138:1674-1678.
  14. Beth Mole, Bloody math — Experts Demolish Studies Suggesting COVID-19 Is No Worse than Flu: Authors of widely publicized antibody studies “owe us all an apology,” one expert says, Ars Technica, Apr. 24, 2020, 1:33 PM, https://arstechnica.com/science/2020/04/experts-demolish-studies-suggesting-covid-19-is-no-worse-than-flu/
  15. https://twitter.com/wfithian/status/1252692357788479488 
  16. Vogel, supra note 1.
  17. A Bayesian analysis might help. See, e.g., Speybroeck et al., supra note 10.

UPDATED Apr. 27, 2020, to correct a typo in line (2) of the derivation of Equation (1), as pointed out by Geoff Morrison.

NOTES to update of 5/9/20

  1. OSU Newsroom, TRACE First Week’s Results Suggest Two People per 1,000 in Corvallis Were Infected with SARS-CoV-2, May 7, 2020, https://today.oregonstate.edu/news/trace-first-week%E2%80%99s-results-suggest-two-people-1000-corvallis-were-infected-sars-cov-2
  2. But "[t]he tests used in TRACE-COVID-19 collect material from the entrance of the nose and are more comfortable and less invasive than the tests that collect secretions from the throat and the back of the nose." Id.