Monday, August 10, 2020

Applying the Justice Department's Policy on a Reasonable Degree of Certainty in United States v. Hunt

In United States v. Hunt, \1/ Senior U.S. District Court Judge David Russell disposed of a challenge to proposed firearms-toolmark testimony. The first part of the unpublished opinion dealt with the scientific validity (as described in Daubert v. Merrell Dow Pharmaceuticals, Inc.) of the Association of Firearms and Toolmark Examiners (AFTE) "Theory of Identification As It Relates to Toolmarks." Mostly, this portion of the opinion is of the form: "Other courts have accepted the government's arguments. We agree." This kind of an opinion is common for forensic-science methods that have a long history of judicial acceptance--whether or not such acceptance is deserved.

The unusual part of the opinion comes at the end. There, the court misconstrues the Department of Justice's internal policy on the use of the phrase "reasonable certainty" to characterize an expert conclusion for associating spent ammunition with a gun that might have fired it. This posting describes some of the history of that policy and suggests that (1) the court may have unwittingly rejected it; (2) the court's order prevents the experts from expressing the same AFTE theory that the court deemed scientifically valid; and (3) the government can adhere to its written policy on avoiding various expressions of "reasonable certainty" and still try the case consistently with the judge's order.

I. The Proposed Testimony

Dominic Hunt was charged with being a felon is possession of ammunition recovered from two shootings. The government proposed to use two firearm and toolmark examiners--Ronald Jones of the Oklahoma City Police Department and Howard Kong of the Bureau of Alcohol, Tobacco, Firearms and Explosives' (ATF) Forensic Science Laboratory--to establish that the ammunition was not fired from the defendant's brother's pistol--or his cousin's pistol. To eliminate those hypotheses, "the Government intend[ed] its experts to testify" that "the unknown firearm was likely a Smith & Wesson 9mm Luger caliber pistol," and that "the probability that the ammunition ... were fired in different firearms is so small it is negligible."

This testimony differs from the usual opinion testimony that ammunition components recovered from the scene of a shooting came from a specific, known gun associated with a defendant. It appears that the "unknown" Luger pistol was never discovered and thus that the examiners could not use it to fire test bullets for comparison purposes. Their opinion was that several of the shell casings had marks and other features that were so similar that they must have come from the same gun of the type they specified.

But the reasoning process the examiners used to arrive at this conclusion--which postulates "class," "subclass," and conveniently designated "individual" characteristics--is the same as the one used in the more typical case of an association to a known gun. Perhaps heartened by several recent trial court opinions severely limiting testimony about the desired individualizing characteristics, Hunt moved "to Exclude Ballistic Evidence, or Alternatively, for a Daubert Hearing."

II. The District Court's Order

Hunt lost. After rejecting the pretrial objection to the scientific foundation of the examiners' opinions and the proper application of accepted methods by the two examiners, Judge Russell turned to the defendant's "penultimate argument [seeking] limitations on the Government's firearm toolmark experts." He embraced the government's response "that no limitation is necessary because Department of Justice guidance sufficiently limits a firearm examiner's testimony."

The odd thing is that he turned the Department's written policy on its head by embracing a form of testimony that the policy sought to eliminate. And the court did this immediately after it purported to implement DoJ's "reasonable" policy. The relevant portion of the opinion begins:

In accordance with recent guidance from the Department of Justice, the Government's firearm experts have already agreed to refrain from expressing their findings in terms of absolute certainty, and they will not state or imply that a particular bullet or shell casing could only have been discharged from a particular firearm to the exclusion of all other firearms in the world. The Government has also made clear that it will not elicit a statement that its experts' conclusions are held to a reasonable degree of scientific certainty.
The Court finds that the limitations mentioned above and prescribed by the Department of Justice are reasonable, and that the Government's experts should abide by those limitations. To that end, the Governments experts:
[S]hall not [1] assert that two toolmarks originated from the same source to the exclusion of all other sources.... [2] assert that examinations conducted in the forensic firearms/toolmarks discipline are infallible or have a zero error rate.... [3] provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data.... [4] cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career as a direct measure for the accuracy of a proffered conclusion..... [5] use the expressions ‘reasonable degree of scientific certainty,’ ‘reasonable scientific certainty,’ or similar assertions of reasonable certainty in either reports or testimony unless required to do so by [the Court] or applicable law. \2/

So far it seems that the court simply told the government's experts (including the city police officer) to tow the federal line. But here comes the zinger. The court abruptly turned around and decided to ignore the Attorney General's mandate that DoJ personnel should strive to avoid expressions of "reasonable scientific certainty" and the like. The court wrote:

As to the fifth limitation described above, the Court will permit the Government's experts to testify that their conclusions were reached to a reasonable degree of ballistic certainty, a reasonable degree of certainty in the field of firearm toolmark identification, or any other version of that standard. See, e.g., U.S. v. Ashburn, 88 F. Supp. 3d 239, 249 (E.D.N.Y. 2015) (limiting testimony to a “reasonable degree of ballistics certainty” or a “reasonable degree of certainty in the ballistics field.”); U.S. v. Taylor, 663 F. Supp. 2d 1170, 1180 (D.N.M. 2009) (limiting testimony to a “reasonable degree of certainty in the firearms examination field.”). Accordingly, the Government's experts should not testify, for example, that “the probability the ammunition charged in Counts Eight and Nine were fired in different firearms is so small it is negligible” ***.

So the experts can testify that they have "reasonable certainty" that the ammunition was fired from the same gun, but they cannot say the probability that it was fired from a different gun is small enough that the alternative hypothesis has a negligible probability? Even though that is how experts in the field achieve "reasonable certainty" (according to the AFTE description that the court held was scientifically valid)? This part of the opinion hardly seems coherent. \3/

III. The Tension Between the Order and the ULTR

The two cases that the court cited for its "reasonable ballistic certainty" ruling were decided years before the ULTR that it called reasonable, and such talk of "ballistic certainty" and "any other version of that standard" is precisely what the Department had resolved to avoid if at all possible. The history of the "fifth limitation" has an easily followed paper trail that compels the conclusion that this limitation was intended to avoid precisely the kind of testimony that Judge Russell's order permits.

Let's start with the ULTR quoted (in part) by the court. It has a footnote to the "fifth limitation" that instructs readers to "See Memorandum from the Attorney General to Heads of Department Components (Sept. 9. 2016), https://www.justice.gov/opa/file/891366/download." The memorandum's subject is "Recommendations of the National Commission on Forensic Science; Announcement for NCFS Meeting Eleven." In it, Attorney General Loretta Lynch wrote:

As part of the Department's ongoing coordination with the National Commission on Forensic Science (NCFS), I am responding today to several NCFS recommendations to advance and strengthen forensic science. *** I am directing Department components to *** work with the Deputy Attorney General to implement these policies *** .

1. Department forensic laboratories will review their policies and procedures to ensure that forensic examiners are not using the expressions "reasonable scientific certainty" or "reasonable [forensic discipline] certainty" in their reports or testimony. Department prosecutors will abstain from use of these expressions when presenting forensic reports or questioning forensic experts in court unless required by a judge or applicable law.

The NCFS was adamant that judges should not require "reasonable [forensic discipline] certainty." Its recommendation to the Attorney General explained that

Forensic discipline conclusions are often testified to as being held “to a reasonable degree of scientific certainty” or “to a reasonable degree of [discipline] certainty.” These terms have no scientific meaning and may mislead factfinders about the level of objectivity involved in the analysis, its scientific reliability and limitations, and the ability of the analysis to reach a conclusion. Forensic scientists, medical professionals and other scientists do not routinely express opinions or conclusions “to a reasonable scientific certainty” outside of the courts. Neither the Daubert nor Frye test of scientific admissibility requires its use, and consideration of caselaw from around the country confirms that use of the phrase is not required by law and is primarily a relic of custom and practice. There are additional problems with this phrase, including:
• There is no common definition within science disciplines as to what threshold establishes “reasonable” certainty. Therefore, whether couched as “scientific certainty” or“ [discipline] certainty,” the term is idiosyncratic to the witness.
• The term invites confusion when presented with testimony expressed in probabilistic terms. How is a lay person, without either scientific or legal training, to understand an expert’s “reasonable scientific certainty” that evidence is “probably” or possibly linked to a particular source?

Accordingly, the NCFS recommended that the Attorney General "direct all attorneys appearing on behalf of the Department of Justice (a) to forego use of these phrases ... unless directly required by judicial authority as a condition of admissibility for the witness’ opinion or conclusion ... ." As we have seen, the Attorney General adopted this recommendation. \4/

IV. How the Prosecutors and the ATF Expert Can Follow Departmental Policy

Interestingly, Judge Russell's opinion does not require the lawyers and the witnesses to use the expressions of certainty. It "permits" them to do so (seemingly on the theory that this practice is just what the Department contemplated). But not all that is permitted is required. To be faithful to Department policy, the prosecution cannot accept the invitation. The experts can give their conclusion that the ammunition came from a single gun. However, they should not add, and the prosecutor may not ask them to swear to, some expression of "reasonable [discipline] certainty" because: (1) the Department's written policy requires them to avoid it "unless required by a judge or applicable law"; (2) the judge has not required it; and (3) applicable law does not require it. \5/

The situation could change if at the trial, Judge Russell were to intervene and to ask the experts about "reasonable certainty." In that event, the government should remind the court that its policy, for the reasons stated by the National Commission and accepted by the Attorney General, is to avoid these expressions. If the court then rejects the government's position, the experts must answer. But even then, unless the defense "opens the door" by cross-examining on the meaning of "reasonable [discipline] certainty," there is no reason for the prosecution to use the phrase in its examination of witnesses or closing arguments.

NOTES

  1. No. CR-19-073-R, 2020 WL 2842844 (W.D. Okla. June 1, 2020).
  2. The ellipses in the quoted part of the opinion are the court's. I have left out only the citations in the opinion to the Department's Uniform Language on Testimony and Reporting (ULTR) for firearms-toolmark identifications. That document is a jumble that is a subject for another day.
  3. Was Judge Russell thinking that the "negligible probability" judgment is valid (and hence admissible as far as the validity requirement of Daubert goes) but that it would be unfairly prejudicial or confusing to give the jury this valid judgment? Is the court's view that "negligible" is too strong a claim in light of what is scientifically known? If such judgments are valid, as AFTE maintains, they are not generally prejudicial. Prejudice does not mean damage to the opponent's case that arises from the very fact that evidence is powerful.
  4. At the NCFS meeting at which the Department informed the Commission that it was adopting its recommendation, "Commission member, theoretical physicist James Gates, complimented the Department for dealing with these words that 'make scientists cringe.'" US Dep't of Justice to Discourage Expressions of "Reasonable Scientific Certainty," Forensic Sci., Stat. & L., Sept. 12, 2016, http://for-sci-law.blogspot.com/2016/09/us-dept-of-justice-to-discourage.html.
  5. In a public comment to the NCFS, then commissioner Ted Hunt (now the Department's senior advisor on forensic science) cited the "ballistic certainty" line of cases as indicative of a problem with the NCFS recommendation as then drafted but agreed that applicable law did not require judges to follow the practice of insisting on or entertaining expressions of certitude. See "Reasonable Scientific Certainty," the NCFS, the Law of the Courtroom," and That Pesky Passive Voice, Forensic Sci., Stat. & L., http://for-sci-law.blogspot.com/2016/03/reasonable-scientific-certainty-ncfs.html, Mar. 1, 2016; Is "Reasonable Scientific Certainty" Unreasonable?, Forensic Sci., Stat. & L., Feb. 26, 2016, http://for-sci-law.blogspot.com/2016/02/is-reasonable-scientific-certainty.html (concluding that
    In sum, there are courts that find comfort in phrases like "reasonable scientific certainty," and a few courts have fallen back on variants such as "reasonable ballistic certainty" as a response to arguments that identification methods cannot ensure that an association between an object or person and a trace is 100% certain. But it seems fair to say that "such terminology is not required " -- at least not by any existing rule of law.)

Tuesday, May 5, 2020

How Do Forensic-science Tests Compare to Emergency COVID-19 Tests?

The Wall Street Journal recently reported that
At least 160 antibody tests for Covid-19 entered the U.S. market without previous FDA scrutiny on March 16, because the agency felt then that it was most important to get them to the public quickly. Accurate antibody testing is a potentially important tool for public-health officials assessing how extensively the coronavirus has swept through a region or state.
Now, the FDA will require test companies to submit an application for emergency-use authorization and require them to meet standards for accuracy. Tests will need to be found 90% “sensitive,” or able to detect coronavirus antibodies, and 95% “specific,” or able to avoid false positive results. \1/
How many test methods in forensic science have been shown to perform at or above these emergency levels? It is hard to say. For FDA-authorized tests, one can find the manufacturers' figures on the FDA's website, but for forensic-science tests, there is no such repository of information on the standards adopted by voluntary standards development organizations. The forensic-science test-method standards approved by consensus bodies such as the Academy Standards Board and ASTM Inc. rarely state the performance characteristics of these tests.

For the FDA's minimum operating characteristics of a yes-no test, the likelihood ratio for a positive result is Pr(+ | antibodies) / Pr(+ | no-antibodies) = 0.90/(1 − .95) = 18. The likelihood ratio for a negative result is Pr(− | no-antibodies) / Pr(− | antibodies) = .95/(1 − .90) = 9.5. In other words, a clean bill of health on a serological test with minimally acceptable performance would occur less than ten times as frequently for people with less than the detectable level of the virus as compared to people with detectable levels.

According to an Ad Hoc Working Group of the forensic Scientific Working Group on DNA Analysis Methods (SWGDAM), such a likelihood ratio may be described as providing "limited support." This description is near the lower end of a scale for likelihood ratios. These "verbal qualifiers" go from "uninformative" (L=1), to "limited" (2 to 99), "moderate" (100 to 999), "strong" (1,000 to 999,999), and, finally, "very strong" (1,000,000 or more). \2/

A more finely graded table appears "for illustration purposes" in an ENFSI [European Network of Forensic Science Institutes] Guideline for Evaluative Reporting in Forensic Science. The table classifies L = 9.5 as "weak support." \3/

NOTES
  1. Thomas M. Burton, FDA Sets Standards for Coronavirus Antibody Tests in Crackdown on Fraud, Wall Street J., Updated May 4, 2020 8:24 pm ET, https://www.wsj.com/articles/fda-sets-standards-for-coronavirus-antibody-tests-in-crackdown-on-fraud-11588605373
  2. Recommendations of the SWGDAM Ad Hoc Working Group on Genotyping Results Reported as Likelihood Ratios, 2018, available via https://www.swgdam.org/publications.
  3. ENFSI Guideline for Evaluative Reporting in Forensic Science, 2016, p. 17, http://enfsi.eu/wp-content/uploads/2016/09/m1_guideline.pdf.

Saturday, April 25, 2020

Estimating Prevalence from Serological Tests for COVID-19 Infections: What We Don't Know Can Hurt Us

A statistical debate has emerged over the proportion of the population that has been infected with SARS-CoV-2. It is a crucial number in arguments about "herd immunity" and public health measures to control the COVID-19 pandemic. A news article in yesterday's issue of Science reports that
[S]urvey results, from Germany, the Netherlands, and several locations in the United States, find that anywhere from 2% to 30% of certain populations have already been infected with the virus. The numbers imply that confirmed COVID-19 cases are an even smaller fraction of the true number of people infected than many had estimated and that the vast majority of infections are mild. But many scientists question the accuracy of the antibody tests ... .\1/
The first sentence reflects a common assumption -- that the reported proportion of test results that are positive -- directly indicates the prevalence of infections where the tested people live. The last sentence gives one reason this might not be the case. But the fact that tests for antibodies are inaccurate does not necessarily preclude good estimates of the prevalence. It may still be possible to adjust the proportion up or down to arrive at the percentage "already ... infected with the virus." There is a clever and simple procedure for doing that -- under certain conditions. Before describing it, let's look another, more easily grasped threat to estimating prevalence -- "sampling bias."

Sampling Design: Who Gets Tested?

Inasmuch as the people tested in the recent studies are not based on random samples of any well defined population, the samples of test results may not be representative of what the outcome would be if the entire population of interest were tested. Several sources of bias in sampling have been noted.

A study of a German town "found antibodies to the virus in 14% of the 500 people tested. By comparing that number with the recorded deaths in the town, the study suggested the virus kills only 0.37% of the people infected. (The rate for seasonal influenza is about 0.1%.)" But the researchers "sampled entire households. That can lead to overestimating infections, because people living together often infect each other." \2/ Of course, one can count just one individual per household, so this clumping does not sound like a fundamental problem.

"A California serology study of 3300 people released last week in a preprint [found 50] antibody tests were positive—about 1.5%. [The number in the draft paper by Eran Bendavid, Bianca Mulaney, Neeraj Sood, et al. is 3330 \3/] But after adjusting the statistics to better reflect the county's demographics, the researchers concluded that between 2.49% and 4.16% of the county's residents had likely been infected." However, the Stanford researchers "recruit[ed] the residents of Santa Clara county through ads on Facebook," which could have "attracted people with COVID-19–like symptoms who wanted to be tested, boosting the apparent positive rate." \4/ This "unhealthy volunteer" bias is harder to correct with this study design.

"A small study in the Boston suburb of Chelsea has found the highest prevalence of antibodies so far. Prompted by the striking number of COVID-19 patients from Chelsea colleagues had seen, Massachusetts General Hospital pathologists ... collected blood samples from 200 passersby on a street corner. ... Sixty-three were positive—31.5%." As the pathologists acknowledged, pedestrians on a single corner "aren't a representative sample." \5/

Even efforts to find subjects at random will fall short of the mark because of self-selection on the part of subjects. "Unhealthy volunteer" bias is a threat even in studies like one planned for Miami-Dade County that will use random-digit dialing to utility customers to recruit subjects. \6/

In sum, sampling bias could be a significant problem in many of these studies. But it is something epidemiologists always face, and enough quick and dirty surveys (with different possible sources of sampling bias) could give a usable indication of what better designed studies would reveal.

Measurement Error: No Gold Standard

A second criticism holds that because the "specificity" of the serological tests could be low, the estimates of prevalence are exaggerated. "Specificity" refers the extent to which the test (correctly) does not signal and infection when applied to an uninfected individual. If it (incorrectly) signals an infection for these individuals, it causes false positives. Low specificity means lots of false positives. Worries over specificity recur throughout the Science article's summary of the controversy:
  • "The result carries several large caveats. The team used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH's own validation tests found a specificity of higher than 99.5%."
  • "Because the absolute numbers of positive tests were so small, false positives may have been nearly as common as real infections."
  • "Streeck and his colleagues claimed the commercial antibody test they used has 'more than 99% specificity,' but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%. That means that in the Heinsberg sample of 500, the test could have produced more than a dozen false positives out of roughly 70 the team found." \7/
Likewise, political scientist and statistician Andrew Gelman blogged that no screening test that lacks a very high specificity can produce a usable estimate of population prevalence -- at least when the proportion of tests that are positive is small. This limitation, he insisted, is "the big one." \8/ He presented the following as a devastating criticism of the Santa Clara study (with my emphasis added):
Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia ... OK, here are [sic] concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.
If the specificity is 90%, we’re sunk.
With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero ... . On the other hand, if the specificity were 100%, then we could take the result at face value.
So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:
A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.
This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!
To be sure, the fact that the serological tests are not perfectly accurate in detecting an immune response makes it dangerous to rely on the proportion of people tested who test positive as the measure of the proportion of the population who have been infected. Unless the test is perfectly sensitive (is certain to be positive for an infected person) and specific (certain to be negative for an uninfected person), the observed proportion will not be the true proportion of past infections -- even in the sample. As we will see shortly, however, there is a simple way to correct for imperfect sensitivity and specificity in estimating the population prevalence, and there is a voluminous literature on using imperfect screening tests to estimate population prevalence. \9/ Recognizing what one wants to estimate leads quickly to the conclusion that the usual media reports of the raw proportion of positives among the tested group (even with a margin of error to account for sampling variability) is not generally the right statistic to focus on.

Moreover, the notion that because false positives inflate an estimate of the number who have been infected, only the specificity is relevant is misconceived. Sure, false positives (imperfect specificity) inflate the estimate. But false negatives (imperfect sensitivity) simultaneously deflate it. Both types of misclassifications should be considered.

How, then, do epidemiologists doing surveillance studies normally handle the fact that the tests for a disease are not perfectly accurate? Let's use p to denote the positive proportion in the sample of people tested -- for example, the 1.5% in the Santa Clara sample or the 21% figure for New York City that Governor Andrew Cuomo announced in a tweet. The performance of the serological test depends on its true sensitivity SEN and true specificity SPE. For the moment, let's assume that these are known parameters of the test. In reality, they are estimated from separate studies that themselves have sampling errors, but we'll just try out some values for them. First, let's derive a general result that contains ideas presented in 1954 in the legal context of serological tests for parentage. \10/

Let PRE designate the true prevalence in the population (such as everyone in Santa Clara county or New York City) from which a sample of people to be tested is drawn. We pick a person totally at random. That person either has harbored the virus (inf) or not (uninf). The former probability we abbreviate as Pr(inf); the latter is Pr(uninf). The probability that the individual tests positive is
  Pr(test+) = Pr[test+ & (inf or uninf)]
     = Pr[(test+ & inf) or (test+ & uninf)]
     = Pr(test+ & inf) + Pr(test+ & uninf)
     = Pr(test+ | inf)Pr(inf) + Pr(test+ | uninf)Pr(uninf)     (1)*
In words, the probability of the positive result is (a) the probability the test is positive if the person has been infected, weighted by the probability he or she has been infected, plus (b) the probability it is positive if the person has not been infected, weighted by the probability of no infection.

We can rewrite (1) in terms of the sensitivity and specificity. SEN is Pr(test+|inf) -- the probability of a positive result if the person has been infected. SPE is Pr(test–|uninf) -- the probability of a negative result if the person has not been infected. For the random person, the probability of infection is just the true prevalence in the population, PRE. So the first product in (1) is simply SEN × PRE.

To put SPE into the second term, we note that the probability that an event happens is 1 minus the probability that it does not happen. Consequently, we can write the second term as (1 – SPE) × (1 – PRE). Thus, we have
     Pr(test+) = SEN PRE + (1 – SPE)(1 – PRE)           (2)
Suppose, for example, that SEN = 70%, SPE = 80%, and PRE = 10%. Then Pr(test+) = 1/5 + PRE/2 = 0.25. The expected proportion of observed positives in a random sample would be 0.25 -- a substantial overestimate of the true prevalence PRE = 0.10.

In this example, with rather poor sensitivity, using the observed proportion p of positives in a large random sample to estimate the prevalence PRE would be foolish. So we should not blithely substitute p for PRE. Indeed, doing so can give us a bad estimate even when the test has perfect specificity. When SPE = 1, Equation (2) reduces to Pr(test+) = SEN PRE. In this situation, the sample proportion does not estimate the prevalence -- it estimates only a fraction of it.

Clearly, good sensitivity is not a sufficient condition for using the sample proportion p to estimate the true prevalence PRE, even in huge samples. Both SEN and SPE cause misclassifications, and they work in opposite directions. Poor specificity leads to false positives, but poor sensitivity leads to true positives being counted as negatives. The net effect of these opposing forces is mediated by the prevalence.

To correct for the expected misclassifications in a large random sample, we can use the observed proportion of positives, not as estimator of the prevalence, but as an estimator of Pr(test+). Setting p = Pr(test +), we solve for PRE to obtain an estimated prevalence of
      pre = (p + SPE – 1)/(SPE + SEN – 1)         (3) \11/
For the Santa Clara study, Bendavid et al. found p = 50/3330 = 1.5%, and suggested that SEN = 80.3% and SPE = 99.5%. \12/ For these values, the estimated prevalence is pre = 1.25%. If we change SPE to 98.5%, where Gelman wrote that "you get into trouble," the estimate is pre = 0, which is clearly too small. Instead, the researchers used equation (3) only after they transformed their stratified sample data to fit the demographics of the county. That adjustment produced an inferred proportion p' = 2.81%.  Using that adjusted value for p, Equation (3) becomes
      pre = (p' + SPE – 1)/(SEN + SPE – 1)         (4)
For the SPE of 98.5%, equation (4) gives an estimated prevalence of pre = 1.66%. For 99.5% it is 2.9%. Although some critics have complained about using Equation (3) with the demographically adjusted proportion p' shown in (4), if the adjustment provides a better picture of the full population, it seems like the right proportion to use for arriving at the point estimate pre.

Nevertheless, there remains a sense in which the sensitivity is key. Given SEN = 80.3%, dropping SPE to 97.2% gives pre = 0. Ouch! When SPE drops below 97.2%, pre turns negative, which is ridiculous. In fact, this result holds for many other values of SEN. So one does need a high sensitivity for Equation (3) to be plausible -- at least when the true prevalence (and hence p') is small. But as PRE (and thus p') grow larger, Equations (3) and (4) look better. For example, if p = 20%, then pre is 22% even with SPE = 97.2% and SEN = 80.3%. Indeed, with this large a p even with a specificity of only SPE = 90% we still get a substantial pre = 14.2%.

Random Sampling Error

I have pretended the sensitivity and specificity are known with certainty.  Equation (3) only gives a point estimate for true prevalence. It does not account for sampling variability -- either in p (and hence p') or in the estimates (sen and spe) of SEN and SPE, respectively, that have to be plugged into (3). To be clear that we are using estimates from the separate validity studies rather than the unknown true values for SEN and SPE, we can write the relevant equation as follows:
      pre = (p + spe – 1)/(sen + spe – 1)         (5)
Dealing with the variance of p (or p') with sample sizes like 3300 is not hard. Free programs on the web give confidence intervals based on various methods for arriving at the standard error for pre considering the size of the random sample that produced the estimate p. (Try it out.)

Our uncertainty about SEN and SPE is greater (at this point, because the tests rushed into use have not been well validated, as discussed in previous postings). Bendavid et al. report a confidence interval for PRE that is said to account for the variances in all three estimators -- p, sen, and spe. \13/ However, a savage report in Ars Technica \14/ collects tweets such as a series complaining that "[t]he confidence interval calculation in their preprint made demonstrable math errors." \15/ Nonetheless, it should be feasible to estimate the contribution that sampling error in the validity studies for the serological tests contributes to the uncertainty in pre as an estimator of the population prevalence PRE. The researchers, at any rate, are convinced that "[t]he argument that the test is not specific enough to detect real positives is deeply flawed." \16/ Although they are working with a relatively low estimated prevalence, they could be right. \17/ If sensitivity is in the range they claim, their estimates of prevalence should not be dismissed out of hand.

* * *

The take away message is that a gold standard serological test is not always necessary for effective disease surveillance. It is true that unless the test is highly accurate, the positive test proportion p (or a proportion p' adjusted for a stratified sample) is not a good estimator of the true prevalence PRE. That has been known for quite some time and is not in dispute. At the same time, pre sometimes can be a useful estimator of true prevalence. That too is not in dispute. Of course, as always, good data are better than post hoc corrections, but for larger prevalences, serological tests may not require 99.5% specificty to produce useful estimates of how many people have been infected by SARs-CoV-2.

UPDATE: 5/9/20: An Oregon State University team in Corvallis is going door to door in an effort to test a representative sample of the college town's population. \1/ A preliminary report released to the media reports a simple incidence of 2/1,000. Inasmuch the sketchy accounts indicate that the samples collected are nasal swabs, the proportion cannot be directly compared to the proportion positive for serological tests mentioned above. The nasal swabbing is done by the respondents in the survey rather than by medical personnel, \2/ and the results pertain to the presence of the virus at the time of the swabbing rather than to an immune response that may be the result of exposure in the past.

UPDATE: 7/9/20: Writing on “SARS-CoV-2 seroprevalence in COVID-19 hotspots” in The Lancet on July 6, Isabella Eckerle and Benjamin Meyer report that
Antibody cross-reactivity with other human coronaviruses has been largely overcome by using selected viral antigens, and several commercial assays are now available for SARS-CoV-2 serology. ... The first SARS-CoV-2 seroprevalence studies from cohorts representing the general population have become available from COVID-19 hotspots such as China, the USA, Switzerland, and Spain. In The Lancet, Marina Pollán and colleagues and Silvia Stringhini and colleagues separately report representative population-based seroprevalence data from Spain and Switzerland collected from April to early May this year. Studies were done in both the severely affected urban area of Geneva, Switzerland, and the whole of Spain, capturing both strongly and less affected provinces. Both studies recruited randomly selected participants but excluded institutionalised populations ... . They relied on IgG as a marker for previous exposure, which was detected by two assays for confirmation of positive results.

The Spanish study, which included more than 60,000 participants, showed a nationwide seroprevalence of 5·0% (95% CI 4·7–5·4; specificity–sensitivity range of 3·7% [both tests positive] to 6·2% [at least one test positive]), with urban areas around Madrid exceeding 10% (eg, seroprevalence by immunoassay in Cuenca of 13·6% [95% CI 10·2–17·8]). ... Similar numbers were obtained across the 2766 participants in the Swiss study, with seroprevalence data from Geneva reaching 10·8% (8·2–13·9) in early May. The rather low seroprevalence in COVID-19 hotspots in both studies is in line with data from Wuhan, the epicentre and presumed origin of the SARS-CoV-2 pandemic. Surprisingly, the study done in Wuhan approximately 4–8 weeks after the peak of infection reported a low seroprevalence of 3·8% (2·6–5·4) even in highly exposed health-care workers, despite an overwhelmed health-care system.

The key finding from these representative cohorts is that most of the population appears to have remained unexposed to SARS-CoV-2, even in areas with widespread virus circulation. [E]ven countries without strict lockdown measures have reported similarly low seroprevalence—eg, Sweden, which reported a prevalence of 7·3% at the end of April—leaving them far from reaching natural herd immunity in the population.

NOTES
  1. Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831
  2. Id.
  3. Eran Bendavid, Bianca Mulaney, Neeraj Sood et al.,  COVID-19 Antibody Seroprevalence in Santa Clara County, California. medRxiv preprint dated Apr. 11, 2020,
  4. Id.
  5. Id.
  6. University of Miami Health System, Sylvester Researchers Collaborate with County to Provide Important COVID-19 Answers, Apr. 25, 2020, http://med.miami.edu/news/sylvester-researchers-collaborate-with-county-to-provide-important-covid-19
  7. Vogel, supra note 1.
  8. Andrew Gelman, Concerns with that Stanford Study of Coronavirus Prevalence, posted 19 April 2020, 9:14 am, on Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/
  9. E.g., Joseph Gastwirth, The Statistical Precision of Medical Screening Procedures: Application to Polygraph and AIDS Antibodies Test Data, Stat. Sci. 1987, 2:213-222; D. J. Hand, Screening vs. Prevalence Estimation, Appl. Stat., 1987, 38:1-7; Fraser I. Lewis & Paul R. Torgerson, 2012, A Tutorial in Estimating the Prevalence of Disease in Humans and Animals in the Absence of a Gold Standard Diagnostic Emerging Themes in Epidemiology, 9:9, https://ete-online.biomedcentral.com/articles/10.1186/1742-7622-9-9; Walter J. Rogan & Beth Gladen, Estimating Prevalence from Results of a Screening-test. Am J Epidemiol. 1978, 107: 71-76; Niko Speybroeck, Brecht Devleesschauwer, Lawrence Joseph & Dirk Berkvens, Misclassification Errors in Prevalence Estimation: Bayesian Handling with Care, Int J Public Health, 2012, DOI:10.1007/s00038-012-0439-9
  10. H. Steinhaus, 1954, The Establishment of Paternity, Pr. Wroclawskiego Tow. Naukowego ser. A, no. 32. (discussed in Michael O. Finkelstein and William B. Fairley, A Bayesian Approach to Identification Evidence. Harvard Law Rev., 1970, 83:490-517). For a related discussion, see David H. Kaye, The Prevalence of Paternity in "One-Man" Cases of Disputed Parentage, Am. J. Human Genetics, 1988, 42:898-900 (letter).
  11. This expression is known as "the Rogan–Gladen adjusted estimator of 'true' prevalence" (Speybroeck et al., supra note 9) or "the classic Rogan-Gladen estimator of true prevalence in the presence of an imperfect diagnostic test." Lewis & Torgerson, supra note 9. The reference is to Rogan & Gladen, supra note 9.
  12. They call the proportion p = 1.5% the "unadjusted" estimate of prevalence.
  13. Some older discussions of the standard error in this situation can be found in Gastwirth, supra note 9; Hand, supra note 9. See also J. Reiczigel, J. Földi, & L. Ózsvári, Exact Confidence Limits for Prevalence of a Disease with an Imperfect Diagnostic Test, Epidemiology and Infection, 2010, 138:1674-1678.
  14. Beth Mole, Bloody math — Experts Demolish Studies Suggesting COVID-19 Is No Worse than Flu: Authors of widely publicized antibody studies “owe us all an apology,” one expert says, Ars Technica, Apr. 24, 2020, 1:33 PM, https://arstechnica.com/science/2020/04/experts-demolish-studies-suggesting-covid-19-is-no-worse-than-flu/
  15. https://twitter.com/wfithian/status/1252692357788479488 
  16. Vogel, supra note 1.
  17. A Bayesian analysis might help. See, e.g., Speybroeck et al., supra note 10.
UPDATED Apr. 27, 2020, to correct a typo in line (2) of the derivation of Equation (1), as pointed out by Geoff Morrison.

NOTES to later updates
  1. OSU Newsroom, TRACE First Week’s Results Suggest Two People per 1,000 in Corvallis Were Infected with SARS-CoV-2, May 7, 2020, https://today.oregonstate.edu/news/trace-first-week%E2%80%99s-results-suggest-two-people-1000-corvallis-were-infected-sars-cov-2
  2. But "[t]he tests used in TRACE-COVID-19 collect material from the entrance of the nose and are more comfortable and less invasive than the tests that collect secretions from the throat and the back of the nose." Id.

Thursday, April 23, 2020

More on False Positive and False Negative Serological Tests for COVID-19

An earlier posting looked at sensitivity and specificity of the first FDA-allowed emergency serological test for antibodies to SARS-CoV-2. It then identified some implications for getting people back to work through what a recent article in Nature called an "immunity passport." \1/

The news article cautions that "[k]its have flooded the market, but most aren’t accurate enough to confirm whether an individual has been exposed to the virus." The kits use components of the virus that the antibodies latch onto (the antigens) to detect the antibodies in the blood. Blood samples can be sent to a qualified laboratory for testing. In addition, "[s]everal companies ... offer point-of-care kits, which are designed to be used by health professionals to check if an individual has had the virus." In fact, "some companies market them for people to use at home." But
most kits have not undergone rigorous testing to ensure they’re reliable, says Michael Busch, director of the Vitalant Research Institute in San Francisco]. During a meeting at the UK Parliament’s House of Commons Science and Technology Select Committee on 8 April, Kathy Hall, the director of the testing strategy for COVID-19, said that no country appeared to have a validated antibody test that can accurately determine whether an individual has had COVID-19. ... [S]o far, most test assessments have involved only some tens of individuals because they have been developed quickly. ... [S]ome commercial antibody tests have recorded specificities as low as 40% early in the infection. In an analysis of 9 commercial tests available in Denmark, 3 lab-based tests had sensitivities ranging 67–93% and specificities of 93–100%. In the same study, five out of six point-of-care tests had sensitivities ranging 80–93%, and 80-100% specificity, but some kits were tested on fewer than 30 people. Testing was suspended for one kit.

Point-of-care tests are even less reliable than tests being used in labs, adds [David Smith, a clinical virologist at the University of Western Australia in Perth]. This is because they use a smaller sample of blood — typically from a finger prick — and are conducted in a less controlled environment than a lab .... The WHO recommends that point-of-care tests only be used for research.
False positives arise when a test uses an antigen that reacts with antibodies for pathogens other than SARS-CoV-2. In other words, the test is not 100% specific to the one type of virus. "An analysis of EUROIMMUN’s antibody test found that although it detected SARS-CoV-2 antibodies in three people with COVID-19, it returned a positive result for two people with another coronavirus." It is notable that "[i]t took several years to develop antibody tests for HIV with more than 99% specificity."

A further problem with issuing an "immunity passport" on the basis of a serologcal test is that the test may not detect the kind of antibodies that confer immunity to subsequent infection. It is not clear whether all people who have had COVID-19 develop the necessary "neutralizing" antibodies. An unpublished analysis of 175 people in China who had recovered from COVID-19 and had mild symptoms reported that 10 individuals produced no detectable neutralizing antibodies — even though some had high levels of binding antibodies. These people may lack protective immunity. Moreover, one study showed that viral RNA load declines slowly after antibodies are detected in the blood. Consequently, there could be a period in which a recovered patient is still shedding infectious virus.

A news article in this week's Science magazine also contains information on using serologic test data to estimate the proportion of people who have been infected (prevalence). \2/ It described a German study in which "Streeck and his colleagues claimed the commercial antibody test they used has “more than 99% specificity,” but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%."

The article also mentions a survey in which "Massachusetts General Hospital pathologists John Iafrate and Vivek Naranbhai ... collected blood samples from 200 passersby on a street corner [and] used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH’s own validation tests found a specificity of higher than 99.5%."

NOTES
  1. Smriti Mallapaty, Will Antibody Tests for the Coronavirus Really Change Everything?, Nature, Apr. 18, 2020, doi:10.1038/d41586-020-01115-z
  2. Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831

Wednesday, April 22, 2020

Forensic Magazine Branches Out

Forensic Magazine is "powered by Labcompare, the Buyer's Guide for Laboratory Professionals." Its slogan is "On the Scene and in the Lab." Today's newsletter includes the following item, sandwiched between an article on DNA cold cases in Florida and domestic abuse in Nicaragua:
Texas State Forensic Association Names Educator of the Year
Wednesday, April 22, 2020

Julie Welker, chair of Howard Payne University’s Department of Communication and coach of HPU’s speech and debate team, was recently named the Texas Intercollegiate Forensics Association (TIFA) Educator of the Year. ... Welker, in her twenty-second year on the faculty at HPU, has been coaching the speech and debate team since 2005. ... [read the full story]
As a former high school and college debater myself, I applaud Professor Welker's coaching, but the newsletter brings to mind a discussion of the terms "forensic evidence" and "forensics" at a meeting of the National Commission on Forensic Science. A commission member, herself a university chemist, urged the commission to eschew these terms because of the speech and debate connection. At the time, I thought she was being picky. Now I am not so sure. By the way, the adjective "forensic" comes from the Latin word forensis, meaning "of the forum" or "public."