Saturday, April 25, 2020

Estimating Prevalence from Serological Tests for COVID-19 Infections: What We Don't Know Can Hurt Us

A statistical debate has erupted over the proportion of the population that has been infected with SARS-CoV-2. It is a crucial number in arguments about "herd immunity" and public health measures to control the COVID-19 pandemic. A news article in yesterday's issue of Science reports that

[S]urvey results, from Germany, the Netherlands, and several locations in the United States, find that anywhere from 2% to 30% of certain populations have already been infected with the virus. The numbers imply that confirmed COVID-19 cases are an even smaller fraction of the true number of people infected than many had estimated and that the vast majority of infections are mild. But many scientists question the accuracy of the antibody tests ... .\1/

The first sentence reflects a common assumption -- that the reported proportion of test results that are positive -- directly indicates the prevalence of infections where the tested people live. The last sentence gives one reason this might not be the case. But the fact that tests for antibodies are inaccurate does not necessarily preclude good estimates of the prevalence. It may still be possible to adjust the proportion up or down to arrive at the percentage "already ... infected with the virus." There is a clever and simple procedure for doing that -- under certain conditions. Before describing it, let's look another, more easily grasped threat to estimating prevalence -- "sampling bias."

Sampling Design: Who Gets Tested?

Inasmuch as the people tested in the recent studies are not based on random samples of any well defined population, the samples of test results may not be representative of what the outcome would be if the entire population of interest were tested. Several sources of bias in sampling have been noted.

A study of a German town "found antibodies to the virus in 14% of the 500 people tested. By comparing that number with the recorded deaths in the town, the study suggested the virus kills only 0.37% of the people infected. (The rate for seasonal influenza is about 0.1%.)" But the researchers "sampled entire households. That can lead to overestimating infections, because people living together often infect each other." \2/ Of course, one can count just one individual per household, so this clumping does not sound like a fundamental problem.

"A California serology study of 3300 people released last week in a preprint [found 50] antibody tests were positive—about 1.5%. [The number in the draft paper by Eran Bendavid, Bianca Mulaney, Neeraj Sood, et al. is 3330 \3/] But after adjusting the statistics to better reflect the county's demographics, the researchers concluded that between 2.49% and 4.16% of the county's residents had likely been infected." However, the Stanford researchers "recruit[ed] the residents of Santa Clara county through ads on Facebook," which could have "attracted people with COVID-19–like symptoms who wanted to be tested, boosting the apparent positive rate." \4/ This "unhealthy volunteer" bias is harder to correct with this study design.

"A small study in the Boston suburb of Chelsea has found the highest prevalence of antibodies so far. Prompted by the striking number of COVID-19 patients from Chelsea colleagues had seen, Massachusetts General Hospital pathologists ... collected blood samples from 200 passersby on a street corner. ... Sixty-three were positive—31.5%." As the pathologists acknowledged, pedestrians on a single corner "aren't a representative sample." \5/

Even efforts to find subjects at random will fall short of the mark because of self-selection on the part of subjects. "Unhealthy volunteer" bias is a threat even in studies like one planned for Miami-Dade County that will use random-digit dialing to utility customers to recruit subjects. \6/

In sum, sampling bias could be a significant problem in many of these studies. But it is something epidemiologists always face, and enough quick and dirty surveys (with different possible sources of sampling bias) could give a usable indication of what better designed studies would reveal.

Measurement Error: No Gold Standard

A second criticism holds that because the "specificity" of the serological tests could be low, the estimates of prevalence are exaggerated. "Specificity" refers the extent to which the test (correctly) does not signal and infection when applied to an uninfected individual. If it (incorrectly) signals an infection for these individuals, it causes false positives. Low specificity means lots of false positives. Worries over specificity recur throughout the Science article's summary of the controversy:

  • "The result carries several large caveats. The team used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH's own validation tests found a specificity of higher than 99.5%."
  • "Because the absolute numbers of positive tests were so small, false positives may have been nearly as common as real infections."
  • "Streeck and his colleagues claimed the commercial antibody test they used has 'more than 99% specificity,' but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%. That means that in the Heinsberg sample of 500, the test could have produced more than a dozen false positives out of roughly 70 the team found." \7/

Likewise, political scientist and statistician Andrew Gelman blogged that no screening test that lacks a very high specificity can produce a usable estimate of population prevalence -- at least when the proportion of tests that are positive is small. This limitation, he insisted, is "the big one." \8/ He presented the following as a devastating criticism of the Santa Clara study (with my emphasis added):

Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia ... OK, here are [sic] concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.
If the specificity is 90%, we’re sunk.
With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero ... . On the other hand, if the specificity were 100%, then we could take the result at face value.
So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:
A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.
This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!

To be sure, the fact that the serological tests are not perfectly accurate in detecting an immune response makes it dangerous to rely on the proportion of people tested who test positive as the measure of the proportion of the population who have been infected. Unless the test is perfectly sensitive (is certain to be positive for an infected person) and specific (certain to be negative for an uninfected person), the observed proportion will not be the true proportion of past infections -- even in the sample. As we will see shortly, however, there is a simple way to correct for imperfect sensitivity and specificity in estimating the population prevalence, and there is a voluminous literature on using imperfect screening tests to estimate population prevalence. \9/ Recognizing what one wants to estimate leads quickly to the conclusion that the usual media reports of the raw proportion of positives among the tested group (even with a margin of error to account for sampling variability) is not generally the right statistic to focus on.

Moreover, the notion that because false positives inflate an estimate of the number who have been infected, only the specificity is relevant is misconceived. Sure, false positives (imperfect specificity) inflate the estimate. But false negatives (imperfect sensitivity) simultaneously deflate it. Both types of misclassifications should be considered.

How, then, do epidemiologists doing surveillance studies normally handle the fact that the tests for a disease are not perfectly accurate? Let's use p to denote the positive proportion in the sample of people tested -- for example, the 1.5% in the Santa Clara sample or the 21% figure for New York City that Governor Andrew Cuomo announced in a tweet. The performance of the serological test depends on its true sensitivity SEN and true specificity SPE. For the moment, let's assume that these are known parameters of the test. In reality, they are estimated from separate studies that themselves have sampling errors, but we'll just try out some values for them. First, let's derive a general result that contains ideas presented in 1954 in the legal context of serological tests for parentage. \10/

Let PRE designate the true prevalence in the population (such as everyone in Santa Clara county or New York City) from which a sample of people to be tested is drawn. We pick a person totally at random. That person either has harbored the virus (inf) or not (uninf). The former probability we abbreviate as Pr(inf); the latter is Pr(uninf). The probability that the individual tests positive is

  Pr(test+) = Pr[test+ & (inf or uninf)]
     = Pr[(test+ & inf) or (test+ & uninf)]
     = Pr(test+ & inf) + Pr(test+ & uninf)
     = Pr(test+ | inf)Pr(inf) + Pr(test+ | uninf)Pr(uninf)     (1)*

In words, the probability of the positive result is (a) the probability the test is positive if the person has been infected, weighted by the probability he or she has been infected, plus (b) the probability it is positive if the person has not been infected, weighted by the probability of no infection.

We can rewrite (1) in terms of the sensitivity and specificity. SEN is Pr(test+|inf) -- the probability of a positive result if the person has been infected. SPE is Pr(test–|uninf) -- the probability of a negative result if the person has not been infected. For the random person, the probability of infection is just the true prevalence in the population, PRE. So the first product in (1) is simply SEN × PRE.

To put SPE into the second term, we note that the probability that an event happens is 1 minus the probability that it does not happen. Consequently, we can write the second term as (1 – SPE) × (1 – PRE). Thus, we have

     Pr(test+) = SEN PRE + (1 – SPE)(1 – PRE)           (2)

Suppose, for example, that SEN = 70%, SPE = 80%, and PRE = 10%. Then Pr(test+) = 1/5 + PRE/2 = 0.25. The expected proportion of observed positives in a random sample would be 0.25 -- a substantial overestimate of the true prevalence PRE = 0.10.

In this example, with rather poor sensitivity, using the observed proportion p of positives in a large random sample to estimate the prevalence PRE would be foolish. So we should not blithely substitute p for PRE. Indeed, doing so can give us a bad estimate even when the test has perfect specificity. When SPE = 1, Equation (2) reduces to Pr(test+) = SEN PRE. In this situation, the sample proportion does not estimate the prevalence -- it estimates only a fraction of it.

Clearly, good sensitivity is not a sufficient condition for using the sample proportion p to estimate the true prevalence PRE, even in huge samples. Both SEN and SPE cause misclassifications, and they work in opposite directions. Poor specificity leads to false positives, but poor sensitivity leads to true positives being counted as negatives. The net effect of these opposing forces is mediated by the prevalence.

To correct for the expected misclassifications in a large random sample, we can use the observed proportion of positives, not as estimator of the prevalence, but as an estimator of Pr(test+). Setting p = Pr(test +), we solve for PRE to obtain an estimated prevalence of

      pre = (p + SPE – 1)/(SPE + SEN – 1)         (3) \11/

For the Santa Clara study, Bendavid et al. found p = 50/3330 = 1.5%, and suggested that SEN = 80.3% and SPE = 99.5%. \12/ For these values, the estimated prevalence is pre = 1.25%. If we change SPE to 98.5%, where Gelman wrote that "you get into trouble," the estimate is pre = 0, which is clearly too small. Instead, the researchers used equation (3) only after they transformed their stratified sample data to fit the demographics of the county. That adjustment produced an inferred proportion p' = 2.81%.  Using that adjusted value for p, Equation (3) becomes

      pre = (p' + SPE – 1)/(SEN + SPE – 1)         (4)

For the SPE of 98.5%, equation (4) gives an estimated prevalence of pre = 1.66%. For 99.5% it is 2.9%. Although some critics have complained about using Equation (3) with the demographically adjusted proportion p' shown in (4), if the adjustment provides a better picture of the full population, it seems like the right proportion to use for arriving at the point estimate pre.

Nevertheless, there remains a sense in which the sensitivity is key. Given SEN = 80.3%, dropping SPE to 97.2% gives pre = 0. Ouch! When SPE drops below 97.2%, pre turns negative, which is ridiculous. In fact, this result holds for many other values of SEN. So one does need a high sensitivity for Equation (3) to be plausible -- at least when the true prevalence (and hence p') is small. But as PRE (and thus p') grow larger, Equations (3) and (4) look better. For example, if p = 20%, then pre is 22% even with SPE = 97.2% and SEN = 80.3%. Indeed, with this large a p even with a specificity of only SPE = 90% we still get a substantial pre = 14.2%.

Random Sampling Error

I have pretended the sensitivity and specificity are known with certainty.  Equation (3) only gives a point estimate for true prevalence. It does not account for sampling variability -- either in p (and hence p') or in the estimates (sen and spe) of SEN and SPE, respectively, that have to be plugged into (3). To be clear that we are using estimates from the separate validity studies rather than the unknown true values for SEN and SPE, we can write the relevant equation as follows:

      pre = (p + spe – 1)/(sen + spe – 1)         (5)

Dealing with the variance of p (or p') with sample sizes like 3300 is not hard. Free programs on the web give confidence intervals based on various methods for arriving at the standard error for pre considering the size of the random sample that produced the estimate p. (Try it out.)

Our uncertainty about SEN and SPE is greater (at this point, because the tests rushed into use have not been well validated, as discussed in previous postings). Bendavid et al. report a confidence interval for PRE that is said to account for the variances in all three estimators -- p, sen, and spe. \13/ However, a savage report in Ars Technica \14/ collects tweets such as a series complaining that "[t]he confidence interval calculation in their preprint made demonstrable math errors." \15/ Nonetheless, it should be feasible to estimate the contribution that sampling error in the validity studies for the serological tests contributes to the uncertainty in pre as an estimator of the population prevalence PRE. The researchers, at any rate, are convinced that "[t]he argument that the test is not specific enough to detect real positives is deeply flawed." \16/ Although they are working with a relatively low estimated prevalence, they could be right. \17/ If sensitivity is in the range they claim, their estimates of prevalence should not be dismissed out of hand.

* * *

The take away message is that a gold standard serological test is not always necessary for effective disease surveillance. It is true that unless the test is highly accurate, the positive test proportion p (or a proportion p' adjusted for a stratified sample) is not a good estimator of the true prevalence PRE. That has been known for quite some time and is not in dispute. At the same time, pre sometimes can be a useful estimator of true prevalence. That too is not in dispute. Of course, as always, good data are better than post hoc corrections, but for larger prevalences, serological tests may not require 99.5% specificty to produce useful estimates of how many people have been infected by SARs-CoV-2.


UPDATE (5/9/20): An Oregon State University team in Corvallis is going door to door in an effort to test a representative sample of the college town's population. \1/ A preliminary report released to the media reports a simple incidence of 2/1,000. Inasmuch the sketchy accounts indicate that the samples collected are nasal swabs, the proportion cannot be directly compared to the proportion positive for serological tests mentioned above. The nasal swabbing is done by the respondents in the survey rather than by medical personnel, \2/ and the results pertain to the presence of the virus at the time of the swabbing rather than to an immune response that may be the result of exposure in the past.

UPDATE (7/9/20): Writing on “SARS-CoV-2 seroprevalence in COVID-19 hotspots” in The Lancet on July 6, Isabella Eckerle and Benjamin Meyer report that

Antibody cross-reactivity with other human coronaviruses has been largely overcome by using selected viral antigens, and several commercial assays are now available for SARS-CoV-2 serology. ... The first SARS-CoV-2 seroprevalence studies from cohorts representing the general population have become available from COVID-19 hotspots such as China, the USA, Switzerland, and Spain. In The Lancet, Marina Pollán and colleagues and Silvia Stringhini and colleagues separately report representative population-based seroprevalence data from Spain and Switzerland collected from April to early May this year. Studies were done in both the severely affected urban area of Geneva, Switzerland, and the whole of Spain, capturing both strongly and less affected provinces. Both studies recruited randomly selected participants but excluded institutionalised populations ... . They relied on IgG as a marker for previous exposure, which was detected by two assays for confirmation of positive results.

The Spanish study, which included more than 60,000 participants, showed a nationwide seroprevalence of 5·0% (95% CI 4·7–5·4; specificity–sensitivity range of 3·7% [both tests positive] to 6·2% [at least one test positive]), with urban areas around Madrid exceeding 10% (eg, seroprevalence by immunoassay in Cuenca of 13·6% [95% CI 10·2–17·8]). ... Similar numbers were obtained across the 2766 participants in the Swiss study, with seroprevalence data from Geneva reaching 10·8% (8·2–13·9) in early May. The rather low seroprevalence in COVID-19 hotspots in both studies is in line with data from Wuhan, the epicentre and presumed origin of the SARS-CoV-2 pandemic. Surprisingly, the study done in Wuhan approximately 4–8 weeks after the peak of infection reported a low seroprevalence of 3·8% (2·6–5·4) even in highly exposed health-care workers, despite an overwhelmed health-care system.

The key finding from these representative cohorts is that most of the population appears to have remained unexposed to SARS-CoV-2, even in areas with widespread virus circulation. [E]ven countries without strict lockdown measures have reported similarly low seroprevalence—eg, Sweden, which reported a prevalence of 7·3% at the end of April—leaving them far from reaching natural herd immunity in the population.

UPDATE (10/5/20): In Seroprevalence of SARS-CoV-2–Specific Antibodies Among Adults in Los Angeles County, California, on April 10-11, 2020, JAMA 2020, 323(23):2425-2427, doi: 10.1001/jama.2020.8279, Neeraj Sood, Paul Simon, Peggy Ebner, Daniel Eichner, Jeffrey Reynolds, Eran Bendavid, and Jay Bhattacharya used the methods applied in the Santa Clara study on a "random sample ... with quotas for enrollment for subgroups based on age, sex, race, and ethnicity distribution of Los Angeles County residents" invited for tests "to estimate the population prevalence of SARS-CoV-2 antibodies." The tests have an estimated sensitivity of 82.7% (95% CI of 76.0%-88.4%) and specificity of 99.5% (95% CI of 99.2%-99.7%).

The weighted proportion of participants who tested positive was 4.31% (bootstrap CI, 2.59%-6.24%). After adjusting for test sensitivity and specificity, the unweighted and weighted prevalence of SARS-CoV-2 antibodies was 4.34% (bootstrap CI, 2.76%-6.07%) and 4.65% (bootstrap CI, 2.52%-7.07%), respectively.

The estimate of 4.65% suggests that some "367, 000 adults had SARS-CoV-2 antibodies, which is substantially greater than the 8,430 cumulative number of confirmed infections in the county on April 10." As such, "fatality rates based on confirmed cases may be higher than rates based on number of infections." Indeed, the reported fatality rate based on the number of confirmed cases (about 3% in the US) would be too high by a factor of 44! But "[s]election bias is likely. The estimated prevalence may be biased due to nonresponse or that symptomatic persons may have been more likely to participate. Prevalence estimates could change with new information on the accuracy of test kits used. Also, the study was limited to 1 county."

NOTES

  1. Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831
  2. Id.
  3. Eran Bendavid, Bianca Mulaney, Neeraj Sood et al.,  COVID-19 Antibody Seroprevalence in Santa Clara County, California. medRxiv preprint dated Apr. 11, 2020.
  4. Id.
  5. Id.
  6. University of Miami Health System, Sylvester Researchers Collaborate with County to Provide Important COVID-19 Answers, Apr. 25, 2020, http://med.miami.edu/news/sylvester-researchers-collaborate-with-county-to-provide-important-covid-19
  7. Vogel, supra note 1.
  8. Andrew Gelman, Concerns with that Stanford Study of Coronavirus Prevalence, posted 19 April 2020, 9:14 am, on Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/
  9. E.g., Joseph Gastwirth, The Statistical Precision of Medical Screening Procedures: Application to Polygraph and AIDS Antibodies Test Data, Stat. Sci. 1987, 2:213-222; D. J. Hand, Screening vs. Prevalence Estimation, Appl. Stat., 1987, 38:1-7; Fraser I. Lewis & Paul R. Torgerson, 2012, A Tutorial in Estimating the Prevalence of Disease in Humans and Animals in the Absence of a Gold Standard Diagnostic Emerging Themes in Epidemiology, 9:9, https://ete-online.biomedcentral.com/articles/10.1186/1742-7622-9-9; Walter J. Rogan & Beth Gladen, Estimating Prevalence from Results of a Screening-test. Am J Epidemiol. 1978, 107: 71-76; Niko Speybroeck, Brecht Devleesschauwer, Lawrence Joseph & Dirk Berkvens, Misclassification Errors in Prevalence Estimation: Bayesian Handling with Care, Int J Public Health, 2012, DOI:10.1007/s00038-012-0439-9
  10. H. Steinhaus, 1954, The Establishment of Paternity, Pr. Wroclawskiego Tow. Naukowego ser. A, no. 32. (discussed in Michael O. Finkelstein and William B. Fairley, A Bayesian Approach to Identification Evidence. Harvard Law Rev., 1970, 83:490-517). For a related discussion, see David H. Kaye, The Prevalence of Paternity in "One-Man" Cases of Disputed Parentage, Am. J. Human Genetics, 1988, 42:898-900 (letter).
  11. This expression is known as "the Rogan–Gladen adjusted estimator of 'true' prevalence" (Speybroeck et al., supra note 9) or "the classic Rogan-Gladen estimator of true prevalence in the presence of an imperfect diagnostic test." Lewis & Torgerson, supra note 9. The reference is to Rogan & Gladen, supra note 9.
  12. They call the proportion p = 1.5% the "unadjusted" estimate of prevalence.
  13. Some older discussions of the standard error in this situation can be found in Gastwirth, supra note 9; Hand, supra note 9. See also J. Reiczigel, J. Földi, & L. Ózsvári, Exact Confidence Limits for Prevalence of a Disease with an Imperfect Diagnostic Test, Epidemiology and Infection, 2010, 138:1674-1678.
  14. Beth Mole, Bloody math — Experts Demolish Studies Suggesting COVID-19 Is No Worse than Flu: Authors of widely publicized antibody studies “owe us all an apology,” one expert says, Ars Technica, Apr. 24, 2020, 1:33 PM, https://arstechnica.com/science/2020/04/experts-demolish-studies-suggesting-covid-19-is-no-worse-than-flu/
  15. https://twitter.com/wfithian/status/1252692357788479488 
  16. Vogel, supra note 1.
  17. A Bayesian analysis might help. See, e.g., Speybroeck et al., supra note 10.

UPDATED Apr. 27, 2020, to correct a typo in line (2) of the derivation of Equation (1), as pointed out by Geoff Morrison.

NOTES to update of 5/9/20

  1. OSU Newsroom, TRACE First Week’s Results Suggest Two People per 1,000 in Corvallis Were Infected with SARS-CoV-2, May 7, 2020, https://today.oregonstate.edu/news/trace-first-week%E2%80%99s-results-suggest-two-people-1000-corvallis-were-infected-sars-cov-2
  2. But "[t]he tests used in TRACE-COVID-19 collect material from the entrance of the nose and are more comfortable and less invasive than the tests that collect secretions from the throat and the back of the nose." Id.

Thursday, April 23, 2020

More on False Positive and False Negative Serological Tests for COVID-19

An earlier posting looked at sensitivity and specificity of the first FDA-allowed emergency serological test for antibodies to SARS-CoV-2. It then identified some implications for getting people back to work through what a recent article in Nature called an "immunity passport." \1/

The news article cautions that "[k]its have flooded the market, but most aren’t accurate enough to confirm whether an individual has been exposed to the virus." The kits use components of the virus that the antibodies latch onto (the antigens) to detect the antibodies in the blood. Blood samples can be sent to a qualified laboratory for testing. In addition, "[s]everal companies ... offer point-of-care kits, which are designed to be used by health professionals to check if an individual has had the virus." In fact, "some companies market them for people to use at home." But
most kits have not undergone rigorous testing to ensure they’re reliable, says Michael Busch, director of the Vitalant Research Institute in San Francisco]. During a meeting at the UK Parliament’s House of Commons Science and Technology Select Committee on 8 April, Kathy Hall, the director of the testing strategy for COVID-19, said that no country appeared to have a validated antibody test that can accurately determine whether an individual has had COVID-19. ... [S]o far, most test assessments have involved only some tens of individuals because they have been developed quickly. ... [S]ome commercial antibody tests have recorded specificities as low as 40% early in the infection. In an analysis of 9 commercial tests available in Denmark, 3 lab-based tests had sensitivities ranging 67–93% and specificities of 93–100%. In the same study, five out of six point-of-care tests had sensitivities ranging 80–93%, and 80-100% specificity, but some kits were tested on fewer than 30 people. Testing was suspended for one kit.

Point-of-care tests are even less reliable than tests being used in labs, adds [David Smith, a clinical virologist at the University of Western Australia in Perth]. This is because they use a smaller sample of blood — typically from a finger prick — and are conducted in a less controlled environment than a lab .... The WHO recommends that point-of-care tests only be used for research.
False positives arise when a test uses an antigen that reacts with antibodies for pathogens other than SARS-CoV-2. In other words, the test is not 100% specific to the one type of virus. "An analysis of EUROIMMUN’s antibody test found that although it detected SARS-CoV-2 antibodies in three people with COVID-19, it returned a positive result for two people with another coronavirus." It is notable that "[i]t took several years to develop antibody tests for HIV with more than 99% specificity."

A further problem with issuing an "immunity passport" on the basis of a serologcal test is that the test may not detect the kind of antibodies that confer immunity to subsequent infection. It is not clear whether all people who have had COVID-19 develop the necessary "neutralizing" antibodies. An unpublished analysis of 175 people in China who had recovered from COVID-19 and had mild symptoms reported that 10 individuals produced no detectable neutralizing antibodies — even though some had high levels of binding antibodies. These people may lack protective immunity. Moreover, one study showed that viral RNA load declines slowly after antibodies are detected in the blood. Consequently, there could be a period in which a recovered patient is still shedding infectious virus.

A news article in this week's Science magazine also contains information on using serologic test data to estimate the proportion of people who have been infected (prevalence). \2/ It described a German study in which "Streeck and his colleagues claimed the commercial antibody test they used has “more than 99% specificity,” but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%."

The article also mentions a survey in which "Massachusetts General Hospital pathologists John Iafrate and Vivek Naranbhai ... collected blood samples from 200 passersby on a street corner [and] used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH’s own validation tests found a specificity of higher than 99.5%."

NOTES
  1. Smriti Mallapaty, Will Antibody Tests for the Coronavirus Really Change Everything?, Nature, Apr. 18, 2020, doi:10.1038/d41586-020-01115-z
  2. Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831

Wednesday, April 22, 2020

Forensic Magazine Branches Out

Forensic Magazine is "powered by Labcompare, the Buyer's Guide for Laboratory Professionals." Its slogan is "On the Scene and in the Lab." Today's newsletter includes the following item, sandwiched between an article on DNA cold cases in Florida and domestic abuse in Nicaragua:
Texas State Forensic Association Names Educator of the Year
Wednesday, April 22, 2020

Julie Welker, chair of Howard Payne University’s Department of Communication and coach of HPU’s speech and debate team, was recently named the Texas Intercollegiate Forensics Association (TIFA) Educator of the Year. ... Welker, in her twenty-second year on the faculty at HPU, has been coaching the speech and debate team since 2005. ... [read the full story]
As a former high school and college debater myself, I applaud Professor Welker's coaching, but the newsletter brings to mind a discussion of the terms "forensic evidence" and "forensics" at a meeting of the National Commission on Forensic Science. A commission member, herself a university chemist, urged the commission to eschew these terms because of the speech and debate connection. At the time, I thought she was being picky. Now I am not so sure. By the way, the adjective "forensic" comes from the Latin word forensis, meaning "of the forum" or "public."

Tuesday, April 21, 2020

Comparing Countries by Cases of COVID-19: America First?

The following map appeared in the Los Angeles Times newsletter, Coronavirus Today, on April 20, with the caption, Where Is the Coronavirus Spreading?

Confirmed COVID-19 cases by country as of 5:00 p.m. Monday, April 20, 2020.
It does not take a Ph.D. in statistics to see that the total number of cases since the outbreak of the disease is not a measure of where the virus known as SARS-CoV-2 is currently spreading. It is an indirect measure of where it has been up to a given time. To know where the virus is spreading, we should look at new cases of the disease (incidence rates).

Even as a measure of cumulative cases, shading entire countries is misleading. If we want to see where cases have clustered since the recordkeeping started,we could look at the heights of bars placed on top of small, similarly sized geographic regions, where the heights are proportional to the number of cases in each region. Alaska would not stand out in such a graph. Indeed, the Times newsletter has a link to far better infographics from Johns Hopkins University, one of which clearly shows this fact.

Media reporting on only the total number of cases by country promotes false impressions of the incidence and prevalence of the disease across countries. For example, the table below gives approximate numbers for the US and Spain, which have the greatest cumulative numbers of cases as reported today on the Johns Hopkins website. It also includes China, which is ranked 9th in reported cases, and Switzerland (15th).

CountryCumulative
cases
PopulationRelative frequency
(cases per 100,000)
United States788,000328,000,000240
Spain204,00047,000,000430
China84,0001,400,000,0006
Switzerland28,0008,600,000330

Spain has only about a quarter of the number of cases reported in the US, but it has almost twice the prevalence of the disease. On the basis of reported cases and population size, China's population has emerged relatively unscathed, and on this scale, the US is by no means the most ravaged -- so far.

Monday, April 6, 2020

Using Serological Tests for COVID-19

The next wave of tests developed in response to the COVID-19 pandemic are designed to detect antibodies in the blood of infected individuals. Some of the antibodies (known as IgM) are short-term. Others (IgG) hang around longer. Just how long -- and just how much immunity the antibodies confer -- remains to be determined, \1/ but one hope is that these immunological tests will enable public health authorities to identify individuals with immunity who could be freed from distancing or isolation rules.

Will there be a database of the names of the immune among us? "If I were emperor and had unlimited resources, I’d do serological testing on every person in this country as fast as I could,” said one ER physician interviewed by the Los Angeles Times. \2/ The article reports that "Germany may issue 'immunity certificates' allowing release from quarantine for those who show they have already been exposed and fought off the coronavirus."

According to Times reporters Anita Chabria and Emily Baumgaertner, the FDA has been "lax" in allowing "the sale of more than 40 serological tests, not requiring them to undergo a formal emergency-use approval process," creating "potential problems because there is no centralized collection of results, no uniformity of methods and slim evidence that the tests work with acceptable accuracy."

Of course, what accuracy is acceptable depends on the use to which the tests are put, and a formal but abbreviated EUA process is required for tests to be marketed to diagnose the disease. The FDA granted its first EUA for a serological test on April 1, 2020, to the North Carolina phamaceutcial company Cellex. \3/ Other countries have been using other tests (that probably are as well validated) for some time. \4/

Cellex's fact sheet for physicians, \5/ like the ones that accompany molecular diagnostics tests for the virus itself, contains no estimates of the sensitivity and specificity of the test -- either for the particular antibodies it is intended to identify, for the virus (SAR-CoV-2), or for the disease (COVID-19). With respect to disease diagnosis, the fact sheet states that a "positive test result with the qSARS-CoV-2 IgG/IgM Rapid Test indicates that antibodies to SARS-CoV-2 were detected, and the patient has potentially been exposed to COVID-19." \6/ Well, yes, a positive test result for the antibodies is a positive test result for the antibodies. But what is known about the sensitivity for antibody detection--how probable is it to have a positive test result when the antibodies actually are circulating in the bloodstream? As for false positives (which arise from imperfect specificity), the fact sheet merely announces that the "qSARS-CoV-2 lgG/lgM Rapid Test has been designed to minimize the likelihood of false positive test results" -- whatever that means. \7/

Moreover, Cellex stays away from saying much about the implications of correctly detected antibodies. To be sure, "[w]hen IgM antibodies are present, they can indicate that a patient has an active or recent infection with SARS-CoV-2," but once again,  the fact sheet does not state how strong this possible indication is. Instead, it only warns that
A positive result for IgM or IgG may not mean that a patient’s current symptoms are due to COVID-19 infection. Laboratory test results should always be considered in the context of clinical observations and epidemiological data in making a final diagnosis and patient management decisions.
The fact sheet also steers clear of quantitative statements for negative test results. It gives the definition of a negative test result -- "A negative test result with this test means that SARS-CoV-2 specific antibodies were not present in the specimen above the limit of detection." Then it adds:
However, a negative result does not rule out COVID-19 and should not be used as the sole basis for treatment, patient management decisions, or to rule out active infection. Patients tested early after infection may not have detectable IgM antibody despite active infection; in addition, not all patients will develop a detectable IgM and/or IgG response to SARS-CoV-2 infection. The absolute sensitivity of the qSARS-CoV-2 IgG/IgM Rapid test is unknown.

When diagnostic testing is negative, the possibility of a false negative result should be considered in the context of a patient's recent exposures and the presence of clinical signs and symptoms consistent with COVID-19. ... Direct testing for virus (e.g., PCR testing) should always be performed in any patient suspected of COVID-19, regardless of the qSARS-CoV-2 IgG/IgM Rapid test.
Despite the absence of quantitative information in the fact sheet, Cellex did perform a validity study, which it summarized in the package insert. \8/ In one experiment, fifty normal whole blood samples were spiked with positive serum (diluted to 1:100), and fifty others were spiked with negative serum at the same dilution. The result was that "[a]ll spiked samples were correctly identified by the test except for one of the negative samples, which was tested positive with the test. Thus, there was a 99% concordance rate with expected results when venous whole blood specimens are used." The two-by-two table for these results is
No antibodies
in serum
Antibodies
in serum
Test –490
Test +150

The estimated sensitivity is therefore 50/50 = 1.00 (95% CI of 0.93 to 1), and the estimated specificity is 49/50 = 0.98 (95% CI of 0.89 to 1). \9/

A test of "clinical performance" compared (1) antibody-test outcomes on 98 serum or plasma samples from individuals who tested positive with a RT-PCR method for SARS-CoV-2 infection and who had mild or no symptoms with (2) antibody-test outcomes on 180 negative serum or plasma samples that had been collected before September 2019. The classifications were not as accurate.
No VirusVirus
Test –1747
Test +691
Sensitivity91/98 = 0.93 [0.86, 0.97]
Specificity174/180 = 0.97 [0.93, 0.99]

An additional 30 samples were collected from hospitalized individuals who were clinically confirmed positive for SARS-CoV-2 infection and exhibited severe symptoms. These samples, along with 70 negative serum or plasma samples collected prior to September 2019 generated the following results: \10/

NoVirusSeverely Ill
Test –651
Test +529
Sensitivity29/30 = 0.97 [0.83, 0.999]
Specificity65/70 = 0.93 [0.84, 0.98]

Although the population from which the samples came is not stated, let's assume that they are fully applicable to the U.S. population. How well would the serological test work if extended, as suggested earlier, to "every person in this country," or perhaps, to every person in the labor force? The Census Bureau estimates the July 2019 population at a bit under 300 million, and it estimates the labor force percentage from 2014-2018 at almost 64%, for a total of about 211 million workers. Let's round this off to 200 million. In widespread testing, the error probabilities would be higher than those obtained under laboratory conditions, but I'll use the ballpark sensitivity and specificity estimates of 0.95 from the tables above. \11/

Lets assume that 10% of the 200 million workers have been infected and are immune. This percentage is far higher than the percentage of infections based on the number of reported cases to date in the country (about 300,000), but the ratio of unreported to reported cases could well be on the order of 10 to 1. \12/ For this guess as to the prevalence of past infection, and under the optimistic assumption that every infection confers immunity, we have some 20 million immune workers and 180 million non-immune workers. Now we apply the sensitivity and specificity estimates of 0.95 = 19/20:
  • Of the 20 million immune workers, 19/20, will test positive (antibodies detected). That is 19 million true positives.
  • Of the 20 million immune workers, 1/20 will test negative. That is 1 million false negatives.
  • Of the 180 million non-immune workers, 19/20 will test negative. That is 171 million true negatives.
  • Of the 180 million non-immune workers, 1/20 will test positive. That is 9 million false positives.
The negatives (apparently non-immune workers) will continue to be excluded from the workplace. There will only be 1/(1 + 171) false exclusions -- about half a percent. That is pretty good.

The positives (apparently immune workers) will return to work. There will be 9/(9 + 19) = 9/28 -- about a third -- false inclusions. If these workers mistakenly believe that the blood test indicates a 95% probability of immunity (as they might if they were told that the false positive rate is only 5%), they will be sorely mistaken about how likely it is that they are immune. \13/

This arithmetic is a "mind-bending" instance of Bayes' rule in operation. \14/ Before testing, the probability of a randomly selected worker being immune was (assumed to be) 1/10, or 10%. A positive test result raises the probability of immunity to 19/28, or 68%. The odds have improved from 1:9 to about 2:1. More precisely, the posterior-to-prior-odds ratio is [19/(28 – 19)] / (1/9) = 19. This Bayes factor is just the positive likelihood ratio LR+ = (sensitivity) /  (1 – specificity) = 0.95 / 0.05 = 19. For any set of prior odds, the posterior odds are this same likelihood ratio times the stipulated prior odds, and any such test:

Posterior Odds = [(sensitivity) /  (1 – specificity)] × Prior Odds.

Proposals from economists and public policy analysts to relax employment limitations according to serological test results are emerging. \15/ To think through their merits, we would also need to consider the cost of the testing and the consequences of the false positives and negatives for individuals and the rest of society. However, in the Washington world of volatile decisionmaking, policies are likely to change before comprehensive and convincing analyses of their probable costs and benefits are available.

UPDATED: Note 14 added 6/20/20 9:15 AM ET

NOTES
  1. The period of immunity could be short. An article in Scientific American reported that
    [I]mmunity functions on a continuum. ... Immunity to seasonal coronaviruses (such as those that cause common colds), for example, starts declining a couple of weeks after infection. And within a year, some people are vulnerable to reinfection. ... But studies of SARS-CoV—the virus that causes severe acute respiratory syndrome, or SARS, which shares a considerable amount of its genetic material with SARS-CoV-2—are more promising. Antibody testing shows SARS-CoV immunity peaks at around four months and offers protection for roughly two to three years. ...

    Even if the antibodies stick around in the body, however, it is not yet certain that they will prevent future infection. What we want, [Dawn Bowdish, a professor of pathology and molecular medicine and Canada Research Chair in Aging and Immunity at McMaster University in Ontario] says, are neutralizing antibodies. These are the proteins that reduce and prevent infection by binding to the part of a virus that connects to and “unlocks” host cells. ... In contrast, nonneutralizing antibodies still recognize parts of the pathogen, but they do not bind effectively and so do not prevent it from invading cells. ...

    Nevertheless, a few small studies of cells in laboratory dishes suggest that SARS-CoV-2 infection triggers the production of neutralizing antibodies. And animal studies indicate such antibodies do prevent reinfection, at least for a couple of weeks. ...
    Stacey McKenna, What Immunity to COVID-19 Really Means, Sci. Am. Apr. 10, 2020, https://www.scientificamerican.com/article/what-immunity-to-covid-19-really-means/. See also Marc Lipsitch, Who Is Immune to the Coronavirus?, N.Y. Times, Apr. 13, 2020, https://www.nytimes.com/2020/04/13/opinion/coronavirus-immunity.html (sketching existing information and the major gaps in what is known).
  2. Anita Chabria & Emily Baumgaertner, A Coronavirus Immunity Test Is Essential for the U.S. But Will It Work?, L.A. Times, Apr. 2, 2020, 12:06 PM, https://www.latimes.com/science/story/2020-04-02/coronavirus-test-immunity-detection-accuracy
  3. Letter of Apr. 1, 2020, from RADM Denise M. Hinton, Chief Scientist, Food and Drug Administration, to James X. Li, Chief Executive Officer, Cellex Inc., available at https://www.fda.gov/media/136622/download.
  4. Apoorva Mandavilli, F.D.A. Approves First Coronavirus Antibody Test in U.S., N.Y. Times, Apr. 2, 2020, https://www.nytimes.com/2020/04/02/health/coronavirus-antibody-test.html.
  5. Cellex, Fact Sheet for Healthcare Providers, qSARS-CoV-2 lgG/lgM Rapid Test– Cellex Inc., Apr. 1, 2020, available at https://www.fda.gov/media/136623/download
  6. Presumably, "exposed to COVID-19" really means "exposed to the virus (SARS-CoV-2)."
  7. For an effort to unpack the meaning of minimize, see  https://for-sci-law.blogspot.com/2020/04/he-or-she-tested-negative-or-positive.html
  8. Cellex qSARS-CoV-2 IgG/IgM Rapid Test, Package Insert, available at https://www.fda.gov/media/136625/download
  9. In an unstated number of tests and specimens, no reactivity to a number of other virus was observed, and spiking specimens with a variety of substances that might be found in blood plasma did not generate false negatives or positives. Id.
    UPDATE (June 11, 2020): The FDA has posted a summary of sensitivity, specificity, and predictive values for serologic tests that it has authorized for emergency use. It reports a sensitivity of 93.8% (120/128) and a specificity of 96.0% (240/250). U.S. Food & Drug Administration, EUA Authorized Serology Test Performance, June 9, 2020, https://www.fda.gov/medical-devices/emergency-situations-medical-devices/eua-authorized-serology-test-performance
  10. The day of collection relative to the onset of illness was unknown. The package insert reports "a Positive Percent Agreement and Negative Percent Agreement of 93.75% (95% CI: 88.06-97.26%) and 96.40% (95% CI: 92.26-97.78%), respectively." Id.
  11. To the extent that many business and government activities are considered essential and continue to employ their workers, I am also overstating the number of workers who have been idled by precautionary measures, but we'll be looking at ratios that do not depend on the absolute numbers.
  12. See Kathleen M. Jagodnik, Forest Ray, Federico M. Giorgi & Alexander Lachmann, Correcting Under-reported COVID-19 Case Numbers: Estimating the True Scale of the Pandemic, .
  13. This overstates the impact of false negative test results because workers who already recovered from a clear case of COVID-19 will know that their probability of immunity is high notwithstanding the more general probability of a false inclusion.
  14. A Nature weekly briefing of 6/19/20 noted that "A mathematical pitfall plagues antibody tests that are less than 100% accurate: the lower the infection rate, the more likely it is that a positive result is wrong. Scientific American explains, with a very handy graphic, how this mind-bending fact arises. (Scientific American | 3 min read)."
  15. E.g., Aaron Edlin & Bryce Nesbitt, The ‘Certified Recovered’ from Covid-19 Could Lead the Economic Recovery, STAT, Apr. 6, 2020, https://www.statnews.com/2020/04/06/the-certified-recovered-from-covid-19-could-lead-the-economic-recovery/; David M. Studdert and Mark A. Hall, Disease Control, Civil Liberties, and Mass Testing — Calibrating Restrictions during the Covid-19 Pandemic, New Eng. J. Med., Apr. 9, 2020, DOI: 10.1056/NEJMp2007637.

Thursday, April 2, 2020

"He (or She) Tested Negative (or Positive) for Coronavirus"

The news is full of stories of celebrities and public officials who tested positive or negative "for coronavirus" or "for COVID-19."
  • Prime Minister Boris Johnson has tested positive for coronavirus -- BBC News 3/27/20
  • Rapper Scarface revealed he has tested positive for coronavirus -- The Daily Beast 3/26/20
  • Rep. Mike Kelly (R-Pa.) announced Friday he has tested positive for the coronavirus -- The Hill, 3/27/20
  • An Arizona State University professor said he tested positive for COVID-19 -- KTAR 3/27/20
  • Trump tested negative for coronavirus -- CNN, 3/14/20
  • Charles Barkley announced he tested negative for the coronavirus -- USA Today, 3/23/20
  • Romney says he tested negative for coronavirus -- The Hill, 3/24/20
  • Lindsey Graham says he tested negative for coronavirus -- CNN, 3/15/20
  • Ayanna Pressley tests negative for COVID-19 -- CNN, 3/27/20
What can anyone really conclude from a negative or positive finding? How well do these findings answer the question of whether someone is infectious, or ill because of an infection? This posting seeks to explain why convincing estimates of test sensitivity and specificity are hard to come by. It also sketches the kind of additional reasoning that would be necessary to supply estimates of the probability a person is infected with the virus or ill from COVID-19 in light of the test results. (I am outside my comfort zone in parts of this posting -- corrections are welcome.)

Tests for What?

To begin with, we need to distinguish between the disease -- Coronavirus Disease 2019 (COVID-19) -- and the virus itself -- Sudden Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The virus hijacks the machinery of human cells to replicate itself. Initially, it tends to reside in the mucous membranes of the upper nose and throat, but in more serious cases, it moves from the upper respiratory tract to the lungs. The disease spreads primarily through respiratory droplets from an infected individual that end up in the mouth, nose, or eyes of another person.

In a sense, the tests in the news are not tests for the disease -- even though many of their creators call them tests for COVID-19. \1/They are molecular diagnostics tests for the presence of certain sequences (of the nucleotide base-pairs) that are characteristic of SARS-CoV-2. \2/ If these sequences are detected in a swab from the person under investigation (a PUI), the test is said to be positive. If these sequences are not detected, the result is negative.

Operating Characteristics: Sensitivity and Specificity

In an ideal test for being infected with the virus, positives would only arise when SARS-CoV-2 is present in the PUI, and negatives would only occur when it is not. The probability of a positive result (+) given that the PUI harbors the specific SARS virus then would be 1. As we will see, this probability is not precisely known, but it surely is less than 1. If we let S2 stand for the event that the PUI has the virus SARS-Cov-2, we can write this conditional "true positive" probability, or test sensitivity, as Pr(+ | S2). The "|" in the expression is read "given" or "conditional on."

One other probability is needed to characterize the accuracy of the test. The specificity indicates how accurately the test indicates that a PUI does not harbor the virus. Ideally, the specificity, Pr(− | not-S2), also is  1. That is, whenever the PUI is not infected, the test is negative. But, once again, no real-world test for infection performs this well.

To see why, the following path diagram for test results may be helpful:
Figure 1. What might produce positive and negative test results

The diagram shows that a positive test result (TEST +) could be explained either by viruses from the PUI or by contamination on the swab. A negative test result (TEST −) could be explained either by the absence of any infection in the PUI, by an infection that has not generated enough viruses to signal a positive result, or by problems with the chemistry of the test. If any these paths have a nonzero probability, the sensitivity and specificity are less than 1.

Even this list of explanations assumes that "virus" in the diagram refers strictly to the SARS-CoV-2 strain that the test is designed to detect. If another type of coronavirus, a rhinovirus, parainfluenza virus, adenovirus, etc., has sufficient sequence similarity in the few regions tested to be mistaken for SARS-CoV-2, then the similar viruses on the swab could produce a signal. That would make the test even less specific to the infectious agent for COVID-19. Conversely, if other strains of SARS-CoV-2 exist and have sufficiently different sequences in the regions of the virus's genome that the test covers, the test will miss them, reducing its sensitivity. The FDA calls this aspect of sensitivity "inclusivity." \3/

So what are the sensitivity and specificity of the tests that have been released under emergency use authorizations from the FDA? The laboratories that rushed to develop the tests based on the viral genome performed limited experiments to assess (1) how much virus on the swab would be detectable; (2) whether other types of viruses would be detected instead of the real target; and (3) the probabilities implicit in the pathways in the blue boxes in the diagram.

Reported Laboratory Validation Data

To distribute or perform tests, manufacturers or laboratories must file validity studies with the Food and Drug Administration, which insists that "[a]ll clinical tests should be validated prior to use. In the context of a public health emergency, it is especially important that tests are validated as false results can have broad public health impact beyond that to the individual patient." \4/

For example, the Laboratory Corporation of America's Accelerated Emergency Use Authorization (EUA) Summary for its COVID-19 RT-PCR Test explains that "[t]he COVID-19 RT-PCR test is a real-time reverse transcription polymerase chain reaction (rRT-PCR) test for the qualitative detection of nucleic acid from SARS-CoV-2 in upper and lower respiratory specimens (such as nasopharyngeal or oropharyngeal swabs, sputum, lower respiratory tract aspirates, bronchoalveolar lavage, and nasopharyngeal wash/aspirate or nasal aspirate) collected from individuals suspected of COVID-19 by their healthcare provider."

The summary asserts that "SARS-CoV-2 RNA is generally detectable in respiratory specimens during the acute phase of infection" -- in other words, the test is somewhat sensitive to the disease. But that covers a lot of territory. Without data on the probabilities in the path from PUI infected to viruses in the speciment-collection site to a detectable quantity on the swab (and other possible paths), we are left with vague statements in the summary, such as "[p]ositive results are indicative of the presence of SARS-CoV-2 RNA" on the swab or other sample, and "[n]egative results do not preclude SARS-CoV-2 infection."

Of course, the test only is designed to signal the presence of viruses in the sample (the paths in the blue boxes in Figure 1). As I noted at the outset, it does not purport to be a test for the disease itself. How well does this test accomplish its more limited task? Naturally, detection of the virus depends on the quantity of viruses that are actually present. LabCorp and other test developers follow the simplistic approach of defining a fixed "Limit of Detection (LoD)." What, then, is the probability of detection at and above the limit?

According to the summary, "[t]he LoD study established the lowest concentration of SARS-CoV-2 (genome copies (cp)/μL) that can be detected by the COVID-19 RT-PCR test at least 95% of the time." This limit came from creating mock specimens with known quantities of the virus ("spiking the quantified live SARS-CoV-2 into negative respiratory clinical matrices") and reducing the quantity to the point at which 19 out of 20 specimens tested positive. Using only 20 mock specimens for each concentration, \5/ "[t]he study results showed that the LoD of the COVID-19 RT-PCR test is 6.25 cp/μL (19/20 positive)."

Although 19/20 describes the sample data, one cannot be entirely confident that the test really has a 95% sensitivity at the selected concentration for the LoD. Even if the true sensitivity at the 6.25 concentration were, say, 18 out of 20, we would find exactly 19 out of 20 replicates to be positive (as occurred in the LoD study) more than a quarter of the time. \6/ Likewise, the 0.95 sensitivity criterion for the limit of detection could have led to twice the reported LoD in a study with the same sample size. If the detection probability (sensitivity) were 95% at the next level up (12.5 cp/μL in this study), it could well be that all 20 replicates would be positive. The probability of that datum is 36% (0.9520 = 0.358).

Having chosen 6.25 for the LoD concentration, LabCorp proceeded to a "Clinical Evaluation." More precisely, "[a] contrived clinical study was performed." For brevity, I will just describe the results for NP swabs. (The data on BALs were the same.)

No. samplesConcentrationTest −Test +
500500
101×LoD010
102×LoD010
104×LoD010
108×LoD010

For these outcomes, the summary derives the following statistics:
  • "Positive Percent Agreement 40/40 = 100% (95% CI: 91.24% - 100%)"
  • "Negative Percent Agreement 50/50 = 100% (95% CI: 92.87% -100%)"
The first confidence interval is an estimate for the sensitivity, based the 40 positive test swabs pooled over the four geometrically decreasing concentrations. This interval, and the second one, for the specificity, suggest that the test is good at distinguishing between swabs spiked with between 6.25 and 50 cp/μL of the virus, on the one hand, and swabs with no SARS-CoV-2 at all, on the other.

But it is not clear what this sample of 90 tests is representative of. The efficacy of the test in discriminating between a virus-free swab and a virusy one depends on the how many viruses are on the swab. If we contrast the 10 swabs constructed to have the reported limit of detection (6.25) with the 50 with no viruses, the observed sensitivity in the experimental sample is still 1, but because 10 is a small sample size, the 95% Clopper-Pearson CI extends as low as 0.69. The lower end of the interval for the specificity is still 0.93. To discern some sort of average sensitivity and specificity for swabs from patients, one would need to know the distribution of viral concentrations in the patient population.

LabCorp's validity study contains further data on "Analytical Specificity." This is not the specificity for the classification for virus-present versus virus-absent seen in the "Clinical Evaluation." It concerns the possibility that a different virus could generate (false) positive results -- something that, as previously noted, would make the test even less specific. The summary lists bacteria and viruses that did not produce positive test results (in an unspecified number of tests). This is consistent with the fact that a number of them have "no homology with primers and probes of the COVID-19 RT-PCR test." In other words, their nucleic acid sequences are substantially different from the sequences of SARS-CoV-2 used in the test. As such, the SARS-CoV-2 amplication and detection process should not react to the sequences from at least this set of other organisms.

Reporting Test Results Without Quantitative Information

The advice from testing companies and laboratories does not even try to supply estimates of sensitivity and specificity -- either for the diagnosis of COVID-19 or the presence of SARS-CoV-2 on specimens. For example, the Fact Sheet for Healthcare Providers: Labcorp's COVID-19 RT-PRC Test - LabCorp (Mar. 16, 2020) contains the following questions and answers:
What does it mean if the specimen tests positive for the virus that causes COVID-19?
A positive test result for COVID-19 indicates that RNA from SARS-CoV-2 was detected, and the patient is infected with the virus and presumed to be contagious. Laboratory test results should always be considered in the context of clinical observations and epidemiological data ....
LabCorp's COVID-19 RT-PCR Test has been designed to minimize the likelihood of false positive test results. ...
What does it mean if the specimen tests negative for the virus that causes COVID-19?
A negative test result for this test means that SARS-CoV-2 RNA was not present in the specimen above the limit of detection. However, a negative result does not rule out COVID-19 and should not be used as the sole basis for treatment or patient management decisions. A negative result does not exclude the possibility of COVID-19.
When diagnostic testing is negative, the possibility of a false negative result should be considered in the context of a patient’s recent exposures and the presence of clinical signs and symptoms consistent with COVID-19. ...
This advice is oddly phrased and not terribly helpful. Among other things, \7/ does the statement that the test is designed to "minimize" the false-positive probability mean that the test maximizes the sensitivity to the point that its complement, the false-positive probability, is 0? That the test design makes the FPP higher than some alternative designs that were considered? Of course, the positive test "indicates" that the viral RNA is present, but how strong is the indication? \8/ And, how should the test result -- whether positive or negative -- be evaluated "in the context of clinical observations and epidemiological data"? The Fact Sheet leaves the healthcare providers for whom it is written at sea.

It is all but impossible to answer the last two questions without understanding Bayes' rule -- a formula for updating a previously established probability in the light of new information such as a symptom or a test result. Suffice it to say that the probability of COVID-19 in the patient is a function of (1) the prevalence of the disease among persons who are like the patient in their demographic and geographic characteristics and medical histories; (2) the sensitivity and specificity of the symptoms (things like a fever and a cough) in this population; and (3) the sensitivity and specificity of the test for SARS-CoV-2 in this population.

Today's Bottom Line

On the basis of the kind of information collected here, it is safe to say that a positive test result raises the odds of COVID-19 and a negative result lowers them. But by how much? To make better use of the tests in diagnosing COVID-19, their operating characteristics should be measured by validating the tests against specimens from patients who are known to be suffering from COVID-19.

The Wall Street Journal has an alarming statistic for the false negative rate. Its answer to the question "Are tests accurate?" is
  • Health experts say they now believe nearly one in three patients who are infected are nevertheless getting a negative test result. They caution that only limited data are available, and their estimates are based on their own experience in the absence of hard science.
  • That picture is troubling, many doctors say, as it casts doubt on the reliability of a wave of new tests developed by manufacturers, lab companies and the CDC. Most of these are operating with minimal regulatory oversight and little time to do robust studies amid a desperate call for wider testing. \9/
A false-negative rate of 1/3 is the same as a sensitivity of 2/3s. (Proof: Let C stand for "has COVID-19." Then Pr(–|C) + Pr(–|not-C) = false negative rate + sensitivity = 1/3 + 2/3 = 1. This notation makes it clear that now we are speaking of conditional probabilities for the disease rather than the presence of the virus in the specimen at or above the LoD.)

A discussion in the Internet Book of Critical Care \10/ refers to one or two studies along these lines. It suggests that in practice, the sensitivity and specificity are each below 80%:
There are several major limitations, which make it hard to precisely quantify how RT-PCR performs.
  1. RT-PCR performed on nasal swabs depends on obtaining a sufficiently deep specimen. Poor technique will cause the PCR assay to under-perform.
  2. COVID-19 isn't a binary disease, but rather there is a spectrum of illness. Sicker patients with higher viral burden may be more likely to have a positive assay. Likewise, sampling early in the disease course may reveal a lower sensitivity than sampling later on.
  3. Most current studies lack a “gold standard” for COVID-19 diagnosis. For example, in patients with positive CT scan and negative RT-PCR, it's murky whether these patients truly have COVID-19 (is this a false-positive CT scan, or a false-negative RT-PCR?). ...
Specificity seems to be high (although contamination can cause false-positive results), [but] sensitivity may not be terrific. ... In a case series diagnosed on the basis of clinical criteria and CT scans, the sensitivity of RT-PCR was only ~70% (Kanne 2/28). Sensitivity varies depending on assumptions made about patients with conflicting data (e.g. between 66-80%) (Ai et al.). ... Among patients with suspected COVID-19 and a negative initial PCR, repeat PCR was positive in 15/64 patients (23%). This suggests a PCR sensitivity of <80%. Conversion from negative to positive PCR seemed to take a period of days, with CT scan often showing evidence of disease well before PCR positivity (Ai et al.).

Bottom line?
PCR seems to have a sensitivity somewhere on the order of ~75%. A single negative RT-PCR doesn't exclude COVID-19 (especially if obtained from a nasopharyngeal source or if taken relatively early in the disease course). If the RT-PCR is negative but suspicion for COVID-19 remains, then ongoing isolation and re-sampling several days later should be considered.

An 80% sensitivity and specificity implies that the test changes the odds of the disease by a factor of only 80/20 = 4. For such a test, if the physician's prior odds (those formed before receiving the test result) were, say, 6:1 in favor of COVID-19, a positive test result would change them to 24:1. The posterior probability is thus 24/25 = 96%. A negative test result would shift the odds from 1:6 for not-COVID-19 to 4:6. These latter odds are equivalent to 6:4 on COVID-19 (a probability of disease of 6/10 = 60%). In short, the starting probability of 6/7 = 87% went up to 96% or down to 60%, depending on whether the test came back positive or negative. If the starting odds were reversed -- 1:6 on COVID-19 prior to the test -- the posterior probabilities of the disease would be lower -- 40% with the positive test result, and only 4% with a negative test result.

By way of comparison, one study of the much maligned technique of microscopic hair comparisons for identity used mitochondrial DNA tests as the gold standard for accuracy. It gave rise to a likelihood ratio for a positive association between the crime-scene hair fibers and the suspects' head hairs of a little under 3. \11/ That is not impressive, but if the estimates proposed by the Critical Care doctors are correct about the tests for SARS-CoV-2, the probative value of a hair association is not all that different from the diagnostic value of a positive molecular diagnostics test.

UPDATE (8/28/20)
A clear discussion of test sensitivity, specificity, and positive predictive value can be found in the International Statistical Institute's blog posting by John Bailar, My COVID-19 Test Is Positive … Do I Really Have It?, Statisticians React to the News, Aug. 25, 2020. It focuses on interpreting rapid antigen screening test results in combination with confirmatory PCR tests of the kind discussed here but does not delve deeply into the estimated sensitivity and specificity of any of the tests. It proposes further further reading on this topic in Lauren Kucirka & Justin Lessler, COVID-19 Story Tip: Beware of False Negatives in Diagnostic Testing of COVID-19, Johns Hopkins Medicine Newsroom, May 26, 2020, ("describing work suggesting false negative rates > 20% for RT-PCR tests and that test accuracy changes over time course of disease"), and Rob Stein, Study Raises Questions About False Negatives From Quick Covid-19 Test, NPR Morning Edition, Apr. 21, 2020 (reporting that "[r]esearchers at the Cleveland Clinic tested 239 specimens known to contain the coronavirus using five of the most commonly used coronavirus tests, including the Abbott ID NOW [which] only detected the virus in 85.2% of the samples, meaning it had a false-negative rate of 14.8 percent.").

NOTES

  1. The names (and other information) on the tests that have received Emergency Use Authorization (EUA) from the FDA are listed at https://www.fda.gov/emergency-preparedness-and-response/mcm-legal-regulatory-and-policy-framework/emergency-use-authorization#2019-ncov.
  2. There also are serological tests. These look for antibodies in the blood. For a description of various types of tests, see Cormac Sheridan, Fast, Portable Tests Come Online to Curb Coronavirus Pandemic, Nature Biotechnology, Mar. 23, 2020.
  3. The FDA's Policy for Diagnostic Tests for Coronavirus Disease-2019 during the Public Health Emergency, Mar. 16, 2020, contains the following "recommendations regarding the minimum testing that should be performed to ensure analytical and clinical validity" for "tests that detect SARS-CoV-2 nucleic acids from human specimens" (pp. 9-10):
    (1) Limit of Detection

    FDA recommends that laboratories document the limit of detection (LoD) of their SARS-CoV-2 assay. FDA generally does not have concerns with spiking RNA or inactivated virus into artificial or real clinical matrix (e.g., Bronchoalveolar lavage [BAL] fluid, sputum, etc.) for LoD determination. FDA recommends that laboratories test a dilution series of three replicates per concentration, and then confirm the final concentration with 20 replicates. For this guidance, FDA defines LoD as the lowest concentration at which 19/20 replicates are positive. If multiple clinical matrices are intended for clinical testing, FDA recommends that laboratories submit in their EUA requests the results from the most challenging clinical matrix to FDA. For example, if testing respiratory specimens (e.g., sputum, BAL, nasopharyngeal (NP) swabs, etc.), laboratories should include only results from sputum in their EUA request.

    (2) Clinical Evaluation

    In the absence of known positive samples for testing, FDA recommends that laboratories confirm performance of their assay with a series of contrived clinical specimens by testing a minimum of 30 contrived reactive specimens and 30 non-reactive specimens. Contrived reactive specimens can be created by spiking RNA or inactivated virus into leftover clinical specimens, of which the majority can be leftover upper respiratory specimens such as NP swabs, or lower respiratory tract specimens such as sputum, etc. We recommend that twenty of the contrived clinical specimens be spiked at a concentration of 1x-2x LoD, with the remainder of specimens spanning the assay testing range. For this guidance, FDA defines the acceptance criteria for the performance as 95% agreement at 1x-2x LoD, and 100% agreement at all other concentrations and for negative specimens.

    (3) Inclusivity

    Laboratories should document the results of an in silico analysis indicating the percent identity matches against publicly available SARS-CoV-2 sequences that can be detected by the proposed molecular assay. FDA anticipates that 100% of published SARS-CoV-2 sequences will be detectable with the selected primers and probes.

    (4) Cross-reactivity

    At a minimum, FDA believes an in silico analysis of the assay primer and probes compared to common respiratory flora and other viral pathogens is sufficient for initial clinical use. For this guidance, FDA defines in silico cross-reactivity as greater than 80% homology between one of the primers/probes and any sequence present in the targeted microorganism. In addition, FDA recommends that laboratories follow recognized laboratory procedures in the context of the sample types intended for testing for any additional cross-reactivity testing.
  4. Id. at 4.
  5. Id. at 9 ("FDA recommends that laboratories test a dilution series of three replicates per concentration, and then confirm the final concentration with 20 replicates.").
  6. In twenty independent tests of swabs with a probability of 18/20 = 0.9 of a detection on each test, the binomial probability of exactly 19 detections is 20 × 0.919 × 0.1 = 0.27.
  7. I think the first sentence is intended to say that a positive test result indicates (with some probability) that the specific RNA is present in the specimen, which implies (with some probability) that the patient is infected with SARS-CoV-2. And, I suspect that the first sentence in the second answer should read, a "negative test result indicates that SARS-CoV-2 RNA was not present in the specimen above the limit of detection" or a "negative test result means that SARS-CoV-2 RNA was not detected in the specimen above the limit of detection."
  8. Despite the warning in the first box about clinical and epidemiologic information, LabCorp apparently regards a positive result on its test as "definitive" in and of itself. A Q&A page on the LabCorp's website does not even include a question about the meaning of a positive result. It only asks whether "a negative result from LabCorp’s testing for COVID-19 mean[s] that a patient is definitely not infected." Its answer is
    Not necessarily. LabCorp’s testing for COVID-19 detects the virus directly, within the established limits of detection for which it was validated. A positive result is considered definitive evidence of infection. However, a negative result does not definitively rule out infection. As with any test, the accuracy relies on many factors:
    • The test may not detect virus in an infected patient if the virus is not being actively shed at the time or site of sample collection.
    • The amount of time that an individual was exposed prior to the collection of the specimen can also influence whether the test will detect the virus.
    • Individual response to the virus can differ.
    • Whether the specimen we receive was collected properly, sent promptly, and packaged correctly. Test results are a critical part of any diagnosis, but must be used by the clinician along with other information to form a diagnosis.
    Q&A, LabCorp's Testing for COVID-19, Mar. 25, 2020, https://www.labcorp.com/assets-media/2330. At least the last bulleted item pertains to positive as well as negative results.
  9. WSJ Staff, Who Has Covid-19? What We Know About Tests for the New Coronavirus?, updated Apr. 2, 2020 7:46 pm ET, https://www.wsj.com/articles/who-has-covid-19-what-we-know-about-tests-for-the-new-coronavirus-11585868185
  10. Josh Farkas, COVID-19, Internet Book of Critical Care, Mar. 2, 2020 (updated Mar. 29, 2020), https://emcrit.org/ibcc/COVID19/.
  11. David H. Kaye, Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, 72 Wash. & Lee L. Rev. Online, 227–254 (2015) (discussing the study), available at ssrn.com/abstract=2647430.