A statistical debate has erupted over the proportion of the population that has been infected with SARS-CoV-2. It is a crucial number in arguments about "herd immunity" and public health measures to control the COVID-19 pandemic. A news article in yesterday's issue of Science reports that
[S]urvey results, from Germany, the Netherlands, and several locations in the United States, find that anywhere from 2% to 30% of certain populations have already been infected with the virus. The numbers imply that confirmed COVID-19 cases are an even smaller fraction of the true number of people infected than many had estimated and that the vast majority of infections are mild. But many scientists question the accuracy of the antibody tests ... .\1/
The first sentence reflects a common assumption -- that the reported proportion of test results that are positive -- directly indicates the prevalence of infections where the tested people live. The last sentence gives one reason this might not be the case. But the fact that tests for antibodies are inaccurate does not necessarily preclude good estimates of the prevalence. It may still be possible to adjust the proportion up or down to arrive at the percentage "already ... infected with the virus." There is a clever and simple procedure for doing that -- under certain conditions. Before describing it, let's look another, more easily grasped threat to estimating prevalence -- "sampling bias."
Inasmuch as the people tested in the recent studies are not based on random samples of any well defined population, the samples of test results may not be representative of what the outcome would be if the entire population of interest were tested. Several sources of bias in sampling have been noted.
A study of a German town "found antibodies to the virus in 14% of the 500 people tested. By comparing that number with the recorded deaths in the town, the study suggested the virus kills only 0.37% of the people infected. (The rate for seasonal influenza is about 0.1%.)" But the researchers "sampled entire households. That can lead to overestimating infections, because people living together often infect each other." \2/ Of course, one can count just one individual per household, so this clumping does not sound like a fundamental problem.
"A California serology study of 3300 people released last week in a preprint [found 50] antibody tests were positive—about 1.5%. [The number in the draft paper by Eran Bendavid, Bianca Mulaney, Neeraj Sood, et al. is 3330 \3/] But after adjusting the statistics to better reflect the county's demographics, the researchers concluded that between 2.49% and 4.16% of the county's residents had likely been infected." However, the Stanford researchers "recruit[ed] the residents of Santa Clara county through ads on Facebook," which could have "attracted people with COVID-19–like symptoms who wanted to be tested, boosting the apparent positive rate." \4/ This "unhealthy volunteer" bias is harder to correct with this study design.
"A small study in the Boston suburb of Chelsea has found the highest prevalence of antibodies so far. Prompted by the striking number of COVID-19 patients from Chelsea colleagues had seen, Massachusetts General Hospital pathologists ... collected blood samples from 200 passersby on a street corner. ... Sixty-three were positive—31.5%." As the pathologists acknowledged, pedestrians on a single corner "aren't a representative sample." \5/
Even efforts to find subjects at random will fall short of the mark because of self-selection on the part of subjects. "Unhealthy volunteer" bias is a threat even in studies like one planned for Miami-Dade County that will use random-digit dialing to utility customers to recruit subjects. \6/
In sum, sampling bias could be a significant problem in many of these studies. But it is something epidemiologists always face, and enough quick and dirty surveys (with different possible sources of sampling bias) could give a usable indication of what better designed studies would reveal.
A second criticism holds that because the "specificity" of the serological tests could be low, the estimates of prevalence are exaggerated. "Specificity" refers the extent to which the test (correctly) does not signal and infection when applied to an uninfected individual. If it (incorrectly) signals an infection for these individuals, it causes false positives. Low specificity means lots of false positives. Worries over specificity recur throughout the Science article's summary of the controversy:
- "The result carries several large caveats. The team used a test whose maker, BioMedomics, says it has a specificity of only about 90%, though Iafrate says MGH's own validation tests found a specificity of higher than 99.5%."
- "Because the absolute numbers of positive tests were so small, false positives may have been nearly as common as real infections."
- "Streeck and his colleagues claimed the commercial antibody test they used has 'more than 99% specificity,' but a Danish group found the test produced three false positives in a sample of 82 controls, for a specificity of only 96%. That means that in the Heinsberg sample of 500, the test could have produced more than a dozen false positives out of roughly 70 the team found." \7/
Likewise, political scientist and statistician Andrew Gelman blogged that no screening test that lacks a very high specificity can produce a usable estimate of population prevalence -- at least when the proportion of tests that are positive is small. This limitation, he insisted, is "the big one." \8/ He presented the following as a devastating criticism of the Santa Clara study (with my emphasis added):
Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia ... OK, here are [sic] concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.
If the specificity is 90%, we’re sunk. With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero ... . On the other hand, if the specificity were 100%, then we could take the result at face value.
So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:
A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!
To be sure, the fact that the serological tests are not perfectly accurate in detecting an immune response makes it dangerous to rely on the proportion of people tested who test positive as the measure of the proportion of the population who have been infected. Unless the test is perfectly sensitive (is certain to be positive for an infected person) and specific (certain to be negative for an uninfected person), the observed proportion will not be the true proportion of past infections -- even in the sample. As we will see shortly, however, there is a simple way to correct for imperfect sensitivity and specificity in estimating the population prevalence, and there is a voluminous literature on using imperfect screening tests to estimate population prevalence. \9/ Recognizing what one wants to estimate leads quickly to the conclusion that the usual media reports of the raw proportion of positives among the tested group (even with a margin of error to account for sampling variability) is not generally the right statistic to focus on.
Moreover, the notion that because false positives inflate an estimate of the number who have been infected, only the specificity is relevant is misconceived. Sure, false positives (imperfect specificity) inflate the estimate. But false negatives (imperfect sensitivity) simultaneously deflate it. Both types of misclassifications should be considered.
How, then, do epidemiologists doing surveillance studies normally handle the fact that the tests for a disease are not perfectly accurate? Let's use p to denote the positive proportion in the sample of people tested -- for example, the 1.5% in the Santa Clara sample or the 21% figure for New York City that Governor Andrew Cuomo announced in a tweet. The performance of the serological test depends on its true sensitivity SEN and true specificity SPE. For the moment, let's assume that these are known parameters of the test. In reality, they are estimated from separate studies that themselves have sampling errors, but we'll just try out some values for them. First, let's derive a general result that contains ideas presented in 1954 in the legal context of serological tests for parentage. \10/
Let PRE designate the true prevalence in the population (such as everyone in Santa Clara county or New York City) from which a sample of people to be tested is drawn. We pick a person totally at random. That person either has harbored the virus (inf) or not (uninf). The former probability we abbreviate as Pr(inf); the latter is Pr(uninf). The probability that the individual tests positive is
Pr(test+) = Pr[test+ & (inf or uninf)] = Pr[(test+ & inf) or (test+ & uninf)] = Pr(test+ & inf) + Pr(test+ & uninf) = Pr(test+ | inf)Pr(inf) + Pr(test+ | uninf)Pr(uninf) (1)*
In words, the probability of the positive result is (a) the probability the test is positive if the person has been infected, weighted by the probability he or she has been infected, plus (b) the probability it is positive if the person has not been infected, weighted by the probability of no infection.
We can rewrite (1) in terms of the sensitivity and specificity. SEN is Pr(test+|inf) -- the probability of a positive result if the person has been infected. SPE is Pr(test–|uninf) -- the probability of a negative result if the person has not been infected. For the random person, the probability of infection is just the true prevalence in the population, PRE. So the first product in (1) is simply SEN × PRE.
To put SPE into the second term, we note that the probability that an event happens is 1 minus the probability that it does not happen. Consequently, we can write the second term as (1 – SPE) × (1 – PRE). Thus, we have
Pr(test+) = SEN PRE + (1 – SPE)(1 – PRE) (2)
Suppose, for example, that SEN = 70%, SPE = 80%, and PRE = 10%. Then Pr(test+) = 1/5 + PRE/2 = 0.25. The expected proportion of observed positives in a random sample would be 0.25 -- a substantial overestimate of the true prevalence PRE = 0.10.
In this example, with rather poor sensitivity, using the observed proportion p of positives in a large random sample to estimate the prevalence PRE would be foolish. So we should not blithely substitute p for PRE. Indeed, doing so can give us a bad estimate even when the test has perfect specificity. When SPE = 1, Equation (2) reduces to Pr(test+) = SEN PRE. In this situation, the sample proportion does not estimate the prevalence -- it estimates only a fraction of it.
Clearly, good sensitivity is not a sufficient condition for using the sample proportion p to estimate the true prevalence PRE, even in huge samples. Both SEN and SPE cause misclassifications, and they work in opposite directions. Poor specificity leads to false positives, but poor sensitivity leads to true positives being counted as negatives. The net effect of these opposing forces is mediated by the prevalence.
To correct for the expected misclassifications in a large random sample, we can use the observed proportion of positives, not as estimator of the prevalence, but as an estimator of Pr(test+). Setting p = Pr(test +), we solve for PRE to obtain an estimated prevalence of
pre = (p + SPE – 1)/(SPE + SEN – 1) (3) \11/
For the Santa Clara study, Bendavid et al. found p = 50/3330 = 1.5%, and suggested that SEN = 80.3% and SPE = 99.5%. \12/ For these values, the estimated prevalence is pre = 1.25%. If we change SPE to 98.5%, where Gelman wrote that "you get into trouble," the estimate is pre = 0, which is clearly too small. Instead, the researchers used equation (3) only after they transformed their stratified sample data to fit the demographics of the county. That adjustment produced an inferred proportion p' = 2.81%. Using that adjusted value for p, Equation (3) becomes
pre = (p' + SPE – 1)/(SEN + SPE – 1) (4)
For the SPE of 98.5%, equation (4) gives an estimated prevalence of pre = 1.66%. For 99.5% it is 2.9%. Although some critics have complained about using Equation (3) with the demographically adjusted proportion p' shown in (4), if the adjustment provides a better picture of the full population, it seems like the right proportion to use for arriving at the point estimate pre.
Nevertheless, there remains a sense in which the sensitivity is key. Given SEN = 80.3%, dropping SPE to 97.2% gives pre = 0. Ouch! When SPE drops below 97.2%, pre turns negative, which is ridiculous. In fact, this result holds for many other values of SEN. So one does need a high sensitivity for Equation (3) to be plausible -- at least when the true prevalence (and hence p') is small. But as PRE (and thus p') grow larger, Equations (3) and (4) look better. For example, if p = 20%, then pre is 22% even with SPE = 97.2% and SEN = 80.3%. Indeed, with this large a p even with a specificity of only SPE = 90% we still get a substantial pre = 14.2%.
I have pretended the sensitivity and specificity are known with certainty. Equation (3) only gives a point estimate for true prevalence. It does not account for sampling variability -- either in p (and hence p') or in the estimates (sen and spe) of SEN and SPE, respectively, that have to be plugged into (3). To be clear that we are using estimates from the separate validity studies rather than the unknown true values for SEN and SPE, we can write the relevant equation as follows:
pre = (p + spe – 1)/(sen + spe – 1) (5)
Dealing with the variance of p (or p') with sample sizes like 3300 is not hard. Free programs on the web give confidence intervals based on various methods for arriving at the standard error for pre considering the size of the random sample that produced the estimate p. (Try it out.)
Our uncertainty about SEN and SPE is greater (at this point, because the tests rushed into use have not been well validated, as discussed in previous postings). Bendavid et al. report a confidence interval for PRE that is said to account for the variances in all three estimators -- p, sen, and spe. \13/ However, a savage report in Ars Technica \14/ collects tweets such as a series complaining that "[t]he confidence interval calculation in their preprint made demonstrable math errors." \15/ Nonetheless, it should be feasible to estimate the contribution that sampling error in the validity studies for the serological tests contributes to the uncertainty in pre as an estimator of the population prevalence PRE. The researchers, at any rate, are convinced that "[t]he argument that the test is not specific enough to detect real positives is deeply flawed." \16/ Although they are working with a relatively low estimated prevalence, they could be right. \17/ If sensitivity is in the range they claim, their estimates of prevalence should not be dismissed out of hand.
The take away message is that a gold standard serological test is not always necessary for effective disease surveillance. It is true that unless the test is highly accurate, the positive test proportion p (or a proportion p' adjusted for a stratified sample) is not a good estimator of the true prevalence PRE. That has been known for quite some time and is not in dispute. At the same time, pre sometimes can be a useful estimator of true prevalence. That too is not in dispute. Of course, as always, good data are better than post hoc corrections, but for larger prevalences, serological tests may not require 99.5% specificty to produce useful estimates of how many people have been infected by SARs-CoV-2.
UPDATE (5/9/20): An Oregon State University team in Corvallis is going door to door in an effort to test a representative sample of the college town's population. \1/ A preliminary report released to the media reports a simple incidence of 2/1,000. Inasmuch the sketchy accounts indicate that the samples collected are nasal swabs, the proportion cannot be directly compared to the proportion positive for serological tests mentioned above. The nasal swabbing is done by the respondents in the survey rather than by medical personnel, \2/ and the results pertain to the presence of the virus at the time of the swabbing rather than to an immune response that may be the result of exposure in the past.
UPDATE (7/9/20): Writing on “SARS-CoV-2 seroprevalence in COVID-19 hotspots” in The Lancet on July 6, Isabella Eckerle and Benjamin Meyer report that
Antibody cross-reactivity with other human coronaviruses has been largely overcome by using selected viral antigens, and several commercial assays are now available for SARS-CoV-2 serology. ... The first SARS-CoV-2 seroprevalence studies from cohorts representing the general population have become available from COVID-19 hotspots such as China, the USA, Switzerland, and Spain. In The Lancet, Marina Pollán and colleagues and Silvia Stringhini and colleagues separately report representative population-based seroprevalence data from Spain and Switzerland collected from April to early May this year. Studies were done in both the severely affected urban area of Geneva, Switzerland, and the whole of Spain, capturing both strongly and less affected provinces. Both studies recruited randomly selected participants but excluded institutionalised populations ... . They relied on IgG as a marker for previous exposure, which was detected by two assays for confirmation of positive results.
The Spanish study, which included more than 60,000 participants, showed a nationwide seroprevalence of 5·0% (95% CI 4·7–5·4; specificity–sensitivity range of 3·7% [both tests positive] to 6·2% [at least one test positive]), with urban areas around Madrid exceeding 10% (eg, seroprevalence by immunoassay in Cuenca of 13·6% [95% CI 10·2–17·8]). ... Similar numbers were obtained across the 2766 participants in the Swiss study, with seroprevalence data from Geneva reaching 10·8% (8·2–13·9) in early May. The rather low seroprevalence in COVID-19 hotspots in both studies is in line with data from Wuhan, the epicentre and presumed origin of the SARS-CoV-2 pandemic. Surprisingly, the study done in Wuhan approximately 4–8 weeks after the peak of infection reported a low seroprevalence of 3·8% (2·6–5·4) even in highly exposed health-care workers, despite an overwhelmed health-care system.
The key finding from these representative cohorts is that most of the population appears to have remained unexposed to SARS-CoV-2, even in areas with widespread virus circulation. [E]ven countries without strict lockdown measures have reported similarly low seroprevalence—eg, Sweden, which reported a prevalence of 7·3% at the end of April—leaving them far from reaching natural herd immunity in the population.
UPDATE (10/5/20): In Seroprevalence of SARS-CoV-2–Specific Antibodies Among Adults in Los Angeles County, California, on April 10-11, 2020, JAMA 2020, 323(23):2425-2427, doi: 10.1001/jama.2020.8279, Neeraj Sood, Paul Simon, Peggy Ebner, Daniel Eichner, Jeffrey Reynolds, Eran Bendavid, and Jay Bhattacharya used the methods applied in the Santa Clara study on a "random sample ... with quotas for enrollment for subgroups based on age, sex, race, and ethnicity distribution of Los Angeles County residents" invited for tests "to estimate the population prevalence of SARS-CoV-2 antibodies." The tests have an estimated sensitivity of 82.7% (95% CI of 76.0%-88.4%) and specificity of 99.5% (95% CI of 99.2%-99.7%).
The weighted proportion of participants who tested positive was 4.31% (bootstrap CI, 2.59%-6.24%). After adjusting for test sensitivity and specificity, the unweighted and weighted prevalence of SARS-CoV-2 antibodies was 4.34% (bootstrap CI, 2.76%-6.07%) and 4.65% (bootstrap CI, 2.52%-7.07%), respectively.
The estimate of 4.65% suggests that some "367, 000 adults had SARS-CoV-2 antibodies, which is substantially greater than the 8,430 cumulative number of confirmed infections in the county on April 10." As such, "fatality rates based on confirmed cases may be higher than rates based on number of infections." Indeed, the reported fatality rate based on the number of confirmed cases (about 3% in the US) would be too high by a factor of 44! But "[s]election bias is likely. The estimated prevalence may be biased due to nonresponse or that symptomatic persons may have been more likely to participate. Prevalence estimates could change with new information on the accuracy of test kits used. Also, the study was limited to 1 county."
NOTES
- Gretchen Vogel, Antibody Surveys Suggesting Vast Undercount of Coronavirus Infections May Be Unreliable, Science, 368:350-351, Apr. 24, 2020, DOI:10.1126/science.368.6489.350, doi:10.1126/science.abc3831
- Id.
- Eran Bendavid, Bianca Mulaney, Neeraj Sood et al., COVID-19 Antibody Seroprevalence in Santa Clara County, California. medRxiv preprint dated Apr. 11, 2020.
- Id.
- Id.
- University of Miami Health System, Sylvester Researchers Collaborate with County to Provide Important COVID-19 Answers, Apr. 25, 2020, http://med.miami.edu/news/sylvester-researchers-collaborate-with-county-to-provide-important-covid-19
- Vogel, supra note 1.
- Andrew Gelman, Concerns with that Stanford Study of Coronavirus Prevalence, posted 19 April 2020, 9:14 am, on Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/
- E.g., Joseph Gastwirth, The Statistical Precision of Medical Screening Procedures: Application to Polygraph and AIDS Antibodies Test Data, Stat. Sci. 1987, 2:213-222; D. J. Hand, Screening vs. Prevalence Estimation, Appl. Stat., 1987, 38:1-7; Fraser I. Lewis & Paul R. Torgerson, 2012, A Tutorial in Estimating the Prevalence of Disease in Humans and Animals in the Absence of a Gold Standard Diagnostic Emerging Themes in Epidemiology, 9:9, https://ete-online.biomedcentral.com/articles/10.1186/1742-7622-9-9; Walter J. Rogan & Beth Gladen, Estimating Prevalence from Results of a Screening-test. Am J Epidemiol. 1978, 107: 71-76; Niko Speybroeck, Brecht Devleesschauwer, Lawrence Joseph & Dirk Berkvens, Misclassification Errors in Prevalence Estimation: Bayesian Handling with Care, Int J Public Health, 2012, DOI:10.1007/s00038-012-0439-9
- H. Steinhaus, 1954, The Establishment of Paternity, Pr. Wroclawskiego Tow. Naukowego ser. A, no. 32. (discussed in Michael O. Finkelstein and William B. Fairley, A Bayesian Approach to Identification Evidence. Harvard Law Rev., 1970, 83:490-517). For a related discussion, see David H. Kaye, The Prevalence of Paternity in "One-Man" Cases of Disputed Parentage, Am. J. Human Genetics, 1988, 42:898-900 (letter).
- This expression is known as "the Rogan–Gladen adjusted estimator of 'true' prevalence" (Speybroeck et al., supra note 9) or "the classic Rogan-Gladen estimator of true prevalence in the presence of an imperfect diagnostic test." Lewis & Torgerson, supra note 9. The reference is to Rogan & Gladen, supra note 9.
- They call the proportion p = 1.5% the "unadjusted" estimate of prevalence.
- Some older discussions of the standard error in this situation can be found in Gastwirth, supra note 9; Hand, supra note 9. See also J. Reiczigel, J. Földi, & L. Ózsvári, Exact Confidence Limits for Prevalence of a Disease with an Imperfect Diagnostic Test, Epidemiology and Infection, 2010, 138:1674-1678.
- Beth Mole, Bloody math — Experts Demolish Studies Suggesting COVID-19 Is No Worse than Flu: Authors of widely publicized antibody studies “owe us all an apology,” one expert says, Ars Technica, Apr. 24, 2020, 1:33 PM, https://arstechnica.com/science/2020/04/experts-demolish-studies-suggesting-covid-19-is-no-worse-than-flu/
- https://twitter.com/wfithian/status/1252692357788479488
- Vogel, supra note 1.
- A Bayesian analysis might help. See, e.g., Speybroeck et al., supra note 10.
UPDATED Apr. 27, 2020, to correct a typo in line (2) of the derivation of Equation (1), as pointed out by Geoff Morrison.
NOTES to update of 5/9/20
- OSU Newsroom, TRACE First Week’s Results Suggest Two People per 1,000 in Corvallis Were Infected with SARS-CoV-2, May 7, 2020, https://today.oregonstate.edu/news/trace-first-week%E2%80%99s-results-suggest-two-people-1000-corvallis-were-infected-sars-cov-2
- But "[t]he tests used in TRACE-COVID-19 collect material from the entrance of the nose and are more comfortable and less invasive than the tests that collect secretions from the throat and the back of the nose." Id.