ABSTRACT
Accurately estimating the seroprevalence of antibodies to SARS-CoV-2 requires the use of appropriate methods. Bayesian statistics provides a natural framework for considering the variabilities of specificity and sensitivity of the antibody tests, as well as for incorporating prior knowledge of viral infection prevalence. We present a full Bayesian approach for this purpose, and we demonstrate the utility of our approach using a recently published large-scale dataset from the U.S. CDC.
INTRODUCTION
Antibody tests for COVID-19 have been increasingly deployed to estimate the seroprevalence of antibodies to SARS-CoV-21. Although antibody tests can provide important estimations on the prevalence of the viral infection in populations, the test results must be interpreted with caution due to the presence of false positives and false negatives2. Therefore, a critical statistical challenge is how to accurately estimate the prevalence of the viral infection in populations while accounting for the false positive and false negative rates of the antibody tests.
Recently, the U.S. Centers for Disease Control and Prevention (CDC) published a large-scale study on antibody tests from 10 sites in the U.S. administered between March 23 and May 12, 20203. The CDC antibody tests employed an enzyme-linked immunosorbent assay with a specificity (i.e., 1 – false positive rate) of 99.3% (95% CI, 98.3%-99.9%) and sensitivity (i.e., true positive rate) of 96.0% (95% CI, 90.0%-98.9%)3. In order to take the test accuracy into the consideration, the CDC study applied the following simple correction: Robs = P×Sensitivity + (1-P) × (1-Specificity), where Robs is the observed seroprevalence in the study samples and P is the unknown seroprevalence in populations. Using the point estimates of the sensitivity (96.0%) and specificity (99.3%) of the antibody tests, they obtained the point estimate of the population prevalence P = (Robs – 0.007)/0.953.
There are two main limitations with such an approach. First, only the point estimate of population prevalence P was obtained. Although the CDC study also generated confidence intervals for the point estimate based on a non-parametric bootstrap procedure, the confidence interval does not provide a probabilistic measurement of the uncertainty associated with all possible values of the unknown prevalence. Second, the above CDC approach could not account for any prior knowledge of the population prevalence P, which can lead to inaccurate estimation especially when the true rate of viral infection is low, even with high specificity and sensitivity of the tests4,5.
To overcome the above limitations, we have developed a Bayesian approach. Our approach is not a simple application of Bayes’ theorem by plugging in the point estimates of sensitivity and specificity into the formula and computing a posterior probability. Instead, our approach is a full Bayesian procedure that models the known variability in the sensitivity (95% CI, 90.0%-98.9%) and specificity (95% CI, 98.3%-99.9%) of the antibody test, and we can incorporate any prior knowledge of the viral infection rate to estimate the entire posterior probability distribution of the unknown population prevalence.
MATERIALS AND METHODS
Bayesian modeling
Let Nt and Np denote the number of people tested in total and the number of people tested as positive, respectively. Let p denote the unknown seroprevalence of antibodies to SARS-CoV-2. Let θ denote the true positive rate of the antibody test (i.e., sensitivity). Let κ denote the false positive rate of the test (i.e., 1 – specificity). Then, we can define the following likelihood function:
In Eq. (1), the term (pθ + (1-p)κ)Np corresponds to the probability of observing Np people that have tested positive, since a person with a positive test result can either be infected (with the probability of p) and correctly test positive (with the probability of θ), or not infected (with the probability of 1 -p) and falsely test positive (with the probability of κ). Similarly, the term (p(1-θ) + (1-p)κ)(Nt-Np) corresponds to the probability of observing (Nt -Np) people whose test results were negative.
To estimate the posterior probability of p,, we need to sample from the following posterior distribution:
To specify the prior distribution for p, κ, and θ, we chose beta distributions as they are commonly used to model probabilities6. where αp, βp, ακ, βκ, αθ, and βθ denote shape parameters of the corresponding beta distributions. For the unknown parameter p, we chose to use a non-informative flat prior probability distribution for this study (i.e., αp = βp = 1), although it can be adjusted if prior knowledge of the proportion of infected people for a particular region is known (see more in the Discussion section). For κ and θ, we chose informative priors to reflect the known specificity and sensitivity of a particular antibody test. Specifically, the shape parameters of ακ, βκ, αθ, and βθ can be estimated using the method of moments5 as follows: where μκ and , and μθ and represent the mean and variance of the test specificity and sensitivity, respectively. For this study, the mean of specificity and sensitivity is 99.3% and 96.0%, respectively. The variances of specificity and sensitivity were approximated7 as s(1-s)/n, where s is the mean value of specificity or sensitivity, and n = 618 according to the CDC validation study on the antibody test accuracy8.
We used WinBUGS9 (version 1.4.3) to implement the above models. In particular, the likelihood function was implemented using the “ones trick”10 of WinBUGS (see the GitHub repository https://github.com/qunfengdong/AntibodyTest for the implementation details). The posterior distributions were estimated with the Markov Chain Monte Carlo (MCMC) sampling in WinBUGS using the following parameters: the number of chains of 4, the number of total iterations of 100,000, burn-in of 10,000, and thinning of 4. Convergence and autocorrelations were evaluated with trace/history/autocorrelation plots and the Gelman-Rubin diagnostic11. Multiple initial values were applied for MCMC sampling. The above Bayesian procedure was validated with simulated datasets generated by our customized R12 script (available in the above GitHub repository).
Seroprevalence data
The seroprevalence data was taken from the aforementioned CDC publication3. Our approach requires two inputs: (i) the total number of tested samples and (ii) the number of positive samples. For this project, we only focused on gender-specific data in the CDC study. We extracted the total number of male and female samples from the original Table 1 in the CDC publication. However, the number of positive samples was not reported in the CDC publication. To infer those numbers for both genders, we extracted the CDC estimated seroprevalence, P, for both genders from the original Table 2 in the CDC publication. Using the equation P = (Robs – 0.007)/0.953 mentioned above, we obtained the observed seroprevalence Robs for both genders, which were used for calculating the number of observed positive male and female samples by multiplying Robs to the total number of samples in each respective gender. Table 1 lists the calculated number of test positive samples, rounded to the nearest integer in each site.
RESULTS
We applied our Bayesian approach to the data listed in Table 1. It is important to emphasize that Bayesian approaches produce entire probability distributions instead point estimates6. Figure 1 depicts the posterior distributions of the seroprevalence of antibodies to SARS-CoV-2 virus in 10 U.S. sites. Table 2 lists both the original CDC point estimates with the accompanying 95% confidence intervals, and our Bayesian estimates, which were presented as the medians and 95% credible intervals of the posterior distributions. It is worth noting that confidence intervals and Bayesian credible intervals are two different concepts13, thus they are not technically comparable despite being listed together in Table 2 for convenience. Although the posterior medians are similar to the original CDC point estimates overall, the entire posterior distributions (fig. 1) inferred by our Bayesian approach accurately capture the uncertainties associated with seroprevalence (i.e., the posterior distribution provides a precise probability associated with every possible value of seroprevalence), which cannot be achieved through confidence intervals.
DISCUSSION
Antibody tests have been increasingly applied to estimate the prevalence of people who have been infected by the SARS-CoV-2 virus. For example, New York City recently released data of more than 1.46 million coronavirus antibody test results on August 18, 2020. Accurately analyzing such data is critical for developing important public health policies14. Our Bayesian approach can account for the variabilities in antibody tests (i.e., uncertainties in the sensitivity and specificity of the tests). In addition, the Bayesian approach can easily incorporate prior knowledge of the proportion of infected people for a particular region. This is particularly important for accurate estimation if the true prevalence is low5. Moreover, the Bayesian approach also provides a natural framework for updating the estimation based on new data, which is particularly relevant to the continuous monitoring of the seroprevalence of coronavirus antibodies. For example, New York City is still releasing coronavirus antibody test results on a weekly basis15. By turning the estimated posterior distribution from previous weeks into a prior distribution for the next week, the seroprevalence of coronavirus antibody can be quickly updated within a solid Bayesian probabilistic inference framework.
Data Availability
We only used published dataset for analysis in this study.
AUTHOR CONTRIBUTION
Q.D. and X.G. both contributed project conception. Q.D. contributed WinBUGS modeling and drafting the manuscript. X.G. contributed R programming and data analysis.
CONFLICT OF INTEREST
None declared
Funding
None