Summary
Sampling for prevalence estimation of infection is subject to bias by both over-sampling of symptomatic individuals and error-prone tests. This results in naïve estimators that can be very far from the truth. In this work, we present a method of prevalence estimation that removes the effect of testing errors and reduces the effect of oversampling symptomatic individuals. Moreover, this procedure considers stratified errors in which tests have different error rate profiles for symptomatic and asymptomatic individuals. The result is an easily implementable algorithm (for which code is provided) that produces better prevalence estimates than other methods, as demonstrated by simulation and on Covid-19 data from the Israeli Ministry of Health.
1 INTRODUCTION
Estimation of disease prevalence is challenging. First, imperfect testing always distorts actual proportions. Second, it’s not uncommon to have to derive estimates from samples that under-represent or fail to capture subpopulations that are at greatest risk or of interest. An example is estimating the general population prevalence of chronic hepatitis C (HCV) because of the challenges of sampling from subpopulations of former and current injecting drug users, the homeless or incarcerated. 1 Other examples include the over-representation of symptomatic individuals in a sample since these individuals are more likely to get tested than asymptomatic ones, with which the final estimates of prevalence inflates, since symptomatic individuals are also more likely to be truly infected than asymptomatic ones.
This situation became clear during the recent Covid-19 pandemic: besides usual discussions of the error rates of PCR and rapid tests, surveillance mechanisms have usually relied on convenience sampling or contact tracing. Therefore sampling bias was also present. In the case of convenience sampling, because it passively waits for symptomatic individuals to get tested, whereas asymptomatic individuals have few reasons to do so. As for contact tracing, because it actively pursues infected individuals, ignoring the non-infected almost altogether. Besides this, contact tracing has also raised questions on privacy and individual liberties. 2,3,4 Though this example corresponds to a non-probability Covid-19 sampling setting, the problem is of course more general. It applies to every form of prevalence estimation performed through testing, either probabilistic or not.
Recently, Díaz-Pachón and Rao introduced a correction for oversampling of the symptomatic group. 5 It was a three-step procedure based on the assumption that all symptomatic individuals in the population were sampled and infected but it did not address the issue of imperfect testing (i.e. the presence of false positives and false negatives). This implies that the symptomatic and infected individuals in the sample corresponded to the total number of symptomatic individuals in the population. Thus the asymptomatic group in the population was the complement of the total of symptomatic individuals in the sample. The prevalence among the asymptomatic group was then obtained as a uniform random variable among the asymptomatic individuals in the population, with no resource to the sample.
In this paper a method that is stronger in all aspects is presented. First, it does not assume that all symptomatic individuals are sampled, only that symptomatic individuals are overrepresented in the sample. Second, sample values among the asymptomatic are used to produce an estimator of prevalence that is informed by evidence. Third, testing errors are considered. And fourth, the proposed correction is extended to stratified errors by symptom status.
2 SETTING
Consider a population 𝒫 of size N that is divided into four categories: asymptomatic and non-infected individuals, , with size ; asymptomatic and infected individuals, , with size ; symptomatic and infected individuals, , with size ; and symptomatic and non-infected individuals, , with size . The population total N is known, whereas , and are unknown, though their sum is N.
The group of individuals with symptoms s in the population will be denoted by , and its total by for s = 0, 1. Analogously, the group of individuals with infection status i in the population will be denoted by and its total by , for i = 0, 1.
Now, will be the proportion of individuals in the population with symptoms s and infection status i. More formally, define a random element S∗ taking values in the set , with density given by and .
The proportion of individuals in the group Is is then given by , for s = 0, 1. And the proportion of individuals in the group I(i) is given by .
2.1 Sampling probabilities
For the j-th individual in the population (0 < j ≤ N), define a Bernoulli random variable as follows:
That is, an individual in the category will be tested with probability , for s, i = 0, 1.
The sampling probability of individuals with symptoms s and infection i is defined as where is the number of tested individuals from group . Analogously to (3), p (Is), the sampling probability among individuals with symptoms s, is defined as where . And p (I(i)), the sampling probability among individuals with infection status i, is defined as where . Finally, the total of sampled individuals is defined as
3 ESTIMATORS
In case there is no error in testing, the naïve estimator of can be naturally defined as the conditional probability of an individual belonging to the group , given that s/he was sampled. A Bayesian approach, inspired from ideas in publication bias, 6 leads to
Then , the population size of , disappears from the sample estimator, and (1) in the Appendix shows that all information in the sample about the group comes from the sampling mechanism . In fact, can be seen as the message sent, as the message received, and as the channel between them distorting the message. 7,8
Analogously to (7) with Proposition 1, the naive estimator of individuals with symptoms s, , and the naive estimator of individuals with infection status i, , are defined as
Equation (1) in the Appendix also says that some information about the sampling mechanism is needed if any meaningful observation is going to be obtained. For the scenario considered in this article, this corresponds to the intuition that, since symptomatic individuals are more prone to get tested than asymptomatic individuals, the probability of sampling from the symptomatic group is larger than the probability of sampling from the asymptomatic one:
Also corresponding to the intuition that infected and non-infected individuals inside each category are randomly sampled, for s = 0, 1.
3.1 Naïve estimators with testing errors
Up to this point the analysis has not considered testing errors. The following result is obtained when the errors are introduced and stratified by symptoms:
Let α0 and β0 be the false positive and false negative rate for asymptomatic individuals, respectively; and let α1 and β1 be the false positive and false negative rate for symptomatic individuals, respectively. The naïve estimators thus become:
Analogously to previous definitions, let and .
Remark 1. The right-hand side of (13) contains the contribution to the naïve estimator by each group in the sample weighted by the probability of their errors. The “tilde terms” are observed by the practitioner, but are unknown to him.
4 CORRECTION
This section introduces an estimator that corrects bias induced by the testing errors and oversampling of symptomatic individuals. Section 4.1 uses the maximum entropy principle, together with (11) and (12) to correct the latter. Section 4.2 proposes an estimator that eliminates the bias induced by the former. Section 4.3 puts the two together in a simple algorithm that summarizes the findings.
4.1 Correction of sampling bias
This section ignores the presence of testing errors. The approach will be to use the maximum entropy principle, which is “the least biased estimate possible on the given information.” 9 Such given information corresponds to the overrepresentation of symptomatic individuals (11) and the random sampling of infected and non-infected individuals inside each category of symptoms (12). The next theorem shows that (11) provides an upper bound to p1 and .
For if and only if .
Theorem 1 shows that, given the basic assumption (11), p1 is bounded above by . On the other hand, says that there are at least infected symptomatic individuals in the population. Therefore,
By the maximum entropy principle, 10 the corrected estimator of p1 is taken to be the expectation of a uniform distribution over . Formally, let U be a uniform distribution over the interval . The corrected estimator of p1 is defined as
Since, by (12), the sample is assumed to be random among symptomatic individuals,
Again, using (12), but now on the asymptomatic group, the prevalence for this group is obtained as
Taking (16) and (17), the final sampling-bias corrected prevalence is then taken to be
Hössjer et al proved that converges asymptotically to : 11
(Hössjer et al, Theorem 1). Suppose N → ∞ in such a way that, for s = 0, 1, ps = Ns/N is fixed and the sampling probabilities satisfy that . Assume also that there exists such that converges in probability to for all s as N → ∞. Then where “→ℒ” implies convergence in distribution, and
Remark 2. Theorem 2 does not say that . It says that the corrected estimator behaves as well as estimates N1/N. This corresponds well to the maximum entropy assumption: the more useful information is at hand, the highest the reduction of entropy, and therefore the better the correction.
4.2 Error-free estimator
According to Remark 1, when testing errors are considered, estimators that correct them are necessary before applying the correction to sampling bias. This section presents such estimators.
For s = 0, 1, assume αs and βs are known, and let βs ≤ 1 − αs. The estimators of and , respectively, are unbiased for errors, where and . Thus is an estimator of .
4.3 Algorithm
The correction procedure can be motivated as follows. The researcher observes the total of positive and negative individuals among the symptomatic and asymptomatic groups after testing, and s/he does not know neither the real prevalence nor the sampling scheme, except for the fact that the symptomatic group is oversampled (11). Thus the process to correct the estimator runs backwards: starting with the naïve estimator (that includes bias from errors and sampling), the sampling proportions are recovered (getting rid of the errors), to finally produce a corrected estimator (reducing the sampling bias). Algorithm 1, for which code is available at https://github.com/kalilizhou/BiasCorrection.git, summarizes the procedure to obtain a corrected estimator of prevalence.
Remark 3. If stratification is ignored, just take α = α0 = α1 and β = β0 = β1 in Step 1 of Algorithm 1.
Remark 4. If error in testing is not of interest, and only sampling bias is being considered, Algorithm 1 can still be used, starting from step 2.
Remark 5. If only testing errors are under consideration, and sampling bias is ignored, Step 1 of Algorithm 1 provides a correction.
Remark 6. As specified in Remark 2, under the assumption of maximum entropy a natural way to increase the precision of the estimator is to add knowledge whenever it is available. The scenario considered by Díaz-Pachón and Rao in which all the symptomatic individuals are tested (as it is required in most universities and companies in the U.S.) is one example. 5 In this case, the only modification of Algorithm 1 is that in (22) becomes
No other changes are required. Notice however that, under this modification, Algorithm 1 does a better job than the procedure considered by Díaz-Pachón and Rao, 5 since Algorithm 1 neither assumes the absence of symptomatic individuals without the disease nor ignores the evidence from the sample to estimate prevalence between the classes of symptoms, as becomes clear from steps 3 and 4. This superiority, as well as comparisons with other mechanisms, will be analyzed in Section 6.
5 SIMULATION
This section uses simulation to analyze the asymptotic behavior of the corrected estimator. The population has the following features:
The proportion of positive cases with symptoms is 0.15,
the proportion of negative cases with symptoms is 0.05,
the proportion of positive cases without symptoms is 0.05,
and proportion of negative cases without symptoms is 0.75.
Thus, the prevalence is , the proportion of symptomatic individuals in the population is , so the proportion of asymptomatic in the population is . The asymptomatic false positive rate is taken to be α0 = 0.01, the symptomatic false positive rate is α1 = 0.05, the asymptomatic false negative rate is β0 = 0.1, and the symptomatic false negative rate is β1 = 0.05.
The proportion of the asymptomatic patients being tested is p(I0) is 0.1, while the proportion of the symptomatic patients being tested is p(I1) is 0.9. (Notice that, in spite of the selection of the sampling probabilities for this simulation, there is no requirement that p(I0) + p(I1) = 1.)
The naïve estimator from the sample, , and the corrected estimator, , are listed in Table 1 for increasing population values, which suggests that the estimated values are gradually converging; i.e., and .
5.1 Active information: the index
Active information (actinfo) was introduced in search problems to quantify the amount of Shannon information introduced by the programmer in a search problem. 12,13,14 In machine learning, it has been used to show that no algorithm performs well for a large class of problems, in agreement with the so-called No Free Lunch Theorems. 15,16,17 It has also been used for mode hunting, 18,19 and to compare neutral to non-neutral models in population genetics. 20 Following the recommendation of Hössjer et al, here active information is used as a measure of bias. 11 The idea is as follows: active information is defined as where the logarithm is taken to be in base e, so that information is measured in nats. Thus defined, active information measures the amount of Shannon information of the estimator to the true proportion p(1), and it is the quantity that is averaged in the Kullback-Leibler divergence 21. That is, if the true proportion is overestimated, the active information will be positive and large; if the true proportion is underestimated, the active information will be negative; and if the true proportion is accurately estimated, the active information will be around zero. 22,23
Moeover, active information can be decomposed into two parts, , where measures the difference in information from the biased estimate to the real prevalence, and measures the difference in information from the correction to the naïve estimator. 11 Their empirical versions are listed in Table 2.
The active information for the biased estimator is . The active information for the correction reduces the bias, producing I+ ≈ 0.59. 22
6 DATA FROM THE ISRAELI MINISTRY OF HEALTH
In what follows, Covid-19 data from the Israeli Ministry of Health is considered. 24 The Ministry of Health publicly released data for individuals tested for Covid-19 via a PCR assay from a nasal swab sample collected between March 22, 2020 and April 7, 2020. The dataset contains information on the test date, test result, clinical symptoms, gender of the individual, known contact with an infected individual and a binary indicator of whether the individual was 60 years of age or older. Symptoms include cough, fever, sore throat, shortness of breath and headache. For the purposes of illustrating the methodology, we will consider this the population consisting of 99 232 tested individuals of whom 1862 were symptomatic (have shortness of breath or have at least three of four symptoms: cough, fever, sore throat, and headache) and 97 370 were asymptomatic. Among the total tested individuals, it was possible to identify 8393 infections through PCR testing. Among the individuals who tested positive, 1754 were symptomatic. The characteristics of the data set are presented in Table 3.
Error rates will be stratified by symptoms. Thus, let α0 and α1 be the false positive rate for asymptomatic and symptomatic individuals, respectively, and β0 and β1, the false negative rate for asymptomatic and symptomatic individuals, respectively. For the purpose of this example, α0 = 0.1%, α1 = 0.5%, β0 = 10%, and β1 = 5%. The actual number of individuals inside each group can be found in Table 4, after correcting for these errors.
The real prevalence is then and prevalence among the asymptomatic is 15 705/97 370 = 0.161.
Finally, active information (24) is used to compare how well Algorithm 1 and other estimators proposed in the literature are doing with respect to the real prevalence. The best estimator will be the one with active information I+ closer to 0. The competitors will be the method proposed by Díaz-Pachón and Rao, which assumes all symptomatic individuals are sampled, correcting only for sample bias and ignoring testing errors; 5 Diggle’s Bayesian approach, which corrects for imperfect testing but ignores sampling bias; 25 and the Rogan-Gladen estimate, a frequentist method that only corrects for testing errors too. 26 Neither of the competitors corrects for sampling bias and testing errors at the same time. As much as we search, we could not find a methodology that simultaneously corrects for imperfect testing and sampling bias; this will be reflected in the analysis.
Sampling Protocol 1
In the first scenario, all symptomatic individuals are sampled, as considered by Díaz-Pachón and Rao. 5 The sample consists of 2483 individuals. Among these, 1862 (75%) are symptomatic and 621 (25%) are asymptomatic. Since all the symptomatic group was sampled and tested, the observations for this group coincide with those of Table 3. As for the asymptomatic group, the sampling proportions are taken according to Table 4. The observations of this setting are summarized in Table 5.
According to Table 5, the naïve estimator is . Using Algorithm 1 with the modification (23), the corrected estimator is . Table 6 presents these results as well as those of the other methods.
In this case, Diggle’s correction was not implemented because it involves combinations in its logarithm that are difficult to approximate when the sample is moderately large. Under the assumption of sampling all symptomatic individuals, Díaz-Rao algorithm works better than all others, and RGE performs as poorly as the naïve estimator. However, Algorithm 1 also corrects very well the naïve estimate, producing small active information. Both Díaz-Rao and Algorithm 1 are close to the real prevalence, and there is not statistical difference between them for this scenario. 22
For the next protocols, the assumption that all symptomatic individuals were sampled is removed, which implies that the Diaz-Rao correction cannot be assessed and Algorithm 1 is followed without modifications.
Sampling Protocol 2
The sample consists of 200 individuals. Among these, 150 (75%) are symptomatic and 50 (25%) are asymptomatic. For both the symptomatic and asymptomatic groups, the sampling proportions are taken according to Table 4. With this information, we can observe Table 7. The summary of results under different methods is shown in Table 8.
Table 8 shows that Algorithm 1 has the best performance, without being optimal. In fact, Diggle’s and Rogan-Gladen’s estimates do as poorly as the naïve estimate. Algorithm 1 beats its competitors because it is the only one that corrects for sampling bias, whereas the other two only correct for testing errors. Notice that the additional information of Protocol 1 (knowing that all symptomatic individuals were sampled), in comparison to Protocol 2, greatly improves the performance of Algorithm 1, as reflected by the active information.
Sampling Protocol 3
In this scenario there are 100 symptomatic and 100 asymptomatic individuals. Again Table 4 reflects the proportions inside each group for this protocol. Table 9 is obtained. With Table 9 as base, the summary of results under different methods for this sampling protocol is presented in Table 10.
Thus, compared to the Sampling Protocol 2, with less sampling bias, all the methods perform better. Rogan-Glade’s estimates performs better than Diggle’s, reducing the naïve bias. Algorithm 1 still works better than competitors, correcting a half of the naïve overestimation.
Sampling Protocol 4
This sample is truly random, with NT = 200. Table 4 is used to determine the proportions of all groups. With these values Table 11 is obtained. The results of the different methods for this scenario are presented in Table 12.
Of course, in this scenario the naïve estimate is optimal. Rogan-Gladen’s frequentist estimate grossly overcorrects to the point of removing almost 2 nats of information with respect to the real prevalence. On the other hand, Diggle’s Bayesian approach and Algorithm 1 work pretty well and no statistical difference is observed between them and the naïve estimate.
7 DISCUSSION
Timely and accurate prevalence estimation of a disease is one of the most fundamental concepts in epidemiology and its importance is because it provides a measure of disease burden in a population at a particular point in time. It can also be part of a compendium of measures used to inform public health prevention policies to help slow the spread of disease through the population. To provide prevalence estimates that are reliable and generalizable, the sample must be comprehensive enough to capture all relevant subpopulations in the general population and as mentioned, for a number of diseases this can be challenging because many of these sub-populations can be hard-to-reach. Thus, sampling bias corrections are needed. Interestingly, this paper has presented new methodology where biased samples result due to over-sampling of symptomatic individuals. In addition, Algorithm 1 goes further and presents a correction both for sampling bias and testing errors. However, the methodology generalizes easily regardless of how the biased samples resulted.
A limitation of our study is that error rates for tests are assumed to be known a priori. If this is not the case, then at least under the random sampling situation, prevalence can still be estimated using a Bayesian approach described by Diggle. 25 This naturally results in increased variability of the prevalence estimate and relies on a reasonable prior distribution being elicited for the prevalence. This approach has not been extended to the setting in this paper, in which sampling bias is also an issue.
Sample pooling has also been proposed as an efficient way to estimate population prevalence because if the disease prevalence is low, then little information is accrued from individual tests. 27 This is sometimes called group testing. However, this implicitly assumes random sampling of pools which is clearly not the case considered here.
Another approach is to use population seroprevalence complex surveys. 28,29 While inherently much more difficult to conduct and analyze, these can also suffer from non-ignorable non-response which can lead to biased estimates of prevalence. Indeed, biased sampling can be more generally cast within a missing data framework and the impact of different missing data mechanisms has been studied. 11
For some diseases it is becoming more common to use administrative data to estimate disease prevalence since for many countries these data cover large proportions of the population. Examples include Canada, Denmark and Italy among others. This requires some effort to properly assemble these data sources, 30 but they have to date not proven as useful for emerging diseases like Covid-19 where surveillance studies dominated the earlier days of the pandemic.
Data Availability
Code to implement Algorithm 1 is available at https://github.com/kalilizhou/BiasCorrection.git. The data used in Section 6 is publicly available at https://github.com/nshomron/covidpred.
Author contributions
D. A. D. P. and J. S. R. conceptualized the methodology framework and the paper. L. Z. and D. A. D. P. developed the methodology details, L. Z. ran the examples and produced the R code, and C. Z. ran the simulations.
Financial disclosure
None reported.
Conflict of interest
The authors declare no potential conflict of interests.
Data availability statement
Code to implement Algorithm 1 is available at https://github.com/kalilizhou/BiasCorrection.git. The data used in Section 6 is publicly available at https://github.com/nshomron/covidpred.
How to cite this article: Zhou L., Díaz-Pachón D. A., Zhao C., and Rao J. S. (2022), Correcting prevalence estimation for biased sampling with testing errors,, 2022;00:1–13.
APPENDIX
Proof of Proposition 1. where the approximation step uses (1), (3), and (6).
Proof of Proposition 2. The result follows from Proposition 1 once testing errors are taken into account.
Proof of Theorem 1. where the fourth step used that .
Proof of Theorem 2. The result follows from Theorem 1 in Hössjer et al. 11.
Proof of Theorem 3. Since the left-hand side of (13) is obtained from the sample, and the errors are known, the first two equations of (13) have two unknowns: and . Analogously, the last two equations of (13) have two unknowns: and . Now, for s = 0, 1, and
Footnotes
The algorithm for the correction has been significantly strengthened and simplified, with code that easily allows to implement it. Also, a simulation and a comparison against other competitors with a real data set have been added showing the superiority of our algorithm. The comparison is made using active information.