## Abstract

Surveillance studies for Covid-19 prevalence estimation are subject to sampling bias due to oversampling of symptomatic individuals and error-prone tests, particularly rapid antigen tests which are known to have high false negative rates for asymptomatic individuals. This results in naïve estimators which can be very far from the truth.In this work, we present a method that removes these two sources of error directly. Moreover, our procedure can be easily extended to the stratified error situation in which a test has very different error rate profiles for symptomatic and asymptomatic individuals as is the case for rapid antigen testing. The result is an easily understandable four-step algorithm that produces much more reliable prevalence estimates as demonstrated on data from the Israeli Ministry of Health. Thus it may re-open the debate about whether we are under-valuing rapid testing as a surveillance tool and may have policy implications in Third-World countries or disadvantaged communities where access to PCR testing may be less accessible.

## 1 Introduction

Surveillance testing for COVID-19 remains an effective strategy for understanding viral spread in a population even as the vaccines have changed the focus of most media coverage. Since the virus will likely eventually enter a state of endemicity, the need for testing will not vanish. In fact, it will be a necessary tool in order to understand when spikes arise and more vigilance to contain spread is needed.

COVID-19 testing usually comes in one of three possible ways: serological, rapid antigen, or PCR (so-called molecular) testing. Serological tests, by their counting of antibodies are more apt to detect past COVID-19 infection, and rapid antigen tests and PCR tests can detect current viral infection. However, since rapid antigen tests are known to have very high false negative rates amongst asymptotic individuals,[1] they have fallen out of favor as as surveillance tool and are preferentially used now to test symptomatic individuals and to determine whether that individual is infectious or not. This is a shame because rapid tests can be much more easily deployed, can be administered at home, do not require lab-based assays, and are much more cost effective. This could have made them ideal tools to use particularly in Third-World countries, which remain vulnerable to the virus due to persistently low vaccination rates.

As for sampling strategies, contact-tracing ensures that individuals who were in close contact to other individuals who tested positive are also tested. But contact tracing has important drawbacks, either if implemented manually or digitally,[2, 3] and it has raised questions on privacy and individuals liberties.[4] The world has thus relied mostly upon convenience or surveillance sampling (i.e. not random sampling) to estimate the prevalence of the disease. Convenience here implies that typically there is over-sampling of symptomatic individuals, and since the probability of testing positive for these individuals is higher than for asymptomatic individuals, this results in an over-estimation of the population prevalence using the biased sample naïve prevalence estimate of the proportion in the sample who tested positive.

In our previous work we derived a correction to remove the bias described above.[5] However, we did not address the issue of imperfect testing (i.e. false positives and negatives). In this work, we provide a solution for both issues and importantly discover a correction framework when we examine the special case of stratified errors (by symptom status) which results in dramatic corrections to the naïve sample prevalence estimates — even in situations where the error rates are as high as for rapid testing. This then naturally begs the question if we have been under-valuing the role of rapid testing for surveillance of COVID-19 and re-opens the debate on whether such a tool could be deployed effectively to track the spread of the virus, particularly in Third-World countries.

Our methodology can then be summarized as follows: First there is a population prevalence. Second, a sample is taken from the population. Third, individuals in the sample are tested. With these testing totals prevalence is estimated. Now, the population prevalence is in general unknown, that is why we do sampling. However, convenience sampling is biased towards symptomatic individuals. Moreover, tests are imperfect so they have false positive and false negatives rates. Then the naïve estimator taken after testing has incorporated the bias from the sampling strategy and the errors from testing. The goal of this article is to correct these two sources of error.

To add some formality, consider a population 𝒫 of size *N* that is divided into three categories: asymptomatic and non-infected individuals, , with size ; asymptomatic and infected individuals, , with size ; and symptomatic and infected individuals, , with size . It is possible to have a fourth group of symptomatic and non-infected individuals, , but we reasonably set it to have zero size , since developing a constellation of COVID-19 specific symptoms and not having the disease is not common. Notice that the sum of all the individuals in these groups is *N*.

As for the sample, if an individual belongs to the category , she will be tested with probability , for *s, i* = 0, 1. We also set to be 0, since it was assumed that . Therefore, for the three non-empty categories, calling the number of individuals tested from the group , we obtain
where .

With this setting, inspired by previous work on detection and correction of publication bias,[6] it is possible to obtain that the naïve estimators of totals for each group are (see the Methods):
with *α* being the probability of a false positive, and *β*, the probability of a false negative. This set of equations is very intuitive. For instance, the naïve group of asymptomatic and non-infected is formed by the real group of asymptomatic and non-infected who were not false positives in the test and the group of asymptomatic and infected individuals in the population who were false negatives. Interestingly, the third group, corresponding to the naïve estimator of symptomatic and non-infected individuals is nonzero, since false negatives will make the naïve estimate positive.

On the other hand, notice that the naïve estimators in the left-hand side are determined by the testing errors and the sampled individuals of each group. Therefore, if the sample is not random, there will be bias. This is easy to see when we assume that there is no error in testing, in whose case the previous naïve estimators are the total sampled from each group. In fact, to reflect our initial hypothesis of overrepresentation in the sample of the symptomatic group, we assume that . Thus, *q* is a parametric value saying that at least half the symptomatic individuals were sampled.

Notice that the assumption *q ≥* 1*/*2 is not unreal in convenience testing, particularly in developed societies. In fact, most universities and large companies in the US require now that all symptomatic individuals get tested, with which . However, underdeveloped populations (countries mainly, but also possibly subpopulations in developed countries, like illegal immigrants) might not achieve this goal. The fact that, as we will see below, the correction, even in the presence of large false-negatives rates, is so effective, should motivate sampling at least half of the symptomatic individuals to allow the estimation of population prevalence.

Now, provided *β* ≤ 1 *− α*, we propose the corrected estimators of prevalence in Algorithm 1. Some comments are in order:

With respect to the first step, its estimator is very easy to see from equations (2).

With respect to the second step, had we known that all the symptomatic group was sampled, as Diaz and Rao did, then .[5] The methodology here is more realistic. Given our imperfect knowledge, the only unbiased assumption we can make is that is uniformly distributed in the set **U**.[7, 8, 9] Any other distribution, unless additional knowledge is at hand, will introduce bias.

With respect to the third step, it is important to notice that and do *n*ot depend on . However, they do depend on the restriction on *β* and *α* which is in general satisfied, even with rapid antigen testing.

As for the fourth step, since we do not have a reason to think otherwise, we assume that the sample is random for the asymptomatic individuals. is thus estimated accordingly.

What the researcher observes are the total of positive and negative individuals among symptomatic and asymptomatic ones after testing. The real prevalence is unknown to her, as is the sampling scheme. Thus the process to correct the estimator runs backwards: starting with the naïve estimator (that includes errors and bias), we then recover the sampling proportions (getting rid of the errors), to finally produce the correct estimator (getting rid of the bias).

### 1.1 Stratified errors

Notice we can make a stratification of errors by group of symptoms. In this case the naïve estimators become:

In this scenario, the correction is analogous to Algorithm 1, with the sole difference that *α* and *β* now become *α*_{0} and *β*_{0} for the observed values in the asymptomatic group, as made explicit in Algorithm 2. Notice how this framework allows for the unique error profiles of rapid testing where accuracies vary greatly between asymptomatic and symptomatic individuals. For rapid testing, *β*_{0} can be as high as 50% and *β*_{1} on the order of 10%.[1] Typically, *α*_{1} and *α*_{0} remain small.

## 2 Data from the Israeli Ministry of Health

In what follows, we consider data from the Israeli Ministry of Health.[10] Accordingly, we start with a sample of 99,232 tested individuals of whom 1,862 were symptomatic (have shortness of breath or have at least three of four symptoms: cough, fever, sore throat, and headache) and 97,370 were asymptomatic. Among the total tested individuals, it was possible to identify 8,393 infections through PCR testing. Among the individuals who tested positive, 1,754 were symptomatic. The characteristics of the data set are presented in Table 1.

### Estimation of total number of infected with errors by stratum

We also know that PCR testing has false negatives rates *β* = 0.26 and false positives rate *α* = 0.003.[11] Therefore we can obtain the real number of individuals for each group, as in Table 2.

If a whole population is tested with PCR (as in this data set) and the error rates for PCR are known (and we know them), just by correcting for the errors we can obtain the real values in the population. We then will assume that Table 2 represents a population total and we will take samples from it. Accordingly, the real prevalence is
and prevalence among the asymptomatic is 30209*/*97370 = 0.31. Finally, calling the naïve estimator, and the corrected estimator, we define the ratio of absolute errors,
for . This ratio will be larger than 1 when the correction works better than the naïve estimator, will be less than 1 when the naïve estimator does a better job than the correction, and will be 1 when the two estimators behave similarly.

In what follows we will assume *q* = 1*/*2. The uniform random variable in step 2 of Protocols 1 and 2 will be replaced by its expected value: .

### 2.1 Without stratified errors

#### Sampling Protocol 1

Here the sample is made 100% of symptomatic individuals. We also assume that all the symptomatic group was sampled. Thus, the sample consists of 1862 individuals. Table 3 summarizes this information.

In this extreme case, the naïve estimate of prevalence is . The total corrected prevalence is . Then the ratio of absolute errors is which shows that even for so bad a sample the correction behaves better than the naïve estimator.

#### Sampling Protocol 2

This sample is made of 75% symptomatic individuals, and 25% asymptomatic ones. We also assume that all the 1862 symptomatic individuals were sampled. Then the asymptomatic sampled are 621. Since we do not have more information about the asymptomatic group, we assume that they are sampled at random from the population (Table 2); then 199 are asymptomatic disease positive and 422 are asymptomatic disease negative. However, the *observed values* (based on testing, not on true disease status) are given in Table 4.

The values of Table 4 were obtained using Equation (2). Notice however for the symptomatic group that, since all its individuals were tested, the first row is identical to that of Table 1.

In this case, the naïve estimate of prevalence, obtained from Table 4, is . As for the correction, following Algorithm 1, we obtain that , where the first term is the estimated prevalence among the symptomatic, and the second, prevalence among the asymptomatic.

The ratio of absolute errors now becomes: with which the superiority of the correction is clearly seen.

#### Sampling Protocol 3

This sample contains 50% symptomatic and 50% asymptomatic individuals. Since we again assume that all the symptomatic were sampled, we have a sample of size 3724. Under a similar analysis to that of the previous scenario, the observed values are presented in Table 5.

The naïve estimate is thus . And using Algorithm 1, we obtain that the corrected estimate is again . In this case, the ratio of absolute errors becomes

Again, a huge improvement when using the correction.

#### Sampling Protocol 4

This sample is truly random, with *N*_{T} = 50, 000. Using the proportions from Table 2 to obtain, and , and then Equations 2, after testing we obtain the observations in Table 6.

Equations (2) actually produce that the number of symptomatic observed positive are 244, however, we restrict them to 108 because that is the number observed in Table 1. The naïve estimate out of Table 6 is

In this case, , and . Therefore, the total corrected estimator is . The ratio of absolute errors will be:

### 2.2 With stratified errors

In some more realistic scenarios like doing rapid antigen testing, considering stratifying errors, we give the notations that *α*_{0} and *α*_{1} are the false positive rate for asymptomatic and symptomatic individuals, respectively, and *β*_{0} and *β*_{1} are the false negative rate for asymptomatic and symptomatic individuals, respectively. In our sampling protocols, the values are given by *α*_{0} = 0.1%, *α*_{1} = 0.5%, *β*_{0} = 50%, and *β*_{1} = 10%. Therefore, analogous to what we did with Table 2 previously, we can obtain a new table showing as the totals for this new set of errors. Table 7 summarizes this information.

According to Table 7, the true prevalence with stratifying errors is *p* = 0.54. 94% individuals are infected in the symptomatic group, and 53% are infected in the asymptomatic group.

#### Sampling Protocol 1

The sample consists of the 1862 symptomatic individuals. In this extreme case, the naïve estimate of prevalence is . Thus, the observed totals are presented in Table 8.

According to Table 8, the biased estimator is . The correction is . With this information, the ratio of absolute errors is

We see that, due to the fact that the heavy false negative rate corrects a little the very bad features of the sample, the naïve estimator does slightly better than the corrected estimate.

#### Sampling Protocol 2

The sample consists of 2483 individuals. Among these, 1862 (75%) are symptomatic and 621 (25%) are asymptomatic. Since all the symptomatic group was sampled and tested, the observations for this group coincide with those of Table 1. As for the asymptomatic group, the sampling proportions are taken according to Table 7, with which, according to Equations (4), we observe the naïve totals in Table 9.

Then the naïve estimate of prevalence is .

Using is 0.0265 and is 0.53. Therefore, . The ratio of absolute errors is thus which again shows the huge improvements of the correction with respect to the biased estimator.

#### Sampling Protocol 3

In this scenario we have 1862 symptomatic and 1862 asymptomatic individuals. Again Table 1 tells the behavior of the symptomatic when tested. As for the asymptomatic, opposite to the proportions in Table 7, and the previous example, we now make a twist and assume that 53% asymptomatic non-infected and 47% asymptomatic infected were tested. However, remember, this is unknown to the observer, since all she can see is Table 10), obtained from Equations (4).

The naïve estimate is . The corrected estimate, using Algorithm 2, is . In this case, the ratio of absolute errors becomes

Therefore, in this scenario both estimators behave almost identically.

#### Sampling Protocol 4

This sample is also truly random. Say *N*_{T} = 30, 000. Among these, 30, 000(1862*/*99, 232) ≈ 563 are symptomatic. Then, 29,437 are asymptomatic. We use Table 7 to determine the proportions sampled per group for asymptomatic, obtaining infected and . With these values we map back to the observations in Table 11.

The naïve estimate is .

As for the correction, and . Therefore, . With this, the ratio of absolute errors is

## 3 Methods

For *s, i* = 0, 1, define as , that is, the proportion of individuals with symptoms *s* and infection status *i*. More formally, we can define a random element *S*^{*} taking values in the set , with density given by
and .

For the *j*-th individual in the population (0 *< j* ≤ *N*), we assume a Bernoulli random variable:

Calling *N*_{T} the sample size, it is easily seen that , and we can define an uncondtional binary random variable

Now, if there is no error in sampling, so that the bias is induced by the number of individuals from in the sample. Seen from another perspective, we have the following proposition:

### Proposition 1

*Proof*. Notice that, by definition,

Therefore, after applying applying Bayes rule at the RHS, we obtain

□

From (10), it is clear that , the population size of , disappeared from the sample. Therefore, the importance of Proposition 1 is to show that all information we have in the sample about comes from the sampling mechanism . In fact, we can think of as the message sent, as the received message, and as the channel between them distorting the message.

Up to this point the analysis has been done for testing without errors. However, notice that (4) is obtained directly from (9), once we insert testing errors.

### 3.1 Correction

Since the group of symptomatic and non-infected individuals is empty, it is clear that is exclusively made of false negatives coming from the symptomatic and infected individuals. Therefore, the real number of symptomatic and infected individuals is , the total of symptomatic individuals sampled. Notice that here *β* disappears from the analysis, so we can safely ignore it in our estimation of the number of individuals in .

From Proposition 1, it is clear that if we do not know anything about it will be impossible to correct the bias. Thus we need a reasonable assumption, like . With this assumption we use the principle of maximum entropy to obtain a correct estimator .

and in the third step of Algorithm 1 can be obtained from the first two equations in (4), since and are known. Therefore, we have two equations with two unknowns. The requirement of *α* ≤ 1 − *β* ensures that the values of and are non-negative.

The estimator of the number of infected individuals among the asymptomatic is thus obtained after assuming that they were randomly sampled in what remains of the sample once the symptomatic individuals have been removed.

## 4 Discussion

There are a couple of limitations to our study. First, we have assumed throughout that error rates for tests are known a priori. If this is not the case, then at least under the random sampling situation, prevalence can still be estimated using a Bayesian approach described by Diggle.[12] This naturally results in increased variability of the prevalence estimate and relies on a reasonable prior distribution being elicited for the prevalence. This approach has not been extended to our situation here where sampling bias is also an issue. Second, we assumed that the symptomatic without the disease group is negligible and size of sampled symptomatic individuals is at least half the population value. As for the former, we expect this to not be violated for COVID-19.

Our correction proves to be very effective in many situations that would be encountered in practice. As we argued in the Introduction, this re-opens the debate about the utility of widespread rapid testing as a surveillance tool particularly in third world countries where PCR testing may be too expensive to implement widely. The scenario(s) where the correction does not improve upon the naïve estimate are those where the error rates are so large relative to the information in the sample that the correction is blurred and appears negligible. Sample pooling has also been proposed as an efficient way to estimate population prevalence because if the disease prevalence is low, then little information is accrued from individual tests.[13] This is sometimes called group testing. However, this implicitly assumes random sampling of pools which is clearly not the case in what we are considering here.

Another approach that has been taken is to use population seroprevalence complex surveys. [14, 15] While inherently much more difficult to conduct and analyze, these can also suffer from non-ignorable non-response which can lead to biased estimates of prevalence. Indeed, biased sampling can be more generally cast within a missing data framework and the impact of different missing data mechanisms studied.

## Data Availability

The data can be retrieved from https://github.com/nshomron/covidpred, and it is publicly available.