Extending the Susceptible-Exposed-Infected-Removed (SEIR) model to handle the high false negative rate and symptom-based administration of COVID-19 diagnostic tests: SEIR-fansy

The false negative rate of the diagnostic RT-PCR test for SARS-CoV-2 has been reported to be substantially high. Due to limited availability of testing, only a non-random subset of the population can get tested. Hence, the reported test counts are subject to a large degree of selection bias. We consider an extension of the Susceptible-Exposed-Infected-Removed (SEIR) model under both selection bias and misclassification. We derive closed form expression for the basic reproduction number under such data anomalies using the next generation matrix method. We conduct extensive simulation studies to quantify the effect of misclassification and selection on the resultant estimation and prediction of future case counts. Finally we apply the methods to reported case-death-recovery count data from India, a nation with more than 5 million cases reported over the last seven months. We show that correcting for misclassification and selection can lead to more accurate prediction of case-counts (and death counts) using the observed data as a beta tester. The model also provides an estimate of undetected infections and thus an undereporting factor. For India, the estimated underreporting factor for cases is around 21 and for deaths is around 6. We develop an R-package SEIR-fansy for broader dissemination of the methods.


Calculation of R 0 for Misclassification Model
We calculate the basic reproduction number R 0 using the The Next Generation Matrix Method as described by van den Driessche [3]. Suppose the whole population is divided into n compartments in which there are m < n infected compartments. Let x i , i = 1, 2, .., m be the number of infected individuals in the i th infected compartment at time t. Now, the epidemic model is: represents the rate of transfer of individuals into compartment i from all other components containing individuals infected with the disease (here E, U , P and F ) and where V i (x) represents the rate of transfer of individuals out of compartment i. Here, F i (x) represents the rate of appearance of new infections in compartment i. Let x 0 denote the disease free equilibrium. Now F and V are m⇥m matrices such that : Now, FV 1 is called the Next Generation Matrix. The basic reproduction number R 0 is calculated by the spectral radius or the largest eigenvalue of FV 1 . For our case, CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint Now, we calculate the jacobian of F and V at the Disease Free Equilibrium (DFE).

Motivation Behind the 3-parameter Multinomial Model
We can observe from figure (S1) that mCFR varies widely across countries and also across time. Now note that while countries like Belgium, USA, Italy, Spain have very high mCFR, India and Russia have comparatively much lower mCFR. Also, we observe that initially most countries experience high mCFR and it gradually settles to a comparatively lower value in most countries as the case counts and recoveries rise. This supports modeling mCFR as a time-varying quantity.
3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint  The only difference in this model from the Multinomial 2-parameter model is that, from the Exposed Node, an infected person can enter into one of three possible nodes: Severe Symptomatic Infectious(Se), Mild Symptomatic Infectious(Mi) and Asymptomatic Infectious(As) with probabilities p 1 , p 2 , and p 3 respectively. Now, we assume people with severe symptoms (people in Se) are tested with probability 1. While the Mi people and the As people are tested with probabilities t 1 and t 2 .

Misclassification model -complete distributional assumptions
In the main paper, we have given the distribution of observed nodes given the other nodes and parameters. Here, we describe the distribution of the latent nodes also. After getting the estimates of the parameters using MCMC, we want to obtain model-based forecasts. In order to predict the future counts, we use the following multinomial random sampling 4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint strategy: where ⇣ X!Y denotes the number of individuals moving from compartment X to compartment Y at time t. ⇣ X!0 denotes the number of individuals in compartment X that die at time t. The counts in each compartment at time t are given by, Given the parameters and the counts at time (t 1), we obtain the predicted counts for time t. Using this approach, we obtain the posterior means of the future predicted counts at each of the 9 compartments using the MCMC estimated parameters. For the purpose of future prediction beyond the training period, we use the parameter estimates from the last time period.

Differential Equations for the Selection Model
5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint

Selection Model : Complete Distributional Assumptions
To generate data using the test model, we perform the following steps.
Now, we assume the probability of an individual being severely symptomatic, mildly symptomatic or asymptomatic given he/she is susceptible is given by the probability vector p 0 = (p 01 , p 02 , p 03 ). The probability for an infected individual is given by p 1 = (p 11 , p 12 , p 13 ). To obtain the number of individuals in the groups Se 0 , Mi 0 , andAs 0 , we assume that the outgoing individuals from the susceptible group follow the distribution given by p 0 .
6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint Now, from our assumption that the individuals in E follow the distribution given by p 1 , we can write, Recall, we assume that all individuals with severe symptoms are tested provided adequate tests are available. This implies In the case when number of test T (t) is less than that of severe individuals, we assume that the number of tested Se 1 and Se 0 individuals is proportional to their respective counts. , If the total number of remaining tests is greater than or equal to the number of mild and asymptomatic individuals, then all of them are tested i.e : As 0 tested = As 0 (t), As 1 tested = As 1 (t) If number of tests are not adequate for all the mild symptomatic and asymptomatic people to be tested, then the remaining tests (after testing the severe symptomatic people) are distributed among the mildly symptomatic and asymptomatic individuals in the ratio t 1 : t 2 .
Mi tested , As tested ⇠ Binomial (T Se tested , (t 1 , t 2 )) As we did in the case of severely symptomatic, we allocate the tests among infected and uninfected mildly symptomatic (and also asymptomatic) individuals randomly.

◆◆
As 0 tested , As 1 tested ⇠ Binomial ✓ As tested , ✓ As 0 (t) As 0 (t) + As 1 (t) , As 1 (t) As 0 (t) + As 1 (t) We also assume the false negative probability = f . The numbers of new individuals to P and F states are given by : Finally we write the number of people in each state at time t as follows : CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The transmission dynamics of this model are very similar to the Selection model. We provide the differential equations describing the dynamics of the nodes S, E, U I, P and F . The rest of the nodes (RU, RR, DU and DR) have differential equations exactly same as in Selection Model.
Now that we have all the differential equations governing the dynamics of this model, we calculate the basic reproduction number using Next Generation Matrix method [3]. Using calculations similar to what we did for the Misclassification model, we arrive at the following expression of R 0 for Uniform testing model.
8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Additional Plots for India
We have done our estimation using the MCMC Metropolis Method and predicted the counts for the different compartments by using the posterior means conditional on the estimated parameters. So the large number of iterations of MCMC provide a 95% credible interval for the parameters as well as for the predictions of the compartments. So the following figure shows the credible regions for the Reported Active, False negative active and Untested Active cases. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint earlier. The estimates of and r for the last period (1 st June to 30 th June) was used to predict the cases from 1 st July to 31 th August. As it is easily visible from Figure (S4), the CI's for estimates of untested cases is the highest. This is expected due to the much higher estimated number of untested cases than tested positives or false negatives.

Results for Simulations -Effect of Misclassification
In the main paper we have shown the effect of misclassification on number of total active cases. We concluded that the effect of misclassification on total active cases was substantial, but it was negligible on reported active cases. Here, we provide the mean estimates of R 0 obtained by the 3 different models with 3 different false negative rates f = 0, 0.15 and 0.3. 10 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

Basic Reproduction Number MRE
The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint Since we have not estimated the values of quite a few parameters, a sensitivity analysis is necessary. Now, as doing sensitivity analysis of all the parameters and initial values is impractical, we will do sensitivity analysis for the following parameters only.
1. E 0 : The initial value of Exposed had been chosen as 3 times the sum of initial values of Untested, Confirmed and False Negative cases. Such a choice might seem arbitrary. Hence, we try 4 different values of E 0 and check how the estimates of R 0 and Current Active cases vary across different values of E 0 .
2. ↵ U : The value of ↵ U had been taken as 0.5 in the main analysis. We also assumed ↵ P = 0.5. So, we effectively assumed that the rate of transmission of disease by untested and tested positive individuals was same. Some things to consider when choosing the value of ↵ U and ↵ P were that individuals who were tested positive are quarantined and/or hospitalized reducing their rate of transmitting the disease. And untested cases are predominantly asymptomatic cases whose rate of spreading the virus is much less than symptomatic cases. So, we have ↵ U < 1 and ↵ P < 1. However, we do not know if ↵ U > ↵ P or ↵ U < ↵ P . So we try 4 different values of ↵ U here which are ↵ U = 0.3, 0.5, 0.7 and 1.
3. D e : We stated in the beginning of this paper that we have assumed the Incubation period equals the Latency period (= D e ). We have taken D e = 5.2 days following the results by Lauer et al. [2]. However research by other groups suggest different values of incubation period like 6.4 days by Becker et al. [1] etc. So we consider 3 values of D e for sensitivity analysis. They are D e = 6.4, 5.2 and 4.1 (lower limit of 95% CI of estimates of incubation period by Lauer et al. [2]) 4. k : For Multinomial Symptoms model, one important parameters is k which is the ratio of probability of a mildly symptomatic person being tested to that of an asymptomatic person being tested. Since the probability of testing is higher for a mildly symptomatic person than an asymptomatic person, so k > 1. In our main analysis, we assumed k = 4. The choice of k was not supported by any data. So, we try 4 different values of k : k = 3, 4, 5 and 6 and look at the different estimates.

Effect of initial value of Exposed
We start with the initial value of Exposed individuals. Throughout our analysis we have assumed that the number of exposed individuals on the starting day i.e. 1 st April was thrice the number of total expected infected up to that day. So we check how much our estimates vary if we vary the starting value of Exposed (E 0 ). So, we use 4 starting values for E 0 : 4. E 0 = 4(U 0 + P 0 + F 0 ) Figure S6: Variation of estimates of R 0 with different values of E 0 11 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint We can observe from subfigure B of Figure (S6) that our estimates of R 0 are relatively robust with respect to choice of initial values of exposed. The only substantial variation is observed in the first time period -1 st -14 th April. Now, let us look at how the estimates of number of active cases change with different initial values.
We can observe from subfigure A of Figure (S6) that all the estimates for total active cases increases with increasing values of E 0 . The estimate of total active cases on 30 th June for E 0 = 4(U 0 + P 0 + F 0 ) was more than 2 times that for E 0 = (U 0 + P 0 + F 0 ). Hence we observe that though the estimates of total active cases vary substantially with different initial number of Exposed people, the estimates of Basic Reproduction Number are much more robust to such variations. Now we look at the effect of ↵ U on our estimates.

Effect of ↵ u
In our main analysis we assumed ↵ U = 0.5. Here, we try 4 different values of ↵ U , ↵ = 0.3, 0.5, 0.7 and 1. First, we Figure S7: Variation of estimates of R 0 with different values of ↵ U look at the estimates of R 0 . Similar to the previous section, from subfigure B of (S7), we observe that the estimates of R 0 are more or less similar for different values of ↵ U . Once again, the only R 0 that substantially varies with different values of ↵ U is the first one i.e. R 01 . Now, we look at the estimates of total active cases.
Subfigure A of (S7) shows that the estimated value of total active cases decreases with increasing value of ↵ U . The reason behind this is if the value of ↵ U is higher, then a smaller number of untested cases will spread the same amount of infection as a larger number of cases would have if the value of ↵ U had been lower.
So, once again, we observe that while the estimates of the number of active cases are influenced heavily by ↵ U , the estimates of R 0 remain relatively unaffected by the change.

Effect of D e
For our main analysis, we had assumed D e = 5.2. Here, we try 2 more values of D e and check how our estimates vary with different values of D e . Again, we observe from subfigure B of (S8), that estimates of R 0 are very robust with respect to different values of D e . From subfigure A of Figure (S8), we note that the predicted number of active cases vary with the different values of D e . However, unlike the previous cases, we do not substantial variation with different variation of D e . Now, we move on to our last sensitivity analysis which is for the value of k in multinomial symptoms.

Effect of k
In Multinomial symptoms model, we defined k as the ratio of the probability of a mildly symptomatic individual getting tested to the same for an asymptomatic individual. We chose the value k = 4 for our main analysis. We had argued why the value of k should be greater than 1 but could not provide any justification for choosing that particular value. So, we try 4 different values of k : k = 3, 4, 5 and 6. We will start with the estimates of R 0 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. To summarize, we observe that the estimates of the Basic Reproduction number are not substantially influenced by these parameters with an exception of the first Reproduction number. We also note that the estimated number of active cases varies with different values of parameters in most of the cases. It is clearly visible that the sensitivity of the total active case predictions vary across parameters. While it does not vary much with D e , there is substantial variation with different values of E 0 .
13 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.24.20200238 doi: medRxiv preprint