COVID-19 mild cases determination from correlating COVID-line calls to reported cases

Background: One of the most challenging keys to understand COVID-19 evolution is to have a measure on those mild cases which are never tested because their few symptoms are soft and/or fade away soon. The problem is not only that they are difficult to identify and test, but also that it is believed that they may constitute the bulk of the cases and could be crucial in the pandemic equation. Methods: We present a novel algorithm to extract the number of these mild cases by correlating a COVID-line calls to reported cases in given districts. The key assumption is to realize that, being a highly contagious disease, the number of calls by mild cases should be proportional to the number of reported cases. Whereas a background of calls not related to infected people should be proportional to the district population. Results: We find that for Buenos Aires Province, in addition to the background, there are in signal 6.6 +/- 0.4 calls per each reported COVID-19 case. Using this we estimate in Buenos Aires Province 20 +/- 2 COVID-19 symptomatic cases for each one reported. Conclusions: A very simple algorithm that models the COVID-line calls as sum of signal plus background allows to estimate the crucial number of the rate of symptomatic to reported COVID-19 cases in a given district. The result from this method is an early and inexpensive estimate and should be contrasted to other methods such as serology and/or massive testing.

COVID-19 Pandemic is impacting on World's health and economy with an unprecedented strength [1][2][3]. Among the major challenges in mitigating the pandemic effect is assessing the real number of infected people at any time [3][4][5]. This key information is not only useful for determining health policies, but also for estimating the level of immunity in society which provides a reference framework to decide the re-opening of economic and other activities. Although the natural method for obtaining this information including mild symptom cases 1 would be to test persons upon the minimal symptom, COVID-19 is a disease with a large fraction of mild cases and therefore its cost and logistics is usually beyond the affordable. In this work we address a novel method to estimate the total number of symptomatic infected people, including mild cases, by correlating COVID-line phone calls to lab-confirmed reported cases. The main idea is that, since this is a highly contagious disease, then calls coming from infected people are proportional to the number of lab-confirmed people in that area and time, whereas other calls correspond to a background proportional to the population in the area. By measuring number of calls in different scenarios, we can fit the proportionality coefficient and estimate the total number of infected people, even though a fraction of these will not reach the threshold to be derived to a laboratory diagnostic. The idea of distinguishing a signal in a large dataset of queries is present in many schemes such as Google Flu [6] and others [7]. However, the present method is not only considerably simpler and straightforward to be implemented in any country or region, but also is specially designed for a contagious disease with the particular feature of the mild cases such as the COVID-19. The algorithm is based on a few reasonable hypotheses, is useful in any sub-testing scenario -as is the case in most of the countries-, and the presented general framework can be used in tackling other diseases and/or catastrophes, beyond the COVID-19 Pandemic. The method, being statistical, does not allow to identify the mild cases, but to estimate their number. In the following paragraphs we present the details of the algorithm, and we apply it to a real case scenario in Buenos Aires Province (PBA for its acronym in Spanish).
Along this work we model the number of calls to a COVID-line in a given district and during a given period of time by using a simple assumption of signal and background. We define signal to those calls due to real COVID-19 cases, and background to those calls due to similar symptoms and/or other causes but that do not correspond to real infections. It is key to clarify that the real COVID-19 cases that constitute the signal calls do not necessarily correspond to people whose symptoms will drive them to have a laboratory confirmation of their condition. The people who has symptoms to place the COVID-call, but that their symptoms and evolution does not reach the threshold to have a laboratory test confirmation is what we call infected with mild symptoms. As a matter of fact, we assume the compelling hypothesis that the number of signal calls is proportional to the number of lab-confirmed cases. On the other hand, we assume that the number of background calls is solely proportional to the district population. This assumption is reasonable as far as the studied populations have similar social behavior and there are not major changes in the social conditions -as for instance temperature and weather-during the analyzed period of time.
Within the above hypothesis, we model the number of call received in district j in a given period of time as Where N (j) P is the population and N (j) I is the number of lab-confirmed infections at district j and at the given period of time. n (j) C is the number of calls for district j predicted by the fit and that should be as similar as possible to the real number of calls N (j) C . The number N (j) I corresponds to those cases whose record was opened in the studied period of time, regardless if the laboratory result was confirmed at some other time. Observe that proportionality coefficients θ = (θ P , θ I ) are independent of index j and should be fitted from the data.
There is important information to be extracted once the number of calls for each COVID-19 lab-confirmed case, θ I , is fitted. Let f c be the fraction of people with symptoms that contacts the Health Care System through the COVID-line, and let κ be the average of times each one of these persons calls the COVID-line. With these two variables we can now estimate the total number of lab-confirmed plus mild cases. In fact, we can now assert that the number of calls from different persons for each lab-confirmed case is θ I /κ. Moreover, we can also estimate that for each mild calling the COVID-line, there exist other 1/f c mild cases which are not calling. Henceforth, we obtain where N I is the number of lab-confirmed cases in the studied district and period of time.
The values for κ can be estimated by the telephone records or surveying on the COVID-line reported cases, whereas f c is usually within the information available from the Health Care Administration. In any case, the outcome in Eq. 2 has many sources of intractable systematic uncertainties, and therefore its value should be understood within the corresponding caution.
Finally it is worth discussing a few details concerning the above ideas. First, observe that mild is not the same as asymptomatic. A mild case is defined as having enough symptoms to place a call to the Health Care System COVID-line, but less than the threshold required to be tested. Second, notice that, within this framework, the line dividing lab-confirmed and mild cases depends on the local Health Care System policies, since the division comes from the definition of the threshold needed to be tested. Therefore, the value of the factor accompanying N I in Eq. 2 depends on the local Health Care System policies for each region. At last, observe that the division between mild and asymptomatic cases, although independent of policies, is rather a smooth division since depends person by person on their perception of the symptoms. The number that comes out from Eq. 2 does not consider cases which are purely asymptomatic.
As an example with a real application of the above proposal we study the Buenos Aires Province in Argentina (PBA) during the period of June 2020, in which there were not major changes in weather nor in threshold and methodology for laboratory testing. During this period of time the number of lab-confirmed cases was in the order of 1000 new cases per day. The PBA Health Care System runs a COVID-line for symptoms whose local phone number is 148. The access to this local number consists in an automatic menu that derives into an operator for those cases passing the automatic menu. We take as N C the number of calls that choose from the automatic menu the COVID symptoms option. We count calls even if they hang up before their call is taken by an operator. Given the structure of this COVID-line, the call counting at this level does not differentiate the district from which the call was placed, therefore we can only take the whole PBA as one district for this specific analysis. To have many measurements with different N I -which is key for the fitting-we take as the period of time of each data point for Eq. 1 each full day during June. Yielding a total of 30 data points.
In order to fit the parameters θ from the data we can use the method of maximum likelihood. We should maximize the likelihood L(θ) or equivalently minimize the χ 2 (θ) defined through Observe that also the method of least squares could have been used and results would be similar. Statistically, the variance σ 2 should correspond to the Poissonian variance of the number of calls, however there are sources of systematic errors which dominate over the statistical one. In particular, the correspondence between the number of new records opened a given day, and the number of calls that day, yields a systematic uncertainty since both may be shifted one or two days due to intractable causes. More systematic uncertainties may come from other uncontrollable behaviors. We find that assigning a 11% systematic uncertainty to the number of calls is self-consistent with the outcome of the fit. We obtain that the best fit for Eq. 3 is throughθ P = 118.2 calls per 1M peoplê θ I = 6.6 calls per confirmed case, and at this point we obtain χ 2 (θ) = χ 2 min = 36.0. We plot in Fig. 1a the result of this fit by comparing the real number of placed calls against the predicted number of calls coming from inserting the fitted valuesθ into Eq. 1. The fit yields a coefficient of determination R 2 = 0.86, indicating that the fit is very good, but that there are still extra sources of uncertainties, as it can be seen in the plot. To find the uncertainty inθ we use that the contour in parameter space defined by χ 2 (θ) ≤ χ 2 min + 2.3 has a 68% probability of covering the true value [8]. We plot the resulting 68% and 90% confidence level contour regions in Fig. 1b. If we disregard the value of θ P , we find θ I = 6.6 ± 0.4 calls per confirmed case.
In the above analyzed data for PBA we know from the records that from the total labconfirmed cases, 22% correspond to cases that entered into the Health Care System through the COVID-line. We use this to estimate that the ratio of calls from infected people corresponds to f c = 0.22. On the other hand we have determined through a simple survey that each confirmed person calling the COVID-line makes on average 1.5 calls (κ = 1.5 ± 0.1). Using this into Eq. 2 we obtain Total number of lab-confirmed plus mild @ PBA = (20 ± 2) N I .
Where N I is the total reported cases in PBA. Observe that in deriving this result there may have been introduced additional systematic sources through the variables f c and κ which have not been taken into account and, in contrast to θ I , are not controlled by the goodness of fit. These systematic may consist, for instance, in infected who may have called the COVID-line, but finally entered the system through another path; or in a different ratio of mild than severe cases calling the COVID-line. The outcome of this algorithm should be understood as an early and inexpensive estimate of the rate of symptomatic to reported COVID-19 cases. This result should be complemented with serology and/or massive testing results.
The factor 20 ± 2 in Eq. 5 is compatible with results in other parts of the World on the ratio of total infected cases found through serology to reported cases. For instance, Germany has estimated 10 times more infected than those reported by lab-confirmation [9], Spain 15 [10] and London 45 [11]. Observe, however, that these last numbers may include the asymptomatic cases, which are beyond the scope of this work and whose role in the COVID-19 Pandemic is still controversial [12][13][14].
Summarizing, we have designed a novel method to estimate the number of mild cases in the COVID-19 Pandemic. This method, variations and/or digital adaptations based on the idea in Eq. 1, could be useful as alternatives or complements to massive testings while considerably less expensive. We use the correlation between the COVID-line phone calls and the number of reported cases and population in each district to estimate how many calls are made for each reported case. Using in addition the fraction of cases entering the system through the COVID-line and how many time calls are repeated by the same person, we provide an estimation for the total number of reported plus mild cases. We apply the technique to Buenos Aires Province in Argentina and find numbers compatible with other countries in which serology antibody tests have been performed.