## Abstract

Estimating the unreported cases of Covid-19 in a region/country is a complicated problem. We propose a new mathematical model that, combined with a deterministic model of the total growth of cases, describes the time evolution of the unreported cases for each reported Covid-19 case. The new model considers the growth of unreported cases in plateau periods and the decrease towards the end of an epidemic wave. We combined the new model with a Gompertz-growth model, a generalized logistic model, and a susceptible-infectious-removed (SIR) model; and fitted them via Bayesian methods to data from Cuba and Spain. The combined-model fits yielded better Bayesian-Information-Criterion values than the Gompertz, logistic, and SIR models alone. This suggests the new model can achieve improved descriptions of the evolution of a Covid-19 epidemic wave. The new model is also able to provide reliable predictions of the epidemic evolution in a short period of time. We include in the paper the steps that researchers should take to use the new model for predictions with other data.

## 1 Introduction

Estimating the unreported novel coronavirus (SARS-CoV-2) infections is a major challenge for researchers around the world, and it is important in order to give advice to decision-makers. Several studies show that the reporting rate of SARS-CoV-2 infections is variable over time, for reasons that cannot be attributed solely to the transmission dynamics of the Covid-19 (the disease caused by SARS-CoV-2) epidemic [1–5]. This variability can be influenced by the availability of medical supplies, changes in the policies for conducting and reporting tests, hospital capacity, eligibility criteria for people who are tested, among other factors. The data of the Covid-19 epidemic in Wuhan, China, showed an initial peak of 41 infected in the month of December, 2019. Subsequently, for unknown reasons, Wuhan remained without reporting new cases from January 1 to January 15, 2020. This was followed by a huge explosion of cases that has been widely studied and reported [2, 6, 7]. In the province of Santiago de Cuba, Cuba, there was a first wave of cases that had certain similarities with Wuhan, but with a much lower transmissibility [8]. The first infected case was reported on March 27, 2020, and the second case in the next six days. This was followed by several days in which reported infections increased with some regularity. Santiago de Cuba had another period without reporting new cases from April 7 to April 10, 2020, followed by the days of greatest transmissibility in the province until reaching a total of 49 cases in the first wave. This period of several days without registering new cases reported both in Wuhan and in Santiago de Cuba, is not difficult to find in other regions of the world, and they cause plateau areas in the curves of cumulative cases.

From mathematical point of view, uncertainty in the number of undetected infections can lead to misleading inferences and predictions. To be of practical use, models must be able to deal in some way with the difference between reported and actual cases, and with the possible growth of undetected cases during the plateau periods of those detected. An illustrative example of the difficulty of addressing this problem mathematically is that it is impossible with an SI (susceptible-infectious) model to estimate the ratio between reported and unreported cases due to the low identifiability of the model, which provides the same fit to the data for different ratio values [9].

Approaches based on directly fitting a compartmentalized model to the reported data [10–12], or considering the unreported infected one compartment of the model [13,14], cannot capture the recording-no-new-cases periods described in Wuhan and in Santiago de Cuba data, because in these models the derivatives are zeros only in the endemic and disease-free equilibria. An interesting idea to estimate the unreported cases is fitting a susceptible-exposed-infectious-recovered-dead (SEIRD) model directly to numbers of deceased patients, as these are data that provide greater certainty [15]. However, this approach is not useful for predictions in early stages, when there have been very few or no deaths. The exponential-growth fitting used in [2] to evaluate the under-reporting is not feasible when a few plateau days have elapsed while an epidemic wave is fully transmissible.

The aim of this paper is to present a new mathematical model of unreported COVID-19 cases that captures the irregularity of the data when the number of infections is in full growth. This model considers that the total number of cases, both reported and unreported, follows a deterministic law, and it can be combined with known epidemic-growth models. The model also gives reasonable predictions of the epidemic evolution in short periods of time. The paper is structured as follows: the methods section describes the data, the new model, the tested deterministic models, the fitting method, and the criteria for evaluating the results. It is followed by the results section where the fits and quantitative values of the evaluation are shown. The discussion delves into the findings, advantages and limitations of the model. We finish with our conclusions.

## 2 Methods

### 2.1 Data

We used five data sets to examine the performance of the models: the first wave of Covid-19 in Santiago de Cuba (March 20^{th} to May 4^{th}, 2020), the first wave of Covid-19 in Cuba (March 11^{th} to July 8^{th}, 2020), the second wave of Covid-19 in Santiago de Cuba (October 27^{th}, 2020 to January 1^{st}, 2021), the second wave of Covid-19 in Cuba (August 7^{th} to October 6^{th}, 2020), and the first wave of Covid-19 in Spain (February 20^{th} to May 17^{th}, 2020). Cuban data were supplied by the Ministry of Public Health (https://covid19cubadata.github.io), and data from Spain were issued by the Ministry of Health, Consumption and Social Welfare (https://cnecovid.isciii.es/covid19).

### 2.2 Models

We argue that the number of unreported Covid-19 cases for each reported case (*U* (*n*)) depends on the difference between the accumulated cases reported until the current day (*CR*(*n*)) and those reported the previous day (*CR*(*n* − 1)), as well as the increase in active cases (*I*(*n*) − *I*(*n* − 1)). We propose the model
where *C*_{R}(*n*) is the reported number of cumulative cases at day *n, I*(*n*) is the number of active cases at day *n, a* ≥ 1 is the unreported-growth rate (this parameter captures the rate of the increment of cases when no new case is detected), *b* ≥ 0 is a coefficient that is related to the effectiveness in detecting new cases, and *c*, if positive, captures a possible decrease in the fraction of unreported cases towards the end of an epidemic wave. If the decrease in active cases is due more to under-detection than to the influence of the end of the wave, *c* takes a negative value.

This model accounts for the under-reporting that is likely in the early stages of the epidemic [2]. In those early stages, if a plateau occurs, the difference *C*_{R}(*n*) −*C*_{R}(*n* −1) is zero or close to zero, and the value of *U* (*n*) grows proportional to *a*. When an epidemic wave nears its end, the number of active cases decreases, but it is to be hoped that the authorities have improved the mechanisms for detecting new infections. For this reason the model contains the term with the coefficient *c*, which exerts influence in decreasing sections of the active cases that occur at the end of the wave. Equation (1) can complement a deterministic model of infections, including detected and undetected cases. To describe the dynamics of the total cumulative-case number, we use three phenomenological models that have been previously applied in the forecast of infectious disease outbreaks.

The first phenomenological model is an extension of the Richards growth model [16], given by
where *t* is the time, *r* ≥ 0 is the growth rate, *p* (0 ≤ *p* ≤ 1) represents the deceleration of growth, *K >* 0 is the size of the epidemic, and *α*(0 ≤ *α* ≤ 1) is a parameter that captures the extent of deviation from the S-shaped dynamics of the classical logistic growth model [17]. This model is known as generalized Richard model (GRM), and it has been successfully used to make post-peak predictions of several infectious disease epidemics [18–20].

The second phenomenological model is the Gompertz curve, which has been used to describe several biological growths [21], including the Covid-19 epidemic [22, 23]. The parametrisation of Gompertz growth that we use is
where *λ >* 0 is the growth rate. Compared with GRM, the Gompertz curve changes faster early but approaches the asymptote more slowly. The third model is a conventional SIR [24] given by
where *S*(*t*) is the number of susceptible individuals at time *t*; *I*(*t*) is the number of infected individuals at time *t*; *R*(*t*) is the number of removed (both recovered and dead) individuals at time *t*; and *β* and *γ* are the transmission rate and rate of removal, respectively.

If the numerical solution of the phenomenological model (obtained from equation 2, 3 or 4) at day *n* is denoted as *C*(*n*), the reported number of cumulative cases *C*_{R}(*n*) and the unreported ratio *U* (*n*) are related by

In the following we refer to the combination of equations 1, 2 and 5 as GRM+U; the combination of equations 1, 3 and 5 as Gompertz+U; and the combination of equations 1, 4 and 5 as SIR+U. For comparison, the GRM (equation 2), Gompertz (equation 3) and SIR (equation 4) models were fitted to the data.

### 2.3 Model fitting and assessment

We explored several fits using a Student’s t distribution of the error in case observation [25] and the best results were obtained with high values of the degree-of-freedom parameter (*v*). For high values of *v*, the Student’s t distribution resembles the Gaussian distribution. For this reason we prefer to use a normal error, which in turn has other advantages [26–28], and is given by
where *C*_{obs}(*n*) is the observed value of cumulative cases at day *n*. Although *C*_{R}(*n*) is a discrete variable, *ϵ* (*n*) is continuous, and the assumption that it has a normal distribution facilitates the likelihood estimation. On the other hand, we assume a constant variance instead of a variance proportional to the mean as in [25]. We consider that the observation noise through random subsampling is not necessarily greater in the final days of an epidemic wave than at the beginning of it, but rather has a more complicated variability that depends on the quality of the epidemiological investigation. This variability is already considered in the rationale of the new proposed model (equation 1).

We infer the parameters with the Bayesian approach. To represent the little information available on the parameters, we introduced improper uniform priors in all models (Tables 1-15). In the SIR models, it was necessary to introduce proper informative priors for parameters *β* and *γ* (Tables 3, 6, 9, 12 and 15), because it was practically impossible to obtain reasonable posteriors using uninformative priors. This is consistent with the SIR model being practically non-identifiable when only cumulative-case data are used [29]. The means of the prior distributions of parameters *β* and *γ* were determined empirically.

As of May 4, 2020, no more cases were registered in Santiago de Cuba for more than 30 days. For this reason, we consider the first wave completely finished on May 4, and we set the value of *K* = 49, both for the Gompertz model and for the GRM model. Also, we set 49 the lower limit of the improper uniform prior of *K* for the Gompertz+U and GRM+U models. Based on the expertise of epidemiologists and on estimates made in previous work [8], we set relatively small upper limits for the priors of C and U (20 and 19, respectively). This upper limit made it possible to reduce the search region of the posterior distributions of parameters and to reduce the risk of practical non-identifiability [29, 30]. Tables 1-3 show details of the priors and the fixed values of the parameters for the fitting of the models to the data of the first wave in Santiago de Cuba.

In the other waves analyzed, a clear ending is not seen, so we empirically cut the data on certain dates to fit the models, and we used the values of cumulative cases on those dates as lower limits of the uniform priors of *K* in all models (1445 for the second wave in Santiago de Cuba, 2403 for the first wave in Cuba, 5898 for the second wave in Cuba, and 2.3 × 10^{5} for the first wave in Spain). The numbers of cumulative cases reported in the initial days of each data set were used either as lower limits of the uniform priors of *C*(0) if the new model (+U) is included, or as fixed values of *C*(0), if the new model is not included. The cumulative cases reported on those initial dates are 81 for the second wave in Santiago de Cuba, 3 for the first wave in Cuba, 2953 for the second wave in Cuba, and 3 for the first wave in Spain. A previous work [31] estimated a value of *U* (*n*) = 6.23 for Cuba for *n* corresponding to the date December 22, 2020. Taking into account that the two waves analyzed in Cuba and the beginning of the second wave in Santiago de Cuba do not exceed the date mentioned above, we decided to establish an empirical upper limit of 20 to the uniform prior of *U* (0) (tables 2, 3 and 4). Due to the large number of cases reported in Spain (the highest value of accumulated cases in the data studied from Spain is more than 30 times higher than the highest number reported in Cuba), we decided to establish an upper limit of 150 to the uniform prior of *U* (0) for models fitted to data from Spain.

In the Bayesian inference carried out, the samples of the posterior distributions of the parameters were obtained using the Delayed-Rejection-Adaptive-Metropolis (DRAM) algorithm, implemented via the Python Package **pymcmcstat** [32], with 5 × 10^{5} samples and burn-in number of 2.5 × 10^{5}. We generated 2 × 10^{4} trajectories of the different models by sampling from the inferred posterior distributions of the parameters. From these trajectories we determine a mean of posterior for forecast, and a 95% credible interval for each day. Models were compared using the Bayesian information criterion (BIC) [33], using the maximum a posteriori (MAP) point. The differential-evolution algorithm [34] was used to find the MAP. We chose the BIC because it rewards a good description of the experimental data and penalizes the complexity of the model.

With the data from the second wave in Santiago de Cuba, we extrapolated using the fit of the models to predict the cases reported in the following days. We extrapolated using the function
where *n* is the day, *F* is the ratio between the total and reported cumulative cases (*F* (*n*) = *C*(*n*)*/C*_{R}(*n*) = *U* (*n*) + 1), and (*p*_{1}, …, *p*_{4}) are tuning parameters. The DRAM algorithm was used to get samples from the posterior distribution of equation-7 parameters, using data from the 15 days prior to the date from which it is desired to extrapolate. For a day *x* after the last one for which the data are available, the reported number of cumulative cases can be estimated as *C*_{R}(*x*) = *C*(*x*)*/F* (*x*).

## 3 Results

Prior distributions and results of the model fits to the five data sets are summarized in Tables 1-15. The lowest Bayesian-information-criterion (BIC) value for the first wave in Santiago de Cuba was obtained with the model GRM+U (Table 2). The other lowest BIC values were obtained with: the model GRM+U for the second wave in Santiago de Cuba (Table 5), the model Gompertz+U for the first wave in Cuba (Table 7), the model SIR+U for the second wave in Cuba (Table 12), and the model Gompertz +U for the first wave in Spain (Table 13). In all the trials carried out, the BIC values obtained including the new model (+U) were lower than those calculated without including the new model.

We selected six results to provide visual evidence of what can be achieved with the new model. The Gompertz+U model visually follows the variability of the first wave data in Santiago de Cuba better than the Gompertz model (Figure 1). In figure 1B, the green-triangle line is remarkably close to the line of asterisks that represents the data. This green-triangle line corresponds to the maximum a posteriori (MAP) of the posterior density obtained by fitting the Gompertz+U model to the data of Santiago de Cuba (first wave). Figure 1 includes mean trajectory (blue) and the 95% credible intervals of drawn-from-posterior trajectories (light blue). The 95% credible intervals of the total estimated cases are represented in light red (Figure 1A). Fitting the Gompertz model to the same data shows a clearer difference between the fit and the data (Figure 1C).

The first 75 of the 85 days analyzed in the second wave of Santiago de Cuba were taken to fit the models, and the remaining ten days were used to test model predictive ability. This division was carried out taking into account a change in the trend with a greater growth of new cases in the last ten days. The extrapolated Gompertz+U model was able to roughly follow this variation (Figure 2A). The extrapolation was carried out with equation 7 fitted to the estimated ratio of total to reported cumulative cases from 60 to 75 days (Figure 2B). The extrapolation of the Gompertz model was not able to follow the new trend (Figure 2C). Equation 7 was also fitted directly to the data from 60 to 75 days, without using another extra model, to extrapolate (Figure 2D).

## 4 Discussion

For all data sets studied, the best BIC values were obtained with some combination that includes the new model (Tables 1-15). This implies that the inclusion of the three new parameters *a, b* and *c*, does not overfit the resulting model, but is justified by the variability of the data. This result is promising if one takes into account that, for example, compared to the Akaike infomation criterion (AIC), the BIC tends to favor simpler models [33]. In most cases, the evidence in favor of the new model (+U) with respect to the reference model is classified as strong or very strong.

The proposed model is able to provide reasonable estimation of the number of unreported infected cases. The total number of accumulated cases on the 50th day of the first wave in Santiago de Cuba was previously estimated at 73 [8]. The Gompertz+U model estimates it at 60.3758 (95%*CI* : 51.5345 − 70.2727) (the estimate with all the daily credible intervals of the Gompert+U model with the data of the first wave in Santiago de Cuba can be visualized in Figure 1), the GRM+U model estimates it at 58.6924 (95%*CI* : 50.4297 − 63.5315) and the SIR+U model estimates it at 60.3758 (95%*CI* : 51.5345 − 70.2727). The approach of the present work has the advantage that it was not necessary to divide the wave into three parts for its analysis as in [8].

A previous report that considers the under-reporting [35] estimated for March 23, 2020, a total of 1 421 505 cases in Spain. For the same date, the Gompertz+U model estimates, for total cases, a 95% credible interval of 82 840 − 2 328 569 (it includes the estimated value in [35]), the GRM+U model estimates a 95% credible interval of 1 335 661 − 18 701 982 (it includes the estimated value in [35]), and the SIR+U model gives a 95% credible interval of 133 320 − 1 343 510 (5% below the estimated value in [35]). It is inferred that the choice of the deterministic model influences the quality of the estimation of the undetected cases.

The data for Spain are those with the highest number of reported cases among the five studied. Noise due to variability in reports is expected to have less influence on model-based results, compared to small-number data. In most cases, the absolute values of the parameters *b* and *c* were small compared to those of the other data (compare Tables 13-15 with Tables 1-12). For example, the absolute value of *b* for the Gompertz+U model in the first wave of Santiago de Cuba is more than 3000 times greater than the absolute value of *c* for the same model fitted to the data from Spain. This shows that the new model works properly with small variability as well as with large variability in reports.

The combinations obtained with the new model can predict the evolution of an ongoing epidemic wave, if a suitable extrapolation method is used. The example shown of successful extrapolation (Figure 2A) corroborates that the good fit of the new model to the data does not mean that it has a bad generalizability. Figure 2 illustrates a problem that is possible in practice. If a decision-maker asks the researcher to predict as accurately as possible the new cases in the next ten days, starting on the 75^{th} day of the second wave in Santiago de Cuba, in order to ensure the hospital capacities and all the necessary resources, it would be an underestimation if the Gompertz model (Figure 2C) or the extrapolation method applied directly to the data (Figure 2D) is used. In this case, only the extrapolation carried out from the Gompertz+U model offers an acceptable estimate. Nevertheless, the extrapolation function (equation 7) did not perform with the same quality in all trials. This suggests that it is necessary to observe the time-dependent dynamics of the actual-to-reported-cases ratio in order to choose the appropriate extrapolation function.

Based on the experience that the evolution of the first wave in Santiago de Cuba was favourable compared to the second one, we argue that the parameter *b* may be a quality criterion of the epidemiological research carried out in the region/country (the values of *b* in tables 1-3 are, at least, more than 25 times greater than the values of *b* in tables 4-6).

The new model can be combined with any deterministic model, making it possible to use it with SIRD (susceptible-infectious-recovered-dead), SEIR (susceptible-exposed-infectious-removed) and other compartmentalized variants. The forecast limitations are directly related to the limitations of the chosen deterministic model. In our case, the deterministic models Gompertz and GRM allow predicting the end of a wave, but they do not predict peaks of active infections or epidemic rebounds. Nevertheless, the Gompertz+U model shows good results in making short-term predictions.

In some trajectories drawn from posterior distributions of the new-model parameters, a growth of reported cases was observed towards the final days when the deterministic model already showed a plateau shape. This was not a reason to discard these regions in the analysis because we consider this phenomenon to be plausible, because sometimes samples are accumulated without being tested by real-time polymerase chain reaction, due to laboratory availability or priority of the authorities of one region over another.

Some current limitations of the new model are:

Active cases numbers are needed. For this reason it cannot be applied to the data provided by the World Health Organization in https://covid19.who.int without additional information.

It is assumed that the deterministic model captures the dynamics of the total number of cases in the region/country.

Several posterior peaks (multimodal posteriors) can be found in the estimation of the parameters.

The influence of the third limitation is reduced in this work by imposing limits on initial conditions of the model. Despite the above, we consider that the new model is a powerful tool to complement the information that decision-makers need to establish the appropriate policies to contain Covid-19.

We propose the following steps to generalize short-term predictions with the new model to data from an ongoing epidemic wave:

Collect the data of active and cumulative cases of the region/country where the epidemic wave is ongoing.

Fit the Gompert+U, GRM+U and SIR+U models to the collected data. Although in this article the Bayesian approach has been used for its advantages, it is possible to perform the fitting also with the maximum likelihood estimation or least-squares approaches.

Obtain the model-estimated ratio between the total and reported cumulative cases (

*F*(*n*) =*C*(*n*)=*C*(_{R}*n*)).Choose a number of days (

*D*(−) = 15 in this paper) to take the last reported data and extrapolate.Extrapolate

*F*(*n*) a number of days (*D*(+) = 15 in this paper) by fitting the equation 7 to the data of the days*D*(−).Repeat steps 4 and 5, changing values until a visually adequate fit is obtained (see figure 2B).

Predict the number of cumulative cases with

*C*(_{R}*x*) =*C*(*x*)=*F*(*x*) where*x*is the day in*D*(+) for which the prediction is desired.

## 5 Conclusions

We propose a new mathematical model to describe the dynamics of the number of undetected cases for each reported case of SARS-CoV-2 infection in a region or country. The new model needs to be combined with a deterministic model that describes the evolution of the total number of cases in the selected region. In this study, we combined the new model with Gompertz growth, generalized logistic growth and SIR model, we fitted them to data from Cuba and Spain, and the results were better than the Gompertz, logistic and SIR models directly fitted to data. The proposed model deals well with variability in reports, estimates well the number of unreported infected and allows reasonable predictions. Although additional studies are needed to fully understand its parameters, the new model is already a useful tool for decision-makers to face waves of Covid-19.