## Abstract

Knowing COVID-19 epidemiological distributions, such as the time from patient admission to death, is directly relevant to effective primary and secondary care planning, and moreover, the mathematical modelling of the pandemic generally. We determine epidemiological distributions for patients hospitalised with COVID-19 using a large dataset (*N* = 21,000 – 157,000) from the Brazilian Sistema de Informação de Vigilancia Epidemiológica da Gripe database. A joint Bayesian subnational model with partial pooling is used to simultaneously describe the 26 states and one federal district of Brazil, and shows significant variation in the mean of the symptom-onset-to-death time, with ranges between 11.2-17.8 days across the different states, and a mean of 15.2 days for Brazil. We find strong evidence in favour of specific probability density function choices: for example, the gamma distribution gives the best fit for onset-to-death and the generalised lognormal for onset-to-hospital-admission. Our results show that epidemiological distributions have considerable geographical variation, and provide the first estimates of these distributions in a low and middle-income setting. At the subnational level, variation in COVID-19 outcome timings are found to be correlated with poverty, deprivation and segregation levels, and weaker correlation is observed for mean age, wealth and urbanicity.

## 1. INTRODUCTION

Surveillance of COVID-19 has progressed from initial reports on 31st-Dec-2019 of pneumonia with unknown etiology in Wuhan, China,[1] to the confirmation of 9, 826 cases of SARS-CoV-2 across 20 countries one month later,[2] to the current pandemic of greater than 12 million confirmed cases and 500,000 deaths globally to date.[3] Early estimates of epidemiological distributions provided critical input that enabled modelling to identify the severity and infectiousness of the disease. The onset-to-death distribution,[4, 5] characterising the range of times observed between the onset of first symptoms in a patient and their death, has for example proved crucial in early estimates of the Infection Fatality Ratio (IFR) [6], and was similarly integral to recent approaches to modelling the transmission dynamics of SARS-CoV-2.[7-12]

Initial estimates of COVID-19 epidemiological distributions necessarily relied on relatively few data points, with the events comprising these distributions occurring a period of time that was short compared to the temporal pathologies of the disease progression, resulting in wide confidence or credible intervals and a sensitivity to time-series censoring effects.[6] Global surveillance of the disease over the past 197 days has provided more data to re-evaluate the time-delay distributions of the disease. In particular, public availability of a large number of patient-level hospital records – currently over 390, 000 in total – from the SIVEP-Gripe (*Sistema de Informação de Vigilância Epidemiológica da Gripe*) database published by Brazil’s Ministry of Health (MoH),[13] provides an opportunity to make robust statistical estimates of the onset-to-death and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission, and hospital-admission-to-death. In this work we fit and present an analysis of these epidemiological distributions, with the paper set out as follows. Section 2 describes the data used from the SIVEP-Gripe database,[13] and the methodological approach applied to fit distributions using a hierarchical Bayesian model with partial pooling. Section 3 provides a description of the results from this study from fitting epidemiological distributions at national and subnational level to a range of probability density functions (PDFs). The results are discussed in Section 4, including associations with socioeconomic factors, such as education, segregation, and poverty, and conclusions are given in Section 5.

## 2. METHODS

### 2.1. Data

The SIVEP-Gripe database provides detailed patient-level records for all individuals hospitalised with severe acute respiratory syndrome, including all suspected or confirmed cases of severe COVID-19.[13] The records include the date of admission, date of onset of symptoms, state where the patient lives, state where they are being treated, and date of outcome (death or discharge), among other diagnosis related variables. We extracted the data for confirmed COVID-19 records starting on 25th February and considered records in our analysis ending on 7th July. The dataset was filtered to obtain rows for onset-to-death, hospital-admission-to-death, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission and onset-to-diagnosis. Onset-to-diagnosis data were split into the diagnosis confirmed by PCR and those confirmed by other methods, such as rapid antibody and antigen tests, called non-PCR throughout this manuscript. Entries resulting in distribution times greater than 133 days were considered a typing error and removed, as the first recorded COVID-19 case in Brazil was on 25th February.[14]

Additional filtering of the data was applied for onset-to-ICU-admission, onset-to-hospital-admission and onset-to-death in order to eliminate bias introduced by potentially erroneous entries identified in the data for these distributions. We removed the rows where admission to the hospital or ICU or death happened on the same day as onset of symptoms, assuming that these were actually incorrectly inputted entries. The decision to test removing the first day is motivated firstly by the observation of a number of conspicuous data entry errors in the database, and secondly by anomalous spikes corresponding to same-day events observed in these distributions. An example of the anomalous spikes in the onset-to-death distribution is shown in Appendix Figure 4 for selected states.

Sensitivity analyses on data inclusion, regarding the removal of anomalous spikes in first-day data indicative of reporting errors (e.g. in onset to hospital admission), and regarding the sensitivity of the dataset to time-series censoring effects, are set out in the Results Section 3.3.

A summary of the data, including number and a range of samples per variable from the SIVEP-Gripe dataset is given in Table I. A breakdown of the number of data samples per state is provided in Appendix Table IX.

Basic exploratory analysis to explain geographic variation observed in time-delay distributions adopts *GeoSES (Índice Socioeconômico do Contexto Geográfico para Es-tudos em Saúde*) [15], which measures Brazilian socioeconomic characteristics through an index composed of education, mobility, poverty, wealth, deprivation, and segregation. We investigate correlations between the *GeoSES* indicators and the time-delay means that we estimate at the state level. Additionally, we consider correlations with the mean age of the population of the state and the percentage of people living in urban areas, data we obtained from *Instituto Brasileiro de Geografia e Estatística* (*IBGE*).[16]

### 2.2. Model fitting

Gamma, Weibull, log-normal, generalised log-normal,[17] and generalised gamma[18] PDFs are fitted to several epidemiological distributions, with the specific parameterisations provided in Appendix Section 10.1. The parameters of each distribution are fitted in a joint Bayesian hierarchical model with partial pooling, using data from the 26 states and one federal district of Brazil, extracted and filtered to identify specific epidemiological distributions such as onset-to-death, ICU-stay, and so on.

As an example consider fitting a gamma PDF for the onset-to-death distribution. The gamma distribution for the *i*^{th} state is given by
where shape and scale parameters are assumed to be positively constrained, normally distributed random variables
and

The parameters *α*_{Brazil} and *β*_{Brazil} denote the national level estimates, and
where *N*^{+} (·) is a truncated normal distribution. In this case, parameters *α*_{Brazil} and *β*_{Brazil} are estimated by fitting a gamma PDF to the fully pooled data, that is including the observations for all states. Prior probabilities for the national level parameters for each of the considered PDFs are chosen to be *N*^{+}(0,1), except for the generalised gamma distribution where we used: *μ*_{Brazil} ~ *N*^{+}(2, 0.5), *σ*_{Brazil} ~ *N*^{+}(0.5,0.5) and *s*_{Brazil} ~ *N*^{+}(1.5, 0.5).

Posterior samples of the parameters in the model are generated using Hamiltonian Monte Carlo (HMC) with Stan.[19, 20] For each fit we use 4 chains and 2, 000 iterations, with half of the iterations dedicated to warm-up.

The preference for one fitted model over another is characterised in terms of the Bayesian support, with the model evidence calculated to see how well a given model fits the data, and comparison between two models using Bayes Factors. The details of how to estimate the model evidence and calculate the Bayes Factors for each pair of models are given in Appendix Section 10.1.

## 3. RESULTS

### 3.1. Brazil epidemiological distributions

Five trial PDFs – gamma, Weibull, log-normal, generalised log-normal and generalised gamma – were fitted to the epidemiological data shown in Figure 1.

All of the models’ fits were tested by using the Bayes Factors based on the Laplace approximation and corrected using thermodynamic integration,[21-23] as described in Appendix Section 10.1. The thermodynamic integration contribution was negligible suggesting the posterior distributions are satisfactorily approximated as multivariate normal. The conclusions on the preferred PDF were not sensitive to the choice of prior distributions, that is the preferred model was still the favoured one even when more informative prior distributions were applied for all PDFs. The Bayes Factors used for model selection are shown in Appendix Table V.

The gamma PDF provided the best fit to the onset-to-death, hospital-admission-to-death and ICU-stay data. For the remaining distributions – onset-to-diagnosis (non-PCR), onset-to-diagnosis (PCR), onset-to-hospital-discharge, onset-to-hospital-admission and onset-to-ICU-admission – the generalised log-normal distribution was the preferred model. The list of preferred PDFs for each distribution, together with the estimated mean, variance and PDFs’ parameter values for the national fits are given in Table II. The 95% credible intervals (CrI) for parameters of each of the preferred PDFs was less than 0.1 wide, therefore in Table II we show only point estimates.

Additionally, in Figure 1, in each instance the cumulative probability distribution is given for the best model fit, revealing that out of patients for whom COVID-19 is terminal, almost 70% die within 20 days of symptom onset. Out of patients who die in the hospital, almost 60% die within the first 10 days since admission.

The estimated mean number of days for each distribution for Brazil is compared in Table III with values found in the literature for China, US and France. The majority of the data obtained through searching the literature pertained to the early stages of the epidemic in China, and no data was found for low- and middle-income countries. The mean onset-to-death time of 15.2 (95% CrI 15.1 — 15.3) days, from a best-fitting gamma PDF, is shorter than the 17.8 (95% CrI 16.9-19.2) days estimate from Verity et al.,[6] and 20.2 (95% CrI 15.1 — 29.5) days estimate (14. 5 days without truncation) from Linton et al.[12] In both cases, estimates were based on a small sample size from the beginning of the epidemic in China. The mean number of days for hospital-admission-to-death of 10.8 (95% CrI 10.7 — 10.9) for Brazil matches closely the 10 days estimated by Salje et al.[24]

### 3.2. Subnational Brazilian epidemiological distributions

The onset-to-death distribution, and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission, and hospital-admission-to-death, have been fitted in a joint model across the 26 states and one federal district of Brazil using partial pooling. The mean number of days, plotted in Figure 2, shows substantial subnational variability – e.g. the mean onset-to-hospital-admission for Amazonas state was estimated to be 9.9 days (95% CrI 9.7-10.1), whereas for Mato Grosso do Sul the estimate was 6.7 (95% CrI 6.4-7.1) days and Rio de Janeiro – 7.2 days (95% CrI 7.1-7.3). Amazonas state had the longest average time from onset-to-hospital- and ICU-admission. The state with the shortest average onset-to-death time was Acre. Santa Catarina state on the other hand had a longest average onset-to-death and hospital-admission-to-death time, as well as longest average ICU-stay. For a visualisation of the uncertainty in our mean estimates for each state, see the posterior density plots in Appendix Figures 5 and 6. Additional national and state-level results for the onset-to-death gamma PDF, including the posterior plots for mean and variance, are shown in Figure 7 in the Appendix.

We also observe discrepancies between the five geographical regions of Brazil, for example states belonging to the southern part of the country (Parana, Rio Grande do Sul and Santa Catarina) had a longer average ICU-stay and hospital-admission-to-death time as compared to the states in the North region. Full results, including detailed estimates of mean, variance, and estimates for each of the distributions’ parameters for Brazil and Brazilian states can be accessed at https://github.com/mrc-ide/Brazil_COVID19_distributions/blob/master/results/results_full_table.csv.

### 3.3. Sensitivity analyses

In order to remove the potential bias towards shorter outcomes from left- and right-censoring, we tested the scenario in which the data to fit the models was truncated. For example, based on a 95% quartile of 35 days for the hospital-admission-to-death distribution, entries with the starting date (hospital admission) after 2nd June 2020 and those with an end-date (death) before 1st April were truncated, and the models were refitted. With censored parts of the data removed, the mean time from start to outcome increased for every distribution, e.g. for hospital-admission-to-death it increased from 10.0 (95% CrI 9.9-10.0) to 10.8 (95 % CrI 10.7-10.9), and for onset-to-death it changed from 15.2 days (95% CrI 15.1-15.3) to 16.0 days (95% CrI 15.9-16.1). The effect truncation on censored data is given in Appendix Figure 8.

To test the impact of keeping or removing entries identified as potentially resulting from erroneous data transcription (see the Methods Section 2), we fitted the PDFs to some of the distributions on a national level with and without those entries. For onset-to-hospital-admission, onset-to-ICU and onset-to-death we find that generalised gamma PDF was preferred when the first day of the distribution was included, and gamma (for onset-to-death) and generalised log-normal PDFs if the first day was removed. For hospital-admission-to-death, a gamma distribution fitted most accurately when the first day was included, and Weibull when it was excluded. The effect of removing the first day results in means shifting to the right by approximately 1 day for both onset-to-hospital- and ICU-admission, and by 0.5 days for hospital-admission-to-death (see Appendix Figure 8).

Sensitivity analysis regarding the model selection approach is detailed in Appendix Section 10.1.

## 4. DISCUSSION

We fitted multiple probability density functions to a number of epidemiological datasets, such as onset-to-death or onset-to-diagnosis, from the Brazilian SIVEP Gripe database,[13] using Bayesian hierarchical models. Our findings provide the first reliable estimates of the various epidemiological distributions for the COVID-19 epidemic in Brazil and highlight a need to consider a wider set of specific parametric distributions. Instead of relying on the ubiquitous gamma or log-normal distributions, we show that often these PDFs do not best capture the behaviour of the data. For instance, the generalised log-normal is preferable for several of the epidemiological distributions in Table II. These results can inform modelling of the epidemic in Brazil [30], and other low- and middle-income countries,[31] but we expect they also have some relevance more generally.

In terms of modelling the epidemic in Brazil, the variation observed at subnational level – see Figure 2 – can be shown to be important to accurately estimating disease progression. Making use of the state-level custom-fitted onset-to-death distributions reported here, we have estimated the number of active infections on 23rd June 2020 across ten states spanning the five regions of Brazil, using a Bayesian hierarchical renewal-type model.[7, 30, 32] The relative change in the number of active infections from modelling the cases using heterogeneous state-specific onset-to-death distributions, compared to using a single common Brazil one is shown in Figure 3 to be quite substantial. The relative changes observed, up to 18% more active infections, suggest assumptions of onset-to-death homogeneity are unreliable and closer attention needs to be paid when fitting models of SARS-CoV-2 transmission dynamics in large countries.

On the origin of the geographic variation displayed in Figure 2 for the average distribution times across states, there are multiple potential factors that could generate the observed variability and in this work we present an elementary exploratory analysis. We examine the correlation between socioeconomic factors, such as education, poverty, income, etc., using a number of socioeconomic state-level indicators obtained from Barrozo et al. (2020) [15] and additional datasets containing the mean age per state and percentage of people living in the urban areas (urbanicity).[16] The Pearson correlation coefficients, shown in the Appendix Table VII, suggest that segregation, poverty and deprivation elements were most strongly correlated with the analysed onset-time datasets. E.g. poverty was strongly negatively correlated with hospital-admission-to-death (−0.68), whereas income and segregation had a high positive correlation coefficient for the same distribution (+0.60, +0.62 respectively). The strongest correlation was observed for hospital-admission-to-death and deprivation indicator, which measures the access to sanitation, electricity and other material and non-material goods.[15] Interestingly, the indicators measuring economical situation were more correlated with average hospitalisation times than mean age per state, which suggests that although the low- and middle-income countries typically have younger populations, their healthcare systems are more likely to struggle in response to the COVID-19 epidemic. More detailed analysis is necessary to fully appreciate the impact of the economic components on the COVID-19 epidemic response.

In the work presented we acknowledge numerous limitations. The database from which distributions have been extracted, though extensive, contains transcription errors, and the degree to which these bias our estimates is largely unknown. Secondly, the PDFs fitted are based on observational hospital data, and therefore should be cautiously interpreted for other settings. Thirdly, though we have fitted PDFs at subnational as well as national level, this partition is largely arbitrary and further work is required to understand the likely substantial effect of age, sex, ethnic variation,[33] co-morbidities, and other factors.

## 5. CONCLUSIONS

We provide the first estimates of common epidemiological distributions for the COVID-19 epidemic in Brazil, based on the SIVEP-Gripe hospitalisation data.[13] Extensive heterogeneity in the distributions between different states is reported. Quantifying the time-delay for COVID-19 onset and hospitalisation data provides useful input parameters for many COVID-19 epidemiological models, especially those modelling the healthcare response in low- and middle-income countries.

## Data Availability

Data is available from links to github repository within manuscript.

## 7. FUNDING

This work was supported by Centre funding from the UK Medical Research Council under a concordat with the UK Department for International Development, the NIHR Health Protection Research Unit in Modelling Methodology and Community Jameel. This research was also partly funded by the Imperial College COVID-19 Research Fund. IH was supported by Imperial College London MRC Centre.

## 8. CODE AND DATA AVAILABILITY

Python, R and Stan code used to analyse the data and fit the distribution is available at https://github.com/mrc-ide/Brazil_COVID19_distributions, along with estimated parameters for each state and PDFs considered at https://github.com/mrc-ide/Brazil_COVID19_distributions/blob/master/results/results_full_table.csv. The SIVEP-Gripe database,[13] is available to download from Brazil Ministry of Health website https://opendatasus.saude.gov.br/dataset/bd-srag-2020.

## 6. ACKNOWLEDGEMENTS

We thank Microsoft for providing Azure credits which were used to run the analysis.

## 10. APPENDIX

### 10.1. Model selection

To characterise which model (gamma, log-normal, etc.) best fits the data, the Bayesian model evidence *z* = *z*(*y*|*M _{i}*) is evaluated. Here and throughout this section y denotes the data and

*M*denotes the

_{i}*i*

^{th}model from the analysed model set. As determining the model evidence requires calculating an integral over the model parameters (

*θ*) which is generally intractable, we approximate it with

*z*

_{0}=

*z*

_{0}(

*y*|

*M*), which is based on a second-order Laplace approximation,[34]

_{i}*q*

_{0}=

*q*

_{0}(

*θ*|

*M*,

_{i}*y*), to the true un-normalised posterior density

*q*

_{0}=

*q*

_{0}(

*θ*|

*M*,

_{i}*y*). The second-order approximated density is estimated as:

Here denotes the value of the un-normalised posterior evaluated using the mean estimates of the model’s parameters , and Σ the covariance matrix built from Markov Chain Monte Carlo (MCMC) samples of the posterior distribution. From this expression, a second-order approximation to the model evidence, *z*_{0}, is given by , where det(·) denotes the deter-minant of the matrix.

For each model pair, Bayes factors were computed from the marginal likelihoods. Considering two models *M _{j}* and

*M*, the Bayes Factor (BF) is where

_{j}*z*(

*y*|

*M*) is the evidence of model

_{i}*M*given

_{i}*y*. If

*B*> 1, the evidence is in favour of model

_{ij}*M*. Here, for readability we will report the Bayes Factors as 2 log(

_{i}*B*) following Kass and Raftery notation.[35]

_{ij}The sensitivity of our model evidence is tested with respect to the choice of hyperprior distribution, and secondly with respect to the use of the approximate second-order density *q*_{0}. In the latter instance this is done by performing thermodynamic integration[21-23] between *q*_{0} and the true density *q* in order to obtain an asymptotically exact estimate of the marginal model evidence,

The right hand term corrects the *z*_{0} approximation to the exact Bayesian evidence by a path integral evaluated with respect to a sampling distribution that interpolates between the two densities as in terms of the auxiliary coordinate *λ*.

## Footnotes

↵† s.flaxman{at}imperial.ac.uk

Minor amendments to notation, tables, and figures, and correction to authorship list.