ANALYSIS OF CLINICAL RECOVERY-PERIOD AND RECOVERY RATE ESTIMATION OF THE FIRST 1000 COVID-19 PATIENTS IN SINGAPORE

COVID-19 has been declared as a global pandemic by the World Health Organization (WHO) on March 11, 2020. In this paper, we investigate various aspects of the clinical recovery of the first 1000 COVID-19 patients in Singapore, spanning from January 23 to April 01, 2020. This data consists of 245 clinically recovered patients. The first part of the paper studies the descriptive statistics and the influence of demographic parameters, namely age and gender, in the clinical recovery-period of COVID-19 patients. The second part of the paper is on identifying the distribution of the length of the recovery-period for the patients. We identify a piecewise analysis of three different periods, identified based on trends of both positive confirmation and clinical recovery of COVID-19. As expected, the overall recovery rate has reduced drastically during the exponential increase of incidences. However, our in-depth analysis shows that there is a shift in the age-group of incidences to the younger population, and the recovery-period of the younger population is considerably lower. Here, we have estimated the recovery rate to be 0.125. Overall, the prognosis of COVID-19 indicates an improvement in recovery rate owing to the government-mandated practices of restricted mobility of the older population and aggressive contact tracing.


Introduction
The viral contagion, named COVID-19, has been declared as a global pandemic by the World Health Organization on March 11, 2020 2 . The pandemic, characterized by atypical pneumonia, is caused by a virus from the coronavirus family, namely SARS-CoV2 (Severe Acute Respiratory Syndrome Coronavirus-2), which is a positive-sense single-stranded RNA virus. As of April 2, 2020, there are 827,419 positively confirmed cases, and 40,777 deaths, spread across 206 countries 3 . The total number of recovered patients in an unofficial count is 193,989 out of 935,197 positively confirmed patients, which implies that the ratio of the recovered to the infected patients, r ri is ∼0.2, as of April 1, 2020 4 .
In this paper, we analyze the statistics of hospital recovery of patients tested positive for COVID-19 infection in Singapore [1]. The case study of Singapore has been carefully chosen owing to the reliable, accessible, and available data from the official press releases of the Ministry of Health (MoH), Government of Singapore. The healthcare system of Singapore has been unique in its handling of the widespread contagion in terms of imposing strict lockdown, quarantine, and isolation, and aggressive, large-scale contact tracing and testing. 1000 individuals have been confirmed positive for COVID-19 during January 23-April 01, 2020, of which 245 have been discharged after clinical recovery. This gives an overall r ri higher than that of the world, as r ri =0.245. The positive confirmation has been made using the real-time reverse transcription polymerase chain reaction (RT-PCR) tests on respiratory samples (sputum or nasal/throat/nasopharyngeal swabs), based on experiential learning from the outbreak of SARS in 2003 5 . Similarly, the protocol for clinical recovery or hospital discharge has been based on the results on the RT-PCR tests of two consecutive samples being negative over two days [2].
Since the SARS outbreak in 2003, Singapore has systematically strengthened the system of managing the spread of infectious diseases [2]. The measures include opening dedicated facilities (the National Center for Infectious Diseases (NCID), National Public Health Laboratory, and more biosafety level-3 laboratories), increasing capacity in the public healthcare system (e.g., negative pressure isolation beds, personal protective equipment, trained health professionals), and deploying formal (digital) platforms for inter-governmental agency cooperation. For containing the spread of contagions, systems have been in place for upscaled, quick-responsive, and aggressive contact tracing at entry points of the country (airports) and through local healthcare providers. There has been a holistic improvement, supported by increased economic investment, in building expertise in infectious disease management. This organized system has thus facilitated a controlled management of the pandemic COVID-19 in Singapore with patient-wise reporting to the public. Hence, a case study in Singapore pertaining to the demographic analysis of clinically recovered patients enables a systematic understanding of the recovery rate (γ) of the pandemic.
One of the significant benefits of aggressive contact tracing has been hospital isolation within 5 days from the onset of symptoms [3]. However, 12.6% of transmission has been found to be presymptomatic [4], as analysed for seven clusters found in Singapore. While there is an innate uncertainty from the onset of symptoms in the symptomatic cases to a positive confirmation, the hospital stay from positive confirmation to discharge upon negative results exhibits more cohesive statistics, as can be observed with similar studies in hospital stay [5]. Hence, in this work, we perform statistical analysis of the recovery-period, i.e., length of hospital stay, of COVID-19 patients in the hospital. Also, the motivation behind studying the clinical recovery of patients is to assess the overall load on the healthcare system in terms of patient occupancy as the hospital stays determines the load. The clinical recovery studied in this paper corresponds to the hospitalization period for each patient. The demographic analysis of the recovered patients gives insight to shifts in gender and age-groups in the recovery of patients. This analysis further complements the observation in the transmission rate is higher in older males with comorbidities [6]. Fitting the data of hospitalization period to a statistical distribution is essential for estimating γ.
The epidemiological models are generally used to simulate the progression of a disease. The proportions of population being "susceptible," "infected," and "removed" are used in these models. "Removed" implies both "recovered" and "deceased." The first two deaths in Singapore owing to the COVID-19 contagion occurred on March 21, 2020. The number increased to 3 deaths by April 01. Owing to relatively low number of deaths in Singapore due to COVID-19 contagion during January 23-April 01, 2020, we have assumed death/mortality/fatality rate to be 0 in our work. Thus, here, "recovery" implies the state of "clinically recovered and discharged from hospital." Since the contagion has time-varying reproduction number (R t ) with characteristic trends in specific time-periods [7], we split the time-period of January 23-April 01, to perform a piecewise analysis of the timeline [7,8]. We perform two analyses on the periodized timeline. Firstly, we study the age-gender distribution of the patients who have been confirmed positive of COVID-19 and those who have clinically recovered. Secondly, we extract the distribution of clinical recovery-periods and fit regression models. In both analyses, we discuss the observable period-wise shifts in trends and their influencing factors, thus estimating γ. The novel contribution of our work is an in-depth analysis of the clinical recovery of COVID-19 patients to estimate the recovery rate γ, which is a key parameter in the SIR (susceptible-infected-recovered) model for the disease [9].

Methods
The data for our work has been collated from the public press releases made by the MoH, the Government of Singapore 6 . This dataset includes the case-ID's, age, gender, positive confirmation date, discharge date, and date of onset of symptoms 7 . The data has been cross-verified with dashboard 8 for case details. We have analyzed this patient-wise data pertaining to age, gender, and timeline of the disease progression.
We define recovery-period ∆t r as the time elapsed between positive COVID-19 confirmation using RT-PCR test, and the discharge date from hospital after two consecutive negative results, using RT-PCR tests. Owing to the strict protocols followed in the Singapore healthcare system, the recovery-period can be considered equivalent to the virus shedding period. ∆t r is estimated to be 15 days [10] or 20 days [11]. We consider ∆t r as an observed count variable.
Age-gender distribution: There has been early evidence of the influence of both age and gender in susceptibility of COVID-19 infection [6,12]. Hence, we look at the influence of age and gender in clinical recovery of patients, with respect to the recovery-period.
Periodization: As the pandemic progresses, the evolution needs to be studied piecewise in different time periods [7,8]. From the timeline of disease progression in Singapore, we have identified the following significant dates: • On January 23, the first patient was confirmed COVID-19 positive.
• On February 4, the first clinically recovered patient was discharged from the hospital. These trends can be observed in the daily profile of patient counts ( §Figure 1 (ii),(iii)). Assuming a zero death rate owing to the low number of deaths in Singapore from COVID-19, we consider the following three periods: 1. Period P 1 during January 23-February 3, which is the period with no clinically recovered cases. 2. Period P 2 during February 4-March 16, which is the period of slow growth in the total/cumulative number of positively COVID-19 confirmed cases, N i , with an increase in the total/cumulative number of clinically recovered cases, N r , and zero deaths. 3. Period P 3 during March 17-April 1, which is the period of exponential growth in N i , reaching N i = 1000, slow growth in N r , and having the first 3 deaths.
Recovery rate γ: The governing differential equations in the simplest SIR model, also known as Kermack-McKendrick model [9], are given as follows: β is the rate at which an infected individual infects others, γ the transition rate in SIR model 9 , N p the size of the population, N i the number of infected persons (i.e., with positive COVID-19 confirmation), N r the number of recovered persons (i.e., clinically recovered), and N s is the number of susceptible people. The basic reproduction number or reproduction rate, R 0 = β γ , characterizes an infection. R 0 > 1 implies the infection will continue to spread, and R 0 < 1 implies that the spread is limited and under control. Currently, R 0 for COVID-19 is estimated to be (0.8-5.0) [8,13]. γ is estimated as the reciprocal of the recovery-period ∆t r , which implies that γ ∼ (0.05-0.067), based on estimates of ∆t r [10,11]. 9 The transition rate must include both recovery and deceased. However, since we assume a zero death rate, the transition rate is equivalent to recovery rate, in our work.
3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020.  Figure 1: (i) Overlapped population pyramid of age-gender distribution of population of the 245 discharged and 754 active (i.e., COVID-19 positive but not discharged, including the deceased) during the same period (1 active patient who is a 102 years old female and confirmed positive on April 01, 2020, has been excluded in this population pyramid). Daily profile of count of patients in Singapore during January 23-April 01, 2020, (ii) who were confirmed positive for COVID-19, with a total of 1000, and (iii) who got clinically recovered and were discharged from the hospitals, with a total of 245. The red dotted lines indicate the three periods we have introduced in this work.

4
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020. . https://doi.org/10.1101/2020.04. 17.20069724 doi: medRxiv preprint The absolute numbers indicate that r ri , i.e., Nr Ni , has increased from 0.00 at the end of P 1 to 0.45 = 109 18+225 at the end of P 2 , and again dipped to 0.245 = 245 1000 at the end of P 3 . The dip is unfavorable for the scenario, given that the number of positive COVID-19 confirmations has increased exponentially, starting the beginning of P 3 . Since r ri and γ are positively correlated, it implies a further decrease in γ.
However, the absolute counts N i and N r do not explicitly show the shifts in the age-gender distribution of the infected population ( §Figure 2(i)). Thus, it is difficult to demonstrate the influence of this shift in the recovery rate γ. Social distancing reduces β, thus decreasing R 0 . At the same time, increasing γ also favours a decrease in R 0 . In this work, we hypothesize that restricted mobilization and aggressive contact tracing would have indirectly increased γ. Thus, we propose computing the time-varying ∆t r using the shifts in the age-gender distribution in the periodized timeline.
Recovery-period ∆t r analysis: Our goal is to study the period-wise changes in ∆t r owing to the shift in the demographic structure, in order to determine the period-wise change in γ. We performed descriptive statistical analysis using median and interquartile range (IQR), followed by fitting appropriate regression models. We consider two types of regression models. Firstly, we use the time-series of ∆t r and fit a line using the loess model. Loess is a non-parametric local regression model for smoothening empirical time-series data [14] and scatterplots [15]. Secondly, we use the number of patients recovering for a specific ∆t r as a count variable and fit multivariate linear regression models considering age and gender as independent variables, and ∆t r as the dependent variable. Since we are using a combination of a categorical variable (gender) and numerical variable (age), we use generalized linear models (GLM) for regression, which is semi-parametric. Length of hospital stay (LoS) is a naturally skewed distribution, for which GLM's such as the Poisson regression model (PRM) and negative binomial regression model (NBM) have been used [5]. Hence, we propose the use of PRM and NBM for modeling ∆t r .
For the period-wise analysis of ∆t r , we group the clinically recovered patients using two strategies.
1. Grouping based on recovery date, G G G −cfrmDt −cfrmDt −cfrmDt : The two groups of clinically recovered patients can be obtained as per the period in which their discharge/recovery date falls, namely, P 2 and P 3 ( §Figure 2 (ii),(a)). G G G −cfrmDt −cfrmDt −cfrmDt corresponds to a group of 109 patients who were discharged in P 2 , and 136 patients in P 3 . 2. Grouping based on positive confirmation date, G G G +cfrmDt +cfrmDt +cfrmDt : However, we can perform a finer-grain analysis of the groups of patients based on the period in which their date of positive confirmation/hospital admission falls ( §Figure 2 (ii),(b)). This gives us groups of patients who got tested positive in a period and got clinically recovered during the entire period of our study here. This gives us 18 patients in P 1 , 161 in P 2 , and 66 in P 3 .
Symptom-onset period ∆t so analysis: We additionally have data of the date of onset of symptoms for 227 of the 245 clinically recovered patients, who were confirmed positive during P 1 and P 2 . We define the symptom-onset period, ∆t so , as the number of days between the onset of symptoms and positive confirmation of COVID-19. We use the time-series of ∆t so to fit a loess model, similar to ∆t r . Table 1 gives the percentage values of the data presented in Figure 2.  Table 1: Percentage values of the age-gender structure of population confirmed positive with COVID-19 during January 23-April 01, 2020 in Singapore ( §Figures 1(i), and 2(i)).

Results
Descriptive statistical analysis: We present the descriptive statistics either as a ∆t r five-number summary 10 and as "median [IQR]" of the observed count variable, i.e., ∆t r .
∆t r observed during the entire period, January 23-April 01, has the following median and IQR values: 10 A five-number summary is (minimum, first quartile, median, third quartile, maximum) values of a (count) variable.

5
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020.  6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. • Overall: 11 [7] days.
• We observe that the ∆t r is overall lesser in this dataset than reported in early analyses of COVID-19 patients, i.e., 15 days [10] and 20 days [11]. The overall median of ∆t r is 11 days, which is the same as the gender-weighted median, and the age-group-weighted median is 10.7 days. Thus, the overall descriptive statistical analysis gives a conservative estimate of γ to be ∼ 1 11 . The remaining work is to estimate γ more precisely based on the influence of gender and age. There is a stronger influence of age than gender on ∆t r , as the median values are similar for both genders. In contrast, it is relatively lower for the age groups of (0-9), (20-29), and (80-89) years, specifically. These age groups comprise of 1.6%, 14.6%, and 0.8% of the clinically recovered patients ( §Table 1). This result is significant as the age group of (20-29) years contributes the highest (27.3%) to the infected population. 88.6% of the patients in this age group have been confirmed positive in P 3 . Thus, our key conclusion is that since the most susceptible group of people has lower ∆t r , the recovery rate γ is bound to increase further in the period after April 01 compared to the value estimated in our work.
• For females: (10,11,13,22,26) for P 1 , (0, 7, 12, 15, 26) for P 2 , and (3, 6.25, 9,11,14) for P 3 . We observe similar trends when considering G G G −cfrmDt −cfrmDt −cfrmDt grouping. The range, the IQR, and the median of ∆t r decrease from P 1 to P 3 , when we look at the data for each gender as well as the data without the gender information. This indicates that irrespective of gender, the measure of the spread of ∆t r decreases with time, similar to the trend in the value of ∆t r . The age-wise minima reduce sharply from P 1 to P 2 , and increase slightly further to P 3 , indicating an overall trend of decrease in minima. This supports the overall decrease in ∆t r value from P 1 to P 3 . Table 2 shows the fine-grained five-number summaries of the age-gender based box and whisker plots in Figure 4. We observe that there is a strong influence of age on ∆t r . While the overall median values for females and males are 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020.  Table 2: Five-number summaries of age-gender based box and whisker plots in Figure 3 and 4. similar, we observe that there are variations across different age groups. In particular, for the age group of (20-29) years, the median [IQR] value of males of 8.5 [8.25] days is lower than the median value of females of 10 [5.75] days. This also shows that there is a higher spread (IQR) of ∆t r values in males. We further observe that the gender difference in IQR arises in P 3 in the G G G +cfrmDt +cfrmDt +cfrmDt grouping, which happens in P 2 in the G G G −cfrmDt −cfrmDt −cfrmDt grouping. The minima and median values are overall lower in P 3 in the G G G +cfrmDt +cfrmDt +cfrmDt than the G G G −cfrmDt −cfrmDt −cfrmDt grouping. Since the protocols followed in the hospital can be perceived to be similar for patients with closer hospital admission dates, the ∆t r has more cohesive descriptive statistics in the G G G +cfrmDt +cfrmDt +cfrmDt grouping than the G G G −cfrmDt −cfrmDt −cfrmDt one. Hence, we use the G G G +cfrmDt +cfrmDt +cfrmDt grouping exclusively in the regression analysis.
Since the symptom-onset dates have been studied [7,8], we report the five-number summaries of ∆t so , for which data is available for P 1 and P 2 only.
We observe that ∆t so has similar IQR and median across P 1 and P 2 , irrespective of gender. We attribute to the continuous monitor of susceptible cases in Singapore towards less delay in positive confirmations, thus showing a low measure of spread and low values for ∆t so . Overall, we do not emphasize on analysing ∆t so owing to the clinical uncertainties involved [4], which do not reflect in the timeline data that we are using here.
Loess model: We now use the loess model to confirm the trend in the change in ∆t r . The loess model has been estimated on the time-series of the ∆t r and ∆t so values, represented as scatter plots ( §Figure 5). The loess model has been implemented using the stats.loess 11 in R [16]. We have considered the scatter plots based on the positive 11 The loess model is a default local regression model used for a sample with less than 1000 observations in stats package in R.

8
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020. The degrees of freedom roughly corresponds to the degree of the polynomial used to generate the fitting curve. Thus, both ∆t r and ∆t so can be modeled using a 5th degree polynomial. Higher degree polynomial implies less bias but larger variance. ∆t r has a slope of −25 o in P 3 when using the loess model for the entire time period ( §Figure 5(i)) and that of −6 o when using loess model for P 3 ( §Figure 5(ii)). Thus, the key conclusion from the local regression model on the time-series is the negative slope, i.e., a downward trend in ∆t r in P 3 , which is favorable in improving recovery rate γ.
Multivariate (linear) regression model: Now that we have observed and concluded from both the descriptive statistical analysis and loess model that ∆t r is decreasing during the period of January 23-April 01, our next step is to predict the value of ∆t r . We experiment with the generalized linear model (GLM) for a multivariate linear regression model for ∆t r using Poisson (PRM) and negative binomial (NBM) distributions. Our choice of model and distributions are 9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020.

11
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020. . https://doi.org/10.1101/2020.04. 17.20069724 doi: medRxiv preprint commonly used for count data [5], and hospital length of hospital stay (LoS) is commonly over-dispersed data [17]. For each model, we use four scenarios, namely, for the entire period and for each period. We use the Akaike Information Criterion (AIC) and its corrected version for a small sample size (AIC c ) for determining the goodness of fit of our proposed models. We have implemented these models using the stats.glm in R [16].
For GLM with Poisson and binomial families, the dispersion is fixed at 1.0, and the number of parameters (k) is the same as the number of coefficients in the regression model [16]. The negative binomial distribution has an additional parameter to model over-dispersion in the data. For the number of samples (n) in the data, AIC is used if n k > 40 , and AIC c is used otherwise [18]. Thus, we use AIC for scenarios of the entire period and P 2 , and AIC c for P 1 and P 3 , owing to the relatively lesser samples ( §Table 3).  Table 3: Results of the generalized linear models using Poisson distribution and negative binomial distribution of recovery-period ∆t r using G G G +cfrmDt +cfrmDt +cfrmDt grouping. Table 3 gives the results of our models. We infer the following: • The "age" variable is significant only in the PRM, and only for the scenarios of the entire period and P 2 , with a p-value for the coefficients corresponding to the variable being less than 5%.
• The NBM shows lower values for the median and the variance (observable from range and IQR) of deviance residuals, and AIC/AIC c than the PRM. Thus, we conclude that NBM is a better fit than PRM. Also, for both PRM and NBM, the models for the scenarios of P 1 and P 3 are a better fit than those of the entire period and P 2 .
These observations may be attributed to the relatively small sample size for P 1 and P 3 . Since we do not have a large number of variables to discard, we retain the "gender" variable in the model despite its insignificance.
12 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 22, 2020. . https://doi.org/10.1101/2020.04. 17.20069724 doi: medRxiv preprint The key conclusion from the multivariate regression analysis is that a GLM with NBM for P 3 is the best model for us to estimate ∆t r . This helps us to estimate the value of ∆t r to be 8 days, with the maximum likelihood of 11.8%. The expected value of ∆t r of NBM for P 3 is 8.05 days. Hence, overall, we conclude that the estimated value of γ is 1 8 .

Discussion
The improved γ, as per our estimate, is an outcome of the existing protocols in Singapore. The approach of containment of the contagion undertaken by Singapore has been government-mandated, which has ensured delays in the spread of the disease. While the spread got contained, there has been a shift in the age-group of the population getting infected. This shift has brought about the decrease in the recovery-period, ∆t r .
Our work has two specific limitations. Firstly, our study is short of using a non-zero death/mortality/fatality rate of the disease. The number of deaths will continue to increase, warranting its consideration in the SIR model. Secondly, we have modeled recovery isolated from the infection. Since the number of infected persons in Singapore has increased exponentially since March 17, 2020, the infection rate, β, and consequently, basic reproduction number R 0 have to be re-estimated. Nevertheless, our estimated γ is thus applicable for improving estimation of R 0 until April 01, and for simulating/predicting disease progression using SIR model beyond April 01.
In summary, we have looked at the demographic data and timeline of the first 1000 COVID-19 patients in Singapore during January 23-April 01, 2020. We have closely investigated the data on the positive confirmation and discharge/clinical recovery dates of 245 patients who recovered during this time period. We have used regression analysis, subsequent to a descriptive statistical analysis, to get an estimate of recovery-period ∆t r (i.e., hospital length of stay (LoS)). We have found that the ∆t r is time-varying, after performing periodization to find three significant periods, namely P 1 (January 23-February 03), P 2 (February 04-March 16), and P 3 (March 17-April 01). The estimates of ∆t r varied from ∼17 days in P 1 to ∼10 days in P 2 to ∼8 days in P 3 . We have used the loess model for time-series data to demonstrate the negative slope of the regression curve of ∆t r in P 3 , in particular. We then estimated period-wise ∆t r using generalized linear models for multivariate (linear) regression with Poisson and negative binomial distributions for count data. This shows an improvement in the values published for ∆t r , i.e., 20 days [11] and 15 days [10]. This has led us to estimate the current recovery rate γ in the SIR model to be