A new, simple method of describing COVID-19 trajectory and dynamics in any country based on Johnson Cumulative Distribution Function fitting.

This paper present simple method to study and to compare the infection dynamics between countries based on curve fitting to the publicly shared data of COVID-19 confirmed infections reported by them. Presented method was tested using data from 80 countries from 6 regions. We found that Johnson Cumulative Distribution Functions (CDF) are extremely well fitted to the data (R2>0.99) and that Johnson CDF is much better fitted to the data at its tails than both commonly used Normal and Lognormal CDF. Fitted Johnson CDFs can be used to obtain basic parameters of the infection wave, such as the percentage of the population infected during the infection wave, day of the start, peak and the end of the infection wave, as well as the duration of the infections wave and the duration of the wave increase and decrease. These parameters may be easily biologically interpreted and used both in describing the infection wave dynamics and in further statistical analysis. The usefulness of the obtained parameters was demonstrated on two examples: the analysis of the relation of the Gross Domestic Product (GDP) per capita and the analysis of the population density on the percentage of the population infected during infection wave, the day of the start, and the duration of the infection wave in analyzed countries. We found that all of the abovementioned parameters were significantly dependent on the GDP per capita, while only the percentage of population infected was significantly dependent on the population density in analyzed countries. Also, if used with caution, presented method has some limited ability to predict the future trajectory and parameters of the ongoing infection wave.

was declared as a pandemic by the World Health Organization on March 11, 2020 (Ducharme, 3 8 2020).To date, globally, over 64 million infections and almost 1.5 million death cases were reported 3 9 (WHO, 2020). 4 0 Since the very beginning of the pandemic, many models have been proposed to understand the 4 1 outbreak dynamics of COVID-19 (e.g. IHME, 2020;UGSDSC, 2020;LANL, 2020;Ferguson et al., 4 2 2020;Kissler et al., 2020;Aleta et al.;Hellewell et al., 2020) and were used by policymakers (e.g. US 4 3 Government) to allocate resources or plan interventions. Some of them, such as early IHME model 4 4 received fair amount of criticism (Jewell et al., 2020). COVID-19 modelling studies generally follow 4 5 one of two general approaches: forecasting models and mechanistic models; although there are hybrid 4 6 approaches (Holmdahl and Buckee, 2020). Forecasting models are often statistical in nature, fitting a 4 7 line or curve to data and extrapolating from there, without incorporating the process that produces the 4 8 pattern (Holmdahl and Buckee, 2020), while mechanistic models simulate the outbreak through 4 1 0 4 Fitting Johnson CDF by moments 1 0 5 Johnson (1949) described a system of frequency curves that represents transformations of the standard 1 0 6 normal curve (detailed description in Hahn and Shapiro, 1967). Applying these transformations to a 1 0 7 standard normal variable allows a unique distribution to be derived for whatever combination of mean, 1 0 8 standard deviation, skewness, and kurtosis occurs for a given set of observed data. The standard 1 0 9 method of fitting Johnson curves is to use four coefficients defining a Johnson distribution: two shape 1 1 0 (γ, δ ), a location (ξ), and a scale (λ) coefficient: 1 1 1 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted December 7, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 is cumulative distribution function of standard normal distribution. However, this method is 1 1 3 not intuitive (i.e. it is difficult to set starting points from the data to perform numerical fitting). Thus 1 1 4 alternative method for fitting Johnson curves, using first four moments (mean, variance, skewness and 1 1 5 kurtosis) of an empirical distribution was selected (detailed description in Hahn and Shapiro,1967 and 1 1 6 Hill et al.,1976).All statistical fits in the paper were performed using the Levenberg-Marquardt 1 1 7 algorithm (Moré, 1978) to solve the corresponding non-linear least square optimization problem. 1 1 8 Convergence criterion was set to 1.0E -10 . 1 1 9 Fitting Johnson CDF to the epidemic waves 1 2 1 There is no strict definition for what is or is not an epidemic wave or phase. The intuitive definition of 1 2 2 the pandemic wave traces the development of an epidemic over time and/or space. During an epidemic 1 2 3 the number of new infected cases increases (often rapidly) to a peak and then falls (usually more 1 2 4 gradually) until the epidemic wave is over.
The epidemic dynamics may highly differ between countries. Since the beginning of the 1 2 6 pandemic, in some countries only one epidemic wave was observed (e.g. Afghanistan, Argentina), in 1 2 7 some countries two epidemic waves were observed (e.g. Australia), while in others even more 1 2 8 epidemic waves were observed, which also may overlap and interfere each other (e.g. Croatia, where 1 2 9 four overlapping and interfering waves were observed). Also, in many countries, a range of various 1 3 0 levels of the lockdown were applied to slow down or "flatten" the infection curve, the epidemic waves 1 3 1 may not follow the Farr's law (which states that epidemics tend to rise and fall in a roughly 1 3 2 symmetrical pattern or bell-shaped curve) and may be asymmetrical.
The basic assumption is that each epidemic wave W in a given country may be described by a 1 3 4 five parameters scaled Johnson CDF: scale parameter (s), and abovementioned moments: expected 1 3 5 value (mean; E), variance (V), skewness (S) and kurtosis (K) , where t is the time measured since the day of the beginning of the pandemic and function F E,V,S,K is 1 3 8 Johnson CDF with parameters γ , δ , ξ , λ assuring mean, variance, skewness and kurtosis equal to 1 3 9 E,V,S,K respectively (see Hahn and Shapiro,1967;Hill et al.,1976). The S and K parameters were 1 4 0 expected to improve the curve fit at the tails of the epidemic wave in case it was not symmetrical or 1 4 1 heavy tailed. Once the Johnson CDFs were fitted to each pandemic wave in a given country, basic parameters 1 4 5 obtaining the wave dynamics: (1) 2.5% quantile (Q 2.5% ), (2) 50% quantile (median; Q 50% ), (3) 97.5% 1 4 6 quantile (Q 97.5% ) were calculated: 1 4 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted December 7, 2020. The disadvantage of fitting Johnson curve by its moments is that it is not possible to 1 5 1 determinate its mode analytically. Thus the mode of each Johnson CDF was determined numerically: is Johnson Probability Density Function (PDF). 1 5 4 The obtained parameters have an intuitive biological interpretation ( Fig. 1): the scale parameter (s) 1 5 5 indicate the total percentage of infections during a given epidemic wave (P inf ), Q 2.5% indicate the day 1 5 6 when infection wave starts, while Q 97.5% indicate its end. Median (Q 50% ) indicate the day when the half 1 5 7 of the total percentage of infected during a given wave was reached. Finally, the mode (M) indicate the 1 5 8 day of the peak occurrence. Additionally, one can easily obtain the wave duration (T) Also, the parameter measuring the asymmetry of the infection wave (A) can be easily obtained as a 1 6 5 ratio 1 6 6 A=t i /t d (8) 1 6 7 All of the abovementioned parameters may be easily used in further statistical analysis, which 1 6 8 was shown on examples: 1) the relationship between Gross Domestic Product (GDP)per capita and 1 6 9 basic parameters describing the dynamics of the first wave of infections: M, T, and P inf , and 2) the 1 7 0 relation between population density and basic parameters describing the dynamics of the first wave of 1 7 1 infections: M, T, and P inf . Only first wave of infections in each country was taken into account, 1 7 2 because in some countries, second (and consecutive) waves were not observed, and they would have 1 7 3 been excluded from the analysis. 1 7 4 1 7 5 Comparing curves: Johnson vs Normal and Lognormal CDF 1 7 6 The differences between Johnson, Normal and Lognormal CDF were presented on the data from 1 7 7 Afghanistan, where only one epidemic wave was observed. The differences were shown by comparing 1 7 8 the R 2 ,P inf , Q 2.5% , M, and Q 97.5% parameters. Both 2.5% and 97.5% quantiles for normal and lognormal 1 7 9 distributions, were obtained using inverse Normal and inverse Lognormal PDF respectively. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted December 7, 2020. ; https://doi.org/10. 1101/2020 Fitting Johnsons curve to the ongoing wave result in obtaining parameters, which can also be 1 8 3 interpreted as a prognosis of the future shape and dynamics of infection wave. In such case, P inf , M and 1 8 4 Q 97.5% indicate predicted percentage of infections, predicted day of the peak and predicted day of the 1 8 5 end of the ongoing wave respectively, which also can be used to calculate predicted time of increase, 1 8 6 decrease and duration of the ongoing infection wave. Because presented method is intended to 1 8 7 describe infection dynamics rather than predicting its future outcome, the accuracy of the prognosis 1 8 8 was presented only on the data on the first wave of infection observed in the United Kingdom in the 1 8 9 Supplementary Materials. 1 9 0 1 9 1 Examples of application 1 9 2 The relation between Gross Domestic Product (GDP) per capita and the relation between 1 9 3 population density and the dynamics of the first wave of COVID-19 infections 1 9 4 The data on the GDP per capita and population density in 80 analyzed countries were obtained from 1 9 5 Our World in Data COVID-19 dataset (Hasell, et al. 2020). 1 9 6 The relationship between GDP per capita and the relation between population density and 1 9 7 basic parameters describing the dynamics of the first wave of infections (M, T, and P inf ) obtained using 1 9 8 presented method of Johnson CDF fitting was tested using the quantile dependence function method, 1 9 9 which was described in detail in Ć miel and Ledwina (2020).This method was designed for measuring, 2 0 0 visualizing the dependence structure, and testing of independence of two random variables. It exploits 2 0 1 a recently introduced local dependence measure (quantile dependence function q), which gives a 2 0 2 detailed picture of the underlying dependence structure and provides a means to carefully examine the 2 0 3 local association structure at different quantile levels (Ćmiel and Ledwina 2020). The examples of fitted Johnson curves to the data from countries where one ongoing infection wave 2 0 7 (Argentina), one infection wave (Afghanistan), two infection waves (Australia) and four overlapping 2 0 8 and interfering infection waves (Croatia) were observed was presented at Fig. 2.Fitted four Johnson 2 0 9 CDFs to the four waves of infections observed in Croatia, with areas where waves are overlapping and 2 1 0 interfering was presented in detail at Fig. 3A.
Johnson CDF fitting tested using data obtained from 80 different countries showed that all 2 1 2 curves were extremely well fitted: the lowest R 2 obtained was 0.995 (Fiji), while the highest R 2 was 2 1 3 0.99997 (Iraq), while the mean and median R 2 was 0.9995 and 0.9997 respectively. Fitted functions 2 1 4 with R 2 and COVID-19 trajectory plots with fitted functions for each country were presented the 2 1 5 Supplementary Materials (Table S1; Figure S1-S6). 2 1 6 Fitting Johnson, Normal and Lognormal distribution curves to the single wave of infection 2 1 7 observed in Afghanistan showed, that the best fitted was the Johnson CDF (R 2 =0.9998), while both 2 1 8 Normal (R 2 =0.9980) and Lognormal (R 2 =0.9989) distributions were worse fitted, mainly at the tails of 2 1 9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

3 5
The results of the analysis of the associations between GDP per capita and M, T and P inf 2 3 6 parameters showed, that the percentage of confirmed infections during the first epidemic wave in 2 3 7 analyzed countries was dependent on the GDP per capita (p=0.0147; Fig 4A), as well as the time of 2 3 8 the peak occurrence (M; p=0.0002; Fig. 4B) and the duration of the first epidemic wave (T; p=0.0087; 2 3 9 Fig. 4C).The relation between the percentage of infections and GDP per capita showed rather global 2 4 0 positive dependence (Fig. 4A), which means that the higher GDP per capita, the higher percentage of 2 4 1 infections during the first epidemic wave. The relation between the time of peak occurrence and GDP 2 4 2 per capita showed local negative dependence for countries where peak occurs late (above median; Fig.  2  4  3 4B) which means that the very early occurrence of peak is rather not correlated with GDP per capita 2 4 4 but in case when the peak does not occur early the higher GDP per capita, the earlier peak occurs. The 2 4 5 similar relation was also observed for the relation between the duration of the infection wave and GDP 2 4 6 per capita (Fig. 4C), i.e. the very short duration of the first epidemic wave is rather not correlated with 2 4 7 GDP per capita but in case when the duration of the first epidemic wave is not short, the higher GDP 2 4 8 per capita, the shorter first epidemic wave.

4 9
The results of the analysis of the associations between population density and M, T and P inf 2 5 0 parameters showed that the percentage of infections during the first epidemic wave in analyzed 2 5 1 countries was dependent on the population density (p=0.0079; Fig 4D), while the day of the peak 2 5 2 occurrence and the duration of the first epidemic wave were not dependent on population density (T: 2 5 3 p=0.4243; Fig. 4E; M: p=0.5924; Fig. 4F).The relation between percentage of infections and 2 5 4 population density showed local negative dependence (Fig. 4D) e.g. in case when population density is 2 5 5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted December 7, 2020. ; https://doi. org/10.1101org/10. /2020 parameters during Johnson curve fitting procedure, whereas the shape of other commonly used curves 2 9 3 (Normal, Lognormal, Weibull) is more or less imposed. This result also suggests that Johnson 2 9 4 distribution should be preferred in curve-fitting approach for COVID-19 data. 2 9 5 Presented curve fitting method was designed primarily to obtain easy in interpretation 2 9 6 parameters describing past trajectory of COVID-19 infection, but parameters describing actually 2 9 7 ongoing wave of infection, especially in its early stage (before the peak), may be interpreted as a 2 9 8 forecast of future course of the pandemic. However, in such case, extreme caution is advised (see 2 9 9 Jewell et al. 2020). Presented method is purely statistical model and it does not incorporate the process 3 0 0 that produces the number of infections pattern, and does not account for any parameters governing 3 0 1 transmission, disease, and immunity. Also, curve fitting techniques cannot predict the occurrence of 3 0 2 future peaks. Thus, for long term prognosis and modelling the future scenarios of the pandemic, it is 3 0 3 recommended to use more reliable methods, based on SEIR models. Nevertheless, some short term 3 0 4 prognosis can be obtained using presented method, which may be useful for policymakers in rapid, 3 0 5 short term intervention planning, however one must keep in mind the abovementioned limitations of 3 0 6 presented method, as well as the limitations resulting from the data colleting and reporting, which are 3 0 7 discussed later in this section. 3 0 8 The results obtained in the presented example of the application of parameters describing 3 0 9 COVID-19 dynamics showed, that the higher the GDP per capita, the higher percentage of the 3 1 0 population infected was observed. This is quite unexpected result, however consistent with the result obtained results showed that, excluding countries where peak of infections occurred very early and its 3 1 4 duration was short, the higher GDP per capita, the earlier peak occurs and the first epidemic wave is 3 1 5 shorter. This result, in turn, is similar to another very recent paper, which reported that the date of first 3 1 6 CoVID-19 cases co-varies positively with GDP across countries, most probably due to their more 3 1 7 intensive participation of the global tourism and traffic industries (Jankowiak et al. 2020). The other 3 1 8 example showed that the higher population density the lower the percentage of population infected 3 1 9 during first wave of infections. This also seems to be unexpected, however, a negative dependence 3 2 0 result from fact that the infections are presented as a percentage, which does not scale proportionally 3 2 1 with the population density. Another possible explanation is that in countries with high population 3 2 2 density (e.g. China, Singapore), very strict (full) lockdowns were immediately applied (China, 3 2 3 Kretschmer and Yang, 2020; Singapore, Cheong 2020), which could result in lower percentages of 3 2 4 infected population than in countries with lower population density, where partial lockdown or no 3 2 5 lockdowns at all were applied. Moreover, some research report positive correlation between 3 2 6 population density and number of infections and related mortality (e.g. in India; Bhadra et al., 2020), 3 2 7 while other report no evidence that population density is linked with COVID-19 cases and deaths (e.g. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 1 0 presented method, but also, the very recent papers of Liu et al. (2020) andJankowiak et. al. (2020) 3 3 0 showed that the field of research on COVID-19, other than purely epidemiological modelling of the 3 3 1 future pandemic scenarios, is rising, which indicate that the simple methods of obtaining parameters 3 3 2 describing the infection waves, such as presented in this paper, may be very useful and can help to 3 3 3 deepen our understanding of the COVID-19 pandemic. 3 3 4 The last but not least issue which has to be addressed is a key limitation in understanding of 3 3 5 the COVID-19 pandemic, that the true number of infections is not known and the only known 3 3 6 infections are those confirmed by tests. Moreover, testing strategies differs between counties i.e. in 3 3 7 some countries only symptomatic cases are tested, while in other mass testing is performed. Also, 3 3 8 most COVID-19 cases are asymptomatic and remain unreported (Peirlinck et al. 2020). Because of 3 3 9 that, mortality data are generally considered as more reliable than testing-dependent confirmed case 3 4 0 counts and used in COVID-19 epidemic modelling (e.g. Chikobvu and Sigauke, 2020 into account, it is very likely that real number of deaths is also higher than the reported number of 3 4 8 deaths, which was noticed in some countries (e.g. Italy, Foresti 2020, Stancati and Sylvers, 2020; 3 4 9 China, Long et al., 2020). It seems that both confirmed new cases and confirmed deaths may not be 3 5 0 reliable, but on the other hand, no other data is available. Some models (e.g. IHME 2020) are able to 3 5 1 estimate true number of infections, but it is related to a number of additional assumptions, and is partly 3 5 2 based on the reported testing-dependent data. Also, the relation between true number of infections and 3 5 3 number of death is not well studied to date and require a number of assumptions. Using the number of 3 5 4 infections seems to be the easiest way of obtaining basic data on the COVID-19 infection dynamics in 3 5 5 a given country, as long as one is aware that publicly shared data show number of confirmed cases 3 5 6 instead of number of real infections and takes this into account when interpreting the results. 3 5 7 In conclusion, presented method based on Johnson CDF curve fitting to the cumulative 3 5 8 number of confirmed cases is straight forward, well known and easy to use. It provides curves which 3 5 9 are extremely well fitted to the data, and obtained basic parameters of COVID-19 infection dynamics 3 6 0 are easy to interpret and to use in further statistical analysis by researchers from other fields than 3 6 1 epidemiology (e.g. sociology, biology, ecology, etc.), and can deepen our understanding of the 3 6 2 COVID-19 pandemic. It also may be useful in short term prognosis, however, in such case caution is 3 6 3 advised. 3 6 4 3 6 5 Acknowledgements 3 6 6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Distribution Functions to the raw data from Afghanistan (black dots). 4 8 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted December 7, 2020. ; https://doi.org/10. 1101/2020