## ABSTRACT

A pipeline involving data acquisition, curation, carefully chosen graphs and mathematical models, allows analysis of COVID-19 outbreaks at 3,546 locations world-wide (all countries plus smaller administrative divisions with data available). Comparison of locations with over 50 deaths shows all outbreaks have a common feature: *H(t)* defined as log_{e}*(X(t)/X(t-1))* decreases linearly on a log scale, where *X(t)* is the total number of Cases or Deaths on day, *t* (we use ln for log_{e}). The downward slopes vary by about a factor of three with time constants (1/slope) of between 1 and 3 weeks; this suggests it may be possible to predict when an outbreak will end. Is it possible to go beyond this and perform early prediction of the outcome in terms of the eventual plateau number of total confirmed cases or deaths?

We test this hypothesis by showing that the trajectory of cases or deaths in any outbreak can be converted into a straight line. Specifically *Y*(*t*) ≡ −ln(ln(*N* / *X* (*t*)), is a straight line for the correct plateau value *N*, which is determined by a new method, Best-Line Fitting (BLF). BLF involves a straight-line facilitation extrapolation needed for prediction; it is blindingly fast and amenable to optimization. We find that in some locations that entire trajectory can be predicted early, whereas others take longer to follow this simple functional form. Fortunately, BLF distinguishes predictions that are likely to be correct in that they show a stable plateau of total cases or death (*N* value). We apply BLF to locations that seem close to a stable predicted *N* value and then forecast the outcome at some locations that are still growing wildly. Our accompanying web-site will be updated frequently and provide all graphs and data described here.

## INTRODUCTION

In December 2019 a coronavirus, known as SARS-CoV-2, was discovered in Wuhan China (Wang, 2020). The virus, perhaps from horseshoe bats (Zhou, 2020), spread between humans during January 2020, leading to the COVID-19 pandemic. Early prediction of the number of cases and deaths in an epidemic or pandemic is of vital importance as it helps policy makers make informed decisions on the best allocation of resources and containment of the pathogen. For this reason, many different groups have attempted to make reliable predictions of Sars-Cov-2 diffusion (Levitt 2020a, Wang 2020, Dimeglio 2020, Wu 2020, Pinotti 2020). These forecasts are based on a variety of mathematical and statistical models, which use different types of data (COVID-19 data, mobility data, demographic data) and take into account the impact of interventions, such as social distancing, proper hand hygiene and the use of masks. Such variables differ from country to country, and moreover, the criteria to detect COVID19 cases and consider COVID19 as the cause of deaths also vary sometimes even for states/provinces in the same country. These factors combine to complicate finding a universal method to fit and predict COVID-19 trajectories.

We began working on COVID-19 in the last week of January 2020 using data released by Sudalai Rajkumar (Rajkumar), Johns Hopkins Coronavirus Resource Center (JHCS) and Chinese internet (JOBTUBE). On January 28^{th}, there were numbers of cases and deaths for 6 days starting on January 22^{nd}. The daily death rate of COVID-19 (ratio of total deaths to total cases on a given day) was ten times higher inside Hubei, the province surrounding Wuhan, than everywhere else in China (non-Hubei). Concerned and encouraged by this data, we started an Excel spreadsheet to follow the daily progression of COVID-19. Each day, we made graphs of four simple measures. Three were obvious: the total number of cases; the total number of deaths, and their ratio, the death rate. The fourth was trivial but less obvious: the ratio of the total cases (or deaths) denoted as *X(t)* for today divided by that of yesterday. This ‘fractional change function’ *f* (*t*) measures exponential growth of *X(t)* with *f* (*t*) = *X* (*t*) / *X* (*t* − 1).

If the total today is always 10% more than yesterday the value today will be 1.1 times the value yesterday with *f(t)=1*.*1*. In fact, on January 29^{th}, the number of deaths today divided by that of yesterday was 1.3. Were such exponential growth of 30% a day to continue, everyone on earth would die within 90 days. Analyzing the data more completely over the next few days, we noticed on February 2^{nd} that the fractional change for deaths in Hubei showed a steady decrease from 30% on January 29^{th} to 18% four days later. If this linear decrease of fractional change in deaths continued then deaths in Hubei would stop on day 67, when the fractional change became equal 1 can *X(t)* was the same as *X(t-1)*. We reported this finding widely (Levitt, 2020b), although in retrospect, it was naïve to expect the linear decrease to continue.

Nevertheless, our early interest in the fractional change function remained for two reasons. Firstly, because of the mathematical simplicity of *X* (*t*) / *X* (*t* − 1) as compared to more accepted measures like *R*_{t}(Wallinga 2007; Ferguson 2013). Secondly, because, by analyzing the data of a small number of early epidemics (before mid-March 2020), we realized that the factional change function appears to have the same shape for multiple locations: it converges to 1 as fast as a decaying exponential (Levitt, 2020c). Furthermore, because the fractional change function is a ratio, it is not affected by different systematic counts of cases/deaths due to different criteria: two countries that apply different criteria for deciding when a person is infected but have the same day to day growth will have the same fractional change function, provided that the counting method is kept consistent within the country.

Elaborating further from this initial intuition we found a minimal mathematical model that allows us to consistently describe the spreading of the virus in different countries. We also were able to reduce the very complicated task of fitting inconsistent data sets to the fitting of a straight line for which extrapolations and quality controls are trivial. This allowed us to completely automate data fitting, extrapolation and assessment of the quality of fitting, all done simultaneously and at blinding speed (less than an hour of CPU for all the outbreaks in the world).

## METHODS

### Data Processing

Data is synced daily to two different sources for world data, US and Italy data. World data and US data including county and states levels is taken from (JHU), available from (Starschema). Italy data at provinces level is taken from (Ita-regioni). (We thank Levitt-group members Dr. João Rodrigues and Dr. Frederic Poitevin for integrating these data sources into a master file).

For some location the data contains inconsistencies, which we call ‘data glitches’ and these are corrected as we did in our earliest analysis of the epidemic in Hubei, China (Levitt, 2020d). We were well-aware that any alteration of the raw data must be justified and carefully recorded as we do here. Such correction turns out to be important as the curve-fitting of the raw data is insensitive to random counting errors, allowing us to use the raw data without any smoothing, but is sensitive to systematic errors like these. There were three type of ‘data glitches’: ‘mis-glitches’, ‘rise-glitches’ and ‘drop-glitches’. (1) *mis-glitches* occur when the data on a given day is not updated. Specifically, whenever two consecutive *X(t)* values (at times *t-1* & t) are identical, we alter the value at *t* to be the average of the values at times *t-1* & *t+1*. (2) *rise-glitches* occur when new cases or deaths not previously reported are discovered and released on a particular day. This first occurred in China Hubei on February 13^{th}, when 13,000 cases detected clinically were added to the total. These cases did not occur on the day reported but rather over the preceding days, so we corrected for by rescaling the number of confirmed cases on days prior to 13^{th} February by a constant factor greater than 1 (Levitt 2020d). The same correction was applied on a small number of instances when additional deaths or confirmed cases were reported on a specific day as having been unreported on previous days. Again, we added the deaths or confirmed cases to the previous days a fixed fractional increment (the complete list of with both types of correction is provided in the Supplementary Material). (3) *drop-glitches* occur when the total numbers at a given location are decreased on a particular day. This can never happen normally as totals always increase and is due to the realization that numbers reported previously include misidentified cases or deaths. This glitch is less common than the other two. It is corrected in the same way as the *rise-glitch* except that the factor multiplying total values on all previous days is less than 1.

### Mathematical background

We consider *X* (*t*), the discrete temporal series of cases (or deaths) in a given country, region or province. In the most general scenario, we assume that *X* (*t*) obeys the following ordinary differential equation (ODE):

In the discretized form the first derivative of *X(t)* is *X* ‘(*t*) = *X* (*t* − 1) − *X* (*t*), which is the number of new cases on day *t*. Equation (1) simply states that the number of new cases on a certain day is proportional to the number of cases on the previous day.

The coefficient of proportionality *s(t)* is not constant. It changes with time so as to take into account the dynamics of virus spreading, which may be affected by social distancing or the structure of social network interactions.

We are interested in a solution of Equation 1 that reaches a plateau value of *N* for a large *t* (often called a growth function). A general form for many different kinds of growth functions can be written as follows (Koya 2013)

Equation 2 describes a rich family of curves which comprises Richards functions (Richards 1959), generalized logistic functions, Weibull functions (Frechet 1927) etc. While the overall shape of these curves depends on the various parameters, the asymptotic behavior has the same analytical form for all the curves in the family. It is this behavior that allows us to introduce an important simplification that reduces the fitting of Equation 2 to fitting a straight line. It is easy to show that the following relationship holds in the limit of large *t*:
Equation (3) is true asymptotically for every function in the Koya Goshu family, and exactly true for the Gompertz function, *G(t)*, (Gompertz 1825). This function has been also used by other groups to fit data of COVID-19 trajectories (Castorina 2020, Catala 2020) and is shown in **Fig 1** and **Fig. S1**):

We also consider another function, which is the logarithm of the fractional change function *f* (*t*) defined above:

For a Gompertz distribution *H(t)* is a decaying exponential function with the same time constant *U*, associated with *Y(t):*

A similar relationship is valid asymptotically for other growing functions with the same time constant.so analysis of the behavior of *H(t)*, provides a second method to derive the time constant *U*.

### Data fitting and validation

The simple linear relationship in Equation 3 provides a remarkable tool allowing us to fit the trajectory of virus spreading and predict the end points (*N*) in different locations. Given a single data series *X(t)*, the best estimate for *ln(N)* is determined as the value that maximizes the correlation coefficient of *Y(t)* and *t* (Figure 1). The calculation of the correlation coefficient is very fast and can be completely automated for a large number of data, and implicitly it also provides a measure of the validity of the assumptions that lead to Equation 4.

This calculation can be updated day by day, and eventually, the extrapolation for *ln(N)* will converge to the correct number. As we will show in the results section, in many cases the end point can be predicted accurately at a very early stage.

The pseudo-code for data fitting is the following: Read in csv date, Total Cases, Total Deaths for all the world location Correct errors, in the date Main loop for each location for line_end to End { for line_start 10 to line_end-10{ step lnN from lnN1 to lnN2 by dlN{ x=day; y = ln(N)-ln(X(t)) CC = correlation_coef(x,y) Find maximum CC } if best CC > threshold Keep line Y coordinates and the lnN values } }

For each line_end, collect the predicted *N* values and histogram them to find the most common value that is then taken as the prediction for that particular line-end value. We are well aware that this method can be improved in many ways some of which we are currently exploring.

### Data Smoothing

All data is smoothed using the LOWESS method (locally weighted scatter-plot smoothing) developed by W. S. Cleveland at Bell Labs in 1988 (Cleveland 1988). We use the original FORTRAN code written in Ratfor (Ratfor 1976) (https://www.netlib.org/go/lowess) and converted to Mortran (Mortran 1975**)**. The parameter *F* (the fraction of points used to compute each fitted value) is set to 0.05, 0.07, 0.1, 0.12 and 0.14 for SMO1 to SMO5, respectively. In addition, the smoothed output Y-axis values for SMO4 and SMO5 are smoothed a second time using *F*=0.1. Smoothing is only applied to the total counts of cases and deaths. Well-aware of the distortions that smoothing can cause, we made sure that the smoothing did not introduce false features at the start or end of the time series. We also made sure that the smoothing did not move the location of the peaks as shown in **Fig. S3**. We also test the root-mean-square value of the change in total values caused by the five different levels of smoothing. When we do this for locations with more than 60 deaths and for locations with more than 1000 cases we find that the % RMS error average values are between 0.4% and 1.2% for *F* ranging from 0.05 to 0.14.

One problem when using smoothed data to test prediction, is that smoothing uses future data points that would not have been available on the day the prediction would have been made. We allow for this in estimating when new cases and deaths peak by taking the effective peak date for completed situations as half way between the actual peak date found in the smoothed data and the date at which the level has dropped past the peak to half peak height. We also generally avoid using smoothed data.

## RESULTS

### What To Expect From Simple Mathematical Functions

The most important result of this study is that the Gompertz function can be transformed into a straight line provided one knows the final plateau value of total counts of either cases or deaths, denoted here as *N*. This is shown in **Fig. 1** and provides the basic method we use to fit the observed data. Namely, vary the value of *N* to make the transformed Gompertz function *Y(t)* into a straight line and then derive parameters from the fit. Although this result is asymptotically true for a broad class of growth functions, we find that the simple three parameter Gompertz growth function fits the trajectory of actual COVID-19 outbreaks very well (Fig. 2). Specifically, the logarithm of the slope of ln*(X(t))* (called *H(t*)) decreases linearly with time meaning that the exponential growth rate (the slope of ln*(X(t))*) is never constant so that growth is never exponential. This linear decrease of *H(t)* is not true for all growth functions: specifically, the sigmoid function starts with pure exponential growth (**Fig. 2, c-d**). We find this linear decrease of ln*H(t)* is in fact a universal property of all outbreaks (Fig. 3) justifying the broad use of the Gompertz function here.

### Classification of World COVID-19 Outbreaks

Table 1 lists those countries (89 in all) or regions (Italy, US & Canada, 147 in all) with more than 50 deaths or 1000 cases. The outbreaks have been classified by our completeness code that is based on the peaking of the number of new daily cases or new daily deaths. (See **Table 1** for explanation for explanation of the completeness code).

### Fitting With a Straight Line

**Fig 4 (a)** shows the function *Y(t)* for deaths in the many different locations (countries or regions of countries) which have reached a plateau, and for which the prediction of the final *N* is stable. It is evident that for all these locations the data generally follows a linear relationship thereby justifying *a posteriori* our working hypothesis. This observation is confirmed by the fact that the correlation coefficient with time of *Y(t)* is close to 1 for the vast majority of the locations we examined. We also note that the time constants *U* (i.e. the inverse slope of the lines or the time-constant of decay) are very similar to each other, indicating the existence of universal properties in virus diffusion that are largely independent of the country.

When considering confirmed cases (**Fig. 4**), we observe more diverse behavior in the time course of *Y(t*). While for some countries the linear relationship still holds (**Fig 4 (b)**), in other countries we notice deviation from linearity (**Fig. 4 (c)**), which could indicate the existence of multiple outbreaks, or could reflect a change in the method of counting cases.

By fitting *Y(t)* to a single straight line, we can average multiple outbreaks into a single major outbreak which will follow a Gompertz distribution, where the parameters *U* and *T* are the slope and the x-axis intercept at *Y(t)*=0, of *Y(t)*. This approach allows us to obtain a uniform description for every time series *X(t)* of cases and deaths in different parts of the world, but with loss of details for locations that do not follow a simple linear relationship.

While *a posteriori* fits describe the raw data well, extrapolations of the final plateau before a given day are still subject to large fluctuations, due to the (double) exponential nature of Gompertz law. In other words, when *X(t)* is small compared to *N*, the fitting line varies approximately like ln(ln*(N*));.even large variation of *N* barely affects the quality of the fit. Vice versa, when *X(t)* approaches *N, Y(t)* becomes more and more sensitive to the correctness of the predicted value of *N* (**Fig. S2**) The consensus predicted value of *N* converges to a plateau value with time, and then it is followed by real data with some delay. This allows us to discriminate between locations in which confirmed cases (or deaths) have reached or are near to reach the plateau, and locations for which it is still impossible to predict the plateau.

### A closer Look at Specific Locations

While COVID-19 trajectories share many properties, each outbreak has its own features, which affect our ability to forecast the outcome in terms of the plateau value *N* for both cases and deaths. These features are best appreciated using two types of graphs, the Four-Panel graph and the Best Line Prediction graph described carefully in **Fig. 5**, which shows these two graphs for Germany a large but well-behaved outbreak. The top panel of **Fig. 5(a)** shows that for smoothed data there are single peaks of new cases and new deaths, with the new deaths peaking 11 days after the new cases. This is almost exactly what we observed for the smaller and much earlier outbreak in China, non_Hubei, where deaths were most likely to occur 10 days after a case was confirmed (**Levitt, 2020f**); this suggests that this interval may be connected to the natural progression of the disease in well-managed scenarios. The same delay between cases and deaths is also seen (as it should be) in the second panel. The third panel shows the characteristic curvature recognized since our 14-Mar-20 analysis (Levitt 2020b). Together with the forth panel, it also reveals a small initial outbreak that started on 24-Jan-20, was contained and then followed by a much larger outbreak that started two weeks later and became clearly seen after another two weeks. The Best Line Prediction (BLP) graphs for Germany show in **Fig. 5 (b)** that from 1-Apr-20, the plateau value of total cases would have been well-predicted. For deaths, **Fig. 5 (c)** shows the eventual plateau value could have been predicted accurately on 10-Apr-20. The blue dots on these two graphs show that the predicted plateau values vary wildly and a prediction can only be made because many straight-line fits give a similar consensus *N* value.

In **Fig. 6** we show four other locations which have reached a plateau and for which the extrapolation has not changed significantly in the last few weeks. Although none of these locations are as clean as Germany (**Fig. 5**), one see that early predictions are unstable but converge to more realistic figures with time. **Fig. 6(a)** shows New York City to be well-behaved in terms the smoothed peaks of new cases and new deaths although deaths and cases seem to occur at the same time. This suggests a situation less under control than either China, non_Hubei or Germany. Nevertheless, the BLP graph shows that the final plateau value of *N* for deaths in New York City could have been predicted correctly on 10-Apr-20. The decay of *H(t)* shown in the bottom panel of the four-panel graph is very clean suggesting a single outbreak. **Fig. 6(b)** shows Sweden to have very badly formed peaks of new cases and new deaths and deaths seem to occur before cases, an impossibility likely due to decreased counting of cases as the epidemic proceeds. The BLP graph predicts a plateau value that is not constant although it does look as if total deaths will plateau at about 6,000. **Fig. 6(c)** shows Russia to also have a very extended peaks of new cases and deaths. The number of new cases peaked 7 weeks ago but new deaths remain high. Nevertheless, the BLP graph shows that the predicted plateau value of *N* for deaths is increasing more and more slowly and may well converge to a value of about 16,000, almost double the current number of deaths. **Fig. 6(c)** shows Mexico deaths to be increasing even more rapidly than Russia and at present it is impossible to predict the plateau.

In **Table 2**, we compare the predictive power on the Best Line Prediction (BLP) with that of the Peak Detection Method (PDM). Checking all the converged locations where the current value is expected to close to the expected plateau value of *N* shows that the BLP is significantly better than the PDM. Both methods seem to be able to make their predictions at about the same time (on average, the PDM predicts two days earlier than the BLP based on our assumed value for the peak confirmation date.

In **Table 3** we look at the most active locations to identify cases where prediction of outcome could have significant impact. For this we use two criteria: First, that the forecast be reliable in that the plateau is stable in terms of its slope, its percent standard deviation and at least seven days at this plateau value. Second, that the forecast plateau is a significant increase over the current level.

At the moment of writing this manuscript, many countries or regions are still in the fast growing phase, and it is still impossible to predict the outcome of the epidemic in them (see **Table 3**). For others we can make predictions as shown in **Fig. 7**. Panels (a) & (b) show clearly that the BLP graph for Peru predicts a clear plateau for cases (*N*=478,000), but the predicted plateau for deaths is still rising rapidly. The plateau value for cases is almost double the current level of 257,000 making this a very meaningful forecast. Panel (c) shows that for Brazil the BLP predicts a stable plateau of 98,000, another very meaningful forecast, again almost double the current level of 47,000. Panel (d) shows that cases in Belarus are predicted to plateau at 82,000, although there is a less clear leveling. Panel (e) shows a split prediction for cases in the United Arab Emirates where there are two plateaus, at 49,000 and about 60,000, respectively; such splitting is very rare. Panel (f) shows that deaths in Kuwait are perhaps going to plateau at 400. Panels (a) to (c) are important forecasts with a meaningful impact, whereas those in panel (d) to (f) show the diversity of behavior making automatic forecasting a challenging problem.

### Open Availability of all Data

All graphs and tables are available online using apps written by Dr. Scaiewicz. The app at http://levitt1.herokuapp.com/ shows the classification for different countries and updated numbers and graphs. The app at http://levitt.herokuapp.com/ shows the predictions in the Best Line graphs. We thank Dr. João Rodrigues for advising us on this matter.

### Availability of the Computer Codes

We would like to make the computer codes we use available to all but these are currently written in a variety of languages that few would want to use. While Dr. Scaiewicz uses clean self-documenting Jupyter Python notebook code, Dr. Levitt still develops in a FORTRAN dialect call Mortran (Mortran 1975) that he has used since 1980. The Mortran preprocessor produces Fortran that is then converted to C-code using f2c. This code is at least a hundred-fold faster than Python code. His other favorite language is more modern, but involves the use of the now deprecated language Perl and Unix shell scripts.

Nevertheless, the methods proposed here are simple; they are easily and quickly implemented by a skilled programmer. Should there be interest, we would be happy to help others develop the code and test them against ours. We also realize that there is ample room for code optimization. Some of the things that we have considered are pre-calculating sums of terms to convert computation of the correlation coefficient from a sum over *N* terms to the difference of two sums. Another way to speed the code would be to use hierarchical step sizes in a binary search to find the value of ln*N* that gives the best straight line.

Our study involving as it did a small group working in different time zones and under extreme time pressure revealed that scientific computation nowadays faces a Babel of computer languages. In some ways this is good as we generally re-coded things rather than struggle with the favorite language of others. Still, we worry about the future of science when so many different tools are used. In this work we used Python for data wrangling and some plotting, Perl and Unix shell tools for data manipulation, Mortran (effectively C++) for the main calculations, xmgrace and gnuplot for other plotting, Excel (and Openoffice) for playing with data. And this diversity is for a group of three!

## DISCUSSION

### Non-Exponential Growth

It is evident from our data analysis that the growth of a COVID19 epidemic does not follow an exponential growth law even in the very first days, but instead its growth is slowing down exponentially with time. While all growth functions decelerate exponentially when approaching the plateau, the Gompertz function is unique in that it is decelerating from the first day, and thus can fit the first part of the COVID-19 outbreak. Moreover, its relatively simple functional form, allowed us to produce an efficient computer code to fit data in all different locations in a consistent way.

As would be expected, we find several examples in which this simple law is not followed, especially when looking at confirmed cases (deaths appear to follow the Gompertz Function more consistently). For some of these countries (e.g. Iran) it is evident that a second outbreak occurred well separated in time from the first. In other countries, (e.g. South Korea) we observe a change in the dynamics of the virus spread, which could be related to the adopted containment strategy or a difference in the level of testing. Even though such unusual dynamics cannot be predicted from the beginning, our fitting method is able to identify abrupt changes and will identify the slowest characteristic time and will, therefore, be able to produce a prediction for the new plateau.

We believe that the analysis in our study shows conclusively that COVID-19 epidemics grow according to the Gompertz Function and not the Sigmoid Function (Fig. 2). The main difference between these functional forms is that the Sigmoid Function starts off growing exponentially (it has a constant exponential growth factor) and then slows down (blue line in Fig. 2(c)). The Gompertz Function is never exponential but rather has a growth rate that decreases exponentially from the very first confirmed case. This does not make sense as when there are very few cases, it should be easy for each infectious individual to find people to infect, which would lead to exponential growth at the early stages of the outbreak. The Gompertz Function normally applies to conditions when the growth is constrained by some global resource. For example, bacteria growing with a limited food supply or a fire in an enclosure where oxygen is limited.

What is limited for coronavirus? First clues came from the large number of invisible cases indicated by the early serological studies by our Stanford colleagues (Bendavid 2020). More recently, a paper in Science (Silverman 2020) showed that millions of people were infected in the USA before there were known cases. The existence of invisible cases of individuals who are mildly symptomatic and, therefore, not counted as confirmed cases may explain the non-exponential behavior of COVID-19: the known cases cannot easily find people to infect as the hidden invisible cases have already infected them. We realize that other factors may limit growth. For example, the structure of the human interaction network can lead to sub-exponential growth (Moreno 2002). Still, we believe that as SARS-CoV-2 is so infectious, it does not have a problem finding people to infect early on due to the local network structure.

Initial sub-exponential growth is not a unique feature of COVID-19, but has been observed in previous viral outbreaks and needs to be taken into account to produce accurate predictions (Chowell 2016). Our method provides a quick way to analyze early epidemic data and identify and also quantify sub-exponential growth in terms of the time constant *U*.

### Clean and Curate Data Carefully

An essential step for our study has been to clean and curate the data made available from so many different countries. Had we not filled in missing value or spread large changes back in time, the sensitive methods we use would fail. Of course, we need to document every step we take so as not to manipulate data in some arbitrary way. In taking this approach we were aided by the fact that we started the project very early on when there were just 24 data points: six days of cases and deaths in two regions of China (Levitt 2020c).

Another consequence of being so intimately connected with the data is that we had to collect data manually until the various repositories became established. We are now quite certain that the quality of data is more or less the same from all sources. The question of data reliability is often raised and we believe that the data has to obey so many rules of self-consistency that cheating would be almost impossible. For example, in Fig. 2, we see that the raw data from China, non- Hubei. which was available in late January is essentially indistinguishable from the data released for New Zealand two months later.

### Sanity Tests To Prove We Do Not Inadvertently Cheat

In a study like this involving a huge body of data, computer programs written quickly and the intense pressure to get results out while they can still be useful, one needs to be very self-critical at every stage looking for computer bugs that could explain any good results that one finds. Specifically, we are trying to test our forecasting method by going back in time and trying to predict something that was not known then but is know now. Such a process, often called ‘postdiction’ in contrast to ‘prediction’, is extremely dangerous. We guard against it by running calculations with data sets that have been specially prepared to eliminate all data after a certain previous date. This is tricky in that one cannot use smoothed data as smoothing looks into the future to smooth the present. In this work we made a series of data sets going back into the past and showed that the results from a past date would have been obtained with a data set that did not include data after that date.

### Work in progress

We have been studying COVID for five months and worked on all aspects of the analysis.

Some of the related projects that we are working on include:

Predict the future time-course of the epidemic and not just the plateau value

*N*. This will involve better understanding of the two other parameters of the best line fit,*U*&*T*.In what ways are the detailed trajectories from various locations different? What affects the trajectory in terms of

*N*and*U*: population size, population age/health, physical size of location, social distancing or lockdown measures?What is the burnout saturation value of

*N*? What is the population fatality ratio if the infection runs its course?

## CONCLUSIONS

This manuscript is being submitted as a preprint, which is something that we have never done before. We do this for two reasons. One is to make our discoveries available to all at a stage where they will still be useful. Another is to solicit broad criticism and comments that are essential to the scientific process.

## ONLINE RESOURCES USED OR CITED IN THE TEXT

(Rajkumar) https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

(JHCS) https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

(JOBTUBE) https://jobtube.cn/wv/?from=groupmessage&isappinstalled=0

(JHU) https://s3-us-west-1.amazonaws.com/starschema.covid/JHU_COVID-19.csv

(Starschema) https://starschema.com/covid-19-data-set

(Ita-regioni)https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv

(Levitt-Twitter) https://twitter.com/MLevitt_NP2013/status/1256511516863586304?s=20

## LEGENDS FOR SUPPLEMENTARY FIGURES

## Supplementary Methods

Proof that any growth function has a *Y(t)* function that tend to be linear for large *t*.

## ACKNOWLEDGMENTS

This analysis started as a team effort based on our 14-Mar-20 analysis (Levitt 2020b); besides the current authors, it involved Patrick Tam (Hong Kong, see www.covibes.org), Frédéric Poitevin, João Rodrigues and Fatima Pardo-Avila (all at Stanford). We also had discussions and shared ideas with Eran Bendavid (Stanford), Cathrine Bergh (Royal Institute of Technology, Sweden) and Siri Camee van Keulen (Utrecht University, The Netherlands). We are most grateful to them. We offer special thanks to João Rodrigues, for his automatic daily data updates and invaluable advice on writing Python apps.

This work was supported by a US National Institutes of Health award R35GM122543 to M.L. and by a National Natural Science Foundation of China Grant No. 31770776 to F.Z. Michael Levitt is the Robert W. and Vivian K. Cahill Professor of Cancer Research.