Epidemiological characteristics of novel coronavirus infection: A statistical analysis of publicly available case data

Following the first report of coronavirus disease 2019 (COVID-19) in Sapporo City, Hokkaido Prefecture, Japan on 14 February 2020, a surge of cases was observed in Hokkaido during February and March. As of 6 March, 90 cases were diagnosed in Hokkaido. Unfortunately, many infected persons may not have been recognized as cases due to having mild or no symptoms. We therefore estimated the actual number of COVID-19 cases in (i) Hokkaido Prefecture and (ii) Sapporo City using data on cases diagnosed outside these areas. The estimated cumulative incidence in Hokkaido as of 27 February was 2297 cases (95% confidence interval [CI]: 382, 7091) based on data on travelers outbound from Hokkaido. The cumulative incidence in Sapporo City as of 28 February was estimated at 2233 cases (95% CI: 0, 4893) based on the count of confirmed cases within Hokkaido. Both approaches resulted in similar estimates, indicating higher incidence of infections in Hokkaido than were detected by the surveillance system. This quantification of the gap between detected and estimated cases can help inform public health response as it provides insight into the possible scope of undetected transmission.


Introduction
In December 2019 a cluster of 41 patients with atypical pneumonia of unknown etiology was reported in Wuhan City, China [1,2]. The number of atypical pneumonia cases then rapidly increased in Wuhan in early January and cases began appearing across China and in other countries [3,4]. The cause of the pneumonia was recognized as a newly emerged coronavirus, called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and the disease it causes was named coronavirus disease 2019 . Cases of COVID-19 have now been reported in more than 120 countries worldwide [5], and the WHO declared it a pandemic on 11 March 2020. In the present paper, we examine the recent COVID-19 hotspot of Hokkaido Prefecture, Japan, and offer two approaches for estimation of actual incidence in the prefecture.
The first imported case of COVID-19 infection in Hokkaido was confirmed on 28 January 2020. The second case was reported two weeks later on 14 February 2020 in a male Sapporo (the capital of Hokkaido) resident in his 50s. At present, there is no evidence that the first imported case was linked to any subsequent cases, indicating either a possible hidden chain of transmissions that led 2 Methods

Epidemiological data
Three sources of data were leveraged to estimate the actual incidence of COVID-19 in Hokkaido. First, cases exported domestically and internationally from Hokkaido were identified from government reports. The domestically exported cases were Kumamoto (1 case), Nagano (1 case) and Chiba (1 case) prefecture residents [26][27][28]. The internationally exported cases were reported from Malaysia (1 case) and Thailand (2 cases) [29,30]. Considering that the two Thailand cases are husband and wife and one case is asymptomatic, one infection may have been the result of household transmission as similarly described in [31] rather than exported from Japan. Second, the total volume of air passengers from all airports in Hokkaido to Thailand and Malaysia over the two-month period of January and February 2019 were retrieved from the International Air Transport Association (IATA), and includes 9,349 passengers to Malaysia and 45,137 to Thailand. Lastly, the population size of Hokkaido was obtained from the Hokkaido Prefectural Government [32].
Calculation of COVID-19 incidence in Sapporo was based on cases of non-Sapporo residents who (i) became infected in Sapporo and were diagnosed outside of Sapporo or (ii) had no information on their source of infection (i.e. unknown link). Cases who were infected outside of Sapporo by Sapporo residents were excluded from the analysis, as they did not represent true exportation of infection from Sapporo by a local case. For example, in one instance a Sapporo resident travelled to Kitami City in Okhotsk subprefecture and caused secondary infections among Kitami residents. These Kitami residents were excluded from our analysis as they were infected outside of Sapporo. However, if a case from Kitami was to travel to Sapporo, have likely been infected there, and then be subsequently diagnosed in Kitami, they would eligible for our analysis. Data on cases reported through the end of February 2020, including dates of illness onset, subprefecture of diagnosis, travel history to Sapporo, and epidemiologic linkage to other confirmed cases were retrieved from the websites of government entities in Hokkaido.
The estimate of the fraction of commuters between subprefectures was obtained by exploring survey data of daytime and a nighttime population sizes in Sapporo [33]. By subtracting the daytime population size from the nighttime population size and dividing the difference by the nighttime population, we obtained a value of 3.7% [33], reflecting the change in population size of Sapporo due to commuting. However, this estimate accounts for movement of individuals both within Ishikari subprefecture and between subprefectures. In contrast, our model requires the average estimate between all subprefectures of Hokkaido. Therefore, we set this parameter at 1% and performed a sensitivity analysis wherein we varied the fraction of commuters between 0.5% and 4%.

Reporting delay between illness onset and confirmation
The time interval from illness onset S i to report of case confirmation R i was extracted for each confirmed case i. The likelihood function was doubly interval-censored [16,34]: with h(·) as the probability distribution function (PDF) of illness onset time following a uniform distribution, and f (·) as the PDF of the reporting delay independent of h(·). We fit the distribution f (·) with . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04.24.20077800 doi: medRxiv preprint parameters θ to gamma, lognormal, and Weibull distributions, and we selected the best-fit distribution by comparing widely acceptable information criterion (WAIC) values.
An alternative way is to implement the right truncated likelihood: where t * is the cut-off time at noon of 29 February 2020, F (·) is the cumulative distribution function of the distribution f (·). However, our analysis has shown that the mean delay was longer than two weeks and this was inconsistent with real observations. The reason is that the right truncation does not account for the effect of control measures implemented in the late February (see Discussion).

Estimating incidence in Hokkaido by using imported cases among travelers outbound from Hokkaido
We implemented a balance equation to predict incidence across all Hokkaido using the case count of air travelers from Hokkaido with international or domestic destinations [3,[35][36][37]. Given the observed cumulative count of exported cases C we write our balance equation as follows: where m is total volume of passengers to the corresponding destination, T is the infectious period, approximated from the observed virus shedding period, 5.0 days [38]. The estimated incidence in Hokkaido is given bypn, where n is the catchment population size for the international airport in Hokkaido. The cumulative incidence was estimated using maximum likelihood and the 95% confidence intervals were calculated using profile likelihood using data generated following the binomial sampling process among travelers from Hokkaido.

Estimating the incidence in Sapporo cases imported by other areas of Hokkaido
Let C j be the cumulative number of cases who had no link to Sapporo or any other local cluster (see case classification in the Method section) and were detected in subprefecture j. Following a linear regression model with negative binomial link [39] we obtain: where β is the regression coefficient, k is the overdispersion parameter, and M •j is the predicted commuter flow between Sapporo and the various subprefectures of Hokkaido. Because Sapporo is a part of Ishikari subprefecture, we define Ishikari subprefecture as the area that excludes Sapporo ( Figure 1B). The flow M •j was predicted using the uniform selection and radiation models of human movement [40,41], as described below. The final estimate for the cumulative incidence in Sapporo was obtained by extending (4) to the entire population of Sapporo as follows: where N • is the population size of Sapporo.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Modeling commuter movement between subprefectures of Hokkaido
Two single-parameter models of human movement were employed to predict the rate of commuting between Sapporo and the various subprefectures of Hokkaido-a radiation model and a model of uniform selection. Both emphasize the attractiveness of large population centers [23,24,42]. The radiation model incorporates the distance metric between the origin and destination while accounting for their population sizes. The uniform selection model [41] predicts the commuting volume using the formula: where P i is the population size at the origin i, Q j is the population size at the destination j, N is the total population of Hokkaido. The only parameter M i defines the scaling parameter of the ongoing number of commuters from the source subprefecture and equals to the proportion of commuters multiplied on the population size P i [42]. The radiation model [40] defines a commuting volume similarly to (6): where R ij denotes the total population within a radius around two centers P i and Q j . Both matrices M ij (6)- (7) were forced to be symmetric in our simulations. An aggregated dataset of population sizes and central point coordinates for each subprefecture was used to estimate the commuting rates between Sapporo and the subprefectures of Hokkaido while leveraging the R package "movement" [25]. The package required a fraction of the commuting individuals to be a single input parameter for the function. Due uncertainty of this value, we performed a sensitivity analysis with reference to the daytime and nighttime populations of Sapporo [33].
The derived estimates were then used to validate our assumption of linearity between the number of reported COVID-19 infections in each subprefecture and the daily commuting rate [39]. The number of infections C i in subprefecture i is described by a negative binomial distribution with mean λ i linearly proportional to the travel volume x i : λ i = β ·x i , and a dispersion parameter k. In this case, the variance of the distribution takes the form: λ i + λ 2 i /k. The regression parameter β is regarded as independent of the region. Using the fit of the predicted travel volume and observed case counts, we selected the best-fit model between (6)-(7) based on the difference in WAIC values and with lower value regarded as better.
The data were processed using R version 3.6.2 and Python version 3.6.10. The Markov chain Monte Carlo (MCMC) simulations were performed in Stan (cmdStan version 2.22.1 [43]) for estimation of the delay distribution, and in PyMC3 version 3.8 [44] for all other estimates. All code is freely available at the github repository: http://github.com/aakhmetz/Covid19IncidenceHokkaidoFeb2020.

Epidemiological situation
The first case was reported in Hokkaido on 28 January 2020. A month later on 28 February 2020, the total count of confirmed cases reached 65, with 13 cases reported by Sapporo, and another 51 case reported by 11 of the 14 subprefectures of Hokkaido. The place of diagnosis of the first case was unspecified.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 30, 2020. Although the initial cluster of cases was linked to the Sapporo Snow Festival, the overall geographic distribution of COVID-19 cases appeared to be widespread. By 28 February Japan had also been notified of three cases diagnosed in Thailand and Malaysia who were believed to have been infected in Hokkaido, as well as three cases domestic cases with exposures in Hokkaido that were reported by other prefectures (Table 1, Figure 1A). A surge of cases was observed at the end of February (Figure 2A). In the first half of February, peaks of reported cases by date of illness onset were separated by 3-4 days on average, consistent with estimates of the serial interval [45]. The epidemic curve peaked around 18 February 2020, but the decline in cases which was seen later can be explained by the delay in reporting. The delay distribution fitted with the gamma distribution had the mean of 7.9 days (95% CI: 6.9, 9.0) and standard deviation of 4.2 days (95% CI: 3.3, 5.2). The 95th percentile was 15.6 days (95% CI: 13.4, 18.8) implying that cases with illness onset in the last two weeks of February were likely to be underascertained.

Estimated incidence in Hokkaido using confirmed cases diagnosed outside Hokkaido
The first three cases diagnosed in early February outside of Hokkaido indicated a low incidence of COVID-19 in Hokkaido with an estimated upper bound (95th percentile) of fewer than 100 cases. Our estimate of the cumulative incidence in Hokkaido as of 25 February 2020 is dependent on whether the transmission within the infected couple who travelled to Thailand was acquired in the community or within the household, and we estimated incidence in Hokkaido to be higher in the former scenario, at 3446 cases (95% CI: 857, 8931), compared to the latter, with 2297 cases (95% CI: 382, 7091), see Figure 3A.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 30, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 30, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Estimated incidence in Sapporo using confirmed cases diagnosed within Hokkaido
A total of 34 Hokkaido cases were included in our model ( Figure 1B). We fit the radiation and uniform selection models to the population data for each subprefecture based on their centroids (Supplementary figure 1). The resulting fit to the observed incidence of COVID-19 cases showed that the uniform selection model performed better than the radiation model (WAIC values were 60.2 vs 68.3, respectively). Figure 3B shows the fit using the model with uniform selection. If we assume the fraction of commuters to be at 1%, the mean estimated incidence in Sapporo was 2233 cases (95% CI: 0, 4893) as of 28 February 2020. Variation in the fraction of commuters between 0.5% and 4% results in estimated incidence between 4440 (95% CI: 0, 9687) and 563 (95% CI: 0, 1221) for the 0.5% and 4%, respectively ( Figure 3C).

Discussion
In our analysis of the reporting delay, we did not implement right truncation of the likelihood because the low case count (by date of illness onset) seen in late February through the beginning of March was likely due to the effect of interventions in Hokkaido rather than delays in reporting. In support of this assumption, there were fewer cases reported between 28 February and 6 March compared to the week of 21-27 February (cf. Figure 2 and Supplementary figure 2). Right truncation should therefore be used with caution when it is unclear whether recent incidence is increasing or decreasing. Although we tracked the spread of the disease in Hokkaido and we were able to distinguish between locally acquired and travel associated infections, the surveillance systems in the subprefectures of Hokkaido may not be sensitive enough to detect cases with milder symptoms. This could explain why we obtained a lower estimate of the incidence in Sapporo using cases diagnosed within Hokkaido compared to analyzing the outbound flow of travelers from Hokkaido. Use of data on travel volume of commuters and tourists within Hokkaido could be beneficial for fitting the travel volume to the observed incidence rather than adopting a model of human movement ( Figure 3B). However, publicly available data on train, bus, and private car travel in Hokkaido were inconsistent and scarce, and we were unable to use them for this study. Nevertheless, our results demonstrate that the use of mathematical models of human movement at the scale of estimating the travel volume between administrative subunits of a given region is a promising alternative to use of observed transportation data. Previously, this has been also demonstrated by other researchers in assessing the risk of spread of yellow fever in Angola [23] or during epidemics in resource poor settings [24].
In addition, surveillance among travelers at international borders may detect infections not found through standard surveillance systems due to implementation of additional detection methods [46,47]. As shown by other researchers [39,48] the linear regression fit of the travel volume among international travelers to observed disease incidence represents an elegant way to distinguish areas that are able to well detect new infections from those that are likely to miss some fraction of cases. We applied the same approach to compare the performance of the various subprefectures of Hokkaido, as shown in Figure 3B. Subprefectures that fall within the 95% CI of the linear regression line are considered to have surveillance systems that succeed in detecting COVID-19 infections within their jurisdiction. Subprefectures that fall below the 95% CI of the linear regression line may not have surveillance systems that are sensitive enough to detect all imported cases [39]. The four subprefectures with no reported cases are Rumoi, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04. 24.20077800 doi: medRxiv preprint Soya, Nemuro, and Shirebeshi. The first three are distant and less connected to Sapporo compared to, for example, Kamikawa or Oshima, which contain the second and third largest cities of Hokkaido, Asahikawa and Hakodate ( Figure 1B). However, Shirebeshi is located next to Ishikari/Sapporo and includes the relatively large city of Otaru, though the lack of reports in Shiribeshi may be explained by our use of place of diagnosis over place of residence for determining the subprefecture assigned to cases. This was done because the specific place of residence was not always reported. We suspect that some cases from Shirebeshi subprefecture were diagnosed in Ishikari subprefecture.
Our estimate of the incidence in Sapporo is in the range of 1,000 to 10,000 cases, which is similar to early estimates of the incidence in Wuhan City, China that used data on the first cases among international travelers [3,49]. Assuming our estimates are correct, a substantial proportion of these infected persons likely had only a mild or asymptomatic course of disease [9,17] and therefore did not seek medical care and were not detected by the surveillance system.
Preliminary analysis of the offspring distribution among infected individuals in the detected clusters around Japan has shown that approximately 80% of cases do not produce secondary infections (i.e., their reproduction number is equal to zero) [50]. Because most of the clusters have been linked to the known cases and could be traced back in time, we argue that there is still a window of opportunity for containment of the disease and only some small fraction of the infections may drive the epidemic [14]. We agree with other studies, e.g. [51,52], that a successful strategy for control of COVID-19 lays in strict movement restriction, avoiding social gatherings, and intense investigation effort in contact tracing.

Conflicts of Interest:
The authors declare no conflicts of interest.

Supplementary figure 1. Reconstructed connectivity matrix between subprefectures of
Hokkaido. Ishikari subprefecture was separated into two subregions: within Sapporo and outside of Sapporo. The grey-shaded heatmap indicates the relative intensity of travel volume. The resulting connectivity matrix was forced to be symmetric.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04.24.20077800 doi: medRxiv preprint