Potential dissemination of epidemics based on Brazilian mobile geolocation data. Part I: Population dynamics and future spreading of infection in the states of São Paulo and Rio de Janeiro during the pandemic of COVID-19 ============================================================================================================================================================================================================================= * Pedro S. Peixoto * Diego Marcondes * Cláudia Peixoto * Lucas Queiroz * Rafael Gouveia * Afonso Delgado * Sérgio M. Oliva ## Abstract Mobile geolocation data is a valuable asset in the assessment of movement patterns of a population. Once a highly contagious disease takes place in a location the movement patterns aid in predicting the potential spatial spreading of the disease, hence mobile data becomes a crucial tool to epidemic models. In this work, based on millions of anonymized mobile visits data in Brazil, we investigate the most probable spreading patterns of the COVID-19 within states of Brazil. The study is intended to help public administrators in action plans and resources allocation, whilst studying how mobile geolocation data may be employed as a measure of population mobility during an epidemic. The first part of the study focus on the states of São Paulo and Rio de Janeiro during the period of March 2020, when the disease first started to spread in these states. Metapopulation models for the disease spread were simulated in order to evaluate the risk of infection of each city within the states, by ranking them according to the time the disease will take to infect each city. We observed that, although the high risk regions are those closer to the capital cities, where the outbreak has started, there are also cities in the countryside with great risk. Keywords * covid19 * SARS-CoV2 * epidemics * pandemic * mobile geolocation data * population dynamics * metapopulation models ## 1. Introduction The COVID-19, caused by the coronavirus (SARS-CoV-2), has spread quickly after its first reported cases in Wuhan, China, in December 2019, posing a serious threat to health systems and the world economy [9]. Since March 2020, when the disease was classified by WHO as a pandemic [12], countries around the world have followed protocols implemented months before in Asia, enforcing a variety of interventions, from mild to radical ones, based on social distancing, isolation and quarantine, to slow the disease spread, as recommended by WHO [10]. It is a common sense that the pandemic should be fought in two frontiers: by saving lives while avoiding the collapse of health systems, and by protecting the population from the economic impacts of the pandemic, specially its most vulnerable parcel [12]. For either goal to be achieved, health officials and government authorities should have reliable information about the disease spreading and its economic and social impacts, hence, for instance, the modelling of such spreading is not only a scientific achievement, but also a source of crucial strategic information. Indeed, a way of reducing the damages caused by the pandemic is to model how the disease will spread, in order to properly assign the available resources to locations where they will be needed the most. Another strategic information governments need to have is about the efficacy of the interventions enforced to slow the disease spread. Initial reports have shown the efficacy of these interventions, but we still lack reliable data, specially in Brazil, and, even when the data is available, we need to sort out misleading information [3]. Among the several challenges to address this pandemic, detecting the spatial spread of the disease within a region is one of the top priorities. An early warning can give time for government authorities to prepare the health system in a location to endure the increase on the number of people in need of medical care. One way to overcome this challenge is to monitor human mobility in order to detect patterns from which to predict future focus of infection, to either asses the efficacy of implemented policies to avoid transmission, or drive policies with the goal of avoiding the transmission to certain locations. This monitoring, specially using mobile phone data, has been noted to be an efficient way to follow public mobility. In recent work, [3] has indicated the efficacy of the intervention in China, correlating mobile data with reported cases. In other report, mobile data has evidenced the effect of Government-enforced measures in São Paulo, Brazil, in reducing social contact [5]. It is worth noting that, for large scale movements, other measures beyond mobile phone data have been successfully used to foresee the spread of the disease in Brazil [4]. As mentioned by Brockman [6,8], while the time evolution of the epidemics is frequently modeled in the literature by dynamical differential equations or time series [1,7], the modelling depends most on the scale used. For large scales, such as big countries, continents and the whole world, available airport data is enough to give us reliable predictions. As mentioned, [4] has some interesting results for the dissemination of the COVID-19 in Brazil based on airport network. But once the epidemics reaches a primary local region, it is of relevance to anticipate how the dissemination will take place locally, so local transit and regional road movement play an important role in the modelling, and mobile data provide a reliable characterization of such movements. In the first part of this study we rely on mobile data to assess the movement pattern between cities within the states of São Paulo and Rio de Janeiro in Brazil, before and during the COVID-19 pandemic, in order to identify future focus of infection within the states. We concentrate on these states as they were the first ones in Brazil to have significant number of confirmed cases and local transmission. To model the mobility via mobile data we have established a fruitful collaboration with Brazilian company In Loco ([https://inloco.com.br/](https://inloco.com.br/)). In Loco provides software engineering services to mobile phone applications and has a database with more than 60 million devices. The anonymized data provided by them contain the physical location where billions of visits to selected apps have occurred. Although no civil information is collected, such as name or social security number, in deference to users’ privacy concerns, In Loco can detect, through anonymous tracking, the most likely devices’ locations across the country and the movement between them. In this report, we measure the mobility in each day of March 2020 between the cities within each state, seeking to identify the most common mobility patterns in order to predict possible future focus of infection. We consider the movement on March 2020 as these were the days which followed the first infections in Brazil, when isolation measures were implemented. To predict theses focus, we analyse the raw mobility data and simulate spatial models of disease spread to predict the locations where the disease is more likely to spread first. This study seeks to not only subsidize public discussions about the allocation of resources and enforcement of isolation measures, but also to be the base of a next study, addressing population dynamics together with available public health data, providing risk assessments and forecasts. ## 2. Methods ### 2.1 Dataset The In Loco company provided anonymized data containing the geolocation of millions of users of their software development kit (SDK), which is present in many popular mobile apps. For this part of the study, we only analyse data referring to the states of São Paulo and Rio de Janeiro, although data of other states are also collected by the company. The available dataset contains, from the 1st to the 30th of March 2019 and March 2020, recordings of pairs of positions, referring to the locations of an initial and a second app use by a same device. Each position is calculated based on the location where an app with In Locos’ system was used and on information collected on the background while the app was not running, which aids in the collection of data when the app is in use, and is measured in geographical coordinates with a precision of 0.01 degrees in each coordinate. The first position refers to a use in a given day of an app by a device, while the second position refers to where a subsequent use occurred, when this location is different from the first one. Hence, only movements between different locations are represented, since users which used an app multiple times in a day within the same location are not present in the dataset. Furthermore, we excluded all pairs in which the second use occurred more than 24 hours after the first one, so all movements occurred inside the period of a day. Observe that each device may appear more than once if the apps are used multiples times in a same day at different locations, although, by anonymization, we do not know how many times a device appears, hence cannot follow it for more than two consecutive uses. Therefore, we have two point movement data in space-time of millions of devices in each day, representing a rich sample of daily population dynamics. We will focus on mobility data from March 2019, as a reference, and March 2020, as a measure of mobility patterns during the pandemic. For São Paulo we have on average 3.6 million daily position recordings in March 2019 and 4.3 million in March 2020. For Rio de Janeiro we have on average 1 million daily recordings in March 2019 and 0.8 million in March 2020. Just as a reference, São Paulo state has a population of approximately 45 million people and Rio de Janeiro state has approximately 6.3 million people. In Table 1 we present descriptive statistics of the daily number of recordings for weekdays and weekends for both years. We note that the daily uses decrease on weekends, what is evidence of less mobility between locations. View this table: [Table 1:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/T1) Table 1: Descriptive statistics of the daily number of recordings in March 2019 and 2020 for each state on the weekends and weekdays. SD = Standard Deviation. Figure 1 shows the daily number of recordings in both states in 2019 and 2020. On the one hand, in 2019 we see a steady pattern of the recordings in both states, which approximately repeats itself every week. On the other hand, in 2020, there is a clear decline in the number of recordings starting on the 15th, specially in Rio de Janeiro. This decline coincides with the implementation of stronger isolation measures enforced on the second half of March. Indeed, in Figure 2 we see a great decrease on the number of recordings in the second half of March (starting on the 15th), in both weekends and weekdays, as the boxes, which illustrate the statistics in Table 1, are below the respective boxes in the first half of March. As the control group (March 2019) behaves approximately the same on the first and second half, we have evidences that the isolation measures implemented decreased the number of recordings. Now, since the dataset contains only recordings of movement, the number of recordings is, by itself, an intrinsic measure of population isolation/quarantine, hence its decline is an evidence of efficacy of isolation measures, although this efficacy needs to be studied in more detail by using more suitable data. ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F1) Figure 1: Total number of recordings for each day of March in São Paulo (SP) and Rio de Janeiro (RJ), in 2019 and 2020. ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F2) Figure 2: Box-plot of the total number of recordings on the first (from the 1st to the 14th) and second (from the 15th to the 30th) half of March, in weekdays and weekends, in 2019 and 2020 for both states. Figure 3 shows the number of recordings in each location in a usual day of March. There is a clear pattern on the distribution of these locations, which are concentrated within cities and along roadways, in both states. Furthermore, the majority of uses occured in the surroundings of the states’ capital cities, in their metropolitan region. The pattern of these locations evidences how this data is a good proxy for population mobility, as it is either representing movement within cities or between them, via roadways. This distribution of locations is a good evidence in support of mobile data to assess regional mobility. ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F3) Figure 3: Typical distribution of the location of app usage in one day for the states of São Paul (left) and Rio de Janeiro (right) considering a resolution of 0.01 degree on each geographical coordinate. This data refers to March 1st 2020 and the color represents the number of uses, first or subsequent, in each location. ### 2.2 Movement dynamics In order to study mobility patterns between cities we group the recordings by city, i.e., each position is mapped from geographical position to the city containing it, generating a sample with pairs of initial city and subsequent city, according to the movement given by the geolocation. If the two positions are within the same city, we consider that there has been no movement, as *movement* here is taken as *movement between cities*. Proceeding in this manner, we divided São Paulo in 645 regions and Rio de Janeiro in 92 regions, given by their cities. Although we chose to divide the states by cities, we could have chosen another division, with more or less resolution, considering for example microregions (formed by cities) or subdistricts (which form cities), in order to study the dynamics in larger or smaller scales. From the generated sample of movements between regions, we can compute the proportion of movements from a region A to each region, in a given period of time. In this study, we always consider the period of time to be that of a day. This proportion of movement from region A to a region B is given simply by the number of recordings which departed from A in the given day and were in B within 24 hours, divided by the total number of recordings which started in A in the given day. This is the proportion of movements starting from A in a day and ending in B within 24 hours. Also, the proportion of *no-movement* of A in a day is given by the number of recordings which started in A in this day and were still in A in the second use, less than 24 hours later. These proportions are organized in a transition matrix, in which the entry of column A and row B is the proportion of movement from A to B. For each considered day in March 2019 and 2020, we have a transition matrix containing the movement between regions in this day. As we lack information about consecutive uses which occurred in a same location, the proportion of no-movement of a region is underestimated, as we consider it as only movement within the city, disregarding devices which have not moved at all within a day, as we do not have this information. This causes the proportion of movement from A to other regions to be overestimated, as we observe proportions as high as 45% of the movements from a region being to out of it, what is an unrealistic estimation of the proportion of people which move to outside of a region. A more realistic number is no more than 5%, what we believe would be obtained if we had the number of recordings in which the uses occurred in a same location. This overestimation will be corrected in the models simulating the disease spread, but when analysing raw data we disregard any correction, as we are only interested in determining common movements, and are not interested in *how* common they are. Even though this proportion is not a consistent estimator, in a statistical sense, of the proportion of a population which travels from a region to another within 24 hours, as a same device may be recorded twice in the period of a day, it is a good proxy for the mobility between two regions, as represents in reality a person which traveled from one region to another within 24 hours, or stayed in a same region. As the data is anonymized and each device is followed for only two uses of the app, we do not actually know if the movement is that of a person which is returning to a location or going there the first time, for example. However, this proportion gives a good idea of possible patterns followed by a population in general, as if a pattern is recurrent in the population it may also be in our dataset, although the proportion of movements in the population may be distinct of the one we calculated, i.e., we may be able to identify common patterns of mobility, even though we cannot estimate properly, in a statistical sense, the proportion of the population which leave one region and go to another in the period of a day. In order to asses the mobility patterns in weeks following the first cases of COVID-19 in Brazil in March 2020, we always take the mobility in March 2019 as a control group. Indeed, we need a measure of the usual mobility between the regions to compare with the observed mobility to know if it is within the usual pattern. For this purpose, we disregard the first days of March 2019, as the mobility was influenced by a major Brazilian holiday, the carnival week, so we observe the pattern of mobility in March 2019 starting on the 11th. On the one hand, the mobility in March 2020 is measured daily, by the proportion of movement from one region to another, i.e., by the daily transition matrices. On the other hand, the mobility in March 2019 is measured by the mean of these proportions over all considered days of March which fell on a day of the week, i.e., for each day of the week we calculate the mean of the proportions for all considered days of March 2019 which fell on it. Proceeding in this way, we have one transition matrix for each day of March 2020, and seven transition matrices related to the mean pattern of movement of each day of the week in March 2019. Each day of March 2020 is compared with the pattern of the day of the week it fell on. The analysis of this study concentrates on an important feature of the pandemic spread, that is, possible focus of future infections. We now discuss how they can be evaluated from the available mobile data. ### 2.3 Possible focus of infection The COVID-19 outbreak in the states of São Paulo and Rio de Janeiro has started in their capital city in the end of February 2020, and spread to other cities on the metropolitan region and countryside. However, many regions are yet to suffer from the pandemic, so pointing out possible focus of future infections provides strategic informations to public authorities act on to avoid the spread of the disease. These focus may be identified by studying the pattern of movement from the infected regions (capital cities) to the countryside, by identifying common movement patterns. Observe that the geographical distance between cities is not enough to determine these focus, as there are other factors which drive mobility within the states, specially of economic nature, which make movement to more developed cities far away more likely. The analysis is focused on the movement patterns starting on the capital cities and is performed with the aid of maps, in which each region is painted according to the proportion of the movements from the capital city which ended in each region. The daily patterns in these movements in March 2020 provide insights about possible paths infected people may have taken, spreading the disease to other regions. Also, we study the most frequent movements from the capitals in the days of March 2020 and March 2019 seeking to find common movements, and any difference in the patterns, from one year to the next. ### 2.4 A model for the spatial spreading of the disease The mains focus of this study is to explore the mobility dynamics within a state in order to give authorities a heads up on the evolution of the disease, so they can be a step ahead and prepare the local health care systems for the upcoming events. Since we do not have reliable data on the recovery time, we decided to use in this first approach an infectious model suitable for the initial exponential spread of the disease. Once we have more reliable data, we can incorporate other nuances of the disease spread and infection to get more adequate models for the next stages of the spread. So, in order to model the spatial spread of the disease in this early stage, we consider a metapopulation model which relates the evolution of a disease inside a population with two terms, one referring to the spread within the location and another to the spread to and from other locations. The spread within each location is modelled as a SI model, while the spread between locations is based on the mobile data, more specifically on the transition matrices. In the proposed model, the evolution of *I**i*(*t*), the number of infected in region *i* at time *t*, is modelled as ![Formula][1] in which r is the transmission rate within each region, *s* is a free parameter used to correct the overestimation or underestimation of movement between the locations, *N**i* is the population of region *i* and *w**ji*(*t*) is a measure of the movement from region *j* to region *i* at time *t*, calculated from the transition matrices in the following way. Let ![Graphic][2] be the entry at row *i* and column *j* of the transition matrix of day *n*, indicating the proportion of registered movements from region *j* to *i* at day *n*, and let ![Graphic][3], where ![Graphic][4] is the total number of recordings which departed from *j* at day *n*, resulting in the actual number of recorded movements from region *j* to *i* in this day. We take ![Graphic][5] as an estimative of the number of people which moved from region *j* to region *i* at the day. We consider the measure of mobility from *j* to *i* as ![Formula][6] in which *n* is such that 24(*n* − 1) ≤ *t* < 24*n*. The scale of that we consider is that of an hour, and *t*=0 is midnight at March 1st 2020. Also, we consider *r* = 0.4 which is approximately *R*/6, in which *R* = 2.68 is the Basic Reproduction Number estimated by [11] from data about the disease spread in Wuhan, China, and 6 days is the mean incubation time of the disease. Although the SI model is not suitable for long forecasts, since it does not take into account the Recovered and Exposed individuals, we wish to explore it to simulate the spread until the end of April 2020. But, as there is no mobility data available beyond March 30th 2020, we will use the mean transition matrices from the corresponding weekday in March 2020 when simulating the spread in April. At *t* = 0 we start with one single infected case in the state’s capital and zero in the other cities, and simulate how the disease spreads spatially within the states. We use the number of recordings from one region to another divided by the population of the departure region as the measure of mobility between regions because the proportion of transitions may overestimate the mobility, as we do not have data about devices which have not moved within a day. When dividing by the population, we may assume that the number of recordings is actually the number of people moving from one region to another. However, this estimator is biased for, on the one hand, each device may be counted more than once, and, on the other hand, there are people moving between regions without using any app. Therefore, we need to correct the estimative, and that is performed by parameter which multiply the proportions. If *s* < 1 then we are correcting a possible overestimation of the movement proportions, while if *s* > 1 then we are correcting a possible underestimation of the proportions. Hence, we will simulate the model for various values of *s* to attest its robustness. The main interest of the simulations is in determining *t**i*, the least time such that the number of infected in a region *i* attains a threshold *c*, i.e., *I**i*(*t**i*) ≥ *c*. From this value, we may rank the regions from the smallest to the greatest times of arrival of the disease, producing evidences about possible focus of future infection. In the simulations we adopt *c* = 1, that is, we assume that the region is at risk when the model predicts at least 1 infected individual in the region. The models are simulated until April 30th 2020. ## 3 Results ### 3.1 Possible focus of infection In Figure 4 we have the proportion of movement from the capital cities at March 1st, 10th, 20th and 30th 2020, and the mean proportions of the respective day of the week at March 2019. We observe that the mobility pattern is similar in both years, although the value of the proportions may differ. As we have seen, the number of recordings decreased in the second half of March 2020 influenced by isolation measures, but according to Figure 4 the movement patterns did not change significantly. This means that, among people still moving between cities, the pattern is that of before isolation, hence isolation measures seems to have not changed the pattern movement, at least in the city scale, but only the intensity of movement, evidenced by the decrease on the number of recordings. ![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F4.medium.gif) [Figure 4:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F4) Figure 4: Proportion of movement from São Paulo capital city to each city within the state at March 1st, 10th, 20th and 30th of 2020 alongside with the mean proportion of movement of the respective weekday in March 2019. In Tables 2 and 3 we see the descriptive statistics of the rank of the top 15 cities concentrating the proportion of moviment out of the capitals, calculated for all days of March 2019 and 2020. The rank is the ordering, from lowest to greatest, of the proportion of movement from the capital, so as greater the rank, more movement there was from the capital to the city. We see that the rank does not vary much among the days of March (small standard deviation), and that the rank in 2019 is close to the rank in 2020, evidencing again that, even though movement has decreased, the pattern of movement has not changed. These top cities are mainly in the metropolitan region of the capitals, what is an evidence that these may be future focus of infection, what they are, since the disease has spread to the metropolitan region of each state. View this table: [Table 2:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/T2) Table 2: Descriptive statistics of the rank of the proportion of movement out of São Paulo capital city in the days of March 2019 and March 2020. View this table: [Table 3:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/T3) Table 3: Descriptive statistics of the rank of the proportion of movement out of Rio de Janeiro capital city in the days of March 2019 and March 2020. ### 3.2 Model for spatial disease spreading Figures 6 to 9 display the simulated number of infected individuals across the states of São Paulo and Rio de Janeiro as time evolved for selected values of *s*. We see that the effect of is on the time, in number of days, that the disease takes to attain some location, rather than on the evolution of the spread itself. For *s* = 1, we observe in Figures 6 and 8 that the number of infected individuals spread from the capital cities, to the their metropolitan region and then selected cities on the countryside, which are geographically far from the capital, specially in the state of São Paulo. We see on Figures 7 and 9 that, for different values of *s*, the evolution of the disease is the same, but the cities with focus of infections at the end of the simulation, i.e., April 30th, depend on *s*: as greater the value of *s*, more cities are infected at the end. Also, we can clearly see a non-local diffusion process, as described by [8]. ![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F5.medium.gif) [Figure 5:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F5) Figure 5: Proportion of movement from Rio de Janeiro capital city to each city within the state at March 1st, 10th, 20th and 30th of 2020 alongside with the mean proportion of movement of the respective weekday in March 2019. ![Figure 6:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F6.medium.gif) [Figure 6:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F6) Figure 6: Simulation results for the number of infected individuals in the state of São Paulo for selected days assuming *s* = 1 The maps refer, respectively, to March 1st, 10th, 20th and 30th, and April 10th and 20th. ![Figure 7:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F7.medium.gif) [Figure 7:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F7) Figure 7: Simulation results at the 10th of April for the number of infected individuals in the state of São Paulo considering different values of s, namely, 0.0001 (top left), 0.001 (top right), 0.1 (mid-left), 1.0 (mid-right), 2.0 (bottom-left) and 3.0 (bottom-right). ![Figure 8:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F8.medium.gif) [Figure 8:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F8) Figure 8: Simulation results for the number of infected individuals in the state of Rio de Janeiro for selected days assuming *s* = 1. The maps refer, respectively, to March 1st, 10th, 20th and 30th, and April 10th and 20th. ![Figure 9:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F9.medium.gif) [Figure 9:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F9) Figure 9: Simulation results at the 10th of April for the number of infected individuals in the state of Rio de Janeiro considering different values of s, namely, 0.0001 (top-left), 0.001 (top-right), 0.1 (mid-left), 1.0 (mid-right), 2.0 (bottom-left) and 3.0 (bottom-right). In order to evaluate the risk of infection of each city we consider the *rank of infection* obtained by the simulated models, as follows. For each value of *s* we number the cities by the order of disease arrival. The first city in which it arrives we rank as one, the second as two and so forth. If the disease arrives at more than one city at a same day they receive the same rank, and the next city in which the disease arrives receive the following rank, independently of how many cities got the disease before it. We have then for each value of [s ∈ {0.001, 0.005, 0.1 0.2, 0.4, 0.5, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2, 2.5, 3}](http://www.sciweavers.org/tex2img.php?bc=White&fc=Black&im=jpg&fs=78&ff=txfonts&edit=0&eq=s%20%5Cin%20%5C%7B0.001%2C0.005%2C0.1%2C0.2%2C0.4%2C0.5%2C0.6%2C0.8%2C1%2C1.2%2C1.4%2C1.6%2C1.8%2C2%2C2.5%2C3%5C%7D#0) a rank for each city. The risk of infection is then calculated via a cluster analysis, in the following way. We apply k-means clustering [13] to divide the cities into three groups (low risk, medium risk and high risk) according to their ranks attributed by the models. We first clustered the cities according to the ranks attributed by the models simulated with the values of *s* lesser than one, and for the values greater or equal to one, separately. As the clusterization by both methods was very similar, as they classified differently only 49 cities in São Paulo and 29 in Rio de Janeiro, we decided to consider the ranks attributed by all values of *s* together to cluster the cities. The risk class of each city in the states of São Paulo and Rio de Janeiro is represented in the maps of Figures 10 and 11, respectively. We see that, besides some cities in the countryside in the state of São Paulo, the high risk locations are indeed in the metropolitan region of the capitals. ![Figure 10:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F10.medium.gif) [Figure 10:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F10) Figure 10: Risk of each city in the state of São Paulo evaluated by k-means clustering of the ranks attributed by the simulated models with distinct values of *s*. ![Figure 11:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F11.medium.gif) [Figure 11:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F11) Figure 11: Risk of each city in the state of Rio de Janeiro evaluated by k-means clustering of the ranks attributed by the simulated models with distinct values of *s*. In Figures 12 and 13 we present the rank attributed by the simulated models, and the distance to the capital city, for each city with more than 100,000 inhabitants in São Paulo and more than 75,000 inhabitants in Rio de Janeiro. We observe that the rank does not change significantly with the value of *s* and see that there is a correlation between the rank of the city and the distance from the capital, since as lower the rank is, lower tends to be the distance. These figures show that the model, when used to predict where the disease will arrive first, is robust regarding the values of *s*, as distinct values of *s* generated similar ranks. ![Figure 12:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F12.medium.gif) [Figure 12:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F12) Figure 12: Rank of infection and distance to capital city for each city with more than 100,000 inhabitants in the state of São Paulo. The points refer to ranks estimated for different values of *s*, the triangles to the distance to the capital city and the line is a smooth approximation of the distance triangles. The colors refer to the risk evaluated by k-means clustering of the ranks attributed by the simulated models. ![Figure 13:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/04/11/2020.04.07.20056739/F13.medium.gif) [Figure 13:](http://medrxiv.org/content/early/2020/04/11/2020.04.07.20056739/F13) Figure 13: Rank of infection and distance to capital city for each city with more than 75,000 inhabitants in the state of Rio de Janeiro. The points refer to ranks estimated for different values of *s*, the triangles to the distance to the capital city and the line is a smooth approximation of the distance triangles. The colors refer to the risk evaluated by k-means clustering of the ranks attributed by the simulated models. ## 4 Discussion In this work, we used anonymized mobile phone data to detect population movement between cities. This framework is very useful for a variety of applications. Here we focused in establishing a risk map for the evolution of the COVID-19 within the states of São Paulo and Rio de Janeiro, and noted that the high risk regions are mainly in the metropolitan region of the states’ capital cities, although there are some high risk cities in the countryside, specially in São Paulo. This was done by coupling the predicted mobility patterns with a standard SI model via a metapopulation model. The SI model is not suited for predicting the incidence of the infection for long periods of time, but it is an adequate linear approximation for the early exponential spread. The model chosen was adequate to be used with the available disease information, namely, the Basic Reproduction Number R0 estimated from the initial spread in China. We also introduced *s*, a free parameter, used to correct the overestimation or underestimation of movement between the locations. As expected, parameter *s* is related to the intensity of mobility, which in turn implies a greater or smaller time of infection for each city. This is an indicative that the decrease in mobility, enforced by isolation and quarantine measures, may slow the spread of the disease. Also, we proposed a risk index, based on ranks of the estimated time for an infected individual to be identified in a specific city. The risk index was shown to be robust and consistent with the spreading patterns, independent of the mobility intensity parameter *s*. The next steps of this work are two-fold. Initially, we will extend the analysis to other states of the country and relate the infection risk to geolocated health and economic variables, to help in the planning of local financial and hospital resources allocation, and of economic loss mitigation strategies. Additionally, we will also address later phases of the disease, considering a more complex model, such as an SEIR (Susceptible - Exposed - Infectious - Recovered) coupled with mobility, allowing long term projections and better development of control measures. ## Data Availability Geolocation data is proprietary, but all methods are available through open-source development. Additional data information, figures, and tables may be requested via email. [https://github.com/pedrospeixoto/mdyn](https://github.com/pedrospeixoto/mdyn) ## Code and Data Availability All methods discussed in this work are available in python or R program codes under open-source license, available at [https://github.com/pedrospeixoto/mdyn](https://github.com/pedrospeixoto/mdyn). The mobile geolocation data is proprietary of the In Loco company, therefore not publicly available. However, additional details about the results, figures and tables as presented in this work may be obtained upon request. Supporting information is provided at [www.ime.usp.br/∼pedrosp/covid-19](http://www.ime.usp.br/%E2%88%BCpedrosp/covid-19). ## Acknowledgments FAPESP financial support and computer resources under grant 16/18445-7 are acknowledged. D. Marcondes has received financial support from CNPq during the development of this article. ## Footnotes * E-mail: ppeixoto{at}usp.br * Received April 7, 2020. * Revision received April 7, 2020. * Accepted April 11, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/) ## References 1. [1].Keeling, M. J., Bjørnstad, O. N., & Grenfell, B. T. (2004). Metapopulation Dynamics of Infectious Diseases. In Ecology, Genetics and Evolution of Metapopulations (pp. 415-445). Elsevier Inc.. DOI: 10.1016/B978-012323448-3/50019-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/B978-012323448-3/50019-2&link_type=DOI) 2. [2].Kraemer, et al. (2020). The effect of human mobility and control measures on the COVID-19 epidemic in China, Science 25 Mar 2020:eabb4218 DOI: 10.1126/science.abb4218 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjgvNjQ5MC80OTMiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMC8wNC8xMS8yMDIwLjA0LjA3LjIwMDU2NzM5LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 3. [3].Ioannidis, J.P. (2020), Coronavirus disease 2019: the harms of exaggerated information and non-evidence-based measures. Eur J Clin Invest. Accepted Author Manuscript. DOI:10.1111/eci.13222 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/eci.13222&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32191341&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F04%2F11%2F2020.04.07.20056739.atom) 4. [4].Codeco Coelho, F. et al.,Assessing the potential impact of COVID-19 in Brazil: Mobility, Morbidity and the burden on the Health Care System medRxiv: 2020.03.19.20039131; DOI: 2020.03.19.20039131 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=2020.03.19.20039131&link_type=DOI) 5. [5].Queiroz, L., Queiroz, L., Melo, J. L., Barboza, G., Urbanski, A. H., Nicolau, A., … Nakaya, H. (2020, March 26). Large-scale assessment of human mobility during COVID-19 outbreak. DOI: 10.31219/osf.io/nqxrd [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.31219/osf.io/nqxrd&link_type=DOI) 6. [6].Brockmann, D., Hufnagel, L. & Geisel, T. The scaling laws of human travel. Nature 439, 462–465 (2006). [https://doi.org/10.1038/nature04292](https://doi.org/10.1038/nature04292) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature04292&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16437114&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F04%2F11%2F2020.04.07.20056739.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000234859200044&link_type=ISI) 7. [7].Gautreau A, Barrat A, Barthélemy M. Global disease spread: Statistics and estimation of arrival times. Journal of Theoretical Biology 285 2008;251(3):509–22. DOI: 10.1016/j.jtbi.2007.12.001. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jtbi.2007.12.001&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=18222486&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F04%2F11%2F2020.04.07.20056739.atom) 8. [8].Brockmann D, Helbing D. The Hidden Geometry of Complex, Network-Driven Contagion Phenomena. Science 2013;342(6164):1337–42. DOI: 10.1126/science.1245200 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNDIvNjE2NC8xMzM3IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjAvMDQvMTEvMjAyMC4wNC4wNy4yMDA1NjczOS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 9. [9]. S. Chen, J. Yang, W. Yang, C. Wang, T. Bärnighausen, COVID-19 control in China during mass population movements at New Year. Lancet 395, 764–766 (2020). DOI: 10.1016/S0140-6736(20)30421-9pmid:32105609 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(20)30421-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32105609&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F04%2F11%2F2020.04.07.20056739.atom) 10. [10].Wilder-Smith, A., C.J. Chiew, and V.J. Lee, Can we contain the COVID-19 outbreak with the same measures as for SARS? Lancet Infect Dis, 2020. 11. [11]. Joseph T Wu*, Kathy Leung*, Gabriel M Leung, (2020, January 31) Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. The Lancet 395, 689–697 (2020). DOI: 10.1016/S0140-6736(20)30260-9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(20)30260-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32014114&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F04%2F11%2F2020.04.07.20056739.atom) 12. [12].Situation Report 52 - Coronavirus disease 2019 (Covid 19), World Health Organization, 12 March 2020. Available at [https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-mission-briefing-on-covid-19\---|12-march-2020](https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-mission-briefing-on-covid-19\---|12-march-2020). accessed on 5 April 2020. 13. [13].Hartigan, J. A.; Wong, M. A. (1979). Algorithm AS 136: A k-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 28 (1): 100–108. JSTOR 2346830 [1]: /embed/graphic-5.gif [2]: /embed/inline-graphic-1.gif [3]: /embed/inline-graphic-2.gif [4]: /embed/inline-graphic-3.gif [5]: /embed/inline-graphic-4.gif [6]: /embed/graphic-6.gif