High resolution proximity statistics as early warning for US universities reopening during COVID-19 =================================================================================================== * Zakaria Mehrab * Akhilandeshwari Goud Ranga * Debarati Sarkar * Srinivasan Venkatramanan * Youngyun Chung Baek * Samarth Swarup * Madhav V. Marathe ## Abstract Reopening of colleges and universities for the Fall semester of 2020 across the United States has caused significant COVID-19 case spikes, requiring reactive responses such as temporary closures and switching to online learning. Until sufficient levels of immunity are reached through vaccination, Institutions of Higher Education will need to balance academic operations with COVID-19 spread risk within and outside the student community. In this work, we study the impact of proximity statistics obtained from high resolution mobility traces in predicting case rate surges in university counties. We focus on 50 land-grant university counties (LGUCs) across the country and show high correlation (PCC *>* 0.6) between proximity statistics and COVID-19 case rates for several LGUCs during the period around Fall 2020 reopenings. These observations provide a lead time of up to ∼3 weeks in preparing resources and planning containment efforts. We also show how features such as total population, population affiliated with university, median income and case rate intensity could explain some of the observed high correlation. We believe these easily explainable mobility metrics along with other disease surveillance indicators can help universities be better prepared for the Spring 2021 semester. ## 1 Introduction Institutions of Higher Education (IHEs) are an integral part of a society’s normal operations. Like many other sectors, these have been adversely impacted by the COVID-19 pandemic, with many of the US universities moving academic operations entirely online [8], and/or requiring adaptations in pedagogy, thereby affecting student enrollment, access and engagement [14]. Many universities which opened up partly or completely in-person have subsequently seen case spikes leading to temporary closures. As of November 19, 2020, there have been up to 321,000 cases across 1700 colleges [16] spread across all 50 U.S. states. Most of these outbreaks subsequently led to case increases in the local community. In several cases, especially in rural counties, these contributed to a significant fraction of the overall cases to date. While comprehensive testing, tracing and isolation policies can and have brought some of these outbreaks under control, in many cases they have been reactive and ad hoc. Given the case surges seen across the country and worldwide as we head into the winter, questions of continued operations extending into Spring 2021 loom large in terms of COVID-19 control until the availability and mass roll-out of vaccines. While standard surveillance measures such as pre-arrival testing [13], random prevalence sampling and innovative strategies such as wastewater surveillance can provide indicators of COVID-19 incidence in the student community, often these are lagging indicators of infection activity, hence have limited utility in resource management and outbreak containment. For effective response, identifying potential infection events will be valuable for public health officials in preparing for subsequent case rates and could also guide targeted surveillance and control measures. However, contact tracing based on mobile apps has been less effective so far, potentially due to low participation rates despite high acceptability [4]. As the next best alternative, one could use aggregate statistics derived passively from mobility traces obtained with consent to identify hotspots of interaction and potential subsequent case spikes. High resolution mobility data have become increasingly available and have been shown to be relevant for infectious disease modeling and forecasting across different phases [11, 6]. Anderson et al. [5] have also used such GPS traces to show the relationship to case spikes in some universities. While their focus was on understanding the impact of in-person instruction and inflow of students from across the country, the demographic characteristics of the region were not studied in detail. These will play a central role for local public health officials in understanding the potential impact on local community, and identify any leading indicators for triggering response efforts. In our work, we use anonymized mobility traces to study their relationship to COVID-19 case rates at land-grant university counties (LGUCs) across the United States. Specifically, we focus on the level of proximity maintained by the users, as a proxy for social contact and disease spread risk. We show that simple, easily explainable proximity statistics computed from such mobility data serve as early indicators with high correlation (Pearson Correlation Coefficient, PCC *>* 0.6) with a lead time of up to ∼3 weeks, across multiple LGUCs. These results are especially evident in rural college towns and counties where the case spikes due to the student population are prominently observable at the county level. Further, demographic characteristics such as median income, university population partially explain the observed high correlation. Given the near real-time nature of such datasets (within 24 hours of interaction) and the privacy-preserving nature of aggregate statistics, they will serve as valuable early warning triggers for case investigation and resource planning by local health officials. In addition to providing a retrospective view of the Fall 2020 case spikes, they will be valuable for decision makers moving forward into the Spring 2021 semester. ## 2 Results Our analyses show strong correlation between smoothed daily mean number of interactions and case rates in land-grant university counties (LGUCs) during the period from July to September 2020. Figure 1 (left panels) show the two curves for the top five LGUCs in terms of the largest maximum Pearson Correlation Coefficient (maxPCC) values observed across different lags (all 50 LGUCs are shown in Supplementary Figure S4). As evident from the visual comparison, strong spikes in interaction statistics are soon followed by case rate spikes in several of the LGUCs. The panels on the right side of Figure 1 show how the PCC varies with lead time along with the maximum PCC observed (vertical dashed line). They also show the results of Granger causality test at different lead times along with the level of significance achieved greater lead times (where the curve falls below the *p* = 0.05 level on the secondary y-axis). ![Figure S3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F6.medium.gif) [Figure S3:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F6) Figure S3: Residual vs fitted value plot for each linear regression model. ![](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F7/graphic-9.medium.gif) [](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F7/graphic-9) ![](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F7/graphic-10.medium.gif) [](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F7/graphic-10) ![](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F7/graphic-11.medium.gif) [](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F7/graphic-11) ![](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F7/graphic-12.medium.gif) [](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F7/graphic-12) ![](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F7/graphic-13.medium.gif) [](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F7/graphic-13) Figure S4: Results for all LGUCs. ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F1) Figure 1: LGUCs with strong correlation. The name of the LGUC is given in the center-top position of each row. The left panel of each row shows the measured mean interaction (blue curve) and the measure case rate (red curve) for each day. We can see that a spike in the case rate is preceded by a spike in the mean interaction. For Monongalia County (WV), however, there is no preceding spike in interaction for the first peak in case rate. We believe this is due to boundary effect. The right panel show the PCC (green curve) and p-values (purple curve) for different lead times. PCC is high for lead times of 2-3 weeks and corresponding p-value is also significant across all the figures. The maxPCC observed across different lead times is greater than 0.6 for more than half the counties under study, with the top three counties at maxPCC of 0.95. For all but three LGUCs, the maxPCC occurs at positive lead times strongly indicating causality (Figure 2). The median lead time for maxPCC is 14.5 days, and more than 80% of counties have maxPCC at a lead time of above 10 days. Further investigation using Granger causality reveals that for 68% of counties the measure is significant (*p <* 0.05, SSR chi-square test) for lead times above 10 days. The observed maxPCC values were also robust to hyperparameters of interaction statistics such as dwell time and minimum distance when computing proximity statistics (Supplementary Figure S2). ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F2) Figure 2: The maximum PCC and the corresponding lead time for each pair of interaction and case rate time series. We see that about half the university counties have maxPCC *>* 0.5 and lead time *>* 11. Upon further investigation of the spatial and demographic characteristics, we notice that high correlations are mostly observed for LGUCs with smaller population sizes (blue circles in Figure 3). We also note that though these patterns are spread across the United States, especially in the Midwestern states and along the Eastern Seaboard, consistent with those regions experiencing higher COVID-19 incidence during July-September 2020. These patterns however are not universal, as seen in LGUCs with low maxPCC (Supplementary Table S2) or the corresponding lead time being negative (shown as 0 in Figure 2). ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F3) Figure 3: Population sizes and maxPCC values for LGUCs across the United States (insets contain Alaska and Hawaii). Circle size represents maxPCC. Larger circle represents high correlation in the corresponding LGUC. Also LGUCs with negative maxPCC are denoted with a cross (x) mark instead of circles. The color of the circle represents the population of the LGUC. We can see most of the large circles correspond to LGUC with low population (shown in cyan, yellow and magenta). Regression models developed using various socioeconomic and epidemiological factors as independent vari-ables to explain the maxPCC (either continuous or categorical) reveal that factors such as total LGUC population, number of people associated with university, rurality (captured by RUCC code of the county), median income level and epidemic intensity of case rate (measured as defined in [9]) were significant across multiple model variants (*R*2 ranging from 0.37 to 0.50). The chosen variables were mostly similar across different model selection procedures (forward, backward and stepwise regression) and dependent variable (continuous or categorical variant). A complete list of features considered and results for other model variants are provided in the Supplementary material. ## 3 Discussion Human mobility datasets obtained from smartphones have been shown in the past to be useful for infectious disease modeling and forecasting. However, these suffer from demographic bias and may not be completely representative of the different population segments in the interaction counts, or in how they might be affecting case rates. In our analyses, we find that the correlation is highest in small college towns and weaker in larger, more urban, areas. Given the disproportionate representation of college-age individuals in the mobility data, their return to university campuses, especially rural ones leads to strong spikes in interactions. As shown in [5], their arrivals from higher incidence areas may have seeded outbreaks which subsequently caused case spikes observable at the county level. While a similar phenomenon could have played out in urban areas, given the background number of interactions these may be difficult to observe and thus fail to provide strong early warning signals for case spikes. Based on the regression analyses, a positive coefficient associated with the university population depicts that maxPCC is higher in LGUCs where the interactions are attributable to a largely younger population. On the other hand, the overall LGUC population has a negative coefficient. This, coupled with the fact that university affiliated population is positively correlated with maxPCC, follows that maxPCC is high in small university towns; A negative coefficient associated with the Rural-Urban Continuum Codes (where higher values indicate lower urbanization) reflects the fact that many LGUCs are classified as metro areas. This is the case for all the counties in our data with the highest maxPCC values. The *median income* also has a negative coefficient, indicating that poorer counties show a higher maxPCC value. On a similar note, if there is a sufficient background case rate, then the change in case rate due to the change in interactions may not stand out sufficiently. In the regression analysis, the *entropy of the case rate* (computed as Shannon entropy of the normalized case time series, proposed by [9]) attempts to capture the strength of the signal in the case data, and is higher for counties that do not have noticeable surges in COVID-19 case rates. The coefficient for this feature is negative, which indicates that the stronger the spike the better the correlation with the interaction curve. We note that other factors such as mask use prevalence, testing rates, and university operations (online vs in-person) may have an impact on whether high numbers of interactions lead to case spikes. However such datasets were not either universally available or did not match the timeframe of our study, so were not included. The goal of this study is to establish the utility of aggregate mobility statistics in creating early warning systems for universities. Thus it may not be exhaustive in explanatory variables and is not intended as a forecasting exercise. Further modeling of contact durations and disease transmissions using mechanistic models will definitely help improve the utility of such datasets. Operating on such anonymized datasets and computing aggregate yet useful statistics with privacy guarantees remains an active area of investigation. While most universities have tided over the case surges due to Fall 2020 reopening, the threat of subsequent outbreaks and impact on communities will persist until the rollout of a vaccine to the general population. Beyond retrospectively explaining case incidence in the past few months, this study can be seen as highlighting the value of proximity statistics as early warning indicators to case investigators and resource planners. As universities close early and plan their reopening strategies for Spring 2021, integrating such real-time indicators in their planning process will help with their response. We strongly believe that along with comprehensive testing and isolation plans, learning from the collective lessons of Fall experience, especially success stories, we could avoid similar spikes in the future despite surging case rates across the country. ## 4 Materials and Methods ### Datasets We use data obtained via a third-party SDK operated by X-Mode [3] which serves multiple mobile applications and provides anonymized mobility traces with persistent identifiers (see Table S3 for statistics). We use TIGER/LINE shapefiles (for data extraction) [7] and medial income statistics [1] at county level from the US Census Bureau. County populations and COVID-19 case counts are obtained from USAFacts [2], while the RUCC codes were obtained from USDA [10]. Finally, we use a manually curated list of land-grant universities and their corresponding counties (Table S4). Additional details on all features considered are provided in the Supplementary material. ### Trajectory extraction From the raw mobility traces obtained from X-Mode, we use Census shapefiles to first extract traces corresponding to the US and separate records per LGUC. We use a stop-point detection algorithm from scikitmobility [12] to post-process the raw traces with a specified level of sensitivity (time interval 5 minutes, distance threshold 10m). Note that this step is prior to interaction computation, i.e., the time interval and distance threshold used in this step are distinct from the parameters used in interaction count computation below. ### Interaction statistics We define an interaction as the co-occurrence of mobile device pings within a spatial distance threshold *d* and within temporal interval threshold *t*. Although more accurate interaction counts could hypothetically be obtained by interpolating travel to produce more continuous trajectories, the current method provides a lower bound on such interactions. Further, the use of stop points and subsequent deduplication allows us to obtain the number of unique interactions for each distinct user (i.e., device). We then compute the mean number of interactions per county. In our analysis, we set *d* = 2*m* and *t* = 1 *minute*. We also performed sensitivity analyses to these parameters (shown using Story County, Iowa as an example) and found that the results are robust to slight variations in *d* and *t* (Figure S2). ### Correlation analyses We used two metrics, the Pearson Correlation Coefficient and Granger Causality, to measure the correlation between the obtained daily interaction statistics and case rates for the LGUCs of interest. Before analysis, county level case counts are converted to case rates per 100,000 and both case rates and mean interactions are smoothed by computing 7-day moving averages. The former is done to normalize across different population sizes, while the latter eliminates any within-week reporting artifacts and noise in trajectory or case data. PCC between the case rates and the mean interactions time series is obtained across different lead times and the maximum PCC and corresponding lead time are reported. We test Granger causality using python statsmodels [15] and report value for SSR-Chi square test. ### Regression analyses We perform regression analyses for explaining the observed maxPCC values with various exogenous variables related to population characteristics and COVID-19 epidemiology. The features were normalized before the regression to account for the varying ranges of magnitude. We used three feature selection methods (forward, backward and stepwise) applied to continuous and quantized values of maxPCC to identify the statistically significant features. Among these models, we report the model obtained using the features obtained by backward selection. We find that the university population, total LGUC population, median household income, the RUCC type of the county and the entropy of case rate are strongly correlated with the maximum PCC (*R*2=0.504) (Table 1). A full list of the features, description of the feature selection methods and summary of all six linear regression models (Table S1) can be found in the Supplementary material. View this table: [Table 1:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/T1) Table 1: Linear regression analysis result for backward selection method and continuous target variable. ## Data Availability Aggregated statistics visualized in the manuscript are available upon request. ## Privacy and Ethics Statement Anonymized datasets were obtained from X-Mode which collects location information from smartphone applications that use its software development kit (XDK). Users have the ability to limit/disable tracking and/or redistribution of their data. We refer the readers to [https://xmode.io/xmode-privacy-policy-2/](https://xmode.io/xmode-privacy-policy-2/) for details on their Privacy Policy. The study at University of Virginia was approved by the Institutional Review Board for Social and Behavioral Sciences as part of the Human Research Protection Program. The paper only publishes aggregated derived statistics at the level of counties and on a daily basis guaranteeing individual privacy. ## Trajectory extraction and Stop point detection ![Figure S1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F4.medium.gif) [Figure S1:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F4) Figure S1: Effect of applying stop points. For the purposes of our study the dataset spans 1 July, 2020 to 15 September, 2020 and focuses on mobility traces in the LGUCs of interest. From the mobility traces obtained for USA for a given day, we separate the pings by each LGUC. Then we use a stop point detection algorithm and generate a set of stop points from those raw trajectory points (Figure S1). The purpose of this pre-processing step is to filter out the records that do not contribute to significant interaction. For instance, this eliminates the spurious pings generated in transit and provide a ‘cleaner’ trajectory with fewer records for subsequent calculations. This also helps remove users who did not spend significant time within the county, but were transiting through the region. We use the stop-point detection method implemented in the scikit-mobility [5] library. The stop point detection method requires one temporal constraint and two spatial constraints. The temporal constraint specifies the minimum number of minutes an individual has to spend at a location for that record to be considered a stop point. The two spatial constraints are stop radius factor and spatial radius. The product of these two parameters gives us a distance threshold. Any point from the candidate trajectory point within this distance threshold will be considered as a single stop point and the final stop point coordinates are the median latitude and longitude values of the points found within the specified distance threshold. We use 5 minutes as our temporal constraint. For the spatial constraints, we set spatial radius to 0.2 km and the stop radius factor as 0.05. Thus the distance threshold is 10 meters. ![Figure S2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/23/2020.11.21.20236042/F5.medium.gif) [Figure S2:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/F5) Figure S2: Sensitivity of Iowa State University LGUC PCC analysis with respect to the distance and time interval parameters. ## Measuring Interaction rate After performing pre-processing on the previous step, we estimate the number of interactions occurring on each day for each county. Human interaction is associated with both temporal and spatial proximity. Intuitively, an interaction occurs when two persons are present close to each other (spatial) simultaneously (temporal). Following this, we define that an interaction happens when two people are present at a location within a spatial threshold *d*, and the difference of their time of pings is within a temporal threshold *t*. The data for a county for each day consists of a set of pings *P*. The set of pings are the stop points obtained from the previous step. Each ping *p**i* of *P* is defined by a tuple as follows: *p**i* = (*u**i*, *lat**i*, *lon**i*, *ts**i*). Here, *u**i* denotes an anonymous user. Although one user may have multiple *u**i* if he/she possesses multiple devices, for our experiment we consider each distinct *u**i* to be a separate user. *lat**i* and *lon**i* represent the latitude and longitude of the ping, respectively, and *ts**i* denotes the timestamp when the ping was captured. From the pings collected for each day, we calculate the number of unique interactions for each distinct *u**i*. After calculating the number of unique interactions for each distinct user *u**i*, we obtain the mean interaction count for that day for the particular county. ## Correlation Measurement After obtaining the time-series data of interaction for a county, we aim at measuring whether any correlation exists between these data and the case rate data. As mentioned in the main text, correlation analyses are performed on 7-day smoothed versions of the mean number of interactions per county and the case rates per 100,000. ## Pearson Correlation This is a statistic that measures the correlation between an exogenous variable *X* and an endogenous variable *Y*. It takes *X* and *Y* as input and generates a value between +1 and -1. A value close to 1 represents a strong positive correlation between *X* and *Y* (*Y* increases as *X* increases). A value close to -1 represents a strong negative correlation (*Y* decreases as *X* increases). A value of 0 indicates that there is no correlation between *X* and *Y* . For our purposes, *X* is the time series of interactions and *Y* is the time series of case rates. We shift *X* by some lead time and compute the Pearson correlation coefficient for each such lead time. This not only tells us whether *X* and *Y* are correlated, but also gives us the optimal lead time that lets us predict *Y* from *X* if they are strongly correlated. ## Granger Causality Analysis The Granger causality test is a statistical hypothesis test for determining whether one time series is more useful in forecasting another time series than the past values of the other time series itself. A time series *X* is said to “granger-cause” *Y* if it can be shown that the past values of *X* along with past values of *Y* provide more statistically significant prediction of the future value of *Y* than the future value of *Y* obtained from past values of *Y* alone. This is done through a series of chi square-tests and F-tests. ## Linear regression The following features were considered in linear regression for predicting maxPCC of each county. For each feature, we first define what the feature represents and also explain how the feature was pre-processed for regression analysis, wherever applicable. 1. Standard deviation scale of interactions: The ratio of the standard deviation to the mean for the interaction rate time-series data. (Feature was normalized before usage) 2. Standard deviation scale of cases: The ratio of the standard deviation to the mean for the case rate time-series data. (Feature was normalized before usage) 3. Maximum standard deviation of interactions. (Feature was normalized before usage) 4. Maximum standard deviation of cases: (Feature was normalized before usage) 5. Income: Median household income for the year 2018 of the county as per the data provided by the Economic Research Service, USDA [1]. (Median Income values were in 100K range. So, the values were used on a scale of 100K in the regression analysis.) 6. Dynamic range of case rates: The difference between the maximum and minimum value in the case rate time series data. divided by the minimum value of the case rate time series data. (Feature was normalized before usage) 7. Dynamic range of interaction rates: The difference between the maximum and minimum value in the interaction rate time series data divided by the minimum value of the interaction rate time series data. (Feature was normalized before usage) 8. Entropy of case rates: The amount of uncertainty in the time series data for the case rate. This was computed by quantizing the values of the case rate time series data, converting it into a probability distribution and computing the entropy of the probability distribution. 9. Entropy of interaction rates: The amount of uncertainty in the time series data for the interaction rate. This was computed by quantizing the values of the interaction rate time series data, converting it into a probability distribution and computing the entropy of the probability distribution. 10. Expected age group: Weighted average of the age group of people in the county as per the data of 2019 provided by the US Census Bureau [2]. (Feature was normalized before usage) 11. University population: The number of people affiliated with the corresponding land grant university of the county as per IPEDS [4]. (On a scale of 100K) 12. Population: Population of the county (On a scale of 100K) 13. University percentage: Ratio of 11 to 12. 14. User coverage: The number of unique users in our dataset. (On a scale of 100K) 15. User coverage percent: Ratio of 14 to 12. 16. Social distancing index: Average social distancing index of a county as provided by University of Maryland [6]. (Feature was normalized before usage) 17. Type: RUCC Type of a county, stating whether a county is urban or rural. This is obtained from United States Department of Agriculture [3]. (RUCC type ranges from 1 to 9. The values were divided by 10 before using for regression analysis.) 18. Out trips: Average number of out-of-county trips of a county as provided by University of Maryland [6]. (Feature was normalized before usage) ## Feature selection methods ### Forward selection The steps for the forward selection method are as follows: 1. Fix a significance level (0.05, in our case). Let *C* be set of the candidate features and *S* be the set of currently selected features. 2. Fit all possible regression models considering one feature at a time from *C* and the features from *S*. 3. Select the feature *c* ∈ *C* that has lowest *p*-value over the constructed models. 4. If the *p*-value of *c* is less than the fixed significance level, then add *c* to *S*, delete *c* from *C*. Repeat step 2 to 4. Otherwise finish the process and return *S*. ### Backward elimination The steps for the forward selection method are as follows: 1. Fix a significance level (0.05, in our case). Let *C* be set of the candidate features. 2. Fit a model considering all features in *C*. Select the feature *c* ∈ *C* that has the highest *p*-value. 3. If the *p*-value of *c* is more than the fixed significance level, then delete *c* from *C*. Repeat steps 2 to 3. Otherwise finish the process and return *C*. ### Stepwise selection The steps for stepwise selection method are as follows: 1. Fix a significance level (0.05, in our case). Let *C* be set of the candidate features and *S* be the set of currently selected features. 2. Fit all possible regression models considering one feature at a time from *C* and the features from *S*. 3. Select the feature *c* ∈ *C* that has lowest *p*-value over the constructed models. 4. If the *p*-value of *c* is less than the fixed significance level, then add *c* to *S*. Otherwise, finish the process and return *S*. 5. Fit a model considering all features in *S*. Select the feature *s* ∈ *S* that has the highest *p*-value. 6. If the *p*-value of *s* is more than the fixed significance level, then delete *s* from *S*. Repeat steps 5 to 6. Otherwise, go to next step 7. Set *C* = *C* − *S* 8. If *C* = ∅, finish the process and return *S*. Otherwise, go back to step 2 and repeat. View this table: [Table S1:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/T2) Table S1: Regression analyses results for all models View this table: [Table S2:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/T3) Table S2: List of LGUCs and corresponding Universities with maxPCC and lead times. View this table: [Table S3:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/T4) Table S3: Demographics and app coverage statistics for X-Mode data. The second column is the average number of unique users in our Dataset for each day along with the penetration percentage, which is the percentage ratio of unique users to the population of the LGUC. The third and fourth column represent the number of traces in our dataset after and before stop point detection, respectively. View this table: [Table S4:](http://medrxiv.org/content/early/2020/11/23/2020.11.21.20236042/T5) Table S4: List of land grant universities with corresponding US counties. ## Acknowledgement The authors would like to thank members of the Biocomplexity COVID-19 Response Team and Network Systems Science and Advanced Computing (NSSAC) Division for their thoughtful comments and suggestions related to epidemic modeling and response support. We thank members of the Biocomplexity Institute and Initiative, University of Virginia for useful discussion and suggestions. This work was partially supported by National Institutes of Health (NIH) Grant R01GM109718, NSF BIG DATA Grant IIS-1633028, NSF DIBBS Grant OAC-1443054, NSF Grant No.: OAC-1916805, NSF Expeditions in Computing Grant CCF-1918656, CCF-1917819, NSF RAPID CNS-2028004, NSF RAPID OAC-2027541, US Centers for Disease Control and Prevention 75D30119C05935, DTRA subcontract/ARA S-D00189-15-TO-01-UVA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. * Received November 21, 2020. * Revision received November 21, 2020. * Accepted November 23, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. [1].Economic Research Service, United States Department of Agriculture. [https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/](https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/), visited 2020-11-10. 2. [2].U.S. coronavirus cases and deaths. [https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/](https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/), visited 2020-11-10. 3. [3].X-Mode. [https://xmode.io/](https://xmode.io/), visited 2020-11-10. 4. [4]. Samuel Altmann, Luke Milsom, Hannah Zillessen, Raffaele Blasone, Frederic Gerdon, Ruben Bach, Frauke Kreuter, Daniele Nosenzo, Severine Toussaert, and Johannes Abeler. Acceptability of app-based contact tracing for COVID-19: Cross-country survey evidence. Available at SSRN 3590505, 2020. 5. [5]. Martin S. Andersen, Ana I. Bento, Anirban Basu, Chris Marsicano, and Kosali Simon. College openings, mobility, and the incidence of covid-19 cases. medRxiv, 2020. 6. [6]. Hamada S. Badr, Hongru Du, Maximilian Marshall, Ensheng Dong, Marietta M. Squire, and Lauren M. Gardner. Association between mobility patterns and COVID-19 transmission in the USA: a mathematical modelling study. The Lancet Infectious Diseases, 20(11):1247–1254, 2020. 7. [7].U.S. Census Bureau. TIGER/Line shapefiles. [https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html), visited 2020-11-10. 8. [8].Davidson College. College Crisis Initiative. [https://collegecrisis.shinyapps.io/dashboard](https://collegecrisis.shinyapps.io/dashboard). 9. [9]. Benjamin D. Dalziel, Stephen Kissler, Julia R. Gog, Cecile Viboud, Ottar N. Bjørnstad, C. Jessica E. Metcalf, and Bryan T. Grenfell. Urbanization and humidity shape the intensity of influenza epidemics in US cities. Science, 362(6410):75–79, 2018. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjExOiIzNjIvNjQxMC83NSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIwLzExLzIzLzIwMjAuMTEuMjEuMjAyMzYwNDIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 10. [10].Economic Research Service, United States Department of Agriculture. Rural Urban Continuum Codes. [https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx](https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx), visited 2020-11-10. 11. [11]. Nuria Oliver, Bruno Lepri, Harald Sterly, Renaud Lambiotte, Sébastien Deletaille, Marco De Nadai, Emmanuel Letouzé, Albert Ali Salah, Richard Benjamins, Ciro Cattuto, Vittoria Colizza, Nicolas de Cordes, Samuel P. Fraiberger, Till Koebe, Sune Lehmann, Juan Murillo, Alex Pentland, Phuong N Pham, Frédéric Pivetta, Jari Saramäki, Samuel V. Scarpino, Michele Tizzoni, Stefaan Verhulst, and Patrick Vinck. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle. Science Advances, 6(23), 2020. 12. [12]. Luca Pappalardo, F. Simini, G. Barlacchi, and R. Pellungrini. scikit-mobility: A Python library for the analysis, generation and risk assessment of mobility data. arxiv:1907.07062, 2019. 13. [13]. Lior Rennert, Corey Andrew Kalbaugh, Lu Shi, and Christopher McMahan. Reopening universities during the COVID-19 pandemic: A testing strategy to minimize active cases and delay outbreaks. medRxiv:2020.07.06.20147272, 2020. 14. [14]. Pradeep Sahu. Closure of universities due to Coronavirus Disease 2019 (COVID-19): Impact on education and mental health of students and academic staff. Cureus, 12(4), 2020. 15. [15]. Skipper Seabold and Josef Perktold. statsmodels: Econometric and statistical modeling with Python. In 9th Python in Science Conference, 2010. 16. [16].The New York Times. Tracking the Coronavirus at U.S. Colleges and Universities. [https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html](https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html). ## References 1. [1].Economic Research Service, United States Department of Agriculture. [https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/](https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/), visited 2020-11-10. 2. [2].U.S. Census Bureau. Population and housing unit estimates. [https://www.census.gov/programs-surveys/popest/technical-documentation/file-layouts.html](https://www.census.gov/programs-surveys/popest/technical-documentation/file-layouts.html), visited 2020-11-10. 3. [3].Economic Research Service, United States Department of Agriculture. Rural Urban Continuum Codes. [https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx](https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx), visited 2020-11-10. 4. [4].National Center for Education Statistics. IPEDS. [https://nces.ed.gov/ipeds/use-the-data](https://nces.ed.gov/ipeds/use-the-data), visited 2020-11-10. 5. [5]. Luca Pappalardo, F. Simini, G. Barlacchi, and R. Pellungrini. scikit-mobility: A Python library for the analysis, generation and risk assessment of mobility data. arxiv:1907.07062, 2019. 6. [6]. Lei Zhang, Sepehr Ghader, Michael L Pack, Chenfeng Xiong, Aref Darzi, Mofeng Yang, Qianqian Sun, AliAkbar Kabiri, and Songhua Hu. An interactive covid-19 mobility impact and social distancing analysis platform. medRxiv:2020.04.29.20085472, 2020.