## Abstract

Reopening of colleges and universities for the Fall semester of 2020 across the United States has caused significant COVID-19 case spikes, requiring reactive responses such as temporary closures and switching to online learning. Until sufficient levels of immunity are reached through vaccination, Institutions of Higher Education will need to balance academic operations with COVID-19 spread risk within and outside the student community. In this work, we study the impact of proximity statistics obtained from high resolution mobility traces in predicting case rate surges in university counties. We focus on 50 land-grant university counties (LGUCs) across the country and show high correlation (PCC *>* 0.6) between proximity statistics and COVID-19 case rates for several LGUCs during the period around Fall 2020 reopenings. These observations provide a lead time of up to ∼3 weeks in preparing resources and planning containment efforts. We also show how features such as total population, population affiliated with university, median income and case rate intensity could explain some of the observed high correlation. We believe these easily explainable mobility metrics along with other disease surveillance indicators can help universities be better prepared for the Spring 2021 semester.

## 1 Introduction

Institutions of Higher Education (IHEs) are an integral part of a society’s normal operations. Like many other sectors, these have been adversely impacted by the COVID-19 pandemic, with many of the US universities moving academic operations entirely online [8], and/or requiring adaptations in pedagogy, thereby affecting student enrollment, access and engagement [14]. Many universities which opened up partly or completely in-person have subsequently seen case spikes leading to temporary closures. As of November 19, 2020, there have been up to 321,000 cases across 1700 colleges [16] spread across all 50 U.S. states. Most of these outbreaks subsequently led to case increases in the local community. In several cases, especially in rural counties, these contributed to a significant fraction of the overall cases to date. While comprehensive testing, tracing and isolation policies can and have brought some of these outbreaks under control, in many cases they have been reactive and ad hoc. Given the case surges seen across the country and worldwide as we head into the winter, questions of continued operations extending into Spring 2021 loom large in terms of COVID-19 control until the availability and mass roll-out of vaccines.

While standard surveillance measures such as pre-arrival testing [13], random prevalence sampling and innovative strategies such as wastewater surveillance can provide indicators of COVID-19 incidence in the student community, often these are lagging indicators of infection activity, hence have limited utility in resource management and outbreak containment. For effective response, identifying potential infection events will be valuable for public health officials in preparing for subsequent case rates and could also guide targeted surveillance and control measures. However, contact tracing based on mobile apps has been less effective so far, potentially due to low participation rates despite high acceptability [4]. As the next best alternative, one could use aggregate statistics derived passively from mobility traces obtained with consent to identify hotspots of interaction and potential subsequent case spikes. High resolution mobility data have become increasingly available and have been shown to be relevant for infectious disease modeling and forecasting across different phases [11, 6]. Anderson et al. [5] have also used such GPS traces to show the relationship to case spikes in some universities. While their focus was on understanding the impact of in-person instruction and inflow of students from across the country, the demographic characteristics of the region were not studied in detail. These will play a central role for local public health officials in understanding the potential impact on local community, and identify any leading indicators for triggering response efforts.

In our work, we use anonymized mobility traces to study their relationship to COVID-19 case rates at land-grant university counties (LGUCs) across the United States. Specifically, we focus on the level of proximity maintained by the users, as a proxy for social contact and disease spread risk. We show that simple, easily explainable proximity statistics computed from such mobility data serve as early indicators with high correlation (Pearson Correlation Coefficient, PCC *>* 0.6) with a lead time of up to ∼3 weeks, across multiple LGUCs. These results are especially evident in rural college towns and counties where the case spikes due to the student population are prominently observable at the county level. Further, demographic characteristics such as median income, university population partially explain the observed high correlation. Given the near real-time nature of such datasets (within 24 hours of interaction) and the privacy-preserving nature of aggregate statistics, they will serve as valuable early warning triggers for case investigation and resource planning by local health officials. In addition to providing a retrospective view of the Fall 2020 case spikes, they will be valuable for decision makers moving forward into the Spring 2021 semester.

## 2 Results

Our analyses show strong correlation between smoothed daily mean number of interactions and case rates in land-grant university counties (LGUCs) during the period from July to September 2020. Figure 1 (left panels) show the two curves for the top five LGUCs in terms of the largest maximum Pearson Correlation Coefficient (maxPCC) values observed across different lags (all 50 LGUCs are shown in Supplementary Figure S4). As evident from the visual comparison, strong spikes in interaction statistics are soon followed by case rate spikes in several of the LGUCs. The panels on the right side of Figure 1 show how the PCC varies with lead time along with the maximum PCC observed (vertical dashed line). They also show the results of Granger causality test at different lead times along with the level of significance achieved greater lead times (where the curve falls below the *p* = 0.05 level on the secondary y-axis).

The maxPCC observed across different lead times is greater than 0.6 for more than half the counties under study, with the top three counties at maxPCC of 0.95. For all but three LGUCs, the maxPCC occurs at positive lead times strongly indicating causality (Figure 2). The median lead time for maxPCC is 14.5 days, and more than 80% of counties have maxPCC at a lead time of above 10 days. Further investigation using Granger causality reveals that for 68% of counties the measure is significant (*p <* 0.05, SSR chi-square test) for lead times above 10 days. The observed maxPCC values were also robust to hyperparameters of interaction statistics such as dwell time and minimum distance when computing proximity statistics (Supplementary Figure S2).

Upon further investigation of the spatial and demographic characteristics, we notice that high correlations are mostly observed for LGUCs with smaller population sizes (blue circles in Figure 3). We also note that though these patterns are spread across the United States, especially in the Midwestern states and along the Eastern Seaboard, consistent with those regions experiencing higher COVID-19 incidence during July-September 2020. These patterns however are not universal, as seen in LGUCs with low maxPCC (Supplementary Table S2) or the corresponding lead time being negative (shown as 0 in Figure 2).

Regression models developed using various socioeconomic and epidemiological factors as independent vari-ables to explain the maxPCC (either continuous or categorical) reveal that factors such as total LGUC population, number of people associated with university, rurality (captured by RUCC code of the county), median income level and epidemic intensity of case rate (measured as defined in [9]) were significant across multiple model variants (*R*^{2} ranging from 0.37 to 0.50). The chosen variables were mostly similar across different model selection procedures (forward, backward and stepwise regression) and dependent variable (continuous or categorical variant). A complete list of features considered and results for other model variants are provided in the Supplementary material.

## 3 Discussion

Human mobility datasets obtained from smartphones have been shown in the past to be useful for infectious disease modeling and forecasting. However, these suffer from demographic bias and may not be completely representative of the different population segments in the interaction counts, or in how they might be affecting case rates. In our analyses, we find that the correlation is highest in small college towns and weaker in larger, more urban, areas. Given the disproportionate representation of college-age individuals in the mobility data, their return to university campuses, especially rural ones leads to strong spikes in interactions. As shown in [5], their arrivals from higher incidence areas may have seeded outbreaks which subsequently caused case spikes observable at the county level. While a similar phenomenon could have played out in urban areas, given the background number of interactions these may be difficult to observe and thus fail to provide strong early warning signals for case spikes.

Based on the regression analyses, a positive coefficient associated with the university population depicts that maxPCC is higher in LGUCs where the interactions are attributable to a largely younger population. On the other hand, the overall LGUC population has a negative coefficient. This, coupled with the fact that university affiliated population is positively correlated with maxPCC, follows that maxPCC is high in small university towns; A negative coefficient associated with the Rural-Urban Continuum Codes (where higher values indicate lower urbanization) reflects the fact that many LGUCs are classified as metro areas. This is the case for all the counties in our data with the highest maxPCC values. The *median income* also has a negative coefficient, indicating that poorer counties show a higher maxPCC value. On a similar note, if there is a sufficient background case rate, then the change in case rate due to the change in interactions may not stand out sufficiently. In the regression analysis, the *entropy of the case rate* (computed as Shannon entropy of the normalized case time series, proposed by [9]) attempts to capture the strength of the signal in the case data, and is higher for counties that do not have noticeable surges in COVID-19 case rates. The coefficient for this feature is negative, which indicates that the stronger the spike the better the correlation with the interaction curve.

We note that other factors such as mask use prevalence, testing rates, and university operations (online vs in-person) may have an impact on whether high numbers of interactions lead to case spikes. However such datasets were not either universally available or did not match the timeframe of our study, so were not included. The goal of this study is to establish the utility of aggregate mobility statistics in creating early warning systems for universities. Thus it may not be exhaustive in explanatory variables and is not intended as a forecasting exercise. Further modeling of contact durations and disease transmissions using mechanistic models will definitely help improve the utility of such datasets. Operating on such anonymized datasets and computing aggregate yet useful statistics with privacy guarantees remains an active area of investigation. While most universities have tided over the case surges due to Fall 2020 reopening, the threat of subsequent outbreaks and impact on communities will persist until the rollout of a vaccine to the general population. Beyond retrospectively explaining case incidence in the past few months, this study can be seen as highlighting the value of proximity statistics as early warning indicators to case investigators and resource planners. As universities close early and plan their reopening strategies for Spring 2021, integrating such real-time indicators in their planning process will help with their response. We strongly believe that along with comprehensive testing and isolation plans, learning from the collective lessons of Fall experience, especially success stories, we could avoid similar spikes in the future despite surging case rates across the country.

## 4 Materials and Methods

### Datasets

We use data obtained via a third-party SDK operated by X-Mode [3] which serves multiple mobile applications and provides anonymized mobility traces with persistent identifiers (see Table S3 for statistics). We use TIGER/LINE shapefiles (for data extraction) [7] and medial income statistics [1] at county level from the US Census Bureau. County populations and COVID-19 case counts are obtained from USAFacts [2], while the RUCC codes were obtained from USDA [10]. Finally, we use a manually curated list of land-grant universities and their corresponding counties (Table S4). Additional details on all features considered are provided in the Supplementary material.

### Trajectory extraction

From the raw mobility traces obtained from X-Mode, we use Census shapefiles to first extract traces corresponding to the US and separate records per LGUC. We use a stop-point detection algorithm from scikitmobility [12] to post-process the raw traces with a specified level of sensitivity (time interval 5 minutes, distance threshold 10m). Note that this step is prior to interaction computation, i.e., the time interval and distance threshold used in this step are distinct from the parameters used in interaction count computation below.

### Interaction statistics

We define an interaction as the co-occurrence of mobile device pings within a spatial distance threshold *d* and within temporal interval threshold *t*. Although more accurate interaction counts could hypothetically be obtained by interpolating travel to produce more continuous trajectories, the current method provides a lower bound on such interactions. Further, the use of stop points and subsequent deduplication allows us to obtain the number of unique interactions for each distinct user (i.e., device). We then compute the mean number of interactions per county. In our analysis, we set *d* = 2*m* and *t* = 1 *minute*. We also performed sensitivity analyses to these parameters (shown using Story County, Iowa as an example) and found that the results are robust to slight variations in *d* and *t* (Figure S2).

### Correlation analyses

We used two metrics, the Pearson Correlation Coefficient and Granger Causality, to measure the correlation between the obtained daily interaction statistics and case rates for the LGUCs of interest. Before analysis, county level case counts are converted to case rates per 100,000 and both case rates and mean interactions are smoothed by computing 7-day moving averages. The former is done to normalize across different population sizes, while the latter eliminates any within-week reporting artifacts and noise in trajectory or case data. PCC between the case rates and the mean interactions time series is obtained across different lead times and the maximum PCC and corresponding lead time are reported. We test Granger causality using python statsmodels [15] and report value for SSR-Chi square test.

### Regression analyses

We perform regression analyses for explaining the observed maxPCC values with various exogenous variables related to population characteristics and COVID-19 epidemiology. The features were normalized before the regression to account for the varying ranges of magnitude. We used three feature selection methods (forward, backward and stepwise) applied to continuous and quantized values of maxPCC to identify the statistically significant features. Among these models, we report the model obtained using the features obtained by backward selection. We find that the university population, total LGUC population, median household income, the RUCC type of the county and the entropy of case rate are strongly correlated with the maximum PCC (*R*^{2}=0.504) (Table 1). A full list of the features, description of the feature selection methods and summary of all six linear regression models (Table S1) can be found in the Supplementary material.

## Data Availability

Aggregated statistics visualized in the manuscript are available upon request.

## Privacy and Ethics Statement

Anonymized datasets were obtained from X-Mode which collects location information from smartphone applications that use its software development kit (XDK). Users have the ability to limit/disable tracking and/or redistribution of their data. We refer the readers to https://xmode.io/xmode-privacy-policy-2/ for details on their Privacy Policy. The study at University of Virginia was approved by the Institutional Review Board for Social and Behavioral Sciences as part of the Human Research Protection Program. The paper only publishes aggregated derived statistics at the level of counties and on a daily basis guaranteeing individual privacy.

## Trajectory extraction and Stop point detection

For the purposes of our study the dataset spans 1 July, 2020 to 15 September, 2020 and focuses on mobility traces in the LGUCs of interest. From the mobility traces obtained for USA for a given day, we separate the pings by each LGUC. Then we use a stop point detection algorithm and generate a set of stop points from those raw trajectory points (Figure S1). The purpose of this pre-processing step is to filter out the records that do not contribute to significant interaction. For instance, this eliminates the spurious pings generated in transit and provide a ‘cleaner’ trajectory with fewer records for subsequent calculations. This also helps remove users who did not spend significant time within the county, but were transiting through the region.

We use the stop-point detection method implemented in the scikit-mobility [5] library. The stop point detection method requires one temporal constraint and two spatial constraints. The temporal constraint specifies the minimum number of minutes an individual has to spend at a location for that record to be considered a stop point. The two spatial constraints are stop radius factor and spatial radius. The product of these two parameters gives us a distance threshold. Any point from the candidate trajectory point within this distance threshold will be considered as a single stop point and the final stop point coordinates are the median latitude and longitude values of the points found within the specified distance threshold. We use 5 minutes as our temporal constraint. For the spatial constraints, we set spatial radius to 0.2 km and the stop radius factor as 0.05. Thus the distance threshold is 10 meters.

## Measuring Interaction rate

After performing pre-processing on the previous step, we estimate the number of interactions occurring on each day for each county. Human interaction is associated with both temporal and spatial proximity. Intuitively, an interaction occurs when two persons are present close to each other (spatial) simultaneously (temporal). Following this, we define that an interaction happens when two people are present at a location within a spatial threshold *d*, and the difference of their time of pings is within a temporal threshold *t*. The data for a county for each day consists of a set of pings *P*. The set of pings are the stop points obtained from the previous step. Each ping *p*_{i} of *P* is defined by a tuple as follows: *p*_{i} = (*u*_{i}, *lat*_{i}, *lon*_{i}, *ts*_{i}). Here, *u*_{i} denotes an anonymous user. Although one user may have multiple *u*_{i} if he/she possesses multiple devices, for our experiment we consider each distinct *u*_{i} to be a separate user. *lat*_{i} and *lon*_{i} represent the latitude and longitude of the ping, respectively, and *ts*_{i} denotes the timestamp when the ping was captured. From the pings collected for each day, we calculate the number of unique interactions for each distinct *u*_{i}. After calculating the number of unique interactions for each distinct user *u*_{i}, we obtain the mean interaction count for that day for the particular county.

## Correlation Measurement

After obtaining the time-series data of interaction for a county, we aim at measuring whether any correlation exists between these data and the case rate data. As mentioned in the main text, correlation analyses are performed on 7-day smoothed versions of the mean number of interactions per county and the case rates per 100,000.

## Pearson Correlation

This is a statistic that measures the correlation between an exogenous variable *X* and an endogenous variable *Y*. It takes *X* and *Y* as input and generates a value between +1 and -1. A value close to 1 represents a strong positive correlation between *X* and *Y* (*Y* increases as *X* increases). A value close to -1 represents a strong negative correlation (*Y* decreases as *X* increases). A value of 0 indicates that there is no correlation between *X* and *Y* .

For our purposes, *X* is the time series of interactions and *Y* is the time series of case rates. We shift *X* by some lead time and compute the Pearson correlation coefficient for each such lead time. This not only tells us whether *X* and *Y* are correlated, but also gives us the optimal lead time that lets us predict *Y* from *X* if they are strongly correlated.

## Granger Causality Analysis

The Granger causality test is a statistical hypothesis test for determining whether one time series is more useful in forecasting another time series than the past values of the other time series itself. A time series *X* is said to “granger-cause” *Y* if it can be shown that the past values of *X* along with past values of *Y* provide more statistically significant prediction of the future value of *Y* than the future value of *Y* obtained from past values of *Y* alone. This is done through a series of chi square-tests and F-tests.

## Linear regression

The following features were considered in linear regression for predicting maxPCC of each county. For each feature, we first define what the feature represents and also explain how the feature was pre-processed for regression analysis, wherever applicable.

Standard deviation scale of interactions: The ratio of the standard deviation to the mean for the interaction rate time-series data. (Feature was normalized before usage)

Standard deviation scale of cases: The ratio of the standard deviation to the mean for the case rate time-series data. (Feature was normalized before usage)

Maximum standard deviation of interactions. (Feature was normalized before usage)

Maximum standard deviation of cases: (Feature was normalized before usage)

Income: Median household income for the year 2018 of the county as per the data provided by the Economic Research Service, USDA [1]. (Median Income values were in 100K range. So, the values were used on a scale of 100K in the regression analysis.)

Dynamic range of case rates: The difference between the maximum and minimum value in the case rate time series data. divided by the minimum value of the case rate time series data. (Feature was normalized before usage)

Dynamic range of interaction rates: The difference between the maximum and minimum value in the interaction rate time series data divided by the minimum value of the interaction rate time series data. (Feature was normalized before usage)

Entropy of case rates: The amount of uncertainty in the time series data for the case rate. This was computed by quantizing the values of the case rate time series data, converting it into a probability distribution and computing the entropy of the probability distribution.

Entropy of interaction rates: The amount of uncertainty in the time series data for the interaction rate. This was computed by quantizing the values of the interaction rate time series data, converting it into a probability distribution and computing the entropy of the probability distribution.

Expected age group: Weighted average of the age group of people in the county as per the data of 2019 provided by the US Census Bureau [2]. (Feature was normalized before usage)

University population: The number of people affiliated with the corresponding land grant university of the county as per IPEDS [4]. (On a scale of 100K)

Population: Population of the county (On a scale of 100K)

University percentage: Ratio of 11 to 12.

User coverage: The number of unique users in our dataset. (On a scale of 100K)

User coverage percent: Ratio of 14 to 12.

Social distancing index: Average social distancing index of a county as provided by University of Maryland [6]. (Feature was normalized before usage)

Type: RUCC Type of a county, stating whether a county is urban or rural. This is obtained from United States Department of Agriculture [3]. (RUCC type ranges from 1 to 9. The values were divided by 10 before using for regression analysis.)

Out trips: Average number of out-of-county trips of a county as provided by University of Maryland [6]. (Feature was normalized before usage)

## Feature selection methods

### Forward selection

The steps for the forward selection method are as follows:

Fix a significance level (0.05, in our case). Let

*C*be set of the candidate features and*S*be the set of currently selected features.Fit all possible regression models considering one feature at a time from

*C*and the features from*S*.Select the feature

*c*∈*C*that has lowest*p*-value over the constructed models.If the

*p*-value of*c*is less than the fixed significance level, then add*c*to*S*, delete*c*from*C*. Repeat step 2 to 4. Otherwise finish the process and return*S*.

### Backward elimination

The steps for the forward selection method are as follows:

Fix a significance level (0.05, in our case). Let

*C*be set of the candidate features.Fit a model considering all features in

*C*. Select the feature*c*∈*C*that has the highest*p*-value.If the

*p*-value of*c*is more than the fixed significance level, then delete*c*from*C*. Repeat steps 2 to 3. Otherwise finish the process and return*C*.

### Stepwise selection

The steps for stepwise selection method are as follows:

Fix a significance level (0.05, in our case). Let

*C*be set of the candidate features and*S*be the set of currently selected features.Fit all possible regression models considering one feature at a time from

*C*and the features from*S*.Select the feature

*c*∈*C*that has lowest*p*-value over the constructed models.If the

*p*-value of*c*is less than the fixed significance level, then add*c*to*S*. Otherwise, finish the process and return*S*.Fit a model considering all features in

*S*. Select the feature*s*∈*S*that has the highest*p*-value.If the

*p*-value of*s*is more than the fixed significance level, then delete*s*from*S*. Repeat steps 5 to 6. Otherwise, go to next stepSet

*C*=*C*−*S*If

*C*= ∅, finish the process and return*S*. Otherwise, go back to step 2 and repeat.

## Acknowledgement

The authors would like to thank members of the Biocomplexity COVID-19 Response Team and Network Systems Science and Advanced Computing (NSSAC) Division for their thoughtful comments and suggestions related to epidemic modeling and response support. We thank members of the Biocomplexity Institute and Initiative, University of Virginia for useful discussion and suggestions. This work was partially supported by National Institutes of Health (NIH) Grant R01GM109718, NSF BIG DATA Grant IIS-1633028, NSF DIBBS Grant OAC-1443054, NSF Grant No.: OAC-1916805, NSF Expeditions in Computing Grant CCF-1918656, CCF-1917819, NSF RAPID CNS-2028004, NSF RAPID OAC-2027541, US Centers for Disease Control and Prevention 75D30119C05935, DTRA subcontract/ARA S-D00189-15-TO-01-UVA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

## References

## References

- [1].
- [2].
- [3].
- [4].
- [5].
- [6].