Human mobility data from the BBC Pandemic project

We present human mobility data for the United Kingdom collected from the "BBC Pandemic", a public science project linked to the BBC Four television documentary of the same name. Mobile phone GPS trajectories submitted by users and collected over a 24 hour period were aggregated to construct anonymised origin-destination flux matrices at the local administrative district (LAD). We use these data to explore how mobility patterns change with age and employment status - unique stratifications that are not available from other publicly and privately held mobility data sets. We validate the consistency of the aggregated BBC mobility data set against census workflow data and illustrate how the systematic differences in mobility rates with age affect the risk and pattern of transmission between regions with 18-30 year old's contributing the greatest risk of transmission to adjacent regions, but older 60-100 years playing the most important role in more remote low-density locations.


Introduction
To address these limitations, the BBC Pandemic project recruited over 86,000 participants in the United Kingdom between September 2017 and December 2018 as part of a public science project linked to a BBC Four documentary [31]. Here we present and analyse mobility data from 43,291 participants aggregated to the level of the observed flux between local administrative districts and stratified by age and employment status. The only other open source mobility data currently available for the United Kingdom is the 2011 census workflow data [1]. Published by the Office of National Statistics under version 3.0 of the open government licence this data can be freely shared, adapted and republished for both commercial and non-commercial purposes [4]. We compare the BBC mobility data set to this census workflow data which only captures commuting patterns of adults to and from their usual place of work. We estimate -and systematically compare -human mobility models for the United Kingdom and impute national level origin-destination matrices for the total BBC data set and four coarse grained stratifications by age. We use these imputed commuting matrices to demonstrate how the risk of transmission between regions changes for each age-group compared to using aggregated data.

Human Mobility Models
So-called gravity models, where the rate of migration between spatially segregated populations is modelled as a function of the distance between locations and their relative population size, have proven to be exceptionally popular for epidemiological modelling [24,45,19,42,22,12,17,25,30,33,6]. Taking its name and form from the Newtonian law of gravitation, the theory actually has its origins in the social sciences and in particular transportation theory [20]. However, beyond the dependence on population size and distance there has been very little consistency in the definition of gravity models by different authors.
The most basic formulation of the gravity model has the convenient -but unrealistic -characteristic that the population flux between two locations only depends on the local characteristics of the two interacting populations. These foundational models take no account of the distribution of population -or connectivity -of the population between two points. The Radiation model [40] was developed to account for these effects explicitly by construction and in some human mobility data sets achieves similar -or better -predictive performance than gravity models despite having no free parameters to estimate. This parsimony comes at the expense of a lack of flexibility with the Radiation model being outperformed by classical gravity formulations when factors other than population density -such as the types and opportunities of travel -are more important. The extended radiation model [47] addressed this limitation by introducing a single scaling parameter α which can be interpreted as defining a characteristic length scale (l) for trips with a region.
There have been several attempts within the transport theory field to address these same issues within the gravity modelling framework -in particular the intervening opportunities model [41] and the competing destinations model [21]. We consider two variants of the intervening opportunities model -the Schneider formulation examined by [34] and Stoufer's Rank Model as recently revisited in the context of modelling historical measles epidemics in England and Wales [2]. Finally, we consider the Impedance model [39]. Taking inspiration from the Ohmic law, the Impedance model was proposed as parameter free mobility model and compared to gravity and radiation laws in a model of the 2010 Haiti cholera epidemic [39].
Human mobility data sets typically capture a snapshot of individuals movement for which the total flux of individuals moving (or commuting) from a region is a fixed margin. For such data it is convenient to model the probability of moving given a particular home location separately from the choice of destination [40]. We therefore first compare the mobility rates (proportion of users that have different origin-destination locations) for the BBC data to census workflow data from the United Kingdom, then estimate constrained (singly or work constrained) variants of each mobility model [34].
All of these models have previously only been estimated from and tested against either aggregate mobility data or indirectly with respect to infectious disease case reports. The BBC mobility data set offers the first opportunity to compare and assess the predictive ability of these models for different groups of society.  Table 1: Number of BBC users by member nation of the UK and number of records from census workflow data Users are the total individuals with a paired origin-destination. Movers are the number of individuals whose destination location is different from their home location. We estimate the average per capita probability of moving for each data set by the ratio of these two numbers (presented to 3 s.f.).  Table 2: Number of BBC users by age and employment categories Users are the total individuals with a paired home and work location. Movers are the number of individuals whose destination location is different from their home location. We estimate the average per capita probability of moving for each data set by the ratio of these two numbers (presented to 3 s.f.). Note that Under 18 is a category in both the age-stratified and employment data sets but is presented twice to emphasise that these are two distinct (not nested) stratifications of the total BBC mobility data set.

Data Set Users Movers P(move) per capita per day
Age ranges are open on the lower limit and closed on the upper: (0, 18], (18,30], (30,60], (60, 100].  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Data
There were two components to the BBC Pandemic study, one focused on the town of Haslemere [29], and another focused on the wider UK population [31]. Here we present mobility data from the UK national study. Upon starting the BBC Pandemic app users first entered their basic demographic information, including age, household size, gender and occupation. The app then recorded their location at a 1km grid scale across the UK at hourly intervals for a 24 hour period. At the end of this period, users provided key meta data including their gender, age, occupation, health today (self-assessed), number of people in household, maximum distance travelled (selfestimated) and mode of transport. Users were additionally asked to complete a detailed survey on the social contacts they made over the study period which are reported elsewhere [3].
Over 86,000 participants started the survey and filled out their profile. Participants with no encounter or location data were excluded, as were users whose location recordings were all outside the UK leaving 47,741 user with GPS trajectories. A small subset of users repeated the survey and provided multiple observation periods. To protect the anonymity of these users we only consider the location data collected during the first 24 hour period of observation for this study. As a further step to ensure anonymity of users we aggregated individual user trajectories to origin and destination locations at the mid-layer super output area (MSOA). As the mid-layer super output areas nest within the local area districts (LAD), we map the MSOA code to the corresponding LAD to define origin and destination locations at this higher spatial scale and calculate the final origin-destination flux matrices Ω ij [8] that record the number of users with origin i and destination j.
We use the modal location to define users origin (home location) and considered two alternative definitions of destination based on the furthest extent and second (next) most frequent location from home. For the purposes of this study we focus on analysis on the 'next' (most frequent) origin-destination matrices as these are the most consistent conceptually with the census work flow data. (Full details of these definitions are given in supplemental information).
To provide some context on changing patterns of mobility over the working week we tabulate the start time of each individuals observation period aggregated by day of week (Table 3). The majority of users started the app on a weekday (37,322). The highest single day of sign-up was on the Thursday that the associated BBC Pandemic documentary was originally broadcast (22nd March 2018). 2,981 users signed up during the hour of broadcast alone, with a total of 7,796 users starting the app on that day. The next highest day of sign-up (with 4,967 new users) was also a Thursday (28th September 2017) corresponding to the second day of a social media campaign run by the production company to promote the app. Monday  3,340  Tuesday  3,476  Wednesday  6,945  Thursday  16,094  Friday  7,467  Saturday  3,168  Sunday  2,801   Table 3: Number of users by weekday of start date.

Weekday Users
Given the relative sparseness of the data set at the MSOA level, we focus on analysing the patterns of mobility at the LAD level. Raw data, estimated models and imputed flux matrices for both the next and furthest extent definitions are available in a public repository along with code developed for this manuscript (https://github.com/BBCPandemic/BBCMobility).
Of the 43,291 users in the BBC mobility data set (henceforth the Total BBC data), 25,114 had inferred home and next most frequent locations within the same LAD. We define the complementary set of 18,177 users with origin and destination locations in different LADs as 'movers' (Tables 1,2, Figure 1) rather than commuters to acknowledge that the displacements in the BBC mobility set capture a wider range of human mobility than the strictly commuting flows measured by 5 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint census work flow data.

Modeling the origin-destination flux matrix
Flux patterns can be decoupled into two model components: the probability that someone moves between locations, and the location that someone goes to given that they do move. The origindestination flux matrix Ω ij can be correspondingly decomposed. The diagonal elements represent users who stayed in the same location and the off-diagonal elements correspond to the 'movers'. Thus, the flux matrix of movers alone is given by the origin-destination flux matrix with the diagonal elements set to zero: The total eflux from location i is given by the sum over all potential destinations j∆ i = jΩ ij while the total number of users with origin i (i.e. including the non-movers) is ∆ i = j Ω ij . The proportion of movers for location i is given by p i =∆ i /∆ i . We define the conditional movement matrix: σ ij =Ω ij /∆ i as the proportion of movers from i who choose destination j. It can then be seen that σ satisfies the normalisation j σ ij = 1 and by definition σ ii = 0. The origin-destination flux matrix can thus be decomposed in terms of the conditional movement matrix σ ij and probability of movement p i : where δ ij is the alternating Kronecker delta symbol.

Probability of moving
The data are now summarised in two separate components: the proportion of the population that move (p i ) and where the movers go (σ ij ). We first we consider the first part of this: the probability of movement, comparing the BBC and census datasets, and also explore how probability of movement varies by age and employment status.
Census outputs are prepared and published separately for Scotland, Northern Ireland and England & Wales at different levels of spatial aggregation. The Northern Ireland Statistics and Research Agency excludes all responses with work locations outside of Northern Ireland. The Office of National Statistics and Scotland's Census publish aggregate numbers for the total work flows to each member nation, but not the sub-national location. For consistency, and to make comparisons between the member nations of the United Kingdom, we separate the combined England and Wales data set and treat the four resulting public census outputs as separate data sets for the purposes of model estimation and comparison (Table 1). Sub-national movement rates are comparable between the census workflow and BBC mobility data sets (to 1 significant figure).
The meta-data collected by the BBC pandemic app allows us to further stratify the Total BBC data set by age and employment categories. For age we consider four coarse grained categories -under the age of 18 (BBC Under 18), over [18][19][20][21][22][23][24][25][26][27][28][29][30], over 30 and under 60 (BBC 30-60) and over the age of 60 (BBC 60+). With respect to employment status we reuse the under 18 category (BBC Under 18), and define three alternative subsets for analysis corresponding to the group of users over the age of 18 and in education (BBC Education), over the age of 18 and in employment (BBC Employment) and over the age of 18 and not in employment, education or training (BBC NEET). The per-capita mobility rate for these strata of the BBC mobility data set vary from a minimum of 0.245 for BBC Under 18 to a maximum of 0.460 for BBC 30-60 ( Table 2).

Geographic variation
The proportion of movement p i also varies considerably between regions ( Figure 2). There is no clear relationship with the size of the resident population and only a weak association between the area of LADS and p i . However, there is a clear linear relationship between p i from the BBC data 6 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; and the corresponding probability of movement from the census data, with an R 2 = 0.71, suggesting that both instruments are measuring a common source of variability in mobility between regions.
With a median of 81 users per LAD (range 2-948) the raw BBC data is too sparse to estimate the geographic variation in population flux from each LAD for each stratification of interest (i.e. age or employment status). To address this limitation and be able to impute movement rates for each age and employment group we model the per location probability of moving using a generalised linear model (with logit link) and random intercept term for each LAD (i): where group is a categorical variable that adjusts the background rate of movement for each level of the age or employment strata. We therefore assume that the difference in mobility between groups is constant across the UK, proportionally adjusting the regional mobility captured by the random intercept term. Random effects models were estimated using the lme4 package [11] in the R statistical language [38].
Model fit was assessed graphically by plotting the predicted probability of movement against the observed data (Supplemental is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Mobility Models
Here we now focus on the model component of where people go, conditional that they do move, in essence modelling the conditional movement matrix σ ij as defined above.

Gravity Model
The classical gravity formulation assumes that the flux between two locations depends on the product of the size of the donor (N i ) and destination (N j ) populations (with scaling parameters τ 2 , τ 1 ) divided by a function of the relative distance (r ij ) between them: However, this general form is arguably too flexible, and in particular is unbounded with no upper limit on the predicted number of commuters from the donor patch. We can normalise the gravity model by defining a vector of normalisation constants n G i for each donor patch i by summing over the set of recipient patches j (for all (j = i): forming a so called singly constrained gravity model. Recasting the model in terms of the conditional movement matrix σ ij , the donor term N τ2 i cancels out and is redundant in this formulation. We therefore define our basic gravity law model for i = j as: with σ ii = 0 by definition and: We consider gravity-type models with three different distance scaling functions, power-law f (r) = r ρ ij , exponential f (r) = e rij /ρ and offset f (r) = 1 + rij ρα α .

Competing Destinations Model
Following the same reasoning, the competing destinations model [21] can be defined for our purposes for i = j as: The parameter δ adjusts for the effect that other locations -the competing destinationshave on the flux. For a negative value of δ the effect of competing destinations is to reduce the flux to a particular location, whereas for a positive δ the flux between the two locations is enhanced by the presence of alternative destinations. When δ = 1 the competing destinations term vanishes and we recover the classical gravity model. As the gravity model is nested within competing destinations we do not fit these simpler gravity model formulations separately. We therefore compare three gravity type models: competing destinations with power law distance function (CDP), competing destinations with exponential distance function (CDE) and competing destinations with offset distance function (CDO).

9
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Extended Radiation Model
The radiation model has no free parameters to estimate and is normalised by construction: where s ij is the total population in a circle of radius r ij centred at i and excluding N i and N j themselves [40]. The extended radiation model (ERad) introduces a single scaling parameter α [47].
Yang et al. [47] proposed that α should vary between between regions with different characteristic length scales (l defined as mean length between regions being modelled) such that: For the UK local authorities, McNeill et al. [36] calculated l = 19km and thus we would expect α = 0.43. Here we estimate α as a free parameter to compare with this predicted scaling law.

Intervening Opportunities Model
Schneider's intervening opportunities (IO) model [34] depends on the same matrix s ij as the radiation model and is defined for i = j as:

Stoufer's Rank Model
Stoufer's (Sto) Rank model [2] also depends on the same matrix s ij as the radiation model and can be defined as:

Impedance Model
The impedance (Imp) model [39] is defined as: In common with the radiation model this mobility model has no free parameters to estimate.

10
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Inferential framework and model comparison
Each row (Ω i ) of the mover flux matrixΩ ij could be considered as a multinomial sample [42] witĥ ∆ i trials and probability vector σ i equal to the corresponding row of the mobility matrix σ ij : However, for a multinomial likelihood the variance of observations scales linearly with the number of trials. As the efflux∆ i scales with local population size, a multinomial likelihood for these data will be dominated by contributions from large urban centres potentially introducing systematic biases and not allowing for the possibility of over-dispersion in the sampled flux. The expected rate of flux for a given mobility matrix σ i is: From which we can construct a negative binomial likelihood and explicitly allow the variance to scale with (origin) population size:Ω i ∼ negbin(ω ij , φ) Note that we use the ecological parameterisation of the negative binomial specified by the mean and shape parameter φ) and the variance of the flux leaving patch i is thusω ij +ω 2 ij φ .. We estimate posterior distributions for each model using Hamiltonian MCMC (as implemented by Stan [16] http://mc-stan.org/). To assess model fit and provide a basis for model selection we use approximate leave-one-out cross-validation [43,44]. For numerical stability we restrict the range of parameters such that 0 < τ, φ < 5, 0 < α < 1, −5 < δ < 5 and 0 < γ < 10 −4 . We restrict ρ > 0 for the offset and exponential competing destinations models (CDO, CDE), with the further restriction that 0 < ρ < 5 for the power law scaling (CDP). We choose Cauchy (0,5) prior distributions for all parameters. All further analyses were carried out in R [38].

Model checking and cross-validation
For each combination of model and data set, 4 Hamiltonian MCMC chains were run for the default 2,000 iterations, unless a greater number were required to pass diagnostic checks. Chains were well mixed for all combinations of models and data sets and passed standard convergence diagnostics. As a further predictive check of the model -and to provide a basis for model comparison and selection -we carried out approximate leave-one-out cross validation (LOO) [43].
This method uses Pareto smoothed importance sampling (PSIS-LOO) [44] to estimate the expected log pointwise predictive density:êlpd which measures the predictive accuracy of the model when a single observation is dropped out. The difference (∆êlpd) betweenêlpd for alternative models fitted to the same data provides a measure of their relative predictive accuracy. The standard error on the difference gives a measure of uncertainty with standard errors comparable to the magnitude of the difference suggesting the relative predictive accuracy of the two models is indistinguishable.
The estimated (Pareto) shape parameters (k) for the predicted distribution ofêlpd can be used to judge the reliability of the estimate ofêlpd for each data point. The estimate ofêlpd is considered reliable fork < 0.5, performance may still be reliable for values ofk up to 0.7. Values ofk > 0.7 suggest that the data points are highly influential to the estimated posterior and potentially introducing bias.
Models estimated from the full BBC mobility and Census data sets have five locations withk > 0.7 and two withk > 1.0 suggesting these locations are highly influential to the estimated posterior distribution and potentially introducing bias. Most of these issues can be traced to the uniquely large number of commuters to two specific districts within London: namely the City of London and Westminster. The City of London and Westminster (and to a lesser extent other boroughs of London) attract anomalously large numbers of commuters for their size. This can be visualised by plotting the numbers of commuters in (influx) against leaving (efflux), (Supplementary figure  11). For the majority of LADs this relationship is symmetric with both the influx and efflux of commuters approximately proportional to the resident population size. However the City of London and Westminster have orders of magnitude higher commuters in than residents who commute out. This extreme lack of fit of gravity type models results in a systematic bias to parameter estimates 11 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; as illustrated by comparing the posterior estimates between the full set of 326 English LADs and a data set dropping the City of London and Westminster (324 English Lads, Supplementary figure  12).
The Highland LAD in Scotland has a similar impact on parameter estimates from the BBC mobility data set (due in this case to the small flux of users to this location). Removing these three locations, then all of the models estimated from the BBC mobility data havek < 0.7. However, even after these steps between 1 to 9 LADS havek > 0.7. for the gravity type CDO, CDE and CDP models estimated from census data. Given the greater overall predictive performance of these models on the BBC mobility data set we do not wish to simply set aside these models.
Although the specific problematic LADS vary between models, the value ofk increases directly with the size of the resident population. Gravity models are notoriously sensitive to contributions from high population density locations. This, and the disparity in sample sizes between the BBC mobility and census data sets, was our motivation for using a negative binomial likelihood where the shape parameter φ controls the variance to mean scaling relationship. For a mean flux y, the variance of the negative binomial likelihood is inversely related to the shape parameter y + y 2 φ . Reducing the value of φ reduces the influence of larger populations to the likelihood, hence we can carry out a sensitivity analysis to the influence of the outlier values by fixing the value of φ and reducing it until the pareto shape parameter for all of the observations < 0.7. This was achieved for a fixed shape parameter of φ = 0.1. The models estimated from census data with a fixed value of φ reduced the absolute values of the likelihood but did not change the rank ordering of models (described below). Although the point estimates of the gravity parameters do change systematically with φ, they still have overlapping credible intervals in the range between the estimated value and fixed value (0.1) where the census models pass all diagnostic checks (13). For the purposes of comparison to the BBC mobility data set we present the models with estimated value of φ, which have a greater overall predictive performance (as measured independently via the common part of commuters index described in the next section).

Model Selection and Predictive Performance
The rank order of mobility models based on (∆êlpd) is identical for the UK level stratifications of the BBC mobility data set (Tables 5, 6, 7) -with the competing destinations (CDO) model favoured above the other models for all data sets. The ranking of the second and third ranked models varies between member countries of the UK, but the differences are small compared to the standard deviation implying there is little to choose between the predictive ability of the three gravity formulations for these data. The English census data is the only data set where an alternative model -the Extended Radiation model is preferred.
As a final posterior predictive check of the estimated models we consider the Common Part of Commuters (CPC) index introduced by [47] which can be defined as: i jΩ ij +ω ij whereω ij is a posterior predictive flux matrix from a mobility model fitted to the empirical flux matrixΩ ij and the sum is over all donor (j) and recipient (i) patches within the meta-population. A value of 1 is calculated when there is perfect agreement between the two matrices, with 0 when there is no agreement. Figure 3 presents the posterior predictive distributions for the CPC for each combination of mobility model and data set. This measure is in broad agreement with the ranking achieved through LOO cross-validation -with the competing destinations model favoured in the majority of cases, but once again with relatively little difference in predictive performance between the top three models.
12 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  Table 4: Mobility model comparison for the sub-national Census workflow data sets Mobility models are ranked by their predictive accuracy as measured by the difference (∆êlpd) in the expected log pointwise predictive density (êlpd). The competing destinations model, with offset distance kernel, is favoured for the census workflow data from Wales (W), Scotland (S) and Northern Ireland (NI). For the English census data, the Extended radiation model is ranked first. The difference in predictive accuracy between the top ranked models is greater than the standard error for the English census data, but of a comparable magnitude for the relatively smaller Scottish, Welsh and Northern Irish data sets. The differences between the next ranked models are much smaller than with respect to the top ranked model.  Table 5: Mobility model comparison for the sub-national BBC Total mobility data sets Mobility models are ranked by their predictive accuracy as measured by the difference (∆êlpd) in the expected log pointwise predictive density (êlpd). The competing destinations model, with offset distance kernel, is favoured for all four subsets of the BBC Total mobility data from England (E), Wales (W), Scotland (S) and Northern Ireland (NI). The difference in predictive accuracy between the top ranked models is greater than the standard error for the English census data, but of a comparable magnitude for the relatively smaller Scottish, Welsh and Northern Irish data sets. As with the census data sets the differences between the next ranked models are much smaller than with respect to the top ranked model.  . The competing destinations model, with offset distance kernel, is favoured for all data sets. In contrast to the sub-national data sets the ranking of models estimated from the BBC data sets are consistent across age groups and the difference between the predictive accuracy of the top ranked model is greater than the standard deviation of the difference. The differences between the second ranked models, in this case the exponential distance kernel and Extended Radiation models, are once again much smaller than the difference with respect to the favoured model (CDO).

13
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ;

Figure 3: Posterior predictive distributions for the Common Part of Commuters (CPC) index
The Common Part of Commuters (CPC) index measures the agreement between the posterior predicted flux between each location and the empirical flux used to estimate the model with 1 indicating perfect agreement and 0 no agreement. Performance of all seven candidate mobility models (Competing Destinations -CDO, CDP, CDE, Extended Radiation -ERad, Stoufer's Rank Model -Sto, Impedance -Imp) is compared for (top left panel through to bottom right) the (subsampled) census data sets (England, Wales, Northern Ireland -NI and Scotland, sub-national subsets of the BBC Total data set. Note that the predicted flux for the Impedance models depends on no free parameters, so for these models the average CPC is a function only of the topological distribution of local administrative districts for the United Kingdom and the member countries (census data). The variance of the predictive distribution for CPC for the Impedance models does vary between data sets through the estimated shape parameter. The rank ordering is consistent with the results of the LOO analysis, as are the relatively small predictive differences between the three gravity type models and the Extended radiation model. 14 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  In contrast to the sub-national data sets the ranking of models estimated from the BBC data sets are consistent across employment groups and the difference between the predictive accuracy of the top ranked model is greater than the standard deviation of the difference. The differences between the second ranked models, in this case the exponential distance kernel and Extended Radiation models, are once again much smaller than the difference with respect to the favoured model (CDO).

15
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; Figure 4: Sub-national posterior estimates for the competing destinations (CDO) model Posterior distributions for the population density (τ ), distance (ρ, α) and competing destinations (δ) scaling parameters from the (sub-sampled) Census data from England, Wales, Scotland and Northern Ireland and the corresponding subsets of the BBC mobility data. At the aggregate level parameter estimates are consistent between the BBC Total and Census data sets reflecting systematic differences in mobility patterns between England, Scotland and Wales. Northern Ireland estimates presented for completeness -note the considerably larger uncertainty resulting from the small size of this data set.

Impact of age and employment status on mobility patterns
We first compare the gravity parameters for the BBC Total data set estimated from each of the member nations of the UK and compare with estimates from the census workflow data ( Figure  4). The pooled UK estimates are most consistent with Census estimates from England (which has the largest population and number of LADs) but there are systematic differences in estimates from England and the smaller member nations. Containing only 10 LADs the Northern Ireland data set is clearly to small to support inference of this model, with huge uncertainty in the results should not be interpreted and are presented only for completeness. The Scottish and Welsh data sets demonstrate an increased importance of population size (larger τ value) and a longer spatial scale (ρ, α parameters) compared to estimates from England and the pooled UK data.
Given this sub-national variation it would be ideal to compare the mobility patterns of by age and employment categories of the BBC mobility data set at this level. However, given the sample size we must restrict comparison to models estimated at the national (UK) level. There is close correspondence between the age and employment groups suggesting that the 18-30, 30-60 and 60+ age groups behave largely like the education, employed and NEET categories.
Once again given that the vast majority of users were are in the 30-60 or employed categories, the parameters estimated from the BBC Total data set are statistically indistinguishable from those from the subset that were employed with overlapping posterior distributions ( Figure 5). There are systematic differences in the estimated gravity scaling parameters for Under 18s, those in Education and NEETS. Under 18s have a faster decay in mobility with distance than other groups 16 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint Figure 5: Posterior estimates for the competing destinations (CDO) model by age and employment groups Posterior distributions for the population density (τ ), distance (ρ, α) and competing destinations (δ) scaling parameters estimated from the full UK BBC mobility data set, with comparison between models estimated to the stratified data sets with respect to age and employment groups.

17
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint (larger α) and experience the strongest impact of competing destinations. Population size is a more important predictor for those over the age of 18 and in full time education (higher estimated value of τ ) -which may reflect the location of universities, colleges and other higher education institutions within urban centres. By comparison NEET's range further (smaller α, but population density is less important in modulating their decisions (smaller τ ).
To facilitate the use of the BBC mobility data in simulation studies, we use our estimated models for the probability of movement p g i and choice of destination σ g ij to impute UK population flux matrices for each level of the BBC mobility data set (Figure 6): where N g i is the total population in group g resident in patch i. These imputed matrices are calculated using the point (median) posterior estimates and provided as supplementary information.

Force of infection for a multi-group commuter model
To explore the extent to which variation in commuting rates and patterns within a population translates to epidemiological risk we derive a commuter approximation for the force of infection for a generic SIR (susceptible S, infected I, recovered R) model within a meta-population with multiple movement groups. We assume that the force of infection is well mixed within each patch. The effective local force of infection within a patch can be conceptually thought of as being constituted by two parts: the local force of infection due to resident infectives and the extrinsic force of infection generated by infected movers resident within the local population and susceptible movers to other spatial locations.
Elaborating an earlier result from [27], [7] demonstrated that the magnitude of these two components can be explicitly derived from the mechanistic movement rates. The per capita movement rate κ g ij from patch i to patch j for members of group g can be calculated in terms of the inferred probability of movement p g i and conditional movement matrix σ g ij : where D is the average trip duration. The total movement rate from patch i to patch j will then be: As long as the return rate 1 D is small with respect to the generation time of the pathogen then the force of infection acting on susceptibles (S ig ) in patch i and movement group g can be approximated as: where κ g i = k κ g ik . We further assume that individuals differ only with respect to their mobility and have equal susceptibility and infectiousness and mix homogeneously with other individuals within the same patch. Under this assumption the force of infection only depends on the total number of infected individuals in each patch: I i = g I ig . The first term corresponds to the contribution to the force of infection acting on susceptibles within group g resident in patch i: This has also two terms, the first corresponds to the contribution from local infectives (who are not commuting) from all four mobility groups. The second term corresponds to the contribution of susceptible individuals from patch j meeting infectious individuals (from all mobility groups g) while commuting. Similarly, the contribution of local susceptibles encountering infectious individuals (of all groups) while commuting to another patch is: 18 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; Figure 6: Imputed flux matrices for the Competing Destinations model Imputed flux matrices from the Competing Destinations model (median parameter estimates) calculated for each age group of the BBC mobility data set (Total, Under 18, 18-30, 30-60 and 60-100) and compared to the imputed flux from the England census data (predicted to the UK demography).
The effective population size within each patch is: where the population size N i = g S ig + I ig + R ig and we neglect higher order terms of the movement matrix O(σ 2 ) and higher.

Geographic risk of transmission
We use our multi-group commuter model to explore how the predicted geographic risk of transmission differs between an aggregate and age-stratified commuter model. Previous theoretical 19 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint work has suggested that difference in mobility rates (or equivalently average trip duration D) are most important in the early stages of invasion when incidence is low [7]. As an illustration we consider an invasion seeded, as was the BBC pandemic, in Haslemere situated within the Waverley district in southern England. We consider the scenario where a novel pathogen has been introduced into a single location and identify the age-group that contributes the largest value to the net force of infection assuming a single infectious individual within each age group (Figure 7 A). We note first that although modulated by population density (7 C), distance from the seed location is the primary determinant of the net contribution to the force of infection in distant LADs ((7 B). While young people -in this case the 18-30 group -provide the largest net risk of transmission to another LAD -the relative contribution of older age-groups becomes more important for distant, low density areas (7 B) which also tend to have older populations (7 D). In supplemental information we explore a range of different seeder locations in comparable commuter districts across the UK (Falkirk, Belfast, Newport) and see the same basic pattern with one small variation. For low density districts proximal to the seed location the youngest Under 18 year old group, whose movements are closely constrained to their home location, are occasionally the dominant groupperhaps reflecting the size of school catchment areas.

Discussion
In this paper we have introduced mobility data from the BBC Pandemic project to perform an analysis of how commuting patterns differ between groups of individuals with respect to employment status. We estimate and compare seven alternative human mobility models and find that the Competing Destinations model (with offset distance kernel, CDO) provides the best fit and predictive ability as assessed by leave-one-out (LOO) cross validation and posterior predictive checks of the common part of commuters (CPC) index. The competing destinations model extends the classical gravity formulation by adding a term that adjusts the flux between two locations according to the network of alternative locations within the meta-population. Although it lacks the elegant theoretical origins and mechanistic interpretation of the Radiation model -the additional flexibility makes it more appropriate for exploring the differences in mobility patterns between different groups.
However we note that, despite requiring two less parameters than the favoured CDO model, the Extended radiation model has similar absolute predictive performance in terms of the common part of commuters (CPC) index. The estimated value of α = 0.49 (0.48 − 0.5) for the full UK BBC Total data set is in line, but slightly higher than, the expected value of α = 0.43 independently estimated by McNeill et al. [36] based on a length scale for average trips of l = 19km.
Users in the BBC mobility data set demonstrate a higher rate of mobility across employment categories. There are important differences in the rates of mobility between urban and rural areas for users in different employment and age groups which could potentially be important for modelling the invasion of novel pandemic pathogens as the increased mobility and range of older individuals and those outside employment could enhance the rate of spread in the early stages of an outbreak.
For this paper we constructed origin-destination flows from users GPS trajectories taking the most frequent location as the inferred home (origin) and considered two alternative measures for the destination. The results we present in this paper use the second most frequent location ("next") as the users destination. Repeating the analysis using the furthest extent and all recorded user locations (other than the origin) give the same qualitative results in terms of the relative performance of mobility models and differences between employment categories. We chose to focus on the "next" definition given it's logical consistency with the question asked in the UK census. To validate this assumption by comparing the common part of commuters (CPC) between the predicted flux from the England & Wales census data to the flux predicted by the best fit models to the BBC Total data (Figure 8). Both definitions lead to estimated models with predicted flux that has a high (and indistinguishable) degree of similarity to the flux predicted from models estimated from Census data. For our purposes in this paper, to explore how human mobility between different employment groups in society varies from patterns inferred from census data, it is clearly the most appropriate choice. However, it is not necessarily the most appropriate choice for predicting disease transmission, hence we include both alternative definitions and estimated models within the the on-line data respository.
An unavoidable consequence of the crowd-sourced nature of our data is that users knew 20 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; Figure 8: Comparison of posterior predicted flux from Census model to BBC mobility for different measures of destination. We use the common part of commuters (CPC) measure to assess which destination measure is most consistent with Census flux data. Flux was predicted for the whole of the UK using samples from the competing destinations model (CDO) estimated from the English census data and compared to the predicted flux for the same model estimated from the BBC Total data set constructed using the next most frequent and furthest extent definitions for destination location. The two measures are indistinguishable in terms of the difference in predictive accuracy compared to the equivalent model estimated from census data from the four member countries of the UK.

22
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; when their movements were being recorded and chose when to start tracking on the app which could potentially have changed their behaviour. We expect that this effect will be mitigated to an extent by the vast majority of users that signed up on (or immediately after) the day of broadcast. Although recruitment through the app was open over the course of a year, the largest group of users signed up following the broadcast of the documentary. There was also a smaller wave of recruitment following a social media campaign before filming was carried out in the town of Haslemere. This temporal pattern highlights both the effectiveness of the documentary and public engagement activities on recruitment, but also the potential for bias to be introduced into our sample of user trajectories. The close correspondence between the predicted flux and gravity parameters from the Total BBC data and subnational census workflows provides another layer of reassurance about the representativeness of the BBC data.
Census workflow data (and privately held mobile data sets) have the advantage of being dense enough that the raw data can be used directly in commuter models [5] without the need to estimate the human mobility models necessary to interpret the BBC mobility data. The consistency of the total BBC data at the aggregate level to census data, belies the differences in both the probability of movement (from home) and the choice of destination for movers not in full time employment. The unique meta-data collected along with GPS traces from the BBC Pandemic app has allowed us to quantify these differences for the first time. Under 18s and the over 60s are both less likely to move and have destinations closer to home than those in employment. We used a multi-group commuter model to illustrate how the predicted spatial risk of transmission varies according to seeder cases in different age groups. These differences may be small in aggregate but could be critically important in assessing the risk of spillover between regions in the early stages of a pandemic or in the period immediately following the easing of lockdown restrictions.

23
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint

A Definition of Origin-Destination Matrices
User locations were first snapped to the nearest MSOA based on the generalised (20m resolution) shape file provided by the Open Geography portal from the Office of National Statistics (ONS). Locations outside of the boundary of any MSOA were either excluded or snapped to the MSOA with the greatest area of overlap within a 1km buffer centered around the user location. A home (origin) location was defined for each user as their modal MSOA. In defining the duration of time spent in a location we needed to account for missing location logs for some users who moved into an area with poor service, or switched off their phone, during the observation period. For such gaps we make the assumption the user remained in the last seen location until a new location was logged and use the duration of time within each location to calculate the modal location. In the event of a tie we chose the location with the least amount of time spent in the 12 hours between 7am and 6pm (inclusive). Users to which we could not assign a home location were removed from the data set for analysis.
For destination locations, two alternative definitions were considered -the furthest extent and the second most frequent recorded location after home (which we will refer to as the 'next' location for convenience). For the 'next' location we again ranked locations based on the total duration of time spent including the inferred location between gaps as described above. In the event of ties the location with the greater proportion of time between 7am and 6pm was chosen. Once again, users to whom we could not assign a unique destination location were removed from the data set. In total there were 4,450 users we could not assign a unique origin and destination location according these definitions leaving a total of 43,291 users within the final BBC mobility data set.
For users with all location records in the same MSOA, their home and destination locations are both set to this unique value. As the LAD origin and destinations are mapped from these MSOA locations, at the LAD level, users can therefore have the same inferred origin and destination locations (even though they have moved between different MSOAs over the course of the reporting period). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

B Supplemental Figures
The copyright holder for this this version posted February 23, 2021. ; Figure 9: Predicted probability of movement from (age) random effects model Predicted per capita probability of moving (i.e. having a different origin and destination location) from estimated random effects model plotted against observed proportion from BBC mobility data set (move/N). Error bars indicate 95% bootstrapped prediction intervals, size the number of observations (N) and the 1:1 line is added for reference.

29
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  Panel D Fraction of local population in 60-100 year old age group. While the fall-off in the net force of infection is driven primarily by distance -the relative contribution of different age groups is shaped by population density and demography. The 18-30 year group dominates for cities and high density LADs, which also tend to have younger populations, while individuals in the 60-100 age group play the dominant role in transmission to low density areas. Under 18s play a more important role for low-density areas when they are proximal to the seed location -in this case Clackmannanshire and West Lothian.

35
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint Figure 15: Dominant age group contributing to local force of infection Set of choropleth maps exploring the predicted force-of-infection across the UK from a single infectious individual within each age-group located in Belfast, Northern Ireland (filled black on maps A,B) compared with demographic correlates (population density and fraction of population aged 60-100). Panel A The age-group with the largest contribution to the local force of infection with each LAD. Panel B The net force of infection within each LAD (sum over contribution from all age groups). Panel C Local population density within each LAD (Total population size per km 2 ). Panel D Fraction of local population in 60-100 year old age group. While the fall-off in the net force of infection is driven primarily by distance -the relative contribution of different age groups is shaped by population density and demography. The 18-30 year group dominates for cities and high density LADs, which also tend to have younger populations, while individuals in the 60-100 age group play the dominant role in transmission to low density areas. Under 18s play a more important role for low-density areas when they are proximal to the seed location -in this case Lisburn and Castlereagh.

36
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint Figure 16: Dominant age group contributing to local force of infection Set of choropleth maps exploring the predicted force-of-infection across the UK from a single infectious individual within each age-group located in Newport, Wales (filled black on maps A,B) compared with demographic correlates (population density and fraction of population aged 60-100). Panel A The age-group with the largest contribution to the local force of infection with each LAD. Panel B The net force of infection within each LAD (sum over contribution from all age groups). Panel C Local population density within each LAD (Total population size per km 2 ). Panel D Fraction of local population in 60-100 year old age group. While the fall-off in the net force of infection is driven primarily by distance -the relative contribution of different age groups is shaped by population density and demography. The 18-30 year group dominates for cities and high density LADs, which also tend to have younger populations, while individuals in the 60-100 age group play the dominant role in transmission to low density areas. Under 18s play a more important role for low-density areas when they are proximal to the seed location -in this case Rhondda Cynon Taf.

37
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint Figure 17: Dominant age group contributing to local force of infection Set of choropleth maps exploring the predicted force-of-infection across the UK from a single infectious individual within each age-group located in Cambridge (filled black on maps A,B) compared with demographic correlates (population density and fraction of population aged 60-100). Panel A The age-group with the largest contribution to the local force of infection with each LAD. Panel B The net force of infection within each LAD (sum over contribution from all age groups). Panel C Local population density within each LAD (Total population size per km 2 ). Panel D Fraction of local population in 60-100 year old age group. While the fall-off in the net force of infection is driven primarily by distance -the relative contribution of different age groups is shaped by population density and demography. The 18-30 year group dominates for cities and high density LADs, which also tend to have younger populations, while individuals in the 60-100 age group play the dominant role in transmission to low density areas. Under 18s play a more important role for low-density areas when they are proximal to the seed location -in this case the surrounding rural district of South Cambridgeshire.

38
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint Figure 18: Dominant age group contributing to local force of infection Set of choropleth maps exploring the predicted force-of-infection across the UK from a single infectious individual within each age-group located in Norwich (filled black on maps A,B) compared with demographic correlates (population density and fraction of population aged 60-100). Panel A The age-group with the largest contribution to the local force of infection with each LAD. Panel B The net force of infection within each LAD (sum over contribution from all age groups). Panel C Local population density within each LAD (Total population size per km 2 ). Panel D Fraction of local population in 60-100 year old age group. While the fall-off in the net force of infection is driven primarily by distance -the relative contribution of different age groups is shaped by population density and demography. The 18-30 year group dominates for cities and high density LADs, which also tend to have younger populations, while individuals in the 60-100 age group play the dominant role in transmission to low density areas. Under 18s play a more important role for low-density areas when they are proximal to the seed location -in this case the adjacent district of Broadland.

39
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint    is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2021. ; https://doi.org/10.1101/2021.02.19.21252079 doi: medRxiv preprint