## Abstract

Variation in individual susceptibility or exposure to infection accelerates the rate at which populations acquire immunity by natural infection. Individuals that are more susceptible or more exposed tend to be infected earlier and hence selectively removed from the susceptible pool, decelerating the incidence of new infections. Eventually, susceptible numbers become low enough to prevent epidemic growth or, in other words, the herd immunity threshold is reached. Here we fit epidemiological models, with inbuilt distributions of susceptibility or exposure, to SARS-CoV-2 epidemics in Spain and Portugal, to estimate basic reproduction numbers (*R*_{0}) alongside coefficients of individual variation (CV) and the effects of control measures. Herd immunity thresholds are then calculated as or , depending on whether variation is in susceptibility or exposure. Our inferences result in lower herd immunity thresholds than what would be expected if population imvmunity was to be induced by random infection or vaccination, 1 − 1/*R*_{0}.

## Introduction

Selection acting on unmeasured individual variation is a well-known source of bias in the analysis of populations. It has been shown to affect measured rates of mortality (Keyfitz and Littman; Vaupel et al 1979; Vaupel and Yashin 1985), the survival of endangered species (Kendall and Fox 2002; Jenouvrier et al 2018), the scope of neutral theories of biodiversity and molecular evolution (Steiner and Tuljapurkar 2012, Gomes et al 2019), the risk of diseases whether non-communicable (Aalen et al 2015; Stensrud and Valberg 2017) or infections (Anderson et al 1986; Dwyer et al 1997; Smith et al 2005; Bellan et al 2015; Gomes et al 2019; Corder et al 2020), and the efficacy of interventions such as vaccines (Halloran et al 1996; O’Hagan et al; Gomes et al 2014; Gomes et al 2016; Langwig et al 2017) or symbionts (Pessoa et al 2016; King et al 2018). Building on this knowledge, we addressed how selection on individual variation might affect the course of the coronavirus disease (COVID-19) pandemic (Gomes et al 2020).

COVID-19 is an infectious respiratory disease caused by a virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]), which was first identified in China in late 2019 and has since spread worldwide leading to considerable human suffering and social disruption. European and American continents have been the most affected, with 0.16% and 0.20% of the respective total populations having died as of the 15 July 2021 (WHO 2021). Here we analyse series of daily deaths attributed to COVID-19 in Spain and Portugal (Iberian Peninsula) to study how individual variation in susceptibility and exposure to a respiratory virus affects its epidemic trajectory. Besides adding to the compendium of neglected effects of selection in population dynamics we hope to stimulate a new approach to study epidemic dynamics.

The main idea developed here is that individual variation in susceptibility or exposure (connectivity) accelerates the acquisition of immunity in populations. More susceptible and more connected individuals have a higher propensity to be infected and thus are likely to become immune earlier. Due to this selective immunization by natural infection, heterogeneous populations acquire herd immunity by natural infection more efficiently than suggested by models that do not fully account for these types of variation. We integrate continuous distributions of susceptibility or connectivity in otherwise basic epidemiological models for COVID-19, which necessarily account for non-pharmaceutical intervention effects, and generate three types of results. First, at national levels the herd immunity threshold by natural infection declines from around 70% to 20-30%. This is newly reported here for Spain and in agreement with recent estimates for England and Scotland (Gomes et al 2021). Second, these inferences can be made relatively early in the pandemic, such as between first and second waves, provided the first wave is sufficiently large and spatially synchronous. Third, we include a selection of results for Portugal to illustrate how the inferential procedure degenerates when national data do not meet certain conditions.

### Individual variation in SARS-CoV-2 transmission

SARS-CoV-2 is transmitted primarily by respiratory droplets and modelled as a susceptible-exposed-infectious-recovered (SEIR) process.

#### Variation in susceptibility to infection

Individual variation in susceptibility is integrated as a continuously distributed factor that multiplies the force of infection upon individuals (Diekmann et al 1990) in the form of an infinite system of ordinary differential equations (ODEs):
where *S*(*x*) is the density of individuals with susceptibility *x, E*(*x*) and *I*(*x*) are the densities of individuals who originally had susceptibility *x* and became exposed and infectious, while *R*(*x*) represents those who have recovered and have their susceptibility reduced to a reinfection factor *σ* due to acquired immunity. Parameter *δ* is the rate of progression from exposed to a period of maximal infectiousness (= 1/5.5 per day [McAloon et al. 2020]), *γ* is the rate of recovery from maximal infectiousness (= 1/4 per day [Nishiura et al. 2020; Lauer et al. 2020; Li et al. 2020]), *ϕ* is the proportion of individuals who die as a result of infection (= 0.009 [Ward et al. 2021]), and:
is the average force of infection upon susceptible individuals in a population of approximately constant size *N* and transmission coefficient *β*. Standardizing so that susceptibility distributions have mean ∫ *xg*(*x*) *dx* = 1, given a probability density function *g*(*x*), the basic reproduction number, defined as the expected number of secondary infections generated by an infected individual in a population that has no specific immunity to the virus (Diekmann et al 1990), is:
where *ρ* is a factor measuring the infectiousness of individuals in compartment *E* in relation to those in *I* (= 0.5 [Wei et al. 2020; To et al. 2020; Arons et al. 2020; He et al. 2020]). The coefficient of variation in individual susceptibility is also treated as a parameter.

The basic reproduction number ℛ_{0} is a theoretical framework. It cannot be exactly measured directly but it can be approximated from the initial growth in case numbers. However, as the virus spreads through the population, infected and immune individuals accumulate, reducing the availability of susceptible hosts. As a result, growth in case numbers deviates from being a direct indication of ℛ_{0} but rather of a so-called effective reproduction number ℛ_{eff}.

When susceptibility is given by a gamma distribution and acquired immunity is totally protective, the effective reproduction number is:
where *S*(*t*) = ∫ *S*(*x, t*) *dx* is the total number of susceptible individuals at time *t*, and (Equations 1-4) reduce to a finite system of ODEs:
where *S, E, I* and *R* are the total numbers of susceptible, exposed, infectious and recovered individuals, respectively (Novozhilov 2008; Montalbán et al. 2020).

#### Variation in connectivity

In a directly transmitted infectious disease, such as COVID-19, variation in exposure to infection is primarily governed by patterns of connectivity among individuals. Here we incorporate this in the system (Equations 1-4) under the assumption that individuals mix at random (Pastor-Satorras and Vespignani 2001; Miller et al. 2012), while in Supplementary Information we conduct some sensitivity analyses to this assumption. Under random mixing and heterogeneous connectivity, the force of infection is written as:
and the basic reproduction number is:
In this setup, when connectivity is given by a gamma distribution and acquired immunity is totally protective, the effective reproduction number is:
and (Equations 1-4) reduce to the finite system of ODEs:
where *S, E, I* and *R* are the total numbers of susceptible, exposed, infectious and recovered individuals, respectively (Novozhilov 2008; Montalbán et al. 2020).

We have used the reduced systems (Equations 8-11) and (Equations 15-18) in combination with non-pharmaceutical interventions (NPIs) to analyse series of daily deaths in the study countries.

### Non-pharmaceutical interventions and other transmissibility modifiers

NPIs designed to control transmission typically reduce *β* and hence ℛ_{0}. Denoting the time-dependent reproduction number when control measures are in place by ℛ_{c}(*t*), the modified effective reproduction number is obtained by replacing ℛ_{0} with ℛ_{c}(*t*) in (Equation 7) and (Equation 14) as appropriate. For the estimation of ℛ_{c}(*t*) we introduce flexible transmissibility profiles *c*(*t*) as illustrated in Figure 1.

#### One-wave transmissibility profile

When the model is applied to the first pandemic wave only (until 1 July 2020) we use the top profile in Figure 1. *T*_{0} is the time when ℛ_{0} begins to decrease due to behavioural change or seasonality; *T*_{1} is the time interval over which transmission reduces gradually (assumed linearly, until the first lockdown; *c*_{1} ≤ 1 is the average *c*(*t*) from the beginning of the first lockdown onwards (14 March in Spain, 19 March in Portugal). Mathematically this is constructed as:

#### Two-wave transmissibility profile

Applying the model over longer periods which capture multiple waves and multiple lockdowns requires additional features on the transmissibility profile. Denoting by *L*_{1} the duration of the first lockdown (29 days in Spain, 44 days in Portugal), we allow restrictions to be progressively relaxed at the end of this period by letting transmission begin a linear increase such that *c*(*t*) reaches 1 in *T*_{2} days, which may or may not be within the range of the study. Changes in other factors that affect transmission (such as seasonality or viral evolution) are inseparable from contact changes in this framework and are also accounted for by *c*(*t*). Mathematically this is constructed as:
Second and third lockdowns in the autumn and winter season are implemented as a further reduction in transmission (by factors *c*_{2} and *c*_{3}, respectively) over the stipulated time periods (*L*_{2} and *L*_{3} in the bottom panel of Figure 1):
In Spain, second and third lockdowns were effectively a single intervention mildly interrupted by a short relaxation over Christmas, and hence we assume *c*_{2} = *c*_{3} in this country. This is contrasted by Portugal where the third lockdown was much stricter than the second.

### Herd immunity thresholds

Individual variation in risk of acquiring infection is under selection by the force of infection, whether individual differences are due to biological susceptibility, exposure, or both. The most susceptible or exposed individuals are selectively removed from the susceptible pool as they become infected and eventually recover with immunity (some die), resulting in decelerated epidemic growth and accelerated acquisition of immunity in the population. The herd immunity threshold (HIT) defines the percentage of the population that needs to be immune to reverse epidemic growth and prevent future waves. In the absence of NPIs or other transmissibility modifiers, if individual susceptibility or connectivity is gamma-distributed and mixing is random, basic HIT curves (ℋ_{0}) can be derived analytically (Montalbán et al 2020) from the model systems (Equations 1-4, with the respective forces of infections). In the case of variation in susceptibility to infection we obtain
while variable connectivity results in
In less straightforward cases, such as when the characteristics follow a distribution other than gamma (Gomes et al. 2020), when mixing is not random (Supplementary Information), when both distributions in susceptibility and connectivity are considered, or when contact networks are rewired as result of NPIs or otherwise, ℋ_{0} can be obtained numerically.

Both (Equation 23) and (Equation 24) convey substantial declines in HIT as individual variation increases (Figure 2, Gomes et al. 2020), most strikingly over relatively low CV (from 0 up to 1 or 2). For concreteness, when ℛ_{0} = 3, ℋ_{0} = 67% for *CV* = 0, while *CV* = 1 brings ℋ_{0} down to 42% for heterogeneous susceptibility and 30% for heterogeneous connectivity, and *CV* = 2 brings ℋ_{0} further down to 20% and 11%, respectively. Accounting for reinfection (i.e. *σ* > 0 in the formulas) might moderate these reductions. In any case, such differences in HIT are an indication of strong sensitivity of the epidemic dynamics to the parameter CV. Since early in the COVID-19 pandemic we and others (Gomes et al. 2020, Britton et al 2020) have advocated for attention to this issue. Given that the steeper decline in ℋ_{0} appears to occur over a range of realistic CVs (Gomes et al. 2020, Gomes et al 2021) it is plausible that some aspects of the pandemic dynamics need to be reinterpreted in a broader framework.

We emphasise, however, that ℋ_{0} is a theoretical framework almost to the extent that ℛ_{0} is a theoretical framework. It cannot be measured directly when epidemic trajectories are affected by interventions, but it can be inferred indirectly from epidemiological data. By construction, ℋ_{0} changes if the parameters that determine its value change. Most notably, natural changes in ℛ_{0} through time, which can happen due to seasonal forces or viral evolution, transfer to ℋ_{0} according to (Equation 23), (Equation 24), or even their homogeneity equivalent 1 **−** 1/ℛ_{0}. As a result, the percentage of the population immune required to prevent sustained epidemic growth may deviate from the initial ℋ_{0}. Notwithstanding, a model with lower ℋ_{0} outputs smaller epidemics than a model with higher ℋ_{0}, all non-basic processes being the same.

The phenomenon of variation and selection which accounts for lower HIT was widely explained around mid-2020 (Hartnett 2020) and generated broad public interest in the context of COVID-19. By early 2021, vaccines had become available, and the narrative changed to imply that the HIT might be unachievable for COVID-19 (Aschwanden 2021). This apparent contradiction stems from the adoption of less fundamental meaning of HIT by the latter. While Hartnett (2020) writes about using the basic ℋ_{0} to assess pandemic potential, stressing how that is weakened by variation and selection by natural infection, Aschwanden (2021) endorses a more common sensical view that reinfection may become frequent enough to make herd immunity unachievable. In the light of the theory presented here – specifically (Equation 23) and (Equation 24) – these views are transversal and not contradictory.

### Data

We fit compartmental models to publicly available epidemiological data from the coronavirus dashboards for Spain [https://cnecovid.isciii.es/covid19] and Portugal [https://covid19.min-saude.pt/ponto-de-situacao-atual-em-portugal] to fit the models and estimate parameters of interest. Namely, we fit model reconstructed mortality timeseries assuming a fixed infection fatality ratio to datasets containing daily deaths, , where *k* = 1 is the day when the cumulative moving average of death numbers exceed 5 · 10^{−7} of the population (7 March in Spain, 19 March 2020 in Portugal).

Model fits were carried out to the raw series of daily deaths until the 1 July 2020 in the first instance (to cover the first wave of the epidemic in the study countries), and until the 1 March 2021 in an extended analysis (as a compromise between having a series sufficiently long to capture much of the second wave and not so long that it would be affected by vaccination and require the vaccine to be modelled). We defined the initial conditions as:
and the population sizes *N* were obtained from the most recent respective census (approx. 46.94 million in Spain, 10.28 million in Portugal, 3.57 million in the North Region of Portugal, 3.66 million in Lisbon and Tagus Valley Region of Portugal).

### Model fitting and parameter estimating

We assumed that reinfection was negligible throughout the study period. A study conducted in England (Hall et al. 2021), between June 2020 and January 2021, concluded that previous SARS-CoV-2 infection induced 84% effective immunity to future infections. In (Gomes et al. 2121) we fit models to daily COVID-19 deaths in England and Scotland, assuming no reinfection (i.e. 100% effective immunity) or 90% effective immunity, and found this alteration to have no significant effect on projected model trajectories. Basically, the fit readjusts the parameters when reinfection is added to the model in such a way that the output solutions remain similar.

Parameter estimation was performed with the software MATLAB using the PESTO (Parameter EStimation Toolbox) package (Stapor et al. 2018). We assumed the number of SARS-CoV-2 infections to be Poisson distributed.

We approximate the dynamics of COVID-19 deaths by estimating the set of parameters *θ* that maximises the log-likelihood (LL) of observing the daily numbers of reported deaths *Y*:
where *y*(*k, θ*) are the simulated model output numbers of COVID-19 deaths at day *k* for the set of parameters are the numbers of daily reported deaths, and *n* is the total number of days included in the analysis.

### Fitting models to one pandemic wave

The models exploring heterogeneity in susceptibility (Equations 8-11) and connectivity (Equations 15-18) both with transmissibility profile as in (Equation 19) were fit to COVID-19 deaths recorded daily until 1 July 2020, in Spain and Portugal. A homogeneous version obtained by setting *CV* = 0 in either model was also fitted. Results for Spain are shown in Table 1 and Figure 3, and for Portugal in Table 2 and Figure 4.

We estimate the basic reproduction number ℛ_{0} with 95% credible intervals (CI) around 3.5 **−** 3.8 in Spain and 2.3 **−** 3.4 in Portugal. For the minimal transmissibility factor *c*_{1} we estimate 0.21 **−** 0.27 when individual variation is allowed and 0.18 **−** 0.19 with the homogeneity constraint in Spain, while in Portugal we estimate the wider intervals 0.26 **−** 0.39. For coefficients of variation in Spain, we estimate CV in the range 1.3 **−** 2.4 under heterogeneous susceptibility and 1.0 **−** 1.7 under heterogeneous connectivity. In Portugal, we obtain the much wider and uninformative ranges 0.0 **−** 3.2.

Left plots in Figures 3 and 4 show the best fitting model solutions generated from the median posterior estimates of each parameter in the respective countries as well as the 95% CI generated from 100,000 posterior samples. The basic herd immunity thresholds ℋ_{0}, calculated from ℛ_{0} and CV estimates, are ℋ_{0} = 19% (95% CI, 13-32%) under heterogeneous susceptibility, ℋ_{0} = 19% (95% CI, 13-36%) under heterogeneous connectivity, and ℋ_{0} = 72% (95% CI, 71-73%) when homogeneity is imposed, in Spain. In Portugal, we obtain ℋ_{0} = 19% (95% CI, 5-69%) under heterogeneous susceptibility, ℋ_{0} = 13% (95% CI, 3-69%) under heterogeneous connectivity, and ℋ_{0} = 67% (95% CI, 66-69%) when homogeneity is imposed. Credible intervals for ℋ_{0} in Portugal are wide and uninformative when individual variation is allowed which is expected given the wide ranges obtained for CV. NPIs in Portugal were initiated very early in the epidemic which resulted in transmissibility reductions blending with ℛ_{0}, making parameter identification a major challenge from the data accessible to us.

In Spain, where the three models provide good fits to the data, model selection criteria such as AIC (Akaike information criterion) support the heterogeneous implementations. To better distinguish the various models, we run the respective systems of equations forward, under a set of conventions, and compare the respective projected outcomes. Right plots in Figures 3 were generated by taking the end conditions of the left plots, moving all exposed and infectious individuals to recovered (except a residual proportion to seed a new outbreak) and running each model with the estimated ℛ_{0} and CV until the susceptible pool has been effectively depleted in all implementations. In this manner we can visualise how much more infection appears to be ahead when models are constrained to be homogeneous (a manifestation of their relatively higher ℋ_{0}. Roughly, epidemics peak one order of magnitude higher when models are homogeneous. This must have broad implications, which we believe remain largely unappreciated, for how a population will experience an epidemic, irrespective of how this basic scenario is adapted to specific factors such behavioural patterns, seasonality, viral evolution, or vaccination.

The same analysis applied to Portugal, selects in favour of the homogeneity assumption and larger projected waves. We recall, however, the great uncertainty associated with these specific results. We investigate this further through fittings to longer data series, in the first instance.

### Fitting models to two serial pandemic waves

Here we take the models with heterogeneity in susceptibility (Equations 8-11) and connectivity (Equations 15-18) and apply the transmissibility profile in (Equations 20-22) to both before fitting the model outputs to COVID-19 deaths recorded daily until 1 March 2021, in Spain and Portugal. As before a homogeneous version obtained by setting *CV* = 0 was also fitted. Results for Spain are shown in Table 3 and Figure 5, and for Portugal in Table 4 and Figure 6.

We estimate ℛ_{0} with 95% CI around 3.9 **−** 4.4 in Spain and 2.7 **−** 4.6 in Portugal. For the minimal transmissibility factor *c*_{1} we estimate 0.20 **−** 0.21 when individual variation is allowed and 0.17 with the homogeneity constraint in Spain, while in Portugal we obtain 0.23 **−** 032 with individual variation and 0.16 **−** 018 without. For coefficients of variation in Spain, we estimate CV around 2.0 under heterogeneous susceptibility and 1.3 under heterogeneous connectivity. In Portugal, the estimates are around 0.5.

Left plots in Figures 5 and 6 show best fitting model solutions as well as the respective 95% CI. In Spain, basic herd immunity thresholds calculated from best-fitting ℛ_{0} and CV are ℋ_{0} = 25% under heterogeneous susceptibility, ℋ_{0} = 27% under heterogeneous connectivity, and ℋ_{*} = 76% when homogeneity is imposed. In Portugal, we obtain again wider and higher ranges: ℋ_{0} = 56% (95% CI, 52-57%) under heterogeneous susceptibility, ℋ_{0} = 56% (95% CI, 54-57%) under heterogeneous connectivity, and ℋ_{0} = 77% (95% CI, 76-78%) when homogeneity is imposed.

Puzzling, as in the case of shorter data series, model selection supports the homogeneous model for Portugal while still favouring the incorporation of individual variation in Spain. Moreover, for each country, results are consistent whether we base our estimates on one or two waves of the national epidemic. This consistency was also verified in England and Scotland (Gomes et al 2021). This suggests that the earlier initiation of NPI in Portugal may not in itself explain why results for Portugal contrast with those for other countries studied.

### Fitting models to regional data in Portugal

Intrigued by the puzzling results for Portugal at country level, we gathered regional data. We found that the epidemic was considerably different between the two largest regions: North, home to roughly one third of the Portuguese population; and Lisbon and Tagus Valley, home to another third. Asynchrony of epidemic dynamics between regions of similar sizes may require disaggregated analyses. We then decided to fit the mortality data for the two regions simultaneously, estimating common parameters to describe country-wide lockdowns, and regions-specific parameters to describe the basic transmission dynamics (ℛ_{0} and CV). The results are provided in Tables 5, 6 and Figures 7-9. According to these analyses, the best-fitting models include heterogeneity.

Basic herd immunity thresholds inferred by fits to the longer series (until 1 March 2021) are around 25 **−** 34% in the North (similar as Spain, England, and Scotland) and 47 **−** 52% in Lisbon and Tagus Valley. The higher ℋ_{0} in the capital region results from the estimation of a lower CV. This may be real and due to the more urban character of Lisbon and Tagus Valley or, in contrast, it may be a spurious result of the lack of a first wave in the region to inform the model. It would be interesting to replicate the regional analysis in other countries to investigate to what extent more urban regions have higher HIT.

Fits to the shorter series (until 1 July 2020) are again inconclusive. Not only the uncertainty around parameter estimates is large but some estimates are implausible. First, ℛ_{0} around 2 or less is on the low end of consensus estimates (Flaxman et al. 2020; Keeling et al. 2020; Viana et al. 2021; Wood 2021). More strikingly, the algorithm is incapable of estimating CV from these regional series, resulting in convergence to the upper and lower limits of the prior distribution (uniform between ) in the North and Lisbon regions, respectively.

In summary, the results reported in this section support two notions. First, asynchronous dynamics may compromise parameter estimation based on model fittings to aggregated data. This should depend on whether the asynchrony in question is between similar-sized regions. Second, the early estimation of parameters for heterogeneous models (especially CV) may require a larger first wave than that required by models that are either homogeneous (Flaxman et al. 2020; Wood 2021) or have their heterogeneity informed directly by specific data (Keeling et al. 2020). The downside of these other approaches, however, is that heterogeneity is either absent or possibly incomplete, resulting in reduced selection and biased estimates. Among the studies we have completed, England, Scotland and Spain had sufficiently sized first waves to inform the inference of CV while Portugal did not.

## Discussion

We fit SEIR models, with inbuilt distributions of individual susceptibility or exposure to infection, to daily series of COVID-19 deaths in Spain and Portugal. In similarity with other studies (Flaxman et al. 2020, Keeling et al. 2020, Bertuzzo et al. 2021, López & Rodó 2021, Viana et al. 2021), we estimated relevant transmission parameters, such as the basic reproduction number ℛ_{0}, and a time dependent transmissibility profile *c*(*t*) which multiplies ℛ_{0} to account for effects of NPIs, seasonality, viral evolution, or any unexplicit factor that modifies the ability of the virus to infect new hosts. In contrast, our sets of estimated parameters include coefficients of variation that characterise distributions of unmeasured individual susceptibility or connectivity while more common practices are to impose measured variation into the model and ignore the unmeasured.

The two approaches have different advantages and disadvantages, and which to adopt depends on study objectives. Using directly measured variation to inform the model allows phenomena to be represented on a micro scale as desired for designing targeted interventions, for instance, but ignoring unmeasured variation may create biases on the macro scale if relevant selection is occurring on omitted characteristics. This may result in the model overpredicting the impact of uniform interventions, for example. The approach of inferring the amount of variation that is under selection by fitting the model on a macro scale is conceived to eliminate those biases, but not suitable for informing targeted actions. In addition, the latter approach appears to require larger infected numbers than the former. In a global crisis such as the current COVID-19 pandemic the two approaches can have a role, helping to overcome each other’s limitations.

The inference of selectable variation presented here is relatively new to infectious disease modelling. Prior to attempting this in the context of the COVID-19 pandemic, we and others have been accelerating progress in systems that were either experimentally controlled (Dwyer et al. 1997; Ben-Ami et al. 2008; Zwart et al. 2011; Pessoa et al. 2014; Langwig et al. 2017; King et al. 2018) or already endemic (Smith et al. 2005; Bellan et al. 2015; Corder et al. 2020). Several aspects of the pandemic made it more challenging. There was an urgency for early results from inherently scarce data, a challenge amplified by non-pharmaceutical interventions designed to suppress the epidemic. In countries where interventions began earlier, such as Portugal, it has been impossible to reach conclusive results without refining data and methods further. In Spain (this study) and England and Scotland (Gomes et al 2021), on the other hand, interventions started later, and results were consistent internally and with each other. Moreover, comparing our coefficients of variation for individual connectivity to those measured directly by contact surveys, we found them to be less than 40% higher. This suggests that there was more selectable variation than the surveys had captured, as expected, but not enough to seem implausible.

In conclusion, we believe that the results presented here constitute a major step in the establishment of a novel approach to improve the accuracy and predictability of infectious disease modelling. It holds good promise for use in conjunction with other tools to help devise future pandemic policies.

## Data Availability

Datasets are publicly available at the respective national ministry of health websites.

## Author contributions

M.G.M.G. conceived the study. R.A. and M.G.M.G. performed the analyses. All authors interpreted the data and wrote the paper.

## Competing interests

The authors declare no competing interests.

## Acknowledgements

We thank Jan Hasenauer, Rodrigo Corder and Antonio Montalbán for technical advice and contributions to related research. M.U.F. received funding from Conselho Nacional de Desenvolvimento Científico e Tecnológio (CNPq), Brazil.