## Abstract

For the last decade, evolutionary forecasting models have influenced seasonal influenza vaccine design. These models attempt to predict which genetic variants circulating at the time of vaccine strain selection will be dominant 12 months later in the influenza season targeted by vaccination campaign. Forecasting models depend on hemagglutinin (HA) sequences from the WHO’s Global Influenza Surveillance and Response System to identify currently circulating groups of related strains (clades) and estimate clade fitness for forecasts. However, the average lag between collection of a clinical sample and the submission of its sequence to the Global Initiative on Sharing All Influenza Data (GISAID) EpiFlu database is ∼3 months. Submission lags complicate the already difficult 12-month forecasting problem by reducing understanding of current clade frequencies at the time of forecasting. These constraints of a 12-month forecast horizon and 3-month average submission lags create an upper bound on the accuracy of any long-term forecasting model. The global response to the SARS-CoV-2 pandemic revealed that modern vaccine technology like mRNA vaccines can reduce how far we need to forecast into the future to 6 months or less and that expanded support for sequencing can reduce submission lags to GISAID to 1 month on average. To determine whether these recent advances could also improve long-term forecasts for seasonal influenza, we quantified the effects of reducing forecast horizons and submission lags on the accuracy of forecasts for A/H3N2 populations. We found that reducing forecast horizons from 12 months to 6 or 3 months reduced average absolute forecasting errors to 25% and 50% of the 12-month average, respectively. Reducing submission lags provided little improvement to forecasting accuracy but decreased the uncertainty in current clade frequencies by 50%. These results show the potential to substantially improve the accuracy of existing influenza forecasting models by modernizing influenza vaccine development and increasing global sequencing capacity.

## Introduction

Seasonal influenza virus infections cause approximately half a million deaths per year (** World Health Organization, 2014)**. Vaccination provides the best protection against hospitalization and death, but the rapid evolution of the influenza surface protein hemagglutinin (HA) allows viruses to escape existing immunity and requires regular updates to influenza vaccines

**). The World Health Organization (WHO) meets twice a year to decide on vaccine updates for the Northern and Southern Hemispheres**

*(Petrova and Russell, 2018***). The dominant influenza vaccine platform is an inactivated whole virus vaccine grown in chicken eggs**

*(Morris et al., 2018***) which takes 6 to 8 months to develop and contains a single representative vaccine virus per seasonal influenza subtype including A/H1N1pdm, A/H3N2, and B/Victoria**

*(Wong and Webby, 2013***). As a result, the WHO must select a single immunologically-representative virus per subtype approximately 12 months before the peak of the next influenza season. These selections depend on the diversity of currently circulating phylogenetic clades, groups of influenza viruses that all share a recent common ancestor. The WHO’s understanding of that genetic diversity comes from HA sequences collected by the WHO’s Global Influenza and Surveillance and Response System (GISRS)**

*(Morris et al., 2018***) and submitted to the Global Initiative on Sharing All Influenza Data (GISAID) EpiFlu database**

*(Hay and McCauley, 2018***). The fastest evolving influenza subtype A/H3N2 accumulates 3–4 HA amino acid substitutions per year**

*(Shu and McCauley, 2017***;**

*(Smith et al., 2004***) such that the clades circulating 12 months after the vaccine decision can be antigenically distinct from clades that were circulating at the time of the decision.**

*Kistler and Bedford, 2023*Given the 12-month lag between the decision to update an influenza vaccine and the peak of the following influenza season, the vaccine composition decision is commonly framed as a long-term forecasting problem ** (Lässig et al., 2017**). For this reason, the decision process is partially informed by computational models that attempt to predict the genetic composition of seasonal influenza populations 12 months in the future based on current genetic and phenotypic data

**). The earliest of these models predicted future influenza populations from HA sequences alone**

*(Morris et al., 2018***;**

*(Łuksza and Lässig, 2014***;**

*Neher et al., 2014***). Recent models include phenotypic data from serological experiments**

*Steinbrück et al., 2014***;**

*(Morris et al., 2018***;**

*Huddleston et al., 2020***,**

*Meijers et al., 2023***), but these models still heavily rely on HA sequences to determine the viruses circulating at the time of a forecast. Unfortunately, the average lag between collection of a seasonal influenza A/H3N2 HA sample and submission of its sequence had been ∼3 months in the era prior to the SARS-CoV-2 pandemic**

*2024***). While long-term forecasting models continue to improve technically, the constraints of a 12-month forecast horizon and the availability of enough recent, representative HA sequences impose an upper bound on the accuracy of long-term forecasts.**

*(Figure 1A*The global response to the SARS-CoV-2 pandemic in 2020 showed the speed with which we can develop new vaccines and capture real-time viral genetic diversity. Decades of research on mRNA vaccines enabled the development of multiple effective vaccines a year after the emergence of SARS-CoV-2 ** (Mulligan et al., 2020**;

**). This mRNA-based vaccine platform also enabled the approval of booster vaccines targeting Omicron only 3 months after the recommendation of an Omicron-based vaccine candidate**

*Baden et al., 2021***). In parallel to vaccine development, expanded funding and capacity building for viral genome sequencing enabled unprecedented dense sampling of a pathogen’s genetic diversity over a short period of time**

*(Grant et al., 2023***). By 2021, the average time between collection of a SARS-CoV-2 sample and submission of the sample’s genome sequence to GISAID EpiCoV database had decreased to approximately 1 month**

*(Chen et al., 2022***). This reduction in submission lags reflects both increased emergency funding and the sustained efforts by more public health organizations to adopt best practices for genomic epidemiology**

*(Brito et al., 2022***;**

*(Kalia et al., 2021***). Assessments of SARS-CoV-2 short-term forecasts have shown how such reductions in forecast horizon and submission lags can improve the accuracy of short-term forecasts and real-time estimates of clade frequencies**

*Black et al., 2020***).**

*(Abousamra et al., 2024*These technological and societal changes in response to SARS-CoV-2 suggest that we could realistically expect the same outcomes for seasonal influenza. Work on mRNA vaccines for influenza viruses dates back over a decade ** (Petsch et al., 2012**;

**;**

*Brazzoli et al., 2016***;**

*Pardi et al., 2018***). A switch from the current egg-based inactivated virus vaccines to mRNA vaccines could reduce the time between vaccine design decisions and the peak influenza season from 12 months to 6 months. Similarly, the expanded global capacity for sequencing SARS-CoV-2 genomes could reasonably extend to broader and more rapid genomic surveillance for seasonal influenza, reducing submission lags from 3 months to 1 month on average. Even in the years immediately after the onset of the SARS-CoV-2 pandemic, we have observed a trend toward a reduced average submission lag of 2.5 months that we would expect from increased global capacity for genome sequencing**

*Feldman et al., 2019***).**

*(Figure 1—figure Supplement 1*In this work, we tested the effects of similar reductions in forecast horizons and submission lags on the accuracy of long-term forecasts for seasonal influenza. Building on our previously published forecasting framework ** (Huddleston et al., 2020**), we performed a retrospective analysis of HA sequences from simulated and natural A/H3N2 populations. For each population type, we produced forecasts from 12, 9, 6, and 3 months prior to a given influenza season

**). We made each forecast under three different submission lag scenarios including a realistic lag (3 months on average), an ideal lag (1 month on average), and no lag**

*(Figure 1A***). First, we measured the accuracy and precision of forecasts under these different scenarios by calculating the genetic distance between predicted and observed future populations using the same earth mover’s distance metric that we originally used to train our forecasting models**

*(Figure 1B***). Next, we calculated the effect of forecast horizon and submission lags on clade frequencies which are the values we use to communicate predictions to WHO decision-makers**

*(Rubner et al., 1998***). We quantified the effect of reduced submission lags on initial clade frequencies, and we calculated forecast accuracy as the difference between predicted and observed clade frequencies of future populations. Finally, we calculated the relative improvement in forecast accuracy produced by different realistic interventions including reduced vaccine development time, reduced submission lags, and the combination of both. In this way, we show the potential to improve the accuracy of existing long-term forecasting models and, thereby, the quality of vaccine design decisions by simplifying the forecasting problem through realistic societal changes.**

*(Huddleston et al., 2024*## Results

### Reducing forecast horizons and submission lags decreases distances between predicted and observed future populations

Previously, we trained long-term forecasting models that minimized the genetic distance between predicted and observed future populations of HA sequences ** (Huddleston et al., 2020**). We predicted each population 12 months in the future based on the frequencies and fitness estimates of HA sequences in the current population. We calculated the distance between predicted and observed future populations with the earth mover’s distance metric

**). This metric provided an average genetic distance between amino acid sequences of the two populations weighted by the frequencies of sequences in each population. This approach allowed us to measure forecasting accuracy without first defining phylogenetic clades, a process that can borrow information from the future or change clade definitions between initial and future timepoints. We identified the best forecasting models as those that minimized this distance between populations. The most accurate sequence-only model for the 12-month forecast horizon estimated fitness with local branching index (LBI)**

*(Rubner et al., 1998***) and mutational load**

*(Neher et al., 2014***). As a positive control, we calculated the posthoc empirical fitness of each initial population based on the composition of the corresponding future population. These empirical fitnesses provided the lower bound on the earth mover’s distance which represented the number of amino acid substitutions accumulated between populations.**

*(Łuksza and Lässig, 2014*To understand the effects of reducing forecast horizons and submission lags on long-term forecast accuracy, we produced forecasts 3, 6, 9, and 12 months into the future using HA sequences available at each initial timepoint under each submission lag scenario including no lag, ideal lag (∼1-month average), and realistic lag (∼3-month average) ** (Figure 1**,

**,**

*Figure 1—figure Supplement 2***). For both natural and simulated populations, we assigned ideal and realistic lags to each sequence from the modeled distributions in**

*Figure 1—figure Supplement 3***. This approach allowed us to assign uncorrelated lag values to both population types while avoiding the biases associated with historical submission patterns for natural A/H3N2 HA sequences. For natural A/H3N2 populations, we used the best sequence-only forecasting model, LBI and mutational load, which we previously trained on 12-month forecasts without any submission lag. For simulated A/H3N2-like populations, we used the observed fitness per sample provided by the simulator. For each forecast horizon and submission lag type, we calculated the earth mover’s distance between the predicted future populations under the given lag scenario and the observed future populations without any lag in sequence availability. As a control, we also calculated the optimal distance between initial and future populations based on posthoc empirical fitness of the initial population. We anticipated that reducing either the forecast horizon or the submission lag would reduce the distance to the future in amino acids (AAs), representing increased accuracy of the forecasting models.**

*Figure 1B*We found that reducing the forecast horizon from the current standard of 12 months linearly reduced the distance to the future population predicted by the LBI and mutational load model ** (Figure 2**). Under the all three submission lag scenarios, the distance to the future reduced by approximately 1 AA on average for each 3-month reduction in forecast horizon

**). We observed the greatest average reduction in distance to the future (∼1.4 AAs) between the 6- and 3-month forecast horizons. Reducing the forecast horizon also noticeably reduced the variance per time-point in predicted future populations across all lag scenarios**

*(Table 1***). For example, the standard deviation of distances to the future reduced from ∼2.6 AAs at the 12-month horizon to ∼1 AA at the 3-month horizon**

*(Figure 2***). We observed the same patterns for forecasts of simulated A/H3N2-like populations**

*(Table 1***) and optimal distances to the future for natural and simulated populations**

*(Figure 2—figure Supplement 1***and**

*(Figure 2—figure Supplement 2***). Thus, reducing how far we have to predict into the future increased both forecast accuracy and precision.**

*Figure 2—figure Supplement 3*In contrast, we found that reducing submission lags from a ∼3-month average lag in the realistic scenario to a ∼1-month average lag in the ideal scenario had a weaker effect on distance to the future. At the 12-month forecast horizon, the ideal and realistic lag scenarios produced similar predictions, with the only noticeable improvement observed under the scenario without any submission lags ** (Figure 2**). As the forecast horizon decreased, the effect of submission lags appeared more prominent, with the greatest effect of reduced lags observed at the 3-month forecast horizon. However, the average improvement from the realistic to the ideal submission lag scenario at the 3-month horizon was still only ∼0.3 AAs

**). Reducing submission lags also had little effect on the variance per timepoint in predicted future populations. Interestingly, we observed a stronger effect of reducing submission lags in simulated A/H3N2-like populations, with the best average improvement between realistic and ideal lags of ∼0.7 AAs at the 3-month horizon**

*(Table 1***). As with natural A/H3N2 populations, the effect of reducing submission lags appeared to increase as the forecast horizon decreased. These results indicate that reducing submission lags may have little effect under the current 12-month forecast approach used for influenza vaccine composition, but reducing submission lags should become increasingly important as we forecast from closer to future influenza populations.**

*(Figure 2—figure Supplement 1*### Reducing submission lags improves estimates of current clade frequencies

Although the distance between predicted and observed future populations in amino acids provides an unbiased metric to optimize forecasting models, in practice, we use these models to forecast clade frequencies. We predict each clade’s future frequency as the sum of predicted future frequencies for each HA sequence in the clade. We calculate these sequence-specific future frequencies as the initial sequence frequency times the estimated sequence fitness ** (Łuksza and Lässig, 2014**;

**). Given the importance of initial clade frequencies in these forecasts, we tested the effect of submission lags on current clade frequency estimates. For each timepoint and clade with a frequency greater than zero under the scenario without lags, we calculated the clade frequency error as the difference between clade frequency without submission lags and the frequency with either an ideal or realistic lag. Positive error values represented underestimation of current clades, while negative values represented overestimation.**

*Huddleston et al., 2020*Across all clade frequencies, we found that errors in current clade frequencies for A/H3N2 appeared normally distributed with lower variance in the ideal lag scenario than under realistic lags ** (Figure 3A** and

**). Of the 822 clades under the scenario without lags, 613 (75%) had a frequency less than 10%, representing small, emerging clades. The remaining 209 (25%) had a frequency of 10% or greater, representing larger clades that could be more likely to succeed. To understand whether lags had different effects on these small and large clades, respectively, we inspected clades from these latter two groups separately. For small clades, errors under ideal lags ranged from −4% to 4% with a standard deviation of 1%, while realistic lags produced errors ranging from −8% to 7% with a standard deviation of 2%**

*B***). We did not observe a bias toward underestimation or overestimation of initial small clade frequencies under either lag scenario. For large clades, errors under ideal lag ranged from −9% to 14% with a standard deviation of 3%**

*(Figure 3C***). Errors under realistic lags ranged from −16% to 29% with a standard deviation of 6%. We observed a slight bias toward underestimation of large clades under the realistic lag scenario, with a median error of 1%. These results show that reducing submission lags for natural A/H3N2 populations from a 3-month average to a 1-month average could reduce the bias toward underestimated large clade frequencies and reduce the standard deviation of all current clade frequency errors by 50%.**

*(Figure 3D*Lagged submissions similarly affected clade frequencies for simulated A/H3N2-like populations ** (Figure 3—figure Supplement 1**). Small clade errors under ideal lags ranged from −4% to 6% (standard deviation of 1%) and under realistic lags ranged from −9% to 8% (standard deviation of 2%)

**). For large clades, errors under ideal lags ranged from −8% to 18% (standard deviation of 3%) and under realistic lags from −14% to 40% (standard deviation of 7%)**

*(Figure 3—figure Supplement 1C***). As with natural A/H3N2 populations, we observed a slight bias in simulated populations under realistic lags toward underestimation of large clade frequencies with a median error of 2%. We also observed a similar reduction in standard deviation of current frequency errors for these simulated A/H3N2-like populations when switching from realistic to ideal submission lags.**

*(Figure 3—figure Supplement 1D*### Reducing forecast horizons increases the accuracy and precision of clade frequency forecasts

Next, we estimated the effects of different forecast horizons and submission lags on the accuracy of clade frequency forecasts. As with the current clade frequency analysis, we analyzed small clades (<10% initial frequency) and large clades (≥10% initial frequency) separately. For each combination of initial timepoint, future timepoint, and lag scenario ** (Figure 1A**), we calculated initial and predicted future frequencies for all clades present under the given lag and then calculated the corresponding observed future frequencies without lag for clades that descended from the clades present at the initial timepoint. We calculated the error in forecast frequencies as the difference between predicted future frequencies under the given lag scenario and observed future frequencies without any lag. We used absolute forecast errors to evaluate forecast accuracy and overall forecast errors to evaluate forecast bias.

Absolute forecast errors trended strongly toward values less than 30% with long tails reaching 80% for both small and large clades ** (Figure 4**). Each 3-month reduction of the forecast horizon linearly reduced the variance in forecast errors, but mean and median absolute errors only improved after reducing the forecast horizon below 9 months

**and**

*(Figure 4***). For small clades, reducing the forecast horizon most noticeably reduced the range of errors, while reducing submission lags had little effect**

*Table 2***). For large clades, almost all decreases in forecast horizon and submission lag (except lags at the 12-month horizon) reduced the standard deviation of absolute forecast errors**

*(Figure 4A***). Overall, reducing the forecast horizon had a greater effect on the mean, median, and standard deviation of absolute forecast errors than reducing submission lags. For example, the standard deviation of absolute errors at the 12-month horizon under realistic submission lags was 23%, while the standard deviation for the 6-month horizon under realistic lags was 14%**

*(Figure 4B***). In contrast, the standard deviation at the 12-month horizon under ideal submission lags did not change from the realistic lags at 23%, and the average absolute error increased by 1% from 20%. For all other forecast horizons, reducing the submission lags from realistic to ideal only reduced the mean and standard deviation of absolute errors by 1–2%. We observed the same general patterns in simulated populations**

*(Table 2***).**

*(Figure 4—figure Supplement 1*The majority of forecast frequency errors appeared to be normally distributed, indicating little bias toward over- or underestimating future clade frequencies ** (Figure 4—figure Supplement 2** and

**). This pattern matched our expectation that at any given initial timepoint the overestimation of one clade’s future frequency must cause an underestimation of another current clade’s future frequency. However, we observed a long tail of small clades with underestimated future frequencies at all forecast horizons, indicating that correctly predicting the growth of small clades remains more difficult than predicting their decline**

*Figure 4—figure Supplement 3***). The strongest effect of reducing submission lags was the reduction in maximum error, corresponding to reduction in underestimation of large clades. The switch from realistic to ideal lags at 12-, 9-, 6-, and 3-month horizons reduced the maximum forecast error by 4%, 21%, 22%, and 14%, respectively**

*(Figure 4—figure Supplement 2A***). These results show that reducing submission lags can substantially lower the upper bound for forecasting errors.**

*(Table 2*### Reduced vaccine development time provides the best improvement in forecast accuracy of available realistic interventions

Although we have investigated the effects of a range of forecast horizons and submission lags, not all of these scenarios are currently realistic. The most we can hope to reduce the forecast horizon with current mRNA vaccine technology is from 12 months to 6 months and the most we could reduce submission lags would be from an average of 3 months to 1 month ** (Grant et al., 2023**). In practice, we wanted to know how much a reduction in forecast horizon or submission lag could improve the accuracy of forecasts to each future timepoint. To determine the effects of realistic interventions on forecast accuracy, we inspected the reduction in total absolute forecast error per future timepoint associated with improved vaccine development (reducing forecast horizon from 12 months to 6 months), improved genomic surveillance (reducing lags from a 3-month average to 1 month), and the combination of both improvements. We selected all forecasts with a 12-month horizon and a realistic lag, to represent current forecast conditions or “the status quo”. For the same future timepoints present in the status quo conditions, we selected the corresponding forecasts for a 6-month horizon and a realistic lag, a 12-month horizon and an ideal lag, and 6-month horizon and an ideal lag. Since forecasts between different initial and future timepoints could be represented by different clades, we could not compare forecasts for specific clades between interventions. Instead, we calculated the total absolute clade frequency error per future timepoint under each intervention and calculated the improvement in forecast accuracy as the difference in total error between the status quo and each intervention. In addition to this clade-based analysis, we also estimated effects of interventions on the difference in distance to the future between different scenarios for both estimated and empirical fitnesses. For all analyses, positive values represented improved forecast accuracy under a given intervention scenario and negative values represented a reduction in accuracy.

Both interventions with improved vaccine development increased forecast accuracy for the majority of future timepoints ** (Figure 5**,

**, and**

*Table 3***). Improving vaccine development alone increased total forecast accuracy by 53% on average, while the addition of improved genomic surveillance under that 6-month forecast horizon increased total forecast accuracy by 54% on average. In contrast, the intervention that only improved genomic surveillance decreased forecast accuracy by an average of 11%. Based on the distributions of total absolute forecast error per future timepoint, we would expect improved genomic surveillance to improve forecast accuracy at a forecast horizon of 3 months**

*Figure 5—figure Supplement 1***). We observed similar effects of interventions in simulated A/H3N2-like populations except that the average effect of reducing submission lags alone was positive for these populations**

*(Figure 5—figure Supplement 1***and**

*(Figure 5—figure Supplement 2***). When we calculated the effects of interventions on distances to the future instead of total absolute clade frequency errors, we observed the same patterns for natural and simulated populations**

*Figure 5—figure Supplement 3***and**

*(Figure 5—figure Supplement 4***). Based on these results, the single most valuable intervention we could make to improve forecast accuracy would be to reduce the forecast horizon to 6 months or less through more rapid vaccine development. However, as we reduce the forecast horizon, reducing submission lags should have a greater effect on improving forecast accuracy.**

*Figure 5—figure Supplement 5*We hypothesized that the decrease in average accuracy of natural A/H3N2 forecasts under the improved genomic surveillance intervention could reflect the bias of the LBI and mutational load fitness metrics. For example, we previously showed how LBI fitness estimates can overestimate the future growth of large clades ** (Huddleston et al., 2020**). Adding more sequences at initial time-points where LBI already overestimates clade success could increase the LBI of those clades and exacerbate the overestimation. To test this hypothesis, we calculated the effects of the same interventions on the optimal distances to the future for both natural and simulated populations. Since optimal distances reflected the empirical fitnesses of the initial populations, the effects of interventions should be independent of biases from fitness metrics. We expected all interventions to maintain or improve the optimal distance to the future without any cases where an intervention decreased accuracy. As expected, all interventions improved on the optimal distance to the future for both populations

**and**

*(Figure 6***). For natural A/H3N2 populations, the average improvement of the vaccine intervention was 1.1 AAs and the improvement of the surveillance intervention was 0.27 AAs or approximately 25% of the vaccine intervention. The average improvement of both interventions was only slightly less than additive at 1.28 AAs. These results confirmed the relatively stronger effect of reducing forecast horizons compared to submission lags. They also confirmed that reducing submission lags can improve forecasts under optimal forecasting conditions. For this reason, we expect that simultaneous improvements to forecasting models and genomic surveillance will have a mutually beneficial effect on forecast accuracy.**

*Figure 6—figure Supplement 1*## Discussion

In this work, we showed that decreasing the time to develop new vaccines for seasonal influenza A/H3N2 and decreasing submission lags of HA sequences to public databases improves our estimates of future and current populations, respectively. We confirmed that forecasts became more accurate and more precise with each 3-month reduction in forecast horizon from the status quo of 12 months. Although decreasing submission lags only marginally improved long-term forecast accuracy, shorter lags increased the accuracy of current clade frequency estimates, reduced the bias toward underestimating current and future frequencies of larger clades, and improved forecasts 3 months into the future. Under a realistic scenario where a faster vaccine development timeline allowed us to forecast from 6 months before the next season, we found a 53% average improvement in forecasts of total absolute clade frequency and a 25% reduction in average absolute forecast frequency errors for large clades from 20% to 15%. We confirmed these effects with a previously validated forecasting model using both simulated and natural populations and two different metrics of forecast accuracy including earth mover’s distances between populations and clade frequencies. We expect that decreasing forecast horizons and submission lags will have similar relative effect sizes in other forecasting models, too.

Even without these recommended improvements to vaccine development and sequence submissions, these results inform important next steps to improve forecasting models. Current and future frequency estimates should be presented with corresponding uncertainty intervals. From this work, we know that our current frequency estimates for large clades (≥10% frequency) under realistic submission lags have a wide range of errors (−16% to 29%). Similarly, the range of 12-month forecast frequency errors under realistic lags include overestimates by up to 78% and underestimates up to 78%. Long-term forecasts with incomplete current data are highly uncertain by their nature. To support informed decisions about vaccine updates, we must communicate that uncertainty of the present and future to decision-makers. One simple immediate strategy to provide these uncertainty estimates is to estimate current and future clade frequencies from count data with multinomial probability distributions. Another immediate improvement would be to develop models that can use all available data in a way that properly accounts for geographic and temporal biases. Current models based on phylogenetic trees need to evenly sample the diversity of currently circulating viruses to produce unbiased trees in a reasonable amount of time. Models that could estimate sample fitness and compare predicted and future populations without trees could use more available sequence data and reduce the uncertainty in current and future clade frequencies. Finally, we could improve existing models by changing the start and end times of our long-term forecasts. We could change our forecasting target from the middle of the next season to the beginning of the season, reducing the forecast horizon from 12 to 9 months. We could also start forecasting from one month prior to the current date to minimize the effect of submission lags on our estimates of the current global influenza population.

Despite the small effect that reducing sequence submission lags had on long-term forecasting accuracy, we still see a need to continue funding global genomic surveillance at higher levels than the pre-pandemic period. Compared to estimates of current viral diversity, forecasts of future influenza populations only represent one component of the overall decision-making process for vaccine development. For example, virologists must choose potential vaccine candidates from the diversity of circulating clades well in advance of vaccine composition meetings to have time to grow virus in cells and eggs and measure antigenic drift with serological assays ** (Morris et al., 2018**;

**). Similarly, prospective measurements of antigenic escape from human sera allow researchers to predict substitutions that could escape global immunity**

*Loes et al., 2024***;**

*(Lee et al., 2019***;**

*Greaney et al., 2022***). The finding of even a few sequences with a potentially important antigenic substitution could be enough to inform choices of vaccine candidate viruses. Finally, our results here reflect uncorrelated submission lags for each sequence, but actual lags can strongly correlate between sequences from the same originating and submitting labs. These correlated lags could further decrease the accuracy of frequency estimates beyond our more conservative estimates. More rapid sequence submission will improve our understanding of the present and give decision-makers more choices for new vaccines. Such reductions in submission lags depend on substantial, sustained funding and capacity building globally.**

*Welsh et al., 2023*## Methods and Materials

### Selection of natural influenza A/H3N2 HA sequences

We downloaded all A/H3N2 HA sequences and metadata from GISAID’s EpiFlu database ** (Shu and McCauley, 2017**) as of November 2023. We evenly sampled sequences geographically and temporally as previously described

**). Briefly, we selected 90 sequences per month, evenly sampling from major continential regions (Africa, Europe, North America, China, South Asia, Japan and Korea, Oceania, South America, Southeast Asia, and West Asia) and excluding sequences labeled as egg-passaged or missing complete date annotations. For our forecasting analyses, we selected sequences collected between April 1, 2005 and October 1, 2019.**

*(Huddleston et al., 2020*### Simulation of influenza A/H3N2-like HA sequences

We simulated A/H3N2-like populations as previously described ** (Huddleston et al., 2020**). Briefly, we simulated A/H3N2 HA sequences with SANTA-SIM

**) for 10,000 generations or 50 years at 200 generations per year. We discarded the first 10 years of simulated data as a burn-in period and used the next 30 years of the remaining data for our analyses. We sampled 90 viruses per month to match the sampling density of natural populations.**

*(Jariani et al., 2019*### Estimating and assigning submission lags

We estimated the lag between sample collection and submission of A/H3N2 hemagglutinin (HA) sequences to the GISAID EpiFlu database ** (Shu and McCauley, 2017**) by calculating the difference in GISAID-annotated submission date and collection date in days for samples collected between January 1, 2019 and January 1, 2020 and with a submission date prior to October 1, 2020. We selected this period of time as representative of modern genomic surveillance efforts prior to changes in circulation patterns of influenza caused by the SARS-CoV-2 pandemic. Of the 104,392 HA sequences in GISAID EpiFlu, 11,222 (11%) were collected during this period with a mean submission lag of 98 days (∼3 months) and a median lag of 74 days. Only 11% of sequences (N=1,210) were submitted within 4 weeks of collection, and only 36% (N=4,057) were submitted within 8 weeks

**, purple).**

*(Figure 1A*We modeled the shape of the observed lag distribution as a gamma distribution using a maximum likelihood fit from SciPy 1.10.1 ** (Virtanen et al., 2020**). With this approach, we estimated a shape parameter of 1.76, a scale parameter of 53.18, and location parameter of 3.98. The product of these shape and scale values corresponded to a mean lag of 93.76 days

**, green). To assign realistic submission lags to each sample in our analysis, we randomly sampled from this gamma distribution and calculated a “realistic submission date” by adding the sampled lag in days to the observed collection date. This approach allowed us to assign realistic lags to natural and simulated populations without the biases and autocorrelations associated with historical submission patterns across different submitting labs.**

*(Figure 1A*Based on the observed rapid submission of SARS-CoV-2 genomes during the first years of the pandemic, we expected that an achievable “ideal” submission lag for seasonal influenza sequences would have a 1-month average lag instead of the observed ∼3-month lag from the pre-pandemic period. We modeled this ideal submission lag distribution by dividing the gamma shape parameter by 3 to get a value of 0.59 and a corresponding mean lag of 31.25 days ** (Figure 1A**, orange). This approach effectively shifted the realistic gamma toward zero, while maintaining the relatively longer upper tail of the distribution. To assign ideal submission lags to each sample in our analysis, we randomly sampled from this modified gamma distribution and added the sampled lag in days to the observed collection date. Additionally, we required that each sample’s “ideal” lag be less than or equal to its “realistic” lag.

To estimate the effect of increased global sequencing capacity associated with the response to the SARS-CoV-2 pandemic, we summarized the lag distribution for sequences submitted to GISAID EpiFlu between January 1, 2022 and January 1, 2023. During this period, global influenza circulation had rebounded to its prepandemic level and 26,394 HA sequences were collected. The mean and median submission lags during this period were 76 and 62 days, respectively, representing a trend toward reduced lags compared to the prepandemic era ** (Figure 1—figure Supplement 1**).

### Phylogenetic inference

We inferred time-scaled phylogenetic trees for HA sequences as previously described ** (Huddleston et al., 2020**). Briefly, we aligned sequences with MAFFT v7.520

**;**

*(Katoh et al., 2002***) using the augur align command in Augur v22.3.0**

*Katoh and Standley, 2013***). We inferred phylogenies with IQ-TREE v2.2.3**

*(Huddleston et al., 2021***) using the augur tree command with IQ-TREE parameters of -ninit 2 -n 2 -me 0.05 and a general time reversible (GTR) model. We inferred time-resolved phylogenies with TreeTime v0.10.1**

*(Nguyen et al., 2014***) with the augur refine command.**

*(Sagulenko et al., 2018*### Forecasting with different forecast horizons

We tested the effect of forecasting future influenza populations at forecast horizons of 3, 6, 9, and 12 months ** (Figure 1B**). Previously, we produced forecasts every 6 months starting from October 1 and April 1 and predicting 12 months into the future

**). To support forecasts in 3-month intervals, we produced annotated time trees for 6 years of HA sequences every 3 months with data available up to the first day of January, April, July, and October. We produced these trees for each timepoint with three different lag scenarios: no lag, ideal lag, and realistic lag. For each scenario, we selected sequences for analysis at a given timepoint based on their collection date, ideal submission date, or realistic submission date, respectively. This experimental design produced forecasts for three lag types at each of the four forecast horizons (e.g.,**

*(Huddleston et al., 2020***, blue, green, and orange initial timepoints for the 3-month forecast horizon).**

*Figure 1B*Since reliable submission dates were not available prior to April 2005, our analysis of natural A/H3N2 sequences spanned from April 1, 2005 to October 1, 2019. To simplify the data required for these analyses, we produced forecasts of natural A/H3N2 populations with our best sequence-only model from our prior work ** (Huddleston et al., 2020**), a composite model based on local branching index (LBI)

**) and mutational load**

*(Neher et al., 2014***). For simulated A/H3N2-like populations, we produced forecasts with the “true fitness” model that relies on the normalized fitness value of each simulated sample.**

*(Łuksza and Lässig, 2014*Each forecast generated a predicted future frequency per sequence in the initial timepoint’s tree. As in our prior work, we calculated the earth mover’s distance ** (Rubner et al., 1998**) between the predicted and observed future populations using HA amino acid sequences from initial and future timepoints, predicted future frequencies from the initial timepoint, and observed future frequencies from future timepoint. For the future timepoint, we used data from the “no lag” scenario as our truth set, regardless of the lag scenario for the initial timepoint. This design allowed us to measure the effect of ideal and realistic submission lags on forecast accuracy relative to a scenario with no lags.

### Defining clades

Official clade definitions do not exist for all time periods of our analysis of A/H3N2 populations and do not exist at all for simulated A/H3N2-like populations. Therefore, we defined clades *de novo* for both population types with the same clade assignment algorithm used to produce “subclades” for recent seasonal influenza vaccine composition meeting reports ** (Huddleston et al., 2024**). The complete algorithm description and implementation is available at https://github.com/neherlab/flu_clades. Briefly, the algorithm scores each node in a phylogenetic tree based on three criteria including the number of child nodes descending from the current node, the number of epitope substitutions on the branch leading to the current node, and the number of amino acid mutations since the last clade assigned to an ancestor in the tree. After assigning and normalizing scores, the algorithm traverses the tree in preorder, assigning clade labels to each internal node whose score exceeds a predefined threshold of 1.0. Clade labels follow a hierarchical nomenclature inspired by Pangolin

**) such that the first clade in the tree is named “A” and its first immediate descendant is named “A.1”. For each population type, we applied this algorithm to a single phylogeny representing all HA sequences present in our analysis. This approach allowed us to produce a single clade assignment per sequence and easily identify related sequences between initial and future timepoints using the hierarchical clade nomenclature.**

*(O’Toole et al., 2021*### Estimating current and future clade frequencies

We estimated clade frequencies with a kernel density estimation (KDE) approach as previously described ** (Huddleston et al., 2020**) with the augur frequencies command

**). Briefly, we represented each sequence in a given phylogeny by a Gaussian kernel with a mean at the sequence’s collection date and a variance of two months. We estimated the frequency of each sequence at each timepoint by calculating the probability density function of each KDE at that timepoint and normalizing the resulting values to sum to one.**

*(Huddleston et al., 2021*We calculated clade frequencies for each initial timepoint in our analysis by first summing the frequencies of individual sequences in a given timepoint’s tree by the clade assigned to each sequence and then summing the frequencies for each clade and its descendants to obtain nested clade frequencies. To inspect the effects of submission lags on clade frequency estimates, we calculated the clade frequency error per timepoint and clade by subtracting the clade frequency estimated with ideal or realistic lagged sequence submission from the corresponding clade frequency without lags. We compared the effects of submission lags for clades of different sizes by filtering clades by their frequency estimated without lags to small clades (>0% and <10%) and large clades (≥10%).

To estimate the accuracy of clade frequency forecasts, we needed to calculate the predicted and observed future clade frequencies for each combination of lag type, initial timepoint, and future timepoint in the analysis. We calculated predicted future frequencies for all clades that existed at given initial timepoint and lag type by first summing the predicted future frequency per sequence by the clade assigned to each sequence and then summing the predicted frequencies for each clade and its descendants. Clades that existed at any given future timepoint were not always represented at a corresponding initial timepoint either because the clades had not emerged yet or sequences for those clades had a lagged submission. For this reason, we calculated observed future clade frequencies in a multi-step process. First, we calculated the frequencies of clades observed at the future timepoint without submission lag by summing the individual frequencies of all sequences in each clade. Then, we mapped each future clade to its most derived ancestral clade that circulated at the initial timepoint by progressively removing suffixes from the future clade’s label until we found a match in the initial timepoint. For example, if the future timepoint had a clade named A.1.1.3 and the initial timepoint had the ancestral clade A.1, we would test for the presence of A.1.1.3, A.1.1, and A.1 at the initial timepoint until we found a match. The hierarchical nature of the clade assignment algorithm guaranteed that each future clade mapped directly to a clade at each initial timepoint and lag type. Finally, we summed the frequencies of future clades by their corresponding initial clades to get the observed future frequencies of clades circulating at the initial timepoint. We calculated the accuracy of clade frequency forecasts as the difference between the predicted and observed future clade frequencies.

### Data and software availability

Sequence data are available from the GISAID EpiFlu Database using accessions provided in Supplemental File S1. Source code for the analysis workflow and manuscript are available in the project’s GitHub repository (https://github.com/blab/flu-forecasting-delays). Supplemental data are available on Zenodo at DOI 10.5281/zenodo.13742375.

## Data Availability

Sequence data are available from the GISAID EpiFlu Database using accessions provided in the supplemental data. Source code for the analysis workflow and manuscript are available in the project's GitHub repository (https://github.com/blab/flu-forecasting-delays). Supplemental data are available on Zenodo at DOI 10.5281/zenodo.13742375 (https://zenodo.org/records/13742375).

## Author contributions

JH designed and implemented experiments, analyzed results, and wrote the manuscript. TB edited the manuscript.

## Competing interests

The authors declare that no competing interests exist.

## Supplemental Files

Supplemental File S1. GISAID accessions and metadata including originating and submitting labs for natural strains used across all timepoints.

## Acknowledgments

We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from the GISAID EpiFlu Database ** (Shu and McCauley, 2017**) on which this research is based. A list of sequence accessions, authors, and labs appear in the Supplemental Material. We thank Katie Kistler and Marlin Figgins for their comments on early versions of this manuscript and Richard A. Neher for the development of tools for hierarchical clade nomenclature. This work was funded by NIAID R01 AI165821-01. TB is a Howard Hughes Medical Institute Investigator.