## Abstract

Estimating an epidemic’s trajectory is crucial for developing public health responses to infectious diseases, but incidence data used for such estimation are confounded by variable testing practices. We show instead that the population distribution of viral loads observed under random or symptom-based surveillance, in the form of cycle threshold (Ct) values, changes during an epidemic and that Ct values from even limited numbers of random samples can provide improved estimates of an epidemic’s trajectory. Combining multiple such samples and the fraction positive improves the precision and robustness of such estimation. We apply our methods to Ct values from surveillance conducted during the SARS-CoV-2 pandemic in a variety of settings and demonstrate new approaches for real-time estimates of epidemic trajectories for outbreak management and response.

## Main Text

Real-time tracking of infection incidence during an epidemic is fundamental for public health planning and intervention (*1, 2*). In the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) pandemic, key epidemiological parameters such as the time-varying effective reproductive number, *R*_{t}, have typically been estimated using the time-series of observed case counts, percent of positive tests, or deaths, usually based on reverse-transcription quantitative polymerase chain reaction (RT-qPCR) testing. However, reporting delays (*3*), limited testing capacities, and changes in test availability over time all impact the ability of routine testing to reliably and promptly detect underlying changes in infection incidence (*4, 5*). In particular, whether changes in case counts at different times reflect epidemic dynamics or simply changes in testing have been major topics of debate with important economic, health and political ramifications. Here, we describe a new method to overcome these biases and obtain accurate estimates of the epidemic trajectory, one that does not require repeat measurements and uses routinely generated but currently discarded quantitative data from RT-qPCR testing from single or successive cross-sectional samples.

RT-qPCR tests provide quantitative results in the form of cycle threshold (Ct) values, which are inversely correlated with log_{10} viral loads, but they are often reported only as binary “positives” or “negatives” (*6, 7*). It is common when testing for other infectious diseases to use this quantification of sample viral load, for example, to identify individuals with higher clinical severity or transmissibility (*8–11*). For SARS-CoV-2, Ct values may be useful in clinical determinations about the need for isolation and quarantine (*7, 12*), identifying the phase of an individual’s infection (*13, 14*) and predicting disease severity (*14, 15*). However, individual-level decision making based on Ct values has not yet become a widespread reality due to the variability in measurements across testing platforms and samples, and limited data to understand SARS-CoV-2 viral kinetics in asymptomatic and presymptomatic infections. These concerns do not necessarily hold at the population level: whereas a single high Ct value may not necessarily guarantee a low viral load in one sample, high Ct values in many samples will indicate a population with predominantly low viral loads. Indeed, the population-level distribution of Ct values does appear to change over time. For example, a systematic incline in the distribution of quantified Ct values has been noted alongside epidemic decline (*12, 14, 16*).

We demonstrate that population-level changes in the distribution of observed Ct values can arise as an epidemiological phenomenon, and propose methods to use these quantitative values to estimate epidemic trajectories from one or more cross-sectional samples.

### Relationship Between Observed Ct Values and Epidemic Dynamics

First, we show that the interaction of within-host viral kinetics and epidemic dynamics can drive changes in the distribution of Ct values over time without a change in the underlying pathogen kinetics. To demonstrate the epidemiological link between transmission rate and measured viral loads or Ct values, we first simulated infections arising under a deterministic susceptible-exposed-infectious-recovered (SEIR) model (Fig. 1A, *Materials and Methods: Simulated Epidemic Transmission Models)*. Parameters used are in Table S1. At selected testing days during the outbreak, simulated Ct values are observed from a random sample of the population using the Ct distribution model described in *Materials and Methods: Ct Value Model* and shown in Figs. S1 and S2. By drawing simulated samples for testing from the population at specific time points, these simulations recreate realistic cross-sectional distributions of detectable viral loads across the course of an epidemic. Throughout, we assume each individual is infected at most once, ignoring re-infections as these appear to be a negligible portion of infections in the epidemic so far (*17*).

Early in the epidemic, infection incidence grows rapidly and the typical infection is thus recent; as the epidemic wanes, however, the average time since exposure increases as the rate of new infections decreases (Fig. 1B,E) (*18*); this is analogous to the average age being lower in a growing vs. declining population (*19*). Infections are often unobserved events, but we can rely on an observable quantity, such as viral load, as a proxy for the time since infection. Since Ct values change over time within infected hosts (Fig. 1C), random sampling of individuals during epidemic growth is more likely to measure individuals who were recently infected and therefore in the acute phase of their infection with higher quantities of viral RNA. Conversely, sampling infected individuals during epidemic decline is more likely to capture individuals in the convalescent phase, typically sampling lower quantities of viral RNA (Fig. 1D). The distribution of observed Ct values therefore changes over time, as measured by the median, quartiles, and skewness (Fig. 1G). While estimates for an individual’s time since infection based on a single Ct value will be highly uncertain, the population-level distribution of observed Ct values will vary with the growth rate, and therefore *R*_{t}, of new infections (Fig. 1F,H). Similar principles have been applied to serologic data to infer unobserved individual-level infection events (*16, 20–22*) and population-level parameters of infectious disease spread (*20, 23–27*).

This phenomenon is also present, though less pronounced, among viral loads measured under symptom-based surveillance (Fig. S3). One might imagine that the typical time since infection would not depend on the epidemic trajectory in individuals systematically sampled soon after symptom onset. However, the distribution of delays between infection date and test date is a convolution of the infection incidence curve and the confirmation delay distribution (time from infection to testing of symptomatic infections) (*28*). Individuals tested due to recent symptom onset are more likely to have been recently infected with a short incubation period during epidemic growth than during epidemic decline, where more onsets are from older infections with longer incubation periods. The time-since-infection distribution of individuals tested based on symptom onset, and therefore their measured viral loads, is influenced by the stage of the epidemic.

By modeling the variation in observed Ct values arising from individual-level viral growth/clearance kinetics and sampling errors, the distribution of observed Ct values becomes an estimable function of the times since infection, and the expected median and skewness of Ct values at a given point in time are then predictable from the growth rate. This function can then be used to estimate the epidemic growth rate conditional on a set of observed Ct values. The relationship between observed Ct value and epidemic growth rate holds for any testing approach, though calibration is needed to define the precise mapping (i.e., using a different RT-qPCR instrument, a different Ct value threshold, or in a different lab; see Fig. S4).

### Inferring Epidemic Trajectory Using a Single Cross-Section

From these relationships, we derived a method to formally infer the epidemic growth rate given a single cross-section of RT-qPCR test results. The method combines two models: (1) the likelihood of observing a Ct value or negative result conditional on having been infected on a given day; and (2) the likelihood of being infected on a given day prior to the sample date. For (1), we used a Bayesian model and defined priors for the mode and range of Ct values following infection based on the existing literature (*Materials and Methods: Ct Value Model* and *Single Cross-Section Model*). For (2), we initially developed two models to describe the probability of infection over time: (a) constant exponential growth of infection incidence; or (b) infections arising under an SEIR model. Both models provide estimates for the epidemic growth rate, but make different assumptions regarding the possible shape of the outbreak trajectory: the exponential growth model assumes a constant growth rate, whereas the SEIR model assumes that the growth rate changes daily depending on the remaining number of susceptible individuals.

We first investigated how the distribution of Ct values and prevalence of PCR positivity changed over time in four well-observed Massachusetts long-term care facilities that underwent SARS-CoV-2 outbreaks in March and April 2020 (*29*). These facilities were relatively closed after outbreaks began, so we model the outbreak within each facility using an extended SEIR (SEEIRR) model, with additional exposed and recovered compartments to account for the duration of PCR positivity (*Materials and Methods: Simulated Epidemic Transmission Models*). In each facility, we have the results of near-universal PCR testing, including both residents and staff, from three time points after the outbreak began, including the number of positive samples, the Ct values of positive samples, and the number of negative samples (*Materials and Methods: Nursing Home Data*). Fig. 2 shows results for one of these facilities, while Fig. S5 shows results for the other three.

In Fig. 2A, we fit the SEEIRR compartmental model to the three observed point prevalence values from the facility as a benchmark. The distribution of observed Ct values at each time point (Fig. 2B) shifts higher and becomes more left-skewed at later time points. We then fit the exponential growth and simple SEIR models using the Ct likelihood to each individual cross-section to get posterior distributions for the epidemic trajectory up to that time point (Fig. 2C). Note that these fits do not use any longitudinal data; each is fit to the positive and negative Ct values from only one time point. To assess the fit, we compare the predicted Ct distribution and point prevalence from each fit to the data (Fig. 2B,D) and compare the growth rates from these fits to those derived from the fits to the point prevalences. Posterior distributions of all Ct value model parameters are shown in Fig. S6.

While both sets of results are fitted models and so neither can be considered the truth, we find that the Ct method fit to one cross-section of data provides a similar posterior median trajectory to the compartmental model fit to three point prevalences. In particular, the Ct-based models appear to accurately discern whether the samples were taken soon or long after peak infection incidence. Both methods were in agreement over the direction of the past average and recent daily growth rates (i.e., whether the epidemic is currently growing or declining, and whether the growth rate has dropped relative to the historic average). The average growth rate estimates were very similar at most time points, though the daily growth rate appeared to decline earlier in the compartmental model. Overall, these results demonstrate that a single cross-section of Ct values can provide similar information to point prevalence estimates from three distinct sampling rounds.

To ensure that our method provides accurate estimates of the epidemic trajectory, we performed extensive simulation-recovery experiments using a synthetic nursing home population undergoing a stochastic SEIR epidemic. We assess performance using various models, including a version that uses only positive Ct values, and varying parameters of the simulation; details are in *Materials and Methods: Simulated Nursing Home Outbreaks* and results in Figs. S7–S9.

### Inferring Epidemic Trajectory Using Multiple Cross-Sections

Next, we extended our method to combine data from multiple cross-sections, allowing us to more reliably estimate the epidemic trajectory (*Materials and Methods: Multiple Cross-Sections Model* and *Markov Chain Monte Carlo Framework*). In many settings, the epidemic trajectory is monitored using reported case counts, the definition of which can change during the epidemic (*30*). Limiting reported cases to positive test results, the number of new positives among the tests conducted each day can be used to calculate *R*_{t} (*3*). However, these data represent the growth rate of positive tests and not the incidence of infection, requiring adjustments to account for changes in testing capacity, the delay between infection and test report date, and the conversion from prevalence to incidence. When, instead, Ct values from surveillance sampling is available, our methods can overcome these limitations by providing a direct mapping between the distribution of Ct values and infection incidence. Crucially, the Ct-based methods are agnostic to changing testing rates, providing unbiased growth rate estimates where case count-based methods exhibit bias (*5*).

To demonstrate the performance of these methods, we use them to recover parameters from SEIR-based simulations under a variety of testing schemes (*Materials and Methods: Simulated Testing Schemes*). We compare the performance of *R*_{t} estimation using reported case counts via the R package *EpiNow2* (*28, 31*), where reporting depends on testing capacity and the symptom status of infected individuals, to the performance of our methods when one, two, or three surveillance samples are available with observed Ct values, with a total of about 0.3% of the population sampled (3000 tests spread among the samples).

Figure 3 plots the posterior median *R*_{t} from each of the 100 simulations of each method when the epidemic is growing (day 60) and declining (day 88). Except when only one sample is used, the Ct-based methods fitting to an SEIR model exhibit minimal bias, even when testing capacity changes. Methods based on reported case counts, on the other hand, exhibit noticeable upward bias when testing rates increase over the period observed and substantial downward bias when testing rates decrease. The Ct-based methods do exhibit higher variability, however. This is captured by the Bayesian inference model, as all of the Ct-based methods achieve at least nominal coverage of the 95% credible intervals among these 100 simulations. The methods based on reported case counts have coverage below 70% when testing is falling at either time point and when cases are rising while the epidemic is declining.

### Reconstructing Complex Incidence Curves Using Ct Values

Simple epidemic models are useful to understand recent incidence trends when data are sparse or in relatively closed populations where the epidemic start time is approximately known (*Materials and Methods: Epidemic Seed Time Priors*). In reality, however, the epidemic usually follows a more complex trajectory which is difficult to model parametrically. For example, the SEIR model does not account for the implementation/relaxation of non-pharmaceutical interventions unless explicitly specified in the model. For a more flexible approach to estimating the epidemic trajectory from multiple cross-sections, we developed a third model for infection incidence, using a Gaussian Process (GP) prior for the underlying daily probabilities of infection (*32*). The GP method provides estimated daily infection probabilities without making strong assumptions about the epidemic trajectory, assuming only that infection probabilities on contemporaneous days are correlated, with decreasing correlation at increasing temporal distances (*Materials and Methods: Gaussian Process Model*). Movie S1 demonstrates how estimates of the full epidemic trajectory can be sequentially updated using this model as new samples become available over time.

With the objective of reconstructing the entire incidence curve using routinely collected RT-qPCR data, we used anonymized, Ct values from positive samples measured from nearly all hospital admissions into Brigham & Women’s Hospital (BWH) in Boston, MA, between April 3 and November 10, 2020 (*Materials and Methods: Brigham & Women’s Hospital Data*). We aligned these with estimates for *R*_{t} based on case counts in Massachusetts (Fig. 4A–C). The median and skewness of the detectable Ct distribution was correlated with *R*_{t} (Fig. 4B), in line with our theoretical predictions. Tests taken prior to April 3 were restricted to symptomatic patients only, while those after April 15 represented near-universal testing of all hospital admissions and non-admitted ER patients. The median Ct value rose (corresponding to a decline in median viral load) and skewness of the Ct distribution fell in the late spring and early summer, as shelter-in-place orders and other non-pharmaceutical interventions were rolled out (Fig. 4C), but the median declined and skewness rose in late summer and early fall as these measures were relaxed, coinciding with an increase in observed case counts for the state (Fig. 4A).

Using the observed Ct values we estimated the daily growth rate of infections using the SEIR model on single cross-sections (Fig. 4D,E, Fig. S10, Fig. S11) and the daily relative probability of infection over time using the GP model (Fig. 4F, Fig. S12). Similar temporal trends were inferred under both models, and the GP model provided growth rate estimates that followed those estimated using observed case counts (Fig. 4G). While these data are not strictly a random sample of the community, and the observed case counts do not necessarily provide a ground truth for the *R*_{t} value, this demonstrates the ability of this method to re-create epidemic trajectories and estimate growth or decline of cases using only positive Ct values collected through routine testing. Interestingly, our estimated epidemic trajectory using only routinely generated Ct values from a single hospital was remarkably similar to changes in viral loads obtained from wastewater data (Fig. S13) (*33*).

## Discussion

The usefulness of Ct values for public health decision making is currently the subject of much discussion and debate. One unexplained observation which has been consistently observed in many locations is that the distribution of observed Ct values has varied over the course of the current SARS-CoV-2 pandemic, which has led to questions over whether the fitness of the virus has changed (*12, 14, 16*). Our results demonstrate instead that this can be explained as a purely epidemiologic phenomenon, without any change in individual-level viral dynamics or testing practices. We find that properties of the population-level Ct distribution strongly correlate with estimates for the effective reproductive number or growth rate in real-world settings, in line with our theoretical predictions.

Using quantitative diagnostic test results from multiple different tests conducted in a single cross-sectional survey, Rydevik et al. (*18*) demonstrated that epidemic trends could be inferred from virologic data. The methods we describe here use the phenomenon observed in the present pandemic and the relationship between incidence rate, time since infection, and virologic test results to estimate a community’s position in the epidemic curve, under various models of epidemic trajectories, based on data from one or more cross-sectional surveys using a single virologic test. Despite the challenges of sampling variability, individual-level differences in viral kinetics, and limitations in comparing results from different laboratories or instruments, our results demonstrate that RT-qPCR Ct values, with all of their quantitative variability for an individual, can be highly informative of population-level dynamics. This information is lost when measurements are reduced to binary classifications.

Our results demonstrate that this method can be used to estimate epidemic growth rates based on data collected at a single time point, and independent of assumptions about the intensity of testing. Comparisons of simulated Ct values and observed Ct values with growth rates and *R*_{t} estimates validate this general approach. Results should be interpreted with caution in cases where the observed Ct values are not from a population census or a largely random sample, or when there are very few samples with detectable viral load. When testing is based primarily on the presence of symptoms or follow-up of contacts of infected individuals, people may be more likely to be sampled at specific times since infection and thus the distribution of observed Cts would not be representative of the population as a whole. This method may be most useful in settings where representative surveillance samples can be obtained independent of COVID-19 symptoms, such as the REACT study (*34*). These methods allow municipalities to evaluate and monitor, in real-time, the role of various epidemic mitigation interventions, for example by conducting even a single or a small number of random virologic testing samples as part of surveillance.

These results are sensitive to the true distribution of observed viral loads each day after infection. Different swab types, sample types, instruments, or Ct thresholds may alter the variability in the Ct distribution (*15, 16, 35, 36*), leading to different relationships between the specific Ct distribution and the epidemic trajectory. Setting-specific calibrations, for example based on a reference range of Ct values, will be useful to ensure accuracy. Here, we generated a viral kinetics model based on observed properties of measured viral loads (proportion detectable over time following symptom onset, distribution of Ct values from true specimens), and used these results to inform priors on key parameters when estimating growth rates. The growth rate estimates can therefore be improved by choosing more precise, accurate priors relevant to the observations used during model fitting. In cases where results come from multiple testing platforms, the model should either be adjusted to account for this by specifying a different distribution for each platform based on its properties or, if possible, the Ct values should be transformed to a common scale such as log viral copies. Results could also be improved if individual-level features that may affect viral load, such as symptom status, age, and antiviral treatment, are available with the data and incorporated into the Ct value model (*14–16, 37, 38*). A similar approach may also be possible using serologic surveys, as an extension of work that has related time since infection to antibody titers for other infectious diseases (*26, 27*). If multiple types of tests (e.g., antigen and PCR) are conducted at the same time, combining information can substantially reduce uncertainty in these estimates as well (*18*). If variant strains are associated with different viral load kinetics and become common (*39*), this should be incorporated into the model as well.

This method has a number of limitations. While the Bayesian framework incorporates the uncertainty in viral load distributions into inference on the growth rate, parametric assumptions and reasonably strong priors on these distributions aid in identifiability. If these parametric assumptions are violated, inference may not be reliable. In addition, the methods described here and the relationship between incidence and skewness of Ct distributions become unreliable when there are very few positive cases, so results should be interpreted with caution and sample sizes increased in periods with low incidence. In some cases with one or a small number of cross-sections, the observed Ct distribution could plausibly result from all individuals very early in their infection at the start of fast epidemic growth, all during the recovery phase of their infection during epidemic decline, or a mixture of both (Fig. 4E, Fig. S11). We therefore used a parallel tempering MCMC algorithm which is able to accurately estimate these multimodal posterior distributions. Interpretation of the estimated median growth rate and credible intervals should be done with proper epidemiological context: estimated growth rates that are grossly incompatible with other data can be safely excluded.

This method may also overstate uncertainty in the viral load distributions if results from different machines or protocols are used to inform the prior. A more precise understanding of the viral load kinetics, and modeling those kinetics in a way that accounts for the epidemiologic and technical setting of the measurements, will help improve this approach and determine whether Ct distribution parameters from different settings are comparable. Because of this, quantitative measures from RT-qPCR should be reported regularly for SARS-CoV-2 cases and early assessment of pathogen load kinetics should be a priority for future emerging infections. The use of a control procedure in the measurements, like using the ratio of detected viral RNA to detected human RNA, could also improve the reliability and comparability of Ct measures.

The Ct value is a measurement with magnitude, which provides information on underlying viral dynamics. Although there are challenges to relying on single Ct values for individual-level decision making, the aggregation of many such measurements from a population contains substantial information. These results demonstrate how population-level distributions of Ct values can provide information on important epidemiologic questions of interest, even from a single cross-sectional survey. Better epidemic planning and more targeted epidemiological measures can then be implemented based on this survey, or use of Ct values can be combined with repeated sampling to maximize the use of available evidence.

## Data Availability

All code to perform the analyses and generate the figures presented in this article is available at https://github.com/jameshay218/virosolver_paper and https://github.com/jameshay218/virosolver. Simulated data and real data used in the analyses are also available at https://github.com/jameshay218/virosolver_paper. For the model fitting, code for the Markov chain Monte Carlo framework is available at https://github.com/jameshay218/lazymcmc and https://github.com/jameshay218/lazymcmc/tree/parallel_tempering. The authors used code developed by Abbott et al. to estimate Rt from reported case counts; this is available at https://github.com/epiforecasts/EpiNow2.

## Funding

U.S. National Institutes of Health Director’s Early Independence Award DP5-OD028145 (MJM, JAH)

Morris-Singer Fund (LKS, ML)

U.S. Centers for Disease Control and Prevention Award U01IP001121 (LKS, ML)

U.S. National Institute of General Medical Sciences award U54GM088558 (ML, JAH, LKS)

## Author contributions

Conceptualization: JAH, LKS, ML, MJM

Methodology: JAH, LKS, ML, MJM Visualization: JAH, LKS, ML, MJM

Investigation: JAH, LKS, SK, ML, MJM

Resources: SK, NJL, SBG, MJM

Data curation: JAH, SK, NJL, SBG, MJM

Software: JAH, LKS, SK, NJL, SBG

Funding acquisition: ML, MJM

Supervision: ML, MJM

Writing – original draft: JAH, LKS, SK, ML, MJM

Writing – review & editing: JAH, LKS, SK, NJL, SBG, ML, MJM

## Competing interests

ML discloses honoraria/consulting from Merck, Affinivax, Sanofi-Pasteur, and Antigen Discovery; research funding (institutional) from Pfizer, and unpaid scientific advice to Janssen, Astra-Zeneca, and Covaxx (United Biomedical). MJM is a medical advisor for Detect. All other authors declare no competing interests.

## Data and materials availability

All code to perform the analyses and generate the figures presented in this article is available at https://github.com/jameshay218/virosolver_paper and https://github.com/jameshay218/virosolver. Simulated data and real data used in the analyses are also available at https://github.com/jameshay218/virosolver_paper. For the model fitting, code for the Markov chain Monte Carlo framework is available at https://github.com/jameshay218/lazymcmc and https://github.com/jameshay218/lazymcmc/tree/parallel_tempering. The authors used code developed by Abbott et al. to estimate *R*_{t} from reported case counts; this is available at https://github.com/epiforecasts/EpiNow2.

## Acknowledgments

We thank Steven Riley for helpful discussions.

## Footnotes

Significant additions to the methodology, figures and analyses.