Abstract
The growing toll of the COVID-19 pandemic has heightened the urgency of identifying individuals most at risk of infection and severe outcomes, underscoring the need to assess susceptibility and severity patterns in large datasets.1 The AncestryDNA COVID-19 Study collected self-reported survey data on symptoms, outcomes, risk factors, and exposures for over 563,000 adult individuals in the U.S., including over 4,700 COVID-19 cases as measured by a self-reported positive nasal swab test. We observed significant associations between several risk factors and COVID-19 susceptibility and severity outcomes. Many of the susceptibility associations were accounted for by differences in known exposures; a notable exception was elevated susceptibility odds for males after adjusting for known exposures and age. We also leveraged the dataset to build risk models to robustly predict individualized COVID-19 susceptibility (area under the curve [AUC]=0.84) and severity outcomes including hospitalization and life-threatening critical illness amongst COVID-19 cases (AUC=0.87 and 0.90, respectively). The results highlight the value of self-reported epidemiological data at scale to provide public health insights into the evolving COVID-19 pandemic.
Main
The COVID-19 pandemic has resulted in nearly 35 million COVID-19 cases and more than 1,000,000 deaths worldwide,2 including nearly 7.5 million cases and more than 210,000 deaths in the United States as of early October 2020.3 The growing impact of the pandemic intensifies the need for enhanced understanding of COVID-19 susceptibility and severity risk factors, not only for public health experts, but also for individuals seeking to assess their own personalized risk. Prior research has hypothesized that differences in COVID-19 susceptibility are related to age,4 sex-dependent immune responses,5 and genetics,6,7 while heightened severity of COVID-19 illness is associated with risk factors such as age,1,4,8,9 sex,5,10–12 underlying health conditions,1,9,10,13,14 and genetic factors including the ABO blood group.15 Self-reported survey data, which can easily be collected in the home, afford the opportunity to dynamically monitor the continually evolving pandemic and allow for real-time estimation of individual-level COVID-19 risk.16–19 Furthermore, self-reported surveys allow for collection of information about known exposures, of which few epidemiological COVID-19 studies have explicitly accounted for in association analyses to date.20
In this paper, we aim to provide insight into factors associated with susceptibility and severity of COVID-19 using a large survey cohort of 563,141 AncestryDNA customers who have consented to participate in the AncestryDNA COVID-19 Study (Supplementary Tables 1-6).6 We found that known exposure differences account for many associations for a positive COVID-19 test result.18,21 Further analysis yielded evidence that males may be more susceptible to COVID-19 than females after adjusting for known exposures and age. While several previous reports have documented increased severity risk in males,5,11,13 the finding of sex-based differences in susceptibility here warrants further investigation. This study also replicates previous reports of strong associations between certain health conditions, age, and COVID-19 severity,1,8–14 many of which remain significant after adjusting for other risk factors. We additionally investigated the symptomotology of COVID-19, and found that the symptoms most strongly associated with a positive test result are distinct from the symptoms most strongly associated with severity.17,18,22,23
Finally, we built predictive risk models for COVID-19 susceptibility and severity outcomes. For susceptibility, we designed two models and additionally applied two literature-based models18 to predict COVID-19 cases among respondents reporting a test result. We also designed models to predict two different COVID-19 severity outcomes based on minimal information about demographics, health conditions, and symptoms: hospitalization due to COVID-19 infection (referred to throughout as “hospitalization”) and progression of an infection to a life-threatening critical case among those reporting a positive COVID-19 result (referred to throughout as “critical case”; Methods).13 Both the susceptibility and severity models are robust and generalizable across a large internal holdout set, and could potentially serve as tools for understanding individualized COVID-19 susceptibility and severity risk.1,16–19,24
Survey description
Survey responses were collected from AncestryDNA customers (see Methods and Supplementary Table 3 for demographic information). The survey collected self-reported responses to questions about COVID-19 test results, 15 symptoms among those who tested positive or who tested negative and had flu-like symptoms, disease progression for positive testers, age, height, weight, known exposures to biological relatives, household members, patients or any other contacts with COVID-19, and 11 underlying health conditions (Supplementary Tables 1 and 2). Data were collected between 22 April 2020 and 6 July 2020, and the survey completion rate was approximately 95%. In general, the COVID-19 positive test rate and self-reported clinical outcomes are consistent with those reported by the U.S. Centers for Disease Control and Prevention (CDC) over a similar period (Supplementary Note 1).21
Association Analyses
Susceptibility
We investigated associations between COVID-19 testers reporting a positive or negative result and risk factors within the dataset (Figure 1, Methods, Supplementary Tables 7-12). Unadjusted odds ratios (ORs) were calculated using simple logistic regression, and adjusted ORs (aORs) were calculated using multiple logistic regression including known exposures, age, and sex as risk factors. Unadjusted ORs provide insight into which individual variables are correlated with testing positive for COVID-19, while adjusted ORs provide insight into which of these associations are not completely explained due to differences in age, sex, and known exposures.25
Known COVID-19 exposures, either through a household case (OR=26.03; 95% confidence interval [CI]=(22.26, 30.43)), biological relative (OR=5.77; 95% CI=(4.99, 6.68)), or other source of “direct” exposure (OR=6.94; 95% CI=(6.02, 7.99)) were the strongest predictors of a positive COVID-19 test result (Figure 1, Supplementary Table 7). In general, adjusting for known exposures, age, and sex resulted in attenuation of the ORs, with many associations becoming insignificant after adjustment (Figure 1, Supplementary Tables 8-9). Intriguingly, the OR for males was not attenuated after adjustment, and males remained at elevated odds after adjusting for known exposures and age (aOR=1.36; 95% CI=(1.19, 1.55); Figure 1, Supplementary Table 9). We also note that males and females reported comparable exposure burden, with males slightly more likely to report a household case of COVID-19 but less likely to report a case of COVID-19 among biological relatives (Supplementary Tables 11-12).
Younger individuals (ages 18-29; OR=1.51; 95% CI=(1.26, 1.81)), as well as individuals of admixed African-European (OR=1.48; 95% CI=(1.18, 1.85)) or admixed Amerindian ancestry (OR=1.49; 95% CI=(1.26, 1.77)) were significantly more likely to test positive compared to older individuals (ages 50-64, the largest age group in this cohort) or those of European ancestry, respectively (Supplementary Table 7). These individuals reported higher levels of COVID-19 cases within the household, cases among biological relatives, and/or other known “direct” COVID-19 exposures (Supplementary Tables 10-12).26–30 Adjusting for age, sex, and known exposures attenuated the OR for all three of these groups (younger aOR=1.28; 95% CI=(1.03, 1.59), African-European aOR=1.23; 95% CI=(0.94, 1.62), and Amerindian aOR=1.27; 95% CI=(1.04, 1.57); Figure 1, Supplementary Table 9).
Individuals reporting pre-existing medical conditions (e.g., cancer, cardiovascular disease, chronic kidney disease [CKD], diabetes, hypertension) were less likely to test positive for COVID-19 (Figure 1, Supplementary Table 7). We observed significantly decreased odds of a known “direct” exposure to COVID-19, as well as significantly decreased odds of a household case of COVID-19, among such individuals relative to those without any health conditions (OR=0.71; 95% CI=(0.65, 0.78) and OR=0.74; 95% CI=(0.65, 0.84), respectively; Supplementary Tables 10-11). Notably, individuals with asthma and those with other lung conditions had significantly decreased odds of testing positive after adjusting for known exposures, age, and sex (OR=0.82; 95% CI=(0.69, 0.97) and OR=0.67; 95% CI=(0.44, 1.00), respectively; Supplementary Table 9).
Severity
We investigated associations between demographics, exposures, symptoms, and underlying health conditions for hospitalization and critical illness progression among COVID-19 cases (Figure 2, Supplementary Figure 1). Consistent with previous reports,1,9,12–14 we observed positive associations between certain health conditions and COVID-19 severity outcomes; many of these associations remained significant after adjustment for age, sex, and obesity (BMI >= 30) (Figure 2, Supplementary Tables 13-16). COVID-19 cases reporting at least one underlying health condition were significantly more likely to progress to a critical case (OR=2.85; 95% CI=(1.78, 4.57); Figure 2, Supplementary Figure 1, Supplementary Table 15). Specific underlying health conditions that were associated with hospitalization and/or critical case progression included CKD, chronic obstructive pulmonary disease (COPD), diabetes, cardiovascular disease, and hypertension (Figure 2, Supplementary Figure 1, Supplementary Tables 13 and 15). Among individuals testing positive for COVID-19, the oldest (≥65 years) were significantly more likely to be hospitalized compared to those aged 50-64 (OR=1.70; 95% CI=(1.13, 2.56); Figure 2, Supplementary Table 13). Individuals of admixed African-European ancestry who tested positive were significantly more likely to report progression to a critical case, compared to those with European ancestry (OR=2.07; 95% CI=(1.03, 4.17); Supplementary Figure 1, Supplementary Table 15). Among COVID-19 cases, males were significantly more likely than females to report progression to a critical case (OR=1.54, 95%; CI=(1.00, 2.37); Supplementary Figure 1, Supplementary Table 15); these findings are consistent with CDC reports of increased ICU admittance rates in males (3% vs. 2%).21
Differential symptomology
Among symptomatic people who were reported a COVID-19 test result, those reporting moderate to severe change in taste or smell (OR=7.26; 95% CI=(5.54, 9.50)), fever (OR=1.60; 95% CI=(1.28, 2.01)), or feeling tired or fatigue (OR=1.41; 95% CI=(1.05, 1.89)) were more likely to test positive (Figure 3, Supplementary Table 7). Those reporting moderate to severe runny nose (OR=0.59; 95% CI=(0.47, 0.75)) or sore throat (OR=0.49; 95% CI=(0.39, 0.62)) were more likely to test negative, consistent with previous reports that these symptoms are more indicative of influenza or the common cold (Figure 3, Supplementary Table 7).17,18,22 Change in taste or smell, a hallmark symptom of COVID-19 infection, was not associated with hospitalization (OR=0.77, 95% CI=(0.55, 1.07); Figure 3, Supplementary Table 13). By contrast, dyspnea (shortness of breath) was the most predictive of hospitalization and critical case progression (OR=7.52; 95% CI=(4.92, 11.49) and OR=11.55; 95% CI=(5.91, 22.59), respectively),23 but was not associated with a positive test result (OR=1.14; 95% CI=(0.91, 1.44); Figure 3, Supplementary Tables 7, 13, and 15).
Predictive risk models
We developed models that predict an individual’s COVID-19 risk (positive test result or severity). Many predictive models for COVID-19 infection and severity outcomes have been reported in the literature.1,16–18,24,31 To date, few large-scale studies have investigated both susceptibility and severity risk within the same dataset, offering a consistent and comprehensive understanding of similarities and differences between the two outcomes.
For each of the risk models we developed, the survey data were divided into independent training and test cohorts, and additional association analyses were performed on the training data to guide the selection of risk factors for the models (Methods, Supplementary Tables 17-20). In contrast to association analyses, these risk models are based on penalized logistic regression with cross-validation in order to allow for transferability to independent cohorts.
The susceptibility models were designed to predict a COVID-19 result (positive or negative) from risk factors among testers. In addition to our own two models, we also replicated two self-reported models from the literature in order to assess how well our models perform relative to a respected benchmark.18 In all, we compared four models: our model based on demographics and exposures (referred to throughout as “Dem + Exp”); our model based on demographics, exposures, and symptoms (referred to throughout as “Dem + Exp + Symp”); and the two literature-based models designed with nearly identical risk factors as reported previously in another large, self-reported study (“How We Feel” (HWF); models referred to as “HWF Exp + Symp” and “HWF Symp” throughout; Supplementary Note 2, Supplementary Table 20).18
All four susceptibility models performed robustly, with the Dem + Exp + Symp model achieving the highest overall performance (Figure 4, Supplementary Tables 21-24). The three models that included one or more symptoms outperformed the model without symptoms (Dem + Exp), underscoring the value of self-reported symptoms for discriminating between cases and controls. The Dem + Exp model had an area under the curve (AUC) of 0.84 +/- 0.02, and the most predictive risk factor was having a household case of COVID-19. The Dem + Exp + Symp model had an AUC of 0.94 +/- 0.02, and the most predictive symptom was change in taste or smell. The HWF Exp + Symp model had an AUC of 0.90 +/- 0.03, and the HWF Symp model had an AUC of 0.87 +/- 0.03 (Supplementary Note 3). Each of the models performed comparably across different age, sex, and genetic ancestry groups (Supplementary Tables 21-24). We observed no significant overfitting in any of the models as evidenced by comparable train-test performances (Supplementary Table 25).
We also trained severity models to predict hospitalization and critical illness progression among COVID-19 cases. We included a number of risk factors and symptoms most associated with severe COVID-19 outcomes from the literature and/or our training dataset (Figure 3, Supplementary Tables 18 and 19); these included age,1,8,9,13 sex,1,5,11–13 morbid obesity (BMI >= 40),1,32 and health conditions,1,9,12–14 as well as shortness of breath,23 fever, feeling tired or fatigue, dry cough, and diarrhea for symptoms. Both models performed robustly on an independent holdout dataset (Figure 4). The hospitalization model had an AUC of 0.87 +/- 0.03, and the critical case model had an AUC of 0.90 +/- 0.03. For both severity models, shortness of breath was the most predictive risk factor. We were concerned that this might seem obvious, so we designed models excluding this symptom which also achieved moderately high discriminative performance (AUC > 0.80; Supplementary Figure 4). The severity models performed comparably when stratifying by age, sex, and genetic ancestry (Supplementary Tables 26-27), and there was no significant overfitting bias as evidenced by comparable train-test performances (Supplementary Table 25).
In contrast with susceptibility models, we did not evaluate the severity models against literature-based models, as the vast majority of these models include clinical factors (e.g., bloodwork) not measured in this self-reported dataset.1,16,31 However, the models presented here perform on par with many previously reported models despite the absence of clinical factors, suggesting the potential value of self-reported data to identify the most at-risk COVID-19 individuals.16,31
Discussion
The AncestryDNA COVID-19 Study provides a highly complete, self-reported dataset that contains information about a plethora of risk factors in the context of COVID-19 susceptibility and severity outcomes. The self-report framework provides fast, low-cost, population-scale data that are particularly valuable in a pandemic, where knowledge is both limited and evolving rapidly based on changing circumstances. Additionally, the broad collection mechanism enables data-gathering from many more and potentially more diverse participants than typically seen in a medical setting, and participants can safely provide data from their homes. Moreover, self-reported data have been shown to be useful for estimating population prevalence,19 and models built from these data may help to contextualize individual risk prior to infection with COVID-19.1,16–18,24,31
The study highlights exposure burden as the primary risk factor for COVID-19 susceptibility, and the importance of accounting for known exposures when assessing differences in susceptibility to COVID-19. Few studies have measured and explicitly adjusted for known COVID-19 exposures at this scale.20 We found elevated exposure levels for younger individuals, as well as individuals of admixed African-European or admixed Amerindian ancestry, which account for at least some of the elevated susceptibility risk for these individuals as evidenced by attenuation of ORs after adjustment for known exposures and age. This finding is consistent with previous reports about elevated exposure levels for younger and minority groups within the U.S.26–30 By contrast, we found reduced exposure levels among those reporting one or more pre-existing health conditions. The lower exposure burden observed for these individuals may reflect pre-pandemic differences in public interactions or increased precautions to mitigate exposure, given published severity risks associated with these conditions.1,9,13,14,33,34
Importantly, we found elevated susceptibility risk in males after adjusting for age and known exposures, and the adjusted odds were not attenuated compared to the unadjusted odds. This finding is distinct from previous findings on elevated severity risk in males.1,5,11 This result could be due to differences between men and women in behaviors, unknown exposures, biology, genetics,5–7 or other risk factors not measured within this dataset.
Another major contribution of this study is the development of novel risk models for predicting an individual’s COVID-19 susceptibility and severity risk. The risk models presented here perform comparably or better than similar and more complex models reported previously.16–18,24,31 This study is also one of few to date to assess and demonstrate relatively strong performance of risk models across different age, sex, and genetic ancestry groups,17 highlighting the potential utility and generalizability of these models to broader populations. Such models may be useful to clinicians to estimate an individual’s risk of infection and/or severity risk, or as a potential tool to triage testing given limited resources.17,18,22 Just as importantly, the models could also be used by public health experts to understand population-level risk at large, given minimal self-reported risk factor data.
The COVID-19 pandemic has exacted a historic toll on healthcare systems and global economies, and continues to evolve based on changes in human behavior, public health guidelines, and societal factors. The large AncestryDNA network, well-established data collection mechanisms, and willingness of AncestryDNA customers to participate in COVID-19 research have rapidly come together in this study to elucidate more details about susceptibility and severe disease risk factors and help point the way to minimizing disease burden.
Methods
Ethics statement
Ethics statement. All data for this research project were from subjects who have provided informed consent to participate in AncestryDNA’s Human Diversity Project, as reviewed and approved by our external institutional review board, Advarra (formerly Quorum). All data were de-identified prior to use.
Study population
Collection of self-reported COVID-19 outcomes from AncestryDNA customers who consented to research, participation criteria for the study, and the survey design are described in a genome-wide association study (GWAS) on a very similar AncestryDNA dataset, which identified three novel, genome-wide significant loci.6 Here, participants reporting a negative test result were also assessed for symptoms and clinical outcomes. Analyses presented here were performed with data collected between 22 April and 6 July 2020.
Outcome definitions
The study assessed three outcomes: one for susceptibility and two for severity of COVID-19 infection. Cases for COVID-19 susceptibility were individuals who responded, “Yes, and was positive” to the question, “Have you been swab tested for COVID-19, commonly referred to as coronavirus?” Responders who answered, “Yes, and was negative” were used as controls for the susceptibility analysis.
The hospitalization outcome was defined among COVID-19-positive cases if a participant responded “Yes” to a binary question about experiencing symptoms due to COVID-19 illness and “Yes” to the hospitalization question (“Were you hospitalized due to these symptoms?). Controls were defined by a response of “No” to the hospitalization question in addition to reporting a self-reported positive COVID-19 test result.6
Critical cases of COVID-19 were defined via a response of “Yes” to one or more questions about ICU admittance (“Were you hospitalized in the Intensive Care Unit [ICU] with a ventilator?” or “Were you hospitalized in the Intensive Care Unit [ICU] with oxygen?”) or, alternatively, self-reported septic shock, organ failure, or respiratory failure resulting from a COVID-19 infection (“Have you had any of the following complications due to your illness? Select all that apply.”).13 Controls were defined by a response of “No” across all of these questions in addition to self-reporting a positive COVID-19 test result.
Genetic sex and ancestry definitions
All individuals were genotyped, using previously described general genotyping and quality control procedures.35 Both sex and genetic ancestry were defined for individuals based on their genotypes. Genetic ancestry was estimated using a proprietary algorithm to estimate continental admixture proportions.36 Briefly, this algorithm uses a hidden Markov model to estimate unphased diploid ancestry across the genome by comparing haplotype structure to a reference panel.
Data preparation for analysis
Multiple-choice categorical questions were one-hot (“dummy”) encoded as binary risk factors with k-1 degrees of freedom, where k corresponds to the number of fields for a given question. In this framework, a value of 0 across all k-1 fields corresponds to a response of “None” or “None of the above” for a given question.
We considered several risk factors and outcomes questions in our association analyses and risk modeling efforts, some of which are summarized in Supplementary Tables 1 and 2. Based on the dependency structure within our survey, not every question was surfaced to every participant included within our study. As such, we made the following inferences:
Participants reporting “No” to a binary question about symptoms arising from COVID-19 infection (“Did you experience symptoms as a result of your condition?”) were designated as negatives for dependent questions about individual symptoms (“Between the beginning of February 2020 and now, have you had any of the following symptoms?”), hospitalization due to symptoms (“Were you hospitalized due to these symptoms?”), ICU admittance due to symptoms (“Were you hospitalized in the Intensive Care Unit [ICU] with a ventilator?” or “Were you hospitalized in the Intensive Care Unit ([ICU] with oxygen?”), and medications prescribed due to symptoms (“Did a doctor treat you with medication for your illness?”)
Participants reporting “No” to a binary question about hospitalization (“Were you hospitalized due to these symptoms?”) were assigned to a hospital duration of 0 days and designated as negative for ICU admittance due to symptoms.
Participants reporting “No” to a binary question about medications prescribed due to symptoms (“Did a doctor treat you with medication for your illness?”) were designated as negative for individual medication fields from a dependent question (“What medication were you treated with?”).
Responses to a question about individual symptoms (“Between the beginning of February 2020 and now, have you had any of the following symptoms?”) were converted to a binary variable based on the following mapping: 0 = None, Very Mild, Mild; 1 = Moderate, Severe, Very Severe.
Association analysis
Analyses were performed either with the statsmodels package in Python3 or in base R with the glm function. For each susceptibility and severity outcome and risk factor of interest, a simple logistic regression (LR) model was fit using unpenalized maximum likelihood.37 LR models were fit on the entire cohort, except for age, sex, genetic ancestry, and obesity, where the models were fit only on the appropriate reference population drawn from the entire cohort (Supplementary Tables 7-16). Multiple logistic regression was used to adjust the ORs for known COVID-19 exposures and potentially confounding risk factors. The adjusted model includes age, sex, and four known exposures for susceptibility outcomes (Y/N if any); and age, sex, obesity (binarized if BMI >= 30), and health conditions (binarized if any) for severity outcomes (Figures 1 and 2).
For each risk factor, 95% confidence intervals (CIs) for the log odds ratio were estimated under the normal approximation. The significance threshold was Bonferroni-corrected for the 42 different risk factors examined, leading to an adjusted threshold of 0.05/42=0.0012.37
Risk model training and evaluation
Data splitting
All data were used in the OR analyses; data splitting was performed for risk modeling only. Prior to model training, the data were split with a fixed random seed. For susceptibility models without symptoms, 75% of the data were used for model training and 25% of the data were used for holdout evaluation. For severity models and susceptibility models with symptoms, 50% of the data were used for model training and 50% of the data were used for holdout evaluation. The larger holdout set allows for more robust stratified analyses of model performances, given the smaller overall cohort sizes for models with symptoms.
Predictive risk models
Logistic regression models were trained and evaluated using the scikit-learn package in Python. Here, we used lasso-penalized multiple logistic regression with cross-validation on the training dataset in order to select an optimal hyperparameterization for generalizability to independent test sets.37 The penalized objective function for regression was weighted by inverse prevalence to address class imbalance and was framed as follows: where m is the number of training examples, yj is the true value of the outcome variable for the jth observation within the training data, my=yj is the number of training examples with y=yj, β corresponds to the vector of estimated coefficients for the risk factors, xj is the vector of risk factor values for the jth observation within the training data, Logistic is the logistic (sigmoid) function, λ is the regularization parameter which favors simpler models via shrinkage of the beta coefficients, n is the total number of risk factors in the model, βi is the regression coefficient for the ith risk factor in the model, and Ln is the natural logarithm.
Risk factor selection and model training
We chose risk factors based on a minimal subset of nominally significant ORs within our training data as well as literature guidance.1,4,5,9,11–14 For the susceptibility models without symptoms, we included a subset of exposure-related questions, based on the training OR analyses, as well as two demographic variables (age and sex). For susceptibility models with symptoms, we additionally included the five symptoms most differentiated between symptomatic negative and positive testers from our training ORs. For the severity models, we included pre-existing conditions, based on the training OR analyses, predictive symptoms within our training dataset36, morbid obesity (BMI >= 40), age, and sex. The full list of risk factors for each model is included in Supplementary Table 20.
Once final risk factors were selected, we performed 5-fold cross-validation with grid search on our training dataset to select an optimal lasso regularization parameter lambda.37 For the grid search, we scanned 8 different values for lambda, equally partitioned geometrically across a 4-log space. We then re-trained on the entire training dataset with the optimal lambda, and evaluated the final model on the holdout dataset.
Model thresholding
Phenotypes were predicted from the output of trained models based on a 50% probability threshold (i.e., logistic model output > 0.5). Sensitivity and specificity were then calculated based on the true vs. predicted binary outcomes.
Estimation of performance error
To estimate the error in our model performances, we bootstrapped our holdout dataset 1,000 times to generate a sampling distribution for each evaluation metric. We estimated the mean and 95% CIs for each metric based on the mean and standard deviation of this sampling distribution.37
Data Availability
We plan to make a dataset available to qualified scientists through the European Genome-phenome Archive (EGA). The EGA dataset includes the risk factors and outcomes studied here, for a different set of individuals.
Data availability
We plan to make a dataset available to qualified scientists through the European Genome-phenome Archive (EGA). The EGA dataset includes the risk factors and outcomes studied here, for a different set of individuals.
Author contributions
SCK and SRM contributed equally to the manuscript and wrote the first draft of the paper. ARG provided direct project guidance and lead the COVID-19 research teams. SCK, SRM, BR performed the association analyses. SCK developed and assessed the risk models. MVC and KAR designed the COVID-19 survey questionnaire. SCK and DP supported the dataset creation. GHLR supported the phenotype definitions. MVC and NDB built the demographic tables. SCK, SRM, BR, MVC, GHLR, ARG, and KAR helped with additional analyses and interpretation. MZ, DP, DT, KD, MP, HG, AKHB helped with the EGA dataset. The AncestryDNA Science Team contributed to additional work, allowing for the completion of the COVID-19 research and manuscript. KAR, ELH, and CAB provided additional project guidance. All authors contributed to the final manuscript.
Competing interests
The authors declare competing financial interests: authors affiliated with AncestryDNA may have equity in Ancestry.
AncestryDNA Science Team
Yambazi Banda, Ke Bi, Robert Burton, Marjan Champine, Ross Curtis, Abby Drokhlyansky, Ashley Elrick, Cat Foo, Michael Gaddis, Jialiang Gu, Shannon Hateley, Heather Harris, Shea King, Christine Maldonado, Evan McCartney-Melstad, Alexandra McFarland, Patty Miller, Luong Nguyen, Keith Noto, Jingwen Pei, Jenna Petersen, Scott Pew, Chodon Sass, Josh Schraiber, Alisa Sedghifar, Andrey Smelter, Sarah South, Barry Starr, Cecily Vaughn, Yong Wang
Supplementary Materials
Supplementary Notes
Supplementary Note 1. Comparison of AncestryDNA Data to CDC Data. In general, the COVID-19 testing results are consistent with those reported by the U.S. Centers for Disease Control and Prevention (CDC) over a similar period.21 For example, the CDC reported a 12% cumulative overall positive test rate from 1 March to 30 May 2020, while 14.1% of AncestryDNA survey participants reported testing positive (the denominator includes tests with “results pending”). For hospitalization, the CDC reported that 14% of individuals testing positive for COVID-19 were hospitalized, and 2% were admitted to an intensive care unit (ICU), between 22 January and 30 May 2020. Among the AncestryDNA survey respondents, 11% of those reporting a positive COVID-19 test were hospitalized, and 4.9% were admitted to ICU between 22 April and 6 July 2020.
Supplementary Note 2. Risk factors for predictive susceptibility models. The HWF Exp + Symp model includes three risk factors, including two exposures and one symptom (change in taste or smell, Supplementary Table 20). The HWF Symp model includes seven symptoms most commonly associated with COVID-19 according to the CDC (Supplementary Table 20).18,38 For the Dem + Exp and Dem + Exp + Symp models, we incorporated responses to several exposure questions that were primary risk factors in the OR analysis of the training set (Methods and Supplementary Table 20).18,39,40 We also considered age and sex, given their nominal association with COVID-19 in our ORs in the training dataset, as well as reports about these variables as risk factors (Supplementary Table 17).4,5,30 For Dem + Exp + Symp model, we selected the most differentiated symptoms between cases and controls amongst symptomatic COVID-19 testers within the training dataset (Supplementary Table 17), including fever, change in taste or smell, feeling tired/fatigued, runny nose, and sore throat.17,18,22
Supplementary Note 3. Comparison of Ancestry and HWF Cohorts and Performances. The HWF models performed better when trained and evaluated on the AncestryDNA dataset as compared with the HWF dataset, particularly for the symptoms-only model (HWF Symp: AncestryDNA-trained AUC=0.87 and HWF-trained AUC=0.76 and HWF Exp + Symp: AncestryDNA-trained AUC=0.90 and HWF-trained AUC=0.87; Figure 4).18 This result could be due to differences in survey flow design, demographics, or the relative prevalences of self-reported symptoms between the two datasets. We re-trained and evaluated our models on a subset of symptomatic COVID-19 testers (positives and negatives) reporting at least one symptom of moderate or greater intensity. The performances of all models that included symptoms were slightly attenuated in this cohort, with AUCs for the HWF models approaching previous reports (HWF Symp: AncestryDNA-trained AUC=0.80 and HWF Exp + Symp: AncestryDNA-trained AUC=0.87); the model without symptoms performed the same (Supplementary Figure 3). This suggests that predicting COVID-19 cases on the basis of symptoms alone is more challenging when symptoms are less differentiated between positive and negative testers.
Supplementary Tables
Supplementary Figures
Acknowledgements
We thank our AncestryDNA customers who made this study possible by contributing information about their experience with COVID-19 through our survey. Without them, this work would not be possible. We would like to thank Zach Bass, Robert Dowling, Disha Akarte, Swapnil Sneham, Sean Enright and the entire Cyborg team for their tireless work in the release and continued support of the COVID-19 survey.