Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Reproducibility of Genetic Risk Factors Identified for Long COVID using Combinatorial Analysis Across US and UK Patient Cohorts with Diverse Ancestries

J Sardell, M Pearson, K Chocian, S Das, K Taylor, M Strivens, R Gupta, A Rochlin, S Gardner
doi: https://doi.org/10.1101/2025.02.04.25320937
J Sardell
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
M Pearson
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
K Chocian
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
S Das
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
K Taylor
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
M Strivens
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
R Gupta
2Metrodora Foundation, 3535 South Market Street, West Valley City, UT 84119, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
A Rochlin
3Metrodora Institute, 3535 South Market Street, St 900, West Valley City, UT 84119, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
S Gardner
1PrecisionLife Ltd., Unit 8b Bankside, Hanborough Business Park, Long Hanborough OX29 8LJ, UK
3Metrodora Institute, 3535 South Market Street, St 900, West Valley City, UT 84119, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: steve{at}precisionlife.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Long COVID is a major public health burden causing a diverse array of debilitating symptoms in tens of millions of patients globally. In spite of this overwhelming disease prevalence and staggering cost, its severe impact on patients’ lives and intense global research efforts, study of the disease has proved challenging due to its complexity. Genome-wide association studies (GWAS) have identified only four loci potentially associated with the disease, although these results did not statistically replicate between studies. A previous combinatorial analysis study identified a total of 73 genes that were highly associated with two long COVID cohorts in the predominantly (>91%) white European ancestry Sano GOLD population, and we sought to reproduce these findings in the independent and ancestrally more diverse All of Us (AoU) population.

Methods We assessed the reproducibility of the 5,343 long COVID disease signatures from the original study in the AoU population. Because the very small population sizes provide very limited power to replicate findings, we initially tested whether we observed a statistically significant enrichment of the Sano GOLD disease signatures that are also positively correlated with long COVID in the AoU cohort after controlling for population substructure.

Results For the Sano GOLD disease signatures that have a case frequency greater than 5% in AoU, we consistently observed a significant enrichment (77% - 83%, p < 0.01) of signatures that are also positively associated with long COVID in the AoU cohort. These encompassed 92% of the genes identified in the original study. At least five of the disease signatures found in Sano GOLD were also shown to be individually significantly associated with increased long COVID prevalence in the AoU population. Rates of signature reproducibility are strongest among self-identified white patients, but we also observe significant enrichment of reproducing disease associations in self-identified black/African-American and Hispanic/Latino cohorts. Signatures associated with 11 out of the 13 drug repurposing candidates identified in the original Sano GOLD study were reproduced in this study.

Conclusion These results demonstrate the reproducibility of long COVID disease signal found by combinatorial analysis, broadly validating the results of the original analysis. They provide compelling evidence for a much broader array of genetic associations with long COVID than previously identified through traditional GWAS studies. This strongly supports the hypothesis that genetic factors play a critical role in determining an individual’s susceptibility to long COVID following recovery from acute SARS-CoV-2 infection. It also lends weight to the drug repurposing candidates identified in the original analysis. Together these results may help to stimulate much needed new precision medicine approaches to more effectively diagnose and treat the disease.

This is also the first reproduction of long COVID genetic associations across multiple populations with substantially different ancestry distributions. Given the high reproducibility rate across diverse populations, these findings may have broader clinical application and promote better health equity. We hope that this will provide confidence to explore some of these mechanisms and drug targets and help advance research into novel ways to diagnose the disease and accelerate the discovery and selection of better therapeutic options, both in the form of newly discovered drugs and/or the immediate prioritization of coordinated investigations into the efficacy of repurposed drug candidates.

Introduction

Post COVID condition, commonly known as long COVID or PASC (post-acute sequelae of SARS-CoV-2 infection), is a debilitating chronic condition that develops following a SARS-CoV-2 infection in around 5-15% of patients1. The global prevalence of long COVID is estimated at least 65 million people2 and is increasing annually. It’s estimated to now have a cumulative global incidence of over 400 million individuals and cost over $1 trillion (or 1% of global GDP) annually1, which causes a long-lasting and profound impact on patients’ lives and healthcare systems and has created a major public health issue3.

‘Long COVID’ is a term originally defined by patients to describe the post-acute and long-term health effects of COVID-194,5 and the highly variable symptoms associated with the condition. The frequency and severity of SARS-CoV-2 infections appears to be correlated with increased risk of developing long COVID6.

Long COVID patients have reported a diverse array of symptoms across multiple organ systems7 with the most common being post-exertional malaise8, dysautonomia9, cognitive dysfunction10, mood disturbances11 and respiratory problems12. Many of these symptoms and signs are also observed in other complex neuroimmune disorders such as myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS)13,14,15, postural orthostatic tachycardia syndrome (POTS)16,17 and fibromyalgia18,19, all of which, like long COVID, disproportionately affect women20. To advance our understanding of the pathophysiological mechanisms underlying these shared clinical manifestations, it is important to have a deeper understanding of the genetic similarities between long COVID and other neuroimmune conditions. This effort is hampered, as most of these diseases, like long COVID, are highly complex and have been intractable for existing genomic analysis approaches.

More than four years following the global COVID-19 outbreak, patients still often struggle to obtain a long COVID diagnosis as agreement on the definition of the disease remains elusive beyond self-reported persistence of a wide range of symptoms. Governments also find evaluating its prevalence and setting public health policy difficult due to absence of a clear and consistent definition of the disease21,22. There are currently no recognized laboratory diagnostic tests or disease modifying therapies for long COVID. Research into the biological mechanisms of the disease is hindered by the variability in study designs, lack of reproducible findings across patient populations, and challenges in accurately capturing the heterogenous clinical phenotypes of patient cohorts23. A definitive biological explanation of some of the factors causing and defining the disease and a test encompassing these is urgently required to overcome this.

Only a few preliminary GWAS for long COVID have been published to date 24,25,26, likely due to the challenges of assembling a sufficiently powered patient cohort and the studies’ consequently limited findings. A study by the COVID-19 Host Genetics Initiative (HGI) identified only a single significant locus (FOXP4) from an analysis of 6,450 long COVID cases and over 1 million population controls aggregated from multiple cohorts25. Another recent meta-analysis of over 53,000 cases and 120,000 controls from 23andMe identified three significant loci (HLA-DQA1–HLA-DQB, ABO and BPTF–KPAN2–C17orf58)26. The effect sizes of the latter three loci reproduced in the HGI cohort but the associations were not significant, likely due to limited statistical power even in such a large cohort. The reported association between FOXP4 and long COVID did not reproduce in the 23andMe data.

Combinatorial Analytics for Complex Diseases

Combinatorial analytics has been more successful than GWAS in identifying key genetic risk factors that capture the complex biology of similarly multifactorial and heterogenous diseases like ME/CFS, generating more mechanistic insights and reproducible findings across cohorts27.

The combinations of genetic variants (‘disease signatures’) identified by combinatorial analyses capture both the linear and non-linear effects of interactions between multiple genes. They can be used to identify individual patients who have specific disease signatures, enabling the identification of associations between the disease signatures associated with a mechanism and the symptoms presented by patients with those disease signatures. These can improve our understanding of complex diseases beyond the single SNP associations identified by GWAS28,29 and creates opportunities for clinically actionable diagnostic tests and the targeted trials of multiple drug repurposing candidates to provide clinical benefit to specific patient cohorts.

Aims of Study

The PrecisionLife combinatorial analytics platform was previously used to identify disease signatures for Severe and Fatigue Dominant long COVID cohorts derived from the Sano Genetics’ long COVID GOLD (Sano GOLD) study, and to highlight the biological similarities and differences between these two patient populations30. At the same time, combinatorial analysis was also undertaken on a General long COVID cohort encompassing all patients with a broader (and potential less-reliable) definition of the disease. The General cohort’s results were not described in the original publication, which instead focused on the most well phenotyped cases.

The Severe cohort in this study was comprised of cases who self-reported the greatest variety and severity of symptoms, while the Fatigue Dominant cohort was comprised of cases who self-reported predominantly fatigue-associated long COVID symptoms. The study identified a total of 73 genes that were highly associated with at least one of these long COVID populations. Of these genes at the time of publication, 9 genes were linked to acute COVID-19, 14 genes were differentially expressed in a previous transcriptomic analysis of long COVID patients31 and 9 genes were found that had been associated with ME/CFS in the previous combinatorial analysis of this disease27.

In this study, we assessed the findings of all three of the original long COVID combinatorial analyses of the Sano GOLD cohorts in an independent, more ancestrally diverse patient population. We used genomic, clinical, and questionnaire data from the All of Us (AoU) population32 to generate a long COVID cohort (using ICD-10 code U09.9) and evaluated the reproducibility of the findings from the original Sano GOLD study. We investigated the genes and mechanisms underlying the reproducible disease signatures, and evaluated the clinical phenotypes associated with each.

Materials and Methods

Generation of long COVID cohorts

For this study, we identified a cohort of long COVID patients and matching controls from the AoU dataset (accessed on December 10th 2024). AoU provides data33 for nearly 850,000 American participants, including genomic data derived from the Illumina Global Diversity Array (GDA)34 (n=312,925), electronic health records (EHR, n=254,700), health questionnaires (n=412,220), and COPE COVID-19 survey (n=100,220)35. The AoU dataset was designed to capture data for a diverse group of individuals, including non-European ancestry groups often underrepresented in genomic datasets, and the cohort selected for this study reflects this diversity.

The baseline long COVID cohort was created by selecting all 458 individuals with GDA genotyping data who have a diagnosis of long COVID, using ICD-10 code U09.9 (post-acute COVID-19). We note that this criterion, which implies a prevalence of long COVID less than 0.2%, almost certainly excludes many patients with long COVID based on published estimates of long COVID prevalence of between 6.9% to 14%36,37,38.

The control cohort was generated by selecting individuals with GDA genotyping data who have evidence of SARS-CoV-2 infection, either based on a reported positive COVID-19 test in the COPE COVID-19 survey (n=3,615) or presence of ICD-10 codes B97.21 or U07.1 (n=17,024). We excluded individuals with long COVID based on ICD-10 code U09.9 as well as any individual with a history of symptomatic phenotypes consistent with long COVID or other post-viral fatigue syndromes (see Supplemental Table 1). Applying these criteria, our maximum control population included 9,774 individuals.

We used the sex-imputation functionality of PLINK39 to identify the genetic sex of each of the individuals in the full GDA dataset. 2.9% of total samples could not be reliably identified as male or female and were excluded from the study. 57.6% were identified as female and 39.5% were identified as male, which broadly agrees with the self-reported distribution of sex at birth from the AoU demographics questionnaire (59% female, 39% male, 2% skip/unknown).

Case and control matching using stratified sampling

To create a balanced dataset and reduce potential confounding effects of population substructure, we created a subset of controls that match the demographic distribution (i.e., sex and self-reported race/ethnicity) of the long COVID cases. We used genetic sex inferred by PLINK (using the command --check-sex) as well the answers to the demographics survey on self-reported race and ethnicity to split the cohort into subgroups, and we compared the percentage of the baseline cases and potential controls that fall in each category. The results showed that some subgroups were over-or under-represented in cases vs controls, e.g. white, female, non-Hispanics accounted for 38.5% of cases but only 29.7% of controls.

We adjusted our long COVID cohort by removing all individuals whose sex was undetermined during PLINK sex-imputation. We also removed all individuals whose self-reported race and/or ethnicity was coded as “None of these/I prefer not to answer/PMI:Skip” as these demographics do not allow accurate ‘matching’ with the control population. This created a final long COVID case population of 413 individuals (see Table 1 for demographic distribution).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1. The distribution by genetic sex, self-reported race, and self-reported ethnicity of cases and controls in the final All of Us long COVID cohort.

The set of potential controls allowed us to create a final study cohort with a 1:10 case:control ratio and similar demographic splits in the case and control sub-cohorts. We used a probabilistic function to apply a stratified sampling technique using granular subgroups based on three demographic values (as illustrated in Table 2) to the baseline potential controls and match the distribution of the demographic subgroups in the long COVID cases as closely as possible.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2. Two examples of how stratified sampling balances the frequency of granular subgroups (full breakdown is not available due to reporting restrictions imposed by AoU for rare subgroups).

Prior to sampling, we also removed any age-based outliers from the control cohort (i.e., any individuals whose age was outside the range of ages in the long COVID case cohort). Additional information on the demographic breakdown of cases and controls, including prevalence of comorbidities is included in Supplemental Table 2. The concordance between the self-reported demographic data and the AoU genetic ancestry predictions33 was very high (88.0-99.4% for matched groups) in the study cohort (Supplemental Table 3).

We used principal component analysis (PCA) to model any remaining population substructure within the AoU study cohort. We first removed all SNPs that are associated with the sex chromosomes or the MHC region on chromosome 6 or that have minor allele frequency less than 0.05. We then conducted LD-pruning in PLINK 1.9 (--indep-pairwise 50 5 0.2) before generating genetic PCs using the PLINK --pca command, as recommended elsewhere40,41,42.

We selected the top 5 PCs for use in our analyses based on the associated eigenvalues (Supplemental Table 3)

Long COVID Disease Signatures

We previously identified long COVID associated disease signatures in two patient cohorts derived from the Sano GOLD study cohort, as described in the original Taylor et al. (2023) paper30, and a third unreported patient cohort using a broader definition of the disease (Supplemental Table 4). This resulted in:

  1. 1,188 signatures mapped to 43 genes, from a ‘Severe’ cohort of patients who reported the greatest variety and severity of long COVID symptoms.

  2. 1,435 signatures mapped to 35 genes, from a ‘Fatigue Dominant’ cohort of patients who reported predominantly fatigue-associated long COVID symptoms.

  3. 6,445 signatures mapped to 165 genes, from a ‘General’ cohort of patients who reported they were still suffering continuation or development of new symptoms 12 weeks after the initial SARS-CoV-2 infection, with these symptoms lasting for at least 2 months with no other explanation.

In contrast to the diverse AoU American study cohort, the Sano GOLD study cohort was comprised of British patients of predominantly white European ancestry (>91% of the cohort), with Asian ancestry (∼4%) as the largest non-white European demographic.

Evaluating Enrichment of Reproducing Long COVID Disease Signature in AoU Cohort

We used a logistic regression approach to evaluate the disease association of each of the previously identified disease signatures in the AoU study population. Individuals were coded as 1 if they possessed the exact combination of SNP genotypes comprising a signature and 0 if they did not. This term was included in the regression as an independent variable alongside covariates representing the top 5 genetic PCs (see Supplementary Table 5), with case-control status of the patients in the population (1 = case, 0 = control) as the dependent variable.

The limited number of patients with ICD-10 codes for long COVID provides very limited power to replicate, i.e. statistically validate individual disease signatures’ disease associations in AoU, especially given the need for false discovery rate correction when testing the many signatures identified in the Sano GOLD dataset. Instead, we began by testing whether we observed a statistically significant enrichment of disease signatures that are also positively correlated with long COVID in the AoU cohort after controlling for population substructure.

For each of the three sets of disease signatures identified in the original Sano GOLD study (Severe, Fatigue Dominant, and General), we first counted the number of signatures where the logistic regression returns a positive coefficient (i.e., odds ratio > 1) for the independent ‘genetic signature’ variable. Below we use the term ‘reproducing’ to denote signatures with odds ratio > 1 in the AoU test cohort, and ‘reproducibility rate’ to denote the percentage of tested signatures that have odds ratio > 1.

Some of the original signatures could not be evaluated in AoU because one or more of their component SNP genotypes are not included in the dataset. These were excluded from the analysis (see Supplemental Table 6). Most of these missing SNPs are represented on the Illumina GDA but we believe these data were likely filtered out during the AoU dataset’s QC processes.

Many of the long COVID disease signatures are non-independent due to shared component SNP genotypes and linkage disequilibrium, preventing us from using standard statistical tests to evaluate the significance of our observed reproducibility rates. We therefore used a permutation-based approach to generate the expected distribution of observed reproducibility rates under the null hypothesis (i.e., no association between signatures and disease). We randomly shuffled the case-control vector 100 times, reran the logistic regression analysis for every signature and counted the number of disease signatures that have odds ratio greater than 1 for each random permutation. The p-value of the observed results is equal to the number of permutations in which the number of signatures with odds ratios above 1 is greater than or equal to the number of signatures with odds ratios above 1 in the original analysis.

From previous experience with other diseases, we have found that reproducibility rates for disease signatures are positively correlated with the frequency of the signature in the population. To test whether this is true for long COVID we filtered each of the three sets of disease signatures from the original analysis to the subsets that occur in at least 4% or 5% of total cases in the AoU cohort (based on observations of reproducibility rates from prior unpublished studies in other diseases). We then reran the reproducibility analysis for each of these ‘high frequency’ subsets of signatures.

Finally, we evaluated the impact of signature complexity (i.e., number of SNP genotypes in the signature) on reproducibility rates. We split the set of signatures into sets comprised of 2, 3, 4, and 5 SNP genotypes and reran the analysis of reproducibility on each separately, with and without applying filtering by case-frequency.

Ancestry-Specific Analyses

To test whether the observed rates of disease signature reproducibility broadly apply across traditionally under-served patient cohorts, we created three ancestry-specific sub-cohorts consistent with the demographic categories used to match cases with controls:

  1. White – patients who self-identify as ‘white’

  2. Black / African-American – patients who self-identify as ‘black’ and/or ‘African-American’

  3. Hispanic / Latino - patients who self-identify as ‘Hispanic’, ‘Latino’, and/or ‘Latina’

Note that these cohorts are not all mutually exclusive, as AoU includes separate questionnaire questions for self-reported race and Hispanic/Latino identification.

We again used genetic principal components to control for any indirect relationships between signature frequency and disease prevalence resulting from population substructure (including relatedness or broader shared ancestry between patients). We conducted separate PCAs for each sub-cohort dataset using the approach described above for the whole cohort. We then selected the first five PCs as covariates for each ancestry-specific analysis after confirming that they explained sufficient variance in the dataset (see Supplementary Table 5).

We restricted the ancestry-specific analyses to the sets of signatures that occur in more than 5% of cases. Given the small sample size and low statistical power for ancestry-specific sub-cohorts, it is most appropriate to assess differences in reproducibility rate for the sets of signatures that exhibit the strongest reproducibility statistics in the full cohort.

Evaluating Enrichment of Reproducing Long COVID Disease Signature in AoU Cohort

Finally, we tested whether any of the original disease signatures replicate, i.e., are individually significantly associated with long COVID in AoU. To minimize the FDR correction required for multiple tests, we restricted the analysis to the sets of high-frequency signatures that occur in more than 5% of cases. Output for the three original combinatorial analyses were assessed separately. Uncorrected p-values were obtained from the logistic regression with genetic PC covariates. ‘Reproducing’ signatures have p-values < 0.05 after FDR correction via the Benjamini-Hochberg procedure43. We also assessed significance using the more conservative Bonferroni adjustment44.

The SNPs in the disease signatures associated with long COVID in AoU were mapped to genes using an annotation cascade process against the human reference genome (GRCh38), as detailed in Das et al. (2022)27. SNPs located within the coding region of a gene (or genes) were mapped directly to the gene(s) and any remaining SNPs within 2kb upstream or 0.5kb downstream were mapped to the closest gene(s).

Results

Reproducibility of Overall Long COVID Disease Associations in AoU Cohort

We were able to test 5,343 of the 9,068 long COVID disease signatures originally identified in the three Sano GOLD sub-cohorts. The untested signatures all contained at least one SNP genotype that was not present in the post-QC AoU genotype dataset. Of the tested signatures, 1,766 occur in greater than 5% of cases and 3,100 occur in greater than 4% of cases in AoU.

When we restricted the analysis to signatures with case frequency greater than 5%, we consistently observed a significant enrichment of signatures (77% - 83%, p < 0.01) that are positively associated with long COVID in the AoU cohort, across all three sets of disease signatures (Table 3). Notably, the percentage of signatures with odds ratios greater than 1 in AoU is much larger than observed in any of the permutations where cases and controls were randomly assigned to patients (e.g., 82% vs. a maximum of 57% in the random permutations for the Severe cohort). This result confirms that many disease signatures are non-randomly associated with increased long COVID prevalence in AoU.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3. Reproducibility statistics in AoU for long COVID disease signatures derived from three Sano GOLD sub-cohorts.

Overall reproducibility rates are lower when we apply a less stringent 4% frequency cutoff, but the enrichment is still highly significant (60% - 71%, p < 0.01). That is, the observed number of signatures with odds ratios greater than 1 in AoU exceeds the maximum number of signatures with odds ratio greater than 1 in the random permutations. We did not observe any significant enrichment of reproducing signatures when we included low-frequency signatures in our analysis.

The distributions of odds ratios for the reproducing high-frequency (>5%) signatures are shown in Figure 1. 89% and 90% of the reproducing signatures from the Severe and Fatigue Dominant studies respectively have odds ratios greater than 1.1 in AoU, while 17% and 48% have odds ratios greater than 1.5. The mean odds ratio for the Severe signatures is 1.35 and the maximum is 2.09, while the mean odds ratio for the Fatigue Dominant signatures is 1.49 and the maximum is 2.22. Thus, the reproducing disease signal from these studies largely represents signatures that are individually strongly associated with increased disease prevalence.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

Distribution of observed odds ratios in AoU for reproducing signatures with high case frequencies (>5%).

The reproducing signatures from the General study tend to have lower odds ratios than the other studies (Figure 1). 75% of reproducing signatures have odds ratios greater than 1.1 in AoU and 5% have odds ratio greater than 1.5. The mean odds ratio is 1.21 and the maximum is 2.10. Thus, although these signatures included many with relatively weak disease associations, they also include signatures with strong effect sizes.

The relative enrichment of low odds ratios for signatures from the General study likely reflects the greater number of cases (Supplemental Table 4) and greater statistical power associated with the Sano GOLD study cohort. That is, the General study was better suited to detect signatures with lower effect sizes relative to the smaller Severe and Fatigue Dominant cohorts. The weaker disease associations may also reflect the relative reliability of the criteria used to define the Sano GOLD cohorts, as we believe that the case definition criteria for the General cohort is less accurate than the criteria used to identify patients with Severe and Fatigue Dominant long COVID subtypes.

Reproducibility statistics are strongest for high case frequency (>5%) signatures comprised of 4 or 5 SNP genotypes, as measured both by percent reproducing (i.e., odds ratio >1) and p-value (Table 4). Notably, across all three analyses, roughly twice as many 4- and 5-SNP signatures have odds ratios greater than 1 in AOU than would be expected due to random chance based on the mean reproducibility rates for the random permutations.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4. Reproducibility statistics by signature complexity (i.e., number of SNP genotypes comprising disease signatures) in AoU for high case frequency (>5%) long COVID disease signatures derived from three Sano GOLD sub-cohorts.

We similarly observed that reproducibility statistics are strongest for higher complexity signatures when applying a frequency cut-off of 4% (Supplemental Table 7). We observed no clear association between signature complexity and reproducibility rates for low frequency signatures (Supplemental Table 8).

To ensure that the sets of long COVID disease signatures are broadly reproducible across patients, we conducted separate analyses for self-reported white, black/African-American, and Hispanic/Latino sub-cohorts (Table 5).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 5. Reproducibility statistics in AoU for high case frequency long COVID disease signatures (>5% of cases) derived from three Sano GOLD sub-cohorts, broken down by self-reported ancestry.

We observed a highly significant enrichment of positively correlated disease signatures among self-reported white patients. This result confirms that the observed enrichment of reproducible disease associations in the all-ancestry cohort does not simply reflect population substructure in the dataset (i.e., indirect correlations between disease prevalence and signature frequency that arise due to shared correlations with ancestry).

Reproducibility rates were lower in the self-reported black/African-American and Hispanic/Latino sub-cohorts relative to the self-reported white sub-cohort, but consistently above the mean values observed in the random permutations. Two of the observed enrichment values were statistically significant (p < 0.05) despite the very small number of cases (71 and 77) and consequent weak statistical power in these sub-cohorts.

More than 85% of the long COVID genes identified across the three Sano GOLD long COVID cohorts mapped to one or more disease signatures that have >4% case frequency and were also positively associated with long COVID in the AoU cohort (see Table 6). Out of the 73 genes published in Taylor et al. (2023)30 15 genes could not be tested due to missing SNPs in the AoU dataset. 76% (44/58) of the remining genes also map to disease signatures that reproduced in AoU. These genes are linked to a wide range of biological processes and mechanisms including dysregulated immune response and metabolic pathways, development of chronic inflammation, and cognitive dysfunction.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 6. Reproducibility statistics in AoU for genes associated with high case frequency (> 4% and >5%) long COVID disease signatures identified in three Sano GOLD sub-cohorts.

Of the 13 repurposing gene candidates identified in Taylor et al. (2023), 11 (85%) map to at least one disease signature that reproduces in AoU (see Supplemental Table 9). These genes include TLR4 which Taylor et al. (2023) noted has been shown to protect against long-term cognitive impairment pathology caused by SARS-CoV-245. Inhibition of TLR4 in a mouse model was shown to prevent long term cognitive pathology including synapse elimination and memory deficits that are caused by the SARS-CoV-2 Spike protein. Previous clinical studies have shown that antagonizing TLR4 signaling has the effect of dampening the pathological cytokine storm observed in patients with severe acute COVID-19 and reduces mortality rates in hospitalized COVID-19 patients46,47.

Replication of Individual Disease Signatures in All of Us

The above analyses focused on demonstrating an overall enrichment of disease signatures and genes that are positively correlated with long COVID in AoU, recognizing that the small size of the AoU cohort severely limits wide-scale replication. To achieve sufficient power to statistically validate individual signatures, we limited our replication analysis to the subset of signatures with case frequencies above 5%.

Four high-frequency disease signatures from the Severe Sano GOLD analysis were significantly associated with increased prevalence of long COVID in AoU, one of which was still significant after applying the more conservative Bonferroni FDR correction (Table 6). All four signatures are comprised of five SNP genotypes, each of which contributes to the overall association with disease in AoU (i.e., removing any of the SNP genotypes from the signature results in a lower odds ratio). This observation highlights the utility of the combinatorial analysis approach for identifying genetic disease associations.

Two of the replicating disease signatures from the Severe analysis mapped to the gene CCDC146 and one mapped to D2HGDH. These genes have different functions and affect different potential mechanism of action hypotheses for their role in the development of long COVID. CCDC146 is a ubiquitous centriole and microtubule-associated protein linked to cognitive functioning and type 2 diabetes48. D2HGDH is an enzyme involved in mitochondrial functioning, also exhibits anti-inflammatory effects49.

Two disease signatures from the Fatigue Dominant Sano GOLD analysis were significantly associated with increased prevalence of long COVID in AoU, one of which was still significant after applying the more conservative Bonferroni FDR correction (Table 6). The latter is comprised of two SNP genotypes, while the other is comprised of five SNP genotypes. Each of the individual SNP genotypes contribute to the signatures’ association with disease in AoU.

None of the signatures from the General cohort in the Sano GOLD analysis were significantly associated with increased prevalence of long COVID in AoU. Although this output includes the signature most strongly associated with long COVID (by uncorrected p-value), it does not survive the stringent FDR correction for the large number of signatures from this analysis.

Finally, if we pool the signatures from the Severe and Fatigue Dominant cohorts into a single analysis (excluding the large number of signatures from the General cohort to avoid the need for stringent FDR correction), then 5 of signatures in Table 7 remain significant under the combined Benjamini-Hochberg FDR correction. These include all four significant signatures from the Severe analysis and the top signature from the Fatigue Dominant analysis.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 7.

Replicating disease signatures that exhibit statistically significant associations with long COVID in AoU using Bonferroni-Hochberg FDR procedure. p-values in bold are also significant under Bonferroni FDR correction. Signatures from each cohort analysis were evaluated separately. Odds ratios reflect the number of total case and controls with genotype data for all component SNPs, which differs between signatures.

Discussion

Studies using traditional GWAS and meta-GWAS approaches on large patient populations (6,450 cases and 53,764 cases) respectively identified a single locus and three loci associated with long COVID25,26, although there was no statistical replication of the findings between these studies.

The original combinatorial analysis of Sano’s GOLD cohort identified 9,068 genetic disease signatures and 73 genes that were significantly enriched in two small UK-based long COVID patient cohorts (Severe n_cases=459 and Fatigue Dominant n_cases=477)30. In this original analysis, 28/43 genes found in the Severe cohort were also significantly associated with disease in the Fatigue Dominant cohort, and 25/35 genes from the original Fatigue Dominant analysis were also associated in the Severe cohort. 25 genes (15 from Severe and 10 from Fatigue Dominant) were found to be unique to those cohorts.

92% of the genes and 60-83% of the medium/high-frequency disease signatures from the Sano GOLD results that are also represented in the AoU dataset were positively correlated with long COVID in this independent US-based population. For disease signatures that occur in at least 5% of patients, between 77%-83% were positively correlated with long COVID prevalence in both the Sano GOLD and AoU cohorts, far more than we would expect to randomly observe if the signatures were uncorrelated with disease biology. Although we defined a ‘reproducing’ signature as one that has any odds ratio greater than 1, most reproducing signatures have relatively large odds ratios in AoU, indicating a strong association with increased disease prevalence.

At least five of the disease signatures found in Sano GOLD were individually significantly associated with increased prevalence of long COVID in the AoU population. The significant enrichment of positively-associated disease signatures further confirms that many additional signatures are non-randomly associated with disease but cannot be individually validated due to the very low statistical power provided by the small number of long COVID patients in the dataset (n=413). Together these results demonstrate a significant enrichment and reproduction of disease signal, broadly validating the results of the original analysis.

Importantly, the results of this paper provide strong supporting evidence for a much broader range of genetic associations with long COVID than has been uncovered by GWAS studies to date. This provides evidence highly consistent with a strong biological basis of the disease and the hypothesis that patients’ genetics influence their susceptibility to developing long COVID (and their predominant symptoms) following recovery from acute SARS-CoV-2 infection.

The AoU ancestry distribution differs significantly from the mainly (>91%) white British patient cohort used in the original combinatorial analysis. Disease signature reproducibility rates are very strong in the sub-cohort of self-identified white patients, as expected given the similarity in ancestry between that cohort with the original Sano GOLD dataset. Signature reproducibility rates are lower in sub-cohorts of self-identified black/African-Americans and Hispanic/Latinos, but we still observe significant enrichment of disease signatures despite very small sample sizes.

This therefore represents the first reproduction of long COVID genetic associations across multiple populations with substantially different ancestry distributions. Given the degree of reproducibility of results across diverse populations, these findings may have broad clinical application which could promote better health equity.

The lower signature reproducibility rates among the self-identified black/African American and Hispanic/Latino sub-cohorts relative to the self-identified white sub-cohort highlight a pressing need to identify large, diverse, well-phenotyped cohorts of long COVID patients. Many long COVID specific datasets such as Sano GOLD are comprised predominantly of patients with white European ancestry. In contrast, All of Us includes a highly diverse patient cohort, but lack of reliable data identifying which participants have a history of long COVID prevents us from reliably obtaining sufficient sample sizes to conduct a combinatorial analysis aimed at identifying novel disease signatures.

Having access to larger and more diverse populations with a confirmed diagnosis is essential to enabling primary analysis within these ancestry cohorts and adding to our understanding of the factors underpinning disease in those populations. In turn this would also help us build more inclusive and transferrable disease risk models.

Combinatorial analysis of diverse long COVID patient cohorts could potentially identify disease signatures that were not detected in predominantly white European cohorts due to low relative case frequencies, but which have greater frequency and importance for disease biology in other patient cohorts. Such signatures could be used to better estimate patients’ relative susceptibility to developing long COVID.

Alternatively, the disease signatures identified in the Sano GOLD cohort may have reduced effect sizes in non-white European cohorts due to an increased frequency of ‘actively protective’ signatures in those populations, i.e., one or more SNP genotypes that wholly or partially mitigate the disease associations of a set of ‘causative’ disease signatures50. The combinatorial analysis published in Taylor et al. (2023) focused only on causative disease signatures and did not include any analysis of protective signatures30 Incorporating actively protective features into the set of disease signatures should increase their predictivity for identifying ‘high-risk’ patients and improve reproducibility statistics.

Evaluating the Output of the PrecisionLife Combinatorial Analysis Platform

We observed high rates of reproducibility among disease signatures derived from all the Sano GOLD cohorts and showed that these rates of disease signature reproducibility were strongly correlated with the frequency of signatures in the original study cohort. We observed slightly higher overall rates of reproducibility in the Severe and Fatigue Dominant cohorts which have fewer high case frequency disease signatures relative to the broader ‘General’ long COVID cohort.

Rates of reproducibility were highest for disease signatures comprised of four or five SNP genotypes, suggesting that combinatorial genetic interactions play an important role in the biology of long COVID. This also provides supporting evidence for the combinatorial analytic approach’s ability to detect a broad and clinically informative set of genetic disease associations in otherwise intractable complex diseases.

The Predictive Value of Common vs Rare Signatures

In contrast to these mid/high case frequency signatures, when analyzing the entire set of disease signatures from the original analyses including low frequency signatures, only 34%-44% were consistently correlated with long COVID prevalence. This implies that rarer signatures may replicate between populations more poorly, an observation that is consistent with similar findings in GWAS and polygenic risk score studies51,52,53,54,55. There are a several explanations for this observed correlation between signature frequency and reproducibility rates.

First, statistical power is proportional to sample size, which is already limited in the reproducibility analysis due to the very small number of confirmed long COVID patients in AoU. Signatures with frequencies below 5% are expected to occur in 21 or fewer AoU cases. This small sample size results in large variance in expected rates of reproducibility under the null model and a high probability of observing odds ratios less than one due to random sampling even when signatures are biologically relevant to disease.

Second, due to the high case: control skew (1:10) in our dataset, rare signatures were often more likely to be negatively correlated with disease under the null model. In the most extreme scenario, a signature that occurs in one person in the dataset is 10 times more likely to randomly occur in a control (resulting in a negative odds ratio) than a case (resulting a positive odds ratio). This bias caused the mean numbers of signatures that randomly exhibit odds ratios above 1 in the null model permutations to range between 41%-46%, below the 50% expectation for a balanced dataset.

Third, rare signatures appeared to be more reflective of subpopulation structure in the original Sano GOLD dataset. Including genetic principal components as covariates resulted in 4% fewer high-frequency signatures (i.e., those that occur in >5% of total cases) that are positively correlated with long COVID, relative to a logistic regression that did not include covariates for population substructure. In contrast, including genetic principal components in the analysis resulted in 52% fewer replicating low-frequency signature (i.e., those that occur in <4% of total cases).

Finally, more complex disease signatures (i.e., those comprised of 4 or 5 SNP genotypes) generally occur at lower frequencies in the population simply because there are more possible genotype combinations for a larger set of SNPs. The risk of overfitting to a dataset is known to increase with tree depth when applying tree-based machine learning algorithms56 and the same potentially holds true for higher layer disease signatures derived from a layer-based mining approach. Applying a frequency filter therefore potentially mitigates the impacts of false positive SNPs by removing higher-order signatures.

We found no evidence, however, that increased signature complexity was associated with reduced reproducibility among high-frequency signatures. Rather, overall reproducibility rates were highest for 4-SNP and 5-SNP signatures relative to the small number of 2-SNP signatures. We also did not observe a correlation between signature complexity and reproducibility rate among low-frequency signatures. These results suggest that outputs of the combinatorial analyses of the Sano GOLD cohorts were not excessively overfitted to the original datasets and that presence of any false positive component SNP genotypes does not significantly affect the overall association with disease.

Although the results of this analysis suggests that false positive component SNP genotypes do not have a major effect on signature reproducibility, we could potentially improve the effect sizes and predictivity of these signatures by using AoU to further refine the set of signatures.

This step entails testing each signature individually and removing any component SNP genotype that does not enhance the signature’s association with disease in AoU. We have not included any refinement analysis in this study because it can potentially overfit the new set of signatures to the training dataset (AoU). A third cohort of long COVID patients would be required to properly evaluate the improvement in disease signature reliability that results from this refinement process.

Limitations of Analysis

Reliably identifying which patients in AoU have a history of long COVID is currently challenging as we needed to rely on ICD-10 codes, which are known to be inconsistently and inaccurately applied, to identify known cases. As noted above, published estimates of long COVID prevalence in the United States range between 6.9% to 14%, yet fewer than 0.2% of individuals in AoU have ICD-10 codes associated with long COVID.

This suggests that many long COVID patients have not been assigned the appropriate ICD-10 code. As a result, more than 10% of the controls in our AoU study cohort potentially could represent misclassified cases with unreported long COVID.

This type of phenotypic misclassification in datasets will generally weaken the observed effect sizes by artificially inflating the similarity between cases and controls57. This behavior is potentially problematic for reproducibility analyses, as the dilution of signal decreases the statistical power of the analysis58. For example, phenotypic misclassification increases the probability that a signature that is biologically correlated with increased disease risk will nonetheless exhibit an odds ratio less than 1 due to random sampling effects.

We therefore expect that the high degree of phenotypic misclassification in our dataset will have worked to reduce the overall rates of observed signature reproducibility. As such, the reproducibility statistics presented in this paper probably represent a low-end estimate of the true reproducibility rate.

Applications for healthcare

The identification of a set of genetic signatures that are consistently associated with increased prevalence of long COVID offers many opportunities for improving treatment of this poorly understood but highly prevalent and debilitating disease.

Firstly, although we have insufficient power to validate the full set of individual signatures in AoU, demonstrating that reproducing signatures are significantly enriched in a second dataset provides important confirmatory evidence of the original findings of the combinatorial analytics approach. To provide insights into potential drug therapies for long COVID, we further tested whether the signatures associated with novel drug targets and their related drug repurposing candidates are significantly correlated with increased long COVID prevalence in AoU. 27/30 (90%) of the genes represented in the >5% disease signatures and 11 out of the 13 drug repurposing candidates identified in the original study were reproduced in this study. This lends weight to their prioritization in clinical efficacy trials especially for those with generic drugs.

Controlled open-label studies of similar design to the RECOVERY trial in Covid-19, which rapidly identified dexamethasone as an effective frontline therapy59, can be undertaken on this set of generic drugs, benefiting from the additional evidence that one or more selected therapies is more likely to help the subset of patients who have those mechanisms’ disease signatures in their genetic makeup.

Secondly, we can use the insights into disease biology that are reflected by the reproducing disease signatures to construct a combinatorial risk score that evaluates an individual patient’s relative genetic susceptibility towards developing long COVID. Although genetic risk scores are not strictly diagnostic, especially in a pathogen triggered disease, they have substantial potential to be used by physicians for differential triage, i.e., to rapidly gauge the relative likelihood of different diagnoses when presented with ambiguous or indistinct symptoms and refer patients and/or select treatment options accordingly. As the utilization of large-scale COVID-19 testing fades, alternative tests that can help differentiate patients with long COVID from patients with other illnesses with similar symptoms will become increasingly useful in healthcare settings.

Constructing a combinatorial risk score from disease signatures is a more complex challenge than a conventional polygenic risk model – the latter assumes that all features (SNPs) act independently, whereas combinatorial disease signatures are often inherently correlated due to the sharing of SNP genotypes. Machine learning approaches can disentangle this complexity and non-independence and combine features such as disease signatures and their component SNP genotypes into a single predictive model. Although the small sample size of the AoU dataset is sufficient to train a combinatorial risk score using machine learning, a third (currently unavailable) independent dataset would be required to properly evaluate the relative increase in long COVID prevalence between subsets of patients flagged as having high and low genetic susceptibility.

Finally, the set of replicating disease signatures can be used to mechanistically stratify patients based on the causative etiologies most likely associated with their form of long COVID. This first entails assigning disease signatures to one or more mechanism-of-action (MoA) cluster/s based on the gene(s) associated with the component SNP genotypes. We can then assess whether a patient has a significant excess or lack of disease signatures associated with a given MoA relative to the distribution of signature counts in the larger long COVID community. In essence, this mechanistic stratification tool is comprised of multiple combinatorial risk scores, each for a different set of mechanism related disease signatures. This can provide insight in the clinic into the selection of therapies that are matched to a patient’s personal genetic makeup.

Unlike standard risk scores, which can be used to inform public health applications but provides more limited utility for personalized precision medicine60, a mechanistic stratification tool would potentially ultimately enable healthcare practitioners to identify individualized treatment regimens including single- or multi-drug therapies that are most likely to generate a positive outcome for a given patient. In the case of long COVID these mechanistic insights also have other potential applications, as the Taylor et al. (2023) combinatorial analysis also found evidence for substantial overlap in disease biology between long COVID and myalgic encephalitis / chronic fatigue syndrome (ME/CFS)30.

Conclusion

The level of reproducibility of results from the original Sano GOLD long COVID study in the All of Us population to the extent demonstrated is highly encouraging for the study of long COVID and other similarly complex diseases. These findings redefine our understanding of long COVID by uncovering a broad spectrum of reproducible genetic signatures, laying the foundation for new diagnostic innovations and targeted therapies that have the potential to revolutionize care for millions suffering from this debilitating condition worldwide.

The study demonstrates the level of reproducibility of results achievable using combinatorial analysis, even across very small populations with diverse ancestries in highly heterogenous diseases. Increasing reproducibility across patients with different ancestries is critically important for improving equitable representation and access to healthcare solutions. All of these studies would nonetheless obviously benefit from larger datasets with a wider population diversity, more secure diagnosis, more harmonized health/symptom surveys and deeper genomic, longitudinal clinical, immunological and metabolic data.

The results provide further compelling evidence for the detailed description of the genetic components of long COVID’s complex disease biology that was presented in the original combinatorial analysis study30. We hope that this will provide confidence to explore some of these mechanisms and drug targets and help advance research into novel ways to diagnose the disease and accelerate the discovery and selection of better therapeutic options, both in the form of newly discovered drugs and/or the immediate prioritization of coordinated investigations into the efficacy of repurposed drug candidates.

We also hope that these findings will better establish a stronger appreciation of the role of genetic contributions to the etiology and lived experience of disease in long COVID patients and prove its underlying biological basis to the clinical community.

For the first time, a definitive test for the disease would enable clinicians to rapidly and accurately identify and triage patients, ensuring they receive timely and equitable access to care, and reducing the potential for misdiagnosis. It would also establish a definitive framework for measuring the public health impact of the disease, informing health policy and helping strategically prioritize research initiatives to make more rapid progress in addressing this massive global challenge and improve patients’ lives.

Data Availability

Only data from existing All of Us and Sano GOLD study cohorts were analyzed and no new source data were collected for this study. Aggregate-level data for the All of Us cohort is publicly available at https://databrowser.researchallofus.org/ (Public Tier dataset). Individual-level data for the All of Us cohort, available in the Controlled Tier dataset, can be analyzed by approved researchers on the Researcher Workbench.

https://databrowser.researchallofus.org/

Author contributions

SG, AR, MS, JS, SD, RG and KT contributed to the design of the study. JS designed the reproducibility analyses. MP, KC, JS, and SD performed the analyses described in this manuscript. MP and KC conducted the analyses on the Researcher Workbench. All authors contributed to writing the manuscript and consent to publication.

Funding

The project was funded entirely by Metrodora Foundation and PrecisionLife Ltd.

Availability of data and materials

Only data from existing All of Us and Sano GOLD study cohorts were analyzed and no new source data were collected for this study. Aggregate-level data for the All of Us cohort is publicly available at https://databrowser.researchallofus.org/ (Public Tier dataset). Individual-level data for the All of Us cohort, available in the Controlled Tier dataset, can be analyzed by approved researchers on the Researcher Workbench.

Ethics approval and consent to participate

The Sano GOLD study has approval from the Wales Research Ethics Committee (REC) (IRAS 291221). Consent to participate has been received from all participants. Institutional Reviewing Board (IRB) approval was obtained prior to enrollment of patients in the All of Us Research Program. Informed consent for all participants is conducted in person or through an eConsent platform that includes primary consent, HIPAA Authorization for Research use of EHRs and other external health data, and Consent for Return of Genomic Results. The protocol was reviewed by the Institutional Review Board (IRB) of the All of Us Research Program (IRB Approval Date: Dec 03, 2021). The All of Us IRB follows the regulations and guidance of the NIH Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly. The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers (OT2 OD026549; OT2 OD026554; OT2 OD026557; OT2 OD026556; OT2 OD026550; OT2 OD 026552; OT2 OD026553; OT2 OD026548; OT2 OD026551; OT2 OD026555); Inter agency agreement AOD 16037; Federally Qualified Health Centers HHSN 263201600085U; Data and Research Center: U2C OD023196; Genome Centers (OT2 OD002748; OT2 OD002750; OT2 OD002751); Biobank: U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: U24 OD023163; Communications and Engagement: OT2 OD023205; OT2 OD023206; and Community Partners (OT2 OD025277; OT2 OD025315; OT2 OD025337; OT2 OD025276). Results reported are in compliance with the All of Us Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20 to protect participant privacy.

Declaration of Competing Interest

AR is an employee of Metrodora Foundation, SG and RG are co-chairs of Metrodora Foundation’s Scientific Advisory Board. JS, SD, KT, KC, MP and SG are employees of PrecisionLife Ltd. S.G. is a shareholder of PrecisionLife, Ltd.

Acknowledgements

Research described in this article has been conducted using data from the All Of Us Research Program and Sano Genetics’ Long COVID GOLD study. We gratefully acknowledge All of Us and Sano Genetics’ Long COVID GOLD participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program and Sano Genetics for making available the participant data examined in this study. Special thanks to Gert Møller and Claus Erik Jensen, who initially developed the combinatorial analytics methodology, and the rest of the PrecisionLife and Metrodora teams.

Footnotes

  • ↵# Joint last authors

References

  1. 1.↵
    Al-Aly, Z., Davis, H., McCorkell, L. et al. Long COVID science, research and policy. Nat Med 30, 2148–2164 (2024). doi:10.1038/s41591-024-03173-6
    OpenUrlCrossRefPubMed
  2. 2.↵
    Davis, H.E., McCorkell, L., Vogel, J.M. and Topol, E.J., 2023. Long COVID: major findings, mechanisms and recommendations. Nature Reviews Microbiology, 21(3), pp.133–146.
    OpenUrlCrossRefPubMed
  3. 3.↵
    Blitshteyn, S. and Verduzco-Gutierrez, M., 2024. Long COVID: A major public health issue. American Journal of Physical Medicine & Rehabilitation, pp.10–1097.
  4. 4.↵
    Callard, F. and Perego, E., 2021. How and why patients made Long Covid. Social Science & Medicine, 268, p.113426.
    OpenUrlPubMed
  5. 5.↵
    Davis, H.E., Assaf, G.S., McCorkell, L., Wei, H., Low, R.J., Re’em, Y., Redfield, S., Austin, J.P. and Akrami, A., 2021. Characterizing long COVID in an international cohort: 7 months of symptoms and their impact. EClinicalMedicine, 38.
  6. 6.↵
    Al-Aly Z, Topol E. Solving the puzzle of Long Covid. Science. 2024 Feb 23;383(6685):830-832. doi: 10.1126/science.adl0867.
    OpenUrlCrossRefPubMed
  7. 7.↵
    Altmann, D.M., Whettlock, E.M., Liu, S., Arachchillage, D.J. and Boyton, R.J., 2023. The immunology of long COVID. Nature Reviews Immunology, 23(10), pp.618–634.
    OpenUrlPubMed
  8. 8.↵
    Greenhalgh, T., Sivan, M., Perlowski, A. and Nikolich, J.Ž., 2024. Long COVID: a clinical update. The Lancet, 404(10453), pp.707–724.
    OpenUrl
  9. 9.↵
    Su, S., Zhao, Y., Zeng, N., Liu, X., Zheng, Y., Sun, J., Zhong, Y., Wu, S., Ni, S., Gong, Y. and Zhang, Z., 2023. Epidemiology, clinical presentation, pathophysiology, and management of long COVID: an update. Molecular Psychiatry, 28(10), pp.4056–4069.
    OpenUrlCrossRefPubMed
  10. 10.↵
    Goldstein, D.S., 2024. Post-COVID dysautonomias: What we know and (mainly) what we don’t know. Nature Reviews Neurology, 20(2), pp.99–113.
    OpenUrlPubMed
  11. 11.↵
    Harrison, P.J. and Taquet, M., 2023. Neuropsychiatric disorders following SARS-CoV-2 infection. Brain, 146(6), pp.2241–2247.
    OpenUrlCrossRefPubMed
  12. 12.↵
    Kubota, T., Kuroda, N. and Sone, D., 2023. Neuropsychiatric aspects of long COVID: a comprehensive review. Psychiatry and Clinical Neurosciences, 77(2), pp.84–93.
    OpenUrl
  13. 13.↵
    Nugent, K. and Berdine, G., 2024. Dyspnea and long COVID patients. The American Journal of the Medical Sciences.
  14. 14.↵
    Komaroff, A.L. and Lipkin, W.I., 2023. ME/CFS and Long COVID share similar symptoms and biological abnormalities: road map to the literature. Frontiers in Medicine, 10, p.1187163.
    OpenUrlPubMed
  15. 15.↵
    Eaton-Fitch, N., Rudd, P., Er, T., Hool, L., Herrero, L. and Marshall-Gradisnik, S., 2024. Immune exhaustion in ME/CFS and long COVID. JCI insight, 9(20), p.e183810.
    OpenUrl
  16. 16.↵
    Spicer, C.M., Chu, B.X., Volberding, P.A. and National Academies of Sciences, Engineering, and Medicine, 2024. Chronic Conditions Similar to Long COVID. In Long-Term Health Effects of COVID-19: Disability and Function Following SARS-CoV-2 Infection. National Academies Press (US).
  17. 17.↵
    Cantrell, C., Reid, C., Walker, C.S., Stallkamp Tidd, S.J., Zhang, R. and Wilson, R., 2024. Post-COVID postural orthostatic tachycardia syndrome (POTS): a new phenomenon. Frontiers in Neurology, 15, p.1297964.
    OpenUrlPubMed
  18. 18.↵
    El-Rhermoul, F.Z., Fedorowski, A., Eardley, P., Taraborrelli, P., Panagopoulos, D., Sutton, R., Lim, P.B. and Dani, M., 2023. Autoimmunity in long COVID and POTS. Oxford Open Immunology, 4(1), p.iqad002.
    OpenUrl
  19. 19.↵
    Goldenberg, D.L., 2024, May. How to Understand the Overlap of Long COVID, Chronic Fatigue Syndrome/Myalgic Encephalomyelitis, Fibromyalgia and Irritable Bowel Syndromes. In Seminars in Arthritis and Rheumatism (p. 152455). WB Saunders.
  20. 20.↵
    Perlis, R.H., Santillana, M., Ognyanova, K., Safarpour, A., Trujillo, K.L., Simonson, M.D., Green, J., Quintana, A., Druckman, J., Baum, M.A. and Lazer, D., 2022. Prevalence and correlates of long COVID symptoms among US adults. JAMA Network Open, 5(10), pp.e2238804–e2238804.
    OpenUrl
  21. 21.↵
    Turk, F., Sweetman, J., Chew-Graham, C.A., Gabbay, M., Shepherd, J., van der Feltz-Cornelis, C. and STIMULATE-ICP Consortium, 2024. Accessing care for Long Covid from the perspectives of patients and healthcare practitioners: A qualitative study. Health Expectations, 27(2), p.e14008.
    OpenUrlPubMed
  22. 22.↵
    O’Hare, A.M., Vig, E.K., Iwashyna, T.J., Fox, A., Taylor, J.S., Viglianti, E.M., Butler, C.R., Vranas, K.C., Helfand, M., Tuepker, A. and Nugent, S.M., 2022. Complexity and challenges of the clinical diagnosis and management of long COVID. JAMA Network Open, 5(11), pp.e2240332–e2240332.
    OpenUrlPubMed
  23. 23.↵
    Hamlin, R.E. and Blish, C.A., 2024. Challenges and opportunities in long COVID research. Immunity, 57(6), pp.1195–1214.
    OpenUrlPubMed
  24. 24.↵
    Ruß, A.K., Schreiber, S., Lieb, W., Vehreschild, J.J., Heuschmann, P.U., Illig, T., Appel, K.S., Vehreschild, M.J., Krefting, D., Reinke, L. and Viebke, A., 2024. Genome-wide Association Study of Post COVID-19 Syndrome in a Population-based Study in Germany. Research Square, 09 December 2024, PREPRINT (Version 1).
  25. 25.↵
    Lammi, V., Nakanishi, T., Jones, S.E., Andrews, S.J., Karjalainen, J., Cortés, B., O’Brien, H.E., Fulton-Howard, B.E., Haapaniemi, H.H., Schmidt, A. and Mitchell, R.E., 2023. Genome-wide association study of long COVID. medRxiv, pp.2023–06.
  26. 26.↵
    Chaudhary, N.S., Weldon, C.H., Nandakumar, P., 23andMe Research Team, Holmes, M.V. and Aslibekyan, S., 2024. Multi-ancestry GWAS of Long COVID identifies immune-related loci and etiological links to chronic fatigue syndrome, fibromyalgia and depression. medRxiv, pp.2024-10.
  27. 27.↵
    Das, S., Taylor, K., Kozubek, J., Sardell, J. and Gardner, S., 2022. Genetic risk factors for ME/CFS identified using combinatorial analysis. Journal of Translational Medicine, 20(1), p.598.
    OpenUrlPubMed
  28. 28.↵
    Gardner, S., 2021. Combinatorial analytics: an essential tool for the delivery of precision medicine and precision agriculture. Artificial Intelligence in the Life Sciences, 1, p.100003.
    OpenUrl
  29. 29.↵
    Das, S., Taylor, K., Beaulah, S. and Gardner, S., 2022. Systematic indication extension for drugs using patient stratification insights generated by combinatorial analytics. Patterns, 3(6).
    OpenUrl
  30. 30.↵
    Taylor, K., Pearson, M., Das, S., Sardell, J., Chocian, K. and Gardner, S., 2023. Genetic risk factors for severe and fatigue dominant long COVID and commonalities with ME/CFS identified by combinatorial analysis. Journal of Translational Medicine, 21(1), p.775.
    OpenUrlPubMed
  31. 31.↵
    Thompson, R.C., Simons, N.W., Wilkins, L., Cheng, E., Del Valle, D.M., Hoffman, G.E., Cervia, C., Fennessy, B., Mouskas, K., Francoeur, N.J. and Johnson, J.S., 2023. Molecular states during acute COVID-19 reveal distinct etiologies of long-term sequelae. Nature Medicine, 29(1), pp.236–246.
    OpenUrlCrossRefPubMed
  32. 32.↵
    All of Us Research Program Investigators; Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, Jenkins G, Dishman E. The “All of Us” Research Program. N Engl J Med. 2019 Aug 15;381(7):668-676. doi: 10.1056/NEJMsr1809937.
    OpenUrlCrossRefPubMed
  33. 33.↵
    The All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). doi:10.1038/s41586-023-06957-x
    OpenUrlCrossRef
  34. 34.↵
    Infinium Global Diversity Array-8 Kit Specifications available from https://emea.illumina.com/products/by-type/microarray-kits/infinium-global-diversity.html last accessed 14 January 2025
  35. 35.↵
    Phillips R, Taiyari K, Torrens-Burton A, Cannings-John R, Williams D, Peddle S, Campbell S, Hughes K, Gillespie D, Sellars P, Pell B, Ashfield-Watt P, Akbari A, Seage CH, Perham N, Joseph-Williams N, Harrop E, Blaxland J, Wood F, Poortinga W, Wahl-Jorgensen K, James DH, Crone D, Thomas-Jones E, Hallingberg B. Cohort profile: The UK COVID-19 Public Experiences (COPE) prospective longitudinal mixed-methods study of health and well-being during the SARSCoV2 coronavirus pandemic. PLoS One. 2021 Oct 13;16(10):e0258484. doi: 10.1371/journal.pone.0258484.
    OpenUrlCrossRefPubMed
  36. 36.↵
    Blanchflower DG, Bryson A. Long COVID in the United States. PLoS One. 2023 Nov 2;18(11):e0292672. doi: 10.1371/journal.pone.0292672.
    OpenUrlCrossRefPubMed
  37. 37.↵
    Robertson MM, Qasmieh SA, Kulkarni SG, Teasdale CA, Jones HE, McNairy M, Borrell LN, Nash D. The Epidemiology of Long Coronavirus Disease in US Adults. Clin Infect Dis. 2023 May 3;76(9):1636–1645. doi: 10.1093/cid/ciac961.
    OpenUrlCrossRefPubMed
  38. 38.↵
    National Center for Health Statistics. U.S. Census Bureau, Household Pulse Survey, 2022–2024. Long COVID. Generated interactively: January 11, 2025 from https://www.cdc.gov/nchs/covid19/pulse/long-covid.htm
  39. 39.↵
    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559–75. doi: 10.1086/519795.
    OpenUrlCrossRefPubMed
  40. 40.↵
    Anderson, C., Pettersson, F., Clarke, G. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564–1573 (2010). doi:10.1038/nprot.2010.116
    OpenUrlCrossRefPubMedWeb of Science
  41. 41.↵
    Grinde KE, Browning BL, Reiner AP, Thornton TA, Browning SR. Adjusting for principal components can induce collider bias in genome-wide association studies. PLoS Genet. 2024 Dec 16;20(12):e1011242. doi: 10.1371/journal.pgen.1011242.
    OpenUrlCrossRefPubMed
  42. 42.↵
    Reed E, Nunez S, Kulp D, Qian J, Reilly MP, Foulkes AS. A guide to genome-wide association analysis and post-analytic interrogation Statistics in Medicine. 2015;34(28):3769–3792 doi: 10.1002/sim.6605
    OpenUrlCrossRefPubMed
  43. 43.↵
    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple hypothesis testing. J R Stat Soc B. 1995;57:289–300.
    OpenUrlCrossRefPubMed
  44. 44.↵
    Neyman J & Pearson ES. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika 1928; 20A: 175–240.
  45. 45.↵
    Fontes-Dantas FL, Fernandes GG, Gutman EG, De Lima EV, Antonio LS, Hammerle MB, Mota-Araujo HP, Colodeti LC, Araújo SMB, Froz GM, da Silva TN, Duarte LA, Salvio AL, Pires KL, Leon LAA, Vasconcelos CCF, Romão L, Savio LEB, Silva JL, da Costa R, Clarke JR, Da Poian AT, Alves-Leon SV, Passos GF, Figueiredo CP. SARS-CoV-2 Spike protein induces TLR4-mediated long-term cognitive dysfunction recapitulating post-COVID-19 syndrome in mice. Cell Rep. 2023;42(3):112189. doi:10.1016/j.celrep.2023.112189.
    OpenUrlCrossRefPubMed
  46. 46.↵
    Mukherjee S. Toll-like receptor 4 in COVID-19: friend or foe? Future Virol. 2022. doi:10.2217/fvl-2021-0249.
    OpenUrlCrossRef
  47. 47.↵
    Liu ZM, Yang MH, Yu K, Lian ZX, Deng SL. Toll-like receptor (TLRs) agonists and antagonists for COVID-19 treatments. Front Pharmacol. 2022;7(13):989664. doi:10.3389/fphar.2022.989664.
    OpenUrlCrossRef
  48. 48.↵
    Ustinova M, Peculis R, Rescenko R, Rovite V, Zaharenko L, Elbere I, Silamikele L, Konrade I, Sokolovska J, Pirags V, Klovins J. Novel susceptibility loci identified in a genome-wide association study of type 2 diabetes complications in population of Latvia. BMC Med Genomics. 2021;14(1):18. doi: 10.1186/s12920-020-00860-4
    OpenUrlCrossRefPubMed
  49. 49.↵
    de Goede KE, Harber KJ, Gorki FS, Verberk SGS, Groh LA, Keuning ED, Struys EA, van Weeghel M, Haschemi A, de Winther MPJ, van Dierendonck XAMH, Van den Bossche J. d-2-Hydroxyglutarate is an anti-inflammatory immunometabolite that accumulates in macrophages after TLR4 activation. Biochim Biophys Acta Mol Basis Dis. 2022;1868(9):166427. doi: 10.1016/j.bbadis.2022.166427.
    OpenUrlCrossRef
  50. 50.↵
    J Sardell, S Das, K Taylor, C Stubberfield, A Malinowski, M Strivens, & S Gardner. Actively protective combinatorial analysis: a scalable novel method for detecting variants that contribute to reduced disease prevalence in high-risk individuals. medRxiv 2024; 12.19.24319349. doi: 10.1101/2024.12.19.24319349.
    OpenUrlAbstract/FREE Full Text
  51. 51.↵
    Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nat Rev Genet. 2019 Aug;20(8):467–484. doi: 10.1038/s41576-019-0127-1.
    OpenUrlCrossRefPubMed
  52. 52.↵
    Moreno-Grau S, Vernekar M, Lopez-Pineda A, Mas-Montserrat D, Barrabés M, Quinto-Cortés CD, Moatamed B, Lee MTM, Yu Z, Numakura K, Matsuda Y, Wall JD, Ioannidis AG, Katsanis N, Takano T, Bustamante CD. Polygenic risk score portability for common diseases across genetically diverse populations. Hum Genomics. 2024 Sep 2;18(1):93. doi: 10.1186/s40246-024-00664-y.
    OpenUrlCrossRefPubMed
  53. 53.↵
    Clifton, L., Collister, J.A., Liu, X. et al. Assessing agreement between different polygenic risk scores in the UK Biobank. Sci Rep 12, 12812 (2022). doi:10.1038/s41598-022-17012-6
    OpenUrlCrossRefPubMed
  54. 54.↵
    Curtis D. Clinical relevance of genome-wide polygenic score may be less than claimed. Ann Hum Genet. 2019 Jul;83(4):274–277. doi: 10.1111/ahg.12302.
    OpenUrlCrossRefPubMed
  55. 55.↵
    Auer, P.L., Lettre, G. Rare variant association studies: considerations, challenges and opportunities. Genome Med 7, 16 (2015). doi:10.1186/s13073-015-0138-2
    OpenUrlCrossRefPubMed
  56. 56.↵
    1. Simon, G.J.,
    2. Aliferis, C
    Aliferis, C., Simon, G. (2024). Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI. In: Simon, G.J., Aliferis, C. (eds) Artificial Intelligence and Machine Learning in Health Care and Medical Sciences. Health Informatics. Springer, Cham. doi:10.1007/978-3-031-39355-6_10
    OpenUrlCrossRef
  57. 57.↵
    Beesley LJ, Mukherjee B. Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics. 2022 Mar;78(1):214–226. doi: 10.1111/biom.13400.
    OpenUrlCrossRef
  58. 58.↵
    Burstein D, Hoffman G, Mathur D, Venkatesh S, Therrien K, Fanous AH, Bigdeli TB, Harvey PD, Roussos P, Voloudakis G. Detecting and Adjusting for Hidden Biases due to Phenotype Misclassification in Genome-Wide Association Studies. medRxiv [Preprint]. 2023 Jan 18:2023.01.17.23284670. doi: 10.1101/2023.01.17.23284670
    OpenUrlAbstract/FREE Full Text
  59. 59.↵
    RECOVERY Collaborative Group; Horby P, Lim WS, Emberson JR, Mafham M, Bell JL, Linsell L, Staplin N, Brightling C, Ustianowski A, Elmahi E, Prudon B, Green C, Felton T, Chadwick D, Rege K, Fegan C, Chappell LC, Faust SN, Jaki T, Jeffery K, Montgomery A, Rowan K, Juszczak E, Baillie JK, Haynes R, Landray MJ. Dexamethasone in Hospitalized Patients with Covid-19. N Engl J Med. 2021 Feb 25;384(8):693-704. doi: 10.1056/NEJMoa2021436.
    OpenUrlCrossRefPubMed
  60. 60.↵
    Lewis, C.M., Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med 12, 44 (2020). doi:10.1186/s13073-020-00742-5
    OpenUrlCrossRefPubMed
Back to top
PreviousNext
Posted February 06, 2025.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Reproducibility of Genetic Risk Factors Identified for Long COVID using Combinatorial Analysis Across US and UK Patient Cohorts with Diverse Ancestries
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Reproducibility of Genetic Risk Factors Identified for Long COVID using Combinatorial Analysis Across US and UK Patient Cohorts with Diverse Ancestries
J Sardell, M Pearson, K Chocian, S Das, K Taylor, M Strivens, R Gupta, A Rochlin, S Gardner
medRxiv 2025.02.04.25320937; doi: https://doi.org/10.1101/2025.02.04.25320937
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Reproducibility of Genetic Risk Factors Identified for Long COVID using Combinatorial Analysis Across US and UK Patient Cohorts with Diverse Ancestries
J Sardell, M Pearson, K Chocian, S Das, K Taylor, M Strivens, R Gupta, A Rochlin, S Gardner
medRxiv 2025.02.04.25320937; doi: https://doi.org/10.1101/2025.02.04.25320937

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetic and Genomic Medicine
Subject Areas
All Articles
  • Addiction Medicine (430)
  • Allergy and Immunology (754)
  • Anesthesia (221)
  • Cardiovascular Medicine (3287)
  • Dentistry and Oral Medicine (363)
  • Dermatology (277)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1169)
  • Epidemiology (13354)
  • Forensic Medicine (19)
  • Gastroenterology (898)
  • Genetic and Genomic Medicine (5144)
  • Geriatric Medicine (481)
  • Health Economics (782)
  • Health Informatics (3263)
  • Health Policy (1140)
  • Health Systems and Quality Improvement (1189)
  • Hematology (429)
  • HIV/AIDS (1017)
  • Infectious Diseases (except HIV/AIDS) (14619)
  • Intensive Care and Critical Care Medicine (912)
  • Medical Education (476)
  • Medical Ethics (126)
  • Nephrology (522)
  • Neurology (4916)
  • Nursing (262)
  • Nutrition (725)
  • Obstetrics and Gynecology (882)
  • Occupational and Environmental Health (795)
  • Oncology (2518)
  • Ophthalmology (723)
  • Orthopedics (280)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (542)
  • Pediatrics (1299)
  • Pharmacology and Therapeutics (549)
  • Primary Care Research (556)
  • Psychiatry and Clinical Psychology (4202)
  • Public and Global Health (7492)
  • Radiology and Imaging (1704)
  • Rehabilitation Medicine and Physical Therapy (1010)
  • Respiratory Medicine (980)
  • Rheumatology (479)
  • Sexual and Reproductive Health (497)
  • Sports Medicine (424)
  • Surgery (547)
  • Toxicology (72)
  • Transplantation (235)
  • Urology (205)