Abstract
Host genetic variants influence the susceptibility and severity of several infectious diseases, and the discovery of novel genetic associations with Covid-19 phenotypes could help developing new therapeutic strategies to reduce its burden.
Between May 2020 and June 2021, we used Covid-19 data released periodically by UK Biobank and performed 65 Genome-Wide Association Studies (GWAS) in up to 18 releases of Covid-19 susceptibility (N=18,481 cases in June 2021), hospitalization (N=3,260), severe outcomes (N=1,244) and death (N=1,104), stratified by sex and ancestry.
In coherence with previous studies, we observed 2 independent signals at the chr3p21.31 locus (rs73062389-A, OR=1.21, P=4.26×10−15 and rs71325088-C, OR=1.62, P=2.25×10−9) modulating susceptibility and severity, respectively, and a signal influencing susceptibility at the ABO locus (rs9411378-A, OR=1.10, P=3.30×10−12), suggesting an increased risk of infection in non-O blood groups carriers. Additional signals at the APOE (associated with severity and death) LRMDA (susceptibility in non-European) and chr2q32.3 (susceptibility in women) loci were also identified but did not replicate in independent datasets. We then devised an original approach to extract variants exhibiting an increase in significance over time. When applied to the susceptibility, hospitalization and severity analyses, this approach revealed the known DPP9, RPL24 and MAPT loci, amongst thousands of other signals. Finally, this significance trajectory analysis was applied to the larger Covid-19hgi meta-analyses, where additional loci of interest, related to the immune system, were identified.
These results, freely available on the GRASP portal, provide new insights on the genetic mechanisms involved in Covid-19 phenotypes.
Introduction
The severe acute respiratory syndrome – coronavirus 2 (SARS-CoV-2) is responsible for the coronavirus disease 2019 (Covid-19) which affects individuals with variable severity, ranging from asymptomatic patients to mild respiratory symptoms, hypercytokinemia, pneumonia, thrombosis and even death.1,2 Understanding the mechanisms leading to heterogeneous symptoms and susceptibility is essential in order to develop efficient treatments and improve patient care. Host genetic diversity has been shown to influence the effects of infection to several viruses,3 such as variations in CCR5 [MIM: 601373] leading to HIV resistance4 [MIM: 609423] or IRF7 [MIM: 605047] deficiency affecting Influenza susceptibility [MIM: 614680].5
In order to discover human genetic determinants to Covid-19 susceptibility and severity, several biobanks and research groups worldwide collaborated to perform Genome-Wide Association Studies (GWAS) and meta-analyses of the GWAS. In June 2020, a study involving 1,980 Covid-19 infected patients with respiratory failure was the first to reveal genome-wide significant (P < 5 × 10−8) associations at the 3p21.31 locus, encompassing SLC6A20 [MIM: 605616] and several chemokine receptors, and at the ABO [MIM: 110300] locus on chromosome 9.6 These 2 signals were later validated in independent analyses for both Covid-19 susceptibility and severity7,8 while additional significant associations were observed at loci involved in immune response or inflammation, such as IFNAR2 [MIM: 602376], DPP9 [MIM: 608258], TYK2 [MIM: 176941], CCHCR1 [MIM: 605310] and OAS1 [MIM: 164350]. Notably, these findings implicate common and uncommon variants, while studies trying to identify associations of rare variants have been unsuccessful so far.9 The largest effort is currently led by the Covid-19 host genetics initiative (Covid-19hgi),10 which completed meta-analyses of results shared by 61 studies as of June 15th 2021, and plan to release new results as additional data is made available. A major contributor to this group is the UK Biobank (UKB)11 which periodically releases the results of Covid-19 tests and related deaths, as well as health care data for its nearly 500,000 consented participants, to approved researchers.
We created a public Covid-19 GWAS results portal (https://grasp.nhlbi.nih.gov/Covid19GWASResults.aspx) in order to provide rapid deep annotation for emerging genetics results and facilitate open access to the scientific community. We contribute to this resource by performing GWAS on each Covid-19 data release from the UK Biobank, including sex-specific, ancestry-specific, and trans-ethnic Covid-19 related GWAS, along with a deep set of annotations for top variants (with P < 1 × 10−5). For each release, up to 65 GWAS have been generated including Covid-19 susceptibility, Covid-19 hospitalization, severe Covid-19 with respiratory failure, and Covid-19 death. Here we describe the results of these GWAS, 612 in total as of June 18th 2021, and report the evolution of signals associated with these Covid-19 phenotypes over the consecutive datasets released by UKB since May 2020. The latter approach, tracking the evolution of genetic signatures iteratively in UKB, suggests a valuable new analytic approach in genetic biobank studies where emerging true signals may be identified before they reach genome-wide significance based on their trajectories of significance.
Materials and Methods
UKB data
Analyses are based on v.3 of the UKB imputed dataset,12 which provide genomic data for 487,320 participants from multiple ethnicities, including 459,250 of European ancestry (EUR), 7,644 of African ancestry (AFR), 9,417 of South Asian ancestry (SAS) and 11,009 of other ancestries (OTHERS). Participants were enrolled at ages ranging from 37 to 73 and are 51.16% female. UKB started to release Covid-19 test results of its participants on March 15th 2020, and periodically update this resource as new cases are reported. Furthermore, information about Covid-19 related death was made available from June 2020, while inpatient data and primary care data was first added during the summer of 2020 and are periodically updated. Details regarding the definition and selection of cases with Covid-19 susceptibility, Covid-19 hospitalization, Covid-19 severity, and Covid-19 death are available in Table S1.
Depending on the Covid-19 phenotype analyzed (susceptibility, hospitalization, severity, or death), up to 3 different subsets of participants were used as controls. For Covid-19 susceptibility, cases with positive test results were analyzed against either participants tested with negative results (labelled Tested), or against all participants without a positive test (labelled Population). For analyses of Covid-19 hospitalization, patients hospitalized due to Covid-19 were tested against non-hospitalized participants with a positive test (Positive), or non-hospitalized participants with a test (Tested), or against all non-hospitalized participants (Population). For analyses of Covid-19 severity, patients requiring invasive respiratory support or patients who died from complications were tested against non-severe participants with a positive test (Positive), or non-severe participants with a test (Tested) or all non-severe participants (Population). For analyses of Covid-19 death, patients with Covid-19 death were tested against participants with a positive test (Positive), or participants with a test (Tested) or against all participants (Population).
Analyses
Each GWAS was conducted with SAIGE v0.38,13 which controls for population stratification, relatedness and case-control imbalance, and adjusted for baseline age (at enrollment), sex and 10 genetic principal components. For the results uploaded to the GRASP portal, variants were filtered on imputation quality (r2 > 0.3), minor allele count (MAC > 2), and minor allele frequency (MAF > 0.0001). However, for the results presented in this manuscript, we applied a more stringent filter, and considered only well-imputed variants (r2 > 0.8) and common variants (MAF > 0.01). After applying this filter, the lambda (genomic control factor) ranged from 0.988 to 1.027 in all 65 analyses of the last data release analyzed (Table S2), indicating no systematic inflation. For analyses prior to the June 18th 2020 release, we conducted analyses on participants of European ancestry only, and started adding new analyses stratified by sex and ancestry from June 18th 2020 onward. GWAS were stratified for EUR, AFR, SAS, and OTHERS ancestries, and an additional trans-ancestry GWAS combining all participants (labelled ALL) as well as GWAS combining non-European (nEUR) participants were performed.
Associations are either reported as odd ratios (OR) and 95% confidence intervals or as beta coefficients (β) and associated standard errors (SE). Linkage disequilibrium (LD) was estimated by squared correlation (r2) using UKB EUR imputed data. Given the large number of variants that may be significantly (or suggestively) associated at a locus, we assigned significantly (or suggestively) associated variants to a common locus if they were separated by less than 1Mb, and only reported the lead variant at that locus. To test 2 observed effects are equal, we used the Z statistic: , with b1 and b2 corresponding to the observed effects and SEb1 and SEb2 the associated standard errors. In addition, haplotype analyses were performed with the LDlink LDhap tool,14 while regional association plots were generated with locuszoom.15
Annotation
For each analysis hosted on the portal, we provide comprehensive annotation for top results (P < 1 × 10−5) using ANNOVAR16 and the RESTful API service provided by CADD v1.6.17 We also retrieve known phenotype associations extracted from the GRASP18 and EBI GWAS catalogs,19 and known eQTLs extracted from GTeX v820 and other eQTL resources compiled from nearly 150 datasets (built upon the work of Zhang et al,21 detailed in Table S3).
Significance Trajectory Analyses
All 65 analyses were updated as soon as new data was released from UKB. As a result, we obtained results for these analyses at different time points, which differed by the addition of new cases, thus increasing power in most recent analyses. With an increase in power, bona fide signals associated with Covid-19 phenotypes should increase in significance in each consecutive data release analyzed. We designed a workflow to extract these signals with positive significance trajectory in each analysis: for each variant (i) in the most recent data release analyzed, P < 10−4; (ii) in the first data release analyzed, P must be greater than in the most recent data release analyzed; (iii) P cannot decrease in significance by more than one order of magnitude between 2 consecutive releases; (iv) P must have increased in significance by more than 1 order of magnitude at least once between 2 consecutive releases; (v) the latest P observed must have decreased by more than 1 order of magnitude since the minimal P observed. This set of rules should ensure to extract variants which increased in significance since we started performing these analyses, while allowing some stagnation, which might happen due to random sampling and/or low case increase between 2 consecutive releases.
The Covid-19hgi datasets
For replication purposes and significance trajectory analyses, we used the publicly available Covid-19hgi meta-analyses summary statistics (freeze 6) for Covid-19 susceptibility (labelled C2, N = 112,612 cases), hospitalization (B2, N = 24,274) and severity (A2, N = 8,779 cases). These datasets are currently the largest analyses of Covid-19 phenotypes available. Publicly available summary statistics do not include the 23andMe dataset, and a version of the summary statistics without UKB was also made available for the B2 and C2 analyses. As there was no overlap of samples with our analyses, we used these summary statistics to replicate our novel findings from susceptibility and hospitalization analyses, as well as our suggestive findings from the significance trajectory analyses. To replicate findings from our severe analyses, we used the A2 summary statistics without UKB from the freeze 5 (the A2 results without UKB from freeze 6 were not available). In addition, we retrieved Covid-19hgi summary statistics (with UKB) from all publicly available freezes to perform significance trajectory analyses with the Covid-19hgi datasets (freezes 2 to 6 for B2 and C2; freezes 3 to 6 for A2).
Data Availability
Permission was obtained to post UKB summary statistics under an approved application (ID 28525). The association results are available on the portal, as well as annotated top results. In addition to UKB summary statistics, results from other efforts are also hosted on the portal. Authors of Covid-19 GWAS publications have been contacted to seek approval before hosting the results of their analyses on the GRASP Covid-19 portal. Summary statistics at this time include multiple releases of the Covid-19hgi group, severe hospitalization results from Ellinghaus et al6 and the GenOMICC study,7 as well as hospitalization status and time to end Covid-19 symptoms from the COLCORONA study,22 with all top results being re-annotated in the common framework mentioned above.
Results
UKB Covid-19 demographics
Using the latest UKB data releases available at this point, with susceptibility and hospitalization phenotypes analyzed on June 18th 2021, and severe Covid-19 and death analyzed on May 9th 2021, we retrieved 86,435 participants with a Covid-19 diagnostic, of which 18,481 tested positive. According to inpatient care data, 3,260 positive cases were hospitalized, while 1,244 patients with severe Covid-19 diagnostic received respiratory support and/or died from complications (Table 1). Since May 7th 2020, we analyzed up to 18 UKB data releases regarding Covid-19 susceptibility, 8 data releases concerning Covid-19 related deaths, 6 for Covid-19 hospitalization and 5 for Covid-19 severity. Amongst Covid-19 positive participants, we observed a global increase in the percentage of female cases, starting at 45.3% at the first release analyzed, and reaching 52.8% in the last, while men were more likely to be infected (P = 8.06 × 10−6), hospitalized (P = 1.59 × 10−45), or develop severe complications (P = 1.51 × 10−34) and die from Covid-19 (P = 2.12 × 10−27) (Table S4). There was also a decrease in the mean age of positive cases, ranging from 57.02 to 53.57, with a significant drop after the 2020 summer (Table 2), with younger individuals more likely to be infected (P < 10−300) while increase in age was associated with hospitalization (P = 4.54 × 10−90), severity (P = 5.63 × 10−108) and death (P = 1.11 < 10−122) (Table S4). Positive cases were mainly of European ancestry, representing 85.5% of all Covid-19 positive participants in the first analysis and growing to 89.6% in the last. South Asian ancestry, African ancestry, and other ancestry represent 4.4%, 3% and 3% of positive cases, respectively, in this UKB data release. However, compared to European participants, non-European participants were enriched amongst cases (P = 3.09 × 10−33 for AFR, P = 1.10 × 10−91 for SAS, P = 1.96 × 10−6 for OTHERS), as well as hospitalized (P = 2.05 × 10−53 for AFR, P = 1.89 × 10−24 for SAS, P = 6.69 × 10−9 for OTHERS), severe cases (P = 1.02 × 10−17 for AFR, P = 5.93 × 10−11 for SAS, P = 0.004 for OTHERS) and deceased participants (P = 2.17 × 10−14 for AFR, P = 6.14 × 10−9 for SAS, P = 0.009 for OTHERS) (Table S4).
Sample sizes for each analysis, as of 06.18.21
Composition of the consecutive UK Biobank Covid-19 data releases
Genome-wide significant results
Between May 2020 and June 2021, we observed several genome-wide significant (P < 5 × 10−8) signals across the 65 analyses. However, many signals were only punctually significant, and only a handful of signals remained significant in subsequent data releases analyzed. For instance, the analysis of Covid-19 susceptibility in participants of European ancestry, with untested or negatively tested participants as controls (noted Population controls), revealed 8 signals reaching genome-wide significance at some point (Figure 1), of which only 2 remained significant in the last data release analyzed (on 06.18.21): the chr3p21.31 locus encompassing SLC6A20 and several chemokine receptors (rs73062389-A, MAF = 0.06, OR = 1.22 [1.16; 1.28], P = 7.60 × 10−16), and the ABO locus on chromosome 9 (rs9411378-A, MAF = 0.22, OR = 1.10 [1.07; 1.14], P = 8.78 × 10−12). These two loci were previously reported to modulate Covid-19 susceptibility in several studies.6–8,10
The top panel represents the evolution in significance of signals reaching genome wide significance at least once across all data releases analyzed in Europeans for Covid-19 susceptibility. The middle panel represent the mean age of cases in each data release. The bottom panel represent the number of cases in each data release. A representation of these signals across the Covid-19hgi releases is available as Figure S50.
Across all 65 analyses, 5 loci reached genome-wide significance in the last data releases analyzed (on 06.18.21 for susceptibility and hospitalization, and 05.09.21 for severity and death): the chr3p21.31 locus, ABO, APOE [MIM: 107741], LRMDA [MIM: 614537] and an intergenic signal at the chr2q32.3 locus (Table 3). All associations with P < 10−5, from all 65 analyses, are available in Table S5. In addition, all signals reaching genome-wide significance in any data release analyzed are presented in Figures S1-S46. The signals reported in the following correspond to results using Population controls, but after systematically applying the Z-test for the equality of regression coefficients we did not observe a significant difference of effect when using a different set of control, such as Tested controls (P > 0.05, Figure S47).
Lead variants associated with Covid-19 phenotypes
The 3p21.31 locus
This locus was associated with Covid-19 susceptibility, hospitalization and severity, in EUR and ALL ancestries, regardless of the set of controls employed. However, as mentioned in previous works,8,10 the lead variant associated with susceptibility was not in LD with the lead variants for hospitalization nor severity (r2 < 0.01) suggesting 2 distinct signals modulating Covid-19 susceptibility and severity (Figure S48).
ABO
The ABO locus was associated with Covid-19 susceptibility for ALL and EUR participants, using either Population or Tested controls. Using haplotype analyses with blood group tagging variants, we determined the blood groups tagged by the lead variant associated with susceptibility (rs9411378). For the 5 main blood groups, we used the following tagging variants23: rs41302905 (tagging for O2), rs8176743 (tagging for B), rs1053878 (tagging for A2), rs8176719 (tagging for O1) and rs2519093 (tagging for A1). As a result, we observed that the Covid-19 susceptibility variant was fully tagging A1 and A2 haplotypes, and partially tagging B haplotypes (20% of B haplotypes) (Figure S49), suggesting an increased risk of infection for A1 and A2 carriers.
APOE
The APOE variant tagging for the APOE-ε4 haplotype was initially significantly associated with Covid-19 susceptibility in EUR (rs429358-C, MAF = 0.15, OR = 1.38 [1.24; 1.53], P = 1.80 × 10−9, on 07.14.20). This signal was notably the first report of a genetic determinant for Covid-19 susceptibility.24 This previous report was based on UKB data but this signal was not replicated in an independent dataset, and this association was greatly attenuated after the summer, when the number of Covid-19 cases started to rise significantly and the mean age of infected participants decreased (Figure 1). The interaction between the age of participants and the APOE variant was significant (P = 1.78 × 10−9), which was further confirmed using a 2df test25 with age and genotype (P = 2.65 × 10−9), suggesting that a subset of older participant carriers of this variant was more at risk of Covid-19 infection. Remarkably, this signal became genome-wide significant in the 05.09.21 data release analyses of death in ALL and EUR, and in the analysis of Severe Covid-19 in EUR, which may suggest a mechanism leading to severe and lethal complications after infection. However, there is no evidence of association in the larger Covid-19hgi meta-analysis of Covid-19 severity (A2, freeze 5 without UKB, P = 0.27).
Results from ancestry- and sex-stratified analyses
In other ancestry-stratified GWAS (AFR, SAS and OTHERS), no signal was found genome-wide significant in the last data release analyzed. In the GWAS combining all non-European participants (nEUR), with Population controls, one signal was found significant at the LRMDA locus on chromosome 10 (rs114026383-C, MAF = 0.02, OR = 2.39 [1.79; 3.20], P = 4.10 × 10−9). According to gnomAD (v2.1.1),26 this variant is mostly carried by individuals of African ancestry (MAF = 0.04) and mainly absent in other ancestries. In the GWAS of African ancestry participants, this signal is close to genome wide significance (P = 1.55 × 10−7). Interestingly, LRMDA variants have been found associated to lung function27 [MIM: 608852] and HIV viral load in an unadjusted GWAS,28 but there was no evidence of LD between these variants and rs114026383 (r2 < 0.01). Furthermore, this association did not replicate in the Covid-19hgi susceptibility analysis (C2) restricted to African ancestry participants (P=0.27).
In addition, the chr3p21.31 susceptibility variant (rs73062389) was less frequent in non Europeans, and we did not observe an association with this variant in any of the non European ancestry-stratified analyses (MAF = 0.007 and P = 0.22 in AFR; MAF = 0.021 and P = 0.47 in SAS; MAF = 0.038 and P = 0.79 in OTHERS). Similarly, there was no association at the ABO lead variant (rs9411378) in non European ancestry-stratified analyses (P = 0.18 in AFR; P = 0.22 in SAS; P = 0.34 in OTHERS).
In sex-stratified analyses, using Population controls, the chr3p21.31 susceptibility signal was significant in women (rs73062389, P = 1.06 × 10−8 in ALL) and moderatly associated in men (P = 2.10 × 10−6 in ALL), whereas the ABO signal was significant in men (rs9411378, P = 5.10 × 10−10 in ALL) and moderatly associated in women (P = 3.30 × 10−5 in ALL). The chr3p21.31 lead variant in the hospitalization analysis (rs72893671) was more significant in men (P = 6.80 × 10−11 in ALL) than in women (P = 3.68 × 10−4 in ALL). Despite these differences in significance between men and women for these 3 main signals, we did not observe a significant difference of effect when using the Z-test for the equality of regression coefficients (P > 0.05 for all 3 signals). Additionally, a variant reached genome-wide significance in the analysis of Covid-19 susceptibility of women of ALL ancestry at the chr2q32.3 locus, while no association was observed for this variant in men (P = 0.47). However this association was not supported by the Covid-19hgi C2 analysis (P = 0.58), even though the Covid-19hgi meta-analyses were not sex-stratified.
Signals with a positive significance trajectory in UKB
The significance trajectory of the most robust signals, at the chr3p21.31 and ABO loci, mostly increased after the surge of new cases following the 09.08.20 data release (Figure 1). After this date, the chr3p21.31 signal increased at each new data release and reached genome-wide significance in the 11.03.20 release, while the ABO signal also increased and reached significance in the 01.04.21 release. In order to identify signals that may become significant in future releases, we extracted variants displaying a similar positive trajectory in significance over time, and reaching at least P < 10−4 in the last data release analyzed.
After applying our filters to each of the 65 analyses, the number of variants with a positive significance trajectory summed up to 61,430. For each analysis, we extracted the lead variant at each locus, which resulted into 11,821 lead variants (Table S6), with some duplicates, as lead variants from a same locus can appear in several analyses (i.e. the chr3p21.31 locus, which appears in 49 out of the 65 analyses). Notably, some of these loci previously reached genome-wide significance in the Covid-19hgi meta-analyses10: the DPP9 locus [MIM: 608258] associated with hospitalization, the NXPE3 / RPL24 locus [MIM: 604180] associated with susceptibility, and the MAPT locus [MIM: 157140] associated with severe Covid-19.
In order to identify which variants with a positive trajectory reached at least nominal significance in the Covid-19hgi analyses, we further focused on loci with positive significance trajectory from the susceptibility, hospitalization and severe analyses, using ALL ancestry and Population controls, which have the closest phenotype definition to the susceptibility (C2), hospitalization (B2) and severity (A2) meta-analyses from Covid-19hgi, respectively. Signals with a positive trajectory from Covid-19 susceptibility and hospitalization GWAS were sought for replication in the corresponding C2 and B2 meta-analyses from Freeze 6 without UKB, while signals from the severe Covid-19 GWAS were sought for replication in the A2 meta-analysis from Freeze 5 without UKB. We identified 205 loci with a positive trajectory in the susceptibility analysis, 193 with hospitalization, and 168 with severity. Out of these 566 signals, 430 were present in the corresponding Covid-19hgi meta-analyses, of which 30 (7%) had a concordant effect direction and P < 0.05 (Table S7). Notably, we observed a stringent gap in P-values from the Covid-19hgi meta-analyses results between the top 7 signals, previously identified and reaching P < 10−16 (chr3p21.31, ABO, DPP9, and RPL24), and the remaining 23 signals which did not reach P < 10−3 (Table 4). Furthermore, amongst the variants missing in the Covid-19hgi meta-analyses figured the lead variant at the MAPT locus (rs532052263), which had a positive trajectory in the UKB severe analysis. However, we observed significant LD between this variant and the known signal associated with severe Covid-1910 at this locus (rs8080583, r2 = 0.75), suggesting they both belong to the same signal.
Signals with a positive significance trajectory in UKB, sorted by their P-value in their equivalent Covid-19hgi meta-analyses (top 10)
Signals with a positive significance trajectory in Covid-19hgi meta-analyses
Finally, given the larger sample size and statistical power of the Covid-19hgi studies, we applied the significance trajectory analyses to these datasets, where this approach might be more effective to identify signals of interest. For the B2 and C2 analyses, we analyzed freezes 2 to 6, and we analyzed freezes 3 to 6 for the A2 analysis (all including the UKB dataset, but not the 23andMe dataset). For this significance trajectory analysis, the same set of rules used for the UKB analyses was applied, except that we only extracted variants reaching P < 10−5 in the last freeze (instead of P < 10−4 in UKB), since these meta-analyses are better powered than the UKB analyses. Over all 3 analyses, we identified a total of 9,697 variants with a positive trajectory (Table S8). After clusterizing these variants into loci and extracting lead variants at each locus, we obtained 51 lead variants for the A2 analysis, 89 for B2 and 85 for C2 (Table S9).
Out of these 225 loci, 34 reached genome-wide significance, and only 3 known genome wide significant signals were not detected by the significance trajectory analysis: at the CCHCR1 locus [MIM: 605310] (in both B2 and A2 analyses) and at the MUC5B locus [MIM: 600770] (in the B2 analysis), as both these loci decreased in significance by more than one order of magnitude at some point. Furthermore, 2 loci that reached genome-wide significance only when the 23andMe datasets were included in the Covid-19hgi meta-analyses were also detected by our approach: at the SFTPD locus [MIM: 178635] and at the SLC22A31 locus (both in the B2 analysis). Additionaly, we detected signals at the ACE2 [MIM: 300335] and TMPRSS2 loci [MIM: 602060], 2 proteins involved in SARS-CoV-2 infection. A rare ACE2 variant was previously observed as significantly associated in the C2 Covid-19hgi meta-analysis (rs190509934, MAF = 0.003, P = 3.30 × 10−17), but the significance trajectory analysis also revealed a common variant (rs4646120, MAF = 0.49, P = 2.30 × 10−6) associated with Covid-19 susceptibility. At the TMPRSS2 locus, a common exonic variant was detected in the A2 analysis (rs12329760, MAF = 0.27, P = 8.18 × 10−6), and other intronic variants in LD with the exonic variant were detected in the B2 analysis (rs915823, P = 1.67 × 10−6, LD r2= 0.83). Finally, several signals detected in this approach might be of interest as they are located in, or near, genes playing a role in immune response. For instance, intronic variants in the mucin genes MUC4 [MIM: 158372] (rs2260685, MAF = 0.46, P = 1.76 × 10−6) and MUC16 [MIM: 606154] (rs73005873, MAF = 0.36, P = 7.81 × 10−8) were identified in the C2 analysis, and 2 intergenic variants close to interferon regulatory factors IRF1 [MIM: 147575] (rs4143335, MAF = 0.14, P = 5.96 × 10−6) and IRF2 [MIM: 147576] (rs6854508, MAF = 0.17, P = 9.26 × 10−6) were identified in the B2 analysis.
Discussion
This project was initiated with the major aim to share results of Covid-19 host genetics analyses freely and rapidly on the GRASP portal, during a pandemic where new insights to improve patient care and develop better treatments were greatly needed. A few months after the first Covid-19 case was identified in the UK, we started to perform GWAS on each dataset released by UKB between May 2020 and June 2021, and thus far examined genetic signals associated with Covid-19 phenotypes across 18 data releases. This unique context allowed us to track the evolution of genetic associations over time, an approach rarely applied. While some works have examined the statistical properties of phased and nested case-control studies,29,30 few studies have proposed using statistical significance trajectories over time in genetic analyses.
As a first observation, the majority of signals observed in the first stages of the project did not sustain over time. For instance, the first genome-wide significant association observed in the GWAS of Covid-19 susceptibility in Europeans with rs34338189 as lead variant is not even nominally significant in the last release (P = 0.25). Another genome-wide significant signal was observed at the APOE locus, with a significance that increased in the first 4 data releases before decreasing continually in the subsequent releases. The lead variant at the APOE locus is coding in part for the APOE-ε4 haplotype, known to increase the risk of Alzheimer’s disease [MIM: 607822], dementia, dyslipidemia and cardiovascular diseases [MIM: 617347] and is speculated to cause inflammation and cytokine storms.31 Notably, the significant association of the APOE signal with severe Covid-19 and death in most recent analyses could support this proposition, but this association has not been replicated in an independent dataset. The evolution of the APOE signal over time could also be due to an initial higher prevalence of Covid-19 in nursing homes,32 where dementia patients were at higher risk of being infected and spreading the virus due to living arrangements, and a poor understanding of transmission dynamics and appropriate safety guidelines early in the pandemic. Overall, the evolution of these signals suggests a change in the composition of cases over time, such as the diminution of age and increase of positively tested women in the later data releases, as well as the introduction of variant SARS-CoV2 strains. This change was most significant after the 2020 summer when a surge in new Covid-19 infections occurred.
The most robust findings from our study are the association of the chr3p21.31 and ABO loci with Covid-19 susceptibility and a distinct signal at the chr3p21.31 locus associated with Hospitalization and severe Covid-19. These observations corroborate several previous reports,6– 8,10. In order to detect additional suggestive signals, that may reach genome-wide significance in future analyses, we developed an original strategy to identify associations displaying increased significance over time, and identified thousands of variants with a positive significance trajectory. Amongst these, 3 signals at the DPP9, NXPE3, and MAPT loci were previously established variants associated with Covid-19 phenotypes, notably by the Covid-19hgi analyses,10 thus demonstrating the potential of this approach. However, determining which variants actually modulate Covid-19 phenotypes amongst thousands of candidates, without an independent and well-powered dataset such as Covid-19hgi to confirm the validity of a signal, is still challenging. Still, this approach might be valuable to reduce the scope of search. To further reduce the scope of candidates, we applied this significance trajectory analysis to the larger meta-analyses provided by the Covid-19hgi. Given the higher statistical power of these analyses, we also applied a more stringent threshold, which allowed us to extract only variants reaching P < 10−5 in the last freeze. As a result, a lower and thus more workable number of loci were extracted, among which several have a role in immune function, such as mucins, interferon regulators, interleukin receptors, or HLA system related genes, which might be of interest for the study of genetic mechanisms involved in Covid-19 phenotypes. However, these associations are still only suggestive, and will need to be further validated either in future releases or in independent datasets.
The associations we observed changing through the pandemic could reflect random effects or changes in statistical power, but some of the results suggest changes due to potential gene-environment interactions such as age, underlying health conditions (APOE) or sex makeup of cases exposed to or engaging in risk behavior. This indicates the general approach of iterative analysis and significance trajectory analyses for genetics during pandemics may have benefits in uncovering pathophysiologic clues. Additionally, other factors like predominant virus strains and changing treatment strategies through a pandemic might interact with host genetics, and be better understood by iterative analyses. To enable such approaches in pandemics and other critical research domains, rapid and extensive results sharing are important catalysts, as demonstrated by the UKBB, COLCORONA, GenOMICC and Covid-19hgi.
In summary, our host genomic analyses of Covid-19 may contribute to improve the comprehension of mechanisms involved in the infection and complications due to Covid-19. Our study also has some limitations. Most importantly, our data and the work of others support large health disparities between EUR and non-EUR individuals related to COVID-19 throughout the ongoing pandemic. Despite an over-representation proportionally among cases and those with severe and fatal outcomes, the non-EUR component of UKB and Covid-19hgi is a proportionally small sample limiting our statistical power to address population-specific genetic variants contributing to health outcomes. However, moving forward we feel that having a diverse set of results with different phenotype definitions, sex-specific, ancestry-specific, and including external group summary statistics, all in a common genome reference and annotation framework may maximize the chance for new studies to cross-replicate or meta-analyze results as Covid-19 genetic studies continue to grow.
Data Availability
The datasets generated during this study are available at the GRASP Covid-19 portal
Description of Supplemental Data
Supplemental Data include 50 figures and 9 tables.
Declaration of Interests
The authors declare no competing interests.
Web Resources
GRASP Covid-19 resource: https://grasp.nhlbi.nih.gov/Covid19GWASResults.aspx
GTeX: https://gtexportal.org/home/
CADD: https://cadd.gs.washington.edu/
EBI GWAS catalog: https://www.ebi.ac.uk/gwas/
GRASP catalog: https://grasp.nhlbi.nih.gov/Overview.aspx
Covid-19hgi meta-analyses results: https://www.covid19hg.org/results/r5/
GnomAD: https://gnomad.broadinstitute.org/
LocusZoom: https://locuszoom.org/
LDlink: https://ldlink.nci.nih.gov
Data and Code Availability
The datasets generated during this study are available at the GRASP Covid-19 portal: https://grasp.nhlbi.nih.gov/Covid19GWASResults.aspx
Acknowledgments
All authors were supported by NIH Intramural Research Program funds. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services. This research has been conducted using the UK Biobank Resource under Application Number 28525. UK Biobank was established by the Wellcome Trust, Medical Research Council, Department of Health, Scottish government, and Northwest Regional Development Agency. It has also had funding from the Welsh assembly government and the British Heart Foundation. All UKB analyses for this manuscript were conducted on the NIH Biowulf high performance computing cluster (https://hpc.nih.gov/). The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. We also thank the NHLBI IT team for their help in keeping the GRASP portal up to date, David-Alexandre Tregouet for his helpful comments regarding the ABO haplotype analyses, and the Covid-19 Host Genetics Initiative for sharing the results of their analyses.
Footnotes
The analyses presented were updated with new results from UKB (up to June 6th 2021). In addition, new analyses of significance trajectory in Covid-19hgi meta-analyses were performed.