Abstract
Background There are certain pre-existing conditions which leads in increased risk of coronavirus disease 2019 (COVID-19) severity and mortality. The objective of this study is to determine shared genetic architecture between COVID-19 severity and other medical conditions using electronic health record (EHR) data from diverse patient populations.
Methods and Findings We conducted Phenome-wide association study (PheWAS) of genetic variants associated with severe COVID-19 in two biobanks with EHR and genomic data: 1) Veteran Affairs (VA) Million Veteran Program (MVP), 2) United Kingdom Biobank (UKBB). Genetic variants associated with critical illness (n=48) or hospitalization (n=39) due to COVID-19 from COVID-19 Host Genetics Initiative genome-wide association studies. Phenotypes defined by International Classification of Diseases (ICD) codes mapped to clinically relevant groups using published PheWAS methods; pre-COVID-19 data used to avoid potential confounding. Among 455,683 US Veterans from MVP, variants associated with severe COVID-19 tested for association across 1,559 phenotypes; 353,365 UK Biobank participants, and 1,064 phenotypes tested. Genetic variants at ABO locus (rs550057, rs505922) associated with the largest number of phenotypes (nrs55057= 53 and nrs505922=61); strongest association with venous embolism, odds ratio (OR)rs550057 1.27 (p=5.28 × 10−116), and thrombosis ORrs505922 1.31, p=3.5 x10−183. Among 67 respiratory conditions tested, only idiopathic pulmonary fibrosis, OR rs2277732 1.17, p=1.3410−05, and asthma ORrs143334143 0.94, p=2.31 x10−04, shared variants with severe COVID-19. The RAVER1 locus (rs74956615) associated with reduced risk for autoimmune conditions, e.g. psoriasis OR 0.71, p= 1.53 x10−22, rheumatoid arthritis, OR 0.78, p=1.04 × 10−09; findings replicated in UKBB. A known functional missense variant (rs34536443, TYK2) in the region had the highest linkage disequilibrium with rs74956615, suggesting signal was likely from TYK2. In MVP, PheWAS results stratified by genetic ancestry did not demonstrate significant difference in associations across ancestry.
Conclusions Shared genetic architecture between COVID-19 severity and conditions related to underlying risk factors for severe and poor COVID-19 outcomes; associations similar across genetic ancestries. Divergent association between inflammatory conditions and severe COVID may be explained by known pathway impairing signaling of inflammatory cytokines, reducing risk for autoimmunity; same pathway reduces type 1 interferon signaling, critical for viral host defense. Caution needed when targeting pathways that may balance immune tolerance and immunodeficiency to treat COVID-19.
Introduction
A respiratory disease coronavirus disease-2019 (COVID-19) caused by a novel coronavirus was identified in December 2019,1 now known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). As of March 2021, the COVID-19 pandemic has resulted in the loss of over 2.6 million lives worldwide.2 Identifying host genetic variants associated with severe disease from COVID-19 can identify key pathways important in the pathogenesis of SARS-CoV2. International efforts such as the COVID-19 Host Genetics Initiative (HGI)3 have meta-analyzed genome-wide association study (GWAS) summary statistics at regular intervals to identify novel genetic associations with COVID-19 severity. Thus far, ten independent variants associated with COVID-19 severity at genome-wide significance have been identified, most notably at the ABO locus.4 These GWASs have also identified variations in genes involving inflammatory cytokines and interferon signaling pathways such as IFNAR2, TYK2, and DPP9.4
The unprecedented availability of genome-wide data for COVID-19 provides an opportunity to study clinical conditions that share genetic risk factors for COVID-19 severity. Examining known conditions, each with a body of knowledge regarding important pathways and targets, may in turn improve our understanding of pathways relevant for COVID-19 severity and inform the development of novel treatments against this pathogen. The Phenome-Wide Association Study (PheWAS) is an approach for simultaneously testing genetic variants’ association with a wide spectrum of conditions and phenotypes.5 Mega-biobanks such as the Veteran’s Affairs (VA) Million Veterans Program (MVP) and the UK Biobank (UKBB), both with genotypic data on over 500,000 participants linked with Electronic Health Record (EHR) data containing rich phenotypic data, enables large-scale PheWAS. Moreover, MVP has the highest racial and ethnic diversity of the major biobanks affording an opportunity to compare whether associations are similar across genetic ancestries.6
The objective of this study was to use existing clinical EHR data to identify conditions that share genetic variants with COVID-19 severity using the disease-agnostic PheWAS approach. Since COVID-19 is a new condition, identifying existing conditions which share genetic susceptibility may allow us to leverage existing knowledge from these known conditions to provide context regarding important pathways for COVID-19 severity.
Methods
Data sources
The VA MVP is a national cohort launched in 2011 designed to study the contributions of genetics, lifestyle, and military exposures to health and disease among US Veterans.6 Blood biospecimens were collected for DNA isolation and genotyping, and the biorepository was linked with the VA EHR, which includes diagnosis codes (International Classification of Diseases ninth revision [ICD-9] and tenth revision [ICD-10]) for all Veterans followed in the healthcare system up to September 2019. The single nucleotide polymorphism (SNP) data in the MVP cohort was generated using a custom Thermo Fisher Axiom genotyping platform called MVP 1.0. The quality control steps and genotyping imputation using 1000 Genomes cosmopolitan reference panel on the MVP cohort has been described previously.7
UKBB is a population-based cohort of approximately 500,000 subjects from the UK between 49 and 69 years of age and served as the replication cohort. Participants were enrolled between 2006 and 2010 and consented for the prospective collection of health records and lifestyle, imaging, and genetic data. The phenotype data included ICD-10 diagnosis codes up to March 2019.
Genetic variant selection
An overview of the analytic workflow is outlined in Fig 1. Variants were derived from the COVID-19 HGI GWAS meta-analysis release v43. In this study, we analyzed the following HGI GWAS summary statistics: 1) hospitalized and critically ill COVID-19 vs. population controls denoted as “A2” in HGI, and referred to as “critical COVID” in this study, and 2) hospitalized because of COVID-19 vs. population controls, denoted as “B2” in HGI, referred to as “hospitalized COVID” in this study3. For each GWAS, variants with a Benjamini-Hochberg false discover rate (FDR) corrected p-value < 0.1 were selected as candidate lead SNPs (853 associated with critical COVID, and 902 associated with hospitalized COVID). Variants with r2<0.1 were clustered within a 250 kb region according to 1000 Genomes phase 3 transethnic reference panel8, resulting in 48 independent variants associated with critical COVID and 39 variants associated with hospitalized COVID summary statistics. The lead variants from each set of GWAS summary statistics are available in eTable 1.
Outcomes
For both MVP and UKBB, clinical data prior to the onset of the COVID-19 pandemic were used to reduce potential confounding bias from SARS-CoV-2 infection on existing conditions.
Briefly, phenotypes were defined by phecodes, used in prior PheWAS studies5,9. Each phecode represents ICD codes grouped into clinically relevant phenotypes for clinical studies. For example, the phecode “deep venous thrombosis” includes “venous embolism of deep vessels of the distal lower extremities,” and “deep venous thrombosis of the proximal lower extremity,” both of which have distinct ICD codes. Using this approach, all ICD codes for all Veterans in MVP were extracted and each assigned a phenotype defined by a phecode. ICD-9 and ICD-10 codes were mapped to 1876 phecodes, as previously described.5,9
For each phecode, participants with ≥2 phecode-mapped ICD-9 or ICD-10 codes were defined as cases, whereas those with no instance of a phecode-mapped ICD-9 or ICD-10 code were defined as controls. Based on our previous simulation studies of ICD codes from EHRs, we excluded phecodes with < 200 cases from the analysis10 and we applied this threshold in each ancestry group. Thus, 1,520 (EUR), 1206 (AFR), 858 (HIS), 176 (ASN) phecodes in the MVP cohort were analyzed. The same process was applied to UKBB, resulting in 1,0,64 phecodes from 1,615 phecodes.
Phenome-wide association studies
The primary PheWAS analysis used SNPs identified from the HGI GWAS of critical and hospitalized COVID, and tested association of these SNPs with phenotypes extracted from the EHR data prior to the COVID-19 pandemic. Logistic regression using PLINK2 to examine SNPs’ association with phecodes and firth regression was applied when logistic regression model failed to converge. Regression models were adjusted for sex, age (at enrollment), age squared, and the first 20 principal components. Genetic ancestry was determined using the HARE method for four major groups: African (AFR), Asian (ASN), Hispanic (HIS), and European (EUR) ancestry11. Ancestry-specific PheWAS was first performed in these four groups, and then summary data were meta-analyzed using an inverse-variance weighted fixed-effects model implemented in the PheWAS R package9. We assessed heterogeneity using I2 and excluded any results with excess heterogeneity (I2 > 40%).
To address multiple testing, FDR p < 0.1 was considered significant. In the PheWAS analysis, p < 2.44 × 10−4 (critical COVID lead variants) and p < 3.18 × 10−4 (hospitalized COVID lead variants) were considered significant. In the main manuscript we highlight PheWAS significant associations using FDR < 0.1 and an effect size associated with increased or reduced risk for a condition by 10%, with complete PheWAS results in eTable 2 and eTable 3. Replication of MVP PheWAS results were performed in UKBB among individuals of European descent due to small sample size of other ancestries in UKBB.
Results
We studied 453,699 MVP participants, with mean age 68 years (SD), 89% male, with 32% participants from non-European ancestry (Table 1). The PheWAS was performed on 48 genetic variants associated with critical COVID-19, and 39 genetic variants associated with hospitalized COVID, across 1,559 phenotypes.
From the trans-ethnic meta-analysis, we identified 109 phenotypes significantly associated with critical COVID GWAS-identified variants, and 118 associations with hospitalized COVID GWAS-identified lead variants (FDR, p<0.1). Five variants were common to both severe and hospitalized COVID and had significant PheWAS associations, namely, variations nearest to the genes ABO (rs550057 and rs505922), DPP9 (rs2277732), HLA-DPB1 (rs143334143), and CCHCR1 (rs9501257) (eTable 2 and eTable 3).
Association of ABO loci with known risk factors and outcomes related to COVID-19 severity
In the transethnic meta-analysis, the phenotypes with the strongest association with variants near ABO locus (rs550057 and rs505922) was “venous embolism and thrombosis” (ORcritical_PheWAS = 1.27 [1.24 - 1.29], Pcritical_PheWAS = 5.28 × 10−116; ORhospitalized_PheWAS = 1.31 [1.29 - 1.34], Phospitalized_PheWAS = 3.53 × 10−183, Fig 2). The ABO loci had the largest number of significant PheWAS association findings, accounting for 48% (53/109) of significant phenotype associations in the critical COVID PheWAS, and 51% (61/118) in the hospitalized COVID PheWAS. As expected, conditions associated with the ABO locus, such as venous embolism and thrombosis, hypercoagulable state, type 2 diabetes, coronary atherosclerosis, have been reported as risk factors for or are complications associated with COVID-19 severity and mortality (Fig 2, eTable 2 and eTable 3).
Associations between variants associated with COVID-19 severity and respiratory conditions and infections
Among respiratory conditions, only two diseases had significant associations (FDR < 0.1) shared with genetic variants associated with severe COVID-19. The most significant association was observed between rs2277732 (DPP9) and idiopathic fibrosing alveolitis (OR = 1.17 [1.08 – 1.26]; P = 5.34 × 10−5), also known as idiopathic pulmonary fibrosis (IPF). The association between DPP9 variants and IPF has been reported in previous studies.12 However, the association of genetic variants with other respiratory conditions may represent novel findings: the association of intronic variant rs143334143 in CCHCR1 with asthma (OR = 0.94 [0.92 - 0.97], P = 2.31 × 10−4). We did not detect associations between any of the variants and either other respiratory conditions or infectious diseases. (eTable 3).
Associations between variants associated with COVID-19 severity and reduced risk for certain phenotypes
A subset of variants associated with severe COVID were also associated with a reduced risk for certain conditions, many of which were autoimmune conditions (Table 2). The rs74956615-A allele of RAVER1, a lead variant from the hospitalized COVID GWAS was associated with a reduced risk for psoriasis (OR = 0.71 [0.66-0.76], P = 1.5 × 10−22), rheumatoid arthritis (RA) (OR = 0.78 [0.72 - 0.84], P = 4.6 × 10−09), and psoriatic arthropathy (OR = 0.60 [0.50 - 0.71], P = 3.4 × 10−09). Similarly, rs11085727-T at the TYK2 locus identified as a lead variant in the HGI critical COVID GWAS was associated with a reduced risk of psoriasis (OR = 0.88 [0.85 - 0.90], P =1.35 × 10−17) and psoriatic arthropathy (OR = 0.81 [0.75 -0.86], P = 9.54 × 10−10).
The association of the RAVER1 variant with psoriasis and RA replicated in UKBB. Among the 353,365 EUR UKBB participants, the RAVER1 variant was associated with a decreased risk for psoriasis and RA (ORpsoriasis = 0.69, P= 2.88 × 10−05; ORRA = 0.67, P = 9.15 × 10−07). RAVER1 was mentioned in prior RA GWAS’ with the signal attributed to TYK2.13 The TYK2 signal has been mapped to a protein-coding variant (rs34536443) with predicted partial loss of function (LOF) leading to a reduced risk of psoriasis, psoriatic arthropathy, and RA as well as other autoimmune inflammatory conditions14,15. The RAVER1 variant (rs74956615) in this study is in high linkage disequilibrium (LD) (r2 = 0.7) with this TYK2 protein-coding variant (S1 Fig), with the same associations of reduced risk of psoriasis, psoriatic arthropathy, and RA. This suggests that the RAVER1 signal from our analysis is coupled with the protein-coding variant in TYK2 (rs34536443). The TYK2 variant (rs11085727) associated with critical COVID was not in LD (r2 =0.07) with the previously identified protein-coding variant in TYK2 (rs34536443). To further support these findings, we conducted PheWAS of rs34536443 in MVP and observed similar association of reduced risk with aforementioned autoimmune inflammatory conditions (S2 Fig).
The PheWAS analyses performed across genetic ancestries observed similar findings as the overall meta-analysis with fewer associations within each ancestry group due to smaller population sizes (S3 Fig). Of note, SNP rs1396186 associated with hospitalized COVID didn’t had associations with obesity in meta-analysis (P=0.06; I2 = 84%). However, among Veterans of African ancestry, this association was significant (OR = 1.13 [1.07 – 1.19]; P = 8.79 × 10−06) but was not observed in the larger population of European descent (S3 Fig, panel B, S3 Table).
Another SNP, that passed rs1754680 was significantly associated with obesity among EUR ancestry participants but not in other ancestry populations (S3 Table).
Discussion
In this large-scale PheWAS, we identified shared genetic architecture between variants associated with severe COVID-19 and other complex conditions using data from MVP, one of the largest and most diverse biobanks in the world, with replication in UKBB. Broadly, these risk alleles identified conditions associated with risk factors for severe COVID-19 manifestations such as T2D, ischemic heart disease across all ancestries, with a potentially stronger association with obesity among Veterans of African ancestry. Phenotypes associated with poor outcomes related to COVID-19 infection, including deep venous thrombosis and other thrombotic complications, also shared variants associated with severe COVID-19. In contrast, among respiratory conditions, only idiopathic pulmonary fibrosis and asthma shared genetic risk factors, with the notable absence of an association with COPD and other respiratory infections. Finally, we observed that variants associated with severe COVID-19 were associated with a reduced risk of autoimmune inflammatory conditions, such as psoriasis, psoriatic arthritis, and RA.
A classic GWAS tests the association between millions of genetic variants with the presence or absence of one phenotype, e.g., GWAS of deep venous thrombosis. In the COVID-19 HGI GWAS, the “phenotype” was patients hospitalized for or critically ill from COVID-19. Clinically, this population includes a mixture of patients with a complex list of medical conditions at high risk for severe COVID complications and those who had actual complications from COVID-19. Thus, we would anticipate that many of the significant phenotypes would be associated with risk factors such as obesity and deep venous thrombosis. Notably, the clinical data used in this study pre-dates the emergence of COVID-19 to reduce potential confounding bias that can occur in a population infected with SARS-CoV-2, e.g. interaction between COVID-19 and type 2 diabetes. Additionally, our findings suggest that the PheWAS approach can be a useful tool to identify clinical factors related to emerging infectious diseases regarding severity or complications when genomic data are available.
The PheWAS results of SNPs in the ABO locus served as a positive control for this study. Genetic variations in ABO are an established risk factor for COVID-19 severity. Patients with blood group A have a higher risk of requiring mechanical ventilation and extended ICU stay compared with patients with blood group O.16 These same variations at ABO had known associations with a spectrum of blood coagulation disorders identified in studies pre-dating COVID-19.17–19 The PheWAS of ABO variants identified associations with increased risk of deep vein thrombosis, pulmonary embolism, and other circulatory disorders, in line with prior studies, and recent studies among patients hospitalized with COVID-19.20–24
Autoimmune conditions shared genetic variants with severe COVID-19, the direction of effect was for a reduced odd of these conditions. Based on existing data, we believe known protein-coding variants in TYK2 (rs34536443, 1140A) drove the majority of the signal from RAVER1 with reduced risk of psoriasis, psoriatic arthritis, and RA; this signal is consistent with prior reported associations of a protective effect of TYK2 with these same conditions.13,14,15 Moreover, recent studies exploring potentially important pathways for COVID-19 progression have highlighted TYK2 and its downstream signaling pathway through type 1 interferons as a potential target for treatment.25
The existing literature can help explain the dual association between reduced risk of autoimmune conditions such as psoriasis and RA and increased risk of severe COVID via TYK2. TYK2, a member of the Janus Kinase (JAK) family of genes, plays a key role in cytokine signal transduction and the inflammatory response, particularly via IL-12, IL-23, and is also important for IL-6 and IL-10 signaling (Fig 3).26 Importantly, TYK2 has a central role in type 1 interferon signaling, part of the innate immune response blocking the spread of a virus from infected to uninfected cells. Of interest is the TYK2 variant represented by the SNP rs34536443 (P1104A) resulting in partial LOF and reduced risk for several autoimmune disorders such as RA, type 1 diabetes, psoriasis, and psoriatic arthritis without evidence of significant immunodeficiency.14,27– 30 However, humans with complete TYK2 LOF have clinically significant immunodeficiency with increased susceptibility to mycobacterial and viral infections.26,31
The balance between immune tolerance and ability to mount a robust early inflammatory response, and viral host defense may in part explain the mixed results of clinical trials using immunosuppressive therapy in COVID-19 to date25. Corticosteroid therapy may be beneficial in later stages of COVID-19 infection but detrimental if used early in the disease presumably when an early host defense response is important.32 Blocking the IL-6 pathway with tocilizumab did not reduce mortality among hospitalized COVID-19 patients, but may be beneficial among later-stage pulmonary damage33,34. The evidence to date suggests that the ability to mount an early inflammatory response is an important aspect in the control of COVID-19 infection and that dampening inflammation early in the disease course may be detrimental.
Limitations
We note several limitations. First, the PheWAS was designed as a broad screen to test for potentially clinically relevant associations between genes and phenotypes, with limited power to detect associations among uncommon conditions, and when further stratified by genetic ancestry. MVP is not a general US population-based cohort; however, results were replicated in UKBB, an independent general population cohort. Findings from this study suggest that variants associated with severe COVID-19 are also associated with reduced risk of having an autoimmune inflammatory condition. However, the results cannot provide information on the impact of actual SARS-CoV-2 infection in these individuals after diagnosis of an autoimmune disease.
Finally, the data strongly suggest but cannot confirm that the signal observed from RAVER1 arises from the previously identified and studied TYK2 variant.
Conclusions
This PheWAS of genetic variants reported to associate with severe COVID-19 demonstrated shared genetic architecture between COVID-19 severity and known underlying risk factors for both severe COVID-19 and poor COVID-19 outcomes, rather than susceptibility to other viral infections. Overall, the associations observed were generally consistent across genetic ancestries, with a potential stronger association with obesity among Veterans of African ancestry compared to European ancestry. The divergent association between severe COVID-19 and autoimmune inflammatory conditions sheds light on the concept of a balance between immune tolerance and immunodeficiency. This balance will be important when considering therapeutic targets for COVID-19 therapies where pathways may control both inflammation and the viral host response.
Data Availability
Full summary-level association data from the PheWAS from this study is made available as supplementary files.
Funding
This research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award MVP035. This publication does not represent the views of the Department of Veteran Affairs or the United States Government. R.M.C. is supported by NIH grants R01 AA026302 and P30 DK0503060. K.P.L. is supported by NIH P30 AR072577, and the Harold and Duval Bowen Fund.
Conflict of Interest
RMC has received research support from Intercept Pharmaceuticals, Inc and Merck & Co. MDR is on the scientific advisory board for Goldfinch Bio and Cipherome. CJO is an employee of Novartis Institute for Biomedical Research. PN reports grant support from Amgen, Apple, AstraZeneca, Boston Scientific, and Novartis, personal fees from Apple, AstraZeneca, Blackstone Life Sciences, Genentech, and Novartis, and spousal employment at Vertex, all unrelated to the present work.
Supplementary Figures and Tables
S1 Fig. LDproxy plot showing RAVER1 SNP rs74956615 has highest LD with missense variant rs34536443 (TYK2).
S2 Fig. The PheWAS results of rs34536133 in MVP.
S3 Fig. The PheWAS results of 48 SNPs from critical ill COVID GWAS by each ancestry a) European ancestry, b) African ancestry, c) Hispanic ancestry, and d) Asian ancestry.
S4 Fig. The PheWAS results of 39 SNPs from hospitalized COVID GWAS by each ancestry a) European ancestry, b) African ancestry, c) Hispanic ancestry, and d) Asian ancestry.
S1 Table. List of lead variants from critical ill and hospitalized COVID GWAS included in the study.
S2 Table. PheWAS results summary statistics by each ancestry and meta-analysis across four ancestry for 48 lead SNPs identified from critical ill COVID GWAS.
S3 Table. PheWAS results summary statistics by each ancestry and meta-analysis across four ancestry for 39 lead SNPs identified from hospitalized COVID GWAS.
Acknowledgements
We are grateful to our Veterans for their contributions to MVP. Full acknowledgements for the VA Million Veteran Program COVID-19 Science Initiative can be found in the supplementary methods. We would like to thank the Host Genetic Initiative for making their data publicly available (https://www.covid19hg.org/acknowledgements/). We would also like to acknowledge Dr. Madhuvika Murgan for her insightful suggestion on the Fig 3. We used genotype and phenotype data from the UK Biobank under project 32133.
Footnotes
Joint Authorship
↵* These authors jointly supervised this work