A Phenome-Wide Association Study of genes associated with COVID-19 severity reveals shared genetics with complex diseases in the Million Veteran Program

The study aims to determine the shared genetic architecture between COVID-19 severity with existing medical conditions using electronic health record (EHR) data. We conducted a Phenome-Wide Association Study (PheWAS) of genetic variants associated with critical illness (n=35) or hospitalization (n=42) due to severe COVID-19 using genome-wide association summary from the Host Genetics Initiative. PheWAS analysis was performed using genotype-phenotype data from the Veterans Affairs Million Veteran Program (MVP). Phenotypes were defined by International Classification of Diseases (ICD) codes mapped to clinically relevant groups using published PheWAS methods. Among 658,582 Veterans, variants associated with severe COVID-19 were tested for association across 1,559 phenotypes. Variants at the ABO locus (rs495828, rs505922) associated with the largest number of phenotypes (nrs495828=53 and nrs505922=59); strongest association with venous embolism, odds ratio (ORrs495828 1.33 (p=1.32 × 10−199), and thrombosis ORrs505922 1.33, p=2.2 × 10−265. Among 67 respiratory conditions tested, 11 had significant associations including MUC5B locus (rs35705950) with increased risk of idiopathic fibrosing alveolitis OR 2.83, p=4.12 × 10−191; CRHR1 (rs61667602) associated with reduced risk of pulmonary fibrosis, OR 0.84, p=2.26 × 10−12. The TYK2 locus (rs11085727) associated with reduced risk for autoimmune conditions, e.g., psoriasis OR 0.88, p=6.48 × 10−23, lupus OR 0.84, p=3.97 × 10−06. PheWAS stratified by genetic ancestry demonstrated differences in genotype-phenotype associations across ancestry. LMNA (rs581342) associated with neutropenia OR 1.29 p=4.1 × 10−13 among Veterans of African ancestry but not European. Overall, we observed a shared genetic architecture between COVID-19 severity and conditions related to underlying risk factors for severe and poor COVID-19 outcomes. Differing associations between genotype-phenotype across ancestries may inform heterogenous outcomes observed with COVID-19. Divergent associations between risk for severe COVID-19 with autoimmune inflammatory conditions both respiratory and non-respiratory highlights the shared pathways and fine balance of immune host response and autoimmunity and caution required when considering treatment targets.

Million Veterans Program (MVP) has generated genotypic data on over 650,000 participants 157 linked with electronic health record (EHR) data containing rich phenotypic data, enables large-158 scale PheWAS. Moreover, MVP has the highest racial and ethnic diversity of the major biobanks 159 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. ; https://doi.org/10.1101/2021.05.18.21257396 doi: medRxiv preprint worldwide affording an opportunity to compare whether associations are similar across genetic 160 ancestries. 6

161
The objective of this study was to use existing clinical EHR data to identify conditions 162 that share genetic variants with COVID-19 severity using the disease-agnostic PheWAS 163 approach. Since COVID-19 is a new condition, identifying existing conditions which share 164 genetic susceptibility may allow us to leverage existing knowledge from these known conditions 165 to provide context regarding important pathways for COVID-19 severity, as well as how 166 pathways may differ across subpopulations. 167 Methods 168

Data sources 169
The VA MVP is a national cohort launched in 2011 designed to study the contributions of 170 genetics, lifestyle, and military exposures to health and disease among US Veterans. 6 171 Blood biospecimens were collected for DNA isolation and genotyping, and the biorepository was 172 linked with the VA EHR, which includes diagnosis codes (International Classification of 173 Diseases ninth revision [ICD-9] and tenth revision [ICD-10]) for all Veterans followed in the 174 healthcare system up to September 2019. The single nucleotide polymorphism (SNP) data in the 175 MVP cohort was generated using a custom Thermo Fisher Axiom genotyping platform called 176 MVP 1.0. The quality control steps and genotyping imputation using 1000 Genomes 177 cosmopolitan reference panel on the MVP cohort has been described previously. 7 All individuals 178 in the study provided written informed consent as part of the MVP. This study was approved 179 through the Veterans Affairs central institutional review board as part of the MVP. 180 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Genetic variant selection 181
An overview of the analytic workflow is outlined in Fig 1. Variants were derived from the 182 COVID-19 HGI GWAS meta-analysis release v6 3 . In this study, we analyzed the following HGI 183 GWAS summary statistics: 1) hospitalized and critically ill COVID-19 vs. population controls 184 denoted as "A2" in HGI, and referred to as "critical COVID" in this study, and 2) hospitalized 185 because of COVID-19 vs. population controls, denoted as "B2" in HGI, referred to as 186 "hospitalized COVID" in this study 3 . For each GWAS, variants with a Benjamini-Hochberg false 187 discovery rate (FDR) corrected p-value < 0.01 were selected as candidate lead SNPs (3,502 188 associated with critical COVID, and 4,336 associated with hospitalized COVID). Variants with 189 r 2 <0.1 were clustered within a 250 kb region according to 1000 Genomes phase 3 transethnic 190 reference panel 8 , resulting in 45 independent variants associated with critical COVID and 42 191 variants associated with hospitalized COVID summary statistics. The lead variants from each set 192 of GWAS summary statistics are available in eTable 1. 193 Outcomes 194 For both MVP, clinical data prior to the onset of the COVID-19 pandemic were used to reduce 195 potential confounding bias from SARS-CoV-2 infection on existing conditions. Phenotypes 196 were defined by phecodes from prior studies 5,9 . Each phecode represents ICD codes grouped into 197 clinically relevant phenotypes for clinical studies. For example, the phecode "deep venous 198 thrombosis" includes "venous embolism of deep vessels of the distal lower extremities," and 199 "deep venous thrombosis of the proximal lower extremity," both of which have distinct ICD 200 codes. Using this approach, all ICD codes for all Veterans in MVP were extracted and each 201 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 15, 2021. ; https://doi.org/10.1101/2021.05.18.21257396 doi: medRxiv preprint assigned a phenotype defined by a phecode. ICD-9 and ICD-10 codes were mapped to 1876 202 phecodes, as previously described. 5,9 203 For each phecode, participants with ≥ 2 phecode-mapped ICD-9 or ICD-10 codes were defined as 204 cases, whereas those with no instance of a phecode-mapped ICD-9 or ICD-10 code were defined 205 as controls. Based on our previous simulation studies of ICD EHR data, populations where the 206 phecode comprises < 200 cases were more likely to result in spurious results 10 , and we thus 207 applied this threshold in each ancestry group. In total, we analyzed 1,617 (EUR), 1304 (AFR), 208 993 (HIS), 294 (ASN) phecodes from the MVP cohort. 209 Phenome-wide association studies 210 The primary PheWAS analysis used SNPs identified from the HGI GWAS of critical and 211 hospitalized COVID, and tested association of these SNPs with phenotypes extracted from the 212 EHR using data prior to the COVID-19 pandemic. Logistic regression using PLINK2 to 213 examine the SNP association with phecodes and firth regression was applied when logistic 214 regression model failed to converge. Regression models were adjusted for sex, age (at 215 enrollment), age squared, and the first 20 principal components. Genetic ancestry was determined 216 using the HARE method for four major groups: African (AFR), Asian (ASN), Hispanic (HIS), 217 and European (EUR) ancestry 11 . Ancestry-specific PheWAS was first performed in these four 218 groups, and summary data were meta-analyzed using an inverse-variance weighted fixed-effects 219 model implemented in the PheWAS R package 9 . We assessed heterogeneity using I 2 and 220 excluded any results with excess heterogeneity (I 2 > 40%). 221 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) To address multiple testing, an association between SNP and phecode with FDR p < 0.01 was 222 considered significant. Thus, the threshold for significance was p < 6.07 × 10 -05 for critical 223 COVID lead variants, and p < 4.13 × 10 -05 for hospitalized COVID lead variants. In the main 224 manuscript we highlight PheWAS significant associations using FDR < 0.01 and an effect size 225 associated with increased or reduced risk for a condition by 10%, with complete PheWAS results 226 provided in S2 Table and S3 Table. 227

228
We studied 658,582 MVP participants, with mean age 68 years (SD), 90% male, with 30% 229 participants from non-European ancestry ( Table 1). The PheWAS was performed on 35 genetic 230 variants associated with critical COVID-19, and 42 genetic variants (S1 Table) associated with 231 hospitalized COVID, across 1,559 phenotypes. 232 From the trans-ethnic meta-analysis, we identified 151 phenotypes significantly associated with 233 critical COVID GWAS-identified variants, and 156 associations with hospitalized COVID 234 GWAS-identified lead variants (FDR, p<0.01). Among these lead variants with significant 235 PheWAS associations, 10 SNPs were associated with reduced risk of critical and hospitalized 236 COVID-19 in HGI. Six variants were common to both severe and hospitalized COVID and had 237 significant PheWAS associations, namely, variations nearest to the genes ABO (rs495828 and 238 rs505922), DPP9 (rs2277732), MUC5B (rs35705950), TYK2 (rs11085727), and CCHCR1 239 (rs9501257) (S2 Table and S3 Table). 240 241 242 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) However, the association of genetic variants with other respiratory conditions may represent 263 novel findings: the association of intronic variant rs61667602 in CRHR1 with reduced risk of 264 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Table, S3 Table).  (Table 2). 277

Ancestry specific PheWAS provide insights into disease risks across ancestries 278
The PheWAS analyses performed across four major genetic ancestry group in MVP observed 279 similar findings as the overall meta-analysis with few associations unique to each ancestry. (Fig  280   3, S8 Table). SNP rs581342 (LMNA), associated with severe COVID-19, was a highly prevalent 281 variant among subjects with AFR ancestry (MAF=0.53) and was associated with neutropenia 282 (OR AFR = 0.82 [0.76 -0.87], P AFR = 4.09 × 10 -13 ); this association was not observed in larger 283 population of EUR descent (S8 Table). Following up on this finding, we extracted data on 284 laboratory values and observed a strong association between LMNA with lower white blood cell 285 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. ; https://doi.org/10.1101/2021.05.18.21257396 doi: medRxiv preprint count (beta = -0.34 [-0.35, -0.32], P AFR = 1 x 10 -300 ) and lower median neutrophil fraction (beta = 286 -1.84 [-1.94, -1.75], P AFR = 1 x 10 -300 ) compared to those without this variant. This association in 287 laboratory values was again more significant with a stronger effect size among subjects with 288 AFR ancestry in comparison to EUR (P=0.005). Among AFR individuals, each allele was 289 associated with a 1.84% lower neutrophil fraction, where among EUR individuals, each allele 290 was associated with only a 0.04% reduction (S9 Table). 291 Similarly, associations between rs9268576 (HL-DRA) and thyrotoxicosis was only observed in 292 AFR ancestry participants. The EUR ancestry specific PheWAS identified 39 significant 293 associations which were not observed in other ancestry groups. One such association was 294 between MUC5B variant and phecode for "dependence on respirator [Ventilator] or 295 supplemental oxygen" (OR EUR = 1.16 [1.11 -1.12], P EUR = 1.72× 10 -10 ) among EUR ancestry 296 participants was not significant in other ancestry population (S8 Table). It is important to note 297 that the conditions with significant association among EUR participants had similar prevalence 298 among other ancestries. However, since there were overall fewer subjects in non-EUR ancestry 299 groups, this likely resulted in lower statistical power to detect associations. All ancestry specific 300 PheWAS results are available in supplementary tables (S4 Table, S5 Table, S6 Table, S7 Table). 301 302 Association with variation at sex chromosome 303 In the hospitalized COVID-19 GWAS, we identified rs4830964 as the only lead variant on 304 chromosome X. The SNP is located near ACE2 and was associated "non-healing surgical 305 wound" (OR = 0.92 [0.89 -0.96], P = 2.23× 10 -05 ). Notably, the SNP had nominal association 306 (p<0.05) with type 2 diabetes and diabetes related complications that are previously reported 307 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. ; https://doi.org/10.1101/2021.05.18.21257396 doi: medRxiv preprint association with variation in ACE2 (S3 Table). We did not observe any association with this 308 variant in the ancestry specific PheWAS analysis. 309

Discussion 310
In this large-scale PheWAS, we identified the shared genetic architecture between variants 311 associated with severe COVID-19 and other complex conditions using data from MVP, one of 312 the largest and most diverse biobanks in the world. Broadly, these risk alleles identified 313 conditions associated with risk factors for severe COVID-19 manifestations such as T2D, 314 ischemic heart disease across all ancestries. Notably, the strongest associations with the highest 315 effect size were related to coagulopathies, specifically, hypercoagulable state including deep 316 venous thrombosis and other thrombotic complications, also shared variants associated with 317 severe COVID-19. In contrast, among respiratory conditions, only idiopathic pulmonary fibrosis 318 and chronic alveolar lung disease shared genetic risk factors, with the notable absence of an 319 association with COPD, pulmonary hypertension, and other respiratory infections. When 320 comparing findings from the two largest ancestry groups in MVP, AFR and EUR, we observed 321 that a risk allele associated with severe COVID-19 that shares an association with neutropenia 322 on among Veterans of AFR ancestry. Finally, we observed that variants associated with severe 323 COVID-19 had an opposite association, or reduced odds with autoimmune inflammatory 324 conditions, such as psoriasis, psoriatic arthritis, RA, and inflammatory lung conditions. 325 A classic GWAS tests the association between millions of genetic variants with the presence or 326 absence of one phenotype, e.g., GWAS of deep venous thrombosis. In the COVID-19 HGI 327 GWAS, the "phenotype" was patients hospitalized for or critically ill from  Clinically, this population includes a mixture of patients with a complex list of medical 329 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. ; https://doi.org/10.1101/2021.05.18.21257396 doi: medRxiv preprint conditions at high risk for severe COVID complications and those who had actual complications 330 from COVID-19. Thus, we would anticipate that many of the significant phenotypes would be 331 associated with risk factors such as obesity and deep venous thrombosis. The clinical data used 332 in this study pre-dates the emergence of COVID-19 to reduce potential confounding bias that can 333 occur in a population infected with SARS-CoV-2, e.g., interaction between COVID-19 and type 334 2 diabetes. Additionally, our findings suggest that the PheWAS approach can be a useful tool to 335 identify clinical factors related to emerging infectious diseases regarding severity or 336 complications when genomic data are available. and is also important for IL-6 and IL-10 signaling (Fig 3). 28 TYK2 serves a central role in type 1 372 interferon signaling, part of the innate immune response blocking the spread of a virus from 373 infected to uninfected cells. Partial loss of TYK2 function is associated with reduced risk for 374 several autoimmune disorders such as RA and psoriatic disease, conditions treated with 375 immunosuppressive therapy. 13,29-32 Humans with complete TYK2 loss of function have clinically 376 significant immunodeficiency with increased susceptibility to mycobacterial and viral 377 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The PheWAS of genetic variants reported to associate with severe COVID-19 demonstrated 417 shared genetic architecture between COVID-19 severity and known underlying risk factors for 418 both severe COVID-19 and poor COVID-19 outcomes, rather than susceptibility to other viral 419 infections. Overall, the associations observed were generally consistent across genetic ancestries, 420 with the exception of a stronger association with neutropenia among Veterans of African 421 ancestry than European ancestry. Notably, only few respiratory conditions had a shared genetic 422 association with severe COVID-19. Among these, variants associated with a reduced risk for 423 severe COVID-19 had an opposite association, with reduced risk for inflammatory and fibrotic 424 pulmonary conditions. Similarly, other divergent associations were observed between severe 425 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 15, 2021. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.