A combined polygenic score of 21,293 rare and 22 common variants significantly improves diabetes diagnosis based on hemoglobin A1C levels

Polygenic scores (PS), constructed from the combined effects of many genetic variants, have been shown to predict risk or treatment strategies for certain common diseases. As most PS to date are based on common variants, the benefit of adding rare variation to PS remains largely unknown and methodically challenging. We developed and validated a novel method for constructing a rare variant PS and applied it to a previously identified clinical scenario, in which genetic variants modify the hemoglobin A1C (HbA1C) threshold recommended for type 2 diabetes (T2D) diagnosis. The resultant rare variant PS is highly polygenic (21,293 variants across 144 genes), depends on ultra-rare variants (72.7% of variants observed in <3 people), and identifies significantly more undiagnosed T2D cases than expected by chance (OR=2.71, p=1.51x10-6). A model combining the rare variant PS with a previously published common variant PS is expected to identify 4.9M misdiagnosed T2D cases in the USA, nearly 1.5-fold more than the common variant PS alone. These results provide a method for constructing complex phenotype PS from rare variants and suggest that rare variants will augment common variants in precision medicine approaches for common disease.


Main 45
HbA1C, which measures average blood glucose levels over 2-3 months 12 , is a 46 widely used T2D diagnostic biomarker. Meeting the HbA1C diagnostic threshold of 47 6.5% 13 -as opposed to a milder clinical diagnosis of pre-diabetes for individuals near 48 the threshold 14 -has both insurance and treatment implications, as the therapeutic 49 armamentarium is much larger once a T2D diagnosis is made 15,16 . An individual's 50 HbA1C is influenced by common genetic variants that affect both pathways central to 51 glycemic control 6,9,10 and pathways that influence erythrocytic properties such as cell 52 lifespan 17 . Erythrocytic variants do not contribute to risk of T2D and can in fact cause 53 diabetes misdiagnoses by altering the relationship between measured HbA1C and true 54 blood glucose levels 6,9,10,18,19 . Accounting for the effects of common erythrocytic variants 55 through a PS 6 identifies a substantial number of undiagnosed T2D cases in the 56 population -most notably, roughly 11% of African-Americans carry a G6PD variant 57 (rs1050828) 6,20 that shortens the mean erythrocyte lifespan and may cause ~0.65M 58 African-American individuals with T2D to be undiagnosed 6,10 . 59 60 The identification of common variants that affect HbA1C through erythrocytic 61 pathways advances the public health goal of ensuring that all people across all genetic 62 backgrounds are diagnosed accurately for diabetes. Most variants in a population, 63 however, are rare 21 and have unknown impacts on HbA1C or the T2D diagnostic 64 threshold. Furthermore, while rare variants have been used to predict 22 and diagnose 23 toward decreased HbA1C (16/23, binomial p=0.04; Figure 2b), consistent with 137 expectations from the gene set annotation and previous observations in humans 17 . 138 These results suggest that a substantial number of genes that do not reach exome-wide 139 significance in our analysis harbor rare variants that not only affect HbA1C levels but do 140 so independently of glycemia. 141

142
We next sought to augment the existing HbA1C common variant PS with this 143 broader collection of rare variants. Existing PS methods [1][2][3][4]35,36 , however, are designed 144 for common variants and do not extend easily to include many rare variants across 145 many genes. Challenges include (a) selecting genes to include in the PS; (b) assigning 146 weights to variants within the selected genes; and (c) assigning weights to rare variants 147 not seen in the discovery sample. Rare coding variant PS also, however, present new 148 opportunities: numerous annotations of gene function are publicly available to help 149 select genes [37][38][39][40][41] , and estimates of functional effects are easier to incorporate into 150 weights for coding variants than they are for non-coding variants 24,42,43 . We therefore 151 developed a novel framework for a rare coding variant HbA1C PS that (a) includes 152 genes based on their aggregate p-values and experimental annotations (e.g. from 153 knockout mice; Figure 3a) and (b) assigns variant weights based on aggregate effect 154 sizes for the masks that contain the variant (Methods; Figure 3b). We explored 12 155 potential models within this framework by varying the precise criteria used for gene 156 selection and weight assignment (Figure 3c).
Six (50%) of the tested PS models reclassified a significantly greater (OR>1,159 p≤4.1×10 -3 [0.05/12 models]) proportion of true cases than expected by chance ( Figure  160 3c). The most significant excess of reclassified true cases (OR=2.71, p=1.5×10 -6 ; 161 Figure 3c) was achieved by a model with a "loose" criterion for gene-level HbA1C 162 association and a "nested" method for variant weights (Methods; Figure 3a, 3b). The 163 "loose" criterion requires genes to achieve HbA1C rare variant p<0.05 and then, to filter 164 for evidence of acting through erythrocytic pathways, includes only genes that either variant absent from the discovery study (Methods). As a negative control, we replaced 175 the erythrocyte gene filters in the model with filters for genes involved in glucose 176 homeostasis (Methods) and found that the resultant model did not reclassify 177 significantly more true cases than expected by chance (OR=1.38,p=0.16). 178 Rare variant PS also complement common variant PS in other aspects. First, 228 while both types of PS benefit by including suggestive (p<0.05) but not exome-wide 229 significant associations, we show that incorporating gene annotations from model 230 organisms (e.g. knockout mice) increases rare variant PS accuracy. Other gene 231 annotations 47 , which are currently more readily available than noncoding variant 232 annotations 48 , could naturally be used to further improve the model. As noncoding 233 variant annotations do improve, we anticipate that extending rare variant PS to include 234 noncoding variants could likewise improve PS accuracy 49 . A second property of the rare 235 variant PS is that it does not require individual variant effect size estimates. It can thus 236 incorporate variants private to an individual, and our data provide evidence that its 237 variant weights can be held roughly constant across ancestries. While limited in a strict 238 sense to HbA1C adjustments for diabetes diagnosis, in a broader sense -given the 239 broad similarities in genetic architectures of many complex traits 50,51 -the current study 240 is an initial step towards rare variant PS for complex traits in general. Our results 241 suggest that rare variants will be a valuable addition to complex trait PS, albeit through 242 polygenic effects that look quite distinct from those that might have been expected from 243 early predictions of rare variants with a "high impact" on common disease 24,52 . 244 245 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi. org/10.1101org/10. /2021 The AMP-T2D-GENES T2D case definition was based on previously described 291 diagnostic thresholds 28 . Individuals were categorized as a T2D case if: (a) a 2-hour 292 glucose measure was ≥ 11.1 mmol/L or (b) any two of the following suggestive 293 measures were met: 294 (1) fasting glucose levels ≥ 7 mmol/L 295 (2) HbA1C levels ≥ 6.5% 296 (3) there is a record of antihyperglycemic medication use. 297 298 When conducting our reclassification analysis for the HbA1C PS, we considered 299 as true cases only those individuals who had criteria (1) or (3) satisfied -that is, we 300 ignored the HbA1C based criterion. All individuals not categorized as T2D cases were 301 considered controls. 302 303 AMP-T2D-GENES exome sequence genotype processing 304 Genotyping and quality control (QC) for AMP-T2D-GENES samples were 305 performed as previously described 28 . Briefly, we obtained BAM files (containing 306 unaligned sequence reads) from each contributing consortium (see 'AMP-T2D-GENES 307 cohort and phenotype information' methods) and processed them in a uniform 308 fashion using the Picard (broadinstitute.github.io/picard/), BWA 57 , and GATK 58 software 309 packages. This pipeline produced single nucleotide and indel variant calls within 50bp of 310 any region targeted for capture. For each sample at each variant site, we calculated 311 genotype dosages and hard genotype calls based on a genotype quality (GQ) threshold 312 of ≥ 20. We called X chromosome dosages treating males as haploid. 313 'Gene-level grouping' methods). We ran burden tests on a set of unrelated samples 383 (all pairs IBD <0.25) with 10 transethnic PCs, sample subgroup (see 'AMP-T2D-GENES 384 single variant association analysis' methods), and sequencing technology as 385 covariates. For reasons previously described 28 , we performed the burden testing as a 386 single "mega-analysis" across all samples and not a meta-analysis of sample subgroups 387 (as done for single variant association analysis). Nonetheless, we still performed variant 388 filtering at the sample subgroup level by setting genotypes for an entire subgroup as 389 missing during association testing. Unlike single-variant association analysis, for which 390 we collapsed alternate alleles for multi-allelic variants, for gene-level testing we included 391 in each mask only the allele(s) that passed the filters defined for the mask. For genes 392 on the X chromosome, we analyzed males and females separately before combining 393 results via a fixed-effect inverse-variance weighted meta-analysis (implemented via 394 METAL 60 ). All genes with p≤2.5×10 -6 associations (i.e. threshold not adjusted for 395 multiple hypothesis testing) for any phenotype and mask are listed in Supplementary 396 To account for the effects of population substructure on these metrics, we 484 adjusted each metric according to the top 3 genetic PCs. To identify outlier samples, we 485 clustered each sample metric into Gaussian distributed subsets using the Klustakwik 486 software package (http://klustakwik.sourceforge.net/); we then removed outlier samples 487 that did not fit into one of the Gaussian distributed subsets. For the metrics n_called and 488 call_rate, we removed only outlier samples below the mean of the distribution. 489

490
In addition to excluding samples who were outliers according to individual sample 491 metrics, we also excluded samples that were outliers according to combinations of 492 sample metrics. To identify these outliers, we first calculated PCs across the metrics 493 that explained 95% of their variation (we did not include the 'call_rate' and 'n_called' 494 metrics in this analysis as they were characterized by long distributional tails 495 inconsistent with the other metrics). We then scanned for samples that exhibited 496 deviation from any of the metric PCs: as we did for the individual metrics, we used 497 Klustakwik (http://klustakwik.sourceforge.net/) to identify samples that lay outside of the 498 distribution of one of more metric PCs. In total, we removed 76 samples by this analysis 499 (Supplementary Table 16). 500

501
We also assessed potential discordance between each sample's reported sex 502 and its genetically determined sex using the HAIL software package (hail.is). We did not 503 remove any additional samples (N=0) due to discordance between reported and genetic 504 sex, although we did remove N=16 samples whose imputation for sex failed (i.e. flagged 505 as a 'PROBLEM' by Hail)(Supplementary Table 16). 506

UK Biobank variant-level quality control 508
We assessed variant quality using call rate and HWE calculated within EUR 509 samples. We excluded variants with HWE p<1x10 -6 or call rate <0.3 from further 510 analysis. In total, 62,420 variants were flagged for removal resulting in a total of 8.89M 511 variants remaining for downstream analysis. 512

UK Biobank kinship analysis 514
To produce a set of variants for inferring genetic relationships among samples, 515 we first used the Plink software package 68 to remove non-autosomal (--chr flag), non-516 A/C/G/T (--snps-only flag), low call rate (CR (http://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)). We 520 then LD-pruned the remaining variants with Plink to produce a set of 57,297 variants. 521

522
We used the KING relationship inference software package 69 to determine 523 sample pair kinship coefficients. We considered samples with kinship coefficients >0.4 524 as potential duplicates. If the clinical data for the duplicate pair were identical, then we 525 removed the sample with the lower genotype call rate from analysis. If the clinical data 526 were not identical, we removed both samples. In addition to removing duplicate 527 samples, we also checked for any individuals that exhibited kinship values indicating a 528 2nd degree relative or higher relationship with 10 or more others and found none to be 529 removed (N=0). 530

UK Biobank single variant association analysis 532
Single variant association analysis in the UKB was performed using the HAIL 533 linear regression function (i.e. the "hail.methods.linear_regression_rows()" function) to 534 assess association between each variant and each inverse-normal transformed 535 phenotype. The single variant regression model included covariates for the first nine 536 PCs of ancestry but not age and sex (as these were included when transforming the 537 phenotypes). The analysis included only variants with termed "variant masks." The seven masks were ordered (with 1 being most strict) by the 566 likelihood of the deleteriousness of variants within them: 567 (1) "LoFTee" consists of variants predicted to be loss of function with high 568 (6) "1/5 1%" includes variants in the previous mask (5/5 LoFTee LC 1%) and 588

MAF<1% variants with either Polyphen2_HDIV_pred=D or 589
Polyphen2_HVAR_pred=D or SIFT_pred=D or LRT_pred=D or 590 MutationTaster_pred=D or A 591 (7) "0/5 1%" includes variants in the previous mask (1/5 1%) and all missense 592 variants with MAF < 1%. 593 594 analysis); mask-level meta-analysis results are presented in Supplementary Table 19 For the gene-level meta-analysis, we considered genes with p≤1.0×10 -7 619 (p≤2.5×10 -6 with Bonferroni correction for 24 phenotypes) to be exome-wide significant. 620 All gene-level associations that reach this threshold are listed in Supplementary proportion of variance than a previously identified nearby common variant association 655 (rs837763) 6 , we evaluated whether LD between the common variant and rare variants 656 included in the gene-level analysis impacted the estimated variance explained. To do 657 so, we re-conducted the gene-level analysis for the mask that produced the smallest p-658 value (5/5 mask; see 'Gene-level grouping' methods) conditional on the common 659 variant rs837763. For AMP-T2D-GENES, we included only individuals with SNP array 660 data (N=16,701) in this analysis. We then evaluated, within each ancestry, the 661 proportion of variance explained by PIEZO1 with and without inclusion of rs837763 as a 662 covariate. Results suggested that including rs837763 as a covariate had little impact on 663 the estimated proportion of variance explained by the PIEZO1 rare variant gene-level 664 association (Supplementary Table 6). 665 666 Defining GWAS loci for rare variant association analysis 667 To assess whether variants or genes identified during rare variant association 668 analysis were located within known GWAS loci, we calculated the genomic distance 669 between a variant (or gene) and the nearest genome-wide significant association 670 (p≤5×10 -8 ) for the phenotype of interest. We extracted genome-wide significant 671 associations for each phenotype from the Type 2 Diabetes Knowledge Portal 77 . We 672 considered variants or genes within 125kb of a genome-wide significant association to 673 be within a known GWAS locus. 674

Directional concordance of genes within enriched gene sets 727
To test whether the direction of effect on HbA1C was consistent across gene-728 level associations within a gene set, we first filtered to genes with HbA1C association 729 (p≤0.05). We then assigned a direction of effect to each gene according to the sign of 730 the effect size from the most significant (i.e. smallest p-value) mask. 731

732
We then tested for directional concordance among these filtered genes by 733 evaluating, via a binomial test, whether there were more negative effect sizes than the 734 50% expected by chance. We conducted this analysis for significantly enriched 735 erythrocytic and glycemic gene sets. We also evaluated concordance both (a) without 736 removing overlapping genes found in multiple gene sets and (b) after removing 737 overlapping genes found in multiple gene sets. In both cases, we considered p≤0.05 as 738 significant. 739 740

Polygenic adjustment scores for HbA1C 741
We calculated polygenic scores (PS) similarly for both common and rare 742 variants. For the common variant PS, we used a set of variants and effect sizes from a 743 previous GWAS publication 6 . The PS was calculated as 744 745 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi.org/10.1101/2021.11.04.21265868 doi: medRxiv preprint PS: we used individual-level data from the AMP-T2D-GENES study to assess how 922 many T2D cases are "reclassified" to have an adjusted HbA1C >6.5%, and whether the 923 ratio of reclassified cases to controls is greater than the expectation from a set of 924 individuals matched on HbA1C, cohort, and ancestry (see 'Assessing accuracy of the 925 rare variant HbA1C polygenic score' methods). 926 927

Polygenic score sensitivity analyses 928
To evaluate the extent to which the rare variant PS and common variant PS 929 depend on individual genes and/or variants, we used a leave-one-out approach. Survey (NHANES) data to estimate, based on the results from the AMP-T2D-GENES 944 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi.org/10.1101/2021.11.04.21265868 doi: medRxiv preprint  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi.org/10.1101/2021.11.04.21265868 doi: medRxiv preprint who had an unadjusted HbA1C value below 6.5% but whose T2D status was 1194 "reclassified" based on an adjusted HbA1C value above 6.5%. We then calculated the 1195 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. metrics for each sample in the UKB (see 'UK Biobank sample-level quality control' 1262 methods) and excluded samples from analysis that were outliers according to one or 1263 more metrics. Shown are the number of samples removed by each metric, as well as 1264 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021.  (2017). 1499 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi.org/10.1101/2021.11.04.21265868 doi: medRxiv preprint 1532 1533 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 9, 2021. ; https://doi.org/10.1101/2021.11.04.21265868 doi: medRxiv preprint    Fisher's Odds Ratio − Erythocytic PRS V ar ia n t 1 V ar ia n t 2 V ar ia n t 3 V ar ia n t 4 V ar ia n t 5 V ar ia n t 6 V ar ia n t 7