## ABSTRACT

The estimate of an individual’s genetic susceptibility to a disease can provide critical information when setting screening schedules, prescribing medication and making lifestyle change recommendations. The polygenic risk score is the predominant susceptibility metric, with many methods available to describe its construction. However, these methods have never been comprehensively compared or the predictive value of their outputs systematically assessed, leaving the clinical utility of polygenic risk scores uncertain. This study aims to resolve this uncertainty by deeply comparing the maximum possible, currently available, 15 polygenic risk scoring methods to 25 well-powered, UK Biobank derived, disease phenotypes. Our results show that simpler methods, which employ heuristics, bested complex, methods, which predominately model linkage disequilibrium. Accuracy was assessed with AUC improvement, the difference in area under the receiver operating curve generated by two logistic regression models, both of which share the covariates of age, sex, and principal components, while the second model also contains the polygenic risk score. To better determine the maximal utility of polygenic risk scores, straightforward score ensembles, which bested all methods across all traits in the training data-set, were evaluated in the withheld data-set. The score ensembles revealed that the accuracy gained by considering a polygenic risk score varied greatly, with AUC improvement greater than 0.05 for 9 traits. Many additional analyses revealed widespread pleiotropy across scores, large variations between assessment statistics, peculiar patterns amongst phenotype definitions, and wide ranges in the optimal number of variants used for scoring. If these many variable aspects of score creation can be well controlled and documented, simple methods can easily generate polygenic risk score that well predict an individual’s future liability of certain diseases.

## Introduction

Public health may be largely improved by accurately stratifying a population’s disease risk, as the proper intensity of care can then be distributed accordingly^{1}. Many clinical models incorporate age, sex and pre-existing conditions into predictions of such risk, generating useful preventative recommendations^{2}. Genetic information is nearly universally excluded from these models, despite evidence that it may significantly increase prediction accuracy^{3}. The few situations where genetics is considered involve specific gene mutations. For example, BRCA mutations are considered in relation to breast cancer as they significantly increase risk, and therefore those affected start mammogram screening earlier in life^{4}. However, the heritable component for the vast majority of traits is not captured by a few mutations, but rather is determined by variants scattered across the entire genome^{5}. The polygenic risk score condenses this diffuse information into one metric, by summing over multiple genetic locations the product between a genetic variant’s externally determined effect size and the number of alleles at a variant. Polygenic risk scores have been proven to be as accurate as clinically relevant monogenic mutations^{6}. Beyond direct risk prediction, polygenic risk scores have identified sub-populations that respond well to statins^{7} and lifestyle changes^{8}, revealed underlying cross-disease correlations of schizophrenia^{9} and breast cancer^{10}, and exposed hidden structures within electrocardiograms^{11} and magnetic resonance imaging^{12}. Despite the deep and diverse upside of polygenic risk scores, various reporting inconsistencies may be preventing their maximal utility from currently being well appreciated.

The expansive array of cohorts, traits, covariates and metrics polygenic risk score may be applied within leads to a difficult assessment of their true worth. For example, one of the first polygenic risk scores was developed on a schizophrenia cohort with 6,090 individuals, generating an R^{2} of 0.032^{13}. A later analysis of coronary artery disease (CAD) in the 409,258 person UK Biobank, which included covariates of age and sex, led to an AUC value of 0.81^{6}. From these two publications it is not immediately clear whether the CAD polygenic risk score is actually superior to the schizophrenia score, or if instead its performance reflects the large sample size or unrelated covariates. This pattern of difficult comparisons extends to numerous other scores, leaving each of their maximal, true utility unclear. In addition to the confusion produced from inconsistent reporting, various methods exist which select the component genetic variants within the polygenic risk score. The choice of which method to employ is a critical but currently difficult choice to make when developing an optimal, genetically-informed clinical model.

The selection of genetic variants which best predict a phenotype is a fundamental problem in the field of genetics. In non-human studies this problem is often resolved with mixed-model type approaches applied directly to genetic information^{14}. In human studies, privacy concerns have motivated the safeguarding of genetic information. Therefore, only a small group of investigators with access to the full genetic information perform a genome wide association study. The resulting univariate associations are published in the form of summary statistics, which describe each variant’s effect size and significance^{15}. A naive method for creating a polygenic risk score would include every variant tested within a GWAS, denoted as the basic method in this study. However, a large proportion of these variants likely reflect, via linkage disequilibrium, a sparse true causal signal^{16}. Without any actual relation to a causal mechanism, these false positive variants will increase the polygenic risk score without a proportional increase in disease risk. Alternatively, only genome-wide significant variants can be included. While vastly increasing the chance of only incorporating true positive signals, assessments have shown this approach is less accurate than using more liberal p-value thresholds^{17}, that is, including variants above classical statistical relevance thresholds. Further gradations between these two extremes are possible by modifying the effect sizes of variants, or including all variants within a certain statistical significance. To find the optimal set of variants and their effect sizes, multiple methods have been introduced that span from simple heuristics to complex models (Table 1).

The first variety of polygenic risk score methods developed were simple and heuristic-based. The clumping^{18} method selects variants below a p-value limit and within a linkage disequilibrium range. The variations on this approach, WC-2d^{19} and stackCT^{20}, use multiple limits simultaneously. Alternatively an empirical likelihood can be calculated and used to select variants, an approach utilized by WC-Likelihood^{19}, WC-Lasso^{19}, and Tweedie^{21} methods. An entirely different branch of complex methods selects variants by attempting to approximate the results of a mixed model approach. LDpred^{22} was the first such method, utilizing a MCMC algorithm for posterior effect estimation. It inspired the related methods LDpred-funct^{23}, PRScs^{24} and AnnoPred^{25}. The sBLUP^{26} and SBayesR^{27} specified the methodology even further by recreating a multiple regression framework. Lastly, the GraBLD^{28} and lassosum^{29} methods both utilized generally sophisticated algorithms, although did not attempt the same approximation as the other complex methods. Most of these methods perform a small comparison of methods within their respective publication, each concluding their own process is the most accurate. However, bias towards reporting positive results, along with unique computational environments that are difficult to replicate, may partially impact these claims.

A few studies have attempted to compare methods or assess wide-scale polygenic risk score accuracy, but all have been limited in the scope of their analyses. The majority of analyses focus on correlations between scores^{30,31} rather than predictive performance. Studies which do assess accuracy either utilize only a limited number of variants^{32}, compare just a couple of methods^{6} (often LDpred and clumping), or analyze a single trait^{7,8}. Additionally, the majority of polygenic risk score studies do not consider or provide supplementary analyses that enable easy comparisons among results within the current literature. For example, a disagreement on whether self-reported traits should be counted toward the total number of cases creates two types of analyses that cannot be reconciled^{33}. Without a comprehensive polygenic risk score study these disparate analyses will likely continue, minimizing the contribution scores can make in improving clinical care.

To remedy the current lack of knowledge in the field this study unbiasedly ranks the performance of polygenic risk score methods and reports the best predictive accuracy for a maximal number of well-powered traits. A depth of context is further provided to cover all assessment metrics, unaccounted covariates, possible correlations and other potential caveats that facilitate reproducible or innovative analyses. Future polygenic risk score practitioners may reference these results to simply determine which method is best and if it can produce the accuracy needed in their work. In certain situations the interpretation of these results may warrant the consideration or even inclusion of polygenic risk scores into clinical decision making.

## Results

### Comparison of Methods

Polygenic risk scores were generated from the maximum available 25 sets of well-powered GWAS summary statistics and 15 scoring methods, applied to UK Biobank data^{34}. The comparison of all polygenic risk scores revealed that simpler, heuristic-based methods, generated the most accurate polygenic risk scores, as they produced the highest area under the curve (AUC) improvement in 21 out 25 traits (Fig. 1). AUC improvement is defined as the difference in area under the receiver operating curve generated by two logistic regression models. Both models share the covariates of age, sex, and principal components, while the second model also contains the polygenic risk score. The methods which produced the highest AUC improvement in the largest number of traits was a tie between clumping and WC-2d, each chosen as the best method in 10 traits (Fig. 1b). However, the WC-2d method produced the highest average AUC improvement (mean = 0.046, 95% CI 0.029-0.063) when looking across best parameter performance for each trait (Fig. 1).

Various other statistics were also used to assess predictive performance, including: a p-Value extracted from the polygenic risk score term in the full logistic regression model; Nagelgerke’s R^{2} calculated from a logistic regression model with the polygenic risk score as the independent variable; and odds ratios calculated by setting the exposed group as individuals with a polygenic risk score above a specified threshold, non-exposed group as individuals with a polygenic risk score below the 50^{th} percentile and cases/controls as determined by the phenotype definition (Supp. Table 1-4). When examining the best respective score for each trait (Supp. Fig. 1-8), the various statistics used to assess disease prediction showed a wide range of relationships (Supp. Fig. 9-11). When examining the average score for each trait the AUC improvement was well correlated to both odds ratios (rho = 0.82, p < 0.01), and Nagelgerke’s R^{2} (rho = 0.80, p < 0.01), whereas correlation of AUC improvement to p-value was insignificant (rho = 0.26, p = 0.20), (Supp. Fig. 10). Alternatively, when examining the average score for each method, the various types of statistics all showed a similar relationship - uniform high correlations between all pairwise combinations of statistics (p < 0.01 for all metrics) (Supp. Fig. 11).

For each trait an ensemble score, created by taking an average of the top 5 polygenic risk scores weighted by their coefficients in a cross-validated logistic regression, generated a higher AUC improvement than any single method. On average the AUC improvement was 22% (95% CI 10%-35%), larger than that of the next best method (Fig. 1c, Supp. Fig. 12-14). However, when this ensemble technique was expanded to include far more than five scores, as implemented in the stackCT method, overfitting was observed. Therefore, the stackCT method was removed from all further analyses (Supp. Fig. 15).

### Evaluation of Polygenic Risk Score Accuracy

For each trait, the method that corresponds to the best polygenic risk score, defined by producing the largest AUC improvement within the training data-set, was used to make predictions within the witheld data-set. Subsequent statistics revealed substantial predictive ability in several traits, and widespread variability in analyses of the consistency across statistics, phenotyping method, and number of variants in the component score. The AUC improvement produced among all 25 traits ranged from 0.002 to 0.15 (quantiles: 0.017, 0.029, 0.090) (Table 2). The traits at the top of this range often had poor total AUC, generated from a model including age and sex covariates, as the overall correlation between total and improved AUC is 0.077 (p = 0.72). An example of this trend is coronary artery disease, which can be well predicted through age and sex, generating a covariate model AUC of 0.758, leaving the polygenic risk score only able to improve the AUC by 0.017. On the other hand, Crohn’s Disease is less affected by age and sex, thereby generating a covariate AUC of 0.505, leaving the polygenic risk score to improve the AUC by 0.152. Out of 25 traits, genetic information improved AUC values by 0.05 or more in 9 traits.

Other statistics measuring the accuracy of polygenic risk scores demonstrated a similar variability among traits (Supp. Figs. 16-25). For example, when calculating odds ratios with a cut-off at the 1^{st} percentile, the median value across all traits was 4.40, with three traits having an odds ratio above 10 (Supp. Fig. 18). The mean improvement in Nagelgerke’s R^{2} value was 0.055 (quantiles: 0.23, 0.044, 0.061) (Supp. Figs. 19, 20). The highest polygenic risk score p-value, generated from the logistic regression including age and sex covariates, was less than 5×10^{−8}, indicating that all polygenic risk scores were significantly associated with their respective trait, even if they did not form a (Supp. Fig. 22). While these values were generated in a withheld testing group to prevent overfitting, a comparison between testing and training performance showed a slight drop in the testing group (linear slope = 0.64)(Supp. Fig. 26). This relative decrease in accuracy could be caused by slight overfitting in the training set, as 81 scores were compared to each trait.

### Context of Prediction Analyses

Correlation analyses of the 25 polygenic risk scores evaluated within the withheld data-set indicated widespread pleiotropy. The first analysis correlated the 25 scores to each other in a pairwise fashion. There were potentially expected correlations between Crohn’s Disease and ulcerative colitis, psoriaris and vitiligo, and amongst many autoimmune traits (Fig. 2a). Secondly, pair-wise correlations between the 25 previously analyzed scores to 23 other polygenic risk scores, including height, weight and blood protein levels, were calculated. Correlation p-values less than the Bonferroni Correction threshold were generated in 33% of pairs, and 23 of the 25 previously evaluated scores (Supp. Fig. 27). Thirdly, the 25 previously analyzed scores were used to predict in a pair-wise fashion 206 other traits (Supp. Fig. 28-31). A total of 141 predictions generated significant associations at the Bonferroni level (Supp. Fig. 32), according to the p-value of the score term in a logistic regression model which includes age and sex covariates. Most of these significant associations capture related disease etiology, such as the coronary artery disease polygenic risk score predicting angina, myocardial infarction and hypertension, all with a p-value less than 10^{×93}. Other associations have literature evidence, but their underlying mechanisms are less well understood, including how breast cancer is associated to meningitis^{35} and schizophrenia is associated to irritable bowel syndrome^{36}. The majority of comparisons however show insignificant or otherwise currently unexplainable relationships.

Beyond the assessment of trait prediction accuracy, additional meta-analyses revealed further variation and complexities within a typical polygenic risk score analysis. The first meta-analysis examined the sets of variants that generated the 25 polygenic risk scores evaluated within the withheld data-set (Supp. Fig. 33-37). The number of variants in each set ranged widely from 9 to 1,095,393 (mean 161,353; sd 322,134)(Table 2, Supp. Fig. 33). The predominant pattern among variants is to have many with low effect size and a few with high effect size (Supp. Fig. 34). The cumulative effect of the variants in the bottom three quartiles is larger than the cumulative effect of the those in the fourth quartile in all traits (Supp. Fig. 37). The second meta-analysis investigated associations between polygenic risk score accuracy and meta-statistics, including the number of variants and sample size in the GWAS. In these analyses, the only meta-statistic that consistently saw an association to the withheld data-set AUC improvement was the trait’s heritability, producing a correlation of 0.53 (p = 0.007) (Fig. 2c, Supp. Fig. 38). A similar trend emerged when analyzing each method separately (Supp. Fig. 39). The third meta-analysis examined polygenic risk score performance across 4 different phenotype definitions (Supp. Table. 1-4, Supp. Fig. 40-46). In terms of the number of cases determined, the Dual-agreement and ICD methods, and the Self-report and Medication methods were similar (Supp. Fig. 40, 41). The definition used did not change the overall performance when comparing the AUC improvement for each trait in the withheld data-set (p = 0.95) (Fig. 2b). However, several individual traits did see large changes in prediction accuracy, including vitiligo and eczema whose AUC improvement values across the phenotyping methods ranged by more than This clear lack of relationship between sample selectivity and accuracy highlights the difficulty in drawing simplistic conclusions from these meta-analyses.

## Discussion

Polygenic risk scores are easy to produce, simple to understand, and in the literature they are currently the most prominent metric for reporting disease risk attributable to genetics^{37}. Unfortunately, their utility is often either convoluted by additional covariates^{6}, overstated due to overfitting^{38,39}, poorly constructed due to simplistic selection of underlying SNPs^{40,41}, or used in sensitive social issues^{42,43}. This work aims to resolve these issues by providing a comprehensive yet straightforward analysis of polygenic risk score methods on a multitude of traits while providing extensive supplementary analyses. The associated findings lead to the following conclusions.

Out of the current polygenic risk scoring methods available, the simpler methods are superior. The low performance of sophisticated methods may not be due to incorrect genetic theory but rather caused by poor documentation, leading to improper implementation, and inferior results. One of the individual methods that was selected most often as the best polygenic risk scores is clumping. The combination of high accuracy and ease in application makes clumping a preferred first-line method. Only when maximal accuracy is absolutely demanded might complex methods warrant the effort of implementation. Alternatively, high accuracies can be obtained by creating an ensemble from the outputs of multiple simple methods. In the ensemble construction not only are labeled samples required, but also care must be taken to limit overfitting. In our work overfitting was limited by only including the best five individual scores in an ensemble. However, there was likely still some minor overfitting due to the large number of scores compared in the training data-set.

Polygenic risk score performance in the withheld data-set indicated that several could be considered clinically relevant, and many others could serve as valuable measurements in related analyses. However, this utility should be framed not only by many possible confounders^{44}, but also by the large variability found across all supplementary analyses. The sets of SNPs which produced the highest accuracy in the training set ranged in size from single digits to over a million. The withheld data-set AUC improvement ranged from values just above zero, indicating genetics provide negligible aid, to above 0.1, indicating genetics could have a beneficial place within the clinic^{45}. The multiple statistics used to evaluate polygenic risk performance also often ranged widely in ranking the genetic predictability of a trait. Interestingly, this variability was also proven to be unrelated to many meta-statistics, including sample size and number of components SNPs. This consistency of variability in polygenic risk score analysis demonstrates that widespread assessments or declarations on polygenic risk scores require context on the trait, assessment statistic, SNPs analyzed and underlying GWAS utilized. Polygenic risk scores should not be assumed to be infallible representations of a trait’s genetic components, but should also not be assumed to be inconsequential metrics.

When analyzing polygenic risk scores either through correlations to other scores or predicting additional traits, significant associations are often uncovered. These associations are usually within logical contexts, such as atrial fibrillation and coronary artery disease. Although, there are many other less obvious associations that are also significant, indicating widespread pleiotropy throughout the genome. This pleiotropy may be utilized to improve predictions by incorporating the genetic risk of a secondary trait into an analysis of the primary, related disease.

These conclusions are limited in their certainty because only a snapshot of a highly uniform prospective cohort was analyzed. The snapshot nature of the data could have been theoretically better modeled using a survival analysis, but a logistic regression model was chosen to follow precedent^{6}. The uniformity is caused by the selection of only white, British individuals in this study. While narrowing the population follows literature precedence, and limits possible confounders, this practice creates many additional problems. One recently highlighted problem is participation bias^{46}, which may have caused starkly different results in certain traits, such as Schizophrenia, as compared to other studies. An additional large problem is that the portability of these results into other populations is unknown. On a wider scale, this problem will exacerbate health disparities^{47}. Follow-up work should be performed to both test the polygenic risk score portability into multi-ethnic cohorts and to develop methods that can maximize the multi-ethnic utility of a single ancestry derived polygenic risk score.

This study produced polygenic risk scores across 25 traits and 15 methods. Simple polygenic risk score methods were proven superior, as they were able to provide a sizeable accuracy boost for several traits when also considering basic covariates. These specific traits may be well suited for the consideration of genetics in the clinic - not as a solitary test, but rather as a supplementary measure to tailor screening schedules, prescription guidelines or lifestyle change recommendations. These changes may ultimately improve the health of an entire population.

## Methods

### Creation of the polygenic risk scores

All polygenic risk scores were derived from two sources of data: summary statistics provided in the GWAS Catalog^{48} (Table 2), and genetic/phenotypic information within the UK Biobank^{34}. The summary statistics, which were derived in genome wide association studies that did not contain the UK Biobank, were formatted to remove multi-allelic, multi-nucleotide, ambiguous, low minor allele frequency (< 0.01) or low imputation quality (INFO < 0.3) variants. The variants were further munged following Linkage Disequilibrium Score Regression (LDSC) methods^{45}. The allele symbols were flipped when necessary to match the UK Biobank. Any reported odds ratios were logged to create a consistent effect size across traits.

Only individuals whose quality control reported white-British ancestry, not heterozygosity outliers, not holding sex chromosome aneuploidy, and matching reported and inferred sex were included. These individuals were split, based on precedence^{6}, into a training (N = 129,568) and testing (N = 278,685), following the phase 1 (batches UK BiLEVE 1-11 and batch 1-22) and phase 2 (batches 23-95) of release.

Once all summary statistics and genetic data were prepared, methods were applied to the training group following documentation outlined in the respective publication, code repository, or other online guide (Table 1). Some methods required external reference data, subset of the UK Biobank data, or computational output from external methods. Each method produced adjusted summary statistics, which were then used to generate the polygenic risk scores for both the training and testing groups. Adjustments for population structure, or additional covariates, were not made directly to the polygenic risk score but rather were included explicitly in the downstream analysis models. Traits in which only a portion of the population are most commonly susceptible were subset, namely breast and ovarian cancer were scored on females only and prostate cancer was scored on males only. Details on how each method was implemented can be found in the supplementary appendix II and Figure 3.

### Comparison of Polygenic Risk Score Methods

Once all polygenic risk scores were generated, the best score for each trait was determined via a cross validation framework. In total, 81 scores were created for each trait, as multiple scores for each method were required to assess an array of possible parameters. Threefold cross validation produced logistic regression models within two thirds of the training group, which were then applied to the remaining one third of the training group for assessment. The logistic regression models were either termed covariate (including age, sex, first four principal components of the genetic variant matrix), score (including the polygenic risk score), or complete (including age, sex, first four principal components of the genetic variant matrix and polygenic risk score). The independent variable for all regressions was the phenotype, determined through ICD codes, OPCS codes and self-reported disease status (Supp. Tables 1-4). The primary statistic used to compare each score was AUC improvement, the difference between the AUCs generated from the complete and covariate models. Additional statistics such as odds ratio, p-value and Nagelkere’s R^{2} were computed. Odds ratios were calculated by considering individuals whose polygenic risk score was above the denoted cut off percentile to be in the exposure group and individuals with a polygenic risk score below the 50th percentile to be in the non-exposure group. p-values were calculated from the polygenic risk score term in the complete regression model. Nagelkerke’s R^{2} values were assessed upon either the score, covariate or complete logistic regression model. The rankings of the top method-parameter combinations according to these various statistics were compared. Finally, an ensemble score was created by assembling the best five polygenic risk scores and conducting threefold cross validation with each fold generating a logistic regression model. An average of each score’s three logistic regression coefficients, weighted by the overall model’s performance, became the final coefficients to define the simple linear combination ensemble. The motivation behind creating this ensemble score was to determine if a simple manipulation of scores, easily implementable with any group of scores at hand, could improve prediction performance. Further, principal component analysis, statistic comparisons, phenotype comparison, SNP set analysis, ensemble weight analysis and method to meta-data associations were conducted to provide greater insight and verification of the comparison results. The principal components analysis is particularly noteworthy as the clusters which formed helped define the simple and complex methods (Supp. Fig. 47).

### Generating Statistics from the Validation Data Set

For each trait, logistic regression models were fit with the best parameter-method score on the entire training set of data and assessed on the entire withheld testing set. This process was only carried out on one score per trait to best approximate the true extent of predictive performance without overfitting. Following the same analysis process as described in the methods comparison, a covariate, score, and complete model were fit to the entire training data-set and then used to predict the outcome of the withheld data-set. Producing three respective AUC measurements, these models led to the AUC improvement statistic. In addition, p-values, odds ratios, PR curves, prevalences, case/control distributions, significance of ROC curve difference and Nagelkerke’s R^{2} values were also all computed following the same procedures described in the previous section. The association between predictive accuracy and study statistics, such as sample size, were also computed, along with a diagnostic comparison of training and testing error.

### Correlations of Scores

Several correlation analyses were completed, the first of which computed spearman correlation between pairwise combinations of the 25 evaluated polygenic risk scores. Pairs of traits which did not share any individuals in common, such as prostate and breast cancer, were left as NA. Corresponding empirical correlations between the numbers of cases were also computed using the Jaccard Similarity measure for comparison (Supp. Fig. 48). The next correlation analysis computed spearman correlation and the corresponding p-values between the 25 polygenic risk scores analyzed in the previous section and 23 additional polygenic risk scores generated using the clumping method. Lastly, unrelated traits were predicted by creating the three logistic regression models and calculating the corresponding AUC improvements, as described in the methods comparison section. This process was carried out over a grid of independent variables selected from the 25 best polygenic risk scores and the dependent variable determined by one of the unrelated traits. The exact codings of the unrelated traits were either determined by prevalent ICD-10 codes (n = 136) or self-reported status (n = 70). A final principal components analysis was applied to further visualize the relationship between multiple traits (Supp. Fig. 49).

All computational work was completed in R 3.5, or shell scripts. All code has been uploaded to GitHub (https://github.com/kulmsc). P-values derived from comparisons of groups were generated from ANOVA tests, P-values for individual terms were generated from coefficient terms in regression models, and P-values in relation to correlation analyses were derived from Pearson’s product-moment correlation. Boxplots were produced with the default ggplot settings, the middle bar is the median, upper and lower bars are the 25^{the} and 75^{th} percentiles respectively, the whiskers extend 1.5 times the interquartile range, and the notches extend 1.58 times the interquartile range divided by the square root of the sample. Additional notes on how methods were in implemented is in supplementary appendix II.

## Data Availability

All data can originates in the UK Biobank or the GWAS Catalog. All scripts are located on GitHub.