## Abstract

Polygenic scores (PGSs), which assess the genetic risk of individuals for a disease, are calculated as a weighted count of risk alleles identified in genome-wide association studies (GWASs). PGS methods differ in terms of which DNA variants are included in the score and the weights assigned to them. PGSs are evaluated in independent target samples of individuals with known disease status. Evaluation of new PGS methods are made using simulated data or single target cohort, however, in real data sets there can be heterogeneity between target sample cohorts, which could reflect a number of real or artefactual factors. The Psychiatric Genomics Consortium working groups for schizophrenia (SCZ) and major depressive disorder (MDD) bring together many independently collected case-control cohorts for GWAS meta-analysis. These resources are used here in repeated application of leave-one-cohort-out GWAS analyses, generating robust conclusions for PGS prediction applied across multiple target (left-out) cohorts. Eight PGS methods (P+T, SBLUP, LDpred-Inf, LDpred-funct, LDpred, PRS-CS, PRS-CS-auto, SBayesR) are compared. We found that SBayesR had the highest prediction evaluation statistics in most comparisons. For SCZ across 30 target cohorts, the SBayesR PGS achieved a mean area under the receiver operator characteristic curve (AUC) of 0.733, and explained 9.9% of variance on the liability scale. For MDD across 26 target cohorts, the AUC and variance explained were 0.601 and 4.0%, respectively. The variance explained by the SBayesR PGS was 46% and 43% higher for SCZ and MDD, respectively, compared to the basic p-value thresholding P+T method.

## Introduction

Polygenic scores (PGSs), which assess the genetic risk of individuals for a disease^{1; 2}, are calculated as a weighted count of genetic risk alleles in the genome of an individual, with the risk alleles and their weights typically derived from the results of genome-wide association studies (GWAS).^{3} PGS can be calculated for any trait or disease with sufficiently powered GWAS (‘discovery samples’). For many common complex genetic disorders, such as cancers^{4; 5} and heart disease^{6; 7}, there is increasing interest in trialling PGS for early disease detection, prevention and intervention^{8; 9}. In the context of psychiatric disorders, it has been argued^{10} that PGS may have utility in the context of youth mental health clinics, where young people present with symptoms that have not yet crystallised to portray a clear treatment pathway. A high PGS could nudge clinic decision making for those presenting in this prodromal state.

There are now many methods to calculate PGSs, and the methods differ in terms of two key criteria: which DNA variants to include (DNA variants here are limited to single nucleotide polymorphisms, SNPs, but can include other DNA variants tested for association with a trait) and what weights to allocate to them. While stringent thresholds are set to declare significance for association of individual SNPs in GWAS, PGSs are robust to inclusion of some false positives, and the maximum prediction from PGSs tested in target samples (i.e., GWAS samples independent of the GWAS discovery sample) may include nominally associated SNPs. The optimum method to decide which SNPs to select and what weights to allocate them, may differ between traits depending on the sample size of the discovery GWAS and on the genetic architecture of the trait (the number, frequencies and effect sizes of causal variants), particularly given the linkage disequilibrium (LD) structure between SNPs. Often, when new PGS methods are introduced, comparisons are made between a limited set of methods using simulated data, together with application to some real data examples. However, it can be difficult to compare across the new methods, particularly because in real data there can be heterogeneity in PGS evaluation statistics between target samples, not encountered in idealised simulations. The reasons for this heterogeneity are usually unknown but could reflect a number of factors such as phenotype definition, ascertainment strategies of cases and controls, cohort-specific ancestry within the broad classification of ancestry defined by the GWAS discovery samples (e.g., European), or technical artefacts in genotype generation.

Here, we compare eight PGS methods (P+T^{3; 11}, SBLUP^{12}, LDpred-Inf^{13}, LDpred^{13}, LDpred-funct^{14}, PRS-CS^{15}, PRS-CS-auto^{15} and SBayesR^{16} in **Table 1**). Some of these methods (P+T,LDpred and PRS-CS) require a tuning sample, a GWAS cohort with known trait status that is independent of both discovery and target samples, used to select parameters needed to generate the PGSs in the target sample. Briefly, P+T (pruning with a p-value threshold) uses the GWAS effect size estimates as SNP weights and includes independent SNPs (defined by an LD *r*^{2} filter for a given chromosomal window distance) with association p-values lower than a threshold (chosen after application in a tuning sample). P+T is the most commonly used and basic method, and so is the bench-mark method here. The other methods (referred to here as recent methods) assume either that all SNPs have an effect size drawn from a normal distribution (SBLUP and LDpred-Inf) or that SNP effects are drawn from mixtures of distributions with the key parameters defining these architectures estimated through Bayesian frameworks (LDpred, PRS-CS, SBayesR). LDpred-funct includes functional annotation to SNPs to up/down weight their contributions to the PGSs, which could improve prediction accuracy if this functional information helps to better separate true and false positive associations^{17}. We apply these methods to data from the Psychiatric Genomics Consortium(PGC) working groups for schizophrenia (SCZ)^{18; 19} and major depressive disorder (MDD)^{20–22}(**Tables S1 and S2**). The PGC provides a useful resource for undertaking this study because it brings together many independently collected cohorts for GWAS meta-analysis. This allows the application of repeated leave-one-cohort-out GWAS analyses generating robust conclusions from evaluation of PGS applied across multiple left-out target cohorts.

## Materials and Methods

### Data

Schizophrenia GWAS summary statistics, denoted as PGC-SCZ2+, were available from PGC Schizophrenia (SCZ) Working group (34 European ancestry cohorts, denoted as SCZ34)^{18} and another three cohorts from Pardiñas et al^{19}. PGC-SCZ2+ comprises more than 8M imputed SNPs in 31K SCZ cases and 41K controls. Individual level genotype data were available from 25K cases and 30K controls of SCZ34. Detailed information about the cohorts is provided elsewhere^{23} but is summarised in **Table S1**. Since some methods require a tuning sample (defined below), we arbitrarily chose the lie2 cohort (137 cases and 269 controls) as the tuning cohort. The GWAS discovery sample was a meta-analysis of 35 cohorts; lie2 was always excluded and then each of the remaining 33 cohorts was left-out in turn and used as the target sample. In sensitivity analyses, investigating the impact of the tuning sample, the msaf, gras and swe6 cohort were exchanged with lie2 in turn, in which msaf has a similar sample size as lie2, 327 cases and 139 controls, while gras and swe6 are larger with more than 2000 individuals each.

Major depression GWAS summary statistics were available from seven studies including UK Biobank^{21; 24}, 23andMe^{25}, GERA^{26}, iPSYCH^{27}, deCODE^{28}, GenScotland^{29; 30}, and the PGC Major Depressive Disorder (MDD) Working group (with the data previously denoted as PGC29, but here MDD29)^{20}. All are European ancestry studies and comprise almost 13M imputed SNPs from 248K cases and 563K controls. MDD29 includes the GWAS results from 29 research study cohorts. Detailed information of the MDD29 cohorts is described elsewhere^{20; 21; 25–30} but is summarised in **Table S2**. Individual level genotype data were available for 15K cases and 24K controls from 26 cohorts. We left one cohort out of those 26 cohorts in turn as the target sample. A cohort from Muenster^{20}, not included in the MDD29 was used as the tuning sample (845 clinical defined MDD cases and 834 controls). We then meta-analysed with the other GWAS summary statistics results to make the discovery samples. We note that the discovery sample meta-analyses include samples where the depression phenotype is self-reported rather than following a structured clinical interview, nonetheless we refer to the prediction as MDD since the PGC target cohorts are of MDD cases and controls. 959 overlapped individuals between UK Biobank and MDD29 were excluded from the target cohorts.

The datasets stored in the PGC central server follow strict guidelines with local ethics committee approval.

### Baseline SNP selection

For baseline analyses, only SNPs with minor allele frequency (MAF) > 0.1 and imputation INFO score > 0.9 (converted to best-guess genotype values of 0, 1 or 2) were selected. Sensitivity analyses relaxed the MAF threshold to MAF > 0.05 or 0.01 and INFO score threshold to 0.3. All methods were conducted using HapMap3 SNPs, except the method P+T, which was conducted based on all imputed SNPs (8M in SCZ, and 13M in MDD).

### Prediction methods

We define a PGS of an individual, *j*, as a weighted sum of SNP allele counts: ,where *m* is the number of SNPs included in the predictor, is the per allele weight for the SNP,*x _{ij}* is a count of the number (0, 1, or 2) of trait-associated alleles of SNP

*i*in individual

*j*. The cohort (target sample) for which PGSs are calculated is excluded from the meta-analysis that generates the GWAS summary statistics (discovery sample), so that discovery and target samples are independent. We compared eight risk prediction methods (detailed below): The methods differ in terms of the SNPs selected for inclusion in the predictor and the values assigned to the SNPs. All methods use the GWAS summary statistics as the starting point, but each makes choices differently for which SNPs to include and for the values to assign. Briefly, the key differences between the methods are the assumptions made about the underlying genetic architecture and the distributions of true effect sizes, with Bayesian methods setting some priors for these distributions. Several methods employ an LD reference sample to determine LD between SNPs. Here, we use EUR of the 1000 Genomes Project as the LD reference, unless the method software provides an LD reference. In some methods the PGS calculated in a target cohort requires estimates of parameter values, which need to be estimated by application of the PGS method to a tuning cohort (also not included in the discovery GWAS sample) using a range of parameter estimates, then selecting the parameter estimates that maximizes prediction in that tuning cohort. In all methods, once the SNPs and have been decided, PLINK

*--score*is used to calculate the PGS in the target sample.

### LD pruning and thresholding (P+T)^{3}

In the P+T method GWAS summary statistics are pruned to be approximately independent using a LD threshold, *r*^{2}. From this quasi-independent genome-wide SNP list, SNPs are selected by thresholding on a pre-specified association p-value, P_{t}. We evaluated P+T as implemented in Ricopili^{31} which uses PLINK^{32} to prune the SNP set using *r*^{2} = 0.1 within 500 kb windows, and P_{t}∈ (5e-08, 1e-06, 1e-04, 1e-03, 0.01, 0.05, 0.1, 0.2, 0.5, 1), where P_{t} = 1 means that all SNPs from the LD-pruned list are included. In applications of P+T it is common for results from the most associated P_{t} to be reported (including the application in the software PRSice^{33} which uses a continuous P_{t} range), but this approach utilises information from the target cohort and hence introduces a form of winner’s curse. Here, the P_{t} threshold applied in target cohorts is the P_{t} threshold that maximised prediction in the tuning cohort.

### SBLUP^{12}

SBLUP is a method that re-scales the GWAS SNP effect estimates using an external LD reference panel to transform the ordinary least-squares estimates to approximate the best linear unbiased prediction (BLUP) solutions. This method assumes an infinitesimal model where SNP effects are drawn from a normal distribution. All genome-wide SNPs are used to build the PGS. Hence, for example, consider a genomic region with a single causal variant but with many SNPs in the region correlated with the causal variant and correlated with each other. In this case the effect size estimate is “smeared” across the correlated SNPs, but with the total contribution to risk expected to represent the best estimate of the signal from the underlying causal variant. This method is implemented within the software package GCTA^{34}.

### LDpred and LDpred-inf^{13}

While P+T uses arbitrary LD and p-value thresholds for selection of SNPs, LDpred tries to optimise this step in a Bayesian framework. The method uses the GWAS summary statistics and LD information from the external LD reference sample to infer the posterior mean effect size of each SNP, conditioning on the SNP effect estimates of other correlated SNPs. This method assumes a point-normal prior on the distribution of SNP effects such that only a fraction of SNPs with non-zero estimated effects are selected for inclusion in the PGS. The default parameter setting for the fractions of causal SNPs (π, but denoted p in the original paper) were used in the tuning cohort: π ∈ {1 (i.e. all SNPs), 0.3, 0.1, 0.03, 0.01, 0.003, and 0.001}, with an LD radius of M/3000 (M is the number of SNPs) to obtain local LD information, as suggested by the authors^{13}. The π value that maximised the prediction in the tuning sample was applied in the target sample; the p value can differ between target cohorts even though the same tuning cohort is used, reflecting the properties of the discovery sample which may change with each left-out target sample. When π=1 the method is called LDpred-Inf and is equivalent to SBLUP (the concordance of results was checked, **Table S7**).

### LDpred-funct^{14}

LDpred-funct is an extension of the LDpred-inf (SBLUP equivalent) model but leverages trait-specific functional enrichments relative to the baseline-LD model^{35} to up/down-weight

SNP effects. The functional annotations include coding, conserved, regulatory and LD-related annotation. In the baseline-LD model, the enrichment of each category is jointly calculated via stratified LD score regression^{36}. LDpred-funct has a non-infinitesimal model version, but in pilot analyses we found LDpred-Inf performed better than LDpred and hence only considered the LDpred-funct infinitesimal model. Thus, we continued only with the infinitesimal model version.

### PRS-CS and PRS-CS-auto^{15}

PRS-CS is also built under a Bayesian regression framework. Unlike LDpred which assumes a point-normal distribution as a prior, which is discrete, PRS-CS assumes a continuous shrinkage prior on the SNP effects. PRS-CS was implemented using the default setting and with the LD reference panel provided with the PRS-CS software, which is computed using the 1000 Genomes samples and HapMap3 SNPs. In PRS-CS, for the global scaling parameter which is applied to all SNP effects ϕ, the search grid is ϕ^{1/2}∈ {0.0001, 0.001, 0.01, 0.1, 1}, The ϕ that produces the best predictive performance in a tuning data set is selected for use in the target sample. In PRS-CS-auto, ϕ is automatically learnt from GWAS summary statistics and no tunning sample is needed.

### SBayesR

SBayesR is a method that re-scales the GWAS SNP effect estimates based on Bayesian multiple regression. SBayesR assumes that the standardised SNP effects are drawn from a mixture of four zero-mean normal distributions with different variances (one of the variances is zero, with a probability of π_{1}), indicating that only a fraction of SNPs (1-π_{1}) have non-zero estimated effects which contribute to the phenotype. Moreover, the contributions of SNPs in different distributions differ because of different variances. Here, we evaluated SBayesR in the default setting. For the LD reference, we used the same sparse LD matrix as the one used in Lloyd-Jones et al.^{16}, where the LD matrix was built based on the HapMap3 SNPs of randomly selected and unrelated 50K UK Biobank individuals. Whereas LDpred estimates p from a tuning sample, SBayesR estimates p from the GWAS discovery sample, so no tuning sample is needed.

### Evaluation of out-of-sample prediction

The accuracy of prediction in each target cohort was quantified by 1) Area under the receiver operator characteristic curve (AUC; R library pROC). AUC can be interpreted as a probability that a case ranks higher than a control. 2) The proportion of variance on the liability scale explained by PGS^{37}. We used the population lifetime risk of SCZ and MDD as

1% and 15% respectively to convert the variance explained in a linear regression to the liability scale^{20; 23; 38}. 3) Odds ratio (OR) of tenth PGS decile relative to the first decile. 4) Odds ratio of tenth PGS decile relative to those ranked in the middle of the PGS distribution, which is calculated as the average of OR of tenth decile relative to fifth and sixth decile. 5) Standard deviation unit increase in cases. The PGS in each target cohort were scaled by standardising the PGS of controls and applying the standardisation to cases: , where SD is standard deviation. This does not impact PGS evaluation statistics but simply means that PGS are in SD units for all cohorts. We compare the median value for evaluation statistics 3 and 4, because they are significantly different from a normal distribution based on a Shapiro-Wilk Normality Test. The regression analyses for evaluation statistics 2-4 include 6 ancestry principal components as covariates. These covariates are not included in the AUC model and the standard deviation unit increase in cases model. To illustrate the impact on results, for SCZ given the SBayesR mean variance in liability of 9.9% and lifetime risk of 0.01 the AUC expected from normal distribution theory^{39} is 0.730, compared to the mean reported of 0.733. For MDD given the variance in liability of 4.0% and lifetime risk of 0.15 the expected AUC is 0.603 compared to the mean reported of 0.601.

## Results

Prediction evaluation statistics based on recent PGS methods applied to SCZ across 30 study cohorts (**Figure 1**, **Table S3** and **S4**), and to MDD across 26 cohorts (**Figure 2**, **Table S5** and **S6**), show higher values for all methods over the benchmark method, P+T. The evaluation statistics include i) area under the receiver operator characteristic curve (AUC) which can be interpreted as the probability that a case ranks higher than a control, when the case and control are randomly drawn; ii) mean difference between cases and controls expressed in PGS standard deviation (SD) units of controls, after standardization of the PGS so that controls in each target cohort have a mean of 0 and a SD of 1; iii) variance in liability explained by the PGS; iv) Odds ratio of the top 10% ranked on PGS relative to the bottom 10%; v) Odds ratio of the top 10% ranked on PGS relative to those ranking in the middle of the PGS distribution; vi) difference between mean of PGS in the top 10% of cases and mean PGS in top 10% of controls.

There is variability in prediction statistics across target cohorts which is not a reflection of sample size (**Figure S1** and **Table S4** for SCZ, **Figure S2** and **Table S6** for MDD). To provide a benchmark in terms of power, we note that for SCZ, the mean difference in PGS between cases and controls for the P+T method is 0.73 standard deviation units of the control sample (SDU). A sample size of only 42 cases and 42 controls has 95% power to detect this difference at a nominal significance threshold of 0.05; all SCZ cohorts are bigger than this. For MDD, the mean difference in PGS SDU between cases and controls for the P+T method is 0.30, and the power calculation requires a sample size of 290 cases and 290 controls to detect this difference; 20 (77%) of the MDD cohorts achieve this effective size. Hence, the SCZ and MDD cohorts are well-powered for PGS evaluation.

The correlations of PGS between different methods are high (**Table S7**), but are lowest between P+T and other methods (minimum 0.67). In contrast, the correlations between the recent methods are always > 0.83. In theory, LDpred-Inf and SBLUP are the same method. In practice, there are differences in implementation (e.g., different input parameters associated with definition of LD window), generating a correlation 0.977. The differences in prediction evaluation statistics between methods are small. For SCZ the AUC for all recent methods other than PRS-CS-auto are significantly higher than the P+T method after Bonferroni correction (p-value<0.0018=0.05/28 (28 pair-wise comparisons between 8 methods), two-tailed Student’s t-test). For MDD none of the differences between methods were significant. For both SCZ and MDD, regardless of tuning cohorts SBayesR showed relatively better performance (on average across target cohorts) than other methods on all statistics, where other recent methods performed similarly (**Figures 1 and 2**). For variance explained on the liability scale, the P+T PGS explained a mean of 6.8% across cohorts for SCZ. For SBayesR, the mean was 9.9% for variance explained in liability, an increase of 46%. For MDD although the variance explained is lower in absolute terms, 2.8% for P+T vs 4.0% for SBayesR; the latter represents a 43% increase.

We provide several evaluation statistics that focus on those in the top 10% of PGS, because clinical utility of PGS for psychiatric disorders is likely to focus on individuals that are in the top tail of the distribution of predicted genetic risk. The odds ratio for top vs bottom decile are large, ranging from 13.8 for P+T to 22.5 for SBayesR for SCZ and 3 to 4 for MDD.

While these top vs bottom decile odds ratios (**Figure 1c** and **2c**) are much larger than the odds ratio obtained by using PGS to screen a general population (**Figure 1d** and **2d**) or patients in a healthcare system to identify people at high risk^{40; 41}, these comparisons are useful for research purposes, which could for example make cost-effective experimental designs focussing on individuals with high vs low PGS.^{42} The odds ratio of top 10% vs middle 10% are much less impressive, up to median of 5.5 for SCZ and 2 for MDD, but more fairly represents the value of PGS in population settings. These values can be benchmarked against risk in 1^{st} degree relatives of those affected, which are of the order of 8 for SCZ and 2 for MDD; low values are always expected for MDD because it is more common (lifetime risk ∼15% compared to ∼1% for SCZ). The odds ratio values are particularly high for some cohorts (**Table S4**), because in some SCZ cohorts the bottom 10% include very few or no cases, especially in cohorts with relatively small sample sizes. Since the PGS are normally distributed, as expected the mean PGS for controls in the top 10% PGS is ∼1.75 SD units(K = 0.10, t = qnorm(1-K), z = dnorm(t), mean value of top 10% of a normal distribution = z/K), whereas the top 10% of cases have mean value of 2.63 control sample SD units for SCZ cases and 2.09 control sample SD units for MDD cases, using SBayesR. These mean values of the top 10% in cases equate to expectations from the population of the top 1.1% and top 4.7% for SCZ and MDD, respectively.

### The impact of tuning cohort

Three methods (i.e., P+T, LDpred and PRC-CS) use tuning cohorts to determine key parameters for application of the method into the target cohorts. Tuning parameters impact results in two ways. First, the parameters may be dependent on the choice of tuning cohort. Second, the discovery GWAS sample may be reduced in size (and hence power) if a tuning cohort needs to be excluded from the discovery GWAS. In all our analyses the tuning cohort is excluded from all GWAS discovery samples so that GWAS discovery sample is not variable across methods for each target cohort. A sensitivity analyses that used the SCZ cohorts of msaf (Ncase = 327, Ncontrol = 139), gras (Ncase = 1086, Ncontrol = 1232) or swe6 (Ncase = 1094, Ncontrol = 1219) as the tuning sample instead of cohort lie2 (Ncase = 137, Ncontrol = 269) show that the tuning cohort can have considerable impact (**Figure 3** and **Figures S3-5**). In our results, the tuning cohort that generates higher PGS is method dependent and differs between cohorts. Although methods SBLUP, LDpred-Inf, LDpred-funct, PRS-CS-auto and SBayesR require no tuning cohort, they serve as a benchmark, since the differences in their results reflect differences in the changed discovery samples (e.g., msaf is in the discovery sample, when lie2 is the tuning cohort, and *vice versa*), as well as the stochasticity inherent in the Gibbs sampling of Bayesian methods.

### The impact of MAF/INFO threshold

A MAF threshold of 0.1 and a INFO threshold of 0.9 are used to be consistent with applications in the PGC SCZ^{23} and PGC MDD^{20} studies, which had been imposed recognising that these thresholds generated more robust PGS results than using lower threshold values. In the second sensitivity analysis applied to the SCZ data, the MAF threshold was relaxed to 0.05 (**Figure 4**) and to 0.01 (**Figure S6**). The prediction evaluation statistics increase for some cohorts and decrease for others. SBLUP, PRS-CS, PRS-CS-auto and SBayesR are less affected than P+T, LDpred-Inf, LDpred-funct and LDpred. For QC threshold of MAF < 0.01, the differences in AUC have a similar trend compared to using MAF< 0.05, but with greater variability (**Figure S6**). The effects of MAF thresholds vary between cohorts, although the use of lower MAF threshold tends to generate higher AUC for the larger target samples. Across target cohorts, different evaluation statistics were almost identical when including less common SNPs (**Table S3**). Relaxing the INFO score to 0.3 has a negligible effect (**Figure S7**).

## Discussion

Comparison of PGS risk prediction methods showed that all recent methods had higher prediction evaluation statistics over the benchmark P+T method for SCZ and MDD. While the differences between the recent methods were small, we found that SBayesR consistently ranked highest. Given that the PGS is a sum of many small effects, a normal distribution of PGS in a population is expected (and observed **Figures S8-S11**). In idealised data, such as the relatively simple simulation scenarios usually considered in method development, all evaluation statistics should rank the same, but with real data sets this is not guaranteed. This is the motivation for considering a range of evaluation statistics. Our focus on statistics for those in the top 10% of PGS is partly motivated by potential clinical utility. In the context of psychiatry, it is likely that this will focus on people presenting in a prodromal state with clinical symptoms that have not yet crystallised to a specific diagnosis^{10; 43}. High PGS in those presenting to clinics could help tilt the clinical decision-making towards closer monitoring or earlier intervention. Since a genetic-based predictor only predicts part of the risk of disease, and since a PGS only predicts part of the genetic contribution to disease it is acknowledged that PGS cannot be fully accurate predictors. Nonetheless, PGS, in combination with clinical risk factors, could make a useful contribution to risk prediction.

In sensitivity analyses that used different quality criteria for SNPs e.g. MAF of 0.01 vs 0.05, INFO of 0.3 vs 0.9, we concluded that, currently, there is little to be gained in PGS from including SNPs with MAF < 0.10 and INFO < 0.9 for the diseases/dataset studied (**Table S8** and **S9**). This result may seem counter-intuitive since variants with low MAF are expected to play an important role in common disease, and some may be expected to have larger effect sizes than more common variants^{44; 45}. However, sampling variance is a function of allele frequency (var(y)/(2*MAF(1-MAF)*n), where n is sample size), such that a variant of MAF = 0.01 has sampling variance 25 times greater than a variant of MAF = 0.5. Moreover, in real data sets cohort sample size and technical artefacts can accumulate to increase error in effect size estimates particularly of low frequency variants. Our conclusion that little is gained from including variants of MAF < 0.1 and reducing INFO threshold needs to be revisited as larger discovery samples and larger target cohorts accumulate.

For both SCZ and MDD, while all recent methods had similar performance, SBayesR saw the highest prediction accuracy in most of the comparisons. Although SCZ and MDD both have a highly polygenic genetic architecture, we have recently shown that SBayesR outperforms other methods for two less polygenic diseases, Alzheimer’s^{46} (which includes the *APOE* locus which has a very large effect size) and ALS^{47} (for which there is evidence of greater importance of low MAF variants compared to SCZ^{48}). The original SBayesR publication showed that in both simulations and applications to real data, the method performed well across a range of traits with different underlying genetic architectures, which is because SBayesR can fit essentially any underlying architecture and other methods are special cases of the SBayesR model, except PRS-CS which uses different distributional approaches (**Table 1**). We note that we did not consider a version of P+T that has been shown to have higher out of sample prediction compared to the standard implementation^{11}. This method conducts a grid search in a tuning cohort to determine LD *r*^{2} and INFO score thresholds for SNPs as well as the p-value threshold. We chose to implement only the basic, commonly used P+T method, and specifically as implemented in published PGC studies. Moreover, many of the methods implemented here address optimum SNP selection from a methodological approach rather than grid search approach. We note that here we only considered the infinitesimal model version of LDpred-funct, because we have already found no advantage of LDpred over LDpred-inf in the preliminary analyses. For traits and diseases of other genetic architecture parameters of LDpref-funct should be investigated, although in the updated LDpred-funct preprint^{49}, SBayesR was found to perform well across a range of quantitative and binary traits. We do note that SBayesR expects effect size estimates and their standard errors to have properties consistent with the sample size and with the LD patterns imposed from an external reference panel. If GWAS summary statistics have non-ideal properties (perhaps resulting from meta-analysis errors or approximations) then SBayesR may not achieve converged solutions. Last, we note that the comparison of methods uses only study samples of European ancestry. More research is needed to understand the properties of prediction methods within other ancestries and across ancestries, given potential differences in genetic architectures (in terms of number, frequencies and effect sizes of causal variants) and LD between measured variants and causal variants^{50; 51}. Such research is dependent on the availability of large GWAS data sets from non-European ancestries; currently there is considerable effort to increase GWAS sample collection across world-wide population groups to address this concern^{50–52}.

All recent methods are compared using their default parameters settings. An optimum setting of each method could potentially increase the prediction accuracy. For example, in sensitivity analyses we found that LDpred sees higher prediction accuracy when increasing the length of MCMC chain, while PRS-CS-auto and SBayesR results are not impacted by increasing the MCMC chain length beyond the default settings (**Table S10**). This result agrees with the recent revision of LDpred, LDpred2^{53}. The underlying model and assumptions about the SNP effect distribution are unchanged, but higher prediction accuracy is reported for longer MCMC chain and larger LD windows. Most likely the optimum parameter settings are trait (genetic architecture) dependent^{11}. Hence, we conclude that a key advantage of SBayesR is that there is no need for the user to tune or select model or software parameters. Moreover, it does not need a tuning cohort to derive SNP effect weights but learns the genetic architecture from the properties of the GWAS results. A third key advantage of SBayesR is computational speed. Using one CPU, it takes approximately 2 hours to generate SNP weights based on each discovery sample and predict into the left-out-cohort, compared to PRS-CS which needs 40 hours using 5 CPUs (the CPU number is fixed in the PRS-CS software). Last, given that SBayesR uses only HapMap3 SNPs that are mostly well-imputed it should be possible to provide these SBayesR SNP weights as part of a GWAS pipeline to apply in external target samples.

## Data Availability

The datasets stored in the Psychiatric Genomics Consortium central server follow strict guidelines with local ethics committee approval.

## Declaration of Interests

The authors declare no competing interests.

## Acknowledgements

We acknowledge funding from the National health and Medical Research Council(1173790,1078901,108788 (NRW),1113400 (NRW, PMV)) and the Australian Research Council (FL180100072 (PMV)).

This work would not have been possible without the contributions of the investigators who comprise the PGC-SCZ and PGC-MDD working group. For a full list of acknowledgments of all individual cohorts included in PGC-SCZ and PGC-MDD, please see the original publications. The PGC has received major funding from the US National Institute of Mental Health and the US National Institute of Drug Abuse (U01 MH109528 and U01 MH1095320). We thank the customers, research participants and employees of 23andMe for making this work possible. The study protocol used by 23andMe was approved by an external AAHRPP- accredited institutional review board.

The Münster cohort was funded by the German Research Foundation (DFG, grant FOR2107 DA1151/5–1 and DA1151/5–2 to U.D.; SFB-TRR58, Projects C09 and Z02 to U.D.) and the Interdisciplinary Center for Clinical Research (IZKF) of the medical faculty of Münster (grant Dan3/012/17 to U.D.).

Some data used in this study were obtained from dbGaP. dbGaP accession phs000021: funding support for the Genome-Wide Association of Schizophrenia Study was provided by the National Institute of Mental Health (R01 MH67257, R01 MH59588, R01 MH59571, R01 MH59565, R01 MH59587, R01 MH60870, R01 MH59566, R01 MH59586, R01 MH61675, R01 MH60879, R01 MH81800, U01 MH46276, U01 MH46289, U01 MH46318, U01 MH79469, and U01 MH79470) , and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). Samples and associated phenotype data for the Genome-Wide Association of Schizophrenia Study were provided by the Molecular Genetics of Schizophrenia Collaboration (principal investigator P. V. Gejman, Evanston Northwestern Healthcare (ENH) and Northwestern University, Evanston, IL, USA). dbGaP accession phs000196: this work used in part data from the NINDS dbGaP database from the CIDR: NGRC PARKINSON’S DISEASE STUDY. dbGaP accession phs000187: High-Density SNP Association Analysis of Melanoma: Case–Control and Outcomes Investigation.Research support to collect data and develop an application to support this project was provided by P50 CA093459, P50 CA097007, R01 ES011740, and R01 CA133996 from the NIH.

Statistical analyses were carried out on the Genetic Cluster Computer (http://www.geneticcluster.org) hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam.