Leveraging fine-mapping and non-European training data to improve trans-ethnic polygenic risk scores

Polygenic risk scores (PRS) based on European training data suffer reduced accuracy in non-European target populations, exacerbating health disparities. This loss of accuracy predominantly stems from LD differences, MAF differences (including population-specific SNPs), and/or causal effect size differences. Here, we propose PolyPred, a method that improves trans-ethnic polygenic prediction by combining two complementary predictors: a new predictor that leverages functionally informed fine-mapping to estimate causal effects (instead of tagging effects), addressing LD differences; and BOLT-LMM, a published predictor. In the special case where a large training sample is available in the non-European target population (or a closely related population), we propose PolyPred+, which further incorporates the non-European training data, addressing MAF differences and causal effect size differences. We applied PolyPred to 49 diseases and complex traits in 4 UK Biobank populations using UK Biobank British training data (average N=325K), and observed statistically significant average relative improvements in prediction accuracy vs. BOLT-LMM ranging from +7% in South Asians to +32% in Africans (and vs. LD-pruning + P-value thresholding (P+T) ranging from +77% to +164%), consistent with simulations. We applied PolyPred+ to 23 diseases and complex traits in UK Biobank East Asians using both UK Biobank British (average N=325K) and Biobank Japan (average N=124K) training data, and observed statistically significant average relative improvements in prediction accuracy of +24% vs. BOLT-LMM and +12% vs. PolyPred. In conclusion, PolyPred and PolyPred+ improve trans-ethnic polygenic prediction accuracy, ameliorating health disparities.

Here, we propose PolyPred, a polygenic prediction method that linearly combines two complementary predictors derived from European training data: (1) PolyFun-pred, a new predictor that circumvents LD differences by applying genome-wide functionally informed fine-mapping 28,29 to precisely estimate causal effects (instead of tagging effects); and (2) BOLT-LMM 30,31 , a published predictor that analyzes all loci jointly and can capture all signals in extremely polygenic loci.
In the special case where there exists a large (e.g. N≥50K) non-European training sample from the target population (or a closely related population), we propose PolyPred+, a polygenic prediction method that leverages both European and non-European training data. PolyPred+ linearly combines (1) PolyFun-pred; (2) BOLT-LMM; and (3) BOLT-LMM-pop, a predictor obtained by applying BOLT-LMM to the non-European training data, addressing MAF differences and causal effect size differences.
We compared PolyPred and PolyPred+ to state-of-the-art polygenic prediction methods via simulations and analyses of 49 diseases and complex traits in 4 populations from the UK Biobank 32 , additionally incorporating Biobank Japan 7 and Uganda-APCDR 33,34 to increase non-European training sample size and avoid cohort effects. We conclude that PolyPred substantially increases trans-ethnic polygenic prediction accuracy, and that PolyPred+ further increases trans-ethnic prediction accuracy in the special case where non-European training data is available in large sample size.

Overview of Methods
PolyPred combines two complementary predictors: PolyFun-pred and BOLT-LMM. PolyFun-pred is a new predictor that leverages genome-wide functionally informed fine-mapping 28,29 to estimate posterior mean causal effects (instead of tagging effects; see Supplementary Note) for all SNPs with European MAF≥0.1% (accounting for MAF-dependent architectures [35][36][37] ; 18 million SNPs in this study) by applying PolyFun + SuSiE 29 to European training data across 2,763 overlapping 3Mb loci. Leveraging fine-mapped posterior mean causal effects for trans-ethnic polygenic prediction aims to address LD differences between populations; to our knowledge, the application of PolyFun + SuSiE (or any other fine-mapping method) to polygenic prediction has not previously been explored. BOLT-LMM is a published predictor 30,31 that estimates posterior mean tagging effects of common SNPs (1.2 million HapMap 3 SNPs 38 in this study) using European training data. Combining PolyFun-pred and BOLT-LMM is advantageous because they have complementary advantages: PolyFun-pred estimates causal effects rather than tagging effects. BOLT-LMM estimates tagging effects, but it analyzes all loci jointly and it can potentially capture all signals in extremely polygenic loci (i.e., loci harboring >10 causal variants within 1.5Mb from the locus center; see Methods).
In the special case where a large training sample is available in the target population (or a closely related population), we propose PolyPred+, which combines three complementary predictors: PolyFun-pred, BOLT-LMM and BOLT-LMM-pop; BOLT-LMM-pop refers to application of BOLT-LMM to common SNPs (1.2 million HapMap 3 SNPs in this study) using training data from the non-European target population, addressing MAF differences and causal effect size differences.
PolyPred and PolyPred+ compute linear combinations of the estimated effect sizes of their constituent predictors: where indexes SNPs, indexes the constituent predictors (PolyFun-pred and BOLT-LMM for PolyPred; PolyFun-pred, BOLT-LMM and BOLT-LMM-pop for PolyPred+), ( ) is the PolyPred (or PolyPred+) per-allele effect size of SNP , are method-specific weights, and is the per-allele effect size of SNP for method (or 0 if SNP was not considered by method ). Predicted phenotypes are computed by applying effect sizes to target genotypes: where is the predicted phenotype of an individual from the target population and is the number of minor alleles of SNP carried by the individual. The mixing weights in Equation 1 are estimated via non-negative least squares regression using a small number of training individuals from the target population (500 in this study), regressing true phenotypes on a linear combination of the constituent predictors (which are computed as in Equation 2). Further details of PolyPred and PolyPred+ are provided in the Methods section; we have publicly released open-source software implementing PolyPred and PolyPred+ (see URLs).
We evaluate prediction accuracy for each method and target population using relative-R 2 , defined as the R 2 obtained in the target non-European population (after correcting for covariates and potential confounders; see Methods) divided by the R 2 obtained by BOLT-LMM in UK Biobank non-British Europeans (employing the same correction), using the same training data for the numerator and the denominator. This quotient transforms the prediction accuracies from an absolute scale to a scale of relative improvement (vs. the BOLT-LMM predictor in the UK Biobank non-British European target population), which is invariant to factors such as training sample size and trait heritability. We compute standard errors via a genomic block-jackknife, which is conservative compared to a jackknife over individuals (see Methods). We meta-analyze relative-R 2 across traits in a given target population via an inverse varianceweighted average, weighting traits according to the sampling variance of the BOLT-LMM predictor in the target population (estimated via genomic block-jackknife; see Methods). We compare PolyPred and PolyPred+ to 3 published methods: LD-pruning + P-value thresholding (P+T) 39,40 , SBayesR 41 , and BOLT-LMM 30,31 (Table 1). computed predictions in four target populations: non-British Europeans, South Asians, East Asians, and Africans. We estimated PolyPred mixing weights using 500 individuals from the target population. We evaluated prediction accuracy using held-out individuals from each target population that were not included in the training sets: 42K non-British Europeans, 7.7K South Asians, 0.9K East Asians, and 6.2K Africans. We computed PRS using 250,963 MAF≥0.1% SNPs with INFO score≥0.6 on chromosome 22 (including short indels) (we restricted the analysis to chromosome 22 due to alleviate the computational burden of running hundreds of simulations). Generative trait architectures were specified as follows. We simulated traits with polygenicity (genome-wide proportion of causal SNPs) equal to either 0.1% (less polygenic) or 0.3% (more polygenic) and heritability equal to 5% (we specified a heritability that is larger than typical chromosome 22 heritability to increase our power to detect differences between methods using a limited number of simulations; see below). We specified prior causal probabilities for each SNP in proportion to per-SNP heritabilities, which we generated for each SNP based on its British LD, MAF, and functional annotations, using the baseline-LF model 36 with meta-analyzed functional enrichments from real traits as described in our previous work 29 , and sampled causal SNPs. For each causal SNP, we sampled ancestry-specific causal effect sizes (for European, South Asian, East Asian, and African ancestries) from a multivariate normal distribution assuming trans-ethnic genetic correlations of 0.8, consistent with recent findings 13,25 ; functional annotations impacted prior causal probabilities but not causal effect sizes for causal SNPs, consistent with our recent work 42 . Other parameter settings were explored in secondary analyses (see below). Further details of the simulation framework are provided in the Methods section.
We computed summary association statistics (used by P+T, SBayesR, and PolyPred) via linear regression. For P+T, we used summary LD estimated from a random subset of 10,000 British-ancestry UK Biobank individuals. For SBayesR we used summary LD for 18,040 HapMap 3 SNPs on chromosome 22 estimated from 50,000 British-ancestry UK Biobank individuals, that was made publicly available by the authors of SBayesR 43 . For PolyPred we used summary LD estimated from 337,548 British-ancestry UK Biobank individuals that we previously made publicly available 29 , effectively using in-sample LD. For BOLT-LMM, we used individual-level genotypes at 18,040 HapMap 3 SNPs on chromosome 22, using hard-called values for imputed alleles. We applied all methods using default or recommended parameter settings (Methods). We computed relative-R 2 for each method, target population, and trait architecture (less polygenic, more polygenic), averaged across 100 simulations. We did not evaluate PolyPred+ in these experiments because of the small size of the UK Biobank non-European populations.
We performed 4 secondary analyses to investigate how PolyPred performs when individual-level data, insample LD, or functional annotations are not available. First, we evaluated a method (PolyPred-S) that linearly combines PolyFun-pred and SBayesR (instead of PolyFun-pred and BOLT-LMM), precluding the need for individual-level training data. PolyPred-S slightly underperformed PolyPred, as expected since SBayesR slightly underperformed BOLT-LMM (Supplementary Table 1). Second, we evaluated a modified version of PolyFun-pred that uses summary LD estimated from either UK10K 49 (N=3,567) or 1000 Genomes Europeans 50 (N=489), precluding the need for in-sample LD. We focused on PolyFun-pred rather than PolyPred because we wanted to consider the scenario where no individual-level training data is available. Using UK10K was almost as accurate as using in-sample LD, but using 1000 Genomes Europeans was hugely inaccurate, leading to prediction accuracy even lower than that of P+T (Supplementary Table  1), confirming the importance of using a large (population-matched) LD reference panel to compute PRS 41 (on the other hand, using UK10K LD is not recommended when using PolyFun for fine-mapping 29 due to concerns about false-positive fine-mapped SNPs, which are not a primary concern when computing PRS). Third, we evaluated a modified version of PolyFun-pred (PolyFun-pred1) that assumes a single causal variant per locus, again precluding the need for in-sample LD (since fine-mapping under a single casual variant assumption does not require any LD information 29 ). PolyFun-pred1 was less accurate than all other methods (including P+T) and is thus not recommended for polygenic prediction (Supplementary Table 1). Fourth, we evaluated a non-functionally informed method (PolyPred-NoFun) that linearly combines PolyNoFun-pred (a modification of PolyFun-pred that is not functionally-informed; see Methods) and BOLT-LMM, precluding the need for functional annotations. PolyPred-NoFun was slightly less accurate than PolyPred, but still more accurate than BOLT-LMM (Supplementary Table 1).
We performed 5 secondary analyses to investigate the sensitivity of the results to the simulation parameters. First, we performed simulations for much less polygenic (0.05%) and much more polygenic (0.5%) architectures. PolyPred remained the most accurate method, attaining the largest relative improvements vs. BOLT-LMM for the much less polygenic architecture (Supplementary Table 1); we conservatively restricted the remaining secondary analyses to the more polygenic (0.3%) architecture (for which PolyPred attains smaller relative improvements among the two main architectures simulated), unless otherwise indicated. Second, we performed simulations with lower (3%) or higher (7%) chromosome 22 heritability. PolyPred remained the most accurate method, with relative improvements vs. BOLT-LMM increasing with heritability (Supplementary Table 1). Third, we performed simulations with trans-ethnic genetic correlations increased from 0.8 to 1.0. PolyPred remained the most accurate method, with relative improvements vs. BOLT-LMM remaining broadly similar (Supplementary Table 1). Fourth, we modified the number of training samples from the target population used to estimate mixing weights (Nmix) from 500 to various values from 100-1000. PolyPred remained the most accurate method in all of these experiments, with relative improvements vs. BOLT-LMM increasing with Nmix but limited improvement above Nmix=500 (Supplementary Table 1). Fifth, we decreased the number of Britishancestry training samples (N) from N=337K to N=100K or N=10K. Prediction accuracies decreased with decreasing training sample size for all methods, and the relative improvements of PolyPred vs. BOLT-LMM (and other methods) were substantially decreased for N=10K, though they remained statistically significant in Africans under 0.1% polygenicity (Supplementary Table 1).
We performed two secondary analyses to evaluate the computational cost and memory cost of each method. First, we evaluated the computational cost of each method (for PolyPred, we included the time cost of each constituent method); we focused on the time cost to compute SNP effect sizes used for prediction, as the time cost to compute predictions in target samples using these SNP effect sizes is approximately the same for each method. SBayesR was the fastest method (2.8 minutes), P+T was the second fastest method (7.4 minutes), BOLT-LMM was the third fastest method (224 minutes), and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021.
PolyPred was the slowest method (668 minutes) (Supplementary Table 2). Second, we evaluated the memory cost of each method (for PolyPred, we computed the maximum memory cost of each constituent method). We performed this analysis using chromosome 1 instead of chromosome 22 because memory cost can increase with the number of SNPs in the analysis (but the memory cost of PolyFun-pred is fixed because it analyzes each 3Mb-locus separately). P+T used the least memory (1.5GB), SBayesR used the second smallest amount of memory (2.6GB), BOLT-LMM used the third smallest amount of memory (11GB), and PolyPred used the most memory (57GB).
We conclude that PolyPred is more accurate than P+T, SBayesR and BOLT-LMM, with small but significant improvements vs. BOLT-LMM in Europeans and substantial improvements in Africans.

Analysis of 4 UK Biobank populations using UK Biobank British training data
We applied PolyPred to 49 diseases and complex traits from the UK Biobank, analyzing 4 target populations (Methods, Supplementary Table 3). As in our simulations, we used UK Biobank British training data (average N=325K) to estimate SNP effect sizes; used 500 additional individuals from the target population to estimate mixing weights (we note that PolyPred is relatively insensitive to the choice of mixing weights; see below); evaluated prediction accuracy using individuals from each of the 4 target populations that were not included in the training data, and were unrelated to the training individuals and to each other: 42K non-British Europeans, 7.7K South Asians, 0.9K East Asians, and 6.2K Africans; and compared PolyPred to P+T, SBayesR and BOLT-LMM. We meta-analyzed relative-R 2 across traits by restricting to 7 well-powered, independent complex traits from the UK Biobank 32 (|rg|<0.3; see Methods and Supplementary Table 3) that were also available in Biobank Japan and in Uganda-APCDR (see below). We excluded the HLA region and two other long-range LD regions from the analysis (Methods). We have publicly released SNP effect sizes used for prediction for each of the 4 methods (see URLs).
We computed relative-R 2 for each method and target population. Results meta-analyzed across traits are reported in Figure 2 and Supplementary Table 4, and results for each trait are reported in Supplementary  Table 4. Among the 3 published methods, BOLT-LMM attained the highest prediction accuracy in all target populations (but differences between BOLT-LMM and SBayesR were small and not statistically significant). P+T was much less accurate than the other methods (despite its widespread recent use 12-18,26,44-48 ), suffering relative losses of 37-50% vs. BOLT-LMM. We thus used BOLT-LMM as a benchmark, conservatively assessing the statistical significance of improvements vs. BOLT-LMM via genomic blockjackknife across 200 genomic regions (Methods).
Among all 4 methods, PolyPred attained the highest prediction accuracy in each target population. Improvements in average relative-R 2 of PolyPred vs. BOLT-LMM were equal to +7.5% in non-British Europeans (P=0.05), +6.8% in South Asians (P=0.02), +11% in East Asians (P=0.12) and +32% in Africans (P=0.02). The larger improvement in Africans reflects the larger LD differences vs. British training data, due to earlier divergence times 13,14,51 . The lack of statistical significance in East Asians reflects the low power to detect significant differences in very small target samples (though statistical power is primarily limited by genome size due to our conservative use of genomic block-jackknife). The relative mixing weight contributions of PolyFun-pred/BOLT-LMM to PolyPred were equal to 38%/72% in non-British Europeans, 40%/60% in South Asians, 63%/37% in East Asians, and 48%/52% in Africans (Methods, Supplementary Table 4). Despite the improvements attained by PolyPred, the reductions in prediction . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. accuracy in non-European populations remained substantial, with meta-analyzed absolute R 2 equal to 0.17 in non-British Europeans, 0.11 in South Asians, 0.093 in East Asians, and 0.053 in Africans.
We assessed the calibration of each prediction method. A predictor is correctly calibrated if a regression of the true phenotype vs. the predictor yields a slope of 1, and is miscalibrated otherwise 52 . Regression slopes are reported in Supplementary Table 4. In non-British Europeans, PolyPred was well-calibrated (regression slope = 1.01), BOLT-LMM and BayesR were approximately well-calibrated (0.96-1.08), and P+T was poorly calibrated (0.08). In non-European populations, BOLT-LMM and SBayesR suffered reduced regression slopes (0.57-0.90), consistent with reduced prediction accuracy, but PolyPred remained wellcalibrated (0.96-1.14), as expected due to its extra training step to estimate mixing weights in the target population.
We performed 11 secondary analyses. First, we evaluated a method (PolyPred-S) that linearly combines PolyFun-pred and SBayesR (instead of PolyFun-pred and BOLT-LMM), precluding the need for individuallevel training data. PolyPred-S attained very similar accuracy to PolyPred in all target populations, with no statistically significant differences (Supplementary Table 5). Second, we evaluated a modified version of SBayesR (SBayesR-2.8M) that uses 2.8M common SNPs specified by the authors of SBayesR 41 instead of 1.2M HapMap 3 SNPs. SBayesR-2.8M was significantly less accurate than SBayesR (Supplementary Table  4), validating our primary focus on SBayesR using 1.2M HapMap 3 SNPs. Third, we evaluated three additional methods: LDpred 52 using 1000 Genomes Europeans 50 or UK10K 49 as the LD reference panel, and PRS-CS 53 using 1000 Genomes Europeans as the LD reference panel (as recommended by the authors of PRS-CS) (Methods). All three methods were consistently less accurate than BOLT-LMM (Supplementary  Table 4), validating our primary focus on BOLT-LMM and SBayesR. Fourth, we evaluated modified versions of PolyPred that specify fixed mixing weights instead of estimating mixing weights in the target populations. We considered mixing weights for PolyFun-pred/BOLT-LMM equal to 0%/100%, 25%/75%, 50%/50%, 75%/25%, and 100%/0%. The 25%/75% and 50%/50% methods performed very similarly to PolyPred, with no statistically significant differences (Supplementary Table 5). Fifth, we restricted the PolyFun-pred component of PolyPred to only include SNPs with posterior causal probability greater than a fixed threshold (0.05, 0.50 or 0.95). This restriction decreased prediction accuracy (Supplementary Table  5), implying that estimating causal effect sizes is beneficial for prediction even at loci that cannot be confidently fine-mapped. Sixth, we evaluated a modified version of BOLT-LMM (BOLT-LMM-727K) that estimates effect sizes using only 727K genotyped SNPs (instead of 1.2M imputed HapMap 3 SNPs). BOLT-LMM-727K was substantially and significantly less accurate than BOLT-LMM (Supplementary Table 4). Seventh, we evaluated a non-functionally informed method (PolyPred-NoFun) that linearly combines PolyNoFun-pred (a modification of PolyFun-pred that is not functionally-informed; see Methods) and BOLT-LMM. PolyPred-NoFun was slightly less accurate than PolyPred, but still more accurate than BOLT-LMM (Supplementary Table 5). The difference between PolyPred-NoFun vs. PolyPred was not statistically significant, in contrast to previous studies reporting a large and statistically significant increase in prediction accuracy from incorporating functional annotations [54][55][56] . Eighth, we reduced the number of training samples from the target population used to estimate mixing weights (Nmix) from 500 to 100. PolyPred suffered slightly reduced accuracy but remained the most accurate method, although relative improvements vs. BOLT-LMM were no longer statistically significant due to larger standard errors (Supplementary Table 4). Ninth, we computed standard errors of relative-R 2 using a jackknife over individuals 54 (instead of a genomic block-jackknife over SNPs; see Methods). Standard errors computed using a jackknife over individuals were generally smaller, increasing the statistical significance of relative . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; improvements of PolyPred vs. BOLT-LMM (Supplementary Table 4). Tenth, we meta-analyzed the results of each method across three independent diseases: type 2 diabetes, asthma, and all autoimmune disease (Methods); these diseases were not included in our primary meta-analyses due to low (observed-scale) heritabilities. PolyPred attained the highest prediction accuracy for each target population and each disease, except for East Asians (where standard errors were very large in relative terms due to the small sample size) and for type 2 diabetes in non-British Europeans (where BOLT-LMM performed slightly but non-significantly better) (Supplementary Table 4). However, relative improvements were not statistically significant due to lower power (Supplementary Table 4). Finally, we assessed the potential contribution of ancestry-specific heritability to reductions in trans-ethnic prediction accuracy 14 , by applying GCTA 57 to estimate the SNP-heritability explained by HapMap 3 SNPs 58,59 in each target population. SNP-heritabilities were largest in non-British Europeans and smallest in Africans (Supplementary Table 6) (these differences could be due to SNP ascertainment 60 , sample ascertainment, and/or ancestry-specific architectures 25 ), likely contributing to reductions in trans-ethnic prediction accuracy.
We conclude that PolyPred substantially increases trans-ethnic polygenic prediction accuracy vs. published methods (with a particularly large improvement in Africans), consistent with simulations. However, there remains a large gap in trans-ethnic polygenic prediction accuracy as compared to Europeans.

Analysis of Biobank Japan and Uganda-APCDR cohorts
We applied PolyPred to predict 23 diseases and complex traits in Biobank Japan 61 and 7 complex traits in Uganda-APCDR, an African-ancestry cohort 33,34 (Methods, Supplementary Table 3). We performed these experiments to avoid training effect sizes and testing predictions in the same cohort, which may produce inflated prediction accuracies 52,62-64 . We again used UK Biobank British training data (average N=325K) to estimate SNP effect sizes, and used 500 individuals from the target population to estimate mixing weights. We evaluated prediction accuracy using individuals from each of the 2 target cohorts that were not included in the training data, and were unrelated to the training individuals and to each other: 5K Biobank Japan individuals and 1.3K Uganda-APCDR individuals. We again compared PolyPred to P+T, SBayesR and BOLT-LMM. We meta-analyzed relative-R 2 across the same 7 well-powered, independent complex traits used in the UK Biobank analyses (Supplementary Table 3).
Results meta-analyzed across traits are reported in Figure 3 and Supplementary Table 7, and results for each trait are reported in Supplementary Table 7. Among the 3 published methods, we again observed that BOLT-LMM attained the highest prediction accuracy in each target population (although differences between BOLT-LMM and SBayesR were not statistically significant), and that P+T was substantially less accurate than the other methods, suffering relative losses of 42-61% vs. BOLT-LMM.
Among all 4 methods, PolyPred attained the highest prediction accuracy in each target population. Improvements in average relative-R 2 vs. BOLT-LMM were equal to +13% in Biobank Japan (P=2×10 -6 ) and +22% in Uganda-APCDR (P=0.26), similar to our UK Biobank results above. However, prediction accuracy (and hence relative-R 2 ) for each method was much smaller in Biobank Japan and Uganda-APCDR (e.g. 0.32 and 0.11 for PolyPred; Figure 3) than in UK Biobank East Asians and UK Biobank Africans (0.62 and 0.34; Figure 2), likely due to higher SNP-heritabilities in the UK Biobank (see below). We also applied PolyPred+ to Biobank Japan, incorporating Biobank Japan training data (in addition to UK Biobank British training . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249483 doi: medRxiv preprint data) to estimate effect sizes (average N=124K, distinct from and unrelated to the 5K target individuals), with the caveat that this analysis involved training and testing in the same cohort. PolyPred+ attained increased prediction accuracy, with a further +23% improvement vs. PolyPred (P=0.0004) (Supplementary Table 7).
We performed additional experiments to investigate the above result of decreased prediction accuracy in Biobank Japan vs. UK Biobank East Asians (of predictors trained using UK Biobank British training data). We compared BOLT-LMM trained using a reduced set of N=124K UK Biobank British training samples and applied to UK Biobank non-British Europeans vs. BOLT-LMM trained using the N=124K Biobank Japan training samples and applied to the Biobank Japan test samples. The prediction R 2 of BOLT-LMM in UK Biobank non-British Europeans was +108% larger than in Biobank Japan, consistent with the +104% increase expected from theory 65,66 based on the +67% higher SNP-heritabilities in UK Biobank (Supplementary Table 8, Supplementary Note). This suggests that differences in SNP-heritability due to ancestry differences (e.g. SNP ascertainment 60 , sample ascertainment, and/or ancestry-specific architectures 25 ) or due to cohort differences (e.g. differences in phenotype definitions 13 , different recruiting strategies 13 , or assay artifacts) may explain most of the differences in prediction accuracies observed between the UK Biobank and Biobank Japan. Further experiments and interpretation are provided in the Supplementary Note.
We performed 6 secondary analyses. First, we assessed the calibration of each method by computing regression slopes (see above), which are reported in Supplementary Table 7. Similar to our above analyses of non-European UK Biobank target populations, PolyPred was the only approximately well-calibrated method, as expected due to its extra training step to estimate mixing weights in the target population. Second, we evaluated a modification of PolyPred that estimates mixing weights using 500 UK Biobank individuals from the genetically closest target population (UK Biobank East Asians for Biobank Japan, UK Biobank Africans for Uganda-APCDR) instead of 500 individuals from the target cohort. The differences between the original and modified versions of PolyPred were small and not statistically significant (Supplementary Table 7), indicating that PolyPred mixing weights can be estimated using 500 individuals from any cohort with the same continental ancestry as the target population. Third, we evaluated modified versions of PolyPred that specify fixed mixing weights instead of estimating mixing weights in the target populations. We considered mixing weights for PolyFun-pred/BOLT-LMM equal to 0%/100%, 25%/75%, 50%/50%, 75%/25%, and 100%/0%. The 25%/75% and 50%/50% methods performed very similarly to PolyPred, with no statistically significant differences (Supplementary Table 7). Fourth, we reduced the number of training samples from the target population used to estimate mixing weights (Nmix) from 500 to 100. PolyPred suffered slightly reduced accuracy but remained the most accurate method, with the improvement vs. BOLT-LMM in Biobank Japan remaining statistically significant (Supplementary  Table 7). Fifth, we computed standard errors of relative-R 2 using a jackknife over individuals 54 (instead of a genomic block-jackknife over SNPs). We obtained standard errors that were almost identical to those obtained using a genomic block-jackknife (unlike the above results for UK Biobank), suggesting that Biobank Japan may be more heterogeneous across samples, possibly due to its hospital-based recruitment (Supplementary Table 7). Finally, we meta-analyzed the results of each method across three independent diseases in Biobank Japan: type 2 diabetes, asthma, and all autoimmune disease. Similar to our UK Biobank analyses above, PolyPred attained the highest prediction accuracy in each disease, though relative improvements were not statistically significant due to lower power (Supplementary Table 7).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. We conclude that PolyPred substantially increases trans-ethnic polygenic prediction accuracy vs. published methods when applied to target cohorts different from the training cohort.

Analysis of UK Biobank East Asians using UK Biobank British and Biobank Japan training data
We applied PolyPred+ to predict 23 diseases and complex traits in UK Biobank East Asians using UK Biobank British and Biobank Japan training data (Supplementary Table 3). We performed this experiment to explore the special case where non-European training data is available in large sample size from a population that is genetically similar to the target population, in a cohort that is distinct from the target cohort; as such, this experiment is a particular strength of this study, relative to previous studies that considered only European training data or analyzed non-European training data from the same cohort as the target cohort 12-17 . We note that this experiment is still imperfect in that the European training data and non-European target data are from the same cohort (UK Biobank); however, we believe that cohort effects (if present) would deflate rather than inflate the relative improvement of PolyPred+ vs. other methods, since they would confer an advantage to the European training data but not the non-European training data. We used UK Biobank British training data (average N=325K) and Biobank Japan training data (average N=124K) to estimate SNP effect sizes. We again used 500 individuals from the target population to estimate mixing weights, and evaluated prediction accuracy using 900 UK Biobank East Asians that were not included in the training data, and were unrelated to the training individuals and to each other. We compared PolyPred+ to P+T, SBayesR, BOLT-LMM and PolyPred. We meta-analyzed relative-R 2 across the same 7 well-powered, independent complex traits used in the previous analyses (Supplementary Table  3).
Results meta-analyzed across traits are reported in Figure 4 and Supplementary Table 4, and results for each trait are reported in Supplementary Table 4. PolyPred+ attained the highest prediction accuracy, with a +24% improvement vs. BOLT-LMM (P=0.0009) and a +12% improvement vs. PolyPred (P=0.0014). This implies that incorporating non-European training data can provide a substantial advantage, if it is available in large sample size. We emphasize that the +12% improvement for PolyPred+ vs. PolyPred should be viewed as a lower bound on the improvement that would be obtained in settings without cohort effects that may confer an advantage to the European training data.
We performed 6 secondary analyses. First, we verified that PolyPred+ using European and East Asian training data does not outperform PolyPred in UK Biobank populations other than East Asians; differences between PolyPred+ and PolyPred were very small and not statistically significant (Supplementary Table  5). Second, we verified that PolyPred+ was well-calibrated (Supplementary Table 4; results for other methods are described above), as expected due to its extra training step to estimate mixing weights in the target population. Third, we evaluated a modified version of PolyPred+ that estimates mixing weights using 500 Biobank Japan individuals instead of 500 UK Biobank East Asians. The modified version of PolyPred+ was far less accurate than the original version (52% lower relative-R 2 ; Supplementary Table 5). The mixing weights estimated in Biobank Japan assign much higher weight to the Biobank Japan training data (Supplementary Table 5), perhaps due to cohort effects; thus, it may be important to estimate PolyPred+ mixing weights using the target cohort (as opposed to the training cohort) if cohort effects are present. Fourth, we reduced the number of training samples from the target population used to estimate mixing weights (Nmix) from 500 to 100. PolyPred+ suffered slightly reduced accuracy, though the difference . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; was not statistically significant (Supplementary Table 5). Fifth, we evaluated a prediction method using only the N=124K Biobank Japan individuals to train effect sizes (BOLT-LMM-BBJ). BOLT-LMM-BBJ substantially underperformed methods that use UK Biobank British training data (−27% vs. BOLT-LMM, −34% vs. PolyPred, −41% vs. PolyPred+; Supplementary Table 4). Finally, we computed standard errors of relative-R 2 using a jackknife over individuals 54 (instead of a genomic block-jackknife over SNPs). Standard errors computed using a jackknife over individuals were smaller, increasing the statistical significance of relative improvements of PolyPred+ vs. other methods (Supplementary Table 5).
We conclude that PolyPred+ further increases trans-ethnic prediction accuracy in the special case where non-European training data from the target population (or a closely related population) is available in large sample size. We emphasize that efforts to assess the benefit of incorporating non-European training data should analyze non-European training data from a cohort that is distinct from the target cohort, otherwise results may be inflated due to cohort effects.

Discussion
We have introduced PolyPred, a method for improving trans-ethnic polygenic risk prediction by incorporating causal effects in addition to tagging effects, addressing trans-ethnic LD differences. Across seven well-powered independent traits, PolyPred significantly increased prediction accuracy over BOLT-LMM by 32% in UK Biobank Africans and by 13% in Biobank Japan (with similar results vs. SBayesR). In the special case where a large training sample is available in the non-European target population (or a closely related population), we have introduced PolyPred+, which further incorporates the non-European training data, addressing MAF differences and causal effect size differences. PolyPred+ significantly increased prediction accuracy in UK Biobank East Asians over BOLT-LMM by 24% (and over PolyPred by 12%). We previously demonstrated that linearly combining PRS from European and non-European training samples improves trans-ethnic prediction accuracy 7 . However, these previous results did not incorporate causal effects and used P+T, which is highly inaccurate despite its widespread use 12-18,26,44-48 , as PolyPred obtained up to 164% greater accuracy than P+T. In conclusion, PolyPred and PolyPred+ substantially improve trans-ethnic polygenic prediction accuracy, ameliorating health disparities 13 . We have publicly released the PRS coefficients for all SNPs and traits analyzed under all evaluated methods (see URLs).
Although we substantially improved trans-ethnic PRS accuracy over the state of the art, prediction accuracy in non-Europeans is still substantially lower compared to Europeans, even within the UK Biobank. There are two reasons for the remaining accuracy gap. First, European sample sizes are still limited, which limits the ability of PolyFun-pred to estimate causal rather than tagging effects (mathematical theory guarantees perfect estimation of causal effect sizes in European cohorts under an infinite sample size if model assumptions hold 67 ). Second, non-European sample sizes are limited, which limits the ability of BOLT-LMM applied to non-European samples to estimate tagging effects. Even with an infinite European training sample, which allows estimating causal effects perfectly (thus addressing LD differences), prediction accuracy could still be higher for Europeans vs. non-Europeans due to trans-ethnic genetic correlations less than 1 25,68,69 and different allele frequencies (including population-specific SNPs) (Methods). Hence our theory and results confirm that larger non-European GWAS are the best way to further improve PRS accuracy in non-European populations 9-11,13,21 .
PolyPred requires individual-level genotypes, but researchers often cannot obtain them, necessitating methods that use summary statistics 70 . When only summary statistics are available, we recommend (1) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; using PoyPred-S (which linearly combines PolyFun-pred and SBayesR) instead of PolyPred (which linearly combines PolyFun-pred and BOLT-LMM); and (2) training both PolyFun-pred and SBayesR using either insample LD or a large (e.g. N>3K) LD reference panel from a population that is as close as possible to the GWAS population. In the absence of a large LD reference panel from the GWAS population (or a closely related population) we recommend using P+T together with the closest ancestral population from the 1000 Genomes project 50 . We have facilitated the use of in-sample LD for UK Biobank researchers by publicly releasing summary LD information for N=337K British-ancestry UK Biobank samples 29 . We emphasize that these recommendations for PRS differ from our recommendations for fine-mapping in our previous work 29 , where we suggested only using in-sample LD due to concerns about false-positive finemapped SNPs, which are not a primary concern when computing PRS.
Our results corroborate previous results that predictions within the UK Biobank are often more accurate than off-cohort predictions to the same target ancestry [62][63][64] . This raises the question of whether the higher within-UK Biobank prediction accuracy is inflated by cohort effects. Our analysis suggests that within-UK Biobank prediction accuracy is not inflated, because most of the off-cohort loss of accuracy is driven by heritability differences. These heritability differences could be driven by between-cohort factors such as differences in phenotype definitions 13 , different recruiting strategies 13 , or assay artifacts. Our results are consistent with recent results showing almost no loss of accuracy when applying PRS based on UK Biobank training data to other European-ancestry cohorts 41 . Importantly, our results suggest that factors that inflate within-cohort PRS accuracy 71 (such as cohort-specific GxE, cohort-specific indirect effects 72 , cohortspecific population structure, or cohort-specific assortative mating) are unlikely explanations for the observed accuracy differences between the UK Biobank and Biobank Japan.
Our work has several limitations, providing opportunities for future work. First, we did not evaluate a setting where the British training data, the non-British training data, and the target population are sampled from three different cohorts. However, we hypothesize that the relative improvement of PolyPred+ over PolyPred when applied to UK Biobank East Asians reflects a lower bound on the improvement in relative-R 2 that would have been obtained in such an experiment. Second, PolyPred is slower than alternative PRS methods, requiring over 1,000 hours of computation time to train (approximately 30 minutes for each of 2,763 3Mb loci), vs. less than 100 hours for BOLT-LMM. However, training can be parallelized, and also provides genome-wide fine-mapping results of direct interest 29 . Third, PolyPred may assign non-zero coefficients to SNPs that have not been imputed in the target sample (whereas BOLT-LMM uses only HapMap 3 SNPs, which are typically well-imputed 73 ), motivating the need for large trans-ethnic imputation panels. Fourth, our block-jackknife standard error estimates may be conservative, though they may be better suited for evaluating the sampling variance introduced by the training set (vs. individual-level jackknife, which assumes a fixed training set; see Methods). Fifth, our PRS do not capture effects from the HLA region, which explains a large proportion of the variance of several diseases and traits, owing to the very complex and long-range LD patterns in this region. Sixth, PolyPred+ requires a large training sample that is closely related to the target population. However, it is not clear exactly how large this sample should be (we currently recommend N>50K), or how to quantify genetic similarity between the training and target populations (as LD differences between populations are driven by divergence rather than genetic drift 51 ). Seventh, PolyPred ideally requires a small training sample from the target cohort to estimate mixing weights. Our results suggest that it is possible to improve trans-ethnic PRS accuracy even without such a training sample, by linearly combining PolyPred-Fun and BOLT using mixing weights of either 25%/75% or 50%/50%, respectively. However, we caution that PRS linearly . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249483 doi: medRxiv preprint combined using fixed mixing weights may not always be well-calibrated. Eighth, it may be preferable to construct a European and a non-European PRS jointly 74 , rather than linearly combining a European and a non-European PRS as performed in PolyPred+. Ninth, it may be possible to improve PRS accuracy for admixed individuals by using European effect sizes for European alleles and non-European effect sizes for non-European alleles 16,17 . Tenth, trans-ethnic prediction accuracy may be improved by identifying SNP sets other than HapMap 3 that yield better prediction accuracy across cohorts. Eleventh, PolyPred may be able to estimate causal effect sizes more accurately by using a multi-ethnic fine-mapping method (instead of PolyFun-pred, which uses only European training data). Finally, PRS may implicitly capture GxE interactions, which may not be transferable across cohorts or ancestries 27,75 . Despite all these limitations, PolyPred and PolyPred+ provide a clear improvement for trans-ethnic polygenic risk prediction.

PolyPred and PolyPred+ methods
All methods in this paper use a linear PRS, i.e., = ∑ , where is the PRS of an individual, is the number of minor alleles of SNP carried by that individual, and is the estimated per-allele causal effect size of SNP . The methods differ in the way they estimate .
PolyPred and PolyPred+ both combine the methods PolyFun-pred and BOLT-LMM. PolyFun-pred estimates as the (approximate) posterior mean causal effect size of SNP , as estimated by PolyFun + SuSiE 29 based on European training data, using 187 functional annotations to specify prior causal probabilities. BOLT-LMM estimates tagging effects (Supplementary Note) of HapMap 3 SNPs by applying BOLT-LMM 30,31 to European training data. BOLT-LMM treats the effect of each SNP as a random effect sampled from a mixture of two zero-mean normal distributions, whose variances and mixture weights are determined in a data-driven manner.
PolyPred computes the effect size of each SNP that is either in HapMap 3 or has a European MAF≥0.1% and INFO score ≥0.6 as a weighted combination of (1) its PolyFun-pred effect size based on European training data; and (2) its BOLT-LMM effect size based on European training data: where is the PolyFun-pred approximate posterior mean causal effect size of SNP based on European training data, is the BOLT-LMM approximate posterior mean tagging effect size of SNP based on European training data (setting the effects of SNPs not in HapMap 3 to zero), and , are mixing weights. PolyPred estimates the mixing weights via non-negative least squares estimation (i.e., least squares estimation constrained produce to non-negative estimates) based on training individuals from the target cohort. Specifically, PolyPred estimates the mixing weights by computing the PRS corresponding to the PolyFun-pred effect sizes (given by = ∑ ) and the PRS corresponding to the BOLT-LMM effect sizes (given by = ∑ ), and then fitting the mixing weights by regressing the true phenotypes of the training individuals in the target cohort on the PolyFun-pred and the BOLT-LMM PRSs. The use of non-negative . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; least squares estimation guarantees that the correlation of the predicted phenotype with the true phenotype is at least as large as the smallest correlation obtained by the constituent predictors.
PolyPred+ computes the effect size of each SNP that is either in HapMap 3 or has a European MAF≥0.1% as a weighted combination of (1) its PolyFun-pred effect size based on European training data; (2) its BOLT-LMM effect size based on European training data; and (3) its effect size as estimated by applying BOLT-LMM to training data from the target population (or a closely related population): where is the BOLT-LMM approximate posterior mean tagging effect of SNP based on training data from the non-European population (and set to zero for SNPs that are not in HapMap 3), and is the mixing weight of . The mixing weights are estimated as in PolyPred.
In practice, we apply PolyPred and PolyPred+ by linearly combining the PolyFun-pred PRS and the BOLT-LMM PRS (rather than linearly combining the SNP effect sizes). The two procedures are almost mathematically identical, with the only difference being that a linear combination of PRSs can also accommodate an intercept, which explicitly bias-corrects the PRS to the target population.
We applied PolyFun-pred in the same way that we applied PolyFun + SuSiE in our previous work 29 . Briefly, we applied PolyFun-pred across 2,763 overlapping 3Mb loci spanning 18,212,157 European MAF>0.1% imputed SNPs with INFO score>0.6 (excluding the HLA and two other long-range LD regions 29 ), assuming 10 causal SNPs per locus. We used summary statistics computed by BOLT-LMM, based on up to N=337,491 unrelated British-ancestry UK Biobank individuals, and using summary LD information estimated directly from the target samples. Full details are provided in ref. 29 . We note that the use of BOLT-LMM summary statistics is mathematically equivalent to regressing the target phenotypes on BOLT-LMM offchromosome PRS prior to applying PolyFun + SuSiE 31 . In secondary analyses, we evaluated alternative versions of PolyFun-pred that assume a single causal SNP per locus (and hence do not require an LD reference panel 29 ) or a non-functionally-informed version that specifies the same prior causal probability to all SNPs in each locus.
Estimating relative-R 2 and its standard error We measured prediction accuracy for each trait via a measure that we call relative-R 2 , defined via the following computations: 1. Compute R 2 -PRS: the R 2 obtained via a linear predictor that includes PRS, age ,sex, age*sex (if the correlation with age was <0.95), UK Biobank assessment center (defined as a set of dummy binary . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; variables), genotyping array, 10 principal components (computed separately for each ancestry; see below), and dilution factor (for biochemical traits only). 2. Compute R 2 -noPRS, defined like R 2 -PRS but omitting the PRS 3. Compute R 2 -PRS-BOLT-EUR, computed by applying BOLT-LMM to UK Biobank non-British Europeans as in step 1 4. Compute R 2 -noPRS-EUR, computed by applying step 2 to non-British Europeans 5. Compute relative-R 2 as (R 2 -PRS -R 2 -noPRS) / (R 2 -PRS-BOLT-EUR -R 2 -noPRS-EUR) We note that relative improvement in relative-R 2 is the same as relative improvement in absolute difference in R2, (i.e., in R 2 -PRS -R 2 -noPRS), because the denominator (R 2 -PRS-BOLT-EUR -R 2 -noPRS-EUR) can be regarded as a constant scaling factor.
We computed standard errors via genomic block-jackknife, partitioning the genome into 200 equally-sized consecutive loci and omitting each one in turn. We similarly computed standard errors of differences in relative-R 2 (e.g. vs. BOLT-LMM) via genomic block-jackknife, computing the difference after omitting each block in turn. In secondary analyses, we computed standard errors by applying jackknife over individuals from the test set. These analyses yielded much smaller standard errors in the UK Biobbank, suggesting that genomic block-jackknife standard errors may be conservative, whereas individual-based jackknife estimates maty be anti-conservative. We emphasize that individual-based jackknife explicitly assume a fixed training set.
We meta-analyzed relative-R 2 via an inverse-variance weighted average, using the standard error of the relative-R 2 of BOLT-LMM in the target ancestry (as estimated via genomic block-jackknife). We metaanalyzed the standard error of relative-R 2 as the square root of the average of the trait-specific sampling variances, divided by the number of traits. We meta-analyzed the difference in relative-R 2 vs. an alternative method (typically BOLT-LMM) in the same way. We computed p-values of differences in relative-R 2 via a Wald test, based on the block-jackknife standard error estimates (for single traits) and based on the meta-analyzed standard errors (for the meta-analyzed results).
We computed ancestry-specific regression slopes by regressing true phenotypes on the PRS (including a intercept) in each respective population. We computed the standard errors of regression slopes via genomic block-jackknife, using 200 jackknife blocks.

Cohorts Analyzed UK Biobank
The UK Biobank is a UK-based population cohort 32 . We used version 3 of the imputed genotypes, as described in our previous work 29 . We computed ancestry-specific PCs for UK Biobank Africans, UK Biobank East Asians, and UK Biobank South Asians via plink 1.9 79 , restricting to SNPs with ancestry-specific MAF>5%, missingness<10%, HWE p-value>10 -10 , and LD-pruned using the command --indep-pairwise 1000 50 0.05, and restricted to unrelated individuals (kinship coefficient <0.05) from the target ancestry with missingness <10%. We used the UK Biobank provided PCs for UK Biobank Europeans.

Biobank Japan
BioBank Japan (BBJ) is a multi-institutional hospital-based biobank with DNA and serum samples from approximately 200,000 participants from 12 medical institutions in Japan 61 . The participants are mainly of Japanese ancestry and had been diagnosed with at least one of 47 diseases by physicians at the cooperating hospitals. Written informed consent was obtained from all the participants, as approved by the ethics committees of RIKEN Center for Integrative Medical Sciences and the Institute of Medical Sciences at the University of Tokyo.
We genotyped samples with either (i) the Illumina HumanOmniExpressExome BeadChip or (ii) a combination of the Illumina HumanOmniExpress and HumanExome BeadChips. We applied standard quality control criteria for both samples and variants as detailed elsewhere 80 . We then prephased genotypes with Eagle2 81 and imputed dosages with Minimac3 82 using 1000 Genomes Project Phase 3 (version 5) data (N=2,504) and Japanese whole-genome sequencing (WGS) data (N=1,037) as a reference 80 . We computed PCs using EIGENSOFT's smartpca 83 .
For phenotypes, we retrieved clinical medical records from the participating hospitals through interviews and a standardized questionnaire. We used 23 diseases and complex traits in Biobank Japan which are also analyzed in UK Biobank (Supplementary Table 3). We normalized quantitative phenotypes via inverserank normal transformation as described elsewhere 84 . We defined the 'autoimmune disease' trait in Biobank Japan as a union of Graves' disease and rheumatoid arthritis patients.

Uganda-APCDR
Uganda-APCDR is a population-based cohort from the General Population Cohort (GPC), Uganda. We retrieved genotype and phenotype data through the African Partnership for Chronic Disease Research (APCDR) initiative via the European Genome-Phenome Archive (EGA), using EGAD00010000965 to access genotype data. Phenotype data were accessed via sftp from EGA (reference: DD_PK_050716 gwas_phenotypes_28Oct14.txt). The participants are from nine ethno-linguistic groups in sub-Saharan Africa and had been recruited from the study area located in southwestern Uganda in Kyamulibwa subcounty of Kalungu district, approximately 120 km from Entebbe town. These ethno-linguistic groups have diverse population structure with varying degrees of admixture between Eurasian and East African Nilo-Saharan ancestries, which has been extensively characterized elsewhere 85 . The detailed cohort demographics, sample collection, and processing were described previously 33,34 .
Briefly, the samples were genotyped using the Illumina HumanOmni 2.5M BeadChip at the Wellcome Trust Sanger Institute. We used the Ricopili pipeline to conduct pre-imputation QC and perform phasing and imputation 86 . Briefly, we phased the data using Eagle 2.3.5 81 and imputed variants using minimac3 82 in chunks ≥3Mb. The 1000 Genomes phase 3 haplotypes 50 were used as the reference panel for phasing and imputation.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; As described previously, phenotypes were collected using a standard individual questionnaire, blood samples (laboratory tests), and biophysical measurements (height, weight, waist and hip circumferences and blood pressure) 33 . We normalized quantitative phenotypes via inverse-rank normal transformation.

UK Biobank Simulations
We simulated data based on real genotypes of UK Biobank individuals, using 250,963 MAF≥0.1% SNPs with INFO score≥0.6 on chromosome 22 (including short indels). To simulate data, we first computed the variance of per-standardized-genotype effect for every SNP with annotations using the baseline-LF (version 2.2.UKB) model, var[ | ] = ∑ , where are annotations and estimates are taken from a fixed-effects meta-analysis of 16 well-powered genetically uncorrelated (|rg|<0.2) UK Biobank traits (age of menarche, BMI, balding, bone mineral density, eosinophil count, FEV1/FVC ratio, forced vital capacity, hair color, height, platelet count, red blood cell distribution width, red blood cell count, systolic blood pressure, tanning, waist-hip ratio adjusted for BMI, white blood count), scaled such that ∑ var[ | ] is the same across all traits (as detailed in ref. 29 ). Each SNP was set to be causal with probability proportional to var[ | ], such that the average causal probability was equal to the desired proportion of causal SNPs (0.1% and 0.3% by default).
We generated ancestry-specific effect sizes as follows. First, we generated a British per-allele causal effect size for each SNP via = / 2 (1 − ) , where ∼ (0, ℎ / ), is the number of causal SNPs, and is the maximal MAF of SNP among British, non-British European, South Asian, East Asian, or African UK Biobank individuals. Afterwards, for each of the main UK Biobank non-European ancestries (South Asian, East Asian, and African) we generated an ancestry-specific per-allele effect size via = ⋅ + 1 − , where is the trans-ethnic genetic correlation (set to 0.8 by default, following previous works 25,68,69 ), and ∼ (0,1). The use of bounds the per-allele causal effect sizes by the MAF of the ancestry in which the SNP is most common, which guarantees that SNPs that are infrequent in Europeans but are common in other ancestries do not explain a very large proportion of heritability.
After generating ancestry-specific per-allele causal effect sizes, we generated a phenotype for every UK Biobank individual in each ancestry via = ∑ + , where is the number of minor alleles of SNP carried by that individuals, is the ancestry-specific per-allele causal effect size of SNP , and ∼ (0,1 − ℎ ) is the environmental variance of the generated trait. We generated phenotypes based on dosage data from imputed genotypes, using Plink 2.0 87,88 . We used self-reported ancestry based on UK Biobank data field 21000 (Ethnic background). We considered Irish-ancestry as a non-British European ancestry.
We trained all methods using 337,491 unrelated British-ancestry individuals 32 , and we estimated the mixing weights of PolyPred using up to 500 additional individuals from each of the four non-British ancestries (Nmix≤500). We computed summary statistics by applying linear regression via Plink 2.0. We did not evaluate PolyPred+ in the simulations because of the relatively small sample sizes of the UK Biobank non-European populations.
We evaluated prediction accuracy via R 2 , using held-out individuals that were not included in the training sets, using 42K non-British Europeans, 7.7K South Asians, 0.9K East Asians, and 6.2K Africans. We . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; computed PRSs by applying plink 2.0 with the --score command, using imputed dosage data (rather than hard-called SNP values). We computed standard errors via a jackknife over simulations.
We trained BOLT-LMM by applying BOLT-LMM v2.3.4 to plink files of HapMap 3 SNPs (hard-coded from imputed dosages), using the same covariates specified in the "Estimating relative-R 2 and its standard error" Methods subsection, and specifying the flag -predBetasFile to report PRS coefficients.
We trained SBayesR using 10,000 iterations, 4,000 burn-in iterations, using values from 10% of the iterations to compute posterior means, using the HapMap 3 LD files published the SBayesR authors 43 . By default we ran SBayesR using a mixture of four distributions (using = [0.95,0.02,0.02,0.01] and = [0,0.01,0.1,1]). In case SBayesR failed with these parameters, we iteratively shrank the last entry in the vector by 50% until it was smaller than 10 , at which point we removed the last mixture component and redefined such that the first entry was equal to 0.95 and all other entries had the same value such that all values sum to 1.0. We did not evaluate other summary statistic based methods such as LDpred 52 and PRS-CS 53 in the simulations, because we determined that SBayesR outperforms these methods in real trait analysis (as described in the Results).
We trained P+T by applying plink with the command -clump-r2 0.5 -clump-kb 250 with various values of -clump-p1 (following ref. 13 ), and using 10,000 randomly selected unrelated UK Biobank British individuals to compute LD. We estimated LD using 10,000 individuals to balance between runtime and accuracy (noting that P+T is relatively insensitive to the LD reference panel size compared to the other methods evaluated in this manuscript). We used summary statistics based on BOLT-LMM, using marginal effect sizes derived from reported values (i.e., the square root of the divided by the square root of the BOLT-LMM effective sample size 29 and multiplied by the sign of the effect size estimated by the infinitesimal version of BOLT-LMM), because the non-infinitesimal version of BOLT-LMM does not estimate effect sizes. We used the best value of -clump-p1 (out of the evaluated values 10 -2 , 10 -3 , 10 -4 , 10 -6 , 5×10 -8 ) based on the test set phenotypes, which leads to anti-conservative prediction accuracy estimates for P+T.
When changing the training sample size, we kept the LD reference panel sample size fixed to alleviate computational costs. selected ranked traits according to their heritability in UK Biobank non-British Europeans (estimated as in ref. 29 ), such that no selected trait had |rg|<0.3 with a previously selected trait.
We computed ancestry-specific SNP heritabilities in each UK Biobank ancestry by applying GCTA 57 to unrelated sets of individuals using hard-called HapMap 3 SNPs (using a random set of 10,000 individuals for non-British Europeans to facilitate the computations). We did not use more advanced methods 89 because of the relatively small sample sizes. We meta-analyzed ancestry-specific SNP heritabilities by averaging the estimated heritabilities, and we meta-analyzed their standard error via the square root of the average sampling variance, divided by the square root of the number of traits.
We trained all PRS methods on UK Biobank unrelated British-ancestry individuals (average N=325) as described in the Methods subsection "UK Biobank simulations", but using summary statistics generated by BOLT-LMM when applied to UK Biobank British-ancestry individuals, as described in our previous work 29 . For SBayesR we used the summary statistics of the infinitesimal version of BOLT-LMM, which yielded far superior accuracy (results not shown), possibly indicating that the non-infinitesimal version of BOLT-LMM violates some of the underlying assumptions behind the SBayesR model. We trained P+T separately for each non-UK Biobank cohort by restricting the set of SNPs considered to the set of SNPs available in the target cohort. We computed the contribution of PolyFun-pred (resp. BOLT-LMM) towards PolyPred via the ratio of the mixing weight of PolyFun-pred (resp. BOLT-LMM) to the sum of the mixing weights of PolyPred and of BOLT-LMM.
In analysis sets (i) and (iii), we computed a PRS for each UK Biobank individual using imputed dosage data as described in the "UK Biobank Simulations". In analysis set (ii), we computed a PRS for each individual in Biobank Japan and in Uganda-APCDR using imputed dosage data and PRS coefficients from UK Biobank Europeans using Plink 2.0 87,88 .
In secondary analyses of analysis set (i) we also evaluated PRS-CS 53 and LDpred 52 . We trained PRS-CS using summary statistics from the infinitesimal version of BOLT-LMM (as in SBayesR) with the parameters a=1, b=0.5, thin=5, n_iter=10000, n_burnin=500, and using three different values of phi: 0.01, 0.0001, or none (corresponding to PRS-CS-auto). We used the best value of phi using the test set data, yielding anticonservative accuracy estimates as in P+T. We trained LDpred using HapMap 3 SNPs and using two different LD reference panels: 1000 Genomes 50 and UK10K 49 . We used summary statistics from the infinitesimal version of BOLT-LMM (as in SBayesR) and with default parameters, using the parameter --ldr 400. We used the value of "--F" (corresponding to the assumed proportion of causal SNPs, using all of the default evaluated values) that yielded the best prediction accuracy in the test set, yielding anticonservative accuracy estimates as in P+T.
In analysis sets (ii) and (iii), we trained BOLT-LMM-BBJ (BOLT-LMM trained using Biobank Japan training data) (average N=124K). We selected individuals for training BOLT-LMM-BBJ as described in our previous work 13 , but excluding a random subset of 5,000 individuals that were used for evaluating prediction accuracy.

Loss of accuracy under an infinite European training sample
Under an infinite European training sample, the ratio between and , which denote in an European sample and in a non-European sample, respectively, is approximately given by: Here, is the trans-ethnic genetic correlation, ℎ , ℎ are the heritabilities in the non-European and the European populations, respectively, iterates over causal SNPs, , , , are minor allele frequencies in the non-European and the European population, respectively, and var(PGS ), var(PGS ) are the variances of the polygenic risk scores in the non--European and the European populations, respectively. This equation is directly derived from Equation 1 in ref. 14 after assuming that causal SNPs are approximately not in LD with each other, and that the predictor SNPs are the causal SNPs under an infinite sample size. Tables   Table 1: Summary of main methods evaluated. For each method we report the set of SNPs analyzed in model training (and its size when restricted to imputed UK Biobank SNPs with European MAF≥0.1% and INFO score≥0.6), the training data analyzed, whether it incorporates fine-mapped effect sizes (as opposed to tagging effect sizes), and the corresponding reference. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

Method
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249483 doi: medRxiv preprint Figure 2: Trans-ethnic PRS results for real UK Biobank traits. We report average prediction accuracy (relative-R 2 ; see main text), meta-analyzed across 7 well-powered, independent traits, for PRS trained in UK Biobank British samples (average N=325K) and applied to 4 UK Biobank target populations. Target population sample sizes are indicated in parentheses; PolyPred used 500 additional training samples from each target population to estimate mixing weights. Asterisks above each bar denote statistical significance of the difference vs. BOLT-LMM, with black asterisks denoting an advantage and red asterisks denoting a disadvantage (*P<0.05; **P<0.001). Errors bars denote standard errors. Numerical results, results for all 49 traits analyzed, absolute prediction accuracies (R 2 ), and P-values of relative improvements vs. BOLT-LMM are reported in Supplementary Tables 4-5. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; Figure 3: Trans-ethnic PRS results for Biobank Japan and Uganda-APCDR traits. We report average prediction accuracy (relative-R 2 ; see main text), meta-analyzed across 7 well-powered, independent traits, for PRS trained in UK Biobank British samples (average N=325K) and applied to Biobank Japan and Uganda-APCDR target populations. Target population sample sizes are indicated in parentheses; PolyPred used 500 additional training samples from each target population to estimate mixing weights. Asterisks above each bar denote statistical significance of the difference vs. BOLT-LMM, with black asterisks denoting an advantage and red asterisks denoting a disadvantage (*P<0.05; **P<0.001). Errors bars denote standard errors. Numerical results, results for all 23 traits analyzed, absolute prediction accuracies (R 2 ), and Pvalues of relative improvements vs. BOLT-LMM are reported in Supplementary Table 7. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; Figure 4: Trans-ethnic PRS results for UK Biobank East Asians using PolyPred+. We report average prediction accuracy (relative-R 2 ; see main text), meta-analyzed across 7 well-powered, independent traits, for PRS trained in UK Biobank British (average N=325K) and Biobank Japan samples (average N=124K; used by PolyPred+ only) and applied to UK Biobank East Asians. The target population sample size is indicated in parentheses; PolyPred and PolyPred+ used 500 additional training samples from the target population to estimate mixing weights. Asterisks above each bar denote statistical significance of the difference vs. BOLT-LMM, with black asterisks denoting an advantage and red asterisks denoting a disadvantage (*P<0.05; **P<0.001). Errors bars denote standard errors. Numerical results, results for all 23 traits analyzed, absolute prediction accuracies (R 2 ), and P-values of relative improvements vs. BOLT-LMM are reported in Supplementary Table 4. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249483 doi: medRxiv preprint