Assessment of the evidence yield for the calibrated PP3/BP4 computational recommendations

Purpose: To investigate the number of rare missense variants observed in human genome sequences by ACMG/AMP PP3/BP4 evidence strength, following the calibrated PP3/BP4 computational recommendations. Methods: Missense variants from the genome sequences of 300 probands from the Rare Genomes Project with suspected rare disease were analyzed using computational prediction tools able to reach PP3_Strong and BP4_Moderate evidence strengths (BayesDel, MutPred2, REVEL, and VEST4). The numbers of variants at each evidence strength were analyzed across disease-associated genes and genome-wide. Results: From a median of 75.5 rare (≤1% allele frequency) missense variants in disease-associated genes per proband, a median of one reached PP3_Strong, 3–5 PP3_Moderate, and 3–5 PP3_Supporting. Most were allocated BP4 evidence (median 41–49 per proband) or were indeterminate (median 17.5–19 per proband). Extending the analysis to all protein-coding genes genome-wide, the number of PP3_Strong variants increased approximately 2.6-fold compared to disease-associated genes, with a median per proband of 1–3 PP3_Strong, 8–16 PP3_Moderate, and 10–17 PP3_Supporting. Conclusion: A small number of variants per proband reached PP3_Strong and PP3_Moderate in 3,424 disease-associated genes, and though not the intended use of the recommendations, also genome-wide. Use of PP3/BP4 evidence as recommended from calibrated computational prediction tools in the clinical diagnostic laboratory is unlikely to inappropriately contribute to the classification of an excessive number of variants as Pathogenic or Likely Pathogenic by ACMG/AMP rules.


INTRODUCTION
Genetic testing identifies many variants of uncertain significance (VUS), of which the majority of coding variants are missense (non-synonymous).).In the 2015 recommendations, in silico evidence (PP3 and BP4) was capped at "Supporting" for or against pathogenicity.
2 Furthermore, no explicit recommendations concerning the prediction tools or thresholds to be used were specified, enabling non-standardized application of criteria and resulting in inconsistencies in variant classification between clinical diagnostic laboratories.

3
Recently, Pejaver et al., (2022) refined the use of computational prediction tools to provide evidence of pathogenicity using the Bayesian adaptation of the ACMG/AMP framework.

4,5
For 13 computational prediction tools frequently used in clinical workflows, evidence-based calibrated thresholds were introduced corresponding to "Supporting," "Moderate," "Strong," and "Very Strong" PP3/BP4 evidence strengths, and also defined an indeterminate range.These thresholds demonstrated that the initial framework underweighted evidence from computational prediction tools, as many had the ability to provide evidence beyond "Supporting" strength.
Since the release of the PP3/BP4 recommendations, we have received questions from users regarding the key steps to implementation, calling for practical guidance on the intended use of the PP3/BP4 recommendations for variant curation in disease-associated genes (see Box 1).In particular, concerns have arisen due to the impression that an excessive number of variants are reaching PP3_Strong.Here, by demonstrating the level of PP3/BP4 evidence allocated to rare missense variants in the genome sequences of patients with rare disease, we specifically aimed to address these concerns.Can the calibration of these methods be trusted?-PP3/BP4 are empirically calibrated evidence codes -Confounders that could be addressed directly were eliminated -As with any approach, it is expected that the evidence strength provided will be too high or too low for some variants when applying the calibrated PP3/BP4 codes -The calibrated codes have been extensively validated, including in this study What are some of the limitations to the calibration?-Variants used for the calibration may not be representative of novel variants to be classified -Computational prediction tools were assumed not to have had a major role in the classification of variants used in the calibration - The calibration provides the evidence strength, on average, across the thousands of genes assessed; however, it is a probability that will vary across genes Will more calibrations need to be performed in the future?-New and revised methods will require independent calibration More detailed answers to these questions are provided in the Supplemental Material .

METHODS
Study participants and data. .
and genome-wide.These methods were also repeated for missense variants according to the VEP "most severe consequence" across all transcripts (see Supplemental Material).Statistical analyses.Proportions between two groups were compared with two-tailed binomial tests with Bonferroni correction for multiple testing.Bootstrap resampling with replacement (1,000 iterations) was performed to provide a 95% confidence interval (CI) for the mean.

RESULTS
Detection of missense variants in disease-associated genes.1B and Figure S1).ClinVar provides classifications for 54-72% of the unique variants with PP3_Strong evidence per prediction tool, of which 12-29% are currently reported as P/LP, 63-.
Using a less stringent AF threshold of ≤5% resulted in a subtle increase in variants with PP3 evidence in disease-associated genes (median = 1 PP3_Strong, median = 4-6 PP3_Moderate, median = 5-6 PP3_Supporting) (Table S3).Using the VEP "most severe consequence" across all transcripts to detect variants, a rare disease analysis approach that is sometimes used to increase the detection of potentially deleterious missense variants in alternative transcripts versus using only MANE Select transcripts, we also did not see many more variants reaching PP3_Supporting-Strong in disease-associated genes (median = 1 PP3_Strong, median = 3-5 PP3_Moderate, median = 4-5 PP3_Supporting) (Table S4).
"-" indicates that the given prediction tool is not able to provide BP4 evidence of this strength. .

DISCUSSION
The use of computational prediction tools to provide evidence of pathogenicity and benignity within the ACMG/AMP framework was recently refined by Pejaver et al.

4
and certain prediction tools were found capable of reaching "Strong" and "Very Strong" evidence for PP3 and BP4 codes, respectively.These changes were expected to have important implications for the final classification of missense variants in the clinical diagnostic setting, given that previously the codes were capped at "Supporting" and could only be applied if "Multiple" lines of computational evidence support a deleterious effect on the gene or gene product. 2 Through various scientific meetings and interactions following release of the recommendations, concerns were raised due to the impression that an excessive number of PP3_Strong variants are generated.To explore these concerns, we assessed the observed number of rare missense variants by PP3/BP4 evidence strength in the genome sequences of 300 research participants with rare disease.
In our analyses, at ≤1% AF, a standard threshold in rare disease analysis, we found a median of one PP3_Strong variant per individual (range 0-4) across ~950 of over .
To better understand why users reported an excess of PP3_Strong variants, we also extended our analyses to more frequent variants up to 5% AF, the threshold for stand-alone evidence of benignity in the ACMG/AMP guidance, and to variants that are missense on alternative transcripts (VEP "most severe consequence").These analyses did not result in a considerable increase in the number of PP3_Strong variants.Furthermore, though Pejaver et al.
made no recommendation about running computational prediction tools genome-wide, given that the thresholds are calibrated for disease-associated genes only, we applied the same thresholds to variants genome-wide.We found an approximately 2.6-fold increase in the number of PP3_Strong variants genome-wide compared to within disease-associated genes only, consistent with the genome having ~5-fold as many genes as covered by ACMG/AMP classification rules and the prior for pathogenicity genome-wide being ~5-fold lower (~1%) 15,18 than for disease-associated genes (~4.5%)
Importantly, deleterious in silico prediction does not equate to pathogenicity and, in the absence of additional evidence, one line of "Strong" evidence from the PP3 code classifies a variant as a VUS in the ACMG/AMP framework.In the case that a variant does reach P or LP classification in combination with other codes, there is a 99% or 90% posterior probability of pathogenicity, respectively, which implies that 1-10% of variants may not actually be causative Moreover, computational prediction tools were assumed not to have played a major role in the classification of the variants used for the calibration.Given these limitations, we appeal to .

1
Limited availability of functional data means that it is often necessary to turn to computational in silico prediction tools for evidence of deleteriousness.The American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) provide a sequence variant classification (SVC) framework to combine distinct lines of evidence of pathogenicity or benignity of varying strengths to reach a final variant classification (Benign [B], Likely Benign [LB], VUS, Likely Pathogenic [LP], or Pathogenic [P]

Figure 1 .
Figure 1. A. Rare (≤1% AF) missense variants in disease-associated genes per proband by PP3 evidence strength for analyzed computational prediction tools.B. Rare (≤1% AF) missense variants in disease-associated genes with PP3 evidence per proband by evidence strength and reported mode of inheritance (AD-only and AR-only) for analyzed computational tools.Boxplots correspond to the first, second, and third quartile of data, with whiskers denoting 1.5 × IQR.Outliers are displayed as individual points.
of disease.The PP3/BP4 codes should be used within the framework of the ACMG/AMP recommendations including the updates that have been made by ClinGen to determine the pathogenicity of a variant.Code combination does, however, require great care and there are a number of important caveats.In particular, (meta)predictors may use data partially captured by other codes, notably key domains and critical residues and population AF, increasing the risk of double-counting of evidence (see Supplemental Material for further recommendations on code combination).The PP3/BP4 calibration by Pejaver et al. does have limitations.It was performed on variants classified in the past several years that were not used in the training sets of the analyzed prediction tools and may be non-representative of novel variants to be classified.

.
Box 1. Key steps in implementing the PP3/BP4 missense variant recommendations How should PP3/BP4 evidence be used for missense variants?
Genome sequencing (GS) data were obtained from the Rare Genomes Project (RGP) at the Broad Institute of MIT and Harvard.Participant demographics are displayed in TableS1.Sequencing was performed on DNA purified from blood by the Broad Institute Genomics Platform on an Illumina sequencer to 30x average depth.Raw sequence reads were aligned to the GRCh38 reference genome.Variants were called with GATK version 4.1.8.0 7 in the form of single nucleotide variants (SNVs) and small insertions/deletions (indels).Variants were filtered at the site-level with GATK Variant Quality Score Recalibration (VQSR).Missense variant extraction and annotation.Missense variants identified by the Ensembl Variant Effect Predictor (VEP) 8 using MANE Select transcripts 9 were extracted from the GS data.Only variants with genotype quality ≥40, depth ≥10, and allele balance ≥0.2 were retained for analyses.Allele frequency (AF) thresholds of ≤5% and ≤1% global and population-max "popmax" AF in gnomAD v3.1.2genomes were applied (the highest allele frequency for nonbottlenecked populations).10 Precomputed scores from four in silico (meta)predictors that were able to reach PP3_Strong and BP4_Moderate in the Pejaver et al. calibration were included in the analysis.BayesDel (without minor allele frequency), Database (last accessed Jul 21, 2023) (3,424 genes -1,004 autosomal dominant only [AD-only], 1,903 autosomal recessive only [AR-only], 517 other [includes gene that are both AD and AR]) 16 The GS dataset included 300 probands with rare disease.Across protein-coding genes genome-wide, a median of 8,781.5 variants per proband (range 8,383-10,616) passing QC thresholds were detected in MANE Select transcripts.Applying a ≤1% AF threshold in the gnomAD v3 genomes dataset, we found 75,384 unique missense variants across 15,566 genes (median 321 per proband, range 244-847).Within GenCC Moderate, Strong, and Definitive disease-associated genes, the number of unique variants dropped to 17,789 across 2,899 genes, and a median of 75.5 variants per proband (range 53-186).Variant counts following each step in QC and AF filtering are displayed in TableS2.PP3/BP4 evidence strength of missense variants in disease-associated genes.A median of one variant (mean 0.8-1) per proband reached PP3_Strong per analyzed prediction tool, 3-5 variants (mean 3.4-4.9)reached PP3_Moderate, and 3-5 (mean 3.6-5.2) reached PP3_Supporting (Table

Table 1 .
Number of rare (≤1% AF) missense variants in disease-associated genes per proband by ACMG PP3/BP4 evidence strength within MANE Select transcripts