Abstract
Background/Objectives Canonical splice site variants (CSSVs) are often presumed to cause loss-of-function (LoF) and are assigned very strong evidence of pathogenicity (according to ACMG criterion PVS1). However, the exact nature and predictability of splicing effects of unselected rare CSSVs in blood-expressed genes is poorly understood.
Methods A total of 184 rare CSSVs in unselected blood-expressed genes were identified by genome sequencing in 121 individuals, and their impact on splicing was interrogated manually in RNA sequencing (RNA-seq) data. Blind to these RNA-seq data, we attempted to predict the precise impact of CSSVs by applying in silico tools and the ClinGen Sequence Variant Interpretation Working Group 2018 guidelines for applying PVS1 criterion.
Results There was no evidence of a frameshift nor of reduced expression consistent with nonsense-mediated decay (NMD) for 24% of CSSVs: 17% had wildtype splicing only and normal junction depths, 3.25% resulted in cryptic splice site usage and in-frame indels, 3.25% resulted in full exon skipping (in-frame), and 0.5% resulted in full intron inclusion (in-frame). Misclassification rates for splicing outcome (frameshift/NMD vs. no frameshift/no NMD) using (i) SpliceAI, (ii) MaxEntScan, and (iii) AutoPVS1 ranged from 30-41%, with none outperforming a simple “zero rule” classifier.
Conclusion Nearly 1 in 4 CSSVs may not cause LoF based on analysis of RNA-seq data. Predictions from in silico methods were often discordant with findings from RNA-seq. More caution may be warranted in applying PVS1-level evidence to CSSVs in the absence of functional data.
Introduction
Canonical splice site variants (CSSVs) are DNA variants affecting splicing donor (+1 and +2) and acceptor (–1 and –2) sites defining exon-intron boundaries1, 2. The consensus nucleotide sequences at splicing donor and acceptor sites are GT and AG, respectively, and are essential in interacting with the U2 spliceosome to result in normal splicing and generation of wildtype (WT) transcripts2–5. CSSVs may modify the interactions between the precursor messenger RNA (pre-mRNA) and spliceosome complex5–7. The resulting splice disruption events can include exon skipping, full intron inclusion, and alternative use of nearby cryptic splice sites resulting in insertions or deletions (indels) of nucleotides6–8. These effects may or may not induce a frameshift and premature termination codon (PTC), which can then trigger nonsense-mediated RNA decay (NMD) and result in a loss-of-function (LoF) of the gene9, 10.
Accurate variant interpretation is foundational to both genome diagnostics and screening11–13. Rare CSSVs are typically considered under the “null variant” code and assigned very strong evidence for pathogenicity (PVS1)12. The PVS1 guidelines were refined by the ClinGen Sequence Variant Interpretation Working Group (ClinGen SVI) in 2018 and again recently in 202311, 14. ClinGen SVI initially recommended assigning PVS1 at varying evidence strengths (i.e., supporting, moderate, strong, very strong, or not at all) after directly inspecting the genomic region to predict the impact of the CSSV on splicing and the overall reading frame11. SpliceAI subsequently emerged as a widely used and powerful method for annotating genetic variants with their predicted effect on splicing15, 16. However, in silico tools continue to be recognized as a valuable but imperfect adjunct method for predicting the impact of CSSVs14.
To our knowledge, there have been no systematic attempts to catalogue the precise consequences of rare CSSVs on splicing using functional data. The degree to which the impact(s) of CSSVs are predictable via inspection of the genomic region and application of in silico tools is also unclear. Here, we analyzed all rare CSSVs identified by genome sequencing (GS) from children and parents, in blood-expressed but otherwise unselected genes, using genome-wide RNA sequencing (RNA-seq). We hypothesized that a significant minority of CSSVs would show no evidence of a frameshift nor of reduced expression consistent with nonsense-mediated decay, recognizing that ∼1/3 of inserted or deleted DNA segments would be expected to have a size divisible by 317. We determined the proportions of various splicing outcomes, and we compared the results to outcomes predicted in a blinded fashion by three in silico tools: SpliceAI, MaxEntScan, and AutoPVS112, 13, 15. Our results revealed a previously underappreciated complexity in CSSV impact prediction, and underscore the value of functional data in the interpretation of rare CSSVs.
Materials and Methods
Identification of rare CSSVs expressed in blood
In this study, we analyzed blood-derived GS and RNA-seq data from a convenience sample of 121 total individuals18, 19. The GS and RNA-seq studies were approved by the Research Ethics Board at The Hospital for Sick Children. Parents and/or guardians provided written consent on their child’s behalf. Where appropriate, children provided written and/or oral assent.
Demographic details for the cohort are described in Table S1. Detailed GS methods, and a subset of the GS data, were published previously18–22. DNA variants from GS files were filtered according to the following criteria: (I) single nucleotide substitution in a canonical splice site identified in a MANE or Ensembl Canonical transcript, (II) allele frequency of <0.05 (per 1000 Genomes, NHLBI-ESP, and ExAC), (III) >=99% Genotype Quality Score, and (IV) gene possibly detected in whole blood (according to GTEx, V8, transcripts per million (TPM) >0.05) and detected in our internal cohorts20, 23. Variants in untranslated regions (UTRs) were excluded23. All 184 CSSVs included in this study are listed in Table S2. Two of the identified CSSVs were considered diagnostic for the phenotype(s) that prompted GS, and all CSSVs were heterozygous18–22, 25.
Analysis of splicing impact of CSSVs using RNA-seq data
We analyzed the impact of CSSVs on splicing in the canonical transcript using the accompanying short-read blood-derived RNA-seq data. RNA extraction, sequencing, and data processing methods were previously described in full20, 26. Each splicing junction was manually inspected using the Integrated Genome Visualizer (IGV) by two independent evaluators (R.Y.O. and A.A.). For every CSSV, an average of five random, age-range matched “controls” (i.e., with normal DNA sequence at the affected CSS) from this cohort were used to identify the WT splicing event(s) and provide a reference on junction read depths to account for any possible fluctuations. Sex-matched “controls” were specifically used for one CSSV located on the X chromosome. Only junctions with >5 uniquely mapped reads, which is a low cut-off, were considered in the analysis. The junction with the highest read depth was considered the predominant splicing outcome. Splicing outcome categories are as follows:
(A) presumed NMD (if the raw WT splicing junction coverage was >=20% less than control individuals but no aberrant splicing events were captured)26,
(B) NMD not detected [if there was comparable (<=20% difference) junction depth between individuals and no aberrant splicing events were captured],
(C) exon skipping leading to frameshift deletion,
(D) exon skipping leading to in-frame deletion,
(E) full intron inclusion leading to frameshift insertion,
(F) full intron inclusion leading to in-frame insertion,
(G) activation of a cryptic splice site leading to frameshift indel,
(H) activation of a cryptic splice site leading to in-frame indel.
Prediction of splicing outcomes using in silico tools
Blind to the RNA-seq data, two independent assessors (D.C. and G.G.) attempted to predict the precise impact of CSSVs using in silico tools and ClinGen SVI recommendations11. Standard score thresholds were used for MaxEntScan (>3) and for SpliceAI (delta score >0.2). For each CSSV, the splice junction was manually inspected using Alamut Visual Plus (version v1.7, ©2022 SOPHiA GENETICS) with a +/-20 bp window for a cryptic splice site. In the case of two or more cryptic splice sites in the same region with comparable in silico scores, if the use of any one site was predicted to result in an in-frame indel then NMD was not predicted as the outcome. In the absence of a cryptic splice site in the neighboring region, for acceptor losses we predicted exon skipping and for donor losses we predicted full intron inclusion. The length of the exon or intron, respectively, was then used to predict whether the impact would be in-frame or out-of-frame. While the ClinGen SVI recommendations specify use of a +/-20 bp window, we conducted additional exploratory analyses using an expanded SpliceAI window of +/-5000 bp (“SpliceAI_expanded”). Concordance of outcomes was then calculated if the frameshift/non-frameshift outcome predicted by the in silico tools matched the results from RNA-seq.
AutoPVS1, an automatic classification tool for PVS1 interpretation of null variants, was also used to predict the impact of CSSVs on splicing27. This algorithm uses Variant Effect Predictor for annotation of variants and MaxEntScan to predict the use of cryptic splice sites and resulting impact on splicing27. The three possible outcomes of AutoPVS1 are (I) Exon skipping or cryptic splice site usage that leads to a frameshift and NMD, (II) Exon skipping or cryptic splice site usage that leads to a frameshift without NMD, and (III) Exon skipping or cryptic splice site usage that preserves the reading frame.
Results
Comparison of frameshift outcomes of CSSVs using RNA-seq vs. in silico predictions
We assessed a total of 184 rare, blood expressed CSSVs in 180 otherwise unselected genes. By RNA-seq, 24% of these CSSVs did not result in a frameshift nor in reduced expression consistent with NMD (Figure 1a). There was no apparent difference in the patterns of variant location and specific nucleotide substitution of CSSVs by frameshift/non-frameshift outcome (Figures S1 and S2). Considering the n=18 CSSVs that showed only WT splicing and with comparable read depth to controls (outcome category B), 14 CSSVs were in the donor splice site including only 1 GT>GC variant28 (Figure S3).
a) A comparison of the proportion of CSSVs resulting in a frameshift according to RNA-seq analysis vs. in silico predictions. b) Concordance between RNA-seq and in silico predictions of the impact of CSSVs on splicing. Concordant outcomes are defined as the RNA-seq and respective in silico tool identifying the same outcome (frameshift or non-frameshift). c) Misclassification rates of in silico tools compared to a zero-rule classifier.
Selected variants (n=23) with specific splicing outcomes in RNA-seq (exon skipping, cryptic splice site usage, and intron inclusion) compared with splicing outcome predictions from in silico tools. Total pairwise concordance in specific splicing outcomes was 70% (16/23) for SpliceAI_expanded, 35% (8/23) for SpliceAI and 26% (6/23) for MaxENTScan.b) Concordance of specific splicing outcomes in RNA-seq vs. in silico predictions.
For the 184 CSSVs, blinded predictions from in silico methods for non-frameshift outcomes are 28% (SpliceAI), 29% (SpliceAI_expanded), to 43% (AutoPVS1) of all CSSVs (Figure 1a). For CSSVs resulting in a frameshift per RNA-seq, SpliceAI_expanded had the highest pairwise concordance (77%), followed by SpliceAI (74%), MaxENTScan (69%), and AutoPVS1 (61%; Figure 1b). For CSSVs not resulting in a frameshift per RNA-seq, SpliceAI was concordant in 36%, MaxENTScan in 41%, SpliceAI_expanded in 48%, and AutoPVS1 in 55% (Figure 1b).
Across all 184 variants, SpliceAI_expanded had the highest pairwise concordance to RNA-seq with respect to the frameshift vs. no-frameshift outcome, at 71%. In order to assess the performance of each in silico method, we calculated misclassification rates from each technique, which ranged from 30-41% (Figure 1c). In comparison, a zero-rule classifier (which would predict that every CSSV causes frameshift/NMD, the predominant outcome category in this study) had a misclassification rate of only 24% (Figure 1c).
Comparison of specific splicing outcomes of CSSVs using RNA-seq vs. in silico predictions
Next, we compared the specific splicing outcome of CSSVs (cryptic splice site usage, exon skipping, or intron inclusion; categories C-H above) between RNA-seq and in silico predictions, for the n=23 variants where this could be determined from RNA-seq (Figure 2a). Total pairwise concordance was 70% (16/23) for SpliceAI_expanded (improved from 35% (8/23) for SpliceAI) and 26% (6/23) for MaxENTScan; AutoPVS1 does not provide such predictions. The performance of both in silico methods appeared to vary by outcome category, e.g., with SpliceAI_expanded correctly predicting all usage of cyptic splice sites (10/10), some of the exon skipping events (5/10) and none of the intron inclusion events (0/3) (Figure 2b). Two selected donor CSSVs are used to illustrate that in silico were often correct in predicting frameshift vs. no frameshift outcomes, however, for incorrect and/or incomplete mechanisms of abnormal splicing (Figure 3a-d).
Two examples of donor CSSVs showing discordant splicing events in blood RNA-seq vs. in silico predictions but overall correct outcome (frameshift vs. no frameshift). Splicing junctions with reads >5 are shown. First transcripts on the list are the canonical transcripts indicated by Ensembl. a) CSSV in MICAL1 results in exon skipping leading to a frameshift in RNA-seq. b) SpliceAI, SpliceAI_expanded and MaxEntScan all predicted activation of a cryptic splice site, resulting in frameshift. c) CSSV in TARBP1 results in exon skipping leading to a frameshift in RNA-seq. d) SpliceAI, SpliceAI_expanded and MaxEntScan all predicted intron inclusion, resulting in frameshift.
Discussion
Challenging common assumptions and interpretation heuristics for CSSVs
As demostrated in this study cohort, most human genomes contain one or more rare CSSVs in blood-expressed genes. Although there is widespread recognition that a rare CSSV need not necessarily result in LoF, experiences from decades of targeted diagnostics-focused clinical genetic testing (with a resulting ascertainment bias) may have contributed to a misconception that CSSVs are comparable to nonsense and frameshift variants. We present the first systematic assessment of the impact of rare “unselected” CSSVs in blood-expressed genes, using RNA-seq. We found that nearly 1 in 4 CSSVs may not cause LoF, and that in silico predictions using established tools and published guidelines were often discordant with RNA-seq data.
In recent years, there have been numerous computational tools developed to predict the location of novel splice sites and thus the impact of DNA variants on splicing12, 15. These tools were validated using few if any CSSVs (e.g., near intronic variants at least 3 nucleotides from a canonical exon boundary in SpliceAI-10k; n=55 in 300K-RNA Top-4) and were not created to predict specific CSSV outcomes like exon skipping or intron inclusion15, 16, 29, 30. A prior report found that intron inclusion was poorly predicted using SpliceAI when compared with transcriptome sequencing data31. AutoPVS1 does not list intron inclusion as a specific outcome, nor does it distinguish which variants result in exon skipping versus cryptic splice site usage27. For now, no in silico methods appear to predict the precise impact of CSSVs with the sensitivity and specificity needed for clinical diagnostics. A recent study has proposed using the ACMG/AMP PM4 (moderate evidence of pathogenicity) criterion for CSSVs that are predicted to result in intron inclusion and 3 or more in-frame events as predicted by 300K-RNA Top-416.
Relative weighting of biological function or evolutionary conservation of the affected gene region(s) may need to be considered in addition to the length of the in-frame disruption16. We propose, in addition, that future updates to published guidelines on the use of PVS1 should consider the use of SpliceAI with an expanded window of +/-5000 bp.
The impact of CSSVs on splicing can be complex. Multiple and/or partial effects on splicing have been observed in the past (e.g., with some aberrantly spliced transcripts resulting in LoF and others showing no apparent impact or producing a functional transcript)24. Surprisingly, some CSSVs in our data showed no direct (aberrant/non-WT splice junctions) or indirect (reduced read depth, compatible with NMD) impact on splicing. The ACMG/AMP rule, BS3, may be applied to CSSVs with evidence of normal splicing patterns demonstrated by RNA-seq fulfilling specific criteria24, 32. A recent study using cell culture-based full-length gene splicing assays demonstrated that specific nucleotide substitutions (GT>GC in the donor splice site) can generate WT transcript levels in an estimated 15-18% of cases; no other nucleotide substitutions in the +2 donor splice site were able to generate WT transcripts28. Moreover, WT splicing as a result of the 5’ splice site GT>GC substitutions was not accurately predicted by in silico tools28. Our results showed that WT transcripts can be generated with diverse nucleotide substitutions, in no consistent ranking order in the 5’ donor splice site as recently described (though this study performed RNA-seq in fibroblast samples and also included common variants), affecting +/– 1 and 2 canonical splice sites, highlighting in vitro assays’ inability to capture the full complexity of splicing in human cells (Figure S3)33. Of note, sensitivity of our RNA-seq methods for detecting evidence of NMD will be <100%. Long-read RNA-seq might have facilitated more robust quantitation of transcript isoforms and revealed additional splicing outcomes. Testing additional tissues beyond blood was also outside the scope of this initial study.
Our study has several additional limitations. First, we recognize that in-frame insertions and deletions can still result in non-functional proteins (e.g., through disruption of an essential protein domain) and that protein function cannot be inferred completely from RNA-seq. In the absence of any evidence-based guidance, we assumed for all in silico predictions that exon skipping would be the typical impact of an acceptor site loss and that intron retention/inclusion would be the typical impact of a donor site loss, whenever an alternative/cryptic splice site was not present. There is growing appreciation that this reasoning is overly simplistic31, 34. The landscape of impacts of rare CSSVs may change based on tissue and/or age. Last, we were underpowered in this study to identify substitution– and location/motif-specific predictors of splicing outcomes.
Implications for diagnostics, predictive testing, and screening
These data reinforce prior expert consensus recommendations that cautioned against applying PVS1 to CSSVs in the absence of additional supportive evidence24. Recent ClinGen SVI recommendations emphasize the importance of annotating physiologically occurring alternative splicing events (or “leaky” splicing events) as candidate rescue transcripts that may not result in LoF, consideration of gene-specific details (e.g., the impact on critical regions), and generating more data to guide the integration of various lines of evidence (i.e., computational and in vitro assays to assess impact on splicing), with the end goal of improved navigation of equivocal clinical scenarios requiring interpretation of rare CSSVs34. The advantages and limitations of RNA-seq, which can be done high-throughput, should be weighed against a targeted approach like RT-PCR; the latter may have greater sensitivity in detecting mis-spliced reads of low read depth as a result of low gene expression in whole blood, and being able to distinguish partial from complete abnormal splicing33. Although a majority of CSSVs result in LoF, this assumption should be questioned when genome-wide sequencing identifies novel rare CSSVs in genes associated with ultra-rare, poorly characterized conditions with non-specific phenotypes (e.g., developmental delays)35. The nuances of CSSV interpretation take on added importance when the pre-test probability is low and phenotypes are absent, as is the case with most secondary findings and in newborn genomic screening programs. Last, confirmation of genetic diagnosis and pathomechanism will be foundational to the coming wave of clinical trials of genetic therapies36.
Declaration of interest
S.W. is currently an employee of Genomics England Limited. The other authors declare no competing interests.
Data Availability
Genome sequencing, RNA-seq analysis, and in silico prediction data will be shared upon request to the corresponding author.
Web resources
Data availability
GS and RNA-seq data were published and shared previously. The in silico predictions can be shared upon request to the corresponding author.
Author contributions
Conceptualization: R.Y.O., A.A., S.W., G.C.; Data curation: R.Y.O., A.A., D.C., G.G., B.H., G.C.; Formal analysis: R.Y.O., A.A., D.C., G.G., G.C.; Funding acquisition: C.R.M., R.M-L., A.S., L.G.K., J.J.D., M.D.W., G.C.; Investigation: R.Y.O., A.A., D.C., G.C.; Methodology: R.Y.O., A.A., D.C., G.G., H.H., K.Y., S.W., G.C.; Project administration: R.Y.O., A.A., T.K., B.T., G.C.; Resources: C.R.M., R.M-L., A.S., L.G.K., J.J.D., M.D.W., G.C.; Software: R.Y.O., A.A., D.C., G.G., H.H., K.Y.; Visualization: R.Y.O., A.A., D.C., H.H.; Writing-original draft: R.Y.O., A.A., G.C.; Writing-review & editing: all other authors
Acknowledgements
We gratefully acknowledge all the individuals and their families who participated in this study. We thank the many healthcare providers involved in the diagnosis and care of the study participants. Special thanks to all staff affiliated with the Complex Care Program and The Centre for Applied Genomics. M.D.W. was supported by the Canada Research Chairs Program.
Funding was provided by Genome Canada (OGI-158; M.D.W., A.S., and J.J.D.), the SickKids Centre for Genetic Medicine and Translational Genomics Node, the Sickkids Research Institute, the Canadian Institutes of Health Research (Funding Reference Number: PJT186240), and the University of Toronto McLaughlin Centre.