MRSD: a novel quantitative approach for assessing suitability of RNA-seq in the clinical investigation of mis-splicing in Mendelian disease

Background: RNA-sequencing of patient biosamples is a promising approach to delineate the impact of genomic variants on splicing, but variable gene expression between tissues complicates selection of appropriate tissues. Relative expression level is often used as a metric to predict RNA-sequencing utility. Here, we describe a gene- and tissue-specific metric to inform the feasibility of RNA-sequencing, overcoming some issues with using expression values alone. Results: We derive a novel metric, Minimum Required Sequencing Depth (MRSD), for all genes across three human biosamples (whole blood, lymphoblastoid cell lines (LCLs) and skeletal muscle). MRSD estimates the depth of sequencing required from RNA-sequencing to achieve user-specified sequencing coverage of a gene, transcript or group of genes of interest. MRSD predicts levels of splice junction coverage with high precision (90.1-98.2%) and overcomes transcript region-specific sequencing biases. Applying MRSD scoring to established disease gene panels shows that LCLs are the optimum source of RNA, of the three investigated biosamples, for 69.3% of gene panels. Our approach demonstrates that up to 59.4% of variants of uncertain significance in ClinVar predicted to impact splicing could be functionally assayed by RNA-sequencing in at least one of the investigated biosamples. Conclusions: We demonstrate the power of MRSD as a metric to inform choice of appropriate biosamples for the functional assessment of splicing aberrations. We apply MRSD in the context of Mendelian genetic disorders and illustrate its benefits over expression-based approaches. We anticipate that the integration of MRSD into clinical pipelines will improve variant interpretation and, ultimately, diagnostic yield.


Introduction 50
Pinpointing disease-causing genomic variation informs diagnosis, treatment and 51 management for a wide range of rare disorders. An underappreciated group of 52 pathogenic variants is those that lie outside of canonical splice sites but act through 53 disruption of pre-mRNA splicing, the process whereby introns are removed from 54 nascent pre-mRNA to produce mature and functional transcripts (Supplementary 55 Figure 1a). The ways through which genomic variants can disrupt pre-mRNA splicing 56 are diverse ( Supplementary Figures 1b-g), including both protein-coding and intronic 57 variants that are well described as causes of rare disorders (1-3). However, the 58 omission of intronic regions in targeted sequencing approaches (4, 5), discordance 59 between in silico variant prioritization tools (6) and the lack of availability of the 60 appropriate tissue from which to survey RNA for splicing disruption (7, 8) limit 61 effective identification of pathogenic splice-impacting variants. 62 impacting splicing, or leading to impairment of transcript expression or stability (16). 75 However, there remain several hurdles to the effective and routine integration of 76 RNA-seq into diagnostic pipelines. For example, surveying a whole transcriptome 77 identifies a large number of aberrant splicing events -in the order of hundreds of 78 thousands -and there is little consensus regarding the best approach to filter for true 79 positive and pathogenic events. Furthermore, diagnostic analysis using RNA-seq is 80 only effective when sufficient levels of sequence coverage of a relevant gene 81 transcript are present in the sampled tissue. 82

83
In this study, we develop an informatics approach to quantify the likelihood that a 84 gene/transcript, or a defined set of genes or transcripts, can be appropriately 85 surveyed using RNA-seq. We name our framework the Minimum Required 86 Sequencing Depth (MRSD), which can be utilized in a flexible and customized 87 manner to assess the suitability of RNA-seq derived from different tissues to identify 88 pathogenic splicing aberrations in specific genes of interest. MRSD scores (available 89 at: https://mcgm-mrsd.github.io/) can be utilized to select the most appropriate 90 biosample to detect splicing aberrations for a candidate set of genes/transcripts or to 91 guide the amount of sequencing reads from a specific biosample required to 92 generate appropriate transcriptomic datasets for a gene of interest. We apply these 93 techniques to the study of monogenic disease genes, and assess three clinically 94 accessible biosamples for their appropriateness to survey all known monogenic 95 disease genes. 96 97 98 99 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 respectively, as higher levels of required coverage were imposed (Figure 2b-c). 150 Across all parameter combinations, PPV values ranged from 90.1-98.2%, while NPV 151 ranged from 56.4-94.7%, suggesting MRSD is a fairly conservative model that 152 primarily returns positive results with high certainty. 153 154 Interestingly, although MRSD scores were derived from 75 bp paired-end RNA-seq 155 data, evaluating the ability of the model to predict transcript coverage in 150 bp 156 paired-end data (LCLs, n=20) shows higher PPV than with 75 bp data for half of the 157 four parameter combinations tested, while NPV was only slightly lower for all 158 combinations (Supplementary Figure 3). This suggests that, while care must be 159 taken applying this approach to datasets derived using alternative experimental 160 approaches, the MRSD model described here may provide a suitable approximation 161 in the case of alternative sequencing read lengths. 162 163

Comparison of MRSD and TPM as a guide for appropriate surveillance 164
We compared MRSD to the use of relative expression level (in transcripts per million, 165 TPM) as a possible indicator of RNA-seq suitability for the detection of aberrant 166 splicing events. We compared the expression levels, in TPM, of PanelApp genes 167 against tissue-specific MRSD predictions, finding a negative correlation between the 168 level of gene expression and its predicted MRSD across all three tissues (r 2 = 0.539-169 0.669; Figure 3a-c). This comparison confirms that more highly-expressed genes are 170 associated with lower MRSD scores. However, we noted significant overlap between 171 genes grouped into low-MRSD (< 100 M reads) and high-MRSD (≥ 100 M reads) 172 brackets. For example, among genes considered low-MRSD, TPM values ranged 173 from 1.25-1390, while genes with high-MRSD values had TPM values between 0-174 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 4880 ( Figure 3d). We quantified the overlap between these distributions, 175 demonstrating that 98.6% of high-MRSD genes had higher TPM values than at least 176 one low-MRSD gene. We calculated the tissue-specific median and the lowest TPM 177 values within the low-MRSD bracket for the top 95% and 70% percentiles, and 178 observed higher TPM values in 52.2%, 13.3% and 5.3% of high-MRSD genes, 179 respectively ( Figure 3d). The substantial overlap in the TPM values for low and high 180 MRSD genes suggests that relative expression does not provide a wholly accurate 181 representation of transcript coverage in RNA-seq data. Such inconsistencies may 182 arise from bias in the regions of genes that are sequenced, for example, genes with 183 high degrees of 3' bias in RNA-seq datasets (Supplementary Figure 4). 184 185

Traits of pathogenic splicing variation vary widely between genes and events 186
We identified pathogenic aberrations to splicing in 20 of the 88 samples utilizing a 187 previously described analysis pipeline (13) with a wide variety of mis-splicing effects 188  Table 1 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

MRSD predictions 220
To further define the most informative parameters for use in the MRSD model, we 221 investigated the impact of a variety of metrics on the capability to identify pathogenic 222 splicing events, including number of samples within the healthy reference set, the 223 extent of read support for splicing junctions, and the relative expression of genes of 224 interest. Overall, our analyses suggested that two supporting reads for an aberrant 225 splicing event that is novel or has an NRC > 0.25 would reliably highlight pathogenic 226 aberrations amongst transcriptome-wide splicing variation. These parameters are 227 conservative and could be relaxed for the targeted investigation of variants of 228

interest. 229 230
We first identified how the number of control samples used as a reference set for 231 "healthy splicing" impacted our ability to identify aberrant splicing events. For all 232 samples within our healthy splicing set, we iteratively selected groups of control 233 samples at sizes of 30, 60 or 90. We observed that moving from 30 to 60 controls is 234 associated with a mean reduction in event count of 19.3% (28.1% of non-singleton 235 events, 17.1% of singleton events) across the three tissues, while increasing the 236 control size to 90 results in a further reduction of 10.2% of events (16.5% of non-237 singleton events, 9.5% of singleton events; Figure

Implications for investigation of variants in known disease-causing genes 264
We applied our MRSD model to all established disease genes included in the 265 Genomics England PanelApp repository, encompassing 275 distinct gene panels 266 and 3199 unique genes. 87 single-exon genes were excluded from analysis, leaving 267 3112 unique disease genes. Based on our investigations of MRSD, we applied the 268 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Quantifying the resolving power of RNA-seq for variants of uncertain significance 293 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 20, 2021. ; https://doi.org/10. 1101 To analyze the possible impact of diagnostic RNA-seq integration on variant 294 interpretation, we curated variants of uncertain significance (VUSs) from the ClinVar 295 variant database (17) that were predicted by SpliceAI (18)  The recent development of machine learning approaches has underpinned 311 improvements to the prioritization of variants that impact splicing and cause rare 312 disease (19). Despite these advances, corroboration of the effect of such variants 313 remains a major obstacle to improving diagnostic yield for Mendelian disorders. This 314 obstacle is amplified by the unexpected functional impact of some variants on 315 splicing, which may change the way the variant is classified in accordance with 316 current guidelines (6). The MRSD-based approach described here allows the 317 informed selection of biosample(s) for bulk RNA-seq, based on the required number 318 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 of sequencing reads that need to be generated for appropriate surveillance of genes 319 of interest. This approach enables the effective identification of patients, disease 320 groups and genomic variants that are amenable for functional assessment of mis-321 splicing through RNA-seq, and may help to improve the efficiency and accuracy of 322 genomic diagnostic approaches. 323

324
The primary purpose of MRSD is to predict the likelihood of observing pathogenic 325 splicing defects in a given gene and tissue, and we quantify the utility of three distinct 326 biosamples in this manner for known monogenic disease genes ( Figure 6). Through 327 this analysis, we are able to highlight biosamples that may be most informative for 328 RNA-seq based analysis datasets for specific disease subsets. Although our model 329 is conservative (Figure 2), we demonstrate through MRSD-guided re-inspection of 330 VUSs in ClinVar that it may be possible to use RNA-seq to clarify the effect of up to 331 2.4% of variants of uncertain significance (Figure 7a). selecting genes that could be surveyed through RNA-seq when considering TPM 342 alone. Additionally, the normalization against sequencing depth that occurs during 343 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. to predict the suitability of datasets to detect allele-specific expression biases and 362 differential gene expression, which have been demonstrated to be evidence of 363 pathogenic mechanisms in known disease-causing genes (10,11,14,24). Although 364 further investigations are required to quantify and prove this suitability, it is likely that 365 genes with low MRSD scores (Figure 3d) are also amenable to investigations of 366 differential gene expression and isoform imbalance. 367 368 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Thirdly, our approach is built for a specific set of RNA-seq-based analyses; namely, 388 the analysis of a selection of tissues by bulk short-read poly-A enrichment RNA-seq, 389 followed by a specific bioinformatics analysis pipeline (13). This experimental RNA-390 seq approach currently remains widespread (3, 10, 13-15); however, our model may 391 be readily applicable to RNA-seq generated using alternative methodologies, such 392 as increased read length, with only minor variations in model performance 393 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 (Supplementary Figure 3). As other technologies, such as long-read (36-38), single-394 cell (39, 40) and spatially resolved RNA-seq (41-44), become more prevalent in a 395 clinical setting, appropriate control datasets must be generated to develop 396 corresponding MRSD models. Similarly, recent research has shown noticeable 397 improvements to diagnostic yield for neuromuscular disorders by conducting RNA-398 seq on in vitro myofibrils generated by a fibroblast-to-myofibril transdifferentiation 399 protocol (45). Such patient-derived cell line approaches represent a promising 400 avenue to scrutinize transcripts not otherwise observable in proxy tissues (31,46). 401 As these protocols gain wider use, generation of control RNA-seq data from healthy 402 individuals using these approaches will be vital both to allow the generation of MRSD 403 scores and to accurately assess pathogenicity of any identified mis-splicing events. 404 405

Conclusions 406
In summary, the novel MRSD model presented here offers a gene-specific readout 407 to predict the most suitable biosample for interrogation of splicing disruption at the 408 transcript level. This may uncover previously unintuitive choices of biosample, as 409 discussed above in the case of familial rhabdomyosarcoma (Figure 6c). The use of 410 different biosamples is associated with different costs: while whole blood is routinely 411 taken in the clinic, cell-based RNA-seq requires harvesting and culturing of patient 412 cells, and muscle biopsy is an invasive procedure that is generally only undertaken if 413 deemed necessary. Our tool may allow clinical staff to make informed decisions 414 about the likely cost-benefit balance of RNA-seq analysis to ensure such costs are 415 not incurred unnecessarily. We expect that the use of MRSD will allow effective and 416 appropriate integration of RNA-seq into diagnostic genomic services, and ultimately 417 improve variant interpretation and diagnostic yield. 418 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Minimum required sequencing depth (MRSD) score 421
We generated a collated map of splice junction coverage for GTEx samples from 422 three tissues (peripheral blood: n = 151; LCLs: n = 91; skeletal muscle: n = 184; see 423 RNA-seq data acquisition, below), using established methods (Cummings et al., 424 2017). These samples were designated as reference sets. Our model considers the GTEx control individuals and neuromuscular disease patients, respectively. GTEx 466 controls were selected for LCLs (n = 91), skeletal muscle (n = 184) and whole blood 467 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All FASTQs were aligned and processed as previously described (Cummings et al., 500 2017). Briefly, this analysis consisted of two-pass alignment using the STAR v2.4.2 501 aligner, marking of suspected PCR duplicates, and processing of the resultant 502 alignments to generate tissue-by-tissue lists of splice junctions present within the 503 cohort. Metrics for each splicing event were collected (Box 1), and splicing junctions 504 were filtered to retain only those events that were unique to single samples 505 (singletons) or that were present in multiple samples (non-singletons) but with an 506 increased usage in the sample of interest, that is, with a higher normalized read 507 count (NRC), than any control. The resulting list was ranked according to NRC fold 508 change, with singletons with high read counts considered the most significant events. 509 The resulting junctions were considered "events of interest". 510 511

Factors influencing the likelihood of aberrant splicing identification 512
To calculate how the level of background splicing aberrations was altered by sample 513 size, each individual in the three control splicing datasets was processed using the 514 above pipeline (13) and compared against 2000 bootstraps of 30, 60 and 90 controls 515 each from their respective control tissue dataset with replacement. Events were then 516 filtered to retain only those events for which the NRC was higher in the given 517 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 20, 2021. ; https://doi.org/10.1101/2021.03.19.21253973 doi: medRxiv preprint 1 individual than in any controls, and then counted for each bootstrap. Median counts 518 for singleton and non-singleton events were then collated for each control group size. 519 We selected 32 aberrant splicing events identified in neuromuscular patient RNA-seq 520 data. From the genes in which we identified these variants, samtools was used to 521 remove random subsets of reads in 10% intervals from each of these events to 522 simulate variability in the number of reads generated for the gene of interest. The 523 resulting datasets, exhibiting variable expression of a single gene, were then rerun 524 through the splice analysis pipeline and the above metrics gathered for these 525 simulated datasets. 526 527

Genomics England PanelApp data collection 528
Tabulated versions of 284 gene panels were downloaded from the Genomics 529 England PanelApp repository. Each panel was filtered to retain only genes assigned 530 a "green" classification for that panel, representing the highest level of confidence of 531 a real genotype-phenotype association. 532 533

Curation of ClinVar variants of uncertain significance 534
A tabulated version of the comprehensive ClinVar variant listing (17) for January 535 2021 was downloaded and filtered to retain only those variants that were annotated 536 as either "Uncertain significance" or "Conflicting interpretations of pathogenicity". 537 SpliceAI scores (v1.2.1; (18)) were generated for these variants and those with a 538 score of 0.5 or greater retained for downstream analysis. 539 540

Declarations 541
Ethics approval and consent to participate 542 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

Availability of data and materials 553
The control datasets used to generate the MRSD model are available through the 554 dbGaP repository as part of the GTEx v8 release (accession phs000424.v8.p2). 555 Publicly available muscle-derived RNA-seq datasets to test the model are available 556 at dbGaP (accession phs000655.v3.p1.c1). Source code will be made available 557 upon publication. All MRSD scores are available at http://mcgm-mrsd.github.io/. 558 559

Competing interests 560
The authors declare no competing interests. Australia. We also wish to thank members of the Wessex Investigational Sciences 590 Hub (WISH) Laboratory, Southampton, UK, for their help in facilitating RNA-seq of 591 kConFab LCL samples (particularly Christopher Mattocks, Daniel Ward and Jade 592 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 Forster), as well as the work of the University of Manchester Genomics Core 593 Technology and Bioinformatics Facilities for their assistance in sample processing. 594 CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021