Introduction

Whole-exome sequencing (WES) is becoming increasingly important in enabling tailored interventions in “personalized” or “precision” medicine. Recent studies suggest that in patients with suspected genetic disorders, clinical WES (CES) should be applied as a first-tier molecular test, in parallel with standard diagnostic care, to improve the cost-effectiveness of the diagnostic process [1, 2]. However, limitations of the current analytical methods [3], genetic and phenotypic heterogeneity [4], knowledge gaps in the gene–disease association [5], and interpretation/reporting differences [6] often hinder the identification of disease-causing variants. Ultimately, only 25–30% of families receive a diagnosis after CES [7, 8], leaving the genetic basis of the remaining families’ diseases unknown.

Recent studies indicate that periodic reanalysis of negative CES data using improved bioinformatic tools [9] and up-to-date gene–disease databases [5] can identify additional candidate variants, further enhanced by complementary methods including copy number variation (CNV) analysis [10], whole genome sequencing (WGS), and other targeted methods [11]. Despite the use of these approaches, many cases remain negative after reanalysis. Thus, there is a need to develop a comprehensive strategy for reanalyzing negative CES that will improve the analytical process, thereby providing better options for those who do not initially receive a molecular diagnosis.

In this study, using a CES cohort enrolled in the Manton Center for Orphan Disease Research at Boston Children’s Hospital, we present our strategy for retrospective reanalysis of negative CES studies. We describe a unique approach featuring a custom-built pipeline (“Variant Explorer Pipeline,” VExP), in conjunction with updated phenotypic information, literature, and databases, and in collaboration with an interdisciplinary team to enhance diagnostic yield while minimizing cost. As no methodological consensus guidelines for CES reanalysis currently exist, we have drawn from methods described in the literature [10,11,12], as well as our own reanalysis experience, to develop a standard strategy for negative CES reanalysis. This proposed strategy addresses each possible outcome after negative CES reanalysis, including cases that remain negative. Supported by the current literature, we have incorporated a novel bioinformatic strategy into an interdisciplinary research setting to introduce an innovative approach that addresses the need for comprehensive reanalysis of exome data.

Materials and methods

Recruitment of CES cohort

The Gene Discovery Core of the Manton Center for Orphan Disease Research (Manton GDC) utilizes an Institutional Review Board-approved research protocol at our institution established to enroll and study patients and families with a wide range of rare and undiagnosed disorders. The Manton GDC often obtains raw CES data from enrolled participants who have undergone prior clinical testing and information about the CES cohort can be found in Supplementary Table S1. These data can be negative, having demonstrated no candidate variant, or contain a candidate variant and are submitted by a referring physician to confirm the existing candidate or identify additional variants. The CES datasets utilized in this study were all performed by external Clinical Laboratory Improvement Amendments-certified genetic testing laboratories. Two hundred and seventy-five sequencing files (162 fastq, 63 bam, and 50 cram files) were included in our study. The clinical labs performed exome capture using different exome capture kits (e.g., Agilent SureSelectXT Human All Exon Kit V4 and V5) and sequenced captured libraries using Illumina HiSeq at varying levels of lab-specific coverage (>50 ×).

Exome data analysis/reanalysis of negative CES

Our exome-sequencing reanalysis approach uses an innovative, validated, custom-built pipeline VExP [13, 14] to narrow down potential candidate variants. VExP is a comprehensive system that integrates existing methods, genetic information, and probabilistic models into an automated pipeline for the identification of disease genes (Supplementary Fig. S1). VExP features include enhanced calling methodologies, novel variant detection algorithms, updated variant annotations, and validated filtering strategies. Independent families (102 total: 5 quads, 79 trios, 18 singletons) were processed using the VExP (v2018May), using the following four steps:

Step 1 (fastq to vcf): VExP starts with fastq raw data to avoid processing errors or reference mismatch data. When only bam files were available, PICARD (v2.9.2) was used to revert to fastq files. Raw data were processed to obtain vcf files using the human reference assembly 19, BWA (alignment, v0.7.15), PICARD (mark/delete duplicates), SAMTOOLS (variant calling, v1.4.1), and GATK (multi-sample variant calling, v3.7). Further, ANNOVAR (2018Apr18) and custom VExP scripts were used to add annotations from 53 relevant genetic databases (Supplementary Table S2) into the output of the system.

Step 2 (Variant filtering): We observed, on average, a total of 201,929 variants per individual (31,837 in coding/splicing sites). These uniform vcf files were next subjected to pedigree/relationship tests of each family to eliminate or correct discordant samples. Variant analysis was performed in each family using different inheritance models (assuming full penetrance) based on three filtering criteria: first, include variants predicted by ANNOVAR to have a potential functional coding consequence, including stop gain or loss, splice site disruption, indel, and nonsynonymous. Second, variants are filtered based on allele frequency in control populations (gnomAD, ExAC, EVS, 1000GP, and internal data from 2114 unaffected individuals from BCH). Heterozygous/hemizygous variants for dominant, de novo, and X-linked models were included if minor allele frequency (MAF) was <0.0005 (0.05%) in any database. In comparison, homozygous variants were included only if MAF was ≤0.00005 (0.005%) and for compound heterozygous models the MAF cutoff was ≤0.01 (1%) with no homozygous variant reported in any database. Phasing was performed if parents were available. The variants were further prioritized to include those with read depth ≥10 ×, alternative depth ≥5 ×, allele balance ≥0.20, and deleterious prediction (2 or more of 23 softwares, including PolyPhen, SIFT, FATHMM, and CADD; Supplementary Table S2). Third, an inheritance filter was also applied to identify any variants that may be incompletely penetrant or mosaic in the parent and these were analyzed and evaluated outside this workflow. These filters were calibrated and validated using data from 150 independent families with genetic diagnoses from Boston Children’s Hospital and the Broad Institute (Supplementary Data and Table S3).

Step 3 (Candidate genes): On average, 14 variants per quad, 16 per trio, and 131 per singleton family (Supplementary Fig. S2) were identified following this common filtering strategy (Steps 1 and 2). To target the analysis toward making a primary diagnosis, we subdivided our candidate genes into three categories: (a) likely pathogenic (LP) and pathogenic (P) variants in known disease-associated genes, (b) variants of uncertain significance (VUS) in known disease-associated genes, and (c) variants in novel genes. For (a) and (b), VExP uses the Human Phenotype Ontology (HPO) terms that define the phenotype of the patient to filter for variants that lie within an appropriate candidate gene list, built from 12 different medical databases (Online Mendelian Inheritance in Man, ClinVar, Mouse Genome Informatics, Human Gene Mutation Database (HGMD), Genetic Association Database, and HPO; see Table S2) to identify variants potentially correlated to the phenotype. For c) we selected variants predicted to be deleterious in candidate genes that were not confirmed previously to be implicated in human disease. Our database is integrated with 1,579 independent families (3,776 independent samples) with 178 different phenotypes that were processed using the same pipeline (VExP). We utilized the database to identify novel genes using a unique approach wherein we were able to exclude 40% of additional candidate genes (Supplementary Material and Supplementary Fig. S3).

Step 4 (Diagnostic): Candidate gene variants were further evaluated by our clinical review team that consists of genetic counselors and both internal and referring clinicians. The genomic and phenotypic data were combined with additional information including gene expression, protein domains, functional studies, animal models, interactomes, gene families, and literature [10, 12]. Potentially diagnostic candidate variants were deposited into data sharing platforms such as “Matchmaker Exchange” [15] and “GeneMatcher” [16] to corroborate findings and/or to find future collaborators for functional analysis. Variants in disease-associated genes were interpreted using the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) guidelines for variant classification [17]. Likely diagnostic variants selected by our clinical review team were subsequently confirmed using Sanger sequencing in our laboratory at BCH and these confirmed variants were submitted to ClinVar (#SUB5116581). When available, other family members were used for segregation testing.

Results

These 102 CES cases were sequenced over a period of 5 years from 2013 to 2018 (2013: 5 cases; 2014: 7 cases; 2015: 14 cases; 2016: 27 cases; 2017: 37 cases; and 2018: 12 cases). The average and median times between original analysis and reanalysis in this study were 1.9 ± 1.4 years and 2 years, respectively. Data from these cases were reanalyzed and classified into one of the five following groups: (a) variants in known genes (that match phenotype) identified by CES and confirmed by reanalysis; (b) variants in candidate (novel and phenotypic expansion) genes identified by CES and confirmed by reanalysis; (c) variants in known genes missed by CES and identified by reanalysis; (d) variants in candidate genes missed by CES and identified by reanalysis; and (e) no likely diagnostic variant identified by CES or reanalysis (Fig. 1).

Fig. 1
figure 1

Flowchart of CES reanalysis and outcomes. All 102 CES cases with fastq data were reanalyzed by VExP and classified into one of the five following groups: (a) confirmed variants in known disease-causing genes previously identified by CES; (b) confirmed variants in candidate genes previously identified by CES; (c) variants in known disease-causing genes identified on reanalysis; (d) variants in candidate genes identified on reanalysis; and (e) no candidate variant identified on reanalysis. CES, clinical exome sequencing; VExP, Variant Explorer Pipeline

Of the 102 reanalyzed cases, 28 had previously identified variants in known (n = 12) and candidate genes (n = 15), and one needed to be reclassified. Reanalysis of those 28 cases was requested by the referring physicians in order to confirm existing findings and/or potentially identify additional disease-causing variants. Meanwhile, of the 74 negative and 1 reclassified case (n = 75), our reanalysis determined six instances of missed diagnoses. In addition, we identified 18 probands with variants in candidate genes. The remaining 51 cases stayed unsolved despite our reanalysis.

The known disease-causing genes and candidate genes in which variants were identified or confirmed by reanalysis are listed in Table 1. The majority of the variants were de novo (29/51), followed by recessive (22/51), which included compound heterozygous (9/51), homozygous (7/51), and hemizygous (6/51).

Table 1 List of known and candidate disease-causing genes in which variants were identified or confirmed by reanalysis

Phenotypic details associated with known disease-causing genes identified by reanalysis are listed in Table 2. These variants were either recently published, such as CAMK2B (NM_001220.4:c.416C > T, p.(Pro139Leu)) [18], CLTC (NM_004859.3:c.2669C > T, p.(Pro890Leu)) [19], MAP1B (NM_005909.4:c.5368C > T, p.(Arg1790*)) [20], or high impact variants (nonsense, frameshift, hotspot mutation or mutation in functional domains with high conservation) found in disease-causing genes with a clinical presentation that fit the reported patient phenotype, including PPM1D (NM_003620.3:c.1258dupA, p.(Lys420fs)), FBXO11 (NM_025133.4:c.2145G > C, p.(Lys715Asn)), and PRKAR1A (NM_001278433.1: c.620A > G, p.(Tyr207Cys)).

Table 2 Phenotypic details associated with known disease-causing genes identified by reanalysis

Overall, the diagnostic yield for reanalyzed negative plus one reclassified CES cases was 32.0% (24/75). The diagnostic rate was slightly higher (no significant difference) in trios than in singletons (32.8% and 28.6%, respectively; Supplementary Table S4). Reanalysis of older CES cases had a higher chance to be diagnostic, although not statistically significant (38.1% for 2013–15 vs. 29.6% for 2016–18, P-value > 0.05). Diagnosis rate for each year is depicted in Supplementary Fig. S4. Variant assessments, literature searches, and collaboration with other laboratories contributed to the validation of candidate genes in these patients. Using the reanalysis workflow described, we have achieved a confirmed or likely genetic diagnosis for 24 probands in the 75 negative or reclassified rare disease families after reanalysis, improving the diagnostic yield.

Discussion

Here we describe a comprehensive and collaborative approach for reanalyzing negative CES cases with heterogeneous phenotypes that maximizes diagnostic yield. Our computational approach employs an innovative and validated custom-built pipeline (VExP) [13, 14] that integrates enhanced calling, variant detection and annotations, filtering strategies, and probabilistic models using 53 different databases. Out of 75 negative/reclassified cases, the confirmed and proposed genetic diagnosis account for 8.0% and 24%, respectively, resulting in an overall diagnostic yield of 32.0%. This is an improvement from prior studies that report diagnostic yields ranging from 7.7% to 27.2% [5, 9, 12, 21, 22], supporting our comprehensive approach. Further, compared with our study, higher diagnostic yields have been reported, in studies of consanguineous families [23, 24], or with utilization of additional methods such as CNV analysis [10], WGS, and others [11]. Overall, the yield of all reported studies including ours is 21% (472/2250), 14.9% for known and 6.1% for candidate genes (Table 3). It should be noted that the comparison in Table 3 is limited in precision by variable confounding factors including family type (trios, singletons, etc.), ethnicity, phenotypic variability, and methodology.

Table 3 Comparison of this study with other publications related to negative CES reanalysis

The continued accumulation of genetic knowledge during the time that elapsed between the CES and reanalysis may have affected our diagnostic yield [5, 21]. Every year, hundreds of new genes and thousands of new variants are linked to various phenotypes, making it necessary to reanalyze the genomic data [10, 11, 25]. Indeed, of the six cases where reanalysis identified diagnostic variants in established genes, five of these were in genes whose disease associations were published after the initial testing, the rest one (absent in the original VCF file) was detected by variant calling improvement. In addition, for 18 families we have a candidate gene (missing in the original analysis) and additional workup is ongoing. The diagnostic rate for early years (2013–15) was higher (although not statistically significant) compared with later years (2016–18); therefore, it is important to reanalyze the negative CES data periodically, preferably annually, with new methodologies, updated genomic datasets, improved pipelines/variants callings, and clinical experts. In addition, the families that do not receive a genetic diagnosis should have access to resources including research studies and patient registries to benefit from new disease discoveries over time.

Proposed strategy for negative CES reanalysis

Negative CES reanalysis is governed by an intricate process with many collaborative components that is constantly subject to change based on the latest developments [11, 26]. Lack of standardization of this process leads to high variability in diagnostic yield, as well as improper care for patients who do not receive a diagnosis after reanalysis. In order to improve the outcomes for these patients, we proposed a comprehensive strategy for reanalyzing negative exomes as illustrated in Fig. 2. This strategy encompasses the range of methods described in the current literature and integrates novel technological advancements and information from our own database. In addition to updating initial reanalysis, our proposed strategy addresses each possible outcome after negative CES reanalysis. These outcomes can be divided into three classes: (1) variant(s) in known disease-causing gene; (2) variant(s) in novel candidate gene; and (3) no candidate variant identified.

Fig. 2
figure 2

Proposed strategy for reanalyzing negative exomes. Collaborations between multidisciplinary clinical and research teams facilitate reanalysis of negative CES to reach a diagnosis. These negative CES cases can be divided into three classes after reanalysis: (1) candidate variants linked to the phenotype but of limited pathogenic evidence; (2) candidate variants in new genes not linked to the phenotype; (3) no candidate variants detected by CES after reanalysis. CES, clinical exome sequencing; CNV, copy number variation; VExP, Variant Explorer Pipeline

Outcomes categorized as class 1 in Fig. 2 may include one or more VUS that can only partially explain patients’ phenotypes at the time of diagnosis. In addition, multiple variants in two or more genes can be responsible for rare diseases, leading to a mixed phenotype without a clear diagnosis [27]. Once class 1 variants are categorized as P or LP variants according to the ACMG/AMP guidelines and clinically confirmed by the referring physicians, they should be deposited into databases including ClinVar, HGMD, and LOVD (Leiden Open Variation Database).

Outcomes categorized as class 2 should first be evaluated for damaging effects using in-silico predictions tools and variants predicted to be deleterious should be subjected to further functional studies in vitro and in vivo, in order to assess the pathogenicity of the candidate variants. Novel potential disease genes of interest or undergoing functional evaluation should be uploaded onto online data sharing platforms (e.g., GeneMatcher [16], Matchmaker Exchange) [15], in order to facilitate the collection of pathogenic evidence through collaborative efforts. Confirmed or published novel gene discoveries can then be added to the gene–disease databases to improve the power of the analytical pipeline. Class 2 outcomes should also be reanalyzed regularly to account for updates in analytical methods and database information.

Class 3 consists of outcomes in which no potentially diagnostic variant was identified by reanalysis. Possible factors contributing to this result include disease-causing variants with allele balance <20%, which are prone to misclassification as false-negative due to sequencing and mapping errors from CES [28]. These variants are often missed in cases of somatic mosaicism (e.g., GNAQ in Sturge–Weber syndrome: allele balance = 1–18%) [29]. Trinucleotide repeat diseases such as Fragile X syndrome, Huntington’s disease, and myotonic dystrophy are also frequently missed by CES [30]. Additional target testing can be used to improve the detection rate of trinucleotide repeat diseases [31] and low-allele balance variants, as has been demonstrated in cases of mosaicism [32]. Negative outcomes may also result from variants that lie in non-coding regions, unmappable reads arising from repeated regions, or structural variants that CES cannot detect. In these cases, WGS and long read sequencing can be applied to improve the detection rate, as these methods allow additional detection of both structural and non-exonic variants, and insertion of mobile elements such as transposons [33].

RNA-sequencing and mitochondrial genome sequencing can aid in the diagnosis when CES fails. By detecting both coding and non-coding variants at gene expression levels and alternative splicing, transcriptome sequencing holds considerable diagnostic potential for rare disorders [34]. Further, although CES data are increasingly being employed for CNV analysis, technical limitations due to inherent heterogeneity in exome capture may result in false negatives [35]. In such cases, CNV analysis techniques such as microarray-based CNV profiling or reanalyzing data from WGS for CNVs can be employed [36].

In conclusion, we report our cumulative experience from reanalysis of a CES cohort enrolled in our institution using a custom-built comprehensive variant detection and analysis pipeline in conjunction with updated phenotypic information, literature, and databases in collaboration with an interdisciplinary team. These advances allowed us to reach a confirmed or potential diagnosis for up to a third of previously negative CES cases, demonstrating a meaningful improvement in diagnostic yield. We also compare our findings with other studies, review literature, and propose an overall strategy for reanalyzing negative exome data that will standardize the analytical process and provide appropriate medical care for patients with no diagnosis by CES.