TY - JOUR T1 - CNVscore calculates pathogenicity scores for copy number variants together with uncertainty estimates accounting for learning biases in reference Mendelian disorder datasets JF - medRxiv DO - 10.1101/2022.06.23.22276396 SP - 2022.06.23.22276396 AU - Francisco Requena AU - David Salgado AU - Valérie Malan AU - Damien Sanlaville AU - Frédéric Bilan AU - Christophe Béroud AU - Antonio Rausell Y1 - 2022/01/01 UR - http://medrxiv.org/content/early/2022/06/27/2022.06.23.22276396.abstract N2 - Copy number variants (CNVs) are a major cause of rare pediatric diseases with a broad spectrum of phenotypes. Genetic diagnosis based on comparative genomic hybridization tests typically identifies ∼8-10% of patients as having CNVs of unknown significance, revealing the current limits of clinical interpretation. The adoption of whole-genome sequencing (WGS) as a first-line genetic test has significantly increased the load of CNVs identified in single genomes. Alongside short- and long-read sequencing technologies, a number of pathogenicity scores have been developed for filtering and prioritizing large sets of candidate CNVs in clinical settings. However, current approaches are often based, either explicitly or implicitly, on clinically annotated reference sets, which are likely to bias their predictions. In this study we developed CNVscore, a supervised-learning approach combining tree ensembles and a Bayesian classifier trained on pathogenic and non-pathogenic CNVs from reference databases. Unlike previous approaches, CNVscore couples pathogenicity estimates with uncertainty scores, making it possible to evaluate the suitability of a model for the query CNVs. Comprehensive comparative benchmark tests across independent sets and against alternative methods showed that CNVscore effectively distinguishes between pathogenic and benign CNVs. We also found that CNVs associated with CNVscores of low uncertainty were predicted with significantly higher accuracy than those of high uncertainty. However, the performance of current scoring approaches, including CNVscore, was compromised on CNV sets enriched in highly uncertain variants and presenting unconventional features, such as functionally relevant non-coding elements or the presence of disease genes irrelevant for the clinical phenotypes investigated. Finally, we used the CNVscore framework to guide CNV scoring model selection for the French National Database of Constitutional CNVs (BANCCO), which includes clinical diagnosis annotations. The CNVscore framework provides an objective strategy for leveraging the uncertainty on bioinformatic predictions to enhance the assessment of CNV pathogenicity in rare-disease cohorts. CNVscore is available as open-source software from https://github.com/RausellLab/CNVscore and is integrated into the CNVxplorer webserver http://cnvxplorer.com.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe Laboratory of Clinical Bioinformatics of the Imagine Institute, headed by A.R. was partly supported by the French National Research Agency (ANR) "Investissements d Avenir" Program [ANR-10-IAHU-01, ANR-17-RHUS-0002 - CIL-LICO project]; MSD Avenir fund (Devo-Decode project) ; Aviesan - ITMO Genetique-Genomique-Bioinformatique [ResDiCard : Resolving diagnostic deadlock in Cardiomyopathies project, AAP 2020 : Maladies Rares - Resoudre les impasses diagnostiques] and by Christian Dior Couture, Dior; F.R. is supported by a PhD fellowship from the Fondation Bettencourt-Schueller.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study used ONLY available human data that were originally located at: Decipher: https://www.deciphergenomics.org/ Clinvar: https://www.ncbi.nlm.nih.gov/clinvar/ GnomAD: https://gnomad.broadinstitute.org/ DGV: http://dgv.tcag.ca/ IGRS: https://www.internationalgenome.org/data-portal/data-collection/structural-variation Dbvar: https://www.ncbi.nlm.nih.gov/dbvar Beyter et al, 2021: https://github.com/DecodeGenetics/LRS_SV_sets. Bancco: http://bancco.fr BANCCO database requires registration for access. The BANCCO database has received appropriate approval through the French National Committee for Informatics and Liberty (CNIL): CNIL authorization #2071658I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data produced in the present work are contained in the manuscript https://github.com/RausellLab/CNVscore ER -