Abstract
Rare diseases collectively affect 5% of the population. However, fewer than 50% of rare disease patients receive a molecular diagnosis after whole genome sequencing. Supervised machine Learning is a valuable approach for the pathogenicity scoring of human genetic variants. However, existing methods are often trained on curated but limited central repositories, resulting in poor accuracy when tested on external cohorts. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling institutions to collaboratively train models without sharing their local datasets. Here, we present a proof-of-concept study evaluating the effectiveness of federated learning for the clinical classification of genetic variants. A comprehensive array of diverse FL strategies was assessed for coding and non-coding Single Nucleotide Variants as well as Copy Number Variants. Our results showed that federated models generally achieved comparable or superior performance to traditional centralized learning. In addition, federated models reached a robust generalization to independent sets with smaller data fractions as compared to their centralized model counterparts. Our findings support the adoption of FL to establish secure multi-institutional collaborations in human variant interpretation.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
The Laboratory of Clinical Bioinformatics of the Imagine Institute, headed by A.R. was partly supported by the French National Research Agency (ANR) Investissements d Avenir Pro-gram [ANR-10-IAHU-01 and ANR-21-PMRB-0004, FACE.S-4-KIDS project, FACE and SKULL for Key Innovative Data Science]; by the European Rare Diseases Alliance (ERDERA) pro-gramme funded by the European Unions Horizon Europe research and innovation pro-gramme under grant agreement number 101156595; and by the French government as part of the Important Project of Common European Interest (IPCEI) Cloud call of the France 2030 pro-gramme (E2CC - AI4RDP - AI for Rare Diseases Pathogenicity project). N.M. was partly sup-ported by the French National Research Agency (ANR) Investissements d Avenir Program [ANR-10-IAHU-01], the JANSSEN HORIZON Fonds de dotation, and by the French govern-ment as part of the Important Project of Common European Interest (IPCEI) Cloud call of the France 2030 programme (E2CC - AI4RDP - AI for Rare Diseases Pathogenicity project).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study used ONLY available human data that were originally located at: Clinvar: https://www.ncbi.nlm.nih.gov/clinvar/ GnomAD: https://gnomad.broadinstitute.org/ DGV: http://dgv.tcag.ca/ IGRS: https://www.internationalgenome.org/data-portal/data-collection/structural-variation Dbvar: https://www.ncbi.nlm.nih.gov/dbvar Beyter et al, 2021: https://github.com/DecodeGenetics/LRS_SV_sets.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data required to reproduce the results presented in this manuscript, including figures and tables, will be made available under the GNU General Public License v3 upon publication. The data will be accessible via a Figshare repository. All code required to reproduce the results presented in this manuscript, including figures and tables, will be made available under the GNU General Public License v3 upon publication. The code will be accessible via a GitHub repository at https://github.com/RausellLab/FedLearnVar
List of abbreviations
- AUC ROC
- Area Under the Receiver Operating Characteristic curve
- CDS
- Collaborative Data Sharing
- CCR
- Constrained Coding Region
- CNN
- Convolutional Neural Network
- CNV
- Copy Number Variant
- FL
- Federated Learning
- GDPR
- General Data Protection Regulation
- GWAS
- Genome-Wide Association Studies
- HARs
- Human Accelerated Regions
- HPC
- High-Performance Computing
- IID
- Independent and Identically Distributed
- LADs
- Lamina-Associated Domains
- MLP
- Multilayer Perceptron
- ML
- Machine Learning
- sNDF
- Shallow Neural Decision Forest
- SNV
- Single-Nucleotide Variant
- SV
- Structural Variant
- TSS
- Transcription Start Site
- TPM
- Transcripts Per Million
- UCNEs
- Ultra-Conserved Non-Coding Elements
- UMAP
- Uniform Manifold Approximation and Projection
- WGS
- Whole Genome Sequencing





