Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings

View ORCID ProfileNigreisy Montalvo, Francisco Requena, View ORCID ProfileEmidio Capriotti, View ORCID ProfileAntonio Rausell
doi: https://doi.org/10.1101/2025.04.03.25325184
Nigreisy Montalvo
1Université Paris Cité, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, Paris, F-75006, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nigreisy Montalvo
Francisco Requena
2Department of Physiology and Biophysics, Weill Cornell Medicine, Institute for Computational Biomedicine, Englander Institute for Precision Medicine, New York, NY, 10021, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Emidio Capriotti
3Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna 40126, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emidio Capriotti
Antonio Rausell
1Université Paris Cité, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, Paris, F-75006, France
4AP-HP, Necker Hospital for Sick Children, Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, Paris, F-75015, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Antonio Rausell
  • For correspondence: antonio.rausell{at}institutimagine.org
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Rare diseases collectively affect 5% of the population. However, fewer than 50% of rare disease patients receive a molecular diagnosis after whole genome sequencing. Supervised machine Learning is a valuable approach for the pathogenicity scoring of human genetic variants. However, existing methods are often trained on curated but limited central repositories, resulting in poor accuracy when tested on external cohorts. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling institutions to collaboratively train models without sharing their local datasets. Here, we present a proof-of-concept study evaluating the effectiveness of federated learning for the clinical classification of genetic variants. A comprehensive array of diverse FL strategies was assessed for coding and non-coding Single Nucleotide Variants as well as Copy Number Variants. Our results showed that federated models generally achieved comparable or superior performance to traditional centralized learning. In addition, federated models reached a robust generalization to independent sets with smaller data fractions as compared to their centralized model counterparts. Our findings support the adoption of FL to establish secure multi-institutional collaborations in human variant interpretation.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The Laboratory of Clinical Bioinformatics of the Imagine Institute, headed by A.R. was partly supported by the French National Research Agency (ANR) Investissements d Avenir Pro-gram [ANR-10-IAHU-01 and ANR-21-PMRB-0004, FACE.S-4-KIDS project, FACE and SKULL for Key Innovative Data Science]; by the European Rare Diseases Alliance (ERDERA) pro-gramme funded by the European Unions Horizon Europe research and innovation pro-gramme under grant agreement number 101156595; and by the French government as part of the Important Project of Common European Interest (IPCEI) Cloud call of the France 2030 pro-gramme (E2CC - AI4RDP - AI for Rare Diseases Pathogenicity project). N.M. was partly sup-ported by the French National Research Agency (ANR) Investissements d Avenir Program [ANR-10-IAHU-01], the JANSSEN HORIZON Fonds de dotation, and by the French govern-ment as part of the Important Project of Common European Interest (IPCEI) Cloud call of the France 2030 programme (E2CC - AI4RDP - AI for Rare Diseases Pathogenicity project).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study used ONLY available human data that were originally located at: Clinvar: https://www.ncbi.nlm.nih.gov/clinvar/ GnomAD: https://gnomad.broadinstitute.org/ DGV: http://dgv.tcag.ca/ IGRS: https://www.internationalgenome.org/data-portal/data-collection/structural-variation Dbvar: https://www.ncbi.nlm.nih.gov/dbvar Beyter et al, 2021: https://github.com/DecodeGenetics/LRS_SV_sets.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data required to reproduce the results presented in this manuscript, including figures and tables, will be made available under the GNU General Public License v3 upon publication. The data will be accessible via a Figshare repository. All code required to reproduce the results presented in this manuscript, including figures and tables, will be made available under the GNU General Public License v3 upon publication. The code will be accessible via a GitHub repository at https://github.com/RausellLab/FedLearnVar

  • List of abbreviations

    AUC ROC
    Area Under the Receiver Operating Characteristic curve
    CDS
    Collaborative Data Sharing
    CCR
    Constrained Coding Region
    CNN
    Convolutional Neural Network
    CNV
    Copy Number Variant
    FL
    Federated Learning
    GDPR
    General Data Protection Regulation
    GWAS
    Genome-Wide Association Studies
    HARs
    Human Accelerated Regions
    HPC
    High-Performance Computing
    IID
    Independent and Identically Distributed
    LADs
    Lamina-Associated Domains
    MLP
    Multilayer Perceptron
    ML
    Machine Learning
    sNDF
    Shallow Neural Decision Forest
    SNV
    Single-Nucleotide Variant
    SV
    Structural Variant
    TSS
    Transcription Start Site
    TPM
    Transcripts Per Million
    UCNEs
    Ultra-Conserved Non-Coding Elements
    UMAP
    Uniform Manifold Approximation and Projection
    WGS
    Whole Genome Sequencing
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
    Back to top
    PreviousNext
    Posted April 04, 2025.
    Download PDF
    Data/Code
    Email

    Thank you for your interest in spreading the word about medRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings
    (Your Name) has forwarded a page to you from medRxiv
    (Your Name) thought you would like to see this page from the medRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings
    Nigreisy Montalvo, Francisco Requena, Emidio Capriotti, Antonio Rausell
    medRxiv 2025.04.03.25325184; doi: https://doi.org/10.1101/2025.04.03.25325184
    Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings
    Nigreisy Montalvo, Francisco Requena, Emidio Capriotti, Antonio Rausell
    medRxiv 2025.04.03.25325184; doi: https://doi.org/10.1101/2025.04.03.25325184

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Genetic and Genomic Medicine
    Subject Areas
    All Articles
    • Addiction Medicine (576)
    • Allergy and Immunology (867)
    • Anesthesia (306)
    • Cardiovascular Medicine (4480)
    • Dentistry and Oral Medicine (449)
    • Dermatology (385)
    • Emergency Medicine (614)
    • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
    • Epidemiology (15276)
    • Forensic Medicine (31)
    • Gastroenterology (1133)
    • Genetic and Genomic Medicine (6643)
    • Geriatric Medicine (671)
    • Health Economics (1006)
    • Health Informatics (4603)
    • Health Policy (1378)
    • Health Systems and Quality Improvement (1622)
    • Hematology (544)
    • HIV/AIDS (1275)
    • Infectious Diseases (except HIV/AIDS) (15959)
    • Intensive Care and Critical Care Medicine (1110)
    • Medical Education (626)
    • Medical Ethics (147)
    • Nephrology (674)
    • Neurology (6692)
    • Nursing (346)
    • Nutrition (1006)
    • Obstetrics and Gynecology (1152)
    • Occupational and Environmental Health (961)
    • Oncology (3369)
    • Ophthalmology (988)
    • Orthopedics (370)
    • Otolaryngology (421)
    • Pain Medicine (437)
    • Palliative Medicine (131)
    • Pathology (668)
    • Pediatrics (1703)
    • Pharmacology and Therapeutics (699)
    • Primary Care Research (717)
    • Psychiatry and Clinical Psychology (5494)
    • Public and Global Health (9284)
    • Radiology and Imaging (2223)
    • Rehabilitation Medicine and Physical Therapy (1375)
    • Respiratory Medicine (1201)
    • Rheumatology (598)
    • Sexual and Reproductive Health (720)
    • Sports Medicine (535)
    • Surgery (720)
    • Toxicology (100)
    • Transplantation (290)
    • Urology (266)