Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Training machine learning models on patient level data segregation is crucial in practical clinical applications

View ORCID ProfileMustafa Umit Oner, Yi-Chih Cheng, Hwee Kuan Lee, Wing-Kin Sung
doi: https://doi.org/10.1101/2020.04.23.20076406
Mustafa Umit Oner
1School of Computing, National University of Singapore, Singapore 117417
2A*STAR Bioinformatics Institute, Singapore 138671
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mustafa Umit Oner
  • For correspondence: onermustafaumit{at}gmail.com
Yi-Chih Cheng
2A*STAR Bioinformatics Institute, Singapore 138671
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hwee Kuan Lee
2A*STAR Bioinformatics Institute, Singapore 138671
1School of Computing, National University of Singapore, Singapore 117417
3Image and Pervasive Access Lab (IPAL), CNRS UMI 2955, Singapore 138632
4Singapore Eye Research Institute, Singapore 169856
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wing-Kin Sung
1School of Computing, National University of Singapore, Singapore 117417
5A*STAR Genome Institute of Singapore, Singapore 138672
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

This article discusses the effect of segregation of histopathology images data into three sets; training set for training machine learning model, validation set for model selection and test set for testing model performance. We found that one must be cautious when segregating histological images data (slides) into training, validation and test sets because subtle mishandling of data can introduce data leakage and gives illusively good results on the test set. We performed this study on gene mutation prediction performance by using the deep neural network in the paper of Coudray et al. [1]. By using the provided code and the same set of data, we discovered that data segregation method of the paper suffered from a data leakage problem [2]. The paper pools all the slides from all patients and then segregates them exclusively into training, validation and test sets. In this way, none of the slides is used in more than one set. This seems to be a clean separation of the data. However, the paper did not consider that some slides were strongly correlated. For example, if the tumor of a patient is cut and stained to produce multiple slides, these slides are strongly correlated. If one slide is used for training and another one is used for testing, essentially, the deep neural network can memorize the pattern on the slide in the training set and apply this memory on the slide in the test set. Hence, by memorization, the deep neural network can predict very well on the slide in the test set. This mechanism of prediction is not useful in a practical clinical setting since no two tumors are the same in the real world. In this real setting, we demand the deep neural network to generalize across patients and tumors. Hereafter, we call this way of data segregation slide-level segregation. There is a better way to perform data segregation that is compatible for deployment of deep learning model in practical clinical settings. First, the patients are segregated exclusively into training, validation and test sets. All the slides belonging to the patients in the training set are used solely for training. Similarly, all the slides belonging to the patients in the test set are used for testing only. Segregation of data in this way forces the deep neural network to generalize across patients. We call this way of data segregation patient-level segregation.

In slide-level segregation approach analysis, we obtained similar results to that presented in the paper by Coudray et al. [1]: overall performance on the test set was good. However, it was illusory due to data leakage. The model gave very good testing results on the slides that come from a patient who also has slides in the training set. On the other hand, the test result was quite bad on the slides that come from a patient who does not have any slides in the training set. Hereafter, we call the slide in the test set as seen-patient data if the corresponding patient also has some slides in the training set. Otherwise, the slide in the test set is called unseen-patient data if the corresponding patient does not have slides in the training set. Furthermore, we analyzed performance of the model on the data segregated by the patient-level segregation approach. Note that, in this approach, all patients in the test set mimics the real world clinical workflow. We observed a significant drop in the performance of the model on the test set of patient-level segregation approach compared to the performance on the test set of slide-level segregation approach. Moreover, the performance of the model on the test set of patient-level segregation approach was very similar to the performance on the unseen-patients data in the test set of slide-level segregation approach. Hence, we conclude that patient-level segregation approach is crucial and appropriate to simulate real world scenario, where each patient in the test set can be thought as a patient walking into clinic tomorrow.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work is partly supported by the Biomedical Research Council of the Agency for Science, Technology, and Research, Singapore and the National University of Singapore, Singapore.

Author Declarations

All relevant ethical guidelines have been followed; any necessary IRB and/or ethics committee approvals have been obtained and details of the IRB/oversight body are included in the manuscript.

Yes

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • {umitoner{at}comp.nus.edu.sg, ksung{at}comp.nus.edu.sg}

  • {chengyc{at}bii.a-star.edu.sg, leehk{at}bii.a-star.edu.sg}

Data Availability

All data is available online at TCGA Data Portal.

https://www.cancer.gov/tcga

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted April 25, 2020.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Training machine learning models on patient level data segregation is crucial in practical clinical applications
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Training machine learning models on patient level data segregation is crucial in practical clinical applications
Mustafa Umit Oner, Yi-Chih Cheng, Hwee Kuan Lee, Wing-Kin Sung
medRxiv 2020.04.23.20076406; doi: https://doi.org/10.1101/2020.04.23.20076406
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Training machine learning models on patient level data segregation is crucial in practical clinical applications
Mustafa Umit Oner, Yi-Chih Cheng, Hwee Kuan Lee, Wing-Kin Sung
medRxiv 2020.04.23.20076406; doi: https://doi.org/10.1101/2020.04.23.20076406

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (434)
  • Allergy and Immunology (760)
  • Anesthesia (222)
  • Cardiovascular Medicine (3316)
  • Dentistry and Oral Medicine (366)
  • Dermatology (282)
  • Emergency Medicine (480)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1175)
  • Epidemiology (13403)
  • Forensic Medicine (19)
  • Gastroenterology (900)
  • Genetic and Genomic Medicine (5182)
  • Geriatric Medicine (483)
  • Health Economics (786)
  • Health Informatics (3286)
  • Health Policy (1146)
  • Health Systems and Quality Improvement (1199)
  • Hematology (432)
  • HIV/AIDS (1024)
  • Infectious Diseases (except HIV/AIDS) (14657)
  • Intensive Care and Critical Care Medicine (917)
  • Medical Education (478)
  • Medical Ethics (128)
  • Nephrology (526)
  • Neurology (4957)
  • Nursing (263)
  • Nutrition (735)
  • Obstetrics and Gynecology (889)
  • Occupational and Environmental Health (797)
  • Oncology (2531)
  • Ophthalmology (730)
  • Orthopedics (284)
  • Otolaryngology (348)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (547)
  • Pediatrics (1308)
  • Pharmacology and Therapeutics (552)
  • Primary Care Research (559)
  • Psychiatry and Clinical Psychology (4225)
  • Public and Global Health (7526)
  • Radiology and Imaging (1717)
  • Rehabilitation Medicine and Physical Therapy (1022)
  • Respiratory Medicine (982)
  • Rheumatology (480)
  • Sexual and Reproductive Health (500)
  • Sports Medicine (425)
  • Surgery (551)
  • Toxicology (73)
  • Transplantation (237)
  • Urology (206)