Abstract
PURPOSE Cancer registries are important sources of real-world data (RWD) that reveal insights into practice patterns and cancer patient outcomes, but the prevalence of missing data can be high. Machine learning (ML) imputation methods can be applied to large RWD sets, but the performance of these approaches within cancer registries is unclear.
METHODS We identified non-small cell lung cancer (NSCLC) patients within the National Cancer Database diagnosed in 2014 with complete data in 19 variables of known clinical and prognostic significance. We generated synthetic missing data for each variable, then performed imputation using substitution (control) and five different ML approaches. Imputation efficacy was measured by normalized root-mean-square error (RMSE) for continuous variables and proportion of falsely classified entries (PFC) for categorical variables. We also measured algorithm runtimes and the impact of incorporating imputed values on survival modeling.
RESULTS 50,790 NSCLC patients were included for this study, with 81 features for each patient after data preprocessing. Among the tested ML methods, SoftImpute had the lowest RMSE (best performance) for continuous variables ranging from 0.071 to 0.080 for 10% to 50% missing data, and MissForest had the lowest PFC (best performance) for categorical variables ranging from 0.251 to 0.311 for 10 to 50% missing data. SoftImpute had a runtime of 3.28×10−4 seconds per patient record, and MissForest averaged 2.96×10−3 seconds per patient record. Deep learning imputation using a denoising autoencoder did not achieve improved performance despite higher algorithm runtimes. Cox models incorporating ML imputed data achieved similar C-index ranging from 0.787 to 0.801 for all ML methods tested.
CONCLUSION ML imputation achieved promising performance for NSCLC patients within a large national cancer registry.
Competing Interest Statement
Daniel X. Yang Research Funding: ASCO Conquer Cancer, RefleXion Medical Yongfeng Hui Employment: Amazon (current work completed prior to employment at Amazon) Henry S. Park Consulting or Advisory Role: AstraZeneca, Galera Medical, Bristol Myers Squib Research Funding: RefleXion Medical Travel, Accommodations, Expenses: Bristol Myers Squib Sanjay Aneja Sanjay Aneja is an Associate Editor for JCO Clinical Cancer Informatics. Journal policy recused the author from having any role in the peer review of this manuscript. Consulting or Advisory Role: Prophet Consulting (I) Research Funding: The MedNet, Inc, American Cancer Society, National Science Foundation, Agency for Healthcare Research and Quality, National Cancer Institute, ASCO, Patterson Foundation, Amazon Web Services, RefleXion Medical Patents, Royalties, Other Intellectual Property: Provisional patent of deep learning optimization algorithm Travel, Accommodations, Expenses: Prophet Consulting (I), Hope Foundation Other Relationship: NRG Oncology Digital Health Working Group, SWOG Digital Engagement Committee, ASCO mCODE Technical Review Group
Funding Statement
This work was funded in part by a Conquer Cancer Young Investigator Award. Any opinions, findings, and conclusions expressed in this material are those of the author(s) and do not necessarily reflect those of the American Society of Clinical Oncology, Conquer Cancer, or the Funder.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study was granted an institutional review board exemption by the Yale Human Investigations Committee given Health Insurance Portability and Accountability Act (HIPAA)-compliant, de-identified patient information was used.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All data produced in the present study are available by an application process available to investigators associated with CoC-accredited cancer programs.