TY - JOUR T1 - Identification of non-small cell lung cancer subgroups with distinct immuno-therapy outcomes from integrating genomics and electronic health records on a graph convolutional network JF - medRxiv DO - 10.1101/19011437 SP - 19011437 AU - Chao Fang AU - Dong Xu AU - Jing Su AU - Jonathan Dry AU - Bolan Linghu Y1 - 2019/01/01 UR - http://medrxiv.org/content/early/2019/11/12/19011437.abstract N2 - Recently immuno-oncology (IO) therapies, especially checkpoint inhibitor therapies, have transformed the therapeutic landscape of non-small cell lung cancer (NSCLC). However, responses to IO in NSCLC are highly disparate because patients are heterogenous with a variety of genomic and clinical-phenotype complexity. Thus, there is a pressing need to discover and characterize NSCLC subgroups to advance precision immuno-oncology. However, this is a challenging task largely due to: 1) the study cohort is too small to investigate this heterogeneous disease; 2) the datasets used in subtyping studies are not comprehensive enough to incorporate both genomic data and diverse clinical-phenotype data with long-term follow-ups, and 3) the subtyping algorithms and models are ineffective in integrating high-dimensional data from both genomic and clinical domains. To address these challenges, we have developed a graph convolutional neural network (GCN) method to discover NSCLC complexity on IO treatment responses based on the high-dimensional electronic health records (EHR) and genomic data from 1,937 IO treated NSCLC patients. First, using Flatiron Health’s database, we identified a IO treated NSCLC cohort (n = 1,937), with genomic data from Foundation Medicine’s targeted DNA deep-sequencing, clinical data from harmonized real-world EHR from 275 US oncology practices, and survival data after IO treatment with a median follow-up time of 6.61 months (average follow-up time 9.11 months). We then developed a GCN based artificial intelligence (AI) model to build a patient-patient similarity network from integrating both genomic and EHR data to discover novel NSCLC subgroups with dramatically different responses to IO therapies. We have demonstrated the performance of the GCN is superior to commonly used machine learning methods such as autoencoder, UMAP, and tSNE, and superior to utilizing genomic or clinical data alone. Importantly, we have successfully discovered the IO responsive (covers 20.27% of the cohort) and the IO non-responsive (45.46%) subgroups that demonstrate significant overall survival difference after IO treatments (9.42 vs. 20.35 months, p < 0.0001). These two subgroups demonstrate enrichments of novel clinical phenotypes and genomic traits beyond well-known IO biomarkers of tumor mutation burden and PDL1 status, such as enrichment of abnormal blood Basophils and KRAS mutations in the responsive subgroup and the enrichment of low hemoglobin, low lymphocytes, PI3KCA amplifications, etc. in the non-responsive subgroup, suggesting distinct clinical and molecular underpinnings. To the best of our knowledge, this is the first study to employ a graph-based AI approach to integrate both high-dimensional clinical and genomic features to investigate IO treatment responses in NSCLC. The new subtypes discovered in this work cast new lights on understanding the heterogeneity of IO treatment responses, and pave ways to inform clinical decision making for precision oncology of NSCLC.Competing Interest StatementThe project was funded by AstraZenecaClinical TrialThis is a retrospective analysis, the analysis uses retrospective data from secondary source. This is a retrospective observational analysis and hence does not need to be recorded in clinicaltrial.gov The study cohort was generated from the Flatiron Health electronic health records (EHRs) database with de-Identified patient data. Cohort Reference: Singal et al Jama 2019, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459115/Funding StatementFully sponsored by AstraZenecaAuthor DeclarationsAll relevant ethical guidelines have been followed; any necessary IRB and/or ethics committee approvals have been obtained and details of the IRB/oversight body are included in the manuscript.YesAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data are patients' electronic-health-record data and we are not releasing the data to public. ER -