%0 Journal Article %A Alexander Gusev %A Stefan Groha %A Kodi Taraszka %A Yevgeniy R. Semenov %A Noah Zaitlen %T Constructing germline research cohorts from the discarded reads of clinical tumor sequences %D 2021 %R 10.1101/2021.04.09.21255197 %J medRxiv %P 2021.04.09.21255197 %X Background Hundreds of thousands of cancer patients have had targeted (panel) tumor sequencing to identify clinically meaningful mutations. In addition to improving patient outcomes, this activity has led to significant discoveries in basic and translational domains. However, the targeted nature of clinical tumor sequencing has a limited scope, especially for germline genetics. In this work, we assess the utility of discarded, off-target reads from tumor-only panel sequencing for recovery of genome-wide germline genotypes through imputation.Methods We develop a framework for inference of germline variants from tumor panel sequencing, including imputation, quality control, inference of genetic ancestry, germline polygenic risk scores, and HLA alleles. We benchmark our framework on 833 individuals with tumor sequencing and matched germline SNP array data. We then apply our approach to a prospectively collected panel sequencing cohort of 25,889 tumors.Results We demonstrate high to moderate accuracy of each inferred feature relative to direct germline SNP array genotyping: individual common variants were imputed with a mean accuracy (correlation) of 0.86; genetic ancestry was inferred with a correlation of >0.98; polygenic risk scores were inferred with a correlation of >0.90; and individual HLA alleles were inferred with correlation of >0.89. We demonstrate a minimal influence on accuracy of somatic copy number alterations and other tumor features. We showcase the feasibility and utility of our framework by analyzing 25,889 tumors and identifying relationships between genetic ancestry, polygenic risk, and tumor characteristics that could not be studied with conventional data.Conclusions We conclude that targeted tumor sequencing can be leveraged to build rich germline research cohorts from existing data, and make our analysis pipeline publicly available to facilitate this effort.Competing Interest StatementThe authors have declared no competing interest.Funding StatementN.Z. and K.T. were supported by NIH grants K25HL121295, U01HG009080, R01HG006399, R01CA227237, R01ES029929, R01HG011345, the DoD grant W81XWH-16-2-0018, and the Chan Zuckerberg Science Initiative. A.G. and S.G. were supported by R01CA227237, R01CA244569, and the Doris Duke Charitable Foundation. A.G. was supported by the Louis B. Mayer Foundation and the Claudia Adams Barr Foundation.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:PROFILE samples were selected and sequenced from patients who were consented under institutional review board (IRB) approved protocol 11-104 and 17-000 from the Dana-Farber/Partners Cancer Care Office for the Protection of Research Subjects. Written informed consent was obtained from participants prior to inclusion in this study. Secondary analyses of previously collected data were performed with approval from the Dana-Farber IRB (DFCI IRB protocol 19-033 and 19-025; waiver of HIPAA authorization approved for both protocols).All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe raw sequencing data are not publicly available because the research participant consent, privacy policy, and terms of service do not include authorization to share identifiable data. The full analysis workflow is available at: https://github.com/gusevlab/panel-imp A containerized version of the imputation pipeline is available at: https://hub.docker.com/r/stefangroha/stitch_gcshttps://github.com/gusevlab/panel-imp %U https://www.medrxiv.org/content/medrxiv/early/2021/04/13/2021.04.09.21255197.full.pdf