Prediction Model for Detection of Sporadic Pancreatic Cancer (PRO-TECT) in a Population-Based Cohort Using Machine Learning and Further Validation in a Prospective Study ========================================================================================================================================================================= * Wansu Chen * Yichen Zhou * Fagen Xie * Rebecca K. Butler * Christie Y. Jeon * Tiffany Q. Luong * Yu-Chen Lin * Eva Lustigova * Joseph R. Pisegna * Sungjin Kim * Bechien U. Wu ## ABSTRACT **OBJECTIVES** There is currently no widely accepted approach to screening for pancreatic cancer (PC). We aimed to develop and validate a risk prediction model for PC across two health systems using electronic health records (EHR). **METHODS** This retrospective cohort study consisted of patients 50-84 years of age meeting utilization criteria in 2008-2017 at Kaiser Permanente Southern California (KPSC, model training, internal validation) and the Veterans Affairs (VA, external validation). ‘Random survival forests’ models were built to identify the most relevant predictors from >500 variables and to predict PC within 18 months of cohort entry. A prospective study was then conducted in KPSC to assess feasibility of the model for real-time implementation. **RESULTS** The KPSC cohort consisted of 1.8 million patients (mean age 61.6) with 1,792 PC cases. The estimated 18-month incidence rate of PC was 0.77 (95% CI 0.73-0.80)/1,000 person-years. The three models containing age, abdominal pain, weight change and two laboratory biomarkers (ALT change/HgA1c, rate of ALT change/HgA1c, or rate of ALT change/rate of HgA1c change) had comparable discrimination and calibration measures (c-index: mean=0.77, SD=0.01-0.02; calibration test: p-value 0.2-0.4, SD 0.2-0.3). The VA validation cohort consisted of 2.6 million patients (mean age 66.1) with an 18-month incidence rate of 1.27 (1.23-1.30). A total of 606 patients were screened in the prospective pilot study at KPSC with 9 patients (1.5%) diagnosed with a pancreatic or biliary cancer. **CONCLUSIONS** Using widely available parameters in EHR, we developed a population-based parsimonious model for early detection of sporadic PC suitable for real-time application. **What Is Known** * Patients with pancreatic cancer are often diagnosed at late stages. * Early detection is needed to impact the natural history of disease progression and improve patient survival. **What Is New Here** * Machine-learning was used to develop a population-based model for early detection of pancreatic cancer. The model was internally and externally validated in cohorts of 1.8 million and 2.6 million individuals, respectively. * Calibration was excellent in prospective pilot testing for detection of pancreatic malignancy. Keywords * risk prediction * pancreatic cancer * machine learning * general population * glycated hemoglobin * alanine transaminase * weight loss ## INTRODUCTION Pancreatic cancer is the third leading cause of cancer deaths with 48,220 estimated deaths in 2021 in the US.1 Because of the lack of an early detection strategy, majority of patients (50-55%) have metastases at distant sites at the time of diagnosis.2,3 Once diagnosed, the average 5-year survival is only 10.8%.1 Accounting for 90% of all pancreatic cancer cases, pancreatic ductal adenocarcinoma (PDAC) is by far the most common form of pancreatic cancer, and also the most lethal. Due to the low incidence of pancreatic cancer in the general population (13.2 per 100,000 person-years),1 widespread population-based screening is not currently recommended by the United States Preventative Services Task Force.4 Therefore, alternative approaches to early detection are needed in order to substantially impact the natural history of this disease and improve survival for patients. The emergence of comprehensive EHR and maturation of machine learning offers an opportunity to enhance efforts in early detection in pancreatic cancer. To date, efforts to develop clinical prediction models in pancreatic cancer have focused on specific populations such as those with new-onset diabetes5-7or within the confines of a case-control study.8,9 Efforts to target the high-risk patients in the general population are sparse.10 There is a critical need for novel risk stratification tools which are both sensitive and specific for rapid identification of patients at increased risk of developing pancreatic cancer. The aim of the present study was to develop and validate a clinical prediction model for risk of PDAC across several large health systems. Specifically, we sought to apply machine-learning combined with a comprehensive approach to data in EHR to predict the risk of sporadic PDAC. ## METHODS ### Study Design and Setting We conducted a retrospective cohort study utilizing multi-ethnic health plan enrollees of Kaiser Permanente Southern California (KPSC), a large integrated healthcare system that provides comprehensive healthcare services for >4.7 million enrollees across 15 medical centers and 235 medical offices. Model training and internal validation were conducted based on EHR data. The demographics and socioeconomic status of KPSC health plan enrollees are comparable to those of residents in the Southern California region.11 The internally validated models were externally tested using EHR of Veterans Affairs (VA).12 The study protocol was approved by the KPSC’s institutional Review Board. ### Study Population #### Model training and internal validation Patients 50-84 years of age and had ≥1 clinic-based visit (index visit) within a KPSC facility in 2008-2017 were identified. Patients who had history of pancreatic cancer, or not continuously enrolled in the KPSC health plan in the past 12 months (gaps 45 days or less were allowed) were excluded. The requirement of continuous enrollment allowed adequate data to define study variables. For patients with multiple qualifying index visits, we selected one randomly as the index visit. The corresponding visit date was referred to as the index date (t). Follow-up started on t and ended with the earliest of the following events: disenrollment from the health plan, end of the study (December 31, 2018), reached the maximum length of follow-up (18 months), non-PDAC related death, or PDAC diagnosis or death (outcome). A minimum of 30 days of follow-up is required. #### Model testing Veterans 50-84 years of age who had >1 outpatient visit (index visit) within a VA facility in 2008-2017 and another clinic-based visit within the 12 months prior to the index date were identified. Patients who had history of pancreatic cancer were excluded. The same follow-up rules mentioned above were applied to the VA cohort except for “disenrollment from the health plan”. ### Outcome Identification The study outcome was PDAC diagnosis or death with pancreatic cancer in the 18 months after the index date. For the KPSC cohort, PDAC was identified from the Cancer Registry by using the Tenth Revision of International Classification of Diseases, Clinical Modification (ICD-10-CM) code C25.x and histology codes (eTable 1). Pancreatic cancer deaths were derived from the linkage with the California State Death Master files and identified using ICD-10-CM codes C25.x.13 For the VA cohort, cases of PDAC were similarly identified through an internal VA Central Cancer Registry, and PDAC deaths identified through the VA Mortality Data Repository, which integrates vital status data from the National Death Index (NDI), VA, and DoD administrative records. ### Patient Demographic and Clinical Features at Baseline A complete list of extracted and derived features for the KPSC cohort is shown in eTable 2. Except for demographic variables, values within each time interval (0-6 months, 7-12 months, 1-2 years and >2 years) were generated. Definitions of the derived variables were described in eTable3. Since the VA dataset was solely used for testing purposes, only limited number of features were extracted (Table 1). View this table: [Table 1.](http://medrxiv.org/content/early/2022/02/15/2022.02.14.22270946/T1) Table 1. Characteristics of study subjects at baseline, n (%) unless otherwise stated. Missing values were imputed14 if the frequency of missing was <60%. We used predictive mean matching method15 with k=5. Laboratory measures with ≥60% missingness or change/change rate measures with ≥80% missingness were not included in the model development process. Ten imputed datasets were generated. ### Model Training, Validation and Testing To overcome the limitations of regression-based models that are traditionally used for analysis of time-to-event data, we applied ‘random survival forests’ (RSF), a nonparametric machine learning method,16-18 to pre-select features and train/validate models. First, we iteratively preselected features based on the average minimum depth (eTable 4) and used those features to develop and validate risk prediction models based on 5-fold cross validation.19 Age was forced into the model. Preselected features were added incrementally to identify the feature that yielded the maximum improvement of c-index. This process continued until the c-index increased <0.005. Of the 50 models derived from the 50 training datasets (10 imputation datasets x 5-fold cross validation), the three that appeared the most often were selected as the winning models. Algorithms of the winning models were applied to the corresponding KPSC validation datasets. By design, the KPSC validation datasets did not include any observations of the KPSC training datasets from which the winning models were developed. One winning model was first directly applied to VA imputed datasets, and subsequently recalibrated to achieve better performance. ### Performance Measures The discriminative power for each of the winning models was evaluated by c-index, a concordance measure, averaged across all the relevant validation datasets for cohort members. Calibration was assessed by calibration plots with five risk groups (<50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles).20 Greenwood-Nam-D’Agostino (GND) calibration test was also performed to assess goodness-of-fit. We estimated sensitivity, specificity, positive predictive value (PPV), and relative increase in risk in comparison to that of the entire cohort at various levels of risk thresholds. For this analysis we restricted the patients to those with complete follow up or developed PDAC in 18 months. The results were averaged across the validation datasets for each winning model. ### Early Detection Model To facilitate earlier detection of PDAC by ≥90 days, we also established a cohort which included patients identified in the main cohort who had ≥90 days of cancer-free follow up. The same model training and validation methods mentioned above were applied. ### Prospective pilot study We conducted a subsequent prospective study from February to December 2021 to evaluate feasibility of real-time implementation of the final prediction model. We aimed to assess the calibration of the model (frequency of observed vs. expected cancer). This study was approved as a separate protocol by the KPSC Institutional Review Board. We prospectively ran a final algorithm on a bi-weekly basis to identify patients aged 50-84 without a prior history of pancreatic cancer whose predicted risk of pancreatic cancer is ≥1%. All algorithm-identified patients were manually reviewed. Patients with findings suspicious for a pancreatic cancer on cross-sectional imaging obtained through routine clinical care up to 3 months prior to the risk identification date (the date when the algorithm was applied to EHR) were monitored. The diagnosis was based on either histology where available or presumptive clinical diagnosis as documented in patients’ EHR. ### Statistical Analysis Analyses were performed using SAS (Version 9.4 for Unix; SAS Institute, Cary, NC) or R Version 3.6.0 (R Foundation, Vienna, Austria). ## RESULTS ### Characteristics of the study cohorts 1.8 million KPSC patients were eligible (eFigure 1), of which 53.3% were females, 45.3% were white, 29.8% were Hispanic, 9.5% were African American and 10.7% were Asian and Pacific Islanders (Table 1). The majority (60.7%) used commercial insurance, and slightly under one-third (31.8%) were on Medicare. On average, the KPSC patients were 61.6 years of age, with average membership length of 18.9 years. 35.7% of the patients were obese and additional 35.6% were overweight. Diabetes was common (about 20%), while acute and chronic pancreatitis were rare (<1%). The 2.6 million eligible veterans were predominantly male (94.3%), white (67.9%) and African American (16.8%) and were older (66.1 years of age) than the KPSC cohort. Smoking, diabetes, acute and chronic pancreatitis were more prevalent in the VA cohort compared to those of the KPSC cohort. ALT and HbA1c at baseline appeared comparable between the two cohorts. ### Incidence of PDAC Table 2 displays the follow-up time in years, number, incidence rate (IR) of PDAC, and time to PDAC for all patients and for subgroups of patients defined by important features. 1,792 KPSC patients developed PDAC within 18 months of follow-up (IR=0.77, 95% CI 0.73-0.80/1,000 person-years (PY) (Table 2). A total of 4,582 patients in the VA cohort developed PDAC (IR 1.27 (1.23-1.30)). In the VA cohort, abdominal pain in the 6 months prior to t increased the IR to 4.35 (4.01-4.70). Time to PDAC appeared to be longer for the VA cohort (median 233 days, IQR 116-370 days) compared to that of the KPSC cohort (median 205 days, IQR 91-358 days). The distributions of cancer stage (I-IV) were comparable between the two cohorts (eTable 5), if the higher frequency of missingness in the VA cohort is ignored. View this table: [Table 2.](http://medrxiv.org/content/early/2022/02/15/2022.02.14.22270946/T2) Table 2. Total follow-up (f/u) time, number, and incidence rate of PDAC per 1,000 person-years (PY) and 95% CI. ### Model Training, Validation and Testing The number and size of the KPSC training, KPSC validation and VA testing datasets are shown in eTable 6. For the main cohort, the preselection process identified 29 potential predictors (eTable 4). Of the 50 training samples, the three winning models containing age, abdominal pain, weight change and two biomarkers (alanine transaminase (ALT) change/HbA1c, rate of ALT change/HbA1c, and rate of ALT change/rate of HbA1c change) appeared most often (Table 3). Internal validation based on KPSC validation datasets revealed comparable results among the top three models (c-index: mean 0.77 for all three models (M1-M3) and SD 0.01-0.02; calibration test: p-value 0.2-0.4 and SD 0.2-0.3). When M1 was directly applied to VA testing datasets, the mean c-index was 0.69 (SD 0.003) (data not shown); however, after the algorithm was recalibrated based on VA’s datasets, the mean c-index based on 10 testing datasets was 0.71 (SD 0.002) (Table 3). View this table: [Table 3.](http://medrxiv.org/content/early/2022/02/15/2022.02.14.22270946/T3) Table 3. Predictors selected in top 3 modelsa developed based on the main cohort and restricted cohort, and model performance during internal validation and external testing. For the early detection cohort, the preselection process identified 32 potential predictors (eTable 4). The three best models (E1-E3, c-index: 0.74-0.77) contained the same features as those selected by M1-M3 except that abdominal pain was not chosen (Table 3). The calibration test for model E3 was significant (p=0.04), indicating a lack of model fit. The recalibrated E1 model based on the VA testing datasets achieved a mean c-index of 0.68 (SD 0.003) (Table 3). The hyperparameters used and the features selected by ≥5 out of 50 models for each cohort can be found in eTables 7 and 8, respectively. Figure 1 displays the calibration plots for all six models (M1-M3, E1-E3). It appears that all fit well for the four out of five lower risk groups (i.e., risk<95th percentile). However, for the highest risk group (risk ≥95th percentile), M1-M3 properly estimated the risks at KPSC, while E1-E3 slightly overestimated the risks for KPSC patients, and M1, E1 slightly underestimated the risks for VA patients. ![Figure 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/02/15/2022.02.14.22270946/F1.medium.gif) [Figure 1](http://medrxiv.org/content/early/2022/02/15/2022.02.14.22270946/F1) Figure 1 Calibration plots of winning models. x-axis: predicted; y-axis: observed. The five clusters represent the five risk groups defined by the ranges of predicted risks: <50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles. Within each cluster, there are multiple dots representing the pairs of predicted and observed risks, calculated based on the corresponding validation datasets. Sensitivity, specificity, PPV and fold increase in risk were comparable among the three winning models for both main cohort (M1-M3) and early detection cohort (E1-E3) (Table 4). The top 2.5% of the testing sample based on M1 experienced 1% risk of PDAC over 18 months, which was 7- to 8-fold higher than the baseline risk of PDAC in the KPSC cohort. Twenty % of the total PDAC cases occurring within 18 months were identified in this top 2.5% model-predicted high-risk group. Patients within the top 20% predicted risk of PDAC experienced 0.6% risk of PDAC over 18 months. This identified more than 50% of PDAC occurring within 18 months with a specificity of 80%. While sensitivities and fold increases in PDAC incidence rate were lower in the VA population, PPVs were higher for VA validation data compared to those of KPSC (Table 4). View this table: [Table 4.](http://medrxiv.org/content/early/2022/02/15/2022.02.14.22270946/T4) Table 4. Percent of patientsa whose risk was among the top 20%, 15%, 10%, 5%, and 2.5%, sensitivity, specificity, positive predictive value (PPV), and risk fold increase for each of the winning models based on KPSC validation datasets and one of the winning models based on VA validation dataset. Compared to the models developed based on the main cohort (M1-M3), the models developed based on early detection cohort (E1-E3) had slightly compromised sensitivity, PPV and fold increase in risk. ### Model application/implementation A total of 606 patients were identified by the model with predicted risk of ≥1%. Nine (1.5%) patients had abnormalities identified in the region of the pancreas with the following diagnoses/stages: PDAC stages 1b, II, III, IV; pancreatic lymphoma; pancreatic neuroendocrine tumor; ampullary carcinoma; cholangiocarcinoma (n=2). To facilitate external application of the RSF-based prediction model (M1), we have developed a publicly available web-based tool ([https://pcriskdev.kp-scalresearch.org/](https://pcriskdev.kp-scalresearch.org/)). A hypothetical 70-year-old male patient with hemoglobin A1c value of 7.5%, weight loss of 4 lbs. and ALT increase of 4 IU/L in one year has an estimated 18-month risk of PDAC 0.30%. For demonstration, decision rules based on one of the trees built for M1 is displayed in eFigures 2 and 3 for the left and the right side of the decision tree, respectively. ## DISCUSSION We applied machine learning methods to EHR data to derive and validate clinical prediction models for sporadic pancreatic cancer across two large integrated healthcare systems. Despite inclusion of >500 potential features in the candidate pool, the machine learning models incorporated traditional parameters including age, glycated hemoglobin, alanine aminotransferase, weight, and abdominal pain. The final models were both parsimonious (with only 4-5 predictors) and reasonably accurate in both internal and external validation. In addition, the model proved well-calibrated when applied prospectively for real-time identification of not only pancreatic but biliary cancers. While there has been some progress in studying approaches to early detection in high-risk patients based on either family history or genetic susceptibility21 as well as those with specific conditions such as late-onset diabetes,6 limited data exist on identification of patients at risk for sporadic pancreatic cancer. This study presents a novel approach to risk stratification at the population-level based on dynamic parameters contained within structured data from EHR. Although parameters included in the model are well-established parameters for PDAC,5,6,10 their selection using an unbiased, comprehensive data-driven approach helped ensure inclusion of the most relevant combination of parameters. A recently developed model to predict risk of pancreatic cancer among patients with late or new onset diabetes at age 50 or later similarly identified increasing age, weight loss and change in blood glucose as key parameters for determining risk of pancreatic cancer in this patient population.5 The current model extends the concept of model-based risk prediction to a much broader population while maintaining reasonably high levels of discriminative accuracy for prediction of pancreatic cancer. External validation is key to assessing model performance. In a review of 127 prediction models, Siontis et. al found 32 (25%) had at least one external validation.22 AUC estimates significantly decreased during external validation vs. the derivation study with a median AUC reduction 0.05.22 In the current study, the c-index declined by 0.08 (or 10%) when model M1 was directly transported and about 0.06 (or 8%) and 0.09 (or 12%) after models M1 and E1 were recalibrated, respectively. Although the comparison should be interpreted cautiously, the larger reduction observed in the current study could be attributable to multiple factors. First, a higher frequency of PDAC cases in the VA cohort were pancreatic cancer deaths identified through mortality records compared to the KPSC cohort. Second, given the differences in age and sex between KPSC and VA populations, a higher incidence rate of PDAC in the VA dataset compared to that of KP’s was observed as expected. This could have impacted model accuracy especially for the models without recalibration. Strengths of the current study included a comprehensive, data-driven approach to model development, use of structured data elements and external validation in a separate healthcare system with distinct patient population. The present study also extended model development to assess feasibility of implementing the model through application in a prospective pilot study that demonstrated ability to identify pancreaticobiliary cancers in real-time. The present study also had several limitations. First, several parameters identified in the prediction models (abdominal pain, abnormal ALT) are often associated with advanced stage pancreatic cancer. However, several early-stage pancreatic cancer cases were detected by the algorithm in the prospective feasibility study. Nevertheless, an ongoing concern relates to the timing with respect to clinical diagnosis. The 30-day cancer-free period used in the present model is likely insufficient to provide a reasonable window of opportunity for intervention to impact the disease course. To address this concern, we also developed an early detection model that restricted the study population to patients with ≥90 days cancer-free follow-up from the index date. Further testing of this restricted model for early detection of pancreatic cancer is the subject of an ongoing single-arm prospective interventional study ([NCT04883450](http://medrxiv.org/lookup/external-ref?link_type=CLINTRIALGOV&access_num=NCT04883450&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom)). Despite the reasonably high performance in terms of discriminative ability, the absolute risk in the highest risk category (top 2.5%) approached 1% over 18-months. This level of risk is likely below the threshold for cost-effective screening based on currently available testing.23 Second, of the 1479 and 4,582 events identified in the KPSC and VA cohorts, respectively, 300 and 2,564 events were captured by data sources other than Cancer Registry. An evaluation based on the KPSC Cancer Registry of the same time window showed that about 90% of pancreatic cancer cases were PDAC. Third, to estimate sensitivity, specificity, PPV and fold of risk increase, we relied on a subset of patients (∼70% and ∼80% of the total patients in the KPSC and VA cohorts, respectively) with complete follow up unless they died of pancreatic cancer. This restriction over-estimated the risk of PDAC, because the patients who were excluded from this analysis were at-risk for some periods of time. Finally, several cancers identified during prospective pilot testing of the algorithm were biliary in origin given similarities in anatomic location and clinical presentation. In conclusion, we developed a parsimonious clinical risk prediction model for sporadic pancreatic cancer in a large, diverse integrated health system and subsequently applied the model in a separate health system. We also evaluated the model prospectively for real-time application. The model identified five key factors in determining risk of pancreatic cancer. Findings from the present study provide a potential framework for a systematic approach to targeted screening for pancreatic cancer based on automated analysis of data in EHR. ## Supporting information Online Supplements [[supplements/270946_file04.docx]](pending:yes) eFigure 1 [[supplements/270946_file05.tif]](pending:yes) eFigure 2 [[supplements/270946_file06.tif]](pending:yes) eFigure 3 [[supplements/270946_file07.tif]](pending:yes) ## Data Availability Anonymized data that support the findings of this study may be made available from the investigative team in the following conditions: 1) agreement to collaborate with the study team on all publications, 2) provision of external funding for administrative and investigator time necessary for this collaboration, 3) demonstration that the external investigative team is qualified and has documented evidence of training for human subjects protections, and 4) agreement to abide by the terms outlined in data use agreements between institutions. ## Figure Legends **eFigure 1** – Consort Diagram (KPSC and VA main cohorts) **eFigure 2** – One of the decision trees for KPSC M1 (left side) **eFigure 3** – One of the decision trees for KPSC M1 (right side) ## Footnotes * **Address where the work was conducted:** 100 S Los Robles, 2nd Floor, Pasadena, CA 91101 * **Author Conflict of Interest / Study Support** * **Guarantor of the article:** Dr. Wansu Chen accepts full responsibility for the conduct of the study. She has had access to the data and has control of the decision of the decision to publish. * **Financial support:** Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA230442. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. * **Potential competing interests:** The authors declare they have no conflict of interest for this study. ## Abbreviations ALT : Alanine transaminase AUC : area under the curve CI : confidence interval DoD : Department of Defense EHR : electronic health record GND : Greenwood-Nam-D’Agostino HbA1C : glycated hemoglobin ICD-9-CM : Ninth Revision of International Classification of Diseases, Clinical Modification ICD-10-CM : Tenth Revision of International Classification of Diseases, Clinical Modification IR : Incidence rate KPSC : Kaiser Permanente Southern California NDI : National Death Index NOD : new onset diabetes PC : pancreatic cancer PDAC : pancreatic ductal adenocarcinoma PPV : positive predictive value RSF : Random Survival Forest SEER : Surveillance, Epidemiology, and End Results VA : Veterans Affairs * Received February 14, 2022. * Revision received February 14, 2022. * Accepted February 15, 2022. * © 2022, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Cancer Stat Facts: Pancreatic Cancer. SEER [https://seer.cancer.gov/statfacts/html/pancreas.html](https://seer.cancer.gov/statfacts/html/pancreas.html). Accessed July 13, 2021. 2. 2.Stathis A, Moore MJ. Advanced pancreatic carcinoma: current treatment and future challenges. Nat Rev Clin Oncol. 2010;7(3):163–172. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nrclinonc.2009.236&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20101258&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 3. 3.Stokes JB, Nolan NJ, Stelow EB, et al. Preoperative capecitabine and concurrent radiation for borderline resectable pancreatic cancer. Ann Surg Oncol. 2011;18(3):619–627. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1245/s10434-010-1456-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21213060&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000287703500004&link_type=ISI) 4. 4.US Preventive Services Task Force, Owens DK, Davidson KW, et al. Screening for Pancreatic Cancer: US Preventive Services Task Force Reaffirmation Recommendation Statement. JAMA. 2019;322(5):438–444. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2019.10232&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 5. 5.Sharma A, Kandlakunta H, Nagpal SJS, et al. Model to Determine Risk of Pancreatic Cancer in Patients With New-Onset Diabetes. Gastroenterology. 2018;155(3):730–739 e733. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29775599&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 6. 6.Boursi B, Finkelman B, Giantonio BJ, et al. A Clinical Prediction Model to Assess Risk for Pancreatic Cancer Among Patients With New-Onset Diabetes. Gastroenterology. 2017;152(4):840–850 e843. 7. 7.Chen W, Butler RK, Lustigova E, Chari ST, Wu BU. Validation of the Enriching New-Onset Diabetes for Pancreatic Cancer Model in a Diverse and Integrated Healthcare Setting. Dig Dis Sci. 2021;66(1):78–87. 8. 8.Kim J, Yuan C, Babic A, et al. Genetic and Circulating Biomarker Data Improve Risk Prediction for Pancreatic Cancer in the General Population. Cancer Epidemiol Biomarkers Prev. 2020;29(5):999–1008. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiY2VicCI7czo1OiJyZXNpZCI7czo4OiIyOS81Lzk5OSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzAyLzE1LzIwMjIuMDIuMTQuMjIyNzA5NDYuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 9. 9.Klein AP, Lindstrom S, Mendelsohn JB, et al. An absolute risk model to identify individuals at elevated risk for pancreatic cancer in the general population. PLoS One. 2013;8(9):e72311. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0072311&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24058443&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 10. 10.Yu A, Woo SM, Joo J, et al. Development and Validation of a Prediction Model to Estimate Individual Risk of Pancreatic Cancer. PLoS One. 2016;11(1):e0146473. 11. 11.Koebnick C, Langer-Gould AM, Gould MK, et al. Sociodemographic characteristics of members of a large, integrated health care system: comparison with US Census Bureau data. Perm J. 2012;16(3):37–41. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7812/TPP/12-031&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23012597&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 12. 12.Fihn SD, Francis J, Clancy C, et al. Insights from advanced analytics at the Veterans Health Administration. Health Aff (Millwood). 2014;33(7):1203–1211. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToiaGVhbHRoYWZmIjtzOjU6InJlc2lkIjtzOjk6IjMzLzcvMTIwMyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzAyLzE1LzIwMjIuMDIuMTQuMjIyNzA5NDYuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 13. 13.Chen W, Yao J, Liang Z, et al. Temporal Trends in Mortality Rates among Kaiser Permanente Southern California Health Plan Enrollees, 2001-2016. Perm J. 2019;23. 14. 14.Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw. 2017;77(1). 15. 15.Little RJA. Missing-Data Adjustments in Large Surveys. J Bus Econ Stat. 1988;6(3):287–296. 16. 16.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008(3):841–860. 17. 17.Dietrich S, Floegel A, Troll M, et al. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol. 2016;45(5):1406–1420. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyw145&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27591264&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 18. 18.Ishwaran H KU. randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). [http://web.ccs.miami.edu/~hishwaran/](http://web.ccs.miami.edu/~hishwaran/). Accessed July 8, 2021. 19. 19.Stone M. Cross-Validatory Choice and Assessment of Statistical Predictions. J R Stat Soc Series B Stat Methodol. 1974;36(2):111–147. 20. 20.Demler OV, Paynter NP, Cook NR. Tests of calibration and goodness-of-fit in the survival setting. Stat Med. 2015;34(10):1659–1680. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.6428&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25684707&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 21. 21.Goggins M, Overbeek KA, Brand R, et al. Management of patients with increased risk for familial pancreatic cancer: updated recommendations from the International Cancer of the Pancreas Screening (CAPS) Consortium. Gut. 2020;69(1):7–17. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ3V0am5sIjtzOjU6InJlc2lkIjtzOjY6IjY5LzEvNyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzAyLzE1LzIwMjIuMDIuMTQuMjIyNzA5NDYuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 22. 22.Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68(1):25–34. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2014.09.007&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25441703&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F02%2F15%2F2022.02.14.22270946.atom) 23. 23.Schwartz NRM, Matrisian LM, Shrader EE, Feng Z, Chari S, Roth JA. Potential Cost-Effectiveness of Risk-Based Pancreatic Cancer Screening in Patients With New-Onset Diabetes. J Natl Compr Canc Netw. 2021:1–9.