Comparison of Population Characteristics in Real-World Clinical Oncology Databases in the US: Flatiron Health, SEER, and NPCR

The Surveillance, Epidemiology, and End Results Program (SEER) program and the National Program of Cancer Registries (NPCR), are authoritative sources for population cancer surveillance and research in the US. An increasing number of recent oncology studies are based on the electronic health record (EHR)-derived de-identified databases created and maintained by Flatiron Health. This report describes the differences in the originating sources and data development processes, and compares baseline demographic characteristics in the cancer-specific databases from Flatiron Health, SEER, and NPCR, to facilitate interpretation of research findings based on these sources.Patients with documented care from January 1, 2011 through May 31, 2019 in a series of EHR-derived Flatiron Health de-identified databases covering multiple tumor types were included. SEER incidence data (obtained from the SEER 18 database) and NPCR incidence data (obtained from the US Cancer Statistics public use database) for malignant cases diagnosed from January 1, 2011 to December 31, 2016 were included. Comparisons of demographic variables were performed across all disease-specific databases, for all patients and for the subset diagnosed with advanced-stage disease.As of May 2019, a total of 201,570 patients with 19 different cancer types were included in Flatiron Health datasets. In an overall comparison to national cancer registries, patients in the Flatiron Health databases had similar sex and geographic distributions, but appeared to be diagnosed with later stages of disease and their age distribution differs from the other datasets. For variables such as stage and race, Flatiron Health databases had a greater degree of incompleteness. There are variations in these trends by cancer types.These three databases present general similarities in demographic and geographic distribution, but there are overarching differences across the populations they cover. Differences in data sourcing (medical oncology EHRs vs cancer registries), and disparities in sampling approaches and rules of data acquisition may explain some of these divergences. Furthermore, unlike the steady information flow entered into registries, the availability of medical oncology EHR-derived information reflects the extent of involvement of medical oncology clinics at different points in the specialty management of individual diseases, resulting in inter-disease variability. These differences should be considered when interpreting study results obtained with these databases.

The NPCR cancer registries routinely capture data elements including the type, extent, and location of the cancer, the type of initial treatment, and outcomes of newly diagnosed cancers.
Medical facilities such as hospitals, physician offices, and pathology laboratories send information about cancer cases to their respective central cancer registry, and each central cancer registry submits electronically de-identified demographic and clinical information to the NPCR on a yearly basis (4). Mortality information in the NPCR is obtained from the CDC's National Center for Health Statistics' National Vital Statistics System (20). As of May 31, 2019, the most recent information available from NPCR included new incident malignancies diagnosed through December 31, 2016. NPCR data is made available through the US Cancer Statistics dataset, which combines NPCR data and data from 4 SEER-funded states (Connecticut, Hawaii, Iowa, and New Mexico). This data provides information on 100% of the US population. In this paper, "NPCR data" was obtained through the US Cancer Statistics public use research dataset and was restricted to the 46 NPCR funded states and D.C.

Variables
For each cancer type, demographic and clinical characteristics including race, age, region, year and stage at diagnosis were compared between the Flatiron Health and the SEER and NPCR databases. To overcome coding discrepancies across databases, cancer types were matched using ICD-9, ICD-10, and histology codes (e.g. ICD-0-3). All comparisons were unadjusted.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 30, 2020. ; https://doi.org/10.1101/2020.03. 16.20037143 doi: medRxiv preprint Variables not available across two or more data sources were not included in the comparison (e.g., smoking status, treatment detail, real world progression [rwP] information).
In the Flatiron Health databases, cancer staging information was collected as entered into the EHR by the treating physician or otherwise as assessed by Flatiron Health abstractors; during the study time period, the applicable staging criteria for solid tumors were those of the American In order to compare the particular data segments common across all three databases (the most recent SEER and NPCR data releases reach through 2016 as initial diagnosis year), descriptive analyses were performed not only for all patients available for analysis across the entire time frame of January 2011 -May 2019 in the Flatiron Health databases, but also in the subset available for analysis from January 2011 -December 2016.
As sensitivity analyses to address potential biases related to temporal drifts, we performed separate comparisons for the patient subgroups who had stage IV disease at diagnosis in each . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 30, 2020. ; https://doi.org/10.1101/2020.03.16.20037143 doi: medRxiv preprint cancer type, for whom survival times would be expected to be shorter and the date of diagnosis would be expected to be closer to the database entry point. The potential biases to address were twofold: (i) as noted above, SEER and NPCR only collect specific incident disease data points, whereas Flatiron Health databases include both incident and prevalent cases. Therefore, Flatiron Health databases may receive patients at the time of diagnosis but also patients with initial diagnosis dates in the past; these cases may have long intervening periods between the initial diagnosis date and the date of entry into the Flatiron Health database, introducing a potential bias for patient characteristics associated with longer survival times (when compared with strictly incident cases in cancer registries); (ii) in addition, temporal trends where certain patient characteristics (i.e., sex, age) may be associated with cancer diagnoses during discrete time periods and can affect distributions depending on diagnosis year.

Analyses
Case-level data for patients in SEER were extracted from the SEER 18 November 2018 data submission dataset to the SEER Program by using the Case Listing Session feature in SEER*Stat software (Version 8.3.6, Information Management Services, Inc., Silver Spring, MD) and processed by using R 3.6.1. For patients in NPCR, case listing is not publicly available in the US Cancer Statistics public use SEER*Stat dataset, and case-level data cannot be accessed or downloaded. Frequencies by demographic and clinical characteristics for all malignant cases were calculated in SEER*Stat software using the November 2018 data submission.

RESULTS
Among the 2.2 million patients with cancer in the Flatiron Health database as of May 2019, 201,570 were included in this analysis, as well as 1,719,277 and 6,308,342 cases from the SEER and NPCR, respectively. The disease-specific databases vary in size, depending on the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 30, 2020.

DISCUSSION
Results reported in this descriptive study provide an overview of the originating sources, data collection methods, and comparative population characteristics for three oncology-specific RWD sources in the US: the Flatiron Health, SEER, and NPCR data. We focused our comparisons on baseline demographic and clinical variables at the time of initial cancer diagnosis that describe the populations included in these data sources, based on the data elements commonly found in all three of them. Each of these three data sources relies on different collection approaches (Table 1), and our findings reveal population differences likely stemming from those distinct collection strategies. These differences in data collection methods and resulting populations should be considered when determining whether a dataset is fit-for-use for a particular research question and can help to contextualize research results obtained when using each data source.
To further assist in that contextualization, this discussion highlights some of the potential underlying explanations for the differences observed.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 30, 2020. ; https://doi.org/10.1101/2020.03. 16.20037143 doi: medRxiv preprint The distribution of patients according to sex/gender in the three data sources was comparable, but there were noticeable differences for other variables. Regarding regional distribution, Flatiron Health and the NPCR data were most closely aligned to the regional population distribution in the most recent US census (24). Due to its design, SEER data diverges the most from the census, particularly overrepresenting the West, while the Flatiron Health database provides a convenience sample that is slightly weighted towards the South and underrepresents the Western region.
The three data sources had differences in age distribution, most notably a lower proportion of patients over 80 years at diagnosis in the Flatiron Health database. The ultimate reason for this discrepancy is probably ingrained in the different information sources that feed each one of them. State registries collect information regardless of patients' site of care and from death certificates and autopsy reports (25), while Flatiron Health databases accrue information only via oncology clinics. By focusing on specialized care, Flatiron Health has limited reach into general hospice or other geriatric care settings, where a substantial number of elderly patients may be referred before they complete two visits to an oncology clinic (therefore excluding them from eligibility into Flatiron Health databases). There was one exception to this finding: prostate cancer, which may be a second cancer diagnosis for some elderly patients who could have already been entered into the Flatiron Health database at the time of a prior cancer diagnosis.
For information on race/ethnicity, the different data collection approaches result in expected differences in completeness and in population distribution across the three databases. The proportion of incomplete records for race/ethnicity in Flatiron Health data is greater than in the other databases. During routine oncology care in the US (i.e., in the source clinics for Flatiron Health), collection of race/ethnicity data is not mandatory or incentivized; on the other hand, the registries feeding both SEER and NPCR have a mandate to reach certain levels of completeness for this variable and thus, this information is collected both directly from self-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 30, 2020. ; https://doi.org/10.1101/2020.03.16.20037143 doi: medRxiv preprint reports and indirectly using algorithms (26). In addition, Flatiron Health relies on self-reported information by patients, which adds complexity and variability to an information category that is in constant evolution within a broader social context. These challenges probably contribute to the apparent lower completeness of this variable in the Flatiron Health databases. Furthermore, due to the purposeful design of the SEER program, there is an overrepresentation of certain groups compared to the US census (24, 27), a finding consistent with prior representativeness studies (28-30).
Lastly, a combination of differences in sources and in data collection approaches across the three data sources leads to noticeable differences in their information about AJCC disease stage at diagnosis. In some diseases, particularly in HCC, malignant pleural mesothelioma, RCC, SCLC, prostate cancer, and DLBCL, detailed AJCC disease stage information is missing from Flatiron Health databases to a substantially larger extent than in SEER, although that incompleteness is mitigated in the simplified category metastatic/non-metastatic disease. To understand that finding, it is important to note that Flatiron Health data are generated from a pipeline of medical oncology EHR-derived data, where stage information is mostly as documented in unstructured notes by the treating oncology team. In contrast, registries rely on multiple sites of care as sources, and disease stage is intentionally entered into their databases via mandated calculation and coding by trained tumor registrars (31, 32). Ultimately, these fundamental differences lead to idiosyncratic fluctuations in information completeness in EHRderived vs systematically-collected data. For instance, clinical scenarios where medical oncologists tend to be involved in initial diagnosis (when staging takes place) are more likely to have a complete capture of initial staging in the medical oncology EHR. To wit, compare cancers commonly diagnosed at an advanced stage (e.g., SCLC) or that are eligible for systemic adjuvant therapy from early stages (e.g., breast cancer) with diseases where the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 30, 2020. ; https://doi.org/10.1101/2020.03.16.20037143 doi: medRxiv preprint medical oncologist tends to be less involved in initial diagnosis (e.g., RCC, which is often managed surgically upon initial diagnosis).
Completeness/incompleteness of EHR-derived data is also affected by practical patterns of clinical documentation; as staging algorithms become increasingly complex, practitioners may tend to be more attentive to staging specifics in settings where the link between staging and treatment is more crucial (i.e., local or locally-advanced settings, where patients are candidates for multi-modality therapy), and less stringent in other settings (i.e., advanced disease) where staging information is less critical to clinical decision making. For example, AJCC staging is not a key consideration for initial HCC treatment decisions; clinical and laboratory data to assess underlying liver functional status and inform potential transplant eligibility are far more clinically relevant. Treating clinicians may prefer to document and rely on clinically actionable, non-AJCC staging systems for the routine management of some diseases, like HCC and SCLC, resulting in less AJCC-based information available in those databases.
In conclusion, the disease-specific Flatiron Health databases provide deep demographic, clinical, and treatment data models derived from EHR information. Several of the data elements in the Flatiron Health databases cannot be found in SEER or NPCR, such as date of metastatic diagnosis and sites of metastatic disease, comprehensive standard biomarker status, longitudinal treatment sequences, and disease progression dates. Within the portfolio of data elements commonly found across the three databases, comparing Flatiron Health to SEER and NPCR shows that Flatiron Health has a regional distribution closer to the general US census than SEER, an overall lower proportion of patients older than 80 years at diagnosis, less complete racial information, and disease-dependent variability in the capture of staging data.
These differences stem from the originating EHR-source and from the rules for data capture and processing. Investigators should consider these inter-database demographic differences when . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Geographic distribution of participating states is based on maintaining a high-quality population-based cancer reporting system, and pre-specified and adjusted to reach specific regional and racial representation. The biomarker portfolio offered by each database is different, with SEER providing information about a limited set of specific biomarkers that might predict outcomes or response to specific therapies, the US Cancer Statistics database providing ER, PR and HER2 status, and Flatiron Health offering a more extensive portfolio of biomarkers for which testing is considered standard of care.
b Flatiron Health databases provide information on specific sites of metastatic disease for a selected set of tumor types C Flatiron Health has assembled early-stage disease datasets (for which information about early-stage treatment is collected) for certain selected tumor types 23                   Race/ethnicity Defined as: • Asian = "Asian" or "Pacific Islander" • Black or African American = "Black" • Other Race = "American Indian" or "Alaska Native" • Unknown = "Unknown" • White = "White" As self-reported by patients, and captured into the EHR. Defined as: • White = "White" • Black or African American = "Black or African American" • Asian = "Asian" • Other Race = "Other" • Unknown = "Unknown", "Hispanic or Latino" Categories had to be combined to harmonize comparisons across databases