Abstract
Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implementation of this vision is now possible. Motivated by these advances, and the information needs of clinicians in our academic medical center, we offered an on-demand consultation service to derive evidence from patient data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. Consultation results informed individual patient care, resulted in changes to institutional practices, and motivated further clinical research. We make the tools and methods developed to implement the service publicly available to facilitate the broad adoption of such services by health systems and academic medical centers.
The need for on-demand evidence
Evidence-based medicine emphasizes the “conscientious, explicit and judicious use of current best evidence”1 when making treatment decisions2,3. Randomized controlled trials (RCTs) are considered the highest quality source of evidence about treatment efficacy and safety. Evidence derived from RCTs, however, often does not generalize to the vast majority of patients, who tend to have multiple comorbidities, take many medications, and differ from individuals enrolled in RCTs on many characteristics4, resulting in an inferential gap between the evidence that is available and that which is needed5,6. Therefore, it is necessary to transform the evidence generation process7 and to incorporate the use of aggregate patient data at the point of care8 in order to create a successful learning health system9.
Electronic medical records (EMRs) are a source of rich longitudinal data about millions of real world patients. Since the 1970s, clinicians and scientists have envisioned using the medical records of previously treated patients to inform the care of current and future patients10,11. As a recent example, in 2011 the New England Journal of Medicine published an article by Frankovich et al.12 describing the use of EMR data to support management of an adolescent female with systemic lupus erythematosus. At the time, incorporating data from EMRs into clinical decision making required significant manual effort, rendering it infeasible for use in routine patient care.
A decade later, the adoption of EMRs across the United States and internationally, the increasing ease of use of advanced statistical methods, and the ability to compute with large patient cohorts has enabled a core tenet of the learning health system: deriving on-demand evidence for diverse clinical scenarios from the EMR7,13.
Using these advances as a foundation, we designed, developed, and offered a consultation service that used EMR and medical insurance claims data at Stanford Medicine to provide on-demand evidence for questions arising during clinical care14. Here, we report our findings from responding to the first 100 requests to the service: we summarize requests by medical specialty, the types of analyses required to fulfill their requests, and clinicians’ responses to the evidence returned.
The setup of the consultation service
Beginning in 2017, with approval from the Stanford Institutional Review Board, we offered a consultation service to provide on-demand evidence to clinicians at Stanford Medicine, staffed by a team of four (described below). As part of offering the service, we collected data on the motivations for consultation requests, and the subsequent actions taken in light of the evidence returned. At the conclusion of the study in August 2019, we analyzed the consultation request motivations and resulting actions, and assessed the concordance of consultation results across clinical data sources as a measure of reliability of consultation analysis methods.
In designing the service, we leveraged best practices15, methods16, and tools17,18 to derive evidence from EMRs. Callahan et al15 summarizes recommendations for conducting and reporting observational studies done using EMRs derived from a large body of our team’s prior work. For example, we have used EMR data for vigilance, such as monitoring adverse drug events19,20 and surveilling implantable devices21 ; for answering clinical questions such as whether there is an association between androgen deprivation therapy and dementia22,23 ; and for elucidating quality of care, by profiling unplanned ED visits24, surfacing patient reported outcomes25 and quantifying treatment variability in metastatic breast cancer26. We have also learned from leading collaborative studies27, developing methods for electronic phenotyping28–31, and from participating in multiple OHDSI network studies32–38.
Gombar et al.14 describes the consultation service setup to receive questions from clinicians, retrieve the appropriate patient data using a specialized search engine18, perform the analyses required for the question, and return a report summarizing the results. Schuler et al.16 describes the methods for data extraction, processing, and analysis used in the consultation service. Datta et al17 describes the platform for clinical data science at Stanford Medicine that supported the operation of the service.
The workflow for fulfilling a consultation request
A consultation began with an email from a requestor, detailing a clinical question. Upon receiving the request, the team’s informatics clinician scheduled an intake discussion with the requesting clinician to specify the population, intervention, comparator, outcome and timeframe (PICOT) for their particular question14.
Based on the PICOT formulation of the question, the EMR data specialist constructed patient cohorts using the Advanced Cohort Engine (ACE)18 to search one of three data sources: EMRs of 3.1 million individuals from Stanford Medicine; IBM MarketScan® insurance claims for 124 million individuals; or Optum Clinformatics Data Mart® insurance claims for 53 million individuals. The data scientist then conducted the necessary statistical analyses and worked with the informatics clinician to write a report summarizing the analyses and their results. The report was then shared with the requestor and explained during an in-person debrief session. Each report consisted of the original question as posed, the PICOT re-formulation, and sections summarizing the cohort demographics, the interpretation of the analyses, and a detailed walkthrough of the analyses. Three example reports are provided in the Supplement. The interaction was designed to be similar to obtaining a second opinion from a colleague.
Our workflow evolved to incorporate real-time searches of the EMR as the informatics clinician collected PICOT details. For example, if a given cohort criterion returned very few patients, then the informatics clinician could relay this information during the intake interview in order to elicit modifications to the cohort definition from the requestor. Clarifications needed during debrief interviews were also incorporated into subsequent reports and debriefs to better contextualize analysis results for requestors. The majority of this evolution occurred during the first 3 months of offering the service.
Based on the time required to respond to the first 100 consultations received (see Findings from the first 100 consultations), we believe a team composed of one full-time clinical informatics fellow, two full-time EMR data specialists, and a 20% part-time data scientist would be able to complete up to 20 such consultations in one week. The personnel costs for our geographic area (San Francisco Bay) for this team are estimated at $505,000/year. Yearly data access infrastructure, cloud compute, licensing, and professional service expenses come to an additional $70,000/year. With these assumptions, the cost of running such a service would be approximately $550 per consultation.
Figure 1 illustrates the process of fulfilling a consultation request. The datasets, cohort building and analysis methods used in completing consultations, and our assessments of their performance, are further described in the following subsections.
The workflow for fulfilling a consultation request, illustrating the order of each step, the time required, and the personnel responsible.
Datasets and Cohort Building
The service used demographics, diagnoses, procedures, medications, laboratory values, clinical notes, length of stay, and mortality information for millions of patients from three data sources: EMRs from 3.1 million Stanford Medicine (Stanford) patients (54% female, spanning 1995-2019)39 consisting of diagnosis, procedure, medication, and laboratory test records, as well as clinical notes processed using a previously developed and evaluated text-processing pipeline40,41 ; IBM MarketScan® (MarketScan) which contains employer and Medicare insurance claims for 124 million lives (53% female, spanning 2007-2015); and Optum Clinformatics Data Mart® (Optum) which contains insurance claims for 53 million lives from employer sponsored health plans (53% female, spanning 2003-2016).
The choice of dataset for a given consultation was informed by the question and primarily based on meeting the criteria specified in the PICOT. For example, if a patient cohort definition relied on a specific range of laboratory test result values, then this necessitated using the Stanford EMR dataset, because claims data do not include laboratory test results. The EMR data specialist constructed patient cohorts using the Advanced Cohort Engine (ACE)18 to define necessary and sufficient conditions to determine if an exposure or outcome of interest occurred in a patient’s timeline. A patient timeline view of patient records provided by ACE enabled anonymized chart review for quality checks of the exposure and outcome definitions and resulting cohorts.
Supported Analyses
The service supported treatment comparisons for discrete, continuous and time-to-event outcomes as well as custom descriptive analyses16. For discrete, continuous and time-to-event outcomes we used a standardized process which attempted to emulate a “target trial”42 based on the criteria specified in the PICOT. For consultations requesting treatment comparisons, we created cohorts of similar patients using two approaches: Mahalanobis distance with a fixed caliper based on age, gender, length of record, and year of entry into the cohort (we call this “simple matching”) or high dimensional propensity score matching (hd-PSM)43,44. Matching is a way to identify subsets of patients that are similar in most respects other than the treatment they received, in order to reduce the chance that observed differences in outcomes are due to variation in properties other than treatment but which also impact the outcome (commonly referred to as confounding)43. For propensity score estimation, we used an L2 regularized logistic regression model with a time-binned count based featurization of pre-treatment clinical data elements (diagnoses, procedures, medication records), fit using GLMnet45. Regularization strength was determined using 10-fold cross validation with a final refit on the entire data before estimating propensity scores for all patients46. Results from both matching strategies were included with each report.
The subsequent analysis performed on matched cohorts was selected based on the outcome specified in the PICOT formulation of the question. For treatment comparisons with binary outcomes we calculated odds ratios and associated confidence intervals. For treatment comparisons with continuous outcomes, we fit regression models and reported mean change in response estimates and associated confidence intervals. For treatment comparisons with time-to-event (survival) outcomes, we computed Kaplan-Meier plots and performed log rank tests for differences in survival curves between compared treatments, and reported hazard ratios (HRs) and associated confidence intervals.
Custom descriptive analyses required bespoke code for each request, primarily written in R, with data aggregation using Python as necessary. All analyses were conducted in R. Analysis code is publicly available47.
Quality checks for supported treatment comparison analyses
Given that there is no known ground truth for the questions received by the consultation service, we established code correctness using synthetic datasets as well as derived an estimate of the false positive rate for treatment comparison analyses using publicly available datasets of drug-effect pairs as ground truth.
Establishing code correctness
We generated eight synthetic datasets, each with 10,000 patients, using all combinatorial variations of three properties: whether a binary treatment had an effect on a single survival outcome, whether treatment assignment had a dependence on a single binary covariate, and whether the covariate had an effect on the survival outcome. We confirmed the correctness of the analysis code by verifying that the analyses returned a treatment effect if and only if the underlying data were constructed with a treatment effect, and that the direction of the derived treatment effect was concordant with the treatment effect specified when creating the synthetic dataset. On performing treatment comparisons using cohorts matched with hd-PSM, the analysis code correctly identified protective effects for the four synthetic datasets constructed to have such intervention effects and no effects for the four synthetic datasets constructed to have no effect. For the two synthetic datasets where there was both a biased treatment and a covariate effect, resulting in confounding, propensity matching correctly recovered the true effect for the treatment.
Quantifying the expected false positive rate
We selected treatment-outcome pairs known to be either associated or non-associated as compiled by the OMOP community (Ryan et al48, 399 pairs) and the EU-ADR project (Coloma et al20, 93 pairs) because they are publicly available and have been used as ground truth sets in other studies49–51. While both reference sets contain known associations as well as asserted non-associations, only the asserted non-associations were informative in quantifying the false positive rate. Of the 399 drug-outcome pairs from the OMOP community reference set, 234 are non-associations. Of the 93 pairs from the EU-ADR project, 50 are non-associations.
We constructed cohorts corresponding to each of the asserted non-associated treatment-outcome pairs in the reference sets and estimated a treatment effect, using Stanford EMR data. Cohorts were constructed by transforming each treatment and outcome definition into corresponding ACE queries. Outcomes were defined using ICD9 codes and drug treatments were defined using RxNorm codes. We used a new patient cohort design where patients entered a cohort immediately after the first time they were prescribed a drug. Outcomes were measured as events after the first prescription, with patients being marked as censored when their medical records ended. A result was counted as false positive if our analysis found that a given treatment was associated with an increased or decreased hazard ratio relative to the comparator (with an effect estimate greater than or less than 1, and a p-value ≤ 0.05), and the reference set marked it as not associated.
Of the 234 non-associated pairs from the OMOP community, there were 137 drug-outcome pairs for which a minimum 100 patients exposed to the drug were present in Stanford data. Of these, 27 associations were false positives and the remaining 110 were correctly identified as non-associations, providing an estimated false positive rate of 20%. From the 50 non-associated treatment-outcome pairs from the EU-ADR project, there were 42 pairs for which there was enough data. Of these, 7 associations were false positives and the remaining 35 correctly identified as non-associations, providing an estimated false positive rate of 17%.
Because the OMOP and EU-ADR reference sets were constructed to evaluate methods for treatment comparisons, the 17-20% expected false positive rate is applicable to consultations requesting a comparison of the hazard ratio of an outcome between treatments.
Summarizing the first 100 consultation requests
Categorizing motivations for requests and subsequent actions
We categorized the scenarios motivating consultation requests, and subsequent actions by requestors, based on the intake and debrief meetings, respectively. Each consultation request was assigned a single motivation category, and one or more subsequent action categories. We categorized subsequent actions into one or more of three possible categories. If, during debrief, the requestor stated that they would use the knowledge gained from the consultation to change the treatment of a current or future patient with similar presentation, the consultation was categorized as having changed patient care. Debriefs where the requestor planned to obtain approval to further study their question or use the findings from the consultation to generate hypotheses for an ongoing research project were categorized as guiding further research. Debriefs where the requester used the results from the consultation report directly as the basis of a publication, poster, abstract, grant submission, or presentation were categorized as follow-up analyses. Because the motivating scenarios were not known in advance, the eight categories of motivation (Table 2) were developed after the 100 consultations were completed.
Concordance of consultation results across data sources
We compared results obtained using different data sources for the same consultation request. To do so, we first identified consultations requesting treatment effect comparisons which could be re-executed using another dataset. For example, if a consultation was originally completed using data from Stanford, we re-executed it using MarketScan and Optum claims data. Some two-way comparisons across datasets failed due to few patients in a given dataset (our threshold was 100 patients), while for others the matching procedure resulted in groups with no overlap in their propensity score distributions and thus were unsuitable for comparison52.
Because a consultation to provide a treatment comparison could involve more than one outcome, we summarized concordance in terms of the number of outcomes, rather than the number of consultations. We evaluated the concordance of results for 59 outcomes from 33 consultations across Stanford and Optum; and 53 outcomes from 22 consultations across Stanford and MarketScan.
Using the notion of ‘regulatory agreement’53, a result was counted as concordant across two datasets only if both datasets provided an effect estimate in the same direction (e.g. both greater than 1 or both less than 1) with a p-value ≤ 0.05, or if the effect estimates derived from both datasets did not indicate a significant effect on the outcome(s) of interest, regardless of direction.
Findings from the first 100 consultations
Consultations requests came from multiple specialties
Of the first 100 requests by 53 users from multiple specialties, 83 consultations were completed. 17 consultations could not be completed due to missing data elements, available data sources having too few patients meeting the specified cohort criteria, inability to define a cohort, or requiring an unsupported study design.
Of the 83 completed consultations, 48 were descriptive analyses. 35 were treatment comparison analyses, of which 18 had discrete or continuous outcomes and 17 had time-to-event outcomes. 78 out of 83 (94%) consultations used Stanford EMR data, and 4 out of 83 (5%) used national claims data to obtain adequate sample size. One consultation used both EMR and claims data.
Internal medicine was the most common requesting specialty, in terms of both requests received and number of requestors, followed by dermatology, oncology and cardiology (Table 1). Among 53 users, 24 requested a consultation more than once, for a total of 76 consultations. Internal medicine also had the highest number of repeat users.
Summary of completed consultations by specialty, showing the number of consultations of each analysis type.
Median consultation turnaround time was 5 days, with 71 (86%) of consultations completed in 10 days or less. Longer turnaround times occurred when additional data elements were needed, there were delays in scheduling conversations with the requestor, or when matching required substantial time for large cohorts. As the service workflow matured, by the end of the study, 19 consultation reports were returned in 48 hours or less by reusing cohort definitions, experience in PICOT formulation of the request, and analysis code optimization.
Consultation requests had diverse motivations
Consultation requests were driven by a variety of motivations, including evaluating patient management strategies for a given disease or patient presentation, identifying comparatively effective treatments for patients with typically understudied characteristics, and quantifying associations between diseases. The categorization of consultation motivations is summarized in Table 2, and cross-tabulated with the subsequent actions taken by requestors. 10 consultations led to changes to patient care, 52 guided further research, and 17 led to follow up analyses, including four that were presented at medical conferences or published in peer-reviewed journals54–57. Not all subsequent actions could be categorized into the three groups: 27 consultations lacked clear subsequent actions, suggesting that the consultation may have been sought primarily to contribute to the personal knowledge of the requestor, or that the findings were not sufficiently compelling to warrant action on their basis.
Clinical motivations and subsequent actions taken by requestors for the 83 completed consultations. Each request was assigned a single motivation category and one or more follow-up action categories.
We highlight three consultations that demonstrate the diversity of situations motivating a consultation: a request to characterize a rare disease presentation (a pediatric patient with mononeuritis multiplex); a request to compare treatment outcomes (for a recently approved class of melanoma drugs, PD-1 inhibitors); and a request to summarize the institutional use of procalcitonin tests (to inform antibiotic discontinuation). The reports for these three consultations are provided in the Supplement. In each of these consultations, the service addressed a different need.
In the case of the pediatric mononeuritis multiplex patient the consultation required a custom descriptive analysis of a rare disease presentation that resulted in changes to patient care. We provided the requestor with summaries of the most frequent diagnoses preceding and following mononeuritis multiplex diagnosis in 118 similarly aged patients, which included bacterial and viral infections as well as psychosomatic disorders. A variety of treatments were prescribed for those patients, including steroids, antibiotics, anti-inflammatory medications, painkillers, and hormone supplements. These findings, alongside further clinical workup suggested managing the patient’s symptoms as a post-viral syndrome. The patient improved with a trial of steroids and was discharged.
In the case of the use of PD-1 inhibitors the consultation required a treatment comparison analysis for an understudied population that guided further research. The consultation was motivated by a melanoma patient who had a herpes simplex reactivation following treatment with nivolumab. We found 587 similar patients and found no difference in viral reactivation rates in patients treated with PD-1 inhibitors compared to those treated with other antineoplastic agents. Published evidence on the relationship between PD-1 therapies and herpetic reactivations was not available, perhaps because nivolumab was only recently approved (in 2014). Here, the consultation findings filled an important clinical evidence gap.
In the case of procalcitonin testing, the request entailed a custom descriptive analysis to evaluate institutional patient management that changed patient care. The consultation was requested because while procalcitonin is a serum biomarker capable of discriminating between bacterial and non-bacterial causes of infection58–61 the exact cut-off value at which to discontinue antibiotics is not universally agreed upon. Procalcitonin’s utility for deciding whether to order a blood culture remains unclear62,63. By analyzing approximately 16,000 procalcitonin test results and 29,000 blood culture results, we calculated how often a positive blood culture was obtained within 48 hours of one cut-off value for a procalcitonin result, how frequently antibiotic therapy was discontinued at different cut-offs of procalcitonin values, and how often antibiotics were restarted within 72 hours of discontinuation. The analysis found that at the cut-offs in use (procalcitonin > 0.5) a positive test was not associated with a positive blood culture. This finding, combined with further analyses, informed an institutional protocol change: procalcitonin values are no longer used to inform ordering a blood culture when deciding whether to discontinue antibiotic therapy.
Consultation results were concordant across datasets
When comparing results obtained using different data sources for the same consultation request, 68% to 74% of results were concordant across datasets. In the Stanford and Optum comparison, results for 68% (40 out of 59) of the evaluated outcomes were concordant. For 28 outcomes, both datasets reported a significant treatment effect with the same direction of the effect. For 12 outcomes, results from both datasets did not have a significant effect. In the Stanford and MarketScan comparison, 74% (39 out of 53) of the evaluated outcomes were concordant. For 30 outcomes, results from both datasets had a significant treatment effect with the same direction of the effect. For 9 outcomes, results from both datasets did not have a significant effect.
A vision realized: strengths, caveats and next steps
Using data generated during routine care to guide the care of future patients is a core tenet of a learning health system64–67 and, as a distillation of clinical expertise, of evidence-based medicine1,68. Our work is a first-of-its-kind implementation of this vision, demonstrating that an on-demand consultation service to summarize the experiences of previously seen patients is feasible from both an engineering and operational standpoint. The variety of consultation requests we received (in terms of clinical motivations, analyses needed and subsequent actions) also empirically illustrates the potential to inform a broad range of clinical decisions (Figure 2). As large patient data repositories become commonplace 75,76, the ability to learn from the experience of similar patients is one of the nobler opportunities such repositories enable77.
Sankey plot illustrates the flow (horizontal colored lines) of completed consultations in terms of the clinical motivation (left) to the analysis type (center) to the subsequent action (right). The thickness of each flow is proportional to the number of consultations. For example, consultations motivated by evaluating institutional patient management required mostly descriptive analyses, and resulted in all three categories of subsequent action.
Our work has several unique strengths. First, the service’s underlying search engine, ACE—which is essential for rapidly constructing cohorts and defining electronic phenotypes corresponding to exposures and outcomes of interest—is freely available for non-commercial use, allowing implementation of such a service at other sites without extensive monetary or technical resources18. Second, we found 68-74% concordance of consultation results across multiple datasets, a rate of agreement comparable both to the rate at which results from randomized controlled trials (RCTs) agree with each other (67-87%)69 and to the rate at which results derived from observational claims data agree with RCTs (60-80%)70. Finally, the number of repeat requests demonstrates the need for, and viability of, such a consultation service.
Our study has several limitations. First, users of the service were self-selecting; consultation requestors may thus have been predisposed to finding value in the service, and self-reported utility for advancing research or patient care may have been affected by subjective expectations of the service. Second, the cost to deploy such a service will vary at institutions where the necessary data access and analysis infrastructure does not yet exist. Our implementation was at an academic medical center with ready access to EMR and claims data resulting in an estimated cost per consult at $550 USD14,16 ; costs may be higher elsewhere. Moreover, while the current turnaround time is analogous to a send-out laboratory test, providing an ongoing service would require additional engineering effort incurring additional costs71. Third, the choice and evaluation of patient matching and causal inference methods remains an active area of research72–74. Future work may find methods beyond hd-PSM that offer improved concordance across data sources. Lastly, the net-benefit of providing on-demand evidence needs to be studied prospectively at multiple sites by measuring impact on patient outcomes, cost of care, and health system operations. We hope that our experience and the tooling we share will enable such studies.
Takeaways
On-demand evidence generation to inform clinical decision making is an achievable goal, given the confluence of scalable technology for data analysis, a growing data science workforce, the training of increasingly data savvy clinicians, and the availability of large amounts of patient data from EMRs and claims 8,14. The consultation service we created provides proof-of-feasibility for realizing this goal. Such a service is capable of informing patient care at the bedside for specific patients, informing the medical literature and supporting institutional guideline creation. As large patient data repositories are created75,76, the potential to benefit from such a service is immense77. Given the feasibility, and the documented need, studies establishing the utility of having such a consultation service are logical next steps.
Data Availability
Data are available upon reasonable request.
Footnotes
Updated statement of competing interest.