- Split View
-
Views
-
Cite
Cite
Jeffrey F Scherrer, Wilson D Pace, Will electronic health record data become the standard resource for clinical research?, Family Practice, Volume 34, Issue 5, October 2017, Pages 505–507, https://doi.org/10.1093/fampra/cmx055
- Share Icon Share
Uses and sources of large medical record databases
Electronic health record (EHR) systems are increasingly common in most of the developed world (1). Clinic encounters, in the age of EHRs, generate hundreds of data elements used in high impact, innovative research. With many health care systems having 10–20 years of EHR utilization, longitudinal analysis can be conducted using standardized measures of medical diagnoses, laboratory results, medications, vital signs, provider type and demographic information (2,3). The size of existing data and promise for growth and linkage of databases across the United States offers an ‘unprecedented’ (4) opportunity for retrospective cohort designs, and facilitating identification of eligible subjects and recruitment to randomized trials (5). These databases can be used to exponentially increase the rate of discovery when contrasted with chart abstraction or primary data collection.
In the United States, the Patient-Centered Outcomes Research Institute’s PCORNet (6) links millions of patient records from health care systems across the United States and the Health Care Systems Research Network (http://www.hcsrn.org/en/About/) has a long history of using EHR data to conduct research with standard variable definitions from large health care organizations across the USA. More specific to primary care, large data bases capturing numerous ambulatory family medicine and general internal medicine clinics are available through the DARTNet Institute (http://www.dartnet.info/). These networks allow researchers to collaborate and test hypothesis in multiple cohorts or, as appropriate, in large merged samples derived from entire health care systems to primary care practices as possible with the DARTNet Institute (http://www.dartnet.info/).
Electronic health record databases can advance genomic medicine and be used to create genetically informative cohorts without additional data collection. The Marshfield Twin Registry is an excellent example of using existing medical record data to compute the familial contributions to the entire range of ICD-9 codes (7,8).
Can observational cohort data support evidence of causality?
The randomized controlled trial (RCT) has long been considered the highest level of evidence in medical literature and the gold standard for developing evidence of causality. However, quasi-RCTs in medical record cohorts could overcome common limitations of prospective RCTs including exclusion criteria that eliminate vulnerable subjects and more complex patients and thus reducing generalizability to real world delivery of care. RCT follow-up time typically does not exceed 1 or 2 years and attrition increases with each wave of follow-up. Big data medical record cohorts do not have to exclude patients due to ability to participate and can ‘randomize’ patients to exposure groups that are harmful, such as chronic, high dose opioid use, and follow-up time can be 10 years or more. These long observation periods and enormous sample sizes (>100000) make EHR data bases particularly useful for research on factors associated with rare outcomes.
Retrospective cohort studies are now able to build valid evidence for causal relationships with advanced statistical approaches. Standard propensity scores, high dimensional propensity scores, persistence correction, instrumental variables, weighting or matching on the probability of treatment exposure are analytic approaches that permit robust control of confounding. Advanced analytic methods have been employed to replicate RCT findings in retrospective cohort data (9). Examples of successful replications using medical record data have been reported for RCTs of diabetes medications (9), statins (10) and cardiovascular drugs (11). Replicating results from the controlled artificial environment of an RCT using real world clinic data increases the clinical relevance of RCT findings and demonstrates, with appropriate analytic methods, existing data can be used to demonstrate causality (9). This conclusion is consistent with the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) Task Force on Good Research Practices for Designing and Analyzing Retrospective Databases conclusion that ‘valid findings of therapeutic benefits can be produced from nonrandomized studies using an array of state-of-the-art analytic techniques’ (12).
Limitations of research using large medical record databases
A common critique of EHR-based research is the risk of misclassification and missing data. Having spent half my (JFS) research career in primary data collection in prospective cohort studies and half in modeling EMR data, I believe it is inaccurate to consider primary data superior to EMR data given that both contain random error. Primary data, depending on the subject manner, can suffer more recall bias and obtain answers that respondents believe are socially acceptable or what the investigator wants to hear. EMR data does have limitations and understanding how data is generated can improve the validity of studies. EHR data, relative to primary data, suffers more false negatives. Patients who go to physicians outside the health care system’s EHR network and patients who do not utilize care, or avoid preventive medical care, are more likely to have undetected disease. Yet unlike a finite prospective cohort, most EHR data bases are large enough to restrict samples to regular users and patients in ‘closed’ systems such as the Veterans Health Affairs (VA) are less likely to use multiple healthcare sources.
Although the risk of misclassification exists in EHR databases, methods are available to improve the accuracy of true cases. Numerous studies have demonstrated how to build algorithms to increase the accuracy of diagnoses. Defining a case of type 2 diabetes does not have to be limited to an ICD-9 code and can be defined by American Diabetes Association diagnostic guidelines (13) that involves diagnosis using the presence of multiple variables including ICD-9 code, A1c ≥6.5%, 2-h plasma glucose ≥200 mg/dl (11.1 mmol/l) during an OGTT, a random plasma glucose ≥200 mg/dl (11.1 mmol/l) in patients with hyperglycemia or hyperglycemic crisis or initiation of anti-diabetic medications. A single diagnoses for depression has very poor agreement with gold standard comparisons, but case definitions requiring two diagnoses within the same 12-month period or one in patient discharge diagnosis has excellent agreement (95% positive predictive value) with clinician chart review (14) and 88% positive predictive value for survey derived diagnosis (15). Open and frequent communications with your EHR system programmers and users (i.e. clinical staff) is extremely valuable in determining the validity of the data.
Detailed patient characteristics should be collected in EMRs
For some conditions, symptom level data is routinely measured in provision of clinical care. Repeated A1c values, blood pressures and body mass index are common variables in EHR data. However, for mental health conditions there is a great need to increase symptom level data by administering brief instruments like the Patient Health Questionnaire-9 (PHQ-9) (16) and The Alcohol Use Disorders Identification Test (AUDIT-C) (17). Measures of impairment and health related quality of life should be implemented to improve care. The PHQ-9 and AUDIT-C would help detect depression and alcohol abuse, while helping to understand the impact these events have on co-morbidities and overall health. The routine measurement of impairment and health related quality of life offers detailed information on patients’ well-being and burden of disease, impacts day to day care and allows a greater understanding of the impact and value of different treatment approaches (18). Symptom level data can then be used to evaluate the effectiveness of different treatment approaches.
Additional funds are required to support EMR adoption in countries with fewer resources and enable extraction of data for research (1). While privacy concerns are reduced in de-identified or limited data sets, the ability to link biobanks with patient medical records raises ethical concerns when biological data, such as DNA, are collected for specific studies. To fully realize the opportunity of linking these rich databases, substantial funding will be necessary to obtain informed consent from patients whose biological material was obtained for medical care or specific research studies.
The learning health care system and predictive analytics
Predictive modeling is on the horizon and will allow research models, developed from millions of continuously updated clinical measures, to be applied to treatment decisions for an individual patient (19). Soler et al. (20) demonstrated analytic methods with family medicine medical records can be used to make diagnostic decisions for patients presenting with new symptoms of urinary tract infection. The next step is to develop capacity to utilize predictive analytics in delivery of care. To tease apart the noise from evidence in forecasting patient outcomes and treatment response will require educating physicians and researchers about the strengths and limitations of big data and we should proceed with caution to ensure EHR-based research, able to detect many clinically insignificant yet statistically significant findings, continuously improves methodological and analytic designs to strengthen the validity of new discoveries.
Using inexpensive, existing EHR data is an attractive alternative to primary data collection in a tight funding environment. The application of new methods and advanced analytic designs, increasing depth and duration of the health record, data networks and biobank data linked to patient records may lead to an increasing number of investigators choosing EHR data to test hypothesis. Eventually research studies using EHR data may equal or surpass prospective cohorts and RCTs as the primary resource for advancing evidence based medicine.
Declaration
Funding: this work was completed as part of the authors’ faculty responsibilities and was not supported by grant funding.
Ethical approval(s): this report does not involve human subjects research.
Conflict of interest: JFS is an Associate Editor for Family Practice.
References