ABSTRACT
Objectives Missing data is the most common data quality issue in electronic health records (EHRs). Checks are typically limited to counting the number of missing values in individual fields, but researchers and organisations need to understand multi-field missing data patterns, and counts or numerical summaries are poorly suited to that. This study shows how set-based visualization enables multi-field missing data patterns to be discovered and investigated.
Design Development and evaluation of interactive set visualization techniques to find patterns of missing data and generate actionable insights.
Setting and participants Anonymised Admitted Patient Care health records for NHS hospitals and independent sector providers in England. The visualization and data mining software was run over 16 million records and 86 fields in the dataset.
Results The dataset contained 960 million missing values. Set visualization bar charts showed how those values were distributed across the fields, including several fields that, unexpectedly, were not complete. Set intersection heatmaps revealed unexpected gaps in diagnosis, operation and date fields. Information gain ratio and entropy calculations allowed us to identify the origin of each unexpected pattern, in terms of the values of other fields.
Conclusions Our findings show how set visualization reveals important insights about multi-field missing data patterns in large EHR datasets. The study revealed both rare and widespread data quality issues that were previously unknown to an epidemiologist, and allowed a particular part of a specific hospital to be pinpointed as the origin of rare issues that NHS Digital did not know exist.
Strengths and limitations of this study
This study demonstrates the utility of interactive set visualization techniques for finding and explaining patterns of missing values in electronic health records, irrespective of whether those patterns are common or rare.
The techniques were evaluated in a case study with a large (16-million record; 86 field) Admitted Patient Care dataset from NHS hospitals.
There was only one data table in the dataset. However, ways to adapt the techniques for longitudinal data and relational databases are described.
The evaluation only involved one dataset, but that was from a national organisation that provides many similar datasets each year to researchers and organisations.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This research was supported by the Engineering and Physical Sciences Research Council grant numbers EP/N013980/1 and EP/K503836/1, the British Heart Foundation grant number PG/13/81/30474, and the Alan Turing Institute.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The MaPS and Engineering joint Faculty Research Ethics Committee or the University of Leeds gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
madnan1{at}hct.ac.ae, m.s.hall{at}leeds.ac.uk
I, the Submitting Author has the right to grant and does grant on behalf of all authors of the Work (as defined in the below author licence), an exclusive licence and/or a non-exclusive licence for contributions from authors who are: i) UK Crown employees; ii) where BMJ has agreed a CC-BY licence shall apply, and/or iii) in accordance with the terms applicable for US Federal Government officers or employees acting as part of their official duties; on a worldwide, perpetual, irrevocable, royalty-free basis to BMJ Publishing Group Ltd (“BMJ”) its licensees and where the relevant Journal is co-owned by BMJ to the co-owners of the Journal, to publish the Work in BMJ Open and any other BMJ products and to exploit all rights, as set out in our licence.
The Submitting Author accepts and understands that any supply made under these terms is made by BMJ to the Submitting Author unless you are acting as an employee on behalf of your employer or a postgraduate student of an affiliated institution which is paying any applicable article publishing charge (“APC”) for Open Access articles. Where the Submitting Author wishes to make the Work available on an Open Access basis (and intends to pay the relevant APC), the terms of reuse of such Open Access shall be governed by a Creative Commons licence – details of these licences and which Creative Commons licence will apply to this Work are set out in our licence referred to above.
Data Availability
The dataset was provided by NHS Digital (request number DARS-NIC-17649-G0X4B-v0.6) and, due to data governance restrictions, cannot be made openly available.