Using primary care data for health research in England – an overview
ABSTRACT
In contrast to secondary care, where handwritten records remain widespread, electronic patient records have long been a key feature of UK general practice. By 1996, 96% of general practices were computerised and now almost every primary care consultation in the UK is recorded on a computerised clinical system. Consequently, we now have a vast repository of patient health data that spans decades, which could be used to address a range of important research questions. Unfortunately, accessing primary care data for health researchers can be a burdensome, confusing and time-consuming process. Understanding the way in which primary care data are recorded and ‘coded’ is not intuitive to those unfamiliar with general practice. The requirements of information governance mean that some data, or data presented in particular ways, are not available at all. This review provides a practical overview of the types of data recorded in primary care, the bodies responsible for them and how they can be accessed.
Introduction
Electronic patient records (EPRs) have long been a key feature of English general practice. By 1996, 96% of general practices were computerised1 and now almost every primary care consultation in the UK is recorded on a computerised clinical system. This has resulted in a vast repository of patient health data that spans decades, which can be used to address a range of important research questions.
The process of accessing primary care data for health researchers can be burdensome, confusing and time-consuming. Understanding the way in which primary care data are recorded and ‘coded’ is not intuitive to those unfamiliar with general practice. Legislation around the Data Protection Act and the requirements this produces for information governance effectively results in some data being unavailable to researchers. While significant barriers remain to accessing primary care data for research, many important research findings have been accomplished through utilising this data, some examples of significance are given in Box 1. Here, we present a practical overview of the types of data recorded in primary care (simplified representation in Fig 1), the bodies responsible for controlling them and the methods available to access them (Fig 2 and Box 2).
Data sources in primary care
EPRs
Most GPs use one of four commercially provided software platforms to manage their EPRs. These are EMISWeb (EMIS), SystmOne (TPP/Pheonix partnership), Vision (In Practice Systems) and Evolution (Microtest).12
GPs generally enter narrative information regarding the consultation, in a free-text format. In some systems, this might be entered in a defined format, for example with fields for history, examination, diagnosis and plan. Data entered as ‘free text’, for example a medical diagnosis entered as text with no corresponding code, can often only be retrieved by inspection of the individual patient record, although free-text search functionality is a feature of some systems.
Free text is navigable for direct patient care, but inappropriate for large-scale data analysis. Clinical coding is the conversion of discrete items of information within the narrative mapped to a standardised thesaurus of clinical terms. Until recently ‘Read’ has been the system used but this is being succeeded by SNOMED CT which has been introduced across general practice in phases since April 2018. Clinicians or administrative staff can use this system to improve the patient record and are incentivised to do so through payments linked to coded entries. SNOMED CT is more comprehensive than diagnostic coding systems, incorporating items such as symptoms, procedures, body structures and so on. However, diagnoses in SNOMED CT map directly to those found in the International Classification of Diseases (ICD). The aim in the NHS is for all coding across the entire health system in England to use SNOMED CT by 2020.13
It is important to recognise that the practice of coding is highly variable among clinicians.14 It is not possible to assume that all diagnoses are coded accurately, or at all: patients frequently attend with multiple presentations; a clinician might not be able to find a sufficiently appropriate code for the presentation; codes might be entered in error; or the clinician might rely on free-text data entry only.
Quality of data recording
The quality of coding is related to the presence of incentives for accurate recording. These can be financial (eg the Quality Outputs Framework [QOF]) or safety and quality driven (eg a diagnosis of type 1 diabetes mellitus). Payment regimens, such as QOF, have been criticised for creating ‘perverse incentives’ for the ways in which data are coded and the effect this has on clinician behaviour.15 This can distort the numbers of patients with recorded diagnoses, as manifested by the fluctuating numbers of patients diagnosed with depression, proven to be influenced by the existence of a financial incentive.16
Many primary care computer systems facilitate the entry of discrete values where appropriate, such as gender, weight, height, blood pressure and peak expiratory flow. The ability to perform searches based on these values depends on them having been entered in the appropriate format.
Data sharing
Sharing of anonymised data for research, for example with Clinical Practice Research Datalink (CPRD) or with the research divisions of clinical systems, is determined via a voluntary ‘opt in’ for each practice. The decision to opt in or out of data sharing is usually the responsibility of the Caldicott guardian or information governance-responsible officer of each GP practice. GP practices that have opted in to data sharing can still exercise judgement to exclude individual patient records. However, many organisations lack the capacity to decide whether to include individual patient records.
Patients maintain the right to opt out of their data being shared for purposes other than direct care, regardless of whether their practice has opted in. Opt-outs currently take two forms, and prevent the use of data beyond direct patient care. A ‘type 1 opt-out’ prevents information that identifies a patient from leaving the GP practice (except when required by law for public health reasons). A ‘type 2 opt-out’ prevents information that identifies a patient from leaving the central repository in NHS Digital (formerly the Health and Social Care Information Centre; HSCIC).17 Following Dame Fiona Caldicott’s review of data security, consent and opt-outs,18 an online national system for patients to opt out of sharing of their data was launched in May 2018, with the aim of giving patients a more convenient way of expressing their preferences.19
Legislation
All data processing must be conducted in line with the Data Protection Act 1998, the incoming General Data Protection Regulation (GDPR) and all other relevant legislation. However, section 251 of the NHS Act 2006 allows ‘important’ medical research to make use of identifiable patient information without explicit patient consent,20 while adhering to data protection principles. Application for Section 251 approval for research purposes is made through the Confidentiality Advisory Group (CAG), which is administered by the Health Research Authority (HRA). Section 251 approval grants access to data for valued research where obtaining patient consent would be impractical. The HRA provides comprehensive guidelines, and maintains a register of approved CAG applications, which is accessible online.21
Prescribing data
In England, primary care drugs are prescribed almost entirely through practice IT systems. Handwritten prescriptions account for a small minority, often only used in the case of system failure and home visits. This produces high-quality prescribing data from primary care. Information about the volumes and types of drug prescribed at the level of individual GP practices has been made available through NHS Digital,22 while a search interface for these data has been created by EBM Datalab’s (University of Oxford) OpenPrescribing projects.23 More individualised prescribing information is recorded and can be retrieved through the individual primary care clinical systems. Prescribing information is also available through the NHS Business Services Authority (NHS BSA).22
Processing of data
Primary care clinical systems providers
The data entered through the four dominant primary care clinical systems are processed under instruction by each corresponding contractor. Procurement of the systems are managed by NHS digital, with each practice choosing which to use through ‘GP Systems of Choice’.24 Some contractors have developed research divisions, including QResearch (the University of Nottingham and EMIS) and ResearchOne (SystmOne), which engage with clinical researchers.
Distributions of practices using each system vary heterogeneously across England, with more than one system commonly in use within each region. Therefore, it is important for researchers to determine whether patterns of usage for each individual system correspond well with the population under study if the method of access is to be through single contractors. In West Yorkshire, for example, where the market is dominated by SystmOne and EMISWeb, researchers wishing to achieve comprehensive access to patient data through clinical systems would need to engage with both QResearch and ResearchOne.
Linking data sets
Linking different data sets from multiple sources that relate to an individual patient offers great opportunities for clinical research. For instance, linking primary care records documenting different diagnoses to secondary care pathology reports could reveal possible associations and insights into how certain diseases could be diagnosed earlier. For linking to take place, data cannot be anonymised, but they could be pseudonymised to protect patient-identifiable data (PID). In this process, all identifying fields within a patient record are replaced with anonymous identifiers. A reidentification key that unlocks pseudonymisation to maintain an access pathway to patient-level data can then be restricted to designated persons, in defined organisations. This allows the reidentification of a patient in special circumstances if there is a safety issue or direct care of the patient would benefit from the insight gained from the research. The NHS Business Service Authority’s pseudonymisation and anonymisation of data policy outlines definitions of these terms as well as a series of actions that must be taken to pseudonymise data.25
Despite adequate control of reidentification keys, linking of data still has the potential to undermine the pseudonymisation of data and could lead to individuals becoming identifiable. Reidentification is particularly problematic when dealing with small data sets or rare diseases. For this reason, interrogation that generates numeric data relating to fewer than five patients cannot be released, although it can be expressed as percentages. This can have important consequences for the interpretation of data that are aggregated from several sources, each representing small numbers.
The failure of the NHS England Care.Data programme was a significant setback to ambitious large-scale data sharing.26,27 This has prompted organisations to exercise significant caution in consenting to the use of data in the creation of linked data sets.
Research databases
The CPRD is a large anonymised primary care database comprising data that includes 6.9% of the UK population and is broadly representative in terms of age, sex and ethnicity.28 Developed from the General Practice Research Database (GPRD) that collated patient information from over 500 GP practices, the CPRD now links pseudonymised general practice data to national data sets and data from other health providers.29 The CPRD is based at the Medicines and Healthcare Products Regulatory Agency (MHRA), is run jointly with the National Institute of Health Research (NIHR), and has been used in over 1,700 published studies.30
Researchers can apply for access to CPRD data that include demographics, symptoms and signs, tests, diagnoses, prescriptions, lifestyle information, and referrals to secondary care.10 The cost to researchers for data extracts from CPRD can be considerable but can be reduced if accessed through organisations that have institutional membership.
The Health Improvement Network (THIN), based at University College London, contains data from over 550 GP practices across the UK.31 The Royal College of General Practitioners’ Research and Surveillance Centre (RSC) collects data from around 200 practices.32 Data are extracted twice weekly and have proved particularly fruitful for monitoring the spread of disease such as influenza. Historic data from the past 10 years are stored and managed by the University of Surrey.
As with other sources, the utility of data from research databases can be limited by the original quality and consistency of data entry by clinicians and administrators in individual practices. The size of data sets will also be affected by data-sharing opt-outs as exercised by GP practices on behalf of their patients or by individual patients themselves.
Accessing primary care data for research
Individual general practices and GPs
The HRA has determined that each individual general practice is considered a distinct entity with the ability to enter into data-sharing agreements.33 Practices have nominated Caldicott guardians or information governance-responsible officers whose role involves safeguarding patient data. The data controller can be either an individual or the GP practice as a legal entity. Collaboration can then occur by researchers acting as data processors and agreeing a data-sharing agreement directly with the data controller. Once there is an instruction from the data controller to the data processor, data can flow in compliance with the Data Protection Act.34 GP practices can also access research and development support from their Clinical Commissioning Group (CCG) or Commissioning Support Unit (CSU), but a similar direction must always be in place for data to flow legally.
Historically, GPs have been pioneers in establishing approaches to information management because of being under moral, professional and legal obligation to protect their patients’ privacy. Revelations around the use of NHS data by the Home Office has highlighted ways in which information sharing could undermine patients’ access to healthcare.35
CCGs and CSUs
Following the Health and Social Care Act 2012, administration of primary care was divided into CCGs with regional-level support functions offered by CSUs. This has had important consequences for information governance, because it has subsequently been considered that CCGs should not be allowed to access PID, unless the CCG attains ‘Accredited Safe Haven’ (ASH) status.36 ASH status is obtained from NHS Digital through Data Access Request Service (DARS), which must stipulate the data requested, why they are required and how they will be processed. Applications must then be approved by Independent Group Advising on the Release of Data (IGARD) and, in practice, take several months. Maintaining ASH status is subject to audit and entails maintaining necessary information governance. These governance procedures create significant barriers to accessing data directly from CCGs.
Receipt and processing of PID are managed by CSUs in collaboration with NHS Digital. By setting up ‘Data Services for Commissioners Regional Offices’ inside NHS Digital, into which staff from CSUs are seconded, fair processing is maintained by only staff within NHS Digital having access to PID.37
Following the NHS England lead provider framework (2015),38 private-sector organisations have been allowed to tender to provide CSU service for CCGs, provided they have ASH status. CSUs can charge a service fee for each episode of access to data. The institution of fees for access to data could become more prevalent as CSUs are tendered to private-sector providers. It is vital in maintaining public trust that such transactions are not perceived as ‘selling’ personal data and that charges are understood to be for the service of transacting data rather than for the data themselves.
NHS Digital
NHS Digital, which succeeded the HSCIC, is a non-departmental body of the Department of Health, and acts as the provider and overseer of data and IT systems for the NHS in England. Specific information for defined purposes is extracted automatically from GP practices to the systems of NHS Digital over a specific period of time, through the General Practice Extraction Service (GPES).39 Data are in turn processed and presented by the Calculating Quality Reporting Service (CQRS). However, some mandated data submissions are not automated through GPES, and require manual entry by GP practice staff onto the CQRS. Data within the CQRS are then used to calculate payments to GP practices.
To access data held centrally at NHS Digital, one must submit a DARS application and have this approved by IGARD. The charges for such requests can be found on the NHS Digital website.11 In seeking central approval to link data sets, it is crucial to demonstrate that linkage is necessary to address the research question. Before formally applying for data linkage through DARS or via CAG, it is good practice to have received advice from information governance and research and development leads within one’s own organisation, and to have constructed a privacy impact assessment document.
Access to data through NHS Digital has been perceived as slow and costly,40 with detailed attention to legal basis, information governance and fair processing. In our experience, many researchers opt to develop separate data-sharing agreements with providers outside of NHS Digital to avoid the DARS process. The DARS team at NHS Digital offers support through email content and webinars for researchers, and is working to streamline the data-access process.11 Several other sources of data derived from or pertinent to primary care are available in addition to those described in this article. Some examples of these are listed in Box 3.
Conclusions
To earn the trust of the public and the confidence of the doctors to whom they have entrusted their information, research using patient data must proceed in an open and transparent manner, demonstrating clear benefits for patients.
For researchers to utilise primary care data effectively, it is vital to understand the ways in which data are entered in practice and the limitations imposed by variation in the consistency and quality of data entry. Similarly, an understanding of the strengths and limitations of the different repositories of primary care data and what can be expected from each will benefit study design.
Currently, the ability of researchers to access primary care data for research frequently requires navigating burdensome and complex approval processes. Linked data sets could generate a rich resource for researchers to address important questions in clinical research, particularly between primary and secondary care. Approval to achieve this through separate data-sharing agreements between each body responsible for recording or processing data is challenging, and highlights the value of the centrally held data repositories within NHS Digital.
Primary care data represent a rich resource for researchers that can help to create a health system that learns from everyone who is treated. Such a ‘learning health system’ can help to address a range of clinical and research challenges.41–43 Navigating the complex infrastructure and information governance arrangements represents a significant challenge and can consume substantial additional time and resources. We recommend that researchers make contact early with relevant bodies and set aside sufficient time to navigate these processes. These resource costs should be included in grant applications. Researchers should also understand the pattern of usage of the different primary care clinical systems within the areas included in the study.
The difficulty in accessing primary data for research is not only a result of legitimate concerns around data security, but also a consequence of the fragmenting of information governance oversight following the Health and Social Care Act 2012. To unleash the potential of primary care, and particularly linked data sets, it might be necessary to rationalise and harmonise information governance procedures while adhering to the principles of data security and patient consent.
Author contributions SB prepared the manuscript based on an oral presentation prepared by PC. NL added further content, including the two figures and text boxes and made other corrections to the text.
Acknowledgments
The authors would like to thank Dr Peter Short, Dr Tom Foley, Ms Rosemary Dewey and Professor Richard Neal for their information and advice when writing this paper.
- © Royal College of Physicians 2018. All rights reserved.
References
- ↵
- Benson T
- NHS Redbridge
- Bhaskaran K
- Barker I
- NHS Digital
- How to make a freedom of information (FOI) request. www.gov.uk/make-a-freedom-of-information-request [Accessed 23 July 2018].
- Information Commissioner’s Office
- ↵
- Clinical Practice Research Datalink
- ↵
- NHS Digital
- ↵
- NHS Digital
- ↵
- HM Government
- ↵
- ↵
- McCartney M
- ↵
- McLintock K
- ↵
- NHS Digital
- ↵
- National Data Guardian for Health and Care
- ↵
- NHS Digital
- ↵
- Health Research Authority
- ↵Confidentiality Advisory Group registers. www.hra.nhs.uk/about-the-hra/our-committees/section-251/cag-advice-and-approval-decisions/ [Accessed 23 July 2018] 2017.
- ↵
- NHS Digital
- ↵
- ↵
- ↵
- NHS Business Services Authority
- ↵
- Godlee F.
- ↵
- van Staa T-P
- ↵
- ↵
- Lawson DH
- ↵
- ↵
- ↵
- ↵
- Mills P.
- ↵
- Information Commissioner’s Office
- ↵
- Gulland A
- ↵
- Leeds CCGs seek ‘safe haven’ data status, 2013. www.hsj.co.uk/5060627.article [Accessed 26 August 2018]
- ↵
- NHS Digital
- ↵
- ↵
- NHS Digital
- ↵
- Filippon J
- ↵
- Foley T
- ↵Interoperability and Population Health Summit: Emerging Target Architecture. www.interopen.org/2016/12/29/interoperability-and-population-health-summit-emerging-target-architecture/ [Accessed 26 August 2018].
- ↵
- Friedman CP
Article Tools
Citation Manager Formats
Jump to section
Related Articles
- No related articles found.