CPRD linked data

Anonymised primary care patient data can be individually linked to secondary care and other health and area-based datasets. This linkage enables CPRD to provide a fuller picture of the patient care record to support vital public health research, informing advances in patient safety and delivery of care. CPRD is expanding its healthcare data and research services to increase both the cover of primary care data and the number of datasets that are linked and made available on a routine basis to the research community.

Data linkage in England is carried out by the Trusted Third Party NHS England. For further information please contact CPRD Enquiries at enquiries@cprd.com.

Linked datasets currently available include:

Source data

Publication: Padmanabhan S, Carty L, Cameron E, Ghosh RE, Williams R, Strongman H. Approach to record linkage of primary care data from Clinical Practice Research Datalink to other health-related patient data: overview and implications. Eur J Epidemiol, 2018.

 

Availability of linked data 

Linkage of CPRD primary care data with other patient level datasets is available for English practices who have consented to participate in the linkage scheme. Each individual GP practice participating in CPRD's collection of their primary care data can choose to revoke their consent for data collection at any point.

CPRD respects all patient opt-outs. Patients who have registered an opt-out will not be extracted for CPRD research or for data linkage.

The latest update to the priority linkages (specifically the NHS England (formerly Public Health England Second Generation Surveillance System (SGSS) COVID-19 virology test data, COVID-19 Hospitalisation in England Surveillance System (CHESS), Intensive Care National Audit and Research Centre (ICNARC) data on COVID-19 intensive care admissions, Hospital Episodes Statistics Admitted Patient Care, Office for National Statistics mortality data, and small area deprivation data) was released in February 2022, with coverage as detailed below.

Timelines for the next update to the linked data are not currently known, but there will likely be a delay as NHS England, the CPRD's Trusted Third Party for linkage, is undergoing a change in the way they process and link data (see: https://digital.nhs.uk/data-and-information/data-insights-and-statistics/improving-our-data-processing-services). This will also require changes to the CPRD processing of these data, and will impact timelines.

The latest linked data comprise ONS deaths data (to 29/03/2021), HES APC (to 31/03/2021), SGSS and CHESS (to 23/02/2021), ICNARC data (to 17/03/2021), HES OP/DID (to 31/10/2020), HES A&E (to 31/03/2020), NCRAS cancer registrations/SACT/RTDS (to 31/12/2018) and small area data with 9,315,232 acceptable patients in the CPRD GOLD July 2021 build and 38,416,860 acceptable patients in the CPRD Aurum June 2021 build eligible for >/=1 linkage.

As standard, to ensure we honour patient opt-outs, we will supply the latest available linked data for each dataset. If you require data from a specific earlier linkage set, or are unsure about which source file should be used in your study, please contact us on enquiries@cprd.com

Access to linked data 

Access to patient level data is dependent on approval of a study protocol via the Research Data Governance (RDG) process. All required linked data sources must be requested on the application form. Additionally, researchers who are first time users of a linked dataset must contact the CPRD Observational Research Team to discuss their requirements before submitting their application. Data are only provided by CPRD when part of a data extract is linked to CPRD primary care data. 

Guidance: Requesting linked data from CPRD
 

COVID-19 data

CPRD-linked COVID-19 datasets comprise:

1. NHS England (formerly Public Health England (PHE)) Second Generation Surveillance System (SGSS) COVID-19 virology test data

2. PHE COVID-19 Hospitalisation in England Surveillance System (CHESS)

3. Intensive Care National Audit and Research Centre (ICNARC) data on COVID-19 intensive care admissions. 

Second Generation Surveillance System (SGSS)

SGSS is the national laboratory reporting system used in England to capture routine laboratory data on infectious diseases and antimicrobial resistance. The SARS-CoV-2 testing started in UK laboratories on 24/02/2020, with the SGSS data reflecting testing (swab samples, PCR test method) offered to those in hospital and NHS key workers (i.e. Pillar 1). The CPRD-SGSS linked data currently contain positive tests results only.

Access to linked SGSS data is subject to prior approval. This dataset is not covered by existing licences, and data can only be released to organisations within the UK/EU/EEA.

The latest release of CPRD-SGSS data covers the period 01/03/2020 – 23/02/21.

Note, these SGSS data will not be further updated as COVID-19 test data now reliably flow into the GP primary care record.

Please click on the link below to download the documentation relating to CPRD-SGSS data.

COVID-19 Hospitalisation in England Surveillance System (CHESS)

The former PHE established CHESS across all NHS Trusts in England on 15/03/2020 to collect epidemiological data on COVID-19 infection in persons requiring hospitalisation and ICU/HDU admission. Trends in hospital and critical care admission rates need to be interpreted in the context of testing recommendations, which changed over time.

Access to linked CHESS data is subject to prior approval. This dataset is not covered by existing licences, and data can only be released to organisations within the UK/EU/EEA.

The latest release of CPRD-CHESS data covers admissions to 23/02/2021.

Please click on the link below to download the documentation relating to CPRD-CHESS data.

Intensive Care National Audit and Research Centre (ICNARC) data on COVID-19 intensive care admissions

ICNARC is a national clinical audit covering all NHS adult, general intensive care and combined intensive care/high dependency units, and some additional specialist and non-NHS critical care units.

Data on patients critically ill with confirmed COVID-19 admitted to critical care units will be linked to the CPRD data.

The CPRD-ICNARC linked data comprise information on demographic variables (age, sex), admission/discharge, height/weight/BMI, clinical parameters (BP, hypertension, blood gas measurements, haemoglobin, platelet count, lactate, heart rate, respiratory rate etc), coma, mortality prediction and physiology scores).

The latest release of CPRD-ICNARC data covers admissions to 17/03/2021.

Please click on the link below to download the documentation relating to CPRD-ICNARC data.

Small area level data

Classifications based on the population characteristics of small areas or neighbourhoods (and the individuals who live there) are available for linkage to CPRD primary care data. CPRD has linked GP practice postcodes and eligible patient residence postcodes for both CPRD GOLD and CPRD Aurum to some of the most commonly requested area level data. This includes several measures of area level deprivation and a rural-urban classification, and Clinical Commissioning Group (CCGs) pseudonym (practice level, England-only) . These measures can be used as a proxy for socio-demographic and socio-economic data which are generally poorly recorded in the primary care data given they do not directly relate to a patient's care.

For each measure the postcode of the practice or patient residence is mapped to lower layer Super Output Area (LSOA), SOA in Northern Ireland or datazone (DZ) in Scotland using a postcode lookup file.

More information about small area level data can be found in the paper below.

Mahadevan P, Harley M, Fordyce S, et al. Completeness and representativeness of small area socioeconomic data linked with the UK Clinical Practice Research Datalink (CPRD). J Epidemiol Community Health 2022;76:880-886. https://jech.bmj.com/content/76/10/880

Patient postcode linked deprivation measures

Patient postcode linked measures are available for patients in English practices that have consented to participate in the linkage scheme. The latest available patient postcode of residence is mapped to an LSOA boundary. The LSOA of residence then allows linkage to the following LSOA-level deprivation measures;

  • 2019 English Index of Multiple Deprivation (composite and individual domains)
  • Townsend Deprivation Index: calculated using unadjusted 2011 census data
  • Carstairs Index using 2011 census data

Data are provided as quintiles, deciles or twentiles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of a patients’ area of residence, researchers will only be provided with one of the above linked datasets for any one study. Access is provided by CPRD subject to approval.

Practice postcode linked deprivation measures

The general practice postcode linkages are available for all practices in CPRD GOLD and CPRD Aurum and use the general practice postcode which is linked via LSOA, SOA in Northern Ireland and datazone (DZ) in Scotland.

The general practice postcode linkage includes Clinical Commissioning Group (CCGs) pseudonym (England-only) and several well-known area-based measures of deprivation:

  • 2019 English Index of Multiple Deprivation (composite and individual domains)
  • 2020 Scottish Index of Multiple Deprivation (composite and individual domains)
  • 2017 Northern Ireland Index of Multiple Deprivation (composite and individual domains)
  • 2019 Welsh Index of Multiple Deprivation (composite and individual domains)
  • Townsend Deprivation Index: calculated using unadjusted 2011 census data
  • Carstairs Index: England, Wales and Scotland calculated using 2011 census data

As standard, the most recent national Indices of Deprivation are provided for each country. It is important to note that the IMD indices are not comparable between countries in the UK. Data is provided as quintiles or deciles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of the location of a practice, researchers will only be provided with one practice level linkage for any one study. Access is provided by CPRD subject to approval.

Rural-Urban classification

It may be important to distinguish between rural and urban areas when investigating differences in social and economic characteristics of small areas. Populations can vary in their composition between urban and rural areas, as can access to services, employment and educational opportunities, and quality of life. The measures available for patient (England only) and practice postcode are:

  • 2011 England and Wales Rural-Urban classification
  • 2015 Northern Ireland Rural-Urban classification
  • 2016 Scottish Rural-Urban classification

Access is provided by CPRD subject to approval.

For more information about data linkage and prices please contact CPRD Enquiries on enquiries@cprd.com
 

Data from NHS England

NHS England has responsibility for standardising, collecting and publishing data and information from across the health and social care system in England.

CPRD linked data from NHS England includes Hospital Episode Statistics (HES) - a database containing details of all admissions, Accident and Emergency attendances and outpatient appointments at NHS hospitals in England, and ONS mortality data. 

HES Admitted Patient Care data

HES Admitted Patient Care (HES APC) data contains details of all admissions to, or attendances at English NHS healthcare providers. It includes private patients treated in NHS hospitals, patients resident outside of England and care delivered by treatment centres (including those in the independent sector) funded by the NHS. All NHS healthcare providers in England, including acute hospital trusts, primary care trusts and mental health trusts provide data.

HES APC data includes the complete set of hospital episode information (admission and discharge dates, diagnoses (identifying primary diagnosis), specialists seen under and procedures undertaken) for each linked patient with a hospitalisation record. In addition, Augmented care data (intensive and/or high dependency levels of care) and Maternity data are available.

Diagnostic data recorded in HES are coded using the International Classification of Diseases version 10 (ICD10) coding frame; procedure information is coded using the UK Office of Population, Census and Surveys classification (OPCS) 4.6.

Requests for HES APC data access are subject to prior approval

The latest release of HES APC data covers the period April 1997 to March 2021. 

Please click on the link below to download the documentation which provides an overview of the HES APC data linked to CPRD primary care patients.

More information about HES APC data can be found in the data resource profile below, and from a number of recent concordance and validation studies.

Publication: Herbert A, Wijlaars L, Zylbersztejn A, Cromwell D, Hardelid P. Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC). International Journal of Epidemiology, Volume 46, Issue 4, August 2017, Pages 1093–1093i.

Publication: Hagberg KW, Vasilakis-Scaramozza C, Persson R, Yelland E, Williams T, Myles P, Jick SS. Quality and completeness of malignant cancer recording in United Kingdom Clinical Practice Research Datalink Aurum compared to Hospital Episode Statistics. Ann Cancer Epidemiol 2022;6:6. doi: 10.21037/ace-22-4

Publication: Thorn JC, Turner EL, Hounsome L the CAP trial group, et al. Validating the use of Hospital Episode Statistics data and comparison of costing methodologies for economic evaluation: an end-of-life case study from the Cluster randomised triAl of PSA testing for Prostate cancer (CAP). BMJ Open 2016;6:e011063

Publication: Saine, ME et al. (2019). Concordance of hospitalizations between Clinical Practice Research Datalink and linked Hospital Episode Statistics among patients treated with oral antidiabetic therapies. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4853

Publication: McDonald, L, CJ Sammon, et al. (2018). Under-recording of hospital bleeding events in UK primary care: a linked Clinical Practice Research Datalink and Hospital Episode Statistics study. Clin Epidemiol 10, pp. 1155– 1168. issn: 1179-1349 (Print) 1179-1349. doi: 10.2147/clep.s170304.

Publication: Williams, R et al. (2018). Cancer recording in patients with and without type 2 diabetes in the Clinical Practice Research Datalink primary care data and linked hospital admission data: a cohort study. BMJ Open 8.5, e020827. issn: 2044-6055. doi: 10.1136/bmjopen-2017-020827.

HES Outpatient data

HES Outpatient (HES OP) data are a collection of individual records of outpatient appointments occurring in England only. The data includes information on the type of outpatient consultation appointment dates, the main specialty and treatment specialty under which the patient was treated, referral source, waiting times, clinical diagnosis and procedures performed. HES OP data can be used to support health resource utilisation studies, clarify clinical health care pathways and enable variations in the uptake of services to be evaluated, for example by gender and age.

Access to linked HES OP data is subject to prior approval.

The latest release of HES OP data covers the period April 2003 to October 2020. 

Please click on the link below to download the documentation relating to HES Outpatient data.

Useful information can be found in the following validation study on the coverage of HES OP resource-use data in comparison to medical records from a cluster randomised trial:

Publication: Thorn JC, Turner E, Hounsome L, Walsh E , Donovan JL, Verne J, Neal DE , Hamdy FC, Martin RM, Noble SM. Validation of the Hospital Episode Statistics Outpatient Dataset in England. Pharmacoeconomics, 34 (2), 161-8, Feb 2016.

HES Accident and Emergency data

HES Accident and Emergency (HES A&E) data consists of individual records of patient care administered in the accident and emergency setting in England. These data are a subset of national A&E data collected by NHS England to monitor the national standard that 95% of patients attending A&E should wait no longer than 4 hours from arrival to admission, transfer or discharge. A&E data are submitted by A&E providers of all types in England. Data collected includes details about patients’ attendance, outcomes of attendance, waiting times, referral source, A&E diagnosis, A&E treatment (drugs prescribed not recorded), A&E investigations and Health Resource Group. HES A&E may be used to clarify the health care pathway, to quantity health resource use and costs in the emergency setting, and to assess variations in the uptake of emergency services over time.

Access to HES A&E data is subject to prior approval.

The latest release of HES A&E data covers the period April 2007 to March 2020. 

Note: The Emergency Care Data Set (ECDS) is a new national dataset for urgent and emergency care and replaced the HES A&E dataset across England from 2019-20 financial year. ECDS will enable more detailed analysis and enhanced understanding of emergency services, and linkage to CPRD primary care data is in progress.

Please click on the link below to download the documentation relating to HES Accident & Emergency data.

HES Diagnostic Imaging Dataset

The Diagnostic Imaging Dataset (DID) is a collection of detailed information about diagnostic imaging tests, such as x-rays and MRI scans, taken from NHS providers' radiological information systems. The DID includes information on imaging tests carried out from 1 April 2012 on NHS patients in England. It does not include the images that are produced as a result of these tests. The DID captures information about referral source and patient type, details of the test (type of test and body site), plus items about waiting times for each diagnostic imaging event, from time of test request through to time of reporting. The DID enables analysis of demographic and geographic variation in access to different test types and different providers.

The DID is routinely linked to Hospital Episode Statistics (HES) through NHS England. This existing HES DID dataset has now been linked to CPRD primary care data enabling users to analyse patient care pathways. Access to HES DID data is subject to prior approval.

The latest release of HES DID data covers the period April 2012 to October 2020.  

Please click on the link below to download the documentation relating to the HES Diagnostic Imaging Dataset.

Death Registration data

Death Registration data contains data from the Office for National Statistics (ONS) and includes information on the official date and causes of death (using ICD codes).

Access to ONS Death Registration data is subject to prior approval.

The latest release of ONS Death Registration Data covers the period 2 January 1998 to 29 March 2021. 

Please note that late registration for some deaths means that the proportion of deaths captured is lower for the last year of the coverage period, and this proportion is likely to differ by age at death and cause of death. This is especially pronounced for the last 1-2 weeks of available death data which shows an under count of the total number of deaths as these data do not capture those where the registration of a death has been delayed (eg deaths referred to coroners in England, Wales and Northern Ireland, which cannot be registered until investigations have been concluded, and can result in delays of months or years).

Please click on the link below to download the documentation relating to ONS death registration data.

For more information please refer to the ONS User guide to mortality statistics, the ONS analysis exploring the impact of registration delays on mortality statistics and the associated dataset used for this report.

Further details can be found in three studies investigating the impact of the choice of data source in estimating mortality.

Publication: Gallagher, AM et al. (2019). The accuracy of date of death recording in the Clinical Practice Research Datalink GOLD database in England compared with the Office for National Statistics death registrations. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4747.

Publication: Harshfield, A et al. (2018). Do GPs accurately record date of death? A UK observational analysis. BMJ Support Palliat Care. issn: 2045-435x. doi: 10.1136/bmjspcare-2018-001514.

Publication: Gallagher, AM. et al. (2016). The Impact of the Choice of Data Source in Record Linkage Studies Estimating Mortality in Venous Thromboembolism. PLoS One 11.2, e0148349. issn: 1932-6203. doi: 10.1371 / journal.pone.0148349.

Cancer data from NHS England National Disease Registration Service (NDRS) (formerly Public Health England (PHE))

NHS England National Disease Registration Service (NDRS) (formerly Public Health England (PHE)) provide cancer data via the National Cancer Registration and Analysis Service (NCRAS). There are three NCRAS datasets that can be linked to CPRD:

  1. Cancer Registration Tumour and Treatment data
  2. The Systemic Anti-Cancer Treatment (SACT) Dataset
  3. The National Radiotherapy Dataset (RTDS)

Access to all NCRAS datasets are subject to prior RDG approval. 

As of January 2024 CPRD has changed the way requests are handled for studies requiring linkage to NCRAS datasets.

For studies requesting linkage to the NCRAS Tumour and Treatment data, the NCRAS Data Selection Form is no longer required. Requests can be submitted directly via eRAP in the same way as other linkage requests (e.g. HES and ONS death registration data). This will significantly reduce data delivery timelines for NCRAS Tumour and Treatment datasets. 

For studies requesting linkage to NCRAS SACT and/or RTDS data, there will remain additional steps to gain access these datasets prior to submission of the research protocol on eRAP. This includes the completion of a SACT / RTDS Data Selection Form (DSF) and submission of a draft study protocol to the Observational Research team. Please contact CPRD Enquiries at enquiries@cprd.com to request this form. Please be aware that it currently takes around 12-18 months from protocol approval to delivery of SACT and/or RTDS data, and that timelines for accessing these two datasets are largely out of CPRD control.

NCRAS Cancer Registration Tumour and Treatment data

The data contains a record for each registrable tumour diagnosed or treated in England, of which the NCRAS has been notified. Cancers are coded using the International Classification of Diseases for Oncology, revision 3, 2011. They are also back mapped to the tenth revision of the International Classification of Diseases version 10.

The latest release of NDRS cancer registration data covers the period January 1990 – December 2018. 

More information about the cancer registration data can be found in the following data resource profile:

Publication: Henson KE, Elliss-Brookes L, Coupland VH, Payne E, Vernon S, Rous B, Rashbass J. Data Resource Profile: National Cancer Registration Dataset in England. International Journal of Epidemiology, dyz076.

Further details can be found in three studies comparing recording of cancer across data sources.

Publication: Strongman H, Williams R, Bhaskaran K. What are the implications of using individual and combined sources of routinely collected data to identify and characterise incident site-specific cancers? a concordance and validation study using linked English electronic health records data. BMJ Open 2020; 10:e037719. doi: 10.1136/bmjopen-2020-037719

Publication: Arhi, CS, A Bottle, et al. (2018). Comparison of cancer diagnosis recording between the Clinical Practice Research Datalink, Cancer Registry and Hospital Episodes Statistics. Cancer Epidemiol 57, pp. 148–157. issn: 1877-7821. doi: 10.1016/j.canep.2018.08.009.

Publication: Margulis, AV, J Fortuny, et al. (2018a). Validation of Cancer Cases Using Primary Care, Cancer Registry, and Hospitalization Data in the United Kingdom. Epidemiology 29.2, pp. 308–313. issn: 1044-3983. doi: 10.1097/ede.0000000000000786.

NCRAS Systemic Anti-Cancer Treatment (SACT) data

The SACT dataset covers chemotherapy treatment for all solid tumour and haematological malignancies, including those in clinical trials. Information is included about programme and regime of treatment, and the outcome for each treatment. In the latest linkage release (set 19) SACT data is available for patients with tumours recorded in the cancer registration data from January 2014 to December 2018. Data prior to January 2014 is also available but should be used with caution due to incomplete ascertainment during this period. 

More information about the SACT data can be found in the following data resource profile:

Publication: Bright CJ, Lawton S, Benson S, Bomb M, Dodwell D, Henson KE, McPhail S, Miller L, Rashbass J, Turnbull A, Smittenaar R. Data Resource Profile: The Systemic Anti-Cancer Therapy (SACT) Dataset. International Journal of Epidemiology, dyz137.

NCRAS National Radiotherapy Dataset (RTDS)

The RTDS dataset contains records of radiotherapy services provided since April 2009, including teletherapy and brachytherapy. All radiotherapy delivered in England to patients in NHS facilities, or in private facilities where delivery was funded by the NHS, is included. Brachytherapy delivered for the treatment of non-malignant disease, radiotherapy delivered using unsealed sources, and non-therapeutic exposures delivered using radiotherapy machines (e.g. imaging) are not included. In the latest linkage release (set 19) RTDS data is available for patients with tumours recorded in the cancer registration data from April 2012 to December 2018. 

Source data 

The source data are provided to organisations that hold CPRD multi-study licences to enable researchers to ascertain which patients are eligible for linkage to each dataset and to clarify the coverage periods for each data source. The linkage eligibility file only includes patients from practices that have consented to take part in the linkage process. The file contains flags to indicate whether the patient is eligible for each individual linked data source. Some patients will not be eligible for any of the linked data sources, whereas others may be eligible for some/all of them. These data are provided so that multi-study licence users can determine the appropriate population to include in their study. The linkage coverage file indicates the start and end of coverage for each individual linked data source.

Access to source data for CPRD GOLD and/or CPRD Aurum is available to nominated users only; for access, please contact us at enquiries@cprd.com

 

Page last reviewed