A comprehensive high cost drugs dataset from the NHS in England - An OpenSAFELY-TPP Short Data Report

Background: At the outset of the COVID-19 pandemic, there was no routine comprehensive hospital medicines data from the UK available to researchers. These records can be important for many analyses including the effect of certain medicines on the risk of severe COVID-19 outcomes. With the approval of NHS England, we set out to obtain data on one specific group of medicines, “high-cost drugs” (HCD) which are typically specialist medicines for the management of long-term conditions, prescribed by hospitals to patients. Additionally, we aimed to make these data available to all approved researchers in OpenSAFELY-TPP. This report is intended to support all studies carried out in OpenSAFELY-TPP, and those elsewhere, working with this dataset or similar data. Methods: Working with the North East Commissioning Support Unit and NHS Digital, we arranged for collation of a single national HCD dataset to help inform responses to the COVID-19 pandemic. The dataset was developed from payment submissions from hospitals to commissioners. Results: In the financial year (FY) 2018/19 there were 2.8 million submissions for 1.1 million unique patient IDs recorded in the HCD. The average number of submissions per patient over the year was 2.6. In FY 2019/20 there were 4.0 million submissions for 1.3 million unique patient IDs. The average number of submissions per patient over the year was 3.1. Of the 21 variables in the dataset, three are now available for analysis in OpenSafely-TPP: Financial year and month of drug being dispensed; drug name; and a description of the drug dispensed. Conclusions: We have described the process for sourcing a national HCD dataset, making these data available for COVID-19-related analysis through OpenSAFELY-TPP and provided information on the variables included in the dataset, data coverage and an initial descriptive analysis.

term conditions, prescribed by hospitals to patients. Additionally, we

Introduction
Medicines data can help answer important questions about the patterns of use of medications, associated costs and possible risks and benefits of pharmaceutical treatments on patient outcomes, such as death from COVID- 19. In 2019-20 the NHS spent £20.9bn on medicines in England 1 . Of this total, medicines issued in hospitals accounted for 55.9% (£11.7bn) and medicines issued in primary care accounted for 43.5% (£9.1bn). Detailed records and data exist for medicines used in hospitals; however, at the outset of the pandemic this information was not made routinely available by the NHS 2,3 .
Although the NHS is a single-payer healthcare system, it includes different internal payment mechanisms, and consequently different sources of medicines data. Briefly, medicines in the NHS in England are funded either centrally, by NHS England (NHSE) specialised commissioning, or locally by Clinical Commissioning Groups (CCGs). There are approximately 135 CCGs who fund all medicines prescribed in primary care, and the majority of medicines in secondary care such as those funded through overall hospital contracts or "tariffs''. However, a specific group of "high-cost" medicines, typically specialist medicines 4 , are excluded from tariffs and funding is provided directly either from NHSE specialised commissioning or a CCG, depending on the medicine, condition or service it is used for. As a consequence, detailed payment information, including patient details, is passed between a hospital and the responsible commissioner, resulting in detailed data that is held locally by each commissioner. To our knowledge this has never been collated in a single place and made available for routine analysis at national level.
OpenSAFELY-TPP is a new secure analytics platform for electronic patient records built by our group on behalf of NHSE to deliver urgent academic and operational research during the pandemic 5,6 . Analyses run across all patients' full raw pseudonymised primary care records in 40% of English general practices where TPP electronic health record (EHR) software is deployed, with patient-level linkage to various sources of secondary care data. Code and analysis are shared openly for inspection and re-use.
With the approval of NHS England, we set out to: obtain a source of hospital high-cost drug (HCD) data; make these data available in OpenSAFELY-TPP to support analysis of important questions related to COVID-19; better understand the information collected and available for analysis and generate descriptive outputs. This report is intended to support all researchers and studies carried out in OpenSAFELY-TPP, and those elsewhere working with the present dataset or similar data, to help inform the response to the COVID-19 pandemic.

Methods
Data source -obtaining a national high-cost drugs dataset Hospitals in England supply medicines to patients either directly or through "homecare" providers who deliver medicines to a patient's home. The majority of medicines are funded through overall hospital contracts, included in tariffs; however, for certain HCDs, hospitals are required to provide a submission for each patient to the relevant commissioner, either NHSE or one of 135 local CCGs 1 , in order to receive payment. The majority of submissions relate to a prescription of a HCD, although some submissions relate to associated services (i.e. home care delivery charges). There is a national list of the medicines that are funded by NHSE 7 and locally agreed lists for each CCG. These patient-level submissions are processed by intermediate organisations, Commissioning Support Units (CSUs), to support financial payments and associated activities like summary reporting. To our knowledge there has been no single collation of the submissions data across NHSE and all 135 CCGs for these HCDs. To address this gap, we arranged for a single CSU, the North East Commissioning Support Unit (NECS), to collate all the data flows from their partner CSUs into a single comprehensive dataset. The data was collated by NECS in May 2020. To use the dataset in OpenSAFELY-TPP it was determined that NHS Digital must approve access, via the NHS Digital Data Access Request Service (DARs), which was granted in September 2020 and made available shortly thereafter. In line with OpenSAFELY-TPP standards on 2 privacy and security the HCD dataset was linked at individual patient level to primary care records in the secure data warehouse of TPP. 3

Variable overview
The HCD dataset is a patient-level dataset and includes variables on patient characteristics, clinical indications and medicine prescribed (Table 1). NHSE have a standard data collection specification for each individual submission, the Drugs Patient Level Contract Monitoring Data Set; however each CCG is independent and has local versions of the collection.
The national specification for submissions is published on the NHS Data Model and Dictionary website 8 . A full list of the variables collected via submissions and the specification for each variable can be found on the website. Further information on the NECS data collation and standardisation processes can be found in the documentation on Github 9 .
A description of each variable in the OpenSAFELY-TPP HCD dataset is provided below. Table 1 provides a brief overview  of each variable and Table 2 provides information on the completeness of the data collection.
Currently, three variables from the HCD dataset can be queried in an OpenSAFELY-TPP study: FinancialMonth, FinancialYear and DrugName. These variables are the ones most relevant to current research questions and add new information not available in other OpenSAFELY-TPP datasets. These are also some of the most complete variables in the dataset.
Analysing the high-cost drugs dataset within OpenSAFELY-TPP The HCD dataset covering submissions from April 2018 to March 2020 was made available to researchers within the Open-SAFELY-TPP software framework, to inform responses to the COVID-19 pandemic. Optional Should be a SNOMED CT Code but looks like input varies based on organisation collecting the data. This should be the dm+d description of medicine administered to patient. Only populated when the provider has a dm+d enabled system.

DrugName
Max an255 Mandatory where relevant

Currently able to query in OpenSAFELY-TPP study definition
Input standardisation is at a provider level rather than a national levelnon-standardised text input.
The name of the prescribed item. Where possible this should be the SNOMED CT name. For drugs not listed in dm+d, this must be the valid name in UPPER CASE.

Mandatory where relevant
Should be a SNOMED CT code but looks like input varies based on organisation collecting the data. To be populated by providers with an e-prescribing system.

DrugStrength
Max an100 Mandatory where relevant The amount of ingredient substance in the prescribed item.

DrugVolume
Max an100 Mandatory where relevant The volume of the drug administered to a patient when given in liquid form.

Max an100 Optional
The amount of product in a pack or container.

Mandatory where relevant
The quantity prescribed in terms of either the packsize or number of doses. * To note, the variable name is misspelled. The guidance to analyse the HCD dataset via OpenSAFELY-TPP is published online and available to all 11 . Box 1 below provides an example of the code used to include information on HCD prescriptions within an OpenSAFELY "study definition"; this code is used to define a cohort. As part of this process, users also need to create specific codelists, which cover the medications of interest. Due to the nature of the "Drug-Name" variable in the HCD dataset (discussed in more detail in the Results section) the codelists used to query HCD data do not follow an existing naming convention such as the British National Formulary (BNF) or the mandated NHS standard dictionary of medicines and devices (dm+d).
· A list of the existing HCD codelists can be found on the OpenSAFELY codelists website 12 .
· If a codelist does not already exist, then the user will need to create one. These codelists will need to be based on the unique values from the DrugName variable. A list of all the unique values for the drug name variable can be found in the analysis code under Extended data 9 .

Box 1. Example of code used in an OpenSAFELY-TPP study definition to query the HCD dataset
The example code below flags all patients who were prescribed adalimumab between October 2019 and March 2020, in the HCD dataset. The adalimumab filter is based on the adalimumab codelist, found on OpenSAFELY codelists 13 .
prescribed_adalimumab=patients.with_high_ cost_drugs( drug_name_matches= adalimumab_ codes, between = ["2019-10-01", "2020-03-31"], find_first_match_in_period=True, returning="binary_flag", return_expectations={"incidence" : 0.05,}, ) Further guidance on querying HCD dataset via an OpenSAFELY-TPP study definition can be found online 11 . Full variable list. The national HCD dataset in OpenSAFELY-TPP covers 21 variables, three of which are currently available to query via OpenSAFELY-TPP study definitions. Table 1 provides a complete list of the variables in the dataset with a brief description of the variable type and specification. Table 2. The completeness (the percentage of records with non-missing values) differs across the variables: some variables have very few or no missing values (DrugName, PersonAge, TotalCost) whilst others are much less complete (DrugStrength, DrugVolume, TherapeuticIndicationCode). Variables that have high levels of missing data may not be suitable for inclusion in analysis and could be a target for improving the coverage of the data collection.

Variable completeness. Variable completeness is shown in
The number of unique values recorded in each variable is an indication of whether the variable uses nationally standardised inputs, following a codelist with restricted input at the data collection stage, or uses locally compiled lists, which will vary across providers. The DrugName variable is an example of a variable which uses locally compiled lists rather than national standardised input on collection and therefore has many unique values at a national level (>20,000). By contrast the ActivityTreatmentFunctionCode variable only has 143 unique values in the latest year, suggesting use of a nationally standardised list on collection.

Descriptive analysis
Using OpenSAFELY-TPP, descriptive analysis of the characteristics of the patients who receive HCD can be carried out for the first time on a large scale, to inform related analysis on COVID-19. In this Data Note, we have provided some summary analysis of the demographic characteristics of patients in the HCD dataset (including age, sex, ethnicity and geographic location) and compare this patient group to other patients registered at TPP practices.
Using OpenSAFELY-TPP, we produced a descriptive analysis to better understand the demographic characteristics of patients that appear in the HCD dataset and how these patients compare to others registered at TPP practices. This analysis was

Variable name Variable type
Specification details

Mandatory
The total cost of the activity that includes any agreed adjustments.

DerivedVTM
Max an255 Derived by NECS dm+d code -virtual therapeutic moiety Around one third of values are NULL (see Table 2).

DerivedVTM Name
Max an255 Derived by NECS dm+d name -virtual therapeutic moiety Around one third of values are NULL (see Table 2).
restricted to patients who were registered at a TPP practice between 1st January 2020 and 31st March 2020 inclusive. Any patient who appeared in the HCD dataset between 1st October 2019 and 31st March 2020 were counted as in the HCD population, all other patients were included in the comparator population (not in HCD).  , require organisations to process confidential patient information for the purposes of protecting public health, providing healthcare services, the public and monitoring and managing the COVID-19 outbreak and incidents of exposure; this sets aside the requirement for patient consent 17 . Taken together, these provide the legal bases, link patient datasets on the OpenSAFELY platform. GP practices, from which the primary care data are obtained, are required, share relevant health information, support the public health response, the pandemic, and have been informed of the OpenSAFELY analytics platform.

Data access and verification
Access to the underlying identifiable and potentially re-identifiable pseudonymised electronic health record data is tightly governed by various legislative and regulatory frameworks, and restricted by best practice. The data in OpenSAFELY-TPP is drawn from General Practice data across England where TPP is the Data Processor. TPP developers (CB, JC, JP, FH, and SH) initiate an automated process to create pseudonymised records in the core OpenSAFELY-TPP database, which are copies of key structured data tables in the identifiable records. These are linked onto key external data resources that have also been pseudonymised via SHA-512 one-way hashing of NHS numbers using a shared salt. DataLab developers and primary investigators (BG, LS, CEM, SB, AJW, KW, WJH, HJC, DE, PI, SD, GH, BBC, RMS, ID, KB, EJW and CTR) holding contracts with NHS England have access to the OpenSAFELY pseudonymised data tables as needed to develop the OpenSAFELY tools. These tools in turn enable researchers with OpenSAFELY Data Access Agreements to write and execute code for data management and data analysis without direct access to the underlying raw pseudonymised patient data, and to review the outputs of this code. All code for the full data management pipeline, from raw data to completed results for this analysis, and for the OpenSAFELY-TPP platform as a whole is available for review on Github.
The data management and analysis code for this paper was led by AR with contributions from WH, BMK, SD, PI and DE.

Software and reproducibility
Data management was performed using Python, with analyses carried out using R. All of the code used for data management and analyses is openly available for inspection and re-use from the OpenSAFELY-TPP high cost drugs -research GitHub repository (Extended data) 9 . More information on data access and verification is available in the supplementary material. , and these are prospective submissions submitted before the patient had received the medicine. We recommend that these records are ignored and not used in any analysis. The patient ID in the HCD dataset is used to match the information from this dataset to other patient-level data included in the OpenSAFELY-TPP environment. This ID allows OpenSAFELY-TPP users to include information from other data sources on the platform (e.g. hospital episodes or COVID-19 testing) in any analysis of HCD use.

Variable overview
The financial year and financial month variables in the HCD dataset are stored separately, which makes analysis over a specific time interval a little more complex than if it were combined as a single variable. The OpenSAFELY-TPP cohort extractor has been developed so that users can query dates easily and the translation from conventional date format to separate FY and financial month filters is done in the background of the OpenSAFELY-TPP cohort extractor.
Drug name. The drug name variable is a mandatory part of the submission (where relevant) and can be used in OpenSAFELY-TPP study definitions to provide information on the HCD a patient has been prescribed in a given time period. This variable can be queried to produce: a flag to indicate a patient was ever prescribed a medicine between two dates; the first date a patient was prescribed a medicine between two dates and the last date a patient was prescribed a medicine between two dates. This variable is populated for 99% of records.
There are almost 21,000 unique values for the drug name variable in FY 2018/19 and almost 20,000 in FY 2019/20. The majority of these are not in the NHS-mandated dm+d format.
This variation in the recording of drug names in submissions means that codelists cannot be created using existing data definitions (e.g., dm+d or BNF) and bespoke codelists need to be created to try and pick up all possible variants of a drug name. These bespoke codelists are created by carrying out keyword searches on the list of unique values in the DrugName variable. The range of values in the DrugName variable is dealt with by building bespoke codelists rather than via the OpenSAFELY-TPP study definition.
Not only is there variation in how the same medicine is referenced by different providers, but we also found occurrences of misspelled drug names. For example, when constructing a bespoke code list for the medicine dupilumab, we included the misspelling dipilumab as that appeared in the DrugName variable values.
Taking the medicine adalimumab as an example, based on a keyword search, there were around 460 different ways that adalimumab was described in the DrugName variable, including various brand names 8 . The keywords used were adalimumab, amgevita, hyrimoz, humira, idacio and imraldi. The search ignored whether letters were uppercase or lowercase. The twenty most common names that appear through this search are shown in Table 3.

Descriptive analysis
Age. A higher proportion of patients in the HCD dataset belonged to older age bands (50+) than the other patients registered at TPP practices ( Figure 1, Table 4). For females, 67.8% of the patients in the HCD dataset were in older age bands compared to 39.2% of patients not in the HCD dataset. For males, 70.0% of the patients in the HCD dataset were in older age bands compared to 36.7% of patients not in the HCD dataset.

Ethnicity.
A higher proportion of patients in the HCD dataset were in the White ethnicity group compared to the other patients registered at TPP practices ( Figure 2, Table 5). For females, 67.5% of the patients in the HCD dataset were in the White ethnicity group compared to 63.7% of patients not in the HCD dataset. For males, 65.9% of the patients in the HCD dataset were in the white ethnicity group compared to 59.7% of patients not in the HCD dataset.
Geographical variation -STP. We looked at geographical variation by grouping patients by sustainability and transformation partnerships (STPs). The proportion of an STP population that appears in the HCD dataset ranged from around 0.5% to 1.5% (Table 6).

Data validation Summary
The national HCD dataset provides information on prescriptions and spend on HCD at patient level for the FY 2018/19 and 2019/20. There were 4.0 million submissions for 1.3 million

Figure 2. Distribution of patients across ethnicity groups, split by sex and whether patient appears in high-cost drugs (HCD)
dataset. unique patient IDs, with an average number of submissions per patient over the year of 3.1. This data is now available with the OpenSAFELY-TPP framework, linked to other NHS records at patient level, alongside reusable code to undertake analyses related to COVID-19. We observed substantial variation in missing data between specific fields in the data (0% gender -77.7% unit of measurement) and consequently have made three fields available: FinancialMonth, FinancialYear and Drug-Name. The first output using this data in OpenSAFELY-TPP has already been published: a research paper on the association between the use of immune modifying medicines to treat immune-mediated inflammatory diseases and severe COVID-19 outcomes 18 . This report can support those undertaking further analysis on COVID-19 using the HCD in OpenSAFELY-TPP.

Strengths and weaknesses
The national HCD dataset includes information on all HCD, rather than being limited to a specific class of medicines or disease as some other data collections are, such as national disease registries. At the outset of the pandemic, there was no data available on medicines supplied by hospitals; the provision of HCD allows researchers and the NHS to capitalise on new information to inform analysis. We utilised an existing data collection, efficiently re-using information already collected from NHS providers, but did not add any further burden to hospitals. Access to the national HCD dataset via OpenSAFELY-TPP means that this dataset can be analysed alongside a range of other patient level information, meaning that analysis of drugs prescribed and clinical outcomes is straightforward to run, and does not involve any additional time to source data. However, there are several caveats that need to be considered when using this resource. The national HCD dataset in Open-SAFELY-TPP is comprehensive, however due to the scale and speed at which it was assembled, it is possible that unknown inconsistencies or omissions may have occurred. The inputs to the DrugName and other variables are not standardised at a national level, which means there is a wide range of values (many thousands over a financial year). This is a feature of the data collection process. This means that the creation of bespoke codelists is required each time the HCD dataset is used for new analysis, and there can be misspellings of drug names. Alongside this non-standardised input, some variables have a high proportion of missing records. Finally, the dataset in OpenSAFELY-TPP is currently limited to a one-off collection covering submissions from FY 2018/19 and FY 2019/20.

Findings in context
To date there has been limited research conducted using patientlevel HCD data in the UK. The national HCD dataset covers all HCDs, which means that, for the first time, researchers can produce analyses covering large numbers of patients, over one million unique patients in each year. As an example, the first analysis using the HCD dataset within OpenSAFELY-TPP was conducted to ascertain the risk of severe COVID-19 outcomes associated with immune-mediated inflammatory diseases and immune modifying therapies: a nationwide cohort study in 17 million individuals 18 . We are unaware of any other use of comprehensive and routinely collected data on medicines supplied by hospitals to individual patients in England. There are several large clinical registry studies in England focused on specific diseases or medicines; although detailed and comprehensive, they are limited by underreporting, loss to follow-up and absence of information from elsewhere in the NHS [19][20][21] Combining detailed clinical registry data with the data available in OpenSAFELY may enhance the quality and robustness of analysis that can be achieved.

Policy implications
In March 2020, at the outset of the COVID-19 pandemic, there was no routine comprehensive hospital medicines data from the UK available to researchers and organisations. Since the onset of the pandemic, the NHS has improved access to information on the usage of medicines in hospitals, the NHS Business Services Authority now publish a monthly summary of the volume of medicines issued in hospitals 22 which is publicly available and NHS Digital are developing the electronic prescribing and medicines administration (EPMA) data collection with a subset of hospital data now available 23 . The availability of the national HCD dataset through OpenSAFELY-TPP adds to this collection of knowledge, and researchers can access this resource, along with all other OpenSAFELY-TPP data sources, by following the OpenSAFELY access process 24 .
However, the current dataset in OpenSAFELY-TPP is a oneoff collection covering submissions from FY 2018/19 and FY 2019/20, and there is no process in place as at time of writing to routinely update the information available in the HCD dataset. Whilst this is very useful for assessing events and outcomes early in the COVID-19 pandemic, a routine update of the data is needed to assess current high-priority questions and future important questions. For example, a routine update to this data will allow assessment of COVID-19 vaccine effectiveness in people using high -cost medicines or indeed people with a recorded diagnosis likely to be treated with a HCD. Our work demonstrates that it is possible for the NHS to collate the data at a national level and we strongly recommend that a routinely updated version of the HCD dataset is produced and made available to all interested users, including via the OpenSAFELY-TPP platform.

Data availability
OpenSAFELY: https://opensafely.org/ The project contains the following underlying data: · All data were linked, stored and analysed securely within the OpenSAFELY platform. Data include pseudonymized data such as coded diagnoses, medications and physiological parameters. No free text data are included. All code is shared openly for review and re-use under MIT open license. Detailed pseudonymised patient data is potentially re-identifiable and therefore not shared.
· For security and privacy reasons, OpenSAFELY is very different to other approaches for EHR data analysis. The platform does not give researchers unconstrained access to view large volumes of pseudonymised and disclosive patient data, either via download or via a remote desktop. Instead we have produced a series of open source tools that enable researchers to use flexible, pragmatic, but standardised approaches to process raw electronic health records data into "research ready" datasets, and to check that this has been done correctly, without needing to access the patient data directly. Using this data management framework we also generate bespoke dummy datasets. These dummy datasets are used by researchers to develop analysis code in the open, using Github. When their data management and data analysis scripts are capable of running to completion, and passing all tests in the OpenSAFELY framework, they are finally sent through to be executed against the real data inside the secure environment, using the OpenSAFELY jobs runner, inside a container using Docker, without the researcher needing access to that raw potential disclosive pseudonymised data themselves. The non-disclosive summary results output tables, logs, and graphs are then manually reviewed, as in other systems, before release.

·
As part of building that resource for the community, over the next six months we are working with NHS England to cautiously on-board a small number of external pilot users to develop their analyses on Open-SAFELY. This process is described in further detail on our webpage, here: https://opensafely.org/onboardingnew-users/.

Open Peer Review
This data report outlines and describes the collation of a national open-source database for high costs drugs in the NHS England. Due to complex funding arrangements, much of this information was stored locally. The aim was to produce a comprehensive large-scale database on the use of high-cost medicines to better understand the impacts of these medicines on COVID-19 outcomes. It is an ambitious but worthwhile project achieved in an impressive timeframe, and it is well described. It clearly identifies a gap in researchers' access to detailed and reliable medicines information on high-cost drugs on a widespread scale. The authors also provide access to all code used for data management and analysis. Further efforts in this space are needed to enable researchers to conduct worthwhile studies on medicine use at a population level.
The dataset is comprehensive and covers the time from 2018 to 2020 (2 financial years) with over 1 million unique patients per year from 5.4 million submissions for payment for high-cost drugs.
There are a total of 21 variables but only 3 are currently available for analysis in the open access data analysis platform (OpenSAFELY). Descriptive analysis shows that older (70+ years), white individuals were overrepresented in the database. A strength of the database is that the high-cost drug records can be linked to other patient records in primary and secondary care in OpenSAFELY, allowing a range of linkage studies not previously possible.
Overall, the paper is well written and the processes for data collation are clear. There are a few things we noted: We would have liked a more detailed description of the range and mean costs of such drugs. What is the minimum cost of drugs to be subsidized outside of hospital contracts or tariffs? There is a brief description in the abstract but some further elaboration in the introduction would be beneficial. Further information about the type and class of drugs in the HCD list would be useful because such details would allow international researchers a better understanding and comparison with their own countrys' databases.
The open access element is a real strength of the study, but we are curious as to why only three variables can be accessed in such a way and there is no cost variable so far. Are the authors proposing to add to these variables over time? The amount of missing data for total cost is low so we wondered why that was not included with the other three variables? Is this a limitation of OpenSAFELY? As health economists, including cost variables alongside clinical information, is of paramount importance.
As noted by the authors the drug name variable, which is mandatory for reporting and has only 1% of missing data, is not in a standardized format. The researchers have provided links to access codelists that have been created for medicines in the research to date. For medicines outside of this list, researchers will need to create and use their own bespoke code or codelists to access all the data they might need. This could be burdensome or act as a barrier to use for some researchers. Nevertheless, it is a comprehensive list and with the right search tools relevant information can be extracted. Suggestions by the research team, or open source code on the most useful way to use and extract this information would be valuable.
The dataset is just a one-off compilation and covers only 2 years' worth of data which can be useful in the early stages of the pandemic. Updates will be essential for tracking the longer-term outcomes linked to COVID-19 including long COVID and vaccine impacts. Codelists will need to be updated on an ongoing basis to ensure all high-cost drug data can continue to be accessed.
Overall, it is very promising that these substantial logistical and data compilation efforts have resulted in such a useful and easy to access resource. It will be interesting to see how these data are used are we look forward to seeing the outcomes of these future studies.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Partly Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Dr Katie Page is a senior health economist and academic psychologist with expertise in the use and evaluation of administrative health data and economic evaluations. Dr Lutfun Hossain is a pharmacist and researcher with expertise in quality use of medicines.
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This is an incredibly useful piece of work, demonstrating the feasibility of collating such information but highlighting the problems when non-standard values are used. The example of drug names given, where around 460 ways of describing the drug adalimumab, shows how improvements in standardised data collection are needed, as well as open access collaboration of code lists already created. Due to the incompleteness of many of the variables, only a small proportion are of sufficient quality to be made available for research. This dataset is currently a one-off, made possible by the pressing need to understand the COVID-19 pandemic. A routinely updated dataset with more information in a useable format would be a valuable resource for understanding the fuller picture of healthcare within England.
The processes used are clearly described, with references to publicly available codes, analysis and other resources.
1 -what was the extra value?
On page 8 there are some sentences that aren't clear: "The OpenSAFELY research platform adheres, the obligations[…]" and "GP practices, from which the primary care data are obtained, are required, share relevant health information[…]".

5.
It's not clear where the "supplementary material" referred to in the 'Software and reproducibility' section is. 6.
The column headings in Table 5 refer to age, when the table shows ethnic group data. 7.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes responses. In the UK, a national, patient-level, data set of all medications used that includes those prescribed in NHS hospitals was not available for interrogation pre the COVID pandemic.
Rowan et al.'s report describes the construction of a national 'high cost drugs' (HCD) data set from UK NHS, complete, patient level, hospital data. The 'pseudonymised' patient's data comprising the HCD data set were managed with the OpenSAFELY-TPP, electronic health record platform (OSP) established across a large percentage (>40%) of GP practices in England. OSP analyses in primary care access each patient's, pseudonymised, complete, electronic primary care health record data allowing patient level data-linkages to be made to secondary databases and services.
The current exercise is designed to demonstrate to researchers the HCD data set and the availability of the 'extended' OSP resource and how this resource might be used. The case to build the extended OSP prevailed because of the COVID pandemic and the acute need for health system-wide data on impacts of multiple, new, potentially critical, medication-related interventions and consequences, often with very limited background information to guide decisions. To progress from no availability of a national data set for medicines used in NHS hospitals to a HCD data set since the start of the pandemic is a remarkable achievement especially given the overwhelming complexity of supply and funding arrangements for medications in NHS hospitals. All steps and methods in building this resource are made available for researchers to evaluate via 'open-source' access. Considerable skills and resources are likely required.
The example discusses the characteristics of the HCD built and now ready for limited use. There were 2.8 million 'submissions' (think purchases of a high cost drug) in NHS hospitals for 1.1 unique patients for FY 2018/9 rising to 4million for 1.3 million individuals in FY 2019/20, a substantial experience to interrogate. Why did the 'purchases' increase? The premise behind the HCD is that it likely reflects the significant, long-term, inflammatory disease group whose outcomes, experiences and connections to secondary services and primary care could be significantly impacted by COVID. High cost drugs for these conditions are prescribed largely by specialist services in hospitals, not the community, in the UK. Three of the 21 variables in the HCD made available for the study at this preliminary stage of development show that in comparison to the remainder of the OSP cohort the HCD patients were older and more likely White. The 21 variables to describe this HCD cover information needed to understand 'high cost drug' use and outcomes associated in individual patients across the NHS hospital and primary care system in UK. For example what conditions are being treated, what 'high cost drugs' have been chosen, what dose selected, etc. Not a lot new emerged of special interest in this example, however, that was not the point. The resource and its features and how to access the resource and HCD was the goal and this has been achieved.
In summary, an important resource produced with impressive speed, promises deeper understanding and insights around impacts of medicine use and external impacts, such as the COVID epidemic, on individual patients and the NHS is now available for interrogations. Results from these studies are eagerly anticipated.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?