The timing of use of risk stratification tools affects their ability to predict mortality from sepsis. A meta-regression analysis. [version 1; peer review: awaiting peer review]

Background: Risk stratification tools (RSTs) are used in healthcare settings to identify patients at risk of sepsis and subsequent adverse outcomes. In practice RSTs are used on admission and thereafter as ‘trigger’ tools prompting sepsis management. However, studies investigating their performance report scores at a single timepoint which varies in relation to admission. The aim of this meta-analysis was to determine if the predictive performance of RSTs is altered by the timing of their use. Methods: We conducted a systematic review and meta-regression analysis of studies published from inception to 31 October 2018, using EMBASE and PubMed databases. Any cohort studies investigating the ability of an RST to predict mortality in adult sepsis patients admitted to hospital, from which a 2x2 table was available or could be constructed, were included. The diagnostic performance of RSTs in predicting mortality was the primary outcome. Sensitivity, specificity, positive predictive value, negative predictive value and area under the receiver-operating curve (AUROC) were the primary measures, enabling further meta-regression analysis. Results: 47 studies were included, comprising 430,427 patients. Results of bivariate meta-regression analysis found tools using a first-recorded score were less sensitive than those using worst-recorded score (REML regression coefficient 0.57, 95% CI 0.07-1.08). Using worstrecorded score led to a large increase in sensitivity (summary sensitivity 0.76, 95% CI 0.67-0.83, for worst-recorded scores vs. 0.64 Open Peer Review Reviewer Status AWAITING PEER REVIEW Any reports and responses or comments on the article can be found at the end of the article. Page 1 of 14 Wellcome Open Research 2021, 6:328 Last updated: 01 DEC 2021


Introduction
Background Sepsis is a major global health issue, with 48.9 million cases per year worldwide resulting in 11 million deaths [1][2][3] . In 2016, the SEPSIS-3 task force identified the need for "earlier recognition and more timely management" 4 . Early detection, coupled with rapid management improves patient outcomes 4-8 by reducing progression to severe sepsis, the need for intensive care unit (ICU) care, and mortality 9 . Early detection also enables identification of those in whom a good outcome is likely, allowing more conservative management strategies 10 . Early identification is therefore a priority, but this is challenging due to variation in clinical presentations of sepsis and the lack of a single, gold-standard diagnostic test 11,12 . A variety of risk stratification tools (RSTs) are used across multiple healthcare settings to identify patients at risk of sepsis and subsequent adverse outcomes. Using physiological and biochemical parameters RSTs aim to predict patients at risk of adverse outcomes from sepsis, a deteriorating condition 13 , or a specific infective diagnosis 14 . Examples include the: Early Warning Score (EWS) 15 , with numerous variants; Sequential Organ Failure Assessment score (SOFA) 14 ; Quick-SOFA (qSOFA) 4 ; Systemic Inflammatory Response Syndrome criteria (SIRS) 16 ; and organ-specific scores, such as CURB-65 17 .

Importance
Inconsistencies in the research evidencing RSTs and their use in clinical practice potentially leads to exaggeration of their predictive ability 18,19 . A major variation is in the timing of the use of RSTs. In practice RSTs are often used on admission to hospital and in the emergency department (ED) to guide management, and thereafter as trigger tools to monitor patient deterioration. However, studies investigating RSTs vary in their timing that the tool was performed in relation to admission, indeed a large number of studies use a worstrecorded score within 24-48 hours rather than reflecting the way RSTs are used in practice, which is likely to affect their reported performance.

Goals of this investigation
The primary objective of this meta-analytical study was to determine if the predictive performance of RSTs is altered by the timing of their use.

Methods
The protocol for this study was registered with PROSPERO (CRD42019146321).

Study selection and inclusion
A systematic search of EMBASE and PubMed databases from inception until October 2018 was undertaken. We aimed to include any prognostic tool identified by the search that was mainly based on physiological markers. RSTs to be included were not pre-defined, rather guided by the search strategy findings. The search identified primary research which investigated any RST used to predict mortality in adult patients (>18yo) admitted to hospital with sepsis, suspected sepsis or pneumonia. The search strategy combined terms for RSTs with terms for infection, sepsis and mortality.
Studies must have reported an RST score calculated on patients' hospital admission or during the initial period of observation following admission for suspected infection or sepsis. Mortality was a required outcome measure of included studies, which was primarily defined as 28-day or 30-day mortality, however other mortality measures (e.g. in-hospital mortality) were accepted. Mortality was selected as the use of RSTs have been driven by campaigns such as Surviving Sepsis 8 , advocating for early identification of septic patients to reduce mortality as far as possible. Studies in a non-ICU setting were included as several RSTs are validated for use in patients at risk of deterioration or sepsis in this setting. Studies conducted exclusively in ICU, and those investigating only trauma, paediatric, obstetric and gynaecological patients, or those with alternative diagnoses to sepsis were excluded. Language of articles was limited to English unless a translation was available. Any article that was not an observational cohort study was excluded.
Two authors (MF, DH) independently screened identified titles and abstracts using a validated web-based application, Rayyan 20 , to produce a list of relevant articles for full-text review. Disagreements were resolved through consensus between further authors (FH, DA).

Data extraction and outcomes
The data extraction recorded: (i) RST used, (ii) timing of RST use, (iii) study design, (iv) sample size, (v) healthcare setting, (vi) geographical location, (vii) diagnosis, (viii) outcome measure for mortality, (ix) sensitivity, (x) specificity, (xi) positive predictive value, (xii) negative predictive value and (xiii) area under the receiver-operator curve (AUROC). If a study did not publish 2×2 data, this data was calculated using specificity, sensitivity, number of participants and mortality. Where studies published more than one set of 2×2 data for a single score, e.g. for multiple cut-offs, the best was taken. If identified studies investigated the use of more than one RST, all 2×2 datasets were collected and included in the analysis. Timing data was classified as 'first recorded in ED', 'worst in ED', 'within 24 hours', 'within 48 hours', or other. We then used a pragmatic approach, categorising timing further into 'first recorded scores', 'worst recorded within a time point scores' (hereafter 'worst-recorded') and 'other' for analysis. Mortality measures were coded as '28-or 30-day', 'in-hospital' or 'other'. As there are many Early Warning Score (EWS) variants which are all physiological scoring tools with broadly similar scoring parameters, we pragmatically cohorted these scores for analysis, allowing more powerful analysis. One author (MF) extracted data from eligible studies, which were confirmed by a second author (DH).

Risk of bias assessment
This study did not aim to include formal risk of bias assessment, as the primary objective was to quantify whether timing impacted RST performance, which itself is a major bias. All studies that used later time points would be at significant risk of bias due to the later nature of timing used (measurements occurring after diagnosis and subsequent management), and we did not feel that formally evaluating this would be helpful.

Statistical analysis
The fundamental approach taken was a meta-regression of study performance, with timing of score as the variable of interest, for all RSTs, and then for individual RSTs. All statistical analysis was performed in R 3.6.0 and 4.00, using the package mada (v 0.5.1) 21 .
Initial analysis generated hierarchical summary receiver-operator characteristic (HSROC) plots, allowing for comparison of the effects of all relevant covariates across studies. Using the HSROC plots we were able to visually assess heterogeneity. We then conducted bivariate meta-regression analysis to assess the impact of the covariates on the result of the metaanalysis, the predictive ability of RSTs.
Thereafter for any RST with adequate data for analysis, bootstrapped AUC, and summary sensitivity and specificity were generated for studies investigating the effects of different timings of RST use to enable clinical comparisons. Bootstrapping was performed 1,000 times.

Study characteristics
The study selection process is illustrated in Figure 1. Of the 12,853 studies identified through searches, 47 were deemed appropriate for final inclusion, including 430,427 patients. Identified studies included a total of 113 uses of RSTs, which represented 17 different tools. Table 1 outlines the study characteristics. The earliest study was from 2007, with the most recent published in 2018. 14 studies focussed on a single RST with 33 assessing multiple tools. The distribution of the RSTs analysed is also demonstrated in Figure 1. In terms of timing, 23 studies calculated RST score based on first recorded score, five studies reported score within 24 hours of admission, four studies reported the worst score whilst in ED, eight studies did not state when they completed a RST and seven studies used other timing scales.
Overall, 24 studies used the primarily defined outcome measure of 28-or 30-day mortality, with a further 22 using in-hospital mortality and one coded as 'other', using a definition of inhospital mortality within 72 hours of admission. Average mortality of studies investigating the primary outcome of 28-or 30-day mortality was 8.57%. Extending this to all included studies with any mortality end point, mortality was 7.59%.

Study performance
There was heterogeneity in overall performance across different scoring systems across all studies, with sensitivities and specificities ranging from nearly 0 to 100, as demonstrated in Figure 2.

Meta-regression
In a bivariate analysis using timing data alone, scores using first-recorded timing were less sensitive than worst-recorded scores (REML regression coefficient 0.57 (0.07-1.08)), with a trend towards increased specificity. This corresponds to a large increase in sensitivity with later timing, with summary sensitivity in first-recorded RSTs of at 0.64 (0.57-0.71) compared to summary sensitivity of worst-recorded RSTs at 0.76 (0.67-0.83). Figure 3 demonstrates this graphically, comparing studies using first-recorded scores than those using worst-recorded scores.
In the subsequent analysis including all relevant study level covariates only three study level factors significantly impact study performance: mortality, LMIC setting, and usage of a worst-recorded timing approach (eTable 2 in the supplement). Increasing mortality in a study was associated with increased sensitivity and reduced specificity, and this was also true for studies reporting worst-recorded timings. Studies in LMIC settings also had higher specificity. No other factor (including scoring system type) was found to have a significant relationship with studies' predictive ability.

Individual scores
For the two scores with sufficient data individually to generate meaningful outputs (qSOFA and EWS), further analysis was performed. For both these scores, a similar relationship was found between timing and study performance, although neither met statistical significance for either sensitivity or specificity, due to low study numbers.

Summary of findings
The meta-analysis of observational cohort studies aimed to investigate whether the predictive performance of RSTs is altered by the timing of their use. Our search strategy identified 47 studies including over 430,000 patients, with qSOFA the most analysed RST. It demonstrates that the timing of performing a RST affects its predictive performance: studies utilising any RST based on first-recorded patient observations and laboratory results have a lower sensitivity than RSTs which are based on worst-recorded values in order to predict infection or sepsis-related mortality. From sub-group analysis, the effect of timing was particularly evident in studies investigating the various EWS derivatives. Alongside timing, only LMIC setting significantly impacted study performance, and no significant differences were identified between RSTs.

Clinical context
Given the global burden of sepsis, ensuring rigorous validation of the research underpinning the tools used to allow the early identification and diagnosis of patients with suspected sepsis as advocated by SEPSIS-3 4 , is essential 68 . Failure to do so may result in unnecessary sepsis-related morbidity and mortality. However, several criticisms have been made of RSTs' use in sepsis, concerning the research evidencing their use and their application in clinical practice 10,18,36,69 . These largely focus on either a lack of specificity or sensitivity in different settings and cohorts, and it remains clear that no one tool fully achieves a gold-standard for diagnosis 12,36,68,70,71 . In particular, Sepsis-3 and the qSOFA score have been criticised for their largely retrospective evidence base 72 , with variable predictive performance when investigated prospectively [73][74][75] . Other RSTs, such as EWS, were developed to monitor patients for signs of deterioration 76 and are increasingly used as risk-stratifying tools for deterioration of any cause in hospital inpatients, endorsed by the Royal College of Physicians 77 and NICE 14 despite concern into their prognostic value 10 .
Song et al. 12 note that there is lower predictive performance if tools such as qSOFA are completed at the initial suspicion for infection. Further, while Hamilton et al. 10 have investigated the role of EWS in predicting sepsis-related mortality, we are not aware of other meta-analyses clearly demonstrating the critical role that timing plays in RSTs' predictive role.
Given these challenges in the role of RSTs and in diagnosing sepsis accurately, Franchini et al. 68 advocate for more prospective studies and outcome measures. Our study supports this, in recognising that the methods of investigating RSTs as indicators for management of sepsis must be prospective and consider predictive outcome measures, rather than the current body of evidence which is largely retrospective. Should we continue to rely on tools whose use is evidenced based on timings other than when they are used in clinical practice, that is the worst-recorded approach to researching RSTs, we risk overstating their predictive ability in identifying a condition which has potentially fatal consequences.

Limitations and strengths
There are some limitations to our work. There is wide heterogeneity between case definitions used in studies investigating RSTs' use as well as in the mortality outcome measure reported across studies, a reflection of the likely clinical heterogeneity in our sample. This may impact the consistency of our meta-analysis, although it is not unusual to identify large amounts of heterogeneity in systematic analysis of studies investigating diagnostic test accuracy 78 , though no clear consistent source was identified which can singly explain the heterogeneity across studies. As a meta-analysis our results may be affected by publication bias or selective reporting of datasets within included studies, however the PRISMA statement on  meta-analysis of diagnostic test accuracy studies 79 concluded that there is less risk of these biases than in reviews of primary interventional studies, as well as no adequately powered statistical test for a comprehensive assessment of bias in studies such as ours. Risk of bias assessment is challenging in this setting, as a major bias in all studies is that physiological responses change with treatment. As such, all studies that used later time points are at significant risk of bias, which is the focus of this study. Having said that the strengths of our study lie primarily in the use of a comprehensive search strategy across multiple databases, resulting in a large cohort of 430,427 patients being included in the final analysis. Using this large cohort and bivariate random effects analysis we hope to hope to have minimised biases often present in observational cohorts, allowing the construction of HS-ROC curves. The large number of identified studies also allowed sub-group analysis of qSOFA and EWS in further detail, delivering greater insight into their use as RSTs and the impact of timing. Though there is suspicion regarding the lack of high-quality prospective studies investigating RSTs' use, that this analysis includes a significant number of prospective studies reinforces that these findings are likely generalisable to either mode of study design.

Implications for Clinical Practice and Research
Our findings outline that clinicians must recognise that the performance of any RST is largely related to when they are used, not the individual tool. Given the low specificity of firstrecorded RSTs and the impact that research driven by worstrecorded scores has had on policy design in potentially infected or septic patients, leading to the wide used of RSTs worldwide, there must be more prospective studies investigating their role and predictive value aligned with their use in practice: at the point of admission, or first suspicion of infection. In most settings, where infection and sepsis are priorities and the threat of antibiotic resistance looms large, failure to investigate the RSTs evidencing early management and antibiotic administration could lead to inaccurate identification of patients at risk of sepsis and inappropriate use of antimicrobials.

Conclusion
In summary, we must be rigorous in ensuring that the tools and scores used to predict sepsis-related mortality, and enable management and treatment decisions are used and evidenced appropriately. It remains challenging to determine how effective RSTs are in this role, as the timing of RSTs' use in the evidence base is varied 80 , often reflecting a worst-recorded in a time point approach, unlike their use in clinical practice. This meta-analysis has shown that the timing of RSTs is paramount to their predictive performance. This has important implications for their use in practice and stresses the importance of prospective studies in the future.

Data availability
Underlying data Underlying Data for "The timing of use of clinical screening tools affects their ability to predict sepsis mortality. A metaregression analysis." DOI: https://doi.org/10.5281/zenodo. 5519552 81 This project contains the following underlying data: • Author link.csv Authors' contributions DH, FH and DA conceived the idea. MF, DH and FH undertook abstract screening and data extraction. FH, DA and EC acted as content experts in the field of sepsis and RSTs. MF and DH drafted the manuscript and all authors contributed to its editing and revision. All authors interpreted data and approved the final version of the manuscript.