Hospital performance and payment: impact of integrating pay-for-performance on healthcare effectiveness in Lebanon

Background: In 2014 the Lebanese Ministry of Public Health integrated pay-for-performance into setting hospital reimbursement tiers, to provide hospitalization service coverage for the majority of the Lebanese population. This policy was intended to improve effectiveness by decreasing unnecessary hospitalizations, and improve fairness by including risk-adjustment in setting hospital performance scores. Methods: We applied a systematic approach to assess the impact of the new policy on hospital performance. The main impact measure was a national casemix index, calculated across 2011-2016 using medical discharge and surgical procedure codes. A single-group interrupted time series analysis model with Newey ordinary least squares regression was estimated, including adjustment for seasonality, and stratified by case type. Code-level analysis was used to attribute and explain changes in casemix index due to specific diagnoses and procedures. Results: Our final model included 1,353,025 cases across 146 hospitals with a post-intervention lag-time of two months and seasonality adjustment. Among medical cases the intervention resulted in a positive casemix index trend of 0.11% per month (coefficient 0.002, CI 0.001-0.003), and a level increase of 2.25% (coefficient 0.022, CI 0.005-0.039). Trend changes were attributed to decreased cases of diarrhea and gastroenteritis, abdominal and pelvic pain, essential hypertension and fever of unknown origin. A shift from medium to short-stay cases for specific diagnoses was also detected. Level changes were attributed to improved coding practices, particularly for breast cancer, leukemia and chemotherapy. No impact on surgical casemix index was found. Conclusions: The 2014 policy resulted in increased healthcare effectiveness, by increasing the casemix index of hospitals contracted by the Ministry. This increase was mainly attributed to decreased unnecessary hospitalizations and was accompanied by improved medical discharge coding practices. Integration of pay-for-performance within a healthcare system may contribute to improving effectiveness. Effective hospital regulation can be achieved through systematic collection and analysis of routine data.


Introduction
The linkage of performance and payment has been increasingly used in healthcare during the past two decades. In recent years this has extended towards hospital performance designs, despite mixed evidence regarding its effects.
In 2014 the Lebanese Ministry of Public Health (MoPH) changed the basis by which it determined the payment reimbursement tiers of about 140 public and private hospitals, which it contracts for providing hospitalization service coverage for the majority of the Lebanese population. A pay-for-performance (P4P) framework was integrated within this system, the most prominent component of which was the hospital casemix index (CMI), which reflects the average risk or illness severity of all patients within a hospital (Khalife et al., 2017). This intervention was aimed at improving effectiveness by incentivizing hospitals to decrease unnecessary hospitalizations, as well as to improve fairness in determining hospital reimbursement (by including risk-adjustment), within an integrated evaluation framework.
Hospitalizations that are not compliant with any medical or social criteria may be considered as unnecessary hospitalizations. These arguably differ from potentially preventable hospitalizations; in that the latter may have an indication for admission but would have otherwise been avoided with appropriate outpatient or primary care. Unnecessary hospitalizations are common in various countries and challenge the functioning of healthcare systems (Caminiti et al., 2013;Macinko et al., 2010;Stranges & Friedman, 2006;To et al., 1996). Such cases may be primarily regarded as a question of appropriateness of care. Using the Kruk and Freedman performance framework, unnecessary hospitalizations may be categorized as a quality of care output measure, under the effectiveness dimension (Kruk & Freedman, 2008). Downstream association with outcomes on patient health status (effectiveness) and on maximizing value of resources (efficiency) would also be expected.
The current evidence on the impact of P4P in healthcare is weak, particularly in low-and middle-income countries (LMICs) (Witter et al., 2012). In this study we assess the impact of the integration of the P4P policy by the MoPH, and contribute to the evidence base on P4P effectiveness, using routine data and a specially developed CMI.
Pay for performance A health reform that links measures to payment creates a financial incentive for service providers to improve their performance vis-à-vis these measures. Seen through the lens of principal-agent theory, such P4P seeks to address the recognized information problems within healthcare, particularly information asymmetry (Smith & Hanson, 2012). This alignment of interests provides the principal (i.e. the payer) a tool to incentivize the provider to improve healthcare outputs and outcomes (Grace et al., 2015). Such tools may function at a system level within a complex environment and should be adjustable by the principal (Roberts et al., 2004). Aspects of a P4P intervention that have a large role in determining impact include the actual measures used, context and incentive size.
Most performance frameworks have either been developed for use in high-income countries (HICs) or are heavily influenced by such contexts, and likely require adjustment for use in LMICs (Tashobya et al., 2014). Based on a review of commonly used performance indicators, Kruk and Freedman provide a framework for LMICs (Kruk & Freedman, 2008). Adapting Donabedian's system evaluation of structure, process and outcome measures, this framework categorized indicators as outputs/ processes and outcomes/impact along with the dimensions of effectiveness, equity and efficiency (Donabedian, 1966;Donabedian, 1988).
In LMICs, healthcare P4P initiatives commonly include structural measures of quality. More widely, however, a transition is ongoing towards outcome measures, which are the ultimate target for performance improvement (Chee et al., 2016;Gergen et al., 2017). Establishing broad outcome measurements may enhance P4P impact and integration into routine systems may make such initiatives more cost-effective (Borghi et al., 2015;Chee et al., 2016).
The effects of healthcare P4P in HIC contexts have been mixed, largely confirming the scarcity of the evidence base (Eijkenaar et al., 2013;Emmert et al., 2012;Gillam et al., 2012;Van Herck et al., 2010). A recent review on P4P schemes in the United Kingdom found some positive findings, but cautioned that overall effects on care quality were unclear, underscoring the need for long-term monitoring and evaluation (Mandavia et al., 2017).
The potential for P4P impact may be larger in LMICs, considering the relatively lower resources of providers and more dynamic health reform context (Witter et al., 2012). However, the scarcity of the evidence from LMICs is particularly pronounced; the evidence base is too weak to make general conclusions (Witter et al., 2012). A recent review of P4P in maternal and child care in LMICs found positive impact on process quality, but weak evidence on health outcomes and out-of-pocket expenses (Das et al., 2016).

Amendments from Version 1
Based on the feedback received from the two reviewers, we have updated the manuscript with some additional information. We have clarified that the policy intervention was directed at the categorization of hospitals into the three reimbursement tiers, but not the reimbursement structure itself. We have clarified the weight-setting process for diagnosis and procedure codes. We have defined the systematic approach used in this investigation. We have acknowledged that inclusion of age and comorbidity variables would allow more accurate calculation of hospitalization complexity.

Any further responses from the reviewers can be found at the end of the article
Hospital-based P4P initiatives in England and the United States targeting 30-day readmissions and mortality resulted in improved short-term performance that was not sustained in the long-term, further highlighting challenges due to contextual changes, spill-over, measurement limitations and overall design (Jha et al., 2012;Kristensen et al., 2014;Lindenauer et al., 2007;Sutton et al., 2012;Werner et al., 2011). Long-term investigations of previously favorable initiatives have found improvement in readmissions, but no improvement or worsening of mortality, as well as undesirable practices having misled some earlier findings (Gupta & Fonarow, 2018;Wadhera et al., 2018;Wasfy et al., 2017). Anticipatory or short-term behaviors in response to P4P scheme engagement and implementation may differ from long-term behavior, likely contributing to the findings of short but not long-term effectiveness of P4P (Ryan et al., 2015).
It has been proposed that the debate on P4P should move from distinct projects towards integration within the health system, with broad system objectives (Soucat et al., 2017). Such integration may be seen as an extension of strategic purchasing of health services, working towards achieving universal health coverage. An approach that considers the overall health system rather than more narrow objectives would avoid 'not seeing the forest for the trees' (Soucat et al., 2017).

Aim and objectives
We apply a systematic approach to assess whether the new pay-for-performance policy had an impact on the healthcare system's effectiveness. We analyze whether this policy affected the complexity of the average hospitalization case (i.e. CMI), stratifying by hospitalization case types, length of stay, and hospital ownership. We further quantify any changes, with plausible explanations, at the level of diagnoses and procedures.
The MoPH has engaged in different healthcare supply-side reforms. Contracting private hospitals to provide hospitalization services took place in the 1960s due to public coverage limitations and political reasons, preceding neoliberal influences that supported New Public Management agendas in other systems (Smith & Hanson, 2012). In the late 1990s public hospitals were granted semi-autonomous status, similar to contemporary experiences of public hospitals in HICs and LMICs.
Hospital accreditation was linked to payment in 2001 and incentivized hospitals towards quality improvement, with accreditation results used to categorize hospitals across three reimbursement tiers (Ammar et al., 2007). Hospital budgets set on an annual basis were not targeted by this change, as they continued to be historically and politically determined. Therefore, a moral hazard existed for hospitals interested in admitting more predictable low-risk patients to efficiently reach their pre-allocated budgets. Unnecessary hospitalizations became increasingly recognized as a major problem, particularly for medical (non-surgical) cases where lower barriers to treatment existed (Kronfol et al., 2014). These were often one or two-day stays with variable costs but generally less complex diagnoses or symptoms, such as nausea, vomiting, diarrhea and gastroenteritis (Kronfol et al., 2014).

Policy reform
To address unnecessary hospitalizations and increase fairness in hospital performance assessment, the MoPH created a new policy intervention in late 2014, linking reimbursement tier to a composite hospital total performance score (TPS) (Ammar et al., 2013). The first public announcement of the intervention was through an engagement event held in August 2014 for hospital executives and managers. The event highlighted the new model and its components, as well as the importance of accurate coding for appropriate casemix assessment. The TPS included components on CMI, patient satisfaction and other minor policy indicators, as well as accreditation (Khalife et al., 2017). Specifically, this policy had a regulatory aspect in the form of accreditation, and more distinct payment aspect for other components. The CMI reflects the average risk or illness severity of all patients within a hospital, and was effectively the greatest determinant of a hospital's TPS (Hornbrook, 1982). Therefore, a decrease in unnecessary hospitalizations would be expected to increase a hospital's CMI and consequently its TPS. Additional information regarding this process has been described elsewhere (Khalife et al., 2017).
The incentive size for tier classification remained unchanged from pre-to post-intervention periods. Broadly, the difference in reimbursement tier is 10% additional change per tier among surgical procedures, and about 15% per tier among medical cases; e.g. a bottom-tier 'T3' hospital charges 1,000,000 LBP (USD 670) for a surgical procedure, while a top-tier 'T1' hospital charges 1,200,000 (USD 800) for the same procedure.
Consequently, the policy intervention was directed at the mechanism whereby hospitals were categorized into the different reimbursement tiers. However, the intervention did not change the reimbursement structure of hospitalization cases. Reimbursement of medical cases remained fee-for-service, while surgical cases used a pre-defined flat-fee.

Ethical statement
Research protocol approval was granted by the Institutional Review Board (IRB) at the American University of Beirut (ID: FHS. FE.21). The requirement for patient consent was waived by the IRB.

Data sources
Hospitalization data including all cases under MoPH coverage from January 2011 to December 2016 was extracted by the MoPH Information Technologies Department and shared with the research team in a format with patient identifiers anonymized. The fields included record number (unique per admission), case identifier, hospital code, admission date, discharge date, length of stay, total charge, medical code on each admission and discharge, and surgical procedure code. STATA software package version 11 was used for all calculations and analyses.

CMI calculation
The MoPH overcame limitations in developing a hospital CMI for its contracted 146 public and private hospitals, due to the lack of local Diagnosis Related Groups (DRGs) on which most casemix systems rely on. The MoPH CMI calculation approach used average costs for weight-setting among medical cases based on International Classification of Diseases, 10th Revision (ICD-10) discharge code, and Common Procedural Terminology (CPT) procedure code among surgical cases (Ammar et al., 2013;World Health Organization, 2006;Yang & Reinke, 2006). Among medical cases the weight-setting process was separated across short-stay (<2 days), medium-stay (2-15 days) and long-stay cases (>15 days). This approach has been detailed elsewhere (Ammar et al., 2013;Khalife et al., 2017).
For greater comparability across case types, we standardized the weight-setting previously used by the MoPH; we used fiveyear cost averages rather than the MoPH two-year averages; and assigned the average weight among medical cases for low-volume medical conditions (less than 20 cases). Surgical CMI did not require average cost figures as surgical services have fixed flat-rate charges. However, 11 procedures with weights ten times above the standard reference (1 million LBP) were capped at a weight of 10.00 to limit excessive impact of outliers.
The weight for each ICD and procedure code was the same throughout the period investigated. We note that the MoPH updated procedure costs in March 2013, which remained in use until the 2018 update. This increased the base-rate of all procedures, to account for inflation. We used weights based on the March 2013 update, also retrospectively up to 2011. The MoPH undertook a hospital-bed update (base-rate), which is only one component of the bill charged by hospitals to the MoPH, but nevertheless represents an internal inflation adjustment.
For medical cases, we used the five-year code average without further adjustment.
We developed algorithms to calculate monthly CMI (rather than yearly) and developed 'mixed' cases algorithms (cases concurrently including medical and surgical care). We incorporated secondary procedures into surgical CMI, but this was not done for medical CMI as additional diagnoses or comorbidities are not currently utilized.
CMI was calculated using the formula below, which excludes the denominator correction component used in some versions, as this was calculated at system rather than hospital level (Lichtig, 1986). This is also the standard generic CMI formula used by the US Centers for Medicare and Medicaid Services (Services, 2011).
where W g is the weight calculated for each ICD or CPT, and N gn is the number of cases within each ICD in the total population.
CMI was calculated for medical, surgical and mixed cases separately, and repeated using cases only at public hospitals and only at private hospitals separately. Among the medical cases, short-stay, medium-stay and long-stay casemix indices were also calculated separately at each stage and combined using a case-weighed approach to also obtain an all-stay medical CMI. Such a combination was not conducted for all cases combined as it would likely conceal meaningful findings.
We excluded all chemotherapy coded cases (ICD Z51.1 and Z51.2). Chemotherapy is generally a low-cost short-stay hospitalization, and the MoPH has since 2014 communicated to hospitals the need for correct coding specifically for cancer patients. As a result, the MoPH has documented a decrease in chemotherapy miscoding under general cancer codes (e.g. C50 code being corrected as Z51.1). Retaining chemotherapy codes would have had the effect of artificially decreasing CMI. Typically, chemotherapy cases form around 7.9% of medical cases or 4.4% of total cases.
CMI algorithms were first run on an annual level for developing descriptive statistics. Similar algorithms were used to develop a monthly CMI with discharge date used to categorize records into calendar months.

Impact on CMI level and trend, across public and private hospitals (Objective 1).
To detect a change in CMI attributable to our intervention, it is necessary to make use of a control, whether outside of the group (using randomization) or within it (using historical control), while accounting for potential confounders. Randomization was not possible in this situation due to legal regulations that necessitate the MoPH to use the same performance assessment and reimbursement framework for all acute-care hospitals. Considering the availability of multi-year data, we chose to use a single-group interrupted time series (ITS), with Newey ordinary least squares regression. ITS analysis is considered the most appropriate quasi-experimental design, given the research context and aims (Penfold & Zhang, 2013). With such an approach we seek to identify an 'interruption' of a continuous sequence of observations in a population (a time series) by a specific intervention. ITS analysis is particularly useful when randomization is not possible, and may have greater external validity than randomized designs when occurring in a real-world setting (Bernal et al., 2016). This approach uses a historical control group, and more broadly the Bradford Hill criteria for causality to offer plausible causal explanations (Habicht et al., 1999;World Health Organization, 2009;World Health Organization, 2012). Such an approach has also been noted as the most suitable for the evaluation of P4P interventions (World Health Organization, 2009).
ITS analysis has been increasingly used in population-level evaluations of health interventions, with advantages including the control for secular trends, ease of population-level analysis and ability to evaluate intended and unintended outcomes (Bernal et al., 2016;Lopez Bernal et al., 2018). It also allows us to evaluate both the short-and long-term impact of an intervention, by analysis of level and trend changes, respectively, and when appropriately used allows us to limit the effects of history bias in what is a complex real-world setting (Naci & Soumerai, 2016).
We applied ITS analysis on a dataset including all cases between January 2011 to December 2016, with the exception of the mixed cases dataset that began January 2012 (see Table 1). Statistical significance was set at p < 0.05. It is relevant to note that discharge codes were recorded in the hospitalization database as of January 2011 for most hospitals, with a few hospitals reaching full compliance within five months. Mixed cases data was available as of January 2012.
The intervention was given a two-month lag for expected effect (i.e. as of October 2014), based on the MoPH experience of the hospital response time (personal communication; Dr. Jihad Makouk, MoPH). We adjusted for seasonality using calendar months. Using 72 monthly data points, and CMI as the dependent variable, we ran ITS separately for each of medical all-stay, short-stay, medium-stay and long-stay cases; surgical cases and mixed cases. We subsequently stratified into public and private hospital cases.
As part of the sensitivity analysis we also varied lag period between intervention and impact by zero to four months, with the overall results being unchanged except in magnitude.
To ensure the adequacy of our ITS model, we assessed autocorrelation using the Cumby-Huizinga test; in most cases autocorrelation was not present, and where detected was otherwise accounted for by the lag period.

Changes in diagnoses and procedures (Objective 2).
We used a before and after approach to quantify the change in individual diagnoses and procedures in terms of CMI change (contribution to changes detected in Objective 1), and in terms of absolute and relative change to all hospitalizations. We used algorithms on a dataset including all medical and surgical cases, with a preintervention period combining 2013 and 2014 cases and a postintervention period with 2015 and 2016 cases. This objective used annual cut-offs, comparing 2013 and 2014 cases with 2015 and 2016 cases, in contrast to the first objective (monthly). This was to allow for the development of policy-relevant information and allow CMI algorithm looping and comparison across around 10,000 ICD and CPT codes.
For each case type and stay, we selected the top twenty codes with the greatest CMI change from pre-to post-intervention periods, for code-level analysis. We therefore restricted our analysis to examine the codes with the greatest impact on CMI in any direction (i.e. positive or negative).
In our approach, the codes which had the greatest impact on overall CMI are identified as a function of code weight in relation to the overall CMI 'average', the code weight itself, and frequency. While the codes with the greatest change in terms of their share of CMI are identified as a function of code weight and frequency only.
CMI change formulae: Where W g is the weight calculated for each ICD or CPT, and N gn is the number of cases within each ICD in the total population.
CMI share change formula:

Descriptive statistics
The total study population across 2011-2016 included 1,353,025 inpatient hospitalizations, or about 230,000 per year, which were composed of 55% medical cases, 43% surgical cases and 2% mixed cases (see Table 1). Medium-stay medical cases were about four times more frequent than short-stay cases (43% to 11%), with long-stay cases being a small minority (1%). Among surgical cases the ratio of case to procedure was 1 case to 1.1 procedure, and among mixed cases 1 case to 1.5 procedure.     Approximately one in every three hospitalizations occurs in public hospitals. The proportion of hospitalization at public hospitals increased from 31% to 37% between 2012 and 2016.
Most of this change is due to increased medical cases (32% to 41%) and, to a lesser extent, increasing surgical cases in public hospitals (30% to 33%).
Impact on CMI level and trend, across public and private hospitals (Objective 1)

Pre-intervention
The pre-intervention CMI monthly coefficients were 0.975, 1.284 and 1.783 for each of medical, surgical and mixed cases, respectively. Among medical cases, CMI was lowest for short-stays (0.352) and highest for long-stay cases (3.326).

Post-intervention
Overall the intervention had variable impacts depending on case type and hospital ownership, resulting in either increased or unchanged CMI (see Table 2, Figure 1 and Figure 2).

a. Medical cases
A decreasing pre-intervention trend was reversed, resulting in an increasing trend across all hospitals. Large level changes in CMI were also found at two months post-intervention, among public and private hospitals. No level changes were found among medium-stay cases. Overall, the main impact was on short and medium-stay cases, specifically a level change in the former and a trend change in the latter.

b. Surgical cases
An increasing pre-intervention trend continued unchanged in the post-intervention period across all hospitals. However, among public hospitals the absence of a pre-intervention trend was replaced with an increasing CMI trend.

c. Mixed cases
The only change found among mixed cases was an increasing trend in the medical component of private hospital cases. This had minimal impact on the overall CMI, as mixed cases comprised a very modest proportion of all hospitalizations.
We observed seasonality in medical CMI, and in particular medium-stay cases, with a March minor peak, a July-August trough, and a November-December major peak. Surgical casemix seasonality was limited to private hospitals only, with an April-June major peak, and a November-December minor peak. Mixed casemix seasonality had March and October peaks.

Changes in diagnoses and procedures (Objective 2)
The number of medical case discharge ICD codes used across 2013-2016 was 3,164 for short-stay, 5,828 for medium-stay and 1,566 for long-stay cases, and 3,911 CPT codes for surgical cases. The 20 selected codes in each case type accounted for a majority of the change in CMI (see Table 3).

CMI change
Codes with the greatest changes from pre-to post-intervention are displayed in Table 4 (for detailed changes see Table 5). These included 2,970 fewer cases of abdominal and pelvic pain, 698 fewer cases of intestinal infectious diseases, 1,001 fewer cases of fever of unknown origin, and 783 fewer cases of essential hypertension; altogether this represents about 4.3% of all annual medical cases. The greatest change on medium-stay CMI was due to a decrease of 2,237 cases of diarrhea and gastroenteritis. Two forms of shifting were also noted among several codes: from medium to short-stay cases (e.g. A09, I10, I20-I25.9), and from three-digit to four-digit ICD10 codes (e.g. J18, J44, P22). The greatest change on short-stay CMI was due to the improved coding of chemotherapy cases under the relevant code (Z51.1) rather than under neoplasms (C00-D49).

CMI share change
The range of change of CMI share per code from pre-to post-intervention is shown in Table 3. The conditions with the greatest change in their share of CMI included: percutaneous transluminal coronary angioplasty (PTCA) and vaginal delivery (surgical); malignant neoplasm of breast and acute lymphoblastic leukemia (short-stay); bronchopneumonia, diarrhea and gastroenteritis (medium-stay) (see Table 5).

Discussion
This study investigated the impact of integrating a P4P policy on the complexity of hospitalizations at hospitals contracted by the Lebanese Ministry of Public Health. We found that the intervention resulted in increased case complexity, specifically among medical cases. We used a systematic approach, first examining all hospitalizations by case type, lengths of stay and hospital ownerships for changes, and subsequently identifying and quantifying which diagnoses and procedures explain this change.
Case complexity was greatest for mixed cases, followed by surgical and medical cases. This was unsurprising considering that mixed cases are typically more complex and include ICU cases. Furthermore, a large proportion of medical cases are for short-term evaluation, basic treatment, or chemotherapy.   Similar reasoning explains medical casemix increasing with longer stays.

Medical casemix: decreased unnecessary hospitalizations and improved coding
During the pre-intervention period, medical casemix had a decreasing trend, possibly as a result of increasing unnecessary hospitalizations, for which an incentive exists. The intervention resulted in a reversal to an increasing casemix trend, as well as a short-term level change, at both public and private hospitals. The trend and level changes are attributable to medium and short-stay cases, respectively.
The decrease in diarrhea and gastroenteritis cases had a greater effect on medium-stay casemix than any other condition. Diarrhea and gastroenteritis cases are likely to be responsible for more unnecessary hospitalizations than any other diagnosis (Kronfol et al., 2014;To et al., 1996). Similar decreases were also found in abdominal and pelvic pain, essential hypertension cases and fever of unknown origin, all being major sources of unnecessary hospitalizations (Kronfol et al., 2014). It is relevant to highlight that some of these conditions are more precisely symptoms (e.g. pain), which are not generally appropriate as discharge diagnoses and likely to be unnecessary hospitalizations. We find that the intervention decreased unnecessary hospitalizations by changing admission practices for these four conditions. The change in absolute cases for these conditions were at least one order of magnitude greater than any such changes since 2011 and are unlikely to be due to decreased disease burden.
We also found large shifts from medium to short-stay cases among each of diarrhea and gastroenteritis, abdominal and pelvic pain and essential hypertension cases. This suggests that hospitals not only decreased unnecessary hospitalizations, but also decreased unnecessary hospital stays. Although changes in length of stay are often difficult to associate with quality of care, among such conditions it is likely a result of improved hospital practices during hospitalization, in addition to the aforementioned improved pre-hospitalization practices. Such changes were seen among both public and private hospitals.
The large increase in chronic obstructive pulmonary disease (COPD) cases may be influenced by both practice changes and increasing burden. A steady increase in global COPD has been noted, with smoking prevalence and air pollution being the major risk factors (Lopez et al., 2006). These two factors are also prevalent in Lebanon, and are likely to have contributed to the increased COPD burden (IHME, 2018). It is also possible that improved hospital admission practices influenced this increase. The increases in pneumonia and acute bronchitis cases are likely due to increased disease burden, with yearly and seasonal variations typical of infectious disease. We note the possibility of COPD acute exacerbation of COPD being misdiagnosed as acute bronchitis, as well as acute bronchitis also having smoking as a risk factor.
Given that this intervention had no direct impact on the separately pre-determined hospital budgets, it may be argued that the decreases in unnecessary hospitalizations provided hospitals with greater resources to hospitalize more necessary hospitalizations, such as COPD, pneumonia and acute bronchitis. Confirmation of such an impact would, however, require further investigation.
The decrease in ischemic heart disease cases, including a length of stay shift, may be associated with the concurrent increase in percutaneous transluminal coronary angioplasty (PTCA), which is indicated for certain ischemic heart disease cases. PTCA over-utilization has been documented in other healthcare systems and has had decreasing use in recent years (Mariotto et al., 1999;Weiss & Elixhauser, 2014). It is not possible to evaluate medically indicated PTCA cases in this study, and further investigation would be required to ascertain the level of benefit or abuse.
The code-level findings indicate improved coding practices, which has been documented elsewhere to occur alongside real case changes following the introduction of CMI (Ginsburg & Carter, 1986;Goldfarb & Coffey, 1992). The improved coding for chemotherapy, particularly in breast cancer and leukemia, explained most of the short-stay casemix increase. Among medium and short-stays we find a shift to full coding digits and  more specific codes, specifically among neoplasms, pneumonia, COPD, respiratory distress of newborn, and diarrhea and gastroenteritis.

Surgical casemix
Changes in surgical casemix were limited to public hospitals only, with an increasing trend emerging in the post-intervention period. This was explained by increased cesarean deliveries in public hospitals, which compensated a concurrent decrease in private hospitals, and was identified as a trend preceding the intervention by several months, and therefore not likely impacted by the intervention itself.
Private hospitals continued their increasing casemix trend without any change following the intervention. This trend was explained by two factors: increasing PTCA cases throughout 2011-2016 and decreasing vaginal deliveries since early 2014. The former may also be an example of private hospitals being faster adopters of new technologies than public hospitals or having greater potential for over-utilization. We had expected to find new (and costlier) procedures such as PTCA to explain a large proportion of an increasing surgical casemix at all hospitals. However, with the exception of PTCA, adoption of new procedures had a negligible effect on surgical casemix.
The large decrease in vaginal deliveries at private hospitals coincides with the sharp increase into Lebanon of refugees from Syria in early 2014. This change was not compensated by public hospitals, whose vaginal deliveries remained generally unchanged. Deliveries had been documented in 2013 as the leading cause for hospitalization among Syrian refugees in Lebanon (Huster et al., 2014). This suggests that, under MoPH coverage, Lebanese women having vaginal deliveries had decreased access to private hospitals, or otherwise opted to deliver outside of this coverage (out-of-pocket payment). The hospitalization access of refugee and citizen communities requires further investigation, including the role of private hospitals.
Improvement potential, hospital ownership, and seasonality Hospitals had a greater improvement space in medical cases than in surgical cases. Prior to the intervention, medical ICD10 coding had no link to reimbursement (fee-for-service), unlike surgical codes (flat-fee). The intervention incentivized improved medical coding (for accurate CMI assessment), but the same was not relevant for surgical coding.
Private hospitals had a greater overall casemix than public hospitals, suggesting that more complex cases were admitted to private hospitals, as had been observed in previous MoPH investigations ( Ammar et al., 2013). However, private and public hospitals had similar short-stay casemix, while public hospitals had a higher casemix for mixed cases. Two factors at private hospitals that are relevant to note are their possibly greater capacity for accurate coding and for healthcare technology. The former has been found elsewhere to explain some of the casemix gap between public and private hospitals, while the latter is associated with increased casemix (Mendez et al., 2014;Park et al., 2017).
We also note the increasing proportion of hospitalizations at public hospitals throughout 2011-2016, which continues the trend observed by the MoPH for preceding years. Since the proportion of contracted public and private hospitals had remained relatively unchanged, this is attributable to the MoPH's broad policy of increasing the service delivery carried out at public rather than private hospitals.
The seasonality in the casemix within different case types is likely explained by a combination of disease burden variation throughout the year, and pre-planned hospitalizations that avoid vacation periods. The November-December peak for medical medium-stay may be related to influenza seasonality in Lebanon (World Health Organization, 2019). Further investigations would be required to associate casemix seasonality with specific diseases or conditions. It is noteworthy that surgical casemix seasonality was found only in private hospitals, peaking during the three months preceding the summer period, which suggests that pre-planned hospitalizations are more common in private than in public hospitals.

P4P design
The importance of detailing P4P designs has been highlighted, particularly considering the heterogeneity of such interventions (Chee et al., 2016). Notable design features of the MoPH P4P is the linkage between performance and reimbursement tier, rather than a bonus/penalty. To our knowledge this has not been undertaken elsewhere. In Lebanon, such a feature was a result of the political and financial non-feasibility of having other financial mechanisms to reward hospitals for improved performance. However, this has the benefit of being more sustainable in avoiding bonus financing and maintaining a system aspect (tiers) familiar to hospitals. In effect, the design integrated P4P into the system of determining hospital reimbursement tier (Soucat et al., 2017).
Another feature is the inclusion of CMI directly within the performance scoring, rather than as a risk adjustor for other outcomes (e.g. readmissions). This focus on casemix was intended to address unnecessary hospitalization, as well as the absence of risk-adjustment in hospital assessment. It also formed an objective that may be influenced by a wide range of hospitalizations, rather than a narrow set.
The type and magnitude of incentives are also important factors in determining P4P impact. Having casemix incentivized outside of a prospective payment system or similar approach likely limited the potential of providers to engage in up-coding or otherwise game the system, as has been observed in other casemix evaluations (Radu et al., 2010;Sukul et al., 2019). Such behavior is further limited as providers do not have access to the weights used in casemix evaluation.

Strengths and limitations
The use of ITS analysis with a large number of pre and post-intervention data points allowed us to account for background effects in what is a complex environment. Such effects may include other interventions or events that may have impacted hospitalization practice or burden of disease.
No relevant system interventions were carried out by the MoPH throughout 2011-2016, besides the new intervention evaluated in this work. Other quality-focused activities undertaken by hospitals individually have not been assessed, though these are less likely to have system-level impact.
We relied on the data that is collected and input at hospitallevel, and subsequently used to calculate CMI. As such, the validity of the casemix results relies on the quality of data input, specifically ICD10 and CPT codes. The intervention of August 2014 included advice to hospitals regarding improved coding accuracy, whose short-term results have been detected with CMI level changes. However, we do not account for coding quality initiatives that may have occurred within hospitals.
We recognize that the inclusion of age and comorbidities in the calculation of casemix index would allow a more accurate measure of hospitalization case complexity. Within the Lebanese healthcare setting there currently is insufficient information for selecting diagnoses and procedures for adjustment by age. The absence of comorbidity is due to the lack of routine recording of this variable across most hospitals, and policymaker concerns regarding potential miss-use. Age and comorbidities remain important areas for future development of casemix calculation and pay-for-performance, but their absence does not negate the findings of the current investigation.

Conclusions
This research suggests that the integration of a hospital performance-payment policy in 2014 increased the effectiveness of the healthcare system, primarily due to decreased unnecessary hospitalizations, as well as a decreased length of stay of such cases. An improvement in quality of care may subsequently benefit patient health and resource use (efficiency). The new policy also led to improved discharge coding quality. Although unrelated to effectiveness, the latter finding increases the validity of evidence and policymaking that makes use of such information, including but not limited to the P4P design. Changes in unnecessary hospitalizations took place at a gradual pace compared to the more immediate coding practice changes. We also confirm that CMI can be appropriate tool to detect changes in hospitalizations or performance improvement.
By using a systems perspective, we were able to investigate the impact across different hospitalization case types, length of stay and hospital ownership, and went further to quantify and attribute changes to specific diagnoses and procedures. We also identified issues relevant for further investigation and policymaking (e.g. vaginal deliveries, COPD, ischemic heart disease treatment).
A similar analytical approach using interrupted time series may be used in the evaluation of other interventions on the hospitalization system. The algorithms developed for this research may also be adapted to investigate specific issues, as well as by a regulator or payer to actively monitor hospitalization trends across code and hospital attributes. This would support keeping providers accountable and increase the responsiveness capacity to address non-desirable or harmful hospitalization practices, such as unnecessary hospitalizations.
Despite the generally unfavorable evidence regarding the effectiveness of P4P in hospital settings, our findings suggest that certain P4P designs may be effective in specific contexts, such as that of the MoPH and Lebanese hospitals. This may be in part due to its integration within the system, rather than as a stand-alone external intervention, in addition to the existing space for improvement.
Future investigations would be required to evaluate longerterm impact, which would inform whether the impact was a one-time result of introducing a new policy, or if it retains sustainable benefits. This would necessarily require continued commitment and capacity-building by the MoPH towards this process.
Our findings suggest that effective hospital regulation can be achieved through the systematic collection and analysis of readily available routine data. Our analytical approach to such data reveals relevant patterns of change to performance measures. LMICs that lack casemix adjustment and incentives for improving hospital performance may choose to adopt similar approaches and monitoring systems to measure and improve hospital performance over time.

Underlying data
The source data are owned by the Lebanese Ministry of Public Health and in line with the IRB approval granted for this study, the authors are not permitted to share the source patient-level data. In compliance with the MoPH's obligation on data privacy, the underlying data are accessible in a de-identified form upon request to the Ministry of Public Health (directorgeneral@moph. gov.lb), including a justification for the request.

Kristiina Kahur
Nordic Casemix Centre, Helsinki, Finland Thank you for this very interesting research which is somehow unique on that field. Below are some comments and reflections which raised while reading the article: In conclusions it was said that "This increase was mainly attributed to decreased unnecessary hospitalizations and was accompanied by improved medical discharge coding practices". I am wondering how was the improvement of coding assessed and is improved coding necessarily referring to increase of effectiveness? ○ It was said in the article that the reimbursement was linked to a composite hospital total ○ performance score (TPS) which means that the higher the CMI the higher the TPS. CMI is explicitly related to coding quality. It is known from different researches that coding activity is often incentivized by payments, thus this can incentivize the hospitals to increase the coding activity and perhaps up-coding in some cases which probably happened also in Lebanon. How much this could have been a case?
Related to previous comment -in the article it was mentioned that "providers do not have access to the weights used in casemix evaluation". This might limit the gaming and upcoding indeed. But on other hand it also makes the system non-transparent and weights used for calculation the CMI will be as a black box within the system which might have negative impact on improvement of coding quality.

○
Given that there is no DRG system in place in Lebanon it somehow limited the development of CMI by using DRG cost-weights which would have been a conventional way when DRG system would have been used. The chosen approach was compensating the missing DRG cost-weights. However, it did not use the information/variables which would potentially have had impact on weights and CMI calculation. E.g. variable like age was not available for this study (as indicated in Data sources chapter) which would have had impact on weights in some cases. Also information about co-morbidities and complications was not taken into account which might be cost driver and thus, may have had impact on weights and CMI. It was also interesting that the medical cases where divided based on LOS whereas the surgical ones were not. In conventional cost-weight calculation also the cost/LOS outliers are taken into account which was not systematically the case in this study. Having DRG system in place, the calculations of CMI would be perhaps more consistent and internationally comparable by taking into account all possible variables a casemix system is using and which were not taken into account in this study.

○
Given that five-year cost averages was used for calculation of CMI, how (if at all) was the inflation and other factors influencing the cost throughout the years taken into account? In other words, was the weight of one particular ICD or procedure code the same throughout the five-year period? ○ It was mentioned in the article that the weight was calculated for each ICD (and procedure code). In table 4 it remains unclear if the separate weights were calculated e.g. for diagnosis code J44.1 and J44. Or was the separate weight calculated for codes J20-J20.9 and J20. However, table 5 which refers to diagnosis codes with the greatest change in terms of CMI, shows different values e.g. for code P22 and P22.0. This is somehow confusing because in general, subcodes of three-character diagnosis code should be clinically and economically meaningful and should have one weight, i.e. codes P22 and P22.0 should be similar in terms of clinical severity and cost. In general (and this really depends on national coding guidelines), if a three-character diagnosis code has subcodes, then the subcode as more specific one should be used instead of three-character code.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?
It was said in the article that the reimbursement was linked to a composite hospital total performance score (TPS) which means that the higher the CMI the higher the TPS. CMI is explicitly related to coding quality. It is known from different researches that coding activity is often incentivized by payments, thus this can incentivize the hospitals to increase the coding activity and perhaps up-coding in some cases which probably happened also in Lebanon. How much this could have been a case? ○ Although up-coding is generally a recognized issue in the Lebanese healthcare setting. However, given that providers were unaware of the TPS prior to the intervention in August 2014, we would not expect that there would have been an increase in up-coding above what are background levels. This similarly applies to increased coding activity, but only within surgical and mixed cases, where multiple-coding is used. Nonetheless, we acknowledge this factor will likely have to be more closely considered in future evaluations, with providers having become more aware of the incentive potential of up-coding, and other potential gaps to game the system.
Related to previous comment -in the article it was mentioned that "providers do not have access to the weights used in casemix evaluation". This might limit the gaming and up-coding indeed. But on other hand it also makes the system non-transparent and weights used for calculation the CMI will be as a black box within the system which might have negative impact on improvement of coding quality.

○
The lack of access to the weights used does limit transparency. However, based on past experience and under the existing conditions, this was considered by the Ministry of Public Health as an important feature to limit the abuse/miss-use of hospitalization among some providers. This has been possible to maintain thus far due to the joint commitment of the Syndicate of Private Hospitals towards the P4P initiative. Nevertheless, increasing transparency regarding weights and the TPS process would be important goals for future system development.
Given that there is no DRG system in place in Lebanon it somehow limited the development of CMI by using DRG cost-weights which would have been a conventional way when DRG system would have been used. The chosen approach was compensating the missing DRG cost-weights. However, it did not use the information/variables which would potentially have had impact on weights and CMI calculation. E.g. variable like age was not available for this study (as indicated in Data sources chapter) which would have had impact on weights in some cases. Also information about co-morbidities and complications was not taken into account which might be cost driver and thus, may have had impact on weights and CMI. It was also interesting that the medical cases where divided based on LOS whereas the surgical ones were not. In conventional cost-weight calculation also the cost/LOS outliers are taken into account which was not systematically the case in this study.
Having DRG system in place, the calculations of CMI would be perhaps more consistent and internationally comparable by taking into account all possible variables a casemix system is using and which were not taken into account in this study.
○ This is largely correct, although it is also pertinent to recognize the existing institutional context within which the CMI-based reimbursement system is developing. We have added a paragraph under 'strengths and limitations' to address age and comorbidities. We acknowledge that three-character code. Separate weights were calculated for each code, whether three or four-character. Therefore, J44.1 and J44 had different weights (differing by 14%), as did P22 and P22.0. The decision to have separate weights was based on the review of various codes, which in some cases revealed considerable differences between three and four-character codes. Acknowledging that this is likely due to a mix of factors, including imprecise coding practices and actual diagnosis, but lacking information to resolve these differences, we choose to use weights for each three and four-character code. This approach is similar to that used in the calculation of casemix index by the MoPH. The use of three-character codes is also due for phasing out by the MoPH, which will require four-character codes for all hospitalizations.