The consequences of adjustment, correction and selection in genome-wide association studies used for two-sample Mendelian randomization

Introduction: Genome-wide association studies (GWASs) often adjust for covariates, correct for medication use, or select on medication users. If these summary statistics are used in two-sample Mendelian randomization analyses, estimates may be biased. We used simulations to investigate how GWAS adjustment, correction and selection affects these estimates and performed an analysis in UK Biobank to provide an empirical example. Methods: We simulated six GWASs: no adjustment for a covariate, correction for medication use, or selection on medication users; adjustment only; selection only; correction only; both adjustment and selection; and both adjustment and correction. We then ran two-sample Mendelian randomization analyses using these GWASs to evaluate bias. We also performed equivalent GWASs using empirical data from 318,147 participants in UK Biobank with systolic blood pressure as the exposure and body mass index as the covariate and ran two-sample Mendelian randomization with coronary heart disease as the outcome. Results: The simulation showed that estimates from GWASs with selection can produce biased two-sample Mendelian randomization estimates. Yet, we observed relatively little difference between empirical estimates of the effect of systolic blood pressure on coronary artery disease across the six scenarios. Conclusions: Given the potential for bias from using GWASs with selection on Mendelian randomization estimates demonstrated in our simulation, and the reduced sample size of these GWAS, this approach should be deprioritized. However, based on our empirical results, using adjusted, corrected or selected GWASs is unlikely to make a large difference to two-sample Mendelian randomization estimates in practice.


INTRODUCTION
Concurrent with the advent of large-scale biobank studies collecting a vast array of genetic and phenotypic data, there has been a rapid increase in genome-wide association studies (GWASs) aiming to identify the genetic causes of common diseases or risk factors. Twosample Mendelian randomization is a popular causal inference method that uses summary statistics from two GWASs to assess the effect of an exposure, obtained from one GWAS, on an outcome, obtained from the second GWAS. [1][2][3] It is a form of instrumental variable analysis and therefore requires four assumptions in order to obtain a point estimate for the causal effect, as summarised in Figure 1. Key advantages of two-sample Mendelian randomization include that it removes the need for individual level data and sample sizes can be maximized, particularly for rare outcomes, by using two separate studies instead of a single study that captures both the exposure and outcome. associated with a phenotype. Consequently, GWASs are often adjusted for covariates, corrected for medication use, or selected on medication users to maximise power and avoid potential biases from factors that are not the phenotype of interest. However, while these practices are often beneficial for GWASs, they may affect the SNP-phenotype associations and consequently may cause bias in two-sample Mendelian randomization analyses. [4] The implications of adjustment for covariates in GWASs has been previously discussed, yet the consequences of correcting and selecting for treatment have not been fully described. [3,5] The aim of this study was therefore to assess the consequences of three common GWAS alterations and, where appropriate, their combination on two-sample Mendelian randomization estimates: • Adjusting for a covariate -for example, including body mass index as a covariate in a GWAS of systolic blood pressure • Correcting for medication use -for example, adding 10 mmHg to systolic blood pressure measures of individuals exposed to antihypertensive medicine prior to their blood pressure measurement [6] • Selecting on medication users -for example, removing individuals receiving antihypertensive medicine prior to their blood pressure measurement

METHODS
To demonstrate the possible consequences of using adjusted, corrected or selected GWASs in two-sample Mendelian randomization, we simulated genetic associations for six scenarios.
These were: 1) no adjustment, correction or selection; 2) adjustment for a covariate; 3) selection on medication users; 4) correction for medication use; 5) adjustment for a covariate and selection on medication users; and 6) adjustment for a covariate and correction for medication use. These scenarios are illustrated using directed acyclic graphs in Figure 2. We then performed two-sample Mendelian randomization using the genetic associations from each of the scenarios and compared the results with the simulated effect of the exposure on the outcome. To demonstrate these consequences in a more realistic setting, we also conducted a comparable analysis using empirical data from UK Biobank with systolic blood pressure as the exposure, body mass index as the covariate and coronary heart disease as the outcome. [7] Figure 2 Directed acyclic graphs illustrating the six models considered in this study

Data generation
We conducted a simulation based on a sample of 1,000,000 individuals with 200 SNPs that instrument our exposure. The simulation included specification of six parameters: !" , "# , "$ , #$ , !% and "% . We simulated the genetic variants as a binomial function to get a SNP dose multiplied by the random effect of the SNP on (unmeasured) phenotype A.
The formula for this, and the other main components of the simulation, are detailed below: where !" = 1, and " = Normal(0, 2.5) is a random error term and # = Normal(0, 1) is a random error term where "% = various, !% = 1, and % = Normal(0, 1) is a random error term Genetic associations for phenotype A To obtain the SNP-measured phenotype A associations for each scenario, we did the following for a random sample of half the simulated individuals: Mendelian randomization analysis and comparison with true effect The true effect of the unmeasured phenotype A on phenotype B per unit increase in the exposure in the simulation is equal to "% . We simulated data for values of "% between -1 and 1 in increments of 0.1, excluding zero. We performed Mendelian randomization for each dataset using the 'mrrobust' package in Stata. [9] We then calculated the difference between the Mendelian randomization estimate and the value of "% . To ensure the effects were comparable, despite the different true effects, we scaled the differences by dividing by the value of "% . Finally, we meta-analysed these scaled differences across all 20 simulated datasets using the 'metan' package in Stata with a fixed effect model and Mantel-Haenszel method. [10] Empirical analysis

Data source
We conducted the systolic blood pressure GWAS in UK Biobank. [7,11]  Genetic associations for systolic blood pressure We used the second of two systolic blood pressure readings, which were automatically recorded two minutes apart for all participants at the baseline UK Biobank assessment centre, to determine systolic blood pressure. Body mass index was calculated at the baseline assessment centre from baseline measures of height and weight. A list of antihypertensive medication was defined based on twelve headings from the British National Formulary

(adrenergic neuron blocking drugs; alpha-adrenoceptor blockers, angiotensin-converting enzyme inhibitors, angiotensin-II receptor blockers, beta-adrenoceptor blockers, calcium channel blockers, centrally acting antihypertensive drugs, loop diuretics, potassium-sparing diuretics and aldosterone antagonists, renin inhibitors, thiazides and related diuretics,
and vasodilator antihypertensives). [13] Antihypertensive medication use was then determined based on this list from medication records made by clinic nurses at baseline.
To obtain the SNP-systolic blood pressure associations for each scenario, we did the following: The value of 10 mmHg was chosen as the correction for antihypertensive medication use based on the fact that antihypertensives are estimated to reduce systolic blood pressure by 9 mmHg on average. [6] All GWASs were conducted using the MRC Integrative Epidemiology Unit Pipeline with a BOLT-LMM model to account for population stratification. [14] All six GWASs were adjusted for age, sex and 40 principal components.

Genetic associations for coronary artery disease
We obtained genetic associations for coronary artery disease from a multi-ethnic metaanalysis study by Nikpay et al of 60,801 cases and 123,504 controls to avoid sample overlap with UK Biobank. [15] This study used an inclusive definition for disease cases including clinically documented cardiac angina, coronary artery stenosis greater than 50%, and coronary revascularization.

Mendelian randomization analysis
Genome-wide significant hits from each GWAS were clumped using the 'clump_data' command in the 'TwoSampleMR' R package with a R squared threshold of 0.001 and a clumping window of 10,000kb. [16] These SNPs were then used as the instrument for the two-sample Mendelian randomization analysis, also performed using the 'TwoSampleMR' R package. All estimates were scaled to the effect per standard deviation increase in systolic blood pressure for presentation.

Code availability
All analyses were conducted in Stata version 15.1 and R version 3.4 or higher. The code is available from GitHub: https://github.com/venexia/MR-GWAS-consequences.

Simulation
We found selection on medication users led to a point estimate of the beta that was 61% (95% CI: 59 to 62) greater than the simulated true effect of unmeasured phenotype A on phenotype B across effect sizes. The combination of adjustment for a covariate and selection on medication users also led to an overestimate of the beta of 53% (95% CI: 52 to 54). There was little difference between the estimates and the simulated true effect of unmeasured phenotype A on phenotype B obtained in the other scenarios ( Figure 3).

Empirical analysis
The estimates of the odds ratio for a standard deviation increase in systolic blood pressure on coronary artery disease varied between 1.82 (95% CI: 1.62 to 2.05; #SNPs = 218) for the model corrected for antihypertensive use only to 2.05 (95% CI: 1.75 to 2.41; #SNPs = 160) for the model adjusted for body mass index only. The confidence intervals for all estimates were overlapping suggesting little difference between estimates based on the different exposure GWAS used for these analyses (Figure 4).

DISCUSSION
The aim of this study was to assess the consequences of using adjusted, corrected or selected GWASs on two-sample Mendelian randomization estimates. We demonstrated through simulations that selection of medication users can lead to substantial bias in Mendelian randomization estimates. On the other hand, bias from adjustment for a covariate and correction for medication use was minimal in the simulated scenarios. We have also presented results from an empirical analysis, where we used Mendelian randomization to estimate the effect of systolic blood pressure on coronary artery disease with body mass index as the covariate. We found little difference in the estimates for the effect of a standard deviation increase in systolic blood pressure on coronary artery disease across the six scenarios, even those selecting on medication users. Furthermore, in this example, the observed differences would not have affected the inference drawn from the analysis. This suggests that the bias observed in the simulations may be less pronounced in practice.
The potential for bias in two-sample Mendelian randomization estimates using GWASs with adjustment, correction or selection has been highlighted as a limitation of this method in the literature. [17] Hartwig et al have previously investigated bias resulting from the use of an adjusted GWAS using simulated and empirical data. [5] However, to our knowledge, there are no previous studies that have investigated the consequences of using GWASs corrected for medication use, or with selection on medications users, on two-sample Mendelian randomization. Future work in this area should look to consider the potential for bias from adjustment, correction and selection in other implementations of Mendelian randomization, such as multivariable Mendelian randomization. [18] While knowing the data generation mechanism for a simulation is an advantage, generated data is a simplification of real data generation mechanisms. Reality is likely much more complicated than the models we have studied. Our findings may therefore be considered indicative of potential bias in the scenarios we have outlined, rather than a quantification of the potential bias in all possible scenarios. For example, correction for medication use may have had little effect here because treatment occurs between the instrument and measured phenotype A, without having a direct effect on phenotype B, and so should be captured within the GWAS of measured phenotype A. A further consideration is that our empirical analyses may also be subject to the common limitations of Mendelian randomization, such as horizontal pleiotropy. This may serve to amplify or disguise differences between estimates due to adjustment, correction and/or selection in the exposure GWAS.
Using simulations, we have shown that bias is possible in two-sample Mendelian randomization when using GWASs with selection on medication users. Given that selection also reduces GWAS sample size, this approach should be deprioritized as a way of accounting for treatment. Despite this, little difference was observed between the twosample Mendelian randomization estimates in our empirical example and the inference drawn from these analyses remained the same for all six scenarios studied. This suggests that, in practice, using adjusted, corrected or selected GWASs is unlikely to make a large difference to estimates obtained from two-sample Mendelian randomization.