Meta-analysis of exome array data identifies six novel genetic loci for lung function

Background: Over 90 regions of the genome have been associated with lung function to date, many of which have also been implicated in chronic obstructive pulmonary disease. Methods: We carried out meta-analyses of exome array data and three lung function measures: forced expiratory volume in one second (FEV 1), forced vital capacity (FVC) and the ratio of FEV 1 to FVC (FEV 1/FVC). These analyses by the SpiroMeta and CHARGE consortia included 60,749 individuals of European ancestry from 23 studies, and 7,721 individuals of African Ancestry from 5 studies in the discovery stage, with follow-up in up to 111,556 independent individuals. Results: We identified significant (P<2·8x10 -7) associations with six SNPs: a nonsynonymous variant in RPAP1, which is predicted to be damaging, three intronic SNPs ( SEC24C, CASC17 and UQCC1) and two intergenic SNPs near to LY86 and FGF10. Expression quantitative trait loci analyses found evidence for regulation of gene expression at three signals and implicated several genes, including TYRO3 and PLAU. Conclusions: Further interrogation of these loci could provide greater understanding of the determinants of lung function and pulmonary disease.

FGF10. for regulation of gene expression at three signals and implicated several genes, including and .

TYRO3
PLAU Further interrogation of these loci could provide greater Conclusions: understanding of the determinants of lung function and pulmonary disease. . Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the Edinburgh Clinical Research Facility, University of Edinburgh, Scotland and was funded by the Medical Research Council UK.. The Croatia KORCULA study was supported by the Ministry of Science, Education and Sport in the Republic of Croatia (108-1080315-0302). JD, JCL, WG and GTOC are supported by NIH/NHLBI Contract HHSN268201500001I. Genotyping, quality control and calling of the Illumina HumanExome BeadChip in the Framingham Heart Study was supported by funding from the National Heart, Lung and Blood Institute Division of Intramural Research (Daniel Levy and Christopher J. O'Donnell, Principle Investigators). The AGES study is supported by the NIH (N01-AG012100), the Iceland Parliament (Alþingi) and the Icelandic Heart Association. HABC was supported by NIA contracts N01AG62101, N01AG62103, and N01AG62106; NIA grant R01-AG028050, and NINR grant R01-NR012459 and was supported in part by the Intramural Research Program of the NIH, National Institute on Aging. The HABC genome-wide association study was funded by NIA grant 1R01AG032098-01A1 and genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is fully funded through a federal contract from the National Institutes of Health to The Johns Hopkins University, contract number HHSN268200782096C. We thank the Jackson Heart Study (JHS) participants and staff for their contributions to this work. The JHS is supported by contracts HHSN268201300046C, HHSN268201300047C, HHSN268201300048C, HHSN268201300049C, HHSN268201300050C from the National Heart, Lung, and Blood Institute and the National Institute on Minority Health and Health Disparities. JGW is supported by U54GM115428 from the National Institute of General Medical Sciences. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction
Measures of lung function act as predictors of mortality and morbidity and form the basis for the diagnosis of several diseases, most notably chronic obstructive pulmonary disease (COPD), one of the leading causes of death globally 1 . Environmental factors, including smoking and exposure to air pollution play a significant role in lung function; however there has also been shown to be a genetic component, with estimates of the narrow sense heritability ranging between 39-66% 2-5 . Genomewide association studies (GWAS) of lung function have identified associations between single nucleotide polymorphisms (SNPs) and lung function at over 150 independent loci to date 6-14 .
Associations have also been identified in GWAS of COPD 15-19 ; however, the identification of disease associated SNPs has been restricted by limited sample sizes. Many signals first identified in powerful studies of quantitative lung function traits, have been found to be associated with risk of COPD, highlighting the potential clinical usefulness of comprehensive identification of lung function associated SNPs 13 .
Low frequency (minor allele frequency (MAF) 1-5%) and rare (MAF<1%) variants have been largely underexplored by GWAS to date. Exome arrays have been designed to facilitate the investigation of these low frequency and rare variants, predominately within coding regions, in large sample sizes. Alongside a core content of rare coding SNPs, the exome array additionally includes common variation, including tags for previously identified GWAS hits, ancestry informative SNPs, a grid of markers for estimating identity by descent and a random selection of synonymous SNPs 20 .
An earlier version of this article can be found on bioRxiv (https://doi.org/10.1101/164426)

Results
We carried out a meta-analysis of exome array data and three lung function measures: forced expiratory volume in one second (FEV 1 ), forced vital capacity (FVC) and the ratio of FEV 1 to FVC (FEV 1 /FVC). These analyses included 68,470 individuals from the SpiroMeta and CHARGE consortia in a discovery analysis, with follow-up in an independent sample of up to 111,556 individuals. All studies are listed with their study-specific sample characteristics in Table 1, with full study descriptions, including details of spirometry and other measurements described in the Supplementary Note. The genotype calling procedures implemented by each study (Supplementary Table 1) and quality control of genotype data are described in the Supplementary Methods. We have undertaken both single variant analyses, and gene-based associations, which test for the joint effect of several rare variants in a gene (see Methods for details).

Meta-analyses of single variant associations
We first evaluated single variant associations between FEV 1 , FVC and FEV 1 /FVC and the 179,215 SNPs that passed study level quality control and were polymorphic in both consortia. These analyses identified 34 SNPs in regions not previously associated with lung function, showing association with at least one trait at overall P<10 -5 , and showing association with consistent direction and P<0·05 in both consortia (full results in Supplementary  Table 2, quantile-quantile and Manhattan plots shown in Supplementary Figure 1). We followed up these SNP associations in a replication analysis comprising 3 studies with 111,556 individuals. Combining the results from the discovery and replication stages in a meta-analysis identified six SNPs in total that were independent to known signals and met the pre-defined significance threshold (P<2·8×10 -7 ) overall in, or near to FGF10, LY86, SEC24C, RPAP1, CASC17 and UQCC1 (Table 2, Supplementary Figure 2). A SNP near to the CASC17 signal (rs11654749, r 2 =0·3 with rs1859962) has previously been associated with FEV 1 in a genome-wide analysis of gene-smoking interactions, although this association was not replicated at the time 21 ; the present analysis provides the first evidence for independent replication of this signal. A seventh signal was also identified in LCT (Table 2, Supplementary Figure 2); whilst this locus has not previously been implicated in lung function, this SNP is known to vary in frequency across European populations 22 , and we cannot rule out that this association is not an artefact of population structure. Our discovery analysis furthermore identified associations (P<10 -5 ) in 25 regions previously associated with one or more of FEV 1 , FVC and FEV 1 /FVC (Supplementary Table 3).
Generally, the observed effect of the SNPs at the novel signals were similar in ever and never smokers; the exception was rs1448044 near FGF10, which showed a significant association with FVC only in ever smokers in our discovery analysis (ever smokers P=1·49×10 -6 ; never smokers P=0·695, Supplementary  Table 4 and Supplementary Figure 3). In the replication analysis, however, this association was observed in both ever and never smokers (ever smokers P=3·14×10 -5 ; never smokers P=1·40×10 -4 , Supplementary Table 5). For rs1200345 (RPAP1) and rs1859962 (CASC17), associations were most statistically significant in the analyses restricted to individuals of European Ancestry (Supplementary Table 4 and Supplementary Figure 3), as was the association with rs2322659 (LCT), giving further support that this association may be due to population stratification.

Meta-analyses of gene-based associations
We undertook Weighted Sum Tests (WST) 23 and Sequence Kernel Association tests (SKAT) 24 to assess the joint effects of multiple low frequency variants within genes on lung function traits. In our discovery analyses of all 68,470 individuals, we tested up to 14,380 genes that had at least two variants with MAF<5% and met the inclusion criteria (exonic or loss of function [LOF], see Methods for definitions) in both consortia. The SKAT analyses identified 16 genes associated (P<0·05 in both consortia and overall P<10 -4 ) with FEV 1 , FVC or FEV 1 /FVC (Supplementary Table 6), whilst the WST analyses identified 12 genes

Amendments from Version 2
We have added a further limitation to the discussion of the paper outlining a recently highlighted issue regarding the trait transformation undertaken in our replication analyses. We show through sensitivity analyses that our results are not affected by this issue (Supplementary Figure 4), but note that future studies should avoid such a transformation.

See referee reports
Swiss study on Air Pollution and Lung Disease in adults (SAPALDIA) The Cardiovascular Risk in Young Finns Study (YFS)  Table 7). There was one gene (LY6G6D) that was identified in both analyses. These genes were followed up in UK Biobank, with two genes, GPR126 and LTBP4, showing evidence of replication in the exonic SKAT analysis (P<3·5×10 -6 ); however conditional analyses in UK Biobank showed that both these associations were driven by single SNPs, that were identified in the single variant association analyses and have been previously reported in GWAS of these traits (Supplementary Table 6 and Supplementary Table 7).

Functional characterization of novel loci
In order to gain further insight into the six loci identified in our analyses of single variant associations (excluding LCT), we employed functional annotation and assessed whether identified SNPs in these regions were associated with gene expression levels. One of the identified novel SNPs was nonsynonymous, three intronic and two were intergenic. We found evidence that three of the SNPs may be involved in cis-acting regulation of the expression of several genes in multiple tissues (Supplementary Table 8).
SNP rs1200345 in RPAP1 is a nonysynomous variant, predicted to be deleterious by both SIFT (deleterious) and Polyphen (possibly damaging) (Supplementary Table 9); RPAP1 is ubiquitously expressed, with high levels of protein detected in the lung (Supplementary Table 10). SNP rs1200345 or proxies (r 2 >0·8) were also found to be amongst the most strongly associated SNPs with expression levels of RPAP1 in several tissues, including lung, and with a further six genes in lung tissue (Supplementary Table 8), including TYRO3, one of the TAM family of receptor tyrosine kinases. TYRO3 regulates several processes including cell survival, migration and differentiation and is highly expressed in lung macrophages (Supplementary Table 10). Evidence of association with gene expression was found at two more of the novel signals (sentinel SNPs rs3849969 and rs6088813), implicating a further 16 genes. Of note, in blood expression quantitative trait loci (eQTL) databases, a proxy of a SNP in complete linkage disequilibrium (r 2 =1) with rs3849969 (rs3812637) was an eQTL for plasminogen activator, urokinase (PLAU).

Discussion
We undertook an analysis of 68,470 individuals from 23 studies with data from the exome array and three lung function traits, following up the most significant single SNP and gene-based associations in an independent sample of up to 111,556 individuals. There were six SNPs which reached P<10 -5 in the discovery stage meta-analysis of single variant associations, and subsequently met the Bonferroni corrected significance threshold for independent replication (P<1·47×10 -3 , corrected for 34 SNPs being tested). In the combined analyses of our discovery and replication analyses, these six SNPs met the exome chip-wide significance threshold (P<2·8×10 -7 ). One of the SNPs is in a region that has previously been implicated in lung function (near KCJN2/SOX9) 21 , whilst the remaining five SNPs, although all common, have not previously been identified in other GWAS of lung function. In a recent 1000 Genomes imputed analysis of lung function (which includes some of the studies contributing to sizes (N), Z-statistics (Z) and two-sided P-values (P) are given for the combined discovery analysis and the replication analysis. Two-sided P-values are also given for the full two-stage combined analyses (discovery + replication). the present discovery analysis), all of these SNPs showed at least suggestive association (2·97×10 -3 >P>1·28×10 -5 ) with one or more lung function trait, but none reached the required threshold (P<5×10 -6 ) to be taken forward for replication in that analysis 12 .
We further identified a seventh association with rs2322659 in LCT (MAF=23·5%; combined discovery + replication P=1·70×10 -9 ). Given SNPs in this region are known to vary in frequency across European populations, we cannot dismiss the possibility that this association may be confounded by population stratification; hence we do not report this signal as a novel lung function locus. For SNPs at 7 loci that have been shown to have differences in allele frequency between individuals from different regions of the UK 25 , and subsequently European populations (including the LCT locus), we undertook a look-up of associations with lung function in our discovery analyses. and subsequently across European populations 26 . Aside from the association between the LCT locus and FVC, no significant associations were observed between SNPs at these loci and any lung function trait, in either the analyses restricted to European Ancestry (EA) individuals, or in the analysis of EA and African Ancestry (AA) individuals combined (Supplementary Table 11); this suggests population structure was generally accounted for adequately in our analyses.
One of the novel signals was with a nonsynonymous SNP, rs1200345 in RPAP1, (MAF=48·8%; P=2·33×10 -13 ), which is predicted to be deleterious. This SNP and proxies with r 2 >0·8 were also associated with expression in lung tissue of seven genes, including RPAP1 and the TAM receptor TYRO3. TAM receptors play a role in the inhibition of Toll-like receptors (TLRs)mediated innate immune response by initiating the transcription of cytokine signalling genes (SOCS-1 and 3), which limit cytokine overproduction and inflammation 27,28 . It has been shown that influenza viruses H5N1 and H7N9 can cause downregulation of Tyro3, resulting in an increased inflammatory cytokine response 28 .
Three further signals were with intronic SNPs in SEC24C (MAF=29·4%; P=4·99×10 -12 ), CASC17 (MAF=48·2%; P=4·10×10 -11 ), and UQCC1 (MAF=36·7%; P=4·90×10 -19 ). Two of these intronic SNPs have previously been implicated in GWAS of other traits: rs1859962 in CASC17 with prostate cancer 29 and rs6088813 in UQCC1 with height 30 . The CASC17 locus, near KCNJ2/SOX9 has also previously been implicated in lung function, showing significant association with FEV 1 in a genome-wide analysis of gene-smoking interactions; however, this association was not formally replicated 21 . Whilst the individuals utilised in the discovery stage of this analysis overlap with those included in this previous interaction analysis, the replication stage of the present study provides the first evidence of replication for this signal in independent cohorts. In the present analysis, there was no evidence that the results differed by smoking status.
SNPs rs6088813 in UQCC1 and rs3849969 in SEC24C were identified as eQTLs for multiple genes. Whilst our eQTL analysis did not include formal tests of colocalisation, a SNP in complete linkage disequilibrium with rs3849969 (rs3812637, r 2 =1) was associated with expression of PLAU in blood. The plasminogen activator, urokinase (PLAU) plays a role in fibrinolysis and immunity, and with its receptor (PLAUR) is involved in degradation of the extra cellular matrix, cell migration, cell adhesion and cell proliferation 31 . A study of preterm infants with respiratory distress syndrome, a condition characterised by intra-alveolar fibrin deposition, found PLAU and its inhibitor SERPINE1 to be expressed in the alveolar epithelium, and an increased ratio of SERPINE1 to PLAU was associated with severity of disease 32 . Studies in mice have also shown that increased expression of Plau may be protective against lung injury, by reducing fibrosis 33 . PLAU has also been found to be upregulated in lung epithelial cells subjected to cyclic strain 34 and in patients with COPD and lung cancer, PLAU was found to be expressed in alveolar macrophages and epithelial cells 31 .
The final two signals were with common intergenic SNPs close to LY86 (MAF=36·8%; P=9·74×10 -23 ) and FGF10 (MAF=35·6%; P=2·22×10 -11 ). LY86 (lymphocyte antigen 86) interacts with the Toll-like receptor signalling pathway, to form a heterodimer, when bound with RP105 35 . The sentinel SNP in the present analysis (rs1294421) has previously shown association with waist-hip ratio 36 , whilst an intronic SNP within LY86 (rs7440529, r 2 =0·005 with rs1294421) has been implicated in asthma in two studies of individuals of Han Chinese ancestry 37,38 . FGF10 is a member of the fibroblast growth factor family of proteins, and is involved in a range of biological processes, including embryonic development and morphogenesis, cell growth and repair, tumor growth and invasion. Specifically, the FGF10 signalling pathway is thought to play an criticial role in the development of the lung and in lung epithelial renewal 39 . A deficiency in Fgf10 has been demonstrated to lead to a fatal disruption of branching morphogenesis during lung development in mice 40 .
Our discovery analyses included individuals of both EA and AA. Two of the identified six novel signals showed inconsistent effects in the AA and EA individuals. For these SNPs, the associations in AA individuals were not statistically significant, and we report associations from the analysis restricted to EA individuals only. For the remaining four SNPs similar effects were observed in both the EA and AA individuals (Supplementary Figure 3). We also examined the effects of the novel SNPs in ever smokers and never smokers separately and found these to be broadly similar, with the exception of rs1448044 in FGF10, which in the discovery analysis showed significant association with FVC in ever smokers, whilst showing no association in never smokers (P=0·695). However, in our replication stage analyses, similar effects were seen in both ever and never smokers for this SNP, and the combined analysis of discovery and replication stages for this SNP, including both ever and never smokers, met the exome chip-wide significance level overall (P=4·22×10 -9 ). We also considered whether this signal could be driven by smoking behaviour in our discovery stage as our primary analyses in SpiroMeta did not adjust for smoking quantity. We undertook a look-up of this SNP in the publicly available results of a GWAS of several smoking behaviour traits 41 ; there was only weak evidence that this SNP was associated with ever versus never smoking (P=0·039), and no evidence for association with amount smoked (cigarettes per day, P=0·10).
Through the use of the exome array, we aimed to identify associations with low frequency and rare functional variants, thereby potentially uncovering some of the missing heritability of lung function. However, whilst our discovery analyses identified single SNP associations with 23 low frequency variants (Supplementary  Table 2), we did not replicate any of these findings. Eleven of these 23 SNPs we were unable to follow-up in our replication studies, due to them either being not genotyped, or monomorphic. Overall, limited statistical power is likely to explain our lack of convincing single variant associations with rare variants, in particular if those variants exhibit only modest effects 42 . We additionally investigated the joint effects of low frequency and rare variants within genes, on lung function trait, by employing SKAT and WST gene-based tests. These analyses identified associations with a number of genes that could not attributed to the effect of a single SNP. Replication of these gene-based signals proved difficult however, as again a number of SNPs included in the discovery stage of these analyses were monomorphic, or had not been not genotyped in the replication studies. This lead to a disparity in the gene unit being tested in our discovery and replication samples; hence interpretation of these results was not clear-cut. In the end, we were able to replicate only findings with common SNPs. This finding is in line with several other studies of complex traits and exome array data that have been unable to report robust associations with low frequency variants [43][44][45] and it is clear that future studies will require increasingly larger sample sizes in order to fully evaluate the effect of variants across the allele frequency spectrum. The identification of common SNPs remains important, however, as such findings have the potential to highlight drug targets 46 , and these variants collectively could have utility in risk prediction.
In our replication analyses using UK Biobank, we applied adjustment for covariates including ancestry principal components, before undertaking inverse-normal transformations of the lung function phenotypes. Association analyses were then performed using these transformed phenotypes. It has recently been shown that such transformation has the potential to introduce correlations between principal components and phenotypes 47 ; we undertook sensitivity analyses for the six reported SNPs by repeating the association analyses with phenotypes that had been transformed without prior adjustment, with covariate adjustment made as part of the SNP-trait association test. We found there to be some difference in P-values for some SNPtrait combinations; however, the six novel SNP associations we report all met the replication P-value threshold (P<1·47×10 -3 ) in the sensitivity analyses (Supplementary Figure 4). This issue may also be relevant to the gene-based tests; however no replicated novel gene-based associations were identified in this study. Future studies should avoid undertaking adjustment for principal components of ancestry prior to trait transformation, in order to avoid this potential bias.
This study has identified six common SNPs, independent to signals previously implicated in lung function. Additional interrogation of these loci could lead to greater understanding of lung function and lung disease, and could provide novel targets for therapeutic interventions.

Study design, cohorts and genotyping
The SpiroMeta analysis included 23,751 individuals of EA from 11 studies, and the CHARGE analysis comprised 36,998 EA individuals and a further 7,721 individuals of AA from 12 studies. Follow-up analyses were conducted in an independent sample of up to 111,556 individuals from UK Biobank (2015 interim release), the UK Household Longitudinal Study (UKHLS) and the Netherlands Epidemiology of Obesity (NEO) Study ( Figure 1). All studies (excluding UK Biobank) were genotyped using either the Illumina Human Exome BeadChip v1 or the Illumina Infinium HumanCoreExome-12 v1·0 BeadChip. UK Biobank samples were genotyped using the Affymetrix Axiom UK BiLEVE or UK Biobank arrays.

Statistical analyses
Consortium level analyses: Within the SpiroMeta Consortium, each study contributing to the discovery analyses calculated single-variant score statistics, along with covariance matrices describing correlations between variants, using RAREMETAL-WORKER 48 or rvtests 49 . For each trait, these summary statistics were generated separately in ever and never smokers. Traits were adjusted for sex, age, age 2 and height, and inverse normally transformed prior to association testing. For studies with unrelated individuals, SNP-trait associations were tested using linear models, with adjustments made for the first 10 ancestry principal components, whilst studies with related individuals utilised linear mixed models to account for familial relationships and underlying population structure.
Within the CHARGE Consortium, each study generated equivalent summary statistics using the R package SeqMeta 50 . For each trait, summary statistics were generated in ever and never smokers separately, and in all individuals combined. The untransformed traits were used for all analyses, adjusted for smoking status and pack-years, age, age 2 , sex, height, height 2 , centre/cohort. Models for FVC were additionally adjusted for weight. Linear regression models, with adjustment for principal components of ancestry were used for studies with unrelated individuals, and linear mixed models were used for family-based studies.
Within each consortium we used the score statistics and variancecovariance matrices generated by each study to construct both single variant and gene-based tests using either RAREMETAL 48 (SpiroMeta) or SeqMeta 50 (CHARGE). For single variant associations, score statistics were combined in fixed effects meta-analyses. Two gene-based tests were constructed: first, the Weighted Sum Test (WST) using Madsen Browning weightings 23 , and secondly, the Sequence Kernel Association Test (SKAT) 24 . We performed the SKAT and WST tests using two subsets of SNPs: 1) including all SNPs with an overall consortium-wide MAF<5% that were annotated as splicing, stopgain, stoploss, or frameshift (loss of function [LOF] analysis), and 2) including all SNPs meeting the LOF analysis criteria in addition to all other nonsynonymous variants with consortium wide MAF<5% (exonic analysis). Variants were annotated to genes using dbNSFP v2·6 51 on the basis of the GRCh37/hg19 database.
For both single variant and gene-based associations, consortium-level results were generated for ever smokers and never smokers separately, and in all individuals combined. Within the CHARGE Consortium, results were combined separately for the EA and AA studies and also in a trans-ethnic analysis of both ancestries.
Combined meta-analysis: The single variant association results from the SpiroMeta and CHARGE consortia were combined as follows: The genomic inflation statistic (λ) was calculated for SNPs with consortium-wide MAF>1%; where λ had a value greater than one, genomic control adjustment was applied to the consortium level P-values. The consortium-level results were then combined using sample size weighted z-score meta-analysis. The λ was again calculated for the meta-analysis results and genomic control applied, as appropriate. λ values at the consortium and meta-analysis level are shown in Supplementary  Table 13. Since we were interested in identifying low frequency and rare variants, we applied no MAF or minor allele count (MAC) filter. We identified SNPs of interest as those with an overall P<10 -5 and a consistent direction of effect and P<0·05 observed in both consortia. Rather than using a strict Bonferroni correction for defining the significance threshold, we adopted the more lenient P<10 -5 threshold in order to increase the power to detect variants with modest effect in our discovery analyses, whilst the requirement for consistency in results from the two consortia aimed to limit false positives. All SNPs meeting these thresholds were followed up in independent replication cohorts. Where we identified a SNP within 1Mb of a previously identified Overall Meta-analysis of FEV1, FVC and FEV1/FVC in 60,749 EA and 7721 AA individuals.

Identification of novel associations
34 SNPs selected with P<10 -5 with at least one trait in combined meta-analysis and P<0 . 05 in both consortiumlevel analysis.
SNPs followed up in an independent sample of up to 111,556 individuals from UK Biobank, UKHLS and NEO.
Genes followed up in an independent sample of up to 98,657 individuals from UK Biobank.
27 Genes selected with P<10 -4 with at least one trait in combined meta-analysis and P<0 . 05 in both consortiumlevel analysis.
lung function SNP, we deemed the SNP to represent an independent signal if it had r 2 <0·2 with the known SNP, and if it retained a P <10 -5 , when conditional analyses were carried out with the known SNP, or a genotyped proxy, using data from the SpiroMeta Consortium, or UK Biobank. Our primary meta-analysis included all individuals; we additionally carried out analyses in smoking subgroups (ever and never smokers), and in the subgroup of individuals of European ancestry only.
For genes which contained at least 2 polymorphic SNPs in both consortia, we combined the results of the consortium level gene based tests using either z-score meta-analysis (for the WST analysis) or Fisher's Method for combining P-values (in the case of SKAT). We identified genes of interest as those with P<0·05 observed in both consortia and an overall P<10 -4 , thresholds again chosen to limit both false positive and false negative findings. As in the analyses of single variant associations, our primary meta-analyses included all individuals, with secondary analyses undertaken in smoking and ancestry specific subgroups.

Replication analyses:
All SNP and gene-based associations were followed up for the trait with which they showed the most statistically significant association only. For associations identified through the smoking subgroup analyses, we followed up associations in the appropriate smoking strata; however, no ancestry stratified follow-up was undertaken as replication studies included only a sufficient number of individuals of European Ancestry.
Single variant associations in UK Biobank were tested in ever smokers and never smokers separately, and stratified by genotyping array (UK BiLEVE array or UK Biobank array) using the score test as implemented in SNPTEST v2·5b4 52 . Traits were adjusted for age, age 2 , height, sex, ten principal components and pack-years (ever smokers only), and the adjusted traits were inverse normally transformed. Correlations between principal components and transformed phenotypes may be introduced where adjustment is made prior to transformation. In this analysis, we found any introduced correlations to have no impact on the conclusion of our replication analyses; however future studies should apply transformation of phenotypes prior to covariate adjustment, to avoid this issue. For UKHLS, analyses were undertaken analogously to the SpiroMeta discovery studies using RAREMETALWORKER, while for NEO, analyses were undertaken in the same way as was done in the CHARGE discovery studies using SeqMeta. The single variant results from all replication studies were combined using sample size weighted Z-score meta-analysis. Subsequently, we combined the results from the discovery and replication stage analyses and we report SNPs with overall exome-wide significance of P<2·8×10 -7 (Bonferroni corrected for the original 179,215 SNPs tested).
We followed up genes of interest (P<10 -4 ) using data from UK Biobank only. Summary statistics for UK Biobank were generated using RAREMETALWORKER, with gene-based tests then constructed using RAREMETAL. Finally, we combined the results from the discovery analysis with the replication results in an overall combined meta-analysis using either z-score metaanalysis (WST) or Fisher's Method (SKAT). We declared genes with overall P<3·5×10 -6 (Bonferroni corrected for 14,380 genes tested) in our combined meta-analysis to be statistically significant. For these statistically significant genes, we carried out additional analyses using the UK Biobank data in which we conditioned on the most significantly associated individual SNP within that gene, to determine whether this was a true gene-based signal, or whether the association could be ascribed to the single SNP (if the conditional P<0·01, then association was deemed to not be driven by the single SNP).

Characterization of findings
In order to gain further insight into the loci identified in our analyses of single variant associations, we assessed whether these regions were associated with gene expression levels in various tissues (FDR of 5%, or q-value<0·05), by querying a publically available blood eQTL database 53 and the GTEx project 54 for the sentinel SNPs, or any proxy (r 2 >0·8). We further assessed SNPs of interest (and proxies) within a lung eQTL resource based on non-tumour lung tissues of 1,111 individuals 55-57 . Descriptions of these resources and further details of the look-ups are provided in the Supplementary Methods. Moreover, all sentinel SNPs and proxies with r 2 >0.8 were annotated using ENSEMBL's Variant Effect Predictor (VEP) 58 ; potentially deleterious coding variants were identified as those annotated as 'deleterious' by SIFT 59 or 'probably damaging' or 'possibly damaging' by PolyPhen-2 60 . For all genes implicated through the expression data or functional annotation, we searched for evidence of protein expression in the respiratory system by querying the Human Protein Atlas 61 . This research has been conducted using the UK Biobank Resource. The genetic and phenotypic UK Biobank data are available upon application to the UK Biobank (https://www. ukbiobank.ac.uk/) to all registered health researchers. These data are from Understanding Society: The UK Household Longitudinal Study (UKHLS), which is led by the Institute for Social and Economic Research at the University of Essex and funded by the Economic and Social Research Council. The data were collected by NatCen and the genome wide scan data were analysed by the Wellcome Trust Sanger Institute. Information on how to access the data can be found on the Understanding Society website https://www.understandingsociety.ac.uk/. Click here to access the data.

Supplementary Note includes individual study descriptions.
Supplementary Methods includes details of study level quality control procedures and eQTL analyses.  Thank you to the authors for responding to and addressing our comments. I have one further comment on the replication analysis using UK Biobank data. The sensitivity analyses which the authors carried out showed that adjusting for covariates prior to inverse-normalization does affect the results. While this does not affect the main conclusions drawn, it may affect the results of the gene-based tests, and in addition, other investigators using the methods as a guide may draw inappropriate conclusions if adjusting for principal components prior to inverse-normalizing their phenotype. Ideally, the UK Biobank analysis should be redone with the appropriate phenotype transformation, and the methods and results sections updated accordingly. However, if the authors consider that such a revision would be too extensive, given that the conclusions do not change, it would at least be helpful to note the issue as a limitation in the discussion and make it clear in the methods that adjusting for covariates (such as principal components) should be done after inverse-normalising the phenotype -so it can be used appropriately by others.
No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 17 Jul 2018 , University of Leicester, UK Victoria Jackson Thank you for you approval of our article, and your additional comment. As suggested, we have added a further limitation to the discussion of the paper outlining the issue regarding the trait transformation. This has also been noted in the methods. We have also included the results of the sensitivity analyses in the supplement (Supplementary Figure 4) The authors have performed a large genome-wide association study in subjects of European (36,998 in the discovery set and 111,556 in the replication set) and African (7,721 in the discovery set) ancestries for various lung function measures: FEV1, FVC and FEV1/FVC ratio. Both common and rare variant analyses are performed, and the effect of smoking on the associations is also assessed. The discovery set consisted of CHARGE and SpiroMeta consortia meta analysis using the Human Exome array, while the replication set consisted of genotypes on the HumanCoreExome array and the UK Biobank's custom arrays. A total of 7 novel regions were identified by the authors that met the overall (discovery+replication) Bonferroni-adjusted P-value of 2.8x10^-7 after adjustment for various covariates such as age, sex, height, and ancestry using principal components. All identified novel SNPs are of common frequency, and two of the SNPs are in high LD with missense variants predicted to be damaging.

Some areas for improvement:
Two rare variant tests were chosen and applied to the data as opposed to choosing a combined test (e.g. Derkach et al 2013 Genetic Epidemiology). A combined test would be more powerful.
The authors should explain why there was an inverse normalization of the traits in SpiroMeta but not in CHARGE, and provide some sensitivity analysis.
There appear to be very large differences in Effect Allele Frequencies between the discovery and replication samples. Do the authors have an explanation for this? This might point to local ancestry differences that could be relevant, and should be further investigated.
The eQTL analysis could formally investigate colocalization as opposed to cross-referencing individual associated SNPs with public repositories, and there are several different methods that achieve this goal: e.g. COLOC, eCAVIAR, Sherlock, RTC or EnLoc.
In the replication analyses section, it is stated that "Traits were adjusted for age, age^2, height, sex, ten principal components and pack-years (ever smokers only), and inverse normally " For clarity, the authors should be specific about whether the trait (FEV1, FVC, or transformed. FEV1/FVC) was inverse normalized first and age, age^2, sex, 10 PCs were then added as covariates in the genetic association model In the methods section for the rare variant testing Skat appears to be incorrectly referred to as a Fisher's combined method.

There appear to be very large differences in Effect Allele Frequencies between the discovery and replication samples. Do the authors have an explanation for this? This might point to local ancestry differences that could be relevant, and should be further investigated.
Thank you for highlighting this. There was an error with the effect allele frequencies for the replication samples in Supplementary Table 2; these have now been amended, and the allele frequencies are more consistent in the discovery and replication samples. Where there are still some differences between the discovery and replication allele frequencies, these are where the discovery meta-analysis included individuals of both European and African ancestry, whereas the replication dataset included individuals of European ancestry only.
4. The eQTL analysis could formally investigate colocalization as opposed to cross-referencing individual associated SNPs with public repositories, and there are several different methods that achieve this goal: e.g. COLOC, eCAVIAR, Sherlock, RTC or EnLoc. Tests of colocalisation are more usually undertaken in dense genome-wide data, whereas the (often rare) putative causal variants included on the exome array in our study were relatively sparsely distributed. Furthermore, we did not have access to the lung eQTL data required to undertake a tests of colocalisation. We now acknowledge that the eQTL analysis did not include formal tests of colocalisation in the discussion, and in the example we highlight the variants are in complete LD.

5.
In the replication analyses section, it is stated that "Traits were adjusted for age, age , height, sex, ten principal components and pack-years (ever smokers only), and inverse normally transformed." For clarity, the authors should be specific about whether the trait (FEV , FVC, or FEV /FVC) was inverse normalized first and age, age , sex, 10 PCs were then added as covariates in the genetic association model.
We have clarified in the methods for the replication analysis that "Traits were adjusted for age, age , height, sex, ten principal components and pack-years (ever smokers only), and the adjusted traits were inverse normally transformed." 6. In the methods section for the rare variant testing Skat appears to be incorrectly referred to as a Fisher's combined method.
Within each consortium we generated results for SKAT. Subsequently, we combined the SKAT results from the two consortia using Fisher's Method for combing P-values. We have clarified this in the text as "For genes which contained at least 2 polymorphic SNPs in both consortia, we combined the results of the consortium level gene based tests using either z-score meta-analysis (for the WST analysis) or Fisher's Method for combining P-values (in the case of SKAT).". 1.

2.
the significance threshold, we adopted the more lenient P<10 threshold in order to increase the power to detect variants with modest effect in our discovery analyses, whilst the requirement for consistency in results from the two consortia aimed to limit false positives. All SNPs meeting these thresholds were followed up in independent replicatizon cohorts." "We identified genes of interest as those with P<0·05 observed in both consortia and an overall P<10 , thresholds again chosen to limit both false positive and false negative findings." The overall thresholds for the combined discovery and replication analyses were based on Bonferroni corrected thresholds, as already stated in the text. The authors performed GWAS of FEV1, FVC and FEV1/FVC ratio at 179,215 SNPs from exome arrays. They identified 6 common frequency SNPs associated with at least one of these traits. They also identified 1 SNP in a region with known frequency differences across European populations suggesting that population structure may not have been fully accounted for in their analyses. Strengths of the study include the large sample size and comprehensive approach to assessing associations with low frequency and rare variants. We have the following concerns.

Main concerns:
The phenotypes seem to have been adjusted for covariates and ancestry specific principal components prior to being inverse normally transformed. This transformation has the potential to introduce correlations between principal components and the inverse normally transformed phenotype ( ). Since one of the SNPs https://www.biorxiv.org/content/early/2017/05/15/137232 identified as being associated with the phenotype is known to vary in frequency across European populations, and the authors note that they cannot rule out the effects of population structure on the identified associations this raises concerns that some of the other associations could also be artefacts driven by failure to properly account for population stratification. It should explicitly be mentioned in the methods whether adjustments were made for ancestry specific principal components prior to inverse normal transforming the phenotype in the SpiroMeta Consortium component of the meta analysis or was included as a covariate in the phenotype -SNP association analysis.
Indeed, in the replication analysis in UK Biobank principal components were adjusted for prior to inverse normally transforming the data. Was genotyping chip adjusted for in this cohort (which should be done in the phenotype -SNP analysis)? The UKBiLEVE chip was enriched for smokers, which could affect association analyses unless chip is included as a covariate. In addition the interim data release (which seems to be what is used here -please clarify in the methods whether -5 -4 components prior to being inverse normally transformed. This transformation has the potential to introduce correlations between principal components and the inverse normally transformed phenotype (https://www.biorxiv.org/content/early/2017/05/15/137232). Since one of the SNPs identified as being associated with the phenotype is known to vary in frequency across European populations, and the authors note that they cannot rule out the effects of population structure on the identified associations this raises concerns that some of the other associations could also be artefacts driven by failure to properly account for population stratification. It should explicitly be mentioned in the methods whether adjustments were made for ancestry specific principal components prior to inverse normal transforming the phenotype in the SpiroMeta Consortium component of the meta analysis or was included as a covariate in the phenotype -SNP association analysis.
In the SpiroMeta Consortium component of the analyses, adjustment for ancestry principal components (PCs) was not undertaken prior to transformation, rather PCs were adjusted for when fitting the SNP-trait associations. This is ambiguous in the text, and so we have amended the methods accordingly (Statistical analyses section, new wording below). Given that the adjustment for ancestry PCs was undertaken after phenotype transformation, we don't expect there to have been an introduction of correlation between the transformed trait and population structure.
"Traits were adjusted for sex, age, age and height, and inverse normally transformed prior to association testing. For studies with unrelated individuals, SNP-trait associations were tested using linear models, with adjustments made for the first 10 ancestry principal components, whilst studies with related individuals utilised linear mixed models to account for familial relationships and underlying population structure." 2. Indeed, in the replication analysis in UK Biobank principal components were adjusted for prior to inverse normally transforming the data. Was genotyping chip adjusted for in this cohort (which should be done in the phenotype -SNP analysis)? The UKBiLEVE chip was enriched for smokers, which could affect association analyses unless chip is included as a covariate. In addition the interim data release (which seems to be what is used here -please clarify in the methods whether the data comes from the interim (2015) or full (2017) data release) featured some discrepancies between the two chips, which can introduce spurious associations especially if adjustment is not made for genotyping chip.
In the UK Biobank data, principal components (PCs) were adjusted for prior to transformation. As a sensitivity analysis, we have repeated the analysis for the six reported SNPs (the LCT SNP was not available in UK Biobank), transforming the phenotypes, and then adjusting for all covariates (including PCs) during the SNP-trait association test. For comparison, we have done this for all six SNPs with all three traits. Comparisons of these two analyses (not adjusted prior to transformation vs with adjustment prior to transformation) are shown here: https://doi.org/10.6084/m9.figshare.5959906. For each SNP, the P-value comparison is highlighted for the trait we report the association with, and the dashed lines indicate the Bonferroni corrected significance threshold for independent replication (P<1·47×10 ). Whilst there is a difference in the P-values for some SNP-trait combinations, (more significant P-values in the analysis with covariate adjustment prior to transformation for 5 of the 6 SNPs), the SNPs all meet the replication P-value threshold in both analyses.
We have clarified in the methods (Study design, cohorts and genotyping section) that the UK Biobank data used was from the 2015 interim release. The UK Biobank analysis was stratified by smoking status (ever and never) and also chip (UK BiLEVE array and UK Biobank array). It was not 2 -3 smoking status (ever and never) and also chip (UK BiLEVE array and UK Biobank array). It was not clear from the methods previously that the analysis was stratified for chip, so we have now made this clear in the methods.
We have also tested whether any of the six reported SNPs available in UK Biobank had different MAFs in the UK BiLEVE and UK Biobank samples (suggestive of a chip effect); however none showed evidence of this: https://doi.org/10.6084/m9.figshare.5959927.

Why was raw trait used in CHARGE but inverse normalised in SpiroMeta Consortium? This seems an odd choice
We agree that using the raw trait in CHARGE and the transformed trait in SpiroMeta was not ideal; however it was not planned to combine the results of these consortia from the outset. By the time we had made the decision to combine the results from the two consortia, all studies had already completed analyses and it was not feasible for contributing studies to repeat the analyses with/out the transformation, as this would have involved a substantial amount of reanalysis from contributing studies. Since the effect estimates were not on the same scale we could not do an inverse variance weighted meta-analysis; therefore we did a P-value based meta-analysis. This analysis should be valid given that appropriate analyses were done within each consortium.
Minor concerns: 1. In the discussion, the authors mention that the 6 identified SNPs not attributed to population structure passed the Bonferroni significance threshold. They then mention that the SNPs ALSO pass Bonferroni corrected significance thresholds in the replication analysis. This could be misleading, since not all SNPs passed the Bonferroni threshold in the discovery only dataset.
We have reworded this section of the discussion as follows: "There were six SNPs which reached P<10 in the discovery stage meta-analysis of single variant associations, and subsequently met the Bonferroni corrected significance threshold for independent replication (P<1·47×10 , corrected for 34 SNPs being tested). In the combined analyses of our discovery and replication analyses, these six SNPs met the exome chip-wide significance threshold (P<2·8×10 )." 2. The authors mention that correction was made for genomic inflation statistic (λ), but we could not find the statistics relating to this. The figures should be given in the manuscript.
We have added Supplementary table 13 to the supplement.
No competing interests were disclosed.