Meta-analysis of exome array data identifies six novel genetic loci for lung function

Over 90 regions of the genome have been associated with lung function to date, many of which have also been implicated in chronic obstructive pulmonary disease (COPD). We carried out meta-analyses of exome array data and three lung function measures: forced expiratory volume in one second (FEV1), forced vital capacity (FVC) and the ratio of FEV1 to FVC (FEV1/FVC). These analyses by the SpiroMeta and CHARGE consortia included 60,749 individuals of European ancestry from 23 studies, and 7,721 individuals of African Ancestry from 5 studies in the discovery stage, with follow-up in up to 111,556 independent individuals. We identified significant (P<2·8x10-7) associations with six SNPs: a nonsynonymous variant in RPAP1, which is predicted to be damaging, three intronic SNPs (SEC24C, CASC17 and UQCC1) and two intergenic SNPs near to LY86 and FGF10. eQTL analyses found evidence for regulation of gene expression at three signals and implicated several genes including TYRO3 and PLAU. Further interrogation of these loci could provide greater understanding of the determinants of lung function and pulmonary disease.


Introduction
Lung function measures are important predictors of mortality and morbidity and form the basis for the diagnosis of chronic obstructive pulmonary disease (COPD), a leading cause of death globally. 1 Lung function is largely affected by environmental factors such as smoking and exposure to air pollution; however there is also a genetic component, with heritability estimates ranging between 39-66%. [2][3][4][5] A number of large-scale genome-wide association studies (GWAS) of lung function have successfully identified single nucleotide polymorphisms (SNPs) influencing lung function at over 90 independent loci. [6][7][8][9][10][11][12][13] Associations have also been identified in GWAS of COPD; [14][15][16][17][18] however the identification of disease associated SNPs has been restricted by limited sample sizes. Many signals first identified in powerful studies of quantitative lung function traits, have been found to be associated with risk of COPD, highlighting the potential clinical usefulness of comprehensive identification of lung function associated SNPs. 13 Low frequency (minor allele frequency (MAF) 1-5%) and rare (MAF<1%) variants, have been largely underexplored by GWAS to date. Exome arrays have been designed to facilitate the investigation of these low frequency and rare variants, predominately within coding regions, in large sample sizes.
Alongside a core content of rare coding SNPs, the exome array additionally includes common variation including tags for previously identified GWAS hits, ancestry informative SNPs, a grid of markers for estimating identity by descent and a random selection of synonymous SNPs. 19

Results
We carried out a meta-analysis of exome array data and three lung function measures: forced expiratory volume in one second (FEV1), forced vital capacity (FVC) and the ratio of FEV1 to FVC (FEV1/FVC). These analyses included 68,470 individuals from the SpiroMeta and CHARGE consortia in a discovery analysis, with follow-up in an independent sample of up to 111,556 individuals. All studies are listed with their study-specific sample characteristics in Table 1, with full study descriptions, including details of spirometry and other measurements described in the Supplementary Note. The genotype calling procedures implemented by each study (Supplementary Table 1) and quality control of genotype data are described in the Supplementary Methods. We have undertaken both single variant analyses, and gene-based associations, which test for the joint effect of several rare variants in a gene (see methods for details).

Meta-analyses of single variant associations
We first evaluated single variant associations between FEV1, FVC and FEV1/FVC and the 179,215 SNPs which passed study level quality control and were polymorphic in both consortia. These analyses identified 34 SNPs in regions not previously associated with lung function, showing association with at least one trait at overall P<10 -5 , and showing association with consistent direction and P<0·05 in both consortia (full results in Supplementary Table 2, quantile-quantile and Manhattan plots shown in Supplementary Figure 1). We followed up these SNP associations in a replication analysis comprising 3 studies with 111,556 individuals. Combining the results from the discovery and replication stages in a meta-analysis identified six SNPs in total that were independent to known signals and met the pre-defined significance threshold (P<2·8x10 -7 ) overall in, or near to FGF10, LY86, SEC24C, RPAP1, CASC17 and UQCC1 (Table 2, Supplementary Figure 2). A SNP near to the CASC17 signal (rs11654749, r 2 =0·3 with rs1859962) has previously been associated with FEV1 in a genomewide analysis of gene-smoking interactions, although this association was not replicated at the time; 20 the present analysis provides the first evidence for independent replication of this signal. A seventh signal was also identified in LCT (Table 2, Supplementary Figure 2); whilst this locus has not previously been implicated in lung function, this SNP is known to vary in frequency across European populations, 21 and we cannot rule out that this association is not an artefact of population structure. Our discovery analysis furthermore identified associations (P<10 -5 ) in 25 regions previously associated with one or more of FEV1, FVC and FEV1/FVC (Supplementary Table 3).
Generally, the observed effect of the SNPs at the novel signals were similar in ever and never smokers; the exception was rs1448044 near FGF10, which showed a significant association with FVC only in ever smokers in our discovery analysis (ever smokers P=1·49x10 -6 ; never smokers P=0·695, Supplementary Table 4 and Supplementary Figure 3). In the replication analysis, however, this association was observed in both ever and never smokers (ever smokers P=3·14x10 -5 ; never smokers P=1·40x10 -4 , Supplementary Table 5). For rs1200345 (RPAP1) and rs1859962 (CASC17)), associations were most statistically significant in the analyses restricted to individuals of European Ancestry (Supplementary Table 4 and Supplementary Figure 3), as was the association with rs2322659 (LCT), giving further support that this association may be due to population stratification.

Meta-Analyses of gene-based associations
We undertook Weighted Sum Tests (WST) 22 and Sequence Kernel Association tests (SKAT) 23 to assess the joint effects of multiple low frequency variants within genes on lung function traits. In our discovery analyses of all 68,470 individuals, we tested up to 14,380 genes that had at least two variants with MAF<5% and met the inclusion criteria (exonic or loss of function [LOF], see methods for definitions) in both consortia. The SKAT analyses identified 16 genes associated (P<0·05 in both consortia and overall P<10 -4 ) with FEV1, FVC or FEV1/FVC (Supplementary Table 6), whilst the WST analyses identified 12 genes (Supplementary Table 7). There was one gene (LY6G6D) that was identified in both analyses. These genes were followed up in UK Biobank, with two genes, GPR126 and LTBP4, showing evidence of replication in the exonic SKAT analysis (P<3·5x10 -6 ); however conditional analyses in UK Biobank showed that both these associations were driven by single SNPs, that were identified in the single variant association analyses and have been previously reported in GWAS of these traits (Tables E6 & E7).

Functional characterization of novel loci
In order to gain further insight into the six loci identified in our analyses of single variant associations (excluding LCT), we employed functional annotation and assessed whether identified SNPs in these regions were associated with gene expression levels. One of the identified novel SNPs was nonsynonymous, three intronic and two were intergenic. We found evidence that three of the SNPs may be involved in cis-acting regulation of the expression of several genes in multiple tissues (Supplementary Table 8).
SNP rs1200345 in RPAP1 is a nonysynomous variant, predicted to be deleterious by both SIFT (deleterious) and Polyphen (possibly damaging) (Supplementary Table 9); RPAP1 is ubiquitously expressed, with high levels of protein detected in the lung (Supplementary Table 10). SNP rs1200345 or proxies (r 2 >0·8) were also found to be amongst the most strongly associated SNPs with expression levels of RPAP1 in several tissues including lung, and with a further six genes in lung tissue (Supplementary Table 8) including TYRO3, one of the TAM family of receptor tyrosine kinases. TYRO3 regulates several processes including cell survival, migration and differentiation and is highly expressed in lung macrophages (Supplementary Table 10). Evidence for associations with gene expression was found at two more of the novel signals (sentinel SNPs rs3849969 and rs6088813), implicating a further 16 genes. Of note, in blood eQTL databases, a proxy of a SNP in complete linkage disequilibrium (r 2 =1) with rs3849969 (rs3812637) was an eQTL for plasminogen activator, urokinase (PLAU).

Discussion
We undertook an analysis of 68,470 individuals from 23 studies with data from the exome array and three lung function traits, following up the most significant single SNP and gene-based associations in an independent sample of up to 111,556 individuals. The combined analyses of our discovery and replication single variant associations identified six SNPs meeting the pre-defined significance threshold (P<2·8x10 -7 ). The replication stage results for these six SNPs also meet Bonferroni corrected significance for independent replication (P<1·47x10 -3 , corrected for 34 SNPs being tested).
One of these SNPs is in a region that has previously been implicated in lung function (near KCJN2/SOX9), 20 whilst the remaining five SNPs, although all common, have not previously been identified in other GWAS of lung function. In a recent 1000 Genomes imputed analysis of lung function (which includes some of the studies contributing to the present discovery analysis), all of these SNPs showed at least suggestive association (2·97x10 -3 >P>1·28x10 -5 ) with one or more lung function trait, but none reached the required threshold (P<5x10 -6 ) to be taken forward for replication in that analysis. 12 We further identified a seventh association (P<2·8x10 -7 ) with rs2322659 in LCT. Given SNPs in this region are known to vary in frequency across European populations, we cannot dismiss the possibility that this association may be confounded by population stratification; hence we do not report this signal as a novel lung function locus. We undertook a look-up of associations in our discovery meta-analyses of 7 loci (including LCT) that were identified as showing differences in allele frequency between individuals from different regions in the UK, 24 and subsequently across European populations. 25 Aside from the association between the LCT locus and FVC, no significant associations were observed between SNPs at these loci and any lung function trait, in either the analyses restricted to EA individuals, or in the analysis of EA and AA individuals combined (Supplementary   Table 11); this suggests population structure was generally accounted for adequately in our analyses.
One of the novel signals was with a nonsynonymous SNPs: rs1200345 in RPAP1, which is predicted to be deleterious. This SNP and proxies with r 2 >0·8 were also associated with expression in lung tissue of seven genes, including RPAP1 and the TAM receptor TYRO3. TAM receptors play a role in the inhibition of Toll-like receptors (TLRs)-mediated innate immune response by initiating the transcription of cytokine signalling genes (SOCS-1 and 3) which limit cytokine overproduction and inflammation. 26,27 It has been shown that influenza viruses H5N1 and H7N9 can cause downregulation of Tyro3, resulting in an increased inflammatory cytokine response. 27 Three further signals were with intronic SNPs in SEC24C, CASC17, and UQCC1. Two of these intronic SNPs have previously been implicated in GWAS of other traits: rs1859962 in CASC17 with prostate cancer 28 and rs6088813 in UQCC1 with height. 29 The CASC17 locus, near KCNJ2/SOX9 has also previously been implicated in lung function, showing significant association with FEV1 in a genomewide analysis of gene-smoking interactions, however this association was not formally replicated. 20 Whilst the individuals utilised in the discovery stage of this analysis overlap with those included in this previous interaction analysis, the replication stage of the present study provides the first evidence of replication for this signal in independent cohorts. In the present analysis, there was no evidence that the results differed by smoking status.
SNPs rs6088813 in UQCC1 and rs3849969 in SEC24C were identified as eQTLs for multiple genes.
Notably, a SNP in complete linkage disequilibrium with rs3849969 (rs3812637, r 2 =1) is associated with expression of PLAU in blood. The plasminogen activator, urokinase (PLAU) plays a role in fibrinolysis and immunity, and with its receptor (PLAUR) is involved in degradation of the extra cellular matrix, cell migration, cell adhesion and cell proliferation. 30 A study of preterm infants with respiratory distress syndrome, a condition characterised by intra-alveolar fibrin deposition, found PLAU and its inhibitor SERPINE1 to be expressed in the alveolar epithelium, and an increased ratio of SERPINE1 to PLAU was associated with severity of disease. 31 Studies in mice have also shown that increased expression of PLAU may be protective against lung injury, by reducing fibrosis. 32 PLAU has also been found to be upregulated in lung epithelial cells subjected to cyclic strain 33 and in patients with COPD and lung cancer, PLAU was found to be expressed in alveolar macrophages and epithelial cells. 30 The final two signals were with common intergenic SNPs close to LY86 and FGF10. LY86 (lymphocyte antigen 86) interacts with the Toll-like receptor signalling pathway, when bound with RP105 to form a heterodimer. 34 The sentinel SNP rs1294421 has previously been associated with waist-hip ratio, 35 and an intronic SNP within LY86 (rs7440529, LD with rs1294421: r 2 =0·005) has previously been implicated in asthma in two studies of individuals of Han Chinese ancestry. 36,37 FGF10 is a member of the fibroblast growth factor family of proteins, involved in a number of biological processes, including embryonic development, cell growth, morphogenesis, tissue repair, tumor growth and invasion. Specifically, the FGF10 signalling pathway plays an essential role in lung development and lung epithelial renewal. 38 A study in mice demonstrated that a deficiency in FGF10 resulted in a fatal disruption of branching morphogenesis during lung development. 39 Our discovery analyses included individuals of both European and African ancestry. Two of the identified six novel signals showed inconsistent effects in the African and European ancestry individuals. For these SNPs, the associations in African Ancestry individuals were not statistically significant, and we report associations from the analysis restricted to European ancestry individuals only. For the remaining four SNPs similar effects were observed in both the European and African ancestry individuals (Supplementary Figure 3). We also examined the effects of the novel SNPs in ever smokers and never smokers separately and found these to be broadly similar, with the exception of rs1448044 in FGF10, which in the discovery analysis showed significant association with FVC in ever smokers, whilst showing no association in never smokers (P=0·695). In our replication stage analyses, similar effects were seen in both ever and never smokers for this SNP however, and the combined analysis of discovery and replication stages for this SNP, including both ever and never smokers, met the exome chip-wide significance level overall (P=4·22x10 -9 ). We also considered whether this signal could be driven by smoking behaviour in our discovery stage as our primary analyses in SpiroMeta did not adjust for smoking quantity. We undertook a look-up of this SNP in the publicly available results of a GWAS of several smoking behaviour traits; 40 there was only weak evidence that this SNP was associated with ever versus never smoking (P=0·039), and no evidence for association with amount smoked (cigarettes per day, P=0·10).
Through the use of the exome array, we aimed to identify associations with low frequency and rare functional variants, thereby potentially uncovering some of the missing heritability of lung function.
However, whilst our discovery analyses identified single SNP associations with 23 low frequency variants (Supplementary Table 2), we did not replicate any of these findings. Eleven of these 23 SNPs we were unable to follow-up in our replication studies, due to them either being not genotyped, or monomorphic. Overall, our lack of convincing associations with rare variants is likely due to limited statistical power for identifying single variant associations, particularly if those variants exhibit only modest effects. 41 We additionally employed SKAT and WST gene-based tests to investigate the joint effects of low frequency and rare variants within genes, on lung function traits. These analyses identified associations with a number of genes that did not appear to be driven by single SNPs.
Replication of these signals proved difficult however, as again many SNPs included within the discovery stage of these analyses were not genotyped, or were monomorphic in the replication studies. This often meant a disparity in the gene unit being tested in our discovery and replication samples; hence the interpretation of these results was not straightforward. In the end, we were able to replicate only findings with common SNPs. This finding is in line with several other studies of complex traits and exome array data that have been unable to report robust associations with low frequency variants [42][43][44] and it is clear that future studies, will require increasingly larger sample sizes in order to fully evaluate the effect of variants across the allele frequency spectrum. The identification of common SNPs remains important however, as such findings have the potential to highlight drug targets, 45 and these variants collectively could have utility in risk prediction.
This study has identified six common SNPs, independent to signals previously implicated in lung function. Further interrogation of these loci could lead to greater understanding of lung function and lung disease, and could provide novel targets for therapeutic interventions.

Study Design, cohorts and genotyping
The SpiroMeta analysis included 23

Statistical analyses
Consortium level analyses: Within the SpiroMeta Consortium, each study contributing to the discovery analyses calculated single-variant score statistics, along with covariance matrices describing correlations between variants, using RAREMETALWORKER 46 or rvtests. 47 For each trait, these summary statistics were generated separately in ever and never smokers, with adjustment for sex, age, age 2 and height, and with each trait being inverse normally transformed prior to association testing. For studies with unrelated individuals, further adjustments were made for the first 10 ancestry principal components, whilst studies with related individuals utilised linear mixed models to account for familial relationships and underlying population structure.
Within the CHARGE Consortium, each study generated equivalent summary statistics using the R package SeqMeta. 48 For each trait, summary statistics were generated in ever and never smokers separately, and in all individuals combined. The untransformed traits were used for all analyses, adjusted for smoking status and pack-years, age, age 2 , sex, height, height 2 , centre/cohort. Models for FVC were additionally adjusted for weight. Linear regression models, with adjustment for principal components of ancestry were used for studies with unrelated individuals, and linear mixed models were used for family-based studies.
Within each consortium we used the score statistics and variance-covariance matrices generated by each study to construct both single variant and gene-based tests using either RAREMETAL 46 (SpiroMeta) or SeqMeta 48 (CHARGE). For single variant associations, score statistics were combined in fixed effects meta-analyses. Two gene-based tests were constructed: first, the Weighted Sum Test (WST) using Madsen Browning weightings, 22 and secondly, the Sequence Kernel Association Test (SKAT). 23 We performed the SKAT and WST tests using two subsets of SNPs: 1) including all SNPs with an overall consortium-wide MAF<5% that were annotated as splicing, stopgain, stoploss, or frameshift (loss of function [LOF] analysis), and 2) including all SNPs meeting the LOF analysis criteria in addition to all other nonsynonymous variants with consortium wide MAF<5% (exonic analysis).
Variants were annotated to genes using dbNSFP v2·6 49 on the basis of the GRCh37/hg19 database.
For both single variant and gene-based associations, consortium-level results were generated for ever smokers and never smokers separately, and in all individuals combined. Within the CHARGE Consortium, results were combined separately for the EA and AA studies and also in a trans-ethnic analysis of both ancestries.
Combined Meta-Analysis: The single variant association results from the SpiroMeta and CHARGE consortia were combined as follows: The genomic inflation statistic (λ) was calculated for SNPs with consortium-wide MAF>1%; where λ had a value greater than one, genomic control adjustment was applied to the consortium level P-values. The consortium-level results were then combined using sample size weighted z-score meta-analysis. The λ was again calculated for the meta-analysis results and genomic control applied, as appropriate. Since we were interested in identifying low frequency and rare variants, we applied no MAF or minor allele count (MAC) filter. We identified SNPs of interest as those with an overall P<10 -5 and a consistent direction of effect and P<0·05 observed in both consortia. Where we identified a SNP within 1Mb of a previously identified lung function SNP, we deemed the SNP to represent an independent signal if it had r 2 <0·2 with the known SNP, and if it retained a P <10 -5 , when conditional analyses were carried out with the known SNP, or a genotyped proxy, using data from the SpiroMeta Consortium, or UK Biobank. Our primary meta-analysis included all individuals; we additionally carried out analyses in smoking subgroups (ever and never smokers), and in the subgroup of individuals of European ancestry only.
For genes which contained at least 2 polymorphic SNPs in both consortia, we combined the results of the consortium level gene based tests using either z-score meta-analysis (WST) or Fisher's Method for combining P-values (SKAT). We identified genes of interest as those with P<0·05 observed in both consortia and an overall P<10 -4 . As in the analyses of single variant associations, our primary metaanalyses included all individuals, with secondary analyses undertaken in smoking and ancestry specific subgroups.
Replication Analyses: All SNP and gene-based associations were followed up for the trait with which they showed the most statistically significant association only. For associations identified through the smoking subgroup analyses, we followed up associations in the appropriate smoking strata; however no ancestry stratified follow-up was undertaken as replication studies included only a sufficient number of individuals of European Ancestry.
Single variant associations in UK Biobank were tested in ever smokers and never smokers separately using the score test as implemented in SNPTEST v2·5b4. 50 Traits were adjusted for age, age 2 , height, sex, ten principal components and pack-years (ever smokers only), and inverse normally transformed. For UKHLS, analyses were undertaken analogously to the SpiroMeta discovery studies using RAREMETALWORKER, while for NEO, analyses were undertaken in the same way as was done in the CHARGE discovery studies using SeqMeta. The single variant results from all replication studies were combined using sample size weighted Z-score meta-analysis. Subsequently we combined the results from the discovery and replication stage analyses and we report SNPs with overall exome-wide significance of P<2·8x10 -7 (Bonferroni corrected for the original 179,215 SNPs tested).
We followed up genes of interest (P<10 -4 ) using data from UK Biobank only. Summary statistics for UK Biobank were generated using RAREMETALWORKER, with gene-based tests then constructed using RAREMETAL. Finally, we combined the results from the discovery analysis with the replication results in an overall combined meta-analysis using either z-score meta-analysis (WST) or Fisher's Method (SKAT). We declared genes with overall P<3·5x10 -6 (Bonferroni corrected for 14,380 genes tested) in our combined meta-analysis to be statistically significant. For these statistically significant genes, we carried out additional analyses using the UK Biobank data in which we conditioned on the most significantly associated individual SNP within that gene, to determine whether this was a true gene-based signal, or whether the association could be ascribed to the single SNP (if the conditional P<0·01, then association was deemed to not be driven by the single SNP).

Characterization of findings
In order to gain further insight into the loci identified in our analyses of single variant associations, we assessed whether these regions were associated with gene expression levels in various tissues (FDR of 5%, or q-value<0·05), by querying a publically available blood eQTL database 51 and the GTEx project 52 for the sentinel SNPs, or any proxy (r 2 >0·8). We further assessed SNPs of interest (and proxies) within a lung eQTL resource based on non-tumour lung tissues of 1,111 individuals. [53][54][55] Descriptions of these resources and further details of the look-ups are provided in the Supplementary Methods. Moreover, all sentinel SNPs and proxies with r 2 >0.8 were annotated using ENSEMBL's Variant Effect Predictor (VEP); 56 potentially deleterious coding variants were identified as those annotated as 'deleterious' by SIFT 57 or 'probably damaging' or 'possibly damaging' by PolyPhen-2. 58 For all genes implicated through the expression data or functional annotation, we searched for evidence of protein expression in the respiratory system by querying the Human Protein Atlas. 59    Results are shown for variant in novel loci associated (P<2·7×10 -7 ) with lung function traits in a two stage meta-analysis consisting of up to 68,470 individuals from the SpiroMeta and CHARGE Consortia in the discovery analyses, with follow-up in up to 111,556 individuals from UK Biobank, UKHLS and NEO. For each SNP, the result for the trait-smoking-ancestry combination which resulted in the most statistically significant association is given. The results for these SNPs and all three traits are shown in Supplementary Table 12. Beta values from SpiroMeta (βSp) reflect effect-size estimates on an inverse-normal transformed scale after adjustments for age, age 2 , sex, height and ancestry principal components, and stratified by ever smoking status (Analysis of All individuals only). Beta values from CHARGE (βCH) reflect effect-size estimates on an untransformed scale (litres for FEV1 and FVC; ratio for FEV1/FVC). Samples sizes (N), Z-statistics (Z) and two-sided P-values (P) are given for the combined discovery analysis and the replication analysis. Two-sided P-values are also given for the full two-stage combined analyses (discovery + replication).