Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN

Background: Targeted next generation sequencing (NGS) panels are increasingly being used in clinical genomics to increase capacity, throughput and affordability of gene testing. Identifying whole exon deletions or duplications (termed exon copy number variants, ‘exon CNVs’) in exon-targeted NGS panels has proved challenging, particularly for single exon CNVs. Methods: We developed a tool for the Detection of Exon Copy Number variants (DECoN), which is optimised for analysis of exon-targeted NGS panels in the clinical setting. We evaluated DECoN performance using 96 samples with independently validated exon CNV data. We performed simulations to evaluate DECoN detection performance of single exon CNVs and to evaluate performance using different coverage levels and sample numbers. Finally, we implemented DECoN in a clinical laboratory that tests BRCA1 and BRCA2 with the TruSight Cancer Panel (TSCP). We used DECoN to analyse 1,919 samples, validating exon CNV detections by multiplex ligation-dependent probe amplification (MLPA). Results: In the evaluation set, DECoN achieved 100% sensitivity and 99% specificity for BRCA exon CNVs, including identification of 8 single exon CNVs. DECoN also identified 14/15 exon CNVs in 8 other genes. Simulations of all possible BRCA single exon CNVs gave a mean sensitivity of 98% for deletions and 95% for duplications. DECoN performance remained excellent with different levels of coverage and sample numbers; sensitivity and specificity was >98% with the typical NGS run parameters. In the clinical pipeline, DECoN automatically analyses pools of 48 samples at a time, taking 24 minutes per pool, on average. DECoN detected 24 BRCA exon CNVs, of which 23 were confirmed by MLPA, giving a false discovery rate of 4%. Specificity was 99.7%. Conclusions: DECoN is a fast, accurate, exon CNV detection tool readily implementable in research and clinical NGS pipelines. It has high sensitivity and specificity and acceptable false discovery rate. DECoN is freely available at www.icr.ac.uk/decon.


Introduction
Targeted next generation sequencing (NGS) panels are increasingly being used in clinical genomics to increase capacity, throughput and affordability of gene testing [1][2][3] . For NGS panels to be effective in the clinical setting, all variant classes need to be robustly detected. Base substitutions are accurately detected by most pipelines and detection of small insertions and deletions are improving 4-6 . However, accurate detection of deletions or duplications of whole exons, also known as exon copy number variants (exon CNVs), has proved problematic in targeted NGS data, particularly detection of single exon CNVs 7,8 . In large part this is because the breakpoints usually lie outside the region targeted by the panel, and therefore detection methods are typically based on changes in the number of reads covering each target, commonly referred to as read depth or coverage. However, coverage can vary for several reasons, such as differences in GC content or individual probe efficiencies, and careful normalisation of data is therefore required 7,8 . These challenges have led many research and clinical laboratories to either ignore exon CNVs or to use alternative detection methods 9 . The latter can lead to substantial increases in the time and cost of tests.
The Mainstreaming Cancer Genetics (MCG) programme (www. mcgprogramme.com) is working to increase access to cancer predisposition gene (CPG) testing 10 . To implement this we have developed, in collaboration with Illumina, a NGS panel targeting cancer predisposition genes called the TruSight Cancer Panel (TSCP) http://www.illumina.com/products/trusight_cancer.html. Many CPGs are tumour suppressor genes that predispose to cancer when their functions are inactivated by loss-of-function mutations 11 . Exon CNVs are an important class of such pathogenic mutations accounting for appreciable proportions of mutations in many genes, including BRCA1 and BRCA2 12 . Several methods have been used for their detection in the pre-NGS era, including multiplex ligation-dependent amplification (MLPA), multiplex amplifiable probe hybridisation (MAPH) and array-based comparative genome hybridisation (aCGH) 9, 13,14 . For NGS analysis to replace these tools a method for exon CNV detection with high sensitivity, specificity and acceptable false discovery rate is required. For use in clinical laboratories it is also essential that the required quality control checks are fully integrated into the pipeline, so that reporting of positive and of negative tests is robustly achievable.
Several tools to detect exon CNVs in NGS sequence data have been published, including ExomeDepth, XHMM, and CONTRA [15][16][17] . Generally, these were developed for the research setting and for whole exome rather than targeted exon panels. The tools typically use coverage data from a set of samples as input, but may use different approaches for calling variants. For example, ExomeDepth selects samples from the input set that are well correlated with the sample of interest, and then fits a Betabinomial model to the sample of interest and the selected samples 16 . By contrast, XHMM performs principal component analysis normalisation on the matrix of coverage values and fits a standard normal model to the results 17 . CONTRA creates a baseline from the input set of samples and models the log ratio of the sample of interest and the baseline with a normal distribution 15 .
Here we have modified ExomeDepth to develop a tool, Detection of Exon Copy Number (DECoN), which is easy to implement and integrate in clinical laboratory pipelines and can display results in an interactive GUI for user-friendly data visualisation. With extensive real and simulated data we show that DECoN has high sensitivity and specificity and can be used as the first-line exon CNV detection tool in exon-targeted NGS panel analysis.

Samples and consent
We included data from 2,016 samples, 96 samples in the evaluation set and 1,920 samples in the clinical implementation set. Data were generated on lymphocyte DNA extracted from peripheral blood or saliva. Samples in the evaluation set were from individuals recruited to our studies into discovery and characterisation of disease predisposition genes, which have been approved by the  Table 1). The remaining 65 samples were negative for BRCA exon CNVs on MLPA, and it is assumed they are also negative for exon CNVs in the other eight genes because they either have small intragenic pathogenic mutations that fully account for their phenotype, and/or their phenotype is not consistent with an exon CNV in any of the genes.

Clinical implementation set
The implementation set included 1,920 samples. One sample had suboptimal DNA quality and the data was excluded. Data from 1,919 samples were therefore included in the described analyses. Pre-existing negative BRCA MLPA data was available for 307 samples and was used to evaluate the specificity of DECoN. In the interest of patient confidentiality, individual-level sequencing data is not made available. We anticipate that users will have their own datasets against which to test DECoN. Please contact the authors if test data would be helpful.
TruSight Cancer Panel (TSCP) sequencing TSCP data was generated on all samples. We prepared targeted DNA libraries from 50ng genomic DNA using TSCP and TruSight Rapid Capture kit (Illumina). We followed the manufacturer's protocol with the exception of library enrichment pool complexity, which we performed in 48-plex. We sequenced a final 10pM pooled library on a HiSeq2500 platform set in Rapid-run mode following standard protocols: 96-plex pool per flow cell, TruSeq Rapid SBS Kit, 101 bp paired-end dual index run and onboard clustering.

Detection of Exon Copy Number (DECoN)
We performed a review of the available methods and elected to build a tool through modification and optimisation of ExomeDepth v.1.0.0 16 . This tool was chosen because of its performance and because it was open source and easy to modify. We have called the tool DECoN for Detection of Exon Copy Number.
To create DECoN, we introduced code and implementation optimisations of ExomeDepth. DECoN includes two important code modifications of ExomeDepth v.1.0.0. First it enables detection of variants affecting the first exon on a chromosome, as defined in the BED file, which was not included in previous versions. Second, the HMM transition probabilities were altered to depend upon the distance between exons, so that exons adjacent in the list of targeted regions are treated independently if they are located so far apart on the chromosome that the probability of a germline variant spanning both exons is negligible. These two modifications have also been incorporated into ExomeDepth from versions v.1.1.0 onwards.
DECoN also includes several features to enhance and broaden the usability of ExomeDepth. ExomeDepth is an R package and thus requires a knowledgeable R user to select, specify, and run the appropriate functions in the correct order to generate easily interpretable output. It also requires a number of dependencies, which themselves may have different versions depending on the user's local R installation, potentially impacting the final output. DECoN optimises, standardises and automates the exon CNV calling and visualisation of ExomeDepth, implementing full version control using packrat 19 . This careful approach ensures DECoN implementation is suitable for clinical laboratories, is consistent across user installations and is not affected by future changes of incorporated packages or their dependencies.
To provide a simple interface for users, DECoN requires only a set of BAM files, a BED file and a reference FASTA file. The user can supply a custom annotation file to suit their needs, for example to provide the relevant exon numbering for their genes of interest. DECoN relies on a high level of correlation between samples, comparing the sample of interest only to those with which it is well correlated. The DECoN output reports the correlation between samples and the number of samples selected for comparison for every call. This is very useful as robust variant calling in the clinical setting requires information on potential suboptimal performance in order to report positive and negative results. DECoN also allows the user to set thresholds to flag samples and/or exons which may have suboptimal performance.
The DECoN output contains information on all exon CNVs called, their clinical annotations, and a list of regions and/or samples where calling may be suboptimal. An automatic visualisation of each result is generated as a PDF file; typical examples are shown in Figure 1. Furthermore, interactive visualisation of results is implemented using shiny: Web application framework for R (v 0.12.0) (available from shiny archive link on https://cran.r-project. org/web/packages/shiny/index.html) and can be launched in a modern browser such as Firefox, Chrome or Internet Explorer (v.10 or later) using a simple interface for Windows, Mac OSX Multiplex ligation-dependent probe amplification (MLPA) MLPA was used to evaluate all calls detected by DECoN using the appropriate probe kits and protocols from MRC Holland, as previously described 14 .

Simulations
Detection of single exon CNVs is known to be particularly challenging 7, 21,22 . To better evaluate DECoN performance in single exon CNV detection we simulated single exon deletions and duplications in BRCA1 and BRCA2 in a single pool, using 48 samples from the evaluation set that were known to be negative for BRCA exon CNVs. This simulated data was based on real data, using the variation and fluctuations observed in the true coverage to model the simulated coverage. To simulate a duplication or deletion of a single exon, the observed coverage of that exon in a randomly selected sample was increased or decreased by 50%, respectively. This was repeated 1,000 times for each possible variant. Sensitivity was calculated as the percentage of the 1,000 repeats which were successfully detected.
We also performed simulations to evaluate the effect of varying the coverage and/or the number of samples in an enrichment pool. Simulated data was generated based on the evaluation set by first selecting an enrichment pool, then selecting samples from that enrichment pool for up or down sampling. When selecting 96 samples, all samples from the evaluation set were used. For any read, r, it was assumed to contribute N r times to the simulated data set, where N r was drawn from a Binomial (n,p) distribution. This was assumed for all reads from the selected samples. The values of n and p were chosen to provide the correct level of up or down sampling and the closest approximation to a Poisson distribution, which the original coverage values are assumed to follow. Sensitivity and specificity were determined by assessing detection of variants known to be present in the original data. Simulated datasets which did not contain any samples with an exon CNV were excluded from the sensitivity calculations and simulated datasets entirely comprised of samples with exon CNVs were excluded from the specificity calculations.

Simulation analyses
To further explore the performance of DECoN in the detection of single exon CNVs we simulated a single exon deletion and duplication for each of the exons in BRCA1 and BRCA2 using the observed TSCP data from the evaluation set. The sensitivity for each exon deletion or duplication is shown in Figure 2. Sensitivity for single exon deletions was excellent, >94% in every exon with a mean of 98%. The sensitivity for single exon duplications was somewhat lower, with a mean of 95%, but these are known to be the most challenging exon CNV to detect.
In targeted sequencing, there are several experimental parameters that can impact exon CNV detection. For example, enrichment and/or sequencing can be performed in pools of different sample sizes. This will affect the coverage per sample and the number of samples available to use as reference samples. In turn this could affect DECoN performance. To evaluate this we extended our simulation framework to test in silico the effect of different combinations of sample size and coverage on DECoN's sensitivity and specificity.
The 96 samples in the evaluation set were sequenced in a single HiSeq2500 run using Rapid-run mode and a single flow cell which outputs a maximum of 300 Million (M) read pairs per run, and thus have a maximum of 3.125M read pairs per sample. By comparison, a MiSeq platform using the reagent kit v2 can produce a maximum of 15M read pairs per run. The evaluation set was thus up-sampled or down-sampled to obtain varying numbers of reads per sample. DECoN was then run on each simulated set and the sensitivity and specificity were calculated using the known exon CNVs in the evaluation set.
The simulation results are shown in Table 2. If the sample size was reduced to six samples sequenced at 1.25M reads per sample, the sensitivity was compromised (87%), but if the six samples were sequenced with coverage ≥ 2.5M reads per sample, sensitivity of ≥ 95% was still achievable. Sequencing 12 samples at 1.25M reads per sample, compatible with a typical MiSeq run, produced good sensitivity (92%) and excellent specificity (99%). In general, sequencing higher sample sizes per pool increased sensitivity, but if at least 2.5M reads per sample are generated, the estimated sensitivity was ≥ 95% for all sample sizes evaluated (Table 2).

DECoN clinical implementation
DECoN was incorporated into a clinical pipeline for BRCA analysis using TSCP and applied to 1,919 samples. DECoN was run for each of the 40 pools of 48 samples and took, on average, 24 minutes for each pool (range 19-29 minutes). In total, 23 exon CNVs were detected by DECoN and confirmed by MLPA (Supplementary File 3). One exon CNV (a single exon  gene panels, such as CoNVaDING.
The TSCP contains 94 genes. An explanation is needed why only 10 of those genes are used as P3. evaluation set.
'DECoN relies on a high level of correlation between samples'. It is unclear if this correlation should P4. already be present in the samples analyzed (by analyzing only samples belonging to the same pool) or if DECoN performs a correlation calculation and based on that calculation selects a subset of samples for further calculations. Can DECoN also analyze samples that are not sequenced in the same pool?
Is the correlation score in the report used as a quality metric, and if yes, how? And how were the P4. correlation scores for the samples analyzed?
What are the suggested thresholds to flag samples /exons which may have suboptimal performance?

P4.
Were such thresholds used in the evaluation set? Is the Bayes factor used as part of the quality metric? And if yes, how many samples and exons were flagged? In my opinion this is a critical step in P6. analysis. For clinical detection it is just as important to know which exons can't be reliably analyzed in the data as to know which can. This creates the difference between a false negative result and a failed sample/failed exon analysis. What was the flag status of the exon 8-11 FP duplication.
Why were simulated samples entirely comprised of samples with a exon CNVs excluded from P6. specificity calculations. There can still be a false positive result in other (two copy) exons.
Please state explicitly the specificity of the 8 non-BRCA genes in the evaluation set, even though this P6. is 100%.
It would also be interesting to know the specificity and sensitivity for the other 84 genes in the TSCP.

P6.
How is the quality of the evaluated exons compared to the other exons in the TSCP? Can the P6. sensitivity and specificity and exons passing quality control be extrapolated to genes outside the evaluation panel?
Instead of the number of reads per sample I would mention the average coverage / read P6 and table 2. depth of the evaluated exons. This is a more informative value, since it is independent of panel size.
What is the flag status of exon 20 in the duplication set? In other words, why does this exon P7, figure 2: perform worse than the others, especially since the sensitivity for deletions is so high Can dots for specificity be added to the graph? P7, figure 2: How does DECoN perform compared to the other named tools? P8.
Is DECoN tailor made for these 10 TCSP genes analyzed in pools, or is the performance P8. generalizable to other targeted capturing panels?
No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.