The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data

Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1, BRCA2, TP53, MLH1, MSH2, MSH6, PMS2, EPCAM or PTEN, giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.

The use of targeted next-generation sequencing (NGS) in clinical genomics has increased the capacity, throughput and affordability of gene testing [1][2][3] . Use of NGS data in the clinical setting requires comprehensive validation of methods. Ideally, this should include evaluation of the NGS test performance in samples with pre-determined positive and negative results to provide information on sensitivity, specificity and false detection rate.
Deletions and duplications of whole exons, termed 'exon copy number variants' or 'exon CNVs', are an important class of clinically relevant gene mutations 4 . Accurate exon CNV detection has proved difficult in targeted NGS data, particularly if only a single exon is affected 5 . This has led many research and clinical laboratories to either exclude exon CNV detection or to use separate methods for their detection, which can substantially increase the time and cost of tests.
Datasets are available for base substitutions and small insertions and deletions 6,7 and for copy number variants 6 , but datasets with experimentally validated exon CNV data are not widely available. As a result, methods for detecting exon CNVs in NGS data are usually evaluated using simulated and/or in-house data. This hinders tool comparisons, transparency and reproducibility.
We recently released DECoN (www.icr.ac.uk/DECoN), a tool optimised to detect exon CNVs in targeted NGS panels in the clinical setting 8 . During our validation of DECoN performance, we utilised samples with orthogonally generated exon CNV data. This proved extremely valuable in our evaluations and we believe such data will also be highly useful to others. We have therefore put together the ICR96 exon CNV validation series, which we present here. This was undertaken as part of the Transforming Genetic Medicine Initiative (TGMI, www.thetgmi.org), a Wellcome funded initiative which is developing frameworks and resources to facilitate genetic medicine. The ICR96 exon CNV validation series has been extremely helpful in our assessment of exon CNV detection tools, and the comprehensive orthogonal data allows evaluation of sensitivity, specificity and false detection rate. We believe the ICR96 exon CNV validation series could serve as a benchmarking set, particularly for the many clinical and research laboratories now undertaking cancer predisposition gene testing.

Materials and methods
The data included in this resource were generated from two types of studies. Firstly, through the BOCS, FACT and COG studies, which aimed to discover and characterise disease predisposition genes. All patients gave informed consent for use of their DNA in genetic research. The studies have been approved by the London Multicentre Research Ethics Committee (MREC/01/2/18, MREC/01/2/044, 05/MRE02/17 respectively). Secondly, the data included here was obtained through clinical testing by the TGLclinical laboratory, an ISO 15189 accredited genetic testing laboratory that we run. The consent given from patients tested through TGLclinical includes, as standard, consent for the use of samples for quality-control. It also provides the option of consenting to the use of samples/data in research; all patients whose data was included in the ICR96 series approved this option.
We generated high-quality targeted NGS data for the ICR96 exon CNV validation series using the TruSight Cancer Panel (TSCP) v2 which targets exons from 100 cancer predisposition genes (Supplementary File 1). We prepared targeted DNA libraries from 50 ng genomic DNA using the TSCP and TruSight Rapid Capture kit (Illumina). We followed the manufacturer's protocol, with the exception of library enrichment pool complexity, which we performed in 48-plex. We sequenced a final 10 pM pooled library on a HiSeq 2500 platform set in Rapid-run mode following standard protocols:  Table 1). The EZH2 exon 1-20 deletion was identified by comparative genomic hybridisation (CGH) array and was also confirmed by fluorescent in situ hybridisation (FISH). For simplicity, we included this one CGH result with the MLPA results.
We provide genomic coordinates in both build 37 and build 38 for all results (Supplementary Table 1). The genomic coordinates are the most 5' and most 3' coordinates of the exons involved in the exon CNV, as determined by MLPA, according to the specified transcript. Of note, these are not the actual breakpoints; neither MLPA nor targeted NGS data can provide breakpoint sequence information for exon CNVs. We provide the MLPA results for all exon CNVs using the following notation "Exon X deletion/duplication" for single exon CNVs and "Exon X-Y deletion/duplication" for exon CNVs involving more than one exon, where X specifies the number of the first exon involved in the exon CNV with respect to the transcript, Y specifies the number of the last exon involved in the exon CNV with respect to the transcript, and deletion or duplication is specified as appropriate. For all genes except BRCA1 the numbering is consecutive from the first non-coding exon in the transcript. For BRCA1 we use the conventional clinical numbering system which does not include exon 4.

Dataset
The ICR96 exon CNV validation series includes samples from 96 individuals. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes (Supplementary Table 1). Two of the 66 individuals had an exon CNV in two different genes, such that the dataset includes a total of 68 exon CNVs. This includes 25 single exon CNVs, the most difficult type of exon CNV to detect.
The dataset can be used to evaluate the performance of any tool that aims to detect exon CNVs in NGS data. It has particular utility in validating cancer predisposition gene exon CNV detection. The dataset has excellent representation of the cancer predisposition genes most frequently tested in clinical practice. BRCA1 and BRCA2 are particularly well represented, with 15 BRCA1 exon CNVs and 10 BRCA2 exon CNVs, of which 11 and four respectively, are single exon CNVs. The 25 BRCA1 and BRCA2 exon CNVs include 22 different mutations. We deliberately included, in the same pool, two separate samples with a BRCA1 exon 13 duplication. This small exon duplication is one of the most common BRCA1 mutations in the UK 13 and hence we wanted to cover the clinical scenario of having two different individuals with this mutation in the same sequencing run. To provide further representation of the cancer predisposition genes most frequently tested in clinical practice, the dataset includes 21 exon CNVs in MLH1, MSH2, MSH6, PMS2, EPCAM, PTEN or TP53. Between the two pools, we ensured there was no difference in the representation of exon CNVs in any particular gene or in the proportion of samples without an exon CNV, to minimise potential batch effects (Table 1).

Competing interests
No competing interests were disclosed.

Grant information
The provision of this dataset was supported by funding from Wellcome Trust (200990/Z/16/Z) to the TGMI programme.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgements
We are grateful to the TGLclinical team and the Rahman recruitment team, without whom the data would not have been available. We are grateful to MRC-Holland for providing the NGS-based MLPA data. We acknowledge support from the NIHR RM/ICR Specialist Biomedical Research Centre for Cancer.

Targets of the Illumina TruSight Cancer Panel (TSCP) in BED file format.
Click here to access the data.

Supplementary Table 1. MLPA results for the ICR96 exon CNV validation series.
Column headings:

SampleID -sample ID in the ICR96 exon CNV validation series
ICR96Pool -pool in the ICR96 exon CNV validation series

ENST65 -the ENST ID from Ensembl v65 used for annotation and genomic coordinates
Click here to access the data. We evaluated the usability of this dataset with our recently published method panelcn.MOPS. Our data access was processed quickly without problems. Analysis of all 96 samples with panelcn.MOPS revealed that the dataset is less homogenous than the data used in our study comparing panelcn.MOPS to five different CNV detection tools. Compared to the dataset used in our study, more samples and regions of the ICR96 samples were classified as low-quality and low correlation of read counts between test and control samples was observed in many cases. Additional optimization of panelcn.MOPS to the provided dataset was required, showing that any method needs to be adapted to the data to be analyzed.

Open Peer Review
Overall, the manuscript clearly describes a very useful dataset for the evaluation of CNV detection in targeted NGS data.

Minor comments:
Although it is mentioned briefly in Materials and Methods, the origin of the two different pools should be described again in the Dataset section.
Since both, GRCh37 and GRCh38, coordinates are provided in Supplementary Table 1, it should be specified which coordinates are given in the BED file that is provided as Supplementary File 1.
Wimmer K: panelcn.MOPS: Copy-number detection in targeted NGS panel data for clinical diagnostics.

Hum Mutat
PubMed Abstract Publisher Full Text

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes No competing interests were disclosed.

Are the datasets clearly presented in a useable and accessible format? Yes
No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.