SeqPlots - Interactive software for exploratory data analyses, pattern discovery and visualization in genomics

Experiments involving high-throughput sequencing are widely used for analyses of chromatin function and gene expression. Common examples are the use of chromatin immunoprecipitation for the analysis of chromatin modifications or factor binding, enzymatic digestions for chromatin structure assays, and RNA sequencing to assess gene expression changes after biological perturbations. To investigate the pattern and abundance of coverage signals across regions of interest, data are often visualized as profile plots of average signal or stacked rows of signal in the form of heatmaps. We found that available plotting software was either slow and laborious or difficult to use by investigators with little computational training, which inhibited wide data exploration. To address this need, we developed SeqPlots, a user-friendly exploratory data analysis (EDA) and visualization software for genomics. After choosing groups of signal and feature files and defining plotting parameters, users can generate profile plots of average signal or heatmaps clustered using different algorithms in a matter of seconds through the graphical user interface (GUI) controls. SeqPlots accepts all major genomic file formats as input and can also generate and plot user defined motif densities. Profile plots and heatmaps are highly configurable and batch operations can be used to generate a large number of plots at once. SeqPlots is available as a GUI application for Mac or Windows and Linux, or as an R/Bioconductor package. It can also be deployed on a server for remote and collaborative usage. The analysis features and ease of use of SeqPlots encourages wide data exploration, which should aid the discovery of novel genomic associations.


Introduction
Sequencing based techniques such as ChIP-seq and RNA-seq are widespread experimental tools that generate vast amounts of data for downstream analyses such as uncovering global patterns of genomic activity. After aligning sequence reads to the reference genome, read coverage is calculated. Visualizing coverage tracks using genome browsers is the simplest way to inspect the results. Nevertheless, calculating and plotting signals across groups of selected genomic locations is essential for genome-wide hypothesis testing and quantitative comparisons.
Typically, users plot the abundance of signal (e.g., read coverage) across a set of genomic regions (e.g., transcription start sites) either as a profile plot of average signal or as stacked rows of individual signals visualized as a heatmap. Such plots are usually generated using online or command line tools such as Galaxy/Cistrome, ngs.plot, and deeptools, or using custom scripts combined with plotting software such as Gnuplot 1-5 . We found that these methods were either laborious, as each plot needed to be set up individually, or were difficult to use by those with little computational training. These factors inhibited users from generating a large number of plots for data exploration.
To address this, we developed SeqPlots, a highly configurable, graphical user interface (GUI) operated application that rapidly generates publication quality average profile plots or heatmaps that can be clustered using different algorithms to uncover patterns within the data. A key feature of SeqPlots is the ability to select a set of features and signals, then rapidly plot them in any combination, facilitating wide data exploration.

Methods
SeqPlots can plot signals from any experimental or in silico data (e.g. ChIP-seq or RNA-seq read coverage, density of sequence motifs, mappability, nucleosome occupancy) over one or multiple sets of genomic features, (e.g. TSSs, gene bodies, peak calls). Users first add signal tracks and genomic feature files to an integrated SeqPlots database (see Table 1 for accepted file formats). Then any combination of signal and feature files in the database, together with any user entered sequence motifs, can be analyzed. Plots can be anchored at either end of a feature, at both ends, or at centers, and users can define which lengths of upstream and downstream sequence to plot. Additionally, three different methods can be used to cluster heatmaps: k-means, hierarchical clustering, and self organizing maps (unsupervised neural networks); heatmap rows can also be sorted by signal strength.

Implementation
SeqPlots utilizes indexing and the multi-layer summarization properties of bigWig files for rapid data acquisition 6 , and precalculates and stores profiles for all combinations of selected signals and features. Users are presented with a clickable array of signal/ feature pairs that can be plotted individually or in any combination in a matter of seconds. Average profile plots or heatmaps are immediately displayed as previews and can be downloaded as PDF files. Profile plots can display standard error and 95% confidence intervals. Spreadsheets with annotated heatmap clusters can be downloaded for downstream analyses such as additional clustering or gene enrichment analyses. Scaling, colors, axes, and titles are also easily configurable. Signal and feature files uploaded to the integrated SeqPlots database are available for use in later plot setups. Users can search and sort uploaded files, and annotate them with comments, user names and reference genome versions. Figure 1 illustrates a typical use of SeqPlots. Five feature files in bed format containing genomic coordinates of protein coding genes in different expression bins were selected together with three bigWig signal files (normalized read coverage of H3K4me3, H2A.Z, and H3K36me3). In addition the dinucleotide motif CG was inputted and SeqPlots generated a CG density track for use in the analyses. A plot type anchored at the start position (the TSS) was then selected, and 1 kb upstream and 1.5 kb downstream of the TSS was specified. Following the setup and calculation, SeqPlots presented a clickable grid (top of Figure 1a,b). Selecting the desired combinations and plot type (average profile plot or heatmap) generates a plot. In Figure 1a, three signals (H3K36me3, H3K4me3, and H2A.Z) and one feature (top 20% TSSs) were selected for an average profile plot. For Figure 1b,  find relationships between genomic features and signals. The rapid plotting capability and ease of use of SeqPlots should facilitate wide exploration of high-throughput sequencing data, leading to the discovery of novel biological associations.

Software availability
SeqPlots is distributed as user-friendly stand-alone applications for Mac and Windows or Linux, and is available as an R programming language package from the Bioconductor repository. SeqPlots can be also deployed as a server application, which is useful for data sharing within laboratories, collaborative usage and remote work. Author contributions PS conceived, designed, and wrote the software. JA contributed to the design and supervised the study. PS and JA wrote the manuscript.

Competing interests
No competing interests were disclosed.

Grant information This work was supported by the Wellcome Trust [101863].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Seqplots is an analysis and visualization software for genomic data sets, including ChIPseq, RNAseq and others. It allows investigating the pattern and abundance of signal coverage across genomic regions. Seqplots provides options for average profiling, performing heatmaps of normalized signal that can be clustered or not and further retrieved to explore the data. Seqplots is accessible as an application for all interfaces (Mac, Windows, Linux) or as R/Bioconductor package. The online documentation presents a dataset of a transcription factor and two epigenetic marks (H3K4me3 and H3K36me3) on the first chromosome of .

Installation:
To challenge Seqplots, we tested our own data on the complete human genome. The latest version (3.0.12 MacOSX bundle version) of Seqplots was first downloaded on a Mac but yielded genome installation issues. We thus went on with an older version (1.7.16, MacOS bundle version 2.0.2) that contains two drosophila genome versions (dm3 and dm6), genome ce10 and human hg19 C. elegans and we proceeded with this one.

Genomes upload:
One can select the desired genome on the proposed list but the action of « install selected » stops half way after indicating « installing packages: BSgenome.Scerevisiae.U ». The blue progression line stops but no error message appears, leaving no option to correct any possibility of upload. There is no indication of genome format file needed to upload, whether it would be FASTA, indexed genome or anything else.

Data upload:
Data (bam, wig, bigwig) can be easily uploaded to Seqplots that automatically converts any format in bigwig to allow better handling. Data upload time is directly dependent on your data size. After upload, the file is directly converted to bigwig.

New Plot Set:
To generate a profile plot or heatmap, one has to chose two files: the file containing the signal intensity (wig, sam …) and the file containing the selected annotations of interest such as genes, protein coding genes, promoters. This second file can be at the bed, gtf or gff format.
Several files can be processed in a single analysis; the visualization tool uses results to plot individual or combined profiles as selected by the user.
F1000Research combined profiles as selected by the user.

:
Options to select -Bin size of the signal file has to be known in advance and it is important to adjust this parameter properly, ideally identical to the original data input bin size. If using a bin size below the actual data, it will introduce wholes in the plot. Conversely, using a larger bin size will smooth the data, which can be useful for clarity of the results but can also result in a loss of resolution. This point could be clarified in the documentation. We also note that the link to explanation is essentially inactive.
-The 'point view', 'end point' or 'midpoint' options are easy to understand as start, middle and end of the features. The 'anchored features' option is a bit less obviously accessible, and should be explained more clearly.
: Profiles Profiles are easily generated and modified using indicated options, and overall functions well. That the user could directly change the labels below the plots could be a possible improvement.
: Heatmaps This option also allows clustering of the data with a nice graphical interface, including nice color options. When the process is complete, this tool is very useful to identify classes of genes/features and to explore the data. However, we note that calculation time for heatmap generation can be limiting and varies quite a lot from one time to the other. Unfortunately, no error message appears to explain if something went wrong on the interface while generating the heatmap. Sometimes the message is blocked in "exporting results" but these are never exported.

General considerations and conclusion:
Seqplots can be used quite easily by non bio-informaticians with minimum training. We believe that this software fulfills many of the analysis options that biologists are looking for when dealing with high-throughput sequencing data sets, including ChIP-seq and RNA-seq. It has therefore a great potential of usage by the community of scientists interested in genomic science. We also found that improvements remain possible and suggest debugging more specifically the following points.
The genome upload issue has to be fixed. Documentation should be developed for the sections 'genome upload', There is no possibility to run two jobs in parallel. Seqplots could allow the opening of two windows or more.
Error messages have to be clearer and help the user to make a decision on what should be done. Heatmap generation jobs are often aborted during calculations.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: 28

2.
Institute of Clinical Science, Imperial College London, London, UK 'SeqPlots' is clearly a mature package and it provides a graphical interface to make complex plots from within an web browser. I was able to get the package up and running, and generate plots from my own data very quickly after installation through the GUI, which was intuitively navigable. It is clear that a lot of work has gone into providing an impressive array of options to the user.
The manuscript itself gives an adequate, if brief, description of the purpose and typical output of the resource. It does not attempt to serve as user manual or tutorial.
A tutorial is provided as a separate, regularly updated document. The tutorial is very detailed, although a few relative links were broken. My one suggestion would be to provide easy access to the example data through the tutorial -I couldn't find it without digging into the package source. Adding an example line of run(root='/path/to/ex/data')` in the tutorial would be a simple way to do this.
In the future, it may be worth considering developing a scriptable back-end to the package, so that users who are comfortable with R can automate their pipelines.
Overall, for those who want a graphical interface to make these plots this is a very useful resource.
----Note: Malcolm Perry tested the software and wrote most of the above comments.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
One of us (Malcolm Perry) is developing a R/Bioconductor package with some Competing Interests: overlap in functionality to SeqPlots. SeqPlots is a graphical, interactive tool for exploratory visualization of high-throughput sequencing data. To start with my major point, I regret to say that I can not recommend SeqPlots in its current form. The main issue I have is with installation. This is a common hurdle with bioinformatics software, however, it is an important factor for a software tool. I don't gain any satisfaction from pointing this out, as the author has clearly spent a lot of effort on creating installers for all three major operating systems. However, I was not able to install SeqPlots in any form. I do have to admit here that I am not experienced in R. However, I do use other R modules without problems and SeqPlots is positioned as a tool for those with little computational training.