Algorithmic considerations when analysing capture Hi-C data

Chromosome conformation capture methodologies have provided insight into the effect of 3D genomic architecture on gene regulation. Capture Hi-C (CHi-C) is a recent extension of Hi-C that improves the effective resolution of chromatin interactions by enriching for defined regions of biological relevance. The varying targeting efficiency between capture regions, however, introduces bias not present in conventional Hi-C, making analysis more complicated. Here we consider salient features of an algorithm that should be considered in evaluating the performance of a program used to analyse CHi-C data in order to infer meaningful interactions. We use the program CHICAGO to analyse promoter capture Hi-C data generated on 28 different cell lines as a case study.


Introduction
Chromosome conformation capture (3C) methodologies [1][2][3] have provided insight into the effect of 3D genomic architecture on gene regulation [4][5][6] . They preserve chromatin interactions by cross-linking followed by fragmentation, ligation and sequencing of interacting genomic regions. Hi-C exploits high-throughput paired-end sequencing to retrieve a short sequence from each end of each ligated fragment, allowing all pairwise interactions between fragments to be tested 7 (Figure 1).
Chromatin interactions can result from biological functions, such as promoter-enhancer interactions, or from random polymer looping, whereby undirected physical motion of chromatin causes loci to collide. To identify 'true' interactions, it is necessary to identify the contribution from the null hypothesis, largely attributed to constrained Brownian motion and noise 8 . While not completely eliminating background noise, the development of in situ Hi-C, which preserves the integrity of the nucleus during Hi-C library generation, has gone some way to reducing it 3 .
Analysis of Hi-C libraries involves filtering of invalid di-tags such as self-ligated pairs or adjacent fragment di-tags 9 before determining statistically significant and biologically important di-tag interactions. The expected frequency of interactions between two fragments decreases with their genomic distance, especially if the fragments lie in different chromosomes 8 . Hence, reliable estimates of the dependence on distance are a prerequisite to any analysis.
While Hi-C allows for genome-wide characterization of chromatin contacts detection its effective resolution is determined by both restriction fragmentation and sensitivity of the experiment. Herein we consider data based on restriction enzyme digestion using the 6bp cutter Hind III, Sequencing of Chi-C library, alignment and filtering of valid di-tags using HiCUP 9 and identification of significant di-tag interactions using CHiCAGO 8 .

Amendments from Version 1
This new version addresses the comments of the reviewers, making a few minor edits to presentation (e.g., changing promotor to promoter, more carefully labelling some figures, including explanations in the main text as well as in methods), and adding further to the discussion. These additions to the discussion focused on two areas, approaches to handling bait-bait pairs, and the Suggested Score Threshold (SST). We provide a paragraph to discuss approaches one could take to handling bait-bait pairs when only one direction is significant, highlighting possible areas of future research, and on the SST we mention how potential issues surrounding undersampling written about in the initial CHiCAGO paper may apply to our measure of the False Discovery Ratio. No changes were made to the methods section, and the only changes to the results were that Table 2 and Table 3 were extended.

REVISED
Capture Hi-C (CHi-C) is a recent extension of the Hi-C methodology that improves resolution by enriching defined regions of biological significance 10 (Figure 1). Analysis of CHi-C data is, however, more complicated than conventional Hi-C because: (1) varying targeting efficiency between capture regions introduces a bias not present in Hi-C 8 ; (2) contact maps in CHi-C arise from two distinct sources that have innately different visibility profiles -between the two captured fragments and between captured and non-captured fragments; (3) null hypotheses for each di-tag pair are not independent as these are tested simultaneously, requiring an alternative statistic instead of reliance on raw p-values from hypothesis tests.
CHi-C, especially in the guise of promoter capture Hi-C (PCHi-C) is increasingly being used to decipher the genetic basis of aberrant gene expression in cancer. Since cancers rarely have diploid genomes, the analysis of PCHi-C from tumours is further complicated by copy number variation (CNV) 11 and presence of inter-chromosomal translocations 12 .
Here we examine a number of features of an algorithm that should be considered in evaluating the performance of a program used to analyse CHi-C data. Specifically, (i) the appropriateness of the distance-correction used in the model; (ii) relative importance of weights assigned to model parameters; (iii) whether the null is accurately reproducing the distribution of the large majority of contacts and how thresholds for declaring significant interactions are obtained; (iv) whether the underlying model leads to asymmetry in test statistics of baitbait pairs; and (v) how an algorithm behaves processing CHi-C cancer cell line data. As an illustration we consider CHiCAGO 9 as a case study since it is a widely-used program for analysing PCHi-C data 13,14 . The algorithm features a novel background correction procedure using a two-component convolution model designed to account for real but expected interactions as well as experimental and sequence-based artefacts. Additionally, CHiCAGO implements a p-value weighting procedure, based on parameters that can be estimated from the data.
Raw sequencing data was processed using HiCUP v0.6.1 9 to obtain only valid interaction di-tags aligned to build 38 of the human genome. Summary statistics for each PCHi-C dataset are provided in Supplementary Data 1 (see Extended data 15 ).
Significance of interaction frequencies for di-tags, both ends baited (bait-bait) and where only one end was baited (bait-other end) were estimated using CHiCAGO v1.1.8 8 .

Model considerations
We initially considered PCHi-C libraries from the 18 non-tumour cell lines. In CHiCAGO, background interactions are modelled by the following components of a Delaporte distribution, which are assumed to be independent: (1) Brownian collisionsmodelled by a negative binomial random variable with expected levels a function of genomic distance, adjustment for biases associated with individual fragments and size parameter independent of the interacting pair; (2) assay artefacts/technical noise (i.e. sequencing errors) -modelled by a Poisson random variable, whereby the mean of Poisson random variable depends on the properties of interacting fragments, but is independent of genomic distance between fragments.
We examined the validity of the model and estimation of central parameters. Assuming that in 'small' distance bins technical noise is low, as per CHiCAGO specifications, test statistics and corresponding p-values for the Kolmogorov-Smirnov (KS) test (testing probability that data observed is generated by the model specified, aggregated over distance bins) were generated for ACD4 (Table 1) and the other 17 cell lines  Table 2, Extended data 15 ). In most cell lines the p-values associated with the small distance bins were effectively zero but rapidly increased to near-unity in the larger distance bins. This is to be expected, as the asymptotics of Rosa et al. 16 only hold when genomic separation is much larger than the distance between adjacent regions. The notable outlier was GM12878, which was typified by near zero p-values across all distance bins. The exact estimates of bin-wise p-values are not necessarily important, since they will be impacted by true interactions, bait specific biases permitted by the model, the effect of which is expected to be small, and the distribution of distances within each bin. Nevertheless, the fact that there was no rejection of the null hypothesis at large distance bins implies that the negative binomial model fits the data well for broad-scale behaviour. As technical noise is proportionally greater at large distances, the discrepancy in how well the negative binomial fits over distance cannot be attributed to the KS test ignoring the Poisson component.
The 'distance function', a key component of CHiCAGO's implementation of a genomic distance dependence into the mean of the negative binomial, was generated for each cell line. Coefficients of distance fit curves and plots of estimates of the  15 ). In all 18 cases the cubic spline fitted by CHiCAGO provides a good fit to the data. With the exception of GM12878, there was strong concordance between the theoretical, linear and cubic fit, and the curvature of the cubic spline K further shows GM12878 as an outlier, which may well reflect GM12878 Hi-C libraries being prepared by dilution rather than in situ ligation. In view of GM12878 being an outlier, a linear model Linear Intercept ~ Linear Gradient was fitted with and without GM12878 (y = -0.776 -14.681x and y = 1.067 -12.919x, respectively), with the second having a lower residual square sum (RSS), so correspondingly providing for a better fit. This linear fit is consistent with interactions detected primarily being cis-chromosomal.
The assumption that assay artefacts have minimal effect on expected reads in small distance bins was confirmed by calculating the mean technical noise parameter λ and mean number of trans pairs observed per bait for each cell line. Box plots of the parameter estimate per bait or other-end pool, shown in Supplementary Figure 2 (see Extended data 15 ), adhere largely to the patterns expected as laid out in the CHiCAGO vignette 8 , where it is stated that to interpret the noise box plots one needs to check that "distributions' median and variance should trend upwards as we move from left to right".

Score statistic
CHiCAGO implements a novel score statistic as a proxy for the strength of evidence supporting an interaction 8 . We investigated the suitability of this statistic and the threshold advocated for declaring significance. Initially, we compared CHiCAGO interaction scores between bait-bait pairs. Intuitively it might be assumed that the score for baits AB will be identical to BA. However, this is not the case as evidenced by plots of score ij against score ji statistics for ACD4 ( Figure 2) and the other 17 cell lines (Supplementary Figure 3). The asymmetry arises because when CHiCAGO constructs bait-end biases and other end biases, the former are assumed to be fixed for each bait, whereas the other-end bias is assumed to be drawn from a random distribution, resulting in a different number of expected reads for the pair (mean correlation 0.4854, interquartile range (IQR) 0.0970). To further understand this asymmetry, we define an interaction as 'reversible' if, for a given threshold, the significance of the interaction did not depend on the direction in which the score was calculated. The mean percentage of reversible interactions was only 23.06% (IQR = 4.06%); the presence of non-reversible interactions representing failure of the algorithm to assign biological relevance to a bait-bait interaction.

Significance threshold
The threshold advocated by the developers of CHiCAGO for declaring a significant interaction is a score ij > 5 8 -referred to as the normal score threshold (NST). We investigated power and false-discovery rate (FDR) at this threshold. To evaluate power, or equivalently the false-negative rate, (FNR) we calculated the proportion of interactions with log(p) < -10 (considering the null hypothesis of the Delaporte distribution) and score >5. log(p) < -10 (i.e. 'robust' interactions) was used as in the mathematical specification of CHiCAGO it is suggested that reproducible interactions are those that pass this threshold in all replicates 8 .
Because the principle underlying the CHiCAGO score statistic precluded simply using p-values to identify true interactions, except for in extreme cases, we used the Jaccard index as a proxy for FDR (assuming interactions passing the score threshold in all replicates are true, and using the Jaccard index to measure the proportion of total interactions observed over all significant replicates; see methods section). To quantify the suitability of the score as a statistic at alternate thresholds, we calculated alternate score thresholds with the aim of improving the FNR, FDR, as well as the family-wise error rate (FWER) ( Table 2). Across all 18 cell lines the threshold to fix the theoretical FWER was a score ≳15; however, this limited discovery to O(10 4 ) interactions per cell line, compared with O(10 5 ) interactions when imposing a score threshold of 5.
Weighting procedure CHiCAGO scores are computed from raw p-values corrected for the prior probability of true interaction, fitted by a four-parameter logistic regression model 8 . The parameters can be calculated from the reproducibility of interaction frequencies at different genomic distances for the cell line, otherwise by default the program uses estimates from macrophage data. To investigate the appropriateness of the estimation method and the extent to which the choice of weighting affects the identification of significant interactions, we used CHiCAGO's pre-built method to calculate parameters for each cell line.
CHiCAGO uses the observed interactions to fit a curve of true-interaction prior probability that decreases monotonically with distance. The monotonicity of the model is intuitive because in general baits that have a greater separation distance will have a lower prior probability of interacting. The lack of the monotonicity of the observed data, a measure with range of 0 to 1, with 0 corresponding to perfect monotonicity, had mean 0.3222, IQR 0.0917 (4.d.p). While it is not possible to quantify how much of the lack of monotonicity is due to expected variance without making additional hypotheses of the model, visual inspection of the data the weighting curves produced by CHiCAGO fit to, as shown in Supplementary Figure 4 (see Extended data 15 ), shows a local peak around a log distance of 12. This behaviour is not detectable by the logistic model which is monotonically decreasing. This suggests that the non-zero lack of monotonicity is caused by underlying biological features.
The RSS for the logistic regression had mean 26.08, IQR 6.50, the GM12878 cell line an outlier with RSS 73.11. Visually this is identifiable as the fit is near horizontal for the GM12878 cell line. GM12878 in fact has a lower RSS than one would expect, as there were only seven distance bins contributing, all others having zero observed interactions. This is a more general phenomenon whereby fits in which there are zero-observed-interaction bins will have an artificially lower RSS.
As the threshold of log(p) < -10 for defining 'true' interactions is somewhat arbitrary, yet it affects weight parameters, we sought an alternate p-threshold and recalculated the RSS. With the weight parameters calculated by CHiCAGO that minimised the RSS, key statistics for the data were recalculated, giving a mean change of 0.0153 for the score correlation, 1.36% for the reversibility, 0.0355 for the FDR at the NST, and -0.0708 for the FNR. This suggests that, on balance, using the custom calculated weight parameters improves the quality of the resulting calculations. Moreover, the Jaccard index for concordance between significant interactions with and without suggested weights had mean and IQR values of 0.8732 and 0.0737, respectively, demonstrating that the choice of weights affects which interactions are reported as significant. Applying the new weights had no significant effect on the number of interactions observed at the NST.
Application to cancer cell lines We next analysed PCHi-C data generated on the 10 cancer cell lines. Supplementary Plots of the score symmetry for bait-bait pairs are shown in Supplementary Figure 6 (see Extended data 15 ). Cancer cell lines tended to have a higher non-zero score correlation for baitbait pairs (mean 0.5136, IQR 0.1803) but a significantly lower percentage of reversible interactions (mean 14.54%, IQR 6.54%). The BLN2 and BLN3 cell lines showed substantive aberrant behaviour in their plots in which proximal bait-bait pairs in a similar region extended in long 'arms' away from the theoretical fit. This was not seen in any of the other cell lines and is likely to be a consequence of a vastly different underlying genomic architecture. Summary statistics to calculate the quality of the score threshold were again calculated. The FNR was higher for cancer cell lines (mean 0.197, IQR 0.069). Suggested score thresholds to improve the FNR or theoretical FWER (Table 3), were similar to those seen for non-cancer cell lines.
The data showed a lack of the monotonicity (mean 0.5999, IQR 0.0865 (4.d.p); mean RSS of 55.4, IQR 56.9), but was significantly higher than that observed in non-cancer cell lines. This distortion was most pronounced with BLN2, BLN3 and HT29 cell lines, which all showed very low concordance between the fit and data points. Mean changes in summary statistics with CHiCAGO-calculated weight parameters were 0.0292 for score correlation, 2.27% for reversibility, 0.0683 for FDR at the NST and -0.0870 for the FNR. The corresponding mean Jaccard index was 0.6878 (IQR, 0.2010), highlighting the importance of using derived weights in analyses.
Finally, we examined heatmaps of Hi-C interaction frequencies to detect potential cancer-related chromosomal abnormalities, finding that BLN2 and BLN3 exhibit large-scale inter-chromosomal translocations (Supplementary Figure 7, Extended data 15 ). However, such features are unlikely to be sufficient to solely account for the increased score asymmetry observed in cancer cell lines.

Discussion
When utilising any statistical test, it is necessary to verify that any necessary properties of input data are satisfied, and that under these assumptions sensible conclusions are drawn. In this study we have sought to evaluate CHiCAGO as a methodology for identifying statistically significant genomic interactions in PCHi-C data. This evaluation included examination of: (i) the suitability of the distance-correction model employed; (ii) evidence of discordance in association statistics at bait-bait pairs; (iii) significance thresholds of called interactions; (iv) importance of weight parameter estimates; (v) specific considerations for its application to analysis of cancer cell-line data.
Our findings indicate that the Delaporte null fitted the data well in large distance bins, with the assumption that the Poisson contribution is small being verified. The cubic spline distance function fitted the data well, with the linear fit being sufficient for most non-cancerous cell lines. The symmetry in the score parameter was very low for bait-bait pairs. The default CHiCAGO score threshold of >5 was typically too low to ensure reliability in the data, evaluated either from the FNR or FWER, but correspondingly the sensitivity to detect interactions was greater than at higher score thresholds, with a resulting higher false discovery rate.
In handling bait-bait pairs for which there is asymmetry in the assigned score, either the maximum or minimum of the score and its reverse could be reported depending, whether false negatives or false positives were the priority. Alternative approaches worthy of consideration might be combining the scores weighted by the expected score variance or imposing a framework with symmetry. We leave studies of the suitability of such approaches to future work. Estimating biases for other-ends is inherently more difficult, as noted in the original mathematical specification for CHiCAGO, due to the comparatively low number of nearby bait fragments, and so for discovery we would not expect bait-bait pairs to be a large source of biological information.
Using custom cell-line specific weight parameters marginally improved summary statistics of the data compared to reliance on default parameters. The overlap of significant interactions with and without suggested parameters was around 90%, demonstrating the presence of either false positives or false negatives when using standard weight parameters. These features were also seen, albeit more pronounced, in cancer cell lines. Using custom weights improved the metrics applied to the output, as expected since it provides the theory-mandated adjustment of the p-values. In a recent best-practice guide for CHiCAGO published by the authors further aspects of parameter tuning have been proposed 17 .
As the framework we provide only considers how CHiCAGO processes input data, our methodology is largely resistant to limitations due to the underlying CHi-C inputs. There are small differences in the two versions of the designed oligonucleotide baits to capture promoter fragments between cancer and non-cancer cell line data, but as CHiCAGO produces bait-specific biases as part of its model, we should not expect this to have a major influence on our conclusions. For calculations involving the FDR, FNR, and FWER, due to the lack of bona fide reference interactions, we were reliant on theoretically equivalent proxies. As a result, point estimates will be inherently imprecise and allow us to either only make comparisons or reference confidence intervals between different score thresholds. As a result, point estimates will be inherently imprecise and allow us to either only make comparisons or reference confidence intervals between different score thresholds. As demonstrated in Cairns et al. 8 ( Figure S4), pairs with a low number of counts that are nevertheless significant (because they occur at large genomic distance) within a dataset are unlikely to be identified as significant when viewing a subset of interactions. This was presented in the context of under sampling but suggests that the SST value suggested by optimising the FDR will be skewed towards a threshold which removes more low-read interactions. As score is intended to be a measure of significance that does not depend on distance, we don't expect this skew to be strong. Furthermore, although we provide a large range of statistics as an example of how to assess Hi-C algorithms, there are some visual features that are not necessarily amenable to numerical description. Criteria for selecting the bests summary statistics to efficiently assess algorithms is desirable, something that may potentially be tractable by applying approximate Bayesian computation 18 .
From a numerical and computational perspective our study highlights a few key points. The fact that the cubic distance function fit implemented by CHiCAGO correctly matched the data for every cell line is unsurprising, given the large number of parameters it was able to utilise. We should similarly expect the same of the logistic regression, and so large failures to fit the data are indicative of the unsuitability of the underlying form of the curve for the data it is approximating. Moreover, the exact implementation of methodologies is demonstrated to be important. Numerical optimisation improved the FDR on average, but Nelder-Mead (NM) often produced thresholds too large to be useful in discovery, serving to demonstrate the importance of understanding the underlying processes behind standard R functions, as Broyden-Fletcher-Goldfarb-Shanno (BFGS) provided more practical thresholds. This difference in behaviour stems both from the fact that the Jaccard index is not a continuous function of the score threshold, and that NM is a heuristic algorithm. Using BFGS highlighted certain cell lines for which the SST was considerably different to the NST, but we were did not to see any meaningful biological reason for this occurrence. It is a possibility that this is the sparsity skew discussed previously.
Inevitably, a challenge in evaluating the performance of Hi-C and Hi-C algorithms is not having reference to a large "gold standard" reference set of bona fide true interactions in a given cell line. if previously identified interactions are recovered. One way of generating a "null model" for comparison of di-tag interaction frequencies is to sequence a generated "random ligation" library prepared by reversal of cross-links prior to ligation 10 . This has, however, not generally been standard practise in preparation of large numbers of libraries.
Analysis of CHi-C data generated from cancer cells clearly presents challenges beyond that of diploid cells. Translocations affect distance estimates, leading to highly significant interaction p-values between translocation breakpoints. As a prelude to any analysis of CHi-C, examining pre-capture Hi-C data can be used to identify translocations and inform downstream analyses. Other molecular abnormalities in cancer cell lines, such as focal amplifications/deletions, regions of kataegis and chromothripsis are more intractable sources of bias.
Comprehensively accounting for such aberrations ideally requires de novo assembly of the cancer genome being investigated.
In conclusion, our analysis highlights a number of features that should be considered when evaluating CHi-C algorithms.
In application to CHiCAGO, while we saw that the underlying null hypothesis was entirely sensible, assigning significance to a given interaction is not entirely straightforward. It is clear that many issues associated with processing of CHi-C data are exacerbated when studying cancer derived data because of the complex nature of their genomes.

Datasets analysed
The 28 cell lines and PCHi-C datasets analysed are detailed in Supplementary  15 ). Genome-wide heatmaps of Hi-C contacts were generated using HiCExplorer v2.1.1 21 to identify large-scale chromosomal translocations.

Evaluation of CHiCAGO
Cairns et al. 8 provide a mathematical specification of the algorithm used by CHiCAGO, and we utilise the same notation. For pairs less than 1.5 Megabasepairs (Mbp) apart, the CHiCAGO algorithm assumes that the contribution to the total number of counts from the 'technical noise' component of the null model employed is sufficiently lower than that from the 'Brownian' component, so it is reasonable to approximate the model as where NB(μ,r) is a negative binomial distribution with probability mass function Discrete Kolmogorov-Smirnov test statistics for a goodness of fit were calculated in each distance bin B b . These were calculated under the null that where w is the width of the distance bin, and s 1 , s 2 are drawn from the distribution of the bait and other-end bias distributions. We implemented Monte Carlo hypothesis testing to obtain p-values, using 5,000 simulations of 5,000 pairs to measure the D-statistic, using the standard p-value estimator of Davison and Hinkley 1997 22 . Deriving a statistic per distance bin allowed us to examine how appropriate the null hypothesis is for separate distance bins. A cell-wise Bonferroni correction was applied to the significance threshold.
We plotted estimates of the distance function f(d) against the data, which is the geometric mean of the non-zero reads between bait-other end pairs in each distance bin, alongside a linear fit and a 'theoretical' fit. Specifically, we fit • logf(d) = a 0 + a 1 logd + a 2 (logd) 2 + a 3 (logd) 3 (cubic) The 'theoretical' fit is of the form f(d) ∝ d -1 , as suggested to be the large-distance limit by Rosa et al. 16 . We further calculated the integral of the curvature of f over the distances considered of [10 4 , 1.5 × 10 6 ] base pairs, as a measure of the deviation from a power law, which would be represented by a straight line on a log-log scale. Specifically, we calculated K, given by The limits of integration can be chosen as, outside of this range, the distance function is extrapolated linearly on a log-log scale, where the curvature k will be zero. Careful treatment of the second derivatives at these limits is not necessary.
To validate the assumption that the technical noise will have minimal effect at small distance, the mean λ parameter for the pairs was calculated as per CHiCAGO. Moreover, boxplots were produced demonstrating the distribution of the parameter in each pool used in the estimation procedure.
To adjust for multiple testing in CHiCAGO, p-values are weighted according to the prior probability of a given null hypothesis being true or false 8 . By default, weights are those estimated from human macrophage data, which defines reproducible interaction as one for which log p < -10 in all replicates. To evaluate this definition, we considered the function g(ρ) given by: For ρ = -10, g gives a FNR for reproducible interactions. Furthermore, we calculated the value of ρ that gives g(ρ) = 0.05 (assuming that such a value exists) to determine a p-value threshold for reproducible interactions coherent with the score statistic.
CHiCAGO's algorithm is not symmetric in its treatment of baitbait pairs and we found the correlation between non-zero values of score ij against score ji values for bait-bait pairs. By excluding pairs where both scores were zero, we avoided correlations being artificially inflated because there are many more non-interacting pairs than interacting pairs. We further computed the proportion of the bait-bait pairs that passed the advocated score threshold of > 5 in both pairs, relative to those passing the threshold in at least one pair, that is To examine the reproducibility of interactions called as significant by CHiCAGO, we produced alternate score thresholds based on: (i) a Bonferroni correction to control the FWER in the smallest distance bin; (ii) controlling for FNR, and (iii) minimising the FDR in Jaccard index between replicates. Specifically, for the Bonferroni correction, as thresholding at a score α requires that the evidence for an interaction exceeds that of a proximal pair with p-value e -α , we imposed the threshold The measure of reproducibility used was the Jaccard index of the sets of significant interactions in each replicate, which are reported as FDRs. Under the assumption that true interactions will be significant in all replicates and false interactions will not, we have the equivalence 1 .

False Discovery Rate
Jaccard Index = − At each score threshold, the number of interactions was reported to balance reliability and sensitivity. To demonstrate the importance in the choice of optimisation methodology in minimising FDR for a given cell line, two optimisation methods were used, Bound Limited-memory BFGS (L-BFGS-B), and the default NM utilised by optim in R.
We evaluated CHiCAGO's weighting procedure. In the estimation of the weight parameters, CHiCAGO's algorithm fits a monotonic decreasing curve to the observed prior-probability of interaction through bounded logistic regression. To examine the extent to which monotonicity is observed in the data, a lackof-monotonicity statistic was generated for the data with the formula This is a natural formula in the sense that if the sequence (v i ) is monotonic, the lack of monotonicity is 0. To evaluate the fit we calculated a RSS; distance bins in which no interactions were observed were neglected, since for these bins CHiCAGO estimates the prior probability of interaction to be 0, hence log(p) is infinite. Thus, these points provide no information for the RSS, but nevertheless indicate a failure of the fitted model to represent the data, and so the presence of bins in which no interactions were observed is also reported.
We estimated the weight parameters at the threshold of -10 and ρ, for each cell line. Parameters that provided the 'better' fit (i.e. had a lower RSS), were used and CHiCAGO re-run. After which, a Jaccard index was calculated for the sets of significant interactions called by CHiCAGO using either default or updated weight parameters.
All methods described were implemented in R version 3.6.3. A copy of the program, modified for readability above utility, is released as Extended data (Trim_of_CHiCAGO_evaluation.R) 15 . This project contains the following extended data:

Data availability
-  significance threshold. The paper highlights the importance on training p-value weights on own data and the challenges of choosing an optimal score threshold in the absence of a gold standard.
As a developer of CHiCAGO, I thank the authors for an independent evaluation of our pipeline and am happy to see that it has withstood most tests.
I only have a very small number of comments: "In most cell lines the p-values associated with the small distance bins were effectively zero but rapidly increased to near-unity in the larger distance bins". In the authors' opinion, why are the p-values so low in the small distance bins? It would be good to state this more explicitly. 1.
It would be good to unpick a bit more the phenomena that underlie "non-reversible" baitto-bait contacts -which are generally expected due to the asymmetrical nature of CHi-C data and CHiCAGO's analytical approach. Theoretically, this may be to do with challenges in estimating s_i's (other-end scaling factors) based on much less information than available for s_j's (bait scaling factors). Secondly, this could be due to differences in the coverage of respective baits, potentially leading to differential sensitivity in signal detection depending on the viewpoint. Finally, from Fig 2 it seems that there are a few cases where the difference in score is quite small and so their position on either side of the threshold is incidental. It might be worth discussing these possible situations a bit more detail. Also, what is the authors' suggested strategy for dealing with these situations? In our lab, we pick the pair with the higher score, as we believe that false-negatives are a generally bigger issue in CHi-C and CHiCAGO than false-positives. Alternatively, bait-to-bait contacts could be analysed separately as a symmetrical matrix using appropriate tools.

2.
I wonder to what extent the score threshold selection methods based on reproducibility across replicates are inflated due to sparsity issues (as we showed in Cairns et al., 2016, Fig  S4). 1 This is the reason why in Javierre 2 we used a different approach (based on Blangiardo et al.) 3 to show consistency between datasets, while in Freire-Pritchett eLife 2017; Nat Prot 2021 4,5 use a threshold-tuning approach based on balancing enrichment and recall of PIRs containing enhancer-associated histone marks. It would be good to discuss these points.

3.
those highlighted by the reviewer.
We mention the possible skew effect of sparsity, which we expect to be small, and comment on the possible link to comment 9 of Reviewer 1.

3.
These changes you will see in an updated version of the paper. Having made changes to the manuscript we hope it is now suitable for indexing.

Borbala Mifsud
College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar Disney-Hogg et al. have considered what the important features of algorithms used for identifying biologically meaningful interactions in capture Hi-C analysis are. They used CHiCAGO, a widely applied algorithm, as their case study and assessed the appropriateness of the distance function used for correction, the effect of altering the weights in the model, looked at how thresholding influences the resulting significant interactions and analysed the reciprocity of bait-bait interactions. Given the lack of a set of gold standard interactions, this evaluation is based on the assumption that true interactions are reproducible. They performed the analyses both in normal hematopoietic cells and in cancer cell lines. They found that the cubic spline fits the data well and that the model used by CHiCAGO creates highly asymmetric results for bait-bait interactions. Changing the weights to custom calculated ones slightly improves the quality of the identified interactions. The suggested threshold for significance is appropriate for most cell lines to optimize FNR and FDR, however, there are notable exceptions to this. There are a number of methods developed for capture Hi-C data analysis and this approach can be applied to evaluate and compare their performance and therefore it is a valuable contribution to the field.

Minor comments:
Promoter is the correct term for the genomic region instead of promotor. 1.
Supplementary data 1 is missing information on the captured read count. 2.
In figure 1, the fragmentation step is missing after the ligation "Fragmentation, biotin pulldown, adapter ligation and PCR".
3. Table 1 would probably be easier to see in a plot where the test statistics and the p-value are plotted against the distance.

4.
Please add labels to the scales in Supp. Figure 2. 5.
In figures where the distance is one of the variables, could you indicate the unit? 6.
The definition of FDR based on Jaccard distance should be in the main text. 7.
In Table 2, it would add valuable information if the number of significant interactions were included for each threshold.

8.
Could you comment on the observation that the optimal threshold for FDR using BFGS was mostly around the recommended threshold, but in some cases it was very different? Do those data sets share any characteristics that could explain it? 9.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes for the delay in responding, however this unfortunate delay was the consequence of me starting a PhD on an unrelated subject compounded by Covid-related issues. All your comments were extremely insightful and pertinent, and we have largely implemented all