A response to Yu et al. "A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array", BMC Bioinformatics 2007, 8: 145
Yu et al. (BMC Bioinformatics 2007,8: 145+) have recently compared the performance of several methods for the detection of genomic amplification and deletion breakpoints using data from high-density single nucleotide polymorphism arrays. One of the methods compared is our non-homogenous Hidden Markov Model approach. Our approach uses Markov Chain Monte Carlo for inference, but Yu et al. ran the sampler for a severely insufficient number of iterations for a Markov Chain Monte Carlo-based method. Moreover, they did not use the appropriate reference level for the non-altered state.
We rerun the analysis in Yu et al. using appropriate settings for both the Markov Chain Monte Carlo iterations and the reference level. Additionally, to show how easy it is to obtain answers to additional specific questions, we have added a new analysis targeted specifically to the detection of breakpoints.
The reanalysis shows that the performance of our method is comparable to that of the other methods analyzed. In addition, we can provide probabilities of a given spot being a breakpoint, something unique among the methods examined.
Markov Chain Monte Carlo methods require using a sufficient number of iterations before they can be assumed to yield samples from the distribution of interest. Running our method with too small a number of iterations cannot be representative of its performance. Moreover, our analysis shows how our original approach can be easily adapted to answer specific additional questions (e.g., identify edges).
KeywordsMarkov Chain Monte Carlo Hide State Markov Chain Monte Carlo Algorithm Viterbi Algorithm Single Nucleotide Polymorphism Array
array-based comparative genomic hybridization
aCGH analysis method developed by 
aCGH analysis method developed by 
aCGH analysis method developed by 
Markov Chain Monte Carlo
Reversible Jump-based analysis of aCGH data developed by 
- ROC (curve)
Receiver operating characteristic (curve)
Single nucleotide polymorphism
The recent paper by Yu et al.  proposes a new method for the analysis (segmentation) of high density single nucleotide polymorphism (SNP) arrays to detect copy number changes (CNAs) in genomic DNA. Their approach has been designed to be highly sensitive for edge detection and is tailored to SNP arrays. In their paper, Yu et al. compare the performance of their approach with that of several alternative methods initially developed for the analysis of array-CGH (aCGH) data. The methods compared include Circular Binary Segmentation , GLAD , CGHseg  – though not in the original formulation of their authors, as the recommended adaptive penalization of  is not used–, and three Hidden Markov Model approaches: a homogenous one , the non-homogeneous one of Marioni et al. , and RJaCGH, our own non-homogeneous HMM using Markov Chain Monte Carlo (MCMC) with Reversible Jump .
The different aCGH technologies, from BAC-based aCGH to oligonucleotide aCGH (oaCGH), including Affymetrix SNP arrays, differ in several ways, both in terms of coverage, technology, noise per individual element, requirement for reference samples and minimal DNA quality, and LOH information returned (e.g., see [8, 9, 10, 11, 12, 13]). Some of these differences might require customized approaches for SNP-based arrays [11, 13].
Thus, our objective here is not to show the optimality of RJaCGH for SNP-based arrays but, rather, to reanalyze the data of Yu et al.  using more appropriate parameters for the Markov Chain Monte Carlo in RJaCGH, which yield much improved performance, as well as to show extensions of our model for edge detection, which show the relevance of model-based methods of direct interpretation that explicitly return probabilities.
Re-analyzing the data
We will re-analyze the data in : in the original comparison RJaCGH was used with a) the incorrect reference for normal samples and b) too few iterations for both the burn-in and posterior sampler in the MCMC algorithm.
The data analyzed, in the non-altered condition, have a mean value of 1.03 and a median of 1.05. RJaCGH, by default, expects that the "normal" values will be centered around 0. This value of 0 is, often, the result of the log2 of a (normalized) ratio of 1. Expecting values of non-altered regions to be centered around 0 is not unique to RJaCGH; for instance, Fridlyand et al. , in p. 138, explain why they expect that the median copy number of the array will typically be normalized to 0. Likewise, Marioni et al.  assume a Gaussian for the log2 ratios of each state, which therefore means that regions with unaltered copy number are expected to show values centered around 0. Similarly, Picard et al.  indicate (p. 2) "Array CGH data are normalized with a median set to log2(ratio) = 0 for regions of no change (...)". Now, the last three methods, CGHseg , aCGH , and BioHMM , might not be adversely affected, in terms of edge detection, if the normal samples, instead of being centered around 0, are centered around 1 (as in the current data). However, RJaCGH is adversely affected: our algorithm, by default, uses a step where clones are separated into the three groups "gained", "lost", "no change", and information about the expected value of the non-changed clones is used to separate these three states, and therefore to identify some of the edges. Appropriately using RJaCGH, thus, requires either normalizing the data so that unaltered areas are centered around 0 or, else, explicitly using the "normal.reference" parameter to the RJaCGH function in our code (this parameter might not have been available in the version used in , but recentering the data would have had the same effect).
Second, and more importantly, RJaCGH has been run with a burn-in run of 50 iterations and results obtained from 450 runs (TOT parameter set to 500). Except for extremely simple Bayesian models, in real-life usage, much longer runs are often necessary (e.g., ) both for the burn-in period (i.e., to try to ensure that we are sampling from the true posterior) and for the number of MCMC iterations used after the burn-in period (i.e., the number of samples used to actually carry out inferences). The number of iterations should be even larger when we are using Reversible Jump and a complex model as HMM [15, 16, 17]. It is extremely unlikely that with only 50 burn-in iterations our model would be anywhere near convergence, and thus all inferences from these models are suspect.
Parameters for re-analyzing the data
The data and further details on the analysis and simulations were kindly provided by the first author of the paper, Tianwei Yu. We rerun the RJaCGH analysis, using 10000 burn-in iterations and 40000 samples for the posterior. Note that, to depart from the original  as little as possible, we do not use multiple chains in parallel (which would, however, be the recommended procedure and would also allow to check for convergence [14, 18], as explained in our paper ).
As explained above, we set the reference mean to 1 (parameter "normal.reference" in RJaCGH). We could have, instead, centered all data to ensure that the non-altered regions were centered around 0. To account for different levels of noise in the data (or to modify the trade-off between false positive and false negative rates) RJaCGH uses only one parameter, "normal.ref.percentile". This parameter might not have been easily available to the authors of  and, thus, none of our key criticisms hinge around non-usage of this parameter. "normal.ref.percentile" is used to set the width of the confidence band based on a Normal (Gaussian) distribution: higher values of "normal.ref.percentile" will tend to incorporate more states into the "non-change" final state. We have varied "normal.ref.percentile" from 0.5 to 0.99 to construct ROC curves as in Figure 3 of . All code for our analysis is available from .
Searching directly for edges
In the reanalysis above, we used RJaCGH in the default way. This means that the inference about the edges is not one where we estimate the probability of a spot being an edge. As RJaCGH is an HMM-based model, its main objective is to try to assign clones to their true (hidden) state; RJaCGH additionally automatically collapses the hidden states to three biologically-motivated levels: gained/lost/no-change. Thus, to find edges, we first use RJaCGH to infer the status of a clone (gained, lost, no change) and then we identify the edges by looking for two consecutive probes where status changes. There is, however, a simple and direct procedure to obtain the probability of an edge that does not involve the above indirect procedure. Since we can apply the Viterbi algorithm to every MCMC sample , for each MCMC sample we can directly obtain the locations where there is an edge: spot i is an edge if its state –as identified by the path from the Viterbi algorithm– is different from that of spot i + 1. This approach has the advantage of returning the probability of an edge in a given spot, directly incorporating averaging over number of hidden states, and not requiring to specify "normal.ref.percentile". The later also illustrates one difference with regards to the previous approach. In the previous approach, the edges are identified from the "collapsed" classification of all hidden states into gained/lost/no-change (and this is why we required the specification of "normal.ref.percentile"). Here, in contrast, we will identify as an edge any change in state, even if the change in state is between hidden states that will eventually be considered a single biological level in the classification (e.g., if the change in state is between two different hidden states that will later be both considered "gain").
Finally, there is no need to re-run the algorithm to modify the trade-off between sensitivity and FDR, since this trade-off can be changed by changing the probability threshold over which we consider a spot to be an edge. This feature provides added flexibility, which allows users to choose the best threshold for any application, or even to use different thresholds for different probes or chromosomes. Moreover, even if the edge is not located precisely, we can obtain evidence suggestive of the area where an edge is located. For instance, in Figure 3, panel f) the exact location of any edge is not well defined, but there are three areas where spots of high edge probability are located, and an additional two with areas of medium edge probability; these types of representations might suggest refining searches in these areas. In contrast, panel e) is indicative of a situation where the algorithm is likely to be performing poorly, thus alerting users that those results should not be trusted.
We have shown (Figure 1) that, when used properly (i.e., with realistic number of chains for the MCMC algorithm and with the correct reference for the normal–no-change–probes), RJaCGH can achieve performance on-par with, or better than, other methods in this data set. Moreover, and in contrast to other methods, the model of RJaCGH is sufficiently flexible that we can directly obtain answers to the questions we are interested in, as shown in Figure 2, where we directly obtain the probabilities of an edge from the algorithm, without intermediate and indirect calculations. In fact, one of the advantages of the explicit model underlying RJaCGH  is the possibility of obtaining answers (in the form of statistics, posterior probabilities, etc) to the biologically relevant questions. Our original code was targeted to correctly classifying probes into the gained/lost/no-change status but we can, as well, focus the analysis on the probability that a probe is an edge. Moreover, it is easy to extend this approach to ask (and answer) more specific questions; for example, we could be interested in the probability that a probe is an edge between a state of no-change and a state of gain. In contrast, answers to these types of questions are often hard to obtain from other methods and the answers, if available, are rarely as immediately applicable as a probability. Finally, both our method and BioHMM  are unique among the methods compared because they can incorporate unequal distance between probes. With the data provided, the performance advantage of these two methods can not be detected since, by construction of the simulations, the inter-probe distance is the same (or plays no role in the probability of change in state).
A broad issue raised by the paper of  and our reanalysis is why HMM-based models, as well as approaches such as CGHseg and Circular Binary Segmentation, do not achieve better performance with these data, especially given their excellent performance in other aCGH platforms [7, 20, 21]. The comparatively poor performance of these approaches with SNP-based data might be attributable more to specifics of the Affymetrix platform [11, 12, 13, 22, 23, 24] than to specific values of parameters related to noise in the data (as exemplified by Figure 3).
Response from original authors
Tianwei Yu & Xiaofeng Zhou
Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA USACenter for Molecular Biology of Oral Diseases, College of Dentistry, University of Illinois at Chicago, Chicago, IL USAGuanghua School & Research Institute of Stomatology, Sun Yat-Sen University, Guangzhou, China
Corresponding author: Xiaofeng Zhou
In our paper , the comparison study was done with a previous version of RJaCGH (version 1.0.0). The current version of RJaCGH (version 1.1.1) was publicly available only after our paper was already published. Two key parameters were non-existent in the version 1.0.0 . They are "normal.ref.percentile" and "normal.reference". The first one controls the sensitivity in edge detection, and the second one determines the expected value of the non-changed clones.
As Rueda and Diaz-Uriarte pointed out, two factors contributed to the apparent suboptimal performance of RJaCGH in our original paper . The first is the expected value of the two copy clones being non-zero. It is a characteristic of copy number data processed by Copy Number Analysis Tool (CNAT) from Affymetrix Inc. In the version of RJaCGH we tested, there was no such parameter to adjust and its relevance would not have been known to the user. Obviously the RJaCGH package has been improved afterwards to add the parameter "normal.reference" to address this issue. The other two HMM-based packages already reached excellent performance without setting such a parameter . We wish to point out the difficulty and importance of correctly setting this parameter in Affymetrix SNP array-based copy number analysis. With the current Affymetrix SNP array platforms, the normal reference dataset was provided by an internal library, which is implemented as an integrated part in Affymetrix CNAT software . While this approach greatly reduces the effort and cost to perform copy number analysis, it could also introduce bias from various sources. Hence even minimum variations in the experimental procedure may cause the median value of the two copy clones to drift from the ideal value. In addition, the X chromosome without copy number deviation from a male sample will yield copy numbers centered around 1, as opposed to 2 from the autosomes. These issues require special care by the user when a "normal.reference" value needs to be set.
The authors criticized us for conducting simulations using existing methods "not in the original formulation of their authors". We would like to point out that in Supplement Table 1 of our paper , the comparison of all the tested methods were performed with the default settings, except the number of permutations in DNAcopy and number of iterations in RJaCGH. In the results reported in Figure 3 of , modifications of the parameters were made in an effort to adapt them better to the SNP array data and improve their performance. Certainly this was done from a user's perspective and we could not guarantee optimality for every package tested. All the parameters we used were listed in Table 1 of our paper . These manipulations were necessitated by the fact that copy number data generated by the Affymetrix arrays have different characteristics than traditional array-CGH data. It would be beneficial if authors of the packages could provide users guidance as to how to optimize their packages for newer arrays. In general, we are happy that changes are now implemented in the new version of RJaCGH as targeted improvements. And it is encouraging to see the boosted performance of the method on SNP array-based copy number data.
X. Zhou and, particularly, T. Yu for explanations and details of their simulations. Two reviewers for comments on the ms.
Funding provided by Fundacion de Investigacion Medica Mutua Madrilena. R.D.-U. is partially supported by the Ramon y Cajal programme of the Spanish MEC. Funding bodies had no role in study design, in the collection, analysis, and interpretation of data, in the writing of the manuscript or in the decision to submit the manuscript for publication.
- 1.Yu T, Ye H, Sun W, Li KC, Chen Z, Jacobs S, Bailey DK, Wong DT, Zhou X: A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array. BMC Bioinformatics 2007, 8: 145.PubMedCentralCrossRefPubMedGoogle Scholar
- 11.Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW, Shapero MH: CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics 2006, 7.Google Scholar
- 12.Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65(14):6071–6079.CrossRefPubMedGoogle Scholar
- 14.Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Chapman & Hall/CRC; 2003.Google Scholar
- 15.Green P: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995, 7: 11–732.Google Scholar
- 16.Capp'e O, Moulines E, Ryden T: Inference in Hidden Markov Models (Springer Series in Statistics). Springer 2005.Google Scholar
- 18.Brooks S, Gelman A: General Methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 1998, 7: 434–455.Google Scholar
- 19.Scripts for analysis[http://ligarto.org/rdiaz/R-code-reponse-Yu-BMC-Bioinfo.tar.gz]
- 23.Zhao X, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060–3071.CrossRefPubMedGoogle Scholar
- 25.Rueda OM, Diaz-Uriarte R: A flexible, accurate and extensible statistical method for detecting genomic copy-number changes. COBRA Preprint Series 2006., 9:Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.