Fast MCMC sampling for hidden markov models to determine copy number variations
 5.4k Downloads
 2 Citations
Abstract
Background
Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from highdensity arrays and next generation sequencing amplify these problems.
Results
We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kdtrees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling.
Conclusions
We test our approximate sampling method on simulated and biological ArrayCGH datasets and highdensity SNP arrays, and demonstrate a speedup of 10 to 60 respectively 90 while achieving competitive results with the stateofthe art Bayesian approaches.
Availability: An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.
Keywords
Hide Markov Model Markov Chain Monte Carlo Copy Number Variation Expectation Maximization Single Nucleotide Polymorphism ArrayBackground
The Sirens' call of Bayesian methods is that we can do without the parameters of models and, instead, compute probabilities of interest directly, indicating for example how likely a biological fact is given our data. The price one pays for that convenience is on one hand the conundrum of which prior distributions to choose and how to set their hyperparameters; the frequent use of uniform priors is evidence that this is indeed nontrivial. On the other hand, unless the choice of likelihood and prior yields simple posteriors which we can integrate symbolically, we have to resort to sampling for example with Markov Chain Monte Carlo (MCMC) [1].
In the following we will concentrate on HMMs, stochastic functions of Markov Chains, which have not only been used extensively for discrete sequencespairwisesequence alignments with pairHMMs [2], gene finding with labeled HMMs [3], and detecting remote homologs using profile HMMs [4]but also for continuousvalued observations, such as gene expression timecourses [5]. Continuous observation sequences from either DNA microarrays or next generation sequencing experiments, note that the proportion of mapped reads in an interval is frequently used as a continuous measure of copy number, to detect chromosomal aberrations or copy number variations lead to the same fundamental computational problem and share characteristics of the data. The goal is to segment an observation sequence into regions in which there is little variation around a common mean. In other words, the assumption is that the data can be approximately described by piecewise constant functions. Indeed, if hybridization intensity was an exact, unbiased measurement of DNA concentration before amplification, the sequence of hybridization intensities of probes along a chromosome would yield a piecewise constant function in ArrayCGH experiments. This assumption holds true for a mixture of different cell populations because a finite sum of piecewise constant functions is again a piecewise constant function.
A wide range of methods for copy number detection in ArrayCGH data have been developed in recent years, including changepoint detection based methods [6, 7], smoothing based methods [8, 9], and hierarchical clustering [10]. Here, we concentrate on HMMbased approaches which have been proposed for segmenting sequences of continuousvalued observations and shown to match or improve upon the stateoftheart [11, 12, 13]. Once a model is trained from the data, either using maximum likelihood (ML) or maximum a posteriori (MAP), the segmentation is given by the most likely state sequence obtained with the Viterbi algorithm [14]. The segmentation, however, is highly dependent on the model parameters. A small change in the computed parameter values can adversely affect the recovered segmentation. A full Bayesian approach alleviates this dependence by integrating out the model parameters. As analytic integration of a complex high dimensional model is impossible for most distributions, the Bayesian approach requires the use of numerical integration techniques like MCMC [15], for example by direct Gibbs sampling [16] of model parameters and state paths. Scott [17] reports that combining the forwardbackward recursions [18] and Gibbs sampling improves the converge rate and consequently the running time. Nevertheless, MCMC remains substantially slower than training one model and running Viterbi once and the loss in reliability introduced by relying on one ML or MAP model is ignored in practice. For discrete emissions, compressing sequences and computing forward and backward variables and Viterbi paths on the compressed sequences yields impressive speedups [19]. However, discretization of continuous emissions, similar to vector quantization used in speech recognition [18], is not viable as the separation between the different classes of observations is low since the observations are onedimensional. Moreover, maximal compression is to be expected for small number of discrete symbols and clearly compression ratio conflicts with fidelity in the analysis.
At the core of our approach is a similar geometrical argument about several univariate data points based on a modified kdtree. We adaptively identify blocks of observations, cf. Figure 1 (middle). For all observations in a block we now estimate, at least conceptually, the most likely state simultaneously depending on the means of the Gaussians in each state to gain a considerable speedup proportional to the average block length. Similarly, we can avoid sampling states for each individual observation in a block if we can bound the posterior. Considerable care has to be taken for combining blocks and to bound the errors introduced by the approximations based on geometric arguments.
In summary, we

propose the first use of spatial index structures for several consecutive observations and approximate computations based on geometric arguments to substantially speedup the problem of segmenting sequences of continuous observations using HMM,

demonstrate that very frequently used biological datasets of high relevance measuring chromosomal aberration and copy number variations are consistent with our assumptions of piecewise constant observations with added noise, and

achieve speedups between 10 and 90 on those biological datasets while maintaining competitive performance compared to the stateoftheart.
Methods
HMM
We only consider HMMs with Gaussian emission distributions; see Figure 1 (right) for an example and [18] for an introduction. We will use the following notation: N denotes the number of states, S = {s_{1}, s_{2}, ..., s_{ N }} the set of states, T the length of an observation sequence O = {o_{1}, o_{2}, ..., o_{ T }} with o_{ t }∈ ℝ, A = {a_{ ij }}_{1≤i,j≤N}the transition matrix, π = (π_{1}, π_{2}, ..., π_{ N }) the initial distribution over states, $B=\left\{\left({\mu}_{1},{\sigma}_{1}^{2}\right),\dots ,\left({\mu}_{N},{\sigma}_{N}^{2}\right)\right\}$ with μ_{1} ≤ ... ≤ μ_{ N }are parameters of the emission distributions, and Q = {q_{1}, q_{2}, ..., q_{ T }} the hidden state sequences with q_{ t }∈ S. From an observation sequence O we can obtain a maximum likelihood point estimate of the parameters (A, B, π) using the ExpectationMaximization (EM) or BaumWelch [23] algorithm and compute the most likely path using the Viterbi algorithm [14].
MCMC Sampling
Bayesian analysis requires choosing prior distributions on parameters and we use standard conjugate prior distributions following [1]. Specifically, we choose ${\mu}_{i}~N\left({\stackrel{\u0303}{\mu}}_{i},{\stackrel{\u0303}{\sigma}}_{i}\right),{\sigma}_{i}^{2}~Gamma\left({a}_{i},{b}_{i}\right),{A}_{i}~Dirichlet\left({\lambda}^{{A}_{i}}\right)$, and π ~ Dirichlet(λ^{π}), where ${\stackrel{\u0303}{\mu}}_{i},\phantom{\rule{2.77695pt}{0ex}}{\stackrel{\u0303}{\sigma}}_{i},\phantom{\rule{2.77695pt}{0ex}}{a}_{i},\phantom{\rule{2.77695pt}{0ex}}{b}_{i},\phantom{\rule{2.77695pt}{0ex}}{\lambda}^{{A}_{i}}$, and λ^{π} are the hyperparameters of the model.
 1.
Choose initial parameters θ ^{0} = (A ^{0}, B ^{0}, π^{0})
 2.
Perform the following steps for iteration 0 ≤ m < M.
 (a)
Compute forward variables P(q _{ t }= i, O _{1, ...,t}θ ^{ m }) for 1 ≤ t ≤ T iteratively using the forward algorithm [18] for all i.
 (b)
Sample ${q}_{T}^{m}~P\left({q}_{T},O{\theta}^{m}\right)$.
 (c)Sample the state sequence Q ^{ m }in a backward fashion for T > t ≥ 1.$\begin{array}{cc}{q}_{t}^{m}& ~P({q}_{t}^{m}{q}_{t+1}^{m},O,{\theta}^{m})\hfill \\ \propto P({q}_{t}^{m},{O}_{1},\dots ,t{\theta}^{m}){a}_{{q}_{t}^{m}}{,}_{{q}_{t+1}^{m}}.\hfill \end{array}$
 (d)Sample,$\begin{array}{ll}\hfill {\theta}^{m+1}~& \mathsf{\text{PriorDistribution}}\left(H,O,{Q}^{m},{\theta}^{m}\right)\phantom{\rule{2em}{0ex}}\\ \left[H=\mathsf{\text{Set}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{of}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{hyperparameters}}\right].\phantom{\rule{2em}{0ex}}\end{array}$
Despite its fast convergence, FBGsampling takes O(MTN^{2}) time for M iterations. For long observation sequences with millions of observations, as they are common in biological applications, realistic values for M and N make FGBsampling intractable. In the next section we discuss how FBGsampling can be approximated to improve the running time to O(γMTN^{2}), where γ is the compression ratio of the observation sequence, while maintaining good statistical properties. We refer to our sampling technique as approximate sampling.
Approximate sampling
Through application of a modified kdtree algorithm (details below), we compress the observation sequence O = o_{1}, ..., o_{ T }into ${O}^{\prime}={o}_{1}^{\prime},\dots ,{o}_{{T}^{\prime}}^{\prime}$, cf. Figure 1 (middle), and store precomputed first moment, second moment, and the number of observations compressed into block ${o}_{t}^{\prime}$ for all i ≤ T'. In subsequent MCMC iterations we assume that observations compressed in a block${o}_{t}^{\prime}$ arise from the same underlying state. In other words we ignore the contribution of the state paths that do not go through the same state for${o}_{t}^{\prime}$. By ignoring those state paths, we refer to them as weak state paths, when computing forwardvariables and by reusing precomputed statistics we are able to accelerate sampling.
At first ignoring weak state paths may sound like a very crude approximation for computing forward variables. But in many applications this is certainly not true. We demonstrate with a symmetric Gaussian HMM that the weak state path assumption is a fairly realistic approximation and leads to faster sampling. We define a symmetric HMM λ = (A, B, π) with N states s_{1}, ..., s_{ N }, where we set selftransition probability a_{ ii }= t and nonselftransition probability ${a}_{ij}=\frac{1t}{N1}$ for 1 ≤ i ≠ j ≤ N, and B = {(μ_{1}, σ^{2}), ..., (μ_{ N }, σ^{2})}. Given a sequence of observations O (assumed to be generated by λ) and its compressed form O' we describe an important lemma and some remarks below.
Lemma 1. Let ${O}^{i1}={o}_{1},\dots ,{o}_{i1},{o}^{\prime}={o}_{i},\dots ,{o}_{i+n1},{o}_{min}^{\prime}=\underset{{o}_{i}\in {o}^{\prime}}{min}{o}_{l},{o}_{max}^{\prime}=\underset{{o}_{i}\in {o}^{\prime}}{max}{o}_{l},d=\underset{j\ne k}{min}{\mu}_{j}{\mu}_{k}$ and $\frac{P\left({q}_{i}{O}^{i1}\right)}{P\left({q}_{i}={s}_{x}{O}^{i1}\right)}\le \alpha $. Assuming there exists a state s_{ x }s.t. $\tau =min\left({{o}^{\prime}}_{min}\frac{{\mu}_{{s}_{x1}}+{\mu}_{{s}_{x}}}{2},\frac{{\mu}_{{s}_{x}}+{\mu}_{{s}_{x+1}}}{2}{{o}^{\prime}}_{max}\right)\ge 0$, we can show that $\frac{\sum _{\left({q}_{i},\dots ,{q}_{i+n1}\right)\in {S}^{n}}P\left({q}_{i},\dots ,{q}_{i+n1},{o}^{\prime}{O}^{i1}\right)}{\sum _{s\in S}P\left({q}_{i}=\dots ={q}_{i+n1}=s,{o}^{\prime}{O}^{i1}\right)}\le \alpha \left({\left(1+rc\right)}^{n1}+\left(N1\right){c}^{\frac{2n}{N}}{\left(1+r\right)}^{n1}\right)$, where $r=\frac{1t}{t}$ and $c={e}^{\frac{dx}{2{\sigma}^{2}}}$.
Proof. See Appendix.
Remark 1 For realistic values of τ, t, and n, the contribution from ignored weak state paths, which we call ϵ, can be very small. If ϵ ≪ 1, ignoring weak state paths will not introduce large errors in the computation. For the 2state example in Section: Results, where t = 0.9, d = 1, and σ^{2} = 0.1, ϵ is at most $\frac{1}{3}$ for a block length n ≤ 10 if we assume τ > 0.25 and α = 1. If τ is much larger and consequently ${c}^{\frac{2n}{N}}$ is much smaller, we can roughly say that n can be as large as 1 + log_{1+}_{ rc }(1 + ϵ) in a symmetric Gaussian HMM.
Remark 2 We often encounter situations where P(q_{ i }= s_{ x }O^{ i }^{1}) ≫ P(q_{ i }≠ s_{ x }O^{ i }^{1}). Even though it is not exploited in the lemma (α being greater than or equal to 1), as a consequence of this, the observation sequence can be compressed into larger blocks keeping ϵ small in practice.
From the above lemma and the remarks we see that the longer blocks created by an algorithm should concentrate around the means of the Gaussian distributions. While a brute force algorithm looks at local information, a kdtree like algorithm alternately looks at both dimensions and utilizes global information like the density of data points (around the means data concentration should be high) to create better quality blocks. We use a modified kdtree based algorithm to find such blocks and discuss the details below.
kdtree Based Sequence Compression
 1.
Let O' = ϕ be the starting list, δ = 1.25 (picked empirically), level L = 1, and dimension d = 1.
 2.
If $\underset{{o}_{i}\in O}{max}\left({o}_{i}\right)\underset{{o}_{i}\in O}{max}\left({o}_{i}\right)<\frac{w}{{\delta}^{L}}$ or O = 1, create a node storing the first and second moment of the observations in O, append it to O', and then go to the end step. Otherwise, go to the next step.
 3.
If d = 1, find o _{ m }, the median value of the observations in O. Partition O into maximal set of consecutive observations O ^{1}, ..., O ^{ i }, ..., O ^{ p }such that ${\forall}_{o\in {O}^{i}}o\le {o}_{m}$ or ${\forall}_{o\in {O}^{t}}o\ge {o}_{m}$. For each O ^{ i }, go to step 2 with new input set O ^{ i }, level L + 1, and d = 0.
 4.
If d = 0, divide the input set O into two parts O ^{ l }= o _{1}, ..., o _{ i }and O ^{ r }= o _{i+1}, ..., o _{O}such that ${o}_{i}{o}_{i+1}\phantom{\rule{2.77695pt}{0ex}}\ge \underset{j}{max}{o}_{j}{o}_{j+1}$. Then for each set O ^{ l }and O ^{ r }, go to step 2 keeping the level value L unchanged, and setting d = 1.
 5.
End step.
In the above recursive algorithm, w states the initial width, δ controls the rate of width shrinking in successive levels of the iterations, and O' accumulates the compressed blocks of observations. The current iteration level L, the current dimension d, and the current input set O are local variables in the recursive algorithm. Notice that we start with an empty list O' and at the end of the recursive procedure O' contains an ordered list of compressed observations. To gain further compression of the sequence, we sequentially go through the blocks of O' and combine consecutive blocks if the distance between their means is less than w. We also combine three consecutive blocks if the outer blocks satisfy this condition and the inner block has only one observation. In step 3 of the above algorithm, the input set is divided into two subsets and each subset contains half of the elements from the original set. Consequently, the height of the recursion tree is at most 2logT and the running time of the above algorithm is O(T log T). This overhead is negligible compared to the time that it takes to run M iterations of MCMC sampling.
Width Parameter Selection
For increasing values of w the average block size increases exponentially in the above kdtree based compression. As a result, the compression ratio $\gamma =\frac{{T}^{\prime}}{T}$ plotted as a function of w, has a knee which can inform the choice of w. Moreover, methods originally developed to find the optimal numbers of clusters in clustering can be used to find the knee of such a curve automatically. In particular, we use the Lmethod [27] which finds the knee as the intersection of two straight lines fitted to the compression curve.
Fast Approximate Sampling
 1.Modified forward variables at positions ${t}_{*}=\sum _{i=1}^{t}\left{{o}^{\prime}}_{i}\right$ are computed using the following formula,$\begin{array}{c}P\left(q{t}_{*}=i,{O\prime}_{1},\dots ,t\theta \right)\\ =\sum _{1\le j\le N}P\left(q{t}_{*}\left{o\prime}_{t}\right\phantom{\rule{2.77695pt}{0ex}}=j,{O\prime}_{1,\dots ,t1}\theta \right){a}_{ji}\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\underset{\begin{array}{c}\mathsf{\text{constanttimecomputation}}\\ \mathsf{\text{usingprecomputedstatistics}}\end{array}}{\underset{\u23df}{{a}_{ii}^{\left{o\prime}_{t}\right1}\prod _{{o}_{k}\in {o\prime}_{t}}P\left({o}_{k}{\mu}_{i},{\sigma}_{i}\right).}}\end{array}$
 2.
After sampling the state sequence, parameters are updated ignoring nonself transitions between consecutive observations in ${o}_{t}^{\prime}$.
Clearly, each iteration of approximate sampling takes O(T' N^{2}) resulting in $\frac{T}{{T}^{\prime}}$ times speed up.
Results
We evaluate FBGsampling and approximate sampling in three different settings. First, its effectiveness is verified for a simple two state model. Then, we test on simulated ArrayCGH data which is the accepted standard for method evaluation [28]. Finally, we report findings from an analysis of Mantle Cell Lymphoma (MCL) cell lines [29], Corriel cell lines [30], GBM datasets [31], and high resolution SNP arrays [13, 32]. For biological data, if multiple chromosomes are present, we use pooling [25] across chromosomes, which does not allow transition between different chromosomes but assumes model parameters to be identical across chromosomes. Throughout this section we define σ_{ D }to be the standard deviation of all observations in the dataset. We compress the dataset with increasing values of w = 0.25σ_{ D }, 0.5σ_{ D }, 0.75σ_{ D }, .... For evaluation we consider the experiments as two class problems: aberrant clones belong to the positive class and normal clones belong to the negative class. When ground truth labels of a dataset are available we report F1measure, recall, and precision for the experiment. With tp, fp, tn, fn we denote the number of true and false positives and true and false negatives respectively. Recall is defined as $\frac{tp}{tp+fn}$, precision as $\frac{tp}{tp+fp}$, and F1measure as $\frac{2\times recall\times precision}{recall+precision}$. Experiments were run with a Python implementation on a Linux machine with 1.6 GHz Intel Core 2 Duo processor and 2 GB memory. For Expectation Maximization (EM), we use the BaumWelch algorithm from the GHMM package which is implemented in C and considerably faster than a Python implementation.
Synthetic Data
2State HMM
We define a HMM λ_{2}_{ ST }= (A, B, π) with $A=\left[\left[0.9,0.1\right],\left[0.1,0.9\right]\right],B=\left[\left(0,0.1\right),\left(1,0.1\right)\right],\pi =\left[\frac{1}{2},\frac{1}{2}\right]$. From λ_{2}_{ ST }we sample an observation sequence O = o_{1}, ..., o_{10,000}, and run MCMC for M = 100 steps with hyperparameter values ${\stackrel{\u0303}{\mu}}_{1:2}=0,1$ for the prior mean on μ, ${\stackrel{\u0303}{\sigma}}_{1:2}=0.5,0.5$ for the prior variance on μ, a_{1:2} = 4, 4 for the shape of Gamma prior on σ^{2}, b_{1:2} = 1, 1 for the rate of Gamma prior on σ^{2}, δ^{ π }= 1, 1 for the Dirichlet prior on the initial distribution π, and ${\delta}_{1:2}^{At}=1,1$ for the Dirichlet prior on row i of transition matrix A.
We show the average posterior error $\stackrel{\u0303}{P}=\frac{1}{2T}{\sum}_{t}{\sum}_{i}P\left({q}_{t}=i{\theta}^{M},O\right)P\left({q}_{t}=i{\theta}^{true},O\right)$ and total number of mismatches between the two Viterbi paths $\u1e7c$, generated by models with parameters θ^{ true }and θ^{ M }
Method  w(in σ_{ D })  $\stackrel{\u0303}{P}$  $\u1e7c$  Likelihood  Time(in sec)  Speed up 

0.25  0.001  3  5470  74  1.2  
0.50  0.001  3  5475  61  1.4  
0.75  0.002  6  5469  35  2.4  
Approx  1.00  0.004  22  5478  21  4.2 
1.25  0.012  81  5588  13  6.5  
1.50  0.054  410  6576  8  10.4  
1.75  0.219  2345  8230  4  20.1  
2.00  0.248  2857  8492  3  34.1  
FBG  ...  0.003  12  5471  87  1.0 
True  ...  ...  5470  ... 
Simulation from Genetic Template
We use 500 simulated datasets published in [28]. Each dataset has 20 chromosomes and 100 clones per chromosome for a total of 2,000 clones per dataset. A fourstate HMM predicts the aberrant regionsloss defined as state S_{1} and gain defined as state S_{3} or S_{4}. The neutral region is modeled as state S_{2}. We put an ordering constraint on the means, μ_{1} < μ_{2} < μ_{3} < μ_{4}, to prevent label switching of the states [17]. Hyperparameter choices follow [25] and are ${\stackrel{\u0303}{\mu}}_{1:4}=0.5,0,0.58,1$ for the prior mean on μ, ${\stackrel{\u0303}{\sigma}}_{1:4}=0.5,0.001,1.0,1.0$ for the prior variance on μ, a_{1:4} = 10,100, 5, 5 for the shape of gamma prior on σ^{2}, and ${b}_{1:4}={\delta}^{\pi}={\delta}_{1:4}^{{A}_{t}}=1,1,1,1$ for the rate of gamma prior on σ^{2}, the Dirichlet prior on initial distribution π, and the Dirichlet prior on row i of transition matrix A, respectively.
EM, FBGsampling, and approximate sampling results for simulated, HBL2, and Corriel dataset are shown here
Dataset  Method  w  F1measure  Recall  Precision  Time  Compression  Speedup  Likelihood 

0.50  0.97 ± 0.04  0.96 ± 0.07  0.98 ± 0.02  27  0.387  2.2  
0.75  0.97 ± 0.04  0.96 ± 0.06  0.98 ± 0.03  16  0.195  3.7  
1.00  0.97 ± 0.05  0.95 ± 0.07  0.98 ± 0.03  10  0.097  5.9  
Approx  1.25  0.96 ± 0.06  0.94 ± 0.09  0.98 ± 0.03  7  0.050  8.8  
Simulated  1.50  0.94 ± 0.09  0.92 ± 0.12  0.97 ± 0.07  5  0.031  11.3  
1.75  0.91 ± 0.15  0.89 ± 0.18  0.96 ± 0.12  5  0.023  12.2  
2.00  0.86 ± 0.19  0.85 ± 0.21  0.92 ± 0.19  5  0.018  12.2  
FBG  ...  0.97 ± 0.04  0.96 ± 0.05  0.98 ± 0.03  58  ...  1.0  
EM prior, ML  ...  0.96 ± 0.09  0.97 ± 0.04  0.96 ± 0.11  58  ...  ...  
1.0  0.85 ± 0.00  0.83 ± 0.00  0.88 ± 0.00  72  0.078  11.3  
2.0  0.87 ± 0.00  0.83 ± 0.00  0.91 ± 0.00  21  0.018  39.3  
3.0  0.89 ± 0.00  0.83 ± 0.00  0.95 ± 0.00  13  0.006  61.8  
Approx  4.0  0.84 ± 0.08  0.77 ± 0.11  0.95 ± 0.01  13  0.003  61.9  
5.0  0.71 ± 0.17  0.60 ± 0.22  0.95 ± 0.01  13  0.002  64.8  
6.0  0.79 ± 0.07  0.69 ± 0.10  0.96 ± 0.01  14  0.002  59.3  
7.0  0.76 ± 0.08  0.64 ± 0.11  0.93 ± 0.01  13  0.001  61.4  
HBL2  FBG  ...  0.82 ± 0.00  0.84 ± 0.00  0.80 ± 0.00  810  ...  1.0  
EM prior, ML  ...  0.65  0.90  0.50  810  ...  ...  15158  
EM prior, best  ...  0.85  0.84  0.86  810  ...  ...  9616  
EM prior, mean  ...  0.76 ± 0.09  0.86 ± 0.03  0.68 ± 0.12  810  ...  ...  13744  
EM unif, ML  ...  0.64  0.90  0.50  810  ...  ...  15159  
EM unif, best  ...  0.86  0.84  0.88  810  ...  ...  9136  
EM unif, mean  ...  0.54 ± 0.24  0.77 ± 0.21  0.46 ± 0.27  810  ...  ...  13457  
GM05296  Approx  2.0  0.96 ± 0.00  1.00 ± 0.00  0.93 ± 0.01  3  0.027  13.0  
FBG  ...  0.89 ± 0.01  1.00 ± 0.00  0.81 ± 0.01  40  ...  1.0  
GM00143  Approx  2.0  0.98 ± 0.00  1.00 ± 0.00  0.96 ± 0.00  3  0.027  13.8  
FBG  ...  0.89 ± 0.24  1.00 ± 0.00  0.86 ± 0.26  40  ...  1.0 
Biological Data
Mantle Cell Lymphoma (MCL)
Corriel
Corriel cell lines were used by Snijders et al. [30] and are widely considered a gold standard in ArrayCGH data analysis. This dataset is smaller and, in fact, fairly easy compared to the MCL cell lines. For the Corriel cell line we use a 4state HMM and report the results for GM05296 and GM00143 in Table 2. Again, approximate sampling performs competitively while achieving more than a 10fold speedup. Hyperparameter choices follow [24].
GBM
SNP Array
To set hyperparameters we follow the default parameters of the HMM used in PennCNV [13]. We have observed that HMMs for large arrays are particularly sensitive to the selftransition probabilities (which is also reflected in the default parameter values of the HMM used in PennCNV). Hence, hyperparameters were set to reflect the choice of high selftransition probability for each statewe set ${\delta}_{1:3}^{{A}_{t}}={\alpha}_{i1}l,{\alpha}_{i2}l,{\alpha}_{i3}l$, the Dirichlet prior on row i of transition matrix A, where l = 5000, α_{ ii }is 0.99 for i = 2, 0.95 for i ≠ 2, ${\alpha}_{ij}=\frac{1{\alpha}_{ii}}{2}$ for i ≠ j. Other hyperparameters for the 3state HMM were set such that the expected values of prior distributions match the default values for PennCNV. In particular, they were ${\stackrel{\u0303}{\mu}}_{1:3}=10.66,0,0.54$ for the prior mean on μ, ${\stackrel{\u0303}{\sigma}}_{1:3}=0.001,0.001,0.001$ for the prior variance on μ, a_{1:3} = 12, 30, 25 for the shape of gamma prior on σ^{2}, b_{1:3} = 1, 1, 1 for the rate of gamma prior on σ^{2}, and δ^{ π }= 1, 9, 1 for the Dirichlet prior on initial distribution π, respectively.
Discussion
EM vs. MCMC
Confusion matrices showing the proportion of accurate predictions based on EM, FBGsampling, and approximate sampling methods on 500 simulated datasets
Truth  

Loss  Neutral  Gain  
Loss  0.855  0.071  0.074  
EM  Neutral  0.001  0.996  0.003 
Gain  0.190  0.087  0.723  
Loss  0.980  0.020  0.000  
FBG  Neutral  0.002  0.995  0.003 
Gain  0.002  0.020  0.973  
Loss  0.981  0.019  0.000  
Approx. (w = 1.25σ_{ D })  Neutral  0.002  0.993  0.005 
Gain  0.009  0.022  0.969 
FBG vs. Approximate Sampling
In an ideal setting, like the 2state HMM example, approximate sampling closely mimics the performance of FBG sampling up to moderate compression level. For the simulated and real dataset approximate sampling's performance is comparable to FBG's while achieving a speedup of 10 or larger. We also observe that for higher compression levels approximate sampling reports small number of aberrant clones, which results in small t p and f p values, but large f n value. As a result, we see low recall and high precision rate when the compression level is too high for a particular dataset (see the rows with w ≥ 4.0σ_{ D }for HBL2 in Table 2).
From Figures 4, 5, 6, and 7 we observe that segmentations by both sampling methods are almost identical at the recommended width w value. In case of HBL2, they differ from the ground truth in some places. They predict a few extra gain regions and outliers are generally predicted as gains. We, as well as Shah et al. [25], have noticed that the HBL2 dataset has many outliers, and the variance of emission distribution of gain state 4 converges to a high value which tries to explain the outliers. In contrast, GBM has fewer outliers (see Figure 5) and approximate sampling seems robust to those outliers. As the compression algorithm forces possible outliers to be included in a compressed block, it is robust to moderate frequencies of outliers.
Width Parameter
Outliers
Gaussian HMMs are known to be sensitive to outliers which is evident from our results of HBL2 and SNP arrays. Traditionally, outliers have been handled either by using a mixture distribution as the emission distribution or by preprocessing the data to remove possible outliers or impute more appropriate values. We have observed that a simple local median approach works very well to identify the outliers in a time series of log_{2}ratio values. Although using a mixture distribution or a distribution with fat tails, i.e. Student'st distribution, is a better choice we lose a significant computational advantage in approximate sampling. For a block of observations o' = o_{ i }, ..., o_{ k }we can compute $\prod _{j=i}^{k}P\left({o}_{j}{q}^{\prime},\theta \right)$ in constant time using precomputed values $\sum _{j=1}^{k}{o}_{j}$ and $\sum _{j=i}^{k}{o}_{j}^{2}$ if the emission distribution is Gaussian. But it is not obvious how we can accomplish this for a more complicated distribution. Another approach, which we prefer in this context, is to use a HMM state with a very wide Gaussian and low selftransition probability to model the outliers. We have observed very good performance in this way. However, as our primary focus is to compare FBGsampling with approximate sampling we choose to use a simple Gaussian model at the end.
Conclusions
Analyzing CGH data either from DNA microarrays or next generation sequencing to estimate chromosomal aberrations or investigate copy number variations (CNV), leads to the problem of segmenting sequences of observations which are essentially noisy versions of piecewiseconstant functions. For reasons of efficiency, ML or MAP point estimates of HMM parameters combined with the Viterbialgorithm to compute a most likely sequence of hidden states and thus a segmentation of the input are most popular in practice. This ignores research which clearly demonstrates that Bayesian approaches, where MCMC is used for sampling and thus integrating out model parameters, is more robust with higher recall and higher precision [24]. Additionally, our experiments show that likelihood is not informative with respect to the quality of CNV calls putting the use of ML into question even if the estimation problem could be solved.
We propose a method using approximate sampling to accelerate MCMC for HMMs on such data. Our method constitutes the first use of ideas from spatial index structures for several consecutive observations and approximate computations based on geometric arguments for HMM; the effectiveness of this approach was previously demonstrated for kmeans clustering, mixture estimation, and fast evaluation of a mixture of Gaussians.
We demonstrate that these very abundant biological CGH datasets, which measure chromosomal aberrations and copy number variations, are consistent with our assumptions of piecewise constant ground truths, and we are able to achieve speedups between 10 and 60 respectively 90, on these biological datasets while maintaining competitive prediction accuracy compared to the stateoftheart. As datasets with even higher resolution, both from higher density DNA microarrays and next generation sequencing, become available, we believe that the need for precise and efficient MCMC techniques will increase. The added precision over popular ML/MAPbased methods is of particular biological relevance as for complex neurodegenerative diseases such as Autism denovo copy number variations have recently been shown to play a role [37]; a precise and quick analysis on large collectives of patients is desirable.
Applying approximate sampling to multidimensional observationsto jointly analyze data sets for recurrent CNVs [38] instead of analyzing individuals and postprocessing resultsand considering more complicated HMM topologies and observation densities are directions for our future work.
Appendix
Lemma 1. Let ${O}^{i1}={o}_{1},\dots ,{o}_{i1},{o}^{\prime}={o}_{i},\dots ,{o}_{i+n1},{o}_{min}^{\prime}=\underset{{o}_{i}\in {o}^{\prime}}{min}{o}_{l},{o}_{max}^{\prime}=\underset{{o}_{i}\in {o}^{\prime}}{max}{o}_{l},d=\underset{j\ne k}{min}{\mu}_{j}{\mu}_{k}$ and $\frac{P\left({q}_{i}{O}^{i1}\right)}{P\left({q}_{i}={s}_{x}{O}^{i1}\right)}\le \alpha $. Assuming there exists a state s_{ x }s.t. $\tau =min\left({{o}^{\prime}}_{min}\frac{{\mu}_{{s}_{x1}}+{\mu}_{{s}_{x}}}{2},\frac{{\mu}_{{s}_{x}}+{\mu}_{{s}_{x+1}}}{2}{{o}^{\prime}}_{max}\right)\ge 0$, we can show that $\frac{\sum _{\left({q}_{i},\dots ,{q}_{i+n1}\right)\in {S}^{n}}P\left({q}_{i},\dots ,{q}_{i+n1},{o}^{\prime}{O}^{i1}\right)}{\sum _{s\in S}P\left({q}_{i}=\dots ={q}_{i+n1}=s,{o}^{\prime}{O}^{i1}\right)}\le \alpha \left({\left(1+rc\right)}^{n1}+\left(N1\right){c}^{\frac{2n}{N}}{\left(1+r\right)}^{n1}\right)$, where $r=\frac{1t}{t}$ and $c={e}^{\frac{dx}{2{\sigma}^{2}}}$.
We partition S^{ n }, the set of all possible partial state paths of length n, into N subsets ${S}^{{s}_{1}}\dots {S}^{{s}_{N}}$ such that, ${S}^{{s}_{j}}=\left\{\stackrel{\u0303}{S}\in {S}^{n}:\left({\forall}_{{s}_{l}\ne {s}_{j}}C\left(\stackrel{\u0303}{S},{s}_{j}\right)>C\left(\stackrel{\u0303}{S},{s}_{l}\right)\right)\vee \left(\left({\forall}_{{s}_{l}\ne {s}_{j}}C\left(\stackrel{\u0303}{S},{s}_{j}\right)\ge C\left(\stackrel{\u0303}{S},{s}_{l}\right)\right)\wedge {\stackrel{\u0303}{S}}_{1}={s}_{j}\right)\right\}$ for 1 ≤ j ≤ N, where $C\left(\stackrel{\u0303}{S},s\right)=\sum _{{q}_{k}\in \stackrel{\u0303}{S}}1\left({q}_{k}=s\right)$. We again partition ${S}^{{s}_{j}}={\cup}_{k=0}^{n1}{S}_{k}^{{s}_{j}}$ such that, ${S}_{k}^{{s}_{j}}=\left\{\stackrel{\u0303}{S}\in {S}^{{s}_{j}}:\left(\sum _{l=1}^{n1}1\left({\stackrel{\u0303}{S}}_{l}\ne {\stackrel{\u0303}{S}}_{l+1}\right)\right)=k\right\}$.
As the sets ${S}^{{s}_{j}}$ are equal sized partitions of S^{ n }, $\left{S}^{{s}_{j}}\right\phantom{\rule{2.77695pt}{0ex}}=\sum _{k=0}^{n1}\left(\begin{array}{c}\hfill n1\hfill \\ \hfill k\hfill \end{array}\right){\left(N1\right)}^{k}$. Also notice that, by definition, the partial state paths in S^{ n }with exactly k number of nonselftransitions are equally distributed among the subsets ${S}^{{s}_{j}}$. As a result, $\left{S}_{k}^{{s}_{j}}\right\phantom{\rule{2.77695pt}{0ex}}=\left(\begin{array}{c}\hfill n1\hfill \\ \hfill k\hfill \end{array}\right){\left(N1\right)}^{k}$.
Note: For simplicity of the notation, we follow the convention that ${\mu}_{{x}_{0}}=\infty $ and ${\mu}_{{x}_{N+1}}=\infty $ so that the proof holds for x = 1 or x = N.
Notes
Supplementary material
References
 1.Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: SpringerVerlag New York, Inc; 2006.Google Scholar
 2.Durbin R, Eddy SR, Krogh A, Mitchison GJ: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; 1998.CrossRefGoogle Scholar
 3.Krogh A: Hidden Markov models for labeled sequences. Pattern Recognition, 1994. Vol. 2  Conference B: Computer Vision & Image Processing., Proceedings of the 12th IAPR International. Conference on 1994, 2: 140–144. vol 2 vol 2Google Scholar
 4.Eddy SR: Multiple Alignment Using Hidden Markov Models. ISMB 1995, 114–120.Google Scholar
 5.Schliep A, Costa IG, Steinhoff C, Schönhuth A: Analyzing gene expression timecourses. IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM 2005, 2(3):179–93. 10.1109/TCBB.2005.31CrossRefPubMedGoogle Scholar
 6.Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics (Oxford, England) 2004, 5(4):557–72.CrossRefGoogle Scholar
 7.Picard F, Robin S, Lebarbier E, Daudin JJ: A segmentation/clustering model for the analysis of array CGH data. Biometrics 2007, 63(3):758–66. 10.1111/j.15410420.2006.00729.xCrossRefPubMedGoogle Scholar
 8.Eilers PHC, de Menezes RX: Quantile smoothing of array CGH data. Bioinformatics 2005, 21(7):1146–53. 10.1093/bioinformatics/bti148CrossRefPubMedGoogle Scholar
 9.Tibshirani R, Wang P: Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 2008, 9: 18–29.CrossRefPubMedGoogle Scholar
 10.Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics (Oxford, England) 2005, 6: 45–58.CrossRefGoogle Scholar
 11.Andersson R, Bruder CEG, Piotrowski A, Menzel U, Nord H, Sandgren J, Hvidsten TR, Diaz de Ståhl T, Dumanski JP, Komorowski J: A segmental maximum a posteriori approach to genomewide copy number profiling. Bioinformatics 2008, 24(6):751–758. 10.1093/bioinformatics/btn003CrossRefPubMedGoogle Scholar
 12.Fridlyand J, Snijders A, Pinkel D, Albertson D, Jain A: Hidden Markov models approach to the analysis of array CGH data. J Multivariate Anal 2004, 90: 132–153. 10.1016/j.jmva.2004.02.008CrossRefGoogle Scholar
 13.Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SFA, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome SNP genotyping data. Genome Research 2007, 17(11):1665–74. 10.1101/gr.6861907PubMedCentralCrossRefPubMedGoogle Scholar
 14.Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on 1967, 13(2):260–269.CrossRefGoogle Scholar
 15.Gilks W, Gilks W, Richardson S, Spiegelhalter D: Markov chain Monte Carlo in practice. Interdisciplinary statistics, Chapman & Hall; 1996.Google Scholar
 16.Geman S, Geman D: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 1984, PAMI6(6):721–741.CrossRefGoogle Scholar
 17.Scott S: Bayesian Methods for Hidden Markov Models: Recursive Computing in the 21st Century. Journal of the American Statistical Association 2002, 337–351.Google Scholar
 18.Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77(2):257–286. 10.1109/5.18626CrossRefGoogle Scholar
 19.Mozes S, Weimann O, ZivUkelson M: Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions. Lecture Notes in Computer Science 2007.Google Scholar
 20.Pelleg D, Moore A: Accelerating exact kmeans algorithms with geometric reasoning. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM; 1999:277–281.CrossRefGoogle Scholar
 21.Fritsch J, Rogina I: The Bucket Box Intersection (BBI) Algorithm For Fast Approximative Evaluation Of Diagonal Mixture Gaussians. In Proc ICASSP 1996, 837–840.Google Scholar
 22.Srivastava S: Fast gaussian evaluations in large vocabulary continuous speech recognition. M.S. Thesis, Department of Electrical and Computer Engineering, Mississippi State University 2002.Google Scholar
 23.Baum LE, Petrie T, Soules G, Weiss N: A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 1970, 41: 164–171. 10.1214/aoms/1177697196CrossRefGoogle Scholar
 24.Guha S, Li Y, Neuberg D: Bayesian Hidden Markov Modeling of Array CGH Data. Journal of the American Statistical Association 2008, 103: 485–497. 10.1198/016214507000000923PubMedCentralCrossRefPubMedGoogle Scholar
 25.Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP: Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics 2006, 22(14):e431e439. 10.1093/bioinformatics/btl238CrossRefPubMedGoogle Scholar
 26.Chib S: Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 1996, 75: 79–97. 10.1016/03044076(95)017704CrossRefGoogle Scholar
 27.Salvador S, Chan P: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI '04. Washington, DC, USA: IEEE Computer Society; 2004:576–584.CrossRefGoogle Scholar
 28.Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 2005, 21(22):4084–4091. 10.1093/bioinformatics/bti677CrossRefPubMedGoogle Scholar
 29.Leeuw RJD, Davies JJ, Rosenwald A, Bebb G, Gascoyne YD, Dyer MJS, Staudt LM, Martinezcliment JA, Lam WL: Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. Hum Mol Genet 2004, 13: 1827–1837. 10.1093/hmg/ddh195CrossRefPubMedGoogle Scholar
 30.Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG: Assembly of microarrays for genomewide measurement of DNA copy number. Nat Genet 2001, 29(3):263–264. 10.1038/ng754CrossRefPubMedGoogle Scholar
 31.Bredel M, Bredel C, Juric D, Harsh GR, Vogel H, Recht LD, Sikic BI: HighResolution GenomeWide Mapping of Genetic Alterations in Human Glial Brain Tumors. Cancer Research 2005, 65(10):4088–4096. 10.1158/00085472.CAN044229CrossRefPubMedGoogle Scholar
 32.Harada T, Chelala C, Bhakta V, Chaplin T, Caulee K, Baril P, Young BD, Lemoine NR: Genomewide DNA copy number analysis in pancreatic cancer using highdensity single nucleotide polymorphism arrays. Oncogene 2007, 27(13):1951–1960.PubMedCentralCrossRefPubMedGoogle Scholar
 33.Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21(19):3763–70. 10.1093/bioinformatics/bti611PubMedCentralCrossRefPubMedGoogle Scholar
 34.Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65(14):6071–6079. 10.1158/00085472.CAN050465CrossRefPubMedGoogle Scholar
 35.McNemar Q: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947, 12: 153–157. 10.1007/BF02295996CrossRefPubMedGoogle Scholar
 36.Morganella S, Cerulo L, Viglietto G, Ceccarelli M: VEGA: variational segmentation for copy number detection. Bioinformatics 2010, 26(24):3020–3027. 10.1093/bioinformatics/btq586CrossRefPubMedGoogle Scholar
 37.Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, Conroy J, Magalhaes TR, Correia C, Abrahams BS, Almeida J, Bacchelli E, Bader GD, Bailey AJ, Baird G, Battaglia A, Berney T, Bolshakova N, Bölte S, Bolton PF, Bourgeron T, Brennan S, Brian J, Bryson SE, Carson AR, Casallo G, Casey J, Chung BHY, Cochrane L, Corsello C: Functional impact of global rare copy number variation in autism spectrum disorders. Nature 2010.Google Scholar
 38.Shah SP, Lam WL, Ng RT, Murphy KP: Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics 2007, 23(13):i450i458. 10.1093/bioinformatics/btm221CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.