Skip to main content

A Stochastic Segmentation Model for the Indentification of Histone Modification and DNase I Hypersensitive Sites in Chromatin

  • Conference paper
  • First Online:
Book cover Applied Statistics in Biomedicine and Clinical Trials Design

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

  • 1943 Accesses

Abstract

Focal alterations in chromatin structure are essential for the proper functioning of various classes of transcriptional regulatory elements in the human genome. These changes can be detected through an increased sensitivity to DNase I and other nucleases due to an open and accessible chromatin conformation. Currently, quantitative analysis approaches use heuristic procedures to identify regions enriched for histone modifications and DNase I hypersensitivity. We here develop a stochastic segmentation model and associate inference framework to characterize the categorical and continuous features of hierarchical structures hidden in sequences. The proposed model has attractive statistical and computational properties and yields explicit formulas for posterior distribution of hidden states with a hierarchical structure. We propose an approximation method whose computational complexity is only linear in sequence length. We demonstrate the performance of the model via extensive simulations. We further use our model to identify DNase I sensitivity and DNase I hypersentitive sites over the Encyclopedia of DNA Elements (ENCODE) regions in human lymphoblastoid cells.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Barski A, Cuddapah S, Cui K, Roh Tae-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao, K (2007) Highresolution profiling of histone methylations in the human genome. Cell 129:823–837

    Article  Google Scholar 

  • Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet, AL, Ecker JR et al (2010) The NIH roadmap epigenomics mapping consortium. Nat Biotechnol 28:1045–1048

    Article  Google Scholar 

  • Consortium TEP, data analysis coordination OC, data production DPL, data analysis LA, group W, scientific management NPM, steering committee PI, Boise State University and University of North Carolina at Chapel Hill Proteomics groups (data production and analysis), Broad Institute Group (data production and analysis), Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, Sanger Institute, University of Lausanne, Genome Institute of Singapore group (data production and analysis), Data coordination center at UC Santa Cruz (production data coordination), Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill group (data production and analysis), Genome Institute of Singapore group (data production and analysis), HudsonAlpha Institute, Caltech, UC Irvine, Stanford group (data production and analysis), targeted experimental validation LBNLG, data production and analysis NG, Sanger Institute, Washington University, Yale University, Center for Genomic Regulation, Barcelona, UCSC, MIT, University of Lausanne, CNIO group (data production and analysis), Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UC Davis group (data production and analysis), University of Albany SUNY group (data production and analysis) (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 488:57–74

    Article  Google Scholar 

  • Dorschner MO, Hawrylycz M, Humbert R, Wallace JC, Shafer A, Kawamoto, J, Mack J, Hall R, Goldy J, Sabo PJ et al (2004) High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1:219–225

    Google Scholar 

  • Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, Lawrence CE (2008) Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics 24:1911–1916

    Article  Google Scholar 

  • Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A et al (2006) Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods 3:511–518

    Article  Google Scholar 

  • Xing H, MoY, Liao W, Zhang MQ (2006) Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data. PLoS Comput Biol 8:e100261–3

    Google Scholar 

  • Xing H, Ying C (2014) A stochastic segmentation model for recurrent copy number alteration analysis. Technical Report, Department of Applied Mathematics and Statistics, State University of New York at Stony Brook

    Google Scholar 

  • Mikkelsen TS, and Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP et al (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448:553–560

    Article  Google Scholar 

  • Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM (2010) HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC bioinformatics 11:36–9

    Article  Google Scholar 

  • Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27:66–75

    Article  Google Scholar 

  • Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9:R1375

    Article  Google Scholar 

Download references

Acknowledgement

The first two authors contributed equally to the chapter. Haipeng Xing’s research was supported by the National Science Foundation grant DMS-1206321. Michael Q. Zhang’s research was supported by the National Institutes of Health grant HG001696 and ES-17166, the National Basic Research Program of China grant 2012CB316503, and the National Natural Science Foundation of China grant 91019016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haipeng Xing .

Editor information

Editors and Affiliations

Appendices

Appendix A. Proof of 27.5 and 27.12

Proof of 27.5: To derive the mixture weight \(\xi_{i,t}^{(k)}\), we first note that

$$f(\theta_t, y_t, s_{t-1}=k | {\cal F}_{t-1}) = \sum_{l=1}^K f(\theta_t, y_t, s_{t-1}=k, s_t=l | {\cal F}_{t-1}).\\ $$

When \(l\neq k\),

$$\begin{aligned} &f(\theta_t, y_t, s_{t-1}=k, s_t=l | {\cal F}_{t-1})\\ =& f(\theta_t, y_t | {\cal F}_{t-1}, s_{t-1}=k, s_t =l)P(s_{t-1}=k, s_t =l | {\cal F}_{t-1}) \\ =& f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t)P(s_t =l| s_{t-1}=k)P(s_{t-1}=k| {\cal F}_{t-1}) \\ =& f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t)p_{k,l} \xi_{t-1}^{(k)}.\\ \end{aligned}$$

When \(l= k\),

$$\begin{aligned} &f(\theta_t, y_t, s_{t-1}=k, s_t=k | {\cal F}_{t-1})= \sum_{i=1}^{t-1} f(J^{(k)}_{t}=i, \theta_t, y_t|{\cal F}_{t-1})\\ =& \sum_{i=1}^{t-1} f(\theta_t, y_t|{\cal F}_{t-1},J^{(k)}_{t}=i)P(s_{t-1}=k, s_t =k | {\cal F}_{t-1}) \\ =& \sum_{i=1}^{t-1} f(y_t | {\cal F}_{t-1},J^{(k)}_{t}=i)f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i)P(s_t =k| s_{t-1}=k)P(s_{t-1}=k| {\cal F}_{t-1}) \\ =& \sum_{i=1}^{t-1} f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i)p_{k,k} \xi_{i, t-1}^{(k)}.\\ \end{aligned}$$

Let

$$\xi_{i,t}^{(k)*} = \left\{\begin{array}{ll} \big( \sum_{l \neq k} \xi_{t-1}^{(l)} p_{lk} \big) f(y_t | J_t^{(k)}=t) & i=t, \\ p_{kk} \xi_{i, t-1}^{(k)} f(y_t | {\cal F}_{t-1}, J_t^{(k)}=i) & i < t.\end{array} \right.$$

We then have

$$f({\boldmath \beta}_t | {\cal F}_t) \propto \sum_{k=1}^K \xi_{t,t}^{(k)*} f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t) + \sum_{k=1}^K \sum_{i=1}^{t-1} \xi_{i,t}^{(k)*}f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i).$$

Hence, the mixture weight \(\xi_{i,t}^{(k)}\) is the conditional probability which can be determined via normalization of \(\xi_{i,t}^{(k*)}\). Furthermore, simple algebra shows that

$$f(y_t | J_t^{(k)}=t)=\psi_{0,0}^{(k)} \big/ \psi_{t,t}^{(k)}, \qquad f(y_t | {\cal F}_{t-1}, J_t^{(k)}=i)=\psi_{i,t-1}^{(k)} \big/ \psi_{i,t}^{(k)},$$

where

$$\psi_{0,0}^{(k)}=(\kappa^{(k)})^{-\frac{1}{2}} \frac{(\lambda^{(k)})^{-g^{(k)}}}{\Gamma(g^{(k)})}, \qquad \psi_{ij}^{(k)}=(\kappa_{ij}^{(k)})^{-\frac{1}{2}} \frac{(\lambda_{ij}^{k})^{-g_{ij}^{(k)}}}{\Gamma(g_{ij}^{(k)})},$$

for \(i\le j\). This proves (27.5).

Proof of (27.12): We use Bayes’ theorem to combine the forward filter (27.4) with its backward variant (27.10) to derive the posterior distribution of θ t given \({\cal F}_T\) \((1\le t < T)\)

$$\begin{aligned}f(\theta_t|{\cal F}_T) = \sum_{k=1}^K f(\theta_t, s_t=k|{\cal F}_T) \propto \sum_{k=1}^K f(\theta_t, s_t=k|{\cal F}_t)\frac{ f(\theta_t, s_t=k| {\cal F}_{t+1,T})}{f(\theta, s_t=k)}.\end{aligned}$$
(27.22)

We first consider the following:

$$\begin{aligned}&f(\theta_t, s_t=k|{\cal F}_t) f(\theta_t, s_t=k|{\cal F}_{t+1,T}) \big/ f(\theta, s_t=k) \notag\\ =&\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot\{\widetilde{q}_{kk}\sum_{j=t+1}^T\eta_{t+1,j}^{(k)}f(\theta_t |{\cal F}_{t+1,j})+\sum_{l\neq k}\widetilde{q}_{lk}\eta_{t+1}^{(l)}f(\theta_t|s_t=k)\}}{P(s_t=k)f(\theta_t|s_t=k)} \notag\\ =&\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot \widetilde{q}_{kk}\sum_{j=t+1}^T\eta_{t+1,j}^{(k)}f(\theta_t |{\cal F}_{t+1,j})}{\pi_k f(\theta_t|s_t=k)}\notag\\ & \hspace{20pt}+\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot \sum_{l\neq k}\widetilde{q}_{lk}\eta_{t+1}^{(l)}f(\theta_t|s_t=k)}{\pi_k f(\theta_t|s_t=k)}\notag\\ =&\sum_i^t\xi_{i,t}^{(k)}\sum_{l\neq k}\frac{\widetilde{q}_{lk}}{\pi_k}\eta_{t+1}^{(l)}f(\theta_t | {\cal F}_{i,t})+\frac{\widetilde{q}_{kk}}{\pi_k}\sum_{1 \le i \le t \le j \le T}\xi_{i,t}^{(k)}\eta_{t+1,j}^{(k)}\frac{f(\theta_t | {\cal F}_{i,t})f(\theta_t | {\cal F}_{t+1,j})}{f(\theta_t|s_t=k)}.\end{aligned}$$

Note that

$$\frac{f(\theta_t | {\cal F}_{i,t})f(\theta_t | {\cal F}_{t+1,j})}{f(\theta_t|s_t=k)}=\frac{\psi_{i,t}^{(k)}\psi_{t+1,j}^{(k)}}{\psi_{i,j}^{(k)} \psi_{0,0}^{(k)}}f(\theta_t | {\cal F}_{i,j}),$$

Hence, (27.12) is proved.

Appendix B. EM Algorithm for Hyperparameter Estimation

The inference procedure in the above sections involve the hyperparameters \(\Phi=\) \(\{ Q, z^{(k)}\), \(\kappa^{(k)}\), \(\lambda^{(k)}\), \(g^{(k)}\); \(k=1, \dots, \}\), is a \([4K+K(K-1)]\)-dimensional vector. We can use the EM algorithm to exploit the much simpler structure of the log likelihood \(l_c(\Phi)\) of the complete data \(\{(y_t, s_t, \theta_t), 1\le t \le T\}\), which is expressed as

$$\begin{aligned}& l_c(\Phi) = \sum_{t=1}^T \log f(\{y_t, s_t, \theta_t \}) \nonumber\\ = &\sum_{t=1}^T \Big\{ \log f(y_t|\theta_t) + \sum_{k=1}^K f(\theta_t | s_t=k) {\bf 1}_{ \{ s_t = k \}} + \sum_{k,l=1}^K \log (p_{kl}) {\bf 1}_{ \{ s_{t-1} = k, s_t = l \} } \Big\}\nonumber\\ = &-\sum_{t=1}^T \Big\{ \frac{(y_t-\mu_t)^2}{2\sigma_t^2}+\frac{1}{2}\log(2\sigma_t^2)\Big\} -\sum_{t=1}^T\sum_{k=1}^K\Big\{\frac{(\mu_t-z^{(k)})^2}{ 2\sigma_t^2 \kappa^{(k)}}+ \frac{1}{2}\log( 2\sigma_t^2 \kappa^{(k)})\Big\}\nonumber\\ &-\sum_{t=1}^T\sum_{k=1}^K\Big\{g^{(k)}\log(\lambda^{(k)})- \log(\Gamma(g^{(k)}))-(g^{(k)}-1)\log (2\sigma_t^2)+ \frac{1}{2\sigma_t^2\lambda^{(k)}}\Big\} {\bf 1}_{\{s_t=k\}}\nonumber\\ &+ \sum_{t=1}^T \sum_{k,l=1}^K \log (p_{kl}) {\bf 1}_{ \{ s_{t-1} = k, s_t = l \} }.\end{aligned}$$
(27.23)

The E-step of the EM algorithm calculates \(E[l_c(\Phi)|{\cal F}_t]\), which involves the computation of the conditional expectations:

$$E\Big[\frac{ (y_t-\mu_t)^2}{2\sigma_2^2} |{\cal F}_T\Big],\qquad E[\log(2\sigma_t^2)|{\cal F}_T],\qquad E\Big(\frac{(\mu_t-z^{(k)})^2}{2\sigma_t^2\kappa^{(k)}} {\bf 1}_{\{s_t=k\}}|{\cal F}_T\Big),$$
$$E[\log( 2\sigma_t^2\kappa^{(k)}) {\bf 1}_{\{s_t=k\}}|{\cal F}_T], E[\log (2\sigma_t^2) {\bf 1}_{\{s_t=k\}}|{\cal F}_T], E\Big(\frac{1}{2 \sigma_t^2 \lambda^{(k)}} {\bf 1}_{\{s_t=k\}}|{\cal F}_T\Big),$$

and the conditional probability:

$$P(s_t = k|{\cal F}_T), \qquad P(s_{t-1} = k, s_t = l|{\cal F}_T).$$

The M-step of the EM algorithm involves calculating the partial derivatives of \(E[l_c(\Phi)|{\cal F}_t]\) with respect to Φ. Simple algebra yields the following updating formulas for Φ:

$$\begin{aligned}\widehat{q}_{kl, \rm new} =& \frac{\sum_{t=2}^T P(s_{t-1}=k, s_t=l|{\cal F}_T, \widehat{\Phi}_{\rm old})}{\sum_{t=2}^T P(s_{t-1}=k|{\cal F}_T, \widehat{\Phi}_{\rm old})},\notag\end{aligned}$$
$$\begin{aligned} \widehat{z}^{(k)}_{\rm new} =& \frac{\sum_{t=1}^T E[\mu_t/(2\sigma_t^2) {\bf 1}_{ \{ s_t = k \}} |{\cal F}_T, \widehat{\Phi}_{\rm old}]}{\sum_{t=1}^T E[P_t {\bf 1}_{\{s_{t}=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]},\notag \end{aligned}$$
$$\begin{aligned} \widehat{\kappa}^{(k)}_{\rm new} =& \frac{2\sum_{t=1}^T E[(\mu_t - \widehat{z}^{(k)}_{\rm old})^2/ (2\sigma_t^2) {\bf 1}_{ \{ s_t = k \}} |{\cal F}_T, \widehat{\Phi}_{\rm old}]}{\sum_{t=1}^T E[{\bf 1}_{\{s_{t}=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]}, \notag\end{aligned}$$
$$\begin{aligned} \widehat{\lambda}^{(k)}_{\rm new} =& \frac{\sum_{t=1}^T E[(2\sigma_t)^{-1} {\bf 1}_{\{s_t=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]}{ \sum_{t=1}^T g^{(k)}_{\rm old}E[ {\bf 1}_{\{s_t=k\}} |{\cal F}_T, \widehat{\Phi}_{\rm old}].}\end{aligned}$$
(27.24)

I-terms in (27.24) can be obtained as follows:

$$E\Big( \frac{\mu_t}{2\sigma_t^2} {\bf 1}_{ \{ s_t = k \}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old}\Big) = \sum_{t=1}^T \alpha^{(k)}_{ijt} E \Big( \frac{\mu_t}{2\sigma_t^2} |C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old}\Big),$$
(27.25)
$$E\Big( \frac{\mu_t}{2\sigma_t^2} \Big|C_{ij}^{(k)}{\cal F}_T, \widehat{\Phi}_{\rm old} \Big) = \lambda_{ij}^{(k)} g_{ij}^{(k)} z_{ij}^{(k)},$$
(27.26)
$$E\Big( \frac{1}{2\sigma_t^2} {\bf 1}_{\{s_{t}=k\}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old} \Big) = \sum_{1 \le i \le t \le j \le T} \alpha_{ijt}^{(k)} g_{ijt}^{(k)} \lambda_{ijt}^{(k)}.$$
(27.27)
$$\begin{aligned} E\Bigg( &\frac{\mu_t-z_{\rm old}^{(k)})^2}{2\sigma_t^2}{\bf 1}_{\{s_t=k\}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old}\Bigg) = \sum\limits_{1 \leq i \leq t \leq j \leq T}\alpha_{ijt}^{(k)} \Bigg\{E \left(\frac{\mu_t^2}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T,\widehat{\Phi}_{\rm old} \right) \nonumber\\&- 2 z_{\rm old}^{(k)}E \left( \frac{\mu_t}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old} \right)+(z_{\rm old}^{(k)})^2) E\left( \frac{1}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old}\right)\Bigg\}, \end{aligned}$$
(27.28)

in which

$$\begin{aligned} E\left( \frac{\mu_t^2}{2\sigma_t^2} \Big|C_{ij}^{(k)}, {\cal F}_T,\widehat{\Phi}_{\rm old}\right)=\frac{\kappa_{ij}^{(k)}}{2}+\lambda_{ij}^{(k)}g_{ij}^{(k)}\left(z_{ij}^{(k)}\right)^2.\end{aligned}$$
(27.29)

We can use the BCMIX approximations instead of the full recursions to determine the items (27.25)–(27.29) in order to speed up computation. The iteration scheme (27.24) is carried out until convergence to estimate hyperparameters.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Xing, H., Mo, Y., Liao, W., cai, Y., Zhang, M. (2015). A Stochastic Segmentation Model for the Indentification of Histone Modification and DNase I Hypersensitive Sites in Chromatin. In: Chen, Z., Liu, A., Qu, Y., Tang, L., Ting, N., Tsong, Y. (eds) Applied Statistics in Biomedicine and Clinical Trials Design. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-12694-4_27

Download citation

Publish with us

Policies and ethics