A Stochastic Segmentation Model for the Indentification of Histone Modification and DNase I Hypersensitive Sites in Chromatin

Xing, Haipeng; Mo, Yifan; Liao, Will; cai, Ying; Zhang, Michael

doi:10.1007/978-3-319-12694-4_27

Haipeng Xing⁹,
Yifan Mo¹⁰,
Will Liao¹¹,
Ying cai⁹ &
…
Michael Zhang^12,13

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

1943 Accesses

Abstract

Focal alterations in chromatin structure are essential for the proper functioning of various classes of transcriptional regulatory elements in the human genome. These changes can be detected through an increased sensitivity to DNase I and other nucleases due to an open and accessible chromatin conformation. Currently, quantitative analysis approaches use heuristic procedures to identify regions enriched for histone modifications and DNase I hypersensitivity. We here develop a stochastic segmentation model and associate inference framework to characterize the categorical and continuous features of hierarchical structures hidden in sequences. The proposed model has attractive statistical and computational properties and yields explicit formulas for posterior distribution of hidden states with a hierarchical structure. We propose an approximation method whose computational complexity is only linear in sequence length. We demonstrate the performance of the model via extensive simulations. We further use our model to identify DNase I sensitivity and DNase I hypersentitive sites over the Encyclopedia of DNA Elements (ENCODE) regions in human lymphoblastoid cells.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barski A, Cuddapah S, Cui K, Roh Tae-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao, K (2007) Highresolution profiling of histone methylations in the human genome. Cell 129:823–837
Article Google Scholar
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet, AL, Ecker JR et al (2010) The NIH roadmap epigenomics mapping consortium. Nat Biotechnol 28:1045–1048
Article Google Scholar
Consortium TEP, data analysis coordination OC, data production DPL, data analysis LA, group W, scientific management NPM, steering committee PI, Boise State University and University of North Carolina at Chapel Hill Proteomics groups (data production and analysis), Broad Institute Group (data production and analysis), Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, Sanger Institute, University of Lausanne, Genome Institute of Singapore group (data production and analysis), Data coordination center at UC Santa Cruz (production data coordination), Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill group (data production and analysis), Genome Institute of Singapore group (data production and analysis), HudsonAlpha Institute, Caltech, UC Irvine, Stanford group (data production and analysis), targeted experimental validation LBNLG, data production and analysis NG, Sanger Institute, Washington University, Yale University, Center for Genomic Regulation, Barcelona, UCSC, MIT, University of Lausanne, CNIO group (data production and analysis), Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UC Davis group (data production and analysis), University of Albany SUNY group (data production and analysis) (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 488:57–74
Article Google Scholar
Dorschner MO, Hawrylycz M, Humbert R, Wallace JC, Shafer A, Kawamoto, J, Mack J, Hall R, Goldy J, Sabo PJ et al (2004) High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1:219–225
Google Scholar
Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, Lawrence CE (2008) Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics 24:1911–1916
Article Google Scholar
Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A et al (2006) Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods 3:511–518
Article Google Scholar
Xing H, MoY, Liao W, Zhang MQ (2006) Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data. PLoS Comput Biol 8:e100261–3
Google Scholar
Xing H, Ying C (2014) A stochastic segmentation model for recurrent copy number alteration analysis. Technical Report, Department of Applied Mathematics and Statistics, State University of New York at Stony Brook
Google Scholar
Mikkelsen TS, and Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP et al (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448:553–560
Article Google Scholar
Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM (2010) HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC bioinformatics 11:36–9
Article Google Scholar
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27:66–75
Article Google Scholar
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9:R1375
Article Google Scholar

Download references

Acknowledgement

The first two authors contributed equally to the chapter. Haipeng Xing’s research was supported by the National Science Foundation grant DMS-1206321. Michael Q. Zhang’s research was supported by the National Institutes of Health grant HG001696 and ES-17166, the National Basic Research Program of China grant 2012CB316503, and the National Natural Science Foundation of China grant 91019016.

Author information

Authors and Affiliations

Department of Applied Mathematics and Statistics, State University of New York, 11794, Stony Brook, NY, USA
Haipeng Xing & Ying cai
Mount Sinai Hospital, New York, NY, 10029, USA
Yifan Mo
New York Genome Center, 10013, New York, NY, USA
Will Liao
Department of Molecular & Cell Biology, Center for Systems Biology, The University of Texas at Dallas, 75080, Richardson, TX, USA
Michael Zhang
MOE Key Laboratory of Bioinformatics and Bioinformatics Division, Center for Synthetic and System Biology, TNLIST, Department of Automation, Tsinghua University, 100084, Beijing, P. R. China
Michael Zhang

Authors

Haipeng Xing
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Mo
View author publications
You can also search for this author in PubMed Google Scholar
Will Liao
View author publications
You can also search for this author in PubMed Google Scholar
Ying cai
View author publications
You can also search for this author in PubMed Google Scholar
Michael Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haipeng Xing .

Editor information

Editors and Affiliations

National Institutes of Health, Rockville, Maryland, USA
Zhen Chen
National Institutes of Health, Rockville, Maryland, USA
Aiyi Liu
Lilly Corporation Center, Indianapolis, Indiana, USA
Yongming Qu
George Mason University, Fairfax, Virginia, USA
Larry Tang
Boehringer-Ingelheim, Ridgefield, Connecticut, USA
Naitee Ting
Food and Drug Administration, Silver Spring, Maryland, USA
Yi Tsong

Appendices

Appendix A. Proof of 27.5 and 27.12

Proof of 27.5: To derive the mixture weight $\xi_{i,t}^{(k)}$, we first note that

$$f(\theta_t, y_t, s_{t-1}=k | {\cal F}_{t-1}) = \sum_{l=1}^K f(\theta_t, y_t, s_{t-1}=k, s_t=l | {\cal F}_{t-1}).\\ $$

When $l\neq k$,

$$\begin{aligned} &f(\theta_t, y_t, s_{t-1}=k, s_t=l | {\cal F}_{t-1})\\ =& f(\theta_t, y_t | {\cal F}_{t-1}, s_{t-1}=k, s_t =l)P(s_{t-1}=k, s_t =l | {\cal F}_{t-1}) \\ =& f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t)P(s_t =l| s_{t-1}=k)P(s_{t-1}=k| {\cal F}_{t-1}) \\ =& f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t)p_{k,l} \xi_{t-1}^{(k)}.\\ \end{aligned}$$

When $l= k$,

$$\begin{aligned} &f(\theta_t, y_t, s_{t-1}=k, s_t=k | {\cal F}_{t-1})= \sum_{i=1}^{t-1} f(J^{(k)}_{t}=i, \theta_t, y_t|{\cal F}_{t-1})\\ =& \sum_{i=1}^{t-1} f(\theta_t, y_t|{\cal F}_{t-1},J^{(k)}_{t}=i)P(s_{t-1}=k, s_t =k | {\cal F}_{t-1}) \\ =& \sum_{i=1}^{t-1} f(y_t | {\cal F}_{t-1},J^{(k)}_{t}=i)f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i)P(s_t =k| s_{t-1}=k)P(s_{t-1}=k| {\cal F}_{t-1}) \\ =& \sum_{i=1}^{t-1} f(y_t | {\cal F}_{t-1}, J^{(l)}_{t}=t)f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i)p_{k,k} \xi_{i, t-1}^{(k)}.\\ \end{aligned}$$

Let

$$\xi_{i,t}^{(k)*} = \left\{\begin{array}{ll} \big( \sum_{l \neq k} \xi_{t-1}^{(l)} p_{lk} \big) f(y_t | J_t^{(k)}=t) & i=t, \\ p_{kk} \xi_{i, t-1}^{(k)} f(y_t | {\cal F}_{t-1}, J_t^{(k)}=i) & i < t.\end{array} \right.$$

We then have

$$f({\boldmath \beta}_t | {\cal F}_t) \propto \sum_{k=1}^K \xi_{t,t}^{(k)*} f(\theta_t| {\cal F}_{t}, J^{(l)}_{t}=t) + \sum_{k=1}^K \sum_{i=1}^{t-1} \xi_{i,t}^{(k)*}f(\theta_t| {\cal F}_{t}, J^{(k)}_{t}=i).$$

Hence, the mixture weight $\xi_{i,t}^{(k)}$ is the conditional probability which can be determined via normalization of $\xi_{i,t}^{(k*)}$. Furthermore, simple algebra shows that

$$f(y_t | J_t^{(k)}=t)=\psi_{0,0}^{(k)} \big/ \psi_{t,t}^{(k)}, \qquad f(y_t | {\cal F}_{t-1}, J_t^{(k)}=i)=\psi_{i,t-1}^{(k)} \big/ \psi_{i,t}^{(k)},$$

where

$$\psi_{0,0}^{(k)}=(\kappa^{(k)})^{-\frac{1}{2}} \frac{(\lambda^{(k)})^{-g^{(k)}}}{\Gamma(g^{(k)})}, \qquad \psi_{ij}^{(k)}=(\kappa_{ij}^{(k)})^{-\frac{1}{2}} \frac{(\lambda_{ij}^{k})^{-g_{ij}^{(k)}}}{\Gamma(g_{ij}^{(k)})},$$

for $i\le j$. This proves (27.5).

Proof of (27.12): We use Bayes’ theorem to combine the forward filter (27.4) with its backward variant (27.10) to derive the posterior distribution of θ_t given ${\cal F}_T$ $(1\le t < T)$

$$\begin{aligned}f(\theta_t|{\cal F}_T) = \sum_{k=1}^K f(\theta_t, s_t=k|{\cal F}_T) \propto \sum_{k=1}^K f(\theta_t, s_t=k|{\cal F}_t)\frac{ f(\theta_t, s_t=k| {\cal F}_{t+1,T})}{f(\theta, s_t=k)}.\end{aligned}$$

(27.22)

We first consider the following:

$$\begin{aligned}&f(\theta_t, s_t=k|{\cal F}_t) f(\theta_t, s_t=k|{\cal F}_{t+1,T}) \big/ f(\theta, s_t=k) \notag\\ =&\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot\{\widetilde{q}_{kk}\sum_{j=t+1}^T\eta_{t+1,j}^{(k)}f(\theta_t |{\cal F}_{t+1,j})+\sum_{l\neq k}\widetilde{q}_{lk}\eta_{t+1}^{(l)}f(\theta_t|s_t=k)\}}{P(s_t=k)f(\theta_t|s_t=k)} \notag\\ =&\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot \widetilde{q}_{kk}\sum_{j=t+1}^T\eta_{t+1,j}^{(k)}f(\theta_t |{\cal F}_{t+1,j})}{\pi_k f(\theta_t|s_t=k)}\notag\\ & \hspace{20pt}+\frac{\sum_{i=1}^t\xi_{i,t}^{(k)}f(\theta_t | {\cal F}_{i,t})\cdot \sum_{l\neq k}\widetilde{q}_{lk}\eta_{t+1}^{(l)}f(\theta_t|s_t=k)}{\pi_k f(\theta_t|s_t=k)}\notag\\ =&\sum_i^t\xi_{i,t}^{(k)}\sum_{l\neq k}\frac{\widetilde{q}_{lk}}{\pi_k}\eta_{t+1}^{(l)}f(\theta_t | {\cal F}_{i,t})+\frac{\widetilde{q}_{kk}}{\pi_k}\sum_{1 \le i \le t \le j \le T}\xi_{i,t}^{(k)}\eta_{t+1,j}^{(k)}\frac{f(\theta_t | {\cal F}_{i,t})f(\theta_t | {\cal F}_{t+1,j})}{f(\theta_t|s_t=k)}.\end{aligned}$$

Note that

$$\frac{f(\theta_t | {\cal F}_{i,t})f(\theta_t | {\cal F}_{t+1,j})}{f(\theta_t|s_t=k)}=\frac{\psi_{i,t}^{(k)}\psi_{t+1,j}^{(k)}}{\psi_{i,j}^{(k)} \psi_{0,0}^{(k)}}f(\theta_t | {\cal F}_{i,j}),$$

Hence, (27.12) is proved.

Appendix B. EM Algorithm for Hyperparameter Estimation

The inference procedure in the above sections involve the hyperparameters $\Phi=$ $\{ Q, z^{(k)}$, $\kappa^{(k)}$, $\lambda^{(k)}$, $g^{(k)}$; $k=1, \dots, \}$, is a $[4K+K(K-1)]$-dimensional vector. We can use the EM algorithm to exploit the much simpler structure of the log likelihood $l_c(\Phi)$ of the complete data $\{(y_t, s_t, \theta_t), 1\le t \le T\}$, which is expressed as

$$\begin{aligned}& l_c(\Phi) = \sum_{t=1}^T \log f(\{y_t, s_t, \theta_t \}) \nonumber\\ = &\sum_{t=1}^T \Big\{ \log f(y_t|\theta_t) + \sum_{k=1}^K f(\theta_t | s_t=k) {\bf 1}_{ \{ s_t = k \}} + \sum_{k,l=1}^K \log (p_{kl}) {\bf 1}_{ \{ s_{t-1} = k, s_t = l \} } \Big\}\nonumber\\ = &-\sum_{t=1}^T \Big\{ \frac{(y_t-\mu_t)^2}{2\sigma_t^2}+\frac{1}{2}\log(2\sigma_t^2)\Big\} -\sum_{t=1}^T\sum_{k=1}^K\Big\{\frac{(\mu_t-z^{(k)})^2}{ 2\sigma_t^2 \kappa^{(k)}}+ \frac{1}{2}\log( 2\sigma_t^2 \kappa^{(k)})\Big\}\nonumber\\ &-\sum_{t=1}^T\sum_{k=1}^K\Big\{g^{(k)}\log(\lambda^{(k)})- \log(\Gamma(g^{(k)}))-(g^{(k)}-1)\log (2\sigma_t^2)+ \frac{1}{2\sigma_t^2\lambda^{(k)}}\Big\} {\bf 1}_{\{s_t=k\}}\nonumber\\ &+ \sum_{t=1}^T \sum_{k,l=1}^K \log (p_{kl}) {\bf 1}_{ \{ s_{t-1} = k, s_t = l \} }.\end{aligned}$$

(27.23)

The E-step of the EM algorithm calculates $E[l_c(\Phi)|{\cal F}_t]$, which involves the computation of the conditional expectations:

$$E\Big[\frac{ (y_t-\mu_t)^2}{2\sigma_2^2} |{\cal F}_T\Big],\qquad E[\log(2\sigma_t^2)|{\cal F}_T],\qquad E\Big(\frac{(\mu_t-z^{(k)})^2}{2\sigma_t^2\kappa^{(k)}} {\bf 1}_{\{s_t=k\}}|{\cal F}_T\Big),$$

$$E[\log( 2\sigma_t^2\kappa^{(k)}) {\bf 1}_{\{s_t=k\}}|{\cal F}_T], E[\log (2\sigma_t^2) {\bf 1}_{\{s_t=k\}}|{\cal F}_T], E\Big(\frac{1}{2 \sigma_t^2 \lambda^{(k)}} {\bf 1}_{\{s_t=k\}}|{\cal F}_T\Big),$$

and the conditional probability:

$$P(s_t = k|{\cal F}_T), \qquad P(s_{t-1} = k, s_t = l|{\cal F}_T).$$

The M-step of the EM algorithm involves calculating the partial derivatives of $E[l_c(\Phi)|{\cal F}_t]$ with respect to Φ. Simple algebra yields the following updating formulas for Φ:

$$\begin{aligned}\widehat{q}_{kl, \rm new} =& \frac{\sum_{t=2}^T P(s_{t-1}=k, s_t=l|{\cal F}_T, \widehat{\Phi}_{\rm old})}{\sum_{t=2}^T P(s_{t-1}=k|{\cal F}_T, \widehat{\Phi}_{\rm old})},\notag\end{aligned}$$

$$\begin{aligned} \widehat{z}^{(k)}_{\rm new} =& \frac{\sum_{t=1}^T E[\mu_t/(2\sigma_t^2) {\bf 1}_{ \{ s_t = k \}} |{\cal F}_T, \widehat{\Phi}_{\rm old}]}{\sum_{t=1}^T E[P_t {\bf 1}_{\{s_{t}=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]},\notag \end{aligned}$$

$$\begin{aligned} \widehat{\kappa}^{(k)}_{\rm new} =& \frac{2\sum_{t=1}^T E[(\mu_t - \widehat{z}^{(k)}_{\rm old})^2/ (2\sigma_t^2) {\bf 1}_{ \{ s_t = k \}} |{\cal F}_T, \widehat{\Phi}_{\rm old}]}{\sum_{t=1}^T E[{\bf 1}_{\{s_{t}=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]}, \notag\end{aligned}$$

$$\begin{aligned} \widehat{\lambda}^{(k)}_{\rm new} =& \frac{\sum_{t=1}^T E[(2\sigma_t)^{-1} {\bf 1}_{\{s_t=k\}}|{\cal F}_T, \widehat{\Phi}_{\rm old}]}{ \sum_{t=1}^T g^{(k)}_{\rm old}E[ {\bf 1}_{\{s_t=k\}} |{\cal F}_T, \widehat{\Phi}_{\rm old}].}\end{aligned}$$

(27.24)

I-terms in (27.24) can be obtained as follows:

$$E\Big( \frac{\mu_t}{2\sigma_t^2} {\bf 1}_{ \{ s_t = k \}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old}\Big) = \sum_{t=1}^T \alpha^{(k)}_{ijt} E \Big( \frac{\mu_t}{2\sigma_t^2} |C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old}\Big),$$

(27.25)

$$E\Big( \frac{\mu_t}{2\sigma_t^2} \Big|C_{ij}^{(k)}{\cal F}_T, \widehat{\Phi}_{\rm old} \Big) = \lambda_{ij}^{(k)} g_{ij}^{(k)} z_{ij}^{(k)},$$

(27.26)

$$E\Big( \frac{1}{2\sigma_t^2} {\bf 1}_{\{s_{t}=k\}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old} \Big) = \sum_{1 \le i \le t \le j \le T} \alpha_{ijt}^{(k)} g_{ijt}^{(k)} \lambda_{ijt}^{(k)}.$$

(27.27)

$$\begin{aligned} E\Bigg( &\frac{\mu_t-z_{\rm old}^{(k)})^2}{2\sigma_t^2}{\bf 1}_{\{s_t=k\}} \Big|{\cal F}_T, \widehat{\Phi}_{\rm old}\Bigg) = \sum\limits_{1 \leq i \leq t \leq j \leq T}\alpha_{ijt}^{(k)} \Bigg\{E \left(\frac{\mu_t^2}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T,\widehat{\Phi}_{\rm old} \right) \nonumber\\&- 2 z_{\rm old}^{(k)}E \left( \frac{\mu_t}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old} \right)+(z_{\rm old}^{(k)})^2) E\left( \frac{1}{2\sigma_t^2}\Big|C_{ij}^{(k)}, {\cal F}_T, \widehat{\Phi}_{\rm old}\right)\Bigg\}, \end{aligned}$$

(27.28)

in which

$$\begin{aligned} E\left( \frac{\mu_t^2}{2\sigma_t^2} \Big|C_{ij}^{(k)}, {\cal F}_T,\widehat{\Phi}_{\rm old}\right)=\frac{\kappa_{ij}^{(k)}}{2}+\lambda_{ij}^{(k)}g_{ij}^{(k)}\left(z_{ij}^{(k)}\right)^2.\end{aligned}$$

(27.29)

We can use the BCMIX approximations instead of the full recursions to determine the items (27.25)–(27.29) in order to speed up computation. The iteration scheme (27.24) is carried out until convergence to estimate hyperparameters.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xing, H., Mo, Y., Liao, W., cai, Y., Zhang, M. (2015). A Stochastic Segmentation Model for the Indentification of Histone Modification and DNase I Hypersensitive Sites in Chromatin. In: Chen, Z., Liu, A., Qu, Y., Tang, L., Ting, N., Tsong, Y. (eds) Applied Statistics in Biomedicine and Clinical Trials Design. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-12694-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-12694-4_27
Published: 01 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12693-7
Online ISBN: 978-3-319-12694-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics