Abstract
Focal alterations in chromatin structure are essential for the proper functioning of various classes of transcriptional regulatory elements in the human genome. These changes can be detected through an increased sensitivity to DNase I and other nucleases due to an open and accessible chromatin conformation. Currently, quantitative analysis approaches use heuristic procedures to identify regions enriched for histone modifications and DNase I hypersensitivity. We here develop a stochastic segmentation model and associate inference framework to characterize the categorical and continuous features of hierarchical structures hidden in sequences. The proposed model has attractive statistical and computational properties and yields explicit formulas for posterior distribution of hidden states with a hierarchical structure. We propose an approximation method whose computational complexity is only linear in sequence length. We demonstrate the performance of the model via extensive simulations. We further use our model to identify DNase I sensitivity and DNase I hypersentitive sites over the Encyclopedia of DNA Elements (ENCODE) regions in human lymphoblastoid cells.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barski A, Cuddapah S, Cui K, Roh Tae-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao, K (2007) Highresolution profiling of histone methylations in the human genome. Cell 129:823–837
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet, AL, Ecker JR et al (2010) The NIH roadmap epigenomics mapping consortium. Nat Biotechnol 28:1045–1048
Consortium TEP, data analysis coordination OC, data production DPL, data analysis LA, group W, scientific management NPM, steering committee PI, Boise State University and University of North Carolina at Chapel Hill Proteomics groups (data production and analysis), Broad Institute Group (data production and analysis), Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, Sanger Institute, University of Lausanne, Genome Institute of Singapore group (data production and analysis), Data coordination center at UC Santa Cruz (production data coordination), Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill group (data production and analysis), Genome Institute of Singapore group (data production and analysis), HudsonAlpha Institute, Caltech, UC Irvine, Stanford group (data production and analysis), targeted experimental validation LBNLG, data production and analysis NG, Sanger Institute, Washington University, Yale University, Center for Genomic Regulation, Barcelona, UCSC, MIT, University of Lausanne, CNIO group (data production and analysis), Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UC Davis group (data production and analysis), University of Albany SUNY group (data production and analysis) (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 488:57–74
Dorschner MO, Hawrylycz M, Humbert R, Wallace JC, Shafer A, Kawamoto, J, Mack J, Hall R, Goldy J, Sabo PJ et al (2004) High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1:219–225
Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, Lawrence CE (2008) Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics 24:1911–1916
Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A et al (2006) Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods 3:511–518
Xing H, MoY, Liao W, Zhang MQ (2006) Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data. PLoS Comput Biol 8:e100261–3
Xing H, Ying C (2014) A stochastic segmentation model for recurrent copy number alteration analysis. Technical Report, Department of Applied Mathematics and Statistics, State University of New York at Stony Brook
Mikkelsen TS, and Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP et al (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448:553–560
Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM (2010) HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC bioinformatics 11:36–9
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27:66–75
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9:R1375
Acknowledgement
The first two authors contributed equally to the chapter. Haipeng Xing’s research was supported by the National Science Foundation grant DMS-1206321. Michael Q. Zhang’s research was supported by the National Institutes of Health grant HG001696 and ES-17166, the National Basic Research Program of China grant 2012CB316503, and the National Natural Science Foundation of China grant 91019016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A. Proof of 27.5 and 27.12
Proof of 27.5: To derive the mixture weight \(\xi_{i,t}^{(k)}\), we first note that
When \(l\neq k\),
When \(l= k\),
Let
We then have
Hence, the mixture weight \(\xi_{i,t}^{(k)}\) is the conditional probability which can be determined via normalization of \(\xi_{i,t}^{(k*)}\). Furthermore, simple algebra shows that
where
for \(i\le j\). This proves (27.5).
Proof of (27.12): We use Bayes’ theorem to combine the forward filter (27.4) with its backward variant (27.10) to derive the posterior distribution of θ t given \({\cal F}_T\) \((1\le t < T)\)
We first consider the following:
Note that
Hence, (27.12) is proved.
Appendix B. EM Algorithm for Hyperparameter Estimation
The inference procedure in the above sections involve the hyperparameters \(\Phi=\) \(\{ Q, z^{(k)}\), \(\kappa^{(k)}\), \(\lambda^{(k)}\), \(g^{(k)}\); \(k=1, \dots, \}\), is a \([4K+K(K-1)]\)-dimensional vector. We can use the EM algorithm to exploit the much simpler structure of the log likelihood \(l_c(\Phi)\) of the complete data \(\{(y_t, s_t, \theta_t), 1\le t \le T\}\), which is expressed as
The E-step of the EM algorithm calculates \(E[l_c(\Phi)|{\cal F}_t]\), which involves the computation of the conditional expectations:
and the conditional probability:
The M-step of the EM algorithm involves calculating the partial derivatives of \(E[l_c(\Phi)|{\cal F}_t]\) with respect to Φ. Simple algebra yields the following updating formulas for Φ:
I-terms in (27.24) can be obtained as follows:
in which
We can use the BCMIX approximations instead of the full recursions to determine the items (27.25)–(27.29) in order to speed up computation. The iteration scheme (27.24) is carried out until convergence to estimate hyperparameters.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xing, H., Mo, Y., Liao, W., cai, Y., Zhang, M. (2015). A Stochastic Segmentation Model for the Indentification of Histone Modification and DNase I Hypersensitive Sites in Chromatin. In: Chen, Z., Liu, A., Qu, Y., Tang, L., Ting, N., Tsong, Y. (eds) Applied Statistics in Biomedicine and Clinical Trials Design. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-12694-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-12694-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12693-7
Online ISBN: 978-3-319-12694-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)