Abstract
We introduce inhomogeneous parsimonious Markov models for modeling statistical patterns in discrete sequences. These models are based on parsimonious context trees, which are a generalization of context trees, and thus generalize variable order Markov models. We follow a Bayesian approach, consisting of structure and parameter learning. Structure learning is a challenging problem due to an overexponential number of possible tree structures, so we describe an exact and efficient dynamic programming algorithm for finding the optimal tree structures.
We apply model and learning algorithm to the problem of modeling binding sites of the human transcription factor C/EBP, and find an increased prediction performance compared to fixed order and variable order Markov models. We investigate the reason for this improvement and find several instances of context-specific dependences that can be captured by parsimonious context trees but not by traditional context trees.
Chapter PDF
References
Volf, P., Willems, F.: Context maximizing: Finding MDL decision trees. In: 15th Symp. Inform. Theory Benelux, pp. 192–200 (May 1994)
Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley Interscience (2006)
Ding, Y.: Statistical and Bayesian approaches to RNA secondary structure prediction. RNA 12(3), 323–331 (2006)
Xu, X., Ji, Y., Stormo, G.D.: RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 23(15), 1883–1891 (2007)
Busch, J.R., Ferrari, P.A., Flesia, A.G., Fraiman, R., Grynberg, S.P., Leonardi, F.: Testing statistical hypothesis on random trees and applications to the protein classification problem. The Annals of Applied Statistics 3(2), 542–563 (2009)
Won, K.-J., Ren, B., Wang, W.: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biology 11(1), R7 (2010)
Ramus, F., Nespor, M., Mehler, J.: Correlates of linguistic rhythm in the speech signal. Cognition 73, 265–292 (1999)
Kolmogorov, A., Rychkova, N.: Analysis of russian verse rhythm, and probability theory. Theory Probab. Appl. 44, 375–385 (2000)
Rissanen, J., Langdon, G.: Arithmetic coding. IBM Journal of Research and Development 23, 149–162 (1979)
Galves, A., Galves, C., Garcia, J., Garcia, N., Leonardi, F.: Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6(1), 186–209 (2012)
Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)
Bourguignon, P., Robelin, D.: Modèles de Markov parcimonieux. In: Proceedings of JOBIM (2004)
Ramji, D., Foka, P.: CCAAT/enhancer-binding proteins: structure, function and regulation. Biochem. J. 365, 561–575 (2002)
Heckerman, G., Geiger, D., Chickering, D.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press (2003)
Bühlmann, P., Wyner, A.: Variable length Markov chains. Annals of Statistics 27, 480–513 (1999)
Grau, J., Keilwagen, J., Gohr, A., Haldemann, B., Posch, S., Grosse, I.: Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences. Journal of Machine Learning Research 13, 1967–1971 (2012)
Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A., Kel-Margoulis, O., Kloos, D., Land, S., Lewicki-Potapov, B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., Wingender, E.: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research 33, 374–378 (2003)
Stormo, G., Schneider, T., Gold, L.: Characterization of translational initiation sites in E.coli. Nucleic Acids Research 10(2), 2971–2996 (1982)
Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Eggeling, R., Gohr, A., Bourguignon, PY., Wingender, E., Grosse, I. (2013). Inhomogeneous Parsimonious Markov Models. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40988-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-40988-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40987-5
Online ISBN: 978-3-642-40988-2
eBook Packages: Computer ScienceComputer Science (R0)