Abstract
Probabilistic models of languages are fundamental to understand and learn the profile of the subjacent code in order to estimate its entropy, enabling the verification and prediction of “natural” emanations of the language. Language models are devoted to capture salient statistical characteristics of the distribution of sequences of words, which transposed to the genomic language, allow modeling a predictive system of the peculiarities and regularities of genomic code in different inter and intra-genomic conditions. In this paper, we propose the application of compact intra-genomic language models to predict the composition of genomic sequences, aiming to achieve valuable resources for data compression and to contribute to enlarge the similarity analysis perspectives in genomic sequences. The obtained results encourage further investigation and validate the use of language models in biological sequence analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambrige (1998)
Koski, T.: Hidden Markov Models for Bioinformatics. Kluwer Academic Publishers, Dordrecht (2001)
Lanctot, J.K., Li, M., Yang, E.: Estimating DNA sequence entropy. In: SODA 2000: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pp. 409–418. Society for Industrial and Applied Mathematics, San Francisco (2000)
Loewenstern, D., Yianilos, P.N.: Significantly Lower Entropy Estimates for Natural DNA Sequences. In: Data Compression Conference (DCC 1997), p. 151 (1997)
Osborne, M.: Predicting DNA Sequences using a Backoff Language Model (2000), http://www.cogsci.ed.ac.uk/~osborne/dna-backoff.ps.gz
Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M.: Probabilistic Approach for DNA Compression. In: Soft Computing for Data Mining Applications, pp. 279–289. Springer, Heidelberg (2009)
Buehler, E.C., Ungar, L.H.: Maximum Entropy Methods for Biological Sequence Modeling. In: Workshop on Data Mining in Bioinformatics (with SIGKDD 2001 Conference), San Francisco, CA, USA, pp. 60–64 (2001)
Jelinek, F.: Statistical methods for speech recognition. MIT Press, Cambridge (1997)
Zhai, C.: Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies 1, 1–141 (2008)
Rosenfeld, R.: Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of the IEEE 88, 1270–1278 (2000)
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Harvard University (1998)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A Simple Statistical Algorithm for Biological Sequence Compression. In: 2007 Data Compression Conference (DCC 2007), Snowbird, UT, USA, pp. 43–52 (2007)
Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 400–440 (1987)
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 181–184. IEEE, Detroit (1995)
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 23, 3–34 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deusdado, S., Carvalho, P. (2010). Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy. In: Rocha, M.P., Riverola, F.F., Shatkay, H., Corchado, J.M. (eds) Advances in Bioinformatics. Advances in Intelligent and Soft Computing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13214-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-13214-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13213-1
Online ISBN: 978-3-642-13214-8
eBook Packages: EngineeringEngineering (R0)