Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy

Deusdado, Sérgio; Carvalho, Paulo

doi:10.1007/978-3-642-13214-8_19

Sérgio Deusdado⁶ &
Paulo Carvalho⁷

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 74))

731 Accesses

Abstract

Probabilistic models of languages are fundamental to understand and learn the profile of the subjacent code in order to estimate its entropy, enabling the verification and prediction of “natural” emanations of the language. Language models are devoted to capture salient statistical characteristics of the distribution of sequences of words, which transposed to the genomic language, allow modeling a predictive system of the peculiarities and regularities of genomic code in different inter and intra-genomic conditions. In this paper, we propose the application of compact intra-genomic language models to predict the composition of genomic sequences, aiming to achieve valuable resources for data compression and to contribute to enlarge the similarity analysis perspectives in genomic sequences. The obtained results encourage further investigation and validate the use of language models in biological sequence analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambrige (1998)
MATH Google Scholar
Koski, T.: Hidden Markov Models for Bioinformatics. Kluwer Academic Publishers, Dordrecht (2001)
MATH Google Scholar
Lanctot, J.K., Li, M., Yang, E.: Estimating DNA sequence entropy. In: SODA 2000: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pp. 409–418. Society for Industrial and Applied Mathematics, San Francisco (2000)
Google Scholar
Loewenstern, D., Yianilos, P.N.: Significantly Lower Entropy Estimates for Natural DNA Sequences. In: Data Compression Conference (DCC 1997), p. 151 (1997)
Google Scholar
Osborne, M.: Predicting DNA Sequences using a Backoff Language Model (2000), http://www.cogsci.ed.ac.uk/~osborne/dna-backoff.ps.gz
Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M.: Probabilistic Approach for DNA Compression. In: Soft Computing for Data Mining Applications, pp. 279–289. Springer, Heidelberg (2009)
Chapter Google Scholar
Buehler, E.C., Ungar, L.H.: Maximum Entropy Methods for Biological Sequence Modeling. In: Workshop on Data Mining in Bioinformatics (with SIGKDD 2001 Conference), San Francisco, CA, USA, pp. 60–64 (2001)
Google Scholar
Jelinek, F.: Statistical methods for speech recognition. MIT Press, Cambridge (1997)
Google Scholar
Zhai, C.: Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies 1, 1–141 (2008)
Article Google Scholar
Rosenfeld, R.: Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of the IEEE 88, 1270–1278 (2000)
Article Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Harvard University (1998)
Google Scholar
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A Simple Statistical Algorithm for Biological Sequence Compression. In: 2007 Data Compression Conference (DCC 2007), Snowbird, UT, USA, pp. 43–52 (2007)
Google Scholar
Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 400–440 (1987)
Article Google Scholar
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 181–184. IEEE, Detroit (1995)
Google Scholar
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 23, 3–34 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CIMO - Mountain Research Centre, Polytechnic Institute of Bragança, Portugal
Sérgio Deusdado
Department of Informatics, School of Engineering, University of Minho, Braga, Portugal
Paulo Carvalho

Authors

Sérgio Deusdado
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dep. Informática / CCTC, Universidade do Minho, Campus de Gualtar, 4710-057, Braga, Portugal
Miguel P. Rocha
Escuela Superior de Ingeniería Informática Edificio Politécnico, Despacho 408 Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
Florentino Fernández Riverola
Computational Biology and Machine Learning Lab, School of Computing, Queen’s University, K7L 3N6, Kingston, Ontario, Canada
Hagit Shatkay
Departamento de Informática y Automática Facultad de Ciencias, Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan Manuel Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deusdado, S., Carvalho, P. (2010). Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy. In: Rocha, M.P., Riverola, F.F., Shatkay, H., Corchado, J.M. (eds) Advances in Bioinformatics. Advances in Intelligent and Soft Computing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13214-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-13214-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13213-1
Online ISBN: 978-3-642-13214-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics