Skip to main content

Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy

  • Conference paper
Advances in Bioinformatics

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 74))

  • 731 Accesses

Abstract

Probabilistic models of languages are fundamental to understand and learn the profile of the subjacent code in order to estimate its entropy, enabling the verification and prediction of “natural” emanations of the language. Language models are devoted to capture salient statistical characteristics of the distribution of sequences of words, which transposed to the genomic language, allow modeling a predictive system of the peculiarities and regularities of genomic code in different inter and intra-genomic conditions. In this paper, we propose the application of compact intra-genomic language models to predict the composition of genomic sequences, aiming to achieve valuable resources for data compression and to contribute to enlarge the similarity analysis perspectives in genomic sequences. The obtained results encourage further investigation and validate the use of language models in biological sequence analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambrige (1998)

    MATH  Google Scholar 

  2. Koski, T.: Hidden Markov Models for Bioinformatics. Kluwer Academic Publishers, Dordrecht (2001)

    MATH  Google Scholar 

  3. Lanctot, J.K., Li, M., Yang, E.: Estimating DNA sequence entropy. In: SODA 2000: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pp. 409–418. Society for Industrial and Applied Mathematics, San Francisco (2000)

    Google Scholar 

  4. Loewenstern, D., Yianilos, P.N.: Significantly Lower Entropy Estimates for Natural DNA Sequences. In: Data Compression Conference (DCC 1997), p. 151 (1997)

    Google Scholar 

  5. Osborne, M.: Predicting DNA Sequences using a Backoff Language Model (2000), http://www.cogsci.ed.ac.uk/~osborne/dna-backoff.ps.gz

  6. Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M.: Probabilistic Approach for DNA Compression. In: Soft Computing for Data Mining Applications, pp. 279–289. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Buehler, E.C., Ungar, L.H.: Maximum Entropy Methods for Biological Sequence Modeling. In: Workshop on Data Mining in Bioinformatics (with SIGKDD 2001 Conference), San Francisco, CA, USA, pp. 60–64 (2001)

    Google Scholar 

  8. Jelinek, F.: Statistical methods for speech recognition. MIT Press, Cambridge (1997)

    Google Scholar 

  9. Zhai, C.: Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies 1, 1–141 (2008)

    Article  Google Scholar 

  10. Rosenfeld, R.: Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of the IEEE 88, 1270–1278 (2000)

    Article  Google Scholar 

  11. Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Harvard University (1998)

    Google Scholar 

  12. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A Simple Statistical Algorithm for Biological Sequence Compression. In: 2007 Data Compression Conference (DCC 2007), Snowbird, UT, USA, pp. 43–52 (2007)

    Google Scholar 

  13. Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 400–440 (1987)

    Article  Google Scholar 

  14. Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 181–184. IEEE, Detroit (1995)

    Google Scholar 

  15. Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 23, 3–34 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deusdado, S., Carvalho, P. (2010). Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy. In: Rocha, M.P., Riverola, F.F., Shatkay, H., Corchado, J.M. (eds) Advances in Bioinformatics. Advances in Intelligent and Soft Computing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13214-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13214-8_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13213-1

  • Online ISBN: 978-3-642-13214-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics