Skip to main content

Complexity Profiles of DNA Sequences Using Finite-Context Models

  • Conference paper
  • 2304 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7058))

Abstract

Every data compression method assumes a certain model of the information source that produces the data. When we improve a data compression method, we are also improving the model of the source. This happens because, when the probability distribution of the assumed source model is closer to the true probability distribution of the source, a smaller relative entropy results and, therefore, fewer redundancy bits are required. This is why the importance of data compression goes beyond the usual goal of reducing the storage space or the transmission time of the information. In fact, in some situations, seeking better models is the main aim. In our view, this is the case for DNA sequence data. In this paper, we give hints on how finite-context (Markov) modeling may be used for DNA sequence analysis, through the construction of complexity profiles of the sequences. These profiles are able to unveil structures of the DNA, some of them with potential biological relevance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rissanen, J.: Generalized Kraft inequality and arithmetic coding. IBM J. Res. Develop. 20(3), 198–203 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  2. Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C., Ferreira, P.J.S.G.: A three-state model for DNA protein-coding regions. IEEE Trans. on Biomedical Engineering 53(11), 2148–2155 (2006)

    Article  Google Scholar 

  3. Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)

    Google Scholar 

  4. Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan (April 2009)

    Google Scholar 

  5. Pratas, D., Pinho, A.J.: Compressing the Human Genome Using Exclusively Markov Models. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011. AISC, vol. 93, pp. 213–220. Springer, Heidelberg (2011)

    Google Scholar 

  6. Pinho, A.J., Pratas, D., Ferreira, P.J.S.G.: Bacteria DNA sequence compression using a mixture of finite-context models. In: Proc. of the IEEE Workshop on Statistical Signal Processing, Nice, France (June 2011)

    Google Scholar 

  7. Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)

    Article  Google Scholar 

  8. Pinho, A.J., Pratas, D., Ferreira, P.J.S.G., Garcia, S.P.: Symbolic to numerical conversion of DNA sequences using finite-context models. In: Proc. of the 19th European Signal Processing Conf., EUSIPCO 2011, Barcelona, Spain (August 2011)

    Google Scholar 

  9. Bell, T.C., Cleary, J.G., Witten, I.H.: Text compression. Prentice-Hall (1990)

    Google Scholar 

  10. Salomon, D.: Data compression - The complete reference, 4th edn. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  11. Sayood, K.: Introduction to data compression, 3rd edn. Morgan Kaufmann (2006)

    Google Scholar 

  12. Laplace, P.S.: Essai philosophique sur les probabilités (A philosophical essay on probabilities). John Wiley & Sons, New York (1814); translated from the sixth French edition by Truscott, F.W., Emory, F. L. (1902)

    Google Scholar 

  13. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. of the Royal Society (London) A 186, 453–461 (1946)

    Article  MathSciNet  MATH  Google Scholar 

  14. Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. on Information Theory 27(2), 199–207 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  15. Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 1993, Snowbird, Utah, pp. 340–350 (1993)

    Google Scholar 

  16. Rivals, E., Delahaye, J.P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proc. of the Data Compression Conf., DCC 1996, Snowbird, Utah, p. 453 (1996)

    Google Scholar 

  17. Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology Magazine 20, 61–66 (2001)

    Article  Google Scholar 

  18. Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. In: Dunker, A.K., Konagaya, A., Miyano, S., Takagi, T. (eds.) Genome Informatics 2000: Proc. of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)

    Google Scholar 

  19. Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software—Practice and Experience 34, 1397–1411 (2004)

    Article  Google Scholar 

  20. Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)

    Article  Google Scholar 

  21. Behzadi, B., Le Fessant, F.: DNA Compression Challenge Revisited. In: Combinatorial Pattern Matching. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  22. Korodi, G., Tabus, I.: Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 33–42 (March 2007)

    Google Scholar 

  23. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)

    Google Scholar 

  24. Solomonoff, R.J.: A formal theory of inductive inference, part I. Information and Control 7(1), 1–22 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  25. Solomonoff, R.J.: A formal theory of inductive inference, part II. Information and Control 7(2), 224–254 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  26. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)

    MathSciNet  MATH  Google Scholar 

  27. Chaitin, G.J.: On the length of programs for computing finite binary sequences. Journal of the ACM 13, 547–569 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  28. Wallace, C.S., Boulton, D.M.: An information measure for classification. The Computer Journal 11(2), 185–194 (1968)

    Article  MATH  Google Scholar 

  29. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)

    Article  MATH  Google Scholar 

  30. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. on Information Theory 22(1), 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  31. Gordon, G.: Multi-dimensional linguistic complexity. Journal of Biomolecular Structure & Dynamics 20(6), 747–750 (2003)

    Article  Google Scholar 

  32. Dix, T.I., Powell, D.R., Allison, L., Bernal, J., Jaeger, S., Stern, L.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8(suppl. 2), S10 (2007)

    Article  Google Scholar 

  33. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. on Information Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  34. Bennett, C.H., Gács, P., Vitányi, M.L.P.M.B., Zurek, W.H.: Information distance. IEEE Trans. on Information Theory 44(4), 1407–1423 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  35. Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. on Information Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  36. Nan, F., Adjeroh, D.: On the complexity measures for biological sequences. In: Proc. of the IEEE Computational Systems Bioinformatics Conference, CSB 2004, Stanford, CA (August 2004)

    Google Scholar 

  37. Pirhaji, L., Kargar, M., Sheari, A., Poormohammadi, H., Sadeghi, M., Pezeshk, H., Eslahchi, C.: The performances of the chi-square test and complexity measures for signal recognition in biological sequences. Journal of Theoretical Biology 251(2), 380–387 (2008)

    Article  MathSciNet  Google Scholar 

  38. Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pinho, A.J., Pratas, D., Garcia, S.P. (2011). Complexity Profiles of DNA Sequences Using Finite-Context Models. In: Holzinger, A., Simonic, KM. (eds) Information Quality in e-Health. USAB 2011. Lecture Notes in Computer Science, vol 7058. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25364-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25364-5_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25363-8

  • Online ISBN: 978-3-642-25364-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics