Skip to main content

The Application of Data Compression-Based Distances to Biological Sequences

  • Chapter
Information Theory and Statistical Learning

Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith—Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abel, J.: Improvements to the burrows-wheeler compression algorithm: After bwt stages (2003)

    Google Scholar 

  2. Á goston, V., Káan, L., Carugo, O., Hegedű, Z., Vlahovicek, K., Pongor, S.: Concepts of similarity in bioinformatics. In: D. Moss, S. Jelaska, S. Pongor (eds.) Essays in Bioinformatics. IOS, Amsterdam (2005)

    Google Scholar 

  3. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J Mol Biol215(3), 403–410 (1990)

    Google Scholar 

  4. Andreeva, A., Howorth, D., Brenner, C.: Scop database in 2004: refinements integrate structure and sequence family data (2004)

    Google Scholar 

  5. Bennett, C.H., Gács, P., Li, M., Vitanyi, P.M.B., Zurek, W.H.: Information distance. IEEETIT: IEEE Trans Inform Theory44, 1407–1423 (1998)

    Article  MATH  Google Scholar 

  6. Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, Oxford (1996)

    MATH  Google Scholar 

  7. Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  8. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Rep. 124, 130 Lytton Avenuve, Palo Alto, CA, 94301 (1994)

    Google Scholar 

  9. Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans Inform Theory50(7), 1551–1561 (2004)

    Article  MathSciNet  Google Scholar 

  10. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)

    Google Scholar 

  11. Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, p 107 (2000)

    Google Scholar 

  12. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans Inform Theory51(4), 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  13. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (1991)

    MATH  Google Scholar 

  14. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: M.O. Dayhoff (ed.) Atlas of protein sequence and structure, vol. 5, 345–358. National Biomedical Research Foundation, Washington, D.C., (1978)

    Google Scholar 

  15. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, New York (2000)

    Google Scholar 

  16. Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics8, 252 (2007)

    Article  Google Scholar 

  17. Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Comput Chem20, 25–33 (1996)

    Article  Google Scholar 

  18. Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)

    Google Scholar 

  19. Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics15(6), 471–479 (1999)

    Article  Google Scholar 

  20. Kaján, L., Kertész-Farkas, A., Franklin, D., Ivanova, N., Kocsor, A., Pongor, S.: Application of a simple likelihood ratio approximant to protein sequence classification. Bioinformatics22(23), 2865–2869 (2006)

    Article  Google Scholar 

  21. Kertész-Farkas, A., Dhir, S., Sonego, P., Pacurar M., Netoteia, S., Nijveen, H., Kuzinar, A., Leunissen, J., Kocsor, A., Pongor, S.: Benchmarking protein classification algorithms via supervised cross-validation. J Biochem Biophys Methods 35, 1215–1223 (2007)

    Google Scholar 

  22. Kocsor, A., Kertész-Farkas, A., Kaján, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics22(4), 407–412 (2006)

    Article  Google Scholar 

  23. Koonin, E.V., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Krylov, D.M., Makarova, K.S., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Rogozin, I.B., Smirnov, S., Sorokin, A.V., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: A comprehen-sive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol5(2) (2004)

    Article  Google Scholar 

  24. Li, M.: Information distance and its applications. In: O.H. Ibarra, H.C. Yen (eds.) CIAALecture Notes in Computer Science, vol. 4094, 1–9. Springer, Berlin (2006)

    Google Scholar 

  25. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. In: SODA '03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 863–872. Society for Industrial and Applied Mathematics, Philadelphia (2003)

    Google Scholar 

  26. Li, M., Vitanyi, P.: An introduction to kolmogorov complexity and its applications, 2nd edn. Springer, Berlin (1997)

    MATH  Google Scholar 

  27. Li, M., Vitányi, P.M.: Mathematical theory of thermodynamics of computation. Tech. rep., Centre for Mathematics and Computer Science, Amsterdam, The Netherlands (1992)

    Google Scholar 

  28. Nevill-Manning, C.G., Witten, I.H.: Protein is incompressible. In: DCC '99: Proceedings of the Conference on Data Compression, p. 257. IEEE Computer Society, Washington, DC, USA (1999)

    Google Scholar 

  29. Rice, J.C.: Logistic regression: An introduction. In: B. Rhompson (ed.) Advances in social science methodology, vol. 3, 191–245. JAI, Greenwich (1994)

    Google Scholar 

  30. Schweizer, D., Abu-Mostafa, Y.: Kolmogorov metric spaces. Manuscript, Computer Sciences, 256–80, California Institute of Technology, Pasadena, CA 91125 (1998)

    Google Scholar 

  31. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J Mol Biol147, 195–197 (1981)

    Article  Google Scholar 

  32. Sonego, P., Pacurar, M., Dhir, S., Kertész-Farkas, A., Kocsor, A., Gáspári, Z., Leunissen, J.A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Res35(Database-Issue), 232–236 (2007)

    Article  Google Scholar 

  33. Susko, E., Roger, A.J.: On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol24, 2139–2150 (2007)

    Article  Google Scholar 

  34. Vapnik, V.N.: The nature of statistical learning theory, 2nd edn. Springer, Berlin (1999)

    Google Scholar 

  35. Vinga, S., Almeida, J.: Alignment-free sequence comparison-a review. Bioinformatics 19(4), 513–523 (2003)

    Article  Google Scholar 

  36. Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans Inform Theory, 653–664 (1995)

    Google Scholar 

  37. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  38. Wootton, J.C.: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem18(3), 269–285 (1994)

    Article  MATH  Google Scholar 

  39. Zurek, W.H.: Thermodynamic cost of computation, algorithmic complexity and the information metric. Nature341(6238), 119–124 (1989)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Attila Kertesz-Farkas , Andras Kocsor or Sandor Pongor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Kertesz-Farkas, A., Kocsor, A., Pongor, S. (2009). The Application of Data Compression-Based Distances to Biological Sequences. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-84816-7_4

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-84815-0

  • Online ISBN: 978-0-387-84816-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics