The Application of Data Compression-Based Distances to Biological Sequences

Kertesz-Farkas, Attila; Kocsor, Andras; Pongor, Sandor

doi:10.1007/978-0-387-84816-7_4

Attila Kertesz-Farkas⁵,
Andras Kocsor⁵ &
Sandor Pongor^6,7

4245 Accesses

Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith—Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abel, J.: Improvements to the burrows-wheeler compression algorithm: After bwt stages (2003)
Google Scholar
Á goston, V., Káan, L., Carugo, O., Hegedű, Z., Vlahovicek, K., Pongor, S.: Concepts of similarity in bioinformatics. In: D. Moss, S. Jelaska, S. Pongor (eds.) Essays in Bioinformatics. IOS, Amsterdam (2005)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J Mol Biol215(3), 403–410 (1990)
Google Scholar
Andreeva, A., Howorth, D., Brenner, C.: Scop database in 2004: refinements integrate structure and sequence family data (2004)
Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitanyi, P.M.B., Zurek, W.H.: Information distance. IEEETIT: IEEE Trans Inform Theory44, 1407–1423 (1998)
Article MATH Google Scholar
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, Oxford (1996)
MATH Google Scholar
Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001)
Article MATH Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Rep. 124, 130 Lytton Avenuve, Palo Alto, CA, 94301 (1994)
Google Scholar
Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans Inform Theory50(7), 1551–1561 (2004)
Article MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, p 107 (2000)
Google Scholar
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans Inform Theory51(4), 1523–1545 (2005)
Article MathSciNet Google Scholar
Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (1991)
MATH Google Scholar
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: M.O. Dayhoff (ed.) Atlas of protein sequence and structure, vol. 5, 345–358. National Biomedical Research Foundation, Washington, D.C., (1978)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, New York (2000)
Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics8, 252 (2007)
Article Google Scholar
Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Comput Chem20, 25–33 (1996)
Article Google Scholar
Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)
Google Scholar
Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics15(6), 471–479 (1999)
Article Google Scholar
Kaján, L., Kertész-Farkas, A., Franklin, D., Ivanova, N., Kocsor, A., Pongor, S.: Application of a simple likelihood ratio approximant to protein sequence classification. Bioinformatics22(23), 2865–2869 (2006)
Article Google Scholar
Kertész-Farkas, A., Dhir, S., Sonego, P., Pacurar M., Netoteia, S., Nijveen, H., Kuzinar, A., Leunissen, J., Kocsor, A., Pongor, S.: Benchmarking protein classification algorithms via supervised cross-validation. J Biochem Biophys Methods 35, 1215–1223 (2007)
Google Scholar
Kocsor, A., Kertész-Farkas, A., Kaján, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics22(4), 407–412 (2006)
Article Google Scholar
Koonin, E.V., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Krylov, D.M., Makarova, K.S., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Rogozin, I.B., Smirnov, S., Sorokin, A.V., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: A comprehen-sive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol5(2) (2004)
Article Google Scholar
Li, M.: Information distance and its applications. In: O.H. Ibarra, H.C. Yen (eds.) CIAALecture Notes in Computer Science, vol. 4094, 1–9. Springer, Berlin (2006)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. In: SODA '03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 863–872. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Google Scholar
Li, M., Vitanyi, P.: An introduction to kolmogorov complexity and its applications, 2nd edn. Springer, Berlin (1997)
MATH Google Scholar
Li, M., Vitányi, P.M.: Mathematical theory of thermodynamics of computation. Tech. rep., Centre for Mathematics and Computer Science, Amsterdam, The Netherlands (1992)
Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Protein is incompressible. In: DCC '99: Proceedings of the Conference on Data Compression, p. 257. IEEE Computer Society, Washington, DC, USA (1999)
Google Scholar
Rice, J.C.: Logistic regression: An introduction. In: B. Rhompson (ed.) Advances in social science methodology, vol. 3, 191–245. JAI, Greenwich (1994)
Google Scholar
Schweizer, D., Abu-Mostafa, Y.: Kolmogorov metric spaces. Manuscript, Computer Sciences, 256–80, California Institute of Technology, Pasadena, CA 91125 (1998)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J Mol Biol147, 195–197 (1981)
Article Google Scholar
Sonego, P., Pacurar, M., Dhir, S., Kertész-Farkas, A., Kocsor, A., Gáspári, Z., Leunissen, J.A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Res35(Database-Issue), 232–236 (2007)
Article Google Scholar
Susko, E., Roger, A.J.: On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol24, 2139–2150 (2007)
Article Google Scholar
Vapnik, V.N.: The nature of statistical learning theory, 2nd edn. Springer, Berlin (1999)
Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison-a review. Bioinformatics 19(4), 513–523 (2003)
Article Google Scholar
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans Inform Theory, 653–664 (1995)
Google Scholar
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Wootton, J.C.: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem18(3), 269–285 (1994)
Article MATH Google Scholar
Zurek, W.H.: Thermodynamic cost of computation, algorithmic complexity and the information metric. Nature341(6238), 119–124 (1989)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Group on Artificial Intelligence, Aradi vértanúk tere 1, Szeged, 6720, Hungary
Attila Kertesz-Farkas & Andras Kocsor
Protein Structure and Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, Padriciano 99, Trieste, 34012
Sandor Pongor
Italy and Bioinformatics Group, Biological Research Centre, Hungarian Academy of Sciences, Temesvári krt. 62, Szeged, 6701, Hungary
Sandor Pongor

Authors

Attila Kertesz-Farkas
View author publications
You can also search for this author in PubMed Google Scholar
Andras Kocsor
View author publications
You can also search for this author in PubMed Google Scholar
Sandor Pongor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Attila Kertesz-Farkas , Andras Kocsor or Sandor Pongor .

Editor information

Editors and Affiliations

Department of Biostatistics and Department of Genome Sciences, University of Washington, 1705 NE Pacific St., Box 357730, Seattle, WA, 98195, USA
Frank Emmert-Streib
Queen's University Belfast Computational Biology and Machine Learning, Center for Cancer Research and Cell Biology School of Biomedical Sciences, 97 Lisburn Road, Belfast, BT9 7BL, UK
Frank Emmert-Streib
Institute of Discrete Mathematics and Geometry, Vienna University of Technology, Wiedner Hauptstr. 8–10, Vienna, 1040, Austria
Matthias Dehmer
Probability and Statistics, University of Coimbra Center for Mathematics, Apartado 3008, Coimbra, 3001–454, Portugal
Matthias Dehmer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kertesz-Farkas, A., Kocsor, A., Pongor, S. (2009). The Application of Data Compression-Based Distances to Biological Sequences. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_4

Download citation

DOI: https://doi.org/10.1007/978-0-387-84816-7_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics