Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith—Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abel, J.: Improvements to the burrows-wheeler compression algorithm: After bwt stages (2003)
Á goston, V., Káan, L., Carugo, O., Hegedű, Z., Vlahovicek, K., Pongor, S.: Concepts of similarity in bioinformatics. In: D. Moss, S. Jelaska, S. Pongor (eds.) Essays in Bioinformatics. IOS, Amsterdam (2005)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J Mol Biol215(3), 403–410 (1990)
Andreeva, A., Howorth, D., Brenner, C.: Scop database in 2004: refinements integrate structure and sequence family data (2004)
Bennett, C.H., Gács, P., Li, M., Vitanyi, P.M.B., Zurek, W.H.: Information distance. IEEETIT: IEEE Trans Inform Theory44, 1407–1423 (1998)
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, Oxford (1996)
Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Rep. 124, 130 Lytton Avenuve, Palo Alto, CA, 94301 (1994)
Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans Inform Theory50(7), 1551–1561 (2004)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, p 107 (2000)
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans Inform Theory51(4), 1523–1545 (2005)
Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (1991)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: M.O. Dayhoff (ed.) Atlas of protein sequence and structure, vol. 5, 345–358. National Biomedical Research Foundation, Washington, D.C., (1978)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, New York (2000)
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics8, 252 (2007)
Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Comput Chem20, 25–33 (1996)
Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)
Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics15(6), 471–479 (1999)
Kaján, L., Kertész-Farkas, A., Franklin, D., Ivanova, N., Kocsor, A., Pongor, S.: Application of a simple likelihood ratio approximant to protein sequence classification. Bioinformatics22(23), 2865–2869 (2006)
Kertész-Farkas, A., Dhir, S., Sonego, P., Pacurar M., Netoteia, S., Nijveen, H., Kuzinar, A., Leunissen, J., Kocsor, A., Pongor, S.: Benchmarking protein classification algorithms via supervised cross-validation. J Biochem Biophys Methods 35, 1215–1223 (2007)
Kocsor, A., Kertész-Farkas, A., Kaján, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics22(4), 407–412 (2006)
Koonin, E.V., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Krylov, D.M., Makarova, K.S., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Rogozin, I.B., Smirnov, S., Sorokin, A.V., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: A comprehen-sive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol5(2) (2004)
Li, M.: Information distance and its applications. In: O.H. Ibarra, H.C. Yen (eds.) CIAALecture Notes in Computer Science, vol. 4094, 1–9. Springer, Berlin (2006)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. In: SODA '03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 863–872. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Li, M., Vitanyi, P.: An introduction to kolmogorov complexity and its applications, 2nd edn. Springer, Berlin (1997)
Li, M., Vitányi, P.M.: Mathematical theory of thermodynamics of computation. Tech. rep., Centre for Mathematics and Computer Science, Amsterdam, The Netherlands (1992)
Nevill-Manning, C.G., Witten, I.H.: Protein is incompressible. In: DCC '99: Proceedings of the Conference on Data Compression, p. 257. IEEE Computer Society, Washington, DC, USA (1999)
Rice, J.C.: Logistic regression: An introduction. In: B. Rhompson (ed.) Advances in social science methodology, vol. 3, 191–245. JAI, Greenwich (1994)
Schweizer, D., Abu-Mostafa, Y.: Kolmogorov metric spaces. Manuscript, Computer Sciences, 256–80, California Institute of Technology, Pasadena, CA 91125 (1998)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J Mol Biol147, 195–197 (1981)
Sonego, P., Pacurar, M., Dhir, S., Kertész-Farkas, A., Kocsor, A., Gáspári, Z., Leunissen, J.A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Res35(Database-Issue), 232–236 (2007)
Susko, E., Roger, A.J.: On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol24, 2139–2150 (2007)
Vapnik, V.N.: The nature of statistical learning theory, 2nd edn. Springer, Berlin (1999)
Vinga, S., Almeida, J.: Alignment-free sequence comparison-a review. Bioinformatics 19(4), 513–523 (2003)
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans Inform Theory, 653–664 (1995)
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (1999)
Wootton, J.C.: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem18(3), 269–285 (1994)
Zurek, W.H.: Thermodynamic cost of computation, algorithmic complexity and the information metric. Nature341(6238), 119–124 (1989)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Kertesz-Farkas, A., Kocsor, A., Pongor, S. (2009). The Application of Data Compression-Based Distances to Biological Sequences. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-84816-7_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)