Abstract
Efficient family classification of newly discovered protein sequences is a central problem in bioinformatics. We present a new algorithm, using Probabilistic Suffix Trees, which identifies equivalences between the amino acids in different positions of a motif for each family. We also show that better classification can be achieved identifying representative fingerprints in the amino acid chains.
This work is partially supported by CAPES and is part of PRONEX/FAPESP’s Project Stochastic behavior, critical phenomena and rhythmic pattern identification in natural languages (grant number 03/09930-9).
Chapter PDF
References
Karp, R.M.: Mathematical challenges from genomics and molecular biology. Notices Amer. Math. Soc. 49, 544–553 (2002)
Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory 29, 656–664 (1983)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17, 23–43 (2001)
Eskin, E., Grundy, W.N., Singer, Y.: Protein family classification using sparse markov transducers. In: Proc. Int’l Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 134–145 (2000)
Bourguignon, P.Y., Robelin, D.: Modèles de Markov parcimonieux: sélection de modèle et estimation. Manuscript (2004)
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., Studholme, D.J., Yeats, C., Eddy, S.R.: The Pfam protein families database. Nucl. Acids Res. 32, D138–D141 (2004)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res. 31, 365–370 (2003)
Pearson, W.R.: Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leonardi, F., Galves, A. (2005). Sequence Motif Identification and Protein Family Classification Using Probabilistic Trees. In: Setubal, J.C., Verjovski-Almeida, S. (eds) Advances in Bioinformatics and Computational Biology. BSB 2005. Lecture Notes in Computer Science(), vol 3594. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11532323_20
Download citation
DOI: https://doi.org/10.1007/11532323_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28008-8
Online ISBN: 978-3-540-31861-3
eBook Packages: Computer ScienceComputer Science (R0)