Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers

Ferreira, Pedro Gabriel; Azevedo, Paulo J.

doi:10.1007/11595014_24

Pedro Gabriel Ferreira²¹ &
Paulo J. Azevedo²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3808))

Included in the following conference series:

Portuguese Conference on Artificial Intelligence

1485 Accesses
17 Citations

Abstract

We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Article Google Scholar
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th International Conference of Knowledge Discovery and Data Mining SIGKDD, S. Francisco, July 2002, pp. 429–435 (2002)
Google Scholar
Bairoch, A.: Prosite: a dictionary of sites and patterns in proteins. Nucleic Acids Res 25(19), 2241–2245 (1991)
Google Scholar
Ben-Hur, A., Brutlag, D.: Remote homology detection:a motif based approach. Bioinformatics 19(1), 26–33 (2003)
Article Google Scholar
Ben-Hur, A., Brutlag, D.: Sequence motifs: highly predictive features of protein function. In: Proceeding of Workshop on Feature Selection, NIPS - Neural Information Processing Systems (December 2003)
Google Scholar
Cooper, N.G.: The Human Genome Project, Dechiphering the blueprint of heredity, vol. 1. University Science Books (1994)
Google Scholar
Domingos, P., Pazzani, M.: Beyond independence: Conditions for the optimality of the simple bayesian classifier. In: International Conference on Machine Learning, pp. 105–112 (1996)
Google Scholar
Eskin, E., Grundy, W.N., Singer, Y.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Journal of Computational Biology 10(2), 187–214 (2003)
Article Google Scholar
Bateman, A., et al.: The pfam protein families database. Nucleic Acids Research 32(Database issue) (October 2003)
Google Scholar
Ferreira, P., Azevedo, P.: Protein sequence pattern mining with constraints. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 96–107. Springer, Heidelberg (2005)
Chapter Google Scholar
Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: ACM Press (ed.) The Proceedings of RECOMB 1999, pp. 15–24 (1999)
Google Scholar
Hunter, L.: Molecular biology for computer scientists (artificial intelligence & molecular biology)
Google Scholar
Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)
Google Scholar
Krogh, M.S., Brown, Haussler: Hidden markov models in computational biology: applications to protein modeling. Journal of Molecular Biology (235), 1501–1531 (1994)
Article Google Scholar
Kudenko, D., Hirsh, H.: Feature generation for sequence categorization. In: AAAI/IAAI, pp. 733–738 (1998)
Google Scholar
Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 342–346. ACM Press, New York (1999)
Chapter Google Scholar
Pearson, R.W., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Academy Sciences USA 5, 2444–2448 (1998)
Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings Int. Conf. Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001, pp. 215–226 (2001)
Google Scholar
Durbin, R., Eddy, S.R.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Book MATH Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
Chapter Google Scholar
Zaki, N.M., Ilias, R.M., Derus, S.: A comparative analysis of protein homology detection methods. Journal of Theoretics, 5–4 (2003)
Google Scholar
Zar, J.H.: Biostatistical Analysis, 3rd edn. Prentice-Hall, Englewood Cliffs (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Minho, Campus of Gualtar, 4710-057, Braga, Portugal
Pedro Gabriel Ferreira & Paulo J. Azevedo

Authors

Pedro Gabriel Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Paulo J. Azevedo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Portugal Telecom Inovação (PTI), Centro de Informatica e Sistemas da Universidade de Coimbra (CISUC),
Carlos Bento
Department of Informatics Engineering, Coimbra University, Portugal
Amílcar Cardoso
Centre of Human Language Technology and Bioinformatics, University of Beira Interior, 6201-001, Covilhã, Portugal
Gaël Dias

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_24

Download citation

DOI: https://doi.org/10.1007/11595014_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30737-2
Online ISBN: 978-3-540-31646-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics