Skip to main content

Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers

  • Conference paper
Progress in Artificial Intelligence (EPIA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3808))

Included in the following conference series:

Abstract

We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    Article  Google Scholar 

  2. Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th International Conference of Knowledge Discovery and Data Mining SIGKDD, S. Francisco, July 2002, pp. 429–435 (2002)

    Google Scholar 

  3. Bairoch, A.: Prosite: a dictionary of sites and patterns in proteins. Nucleic Acids Res 25(19), 2241–2245 (1991)

    Google Scholar 

  4. Ben-Hur, A., Brutlag, D.: Remote homology detection:a motif based approach. Bioinformatics 19(1), 26–33 (2003)

    Article  Google Scholar 

  5. Ben-Hur, A., Brutlag, D.: Sequence motifs: highly predictive features of protein function. In: Proceeding of Workshop on Feature Selection, NIPS - Neural Information Processing Systems (December 2003)

    Google Scholar 

  6. Cooper, N.G.: The Human Genome Project, Dechiphering the blueprint of heredity, vol. 1. University Science Books (1994)

    Google Scholar 

  7. Domingos, P., Pazzani, M.: Beyond independence: Conditions for the optimality of the simple bayesian classifier. In: International Conference on Machine Learning, pp. 105–112 (1996)

    Google Scholar 

  8. Eskin, E., Grundy, W.N., Singer, Y.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Journal of Computational Biology 10(2), 187–214 (2003)

    Article  Google Scholar 

  9. Bateman, A., et al.: The pfam protein families database. Nucleic Acids Research 32(Database issue) (October 2003)

    Google Scholar 

  10. Ferreira, P., Azevedo, P.: Protein sequence pattern mining with constraints. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 96–107. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: ACM Press (ed.) The Proceedings of RECOMB 1999, pp. 15–24 (1999)

    Google Scholar 

  12. Hunter, L.: Molecular biology for computer scientists (artificial intelligence & molecular biology)

    Google Scholar 

  13. Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)

    Google Scholar 

  14. Krogh, M.S., Brown, Haussler: Hidden markov models in computational biology: applications to protein modeling. Journal of Molecular Biology (235), 1501–1531 (1994)

    Article  Google Scholar 

  15. Kudenko, D., Hirsh, H.: Feature generation for sequence categorization. In: AAAI/IAAI, pp. 733–738 (1998)

    Google Scholar 

  16. Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 342–346. ACM Press, New York (1999)

    Chapter  Google Scholar 

  17. Pearson, R.W., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Academy Sciences USA 5, 2444–2448 (1998)

    Google Scholar 

  18. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings Int. Conf. Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001, pp. 215–226 (2001)

    Google Scholar 

  19. Durbin, R., Eddy, S.R.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)

    Book  MATH  Google Scholar 

  20. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  21. Zaki, N.M., Ilias, R.M., Derus, S.: A comparative analysis of protein homology detection methods. Journal of Theoretics, 5–4 (2003)

    Google Scholar 

  22. Zar, J.H.: Biostatistical Analysis, 3rd edn. Prentice-Hall, Englewood Cliffs (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_24

Download citation

  • DOI: https://doi.org/10.1007/11595014_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30737-2

  • Online ISBN: 978-3-540-31646-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics