On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification

  • Eser Aygün
  • B. John Oommen
  • Zehra Cataltepe
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)

Abstract

Syntactic methods in pattern recognition have been used extensively in bioinformatics, and in particular, in the analysis of gene and protein expressions, and in the recognition and classification of bio-sequences. These methods are almost universally distance-based. This paper concerns the use of an Optimal and Information Theoretic (OIT) probabilistic model [11] to achieve peptide classification using the information residing in their syntactic representations. The latter has traditionally been achieved using the edit distances required in the respective peptide comparisons. We advocate that one can model the differences between compared strings as a mutation model consisting of random Substitutions, Insertions and Deletions (SID) obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT_SVM, can be devised.

The classifier, which we have built has been tested for eight different “substitution” matrices and for two different data sets, namely, the HIV-1 Protease Cleavage sites and the T-cell Epitopes. The results show that the OIT model performs significantly better than the one which uses a Needleman-Wunsch sequence alignment score, and the peptide classification methods that previously experimented with the same two datasets.

Keywords

Biological Sequence Analysis Optimal and Information Theoretic Syntactic Classifcation Peptide Classification Sequence Processing Syntactic Pattern Recognition 

References

  1. 1.
    Aygün, E., Oommen, B.J., Cataltepe, Z.: Peptide Classification Using Optimal and Information Theoretic Syntactic Modeling (submitted for publication)Google Scholar
  2. 2.
    Bucher, P., Hofmann, K.: A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the Conference on Intelligent Systems for Molecular Biology, pp. 44–51 (1996)Google Scholar
  3. 3.
    Cai, Y.D., Chou, K.C.: Artificial neural network model for predicting HIV protease cleavage sites in protein. Advances in Engineering Software 29(2), 119–128 (1998)CrossRefGoogle Scholar
  4. 4.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
  5. 5.
    Dayhoff, M., Schwartz, R., Orcutt, B.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5(suppl. 3), 345–352 (1978)Google Scholar
  6. 6.
    Duin, R.P.W., Juszczak, P., Paclik, P., Pekalska, E., de Ridder, D., Tax, D.M.J.: PRTools, a Matlab Toolbox for Pattern Recognition. Delft University of Technology (2004)Google Scholar
  7. 7.
    Guide, M.R.: The MathWorks. Inc., Natick, MA (1998)Google Scholar
  8. 8.
    Kim, H., Zhang, Y., Heo, Y.S., Oh, H.B., Chen, S.S.: Specificity rule discovery in HIV-1 protease cleavage site analysis. Computational Biology and Chemistry 32(1), 71–78 (2008)CrossRefPubMedGoogle Scholar
  9. 9.
    Liao, L., Noble, W.S.: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. Journal of Computational Biology 10(6), 857–868 (2003)CrossRefPubMedGoogle Scholar
  10. 10.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the ammo acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefPubMedGoogle Scholar
  11. 11.
    Oommen, B.J., Kashyap, R.L.: A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognition 31(8), 1159–1177 (1998)CrossRefGoogle Scholar
  12. 12.
    Tatusova, T.A., Madden, T.L.: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiology Letters 174(2), 247–250 (1999)CrossRefPubMedGoogle Scholar
  13. 13.
    Thomson, R., Hodgman, T.C., Yang, Z.R., Doyle, A.K.: Characterizing proteolytic cleavage site activity using bio-basis function neural networks. Bioinformatics 19(14), 1741–1747 (2003)CrossRefPubMedGoogle Scholar
  14. 14.
    Trudgian, D.C., Yang, Z.R.: Substitution Matrix Optimisation for Peptide Classification. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 291–300. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Zhao, Y., Pinilla, C., Valmori, D., Martin, R., Simon, R.: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 19(15), 1978–1984 (2003)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Eser Aygün
    • 1
  • B. John Oommen
    • 2
    • 3
  • Zehra Cataltepe
    • 1
  1. 1.Department of Computer Eng.Istanbul Technical UniversityIstanbulTurkey
  2. 2.School of Computer ScienceCarleton UniversityOttawaCanada
  3. 3.Adjunct Professor at the University of Agder in GrimstadNorway

Personalised recommendations