Classification of Protein Interaction Sentences via Gaussian Processes

  • Tamara Polajnar
  • Simon Rogers
  • Mark Girolami
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and naïve Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption.


Support Vector Machine Gaussian Process Mean Average Precision Machine Learn Research Name Entity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Airola, A., Pyysalo, S., Björne, J., Pahikkala, T., Ginter, F., Salakoski, T.: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics 9(suppl. 11) (2008)Google Scholar
  2. 2.
    Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964)Google Scholar
  3. 3.
    Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88(422), 669 (1993)CrossRefGoogle Scholar
  4. 4.
    Altun, Y., Hofmann, T., Smola, A.J.: Gaussian process classification for segmenting and annotating sequences. In: ICML (2004)Google Scholar
  5. 5.
    Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Computational Learning Theory, pp. 144–152 (1992)Google Scholar
  6. 6.
    Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)CrossRefPubMedGoogle Scholar
  7. 7.
    Cawley, G.C.: MATLAB support vector machine toolbox (v0.55β). University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ (2000)Google Scholar
  8. 8.
    Chai, K.M.A., Chieu, H.L., Ng, H.T.: Bayesian online classifiers for text classification and filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 97–104. ACM Press, New York (2002)CrossRefGoogle Scholar
  9. 9.
    Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining pubmed abstracts. BMC Bioinformatics 5, 147 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of Machine Learning Research 6, 1019–1041 (2005)Google Scholar
  11. 11.
    Chu, W., Ghahramani, Z., Falciani, F., Wild, D.L.: Biomarker discovery in microarray gene expression data with gaussian processes. Bioinformatics 21(16), 3385–3393 (2005)CrossRefPubMedGoogle Scholar
  12. 12.
    Chu, W., Ghahramani, Z.: Preference learning with gaussian processes. In: Twenty-second International Conference on Machine Learning, ICML 2005 (2005)Google Scholar
  13. 13.
    Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 51–71 (2005)CrossRefGoogle Scholar
  14. 14.
    Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)Google Scholar
  15. 15.
    Damoulas, T., Girolami, M.A.: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics (March 2008)Google Scholar
  16. 16.
    Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4), 349–358 (2001)CrossRefPubMedGoogle Scholar
  17. 17.
    Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)Google Scholar
  18. 18.
    Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007)Google Scholar
  19. 19.
    Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006)CrossRefGoogle Scholar
  20. 20.
    Girolami, M., Zhong, M.: Data integration for classification problems employing gaussian process priors. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 465–472. MIT Press, Cambridge (2007)Google Scholar
  21. 21.
    Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Proc. EACL 2006 (2006)Google Scholar
  22. 22.
    Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics 21(15), 3294–3300 (2005)CrossRefPubMedGoogle Scholar
  23. 23.
    Huang, J., Lu, J., Ling, C.X.: Comparing naive bayes, decision trees, and svm with auc and accuracy. In: ICDM 2003: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, p. 553. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  24. 24.
    Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)Google Scholar
  25. 25.
    Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515 (2006)Google Scholar
  26. 26.
    Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003)CrossRefGoogle Scholar
  27. 27.
    Lama, N., Girolami, M.: Vbmp: variational Bayesian Multinomial Probit Regression for multi-class classification in R. Bioinformatics 24(1), 135–136 (2008)CrossRefPubMedGoogle Scholar
  28. 28.
    Lawrence, N., Platt, J.C., Jordan, M.I.: Extensions of the informative vector machine. In: Winkler, J., Lawrence, N.D., Niranjan, M. (eds.) Proceedings of the Sheffield Machine Learning Workshop, Berlin. Springer, Heidelberg (2005)Google Scholar
  29. 29.
    Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004)CrossRefGoogle Scholar
  30. 30.
    Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  31. 31.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  32. 32.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)Google Scholar
  33. 33.
    Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)CrossRefPubMedGoogle Scholar
  34. 34.
    Platt, J.C.: Probabilities for SV Machines. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)Google Scholar
  35. 35.
    Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)Google Scholar
  36. 36.
    Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε- truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007)Google Scholar
  37. 37.
    Rosario, B., Hearst, M.: Multi-way relation classification: Application to protein-protein interaction. In: Proceedings of HLT-NAACL 2005 (2005)Google Scholar
  38. 38.
    Seeger, M., Jordan, M.I.: Sparse gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley (2004)Google Scholar
  39. 39.
    Silva, Catarina, Ribeiro, Bernardete: On text-based mining with active learning and background knowledge using svm. Soft Computing 11(6), 519–530 (2007)CrossRefGoogle Scholar
  40. 40.
    Stankovic, M., Moustakis, V., Stankovic, S.: Text categorization using informative vector machine. In: The International Conference on Computer as a Tool, EUROCON 2005, pp. 209–212 (2005)Google Scholar
  41. 41.
    Sugiyama, K., Hatano, K., Yoshikawa, S.U.M.: Extracting information on protein-protein interactions from biological literature based on machine learning approaches. In: Gribskov, M., Kanehis, M., Miyano, S., Takagi, T. (eds.) Genome Informatics 2003, pp. 701–702. Universal Academy Press, Tokyo (2003)Google Scholar
  42. 42.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)Google Scholar
  43. 43.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Tamara Polajnar
    • 1
  • Simon Rogers
    • 1
  • Mark Girolami
    • 1
  1. 1.University of GlasgowGlasgowScotland

Personalised recommendations