Abstract
The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and naïve Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption.
Chapter PDF
References
Airola, A., Pyysalo, S., Björne, J., Pahikkala, T., Ginter, F., Salakoski, T.: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics 9(suppl. 11) (2008)
Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964)
Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88(422), 669 (1993)
Altun, Y., Hofmann, T., Smola, A.J.: Gaussian process classification for segmenting and annotating sequences. In: ICML (2004)
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Computational Learning Theory, pp. 144–152 (1992)
Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)
Cawley, G.C.: MATLAB support vector machine toolbox (v0.55β). University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ (2000)
Chai, K.M.A., Chieu, H.L., Ng, H.T.: Bayesian online classifiers for text classification and filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 97–104. ACM Press, New York (2002)
Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining pubmed abstracts. BMC Bioinformatics 5, 147 (2004)
Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of Machine Learning Research 6, 1019–1041 (2005)
Chu, W., Ghahramani, Z., Falciani, F., Wild, D.L.: Biomarker discovery in microarray gene expression data with gaussian processes. Bioinformatics 21(16), 3385–3393 (2005)
Chu, W., Ghahramani, Z.: Preference learning with gaussian processes. In: Twenty-second International Conference on Machine Learning, ICML 2005 (2005)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 51–71 (2005)
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)
Damoulas, T., Girolami, M.A.: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics (March 2008)
Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4), 349–358 (2001)
Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)
Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007)
Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006)
Girolami, M., Zhong, M.: Data integration for classification problems employing gaussian process priors. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 465–472. MIT Press, Cambridge (2007)
Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Proc. EACL 2006 (2006)
Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics 21(15), 3294–3300 (2005)
Huang, J., Lu, J., Ling, C.X.: Comparing naive bayes, decision trees, and svm with auc and accuracy. In: ICDM 2003: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, p. 553. IEEE Computer Society, Los Alamitos (2003)
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515 (2006)
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003)
Lama, N., Girolami, M.: Vbmp: variational Bayesian Multinomial Probit Regression for multi-class classification in R. Bioinformatics 24(1), 135–136 (2008)
Lawrence, N., Platt, J.C., Jordan, M.I.: Extensions of the informative vector machine. In: Winkler, J., Lawrence, N.D., Niranjan, M. (eds.) Proceedings of the Sheffield Machine Learning Workshop, Berlin. Springer, Heidelberg (2005)
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004)
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)
Platt, J.C.: Probabilities for SV Machines. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε- truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007)
Rosario, B., Hearst, M.: Multi-way relation classification: Application to protein-protein interaction. In: Proceedings of HLT-NAACL 2005 (2005)
Seeger, M., Jordan, M.I.: Sparse gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley (2004)
Silva, Catarina, Ribeiro, Bernardete: On text-based mining with active learning and background knowledge using svm. Soft Computing 11(6), 519–530 (2007)
Stankovic, M., Moustakis, V., Stankovic, S.: Text categorization using informative vector machine. In: The International Conference on Computer as a Tool, EUROCON 2005, pp. 209–212 (2005)
Sugiyama, K., Hatano, K., Yoshikawa, S.U.M.: Extracting information on protein-protein interactions from biological literature based on machine learning approaches. In: Gribskov, M., Kanehis, M., Miyano, S., Takagi, T. (eds.) Genome Informatics 2003, pp. 701–702. Universal Academy Press, Tokyo (2003)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Polajnar, T., Rogers, S., Girolami, M. (2009). Classification of Protein Interaction Sentences via Gaussian Processes. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-04031-3_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)