Classification of Protein Interaction Sentences via Gaussian Processes

Polajnar, Tamara; Rogers, Simon; Girolami, Mark

doi:10.1007/978-3-642-04031-3_25

Classification of Protein Interaction Sentences via Gaussian Processes

Tamara Polajnar²⁴,
Simon Rogers²⁴ &
Mark Girolami²⁴

Conference paper

1025 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5780))

Abstract

The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and naïve Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption.

Download to read the full chapter text

Chapter PDF

References

Airola, A., Pyysalo, S., Björne, J., Pahikkala, T., Ginter, F., Salakoski, T.: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics 9(suppl. 11) (2008)
Google Scholar
Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964)
Google Scholar
Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88(422), 669 (1993)
Article Google Scholar
Altun, Y., Hofmann, T., Smola, A.J.: Gaussian process classification for segmenting and annotating sequences. In: ICML (2004)
Google Scholar
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Computational Learning Theory, pp. 144–152 (1992)
Google Scholar
Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)
Article PubMed Google Scholar
Cawley, G.C.: MATLAB support vector machine toolbox (v0.55β). University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ (2000)
Google Scholar
Chai, K.M.A., Chieu, H.L., Ng, H.T.: Bayesian online classifiers for text classification and filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 97–104. ACM Press, New York (2002)
Chapter Google Scholar
Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining pubmed abstracts. BMC Bioinformatics 5, 147 (2004)
Article PubMed PubMed Central Google Scholar
Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of Machine Learning Research 6, 1019–1041 (2005)
Google Scholar
Chu, W., Ghahramani, Z., Falciani, F., Wild, D.L.: Biomarker discovery in microarray gene expression data with gaussian processes. Bioinformatics 21(16), 3385–3393 (2005)
Article CAS PubMed Google Scholar
Chu, W., Ghahramani, Z.: Preference learning with gaussian processes. In: Twenty-second International Conference on Machine Learning, ICML 2005 (2005)
Google Scholar
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 51–71 (2005)
Article Google Scholar
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)
Google Scholar
Damoulas, T., Girolami, M.A.: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics (March 2008)
Google Scholar
Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4), 349–358 (2001)
Article CAS PubMed Google Scholar
Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)
Google Scholar
Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007)
Google Scholar
Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006)
Article Google Scholar
Girolami, M., Zhong, M.: Data integration for classification problems employing gaussian process priors. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 465–472. MIT Press, Cambridge (2007)
Google Scholar
Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Proc. EACL 2006 (2006)
Google Scholar
Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics 21(15), 3294–3300 (2005)
Article CAS PubMed Google Scholar
Huang, J., Lu, J., Ling, C.X.: Comparing naive bayes, decision trees, and svm with auc and accuracy. In: ICDM 2003: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, p. 553. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Google Scholar
Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515 (2006)
Google Scholar
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003)
Article Google Scholar
Lama, N., Girolami, M.: Vbmp: variational Bayesian Multinomial Probit Regression for multi-class classification in R. Bioinformatics 24(1), 135–136 (2008)
Article CAS PubMed Google Scholar
Lawrence, N., Platt, J.C., Jordan, M.I.: Extensions of the informative vector machine. In: Winkler, J., Lawrence, N.D., Niranjan, M. (eds.) Proceedings of the Sheffield Machine Learning Workshop, Berlin. Springer, Heidelberg (2005)
Google Scholar
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004)
Article Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Google Scholar
Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)
Article CAS PubMed Google Scholar
Platt, J.C.: Probabilities for SV Machines. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)
Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
Google Scholar
Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε- truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007)
Google Scholar
Rosario, B., Hearst, M.: Multi-way relation classification: Application to protein-protein interaction. In: Proceedings of HLT-NAACL 2005 (2005)
Google Scholar
Seeger, M., Jordan, M.I.: Sparse gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley (2004)
Google Scholar
Silva, Catarina, Ribeiro, Bernardete: On text-based mining with active learning and background knowledge using svm. Soft Computing 11(6), 519–530 (2007)
Article Google Scholar
Stankovic, M., Moustakis, V., Stankovic, S.: Text categorization using informative vector machine. In: The International Conference on Computer as a Tool, EUROCON 2005, pp. 209–212 (2005)
Google Scholar
Sugiyama, K., Hatano, K., Yoshikawa, S.U.M.: Extracting information on protein-protein interactions from biological literature based on machine learning approaches. In: Gribskov, M., Kanehis, M., Miyano, S., Takagi, T. (eds.) Genome Informatics 2003, pp. 701–702. Universal Academy Press, Tokyo (2003)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book Google Scholar

Download references

Author information

Authors and Affiliations

University of Glasgow, Glasgow, Scotland, G12 8QQ
Tamara Polajnar, Simon Rogers & Mark Girolami

Authors

Tamara Polajnar
View author publications
You can also search for this author in PubMed Google Scholar
Simon Rogers
View author publications
You can also search for this author in PubMed Google Scholar
Mark Girolami
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Automatic Control and Systems Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Visakan Kadirkamanathan
Department of Computer Science and Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Guido Sanguinetti
University of Glasgow, Department of Computing Science, Sir Alwyn Williams Building, Lilybank Gardens, Glasgow, G12 8QQ, UK, and, University of Glasgow, Department of Statistics, 14 University Gardens, Glasgow, G12 8QQ, UK
Mark Girolami
School of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mahesan Niranjan
Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Josselin Noirel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Polajnar, T., Rogers, S., Girolami, M. (2009). Classification of Protein Interaction Sentences via Gaussian Processes. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-04031-3_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)