Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

  • Tamara Polajnar
  • Mark Girolami
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process.


Radial Basis Function Target Word Text Mining Semantic Model Unlabelled Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Airola, A., Pyysalo, S., Björne, J., Pahikkala, T., Ginter, F., Salakoski, T.: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9(suppl. 11) (2008)Google Scholar
  2. 2.
    Azzopardi, L., Girolami, M., Crowe, M.: Probabilistic hyperspace analogue to language. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 575–576. ACM, New York (2005)CrossRefGoogle Scholar
  3. 3.
    Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)CrossRefPubMedGoogle Scholar
  4. 4.
    Burgess, C., Livesay, K., Lund, K.: Explorations in context space: Words, sentences, discourse. Discourse Processes 25, 211–257 (1998)Google Scholar
  5. 5.
    Burgess, C., Lund, K.: Modeling parsing constraints with high-dimensional context space. In: Language and Cognitive Processes, vol. 12, pp. 177–210 (1997)Google Scholar
  6. 6.
    Cohen, K.B., Fox, L., Ogren, P.V., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases: mining biological semantics, pp. 38–45 (2005)Google Scholar
  7. 7.
    Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)Google Scholar
  8. 8.
    Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007)Google Scholar
  9. 9.
    Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006)CrossRefGoogle Scholar
  10. 10.
    Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)Google Scholar
  11. 11.
    Jones, M.N., Kintsch, W., Mewhort, D.J.: High-dimensional semantic space accounts of priming. Journal of Memory and Language 55(4), 534–552 (2006)CrossRefGoogle Scholar
  12. 12.
    Jones, M.N., Mewhort, D.J.K.: Representing word meaning and order information in a composite holographic lexicon. Psychological Review 114, 1–37 (2007)CrossRefPubMedGoogle Scholar
  13. 13.
    Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003)CrossRefGoogle Scholar
  14. 14.
    Krallinger, M., Leitner, F., Rodriguez-Penagos, C., Valencia, A.: Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome. Biol. 9(suppl. 2) (2008)Google Scholar
  15. 15.
    Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)CrossRefGoogle Scholar
  16. 16.
    Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  17. 17.
    Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers 28, 203–208 (1996)CrossRefGoogle Scholar
  18. 18.
    Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)CrossRefPubMedGoogle Scholar
  19. 19.
    Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: SYNASC 2007: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Washington, DC, USA, pp. 157–164. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  20. 20.
    Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Comput. Linguist. 33(2), 161–199 (2007)CrossRefGoogle Scholar
  21. 21.
    Polajnar, T., Rogers, S., Girolami, M.: An evaluation of gaussian processes for sentence classification and protein interaction detection. Technical report, University of Glasgow, Department of Computing Science (2008)Google Scholar
  22. 22.
    Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., Salakoski, T.: Bioinfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8, 50 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε- truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007)Google Scholar
  24. 24.
    Song, D., Bruza, P.D.: Discovering information flow using a high dimensional conceptual space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001)Google Scholar
  25. 25.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Tamara Polajnar
    • 1
  • Mark Girolami
    • 1
  1. 1.University of GlasgowGlasgowScotland

Personalised recommendations