Skip to main content

Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature

  • Conference paper
Linking Literature, Information, and Knowledge for Biology

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6004))

Abstract

With the advance of high-throughput genomics and proteomics technologies, it becomes critical to mine and curate protein-protein interaction (PPI) networks from biological research literature. Several PPI knowledge bases have been curated by domain experts but they are far from comprehensive. Observing that PPI-relevant documents can be obtained from PPI knowledge bases recording literature evidences and also that a large number of unlabeled documents (mostly negative) are freely available, we investigated learning from positive and unlabeled data (LPU) and developed an automated system for the retrieval of PPI-relevant articles aiming at assisting the curation of a bacterial PPI knowledge base, MPIDB. Two different approaches of obtaining unlabeled documents were used: one based on PubMed MeSH term search and the other based on an existing knowledge base, UniProtKB. We found unlabeled documents obtained from UniProtKB tend to yield better document classifiers for PPI curation purposes. Our study shows that LPU is a possible scenario for the development of an automated system to retrieve PPI-relevant articles, where there is no requirement for extra annotation effort. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Morrison, J.L., Breitling, R., Higham, D.J., Gilbert, D.R.: GeneRank: using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6, 233 (2005)

    Article  Google Scholar 

  2. Spasic, I., Ananiadou, S., McNaught, J., Kumar, A.: Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform. 6, 239–251 (2005)

    Article  Google Scholar 

  3. Leitner, F., Krallinger, M., Rodriguez-Pebagosa, C., et al.: Introducing Meta-Services for Biomedical Information Extraction. Genome Biology (2009) (in press)

    Google Scholar 

  4. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L.: Evaluation of text mining systems for biology: overview of the Second BioCreAtIve community challenge. Genome Biology 9(Suppl. 2), S1 (2008)

    Article  Google Scholar 

  5. Krallinger, M., Valencia, A., Hirschman, L.: Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9(Suppl. 2), S8 (2008)

    Article  Google Scholar 

  6. Goll, J., Rajagopala, S.V., Shiau, S.C., Wu, H., Lamb, B.T., Uetz, P.: MPIDB: the microbial protein interaction database. Bioinformatics 24, 1743–1744 (2008)

    Article  Google Scholar 

  7. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining, KDD (2008)

    Google Scholar 

  8. Noto, K., Saier Jr., M.H., Elkan, C.: Learning to find relevant biological articles without negative training examples. In: Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence, AI (2008)

    Google Scholar 

  9. Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (2003)

    Google Scholar 

  10. Tsai, R.T., Hung, H.C., Dai, H.J., Lin, Y.W., Hsu, W.L.: Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinformatics 9(Suppl. 1), S3 (2008)

    Article  Google Scholar 

  11. Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., Liu, H.: Semi-Supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents. Presented at International Joint Conference on Bioinformatics, Systems Biology and Intselligent Computing, IJCBS 2009 (2009)

    Google Scholar 

  12. Rajagopala, S.V., Goll, J., Gowda, N.D., Sunil, K.C., Titz, B., Mukherjee, A., Mary, S.S., Raviswaran, N., Poojari, C.S., Ramachandra, S.: MPI-LIT: A literature-curated dataset of microbial binary protein-protein interactions. Bioinformatics (2008)

    Google Scholar 

  13. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004)

    Google Scholar 

  14. Mladenic, D.: Feature subset selection in text learning. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 95–100. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  15. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  16. Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. ACM Transactions on Information Systems (TOIS) 17, 141–173 (1999)

    Article  Google Scholar 

  17. Wiener, E.D., Pedersen, I.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of SDAIR 1995, 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)

    Google Scholar 

  18. McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www-2.cs.cmu.edu/~mccallum/bow/

  19. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 143–151 (1997)

    Google Scholar 

  20. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (2000)

    MATH  Google Scholar 

  21. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  22. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  23. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)

    Google Scholar 

  24. Komarek, P., Moore, A.: Making logistic regression a core data mining tool: A practical investigation of accuracy, speed, and simplicity, pp. 685–688. Carnegie Mellon University, Pittsburgh (2005)

    Google Scholar 

  25. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)

    Google Scholar 

  26. Bennett, P.N.: Assessing the calibration of Naive Bayes posterior estimates. Technical Report, CMU-CS-00-155, School of Computer Science. Carnegie-Mellon University, Pittsburgh (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, H., Torii, M., Xu, G., Hu, Z., Goll, J. (2010). Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature. In: Blaschke, C., Shatkay, H. (eds) Linking Literature, Information, and Knowledge for Biology. Lecture Notes in Computer Science(), vol 6004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13131-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13131-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13130-1

  • Online ISBN: 978-3-642-13131-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics