Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature

Liu, Hongfang; Torii, Manabu; Xu, Guixian; Hu, Zhangzhi; Goll, Johannes

doi:10.1007/978-3-642-13131-8_8

Hongfang Liu²¹,
Manabu Torii²²,
Guixian Xu^21,24,
Zhangzhi Hu²³ &
…
Johannes Goll²⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6004))

579 Accesses
3 Citations

Abstract

With the advance of high-throughput genomics and proteomics technologies, it becomes critical to mine and curate protein-protein interaction (PPI) networks from biological research literature. Several PPI knowledge bases have been curated by domain experts but they are far from comprehensive. Observing that PPI-relevant documents can be obtained from PPI knowledge bases recording literature evidences and also that a large number of unlabeled documents (mostly negative) are freely available, we investigated learning from positive and unlabeled data (LPU) and developed an automated system for the retrieval of PPI-relevant articles aiming at assisting the curation of a bacterial PPI knowledge base, MPIDB. Two different approaches of obtaining unlabeled documents were used: one based on PubMed MeSH term search and the other based on an existing knowledge base, UniProtKB. We found unlabeled documents obtained from UniProtKB tend to yield better document classifiers for PPI curation purposes. Our study shows that LPU is a possible scenario for the development of an automated system to retrieve PPI-relevant articles, where there is no requirement for extra annotation effort. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Morrison, J.L., Breitling, R., Higham, D.J., Gilbert, D.R.: GeneRank: using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6, 233 (2005)
Article Google Scholar
Spasic, I., Ananiadou, S., McNaught, J., Kumar, A.: Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform. 6, 239–251 (2005)
Article Google Scholar
Leitner, F., Krallinger, M., Rodriguez-Pebagosa, C., et al.: Introducing Meta-Services for Biomedical Information Extraction. Genome Biology (2009) (in press)
Google Scholar
Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L.: Evaluation of text mining systems for biology: overview of the Second BioCreAtIve community challenge. Genome Biology 9(Suppl. 2), S1 (2008)
Article Google Scholar
Krallinger, M., Valencia, A., Hirschman, L.: Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9(Suppl. 2), S8 (2008)
Article Google Scholar
Goll, J., Rajagopala, S.V., Shiau, S.C., Wu, H., Lamb, B.T., Uetz, P.: MPIDB: the microbial protein interaction database. Bioinformatics 24, 1743–1744 (2008)
Article Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining, KDD (2008)
Google Scholar
Noto, K., Saier Jr., M.H., Elkan, C.: Learning to find relevant biological articles without negative training examples. In: Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence, AI (2008)
Google Scholar
Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (2003)
Google Scholar
Tsai, R.T., Hung, H.C., Dai, H.J., Lin, Y.W., Hsu, W.L.: Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinformatics 9(Suppl. 1), S3 (2008)
Article Google Scholar
Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., Liu, H.: Semi-Supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents. Presented at International Joint Conference on Bioinformatics, Systems Biology and Intselligent Computing, IJCBS 2009 (2009)
Google Scholar
Rajagopala, S.V., Goll, J., Gowda, N.D., Sunil, K.C., Titz, B., Mukherjee, A., Mary, S.S., Raviswaran, N., Poojari, C.S., Ramachandra, S.: MPI-LIT: A literature-curated dataset of microbial binary protein-protein interactions. Bioinformatics (2008)
Google Scholar
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004)
Google Scholar
Mladenic, D.: Feature subset selection in text learning. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 95–100. Springer, Heidelberg (1998)
Chapter Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. ACM Transactions on Information Systems (TOIS) 17, 141–173 (1999)
Article Google Scholar
Wiener, E.D., Pedersen, I.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of SDAIR 1995, 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Google Scholar
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www-2.cs.cmu.edu/~mccallum/bow/
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 143–151 (1997)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (2000)
MATH Google Scholar
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
Google Scholar
Komarek, P., Moore, A.: Making logistic regression a core data mining tool: A practical investigation of accuracy, speed, and simplicity, pp. 685–688. Carnegie Mellon University, Pittsburgh (2005)
Google Scholar
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)
Google Scholar
Bennett, P.N.: Assessing the calibration of Naive Bayes posterior estimates. Technical Report, CMU-CS-00-155, School of Computer Science. Carnegie-Mellon University, Pittsburgh (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, Bioinformatics, and Biomathematics,
Hongfang Liu & Guixian Xu
Imaging Science and Information Systems Center,
Manabu Torii
Department of Oncology, Georgetown University Medical Center, Washington DC
Zhangzhi Hu
School of Computer Science and Technology, Beijing Institute of Technology,
Guixian Xu
The J. Craig Venter Institute, Rockville, Maryland
Johannes Goll

Authors

Hongfang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Torii
View author publications
You can also search for this author in PubMed Google Scholar
Guixian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhangzhi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Goll
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bioalma, C/Ronda de Poniente, 4, 2-C, 28760, Tres Cantos, Madrid, Spain
Christian Blaschke
Computational Biology and Machine Learning Lab, School of Computing, Queen’s University, K7L 3N6, Kingston, ON, Canada
Hagit Shatkay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, H., Torii, M., Xu, G., Hu, Z., Goll, J. (2010). Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature. In: Blaschke, C., Shatkay, H. (eds) Linking Literature, Information, and Knowledge for Biology. Lecture Notes in Computer Science(), vol 6004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13131-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-13131-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13130-1
Online ISBN: 978-3-642-13131-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics