Skip to main content

Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis

  • Conference paper
8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 294))

Abstract

The cell cycle is one of the most important biological processes, being studied intensely by experimental as well as bioinformatics means. A considerable amount of literature provides relevant descriptions of proteins involved in this complex process. These proteins are often key to understand cellular alterations encountered in pathological conditions such as abnormal cell growth. The authors explored the use of text mining strategies to improve the retrieval of relevant articles and individual sentences for this topic. Moreover information extraction and text mining was used to detect and rank automatically Arabidopsis proteins important for the cell cycle. The obtained results were evaluated using independent data collections and compared to keyword-based strategies. The obtained results indicate that the use of machine learning methods can improve the sensitivity compared to term-co-occurrence, although with considerable differences when using abstracts and full text articles as input. At the level of document triage the recall ranges for abstracts from around 16% for keyword indexing, 37% for a sentence SVM classifier to 57% for SVM abstract classifier. In case of full text data, keyword and cell cycle phrase indexing obtained a recall of 42% and 55% respectively compared to 94% reached by a sentence classifier. In case of the cell cycle protein detection, the cell cycle keyword-protein co-occurrence strategy had a recall of 52% for abstracts and 70% for full text while a protein mentioning sentence classifier obtained a recall of over 83% for abstracts and 79% for full text. The generated cell cycle term co-occurrence statistics and SVM confidence scores for each protein were explored to rank proteins and filter a protein network in order to derive a topic specific subnetwork. All the generated protein cell cycle scores together with a global protein interaction and gene regulation network for Arabidopsis are available at: http://zope.bioinfo.cnio.es/cellcyle_addmaterial.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lenhard, M.: Plant growth: Jogging the cell cycle with JAG. Curr. Biol. 22(19), R838–840 (2012)

    Google Scholar 

  2. Menges, M., Hennig, L., Gruissem, W., Murray, J.A.: Cell cycle-regulated gene expression in Arabidopsis. J. Biol. Chem. 277(44), 41987–4(2002)

    Article  Google Scholar 

  3. Breyne, P., Zabeau, M.: Genome-wide expression analysis of plant cell cycle modulated genes. Curr. Opin. Plant Biol. 4(2), 136–142 (2001)

    Article  Google Scholar 

  4. Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7(2), 119–129 (2006)

    Article  Google Scholar 

  5. Leser, U., Hakenberg, J.: What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics 6(4) (2005)

    Google Scholar 

  6. Zhou, D., He, Y.: Extracting interactions between proteins from the literature. Journal of Biomedical Informatics 41(2), 393–407 (2008)

    Article  Google Scholar 

  7. Krallinger, M., Vazquez, M., Leitner, F., Salgado, D., Chatr-aryamontri, A., Winter, A., Perfetto, L., Briganti, L., Licata, L., Iannuccelli, M., et al.: The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12(suppl. 8), S3 (2011)

    Google Scholar 

  8. Rubin, D.L., Thorn, C.F., Klein, T.E., Altman, R.B.: A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge. Journal of the American Medical Informatics Association 12(2), 121–129 (2005)

    Article  Google Scholar 

  9. Shah, P.K., Jensen, L.J., Boué, S., Bork, P.: Extraction of transcript diversity from scientific literature. PLoS Computational Biology 1(1), e10 (2005)

    Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  12. Fontaine, J.F., Barbosa-Silva, A., Schaefer, M., Huska, M.R., Muro, E.M., Andrade-Navarro, M.A.: Medlineranker: flexible ranking of biomedical literature. Nucleic Acids Research 37(suppl. 2), W141–W146 (2009)

    Google Scholar 

  13. Poulter, G.L., Rubin, D.L., Altman, R.B., Seoighe, C.: Mscanner: A classifier for retrieving medline citations. BMC Bioinformatics 9(1), 108 (2008)

    Article  Google Scholar 

  14. Fontaine, J.F., Priller, F., Barbosa-Silva, A., Andrade-Navarro, M.A.: Genie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Research 39(suppl. 2), W455–W461(2011)

    Google Scholar 

  15. Krallinger, M., Rojas, A.M., Valencia, A.: Creating reference datasets for systems biology applications using text mining. Annals of the New York Academy of Sciences 1158(1), 14–28 (2009)

    Article  Google Scholar 

  16. Soldatos, T.G., O’Donoghue, S.I., Satagopam, V.P., Barbosa-Silva, A., Pavlopoulos, G.A., Wanderley-Nogueira, A.C., Soares-Cavalcanti, N.M., Schneider, R.: Caipirini: Using gene sets to rank literature. BioData Mining 5(1), 1 (2012)

    Article  Google Scholar 

  17. Soldatos, T.G., Pavlopoulos, G.A.: Mining cell cycle literature using support vector machines. In: Maglogiannis, I., Plagianakos, V., Vlahavas, I. (eds.) SETN 2012. LNCS (LNAI), vol. 7297, pp. 278–284. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  18. Settles, B.: Abner: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)

    Article  Google Scholar 

  19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

    MATH  Google Scholar 

  20. Krallinger, M., Rodriguez-Penagos, C., Tendulkar, A., Valencia, A.: PLAN2L: A web tool for integrated text mining and literature-based bioentity relation extraction. Nucleic Acids Res. 37, W160–165 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Krallinger, M., Leitner, F., Valencia, A. (2014). Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis. In: Saez-Rodriguez, J., Rocha, M., Fdez-Riverola, F., De Paz Santana, J. (eds) 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014). Advances in Intelligent Systems and Computing, vol 294. Springer, Cham. https://doi.org/10.1007/978-3-319-07581-5_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07581-5_34

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07580-8

  • Online ISBN: 978-3-319-07581-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics