Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis

Krallinger, Martin; Leitner, Florian; Valencia, Alfonso

doi:10.1007/978-3-319-07581-5_34

Martin Krallinger⁶,
Florian Leitner⁶ &
Alfonso Valencia⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 294))

1150 Accesses
2 Citations

Abstract

The cell cycle is one of the most important biological processes, being studied intensely by experimental as well as bioinformatics means. A considerable amount of literature provides relevant descriptions of proteins involved in this complex process. These proteins are often key to understand cellular alterations encountered in pathological conditions such as abnormal cell growth. The authors explored the use of text mining strategies to improve the retrieval of relevant articles and individual sentences for this topic. Moreover information extraction and text mining was used to detect and rank automatically Arabidopsis proteins important for the cell cycle. The obtained results were evaluated using independent data collections and compared to keyword-based strategies. The obtained results indicate that the use of machine learning methods can improve the sensitivity compared to term-co-occurrence, although with considerable differences when using abstracts and full text articles as input. At the level of document triage the recall ranges for abstracts from around 16% for keyword indexing, 37% for a sentence SVM classifier to 57% for SVM abstract classifier. In case of full text data, keyword and cell cycle phrase indexing obtained a recall of 42% and 55% respectively compared to 94% reached by a sentence classifier. In case of the cell cycle protein detection, the cell cycle keyword-protein co-occurrence strategy had a recall of 52% for abstracts and 70% for full text while a protein mentioning sentence classifier obtained a recall of over 83% for abstracts and 79% for full text. The generated cell cycle term co-occurrence statistics and SVM confidence scores for each protein were explored to rank proteins and filter a protein network in order to derive a topic specific subnetwork. All the generated protein cell cycle scores together with a global protein interaction and gene regulation network for Arabidopsis are available at: http://zope.bioinfo.cnio.es/cellcyle_addmaterial.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

Article Open access 09 March 2018

Ferret: a sentence-based literature scanning system

Article Open access 20 June 2015

Roles for Text Mining in Protein Function Prediction

References

Lenhard, M.: Plant growth: Jogging the cell cycle with JAG. Curr. Biol. 22(19), R838–840 (2012)
Google Scholar
Menges, M., Hennig, L., Gruissem, W., Murray, J.A.: Cell cycle-regulated gene expression in Arabidopsis. J. Biol. Chem. 277(44), 41987–4(2002)
Article Google Scholar
Breyne, P., Zabeau, M.: Genome-wide expression analysis of plant cell cycle modulated genes. Curr. Opin. Plant Biol. 4(2), 136–142 (2001)
Article Google Scholar
Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7(2), 119–129 (2006)
Article Google Scholar
Leser, U., Hakenberg, J.: What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics 6(4) (2005)
Google Scholar
Zhou, D., He, Y.: Extracting interactions between proteins from the literature. Journal of Biomedical Informatics 41(2), 393–407 (2008)
Article Google Scholar
Krallinger, M., Vazquez, M., Leitner, F., Salgado, D., Chatr-aryamontri, A., Winter, A., Perfetto, L., Briganti, L., Licata, L., Iannuccelli, M., et al.: The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12(suppl. 8), S3 (2011)
Google Scholar
Rubin, D.L., Thorn, C.F., Klein, T.E., Altman, R.B.: A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge. Journal of the American Medical Informatics Association 12(2), 121–129 (2005)
Article Google Scholar
Shah, P.K., Jensen, L.J., Boué, S., Bork, P.: Extraction of transcript diversity from scientific literature. PLoS Computational Biology 1(1), e10 (2005)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
MATH Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Fontaine, J.F., Barbosa-Silva, A., Schaefer, M., Huska, M.R., Muro, E.M., Andrade-Navarro, M.A.: Medlineranker: flexible ranking of biomedical literature. Nucleic Acids Research 37(suppl. 2), W141–W146 (2009)
Google Scholar
Poulter, G.L., Rubin, D.L., Altman, R.B., Seoighe, C.: Mscanner: A classifier for retrieving medline citations. BMC Bioinformatics 9(1), 108 (2008)
Article Google Scholar
Fontaine, J.F., Priller, F., Barbosa-Silva, A., Andrade-Navarro, M.A.: Genie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Research 39(suppl. 2), W455–W461(2011)
Google Scholar
Krallinger, M., Rojas, A.M., Valencia, A.: Creating reference datasets for systems biology applications using text mining. Annals of the New York Academy of Sciences 1158(1), 14–28 (2009)
Article Google Scholar
Soldatos, T.G., O’Donoghue, S.I., Satagopam, V.P., Barbosa-Silva, A., Pavlopoulos, G.A., Wanderley-Nogueira, A.C., Soares-Cavalcanti, N.M., Schneider, R.: Caipirini: Using gene sets to rank literature. BioData Mining 5(1), 1 (2012)
Article Google Scholar
Soldatos, T.G., Pavlopoulos, G.A.: Mining cell cycle literature using support vector machines. In: Maglogiannis, I., Plagianakos, V., Vlahavas, I. (eds.) SETN 2012. LNCS (LNAI), vol. 7297, pp. 278–284. Springer, Heidelberg (2012)
Chapter Google Scholar
Settles, B.: Abner: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
MATH Google Scholar
Krallinger, M., Rodriguez-Penagos, C., Tendulkar, A., Valencia, A.: PLAN2L: A web tool for integrated text mining and literature-based bioentity relation extraction. Nucleic Acids Res. 37, W160–165 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), C/ Melchor Fernndez Almagro, 3., 28029, Madrid, Spain
Martin Krallinger, Florian Leitner & Alfonso Valencia

Authors

Martin Krallinger
View author publications
You can also search for this author in PubMed Google Scholar
Florian Leitner
View author publications
You can also search for this author in PubMed Google Scholar
Alfonso Valencia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EMBL Outstation - Hinxton, European Bioinformatics Institute, Hinxton, United Kingdom
Julio Saez-Rodriguez
Department of Informatics, University of Minho, Braga, Portugal
Miguel P. Rocha
Department of Informatics Campus Universitario As Lagoas s/n, University of Vigo, Ourense, Spain
Florentino Fdez-Riverola
Department of Computing Science, University of Salamanca, Salamanca, Spain
Juan F. De Paz Santana

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krallinger, M., Leitner, F., Valencia, A. (2014). Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis. In: Saez-Rodriguez, J., Rocha, M., Fdez-Riverola, F., De Paz Santana, J. (eds) 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014). Advances in Intelligent Systems and Computing, vol 294. Springer, Cham. https://doi.org/10.1007/978-3-319-07581-5_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-07581-5_34
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07580-8
Online ISBN: 978-3-319-07581-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis

Abstract

Access this chapter

Preview

Similar content being viewed by others

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

Ferret: a sentence-based literature scanning system

Roles for Text Mining in Protein Function Prediction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Retrieval and Discovery of Cell Cycle Literature and Proteins by Means of Machine Learning, Text Mining and Network Analysis

Abstract

Access this chapter

Preview

Similar content being viewed by others

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

Ferret: a sentence-based literature scanning system

Roles for Text Mining in Protein Function Prediction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation