Extracting and Normalizing Gene/Protein Mentions with the Flexible and Trainable Moara Java Library

Neves, Mariana L.; Carazo, José Maria; Pascual-Montano, Alberto

doi:10.1007/978-3-642-13131-8_9

Mariana L. Neves²¹,
José Maria Carazo²¹ &
Alberto Pascual-Montano²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6004))

565 Accesses

Abstract

Gene/protein recognition and normalization are important prerequisite steps for many biological text mining tasks. Even if great efforts have been dedicated to these problems and effective solutions have been reported, the availability of easily integrated tools to perform these tasks is still deficient. We therefore propose Moara, a Java library that implements gene/protein recognition and normalization steps based on machine learning approaches. The system may be trained with extra documents for the recognition procedure and new organism may be added in the normalization step. The novelty of the methodology used in Moara lies in the design of a system that is not tailored to a specific organism and therefore does not need any organism-dependent tuning in the algorithms and in the dictionaries it uses. Moara can be used either as a standalone application or incorporated in a text mining system and it is available at: http://moara.dacya.ucm.es

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 99.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biology 9 (Suppl. 2), S2 (2008)
Article Google Scholar
Hirschman, L., et al.: Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 6(Suppl.1), S11 (2005)
Article Google Scholar
Morgan, A.A., et al.: Overview of BioCreative II gene normalization. Genome Biology 9(Suppl. 2), S3 (2008)
Article Google Scholar
Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications 7(1), 39–59 (1994)
Google Scholar
Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Cohen, W.C., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: II Web Workshop on International Joint Conference on Artificial Intelligence, Acapulco, Mexico (2003)
Google Scholar
Fukuda, K., et al.: Toward Information Extraction: Identifying protein names from biological papers. In: Pacific Symposium on Biocomputing (PSB 1998), Hawaii, USA (1998)
Google Scholar
Finkel, J., et al.: Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6(Suppl. 1), S5 (2005)
Article Google Scholar
McDonald, R., Pereira, F.: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(Suppl. 1), S6 (2005)
Article Google Scholar
Zhou, G., et al.: Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6(Suppl.1), S7 (2005)
Article Google Scholar
Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Article Google Scholar
Leaman, R., Gonzalez, G.: BANNER: an executable survey of advances in biomedical named entity recognition. In: Pac. Symp. Biocomput., pp. 652–663 (2008)
Google Scholar
Fundel, K., et al.: A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl.1), S15 (2005)
Article Google Scholar
Crim, J., McDonald, R., Pereira, F.: Automatically annotating documents with normalized gene lists. BMC Bioinformatics 6(Suppl.1), S13 (2005)
Article Google Scholar
Liu, H., Wu, C., Friedman, C.: BioTagger: A Biological Entity Tagging System. In: BioCreAtIvE Workshop Handouts, Granada, Spain (2004)
Google Scholar
Hakenberg, J., et al.: Inter-species normalization of gene mentions with GNAT. Bioinformatics 24(16), 126–132 (2008)
Article Google Scholar
Xu, H., et al.: Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 23(8), 1015–1022 (2007)
Article Google Scholar
Farkas, R.: The strength of co-authorship in gene name disambiguation. BMC Bioinformatics 9, 69 (2008)
Article MathSciNet Google Scholar
Neves, M., et al.: CBR-Tagger: a case-based reasoning approach to the gene/protein mention problem. In: BioNLP 2008 Workshop at ACL 2008, Columbus, OH, USA (2008)
Google Scholar
Neves, M.: Identifying Gene Mentions by Case-Based Reasoning. In: Second BioCreative Challenge Evaluation Workshop, Madrid, Spain (2007)
Google Scholar
Daelemans, W., et al.: MBT: A Memory-Based Part of Speech Tagger-Generator. In: Fourth Workshop on Very Large Corpora., Copenhagen, Denmark (1996)
Google Scholar
Liu, H., et al.: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22(1), 103–105 (2006)
Article Google Scholar
Tsuruoka, Y., et al.: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics 23(20), 2768–2774 (2007)
Article Google Scholar
Cherry, J.M., et al.: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 26(1), 73–79 (1998)
Article Google Scholar
Eppig, J.T., et al.: The Mouse Genome Database (MGD): from genes to mice–a community resource for mouse biology. Nucleic Acids Res. 33(Database issue), D471–D475 (2005)
Google Scholar
Gelbart, W.M., et al.: FlyBase: a Drosophila database. The FlyBase consortium. Nucleic Acids Res. 25(1), 63–66 (1997)
Article Google Scholar
Maglott, D., et al.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 35(Database issue), D26–D31 (2007)
Article Google Scholar
Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium 25(1), 25–29 (2000)
Google Scholar
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. 10(6), 821–855 (2003)
Article Google Scholar
Kano, Y., et al.: U-Compare: share and compare text mining tools with UIMA. Bioinformatics (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Biocomputing Unit, Centro Nacional de Biotecnología – CSIC, C/ Darwin 3, Campus de Cantoblanco, 28049, Madrid, Spain
Mariana L. Neves, José Maria Carazo & Alberto Pascual-Montano

Authors

Mariana L. Neves
View author publications
You can also search for this author in PubMed Google Scholar
José Maria Carazo
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Pascual-Montano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bioalma, C/Ronda de Poniente, 4, 2-C, 28760, Tres Cantos, Madrid, Spain
Christian Blaschke
Computational Biology and Machine Learning Lab, School of Computing, Queen’s University, K7L 3N6, Kingston, ON, Canada
Hagit Shatkay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neves, M.L., Carazo, J.M., Pascual-Montano, A. (2010). Extracting and Normalizing Gene/Protein Mentions with the Flexible and Trainable Moara Java Library. In: Blaschke, C., Shatkay, H. (eds) Linking Literature, Information, and Knowledge for Biology. Lecture Notes in Computer Science(), vol 6004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13131-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-13131-8_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13130-1
Online ISBN: 978-3-642-13131-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics