Skip to main content

Extracting and Normalizing Gene/Protein Mentions with the Flexible and Trainable Moara Java Library

  • Conference paper
Linking Literature, Information, and Knowledge for Biology

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6004))

  • 565 Accesses

Abstract

Gene/protein recognition and normalization are important prerequisite steps for many biological text mining tasks. Even if great efforts have been dedicated to these problems and effective solutions have been reported, the availability of easily integrated tools to perform these tasks is still deficient. We therefore propose Moara, a Java library that implements gene/protein recognition and normalization steps based on machine learning approaches. The system may be trained with extra documents for the recognition procedure and new organism may be added in the normalization step. The novelty of the methodology used in Moara lies in the design of a system that is not tailored to a specific organism and therefore does not need any organism-dependent tuning in the algorithms and in the dictionaries it uses. Moara can be used either as a standalone application or incorporated in a text mining system and it is available at: http://moara.dacya.ucm.es

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biology 9 (Suppl. 2), S2 (2008)

    Article  Google Scholar 

  2. Hirschman, L., et al.: Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 6(Suppl.1), S11 (2005)

    Article  Google Scholar 

  3. Morgan, A.A., et al.: Overview of BioCreative II gene normalization. Genome Biology 9(Suppl. 2), S3 (2008)

    Article  Google Scholar 

  4. Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications 7(1), 39–59 (1994)

    Google Scholar 

  5. Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  6. Cohen, W.C., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: II Web Workshop on International Joint Conference on Artificial Intelligence, Acapulco, Mexico (2003)

    Google Scholar 

  7. Fukuda, K., et al.: Toward Information Extraction: Identifying protein names from biological papers. In: Pacific Symposium on Biocomputing (PSB 1998), Hawaii, USA (1998)

    Google Scholar 

  8. Finkel, J., et al.: Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6(Suppl. 1), S5 (2005)

    Article  Google Scholar 

  9. McDonald, R., Pereira, F.: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(Suppl. 1), S6 (2005)

    Article  Google Scholar 

  10. Zhou, G., et al.: Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6(Suppl.1), S7 (2005)

    Article  Google Scholar 

  11. Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)

    Article  Google Scholar 

  12. Leaman, R., Gonzalez, G.: BANNER: an executable survey of advances in biomedical named entity recognition. In: Pac. Symp. Biocomput., pp. 652–663 (2008)

    Google Scholar 

  13. Fundel, K., et al.: A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl.1), S15 (2005)

    Article  Google Scholar 

  14. Crim, J., McDonald, R., Pereira, F.: Automatically annotating documents with normalized gene lists. BMC Bioinformatics 6(Suppl.1), S13 (2005)

    Article  Google Scholar 

  15. Liu, H., Wu, C., Friedman, C.: BioTagger: A Biological Entity Tagging System. In: BioCreAtIvE Workshop Handouts, Granada, Spain (2004)

    Google Scholar 

  16. Hakenberg, J., et al.: Inter-species normalization of gene mentions with GNAT. Bioinformatics 24(16), 126–132 (2008)

    Article  Google Scholar 

  17. Xu, H., et al.: Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 23(8), 1015–1022 (2007)

    Article  Google Scholar 

  18. Farkas, R.: The strength of co-authorship in gene name disambiguation. BMC Bioinformatics 9, 69 (2008)

    Article  MathSciNet  Google Scholar 

  19. Neves, M., et al.: CBR-Tagger: a case-based reasoning approach to the gene/protein mention problem. In: BioNLP 2008 Workshop at ACL 2008, Columbus, OH, USA (2008)

    Google Scholar 

  20. Neves, M.: Identifying Gene Mentions by Case-Based Reasoning. In: Second BioCreative Challenge Evaluation Workshop, Madrid, Spain (2007)

    Google Scholar 

  21. Daelemans, W., et al.: MBT: A Memory-Based Part of Speech Tagger-Generator. In: Fourth Workshop on Very Large Corpora., Copenhagen, Denmark (1996)

    Google Scholar 

  22. Liu, H., et al.: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22(1), 103–105 (2006)

    Article  Google Scholar 

  23. Tsuruoka, Y., et al.: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics 23(20), 2768–2774 (2007)

    Article  Google Scholar 

  24. Cherry, J.M., et al.: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 26(1), 73–79 (1998)

    Article  Google Scholar 

  25. Eppig, J.T., et al.: The Mouse Genome Database (MGD): from genes to mice–a community resource for mouse biology. Nucleic Acids Res. 33(Database issue), D471–D475 (2005)

    Google Scholar 

  26. Gelbart, W.M., et al.: FlyBase: a Drosophila database. The FlyBase consortium. Nucleic Acids Res. 25(1), 63–66 (1997)

    Article  Google Scholar 

  27. Maglott, D., et al.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 35(Database issue), D26–D31 (2007)

    Article  Google Scholar 

  28. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium 25(1), 25–29 (2000)

    Google Scholar 

  29. Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. 10(6), 821–855 (2003)

    Article  Google Scholar 

  30. Kano, Y., et al.: U-Compare: share and compare text mining tools with UIMA. Bioinformatics (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Neves, M.L., Carazo, J.M., Pascual-Montano, A. (2010). Extracting and Normalizing Gene/Protein Mentions with the Flexible and Trainable Moara Java Library. In: Blaschke, C., Shatkay, H. (eds) Linking Literature, Information, and Knowledge for Biology. Lecture Notes in Computer Science(), vol 6004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13131-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13131-8_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13130-1

  • Online ISBN: 978-3-642-13131-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics