Skip to main content

Machine Learning for Information Extraction in Genomics — State of the Art and Perspectives

  • Conference paper
Book cover Text Mining and its Applications

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 138))

Abstract

The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Information Retrieval (IR) technology provides information at a document collection level and thus it is not able to answer requests for specific pieces of information when needed. The development of intelligent tools and methods that give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. Information Extraction is one of the main research fields that attempt to fulfill this need. The IE field has been initiated by the DARPA’s MUC program (Message Understanding Conference in 1987 (MUC Proceedings). MUC has originally defined IE as the task of (1) extracting specific, well-defined pieces of information from homogeneous sets of textual documents in restricted domains (2) in order to fill the slots of pre-defined form or templates. MUC has also brought about a new evaluation paradigm: the comparison of machine-extracted information to human-produced results. MUC inspired a large amount of work in IE and has become a major reference in the text-mining field. Even in the above restrictive definition, the design of an efficient IE system with good recall (coverage) and precision (correctness) rates remains a challenging task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Adar E. (2002). S-RAD: A Simple and Robust Abbreviation Dictionary. HP Laboratories Technical Report, Sept.

    Google Scholar 

  • Bikel D. M., Miller S., Schwartz R., Weischedel R. (1997). Nymble: a High-Performance Learning Name-finder. Conference on Applied Natural Language Processing.

    Google Scholar 

  • Blaschke C., Andrade M. A., Ouzounis C. and Valencia A. (1999). Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. Proc. Int’l Symp. Molecular Biology (ISMB’99), AAAI Press, USA pp. 60–67.

    Google Scholar 

  • Borthwick A. (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University.

    Google Scholar 

  • Collier N., Nobata C., Tsujii J. (2000). Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of COLING-2000, Sarrebrück.

    Google Scholar 

  • Castaño J., Zhang J., Pustejovsky J. (2002). Anaphora Resolution in Biomedical Literature. International Symposium on Reference Resolution. Alicante, Spain.

    Google Scholar 

  • Chang J. T., Schutze H. and RB Altman (2002). “Creating an online dictionary of abbreviations from MEDLINE”. J. Am. Med. Inform. Assoc. 9(6): 612–620.

    Article  Google Scholar 

  • Chieu H. L., and Ng H. T. (2002). Named Entity Recognition: A Maximum Entropy Approach Using Global Information. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002). (pp. 190–196). Taiwan.

    Google Scholar 

  • Cohen K. B., Dolbey A. E., Acquaah-Mensah G. K. and Hunter L. (2002). Contrast and variability in gene names. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain. pp. 14–20.

    Google Scholar 

  • Cowie J., Wilks Y. (2000). Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.

    Google Scholar 

  • Craven M. and Kumlien J. (1999). Constructing Biological Knowledge Bases by Extracting Information from Text Sources, ” Proc. 7th Int’l Conf. Intelligent Systems for Molecular Biology (ISMB-99), AAAI Press, USA, pp. 77–86, Heidelberg, Germany.

    Google Scholar 

  • Franzen K., Eriksson G., Olsson F., Asker L., Liden P. and Coster J. (2002). Protein names and how to find them. Int J Med Inf. 67(1–3): pp 49–61.

    Article  Google Scholar 

  • Freitag D. (1998). Toward General-Purpose Learning for Information Extraction. Proceedings of COLING-ACL-98.

    Google Scholar 

  • Fukuda K., Tamura A., Tsunoda T., Takagi T. (1998). Toward information extraction: identifying protein names from biological papers. PSB’98. pp 707–18.

    Google Scholar 

  • Gildea D., Jurafsky D. (2002). Automatic Labeling of Semantic Roles. Computational Linguistics, 28(3):245–288.

    Article  Google Scholar 

  • Hanisch D., Fluck J., Mevissen H. T., Zimmer R. (2003). Playing Biology’s Name Game: Identifying Protein Names in Scientific Text Pacific Symposium on Biocomputing 8:403–414.

    Google Scholar 

  • Hatzivassiloglou V. and Duboue P. A and Rzhetsky V. (2001). Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 17 Suppl 1: S97–S106.

    Article  Google Scholar 

  • Harris Z., Gottfried M., Ryckman T., Mattick P., Daladier A., Harris T. N., Harris S. (1989). The Form of Information in Science: Analysis of an Immunology Sublanguage, Kluwer Academic Publishers, Dordrecht.

    Google Scholar 

  • Hearst M. A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of COLING’92, pp. 539–545.

    Google Scholar 

  • Isozaki H., Kazawa H. (2002). Efficient Support Vector Classifiers for Named Entity Recognition. Proceedings of COLING-2002, pp. 390–396.

    Google Scholar 

  • Hishiki T., Collier N., Nobata C., Ohta T., Ogata N., Sekimizu T., Steiner R., Park H. S., Tsujii J. (1998). Developping NLP tools for Genome Informatics: An Information Extraction Perspective. Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.

    Google Scholar 

  • Hobbs J. R., Appelt D., Bear J., Israel D., Kameyama M., Stickel M., Tyson M. (1997). FASTUS: A Cascaded Finite-State Transducer for Extraction Information from Natural Language Text. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, chapter 13, pp. 383–406. MIT Press.

    Google Scholar 

  • Humphreys K., Demetriou G., Gaizauskas R. (2000). Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures. PSB’2000, 5:502–513.

    Google Scholar 

  • Kazama J., Makino T., Ohta Y. and Tsujii Y. (2002). Tuning support vector machines for biomedical named entity recognition. In Proceedings of the Workshop of the Natural Language Processing in the Biomedical Domain in ACL ’02, Philadelphia, PA, USA, July.

    Google Scholar 

  • Krauthammer M., Rzhetsky A., Morozov P. and Friedman C. (2000). Using BLAST for identifying gene and protein names in journal articles. Gene. 259(1–2):245–252.

    Article  Google Scholar 

  • Leroy G., Chen H. (2002). Filling preposition-based templates to capture information for medical abstracts. PSB’2001, Kaua’i, January.

    Google Scholar 

  • Majoros W. H. and Subramanian G. M. and Yandell M. D. (2003). Identification of key concepts in biomedical literature using a modified Markov heuristic. Bioinformatics. 19(3): 402–407.

    Article  Google Scholar 

  • Marcotte E. M., Xenarios I., and Eisenberg, D. (2001). Mining litterature for protein-protein interactions. In Bioinfon-natics, vo. 17 n° 4, pp. 359–363.

    Google Scholar 

  • Mikheev A. (1998). Feature Lattices for Maximum Entropy Modelling. In proceedings of COLING-ACL, pp. 848–854.

    Google Scholar 

  • MUC Proceedings (1987-) Message Understanding conference.

    Google Scholar 

  • Narayanaswamy M., Ravikumar K. E., Vi jay-Shanker K. (2003). A Biological Named Entity Recognizer. Pacific Symposium on Biocomputing 8.

    Google Scholar 

  • Nédellec, C., Ould Abdel Vetah, M. and Bessières, P. (2001). Sentence Filtering for Information Extraction in Genomics: A Classification Problem. In Proceedings of the International Conference on Practical Knowledge Discovery in Databases (PKDD’2001), pp. 326–338. Springer Verlag, LNAI 2167, Freiburg, Sept.

    Google Scholar 

  • Nenadic G., Mima H., Spasic I., Ananiadou S. and Tsujii J. (2002). Terminology-driven literature mining and knowledge acquisition in biomedicine. Int J Med Inf. 67(1–3): 33–48.

    Article  Google Scholar 

  • Nenadic G., Spasic I. and Ananiadou S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics. 19(8): 938–943.

    Article  Google Scholar 

  • Nobata C., Collier N. and Tsujii J. (1999). Automatic Term Identification and Classification in Biology Texts. In the Proceedings of the fifth Natural Language Processing Pacific Rim Symposium (NLPRS). Beijin, China. pp. 369–374.

    Google Scholar 

  • Ohta T., Tateisi Y., Mima H. and Tsujii J. (2002). GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. Proceedings of the Human Language Technology Conference.

    Google Scholar 

  • Ono T., Hishigaki H., Tanigami A., Takagi T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 17(2): 155–161.

    Article  Google Scholar 

  • Park J. C., Kim H. S., Kim J. J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. In proceedings of PSB’2001.

    Google Scholar 

  • Proux D., Rechenmann F., Julliard L., Pillet V. and Jacq B. (1998). Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Informatics. 9:72–80.

    Google Scholar 

  • Pustejovsky J., Bergler S. and Anick P. (1993). Lexical Semantic Techniques for Corpus Analysis, in Computational Linguistics. Special Issue on Using Large Corpora: II, 19(2) pp. 331–358.

    Google Scholar 

  • Pustejovsky J., Castano J., Cochran B., Kotecki M., Morrell M. and Rumshisky A. (2001). Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo. 10(Pt 1):371–5.

    Google Scholar 

  • Pustejovsky J., Castaflo J., Zhang J., Kotecki M. and Cochran B. (2002). Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. PSB’2002, 7:362–373.

    Google Scholar 

  • Riloff E. (1993). Automatically constructing a Dictionary for Information Extraction Tasks. Proceedings of AAAI’93, Washington DC, pp 811–816.

    Google Scholar 

  • Rindflesch T. C., Tanabe L., Weinstein J. N., Hunter L. (2000). EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature. Proceedings of PSB’2000, vol 5:514–525.

    Google Scholar 

  • Schwartz A.S., Hearst M.A. (2003). A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 8:451–462.

    Google Scholar 

  • Roux C., Proux D., Rechenmann F., Julliard L. (2000) An Ontology Enrichment Method for a Pragmatic Information Extraction System gathering Data on Genetic Interactions. Proceedings of the ECAI’2000 Ontology Learning Workshop, S. Staab et al. (eds.).

    Google Scholar 

  • Sekimizu T., Park H. S., Tsujii J. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in MedLine Abstracts. In Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.

    Google Scholar 

  • Takeuchi K. and Collier N. (2002). Use of Support Vector Machines in Extended Named Entity Recognition. Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, August.

    Google Scholar 

  • Tanabe L. and Wilbur W. J. (2002). Tagging gene and protein names in biomedical text. Bioinformatics. 18(8): 1124–1132.

    Article  Google Scholar 

  • Thomas J. et al., (2000). Automatic Extraction of Protein Interactions from Scientific Abstracts. Proc. Pacific Symp. Biocomputing (PSB’2000), vol. 5, pp. 502–513.

    Google Scholar 

  • Weston J. and Watkins C. (1998). Multi-class support vector machines. Technical Report CSDTR-98-04, Dept. of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 OEX, England.

    Google Scholar 

  • Wilks Y. (1997): Information Extraction as a core language technology. In Information Extraction, M. T. Pazienza (ed), Springer, Berlin.

    Google Scholar 

  • Yakushiji A., Tateisi Y., Miyao Y. and Tsujii J.-I. (2001). Extraction from biomedical papers using a full parser. Proceedings of PSB’2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nédellec, C. (2004). Machine Learning for Information Extraction in Genomics — State of the Art and Perspectives. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45219-5_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05780-9

  • Online ISBN: 978-3-540-45219-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics