Machine Learning for Information Extraction in Genomics — State of the Art and Perspectives

Nédellec, C.

doi:10.1007/978-3-540-45219-5_8

C. Nédellec³

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 138))

1014 Accesses
5 Citations

Abstract

The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Information Retrieval (IR) technology provides information at a document collection level and thus it is not able to answer requests for specific pieces of information when needed. The development of intelligent tools and methods that give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. Information Extraction is one of the main research fields that attempt to fulfill this need. The IE field has been initiated by the DARPA’s MUC program (Message Understanding Conference in 1987 (MUC Proceedings). MUC has originally defined IE as the task of (1) extracting specific, well-defined pieces of information from homogeneous sets of textual documents in restricted domains (2) in order to fill the slots of pre-defined form or templates. MUC has also brought about a new evaluation paradigm: the comparison of machine-extracted information to human-produced results. MUC inspired a large amount of work in IE and has become a major reference in the text-mining field. Even in the above restrictive definition, the design of an efficient IE system with good recall (coverage) and precision (correctness) rates remains a challenging task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adar E. (2002). S-RAD: A Simple and Robust Abbreviation Dictionary. HP Laboratories Technical Report, Sept.
Google Scholar
Bikel D. M., Miller S., Schwartz R., Weischedel R. (1997). Nymble: a High-Performance Learning Name-finder. Conference on Applied Natural Language Processing.
Google Scholar
Blaschke C., Andrade M. A., Ouzounis C. and Valencia A. (1999). Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. Proc. Int’l Symp. Molecular Biology (ISMB’99), AAAI Press, USA pp. 60–67.
Google Scholar
Borthwick A. (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University.
Google Scholar
Collier N., Nobata C., Tsujii J. (2000). Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of COLING-2000, Sarrebrück.
Google Scholar
Castaño J., Zhang J., Pustejovsky J. (2002). Anaphora Resolution in Biomedical Literature. International Symposium on Reference Resolution. Alicante, Spain.
Google Scholar
Chang J. T., Schutze H. and RB Altman (2002). “Creating an online dictionary of abbreviations from MEDLINE”. J. Am. Med. Inform. Assoc. 9(6): 612–620.
Article Google Scholar
Chieu H. L., and Ng H. T. (2002). Named Entity Recognition: A Maximum Entropy Approach Using Global Information. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002). (pp. 190–196). Taiwan.
Google Scholar
Cohen K. B., Dolbey A. E., Acquaah-Mensah G. K. and Hunter L. (2002). Contrast and variability in gene names. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain. pp. 14–20.
Google Scholar
Cowie J., Wilks Y. (2000). Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.
Google Scholar
Craven M. and Kumlien J. (1999). Constructing Biological Knowledge Bases by Extracting Information from Text Sources, ” Proc. 7th Int’l Conf. Intelligent Systems for Molecular Biology (ISMB-99), AAAI Press, USA, pp. 77–86, Heidelberg, Germany.
Google Scholar
Franzen K., Eriksson G., Olsson F., Asker L., Liden P. and Coster J. (2002). Protein names and how to find them. Int J Med Inf. 67(1–3): pp 49–61.
Article Google Scholar
Freitag D. (1998). Toward General-Purpose Learning for Information Extraction. Proceedings of COLING-ACL-98.
Google Scholar
Fukuda K., Tamura A., Tsunoda T., Takagi T. (1998). Toward information extraction: identifying protein names from biological papers. PSB’98. pp 707–18.
Google Scholar
Gildea D., Jurafsky D. (2002). Automatic Labeling of Semantic Roles. Computational Linguistics, 28(3):245–288.
Article Google Scholar
Hanisch D., Fluck J., Mevissen H. T., Zimmer R. (2003). Playing Biology’s Name Game: Identifying Protein Names in Scientific Text Pacific Symposium on Biocomputing 8:403–414.
Google Scholar
Hatzivassiloglou V. and Duboue P. A and Rzhetsky V. (2001). Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 17 Suppl 1: S97–S106.
Article Google Scholar
Harris Z., Gottfried M., Ryckman T., Mattick P., Daladier A., Harris T. N., Harris S. (1989). The Form of Information in Science: Analysis of an Immunology Sublanguage, Kluwer Academic Publishers, Dordrecht.
Google Scholar
Hearst M. A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of COLING’92, pp. 539–545.
Google Scholar
Isozaki H., Kazawa H. (2002). Efficient Support Vector Classifiers for Named Entity Recognition. Proceedings of COLING-2002, pp. 390–396.
Google Scholar
Hishiki T., Collier N., Nobata C., Ohta T., Ogata N., Sekimizu T., Steiner R., Park H. S., Tsujii J. (1998). Developping NLP tools for Genome Informatics: An Information Extraction Perspective. Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.
Google Scholar
Hobbs J. R., Appelt D., Bear J., Israel D., Kameyama M., Stickel M., Tyson M. (1997). FASTUS: A Cascaded Finite-State Transducer for Extraction Information from Natural Language Text. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, chapter 13, pp. 383–406. MIT Press.
Google Scholar
Humphreys K., Demetriou G., Gaizauskas R. (2000). Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures. PSB’2000, 5:502–513.
Google Scholar
Kazama J., Makino T., Ohta Y. and Tsujii Y. (2002). Tuning support vector machines for biomedical named entity recognition. In Proceedings of the Workshop of the Natural Language Processing in the Biomedical Domain in ACL ’02, Philadelphia, PA, USA, July.
Google Scholar
Krauthammer M., Rzhetsky A., Morozov P. and Friedman C. (2000). Using BLAST for identifying gene and protein names in journal articles. Gene. 259(1–2):245–252.
Article Google Scholar
Leroy G., Chen H. (2002). Filling preposition-based templates to capture information for medical abstracts. PSB’2001, Kaua’i, January.
Google Scholar
Majoros W. H. and Subramanian G. M. and Yandell M. D. (2003). Identification of key concepts in biomedical literature using a modified Markov heuristic. Bioinformatics. 19(3): 402–407.
Article Google Scholar
Marcotte E. M., Xenarios I., and Eisenberg, D. (2001). Mining litterature for protein-protein interactions. In Bioinfon^-natics, vo. 17 n° 4, pp. 359–363.
Google Scholar
Mikheev A. (1998). Feature Lattices for Maximum Entropy Modelling. In proceedings of COLING-ACL, pp. 848–854.
Google Scholar
MUC Proceedings (1987-) Message Understanding conference.
Google Scholar
Narayanaswamy M., Ravikumar K. E., Vi jay-Shanker K. (2003). A Biological Named Entity Recognizer. Pacific Symposium on Biocomputing 8.
Google Scholar
Nédellec, C., Ould Abdel Vetah, M. and Bessières, P. (2001). Sentence Filtering for Information Extraction in Genomics: A Classification Problem. In Proceedings of the International Conference on Practical Knowledge Discovery in Databases (PKDD’2001), pp. 326–338. Springer Verlag, LNAI 2167, Freiburg, Sept.
Google Scholar
Nenadic G., Mima H., Spasic I., Ananiadou S. and Tsujii J. (2002). Terminology-driven literature mining and knowledge acquisition in biomedicine. Int J Med Inf. 67(1–3): 33–48.
Article Google Scholar
Nenadic G., Spasic I. and Ananiadou S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics. 19(8): 938–943.
Article Google Scholar
Nobata C., Collier N. and Tsujii J. (1999). Automatic Term Identification and Classification in Biology Texts. In the Proceedings of the fifth Natural Language Processing Pacific Rim Symposium (NLPRS). Beijin, China. pp. 369–374.
Google Scholar
Ohta T., Tateisi Y., Mima H. and Tsujii J. (2002). GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. Proceedings of the Human Language Technology Conference.
Google Scholar
Ono T., Hishigaki H., Tanigami A., Takagi T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 17(2): 155–161.
Article Google Scholar
Park J. C., Kim H. S., Kim J. J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. In proceedings of PSB’2001.
Google Scholar
Proux D., Rechenmann F., Julliard L., Pillet V. and Jacq B. (1998). Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Informatics. 9:72–80.
Google Scholar
Pustejovsky J., Bergler S. and Anick P. (1993). Lexical Semantic Techniques for Corpus Analysis, in Computational Linguistics. Special Issue on Using Large Corpora: II, 19(2) pp. 331–358.
Google Scholar
Pustejovsky J., Castano J., Cochran B., Kotecki M., Morrell M. and Rumshisky A. (2001). Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo. 10(Pt 1):371–5.
Google Scholar
Pustejovsky J., Castaflo J., Zhang J., Kotecki M. and Cochran B. (2002). Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. PSB’2002, 7:362–373.
Google Scholar
Riloff E. (1993). Automatically constructing a Dictionary for Information Extraction Tasks. Proceedings of AAAI’93, Washington DC, pp 811–816.
Google Scholar
Rindflesch T. C., Tanabe L., Weinstein J. N., Hunter L. (2000). EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature. Proceedings of PSB’2000, vol 5:514–525.
Google Scholar
Schwartz A.S., Hearst M.A. (2003). A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 8:451–462.
Google Scholar
Roux C., Proux D., Rechenmann F., Julliard L. (2000) An Ontology Enrichment Method for a Pragmatic Information Extraction System gathering Data on Genetic Interactions. Proceedings of the ECAI’2000 Ontology Learning Workshop, S. Staab et al. (eds.).
Google Scholar
Sekimizu T., Park H. S., Tsujii J. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in MedLine Abstracts. In Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.
Google Scholar
Takeuchi K. and Collier N. (2002). Use of Support Vector Machines in Extended Named Entity Recognition. Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, August.
Google Scholar
Tanabe L. and Wilbur W. J. (2002). Tagging gene and protein names in biomedical text. Bioinformatics. 18(8): 1124–1132.
Article Google Scholar
Thomas J. et al., (2000). Automatic Extraction of Protein Interactions from Scientific Abstracts. Proc. Pacific Symp. Biocomputing (PSB’2000), vol. 5, pp. 502–513.
Google Scholar
Weston J. and Watkins C. (1998). Multi-class support vector machines. Technical Report CSDTR-98-04, Dept. of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 OEX, England.
Google Scholar
Wilks Y. (1997): Information Extraction as a core language technology. In Information Extraction, M. T. Pazienza (ed), Springer, Berlin.
Google Scholar
Yakushiji A., Tateisi Y., Miyao Y. and Tsujii J.-I. (2001). Extraction from biomedical papers using a full parser. Proceedings of PSB’2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire Mathématique, Informatique et Génome (MIG), INRA, Domaine de Vilvert, 78352, F-Jouy-en-Josas, France
C. Nédellec

Authors

C. Nédellec
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Technology Institute, Research Academic, 61 Riga Feraiou Str, 26221, Patras, Greece
Spiros Sirmakessis (Assistant Professor) (Assistant Professor)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nédellec, C. (2004). Machine Learning for Information Extraction in Genomics — State of the Art and Perspectives. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-45219-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05780-9
Online ISBN: 978-3-540-45219-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics