Abstract
The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Information Retrieval (IR) technology provides information at a document collection level and thus it is not able to answer requests for specific pieces of information when needed. The development of intelligent tools and methods that give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. Information Extraction is one of the main research fields that attempt to fulfill this need. The IE field has been initiated by the DARPA’s MUC program (Message Understanding Conference in 1987 (MUC Proceedings). MUC has originally defined IE as the task of (1) extracting specific, well-defined pieces of information from homogeneous sets of textual documents in restricted domains (2) in order to fill the slots of pre-defined form or templates. MUC has also brought about a new evaluation paradigm: the comparison of machine-extracted information to human-produced results. MUC inspired a large amount of work in IE and has become a major reference in the text-mining field. Even in the above restrictive definition, the design of an efficient IE system with good recall (coverage) and precision (correctness) rates remains a challenging task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adar E. (2002). S-RAD: A Simple and Robust Abbreviation Dictionary. HP Laboratories Technical Report, Sept.
Bikel D. M., Miller S., Schwartz R., Weischedel R. (1997). Nymble: a High-Performance Learning Name-finder. Conference on Applied Natural Language Processing.
Blaschke C., Andrade M. A., Ouzounis C. and Valencia A. (1999). Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. Proc. Int’l Symp. Molecular Biology (ISMB’99), AAAI Press, USA pp. 60–67.
Borthwick A. (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University.
Collier N., Nobata C., Tsujii J. (2000). Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of COLING-2000, Sarrebrück.
Castaño J., Zhang J., Pustejovsky J. (2002). Anaphora Resolution in Biomedical Literature. International Symposium on Reference Resolution. Alicante, Spain.
Chang J. T., Schutze H. and RB Altman (2002). “Creating an online dictionary of abbreviations from MEDLINE”. J. Am. Med. Inform. Assoc. 9(6): 612–620.
Chieu H. L., and Ng H. T. (2002). Named Entity Recognition: A Maximum Entropy Approach Using Global Information. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002). (pp. 190–196). Taiwan.
Cohen K. B., Dolbey A. E., Acquaah-Mensah G. K. and Hunter L. (2002). Contrast and variability in gene names. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain. pp. 14–20.
Cowie J., Wilks Y. (2000). Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.
Craven M. and Kumlien J. (1999). Constructing Biological Knowledge Bases by Extracting Information from Text Sources, ” Proc. 7th Int’l Conf. Intelligent Systems for Molecular Biology (ISMB-99), AAAI Press, USA, pp. 77–86, Heidelberg, Germany.
Franzen K., Eriksson G., Olsson F., Asker L., Liden P. and Coster J. (2002). Protein names and how to find them. Int J Med Inf. 67(1–3): pp 49–61.
Freitag D. (1998). Toward General-Purpose Learning for Information Extraction. Proceedings of COLING-ACL-98.
Fukuda K., Tamura A., Tsunoda T., Takagi T. (1998). Toward information extraction: identifying protein names from biological papers. PSB’98. pp 707–18.
Gildea D., Jurafsky D. (2002). Automatic Labeling of Semantic Roles. Computational Linguistics, 28(3):245–288.
Hanisch D., Fluck J., Mevissen H. T., Zimmer R. (2003). Playing Biology’s Name Game: Identifying Protein Names in Scientific Text Pacific Symposium on Biocomputing 8:403–414.
Hatzivassiloglou V. and Duboue P. A and Rzhetsky V. (2001). Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 17 Suppl 1: S97–S106.
Harris Z., Gottfried M., Ryckman T., Mattick P., Daladier A., Harris T. N., Harris S. (1989). The Form of Information in Science: Analysis of an Immunology Sublanguage, Kluwer Academic Publishers, Dordrecht.
Hearst M. A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of COLING’92, pp. 539–545.
Isozaki H., Kazawa H. (2002). Efficient Support Vector Classifiers for Named Entity Recognition. Proceedings of COLING-2002, pp. 390–396.
Hishiki T., Collier N., Nobata C., Ohta T., Ogata N., Sekimizu T., Steiner R., Park H. S., Tsujii J. (1998). Developping NLP tools for Genome Informatics: An Information Extraction Perspective. Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.
Hobbs J. R., Appelt D., Bear J., Israel D., Kameyama M., Stickel M., Tyson M. (1997). FASTUS: A Cascaded Finite-State Transducer for Extraction Information from Natural Language Text. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, chapter 13, pp. 383–406. MIT Press.
Humphreys K., Demetriou G., Gaizauskas R. (2000). Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures. PSB’2000, 5:502–513.
Kazama J., Makino T., Ohta Y. and Tsujii Y. (2002). Tuning support vector machines for biomedical named entity recognition. In Proceedings of the Workshop of the Natural Language Processing in the Biomedical Domain in ACL ’02, Philadelphia, PA, USA, July.
Krauthammer M., Rzhetsky A., Morozov P. and Friedman C. (2000). Using BLAST for identifying gene and protein names in journal articles. Gene. 259(1–2):245–252.
Leroy G., Chen H. (2002). Filling preposition-based templates to capture information for medical abstracts. PSB’2001, Kaua’i, January.
Majoros W. H. and Subramanian G. M. and Yandell M. D. (2003). Identification of key concepts in biomedical literature using a modified Markov heuristic. Bioinformatics. 19(3): 402–407.
Marcotte E. M., Xenarios I., and Eisenberg, D. (2001). Mining litterature for protein-protein interactions. In Bioinfon-natics, vo. 17 n° 4, pp. 359–363.
Mikheev A. (1998). Feature Lattices for Maximum Entropy Modelling. In proceedings of COLING-ACL, pp. 848–854.
MUC Proceedings (1987-) Message Understanding conference.
Narayanaswamy M., Ravikumar K. E., Vi jay-Shanker K. (2003). A Biological Named Entity Recognizer. Pacific Symposium on Biocomputing 8.
Nédellec, C., Ould Abdel Vetah, M. and Bessières, P. (2001). Sentence Filtering for Information Extraction in Genomics: A Classification Problem. In Proceedings of the International Conference on Practical Knowledge Discovery in Databases (PKDD’2001), pp. 326–338. Springer Verlag, LNAI 2167, Freiburg, Sept.
Nenadic G., Mima H., Spasic I., Ananiadou S. and Tsujii J. (2002). Terminology-driven literature mining and knowledge acquisition in biomedicine. Int J Med Inf. 67(1–3): 33–48.
Nenadic G., Spasic I. and Ananiadou S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics. 19(8): 938–943.
Nobata C., Collier N. and Tsujii J. (1999). Automatic Term Identification and Classification in Biology Texts. In the Proceedings of the fifth Natural Language Processing Pacific Rim Symposium (NLPRS). Beijin, China. pp. 369–374.
Ohta T., Tateisi Y., Mima H. and Tsujii J. (2002). GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. Proceedings of the Human Language Technology Conference.
Ono T., Hishigaki H., Tanigami A., Takagi T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 17(2): 155–161.
Park J. C., Kim H. S., Kim J. J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. In proceedings of PSB’2001.
Proux D., Rechenmann F., Julliard L., Pillet V. and Jacq B. (1998). Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Informatics. 9:72–80.
Pustejovsky J., Bergler S. and Anick P. (1993). Lexical Semantic Techniques for Corpus Analysis, in Computational Linguistics. Special Issue on Using Large Corpora: II, 19(2) pp. 331–358.
Pustejovsky J., Castano J., Cochran B., Kotecki M., Morrell M. and Rumshisky A. (2001). Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo. 10(Pt 1):371–5.
Pustejovsky J., Castaflo J., Zhang J., Kotecki M. and Cochran B. (2002). Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. PSB’2002, 7:362–373.
Riloff E. (1993). Automatically constructing a Dictionary for Information Extraction Tasks. Proceedings of AAAI’93, Washington DC, pp 811–816.
Rindflesch T. C., Tanabe L., Weinstein J. N., Hunter L. (2000). EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature. Proceedings of PSB’2000, vol 5:514–525.
Schwartz A.S., Hearst M.A. (2003). A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 8:451–462.
Roux C., Proux D., Rechenmann F., Julliard L. (2000) An Ontology Enrichment Method for a Pragmatic Information Extraction System gathering Data on Genetic Interactions. Proceedings of the ECAI’2000 Ontology Learning Workshop, S. Staab et al. (eds.).
Sekimizu T., Park H. S., Tsujii J. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in MedLine Abstracts. In Genome Informatics. Universal Academy Press Inc., Tokyo, Japan.
Takeuchi K. and Collier N. (2002). Use of Support Vector Machines in Extended Named Entity Recognition. Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, August.
Tanabe L. and Wilbur W. J. (2002). Tagging gene and protein names in biomedical text. Bioinformatics. 18(8): 1124–1132.
Thomas J. et al., (2000). Automatic Extraction of Protein Interactions from Scientific Abstracts. Proc. Pacific Symp. Biocomputing (PSB’2000), vol. 5, pp. 502–513.
Weston J. and Watkins C. (1998). Multi-class support vector machines. Technical Report CSDTR-98-04, Dept. of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 OEX, England.
Wilks Y. (1997): Information Extraction as a core language technology. In Information Extraction, M. T. Pazienza (ed), Springer, Berlin.
Yakushiji A., Tateisi Y., Miyao Y. and Tsujii J.-I. (2001). Extraction from biomedical papers using a full parser. Proceedings of PSB’2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nédellec, C. (2004). Machine Learning for Information Extraction in Genomics — State of the Art and Perspectives. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-45219-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05780-9
Online ISBN: 978-3-540-45219-5
eBook Packages: Springer Book Archive