Skip to main content

An Overview and Classification of Adaptive Approaches to Information Extraction

  • Conference paper
Journal on Data Semantics IV

Part of the book series: Lecture Notes in Computer Science ((JODS,volume 3730))

Abstract

Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of information extraction (IE) comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data and storing it in appropriate form for future use.

The rapidly increasing number and diversity of IE systems are the evidence of continuous activity and growing attention to this field. At the same time it is becoming more and more difficult to overview the scope of IE, to see advantages of certain approaches and differences to others. In this paper we identify and describe promising approaches to IE. Our focus is adaptive systems that can be customized for new domains through training or the use of external knowledge sources. Based on the observed origins and requirements of the examined IE techniques a classification of different types of adaptive IE systems is established.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aone, C., Halverson, L., Hampton, T., Ramos-Santacruz, M.: SRA: Description of the IE2 system used for MUC. In: Proceedings of the Seventh Message Understanding Conference (MUC-7) (1998)

    Google Scholar 

  2. Bagga, A., Chai, J.Y.: A trainable message understanding system. In: CoNLL, pp. 1–8 (1997)

    Google Scholar 

  3. Califf, M.E.: Relational Learning Techniques for Natural Language Extraction. PhD thesis, University of Texas at Austin (1998)

    Google Scholar 

  4. Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA, pp. 6–11 (1998)

    Google Scholar 

  5. Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 4, 177–210 (2003)

    Article  MathSciNet  Google Scholar 

  6. Cardie, C.: A case-based approach to knowledge acquisition for domain-specific sentence analysis. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 798–803. AAAI Press, Menlo Park (1993)

    Google Scholar 

  7. Chai, J.Y., Biermann, A.W.: The use of word sense disambiguation in an information extraction system. In: AAAI/IAAI (1999)

    Google Scholar 

  8. Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 786–791 (2002)

    Google Scholar 

  9. Ciravegna, F.: (LP)2, an adaptive algorithm for information extraction from Web-related texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, USA (2001)

    Google Scholar 

  10. Ciravegna, F., Lavelli, A.: LearningPinocchio: Adaptive information extraction for real world applications. In: Proceedings of the 2nd Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND 2002), Frascati, Italy (2002)

    Google Scholar 

  11. Collier, R.: Automatic template creation for information extraction, an overview. Technical report, University of Sheffield (1996)

    Google Scholar 

  12. De Sitter, A., Daelemans, W.: Information extraction via double classification. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining, ATEM-2003 (2003)

    Google Scholar 

  13. Delisle, S., Barker, K., Delannoy, J.-F., Matwin, S., Szpakowicz, S.: From text to Horn clauses: Combining linguistic analysis and machine learning. In: 10th Canadian AI Conf. (1994)

    Google Scholar 

  14. Eikvil, L.: Information extraction from World Wide Web – A survey. Technical Report 945, Norwegian Computing Center (1999)

    Google Scholar 

  15. Embley, D.W., Campbell, D.M., Smith, R.D., Liddl, S.W.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Conference on Information and Knowledge Management (CIKM), pp. 52–59 (1998)

    Google Scholar 

  16. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  17. Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden Markov model: Analysis and applications. Machine Learning 32(1), 41–62 (1998)

    Article  MATH  Google Scholar 

  18. Finn, A., Kushmerick, N.: Information extraction by convergent boundary classification. In: AAAI-2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)

    Google Scholar 

  19. Finn, A., Kushmerick, N.: Multi-level boundary classification for information extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  20. Freitag, D.: Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University (1998)

    Google Scholar 

  21. Freitag, D.: Toward general-purpose learning for information extraction. In: Boitet, C., Whitelock, P. (eds.) Proc. 36th Annual Meeting of the Association for Computational Linguistics, San Francisco, CA, pp. 404–408 (1998)

    Google Scholar 

  22. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)

    Google Scholar 

  23. Freitag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-1999 Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  24. Freitag, D., McCallum, A.K.: Information extraction with HMM structures learned by stochastic optimization. In: AAAI/IAAI, pp. 584–589 (2000)

    Google Scholar 

  25. Fürnkranz, J.: Separate-and-conquer rule learning. Artificial Intelligence Review 13(1), 3–54 (1999)

    Article  MATH  Google Scholar 

  26. Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM—semi-automatic creation of metadata. In: Gomez-Perez, A., Benjamins, V.R. (eds.) Proc. 13th International Conference on Knowledge Engineering and Management (2002)

    Google Scholar 

  27. Kauchak, D., Smarr, J., Elkan, C.: Sources of success for information extraction methods. Technical Report CS2002-0696, UC San Diego (2002)

    Google Scholar 

  28. Lafferty, J., McCallum, A.K., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML (2001)

    Google Scholar 

  29. Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)

    Google Scholar 

  30. Lavelli, A., Califf, M.-E., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: IE evaluation: Criticisms and recommendations. In: AAAI-2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)

    Google Scholar 

  31. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, 285–318 (1988)

    Google Scholar 

  32. McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003)

    Google Scholar 

  33. McCallum, A.K., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: ICML (2000)

    Google Scholar 

  34. McCallum, A.K., Jensen, D.: A note on the unification of information extraction and data mining using conditional-probability, relational models. In: IJCAI 2003 Workshop on Learning Statistical Models from Relational Data (2003)

    Google Scholar 

  35. Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., and the Annotation Group.: Algorithms that learn to extract information—BBN: Description of the SIFT system as used for MUC. In: MUC-7 (1998)

    Google Scholar 

  36. Miller, S., Fox, H., Ramshaw, L., Weischedel, R.: A novel use of statistical parsing to extract information from text. In: ANLP-NAACL, pp. 226–233 (2000)

    Google Scholar 

  37. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    Article  Google Scholar 

  38. Muslea, I., Minton, S., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI 2003 (2003)

    Google Scholar 

  39. Nahm, U.Y., Mooney, R.J.: Using information extraction to aid the discovery of prediction rules from text. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA (2000)

    Google Scholar 

  40. Nobata, C., Sekine, S.: Towards automatic acquisition of patterns for information extraction. In: International Conference of Computer Processing of Oriental Languages (1999)

    Google Scholar 

  41. Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI (2003)

    Google Scholar 

  42. Quinlan, J.R., Cameron-Jones, R.M.: Induction of logic programs: FOIL and related systems. New Generation Computing 13(3,4), 287–312 (1995)

    Article  Google Scholar 

  43. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 1044–1049. The AAAI Press/MIT Press (1999)

    Google Scholar 

  44. Riloff, E., Schmelzenbach, M.: An empirical approach to conceptual case frame acquisition. In: Proceedings of the Sixth Workshop on Very Large Corpora. (1998)

    Google Scholar 

  45. RISE repository, http://www.isi.edu/info-agents/RISE/

  46. Roth, D., Yih., W.-t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI (2001)

    Google Scholar 

  47. Roth, D., Yih, W.-t.: Probabilistic reasoning for entity & relation recognition. In: COLING 2002 (2002)

    Google Scholar 

  48. Scheffer, T., Decomain, C., Wrobel, S.: Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis (2001)

    Google Scholar 

  49. Scheffer, T., Wrobel, S., Popov, B., Ognianov, D., Decomain, C., Hoche, S.: Learning hidden Markov models for information extraction actively from partially labeled text. Künstliche Intelligenz (2) (2002)

    Google Scholar 

  50. Siefkes, C.: Incremental information extraction using tree-based context representations. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 510–521. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  51. Skounakis, M., Craven, M., Ray, S.: Hierarchical hidden Markov models for information extraction. In: IJCAI (2003)

    Google Scholar 

  52. Soderland, S.: Learning Text Analysis Rules for Domain-specific Natural Language Processing. PhD thesis, University of Massachusetts, Amherst (1997)

    Google Scholar 

  53. Soderland, S.: Learning to extract text-based information from the World Wide Web. In: Proc. Third International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 251–254 (1997)

    Google Scholar 

  54. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1–3), 233–272 (1999)

    Article  MATH  Google Scholar 

  55. Soderland, S.: Building a machine learning based text understanding system. In: Proc. IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)

    Google Scholar 

  56. Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a conceptual dictionary. In: Mellish, C. (ed.) Proc. 14th International Joint Conference on Artificial Intelligence, San Francisco, pp. 1314–1319 (1995)

    Google Scholar 

  57. Sudo, K., Sekine, S., Grishman, R.: Automatic pattern acquisition for Japanese information extraction. In: HLT 2001(2001)

    Google Scholar 

  58. Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proc. 16th International Conf. on Machine Learning, pp. 406–414 (1999)

    Google Scholar 

  59. Zavrel, J., Daelemans, W.: Feature-rich memory-based classification for shallow NLP and information extraction. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, Theoretical Aspects and Applications, pp. 33–54. Springer, Heidelberg (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Siefkes, C., Siniakov, P. (2005). An Overview and Classification of Adaptive Approaches to Information Extraction. In: Spaccapietra, S. (eds) Journal on Data Semantics IV. Lecture Notes in Computer Science, vol 3730. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603412_6

Download citation

  • DOI: https://doi.org/10.1007/11603412_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31001-3

  • Online ISBN: 978-3-540-31447-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics