Skip to main content

Information Extraction via Automatic Pattern Discovery in Identified Region

  • Conference paper
Database and Expert Systems Applications (DEXA 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3180))

Included in the following conference series:

  • 657 Accesses

Abstract

Pattern discovery has become a fundamental technique for modern information extraction tasks. This paper presents a new twophase pattern (2PP) discovery technique for information extraction. 2PP consists of orthographic pattern discovery (OPD) and semantic pattern discovery (SPD). The OPD determines the structural features from an identified region of a document and the SPD discovers a dominant semantic pattern for the region via inference, apposition and analogy. 2PP applies discovered pattern back into the region to extract required data items through pattern matching. Experimental evaluation on a large number of identified regions indicates that our 2PP technique achieves effective results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adlberg, B.: Nodose - A tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June 1998, pp. 283–294. ACM, New York (1998)

    Chapter  Google Scholar 

  2. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the ACM SIGMOD, International Conference on Management of Data, San Diego, California, June 2003, pp. 337–348 (2003)

    Google Scholar 

  3. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 119–128 (2001)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 109–118 (2001)

    Google Scholar 

  5. Freitag, D.: Multistrategy learning for information extraction. In: Proceedings of 15th International Conference on Machine Learning, Madison, Wisconsin, USA, July 1998, pp. 161–169 (1998)

    Google Scholar 

  6. Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 1997, Nagoya, Japan, August 1997, pp. 729–737 (1997)

    Google Scholar 

  7. Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE 2000, In Proceedings of the 16th International conference on Data Engineering, San Diego, California, February 28-March 03, pp. 611–621. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  8. Ma, L.: Information Extraction from Unstructured Documents. PhD thesis, School of Computer Science and Engineering, University of New South Wales (2003)

    Google Scholar 

  9. Ma, L., Shepherd, J., Zhang, Y.: Extracting information from semistructured data. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, p. 132. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Ma, L., Shepherd, J., Zhang, Y.: Enhancing text classification using synopses extraction. In: WISE 2003, 4th International Conference on Web Information Systems Engineering, Roma, Italy, December 2003, pp. 115–124 (2003)

    Google Scholar 

  11. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents (Agents 1999), Seattle, WA, USA, pp. 190–197 (1999)

    Google Scholar 

  12. Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, pp. 202–208 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ma, L., Shepherd, J. (2004). Information Extraction via Automatic Pattern Discovery in Identified Region. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30075-5_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22936-0

  • Online ISBN: 978-3-540-30075-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics