Information Extraction via Automatic Pattern Discovery in Identified Region

Ma, Liping; Shepherd, John

doi:10.1007/978-3-540-30075-5_23

Liping Ma¹⁹ &
John Shepherd¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3180))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

657 Accesses

Abstract

Pattern discovery has become a fundamental technique for modern information extraction tasks. This paper presents a new twophase pattern (2PP) discovery technique for information extraction. 2PP consists of orthographic pattern discovery (OPD) and semantic pattern discovery (SPD). The OPD determines the structural features from an identified region of a document and the SPD discovers a dominant semantic pattern for the region via inference, apposition and analogy. 2PP applies discovered pattern back into the region to extract required data items through pattern matching. Experimental evaluation on a large number of identified regions indicates that our 2PP technique achieves effective results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adlberg, B.: Nodose - A tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June 1998, pp. 283–294. ACM, New York (1998)
Chapter Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the ACM SIGMOD, International Conference on Management of Data, San Diego, California, June 2003, pp. 337–348 (2003)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 119–128 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 109–118 (2001)
Google Scholar
Freitag, D.: Multistrategy learning for information extraction. In: Proceedings of 15th International Conference on Machine Learning, Madison, Wisconsin, USA, July 1998, pp. 161–169 (1998)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 1997, Nagoya, Japan, August 1997, pp. 729–737 (1997)
Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE 2000, In Proceedings of the 16th International conference on Data Engineering, San Diego, California, February 28-March 03, pp. 611–621. IEEE Computer Society, Los Alamitos (2000)
Google Scholar
Ma, L.: Information Extraction from Unstructured Documents. PhD thesis, School of Computer Science and Engineering, University of New South Wales (2003)
Google Scholar
Ma, L., Shepherd, J., Zhang, Y.: Extracting information from semistructured data. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, p. 132. Springer, Heidelberg (2002)
Chapter Google Scholar
Ma, L., Shepherd, J., Zhang, Y.: Enhancing text classification using synopses extraction. In: WISE 2003, 4th International Conference on Web Information Systems Engineering, Roma, Italy, December 2003, pp. 115–124 (2003)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents (Agents 1999), Seattle, WA, USA, pp. 190–197 (1999)
Google Scholar
Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, pp. 202–208 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW 2052, Australia
Liping Ma & John Shepherd

Authors

Liping Ma
View author publications
You can also search for this author in PubMed Google Scholar
John Shepherd
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Zaragoza, Ciudad Universitaria, Plaza San Francisco, 50009, Zaragoza
Fernando Galindo
Seikei University, Japan
Makoto Takizawa
Institute of Informatics in Business and Government, University of Linz, Altenbergerstr. 69, 4040, Linz, Austria
Roland Traunmüller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L., Shepherd, J. (2004). Information Extraction via Automatic Pattern Discovery in Identified Region. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-30075-5_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22936-0
Online ISBN: 978-3-540-30075-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics