Advertisement

Information Extraction via Automatic Pattern Discovery in Identified Region

  • Liping Ma
  • John Shepherd
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3180)

Abstract

Pattern discovery has become a fundamental technique for modern information extraction tasks. This paper presents a new twophase pattern (2PP) discovery technique for information extraction. 2PP consists of orthographic pattern discovery (OPD) and semantic pattern discovery (SPD). The OPD determines the structural features from an identified region of a document and the SPD discovers a dominant semantic pattern for the region via inference, apposition and analogy. 2PP applies discovered pattern back into the region to extract required data items through pattern matching. Experimental evaluation on a large number of identified regions indicates that our 2PP technique achieves effective results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adlberg, B.: Nodose - A tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June 1998, pp. 283–294. ACM, New York (1998)CrossRefGoogle Scholar
  2. 2.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the ACM SIGMOD, International Conference on Management of Data, San Diego, California, June 2003, pp. 337–348 (2003)Google Scholar
  3. 3.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 119–128 (2001)Google Scholar
  4. 4.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 109–118 (2001)Google Scholar
  5. 5.
    Freitag, D.: Multistrategy learning for information extraction. In: Proceedings of 15th International Conference on Machine Learning, Madison, Wisconsin, USA, July 1998, pp. 161–169 (1998)Google Scholar
  6. 6.
    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 1997, Nagoya, Japan, August 1997, pp. 729–737 (1997)Google Scholar
  7. 7.
    Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE 2000, In Proceedings of the 16th International conference on Data Engineering, San Diego, California, February 28-March 03, pp. 611–621. IEEE Computer Society, Los Alamitos (2000)Google Scholar
  8. 8.
    Ma, L.: Information Extraction from Unstructured Documents. PhD thesis, School of Computer Science and Engineering, University of New South Wales (2003)Google Scholar
  9. 9.
    Ma, L., Shepherd, J., Zhang, Y.: Extracting information from semistructured data. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, p. 132. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Ma, L., Shepherd, J., Zhang, Y.: Enhancing text classification using synopses extraction. In: WISE 2003, 4th International Conference on Web Information Systems Engineering, Roma, Italy, December 2003, pp. 115–124 (2003)Google Scholar
  11. 11.
    Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents (Agents 1999), Seattle, WA, USA, pp. 190–197 (1999)Google Scholar
  12. 12.
    Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, pp. 202–208 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Liping Ma
    • 1
  • John Shepherd
    • 1
  1. 1.School of Computer Science and EngineeringThe University of New South WalesSydneyAustralia

Personalised recommendations