Skip to main content

Automated Text Data Extraction Based on Unsupervised Small Sample Learning

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 213))

Abstract

Most of Web information extraction systems work with the DOM tree–based structured extraction rules to extract data records from Web pages; however, some, data items of, or even whole, of these data records are often in a semi-structured or unstructured text form. Thus, we need to introduce text data extraction rules to further extract the fine-grained data elements from those coarse-grained text items or records. However, generating text data extraction rules is a challenging task in either manual or automated way. In this paper, we propose an unsupervised learning approach to automatically deducing text data extraction rules from a small sample of text records. First of all, to prepare for extraction rule template deduction, we propose an iterative center core multiple sequence alignment method to align text columns in sample text records. Then, we propose an information entropy model based on the statistical features of text columns to further identify each column as either a template column or a data column. From identified template and data columns, plus some additional processing, we can quickly deduce the template, that is, the text data extraction rule. Eventually, we can use the text data extraction rule to perform the automated text data extraction from test text records. This unsupervised learning approach does not need any manual labeling and enables automated generation of text data extraction rules and text data extraction process. It is the first study effort toward the unsupervised small sample learning approach for automated text data extraction rule generation. The experimental results show that our approach achieves high accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Laender AH, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. SIGMOD 31(2):84–93

    Article  Google Scholar 

  2. Boronat XA (2008) A comparison of HTML-aware tools for Web Data extraction. Leipzig

    Google Scholar 

  3. Kuhlins S, Tredwell R (2002) Toolkits for Generating Wrappers. NetObjectDays, 184–198

    Google Scholar 

  4. Baumgartner R, Gatterbauer W, Gottlob G (2001) Web data extraction system with Lixto. VLDB

    Google Scholar 

  5. Crescenzi V, Mecca G, Merialdo P (2001) RoadRunner: towards automatic data extraction from large web sites. VLDB, 109–118

    Google Scholar 

  6. Liu B, Grossman RL, Zhai Y (2003) Mining data items in Web pages. KDD, 601–606

    Google Scholar 

  7. Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. WWW. 76–85

    Google Scholar 

  8. Borkar V, Deshmukh K, Sarawagi S (2001) Automatic segmentation of text into structured records. SIGMOD 30(2):175–186

    Article  Google Scholar 

  9. Su W, Wang J, Lochovsky FH, Liu Y (2011) Combining tag and value similarity for data extraction and alignment. TKDE 24(7):1186–1200

    Google Scholar 

  10. Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. TKDE 22(2):249–263

    Google Scholar 

  11. Elmeleegy H, Madhavan J, Halevy A (2009): Harvesting relational tables from lists on the web. In: VLDB endowment. 2:1, pp 1078–1089

    Google Scholar 

  12. Carrillo H, Lipman D (1988) The multiple sequence alignment problem in biology. SIAM J Appl Math 48:1073–1082

    Article  MathSciNet  MATH  Google Scholar 

  13. Sun J, Zhou M, Gao J (2003) A class-based language model approach to chinese named entity identification. In: The association for computational linguistics and Chinese language processing, pp 1–28

    Google Scholar 

  14. Chua T-S, Liu J (2002) Learning pattern rules for Chinese named entity extraction. AAAI

    Google Scholar 

  15. Gusfield D (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull Math Biol 55(1):141–154

    MathSciNet  MATH  Google Scholar 

  16. Shannon CE (2001) A mathematical theory of communication. In: Mobile computing and communications review, pp 3–55

    Google Scholar 

Download references

Acknowledgments

This work is funded by China NSF Grant (#61072152) and Jiangsu Province Industry Promotion Program (#BE2011172).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yihua Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, Y., Shi, S., Yuan, C., Huang, Y. (2014). Automated Text Data Extraction Based on Unsupervised Small Sample Learning. In: Sun, F., Li, T., Li, H. (eds) Foundations and Applications of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 213. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37829-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37829-4_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37828-7

  • Online ISBN: 978-3-642-37829-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics