Abstract
Most of Web information extraction systems work with the DOM tree–based structured extraction rules to extract data records from Web pages; however, some, data items of, or even whole, of these data records are often in a semi-structured or unstructured text form. Thus, we need to introduce text data extraction rules to further extract the fine-grained data elements from those coarse-grained text items or records. However, generating text data extraction rules is a challenging task in either manual or automated way. In this paper, we propose an unsupervised learning approach to automatically deducing text data extraction rules from a small sample of text records. First of all, to prepare for extraction rule template deduction, we propose an iterative center core multiple sequence alignment method to align text columns in sample text records. Then, we propose an information entropy model based on the statistical features of text columns to further identify each column as either a template column or a data column. From identified template and data columns, plus some additional processing, we can quickly deduce the template, that is, the text data extraction rule. Eventually, we can use the text data extraction rule to perform the automated text data extraction from test text records. This unsupervised learning approach does not need any manual labeling and enables automated generation of text data extraction rules and text data extraction process. It is the first study effort toward the unsupervised small sample learning approach for automated text data extraction rule generation. The experimental results show that our approach achieves high accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Laender AH, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. SIGMOD 31(2):84–93
Boronat XA (2008) A comparison of HTML-aware tools for Web Data extraction. Leipzig
Kuhlins S, Tredwell R (2002) Toolkits for Generating Wrappers. NetObjectDays, 184–198
Baumgartner R, Gatterbauer W, Gottlob G (2001) Web data extraction system with Lixto. VLDB
Crescenzi V, Mecca G, Merialdo P (2001) RoadRunner: towards automatic data extraction from large web sites. VLDB, 109–118
Liu B, Grossman RL, Zhai Y (2003) Mining data items in Web pages. KDD, 601–606
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. WWW. 76–85
Borkar V, Deshmukh K, Sarawagi S (2001) Automatic segmentation of text into structured records. SIGMOD 30(2):175–186
Su W, Wang J, Lochovsky FH, Liu Y (2011) Combining tag and value similarity for data extraction and alignment. TKDE 24(7):1186–1200
Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. TKDE 22(2):249–263
Elmeleegy H, Madhavan J, Halevy A (2009): Harvesting relational tables from lists on the web. In: VLDB endowment. 2:1, pp 1078–1089
Carrillo H, Lipman D (1988) The multiple sequence alignment problem in biology. SIAM J Appl Math 48:1073–1082
Sun J, Zhou M, Gao J (2003) A class-based language model approach to chinese named entity identification. In: The association for computational linguistics and Chinese language processing, pp 1–28
Chua T-S, Liu J (2002) Learning pattern rules for Chinese named entity extraction. AAAI
Gusfield D (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull Math Biol 55(1):141–154
Shannon CE (2001) A mathematical theory of communication. In: Mobile computing and communications review, pp 3–55
Acknowledgments
This work is funded by China NSF Grant (#61072152) and Jiangsu Province Industry Promotion Program (#BE2011172).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, Y., Shi, S., Yuan, C., Huang, Y. (2014). Automated Text Data Extraction Based on Unsupervised Small Sample Learning. In: Sun, F., Li, T., Li, H. (eds) Foundations and Applications of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 213. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37829-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-37829-4_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37828-7
Online ISBN: 978-3-642-37829-4
eBook Packages: EngineeringEngineering (R0)