Automated Text Data Extraction Based on Unsupervised Small Sample Learning

Liu, Yulong; Shi, Shengsheng; Yuan, Chunfeng; Huang, Yihua

doi:10.1007/978-3-642-37829-4_12

Automated Text Data Extraction Based on Unsupervised Small Sample Learning

Yulong Liu⁵,
Shengsheng Shi⁵,
Chunfeng Yuan⁵ &
…
Yihua Huang⁵

Conference paper
First Online: 23 November 2013

2048 Accesses
1 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 213))

Abstract

Most of Web information extraction systems work with the DOM tree–based structured extraction rules to extract data records from Web pages; however, some, data items of, or even whole, of these data records are often in a semi-structured or unstructured text form. Thus, we need to introduce text data extraction rules to further extract the fine-grained data elements from those coarse-grained text items or records. However, generating text data extraction rules is a challenging task in either manual or automated way. In this paper, we propose an unsupervised learning approach to automatically deducing text data extraction rules from a small sample of text records. First of all, to prepare for extraction rule template deduction, we propose an iterative center core multiple sequence alignment method to align text columns in sample text records. Then, we propose an information entropy model based on the statistical features of text columns to further identify each column as either a template column or a data column. From identified template and data columns, plus some additional processing, we can quickly deduce the template, that is, the text data extraction rule. Eventually, we can use the text data extraction rule to perform the automated text data extraction from test text records. This unsupervised learning approach does not need any manual labeling and enables automated generation of text data extraction rules and text data extraction process. It is the first study effort toward the unsupervised small sample learning approach for automated text data extraction rule generation. The experimental results show that our approach achieves high accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Laender AH, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. SIGMOD 31(2):84–93
Article Google Scholar
Boronat XA (2008) A comparison of HTML-aware tools for Web Data extraction. Leipzig
Google Scholar
Kuhlins S, Tredwell R (2002) Toolkits for Generating Wrappers. NetObjectDays, 184–198
Google Scholar
Baumgartner R, Gatterbauer W, Gottlob G (2001) Web data extraction system with Lixto. VLDB
Google Scholar
Crescenzi V, Mecca G, Merialdo P (2001) RoadRunner: towards automatic data extraction from large web sites. VLDB, 109–118
Google Scholar
Liu B, Grossman RL, Zhai Y (2003) Mining data items in Web pages. KDD, 601–606
Google Scholar
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. WWW. 76–85
Google Scholar
Borkar V, Deshmukh K, Sarawagi S (2001) Automatic segmentation of text into structured records. SIGMOD 30(2):175–186
Article Google Scholar
Su W, Wang J, Lochovsky FH, Liu Y (2011) Combining tag and value similarity for data extraction and alignment. TKDE 24(7):1186–1200
Google Scholar
Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. TKDE 22(2):249–263
Google Scholar
Elmeleegy H, Madhavan J, Halevy A (2009): Harvesting relational tables from lists on the web. In: VLDB endowment. 2:1, pp 1078–1089
Google Scholar
Carrillo H, Lipman D (1988) The multiple sequence alignment problem in biology. SIAM J Appl Math 48:1073–1082
Article MathSciNet MATH Google Scholar
Sun J, Zhou M, Gao J (2003) A class-based language model approach to chinese named entity identification. In: The association for computational linguistics and Chinese language processing, pp 1–28
Google Scholar
Chua T-S, Liu J (2002) Learning pattern rules for Chinese named entity extraction. AAAI
Google Scholar
Gusfield D (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull Math Biol 55(1):141–154
MathSciNet MATH Google Scholar
Shannon CE (2001) A mathematical theory of communication. In: Mobile computing and communications review, pp 3–55
Google Scholar

Download references

Acknowledgments

This work is funded by China NSF Grant (#61072152) and Jiangsu Province Industry Promotion Program (#BE2011172).

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, 210093, China
Yulong Liu, Shengsheng Shi, Chunfeng Yuan & Yihua Huang

Authors

Yulong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shengsheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Chunfeng Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yihua Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yihua Huang .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, People's Republic of China
Fuchun Sun
School of Information Science and Technology, Southwest Jiaotong University, Chengdu, People's Republic of China
Tianrui Li
Department of Computer Science and Techn, Tsinghua University, Beijing, People's Republic of China
Hongbo Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Shi, S., Yuan, C., Huang, Y. (2014). Automated Text Data Extraction Based on Unsupervised Small Sample Learning. In: Sun, F., Li, T., Li, H. (eds) Foundations and Applications of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 213. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37829-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-37829-4_12
Published: 23 November 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37828-7
Online ISBN: 978-3-642-37829-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics