Exploiting Structural Similarity for Automatic Information Extraction from Lists

Huynh, Dat T.; Xu, Jiajie; Sadiq, Shazia; Zhou, Xiaofang

doi:10.1007/978-3-642-41154-0_15

Dat T. Huynh²⁰,
Jiajie Xu²¹,
Shazia Sadiq²⁰ &
…
Xiaofang Zhou^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8181))

Included in the following conference series:

International Conference on Web Information Systems Engineering

2861 Accesses

Abstract

In this paper, we propose a novel technique to reduce dependency on knowledge base for ONDUX, the current state-of-art method for information extraction by text segmentation. While the existing approach mainly relies on high overlapping between pre-existing data and input lists to build an extraction model, our approach exploits structural similarity of text segments in the sequences of a list to align them into groups to achieve effectiveness with low dependency on pre-existing data. Firstly, a structural similarity measure between text segments is proposed and combined with content similarity to assess how likely two text segments in a list should be aligned in the same group. Then we devise a data shifting-alignment technique in which positional information and the similarity scores are employed to cluster text segments into groups before their labels are revised by an HMM-based graphical model. The experimental results on different datasets demonstrate the ability of our method to extract information from lists with high performance and less dependence on knowledge base than the current state-of-art method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rise - a repository of online information sources used in information extraction tasks (1998), http://www.isi.edu/info-agents/rise/index.html
Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the Tenth ACM SIGKDD Conference, pp. 20–29 (2004)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD, pp. 337–348 (2003)
Google Scholar
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: Proceedings of ACM SIGMOD, pp. 175–186 (2001)
Google Scholar
Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: Proceedings of the 2010 ACM SIGMOD, pp. 807–818 (2010)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data bases, pp. 109–118 (2001)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19, 1–16 (2007)
Google Scholar
Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 584–589. AAAI Press (2000)
Google Scholar
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an rdbms for web data integration. In: Proceedings of the 12th International Conference on World Wide Web, pp. 90–101. ACM (2003)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Mansuri, I.R., Sarawagi, S.: Integrating unstructured data into relational databases. In: Proceedings of the 22nd ICDE, pp. 29–40 (2006)
Google Scholar
Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco (1988)
Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using crfs. Information Processing and Management 42, 963–979 (2006)
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)
Google Scholar
Sarawagi, S.: Information extraction. Foundation and Trends in Databases 1(3), 261–377 (2008)
Article Google Scholar
Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Google Scholar
Zhao, C., Mahmud, J., Ramakrishnan, I.V.: Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In: Proceedings of the SIAM International Conference on Data Mining, pp. 420–431 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, Australia
Dat T. Huynh, Shazia Sadiq & Xiaofang Zhou
School of Computer Science and Technology, Soochow University, China
Jiajie Xu & Xiaofang Zhou

Authors

Dat T. Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Jiajie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shazia Sadiq
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Xuemin Lin
Aristotle University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos
AT&T Labs-Research, Florham Park, NJ, USA
Divesh Srivastava
Victoria University, Melbourne, Australia
Guangyan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huynh, D.T., Xu, J., Sadiq, S., Zhou, X. (2013). Exploiting Structural Similarity for Automatic Information Extraction from Lists. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41154-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-41154-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41153-3
Online ISBN: 978-3-642-41154-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics