Generating Xpath Expressions for Structured Web Data Record Segmentation

Grigalis, Tomas; Čenys, Antanas

doi:10.1007/978-3-642-33308-8_4

Tomas Grigalis³ &
Antanas Čenys³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 319))

Included in the following conference series:

International Conference on Information and Software Technologies

1002 Accesses
1 Citations

Abstract

Record segmentation is a core problem in structured web data extraction. In this paper we present a novel technique that segments structured web data into individual data records that come from underlying database. Proposed technique exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to segment data records. During the segmentation process the technique also generates Xpath expressions. These expressions can be later used to directly extract data records from same template generated web pages without need to redo all the clustering and segmentation processes. Extracted structured data can be reused in wide range of applications, such as price comparison portals, meta-searching, knowledge bases and etc. The experimental evaluation results of proposed technique system on three publicly available benchmark data sets demonstrate nearly perfect results in terms of precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering 64(2), 491–509 (2008)
Article Google Scholar
Jindal, N., Liu, B.: A generalized tree matching algorithm considering nested lists for web data extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)
Google Scholar
Kayed, M., Chang, C.: Fivatech: Page-level web data extraction from template pages. IEEE Trans. on Knowl. & Data Engineering 22(2), 249–263 (2010)
Article Google Scholar
Su, W., Wang, J., Lochovsky, F., Liu, Y.: Combining tag and value similarity for data extraction and alignment. IEEE Trans. on Knowl. & Data Engineering 99, 1 (2011)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
Google Scholar
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of the International Conference on Very Large Data Bases, pp. 119–128 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of the International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proc. of the 14th International Conference on World Wide Web, pp. 66–75. ACM (2005)
Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Software: Practice and Experience 21(7), 739–755 (1991)
Article Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.: Vips: a vision based page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003)
Google Scholar
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. University of Washington (2009)
Google Scholar
Cafarella, M., Halevy, A., Wang, D., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. of the VLDB Endowment 1(1), 538–549 (2008)
Google Scholar
Clark, J., Derose, S., Corp, I.: XML Path Language, XPath (1999), http://www.w3.org/TR/xpath/
van Kesteren, A.: CSSOM View Module (2011), http://www.w3.org/TR/cssom-view
Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: Proc. of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 346–347. ACM (2004)
Google Scholar
Paehl, D.: HTML Tidy Library Project Table of Contents (2012), http://tidy.sourceforge.net/

Download references

Author information

Authors and Affiliations

Department of Information Systems, Vilnius Gediminas Technical University, Vilnius, Lithuania
Tomas Grigalis & Antanas Čenys

Authors

Tomas Grigalis
View author publications
You can also search for this author in PubMed Google Scholar
Antanas Čenys
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Kaunas University of Technology, Studentu g. 50-313a, LT-51368, Kaunas, Lithuania
Tomas Skersys & Rimantas Butleris &
Kaunas University of Technology, Studentu g. 50-309a, LT-51368, Kaunas, Lithuania
Rita Butkiene

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grigalis, T., Čenys, A. (2012). Generating Xpath Expressions for Structured Web Data Record Segmentation. In: Skersys, T., Butleris, R., Butkiene, R. (eds) Information and Software Technologies. ICIST 2012. Communications in Computer and Information Science, vol 319. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33308-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-33308-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33307-1
Online ISBN: 978-3-642-33308-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics