Abstract
Record segmentation is a core problem in structured web data extraction. In this paper we present a novel technique that segments structured web data into individual data records that come from underlying database. Proposed technique exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to segment data records. During the segmentation process the technique also generates Xpath expressions. These expressions can be later used to directly extract data records from same template generated web pages without need to redo all the clustering and segmentation processes. Extracted structured data can be reused in wide range of applications, such as price comparison portals, meta-searching, knowledge bases and etc. The experimental evaluation results of proposed technique system on three publicly available benchmark data sets demonstrate nearly perfect results in terms of precision and recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering 64(2), 491–509 (2008)
Jindal, N., Liu, B.: A generalized tree matching algorithm considering nested lists for web data extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)
Kayed, M., Chang, C.: Fivatech: Page-level web data extraction from template pages. IEEE Trans. on Knowl. & Data Engineering 22(2), 249–263 (2010)
Su, W., Wang, J., Lochovsky, F., Liu, Y.: Combining tag and value similarity for data extraction and alignment. IEEE Trans. on Knowl. & Data Engineering 99, 1 (2011)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of the International Conference on Very Large Data Bases, pp. 119–128 (2001)
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of the International Conference on Very Large Data Bases, pp. 109–118 (2001)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proc. of the 14th International Conference on World Wide Web, pp. 66–75. ACM (2005)
Yang, W.: Identifying syntactic differences between two programs. Software: Practice and Experience 21(7), 739–755 (1991)
Cai, D., Yu, S., Wen, J., Ma, W.: Vips: a vision based page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003)
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. University of Washington (2009)
Cafarella, M., Halevy, A., Wang, D., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. of the VLDB Endowment 1(1), 538–549 (2008)
Clark, J., Derose, S., Corp, I.: XML Path Language, XPath (1999), http://www.w3.org/TR/xpath/
van Kesteren, A.: CSSOM View Module (2011), http://www.w3.org/TR/cssom-view
Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: Proc. of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 346–347. ACM (2004)
Paehl, D.: HTML Tidy Library Project Table of Contents (2012), http://tidy.sourceforge.net/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grigalis, T., Čenys, A. (2012). Generating Xpath Expressions for Structured Web Data Record Segmentation. In: Skersys, T., Butleris, R., Butkiene, R. (eds) Information and Software Technologies. ICIST 2012. Communications in Computer and Information Science, vol 319. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33308-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-33308-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33307-1
Online ISBN: 978-3-642-33308-8
eBook Packages: Computer ScienceComputer Science (R0)