Skip to main content

Generating Xpath Expressions for Structured Web Data Record Segmentation

  • Conference paper
Information and Software Technologies (ICIST 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 319))

Included in the following conference series:

Abstract

Record segmentation is a core problem in structured web data extraction. In this paper we present a novel technique that segments structured web data into individual data records that come from underlying database. Proposed technique exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to segment data records. During the segmentation process the technique also generates Xpath expressions. These expressions can be later used to directly extract data records from same template generated web pages without need to redo all the clustering and segmentation processes. Extracted structured data can be reused in wide range of applications, such as price comparison portals, meta-searching, knowledge bases and etc. The experimental evaluation results of proposed technique system on three publicly available benchmark data sets demonstrate nearly perfect results in terms of precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering 64(2), 491–509 (2008)

    Article  Google Scholar 

  2. Jindal, N., Liu, B.: A generalized tree matching algorithm considering nested lists for web data extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)

    Google Scholar 

  3. Kayed, M., Chang, C.: Fivatech: Page-level web data extraction from template pages. IEEE Trans. on Knowl. & Data Engineering 22(2), 249–263 (2010)

    Article  Google Scholar 

  4. Su, W., Wang, J., Lochovsky, F., Liu, Y.: Combining tag and value similarity for data extraction and alignment. IEEE Trans. on Knowl. & Data Engineering 99, 1 (2011)

    Google Scholar 

  5. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)

    Google Scholar 

  6. Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)

    Google Scholar 

  7. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of the International Conference on Very Large Data Bases, pp. 119–128 (2001)

    Google Scholar 

  8. Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of the International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  9. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)

    Article  Google Scholar 

  10. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proc. of the 14th International Conference on World Wide Web, pp. 66–75. ACM (2005)

    Google Scholar 

  11. Yang, W.: Identifying syntactic differences between two programs. Software: Practice and Experience 21(7), 739–755 (1991)

    Article  Google Scholar 

  12. Cai, D., Yu, S., Wen, J., Ma, W.: Vips: a vision based page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003)

    Google Scholar 

  13. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. University of Washington (2009)

    Google Scholar 

  14. Cafarella, M., Halevy, A., Wang, D., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. of the VLDB Endowment 1(1), 538–549 (2008)

    Google Scholar 

  15. Clark, J., Derose, S., Corp, I.: XML Path Language, XPath (1999), http://www.w3.org/TR/xpath/

  16. van Kesteren, A.: CSSOM View Module (2011), http://www.w3.org/TR/cssom-view

  17. Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: Proc. of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 346–347. ACM (2004)

    Google Scholar 

  18. Paehl, D.: HTML Tidy Library Project Table of Contents (2012), http://tidy.sourceforge.net/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grigalis, T., Čenys, A. (2012). Generating Xpath Expressions for Structured Web Data Record Segmentation. In: Skersys, T., Butleris, R., Butkiene, R. (eds) Information and Software Technologies. ICIST 2012. Communications in Computer and Information Science, vol 319. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33308-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33308-8_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33307-1

  • Online ISBN: 978-3-642-33308-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics