$${{\textsc {ber}}}_{y}{\textsc {l}}$$ : A System for Web Block Classification

Kravchenko, Andrey

doi:10.1007/978-3-662-58039-4_4

Andrey Kravchenko¹⁵

Part of the book series: Lecture Notes in Computer Science ((TCOMPUTATSCIE,volume 10990))

256 Accesses

Abstract

Web blocks such as navigation menus, advertisements, headers, and footers are key components of Web pages that define not only the appearance, but also the way humans interact with different parts of the page. For machines, however, classifying and interacting with these blocks is a surprisingly hard task. Yet, Web block classification has varied applications in the fields of wrapper induction, assistance to visually impaired people, Web adaptation, Web page topic clustering, and Web search. Our system for Web block classification, ${{\textsc {ber}}}_{y}{\textsc {l}}$, performs automated classification of Web blocks through a combination of machine learning and declarative, model-driven feature extraction based on Datalog rules. ${{\textsc {ber}}}_{y}{\textsc {l}}$ uses refined feature sets for the classification of individual blocks to achieve accurate classification for all the block types we have observed so far. The high accuracy is achieved through these carefully selected features, some even tuned to the specific block type. At the same time, ${{\textsc {ber}}}_{y}{\textsc {l}}$ avoids a high cost of feature engineering through a model-driven rather than programmatic approach to extracting features. Not only does this reduce the time for feature engineering, the model-driven, declarative approach also allows for semi-automatic optimisation of the feature extraction system. We perform evaluation to validate these claims on a selected range of Web blocks.

This work was supported by the ESPRC programme grant EP/M025268/1 “VADA: Value Added Data Systems – Principles and Architecture”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley Longman Publishing Co. Inc., Boston (1995)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)
Google Scholar
Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: WWW 2006 (2006)
Google Scholar
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asia Conference on Intelligent Information and Database Systems (2009)
Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.: Block-based web search. In: SIGIR 2004, 25–29 July 2004 (2004)
Google Scholar
Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR 2004, 25–29 July 2004 (2004)
Google Scholar
Cao, Y., Niu, Z., Dai, L., Zhao, Y.: Extraction of informative blocks from web pages. In: ALPIT 2008 (2008)
Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2010, 1–5 May 2010 (2010)
Google Scholar
de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.): Datalog 2.0 2010. LNCS, vol. 6702. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24206-9
Book MATH Google Scholar
Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the page: automated traversal of paginated websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31753-8_27
Chapter Google Scholar
Furche, T., et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (2012)
Google Scholar
Goel, A., Michelson, M., Knoblock, C.A.: Harvesting maps on the web. Int. J. Doc. Anal. Recognit. 14(4), 349 (2011)
Article Google Scholar
Gottlob, G., Orsi, G., Pieris, A., Šimkus, M.: Datalog and its extensions for semantic web databases. In: Eiter, T., Krennwallner, T. (eds.) Reasoning Web 2012. LNCS, vol. 7487, pp. 54–77. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33158-9_2
Chapter MATH Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003, 20–24 May 2003 (2003)
Google Scholar
Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2–4 September 2008 (2008)
Google Scholar
Kang, J., Choi, J.: Recognising informative web page blocks using visual segmentation for efficient information extraction. J. Univ. Comput. Sci. 14(11), 1893 (2008)
Google Scholar
Keller, M., Hartenstein, H.: GRABEX: a graph-based method for web site block classification and its application on mining breadcrumb trails. In: 2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) (2013)
Google Scholar
Kordomatis, I., Herzog, C., Fayzrakhmanov, R.R., Krüpl-Sypien, B., Holzinger, W., Baumgartner, R.: Web object identification for web automation and meta-search. In: WIMS 2013 (2012)
Google Scholar
Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: DocEng 2011, 19–22 September 2011 (2011)
Google Scholar
Lee, C.H., Kan, M., Lai, S.: Stylistic and lexical co-training for web block classification. In: WIDM 2004, 12–13 November 2004 (2004)
Google Scholar
Li, C., Dong, J., Chen, J.: Extraction of informative blocks from web pages based on VIPS. J. Comput. Inf. Syst. 6(1), 271 (2010)
Google Scholar
Liu, W., Meng, X.: VIDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Engineering 22(3), 447 (2010)
Article Google Scholar
Luo, P., Lin, F., Xiong, Y., Zhao, Y., Shi, Z.: Towards combining web classification and web information extraction: a case study. In: KDD 2009, 28 June–1 July (2009)
Google Scholar
Maekawa, T., Hara, T., Nishio, S.: Image classification for mobile web browsing. In: WWW 2006, 23–26 May (2006)
Google Scholar
Romero, R., Berger, A.: Automatic partitioning of web pages using clustering. In: Brewster, S., Dunlop, M. (eds.) Mobile HCI 2004. LNCS, vol. 3160, pp. 388–393. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28637-0_43
Chapter Google Scholar
Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: WWW 2004, 17–22 May (2004)
Google Scholar
Vadrevu, S., Velipasaoglu, E.: Identifying primary content from web page and its application to web search ranking. In: WWW 2011 (2011)
Google Scholar
Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: KDD 2009, 28 June–1 July (2009)
Google Scholar
Wu, C., Zeng, G., Xu, G.: A web page segmentation algorithm for extracting product information. In: Proceedings of the 2006 IEEE International Conference on Information Acquisition, 20–23 August 2006 (2006)
Google Scholar
Xiang, P., Yang, X., Shi, Y.: Effective page segmentation combining pattern analysis and visual separators for browsing on small screens. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (2006)
Google Scholar
Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE International Conference on Multimedia and Expo (2007)
Google Scholar
Yang, X., Shi, Y.: Learning web block functions using roles of images. In: Third International Conference on Pervasive Computing and Applications, 6–8 October 2008 (2008)
Google Scholar
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD 2003, 24–27 August 2003 (2003)
Google Scholar
Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: WWW 2003, 20–24 May 2003 (2003)
Google Scholar
Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, 2–6 November 2009 (2009)
Google Scholar
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006, 20–23 August 2006 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Oxford, Oxford, UK
Andrey Kravchenko

Authors

Andrey Kravchenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrey Kravchenko .

Editor information

Editors and Affiliations

University of Calgary, Calgary, AB, Canada
Marina L. Gavrilova
Sardina Systems OÜ, Tallinn, Estonia
C.J. Kenneth Tan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kravchenko, A. (2018). ${{\textsc {ber}}}_{y}{\textsc {l}}$: A System for Web Block Classification. In: Gavrilova, M., Tan, C. (eds) Transactions on Computational Science XXXIII. Lecture Notes in Computer Science(), vol 10990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58039-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-662-58039-4_4
Published: 16 September 2018
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58038-7
Online ISBN: 978-3-662-58039-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

\({{\textsc {ber}}}_{y}{\textsc {l}}\): A System for Web Block Classification

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

\({{\textsc {ber}}}_{y}{\textsc {l}}\): A System for Web Block Classification

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation