Advertisement

\({{\textsc {ber}}}_{y}{\textsc {l}}\): A System for Web Block Classification

  • Andrey Kravchenko
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10990)

Abstract

Web blocks such as navigation menus, advertisements, headers, and footers are key components of Web pages that define not only the appearance, but also the way humans interact with different parts of the page. For machines, however, classifying and interacting with these blocks is a surprisingly hard task. Yet, Web block classification has varied applications in the fields of wrapper induction, assistance to visually impaired people, Web adaptation, Web page topic clustering, and Web search. Our system for Web block classification, \({{\textsc {ber}}}_{y}{\textsc {l}}\), performs automated classification of Web blocks through a combination of machine learning and declarative, model-driven feature extraction based on Datalog rules. \({{\textsc {ber}}}_{y}{\textsc {l}}\) uses refined feature sets for the classification of individual blocks to achieve accurate classification for all the block types we have observed so far. The high accuracy is achieved through these carefully selected features, some even tuned to the specific block type. At the same time, \({{\textsc {ber}}}_{y}{\textsc {l}}\) avoids a high cost of feature engineering through a model-driven rather than programmatic approach to extracting features. Not only does this reduce the time for feature engineering, the model-driven, declarative approach also allows for semi-automatic optimisation of the feature extraction system. We perform evaluation to validate these claims on a selected range of Web blocks.

References

  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley Longman Publishing Co. Inc., Boston (1995)Google Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)Google Scholar
  3. 3.
    Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: WWW 2006 (2006)Google Scholar
  4. 4.
    Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asia Conference on Intelligent Information and Database Systems (2009)Google Scholar
  5. 5.
    Cai, D., Yu, S., Wen, J., Ma, W.: Block-based web search. In: SIGIR 2004, 25–29 July 2004 (2004)Google Scholar
  6. 6.
    Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR 2004, 25–29 July 2004 (2004)Google Scholar
  7. 7.
    Cao, Y., Niu, Z., Dai, L., Zhao, Y.: Extraction of informative blocks from web pages. In: ALPIT 2008 (2008)Google Scholar
  8. 8.
    Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2010, 1–5 May 2010 (2010)Google Scholar
  9. 9.
    de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.): Datalog 2.0 2010. LNCS, vol. 6702. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-24206-9CrossRefzbMATHGoogle Scholar
  10. 10.
    Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the page: automated traversal of paginated websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31753-8_27CrossRefGoogle Scholar
  11. 11.
    Furche, T., et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW 2012 (2012)Google Scholar
  12. 12.
    Goel, A., Michelson, M., Knoblock, C.A.: Harvesting maps on the web. Int. J. Doc. Anal. Recognit. 14(4), 349 (2011)CrossRefGoogle Scholar
  13. 13.
    Gottlob, G., Orsi, G., Pieris, A., Šimkus, M.: Datalog and its extensions for semantic web databases. In: Eiter, T., Krennwallner, T. (eds.) Reasoning Web 2012. LNCS, vol. 7487, pp. 54–77. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33158-9_2CrossRefzbMATHGoogle Scholar
  14. 14.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003, 20–24 May 2003 (2003)Google Scholar
  15. 15.
    Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2–4 September 2008 (2008)Google Scholar
  16. 16.
    Kang, J., Choi, J.: Recognising informative web page blocks using visual segmentation for efficient information extraction. J. Univ. Comput. Sci. 14(11), 1893 (2008)Google Scholar
  17. 17.
    Keller, M., Hartenstein, H.: GRABEX: a graph-based method for web site block classification and its application on mining breadcrumb trails. In: 2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) (2013)Google Scholar
  18. 18.
    Kordomatis, I., Herzog, C., Fayzrakhmanov, R.R., Krüpl-Sypien, B., Holzinger, W., Baumgartner, R.: Web object identification for web automation and meta-search. In: WIMS 2013 (2012)Google Scholar
  19. 19.
    Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: DocEng 2011, 19–22 September 2011 (2011)Google Scholar
  20. 20.
    Lee, C.H., Kan, M., Lai, S.: Stylistic and lexical co-training for web block classification. In: WIDM 2004, 12–13 November 2004 (2004)Google Scholar
  21. 21.
    Li, C., Dong, J., Chen, J.: Extraction of informative blocks from web pages based on VIPS. J. Comput. Inf. Syst. 6(1), 271 (2010)Google Scholar
  22. 22.
    Liu, W., Meng, X.: VIDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Engineering 22(3), 447 (2010)CrossRefGoogle Scholar
  23. 23.
    Luo, P., Lin, F., Xiong, Y., Zhao, Y., Shi, Z.: Towards combining web classification and web information extraction: a case study. In: KDD 2009, 28 June–1 July (2009)Google Scholar
  24. 24.
    Maekawa, T., Hara, T., Nishio, S.: Image classification for mobile web browsing. In: WWW 2006, 23–26 May (2006)Google Scholar
  25. 25.
    Romero, R., Berger, A.: Automatic partitioning of web pages using clustering. In: Brewster, S., Dunlop, M. (eds.) Mobile HCI 2004. LNCS, vol. 3160, pp. 388–393. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-28637-0_43CrossRefGoogle Scholar
  26. 26.
    Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: WWW 2004, 17–22 May (2004)Google Scholar
  27. 27.
    Vadrevu, S., Velipasaoglu, E.: Identifying primary content from web page and its application to web search ranking. In: WWW 2011 (2011)Google Scholar
  28. 28.
    Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: KDD 2009, 28 June–1 July (2009)Google Scholar
  29. 29.
    Wu, C., Zeng, G., Xu, G.: A web page segmentation algorithm for extracting product information. In: Proceedings of the 2006 IEEE International Conference on Information Acquisition, 20–23 August 2006 (2006)Google Scholar
  30. 30.
    Xiang, P., Yang, X., Shi, Y.: Effective page segmentation combining pattern analysis and visual separators for browsing on small screens. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (2006)Google Scholar
  31. 31.
    Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE International Conference on Multimedia and Expo (2007)Google Scholar
  32. 32.
    Yang, X., Shi, Y.: Learning web block functions using roles of images. In: Third International Conference on Pervasive Computing and Applications, 6–8 October 2008 (2008)Google Scholar
  33. 33.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: SIGKDD 2003, 24–27 August 2003 (2003)Google Scholar
  34. 34.
    Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: WWW 2003, 20–24 May 2003 (2003)Google Scholar
  35. 35.
    Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, 2–6 November 2009 (2009)Google Scholar
  36. 36.
    Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006, 20–23 August 2006 (2006)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of OxfordOxfordUK

Personalised recommendations