Abstract
We compare three known semantic web page segmentation algorithms, each serving as an example of a particular approach to the problem, and one self-developed algorithm, WebTerrain, that combines two of the approaches. We compare the performance of the four algorithms for a large benchmark of modern websites we have constructed, examining each algorithm for a total of eight configurations. We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Akpinar, E., Yesilada, Y.: Vision based page segmentation: extended and improved algorithm. Technical report, Middle East Technical University Northern Cyprus Campus, January 2012
Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of the 15th International Conference on World Wide Web (WWW 2012), pp. 33–42. ACM Press (2006)
Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th International Conference on World Wide Web (WWW 2011). ACM Press (2011)
Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pp. 595–601. ACM Press, Manhattan (2004)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a visionbased page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003)
Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 377–386. ACM Press (2008)
Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.di.unimi.it/. Yahoo! research: “Web spam collections”. http://law.di.unimi.it/webdata/uk-2007-05/
Fox, J.: Applied Regression Analysis and Generalized Linear Models. Sage, 2nd edition (2008)
Guha, R., McCool, R.: TAP: a semantic web test-bed. Web Semantics: Science, Services and Agents on the World Wide Web 1(1), 81–87 (2003)
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data, pp. 18–25 (1997)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1173–1182 (2008)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: IEEE International Conference on Data Mining (ICDM 2002), pp. 250–257 (2002)
Kreuzer, R.: A quantitative comparison of semantic web page segmentation algorithms (MSc thesis) (2013). http://www.cs.uu.nl/wiki/Hage/SupervisedMScTheses
Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning of web pages. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 107–118. Springer, Heidelberg (2005)
Vadrevu, S., Velipasaoglu, E.: Identifying primary content from web pages and its application to web search ranking. In: Proceedings of the 20th International Conference on World Wide Web (WWW 2011), Hyderabad, India (Companion Volume), pp. 135–136. ACM Press (2011)
Yesilada, Y.: Web page segmentation: A review. Technical report, Middle East Technical University Northern Cyprus Campus, March 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kreuzer, R., Hage, J., Feelders, A. (2015). A Quantitative Comparison of Semantic Web Page Segmentation Approaches. In: Cimiano, P., Frasincar, F., Houben, GJ., Schwabe, D. (eds) Engineering the Web in the Big Data Era. ICWE 2015. Lecture Notes in Computer Science(), vol 9114. Springer, Cham. https://doi.org/10.1007/978-3-319-19890-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-19890-3_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19889-7
Online ISBN: 978-3-319-19890-3
eBook Packages: Computer ScienceComputer Science (R0)