Abstract
Automatic classification of webpages has several applications in industry: digital marketing, search engines, content filtering and many more. Traditionally this classification has been done using only the textual information of webpages, which includes the html code, tags, title and more lately also the url. The aim of this paper is to prove that for some subjective variables, although very important to the applications mentioned, the visual information of webpages as they are rendered by the browser has extremely rich content for the classification task. The variables studied are the aesthetic value (whether pages are beautiful or ugly) and the design recency of them (whether pages are old fashioned or look modern). We then proved that automatic classifications that rely only on the visual look and feel can achieve very high accuracies. As we used several low-level and mid-level features and studied several criteria for selection and classification, our classifiers were able to improve one step further the stat of the art. Finally, we applied this framework to classify webpages in their topic (content aware) and also to classify whether pages are a blog or not (functional aware).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Videira, A., Goncalves, N.: Automatic web page classification using visual content. In: 10th International Conference on Web Information Systems and Technologies (WEBIST 2014) (2014)
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. Inf. Comput. Sci. 158, 69–88 (2004)
Chen, R.C., Hsieh, C.H.: Web page classification based on a support vector machine using a weighted vote schema. Expert. Syst. Appl. 2(31), 427–435 (2006)
de Boer, V., van Someren, M., Lupascu, T.: Classifying web pages with visual features. In: 6th International Conference on Web Information Systems and Technologies (WEBIST 2010), pp. 245–252 (2010)
Asirvatham, A.P., Ravi, K.K.: Web page classification based on document structure. In: IEEE National Convention (2001)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Visual adjacency multigraphs, a novel approach for a web page classification. In: Workshop on Statistical Approaches to Web Mining (SAWM), pp. 38–49 (2004)
Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. 8, 460–472 (1978)
Deselaers, T.: Features for Image Retrieval (thesis). RWTH Aachen University (2003)
Zhang, D., Wong, A., Indrawan, M., Lu, G.: Content-based image retrieval using Gabor texture features. In: IEEE Pacific-Rim Conference on Multimedia, University of Sydney, Australia (2000)
Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 2(60), 91–110 (2004)
Jialu Liu: Image Retrieval based on Bag-of-Words model (2013). arXiv preprint arXiv:1304.5168
L. Andrade: The worlds ugliest websites!!! (2009). http://www.nikibrown.com/designoblog/2009/03/03/theworlds-ugliest-websites. Acessed October 2009
Matthew Shuey: 10-worst-websites-for-2013 (2013). http://www.globalwebfx.com/10-worst-websites-for-2013/
Vicent Flanders: Worst Websites of the Year 2012–2005 (2012).http://www.webpagesthatsuck.com/worst-websites-of-the-year.html
Crazyleafdesign.com: Most beautiful and inspirational website designs (2013). http://www.crazyleafdesign.com/blog/
waxy.org: Den.net and the top 100 websites of 1999 (2010). http://waxy.org/2010/02/dennet_and_the_top_100_web-sites_of_1999/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Goncalves, N., Videira, A. (2015). Automatic Web Page Classification Using Visual Content for Subjective and Functional Variables. In: Monfort, V., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2014. Lecture Notes in Business Information Processing, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-319-27030-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-27030-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27029-6
Online ISBN: 978-3-319-27030-2
eBook Packages: Computer ScienceComputer Science (R0)