Abstract
One of the effects of the general Internet growth is an immense number of user accesses to WWW resources. These accesses are recorded in the web server log files, which are a rich data resource for finding useful patterns and rules of user browsing behavior, and they caused the rise of technologies for Web usage mining. Current Web usage mining applications rely exclusively on the web server log files. The main hypothesis discussed in this paper is that Web content analysis can be used to improve Web usage mining results. We propose a system that integrates Web page clustering into log file association mining and uses the cluster labels as Web page content indicators. It is demonstrated that novel and interesting association rules can be mined from the combined data source. The rules can be used further in various applications, including Web user profiling and Web site construction. We experiment with several approaches to content clustering, relying on keyword and character n-gram based clustering with different distance measures and parameter settings. Evaluation shows that character n-gram based clustering performs better than word-based clustering in terms of an internal quality measure (about 3 times better). On the other hand, word-based cluster profiles are easier to manually summarize. Furthermore, it is demonstrated that high-quality rules are extracted from the combined dataset.
This work is supported by NSERC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Madria, S., Bhowmick, S., Ng, W., Lim, E.: Research issues in web data mining. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 303â312. Springer, Heidelberg (1999)
Borges, J., Levene, M.: Data mining of user navigation patterns. In: Masand, B., Spiliopoulou, M. (eds.) WebKDD 1999. LNCS (LNAI), vol. 1836, pp. 92â111. Springer, Heidelberg (2000)
Kosala, R., Blockeel, H.: Web mining research: A survey. ACM SIGKDDÂ 2, 1â15 (2000)
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A.: Mining the link structure of the World Wide Webx. IEEE Computer 32, 60â67 (1999)
Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern discovery on the world wide web. In: Proc. of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 1997), pp. 558â567 (1997)
Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating web usage and content mining for more effective personalization. In: Proc. of the Intl. Conf. on Ecommerce and Web Technologies (ECWeb), pp. 165â176 (2000)
Kato, H., Nakayama, T., Yamane, Y.: Navigation analysis tool based on the correlation between contents distribution and access patterns. In: Proc. of the Web Mining Workshop KDD 2000, pp. 95â104 (2000)
Ypma, A., Heskes, T.: Categorization of web pages and user clustering with mixtures of hidden markov models. In: Workshop on Web Knowledge Discovery and Data mining (WEBKDD 2002), pp. 31â43 (2002)
Jin, X., Zhou, Y., Mobasher, B.: A unified approach to personalization based on probabilistic latent semantic models of web usage and content. In: Proc. of the AAAI 2004 Workshop SWP 2004, pp. 26â34 (2004)
Eirinaki, M., Lampos, C., Paulakis, S., Vazirgiannis, M.: Web personalization integrating content, semantics and navigational patterns. In: ACM Web Information and Data Management Workshop, pp. 72â79 (2004)
Aslton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513â523 (1988)
Miao, Y., Keselj, V., Milios, E.: Comparing document clustering using n-grams, terms and words (2004)
Jo, T.C.: Evaluation function of document clustering based on term entropy. In: Proc. of 2nd International Symposium on Advanced Intelligent System, pp. 95â100 (2001)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. of the Text Mining Workshop, KDD 2000 (2000)
Pandey, A., Srivastava, J., Shekhar, S.: A web proxy server with an intelligent prefetcher for dynamic pages using association rules. Technical Report TR-01-004, Department of Computer Science, University of Minnesota (2001)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130â137 (1980)
Etzioni, O.: The World Wide Web: Quagmire or gold mine. Communications of the ACMÂ 39, 65â68 (1996)
Saltonandand, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACMÂ 18, 613â620 (1975)
Punin, J., Krishnamoorthy, M., Zaki, M.J.: WebKDD 2001. LNCS (LNAI), vol. 2356, pp. 88â112. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, J., Kešelj, V., Gao, Q. (2005). Integrating Web Content Clustering into Web Log Association Rule Mining. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_19
Download citation
DOI: https://doi.org/10.1007/11424918_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)