Integrating Web Content Clustering into Web Log Association Rule Mining

Guo, Jiayun; Kešelj, Vlado; Gao, Qigang

doi:10.1007/11424918_19

Jiayun Guo²⁰,
Vlado Kešelj²⁰ &
Qigang Gao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3501))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1221 Accesses
7 Citations

Abstract

One of the effects of the general Internet growth is an immense number of user accesses to WWW resources. These accesses are recorded in the web server log files, which are a rich data resource for finding useful patterns and rules of user browsing behavior, and they caused the rise of technologies for Web usage mining. Current Web usage mining applications rely exclusively on the web server log files. The main hypothesis discussed in this paper is that Web content analysis can be used to improve Web usage mining results. We propose a system that integrates Web page clustering into log file association mining and uses the cluster labels as Web page content indicators. It is demonstrated that novel and interesting association rules can be mined from the combined data source. The rules can be used further in various applications, including Web user profiling and Web site construction. We experiment with several approaches to content clustering, relying on keyword and character n-gram based clustering with different distance measures and parameter settings. Evaluation shows that character n-gram based clustering performs better than word-based clustering in terms of an internal quality measure (about 3 times better). On the other hand, word-based cluster profiles are easier to manually summarize. Furthermore, it is demonstrated that high-quality rules are extracted from the combined dataset.

This work is supported by NSERC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Madria, S., Bhowmick, S., Ng, W., Lim, E.: Research issues in web data mining. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 303–312. Springer, Heidelberg (1999)
Google Scholar
Borges, J., Levene, M.: Data mining of user navigation patterns. In: Masand, B., Spiliopoulou, M. (eds.) WebKDD 1999. LNCS (LNAI), vol. 1836, pp. 92–111. Springer, Heidelberg (2000)
Chapter Google Scholar
Kosala, R., Blockeel, H.: Web mining research: A survey. ACM SIGKDD 2, 1–15 (2000)
Article Google Scholar
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A.: Mining the link structure of the World Wide Webx. IEEE Computer 32, 60–67 (1999)
Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern discovery on the world wide web. In: Proc. of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 1997), pp. 558–567 (1997)
Google Scholar
Mobasher, B., Dai, H., Luo, T., Sun, Y., Zhu, J.: Integrating web usage and content mining for more effective personalization. In: Proc. of the Intl. Conf. on Ecommerce and Web Technologies (ECWeb), pp. 165–176 (2000)
Google Scholar
Kato, H., Nakayama, T., Yamane, Y.: Navigation analysis tool based on the correlation between contents distribution and access patterns. In: Proc. of the Web Mining Workshop KDD 2000, pp. 95–104 (2000)
Google Scholar
Ypma, A., Heskes, T.: Categorization of web pages and user clustering with mixtures of hidden markov models. In: Workshop on Web Knowledge Discovery and Data mining (WEBKDD 2002), pp. 31–43 (2002)
Google Scholar
Jin, X., Zhou, Y., Mobasher, B.: A unified approach to personalization based on probabilistic latent semantic models of web usage and content. In: Proc. of the AAAI 2004 Workshop SWP 2004, pp. 26–34 (2004)
Google Scholar
Eirinaki, M., Lampos, C., Paulakis, S., Vazirgiannis, M.: Web personalization integrating content, semantics and navigational patterns. In: ACM Web Information and Data Management Workshop, pp. 72–79 (2004)
Google Scholar
Aslton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Article Google Scholar
Miao, Y., Keselj, V., Milios, E.: Comparing document clustering using n-grams, terms and words (2004)
Google Scholar
Jo, T.C.: Evaluation function of document clustering based on term entropy. In: Proc. of 2nd International Symposium on Advanced Intelligent System, pp. 95–100 (2001)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. of the Text Mining Workshop, KDD 2000 (2000)
Google Scholar
Pandey, A., Srivastava, J., Shekhar, S.: A web proxy server with an intelligent prefetcher for dynamic pages using association rules. Technical Report TR-01-004, Department of Computer Science, University of Minnesota (2001)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Etzioni, O.: The World Wide Web: Quagmire or gold mine. Communications of the ACM 39, 65–68 (1996)
Article Google Scholar
Saltonandand, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)
Article Google Scholar
Punin, J., Krishnamoorthy, M., Zaki, M.J.: WebKDD 2001. LNCS (LNAI), vol. 2356, pp. 88–112. Springer, Heidelberg (2002)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, NS, B3H 1W5, Canada
Jiayun Guo, Vlado Kešelj & Qigang Gao

Authors

Jiayun Guo
View author publications
You can also search for this author in PubMed Google Scholar
Vlado Kešelj
View author publications
You can also search for this author in PubMed Google Scholar
Qigang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Département d’informatique et de recherche opérationelle, CP 6128 succ. Centre-Ville, Université de Montréal, H3C 3J7, Montréal, Canada
Balázs Kégl
Département d’informatique et de recherche opérationnelle, Université de Montréal,
Guy Lapalme

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, J., Kešelj, V., Gao, Q. (2005). Integrating Web Content Clustering into Web Log Association Rule Mining. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_19

Download citation

DOI: https://doi.org/10.1007/11424918_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics