Block Clustering for Web Pages Categorization

Charrad, Malika; Lechevallier, Yves; Ahmed, Mohamed ben; Saporta, Gilbert

doi:10.1007/978-3-642-04394-9_32

Malika Charrad^18,19,20,
Yves Lechevallier^18,19,20,
Mohamed ben Ahmed^18,19,20 &
…
Gilbert Saporta^18,19,20

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5788))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1859 Accesses
5 Citations

Abstract

With the growth of web-based applications and the increased popularity of the World Wide Web (WWW), the WWW became the greatest source of information available in the world leading to an increased difficulty of extracting relevant information. Moreover, the content of web sites is constantly changing leading to continual changes in Web users’ behaviours. Therefore, there is significant interest in analysing web content data to better serve users. Our proposed approach, which is grounded on automatic textual analysis of a web site independently from the usage attempts to define groups of documents dealing with the same topic. Both document clustering and word clustering are well studied problems. However, most existing algorithms cluster documents and words separately but not simultaneously. In this paper, we propose to apply a block clustering algorithm to categorize a web site pages according to their content. We report results of our recent testing of CROKI2 algorithm on a tourist web site.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Landauer, T.K., Dumais, S.T.: How come you know so much? From practical problems to new memory theory. In: Hermann, D.J., McEvoy, C., Hertzog, C., Hertel, P., Johnson, M.K. (eds.) Basic and applied memory research: Theory in context, vol. 1, pp. 105–126. Lawrence Erlbaum Associates, Mahwah (1996)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, California, pp. 269–274 (2001)
Google Scholar
Chen, H., Schuffels, C., Orwig, R.: Internet Categorization and Search: A Self-Organizing Approach. Journal of visual communication and image representation 7(1), 88–102 (1996)
Article Google Scholar
Rossi, F., El Golli, A., Lechevallier, Y.: Usage Guided Clustering of Web Pages with Mediann Self Organizing Map. In: Proceedings of ESANN 2005 (2005)
Google Scholar
Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 1996 (1996)
Google Scholar
Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Web Content Data Mining: la classification croisée pour l’analyse textuelle d’un site Web. In: Actes des 8émes journées francophones Extraction et Gestion des Connaissances 2008, EGC 2008, Revue des Nouvelles Technologies Informatiques (RNTI), Cépadués-édn., vol. I, pp. 43–54 (2008)
Google Scholar
Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Le bi-partitionnement: Etat de l’art sur les approches et les algorithmes. In: Ecol’IA 2008, Hammamet, Tunisie (2008)
Google Scholar
Crimmins, F., Smeaton, A.F., Dkaki, T., Mothe, J.: TetraFusion: information discovery on the Internet. Journal of IEEExpert, 55–62 (1999)
Google Scholar
Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University (1986)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001); Also appears as IBM Research Report RJ 10147 (1999)
Article MATH Google Scholar
Schutze, H., Silverstein, C.: Projections for efficient document clustering. In: ACM SIGIR (1997)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: ACM SIGIR (1992)
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000 Workshop on AI for Web Search (2000)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
MATH Google Scholar
Govaert, G.: Classification croisée. Thése de doctorat d’état, Paris (1983)
Google Scholar
Stricker, M.: Réseaux de neurones pour le traitement automatique du langage: conception et réalisatin de filtres d’information. Thése de Doctorat, Electronique, ESPCI (2000)
Google Scholar
Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 01(1), 24–45 (2004)
Article Google Scholar
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 122–1129 (2006)
Google Scholar
Forgy, E.: Cluster analysis of multivariate data:efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)
Google Scholar
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

National School of Computer Sciences, Manouba, 2010, Tunisia
Malika Charrad, Yves Lechevallier, Mohamed ben Ahmed & Gilbert Saporta
INRIA-Rocquencourt, 78153, Le Chesnay, cedex, France
Malika Charrad, Yves Lechevallier, Mohamed ben Ahmed & Gilbert Saporta
Conservatoire National des Arts et Métiers, 292 rue Saint-Martin, 75141, Paris, France
Malika Charrad, Yves Lechevallier, Mohamed ben Ahmed & Gilbert Saporta

Authors

Malika Charrad
View author publications
You can also search for this author in PubMed Google Scholar
Yves Lechevallier
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed ben Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Gilbert Saporta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Escuela Politécnica Superior, Universidad de Burgos, Calle Francisco de Vitoria, S/N, Edifico C, 09006, Burgos, Spain
Emilio Corchado
School of Electrical and Electronic Engineering, University of Manchester, Sackville Street Building, Sackville Street, M60 1QD, Manchester, UK
Hujun Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Charrad, M., Lechevallier, Y., Ahmed, M.b., Saporta, G. (2009). Block Clustering for Web Pages Categorization. In: Corchado, E., Yin, H. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2009. IDEAL 2009. Lecture Notes in Computer Science, vol 5788. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04394-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-04394-9_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04393-2
Online ISBN: 978-3-642-04394-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics