Skip to main content

Block Clustering for Web Pages Categorization

  • Conference paper
Intelligent Data Engineering and Automated Learning - IDEAL 2009 (IDEAL 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5788))

Abstract

With the growth of web-based applications and the increased popularity of the World Wide Web (WWW), the WWW became the greatest source of information available in the world leading to an increased difficulty of extracting relevant information. Moreover, the content of web sites is constantly changing leading to continual changes in Web users’ behaviours. Therefore, there is significant interest in analysing web content data to better serve users. Our proposed approach, which is grounded on automatic textual analysis of a web site independently from the usage attempts to define groups of documents dealing with the same topic. Both document clustering and word clustering are well studied problems. However, most existing algorithms cluster documents and words separately but not simultaneously. In this paper, we propose to apply a block clustering algorithm to categorize a web site pages according to their content. We report results of our recent testing of CROKI2 algorithm on a tourist web site.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Landauer, T.K., Dumais, S.T.: How come you know so much? From practical problems to new memory theory. In: Hermann, D.J., McEvoy, C., Hertzog, C., Hertel, P., Johnson, M.K. (eds.) Basic and applied memory research: Theory in context, vol. 1, pp. 105–126. Lawrence Erlbaum Associates, Mahwah (1996)

    Google Scholar 

  2. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, California, pp. 269–274 (2001)

    Google Scholar 

  3. Chen, H., Schuffels, C., Orwig, R.: Internet Categorization and Search: A Self-Organizing Approach. Journal of visual communication and image representation 7(1), 88–102 (1996)

    Article  Google Scholar 

  4. Rossi, F., El Golli, A., Lechevallier, Y.: Usage Guided Clustering of Web Pages with Mediann Self Organizing Map. In: Proceedings of ESANN 2005 (2005)

    Google Scholar 

  5. Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 1996 (1996)

    Google Scholar 

  6. Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Web Content Data Mining: la classification croisée pour l’analyse textuelle d’un site Web. In: Actes des 8émes journées francophones Extraction et Gestion des Connaissances 2008, EGC 2008, Revue des Nouvelles Technologies Informatiques (RNTI), Cépadués-édn., vol. I, pp. 43–54 (2008)

    Google Scholar 

  7. Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Le bi-partitionnement: Etat de l’art sur les approches et les algorithmes. In: Ecol’IA 2008, Hammamet, Tunisie (2008)

    Google Scholar 

  8. Crimmins, F., Smeaton, A.F., Dkaki, T., Mothe, J.: TetraFusion: information discovery on the Internet. Journal of IEEExpert, 55–62 (1999)

    Google Scholar 

  9. Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University (1986)

    Google Scholar 

  10. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001); Also appears as IBM Research Report RJ 10147 (1999)

    Article  MATH  Google Scholar 

  11. Schutze, H., Silverstein, C.: Projections for efficient document clustering. In: ACM SIGIR (1997)

    Google Scholar 

  12. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: ACM SIGIR (1992)

    Google Scholar 

  13. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000 Workshop on AI for Web Search (2000)

    Google Scholar 

  14. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)

    MATH  Google Scholar 

  15. Govaert, G.: Classification croisée. Thése de doctorat d’état, Paris (1983)

    Google Scholar 

  16. Stricker, M.: Réseaux de neurones pour le traitement automatique du langage: conception et réalisatin de filtres d’information. Thése de Doctorat, Electronique, ESPCI (2000)

    Google Scholar 

  17. Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 01(1), 24–45 (2004)

    Article  Google Scholar 

  18. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 122–1129 (2006)

    Google Scholar 

  19. Forgy, E.: Cluster analysis of multivariate data:efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)

    Google Scholar 

  20. Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)

    Google Scholar 

  21. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Charrad, M., Lechevallier, Y., Ahmed, M.b., Saporta, G. (2009). Block Clustering for Web Pages Categorization. In: Corchado, E., Yin, H. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2009. IDEAL 2009. Lecture Notes in Computer Science, vol 5788. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04394-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04394-9_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04393-2

  • Online ISBN: 978-3-642-04394-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics