Web Page Clustering: A Hyperlink-Based Similarity and Matrix-Based Hierarchical Algorithms

  • Jingyu Hou
  • Yanchun Zhang
  • Jinli Cao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2642)


This paper proposes a hyperlink-based web page similarity measurement and two matrix-based hierarchical web page clustering algorithms. The web page similarity measurement incorporates hyperlink transitivity and page importance within the concerned web page space. One clustering algorithm takes cluster overlapping into account, another one does not. These algorithxms do not require predefined similarity thresholds for clustering, and are independent of the page order. The primary evaluations show the effectiveness of the proposed algorithms in clustering improvement.


Cluster Algorithm Cluster Accuracy Hierarchical Cluster Algorithm International World Wide Page Source 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: Fast Access to Linkage Information on the Web, Proceedings of the 7 th International World Wide Web Conference (1998) 469–477Google Scholar
  2. 2.
    Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Proceedings of ACM 21 st International SIGIR’98 (1998) 104–111Google Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison Wesley, ACM Press (1999)Google Scholar
  4. 4.
    Botafogo, R. A.: Cluster Analysis for Hypertext Systems, Proceedings of ACM 16 th Annual International SIGIR’93 (1993)Google Scholar
  5. 5.
    Botafogo, R. A., Rivlin, E., Shneiderman, B.: Structural Analysis of Hypertexts: Indentifing Hierarchies and Useful Metrics, ACM Transactions on Information Systems, Vol 10, No 2 (1992)142–180CrossRefGoogle Scholar
  6. 6.
    Botafogo, R. A., Shneiderman, B.: Identifying Aggregates in Hypertext Structures, Proceedings of Hypertext’91(1991) 63–74Google Scholar
  7. 7.
    Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the 7 th International World Wide Web Conference (1998)Google Scholar
  8. 8.
    Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web, January 1998,
  9. 9.
    Carriere, J., Kazman, R.: WebQuery: Searching and Visualizing the Web through Connectivity, Proceedings of the 6 th International world Wide Web Conference (1997)Google Scholar
  10. 10.
    Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proc. the 7 th International World Wide Web Conference (1998) 65–74Google Scholar
  11. 11.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks, Proceedings of SIGMOD 1998, 307–318Google Scholar
  12. 12.
    Dean, J., Henzinger, M.: Finding Related Pages in the World Wide Web, Proc. the 8 th International World Wide Web Conference (1999) 389–401Google Scholar
  13. 13.
    Dubes, R. J., Jain, A. K.: Algorithms for Clustering Data, Prentice Hall (1988)Google Scholar
  14. 14.
    Hou, J., Zhang, Y.: Constructing Good Quality Web Page Communities, Proceedings of the 13th Australasian Database Conferences (ADC 2002) 65–74Google Scholar
  15. 15.
    Hou, J., Zhang, Y.: A Matrix Approach for Hierarchical Web Page Clustering Based on Hyperlinks, Proceedings of the 3 rd International Conference on Web Information Systems Engineering, Workshop: Mining Enhanced Web Search (2002) 207–216Google Scholar
  16. 16.
    Hou, J., Zhang, Y.: Effectively Finding Relevant Web Pages from Linkage Information, IEEE Transactions on Knowledge & Data Engineering (to appear)Google Scholar
  17. 17.
    Hou, J., Zhang, Y.: Utilizing Hyperlink Transitivity to Improve Web Page Clustering, Proceedings of the 14th Australasian Database Conference (ADC2003) Google Scholar
  18. 18.
    Jiang, H., Lou, W., Wang, W.,: Three-tier Clustering: an Online Citation Clustering System, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 237–248Google Scholar
  19. 19.
    Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment, Proceedings of the 9 th ACM-SIAM Symposium on Discrete Algorithms (SODA, 1998)Google Scholar
  20. 20.
    Marchiori, M.: The Quest for Correct Information on the Web: Hyper Search Engines, Proceedings of the 6 th International Word Wide Web Conference (1997)Google Scholar
  21. 21.
    McCormick, W. T., Schweitzer, P. J., White, T. W.: Problem Decomposition and Data Reorganization by a Clustering Technique, Oper. Res. (1972), 20(5) 993–1009zbMATHCrossRefGoogle Scholar
  22. 22.
    Özsu, M. T., Valduriez, P.: Principle of Distributed Database Systems, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA (1991)Google Scholar
  23. 23.
    Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web, Proceedings of ACM SIGCHI Conference on Human Factors in Computing (1996)Google Scholar
  24. 24.
    Pitkow, J., Pirolli, P.: Life, Death, and Lawfulness on the Electronic Frontier, Proceedings of ACM CHI’97 (1997) 383–390Google Scholar
  25. 25.
    Terveen, L., Hill, W.: Finding and Visualizing Inter-site Clan Graphs, Proceedings of CHI-98 (1998) 448–455Google Scholar
  26. 26.
    Wang, L.: On Competitive Learning, IEEE Transaction on Neural Networks, Vol. 8, No. 5 (1997) 1214–1217CrossRefGoogle Scholar
  27. 27.
    Wang, Y., Kitsuregawa, M.: Use Link-based Clustering to Improve Web Search Results, Proceedings of the Second International Conference on Web Information Systems Engineering (WISE 2001) 119–128Google Scholar
  28. 28.
    Weiss, R., Vélez, B., Sheldon, M. A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D. K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering, Proceedings of the Seventh ACM Conference on Hypertext (1996) 180–193Google Scholar
  29. 29.
    Wen, C.W., Liu, H., Wen, W. X., Zheng, J.: A Distributed Hierarchical Clustering System for Web Mining, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 103–113Google Scholar
  30. 30.
    Xiao, J., Zhang, Y., Jia, X., Li, T.: Measuring Similarity of Interests for Clustering Web-Users, Proceedings of the 12 th Australasian Database Conference (ADC2001) 107–114Google Scholar
  31. 31.
    Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration, Proceedings of ACM SIGIR’98 (1998) 46–54Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Jingyu Hou
    • 1
  • Yanchun Zhang
    • 2
  • Jinli Cao
    • 3
  1. 1.School of Information TechnologyDeakin UniversityMelbourneAustralia
  2. 2.Department of Mathematics and ComputingUniversity of Southern QueenslandToowoombaAustralia
  3. 3.Department of Computer Science and Computer EngineeringLa Trobe UniversityMelbourneAustralia

Personalised recommendations