Similarity Based Hierarchical Clustering with an Application to Text Collections

  • Julien Ah-PineEmail author
  • Xinyu Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9897)


Lance-Williams formula is a framework that unifies seven schemes of agglomerative hierarchical clustering. In this paper, we establish a new expression of this formula using cosine similarities instead of distances. We state conditions under which the new formula is equivalent to the original one. The interest of our approach is twofold. Firstly, we can naturally extend agglomerative hierarchical clustering techniques to kernel functions. Secondly, reasoning in terms of similarities allows us to design thresholding strategies on proximity values. Thereby, we propose to sparsify the similarity matrix in the goal of making these clustering techniques more efficient. We apply our approach to text clustering tasks. Our results show that sparsifying the inner product matrix considerably decreases memory usage and shortens running time while assuring the clustering quality.


Agglomerative hierarchical clustering Lance-Williams formula Scalable hierarchical clustering Kernel machines Text clustering 



This work was supported by the french national project Request PIA/FSN.


  1. 1.
    Bruynooghe, M.: Classification ascendante hiérarchique des grands ensembles de données: un algorithme rapide fondé sur la construction des voisinages réductibles. Cahiers de L’analyse des Données 3(1), 7–33 (1978)Google Scholar
  2. 2.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)CrossRefzbMATHGoogle Scholar
  3. 3.
    Dhillon, I.S.: Co-clustering documents and words using Bipartite co-clustering documents and words using Bipartite spectral graph partitioning. In: Proceedings of 7th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 269–274 (2001)Google Scholar
  4. 4.
    Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)Google Scholar
  5. 5.
    Lance, G.N., Williams, W.T.: A general theory of classificatory sorting strategies ii. clustering systems. Comput. J. 10(3), 271–277 (1967)CrossRefGoogle Scholar
  6. 6.
    Lang, K.: NewsWeeder: learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)Google Scholar
  7. 7.
    Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26(4), 354–359 (1983)CrossRefzbMATHGoogle Scholar
  8. 8.
    Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 2(1), 86–97 (2012)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Nguyen, T.D., Schmidt, B., Kwoh, C.K.: Sparsehc: a memory-efficient online hierarchical clustering algorithm. Procedia Comput. Sci. 29, 8–19 (2014)CrossRefGoogle Scholar
  10. 10.
    Xu, R., Wunsch, D., et al.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
  11. 11.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: ACM Sigmod Record, vol. 25, pp. 103–114. ACM (1996)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.University of LyonEric LabBron CedexFrance

Personalised recommendations