Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results

Davidson, Ian; Ravi, S. S.

doi:10.1007/11564126_11

Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results

Ian Davidson²³ &
S. S. Ravi²³

Conference paper

4834 Accesses
105 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3721))

Abstract

We explore the use of instance and cluster-level constraints with agglomerative hierarchical clustering. Though previous work has illustrated the benefits of using constraints for non-hierarchical clustering, their application to hierarchical clustering is not straight-forward for two primary reasons. First, some constraint combinations make the feasibility problem (Does there exist a single feasible solution?) NP-complete. Second, some constraint combinations when used with traditional agglomerative algorithms can cause the dendrogram to stop prematurely in a dead-end solution even though there exist other feasible solutions with a significantly smaller number of clusters. When constraints lead to efficiently solvable feasibility problems and standard agglomerative algorithms do not give rise to dead-end solutions, we empirically illustrate the benefits of using constraints to improve cluster purity and average distortion. Furthermore, we introduce the new γ constraint and use it in conjunction with the triangle inequality to considerably improve the efficiency of agglomerative clustering.

Download to read the full chapter text

Chapter PDF

References

Basu, S., Banerjee, A., Mooney, R.: Semi-supervised Clustering by Seeding. In: 19^th ICML (2002)
Google Scholar
Basu, S., Bilenko, M., Mooney, R.J.: Active Semi-Supervision for Pairwise Constrained Clustering. 4^th SIAM Data Mining Conf. (2004)
Google Scholar
Bradley, P., Fayyad, U., Reina, C.: Scaling Clustering Algorithms to Large Databases. In: 4^th ACM KDD Conference (1998)
Google Scholar
Davidson, I., Ravi, S.S.: Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In: SIAM International Conference on Data Mining (2005)
Google Scholar
Davidson, I., Ravi, S.S.: Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster-Level Constraints. Tech. Report, CS Department, SUNY - Albany (2005), Available from www.cs.albany.edu/~davidson
Elkan, C.: Using the triangle inequality to accelerate k-means. ICML (2003)
Google Scholar
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-completeness. Freeman and Co., New York (1979)
MATH Google Scholar
Garey, M., Johnson, D., Witsenhausen, H.: The complexity of the generalized Lloyd-Max problem. IEEE Trans. Information Theory 28(2) (1982)
Google Scholar
Klein, D., Kamvar, S.D., Manning, C.D.: From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. ICML (2002)
Google Scholar
Nanni, M.: Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 378–387. Springer, Heidelberg (2005)
Chapter Google Scholar
Schafer, T.J.: The Complexity of Satisfiability Problems. STOC (1978)
Google Scholar
Wagstaff, K., Cardie, C.: Clustering with Instance-Level Constraints. ICML (2000)
Google Scholar
West, D.B.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood Cliffs (2001)
Google Scholar
Yang, K., Yang, R., Kafatos, M.: A Feasible Method to Find Areas with Constraints Using Hierarchical Depth-First Clustering. In: Scientific and Stat. Database Management Conf. (2001)
Google Scholar
Zaiane, O.R., Foss, A., Lee, C., Wang, W.: On Data Clustering Analysis: Scalability, Constraints and Validation. PAKDD (2000)
Google Scholar
Zho, Y., Karypis, G.: Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University at Albany – State University of New York, Albany, NY, 12222, USA
Ian Davidson & S. S. Ravi

Authors

Ian Davidson
View author publications
You can also search for this author in PubMed Google Scholar
S. S. Ravi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6, 4050-190, Porto, Portugal
Luís Torgo
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel Brazdil
Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal
Rui Camacho
Faculty of Economics of the University of Porto, Portugal
João Gama

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Davidson, I., Ravi, S.S. (2005). Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_11

Download citation

DOI: https://doi.org/10.1007/11564126_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics