Abstract
We explore the use of instance and cluster-level constraints with agglomerative hierarchical clustering. Though previous work has illustrated the benefits of using constraints for non-hierarchical clustering, their application to hierarchical clustering is not straight-forward for two primary reasons. First, some constraint combinations make the feasibility problem (Does there exist a single feasible solution?) NP-complete. Second, some constraint combinations when used with traditional agglomerative algorithms can cause the dendrogram to stop prematurely in a dead-end solution even though there exist other feasible solutions with a significantly smaller number of clusters. When constraints lead to efficiently solvable feasibility problems and standard agglomerative algorithms do not give rise to dead-end solutions, we empirically illustrate the benefits of using constraints to improve cluster purity and average distortion. Furthermore, we introduce the new γ constraint and use it in conjunction with the triangle inequality to considerably improve the efficiency of agglomerative clustering.
Chapter PDF
References
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised Clustering by Seeding. In: 19th ICML (2002)
Basu, S., Bilenko, M., Mooney, R.J.: Active Semi-Supervision for Pairwise Constrained Clustering. 4th SIAM Data Mining Conf. (2004)
Bradley, P., Fayyad, U., Reina, C.: Scaling Clustering Algorithms to Large Databases. In: 4th ACM KDD Conference (1998)
Davidson, I., Ravi, S.S.: Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In: SIAM International Conference on Data Mining (2005)
Davidson, I., Ravi, S.S.: Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster-Level Constraints. Tech. Report, CS Department, SUNY - Albany (2005), Available from www.cs.albany.edu/~davidson
Elkan, C.: Using the triangle inequality to accelerate k-means. ICML (2003)
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-completeness. Freeman and Co., New York (1979)
Garey, M., Johnson, D., Witsenhausen, H.: The complexity of the generalized Lloyd-Max problem. IEEE Trans. Information Theory 28(2) (1982)
Klein, D., Kamvar, S.D., Manning, C.D.: From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. ICML (2002)
Nanni, M.: Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 378–387. Springer, Heidelberg (2005)
Schafer, T.J.: The Complexity of Satisfiability Problems. STOC (1978)
Wagstaff, K., Cardie, C.: Clustering with Instance-Level Constraints. ICML (2000)
West, D.B.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood Cliffs (2001)
Yang, K., Yang, R., Kafatos, M.: A Feasible Method to Find Areas with Constraints Using Hierarchical Depth-First Clustering. In: Scientific and Stat. Database Management Conf. (2001)
Zaiane, O.R., Foss, A., Lee, C., Wang, W.: On Data Clustering Analysis: Scalability, Constraints and Validation. PAKDD (2000)
Zho, Y., Karypis, G.: Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Davidson, I., Ravi, S.S. (2005). Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_11
Download citation
DOI: https://doi.org/10.1007/11564126_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)