Abstract
Forming consensus clusters from multiple input clusterings can improve accuracy and robustness. Current clustering ensemble methods require specifying the number of consensus clusters. A poor choice can lead to under or over fitting. This paper proposes a nonparametric Bayesian clustering ensemble (NBCE) method, which can discover the number of clusters in the consensus clustering. Three inference methods are considered: collapsed Gibbs sampling, variational Bayesian inference, and collapsed variational Bayesian inference. Comparison of NBCE with several other algorithms demonstrates its versatility and superior stability.
Chapter PDF
Similar content being viewed by others
References
Alexey, D.G., Tsymbal, A., Bolshakova, N., Cunningham, P.: Ensemble clustering in medical diagnostics. In: IEEE Symposium on Computer-Based Medical Systems, pp. 576–581. IEEE Computer Society, Los Alamitos (2004)
Antoniak, C.E.: Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics 2(6), 1152–1174 (1974)
Ayad, H., Kamel, M.: Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 166–175. Springer, Heidelberg (2003)
Beal, M.J.: Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London (2003)
Blackwell, D., Macqueen, J.B.: Ferguson distributions via pólya urn schemes. The Annals of Statistics 1, 353–355 (1973)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3(4-5), 993–1022 (2003)
Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)
Ferguson, T.S.: A bayesian analysis of some nonparametric problems. The Annals of Statistics 1(2), 209–230 (1973)
Fern, X.Z., Brodley, C.E.: Random projection for high-dimensional data clustering: A cluster ensemble approach. In: International Conference on Machine Learning, pp. 186–193 (2003)
Fred, A.L.N., Jain, A.K.: Data clustering using evidence accumulation. In: International Conference on Pattern Recognition, Washington, DC, USA, vol. 4, pp. 276–280. IEEE Computer Society, Los Alamitos (2002)
Gondek, D., Hofmann, T.: Non-redundant clustering with conditional ensembles. In: KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 70–77. ACM, New York (2005)
Hu, X.: Integration of cluster ensemble and text summarization for gene expression analysis. In: Fourth IEEE Symposium on Bioinformatics and Bioengineering, pp. 251–258 (2004)
Ishwaran, H., James, L.: Gibbs sampling methods for stick breaking priors. Journal of the American Statistical Association 96(453), 161–173 (2001)
Ishwaran, H., Zarepour, M.: Exact and approximate sum-representations for the dirichlet process. The Canadian Journal of Statistics 30(2), 269–283 (2002)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1), 359–392 (1998)
Kuncheva, L.I.: Experimental comparison of cluster ensemble methods. In: International Conference on Information Fusion, pp. 1–7 (2006)
Kuncheva, L.I., Hadjitodorov, S.T.: Using diversity in cluster ensembles. In: International Conference on Systems, Man and Cybernetics, vol. 2, pp. 1214–1219 (2004)
Kuncheva, L.I., Vetrov, D.: Evaluation of stability of k-means cluster ensembles with respect to random initializationtent semantic analysis. PAMI 28(11), 1798–1808 (2006)
Kurihara, K., Welling, M., Teh, Y.W.: Collapsed variational dirichlet process mixture models. In: IJCAI’07: Proceedings of the 20th international joint conference on Artifical intelligence, pp. 2796–2801. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Minaei-bidgoli, B., Topchy, A., Punch, W.F.: A comparison of resampling methods for clustering ensembles. In: International Conference on Machine Learning: Models, Technologies and Applications, pp. 939–945 (2004)
Neal, R.M.: Probabilistic inference using markov chain monte carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto (1993)
Neal, R.M.: Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265 (2000)
Punera, K., Ghosh, J.: Soft cluster ensembles. In: de Oliveira, J.V., Pedrycz, W. (eds.) Advances in Fuzzy Clustering and its Applications, ch. 4, pp. 69–90. John Wiley & Sons, Ltd., Chichester (2007)
Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica 4, 639–650 (1994)
Strehl, A., Ghosh, J.: Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2003)
Sung, J., Ghahramani, Z., Bang, S.-Y.: Latent-space variational bayes. IEEE Trans. Pattern Anal. Mach. Intell. 30(12), 2236–2242 (2008)
Sung, J., Ghahramani, Z., Bang, S.-Y.: Second-order latnet-space variational bayes for approximate bayesian inference. IEEE Signal Processing Letters 15, 918–921 (2008)
Topchy, A., Jain, A.K., Punch, W.: A mixture model for clustering ensembles. In: SIAM International Conference on Data Mining, pp. 379–390 (2004)
Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12), 1866–1881 (2005)
Topchy, A., Topchy, E., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: International Conference on Data Mining, pp. 331–338 (2003)
Wang, H., Shan, H., Banerjee, A.: Bayesian clustering ensembles. In: SIAM Data Mining (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, P., Domeniconi, C., Laskey, K.B. (2010). Nonparametric Bayesian Clustering Ensembles. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science(), vol 6323. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15939-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-15939-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15938-1
Online ISBN: 978-3-642-15939-8
eBook Packages: Computer ScienceComputer Science (R0)