Pairwise Similarity for Cluster Ensemble Problem: Link-Based and Approximate Approaches

  • Natthakan Iam-On
  • Tossapon Boongoen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7980)


Cluster ensemble methods have emerged as powerful techniques, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. In particular, link-based similarity techniques have recently been introduced with superior performance to the conventional co-association method. Their potential and applicability are, however limited due to the underlying time complexity. In light of such shortcoming, this paper presents two approximate approaches that mitigate the problem of time complexity: the approximate algorithm approach (Approximate SimRank Based Similarity matrix) and the approximate data approach (Prototype-based cluster ensemble model). The first approach involves decreasing the computational requirement of the existing link-based technique; the second reduces the size of the problem by finding a smaller, representative, approximate dataset, derived by a density-biased sampling technique. The advantages of both approximate approaches are empirically demonstrated over 22 datasets (both artificial and real data) and statistical comparisons of performance (with 95% confidence level) with three well-known validity criteria. Results obtained from these experiments suggest that approximate techniques can efficiently help scaling up the application of link-based similarity methods to wider range of data sizes.


clustering cluster ensembles pairwise similarity matrix cluster relation link analysis data prototype 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Appel, A.P., Paterlini, A.A., de Sousa, E.P.M., Traina, A.J.M., Traina Jr., C.: A density-biased sampling technique to improve cluster representativeness. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 366–373. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  2. 2.
    Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)Google Scholar
  3. 3.
    Boulis, C., Ostendorf, M.: Combining multiple clustering systems. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 63–74. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Calado, P., Cristo, M., Gonçalves, M.A., de Moura, E.S., Ribeiro-Neto, B.A., Ziviani, N.: Link-based similarity measures for the classification of web documents. JASIST 57(2), 208–221 (2006)CrossRefGoogle Scholar
  5. 5.
    de Castro, L.N.: Immune Engineering: Development of Computational Tools Inspired by the Artificial Immune Systems. Ph.D. thesis, DCA - FEEC/UNICAMP, Campinas/SP, Brazil (2001)Google Scholar
  6. 6.
    Domeniconi, C., Al-Razgan, M.: Weighted cluster ensembles: Methods and analysis. ACM Transactions on Knowledge Discovery from Data 2(4), 1–40 (2009)Google Scholar
  7. 7.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience (November 2000)Google Scholar
  8. 8.
    Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of International Conference on Machine Learning, pp. 186–193 (2003)Google Scholar
  9. 9.
    Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of International Conference on Machine Learning, pp. 36–43 (2004)Google Scholar
  10. 10.
    Fred, A.: Finding consistent clusters in data partitions. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp. 309–318. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  11. 11.
    Fred, A.L.N., Jain, A.K.: Data clustering using evidence accumulation. In: International Conference on Pattern Recognition, pp. 276–280 (2002)Google Scholar
  12. 12.
    Fred, A.L.N., Jain, A.K.: Robust data clustering. In: International Conference on Pattern Recognition, pp. 128–136 (2003)Google Scholar
  13. 13.
    Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 835–850 (2005)CrossRefGoogle Scholar
  14. 14.
    Fred, A.L.N., Jain, A.K.: Learning pairwise similarity for data clustering. In: International Conference on Pattern Recognition, pp. 925–928 (2006)Google Scholar
  15. 15.
    Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proceedings of International Conference on Data Engineering, pp. 341–352 (2005)Google Scholar
  16. 16.
    Iam-on, N., Boongoen, T., Garrett, S.: Refining pairwise similarity matrix for cluster ensemble problem with cluster relations. In: Boulicaut, J.-F., Berthold, M.R., Horváth, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 222–233. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Jain, A.K., Law, M.H.C.: Data clustering: A user’s dilemma. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 1–10. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  18. 18.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Survey 31(3), 264–323 (1999)CrossRefGoogle Scholar
  19. 19.
    Jeh, G., Widom, J.: Simrank: A measure of structural-context similarity. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)Google Scholar
  20. 20.
    Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Transactions on VLSI Systems 7(1), 69–79 (1999)CrossRefGoogle Scholar
  21. 21.
    Karypis, G., Kumar, V.: Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel Distributed Computing 48(1), 96–129 (1998)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Density-biased clustering based on reservoir sampling. In: Proceedings of DEXA Workshops, pp. 1122–1126 (2005)Google Scholar
  23. 23.
    Klink, S., Reuther, P., Weber, A., Walter, B., Ley, M.: Analysing social networks within bibliographical data. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 234–243. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  24. 24.
    Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)CrossRefGoogle Scholar
  25. 25.
    Kuncheva, L.I., Hadjitodorov, S.T.: Using diversity in cluster ensembles. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 1214–1219 (2004)Google Scholar
  26. 26.
    Kuncheva, L.I., Vetrov, D.: Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1798–1808 (2006)CrossRefGoogle Scholar
  27. 27.
    Kyrgyzov, I.O., Maitre, H., Campedel, M.: A method of clustering combination applied to satellite image analysis. In: Proceedings of International Conference on Image Analysis and Processing, pp. 81–86 (2007)Google Scholar
  28. 28.
    Monti, S., Tamayo, P., Mesirov, J.P., Golub, T.R.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52(1-2), 91–118 (2003)zbMATHCrossRefGoogle Scholar
  29. 29.
    Nguyen, N., Caruana, R.: Consensus clusterings. In: Proceedings of IEEE International Conference on Data Mining, pp. 607–612 (2007)Google Scholar
  30. 30.
    Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering. SIGMOD Records 29(2), 82–92 (2000)CrossRefGoogle Scholar
  31. 31.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)CrossRefGoogle Scholar
  32. 32.
    Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)MathSciNetGoogle Scholar
  33. 33.
    Swift, S., Tucker, A., Vinciotti, V., Martin, N., Orengo, C., Liu, X., Kellam, P.: Consensus clustering and functional interpretation of gene-expression data. Genome Biology 5, R94 (2004)Google Scholar
  34. 34.
    Topchy, A.P., Jain, A.K., Punch, W.F.: Combining multiple weak clusterings. In: Proceedings of IEEE International Conference on Data Mining, pp. 331–338 (2003)Google Scholar
  35. 35.
    Topchy, A.P., Jain, A.K., Punch, W.F.: A mixture model for clustering ensembles. In: Proceedings of SIAM International Conference on Data Mining, pp. 379–390 (2004)Google Scholar
  36. 36.
    Wolpert, D.H., Macready, W.G.: No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute (1995)Google Scholar
  37. 37.
    Xue, H., Chen, S., Yang, Q.: Discriminatively regularized least-squares classification. Pattern Recognition 42(1), 93–104 (2009)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Natthakan Iam-On
    • 1
  • Tossapon Boongoen
    • 2
  1. 1.School of Information TechnologyMae Fah Luang UniversityChiang RaiThailand
  2. 2.Department of Mathematics and Computer ScienceRoyal Thai Air Force AcademyBangkokThailand

Personalised recommendations