Advertisement

What are Clusters in High Dimensions and are they Difficult to Find?

  • Frank KlawonnEmail author
  • Frank Höppner
  • Balasubramaniam Jayaram
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7627)

Abstract

The distribution of distances between points in a high-dimensional data set tends to look quite different from the distribution of the distances in a low-dimensional data set. Concentration of norm is one of the phenomena from which high-dimensional data sets can suffer. It means that in high dimensions – under certain general assumptions – the relative distances from any point to its closest and farthest neighbour tend to be almost identical. Since cluster analysis is usually based on distances, such effects must be taken into account and their influence on cluster analysis needs to be considered. This paper investigates consequences that the special properties of high-dimensional data have for cluster analysis. We discuss questions like when clustering in high dimensions is meaningful at all, can the clusters just be artifacts and what are the algorithmic problems for clustering methods in high dimensions.

Keywords

Clustering High-dimensional Data Hubness Phenomenon True Cluster Center Subspace Clustering Prototype-based Clustering 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Bellmann, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)CrossRefGoogle Scholar
  2. 2.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  3. 3.
    Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007)CrossRefGoogle Scholar
  5. 5.
    Aggarwal, C.C.: Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec. 30(1), 13–18 (2001)CrossRefGoogle Scholar
  6. 6.
    Hsu, C.M., Chen, M.S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Trans. Knowl. Data Eng. 21(4), 523–536 (2009)CrossRefGoogle Scholar
  7. 7.
    Jayaram, B., Klawonn, F.: Can unbounded distance measures mitigate the curse of dimensionality? Int. J. Data Min. Model. Manag. 4, 361–383 (2012)Google Scholar
  8. 8.
    Radovanović, M., Nanopoulus, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. Mach. Learn. Res. 11, 2487–2531 (2010)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Low, T., Borgelt, C., Stober, S., Nürnbberger, A.: The hubness phenomenon: fact or artifact? In: Borgelt, C., Ángeles Gil, M., Sousa, J., Verleysen, M. (eds.) Towards Advanced Data Analysis by Combining Soft Computing and Statistics, pp. 267–278. Springer, Berlin (2013) CrossRefGoogle Scholar
  10. 10.
    Evertt, B., Landau, S.: Cluster Analysis, 5th edn. Wiley, Chichester (2011)CrossRefGoogle Scholar
  11. 11.
    Berthold, M., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer, London (2010)CrossRefzbMATHGoogle Scholar
  12. 12.
    Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973)zbMATHGoogle Scholar
  13. 13.
    Dunn, J.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern. Syst. 3(3), 32–57 (1973)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)CrossRefzbMATHGoogle Scholar
  15. 15.
    Jayaram, B., Klawonn, F.: Can fuzzy clustering avoid local minima and undesired partitions? In: Moewes, C., Nürnberger, A. (eds.) Computational Intelligence in Intelligent Data Analysis, pp. 31–44. Springer, Berlin (2012)Google Scholar
  16. 16.
    Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In: IEEE CDC, San Diego, pp. 761–766 (1979)Google Scholar
  17. 17.
    Keller, A., Klawonn, F.: Adaptation of cluster sizes in objective function based fuzzy clustering. In: Leondes, C. (ed.) Intelligent Systems: Technology and Applications. Database and Learning Systems, vol. IV. CRC Press, Boca Raton (2003)Google Scholar
  18. 18.
    Bezdek, J., Keller, J., Krishnapuram, R., Pal, N.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Boston (1999)CrossRefzbMATHGoogle Scholar
  19. 19.
    Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis. Wiley, Chichester (1999)zbMATHGoogle Scholar
  20. 20.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)Google Scholar
  21. 21.
    Hinneburg, A., Gabriel, H.H.: Denclue 2.0: fast clustering based on kernel density estimation. In: Proceedings of the 7th International Symposium on Intelligent Data Analysis, pp. 70–80 (2007)Google Scholar
  22. 22.
    Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD 1999, pp. 49–60. ACM Press (1999)Google Scholar
  23. 23.
    Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)CrossRefGoogle Scholar
  24. 24.
    Kerr, G., Ruskin, H., Crane, M.: Techniques for clustering gene expression data. Comput. Biol. Med. 38(3), 383–393 (2008)CrossRefGoogle Scholar
  25. 25.
    Pommerenke, C., Müsken, M., Becker, T., Dötsch, A., Klawonn, F., Häussler, S.: Global genotype-phenotype correlations in pseudomonas aeruginosa. PLoS Pathogenes 6(8) (2010). doi: 10.1371/journal.ppat.1001074
  26. 26.
    Hinneburg, A., Aggarwal, C., Keim, D.: What is the nearest neighbor in high dimensional spaces? In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB, pp. 506–515. Morgan Kaufmann, San Francisco (2000)Google Scholar
  27. 27.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  28. 28.
    Cook, D., Buja, A., Cabrera, J.: Projection pursuit indices based on orthonormal function expansion. J. Comput. Graph. Stat. 2, 225–250 (1993)CrossRefGoogle Scholar
  29. 29.
    Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Winkler, R., Klawonn, F., Kruse, R.: Fuzzy c-means in high dimensional spaces. Fuzzy Syst. Appl. 1, 1–17 (2011)Google Scholar
  31. 31.
    Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy c-means and its derivatives. IEEE Trans. Fuzzy Syst. 11, 682–694 (2003)CrossRefGoogle Scholar
  32. 32.
    Klawonn, F., Höppner, F.: What is fuzzy about fuzzy clustering? understanding and improving the concept of the fuzzifier. In: Berthold, M.R., Lenz, H.J., Bradley, E., Kruse, R., Borgelt, C. (eds.) Advances in Intelligent Data Analysis, vol. V, pp. 254–264. Springer, Berlin (2003)Google Scholar
  33. 33.
    Borgelt, C.: Resampling for fuzzy clustering. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 15, 595–614 (2007)CrossRefzbMATHGoogle Scholar
  34. 34.
    Borgelt, C.: Prototype-based Classification and Clustering. Habilitation thesis, Otto-von-Guericke-University Magdeburg (2006)Google Scholar
  35. 35.
    Himmelspach, L., Conrad, S.: Clustering approaches for data with missing values: comparison and evaluation. ICDIM 2010, 19–28 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Frank Klawonn
    • 1
    • 2
    Email author
  • Frank Höppner
    • 1
  • Balasubramaniam Jayaram
    • 3
  1. 1.Department of Computer ScienceOstfalia University of Applied SciencesWolfenbuettelGermany
  2. 2.Biostatistics, Helmholtz Centre for Infection ResearchBraunschweigGermany
  3. 3.Department of MathematicsIndian Institute of Technology HyderabadYeddumailaramIndia

Personalised recommendations