Skip to main content

Clustering High-Dimensional Data

  • Chapter
  • First Online:
Machine Learning for Data Science Handbook

Abstract

Clustering algorithms have been adapted or specifically designed for high-dimensional data where many attributes might be just noise such that patterns can be identified only in appropriate combinations of attributes and would be obfuscated by noise otherwise. In this chapter, we give an overview of the basic strategies and techniques used for these specialized algorithms along with pointers to example methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This should not be confused with the different meaning of “correlation clustering” as introduced by Bansal et al. [21].

References

  1. Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2006) Finding hierarchies of subspace clusters. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, pp 446–453, https://doi.org/10.1007/11871637_42

    Google Scholar 

  2. Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2006) Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pp 4–13, https://doi.org/10.1145/1150402.1150408

    Google Scholar 

  3. Achtert E, Böhm C, Kröger P, Zimek A (2006) Mining hierarchies of correlation clusters. In: Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM), Vienna, Austria, pp 119–128, https://doi.org/10.1109/SSDBM.2006.35

    Google Scholar 

  4. Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pp 152–163, https://doi.org/10.1007/978-3-540-71703-4_15

  5. Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) On exploring complex relationships of correlation clusters. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, pp 7–16, https://doi.org/10.1109/SSDBM.2007.21

  6. Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) Robust, complete, and efficient correlation clustering. In: Proceedings of the 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pp 413–418, https://doi.org/10.1137/1.9781611972771.37

  7. Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Global correlation clustering based on the Hough transform. Statistical Analysis and Data Mining 1(3):111–127, https://doi.org/10.1002/sam.10012

    MathSciNet  Google Scholar 

  8. Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Robust clustering in arbitrarily oriented subspaces. In: Proceedings of the 8th SIAM International Conference on Data Mining (SDM), Atlanta, GA, pp 763–774, https://doi.org/10.1137/1.9781611972788.69

  9. Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of the 25th International Conference on Data Engineering (ICDE), Shanghai, China, pp 1152–1154, https://doi.org/10.1109/ICDE.2009.188

  10. Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional space. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pp 70–81, https://doi.org/10.1145/342009.335383

  11. Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pp 61–72, https://doi.org/10.1145/304182.304188

    Google Scholar 

  12. Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery 10:251–273, https://doi.org/10.1007/s10618-005-0645-7

    MathSciNet  Google Scholar 

  13. Aggarwal CC, Ta N, Wang J, Feng J, Zaki M (2007) XProj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pp 46–55

    Google Scholar 

  14. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pp 487–499

    Google Scholar 

  15. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pp 94–105, https://doi.org/10.1145/276304.276314

  16. Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2018) Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery 32(6):1768–1805

    MathSciNet  MATH  Google Scholar 

  17. Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 409–414, https://doi.org/10.1109/ICDM.2007.49

  18. Assent I, Krieger R, Müller E, Seidl T (2008) EDSC: efficient density-based subspace clustering. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pp 1093–1102, https://doi.org/10.1145/1458082.1458227

  19. Assent I, Krieger R, Müller E, Seidl T (2008) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pp 719–724, https://doi.org/10.1109/ICDM.2008.46

  20. Aziz MS, Reddy CK (2010) A robust seedless algorithm for correlation clustering. In: Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hyderabad, India, pp 28–37

    Google Scholar 

  21. Bansal N, Blum A, Chawla S (2004) Correlation clustering. Machine Learning 56:89–113, https://doi.org/10.1023/B:MACH.0000033116.57574.95

    MathSciNet  MATH  Google Scholar 

  22. Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Boston, MA, pp 260–264, https://doi.org/10.1145/347090.347145

    Google Scholar 

  23. Becker R, Hafnaoui I, Houle ME, Li P, Zimek A (2019) Subspace determination through local intrinsic dimensional decomposition. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 281–289

    Google Scholar 

  24. Borutta F, Kröger P, Hubauer T (2019) A generic summary structure for arbitrarily oriented subspace clustering in data streams. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 203–211

    Google Scholar 

  25. Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Computational Statistics and Data Analysis 52:502–519, https://doi.org/10.1016/j.csda.2007.02.009

    MathSciNet  MATH  Google Scholar 

  26. Böhm C, Kailing K, Kriegel HP, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pp 27–34, https://doi.org/10.1109/ICDM.2004.10087

  27. Böhm C, Kailing K, Kröger P, Zimek A (2004) Computing clusters of correlation connected objects. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Paris, France, pp 455–466, https://doi.org/10.1145/1007568.1007620

    Google Scholar 

  28. Campello RJGB, Kröger P, Sander J, Zimek A (2020) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2), https://doi.org/10.1002/widm.1343

  29. Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pp 84–93, https://doi.org/10.1145/312129.312199

  30. Cordeiro RLF, Traina AJM, Faloutsos C, Traina Jr C (2013) Halite: Fast and scalable multiresolution local-correlation clustering. IEEE Transactions on Knowledge and Data Engineering 25(2):387–401, https://doi.org/10.1109/TKDE.2011.176

    Google Scholar 

  31. Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, https://doi.org/10.1137/1.9781611972740.58

  32. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1):63–97, https://doi.org/10.1007/s10618-006-0060-8

    MathSciNet  Google Scholar 

  33. Duda RO, Hart PE (1972) Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM 15(1):11–15, https://doi.org/10.1145/361237.361242

    MATH  Google Scholar 

  34. Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, pp 2790–2797, https://doi.org/10.1109/CVPR.2009.5206547

  35. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pp 226–231

    Google Scholar 

  36. Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4):825–849

    MathSciNet  MATH  Google Scholar 

  37. Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC

    Google Scholar 

  38. Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, https://doi.org/10.1145/1081870.1081880

  39. Günnemann S, Färber I, Müller E, Assent I, Seidl T (2011) External evaluation measures for subspace clustering. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pp 1363–1372, https://doi.org/10.1145/2063576.2063774

  40. Günnemann S, Färber I, Virochsiri K, Seidl T (2012) Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data. In: Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, pp 352–360, https://doi.org/10.1145/2339530.2339588

  41. Haralick R, Harpaz R (2005) Linear manifold clustering. In: Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), Leipzig, Germany, pp 132–141

    Google Scholar 

  42. Haralick RM, Harpaz R (2007) Linear manifold clustering in high dimensional spaces by stochastic search. Pattern Recognition 40(10):2672–2684, https://doi.org/10.1016/j.patcog.2007.01.020

    MATH  Google Scholar 

  43. Hough PVC (1962) Methods and means for recognizing complex patterns. U.S. Patent 3069654

    Google Scholar 

  44. Houle ME (2017) Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 64–79

    Google Scholar 

  45. Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 80–95

    Google Scholar 

  46. Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):657–668, https://doi.org/10.1109/TPAMI.2005.95

    Google Scholar 

  47. Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs

    Google Scholar 

  48. Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8):1026–1041, https://doi.org/10.1109/TKDE.2007.1048

    Google Scholar 

  49. Jolliffe IT (2002) Principal Component Analysis, 2nd edn. Springer

    MATH  Google Scholar 

  50. Kailing K, Kriegel HP, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pp 246–257, https://doi.org/10.1137/1.9781611972740.23

  51. Kazempour D, Mauder M, Kröger P, Seidl T (2017) Detecting global hyperparaboloid correlated clusters based on Hough transform. pp 31:1–31:6

    Google Scholar 

  52. Kazempour D, Bein K, Kröger P, Seidl T (2018) D-MASC: A novel search strategy for detecting regions of interest in linear parameter space. In: Proceedings of the 11th International Conference on Similarity Search and Applications (SISAP), Lima, Peru, pp 163–176

    Google Scholar 

  53. Kazempour D, Hünemörder M, Seidl T (2019) On coMADs and Principal Component Analysis. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 273–280

    Google Scholar 

  54. Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), Hong Kong, China, pp 418–435, https://doi.org/10.1007/978-3-540-69497-7_27

  55. Kriegel HP, Kröger P, Zimek A (2009) Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(1):1–58, https://doi.org/10.1145/1497577.1497578

    Google Scholar 

  56. Kriegel HP, Kröger P, Ntoutsi E, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd International Conference on Scientific and Statistical Database Management (SSDBM), Portland, OR, pp 387–404, https://doi.org/10.1007/978-3-642-22351-8_24

  57. Kriegel HP, Schubert E, Zimek A (2011) Evaluation of multiple clustering solutions. In: 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pp 55–66

    Google Scholar 

  58. Kriegel HP, Kröger P, Zimek A (2012) Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(4):351–364, https://doi.org/10.1002/widm.1057

    Google Scholar 

  59. Li J, Huang X, Selke C, Yong J (2007) A fast algorithm for finding correlation clusters in noise data. In: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Nanjing, China, pp 639–647

    Google Scholar 

  60. Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y (2013) Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):171–184, https://doi.org/10.1109/TPAMI.2012.88

    Google Scholar 

  61. Lu Y, Wang S, Li S, Zhou C (2010) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Machine Learning 82(1):43–70, https://doi.org/10.1007/s10994-009-5154-2

    MathSciNet  Google Scholar 

  62. Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pp 414–425, https://doi.org/10.1109/ICDM.2006.123

  63. Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowledge and Information Systems (KAIS) 14(3):273–298, https://doi.org/10.1007/s10115-007-0090-6

    MATH  Google Scholar 

  64. Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS) 21(3):299–326, https://doi.org/10.1007/s10115-009-0226-y

    Google Scholar 

  65. Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment 2(1):1270–1281

    Google Scholar 

  66. Nagesh HS, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, pp 1–17, https://doi.org/10.1137/1.9781611972719.7

    Google Scholar 

  67. Ng KKE, Fu AW, Wong CW (2005) Projective clustering by histograms. IEEE Transactions on Knowledge and Data Engineering 17(3):369–383, https://doi.org/10.1109/TKDE.2005.47

    Google Scholar 

  68. Ntoutsi E, Zimek A, Palpanas T, Kröger P, Kriegel HP (2012) Density-based projected clustering over high dimensional data streams. In: Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pp 987–998, https://doi.org/10.1137/1.9781611972825.85

    Google Scholar 

  69. Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering 18(7):902–916, https://doi.org/10.1109/TKDE.2006.106

    Google Scholar 

  70. Scott DW (2009) Sturges’ rule. Wiley Interdisciplinary Reviews: Computational Statistics 1(3):303–306

    Google Scholar 

  71. Sim K, Gopalkrishnan V, Zimek A, Cong G (2013) A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery 26(2):332–397, https://doi.org/10.1007/s10618-012-0258-x

    MathSciNet  MATH  Google Scholar 

  72. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Statistical Society, Series B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  73. Vidal R, Ma Y, Sastry SS (2016) Generalized Principal Component Analysis, Interdisciplinary Applied Mathematics, vol 40. Springer, https://doi.org/10.1007/978-0-387-87811-9

  74. Woo KG, Lee JH, Kim MH, Lee YJ (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology 46(4):255–271, https://doi.org/10.1016/j.infsof.2003.07.003

    Google Scholar 

  75. Yip KY, Cheung DW, Ng MK (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, pp 329–340, https://doi.org/10.1109/ICDE.2005.96

  76. Yiu ML, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering 17(2):176–189, https://doi.org/10.1109/TKDE.2005.29

    Google Scholar 

  77. Zhang Q, Liu J, Wang W (2007) Incremental subspace clustering over multiple data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 727–732, https://doi.org/10.1109/ICDM.2007.100

  78. Zhang X, Pan F, Wang W (2008) CARE: finding local linear correlations in high dimensional data. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, pp 130–139, https://doi.org/10.1109/ICDE.2008.4497421

  79. Zimek A (2013) Clustering high-dimensional data. In: Aggarwal CC, Reddy CK (eds) Data Clustering: Algorithms and Applications, CRC Press, chap 9, pp 201–230

    Google Scholar 

  80. Zimek A, Vreeken J (2015) The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning 98(1–2):121–155, https://doi.org/10.1007/s10994-013-5334-y

    MathSciNet  MATH  Google Scholar 

  81. Zimek A, Assent I, Vreeken J (2014) Frequent pattern mining algorithms for data clustering. In: Aggarwal CC, Han J (eds) Frequent Pattern Mining, Springer, chap 16, pp 403–423

    Google Scholar 

  82. Zou H, Xue L (2018) A selective overview of sparse principal component analysis. Proc IEEE 106(8):1311–1320

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arthur Zimek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Houle, M.E., Kiermeier, M., Zimek, A. (2023). Clustering High-Dimensional Data. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_11

Download citation

Publish with us

Policies and ethics