Abstract
Clustering algorithms have been adapted or specifically designed for high-dimensional data where many attributes might be just noise such that patterns can be identified only in appropriate combinations of attributes and would be obfuscated by noise otherwise. In this chapter, we give an overview of the basic strategies and techniques used for these specialized algorithms along with pointers to example methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This should not be confused with the different meaning of “correlation clustering” as introduced by Bansal et al. [21].
References
Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2006) Finding hierarchies of subspace clusters. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, pp 446–453, https://doi.org/10.1007/11871637_42
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2006) Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pp 4–13, https://doi.org/10.1145/1150402.1150408
Achtert E, Böhm C, Kröger P, Zimek A (2006) Mining hierarchies of correlation clusters. In: Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM), Vienna, Austria, pp 119–128, https://doi.org/10.1109/SSDBM.2006.35
Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pp 152–163, https://doi.org/10.1007/978-3-540-71703-4_15
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) On exploring complex relationships of correlation clusters. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, pp 7–16, https://doi.org/10.1109/SSDBM.2007.21
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) Robust, complete, and efficient correlation clustering. In: Proceedings of the 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pp 413–418, https://doi.org/10.1137/1.9781611972771.37
Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Global correlation clustering based on the Hough transform. Statistical Analysis and Data Mining 1(3):111–127, https://doi.org/10.1002/sam.10012
Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Robust clustering in arbitrarily oriented subspaces. In: Proceedings of the 8th SIAM International Conference on Data Mining (SDM), Atlanta, GA, pp 763–774, https://doi.org/10.1137/1.9781611972788.69
Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of the 25th International Conference on Data Engineering (ICDE), Shanghai, China, pp 1152–1154, https://doi.org/10.1109/ICDE.2009.188
Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional space. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pp 70–81, https://doi.org/10.1145/342009.335383
Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pp 61–72, https://doi.org/10.1145/304182.304188
Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery 10:251–273, https://doi.org/10.1007/s10618-005-0645-7
Aggarwal CC, Ta N, Wang J, Feng J, Zaki M (2007) XProj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pp 46–55
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pp 487–499
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pp 94–105, https://doi.org/10.1145/276304.276314
Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2018) Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery 32(6):1768–1805
Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 409–414, https://doi.org/10.1109/ICDM.2007.49
Assent I, Krieger R, Müller E, Seidl T (2008) EDSC: efficient density-based subspace clustering. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pp 1093–1102, https://doi.org/10.1145/1458082.1458227
Assent I, Krieger R, Müller E, Seidl T (2008) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pp 719–724, https://doi.org/10.1109/ICDM.2008.46
Aziz MS, Reddy CK (2010) A robust seedless algorithm for correlation clustering. In: Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hyderabad, India, pp 28–37
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Machine Learning 56:89–113, https://doi.org/10.1023/B:MACH.0000033116.57574.95
Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Boston, MA, pp 260–264, https://doi.org/10.1145/347090.347145
Becker R, Hafnaoui I, Houle ME, Li P, Zimek A (2019) Subspace determination through local intrinsic dimensional decomposition. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 281–289
Borutta F, Kröger P, Hubauer T (2019) A generic summary structure for arbitrarily oriented subspace clustering in data streams. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 203–211
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Computational Statistics and Data Analysis 52:502–519, https://doi.org/10.1016/j.csda.2007.02.009
Böhm C, Kailing K, Kriegel HP, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pp 27–34, https://doi.org/10.1109/ICDM.2004.10087
Böhm C, Kailing K, Kröger P, Zimek A (2004) Computing clusters of correlation connected objects. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Paris, France, pp 455–466, https://doi.org/10.1145/1007568.1007620
Campello RJGB, Kröger P, Sander J, Zimek A (2020) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2), https://doi.org/10.1002/widm.1343
Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pp 84–93, https://doi.org/10.1145/312129.312199
Cordeiro RLF, Traina AJM, Faloutsos C, Traina Jr C (2013) Halite: Fast and scalable multiresolution local-correlation clustering. IEEE Transactions on Knowledge and Data Engineering 25(2):387–401, https://doi.org/10.1109/TKDE.2011.176
Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, https://doi.org/10.1137/1.9781611972740.58
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1):63–97, https://doi.org/10.1007/s10618-006-0060-8
Duda RO, Hart PE (1972) Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM 15(1):11–15, https://doi.org/10.1145/361237.361242
Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, pp 2790–2797, https://doi.org/10.1109/CVPR.2009.5206547
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pp 226–231
Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4):825–849
Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC
Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, https://doi.org/10.1145/1081870.1081880
Günnemann S, Färber I, Müller E, Assent I, Seidl T (2011) External evaluation measures for subspace clustering. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pp 1363–1372, https://doi.org/10.1145/2063576.2063774
Günnemann S, Färber I, Virochsiri K, Seidl T (2012) Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data. In: Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, pp 352–360, https://doi.org/10.1145/2339530.2339588
Haralick R, Harpaz R (2005) Linear manifold clustering. In: Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), Leipzig, Germany, pp 132–141
Haralick RM, Harpaz R (2007) Linear manifold clustering in high dimensional spaces by stochastic search. Pattern Recognition 40(10):2672–2684, https://doi.org/10.1016/j.patcog.2007.01.020
Hough PVC (1962) Methods and means for recognizing complex patterns. U.S. Patent 3069654
Houle ME (2017) Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 64–79
Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 80–95
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):657–668, https://doi.org/10.1109/TPAMI.2005.95
Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8):1026–1041, https://doi.org/10.1109/TKDE.2007.1048
Jolliffe IT (2002) Principal Component Analysis, 2nd edn. Springer
Kailing K, Kriegel HP, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pp 246–257, https://doi.org/10.1137/1.9781611972740.23
Kazempour D, Mauder M, Kröger P, Seidl T (2017) Detecting global hyperparaboloid correlated clusters based on Hough transform. pp 31:1–31:6
Kazempour D, Bein K, Kröger P, Seidl T (2018) D-MASC: A novel search strategy for detecting regions of interest in linear parameter space. In: Proceedings of the 11th International Conference on Similarity Search and Applications (SISAP), Lima, Peru, pp 163–176
Kazempour D, Hünemörder M, Seidl T (2019) On coMADs and Principal Component Analysis. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 273–280
Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), Hong Kong, China, pp 418–435, https://doi.org/10.1007/978-3-540-69497-7_27
Kriegel HP, Kröger P, Zimek A (2009) Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(1):1–58, https://doi.org/10.1145/1497577.1497578
Kriegel HP, Kröger P, Ntoutsi E, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd International Conference on Scientific and Statistical Database Management (SSDBM), Portland, OR, pp 387–404, https://doi.org/10.1007/978-3-642-22351-8_24
Kriegel HP, Schubert E, Zimek A (2011) Evaluation of multiple clustering solutions. In: 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pp 55–66
Kriegel HP, Kröger P, Zimek A (2012) Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(4):351–364, https://doi.org/10.1002/widm.1057
Li J, Huang X, Selke C, Yong J (2007) A fast algorithm for finding correlation clusters in noise data. In: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Nanjing, China, pp 639–647
Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y (2013) Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):171–184, https://doi.org/10.1109/TPAMI.2012.88
Lu Y, Wang S, Li S, Zhou C (2010) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Machine Learning 82(1):43–70, https://doi.org/10.1007/s10994-009-5154-2
Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pp 414–425, https://doi.org/10.1109/ICDM.2006.123
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowledge and Information Systems (KAIS) 14(3):273–298, https://doi.org/10.1007/s10115-007-0090-6
Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS) 21(3):299–326, https://doi.org/10.1007/s10115-009-0226-y
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment 2(1):1270–1281
Nagesh HS, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, pp 1–17, https://doi.org/10.1137/1.9781611972719.7
Ng KKE, Fu AW, Wong CW (2005) Projective clustering by histograms. IEEE Transactions on Knowledge and Data Engineering 17(3):369–383, https://doi.org/10.1109/TKDE.2005.47
Ntoutsi E, Zimek A, Palpanas T, Kröger P, Kriegel HP (2012) Density-based projected clustering over high dimensional data streams. In: Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pp 987–998, https://doi.org/10.1137/1.9781611972825.85
Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering 18(7):902–916, https://doi.org/10.1109/TKDE.2006.106
Scott DW (2009) Sturges’ rule. Wiley Interdisciplinary Reviews: Computational Statistics 1(3):303–306
Sim K, Gopalkrishnan V, Zimek A, Cong G (2013) A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery 26(2):332–397, https://doi.org/10.1007/s10618-012-0258-x
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Statistical Society, Series B 58(1):267–288
Vidal R, Ma Y, Sastry SS (2016) Generalized Principal Component Analysis, Interdisciplinary Applied Mathematics, vol 40. Springer, https://doi.org/10.1007/978-0-387-87811-9
Woo KG, Lee JH, Kim MH, Lee YJ (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology 46(4):255–271, https://doi.org/10.1016/j.infsof.2003.07.003
Yip KY, Cheung DW, Ng MK (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, pp 329–340, https://doi.org/10.1109/ICDE.2005.96
Yiu ML, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering 17(2):176–189, https://doi.org/10.1109/TKDE.2005.29
Zhang Q, Liu J, Wang W (2007) Incremental subspace clustering over multiple data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 727–732, https://doi.org/10.1109/ICDM.2007.100
Zhang X, Pan F, Wang W (2008) CARE: finding local linear correlations in high dimensional data. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, pp 130–139, https://doi.org/10.1109/ICDE.2008.4497421
Zimek A (2013) Clustering high-dimensional data. In: Aggarwal CC, Reddy CK (eds) Data Clustering: Algorithms and Applications, CRC Press, chap 9, pp 201–230
Zimek A, Vreeken J (2015) The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning 98(1–2):121–155, https://doi.org/10.1007/s10994-013-5334-y
Zimek A, Assent I, Vreeken J (2014) Frequent pattern mining algorithms for data clustering. In: Aggarwal CC, Han J (eds) Frequent Pattern Mining, Springer, chap 16, pp 403–423
Zou H, Xue L (2018) A selective overview of sparse principal component analysis. Proc IEEE 106(8):1311–1320
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Houle, M.E., Kiermeier, M., Zimek, A. (2023). Clustering High-Dimensional Data. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-24628-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24627-2
Online ISBN: 978-3-031-24628-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)