Clustering High-Dimensional Data

Houle, Michael E.; Kiermeier, Marie; Zimek, Arthur

doi:10.1007/978-3-031-24628-9_11

Michael E. Houle⁴,
Marie Kiermeier⁵ &
Arthur Zimek⁶

2060 Accesses

Abstract

Clustering algorithms have been adapted or specifically designed for high-dimensional data where many attributes might be just noise such that patterns can be identified only in appropriate combinations of attributes and would be obfuscated by noise otherwise. In this chapter, we give an overview of the basic strategies and techniques used for these specialized algorithms along with pointers to example methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This should not be confused with the different meaning of “correlation clustering” as introduced by Bansal et al. [21].

References

Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2006) Finding hierarchies of subspace clusters. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, pp 446–453, https://doi.org/10.1007/11871637_42
Google Scholar
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2006) Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pp 4–13, https://doi.org/10.1145/1150402.1150408
Google Scholar
Achtert E, Böhm C, Kröger P, Zimek A (2006) Mining hierarchies of correlation clusters. In: Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM), Vienna, Austria, pp 119–128, https://doi.org/10.1109/SSDBM.2006.35
Google Scholar
Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pp 152–163, https://doi.org/10.1007/978-3-540-71703-4_15
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) On exploring complex relationships of correlation clusters. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, pp 7–16, https://doi.org/10.1109/SSDBM.2007.21
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2007) Robust, complete, and efficient correlation clustering. In: Proceedings of the 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pp 413–418, https://doi.org/10.1137/1.9781611972771.37
Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Global correlation clustering based on the Hough transform. Statistical Analysis and Data Mining 1(3):111–127, https://doi.org/10.1002/sam.10012
MathSciNet Google Scholar
Achtert E, Böhm C, David J, Kröger P, Zimek A (2008) Robust clustering in arbitrarily oriented subspaces. In: Proceedings of the 8th SIAM International Conference on Data Mining (SDM), Atlanta, GA, pp 763–774, https://doi.org/10.1137/1.9781611972788.69
Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of the 25th International Conference on Data Engineering (ICDE), Shanghai, China, pp 1152–1154, https://doi.org/10.1109/ICDE.2009.188
Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional space. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pp 70–81, https://doi.org/10.1145/342009.335383
Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pp 61–72, https://doi.org/10.1145/304182.304188
Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery 10:251–273, https://doi.org/10.1007/s10618-005-0645-7
MathSciNet Google Scholar
Aggarwal CC, Ta N, Wang J, Feng J, Zaki M (2007) XProj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pp 46–55
Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pp 487–499
Google Scholar
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pp 94–105, https://doi.org/10.1145/276304.276314
Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2018) Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery 32(6):1768–1805
MathSciNet MATH Google Scholar
Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 409–414, https://doi.org/10.1109/ICDM.2007.49
Assent I, Krieger R, Müller E, Seidl T (2008) EDSC: efficient density-based subspace clustering. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pp 1093–1102, https://doi.org/10.1145/1458082.1458227
Assent I, Krieger R, Müller E, Seidl T (2008) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pp 719–724, https://doi.org/10.1109/ICDM.2008.46
Aziz MS, Reddy CK (2010) A robust seedless algorithm for correlation clustering. In: Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hyderabad, India, pp 28–37
Google Scholar
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Machine Learning 56:89–113, https://doi.org/10.1023/B:MACH.0000033116.57574.95
MathSciNet MATH Google Scholar
Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Boston, MA, pp 260–264, https://doi.org/10.1145/347090.347145
Google Scholar
Becker R, Hafnaoui I, Houle ME, Li P, Zimek A (2019) Subspace determination through local intrinsic dimensional decomposition. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 281–289
Google Scholar
Borutta F, Kröger P, Hubauer T (2019) A generic summary structure for arbitrarily oriented subspace clustering in data streams. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 203–211
Google Scholar
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Computational Statistics and Data Analysis 52:502–519, https://doi.org/10.1016/j.csda.2007.02.009
MathSciNet MATH Google Scholar
Böhm C, Kailing K, Kriegel HP, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pp 27–34, https://doi.org/10.1109/ICDM.2004.10087
Böhm C, Kailing K, Kröger P, Zimek A (2004) Computing clusters of correlation connected objects. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Paris, France, pp 455–466, https://doi.org/10.1145/1007568.1007620
Google Scholar
Campello RJGB, Kröger P, Sander J, Zimek A (2020) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2), https://doi.org/10.1002/widm.1343
Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pp 84–93, https://doi.org/10.1145/312129.312199
Cordeiro RLF, Traina AJM, Faloutsos C, Traina Jr C (2013) Halite: Fast and scalable multiresolution local-correlation clustering. IEEE Transactions on Knowledge and Data Engineering 25(2):387–401, https://doi.org/10.1109/TKDE.2011.176
Google Scholar
Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, https://doi.org/10.1137/1.9781611972740.58
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1):63–97, https://doi.org/10.1007/s10618-006-0060-8
MathSciNet Google Scholar
Duda RO, Hart PE (1972) Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM 15(1):11–15, https://doi.org/10.1145/361237.361242
MATH Google Scholar
Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, pp 2790–2797, https://doi.org/10.1109/CVPR.2009.5206547
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pp 226–231
Google Scholar
Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4):825–849
MathSciNet MATH Google Scholar
Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC
Google Scholar
Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, https://doi.org/10.1145/1081870.1081880
Günnemann S, Färber I, Müller E, Assent I, Seidl T (2011) External evaluation measures for subspace clustering. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pp 1363–1372, https://doi.org/10.1145/2063576.2063774
Günnemann S, Färber I, Virochsiri K, Seidl T (2012) Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data. In: Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, pp 352–360, https://doi.org/10.1145/2339530.2339588
Haralick R, Harpaz R (2005) Linear manifold clustering. In: Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), Leipzig, Germany, pp 132–141
Google Scholar
Haralick RM, Harpaz R (2007) Linear manifold clustering in high dimensional spaces by stochastic search. Pattern Recognition 40(10):2672–2684, https://doi.org/10.1016/j.patcog.2007.01.020
MATH Google Scholar
Hough PVC (1962) Methods and means for recognizing complex patterns. U.S. Patent 3069654
Google Scholar
Houle ME (2017) Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 64–79
Google Scholar
Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany, pp 80–95
Google Scholar
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):657–668, https://doi.org/10.1109/TPAMI.2005.95
Google Scholar
Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs
Google Scholar
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8):1026–1041, https://doi.org/10.1109/TKDE.2007.1048
Google Scholar
Jolliffe IT (2002) Principal Component Analysis, 2nd edn. Springer
MATH Google Scholar
Kailing K, Kriegel HP, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pp 246–257, https://doi.org/10.1137/1.9781611972740.23
Kazempour D, Mauder M, Kröger P, Seidl T (2017) Detecting global hyperparaboloid correlated clusters based on Hough transform. pp 31:1–31:6
Google Scholar
Kazempour D, Bein K, Kröger P, Seidl T (2018) D-MASC: A novel search strategy for detecting regions of interest in linear parameter space. In: Proceedings of the 11th International Conference on Similarity Search and Applications (SISAP), Lima, Peru, pp 163–176
Google Scholar
Kazempour D, Hünemörder M, Seidl T (2019) On coMADs and Principal Component Analysis. In: Proceedings of the 12th International Conference on Similarity Search and Applications (SISAP), Newark, NJ, pp 273–280
Google Scholar
Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), Hong Kong, China, pp 418–435, https://doi.org/10.1007/978-3-540-69497-7_27
Kriegel HP, Kröger P, Zimek A (2009) Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(1):1–58, https://doi.org/10.1145/1497577.1497578
Google Scholar
Kriegel HP, Kröger P, Ntoutsi E, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd International Conference on Scientific and Statistical Database Management (SSDBM), Portland, OR, pp 387–404, https://doi.org/10.1007/978-3-642-22351-8_24
Kriegel HP, Schubert E, Zimek A (2011) Evaluation of multiple clustering solutions. In: 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pp 55–66
Google Scholar
Kriegel HP, Kröger P, Zimek A (2012) Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(4):351–364, https://doi.org/10.1002/widm.1057
Google Scholar
Li J, Huang X, Selke C, Yong J (2007) A fast algorithm for finding correlation clusters in noise data. In: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Nanjing, China, pp 639–647
Google Scholar
Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y (2013) Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):171–184, https://doi.org/10.1109/TPAMI.2012.88
Google Scholar
Lu Y, Wang S, Li S, Zhou C (2010) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Machine Learning 82(1):43–70, https://doi.org/10.1007/s10994-009-5154-2
MathSciNet Google Scholar
Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pp 414–425, https://doi.org/10.1109/ICDM.2006.123
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowledge and Information Systems (KAIS) 14(3):273–298, https://doi.org/10.1007/s10115-007-0090-6
MATH Google Scholar
Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS) 21(3):299–326, https://doi.org/10.1007/s10115-009-0226-y
Google Scholar
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment 2(1):1270–1281
Google Scholar
Nagesh HS, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, pp 1–17, https://doi.org/10.1137/1.9781611972719.7
Google Scholar
Ng KKE, Fu AW, Wong CW (2005) Projective clustering by histograms. IEEE Transactions on Knowledge and Data Engineering 17(3):369–383, https://doi.org/10.1109/TKDE.2005.47
Google Scholar
Ntoutsi E, Zimek A, Palpanas T, Kröger P, Kriegel HP (2012) Density-based projected clustering over high dimensional data streams. In: Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pp 987–998, https://doi.org/10.1137/1.9781611972825.85
Google Scholar
Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering 18(7):902–916, https://doi.org/10.1109/TKDE.2006.106
Google Scholar
Scott DW (2009) Sturges’ rule. Wiley Interdisciplinary Reviews: Computational Statistics 1(3):303–306
Google Scholar
Sim K, Gopalkrishnan V, Zimek A, Cong G (2013) A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery 26(2):332–397, https://doi.org/10.1007/s10618-012-0258-x
MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Statistical Society, Series B 58(1):267–288
MathSciNet MATH Google Scholar
Vidal R, Ma Y, Sastry SS (2016) Generalized Principal Component Analysis, Interdisciplinary Applied Mathematics, vol 40. Springer, https://doi.org/10.1007/978-0-387-87811-9
Woo KG, Lee JH, Kim MH, Lee YJ (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology 46(4):255–271, https://doi.org/10.1016/j.infsof.2003.07.003
Google Scholar
Yip KY, Cheung DW, Ng MK (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, pp 329–340, https://doi.org/10.1109/ICDE.2005.96
Yiu ML, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering 17(2):176–189, https://doi.org/10.1109/TKDE.2005.29
Google Scholar
Zhang Q, Liu J, Wang W (2007) Incremental subspace clustering over multiple data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pp 727–732, https://doi.org/10.1109/ICDM.2007.100
Zhang X, Pan F, Wang W (2008) CARE: finding local linear correlations in high dimensional data. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, pp 130–139, https://doi.org/10.1109/ICDE.2008.4497421
Zimek A (2013) Clustering high-dimensional data. In: Aggarwal CC, Reddy CK (eds) Data Clustering: Algorithms and Applications, CRC Press, chap 9, pp 201–230
Google Scholar
Zimek A, Vreeken J (2015) The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning 98(1–2):121–155, https://doi.org/10.1007/s10994-013-5334-y
MathSciNet MATH Google Scholar
Zimek A, Assent I, Vreeken J (2014) Frequent pattern mining algorithms for data clustering. In: Aggarwal CC, Han J (eds) Frequent Pattern Mining, Springer, chap 16, pp 403–423
Google Scholar
Zou H, Xue L (2018) A selective overview of sparse principal component analysis. Proc IEEE 106(8):1311–1320
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Informatics, Tokyo, Japan
Michael E. Houle
Ludwig-Maximilians-Universität München, Munich, Germany
Marie Kiermeier
Department of Mathematics and Computer Science, University of Southern Denmark, Odense M, Denmark
Arthur Zimek

Authors

Michael E. Houle
View author publications
You can also search for this author in PubMed Google Scholar
Marie Kiermeier
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Zimek .

Editor information

Editors and Affiliations

Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
Lior Rokach
Department of Industrial Engineering, Tel Aviv University, Ramat Aviv, Israel
Oded Maimon
Department of Industrial Engineering, Tel Aviv University, Tel Aviv, Israel
Erez Shmueli

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Houle, M.E., Kiermeier, M., Zimek, A. (2023). Clustering High-Dimensional Data. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-24628-9_11
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24627-2
Online ISBN: 978-3-031-24628-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics