Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Feature Selection for Clustering

  • Manoranjan Dash
  • Poon Wei Koot
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_613

Definition

The problem of feature selection originates from the fact that while collecting data, one tends to collect all possible data. But for a specific learning task such as clustering not all the attributes or features are important. Feature selection is popular in supervised learning or for the classification task because the class labels are given and it is easier to select those features that lead to these classes. But for unsupervised data without class labels, or for the clustering task, it is not so obvious which features are to be selected. Some of the features may be redundant, some are irrelevant, and others may be “weakly relevant”. The task of feature selection for clustering is to select “best” set of relevant features that helps to uncover the natural clusters from data according to the chosen criterion.

Figure 1 shows an example using a synthetic data. There are three clusters in F1- F2 dimensions which follow Gaussian distribution whereas F3, which does not define...
This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS. Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1999. p. 61–72.Google Scholar
  2. 2.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998. p. 94–105.Google Scholar
  3. 3.
    Amershi S, Conati C, Maclaren H. Using feature selection and unsupervised clustering to identify affective expressions in educational games. In: Proceedings of the Workshop on Motivational and Affective Issues in ITS, 8th International Conference on ITS; 2006. p. 21–8.Google Scholar
  4. 4.
    Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs words for text categorization. J Mach Lear Res. 2008;3(7/8):1183–208.zbMATHGoogle Scholar
  5. 5.
    Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering – a filter solution. In: Proceedings of the 2002 IEEE International Conference on Data Mining; 2002. p. 115–22.Google Scholar
  6. 6.
    Dash M, Liu H. Feature selection for classification. Int J Intell Data Analy. 1997;1(3):131–56.CrossRefGoogle Scholar
  7. 7.
    Dash M, Liu H. Handling large unsupervised data via dimensionality reduction. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery; 1999.Google Scholar
  8. 8.
    Devaney M, Ram A. Efficient feature selection in conceptual clustering. In: Proceedings of the 14th International Conference on Machine Learning; 1997. p. 92–7.Google Scholar
  9. 9.
    Duda RO, Hart PE. Pattern classification and scene analysis, Chap. Unsupervised learning and clustering. New York: Wiley, 1973.Google Scholar
  10. 10.
    Dy JG, Brodley CE. Feature subset selection and order identification for unsupervised learning. In: Proceedings of the 17th International Conference on Machine Learning; 2000. p. 247–54.Google Scholar
  11. 11.
    Dy JG, Brodley CE. Visualization and interactive feature selection for unsupervised data. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 360–4.Google Scholar
  12. 12.
    Dy JG, Brodley E. Feature selection for unsupervised learning. J Mach Learn Res. 2004;5(5):845–89.MathSciNetzbMATHGoogle Scholar
  13. 13.
    Fisher DH. Knowledge acquisition via incremental conceptual clustering. Mach Learn. 1987;2(2):139–72.Google Scholar
  14. 14.
    Friedman J, Meulman J. Clustering objects on subsets of attributes. J Royal Stat Soc B. 2004;66(4):1–25.MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Gilad-Bachrach R, Navot A, Tishby N. Margin based feature selection – theory and algorithms. In: Proceedings of the 21st International Conference on Machine Learning; 2004. p. 43.Google Scholar
  16. 16.
    Jain AK, Dubes RC. Algorithm for clustering data, Chap. Clustering methods and algorithms. Prentice-hall advanced reference series, 1988.Google Scholar
  17. 17.
    Kim YS, Street WN, Menczer F. Feature selection in unsupervised learning via evolutionary search. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 365–9.Google Scholar
  18. 18.
    Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell. 2004;26(9):1154–66.CrossRefGoogle Scholar
  19. 19.
    Milligan GW. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika. 1981;46(2):187–98.zbMATHCrossRefGoogle Scholar
  20. 20.
    Talavera L. Feature selection as a preprocessing step for hierarchical clustering. In: Proceedings of the 16th International Conference on Machine Learning; 1999. p. 389–97.Google Scholar
  21. 21.
    Talavera L. Feature selection and incremental learning of probabilistic concept hierarchies. In: Proceedings of the 17th International Conference on Machine Learning; 2000. p. 951–8.Google Scholar
  22. 22.
    Vaithyanathan S, Dom B. Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th International Conference on Machine Learning; 1999. p. 433–43.Google Scholar
  23. 23.
    Xing EP, Karp RM. CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. In: Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology; 2001. p. 306–15.CrossRefGoogle Scholar
  24. 24.
    Yousef M, Jung S, Showe LC, Showe MK. Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics. 2009;8(1):144.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Nanyang Technological UniversitySingaporeSingapore

Section editors and affiliations

  • Dimitrios Gunopulos
    • 1
  1. 1.Department of Computer Science and EngineeringThe University of California at Riverside, Bourns College of EngineeringRiversideUSA