Abstract
Clustering high dimensional dataset is one of the major areas of research because of its widespread applications in many domains. However, a meaningful clustering in high dimensional dataset is a challenging issue due to (i) it usually contains many irrelevant dimensions which hide the clusters, (ii) the distance, which is the most common similarity measure in most of the methods, loses its meaning in high dimensions, and (iii) different clusters may exist in different subsets of dimensions in high dimensional dataset. Feature selection based clustering methods prominently solve the problem of clustering high dimensional data. However, finding all the clusters in one subset of few selected relevant dimensions is not justified as different clusters may exist in different subsets of dimensions. In this article, we propose an algorithm PROFIT (PROjective clustering algorithm based on FIsher score and Trimmed mean) which extends the idea of feature selection based clustering to projective clustering and works well with the high dimensional dataset consisting of attributes in continuous variable domain. It works in four phases: sampling phase, initialization phase, dimension selection phase and refinement phase. We consider five real datasets for experiments with different input parameters and consider three other well-known top-down subspace clustering methods PROCLUS, ORCLUS and PCKA along with our feature selection based non-subspace clustering method FAMCA for comparison. The obtained results are subjected to two well-known subspace clustering quality measures (Jagota index and sum of squared error) and Student’s t-test to determine the significant difference between clustering results. The obtained results and quality measures show effectiveness and superiority of the proposed method PROFIT to its competitors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 70–81. ACM (2000)
Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Park, J.: Fast algorithms for projected clustering. ACM SIGMOD Record 28(2), 61–72 (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press (1998)
Andrews, H., Patterson, C.: Singular value decompositions and digital image processing. IEEE Trans. Acoust. Speech Signal Process 24(1), 26–53 (1976)
Apolloni, B., Bassis, S., Brega, A.: Feature selection via boolean independent component analysis. Inf. Sci. 179(22), 3815–3831 (2009)
Arai, K., Barakbah, A.: Hierarchical k-means: An algorithm for centroids initialization for k-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007)
Barakbah, A., Kiyoki, Y.: A pillar algorithm for k-means optimization by distance maximization for initial centroid designation. In: Computational Intelligence and Data Mining, 2009. IEEE Symposium on CIDM'09, pp. 61–68. IEEE (2009)
Berkhin, P.: A survey of clustering data mining techniques. Technical Report (2002)
Bouguessa, M., Wang, S.: Mining projected clusters in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21(4), 507–522 (2009)
Celebi, M.: Effective initialization of k-means for color quantization. In: 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 1649–1652. IEEE (2009)
Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)
Chu, Y., Huang, J., Chuang, K., Yang, D., Chen, M.: Density conscious subspace clustering for high-dimensional data.. IEEE Trans. Knowl. Data Eng. 22(1), 16–30 (2010)
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the twenty-first International Conference on Machine Learning, pp. 225–232. ACM (2004)
Gheyas, I., Smith, L.: Feature subset selection in large dimensionality domains. Pattern Recognit. 43(1), 5–13 (2010)
Goil, S., Nagesh, H., Choudhary, A.: Mafia: Efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)
Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer (2010)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2001)
Hu, Q., Che, X., Zhang, L., Yu, D.: Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 2114–2124 (2010)
Jagota, A.: Novelty detection on a very large number of memories stored in a hopfield-style network. In: IJCNN-91-Seattle International Joint Conference on Neural Networks, 1991, vol. 2, pp. 905–. IEEE (1991)
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)
Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)
Kabir, M., Islam, M., et al.: A new wrapper feature selection approach using neural network. Neurocomputing 73(16), 3273–3283 (2010)
Khan, S., Ahmad, A.: Cluster center initialization algorithm for k-means clustering. Pattern Recognit. Lett. 25(11), 1293–1302 (2004)
Kriegel, H., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discov. Data (TKDD) 3(1), 1–58 (2009)
Kruskal, J., Wish, M.: Multidimensional Scaling, Quantitative Applications in the Social Sciences. Beverly Hills (1978)
Liu, Y., Liu, Y., Chan, K.: Dimensionality reduction for heterogeneous dataset in rushes editing. Pattern Recognit. 42(2), 229–242 (2009)
Moise, G., Zimek, A., Kröger, P., Kriegel, H., Sander, J.: Subspace and projected clustering: Experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)
Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter. 6(1), 90–105 (2004)
Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)
Pearson, E.: Studies in the history of probability and statistics. XX: Some early correspondence between W.S. Gosset, R.A. Fisher and Karl Pearson, with notes and comments. Biometrika 55(3), 445–457 (1968)
Puri, C., Kumar, N.: Projected Gustafson-Kessel clustering algorithm and its convergence. Trans. on Rough Sets XIV, 159–182 (2011)
Rajput, D., Singh, P., Bhattacharya, M.: An efficient technique for clustering high dimensional data set. In: 10th International Conference on Information and Knowledge Engineering. pp. 434–440. WASET, USA (July 2011)
Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Sugiyama, M., Kawanabe, M., Chui, P.: Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 23(1), 44–59 (2010)
Tenenbaum, J., De Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Veenman, C., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Patt. Anal. Machine Intell. 24(9), 1273–1280 (2002)
Wang, D., Ding, C., Li, T.: K -subspace clustering. Machine Learn. Knowl. Discov. Databases 506–521 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Rajput, D., Singh, P., Bhattacharya, M. (2015). PROFIT: A Projected Clustering Technique. In: Abou-Nasr, M., Lessmann, S., Stahlbock, R., Weiss, G. (eds) Real World Data Mining Applications. Annals of Information Systems, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-07812-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-07812-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07811-3
Online ISBN: 978-3-319-07812-0
eBook Packages: Business and EconomicsBusiness and Management (R0)