Abstract
Text clustering involves data that are of very high dimension. Feature selection techniques find subsets of relevant features from the original feature space that help in efficient and effective clustering. Selection of relevant features merely on ranking scores without considering correlation interferes with the clustering performance. An efficient feature selection technique should be capable of preserving the multi-cluster structure of the data. The purpose of the present work is to demonstrate that feature selection techniques which take into consideration the correlation among features in multi-cluster scenario show better clustering results than those techniques that simply rank features independent of each other. This paper compares two feature selection techniques in this regard viz. the traditional Tf-Idf and the Multi-Cluster Feature Selection (MCFS) technique. The experimental results over the TDT2 and Reuters-21,578 datasets show the superior clustering results of MCFS over traditional Tf-Idf.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken, NJ (2000)
Boutemedjet, S., Bouguila, N., Ziou, N.: A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1429–1443 (2009)
Boutsidis, C., Mahoney, M. W., Drineas, P.: Unsupervised feature selection for principal components analysis. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), pp. 61–69 (2008)
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, 18 (2005)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Wolf, L., Shashua, A.: Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. J. Mach. Learn. Res. 6, 1855–1887 (2005)
Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th Annual International Conference on Machine Learning (ICML’07), pp. 1151–1157 (2007)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, 1983
Sparck Jones, K.: IDF term weighting and IR research lessons. J. Documentation 60(6), 521–523 (2004)
Lee, D.L., Chuang, H., Seamons, K.: Document ranking and vector space models. IEEE Softw. 14(2), 67–75 (1997)
Roberston, S.: Understanding inverse document frequency: on theoretical argument for IDF. J. Documentation 60(5), 503–520 (2004)
Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: Proceeding of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’10), (2010)
Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)
Ng, A. Y., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge, MA (2001)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)
Li, H., Xiang, S., Zhong, Z., Ding, K., Pan, C.: Multicluster spatial–spectral unsupervised feature selection for hyperspectral image classification. IEEE Geo Sci. Remote Sens. Lett. 12(8) (2015)
Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Chung, F.R.K.: Spectral Graph Theory. Regional Conference Series in Mathematics, 92, AMS (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gupta, A., Begum, S.A. (2019). A Comparative Study on Feature Selection Techniques for Multi-cluster Text Data. In: Yadav, N., Yadav, A., Bansal, J., Deep, K., Kim, J. (eds) Harmony Search and Nature Inspired Optimization Algorithms. Advances in Intelligent Systems and Computing, vol 741. Springer, Singapore. https://doi.org/10.1007/978-981-13-0761-4_21
Download citation
DOI: https://doi.org/10.1007/978-981-13-0761-4_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0760-7
Online ISBN: 978-981-13-0761-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)