A Two-Stage Unsupervised Dimension Reduction Method for Text Clustering

bharti, Kusum kumari; singh, Pramod kumar

doi:10.1007/978-81-322-1041-2_45

Kusum kumari bharti⁶ &
Pramod kumar singh⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 202))

1739 Accesses
3 Citations

Abstract

Feature selection is widely used in text clustering to reduce dimensions in the feature space. In this paper, we study and propose two-stage unsupervised feature selection methods to determine a subset of relevant features to improve accuracy of the underlying algorithm. We experiment with hybrid approach of feature selection—feature selection (FS–FS) and feature selection—feature extraction (FS–FE) methods. Initially, each feature in the document is scored on the basis of its importance for the clustering using two different feature selection methods individually Mean-Median (MM) and Mean Absolute Difference (MAD).In the second stage, in two different experiments, we hybridize them with a feature selection method absolute cosine (AC) and a feature extraction method principal component analysis (PCA) to further reduce the dimensions in the feature space. We perform comprehensive experiments to compare FS, FS–FS and FS–FE using k-mean clustering on Reuters-21578 dataset. The experimental results show that the two-stage feature selection methods are more effective to obtain good quality results by the underlying clustering algorithm. Additionally, we observe that FS–FE approach is superior to FS–FS approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G.: Wong, A.: Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975).
Google Scholar
Quinlan, J.R.: Induction of decision tree. Machine learning 1(1), 81-106 (1986).
Google Scholar
Maldonado, S.: Weber, R.: A wrapper method for feature selection using Support Vector Machines. Information Sciences179(13), 2208-2217 (2009).
Google Scholar
Church, K.W.: Hanks, P.: word association norm, mutual information and lexicography. In proceeding of ACL 27, 76-83, Vancouver, Canada (1989).
Google Scholar
Li, Y.: Luo, C.: Chung, S.M.: Text Clustering with Feature Selection by Using Statistical Data. IEEE Transactions On Knowledge And Data Engineering, 20(5), 641-652 (2008).
Google Scholar
Liu, L.: Kang, J.: Yu, J.: Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: IEEE International Conference on Natural Language Processing and Knowledge Engineering 597–601 (2005).
Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval 256–263 (1995).
Google Scholar
Ferreira, A.: Figueiredo, M.: Unsupervised Feature Selection for Sparse Data. In proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 339-344 (2011).
Google Scholar
Ferreira, A.J.: Figueired, M.A.T.: Efficient Feature Selection Filters for High-Dimensional Data. Pattern Recognition Letters 33(13), 1794-1804 (2012).
Google Scholar
Pearson, K..On Lines and Planes of Closest filt to Systems of Points in Space. Philosophical Magazine 1(6), 559-572 (1901).
Google Scholar
Deerwester, S.: Improving Information Retrieval with Latent Semantic Indexing. In proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 36–40 (1988).
Google Scholar
Hyvärinen, A.: Oja, E.: Independent component analysis: a tutorial. In Helsinki University of Technology, Laboratory of computer and Information Science (1999).
Google Scholar
Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems 24(7), 1024-1032 (2011).
Google Scholar
Uguz,H.:A hybrid system based on information gain and principal component analysis for the classification of transcranial Doppler signals. Computer Methods and Programs in Biomedicine 107(3), 598-609, 2012.
Google Scholar
Meng, J.: Lin, H.: Yu, Y.: A two-stage feature selection method for text categorization. Knowledge-Based Systems 62(7), 2793-2800 (2011).
Google Scholar
Song, W.: Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Computers and Mathematics with Applications 57(11-12), 1901-1907 (2009).
Google Scholar
Hsu, H.H.: Hsieh, C.W.: Lu, M.D.: Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications 38(7), 8144–8150 (2011).
Google Scholar
Akadi, A.E.: Amine, A.: Ouardighi, A.E.: Aboutajdine, D.: A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. KnowlInfSyst26(3), 487–500 (2011).
Google Scholar
MacQueen, J. B.: Some Methods for classification and Analysis of Multivariate Observations”. 1. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. 281–297 (1967).
Google Scholar
Zhang, Y.: Ding, C.: Li, T.: Gene selection algorithm by combining reliefF and mRMR. IEEE 7th International Conference on Bioinformatics and Bioengineering. 1-10 (2008).
Google Scholar
Valle, S.: Li, W.: Qin, S.J.: Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Ind, Engineering Chemistry Research 38(11), 4389–4401 (1999).
Google Scholar
Jilliffe, T.: Principal component analysis. ACM Computing Survey, Springer, Verlag, 1-47 (1986).
Google Scholar
Singh, P.K.: Machavolu, M.: Bharti, K.: Suda, R.: Analysis of Text Cluster Visualization in Emergent Self Organizing Maps Using Unigrams and Its Variations after Introducing Bigrams. In proce. of international conference on soft computing for problem solving, 967-978 (2011).
Google Scholar
Ferr, L.: Selection of components in principal component analysis: a comparison of methods, Computing and Statistical Data Analysis 19(6), 669–682 (1995).
Google Scholar
Unler, A.: Murat, A.: Chinnam, R.B.: mr²PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences 181(20), 4625–4641 (2011).
Google Scholar
Kira, K.: Rendell, L.: The feature selection problem: Traditional methods and a new algorithm. In: Association for the Advancement of Artificial Intelligence. AAAI Press and MIT Press, Cambridge, MA, USA. 129–134 (1992).
Google Scholar
Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: Proc. of the European Conference on Machine Learning. Springer, Verlag, 171–182 (1994).
Google Scholar
Foithong, S.: Pinngern, O.: Attachoo, B.: Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications 39(1), 574-584, (2012).
Google Scholar

Download references

Author information

Authors and Affiliations

Computational Intelligence and DataMining Research Lab, ABV-Indian Institute of Information Technology and Management Gwalior, Morena Link Road, Gwalior, Madhya Pradesh, India
Kusum kumari bharti & Pramod kumar singh

Authors

Kusum kumari bharti
View author publications
You can also search for this author in PubMed Google Scholar
Pramod kumar singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kusum kumari bharti .

Editor information

Editors and Affiliations

South Asian University, Chankya Puri, New Delhi, 110021, India
Jagdish Chand Bansal
ABV - IIITM, Gwalior, Gwalior, 474015, Madhya Pradesh, India
Pramod Singh
, Department of Mathematics, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Kusum Deep
, Department of Paper Technology, Indian Institute of Technology Roorkee, Saharanpur Campus, Roorkee, India
Millie Pant
, Department of Computer Science, Liverpool Hope University, Office: FML 412, Liverpool, Liverpool, L16 9JD, United Kingdom
Atulya Nagar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

bharti, K.k., singh, P.k. (2013). A Two-Stage Unsupervised Dimension Reduction Method for Text Clustering. In: Bansal, J., Singh, P., Deep, K., Pant, M., Nagar, A. (eds) Proceedings of Seventh International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012). Advances in Intelligent Systems and Computing, vol 202. Springer, India. https://doi.org/10.1007/978-81-322-1041-2_45

Download citation

DOI: https://doi.org/10.1007/978-81-322-1041-2_45
Published: 04 December 2012
Publisher Name: Springer, India
Print ISBN: 978-81-322-1040-5
Online ISBN: 978-81-322-1041-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics