Skip to main content

A Comparative Study on Feature Selection Techniques for Multi-cluster Text Data

  • Conference paper
  • First Online:
Book cover Harmony Search and Nature Inspired Optimization Algorithms

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 741))

Abstract

Text clustering involves data that are of very high dimension. Feature selection techniques find subsets of relevant features from the original feature space that help in efficient and effective clustering. Selection of relevant features merely on ranking scores without considering correlation interferes with the clustering performance. An efficient feature selection technique should be capable of preserving the multi-cluster structure of the data. The purpose of the present work is to demonstrate that feature selection techniques which take into consideration the correlation among features in multi-cluster scenario show better clustering results than those techniques that simply rank features independent of each other. This paper compares two feature selection techniques in this regard viz. the traditional Tf-Idf and the Multi-Cluster Feature Selection (MCFS) technique. The experimental results over the TDT2 and Reuters-21,578 datasets show the superior clustering results of MCFS over traditional Tf-Idf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken, NJ (2000)

    MATH  Google Scholar 

  2. Boutemedjet, S., Bouguila, N., Ziou, N.: A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1429–1443 (2009)

    Article  Google Scholar 

  3. Boutsidis, C., Mahoney, M. W., Drineas, P.: Unsupervised feature selection for principal components analysis. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), pp. 61–69 (2008)

    Google Scholar 

  4. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, 18 (2005)

    Google Scholar 

  5. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)

    Article  Google Scholar 

  6. Wolf, L., Shashua, A.: Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. J. Mach. Learn. Res. 6, 1855–1887 (2005)

    MathSciNet  MATH  Google Scholar 

  7. Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th Annual International Conference on Machine Learning (ICML’07), pp. 1151–1157 (2007)

    Google Scholar 

  8. Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, 1983

    Google Scholar 

  9. Sparck Jones, K.: IDF term weighting and IR research lessons. J. Documentation 60(6), 521–523 (2004)

    Article  Google Scholar 

  10. Lee, D.L., Chuang, H., Seamons, K.: Document ranking and vector space models. IEEE Softw. 14(2), 67–75 (1997)

    Article  Google Scholar 

  11. Roberston, S.: Understanding inverse document frequency: on theoretical argument for IDF. J. Documentation 60(5), 503–520 (2004)

    Article  Google Scholar 

  12. Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: Proceeding of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’10), (2010)

    Google Scholar 

  13. Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)

    MathSciNet  MATH  Google Scholar 

  14. Ng, A. Y., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge, MA (2001)

    Google Scholar 

  15. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)

    Article  MathSciNet  Google Scholar 

  16. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)

    Book  Google Scholar 

  17. Li, H., Xiang, S., Zhong, Z., Ding, K., Pan, C.: Multicluster spatial–spectral unsupervised feature selection for hyperspectral image classification. IEEE Geo Sci. Remote Sens. Lett. 12(8) (2015)

    Google Scholar 

  18. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  19. Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

    Article  Google Scholar 

  20. Chung, F.R.K.: Spectral Graph Theory. Regional Conference Series in Mathematics, 92, AMS (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ananya Gupta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, A., Begum, S.A. (2019). A Comparative Study on Feature Selection Techniques for Multi-cluster Text Data. In: Yadav, N., Yadav, A., Bansal, J., Deep, K., Kim, J. (eds) Harmony Search and Nature Inspired Optimization Algorithms. Advances in Intelligent Systems and Computing, vol 741. Springer, Singapore. https://doi.org/10.1007/978-981-13-0761-4_21

Download citation

Publish with us

Policies and ethics