Skip to main content

Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

  • Chapter
  • First Online:
Data, Engineering and Applications

Abstract

Automatic text categorization (ATC) is a technique of the text document classification. Based on the textual content of documents, predefined classes are assigned. Large numbers of features are extracted from text documents, and documents are represented as feature vectors. However, feature vector contains many redundant features which cost high processing overhead, and sometimes, the performance of the classification is reduced. Therefore, feature selection schemes are used to select a most relevant feature from the feature vector of a text document for reducing the processing cost and improve the performance of the classification system. In this paper, mutual information-based weighted feature selection algorithms are used for automatic text categorization on the Ohsumed test collection dataset which is a subset of the MEDLINE database available in KEEL text classification dataset. The implementation of four learners SVM, kNN, DT, and NB along with nine feature selection algorithms such as BetaGamma, CMIM, MRMR, MIFS, JMI, DISR, ICAP, Condred, and CIFE is used for experimentation from FEAST toolbox. The extensive experiments are carried out for the performance evaluation using accuracy. On comparing nine feature selection algorithm on text document data set. The results suggested that weighted feature selection is enhancing the classification performance of text documentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning ECML ’98, pp. 137–142 (1998)

    Google Scholar 

  2. Markowetz, F.: Classification by support vector machines. In: Discrete Methods in Epidemiology, pp. 1–9 (2000)

    Google Scholar 

  3. Leung, K.M.: Naive Bayesian Classifier (2007)

    Google Scholar 

  4. Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61, 399–409 (1997)

    Article  Google Scholar 

  5. Cai, Y., Ji, D., Cai, D.: A KNN research paper classification method based on shared nearest neighbor. In: Proceedings of the 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, pp. 336–340 (2010)

    Google Scholar 

  6. Ladha, L., Deepa, T.: Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. 3, 1787–1797 (2011)

    Google Scholar 

  7. Brown, G., Pocock, A., Zhao, M.-J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)

    MathSciNet  MATH  Google Scholar 

  8. Albrechtsen, H.: Subject analysis and indexing. From automated indexing to domain analysis. Indexer 18, 219–224 (1993)

    Google Scholar 

  9. Amati, G., van Rijsbergen, C.J.: Term frequency normalization via Pareto distributions. Adv. Inf. Retr. 2291, 183–192 (2002)

    Article  Google Scholar 

  10. Gini coefficient

    Google Scholar 

  11. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E-Stat. Nonlinear Soft Matter Phys. 69 (2004)

    Google Scholar 

  12. Boulis, C., Ostendorf, M.: Text classification by augmenting the bag-of-words representation with redundancy compensated bigrams. In: Workshop on Feature Selection in Data Mining, pp. 9–16 (2005)

    Google Scholar 

  13. Agre, G., Dzhondzhorov, A.: A weighted feature selection method for instance-based classification. In: International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pp. 14–25 (2016)

    Google Scholar 

  14. Pluim, J.P.W., Maintz, J.B.A.A., Viergever, M.A.: Mutual-Information-Based Registration of Medical Images: A survey (2003)

    Google Scholar 

  15. Li, W.: Mutual information functions versus correlation functions. J. Stat. Phys. 60, 823–837 (1990)

    Article  MathSciNet  Google Scholar 

  16. Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: IJCAI-99 Workshop on Machine Learning for Information Filtering, vol. 1, pp. 61–67 (1999)

    Google Scholar 

  17. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 17, 255–287 (2011)

    Google Scholar 

  18. Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)

    MathSciNet  MATH  Google Scholar 

  19. Bennasar, M., Hicks, Y., Setchi, R.: Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42, 8520–8532 (2015)

    Article  Google Scholar 

  20. Long, W.C., Swiney, K.M., Harris, C., Page, H.N., Foy, R.J.: Effects of ocean acidification on juvenile red king crab (Paralithodes camtschaticus) and tanner crab (Chionoecetes bairdi) growth, condition, calcification, and survival. PLoS ONE 8 (2013)

    Google Scholar 

  21. Rades, M., Ewins, D.: Mifs and macs in modal analysis. In: Modal Analysis Conference (IMAC-20), pp. 771–778 (2002)

    Google Scholar 

  22. Jakulin, A.: Machine learning based on attribute interactions. PhD thesis, pp. 1–252 (2005)

    Google Scholar 

  23. Bar-Nun, A., Dimitrov, V., Tomasko, M.: Titan’s aerosols: comparison between our model and DISR findings. Planet. Space Sci. 56, 708–714 (2008)

    Article  Google Scholar 

  24. Fischer, M., Stone, M., Liston, K., Kunz, J., Singhal, V.: Multi-stakeholder collaboration : the CIFE iRoom. In: International Council for Research and Innovation in Building and Construction. CIB W78 Conference, pp. 12–14 (2002)

    Google Scholar 

  25. Lewis, D.: Feature selection and feature extract ion for text categorization. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, 23–26 Feb 1992

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dilip Singh Sisodia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sisodia, D.S., Shukla, A. (2019). Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization. In: Shukla, R.K., Agrawal, J., Sharma, S., Singh Tomer, G. (eds) Data, Engineering and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-13-6347-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-6347-4_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-6346-7

  • Online ISBN: 978-981-13-6347-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics