Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

Sisodia, Dilip Singh; Shukla, Ankit

doi:10.1007/978-981-13-6347-4_7

Dilip Singh Sisodia⁵ &
Ankit Shukla⁶

981 Accesses
1 Citations

Abstract

Automatic text categorization (ATC) is a technique of the text document classification. Based on the textual content of documents, predefined classes are assigned. Large numbers of features are extracted from text documents, and documents are represented as feature vectors. However, feature vector contains many redundant features which cost high processing overhead, and sometimes, the performance of the classification is reduced. Therefore, feature selection schemes are used to select a most relevant feature from the feature vector of a text document for reducing the processing cost and improve the performance of the classification system. In this paper, mutual information-based weighted feature selection algorithms are used for automatic text categorization on the Ohsumed test collection dataset which is a subset of the MEDLINE database available in KEEL text classification dataset. The implementation of four learners SVM, kNN, DT, and NB along with nine feature selection algorithms such as BetaGamma, CMIM, MRMR, MIFS, JMI, DISR, ICAP, Condred, and CIFE is used for experimentation from FEAST toolbox. The extensive experiments are carried out for the performance evaluation using accuracy. On comparing nine feature selection algorithm on text document data set. The results suggested that weighted feature selection is enhancing the classification performance of text documentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning ECML ’98, pp. 137–142 (1998)
Google Scholar
Markowetz, F.: Classification by support vector machines. In: Discrete Methods in Epidemiology, pp. 1–9 (2000)
Google Scholar
Leung, K.M.: Naive Bayesian Classifier (2007)
Google Scholar
Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61, 399–409 (1997)
Article Google Scholar
Cai, Y., Ji, D., Cai, D.: A KNN research paper classification method based on shared nearest neighbor. In: Proceedings of the 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, pp. 336–340 (2010)
Google Scholar
Ladha, L., Deepa, T.: Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. 3, 1787–1797 (2011)
Google Scholar
Brown, G., Pocock, A., Zhao, M.-J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)
MathSciNet MATH Google Scholar
Albrechtsen, H.: Subject analysis and indexing. From automated indexing to domain analysis. Indexer 18, 219–224 (1993)
Google Scholar
Amati, G., van Rijsbergen, C.J.: Term frequency normalization via Pareto distributions. Adv. Inf. Retr. 2291, 183–192 (2002)
Article Google Scholar
Gini coefficient
Google Scholar
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E-Stat. Nonlinear Soft Matter Phys. 69 (2004)
Google Scholar
Boulis, C., Ostendorf, M.: Text classification by augmenting the bag-of-words representation with redundancy compensated bigrams. In: Workshop on Feature Selection in Data Mining, pp. 9–16 (2005)
Google Scholar
Agre, G., Dzhondzhorov, A.: A weighted feature selection method for instance-based classification. In: International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pp. 14–25 (2016)
Google Scholar
Pluim, J.P.W., Maintz, J.B.A.A., Viergever, M.A.: Mutual-Information-Based Registration of Medical Images: A survey (2003)
Google Scholar
Li, W.: Mutual information functions versus correlation functions. J. Stat. Phys. 60, 823–837 (1990)
Article MathSciNet Google Scholar
Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: IJCAI-99 Workshop on Machine Learning for Information Filtering, vol. 1, pp. 61–67 (1999)
Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 17, 255–287 (2011)
Google Scholar
Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)
MathSciNet MATH Google Scholar
Bennasar, M., Hicks, Y., Setchi, R.: Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42, 8520–8532 (2015)
Article Google Scholar
Long, W.C., Swiney, K.M., Harris, C., Page, H.N., Foy, R.J.: Effects of ocean acidification on juvenile red king crab (Paralithodes camtschaticus) and tanner crab (Chionoecetes bairdi) growth, condition, calcification, and survival. PLoS ONE 8 (2013)
Google Scholar
Rades, M., Ewins, D.: Mifs and macs in modal analysis. In: Modal Analysis Conference (IMAC-20), pp. 771–778 (2002)
Google Scholar
Jakulin, A.: Machine learning based on attribute interactions. PhD thesis, pp. 1–252 (2005)
Google Scholar
Bar-Nun, A., Dimitrov, V., Tomasko, M.: Titan’s aerosols: comparison between our model and DISR findings. Planet. Space Sci. 56, 708–714 (2008)
Article Google Scholar
Fischer, M., Stone, M., Liston, K., Kunz, J., Singhal, V.: Multi-stakeholder collaboration : the CIFE iRoom. In: International Council for Research and Innovation in Building and Construction. CIB W78 Conference, pp. 12–14 (2002)
Google Scholar
Lewis, D.: Feature selection and feature extract ion for text categorization. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, 23–26 Feb 1992
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology, Raipur, India
Dilip Singh Sisodia
Jaypee University of Engineering & Technology, Guna, India
Ankit Shukla

Authors

Dilip Singh Sisodia
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dilip Singh Sisodia .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Sagar Institute of Research & Technology (SIRT), Bhopal, Madhya Pradesh, India
Rajesh Kumar Shukla
School of Information Technology, Rajiv Gandhi Technical University, Bhopal, Madhya Pradesh, India
Jitendra Agrawal
School of Information Technology, Rajiv Gandhi Technological University, Bhopal, Madhya Pradesh, India
Sanjeev Sharma
THDC Institute of Hydropower Engineering and Technology, Tehri, Uttarakhand, India
Geetam Singh Tomer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sisodia, D.S., Shukla, A. (2019). Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization. In: Shukla, R.K., Agrawal, J., Sharma, S., Singh Tomer, G. (eds) Data, Engineering and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-13-6347-4_7

Download citation

DOI: https://doi.org/10.1007/978-981-13-6347-4_7
Published: 19 March 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6346-7
Online ISBN: 978-981-13-6347-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics