A Feature Selection Method for Improved Document Classification

Basu, Tanmay; Murthy, C. A.

doi:10.1007/978-3-642-35527-1_25

Tanmay Basu²² &
C. A. Murthy²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3522 Accesses
8 Citations

Abstract

The aim of text document classification is to automatically group a document to a predefined class. The main problem of document classification is high dimensionality and sparsity of the data matrix. A new feature selection technique using the google distance have been proposed in this article to effectively obtain a feature subset which improves the classification accuracy. Normalized google distance can automatically extract the meaning of terms from the world wide web. It utilizes the advantage of number of hits returned by the google search engine to compute the semantic relation between two terms. In the proposed approach, only the distance function of google distance is used to develop a relation between a feature and a class for document classification and it is independent of google search results. Every feature will generate a score based on their relation with all the classes and then all the features will be ranked accordingly. The experimental results are presented using knn classifier on several TREC and Reuter data sets. Precision, recall, f-measure and classification accuracy are used to analyze the results. The proposed method is compared with four other feature selection methods for document classification, document frequency thresholding, information gain, mutual information and χ ² statistic. The empirical studies have shown that the proposed method have effectively done feature selection in most of the cases with either an improvement or no change of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the Fourteenth International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of the Twenty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.: The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Article Google Scholar
Li, S., Xia, R., Zong, C., Huang, C.: A Framework of Feature Selection Methods for Text. In: Proceedings of ACL-IJCNLP 2009 (2009)
Google Scholar
Novovicova, J., Malik, A.: Information-Theoretic Feature Selection Algorithms for Text Classification. In: Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31-August 4 (2005)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Karypis, G., Han, E.H.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Chapter Google Scholar
TREC, Text REtrieval Conference, http://trec.nist.gov
Lehmann, E.L.: Testing of Statistical Hypotheses. John Wiley, New York (1976)
Google Scholar
Rao, C.R., Mitra, S.K., Matthai, A., Ramamurthy, K.G. (eds.): Formulae and Tables for Statistical Work. Statistical Publishing Soc., Calcutta (1966)
Google Scholar
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. The Journal of Machine Learning Research 3(1), 1289–1305 (2003)
MATH Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.: An Evaluation on Feature Selection for Text Clustering. In: Proc. International Conference on Machine Learning (ICML 2003) (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Tanmay Basu & C. A. Murthy

Authors

Tanmay Basu
View author publications
You can also search for this author in PubMed Google Scholar
C. A. Murthy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, Handan Road 220, 200433, Shanghai, China
Shuigeng Zhou
Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Dongguancun East Road 55, 100190, Beijing, China
Songmao Zhang
Department of Computer Science and Engineering, University of Minnesota, Union Street SE 200, 55455, Minneapolis, MN, USA
George Karypis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basu, T., Murthy, C.A. (2012). A Feature Selection Method for Improved Document Classification. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-35527-1_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics