Automatic Filtering of Valuable Features for Text Categorization

Pietramala, Adriana; Policicchio, Veronica Lucia; Rullo, Pasquale

doi:10.1007/978-3-642-35527-1_24

Adriana Pietramala²²,
Veronica Lucia Policicchio²² &
Pasquale Rullo²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3456 Accesses

Abstract

Feature selection (FS) is aimed at reducing the size of the feature space. The advantage of FS is, in general, two-fold: improved accuracy of the learned classifiers, and efficiency of the learning process. Behind the use of a FS method there is the implicit assumption that only the selected terms are representative of the category being learned, while the rest are redundant. Thus, predicting an appropriate value of the feature space dimensionality is a crucial task, as a too aggressive feature selection might discard terms that carry essential information, while redundant features might deceive the learning algorithm. In “real life”, this task is usually accomplished “manually”, that is, the learning process is rerun over several vocabularies of different dimensions and the best results are eventually taken. Unfortunately, this may take very long training times.

In this paper we propose a FS technique that automatically detects an appropriate number of features for text categorization (TC) that are sufficient to learn accurate classifiers and make efficient the training process. One peculiarity of the proposed approach is that of combining both positive and negative features, the latter being considered relevant for the purpose of effective TC. The proposed approach has been tested by running three well known classifiers, notably, Ripper, C4.5 and linear SVM (the SMO implementation), over 7 real-world data sets, with varying characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baralis, E., Garza, P.: Associative text categorization exploiting negated words. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 530–535 (2006)
Google Scholar
http://web.ist.utl.pt/~acardoso/datasets/
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. on Information Systems, 307–315 (1996)
Google Scholar
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceeding of ICML (2004)
Google Scholar
Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer (2002)
Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods. MIT Press (1998)
Google Scholar
Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A Genetic Algorithm for Text Classification Rule Induction. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 188–203. Springer, Heidelberg (2008)
Chapter Google Scholar
Policicchio, V.L., Pietramala, A., Rullo, P.: GAMoN: Discovering MofN Hypotheses for Text Classification by a Lattice-based Genetic Algorithm. In: Artificial Intelligence. Elsevier (2012), doi:10.1016/j.artint.2012.07.003
Google Scholar
Quinlan, J.R.: Generating production rules from decision trees. In: Proceedings of the IJCAI, vol. 1, pp. 304–307. Morgan Kaufmann Publishers Inc. (1987)
Google Scholar
Rullo, P., Policicchio, L., Cumbo, C., Iiritano, S.: Olex: effective rule learning for text categorization. IEEE Transactions on Knowledge and Data Engineering (2009)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Zheng, Z., Srihari, R.: Optimally combining positive and negative features for text categorization. In: Proceedings of the ICML (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Calabria, Rende, Italy
Adriana Pietramala, Veronica Lucia Policicchio & Pasquale Rullo

Authors

Adriana Pietramala
View author publications
You can also search for this author in PubMed Google Scholar
Veronica Lucia Policicchio
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale Rullo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, Handan Road 220, 200433, Shanghai, China
Shuigeng Zhou
Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Dongguancun East Road 55, 100190, Beijing, China
Songmao Zhang
Department of Computer Science and Engineering, University of Minnesota, Union Street SE 200, 55455, Minneapolis, MN, USA
George Karypis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pietramala, A., Policicchio, V.L., Rullo, P. (2012). Automatic Filtering of Valuable Features for Text Categorization. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-35527-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics