Abstract
Feature selection (FS) is aimed at reducing the size of the feature space. The advantage of FS is, in general, two-fold: improved accuracy of the learned classifiers, and efficiency of the learning process. Behind the use of a FS method there is the implicit assumption that only the selected terms are representative of the category being learned, while the rest are redundant. Thus, predicting an appropriate value of the feature space dimensionality is a crucial task, as a too aggressive feature selection might discard terms that carry essential information, while redundant features might deceive the learning algorithm. In “real life”, this task is usually accomplished “manually”, that is, the learning process is rerun over several vocabularies of different dimensions and the best results are eventually taken. Unfortunately, this may take very long training times.
In this paper we propose a FS technique that automatically detects an appropriate number of features for text categorization (TC) that are sufficient to learn accurate classifiers and make efficient the training process. One peculiarity of the proposed approach is that of combining both positive and negative features, the latter being considered relevant for the purpose of effective TC. The proposed approach has been tested by running three well known classifiers, notably, Ripper, C4.5 and linear SVM (the SMO implementation), over 7 real-world data sets, with varying characteristics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baralis, E., Garza, P.: Associative text categorization exploiting negated words. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 530–535 (2006)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. on Information Systems, 307–315 (1996)
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research 3, 1289–1305 (2003)
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceeding of ICML (2004)
Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer (2002)
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods. MIT Press (1998)
Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A Genetic Algorithm for Text Classification Rule Induction. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 188–203. Springer, Heidelberg (2008)
Policicchio, V.L., Pietramala, A., Rullo, P.: GAMoN: Discovering MofN Hypotheses for Text Classification by a Lattice-based Genetic Algorithm. In: Artificial Intelligence. Elsevier (2012), doi:10.1016/j.artint.2012.07.003
Quinlan, J.R.: Generating production rules from decision trees. In: Proceedings of the IJCAI, vol. 1, pp. 304–307. Morgan Kaufmann Publishers Inc. (1987)
Rullo, P., Policicchio, L., Cumbo, C., Iiritano, S.: Olex: effective rule learning for text categorization. IEEE Transactions on Knowledge and Data Engineering (2009)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Zheng, Z., Srihari, R.: Optimally combining positive and negative features for text categorization. In: Proceedings of the ICML (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pietramala, A., Policicchio, V.L., Rullo, P. (2012). Automatic Filtering of Valuable Features for Text Categorization. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-35527-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)