Skip to main content

Automatic Filtering of Valuable Features for Text Categorization

  • Conference paper
Advanced Data Mining and Applications (ADMA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

  • 3456 Accesses

Abstract

Feature selection (FS) is aimed at reducing the size of the feature space. The advantage of FS is, in general, two-fold: improved accuracy of the learned classifiers, and efficiency of the learning process. Behind the use of a FS method there is the implicit assumption that only the selected terms are representative of the category being learned, while the rest are redundant. Thus, predicting an appropriate value of the feature space dimensionality is a crucial task, as a too aggressive feature selection might discard terms that carry essential information, while redundant features might deceive the learning algorithm. In “real life”, this task is usually accomplished “manually”, that is, the learning process is rerun over several vocabularies of different dimensions and the best results are eventually taken. Unfortunately, this may take very long training times.

In this paper we propose a FS technique that automatically detects an appropriate number of features for text categorization (TC) that are sufficient to learn accurate classifiers and make efficient the training process. One peculiarity of the proposed approach is that of combining both positive and negative features, the latter being considered relevant for the purpose of effective TC. The proposed approach has been tested by running three well known classifiers, notably, Ripper, C4.5 and linear SVM (the SMO implementation), over 7 real-world data sets, with varying characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baralis, E., Garza, P.: Associative text categorization exploiting negated words. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 530–535 (2006)

    Google Scholar 

  2. http://web.ist.utl.pt/~acardoso/datasets/

  3. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. on Information Systems, 307–315 (1996)

    Google Scholar 

  4. Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  5. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceeding of ICML (2004)

    Google Scholar 

  6. Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer (2002)

    Google Scholar 

  7. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods. MIT Press (1998)

    Google Scholar 

  8. Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A Genetic Algorithm for Text Classification Rule Induction. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 188–203. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Policicchio, V.L., Pietramala, A., Rullo, P.: GAMoN: Discovering MofN Hypotheses for Text Classification by a Lattice-based Genetic Algorithm. In: Artificial Intelligence. Elsevier (2012), doi:10.1016/j.artint.2012.07.003

    Google Scholar 

  10. Quinlan, J.R.: Generating production rules from decision trees. In: Proceedings of the IJCAI, vol. 1, pp. 304–307. Morgan Kaufmann Publishers Inc. (1987)

    Google Scholar 

  11. Rullo, P., Policicchio, L., Cumbo, C., Iiritano, S.: Olex: effective rule learning for text categorization. IEEE Transactions on Knowledge and Data Engineering (2009)

    Google Scholar 

  12. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  13. Zheng, Z., Srihari, R.: Optimally combining positive and negative features for text categorization. In: Proceedings of the ICML (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pietramala, A., Policicchio, V.L., Rullo, P. (2012). Automatic Filtering of Valuable Features for Text Categorization. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35527-1_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35526-4

  • Online ISBN: 978-3-642-35527-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics