Skip to main content

Interactions Between Document Representation and Feature Selection in Text Categorization

  • Conference paper
Database and Expert Systems Applications (DEXA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4080))

Included in the following conference series:

Abstract

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the idf transform and several widely used feature selection methods, in the context of Naïve Bayes and Support Vector Machines classifiers, on datasets extracted from the dmoz ontology of Web-page descriptions. The described experimental study shows that the idf transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.

This work was supported by project Abstract Methods and Applications in Computer Science (no. 144017A), of the Serbian Ministry of Science and Environmental Protection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications. WIT Press, Southampton (2005)

    Google Scholar 

  2. Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)

    MATH  Google Scholar 

  3. Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)

    Google Scholar 

  4. Wu, X., Srihari, R.K., Zheng, Z.: Document Representation for One-Class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 489–500. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for Text Categorization Revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)

    Google Scholar 

  7. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–98. Physica-Verlag, Heidelberg (2004)

    Google Scholar 

  8. Radovanović, M., Ivanović, M.: Document Representations for Classification of Short Web-Page Descriptions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 544–553. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23. Springer, Heidelberg (2006)

    Google Scholar 

  10. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2005)

    MATH  Google Scholar 

  11. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)

    Google Scholar 

  12. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML 2004, 21st International Conference on Machine Learning, Baniff, Canada (2004)

    Google Scholar 

  13. Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of WWW 2005, 14th International World Wide Web Conference, Chiba, Japan, pp. 801–810 (2005)

    Google Scholar 

  14. Salton, G. (ed.): The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1971)

    Google Scholar 

  15. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  16. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  17. Mladenić, D.: Machine Learning on non-homogenous, distributed text data. PhD thesis, University of Ljubljana, Slovenia (1998)

    Google Scholar 

  18. Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: ECML 1997. LNCS, vol. 1224, pp. 412–420. Springer, Heidelberg (1997)

    Google Scholar 

  19. Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1999)

    Google Scholar 

  20. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Radovanović, M., Ivanović, M. (2006). Interactions Between Document Representation and Feature Selection in Text Categorization. In: Bressan, S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2006. Lecture Notes in Computer Science, vol 4080. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11827405_48

Download citation

  • DOI: https://doi.org/10.1007/11827405_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37871-6

  • Online ISBN: 978-3-540-37872-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics