Interactions Between Document Representation and Feature Selection in Text Categorization

Radovanović, Miloš; Ivanović, Mirjana

doi:10.1007/11827405_48

Miloš Radovanović¹⁸ &
Mirjana Ivanović¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4080))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1401 Accesses
2 Citations

Abstract

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the idf transform and several widely used feature selection methods, in the context of Naïve Bayes and Support Vector Machines classifiers, on datasets extracted from the dmoz ontology of Web-page descriptions. The described experimental study shows that the idf transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.

This work was supported by project Abstract Methods and Applications in Computer Science (no. 144017A), of the Serbian Ministry of Science and Environmental Protection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications. WIT Press, Southampton (2005)
Google Scholar
Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)
MATH Google Scholar
Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)
Google Scholar
Wu, X., Srihari, R.K., Zheng, Z.: Document Representation for One-Class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 489–500. Springer, Heidelberg (2004)
Chapter Google Scholar
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for Text Categorization Revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)
Chapter Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–98. Physica-Verlag, Heidelberg (2004)
Google Scholar
Radovanović, M., Ivanović, M.: Document Representations for Classification of Short Web-Page Descriptions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 544–553. Springer, Heidelberg (2006)
Chapter Google Scholar
Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23. Springer, Heidelberg (2006)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2005)
MATH Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML 2004, 21st International Conference on Machine Learning, Baniff, Canada (2004)
Google Scholar
Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of WWW 2005, 14th International World Wide Web Conference, Chiba, Japan, pp. 801–810 (2005)
Google Scholar
Salton, G. (ed.): The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Mladenić, D.: Machine Learning on non-homogenous, distributed text data. PhD thesis, University of Ljubljana, Slovenia (1998)
Google Scholar
Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: ECML 1997. LNCS, vol. 1224, pp. 412–420. Springer, Heidelberg (1997)
Google Scholar
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Informatics, University of Novi Sad, Faculty of Science, Trg D. Obradovića 4, 21000, Novi Sad, Serbia and Montenegro
Miloš Radovanović & Mirjana Ivanović

Authors

Miloš Radovanović
View author publications
You can also search for this author in PubMed Google Scholar
Mirjana Ivanović
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, National University of Singapore,
Stéphane Bressan
University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Josef Küng & Roland Wagner &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Radovanović, M., Ivanović, M. (2006). Interactions Between Document Representation and Feature Selection in Text Categorization. In: Bressan, S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2006. Lecture Notes in Computer Science, vol 4080. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11827405_48

Download citation

DOI: https://doi.org/10.1007/11827405_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37871-6
Online ISBN: 978-3-540-37872-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics