A Comparative Study on Term Weighting Schemes for Text Classification

Mazyad, Ahmad; Teytaud, Fabien; Fonlupt, Cyril

doi:10.1007/978-3-319-72926-8_9

Ahmad Mazyad¹⁸,
Fabien Teytaud¹⁸ &
Cyril Fonlupt¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

International Workshop on Machine Learning, Optimization, and Big Data

3045 Accesses
2 Citations

Abstract

Text Classification (or Text Categorization) is a popular machine learning task. It consists in assigning categories to documents. In this paper, we are interested in comparing state of the art classifiers and state of the art feature weights. Feature weight methods are classic tools that are used in text categorization. We extend previous studies by evaluating numerous term weighting schemes for state of the art classification methods. We aim at providing a complete survey on text classification for fair benchmark comparisons.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Apte, C., Damerau, F., Weiss, S., et al.: Text mining with decision rules and decision trees. Citeseer (1998)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995). Springer
MATH Google Scholar
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
MathSciNet MATH Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–97. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-45219-5_7
Chapter Google Scholar
Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24655-8_64
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS (LNAI), vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
McCallum, A., Nigam, K., et al.: A comparison of event models for Naive Bayes text classification. In: Workshop on Learning for Text Categorization, AAAI 1998, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: ACM SIGIR Forum, vol. 31, pp. 67–73. ACM (1997)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Schapire, R.E., Singer, Y.: Boostexter: a boosting-based system for text categorization. Mach. Learn. 39(2–3), 135–168 (2000)
Article Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002)
Article Google Scholar
Wang, D., Zhang, H.: Inverse category frequency based supervised term weighting scheme for text categorization. Preprint arXiv:1012.2609v4 (2013)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Yang, Y.: Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR 1994, pp. 13–22. Springer, Cham (1994). https://doi.org/10.1007/978-1-4471-2099-5_2
Chapter Google Scholar
Youngjoong, K.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030. ACM (2012)
Google Scholar
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, pp. 919–926. Omnipress (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

LISIC, Université du Littoral Côte d’Opale, 50 Rue Ferdinand Buisson, 62100, Calais, France
Ahmad Mazyad, Fabien Teytaud & Cyril Fonlupt

Authors

Ahmad Mazyad
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Teytaud
View author publications
You can also search for this author in PubMed Google Scholar
Cyril Fonlupt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmad Mazyad .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mazyad, A., Teytaud, F., Fonlupt, C. (2018). A Comparative Study on Term Weighting Schemes for Text Classification. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-72926-8_9
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72925-1
Online ISBN: 978-3-319-72926-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics