Undersampling Approach for Imbalanced Training Sets and Induction from Multi-label Text-Categorization Domains

Dendamrongvit, Sareewan; Kubat, Miroslav

doi:10.1007/978-3-642-14640-4_4

Sareewan Dendamrongvit²⁷ &
Miroslav Kubat²⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5669))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

704 Accesses
8 Citations

Abstract

Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F ₁. We also show how a slight modification of an older undersampling technique helps further improve the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Langley, P., Iba, W., Thompson, K.: An analysis of Bayesian classifiers. In: Natl. Conf. on Artificial Intelligence, pp. 223–228 (1992)
Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29(2-3), 131–163 (1997)
Article MATH Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: Proc. Workshop on Learning for Text Categorization (AAAI 1998) (1998)
Google Scholar
Li, B., Lu, Q., Yu, S.: An adaptive k-nearest neighbor text categorization strategy. ACM Trans. on Asian Language Information Processing (TALIP) 3, 215–226 (2004)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Kwok, J.T.: Automated text categorization using support vector machine. In: Proc. Int’l. Conf. on Neural Information Processing (ICONIP 1998), Kitakyushu, JP, pp. 347–351 (1998)
Google Scholar
Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, p. 42. Springer, Heidelberg (2001)
Chapter Google Scholar
Schapire, R.E., Singer, Y.: Improved boosting using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)
Article MATH Google Scholar
Sarinnapakorn, K., Kubat, M.: Combining subclassifiers in text categorization: A dst-based solution and a case study. IEEE Transactions on Knowledge and Data Engineering 19(12), 1638–1651 (2007)
Article Google Scholar
Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)
Article Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. In: The 1st IEEE Int’l. Conf. on Granular Computing (GrC 2005), Beijing, China, July 2005, vol. 2, pp. 718–721 (2005)
Google Scholar
Kubat, M., Pfurtscheller, G., Flotzinger, D.: Ai-based approach to automatic sleep classification. Biological Cybernetics 79, 443–448 (1994)
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, ICML 1997, Nashville, TN, pp. 179–186 (1997)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004)
Chapter Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Download references

Author information

Authors and Affiliations

Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, 33146, USA
Sareewan Dendamrongvit & Miroslav Kubat

Authors

Sareewan Dendamrongvit
View author publications
You can also search for this author in PubMed Google Scholar
Miroslav Kubat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Thammasat University, Sirindhorn International Institute of Technology,, 131 Moo 5 Tiwanont Road, Bangkadi, 12000, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Department of Architecture for Intelligence, The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka,Ibaraki, 567-0047, Osaka, Japan
Cholwich Nattee
Center for Informatics, Federal University of Pernambuco, Brazil
Paulo J. L. Adeodato
Computer Science and Engineering Department, University of Notre Dame, 353 Fitzpatrick Hall, 46556, Notre Dame, IN, USA
Nitesh Chawla
Department of Computer Science, The Australian National University, Australia
Peter Christen
TELECOM Bretagne, Lab-STICC, Institut TELECOM, Brest, France
Philippe Lenca
School of Information Technologies, University of Sydney, P.O. Box, Australia
Josiah Poon
Australian Taxation Office, Australia
Graham Williams

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dendamrongvit, S., Kubat, M. (2010). Undersampling Approach for Imbalanced Training Sets and Induction from Multi-label Text-Categorization Domains. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-14640-4_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14639-8
Online ISBN: 978-3-642-14640-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics