Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles

Liu, Yang; An, Aijun; Huang, Xiangji

doi:10.1007/11731139_15

Yang Liu²²,
Aijun An²² &
Xiangji Huang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3572 Accesses
60 Citations

Abstract

Learning from imbalanced datasets is inherently difficult due to lack of information about the minority class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique with an ensemble of SVMs to improve the prediction performance. The integrated sampling technique combines both over-sampling and under-sampling techniques. Through empirical study, we show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Article Google Scholar
Fawcett, T., Provost, F.J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)
Article Google Scholar
Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. In: KDD, pp. 73–79 (1998)
Google Scholar
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explorations 6, 30–39 (2004)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: onesided selection. In: Proc. 14th International Conference on Machine Learning, pp. 179–186 (1997)
Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004)
Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-basedlearning algorithms. Mach. Learn. 38, 257–286 (2000)
Article MATH Google Scholar
Veropoulos, K., Cristianini, N., Campbell, C.: Controlling the sensitivity of support vector machines. In: International Joint Conference on Artificial Intelligence, IJCAI 1999 (1999)
Google Scholar
Wu, G., Chang, E.Y.: Aligning boundary in kernel space for learning imbalanced dataset. In: ICDM, pp. 265–272 (2004)
Google Scholar
Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: ECML, pp. 39–50 (2004)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR) 16, 321–357 (2002)
MATH Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Weiss, G.M., Provost, F.J.: Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR) 19, 315–354 (2003)
MATH Google Scholar
Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II held in conjunction with ICML 2003 (2003)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)
MATH Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)
Article Google Scholar
Swets, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, York University, Toronto, Ontario, M3J 1P3, Canada
Yang Liu, Aijun An & Xiangji Huang

Authors

Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aijun An
View author publications
You can also search for this author in PubMed Google Scholar
Xiangji Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., An, A., Huang, X. (2006). Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_15

Download citation

DOI: https://doi.org/10.1007/11731139_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics