Abstract
Imbalanced datasets occur in many domains, such as fraud detection, cancer detection and web; and in such domains, the class of interest often concerns the rare occurring events. Thus it is important to have a good performance on these classes while maintaining a reasonable overall accuracy. Although imbalanced datasets can be difficult to learn, but in the previous researches, the skewed class distribution has been suggested to not necessarily being the one that poses problems for learning. Therefore, when the learning of the rare class becomes problematic, it does not imply that the skewed class distribution is the cause to blame, but rather that the imbalanced distribution may just be a byproduct of some other hidden intrinsic difficulties.
This paper tries to shade some light on this issue of learning from imbalanced dataset. We propose to use data complexity models to profile datasets in order to make connections with imbalanced datasets; this can potentially lead to better learning approaches. We have extended from our previous work with an improved implementation of the CODE framework in order to tackle a more difficult learning challenge. Despite the increased difficulty, CODE still enables a reasonable performance on profiling the data complexity of imbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Asuncion, A., Newman, D.: UCI machine learning repository. University of California, Irvine, School of Information (2007)
Batista, G.E., Monard, M.C., Bazzan, A.L.C.: Improving rule induction precision for automated annotation by balancing skewed data sets. LNCS, pp. 20–32. Springer, Heidelberg (2004)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: Balancing Strategies and Class Overlapping. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 24–35. Springer, Heidelberg (2005)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. Machine Learning 31 (2004)
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 289–300 (2002)
Japkowicz, N.: Concept-Learning in the Presence of Between-Class and Within-Class Imbalances. In: Stroulia, E., Matwin, S. (eds.) Canadian AI 2001. LNCS (LNAI), vol. 2056, pp. 67–77. Springer, Heidelberg (2001)
Japkowicz, N.: Class imbalances: are we focusing on the right issue. In: Workshop on Learning from Imbalanced Data Sets II (2003)
Jo, T., Japkowicz, N.: Class Imbalances versus Small Disjuncts. SIGKDD Explor. Newsl. 6, 40–49 (2004)
Prati, R.C., Batista, G., Monard, M.C.: Learning with class skews and small disjuncts. LNCS, pp. 296–306. Springer, Heidelberg (2004)
Provost, F.: Machine Learning from Imbalanced Data Sets 101. In: AAAI Workshop on Learning from Imbalanced Data Sets. AAAI Press, Menlo Park (2000)
Vilalta, R., Giraud-Carrier, C., Brazdil, P., Soares, C.: Using Meta-Learning to Support Data Mining. International Journal of Computer Science& Applications 1, 31–45 (2004)
Weiss, G.M.: Mining with Rarity: A Unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)
Weng, C., Poon, J.: A Data Complexity analysis on imbalanced Datasets and an alternative imbalance Recovering Strategy. In: IEEE/WIC/ACM International Conference on Web Intelligence (2006)
Weng, C.G., Poon, J.: A New Evaluation Measure for Imbalanced Datasets. In: Seventh Australasian Data Mining Conference, vol. 87, pp. 27–32 (2008)
Weng, C.G., Poon, J.: Data Complexity Analysis for Imbalanced Datasets. In: PAKDD Workshop Data Mining When Classes are imbalanced and Errors have Costs, ICEC 2009 (2009)
Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research 6, 1–34 (1997)
Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on knowledge and data engineering 17, 786–795 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weng, C.G., Poon, J. (2010). CODE: A Data Complexity Framework for Imbalanced Datasets. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-14640-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14639-8
Online ISBN: 978-3-642-14640-4
eBook Packages: Computer ScienceComputer Science (R0)