Skip to main content

A Robust Classifier for Imbalanced Datasets

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8443))

Included in the following conference series:

Abstract

Imbalanced dataset classification is a challenging problem, since many classifiers are sensitive to class distribution so that the classifiers’ prediction has bias towards majority class. Hellinger Distance has been proven that it is skew-insensitive and the decision trees that employ Hellinger Distance as a splitting criterion have shown better performance than other decision trees based on Information Gain. We propose a new decision tree induction classifier (HeDEx) based on Hellinger Distance that is randomized ensemble trees selecting both attribute and split-point at random. We also propose hyperplane as a decision surface for HeDEx to improve the performance. A new pattern-based oversampling method is also proposed in this paper to reduce the bias towards majority class. The patterns are detected from HeDEx and the new instances generated are applied after verification process using Hellinger Distance Decision Trees. Our experiments show that the proposed methods show performance improvements on imbalanced datasets over the state-of-the-art Hellinger Distance Decision Trees.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. 2(5-6), 412–426 (2009)

    Article  MathSciNet  Google Scholar 

  2. Provost, F., Domingos, P.: Tree induction for probability-based ranking. Mach. Learn. 52(3), 199–215 (2003)

    Article  MATH  Google Scholar 

  3. Cieslak, D.A., Hoens, T.R., Chawla, N.V., Kegelmeyer, W.P.: Hellinger distance decision trees are robust and skew-insensitive. Data Min. Knowl. Discov. 24(1), 136–158 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  4. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006)

    Article  MATH  Google Scholar 

  5. Drummond, C., Holte, R.C.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239–246. Morgan Kaufmann (2000)

    Google Scholar 

  6. Flach, P.A.: The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 194–201. AAAI Press (2003)

    Google Scholar 

  7. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SDM, pp. 766–777. SIAM (2010)

    Google Scholar 

  8. Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 935–942. ACM, New York (2007)

    Google Scholar 

  10. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)

    MATH  Google Scholar 

  11. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005, Part I. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  12. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: Safe-level-synthetic minority over-sampling tEchnique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111 (2011)

    Google Scholar 

  14. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40–49 (2004)

    Article  MathSciNet  Google Scholar 

  15. Alhammady, H., Ramamohanarao, K.: Using emerging patterns and decision trees in rare-class classification. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 315–318 (2004)

    Google Scholar 

  16. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

  17. Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990)

    Google Scholar 

  18. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kang, S., Ramamohanarao, K. (2014). A Robust Classifier for Imbalanced Datasets. In: Tseng, V.S., Ho, T.B., Zhou, ZH., Chen, A.L.P., Kao, HY. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8443. Springer, Cham. https://doi.org/10.1007/978-3-319-06608-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06608-0_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06607-3

  • Online ISBN: 978-3-319-06608-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics