Skip to main content

Machine Learning Methods for Imbalanced Data

  • Chapter
  • First Online:
  • 894 Accesses

Part of the book series: SpringerBriefs in Statistics ((JSSRES))

Abstract

We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415

    Article  MathSciNet  Google Scholar 

  2. Breiman L (2004) Population theory for boosting ensembles. Ann Stat 32:1–11

    Article  MathSciNet  Google Scholar 

  3. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer, Heidelberg, pp 107–119

    Chapter  Google Scholar 

  4. Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069

    Article  Google Scholar 

  5. Do JH, Choi D (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288

    Google Scholar 

  6. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87

    Article  MathSciNet  Google Scholar 

  7. Eguchi S, Copas J (2002) A class of logistic-type discriminant functions. Biometrika 89:1–22

    Article  MathSciNet  Google Scholar 

  8. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868

    Article  Google Scholar 

  9. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139

    Article  MathSciNet  Google Scholar 

  10. Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:771–780

    Google Scholar 

  11. Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378

    Article  MathSciNet  Google Scholar 

  12. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407

    Article  MathSciNet  Google Scholar 

  13. Fushiki T, Fujisawa H, Eguchi S (2006) Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinform 7:358

    Article  Google Scholar 

  14. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:463–484

    Article  Google Scholar 

  15. Golub TT, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537

    Article  Google Scholar 

  16. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Article  Google Scholar 

  17. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Book  Google Scholar 

  18. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  Google Scholar 

  19. Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: IBM research report, pp 1–20

    Google Scholar 

  20. Kawakita M, Minami M, Eguchi S, Lennert-Cody CE (2005) An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. Fish Res 76:328–343

    Article  Google Scholar 

  21. Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979

    Article  MathSciNet  Google Scholar 

  22. Komori O, Eguchi S (2010) A boosting method for maximizing the partial area under the ROC curve. BMC Bioinform 11:314

    Article  Google Scholar 

  23. Lugosi BG, Vayatis N (2004) On the Bayes-risk consistency of regularized boosting methods. Ann Stat 32:30–55

    MathSciNet  MATH  Google Scholar 

  24. Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21:4356–4362

    Article  Google Scholar 

  25. Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New York

    MATH  Google Scholar 

  26. Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62:221–229

    Article  MathSciNet  Google Scholar 

  27. Pepe MS, Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics 1:123–140

    Article  Google Scholar 

  28. Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227

    Google Scholar 

  29. Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26:1651–1686

    Article  MathSciNet  Google Scholar 

  30. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98:10869–10874

    Article  Google Scholar 

  31. Takenouchi T, Ushijima M, Eguchi S (2007) GroupAdaBoost: accurate prediction and selection of important genes. IPSJ Digit Cour 3:145–152

    Article  Google Scholar 

  32. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536

    Article  Google Scholar 

  33. Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23:1794–2788

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Osamu Komori .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 The Author(s), under exclusive licence to Springer Japan KK

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Komori, O., Eguchi, S. (2019). Machine Learning Methods for Imbalanced Data. In: Statistical Methods for Imbalanced Data in Ecological and Biological Studies. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55570-4_5

Download citation

Publish with us

Policies and ethics