Skip to main content

Handling Imbalanced Data Sets in Multistage Classification

  • Chapter
  • First Online:
Astrostatistics and Data Mining

Part of the book series: Springer Series in Astrostatistics ((SSIA,volume 2))

  • 1650 Accesses

Abstract

Multistage classification is a logical approach, based on a divide-and-conquer solution, for dealing with problems with a high number of classes. The classification problem is divided into several sequential steps, each one associated to a single classifier that works with subgroups of the original classes. In each level, the current set of classes is split into smaller subgroups of classes until they (the subgroups) are composed of only one class. The resulting chain of classifiers can be represented as a tree, which (1) simplifies the classification process by using fewer categories in each classifier and (2) makes it possible to combine several algorithms or use different attributes in each stage. Most of the classification algorithms can be biased in the sense of selecting the most populated class in overlapping areas of the input space. This can degrade a multistage classifier performance if the training set sample frequencies do not reflect the real prevalence in the population. Several techniques such as applying prior probabilities, assigning weights to the classes, or replicating instances have been developed to overcome this handicap. Most of them are designed for two-class (accept-reject) problems. In this article, we evaluate several of these techniques as applied to multistage classification and analyze how they can be useful for astronomy. We compare the results obtained by classifying a data set based on Hipparcos with and without these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blomme J et al (2011) Improved methodology for the automated classification of periodic variable star, MNRAS, in press, ArXiv:11015038

    Google Scholar 

  2. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  3. Chawla NV et al (2003) SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases, 107–119

    Google Scholar 

  4. Dubath P et al (2011) Random forest automated supervised classification of Hipparcos periodic variable stars, ArXiv:1101.2406

    Google Scholar 

  5. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Thirteenth international conference on machine learning, San Francisco, pp 148–156

    Google Scholar 

  6. Hall MA (1998). Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand

    Google Scholar 

  7. Han H, Wang W, Mao B (2005) Boderline-SMOTE: A new over-sampling method in imbalanced data sets learning. ICIC, LNCS 3644, pp 878–887

    Google Scholar 

  8. Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  9. Sarro LM, Debosscher J, López M, Aerts C (2009) Automated supervised classification of variable stars. A&A, 506–535

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. López .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this chapter

Cite this chapter

López, M. (2012). Handling Imbalanced Data Sets in Multistage Classification. In: Sarro, L., Eyer, L., O'Mullane, W., De Ridder, J. (eds) Astrostatistics and Data Mining. Springer Series in Astrostatistics, vol 2. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3323-1_17

Download citation

Publish with us

Policies and ethics