Handling Imbalanced Data Sets in Multistage Classification

Part of the Springer Series in Astrostatistics book series (SSIA, volume 2)


Multistage classification is a logical approach, based on a divide-and-conquer solution, for dealing with problems with a high number of classes. The classification problem is divided into several sequential steps, each one associated to a single classifier that works with subgroups of the original classes. In each level, the current set of classes is split into smaller subgroups of classes until they (the subgroups) are composed of only one class. The resulting chain of classifiers can be represented as a tree, which (1) simplifies the classification process by using fewer categories in each classifier and (2) makes it possible to combine several algorithms or use different attributes in each stage. Most of the classification algorithms can be biased in the sense of selecting the most populated class in overlapping areas of the input space. This can degrade a multistage classifier performance if the training set sample frequencies do not reflect the real prevalence in the population. Several techniques such as applying prior probabilities, assigning weights to the classes, or replicating instances have been developed to overcome this handicap. Most of them are designed for two-class (accept-reject) problems. In this article, we evaluate several of these techniques as applied to multistage classification and analyze how they can be useful for astronomy. We compare the results obtained by classifying a data set based on Hipparcos with and without these methods.


Minority Class Variable Star Imbalanced Problem Misclassification Cost Populated Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Blomme J et al (2011) Improved methodology for the automated classification of periodic variable star, MNRAS, in press, ArXiv:11015038Google Scholar
  2. 2.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHGoogle Scholar
  3. 3.
    Chawla NV et al (2003) SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases, 107–119Google Scholar
  4. 4.
    Dubath P et al (2011) Random forest automated supervised classification of Hipparcos periodic variable stars, ArXiv:1101.2406Google Scholar
  5. 5.
    Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Thirteenth international conference on machine learning, San Francisco, pp 148–156Google Scholar
  6. 6.
    Hall MA (1998). Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New ZealandGoogle Scholar
  7. 7.
    Han H, Wang W, Mao B (2005) Boderline-SMOTE: A new over-sampling method in imbalanced data sets learning. ICIC, LNCS 3644, pp 878–887Google Scholar
  8. 8.
    Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CAGoogle Scholar
  9. 9.
    Sarro LM, Debosscher J, López M, Aerts C (2009) Automated supervised classification of variable stars. A&A, 506–535Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Centro de Astrobiologa (CSIC-INTA) Unidad de Archivo de DatosMadridSpain

Personalised recommendations