Abstract
Multistage classification is a logical approach, based on a divide-and-conquer solution, for dealing with problems with a high number of classes. The classification problem is divided into several sequential steps, each one associated to a single classifier that works with subgroups of the original classes. In each level, the current set of classes is split into smaller subgroups of classes until they (the subgroups) are composed of only one class. The resulting chain of classifiers can be represented as a tree, which (1) simplifies the classification process by using fewer categories in each classifier and (2) makes it possible to combine several algorithms or use different attributes in each stage. Most of the classification algorithms can be biased in the sense of selecting the most populated class in overlapping areas of the input space. This can degrade a multistage classifier performance if the training set sample frequencies do not reflect the real prevalence in the population. Several techniques such as applying prior probabilities, assigning weights to the classes, or replicating instances have been developed to overcome this handicap. Most of them are designed for two-class (accept-reject) problems. In this article, we evaluate several of these techniques as applied to multistage classification and analyze how they can be useful for astronomy. We compare the results obtained by classifying a data set based on Hipparcos with and without these methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blomme J et al (2011) Improved methodology for the automated classification of periodic variable star, MNRAS, in press, ArXiv:11015038
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV et al (2003) SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases, 107–119
Dubath P et al (2011) Random forest automated supervised classification of Hipparcos periodic variable stars, ArXiv:1101.2406
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Thirteenth international conference on machine learning, San Francisco, pp 148–156
Hall MA (1998). Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand
Han H, Wang W, Mao B (2005) Boderline-SMOTE: A new over-sampling method in imbalanced data sets learning. ICIC, LNCS 3644, pp 878–887
Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA
Sarro LM, Debosscher J, López M, Aerts C (2009) Automated supervised classification of variable stars. A&A, 506–535
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this chapter
Cite this chapter
López, M. (2012). Handling Imbalanced Data Sets in Multistage Classification. In: Sarro, L., Eyer, L., O'Mullane, W., De Ridder, J. (eds) Astrostatistics and Data Mining. Springer Series in Astrostatistics, vol 2. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3323-1_17
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3323-1_17
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3322-4
Online ISBN: 978-1-4614-3323-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)