Handling Imbalanced Data Sets in Multistage Classification

López, M.

doi:10.1007/978-1-4614-3323-1_17

M. López⁵

Part of the book series: Springer Series in Astrostatistics ((SSIA,volume 2))

1650 Accesses

Abstract

Multistage classification is a logical approach, based on a divide-and-conquer solution, for dealing with problems with a high number of classes. The classification problem is divided into several sequential steps, each one associated to a single classifier that works with subgroups of the original classes. In each level, the current set of classes is split into smaller subgroups of classes until they (the subgroups) are composed of only one class. The resulting chain of classifiers can be represented as a tree, which (1) simplifies the classification process by using fewer categories in each classifier and (2) makes it possible to combine several algorithms or use different attributes in each stage. Most of the classification algorithms can be biased in the sense of selecting the most populated class in overlapping areas of the input space. This can degrade a multistage classifier performance if the training set sample frequencies do not reflect the real prevalence in the population. Several techniques such as applying prior probabilities, assigning weights to the classes, or replicating instances have been developed to overcome this handicap. Most of them are designed for two-class (accept-reject) problems. In this article, we evaluate several of these techniques as applied to multistage classification and analyze how they can be useful for astronomy. We compare the results obtained by classifying a data set based on Hipparcos with and without these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blomme J et al (2011) Improved methodology for the automated classification of periodic variable star, MNRAS, in press, ArXiv:11015038
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chawla NV et al (2003) SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases, 107–119
Google Scholar
Dubath P et al (2011) Random forest automated supervised classification of Hipparcos periodic variable stars, ArXiv:1101.2406
Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Thirteenth international conference on machine learning, San Francisco, pp 148–156
Google Scholar
Hall MA (1998). Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand
Google Scholar
Han H, Wang W, Mao B (2005) Boderline-SMOTE: A new over-sampling method in imbalanced data sets learning. ICIC, LNCS 3644, pp 878–887
Google Scholar
Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA
Google Scholar
Sarro LM, Debosscher J, López M, Aerts C (2009) Automated supervised classification of variable stars. A&A, 506–535
Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Astrobiologa (CSIC-INTA) Unidad de Archivo de Datos, Madrid, Spain
M. López

Authors

M. López
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. López .

Editor information

Editors and Affiliations

, Department of Statistics, Universidad Nacional Educacion, C/Juan del Rosal, 16 Despacho-U Madrid, Madrid, 28040, Spain
Luis Manuel Sarro
, Observatoire de Genève, Université de Genève, 51 Chemin des Maillettes, Sauverny, 1290, Switzerland
Laurent Eyer
European Space Astronomy Centre, P.O. Box, 78, Villanueva de la Cañada, Madrid, E-28691, Spain
William O'Mullane
, Instituut voor Sterrenkunde, Katholieke Universiteit Leuven, Celestijnenlaan 200D, Leuven, 3001, Belgium
Joris De Ridder

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

López, M. (2012). Handling Imbalanced Data Sets in Multistage Classification. In: Sarro, L., Eyer, L., O'Mullane, W., De Ridder, J. (eds) Astrostatistics and Data Mining. Springer Series in Astrostatistics, vol 2. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3323-1_17

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3323-1_17
Published: 05 June 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3322-4
Online ISBN: 978-1-4614-3323-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics