Parameter-Free Imputation for Imbalance Datasets

Takum, Jintana; Bunkhumpornpat, Chumphol

doi:10.1007/978-3-319-12823-8_27

Jintana Takum¹⁸ &
Chumphol Bunkhumpornpat¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8839))

Included in the following conference series:

International Conference on Asian Digital Libraries

1994 Accesses
1 Citations

Abstract

Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gelman, A., Hill, J.: Data Analysis Using Regression and Multi-level/Hierarchical Models. In: Missing-data Imputation, pp. 529–544. Cambridge University Press (2006)
Google Scholar
Batista, G., Monard, M.C.: A study of K-nearest neighbour as an imputation method. In: Abraham, A., et al. (eds.) Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl., vol. 87, pp. 251–260. IOS Press (2002)
Google Scholar
Batista, G., Monard, M.C.: Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep., University of Sao Paulo (2003)
Google Scholar
Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, University of California, Irvine, California, USA (2009), http://archive.ics.uci.edu/ml/
Bradley, A.P.: The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30(6), 1145–1159 (1997)
Article Google Scholar
Buckland, M., Gey, F.: The Relationship between Recall and Precision. Journal of the American Society for Information Science 45(1), 12–19 (1994)
Article Google Scholar
Bunkhumpornpat, C., Subpaiboonkit, S.: Safe Level Graph for Synthetic Minority Over-sampling Techniques. In: The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, Samui Island, Thailand, pp. 570–575 (2013)
Google Scholar
Zhu, H., Lee, S.-Y., Wei, B.-C., Zhou, J.: Case-deletion meas-ures for models with incomplete data. Biometrika, 727–737 (2001)
Google Scholar
Japkowicz, N.: Class imbalance Problem: Significance and Strategies. In: The 2000 International Conference on Artificial Intelligence (IC-AI 2000), Las Vegas, NV, USA, pp. 111–117 (2000)
Google Scholar
Hall, M.A., Frank, E., Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. The Kaufmann Series in Data Management Systems (2011)
Google Scholar
Solomon, N., Oatley, G., McGarry, K.: A Fast Multivariate Nearest Neighbour Imputation Algorithm (2007) (manuscript received March 9)
Google Scholar
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing Mis-classifica-tion Costs. In: The 11th International Conference on Machine Learning, ICML 1994, pp. 217–225. Morgan Kaufmann, San Francisco (1994)
Chapter Google Scholar
Garcıa-Laencina, P.J., Sancho-Gomez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Computing and Applications (2009)
Google Scholar
Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions. AI Access Foundation and Morgan Kaufmann Publishers. Journal of Artificial Intelligence Research 6, 1–34 (1997)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Theoretical and Empirical Research Group, Department of Computer Science, Faculty of Science, Chiang Mai University, Chiang Mai, 50200, Thailand
Jintana Takum & Chumphol Bunkhumpornpat

Authors

Jintana Takum
View author publications
You can also search for this author in PubMed Google Scholar
Chumphol Bunkhumpornpat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Khon Kaen University, 40002, Khon Kaen, Thailand
Kulthida Tuamsuk
Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, 606-8501, Sakyo-ku, Kyoto, Japan
Adam Jatowt
University of British Columbia, Vancouver, B.C., Canada
Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takum, J., Bunkhumpornpat, C. (2014). Parameter-Free Imputation for Imbalance Datasets. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-12823-8_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12822-1
Online ISBN: 978-3-319-12823-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics