Abstract
Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gelman, A., Hill, J.: Data Analysis Using Regression and Multi-level/Hierarchical Models. In: Missing-data Imputation, pp. 529–544. Cambridge University Press (2006)
Batista, G., Monard, M.C.: A study of K-nearest neighbour as an imputation method. In: Abraham, A., et al. (eds.) Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl., vol. 87, pp. 251–260. IOS Press (2002)
Batista, G., Monard, M.C.: Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep., University of Sao Paulo (2003)
Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, University of California, Irvine, California, USA (2009), http://archive.ics.uci.edu/ml/
Bradley, A.P.: The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30(6), 1145–1159 (1997)
Buckland, M., Gey, F.: The Relationship between Recall and Precision. Journal of the American Society for Information Science 45(1), 12–19 (1994)
Bunkhumpornpat, C., Subpaiboonkit, S.: Safe Level Graph for Synthetic Minority Over-sampling Techniques. In: The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, Samui Island, Thailand, pp. 570–575 (2013)
Zhu, H., Lee, S.-Y., Wei, B.-C., Zhou, J.: Case-deletion meas-ures for models with incomplete data. Biometrika, 727–737 (2001)
Japkowicz, N.: Class imbalance Problem: Significance and Strategies. In: The 2000 International Conference on Artificial Intelligence (IC-AI 2000), Las Vegas, NV, USA, pp. 111–117 (2000)
Hall, M.A., Frank, E., Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. The Kaufmann Series in Data Management Systems (2011)
Solomon, N., Oatley, G., McGarry, K.: A Fast Multivariate Nearest Neighbour Imputation Algorithm (2007) (manuscript received March 9)
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing Mis-classifica-tion Costs. In: The 11th International Conference on Machine Learning, ICML 1994, pp. 217–225. Morgan Kaufmann, San Francisco (1994)
Garcıa-Laencina, P.J., Sancho-Gomez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Computing and Applications (2009)
Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions. AI Access Foundation and Morgan Kaufmann Publishers. Journal of Artificial Intelligence Research 6, 1–34 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Takum, J., Bunkhumpornpat, C. (2014). Parameter-Free Imputation for Imbalance Datasets. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-12823-8_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12822-1
Online ISBN: 978-3-319-12823-8
eBook Packages: Computer ScienceComputer Science (R0)