Parameter-Free Imputation for Imbalance Datasets
Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.
KeywordsImputation Parameter-Free Class Imbalance Classification K-Nearest Neighbours
Unable to display preview. Download preview PDF.
- 1.Gelman, A., Hill, J.: Data Analysis Using Regression and Multi-level/Hierarchical Models. In: Missing-data Imputation, pp. 529–544. Cambridge University Press (2006)Google Scholar
- 2.Batista, G., Monard, M.C.: A study of K-nearest neighbour as an imputation method. In: Abraham, A., et al. (eds.) Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl., vol. 87, pp. 251–260. IOS Press (2002)Google Scholar
- 3.Batista, G., Monard, M.C.: Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep., University of Sao Paulo (2003)Google Scholar
- 4.Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, University of California, Irvine, California, USA (2009), http://archive.ics.uci.edu/ml/
- 7.Bunkhumpornpat, C., Subpaiboonkit, S.: Safe Level Graph for Synthetic Minority Over-sampling Techniques. In: The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, Samui Island, Thailand, pp. 570–575 (2013)Google Scholar
- 8.Zhu, H., Lee, S.-Y., Wei, B.-C., Zhou, J.: Case-deletion meas-ures for models with incomplete data. Biometrika, 727–737 (2001)Google Scholar
- 9.Japkowicz, N.: Class imbalance Problem: Significance and Strategies. In: The 2000 International Conference on Artificial Intelligence (IC-AI 2000), Las Vegas, NV, USA, pp. 111–117 (2000)Google Scholar
- 10.Hall, M.A., Frank, E., Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. The Kaufmann Series in Data Management Systems (2011)Google Scholar
- 11.Solomon, N., Oatley, G., McGarry, K.: A Fast Multivariate Nearest Neighbour Imputation Algorithm (2007) (manuscript received March 9)Google Scholar
- 13.Garcıa-Laencina, P.J., Sancho-Gomez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Computing and Applications (2009)Google Scholar