Dealing with Missing Values

García, Salvador; Luengo, Julián; Herrera, Francisco

doi:10.1007/978-3-319-10247-4_4

Salvador García⁶,
Julián Luengo⁷ &
Francisco Herrera⁸

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 72))

10k Accesses
22 Citations

Abstract

In this chapter the reader is introduced to the approaches used in the literature to tackle the presence of Missing Values (MVs). In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formally known as imputation. After the introduction in Sect. 4.1, the chapter begins with the theoretical background which analyzes the underlying distribution of the missingness in Sect. 4.2. From this point on, the successive sections go from the simplest approaches in Sect. 4.3, to the most advanced proposals, focusing in the imputation of the MVs. The scope of such advanced methods includes the classic maximum likelihood procedures, like Expectation-Maximization or Multiple-Imputation (Sect. 4.4) and the latest Machine Learning based approaches which use algorithms for classification or regression in order to accomplish the imputation (Sect. 4.5). Finally a comparative experimental study will be carried out in Sect. 4.6.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Acuna, E., Rodriguez, C.: Classification, Clustering and Data Mining Applications. Springer, Berlin (2004)
Google Scholar
Atkeson, C.G., Moore, A.W., Schaal, S.: Locally weighted learning. Artif. Intell. Rev. 11, 11–73 (1997)
Article Google Scholar
Aydilek, I.B., Arslan, A.: A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf. Sci. 233, 25–35 (2013)
Article Google Scholar
Azim, S., Aggarwal, S.: Hybrid model for data imputation: using fuzzy c-means and multi layer perceptron. In: Advance Computing Conference (IACC), 2014 IEEE International, pp. 1281–1285 (2014)
Google Scholar
Barnard, J., Meng, X.: Applications of multiple imputation in medical studies: from aids to nhanes. Stat. Methods Med. Res. 8(1), 17–36 (1999)
Article Google Scholar
Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17(5), 519–533 (2003)
Article Google Scholar
Bezdek, J., Kuncheva, L.: Nearest prototype classifier designs: an experimental study. Int. J. Intell. Syst. 16(12), 1445–1473 (2001)
Article MATH Google Scholar
Broomhead, D., Lowe, D.: Multivariable functional interpolation and adaptive networks. Complex Systems 11, 321–355 (1988)
MathSciNet Google Scholar
van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3), 1–67 (2011)
Google Scholar
le Cessie, S., van Houwelingen, J.: Ridge estimators in logistic regression. Appl. Stat. 41(1), 191–201 (1992)
Article MATH Google Scholar
Chai, L., Mohamad, M., Deris, S., Chong, C., Choon, Y., Ibrahim, Z., Omatu, S.: Inferring gene regulatory networks from gene expression data by a dynamic bayesian network-based model. In: Omatu, S., De Paz Santana, J.F., González, S.R., Molina, J.M., Bernardos, A.M., Rodríguez, J.M.C. (eds.) Distributed Computing and Artificial Intelligence, Advances in Intelligent and Soft Computing, pp. 379–386. Springer, Berlin (2012)
Chapter Google Scholar
Ching, W.K., Li, L., Tsing, N.K., Tai, C.W., Ng, T.W., Wong, A.S., Cheng, K.W.: A weighted local least squares imputation method for missing value estimation in microarray gene expression data. Int. J. Data Min. Bioinform. 4(3), 331–347 (2010)
Article Google Scholar
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theor. 14(3), 462–467 (1968)
Article MATH MathSciNet Google Scholar
Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3(4), 261–283 (1989)
Google Scholar
Cohen, W., Singer, Y.: A simple and fast and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335–342 (1999)
Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML), pp. 115–123 (1995).
Google Scholar
Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2 edn. Wiley, New York (1991)
Google Scholar
Daniel, R.M., Kenward, M.G.: A method for increasing the robustness of multiple imputation. Comput. Stat. Data Anal. 56(6), 1624–1643 (2012)
Article MATH MathSciNet Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29, 103–137 (1997)
Article MATH Google Scholar
Dorri, F., Azmi, P., Dorri, F.: Missing value imputation in dna microarrays based on conjugate gradient method. Comp. Bio. Med. 42(2), 222–227 (2012)
Article Google Scholar
Dunning, T., Freedman, D.: Modeling section effects, Sage, pp. 225–231 (2008)
Google Scholar
Ennett, C.M., Frize, M., Walker, C.R.: Influence of missing values on artificial neural network performance. Stud. Health Technol. Inform. 84, 449–453 (2001)
Google Scholar
Fan, R.E., Chen, P.H., Lin, C.J.: Working set selection using second order information for training support vector machines. J. Machine Learning Res. 6, 1889–1918 (2005)
MATH MathSciNet Google Scholar
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recognit. 41(12), 3692–3705 (2008). http://dx.doi.org/10.1016/j.patcog.2008.05.019
Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A 37(5), 692–709 (2007)
Article Google Scholar
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: 13th International Joint Conference on Uncertainly in Artificial Intelligence(IJCAI93), pp. 1022–1029 (1993)
Google Scholar
Feng, H., Guoshun, C., Cheng, Y., Yang, B., Chen, Y.: A SVM regression based approach to filling in missing values. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES (3), Lecture Notes in Computer Science, vol. 3683, pp. 581–587. Springer, Berlin (2005)
Google Scholar
Feng, X., Wu, S., Liu, Y.: Imputing missing values for mixed numeric and categorical attributes based on incomplete data hierarchical clustering. In: Proceedings of the 5th International Conference on Knowledge Science, Engineering and Management, KSEM’11, pp. 414–424 (2011)
Google Scholar
Figueroa García, J.C., Kalenatic, D., Lopez Bello, C.A.: Missing data imputation in multivariate data by evolutionary algorithms. Comput. Hum. Behav. 27(5), 1468–1474 (2011)
Article Google Scholar
de França, F.O., Coelho, G.P., Zuben, F.J.V.: Predicting missing values with biclustering: a coherence-based approach. Pattern Recognit. 46(5), 1255–1266 (2013)
Article MATH Google Scholar
Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning, pp. 144–151 (1998)
Google Scholar
Gheyas, I.A., Smith, L.S.: A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16–18), 3039–3065 (2010)
Article Google Scholar
Gibert, K.: Mixed intelligent-multivariate missing imputation. Int. J. Comput. Math. 91(1), 85–96 (2014)
Article MATH Google Scholar
Grzymala-Busse, J., Goodwin, L., Grzymala-Busse, W., Zheng, X.: Handling missing attribute values in preterm birth data sets. In: 10th International Conference of Rough Sets and Fuzzy Sets and Data Mining and Granular Computing(RSFDGrC05), pp. 342–351 (2005)
Google Scholar
Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y.Y. (eds.) Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science, vol. 2005, pp. 378–385. Springer, Berlin (2000)
Google Scholar
Howell, D.: The analysis of missing data. SAGE Publications Ltd, London (2007)
Google Scholar
Hruschka Jr, E.R., Ebecken, N.F.F.: Missing values prediction with k2. Intell. Data Anal. 6(6), 557–566 (2002)
MATH Google Scholar
Hulse, J.V., Khoshgoftaar, T.M.: Incomplete-case nearest neighbor imputation in software measurement data. Inf. Sci. 259, 596–610 (2014)
Article Google Scholar
Ingsrisawang, L., Potawee, D.: Multiple imputation for missing data in repeated measurements using MCMC and copulas, pp. 1606–1610 (2012)
Google Scholar
Ishioka, T.: Imputation of missing values for unsupervised data using the proximity in random forests. In: eLmL 2013, The 5th International Conference on Mobile, Hybrid, and On-line Learning, pp. 30–36 (2013)
Google Scholar
Jamshidian, M., Jalal, S., Jansen, C.: Missmech: an R package for testing homoscedasticity, multivariate normality, and missing completely at random (mcar). J. Stat. Softw. 56(6), 1–31 (2014)
Google Scholar
Joenssen, D.W., Bankhofer, U.: Hot deck methods for imputing missing data: the effects of limiting donor usage. In: Proceedings of the 8th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’12, pp. 63–75 (2012)
Google Scholar
Juhola, M., Laurikkala, J.: Missing values: how many can they be to preserve classification reliability? Artif. Intell. Rev. 40(3), 231–245 (2013)
Article Google Scholar
Keerin, P., Kurutach, W., Boongoen, T.: Cluster-based knn missing value imputation for dna microarray data. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pp. 445–450. IEEE (2012)
Google Scholar
Keerin, P., Kurutach, W., Boongoen, T.: An improvement of missing value imputation in dna microarray data using cluster-based lls method. In: Communications and Information Technologies (ISCIT), 2013 13th International Symposium on, pp. 559–564 (2013)
Google Scholar
Khan, S.S., Hoey, J., Lizotte, D.J.: Bayesian multiple imputation approaches for one-class classification. In: Kosseim, L., Inkpen, D. (eds.) Advances in Artificial Intelligence - 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, Toronto, ON, Canada, Proceedings, pp. 331–336. 28–30 May 2012
Google Scholar
Kim, H., Golub, G.H., Park, H.: Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinform. 21(2), 187–198 (2005)
Article Google Scholar
Krzanowski, W.: Multiple discriminant analysis in the presence of mixed continuous and categorical data. Comput. Math. Appl. 12(2, Part A), 179–185 (1986)
Article MATH Google Scholar
Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)
Article Google Scholar
Kwak, N., Choi, C.H.: Input feature selection for classification problems. IEEE Trans. Neural Networks 13(1), 143–159 (2002)
Article Google Scholar
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: 4th International Conference of Rough Sets and Current Trends in Computing (RSCTC04), pp. 573–579 (2004)
Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 1st edn. Wiley Series in Probability and Statistics, New York (1987)
MATH Google Scholar
Little, R.J.A., Schluchter, M.D.: Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72, 497–512 (1985)
Article MATH MathSciNet Google Scholar
Lu, X., Si, J., Pan, L., Zhao, Y.: Imputation of missing data using ensemble algorithms. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2011 8th International Conference on, vol. 2, pp. 1312–1315 (2011)
Google Scholar
McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York(2004)
Google Scholar
Merlin, P., Sorjamaa, A., Maillet, B., Lendasse, A.: X-SOM and L-SOM: a double classification approach for missing value imputation. Neurocomputing 73(7–9), 1103–1108 (2010)
Article Google Scholar
Michalksi, R., Mozetic, I., Lavrac, N.: The multipurpose incremental learning system AQ15 and its testing application to three medical domains. In: 5th INational Conference on Artificial Intelligence (AAAI86), pp. 1041–1045 (1986)
Google Scholar
Miyakoshi, Y., Kato, S.: Missing value imputation method by using Bayesian network with weighted learning. IEEJ Trans. Electron. Inf. Syst. 132, 299–305 (2012)
Google Scholar
Moller, F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1990)
Article Google Scholar
Oba, S., aki Sato, M., Takemasa, I., Monden, M., ichi Matsubara, K., Ishii, S.: A bayesian missing value estimation method for gene expression profile data. Bioinform. 19(16), 2088–2096 (2003)
Article Google Scholar
Ouyang, M., Welsh, W.J., Georgopoulos, P.: Gaussian mixture clustering and imputation of microarray data. Bioinform. 20(6), 917–923 (2004)
Article Google Scholar
Panigrahi, L., Ranjan, R., Das, K., Mishra, D.: Removal and interpolation of missing values using wavelet neural network for heterogeneous data sets. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp. 1004–1009 (2012)
Google Scholar
Patil, B., Joshi, R., Toshniwal, D.: Missing value imputation based on k-mean clustering with weighted distance. In: Ranka, S., Banerjee, A., Biswas, K., Dua, S., Mishra, P., Moona, R., Poon, S.H., Wang, C.L. (eds.) Contemporary Computing, Communications in Computer and Information Science, vol. 94, pp. 600–609. Springer, Berlin (2010)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), pp. 1226–1238 (2005)
Google Scholar
Pham, D.T., Afify, A.A.: Rules-6: a simple rule induction algorithm for supporting decision making. In: Industrial Electronics Society, 2005. IECON 2005. 31st Annual Conference of IEEE, pp. 2184–2189 (2005)
Google Scholar
Pham, D.T., Afify, A.A.: SRI: a scalable rule induction algorithm. Proc. Inst. Mech. Eng. [C]: J. Mech. Eng. Sci. 220, 537–552 (2006)
Article Google Scholar
Plat, J.: A resource allocating network for function interpolation. Neural Comput. 3(2), 213–225 (1991)
Article Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Qin, Y., Zhang, S., Zhang, C.: Combining knn imputation and bootstrap calibrated empirical likelihood for incomplete data analysis. Int. J. Data Warehouse. Min. 6(4), 61–73 (2010)
Article MathSciNet Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Google Scholar
Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference - Volume 121, AusDM ’11, pp. 41–50 (2011)
Google Scholar
Rahman, M., Islam, M.: KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) Advanced Data Mining and Applications. Lecture Notes in Computer Science, vol. 8347, pp. 250–263. Springer, Berlin (2013)
Chapter Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Know.-Based Syst. 53, 51–65 (2013)
Article Google Scholar
Rahman, M.G., Islam, M.Z.: Fimus: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Know.-Based Syst. 56, 311–327 (2014)
Article Google Scholar
Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in STATA. J. Stat. Softw. 45(4), 1–20 (2011)
MathSciNet Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MATH MathSciNet Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
Google Scholar
Safarinejadian, B., Menhaj, M., Karrari, M.: A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl. Inf. Syst. 23(3), 267–292 (2010)
Article Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)
Book MATH Google Scholar
Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)
Article Google Scholar
Scheuren, F.: Multiple imputation: how it began and continues. Am. Stat. 59, 315–319 (2005)
Article MathSciNet Google Scholar
Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)
Article Google Scholar
Schomaker, M., Heumann, C.: Model selection and model averaging after multiple imputation. Comput. Stat. Data Anal. 71, 758–770 (2014)
Article MathSciNet Google Scholar
Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinform. 21(10), 2417–2423 (2005)
Article Google Scholar
Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la Vega, M.D.: Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1), 121–129 (2011)
Article Google Scholar
Simński, K.: Rough fuzzy subspace clustering for data with missing values. Comput. Inform. 33(1), 131–153 (2014)
Google Scholar
Somasundaram, R., Nedunchezhian, R.: Radial basis function network dependent exclusive mutual interpolation for missing value imputation. J. Comput. Sci. 9(3), 327–334 (2013)
Article Google Scholar
Tanner, M.A., Wong, W.: The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82, 528–540 (1987)
Article MATH MathSciNet Google Scholar
Ting, J., Yu, B., Yu, D., Ma, S.: Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl. Intell. 40(2), 376–388 (2014)
Article Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinform. 17(6), 520–525 (2001)
Article Google Scholar
Unnebrink, K., Windeler, J.: Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. Stat. Med. 20(24), 3931–3946 (2001)
Article Google Scholar
Vellido, A.: Missing data imputation through GTM as a mixture of t-distributions. Neural Networks 19(10), 1624–1635 (2006)
Article MATH Google Scholar
Wang, H., Wang, S.: Mining incomplete survey data through classification. Knowl. Inf. Syst. 24(2), 221–233 (2010)
Google Scholar
Williams, D., Liao, X., Xue, Y., Carin, L., Krishnapuram, B.: On classification with incomplete data. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 427–436 (2007)
Google Scholar
Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MATH Google Scholar
Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 9(6), 796–805 (1987)
Article Google Scholar
Wu, X., Urpani, D.: Induction by attribute elimination. IEEE Trans. Knowl. Data Eng. 11(5), 805–812 (1999)
Article Google Scholar
Zhang, S.: Nearest neighbor selection for iteratively knn imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)
Article Google Scholar
Zhang, S., Wu, X., Zhu, M.: Efficient missing data imputation for supervised learning. In: Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on, pp. 672–679 (2010)
Google Scholar
Zheng, Z., Webb, G.I.: Lazy learning of bayesian rules. Machine Learning 41(1), 53–84 (2000)
Article MathSciNet Google Scholar
Zhu, B., He, C., Liatsis, P.: A robust missing value imputation method for noisy data. Appl. Intell. 36(1), 61–74 (2012)
Article Google Scholar
Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowl. Data Eng. 23(1), 110–121 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Jaén, Jaén, Spain
Salvador García
Department of Civil Engineering, University of Burgos, Burgos, Spain
Julián Luengo
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Francisco Herrera

Authors

Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salvador García .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

García, S., Luengo, J., Herrera, F. (2015). Dealing with Missing Values. In: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-319-10247-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-10247-4_4
Published: 31 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10246-7
Online ISBN: 978-3-319-10247-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics