A Genetic Programming-Based Imputation Method for Classification with Missing Data

  • Cao Truong TranEmail author
  • Mengjie Zhang
  • Peter Andreae
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9594)


Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.


Missing data Imputation methods Genentic programming Symbolic regression Classification 


  1. 1.
    Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part I. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Andridge, R.R., Little, R.J.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 40–64 (2010)CrossRefGoogle Scholar
  3. 3.
    Asuncion, A., Newman, D.: UCI machine learning repository (2007).
  4. 4.
    Augusto, D.A., Barbosa, H.J.: Symbolic regression via genetic programming. In: Sixth Brazilian Symposium on Neural Networks, 2000, Proceedings, pp. 173–178 (2000)Google Scholar
  5. 5.
    Barmpalexis, P., Kachrimanis, K., Tsakonas, A., Georgarakis, E.: Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemometr. Intell. Lab. Syst. 107, 75–82 (2011)CrossRefGoogle Scholar
  6. 6.
    Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8, 7–36 (1999)CrossRefGoogle Scholar
  7. 7.
    Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, New York (2013)Google Scholar
  8. 8.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)zbMATHGoogle Scholar
  9. 9.
    Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Soft. 45, 1–67 (2011)CrossRefGoogle Scholar
  10. 10.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)zbMATHGoogle Scholar
  11. 11.
    Cunningham, P., Delany, S.J.: k-Nearest Neighbour classifiers. In: Multiple Classifier Systems, pp. 1–17 (2007)Google Scholar
  12. 12.
    Draper, N.R., Smith, H., Pownell, E.: Applied Regression Analysis, vol. 3. Wiley, New York (1966)Google Scholar
  13. 13.
    Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)CrossRefzbMATHGoogle Scholar
  14. 14.
    Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)CrossRefGoogle Scholar
  15. 15.
    García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010)CrossRefGoogle Scholar
  16. 16.
    Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)CrossRefGoogle Scholar
  17. 17.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009)CrossRefGoogle Scholar
  18. 18.
    Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)zbMATHGoogle Scholar
  19. 19.
    Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)CrossRefGoogle Scholar
  20. 20.
    Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic programming, pp. 70–82 (2003)Google Scholar
  21. 21.
    Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.: Applied regression analysis and other multivariable methods. Cengage Learning (2013)Google Scholar
  22. 22.
    Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)zbMATHGoogle Scholar
  23. 23.
    Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)Google Scholar
  24. 24.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley-Interscience, New York (2002)CrossRefzbMATHGoogle Scholar
  25. 25.
    Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A java-based evolutionary computation research system (2006) Downloadable versions and documentation can be found at the following
  26. 26.
    Minka, T.: Bayesian linear regression. Technical report, 3594 Security Ticket Control (1999)Google Scholar
  27. 27.
    Murphy, K.P.: Naive Bayes classifiers. University of British Columbia (2006)Google Scholar
  28. 28.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  29. 29.
    Schafer, J.L.: Analysis of Incomplete Multivariate Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, New York (1997)CrossRefzbMATHGoogle Scholar
  30. 30.
    Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, New York (1997)CrossRefzbMATHGoogle Scholar
  31. 31.
    Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program. Evolvable Mach. 13, 197–238 (2012)CrossRefGoogle Scholar
  32. 32.
    Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), vol. 155162 (2001)Google Scholar
  33. 33.
    Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pp. 583–590 (2015)Google Scholar
  34. 34.
    Uy, N.Q., Hoai, N.X., O’Neill, M., Mckay, R.I., Galván-López, E.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach. 12, 91–119 (2011)CrossRefGoogle Scholar
  35. 35.
    Van Buuren, S., Oudshoorn, C.: Multivariate imputation by chained equations. MICE V1. 0 user’s manual. Leiden: TNO Preventie en Gezondheid (2000)Google Scholar
  36. 36.
    Van Buuren, S., Oudshoorn, K.: Flexible multivariate imputation by MICE. Technical report, PG/VGZ/99.054: TNO Prevention and Health, Leiden (1999)Google Scholar
  37. 37.
    Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13, 333–349 (2009)CrossRefGoogle Scholar
  38. 38.
    White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Cao Truong Tran
    • 1
    Email author
  • Mengjie Zhang
    • 1
  • Peter Andreae
    • 1
  1. 1.School of Engineering and Computer ScienceVictoria University of WellingtonWellingtonNew Zealand

Personalised recommendations