An Out of Sample Version of the EM Algorithm for Imputing Missing Values in Classification

  • Sergio CamposEmail author
  • Alejandro Veloz
  • Hector Allende
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11401)


Finding real-world applications whose records contain missing values is not uncommon. As many data analysis algorithms are not designed to work with missing data, a frequent approach is to remove all variables associated with such records from the analysis. A much better alternative is to employ data imputation techniques to estimate the missing values using statistical relationships among the variables.

The Expectation Maximization (EM) algorithm is a classic method to deal with missing data, but is not designed to work in typical Machine Learning settings that have training set and testing set.

In this work we present an extension of the EM algorithm that can deal with this problem. We test the algorithm with ADNI (Alzheimer’s Disease Neuroimaging Initiative) data set, where about 80% of the sample has missing values.

Our extension of EM achieved higher accuracy and robustness in the classification performance. It was evaluated using three different classifiers and showed a significant improvement with regard to similar approaches proposed in the literature.


Missing data Imputation Classification ADNI EM Out of Sample 


  1. 1.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Gisbrecht, A., Lueks, W., Mokbel, B., Hammer, B.: Out-of-sample kernel extensions for nonparametric dimensionality reduction. In: ESANN 2012, pp. 531–536 (2012)Google Scholar
  4. 4.
    Gray, K., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)CrossRefGoogle Scholar
  5. 5.
    Jie, B., Zhang, D., Cheng, B., Shen, D.: Manifold regularized multitask feature learning for multimodality disease classification. Hum. Brain Mapp. 36, 489–507 (2015)CrossRefGoogle Scholar
  6. 6.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley-Interscience, New York (2002)CrossRefGoogle Scholar
  7. 7.
    Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46(2), 389–422 (2016)CrossRefGoogle Scholar
  8. 8.
    Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)CrossRefGoogle Scholar
  9. 9.
    Thung, K.H., Wee, C.Y., Yap, P.T., Shen, D.: Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion. NeuroImage 91, 386–400 (2014)CrossRefGoogle Scholar
  10. 10.
    Van Der Maaten, L., Postma, E., Van den Herik, J.: Dimensionality reduction: a comparative review. J. Mach. Learn. Res. 10, 66–71 (2009)Google Scholar
  11. 11.
    Yuan, L., Wang, Y., Thompson, P.M., Narayan, V.A., Ye, J.: Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. NeuroImage 61(3), 622–632 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sergio Campos
    • 1
    Email author
  • Alejandro Veloz
    • 2
  • Hector Allende
    • 1
  1. 1.Departamento de InformáticaUniversidad Técnica Federico Santa MaríaValparaísoChile
  2. 2.Departamento de Ingeniería BiomédicaUniversidad de ValparaísoValparaísoChile

Personalised recommendations