Skip to main content

Principal Component Analysis in the Presence of Missing Data

  • Chapter
  • First Online:

Abstract

The aim of this chapter is to provide an overview of recent developments in principal component analysis (PCA) methods when the data are incomplete. Missing data bring uncertainty into the analysis and their treatment requires statistical approaches that are tailored to cope with specific missing data processes (i.e., ignorable and nonignorable mechanisms). Since the publication of the classic textbook by Jolliffe, which includes a short, same-titled section on the missing data problem in PCA, there have been a few methodological contributions that hinge upon a probabilistic approach to PCA. In this chapter, we unify methods for ignorable and nonignorable missing data in a general likelihood framework. We also provide real data examples to illustrate the application of these methods using the R language and environment for statistical computing and graphics.

This is a preview of subscription content, log in via an institution.

References

  1. Bartolucci, F., Farcomeni, A.: A discrete time event-history approach to informative drop-out in mixed latent Markov models with covariates. Biometrics 71(1), 80–89 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  2. Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61(1), 265–285 (1999)

    Article  MATH  Google Scholar 

  3. Creemers, A., Hens, N., Aerts, M., Molenberghs, G., Verbeke, G., Kenward, M.G.: A sensitivity analysis for shared-parameter models for incomplete longitudinal outcomes. Biometrical Journal 52(1), 111–125 (2010)

    MathSciNet  MATH  Google Scholar 

  4. de Brevern, A., Hazout, S., Malpertuy, A.: Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics 5(1), 114 (2004)

    Article  Google Scholar 

  5. de Souto, M.C., Jaskowiak, P.A., Costa, I.G.: Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 16(1), 64 (2015)

    Article  Google Scholar 

  6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  7. Ding, C., Zhou, D., He, X., Zha, H.: \(L_{1}\)-PCA: rotational invariant \(L_{1}\)-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 281–288. ACM

    Google Scholar 

  8. Farcomeni, A., Greco, L.: Robust methods for data reduction. CRC Press, Boca Raton, FL (2015)

    Book  MATH  Google Scholar 

  9. Geraci, M.: Estimation of regression quantiles in complex surveys with data missing at random: An application to birthweight determinants. Statistical Methods in Medical Research 25(4), 1393–1421 (2016)

    Article  MathSciNet  Google Scholar 

  10. Geraci, M., Bottai, M.: Use of auxiliary data in semi-parametric spatial regression with nonignorable missing responses. Statistical Modelling 6(4), 321–336 (2006)

    Article  MathSciNet  Google Scholar 

  11. Geraci, M., Farcomeni, A.: Probabilistic principal component analysis to identify profiles of physical activity behaviours in the presence of nonignorable missing data. Journal of the Royal Statistical Society C 65(1), 51–75 (2016)

    Article  MathSciNet  Google Scholar 

  12. Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society C 41(2), 337–348 (1992)

    MATH  Google Scholar 

  13. Griffiths, L.J., Cortina-Borja, M., Sera, F., Pouliou, T., Geraci, M., Rich, C., Cole, T.J., Law, C., Joshi, H., Ness, A.R., Jebb, S.A., Dezateux, C.: How active are our children? Findings from the Millennium Cohort Study. BMJ Open 3(8), e002,893 (2013)

    Google Scholar 

  14. Heitjan, D.F., Basu, S.: Distinguishing “missing at random” and “missing completely at random”. The American Statistician 50(3), 207–213 (1996)

    Google Scholar 

  15. Houseago-Stokes, R.E., Challenor, P.G.: Using PPCA to estimate EOFs in the presence of missing values. Journal of Atmospheric and Oceanic Technology 21(9), 1471–1480 (2004)

    Article  Google Scholar 

  16. Husson, F., Josse, J.: missMDA: Handling missing values with/in multivariate data analysis (principal component methods) (2013). https://CRAN.R-project.org/package=missMDA. R package version 1.7.2

  17. Ibrahim, J.G., Chen, M.H., Lipsitz, S.R.: Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika 88(2), 551–564 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ibrahim, J.G., Molenberghs, G.: Missing data methods in longitudinal studies: A review. Test 18(1), 1–43 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  19. Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research 11(Jul), 1957–2000 (2010)

    Google Scholar 

  20. Jolliffe, I.T.: Principal component analysis, 2nd edn. Springer-Verlag, New York, NY (2002)

    MATH  Google Scholar 

  21. Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2), 79–99 (2012)

    MathSciNet  MATH  Google Scholar 

  22. Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics and Data Analysis 56(6), 1869–1879 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  23. Josse, J., Pagès, J., Husson, F.: Multiple imputation in principal component analysis. Advances in Data Analysis and Classification 5(3), 231–246 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  24. Laird, N.M.: Missing data in longitudinal studies. Statistics in Medicine 7(1–2), 305–315 (1988)

    Article  MathSciNet  Google Scholar 

  25. Lê, S., Josse, J., Husson, F.: FactoMineR: A package for multivariate analysis. Journal of Statistical Software 25(1), 1–18 (2008)

    Article  Google Scholar 

  26. Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley, New York, NY (1987)

    MATH  Google Scholar 

  27. Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, Hoboken, NJ (2002)

    Book  MATH  Google Scholar 

  28. Mehrotra, D.V.: Robust elementwise estimation of a dispersion matrix. Biometrics 51(4), 1344–51 (1995)

    Article  MATH  Google Scholar 

  29. Melgani, F., Mercier, G., Lorenzi, L., Pasolli, E.: Recent methods for reconstructing missing data in multispectral satellite imagery. In: R.S. Anderssen, P. Broadbridge, Y. Fukumoto, K. Kajiwara, T. Takagi, E. Verbitskiy, M. Wakayama (eds.) Applications + Practical Conceptualization + Mathematics = fruitful Innovation: Proceedings of the Forum of Mathematics for Industry 2014, pp. 221–234. Springer Japan, Tokyo (2016)

    Google Scholar 

  30. Molenberghs, G., Beunckens, C., Sotto, C., Kenward, M.G.: Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society B 70(2), 371–388 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  31. Morelli, M.S., Giannoni, A., Passino, C., Landini, L., Emdin, M., Vanello, N.: A cross-correlational analysis between electroencephalographic and end-tidal carbon dioxide signals: Methodological issues in the presence of missing data and real data results. Sensors (Basel, Switzerland) 16(11), e1828 (2016)

    Google Scholar 

  32. Oh, S., Kang, D.D., Brock, G.N., Tseng, G.C.: Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 27(1), 78–86 (2011)

    Article  Google Scholar 

  33. Orchard, T., Woodbury, M.A.: A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, Sixth Berkeley Symposium on Mathematical Statistics and Probability, pp. 697–715. University of California Press

    Google Scholar 

  34. Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of. Science 2(11), 559–572 (1901)

    Google Scholar 

  35. Petris, G., Tardella, L.: HI: Simulation from distributions supported by nested hyperplanes (2013). https://CRAN.R-project.org/package=HI. R package version 0.4

  36. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.org/

  37. Rich, C., Cortina-Borja, M., Dezateux, C., Geraci, M., Sera, F., Calderwood, L., Joshi, H., Griffiths, L.J.: Predictors of non-response in a UK-wide cohort study of children’s accelerometer-determined physical activity using postal methods. BMJ Open 3(3), e002290 (2013)

    Article  Google Scholar 

  38. Roweis, S.: EM algorithms for PCA and SPCA. In: M.I. Jordan, M.J. Kearns, S.A. Solla (eds.) Advances in neural information processing systems 10: Proceedings of the 1997 conference, vol. 10, pp. 626–632. MIT Press, Cambridge, MA (1998)

    Google Scholar 

  39. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  40. Sattari, M.T., Rezazadeh-Joudi, A., Kusiak, A.: Assessment of different methods for estimation of missing data in precipitation studies. Hydrology Research (2016). https://doi.org/10.2166/nh.2016.364

    Google Scholar 

  41. Schneider, T.: Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate 14(5), 853–871 (2001)

    Article  Google Scholar 

  42. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society B 61(3), 611–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Geraci .

Editor information

Editors and Affiliations

Appendix – EM Algorithm for PPCA with MNAR Values

Appendix – EM Algorithm for PPCA with MNAR Values

In this appendix, we provide additional details on the Monte Carlo EM algorithm introduced in Sect. 3.2 and we derive a simplified E-step where the random effects are integrated out from the complete data log-likelihood.

The Monte Carlo E-step requires sampling from \(f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \). This task can be carried out efficiently via ARMS [12] using the full conditionals

$$\begin{aligned} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {u}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right)&\propto f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\lambda }^{(t)}\right) f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\lambda }^{(t)}\right) ,\end{aligned}$$
(13)
$$\begin{aligned} f\left( \mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {z}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right)&\propto f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\lambda }^{(t)}\right) f\left( \mathbf {u}_{i}\right) . \end{aligned}$$
(14)

An implementation of ARMS is available in the R package HI [35].

A sample \(\varvec{\xi }_{i1},\ldots ,\varvec{\xi }_{iK}\) for \(i=1,\ldots ,n\) is obtained at each EM iteration t, where the \((s_{i} + q) \times 1\) vector \(\varvec{\xi }_{ik} = \left( \tilde{\mathbf {z}}_{ik},\tilde{\mathbf {u}}_{ik}\right) \), \(k = 1, \ldots , K\), contains ‘imputed’ values for \(\mathbf {z}_{i}\) and \(\mathbf {u}_{i}\) (with the understanding that \(\varvec{\xi }_{ik} = \tilde{\mathbf {u}}_{ik}\) if \(s_{i}=0\)). Here the Monte Carlo sample size K is kept constant throughout. Alternative strategies with varying \(K^{(t)}\) that may increase the speed or the accuracy of the EM algorithm can be considered [2, 17]. The E-step (11) is approximated by

$$\begin{aligned} Q(\varvec{\lambda }|\varvec{\lambda }^{(t)}) = \frac{1}{K} \sum _{i = 1}^{n}\sum _{k=1}^{K}l\left( \varvec{\lambda }; \varvec{\xi }_{ik}, \mathbf {x}_{i},\mathbf {m}_{i}\right) . \end{aligned}$$
(15)

The maximization of (15) with respect to \(\varvec{\lambda }\) is straightforward. Define \(\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) \) if \(s_{i}>0\) or \(\tilde{\mathbf {y}}_{ik} = \mathbf {y}_{i}\) if \(s_{i}=0\), \(i = 1,\ldots ,n\), \(k = 1,\ldots ,K\). The maximum likelihood solution of the M-step at the \((t+1)\)th iteration is given by

$$\begin{aligned} \hat{\varvec{\mu }}^{(t+1)}&= \frac{1}{nK}\sum _{i = 1}^{n} \sum _{k=1}^{K}\left( \tilde{\mathbf {y}}_{ik} - \hat{\mathbf {W}}^{(t)}\tilde{\mathbf {u}}_{ik}\right) ,\end{aligned}$$
(16)
$$\begin{aligned} \hat{\mathbf {W}}^{(t+1)}&= \left\{ \sum _{i = 1}^{n}\sum _{k=1}^{K}\left( \tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}^{(t+1)}\right) \tilde{\mathbf {u}}^{\top }_{ik}\right\} \left( \sum _{i = 1}^{n}\sum _{k=1}^{K}\tilde{\mathbf {u}}_{ik}\tilde{\mathbf {u}}^{\top }_{ik}\right) ^{-1},\end{aligned}$$
(17)
$$\begin{aligned} \hat{\psi }^{(t+1)}&= \frac{1}{nKp} \sum _{i = 1}^{n}\sum _{k=1}^{K}\Vert \tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}^{(t+1)} - \hat{\mathbf {W}}^{(t+1)}\tilde{\mathbf {u}}_{ik}\Vert _{2}^2. \end{aligned}$$
(18)

Analogously, the MLE of \(\varvec{\eta }\) can be easily obtained using standard results for generalized linear models.

Note that the computational burden can be alleviated by first integrating out the random effects in (11) and then sampling from \(f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \) during the Monte Carlo E-step. We obtain what we call a simplified E-step

$$\begin{aligned} \nonumber \nonumber Q_{i}(\varvec{\lambda }|\varvec{\lambda }^{(t)})&= \int \!\!\! \int \left\{ \log f\left( \mathbf {y}_{i},\mathbf {u}_{i}|\varvec{\theta }\right) + \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) \right\} f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {u}_{i}d\mathbf {z}_{i}\\ \nonumber&= \int \!\!\! \int \left\{ \log f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\theta }\right) + \log f\left( \mathbf {u}_{i}\right) \right\} f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {u}_{i}d\mathbf {z}_{i}\\ \nonumber&\qquad {} + \int \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {z}_{i}\\ \nonumber&= \int \left\{ -\frac{p}{2} \log (\psi ) - \frac{1}{2\psi } {\text {tr}}\left( \mathbf {W}^{\top }\mathbf {W}\mathbf {B}^{(t)}\right) - \frac{1}{2\psi }\Vert \mathbf {y}_{i} - \varvec{\mu } - \mathbf {W}\mathbf {v}_{i}^{(t)}\Vert _{2}^2 \right. \\ \nonumber&\qquad {} + \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) \bigg \} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {z}_{i}\\&\equiv {\text {E}}_{\mathbf {z}|\mathbf {x},\mathbf {m},\varvec{\lambda }^{(t)}}\left\{ l\left( \varvec{\lambda }; \mathbf {y}_{i},\mathbf {m}_{i}\right) \right\} , \end{aligned}$$
(19)

where \(\mathbf {v}_{i}^{(t)} = \mathbf {B}^{(t)}\mathbf {W}^{(t)^{\top }}\left( \mathbf {y}_{i} - \varvec{\mu }^{(t)}\right) /\psi ^{(t)}\) and \(\mathbf {B}^{(t)} = \left\{ \mathbf {W}^{(t)^{\top }} \mathbf {W}^{(t)}/\psi ^{(t)} + \mathbf {I}_{q}\right\} ^{-1}\). Note that by assumption \(\mathbf {m}_{i}\) is independent from \(\mathbf {u}_i\). The expectation above is now taken with respect to

$$\begin{aligned} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \propto \exp \left\{ -\frac{1}{2}\left( \mathbf {y}_{i}-\varvec{\mu }^{(t)}\right) ^{\top }\mathbf {C}^{(t)^{-1}}\left( \mathbf {y}_{i}-\varvec{\mu }^{(t)}\right) \right\} f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }^{(t)}\right) , \end{aligned}$$

\(\mathbf {C}^{(t)} = \mathbf {W}^{(t)}\mathbf {W}^{(t)^{\top }} + \varvec{\varPsi }^{(t)}\).

Again, we obtain a sample \(\tilde{\mathbf {z}}_{ik}\), \(i = 1,\ldots ,n\), \(k = 1,\ldots ,K\) and calculate the approximate E-step

$$\begin{aligned} Q(\varvec{\lambda }|\varvec{\lambda }^{(t)}) = \frac{1}{K} \sum _{i = 1}^{n}\sum _{k=1}^{K}l\left( \varvec{\lambda }; \tilde{\mathbf {z}}_{ik}, \mathbf {x}_{i},\mathbf {m}_{i}\right) . \end{aligned}$$
(20)

The MLE equations of the M-step which follow from maximizing the log-likelihood in (20) are similar to equations (27) and (28) in [42] and they do not require explicit computation of the covariance matrix. We omit them for the sake of brevity.

Finally, we note that, based on the linear predictions

$$\begin{aligned} \hat{\mathbf {u}}_{ik} = \left( \hat{\mathbf {W}}^{\top }\hat{\mathbf {W}} + \hat{\varvec{\varPsi }}\right) ^{-1}\hat{\mathbf {W}}^{\top }(\tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}), \end{aligned}$$
(21)

where \(\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) \) is the complete data vector at convergence, we can calculate the element-wise variances of \(\frac{1}{K}\sum _{k=1}^{K} \hat{\mathbf {u}}_{ik}\) over the individuals space as estimates of \(\delta _{1},\ldots ,\delta _{q}\). The quantity \((p-q)\cdot \hat{\varvec{\psi }}\) provides the portion of the total variability associated with the ‘discarded’ components.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Geraci, M., Farcomeni, A. (2018). Principal Component Analysis in the Presence of Missing Data. In: Naik, G. (eds) Advances in Principal Component Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-10-6704-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6704-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6703-7

  • Online ISBN: 978-981-10-6704-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics