Principal Component Analysis in the Presence of Missing Data

Geraci, Marco; Farcomeni, Alessio

doi:10.1007/978-981-10-6704-4_3

Principal Component Analysis in the Presence of Missing Data

Marco Geraci² &
Alessio Farcomeni³

Chapter
First Online: 13 December 2017

2772 Accesses
5 Citations

Abstract

The aim of this chapter is to provide an overview of recent developments in principal component analysis (PCA) methods when the data are incomplete. Missing data bring uncertainty into the analysis and their treatment requires statistical approaches that are tailored to cope with specific missing data processes (i.e., ignorable and nonignorable mechanisms). Since the publication of the classic textbook by Jolliffe, which includes a short, same-titled section on the missing data problem in PCA, there have been a few methodological contributions that hinge upon a probabilistic approach to PCA. In this chapter, we unify methods for ignorable and nonignorable missing data in a general likelihood framework. We also provide real data examples to illustrate the application of these methods using the R language and environment for statistical computing and graphics.

This is a preview of subscription content, log in via an institution.

References

Bartolucci, F., Farcomeni, A.: A discrete time event-history approach to informative drop-out in mixed latent Markov models with covariates. Biometrics 71(1), 80–89 (2015)
Article MathSciNet MATH Google Scholar
Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61(1), 265–285 (1999)
Article MATH Google Scholar
Creemers, A., Hens, N., Aerts, M., Molenberghs, G., Verbeke, G., Kenward, M.G.: A sensitivity analysis for shared-parameter models for incomplete longitudinal outcomes. Biometrical Journal 52(1), 111–125 (2010)
MathSciNet MATH Google Scholar
de Brevern, A., Hazout, S., Malpertuy, A.: Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics 5(1), 114 (2004)
Article Google Scholar
de Souto, M.C., Jaskowiak, P.A., Costa, I.G.: Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 16(1), 64 (2015)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Ding, C., Zhou, D., He, X., Zha, H.: $L_{1}$-PCA: rotational invariant $L_{1}$-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 281–288. ACM
Google Scholar
Farcomeni, A., Greco, L.: Robust methods for data reduction. CRC Press, Boca Raton, FL (2015)
Book MATH Google Scholar
Geraci, M.: Estimation of regression quantiles in complex surveys with data missing at random: An application to birthweight determinants. Statistical Methods in Medical Research 25(4), 1393–1421 (2016)
Article MathSciNet Google Scholar
Geraci, M., Bottai, M.: Use of auxiliary data in semi-parametric spatial regression with nonignorable missing responses. Statistical Modelling 6(4), 321–336 (2006)
Article MathSciNet Google Scholar
Geraci, M., Farcomeni, A.: Probabilistic principal component analysis to identify profiles of physical activity behaviours in the presence of nonignorable missing data. Journal of the Royal Statistical Society C 65(1), 51–75 (2016)
Article MathSciNet Google Scholar
Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society C 41(2), 337–348 (1992)
MATH Google Scholar
Griffiths, L.J., Cortina-Borja, M., Sera, F., Pouliou, T., Geraci, M., Rich, C., Cole, T.J., Law, C., Joshi, H., Ness, A.R., Jebb, S.A., Dezateux, C.: How active are our children? Findings from the Millennium Cohort Study. BMJ Open 3(8), e002,893 (2013)
Google Scholar
Heitjan, D.F., Basu, S.: Distinguishing “missing at random” and “missing completely at random”. The American Statistician 50(3), 207–213 (1996)
Google Scholar
Houseago-Stokes, R.E., Challenor, P.G.: Using PPCA to estimate EOFs in the presence of missing values. Journal of Atmospheric and Oceanic Technology 21(9), 1471–1480 (2004)
Article Google Scholar
Husson, F., Josse, J.: missMDA: Handling missing values with/in multivariate data analysis (principal component methods) (2013). https://CRAN.R-project.org/package=missMDA. R package version 1.7.2
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R.: Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika 88(2), 551–564 (2001)
Article MathSciNet MATH Google Scholar
Ibrahim, J.G., Molenberghs, G.: Missing data methods in longitudinal studies: A review. Test 18(1), 1–43 (2009)
Article MathSciNet MATH Google Scholar
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research 11(Jul), 1957–2000 (2010)
Google Scholar
Jolliffe, I.T.: Principal component analysis, 2nd edn. Springer-Verlag, New York, NY (2002)
MATH Google Scholar
Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2), 79–99 (2012)
MathSciNet MATH Google Scholar
Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics and Data Analysis 56(6), 1869–1879 (2012)
Article MathSciNet MATH Google Scholar
Josse, J., Pagès, J., Husson, F.: Multiple imputation in principal component analysis. Advances in Data Analysis and Classification 5(3), 231–246 (2011)
Article MathSciNet MATH Google Scholar
Laird, N.M.: Missing data in longitudinal studies. Statistics in Medicine 7(1–2), 305–315 (1988)
Article MathSciNet Google Scholar
Lê, S., Josse, J., Husson, F.: FactoMineR: A package for multivariate analysis. Journal of Statistical Software 25(1), 1–18 (2008)
Article Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley, New York, NY (1987)
MATH Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, Hoboken, NJ (2002)
Book MATH Google Scholar
Mehrotra, D.V.: Robust elementwise estimation of a dispersion matrix. Biometrics 51(4), 1344–51 (1995)
Article MATH Google Scholar
Melgani, F., Mercier, G., Lorenzi, L., Pasolli, E.: Recent methods for reconstructing missing data in multispectral satellite imagery. In: R.S. Anderssen, P. Broadbridge, Y. Fukumoto, K. Kajiwara, T. Takagi, E. Verbitskiy, M. Wakayama (eds.) Applications + Practical Conceptualization + Mathematics = fruitful Innovation: Proceedings of the Forum of Mathematics for Industry 2014, pp. 221–234. Springer Japan, Tokyo (2016)
Google Scholar
Molenberghs, G., Beunckens, C., Sotto, C., Kenward, M.G.: Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society B 70(2), 371–388 (2008)
Article MathSciNet MATH Google Scholar
Morelli, M.S., Giannoni, A., Passino, C., Landini, L., Emdin, M., Vanello, N.: A cross-correlational analysis between electroencephalographic and end-tidal carbon dioxide signals: Methodological issues in the presence of missing data and real data results. Sensors (Basel, Switzerland) 16(11), e1828 (2016)
Google Scholar
Oh, S., Kang, D.D., Brock, G.N., Tseng, G.C.: Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 27(1), 78–86 (2011)
Article Google Scholar
Orchard, T., Woodbury, M.A.: A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, Sixth Berkeley Symposium on Mathematical Statistics and Probability, pp. 697–715. University of California Press
Google Scholar
Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of. Science 2(11), 559–572 (1901)
Google Scholar
Petris, G., Tardella, L.: HI: Simulation from distributions supported by nested hyperplanes (2013). https://CRAN.R-project.org/package=HI. R package version 0.4
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.org/
Rich, C., Cortina-Borja, M., Dezateux, C., Geraci, M., Sera, F., Calderwood, L., Joshi, H., Griffiths, L.J.: Predictors of non-response in a UK-wide cohort study of children’s accelerometer-determined physical activity using postal methods. BMJ Open 3(3), e002290 (2013)
Article Google Scholar
Roweis, S.: EM algorithms for PCA and SPCA. In: M.I. Jordan, M.J. Kearns, S.A. Solla (eds.) Advances in neural information processing systems 10: Proceedings of the 1997 conference, vol. 10, pp. 626–632. MIT Press, Cambridge, MA (1998)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet MATH Google Scholar
Sattari, M.T., Rezazadeh-Joudi, A., Kusiak, A.: Assessment of different methods for estimation of missing data in precipitation studies. Hydrology Research (2016). https://doi.org/10.2166/nh.2016.364
Google Scholar
Schneider, T.: Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate 14(5), 853–871 (2001)
Article Google Scholar
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society B 61(3), 611–622 (1999)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of South Carolina, 915 Greene Street, Columbia, SC, 29209, USA
Marco Geraci
Sapienza - University of Rome, Piazzale Aldo Moro 5, Rome, Italy
Alessio Farcomeni

Authors

Marco Geraci
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Farcomeni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Geraci .

Editor information

Editors and Affiliations

BENS Research Group, MARCS Institute, Western Sydney University, Kingswood, Australia
Ganesh R. Naik

Appendix – EM Algorithm for PPCA with MNAR Values

In this appendix, we provide additional details on the Monte Carlo EM algorithm introduced in Sect. 3.2 and we derive a simplified E-step where the random effects are integrated out from the complete data log-likelihood.

The Monte Carlo E-step requires sampling from $f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) $. This task can be carried out efficiently via ARMS [12] using the full conditionals

$$\begin{aligned} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {u}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right)&\propto f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\lambda }^{(t)}\right) f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\lambda }^{(t)}\right) ,\end{aligned}$$

(13)

$$\begin{aligned} f\left( \mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {z}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right)&\propto f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\lambda }^{(t)}\right) f\left( \mathbf {u}_{i}\right) . \end{aligned}$$

(14)

An implementation of ARMS is available in the R package HI [35].

A sample $\varvec{\xi }_{i1},\ldots ,\varvec{\xi }_{iK}$ for $i=1,\ldots ,n$ is obtained at each EM iteration t, where the $(s_{i} + q) \times 1$ vector $\varvec{\xi }_{ik} = \left( \tilde{\mathbf {z}}_{ik},\tilde{\mathbf {u}}_{ik}\right) $, $k = 1, \ldots , K$, contains ‘imputed’ values for $\mathbf {z}_{i}$ and $\mathbf {u}_{i}$ (with the understanding that $\varvec{\xi }_{ik} = \tilde{\mathbf {u}}_{ik}$ if $s_{i}=0$). Here the Monte Carlo sample size K is kept constant throughout. Alternative strategies with varying $K^{(t)}$ that may increase the speed or the accuracy of the EM algorithm can be considered [2, 17]. The E-step (11) is approximated by

$$\begin{aligned} Q(\varvec{\lambda }|\varvec{\lambda }^{(t)}) = \frac{1}{K} \sum _{i = 1}^{n}\sum _{k=1}^{K}l\left( \varvec{\lambda }; \varvec{\xi }_{ik}, \mathbf {x}_{i},\mathbf {m}_{i}\right) . \end{aligned}$$

(15)

The maximization of (15) with respect to $\varvec{\lambda }$ is straightforward. Define $\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) $ if $s_{i}>0$ or $\tilde{\mathbf {y}}_{ik} = \mathbf {y}_{i}$ if $s_{i}=0$, $i = 1,\ldots ,n$, $k = 1,\ldots ,K$. The maximum likelihood solution of the M-step at the $(t+1)$th iteration is given by

$$\begin{aligned} \hat{\varvec{\mu }}^{(t+1)}&= \frac{1}{nK}\sum _{i = 1}^{n} \sum _{k=1}^{K}\left( \tilde{\mathbf {y}}_{ik} - \hat{\mathbf {W}}^{(t)}\tilde{\mathbf {u}}_{ik}\right) ,\end{aligned}$$

(16)

$$\begin{aligned} \hat{\mathbf {W}}^{(t+1)}&= \left\{ \sum _{i = 1}^{n}\sum _{k=1}^{K}\left( \tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}^{(t+1)}\right) \tilde{\mathbf {u}}^{\top }_{ik}\right\} \left( \sum _{i = 1}^{n}\sum _{k=1}^{K}\tilde{\mathbf {u}}_{ik}\tilde{\mathbf {u}}^{\top }_{ik}\right) ^{-1},\end{aligned}$$

(17)

$$\begin{aligned} \hat{\psi }^{(t+1)}&= \frac{1}{nKp} \sum _{i = 1}^{n}\sum _{k=1}^{K}\Vert \tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}^{(t+1)} - \hat{\mathbf {W}}^{(t+1)}\tilde{\mathbf {u}}_{ik}\Vert _{2}^2. \end{aligned}$$

(18)

Analogously, the MLE of $\varvec{\eta }$ can be easily obtained using standard results for generalized linear models.

Note that the computational burden can be alleviated by first integrating out the random effects in (11) and then sampling from $f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) $ during the Monte Carlo E-step. We obtain what we call a simplified E-step

$$\begin{aligned} \nonumber \nonumber Q_{i}(\varvec{\lambda }|\varvec{\lambda }^{(t)})&= \int \!\!\! \int \left\{ \log f\left( \mathbf {y}_{i},\mathbf {u}_{i}|\varvec{\theta }\right) + \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) \right\} f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {u}_{i}d\mathbf {z}_{i}\\ \nonumber&= \int \!\!\! \int \left\{ \log f\left( \mathbf {y}_{i}|\mathbf {u}_{i},\varvec{\theta }\right) + \log f\left( \mathbf {u}_{i}\right) \right\} f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {u}_{i}d\mathbf {z}_{i}\\ \nonumber&\qquad {} + \int \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {z}_{i}\\ \nonumber&= \int \left\{ -\frac{p}{2} \log (\psi ) - \frac{1}{2\psi } {\text {tr}}\left( \mathbf {W}^{\top }\mathbf {W}\mathbf {B}^{(t)}\right) - \frac{1}{2\psi }\Vert \mathbf {y}_{i} - \varvec{\mu } - \mathbf {W}\mathbf {v}_{i}^{(t)}\Vert _{2}^2 \right. \\ \nonumber&\qquad {} + \log f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }\right) \bigg \} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) d\mathbf {z}_{i}\\&\equiv {\text {E}}_{\mathbf {z}|\mathbf {x},\mathbf {m},\varvec{\lambda }^{(t)}}\left\{ l\left( \varvec{\lambda }; \mathbf {y}_{i},\mathbf {m}_{i}\right) \right\} , \end{aligned}$$

(19)

where $\mathbf {v}_{i}^{(t)} = \mathbf {B}^{(t)}\mathbf {W}^{(t)^{\top }}\left( \mathbf {y}_{i} - \varvec{\mu }^{(t)}\right) /\psi ^{(t)}$ and $\mathbf {B}^{(t)} = \left\{ \mathbf {W}^{(t)^{\top }} \mathbf {W}^{(t)}/\psi ^{(t)} + \mathbf {I}_{q}\right\} ^{-1}$. Note that by assumption $\mathbf {m}_{i}$ is independent from $\mathbf {u}_i$. The expectation above is now taken with respect to

$$\begin{aligned} f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \propto \exp \left\{ -\frac{1}{2}\left( \mathbf {y}_{i}-\varvec{\mu }^{(t)}\right) ^{\top }\mathbf {C}^{(t)^{-1}}\left( \mathbf {y}_{i}-\varvec{\mu }^{(t)}\right) \right\} f\left( \mathbf {m}_{i}|\mathbf {y}_{i},\varvec{\eta }^{(t)}\right) , \end{aligned}$$

$\mathbf {C}^{(t)} = \mathbf {W}^{(t)}\mathbf {W}^{(t)^{\top }} + \varvec{\varPsi }^{(t)}$.

Again, we obtain a sample $\tilde{\mathbf {z}}_{ik}$, $i = 1,\ldots ,n$, $k = 1,\ldots ,K$ and calculate the approximate E-step

$$\begin{aligned} Q(\varvec{\lambda }|\varvec{\lambda }^{(t)}) = \frac{1}{K} \sum _{i = 1}^{n}\sum _{k=1}^{K}l\left( \varvec{\lambda }; \tilde{\mathbf {z}}_{ik}, \mathbf {x}_{i},\mathbf {m}_{i}\right) . \end{aligned}$$

(20)

The MLE equations of the M-step which follow from maximizing the log-likelihood in (20) are similar to equations (27) and (28) in [42] and they do not require explicit computation of the covariance matrix. We omit them for the sake of brevity.

Finally, we note that, based on the linear predictions

$$\begin{aligned} \hat{\mathbf {u}}_{ik} = \left( \hat{\mathbf {W}}^{\top }\hat{\mathbf {W}} + \hat{\varvec{\varPsi }}\right) ^{-1}\hat{\mathbf {W}}^{\top }(\tilde{\mathbf {y}}_{ik} - \hat{\varvec{\mu }}), \end{aligned}$$

(21)

where $\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) $ is the complete data vector at convergence, we can calculate the element-wise variances of $\frac{1}{K}\sum _{k=1}^{K} \hat{\mathbf {u}}_{ik}$ over the individuals space as estimates of $\delta _{1},\ldots ,\delta _{q}$. The quantity $(p-q)\cdot \hat{\varvec{\psi }}$ provides the portion of the total variability associated with the ‘discarded’ components.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Geraci, M., Farcomeni, A. (2018). Principal Component Analysis in the Presence of Missing Data. In: Naik, G. (eds) Advances in Principal Component Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-10-6704-4_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-6704-4_3
Published: 13 December 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6703-7
Online ISBN: 978-981-10-6704-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix – EM Algorithm for PPCA with MNAR Values

Appendix – EM Algorithm for PPCA with MNAR Values

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation