Principal Component Analysis
Keywords
Principal Component Analysis Singular Value Decomposition Component Score Spectral Decomposition Component LoadingSynonyms
Glossary
 PCA

Principal Component Analysis
 SVD

Singular Value Decomposition
Definition
Nowadays, almost everything can be monitored and measured implying that huge amounts of data are usually available. A big problem to be solved concerns how to transform such data into information. In other words, it is crucial to extract relevant information hidden in data sets. This is in general the main goal of statistical methods such as Principal Component Analysis (PCA). PCA is a wellknown tool often used for the analysis of a numerical data set concerning a number of objects with respect to several variables (features). Its aim is to synthesize the data set in terms of the socalled components, i.e., unobserved variables expressed as linear combinations of the observed ones. These components are orthogonal and are found in such a way that they optimize a certain algebraic criterion (that we will discuss in details in the next sections). Equivalently, if we look at the data as a cloud of points in the variable space, the components can be interpreted geometrically in terms of a lowdimensional space constructed in such a way that they miss relevant information as little as possible. Although we can usually extract a number of components equal to that of the variables (assuming it is lower than the number of objects), PCA effectively synthesizes the data if a few components (a number noticeably lower than that of variables) maintain most of the information. As we shall see, this is so if the sum of variances of the components is pretty high in comparison with the sum of variances of the observed variables (the two sums are the same if the number of extracted components coincides with that of the variables).
The above brief description of PCA has been carried out according to an exploratory approach in the sense that no probabilistic assumptions concerning the observed data are made. In other words, the conclusions drawn are based on mathematical arguments and involve only the available data set. Nonetheless, in the literature several (theoretical) papers on statistical aspects connected to PCA can be found. A standard assumption is that the data follow a multivariate normal distribution and, for instance, the probability distribution of the extracted components is investigated. Further analysis on the statistical significance of the PCA results is usually not provided. In this entry, we introduce PCA as an exploratory tool. A reader interested in inferential aspects involving PCA can refer to Jolliffe (2002) and the references therein.
Introduction
PCA is a statistical tool for summarizing a large number of quantitative variables, say p, through components. When p is huge, it is impossible to properly manage the data. In this respect, the need for PCA arises. In PCA, one seeks for some unobservable variables, called components, capable to both synthesize the huge amount of data and capture most of the information contained in the observed variables. The components are uncorrelated and are found sequentially. The first component is the linear combination of the observed variables having maximum variance (provided that the coefficients of such a linear combination are suitably constrained to avoid trivial solutions). At the next step, the second component is found as a new linear combination of the original variables with maximum variance and orthogonal to the first component. This procedure can be repeated a number of times equal to the number of variables (assuming this number smaller than that of objects). However, for practical purposes, the number of extracted components is remarkably lower than that of the variables (usually 2 or 3). In the following PCA shall be discussed theoretically and in practice by means of examples and real data. Furthermore, the use of PCA by using the most common statistical softwares is briefly provided.
Key Points
To further assess the role of the correlation between X _{1} and X _{2}, we can observe the lower side of Fig. 1 displaying two data sets with correlated variables (r _{12} = 0.967 for the left side, r _{12} = −0.981 for the right side). The main difference between the two data sets involves the variances of the variables. In the left side, the variances are almost the same (\( {\sigma}_1^2=1.04, {\sigma}_2^2=0.94 \)), where, in the right side \( {\sigma}_1^2=1.08<{\sigma}_2^2=3.39 \). By applying PCA and looking at the first component, we can see that it synthesizes the data explaining most of the variability. The variances of the first component are 1.94 (left side) and 4.44 (right side), hence slightly lower than the corresponding sum of variances (1.98 and 4.47 for the left and right sides, respectively). Obviously, in both the cases, the second component accounts for a very limited amount of variance (0.03). A relevant distinction concerns the way to derive the component. In the equal variance case, X _{1} and X _{2} play essentially equally strong roles (C _{1} = 0.726X _{1} + 0.687X _{2}). In the right side, the role of X _{2} is emphasized since C _{1} = −0.488X _{1} + 0.873X _{2} and the different signs of the coefficients are connected with the existing negative correlation between the observed variables.
Historical Background
The two most common references on the origins of PCA are Pearson (1901) and Hotelling (1933). Pearson (1901) approaches the problem from a geometric point of view looking for a lowdimensional space which represents a cloud of points in a higher dimensional space at best. Although Pearson was an English mathematician, such a view is usually referred to as the French approach to PCA. Hotelling (1933) introduces PCA as the attempt to detect a limited number of independent variables best summarizing a set of (correlated) observed variables. In this respect, the adopted approach is the AngloSaxon one, characterized by the algebraic derivation of PCA. Obviously, both the approaches lead to the same solution, found in term of the spectral decomposition of the correlation (or covariance) matrix. Alternatively, the PCA solution can be obtained by a singular value decomposition (SVD) of the original (centered and/or normalized) data matrix. SVD is also known as the EckartYoung decomposition (Eckart and Young (1936)). Nonetheless, some previous works on SVD in a form underlying PCA can be found in the literature (for more details refer to Preisdendorfer and Mobley (1988)), and thus it is very difficult to state the exact origin of PCA.
Originally, PCA has been applied especially in the psychological domain. However, it is nowadays applied in a wide variety of research fields.
Principal Components
being Σ the covariance matrix of x. This constrained maximization problem can be solved by Lagrange multipliers. It can be shown that the problem boils down to compute eigenvectors and eigenvalues of Σ. In particular, since the variance must be maximized, it is easy to see that a _{1} is the eigenvector corresponding to the highest eigenvalue of Σ, say λ _{1}. In this case, \( \mathrm{var}\left({C}_1\right)=\mathrm{var}\left({\mathbf{a}}_1^{\mathbf{\prime}}\mathbf{x}\right)={\mathbf{a}}_1^{\mathbf{\prime}}\Sigma {\mathbf{a}}_1={\lambda}_1 \), thus the variance of the first component is equal to the highest eigenvalue of Σ.
The solution of Eq. 2 is the eigenvector corresponding to the second largest eigenvalue (λ _{2}) and var(C _{2}) = λ _{2}. In general, the sth principal component can be found by considering the eigenvector corresponding to the sth largest eigenvalue of Σ and var(C _{ s }) = λ _{ s }. Hence, it should be clear that the principal components are uncorrelated linear combinations of the original variables (with weights having a fixed sum of squares) such that their variances are as large as possible.
The above description of PCA clarifies that it is a multistep procedure finding components with maximal variance. Probably, this does not fully highlight that PCA is a method for reducing the data complexity passing from a large number of original variables to a few principal components. In this respect, we can consider PCA as a method for finding sequentially “new” variables, i.e., the principal components, that best recover the data, i.e., in such a way to miss relevant information as little as possible. This turns out to look for the principal component that, at each step, best “explains” (or “accounts for”) the variance in the data. Every principal component explains an amount of variance of the original data equal to its variance. For instance, λ _{1} is the amount of variance explained by C _{1}.
where A is the orthonormal matrix containing the eigenvectors of Σ in its columns (the sth column contains the eigenvector corresponding to the sth largest eigenvalue) and Λ is a diagonal matrix with diagonal elements equal to the eigenvalues of Σ in decreasing order.
being \( {\sigma}_j^2 \) the variance of variable j. See, for further details on the derivation of Eqs. 4 and 5, Mardia et al. (1979). Except for situations where the standard deviations of the variables noticeably vary, the loadings are usually considered for interpreting the extracted components. In this respect, it is worth mentioning that the sign of a principal component is fully arbitrary. In fact, if we reverse the signs of all the loadings of a principal component, the variance of the component remains the same. One must only reverse the corresponding interpretation.
If n > p, we can derive up to p principal components (t ≤ p). However, if the rank of Σ is r < p, only the first r principal components have positive variances var(C _{1}) ≥ … ≥ var(C _{ r }) ≥ var(C _{ r+1}) = … = var(C _{ p }) = 0.
When the principal components are derived, the principal component scores (or simply component scores) of the objects with respect to such new variables can be computed as \( {\mathbf{c}}_s={\mathbf{a}}_s^{\mathbf{\prime}}\left(\mathbf{x}\overline{\mathbf{x}}\right) \), s = 1, …, t, being \( \overline{\mathbf{x}} \) the mean vector of X _{1}, …, X _{ p }. Therefore, apart from the mean vector, the component scores are an orthonormal linear transformation of x.
Geometric Interpretation
The component scores satisfy a lot of interesting properties (see, e.g., Jolliffe (2002)). Here, we mention only one property connected with the geometrical interpretation of PCA. Let (x _{1}, … , x _{n}) be the data points that can be seen as a cloud of points in the reference space ℜ^{ p }. The orthonormal matrix A _{ t } = [a _{1} ⋅ a _{ t }] containing the first t eigenvectors of Σ in its columns provide the basis of the tdimensional space such that the sum of squared perpendicular distances of the data points x _{ i }’s (i = 1, …, n) from the subspace is minimized. Then, the principal components determine a lowdimensional space accounting for as much as possible of the total variation in the data. In particular, the first component recovers the bestfitting line. The second component looks for the bestfitting line orthogonal to the first one and, together with the first line, forms a plane. If we use t components, then a collection of t bestfitting lines, orthogonal to each other, is found and the bestfitting hyperplane is built. In this setting, the component scores represent the orthogonal projections of (x _{1}, …, x _{ n }) on such a tdimensional subspace.
Covariance and Correlation Matrices
The above description of PCA is based on the analysis of the covariances of X _{1}, …, X _{ p }. PCA can also applied to the correlation matrix R. In other words, this consists of considering the standardized variables X _{1}, …, X _{ p } when deriving principal components. The choice between standardized and raw variables is crucial because eigenvalues and eigenvectors of correlation and covariance matrices differ and a simple relation linking the two solutions is not available. Thus, the principal components based on Σ and R are not the same.
When the units of measurement used for the variables are different, the correlation matrix should be considered, otherwise artificial differences in the variables will affect the PCA results. In the case of the same units of measurement, the researcher should assess whether standardizing the variables is a sensible choice. Specifically, the principal components extracted considering Σ will depend mostly on the variables with the highest variances. This makes sense if the researcher believes that a larger (smaller) variance implies a more (less) relevant importance of the variable involved. If this is not the case, the use of R is recommended and all the variables will be treated on an equal footing.
Choice of the Number of Components
The choice of the number of components t must be such that the original data are well synthesized maintaining the relevant information as much as possible. If t = 1, a drastic reduction of the data size is achieved, but in general the data variation is lost. Obviously, the opposite comment holds when t = p.
In the literature there exists several proposals on the optimal choice of t. Here, we limit our attention to some basic procedures to select t. The underlying idea of such procedures is to keep the most important components in terms of the sum of their variances. Taking into account the existing connection between the variances of the components and the eigenvalues of either Σ or R, these procedures are based on the eigenvalues λ _{ s }.
A straightforward proposal known as the Kaiser’s rule suggests to retain only the components such that λ _{ s } > 1. Another tool is the socalled Scree plot. It consists of plotting the variances as the Yaxis (λ _{s}) against the numbers of the principal components (s) as the Xaxis. If we connect the plotted points, these drop moving to the right. The optimal value t is found in connection with the presence of an elbow, i.e., a bend related to a high ratio of differences in consecutive eigenvalues.
The index compares the variances explained by the first t components with the total sum of variances. A high value of F _{ t } (close to 100%) implies that the first t components well synthesize the observed data. In general, a rule to be adopted can be to choose t as the smallest integer such that F _{ t } is higher than a prespecified percentage (e.g., 70%). However, how to fix such cutoff is a complex issue and depends on the data under investigation. In general, t can be chosen as the smallest t such that F _{ t } − F _{ t−1} is high and F _{ t+1} − F _{ t } is small. This means that passing from t − 1 to t components leads to a remarkable increase of fit, whereas adding one more component leads to a negligible increase of the explained variance. A plot of F _{ t } against t can be helpful in order to select the optimal value of t. This is essentially the same information offered by the Scree plot. Nonetheless, it must be underlined that a degree of subjectivity connected with the choice of t exists. In fact, there does not exist a rule for selecting t to be adopted slavishly. The choice t should not be based solely on the analysis of the fit values. In fact, this helps to select some interesting solutions, which well balance fit and parsimony. Then, a deeper insight into these solutions is needed in order to choose the optimal one. The illustrative example will help to clarify this point.
PCA and Singular Value Decomposition
being S and D the matrices of the left and right singular vectors of X, respectively, associated with the singular values of X stored in decreasing order in the main diagonal of the diagonal matrix V. The solution of Eq. 8 is then found as C = S _{ t } V _{ t } and A = D _{ t }, where V _{ t } is the diagonal matrix containing the t highest singular values and S _{ t } and D _{ t } the matrices holding in their columns the first t columns of S and D, respectively. The matrices C and A are the socalled (principal) component score and loading matrices, respectively, and coincide with the PCA solution previously described. In fact, the SVD of X and the spectral decompositions of Σ and R are related. Assuming without loss of generality that the variables in X have zero mean (i.e., the matrix is columnwise centered), we have that S contains the eigenvector of XX′, D those of Σ = X′X, and V has in its main diagonal the square root of the eigenvalues of Σ. If X contains standardized scores, the same links exist considering R instead of Σ.
The solution of Eq. 8 provides the best trank decomposition of X. Although the PCA solution remains the same, in this case, the aim of PCA is to look for the best lowrank approximation of a given matrix. Thus, the optimal solution is the trank matrix \( \widehat{\mathbf{X}}=\mathbf{C}{\mathbf{A}}^{\mathbf{\prime}}. \) In this respect, it is interesting to see that the optimal solution previously reported is not unique. Specifically, for any rotation matrix T of order (t × t) an equal fitting solution is \( \tilde{\mathbf{C}} \) = CT and \( \tilde{\mathbf{A}} \) = AT′_{−1} such that \( \tilde{\mathbf{C}} \) \( \tilde{\mathbf{A}} \)′ = CA′. The use of T leads to a new set of components having the same sum of explained variances as for the previous set of components. Nonetheless, differently, from C and A, the components in \( \tilde{\mathbf{C}} \) and \( \tilde{\mathbf{A}} \) are no longer such that they maximize the explained sum of variances sequentially. For this reason, we may refer to as rotated principal components or simply components avoiding to use the adjective “principal.”
Rotation matrices have a high impact for practical purposes. In fact, they can be used in order to facilitate the interpretation of the extracted components by means of simple structure rotations, such as the varimax criterion (Kaiser 1958), that is usually applied to the component loadings. This generally leads to loadings either high or low in absolute value and not in between. In this way, for every component, we can determine a subset of variables that remarkably affect the component involved. At the same, it is desirable that only one loading per observed variable is high in absolute sense in such a way that the variables are mainly related to exactly one component.
PCA and Sparseness
As we already saw, the interpretation of the components is carried out by looking at the largest loadings in absolute value. In other words, since every component is a linear combination of all the observed variables, one usually select the subset of variables having the loadings farthest from zero. This implicitly consists in subjectively fixing a threshold and artificially setting the loadings with absolute values smaller than the threshold to zero. Such an approach is very intuitive especially when the solution is varimax rotated to simple structure since in such cases we usually have a large number of loadings close to zero that can be ignored when interpreting the extracted components. Although this is the most common strategy in practice, it has been criticized by several authors (see, e.g., Cadima and Jolliffe 1995). In case of a rotated solution, obviously the components are no longer ordered with respect to the explained variance, and this may represent a limitation if one is interested in the ranking of the components in terms of explained variance. Furthermore, it is not guaranteed that the rotated solution is as simple as required. Another drawback refers to the case in which PCA is applied to the covariance matrix Σ, i.e., the data in matrix X are not normalized. As we saw in Eq. 5, it is not just the size of the loadings but also the standard deviation of each variable which determines the importance of a variable for a component.
with ζ ≥ 0. The new constraint takes inspiration from the Lasso (Tibshirani 1996). The Lasso is a wellknown method in regression which represents a compromise between variable selection and shrinkage estimators. This depends on its geometry that forces some of the regression coefficients to zero. In PCA, the geometry implies that some loadings are shrunk and some other are set to zero.
A closely related procedure involving the Lasso has been developed by Zou et al. (2006). The socalled SPCA is built on the fact PCA can be written as a regressiontype optimization problem, with a quadratic penalty. In this case, the sparsity of the loadings is achieved by imposing the Lasso constraint on the regression coefficients. The optimal solution of SPCA can be found by either the nonconvex algorithm proposed by Zou et al. (2006) or the more general one developed by d’Aspremont et al. (2007).
Software
Several monographs on the application of PCA using the most common statistical softwares, such as Matlab, R, SAS, SPSS, and Statistica, are available (e.g., Everitt (2005); Marques de Sá (2007)). In fact, all the above mentioned softwares have specific tools for applying PCA. In this respect, note that to perform PCA using SPSS, one has to select the Factor Analysis (FA) menu implicitly suggesting that PCA is a particular option for FA. This is part of cause for the existing ambiguity between PCA and FA. In fact, the two techniques strongly differ. A better insight into the distinctive features of PCA and FA can be found in, e.g., Jolliffe (2002).
Illustrative Example
In order to show how PCA works in practice, the fleabeetle data set (Lubischew 1962) is investigated. The data refer to physical information of n = 74 male fleabeetles. In particular p = 6 measurements on each fleabeetle are considered. These are the width of the first joint of the first tarsus (Tarsus1), the width of the second joint of the first tarsus (Tarsus2), the maximal width of the head between the external edges of the eyes (Head), the maximal width of the aedeagus in the forepart (AedeagusWF), the front angle of the aedeagus (AedeagusA), and the aedeagus width from the side (AedeagusWS). Furthermore, the information concerning the species (variable species with modalities Ch. concinna, Ch. heptapotamica, Ch. heikertingeri) is also reported. The aim of the original work was to discriminate the species by means of (subsets of) the measured characters since external differences between these three species are scarcely perceptible.
PCA can be applied only on the six quantitative measurements, whereas the qualitative information about the species can be used as a supplementary variable. As it will be clarified in the following, this means that the species does not play an active role in determining the components, but it will be used later to provide a better insight into the obtained results.
Correlation matrix for the fleabeetle data
Tarsus1  Tarsus2  Head  AedeagusWF  AedeagusA  AedeagusWS  

Tarsus1  1.00  0.03  −0.10  −0.33  0.78  −0.57 
Tarsus2  0.03  1.00  0.67  0.56  −0.12  0.49 
Head  −0.10  0.67  1.00  0.59  −0.31  0.52 
AedeagusWF  −0.33  0.56  0.59  1.00  −0.25  0.78 
AedeagusA  0.78  −0.12  −0.31  −0.25  1.00  0.48 
AedeagusWS  −0.57  0.49  0.52  0.78  −0.48  1.00 
Principal component loadings matrix with t = 2 (loadings higher than 0.30 in absolute value are in bold)
Measurement  Component 1  Component 2 

Tarsus1  −0.33  0.60 
Tarsus2  0.36  0.48 
Head  0.41  0.34 
AedeagusWF  0.46  0.19 
AedeagusA  −0.35  0.50 
AedeagusWS  0.50  −0.05 
Component loadings matrix with t = 2 after Varimax rotation (loadings higher than 0.30 in absolute value are in bold)
Measurement  Component 1  Component 2 

Tarsus1  0.07  0.69 
Tarsus2  0.57  0.19 
Head  0.53  0.05 
AedeagusWF  0.49  −0.11 
AedeagusA  0.00  0.61 
AedeagusWS  0.38  −0.32 
Component loadings matrix with t = 3 after Varimax rotation (loadings higher than 0.30 in absolute value are in bold)
Measurement  Component 1  Component 2  Component 3 

Tarsus1  −0.24  0.61  0.23 
Tarsus2  0.10  0.11  0.64 
Head  −0.04  −0.10  0.71 
AedeagusWF  0.71  0.12  0.06 
AedeagusA  0.17  0.76  −0.19 
AedeagusWS  0.63  −0.13  −0.00 
Key Applications
The applicability of PCA is witnessed by a large number of papers available in the literature. PCA is a widely used tool for synthesizing the available data when the number of (quantitative) variables is large. In the recent years, this is extremely common in almost all areas since the production of data is dramatically increasing. In this respect, PCA appears to be very helpful in order to extract relevant information from massive data sets.
Future Directions
The current research interests in PCA rely in the attempts to extend the method to more complex situations, e.g., nonlinear interactions among variables, qualitative data, multiway analysis, etc. Here, we briey illustrate the multiway extension of PCA, limiting our attention to the socalled threeway case.
The need for threeway analysis arises when variables are observed on the same objects in several occasions, not necessarily related to time. Thus, the information can be characterized in terms of three modes, namely, objects, variables, and occasions. This situation can be extended to higherway data, hence leading to multiway data. In the case of threeway data, the available information is a threeway array, denoted by X, to distinguish it from a standard (twoway) matrix X. An array can be seen as a collection of (object × variable) matrices, where each matrix pertains to a given occasion. In principle, PCA can still be applied to a threeway array. For instance, this can be done rearranging the elements of an array into a matrix. A way to do it is by the socalled matricization of an array. This can consist of creating a matrix containing all the (object × variable) matrices of the array next to each other. Denoting by n, p, and m the numbers of objects, variables, and occasions, respectively, the resulting matrix has n rows and pm columns. Its rows refer to the objects, while its columns refer to “new variables” which are all the possible combinations of variables and occasions with the variables nested within the occasions (see, e.g., Kiers (2000)). A PCA on such a huge matrix has two disadvantages. The component loadings matrix is hard to read, and the interpretation of the components is rather difficult since the components will depend on the variables in some specific occasions. However, a more relevant problem is that the triple interactions among the three modes are fully missed, since one does not properly take into account that the new variables are combinations of two modes. This leads to erroneous or incomplete results.
An alternative approach for performing standard PCA on threeway data is to first aggregate the array over one mode (for instance, averaging the data across the occasions) and then perform PCA. Unfortunately, once again, the triple interactions among the data cannot be recovered by the extracted components resulting from PCA.
In the literature, the two most common proposals to extend PCA to threeway data are the Tucker3 and Candecomp/Parafac models (for an overview on these and other methods, see Kroonenberg (2007)). Bearing in mind the PCA formulation in Eq. 7, the main idea underlying both the models is to consider not only the matrices C and A for summarizing the objects and the variables, respectively, but also a third matrix for summarizing the occasions. In Candecomp/Parafac the same number of components for every mode is sought. The Tucker3 model generalizes Candecomp/Parafac admitting different numbers of components for every mode and introducing the socalled core array to study the triple interactions among the components of every mode. This implies that Candecomp/Parafac can be seen as a constrained version of Tucker3. Such a constraint guarantees that, under mild conditions, the Candecomp/Parafac solution is unique, whereas Tucker3 admits rotations.
CrossReferences
References
 Cadima J, Jolliffe I (1995) Loadings and correlations in the interpretation of principal components. J Appl Stat 22:203–214MathSciNetCrossRefGoogle Scholar
 d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev 49:434–448MathSciNetCrossRefzbMATHGoogle Scholar
 Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1:211–218CrossRefzbMATHGoogle Scholar
 Everitt BS (2005) An R and Splus companion to multivariate analysis. Springer, LondonCrossRefzbMATHGoogle Scholar
 Gabriel KR (1971) The biplot graphical display of matrices with applications to principal component analysis. Biometrika 58:453–467MathSciNetCrossRefzbMATHGoogle Scholar
 Hausmann R (1982) Constrained multivariate analysis. In: Zanckis SH, Rustagi JS (eds) Optimisation in statistics. NorthHolland, Amsterdam, pp 137–151Google Scholar
 Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441CrossRefzbMATHGoogle Scholar
 Jolliffe IT (2002) Principal component analysis. Springer, New YorkzbMATHGoogle Scholar
 Jolliffe IT, Uddin M (2000) The simplified component technique: an alternative to rotated principal components. J Comput Graph Stat 9:689–710MathSciNetGoogle Scholar
 Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547MathSciNetCrossRefGoogle Scholar
 Kaiser HF (1958) The varimax criterion for analytic rotation in factor analysis. Psychometrika 23:187–200CrossRefzbMATHGoogle Scholar
 Kiers HAL (2000) Towards a standardized notation and terminology in multiway analysis. J Chem 14:105–122CrossRefGoogle Scholar
 Kroonenberg PM (2007) Applied multiway data analysis. Wiley, HobokenGoogle Scholar
 Lubischew AA (1962) On the use of discriminant functions in taxonomy. Biometrics 18:455–477CrossRefzbMATHGoogle Scholar
 Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, LondonzbMATHGoogle Scholar
 Marques de Sá JP (2007) Applied statistics using SPSS, STATISTICA, MATLAB and R. Springer, Berlin/HeidelbergCrossRefzbMATHGoogle Scholar
 Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572CrossRefzbMATHGoogle Scholar
 Preisdendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanography. Elsevier, AmsterdamGoogle Scholar
 Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58:267–288MathSciNetzbMATHGoogle Scholar
 Vines SK (2000) Simple principal components. Appl Stat 49:441–451MathSciNetzbMATHGoogle Scholar
 Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15:262–286MathSciNetCrossRefGoogle Scholar