Encyclopedia of Social Network Analysis and Mining

Living Edition
| Editors: Reda Alhajj, Jon Rokne

Principal Component Analysis

  • Paolo GiordaniEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-7163-9_154-1

Keywords

Principal Component Analysis Singular Value Decomposition Component Score Spectral Decomposition Component Loading 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Synonyms

Glossary

PCA

Principal Component Analysis

SVD

Singular Value Decomposition

Definition

Nowadays, almost everything can be monitored and measured implying that huge amounts of data are usually available. A big problem to be solved concerns how to transform such data into information. In other words, it is crucial to extract relevant information hidden in data sets. This is in general the main goal of statistical methods such as Principal Component Analysis (PCA). PCA is a well-known tool often used for the analysis of a numerical data set concerning a number of objects with respect to several variables (features). Its aim is to synthesize the data set in terms of the so-called components, i.e., unobserved variables expressed as linear combinations of the observed ones. These components are orthogonal and are found in such a way that they optimize a certain algebraic criterion (that we will discuss in details in the next sections). Equivalently, if we look at the data as a cloud of points in the variable space, the components can be interpreted geometrically in terms of a low-dimensional space constructed in such a way that they miss relevant information as little as possible. Although we can usually extract a number of components equal to that of the variables (assuming it is lower than the number of objects), PCA effectively synthesizes the data if a few components (a number noticeably lower than that of variables) maintain most of the information. As we shall see, this is so if the sum of variances of the components is pretty high in comparison with the sum of variances of the observed variables (the two sums are the same if the number of extracted components coincides with that of the variables).

The above brief description of PCA has been carried out according to an exploratory approach in the sense that no probabilistic assumptions concerning the observed data are made. In other words, the conclusions drawn are based on mathematical arguments and involve only the available data set. Nonetheless, in the literature several (theoretical) papers on statistical aspects connected to PCA can be found. A standard assumption is that the data follow a multivariate normal distribution and, for instance, the probability distribution of the extracted components is investigated. Further analysis on the statistical significance of the PCA results is usually not provided. In this entry, we introduce PCA as an exploratory tool. A reader interested in inferential aspects involving PCA can refer to Jolliffe (2002) and the references therein.

Introduction

PCA is a statistical tool for summarizing a large number of quantitative variables, say p, through components. When p is huge, it is impossible to properly manage the data. In this respect, the need for PCA arises. In PCA, one seeks for some unobservable variables, called components, capable to both synthesize the huge amount of data and capture most of the information contained in the observed variables. The components are uncorrelated and are found sequentially. The first component is the linear combination of the observed variables having maximum variance (provided that the coefficients of such a linear combination are suitably constrained to avoid trivial solutions). At the next step, the second component is found as a new linear combination of the original variables with maximum variance and orthogonal to the first component. This procedure can be repeated a number of times equal to the number of variables (assuming this number smaller than that of objects). However, for practical purposes, the number of extracted components is remarkably lower than that of the variables (usually 2 or 3). In the following PCA shall be discussed theoretically and in practice by means of examples and real data. Furthermore, the use of PCA by using the most common statistical softwares is briefly provided.

Key Points

The output generated by PCA is strongly connected with the variances of the observed variables and their correlation structure. For this purpose, we consider the examples reported in Fig. 1 where four bivariate (p = 2) data sets along with the corresponding first principal component are displayed. Every data set has been generated randomly following bivariate normal distributions with null mean vectors and different covariance matrices. In the upper left side of the figure, the variables X 1 and X 2 are approximately uncorrelated (r 12 = 0.001) and have almost equal variance (\( {\sigma}_1^2=0.92 \), \( {\sigma}_2^2=0.81 \)). In the presence of uncorrelated variables, PCA fails to synthesize the data. In fact, the extracted components tend to reproduce the original variables. The first component C 1 = 1.000X 1 + 0.008X 2 essentially coincides with the variable having the highest variance (in fact, its variance is equal to σ 1). A similar result is found by considering the data in the upper right side of Fig. 1. Here, the variance of X 1 is noticeably larger than that of X 2 ( \( {\sigma}_1^2=2.95,{\sigma}_2^2=1.07 \)) and a small correlation exists (r 12 = 0.111). Once again, the first component is closely related to X 1 (C 1 = 0.995X 1 + 0.103X 2); the (small) correlation is also captured by the component giving a limited importance to X 2. This leads to a variance for such a component equal to 2.97, thus slightly higher than σ 1.
Fig. 1

Bivariate normal distributed datasets and first principal component (dotted line)

To further assess the role of the correlation between X 1 and X 2, we can observe the lower side of Fig. 1 displaying two data sets with correlated variables (r 12 = 0.967 for the left side, r 12 = −0.981 for the right side). The main difference between the two data sets involves the variances of the variables. In the left side, the variances are almost the same (\( {\sigma}_1^2=1.04, {\sigma}_2^2=0.94 \)), where, in the right side \( {\sigma}_1^2=1.08<{\sigma}_2^2=3.39 \). By applying PCA and looking at the first component, we can see that it synthesizes the data explaining most of the variability. The variances of the first component are 1.94 (left side) and 4.44 (right side), hence slightly lower than the corresponding sum of variances (1.98 and 4.47 for the left and right sides, respectively). Obviously, in both the cases, the second component accounts for a very limited amount of variance (0.03). A relevant distinction concerns the way to derive the component. In the equal variance case, X 1 and X 2 play essentially equally strong roles (C 1 = 0.726X 1 + 0.687X 2). In the right side, the role of X 2 is emphasized since C 1 = −0.488X 1 + 0.873X 2 and the different signs of the coefficients are connected with the existing negative correlation between the observed variables.

Historical Background

The two most common references on the origins of PCA are Pearson (1901) and Hotelling (1933). Pearson (1901) approaches the problem from a geometric point of view looking for a low-dimensional space which represents a cloud of points in a higher dimensional space at best. Although Pearson was an English mathematician, such a view is usually referred to as the French approach to PCA. Hotelling (1933) introduces PCA as the attempt to detect a limited number of independent variables best summarizing a set of (correlated) observed variables. In this respect, the adopted approach is the Anglo-Saxon one, characterized by the algebraic derivation of PCA. Obviously, both the approaches lead to the same solution, found in term of the spectral decomposition of the correlation (or covariance) matrix. Alternatively, the PCA solution can be obtained by a singular value decomposition (SVD) of the original (centered and/or normalized) data matrix. SVD is also known as the Eckart-Young decomposition (Eckart and Young (1936)). Nonetheless, some previous works on SVD in a form underlying PCA can be found in the literature (for more details refer to Preisdendorfer and Mobley (1988)), and thus it is very difficult to state the exact origin of PCA.

Originally, PCA has been applied especially in the psychological domain. However, it is nowadays applied in a wide variety of research fields.

Principal Components

Suppose that p quantitative variables x = [X 1, … , X p ]′ are observed on n objects and we are interested in synthesizing the available data by PCA. The first principal component is a linear combination of the original variables, \( {C}_1={a}_{11}{X}_1+\dots +{a}_{1 p}{X}_p={\mathbf{a}}_1^{\mathbf{\prime}}\mathbf{x} \mathrm{with} {\mathbf{a}}_1=\left[{a}_{11},\dots, {a}_{1 p}\right]^{\prime } \). The weights in a 1 are such that they maximize the variance of C 1. A trivial solution would be obtained if we do not add a normalization constrain to a 1 (\( {\mathbf{a}}_1^{\mathbf{\prime}}{\mathbf{a}}_1=1 \)). Hence, the problem to be solved is
$$ \begin{array}{l}\underset{{\mathbf{a}}_1}{ \max}\mathrm{var}\left({\mathbf{a}}_{11}^{\mathbf{\prime}}\mathbf{x}\right)={\mathbf{a}}_1^{\mathbf{\prime}}\Sigma {\mathbf{a}}_1\\ {}\mathrm{s}.\mathrm{t}.{\mathbf{a}}_1^{\mathbf{\prime}}{\mathbf{a}}_1=1,\end{array} $$
(1)

being Σ the covariance matrix of x. This constrained maximization problem can be solved by Lagrange multipliers. It can be shown that the problem boils down to compute eigenvectors and eigenvalues of Σ. In particular, since the variance must be maximized, it is easy to see that a 1 is the eigenvector corresponding to the highest eigenvalue of Σ, say λ 1. In this case, \( \mathrm{var}\left({C}_1\right)=\mathrm{var}\left({\mathbf{a}}_1^{\mathbf{\prime}}\mathbf{x}\right)={\mathbf{a}}_1^{\mathbf{\prime}}\Sigma {\mathbf{a}}_1={\lambda}_1 \), thus the variance of the first component is equal to the highest eigenvalue of Σ.

The derivation of the remaining components can be done analogously to Eq. 1 with the additional constraints that the components must be uncorrelated with the previous ones. For instance, the second component is the solution of
$$ \begin{array}{l}\underset{{\mathbf{a}}_2}{ \max}\mathrm{var}\left({\mathbf{a}}_2^{\mathbf{\prime}}\mathbf{x}\right)={\mathbf{a}}_2^{\mathbf{\prime}}\Sigma {\mathbf{a}}_2\\ {}\mathrm{s}.\mathrm{t}.\ {\mathbf{a}}_2^{\mathbf{\prime}}{\mathbf{a}}_2=1,\\ {}\mathrm{cov}\left({\mathbf{a}}_1^{\mathbf{\prime}}\mathbf{x},{\mathbf{a}}_2^{\mathbf{\prime}}\mathbf{x}\right)={\mathbf{a}}_2^{\mathbf{\prime}}\Sigma {\mathbf{a}}_1={\mathbf{a}}_2^{\mathbf{\prime}}{\mathbf{a}}_1=0.\end{array} $$
(2)

The solution of Eq. 2 is the eigenvector corresponding to the second largest eigenvalue (λ 2) and var(C 2) = λ 2. In general, the s-th principal component can be found by considering the eigenvector corresponding to the s-th largest eigenvalue of Σ and var(C s ) = λ s . Hence, it should be clear that the principal components are uncorrelated linear combinations of the original variables (with weights having a fixed sum of squares) such that their variances are as large as possible.

The above description of PCA clarifies that it is a multistep procedure finding components with maximal variance. Probably, this does not fully highlight that PCA is a method for reducing the data complexity passing from a large number of original variables to a few principal components. In this respect, we can consider PCA as a method for finding sequentially “new” variables, i.e., the principal components, that best recover the data, i.e., in such a way to miss relevant information as little as possible. This turns out to look for the principal component that, at each step, best “explains” (or “accounts for”) the variance in the data. Every principal component explains an amount of variance of the original data equal to its variance. For instance, λ 1 is the amount of variance explained by C 1.

For practical purposes, the derivation of principal components can be carried out by performing the spectral decomposition of Σ:
$$ \Sigma =\mathbf{A}\Lambda {\mathbf{A}}^{\mathbf{\prime}}, $$
(3)

where A is the orthonormal matrix containing the eigenvectors of Σ in its columns (the s-th column contains the eigenvector corresponding to the s-th largest eigenvalue) and Λ is a diagonal matrix with diagonal elements equal to the eigenvalues of Σ in decreasing order.

If one is interested in the first t principal components, with a total explained variance equal to \( \sum_{s=1}^t{\lambda}_s \), it is sufficient to consider the first t columns of A and to construct the components as \( {C}_s={\mathbf{a}}_s^{\mathbf{\prime}}\mathbf{x}, s=1,\dots, t \). The elements of a s are the so-called (principal) component loadings and express the importance of the original variables for every component. For instance, a sj gives the importance of variable j for component s and is strongly related with their covariance (in particular,
$$ \mathrm{cov}\left.\left({X}_j,{C}_s\right)={\lambda}_j{a}_{s j}\right). $$
(4)
A high value in absolute sense means that the variable remarkably affects the component and its sign specifies the type of the existing relationship (a positive loading denotes a direct relationship between the component and the variable). Similarly, the correlation between component s and variable j is equal to
$$ \mathrm{corr}\left({X}_j,{C}_s\right)={a}_{s j}\frac{{\lambda_s}^{\frac{1}{2}}}{\sigma_j^2}, $$
(5)

being \( {\sigma}_j^2 \) the variance of variable j. See, for further details on the derivation of Eqs. 4 and 5, Mardia et al. (1979). Except for situations where the standard deviations of the variables noticeably vary, the loadings are usually considered for interpreting the extracted components. In this respect, it is worth mentioning that the sign of a principal component is fully arbitrary. In fact, if we reverse the signs of all the loadings of a principal component, the variance of the component remains the same. One must only reverse the corresponding interpretation.

If n > p, we can derive up to p principal components (tp). However, if the rank of Σ is r < p, only the first r principal components have positive variances var(C 1) ≥ … ≥ var(C r ) ≥ var(C r+1) = … = var(C p ) = 0.

When the principal components are derived, the principal component scores (or simply component scores) of the objects with respect to such new variables can be computed as \( {\mathbf{c}}_s={\mathbf{a}}_s^{\mathbf{\prime}}\left(\mathbf{x}-\overline{\mathbf{x}}\right) \), s = 1, …, t, being \( \overline{\mathbf{x}} \) the mean vector of X 1, …, X p . Therefore, apart from the mean vector, the component scores are an orthonormal linear transformation of x.

Geometric Interpretation

The component scores satisfy a lot of interesting properties (see, e.g., Jolliffe (2002)). Here, we mention only one property connected with the geometrical interpretation of PCA. Let (x 1, … , x n) be the data points that can be seen as a cloud of points in the reference space ℜ p . The orthonormal matrix A t = [a 1a t ] containing the first t eigenvectors of Σ in its columns provide the basis of the t-dimensional space such that the sum of squared perpendicular distances of the data points x i ’s (i = 1, …, n) from the subspace is minimized. Then, the principal components determine a low-dimensional space accounting for as much as possible of the total variation in the data. In particular, the first component recovers the best-fitting line. The second component looks for the best-fitting line orthogonal to the first one and, together with the first line, forms a plane. If we use t components, then a collection of t best-fitting lines, orthogonal to each other, is found and the best-fitting hyperplane is built. In this setting, the component scores represent the orthogonal projections of (x 1, …, x n ) on such a t-dimensional subspace.

Covariance and Correlation Matrices

The above description of PCA is based on the analysis of the covariances of X 1, …, X p . PCA can also applied to the correlation matrix R. In other words, this consists of considering the standardized variables X 1, …, X p when deriving principal components. The choice between standardized and raw variables is crucial because eigenvalues and eigenvectors of correlation and covariance matrices differ and a simple relation linking the two solutions is not available. Thus, the principal components based on Σ and R are not the same.

When the units of measurement used for the variables are different, the correlation matrix should be considered, otherwise artificial differences in the variables will affect the PCA results. In the case of the same units of measurement, the researcher should assess whether standardizing the variables is a sensible choice. Specifically, the principal components extracted considering Σ will depend mostly on the variables with the highest variances. This makes sense if the researcher believes that a larger (smaller) variance implies a more (less) relevant importance of the variable involved. If this is not the case, the use of R is recommended and all the variables will be treated on an equal footing.

Choice of the Number of Components

The choice of the number of components t must be such that the original data are well synthesized maintaining the relevant information as much as possible. If t = 1, a drastic reduction of the data size is achieved, but in general the data variation is lost. Obviously, the opposite comment holds when t = p.

In the literature there exists several proposals on the optimal choice of t. Here, we limit our attention to some basic procedures to select t. The underlying idea of such procedures is to keep the most important components in terms of the sum of their variances. Taking into account the existing connection between the variances of the components and the eigenvalues of either Σ or R, these procedures are based on the eigenvalues λ s .

A straightforward proposal known as the Kaiser’s rule suggests to retain only the components such that λ s > 1. Another tool is the so-called Scree plot. It consists of plotting the variances as the Y-axis (λ s) against the numbers of the principal components (s) as the X-axis. If we connect the plotted points, these drop moving to the right. The optimal value t is found in connection with the presence of an elbow, i.e., a bend related to a high ratio of differences in consecutive eigenvalues.

A fruitful criterion for choosing t is based one the cumulative percentage of explained variance. An index of goodness of fit for the first t components can be computed as
$$ {F}_t=\frac{\sum_{s=1}^t{\lambda}_s}{\sum_{s=1}^p{\lambda}_s}100. $$
(6)

The index compares the variances explained by the first t components with the total sum of variances. A high value of F t (close to 100%) implies that the first t components well synthesize the observed data. In general, a rule to be adopted can be to choose t as the smallest integer such that F t is higher than a prespecified percentage (e.g., 70%). However, how to fix such cut-off is a complex issue and depends on the data under investigation. In general, t can be chosen as the smallest t such that F t F t−1 is high and F t+1F t is small. This means that passing from t − 1 to t components leads to a remarkable increase of fit, whereas adding one more component leads to a negligible increase of the explained variance. A plot of F t against t can be helpful in order to select the optimal value of t. This is essentially the same information offered by the Scree plot. Nonetheless, it must be underlined that a degree of subjectivity connected with the choice of t exists. In fact, there does not exist a rule for selecting t to be adopted slavishly. The choice t should not be based solely on the analysis of the fit values. In fact, this helps to select some interesting solutions, which well balance fit and parsimony. Then, a deeper insight into these solutions is needed in order to choose the optimal one. The illustrative example will help to clarify this point.

PCA and Singular Value Decomposition

So far, PCA has been derived by the spectral decomposition of the covariance or correlation matrices using the eigenvectors associated with the largest eigenvalues. Rather than considering Σ and R, the principal components can also be extracted using the (raw) data set stored in the matrix X of order (n × p). This leads to a different interpretation of PCA which can be seen as a model of the form
$$ \mathbf{X}=\mathbf{C}{\mathbf{A}}^{\mathbf{\prime}}+\mathbf{E}, $$
(7)
Where C and A are matrices of order (n × t) and (p × t), respectively, and E is the error matrix of order (n × p). The optimal matrices C and A are found by solving the following least squares problem:
$$ \min \left|\right|\mathbf{X}-\mathbf{C}{\mathbf{A}}^{\mathbf{\prime}}\left|\right|{}^2, $$
(8)
where ||⋅||2 denotes the squared Frobenius norm, i.e., the sum of squares of the elements of the matrix at hand. It can be shown that the solution of Eq. 8 can be determined by performing the Singular Value Decomposition (SVD) of X:
$$ \mathbf{X}=\mathbf{SV}{\mathbf{D}}^{\mathbf{\prime}}, $$
(9)

being S and D the matrices of the left and right singular vectors of X, respectively, associated with the singular values of X stored in decreasing order in the main diagonal of the diagonal matrix V. The solution of Eq. 8 is then found as C = S t V t and A = D t , where V t is the diagonal matrix containing the t highest singular values and S t and D t the matrices holding in their columns the first t columns of S and D, respectively. The matrices C and A are the so-called (principal) component score and loading matrices, respectively, and coincide with the PCA solution previously described. In fact, the SVD of X and the spectral decompositions of Σ and R are related. Assuming without loss of generality that the variables in X have zero mean (i.e., the matrix is columnwise centered), we have that S contains the eigenvector of XX′, D those of Σ = XX, and V has in its main diagonal the square root of the eigenvalues of Σ. If X contains standardized scores, the same links exist considering R instead of Σ.

The solution of Eq. 8 provides the best t-rank decomposition of X. Although the PCA solution remains the same, in this case, the aim of PCA is to look for the best low-rank approximation of a given matrix. Thus, the optimal solution is the t-rank matrix \( \widehat{\mathbf{X}}=\mathbf{C}{\mathbf{A}}^{\mathbf{\prime}}. \) In this respect, it is interesting to see that the optimal solution previously reported is not unique. Specifically, for any rotation matrix T of order (t × t) an equal fitting solution is \( \tilde{\mathbf{C}} \) = CT and \( \tilde{\mathbf{A}} \) = AT−1 such that \( \tilde{\mathbf{C}} \) \( \tilde{\mathbf{A}} \)′ = CA′. The use of T leads to a new set of components having the same sum of explained variances as for the previous set of components. Nonetheless, differently, from C and A, the components in \( \tilde{\mathbf{C}} \) and \( \tilde{\mathbf{A}} \) are no longer such that they maximize the explained sum of variances sequentially. For this reason, we may refer to as rotated principal components or simply components avoiding to use the adjective “principal.”

Rotation matrices have a high impact for practical purposes. In fact, they can be used in order to facilitate the interpretation of the extracted components by means of simple structure rotations, such as the varimax criterion (Kaiser 1958), that is usually applied to the component loadings. This generally leads to loadings either high or low in absolute value and not in between. In this way, for every component, we can determine a subset of variables that remarkably affect the component involved. At the same, it is desirable that only one loading per observed variable is high in absolute sense in such a way that the variables are mainly related to exactly one component.

PCA and Sparseness

As we already saw, the interpretation of the components is carried out by looking at the largest loadings in absolute value. In other words, since every component is a linear combination of all the observed variables, one usually select the subset of variables having the loadings farthest from zero. This implicitly consists in subjectively fixing a threshold and artificially setting the loadings with absolute values smaller than the threshold to zero. Such an approach is very intuitive especially when the solution is varimax rotated to simple structure since in such cases we usually have a large number of loadings close to zero that can be ignored when interpreting the extracted components. Although this is the most common strategy in practice, it has been criticized by several authors (see, e.g., Cadima and Jolliffe 1995). In case of a rotated solution, obviously the components are no longer ordered with respect to the explained variance, and this may represent a limitation if one is interested in the ranking of the components in terms of explained variance. Furthermore, it is not guaranteed that the rotated solution is as simple as required. Another drawback refers to the case in which PCA is applied to the covariance matrix Σ, i.e., the data in matrix X are not normalized. As we saw in Eq. 5, it is not just the size of the loadings but also the standard deviation of each variable which determines the importance of a variable for a component.

In order to simplify the interpretation of the PCA solution, several authors propose alternative strategies. The common idea is to extract components ordered with respect to the explained variance constrained in such a way that either some loadings are equal to zero or, in general, the loadings have a simple structure. Hausmann (1982) proposes a branch-and-bound algorithm that produces estimated loadings taking only three values, namely, +1, −1, and 0. An extension has been suggested by Vines (2000) where the loadings can take only integer values. All these proposals aim at finding components with maximum variance and with restrictions on loadings to a small number of values. A more general approach has been followed by Jolliffe and Uddin (2000): the so-called SCoT method seeks the optimal components in such a way to maximize the explained variance and the simplicity by solving the optimization problem
$$ \underset{{\mathbf{a}}_s}{ \max}\mathrm{var}\left({\mathbf{a}}_s^{\mathbf{\prime}}\mathbf{x}\right)+\psi Sim\left({\mathbf{a}}_s\right) $$
(10)
with the constraints \( {\mathbf{a}}_s^{\mathbf{\prime}}{\mathbf{a}}_s=1\left( s=1,\dots, t\right) \) and \( {\mathbf{a}}_s^{\mathbf{\prime}}{\mathbf{a}}_s^{\mathbf{\prime}}=0\left( s\ge 2,{s}^{\prime }< s\right) \). In Eq. 10 Sim(a s ) is a measure of simplicity for a s and ψ is a nonnegative parameter tuning the importance of the level of simplicity of component s. An extension of SCoT is SCoTLASS (Jolliffe et al. 2003) where the function in Eq. 10 is maximized with the additional constraint that
$$ \sum_{j=1}^p\left|{a}_{j s}\right|\le \zeta, $$
(11)

with ζ ≥ 0. The new constraint takes inspiration from the Lasso (Tibshirani 1996). The Lasso is a well-known method in regression which represents a compromise between variable selection and shrinkage estimators. This depends on its geometry that forces some of the regression coefficients to zero. In PCA, the geometry implies that some loadings are shrunk and some other are set to zero.

A closely related procedure involving the Lasso has been developed by Zou et al. (2006). The so-called SPCA is built on the fact PCA can be written as a regression-type optimization problem, with a quadratic penalty. In this case, the sparsity of the loadings is achieved by imposing the Lasso constraint on the regression coefficients. The optimal solution of SPCA can be found by either the nonconvex algorithm proposed by Zou et al. (2006) or the more general one developed by d’Aspremont et al. (2007).

Software

Several monographs on the application of PCA using the most common statistical softwares, such as Matlab, R, SAS, SPSS, and Statistica, are available (e.g., Everitt (2005); Marques de Sá (2007)). In fact, all the above mentioned softwares have specific tools for applying PCA. In this respect, note that to perform PCA using SPSS, one has to select the Factor Analysis (FA) menu implicitly suggesting that PCA is a particular option for FA. This is part of cause for the existing ambiguity between PCA and FA. In fact, the two techniques strongly differ. A better insight into the distinctive features of PCA and FA can be found in, e.g., Jolliffe (2002).

Illustrative Example

In order to show how PCA works in practice, the flea-beetle data set (Lubischew 1962) is investigated. The data refer to physical information of n = 74 male flea-beetles. In particular p = 6 measurements on each flea-beetle are considered. These are the width of the first joint of the first tarsus (Tarsus1), the width of the second joint of the first tarsus (Tarsus2), the maximal width of the head between the external edges of the eyes (Head), the maximal width of the aedeagus in the fore-part (AedeagusWF), the front angle of the aedeagus (AedeagusA), and the aedeagus width from the side (AedeagusWS). Furthermore, the information concerning the species (variable species with modalities Ch. concinna, Ch. heptapotamica, Ch. heikertingeri) is also reported. The aim of the original work was to discriminate the species by means of (subsets of) the measured characters since external differences between these three species are scarcely perceptible.

PCA can be applied only on the six quantitative measurements, whereas the qualitative information about the species can be used as a supplementary variable. As it will be clarified in the following, this means that the species does not play an active role in determining the components, but it will be used later to provide a better insight into the obtained results.

The units of measurements of the six variables are different, hence PCA should be performed on the standardized variables, i.e., on the correlation matrix, which is reported in Table 1. Since several large correlations are visible, we can partially deduce the structure of some components. In particular, large correlations can be seen for AedeagusWF and AedeagusWS. We could expect that such a correlation structure will be captured by a component.
Table 1

Correlation matrix for the flea-beetle data

 

Tarsus1

Tarsus2

Head

AedeagusWF

AedeagusA

AedeagusWS

Tarsus1

1.00

0.03

−0.10

−0.33

0.78

−0.57

Tarsus2

0.03

1.00

0.67

0.56

−0.12

0.49

Head

−0.10

0.67

1.00

0.59

−0.31

0.52

AedeagusWF

−0.33

0.56

0.59

1.00

−0.25

0.78

AedeagusA

0.78

−0.12

−0.31

−0.25

1.00

-0.48

AedeagusWS

−0.57

0.49

0.52

0.78

−0.48

1.00

To assess whether the species are related to some of the physical measurements, we draw the box-plots for all variables, for each of the species (Fig. 2). By inspecting Fig. 2 we can conclude that the peculiarities of the species are related to the measurements, except for Head and, to a lesser extent, Tarsus2. If we will find a component mainly related to the measurements that well discriminate the species of the flea-beetle, we will be able to interpret the component accordingly, i.e., with respect to the levels of the species.
Fig. 2

Box-plots of the six variables with respect to the species

In order to choose the optimal number of components, in Fig. 3 we plot the cumulative percentage of explained variance. In this respect, the solutions with t = 2 and t = 3 should be preferred and are thus studied in detail in order to choose the best one. We discard the case with t = 1 because the percentage of explained variance is very low and passing from one to two components remarkably increases it. Also the solution with t = 4 is not considered as optimal since the increase of explained variance is rather limited (about 5%).
Fig. 3

Cumulative percentage of the explained variance

We start by inspecting the component loadings when t = 2 (Table 2). The first component is related to all the measurements contrasting AedeagusA and Tarsus1 with the remaining ones. High component scores pertain to flea-beetles with high values for the variables with positive loadings and low values for the ones with negative loadings. The second component measures the overall size of flea-beetles in terms of Tarsus1, Tarsus2, AedeagusA, and, to a lesser extent, Head. This solution is hard to interpret since several variables are related to both the components. For this purpose, we study if the Varimax rotation leads to a more interpretable solution. The obtained loadings are given in Table 3. Apart from AedeagusWS (its loadings are slightly higher than 0.30 in absolute value for both the components), the remaining variables remarkably affect only one component. Thus, every component seems to capture different peculiarities of the flea-beetles according to different subsets of measurements. Although this solution is easier to interpret, it is not fruitful to discriminate the three species. In fact, if we take a look at the component scores (not reported here), the peculiarities of the species are not easily distinguished. We thus discard this solution and we move to inspect the one with t = 3.
Table 2

Principal component loadings matrix with t = 2 (loadings higher than 0.30 in absolute value are in bold)

Measurement

Component 1

Component 2

Tarsus1

0.33

0.60

Tarsus2

0.36

0.48

Head

0.41

0.34

AedeagusWF

0.46

0.19

AedeagusA

0.35

0.50

AedeagusWS

0.50

−0.05

Table 3

Component loadings matrix with t = 2 after Varimax rotation (loadings higher than 0.30 in absolute value are in bold)

Measurement

Component 1

Component 2

Tarsus1

0.07

0.69

Tarsus2

0.57

0.19

Head

0.53

0.05

AedeagusWF

0.49

−0.11

AedeagusA

0.00

0.61

AedeagusWS

0.38

0.32

Since the principal component solutions are nested, when t = 3 an additional component is extracted with respect to those reported in Table 2. Such a third component contrasts the Aedeagus variables to Head and, to a lesser extent, the Tarsus ones. In general, also this solution is hard to interpret since the components depend on several variables and the species are not well distinguished by the extracted components as it was for t = 2. For all of these reasons, we checked the Varimax rotated solution. By applying the Varimax rotation we get the loadings of Table 4 (Varimax solutions for different values of t are no longer nested). By inspecting Table 4 we can conclude that every variable remarkably influences only one component. In particular Component 1 is related to AedeagusWF and AedeagusWS (the strong connection between the two variables was already observed by inspecting the correlation matrix of Table 1), Component 2 depends on AedeagusA and Tarsus1. Finally, Component 3 is highly connected with Head and Tarsus2.
Table 4

Component loadings matrix with t = 3 after Varimax rotation (loadings higher than 0.30 in absolute value are in bold)

Measurement

Component 1

Component 2

Component 3

Tarsus1

−0.24

0.61

0.23

Tarsus2

0.10

0.11

0.64

Head

−0.04

−0.10

0.71

AedeagusWF

0.71

0.12

0.06

AedeagusA

0.17

0.76

−0.19

AedeagusWS

0.63

−0.13

−0.00

To further investigate the extracted components, we consider the supplementary variable species studying whether such components discriminate the species of flea-beetles. An interesting feature of this solution is that it is able to well distinguish the species. Specifically, low scores for Component 1 pertains to the species Ch. heikertingeri, hence the specimens of Ch. heikertingeri have low values for AedeagusWF and AedeagusWS. Therefore, this component can be interpreted as “Ch. heikertingeri” in the sense that it contrasts such a species with respect to the other ones. In a similar way, we can interpret Component 2 as “Ch. heptapotamica” because the flea-beetles belonging to Ch. heptapotamica are characterized by the lowest component scores (due to low values of Tarsus1 and AedeagusA). This allows us to observe that the specimens of Ch. concinna have high scores for both Components 1 and 2. Finally, no relationship between the species and Component 3 is easily visible. The above conclusions are drawn by studying the component scores matrix. This can be done by plotting them. A useful way to do it is given by the so-called biplot (Gabriel 1971), which has the additional advantage of displaying the loadings information in order to characterize the components in terms of the observed variables. Here, we report in Fig. 4 the biplot concerning Components 1 and 2 from which we can appreciate that the three species are well discriminated by these components and we can easily assess the existing relationships between observed variables and extracted components. Note that in Fig. 4 the flea-beetles are identified by the numbers “1,” “2,” and “3” corresponding to the species Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri, respectively. Taking into account the goodness of fit, the interpretability of the solution and its capability to discriminate the species, the above described solution is considered as the optimal one.
Fig. 4

Biplot of Components 1 and 2 from the varimax solution with t = 3 (1 = Ch. concinna, 2 = Ch. heptapotamica, 3 = Ch. heikertingeri)

Key Applications

The applicability of PCA is witnessed by a large number of papers available in the literature. PCA is a widely used tool for synthesizing the available data when the number of (quantitative) variables is large. In the recent years, this is extremely common in almost all areas since the production of data is dramatically increasing. In this respect, PCA appears to be very helpful in order to extract relevant information from massive data sets.

Future Directions

The current research interests in PCA rely in the attempts to extend the method to more complex situations, e.g., nonlinear interactions among variables, qualitative data, multiway analysis, etc. Here, we briey illustrate the multiway extension of PCA, limiting our attention to the so-called three-way case.

The need for three-way analysis arises when variables are observed on the same objects in several occasions, not necessarily related to time. Thus, the information can be characterized in terms of three modes, namely, objects, variables, and occasions. This situation can be extended to higher-way data, hence leading to multiway data. In the case of three-way data, the available information is a three-way array, denoted by X, to distinguish it from a standard (two-way) matrix X. An array can be seen as a collection of (object × variable) matrices, where each matrix pertains to a given occasion. In principle, PCA can still be applied to a three-way array. For instance, this can be done rearranging the elements of an array into a matrix. A way to do it is by the so-called matricization of an array. This can consist of creating a matrix containing all the (object × variable) matrices of the array next to each other. Denoting by n, p, and m the numbers of objects, variables, and occasions, respectively, the resulting matrix has n rows and pm columns. Its rows refer to the objects, while its columns refer to “new variables” which are all the possible combinations of variables and occasions with the variables nested within the occasions (see, e.g., Kiers (2000)). A PCA on such a huge matrix has two disadvantages. The component loadings matrix is hard to read, and the interpretation of the components is rather difficult since the components will depend on the variables in some specific occasions. However, a more relevant problem is that the triple interactions among the three modes are fully missed, since one does not properly take into account that the new variables are combinations of two modes. This leads to erroneous or incomplete results.

An alternative approach for performing standard PCA on three-way data is to first aggregate the array over one mode (for instance, averaging the data across the occasions) and then perform PCA. Unfortunately, once again, the triple interactions among the data cannot be recovered by the extracted components resulting from PCA.

In the literature, the two most common proposals to extend PCA to three-way data are the Tucker3 and Candecomp/Parafac models (for an overview on these and other methods, see Kroonenberg (2007)). Bearing in mind the PCA formulation in Eq. 7, the main idea underlying both the models is to consider not only the matrices C and A for summarizing the objects and the variables, respectively, but also a third matrix for summarizing the occasions. In Candecomp/Parafac the same number of components for every mode is sought. The Tucker3 model generalizes Candecomp/Parafac admitting different numbers of components for every mode and introducing the so-called core array to study the triple interactions among the components of every mode. This implies that Candecomp/Parafac can be seen as a constrained version of Tucker3. Such a constraint guarantees that, under mild conditions, the Candecomp/Parafac solution is unique, whereas Tucker3 admits rotations.

Cross-References

References

  1. Cadima J, Jolliffe I (1995) Loadings and correlations in the interpretation of principal components. J Appl Stat 22:203–214MathSciNetCrossRefGoogle Scholar
  2. d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev 49:434–448MathSciNetCrossRefzbMATHGoogle Scholar
  3. Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1:211–218CrossRefzbMATHGoogle Scholar
  4. Everitt BS (2005) An R and S-plus companion to multivariate analysis. Springer, LondonCrossRefzbMATHGoogle Scholar
  5. Gabriel KR (1971) The biplot graphical display of matrices with applications to principal component analysis. Biometrika 58:453–467MathSciNetCrossRefzbMATHGoogle Scholar
  6. Hausmann R (1982) Constrained multivariate analysis. In: Zanckis SH, Rustagi JS (eds) Optimisation in statistics. North-Holland, Amsterdam, pp 137–151Google Scholar
  7. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441CrossRefzbMATHGoogle Scholar
  8. Jolliffe IT (2002) Principal component analysis. Springer, New YorkzbMATHGoogle Scholar
  9. Jolliffe IT, Uddin M (2000) The simplified component technique: an alternative to rotated principal components. J Comput Graph Stat 9:689–710MathSciNetGoogle Scholar
  10. Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547MathSciNetCrossRefGoogle Scholar
  11. Kaiser HF (1958) The varimax criterion for analytic rotation in factor analysis. Psychometrika 23:187–200CrossRefzbMATHGoogle Scholar
  12. Kiers HAL (2000) Towards a standardized notation and terminology in multiway analysis. J Chem 14:105–122CrossRefGoogle Scholar
  13. Kroonenberg PM (2007) Applied multi-way data analysis. Wiley, HobokenGoogle Scholar
  14. Lubischew AA (1962) On the use of discriminant functions in taxonomy. Biometrics 18:455–477CrossRefzbMATHGoogle Scholar
  15. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, LondonzbMATHGoogle Scholar
  16. Marques de Sá JP (2007) Applied statistics using SPSS, STATISTICA, MATLAB and R. Springer, Berlin/HeidelbergCrossRefzbMATHGoogle Scholar
  17. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572CrossRefzbMATHGoogle Scholar
  18. Preisdendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanography. Elsevier, AmsterdamGoogle Scholar
  19. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58:267–288MathSciNetzbMATHGoogle Scholar
  20. Vines SK (2000) Simple principal components. Appl Stat 49:441–451MathSciNetzbMATHGoogle Scholar
  21. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15:262–286MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2016

Authors and Affiliations

  1. 1.Department of Statistical SciencesSapienza University of RomeRomeItaly

Section editors and affiliations

  • Suheil Khoury
    • 1
  1. 1.American University of SharjahSharjahUnited Arab Emirates