Keywords

1 Introduction

Massive data is generated daily such as through city-installations of high speed cameras for public safety. These data are mostly incomplete due to sensor failures or environmental obstructions in recordings. This poses a great challenge to algorithms of video surveillance in processing such missing, noisy and high dimensional datasets.

Many manifold learning methods have been proposed for dimensionality reduction (DR). These methods can broadly be divided into global structure and local structure learning. Local structure learning methods such as, Locality Sensitive Discriminant Analysis (LSDA) [1] finds a projection that maximizes the margin between data points from different classes at each local area. Neighborhood Preserving Embedding (NPE) [2] is an unsupervised linear dimensionality reduction technique which solves the out of sample problem in Locally Linear Embedding (LLE). Locality Preserving Projections (LPP) [3] finds a good linear embedding that preserves local structure information. And global structure learning methods, such as Linear Discriminant Analysis (LDA) [4] captures the global geometric structure of data by maximizing the between class distance and minimizing the within class distance. Isomap [5] is another global learning method which estimates the geodesic distance between samples and uses multidimensional scaling to induce a low dimensional manifold.

Among the global DR methods, PCA is the most popular, simplest and efficient technique [6]. But, it is sensitive to outliers or noisy data points [7,8,9,10]. Thus, several adaptations of PCA have been developed in the past few years to improve its performance. Representatives such as graph-Laplacian PCA (gLPCA) [11] learns a low dimensional representation of data that incorporates graph structures. Optimal mean robust principal component analysis (RPCA-OM) [12] removes the mean in a given dataset automatically by integrating the mean into the dimensionality reduction objective function. Abeo et al. [13] also extended the minimizing least squares idea of PCA to consider both data distribution and penalty weights in dealing with outliers. Yang et al. [14] estimated corrupt instances by making full use of both intra-view and inter-view correlations between samples, considering samples in the same view to be linearly correlated. Li et al. [15] proposed ordinal preserving projection (OPP) for learning to rank by using two matrices which work in the row and column directions respectively to leverage the global structure of the dataset and ordinal information of the observations. Most existing adaptations of the standard PCA learn how to select suitable features instead of suitable samples; because of this, corrupt instances have not be efficiently handled in the past. Research in DR is still being vigorously pursued by researchers to either improve the performance of existing techniques or develop new ones.

We propose a novel framework called incomplete-data oriented dimension reduction via instance factoring PCA framework (IFPCA) to address the sensitivity of PCA to corrupt instances. In this framework, a scaling-factor that imposes a penalty on the instance space is introduced to suppress the impact of outliers or corrupt instances in pursuing projections. Two strategies: total distance and cosine similarity metrics are used geometrically to iteratively learn the relationship between each instance and the principal projection in the feature space. Through this, the two strategies are able to distinguish between authentic and corrupt instances. Thus, low-rank projections are achieved through enhanced discrimination between relevant and noisy instances. The main contributions of this paper are summarized as follows:

  1. 1.

    We propose a novel framework by introducing a scaling-factor into the traditional PCA model to impose a penalty on the instance space in pursuing projections. The goal here is to significantly suppress the impact of outliers.

  2. 2.

    We further propose two scaling-factor strategies: total distance and cosine similarity metrics. These metrics iteratively evaluate the importance of each instance by learning the relationship between each instance and the principal projection in the feature space.

  3. 3.

    Finally, with the iterative discrimination ability, IFPCA can obtain better low-rank projections in incomplete datasets.

Extensive experiments on COIL-20, ORL and USPS datasets demonstrate the superiority of our method over state-of-the-art methods. The rest of the paper is organized as follows: Sect. 2 presents formulation of the propose framework, experiments, results and complexity analysis are presented in Sect. 3, and conclusions and recommendations are made in Sect. 4.

2 The Proposed IFPCA Method

To illustrate our idea, we start by observing the objective function of PCA:

(1)

where \(\{\mathbf {w}\}_{j=1}^{d}\) is a subset of orthogonal projection vectors in \(\mathfrak {R}^{m}\) and the set of data points \(\{\mathbf {x}_{i}\}_{i=1}^{n}\) is zero-mean m-dimensional data points. It can be seen that, PCA uses a least square framework to minimize the sum distance between the original dataset X and the reconstructed dataset \({w}{w}^{T}{X}\). This geometrically will force the projection vector w to pass through the densest data points to minimize the sum distance. This can be seen in Fig. 1, where \({u}_{1}\) is the first principal projection vector. From this intuition, we evaluate the importance of instances by considering the relationship between each instance and the principal projection \(u_{1}\). That is, the closer an instance to the projection vector \(u_{1}\), the more important the instance in pursuing projections.

Therefore, we extend formula (1) to include a scaling-factor. This factor imposes a penalty on the instance space to suppress the impact of noise in incomplete datasets. The following is our propose function:

(2)

where p is a vector of sample space and \(D=diag\left( {d}_{1}, {d}_{2},\cdots ,{d}_{n}\right) \) is a diagonal matrix that evaluates the importance of each instance in X. With this penalty, we are actually pursuing a projection \(Z=Dp\) with \(Z^{T}Z= I\) that considers the effect of instances. For example, if a lower scaling-factor \(d_{i}\) is assigned to the projection Z, the component of sample space \(Z_{i}\) is suppressed, which means the corresponding sample \(x_{i}\) contributes little to the projection Z.

To enforce the constraint in formula (2), we introduce the Lagrange multiplier (\(\lambda \)) and take partial derivatives w.r.t. p to obtain:

$$\begin{aligned} \begin{aligned} {X}^{T}{X}{D}{p}= \lambda {D}{p} \end{aligned} \end{aligned}$$
(3)

It can be observed that Eq. (3) is a standard eigenvalue problem.

Mathematically, there is a direct relationship [16] between PCA and SVD when PCA components are calculated from the covariance matrix. The equation for singular value decomposition (SVD) of X is as follows:

$$\begin{aligned} \begin{aligned} {X}={u}\varSigma v^{T} \quad s.t.\, {u}^{T}{u} = I_{r}, v^{T}v=I_{r} \end{aligned} \end{aligned}$$
(4)

In our proposed model, \(v=DP\), where P is the set of r projections of p. Thus, the projection u in feature space can be obtained as follows:

$$\begin{aligned} \begin{aligned} {u}={X}{D}{P}\varSigma ^{-1} \end{aligned} \end{aligned}$$
(5)

where the low dimension feature space u is obtained with an injection of sample factors different from the traditional PCA. In this way, IFPCA can learn a low dimensional subspace from both sample and feature spaces of a dataset for improved performance.

Fig. 1.
figure 1

Illustration of an instance relationship with the principal projection

2.1 Strategies of Building Matrix D

In this subsection, we model the relationship between scaling-factor D and the principal projection \({u}_{1}\) using two strategies: total distance and cosine similarity metrics. Both can be obtained geometrically as shown in Fig. 1.

The first strategy uses total distance metric to iteratively learn the relationship between each instance and the principal projection \({u_{1}}\). The total distance of an instance is defined as the square sum of the distances between the coordinate of each instance and the coordinates of every other instance in the training set to the projection \({u_{1}}\). The total distance of an instance is a natural way to evaluate its importance within the set. From Fig. 1, we can observe that the total distance of instance \({x}_{i}\) which is outside the cluster will be relatively bigger than that of instance \({x}_{j}\) within the cluster. Therefore, instance \({x}_{i}\) is more likely to be an outlier or corrupt instance than \({x}_{j}\). From Fig. 1, the coordinate of instance \(x_{i}\) to the projection \(u_{1}\) is computed through:

$$\begin{aligned} \begin{aligned} {s}_{i}={u}_{1}^{T}{x}_{i} \end{aligned} \end{aligned}$$
(6)

We then compute \({d}_{i}\) through total distance metric as follows:

$$\begin{aligned} \begin{aligned} {d}_{i}=\sum _{i,j=1}^{n}({s}_{i}-{s}_{j})^2 \end{aligned} \end{aligned}$$
(7)

Thus, the bigger the \({d}_{i}\), the more likely \({x}_{i}\) is a noisy or corrupt instance and hence its relevance will be scaled accordingly to suppress its effect on the projection.

The second strategy uses cosine similarity metric to build the scaling-factor D. This also iteratively learns the angle relationship between each instance in the training set and the principal projection \({u}_{1}\). Thus, by normalizing formula (6), the angle between each instance and the principal projection \({u}_{1}\) is defined as follows:

(8)

From formula (8), a bigger \(\cos \theta _{i}\) implies a smaller angle \(\theta _{i}\) between instance \({x}_{i}\) and the principal projection \({u}_{1}\) and vice versa. We illustrate the relationship between an instance and the principal projection in Fig. 1. From Fig. 1, it can be seen that, angle \(\phi \) of instance \({x}_{j}\) is relatively smaller than angle \(\theta \) of instance \({x}_{i}\). Thus, \({x}_{j}\) will be considered probably more important in finding best projections than \({x}_{i}\) which might be noisy. Recall that \({d}_{i}\) is a negative factor, we compute \({d}_{i}\) through the similarity metric as follows:

$$\begin{aligned} \begin{aligned} {d}_{i}=\frac{1}{abs(\cos \theta _{i})+\epsilon } \\ \end{aligned} \end{aligned}$$
(9)

where, \(\epsilon = 0.0001\) is a parameter to avoid \({d}_{i}\) approaching infinity.

By iteratively scaling the data using these two strategies, the effect of corrupt instances in the training set will considerably minimize leading to better low-rank projections. The algorithm of the proposed IFPCA method is as shown in algorithm 1.

figure a

3 Experiments and Complexity Analysis

To demonstrate the effectiveness of the proposed IFPCA algorithm, we conduct experiments on COIL-20, ORL and USPS datasets using the proposed IFPCA and six state-of-the-art DR methods such as LSDA [1], gLPCA [11], RPCA-OM [12], PCA [17], LPP [3] and RCDA [18].

3.1 Parameter Settings

For each dataset, we randomly sampled \(60\%\), \(70\%\) of the instances for training and 40 and \(30\%\) respectively for testing in our experiments. The parameters of LSDA [1], gLPCA [11], RPCA-OM [12], PCA [17], LPP [3] and RCDA [18] were set according to their literature. We set the k-nearest-neighbors parameter K to 5 in IFPCA, and all other relevant comparative methods, in order to make a very fair comparison. We finally make use of the K-nearest neighbor (KNN) classifier for classifications. We record results for our framework as IFPCA-1 and IFPCA-2, where IFPCA-1 and IFPCA-2 represent cosine similarity and total distance metrics respectively. The experiments are repeated 15 times and we record the average classification accuracies, corresponding optimal dimensions and standard deviations for the various methods.

3.2 Results Discussions and Analysis

We discuss and analyze the results obtained for each method on the three datasets used for our experiments in this section.

Object Recognition. We validate the proposed methods on object recognition using COIL-20 dataset. This dataset [19] contains 1440 observations of 20 objects with 72 poses each. The objects were placed on a motorized turntable against a black background and rotated through \(360^{\circ }\) to vary the object pose with respect to a fix camera. Images of the 20 objects were taken at pose intervals of \(5^{\circ }\). The results for the various methods are shown in Table 1 with best results in bold. From Table 1, we can see that IFPCA-1 and IFPCA-2 both have superior performances than all the comparative methods. For optimal dimensions, IFPCA-1 and IFPCA-2 obtain phenomenal optimal dimensions in both samples as compare to the other six comparative methods. For the \(60\%\) sample, IFPCA-1 outperforms IFPCA-2 by a small margin of \(0.16\%\), gLPCA by an impressive margin of \(2.69\%\), RPCA-OM by \(1.61\%\), PCA by \(3.44\%\), LPP by \(11.86\%\), LSDA by \(11.25\%\) and RCDA by \(2.19\%\). Again, IFPCA-1 and IFPCA-2 obtain the lowest variances in both cases which prove their consistency in performance.

Thus, the proposed methods have superior performances than all other comparative methods in object recognition and optimal dimensions. This is because the two proposed methods can detect and sufficiently suppress the impact of noisy data points in the training set than the other comparative methods. Figure 2 shows the trend of classification accuracies of each method against the variation of dimensions. It is evident from Fig. 2 that IFPCA-1, IFPCA-2, RPCA-OM, PCA and RCDA attain stable performances in higher dimensions above 10. While the performances of gLPCA, LPP and LSDA decline considerably as the dimensions increase.

Table 1. Mean classification accuracies ± standard deviation (%) and (optimal dimensions) of the various methods on the COIL-20 dataset.
Fig. 2.
figure 2

Classification accuracies against the variation of dimensions on the COIL-20 dataset by the various methods

Face Recognition. We further validate the effectiveness of the proposed method on face recognition using the ORL face dataset. This dataset [20] has 40 subjects, each with 10 faces at different poses making a total of 400 images of size \(112 \times 92\). However, the images were resized to \(32 \times 32\) for our experiments. These images were taken at different times, lighting and facial expressions. The faces are in an upright position in frontal view with a slight left-right rotation. The results for the various methods are shown in Table 2 with best results in bold.

Table 2. Mean classification accuracies ± standard deviation (%) and (optimal dimensions) of the various methods on the ORL dataset.
Fig. 3.
figure 3

Classification accuracies against the variation of dimensions on the ORL dataset by the various methods

The results as shown in Table 2, clearly indicate that IFPCA-1 and IFPCA-2 both have exceptional performances compared to all the comparative methods in face recognition in both the 60 and \(70\%\) training samples. For the \(70\%\) training sample, IFPCA-1 achieves a remarkable face recognition accuracy of \(98.67\%\) and the best optimal dimension of 25 with gLPCA and LPP achieving the worst dimensions of 279 each. Thus, for face recognition accuracy, IFPCA-1 outperforms IFPCA-2 by a little margin of \(0.38\%\), gLPCA by \(3.06\%\), RPCA-OM by \(2.84\%\), PCA by \(4.35\%\), LPP by \(6.55\%\), LSDA by \(8.13\%\) and RCDA by \(2.20\%\). Figure 3 shows the trend of classification accuracies of each method against the variation of dimensions. It is clear from Fig. 3 that the performances of IFPCA-1, IFPCA-2, RPCA-OM, PCA and RCDA have once again been stable irrespective of increases in dimensions. While that of gLPCA, LPP and LSDA are considerably unstable. The results further show that IFPCA-1 and IFPCA-2 have the most consistent performances in the 60 and \(70\%\) samples respectively.

Handwritten Digit Recognition. In our quest to further demonstrate the effectiveness of our framework, we run experiments on the USPS dataset. This dataset [21] consists of handwritten digits from 0 to 9. The training and testing sets consist of 7291 examples and 2007 examples respectively. Each example has 256 attributes or pixels that describe each digit. The results for the various methods are shown in Table 3 with best results in bold.

From Table 3, IFPCA-1 and IFPCA-2 once again out perform all the comparative methods in handwritten digit recognition for both training samples of the USPS dataset. For the \(60\%\) sample, IFPCA-1 has a digit recognition accuracy of \(0.55\%\) more than IFPCA-2, \(1.87\%\) more than gLPCA, \(1.69\%\) more than RPCA-OM, \(3.06\%\) more than PCA, \(3.09\%\) more than LPP, \(3.73\%\) more than LSDA and \(1.75\%\) more than RCDA. IFPCA-1 further obtains the best optimal dimensions of 27 and 29 in the 60 and \(70\%\) training samples respectively.

Fig. 4.
figure 4

Classification accuracies against the variation of dimensions on the USPS dataset by the various methods

Table 3. Mean classification accuracies ± standard deviation (%) and (optimal dimensions) of the various methods on USPS dataset.
Table 4. Computation time in seconds for training and testing for each method

IFPCA-1 and IFPCA-2 obtain the lowest variances than the comparative methods. The consistency in the performances of IFPCA-1 and IFPCA-2 proved their ability to discover better intrinsic structure of the dataset. Figure 4 shows the trend of classification accuracies of each method against the variation of dimensions. From Fig. 4 all the methods show stable performances in dimensions above 20, but with the proposed methods in the lead.

Complexity Analysis. We compare the computational times of the proposed methods to the other six comparative methods. All algorithms were implemented in MATLAB R2016b version 9.1.0.441655 64-bit using a personal computer with Intel (R) Core(TM) i5-7500 CPU @ 3.40 GHz with 8.00 GB memory and Windows 7 operating system environment. The convergence of the proposed framework depends on the importance evaluation diagonal matrix D. The computation time of an eigenvalue problem on a training set X of size \(m \times n\) is \(O(m^3)\). This means that a complexity of \(O(m^3)\) is required by the proposed framework to compute the projection vector p since our framework is an eigenvalue problem. For the inner loop, if it takes k number of iterations in pursuing D for convergence to be attained, the complexity is O(kmn). Hence, the total complexity of the framework becomes \(O(t(m^3 + kmn))\), where t is the number of iterations of the outer loop. Table 4 shows the computation time for each method on all three datasets.

4 Conclusions and Recommendation

We proposed in this paper a novel incomplete-data oriented dimension reduction via instance factoring PCA framework. Different from other variants of PCA, a scaling-factor that imposes a penalty on the instance space is introduced to suppress the impact of noise in pursuing projections. Two strategies, cosine similarity and total distance metrics are used geometrically to iteratively learn the relationship between each instance and the principal projection.

Comprehensive experiments on COIL-20, ORL and USPS datasets demonstrate the effectiveness of the proposed framework in both dimension reduction and classification tasks. This is because it is able to obtain low-rank projections in incomplete datasets by suppressing the effect of noisy or corrupt instances. This shows that our framework is more noise tolerant than the other comparative methods. We will extend this framework to low rank representation in the near future.