Keywords

1 Introduction

This paper aims to extract features directly from data with missing entries. Many real-world data are multi-dimensional, in the form of tensors, which are ubiquitous such as multichannel images and have become increasingly popular [1]. Tucker decomposition is widely used to solve tensor learning problems, which decomposes a tensor into a core tensor with factor matrices [2]. Based on Tucker decomposition, many tensor methods are proposed for feature extraction (dimension reduction) [3,4,5,6,7]. For example, multilinear principal component analysis (MPCA) [3] extracts features directly from tensors, which is a popular extension of classical Principal Component Analysis (PCA) [8]. Furthermore, some robust methods such as robust tensor PCA (TRPCA) [9] are well studied for data with corruptions (e.g., noise and outliers) [9,10,11].

In practice, some entries of tensors are often missing due to the problems in the acquisition process or costly experiments etc. [12]. This missing data problem appears in a wide range of fields such as social sciences, computer vision and medical systems [13]. For example, partial responses in surveys are common in the social sciences, leading to incomplete datasets with arbitrary patterns [14]. Moreover, some images are corrupted during the image acquisition and partial entries are missing [15]. In these scenarios, the above existing feature learning methods cannot work well. How to correctly handle missing data is a fundamental yet challenging problem in machine learning [16], and the problem of extracting features from incomplete tensors is not well explored.

One natural solution to solving this problem is to recover the missing data and then view the recovered tensors as the extracted features. Tensor completion techniques are widely used for missing data problems and has drawn much attention in many applications such as image recovery [17] and video completion [18]. For example, a high accuracy low-rank tensor completion algorithm (HaLRTC) [17] is proposed to estimate missing values in tensors of visual data, and a generalized higher-order orthogonal iteration (gHOI) [19] achieves simultaneous low-rank Tucker decomposition and completion efficiently. Although these tensor completion methods can recover the missing entries well under certain conditions, they only focus on data recovery without exploring the relationships among samples for effective feature extraction. Besides, taking recovered data as features, the dimension of features cannot be reduced.

Another straightforward solution is a “two-step” strategy, i.e., “tensor completion methods + feature extraction methods”: the missing entries are first recovered by the former and then the features are extracted from the completed data by the latter. For example, LRANTD [20] performs nonnegative Tucker decomposition (NTD) for incomplete tensors by realizing “low-rank representation (LRA) + nonnegative feature extraction”. It needs a tensor completion method to estimate the missing values in the preceding LRA step. However, this “two-step” strategy probably amplifies the reconstruction errors as the missing entries and features are not learned in one stage, and the errors from tensor completion methods can deteriorate the performance of feature extraction in the succeeding step. Moreover, this approach is generally not computationally efficient.

Recently, a few works apply tensor completion methods to feature classification by incorporating completion model with discriminant analysis [21, 22]. These methods are supervised and require labels which are expensive and difficult to obtain. To the best of our knowledge, there is no an unsupervised method to extract features directly from tensors with missing entries.

To solve the problem of extracting features from incomplete tensors, we propose an unsupervised method, i.e., incorporating Low-rank T ucker D ecomposition with feature V ariance M aximization in a unified framework, namely TDVM. In this framework, based on Tucker decomposition with orthonormal factor matrices (a.k.a., higher-order singular value decomposition (HOSVD) [23]), we impose nuclear norm regularization on the core tensors while minimizing the reconstruction error, and meanwhile maximize the variance of core tensors. In this paper, the learned core tensors (analogous to the singular values of a matrix) are viewed as the extracted features. Compared with tensor completion methods and “two-step” strategies:

  • Although Tucker decomposition-based tensor completion methods can also obtain core tensors, these core tensors are learned with aiming to recover the tensor samples and without exploring the relationships among samples for effective feature extraction. Unlike these tensor completion methods, here we focus on low-dimensional feature extraction rather than missing data recovery. Besides, we incorporate a specific term (feature variance maximization) to enhance the discriminative properties of learned core tensors.

  • Different from the “two-step” strategies, we simultaneously learn the missing entries and features directly from observed entries in the unified framework. Besides, TDVM directly learns low-dimensional features in one step, which saves computational cost.

We optimize our model using alternating direction method of multipliers (ADMM) [24]. After feature extraction, we evaluate the extracted features for face recognition, which empirically demonstrates that TDVM outperforms the competing methods consistently. In a nutshell, the contributions of this paper are twofold:

  • We propose an efficient unsupervised feature extraction method, TDVM, based on low-rank Tucker decomposition. TDVM can simultaneously obtain low-dimensional core tensors and features for incomplete data.

  • We incorporate nuclear norm regularization with variance maximization on core tensors (features) to explore the relationships among tensor samples while estimating missing entries, leading to informative features extracted directly from observed entries.

2 Preliminaries and Backgrounds

2.1 Notations and Operations

The number of dimensions of a tensor is the order and each dimension is a mode of it. A vector (i.e. first-order tensor) is denoted by a bold lower-case letter \(\mathbf {x} \in \mathbb {R}^I\). A matrix (i.e. second-order tensor) is denoted by a bold capital letter \(\mathbf X \in \mathbb {R}^{I_1 \times I_2 }\). A higher-order (\(N \ge 3\)) tensor is denoted by a calligraphic letter \(\mathcal {X} \in \mathbb {R}^{I_1 \times \cdots \times I_N}\). The ith entry of a vector \(\mathbf a \in \mathbb {R}^{I}\) is denoted by \(\mathbf a_{i}\), and the (ij)th entry of a matrix \(\mathbf X \in \mathbb {R}^{I_1 \times I_2}\) is denoted by \( \mathbf X_{i,j}\). The \((i_1, \cdots , i_N)\)th entry of an Nth-order tensor \(\mathcal {X}\) is denoted by \(\mathcal {X}_{i_1, \cdots , i_N}\), where \(i_n \in \{1, \cdots , I_n\}\) and \(n \in \{1, \cdots , N\}\). The Frobenius norm of a tensor \(\mathcal X\) is defined by \(\Vert \mathcal X\Vert _F = \langle \mathcal X,\mathcal X\rangle ^{1/2}\) [25]. \(\varvec{\varOmega }\in \mathbb {R}^{I_1 \times I_2}\) is a binary index set: \(\varvec{\varOmega }_{i_1, \cdots , i_N}=1\) if \(\mathcal {X}_{i_1, \cdots , i_N}\) is observed, and \(\varvec{\varOmega }_{i_1, \cdots , i_N}=0\) otherwise. \( \mathcal {P}_{\varvec{\varOmega }}\) is the associated sampling operator which acquires only the entries indexed by \(\varvec{\varOmega }\), defined as:

$$\begin{aligned}&(\mathcal {P}_{\varvec{\varOmega }}(\mathcal {X}))_{i_1, \cdots , i_N} = \left\{ \begin{array}{ll} \mathcal {X}_{i_1, \cdots , i_N}, &{} \text {if} {(i_1, \cdots , i_N)} \in \varvec{\varOmega } \\ 0, &{} \text {if} {(i_1, \cdots , i_N)}\in \varvec{{\varOmega }^c}\end{array} \right. , \end{aligned}$$
(1)

where \({{\varvec{\varOmega }}^c}\) is the complement of \(\varvec{\varOmega }\), and \(\mathcal {P}_{\varvec{\varOmega }}(\mathcal {X})+\mathcal {P}_{\varvec{\varOmega }^c}(\mathcal {X})=\mathcal {X}\).

Definition 1

Mode-n Product. A mode-n product between a tensor \(\mathcal {X} \in \mathbb {R}^{I_1 \times \cdots \times I_N}\) and a matrix/vector \(\mathbf U \in \mathbb {R}^{I_n \times J_n}\) is denoted by \(\mathcal {Y} = \mathcal {X} \times _n \mathbf U^T \). The size of \( \mathcal {Y} \) is \({I_1 \times \cdots \times I_{n-1} \times J_n \times I_{n+1} \times \cdots \times I_N}\), with entries given by \(\mathcal {Y}_{i_1 \cdots i_{n-1} j_n i_{n+1} \cdots i_N} = \sum _{i_n} \mathcal {X}_{i_1 \cdots i_{n-1} i_n i_{n+1} \cdots i_N} \mathbf U_{i_n,j_n} \), and we have \(\mathbf Y_{(n)} = \mathbf U^T \mathbf X_{(n)} \)[25].

Definition 2

Mode-n Unfolding. Unfolding, a.k.a., matricization or flattening, is the process of reordering the elements of a tensor into matrices along each mode [1]. A mode-n unfolding matrix of a tensor \(\mathcal {X} \in \mathbb {R}^{I_1 \times \cdots \times I_N}\) is denoted as \(\mathbf X_{(n)} \in \mathbb {R}^{I_n \times \varPi _{n^* \ne n} I_{n^*} }\).

Fig. 1.
figure 1

The Tucker decomposition of tensors (a third-order tensor \(\mathcal {X}\) shown for illustration).

2.2 Tucker Decomposition

A tensor \(\mathcal {X} \in \mathbb {R}^{I_1 \times I_2 \times \cdots \times I_N}\) is represented as a core tensor with factor matrices in Tucker decomposition model [1]:

$$\begin{aligned} \begin{aligned} \mathcal {X}= & {} \mathcal {G} {\times _1} \mathbf {U}^{(1)} {\times _2} \mathbf {U}^{(2)} \cdots {\times _N} \mathbf {U}^{(N)}, \end{aligned} \end{aligned}$$
(2)

where \( \{\mathbf {U}^{(n)} \in \mathbb {R}^{I_n \times R_n}, n= 1, 2 \cdots N, \ \text {and} \ R_n < I_n \}\) are factor matrices with orthogonal columns and \(\mathcal {G} \in \mathbb {R}^{R_1 \times R_2 \times \cdots \times R_N}\) is the core tensor with smaller dimension. Tucker-rank of an Nth-order tensor \(\mathcal {X}\) is an N-dimensional vector, denoted as \((R_1, \cdots , R_N)\), where \(R_N\) is the rank of the mode-n unfolded matrix \(\mathbf {X}_{(n)}\) of \(\mathcal {X}\). Figure 1 illustrates this decomposition. In this paper, we regard the core tensor consists of the extracted features of a tensor.

3 Feature Extraction for Incomplete Data

3.1 Problem Definition

Given M tensor samples {\(\mathcal {T}_1, \cdots , \mathcal {T}_m, \cdots , \mathcal {T}_M\)} with missing entries in each sample \(\mathcal {T}_m \in \mathbb {R}^{I_1\times \cdots \times I_N}\). \(I_n\) is the mode-n dimension. We denote \(\mathcal {T}= [\mathcal {T}_1, \cdots , \mathcal {T}_m, \cdots \mathcal {T}_M] \in \mathbb {R}^{I_1\times \cdots \times I_N \times M} \), where the M are the number of tensor samples concatenated along the mode-\((N+1) \) of \( \mathcal {T}\). For feature extraction (dimension reduction), we aim to directly extract low-dimensional features \(\mathcal {G}= [\mathcal {G}_1, \cdots , \mathcal {G}_m, \cdots \mathcal {G}_M] \in \mathbb {R}^{R_1 \times \cdots \times R_N\times M} (R_n < I_n, n =1, \cdots , N) \) from the given high-dimensional incomplete tensors \(\mathcal {T}\).

Remark: This problem is different from the case of data with corruptions (e.g., noise and outliers) widely studied in [9, 26,27,28]: Only if the corruptions are arbitrary, missing data could be regarded as a special case of corruptions (with the location of corruption being known). However, the magnitudes of corruptions in reality are not arbitrarily large. In other words, here we study a new problem and existing feature extraction methods are not applicable.

3.2 Formulation of the Proposed Method: TDVM

To solve this problem, we propose an unsupervised feature extraction method. Based on Tucker decomposition, we impose the nuclear norm on the core tensors of observed tensors while minimizing reconstruction errors, and meanwhile maximize the variance of core tensors (features), i.e., incorporating low-rank Tucker Decomposition with feature Variance Maximization, namely TDVM. Thus, the objective function of TDVM is:

$$\begin{aligned} \begin{aligned} \min _{{\mathcal {X}_m, \mathcal G_m,\mathcal S_m,\mathbf {U}^{(n)}}} \sum _{m=1}^{M} \frac{1}{2}\Vert \mathcal {X}_m - \mathcal G_m {\times _1} \mathbf {U}^{(1)} \cdots {\times _N} \mathbf {U}^{(N)}\Vert _F^2&\\ \ + \ \sum _{m=1}^{M}\Vert \mathcal G_m\Vert _* \ - \ \frac{1}{2} \sum _{m=1}^{M} \Vert \mathcal G_m- \bar{\mathcal {G}}\Vert _F^2&, \\ \text {s.t.} \ \ \mathcal {P}_{\varvec{\varOmega }}(\mathcal {X}_m)= \mathcal {P}_{\varvec{\varOmega }}(\mathcal {T}_m), {\mathbf {U}^{(n)}}^{\top } \mathbf {U}^{(n)} =\mathbf I.&\end{aligned} \end{aligned}$$
(3)

where \(\{\mathbf {U}^{(n)} \in \mathbb {R}^{I_n \times R_n} \}_ {n=1}^N \) are common factor matrices with orthonormal columns. \(\mathbf I \in \mathbb {R}^{R_n \times R_n}\) is an identity matrix. \(\mathcal G_m\) is the core tensor which consists of the extracted features (analogous to the singular values of a matrix) of an incomplete tensor \(\mathcal {T}_m\) with observed entries in \(\varvec{\varOmega }\). \(\Vert \mathcal G_m\Vert _*\) is the nuclear norm of \(\mathcal G_m\) (i.e., the summation of the singular values of the unfolded matrices along modes of \(\mathcal G_m\) [17]). \(\bar{\mathcal G} = \frac{1}{M} \sum _{m=1}^{M} \mathcal G_m \) is the mean of core tensors (extracted features).

Remark: Our objective function (3) integrates three terms into a unified framework:

  • The first term: minimizing \(\sum _{m=1}^{M} \frac{1}{2}\Vert \mathcal {X}_m - \mathcal G_m {\times _1} \mathbf {U}^{(1)} \cdots {\times _N} \mathbf {U}^{(N)}\Vert _F^2 \), aims to minimize the reconstruction error based on given observed entries.

  • The second term: minimizing \( \sum _{m=1}^{M}\Vert \mathcal G_m\Vert _*\), aims to obtain low-dimensional features. It is proved that imposing the nuclear norm on a core tensor \(\mathcal G_m\) is essentially equivalent to that on its original tensor \(\mathcal X_m\) [19]. We thus obtain a low-rank solution, i.e. \(R_n\) can be small (\(R_n < I_n\)). Thus, the learned feature subspace is naturally low-dimensional. Besides, imposing nuclear norm on core tensors \(\mathcal G_m\) instead of original \(\mathcal {X}_m\) saves computational cost.

  • The third term: minimizing \( - \sum _{m=1}^{M} \frac{1}{2}\Vert \mathcal G_m - \bar{\mathcal G}\Vert _F^2\), is equivalent to maximize the variance of extracted features (core tensors) following PCA. We thus explore the relationships of incomplete tensors via variance maximization while estimating the missing entries via the first and second term (low-rank Tucker decomposition).

By this unified framework, we can efficiently extract low-dimensional informative features directly from observed entries, which is different from tensor completion methods (only focusing on data recovery without considering the relationships among samples for effective feature extraction) and “two-step” strategies (the reconstruction error from tensor completion step probably deteriorates the performance of feature extraction in the succeeding step, and combining two methods is generally time consuming).

3.3 Optimization by ADMM

To optimize (3) using ADMM, we apply the variable splitting technique and introduce a set of auxiliary variables \(\{\mathcal S_{m} \in \mathbb {R}^{R_1 \times \cdots \times R_N}, m=1 \cdots M, n= 1, \cdots , N\}\), and then reformulate (3) as:

$$\begin{aligned} \begin{aligned} \min _{{\mathcal {X}_m, \mathcal G_m,\mathcal S_m,\mathbf {U}^{(n)}}} \sum _{m=1}^{M} \frac{1}{2}\Vert \mathcal {X}_m - \mathcal G_m {\times _1} \mathbf {U}^{(1)} \cdots {\times _N} \mathbf {U}^{(N)}\Vert _F^2&\\ + \ \sum _{m=1}^{M}\Vert \mathcal S_m\Vert _* \ - \ \frac{1}{2} \sum _{m=1}^{M} \Vert \mathcal G_m- \bar{\mathcal {G}}\Vert _F^2,&\\ \text {s.t.} \ \ \mathcal {P}_{\varvec{\varOmega }}(\mathcal {X}_m)= \mathcal {P}_{\varvec{\varOmega }}(\mathcal {T}_m), \mathcal S_{m}= \mathcal G_{m}, {\mathbf {U}^{(n)}}^{\top } \mathbf {U}^{(n)} =\mathbf I. \end{aligned} \end{aligned}$$
(4)

For easy derivation of (4), we reformulate it by unfolding each tensor variable along mode-n and absorbing the constraints. Thus, we get the Lagrange function as follows:

$$\begin{aligned} \begin{aligned} \mathcal L \ =&\sum _{m=1}^{M}\sum _{n=1}^{N} \Big ( \frac{1}{2} \Vert \mathbf {X}_m^{(n)} - \mathbf {U}^{(n)} \mathbf G_m^{(n)} {\mathbf {P}^{(n)}}^{\top } \Vert _F^2 + \Vert \mathbf S_m^{(n)} \Vert _* \\&+ \ \langle \mathbf Y_{mn}, \mathbf G_m^{(n)} \ - \ \mathbf S_m^{(n)} \rangle \ + \ \frac{\mu }{2}\Vert \mathbf G_m^{(n)} - \mathbf S_m^{(n)}\Vert _F^2 \ - \ \frac{1}{2} \Vert \mathbf G_m^{(n)} - \bar{\mathbf {G}}^{(n)} \Vert _F^2 \Big ) \end{aligned} \end{aligned}$$
(5)

where \( {\mathbf {P}^{(n)}}= \mathbf {U}^{(N)} \bigotimes \cdots \bigotimes \mathbf {U}^{(n+1)} \bigotimes \mathbf {U}^{(n-1)} \cdots \bigotimes \mathbf {U}^{(1)} \in \mathbb {R}^{\prod _{j \ne n} I_j \times \prod _{j \ne n} R_j}\) and \(\{\mathbf Y_{mn} \in \mathbb {R}^{R_n \times \prod _{j \ne n}R_j}, n=1, \cdots , N, m= 1, \cdots , M\}\) are the matrices of Lagrange multipliers. \(\mu >0\) is a penalty parameter. \(\mathbf {X}_m^{(n)} \in \mathbb {R}^{I_n \times \prod _{j \ne n} I_j} \) and \(\{\mathbf G_m^{(n)} , \mathbf S_m^{(n)}, \bar{\mathbf {G}}^{(n)} \}\in \mathbb {R}^{R_n \times \prod _{j \ne n}R_j}\) are the mode-n unfolded matrices of tensor \(\mathcal {X}_m\) and {core tensor \(\mathcal G_m\), auxiliary variable \(\mathcal S_{m}\), mean of features \(\bar{\mathcal {G}}\)}, respectively.

ADMM solves the problem (5) by successively minimizing \(\mathcal {L}\) over \(\{\mathbf {X}_m^{(n)}, \mathbf G_m^{(n)},\mathbf S_m^{(n)}, \mathbf {U}^{(n)}\}\), and then updating \(\mathbf Y_{mn}\).

Update \(\mathbf S _{{\varvec{m}}}^{({\varvec{n}})}\) . The Lagrange function (5) with respect to \(\mathbf S_m^{(n)}\) is,

$$\begin{aligned} \begin{aligned} \mathcal L_{\mathbf S_m^{(n)}} = \sum _{m=1}^{M}\sum _{n=1}^{N} \big (\Vert \mathbf S_m^{(n)} \Vert _* + \frac{\mu }{2} \Vert (\mathbf G_m^{(n)} + \mathbf Y_{mn} / \mu ) -\mathbf S_m^{(n)} \Vert _F^2 \big ). \end{aligned} \end{aligned}$$
(6)

To solve (6), we use the spectral soft-thresholding operation [29] to update \(\mathbf S_m^{(n)}\):

$$\begin{aligned} \begin{aligned} \mathbf S_m^{(n)} = \textit{prox}_{1/\mu }(\mathbf G_m^{(n)} + \mathbf Y_{mn} / \mu ) = \mathbf U \text {diag} (\max {\sigma - \frac{1}{\mu }}, 0) \mathbf V^\top , \end{aligned} \end{aligned}$$
(7)

where \(\textit{prox}\) is the soft-thresholding operation and \(\mathbf U \text {diag} (\max {\sigma - \frac{1}{\mu }}, 0) \mathbf V^\top \) is the Singular Value Decomposition (SVD) of \((\mathbf G_m^{(n)} + \mathbf Y_{mn} / \mu )\).

Update \(\mathbf U ^{({\varvec{n}})}\) . The Lagrange function (5) with respect to \(\mathbf {U}^{(n)}\) is:

$$\begin{aligned} \begin{aligned} \mathcal L_{\mathbf {U}^{(n)}}&= \sum _{m=1}^{M}\sum _{n=1}^{N} \frac{1}{2} \Vert \mathbf {X}_m^{(n)} - \mathbf {U}^{(n)} \mathbf G_m^{(n)} {\mathbf {P}^{(n)}}^{\top } \Vert _F^2, \ \ \text {s.t.} \ \ {\mathbf {U}^{(n)}}^{\top } \mathbf {U}^{(n)} =\mathbf I, \end{aligned} \end{aligned}$$
(8)

According to the Theorem 4 in [30], the minimization of the problem (8) over the matrices \(\{\mathbf {U}^{(1)}, \cdots , \mathbf {U}^{(N)} \}\) having orthonormal columns is equivalent to the maximization of the following problem:

$$\begin{aligned} \begin{aligned} {\mathbf {U}^{(n)} } = \arg \max \ \text {trace} \big ( {\mathbf {U}^{(n)}}^{\top } \mathbf {X}_m^{(n)} {({\mathbf G_m^{(n)}}{\mathbf {P}^{(n)}}^{\top })}^{\top }\big ) \end{aligned} \end{aligned}$$
(9)

where trace() is the trace of a matrix, and we denote \( \mathbf W^{(n)} = {\mathbf G_m^{(n)}}{\mathbf {P}^{(n)}}^{\top }\).

The problem (9) is actually the well-known orthogonal procrustes problem [31], whose global optimal solution is given by the SVD of \( \mathbf {X}_m^{(n)} {\mathbf W^{(n)}}^{\top }\), i.e.,

$$\begin{aligned} \begin{aligned} \mathbf {U}^{(n)} = \hat{\mathbf U}^{(n) }{(\hat{\mathbf V}^{(n)})}^{\top }, \end{aligned} \end{aligned}$$
(10)

where \(\hat{\mathbf U}^{(n)}\) and \(\hat{\mathbf V}^{(n)}\) are the left and right singular vectors of SVD of \( \mathbf {X}_m^{(n)} {\mathbf W^{(n)}}^{\top }\), respectively.

figure a

Update \(\mathbf G _{{\varvec{m}}}^{({\varvec{n}})}\) . The Lagrange function (5) with respect to \(\mathbf G_m^{(n)}\) is:

$$\begin{aligned} \begin{aligned} \mathcal L_{\mathbf G_m^{(n)} } = \ \Vert \mathbf {X}_m^{(n)} - \mathbf {U}^{(n)} \mathbf G_m^{(n)}&{\mathbf {P}^{(n)}}^{\top } \Vert _F^2 \ + \ \frac{\mu }{2} \Vert \mathbf G_m^{(n)} \ \mathbf S_m^{(n)} \ + \ \mathbf Y_{mn} / \mu \Vert _F^2 \\ \ -&\frac{1}{2} \Vert (1- \frac{1}{M})\mathbf G_m^{(n)} \ - \ \frac{1}{M} \sum _{j\ne m}^{M}\mathbf G_j^{(n)} \Vert _F^2 \Big ). \end{aligned} \end{aligned}$$
(11)

Then we set the partial derivative \(\frac{\partial \mathcal L_{\mathbf G_m^{(n)}}}{\partial {\mathbf G_m^{(n)}}}\) to zero, and get:

$$\begin{aligned} \begin{aligned} \mathbf G_m^{(n)} = \frac{M^2}{M^2\mu +2M-1} \Big (\mu \mathbf S_m^{(n)}&- \mathbf Y_{mn} + {\mathbf {U}^{(n)}}^{\top } \mathbf {X}_m^{(n)} {\mathbf {P}^{(n)}} \Big )\\ {}&- \big ( (\frac{1}{M} - \frac{1}{M^2}) \sum _{j\ne m}^{M}\mathbf G_j^{(n)} \big ). \end{aligned} \end{aligned}$$
(12)

Update \({\varvec{\mathcal {X}}}_{{\varvec{m}}}\) . The Lagrange function (5) with respect to \(\mathcal {X}\) is:

$$\begin{aligned} \begin{aligned} \mathcal L_{\mathcal {X}_m}&= \frac{1}{2} \sum _{m=1}^{M} \Vert \mathcal {X}_m - \mathcal G_m {\times _1} \mathbf {U}^{(1)} \cdots {\times _N} \mathbf {U}^{(N)}\Vert _F^2, \\ {}&\text {s.t.} \ \mathcal {P}_{\varvec{\varOmega }}(\mathcal {X}_m)= \mathcal {P}_{\varvec{\varOmega }}(\mathcal {T}_m), \end{aligned} \end{aligned}$$
(13)

By deriving the Karush-Kuhn-Tucker (KKT) conditions for function (13), we can update \( \mathcal {X}_m \) by \( \mathcal {X}_m = {\mathcal {P}_{{\varvec{\varOmega }}} ( \mathcal {X}_m )} + \mathcal {P}_{{\varvec{\varOmega }}^c} (\mathcal {Z}_m)\), where \(\mathcal {Z}_m= \mathcal G_m {\times _1} \mathbf {U}^{(1)} \cdots {\times _N} \mathbf {U}^{(N)}.\)

We summarize the proposed method, TDVM, in Algorithm 1.

3.4 Complexity Analysis

We analyze the complexity of TDVM following [32]. For simplicity, we assume the size of tensor is \( I_1 = \cdots = I_N= I\), and the feature dimensions are \( R_1 =\cdots = R_N= R\). At each iteration, the time complexity of performing the soft-thresholding operator (7) is \(O(M N R^{N+1})\). The time complexities of some multiplication operators in (10)/(12) and (13) are \(O(M N R I^N)\) and \(O(M R I^N)\), respectively. Hence, the total time complexity of TDVM is \(O(M(N+1)RI^N)\) per iteration.

4 Experimental Results

We implemented TDVMFootnote 1 in MATLAB and all experiments were performed on a PC (Intel Xeon(R) 4.0 GHz, 64 GB).

4.1 Experimental Setup

Compared Methods: We compare our TDVM with nine methods in three categories:

  • Two tensor completion methods based on Tucker-decomposition: HaLRTC [17] and gHOI [19]. The recovered tensor are regarded as the features.

  • Six {tensor completion methods + feature extraction methods} (i.e., “two-step” strategies): HaLRTC + PCA [8], gHOI + PCA, HaLRTC + MPCA [3], gHOI + MPCA, HaLRTC + LRANTD [20] and gHOI + LRANTD.

  • One robust tensor feature learning method: TRPCA [9].

After feature extraction stage, we use the Nearest Neighbors Classifier (NNC) to evaluate the extracted features on two real data for face recognition. We had also evaluated TDVM on MNIST handwritten digits [33] for object classification, and TDVM obtains the best results in all cases. We do not report here due to limited space.

Data: We evaluate the proposed TDVM on two real data for face recognition tasks. One is a subset of Facial Recognition Technology database (FERET)Footnote 2 [34], which has 721 face samples from 70 subjects. Each subject has 8–31 faces with at most 15\(^\circ \) of pose variation and each face image is normalized to a \(80 \times 60\) gray image. The other is a subset of extended Yale Face Database B (YaleB)Footnote 3 [35], which has 2414 face samples from 38 subjects. Each subject has 59–64 near frontal images under different illuminations and each face image is normalized to a \(32\times 32\) gray image.

Missing Data Settings: We set the tensors with two types of missing data:

  • Pixel-based missing: we uniformly select \(10 \% - 90 \%\) pixels (entries) of tensors as missing at random. Pixel-based missing setting is widely used in tensor completion domain. One example (e.g., missing \(50\%\) entries) is shown in Fig. 2(b).

  • Block-based missing: we randomly select \(B_1 \times B_2\) block pixels of each tensor sample as missing. The missing block is random in each sample. One example (e.g., \(\{B_1=40, B_2=30\}\) for FERET and \(\{B_1=16, B_2=16\}\) for YaleB) is shown in Fig. 2(c). In practice, some parts of a face can be covered by some objects such as a sunglass, which can be regarded as the block-based missing case.

Intuitively, handling data with block-based missing is more difficult than that with pixel-based missing if same number of entries are missing.

Parameter Settings: We set the maximum iterations \(K=200, \textit{tol} = 1e-5\) for all methods and set the feature dimension \(D = R_1 \times R_2 = \{40 \times 30, 16 \times 16\} \) for TDVM, gHOI and LRANTD on {FERET, YaleB} respectively. In other words, we directly learn \( 40\times 30 \times 721\) features from FERET (\(80\times 60 \times 721\)) and extract \(16\times 16 \times 2414\) features from YaleB (\(32\times 32 \times 2414\)). Other parameters of the compared methods have followed the original papers.

Fig. 2.
figure 2

One example of (a) original images of FERET and YaleB with (b) \(50\%\) pixel-based and with (c) \(40 \times 30\) and \(16 \times 16\) block-based missing entries, respectively.

Fig. 3.
figure 3

Recognition results on FERET via TDVM with different feature dimension Ds.

Applying extracted features for face recognition using NNC, we randomly select \(L = \{1, 2,\cdots , 7\} \) extracted feature samples from each subject (with 8–31 samples) of FERET for training in NNC. On YaleB, we randomly select \(L = \{5, 10, \cdots , 50 \}\) extracted feature samples from each subject (with 59–64 samples) for training.

Fig. 4.
figure 4

Convergence curves of TDVM in terms of Relative Error: \(\Vert \mathcal G_m -\mathcal S_m\Vert _F^2 / \Vert \mathcal G_m\Vert _F^2\) on FERET with (a) pixel-based/(b) block-based missing entries.

Fig. 5.
figure 5

Convergence curves of feature extraction on FERET with pixel-based (\(50\%\))/block-based (\(40 \times 30\)) missing entries via TDVM with ten different values of \(\mu _0\) and \(\rho \), respectively.

4.2 Parameter Sensitivity and Convergence Study

Effect of Feature Dimension \({{\varvec{D:}}}\) We study the effect of TDVM with different feature dimensions (size of each core tensor) for face recognition on FERET. We set the feature dimension D of each face sample as \(R_1 \times R_2\) in TDVM and show the corresponding face recognition results. Figure 3 shows that TDVM with different feature dimensions stably yields similar recognition results on FERET in both pixel-based and block-based missing cases, excepting \(D=5 \times 5\) (i.e., only 25 features are extracted from each \(80 \times 60\) face image) where the number of features are too limited to achieve good results. Since a larger D costs more time and we aim to learn low-dimensional features, here we set \(D=R_1 \times R_2= 40 \times 30\) and \(16 \times 16\) for TDVM to extract features from FERET and YaleB, respectively.

Convergence: We study the convergence of TDVM in terms of Relative Error: \(\Vert \mathcal G_m -\mathcal S_m\Vert _F^2 / \Vert \mathcal G_m\Vert _F^2\) on FERET with pixel/block-based missing entries. Here, we set \(\mu _0=5\) and \(\rho =10\) for TDVM. Figure 4 shows that the relative error dramatically decreases to a very small value (around \(10^{-13}\) order) with about 10 iterations. In other words, the proposed TDVM converges fast within 5 iterations if we set \( \textit{tol}=1e-5\).

Fig. 6.
figure 6

Recognition results on FERET with pixel-based (\(50\%\))/block-based (\(40 \times 30\)) missing entries via TDVM with ten different values of \(\mu _0\) and \(\rho \), respectively.

Sensitivity Analysis of Parameter \(\varvec{\mu }_{\mathbf {0}}\) and \(\varvec{\rho :}\) In line 10 of Algorithm 1, we iteratively update the penalty parameter \(\mu \) with a step size \(\rho \) from an initial \(\mu _0\), which has been widely used in many methods such as [9] and makes the algorithm converges faster. Figures 5 and 6 show the convergence curves and corresponding recognition results on FERET with \(50\%\) missing pixels and \(40 \times 30\) missing block via TDVM with different \(\mu _0\) and \(\rho \) respectively. As seen from Figs. 5(a) and (b), with different \(\mu _0\), TDVM stably converges to a small value (around \(10^{-13}\) order) with around 10 iterations. In terms of \(\rho \), the relative errors converge to a small value faster if TDVM with a larger \(\rho \) (e.g., \(\rho =10\)), as shown in Figs. 5(c) and (d). Figures 6(a) and (c) show that the feature extraction performance of our TDVM is stable and not sensitive to the values of \(\mu _0\) and \(\rho \) on FERET with \(50\%\) missing pixels. Besides, as seen from Figs. 6(b) and (d), with a larger \(\mu _0\) (\(\mu _0>1\)) and \(\rho \) (\(\rho >1.5\)) for TVDM on FERET with \(40 \times 30\) block-based missing entries, the corresponding face recognition results are similar and stable.

In general, we do not need to carefully tune the parameter \(\mu _0\) and \(\rho \) for the proposed TDVM. In this paper, we set \(\mu _0=5\) and \(\rho =10\) in Algorithm 1 for all tests.

Table 1. Face recognition results (average recognition rates \(\%\)) on the FERET and YaleB with \(\{10\%, 30\%, 50\%, 90\% \}\) pixel-based missing entries.

4.3 Evaluate Features Extracted from Data with Pixel/Block-Based Missing

To save space, for pixel-based missing case, we report the results of FERET and YaleB with {\(10\%,30\%, 50\%,90\%\)} missing pixels in Table 1. For block-based missing case, we report the results of FERET with {\(5 \times 10\), \(20 \times 20\), \(40 \times 30\), \(55 \times 55\)} missing block and YaleB with {\(5 \times 5\), \(8 \times 10\), \(16 \times 16\), \(30 \times 25\)} missing block in Table 2 respectively. In each pixel/block-based missing case, we report the recognition rates of randomly selecting \(L = \{1, 7\} \) and \(L = \{ 5, 50 \}\) extracted feature samples from each subject of FERET and YaleB for training in NNC, respectively. We highlight the best results in bold fonts and second best in underline respectively. We repeat the runs 10 times of feature extraction and of recognition separately, and report the average results.

Face Recognition Results on FERET/YaleB with Pixel-Based Missing. TDVM outperforms the other nine methods by \(\{34.69\%, 35.72\%, 35.19\%, 46.65\%\}\) in all cases of FERET with \(\{10\%,30\%,50\%,90\%\}\) missing pixels on average respectively, as shown in the left half of Table 1. Besides, TRPCA achieves the second best results in six cases given more than \(50\%\) observations while its performance drops dramatically when missing \(90\%\) pixels, where the gHOI + LRANTD takes the second place. Moreover, with less training features (e.g. \(L=1\)) in NNC, our TDVM has more advantage as it aims to extract low-dimensional informative features.

The right half of Table 1 shows that TDVM outperforms the best performing existing algorithm (HaLRTC) in all cases of YaleB with {\(10\%, 30\%, 50\%\)} missing pixels by \(\{39.52\%, 40.67\%, 43.37\%\}\) on average, respectively. When the missing rate achieves \(90\%\), the performance of compared methods drop sharply, excepting HaLRTC + MPCA which wins other existing methods in this scenario, where our TDVM keeps the best performance with \(77.45\%\) over all the existing methods.

Table 2. Face recognition results (average recognition rates \(\%\)) on the FERET and YaleB with block-based missing entries.

Face Recognition Results on FERET/YaleB with Block-Based Missing. The left half of Table 2 shows that TDVM outperforms all competing methods by \(\{35.99\%, 37.70\%, 44.48\%, 50.68\%\}\) in all cases of FERET with {\(5 \times 10\), \(20 \times 20\), \(40 \times 30\), \(55 \times 55\)} missing blocks on average, respectively. Furthermore, gHOI/HaLRTC + MPCA and HaLRTC share the second place in these cases.

As shown in the right half of Table 2: TDVM outperforms the nine state-of-the-art methods by \(\{40.53\%, 40.40\%, 46.07\%, 50.42\%\}\) in all cases of YaleB with {\(5 \times 5\), \(8 \times 10\), \(16 \times 16\), \(30 \times 25\)} missing blocks on average, respectively. Specifically, HaLRTC is the best performing existing algorithm in the cases of missing {\(8 \times 10\), \(16 \times 16\), \(30 \times 25\)} block, but our TDVM outperforms it by \(\{29.55\%,30.31\%, 30.02\%\}\) respectively there.

4.4 Computational Cost

We report the average time cost of feature extraction in Table 3. As shown in Table 3, TDVM is much more efficient than all the compared methods in all cases, as we impose nuclear norm on core tensors instead of original tensors to learn low-dimensional features. Specifically, HaLRTC is the second fastest methods on FERET with block-based missing entries while slower than gHOI and HaLRTC + PCA in two pixel-based missing cases. Besides, HaLRTC is also the second efficient method on YaleB with pixel-based missing entries excepting the case of missing \(90\%\) pixels. In the block-based missing cases of YaleB, TRPCA is faster than TDVM, but it yields worse results. Moreover, the “two-step” strategies such as gHOI + MPCA/LRANTD are the most time consuming (more than 10 times slower than TDVM on average).

Table 3. Time cost (seconds) of feature extraction on the FERET and YaleB with pixel/block-based missing entries.

4.5 Summary of Experimental Results

  • TDVM outperforms the nine competing methods in all cases of face recognition on two real data, especially on data with more missing entries. Besides, our method is much more efficient than all compared methods. Moreover, with less training features (e.g. \(L=1\) for FERET and \(L=5\) for YaleB) in NNC, TDVM shows more advantage as it extracts low-dimensional informative features. These results verifies the superiority of incorporating low-rank Tucker decomposition with feature variance maximization.

  • The tensor learning method (TRPCA) is the best performing existing algorithm in six cases of FERET with pixel-based missing entries. However, it works much worse than TDVM on data with increasing missing entries. For example, on YaleB with \(90\%\) missing pixels, TRPCA loses up to \( 94.48\%\) than TDVM on average.

  • Tensor completion methods (HaLRTC and gHOI) obtain similar results in most cases and HaLRTC achieves the second best results in about half of all cases, while TDVM outperforms these two methods by \(34.92\%\) and \(41.71\%\) on average on FERET and YaleB respectively. These results echo our claim: tensor completion methods focus on recovering missing data and do not explore the relationships among samples for effective feature extraction.

  • The “two-step” strategies (e.g., gHOI + PCA/MPCA) do not have much improvement and even perform worse than using only tensor completion methods (e.g., gHOI), as we claimed that reconstruction errors from completion step can deteriorate the performance in feature extraction step. Although gHOI + LRANTD/MPCA and HaLRTC + PCA/MPCA achieve the second best results in a few cases, TDVM outperforms the “two-step” strategies in all cases as we extracts informative features directly from observed entries.

5 Conclusion

In this paper, we have proposed an unsupervised feature extraction method, i.e. TDVM, which solves the problem of feature extraction for tensors with missing data. TDVM incorporates low-rank Tucker decomposition with feature variance maximization in a unified framework, which results in low-dimensional informative features extracted directly from observed entries. We have evaluated the proposed method on two real datasets with different pixel/block-based missing entries and applied the extracted features for face recognition. Experimental results have shown the superiority of TDVM in both pixel-based and block-based missing cases, where the proposed method consistently outperforms the nine competing methods in all cases, especially on data with more missing entries. Moreover, TDVM is not sensitive to parameters and more efficient than the compared methods.