Keywords

1 Introduction

The general goal of data analysis is to extract previously unknown information from a given dataset. Many data analysis tasks, such as pattern recognition, classification, clustering, prognosis, and others, deal with real-world data that are presented in high-dimensional spaces, and the ‘curse of dimensionality’ phenomena are often an obstacle to the use of many methods for solving these tasks.

Fortunately, in many applications, especially in pattern recognition, the real high-dimensional data occupy only a very small part in the high dimensional ‘observation space’ Rp; it means that an intrinsic dimension q of the data is small compared to the dimension p (usually, q << p) [1, 2]. Various dimensionality reduction (feature extraction) algorithms, whose goal is a finding of a low-dimensional parameterization of such high-dimensional data, transform the data into their low-dimensional representations (features) preserving certain chosen subject-driven data properties [3, 4].

The most popular model of high-dimensional data, which occupy a small part of observation space Rp, is Manifold model in accordance with which the data lie on or near an unknown Data manifold (DM) of known lower dimensionality q < p embedded in an ambient high-dimensional space Rp (Manifold assumption [5] about high-dimensional data). Typically, this assumption is satisfied for ‘real-world’ high-dimensional data obtained from ‘natural’ sources.

Dimensionality reduction under the manifold assumption about processed data are usually referred to as the Manifold learning [6, 7] whose goal is constructing a low-dimensional parameterization of the DM (global low-dimensional coordinates on the DM) from a finite dataset sampled from the DM. This parameterization produces an Embedding mapping from the DM to low-dimensional Feature space that should preserve specific properties of the DM determined by chosen optimized cost function which defines an ‘evaluation measure’ for the dimensionality reduction and reflects the desired properties of the initial data which should be preserved in their features.

Most manifold learning algorithms include the solution of large-dimensional global optimization problems and, thus, are computationally expensive. The incremental versions of many popular algorithms (Locally Linear Embedding, Isomap, Laplacian Eigenmaps, Local Tangent Space Alignment, Hessian Eigenmaps, etc. [6, 7]), which reduce their computational complexity, were developed [817].

The manifold learning algorithms are usually used as a first key step in solution of machine learning tasks: the low-dimensional features are used in reduced learning procedures instead of initial high-dimensional data avoiding the curse of dimensionality [18]: ‘dimensionality reduction may be necessary in order to discard redundancy and reduce the computational cost of further operations’ [19]. If the low-dimensional features preserve only specific properties of data, then substantial data losses are possible when using the features instead of the initial data. To prevent these losses, the features should preserve as much as possible available information contained in the high-dimensional data [20]; it means the possibility for recovering the initial data from their features with small reconstruction error. Such Manifold reconstruction algorithms result in both the parameterization and recovery of the unknown DM [21].

Mathematically [22], a ‘preserving the important information of the DM’ means that manifold learning algorithms should ‘recover the geometry’ of the DM, and ‘the information necessary for reconstructing the geometry of the manifold is embodied in its Riemannian metric (tensor)’ [23]. Thus, the learning algorithms should accurately recover Riemannian data manifold that is the DM equipped by Riemannian tensor.

Certain requirement to the recovery follows from the necessity of providing a good generalization capability of the manifold reconstruction algorithms and preserving local structure of the DM: the algorithms should preserve a differential structure of the DM providing proximity between tangent spaces to the DM and Recovered data manifold (RDM) [24]. In the Manifold theory [23, 25], the set composed of the manifold points equipped by tangent spaces at these points is called the Tangent bundle of the manifold; thus, a reconstruction of the DM, which ensures accurate reconstruction of its tangent spaces too, is referred to as the Tangent bundle manifold learning.

Earlier proposed geometrically motivated Grassmann&Stiefel Eigenmaps algorithm (GSE) [24, 26] solves the Tangent bundle manifold learning and recovers Riemannian tensor of the DM; thus, it solves the Riemannian manifold recovery problem.

The GSE, like most manifold learning algorithms, includes the solution of large-dimensional global optimization problems and, thus, is computationally expensive.

In this paper, we propose an incremental version of the GSE that reduces the solution of the computationally expensive global optimization problems to the solution of a sequence of local optimization problems solved in explicit form.

The rest of the paper is organized as follows. Section 2 contains strong definition of the Tangent bundle manifold learning and describes main ideas realized in its GSE-solution. The proposed incremental version of the GSE is presented in Sect. 3.

2 Tangent Bundle Manifold Learning

2.1 Definitions and Assumptions

Consider unknown q-dimensional Data manifold with known intrinsic dimension q

$$ {\mathbf{M}} = \{ {\text{X = g}}\left( {\text{y}} \right) \in {\text{R}}^{\text{p}} : {\text{ y}} \in {\mathbf{Y}} \subset {\text{R}}^{\text{q}} \} $$

covered by a single chart g and embedded in an ambient p-dimensional space Rp, q < p. The chart g is one-to-one mapping from open bounded Coordinate space \( {\mathbf{Y}} \subset {\text{R}}^{\text{q}} \) to the manifold M = g(Y) with differentiable inverse mapping hg(X) = g−1(X) whose values y = hg(X) ∈ Y give low-dimensional coordinates (representations, features) of high-dimensional manifold-valued data X.

If the mappings hg(X) and g(y) are differentiable and Jg(y) is p × q Jacobian matrix of the mapping g(y), than q-dimensional linear space L(X) = Span(Jg(hg(X))) in Rp is tangent space to the DM M at the point X ∈ M; hereinafter, Span(H) is linear space spanned by columns of arbitrary matrix H.

The tangent spaces can be considered as elements of the Grassmann manifold Grass(p, q) consisting of all q-dimensional linear subspaces in Rp.

Standard inner product in Rp induces an inner product on the tangent space L(X) that defines Riemannian metric (tensor) Δ(X) in each manifold point X ∈ M smoothly varying from point to point; thus, the DM M is a Riemannian manifold (M, Δ).

Let \( {\mathbf{X}}_{\text{n}} = \left\{ {{\text{X}}_{ 1} ,{\text{X}}_{ 2} , \ldots ,{\text{X}}_{\text{n}} } \right\} \) be a dataset randomly sampled from the DM M according to certain (unknown) probability measure whose support coincides with M.

2.2 Tangent Bundle Manifold Learning Definition

Conventional manifold learning problem, called usually Manifold embedding problem [6, 7], is to construct a low-dimensional parameterization of the DM from given sample X n, which produces an Embedding mapping \( {\text{h}}: {\mathbf{M}} \subset {\text{R}}^{\text{p}} \to {\mathbf{Y}}_{\text{h}} = {\text{h}}\left( {\mathbf{M}} \right) \subset {\text{R}}^{\text{q}} \) from the DM M to the Feature space (FS) Y h ⊂ Rq, q < p, which preserves specific chosen properties of the DM.

Manifold reconstruction algorithm, which provides additionally a possibility of accurate recovery of original vectors X from their low-dimensional features y = h(X), includes a constructing of a Recovering mapping g(y) from the FS Y h to the Euclidean space Rp in such a way that the pair (h, g) ensures approximate equalities

$$ {\text{r}}_{{{\text{h}},{\text{g}}}} \left( {\text{X}} \right) \equiv {\text{g}}\left( {{\text{h}}\left( {\text{X}} \right)} \right) \approx {\text{X}}\,{\text{for}}\,{\text{all}}\,{\text{points}}\,{\text{X}} \in {\mathbf{M}}. $$
(1)

The mappings (h, g) determine q-dimensional Recovered data manifold

$$ {\mathbf{M}}_{{{\text{h}},{\text{g}}}} = {\text{r}}_{{{\text{h}},{\text{g}}}} \left( {\mathbf{M}} \right) = \{ {\text{r}}_{{{\text{h}},{\text{g}}}} \left( {\text{X}} \right) \in {\text{R}}^{\text{p}} :{\text{X}} \in {\mathbf{M}}\} = \{ {\text{X}} = {\text{g}}\left( {\text{y}} \right) \in {\text{R}}^{\text{p}} :{\text{y}} \in {\mathbf{Y}}_{\text{h}} \subset {\text{R}}^{\text{q}} \} $$
(2)

which is embedded in the ambient space Rp, covered by a single chart g, and consists of all recovered values rh,g(X) of manifold points X ∈ M. Proximities (1) imply manifold proximity M h,g ≈ M meaning a small Hausdorff distance dH(M h,g, M) between the DM M and RDM M h,g due inequality \( {\text{d}}_{\text{H}} ({\mathbf{M}}_{{{\text{h}},{\text{g}}}} ,{\mathbf{M}}) \le { \sup }_{{{\text{X}} \in {\mathbf{M}}}} |{\text{r}}_{{{\text{h}},{\text{g}}}} \left( {\text{X}} \right){-}{\text{X}}| \).

Let G(y) = Jg(y) be p × q Jacobian matrix of the mapping g(y) which determines q-dimensional tangent space Lh,g(X) to the RDM M h,g at the point rh,g(X) ∈ M h,g:

$$ {\text{L}}_{{{\text{h}},{\text{g}}}} \left( {\text{X}} \right) = {\text{Span}}\left( {{\text{G}}\left( {{\text{h}}\left( {\text{X}} \right)} \right)} \right) $$
(3)

Tangent bundle manifold learning problem is to construct the pair (h, g) of mappings h and g from given sample X n ensuring both the proximities (1) and proximities

$$ {\text{L}}_{\text{h,g}} \left( {\text{X}} \right) \approx {\text{L}}\left( {\text{X}} \right)\,{\text{for}}\,{\text{all}}\,{\text{points}}\,{\text{X}} \in {\mathbf{M;}} $$
(4)

proximities (4) are defined with use certain chosen metric on the Grass(p, q).

The matrix G(y) determines also metric tensor \( \Delta_{{{\text{h}},{\text{g}}}} \left( {\text{X}} \right) = {\text{G}}^{\text{T}} \left( {{\text{h}}\left( {\text{X}} \right)} \right) \times {\text{G}}\left( {{\text{h}}\left( {\text{X}} \right)} \right) \) on the RMD M h,g which is q × q matrix consisting of inner products {(Gi(h(X)), Gj(h(X)))} between ith and jth columns Gi(h(X)) and Gj(h(X)) of the matrix G(h(X)). Thus, the pair (h, g) determines Recovered Riemannian manifold (M h,g, Δh,g) that accurately approximates initial Riemannian data manifold (M, Δ).

2.3 Grassmann&Stiefel Eigenmaps: An Approach

Grassmann&Stiefel Eigenmaps algorithm gives the solution to the Tangent bundle manifold learning problem and consists of three successively performed parts: Tangent manifold learning, Manifold embedding, and Manifold recovery.

Tangent Manifold Learning Part.

A sample-based family H consisting of p × q matrices H(X) smoothly depending on X ∈ M is constructed to meet relations

$$ {\text{L}}_{\text{H}} \left( {\text{X}} \right) \equiv {\text{Span}}\left( {{\text{H}}\left( {\text{X}} \right)} \right) \approx {\text{L}}\left( {\text{X}} \right)\,{\text{for}}\,{\text{all}}\,{\text{X}} \in {\mathbf{M}} $$
(5)

in certain chosen metric on the Grassmann manifold. In next steps, the mappings h and g will be built in such a way that both the equalities (1) and

$$ {\text{G}}\left( {{\text{h}}\left( {\text{X}} \right)} \right) \approx {\text{H}}\left( {\text{X}} \right)\,{\text{for}}\,{\text{all}}\,{\text{points}}\,{\text{X}} \in {\mathbf{M}} $$
(6)

are fulfilled. Hence, linear space LH(X) (5) approximates the tangent space Lh,g(X) (3) to the RDM M h,g at the point rh,g(X).

Manifold Embedding Part.

Given the family H already constructed, the embedding mapping y = h(X) is constructed as follows. The Taylor series expansions

$$ {\text{g}}({\text{h}}({\text{X}}^{{\prime }} )) - {\text{g}}\left( {{\text{h}}\left( {\text{X}} \right)} \right) \approx {\text{G}}\left( {{\text{h}}\left( {\text{X}} \right)} \right) \times ({\text{h}}({\text{X}}^{{\prime }} ) - {\text{h}}\left( {\text{X}} \right)) $$
(7)

of the mapping g at near points h(X′), h(X) ∈ Y h, under the desired approximate equalities (1), (6) for the mappings h and g to be specified further, imply equalities:

$$ {\text{X}}^{{\prime }} - {\text{X}} \approx {\text{H}}\left( {\text{X}} \right) \times \left( {{\text{h}}\left( {{\text{X}}^{{\prime }} } \right) - {\text{h}}\left( {\text{X}} \right)} \right) $$
(8)

for near points X, X′ ∈ M. These equations considered further as regression equations allow constructing the embedding mapping h and the FS Y h = h(M).

Manifold Reconstruction Step.

Given the family H and mapping h(X) already constructed, the expansion (7), under the desired proximities (1) and (6), implies relation

$$ {\text{g}}\left( {\text{y}} \right) \approx {\text{X }} + {\text{ H}}\left( {\text{X}} \right) \times \left( {{\text{y}}{-}{\text{h}}\left( {\text{X}} \right)} \right) $$
(9)

for near points y, h(X) ∈ Y h which is used for constructing the mapping g.

2.4 Grassmann&Stiefel Eigenmaps: Some Details

Details of the GSE are presented below. The numbers {εi > 0} denote the algorithms parameters whose values are chosen depending on the sample size n (εi = εi,n) and tend to zero as n → ∞ with rate O(n−1/(q+2)).

Step S1: Neighborhoods (Construction and Description).

The necessary preliminary calculations are performed at first step S1.

Euclidean Kernel.

Introduce Euclidean kernel KE(X, X′) = I{|X′ – X| < ε1} on the DM at points X, X′ ∈ M, here I{·} is indicator function.

Grassmann Kernel.

An applying the Principal Component Analysis (PCA) [27] to the points from the set Un(X, ε1) = {X′ ∈ X n: |X′ – X| < ε1} ∪ {X}, results in p × q orthogonal matrix QPCA(X) whose columns are PCA principal eigenvectors corresponding to the q largest PCA eigenvalues. These matrices determine q-dimensional linear spaces LPCA(X) = Span(QPCA(X)) in Rp, which, under certain conditions, approximate the tangent spaces L(X):

$$ {\text{L}}_{\text{PCA}} \left( {\text{X}} \right) \approx {\text{L}}\left( {\text{X}} \right). $$
(10)

In what follows, we assume that sample size n is large enough to ensure a positive value of the qth PCA-eigenvalue in sample points and provide proximities (10). To provide trade-off between ‘statistical error’ (depending on number n(X) of sample points in set Un(X, ε1)) and ‘curvature error’ (caused by deviation of the manifold-valued points from the ‘assumed in the PCA’ linear space) in (10), ball radius ε1 should tend to 0 as n → ∞ with rate O(n−1/(q+2)), providing, with high probability, the order O(n−1/(q+2)) for the error in (10) [28, 29]; here ‘an event occurs with high probability’ means that its probability exceeds the value (1 – Cα/nα) for any n and α > 0, and the constant Cα depends only on α.

Grassmann kernel KG(X, X′) on the DM at points X, X′ ∈ M is defined as

$$ {\text{K}}_{\text{G}} \left( {{\text{X}},{\text{X}}^{{\prime }} } \right) = {\text{I}}\left\{ {{\text{d}}_{\text{BC}} \left( {{\text{L}}_{\text{PCA}} \left( {\text{X}} \right),{\text{L}}_{\text{PCA}} \left( {{\text{X}}^{{\prime }} } \right)} \right) < \varepsilon_{2} } \right\} \times {\text{K}}_{\text{BC}} \left( {{\text{L}}_{\text{PCA}} \left( {\text{X}} \right),{\text{L}}_{\text{PCA}} \left( {{\text{X}}^{{\prime }} } \right)} \right) $$

with use Binet-Cauchy kernel KBC(LPCA(X), LPCA(X′)) = Det2[S(X, X′)] and Binet-Cauchy metric dBC(LPCA(X), LPCA(X′)) = {1 − Det2[S(X, X′)]}1/2 on the Grassmann manifold Grass(p, q) [30, 31], here S(X, X′) = \( {\text{Q}}_{\text{PCA}}^{\text{T}} ({\text{X}}) \times {\text{Q}}_{\text{PCA}} ({\text{X}}^{{\prime }} ) \).

Orthogonal p × p matrix \( \uppi_{\text{PCA}} ({\text{X}}) = {\text{Q}}_{\text{PCA}} \left( {\text{X}} \right) \times {\text{Q}}_{\text{PCA}}^{\text{T}} ({\text{X}}) \) is projector onto linear space LPCA(X) which approximates projection matrix π(X) onto the tangent space L(X).

Aggregate Kernel.

Introduce the kernel K(X, X′) = KE(X, X′) × KG(X, X′), which reflects not only geometrical nearness between points X and X′ but also nearness between the linear spaces LPCA(X) and LPCA(X′) (and, thus (10), nearness between the tangent spaces L(X) and L(X′)), as a product of the Euclidean and Grassmann kernels.

Step S2: Tangent Manifold Learning.

The matrices H(X) will be constructed to meet the equalities LH(X) = LPCA(X) for all points X ∈ M that implies a representation

$$ {\text{H}}\left( {\text{X}} \right) = {\text{Q}}_{\text{PCA}} \left( {\text{X}} \right) \times {\text{v}}\left( {\text{X}} \right), $$
(11)

in which q × q matrices v(x) should provide a smooth depending H(X) on point X.

At first, the p × q matrices {Hi = QPCA(Xi) × vi} are constructed to minimize a form

$$ \Delta_{{{\text{H}},{\text{n}}}} = \frac{1}{2}\sum\nolimits_{{{\text{i}},{\text{j}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}}_{\text{i}} ,{\text{X}}_{\text{j}} } \right) \times ||{\text{H}}_{\text{i}} - {\text{H}}_{\text{j}} ||_{\text{F}}^{2} } $$
(12)

over q × q matrices v1, v2, …, vn, under normalizing constraint

$$ \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}}_{\text{i}} } \right) \times \left( {{\text{H}}_{\text{i}}^{\text{T}} \times {\text{H}}_{\text{i}} } \right)} = \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}}_{\text{i}} } \right) \times \left( {{\text{v}}_{\text{i}}^{\text{T}} \times {\text{v}}_{\text{i}} } \right) = {\text{K}} \times {\text{I}}_{\text{q}} } $$
(13)

used to avoid a degenerate solution; here \( {\text{K}}\left( {\text{X}} \right) = \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}},{\text{X}}_{\text{j}} } \right)} \) and \( {\text{K}} = \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}}_{\text{i}} } \right)} \).

The quadratic form (12) and the constraint (13) take the forms (K – Tr(V T × Ф × V)) and V T × F × V = K × Iq, respectively, here V is (nq) × q matrix whose transpose consists of the consecutively written transposed q × q matrices v1, v2, …, vn, Φ = ||Φij|| and F = ||Fij|| are nq × nq matrices consisting, respectively, of q × q matrices

$$ \{ \varPhi_{\text{ij}} = {\text{K}}\left( {{\text{X}}_{\text{i}} ,{\text{X}}_{\text{j}} } \right) \times {\text{S}}\left( {{\text{X}}_{\text{i}} ,{\text{X}}_{\text{j}} } \right)\} {\text{ and }}\{ {\text{F}}_{\text{ij}} = \delta_{\text{ij}} \times {\text{K}}\left( {{\text{X}}_{\text{i}} } \right) \times {\text{I}}_{\text{q}} \} . $$

Thus, a minimization (12), (13) is reduced to the generalized eigenvector problem

$$ {\varvec{\Phi}} \times {\mathbf{V}} = \lambda \times {\mathbf{F}} \times {\mathbf{V,}} $$
(14)

and (nq) × q matrix V, whose columns V1, V2, …, Vq ∈ Rnq are orthonormal eigenvectors corresponding to the q largest eigenvalues in the problem (14), determines the required q × q matrices v1, v2, …, vn.

The value H(X) (11) at arbitrary point X ∈ M is chosen to minimize a form

$$ {\text{d}}_{{{\text{H}},{\text{n}}}} \left( {\text{H}} \right) = \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{K}}({\text{X}},{\text{X}}_{\text{j}} ) \times ||{\text{Q}}_{\text{PCA}} ({\text{X}}) \times {\text{v}}({\text{X}}) - {\text{H}}_{\text{j}} ||_{\text{F}}^{2} } $$
(15)

over v(X) under condition Span(H) = LPCA(X), whose solution is

$$ {\text{H}}\left( {\text{X}} \right) = {\text{Q}}_{\text{PCA}} \left( {\text{X}} \right) \times {\text{v}}\left( {\text{X}} \right) = {\text{Q}}_{\text{PCA}} \left( {\text{X}} \right) \times \frac{1}{{{\text{K}}({\text{X}})}}\sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}},{\text{X}}_{\text{j}} } \right) \times {\text{S}}\left( {{\text{X}},{\text{X}}_{\text{j}} } \right) \times {\text{v}}_{\text{j}} } . $$
(16)

It follows from above formulas that the q × p matrix

$$ {\text{G}}_{\text{h}} \left( {\text{X}} \right) = {\text{H}}^{ - } ({\text{X}}) \times \pi_{\text{PCA}} \left( {\text{X}} \right) = {\text{v}}^{ - 1} \left( {\text{X}} \right) \times {\text{Q}}_{\text{PCA}}^{\text{T}} ({\text{X}}) $$

estimates Jacobian matrix Jh(X) of Embedding mapping h(X) constructed afterward, here \( {\text{H}}^{ - } ({\text{X}}) \) is q × p pseudoinverse Moore-Penrose matrix of p × q matrix H(X) [32].

Step S3: Manifold Embedding.

Embedding mapping h(X) with already known (estimated) Jacobian Gh(X) is constructed to meet equalities (8) written for all pairs of near points X, X′ ∈ M which can be considered as regression equations.

At first, the vector set {h1, h2, …, hn} ⊂ Rq is computed as a standard least squares solution in this regression problem by minimizing the residual

$$ \Delta_{{{\text{h}},{\text{n}}}} = \sum\nolimits_{{{\text{i}},{\text{j}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}}_{\text{i}} ,{\text{X}}_{\text{j}} } \right) \times \left| {{\text{X}}_{\text{j}} - {\text{X}}_{\text{i}} - {\text{H}}_{\text{i}} \times ({\text{h}}_{\text{j}} - {\text{h}}_{\text{i}} )} \right|^{2} } $$
(17)

over the vectors h1, h2, …, hn under normalizing condition h1 + h2 + … + hn = 0.

Then, considering the obtained vectors {hj} as preliminary values of the mapping h(X) at sample points, choose the value

$$ {\text{h}}\left( {\text{X}} \right) = \frac{1}{{{\text{K}}({\text{X}})}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}},{\text{X}}_{\text{i}} } \right) \times \left\{ {{\text{h}}_{\text{i}} + {\text{G}}_{\text{h}} ({\text{X}}) \times \left( {{\text{X}} - {\text{X}}_{\text{i}} } \right)} \right\}} $$
(18)

for arbitrary point X ∈ M as a result of minimizing over h the residual

$$ {\text{d}}_{{{\text{h}},{\text{n}}}} \left( {\text{h}} \right) = \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}},{\text{X}}_{\text{j}} } \right) \times \left| {{\text{X}}_{\text{j}} - {\text{X}} - {\text{H}}({\text{X}}) \times ({\text{h}}_{\text{j}} - {\text{h}})} \right|^{2} } . $$
(19)

The mapping (18) determines Feature sample Y h,n = {yh,i = h(Xi), i = 1, 2, …, n}.

Step S4: Manifold Recovery.

A kernel on the FS Y h and, then, the recovering mapping g(y) and its Jacobian matrix G(y) are constructed in this step.

Kernel on the Feature Space.

It follows from (8) that proximities

$$ |{\text{X}}{-}{\text{X}}_{\text{i}} | \approx {\text{d}}\left( {{\text{y}},{\text{ y}}_{{{\text{h}},{\text{i}}}} } \right) = \{ \left( {{\text{y}}{-}{\text{y}}_{{{\text{h}},{\text{i}}}} } \right)^{\text{T}} \times [{\text{H}}^{\text{T}} ({\text{X}}_{\text{i}} ) \times {\text{H}}({\text{X}}_{\text{i}} )] \times \left( {{\text{y}}{-}{\text{y}}_{{{\text{h}},{\text{i}}}} } \right)\}^{ 1/ 2} $$

hold true for near points y = h(X) and yh,iY h,n. Let uE(y, ε1) = {yh,i: d(y, yh,i) < ε1} be a neighborhood of the feature y = h(X) consisting of sample features which are images of the sample points from Un(X, ε1).

An applying the PCA to the set h−1(uE(y, ε1)) = {Xi: yh,i ∈ uE(y, ε1)} results in the linear space LPCA*(y) ∈ Grass(p, q) which meets proximity LPCA*(h(X)) ≈ LPCA(X).

Introduce feature kernel k(y, yh,i) = I{yh,i ∈ uE(y, ε1)} × KG(LPCA*(y), LPCA*(yh,i)) that meets equalities k(h(X), h(X′)) ≈ K(X, X′) for near points X ∈ M and X′ ∈ X n.

Constructing the Recovering Mapping and its Jacobian.

The matrix G(y), which should meet both the conditions (6) and constraint Span(G(y)) = LPCA*(y), is chosen by minimizing quadratic form \( \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{k}}\left( {{\text{y}},{\text{y}}_{{{\text{h}},{\text{j}}}} } \right) \times ||{\text{G}}({\text{y}}) - {\text{H}}_{\text{j}} ||_{\text{F}}^{2} } \) over G, that results in

$$ {\text{G}}\left( {\text{y}} \right) =\uppi^{*}\left( {\text{y}} \right) \times \frac{1}{{{\text{k}}({\text{y}})}}\sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{k}}\left( {{\text{y}}, {\text{y}}_{{{\text{h}},{\text{j}}}} } \right) \times {\text{H}}_{\text{j}} ,} $$
(20)

here π*(y) is the projector onto the linear space LPCA*(y) and \( {\text{k}}\left( {\text{y}} \right) = \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{k}}\left( {{\text{y}},{\text{y}}_{{{\text{h}},{\text{j}}}} } \right)} \).

Based on expansions (9) written for features yh,j ∈ uE(y, ε1), g(y) is chosen by minimizing quadratic form \( \sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{k}}\left( {{\text{y}},{\text{y}}_{{{\text{h}},{\text{j}}}} } \right) \times \left| {{\text{X}}_{\text{j}} - {\text{g}}\left( {\text{y}} \right) - {\text{G}}({\text{y}}) \times ({\text{y}}_{{{\text{h}},{\text{j}}}} - {\text{y}})} \right|^{2} } \) over g, thus

$$ {\text{g}}\left( {\text{y}} \right) = \frac{1}{{{\text{k}}({\text{y}})}}\sum\nolimits_{{{\text{j}} = 1}}^{\text{n}} {{\text{k}}\left( {{\text{y}}, {\text{y}}_{{{\text{h}},{\text{j}}}} } \right) \times \left\{ {{\text{X}}_{\text{j}} + {\text{G}}({\text{y}}) \times \left( {{\text{y}} - y_{{{\text{h}},{\text{j}}}} } \right)} \right\}} . $$
(21)

The constructed mappings (18), (21) allow recovering the DM M and its tangent spaces L(X) by the formulas (2) and (4).

2.5 Grassmann&Stiefel Eigenmaps: Some Properties

Under asymptotic n → ∞, when ε1 = O(n−1/(q+2)), relation dH(M h,g, M) = O(n−2/(q+2)) hold true uniformly in points X ∈ M with high probability [33]. This rate coincides with the asymptotically minimax lower bound for the Hausdorff distance dH(M h,g, M) [34]; thus, the RDM M h,g estimates the DM M with optimal rate of convergence.

The main computational complexity of the GSE-algorithm is in the second and third steps, in which global high-dimensional optimization problems are solved.

First problem is generalized eigenvector problem (14) with nq × nq matrices F and Φ. This problem is solved usually with use the Singular value decomposition (SVD) [32] whose computational complexity is O(n3) [35].

Second problem is regression problem (17) for nq-dimensional estimated vector. This problem is reduced to the solution of the linear least-square normal equations with nq × nq matrix whose computational complexity is O(n3) also [32].

Thus, the GSE has total computational complexity O(n3) and is computationally expensive under large sample size n.

3 Incremental Grassmann&Stiefel Eigenmaps

The incremental version of the GSE divides the most computationally expensive generalized eigenvector and regression problems into n local optimization procedures, each time k solved in explicit form for one new variable (matrix Hk and feature hk) only, k = 1, 2, …, n.

The proposed incremental version includes an additional preliminary step S1+ performed after the Step S1, in which a weighted undirected sample graph Г(X n) consisting of the sample points {Xi} as nodes is constructed and the shortest ways between arbitrary node chosen as an origin of the graph and all the other nodes are calculated.

The second and third steps S2 and S3 are replaced by common incremental step S2–3 in which the matrices {Hk} and features {hk} are computed sequentially at the graph nodes, moving along the shortest paths starting from the chosen origin of the graph. Step S4 in the GSE remains unchanged in the incremental version.

3.1 Step S1+: Sample Graph

Introduce a weighted undirected sample graph Г(X n) consisting of the sample points {Xi} as nodes. The edges in Г(X n) connect the nodes Xi and Xj if and only when K(Xi, Xj) > 0; the lengths of such edge (Xi, Xj) equal to |Xi – Xj|/K(Xi, Xj).

Choose arbitrary node X(1) ∈ Г(Xn) as an origin of the graph. Using the Dijksra algorithm [36], compute the shortest paths between the chosen node and all the other nodes X(2), X(3), …, X(n) writing in ascending order of the lengths of the shortest paths from the origin X(1). Denote Гk a subgraph consisting of the nodes {X(1), X(2), …, X(k)} and connected them edges.

Note.

The origin X(1) can be chosen as a node with minimal eccentricity; an eccentricity of some node equals to maximum of lengths of the shortest paths between the node under consideration and all the other nodes. But a calculation of the shortest ways between all nodes in the graph Г(X n), which should be computed for this construction, require n-fold applying of the Dijksra algorithm.

3.2 Step S2–3: Incremental Tangent Manifold Learning and Manifold Embedding

Incremental version computes sequentially the matrices H(X) and h(X) at the points X(1), X(2), …, X(n), starting from matrix H(1) and h(1) (initialization). Thus, step S2–3 consists of n substeps {S2–3k, k = 1, 2, …, n} in which initialization substep is

Initialization substep S2–31.

Put v(1) = Iq and h(1) = 0; thus, H(X(1)) = QPCA(X(1)).

At the k-th substep S2–3k, k > 1, when the matrices H(j), j < k, have already computed, quadratic form ΔH,k, similar to the form (12) but written only for the points Xi, Xj ∈ Гk, is minimized over single unknown matrix H(k) = QPCA(X(k)) × v(k). This problem, in turn, is reduced to a minimization over v(k) of the form dH,k(H(k)), similar to the form dH,n(H(k)) (15) but written only for points Xj ∈ Гk−1. Its solution v(k), which is similar to the solution (16), is written in explicit form.

Let Δh,k be a quadratic form, similar to the form Δh,n (17) but written only for points Xi, Xj ∈ Гk. The value h(k), under the already computed values h(j), j < k, is calculated by minimizing the quadratic form Δh,k over single vector h(k). This problem, in turn, is reduced to a minimization over h(k) the form dh,k(h(k)), similar to the form dh,n(h(k)) (19) but written only for points Xj ∈ Гk−1; its solution, similar to the solution (18), is written in explicit form also.

Thus, the substeps S2–3k, k = 1, 2, …, n, are:

Typical substep S2–3k, 1 < k ≤ n.

Given {(H(j), h(j)), j < k} already obtained, put

$$ {\text{H}}_{{({\text{k}})}} = {\text{Q}}_{\text{PCA}} \left( {{\text{X}}_{{({\text{k}})}} } \right) \times {\text{v}}_{{({\text{k}})}} = {\text{Q}}_{\text{PCA}} \left( {{\text{X}}_{{({\text{k}})}} } \right) \times \frac{{\mathop \sum \nolimits_{\text{j < k}} {\text{K}}\left( {{\text{X}}_{(k)} , {\text{X}}_{{ ( {\text{j)}}}} } \right) \times {\text{S}}\left( {{\text{X}}_{{ ( {\text{k)}}}} , {\text{X}}_{{ ( {\text{j)}}}} } \right) \times {\text{v}}_{{ ( {\text{j)}}}} }}{{\mathop \sum \nolimits_{\text{j < k}} {\text{K}}\left( {{\text{X}}_{{ ( {\text{k)}}}} , {\text{X}}_{{ ( {\text{j)}}}} } \right)}}, $$
(22)
$$ {\text{h}}_{{({\text{k}})}} = \frac{{\mathop \sum \nolimits_{{{\text{j}} < k}} {\text{K}}\left( {{\text{X}}_{{({\text{k}})}} ,{\text{X}}_{{({\text{j}})}} } \right) \times \left\{ {{\text{h}}_{{({\text{j}})}} + {\text{v}}_{{({\text{k}})}}^{ - 1} \times {\text{Q}}_{\text{PCA}}^{\text{T}} ({\text{X}}_{{({\text{k}})}} ) \times ({\text{X}}_{{({\text{k}})}} - {\text{X}}_{{({\text{j}})}} )} \right\}}}{{\mathop \sum \nolimits_{{{\text{j}} < k}} {\text{K}}\left( {{\text{X}}_{{({\text{k}})}} ,{\text{X}}_{{({\text{j}})}} } \right)}}. $$
(23)

Given {(H(k), h(k)), k = 1, 2, …, n}, the value H(X) = QPCA(X) × v(X) and h(X) at arbitrary point X ∈ M are calculated with use formulas (16) and (18), respectively.

3.3 Incremental GSE: Properties

Computational Complexity.

Incremental GSE works mainly with sample data lying in a neighborhood of some point X contained in ε1-ball Un(X, ε1) centered at X. The number n(X) of sample points fallen into this ball, under ε1 = ε1,n = O(n−1/(q+2)), with high probability equals to n × O(n−q/(q+2)) = O(n2/(q+2)) uniformly on X ∈ M [37].

The sample graph Г(X n) consists of V = n nodes and E edges connecting the graph nodes {Xk}. Each node Xk is connected with no more than n(Xk) other nodes, thus E < 0.5 × n × maxkn(Xk) = O(n(q+4)/(q+2)) and, hence, Г(X n) is sparse graph.

The running time of the Dijksra algorithm (Step S1+), which computes the shortest paths in the sparse connected graph Г(X n), is O(E × lnV) = O(n(q+4)/(q+2) × lnn) in the worst case; the Fibonacci heap improves this rate to O(E + V × lnV) = O(n(q+4)/(q+2)) [38].

The running time of k-th Step S2–3k (formulas (22) and (23)) is proportional to n(Xk); thus total running time of the Step S2–3 is n × O(n−q/(q+2)) = O(n(q+4)/(q+2)).

Therefore, the running time of the incremental version of the GSE is O(n(q+4)/(q+2)), in contrast to the running time O(n3) of the initial GSE.

Accuracy.

It follows from (18), (21) that X - rh,g(X) ≈ \( \left(\pi {_{\text{PCA}}^{\text{T}} \left( {\text{X}} \right) \times {\text{e}}({\text{X}})} \right) \times \left| \delta {({\text{X}})} \right| \), in which \( \delta \left( {\text{X}} \right) \, = {\text{X}} - \frac{1}{{{\text{K}}({\text{X}})}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{K}}\left( {{\text{X}},{\text{X}}_{\text{i}} } \right) \times {\text{X}}_{\text{i}} } \) and e(X) = δ(X)/|δ(X)|. The first and second multipliers are majorized by the PCA-error in (10) and ε1,n, respectively, each of them has rate O(n−1/(q+2)). Thus, reconstruction error (X − rh,g(X)) in the incremental GSE has the same asymptotically optimal rate O(n−2/(q+2)) as in the original GSE.

4 Conclusion

The incremental version of the Grassmann&Stiefel Eigenmaps algorithm, which constructs the low-dimensional representations of high-dimensional data with asymptotically minimal reconstruction error, is proposed. This version has the same optimal convergence rate O(n−2/(q+2)) of the reconstruction error and a significantly smaller computational complexity on the sample size n: running time O(n(q+4)/(q+2)) of the incremental version in contrast to O(n3) of the original algorithm.