Keywords

1 Introduction

Hyperspectral image (HSI) usually consists of hundreds of spectral bands from the visible spectrum to the infrared spectrum [1]. Each pixel of HSI can be represented by a high dimensional spectral vector. It’s because HSI’s rich spectral information that it has not only attracted the attention of the remote sensing community, but also aroused great interest in other fields, for instance, military [2], agriculture [3], urban planning, and environmental monitoring [4]. It is known that classification plays a crucial role in these fields. However, HSI generates a large amount of irrelevant or redundant data that causes a number of issues including significantly increased computation time, computational complexity and the classification performance especially when the training datasets are limited. A number of classical dimensionality reduction (DR) algorithms are explored to address these issues.

One of classic linear methods of DR is principle component analysis (PCA) [5]. But as an unsupervised methods, PCA doesn’t take advantage of class label information. Another one of classic linear methods of DR is linear discriminant analysis (LDA) [6], as a supervised method, it often suffers from the small sample size problem. And the biggest disadvantage of these linear methods is the failure to discover the nonlinear structure inherent in HSI.

Since nonlinear techniques have the merit of preserving geometrical structure of data manifold, it can overcome the above-problem. Laplacian eigenmaps (LE) [7], local linear embedding (LLE) [8] and other manifold learning algorithms have been successfully applied to DR for HSI. Besides, as a linear version of LE, locality preserving projection (LPP) [9] has been introduced. In order to overcome the difficulty of LDA tending to produce undesirable results when the samples in a class is multimodal non-Gaussian class distributions [10], local Fisher’s discriminant analysis (LFDA) [11] which having the advantages of LDA and LPP at the same time was introduced. After that, unlike LPP which uses only one graph to describe the geometry of the sample, local discriminant embedding (LDE) [12] method using two graphs to characterize the geometry structure of the sample was proposed. One as an intrinsic graph to characterize the compact nature of the sample, and the other as a penalty graph to describe the internal separation of the sample. Thus, LDE is more discriminative than LPP. The advantage of LDE is that it can make the data from the same class keep their intrinsic neighbor relations, and it also makes the data in different classes no longer close to each other. However, one thing in common among these above-mentioned methods is that the calculation of the affinity matrix is based on K nearest neighborhood, which is sensitive to outlier samples.

To overcome the above-problem, a graph embedding (GE) frame work [13] was proposed. In order to represent the sparse nature of the samples, a sparse graph embedding (SGE) [14] was developed. Later, a sparse graph-based discriminant analysis (SGDA) [15] model was developed by exploiting the class label information, resulting in a better performance than SGE. Above this, based on SGDA, sparse and low-rank graph discriminant analysis (SLGDA) [16] was proposed by increasing local information of samples. Recently, since considering curves changing description among spectral bands, a graph-based discriminant analysis with spectral similarity (GDA-SS) [17] method was proposed.

Each pixel of HSI is a high dimensional spectral vector that directly displays the spectral reflectance of the targets in different bands. Under an ideal condition, the same targets should have the same spectral characteristics. Nevertheless, HSI is very easy influenced by environment change (i.e. atmosphere and illumination) and instrument problem (i.e. senor) in the real word. And K nearest neighborhood based on euclidean distance is usually used to compute the similarity between two vectors, which is highly susceptible to interference from extreme point. These may lead to inaccurate graph construction and poor performance of classification. Inspired by the region covariance descriptor in [18] and the superiority of the second-order statistic representing data, a novel modified local discriminant embedding (MLDE) is proposed by constructing neighborhoods on a new spectral feature space instead of the original space. We use variance to characterize the pixel similarity of the same class and use covariance to characterize the separation of different classes of pixels. Considering the symmetric positive definite nature of covariance matrix lying on a Riemannan manifold, the Log-Euclidean metric is used to capture the similarity, which has a better effect than the euclidean distance. The main advantages in this paper are summarized as follows: (a) The combination of variance and covariance enables data points in the same class to be closer and enables greater separation of data points from different classes, which enhances classification performance of HSI. (b) The way of representing data by using variance and covariance can attenuate the effects of noise, which can better handle with noise in HSI. (c) The Log-Euclidean metric can provide a more accurate similarity evaluation than euclidean distance, which can better express the characteristics of spectral information.

2 Related Work

2.1 Local Discriminant Embedding (LDE)

Assume a hyperspectral dataset having N samples is denoted as \(X=\{x_{i}\}_{i}^N\) existing in a \(\mathbb {R}^{m\times 1}\) feature space, where m is the number of bands. And class labels \(y_{i}\in {1, 2, ... C}\), where C is the number of classes.

LDE which is defined for manifold learning and pattern classification tries to obtain an optimal projection matrix by considering the class label information of the data points and the local neighborhood information between data points. Specifically, the LDE algorithm can be described as follows.

Steps 1: Construct neighborhood graphs. An intrinsic graph G and a penalty graph \(G^{\prime }\) can be constructed by K nodes of K nearest neighborhood (KNN) over all the data point.

Steps 2: Compute affinity weights. An affinity matrix W of the intrinsic graph G and an affinity matrix \(W^{\prime }\) of the penalty graph \(G^{\prime }\) can be computed as follows:

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} exp(-||x_{i} - x_{j}||^{2} / t) &{} x_{j}\in O(K, x_{i}) \\ &{}\text {or }x_{i}\in O(K, x_{j}) \\ &{}\text {and }y_{i} = y_{j}; \\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

and

$$\begin{aligned} w'_{ij} = {\left\{ \begin{array}{ll} exp(-||x_{i} - x_{j}||^{2} / t) &{} x_{j}\in O(K, x_{i}) \\ &{}\text {or } x_{i}\in O(K, x_{j}) \\ &{}\text {and }y_{i} \not = y_{j}; \\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

where \(O(K,x_{i})\) represents the K nearest neighborhood of data \(x_{i}\) and the parameter t is a kernel width parameter.

The optimization problem of LDE is described as follows:

$$\begin{aligned} \begin{aligned} \mathop {\arg }&\min _P \sum _{i,j}||P^{T}x_{i}-P^{T}x_{j}||^{2}w_{ij}\\ s.t.\;&\sum _{ij}||P^{T}x_{i}-P^{T}x_{j}||^{2}w_{ij}^{'}=1 \end{aligned} \end{aligned}$$
(3)

Steps 3: Complete the embedding. The projection matrix P can be obtained by solving the eigenvectors corresponding to the H smallest nonzero eigenvalues of the following generalized eigenvalue problem:

$$\begin{aligned} X(D-W)X^{T}P = \wedge X(D^{'}-W^{'})X^{T}P \end{aligned}$$
(4)

where \(\wedge \) is a diagonal eigenvalue matrix. D and \(D^{'}\) are diagonal matrices with \(D_{ii} = \sum _{j=1}^{N}W_{i,j}\) and \(D_{ii}^{'} = \sum _{j=1}^{N}W_{i,j}^{'}\).

2.2 Region Covariance Descriptor for HSI

As a robust and very novel data descriptor, region covariance descriptor has been successful and effectively applied to many computer vision problems [19, 20]. Consider a HSI data \({\varvec{{X}}}\in \mathbb {R}^{\textit{l}\times \textit{w}\times \textit{m}}\) with \(\textit{m}\) representing the number of bands and \(\textit{l}\times \textit{w}\) representing the spatial structure. Consider a three order spatial-spectral tensor \(x\in \mathbb {R}^{(2\textit{n}-1)\times (2\textit{n}-1)\times \textit{m}}\) as a small patch of \({\varvec{{X}}}\in \mathbb {R}^{\textit{l}\times \textit{w}\times \textit{m}}\), the central of x is a pixel, the rest of the central of x is its local region neighborhood. Therefore, the pixels of a HSI data \({\varvec{{X}}}\in \mathbb {R}^{\textit{l}\times \textit{w}\times \textit{m}}\) can be denoted as \(\{x_{i}\}_{i=1}^{\textit{N}}\), where \(x_i\in \mathbb {R}^{(2\textit{n}+1)\times (2\textit{n}+1)\times \textit{m}}\) denotes the ith pixel and N is the number of pixels [18]. And \(x_{s}\;(s = 1, 2, ... , (2n+1)\times (2n+1))\) is a spectral vector in the region of interest around the ith hyperspectral pixel. Then, a spectral region covariance descriptor \(C_{i}\) can be obtained by the Eq. (5).

$$\begin{aligned} \begin{aligned} C_{i}&= {1\over S-1}\sum _{s=1}^{S} (x_{s}-\mu _{i})(x_{s}-\mu _{i})^{T}\\ \mu _i&= {1\over S}\sum _{s=1}^{s} x_{s} \end{aligned} \end{aligned}$$
(5)

where S is the number of spectral vectors in the region of interest, and \(\mu _i\) is the mean vector. Meantime, \(C_{i}\) is considered to be the feature of \({\varvec{{X}}}_i\).

3 Our Work

3.1 Variance and Covariance for HSI

Inspired by the region covariance descriptor in [18], we want to introduce the variance and covariance instead of the region covariance descriptor to attenuate the effects of noise, because in this paper the hyperspectral dataset is used as input in the form of a vector, not a tensor. Consider a hyperspectral dataset denoted as \(X = \{x_{i}|x_{i1}, x_{i2}, ... , x_{im}\}_{i=1}^N\) existing in a \(\mathbb {R}^{m\times 1}\) feature space, where m is the number of bands. Then, a spectral variance \(C_i\;(i = 1, 2, ... , N)\) and a covariance \(C_{ij}\;(i, j = 1, 2, ... , N)\) can be obtained by the Eq. (6).

$$\begin{aligned} \begin{aligned} C_i&= {1\over m-1}\sum _{k=1}^{m} (x_{ik}-\mu _{i})(x_{ik}-\mu _{i})^{T}\\ \mu _i&= {1\over m}\sum _{k=1}^{m} x_{ik}\\ C_{ij}&= {1\over m-1}\sum _{k=1}^{m} (x_{ik}-\mu _{i})(x_{jk}-\mu _{j})^{T} \end{aligned} \end{aligned}$$
(6)

where \(\mu _i\) is the spectral mean value. Meantime, the variance \(C_i\) is considered to be the feature of \(x_i\), and the covariance \(C_{ij}\) is considered to be the feature of between \(x_i\) and \(x_j\).

3.2 Modified Local Discriminant Embedding (MLDE)

Suffered by the euclidean distance which is sensitive for noise and the data which contain inevitable noise created by environment change (i.e. atmosphere and illumination) and instrument problem (i.e. senor), the LDE algorithm may lead to inaccurate graph construction and a poor performance of classification. In this section, we propose an MLDE algorithm to overcome the problem.

Like LDE, the intrinsic graph G and the penalty graph \(G^{\prime }\) should be constructed firstly. Nevertheless, in MLDE, the difference is that we use the variance features \(\{C_{i}\}_{i=1}^N\) and the covariance features \(\{C_{ij}\}_{i,j=1}^N\) obtained by Eq. (6) to construct the intrinsic graph and the penalty graph denoted as \(G_{var}\) and \(G_{cov}^{\prime }\), respectively. Due to the variance features and the covariance features lying on a Rimannian manifold, the Log-Euclidean metric is a good choice to compute the affinity.

$$\begin{aligned} D_{LE}(C_{i},C_{j}) = |log(C_{i}) - log(C_{j})| \end{aligned}$$
(7)

Then, the affinity matrix \(W_{var}\) of the intrinsic graph \(G_{var}\) and the affinity matrix \(W_{cov}\) of the penalty graph \(G_{cov}\) can be computed as follows:

$$\begin{aligned} w_{var}\;_{ij} = {\left\{ \begin{array}{ll} exp(- D_{LE}(C_{i},C_{j})^{2} / t) &{} C_{j}\in O(K, C_{i}) \\ &{}\text {or } C_{i}\in O(K,C_{j}) \\ &{}\text {and } y_{i} = y_{j}; \\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

and

$$\begin{aligned} w'_{cov}\;_{ij} = {\left\{ \begin{array}{ll} exp(-|log(C_{ij})|^{2} / t) &{} C_{ij}\in O(K,C_{ii}) \\ &{}\text {or } C_{ii}\in O(K,C_{ij}) \\ &{}\text {and } y_{i} \not = y_{j}; \\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

where \(O(K,C_{i})\) represents the K nearest neighborhood of covariance feature \(C_{i}\) and the parameter t is a kernel width parameter.

The optimization problem of MLDE is described as follows:

$$\begin{aligned} \begin{aligned} J&(P) = \mathop {\arg }\min _P \sum _{i,j}||P^{T}x_{i}-P^{T}x_{j}||^{2}w_{var}\;_{ij}\\&\;s.t.\;\;\sum _{ij}||P^{T}x_{i}-P^{T}x_{j}||^{2}w'_{cov}\;_{ij}=1 \end{aligned} \end{aligned}$$
(10)

Similarity to LDE, the optimization problem (10) can be rewritten as (11) by the nature of trace.

$$\begin{aligned} \begin{aligned} J(P)&= \mathop {\arg }\min _P \sum _{i,j}||P^{T}x_{i}-P^{T}x_{j}||^{2}w_{var}\;_{ij}\\&= \mathop {\arg }\min _P \sum _{i,j}tr\{(P^{T}x_{i}-P^{T}x_{j})(P^{T}x_{i}-P^{T}x_{j})^{T}\}w_{var}\;_{ij}\\&=\mathop {\arg }\min _P\sum _{i,j}tr\{P^{T}(x_{i}-x_{j})(x_{i}-x_{j})^{T}P\}w_{var}\;_{ij} \end{aligned} \end{aligned}$$
(11)

By \(w_{var}\;_{ij}\) is a scalar and the operation of trace is linear, the Eq. (11) can be rewritten as (12):

$$\begin{aligned} \begin{aligned} J(P)&= \mathop {\arg }\min _{P}\;tr\{P^{T}\sum _{i,j}((x_{i}-x_{j})w_{var}\;_{ij}(x_{i}-x_{j})^{T})P\}\\&=\mathop {\arg }\min _{P}\;tr\{P^{T}(2XD_{var}X^{T}-2XW_{var}X^{T})P\}\\&= \mathop {\arg }\min _{P}\;2tr\{P^{T}X(D_{var}-W_{var})X^{T}P\} \end{aligned} \end{aligned}$$
(12)

where \(D_{var}\) is a diagonal matrix with \(D_{var}\;_{ii}=\sum _{j=1}^{N}W_{var}\;_{ij}\). Then, the optimization problem (10) can be rewritten as (13):

$$\begin{aligned} \begin{aligned} J&\!(P) = \mathop {\arg }\min _{P}\;2tr\{P^{T}X(D_{var}-W_{var})X^{T}P\}\\&\;s.t.\;\;2tr\{P^{T}X(D_{cov}-W_{cov})X^{T}P\}=1 \end{aligned} \end{aligned}$$
(13)

The projection matrix P can be obtained by solving the eigenvectors corresponding to the H smallest nonzero eigenvalues of the following generalized eigenvalue problem:

$$\begin{aligned} X(D_{var}-W_{var})X^{T}P = \wedge X(D_{cov}-W_{cov})X^{T}P \end{aligned}$$
(14)

Thus, MLDE for hyperspectral image classification is carried out following the steps in Algorithm 1.

figure a

4 Experimental Results and Discussions

In this section, we will apply MLDE on two hyperspectral datasets. Firstly, we introduce the experimental datasets. Secondly, how to choose the best experimental parameters would be given. Finally, The classification accuracy and classification maps on compared algorithms and MLDE algorithm would be shown. The MLDE algorithm is implemented by matlab. The results are generated on a personal computer equipped with an Intel Core i7-3370 with 3.40 GHz. The personal computer’s memory is 4 GB.

Table 1. Number of training and testing samples for the University of Pavia dataset
Table 2. Number of training and testing samples for the Salinas dataset

4.1 Experimental Dataset

The first experimental dataset was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the University of Pavia in Italy. The image includes \(610\times 340\) pixels and 115 spectral bands in the wavelength range \(0.43-0.86-\upmu \)m. In our experiments, 12 spectral bands covering noisy are removing. Then, a total of 103 bands is used. Thus, the image contains 9 different classes and a total of 42776 ground-truth samples (Table 1).

The second experimental dataset was acquired by the National Aeronautics and Space Administration’s Airborne Visible/ Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley in California. The image includes \(512 \times 127\) pixels and 204 bands afther 20 water-absorption bands are removed. Thus, the image cantains 16 different classes and a total of 54129 ground-truth samples.

8% and 5% samples in each class are randomly selected as training samples in the University of Pavia dataset and the Salinas dataset, respectively. And the rest are chosen as the testing samples. More detailed information of the number of training and testing samples is summarized in Tables 1 and 2.

4.2 Experiment Parameters

The SVM is used to verify the proposed MLDE algorithm. The SVM classifier is implemented by libsvm (the kernel is rbf, the penalty parameter is 1000 and the sigma is searched in {0.01, 0.05, 0.5, 1, 5, 10, 50, 100, 500, 1000}). And to demonstrate the benefits of MLDE algorithm, the experimental results would be compared with nine other classical algorithm of DR, i.e., PCA, LDA, LPP, LDE, LFDA, LGDA, SGDA, SLGDA, GDA-SS.

Fig. 1.
figure 1

The overall accuracy corresponding to different reduced dimensionality and different K for MLDE on two hyperspectral datasets

It is very easy to note that the reduced dimensionality and the value of the K nearest neighborhood are two important parameters, which have a significant influence on the performance of the classification.

If the K is too small, it may reduce classification accuracy. And if the K is too large, it would increase computational complexity, increase the noise and reduce the classification effect. To find a good value of K, the even numbers are chosen from 2 to 60, and the reduced dimensionality is searched in the range of {2, 7, 12, 15, 20, 25, 27, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75}. To have a better presentation, we only show the range of 2–30 of the value of K in Fig. 1.

Fig. 2.
figure 2

The overall accuracy corresponding to different reduced dimensionality for MLDE on the University of Pavia dataset

Figure 1 shows the classification performances of MLDE in different K for two hyperspectral datasets. It can be seen from Fig. 1 that the overall accuracy would increase as K increasing when K is at a relatively small value, while the overall accuracy would decline as K increasing when K is at a relatively big value. It’s noticed that the overall accuracy will be stable and less affected by k when the spectral number is in a high position. From Fig. 1, the highest value of overall accuracy are 94.28% and 93.30% in the University of Pavia dataset and the Salinas dataset, at the same time, K are 12 and 22, respectively.

Thus, the K is respectively fixed as 12 and 22 according to Fig. 1. Next, a good value of the reduced dimensionality would be searched in the above spectral range, the way in which other algorithms do, e.g. LFDA, SGDA, SLGDA.

Fig. 3.
figure 3

The overall accuracy corresponding to different reduced dimensionality for MLDE on the Salinas dataset

Fig. 4.
figure 4

The computational time of different methods in two hyperspectral datasets: (a) the University of Pavia dataset, (b): the Salinas dataset

Figure 2 illustrates the overall accuracy corresponding to the reduced dimensionality H for all the algorithms mentioned in the University of Pavia dataset. The performance is poor when the reduced dimensionality is low, and it would increase and stabilize as the reduced dimensionality increasing. From Fig. 2(a), PCA, LDA, LGDA, SLGDA, and GDA-SS apparently don’t have a better classification performance than MLDE. Although the curves of LPP, LFDA, SGDA and MLDE alternatively rise, the highest point 94.28% can be found on the MLDE curve in Fig. 2(b). So, the reduced dimensionality being set as 27 can be considered a good choice.

Figure 3 also illustrates the overall accuracy corresponding to the reduced dimensionality H for all the algorithms mentioned in the Salinas dataset. The performance is poor when the reduced dimensionality is low, and it would increase and stabilize as the reduced dimensionality increasing. From Fig. 3 (b), LPP, LDE, LFDA, SGDA apparently don’t have a better classification performance than MLDE. Excpet the curver of GDA-SS has some intersection with the curver of MLDE, the other methods don’t have a better value than MLDE in Fig. 3(a), and the highest overall accuracy 93.12% will be found in the curve of MLDE. So, the reduced dimensionality being set as 70 can be considered a good choice.

From Fig. 4(a), the computational time of MLDE is 5.276 s and ranked second with a small difference of 0.251 s of the first place. Because the computational time of SLGDA is 1564.8 s is very big will cause the figure don’t have a good presentation, it don’t be shown in Fig. 4(a). And from Fig. 4(b), the computational time of MLDE is 12.879 s and ranked third.

4.3 Experimental Results

Through our experiments, for the University of Pavia dataset, the value of K would be set as 12, the reduced dimensionality would be set as 27, for the salinas datastet, the value of K would be set as 22, the reduced dimensionality would be set as 70.

The each class’s accuracy, overall accuracy (OA), average accuracy (AA) and kappa coefficient of two hyperspectral datasets are listed in Tables 3 and 4.

From Table 3, the MLDE achieves the best classification performance in the class 3, the class 7, and the class 8, respectively. And the classification accuracy of OA, AA, and \(\kappa \) are all better than other compared methods. On details, the OA of MLDE increases from 0.44% to 10.89%, the AA of MLDE increases from 1% to 17.08%, and the \(\kappa \) of MLDE increases from 0.59% to 15.27%, when compared with other methods. Especially, the classification performance of the class 7 is 83.83% when the accuracy of other methods is basically no more than 80%, and the classification performance of the class 8 is 91.53% when the accuracy of other methods is basically no more than 90%. Meaawhile, when other methods achieve the best results in a certain class, the results of MLDE are not inferior, for instance, the class1, the class2, the class 5, and the class 9.

Table 3. Classification accuracy (%) for the University of Pavia dataset
Table 4. Classification accuracy (%) for the salinas dataset

From Table 4, although the MLDE only achieves the best classification performance in the class 16, the classification performance of the other classes has a good performance, for example, the classification accuracy of the class 1 and the class 12 are also good. And the classification accuracy of OA is better than other compared methods. On details, the OA of MLDE increases from 3.64% to 7.21%.

Fig. 5.
figure 5

Classification maps of different methods for the University of Pavia dataset: (a) legend (b) ground truth; (c) PCA: 84.01%; (d) LDA: 84.87%; (e) LPP: 83.39%; (f) LDE: 91.86%; (g) LFDA: 93.59%; (h) LGDA: 89.35%; (i) SGDA: 93.84%; (j) SLGDA: 86.27%; (k) GDA-SS: 90.75% (l) MLDE: 94.28%

Fig. 6.
figure 6

Classification maps of different methods for the salinas dataset: (a) ground truth; (b) PCA: 84.93%; (c) LDA: 88.34%; (d) LPP: 87.38%; (e) LDE: 87.15%; (f) LFDA: 87.49%; (g) LGDA: 86.63%; (h) SGDA: 88.21%; (i) SLGDA: 87.86%; (k) GDA-SS: 89.50% (l) MLDE: 92.14%

Figure 5 illustrates the classification maps resulting from the classification of those methods in the University of Pavia dataset. In Fig. 5, the number of misclassified points in the class 3 (Gravel), the class 8 (Self-Blocking Bricks) of MLDE is significantly less than other methods, which further illustrates that the results in Table 3 are indeed believable.

Figure 6 illustrates the classification maps resulting from the classification of those methods in the salinas dataset. In Fig. 6, the number of misclassified points in the class 16 (Vinyard-vertical-trellis) is significantly less than other methods.

5 Conclusion

In this paper, we proposed a MLDE algorithm for HSI by constructing neighborhood graphs on a new spectral feature space instead of the original space. We use variance to characterize the pixels similarity of the same class and use covariance to characterize the separation of different classes of pixels. The combination of variance and covariance enables pixels in the same class to be closer and enables greater separation of pixels from different classes, which enhances classification performance of HSI. The way of representing data by using variance and covariance can attenuate the effects of noise, which can better handle with noise in HSI. Considering the symmetric positive definite nature of covariance lying on a Riemannan manifold, the MLDE algorithm using the Log-Euclidean metric to capture the similarity between spectral vectors, which can provide a more accurate similarity evaluation than euclidean distance and can better express the characteristics of spectral information. The experimental results of two hyperspectral datasets demonstrate the effectiveness of our proposed MLDE method.