Keywords

1 Introduction

Linear discriminant analysis (LDA) is an effective and efficient method for dimensionality reduction (feature extraction). It has been successfully used in many pattern recognition problems such as face recognition [1, 2], document recognition [3] and image retrieval [4, 5]. It uses within-class scatter matrix \(S_w\) to evaluate the compactness within same class, and between-class scatter matrix \(S_b\) to evaluate the separation between different classes. The objective of LDA is to find an optimal transformation matrix W which minimizes the within-class scatter matrix \(S_w\) and simultaneously maximizes the between-class scatter matrix \(S_b\).

In past two decades various improvements over the origianl LDA have been proposed in order to enhance its performance in different ways, resulting in different LDA variants. These LDA variants can be put into two categories. In the first category, the LDA variants attempt to tackle the singularity problem of within-class scatter matrix (\(S_w\)). In LDA, we take the leading eigenvectors of \(S_w^{ - 1}S_b\) as the columns of optimal transformation matrix W. In order to guarantee \(S_w\) nonsingular, it requires at least \(N+C\) samples [6], where N is data dimension and C is the number of classes. However, in realistic world it does not always happen and it is almost impossible in high-dimensionality space. Therefore, sigularity makes within-class scatter matrix irreversible and we can not use \(S_w^{ - 1}S_b\) to obtain transformation matrix W. In order to address the singularity problem, Li-Fen Chen et al. [1] proposed NLDA, which is short for null space linear discriminant analysis. It is based on a new Fisher’s criterion function and calculates the transformation matrix in the null space of the within-class scatter matrix, which avoids the singularity problem implicitly. In [7] regularized linear discriminant analysis (RLDA) is proposed. It gets optimal constant \(\alpha \) by heuristic approach and adds \(\alpha \) to the diagonal elements of the within-class scatter matrix to overcome the singularity problem. Some new approaches are proposed to solve the singularity problem recently. For example, Alok Sharma et al. [8] proposed a new method to compute the transformation matrix W, which gave a new perspective to NLDA and presented a fast implementation of NLDA using random matrix multiplication with scatter matrices; Alok Sharma et al. [9] proposed an improvement of RLDA, which presented a recursive method to compute the optimal parameter; and Xin Shu et al. [10] proposed LDA with spectral regularization to tackle the singularity problem. Other LDA variants for solving the singularity problem can be found in [11].

In the second category, the LDA variants apply the original LDA in local data space instead of whole data space. For example, Zizhu Fan et al. [12] presented two local linear discriminant analysis (LLDA) approaches: vector-based LLDA (VLLDA) and matrix-based LLDA (MLLDA), which select a proper number of nearest neighbors of a test sample from a training set to capture the local data structure and use the selected nearest neighbors of the test sample to produce the local linear discriminant vectors or matrix. Chao Yao et al. [13] proposed a subset method for improving linear discriminant analysis, which divided the whole set into several subsets and used the original LDA in each subset. There are other LDA variants such as nonparametric discriminant analysis [14], sparse discriminant analysis [15], semi-supervised linear discriminant analysis [16], incremental LDA [17], tensor-based LDA [18], and local tensor discriminant analysis [19].

The original LDA and most of its variants have elegant mathematical properties, one of which being that the dimensionality of the data space can be reduced to at most one less the number of classes. One consequence is that if there are few classes in a data set, e.g., two classes in a binary classification problem, there will be one or only a few features left after the dimensionality reduction, probably insufficient for deciding the class boundaries. This leads to the over-reducing problem, meaning that dimensionality reduction is over done.

In this paper we propose changes to the original LDA to address the over-reducing problem. Instead of using only the means of each class and the whole data to evaluate the separation between different classes, our new LDA variant uses a new method to compute the between-class scatter matrix. As a result we get more between-class information and more features (than before) after dimensionality reduction even for binary classification.

The rest of the paper is organized as follows. Section 2 reviews the original LDA and two well known LDA variants – Uncorrelated LDA and Orthogonal LDA. Section 3 presents our orLDA, Sect. 4 presents our experimental results and Sect. 5 concludes the paper.

2 Linear Discriminnant Analysis

2.1 The Original LDA

LDA has been widely used for dimensionality reduction and feature extraction. In the original LDA, the within-class scatter matrix and the between-class scatter matrix are used to measure the class compactness and separability respectively. They are defined as [20]:

$$\begin{aligned} S_w= & {} \frac{1}{N}\sum _{i=1}^C\sum _{j=1}^{n_i}(x_{ij} - \mu _i)(x_{ij} - \mu _i)^T, \end{aligned}$$
(1)
$$\begin{aligned} S_b= & {} \frac{1}{N}\sum _{i=1}^C n_i(\mu _i - \mu )(\mu _i - \mu )^T, \end{aligned}$$
(2)

where N denotes the number of data samples, C denotes the number of the classes, \( n_i \) denotes the number of samples in class i, \(\mu _i\) denotes the mean of samples in class i, \(\mu \) denotes the mean of whole samples, and \( x_{ij} \) is the jth sample in class i. The original LDA aims to find a transformation matrix \( W_{opt}=[w_1,w_2,...,w_f] \) that maximizes the Fisher’s criterion

$$\begin{aligned} J(W)=\frac{W^TS_bW}{W^TS_wW} \end{aligned}$$
(3)

Mathematically, the solution to this problem corresponds to an eigenvalue decomposition of \( S_w^{ - 1}S_b \), taking its leading eigenvectors as the columns of \( W_{opt} \).

From Eq. (2), we can see that LDA uses only the centers of classes and whole data set to compute between-class scatter matrix. This may lose much class-separating information. Because the rank of the between-class matrix is at most \(C-1\), the number of extracted features by LDA is at most \(C-1\). However, it is insufficient to separate the classes well with only \(C-1\) features, especially for binary classification in high-dimensional spaces.

2.2 Uncorrelated LDA and Orthogonal LDA

Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA) were presented in [21]. In this paper, Jieping Ye proposed a new optimization criterion to obtain the optimal transformation matrix \(W_{opt}\). \(W_{opt}\) is defined as: \(W_{opt}=X_qM\), where X is a matrix that simultaneously diagonalizes \(S_b\), \(S_w\), \(S_t\) Footnote 1, \(X_q\) is the matrix consisting of the first q columns of X, and M is an arbitrary nonsingular matrix. When M is the identity matrix, we can get Uncorrelated LDA algorithm and make features in the reduced space uncorrelated; however, if we let \(X_q\)=QR be the QR decomposition of \(X_q\) and choose M as the inverse of R, we get Orthogonal LDA algorithm and make the discriminant vectores of OLDA orthogonal to each other.

3 Linear Discriminnant Analysis that Avoids Over-Reducing

In this section we present the proposed changes to the original LDA in order to address the over-reducing problem, which are related to how to compute the between-class matrix.

Suppose there are N samples \( X_i \in R^n\) for \(i=1,2,\ldots ,N\) from two classes, \(N_k\) is the number of samples in class k (\(k=1,2\)) such that \(\sum _{k=1}^2 N_k=N\), \( \mu _k \) is the mean of the samples in class k, and \( x_{kj}\) is the jth sample in class k. Two scatter matrices, the within-class scatter matrix (\(\widetilde{S_w}\)) and between-class scatter matrix (\(\widetilde{S_b}\)) are defined as follows:

$$\begin{aligned} \widetilde{S_w}=\frac{1}{N}\sum _{k=1}^2\sum _{j=1}^{N_k}(x_{kj} - \mu _k)(x_{kj} - \mu _k)^T \end{aligned}$$
(4)
$$\begin{aligned} \widetilde{S_b}= \frac{1}{N}(N_1\sum _{j=1}^{N_1}(x_{1j} - \mu _2)(x_{1j} - \mu _2)^T + N_2\sum _{j=1}^{N_2}(x_{2j} - \mu _1)(x_{2j} - \mu _1)^T) \end{aligned}$$
(5)

When the number of classes is two, Eq. (1), the computation of within-class scatter matrix in the original LDA, is the same as Eq. (4). However, Eq. (5), the computation of between-class scatter matrix, is quite different from the original LDA. In Eq. (5), we use every sample in one class to subtract the mean of another class.

It is clear that computing the between-class scatter matrix in this way will capture more between-class information than the original LDA hence we can expect better classification performance. Besides, by Eq. (5), we can get more than 1 feature in binary classification. According to linear algebra and Eqs. (4) and (5), we can obtain \(rank(S_b)=min(n,N-2)\) and \(rank(S_w)=min(n,N-2)\), where n is the dimensionality of data space and N is the total number of samples. Then we can find that the ranks of \(S_b\) and \(S_w\) depend only on n and N. Therefore, \(rank(S_w^{ - 1}S_b)=min(rank(S_w^{ - 1}), rank(S_b))\) is not limited by 1 extracted feature. Our optimal transformation matrix \(W_{opt}\) maximizes \( J(W)=\frac{W^T\widetilde{S_b}W}{W^T\widetilde{S_w}W}\) and we get the eigenvectors corresponding to the top eigenvalues of the eigenequation \( \widetilde{S_w}^{ - 1}\widetilde{S_b} \) as columns of \(W_{opt}\).

4 Experiments

In this section we take K-Nearest Neighbor (KNN, K=1) as the classifier and use ten-fold cross-validation to evaluate our method on three face datasets – ORL face databaseFootnote 2, Labeled Faces in the Wild (LFW) [22], and Extended Cohn-Kanade [23]; and one DNA microarray gene expression datasets from Kent Ridge Bio-medical Dataset (KRBD)Footnote 3.

The ORL face database consists of a total of 400 images of 40 distinct people. Each people has ten different images and the size of each image is 92*112 pixels, with 256 grey levels per pixel. All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position.

LFW face dataset consists of 13,233 images of 5,749 people, which are organized into 2 views – a development set of 3,200 pairs for building models and choosing features; and a ten-fold cross-validation set of 6,000 pairs for evaluation. The size of each image is 250*250 pixels. All the images are collected from the Internet with large intra-personal variations. There are three versions of the LFW: original, funneled and aligned. In our experiment, we use the aligned version [24].

For the above two face datasets, we do face verification experiment, which is a binary classification problem. The goal of face verification is to decide if two given face images match or not. We use subset of view2 of LFW. We randomly choose 200 matched face pairs and 200 mismatched face pairs from view2 and crop each image to an image of 80 * 150 pixels as in [25]. However, for ORL face dataset, through randomly matching face images, we obtain 80 matched face pairs and 391 mismatched face pairs for face verification. Therefore, we have 400 samples of LFW and 471 samples of ORL. The dimensionality of each sample in LFW and ORL are 24,000 and 20,608, respectively.

Extended Cohn-Kanade dataset (CK+) is a complete dataset for action unit and emotion-specified expression. In this paper, we focus on emotion-specified expressions. There are 593 sequences from 123 subjects which are FACS coded at the peak frame, but only 327 of the 593 sequences have emotion sequences and use the last frame of each sequence to do expression classification. There are seven kinds of emotion expression, including: neutral, anger, contempt, disgust, fear, happy, sadness and surprise. Here, we do positive and negative expression classification experiment and take happy as positive expression and the rest of emotion as negative expression. Therefore, we have 69 positive expression samples and 258 negative expression samples and the dimensionality of each sample is 10,000.

Acute Leukemia dataset [26] consists of DNA microarray gene expression data of human acute leukemia data for cancer classification. There are two types of acute leukemia: 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML), over 7129 probes from 6817 human genes.

We compare our orLDA with three discriminant dimension reduction methods, which are the original LDA, Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA) [21]. To guarantee that \( S_w \) does not become singular, we use two-stage PCA+LDA [27] – we reduce the data dimensionality by PCA, retaining principal components which explain \(95\,\%\) of variance, before original LDA and orLDA methods are used.

Experimental results on the four datasets are shown in Table 1. It is clear that our orLDA has better classification performance than the original LDA, ULDA and OLDA on all datasets except Extended Cohn-Kanade. We credit the better performance to the facts that (1) orLDA obtains more between-class information than the other three LDA variants; (2) more than 1 extracted features can better separate two classes.

Table 1. Mean accuracy and standard error of the mean on four datasets

5 Conclusion

In this paper, we propose a new LDA, orLDA, to address the over-reducing problem associated with LDA. orLDA uses a new method to compute between-class scatter matrix, which contains more between-class information and allows extracting more features. Experiments have shown that orLDA outperformed the original LDA, ULDA and OLDA significantly on two face datasets, outperformed ULDA and OLDA on the gene expression dataset. orLDA achieved the same performance as the original LDA on the emotion expression dataset and the gene expression dataset, and underperformed ULDA and OLDA slightly on the emotion expression dataset. It is then reasonable to conclude that the new LDA variant is an improvement over the state of the art.