Keywords

1 Introduction

The performance of a generic learning algorithm, especially adopted to the classification problem, extremely relies on the quality of learned feature representation of raw input data. Good features not only could remove irrelevant or redundant features coexisting in the original input space, but preserve the essential information for the target tasks. A good feature extractor built for input space, especially using unsupervised learning methods, can be further utilized for computer vision tasks. Deep hierarchical features produced by stacked unsupervised models have been demonstrated to be a powerful tool and appeal to emerging focus [1, 2].

In recent years, the study found that deep learning constructed by the multiple non-linear transformations can be a powerful feature learning tool. Deep learning has already been broadly used to address image classification tasks [3,4,5,6]. As a tool of deep learning with a special architecture, the autoencoder has already been stacked to pre-train a deep neural network using a greedy layer-wise means [7], where each layer is separately initialized by unsupervised pre-training method, and then a fine-tuning way based on backpropagation is used by a supervised learning algorithm [8, 9], leading to solving the lack of expression ability of shallow network.

By restricting the output of the model identical to the input data, autoencoder can be regarded as an identity function which could reconstruct the raw input data composed of an encoding phase and a decoding phase. Meanwhile, sparse representation has proven its significant impact on computer vision [10,11,12]. The performance of an image classifier can be improved if the input image can be represented by a sparse representation. Ghifary [12] demonstrated that, in most cases, sparse network structures have better classification performance than dense structures. In recent years, the sparse deep model is proposed based on the sparse encoding strategy, sparse regularization term and sparse filtering that have taken the input samples into sparse depth related neural network model.

However, autoencoder and its variants have not taken the statistical characteristics and the domain knowledge of training set into the design of deep networks, and they have abandoned a lot of features learned from different levels. Therefore, autoencoder can only provide a relatively coarse parameters setting and serves as a pre-training method because of the large variance and low generalization ability on the unknown testing dataset. So, how to fully utilize the features existed in the input is one of the most important points in our work. It is well known that an ensemble of multiple classifiers is considered as a practical technique for improving accuracy and stability with comparisons to a single classifier. Ensemble learning employs some weak classifiers, according to some combination rule, to construct a stronger one to obtain significantly reduced generalization error than any weak one. But, two key issues, namely the diversity and accuracy of each classifier and the combination rules of fusion rules [13], are required to be taken into consideration to ensure a better performance.

In this paper, we introduce a novel sparsity-induced autoencoder that can further exploit the representative and intrinsic features of the input. Then, to benefit the ability of the ensemble learning, an ensemble sparse feature learning algorithm based on the novel sparse autoencoder mentioned above, named BoostingAE, is proposed. On the one hand, the completion of the pre-training sparsity-induced autoencoder can obtain a plurality of different levels of abstraction of sparse features; on the other hand, ensemble learning could effectively improve and enhance the recognition rate and stability of single classifier. Experimental results on three different datasets show that the proposed ensemble feature learning method can significantly improve the overall performance.

2 Related Work

2.1 Sparse Representation

Sparse Coding.

Sparse coding provides a family of methods for acquiring the condense features in the input. Given only the unlabeled dataset, it can discover the basic functions aimed to capture the higher-level features in the data itself. Despite its close relationship to the traditional sparse coding techniques on image denoising, the main drawback of sparse coding is its high computation cost. Moreover, it is well-known that the sparse coding is not “smooth” [14, 15], which means a tiny variation in input space might result in a significant difference in code space.

Sparse Filtering [16].

In contrast to many existing feature learning models, one of the important properties of sparse filtering is that it only requires one hyper-parameter rather than extensive hyper-parameters tuning for its very simple cost function:

$$ \hbox{min} \sum\limits_{i = 1}^{M} {\left\| {\hat{f}^{(i)} } \right\|_{1} = \sum\limits_{i = 1}^{M} {\left\| {\frac{{\tilde{f}^{(i)} }}{{\left\| {\tilde{f}^{(i)} } \right\|_{2} }}} \right\|} }_{1} $$
(1)

where \( f \) represents the learned feature value for input sample, \( \tilde{f} \) is defined by \( \ell_{2} \) norm of \( f \), and \( M \) indicates the sample’s number.

Sparse Regularization.

Compared with sparse coding, sparse regularization needs to perform an extra separate stage to induce sparsity and encourage sparse representations of input. Various methods of sparsity regularization either employed in deep belief network or autoencoder [17], similar to sparse coding, each of which has been proved the beneficial effects for some particular scene.

2.2 Softmax Regression

Softmax regression is a generalized version of logistic regression applied to classification problems where the class label \( y \) can be chosen from more than two values. Assume that there are \( k \) labels and \( m \) training samples: \( \left\{ {\left( {x^{(1)} ,y^{(1)} } \right),\left( {x^{(2)} ,y^{(2)} } \right), \ldots ,\left( {x^{(i)} ,y^{(i)} } \right)} \right\}\left( {i = 1,2, \ldots ,m} \right) \), where \( x \) is the input sample, and \( y \in \left\{ {\left. {1,2, \ldots ,k} \right\}} \right. \) is the corresponding label.

For every input, the output probability function can be defined as follows:

$$ h_{\theta } (x^{i} ) = \left[ {\begin{array}{*{20}l} {p\left( {y^{(i)} = 1\left| {x^{(i)} ;\theta } \right.} \right)} \hfill \\ {p\left( {y^{(i)} = 2\left| {x^{(i)} ;\theta } \right.} \right)} \hfill \\ \vdots \hfill \\ {p\left( {y^{(i)} = k\left| {x^{(i)} ;\theta } \right.} \right)} \hfill \\ \end{array} } \right] = \frac{1}{{\sum\nolimits_{j = 1}^{k} {e^{{\theta_{j}^{T} x^{(i)} }} } }}\left[ {\begin{array}{*{20}l} {e^{{\theta_{1}^{T} x^{(i)} }} } \hfill \\ {e^{{\theta_{2}^{T} x^{(i)} }} } \hfill \\ \vdots \hfill \\ {e^{{\theta_{j}^{T} x^{(i)} }} } \hfill \\ \end{array} } \right] $$
(2)

where \( \theta \) is the parameter of the softmax model. For each input, the probability of its category is estimated to be:

$$ p\left( {y = k|x^{\left( i \right)} ;\theta } \right) = \frac{{e^{{\theta_{j}^{T} x^{\left( i \right)} }} }}{{\sum\nolimits_{k} {e^{{\theta_{k}^{T} x^{\left( i \right)} }} } }} $$
(3)

2.3 Ensemble Learning

According to certain combination rule, ensemble learning employs some weak classifiers to construct a stronger one to obtain significantly reduced generalization error than any weak one. Weak learner refers to whose generalization performance on the unknown testing dataset is only slightly better than random guessing. From mathematics, ensemble learning can significantly reduce the variance to achieve more stable performance. In order to get a better integration result, it is necessary to make the individual learner as different as possible, that is to say, there is a high degree of diversity between the base learners, which will be helpful to the performance of ensemble learning.

Boosting method is a widely used method for statistical learning, and severs as an important means of ensemble learning. By changing the weights of training samples, boosting method trains a group of individual learners and gets final decision results with a combination rule of voting.

3 Boosting Sparsity-Induced Autoencoder

To learn more representative and intrinsic features of input, a novel sparsity encourage method is first introduced to build a new autoencoder, called sparsity-induced autoencoder (SparsityAE). Based on SparsityAE and ensemble learning, we further proposed a boosting sparsity-induced autoencoder (BoostingAE), which is capable of utilizing the hierarchical and diverse features, ensuring the accuracy and diversity, and boosting the performance of the single SparsityAE on computer vision tasks.

3.1 Sparsity-Induced Autoencoder

Inspired by the assumptions of the sparse representation and the efficient reconstruction of low-dimension feature representation obtained in the encoding phase of deep models, SparsityAE is proposed, whose structure is shown in Fig. 1.

Fig. 1.
figure 1

The topology of the proposed sparsity-induced autoencoder

We feed the encoder by the dataset of high-dimension space. The length of the codes learned by each layer gets less along with the deepening of the encoder. At the end of the encoding phase, we employ a sparsity-induced layer to generate sparse codes. Conversely, the decoding phase deals with the compressed and sparse codes given by sparsity-induced layer. In sparsity-induced layer, the neurons without significant activation value will be set to zero, which could decrease the number of neurons, remove the correlation between attributes and compress the raw inputs.

Let \( y_{i} (i = 1,2, \ldots N) \) be the original data and \( x_{i} \) be its degraded version, so the input can be mapped to a hidden representation by the formulations as follows:

$$ \hat{y}(x_{i} ) = \sigma \left( {W^{'} h\left( {x_{i} } \right) + b^{'} } \right) $$
(4)

where \( \hat{y}_{i} \) is an approximation of \( y_{i} \), and \( \sigma \left( \bullet \right) \) is the mapping function.

To benefit both from the virtues of sparse representation and deep neural networks, we optimize the reconstruction loss regularized by a weight decay and a sparsity-inducing term. The cost function can be designed as follows:

$$ L\left( {X,Y;\theta } \right) = {{\left\| {y_{i} - \hat{y}(x_{i} )} \right\|_{2}^{2} } \mathord{\left/ {\vphantom {{\left\| {y_{i} - \hat{y}(x_{i} )} \right\|_{2}^{2} } {N + \beta \bullet KL\left( {\left. {\hat{\rho }} \right\|\rho } \right)}}} \right. \kern-0pt} {N + \beta \bullet KL\left( {\left. {\hat{\rho }} \right\|\rho } \right)}} $$
(5)

where \( \theta = \left( {W,b,W^{'} ,b^{'} } \right) \) represents weights and bias, \( KL\left( {\hat{\rho }||\rho } \right) \) is the sparse regularization to extract sparse representation, and \( \hat{\rho } \) is the average output of hidden neurons:

$$ KL\left( {\hat{\rho }\left\| \rho \right.} \right) = \sum\limits_{j = 1}^{{\left| {\hat{\rho }} \right|}} {\rho \log \left( {\frac{\rho }{{\hat{\rho }_{j} }}} \right)} + \left( {1 - \rho } \right)\log \left( {\frac{1 - \rho }{{1 - \hat{\rho }_{j} }}} \right) \, $$
(6)
$$ \hat{\rho } = \left( {{1 \mathord{\left/ {\vphantom {1 N}} \right. \kern-0pt} N}} \right)\sum\limits_{i}^{N} {h\left( {x_{i} } \right)} $$
(7)

In the training process, for the intractability of the whole image, the model is provided with the original overlapping patches \( y_{i} \left( {i = 1,2, \ldots ,N} \right) \) as the reconstruction, and their corrupted image patches \( x_{i} \) as the polluted input. As long as the training is completed, the learned model could reconstruct the corresponding clean image given any polluted observation. The detailed process is shown in Algorithm 1.

figure a

3.2 Feature Ensemble Method

Multiple sparse features with different abstraction levels will be obtained using the SparsityAE introduced above; ensemble feature learning could effectively improve the accuracy and stability of a single classifier. Together two points above, a BoostingAE is proposed that uses hierarchical feature obtained in pre-training stage to train multiple classifiers, and integrates the outputs of classifiers with specific fusion rules to get the final prediction of image classification.

To make the whole structure easy to understand, Fig. 2 gives a clear and detailed understanding of BoostingAE, which indicates that by cascading multiple SparsityAEs, BoostingAE is theoretically possible to obtain N compressed sparse features derived from the output of SparsityAEs.

Fig. 2.
figure 2

The topology of the proposed BoostingAE

In our work, training three SparsityAEs and utilizing softmax regression to perform the classification task. First, we train SparsityAE_1 in Fig. 2. Assuming that the original input sample is \( x \), the weight matrix connecting the input layer and the hidden layer is \( W^{(1)} \), the bias vector is \( b^{(1)} \), so its output can be mapped by Eq. (4):

$$ \hat{y}_{1} = \sigma \left( {x \bullet W^{\left( 1 \right)} + b^{\left( 1 \right)} } \right) $$
(8)

Regard \( \hat{y}_{1} \) as the input of the second SparsityAE, thus further train SparsityAE_2. Its weight matrix connecting the input layer and the hidden layer is \( W^{(2)} \), and bias vector is \( b^{(2)} \). With reference to the above operation, SparsityAE_2’s output can be further obtained, which also be used as the input of next SparsityAE:

$$ \hat{y}_{2} = \sigma \left( {\hat{y}_{1} \bullet W^{\left( 2 \right)} + b^{\left( 2 \right)} } \right) $$
(9)

Along with the cascaded sparsity-induced autoencoder network, the characteristic attributes which are trained from the current layer will be passed to the next layer by above process, and therefore three SparsityAEs can be trained. At the same time, in the longitudinal direction, the trained classifier model and optimal parameters of each classifier are obtained by training the characteristic attribute at the encoding stage. Further, three classifiers are obtained.

3.3 Combination Method of Voting

After training all base classifiers, final prediction is given by results of three classifiers after integrating with some fusion rules. Here, the Naïve Bayes combination rules [18] are applied which assume that individual classifiers are mutually independent.

We adopt three Naïve Bayes combination methods, namely MAX, MIN, and AVG rules. Given a sample \( x \), and its label \( y \) has \( C \) possible values. Assuming that the current BoostingAE model consists of \( N \) base classifiers, \( P_{nj} \left( x \right) \) is the probability that the category of \( x \) is \( j \) in the nth classifier. So label \( y \) can be defined as follows:

  • MAX rule: \( y = \arg \mathop {\hbox{max} }\limits_{j = 1,2, \ldots ,C} \mathop {\hbox{max} }\limits_{n = 1,2, \ldots ,N} P_{nj} \left( x \right), \)

  • MIN rule: \( y = \arg \mathop {\hbox{max} }\limits_{j = 1,2, \ldots ,C} \mathop {\hbox{min} }\limits_{n = 1,2, \ldots ,N} P_{nj} \left( x \right), \)

  • AVG rule: \( y = \mathop {\text{arg}\,\text{max}}\limits_{j = 1,2, \ldots ,C} \frac{1}{N}\sum\limits_{n = 1}^{N} {P_{nj} \left( x \right)} \).

3.4 BoostingAE Algorithm

From Fig. 2, a multi-layer architecture based on ensemble learning consists of an input layer, some hidden layers and an output layer to carry out specific tasks.

Here, how to measure the importance of each layer’s feature and how to select the optimal models for each layer are two key issues, which will directly influence the performance of the model. As a main contribution of this paper, we employ Adaboost to supervise the adjustment of parameters and weight coefficients. Algorithm 2 gives the detailed process of the proposed BoostingAE.

figure b

Compared with traditional sparse stacked autoencoder, BoostingAE’s characteristics are shown in following aspects:

  • whose construction of base learners is similar to that of AdaBoost, BoostingAE utilizes the cascade serialization mechanism among the base learners, which makes the individual learners are related to each other and also maintain the difference;

  • the subsequent layer takes the output of the previous layer as the input to obtain rich feature representation, which makes each learners receive various “training input” at the same time and avoids the waste of computing and storage resources;

  • when design individual learners, the topology of each model can be specified separately rather than by a unified model topology. This makes it possible to further increase the diversity of base learners while maintaining the homogeneity of them.

4 Experiments

First of all, we verify the performance of SparsityAE in sparse feature learning. Next, to unbiasedly and accurately show the performance and the stability of BoostingAE on real-world image classification, the experiments are carried out on three widely employed datasets, i.e., MNIST, CIFAR-10, and SVHN. Moreover, some state-of-art methods are employed to provide the comparable results on the same datasets.

4.1 The Sparse Feature Learning of SparsityAE

To validate the performance of SparsityAE in sparse feature learning, we mainly focus on denoising of grey-scale images. From http://decsai.ugr.es/cvg/dbimagenes, a set of natural images are employed as the training set, and a set of standard natural images as the testing set which has been widely used in the image processing.

When it comes to the training process, we randomly pick a clean image \( y \) from the dataset and generate its corresponding noisy patch \( x \) by corrupting it with a specific strength of additive white Gaussian noise. The training performed, the learned model will be capable of reconstructing the corresponding clean image given any noisy observation. To avoid the local minimum, we adopt the layer-wise pre-training procedure introduced in [7].

Figure 3 shows the comparison between SparsityAE and classic image denoising methods: KSVD [19], BM3D [20], on standard testing images degraded by various noise levels. We tell that when \( \sigma = 25 \), SparsityAE (magenta line) is competitive, while corrupts for other different, i.e., higher noise strengthens, which is owing to that our model knowing nothing about the noise of level but other methods were provided with such information. The green line shows that if we train the proposed model on several different noise levels, our SparsityAE is more robust to the change of noise levels which means that it can generalize significantly better to higher noise levels.

Fig. 3.
figure 3

Denoising performance comparison of various methods with various noise levels (Color figure online)

What’s more, we compared the SparsityAE with several state-of-art denoising methods: WNNM [21] and two training based methods: MLP [22], TNRD [23]. The numerical results are shown in Table 1, which is measured by the peak signal to noise ratio (PSNR in dB). The best PSNR result for each image is highlighted in bold. Although images with a lot of repeating structure are ideal for both KSVD and BM3D, we do outperform KSVD, BM3D, and WNNM on every image except Barbara. It is also shown that our SparsityAE is able to compete with MLP and TNPD.

Table 1. Comparison of the various methods’ denoising performance measured by PSNR.

The results illustrated that SparsityAE can not only project the original high dimensional space to a lower dimensional and more intrinsic space from the perspective of dimension-reduction, but capture the more representative sparse feature from multiple layers to make the best use of the information contained in original space.

4.2 BoostingAE for Classification on MNIST

MNIST is a large dataset of handwritten digits that is widely used for image processing and computer vision tasks. It contains 60,000 training images and 10,000 testing images with labels, and the size of a single image is \( 28 \times 28 \).

The topology of the SparsityAE on MNIST with three hidden layers is first determined, i.e., \( 784 \times 500 \times 250 \times 100 \times 10 \), which 784 is the size of the image and 10 is the label number. Then, Classifier_1 in Fig. 2 is obtained by pre-training and fine-tuning SparsityAE_1. After that, SparsityAE_2 takes the feature representations learned from SparsityAE_1 as input to get Classifier_2. With the process above, we get the final three base classifiers that will be integrated when all base learners achieve convergence after fine-tuning. With this, the whole BoostingAE model is constructed and trained completely. When it comes to predicting the real samples, three Naïve Bayes combination rules will be respectively used for voting the integrative result of three classifiers to improve the performance.

Table 2 shows that three individual classifiers have better classification results than KNN and SVM because of the introduction of the sparsity-induced layer in SparsityAE. And the BoostingAE with three different fusion rules gets better performance than any individual classifier and achieves 98.37%, 98.43% and 98.87% accuracy rate respectively, which is very close to the result of Lp–norm AE [24]. Moreover, stacked CAE [25] and CASE [26] employ more feature maps obtained by convolutional operations and hidden layers, so our performance is slightly worse than these.

Table 2. Classification results on three datasets.

4.3 BoostingAE for Classification on CIFAR-10 and SVHN

CIFAR-10 is a dataset contains ten kinds of color images, each category contains 6000 color images. The training set contains 5000 images of each category, the remaining is used for testing. SVHN dataset can be regarded as the upgrade of MNIST and also contains ten kinds of color images. Both are captured from the real life so the background is more complex and the images are difficult to identify. SVHN is divided into training set, testing set and extra set; the validation set is constructed in a random way: the 2/3 of them is derived from the training set (400 samples per class), and the remaining samples come from the extra set (200 samples per class).

Before the experiment, the original images of CIFAR-10 and SVHN should be transformed from RGB space into grey space, and then normalized. To improve the training efficiency, the mini-batch gradient descent algorithm is used when pre-training and fine-tuning. Considering the unsupervised learning mechanism of autoencoder, both CIFAR-10 and SVHN use a certain proportion of unlabeled samples as training set in pre-training; and in the process of fine-tuning, two datasets require ground-truth to implement the classification. Next, the topology of SparsityAE is determined as \( 1024 \times 500 \times 250 \times 100 \times 10 \). The subsequent operations are similar to those performed on MNIST.

We report the results of comparison methods, individual classifiers and the proposed method in Table 2 and get the similar conclusion as MNIST. Our methods achieve the best results among comparison methods. The results illustrated the BoostingAE could capture more sparse representation and utilize multi-layer features, resulting in the improvement of accuracy and diversity of overall.

5 Conclusion

In this work, we first built SparsityAE by adding an extra sparsity-induced layer, which efficiently abstract the sparse feature representations, and then based on SparsityAE and ensemble learning, we further proposed a BoostingAE model to integrate sparse feature learned from multi-layer, so as to improve the performance of individual sparse encoder, which has been successfully applied to image classification.

The main advantage of our approach is that it could abstract more significantly sparse representations that reflect the distribution of original data better and make full use of the features learned from multi-layer to improve the diversity of base learners. What’s more, it also promotes the overall performance after integrating multiple weak learners. Additional experiments on three different datasets validate the effectiveness of the proposed algorithm in image classification.