Keywords

1 Introduction

With the evolution of devices and techniques for information creation, acquisition and distribution, all sorts of digital data emerge remarkably and have been enriching people’s everyday life. In order to manipulate the large scale data effectively and efficiently, machine learning models need to be developed for automatic content analysis and understanding [1, 2]. The learning performance of a data-driven model is largely dependent on two key factors [3, 4], i.e. the number and quality of the training data, and the modeling strategy designed to explore the training data. On one hand, the acquisition of labeled instances requires intensive human effort from manual labeling. As a result, the accessible training data are usually very limited, which inevitably jeopardize the learning performance. On the other hand, the inference of projection function from the training data is a process that mimics human perception of the world. To bridge the gap between low-level features and high-level concepts, the sophisticated mechanism behind the learning process of humans should be formulated into the model [5,6,7].

The idea of automatically generating extra instances as extension of the limited training data is rather attractive, because it is relatively a much more cost-effective way to collect a large number of instances. As a deep learning [8,9,10] method for estimating generative models based on game theory, generative adversarial networks (GANs) [11] have aroused widespread academic concern. The main idea behind GANs is a minimax two-player game, in which a generator and a discriminator are trained simultaneously via an adversarial process with conflicting objectives. After convergence, the GANs model is capable of generating realistic synthetic instances, which have great potential as augmentation to the existing training data. As for the imitation of learning process of humans, self-paced learning (SPL) [12, 13] is a recently rising technique following the learning principle of humans, which starts by learning easier aspects of the learning task, and then gradually takes more complex instances into training. The easiness of an instances is highly related to the loss between ground truth and estimation, based on which the curriculum is dynamically constructed and the training data are progressively and effectively explored.

In this paper, we propose a novel augmented self-paced learning with generative adversarial networks (ASPL-GANs) algorithm to cope with the issues of training data and learning scheme, by absorbing the powers of two promising advanced techniques, i.e. GANs and SPL. In brief, our framework consists of three component modules: a generator G, a discriminator D, and a self-paced learner S. To extend the limited training data, realistic synthetic instances with predefined labels are generated via G vs. D rivalry. To fully explore the augmented training data, S dynamically maintains a curriculum and progressively refines the model in a self-paced fashion. The three modules are jointly optimized in a unified process, and a robust model is achieve with satisfactory experimental results.

2 Augmented Self-paced Learning with GANs

In the text that follows, we let \( \varvec{x} \) denote an instance, and a \( C \)-dimensional vector \( \varvec{y} = \left[ {y_{1} , \ldots ,y_{C} } \right]^{T} \in \left\{ {0,1} \right\}^{C} \) denote the corresponding class label, where \( C \) is the number of classes. The \( i \) th element \( y_{i} \) is a class label indicator, i.e. \( y_{i} = 1 \) if instance \( \varvec{x} \) falls into class \( i \), and \( y_{i} = 0 \) otherwise. \( D\left( \varvec{x} \right) \) is a scalar indicating the probability that \( \varvec{x} \) comes from real data. \( S\left( \varvec{x} \right) \) is a \( C \)-dimensional vector whose elements indicate the probabilities that \( \varvec{x} \) falls into the corresponding classes.

2.1 Overview

The framework and architecture of ASPL-GANs is illustrated in Fig. 1, which consists of three components, i.e. a generator G, a discriminator D and a self-paced learner S. The generator G produces synthetic instances that fall into different classes. The discriminator D and the self-paced learner S are both classifiers: the former is a binary classifier that distinguishes the synthetic instances from the real ones, and the latter is a multi-class classifier that categorizes the instances into various classes. By competing with each other, G generates more and more realist synthetic instances, and meanwhile D’s discriminative capacity is constantly improved. As a self-paced learner, S embraces the idea behind the learning process of humans that gradually incorporates easy to more complex instances into training and achieves robust learning model. Moreover, the synthetic instances generated by G are leveraged to further augment the classification performance. The three components are jointly optimized in a unified framework.

Fig. 1.
figure 1

The framework (left) and architecture (right) of ASPL-GANs.

2.2 Formulation

Firstly, based on the two classifiers in ASPL-GANs, i.e. D and S, we formulate two classification losses on an instance \( \varvec{x} \), i.e. \( \ell_{d} \) and \( \ell_{s} \), as follows.

$$ \begin{aligned} \ell_{d} \left( \varvec{x} \right) = & - I\left( {\varvec{x} \in {\mathcal{X}}} \right)\log \left( {P\left( {{\text{source}}\left( \varvec{x} \right) = real\left| \varvec{x} \right.} \right)} \right) \\ & - I\left( {\varvec{x} \in {\mathcal{X}}_{syn} } \right)\log \left( {P\left( {{\text{source}}\left( \varvec{x} \right) = synthetic\left| \varvec{x} \right.} \right)} \right) \\ = & - I\left( {\varvec{x} \in {\mathcal{X}}} \right)\log \left( {D\left( \varvec{x} \right)} \right) - I\left( {\varvec{x} \in {\mathcal{X}}_{syn} } \right)\log \left( {1 - D\left( \varvec{x} \right)} \right) \\ \end{aligned} $$
(1)
$$ \begin{aligned} \ell_{s} \left( \varvec{x} \right) = & - \sum\nolimits_{i = 1}^{C} {I\left( {y_{i} = 1} \right)\log \left( {P\left( {y_{i} = 1\left| \varvec{x} \right.} \right)} \right)} \\ = & - \varvec{y}^{T} \log \left( {S\left( \varvec{x} \right)} \right) \\ \end{aligned} $$
(2)

where \( {\mathcal{X}} \) and \( {\mathcal{X}}_{syn} \) denote the collection of real and synthetic instances, respectively. Note that \( {\mathcal{X}} \) is divided into labeled and unlabeled subsets according to whether or not the instances’ labels are revealed, i.e. \( {\mathcal{X}} = {\mathcal{X}}_{L} \mathop {\bigcup }\nolimits {\mathcal{X}}_{U} \), whereas \( {\mathcal{X}}_{syn} \) can be regarded as “labeled” because in the framework the class label is already predefined before a synthetic instance is generated. The indicator function is defined as:

$$ I\left( {condition} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {condition = true} \hfill \\ {0,} \hfill & {condition = false} \hfill \\ \end{array} } \right. $$
(3)

\( \ell_{d} \) depicts the consistency between the real source and the predicted source of an instance, whereas \( \ell_{s} \) measures the consistency between the real class label and the predicted label of an instance. Based on (1) and (2), the three component modules of ASPL-GANs, i.e. G, D and S, can be formulated according to their corresponding objectives, respectively.

Generator G. In ASPL-GANs, by jointly taking a random noise vector \( \varvec{z}\sim p_{noise} \) and a class label vector \( \varvec{y}_{g} \in \left\{ {0,1} \right\}^{C} \) as input, G aims to generate a synthetic instance \( \varvec{x}_{\varvec{g}} = \varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right) \) that is hardly discernible from the real instances and meanwhile consistent with the given class label. The loss function for G is formulated as:

$$ \begin{aligned} {\mathcal{L}}_{G} = & \sum\nolimits_{{\varvec{x}_{g} \in {\mathcal{X}}_{syn} }} {\left( { - \ell_{d} \left( {\varvec{x}_{g} } \right) + \alpha \ell_{s} \left( {\varvec{x}_{g} } \right)} \right)} \\ = & \sum\nolimits_{{\varvec{z}\sim p_{noise} }} {\left( {\log \left( {1 - D\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)} \right) - \alpha \varvec{y}_{g}^{T} \log \left( {S\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)} \right)} \right)} \\ \end{aligned} $$
(4)

The first term in the summation encourages the synthetic instances that are inclined to be mistakenly identified with low discriminative probabilities from D. The second term, however, is in favor of the synthetic instances that fall into the correct categories with their given class labels on generation. \( \alpha \) is the parameter to balance the two items.

Discriminator D. Similar to the classic GANs, D receives both real and synthetic instances as input and tries to correctly distinguish the synthetic instances from the real ones. The loss function for D is formulated as:

$$ \begin{array}{*{20}c} {{\mathcal{L}}_{D} = \sum\nolimits_{{\varvec{x} \in {\mathcal{X}}\mathop \cup \nolimits {\mathcal{X}}_{syn} }} {\ell_{d} \left( \varvec{x} \right)} } \\ { = - \sum\nolimits_{{\varvec{x} \in {\mathcal{X}}}} {\log \left( {D\left( \varvec{x} \right)} \right) - } \sum\nolimits_{{\varvec{z}\sim p_{noise} }} {\log \left( {1 - D\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)} \right)} } \\ \end{array} $$
(5)

D aims to maximize the log-likelihood that it assigns input to the correct source. For the real instances, both labeled and unlabeled one are leveraged in modeling D, because their specific class labels are irrelevant to the fact that they are real.

Self-paced Learner S. Different from the traditional self-paced learning model, S receives both real and synthetic instances as training data. In other words, S is trained on dataset \( {\mathcal{X}}_{L} \mathop \cup \nolimits {\mathcal{X}}_{syn} \), and aims to correctly classify. The training data are organized adaptively w.r.t their easiness, and the model learns gradually from the easy instances to the complex ones in a self-paced way. The loss function for S is formulated as:

$$ {\mathcal{L}}_{S} = \sum\nolimits_{{\varvec{x} \in {\mathcal{X}}_{L} \cup {\mathcal{X}}_{syn} }} {\left( {v\left( \varvec{x} \right)u\left( \varvec{x} \right)\ell_{d} \left( \varvec{x} \right) + f\left( {v\left( \varvec{x} \right),\lambda } \right)} \right)} $$
(6)

where

$$ u\left( \varvec{x} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {\varvec{x} \in {\mathcal{X}}_{L} } \hfill \\ {\gamma D\left( \varvec{x} \right),} \hfill & {\varvec{x} \in {\mathcal{X}}_{syn} } \hfill \\ \end{array} } \right. $$
(7)

is a weight to penalize the fake training data, and \( v\left( \varvec{x} \right) \) is the weight reflecting the instance’s importance in the objective. Based on (6) and (7), the loss function can be re-whitened as:

$$ \begin{array}{*{20}c} {{\mathcal{L}}_{S} = \begin{array}{*{20}l} {\sum\nolimits_{{\varvec{x} \in {\mathcal{X}}_{L} }} {\left( {v\left( \varvec{x} \right)\ell_{d} \left( \varvec{x} \right) + f\left( {v\left( \varvec{x} \right),\lambda } \right)} \right)} } \hfill \\ { + \sum\nolimits_{{\varvec{x}_{g} \in {\mathcal{X}}_{syn} }} {\left( {\gamma v\left( {\varvec{x}_{g} } \right)D\left( {\varvec{x}_{g} } \right)\ell_{d} \left( {\varvec{x}_{g} } \right) + f\left( {v\left( {\varvec{x}_{g} } \right),\lambda } \right)} \right)} } \hfill \\ \end{array} } \\ { = \begin{array}{*{20}l} {\sum\nolimits_{{\varvec{x} \in {\mathcal{X}}_{L} }} {\left( { - v\left( \varvec{x} \right)\varvec{y}^{T} \log \left( {S\left( \varvec{x} \right)} \right) + f\left( {v\left( \varvec{x} \right),\lambda } \right)} \right)} } \hfill \\ { + \sum\nolimits_{{\varvec{z}\sim p_{noise} }} {\left( {\begin{array}{*{20}l} { - \gamma v\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)D\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)\varvec{y}^{T} \log \left( {S\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right)} \right)} \hfill \\ { + f\left( {v\left( {\varvec{G}\left( {\varvec{z},\varvec{y}_{g} } \right)} \right),\lambda } \right)} \hfill \\ \end{array} } \right)} } \hfill \\ \end{array} } \\ \end{array} $$
(8)

where \( f\left( {v,\lambda } \right) \) is the self-paced regularizer, where \( \lambda \) is the pace age parameter controlling the learning pace. Given \( \lambda \), the easy instances (with smaller losses) are preferred and leveraged for training. By jointly learning the model parameter \( \varvec{\theta}_{S} \) and the latent weight \( \varvec{v} \) with gradually increasing \( \lambda \), more instances (with larger losses) can be automatically included. In this self-paced way, the model learns from easy to complex to become a “mature” learner. S effectively simulates the learning process of intelligent human learners, by adaptively implementing a learning scheme embodied as weight \( v\left( \varvec{x} \right) \) according to the learning pace. Apart from the real ones, the synthetic instances are leveraged as extra training data to further augment the learning performance. Prior knowledge is encoded as weight \( u\left( \varvec{x} \right) \) imposed on the training instances. Under this mechanism, both predetermined heuristics and dynamic learning preferences are incorporated into an automatically optimized curriculum for robust learning.

3 Experiments

To validate the effectiveness of ASPL-GANs, we apply it to classification of handwritten digits and real-world images respectively. Detailed description of the datasets can be found in [2].

The proposed ASPL-GANs is compared with the follow methods:

  • SL: traditional supervised learning based on labeled dataset \( {\mathcal{X}}_{L} \);

  • SPL: self-paced learning based on labeled dataset \( {\mathcal{X}}_{L} \);

  • SL-GANs: supervised learning with GANs based on labeled dataset \( {\mathcal{X}}_{L} \) and synthetic dataset \( {\mathcal{X}}_{syn} \).

Softmax regression, also known as multi-class logistic regression, is adopted to classify the images. To be fair, all the methods have access to the same number of labeled real instances. We use two distributions to determine the numbers per class. One is uniform distribution according to which the labeled instances are equally divided between classes. The other is Gaussian distribution in which the majority of labeled instances falls into only a few classes. The two settings simulate the balance and imbalance scenario of training data. For methods leveraging augmented training data, synthetic instances falling into the minority classes are inclined to be generated to alleviate the data imbalance problem.

Figure 2 illustrates the classification results of SL, SPL, SL-GANs and ASPL-GANs on both handwritten digit and real-world image datasets. The horizontal axis shows the number of initial training data.

Fig. 2.
figure 2

The classification accuracies on (1) handwritten digit dataset and (2) real-world image dataset.

Analysis of the experimental results are as follows.

  • Traditional learning method SL is trained on the limited training data, and the training data are incorporated all at once indiscriminately. As a result, the learning performance is severely hampered.

  • Both SPL and SL-GANs achieved improvement compared with SL. The former explores the limited training data in a more effective way, whereas the latter leverages extra training data via GANs. As we can see, SL-GANs is especially helpful for simpler dataset such as the handwritten digit dataset, because the generated instances can be more reliable. In contrast, the synthetic real-world images is less realistic, and thus less helpful in augmenting the learning performance. SPL successfully simulates the process of human cognition, and thus achieved consistent improvement for both datasets, especially for the balance scenario. The problem of data imbalance can be alleviated by generating minority instances.

  • The proposed ASPL-GANs achieved the highest classification accuracy among all the methods. By naturally combination of GANs and SPL, the problem of insufficient training data and ineffective modeling are effectively addressed.

4 Conclusion

In this paper, we have proposed the augmented self-paced learning with generative adversarial networks (ASPL-GANs) to address the issues w.r.t. limited training data and unsophisticated learning scheme. The contributions of this work are three-fold. Firstly, we developed a robust learning framework, which consists of three component modules formulated with the corresponding objectives and optimized jointly in a unified process to achieve improved learning performance. Secondly, realistic synthetic instance with predetermined class labels are generated via competition between the generator and discriminator to provide extra training data. Last but not least, both real and synthetic are incorporated in a self-paced learning scheme, which integrates prior knowledge and dynamically created curriculum to fully explore the augmented training dataset. Encouraging results are received from experiments on multiple classification tasks.