Keywords

1 Introduction

Over the past few years, we have witnessed the great success of generative adversarial networks (GANs) [1] for a variety of applications. GANs are a useful family of generative models that expresses generative modeling as a zero-sum game between two networks: A generator network produces plausible samples given some noise, while a discriminator network distinguishes between the generator’s output and real data. There are numerous works inspired by the original GANs, [2,3,4,5] to name a few. While GANs can produce visually pleasing samples, they lack a reliable way of measuring the difference between fake and real data distribution, which leads to unstable training.

To address this issue, [6] introduced the Wassestein-1 metric (W-met) to the GAN framework. Compared to the Jensen-Shannon (JS) or the Kullback-Leibler (KL) divergence, W-met is considered to be more sensible for distributions supported by low dimensional manifolds. Given that the primal form of W-met is intractable to compute, [6] proposed to use the dual form of W-met, which requires the k-Lipschitz constraint. A series of ideas [6,7,8,9] were proposed to approximate the dual W-met and achieved impressive results compared to the non-Wasserstein based GANs. However, they generally suffer from unsatisfying regularization for the k-Lipschitz constraint, mainly because it is a very strict constraint and non-trivial to approximate [9, 10].

Other studies have tackled the stability issue from different angles. For example, [10] proposed a gradient-based regularizer associated with the \(\mathfrak {f}\)-divergence [11] to address the dimensional misspecification. In order to stabilize the training towards high resolution images, [12, 13] applied deep stack architectures by incorporating extra information. Recently, building upon the dual W-met objective of [7, 14] presented a sophisticated progressive growing training scheme and obtained excellent high resolution images.

In this paper, we propose to resolve the k-Lipschitz constraint by introducing a relaxed version of W-met and incorporating it in the GAN framework. Our contributions can be summarized as follows:

  1. 1.

    We introduce a novel Wasserstein divergence (W-div) and prove that the proposed W-div is a symmetric divergence. Moreover, we explore the connection between the proposed W-div and W-met.

  2. 2.

    Benefiting from the non-challenging constraint required by the W-div, we introduce Wasserstein divergence GANs (WGAN-div) as its practical application. The proposed objective can faithfully approximate the corresponding W-div through optimization.

  3. 3.

    We demonstrate the stability of WGAN-div under various settings including progressive growing training. Also, we conduct various experiments on standard image synthesis benchmarks and present superior results of WGAN-div compared to the state-of-the-art methods, both quantitatively and qualitatively.

2 Background

Imagine there are two players in a game. One player (Generator) intends to generate visually plausible images, aiming to fool its opponent, while the opponent (Discriminator) attempts to discriminate real images from synthetic images. Such adversarial competition is the key idea behind GAN models. To measure the distance between real and fake data distributions, [1] proposed the objective

$$\begin{aligned} L_{\text {JS}}(\mathbb {P}_r, \mathbb {P}_g) = \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[\text {ln}(f(\varvec{x}))] + \underset{\tilde{\varvec{x}} \sim \mathbb {P}_g}{\mathbb {E}}[\text {ln}(1 -f(\tilde{\varvec{x}}))], \end{aligned}$$
(1)

which can be interpreted as the JS divergence up to a constant [15] and where f is a discriminative function. The model can thus be defined as a min-max optimization problem:

$$\begin{aligned} \underset{G}{\mathrm {min}} \underset{D}{\mathrm {max}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[\mathrm {ln}(D(\varvec{x}))] + \underset{G(\varvec{z}) \sim \mathbb {P}_g}{\mathbb {E}}[\mathrm {ln}(1 - D(G(\varvec{z})))], \end{aligned}$$
(2)

where G is the generator parametrized by a neural network and D is the discriminative neural network parametrizing f. Usually, we let \(\varvec{z}\) be low dimensional random noise, and \(\varvec{x}, G(\varvec{z})\) are the real and fake data satisfying the probability measures \(\mathbb {P}_r, \mathbb {P}_g\).

Wasserstein GANs (WGANs). The rise of the Wasserstein-1 metric (W-met) in GAN models is primarily motivated by unstable training caused by the gradient vanishing problem [6]. Given two probability measures \(\mathbb {P}_r, \mathbb {P}_g\), the W-met [16] is defined as

$$\begin{aligned} \mathcal {W}_1(\mathbb {P}_r, \mathbb {P}_g) = \underset{f \in \mathrm {Lip}_1}{\mathrm {sup}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})], \end{aligned}$$
(3)

where \(\mathrm {Lip}_1\) is the function space of all f satisfying the 1-Lipschitz constraint \(\Vert f\Vert _L \le 1\). It is worth mentioning that \(\mathcal {W}_1\) is invariant up to a positive scalar k if the Lipschitz constraint is modified to be k. \(\mathcal {W}_1\) is believed to be more sensible to distributions supported by low dimensional manifolds such as image, video, etc. Generally, the existing Wasserstein GANs (WGANs) fall into two categories:

Weight Constraints. To approximately satisfy the Lipschitz constraint, [6] proposed a weight clipping method that imposes a hard threshold \(c > 0\) on the weights \(\varvec{w}\) of the discriminator D, which parametrizes f in Eq. 3:

$$\begin{aligned} \varvec{w'} = {\left\{ \begin{array}{ll} \varvec{w} &{} \quad \text {if } |\varvec{w}| < c\\ c &{} \quad \text {if } \varvec{w} \ge c \\ -c &{} \quad \text {if } \varvec{w} \le -c \end{array}\right. } \end{aligned}$$
(4)

This approach was proven to be unsatisfactory by [7], since through weight clipping, the neural network tends to learn oversimplified functions. Later, [8] proposed spectral normalization GANs (SNGANs). To impose the 1-Lipschitz constraint, SNGANs normalize the weights \(\varvec{w}_i\) of each layer i by the \(L_2\) matrix norm,

$$\begin{aligned} \varvec{w'}_i = \frac{\varvec{w}_i}{\Vert \varvec{w}_i \Vert _2}. \end{aligned}$$
(5)

Because the set of functions satisfying the local 1-Lipschitz constraint is merely a subset of the function space \(\text {Lip}_1\), such a constraint inevitably narrows the effective search space and entails a sub-optimal solution.

Gradient Constraints. To overcome the disadvantages of weight clipping, [7] introduced a gradient penalty term to Wasserstein GANs (WGAN-GP). The objective is defined as

$$\begin{aligned} L_{\text {GP}} = \underbrace{\underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})]}_\text {Wasserstein term} + k \underbrace{\underset{\hat{\varvec{x}} \sim \mathbb {P}_{y}}{\mathbb {E}}[(\Vert \nabla f(\hat{\varvec{x}})\Vert _2 - 1) ^2]}_\text {gradient penalty}, \end{aligned}$$
(6)

where \(\nabla \) is the gradient operator and \(\mathbb {P}_{y}\) is the distribution obtained by sampling uniformly along straight lines between points from the real and fake data distributions \(\mathbb {P}_{r}\) and \(\mathbb {P}_{g}\). As pointed out by [9, 10], with a finite number of training iterations on limited input samples, it is very difficult to guarantee the k-Lipschitz constraint for the whole input domain. Thus, [9] further proposed Wasserstein GANs with a consistency term (CTGANs). Inspired by the original 1-Lipschitz constraint, CTGANs add the following term to Eq. 6,

$$\begin{aligned} \text {CT}|_{\varvec{x}_1, \varvec{x}_2} = \mathbb {E}_{\varvec{x}_1, \varvec{x}_2}[\text {max}(0, \frac{d(f(\varvec{x}_1), f(\varvec{x}_2))}{d(\varvec{x}_1, \varvec{x}_2)} - c)], \end{aligned}$$
(7)

where \(\varvec{x}_1, \varvec{x}_2\) are two data points, d is a metric and c is a threshold. Recently, to improve stability and image quality, [14] proposed a training scheme in which GANs are grown progressively. In addition to progressive growing, [14] also proposed an objective \(L_\text {PG} = L_\text {GP} + \text {PG}\), where

$$\begin{aligned} \text {PG} = {\left\{ \begin{array}{ll} \underset{\hat{\varvec{x}} \sim \mathbb {P}_{y}}{\mathbb {E}}[(\Vert \nabla f(\hat{\varvec{x}})\Vert _2 - 750) ^2 / 750 ^2] &{} \quad \text {for CIFAR-10} \\ 0.001 \underset{\hat{\varvec{x}} \sim \mathbb {P}_{y}}{\mathbb {E}}[\Vert \nabla f(\hat{\varvec{x}})\Vert _2 ^2] &{} \quad \text {for other datasets} \end{array}\right. } \end{aligned}$$
(8)

\(\mathfrak {f}\)-GANs. Outside the family of Wasserstein metrics, there is another important family of divergences—the \(\mathfrak {f}\)-divergences. [11] argued that \(\mathfrak {f}\)-divergence can be used for training generative samplers and proposed \(\mathfrak {f}\)-GANs. Since the \(\mathfrak {f}\)-GANs are vulnerable to the dimension mismatch between fake and real data, [10] proposed a gradient-based regularizer to stabilize the training and gave an example based on JS-divergence:

$$\begin{aligned} \begin{aligned} L_{\text {RJS}}(\mathbb {P}_r, \mathbb {P}_g)&= \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[\text {ln}(f(\varvec{x}))] + \underset{\tilde{\varvec{x}} \sim \mathbb {P}_g}{\mathbb {E}}[\text {ln}(1 -f(\tilde{\varvec{x}}))] - k \varOmega (\mathbb {P}_r, \mathbb {P}_g) \\ \varOmega (\mathbb {P}_r, \mathbb {P}_g)&:= \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}} \left[ (1-f(\varvec{x}))^2 \Vert \nabla f(\varvec{x}) \Vert ^2 \right] + \underset{\tilde{\varvec{x}} \sim \mathbb {P}_g}{\mathbb {E}} \left[ f(\tilde{\varvec{x}})^2 \Vert \nabla f(\tilde{\varvec{x}}) \Vert ^2 \right] . \end{aligned} \end{aligned}$$
(9)

Information Geometry. In information geometry, [17] studied the connections between the Wasserstein distance and the Kullback-Leibler (KL) divergence employed by early GANs. They exploit the fact that by regularizing the Wasserstein distance with entropy, the entropy relaxed Wasserstein distance introduces a divergence and naturally defines certain geometrical structures from the information geometry viewpoint.

3 Proposed Method

As discussed above, it is very challenging to approximate the W-met. This is due to the gap between limited input samples on the one hand and the strict 1-Lipschitz constraint on the whole input sample domain [9, 18] on the other hand. At the same time, it is natural to ask whether there exists an optimal \(f^*\) for W-met (Eq. 3). According to [19], by solving a family of minimization problems given \(p > 0\)

$$\begin{aligned} f_p = \underset{f \in W_c^{1, p}}{\mathrm {argmin}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})] + \frac{1}{p} \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[\Vert \nabla f(\hat{\varvec{x}})\Vert ^p], \end{aligned}$$
(10)

where \(\mathbb {P}_u\) is a Radon probability measure and \(W_c^{1, p}\) is the Sobolev space containing all the functions f in \(L^p\) space with first order weak derivatives and compact support, we can find a sequence \(p_k \rightarrow \infty \) such that \(f_{p_k} \rightarrow -f^*\).

3.1 Wasserstein Divergence

The connection between Eq. 10 and W-met inspires us to propose a novel Wasserstein divergence (W-div) and we prove that it is indeed a valid symmetric divergence.

Theorem 1

(Wasserstein divergence). Let \(\varOmega \subset \mathbb {R}^n\) be an open, bounded, connected set and S be the set of all the Radon probability measures on \(\varOmega \). If for some \(p \ne 1, k > 0\) we define

$$\begin{aligned} \begin{aligned} \mathcal {W}_{p,k}^{'}:S \times S&\rightarrow \mathbb {R}^- \cup \{ 0 \} \\ (\mathbb {P}_r, \mathbb {P}_g)&\rightarrow \underset{f \in C_c^{1}(\varOmega )}{\mathrm {inf}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})] + k \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[\Vert \nabla f(\hat{\varvec{x}})\Vert ^p], \end{aligned} \end{aligned}$$
(11)

where \(C_c^1(\varOmega )\) is the function space of all the first order differentiable functions on \(\varOmega \) with compact support, then \(\mathcal {W}_{p,k}^{'}\) is a symmetric divergence (up to the negative sign).

Proof

See supplementary material.

By imposing the \(C_c^1(\varOmega )\) function space, we rule out pathological functions with weak derivatives. Compared to the k-Lipschitz constraint, \(f \in C_c^1(\varOmega )\) is less restrictive, since \(\Vert \nabla f\Vert \) does not need to be bounded by a hard threshold k. Given the universal approximation theorem and the modern architecture of neural networks—stacking differentiable layers to form a nonlinear differentiable function—\(f \in C_c^1(\varOmega )\) can easily be parameterized by a neural network.

In the following we further explore the connection between the proposed W-div and the original W-met in Eq. 3.

Remark 1

(Upper bound). Given Radon probability measures \(\mathbb {P}_r, \mathbb {P}_g, \mathbb {P}_u\) on \(\varOmega \), let

$$\begin{aligned} \mathcal {W}_{\mathbb {P}_u}^{'}(\mathbb {P}_r, \mathbb {P}_g):= \underset{f \in C_c^{\infty }(\varOmega )}{\mathrm {inf}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})] + \frac{1}{2} \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[(\Vert \nabla f(\hat{\varvec{x}})\Vert ^2], \end{aligned}$$
(12)

where \(C_c^{\infty }\) is the function space of all the smooth functions f with compact support. There exists an optimal \(f^*\) for \(\mathcal {W}_1\)(Eq. 3) such that

$$\begin{aligned} \mathcal {W}_1(\mathbb {P}_r, \mathbb {P}_g) = \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f^*(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f^*(\tilde{\varvec{x}})], \end{aligned}$$
(13)

and a \(\mathcal {W}_{\mathbb {P}_{u^*}}^{'}\) determined by \(f^*\) such that

$$\begin{aligned} \mathcal {W}_{\mathbb {P}_{u^*}}^{'}(\mathbb {P}_r, \mathbb {P}_g) = \underset{\mathbb {P}_{u} \in S}{\mathrm {sup}} \, \mathcal {W}_{\mathbb {P}_{u}}^{'}(\mathbb {P}_r, \mathbb {P}_g). \end{aligned}$$
(14)

Please see the detailed discussion in [19].

Remark 1 indicates that \(\mathcal {W}_{\mathbb {P}_{u^*}}^{'}\), which is determined by the optimal \(f^*\), is the upper bound of our W-div \(\mathcal {W}_{\mathbb {P}_{u}}^{'}\)Footnote 1.

Given the similarities between our proposed W-div and \(L_{\text {GP}}\) (Eq. 6), it may be interesting to know if there exists a divergence corresponding to \(L_{\text {GP}}\). In general, the answer is no.

Remark 2

If for \(n > 0\) we let

$$\begin{aligned} \mathcal {W}_{p,k,n}^{''}(\mathbb {P}_r, \mathbb {P}_g):= \underset{f \in C_c^{1}(\varOmega )}{\mathrm {inf}} \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})] + k \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[(\Vert \nabla f(\hat{\varvec{x}})\Vert - n)^p], \end{aligned}$$
(15)

then \(\mathcal {W}_{p,k,n}^{''}\) is not a divergence in general.

Counterexample

Assuming \(\varOmega = (-1, 1)\) and \(p = 2\), it suffices to show that \(\mathcal {W}_{2,k,n}^{''}(\mathbb {P}_r, \mathbb {P}_g) \ne 0\) for \(\mathbb {P}_r = \mathbb {P}_g\) almost everywhere. Since \({\mathbb {E}}_{\varvec{x} \sim \mathbb {P}_r}[f(\varvec{x})]\) and \({\mathbb {E}}_{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}[f(\tilde{\varvec{x}})]\) cancel out, in order to guarantee \(\mathcal {W}_{2,k,n}^{''}(\mathbb {P}_r, \mathbb {P}_g) = 0\), \(\Vert \nabla f(\hat{\varvec{x}})\Vert \) must be equal to n on \((-1, 1)\), which implies that f is affine and contradicts the compact support constraint. For m-dimensional sets such as \((-1, 1)^m\) and an even integer p we need to employ the uniqueness argument of the Picard-Lindelöf Theorem to show that f can only be affine.

Remark 2 implies that the plausible statistic distance \(\mathcal {W}_{2,k,1}^{''}\) corresponding to Eq. 6 is neither a divergence, nor a valid metric.

3.2 Wasserstein Divergence GANs

Although W-met enjoys the tempting property of providing useful gradients, in practice, the original formulation \({\mathbb {E}}_{\varvec{x} \sim \mathbb {P}_r}[f(\varvec{x})] - {\mathbb {E}}_{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}[f(\tilde{\varvec{x}})]\) of W-met cannot be directly applied as an objective without imposing the strict 1-Lipschitz constraint. In contrast, it is very straightforward to use our proposed W-div as an objective. Therefore, we introduce Wasserstein divergence GANs (WGAN-div). Our objective can be smoothly derived as

$$\begin{aligned} L_{\text {DIV}} = \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[f(\varvec{x})] - \underset{\tilde{\varvec{x}} \sim \mathbb {P}_{g}}{\mathbb {E}}[f(\tilde{\varvec{x}})] + k \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[\Vert \nabla f(\hat{\varvec{x}})\Vert ^p], \end{aligned}$$
(16)
figure a
Table 1. The default architecture of WGAN-div for \(64\times 64\) image generation
Table 2. Visual and FID comparison for generated samples (green dots) and real samples (yellow dots) on Swiss Roll, 8 Gaussians and 25 Gaussians. The value surfaces of the discriminators are also plotted.

which is identical to the formulation of W-div without the infimum. Minimizing \(L_{\text {DIV}}\) faithfully approximates \(\mathcal {W}_{p,k}^{'}\), in a sense that the decrease of \(L_{\text {DIV}}\) indicates a better approximation of \(\mathcal {W}_{p,k}^{'}\). In comparison, lowering \(L_{\text {GP}}\) does not necessarily imply that \(L_{\text {GP}}\) approximates \(\mathcal {W}_1\) better, since \(L_{\text {GP}}\) can be decreased at the cost of violating the gradient penalty term (Eq. 6).

By incorporating our objective \(L_{\text {DIV}}\) in the GAN framework, together with parameterizing \(f \in C_c^1\) by a discriminator D and the fake data distribution \(\mathbb {P}_g\) by a generator G, our min-max optimization problem can be written as

$$\begin{aligned} \min _{G} \max _{D} \, \underset{G(\varvec{z}) \sim \mathbb {P}_{g}}{\mathbb {E}}[D(G(\varvec{z}))] - \underset{\varvec{x} \sim \mathbb {P}_r}{\mathbb {E}}[D(\varvec{x})] - k \underset{\hat{\varvec{x}} \sim \mathbb {P}_{u}}{\mathbb {E}}[\Vert \nabla _{\hat{\varvec{x}}} D(\hat{\varvec{x}})\Vert ^p], \end{aligned}$$
(17)

where \(\varvec{z}\) is random noise, \(\varvec{x}\) is the real data, and \(\hat{\varvec{x}}\) is sampled as a linear combination of real and fake data points. For more studies of sampling strategies we refer readers to our supplementary material. The final algorithm is obtained as shown in Algorithm 1. Following the good practice of [7], our building blocks for D and G are residual blocks [20]. The default architecture of WGAN-div is presented in Table 1. We apply Adam optimization [21] to update G and D. We study the crucial hyperparameters such as the coefficient k and the power p in the next section.

Fig. 1.
figure 1

Curves of FID vs. iteration (top left), Discriminator cost vs. iteration (top right), FID vs. power p (bottom left), and FID vs. coefficient k (bottom right) for WGAN-div on CelebA.

4 Experiments

In this section, we evaluate WGAN-div on toy datasets and three widely used image datasets—CIFAR-10, CelebA [22] and LSUN [23]. As a preliminary evaluation, we use low-dimensional datasets such as Swiss roll, 8 Gaussians and 25 Gaussians to justify that our proposed W-div can be more effectively learned than W-met used by WGAN-GP and CTGAN, in terms of more meaningful value surfaces of discriminator D i.e. f, and better generated data distribution (Table 2). Meanwhile, the three large scale datasets highlight a variety of challenges that WGAN-div should address and evaluation on them is adequate to support the advantages of WGAN-div.

Recently, [24] pointed out that the inception score (IS) [25] is not reliable because it does not incorporate the statistics of real image samples. As an alternative, they introduced the Fréchet inception distance (FID) to measure the difference between real and fake data distributions. Experiments verified that the FID score is consistent with visual judgment by humans. Later, [26] conducted a comprehensive study of the state-of-the-art GANs based on FID, which confirmed that FID provides fairer assessment. Hence, we consider the FID score as the major criterion for evaluating our method. Also, visual results are provided as a complementary form of verification.

We compare our WGAN-div to the state-of-the-art DCGAN [2], WGAN-GP [7], RJS-GAN [18], CTGAN [9], SNGAN [8], and PGGAN [14]. For each method, we apply the default architectures and hyperparamters recommended by their papers. The default architectures for G and D of WGAN-div follow the ResNet design [20] as presented in Table 1. We use Adam optimization [21] for updating G and D with a learning rate of 0.0002 for all three datasets. The number of training steps are 100000 for CelabA and CIFAR-10, and 200000 for LSUN. By cross validation we determine the number of iterations for D per training step to be 4 for CelebA and LSUN, and 5 for CIFAR-10.

4.1 Hyperparameter Study

We demonstrate the impact of two important hyperparameters—the power p and the coefficient k—on our WGAN-div method. Both of them control the gradient term of \(L_\text {DIV}\). We report the obtained FID scores on the \(64\times 64\) CelebA dataset in the bottom row of Fig. 1. For a fixed optimal \(p = 6\) and varying k, Fig. 1 shows that \(L_\text {DIV}\) is not sensitive to changes of k, with the FID score fluctuating mildly around 16. On the other hand, for a fixed \(k = 2\) and changing p, we obtain the optimal FID at \(p=6\), which differs from the common choice \(p=2\) applied in WGAN methods. The fact that \(f_p\) (Eq. 10) converges to the optimal discriminator when p becomes larger may explain why \(L_\text {DIV}\) favors a larger power p. To summarize, our default pk are determined to be \(p=6\) and \(k=2\).

4.2 Stability Study

In this section we evaluate the stability of our method to changes in architecture and compare it to other approaches. In this light, we apply various architecture settings for WGAN-div, WGAN-GP, and RJS-GAN, which represent three types of statistical distances: W-div, W-met, and \(\mathfrak {f}\)-divergence. We train these methods with two standard architectures—ConvNet as used by DCGAN [2] and ResNet [20], which is used by WGAN-GP [7]. Since batch normalization [27] (BN) is considered to be a key ingredient in stabilizing the training process [2], we also evaluate the FID without BN. In total, we use four settings: ResNet, ResNet without BN, ConvNet, and ConvNet without BN. As shown in Table 3, each column reports the visual and FID results obtained under the same architecture. Our WGAN-div achieves the best FID scores for all four settings. Table 3 also features corresponding visual results. Compared to WGAN-GP and RJS-GAN, WGAN-div produces more visually pleasing images and the visual quality remains more stable under changing settings. This experimental study confirms the advantages gained by our W-div and its identical objective \(L_\text {DIV}\).

Table 3. FID scores and qualitative comparison of various architectures on CelebA.

4.3 Evaluation on the Standard Training Scheme

In this experiment, we intend to fairly compare the performance of various GANs by ruling out the impact caused by fine-tuned training strategies. For this purpose, we follow the standard, i.e. non-growing, training scheme, which fixes the size and architecture of the discriminator and generator through the whole training process. We compute the FID scores for DCGAN, WGAN-GP, RJS-GAN, CTGAN, and WGAN-div. The configurations of the compared methods are set according to the recommendations from the authors. The results are reported in Table 4. WGAN-div reaches the best FID scores among the compared approaches, which quantitatively confirms the advantages of our method.

While the FID score of WGAN-div mildly outperforms the state-of-the-art methods on the dataset CIFAR-10, it demonstrates clearer improvements on the larger scale datasets CelebA and LSUN. Similarly, the facial results shown in Fig. 2 tell us that WGAN-div is better than the compared methods with regard to diversity and semantics. For example, Fig. 2 shows diverse faces generated by WGAN-div in terms of gender, age, facial expression and makeup. We can make the same conclusions on LSUN. The proposed WGAN-div outperforms the compared methods with a considerable margin both quantitatively and qualitatively. For example, WGAN-div achieves an FID score of 15.9 on LSUN, which is 4.4 lower than CTGAN, which is already an improved version of WGAN-GP, that introduced an extra regularizer to enhance WGAN-GP.

The examples of visually plausible bedrooms shown in Fig. 2 further highlight the advantages gained by introducing W-div in the GAN model. For the interpolation results in the latent space please check our supplementary material.

Fig. 2.
figure 2

Visual results of WGAN-div and compared methods on CelebA (top row), LSUN (middle row), and CIFAR-10 (bottom row).

Table 4. FID comparison between WGAN-div and the state-of-the-art methods. The result with a * was taken from the original paper [8].

The top row of Fig. 1 reports the learning curve of the compared methods showing that the training process of our WGAN-div is comparatively stable and converges fast. It achieves top FID scores with less than 60 K iterations. The top right plot of Fig. 1 illustrates the meaningful correlation between image quality and discriminator cost. It is worth mentioning that [24] proposed a two time-scale update method to generally improve the training of a variety of GANs. We believe that WGAN-div can also benefit from such a sophisticated update rule. However, due to the space limit, this is left for further studies.

Table 5. FID comparison between PGGAN-div and PGGAN at different resolutions.

4.4 Evaluation on the Progressive Growing Training Scheme

Inspired by the success of PGGAN [14], which trained a W-met based GAN model in a progressive growing fashion, we evaluate how our objective \(L_\text {DIV}\) performs with this sophisticated training scheme. More specifically, we replace \(L_\text {PG}\) with our \(L_\text {DIV}\) while following the default configurations suggested in [14] and propose PGGAN-div. However, computing the FID scores for this experimental setting is challenging, as it is non-trivial to adapt existing FID models for evaluating higher resolution generated images. Since [14] does not specify the details of how their FID scores were computed for higher resolution images, we propose to downscale higher resolution images to \(64\times 64\) resolution and then compute the FID score. The resulting scores are reported in Table 5.

Interestingly, Table 5 shows that, for low resolution images, the FID score of PGGAN is slightly worse than the one of some top methods reported in Table 4, including WGAN-div. We believe that this phenomenon is not surprising. Since it is comparatively easy to learn a data distribution in low dimensional space, applying the standard training scheme suffices to achieve good FID scores. There is no need to introduce the sophisticated progressive growing strategy during the low dimensional phase. For higher resolution images (\(128\times 128\) and \(256\times 256\)) on the other hand, the FID scores for both PGGAN and PGGAN-div decrease with non-negligible margin. It is worth mentioning that our PGGAN-div slightly improves the FID scores over the original PGGAN, demonstrating the stability of our objective \(L_\text {DIV}\) under a sophisticated training scheme.

Fig. 3.
figure 3

Visual results of PGGAN (top), PGGAN-div (bottom) on CelebA-HQ.

Fig. 4.
figure 4

Visual results of PGGAN (top), PGGAN-div (bottom) on \(256\times 256\) LSUN.

We also present the \(256 \times 256\) visual results for CelebA-HQ (Fig. 3) and LSUN (Fig. 4). Since CelebA-HQ was generated by post-processing CelebA [14], we do not report its FID scores due to the distribution shift introduced by the artificial post-processing algorithms. The visual results in Figs. 3 and 4 demonstrate that our PGGAN-div is very competitive compared to the original PGGAN for both datasets. To summarize, we demonstrate the stability of our W-div objective under this training scheme.

5 Conclusion

In this paper, we introduced a novel Wasserstein divergence which does not require the 1-Lipschitz constraint. As a concrete example, we equip the GAN model with our Wasserstein divergence objective, resulting in WGAN-div. Both FID score and qualitative performance evaluation demonstrate the stability and superiority of the proposed WGAN-div over the state-of-the-art methods.