Keywords

1 Introduction

Artificial Intelligence (AI) research has been making huge progress in the machine’s capability of human level understanding across the spectrum of perception, reasoning and planning [1,2,3]. Another key direction yet relatively understudied is creativity where the goal is for machines to generate original items with realistic, aesthetic attributes, usually in artistic contexts. We can indeed imagine AI to serve as inspiration for humans in the creative process and also to act as a sort of assistant able to help with more mundane tasks, especially in the digital domain. Previous work has explored writing pop songs [4], imitating the styles of great painters [5, 6] or doodling sketches [7] for instance.

There has also been a growing interest in generating images using GANs, given their ability to generate appealing images unconditionally [8], or conditionally like from text, class labels, and for paired and unpaired image translations [9,10,11]. However, it is not clear how creative such attempts can be considered since most of them mainly tend to mimic training samples without expressing much originality. Creative Adversarial Networks (CANs) [12] have then been proposed to adapt GANs to generate original content (paintings) by encouraging the model to deviate from existing painting styles. Technically, CAN is a Deep Convolutional GAN (DCGAN) model [13] associated with an entropy loss that encourages novelty against known art styles. The specific application domain of CANs (art paintings) allows for very abstract generations to be acceptable but, as a result, does reward originality a lot without judging much how such enhanced creativity can be mixed with realism and standards.

In this paper we study how AI can generate creative samples for fashion. Fashion is an interesting domain because designing original garments requires a lot of creativity but with the constraints that items must be wearable. In contrast to most generative models works [14,15,16], the originality angle we introduce makes us go beyond replicating images seen during training. Fashion image generation opens the door for breaking creativity into design elements (shape and texture in our case), which is a novel aspect of our work in contrast to CANs. More specifically, this work explores various architectures and losses that encourage GANs to deviate from existing fashion styles covered in the training dataset, while still generating realistic pieces of clothing without needing any image as input. To the best of our knowledge, this work is the first attempt at incorporating creative fashion generation by explicitly relating it to its design elements.

Contributions. (1) We are the first to propose a novelty loss on image generation of fashion items with a specific conditioning of texture and shape, learning a deviation from existing ones. (2) We re-purposed automatic entropy based evaluation criteria for assessment of fashion items in terms of texture and shape; The correlations between the automatic metrics that we proposed and our human study allowed us to draw some conclusions with useful metrics revealing human judgment. (3) We proposed a shape conditioned model named Style GAN and a concrete solution to make it work in a non-deterministic way. Trained with creative losses, it results in a novel and powerful model. Our best models manage to generate realistic images with high resolution \(512\,\times \,512\) using a relatively small dataset (about 4000 images). More than 60% of our generated designs are judged as being created by a human designer while also being considered original, showing that an AI could offer benefits serving as an efficient and inspirational assistant.

2 Models: Architectures and Losses

2.1 Network Architectures

We experiment using two architectures: a modified version of the DCGAN model [13] for higher resolution output images, and our proposed styleGAN model as described below. In addition to its real/fake branch classification, the discriminator in each architecture is augmented with optional classification branches each for shape and texture classes.

GANs with Optional Classification Loss. Let \(\mathcal D\) be a dataset of N images. Following [10], we use shape and texture labels to learn a shape classifier and a texture classifier in the discriminator. Adding these labels improves over the plain model and stabilizes the training for larger resolution. We are adding to the discriminator network either one branch for texture or for shape classification, or two branches for both shape and texture classification and denote the extra classification output of the discriminator \(D_b\). The additional loss is:

$$\begin{aligned} \mathcal {L}_{D \text{ classif }} = - \sum _{x_i\in \mathcal {D}} \log (\text{softmax }(D_{b}(x_i)). \end{aligned}$$
(1)

StyleGAN: Conditioning on Masks. In this model, a generator is trained to compute realistic images from a mask input and noise representing style information (Fig.  1). We use the same discriminator architecture as in DCGAN with classifier branches that learn shape and texture classification on real images on top of real/fake prediction. Training styleGAN with two inputs is difficult, previous approaches of image to image translation such as pix2pix [17] and CycleGAN [11] create a deterministic mapping between an input image to a single corresponding one, i.e. edges to handbags for example or from one domain to another. To make sure that no input is being neglected, we add a \(\ell _1\) loss forcing the generator to output the mask itself in case of null style input z and thus ensure the impact of the shape in the generations as shown in Fig. 1.

Fig. 1.
figure 1

From the segmented mask of a fashion item and different random vector z, our StyleGAN model generates different styled images.

Fig. 2.
figure 2

From the mask of a product, our StyleGAN model generates different styled image for each style noise.

2.2 Novelty Losses

Because GANs learn to generate images very similar to the training images, we explore ways to make them deviate from this replication by studying the impact of two additional losses for the generator: the CAN loss (as used in [12]), and an MCE loss that encourage the generator to confuse the discriminator.

  • CAN loss: As proposed in [12], the CAN loss is defined as

    $$\begin{aligned} \mathcal {L}_{\text{ CAN }} = - \lambda \left[ \sum _{i} \sum _{k=1}^K \frac{1}{K} \log (\sigma (D_{b,k} (G(z_i))) ) + \frac{K-1}{K} \log (1-\sigma (D_{b,k}(G(z_i)))) \right] , \end{aligned}$$
    (2)

    where \(\sigma \) is the sigmoid function, and K the number of texture, shape, or both classes.

  • MCE loss: We propose to use as alternative additional generator’s loss the Multi-class Cross Entropy (MCE) loss between the class prediction of the discriminator and the uniform probability vector.

    $$\begin{aligned} \mathcal {L}_{\text{ MCE }} = -\lambda \sum _{i} \sum _{k} \frac{1}{K}\log \text{ softmax }(D(G(z_i))). \end{aligned}$$
    (3)

Both MCE and sum of binary cross entropy losses encourage deviation from existing categories. However, our MCE criterion considers all classes globally in the softmax unlike the CAN loss which is based on a sum of K independent binary classification losses.

3 Results

Dataset. Unlike similar work focusing on fashion item generation [15, 16], we choose a dataset containing fashion items in uniform background allowing the trained models to learn features useful for creative generation without generating wearer faces and backgrounds. We augment the dataset of 4157 images by a factor 5 by jittering images with random scaling and translations. The images are classified into seven clothes categories: jackets, coats, shirts, tops, t-shirts, dresses and pullovers, and 7 textures categories: uniform, tiled, striped, animal skin, dotted, print and graphical pattern.

Automatic Evaluation Metrics. Evaluating the diversity and quality of a set of images has been tackled by scores such as the inception score and variants like the AM score [18]. We adapt both of them for two labels specific to fashion design (shape and texture) and supplement them by a mean nearest neighbor distance. Our final set of automatic scores contains 5 metrics : (1, 2) shape score and texture score, each based on a Resnet-18 classifier [19] of shape or texture respectively. (3, 4) shape AM score and texture AM score, based on the output of the same classifiers. (5) mean distance to 10 nearest neighbors score. We compute the mean distance for each sample to its retrieved k-Nearest Neighbors (NN), with \(k=10\), as the Euclidean distance between the features extracted from a Resnet18 pre-trained on ImageNet by removing its last fully connected layer.

Creating Evaluation Sets. We select for each setup (DCGAN or styleGAN trained with texture, shape, both or none novelty criterion) four saved models after a sufficient number of iterations. Our models produce plausible results after training for 15000 iterations with a batch size of 64 images.

Given a set of 10000 generations from a model, we extract different sets of images with particular visual properties such as (ii) high/low texture entropy, (iii) high/low NN distance to real images. We also explore random and mixed sets such as low shape entropy and high nearest neighbors distance. We expect such a set to contain plausible generations since low shape entropy usually correlates with well defined shapes, while high nearest neighbor distance contains unusual designs. Overall, we have 8 different sets that may overlap. We choose to evaluate 100 images for each set.

Automatic Evaluation Results. We set \(\lambda =5\) for the MCE loss, and \(\lambda =1\) for the CAN loss, as these parameters appeared to work best. All models were trained using the default learning rate 0.002 as in [13]. Our different models take about half a day to train on 4 Nvidia P100 GPUs for \(256\times 256\) models and almost 2 days for the \(512\times 512\) ones.

Table 1 presents shape and texture scores, AM scores (for shape and texture) and average NN distances computed for each model on 4 selected iterations. Our first observation is that the DCGAN model alone seems to perform worse than all other tested models with the highest NN distance and lower shape and texture scores. The value of the NN distance score may have different meanings. A high value could mean an enhanced “creativity” of the model, but also a higher failure rate. The two models having high shape score, AM shape score, AM texture score and NN distances scores are DCGAN with creativity losses models.

Table 1. Quantitative automatic evaluation. High scores appear in bold.
Table 2. Human evaluation ranked by decreasing overall score (higher is better).

Human Evaluation. Each image was rated by 5 persons asked to answer 6 questions: Q1: how do you like this design overall on a scale from 1 to 5? Q2/Q3: rate the novelty of shape (Q2) and texture (Q3) from 1 to 5. Q4/Q5: rate the complexity of shape (Q4) and texture (Q5) from 1 to 5.Q6: Do you think this image was created by a fashion designer or generated by computer? (yes/no).

Table 2 presents the average score obtained by each model on each human evaluation question for the RTW dataset. From this table, we can see that using our novelty loss (MCE shape and MCE tex) performs better than the DCGAN baseline. While the two proposed models with MCE originality loss rank the best on the overall score, we observe that the preferred images have low nearest neighbor distance. This means that generations which are not close to their nearest neighbors are not always pleasant. It is indeed a challenge to obtain models able to generate novel (high nearest neighbor distance) and at the same time pleasant generations. However, we observe that the models that score better in the high nearest neighbors distance set are clearly the ones with our novelty loss(MCE). Figure 3 shows how well our approaches worked on two axis: likability and real appearance. The most popular methods are obtained by the models employing an originality loss and in particular our proposed MCE originality criterion, as they are perceived as the most likely to be generated by designers, and the most liked overall.

Fig. 3.
figure 3

Evaluation of the different models on the RTW dataset by human annotators on two axis: likability and real appearance. Our models reach nice trade-offs between real appearance and likability.

We are greatly improving the state-of-the-art here, going from a score of 64 to more than 75 in likeability from classical GANs to our best model with shape creativity. We display images which obtained the best scores for each of the 6 questions in Fig. 4. Our proposed Style GAN (See Fig. 2) is producing competitive scores compared to the best DCGAN setups. In particular, StyleGAN with originality loss is ranked in the top-3.

We computed correlation scores between our automatic metrics and human ratings. The metric that correlates the most with the overall score is the NN distance. There is also a negative correlation of NN dist with real appearance.

Fig. 4.
figure 4

Best generations as rated by annotators. Left: Q1: overall score, Q2: shape novelty, Q3: shape complexity; Right: Q4: tex. novelty, Q5: tex. complexity, Q6: Realism.

4 Conclusion

We introduced a specific conditioning of GANs on texture and shape elements for generating fashion design images. While GANs with such classification loss offer realistic results, they tend to reconstruct the training images. Using an MCE originality loss, we learn to deviate from a reproduction of the training set. We also propose a novel architecture named StyleGAN model, conditioned on an input mask, enabling shape control while leaving free the creativity space on the inside of the item. All these contributions lead to the best results according to our human evaluation study. We manage to generate accurately \(512\times 512\) images, however we seek for better resolution, which is a fundamental aspect of image quality, in our future work. Finally, while our results show visually pleasing textural novelty, it will be interesting to explore larger families of novelty loss functions, and ensure wearability constraints.