Fashion Recommendation with Multirelational Representation Learning
 883 Downloads
Abstract
Driven by increasing demands of assisting users to dress and match clothing properly, fashion recommendation has attracted wide attention. Its core idea is to model the compatibility among fashion items by jointly projecting embedding into a unified space. However, modeling the item compatibility in such a categoryagnostic manner could barely preserve intraclass variance, thus resulting in suboptimal performance. In this paper, we propose a novel categoryaware metric learning framework, which not only learns the crosscategory compatibility notions but also preserves the intracategory diversity among items. Specifically, we define a category complementary relation representing a pair of category labels, e.g., topsbottoms. Given a pair of item embeddings, we first project them to their corresponding relation space, then model the mutual relation of a pair of categories as a relation transition vector to capture compatibility amongst fashion items. We further derive a negative sampling strategy with nontrivial instances to enable the generation of expressive and discriminative item representations. Comprehensive experimental results conducted on two public datasets demonstrate the superiority and feasibility of our proposed approach.
Keywords
Fashion compatibility Fashion recommendation Representation learning1 Introduction
With the proliferation of online fashion websites, such as Polyvore^{1} and Farfetch^{2}, there are increasing demands on intelligent applications in the fashion domain for a better user shopping experience. This drives researchers to develop various machine learning techniques to meet such demands. Existing work is mainly conducted for three types of fashion applications: (1) clothing retrieval [1, 1, 8]: retrieving similar clothing items from the data collection based on the query clothing item; (2) fashion attribute detection [3, 11, 12]: identifying clothing attributes such as color, pattern and texture from the given clothing image; (3) Complementary Clothing Recommendation [5, 10, 16, 21, 22]: recommending complementary clothes that match the query clothing item to the user. In this paper, we focus on the third application, which is more challenging and sophisticated due to the fashion data complexity and heterogeneity. It requires the model to infer compatibility among fashion items according to various complementary characteristics, which goes beyond visual similarity measurement.
The key point to tackle the above challenges is to derive an appropriate compatibility measurement for pairs of fashion items, which can effectively capture various fashion attributes (e.g., colors and patterns) from item images for comparison. The major stream of existing approaches for fashion compatibility modeling adopts metric learning techniques to extract effective fashion item representations. A typical fashion compatibility modeling strategy is to learn a latent style space, where matching item pairs stay closer than incompatible pairs. The compatibility of two given fashion items is computed by the pairwise Euclidean distance or inner product between fashion item embeddings. Nevertheless, the previous work has two main limitations that lead to suboptimal performance. Firstly, some approaches consider fashion compatibility modeling as a singlerelational task. However, this neglects the fact that people usually focus on different aspects of clothes from different categories. For example, people are more likely to focus on color and material for blouses and pants, while they may pay attention to shape and style for jeans and shoes. Moreover, using a single unified space is likely to result in incorrect similarity transitivity in fashion compatibility. For instance, if item A matches both B and C, while B and C may not be compatible, the embeddings of A, B and C will be forced to be close to each other in a single unified space, which degrades prediction performance because the compatibility essentially does not hold transitivity property. Therefore, such a categoryindependent approach will result in inaccurate item representations. Secondly, most existing approaches merely randomly sample negative instances from the training set. However, most of the randomly sampled triplets are trivial ones, which may fail to support the model to learn discriminative item representations.
We present a novel categoryaware embedding learning framework for fashion compatibility modeling, which not only captures crosscategorical relationships but also preserves the diversity of intracategory fashion item representations.
We devise a negative sampling strategy with nontrivial samples for discriminative item representations.
Extensive experiments have been conducted on two real world datasets, Polyvore and FashionVC, to demonstrate the superior performance of our model over other stateoftheart methods.
2 Related Work
2.1 Fashion Compatibility Modeling
The mainstream of work aims to map fashion items into a latent space where compatible item pairs are close to each other, while incompatible pairs lay in the opposite position. McAuley et al. [13] propose to use Lowrank Mahalanobis Transformation to learn a latent style space for minimizing the distance between matched items and maximizing that of mismatched ones. Following this work, Veit et al. [19] employ the Siamese CNNs to learn a metric for compatibility measurement in an endtoend manner. Some researchers argue that the complex compatibility cannot be captured by directly learning a single latent space. He et. al [6] propose to learn a mixture of multiple metrics with weight confidences to model the relationships between heterogeneous items. Veit et al. [18] propose Conditional Similarity Network, which learns disentangled item features whose dimensions can be used for separate similarity measurements. Following this work, Vasileva et al. [17] claim that respecting type information has important consequences. Thus, they first form typetype spaces from each pair of types and train these spaces with triplet loss.
2.2 Knowledge Graph Embedding Learning
The techniques of representation learning on the knowledge graph have attracted large attention in recent years. Different from the approaches implemented by tensor factorization, e.g., [14], translationbased models [2, 7, 20], which is partially inspired by the idea of word2vec, have achieved stateoftheart performance in the field of the knowledge graph. Similar to the knowledge graph, heterogeneous fashion recommendation can also be considered as a multirelational problem, where complementary categories form various relations. Enlightened by these findings, we apply a similar idea from the knowledge graph to the fashion domain for compatibility modeling.
3 Problem Formulation
The fashion complementary recommendation task we are tackling is formulated as follows. Suppose we have a collection of fashion item images denoted as \(\mathcal {O} = \{o_1, o_2, o_3, ..., o_n\}\), where n is the number of items, and a set of category labels denoted as \(\mathcal {C} = \{c_1,c_2,c_3,...,c_m\}\), where m is the number of categories. Each fashion item \(o_i \in \mathcal {O}\) has a corresponding kdim visual feature vector \({\varvec{{v}}}_i = g(o_i;\varTheta _{\textit{v}}), {\varvec{{v}}}_i \in \mathbb {R}^k\) and a category label \(c_i \in \mathcal {C}\). Here, \(g(o;\varTheta _{\textit{v}})\) represents a pretrained CNN with trainable parameters \(\varTheta _{\textit{v}}\), which extracts visual features from a fashion item image \(o \in \mathcal {O}\). We denote a set of category complementary relations as \(\mathcal {R} = \{r^{c_ic_j}\}\), where \(c_i,c_j \in \mathcal {C}\) represent a pair of complementary categories, such as topsbottoms. We now use a triplet \(({\varvec{{v}}}_i, {\varvec{{v}}}_j, r^{c_ic_j}), s.t., \forall i,j, r^{c_ic_j} \in \mathcal {R}\) to represent embeddings of a pair of fashion items \(o_i\) and \(o_j\) and their corresponding category complementary relation \(r^{c_ic_j}\). Each relation \(r^{c_ic_j} \in \mathcal {R}\) corresponds to an embedding vector \({\varvec{{r}}}^{c_ic_j} \in \mathbb {R}^d\) from the relation embedding space. Our target is to derive a fashion compatibility scoring function \(f({\varvec{{v}}}_i, {\varvec{{v}}}_j, r^{c_ic_j})\), which captures visual characteristics from the item embeddings for compatibility measurement.
4 Proposed Approach
In this section, we first present our CAFME model for fashion compatibility modeling. Then, we introduce a novel negative sampling strategy for more effective training. Finally, we describe the optimization algorithm to train our model. The overview of our proposed framework is shown in Fig. 1. We aim to build a model, which can (1) effectively model the notion of compatibility; (2) be easily generalized to unseen fashion item compatibility measurement; (3) focus on different aspects of item embeddings regarding different category complementary relations for the compatibility measurement. In particular, the framework consists of a pretrained CNN for visual feature extraction and multiple category complementary relation subspaces for categoryaware compatibility modeling.
4.1 Compatibility Modeling
4.2 Negative Sampling
 1.
The strategy should consider both sides of training triplets.
 2.
The strategy should identify hard negative instances effectively and efficiently.
 3.
The strategy should avoid false negative samples effectively.
Now we introduce the details regarding how our designed negative sampling strategy can meet the abovedefined requirements. We also present the details of our strategy in Algorithm 1.
Requirement 1: We propose to sample negative instances from both sides of a given positive triplet \(({\varvec{{v}}}_i,{\varvec{{v}}}_j, r^{c_ic_j})\). In particular, we first fix \({\varvec{{v}}}_i\) and category complementary relation \(r^{c_ic_j}\), then replace \({\varvec{{v}}}_j\) by randomly sampling an item embedding vector \({\varvec{{v}}}'_j\) from category \(c_j\). Similarly, we perform the same negative sampling for the other side item \({\varvec{{v}}}_j\).
Requirement 2: Given a positive triplet \(({\varvec{{v}}}_i, {\varvec{{v}}}_j, r^{c_ic_j})\), we first uniformly sample N negative candidates denoted as \(\hat{\mathcal {H}}_{({\varvec{{v}}}_i, r^{c_ic_j})}\) from category \(c_j\)’s item set. Then, for each training triplet, we calculate scores for all negative triplets. This two steps correspond to the step 1–2 in Algorithm 1. Intuitively, the negative triplets with high compatibility scores can be regarded as hard negative samples.
We adopt the stochastic gradient decent algorithm (SGD) for the model optimization. In each step, we sample a minibatch of training triplets and update the parameters of the whole network.
5 Experiments
In this section, we first describe the experimental settings and then give comprehensive analysis based on the experimental results.
5.1 Dataset
We conduct our experiments on two public datasets, FashionVC and PolyvoreMaryland, provided by Song et al. [16] and Han et al. [5] respectively.
FashionVC [16]. This dataset consists of 14,871 top item images and 13,663 bottom item images, where each item has a corresponding image, a title and a category label. In this paper, we only consider the visual modality. Therefore, we use images for visual information extraction and category labels to determine which category complementary relation the item pairs belong to. We randomly split the data according to 80%;10%;10% for training, validation and test sets, respectively.
PolyvoreMaryland [5]. This dataset contains 21,799 outfits crawled from the online social community website Polyvore. We use the splits provided by Han et al. [5], which has 17,316, 3,076 and 1,407 outfits in training, testing and validation sets respectively. In this paper, we mainly study itemtoitem compatibility, therefore, we keep four main groups of fashion items: tops, bottoms, bags and shoes from the outfit data. Each fashion item contains an image, a title and a category label. Note that each group of fashion items have several detailed category labels, e.g., there are hand bags and shoulder bags in the “bags” group.
5.2 Baseline Methods
We compare our model CAFME with several stateoftheart models for heterogeneous recommendation. For the fair comparison, we set the pretrained Alexnet [9] as the visual feature extractor of all methods.

SiameseNet [19]: The approach models compatibility by minimizing the Euclidean distance between compatible pairs and maximizing the distance between incompatible ones in a unified latent space through contrastive loss.

Monomer [6]: The approach models fashion compatibility with a mixture of distances computed from multiple latent spaces.

BPRDAE [16]: The approach models compatibility through innerproduct result of top’s and bottom’s embeddings and uses Bayesian Personalized Ranking (BPR) [15] as their optimization objective.

TripletNet [4]: The approach models fashion compatibility in a unified latent space through triplet loss.

TransNFCM [22]: The stateoftheart method that learns itemitem compatibility by modeling categorical relations among different fashion items.

TACSN [17]: The stateoftheart method that builds typeaware subspaces for fashion compatibility modeling.
5.3 Parameter Settings
In our experiment, all the hyperparameters of our approach are tuned to perform the best on the validation set. For the fair comparison, we apply the Alexnet [9] as the visual feature extractor for all methods. In our model, we set margin \(\gamma \) as 1, learning rate \(\alpha = 10^{4}\) with momentum 0.9, batch size \(B = 512\). Visual embedding dimension \(k=128\), with dropout rate 0.5 and relation embedding dimension is set to be 128.
5.4 Compatibility Prediction
Performance comparison between our proposed CAFME and other baseline methods. CAFME(Neg.) indicates the application of negative sampling training strategy.
FashionVC  PolyvoreMaryland  

Methods  AUC  Hit@5  Hit@10  Hit@20  Hit@40  AUC  Hit@5  Hit@10  Hit@20  Hit@40 
SiameseNet  60.4  9.7  18.1  31.2  52.8  59.1  8.3  15.5  29.0  51.8 
Monomer  70.2  16.9  28.6  45.8  69.1  70.5  17.6  28.9  45.7  69.0 
BPRDAE  70.9  16.7  27.3  46.7  70.4  69.5  17.3  28.2  43.9  67.5 
Triplet Net  70.6  16.3  28.0  45.7  69.6  70.1  18.1  28.7  44.9  68.3 
TACSN  71.6  16.7  28.4  46.7  70.8  70.2  17.3  28.4  45.1  68.4 
TransNFCM  73.6  19.0  32.3  51.6  74.0  73.6  19.3  33.1  50.9  73.4 
CAFME  88.6  26.6  48.5  81.9  99.9  95.0  59.8  84.4  97.7  99.7 
CAFME (Neg.)  88.9  26.4  49.9  83.2  99.9  96.2  59.6  88.4  96.7  99.7 
5.5 Performance Comparison
Our model achieves the best performance on both datasets by significant margins compared with all the other stateoftheart methods, which proves the effectiveness and superior performance of our method.
The categoryunaware models including SiameseNet and TripletNet, which merely learn fashion compatibility notions in a single latent space, perform worse than categoryrespected models including TACSN and TransNFCM. This proves that considering category label information is of great importance in fashion compatibility modeling, which can be helpful to avoid incorrect compatibility similarity transitivity. It also proves that items from different categories may have very different visual characteristics for compatibility.
Compared with categoryaware methods, TACSN and TransNFCM, our model obtains around 15% and 30% improvements on AUC and Hit@20 respectively. Although they build categoryaware mask vectors to capture different fashion characteristics among different categories, it is still not sufficient to preserve the intracategory diversity among items. With the help of our relationspecific projection spaces, our model can capture much more specific information of compatibility from different categories. The improvements on PolyvoreMaryland dataset are even much better in terms of AUC and Hit@5. This is mainly because of the different number of relations in two datasets. We define 146 category relations in the Polyvore dataset, while there are only 30 relations in the FashionVC dataset. It proves that more relational spaces can significantly contribute to the improvement of performance.
The results of CAFME(Neg.) show that our negative sampling strategy is helpful to improve our model’s performance, which proves the effectiveness of our proposed training strategy.
5.6 Case Study
6 Conclusion
In this work, we introduced a novel categoryaware neural model CAFME to model the fashion compatibility notions. It not only captures crosscategory compatibility by constructing category relation embeddings but also preserves intracategory diversity among items through build relationspecific projection spaces. To optimize our model, we further introduce a weighted negative sampling strategy to identify highquality negative instances, which consequently assists our model to infer discriminative representations. In addition, although in our paper, we mainly study the compatibility of tops and bottoms, it can easily generalized to arbitrary types of clothing items. Extensive experiments were conducted on two public fashion datasets, which shows that our CAFME model can significantly outperform all the stateoftheart methods on fashion recommendation.
Footnotes
Notes
Acknowledgments
We would like to thank all reviewers for their comments. This work was partially supported by Australian Research Council Discovery Project (ARC DP190102353).
References
 1.Ak, K.E., Kassim, A.A., Lim, J.H., Tham, J.Y.: Learning attribute representations with localization for flexible fashion search. In: CVPR (2018)Google Scholar
 2.Bordes, A., Usunier, N., GarciaDuran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multirelational data. In: NIPS, pp. 2787–2795 (2013)Google Scholar
 3.Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642337123_44CrossRefGoogle Scholar
 4.Chen, L., He, Y.: Dress fashionably: learn fashion collocation with deep mixedcategory metric learning. In: AAAI, pp. 2103–2110 (2018)Google Scholar
 5.Han, X., Wu, Z., Jiang, Y., Davis, L.S.: Learning fashion compatibility with bidirectional LSTMS. In: ACM MM, pp. 1078–1086 (2017)Google Scholar
 6.He, R., Packer, C., McAuley, J.J.: Learning compatibility across categories for heterogeneous item recommendation. In: ICDM, pp. 937–942 (2016)Google Scholar
 7.Ji, G., Liu, K., He, S., Zhao, J.: Knowledge graph completion with adaptive sparse transfer matrix. In: AAAI (2016)Google Scholar
 8.Kiapour, M.H., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV (2015)Google Scholar
 9.Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
 10.Li, Y., Luo, Y., Huang, Z.: Graphbased relationaware representation learning for clothing matching. In: BorovicaGajic, R., Qi, J., Wang, W. (eds.) ADC 2020. LNCS, vol. 12008, pp. 189–197. Springer, Cham (2020). https://doi.org/10.1007/9783030394691_15CrossRefGoogle Scholar
 11.Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR. IEEE (2016)Google Scholar
 12.Luo, Y., Wang, Z., Huang, Z., Yang, Y., Zhao, C.: Coarsetofine annotation enrichment for semantic segmentation learning. In: CIKM (2018)Google Scholar
 13.McAuley, J.J., Targett, C., Shi, Q., van den Hengel, A.: Imagebased recommendations on styles and substitutes. In: SIGIR, pp. 43–52 (2015)Google Scholar
 14.Nickel, M., Tresp, V., Kriegel, H.: A threeway model for collective learning on multirelational data. In: ICML, pp. 809–816 (2011)Google Scholar
 15.Rendle, S., Freudenthaler, C., Gantner, Z., SchmidtThieme, L.: BPR: bayesian personalized ranking from implicit feedback. In: IJAI, pp. 452–461 (2009)Google Scholar
 16.Song, X., Feng, F., Liu, J., Li, Z., Nie, L., Ma, J.: Neurostylist: neural compatibility modeling for clothing matching. In: ACM MM, pp. 753–761 (2017)Google Scholar
 17.Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning typeaware embeddings for fashion compatibility. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part XVI. LNCS, vol. 11220, pp. 405–421. Springer, Cham (2018). https://doi.org/10.1007/9783030012700_24CrossRefGoogle Scholar
 18.Veit, A., Belongie, S.J., Karaletsos, T.: Conditional similarity networks. In: CVPR, pp. 1781–1789 (2017)Google Scholar
 19.Veit, A., Kovacs, B., Bell, S., McAuley, J.J., Bala, K., Belongie, S.J.: Learning visual clothing style with heterogeneous dyadic cooccurrences. In: ICCV, pp. 4642–4650 (2015)Google Scholar
 20.Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI (2014)Google Scholar
 21.Yang, X., et al.: Interpretable fashion matching with rich attributes. In: SIGIR (2019)Google Scholar
 22.Yang, X., Ma, Y., Liao, L., Wang, M., Chua, T.: TransNFCM: translationbased neural fashion compatibility modeling. In: AAAI, pp. 403–410 (2019)Google Scholar