Abstract
In this paper, we present effective algorithms to automatically annotate clothes from social media data, such as Facebook and Instagram. Clothing annotation can be informally stated as recognizing, as accurately as possible, the garment items appearing in the query photo. This task brings huge opportunities for recommender and e-commerce systems, such as capturing new fashion trends based on which clothes have been used more recently. It also poses interesting challenges for existing vision and recognition algorithms, such as distinguishing between similar but different types of clothes or identifying a pattern of a cloth even if it has different colors and shapes. We formulate the annotation task as a multi-label and multi-modal classification problem: (i) both image and textual content (i.e., tags about the image) are available for learning classifiers, (ii) the classifiers must recognize a set of labels (i.e., a set of garment items), and (iii) the decision on which labels to assign to the query photo comes from a set of instances that is used to build a function, which separates labels that should be assigned to the query photo, from those that should not be assigned. Using this configuration, we propose two approaches: (i) the pointwise one, called MMCA, which receives a single image as input, and (ii) a multi-instance classification, called M3CA, also known as pairwise approach, which uses pair of images to create the classifiers. We conducted a systematic evaluation of the proposed algorithms using everyday photos collected from two major fashion-related social media, namely pose.com and chictopia.com. Our results show that the proposed approaches provide improvements when compared to popular first choice multi-label, multi-modal, multi-instance algorithms that range from 20 % to 30 % in terms of accuracy.
Similar content being viewed by others
Notes
Hereafter we refer each f i as the corresponding interval.
Labels for which \(\hat {p}(l_{i}|\tilde {q})>0\).
L1 distance function calculates the difference between two feature vectors by summing the absolute value of each keyword: \(L1(P,Q) = {\sum }_{i=1}^{N} |p_{i} - q_{i}|\)
Both, Chictopia and Pose, datasets used in this paper are available for download at: http://www.patreo.dcc.ufmg.br/downloads/fashion-datasets/
The processing time computed is only the time spent by the classification algorithm.
References
Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: International conference on management of data, pp 207–216
Alahi A, Ortiz R, Vandergheynst P (2012) FREAK: fast retina keypoint. In: Conference on computer vision and pattern recognition, pp 510–517
Atrey PK, Hossain MA, El-Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379
Baeza-Yates RA, Ribeiro-Neto BA (2011) Modern information retrieval—the concepts and technology behind search, 2nd edn, Pearson Education Ltd., Harlow
Bay H, Ess A, Tuytelaars T, Gool LJV (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110(3):346–359
Bekele D, Teutsch M, Schuchert T (2013) Evaluation of binary keypoint descriptors. In: International conference on image processing, pp 3652–3656
Blei DM, Jordan MI (2003) Modeling annotated data. In: ACM special interest group on information retrieval, pp 127–134
Boureau Y, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: Conference on computer vision and pattern recognition, pp 2559–2566
Briggs F, Fern XZ, Raich R (2012) Rank-loss support instance machines for miml instance annotation. In: International conference on knowledge discovery and data mining, pp 534–542
Calonder M, Lepetit V, Strecha C, Fua P (2010) BRIEF: binary robust independent elementary features. In: European conference on computer vision, pp 778–792
da Silva Torres R, Falcȧo AX (2006) Content-based image retrieval: theory and applications. RITA 13(2):161–185
de Avila SEF, Thome N, Cord M, Valle E, de Albuquerque Araújo A (2011) BOSSA: extended bow formalism for image classification. In: International conference on image processing, pp 2909–2912
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. CVPR 2009, pp 248–255
dos Santos JA, Penatti OAB, da Silva Torres R (2010) Evaluating the potential of texture and color descriptors for remote sensing image retrieval and classification. In: International conference on computer vision theory and applications, pp 203–208
dos Santos JA, Faria FA, da Silva Torres R, Rocha A, Gosselin PH, Philipp-Foliguet S, Falcão AX (2012) Descriptor correlation analysis for remote sensing image multi-scale classification. In: International conference on pattern recognition, pp 3078–3081
Duygulu P, Barnard K, de Freitas JFG, Forsyth DA (2002) Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: European conference on computer vision, pp 97–112
Escalante HJ, Montes M, Sucar E (2012) Multimodal indexing based on semantic cohesion for image retrieval. Inf Retr 15(1):1–32
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: International joint conference on artificial intelligence, pp 1022–1029
Feng S, Xu D (2010) Transductive multi-instance multi-label learning algorithm with application to automatic image annotation. Expert Syst Appl 37(1):661–670
Gallagher AC, Chen T (2008) Clothing cosegmentation for recognizing people. In: Conference on computer vision and pattern recognition
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res
Guillaumin M, Mensink T, Verbeek JJ, Schmid C (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: International Conference on Computer Vision, pp 309–316
Guillaumin M, Verbeek JJ, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: Conference on computer vision and pattern recognition, pp 902–909
Huang C, Liu Q (2007) An orientation independent texture descriptor for image retireval. In: International conference on computer and computational sciences, pp 772–776
Huang J, Kumar R, Mitra M, Zhu W, Zabih R (1997) Image indexing using color correlograms. In: Conference on computer vision and pattern recognition, pp 762–768
Kalantidis Y, Kennedy L, Li L (2013) Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In: International conference on multimedia retrieval, pp 105–112
Leutenegger S, Chli M, Siegwart R (2011) BRISK: binary robust invariant scalable keypoints. In: International conference on computer vision, pp 2548–2555
Li R, Lu J, Zhang Y, Zhao T (2010) Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowl-Based Syst 23(3):195–201
Liu T (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
Liu S, Song Z, Liu G, Xu C, Lu H, Yan S (2012) Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set. In: Conference on computer vision and pattern recognition, pp 3330–3337
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Mahmoudi F, Shanbehzadeh J, Eftekhari-Moghadam A, Soltanian-Zadeh H (2003) Image retrieval based on shape similarity by edge orientation autocorrelogram. Pattern Recogn 36(8):1725–1736
Makadia A, Pavlovic V, Kumar S, 2008 A new baseline for image annotation. In: European conference on computer vision. Springer, pp 316–329
Maron O, Lozano-Pérez T (1997) A framework for multiple-instance learning. In: Neural information processing systems, pp 570–576
Moran S, Lavrenko V (2014) Sparse kernel learning for image annotation. In: International conference on multimedia retrieval, p 113
Nguyen C, Zhan D, Zhou Z (2013) Multi-modal image annotation with multi-instance multi-label LDA. In: International joint conference on artificial intelligence
Nogueira K, Veloso AA, dos Santos JA (2014) Learning to annotate clothes in everyday photos: multi-modal, multi-label, multi-instance approach. In: 27th conference on graphics, patterns and images, SIBGRAPI 2014. IEEE Computer Society, pp 327–334
Ntalianis K, Tsapatsoulis N, Doulamis A, Matsatsinis N (2014) Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution. Multimed Tools Appl 69(2):397–421
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Visual perception. Prog Brain Res 155:23–36
Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: International conference on multimedia, pp 65–73
Penatti OAB, Valle E, da Silva Torres R (2012) Comparative study of global color and texture descriptors for web image retrieval. J Vis Commun Image Represent 23(2):359–380
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Conference on computer vision and pattern recognition
Read J, Pfahringer B, Holmes G (2008) Multi-label classification using ensembles of pruned sets. In: International conference on data mining, pp 995–1000
Rublee E, Rabaud V, Konolige K, Bradski GR (2011) ORB: an efficient alternative to SIFT or SURF. In: International conference on computer vision, pp 2564–2571
Shen EY, Lieberman H, Lam F (2007) What am I gonna wear?: Scenario-oriented recommendation. In: International conference on intelligent user interfaces, pp 365–368
Simo-Serra E, Fidler S, Moreno-Noguer F, Urtasun R (2014) A high performance CRF model for clothes parsing. In: Asian conference on computer vision
Simo-Serra E, Fidler S, Moreno-Noguer F, Urtasun R (2015) Neuroaesthetics in fashion: modeling the perception of fashionability. In: Conference on computer vision and pattern recognition
Sivic J, Zisserman A (2006) Video google: efficient visual search of videos. In: Toward category-level object recognition, pp 127–144
Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Conference on empirical methods in natural language processing, pp 254–263
Socher R, Lin CC, Ng AY, Manning CD (2011) Parsing natural scenes and natural language with recursive neural networks. In: International conference on machine learning, pp 129–136
Stehling RO, Nascimento MA, Falcão AX (2002) A compact and efficient image retrieval approach based on border/interior pixel classification. In: International conference on information and knowledge management, pp 102–109
Suh B, Bederson BB (2007) Semi-automatic photo annotation strategies using event based clustering and clothing based person recognition. Interact Comput 19 (4):524–544
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32
Tang J, Li H, Qi G, Chua T (2010) Image annotation by graph-based inference with integrated multiple/single instance representations. IEEE Trans Multimed 12 (2):131–141
Tao B, Dickinson BW (2000) Texture recognition and image retrieval using gradient indexing. J Vis Commun Image Represent 11(3):327–342
Tokumaru M, Fujibayashi T, Muranaka N, Imanishi S (2002) Virtual stylist project—dress up support system considering user’s subjectivity. In: International conference on fuzzy systems and knowledge discovery: computational intelligence for the E-Age, pp 207–211
Tsoumakas G, Katakis I (2006) Multi-label classification: an overview. Dept of Informatics, Aristotle University of Thessaloniki, Greece
Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehouse Min 3(3):1–13
Tuytelaars T (2010) Dense interest points. In: Conference on computer vision and pattern recognition, pp 2281–2288
Tuytelaars T, Mikolajczyk K (2007) Local invariant feature detectors: a survey. Found Trends Comput Graph Vis 3(3):177–280
Unser M (1986) Sum and difference histograms for texture classification. IEEE Trans Pattern Anal Mach Intell 8(1):118–125
van Gemert J, Geusebroek J, Veenman CJ, Smeulders AWM (2008) Kernel codebooks for scene categorization. In: European conference on computer vision, pp 696–709
Veloso A, Jr WM, Zaki MJ (2006) Lazy associative classification. In: International conference on data mining, pp 645–654
Veloso A, Jr WM, Gonçalves MA, Zaki MJ (2007) Multi-label lazy associative classification. In: Conference on principles and practice of knowledge discovery in databases, pp 605–612
Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214
Vogiatzis D, Pierrakos D, Paliouras G, Jenkyn-Jones S, Possen BJHHA (2012) Expert and community based style advice. Expert Syst Appl 39(12):10:647–10:655
Weber M, Bäuml M, Stiefelhagen R (2011) Part-based clothing segmentation for person retrieval. In: International conference on advanced video and signal-based surveillance, pp 361–366
Xie L, Pan P, Lu Y (2015) Markov random field based fusion for supervised and semi-supervised multi-modal image classification. Multimed Tools Appl 613–634
Yamaguchi K, Kiapour MH, Ortiz LE, Berg TL, 2012 Parsing clothing in fashion photographs. In: Conference on computer vision and pattern recognition, pp 3570–3577
Yamaguchi K, Kiapour MH, Berg TL (2013) Paper doll parsing: retrieving similar styles to parse clothing items. In: International conference on computer vision, pp 3519–3526
Yang M, Yu K (2011) Real-time clothing recognition in surveillance videos. In: International conference on image processing, pp 2937–2940
Yang S, Zha H, Hu B (2009) Dirichlet-bernoulli alignment: a generative model for multi-class multi-label multi-instance corpora. In: Neural information processing systems, pp 2143–2150
Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Conference on computer vision and pattern recognition, pp 1385–1392
Zegarra J, Leite N, Torres R (2008) Wavelet-based feature extraction for fingerprint image retrieval. J Comput Appl Math
Zhang D, Lu G (2004) Review of shape representation and description techniques. Pattern Recogn 37(1):1–19
Zhang D, Islam M M, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362
Zhaolao L, Zhou M, Wang X, Fu Y, Tan X (2013) Semantic annotation method of clothing image. In: International conference on human-computer interaction, pp 289–298
Zhou Z, Zhang M, Huang S, Li Y (2012) Multi-instance multi-label learning. Artif Intell 176(1):2291–2320
Acknowledgments
The authors would like to acknowledge grants from CNPq (grant 449638/2014-6), CAPES, Fundação de Apoio à Pesquisa do Estado de Minas Gerais (Fapemig, under the grant APQ-00768-14), PRPq/Universidade Federal de Minas Gerais, Finep, and InWeb − the Brazilian National Institute of Science and Technology for the Web.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nogueira, K., Veloso, A.A. & dos Santos, J.A. Pointwise and pairwise clothing annotation: combining features from social media. Multimed Tools Appl 75, 4083–4113 (2016). https://doi.org/10.1007/s11042-015-3087-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3087-2