Multimedia Tools and Applications

, Volume 75, Issue 7, pp 4083–4113 | Cite as

Pointwise and pairwise clothing annotation: combining features from social media

  • Keiller Nogueira
  • Adriano Alonso Veloso
  • Jefersson Alex dos  Santos


In this paper, we present effective algorithms to automatically annotate clothes from social media data, such as Facebook and Instagram. Clothing annotation can be informally stated as recognizing, as accurately as possible, the garment items appearing in the query photo. This task brings huge opportunities for recommender and e-commerce systems, such as capturing new fashion trends based on which clothes have been used more recently. It also poses interesting challenges for existing vision and recognition algorithms, such as distinguishing between similar but different types of clothes or identifying a pattern of a cloth even if it has different colors and shapes. We formulate the annotation task as a multi-label and multi-modal classification problem: (i) both image and textual content (i.e., tags about the image) are available for learning classifiers, (ii) the classifiers must recognize a set of labels (i.e., a set of garment items), and (iii) the decision on which labels to assign to the query photo comes from a set of instances that is used to build a function, which separates labels that should be assigned to the query photo, from those that should not be assigned. Using this configuration, we propose two approaches: (i) the pointwise one, called MMCA, which receives a single image as input, and (ii) a multi-instance classification, called M3CA, also known as pairwise approach, which uses pair of images to create the classifiers. We conducted a systematic evaluation of the proposed algorithms using everyday photos collected from two major fashion-related social media, namely and Our results show that the proposed approaches provide improvements when compared to popular first choice multi-label, multi-modal, multi-instance algorithms that range from 20 % to 30 % in terms of accuracy.


Image annotation Clothing annotation Bag of visual words Machine learning Multi-modal Multi-instance Multi-label 



The authors would like to acknowledge grants from CNPq (grant 449638/2014-6), CAPES, Fundação de Apoio à Pesquisa do Estado de Minas Gerais (Fapemig, under the grant APQ-00768-14), PRPq/Universidade Federal de Minas Gerais, Finep, and InWeb − the Brazilian National Institute of Science and Technology for the Web.


  1. 1.
    Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: International conference on management of data, pp 207–216Google Scholar
  2. 2.
    Alahi A, Ortiz R, Vandergheynst P (2012) FREAK: fast retina keypoint. In: Conference on computer vision and pattern recognition, pp 510–517Google Scholar
  3. 3.
    Atrey PK, Hossain MA, El-Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates RA, Ribeiro-Neto BA (2011) Modern information retrieval—the concepts and technology behind search, 2nd edn, Pearson Education Ltd., HarlowGoogle Scholar
  5. 5.
    Bay H, Ess A, Tuytelaars T, Gool LJV (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110(3):346–359CrossRefGoogle Scholar
  6. 6.
    Bekele D, Teutsch M, Schuchert T (2013) Evaluation of binary keypoint descriptors. In: International conference on image processing, pp 3652–3656Google Scholar
  7. 7.
    Blei DM, Jordan MI (2003) Modeling annotated data. In: ACM special interest group on information retrieval, pp 127–134Google Scholar
  8. 8.
    Boureau Y, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: Conference on computer vision and pattern recognition, pp 2559–2566Google Scholar
  9. 9.
    Briggs F, Fern XZ, Raich R (2012) Rank-loss support instance machines for miml instance annotation. In: International conference on knowledge discovery and data mining, pp 534–542Google Scholar
  10. 10.
    Calonder M, Lepetit V, Strecha C, Fua P (2010) BRIEF: binary robust independent elementary features. In: European conference on computer vision, pp 778–792Google Scholar
  11. 11.
    da Silva Torres R, Falcȧo AX (2006) Content-based image retrieval: theory and applications. RITA 13(2):161–185Google Scholar
  12. 12.
    de Avila SEF, Thome N, Cord M, Valle E, de Albuquerque Araújo A (2011) BOSSA: extended bow formalism for image classification. In: International conference on image processing, pp 2909–2912Google Scholar
  13. 13.
    Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. CVPR 2009, pp 248–255Google Scholar
  14. 14.
    dos Santos JA, Penatti OAB, da Silva Torres R (2010) Evaluating the potential of texture and color descriptors for remote sensing image retrieval and classification. In: International conference on computer vision theory and applications, pp 203–208Google Scholar
  15. 15.
    dos Santos JA, Faria FA, da Silva Torres R, Rocha A, Gosselin PH, Philipp-Foliguet S, Falcão AX (2012) Descriptor correlation analysis for remote sensing image multi-scale classification. In: International conference on pattern recognition, pp 3078–3081Google Scholar
  16. 16.
    Duygulu P, Barnard K, de Freitas JFG, Forsyth DA (2002) Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: European conference on computer vision, pp 97–112Google Scholar
  17. 17.
    Escalante HJ, Montes M, Sucar E (2012) Multimodal indexing based on semantic cohesion for image retrieval. Inf Retr 15(1):1–32CrossRefGoogle Scholar
  18. 18.
    Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  19. 19.
    Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: International joint conference on artificial intelligence, pp 1022–1029Google Scholar
  20. 20.
    Feng S, Xu D (2010) Transductive multi-instance multi-label learning algorithm with application to automatic image annotation. Expert Syst Appl 37(1):661–670CrossRefGoogle Scholar
  21. 21.
    Gallagher AC, Chen T (2008) Clothing cosegmentation for recognizing people. In: Conference on computer vision and pattern recognitionGoogle Scholar
  22. 22.
    Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot ResGoogle Scholar
  23. 23.
    Guillaumin M, Mensink T, Verbeek JJ, Schmid C (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: International Conference on Computer Vision, pp 309–316Google Scholar
  24. 24.
    Guillaumin M, Verbeek JJ, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: Conference on computer vision and pattern recognition, pp 902–909Google Scholar
  25. 25.
    Huang C, Liu Q (2007) An orientation independent texture descriptor for image retireval. In: International conference on computer and computational sciences, pp 772–776Google Scholar
  26. 26.
    Huang J, Kumar R, Mitra M, Zhu W, Zabih R (1997) Image indexing using color correlograms. In: Conference on computer vision and pattern recognition, pp 762–768Google Scholar
  27. 27.
    Kalantidis Y, Kennedy L, Li L (2013) Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In: International conference on multimedia retrieval, pp 105–112Google Scholar
  28. 28.
    Leutenegger S, Chli M, Siegwart R (2011) BRISK: binary robust invariant scalable keypoints. In: International conference on computer vision, pp 2548–2555Google Scholar
  29. 29.
    Li R, Lu J, Zhang Y, Zhao T (2010) Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowl-Based Syst 23(3):195–201CrossRefGoogle Scholar
  30. 30.
    Liu T (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331CrossRefGoogle Scholar
  31. 31.
    Liu S, Song Z, Liu G, Xu C, Lu H, Yan S (2012) Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set. In: Conference on computer vision and pattern recognition, pp 3330–3337Google Scholar
  32. 32.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  33. 33.
    Mahmoudi F, Shanbehzadeh J, Eftekhari-Moghadam A, Soltanian-Zadeh H (2003) Image retrieval based on shape similarity by edge orientation autocorrelogram. Pattern Recogn 36(8):1725–1736CrossRefGoogle Scholar
  34. 34.
    Makadia A, Pavlovic V, Kumar S, 2008 A new baseline for image annotation. In: European conference on computer vision. Springer, pp 316–329Google Scholar
  35. 35.
    Maron O, Lozano-Pérez T (1997) A framework for multiple-instance learning. In: Neural information processing systems, pp 570–576Google Scholar
  36. 36.
    Moran S, Lavrenko V (2014) Sparse kernel learning for image annotation. In: International conference on multimedia retrieval, p 113Google Scholar
  37. 37.
    Nguyen C, Zhan D, Zhou Z (2013) Multi-modal image annotation with multi-instance multi-label LDA. In: International joint conference on artificial intelligenceGoogle Scholar
  38. 38.
    Nogueira K, Veloso AA, dos Santos JA (2014) Learning to annotate clothes in everyday photos: multi-modal, multi-label, multi-instance approach. In: 27th conference on graphics, patterns and images, SIBGRAPI 2014. IEEE Computer Society, pp 327–334Google Scholar
  39. 39.
    Ntalianis K, Tsapatsoulis N, Doulamis A, Matsatsinis N (2014) Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution. Multimed Tools Appl 69(2):397–421CrossRefGoogle Scholar
  40. 40.
    Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Visual perception. Prog Brain Res 155:23–36CrossRefGoogle Scholar
  41. 41.
    Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: International conference on multimedia, pp 65–73Google Scholar
  42. 42.
    Penatti OAB, Valle E, da Silva Torres R (2012) Comparative study of global color and texture descriptors for web image retrieval. J Vis Commun Image Represent 23(2):359–380CrossRefGoogle Scholar
  43. 43.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Conference on computer vision and pattern recognitionGoogle Scholar
  44. 44.
    Read J, Pfahringer B, Holmes G (2008) Multi-label classification using ensembles of pruned sets. In: International conference on data mining, pp 995–1000Google Scholar
  45. 45.
    Rublee E, Rabaud V, Konolige K, Bradski GR (2011) ORB: an efficient alternative to SIFT or SURF. In: International conference on computer vision, pp 2564–2571Google Scholar
  46. 46.
    Shen EY, Lieberman H, Lam F (2007) What am I gonna wear?: Scenario-oriented recommendation. In: International conference on intelligent user interfaces, pp 365–368Google Scholar
  47. 47.
    Simo-Serra E, Fidler S, Moreno-Noguer F, Urtasun R (2014) A high performance CRF model for clothes parsing. In: Asian conference on computer visionGoogle Scholar
  48. 48.
    Simo-Serra E, Fidler S, Moreno-Noguer F, Urtasun R (2015) Neuroaesthetics in fashion: modeling the perception of fashionability. In: Conference on computer vision and pattern recognitionGoogle Scholar
  49. 49.
    Sivic J, Zisserman A (2006) Video google: efficient visual search of videos. In: Toward category-level object recognition, pp 127–144Google Scholar
  50. 50.
    Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Conference on empirical methods in natural language processing, pp 254–263Google Scholar
  51. 51.
    Socher R, Lin CC, Ng AY, Manning CD (2011) Parsing natural scenes and natural language with recursive neural networks. In: International conference on machine learning, pp 129–136Google Scholar
  52. 52.
    Stehling RO, Nascimento MA, Falcão AX (2002) A compact and efficient image retrieval approach based on border/interior pixel classification. In: International conference on information and knowledge management, pp 102–109Google Scholar
  53. 53.
    Suh B, Bederson BB (2007) Semi-automatic photo annotation strategies using event based clustering and clothing based person recognition. Interact Comput 19 (4):524–544CrossRefGoogle Scholar
  54. 54.
    Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32CrossRefGoogle Scholar
  55. 55.
    Tang J, Li H, Qi G, Chua T (2010) Image annotation by graph-based inference with integrated multiple/single instance representations. IEEE Trans Multimed 12 (2):131–141CrossRefGoogle Scholar
  56. 56.
    Tao B, Dickinson BW (2000) Texture recognition and image retrieval using gradient indexing. J Vis Commun Image Represent 11(3):327–342CrossRefGoogle Scholar
  57. 57.
    Tokumaru M, Fujibayashi T, Muranaka N, Imanishi S (2002) Virtual stylist project—dress up support system considering user’s subjectivity. In: International conference on fuzzy systems and knowledge discovery: computational intelligence for the E-Age, pp 207–211Google Scholar
  58. 58.
    Tsoumakas G, Katakis I (2006) Multi-label classification: an overview. Dept of Informatics, Aristotle University of Thessaloniki, GreeceGoogle Scholar
  59. 59.
    Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehouse Min 3(3):1–13CrossRefGoogle Scholar
  60. 60.
    Tuytelaars T (2010) Dense interest points. In: Conference on computer vision and pattern recognition, pp 2281–2288Google Scholar
  61. 61.
    Tuytelaars T, Mikolajczyk K (2007) Local invariant feature detectors: a survey. Found Trends Comput Graph Vis 3(3):177–280CrossRefGoogle Scholar
  62. 62.
    Unser M (1986) Sum and difference histograms for texture classification. IEEE Trans Pattern Anal Mach Intell 8(1):118–125MathSciNetCrossRefGoogle Scholar
  63. 63.
    van Gemert J, Geusebroek J, Veenman CJ, Smeulders AWM (2008) Kernel codebooks for scene categorization. In: European conference on computer vision, pp 696–709Google Scholar
  64. 64.
    Veloso A, Jr WM, Zaki MJ (2006) Lazy associative classification. In: International conference on data mining, pp 645–654Google Scholar
  65. 65.
    Veloso A, Jr WM, Gonçalves MA, Zaki MJ (2007) Multi-label lazy associative classification. In: Conference on principles and practice of knowledge discovery in databases, pp 605–612Google Scholar
  66. 66.
    Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214CrossRefGoogle Scholar
  67. 67.
    Vogiatzis D, Pierrakos D, Paliouras G, Jenkyn-Jones S, Possen BJHHA (2012) Expert and community based style advice. Expert Syst Appl 39(12):10:647–10:655CrossRefGoogle Scholar
  68. 68.
    Weber M, Bäuml M, Stiefelhagen R (2011) Part-based clothing segmentation for person retrieval. In: International conference on advanced video and signal-based surveillance, pp 361–366Google Scholar
  69. 69.
    Xie L, Pan P, Lu Y (2015) Markov random field based fusion for supervised and semi-supervised multi-modal image classification. Multimed Tools Appl 613–634Google Scholar
  70. 70.
    Yamaguchi K, Kiapour MH, Ortiz LE, Berg TL, 2012 Parsing clothing in fashion photographs. In: Conference on computer vision and pattern recognition, pp 3570–3577Google Scholar
  71. 71.
    Yamaguchi K, Kiapour MH, Berg TL (2013) Paper doll parsing: retrieving similar styles to parse clothing items. In: International conference on computer vision, pp 3519–3526Google Scholar
  72. 72.
    Yang M, Yu K (2011) Real-time clothing recognition in surveillance videos. In: International conference on image processing, pp 2937–2940Google Scholar
  73. 73.
    Yang S, Zha H, Hu B (2009) Dirichlet-bernoulli alignment: a generative model for multi-class multi-label multi-instance corpora. In: Neural information processing systems, pp 2143–2150Google Scholar
  74. 74.
    Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Conference on computer vision and pattern recognition, pp 1385–1392Google Scholar
  75. 75.
    Zegarra J, Leite N, Torres R (2008) Wavelet-based feature extraction for fingerprint image retrieval. J Comput Appl MathGoogle Scholar
  76. 76.
    Zhang D, Lu G (2004) Review of shape representation and description techniques. Pattern Recogn 37(1):1–19CrossRefGoogle Scholar
  77. 77.
    Zhang D, Islam M M, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362CrossRefGoogle Scholar
  78. 78.
    Zhaolao L, Zhou M, Wang X, Fu Y, Tan X (2013) Semantic annotation method of clothing image. In: International conference on human-computer interaction, pp 289–298Google Scholar
  79. 79.
    Zhou Z, Zhang M, Huang S, Li Y (2012) Multi-instance multi-label learning. Artif Intell 176(1):2291–2320MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Keiller Nogueira
    • 1
  • Adriano Alonso Veloso
    • 1
  • Jefersson Alex dos  Santos
    • 1
  1. 1.Department of Computer ScienceUniversidade Federal de Minas Gerais (UFMG)Belo HorizonteBrazil

Personalised recommendations