Advertisement

Deep Clustering for Unsupervised Learning of Visual Features

  • Mathilde CaronEmail author
  • Piotr Bojanowski
  • Armand Joulin
  • Matthijs Douze
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)

Abstract

Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.

Keywords

Unsupervised learning Clustering 

Supplementary material

474202_1_En_9_MOESM1_ESM.pdf (8.9 mb)
Supplementary material 1 (pdf 9130 KB)

References

  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Bach, F.R., Harchaoui, Z.: Diffrac: a discriminative and flexible framework for clustering. In: NIPS (2008)Google Scholar
  3. 3.
    Bautista, M.A., Sanakoyeu, A., Tikhoncheva, E., Ommer, B.: Cliquecnn: deep unsupervised exemplar learning. In: Advances in Neural Information Processing Systems, pp. 3846–3854 (2016)Google Scholar
  4. 4.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS (2007)Google Scholar
  5. 5.
    Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)Google Scholar
  6. 6.
    Bojanowski, P., Joulin, A., Lopez-Paz, D., Szlam, A.: Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776 (2017)
  7. 7.
    Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35289-8_25CrossRefGoogle Scholar
  8. 8.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)Google Scholar
  9. 9.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)
  10. 10.
    Coates, A., Ng, A.Y.: Learning feature representations with k-means. In: Montavon, G., Orr, G.B., Müller, K.R. (eds.) NN: Tricks of the Trade. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35289-8_30CrossRefGoogle Scholar
  11. 11.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Satistical Learning in Computer Vision ECCV, vol. 1, pp. 1–2. Prague (2004)Google Scholar
  12. 12.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  13. 13.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)Google Scholar
  14. 14.
    Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning (2017)Google Scholar
  15. 15.
    Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. arXiv preprint arXiv:1605.09782 (2016)
  16. 16.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NIPS (2014)Google Scholar
  17. 17.
    Dumoulin, V., et al.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
  18. 18.
    Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Univ. Montr. 1341, 3 (2009)Google Scholar
  19. 19.
    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1. Springer, New York (2001).  https://doi.org/10.1007/978-0-387-21606-5CrossRefzbMATHGoogle Scholar
  20. 20.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)Google Scholar
  22. 22.
    Huang, F.J., Boureau, Y.L., LeCun, Y., et al.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: CVPR (2007)Google Scholar
  23. 23.
    Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
  24. 24.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  25. 25.
    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
  26. 26.
    Joulin, A., Bach, F.: A convex relaxation for weakly supervised classifiers. arXiv preprint arXiv:1206.6413 (2012)
  27. 27.
    Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: CVPR (2010)Google Scholar
  28. 28.
    Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 67–84. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_5CrossRefGoogle Scholar
  29. 29.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  30. 30.
    Kovashka, A., Russakovsky, O., Fei-Fei, L., Grauman, K.: Crowdsourcing in computer vision. Found. Trends® Comput. Graph. Vis. 10(3), 177–243 (2016)CrossRefGoogle Scholar
  31. 31.
    Krähenbühl, P., Doersch, C., Donahue, J., Darrell, T.: Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856 (2015)
  32. 32.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  33. 33.
    Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 577–593. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_35CrossRefGoogle Scholar
  34. 34.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  35. 35.
    Liao, R., Schwing, A., Zemel, R., Urtasun, R.: Learning deep parsimonious representations. In: NIPS (2016)Google Scholar
  36. 36.
    Lin, F., Cohen, W.W.: Power iteration clustering. In: ICML (2010)Google Scholar
  37. 37.
    Linsker, R.: Towards an organizing principle for a layered perceptual network. In: NIPS (1988)Google Scholar
  38. 38.
    Mairal, J., Koniusz, P., Harchaoui, Z., Schmid, C.: Convolutional kernel networks. In: NIPS (2014)Google Scholar
  39. 39.
    Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMS for object detection and beyond. In: ICCV (2011)Google Scholar
  40. 40.
    Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21735-7_7CrossRefGoogle Scholar
  41. 41.
    Misra, I., Zitnick, C.L., Mitchell, M., Girshick, R.: Seeing through the human reporting bias: visual classifiers from noisy human-centric labels. In: CVPR (2016)Google Scholar
  42. 42.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by Solving Jigsaw Puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_5CrossRefGoogle Scholar
  43. 43.
    Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: ICCV (2017)Google Scholar
  44. 44.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_48CrossRefGoogle Scholar
  45. 45.
    Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: CVPR (2017)Google Scholar
  46. 46.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)Google Scholar
  47. 47.
    Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin, F., Schmid, C.: Local convolutional features with unsupervised training for image retrieval. In: ICCV (2015)Google Scholar
  48. 48.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)Google Scholar
  49. 49.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: CVPR (2008)Google Scholar
  50. 50.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  51. 51.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  52. 52.
    de Sa, V.R.: Learning classification with unlabeled data. In: NIPS (1994)Google Scholar
  53. 53.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPR workshops (2014)Google Scholar
  54. 54.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI 22(8), 888–905 (2000)CrossRefGoogle Scholar
  55. 55.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  56. 56.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  57. 57.
    Stock, P., Cisse, M.: Convnets and imagenet beyond accuracy: explanations, bias detection, adversarial examples and model criticism. arXiv preprint arXiv:1711.11443 (2017)
  58. 58.
    Thomee, B., et al.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
  59. 59.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
  60. 60.
    Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: CVPR (1991)Google Scholar
  61. 61.
    Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. TPAMI 32(9), 1582–1596 (2010)CrossRefGoogle Scholar
  62. 62.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. JMLR 11(Dec), 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  63. 63.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  64. 64.
    Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. arXiv preprint arXiv:1708.02901 (2017)
  65. 65.
    Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: ICCV (2013)Google Scholar
  66. 66.
    Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML (2016)Google Scholar
  67. 67.
    Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: NIPS (2005)Google Scholar
  68. 68.
    Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: CVPR (2016)Google Scholar
  69. 69.
    Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015)
  70. 70.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_53CrossRefGoogle Scholar
  71. 71.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_40CrossRefGoogle Scholar
  72. 72.
    Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. arXiv preprint arXiv:1611.09842 (2016)
  73. 73.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Mathilde Caron
    • 1
  • Piotr Bojanowski
    • 1
  • Armand Joulin
    • 1
  • Matthijs Douze
    • 1
  1. 1.Facebook AI ResearchParisFrance

Personalised recommendations