Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

  • Ming Sun
  • Yuchen Yuan
  • Feng ZhouEmail author
  • Errui Ding
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


Attention-based learning for fine-grained image recognition remains a challenging task, where most of the existing methods treat each object part in isolation, while neglecting the correlations among them. In addition, the multi-stage or multi-scale mechanisms involved make the existing methods less efficient and hard to be trained end-to-end. In this paper, we propose a novel attention-based convolutional neural network (CNN) which regulates multiple object parts among different input images. Our method first learns multiple attention region features of each input image through the one-squeeze multi-excitation (OSME) module, and then apply the multi-attention multi-class constraint (MAMC) in a metric learning framework. For each anchor feature, the MAMC functions by pulling same-attention same-class features closer, while pushing different-attention or different-class features away. Our method can be easily trained end-to-end, and is highly efficient which requires only one training stage. Moreover, we introduce Dogs-in-the-Wild, a comprehensive dog species dataset that surpasses similar existing datasets by category coverage, data volume and annotation quality. Extensive experiments are conducted to show the substantial improvements of our method on four benchmark datasets.


Fine-grained classification Metric learning Visual attention Multi-attention Multi-class constraint One-squeeze Multi-excitation 


  1. 1.
    Bossard, L., Guillaumin, M., Gool, L.V.: Food-101 - mining discriminative components with random forests. In: ECCV (2014)Google Scholar
  2. 2.
    Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. In: BMVC (2014)Google Scholar
  3. 3.
    Branson, S., Van Horn, G., Wah, C., Perona, P., Belongie, S.: The ignorant led by the blind: a hybrid human-machine vision system for fine-grained categorization. Int. J. Comput. Vis. 108(1–2), 3–29 (2014)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “Siamese" time delay neural network. In: NIPS (1994)Google Scholar
  5. 5.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS workshop (2011)Google Scholar
  6. 6.
    Cui, Y., Zhou, F., Lin, Y., Belongie, S.: Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1153–1162 (2016)Google Scholar
  7. 7.
    Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., Belongie, S.: Kernel pooling for convolutional neural networks. In: CVPR (2017)Google Scholar
  8. 8.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  9. 9.
    Farrell, R., Oza, O., Zhang, N., Morariu, V.I., Darrell, T., Davis, L.S.: Birdlets: subordinate categorization using volumetric primitives and pose-normalized appearance. In: ICCV (2011)Google Scholar
  10. 10.
    Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR (2017)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
  14. 14.
    Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  15. 15.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)Google Scholar
  16. 16.
    Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: stanford dogs. In: CVPR Workshops on Fine-Grained Visual Categorization (2011)Google Scholar
  17. 17.
    Krause, J., Gebru, T., Deng, J., Li, L.J., Fei-Fei, L.: Learning features and parts for fine-grained recognition. In: ICPR (2014)Google Scholar
  18. 18.
    Krause, J., Jin, H., Yang, J., Fei-Fei, L.: Fine-grained recognition without part annotations. In: CVPR (2015)Google Scholar
  19. 19.
    Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: ECCV (2016)CrossRefGoogle Scholar
  20. 20.
    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops on 3D Representation and Recognition (2013)Google Scholar
  21. 21.
    Li, Z., Yang, Y., Liu, X., Wen, S., Xu, W.: Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332 (2017)
  22. 22.
    Lin, D., Shen, X., Lu, C., Jia, J.: Deep LAC: deep localization, alignment and classification for fine-grained recognition. In: CVPR (2015)Google Scholar
  23. 23.
    Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015)Google Scholar
  24. 24.
    Lin, Y., Morariu, V.I., Hsu, W.H., Davis, L.S.: Jointly optimizing 3D model fitting and fine-grained classification. In: ECCV (2014)Google Scholar
  25. 25.
    Liu, J., Kanazawa, A., Jacobs, D.W., Belhumeur, P.N.: Dog breed classification using part localization. In: ECCV (2012)Google Scholar
  26. 26.
    Liu, X., Xia, T., Wang, J., Yang, Y., Zhou, F., Lin, Y.: Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765 (2017)
  27. 27.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  28. 28.
    Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)Google Scholar
  29. 29.
    Parkhi, O.M., Vedaldi, A., Jawahar, C., Zisserman, A.: The truth about cats and dogs. In: ICCV (2011)Google Scholar
  30. 30.
    Perronnin, F., Larlus, D.: Fisher vectors meet neural networks: a hybrid classification architecture. In: CVPR (2015)Google Scholar
  31. 31.
    Rosenfeld, A., Ullman, S.: Visual concept recognition and localization via iterative introspection. In: ACCV (2016)Google Scholar
  32. 32.
    Salakhutdinov, R., Hinton, G.E.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: AISTATS (2007)Google Scholar
  33. 33.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  34. 34.
    Simon, M., Gao, Y., Darrell, T., Denzler, J., Rodner, E.: Generalized orderless pooling performs implicit salient matching. arXiv preprint arXiv:1705.00487 (2017)
  35. 35.
    Simon, M., Rodner, E.: Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: ICCV (2015)Google Scholar
  36. 36.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  37. 37.
    Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NIPS (2016)Google Scholar
  38. 38.
    Van Horn, G., et al.: The iNaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
  39. 39.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
  40. 40.
    Wang, D., Shen, Z., Shao, J., Zhang, W., Xue, X., Zhang, Z.: Multiple granularity descriptors for fine-grained categorization. In: ICCV (2015)Google Scholar
  41. 41.
    Wang, F., et al.: Residual attention network for image classification. In: CVPR (2017)Google Scholar
  42. 42.
    Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. In: ICCV (2017)Google Scholar
  43. 43.
    Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)Google Scholar
  44. 44.
    Wang, Y., Choi, J., Morariu, V., Davis, L.S.: Mining discriminative triplets of patches for fine-grained classification. In: CVPR (2016)Google Scholar
  45. 45.
    Wilber, M., Kwak, I.S., Kriegman, D., Belongie, S.: Learning concept embeddings with combined human-machine expertise. In: ICCV (2015)Google Scholar
  46. 46.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV (2014)Google Scholar
  47. 47.
    Zhang, H., et al.: SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: CVPR (2016)Google Scholar
  48. 48.
    Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: ECCV (2014)Google Scholar
  49. 49.
    Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: Panda: Pose aligned networks for deep attribute modeling. In: CVPR (2014)Google Scholar
  50. 50.
    Zhang, X., Zhou, F., Lin, Y., Zhang, S.: Embedding label structures for fine-grained feature representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1114–1123 (2016)Google Scholar
  51. 51.
    Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: CVPR (2016)Google Scholar
  52. 52.
    Zhao, B., Wu, X., Feng, J., Peng, Q., Yan, S.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 19(6), 1245–1256 (2017)CrossRefGoogle Scholar
  53. 53.
    Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: ICCV (2017)Google Scholar
  54. 54.
    Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. In: ICLR (2014)Google Scholar
  55. 55.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)Google Scholar
  56. 56.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar
  57. 57.
    Zhou, F., Lin, Y.: Fine-grained image classification by exploring bipartite-graph labels. In: CVPR (2016)Google Scholar
  58. 58.
    Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer Vision Technology (VIS)Baidu Inc.BeijingChina
  2. 2.Baidu ResearchBeijingChina

Personalised recommendations