Advertisement

A Two-Teacher Framework for Knowledge Distillation

  • Xingjian Chen
  • Jianbo SuEmail author
  • Jun Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11554)

Abstract

Knowledge distillation aims at transferring knowledge from a teacher network to a student network. Commonly, the teacher network has high capacity, while the student network is compact and can be deployed to embedded systems. However, existing distillation methods use only one teacher to guide the student network, and there is no guarantee that the knowledge is sufficiently transferred to the student network. Thus, we propose a novel framework to improve the performance of the student network. This framework consists of two teacher networks trained with different strategies, one is trained strictly to guide the student network to learn sophisticated features, and the other is trained loosely to guide the student network to learn general decision based on learned features. We perform extensive experiments on two standard image classification datasets: CIFAR-10 and CIFAR-100. And results demonstrate that the proposed framework can significantly improve the classification accuracy of a student network.

Keywords

Knowledge distillation Convolutional neural networks Adversarial learning Image classification 

Notes

Acknowledgments

This paper was partially financially supported by National Natural Science Foundation of China under grants 61533012 and 91748120.

References

  1. 1.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  4. 4.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
  5. 5.
    Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  6. 6.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  7. 7.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  8. 8.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890 (2017)Google Scholar
  9. 9.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  10. 10.
    Xu, Z., Hsu, Y.C., Huang, J.: Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. In: BMVC (2018)Google Scholar
  11. 11.
    Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7130–7138. IEEE (2017)Google Scholar
  12. 12.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)Google Scholar
  13. 13.
    Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. arXiv preprint arXiv:1811.03233 (2018)
  14. 14.
    Wang, X., Zhang, R., Sun, Y., Qi, J.: KDGAN: knowledge distillation with generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 783–794 (2018)Google Scholar
  15. 15.
    Liu, P., Liu, W., Ma, H., Mei, T., Seok, M.: KTAN: knowledge transfer adversarial network. arXiv preprint arXiv:1810.08126v11 (2018)
  16. 16.
    Yang, C., Xie, L., Qiao, S., Yuille, A.: Knowledge distillation in generations: more tolerant teachers educate better students. arXiv preprint arXiv:1805.05551 (2018)
  17. 17.
    Lee, S.H., Kim, D.H., Song, B.C.: Self-supervised knowledge distillation using singular value decomposition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 339–354. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_21CrossRefGoogle Scholar
  18. 18.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: ICLR (2015)Google Scholar
  19. 19.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)Google Scholar
  20. 20.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)Google Scholar
  21. 21.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Research Center of Intelligent RoboticsShanghai Jiao Tong UniversityShanghaiChina
  2. 2.UM-SJTU Joint InstituteShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations