Advertisement

Blended coarse gradient descent for full quantization of deep neural networks

  • Penghang Yin
  • Shuai Zhang
  • Jiancheng Lyu
  • Stanley Osher
  • Yingyong Qi
  • Jack XinEmail author
Research
  • 47 Downloads

Abstract

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.

Keywords

Weight/activation quantization Blended coarse gradient descent Sufficient descent property Deep neural networks 

Mathematics Subject Classification

90C35 90C26 90C52 90C90 

Notes

References

  1. 1.
    Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  2. 2.
    Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)zbMATHGoogle Scholar
  3. 3.
    Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a ConvNet with Gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)
  4. 4.
    Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave Gaussian quantization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  5. 5.
    Carreira-Perpinán, M.: Model compression as constrained optimization, with application to neural nets. Part I: general framework. arXiv preprint arXiv:1707.01209 (2017)
  6. 6.
    Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
  7. 7.
    Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward–backward splitting for monotone operators. Pure Appl. Funct. Anal. 1, 13–37 (2016)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Courbariaux, M., Bengio, Y., David, J.: Binaryconnect: training deep neural networks with binary weights during propagations. In: Conference on Neural Information Processing Systems (NIPS), pp. 3123–3131 (2015)Google Scholar
  9. 9.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)Google Scholar
  10. 10.
    Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: don’t be afraid of spurious local minimum. arXiv preprint arXiv:1712.00779 (2018)
  11. 11.
    Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)CrossRefGoogle Scholar
  12. 12.
    Gilbert, J.C., Nocedal, J.: Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim. 2(1), 21–42 (1992)MathSciNetCrossRefGoogle Scholar
  13. 13.
    He, J., Li, L., Xu, J., Zheng, C.: ReLu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 (2018)
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  16. 16.
    Hinton, G.: Neural networks for machine learning, coursera. Coursera, video lectures (2012)Google Scholar
  17. 17.
    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: training neural networks with weights and activations constrained to +1 or \(-1\). arXiv preprint arXiv:1602.02830 (2016)
  18. 18.
    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1–30 (2018)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Ioffe, S., Szegedy, C.: Normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  20. 20.
    Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)Google Scholar
  21. 21.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  22. 22.
    Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)
  23. 23.
    Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: a deeper understanding. In: NIPS, pp. 5813–5823 (2017)Google Scholar
  24. 24.
    Li, Y., Yuan, Y.: Convergence analysis of two-layer neural networks with ReLu activation. In: Advances in Neural Information Processing Systems, pp. 597–607 (2017)Google Scholar
  25. 25.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Park, E., Ahn, J., Yoo, S.: Weighted-entropy-based quantization for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5456–5464 (2017)Google Scholar
  27. 27.
    Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. Technical report (2017)Google Scholar
  28. 28.
    Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: ImageNet classification using binary convolutional neural networks. In: European Conference on Computer Vision (ECCV) (2016)Google Scholar
  29. 29.
    Rosenblatt, F.: The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory, Buffalo (1957)Google Scholar
  30. 30.
    Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, Washington (1962)zbMATHGoogle Scholar
  31. 31.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)
  32. 32.
    Tian, Y.: An analytical formula of population gradient for two-layered ReLu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 (2017)
  33. 33.
    Wang, B., Luo, X., Li, Z., Zhu, W., Shi, Z., Osher, S.J.: Deep neural nets with interpolating function as output activation. arXiv preprint arXiv:1802.00168 (2018)
  34. 34.
    Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)CrossRefGoogle Scholar
  35. 35.
    Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J.: Binaryrelax: a relaxation approach for training deep neural networks with quantized weights. SIAM J. Imaging Sci. (to appear). arXiv preprint arXiv:1801.06313 (2018)
  36. 36.
    Yin, P., Zhang, S., Qi, Y., Xin, J.: Quantization and training of low bit-width convolutional neural networks for object detection. J. Comput. Math. (to appear). arXiv preprint arXiv:1612.06052 (2018)
  37. 37.
    Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of MathematicsUniversity of California at Los AngelesLos AngelesUSA
  2. 2.Qualcomm AI ResearchSan DiegoUSA
  3. 3.Department of MathematicsUniversity of California at IrvineIrvineUSA

Personalised recommendations