Information Bottleneck Theory on Convolutional Neural Networks


Recent years, many researches attempt to open the black box of deep neural networks and propose a various of theories to understand it. Among them, information bottleneck (IB) theory claims that there are two distinct phases consisting of fitting phase and compression phase in the course of training. This statement attracts many attentions since its success in explaining the inner behavior of feedforward neural networks. In this paper, we employ IB theory to understand the dynamic behavior of convolutional neural networks (CNNs) and investigate how the fundamental features such as convolutional layer width, kernel size, network depth, pooling layers and multi-fully connected layer have impact on the performance of CNNs. In particular, through a series of experimental analysis on benchmark of MNIST and Fashion-MNIST, we demonstrate that the compression phase is not observed in all these cases. This shows us the CNNs have a rather complicated behavior than feedforward neural networks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.


  1. 1.

    Advani MS, Saxe AM (2017) High-dimensional dynamics of generalization error in neural networks. Preprint arXiv:1710.03667

  2. 2.

    Amjad RA, Geiger BC (2019) Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans Pattern Anal Mach Intell 42:2225–2239

    Article  Google Scholar 

  3. 3.

    Chechik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for Gaussian variables. J Mach Learn Res 6(Jan):165–188

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Dai B, Zhu C, Wipf D (2018) Compressing neural networks using the variational information bottleneck. Preprint arXiv:1802.10399

  5. 5.

    Elidan G, Friedman N (2005) Learning hidden variable networks: the information bottleneck approach. J Mach Learn Res 6(Jan):81–127

    MathSciNet  MATH  Google Scholar 

  6. 6.

    Friedman N, Mosenzon O, Slonim N, Tishby N (2013) Multivariate information bottleneck. Preprint arXiv:1301.2270

  7. 7.

    Gabrié M, Manoel A, Luneau C, Macris N, Krzakala F, Zdeborová L et al (2018) Entropy and mutual information in models of deep neural networks. In: Advances in neural information processing systems, pp 1821–1831

  8. 8.

    Goldfeld Z, Berg E, Greenewald K, Melnyk I, Nguyen N, Kingsbury B, Polyanskiy Y (2018) Estimating information flow in deep neural networks. Preprint arXiv:1810.05728

  9. 9.

    Goldfeld Z, Van Den Berg E, Greenewald K, Melnyk I, Nguyen N, Kingsbury B, Polyanskiy Y (2019) Estimating information flow in deep neural networks. In: Proceedings of the 36th international conference on machine learning, vol 97, pp 2299–2308

  10. 10.

    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, New York

    Google Scholar 

  11. 11.

    Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D (2019) A survey of methods for explaining black box models. ACM Comput Surv (CSUR) 51(5):93

    Article  Google Scholar 

  12. 12.

    Hsu WH, Kennedy LS, Chang SF (2006) Video search reranking via information bottleneck principle. In: Proceedings of the 14th ACM international conference on multimedia, pp 35–44

  13. 13.

    Jónsson H, Cherubini G, Eleftheriou E (2019) Convergence of DNNS with mutual-information-based regularization. In: Proceedings of the Bayesian deep learning@ advances in neural information processing systems (NeurIPS), Vancouver

  14. 14.

    Kadmon J, Sompolinsky H (2016) Optimal architectures in a solvable model of deep networks. In: Advances in neural information processing systems, pp 4781–4789

  15. 15.

    Kolchinsky A, Tracey B (2017) Estimating mixture entropy with pairwise distances. Entropy 19(7):361

    Article  Google Scholar 

  16. 16.

    Kolchinsky A, Tracey BD, Wolpert DH (2019) Nonlinear information bottleneck. Entropy 21(12):1181

    MathSciNet  Article  Google Scholar 

  17. 17.

    Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images

  18. 18.

    Painsky A, Tishby N (2017) Gaussian lower bound for the information bottleneck limit. J Mach Learn Res 18(1):7908–7936

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Poole B, Ozair S, Oord A, Alemi AA, Tucker G (2019) On variational bounds of mutual information. Preprint arXiv:1905.06922

  20. 20.

    Saxe AM, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey BD, Cox DD (2019) On the information bottleneck theory of deep learning. J Stat Mech Theory Exp 2019(12):124020

    MathSciNet  Article  Google Scholar 

  21. 21.

    Saxe AM, Mcclelland JL, Ganguli S (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In: In International conference on learning representations. Citeseer, New York

  22. 22.

    Shamir O, Sabato S, Tishby N (2010) Learning and generalization with the information bottleneck. Theoret Comput Sci 411(29–30):2696–2711

    MathSciNet  Article  Google Scholar 

  23. 23.

    Shwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. Preprint arXiv:1703.00810

  24. 24.

    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp 208–215

  25. 25.

    Strouse D, Schwab DJ (2017) The deterministic information bottleneck. Neural Comput 29(6):1611–1630

    MathSciNet  Article  Google Scholar 

  26. 26.

    Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  27. 27.

    Tishby N (2000) The information bottleneck method. Computing Research Repository (CoRR)

  28. 28.

    Tishby N, Zaslavsky N (2015) Deep learning and the information bottleneck principle. In: 2015 IEEE information theory workshop (ITW). IEEE, New York, pp 1–5

  29. 29.

    Wang Q, Gao J, Li X (2019) Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes. IEEE Trans Image Process 28(9):4376–4386

    MathSciNet  Article  Google Scholar 

  30. 30.

    Wang Q, Yuan Z, Du Q, Li X (2018) Getnet: a general end-to-end 2-D CNN framework for hyperspectral image change detection. IEEE Trans Geosci Remote Sens 57(1):3–13

    Article  Google Scholar 

  31. 31.

    Yu S, Principe JC (2019) Understanding autoencoders with information theoretic concepts. Neural Netw 117:104–123

    Article  Google Scholar 

  32. 32.

    Yu S, Wickstrøm K, Jenssen R, Principe JC (2020) Understanding convolutional neural networks with information theory: an initial exploration. IEEE Trans Neural Netw Learn Syst.

    Article  Google Scholar 

  33. 33.

    Yu Y, Chan KHR, You C, Song C, Ma Y (2020) Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Preprint arXiv:2006.08558

Download references


Our research is supported by the Tianjin Natural Science Foundation of China (20JCYBJC00500), the Science and Technology Development Fund of Tianjin Education Commission for Higher Education (2018KJ217).

Author information



Corresponding author

Correspondence to Ding Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



In order to further verify our conclusions, we conduct additional experiments on the CIFAR-10 dataset [17]. This dataset consists of 60,000 \(32\times 32\) colour images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.

In this experiment, the whole 50,000 training images and 10,000 test images are selected as our training dataset and test dataset respectively, which is the only different setting from Experiments and discussion section. Furthermore, because of the arithmetic of computing mutual information, we choose to average the image of three channels and turn it into a signal channel as input data.

Fig. 9

(Color figure online) MI path on CNNs with different convolutional layer widths on training data of CIFAR-10. The convolutional layer width of 3 networks are a 3-3-3-3-3-3, b 6-6-6-6-6-6, c 12-12-12-12-12-12

Fig. 10

(Color figure online) MI path on CNNs with different convolutional layer depths on training data of CIFAR-10. The depth of 4 networks are a \(\hbox {depth}=2\), b \(\hbox {depth}=4\), c \(\hbox {depth}=7\), d \(\hbox {depth}=10\). All these networks width are set to 6

Fig. 11

(Color figure online) MI path on CNNs with pooling layer on test data of CIFAR-10. a Without pooling layer, b with pooling layer. The width of convolutional layers are both 6-6-6 and the kernel size is fixed to \(3\times 3\). Moreover, the MaxPooling2D layer is added after the layer 1 and the pooling size is set to \(2\times 2\)

The Figs. 9 and 10 show the MI with different widths and depths on training data respectively. Figure 11 shows the MI with pooling layer on test data. These results offer more proof about the IB theory.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, J., Liu, D. Information Bottleneck Theory on Convolutional Neural Networks. Neural Process Lett (2021).

Download citation


  • Information bottleneck
  • Convolutional neural networks
  • Deep learning
  • Representation learning