Neural Networks as Model Selection with Incremental MDL Normalization

  • Baihan LinEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1072)


If we consider the neural network optimization process as a model selection problem, the implicit space can be constrained by the normalizing factor, the minimum description length of the optimal universal code. Inspired by the adaptation phenomenon of biological neuronal firing, we propose a class of reparameterization of the activation in the neural network that take into account the statistical regularity in the implicit space under the Minimum Description Length (MDL) principle. We introduce an incremental version of computing this universal code as normalized maximum likelihood and demonstrated its flexibility to include data prior such as top-down attention and other oracle information and its compatibility to be incorporated into batch normalization and layer normalization. The empirical results showed that the proposed method outperforms existing normalization methods in tackling the limited and imbalanced data from a non-stationary distribution benchmarked on computer vision and reinforcement learning tasks. As an unsupervised attention mechanism given input data, this biologically plausible normalization has the potential to deal with other complicated real-world scenarios as well as reinforcement learning setting where the rewards are sparse and non-uniform. Further research is proposed to discover these scenarios and explore the behaviors among different variants.


Neuronal adaption Minimum description length Model selection Universal code Normalization method in neural networks 


  1. 1.
    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
  2. 2.
    Blakemore, C., Campbell, F.W.: Adaptation to spatial stimuli. J. physiol. 200(1), 11P–13P (1969)Google Scholar
  3. 3.
    Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
  4. 4.
    Ding, S., Cueva, C.J., Tsodyks, M., Qian, N.: Visual perception as retrospective Bayesian decoding from high-to low-level features. Proc. Nat. Acad. Sci. 114(43), E9115–E9124 (2017)CrossRefGoogle Scholar
  5. 5.
    Dragoi, V., Sharma, J., Sur, M.: Adaptation-induced plasticity of orientation tuning in adult visual cortex. Neuron 28(1), 287–298 (2000)CrossRefGoogle Scholar
  6. 6.
    Grünwald, P.D.: The Minimum Description Length Principle. MIT press, Cambridge (2007)CrossRefGoogle Scholar
  7. 7.
    Hinton, G., Van Camp, D.: Keeping neural networks simple by minimizing the description length of the weights. In: In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. Citeseer (1993)Google Scholar
  8. 8.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  9. 9.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  10. 10.
    LeCun, Y.: The MNIST database of handwritten digits. (1998)
  11. 11.
    Lin, B., Bouneffouf, D., Cecchi, G.A., Rish, I.: Contextual bandit with adaptive feature extraction. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 937–944. IEEE (2018)Google Scholar
  12. 12.
    Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)Google Scholar
  13. 13.
    Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  14. 14.
    Myung, J.I., Navarro, D.J., Pitt, M.A.: Model selection by normalized maximum likelihood. J. Math. Psychol. 50(2), 167–179 (2006)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Qian, N., Zhang, J.: Neuronal firing rate as code length: a hypothesis. Comput. Behav. pp. 1–20 (2019) Google Scholar
  16. 16.
    Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)CrossRefGoogle Scholar
  17. 17.
    Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore (1989)zbMATHGoogle Scholar
  18. 18.
    Rissanen, J.: Strong optimality of the normalized ML models as universal codes and information in data. IEEE Trans. Inf. Theory 47(5), 1712–1717 (2001)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901–909 (2016)Google Scholar
  20. 20.
    Zemel, R.S., Hinton, G.E.: Learning population coes by minimizing description length. In: Unsupervised Learning, pp. 261–276. Bradford Company (1999)Google Scholar
  21. 21.
    Zhang, J.: Model selection with informative normalized maximum likelihood: Data prior and model prior. In: Descriptive and Normative Approaches To Human Behavior, pp. 303–319. World Scientific (2012)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Center for Theoretical NeuroscienceColumbia UniversityNew YorkUSA
  2. 2.Zuckerman Mind Brain Behavior InstituteColumbia UniversityNew YorkUSA
  3. 3.Department of Applied MathematicsUniversity of WashingtonSeattleUSA

Personalised recommendations