Skip to main content

An information-theoretic approach to curiosity-driven reinforcement learning


We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

This is a preview of subscription content, access via your institution.


  1. 1.

    We will refer to this parameter as the temperature in the rest of the article. One has to keep in mind that this is a metaphor, not a physical temperature.

  2. 2.

    Here and throughout, we use capital letters to denote random variables, and small letters to denote particular realizations of these variables.

  3. 3.

    If there are N actions that maximize Q π(xa), then those occur with probability 1/N, while all other actions occur with probability 0.

  4. 4.

    The assignment becomes deterministic if there are no degeneracies, otherwise all those actions occur with equal probability, as in Sect. 2.


  1. Ay N, Bertschinger N, Der R, Guttler F, Olbrich E (2008) Predictive information and explorative behavior of autonomous robots. Eur Phys J B 63:329–339

    Article  CAS  Google Scholar 

  2. Azar MG, Kappen HJ (2010) Dynamic policy programming. J Mach Learn Res arXiv:1004.2027:1–26

    Google Scholar 

  3. Bagnell JA, Schneider J (2003) Covariant policy search. In: International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico

  4. Bialek W, Nemenman I, Tishby N (2001) Predictability, complexity and learning. Neural Comput 13:2409–2463

    PubMed  Article  CAS  Google Scholar 

  5. Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231

    Google Scholar 

  6. Chechnik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for gaussian variables. J Mach Learn Res 6:165–188

    Google Scholar 

  7. Chigirev DV, Bialek W (2004) Optimal manifold representation of data: an information theoretic perspective. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA

  8. Crutchfield JP, Feldman DP (2001) Synchronizing to the environment: information theoretic limits on agent learning. Adv Complex Syst 4(2):251–264

    Google Scholar 

  9. Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13(1):25–54

    PubMed  Article  Google Scholar 

  10. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630

    Article  Google Scholar 

  11. Kearns M, Singh S (Eds) (1998) Near-optimal reinforcement learning in polynomial time. In: Proceedings of the 15th International Conference on Machine Learning, pp 260–268

  12. Little DY, Sommer FT (2011) Learning in embodied action-perception loops through exploration. arXiv:1112.1125v2

  13. Oudeyer P-Y, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286

    Article  Google Scholar 

  14. Pereira F, Tishby N, Lee L (1993) Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 183–190.

  15. Peters J, Muelling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the Twenty-Fourth National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park

  16. Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697

    PubMed  Article  Google Scholar 

  17. Ratitch B, Precup D (2003) Using MDP characteristics to guide exploration in reinforcement learning. In: Proceedings of ECML, pp 313–324

  18. Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239

    Article  Google Scholar 

  19. Rose K, Gurewitz E, Fox GC (1990) Statistical mechanics and phase transitions in clustering. Phys Rev Lett 65(8):945–948

    PubMed  Article  Google Scholar 

  20. Schmidhuber J (1991) Curious model-building control systems. In Proceedings of IJCNN, pp 1458–1463

  21. Schmidhuber J (2009) Art and science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In: Multiple ways to design research. Research cases that reshape the design discipline. Swiss Design Network—et al. Edizioni, 2009, pp 98–112

  22. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656

    Google Scholar 

  23. Shaw R (1984) The dripping faucet as a model chaotic system. Aerial Press, Santa Cruz, California

    Google Scholar 

  24. Singh S, Barto AG, Chentanez N (2005) Intrinsically motivated reinforcement learning. In Proceedings of NIPS, pp 1281–1288

  25. Still S (2009) Information-theoretic approach to interactive learning. EPL 85 28005. doi:10.1209/0295-5075/85/28005

  26. Still S, Bialek W (2004) How many clusters? An information theoretic perspective. Neural Computation 16(12):2483–2506

    PubMed  Article  Google Scholar 

  27. Still S, Bialek W, Bottou L (2004) Geometric clustering using the information bottleneck method. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA

  28. Strehl AL, Li L, Littman ML (2006) Incremental model-based learners with formal learning-time guarantees. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA

  29. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

  30. Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(1):1633–1685

    Google Scholar 

  31. Thrun S, Moeller K (1992) Active exploration in dynamic environments. In: Advances in Neural Information Processing Systems (NIPS) 4, San Mateo, CA, pp 531–538

  32. Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference, pp 363–377

  33. Tishby N, Polani D (2010) Information theory of decisions and actions. In: Perception-reason-action cycle: models, algorithms and systems. Springer, New York

  34. Todorov E (2009) Efficient computation of optimal actions. Proc Nat Acad Sci USA 106(28):11478–11483

    PubMed  Article  CAS  Google Scholar 

  35. Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University

  36. Wingate D, Singh S (2007) On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp 1128–1135

Download references


This research was funded in part by NSERC and ONR.

Author information



Corresponding author

Correspondence to Susanne Still.



Clever random policy

There are two world states, \(x \in \{0,1\}\) and a continuous action set, \(a \in [ 0,1 ].\) The value of the action sets how strongly the agent tries to stay in or leave a state, and \(p({\bar x}|x,a) = a.\) The interest in reward is switched off (\(\alpha = 0\)), so that the optimal action becomes the one that maximizes only the predictive power.

  • Policies that maximize \(I[X_{t+1}, \{{X_t,A_t}\}]\)

For brevity of notation, we drop the index t for the current state and action.

$$ I [ X_{t+1}, \{X,A\} ] = H [ X_{t+1} ] - H [ X_{t+1}|X,A ] $$

The second term in (24) is minimized and equal to zero for all policies that result in deterministic world transitions. Those are all policies for which \(\pi(\tilde{a}|x) = 0\) for all \(\tilde{a} \notin \{0, 1\}.\) This limits the agent to using only two (the most extreme) actions: \(a \in \{ 0, 1\}. \) Since we have only two states, policies in this class are determined by two probabilities, for example the flip probabilities π(A = 0|X = 1) and π(A = 1|X = 0).

The first term in Eq. (26) is maximized for p(X t+1 = 1) = p(X t+1 = 0) = 1/2. Setting p(X t+1 = 1) to 1/2 yields

$$ \pi(A=0|X=1) p(X=1) + \pi(A=1|X=0) p(X=0) = \frac{1}{2}. $$

We assume that p(X = 0) is estimated by the learner. Equation 27 is true for all values of p(X = 0), if π(A = 0|X = 1) = π(A = 1|X = 0) = 1/2. We call this the “clever random” policy (π R ). The agent uses only those actions that make the world transitions deterministic, and uses them at random, i.e. it explores within the subspace of actions that make the world deterministic. This policy maximizes \(I [ X_{t+1}, \{X,A\} ],\) independent of the estimated value of p(X = 0).

However, when stationarity holds, p(X = 0) = p(X = 1) = 1/2, then all policies for which

$$ \pi(A=0|X=1) = \pi(A=0|X=0) $$

maximize I[X t+1, {XA}]. Those include “STAY-STAY”, and “FLIP-FLIP”.

  • Self-consistent policies.

Since α = 0, the term in the exponent of Eq. (21), for a given state x and action a, is:

$$ {\cal D}^{\pi}(x,a)= -H [ a ] + a \log\left[\frac{p(X_{t+1} = x)}{p(X_{t+1} = \bar{x})}\right] - \log [ p(X_{t+1} = x) ] $$

with \(\bar{x}\) being the opposite state, and \(H [ a ] = - (a\log(a) + (1-a)\log(1-a)).\) Note that H [0] = H [1] = 0. The clever random policy π R is self-consistent, because under this policy, for all x, both actions, STAY (a = 0) and FLIP (a = 1) are equally likely. This is due to the fact that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) =1/2,\) hence \(D^{\pi_R} (x,0) = D^{\pi_R} (x,1), \forall x.\) If stationarity holds, p(X = 0) = 1/2, and no policy which uses only actions \(a \in \{ 0, 1\}\) other than policy π R is self consistent. This is because under other such policies we also have that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) = 1/2, \) and we have H[0] = H[1] = 0, and therefore \( D^{\pi} (x,0) - D^{\pi} (x,1) = 0. \) This means that the algorithm gets to π R after one iteration. We can conclude that π R is the unique optimal self-consistent solution.

A reliable and an unreliable state

There are two possible actions, STAY (s) or FLIP (f), and two world states, \(x \in \{0,1\},\) distinguished by the transitions: p(X t+1 = 0|X t  = 0, A t  = s) = p(X t+1 = 1|X t  = 0, A t  = f) = 1, while \(p(X_{t+1}=x|X_t=1,a) = 1/2, \forall x, \forall a.\) In other words, state 0 is fully reliable, and state 1 is fully unreliable, in terms of the action effects. There is no uncertainty when we start in the reliable state, and the uncertainty when starting in the unreliable state is exactly one bit. The predictive power is then given by

$$ I [ X_{t+1}, \{X,A\} ] = - \sum_{x \in \{0,1\}} p(X_{t+1} = x)\log_2 [ p(X_{t+1} = x) ] - p(X_t = 1) $$

Starting with a fixed value for p(X t  = 1) which is estimated from past experiences, the maximum is reached by a policy that results in equiprobable futures, i.e., p(X t+1 = 1) = 1/2. We have \(p(X_{t+1}=0) = \pi(A=s|X=0) p(X=0) + \frac{1} {2} p(X=1). \) Therefore, this implies that π(A = s|X = 0) = 1/2, which, in turn, implies that after some time p(X t  = 1) = 1/2, and thus I[X t+1, {XA}] = 1/2. However, asymptotically, p(X t  = 0) = p(X t+1 = 0), and the information is given by \(-(p(X=0) \log_2 [ p(X=0)/(1-p(X=0)) ] - \log_2 [ 1-p(X=0) ] ) + p(X=0) - 1.\) Setting the first derivative, \(1-\log_2 [ p(X=0)/(1-p(X=0)) ] ,\) to zero implies that the extremum lies at p(X = 0) = 2/3, where the information reaches \(\log_2(3) - 1/3 \simeq 5/4\) bits. Now, p(X t+1 = 0) = 2/3 implies that π(A = s|X = 0) = 3/4. Asymptotically, the optimal strategy is to stay in the reliable state with probability 3/4. We conclude that the agent starts with the random strategy in state 0, i.e., π(A = s|X = 0) = 1/2, and asymptotically finds the strategy π(A = s|X = 0) = 3/4. This asymptotic strategy still allows for exploration, but it results in a more controlled environment than the purely random strategy. Note that the optimal policy in state 1 is obviously random, i.e. π(A|1) = 1/2, because \(D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=s) || p(X_{t+1}) ] = D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=f) || p(X_{t+1}) ] \).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Still, S., Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 131, 139–148 (2012).

Download citation


  • Reinforcement learning
  • Exploration–exploitation trade-off
  • Information theory
  • Rate distortion theory
  • Curiosity
  • Adaptive behavior