Abstract
We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmannstyle exploration, one of the main exploration methods used in reinforcement learning, is optimal from an informationtheoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiositydriven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmannstyle exploration with a bonus, containing a novel exploration–exploitation tradeoff which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation tradeoff persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.
This is a preview of subscription content, access via your institution.
Notes
 1.
We will refer to this parameter as the temperature in the rest of the article. One has to keep in mind that this is a metaphor, not a physical temperature.
 2.
Here and throughout, we use capital letters to denote random variables, and small letters to denote particular realizations of these variables.
 3.
If there are N actions that maximize Q ^{π}(x, a), then those occur with probability 1/N, while all other actions occur with probability 0.
 4.
The assignment becomes deterministic if there are no degeneracies, otherwise all those actions occur with equal probability, as in Sect. 2.
References
Ay N, Bertschinger N, Der R, Guttler F, Olbrich E (2008) Predictive information and explorative behavior of autonomous robots. Eur Phys J B 63:329–339
Azar MG, Kappen HJ (2010) Dynamic policy programming. J Mach Learn Res arXiv:1004.2027:1–26
Bagnell JA, Schneider J (2003) Covariant policy search. In: International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico
Bialek W, Nemenman I, Tishby N (2001) Predictability, complexity and learning. Neural Comput 13:2409–2463
Brafman RI, Tennenholtz M (2002) Rmax—a general polynomial time algorithm for nearoptimal reinforcement learning. J Mach Learn Res 3:213–231
Chechnik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for gaussian variables. J Mach Learn Res 6:165–188
Chigirev DV, Bialek W (2004) Optimal manifold representation of data: an information theoretic perspective. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA
Crutchfield JP, Feldman DP (2001) Synchronizing to the environment: information theoretic limits on agent learning. Adv Complex Syst 4(2):251–264
Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13(1):25–54
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
Kearns M, Singh S (Eds) (1998) Nearoptimal reinforcement learning in polynomial time. In: Proceedings of the 15th International Conference on Machine Learning, pp 260–268
Little DY, Sommer FT (2011) Learning in embodied actionperception loops through exploration. arXiv:1112.1125v2
Oudeyer PY, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286
Pereira F, Tishby N, Lee L (1993) Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 183–190. http://xxx.lanl.gov/pdf/cmplg/9408011
Peters J, Muelling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the TwentyFourth National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park
Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697
Ratitch B, Precup D (2003) Using MDP characteristics to guide exploration in reinforcement learning. In: Proceedings of ECML, pp 313–324
Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239
Rose K, Gurewitz E, Fox GC (1990) Statistical mechanics and phase transitions in clustering. Phys Rev Lett 65(8):945–948
Schmidhuber J (1991) Curious modelbuilding control systems. In Proceedings of IJCNN, pp 1458–1463
Schmidhuber J (2009) Art and science as byproducts of the search for novel patterns, or data compressible in unknown yet learnable ways. In: Multiple ways to design research. Research cases that reshape the design discipline. Swiss Design Network—et al. Edizioni, 2009, pp 98–112
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Shaw R (1984) The dripping faucet as a model chaotic system. Aerial Press, Santa Cruz, California
Singh S, Barto AG, Chentanez N (2005) Intrinsically motivated reinforcement learning. In Proceedings of NIPS, pp 1281–1288
Still S (2009) Informationtheoretic approach to interactive learning. EPL 85 28005. doi:10.1209/02955075/85/28005
Still S, Bialek W (2004) How many clusters? An information theoretic perspective. Neural Computation 16(12):2483–2506
Still S, Bialek W, Bottou L (2004) Geometric clustering using the information bottleneck method. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA
Strehl AL, Li L, Littman ML (2006) Incremental modelbased learners with formal learningtime guarantees. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(1):1633–1685
Thrun S, Moeller K (1992) Active exploration in dynamic environments. In: Advances in Neural Information Processing Systems (NIPS) 4, San Mateo, CA, pp 531–538
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference, pp 363–377
Tishby N, Polani D (2010) Information theory of decisions and actions. In: Perceptionreasonaction cycle: models, algorithms and systems. Springer, New York
Todorov E (2009) Efficient computation of optimal actions. Proc Nat Acad Sci USA 106(28):11478–11483
Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University
Wingate D, Singh S (2007) On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp 1128–1135
Acknowledgment
This research was funded in part by NSERC and ONR.
Author information
Affiliations
Corresponding author
Appendix
Appendix
Clever random policy
There are two world states, \(x \in \{0,1\}\) and a continuous action set, \(a \in [ 0,1 ].\) The value of the action sets how strongly the agent tries to stay in or leave a state, and \(p({\bar x}x,a) = a.\) The interest in reward is switched off (\(\alpha = 0\)), so that the optimal action becomes the one that maximizes only the predictive power.

Policies that maximize \(I[X_{t+1}, \{{X_t,A_t}\}]\)
For brevity of notation, we drop the index t for the current state and action.
The second term in (24) is minimized and equal to zero for all policies that result in deterministic world transitions. Those are all policies for which \(\pi(\tilde{a}x) = 0\) for all \(\tilde{a} \notin \{0, 1\}.\) This limits the agent to using only two (the most extreme) actions: \(a \in \{ 0, 1\}. \) Since we have only two states, policies in this class are determined by two probabilities, for example the flip probabilities π(A = 0X = 1) and π(A = 1X = 0).
The first term in Eq. (26) is maximized for p(X _{ t+1} = 1) = p(X _{ t+1} = 0) = 1/2. Setting p(X _{ t+1} = 1) to 1/2 yields
We assume that p(X = 0) is estimated by the learner. Equation 27 is true for all values of p(X = 0), if π(A = 0X = 1) = π(A = 1X = 0) = 1/2. We call this the “clever random” policy (π_{ R }). The agent uses only those actions that make the world transitions deterministic, and uses them at random, i.e. it explores within the subspace of actions that make the world deterministic. This policy maximizes \(I [ X_{t+1}, \{X,A\} ],\) independent of the estimated value of p(X = 0).
However, when stationarity holds, p(X = 0) = p(X = 1) = 1/2, then all policies for which
maximize I[X _{ t+1}, {X, A}]. Those include “STAYSTAY”, and “FLIPFLIP”.

Selfconsistent policies.
Since α = 0, the term in the exponent of Eq. (21), for a given state x and action a, is:
with \(\bar{x}\) being the opposite state, and \(H [ a ] =  (a\log(a) + (1a)\log(1a)).\) Note that H [0] = H [1] = 0. The clever random policy π_{ R } is selfconsistent, because under this policy, for all x, both actions, STAY (a = 0) and FLIP (a = 1) are equally likely. This is due to the fact that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) =1/2,\) hence \(D^{\pi_R} (x,0) = D^{\pi_R} (x,1), \forall x.\) If stationarity holds, p(X = 0) = 1/2, and no policy which uses only actions \(a \in \{ 0, 1\}\) other than policy π_{ R } is self consistent. This is because under other such policies we also have that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) = 1/2, \) and we have H[0] = H[1] = 0, and therefore \( D^{\pi} (x,0)  D^{\pi} (x,1) = 0. \) This means that the algorithm gets to π_{ R } after one iteration. We can conclude that π_{ R } is the unique optimal selfconsistent solution.
A reliable and an unreliable state
There are two possible actions, STAY (s) or FLIP (f), and two world states, \(x \in \{0,1\},\) distinguished by the transitions: p(X _{ t+1} = 0X _{ t } = 0, A _{ t } = s) = p(X _{ t+1} = 1X _{ t } = 0, A _{ t } = f) = 1, while \(p(X_{t+1}=xX_t=1,a) = 1/2, \forall x, \forall a.\) In other words, state 0 is fully reliable, and state 1 is fully unreliable, in terms of the action effects. There is no uncertainty when we start in the reliable state, and the uncertainty when starting in the unreliable state is exactly one bit. The predictive power is then given by
Starting with a fixed value for p(X _{ t } = 1) which is estimated from past experiences, the maximum is reached by a policy that results in equiprobable futures, i.e., p(X _{ t+1} = 1) = 1/2. We have \(p(X_{t+1}=0) = \pi(A=sX=0) p(X=0) + \frac{1} {2} p(X=1). \) Therefore, this implies that π(A = sX = 0) = 1/2, which, in turn, implies that after some time p(X _{ t } = 1) = 1/2, and thus I[X _{ t+1}, {X, A}] = 1/2. However, asymptotically, p(X _{ t } = 0) = p(X _{ t+1} = 0), and the information is given by \((p(X=0) \log_2 [ p(X=0)/(1p(X=0)) ]  \log_2 [ 1p(X=0) ] ) + p(X=0)  1.\) Setting the first derivative, \(1\log_2 [ p(X=0)/(1p(X=0)) ] ,\) to zero implies that the extremum lies at p(X = 0) = 2/3, where the information reaches \(\log_2(3)  1/3 \simeq 5/4\) bits. Now, p(X _{ t+1} = 0) = 2/3 implies that π(A = sX = 0) = 3/4. Asymptotically, the optimal strategy is to stay in the reliable state with probability 3/4. We conclude that the agent starts with the random strategy in state 0, i.e., π(A = sX = 0) = 1/2, and asymptotically finds the strategy π(A = sX = 0) = 3/4. This asymptotic strategy still allows for exploration, but it results in a more controlled environment than the purely random strategy. Note that the optimal policy in state 1 is obviously random, i.e. π(A1) = 1/2, because \(D_{{\rm KL}} [ p(X_{t+1}X_t=1, A_t=s)  p(X_{t+1}) ] = D_{{\rm KL}} [ p(X_{t+1}X_t=1, A_t=f)  p(X_{t+1}) ] \).
Rights and permissions
About this article
Cite this article
Still, S., Precup, D. An informationtheoretic approach to curiositydriven reinforcement learning. Theory Biosci. 131, 139–148 (2012). https://doi.org/10.1007/s120640110142z
Received:
Accepted:
Published:
Issue Date:
Keywords
 Reinforcement learning
 Exploration–exploitation tradeoff
 Information theory
 Rate distortion theory
 Curiosity
 Adaptive behavior