Skip to main content

Advertisement

Log in

An information-theoretic approach to curiosity-driven reinforcement learning

  • Original Paper
  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. We will refer to this parameter as the temperature in the rest of the article. One has to keep in mind that this is a metaphor, not a physical temperature.

  2. Here and throughout, we use capital letters to denote random variables, and small letters to denote particular realizations of these variables.

  3. If there are N actions that maximize Q π(xa), then those occur with probability 1/N, while all other actions occur with probability 0.

  4. The assignment becomes deterministic if there are no degeneracies, otherwise all those actions occur with equal probability, as in Sect. 2.

References

  • Ay N, Bertschinger N, Der R, Guttler F, Olbrich E (2008) Predictive information and explorative behavior of autonomous robots. Eur Phys J B 63:329–339

    Article  CAS  Google Scholar 

  • Azar MG, Kappen HJ (2010) Dynamic policy programming. J Mach Learn Res arXiv:1004.2027:1–26

    Google Scholar 

  • Bagnell JA, Schneider J (2003) Covariant policy search. In: International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico

  • Bialek W, Nemenman I, Tishby N (2001) Predictability, complexity and learning. Neural Comput 13:2409–2463

    Article  PubMed  CAS  Google Scholar 

  • Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231

    Google Scholar 

  • Chechnik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for gaussian variables. J Mach Learn Res 6:165–188

    Google Scholar 

  • Chigirev DV, Bialek W (2004) Optimal manifold representation of data: an information theoretic perspective. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA

  • Crutchfield JP, Feldman DP (2001) Synchronizing to the environment: information theoretic limits on agent learning. Adv Complex Syst 4(2):251–264

    Google Scholar 

  • Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13(1):25–54

    Article  PubMed  Google Scholar 

  • Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630

    Article  Google Scholar 

  • Kearns M, Singh S (Eds) (1998) Near-optimal reinforcement learning in polynomial time. In: Proceedings of the 15th International Conference on Machine Learning, pp 260–268

  • Little DY, Sommer FT (2011) Learning in embodied action-perception loops through exploration. arXiv:1112.1125v2

  • Oudeyer P-Y, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286

    Article  Google Scholar 

  • Pereira F, Tishby N, Lee L (1993) Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 183–190. http://xxx.lanl.gov/pdf/cmp-lg/9408011

  • Peters J, Muelling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the Twenty-Fourth National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park

  • Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697

    Article  PubMed  Google Scholar 

  • Ratitch B, Precup D (2003) Using MDP characteristics to guide exploration in reinforcement learning. In: Proceedings of ECML, pp 313–324

  • Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239

    Article  Google Scholar 

  • Rose K, Gurewitz E, Fox GC (1990) Statistical mechanics and phase transitions in clustering. Phys Rev Lett 65(8):945–948

    Article  PubMed  Google Scholar 

  • Schmidhuber J (1991) Curious model-building control systems. In Proceedings of IJCNN, pp 1458–1463

  • Schmidhuber J (2009) Art and science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In: Multiple ways to design research. Research cases that reshape the design discipline. Swiss Design Network—et al. Edizioni, 2009, pp 98–112

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656

    Google Scholar 

  • Shaw R (1984) The dripping faucet as a model chaotic system. Aerial Press, Santa Cruz, California

    Google Scholar 

  • Singh S, Barto AG, Chentanez N (2005) Intrinsically motivated reinforcement learning. In Proceedings of NIPS, pp 1281–1288

  • Still S (2009) Information-theoretic approach to interactive learning. EPL 85 28005. doi:10.1209/0295-5075/85/28005

  • Still S, Bialek W (2004) How many clusters? An information theoretic perspective. Neural Computation 16(12):2483–2506

    Article  PubMed  Google Scholar 

  • Still S, Bialek W, Bottou L (2004) Geometric clustering using the information bottleneck method. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA

  • Strehl AL, Li L, Littman ML (2006) Incremental model-based learners with formal learning-time guarantees. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA

  • Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

  • Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(1):1633–1685

    Google Scholar 

  • Thrun S, Moeller K (1992) Active exploration in dynamic environments. In: Advances in Neural Information Processing Systems (NIPS) 4, San Mateo, CA, pp 531–538

  • Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference, pp 363–377

  • Tishby N, Polani D (2010) Information theory of decisions and actions. In: Perception-reason-action cycle: models, algorithms and systems. Springer, New York

  • Todorov E (2009) Efficient computation of optimal actions. Proc Nat Acad Sci USA 106(28):11478–11483

    Article  PubMed  CAS  Google Scholar 

  • Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University

  • Wingate D, Singh S (2007) On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp 1128–1135

Download references

Acknowledgment

This research was funded in part by NSERC and ONR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susanne Still.

Appendix

Appendix

Clever random policy

There are two world states, \(x \in \{0,1\}\) and a continuous action set, \(a \in [ 0,1 ].\) The value of the action sets how strongly the agent tries to stay in or leave a state, and \(p({\bar x}|x,a) = a.\) The interest in reward is switched off (\(\alpha = 0\)), so that the optimal action becomes the one that maximizes only the predictive power.

  • Policies that maximize \(I[X_{t+1}, \{{X_t,A_t}\}]\)

For brevity of notation, we drop the index t for the current state and action.

$$ I [ X_{t+1}, \{X,A\} ] = H [ X_{t+1} ] - H [ X_{t+1}|X,A ] $$
(26)

The second term in (24) is minimized and equal to zero for all policies that result in deterministic world transitions. Those are all policies for which \(\pi(\tilde{a}|x) = 0\) for all \(\tilde{a} \notin \{0, 1\}.\) This limits the agent to using only two (the most extreme) actions: \(a \in \{ 0, 1\}. \) Since we have only two states, policies in this class are determined by two probabilities, for example the flip probabilities π(A = 0|X = 1) and π(A = 1|X = 0).

The first term in Eq. (26) is maximized for p(X t+1 = 1) = p(X t+1 = 0) = 1/2. Setting p(X t+1 = 1) to 1/2 yields

$$ \pi(A=0|X=1) p(X=1) + \pi(A=1|X=0) p(X=0) = \frac{1}{2}. $$
(27)

We assume that p(X = 0) is estimated by the learner. Equation 27 is true for all values of p(X = 0), if π(A = 0|X = 1) = π(A = 1|X = 0) = 1/2. We call this the “clever random” policy (π R ). The agent uses only those actions that make the world transitions deterministic, and uses them at random, i.e. it explores within the subspace of actions that make the world deterministic. This policy maximizes \(I [ X_{t+1}, \{X,A\} ],\) independent of the estimated value of p(X = 0).

However, when stationarity holds, p(X = 0) = p(X = 1) = 1/2, then all policies for which

$$ \pi(A=0|X=1) = \pi(A=0|X=0) $$
(28)

maximize I[X t+1, {XA}]. Those include “STAY-STAY”, and “FLIP-FLIP”.

  • Self-consistent policies.

Since α = 0, the term in the exponent of Eq. (21), for a given state x and action a, is:

$$ {\cal D}^{\pi}(x,a)= -H [ a ] + a \log\left[\frac{p(X_{t+1} = x)}{p(X_{t+1} = \bar{x})}\right] - \log [ p(X_{t+1} = x) ] $$
(29)

with \(\bar{x}\) being the opposite state, and \(H [ a ] = - (a\log(a) + (1-a)\log(1-a)).\) Note that H [0] = H [1] = 0. The clever random policy π R is self-consistent, because under this policy, for all x, both actions, STAY (a = 0) and FLIP (a = 1) are equally likely. This is due to the fact that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) =1/2,\) hence \(D^{\pi_R} (x,0) = D^{\pi_R} (x,1), \forall x.\) If stationarity holds, p(X = 0) = 1/2, and no policy which uses only actions \(a \in \{ 0, 1\}\) other than policy π R is self consistent. This is because under other such policies we also have that \(p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) = 1/2, \) and we have H[0] = H[1] = 0, and therefore \( D^{\pi} (x,0) - D^{\pi} (x,1) = 0. \) This means that the algorithm gets to π R after one iteration. We can conclude that π R is the unique optimal self-consistent solution.

A reliable and an unreliable state

There are two possible actions, STAY (s) or FLIP (f), and two world states, \(x \in \{0,1\},\) distinguished by the transitions: p(X t+1 = 0|X t  = 0, A t  = s) = p(X t+1 = 1|X t  = 0, A t  = f) = 1, while \(p(X_{t+1}=x|X_t=1,a) = 1/2, \forall x, \forall a.\) In other words, state 0 is fully reliable, and state 1 is fully unreliable, in terms of the action effects. There is no uncertainty when we start in the reliable state, and the uncertainty when starting in the unreliable state is exactly one bit. The predictive power is then given by

$$ I [ X_{t+1}, \{X,A\} ] = - \sum_{x \in \{0,1\}} p(X_{t+1} = x)\log_2 [ p(X_{t+1} = x) ] - p(X_t = 1) $$
(30)

Starting with a fixed value for p(X t  = 1) which is estimated from past experiences, the maximum is reached by a policy that results in equiprobable futures, i.e., p(X t+1 = 1) = 1/2. We have \(p(X_{t+1}=0) = \pi(A=s|X=0) p(X=0) + \frac{1} {2} p(X=1). \) Therefore, this implies that π(A = s|X = 0) = 1/2, which, in turn, implies that after some time p(X t  = 1) = 1/2, and thus I[X t+1, {XA}] = 1/2. However, asymptotically, p(X t  = 0) = p(X t+1 = 0), and the information is given by \(-(p(X=0) \log_2 [ p(X=0)/(1-p(X=0)) ] - \log_2 [ 1-p(X=0) ] ) + p(X=0) - 1.\) Setting the first derivative, \(1-\log_2 [ p(X=0)/(1-p(X=0)) ] ,\) to zero implies that the extremum lies at p(X = 0) = 2/3, where the information reaches \(\log_2(3) - 1/3 \simeq 5/4\) bits. Now, p(X t+1 = 0) = 2/3 implies that π(A = s|X = 0) = 3/4. Asymptotically, the optimal strategy is to stay in the reliable state with probability 3/4. We conclude that the agent starts with the random strategy in state 0, i.e., π(A = s|X = 0) = 1/2, and asymptotically finds the strategy π(A = s|X = 0) = 3/4. This asymptotic strategy still allows for exploration, but it results in a more controlled environment than the purely random strategy. Note that the optimal policy in state 1 is obviously random, i.e. π(A|1) = 1/2, because \(D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=s) || p(X_{t+1}) ] = D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=f) || p(X_{t+1}) ] \).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Still, S., Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 131, 139–148 (2012). https://doi.org/10.1007/s12064-011-0142-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-011-0142-z

Keywords

Navigation