An information-theoretic approach to curiosity-driven reinforcement learning

Still, Susanne; Precup, Doina

doi:10.1007/s12064-011-0142-z

An information-theoretic approach to curiosity-driven reinforcement learning

Original Paper
Published: 12 July 2012

Volume 131, pages 139–148, (2012)
Cite this article

Theory in Biosciences Aims and scope Submit manuscript

Susanne Still¹ &
Doina Precup²

3435 Accesses
76 Citations
6 Altmetric
Explore all metrics

Abstract

We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in recommender systems

Article Open access 01 November 2020

A survey of transfer learning

Article Open access 28 May 2016

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

Notes

We will refer to this parameter as the temperature in the rest of the article. One has to keep in mind that this is a metaphor, not a physical temperature.
Here and throughout, we use capital letters to denote random variables, and small letters to denote particular realizations of these variables.
If there are N actions that maximize Q ^π(x, a), then those occur with probability 1/N, while all other actions occur with probability 0.
The assignment becomes deterministic if there are no degeneracies, otherwise all those actions occur with equal probability, as in Sect. 2.

References

Ay N, Bertschinger N, Der R, Guttler F, Olbrich E (2008) Predictive information and explorative behavior of autonomous robots. Eur Phys J B 63:329–339
Article CAS Google Scholar
Azar MG, Kappen HJ (2010) Dynamic policy programming. J Mach Learn Res arXiv:1004.2027:1–26
Google Scholar
Bagnell JA, Schneider J (2003) Covariant policy search. In: International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico
Bialek W, Nemenman I, Tishby N (2001) Predictability, complexity and learning. Neural Comput 13:2409–2463
Article PubMed CAS Google Scholar
Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231
Google Scholar
Chechnik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for gaussian variables. J Mach Learn Res 6:165–188
Google Scholar
Chigirev DV, Bialek W (2004) Optimal manifold representation of data: an information theoretic perspective. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA
Crutchfield JP, Feldman DP (2001) Synchronizing to the environment: information theoretic limits on agent learning. Adv Complex Syst 4(2):251–264
Google Scholar
Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13(1):25–54
Article PubMed Google Scholar
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
Article Google Scholar
Kearns M, Singh S (Eds) (1998) Near-optimal reinforcement learning in polynomial time. In: Proceedings of the 15th International Conference on Machine Learning, pp 260–268
Little DY, Sommer FT (2011) Learning in embodied action-perception loops through exploration. arXiv:1112.1125v2
Oudeyer P-Y, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286
Article Google Scholar
Pereira F, Tishby N, Lee L (1993) Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 183–190. http://xxx.lanl.gov/pdf/cmp-lg/9408011
Peters J, Muelling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the Twenty-Fourth National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park
Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697
Article PubMed Google Scholar
Ratitch B, Precup D (2003) Using MDP characteristics to guide exploration in reinforcement learning. In: Proceedings of ECML, pp 313–324
Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239
Article Google Scholar
Rose K, Gurewitz E, Fox GC (1990) Statistical mechanics and phase transitions in clustering. Phys Rev Lett 65(8):945–948
Article PubMed Google Scholar
Schmidhuber J (1991) Curious model-building control systems. In Proceedings of IJCNN, pp 1458–1463
Schmidhuber J (2009) Art and science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In: Multiple ways to design research. Research cases that reshape the design discipline. Swiss Design Network—et al. Edizioni, 2009, pp 98–112
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Google Scholar
Shaw R (1984) The dripping faucet as a model chaotic system. Aerial Press, Santa Cruz, California
Google Scholar
Singh S, Barto AG, Chentanez N (2005) Intrinsically motivated reinforcement learning. In Proceedings of NIPS, pp 1281–1288
Still S (2009) Information-theoretic approach to interactive learning. EPL 85 28005. doi:10.1209/0295-5075/85/28005
Still S, Bialek W (2004) How many clusters? An information theoretic perspective. Neural Computation 16(12):2483–2506
Article PubMed Google Scholar
Still S, Bialek W, Bottou L (2004) Geometric clustering using the information bottleneck method. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MA
Strehl AL, Li L, Littman ML (2006) Incremental model-based learners with formal learning-time guarantees. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(1):1633–1685
Google Scholar
Thrun S, Moeller K (1992) Active exploration in dynamic environments. In: Advances in Neural Information Processing Systems (NIPS) 4, San Mateo, CA, pp 531–538
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference, pp 363–377
Tishby N, Polani D (2010) Information theory of decisions and actions. In: Perception-reason-action cycle: models, algorithms and systems. Springer, New York
Todorov E (2009) Efficient computation of optimal actions. Proc Nat Acad Sci USA 106(28):11478–11483
Article PubMed CAS Google Scholar
Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge University
Wingate D, Singh S (2007) On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp 1128–1135

Download references

Acknowledgment

This research was funded in part by NSERC and ONR.

Author information

Authors and Affiliations

Information and Computer Sciences, University of Hawaii at Mānoa, Honolulu, HI, 96822, USA
Susanne Still
School of Computer Science, McGill University, Montreal, QC, Canada
Doina Precup

Authors

Susanne Still
View author publications
You can also search for this author in PubMed Google Scholar
Doina Precup
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susanne Still.

Appendix

Clever random policy

There are two world states, $x \in \{0,1\}$ and a continuous action set, $a \in [ 0,1 ].$ The value of the action sets how strongly the agent tries to stay in or leave a state, and $p({\bar x}|x,a) = a.$ The interest in reward is switched off ($\alpha = 0$), so that the optimal action becomes the one that maximizes only the predictive power.

Policies that maximize $I[X_{t+1}, \{{X_t,A_t}\}]$

For brevity of notation, we drop the index t for the current state and action.

$$ I [ X_{t+1}, \{X,A\} ] = H [ X_{t+1} ] - H [ X_{t+1}|X,A ] $$

(26)

The second term in (24) is minimized and equal to zero for all policies that result in deterministic world transitions. Those are all policies for which $\pi(\tilde{a}|x) = 0$ for all $\tilde{a} \notin \{0, 1\}.$ This limits the agent to using only two (the most extreme) actions: $a \in \{ 0, 1\}. $ Since we have only two states, policies in this class are determined by two probabilities, for example the flip probabilities π(A = 0|X = 1) and π(A = 1|X = 0).

The first term in Eq. (26) is maximized for p(X _t+1 = 1) = p(X _t+1 = 0) = 1/2. Setting p(X _t+1 = 1) to 1/2 yields

$$ \pi(A=0|X=1) p(X=1) + \pi(A=1|X=0) p(X=0) = \frac{1}{2}. $$

(27)

We assume that p(X = 0) is estimated by the learner. Equation 27 is true for all values of p(X = 0), if π(A = 0|X = 1) = π(A = 1|X = 0) = 1/2. We call this the “clever random” policy (π_R). The agent uses only those actions that make the world transitions deterministic, and uses them at random, i.e. it explores within the subspace of actions that make the world deterministic. This policy maximizes $I [ X_{t+1}, \{X,A\} ],$ independent of the estimated value of p(X = 0).

However, when stationarity holds, p(X = 0) = p(X = 1) = 1/2, then all policies for which

$$ \pi(A=0|X=1) = \pi(A=0|X=0) $$

(28)

maximize I[X _t+1, {X, A}]. Those include “STAY-STAY”, and “FLIP-FLIP”.

Self-consistent policies.

Since α = 0, the term in the exponent of Eq. (21), for a given state x and action a, is:

$$ {\cal D}^{\pi}(x,a)= -H [ a ] + a \log\left[\frac{p(X_{t+1} = x)}{p(X_{t+1} = \bar{x})}\right] - \log [ p(X_{t+1} = x) ] $$

(29)

with $\bar{x}$ being the opposite state, and $H [ a ] = - (a\log(a) + (1-a)\log(1-a)).$ Note that H [0] = H [1] = 0. The clever random policy π_R is self-consistent, because under this policy, for all x, both actions, STAY (a = 0) and FLIP (a = 1) are equally likely. This is due to the fact that $p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) =1/2,$ hence $D^{\pi_R} (x,0) = D^{\pi_R} (x,1), \forall x.$ If stationarity holds, p(X = 0) = 1/2, and no policy which uses only actions $a \in \{ 0, 1\}$ other than policy π_R is self consistent. This is because under other such policies we also have that $p(X_{t+1} = x) = p(X_{t+1} = \bar{x}) = 1/2, $ and we have H[0] = H[1] = 0, and therefore $ D^{\pi} (x,0) - D^{\pi} (x,1) = 0. $ This means that the algorithm gets to π_R after one iteration. We can conclude that π_R is the unique optimal self-consistent solution.

A reliable and an unreliable state

There are two possible actions, STAY (s) or FLIP (f), and two world states, $x \in \{0,1\},$ distinguished by the transitions: p(X _t+1 = 0|X _t = 0, A _t = s) = p(X _t+1 = 1|X _t = 0, A _t = f) = 1, while $p(X_{t+1}=x|X_t=1,a) = 1/2, \forall x, \forall a.$ In other words, state 0 is fully reliable, and state 1 is fully unreliable, in terms of the action effects. There is no uncertainty when we start in the reliable state, and the uncertainty when starting in the unreliable state is exactly one bit. The predictive power is then given by

$$ I [ X_{t+1}, \{X,A\} ] = - \sum_{x \in \{0,1\}} p(X_{t+1} = x)\log_2 [ p(X_{t+1} = x) ] - p(X_t = 1) $$

(30)

Starting with a fixed value for p(X _t = 1) which is estimated from past experiences, the maximum is reached by a policy that results in equiprobable futures, i.e., p(X _t+1 = 1) = 1/2. We have $p(X_{t+1}=0) = \pi(A=s|X=0) p(X=0) + \frac{1} {2} p(X=1). $ Therefore, this implies that π(A = s|X = 0) = 1/2, which, in turn, implies that after some time p(X _t = 1) = 1/2, and thus I[X _t+1, {X, A}] = 1/2. However, asymptotically, p(X _t = 0) = p(X _t+1 = 0), and the information is given by $-(p(X=0) \log_2 [ p(X=0)/(1-p(X=0)) ] - \log_2 [ 1-p(X=0) ] ) + p(X=0) - 1.$ Setting the first derivative, $1-\log_2 [ p(X=0)/(1-p(X=0)) ] ,$ to zero implies that the extremum lies at p(X = 0) = 2/3, where the information reaches $\log_2(3) - 1/3 \simeq 5/4$ bits. Now, p(X _t+1 = 0) = 2/3 implies that π(A = s|X = 0) = 3/4. Asymptotically, the optimal strategy is to stay in the reliable state with probability 3/4. We conclude that the agent starts with the random strategy in state 0, i.e., π(A = s|X = 0) = 1/2, and asymptotically finds the strategy π(A = s|X = 0) = 3/4. This asymptotic strategy still allows for exploration, but it results in a more controlled environment than the purely random strategy. Note that the optimal policy in state 1 is obviously random, i.e. π(A|1) = 1/2, because $D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=s) || p(X_{t+1}) ] = D_{{\rm KL}} [ p(X_{t+1}|X_t=1, A_t=f) || p(X_{t+1}) ] $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Still, S., Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 131, 139–148 (2012). https://doi.org/10.1007/s12064-011-0142-z

Download citation

Received: 19 June 2010
Accepted: 05 April 2011
Published: 12 July 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s12064-011-0142-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An information-theoretic approach to curiosity-driven reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in recommender systems

A survey of transfer learning

Multi-agent deep reinforcement learning: a survey

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix

Clever random policy

A reliable and an unreliable state

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An information-theoretic approach to curiosity-driven reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in recommender systems

A survey of transfer learning

Multi-agent deep reinforcement learning: a survey

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Clever random policy

A reliable and an unreliable state

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation