Adaptive Exploration Using Stochastic Neurons

  • Michel Tokic
  • Günther Palm
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7553)


Stochastic neurons are deployed for efficient adaptation of exploration parameters by gradient-following algorithms. The approach is evaluated in model-free temporal-difference learning using discrete actions. The advantage is in particular memory efficiency, because memorizing exploratory data is only required for starting states. Hence, if a learning problem consist of only one starting state, exploratory data can be considered as being global. Results suggest that the presented approach can be efficiently combined with standard off- and on-policy algorithms such as Q-learning and Sarsa.


reinforcement learning exploration/exploitation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Wiering, M.: Explorations in Efficient Reinforcement Learning. PhD thesis, University of Amsterdam, Amsterdam (1999)Google Scholar
  3. 3.
    Thrun, S.B.: Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, USA (1992)Google Scholar
  4. 4.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. The Journal of Machine Learning Research 3, 397–422 (2002)MathSciNetGoogle Scholar
  5. 5.
    van Eck, N.J., van Wezel, M.: Application of reinforcement learning to the game of Othello. Computers and Operations Research 35, 1999–2017 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
  6. 6.
    Faußer, S., Schwenker, F.: Learning a strategy with neural approximated temporal-difference methods in english draughts. In: Proceedings of the 20th International Conference on Pattern Recognition, ICPR 2010, pp. 2925–2928. IEEE Computer Society (2010)Google Scholar
  7. 7.
    Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University (1994)Google Scholar
  8. 8.
    Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441(7095), 876–879 (2006)CrossRefGoogle Scholar
  9. 9.
    Williams, R.J.: Simple statistical Gradient-Following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar
  10. 10.
    Watkins, C.: Learning from Delayed Rewards. PhD thesis, University of Cambridge, England (1989)Google Scholar
  11. 11.
    Grzes, M., Kudenko, D.: Online learning of shaping rewards in reinforcement learning. Neural Networks 23(4), 541–550 (2010)CrossRefGoogle Scholar
  12. 12.
    Tokic, M., Palm, G.: Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In: Bach, J., Edelkamp, S. (eds.) KI 2011. LNCS, vol. 7006, pp. 335–346. Springer, Heidelberg (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Michel Tokic
    • 1
    • 2
  • Günther Palm
    • 1
  1. 1.Institute of Neural Information ProcessingUniversity of UlmGermany
  2. 2.Institute of Applied ResearchUniversity of Applied Sciences Ravensburg-WeingartenGermany

Personalised recommendations