An Exploration Strategy for RL with Considerations of Budget and Risk

  • Jonathan Serrano CuevasEmail author
  • Eduardo Morales ManzanaresEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10267)


Reinforcement Learning (RL) algorithms create a mapping from states to actions, in order to maximize an expected reward and derive an optimal policy. However, traditional learning algorithms rarely consider that learning has an associated cost and that the available resources to learn may be limited. Therefore, we can think of learning over a limited budget. If we are developing a learning algorithm for an agent i.e. a robot, we should consider that it may have a limited amount of battery; if we do the same for a finance broker, it will have a limited amount of money. Both examples require planning according to a limited budget. Another important concept, related to budget-aware reinforcement learning, is called risk profile, and it relates to how risk-averse the agent is. The risk profile can be used as an input to the learning algorithm so that different policies can be learned according to how much risk the agent is willing to expose itself to. This paper describes a new strategy to incorporate the agent’s risk profile as an input to the learning framework by using reward shaping. The paper also studies the effect of a constrained budget on RL and shows that, under such restrictions, RL algorithms can be forced to make a more efficient use of the available resources. The experiments show that as the even if it is possible to learn on a constrained budget with low budgets the learning process becomes slow. They also show that the reward shaping process is able to guide the agent to learn a less risky policy.


Reinforcement learning Risk Budget Reward shaping 


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Thrun, S.B.: Efficient Exploration in Reinforcement Learning. Springer, New York (1992)Google Scholar
  3. 3.
    Mahadevan, S., Connell, J.: Automatic programming of behavior-based robots using reinforcement learning. Artif. Intell. 55(2), 311–365 (1992)CrossRefGoogle Scholar
  4. 4.
    Nevmyvaka, Y., Feng, Y., Kearns, M.: Reinforcement learning for optimized trade execution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 673–680. ACM (2006)Google Scholar
  5. 5.
    Thomas, P.S.: Safe reinforcement learning (2015)Google Scholar
  6. 6.
    García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Mihatsch, O., Neuneier, R.: Risk-sensitive reinforcement learning. Mach. Learn. 49(2–3), 267–290 (2002)CrossRefzbMATHGoogle Scholar
  8. 8.
    Heger, M.: Consideration of risk in reinforcement learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 105–111 (1994)Google Scholar
  9. 9.
    Coraluppi, S.P., Marcus, S.I.: Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica 35(2), 301–309 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Driessens, K., Džeroski, S.: Integrating guidance into relational reinforcement learning. Mach. Learn. 57(3), 271–304 (2004)CrossRefzbMATHGoogle Scholar
  11. 11.
    Martín H., J.A., Lope, J.: Learning autonomous helicopter flight with evolutionary reinforcement learning. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) EUROCAST 2009. LNCS, vol. 5717, pp. 75–82. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-04772-5_11 CrossRefGoogle Scholar
  12. 12.
    Abbeel, P.: Apprenticeship learning and reinforcement learning with application to robotic control. In: ProQuest (2008)Google Scholar
  13. 13.
    Garcia, J., Fernández, F.: Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45, 515–564 (2012)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: ICML, vol. 99, pp. 278–287 (1999)Google Scholar
  15. 15.
    Dorigo, M., Colombetti, M.: Robot shaping: developing autonomous agents through learning. Artif. Intell. 71(2), 321–370 (1994)CrossRefGoogle Scholar
  16. 16.
    Devlin, S., Kudenko, D.: Dynamic potential-based reward shaping. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, vol. 1, pp. 433–440 (2012)Google Scholar
  17. 17.
    Black, P.E.: Manhattan distance. Dict. Algorithms Data Struct. 18, 2012 (2006)Google Scholar
  18. 18.
    MacGlashan, J.: Brown UMBC reinforcement learning and planning BURLAP. Accessed 5 Jan 2017

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceInstituto Nacional de Astrofísica, Óptica y ElectrónicaPueblaMexico

Personalised recommendations