A Method for the Online Construction of the Set of States of a Markov Decision Process Using Answer Set Programming

  • Leonardo Anjoletto FerreiraEmail author
  • Reinaldo A. C. Bianchi
  • Paulo E. Santos
  • Ramon Lopez de Mantaras
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10868)


Non-stationary domains, that change in unpredicted ways, are a challenge for agents searching for optimal policies in sequential decision-making problems. This paper presents a combination of Markov Decision Processes (MDP) with Answer Set Programming (ASP), named Online ASP for MDP (oASP(MDP)), which is a method capable of constructing the set of domain states while the agent interacts with a changing environment. oASP(MDP) updates previously obtained policies, learnt by means of Reinforcement Learning (RL), using rules that represent the domain changes observed by the agent. These rules represent a set of domain constraints that are processed as ASP programs reducing the search space. Results show that oASP(MDP) is capable of finding solutions for problems in non-stationary domains without interfering with the action-value function approximation process.


  1. 1.
    McCarthy, J.: Generality in artificial intelligence. Commun. ACM 30(12), 1030–1035 (1987)MathSciNetCrossRefGoogle Scholar
  2. 2.
    McCarthy, J.: Elaboration tolerance. In: Proceedings of the Fourth Symposium on Logical Formalizations of Commonsense Reasoning (Common Sense 98), vol. 98, London, UK (1998)Google Scholar
  3. 3.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2015)Google Scholar
  4. 4.
    Garnelo, M., Arulkumaran, K., Shanahan, M.: Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518 [cs], September 2016
  5. 5.
    Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: Kowalski, R., Bowen, K. (eds.) Proceedings of International Logic Programming Conference and Symposium, pp. 1070–1080. MIT Press, Cambridge (1988)Google Scholar
  6. 6.
    Lifschitz, V.: Answer set programming and plan generation. Artif. Intell. 138(1), 39–54 (2002)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Bellman, R.: A Markovian decision process. Indiana Univ. Math. J. 6(4), 679–684 (1957)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Bellman, R.E., Dreyfus, S.E.: Applied Dynamic Programming, 4th edn. Princeton University Press, Princeton (1971)zbMATHGoogle Scholar
  9. 9.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England (1989)Google Scholar
  10. 10.
    Zhang, S., Sridharan, M., Wyatt, J.L.: Mixed logical inference and probabilistic planning for robots in unreliable worlds. IEEE Trans. Robot. 31(3), 699–713 (2015)CrossRefGoogle Scholar
  11. 11.
    Yang, F., Khandelwal, P., Leonetti, M., Stone, P.: Planning in answer set programming while learning action costs for mobile robots. In: AAAI Spring 2014 Symposium on Knowledge Representation and Reasoning in Robotics (AAAI-SSS) (2014)Google Scholar
  12. 12.
    Gelfond, M.: Answer sets. In: van Harmelen, F., Lifschitz, V., Porter, B. (eds.) Handbook of Knowledge Representation, pp. 285–316. Elsevier (2008)Google Scholar
  13. 13.
    Leonetti, M., Iocchi, L., Stone, P.: A synthesis of automated planning and reinforcement learning for efficient, robust decision-making. Artif. Intell. 241, 103–130 (2016)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Answer Set Solving in Practice. Morgan & Claypool Publishers, San Rafael (2013)zbMATHGoogle Scholar
  15. 15.
    Khandelwal, P., Yang, F., Leonetti, M., Lifschitz, V., Stone, P.: Planning in action language BC while learning action costs for mobile robots. In: Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ICAPS 2014, 21–26 June 2014, Portsmouth, New Hampshire, USA (2014)Google Scholar
  16. 16.
    Baral, C., Gelfond, M., Rushton, N.: Probabilistic reasoning with answer sets. Theory Pract. Log. Program. 9(1), 57 (2009)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Gelfond, M., Rushton, N.: Causal and probabilistic reasoning in P-log. In: Heuristics, Probabilities and Causality. A Tribute to Judea Pearl, pp. 337–359 (2010)Google Scholar
  18. 18.
    Even-Dar, E., Kakade, S.M., Mansour, Y.: Online Markov decision processes. Math. Oper. Res. 34(3), 726–736 (2009)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Yu, J.Y., Mannor, S., Shimkin, N.: Markov decision processes with arbitrary reward processes. Math. Oper. Res. 34(3), 737–757 (2009)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Leonardo Anjoletto Ferreira
    • 1
    Email author
  • Reinaldo A. C. Bianchi
    • 2
  • Paulo E. Santos
    • 2
  • Ramon Lopez de Mantaras
    • 3
  1. 1.Accesstage Tecnologia S.A.São PauloBrazil
  2. 2.Artificial Intelligence in Automation GroupCentro Universitário FEISão Bernardo do CampoBrazil
  3. 3.Institut d’Investigació en Intelligéncia ArtificialSpanish National Research CouncilBarcelonaSpain

Personalised recommendations