Abstract
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process using belief states. However, because the belief state space is continuous, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete prior knowledge of the environment. This article presents a memory-based reinforcement learning algorithm, namely Reinforcement based U-Tree, which is not only able to learn the state transitions from experience, but also build the state model by itself based on raw sensor inputs. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and demonstrate its performance using a car-driving task with 31,224 world states. The article also presents a modification to the statistical test for reward estimation, which allows the algorithm to be benchmarked against some model-based algorithms with a set of well known POMDP problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs (2002)
Bellman, R.E.: Dynamic Programming. Dover Publications, Incorporated (2003)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, 3rd edn., vol. I. Athena Scientific, Belmont (2005)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov processes over a finite horizon. Operations Research 21(5), 1071–1088 (1973)
Sondik, E.J.: The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs. Operations Research 26(2), 282–304 (1978)
Monahan, G.E.: A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms. Management Science 28(1), 1–16 (1982)
Lovejoy, W.S.: A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research 28(1), 47–65 (1991)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1998)
Papadimitriou, C.H., Tsitsiklis, J.N.: The Complexity of Markov Decision Processes. Mathematics of Operations Research 12(3), 441–450 (1987)
Madani, O., Hanks, S., Condon, A.: On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems. In: Proceedings of the National Conference on Artificial Intelligence (1999)
Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research 13, 33–94 (2000)
McCallum, A.K.: Learning to use selective attention and short-term memory in sequential tasks. In: From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior. MIT Press, Cambridge (1996)
Singh, S.P.: Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes. In: Proceedings of the 12th National Conference on Artificial Intelligence. MIT Press, Cambridge (1994)
Mahadevan, S.: Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning 22(1), 159–195 (1996)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, 3rd edn., vol. II. Athena Scientific, Belmont (2007)
Watkins, C.J.C.H.: Learning from delayed rewards. University of Cambridge, England (1989)
Watkins, C.J.C.H., Dayan, P.: Technical Note: Q-Learning. Machine Learning 8(3), 279–292 (1992)
Sondik, E.J.: The optimal control of partially observable Markov decision processes. Stanford University, Stanford (1971)
Cassandra, A., Littman, M.L., Zhang, N.L.: Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco (1997)
Littman, M.L., Dean, T.L., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Eleventh International Conference on Uncertainty in Artificial Intelligence (1995)
Varakantham, P., et al.: Towards efficient computation of error bounded solutions in pomdps: Expected value approximation and dynamic disjunctive beliefs. In: International Joint Conferences on Artificial Intelligence (2007)
Li, H., Liao, X., Carin, L.: Region-based value iteration for partially observable Markov decision processes. In: The 23rd International Conference on Machine Learning (2006)
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning Policies for Partially Observable Environments: Scaling Up. In: Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1995)
Washington, R.: Incremental Markov-model planning. In: Proceedings of TAI 1996, Eighth IEEE International Conference on Tools With Artificial Intelligence (1996)
Hansen, E.A.: Solving POMDPs by searching in policy space. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (1998)
Hauskrecht, M.: Planning and control in stochastic domains with imperfect information. Massachusetts Institute of Technology (1997)
Ross, S., Pineau, J., Chaib-draa, B.: Theoretical Analysis of Heuristic Search Methods for Online POMDPs. Advances in Neural Information Processing Systems 20 (2008)
Smith, T., Simmons, R.G.: Point-Based POMDP Algorithms: Improved Analysis and Implementation. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence (2005)
Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24, 195–220 (2005)
Pineau, J., Gordon, G., Thrun, S.: Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research 27, 335–380 (2006)
Ji, S., et al.: Point-based policy iteration. In: Proceedings of the 22nd National Conference on Artificial Intelligence (2007)
Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithm for POMDPs. In: Proceedings of International Joint Conference on Artificial Intelligence (2003)
Whitehead, S.D., Ballard, D.H.: Learning to perceive and act by trial and error. Machine Learning 7(1), 45–83 (1991)
Loch, J., Singh, S.: Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1998)
Baird, L., Moore, A.: Gradient descent for general reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 11. MIT Press, Cambridge (1999)
Chrisman, L.: Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: Proceedings of the Tenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park (1992)
Littman, M.L.: Memoryless policies: Theoretical limitations and practical results. In: From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior. The MIT Press, Cambridge (1994)
Meuleau, N., et al.: Learning finite-state controllers for partially observable environments. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco (1999)
Press, W.H., et al.: Numerical Recipes in C: the art of scientific computing, pp. 623–626. Cambridge University Press, Cambridge (1997)
Ron, D., Singer, Y.: The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning (1996)
Shani, G., Brafman, R.I., Shimony, S.E.: Model-based online learning of POMDPs. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 353–364. Springer, Heidelberg (2005)
McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden state. In: Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Zheng, L., Cho, SY., Quek, C. (2010). Reinforcement Based U-Tree: A Novel Approach for Solving POMDP. In: Jain, L.C., Lim, C.P. (eds) Handbook on Decision Making. Intelligent Systems Reference Library, vol 4. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13639-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-13639-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13638-2
Online ISBN: 978-3-642-13639-9
eBook Packages: EngineeringEngineering (R0)