Multi-stage Adaptive Sampling Algorithms
In Chap. 2, we present simulation-based algorithms for estimating the optimal value function in finite-horizon MDPs with large (possibly uncountable) state spaces, where the usual techniques of policy iteration and value iteration are either computationally impractical or infeasible to implement. We present two adaptive sampling algorithms that estimate the optimal value function by choosing actions to sample in each state visited on a finite-horizon simulated sample path. The first approach builds upon the expected regret analysis of multi-armed bandit models and uses upper confidence bounds to determine which action to sample next, whereas the second approach uses ideas from learning automata to determine the next sampled action. The first approach is also the predecessor of a closely related approach in artificial intelligence (AI) called Monte Carlo tree search that led to a breakthrough in developing the current best computer Go-playing programs.
- 108.Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning, pp. 282–293. Springer, Berlin (2006) Google Scholar
- 133.Narendra, K.S., Thathachar, A.L.: Learning Automata: An Introduction. Prentice-Hall, Englewood Cliffs (1989) Google Scholar
- 144.Poznyak, A.S., Najim, K., Gomez-Ramirez, E.: Self-Learning Control of Finite Markov Chains. Marcel Dekker, New York (2000) Google Scholar