Multi-stage Adaptive Sampling Algorithms

  • Hyeong Soo Chang
  • Jiaqiao Hu
  • Michael C. Fu
  • Steven I. Marcus
Part of the Communications and Control Engineering book series (CCE)


In Chap. 2, we present simulation-based algorithms for estimating the optimal value function in finite-horizon MDPs with large (possibly uncountable) state spaces, where the usual techniques of policy iteration and value iteration are either computationally impractical or infeasible to implement. We present two adaptive sampling algorithms that estimate the optimal value function by choosing actions to sample in each state visited on a finite-horizon simulated sample path. The first approach builds upon the expected regret analysis of multi-armed bandit models and uses upper confidence bounds to determine which action to sample next, whereas the second approach uses ideas from learning automata to determine the next sampled action. The first approach is also the predecessor of a closely related approach in artificial intelligence (AI) called Monte Carlo tree search that led to a breakthrough in developing the current best computer Go-playing programs.


Action Space Inventory Level Order Amount Sampling Algorithm Recursive Call 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27, 1054–1078 (1995) MATHCrossRefGoogle Scholar
  2. 3.
    Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993) MathSciNetMATHCrossRefGoogle Scholar
  3. 4.
    Auer, P., Cesa-Bianchi, N., Fisher, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002) MATHCrossRefGoogle Scholar
  4. 26.
    Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012) CrossRefGoogle Scholar
  5. 27.
    Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2), 122–142 (1996) MathSciNetMATHCrossRefGoogle Scholar
  6. 42.
    Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005) MathSciNetMATHCrossRefGoogle Scholar
  7. 45.
    Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: Recursive learning automata approach to Markov decision processes. IEEE Trans. Autom. Control 52(7), 1349–1355 (2007) MathSciNetCrossRefGoogle Scholar
  8. 86.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963) MathSciNetMATHCrossRefGoogle Scholar
  9. 99.
    Jain, R., Varaiya, P.: Simulation-based uniform value function estimates of Markov decision processes. SIAM J. Control Optim. 45(5), 1633–1656 (2006) MathSciNetMATHCrossRefGoogle Scholar
  10. 104.
    Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001) CrossRefGoogle Scholar
  11. 108.
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning, pp. 282–293. Springer, Berlin (2006) Google Scholar
  12. 119.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985) MathSciNetMATHCrossRefGoogle Scholar
  13. 133.
    Narendra, K.S., Thathachar, A.L.: Learning Automata: An Introduction. Prentice-Hall, Englewood Cliffs (1989) Google Scholar
  14. 137.
    Oommen, B.J., Lanctot, J.K.: Discrete pursuit learning automata. IEEE Trans. Syst. Man Cybern. 20, 931–938 (1990) MathSciNetMATHCrossRefGoogle Scholar
  15. 143.
    Poznyak, A.S., Najim, K.: Learning Automata and Stochastic Optimization. Springer, New York (1997) MATHGoogle Scholar
  16. 144.
    Poznyak, A.S., Najim, K., Gomez-Ramirez, E.: Self-Learning Control of Finite Markov Chains. Marcel Dekker, New York (2000) Google Scholar
  17. 146.
    Rajaraman, K., Sastry, P.S.: Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 26(4), 590–598 (1996) CrossRefGoogle Scholar
  18. 155.
    Santharam, G., Sastry, P.S., Thathachar, M.A.L.: Continuous action set learning automata for stochastic optimization. J. Franklin Inst. 331B(5), 607–628 (1994) MathSciNetMATHCrossRefGoogle Scholar
  19. 163.
    Shiryaev, A.N.: Probability, 2nd edn. Springer, New York (1995) MATHGoogle Scholar
  20. 173.
    Thathachar, M.A.L., Sastry, P.S.: A class of rapidly converging algorithms for learning automata. IEEE Trans. Syst. Man Cybern. SMC-15, 168–175 (1985) MathSciNetCrossRefGoogle Scholar
  21. 174.
    Thathachar, M.A.L., Sastry, P.S.: Varieties of learning automata: an overview. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 32(6), 711–722 (2002) CrossRefGoogle Scholar
  22. 183.
    Wheeler, R.M., Jr., Narendra, K.S.: Decentralized learning in finite Markov chains. IEEE Trans. Autom. Control 31(6), 519–526 (1986) MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Hyeong Soo Chang
    • 1
  • Jiaqiao Hu
    • 2
  • Michael C. Fu
    • 3
  • Steven I. Marcus
    • 4
  1. 1.Dept. of Computer Science and EngineeringSogang UniversitySeoulSouth Korea
  2. 2.Dept. Applied Mathematics & StatisticsState University of New YorkStony BrookUSA
  3. 3.Smith School of BusinessUniversity of MarylandCollege ParkUSA
  4. 4.Dept. Electrical & Computer EngineeringUniversity of MarylandCollege ParkUSA

Personalised recommendations