Abstract
In many application domains, temporal changes in the reward distribution structure are modeled as a Markov chain. In this chapter, we present the formulation, theoretical bound, and algorithms for the Markov MAB problem, where the rewards are characterized by unknown irreducible Markov processes. Two important classes of the problem are discussed, namely, rested and restless Markov MAB.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In some literature, rested Markov MAB is also called sleeping Markov MAB.
References
Richard Bellman. “A Markovian Decision Process”. In: Journal of Mathematics and Mechanics 6 (1957).
Richard Bellman. Dynamic Programming. 1st ed. Princeton, NJ, USA: Princeton University Press, 1957.
John C Gittins. “Bandit processes and dynamic allocation indices”. In: Journal of the Royal Statistical Society. Series B (Methodological) (1979), pp. 148–177.
Ronald A. Howard. Dynamic Programming and Markov Processes. Technology Press and Wiley, 1960.
Thomas Jaksch, Ronald Ortner, and Peter Auer. “Near-optimal regret bounds for reinforcement learning”. In: Journal of Machine Learning Research 11.Apr (2010), pp. 1563–1600.
Haoyang Liu, Keqin Liu, and Qing Zhao. “Learning in a changing world: Restless multiarmed bandit with unknown dynamics”. In: Information Theory, IEEE Transactions on 59.3 (2013), pp. 1902–1916.
Ronald Ortner et al. “Regret bounds for restless Markov bandits”. In: Theor. Comput. Sci. 558 (2014), pp. 62–76.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998.
E.J. Sondik. “The optimal control of partially observable Markov processes over the infinite horizon: discounted cost”. In: Operations Research 26 (1978), pp. 282–304.
Cem Tekin and Mingyan Liu. “Online algorithms for the multi-armed bandit problem with Markovian rewards”. In: Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on. IEEE. 2010, pp. 1675–1682.
Cem Tekin and Mingyan Liu. “Online learning of rested and restless bandits”. In: IEEE ransactions on Information Theory 58.8 (2012), pp. 5588–5611.
K. Wang and L. Chen. “On Optimality of Myopic Policy for Restless Multi-Armed Bandit Problem: An Axiomatic Approach”. In: IEEE Transactions on Signal Processing 60.1 (Jan. 2012), pp. 300–309.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this chapter
Cite this chapter
Zheng, R., Hua, C. (2016). Markov Multi-armed Bandit. In: Sequential Learning and Decision-Making in Wireless Resource Management. Wireless Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-50502-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-50502-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50501-5
Online ISBN: 978-3-319-50502-2
eBook Packages: Computer ScienceComputer Science (R0)