A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems

  • Rahul MeshramEmail author
  • Aditya Gopalan
  • D. Manjunath
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10340)


We consider a restless multi-armed bandit (RMAB) in which each arm can be in one of two states, say 0 or 1. Playing an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems where the user’s preferences depend on the history of recommendations. In this paper we analyse the RMAB by first studying single armed bandit. We show that it is Whittle-indexable and obtain a closed form expression for the Whittle index. For a RMAB to be useful in practice, we need to be able to learn the parameters of the arms. We present Thompson sampling scheme, that learns the parameters of the arms and also illustrate its performance numerically.


Restless multi-armed bandit Recommendation systems POMDP Automated playlist creation systems Learning 


  1. 1.
    Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. JMLR Workshop Conf. Proc. 23, 3901–3926 (2012)Google Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  3. 3.
    Avrachenkov, K., Borkar, V.S.: Whittle index policy for crawling ephemeral content. Technical report, report no. 8702, INRIA (2015).
  4. 4.
    Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 1st edn. Athena Scientific, Belmont (1995)zbMATHGoogle Scholar
  5. 5.
    Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 2, 1st edn. Athena Scientific, Belmont (1995)zbMATHGoogle Scholar
  6. 6.
    Bubeck, S., Bianchi, N.C.: Regret analysis of stochastic and non-stochastic multi-armed bandit problem. Found. Trends Mach. Learn. 5(1), 1–122 (2012)CrossRefzbMATHGoogle Scholar
  7. 7.
    Candes, E., Tao, T.: The power of convex relaxation: near optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Caron, S., Kveton, B., Lelarge, M., Bhagat, S.: Leveraging side observations in stochastic bandits. Arxiv (2012)Google Scholar
  9. 9.
    Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Proceedings of NIPS (2011)Google Scholar
  10. 10.
    Gittins, J., Glazebrook, K., Weber, R.: Multi-armed Bandit Allocation Indices, 2nd edn. Wiley, New York (2011)CrossRefzbMATHGoogle Scholar
  11. 11.
    Gopalan, A., Mannor, S.: Thompson sampling for learning parameterized Markov decision processes. In: Proceedings of COLT (2015)Google Scholar
  12. 12.
    Gopalan, A., Mannor, S., Mansour, Y.: Thompson sampling for complex online problems. In: Proceedings of ICML (2014)Google Scholar
  13. 13.
    Hariri, N., Mobasher, B., Burke, R.: Context-aware music recommendation based on latent topic sequential patterns. In: Proceedings of ACM RecSys (2012)Google Scholar
  14. 14.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. In: Proceedings of NIPS (2007)Google Scholar
  16. 16.
    Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of ACM WWW (2010)Google Scholar
  17. 17.
    Liu, H., Liu, K., Zhao, Q.: Learning in a changing world: restless multiarmed bandit with unknown dynamics. IEEE Trans. Inf. Theory 59(3), 1902–1916 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Liu, K., Zhao, Q.: Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 56(11), 5557–5567 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Meshram, R., Manjunath, D., Gopalan, A.: A restless bandit with no observable states for recommendation systems and communication link scheduling. In: Proceedings of IEEE CDC (2015)Google Scholar
  20. 20.
    Meshram, R., Gopalan, A., Manjunath, D.: Optimal recommendation to users that react: online learning for a class of POMDPs. In: Proceedings of IEEE CDC (2016)Google Scholar
  21. 21.
    Meshram, R., Gopalan, A., Manjunath, D.: Optimal recommendation to users that react: online learning for a class of POMDPs. Arxiv (2016)Google Scholar
  22. 22.
    Meshram, R., Manjunath, D., Gopalan, A.: On the whittle index for restless multi-armed hidden Markov bandits. Arxiv (2016)Google Scholar
  23. 23.
    Meshram, R., Gopalan, A., Manjunath, D.: Restless bandits that hide their hand and recommendation systems. In: Proceedings of IEEE COMSNETS (2017)Google Scholar
  24. 24.
    Papadimitriou, C.H., Tsitsiklis, J.H.: The complexity of optimal queueing network control. Math. Oper. Res. 24(2), 293–305 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Ross, S.M.: Quality control under Markovian deterioration. Manag. Sci. 17(9), 587–596 (1971)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Ross, S.M.: Applied Probability Models with Optimization Applications. Dover Publications, New York (1993)Google Scholar
  27. 27.
    Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 24(3–4), 285–294 (1933)CrossRefzbMATHGoogle Scholar
  28. 28.
    Walter, R.: Principles of Mathematical Analysis, 3rd edn. McGraw-Hill Book Co., New York (1976)zbMATHGoogle Scholar
  29. 29.
    Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probab. 25(1), 287–298 (1988)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Electrical Engineering DepartmentIndian Institute of Technology BombayMumbaiIndia
  2. 2.ECE DepartmentIndian Institute of ScienceBangaloreIndia

Personalised recommendations