Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms


This paper tackles the problem of Opportunistic Spectrum Access (OSA) in the Cognitive Radio (CR). The main challenge of a Secondary User (SU) in OSA is to learn the availability of existing channels in order to select and access the one with the highest vacancy probability. To reach this goal, we propose a novel Multi-Armed Bandit (MAB) algorithm called \(\epsilon\)-UCB in order to enhance the spectrum learning of a SU and decrease the regret, i.e. the loss of reward by the selection of worst channels. We corroborate with simulations that the regret of the proposed algorithm has a logarithmic behavior. The last statement means that within a finite number of time slots, the SU can estimate the vacancy probability of targeted channels in order to select the best one for transmitting. Hereinafter, we extend \(\epsilon\)-UCB to consider multiple priority users, where a SU can selfishly estimate and access the channels according to his prior rank. The simulation results show the superiority of the proposed algorithms for a single or multi-user cases compared to the existing MAB algorithms.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Availability of Data and Materials

The authors declare that all the data and materials in this manuscript are available from the authors.


  1. 1.

    A SU in the context of OSA can be considered as an agent in the classic MAB problem, and the frequency channels become equivalent to different arms.

  2. 2.

    The variable \(S_i(t)\) may also represent the reward of the ith channel at slot t.

  3. 3.

    According to [24], Chernoff–Hoeffding theorem is defined as follows: Let \(X_1,\ldots ,X_n\) be random variables in \(\{0, 1\}\), and \(E[X_t]= \mu\), and let \(S_n = \sum _{i=1}^{n} X_i\). Then \(\forall\) \(a\ge 0\), \(p\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}\) and \(p\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}\).

  4. 4.

    According to [24], Chernoff–Hoeffding theorem is defined as follows: Let \(X_1, \ldots ,X_n\) be random variables in [0,1], and \(E[X_t]= \mu\), and let \(S_n = \sum _{i=1}^{n} X_i\). Then \(\forall\) \(a\ge 0\), we have \(P\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}\) and \(P\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}\).


  1. 1.

    Marcus, M., Burtle, C. J., Franca, B., Lahjouji, A., & McNeil, N. (2002). Federal Communications Commission (FCC): Spectrum Policy Task Force. ET Docket, 02-135.

  2. 2.

    Watkins, C. (1989). Learning from delayed rewards. Cambridge: University of Cambridge.

    Google Scholar 

  3. 3.

    Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.

    MathSciNet  MATH  Article  Google Scholar 

  4. 4.

    Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294.

    MATH  Article  Google Scholar 

  5. 5.

    Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.

    MathSciNet  MATH  Article  Google Scholar 

  6. 6.

    Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.

    MATH  Article  Google Scholar 

  7. 7.

    Burtini, G., Loeppky, J., & Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit, arXiv preprint arXiv:1510.00757.

  8. 8.

    Kaufmann, E., Cappé, O., & Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, La Palma, Canary Islands.

  9. 9.

    Maillard, O., Munos, R., & Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback–Leibler divergences. In Annual conference on learning theory, Budapest, Hungary.

  10. 10.

    Scott, S. (2010). A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26, 639–658.

    MathSciNet  Article  Google Scholar 

  11. 11.

    Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, Granada, Spain.

  12. 12.

    Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, Lyon, France.

  13. 13.

    Agrawal, S., & Goyal, N. (2013). Further optimal regret bounds for Thompson sampling. In Artificial intelligence and statistics, Scottsdale, USA.

  14. 14.

    Agrawal, S., & Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit. In Annual conference on learning theory, Edinburgh, Scotland.

  15. 15.

    Gai, Y., & Krishnamachari, B. (2011). Decentralized online learning algorithms for opportunistic spectrum access. In Global communications conference, Texas, USA.

  16. 16.

    Torabi, N., Rostamzadeh, K., & Leung, V. C. (2012). Rank-optimal channel selection strategy in cognitive networks. In Global communications conference, California, USA.

  17. 17.

    Rosenski, J., Shamir, O., & Szlak, L. (2016). Multi-player bandits-a musical chairs approach. In International conference on machine learning, New York, USA.

  18. 18.

    Avner, O., & Mannor, S. (2014). Concurrent bandit and cognitive radio networks. In European conference on machine learning and principles and practice of knowledge discovery in databases, Nancy, France.

  19. 19.

    Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). Distributed algorithm to learn OSA channels availability and enhance the transmission rate of secondary users. In International symposium on communications and information technologies, HoChiMinh, Vietnam.

  20. 20.

    Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2018). Opportunistic spectrum access in cognitive radio for tactical network. In European conference on electrical engineering and computer science, Bern, Switzerland.

  21. 21.

    Modi, N., Mary, P., & Moy, C. (2017). QoS driven channel selection algorithm for cognitive radio network: Multi-user multi-armed bandit approach. IEEE Transactions on Cognitive Communications and Networking, 3, 1–6.

    Article  Google Scholar 

  22. 22.

    Tekin, C., & Liu, M. (2011). Online learning in opportunistic spectrum access: A restless bandit approach. In International conference on computer communications, Shanghai, China.

  23. 23.

    Cauchy, A. (1889). Sur la convergence des séries. Oeuvres completes Sér, 2(2), 267–279.

    Google Scholar 

  24. 24.

    Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.

    MathSciNet  MATH  Article  Google Scholar 

  25. 25.

    Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). All-powerful learning algorithm for the priority access in cognitive network. In European signal processing conference, A Coruña, Spain.

Download references

Author information




All the authors have contributed to the analytic and numerical results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mahmoud Almasri.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1

In this Appendix, we investigate the upper bound of the term \(\mathbb {A}= \epsilon _t \times Prob\) in e-UCB where Prob can be expressed as follows:

$$\begin{aligned} Prob \le P\big \{B_i(t-1, T_i(t-1)) \ge B_1(t-1, T_1(t-1)); T_i(t-1)\ge l \big \} \end{aligned}$$

The index of the i-th channel \(B_i(t, T_i(t))\) is the sum of the exploration factor, \(X_i(T_i(t))\), and the exploitation factor, \(A_i(t, T_i(t))\):

$$\begin{aligned} B_i(t, T_i(t))= X_i(T_i(t))+ A_i(t, T_i(t)) \end{aligned}$$

Then, we obtain:

$$\begin{aligned}&Prob \le P\bigg \{X_1({T_1(t-1)}) + A_1(t-1, T_1(t-1)) \le \nonumber \\&\quad X_i({T_i(t-1)}) + A_i(t-1, T_i(t-1)) \text { and } T_i(t-1)\ge l \bigg \} \end{aligned}$$

By taking the minimum value of \(X_1({T_1(t-1)}) + A_1(t-1, T_1(t-1))\) and the maximum value of \(X_i({T_i(t-1)}) + A_i(t-1, T_i(t-1))\) at each time slot, we can upper bound Prob by the following equation:

$$\begin{aligned} Prob \le P\Bigg \{ \min _{0<S_1<t} \bigg [ X_1({S_1}) + A_1(t,S_1) \bigg ] \le \max _{l\le S_i<t} \bigg [ X_i({S_i}) + A_i(t,S_i) \bigg ] \Bigg \} \end{aligned}$$

where \(S_i \ge l\) to fulfill the condition \(T_i(t-1) \ge l\). Then, we obtain:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\Bigg \{X_1({S_1}) + A_1(t,S_1) < X_i(S_i)+A_i(t,S_i) \Bigg \} \end{aligned}$$

The above probability can be upper bounded by:

$$\begin{aligned}&Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\Big \{ X_1({S_1})+A_1({t,S_1}) \le \mu _1 \Big \} \nonumber \\&\qquad + P\Big \{\mu _1<\mu _i+2 A_i({t,S_i})\Big \} \nonumber \\&\qquad + P\Big \{ X_i(S_i)+ A_i({t,S_i}) \ge \mu _i+2A_i({t,S_i}) \Big \} \end{aligned}$$

Using the ceiling operator \(\lceil \rceil\), let \(l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil\), where \(\Delta _{i} = \mu _1-\mu _i\) and \(S_i\ge l\), then the inequality \(\mu _1<\mu _i+2 A_i({t,S_i})\) in Eq. (32) becomes false, in fact:

$$\begin{aligned} \mu _1-\mu _i-2 A_i({t,S_i})&= \mu _1-\mu _i-2\sqrt{\frac{\alpha \ln (t)}{S_i} }\\&\ge \mu _1-\mu _i-2\sqrt{\frac{\alpha \ln (n)}{l} } \\&\ge \mu _1-\mu _i - \Delta _{i} = 0 \end{aligned}$$

Based on Eq. (32), we obtain:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\bigg \{X_1({S_1})\le \mu _1 - A_1( t,S_1) \bigg \} + P\bigg \{X_i(S_i)\ge \mu _i + A_i(t,S_i) \bigg \} \end{aligned}$$

Using Chernoff–Hoeffding boundFootnote 4 [24], we can prove that:

$$\begin{aligned} P\Big \{X_1({S_1}) \le \mu _1-A_1({t,S_1})\Big \}&\le \exp ^{ \frac{-2}{S_1}\Big [S_1 \sqrt{\frac{\alpha \ln (t)}{S_1}} \Big ]^2 } \nonumber \\&= t^{-2\alpha } \end{aligned}$$
$$\begin{aligned} P\Big \{X_i(S_i)\ge \mu _i+A_i({t,S_i})\Big \}&\le \exp ^{ \frac{-2}{S_i}\Big [S_i \sqrt{\frac{\alpha \ln (t)}{S_i}} \Big ]^2 } \nonumber \\&= t^{-2\alpha } \end{aligned}$$

The two inequations above and inequation (33) lead us to:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} 2 t^{-2 \alpha } \le 2 t^{-2\alpha +2} \end{aligned}$$

Finally, we obtain:

$$\begin{aligned} \mathbb {A} \le \frac{H}{t} \times 2 t^{-2\alpha +2} = 2H \times t^{-2\alpha +1} \end{aligned}$$

Appendix 2

This appendix stands for finding an upper bound of Z that contributes to finding an upper bound of e-UCB:

$$\begin{aligned} Z= p\big \{X_1(T_1(t-1)) \le a ; T_i(t-1)\ge l\ \big \} \end{aligned}$$

where a is a constant number that can be chosen as follows: \(a= \frac{\mu _1 + \mu _i}{2}= \mu _1-\frac{\Delta _i}{2}= \mu _i+\frac{\Delta _i}{2}\), and \(\Delta _i=\mu _1-\mu _i\) . After the learning period where \(T_i(t-1)\ge l\), we have: \(T_1(t-1)>>T_i(t-1)\). Then Z can be upper bounded by:

$$\begin{aligned} Z \le&p\big \{X_1(T_1(t-1)) \le a ; T_1(t-1)\ge l\ \big \}\nonumber \\ \le&\sum _{z=l}^{n} p\big \{X_1(T_1(t-1)) \le \mu _1-\frac{\Delta _i}{2}; T_1(t-1) = z \big \} \nonumber \\ \le&\sum _{z=l}^{n} p\big \{X_1(z) \le \mu _1-\frac{\Delta _i}{2}\big \} \end{aligned}$$

Using the Chernoff–Hoeffding [24], we can upper bound the above equation as follows:

$$\begin{aligned} Z \le \sum _{z=l}^{n} \exp ^{- \frac{2 \Delta _i^2 z^2}{4z}} \le n \exp ^{\frac{- l\Delta _i^2}{2}} \end{aligned}$$

According to the proof provided in Appendix 1, we have \(l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil\) where \(\alpha =2\). So, we obtain:

$$\begin{aligned} Z \le n \exp ^ {-4\ln n} = \frac{1}{n^3} \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Almasri, M., Mansour, A., Moy, C. et al. Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms. Wireless Pers Commun (2021). https://doi.org/10.1007/s11277-020-08064-w

Download citation


  • Opportunistic spectrum access
  • Cognitive networks
  • Multi-armed bandit
  • Single or multi-users
  • Priority access