Sensing time and power allocation for cognitive radios using distributed Q-learning
- 3.9k Downloads
- 6 Citations
Abstract
In cognitive radios systems, the sparse assigned frequency bands are opened to secondary users, provided that the aggregated interferences induced by the secondary transmitters on the primary receivers are negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands. In this article, we propose two decentralized resource allocation Q-learning algorithms: the first one is used to share the sensing time among the cognitive radios in a way that maximize the throughputs of the radios. The second one is used to allocate the cognitive radio powers in a way that maximizes the signal on interference-plus-noise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. Numerical results show the convergence of the proposed algorithms and allow the discussion of the exploration strategy, the choice of the cost function and the frequency of execution of each algorithm.
Keywords
Time Slot Cognitive Radio Power Allocation Secondary User Allocation Algorithm1. Introduction
The scarcity of available radio spectrum frequencies, densely allocated by the regulators, represents a major bottleneck in the deployment of new wireless services. Cognitive radios have been proposed as a new technology to overcome this issue [1]. For cognitive radio use, the assigned frequency bands are opened to secondary users, provided that interference induced on the primary licensees is negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands.
To tackle the fading phenomenon--an attenuation of the received power due to destructive interferences between the multiple interactions of the emitted wave with the environment--when sensing the frequency spectrum, cooperative spectrum sensing has been proposed to take advantage of the spatial diversity in wireless channels [2, 3]. In cooperative spectrum sensing, the secondary cognitive nodes send the results of their individual observations of the primary signal to a base station through specific control channels. The base station then combines the received information in order to make a decision about the primary network presence. Each cognitive node observes the primary signal during a certain sensing time, which should be chosen high enough to ensure the correct detection of the primary emitter but low enough so that the node has still enough time to communicate. In literature [4, 5], the sensing times used by the cognitive nodes are generally assumed to be identical and allocated by a central authority. In [6], the sensing performance of a network of independent cognitive nodes that individually select their sensing times is analyzed using evolutionary game theory.
It is generally considered in literature that the secondary users can only transmit if the primary network is inactive or if the secondary users are located outside a keep-out region surrounding the primary transmitter, or equivalently, if the secondary users generate an interference inferior to a given threshold on a so called protection contour surrounding the primary transmitter [7, 8]. However, multiple simultaneously transmitting secondary users may individually meet the protection contour constraint while collectively generating an aggregated interference that exceeds the acceptable threshold. In [7], the effect of aggregated interference caused by IEEE 802.22 secondary users on primary DTV receivers is analyzed. In [9], the aggregated interference generated by a large-scale secondary network is modeled and the impact of the secondary network density on the sensing requirements is investigated. In [10], a decentralized power allocation Q-learning algorithm is proposed to protect the primary network from harmful aggregated interference. The proposed algorithm removes the need for a central authority to allocate the powers in the secondary network and therefore minimizes the communication overhead. The cost functions used by the algorithm are chosen so that the aggregated interference constraint is exactly met on the protection contour. Unfortunately, the cost functions do not take into account the preferences of the secondary network.
This article aims to illustrate the potential of Q-learning for cognitive radio systems. For this purpose two decentralized Q-learning algorithm are presented to solve the allocation problems that appear during the sensing phase on the one hand and during the communication phase on the other hand. The first algorithm allows to share the sensing times among the cognitive radios in a way that maximize the throughputs of the radios. The second algorithm allows to allocate the secondary user powers in a way that maximize the signal on interference-plus-noise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. The agents self-adapt by directly interacting with the environment in real time and by properly utilizing their past experience. They aim to distributively learn an optimal strategy to maximize their throughputs or their SINRs.
Reinforcement learning algorithms such as Q-learning are particularly efficient in applications where reinforcement information (i.e., cost or reward) is provided after an action is performed in the environment [11]. The sensing time and power allocation problems both allow for the easy definition of such information. In this article, we make the assumption that no information is exchanged between the agents for each of the two problems. As a result, many traditional multi-agent reinforcement learning algorithms like fictitious play and Nash-Q learning cannot be used [12], which justifies the use of multi-agent Q-learning in this article to solve the sensing time and power allocation problems.
This distributed allocation of the sensing times and the node powers presents several advantages compared to a centralized allocation [10]: (1) robustness of the system towards a variation of parameters (such as the gains of the sensing channels), (2) maintainability of the system thanks to the modularity of the multiple agents and (3) scalability of the system as the need for control communication is minimized: on the one hand there is no need for a central authority to send the result of a centralized allocation to the multiple nodes and on the other hand these nodes do not have to send their specific parameters (sensing SNRs and data rates for the sensing time allocation, space coordinates for the power allocation problem). In addition, a centralized allocation is not a trivial operation as the sensing time and the power allocation problems are both essentially multi-criteria problems where multiple objective function to maximize can be defined (e.g., the sum of the individual rewards to aim for a global optimum or the minimum individual reward to guarantee more fairness).
The rest of this article is organized as follows: in Section 2, we formulate the problems of sensing time allocation in the secondary network. In Section 3, we formulate the problem of power allocation in the secondary network. In Section 4, we present the decentralized Q- learning algorithms used to solve the sensing time allocation problem and the power allocation problem. In Section 5, we present numerical results allowing the discussion of the performance of the Q-learning algorithms for different exploration strategies, cost functions and execution frequencies.
2. Sensing time allocation problem formulation
2.1. Cooperative spectrum sensing
The licensed band is assumed to be divided into N sub-bands, and each secondary user is assumed to communicate in one of the N sub-bands when the primary user is absent. When it is present, the primary network is assumed to use all N sub-bands for its communications. Therefore, the secondary user can jointly sense the primary network presence on these sub-bands and report their observations via a narrow-band control channel.
where s_{ ji } and n_{ ji } denote the received primary signal and additive white noise at the i th sample of the j th cognitive radio, respectively, (1 ≤ j ≤ N, 1 ≤ i ≤ M_{ j }). These samples are assumed to be real without loss of generality. H_{0} and H_{1} represent the hypotheses associated to primary signal absence and presence, respectively. In the distributed detection problem, the coordinator node receives information from each of the N nodes (e.g., the communicated Y_{ j }) and must decide between the two hypotheses.
We assume that the instantaneous noise at each node n_{ ji } can be modeled as a zero-mean Gaussian random variable with unit variance ${n}_{ji}~\mathcal{N}\left(0,1\right)$. Let γ_{ j } be the signal-to-noise ratio (SNR) computed at the j th node, defined as ${\gamma}_{j}~\frac{1}{{M}_{j}}{\sum}_{i=1}^{{M}_{j}}{s}_{ji}^{2}$.
where $Q\left(x\right)={\int}_{x}^{+\infty}\frac{1}{\sqrt{2\pi}}{e}^{-\frac{{t}^{2}}{2}}\mathsf{\text{d}}t$.
where Q^{-1}(x) is the inverse function of Q(x).
2.2. Throughput of a secondary user
The random variable representing the presence of the primary network in each time slot n is denoted H (n) (H (n) ∈ {H_{0}, H_{1}}) and is assumed to be a Markov Chain characterized by a transition matrix [p_{ uv }]. It is assumed that the probability p_{01} of the primary network apparition is small compared to the probability p_{00}. As a result, the secondary users can decide to communicate or not during a time slot based on the result of their sensing in the previous time slot while limiting the probability of interference with the primary network.
2.3. Sensing time allocation problem
Equations (6), (8), and (10) show that there is a tradeoff for the choice of the sensing window length M_{ j }: on the one hand, if M_{ j } is high then the user j will not have enough time to perform his data transmission and R_{ j } will be low. On the other hand, if all the users use low M_{ j } values, then the global false alarm probability in (10) will be high and all the average throughputs will be low.
The sensing time allocation problem consists in finding the optimal sensing window length {M_{1}, . . ., M_{ N }} that minimize a cost function f (R_{1}, . . ., R_{ N }) depending on the secondary throughputs.
where ${\stackrel{\u0304}{R}}_{j}$ denotes the throughput required by node j.
It is observed that the cost decreases with respect to R_{ j } until R_{ j } reaches the threshold value ${\stackrel{\u0304}{R}}_{j}$, then the cost increases with respect to R_{ j }. This should prevent secondary users from selfishly transmitting with a throughput higher than required, which would reduce the achievable throughputs for the other secondary users.
Although a base station could determine the sensing window lengths that minimize function (11) and send these optimal values to each secondary user, in this article we rely on the secondary users themselves to determine their individual best sensing window length. This decentralized allocation avoids the introduction of signaling overhead in the system.
3. Power allocation problem formulation
We consider a large circular primary cell made up of one central primary emitter and several primary receivers whose positions are unknown. The primary emitter could be a DTV broadcasting station that communicates with multiple passive receivers.
In order to protect the primary receivers from receiving harmful interference from the secondary users, a protection contour is defined around the primary emitter as a circle on which the received primary SINR must be superior to a given threshold ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}$. The secondary cells are located around the protection contour. As the primary cell ray is assumed to be much larger than the secondary cells ray, the protection contour can be approximated by a line parallel to the secondary base stations line.
The secondary network is assumed to follow a Time Division Multiple Access (TDMA) scheme, so that at each time only one secondary user SU_{ l }communicates with its base station BS_{ l }in cell l (l ∈ {1, . . ., L}). The difference between SU_{ l }and BS_{ l }abscissa is denoted x_{ l }. The point on the protection contour whose distance with SU_{ l }is minimal is denoted I_{ l }. We assume that each cell l deploys sensors on the protection contour so that it is able to measure the primary network SINR at the point I_{ l } , denoted ${\mathsf{\text{SINR}}}_{l}^{p}$.
In this article, the analysis is focused on the interference generated by the upstream transmissions of the secondary users. It is assumed that the secondary SINR at each base station l, denoted ${\mathsf{\text{SINR}}}_{l}^{s}$, needs to be superior to a given threshold ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}$ for the secondary communication to be reliable.
It is observed that the cost decreases with respect to ${\mathsf{\text{SINR}}}_{l}^{s}$ until ${\mathsf{\text{SINR}}}_{l}^{s}$ reaches the threshold value ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}$, then the cost increases with respect to ${\mathsf{\text{SINR}}}_{l}^{s}$. This should prevent secondary users from selfishly transmitting with a power higher than required, which would remove transmission opportunities for other secondary users.
where P^{ p }is the power that is received on the protection contour from the primary transmitter, σ^{2} is the noise power and ${h}_{{\mathsf{\text{I}}}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{k}}$ is the link gain between SU_{ k }and the point I_{ l } on the protection contour.
where ${h}_{{\mathsf{\text{BS}}}_{l}}^{{\mathsf{\text{SU}}}_{l}}$is the link gain between SU_{ l }and BS_{ l }.
where r_{ s } is the ray of the secondary cells, f_{ c } is the transmission frequency and c is the speed of light in vacuum.
4. Learning algorithm
4.1. Q-learning algorithm
In this article, we use two multi-agent Q-learning algorithms. The first one is used to allocate the secondary user sensing times and the second one is used to allocate the secondary user transmission powers. In the sensing time allocation algorithm, each secondary user is an agent that aims to learn an optimal sensing time allocation policy for itself. In the power allocation algorithm, each secondary base station is an agent that aims to learn an optimal power allocation policy for its cell.
- 1)
The agent senses the state $s\in \mathcal{S}$ of the environment
- 2)
Based on s and its accumulated knowledge, the agent chooses and performs an action $a\in \mathcal{A}$.
- 3)
Because of the performed action, the state of the environment is modified. The new state is denoted s' The transition from s to s' generates a cost c ∈ ℝ for the agent.
- 4)
The agent uses c and s' to update the accumulated knowledge that made him choose the action a when the environment was in state s.
where ϵ is the randomness for exploration of the learning algorithm.
where α is the learning rate and γ is the discount rate of the algorithm.
The learning rate α ∈ [0, 1] is used to control the linear blend between the previously accumulated knowledge about the (s, a) couple, Q(s, a), and the newly received quality information $\left(-c+\gamma {max}_{a\prime \in \mathcal{A}}Q\left(s\prime ,a\prime \right)\right)$. A high value of α gives little importance to previous experience, while a low value of α gives an algorithm that learns slowly as the stored Q-values are easily altered by new information.
The discount rate γ ∈ [0, 1] is used to control how much the success of a later action a' should be brought back to the earlier action a that led to the choice of a'. A high value of γ gives a low importance to the cost of the current action compared to the Q-value of the new state this actions leads to, while a low value of γ would rate the current action almost only based on the immediate reward it provides.
The randomness for exploration ϵ ∈ [0, 1] is used to control how often the algorithm should take a random action instead of the best action it knows. A high value of ϵ favors exploration of new good actions over exploitation of existing knowledge, while a low value of ϵ reinforces what the algorithm already knows instead of trying to find new better actions. The exploration-exploitation trade-off is typical of learning algorithms. In this article, we consider online learning (i.e., at every time step the agents should display intelligent behaviors) which requires a low ϵ value.
4.2. Q-Learning implementation for sensing time allocation
Each secondary user is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the state-action Q-value according to rule (17). In this section, we specify the states, actions and cost function used to solve the sensing time allocation problem.
where ${n}_{{H}_{0},t-1}$ denotes the number of time slots that have been identified as free by the base station during the (t -1)th learning period.
This last cost function penalizes the actions that lead to a realized average throughput that is higher than required, which should help the disadvantaged nodes (i.e., the nodes that have a low data rate ${C}_{{H}_{0},j}$) to achieve the required average throughput.
4.3. Q-Learning implementation for distributed power allocation
Each secondary BS is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the state-action Q-value according to rule (17). In this section, we specify the states, actions, and cost function used to solve the power allocation problem.
where P_{min} and P_{max} are the minimum and maximum effective radiated powers (ERP) in dBm.
At each iteration t, the action selected by the base station BS_{ l }is the power to allocate to the currently transmitting secondary user SU_{ l }. The set of all possible actions is therefore given by Equation (27).
where +∞ represents a positive constant that is chosen large enough compared to ${({\mathsf{\text{SINR}}}_{l,t}^{s}-{\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s})}^{2}$. This last cost function penalizes the actions that lead to a secondary SINR that is higher than required, which should help the disadvantaged secondary cells (i.e., the cells in which the transmission distance |x_{ l, t }| and/or the aggregated interference ${\sum}_{k=1,k\ne l}^{N}{P}_{k}{h}_{\mathsf{\text{B}}{\mathsf{\text{S}}}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{k}}$ is high) to achieve the required secondary SINR threshold.
Finally, three exploration strategies are compared in this article. These three exploration strategies are characterized by the same average randomness for exploration $\stackrel{\u0304}{\u03f5}$.
Note that for both the sensing time and power allocation problems, the agents have an imperfect knowledge of the state of the environment. The state represented by an agent at each iteration of the Q-learning algorithm is actually an imperfect estimation of the environment state. In this case, the convergence demonstration of single agent Q-learning [18] does not hold. However, multi-agent Q-learning algorithms have been successfully applied in multiple scenarios [11] and in particular to cognitive radios [10, 12, 19]. Numerical results will show that both Q-learning algorithms presented in this article converge as well.
5. Numerical results
5.1. Sensing time allocation algorithm
Unless otherwise specified, the following simulation parameters are used: we consider N = 2 nodes able to transmit at a maximum data rate ${C}_{{H}_{0},1}={C}_{{H}_{0},2}=0.6$. They each require a data rate ${\stackrel{\u0304}{R}}_{1}={\stackrel{\u0304}{R}}_{2}=0.1$. One node has a sensing channel characterized by γ_{1} = 0 dB and the second one has a poorer sensing channel characterized by γ_{2} = -10 dB.
It is assumed that the primary network transition probabilities are p_{00} = 0.9, p_{01} = 0.1, p_{10} = 0.2, and p_{11} = 0.8. The target detection probability is ${\stackrel{\u0304}{P}}_{D}=0.95$.
We consider s = 10 samples per time slot and r = 100 time slots per learning periods. The Q-learning algorithm is implemented with a learning rate α = 0.5 and a discount rate γ = 0.7. The chosen exploration strategy consists in using ϵ = 0.1 during the first K/2 iterations and then ϵ = 0 during the remaining K/2 iterations.
Average sensing window lengths and realized throughputs obtained with the competitive and cooperative cost functions
${C}_{{H}_{0},1}=0.6$ | ${C}_{{H}_{0},1}=0.6$ | ${C}_{{H}_{0},1}=1.0$ | |||||
---|---|---|---|---|---|---|---|
${C}_{{H}_{0},2}=0.6$ | ${C}_{{H}_{0},2}=0.6$ | ${C}_{{H}_{0},2}=0.2$ | |||||
γ_{1} = - 5 dB | γ_{1} = 0 dB | γ_{1} = - 5 dB | |||||
γ_{2} = - 5 dB | γ_{2} = -10dB | γ_{2} = - 5 dB | |||||
Competitive | |||||||
M _{1} | M _{2} | 2.3 | 2.0 | 3.3 | 0.67 | 2.5 | 2 |
${\widehat{R}}_{1}$ | ${\widehat{R}}_{2}$ | 0.0378 | 0.0397 | 0.0556 | 0.0780 | 0.0635 | 0.0133 |
Cooperative | |||||||
M _{1} | M _{2} | 3.8 | 3.7 | 3.7 | 1.8 | 3.3 | 2.3 |
${\widehat{R}}_{1}$ | ${\widehat{R}}_{2}$ | 0.0423 | 0.0432 | 0.0602 | 0.0779 | 0.0640 | 0.0147 |
After convergence of the algorithm, if the value of the local SNR γ_{1} decreases from 0 dB to -10 dB, the algorithm requires an average of 1200 iterations before converging to the new optimal solution M_{1} = M_{2} = 1. According to Equation (17), each Q-learning iteration requires four additions and five multiplications per node. This result can be compared with the complexity of the centralized allocation algorithm which must be solved numerically. By using a constant step gradient descent optimization algorithm to solve the centralized allocation problem, it was measured that the convergence occurred after an average of four iterations. At each iteration of the algorithm, the partial derivatives of the cost function with respect to the sensing times are evaluated. It can be shown that 18N - 1 multiplications and 8N - 1 additions are needed for this evaluation. As a result, the centralized allocation algorithm will have a lower computational complexity per node than the Q-learning algorithm. The main advantage of the Q-learning algorithm is therefore the minimization of control information sent between the secondary nodes and the coordinator node.
5.2. Power allocation algorithm
The performance of the Q-learning algorithm presented in Section 4 is evaluated by comparison with the optimal centralized power allocation scheme in which a base station having a perfect knowledge of the environment chooses the optimal transmission powers each time there is a change in the environment (i.e., whenever a TDMA time slot ends in any of the L cells). The optimal allocated powers are determined by selecting the transmission powers (P_{1}, . . ., P_{ L }) ∈ Ψ ^{ L } that maximize Equation (13) under the constraints given in Equation (12).
where ${\mathsf{\text{SINR}}}_{t,l}^{s}$ denotes the secondary SINR measured at iteration t at BS_{ l }in the distributed learning scenario and ${\widehat{\mathsf{\text{SINR}}}}_{t,l}^{s}$ denotes the secondary SINR measured at iteration t at BS_{ l }in the optimal centralized scenario.
The performance is evaluated for L = 2 secondary cells with a ray r_{ s } = 15 km. The received power from the primary emitter on the protection contour is P^{ p } = 0 dBm. Both the primary and the secondary network use a frequency f_{ c } = 2.45 GHz. The minimum acceptable primary SINR on the protection contour is ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}=20\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{dB}}$. The desired secondary SINR at the base stations is ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}=3\phantom{\rule{0.3em}{0ex}}\mathsf{\text{dB}}$. The secondary users are allocated powers ranging from P_{min} = 0 dBm to ${P}_{max}=\frac{1}{{h}_{{I}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{l}}}\left(-{\sigma}^{2}+\frac{{P}_{p}}{{\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}}\right)=66.4\phantom{\rule{0.3em}{0ex}}\mathsf{\text{dBm}}$.
The secondary transmission powers P_{ l, t }are quantized on ϕ = 15 levels and the local coordinates x_{ l, t }of the secondary users are quantized on ξ = 10 levels. The Q-learning algorithm is implemented with a learning rate α = 0.5, a discount rate γ = 0.9 and an average randomness for exploration $\stackrel{\u0304}{\u03f5}=0.1$.
The complexity of the decentralized power allocation Q-learning algorithm can be compared to a reference gradient descent centralized power allocation algorithm, similarly to the analysis performed in Section 1. The conclusion is the same as for the sensing time allocation algorithm: the centralized allocation algorithm has a lower computational complexity than the decentralized Q-learning algorithm whose main advantage is therefore that the base stations do not need to exchange control information.
6. Conclusion
In this article, we have proposed two decentralized Q-learning algorithms. The first one was used to solve the problem of the allocation of the sensing durations in a cooperative cognitive network in a way that maximize the throughputs of the cognitive radios. The second one was used to solve the problem of power allocation in a secondary network made up of several independent cells, given strict limit for the allowed aggregated interference on the primary network. Compared to a centralized allocation system, a decentralized allocation system is more robust, scalable, maintainable and computationally efficient.
Numerical results have demonstrated the need for an exploration strategy for the convergence of the sensing time allocation algorithm. It has also been observed that the strategy of keeping the exploration parameter constant in the power allocation algorithm is less efficient than using a linearly decreasing parameter or implementing an alternance between full exploration and full exploitation, this latest exploration policy leading to the fastest convergence of the power allocation algorithm.
It has furthermore been shown that the implementation of a cost function that penalizes the actions leading to a higher than required throughput in the sensing time allocation algorithm gives better results than the implementation of a cost function without such penalty. Similarly, the implementation of a cost function that penalizes the actions leading to a higher than required secondary SINR in the power allocation algorithm gives better results than the implementation of a cost function without such penalty.
Finally, it has been shown that there is an optimal tradeoff value for the frequency of execution of the sensing time allocation algorithm. The power allocation algorithm has been shown to converge faster when its frequency of execution increases, until the frequency reaches an upper bound where the increase of the convergence speed gets insignificant.
Notes
Supplementary material
References
- 1.Jondral FK, Weiss TA: Spectrum pooling: An innovative strategy for the enhancement of spectrum efficiency. IEEE Radio Commun 2004, 42(3):S8-S14.CrossRefGoogle Scholar
- 2.Aazhang B, Sendonaris A, Erkip E: User cooperation diversity. Part I: system description IEEE Trans Commun 2003, 51(11):1927-1938.Google Scholar
- 3.Bazerque GB, Giannakis JA: Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans Signal Process 2010, 58(3):1847-1862.MathSciNetCrossRefGoogle Scholar
- 4.Peh E, Liang Y-C, Zeng Y, Hoang AT: Sensing-throughput tradeoff for cognitive radio networks, IEEE Trans. Wirel Commun 2008, 4(7):1326-1337.Google Scholar
- 5.Stotas S, Nallanathan A: Sensing time and power allocation optimization in wideband cognitive radio networks. In GLOBECOM 2010, 2010 IEEE Global Telecommunications Conference. Miami; 2010:1-5.Google Scholar
- 6.Beibei W, Liu KJR, Clancy TC: Evolutionary cooperative spectrum sensing game: how to collaborate? IEEE Trans. Commun 2010, 58(3):890-900.Google Scholar
- 7.Shankar S, Cordeiro C: Analysis of aggregated interference at DTV receivers in TV bands. In Proceedings of the 3rd International Conference on Cognitive Radio Oriented Wireless Networks and Communications (CrownCom). Singapore; 2008:1-6.Google Scholar
- 8.Tandra R, Shellhammer SJ, Shankar S, Tomcik J: Performance of power detector sensors of DTV signals in IEEE 802.22 wrans. In Proceedings of First International Workshop on Technology and Policy for Accessing Spectrum. Boston; 2006.Google Scholar
- 9.Dejonghe A, Bahai A, der Perre LV, Timmers M, Pollin S, Catthoor F: Accumulative interference modeling for cognitive radios with distributed channel access. In Proceedings of the 3rd International Conference on Cognitive Radio Oriented Wireless Networks and Communications (CrownCom). Singapore; 2008:1-7.Google Scholar
- 10.Galindo-Serrano A, Giupponi L: Distributed q-learning for aggregated interference control in cognitive radio networks. IEEE Trans Veh Technol 2010, 59: 1823-1834.CrossRefGoogle Scholar
- 11.Liviu P, Sean L: Cooperative multi-agent learning: the state of the art, Auton. Agents Multi-Agent Syst 2005, 11(3):387-434.CrossRefGoogle Scholar
- 12.Li H: Multi-agent Q-learning for competitive spectrum access in cognitive radio systems. 5th IEEE Workshop on Networking Technologies for Software Defined Radio Networks 2010, 1-6.Google Scholar
- 13.Urkowitz H: Energy detection of unknown deterministic signals. Proceedings of the IEEE 1967, vol 55: 523-531.CrossRefGoogle Scholar
- 14.Digham FF, Alouini M-S, Simon MK: On the energy detection of unknown signals over fading channels. IEEE Trans Commun 2007, 55(1):21-24.CrossRefGoogle Scholar
- 15.Ma J, Zhao G, Li Y: Soft combination and detection for cooperative spectrum sensing in cognitive radio networks. IEEE Trans Wirel Commun 2008, 7(11):4502-4507.CrossRefGoogle Scholar
- 16.Liang Y-C, Zeng Y, Peh ECY, Hoang AT: Sensing-throughput tradeoff for cognitive radio networks. IEEE Trans Wirel Commun 2008, 7(4):1326-1337.CrossRefGoogle Scholar
- 17.Millington I: Artificial Intelligence for Games. Morgan Kaufmann Publishers, San Fransisco, CA; 2006:612-628.Google Scholar
- 18.Watkins C, Dayan P: Technical note: Q-learning. Mach Learn 1992, 8: 279-292. doi:10.1023/A:1022676722315MATHGoogle Scholar
- 19.Wu C, Chowdhury K, Di Felice M, Meleis W: Spectrum management of cognitive radio using multi-agent reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Industry track, AAMAS'10, (International Foundation for Autonomous Agents and Multiagent Systems). Richland, SC; 2010:1705-1712.Google Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.