Effective norm emergence in cell systems under limited communication
 553 Downloads
Abstract
Background
The cooperation of cells in biological systems is similar to that of agents in cooperative multiagent systems. Research findings in multiagent systems literature can provide valuable inspirations to biological research. The wellcoordinated states in cell systems can be viewed as desirable social norms in cooperative multiagent systems. One important research question is how a norm can rapidly emerge with limited communication resources.
Results
In this work, we propose a learning approach which can trade off the agents’ performance of coordinating on a consistent norm and the communication cost involved. During the learning process, the agents can dynamically adjust their coordination set according to their own observations and pick out the most crucial agents to coordinate with. In this way, our method significantly reduces the coordination dependence among agents.
Conclusion
The experiment results show that our method can efficiently facilitate the social norm emergence among agents, and also scale well to largescale populations.
Keywords
Cell system Cooperative multiagent system Reinforcement learning Social norms Limited communicationBackground
In the research of the cooperative MAS, social norms play an important role in regulating agents’ behaviors to ensure coordination among the agents. For example, in our life, we should drive on the left (or right) according to the traffic rules. When it comes to biological systems, this corresponds to coordinating on the wellcoordinated states for better survival. In biology, different cells are designed for different functions and cells should coordinate their functions to ensure that the overall biological system functions correctly.
Many researches have investigated biological systems which are composed of cells and environments via modeling and simulation [1, 12]. If we regard cells in biological system as agents in multiagent system, the wellcoordinated states among cells can be viewed as social norms in multiagent systems. Thus, investigating how social norms can emerge efficiently among the agents in multiagent systems would provide valuable insights for better understanding how cells can interact to achieve wellcoordinated states. One commonly adopted description of a norm is that a norm serves as a consistent equilibrium that all agents follow during interactions where multiple equivalent equilibriums may exist. Until now, significant efforts have been devoted to studying norm emergence problem [13, 14, 15, 16, 17, 18, 19, 20]. However, most of the existing approaches require significant communications and intensive computations.
Considering the fact that the communications between the cells are limited in biological systems (by sending electrical or chemical signals), we develop a learning approach based on the individually learning methods and the DCOP algorithm under limited communication bandwidth to facilitate the norm emergence in agent societies. In many practical applications, although the agents may interact with many others over time to make a better decision, they usually only need to coordinate with very few agents which strongly affect their performance. Based on previous research [21, 22], we first define a criteria to measure the importance of different subgroup of neighbors by estimating the maximum potential utility each subgroup can bring. Based on this, each agent can estimate the utility loss due to the lack of coordination with any subgroup of agents. Furthermore, each agent dynamically selects the best subset of neighbors to coordinate with for minimizing the utility loss. At last, each agent trades off learning performance and communication cost by limiting the maximum of the miscoordination cost. Experiments results indicate that (1) with the limited communication bandwidth and in different networks (e.g., regular network, random network, smallworld network, scalefree network) our method can efficiently facilitate the emergence of norms compared with the existing approaches. (2) Our method allows agents to trade off the norm emergence performance and the communication cost by adjusting the parameters. (3) Compared with the previous methods, our method can significantly reduce the communication cost among agents and result in efficient and robust norm emergence.
The remainder of this paper is organized as follows. “Methods” section first discusses the basic domain knowledge, and then formally gives the definition of the single state coordination problem and the symbolic representation, and at last presents the architecture and the details of our method. “Results and discussion” section presents experimental evaluation results. Finally, we conclude in “Conclusion” section.
Methods
Game theory and Nash equilibrium
Game theory

players, the players of the game.

actions, the actions available to each player at each decision point.

payoffs, the feedback of making a decision and taking the selected action.

strategies, also called policy, is a high level plan to achieve the goal under conditions of uncertainty.
Normal form games
 1,…,n, n players of the game.

A_{ i }, a finite of actions for each player i.

A, A=A_{1}×…×A_{ n } is the set of joint actions, where × is the Cartesian product operator.

R_{ i }, A_{1}×…×A_{ n }→R, the reward received by agent i with a join action \(\vec a \in A\).

π_{ i }, A_{ i }→[ 0,1], the probability of player i to select each action in A_{ i }.

pure strategy, π(a_{ k })=1 for action a_{ k },and for other actions π(a_{j,j≠k})=0.

mixed strategy, the probability of selecting an action is under some distribution. And the pure strategy is a special case of the mixed strategy.
Nash equilibrium

Best Response:
when player 1 selects an action a_{1}, the best response of player 2 is that player 2 select an action which maximizes its reward, that means a_{2}=argmax_{ a }_{2}∈A_{2}R_{2}.

Nash Equilibrium:
If each player has chosen a strategy and no player can benefit by changing strategies while the other players keep theirs unchanged, that means the chosen action for each player is the best response to the other player’s choice, then the current set of strategy choices and the corresponding payoffs constitutes a Nash equilibrium.
Reinforcement learning
Markov decision process

S, a finite set of states representing the state space.

A, a finite set of actions for the agent.

T, a state transition probability function, T:S×A×S→[ 0,1], which specifies the probability of transition from state s∈S to s^{′}∈S when action a∈A is taken by the agent. Hence, T(s,a,s^{′})=Pr(s^{′}s,a).

R, a reward function \(R:S \times A \times S \rightarrow \mathbb {R}\), the immediate reward for being in state s∈S and taking the action a∈A and then transfer to state s^{′}∈S.
When the state, action, transition function and the reward function are all known, we can use some searching methods (e.g., Monte Carlo Tree Search) to solve the problem. And this is one of the classes of reinforcement learning, saying modelbased methods. And the other one is modelfree, which means the model is unknown.
Introduction of reinforcement learning
In simple terms, reinforcement learning (RL) is a class of methods that the agent continuously interacts with the environment and according to the feedback reward, dynamically adjusts its policy to maximize the expectation of the longterm feedback reward. Explore the environment through trial and error, the methods will gradually improve its performance and finally converge to an optimal policy. Trail and error and the delayed reward is important characteristics of the RL. RL methods always include the 4 basic elements: (1) agent: subject of learning and the object interacting with the environment. (2) environment: the environment that the agents reside in (static and dynamic). (3) action space: the actions available for an agent at certain states (discrete or continuous). (4) feedback reward: a method to measure the utility of an action at certain states.
Qlearning
where α∈ [0,1] is the learning rate, r_{ t } is the immediate reward of doing a_{ t } at state s_{ t }, γ∈[ 0,1] is the discount factor, which is usually set to 1 for a finite horizon. Q(s_{ t },a_{ t }) is the stateaction value function, which represents the expectation of the longterm accumulated feedback reward when in state s_{ t } and selects action a_{ t }. An typical procedure of QLearning is described as Algorithm 1.
Topology of networks
Regular network
Random network
Small world network
Scale free network
Coordination problem

n, number of agents.

A_{ i }, the action space of each agent i.

S_{ i }, the state space of each agent i, each agent only have one state here, that means no state transition.

r_{ i }, the immediate reward of agent i.

π_{ i }, the policy of agent i, π_{ i }→a_{ i }.

\(\vec A\), \(\vec A=A_{1} \times... \times A_{n}\), the joint action space of all agents.

\(\vec S\), the joint state space of all agents.

Q_{ i }(s,a), the local expectation of the discounted reward for agent i selecting action a in state s.

\(Q(\vec s,\vec a)\), the global expected reward of selecting joint action \(\vec A\) in joint state \(\vec S\).

τ(i), all neighbors of agent i.

CS(i), the coordination set of agent i, and agent i should coordinate its action selection with the agents in CS(i), CS(i)⊆τ(i).

NC(i), the neighbors of agent i that are not in CS(i), NC(i)=τ(i)∖CS(i).

CG, coordination graph which is composed of the CS of all agents.
Coordinated learning with controlled interaction
Coordination graph
Cooperative Qlearning
What’s remaining unknown in Eq. (6) is the optimal action \(a_{i}^{*}\) for each agent i. Since enumerate all the combinations of the \(\vec a^{*}\) is intractable, we use the messagepassing DCOP algorithm to find the optimal action \(a_{i}^{*}\) for each agent i in next section.
Coordinated action selection
Above all, the algorithm for each agent i to get the optimal action \(a_{i}^{*}\) is described in Algorithm 2. For more details on maxplus, refer to J. R. Kok and N. Vlassis’s paper [21].
Coordination set selection: random
For large problems, the messages passed in the network are directly proportional to the number of edges of the CG but the communication is limited. To reduce the communication times and frequency, we need to eliminate some noncritical edges of the CG without significantly affecting the system performance. In this subsection, we define 2 different methods to minimize the communication cost.

Random agents: For each agent i, during the learning process, only δ percent of its neighbors τ(i) are selected as the CS(i).

Random agents with decay: We first initialize an δ=δ_{0}. During the learning process, we randomly select δ percent of the neighbors τ(i) as the CS(i) for each agent i at each decision point. And then we decrease the δ with some small decay (e.g., δ=δ−0.01). With time going by, the δ will be smaller and smaller until to the minimum value specified (e.g., 0).
Coordination set selection: loss rate
To reduce the communication without significantly affecting the system performance, we need to find out the difference of communicating with an agent or not. For this purpose, we divide the neighbors τ(i) of each agent i into two groups: CS(i) and NC(i) as mentioned before. Each agent i only has to communicate with the agents in CS(i) to coordinate their actions.
Obviously, if CS_{1}(i)⊆CS_{2}(i)⊆τ(i), then for an action a_{ i }, PV_{ i }(a_{ i },CS_{1}(i))≤PV_{ i }(a_{ i },CS_{2}(i)).
Easily, we can find that (1) if NC_{1}(i)⊆NC_{2}(i)⊆τ(i), then PL_{ i }(NC_{1}(i))≤PL_{ i }(NC_{2}(i)). (2) PL_{ i }(∅)=0. (3) for each NC(i)⊆τ(i),0≤PL_{ i }(NC(i))≤PL_{ i }(τ(i)).
Above all, each agent i will select the best coordination set CS(i) according to the PL(τ(i)∖CS(i)) to minimize the loss of utility. The algorithm is described in Algorithm 3. δ is the predefined loss rate. When δ=0, each agent i will coordinate with all neighbors and when δ=1, each agent i will not coordinate with any agent at all.
Learning processes with emergent coordination
Combining cooperative Qlearning, coordinated action selection, and the coordination set selection, the cooperative learning process is described in Algorithm 4.
Results and discussion
In this section, we evaluate the performance of our algorithm on a large singlestate problem. Firstly, we give the common settings of the large singlestate problem. Then, we compare the norm emergence performance of our algorithm with some existing approaches. At last, we explore the effect of some important parameters and the performance of different coordination set selection methods proposed in “Coordination set selection: random” and “Coordination set selection: loss rate” sections.
Large scale singlestate problems
There is only one state for each agent, and the reward function is defined in “Coordination problem” section (See Fig. 2 for an example). The goal of the agents is to learn and select a joint action which maximizes the global reward. In the following subsections, without additional explanation, we consider 100 agents playing a 10action coordination game in which 10 norms exist. And the agents distribute in a smallworld network. The average connection degree of the graph is set to 6.
Norm Emergence Performance
 Independent Learners (IL): Each agent i uses the independent Qlearning and adjusts its policy only depend on its own action and reward. The Qfunction is updated according to Eq. (11).$$ {\begin{aligned} Q_{i}(s,a_{i})&=Q_{i}(s,a_{i})\\ & \quad +\alpha\left[r\left(s,a,s^{\prime}\right)+ \gamma \max_{a_{i}^{\prime}}Q_{i}\left(s^{\prime}, a_{i}^{\prime}\right)Q_{i}\left(s^{\prime},a_{i}^{\prime}\right)\right] \end{aligned}} $$(11)Table 1
Parameter settings for “Norm Emergence Performance” section
Parameter name
Value
Agent number
50
Action number
2
Init explore rate
1.0
Delta explore rate
0.004 (IL:0.04)
Init learning rate
1.0
Delta learning rate
0.0005
Min learning rate
0.6
Message differ(maxplus)
0.00001
Message sent deadline(maxplus)
5
 Distributed Value Functions (DVF): Each agent i records a local Qfunction based on its own action and reward, and updates it incorporating with the neighbors’ Qfunction following equation 12. f(i,j) is the contribution rate of agent j to agent i, and here is 1/τ(i). For the stateless problem, we make an adjustment that each agent select its action considering the neighbors’ Qfunction, that is \(a_{i}^{*}={\text {argmax}}_{a\in A_{i}}\sum _{j\in \{(i) \cup \tau (i)\}}f(i,j) \max _{a_{j}^{\prime }}Q_{j}\left (s^{\prime }, a_{j}^{\prime }\right)\).$$ {\begin{aligned} Q_{i}(s,a_{i})&=Q_{i}(s,a_{i})\\ &\quad +\alpha \!\left[\!r\left(s,a,s^{\prime}\right)\,+\, \gamma \!\sum_{j\in \{(i) \cup \tau(i)\}}f(i,j) \max_{a_{j}^{\prime}}Q_{j}\left(s^{\prime}, a_{j}^{\prime}\right)\,\,Q_{i}\left(s^{\prime},a_{i}^{\prime}\right)\!\right] \end{aligned}} $$(12)
Influence of key parameters
In this section, we investigate the influence of some key parameters to the performance of norm emergence and message passing times. The parameters of the compared algorithm are the same other than the comparison one.
The influence of random parameter δ
Parameter settings for “The influence of random parameter δ” section
Parameter name  Value 

Agent number  100 
Action number  10 
Init explore rate  1.0 
Delta explore rate  0.003 
Init learning rate  1.0 
Delta learning rate  0.0005 
Min learning rate  0.6 
Init random rate  1, 0.5, 0.1, 0.05, 0.01 
Delta random rate  0.005, 0.002, 0.0004, 0.0002, 0.000004 
Message differ(maxplus)  0.00001 
Message sent deadline(maxplus)  5 
The influence of loss rate δ
Parameter settings for “The influence of loss rate δ” section
Parameter name  Value 

Agent number  100 
Action number  10 
Init explore rate  1.0 
Delta explore rate  0.004 
Init learning rate  1.0 
Delta learning rate  0.0005 
Min learning rate  0.6 
Loss rate  None, 0, 0.01, 0.1, 0.5, 0.7, 0.9 
Message differ(maxplus)  0.00001 
Message sent deadline(maxplus)  5 
The influence of population size n
Parameter settings for “The influence of population size n” section
Parameter name  Value 

Agent number  100,200,500,1000 
Action number  10 
Init explore rate  1.0 
Delta explore rate  0.004 
Init learning rate  1.0 
Delta learning rate  0.0005 
Min learning rate  0.6 
Message differ(maxplus)  0.00001 
Message sent deadline(maxplus)  5 
Conclusion
In this paper, we develop a framework based on the maxplus algorithm to accelerate the norm emergence of large cooperative MASs. With the limited communication bandwidth, we propose two kinds of approaches to minimize the communication cost: random and deterministic. Random methods select the coordination set stochastically, while the deterministic methods identify the best coordination set for each agent by limiting the utility loss due to the lock of coordination. Both approaches significantly reduce links of the coordination graph and result in less communication without deteriorating the learning performance. Experiment results show that our methods lead to better norm emergence performance under all kinds of networks compared with the existing methods and scale well in large populations. Thus, our methods can efficiently accelerate the social norm emergence under limited communication.
As future work, we will further investigate the performance of our methods in more complicated games such as Prisoner’s dilemma, to better reflecting the interaction dynamics in cell systems. And we will evaluate our algorithm on a simulated cell communication environment.
Notes
Acknowledgements
We thank the reviewers’ valuable comments for improving the quality of this work. We would also like to acknowledge Shuai Zhao (zhaoshuai@catarc.ac.cn, China Automotive Technology and Research Center, Tianjin, China) as an additional corresponding author of this article, who contributed to the overall design of the algorithmic framework and cosupervised the work.
Funding
The publication costs of this article was funded by Tianjin Research Program of Application Foundation and Advanced Technology (No.: 16JCQNJC00100), Special Program of Talents Development for High Level Innovation and Entrepreneurship Team in Tianjin and Comprehensive standardization of intelligent manufacturing project of Ministry of Industry and Information Technology (No.: 2016ZXFB01001) and Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 17ZXRGGX00150).
Availability of data and materials
All data generated or analysed during this study are included in this published article.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume19supplement5.
Authors’ contributions
XH contributed to the algorithm design and theoretical analysis. JH, LW and HH contributed equally to the the quality control and document reviewing. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.Kang S, Kahan S, Mcdermott J, Flann N, Shmulevich I. Biocellion: accelerating computer simulation of multicellular biological system models. Bioinformatics. 2014; 30(21):3101–8.CrossRefPubMedPubMedCentralGoogle Scholar
 2.Cheng L, Jiang Y, Wang Z, Shi H, Sun J, Yang H, Zhang S, Hu Y, Zhou M. Dissim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs. Sci Rep. 2016; 6:30024.CrossRefPubMedPubMedCentralGoogle Scholar
 3.Cheng L, Sun J, Xu W, Dong L, Hu Y, Zhou M. OAHG: an integrated resource for annotating human genes with multilevel ontologies. Sci Rep. 2016; 6:34820.CrossRefPubMedPubMedCentralGoogle Scholar
 4.Hu Y, Zhao L, Liu Z, Ju H, Shi H, Xu P, Wang Y, Cheng L. Dissetsim: an online system for calculating similarity between disease sets. J Biomed Semant. 2017; 8(1):28.CrossRefGoogle Scholar
 5.Hu Y, Zhou M, Shi H, Ju H, Jiang Q, Cheng L. Measuring disease similarity and predicting diseaserelated ncrnas by a novel method. BMC Med Genet. 2017; 10(5):71.Google Scholar
 6.Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using dnet. Methods. 2017; 131:104–10.CrossRefPubMedGoogle Scholar
 7.Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics. 2017; 18(16):573.CrossRefPubMedPubMedCentralGoogle Scholar
 8.Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Min Bioinforma. 2017; 17(2):173–88.CrossRefGoogle Scholar
 9.Peng J, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and cofunctional network: a random walk based approach. BMC Syst Biol. 2018;12(Suppl2).Google Scholar
 10.Sycara K. P. Multiagent systems. AI Mag. 1998; 19(2):79.Google Scholar
 11.Wooldridge M. An Introduction to Multiagent Systems.Chichester: Wiley; 2009.Google Scholar
 12.Torii M, Wagholikar K, Liu H. Detecting concept mentions in biomedical text using hidden markov model: multiple concept types at once or one at a time?J Biomed Semant. 2014; 5(1):3.CrossRefGoogle Scholar
 13.Sen S. Emergence of norms through social learning. In: International Joint Conference on Artifical Intelligence: 2007. p. 1507–12.Google Scholar
 14.Airiau S, Sen S, Villatoro D. Emergence of conventions through social learning. Auton Agent MultiAgent Syst. 2014; 28(5):779–804.CrossRefGoogle Scholar
 15.Yu C, Lv H, Ren F, Bao H, Hao J. Hierarchical learning for emergence of social norms in networked multiagent systems. In: Australasian Joint Conference on Artificial Intelligence. Springer: 2015. p. 630–43.Google Scholar
 16.Jianye H, Sun J, Huang D, Cai Y, Yu C. Heuristic collective learning for efficient and robust emergence of social norms. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems: 2015. p. 1647–8. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
 17.Hao J, Leung H. F, Ming Z. Multiagent reinforcement social learning toward coordination in cooperative multiagent systems. ACM Trans Auton Adapt Syst (TAAS). 2015; 9(4):20.Google Scholar
 18.Yang T, Meng Z, Hao J, Sen S, Yu C. Accelerating norm emergence through hierarchical heuristic learning. In: European Conference on Artificial Intelligence: 2016.Google Scholar
 19.Hao J, Huang D, Cai Y, Leung Hf. The dynamics of reinforcement social learning in networked cooperative multiagent systems. Eng Appl Artif Intell. 2017; 58:111–22.CrossRefGoogle Scholar
 20.Hao J, Sun J, Chen G, Wang Z, Yu C, Ming Z. Efficient and robust emergence of norms through heuristic collective learning. ACM Trans Auton Adapt Syst (TAAS). 2017; 12(4):23.Google Scholar
 21.Kok J. R, Vlassis N. Collaborative multiagent reinforcement learning by payoff propagation. J Mach Learn Res. 2006; 7(1):1789–828.Google Scholar
 22.Li J, Qiu M, Ming Z, Quan G, Qin X, Gu Z. Online optimization for scheduling preemptable tasks on iaas cloud systems. J Parallel Distrib Comput. 2012; 72(5):666–77.CrossRefGoogle Scholar
 23.Guestrin C, Lagoudakis MG, Parr R. Coordinated reinforcement learning. In: Nineteenth International Conference on Machine Learning.2002. p. 227–34.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.