1. Introduction
Multiagent coordination in cooperative multiagent systems (MASs) is a significant and widely studied problem in the literature. In cooperative MASs, the agents share common interests defined by the same reward functions Sen and Airiau (2007). This requires the agents to have the capability of coordinating with each other effectively towards desirable outcomes.
Until now, a lot of research efforts have been devoted to addressing multiagent coordination problems in cooperative MASs Claus and Boutilier (1998); Panait and Luke (2005); Matignon et al. (2012); Lauer and Riedmiller (2000); Kapetanakis and Kudenko (2002); Matignon et al. (2008). One class of research focuses on coordination issues in fixedagent repeated interaction settings. Claus and Boutilier Claus and Boutilier (1998) firstly proposed two representative learning strategies  Independent Learner (IL) and JointAction Learner (JAL) to investigate their coordination performance in repeated twoagent cooperative games. Later a number of improved strategies Lauer and Riedmiller (2000); Kapetanakis and Kudenko (2002); Panait and Luke (2005); Matignon et al. (2008) have been proposed to achieve coordination more efficiently and overcome the noise introduced by the high miscoordination cost and stochasticity of games. Matignon et al. Matignon et al. (2012) extensively investigate the comparative performance of existing independent strategies in terms of their coordination efficiency in twoagent repeated cooperative games. Their results show that perfect coordination can hardly be achieved for fully stochastic games. However, in practical complex environments, the interactions between agents can be sparse, i.e., it is highly likely that each agent may not have global access and only have the opportunity to interact with local neighbors. Moreover, agent’s interacting partners are not fixed and may change frequently and randomly. For example, in social media Ellison et al. (2007a), it is commonly accepted that the user attention span is limited, hence it is of great importance to identify interactions that have optimal rewards. Thereby, another line of research focuses on investigating the question of how a population of cooperative agents can effectively coordinate among each other under the social learning framework Sen and Airiau (2007); Villatoro et al. (2011); Yu et al. (2013); Hao and Leung (2013); Hao et al. (2014, 2017); Airiau et al. (2014); Mihaylov et al. (2014). For example, Hao and Leung Hao and Leung (2013) extend IL and JAL into the social learning framework and find that better coordination performance can be achieved by leveraging the observation mechanism.
Most of existing works under the social learning framework assume that agents are located in a static network. During each round of interaction, each agent plays the same cooperative game with a randomly selected partners from its neighborhood. However, two important aspects of dynamics in realworld multiagent scenarios are currently missing. First, the cooperative games might be different for different pairs of agents due to a variety of reasons (e.g., the difference of agents’ capabilities and preferences, and the difference of interaction timing and locations) in practical scenarios. For example, in sensor networks Zhang and Lesser (2013), the overlapping scanning areas of each pair of sensors are of various importance and different rewards are obtained when specific scanning areas are coordinated between two senor agents. Another example is that in negotiation domain Baarslag and Gerding (2015)
, an agent may have different preferences on each offer provided by different opponents. Therefore, in this paper, we relax this assumption by assuming that the payoff matrix between each pair of agents is different which is generated from certain probability distribution, and the payoff information is unknown before their interaction.
Second, the network topologies can be dynamic. For example in social networks Ellison et al. (2007b); Kwak et al. (2010), it is common that users follow and unfollow other users on their own initiative due to individual preference and interest. Therefore, in this paper, we consider a dynamic environment where each agent is given the opportunities to change its connections through rewiring to interact with agents which may bring higher payoffs during the course of interactions. The rewiring mechanism has been previously used to explore indirect reciprocity Peleteiro et al. (2014) or cope with cheaters Griffiths and Luck (2010) to promote cooperations among agents in noncooperative environments such as games. In contrast, in this paper, we focus on cooperative environments where agents can utilize the rewiring mechanism to increase their benefits in longrun interactions by breaking connections to badperformance neighbors and replacing them with new connections.
In this work, during each round, each agent goes through two main phases: the rewiring phase and the interaction phase. In the rewiring phase, each agent is likely to be given the opportunity to alter their connections through rewiring for higher payoffs in future interactions. The key problem here is how to compute the optimal rewiring strategy to maximize the longrun interaction payoff against the network dynamics. To make efficient rewiring decisions, each agent has to take two aspects of uncertainties into consideration, i.e., the uncertainties of opponents’ behaviors and the payoff matrix with unknown peers. To this end, we model an agent’s rewiring problem as a sequential decisionmaking problem and propose an optimal rewiring approach for agents to select most beneficial peers among all reachable agents. In the interaction phase, agents learn their policies from pairwise interactions through playing certain cooperative game against randomly selected opponent from its neighborhood. Empirical results show that our optimal rewiring strategy outperforms other existing benchmark strategies in terms of agents’ average accumulated payoff and robustness against different environments. Besides, the relative performance of three representative learning strategies (i.e., Fictitious Play, JointAction Learner and JointAction WoLFPHC) is analyzed as well.
The remainder of the paper is organized as follows. In Section 2, we give an overview of related works. In Section 3, we give a formal description of our coordination problem in dynamic cooperative MASs. In Section 3, the social learning framework and both rewiring and learning strategies are described. In Section 4, we empirically demonstrate the efficiency of our rewiring approach and analyze the influence of three learning strategies under the social learning framework with rewiring. Lastly conclusion and future work are given in Section 5.
2. Related Work
There has been a large amount of research in the multiagent reinforcement learning literature on solving coordination problem in cooperative MASs One line of research focuses on solving coordination problem in fixedagent repeated interaction settings Claus and Boutilier (1998); Lauer and Riedmiller (2000); Kapetanakis and Kudenko (2002); Panait and Luke (2005); Matignon et al. (2008); Panait et al. (2006); Matignon et al. (2012). Claus and Boutilier Claus and Boutilier (1998) investigate the coordination performance of the Independent Learner and the JointAction learner, in the context of repeated twoagent cooperative games. It shows that both types of learners achieve success in simple cooperative games. Lauer and Riedmiller Lauer and Riedmiller (2000)
propose the distributed Qlearning algorithm base on the optimistic assumption. It is proved that optimal joint actions can be guaranteed to achieve if the cooperative game is deterministic. FMQ heuristic
Kapetanakis and Kudenko (2002) and recursive FMQ Matignon et al. (2008)are proposed to alter the Qvalue estimation function to handle the stochasticity of the games. FMQ obtains success in partially stochastic climbing games but still can not achieve satisfactory performance in fully stochastic cooperative games. Panait et al.
Panait et al. (2006) propose the lenient multiagent learning algorithm to overcome the noise introduced by the stochasticity of the games. The results show that coordination on the optimal joint action can be achieved in more than 93.5% of the runs in the fully stochastic climbing game. Matignon et al. Matignon et al. (2012) review all existing independent multiagent reinforcement learning algorithms and evaluate and discuss their strength and weakness. It shows that all of them fail to achieve perfect coordination for fully stochastic games and only recursive FMQ can achieve coordination for 58% of the runs.The other line of research considers the multiagent coordination problem under the social learning framework Sen and Airiau (2007); Villatoro et al. (2011); Yu et al. (2013); Hao and Leung (2013); Hao et al. (2014, 2017); Airiau et al. (2014); Mihaylov et al. (2014), where agents learn through pairwise interactions by playing the same game. Sen and Airiau Sen and Airiau (2007) propose the social learning model and investigate the emergence of consistent coordination policies (norms) in MASs (e.g., conflictinginterest games) under this social learning framework. Hao and Leung Hao and Leung (2013) investigate the multiagent coordination problems in cooperative environments under the social learning framework. They demonstrate individual action learners (IALs) and joint action learners (JALs) and achieve better coordination performance by leveraging the observation mechanism. Most of previous works assume that agents learn through playing the same cooperative game with randomly selected partners from their neighborhood within a static network. In contrast, in this paper, we consider a dynamic environment in which agents learn through pairwise interactions through playing uniquely and randomly generated cooperative games.
There are some existing works on using rewiring mechanism to support cooperation in social networks Peleteiro et al. (2014); Griffiths and Luck (2010); Hales and Edmonds (2005); Dekker (2007). Peleteiro et al. Peleteiro et al. (2014) propose a new mechanism based on three main pillars: indirect reciprocity, coalitions and rewiring, to improve cooperation among selfinterested agents placed in a complex network. Rewiring, as a part of the proposed mechanism, alters the worst social links with the best coalition members the neighbors’ reputation coalitions information. They confirm that the use of rewiring indeed improves cooperation when they play the donation game in their social scenario. Griffiths and Luck Griffiths and Luck (2010) present and demonstrate a decentralised tagbased mechanism to support cooperation in the presence of cheaters without requiring reciprocity. The simple rewiring enables agents to change their neighbour connections, in particular by removing connections to the worst neighbours and replacing them with connections to neighbours with whom others have had positive experiences. Their results show that cooperation can be improved by enabling agents to change their neighbour connections. In contrast, in this paper, we consider the multiagent coordination problem in cooperative MASs and how rewiring can increase the benefits of individual agents during longrun interactions under the social learning framework.
3. Problem Description
We consider a population of agents, each of which has (bidirectional) connections to its neighbors. We use to denote agent ’s neighborhood, such that each agent is only able to interact with its neighbors in . We model the strategic interaction among each pair of agents as a cooperative game, where each agent always receives the same payoff under the same outcome. One example of deterministic twoaction cooperative games is shown in Fig.1(a), in which there exist two optimal Nash equilibria. In this game, agents are desired to coordinate their actions towards a consistent Nash equilibrium to maximize their benefits.
Previous works Panait et al. (2006); Matignon et al. (2012); Hao and Leung (2013); Hao et al. (2014, 2015) have investigated the problem of how a population of agents could achieve coordination under the same cooperative environments (games) to maximize the system’s overall benefits. However, in general, the payoff matrix between any pair of agents may be different depending on the agents’ interacting environments. Following the setting in Damer and Gini (2008), we assume that each game is drawn independently from certain probability distribution to model agents’ interaction environments in a generalized way. Without loss of generality, a general form of 2action cooperative game between agent and is shown in Fig.1(b), where the value of (or ) is sampled from a stochastic variable (or ) following certain cumulative probability distribution (or ), and represents the payoff under miscoordination. The payoff matrix of any pair of agents is not known as a prior, which can be learned through repeated interactions. In our settings, we assume that each agent in the environment can observe the actions of its interaction neighbor at the end of each interaction.
Inspired from human society Ellison et al. (2007b); Kwak et al. (2010), agents should have the freedom of choosing whom they want to interact with. Thus, in this paper, we consider that the underlying interaction topology is dynamic: agents are allowed to break existing connections which they do not benefit from and establish new connections with nonneighbor agents through rewiring. Each rewiring is associate with certain cost and we use to denote the rewiring cost of agent when it establishes new connection with another agent . As in many scenarios, agents usually do not have access to all agents in the environment but only local access due to physical limitations, e.g. the distance of signal reception and delivery, the radius of human relation circle Crow et al. (1997); Kwak et al. (2010). Hence it is not feasible to allow each agent to be able to rewire any other agents in the environments, and here we assume that each agent is only allowed to establish connections with the set of reachable peers, which can be defined as the set of agents within certain distance of agent . We use to denote agent ’s reachable peers, consisting of for agent ’s neighborhood and for all nonneighbor agents that agent can potentially interact with through rewiring.
Moreover, unlimited increase of communication in realworld networks is usually impractical and the number of connections should have a upper bound due to finite connection resources Zhang and Lesser (2013). Each agent should break an old connection each time it decides to establish a new connection through rewiring.^{1}^{1}1This can be easily extended to the case of allowing multiple rewiring by ranking the expected payoff of reachable peers in the descending order. The interest of any rational agent is to maximize its accumulative payoff during the course of interactions. Therefore, a rational agent needs to balance the tradeoff between rewiring to explore more beneficial partners to interact with and exploiting the current connected neighbors to avoid rewiring cost.
4. Social Learning Framework
Under the social learning framework, there is a population of agents and each agent learns its policy through repeated pairwise interactions with its neighbors. The initial neighborhood of each agent is determined by the underlying topology and three representative topologies are considered here: Regular Network, Smallworld Network, ScaleFree Network Watts and Strogatz (1998); Albert and Barabási (2002). The overall interaction flow under the social learning framework is shown in Algorithm 1.
During each round, each agent first is given the opportunity of rewiring with probability . During the rewiring phase, agent breaks its connection with a neighbor with poorperformance and establishes a new connection with another agent from its potential partners (Line 24). Next, in interaction phase, agent interacts with a randomly selected neighbor agent from its neighborhood by playing their corresponding cooperative game (Line 5). After the interaction, agent and receive the corresponding payoff and update their policies and opponents’ models respectively (Line 67). The details of the rewiring and learning strategies will be introduced in following subsection.
4.1. An Optimal Rewiring Strategy
The goal of rewiring is to explore the set of unconnected peers (potential partners ) in the population and seek more beneficial partners to interact with. Each agent is faced with two aspects of uncertainties: the uncertainty of the behaviors of its neighbors and unknown peers; the uncertainty of the payoff matrix during interaction with unknown peers. To select the optimal agent to rewire, we need to evaluate the potential benefits of interacting with agent by taking the aforementioned uncertainties into consideration.
Before making a rewiring decision, each agent first needs to evaluate the benefits of interacting with each neighbor. One natural way is to use the optimistic assumption, i.e., computing the expected payoff of selecting different actions based on the estimated policy of a neighbor and picking the action with the highest expected value. In other words, agent evaluates the expected payoff of interacting with agent by using the highest expected payoff that can be received among all possible actions . Formally we have:
(1) 
where represents the explicit payoff when both players choose action by sampling from during historic interactions, and is for the miscoordination payoff. denotes agent ’s estimated probability of agent choosing action . The value of can be easily obtained as the empirical frequency distribution of agent ’s actions based on historical interactions and other advanced techniques such as weighting more on recent experiences may also be considered. Provided there is no need to rewire, we use agent ’s worstcase expected payoff among all the neighbors as its baseline utility. Formally we have:
(2) 
For each unknown potential partner to interact with, agent is faced with two aspects of uncertainties: agent ’s behavior, and the payoff matrix . For a previously unseen partner ’s behavior, it can be estimated using the observed expected behaviors from agent ’s existing neighborhood, while the payoff matrix has to be learned through repeated interactions after agent and are connected through rewiring. Therefore, agent ’s estimation of expected benefit from interacting with an unknown agent can be formalized as follows:
(3) 
where is a stochastic variable and
denotes the corresponding cumulative distribution function of
, and is the payoff under miscoordination. is agent ’s estimated probability of agent choosing action within its neighborhood. As no interaction has been made between agent and the unknown peer before, the distribution over agent ’s actions can not be observed a prior but estimated from the neighbors which once interact with agent .Intuitively, it is reasonable for agent to establish new connection with any unknown potential partner if rewiring leads to higher expected payoff. The difference between expected payoff and baseline value causes a incoming benefit which is promising to pay back the rewiring cost and to generate better payoffs in further interactions. Otherwise, agent should keep its current neighborhood unchanged. During each rewiring phase, each agent has to unlink a badperformance neighbor first before rewiring. Note that there is no need for agent to consider those agents already disconnected during previous rewiring, since the expected payoff of interacting with discarded agent is obviously below the current interaction baseline, i.e. .
Each agent’s situated environments are continuously changing as it breaks old connections and establishes new ones, which also influences the following interactions and rewiring decisions thereafter. Therefore, each agent is actually faced with a sequential rewiring decisionmaking problem. Formally, we propose modeling the above sequential rewiring problem for each agent
as an Markov Decision Process (MDP),
as follows:
A finite set of states : each state of agent can be represented as a tuple , in which describes the set of potential partners of agent and is agent ’s current baseline value.

A finite set of actions : agent ’s action set under is which consists of actions of rewiring with an agent in and the Null action for not rewiring.

A transition function : represents the probability of reaching state after taking action under state . The transition is probabilistic because the value of baseline may change stochastically. If agent rewires agent , the new baseline with the updated is determined by new included neighbor and left neighbors after breaking the worstperforming connection.

A reward function : denotes the reward of executing action under state , e.g., if agent chooses to rewire agent and if stops rewiring.
Under the above MDP formulation, our goal is to construct a rewiring strategy specifying which new connection to establish, or to not rewire under each state . We start with a shortsight version where each agent is only interested in maximizing its oneshot payoff after rewiring. The utility of a policy can be defined recursively as follows,
(4) 
where denotes the new rewiring state after agent is rewired, where is the new expected baseline payoff as the neighborhood is altered.
Our final goal is to find an optimal policy among all feasible policies, which considers the following: either not rewire and obtain baseline , or rewire an unknown partner by sampling from at cost , while taking the following into account:

: this indicates the expected payoff interacting with agent is not better than its current baseline and should be the new baseline value after rewiring. Thus the expected utility under state becomes ;

: the expected payoff interacting with agent is higher than our baseline value. In this case, the expected utility under state becomes , where is the updated baseline value since agent breaks the connection with the worstperformance neighbor during rewiring.
The above way of formalizing the dynamics of an optimal strategy only considers the oneshot interaction benefits. However, in our social learning framework, each agent tries to maximize accumulated payoffs during multirounds interactions. To this end, we propose a new farsight way of modeling our rewiring problem such that each agent considers the accumulated expected payoff through multiple interactions with any neighbor after rewiring. Formally, the step utility function of an optimal strategy must satisfy the following recursive relation:
(5)  
where the rewiring sight value of models an agent’s farsight degree by taking the accumulated payoff from rounds of interactions into account during each rewiring. We can see that Eq.5 is essentially a Bellman equation, which could be solved by backward induction in principle. However, even for a moderatesize MAS, this approach quickly becomes intractable. To this end, we provide a simple but optimal method to compute which agents to rewire or not in time (further explained later), which is inspired from Pandora’s Rule^{2}^{2}2Pandora’s Rule Weitzman (1979) is a solution concept for interaction under uncertainty; this framework has a wide range of applications to dynamic alternative selection problems, such as optimal service provider selection in TaskProcurement Problem Gerding et al. (2009), data management Baarslag et al. (2017) and optimal preference elicitation in negotiation Baarslag and Gerding (2015). Weitzman (1979).
Supposing that agent has current baseline and a potential partner in , to determine whether to rewire partner through our sight optimal rewiring approach, we have:

In the following rounds, the expected interaction value we could obtain from neighborhood is at least:
(6) 
If choose to rewire agent , we could get a net benefit of the following rounds interactions in the worst cases as follows:
(7) where is the new baseline if after breaking the connection with old baseline one. We define as where is the second minimum expected payoff in given that agent has at least two neighbors before rewiring.^{3}^{3}3If agent has only one neighbor, the new baseline degenerates to be , which is similar and simpler. Formally we have:
(8) 
Further, we use for the singleround payoff difference between not rewiring (Eq.6) and rewiring agent (Eq.7) which can be formalized as follows:
(9) If , agent is just indifferent between rewiring and not rewiring. Otherwise, the larger value indicates that agent is more beneficial to rewire.
For each unknown partner , the expected payoff satisfies the cumulative distribution function and the rewiring cost is . We can calculate an index through Eq.9, which fully captures the relevant information about agent : it should be rewired when it has the highest positive index and exceeds the interaction baseline of current neighborhood. It is proven in Weitzman (1979) that this strategy is optimal in terms of expected reward Eq.5.
The overall Ksight rewiring strategy is shown in Algorithm 2. At each rewiring phase, each agent first calculates the baseline interaction value and second minimum expected value in (Line 16). Second, for each potential partner , its corresponding index is computed following Eq.9 (Line 79). Finally, agent makes the rewiring decision accordingly — to rewire target agent with the maximum value of at cost if or not to rewire otherwise (Line 1015). Agent unlinks the worstperforming neighbor before rewiring a new partner:
(10) 
After certain rounds, each agent stops rewiring and converges to an optimal neighborhood.
Next we analyze the computational complexity of our proposed algorithm. In Algorithm 2, for each rewiring phase, agent computes the interaction baselines of current neighborhood (Line 16) and the value (Line 69) for each potential peer in to find the optimal action (Line 1015). This leads to a computational complexity of and the value of is for the size of reachable peers . Solving the Bellman equation (Eq.5) is equivalent with calculating the optimal rewiring action for all feasible states which are proportional to the size of potential peers. Thus the total computational complexity of solving our sequential decisionmaking problem is . Note that the computational complexity can be quite low even when the population size increases significantly since the number of reachable peers for each agent is usually much smaller than the number of agents in whole populations, i.e. .
After rewiring phase, each agent proceeds to interact with an agent randomly selected from its neighborhood. We consider the following three representative learning strategies in the literature as agents’ learning strategies.

Fictitious play(FP). FP is a wellknown learning approach in literature. Agent keeps a frequency count of agent ’s decisions from a history of past moves and assumes that the opponent is playing a mixed strategy represented by this frequency distribution. Agent chooses the bestresponse action to that mixed strategy to maximize its expected payoff. Formally we have:
(11) where is agent ’s estimated distribution over agent ’s actions and is the payoff when both agent and choose action , exactly the same in Eq.1.

Joint Action Learner (JAL) Claus and Boutilier (1998). JAL is a classic Multiagent Reinforcement Learning algorithm under the assumptions that each agent can observe the actions of other agents. A JAL agent learns its Qvalues for joint actions. Formally, represents agent ’s Qvalue when agent and select action and respectively. To determine the relative values of their individual actions, agent assumes that each other agent will choose actions in accordance with agent ’s empirical distribution over agent ’s action. In general, agent assesses the expected value of each action as follows:
(12) where is agent ’s estimation for the probability of agent choosing action .

JointAction WoLFPHC (JAWoLF). WoLF Bowling and Veloso (2001) extends Qlearning and learns mixed strategies with the idea of quickly adapting when losing but being cautious when winning. Specifically we modify WoLF to a JointAction version as agents can observe others’ actions in our environments. We replace with and each agent updates its Qvalues for each jointaction. In addition, to determine ’win’ or ’lose’, we need to keep the frequency distribution over opponents’ actions., and agent adjust its learning rate against agent as follows:
(13) where and denotes the learning rate when win or lose separately, is agent ’s strategy and is the average strategy, and is the probability agent maintains for interacting with agent .
5. Experimental Evaluations
In this section, we evaluate the performance of our optimal rewiring strategy with two benchmark strategies against various interaction environments. Following that, we analyze the performance of different learning strategies under our social learning framework.
5.1. Parameter Settings
We consider a population of hundreds of agents and three representative network topologies: regular network, smallworld network and scalefree network. In our experiments, there is no apparent discrimination in results of these three initial topologies,^{4}^{4}4The network topologies of interaction environments are changed along with the occurrence of dynamic rewirings. Thus, different initial topologies show similar results in longterm interactions. and we use regular networks as default choice for following illustration. Besides, we consider a wide range of interaction environments from three major aspects: the number of agents , the size of each agent’s initialized neighborhood and the size of each agent’s reachable agents . For example, interaction environment denotes a scenario consisting of 100 agents which each agent has 4 neighbors () and 12 reachable agents initially(). In following experiments the value of neighborhood
are initialized to be different constances equally for each agent. Note that we put no limitation on the size of neighborhood and it can be constance or other function forms to model the variance of agents’ connection capabilities. The set of reachable agents can also be defined in different ways. For example, in
Peleteiro et al. (2014) agents are allowed to do the rewiring through leaving its worst neighbor and joins the one with the highest score in its coalition. In contrast, in Griffiths and Luck (2010) an agent replaces its badperformance neighbors and replaces them with specific agents from the neighborhood of its (best) neighbors. We set the value of reachable agents size to constances for the purpose of generalization.For each pair of agents and , the cooperative game is uniquely generated. The payoffs , on the diagonal of are sampled independently from , according to probability distribution ,
which are separately set to either a uniform distribution
or a beta distribution
, with both uniformly sampled from and , . The miscoordination payoff is set to a constant value randomly sampled from the range of for each payoff matrix.5.2. Influence of Rewiring Sight
In our optimal rewiring approach, the value of rewiring sight is the key parameter which may directly influence the performance. We investigate the performance of varying sight values of with several rewiring costs.^{5}^{5}5The rewiring costs are usually set to be larger than singleround payoff because changing topologies is a nontrivial task. For learning strategy, we use FP for illustration and similar results can be observed for other learning strategies (i.e., JAL or JAWoLF), which are omitted due to space limitation.
Fig.2(a) shows the average accumulated payoff obtained by each agent with different costsight settings. We can see that the average payoffs for different costs increase rapidly with the growth of sight value and reach the peaks when is within the range of . Intuitively this indicates that an agent with relatively small sight value is shortsighted and may not be willing to rewire any highrisk agent (high cost) to maximize its longterm benefits. This phenomenon can be explained from Eq.9: the index is more likely to be negative with large which means few rewiring decisions can be made with our optimal strategy. After the peaks, it starts to descend with different speed when the value of further increases. This phenomenon indicates that extremely large sight values make optimal estimations become unreliable. This can also be explained from Eq.9: the small value of leads to the increase of , and thus increasing the set of candidate agents for rewiring. In other words, large is deemed to be overoptimistic and aggressive as any connection may be replaced in the future. In the extreme case where , the infinitely long sight makes the rewiring cost meaningless as .
Based on the above analysis, we can see that given any cost , we should select a reasonable as the optimal rewiring sight value. In following experiments, we choose a suitable value of to let the value of be within the optimal range of for different rewiring costs.
5.3. Performance Comparison of Rewiring Strategies
To evaluate the performance of our optimal approach, we compare our rewiring strategy with two benchmark strategies as follows:

Random (Ran): Random rewiring is included as the baseline strategy, in which an agent establishes new connection with a potential peer randomly selected each time.

Ksight Highest Expect (KHE): KHE is a rational benchmark strategy that rewires the agent with the highest Kround expected value minus the cost, i.e. .
To the best of our knowledge, there exist works on rewiring in social networks, but they use rewiring as an additional mechanism instead of designing optimal rewiring strategies to facilitate multiagent coordination under the social learning framework. Thus, most of them are are vanilla and cannot be directly applied in our context.
We first evaluate the performance of each rewiring strategy in different interaction environments with varying parameters of networks. Both KHE and our optimal strategy use the same value of () as the rewiring sight value. We simulate 1000rounds interactions in pure environments where the population uses the same rewiring strategy. The average accumulated payoffs obtained by agents using different rewiring strategies are shown in Table 1.
No.  (x, y, z)  Rew_Stg  Avg.  Max.  Min. 

First we can observe that our optimal strategy outperforms benchmark strategies in terms of average, best and worst cases across all different parameter settings, especially when the size of neighborhood is relatively small (each neighbor matters). Second, another observation we can find in No.1 to 9, is that as the size of reachable agents becomes larger, the performance of the two benchmark strategies decreases, while the optimal strategy still shows good and robust performance (actually a slight ascending trend) against the variation of interaction environments. This is because our optimal strategy is more likely to pick off better peers when our agents have more rewiring alternatives, which ensures robust and even better performance. In contrast, random strategy is more likely to make bad choices. For KHE strategy, it only focuses on the highest expected interaction payoff of potential peers regardless of the network dynamics caused by rewiring. Hence the loss of optimality becomes even worse when facing more choices.
Another observation is that, comparing the results from No.10 to 15, we can see that the average payoff for our optimal strategy decreases and the performance gap between optimal strategy and others becomes smaller as the size of neighborhood becomes larger. The is because in our experiments, each agent interacts with a randomly selected neighbor. Thus, even though our approach always makes the optimal rewiring decision, the expected payoff during the course of interaction will approach the expectation value of random interactions with the neighborhood, as the neighborhood size approaches infinity. This can be analyzed formally as follows. For any agent , let denotes the agent ’s expected value of random interactions with neighborhood and naturally we have . We can see in Eq.1 when , which means agent and coordinate on action . Thus, in the uniform distribution case, where and , we easily have .
Furthermore, we compare the performance of our optimal strategy and existing benchmark strategies in different rewiring cost settings. The value of sight is set to for both KHE and the optimal strategy. We vary rewiring cost value of in the range of . Fig.2(b) shows the average accumulated payoffs for each rewiring strategy in a pure environment. We can see that our optimal rewiring strategy outperforms others and acquires higher payoff across almost all value (). Next we evaluate our optimal rewiring strategy in a competitive environment where each agent is randomly assigned a rewiring strategy (from Random, KHE and optimal strategy) with equal probabilities and the results are shown in Fig.2(c). Fig.2(b) and Fig.2(c) are similar and we can see that our optimal rewiring strategy significantly outperforms existing benchmark strategies for different settings in competitive environments. As expected, the optimal strategy’s payoff starts at the peak when the rewiring cost is zero, and decreases gradually as the cost increases. The obtained payoff slowly declines until , where the rewiring costs are too high and not rewiring is the optimal choice.
The KHE strategy ranks the second. It starts with average accumulated payoff near and drops significantly to the minimum when . The reason why KHE has a good start is the initial rewiring cost is negligible even though its rewiring decision is suboptimal. As the rewiring cost increases, the performance gap between KHE and our optimal strategy becomes even larger. This is because when is small, there are many rewiring alternatives for both KHE and our optimal strategy. For KHE, it is more likely to pick the suboptimal one while our optimal strategy always chooses the optimal one. Further, KHE’s performance gradually improves when further increases and reach the same level as our optimal strategy when . This is because few alternatives exist for selection as the value of becomes larger and it is more likely for KHE to also select the optimal rewiring action by chance. Therefore, for KHE strategy, rewiring with beneficial agents is a safer choice for higher rewiring costs because such agents are sure to give higher payoff; in contrast, rewiring decisions are not reliable when is within the range of . Overall, the performance difference between KHE and our optimal strategy is significant, which stems from the fact that our optimal policy takes future interactions into account when exploring the different potential partners.
The random rewiring method performs the worst since it always rewires a randomly selected peers. It quickly degenerates and moves off the chart for higher costs because of the rapid increase of the total rewiring costs. In addition, we vary the proportions of agents using KHE strategy from to for more competitive environments consisting of agents using only KHE or optimal strategies, and the results are shown in Fig.2(d). The random strategy is not considered due to its poor performance. We can see our optimal strategy significantly prevails KHE over all proportion values against settings, which indicates our optimal strategy is robust and can always achieve higher accumulated payoff.
5.4. Performance Analysis of Different Learning Strategies
We compare the performance of three representative learning strategies: FP, JAL and JAWoLF. Fig.3(a) shows the dynamics of the average singleround payoff for different learning strategies. It shows the agents using FP strategy can fast reach a good payoff level within 10000 rounds and then stay at near 1.3 after 100000 rounds. JAL and JAWoLF start with worse performance but outperform FP from near 200000 round and maintain a 0.1 lead at the value of average perround reward. For further explanation, Fig.3(b) shows the dynamics of the percentage of agents reaching optimal Nash equilibrium (OptNE) and suboptimal equilibrium (SubOptNE) for above three strategies. We can find that JAWoLF is the quickest to reach almost 100% OptNE (40000 rounds), followed by JAL (60000 rounds). The population of agents using FP shows a quick start but leads to only over 70% of OptNE (the rest of 17% for SubOptNE) which results in its poor behavior during longrun interactions in Fig.3(a). We hypothesize that FP’s bad performance is because its convergence strategy highly depends on the initial strategies of players. For example, if two players are assigned a high probability of choosing the action which coordinates to the suboptimal NE, each of them are going to choose the suboptimal action as a best response which will be reinforced gradually and eventually converge to the suboptimal NE. In contrast, JAL and JAWoLF can always reach almost perfect coordination on the optimal NE.
6. Conclusion and Future Work
In this paper, we deal with multiagent coordination problem in cooperative environments under the social learning framework with limited rewiring chances available. We proposed a new rewiring strategy which is optimal and efficient for the agents to maximize accumulated payoff during longrun interactions. Our empirical results show that our method outperforms other benchmark strategies in both competitive and pure environments. Moreover, our method is robust under a variety of circumstances. Besides, we analyzed the comparative performance of three representative learning strategies (i.e., FP, JAL and JAWoLF) under our social learning framework with optimal rewiring. The empirical results show that JAL and JAWoLF outperforms FP on both accumulated payoff and the percentage of times agents reaching the optimal Nash equilibria (NE). In this paper, we only consider the rewiring problem in cooperative MASs. One worthwhile direction is to investigate how this rewiring strategy can be extended and applied in noncooperative MASs. Another research direction is to explicitly consider how to better utilize the property of the underlying network topology to further facilitate coordination.
References
 (1)
 Airiau et al. (2014) Stéphane Airiau, Sandip Sen, and Daniel Villatoro. 2014. Emergence of conventions through social learning. Autonomous Agents and MultiAgent Systems 28, 5 (2014), 779–804.
 Albert and Barabási (2002) Réka Albert and AlbertLászló Barabási. 2002. Statistical mechanics of complex networks. Reviews of modern physics 74, 1 (2002), 47.
 Baarslag et al. (2017) Tim Baarslag, Alper T. Alan, Richard Gomer, Mudasser Alam, Charith Perera, Enrico H. Gerding, and M.C. Schraefel. 2017. An Automated Negotiation Agent for Permission Management. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’17). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 380–390. http://dl.acm.org/citation.cfm?id=3091125.3091184
 Baarslag and Gerding (2015) Tim Baarslag and Enrico H Gerding. 2015. Optimal Incremental Preference Elicitation during Negotiation.. In IJCAI. 3–9.

Bowling and
Veloso (2001)
Michael Bowling and
Manuela Veloso. 2001.
Rational and convergent learning in stochastic
games. In
International joint conference on artificial intelligence
, Vol. 17. LAWRENCE ERLBAUM ASSOCIATES LTD, 1021–1026.  Claus and Boutilier (1998) Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 1998 (1998), 746–752.
 Crow et al. (1997) Brian P Crow, Indra Widjaja, Jeong Geun Kim, and Prescott T Sakai. 1997. IEEE 802.11 wireless local area networks. IEEE Communications magazine 35, 9 (1997), 116–126.
 Damer and Gini (2008) Steven Damer and Maria L Gini. 2008. Achieving Cooperation in a Minimally Constrained Environment.. In AAAI. 57–62.
 Dekker (2007) AH Dekker. 2007. Realistic social networks for simulation using network rewiring. In International Congress on Modelling and Simulation. 677–683.
 Ellison et al. (2007a) Nicole B Ellison et al. 2007a. Social network sites: Definition, history, and scholarship. Journal of ComputerMediated Communication 13, 1 (2007), 210–230.
 Ellison et al. (2007b) Nicole B Ellison, Charles Steinfield, and Cliff Lampe. 2007b. The benefits of Facebook “friends:” Social capital and college students’ use of online social network sites. Journal of ComputerMediated Communication 12, 4 (2007), 1143–1168.
 Gerding et al. (2009) Enrico H Gerding, Kate Larson, Alex Rogers, and Nicholas R Jennings. 2009. Mechanism design for task procurement with flexible quality of service. In International Workshop on ServiceOriented Computing: Agents, Semantics, and Engineering. Springer, 12–23.
 Griffiths and Luck (2010) Nathan Griffiths and Michael Luck. 2010. Changing neighbours: improving tagbased cooperation. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 249–256.
 Hales and Edmonds (2005) David Hales and Bruce Edmonds. 2005. Applying a socially inspired technique (tags) to improve cooperation in P2P networks. IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans 35, 3 (2005), 385–395.
 Hao et al. (2014) Jianye Hao, Dongping Huang, Yi Cai, and HoFung Leung. 2014. Reinforcement social learning of coordination in networked cooperative multiagent systems. In AAAI workshop on multiagent interaction without prior coordination (MIPC 2014).
 Hao et al. (2017) Jianye Hao, Dongping Huang, Yi Cai, and Hofung Leung. 2017. The dynamics of reinforcement social learning in networked cooperative multiagent systems. Engineering Applications of Artificial Intelligence 58 (2017), 111–122.
 Hao and Leung (2013) Jianye Hao and Hofung Leung. 2013. The Dynamics of Reinforcement Social Learning in Cooperative Multiagent Systems.. In IJCAI, Vol. 13. 184–190.
 Hao et al. (2015) Jianye Hao, HoFung Leung, and Zhong Ming. 2015. Multiagent reinforcement social learning toward coordination in cooperative multiagent systems. ACM Transactions on Autonomous and Adaptive Systems (TAAS) 9, 4 (2015), 20.
 Kapetanakis and Kudenko (2002) Spiros Kapetanakis and Daniel Kudenko. 2002. Reinforcement learning of coordination in cooperative multiagent systems. AAAI/IAAI 2002 (2002), 326–331.
 Kwak et al. (2010) Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web. ACM, 591–600.

Lauer and
Riedmiller (2000)
Martin Lauer and Martin
Riedmiller. 2000.
An algorithm for distributed reinforcement learning
in cooperative multiagent systems. In
In Proceedings of the Seventeenth International Conference on Machine Learning
. Citeseer.  Matignon et al. (2008) Laëtitia Matignon, Guillaume Laurent, and Nadine Le FortPiat. 2008. A study of FMQ heuristic in cooperative multiagent games.. In The 7th International Conference on Autonomous Agents and Multiagent Systems. Workshop 10: MultiAgent Sequential Decision Making in Uncertain MultiAgent Domains, aamas’ 08., Vol. 1. 77–91.

Matignon
et al. (2012)
Laetitia Matignon,
Guillaume J Laurent, and Nadine
Le FortPiat. 2012.
Independent reinforcement learners in cooperative
Markov games: a survey regarding coordination problems.
The Knowledge Engineering Review
27, 1 (2012), 1–31.  Mihaylov et al. (2014) Mihail Mihaylov, Karl Tuyls, and Ann Nowé. 2014. A decentralized approach for convention emergence in multiagent systems. Autonomous Agents and MultiAgent Systems 28, 5 (2014), 749–778.
 Panait and Luke (2005) Liviu Panait and Sean Luke. 2005. Cooperative multiagent learning: The state of the art. Autonomous agents and multiagent systems 11, 3 (2005), 387–434.
 Panait et al. (2006) Liviu Panait, Keith Sullivan, and Sean Luke. 2006. Lenient learners in cooperative multiagent systems. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems. ACM, 801–803.
 Peleteiro et al. (2014) Ana Peleteiro, Juan C Burguillo, and Siang Yew Chong. 2014. Exploring indirect reciprocity in complex networks using coalitions and rewiring. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems. International Foundation for Autonomous Agents and Multiagent Systems, 669–676.
 Sen and Airiau (2007) Sandip Sen and Stéphane Airiau. 2007. Emergence of norms through social learning.. In IJCAI, Vol. 1507. 1512.
 Villatoro et al. (2011) Daniel Villatoro, Jordi SabaterMir, and Sandip Sen. 2011. Social instruments for robust convention emergence. In IJCAI, Vol. 11. 420–425.
 Watts and Strogatz (1998) Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of’smallworld’networks. nature 393, 6684 (1998), 440.
 Weitzman (1979) Martin L Weitzman. 1979. Optimal search for the best alternative. Econometrica: Journal of the Econometric Society (1979), 641–654.
 Yu et al. (2013) Chao Yu, Minjie Zhang, Fenghui Ren, and Xudong Luo. 2013. Emergence of social norms through collective learning in networked agent societies. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems. International Foundation for Autonomous Agents and Multiagent Systems, 475–482.
 Zhang and Lesser (2013) Chongjie Zhang and Victor Lesser. 2013. Coordinating multiagent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems. International Foundation for Autonomous Agents and Multiagent Systems, 1101–1108.
Comments
There are no comments yet.