**Reinforcement** L**earning**

We examine **reinforcement** **learning** model.

Introducing a population renewal algorithm, which in the presence of superlinear **reinforcement, **considerably enhances

### Introduction

Functioning of societies is to a large extent regulated by various norms and conventions shared by its members.

In some cases, these rules are centrally imposed or coordinated, e.g., a dress code in a company or the side of the road that one should drive on. But some conventions, such as the color of cloth that we wear in grief or greeting our friends with a handshake, appeared more spontaneously.

Perhaps the most important convention of this kind, which emerged in the absence of any explicit, centralized coordination, is language.

Human language provides a highly efficient communication system acquired by individuals in cultural interactions.

Some researchers try to explain how such a system could have appeared and evolved using evolutionary game theory, evolutionary linguistics, or cognitive science.

A promising approach considers language as a signaling system, which emerged via a **reinforcement** **learning** process.

Such a framework originates from Lewis signaling game.

In the simplest version, there are two players and a fixed number of signals.

The speaker sends a signal (which is to correspond to the state of the world) and the hearer interprets the signal (i.e., takes some action).

If he does it correctly, both players receive some payoff, which might influence their further actions.

The model can actually be considered as a certain urn model. Some mathematical subtleties concerning, e.g., the convergence of the above scheme, were analyzed by Skyrms [[ 8]] and Beggs [[ 9]], while an adaptation focusing on language evolution was proposed by Lenaerts *et al*. [[10]]. Attempts to compare several related approaches where learned signaling might emerge were also made [[11]].

In all these studies, the examined number of agents was rather small (≲ 30), however, one should note that the **reinforcement** **learning** leads to nontrivial results already in two-agent models [[ 1], [ 8], [12]].

Nonetheless, having in mind the evolution of language, much larger populations of agents should certainly be examined.

In such a case, for a population of agents, one has to specify the network of their interactions.

While for a small group, a complete graph, where each agent interacts with each of the others, seems the most natural topology, for larger groups of agents, some other structures such as planar or heterogeneous graphs can also be relevant (e.g., when studying the emergence of a linguistic coherence in large-scale communities such as a city population or a nation).

The emergence of language is often modeled as a process of reaching an agreement (consensus) about linguistic forms used in a population. Opinion formation or ferromagnetism are also manifestations of such an agreement dynamics. For such processes, the structure of the network usually plays an important role, determining whether the consensus will be reached at all, and affecting the way it could be reached [[13]–[15]]. Networks examined in the present paper (Cartesian lattices, complete graphs, random graphs) are only mathematically and computationally appealing idealizations of real networks. Certainly, placing our models on more realistic networks, which take into account a node-distribution heterogeneity, directionality, small-worlds, modular structure or assortativity [[16]], would be desirable.

A model that is often examined in the context of language emergence is the Naming Game [[17]]. Due to its computational simplicity, the Naming Game allows for analytical as well as numerical approaches, and global aspects of its dynamics are now relatively well understood [[18]]. In particular, it is found that typically in the Naming Game, a consensus emerges and reaching such a state resembles the coarsening in the Ising model. The similarity is not accidental because due to the presence of a surface tension [[19]], both models operate with the so-called curvature-driven dynamics [[20]]. Let us notice that the coarsening dynamics of the Naming Game, which gradually eliminates certain languages and eventually leads to a global consensus, can be found very appealing in some linguistic contexts. There are even some indications that the curvature-driven dynamics may underlie such linguistic processes as, e.g., an evolution of dialects [[21]]. The simplicity of the Naming Game implies, however, simplicity of an emerging language, and in many of its versions agents negotiate the name of just a single object. On the other hand, for models that have a potential to generate more complex languages, global aspects of their dynamics are rather poorely understood. Such models could incorporate agents, which, using the **reinforcement** **learning**, would try to establish a language reflecting their multi-object and multi-agent world. An objective of the present paper is to specify whether and how an efficient communication might emerge in such a system.

### Methods

### Reinforcement learning via urn model

The basic building block of our model is a Pólya urn model. In the simplest version of this model, a ball is drawn randomly from an urn with black and white balls [[ 6], [ 7]]. Then the ball is put back into the urn along with an extra ball of the same color (**reinforcement**), and the process is repeated *ad infinitum*. In this scheme, the probability to select a ball of a given color is proportional to the number of such balls in the urn. We can also consider a generalized version of this model with the selection probability proportional to the number of balls raised to a certain power *α* [[22]]. In this case, the behavior of the model strongly depends on *α*. For *α* < 1, the model converges toward an equal number of balls of each color, but for *α* > 1, a monopolistic solution appears with the urn dominated by one color. The monopolistic solution is in fact a simple manifestation of a spontaneous symmetry breaking, the phenomenon of much interest in statistical mechanics or particle physics. The basic Pólya urn model is equivalent to the *α* = 1 case, thus determining the transition between these two different regimes.

Our intention is to study a multi-agent model of a signaling game with communicating agents as **interacting** urns. In the simplest (single-object) version, agents engage in pairwise interactions to negotiate the word to be associated with an object. After a weighted selection of a word, the speaker and the hearer increase its weights (**reinforcement** **learning**), which affects subsequent selections. It seems plausible that in the *α* > 1 regime, a monopolistic solution would emerge with agents almost always selecting the same word. There is, however, a number of questions, which one can ask concerning such a linguistic consensus. For example, is it a global consensus, where the entire population of agents uses the same word, or rather a local one corresponding to a certain multi-word solution. Most likely the answer will depend on the topology of interactions between agents, e.g., networks of long-range connectivity should favour the global consensus. Furthermore, agents may be involved in more complicated interactions, e.g., negotiating simultaneously the names for several objects (multi-object version). In that case they need some recognition mechanism, and the resulting language is likely to be more complex.

It is difficult to advocate that in the linguistic contexts, *α* > 1 should be used. In economy, the emergence of a monopoly is sometimes associated with a certain positive superlinear feedback known as Metcalfe’s Law [[23]]. For example, in social networks, the greater the number of users with a certain service, the more valuable the service becomes to the community, and hence its total value is likely to increase quadratically (*α* = 2) with the number of its users. One might expect that a similar superlinear feedback appears during language formation processes. Most of the results presented in our paper are for *α* = 2; some of our results demonstrate that the behaviour of the model is qualitatively similar as long as *α* > 1. For *α* = 1 the convergence toward consensus is typically much slower and in some cases the model does not evolve toward consensus at all.

### Single-object version

In the simplest version of our model, we have a population of *N* agents, which try to establish a name for a given object. Each agent *A* has an inventory of the same *N*_{w} words *W*_{i} with their corresponding weights *w*_{i}(*A*) (*i* = 1, 2, …, *N*_{w}; initially all *w*_{i}(*A*) = 1). In an elementary step, a randomly selected agent (the speaker) interacts with one of its randomly selected neighbors (the hearer) communicating a word. The probability that the speaker *A* will select the *i*-th word depends on its weight and is given as si(A)=wiα(A)/∑k=1Nwwkα(A). ( 1) After the interaction, both the speaker and the hearer increase their weights of the communicated word by 1. Such an elementary step of our model is illustrated in Fig 1. In our simulations, a unit of time (t = 1) comprises *N* elementary steps (i.e., in a unit of time, each agent is on average selected once as a speaker).

### Multi-object version

We also examine a more general version of our model, in which agents try to establish names for a set of *N*_{o} objects. Their inventories are more complex now as they contain the same set of *N*_{w} words *W*_{i} (coupled with their respective weights) for each object. In other words, each inventory consists now of *N*_{o} copies of inventories from the single-object version and thus each agent *A* has *N*_{w}*N*_{o} weights *w*_{i, j}(*A*), where *i* = 1, …, *N*_{w} and *j* = 1, …, *N*_{o}. First, a randomly selected speaker chooses an object with a uniform probability 1/*N*_{o}. Then the speaker selects the word to be communicated taking into account the weights associated with the words for the chosen object. By analogy with Eq ( 1), the probability that agent *A* will select the *i*-th word for the *j*-th object equals si,j(A)=wi,jα(A)/∑k=1Nwwk,jα(A). ( 2) Next, the role of the hearer (*H*) is to assign an object to the communicated word. This word, say *W*_{i}, appears in the hearer’s inventory *N*_{o} times with weights *w*_{i, j}(*H*), where *j* denotes the object. The hearer uses these weights to guess which object the speaker is talking about. Hence, the hearer recognizes the *j*-th object as that communicated by the *i*-th word with probability ri,j(H)=wi,jα(H)/∑k=1Nowi,kα(H). ( 3) Provided that the object recognized by the hearer is the same as that chosen by the speaker, both agents increase the corresponding weights by 1. An elementary step of this version of the model is illustrated in Fig 2.

The above specified rules are consequences of a number of simplifying assumptions and certainly more realistic versions might be considered. For example, one might assume that the words in agents’ inventories are not necessarily identical and agents could learn new words from each other. Most likely such a change would require a more sophisticated recognition mechanism and perhaps a notion of a distance between words would have to be used. Further analysis of such a version, although it seems more realistic and potentially interesting, is left for the future.

### Population renewal

We also introduce a simple modification of our model (both in its single and multi- object versions), which takes into account a population renewal. The modification seems to be plausible, especially for modeling a formation of a communication system in a population of humans. In such population, when considered at a timescale of, say, hundreds of years, we should take into account a generational turnover (and possibly migrations [[24]]). A child learns the language of its parents but it might also acquire a (possibly different) language of its neighbors. Certainly, for a young person this is more likely to happen than for an adult. Let us notice that in urn models, due to the accumulation of weights after a large number of iterations, it is almost impossible to shift their balance (i.e., change the language). To allow for such a shift, we introduce a population renewal: With (usually small) probability *p*, the agent selected to be a speaker is replaced with a new agent (with all weights equal to 1), while with probability 1 − *p*, the speaker acts as previously defined.