The Grammar of Emergent Languages

(1)

MSc Artificial Intelligence

Master Thesis

The Grammar of Emergent Languages

by

Oskar Douwe van der Wal

11913754 48 ECTS October 2019 - July 2020 Supervisors: Dr D. Hupkes Dr E. Bruni Assessor: Dr W.H. Zuidema

Institute of Logic, Language and Computations University of Amsterdam

(2)

(3)

Abstract

The successful communication by deep neural agents in referential games is attracting increasing interest for simulating the emergence of language with the aim of studying the origins of human language and developing better NLP models. However, despite the efficacy to exchange information about complex scenarios, it remains unclear to what extent these languages resemble human languages. This can be attributed to the limited choice in proper evaluation techniques. More importantly, there are no techniques focusing on the syntax and these primarily concern the semantics. As a result, the effect of many modelling choices on the development of syntactic structure cannot be studied properly. To address this gap, we explore an approach that does concentrate on the syntax of emergent languages.

Here we show that unsupervised grammar induction techniques that are developed for natural language can also be applied to analyse the artificial languages, with the advantage of not being dependent on the underlying semantics. Using these techniques, we investigate the effect of the message length and vocabulary size on the emergence of syntactic structure in a simple referential game. We find that the languages only start to exhibit syntax when a certain message length and vocabulary size are reached. However, the found structure is still limited compared to the complexity of human-like syntax, suggesting that more complicated game scenarios should be investigated. To this end, this thesis also suggests a template for designing new referential games.

The findings of this research form another step towards understanding what settings lead to the emergence of natural language, as it provides future researchers with the necessary tools for analysing syntax.

(4)

Acknowledgements

First and foremost, I would like to thank Dieuwke and Elia for their supervision and guidance when working on this thesis and the paper. The Master thesis has been a trans-formative journey, and my supervisors have been a vital part of my academic growth. I could not have wished better mentors for preparing me for the academic path that lies ahead of me, for which I am truly grateful.

I would also like to thank Jelle for his willingness to examine my work. Given his experience with the research areas of my thesis, it is inspiring to have him as my assessor. It should be noted that the research presented in this thesis does not stand on its own. I would like to thank Diana Luna Rodr´ıguez for providing the emergent languages from her own work and for joining several of our project meetings. Also my friend Silvan de Boer should not go unmentioned, as he started this project with me and kept me motivated throughout the thesis.

Furthermore, I want to thank my classmates and Henning Bartsch in particular for their help by proofreading my work and giving valuable advice when needed. I am also grateful for the support of my friends, family, and especially my girlfriend during this stressful time.

(5)

1. Introduction

The use of deep neural agents for simulating the evolution of language is a promising area of research for investigating the origins of human language (Kirby, 2002) as well as improving current NLP techniques (Lazaridou et al., 2017), by having the agents develop a language similar to natural language. Motivated by the idea that language serves a functional purpose (Wittgenstein, 1953), it is encouraging that these agents can successfully communicate using discrete symbols to solve collaborative tasks.

However, the characteristics of the emergent languages and how these compare to natural language are still poorly understood, mainly because of the limitations of present-day evaluation techniques. In particular, the question of what circumstances lead to the emergence of syntax has received little attention in previous work. To the best of our knowledge, we are the first to conduct a syntactic analysis of emergent languages using unsupervised grammar induction (UGI) techniques, and our findings show that the choice of the maximum message length and vocabulary size influences the development of syntactic structure.1

In this thesis, we take advantage of both recent and earlier developments in the field of UGI for natural language, to conduct a syntactic analysis of languages emerging in a simple referential game with deep neural agents. After confirming the suitability of UGI techniques for our artificial setup, we demonstrate its application in investigating the degree of structure in emergent languages of varying message lengths and vocabulary sizes. We conclude that only the largest languages contain some structure, although still not being of the complexity found in natural language and call for more interesting game scenarios.

To aid in the development of different game scenarios, this thesis also contributes a framework for designing these referential games. I provide an overview of current work and identify the dimensions of the referential game that can be manipulated to study the mechanisms underlying the emergence of syntax and other characteristics.

Before I discuss the research questions and contributions of this thesis in more detail, I give a brief overview of the field of simulating language emergence to provide the academic context and subsequently clarify the main motivation behind the research.

1_{The experiments conducted in this thesis are part of a collaborative project, and there is an overlap}

between this thesis (in particular Chapters 5, 6, and 7) and the EMNLP 2020 submission “The Grammar of Emergent Languages” (in prep.). I have worked closely with Silvan de Boer, who has implemented the pre-processing of the languages (§5.2) and the baseline (§5.5), the evaluation on the grammar aptitude (§5.4.1), language compressibility (§5.4.2) and induction consistency (§5.4.4) as well as the experiment in§6.2, while the further interpretation of the results is by me. Furthermore, Diana Luna Rodr´ıguez has provided us with the emergent languages (§5.2). All the other work presented in this thesis is my own.

(9)

Studying the origins of natural language Natural language is a fascinating cognitive feat, allowing us to communicate with others using complex utterances and other signals connected to some meaning. However, the origins of natural language are up to debate with many theories being proposed, and it remains uncertain how the various character-istics such as grammatical structure have evolved (Christiansen and Kirby, 2003).

There are many ways that the evolution of language is studied. Naturally, testing these theories of language emergence empirically in humans and in animals—both through field work and in labs—has its limitations, so researchers have also resorted to mathematical and computational modelling (Christiansen and Kirby, 2003; Nolfi and Mirolli, 2010; Skyrms, 2010). For example, simulations with simple computational models have been used to investigate how linguistic structure can emerge (Batali, 1998; Briscoe, 1999; Kirby, 2000). However, these early experiments are often characterised by careful mod-elling of the communication channel and even more so by disentangled structured input data.

Simulating language emergence using deep neural agents Recently, there has been a renewed interest in computational modelling of language evolution, but now through conversing deep neural agents, which contrasts with previous experiments in having com-plex input scenarios (e.g. real pictures) and more spontaneous emergence of a language through end-to-end learning from input to action (Foerster et al., 2016; Lazaridou et al., 2017; Havrylov and Titov, 2017). This ‘new wave’ is not only motivated to create a test bed for language evolution, but also by the promise of developing practical AI through grounding the language understanding in situated learning and capturing the functional aspect of human communication (Mikolov et al., 2016; Kiela et al., 2016; Gauthier and Mordatch, 2016).

Referential games are used extensively to create a setting for simulating language emergence (e.g. Skyrms, 2010; Huttegger and Zollman, 2011; Steels and Loetzsch, 2012; Lazaridou et al., 2017; Havrylov and Titov, 2017).2—a form of the Lewis signalling game (David, Lewis, 1969)—which is a fully cooperative game where one agent needs to communicate information unknown to the other by means of arbitrary signals. Through playing this game, the agents generally develop a shared understanding of what the signals mean, but recent work has also shown that interesting language properties may emerge in certain scenarios.

Researchers face various challenges in implementing these referential games with deep neural agents, which are either technical (e.g. how to update the parameters of the agents?), related to finding interesting scenarios (e.g. which pressures lead to the emer-gence of syntactic structure?), or about evaluating the emergent languages (e.g. what linguistic structure underlies the signals?). One of the goals in this thesis is to study what variations are possible in order to design new interesting scenarios. However, in the experiments I primarily focus on evaluating emergent languages, in particular their syntactic structure. In the next section I clarify the motivation for studying the syntax

2

However, there are examples of other games used (e.g. the non-cooperative negotiation game, Cao et al., 2018).

(10)

of these languages.

1.1. Motivation

Simulating language emergence in referential games has shown that deep neural agents are able to successfully communicate about symbolic data and even real pictures (e.g. Lazaridou et al., 2018). However, from a successful game it is not clear to what extent the emergent languages are similar to natural language and what it is the agents talk about. For this reason various qualitative and quantitative evaluation techniques have been proposed in previous work to uncover the characteristics of these languages.

Most of this work concerns the semantics of the emergent languages. First of all, it is not always the case that the agents talk about concepts (e.g. the word “bear” refers to the concept of the animal) and often the game can be played perfectly well by only relying on low-level feature information (e.g. individual pixel values) (Bouchacourt and Baroni, 2018; Lazaridou et al., 2018). However, the categorisation of reality to express information is an important aspect of human symbolic communication systems (Steels, 2010). Besides the search for categorisation, compositionality is another sought-after se-mantic characteristic (e.g. Kottur et al., 2017; Lazaridou et al., 2018; Choi et al., 2018). In a compositional language, the meanings of sub-signals can be composed into new meanings, greatly increasing the possibilities of what information can be expressed with a finite set of signals and making generalisation to unseen descriptions possible. How-ever, researchers face various challenges in their analyses of these semantic properties, especially in demonstrating compositionality (Ren et al., 2020; Kharitonov and Baroni, 2020), while also being restricted to games for which a symbolic representation of the input is available.

While the semantics have been studied extensively, little attention has been given to the syntactic analysis of deep neural emerged languages. Syntax is another important part of language and it is responsible for (hierarchically) structuring the words in a sentence—for example, through word order and function words. A grammatical lan-guage is subject to extra rules that guide the meaning formation, and can add more semantic information or facilitate more complex messages by reducing the ambiguity of the composition and making the interpretation less dependent on the shared context of the agents (Parisi, 1983; Steels, 2010). It is argued that the formation of grammar is one of the most complex transitions in signal structure towards a natural language and it is seen as an extremely challenging task to emerge in embodied agents (Parisi, 1983; Mirolli and Nolfi, 2010; Steels, 2010). The syntactic analysis of emergent languages would bet-ter equip researchers to investigate the various hypotheses on how structured languages can emerge from less structured languages (e.g. Parisi, 1983; Kirby, 2002; Steels, 2010; Loreto et al., 2010; Kirby et al., 2014).

The research presented in this thesis addresses the gap in current analyses of emergent languages from deep neural agents. First of all, to the best of our knowledge, we are the first to conduct a thorough syntactic analysis of these languages. Moreover, the findings may not only facilitate a better understanding of the evolution of syntax, but

(11)

also complement existing analyses of the semantics and compositionality in particular, because of the close relationship between these properties. Furthermore, our analysis does not rely on a symbolic representation of the meaning space in contrast to many of the current evaluations.

1.2. Research questions and contributions

This thesis focuses on the analysis of languages emerging in deep neural referential games (emergent languages). The main contribution is the syntactic analysis of emergent languages using unsupervised grammar induction (UGI) techniques, with the important benefit of not being limited to game scenarios for which a symbolic representation of the semantic content of the language is available. In particular, I examine the following research questions.

RQ1: What variations are possible in the design of a referential game experiment for simulating language emergence with deep neural agents?

The first research goal is to provide a map of current research on languages emerg-ing in referential games, which could serve as a template for designemerg-ing new experiments. In this thesis, I define a framework describing several dimensions of the referential game that can be manipulated to pressure the language emergence towards more interesting features. Two simple to implement pressures that have been studied in relation with the semantic structure of the emergent languages, are varying the message length and vocabulary size of the communication channel. These variations are also the subject of RQ3.

RQ2: Can we use existing unsupervised grammar induction techniques originally de-veloped for natural language, for emergent languages as well?

The results of the experiments show that UGI techniques are suitable for the syntactic analysis of emergent languages, without many of the downsides other evaluation tech-niques for emergent languages have. We compare two UGI procedures, where either a pre-neural statistical model or a neural parser (CCL and DIORA, respectively) induces the constituents, and Bayesian model merging is used to infer a grammar given the unla-belled constituency trees. For our setup, CCL seems to be a more suitable constituency parser than the neural DIORA, next to being computationally more efficient and sim-pler to use. The UGI techniques form the basis of the syntactic analysis of the emergent languages for answering RQ3.

RQ3: How does the message length and vocabulary size influence the emergence of syn-tactic structure?

(12)

We demonstrate that the message length and vocabulary size have some effect on the de-gree of structure found in languages emerging in a simple referential game.3 Specifically, based on measures of aptitude and language compressibility of the induced grammars, we only find significantly more structure than the random baselines for messages of length 10 and the largest tested vocabulary sizes of 13 and 27. The simpler languages of mes-sage length up to 5 or a vocabulary size of 6 do not exhibit much structure. However, the structure we do find is still limited compared to natural language, which suggests that more research is needed on finding more interesting scenarios with different factors.

1.3. Thesis structure

In Chapter 2, I introduce the reader to the referential game and give a brief overview of solutions to unsupervised grammar induction (UGI). Chapter 3 contains a review of the relevant literature on evaluating emergent languages. I proceed in Chapter 4 with a framework of referential game design to address RQ1 and provide the reader with a background of variations of referential game setups found in the literature. The approach to analysing the syntactic structure of emergent languages is then described in Chapter 5, which also explains the referential game that gives rise to the studied languages. Subsequently, I present and discuss the results of two experiments concerning RQ2 and RQ3. First, in Chapter 6 I validate the suitability of the UGI procedure for the artificial setup, since these techniques are originally designed for natural language. I then provide a syntactic analysis of the emergent languages in Chapter 7. Finally, Chapter 8 summarises the main conclusions and recommends further work.

3_{The code for reproducing this research and applying the same methodology to other emergent}

lan-guages will be published at https://github.com/i-machine-think/emergent_grammmar_induction when the anonymity period of the accompanying paper submission is over.

(13)

2. Background

In this thesis I investigate whether unsupervised grammar induction (UGI) techniques can be used for the syntactic analysis of languages emerging in referential games. To fully appreciate this research, one requires an understanding of what a referential game is and how UGI techniques can operate. First of all, this chapter starts with a description of the type of referential game we consider in this thesis and serves as a useful reference point, in particular for the framework discussed in Chapter 4. This description explains the terminology used in later chapters, while also showing how the different variables of the game relate to each other and the emergent language. Secondly, in §2.2 I present some state-of-the-art UGI techniques, including the ones used in the experiments discussed later in this thesis. This should provide the reader with enough understanding of how the employed UGI techniques work and how these compare to some other existing solutions.

2.1. Description of the referential game

For a better understanding of where the emergent languages come from, I give a math-ematical description of the referential game and define some relevant terms to which I refer in the rest of my thesis as well.

In the referential game we consider, there are two agents, the sender agent and receiver agent, which have to cooperate to successfully complete the task. One round of the game consists of the sender describing an object in a message to the receiver, who then has to point out which object the sender referred to. During training, the model parameters of the agents are updated according to how well the receiver identifies the right object given the messages. Specifically, the following points defines the game.

1. The game is played with objects sampled from a data-set, one target t and one or more distractors, which are then presented to the agents. Depending on the design of the game, the sender may or may not see the distractors.

2. The message m is composed of symbols from a vocabulary with size V and has the (maximum) length L.1 These messages constitute an emergent language, once the agents can successfully communicate with these.

3. Given m, the receiver has to choose the target t from a set of objects which also contain a number of distractors. Both the model parameters of the sender and the receiver are updated according to the prediction (distribution) of the played round.

1

The effect of the vocabulary size and message length on the language emergence is studied later on in this thesis.

(14)

4. The communication happens in one step and is uni-directional, meaning that the sender sends one message to the receiver and the receiver does not send a message back.

An illustration of the referential game can be found in Figure 2.1.

There are various variations possible on the above description of the referential game, which could lead to different behaviours of emergence. The intentional changes to steer the language emergence are referred to as pressures and are explained in more detail in Chapter 4.

Figure 2.1.: Illustration of one round of the referential game. The sender agent (left) describes the target t in a message m to the receiver agent (right). Given the message, the receiver has to predict the target from a set also containing distractors.

2.2. Unsupervised grammar induction

An important research goal of this thesis is to investigate the syntactic structure of arti-ficial languages emerging in the previously discussed referential game. Naturally, there are no gold annotations for the emergent languages and we cannot use the common supervised parsing techniques. Hence, we resort to unsupervised grammar induction (UGI) techniques for the syntactic analysis of emergent languages in this thesis, which require little to no additional information about the sentences. Although a compre-hensive review is not the goal of this thesis, I consider previous work on unsupervised parsing algorithms to set a context for the reader and to appreciate the differences there are between the various techniques.

In this section, I discuss a few of these successful unsupervised grammar induction techniques. First, I discuss some traditional pre-neural statistical techniques, after which I proceed with more recent neural based approaches. It is important to note that most of these techniques either induce the syntactic structure, its labels, or infer certain distributional information about the words, which means that some of these techniques may complement each other. Since the explanation for all the required details can

(15)

become quite dense, especially for the neural parsers, I refer the reader to the original papers for more information on the implementations.

2.2.1. Pre-neural statistical approaches

Statistical approaches have been the main focus of research on unsupervised grammar induction before the advent of neural parsers. However, even now pre-neural statistical approaches know several benefits, as they require little hyper-parameter tuning, often work with interpretable models, are computationally efficient, and tend to require less data than their neural counterparts (see for example the comparisons by Li et al., 2020). I give a brief overview of a selection of these models.

DMV One of the most successful pre-neural statistical unsupervised parsing models is the Dependency model with Valence (DMV, Klein and Manning, 2004). DMV is an algorithm that generates dependency parses of sequences of POS -tags or word-classes. A dependency parse consists of head-dependent relationships between the words and starts from one of the words being the ROOT (see Figure 2.2 for an example).

Figure 2.2.: Example dependency structure (from Klein and Manning, 2004).

DMV generates a dependency parse by taking a series of decisions that are condi-tioned on several parameters, which indicate whether to consider a dependent for the current head in a certain direction, and if so, which dependent would be the most likely. Specifically, DMV works with three parameters:

• Pstop(STOP|h, dir, adj), the probability to stop generating another dependent for

the head h in direction dir, with binary adjacency value ajd indicating whether for this direction already an argument has been generated;

• Pchoose(a|h, dir), the probability of selecting a dependent a for the head h in

direc-tion dir; and

• Proot, the probability that a specific word is the root.

Then, a dependency structure D(h) with root h has the probability given by

P (D(h)) = Y

dir∈{l,r}

Y

a∈depsD(h,dir)

Pstop(¬STOP|h, dir, adj)Pchoose(a|h, dir)

(16)

where depsD(h, l) and depsD(h, r) indicate the dependents on the left and right of h,

respectively.

While technically DMV could be used on raw text directly, this does decrease its per-formance considerably.2 Without the categorisation of similar words in word classes, the number of parameters increases greatly for a considerably large vocabulary. Spitkovsky et al. (2011) show that DMV can achieve good results with word classes provided by word clustering algorithms (discussed later in this section), but generally the parsing accuracy seems to be lower than if supervised POS-tags are used (Mareˇcek and Straka, 2013).

CCL Another successful statistical model is the common cover link (CCL) parser in-troduced by Seginer (2007a,b). CCL is one of the constituency parsers we test in this thesis, which outputs unlabelled constituency structures for sentences without the need for POS annotations. Even with the introduction of neural models, the CCL parser is still one of the best unsupervised constituency parsers that does not require gold POS-tags and performs even better than many approaches that do require those (Reichart and Rappoport, 2010; Ponvert et al., 2011).

CCL works on common cover links, a representation similar to dependency trees, but with some subtle differences that suit the parsing of the algorithm. These common cover links are directed links that start at the highest leaf node of a sub-tree and end at another leaf node in that same sub-tree, and from these links the parser can infer the constituents. All the links x −→ y have a depth d ∈ {0, 1}, where the value of dd indicates that linked words x and y share all except for d of the constituency bracketing. Restricting the depth to 0 or 1 is one of the heuristics to limit the search space for the learning and parsing, and results in skewed trees where every sub-tree has at least one short branch. The example in Figure 2.3 illustrates the common cover link representation of a sentence.

Figure 2.3.: An example shortest common cover link set (from Seginer, 2007a).

CCL is an incremental parser, which means that the words in a sentence are read one-by-one, and only links from words that are already seen to the last read word are considered.

Another important feature, is that the parsing of the sentences is guided by a learned lexicon that is used to score the possible links. This lexicon is updated based on the found links after each parse, and contains a list of labels describing the contexts the words have appeared in. In contrast to DMV, CCL performs better with lexicals rather than POS-tags, since it exploits the assumed Zipfian distribution of the words: the labels

(17)

in the lexicon are dominated by words with a high frequency and words of a similar distribution are therefore parsed using the same labels.3

BMM Another technique proposed for unsupervised grammar induction is Bayesian Model Merging (BMM, Stolcke and Omohundro, 1994), which can find a probabilistic context free grammar (PCFG). While approaches to use BMM to induce grammars for natural language corpora proved to be computationally infeasible (Stolcke and Omohun-dro, 1994), BMM has been successfully used to infer labels for unlabelled constituency trees (Borensztajn and Zuidema, 2007) from which a PCFG can be read. It can there-fore complement a technique such as the previously discussed CCL algorithm, which can provide the constituents.

The BMM algorithm finds the most likely grammar by using the principle of the minimum description length, choosing the grammar that requires the smallest number of bits to encode both the grammar and the data. It does so by iteratively searching the grammar that minimises the sum of the grammar description length (GDL) and data description length (DDL).

The GDL is defined as the number of bits required for encoding the grammar in three parts: the top productions R1, the lexical productions R2, and the non-lexical

productions R3. This results in the following formula:

GDL = log(N + 1) ·P

r∈R1(Nr+ 1) + (log(N + 1) + log(T )) · T

+ log(N + 1) ·P

r∈R3(Nr+ 2) + log(N + 1) · 2,

(2.1)

where N is the number of unique non-terminals, Nr the number of non-terminals on the

right-hand side (RHS) of production r, and T the number of terminals.

The DDL is the number of bits required to encode the data given the grammar M (− log P (X|M )), where the total likelihood is defined as the product of the likelihoods of the sentences X: P (X|M ) = Y x∈X X der:yield(der)=x P (der|M ), (2.2)

where the likelihood of a sentence x is the sum of all conditional probabilities for all pos-sible derivations der of the sentence given grammar M , but in practice it is approximated to the probability of only the most likely derivation.

Word clustering As indicated previously, some techniques such as DMV require word classes and cannot be used on raw text directly. Unsupervised word clustering techniques can be used to find word classes if these are not available, for example those of Brown et al. (1992) and Clark (2000) seem to work relatively well (Spitkovsky et al., 2011; Marecek, 2016).

Brown’s hierarchical clustering algorithm exploits the distributional context of the words, more specifically it makes use of the bi-gram statistics of the corpus. Each word

(18)

starts in its own class, where pairs of classes are iteratively merged until K clusters are left. The merging process is based on some measure, for example by maximising the average mutual information. The mutual information of two adjacent clusters ci and cj

is defined as

M I(ci, cj) = log

P (ci, cj)

P (ci) ∗ P (cj)

, (2.3)

where P (ci, cj) is the probability of cj following ci, and P (c) the probability of cluster c

in the corpus.

Clark’s flat clustering algorithm also groups words based on their context, which is defined by the preceding and next word. K clusters are then found using the Kullback-Leibler divergence as distance between the context distributions of the words. An expectation-maximisation (EM) algorithm (Dempster et al., 1977) can be used to find the minimum.

2.2.2. Neural approaches

In the recent past, also a few neural unsupervised parsing models have been proposed. I discuss two neural constituency parsers that could provide the constituents for the previously discussed BMM, and a model for computing the word embeddings required by one of these parsers.

URNNG One of these neural models is unsupervised recurrent neural network grammar (URNNG, Kim et al., 2019)4, which uses amortised variational inference (Kingma and Welling, 2013) to handle the large search space of possible parse trees when jointly optimising a syntax model and a language model. To limit the computational complexity, only binary trees are considered in URNNG and the constituent labels are ignored.

URNNG consists of a generative model and an inference network, as can be seen in the illustration of the procedure in Figure 2.4. The inference network qφ(z|x) is a CRF

parser producing a distribution over binary trees given the sentences x, where a binary tree z consists of a sequence of SHIFT and REDUCE actions. The generative model pθ(x, z)

incrementally generates the terminals while building the parse tree based on the current stack representation. At each step of the generative model, either i) the next terminal symbol is generated and pushed to the stack (if zt= SHIFT) or ii) the last two elements

of the stack are popped from which a tree LSTM obtains a new representation, which is then pushed to the stack (if zt = REDUCE). For training URNNG, a binary tree is

sampled from the inference network and the log join likelihood log pθ(x, z) is optimised.

While state-of-the-art results can be obtained, as the authors note, the performance of URNNG is heavily dependent on the punctuation of the sentences, and cannot improve on right-branching baseline on an unpunctuated corpus. URNNG might therefore not be a good fit for emergent languages, which might not have any punctuation.5

4_{URNNG is an unsupervised version of RNNG (Dyer et al., 2016).} 5

We also found that URNNG exclusively generated right-branching trees for the emergent languages in an initial stage of our experiments.

(19)

Figure 2.4.: Illustration of URNNG with an inference network (left) for sampling a bi-nary tree z, and a generative model (right) defining the joint probability distribution over the sentences and parse trees (from Kim et al., 2019).

DIORA Another state-of-the-art neural parser is the deep inside-outside recursive auto-encoder (DIORA, Drozdov et al., 2019), which is one of the constituency models we use later in this thesis. DIORA is an unsupervised latent tree induction model that can find constituency parses, but in contrast with URNNG it does require word embeddings in the process. The main principle of the model is based on the hypothesis that the true syntactic structure leads to the most effective compression. As is illustrated in Figure 2.5, DIORA comprises two steps: i) an inside pass for finding a representation for the input sentence given all the possible binary constituency trees, and ii) an outside pass where the external context is taken into account when computing a representation for the constituents.

The inside pass starts with a vector representation for each leave node, which is con-ditioned on the provided word embeddings, and recursively composes the representation of the parent node using a tree LSTM as a function of the two child nodes. This step is repeated until the root vector for the whole sentence is reached. Next to that, a learned compatibility function provides the scalar weights used in combining the inside vectors. The outside pass has a similar procedure, but it starts at a separately learned generic root vector at the top and unfolds into the leave node representations. The same tree LSTM is used for composing each node representation from its parent node and the inside vector of the sibling node. The final outside leave node representations are optimised to reconstruct the original input, for which a max-margin loss for sentence x with T tokens

(20)

is used, defined as Lx= T −1 X i=0 N −1 X i∗₌₀

max(0, 1 − ¯b(i) · ¯a(i) + ¯b(i) · ¯a(i∗)), (2.4)

where there are also N negative examples sampled according to the vocabulary fre-quency, ¯b(i) and ¯a(i) are the outside and inside vectors for the input token, respectively, and ¯a(i∗) is the inside vector for the sampled negative example. Finally, given the by DIORA populated inside and outside charts for a sentence, a CKY parser is used for reconstructing the constituents.6

Figure 2.5.: Illustration of an inside-outside pass for DIORA (from Drozdov et al., 2019). During the inside pass, the model incrementally computes an inside vector (inblue) for each constituent by taking the weighted average over all possible pairs of constituents that could be part of it, ending at the root of the sentence. The outside pass starts at the root, and incrementally computes the outside vectors (inred) as a function of the parent vector and the inside vector of the sibling. The outside vectors at the bottom of the chart are optimised to predict the original words of the input sentence.

GloVe GloVe (Pennington et al., 2014) is one of the suitable vector models for DIORA and is the one we use later in this thesis. The main idea is to find vectors for representing each word in the vocabulary, where words that occur in similar contexts are placed together in the vector space, thereby also capturing some semantic information.

The GloVe model starts with the one-time creation of a word-word co-occurrence matrix Xij, which describes how often word j occurs in the context of word i. The

context is defined by a word window-size in front of and after the word, and in computing the values of the co-occurrence matrix words farther away in the context window get less weight.

The co-occurrence matrix then serves as the basis for learning the GloVe vectors, where the authors rely on the ratios of the co-occurrences to capture the relationship of

6

I refer the reader to the original paper for more details on the CKY algorithm as well as the other discussed procedures.

(21)

the words in the learning objective.7 They define the following cost function: J = V X i,j=1 f (Xij)(wiTw˜j+ bi+ ˜bj− log Xij)2, (2.5)

where V indicates the vocabulary size, wi and ˜wj the word vectors and contexts word

vector, respectively, b and ˜b the additional bias terms, and f (x) is a weighting function defined as

f (x) = (

(x/xmax)α if x < xmax

1 otherwise, (2.6)

where α = 3/4 is empirically shown to work well and the cutoff xmax is another

hyper-parameter.

(22)

3. Related work on evaluation techniques

The main focus of this thesis is to investigate the syntactic structure of languages emerg-ing in referential games with deep neural agents. In the previous chapter I explained what a referential game is and examined some unsupervised grammar induction (UGI) techniques that can be used to analyse the syntactic structure of languages. This chapter continues with a review of the related work on the evaluation of emergent languages, where I also point out the downsides of previous analyses and what is still missing.

My review is divided in four sections, to capture the broad nature and aims of the evaluation techniques. In particular, I discuss i) the communication success, ii) symbol usage statistics, iii) semantic analyses, and iv) (the lack of) syntactic analyses. What follows is a summary of the most important techniques used in the recent literature of language emergence in referential games.1 I conclude with a discussion of the current analyses and motivating the need for a proper syntactic analysis of emergent languages.

3.1. Communication success

An important step in the emergence of language in referential games, is the successful use of a priori meaningless symbols for communication. From a pragmatic point of view, the agents can be said to ‘understand’ a language when they can use it to play the referential game, which can be quantified by measuring the task success. Hence, a (near) perfect task accuracy is used to show that the agents have reached the point of successful communication. However, even though successful communication is an important pre-requisite for more interesting languages, it does not mean that the emergent language is human-interpretable and that it resembles a natural language (Kottur et al., 2017). A high task success is therefore a necessary but not a sufficient condition, and other evaluation techniques are therefore required.

3.2. Symbol usage statistics

Various analyses of the languages involve studying the statistics of the agents’ use of the symbols. These statistics can reveal some interesting insights into the nature of the emergent languages. Some of these symbol statistics do not depend on knowledge about the objects, while other analyses relate the message distributions to the described input context.

1

Although I also discuss one work on the non-cooperative negotiation game by Cao et al. (2018), because of the similarities of this analysis to that of languages emerging in referential games.

(23)

Input agnostic statistics Examples of input agnostic analyses are counting the number of active symbols (Lazaridou et al., 2017; Rodr´ıguez et al., 2020), studying the (average) message lengths (Lazaridou et al., 2018; Chaabouni et al., 2019; Rodr´ıguez et al., 2020), computing the language entropy for the variability of the number of symbols (Rodr´ıguez et al., 2020), and examining the signal distributions (Chaabouni et al., 2019; Cao et al., 2018). One important finding using such metrics, is that Zipf ’s law of abbreviation—the natural language phenomenon where more frequent words are shorter—only occurs in emergent languages under special circumstances, such as a length minimisation pressure (Chaabouni et al., 2019).

Input depending statistics The symbol and signal distributions can also be studied in relation to the described input objects. For example, the message distinctness is used to compare the number of unique messages with the number of unique input objects (Choi et al., 2018; Rodr´ıguez et al., 2020), with which Choi et al. find that in their experiment the agents tend to use the same message for different input objects. Another example is the perplexity per symbol, which focuses on individual symbols and reveals how often a symbol occurs in a message describing a particular object input (Havrylov and Titov, 2017; Rodr´ıguez et al., 2020). With this metric Rodr´ıguez et al. find that certain scenarios result in a consistent use of the same symbols for the same input objects.

While the above discussed metrics can provide some insights into the properties of the emergent languages and are easy to interpret, not all important questions can be answered with these. For example, they do not show the semantics of the messages and what the agents are talking about. However, various other approaches have been proposed that do concentrate on the semantics and go further than the aforementioned symbol statistics.

3.3. Semantic analysis

Many researchers use semantics-based analyses to study the characteristics of the emer-gent languages, in which they rely on the meaning of the messages by examining the relation with the described objects. These analyses can help answer questions about the nature of the languages, such as whether the messages pertain to actual concepts instead of relying on low-level feature information instead (e.g. Bouchacourt and Baroni, 2018; Lazaridou et al., 2018), or whether there is any structure in the mapping between the signals and their meanings (e.g. Lazaridou et al., 2018). Various metrics are proposed that shed light on the semantics of the messages, of which I give a brief overview below.

3.3.1. Qualitative inspection

Qualitative inspections of the messages and the described inputs have been used to study whether category specific information is captured by the emergent languages through affixes (Lazaridou et al., 2018), word order (Havrylov and Titov, 2017), or even a hier-archical coding scheme (Havrylov and Titov, 2017). Similarly, Bouchacourt and Baroni

(24)

(2018) have used inspections to cross-check their finding that the messages may not cap-ture concepts at all and that the agents can play the game by relying on low-level feacap-ture information instead. Next to that, previous studies have qualitatively shown that the messages may describe particular properties of the inputs under certain conditions, such as the colour, shape or the size of the objects (Kottur et al., 2017; Choi et al., 2018; S lowik et al., 2020; Rodr´ıguez et al., 2020).

3.3.2. Cluster quality

Clustering the signal space according to the meanings they represent, is another way to study the semantic structure of the language. These clusters can be inspected qualita-tively by visualisation on a lower dimension (Lazaridou et al., 2017) or by sampling the neighbouring words if these are interpretable (Lee et al., 2018). Quantitative measures such as the purity (Lazaridou et al., 2017) and F-measure are also used to study the cluster quality (Lan et al., 2020), generally showing above chance clusterisation with both symbolic data and real pictures. Complementary to this, Lan et al. (2020) not only examine the sender’s output, but also the receiver’s perception by sampling new messages from the clusters, finding a similar task accuracy as with the actual messages.

3.3.3. Diagnostic classification

To investigate what information is captured by a neural network, diagnostic classification can be used (Hupkes et al., 2018).2 A diagnostic classifier is trained on the internal representation of the network to predict a distinct input feature. Then the test accuracies can show whether the specific feature information is retained. Diagnostic classification is used to reveal which concepts are present in the visual representation of the agents (Lazaridou et al., 2018). This approach also works for examining the messages, by training an additional recurrent neural network for encoding these first, for example to study whether the hidden states of the agents are communicated (Cao et al., 2018) or what properties of the objects are represented in the messages (Rodr´ıguez et al., 2020). Rodr´ıguez et al. demonstrate that of the different input properties (i.e. colour, shape, and position), the position is most consistently retained.

3.3.4. Zero shot performance

The test accuracy on unseen combinations of attributes measures the generalisability of the language. Zero shot performance is used as a proxy or indication for underlying (compositional) structure (Kottur et al., 2017; Lazaridou et al., 2018; Choi et al., 2018; Mordatch and Abbeel, 2018; Hupkes et al., 2018; Dess`ı et al., 2019; Resnick et al., 2020), as it is argued that the agents need to combine known attributes in new ways.

(25)

3.3.5. RSA / topographic similarity

Lastly, we consider representation similarity analysis (RSA) and topographic similarity. RSA is a technique to test how similar two different embedding spaces are to each other through measuring the correlation between the distances of corresponding pairs from both spaces (Kriegeskorte et al., 2008). Bouchacourt and Baroni (2018) use it to compare the agents’ embedding spaces with each other, as well as showing how strongly these are correlated with the input space. They concluded that the agents’ internal representations focused on low-level input features to distinguish between images, instead of capturing conceptual representations. RSA is also used to compare the meaning and signal space, in which case it is referred to as topographic similarity (Brighton and Kirby, 2006; Brighton et al., 2005). Several studies have used the topographic similarity to measure the extent to which, in emergent languages, similar inputs map to similar messages (Lazaridou et al., 2018; Andreas, 2019; Li and Bowling, 2019; S lowik et al., 2020).

3.4. Syntactic analysis

Previous work has primarily focused on the semantics of emergent languages, using techniques such as the ones discussed above. However, only few attempts have been made to understand the syntactic structure, and the few works that do are limited to either qualitative inspections or brief mentions of possible links with certain metrics.

3.4.1. Previous work on syntactic structure

Only a few attempts have been made to study the formation of grammar, although even these are limited in scope. Most of these studies use qualitative inspections of the messages to discover patterns in relation to the meaning space, and find for example that affixes are (consistently) used for expressing category specific or positional information (Lazaridou et al., 2018)3, or that these contain information on the colour and shape of the described object (Choi et al., 2018).

Yet others have speculated about the possible link between grammatical structure and some metrics, such as the topographic similarity (Lazaridou et al., 2018) or the omission score of the language (Havrylov and Titov, 2017). The omission score is computed by the difference in the probability that a receiver assigns the target image given the message with and without a word, and Havrylov and Titov note the relationship between a language with separate function and content words, and a higher omission score.A higher omission score would mean that semantic information is less evenly distribution over the words in the message, and thus the maximum omission score of a word in a message is higher if information is focused in this word. However, both these metrics only give a weak indication of syntactic structure and may very well be confounded by other factors.

(26)

3.5. Discussion

Earlier work on understanding emergent languages has been concentrated on the analysis of the semantics of the messages. Here researchers aim to answer questions of what the agents talk about, whether the symbols refer to conceptual information, and whether there is any structure in the mapping between the signal space and the meaning space. Various approaches have been proposed to uncover the semantics. However, these evaluation techniques have their limitations. An important downside of the often used measures such as topographic similarity, diagnostic classification, and zero shot perfor-mance, is the reliance on a known symbolic representation of the input space. This means that researchers are limited to simple game scenarios (e.g. single-step games) and input data that is either generated from a known symbolic representation or for which there is a hypothesis thereof. However, even if such a representation can be found, it is possible that there is a misalignment between the researchers’ ideas of the semantics and the features actually captured by the agents. It would therefore be beneficial to have a different approach to complement and possibly cross-check these evaluation techniques. Additionally, only focusing on the semantics neglects the syntax of the messages, which is another important aspect of natural language. Until now, no serious attempts have been made at studying the syntactic structure besides qualitative inspections. The syntactic analyses that rely on qualitative inspections are limited in scale and are prone to ‘cherry picking’, notwithstanding that these approaches still require detailed information about the input with the aforementioned disadvantages.

In short, the study of emergent languages would benefit from an analysis that focuses on the syntactic structure and does not require a detailed description of the input space. The syntactic analysis done later in this thesis addresses both of these points. More-over, the use of unsupervised grammar induction techniques may not only shed light on the development of syntax, but may also complement the semantics-based analyses concentrating on the structure between the signal and the meaning space.

(27)

4. Framework of referential game design

The first research goal of this thesis is to provide an overview of the possible variations found in the design of referential games.1 The field of language emergence has received quite some attention in recent year and new experiments are being proposed—all with different variations and hypotheses of the effect on the emergent languages. To aid in the understanding of these works and improve readability, I use a simple framework where I use several dimensions to categorise the different pressures, which are intentional changes to the game to steer the language emergence. Besides being a literature review of possible variations, this framework may also serve as a template for future referential game design to find more interesting scenarios.

In the following section, I discuss why the notion of pressures is a useful concept for explaining the mechanisms underlying the language emergence. Subsequently, I provide an overview of the literature by discussing the following three dimensions of the framework:

1. the world : everything related to the environment of the agents, including the data and the design of the referential game;

2. the agent : everything related to the internal world of the agent, including the architecture and communication channel; and

3. the learning: everything related to the approaches used for the optimisation of the agents’ parameters.

Table 4.1 provides an overview of the discussed topics for each of these dimensions.

The world Modality of the data Distribution and ontology

Game setup and different perspectives

The agent Architecture

Communication channel The learning Gradient estimation

Regularisation and extra prediction tasks

Table 4.1.: Overview of the three dimensions for categorising variations in the referential game setups and the discussed topics.

(28)

4.1. The notion of pressures

Some authors of the papers I explore in this review use the term ‘pressure’ to describe the mechanisms used for steering the language emergence in a certain direction, where often metaphors or ideas from the cognitive sciences and evolutionary linguistics are used to illustrate these mechanisms (Choi et al., 2018; Chaabouni et al., 2019; Dagan et al., 2020; Rodr´ıguez et al., 2020). The concept of pressures is useful for communicating the intuitions behind the experimental design and how it influences the nature of the emergent language. In relation with the framework of this chapter, I define pressures as all mechanisms applied to the basic referential game to coax the language evolution into a desired direction.

Rodr´ıguez et al. (2020) make the useful further distinction between internal and ex-ternal pressures to distinguish between the constraints, capabilities, and biases, within the agent design and what pressures come from the environment. In the framework discussed in this chapter, I instead use three ways to categorise the pressures, as I take the optimisation of the model parameters as a separate dimension.

Commonly, researchers come up with different pressures, but with the same underlying idea or addressing the same problem, only from another perspective. For instance, both the least effort pressure of Rodr´ıguez et al. (2020) and the length minimisation pressure of Chaabouni et al. (2019) encourage the agents to use shorter messages, but the former is a vocabulary loss and the latter is a penalty proportional to the message length. Furthermore, both works study the effect of a non-uniform distribution of the objects, but use different types of data and focus on different aspects of the language emergence. In the next sections I discuss the variations in the referential game design found in the literature and categorise these along different dimensions of the game. While regularly these variations are designed as pressures, not all of these are; and often the game design choices are simply arbitrary choices (e.g. the choice of method for the gradient estimation). However, it is still useful to be aware of these different modelling choices, as it is possible that there are different behaviours of language emergence as a result.

4.2. The world

The world is the environment in which the agents ‘live’ and it pertains to how they perceive the objects in the referential game. This dimension includes the data as well as the design of the game, and makes the grounding of the emergent language in the objects possible by providing an environment shared by the speakers. The world is the main area for studying the impact of external pressures on the language emergence. In this section I discuss i) the modality of the data, ii) the ontology and distribution of the world, and iii) the game setup.

4.2.1. Modality of the data

There is a wide variety of different data that is used in referential game setups, but an important distinction can be made between symbolic and continuous data. Symbolic

(29)

data is structured and often in the form of attribute vectors (e.g. Kottur et al., 2017; Lazaridou et al., 2018; Li and Bowling, 2019; Dess`ı et al., 2019; Ren et al., 2020) or one-hot vectors (e.g. Chaabouni et al., 2019). With continuous data the agents have to form their own structure of the world, and the objects are typically represented by images (e.g. Havrylov and Titov, 2017; Lazaridou et al., 2017, 2018; Choi et al., 2018). It is therefore more similar to how we perceive the external world through our raw sensory input, which makes it interesting from a grounding perspective as it provides an opportunity to study how to ground discrete symbolic meanings in a continuous world through shared experiences.

The choice of either continuous or symbolic data can have a significant impact on the emergent language. Lazaridou et al. (2018) show that whether the agents see the world as structured–as is the case with symbolic data–has a noticable effect on the structure of the language, and thus calls for methods to increase the disentanglement in the visual representation.

Further distinctions in the modality of the data can be made in the case of image data. These continuous data can be categorised in real-life pictures (e.g. MSCOCO Havrylov and Titov, 2017; Lazaridou et al., 2017) on the one hand, and synthetic images (Lazaridou et al., 2018; Rodr´ıguez et al., 2020) on the other hand. One could argue that real-life pictures are closer to how we visually perceive the world, but it is less clear how to formalise the semantic representation in symbolic properties as is required for some evaluation techniques (discussed in Chapter 3). For this reason, many researchers opt for synthetic data that is generated from a known symbolic representation, resulting in a world of simple shapes with compositional properties (i.e. different sizes, colours and locations) (Rodr´ıguez et al., 2020); 3-D projections of simple geometric objects (Lazaridou et al., 2018); or abstract cartoon scenes (Lazaridou et al., 2020).

4.2.2. Distribution and ontology of the world

While the modality of the data is about the representation (e.g. real-life pictures), the world’s distribution and ontology are about the internal organisation of the world.

The world’s distribution defines how the objects are sampled in the referential game, which is often done uniformly (e.g. Kottur et al., 2017; Li and Bowling, 2019; Lazari-dou et al., 2018), but some experiments pressure towards a more natural vocabulary distribution by skewed sampling (Dess`ı et al., 2019; Chaabouni et al., 2019; Rodr´ıguez et al., 2020).2 Others create carefully constructed mini-batches with a skewed property distribution (Choi et al., 2018) or examine the effect of balancing the number of different properties the objects can have (Lazaridou et al., 2018).

The world’s ontology defines not the frequency, but how the objects relate to each other. For example, through classes (and possibly hierarchy therein) in data-sets such as MSCOCO or shared properties in the case of synthetic data. Lazaridou et al. (2018) test the effect of context dependent sampling of the distractors, where co-occurrences are more realistic3 and find that the agents are more pressured to focus on concepts rather

2

Most do so successfully, but the results of Chaabouni et al. (2019) are negative.

(30)

than simple low-level feature similarities.

4.2.3. Game setup and different perspectives

Another pressure can be to change the perspective of both agents to improve concept formation, where the sender and receiver see different representations of the objects. The first choice in the perspectives is whether the sender sees the distractors (e.g. Lazaridou et al., 2017, 2020) or not (e.g Havrylov and Titov, 2017; Lazaridou et al., 2018; Mihai and Hare, 2019; Choi et al., 2018; Kottur et al., 2017). Other examples of changes in perspective are to show a different image, but of the same class, as the target (e.g. Lazaridou et al., 2017); or augmenting the data through rotating the images (Mihai and Hare, 2019). Rodr´ıguez et al. (2020) successfully apply a few pressures by changing the location of the shape in the image and adding noise to the colour.

4.3. The agent

Where the world is about the external environment of the agents, the agent -dimension is about the internal world of the agents, specifically the architecture and communication channel.

4.3.1. Architecture

The agents are generally modelled by recurrent neural networks that serve as a lan-guage module to handle multi-symbol messages, for example by an LSTM (e.g. Havrylov and Titov, 2017; Lazaridou et al., 2018; Li and Bowling, 2019; Chaabouni et al., 2019; Rodr´ıguez et al., 2020).4 If the referential game is played with images which are not fea-turised into an embedding before-hand, then the architecture also encompasses a visual module, for example a CNN or VGG.

The architecture of the agent is important, because it must be capable enough to capture the (visual) semantics and produce an ‘interesting’ language. As far as I am aware, there are no studies published with transformer-based agents instead of recurrent neural networks, but it would be interesting to see how the different inductive biases of the architectures result in different emergent languages.

An interesting architecture design is the obverter technique, introduced in referen-tial games by Choi et al. (2018). The basic idea is reminiscent of the theory of mind (Premack and Woodruff, 1978): using the sender’s own understanding as a proxy for how the receiver will understand the message. The sender samples the symbol from the vocabulary that increases the probability that the agent will choose the target image the most. This step of sampling symbols for the message is repeated, until the total probability becomes higher than a certain threshold or the maximum message length is reached. As Choi et al. (2018) points out, this has the added benefit of adding some pressure in line with the principle of least effort. However, it is important to point out

(31)

that only the receiver’s architecture is optimised and that the two agents have to switch roles throughout the training, which has the benefit of avoiding the non-trivial optimi-sation of the sender (discussed in more detail in§4.4), but it adds additional constraints to the agent design and training procedure.

Resetting the agents during the game is another interesting tool. (Kottur et al., 2017) reset the receiver’s hidden state in a multi-step game to ensure the symbols do not change their meaning depending on when in the conversation it is said.

4.3.2. Communication channel

Even though the communication channel can be seen as part of the language module of the agent, it is useful to consider the vocabulary size and message length separately. Previous works have extensively studied the effect of the vocabulary size and message length on the emergent languages, in particular in the context of how the messages are structured in relation with the described objects (Kottur et al., 2017; Lazaridou et al., 2018; Li and Bowling, 2019; Cogswell et al., 2019; Chaabouni et al., 2019).

Regarding the message length, either single symbol (e.g. Lazaridou et al., 2017) or multi-symbol messages are used, where the messages in the latter can have variable length (e.g. Havrylov and Titov, 2017). For generating messages with a variable length, Havrylov and Titov (2017) sample from a categorical distribution parameterised by the hidden state of the recurrent neural network until the end of sentence token is gener-ated or the maximum message length is reached. Typically, the symbols are sampled stochastically during training and greedily during testing (e.g. Lazaridou et al., 2018; Li and Bowling, 2019; Ossenkopf, 2020).

4.4. The learning

The learning dimension relates to the approaches used for optimising the agent’s model parameters. For this dimension, I only consider the gradient-based methods, as these are most widely used in the context of deep neural agents. First, I discuss the choice of the gradient estimation method, where researchers have to resort to tricks for end-to-end optimisation, since simple back-propagation through the discrete sampling of the messages is not possible. I then give a brief summary of some regularisations and extra prediction tasks to pressure the language emergence.

4.4.1. Gradient estimation

For updating the parameters the gradient of the loss function is required. Finding the gradient, however, is not straightforward, since it is not possible to back-propagate through the sampling of the discrete messages. Conventional back-propagation with stochastic gradient descent can be used on the receiver, but another method is required for estimating the gradient for the sender. Two approaches are generally taken:

(32)

2. using continuous relaxation of the discrete message distribution to allow for end-to-end differentiation,

where the most common methods are REINFORCE and Gumbel-softmax relaxation for these approaches, respectively (Kharitonov et al., 2019).

REINFORCE In the first approach, the referential game is seen as a (model-free) rein-forcement learning problem. REINFORCE (Williams, 1992) is a commonly used method to update both the sender and receiver (e.g. Lazaridou et al., 2017; Chaabouni et al., 2019) and estimates the gradient using the following formula:

Em,o[L(o, l)∇θlog P(m, o|θ)] , (4.1)

where m and o are the sampled message and the output of the receiver, respectively, L the loss function, l the target, θ the parameters of the agents, and P(m, o|θ) the joint probability distribution of the outputs.

Gumbel-Softmax relaxation Another frequently used solution is Gumbel-Softmax re-laxation, based on the Gumbel-Softmax distribution (Jang et al., 2017; Maddison et al., 2016) for transforming the categorical message distribution to a differentiable continu-ous approximation (e.g. Havrylov and Titov, 2017; Mordatch and Abbeel, 2018). For samples from the categorical distribution with probability pk, we can get the continuous

approximation ˜wk by the following transformation:

˜ wk =

exp ((log pk+ gk) /τ )

PK

i=1exp ((log pi+ gi) /τ )

, (4.2)

where gk is sampled from the Gumbel-Softmax distribution and τ is the temperature.

The straight-through Gumbel-softmax (ST-GS) estimator is found by applying the Gumbel-softmax relaxation on the backwards pass, but discretising ˜w back during the forward pass using the arg-max operator. Compared to REINFORCE, it has less vari-ance, but this comes at the cost of having a biased estimator, while REINFORCE is unbiased. Furthermore, it still requires the setting of the temperature parameter τ , where a too low value results in a vanishing gradient, while a too high value can make it an unfaithful approximation.

4.4.2. Regularisation and extra prediction tasks

There are several ways to apply pressures to the language emergence in the learning dimension, which can either be explicit by adding an extra loss term, or implicit by adding another prediction task.

For grounding the symbols in natural language often a supervised image captioning task is used (Lazaridou et al., 2017; Havrylov and Titov, 2017; Lazaridou et al., 2020) Another method for grounding the emergent language is tried by Havrylov and Titov (2017) through adding a Kullback-Leibler divergence regularisation, where the difference

(33)

between the agent’s distribution and the natural language (approximated by a language model) are minimised; something similar is done by Resnick et al. (2020).

Others have used pressures in this dimension to change the symbol usage by the agents. For instance, (Rodr´ıguez et al., 2020) used an auxiliary loss for their least-effort pressure that penalises the agents for using more different symbols; effectively reducing the message redundancy. Similarly, Chaabouni et al. (2019) have added an additional cost proportional to the message length, called the length minimisation pressure to evoke a distribution closer to the Zipf ’s Law of Abbreviation distribution.

Another frequently used pressure is entropy regularisation (Mnih et al., 2016) to stim-ulate more exploration (Lazaridou et al., 2018; Li and Bowling, 2019; Ren et al., 2020). Other prediction tasks are the prediction of the receiver’s hidden layer after hearing the message (Ossenkopf, 2020) to encourage ‘empathy’ or the prediction of the transforma-tion applied to an image (e.g. rotatransforma-tion) to encourage more conceptual properties in the visual semantics (Mihai and Hare, 2019).

4.5. Discussion and conclusion

In this chapter I reviewed the literature of simulating language emergence on the possible variations found in the design of the referential game. To aid in the understanding of the different works and their comparison, I suggested a framework of referential game design with three dimensions, namely the world, the agent, and the learning.

Subsequently, I formulated the notion of pressures, which are the intentional changes of the basic referential game, and recommended its use for making the intuitions of the underlying mechanisms of language emergence clear. I proceeded with summarising the literature of referential game design using the proposed framework and gave various examples of how pressures can be applied to the different dimensions for certain effects on the emergent languages.

The framework of referential game design provides insight into the possible modelling choices researchers can make and how these could tie to the emergence of specific prop-erties of the emergent languages. However, it may also serve as a template for future researchers to identify unexplored (combinations of) pressures and design new referential games.

In the following chapters I discuss the syntactic analysis of emergent languages result-ing from a game where the vocabulary size and message length of the communication channel serve as pressures towards more syntactic structure. The ease of implementing these pressures and the fact that these have been studied extensively in the context of semantics structure, makes these pressures particularly interesting as a starting point for investigating the syntax as well.

The Grammar of Emergent Languages

MSc Artificial Intelligence

Master Thesis