Modelling the Generation and Retrieval of Word Associations with Word Embeddings

(1)

Modelling the Generation and Retrieval of

Word Associations with Word Embeddings

A Case Study for a Guesser Agent in the Location Taboo Game

Verna Dankers 10761225

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Dr. Aysenur Bilgin Dr. Raquel Fern´andez Institute for Language and Logic

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

Word associations capture important aspects of the semantic representation of words, by telling us about the contexts in which words appear in the world. Arti-ficially mimicking word associations involves emulating the generation of word as-sociations and the retrieval mechanisms underlying associative responses. Tasks in which this plays a primary role are word-guessing games, such as the Location Taboo Game. In this game, artificial guesser agents should guess the names of cities from simple textual hints and are evaluated with games played by humans. Thus, play-ing the games successfully requires mimickplay-ing associations that humans have with geographical locations. In this thesis, a method for modelling word associations is presented and applied to the construction of an artificial guesser agent for the word-guessing Location Taboo Game.

The acquisition of word associations is modelled through the construction of a semantic vector space from a tailored corpus about travel destinations, using context-predicting distributional semantic models. A targeted corpus annotation method is introduced to make the word associations more explicit. The guesser agent architec-ture retrieves associations during the game by calculating the associative similarity between a city and a hint from the semantic vector space. The annotation method significantly improves performance. The results on a dataset of example games in-dicate that the proposed architecture can guess the target city with up to 27.50% accuracy – a substantial improvement over the 5% accuracy achieved by the baseline architecture.

(3)

Acknowledgements

Foremost, I would like to express my sincere gratitude to my supervisors, Dr. Aysenur Bilgin and Dr. Raquel Fern´andez, for offering me the opportunity to parti-cipate in the Taboo City Challenge and to work under their guidance. Dr. Aysenur Bilgin inspired me through her enthusiasm and the creative, helpful and innovative ideas she suggested for the research presented here. Thank you for the extensive feedback on my work and the many hours we spent discussing it. My thanks also goes to Dr. Raquel Fern´andez, who challenged me to think about my research and learned me how to write about it. I thank both of you for sharing your knowledge and expertise.

Second, I would like to thank my dear friend Martin, for his very valuable com-ments on my presentations and this thesis. Collaborating with you throughout these past three years has been an honour, competing with you has been a delight and discussing our work is always insightful and enjoyable.

Finally, I want to express my gratitude to my GEO specialist, partner in life and love. Wietze, thank you for always having more faith in me than I have in myself.

(4)

1 Introduction

The study of word associations lies at the centre of understanding the organisation of the lexical knowledge of humans. Word associations are acquired through world experience. When words occur in close proximity frequently, an associative link is formed between them in the observer. Examples of words with an associative link are ‘plant’ and ‘garden’ or ‘film’ and ‘actor’. Besides capturing important aspects of the semantic representation of words, word association responses tell us something about the underlying retrieval mechanisms (De Deyne & Storms,2008b). This work contributes to the modelling of word association structures and approximating the generation and retrieval of word associations in an artificial manner, which is an important aspect of modelling lexical knowledge in general.

Word associations have been obtained in multiple ways in the literature: through free association norms (De Deyne & Storms,2008a), through semantic networks, and by inferring them from corpora (Heath et al.,2013). The third approach uses distributional semantic models (DSM) that learn word embeddings, represented as real-valued vectors, from word co-occurrence patterns in a corpus. DSMs have mostly been applied to semantic similarity tasks rather than to associative similarity tasks (Turney & Pantel,2010), but they can be used to mirror word associations made by humans as well (Peirsman et al.,

2008;Agres et al.,2016). Free association norms are gathered by asking subjects to react to a stimulus word with the first word that comes to mind. While the gathering of free association norms tends to be expensive and time-consuming for domain-specific tasks, the extraction of associations from a corpus through a DSM is entirely data-driven and automatic.

In this thesis, a method of modelling word associations is developed through the cre-ation of an artificial agent for the Taboo City Challenge. Taboo is a word-guessing game, in which one agent provides clues about the term to be guessed without mentioning the target term or any of the related terms from a list of taboo words. Thus, the game requires the describer to think of well-known facts about the target word to enable the other agent to guess it correctly. The Taboo City Challenge1 _{is a competition inspired by Taboo,}

where artificial guesser agents play the Location Taboo Game (LTG). The objective of the LTG is to guess the names of cities from simple textual hints. The data provided for training an Artificial Guesser Agent (AGA) are games that were successfully played by various human players. To play the LTG successfully, the AGA should be able to mimic the word associations that human players have with geographical locations.

The presented AGA employs context-predicting DSMs to create word representations from a tailored corpus, which is constructed with data from the online encyclopedias Wiki-pedia2 _{and Wikivoyage.}3 _{The hypothesis is that context-predicting DSMs can capture}

human-like geographical associations from such a tailored corpus. These sources, based on the knowledge and experience of the volunteer authors, implicitly contain associations humans have with geographical locations.

Through the creation of an AGA that models word associations for the LTG and the evaluation of the architecture, the following research question is investigated:

How can context-predicting distributional semantic models facilitate modelling human-like word associations for the Location Taboo Game?

1_{The challenge is organised by the ESSENCE Network} _{https://www.essence-network.com/}

challenge/.

2_{Wikipedia is an online collaborative encyclopedia. The dump used was from April 20, 2017,}_https://

dumps.wikimedia.org/enwiki/.

3_{Wikivoyage is a global travel guide for travel destinations and travel topics written by volunteer}

(6)

The creation and evaluation of the architecture can be broken down into different com-ponents, which leads to the following subquestions:

1 To what extent can the artificial guesser approximate human guessing performance in the Location Taboo Game?

2 What distributional semantic models are most suited to capture associations humans have with geographical locations?

3 How can word embeddings be used to facilitate human word-guessing game playing behaviour?

4 What is the impact of tailoring a corpus for a domain-specific task such as the Location Taboo Game?

The outline of this thesis is as follows: In the following section, a theoretical founda-tion is provided and related work is reviewed. Secfounda-tion3 discusses the rules of the LTG. Section 4 presents the different components of the AGA architecture and discusses the algorithms used. An intrinsic evaluation of the word embeddings and an extrinsic evalu-ation of the architecture are presented in Section5. The results are discussed in Section6

and concluding remarks and future research directions are given in Section7.

2 Background

Firstly, this section motivates the approach taken through a theoretical foundation. Secondly, the algorithms employed for the AGA are explained. Thirdly, related work about two al-ternative approaches to word-guessing games is presented.

2.1 Theoretical Foundation

Vector space models are based on the Distributional Hypothesis, according to which words that appear in similar contexts tend to have related meanings (Harris,1954). The existing training methods for word vectors can generally be divided into two classes: count-based models and context-predicting models (Baroni et al., 2014). Within the count-based models, a word co-occurrence matrix is constructed from a text corpus (Turney & Pantel,

2010) and multiple post-processing steps are applied to improve context informativeness and reduce dimensionality. Context-predicting models use neural networks and set the weights to maximise the probability of the context in which a word is observed in a text corpus.

Baroni et al. (2014) provide a systematic comparison of those two types of models. Count-based vectors with the positive Pointwise Mutual Information and Local Mutual Information weighting schemes were compared to predictive DSMs constructed with the Continuous Bag of Words approach. The models were tested on a variety of benchmarks widely used to test and compare DSMs. The types of tasks included were semantic re-latedness, synonym detection, concept categorisation, selectional preferences and analogy reasoning. The results indicate that the context-predicting models can outperform the count-based models on most of these tasks.

An opposing viewpoint has been provided byLevy et al.(2015), who argued that, if for all methods a set of hyperparameters is tuned, the performance of the different methods may become more comparable. These hyperparameters include association metrics and parameters for pre- and post-processing. These hyperparameters are already a part of the context-predicting DSMs, but can have a substantial impact when adapted for and transferred to the count-based methods.

(7)

The effects of such hyperparameters were analysed byLai et al.(2016), who evaluated six context-predictive models and one newly developed count-based model named GloVe. The models were evaluated through eight tasks of three types, which were calculating a word embedding’s semantic properties, text classification and sentiment classification. The performance of the methods differed greatly. Lai et al. found that the size of the corpus matters, though the corpus domain is more important. More complex models require a larger corpus to outperform the simpler methods. Additionally, for analysing the semantic properties of a word vector, larger dimensions can provide better performance.

Mikolov et al. (2013a) proposed two context-predicting models known as word2vec: Skip-gram (SG) and Continuous Bag of Words (CBOW). The models improve upon feed-forward and recurrent neural net language models (NNLM). CBOW and SG have much lower computational complexity compared to the feedforward and recurrent NNLMs. While the objective of the SG model is to predict the context given the word itself, the CBOW model predicts a word given its context. A more detailed explanation of these models is presented in the next section. The models were tested for a word analogy task containing questions about both syntactic analogies and semantic analogies. SG and CBOW outperformed prior NNLMs and SG outperformed all other models on the se-mantic analogy questions. An interesting discovery was that simple algebraic operations show the learned relationships between vectors. One example mentioned by Mikolov et al. is Paris-France+Italy≈Rome. The word vector for Rome was the word closest to the vector that resulted from the subtraction and the addition.

An association metric is needed to measure the similarity of a possible target term and a hint. Three commonly used similarity measures are the Cosine similarity, the Dice coefficient and the Jaccard coefficient (Turney & Pantel, 2010). Cai et al. (2015) invest-igate a variety of combined strategies of different measures. Most works use one metric to evaluate semantic similarity, but Cai et al. argue that a single metric cannot cap-ture all the aspects of semantic similarity and suit all types of input data. They apply a population-based stochastic search strategy to find an optimal combined strategy for a semantic similarity task. The similarity is compared to human ratings to assess its correctness. The six similarity measures considered were: Cosine, Euclidean, Manhattan, Chebyshev, Correlation and Tanimoto. The combined strategy found by the differential evolutionary algorithm outperformed the other measures. In a systematic study of se-mantic vector space model parameters provided byKiela and Clark(2014), 12 similarity measures were evaluated on a variety of count-based distributional semantic models for several word similarity tasks. Three similarity metrics performed particularly well: Co-sine, Correlation and the Tanimoto coefficient. The results suggested that the Correlation metric has the most consistent performance.

2.2 Algorithms

The two algorithms employed for the AGA are the SG and CBOW algorithm proposed by

Mikolov et al.(2013a), based upon the results presented byBaroni et al.(2014). Although count-based models might provide comparable performance through the adaptation and transferral of hyperparameters (Levy et al., 2015), these hyperparameters are already a part of SG and CBOW.

Two extensions to the original algorithms, proposed by Mikolov et al. (2013b), are the application of Hierarchical Softmax (HS) and Negative Sampling (NS). These two extensions are computationally efficient training methods. This subsection lays out both algorithms and training methods, and presents the similarity metrics employed in the AGA architecture.

A word embedding is defined as a mapping V → Rd _{: w → ~}_v

w, that maps a word w

from a vocabulary V to a real-valued vector ~vw in an embedding space of dimensionality

(8)

and m words after the centre. In the word2vec toolkit — and its Python implementation gensim — the algorithms are implemented in a shallow neural network that consists of an input layer, one hidden layer and an output layer. The weights between the input layer and the hidden layer can be represented by a |V | × d matrix W. The weight matrix between the hidden layer and the output layer, W0, is an d × |V | matrix. The parameters of the neural network, henceforth represented by θ, constitute the vectors. Each row of W is the d-dimensional word vector representation of the corresponding word from the vocabulary. The columns of W0 represent the context vectors. The word vectors are constructed by learning θ, through training the model on the objectives specified in the following sections.

2.2.1 Continuous Bag of Words

The training objective of CBOW is to maximise the probability of the centre word based on its context (Mikolov et al.,2013a):

arg max θ 1 T T X t=1 log p(wt|wt−m, · · · , wt−1, wt+1, · · · , wt+m; θ) (1)

where T represents the number of tokens in the corpus and wtrepresents token t as the

current centre word. In the neural network model a one hot vector is given as input per context word. The average of the input vectors weighted by the matrix W represents the output of the hidden layer ~h:

~h = 1 2mW

T_{( ~}_x

1+ ~x2+ · · · + ~x2m) (2)

The soft-max function is used to obtain the posterior distribution of words (Rong,2014):

p(wt|wt−m, · · · , wt−1, wt+1, · · · , wt+n; θ) = exp( ~vwt· ~h) |V | P j=1 exp( ~vwj· ~h) (3) 2.2.2 Skip-Gram

Within the SG algorithm the goal is to set the parameters θ so as to maximise the probability of the context of a word, given the term that is in the centre:

arg max θ 1 T T X t=1 X −m≤j≤m,6=0 log p(wt+j|wt) (4)

The parameterisation that follows the neural-network language models approach, models the conditional probability p(wt+j|wt) using soft-max:

p(wc|wt; θ) = exp( ~vwc· ~vwt) |V | P j=1 exp( ~vwj · ~vwt) (5) 2.2.3 Hierarchical Softmax

The basic formulation of the algorithms uses a global softmax normalisation, which is computationally expensive, because of the summation over all words in the vocabulary (as shown in Equation4 and Equation 5). This makes the model impractical to use for

(9)

large training corpora. HS approximates the softmax function efficiently by using a binary Huffman tree for the representation of the output layer. The words from the vocabulary are the leaves of this tree and each leaf unit can be reached by a unique path from the root of the tree. At every step along the path, a probability is associated with choosing the right or left subtree and every step represents a local normalisation. This entire path is used to estimate the probability of the word.

For SG, HS defines p(wc|wt) as follows:

p(w|wt) = L(w)−1

Y

j=1

σ(_{Jn(w, j + 1) = ch(n(w, j ))K · v}0T_n(w,j)vwt) (6)

where w is a word at a leaf unit, n(w, j) is the jth node on the path from the root to the leaf unit and L(w) is the length of this path. _{JxK is 1 if x is true and −1 if x is false and} ch(n) represents an arbitrary fixed child of n. Therefore, if ch(n) represents the left child node of n and m is the right child node,_{Jm = ch(n)K is false and thus −1.}

HS can be applied to the CBOW architecture in a very similar manner. 2.2.4 Negative Sampling

NS uses the notion that a probability is associated with a pair (w, c) of word and context coming from corpus data, controlled by the θ parameters: p(D = 1|w, c; θ) (Goldberg & Levy, 2014). In this case, c is regarded as a positive example for w. Correspondingly, p(D = 0|w, c; θ) represents the probability that (w, c) did not come from D, where c is considered a negative example for w. For every training step, instead of looping over the entire vocabulary, NS only uses the positive example and n negative examples, where n is a parameter. The negative examples are sampled from the frequency distribution of words in the corpus. Therefore, terms that occur more often in the corpus, have a higher probability of appearing as a negative sample.

2.2.5 Similarity Metrics

A variety of similarity metrics can be used to calculate the association between the word vector of a hint and the word vector of a city. Four similarity metrics, shown in Table1, are considered in this work.

Table 1: Similarity metrics between vectors ~u and ~v, ~vi represents the ith component of

~v.

Metric Definition

Chebyshev 1

1+maxi|ui−vi|

Cosine _|u||v|u·v

Correlation (u−µu)·(v−µv)

|u||v|

Tanimoto _{|u|+|v|−u·v}u·v

2.3 Related Work on Word-Guessing Games

Heath et al. (2013) reviewed methods for building an agent for a word-guessing game named Wordlery, which is similar to Taboo. In their work, two methods for obtaining word associations were considered: using human free association norms and applying count-based semantic models to build word vectors from a corpus. The models were evaluated by playing the game. The free association norms outperformed the count-based

(10)

semantic models. Combining the two methods of forming word associations was superior to each of the methods in isolation. Heath et al. (2013) suggested that more advanced corpus-based semantic models, which take into account additional semantic information, may improve the results on similar tasks. This thesis builds upon this suggestion by using context-predicting models rather than count-based models and proposes a novel corpus annotation method to improve corpus-based semantic models.

The baseline for the performance of AGAs has been set byAdrian et al.(2016), who proposed a semantic distance based architecture. The architecture uses a two-step ap-proach. First, the geographical area of the guess is narrowed down to the country. Next, the area is further narrowed down to the city. Two types of resources are used to measure the distance between a geographical location and a hint: WordNet and Wikipedia. Word-net is used to measure semantic distance through the hierarchical relations of the location and the hint. For Wikipedia, the similarity is measured by combining the number of hits for the hint, the location and the combination of both, through multiple association met-rics. The highest score, yielding 23.17% accuracy and 68.42% faster guessing performance, was achieved with the Wikipedia corpus and the Pointwise Mutual Information metric. A quantitative comparison of this baseline and the AGA proposed in this paper, is provided in Section5.

3 Case Study: Location Taboo Game

In this section, the rules for the LTG are laid out.4 _{A more detailed description of the}

game has been presented byAdrian et al.(2015). An LTG is played by a describer agent and a guesser agent. Hints are simple English noun phrases, consisting of one to three words that are common nouns, adjectives, or connectors. The hints may not include proper nouns. For example, if ‘Verona’ is the target city, the clue ‘Romeo and Juliet’ is not allowed, but ‘tragic love story’ is. Although there is no closed set of cities available, the challenge does focus on well-known cities.

The describer starts the game by providing a hint about the target city. Based on this hint, the guesser tries to guess the city that is being described. As long as the guess is incorrect, the describer provides a new hint and the game continues until there are no more hints left. If the guesser has not been able to find the right city before the describer runs out of hints, the game is considered to have failed. An example game is shown in Table2.

In the competition, the clues from the describer agent are hints from real games played by human players, for which the target term was guessed successfully. Therefore, the number of hints available differs per game, depending on how many clues the human player needed. The games for the evaluation of the AGA were extracted from a set of 82 real-world games provided by ESSENCE and from a set of 149 games made available through an API for the Taboo City Challenge.5 The games present in both sets were only used once. An important difference with a real, interactive word-guessing game is that the hints from the dataset are static, i.e. they do not depend upon the guesses that are given. Multiple games from the set contained hints that clearly depend upon the other agent’s guess, such as ‘close’ or ‘different accent’. Two games contained two or more of such hints, those games were excluded from the dataset. One game in which ‘Europe’ was used as hint and two games that contained hints with more than three words, were excluded as well, since this violates the rules of the LTG. Three games were excluded, because hints were repeated multiple times. The remaining dataset consists of 202 games. The number of hints per game varies from 1 to 10 and the average number of hints is 2.8.

4

https://www.essence-network.com/challenge/

(11)

Table 2: An example LTG for which the target city is Venice. Agent Message

Describer sea Guesser Sydney Describer yearly festival Guesser Rio de Janeiro Describer bridges Guesser Amsterdam Describer renaissance art Guesser Venice

In the Taboo City Challenge, the AGA performance is evaluated with the total game score, that is derived by the number of submitted guesses. Therefore, the AGA should minimise the number of guesses. If the AGA has not found the answer, the score is the number of hints increased by five. The total score of the AGA is the sum of the scores of the individual games the AGA played. For evaluating the AGA performance, the total game score that is derived by the number of submitted guesses is used:

(5 · f +

j

X

i=0

hi) (7)

where j represents the number of games for which the guesser agent is evaluated, hi is

the number of hints used in game i and f is the number of failed games.

4 Approach

This section discusses the different components that make up the AGA architecture. Firstly, the tailored corpus and a targeted annotation method are discussed. Secondly, the AGA architecture is presented: four methods for the construction of lists of candidate cities and four game strategies are discussed. Thirdly, details about the implementation of the architecture are provided.

4.1 Data and Pre-processing

Data sets from Wikipedia and Wikivoyage were used, available under an open license by the Wikimedia Foundation. An advantage of Wikivoyage is that all entries come with information that is useful for tourists. However, many of these pages are not specifically about travel destinations but rather about tourist attractions, such as historic build-ings or tours. Therefore, an initial filtering of the entries of Wikipedia and WikiVoyage was performed using the database of populated places from Natural Earth cultural vec-tors.6 This database includes all capitals, major cities and towns, and smaller towns from sparsely inhabited regions. The result of this initial filtering was a set of 215 countries and 7,267 cities and towns. The Wikipedia and Wikivoyage pages titled with these names were extracted. For the Wikivoyage pages, the outlinks to other Wikivoyage pages were considered relevant and were included in the corpus. A second filter was applied to the resulting text corpus to remove markup language and English stop words. The corpus was split into sentences before punctuation was removed. City names composed of two

6_{The Populated Places database is available at:}

http://www.naturalearthdata.com/downloads/10m

(12)

or more words were joined: ‘New York’ is represented as ‘New York’. All tokens were lowercased apart from the city names to avoid bias in the word vectors of city names that are also known as common nouns, such as ‘Tours’. The final normalised corpus consists of around 17 million tokens and 600 thousand types. More detailed size information is displayed in Table3.

Table 3: Detailed size data for the tailored corpus.

source Wikivoyage, Wikipedia

size of normalized pages, uncompressed 124.7MB

n locations pages 10,877

n all pages 26,618

n tokens 17,139,619

n types 629,479

n types included in word embeddings 111,630

4.1.1 Targeted Corpus Annotation

The human readers and authors of Wikipedia and Wikivoyage pages know that the con-tent of a page is closely related to its subject, represented by the city name in the title of the page. Even though the relation is not mentioned explicitly throughout the text, the connection between the subject and the content is made implicitly. In order to im-prove the association between a geographical location and its corresponding Wikipedia and Wikivoyage pages, this work introduces a novel type of targeted annotation. This an-notation method serves to make the association that human readers or authors implicitly assume, explicit for the DSM.

For the pages entitled with the name of a country or city, this name is inserted in every sentence. The number of insertions is calculated by rounding up the length of the sentence divided by 50. The names are inserted at an equal distance from each other and from the beginning and ending of the sentence, based upon the number of insertions and the length of the sentence. An example of a rather short sentence, in which the city name is inserted in the middle, is shown below:

‘pizza traditionally eaten locally pasta [Verona] dishes feature widely restaurant menus’ An example of a sentence whose length exceeds 50 words, for which the city name is inserted twice, is shown below:

‘foreign policy priorities Azerbaijan include first restoration territorial integrity elim-ination consequences occupation nagorno karabakh seven regions Azerbaijan surrounding nagorno karabakh integration european euro [Azerbaijan] atlantic structure contribution international security cooperation international organizations regional cooperation bilat-eral relations strengthening defense capability promotion security domestic policy means strengthening democracy preservation [Azerbaijan] ethnic religious tolerance scientific educational cultural policy preservation moral values economic social development enhan-cing internal border security migration energy transportation security policy’

4.2 Agent Architecture

4.2.1 Strategies for Choosing Candidate Cities

A large part of the corpus incorporates little-known locations that are not likely to appear in the LTG, which only concerns well-known locations. Considering all cities in the world could result in unexpected results due to data sparseness. Therefore, candidate cities were filtered in order to consider only sufficiently relevant places. Three definitions of relevance

(13)

were defined, for which three lists were compiled for the AGA. These lists all contain 500 cities. A fourth list is created with only the cities that are actually in the games. Using Occurrence Counts Tailored Corpus (A) The first list of cities is based on occurrence counts of city names in the tailored corpus. Defining the most salient cities through this metric relies on the assumption that the volunteer authors of the encyclopedias write more about cities that are well-known and thus appear more often in the corpus. Besides, well-known cities not only tend to have longer Wikivoyage or Wikipedia pages dedicated to them, but also tend to appear more often on pages of other geographical locations.

Using Occurrence Counts Google News Corpus (B) The second method uses a set of word vectors that is pre-trained on the Google News dataset.7 _{The occurrence}

counts of cities in the corpus are extracted from the vocabulary of the word vectors. The rationale behind this method is that cities that appear more often in news items, are more well-known. A disadvantage of the first two methods is that city names that can also function as a common noun or proper noun with a different meaning have a high frequency count in corpora, even though they are not well-known as a city. Examples of such outliers are the names ‘George’ and ‘David’ and the nouns ‘orange’ and ‘price’. Using Web Resources (C) Instead of defining a popularity metric based on corpora, this third method uses crowd sourced information from digital nomads living all around the world. The ranking of cities from the popular website NomadList8is used. NomadList allows its users to rank travel destinations on multiple factors, such as quality of life, average trip length or number of visits. The number of visits has been used to construct this third list of candidate cities. Although a city’s popularity as a travel destination may be a good estimate for the list of candidate cities, a possible disadvantage could be that cities that are well-known but are not attractive to visit as a tourist, such as ‘Fukushima’, may be excluded.

Using Manual Curation (D) In the set of 202 games used for training and testing purposes 120 well-known cities are included. The fourth list of candidate cities contains only these 120 cities. This list allows us to investigate the performance of the guesser agent in the situation of a perfect set of candidate cities.

4.2.2 Game Strategy

The AGA retrieves the cities nearest to the hints in the semantic vector space via a similarity metric and employs a game strategy to choose from those cities. Multiple game strategies were considered: two strategies that are adapted versions of cross situational learning algorithms named Enumeration and Elimination (De Beule, 2016), a two-step strategy similar to the approach of Adrian et al. (2016), and a game strategy based on vector arithmetic. For the first three game strategies the similarity score between the hints and a city is defined as follows:

s(H, c) = X

h∈H

m(~h, ~c) (8)

where c represents the name of a country or city, m is a similarity metric of choice, and H is the bag of words constructed from all hints presented by the describer agent.

7_{The vectors are available at:} _{https://code.google.com/archive/p/word2vec/}_. 8

(14)

Enumeration This game strategy considers all cities from its list of candidate cities at all times. The cities are ranked according to the cumulative similarity (Equation 8) between vectors of words from the set of hints and the vectors of cities. In this way, cities that are more associated with the hints have a higher chance of being chosen, but less similar cities also have a non-zero chance. This game strategy is useful if very strong associations appear in the game, that can strongly change the ranking. The city at the top of the ranking is guessed at each iteration.

Elimination The Elimination – or informed guess (Simonton,2013) – algorithm consists of eliminating inconsistent candidate cities across game iterations, narrowing down the list of cities after every new hint.

This strategy picks the n cities closest to every hint in the vector space and uses the intersection of those different sets to choose the guess from. Therefore, with every new hint, cities that are not present in this intersection are eliminated. If the intersection is the empty set, the city with the highest score from the union of the sets of most similar cities is guessed. In the following iterations of the game, the sets of most similar cities for new hints are intersected with the union of the hints that were provided before an empty set occurred. Table 4 presents the 5 cities closest to every hint for the example game presented in Section3. For the first two and the first three clues, intersecting the sets of most similar cities produces an empty set. However, after the fourth clue, ‘Venice’ is in the intersection of the cities for the new hint and the union of the cities found before. Therefore, the target city is guessed correctly.

Table 4: The cities most similar to the hints from the example LTG presented in Figure2. Clue 5 most similar cities

sea Istanbul, Gold Coast, Kota Kinabalu, Punta del Este, Nice yearly festival Poznan, Oranjestad, Sendai, Omaha, Battle Creek

bridges Wuhan, Hartford, Venice, Washington, Khartoum renaissance art Ferrara, Milan, Florence, Venice, Scottsdale

Country to City The third game strategy is similar to the two-step approach ofAdrian et al.(2016), in which the architecture first narrows down to a set of countries and then chooses from the cities of those countries. With every new hint, the top n most similar countries is recalculated for the entire set of hints. The guess is chosen from the cities from those countries. Table5 presents the 5 most similar countries for each iteration of the example game presented in Section3.

Table 5: The countries most similar to the hints from the example LTG presented in Figure2.

Clue 5 most similar countries

sea Mauritius, Turkey, Maldives, Georgia, Oman yearly festival Malta, Turkey, Mauritius, Japan, Denmark bridges Malta, Japan, Norway, Denmark, Sweden renaissance art Italy, Denmark, Georgia, Japan, Malta

Vector Arithmetic The fourth and last game strategy uses the fact thatMikolov et al.(2013a) reported that semantic relationships between words can be made explicit with the word2vec word embeddings through vector arithmetic. This game strategy uses the

(15)

sea yearly festival bridges renaissanceart 1 2 2 3 4 4 Gdansk Cannes Tromso Aarhus 1 2 3 4 Venice

sea + yearly + festival sea + yearly + festival + bridges sea + yearly + festival + bridges + renaissance + art

2 3 4 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 −0.4 −0.2 0 0.2 0.4 0.6 PC1 PC2

Figure 1: The Vector Arithmetic game strategy illustrated through 2-dimensional PCA projection of 200-dimensional vectors for the example LTG. Guesses, hints and summed vectors are grouped by colour. Game iterations are indicated through numbers.

word vectors of all words present in the hints to calculate their average normalised vector. The vector of the city closest to this vector in the semantic vector space, is guessed. The performance on the example game is illustrated in Figure1, that shows the word vectors, normalised average vectors per game iteration, the guesses and the correct answer.

4.3 Implementation

The AGA architecture was implemented in Python 2.7. For the processing of the corpus, the Python library gensim.corpora.wikicorpus9 was used for the removal of markup language. The list of English stopwords used was the one provided with the NLTK Toolkit.10 The SG and CBOW models were used to train the word embeddings, as implemented in gensim.models.word2vec.11 This implementation uses a corpus that is separated into sentences, and uses sentence boundaries in calculating the context of term. The main AGA architecture is shown in Algorithm1. The implementations of the game strategies are provided as well: Elimination in Algorithm2, Enumeration in Algorithm3, Vector Arithmetic in Algorithm4and Country to City in Algorithm5. If the first hint is not present in the semantic vector space, a default set of 300-dimensional SGNS vectors trained on a Google News dataset12_{is used. In the games from the ESSENCE API there}

were only five game for which the similarity scores for the first hint had to be extracted from the Google News vectors.

9_{The library used to construct a corpus from a MediaWiki-based database dump is} _https://

radimrehurek.com/gensim/corpora/wikicorpus.html.

10_{http://www.nltk.org/}

11_{word2vec Python implementation:} _{https://radimrehurek.com/gensim/models/word2vec.html}_.

(16)

Algorithm 1: Main algorithm of guesser agent

Input : candidate cities, wiki vectors, google news vectors, threshold while guessed = false and new hints exist = true do

hint ←− getHint();

if hint not in wiki vectors and empty(all hints) then vectors ←− google news vectors

end else

vectors ←− wiki vectors end

all hints ←− addHint(hint);

guess of agent ← gameStrategy( candidate cities, all hints, vectors, threshold ) candidate cities ←− remove(guess of agent)

guessed ←− guessCity(guess of agent) new hints exist ←− askAPI()

end

Algorithm 2: Elimination game strategy

Input : candidate cities, hints, vectors, threshold Output: guess of strategy

for word in hints do

for city in candidate cities do

candidate cities.city.score ←− similarityMetric(vectors[word],vectors[city]) end

hints.word.top cities ←− getBestCities(candidate cities, threshold) end

cities used ←− intersectionCities(hints) if empty(cities used) then

cities used ←− unionCities(hints) end

guess of strategy ← getHighestScore(candidate cities, cities used) return guess of strategy

Algorithm 3: Enumeration game strategy Input : candidate cities, hints, vectors Output: guess of strategy

for city in candidate cities do for word in hints do

candidate cities.city.score ←− similarityMetric(vectors[word], vectors[city]) end

end

guess of strategy ← getHighestScore(candidate cities) return guess of strategy

(17)

Algorithm 4: Vector Arithmetic game strategy Input : candidate cities, hints, vectors Output: guess of strategy

summed vector ←− emptyVector() for word in hints do

summed vector ←− addVector(word, vectors) end

normalised vector ←− normalise(summed vector) for city in candidate cities do

candidate cities.city.score ←− similarityMetric(vectors[city], normalised vector) end

guess of strategy ← getHighestScore(candidate cities) return guess of strategy

Algorithm 5: Country to City game strategy Input : candidate cities, hints, vectors, threshold Output: guess of strategy

countries ←− findCountries(candidate cities) for country in countries do

countries.country.cities ←− getCitiesFromCountry(candidate cities) end

for country in countries do for word in hints do

countries.country.score ←− similarityMetric(vectors[word],vectors[country]) end

end

cities used ←− citiesFromNBestCountries(countries, threshold) for city in cities used do

for word in hints do

cities used.city.score ←− similarityMetric(vectors[city], vectors[word]) end

end

guess of strategy ← getHighestScore(cities used) return guess of strategy

(18)

5 Experiments and Results

This section provides an optimisation of the parameter values for the word embeddings and the AGA architecture. Additionally, an extrinsic evaluation of the AGA architecture and an intrinsic evaluation of word embeddings are provided. Where the extrinsic eval-uation assesses the architecture through the LTG, the intrinsic evaleval-uation tests the word embeddings on semantic analogy tasks to investigate the quality of the word embeddings for geographical semantic relations in general.

5.1 Experimental Setup

Four algorithms are considered throughout the experiments, which are Skip-Gram Neg-ative Sampling (SGNS), Skip-Gram Hierarchical Softmax (SGHS), Continuous Bag of Words Negative Sampling (CBOWNS) and Continuous Bag of Words Hierarchical Soft-max (CBOWHS), as explained in Section2.2. Table 6 displays all hyperparameters in-volved, along with the explored values and the models for which the parameters are applicable. Considering the small size of the corpus, the minimum count setting used is 5. Therefore, any word that does not occur at least this many times in the entire corpus is ignored. For the sub-sampling of very frequent words t = 10−5 _{is used, as suggested by} Mikolov et al.(2013b). The number of training epochs is set to 15.

Firstly, the parameters of the AGA architecture are fixed to determine the optimal settings for the word embeddings. Secondly, the word embeddings are fixed per algorithm and the optimal configurations for the AGA architecture are investigated. The parameter optimisation is based upon the accuracy, which is the number of games for which the target city was guessed divided by the total number of games in the dataset. Thirdly, the best performing AGAs are evaluated according to the three evaluation metrics: the game score (Section3), the accuracy and the faster guessing performance (FGP). The FGP represents the percentage of successfully finished games for which the AGA submitted less guesses than the human player before guessing the target city correctly. The accuracy and FGP both represent a different aspect of modelling word associations. Through maximising the accuracy and FGP, the game score is minimised.

The data has been divided into two parts randomly; 40 games were set aside for final testing and the remaining 162 games were used for parameter tuning with 5-fold cross validation. For every training fold the accuracies were calculated for the different config-urations of word embedding parameters and different configconfig-urations of agent architecture parameters. For every training fold, the best configuration was selected and re-evaluated on the corresponding validation set. The configuration with the highest out-of-sample

Table 6: The space of hyperparameters explored in this work. Hyperparameter Explored Values Applicable models

corpus Unannotated, Annotated All

window size 5, 10, 15, 20, 25 All

dimensions 200, 300, 400 All

negative samples 5, 10, 25 SGNS, CBOWNS

city list A, B, C, D All

strategy Enumeration, Elimination, All Two-Step, Arithmetic

metric Cosine, Correlation, All

(19)

accuracy per algorithm was used for the final settings of the different AGAs.

5.2 Word Embedding Parameter Tuning

To determine the optimal context window size and number of dimensions per algorithm, the parameters of the agent architecture are fixed: the game strategy is set to Enumer-ation, the similarity metric used is Cosine. Enumeration was chosen as the initial game strategy, as it does not eliminate any cities in the decision making process. For the NS al-gorithms, 5 negative samples are used, since the usage of 5 negative samples has provided very respectable results before (Mikolov et al.,2013b) and has proven to be very efficient computationally. Figure 2 displays the guessing accuracies per algorithm, for both the unannotated and the annotated corpus, for the list of candidate cities from the website NomadList, city list C. The accuracies in the figure, are the average accuracies on the training folds. The optimal configurations per fold were re-evaluated on the validation sets and the configuration with the highest out-of-sample accuracy has been selected per algorithm, as shown in Table7. The targeted corpus annotation had a significant effect for all models (t-test, p < 0.001): CBOWNS t = 6.766, CBOWHS t = 8.469, SGNS t = 12.427, SGHS t = 15.489.

The ideal set of cities, list D, can be used to research the effect of the hyperparameters without the influence of noise created by the presence of unused cities. These results are provided in AppendixA.

For CBOWNS and SGNS the number of negative samples can be varied. Mikolov et al.(2013b) suggest to use 5 to 20 negative samples for small corpora. The optimal configurations for CBOWNS and SGNS were trained with 5, 10 and 25 negative samples. The average accuracy over the different training folds is shown in Table8. The results illustrate that increasing the number of negative samples does not necessarily improve performance.

Table 7: The optimal settings per algorithm, as found through 5-fold cross validation. corpus Algorithm dimensions window size negative samples

SGHS 300 20 -Wiki* Corpus SGNS 400 20 5 Unannotated CBOWHS 200 20 -CBOWNS 300 20 5 SGHS 300 25 -Wiki* Corpus SGNS 300 20 5 Annotated CBOWHS 200 10 -CBOWNS 200 10 5

(20)

5 10 15 20 25 200

300 400

Context Window Size

Dimensions (a) SGHS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0 0.05 0.1 0.15 0.2 0.25 0.3 (b) SGHS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions (c) SGNS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0 0.05 0.1 0.15 0.2 0.25 0.3 (d) SGNS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions

(e) CBOWHS Unannotated

5 10 15 20 25

200 300 400

Context Window Size

Dimensions 0 0.05 0.1 0.15 0.2 0.25 0.3 (f) CBOWHS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions (g) CBOWNS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0 0.05 0.1 0.15 0.2 0.25 0.3 (h) CBOWNS Annotated

Figure 2: Results for varying the number of dimensions and the size of the context window per algorithm, with city list C, cosine similarity and the Enumeration game strategy.

(21)

Table 8: Results for varying the number of negative samples, n, for CBOWNS and SGNS.

corpus Algorithm Accuracy (%)

n = 5 n = 10 n = 25 Wiki* Corpus Unannotated SGNS 19.75 18.52 18.52

CBOWNS 14.82 11.73 14.20 Wiki* Corpus Annotated SGNS 24.69 20.37 22.84 CBOWNS 17.29 13.58 16.05

5.3 Agent Architecture Parameter Tuning

The results of the word embedding parameter optimisation indicate that the annotated corpus, combined with the SGHS or the SGNS model, yield the best results for the LTG. For these algorithms the optimal word embedding configurations have been selected. These configurations are used to select the most suitable game strategy and similarity metric, for which results are displayed in Figure3. The strategy specific parameters used for the Elimination and the Country to City game strategies were 50 and 10, consecutively. Additional results for CBOWNS, CBOWHS, the unannotated corpus and city list D are provided in AppendixB.

The Enumeration game strategy appears to be the most appropriate game strategy for the SG algorithm. The Tanimoto, Correlation and Cosine similarity measures all result in very similar performance, none of them clearly dominates. Based upon accuracies on the validation sets, Tanimoto was selected for SGHS and Correlation was selected for SGNS. These configurations were fixed per algorithm to evaluate the different city lists, as shown in Table 9. City list C performs much better than lists A and B, therefore C has been selected for the final configurations of the AGA. City list D results in the best performance, which was expected, since cities that were not present in the dataset were excluded from this list. The results for CBOWNS, CBOWHS and the unannotated corpus are shown in AppendixB.

VA CC EL EN che cor cos tan Game Strategy Similarity Metric (a) SGHS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0 0.05 0.1 0.15 0.2 0.25 0.3 (b) SGNS Annotated

Figure 3: Results for varying the game strategy and similarity metric for the best config-urations for the word embeddings per algorithm, with city list C. The game strategies are Enumeration (EN), Elimination (EL), Country to City (CC) and Vector Arithmetic (VA). The similarity metrics are Cosine (cos), Correlation (cor), Tanimoto (tan) and Chebyshev (che).

A complete overview of the optimal configurations per algorithm is presented in Ap-pendixB, Table14, for both the annotated and the unannotated corpus, for city lists C and D.

(22)

Table 9: Results for varying the city lists.

Algorithm strategy metric Accuracy (%)

A B C D

SGHS Enumeration Tanimoto 17.28 16.66 24.07 41.97 SGNS Enumeration Correlation 20.37 19.13 24.69 33.94

5.4 Extrinsic Evaluation

The optimal combination of parameters has been selected per algorithm through the para-meter optimisation process. To evaluate how well these AGAs perform on games that were unseen before, this section presents the results with a set of 40 games that was put aside during parameter tuning. The AGAs are evaluated according to the three evaluation met-rics: the game score, the FGP and the accuracy. The highest game score possible for these 40 games, is 300; the lowest game score possible is 40. The AGA should minimise the number of guesses submitted and thus minimise the game score. The approach presented here is compared to the performance of the baseline AGA architecture of Adrian et al.

(2016) and to two sets of pre-trained word vectors, that were evaluated within the archi-tecture presented here: the pre-trained SGNS vectors from the Google News corpus, and vectors created with a word embedding model named LexVec, trained on a Wikipedia corpus.13 _{The LexVec model achieves state of the art results in multiple natural language}

processing tasks (Salle et al.,2016) and improves upon SGNS. The performance on the set of 40 games is presented in Table10, for the tailored word embeddings, pre-trained word embeddings and the baseline architecture consecutively. The optimal AGA parameters for the LexVec word embeddings are the Enumeration game strategy and Correlation similarity metric. For the SGNS Google News vectors the Enumeration game strategy and the Cosine similarity metric were used.

Additional results for CBOWNS, CBOWHS, the unannotated corpus and city list D are provided in AppendixB, Table15.

Table 10: The performance of the presented AGA with tailored and pre-trained word embeddings, and the performance of the baseline architecture.

corpus Algorithm Game Score Accuracy (%) FGP (%)

Wiki* Corpus Annotated SGHS 239 27.50 27.27

SGNS 252 22.50 22.22 Google-News SGNS 256 20.00 37.50 Wikipedia LexVec 267 15.00 33.33 - Baseline 290 5.00 0.00 Architecture

5.5 Intrinsic Evaluation

Intrinsic evaluation methods of word embeddings test for syntactic or semantic relation-ships between words. These tasks typically involve a query inventory, which is a pre-selected set of semantically related query words. The evaluation method compiles an aggregate score that serves as an absolute measure of quality.

(23)

Mikolov et al. (2013b) developed query inventories of analogical reasoning tasks to evaluate the quality of word vectors.14 _{The results for two of those tasks are presented in}

Table11. These tasks were selected based upon subject, as they query word embeddings for geographical semantic relations. The tasks use the word vectors to answer queries about semantic relationships between a capital and the country it belongs to, for example ‘Quito’ is to ‘Ecuador’ as ‘Valletta’ is to ‘Malta’.

Table 11: Results for the intrinsic evaluation of the top 5 best configurations per algorithm, for both the unannotated and annotated corpus, and two pre-trained sets of vectors.

corpus Algorithm capital-common-countries capital-world SGHS 83.99 % (425 / 506) 87.80 % (3972 / 4524) Wiki* Corpus SGNS 91.11 % (461 / 506) 85.46 % (3866 / 4524) Unannotated CBOWHS 70.55 % (357 / 506) 44.92 % (2032 / 4524) CBOWNS 83.99 % (425 / 506) 76.66 % (3468 / 4524) SGHS 82.21 % (416 / 506) 89.28 % (4039 / 4524) Wiki* Corpus SGNS 90.71 % (459 / 506) 91.22 % (4127 / 4524) Annotated CBOWHS 65.81 % (333 / 506) 65.36 % (2957 / 4524) CBOWNS 92.29 % (467 / 506) 83.02 % (3756 / 4524) Google News SGNS 83.20 % (421 / 506) 79.13 % (3580 / 4524) Wikipedia LexVec 95.06 % (481 / 506) 94.36 % (4268 / 4524)

6 Discussion

The method presented in this work, aims at approximating human guessing performance in the LTG (Research Question1). The best performing AGA can guess the target city with up to 27.50% accuracy if the candidate cities are unknown, and with up to 37.50% accuracy if they are known. Although the performance is a substantial improvement compared to the baseline architecture ofAdrian et al. (2016), the accuracy of the AGA is still rather low. However, the AGAs were only evaluated with games that have been played successfully by human guessers before, which may have biased the results.

The different configurations used in the experiments of the extrinsic evaluation demon-strate the optimal settings for neural word embeddings for the LTG (Research Question2). The best AGA has been constructed with the SGHS algorithm by creating word embed-dings from the annotated corpus and applying the Enumeration game strategy. A rather large context window size of 20 yields the best results on average. The fact that a larger context window size is more suitable for capturing associative similarity is consistent with the findings ofPeirsman et al.(2008), who show that large context windows tend to model human associations better.

The difference in performance between SG and CBOW may be explained by the size of the corpus and the models’ training objectives. While CBOW conditions upon the context of a word and smooths over this context by averaging its vectors, SG creates many small training instances of word-context pairs. The results indicate that SG performs better at the LTG than CBOW, and that on average HS has a slight advantage over NS, which is illustrated in Figure2. An explanation for the difference between the scores of HS and NS is that more frequent words are more likely to be selected as negative samples (see Section2.2.4); thus NS represents frequent words better than infrequent words. Hints can be very infrequent words, as they are specific to a particular target city. Hence, clues are more accurately represented by training the algorithm with HS. There is a striking relative

(24)

difference in performance for CBOWNS when the experiments with city list C (Figure2) are compared to the experiments with city list D (Appendix B, Figure 4). CBOWNS appears to benefit more from the exclusion of noise presented by cities that do not appear in the games. For the final configurations selected for CBOWNS, annotation increases performance with 2.5% for city list C and with 7.5% for city list D (Table15). As CBOW generally needs more data than SG and NS represents frequent words better, a possible cause could be that cities that are excluded by city list D were relatively non-frequent and caused bias in the results for CBOWNS for city list C.

A targeted annotation method has been proposed to make the associations with geo-graphical locations implicitly present in the corpus more explicit during the construction of word embeddings. The application of the proposed annotation significantly improves the results for the SGHS, the SGNS, the CBOWHS and the CBOWNS models. The ap-plication of annotation increases the average accuracy with 5% absolutely for the SGHS model, across the word embedding parameter space (Figure2).

Four game strategies were implemented to model game playing behaviour using word embeddings (Research Question3), of which the simplest one resulted in the best per-formance for the LTG. This may indicate that in retrieving word associations, humans do not always apply a specific strategy that adheres to rules.

The intrinsic evaluation illustrates that the performance on the analogy questions re-garding countries and capitals does not directly correlate with capturing word associations for the LTG. While the LexVec word embeddings yield a high accuracy in the intrinsic evaluation, the performance in the LTG is lower than the LTG performance achieved with the tailored word embeddings. Additionally, despite the small size of the presented corpus, the tailored word embeddings yield a rather high accuracy for the analogy tasks (Research Question4). This may indicate that the corpus domain has more importance than the size of the corpus, which is consistent with the findings ofLai et al.(2016). The effect of the targeted corpus annotation is less apparent in the intrinsic evaluation. As the intrinsic evaluation only uses the representation of countries and cities, which are very frequent terms in the tailored corpus, this suggests that the corpus annotation mostly enhances the representation of clues instead of candidate cities.

The choice for a tailored corpus for the AGA architecture is based upon the assumption that the associations with cities depend upon their role as travel destination (Wikivoyage) or general facts about places (Wikipedia). Corpora from other domains, for example news items, may enhance the associations. A possible disadvantage of the resources used may be that encyclopedias are mostly useful for little-known facts, whereas the associations that play a role in word-guessing games may be significantly simpler (Von Ahn et al.,

2006). The current approach lacks a method for capturing the relations between words in multi word hints, such as the hint ‘always summer’ (‘San Jose’ ) or the hint ‘stealing from rich’ (‘Nottingham’ ).

The example games used for the evaluation of the agent may not suffice to evaluate to what extent the AGA can model word associations accurately. Possibly, the agent would perform better if interactive games were allowed, as suggested byHeath et al.(2013), who found a similar effect. Multiple games from the dataset include hints that depend upon the guess given by the human guesser. These hints are meaningless for the AGA, as these clues do not change as the AGAs guesses change. Moreover, multiple hints are not about geographical associations, but rather about word distances, such as ‘Marbella’ for which two hints were ‘marble’ and ‘more bells’. Apart from containing unsuitable hints, the types of games included in the dataset are games in which the target city was successfully guessed by the human guesser. Therefore, the AGA could not be evaluated with games in which the describer presented a city that the human guesser did not recognise. However, if the AGA would be able to succeed in guessing the target city for some of those games, that would still be relevant for modelling associations humans have with geographical locations, since it would be modelling the associations of the describer agent.

(25)

7 Conclusion and Future Work

In this work, an AGA architecture has been presented, which uses context-predicting DSMs to infer word associations from a tailored corpus and employs different game strategies for the LTG. This method for the generation and retrieval of word associations could be generalised for different domains or adapted for different tasks. A targeted corpus annotation method is proposed, that artificially inserts the name of a location into every sentence of the page pertaining to it. The corpus annotation method amplifies the associ-ations implicitly present in the tailored corpus and significantly improves the performance in the LTG. The architecture is an improvement compared to the baseline architecture ofAdrian et al.(2016) and can guess target cities from the set of games provided by the ESSENCE network with up to 27.50% accuracy. Based upon the results presented by

Dankers et al.(2017), an AGA employing SGHS word embeddings and the Enumeration game strategy will be presented in the Taboo Challenge Competition workshop at the International Joint Conference on Artificial Intelligence 2017.

Firstly, future research could focus on corpus improvements by combining multiple types of resources, such as reviews from tourists and news articles about a city. The targeted annotation method could be extended to outlink pages, or one could experiment with annotating more or less often.

Secondly, the list of candidate cities could be improved, since this has a large impact on the game performance. A more accurate estimation of well-known cities could combine information from multiple resources to take into account different aspects that influence a city’s salience. Some examples of relevant aspects could be a city’s population, its appearance in news items and the number of well-known events hosted in the city.

Thirdly, more complex architectures could be created. A better method could be developed for the interpretation of multi word hints. For example, terms such as ‘always’ or ‘very’ could be used as an indication of the strength of an association, by attenuating or increasing the similarity score. Also, the knowledge of multiple agents could be used in a larger system in which agents work together to find the optimal answer. A history log per agent could indicate the reliability of an agent’s guess. This could be utilised by a voting system that assigns a weight to every agent’s answer and returns the target city with the highest cumulative weight. The agents in the system could, for example, employ different algorithms for the construction of word embeddings, use different similarity metrics or different corpora to model associations of multiple topics, such as the local cuisine or the history of a location.

References

Adrian, K., Batsuren, K., Bova, N., Brochhagen, T., Chocron, P., Van Eecke, P., . . . Vourliotakis, A. (2015). ‘Taboo challenge’, technical report. Essence Marie Curie Initial Training Network .

Adrian, K., Bilgin, A. & Van Eecke, P. (2016). A semantic distance based architecture for a guesser agent in ESSENCEs location taboo challenge. DIVERSITY@ ECAI 2016 , 33–39.

Agres, K. R., McGregor, S., Rataj, K., Purver, M. & Wiggins, G. A. (2016). Modeling metaphor perception with distributional semantics vector space models. In C3GI@ ESSLLI.

Baroni, M., Dinu, G. & Kruszewski, G. (2014). Don’t count, predict! A systematic com-parison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL (pp. 238–247). doi: 10.3115/v1/P14-1023

(26)

Cai, Y., Lu, W., Che, X. & Shi, K. (2015). Differential evolutionary algorithm based on multiple vector metrics for semantic similarity assessment in continuous vector space. In DMS (pp. 241–249). doi: 10.18293/DMS2015-001

Dankers, V., Bilgin, A. & Fern´andez, R. (2017). Modelling Word Associations with Word Embeddings for a Guesser Agent in the Taboo City Challenge Competition. In The Taboo Challenge Competition, (IJCAI-17). (To appear)

De Beule, J. (2016). The multiple word guessing game. Belgian J. Linguist , 30 .

De Deyne, S. & Storms, G. (2008a). Word associations: network and semantic properties. Behavior Research Methods, 40 (1), 213–231. doi: 10.3758/BRM.40.1.213

De Deyne, S. & Storms, G. (2008b). Word associations: Norms for 1,424 Dutch words in a continuous task. Behavior Research Methods, 40 (1), 198–205. doi:

10.3758/BRM.40.1.198

Goldberg, Y. & Levy, O. (2014). word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 .

Harris, Z. S. (1954). Distributional structure. Word , 10 (2-3), 146–162.

Heath, D., Norton, D., Ringger, E. & Ventura, D. (2013). Semantic models as a combin-ation of free associcombin-ation norms and corpus-based correlcombin-ations. In Seventh interna-tional conference on semantic computing (pp. 48–55). doi: 10.1109/ICSC.2013.18

Kiela, D. & Clark, S. (2014). A systematic study of semantic vector space model para-meters. In Proceedings of the 2nd workshop on continuous vector space models and their compositionality at EACL (pp. 21–30). doi: 10.3115/v1/W14-1503

Lai, S., Liu, K., He, S. & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31 (6), 5–14. doi: 10.1109/MIS.2016.45

Levy, O., Goldberg, Y. & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Compu-tational Linguistics, 3 , 211–225.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of Workshop at the ICLR.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Peirsman, Y., Heylen, K. & Geeraerts, D. (2008). Size matters: tight and loose context definitions in english word space models. In Proceedings of the ESSLLI workshop on distributional lexical semantics (pp. 34–41).

Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 .

Salle, A., Idiart, M. & Villavicencio, A. (2016). Matrix factorization using window sampling and negative sampling for improved word representations. arXiv preprint arXiv:1606.00819 . doi: 10.18653/v1/P16-2068

Simonton, D. K. (2013). Creative problem solving as sequential bvsr: Exploration (total ignorance) versus elimination (informed guess). Thinking Skills and Creativity, 8 , 1–10. doi: 10.1016/j.tsc.2012.12.001

Turney, P. D. & Pantel, P. (2010). From frequency to meaning: Vector space mod-els of semantics. Journal of artificial intelligence research, 37 , 141–188. doi:

10.1613/jair.2934

Von Ahn, L., Kedia, M. & Blum, M. (2006). Verbosity: a game for collecting common-sense facts. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 75–78). doi: 10.1145/1124772.1124784

(27)

A

Word Embedding Parameter Tuning Results

5 10 15 20 25 200

300 400

Context Window Size

Dimensions (a) SGHS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (b) SGHS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions (c) SGNS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (d) SGNS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions

5 10 15 20 25 200

300 400

Context Window Size

Dimensions 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (f) CBOWHS Annotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions (g) CBOWNS Unannotated 5 10 15 20 25 200 300 400

Context Window Size

Dimensions 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (h) CBOWNS Annotated

Figure 4: Results for varying the number of dimensions and the size of the context window per algorithm, with city list D, cosine similarity and the Enumeration game strategy.

(28)

Table 12: Results for varying the number of negative samples, n, for CBOWNS and SGNS.

city list corpus Algorithm Accuracy (%)

n = 5 n = 10 n = 25

C

Wiki* Corpus Unannotated SGNS 19.75 18.52 18.52 CBOWNS 14.82 11.73 14.20 Wiki* Corpus Annotated SGNS 24.69 20.37 22.84 CBOWNS 17.29 13.58 16.05

D

Wiki* Corpus Unannotated SGNS 32.71 32.10 27.16 CBOWNS 32.10 32.10 30.86 Wiki* Corpus Annotated SGNS 38.27 35.80 36.42 CBOWNS 38.27 37.04 37.04

B

Agent Architecture Parameter Tuning Results

Table 13: Results for varying the city list.

corpus Model Accuracy (%)

A B C D

Wiki* Corpus Unannotated

SGHS 12.35 12.35 21.65 36.42 SGNS 15.43 16.67 20.37 30.25 CBOWHS 6.79 8.64 12.35 22.83 CBOWNS 10.49 9.88 15.43 30.24

Wiki* Corpus Annotated

SGHS 17.28 16.67 24.07 41.98 SGNS 20.37 19.14 24.69 33.95 CBOWHS 12.35 13.58 16.67 28.40 CBOWNS 10.49 12.96 17.90 35.19

(29)

VA CC EL EN che cor cos tan Game Strategy Similarity Metric (a) SGHS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0 0.05 0.1 0.15 0.2 0.25 0.3 (b) SGHS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric (c) SGNS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0 0.05 0.1 0.15 0.2 0.25 0.3 (d) SGNS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric

VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0 0.05 0.1 0.15 0.2 0.25 0.3 (f) CBOWHS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric (g) CBOWNS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0 0.05 0.1 0.15 0.2 0.25 0.3 (h) CBOWNS Annotated

Figure 5: Results for varying the game strategy and similarity metric for the best config-urations for the word embeddings per algorithm, with city list C. The game strategies are Enumeration (EN), Elimination (EL), Country to City (CC) and Vector Arithmetic (VA). The similarity metrics are Cosine (cos), Correlation (cor), Tanimoto (tan) and Chebyshev (che).

(30)

VA CC EL EN che cor cos tan Game Strategy Similarity Metric (a) SGHS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (b) SGHS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric (c) SGNS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (d) SGNS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric

VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (f) CBOWHS Annotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric (g) CBOWNS Unannotated VA CC EL EN che cor cos tan Game Strategy Similarity Metric 0.15 0.2 0.25 0.3 0.35 0.4 0.45 (h) CBOWNS Annotated

Figure 6: Results for varying the game strategy and similarity metric for the best config-urations for the word embeddings per algorithm, with city list D. The game strategies are Enumeration (EN), Elimination (EL), Country to City (CC) and Vector Arithmetic (VA). The similarity metrics are Cosine (cos), Correlation (cor), Tanimoto (tan) and Chebyshev

(31)

Table 14: The configurations per algorithm for city lists C and D, for both the annotated and unannotated corpus, as found through 5-fold cross validation. The number of negative samples was set to 5 for methods to which negative samples are applicable.

city corpus Algorithm dimensions window strategy similarity

list size metric

C

SGHS 300 20 Enumeration cosine

Wiki* Corpus SGNS 400 20 Enumeration correlation

Unannotated CBOWHS 200 20 Enumeration cosine

CBOWNS 300 20 Enumeration correlation

SGHS 300 25 Enumeration tanimoto

Annotated CBOWHS 200 10 Enumeration tanimoto

CBOWNS 200 10 Enumeration tanimoto

D

SGHS 300 25 Enumeration cosine

Wiki* Corpus SGNS 200 5 Enumeration tanimoto

Unannotated CBOWHS 200 20 Country to City correlation

CBOWNS 400 10 Enumeration tanimoto

SGHS 300 25 Enumeration tanimoto

Annotated CBOWHS 400 5 Country to City tanimoto

Modelling the Generation and Retrieval of Word Associations with Word Embeddings