An exploration of the role of wordtype on language model performance for encoding and decoding

(1)

An exploration of the role of wordtype

on language model performance for

encoding and decoding

Esra Solak 10001812

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor J. Zuidema S. Abnar Institute for Language and Logic

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

UNIVERSITY OF AMSTERDAM

Abstract

BSc Kunstmatige Intelligentie

by Esra Solak

We rely on our knowledge of words and linguistic abilities in

many aspects of our functioning. How does the brain encode the

meanings of all these words? Many language models aim to

approach the answer to this question, with varying degrees of

success. Previous research has successfully encoded concrete

nouns, some of which have even found interesting patterns

indicating possible complementary roles. This thesis aims to

explore the generalisation of these findings, by applying

Representational Similarity Based Encoding and Decoding as

introduced by Anderson et al. (2016), on a wordset with nouns

of varying degrees of concreteness, and a wordset with various

wordtypes. As previous research suggests possible

complementary roles, where Word2vec is better at encoding,

and GloVe at decoding, this thesis compares the Word2vec and

(3)

1 Introduction

The average native speaker of a language has a vocabulary size between 20.000 and 35.000 words1_{. These numbers vary greatly from person to person}

depend-ing on various conditions such as environment, socio-economic status and the amount of languages they know. We even have words that are very similar in meaning, but hold distinguishing characteristics that are deemed important by us. When we also note that we think in the languages we know, it is clear that the way our brains encode words is crucial for the understanding of many different aspects of human life.

So how does the brain encode all these words, with all their differences and similarities? Mitchell et al. (2008)2 _{have taken one step towards answering}

this question. For 9 participants, they have obtained neural activity patterns for concrete nouns. To build word representations, they have counted the co-occurrences of 60 concrete nouns with 25 carefully selected verbs. The selection of these verbs was based on sensory-motor properties. In order to capture the essence of the neural activity patterns of each noun, all nouns were presented 6 times, in random order. The participants were shown a picture of the noun as well as the written form of the noun, and were tasked with thinking about the noun during the taking of the fMRI. With these word representations and the neural activity patterns, Mitchell et al. (2008) have attempted to predict brain activity from word representations and vice versa, with promising results.

Since then, however, much more research has been conducted in this area. As such, there are now various language models that seem promising3245_{. All of}

these language models make assumptions about important factors that possi-bly hold the meaning of a word in one way or another. Context-based language models for example, hold the assumption that the words that surround a word contain some form of information on the words meaning. Yet other language models make use of lexical or even phonological information about words. All these models yield word representations. So how do these language models compare against eachother, and could we, perhaps, combine them to obtain an ’ideal’ word representation?

Abnar et al. (2017)6 have pitted various such models against eachother on the same dataset Mitchell et al.(2008) have used, as provided by the re-searchers. They have used linear regression with L2 regularisation to com-pare various kinds of language models. While there were various interesting and promising findings, the study seemed to suggest that two language mod-els tended to perform better than the rest, and had complementary errors. Word2Vec, one such language model, seemed to work well in predicting the neural activity patterns; also known as encoding. GloVe, on the other hand, worked well in predicting a words representation from neural activity patterns; also referred to as decoding as opposed to encoding. The findings of this study also indicated patterns in the errors these models tended to make, which were different to eachother and thus suggested complementary roles.

The dataset provided by Mitchell et al.(2008) contains concrete nouns only. How well do these findings generalise to nouns of varying levels of concreteness?

(5)

1 INTRODUCTION Page 2 of 20

How do they fare against various word types, phrases, sentences? The current thesis aims to explore these questions by focusing on various aspects, the first of which is how well these findings generalise. In order to do this, three questions are researched: how well do GloVe and Word2Vec generalise to other nouns of varying degrees of concreteness? Is GloVe better at decoding, while Word2Vec is better at encoding? How well do the findings for GloVe generalise to different wordtypes?

Data containing neural activity patterns requires delicate and thoughtful handling. Not only does the data contain quite some noise, within-subject variance tends to be on the higher end, and between-subject variance even more so78, making generalisation more difficult. Linear regression is quick to implement, and allows one to adjust parameters in order to influence the per-formance of the model. In combination with the previous constitutions on neural activity data, however, it is clear that this does not suit the purpose of this research, which is to explore the generalisability of the previous findings for words with different characteristics than the previous, as well as the gener-alisability of the seemingly complementary roles of the two language models, GloVe and Word2Vec. Representational Similarity Analysis (RSA) makes use of the structure of pattern-similarities in different domains, does not require model fitting, is straightforward to implement as well as computationally ef-ficient9. In other words, it lends itself perfectly for comparing neural activity patterns with word representations, specifically in light of the goal of this study. While it is limited in its use as it does not generally allow synthesis of neural activity patterns, Anderson et al. (2016)9 have introduced Similarity Based Encoding and Similarity Based Decoding. They have applied this method with use of the GloVe language model on the dataset provided by Mitchell et al. (2008), like Abnar et al. (2017) with only this method as the difference, and have done so with success. While various similarity metrics are suitable for RSA, Pearson’s correlation has been used in this research in order to maintain comparability to Anderson’s paper.

In summary, this research aims to explore how well the success of GloVe and Word2vec generalises to nouns of varying degrees of concreteness as well as different wordtypes for GloVe. A third question is whether the relative performances of GloVe and Word2vec are indeed complementary, focusing on the accuracies.

The first section of this study dives into the theoretical backgrounds of methods and models used. While not necessary to be able to follow the re-search, it does provide insights and indicates areas within the methods where multiple options are available. The second section gives insight into the actual experiments, implementation and materials used. The report of the results is followed by its analysis in a discussion, after which the conclusion summarizes the findings of the paper and contains suggestions for future work.

(6)

2 THEORETICAL BACKGROUND Page 3 of 20

2 Theoretical background

2.1 Language Models and Word Representations

Words can be represented in various ways. Some language models like to represent words by making use of words they tend to co-occur with. While some of these models use the context, meaning the words surrounding a word (Word2Vec), others use corpus-global co-occurrence probabilities (GloVe). Some language models however, use syntactical information (WordNet based)4_.

Regardless of the theory behind the model, the result is always a numerical representation (a vector) for a word, referred to as word embeddings and/or word representations. This way, a word space can be constructed, and relations between words can be observed not only in a manner that is friendly to the human eye, but also explored with the use of basic linear algebra.

Well-trained models result in representations that, when plotted in the vec-tor space, group words that are similar in meaning closer to one another than to dissimilar words. In terms of their vectors or representations, this means that words with similar meanings have similar word representations. These findings thus suggest that these models have managed to capture at least part of the meanings of words. An example would be the likelihood of the words ’king’ and ’queen’ co-occurring in a corpus, compared to the likelihood that either of the previous two words co-occurs with the word ’fridge’. With a well-trained model, ‘king’ and ‘queen’ would be plotted closer to eachother than to ‘fridge’. The word ’cook’, however, is more likely to co-occur with the word ’fridge’ and will thus appear closer to it. Figure 1 shows an example for the Word2Vec model, where capitals and countries are clearly clustered3.

(7)

2 THEORETICAL BACKGROUND Page 4 of 20

Figure 1: Word2Vec, as taken from Mikolov et al. (2013)3

This is not the only way in which the vector differences capture some form of meaning. It seems the relations between words are also in some way en-coded. With simple linear algebra we find that even analogies are somehow captured by these models. By observing the vector differences between ’king’ and ’queen’, and ’man’ and ’woman’, the word representations seem to hold the information that ’king is to queen as man is to woman’ (See figure 2). Indeed, word representations formed with good training perform very well on analogy tasks.

Figure 2

The two language models examined in this study are GloVe, a model that uses word co-occurrence statistics, and Word2Vec, a which uses word-context representations. Both models thus follow the assumption that words can be known by the company they keep, and differ in the way ‘company’ is

(8)

inter-2 THEORETICAL BACKGROUND Page 5 of 20

preted. The occurrences described in the previous paragraphs regarding the encoding of similarities and relations between words apply to both models.

2.1.1 Glove

GloVe is an unsupervised learning algorithm that is based on a model that tries to capture the meaning of a word from the structure of corpora. The main assumption underlying this language model is that word to word co-occurrence statistics encode some form of meaning. It can be said that while

While training, the objective of the algorithm is to minimise the Least Squares Error, and to learn word representations such that the dot product of any two word embeddings equals the logarithm of the probability of their co-occurrence. What this means is that the algorithm associates the ratios of co-occurrence probabilities with vector differences in the word vector space.

2.1.2 Word2Vec

Word2Vec is a context-based method that makes use of the context of a word. The context is defined through two characteristics. The window-size of the context indicates the amount of words taken into consideration as part of the context. Then there is the placement of the context, for which there are three options. The first, is that an x amount of words to the left and x amount of words to the right of the word are seen as part of the context. The second and third only take the left or right side of the word as context.

There are two methods with which this model can learn its word represen-tations: skip-gram and Continuous Bag of Words (CBOW). In the first case, the task is to predict the context of a given word, whereas for CBOW the task is predicting the word given the context.

2.2 Representational Similarity Analysis

Representational Similarity Analysis, henceforth RSA, is a technique used widely in research where fMRI-based data is used. The core assumption of this technique is that, while two different domains may not be directly trans-formable to a way in which they can be compared directly, similarities com-puted within the domains can be compared, with useful results7810_.

Computing similarities within the domains results in the construction of self-referential distance spaces, through which the similarities between the el-ements can be used to find characteristics of the elel-ements, and compare these across the various domains. In this paper, neural activity patterns are com-pared with word representations obtained from language models. For every word, similarity vectors and/or matrices are computed and compared within the domains. In other words, this is done between fMRI-data of words and between the word representations of words, separately.

The similarity matrix of a domain can give many useful insights, some of which we have described in subsection 2.1. The main take away was that words that are close in meaning appear closer to each other, not only in a plot, but

(9)

3 METHOD Page 6 of 20

also in similarities of their vectors, their representations. The similarity matrix is thus actually nothing less than the numerical representation of the relation between elements in that domain. In summary, RSA is based on the idea that words that are similar in meaning, have similar relations to other words and concepts, much like the language models described.

As RSA uses similarity metrics, parameter estimation (and thus overfitting) can be sidestepped. This in itself can be a pro or a con depending on your purpose. Two popular and straightforward yet very useful similarity metrics are the Euclidean distance and Cosine similarity. While the Euclidian distance is a direct measure of the difference between two vectors, the Cosine similarity is the angle between two vectors. The Pearson correlation is the Cosine simi-larity between two centered vectors. This means that this simisimi-larity metric is not as influenced by the scale of the vector space as the Euclidean distance is. In order to incorporate parameter estimation in the RSA method, other sim-ilarity metrics can be used. The Gaussian simsim-ilarity is an example of options available for those who are interested in some degree of model fitting. In order to maintain comparability to the previously mentioned study of Anderson et al. (2016), as well as the exploratory purpose of the current paper, Pearson’s correlation has been used as the similarity metric.

RSA is also fairly straightforward in its implementation and requires less computational effort, which contributes to its fit with the purpose of the cur-rent study.

3 Method

3.1 Materials

3.1.1 Language Models

The following pre-trained word embeddings have been used: 1. Glove

The GloVe vectors used for this study have been trained on Wikipedia and Gigaword and consist of 50-dimensional vectors, as provided by Pen-nington et al. (2014)5_.

2. Word2Vec

The Word2Vec vectors used for this study have been trained on the En-glish Wikipedia dump (February 2017), with a context-window length of 5 words to the left and the right and consists of 50 dimensions. The repre-sentations were obtained through the skip-gram model311_{. 300-dimensional}

vectors with these same characteristics have also been used.

3.1.2 Dataset

Acquisition

The data on experiment 1 conducted by and described in12_{for participants}

(10)

the fMRI data has been obtained is described in the original paper. In short, the individual concepts have been presented in sentences and together with pictures.

Composition

fMRI data for 11 participants.

3.1.3 Words

The words used for the experiments regarding only nouns can be found in Appendix A.

The complete set of words, used for the experiments on mixed word types, is supplied by Perreira et al. (2018)12_.

3.2 Experiments

In order to answer the research questions, three tasks have been executed. On the wordset consisting of only nouns, 50-dimensional GloVe and 50 as well as 300 dimensional Word2Vec models have been used for encoding and decoding. Additionally, the 50-dimensional GloVe model has also been used for encoding and decoding for a wordset consisting of various wordtypes.

3.3 Data Preprocessing

The neural activity patterns for words are stored in column vectors of voxel activities. Similarity based encoding and decoding both do not require any additional processing to the datasets as provided by Pereira et al. (2018)12. 3.4 Similarity Based Encoding Implementation

Similarity-based encoding of a word

Similarity based encoding as implemented in this study, uses Pearson’s cor-relation as a measure of similarity between word representations. Given the word representations and neural activation patterns for words, the task of encoding is to predict the neural activity for a word given only its word repre-sentation by making use of the similarity measures of the word reprerepre-sentations. This can be achieved in two phases: the computation of similarity code vectors and the actual synthesis of a predicted neural activity pattern.

In the first phase, similarity code vectors are obtained by correlating the word representation of the word to predict with the word representations of all the other words of which we have word representations and neural activity patterns.

The similarity code vectors obtained in the first phase are used as weights for the average of neural activity patterns for the respective words. First, the similarity codes of the desired word to the other words are used as weights to their corresponding neural activity patterns. These weighted neural activity patterns are summed, voxelwise, and normalised by dividing by the sum of the

(11)

similarity codes, resulting in a similarity weighted average. This phase can be formally expressed as follows:

Figure 3: 9

where N is the wordset for which we have fMRI, each fMRI is stored in a vector b that is linked to a word representation s. The word that is being predicted is indexed N+1, with b0_{N +1}as the synthesized neural activity pattern for the new word s. C, the normalising constant, is the sum of correlation values in the word representation similarity code for the predicted word.

The following figure contains the visualization as provided by Anderson et al.(2016) in their paper 49.

Figure 4: Visualization as provided by Anderson et al.(2016)9

(12)

In order to be able to evaluate the performance of the similarity-based encoding method, leave 2 out cross validation as introduced by Mitchell et al. in 2008, but also employed by Anderson et al. (2016)9 _{is used.}

For all possible word pairs, neural activity patterns are predicted through similarity based encoding. For a word pair, this results in two predicted neural activity patterns. These are then correlated to the two observed neural activity patterns for these words, resulting in four correlations. If the sum of the corre-lations of the correct pairs of predicted and observed neural activity patterns is higher than the sum of the correlations of the wrongly matched predicted and observed neural activity patterns, the encoding is deemed a success.

By doing this for all possible word pairs, we end up with a list of successes and failures, through which we are able to obtain a mean of accuracy as well as a standard deviation.

3.5 Similarity Based Decoding Implementation Similarity Based Decoding for a word

The decoder is tasked with estimating a neural activity patterns label, per pair of words. This is accomplished in roughly three phases: the computation of similarity matrices, the extraction and preparation of the vectors for the words to decode and the last phase, which can be seen as the actual decoding phase.

In the first phase, two similarity code matrices are computed. One of these is the similarity code matrix for the word representations, the other for the neural activity patterns.

For any pair of words, the decoder then extracts the similarity vectors out of the matrices, and deletes all elements that pertain to the words themselves or eachother from their vectors.

The last phase completes the decoding by matching the vectors extracted in the previous phase to eachother. The third phase resultsn in four vectors: two neural activity similarity vectors and two word representation similarity vectors. Correlating these four to eachother gives four correlations. Similar to the encoding counterpart of this method, the labeling is deemed a success if the highest sum of correlations belongs to the correct match of neural activity and word representation similarity vectors.

Figure 5 contains the visualization as provided by Anderson et al.(2016)9 _in

their paper. Evaluation

The evaluation process for the decoder is identical to that of the encoder; using leave 2 out cross validation.

(13)

(14)

5 DISCUSSION Page 11 of 20

4 Results

Model Encoding Decoding

mean standard deviation mean standard deviation

Glove50d 65% 6% 49% 3%

Word2Vec50d 55% 7% 50% 4%

Word2Vec300d 50% 3% 50% 3%

Table 1: Mean accuracies for nouns only

Wordset Encoding Decoding

mean standard deviation mean standard deviation

Nouns 65% 6% 49% 3%

Mixed wordtypes 68% 7% 51% 1%

Table 2: Mean accuracies for GloVe 50d on different wordsets

4.1 Word2Vec vs. Glove

Table 11 shows the results for the 50-dimensional GloVe and Word2Vec models for the experiment on nouns. While GloVe seems to be succesful in encoding, with a mean accuracy of 65%, both models perform at chance-level for decod-ing.

4.2 Word type

Table 22 shows the results for the experiment on different wordtypes versus nouns only for the GloVe model. GloVe seems succesful in encoding for both wordsets, with a mean accuracy of 65% for the nouns, and 68% for the set with the mixed wordtypes. For decoding, however, GloVe performs at chance-level for both wordsets.

5 Discussion

5.1 Word2Vec vs Glove

First, 50-dimensional Word2Vec and GloVe word representations, trained on Wikipedia dumps, were compared on their performances on a list of 123 nouns. While only GloVe performed above chance-level for encoding, both models failed for decoding. These results do not seem to be in accordance with the results from Anderson et al. (2016)9 and Abnar et al. (2017)6. The finding that GloVe performs better at decoding, while Word2Vec performs better at encoding has also not been supported by the findings of this thesis. There are, however, various explanations for these differences.

The first explanation lies in the fMRI data. The fMRI data from Mitchell et al. (2008) consisted of 6 neural activity patterns for all words, whereas the

(15)

5 DISCUSSION Page 12 of 20

dataset used in this thesis only had 1 trial per word. As the mean of the 6 images was used to represent the neural activation of a noun, at least part of the noise was filtered out.

As mentioned previously, RSA does not allow model fitting generally, which is something Abnar et al. (2017) have made use of, and is thus another possible explanation for the differences.

This thesis, however, has used the same method as Anderson et al. (2016), and the results are different from that study as well. The 6 trials brought another difference with them that more likely than not contributes to the difference in results, namely the possibility of voxel selection. Anderson et al. (2016) have made use of voxel selection, with the aim to eliminate the influence of unstable voxels.

Both aforementioned studies have used 300-dimensional models. In order to explore how likely that is as an explanation, the same experiment was conducted with 300-dimensional Word2Vec. The results are not better or worse for the 300-dimensional model compared to the 50-dimensional model, suggesting that the differences in performance are not cause by differences in dimensionality.

Nouns Mixed word types

Encoding Decoding Encoding Decoding

most mistakes most correct most mistakes most mistakes most mistakes

garbage vacation reaction reaction reaction

elegance obligation cook elegance skin

disease personality pleasure body attitude

reaction texture ball driver charity

driver sin charity philosophy cook

star electron weather invention job

bear applause pig computer residence

computer liar job disease liar

food apartment residence brain challenge

time residence skin delivery brain

Table 3: Error Analysis for GloVe

5.2 The role of wordtype on the performance of GloVe

The second question regards the role of wordtypes on the performance of the GloVe model. The experiment consisted of the encoding and decoding for two wordsets, the first of which contained only nouns, and the latter a mix of word types. While at first sight it may seem that GloVe performed better on the mixed wordset, taking the standard deviations in consideration, there seems to be no real difference between the performances of GloVe on wordsets of mixed nature and nouns only.

It is, however, possible that this was caused by the rather large skew in the amount of nouns versus other types of words. Out of all 180 words, 123 were unmistakably nouns. Some of the rest were words that could be seen as nouns as well, for example: ‘play’, which can be seen as a theatrical play, or a

(16)

6 CONCLUSION Page 13 of 20

verb. This means that we can not conclude whether GloVe does perform better, worse or equal for wordsets containing only one wordtype versus a mixture. As similar words tend to have similar representations, and word types introduce another distinguishing characteristic into the wordset, it could be possible that a variety of wordtypes would result in higher accuracies on the performance of models.

5.3 Error analysis

Table 3 3 shows the 10 worst performed words for GloVe, as well as the top 10 best encoded words for the experiment with only nouns. The top 10 lists for all other experiments can be found in Appendix B.

The GloVe model performed best on the encoding tasks for the nouns and mixed wordsets. A quick look at the words the model performed the worst for shows that the majority of the nouns are closely related to a verb, for example, reaction, driver, intention and bear. That said, for the encoding task in the nouns wordset we can see that words like applause, which share the aforementioned characteristic, are in the top 10 best encoded words as well, though they are relatively in the minority.

Some of the words that the model performed subpar re-occur over the var-ious tasks: reaction, elegance and disease, to give an example. While it can not be said that there is a definite pattern in the errors made, nouns closely related to a verb seem to be the most difficult. It is important to note that there are not many differences between the top 10 mostly mistaken words for Glove on the nouns and mixed wordset, suggesting that the addition of different wordtypes did not confuse the model.

Unfortunately, for the Word2Vec model, no pattern could be discerned at all (see Appendix for more).

6 Conclusion

In this thesis, 3 questions were explored. First, the performance of GloVe and Word2Vec on a group of nouns of varying levels of concreteness was researched. While GloVe performed well on the encoding task, both models did not per-form above chance-level for the decoding task. As discussed in the previous section, at least for the Word2vec model, this does not seem to be caused due to a smaller dimensionality than related research. Furthermore, it is highly plausible that the performances of the two models were limited due to the lack of fMRI data. Taking previous related work and the noisy and highly varianced nature of fMRI data is thus expected that with the availability of several fMRI images per participant per concept, the models could perform above chance-level for all tasks. As such, while the findings, at first glance, suggest that Word2Vec’s previous success does not generalise to other nouns of varying degrees of concreteness, and no complementary roles were found where GloVe performed better for decoding and Word2Vec for encoding, it can not be ruled out with certainty.

(17)

REFERENCES Page 14 of 20

For future work it is suggested that a dataset is used where the fMRI data consists of multiple trials per word. The choice of method, RSA or other-wise, depend on the goals of the study, although we suggest the Pearson’s correlation for comparability. Using the Gaussian similarity as the similarity metric, however, would provide a balance between the need for model fitting and efficiency.

GloVe performed as well on the encoding task for the wordset of mixed wordtypes as it did on the wordset containing only nouns, if not better. It seems highly plausible that GloVe could perform better for wordsets containing a mixture of wordtypes, as that would add a distinguishable characteristic to the wordset. As the wordset used in this thesis was highly skewed, yet the performance of the model was not hindered, it would be interesting for future research to focus on heterogeneous wordsets.

References

[1] Summary of results - Test Your Vocab research. http://testyourvocab.com/blog/ 2013-05-10-Summary-of-results.

[2] Tom M Mitchell et al. “Predicting human brain activity associated with the meanings of nouns”. In: science 320.5880 (2008), pp. 1191–1195.

[3] Tomas Mikolov et al. “Distributed representations of words and phrases and their com-positionality”. In: Advances in neural information processing systems. 2013, pp. 3111– 3119.

[4] Ahmad Babaeian Jelodar, Mehrdad Alizadeh, and Shahram Khadivi. “WordNet Based Features for Predicting Brain Activity associated with meanings of nouns”. In: Pro-ceedings of the NAACL HLT 2010 First Workshop on Computational Neurolinguistics. Association for Computational Linguistics. 2010, pp. 18–26.

[5] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 1532–1543.

[6] Samira Abnar et al. “Experiential, distributional and dependency-based word em-beddings have complementary roles in decoding brain activity”. In: arXiv preprint arXiv:1711.09285 (2017).

[7] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. “Representational simi-larity analysis-connecting the branches of systems neuroscience”. In: Frontiers in sys-tems neuroscience 2 (2008), p. 4.

[8] J¨orn Diedrichsen and Nikolaus Kriegeskorte. “Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis”. In: PLoS computational biology 13.4 (2017), e1005508.

[9] Andrew James Anderson, Benjamin D Zinszer, and Rajeev DS Raizada. “Representa-tional similarity encoding for fMRI: Pattern-based synthesis to predict brain activity using stimulus-model-similarities”. In: NeuroImage 128 (2016), pp. 44–53.

[10] Representational Similarity Analysis (RSA). http://brainvoyager.com/bv/doc/ UsersGuide/RSA/RepresentationalSimilarityAnalysis.html.

[11] Word2Vec. https://code.google.com/archive/p/word2vec/.

[12] Francisco Pereira et al. “Toward a universal decoder of linguistic meaning from brain activation”. In: Nature communications 9.1 (2018), p. 963.

(18)

8 APPENDIX B: ERROR ANALYSIS Page 15 of 20

7 Appendix A: Nouns

(19)

w2v50d nouns decoding ---Most mistakes: 'obligation', 0.36661698956780925 'student', 0.37779433681073027 'plant', 0.3897168405365127 'doctor', 0.39120715350223545 'clothes', 0.3986587183308495 'show', 0.40312965722801786 'king', 0.4090909090909091 'brain', 0.4098360655737705 'tool', 0.4128166915052161 'investigation', 0.4210134128166915 Most correct: 'professional', 0.5700447093889717 'sin', 0.5730253353204173 'news', 0.5752608047690015 'light', 0.5797317436661699 'beer', 0.587183308494784 'soul', 0.6140089418777943 'nation', 0.6177347242921013 'cockroach', 0.6251862891207154 'job', 0.6333830104321908 'star', 0.6423248882265276 w2v50d nouns encoding ---Most mistakes: 'skin', 0.38286580742987114 'news', 0.38524590163934425 'law', 0.38748137108792846 'damage', 0.3889716840536513 'reaction', 0.4008941877794337 'hair', 0.4046199701937407 'king', 0.4165424739195231 'big', 0.4314456035767511 'dedication', 0.4366616989567809 'electron', 0.43815201192250375 Most correct: 'invention', 0.6154992548435171 'bar', 0.6177347242921013 'dinner', 0.620715350223547 'texture', 0.6289833080424886 'delivery', 0.6341281669150521 'table', 0.644916540212443 'pain', 0.6609538002980626 'trial', 0.6707132018209409 'solution', 0.6853677028051555 'engine', 0.7317436661698957

(20)

--- glove50d nouns encoding---Most mistakes: 'garbage', 0.4640762463343108 'elegance', 0.4970674486803519 'disease', 0.49853372434017595 'reaction', 0.5263929618768328 'driver', 0.5271260997067448 'star', 0.5417888563049853 'bear', 0.5447214076246334 'computer', 0.5469208211143695 'food', 0.5571847507331378 'time', 0.5579178885630498 Most correct: 'vacation', 0.7741935483870968 'obligation', 0.7749266862170088 'personality', 0.782991202346041 'texture', 0.7844574780058651 'sin', 0.7895894428152492 'electron', 0.7917888563049853 'applause', 0.7983870967741935 'liar', 0.8035190615835777 'apartment', 0.8049853372434017 'residence', 0.8218475073313783 'ignorance', 0.8299120234604106

(21)

glove50d nouns decoding ---Most mistakes: 'reaction', 0.3460410557184751 'cook', 0.3812316715542522 'pleasure', 0.38343108504398826 'ball', 0.38563049853372433 'charity', 0.38929618768328444 'weather', 0.39002932551319647 'pig', 0.40249266862170086 'job', 0.4039589442815249 'residence', 0.40542521994134895 'skin', 0.40615835777126097 Most correct: 'ignorance', 0.5835777126099707 'road', 0.5894428152492669 'personality', 0.5960410557184751 'invention', 0.5982404692082112 'student', 0.6011730205278593 'ship', 0.6070381231671554 'sound', 0.6099706744868035 'help', 0.6363636363636364 'collection', 0.6458944281524927 'blood', 0.6612903225806451

(22)

--- glove50d_mixed wordset encoding---Most mistakes: 'reaction', 0.4482191780821918 'elegance', 0.5098576122672508 'body', 0.5206949412365867 'driver', 0.5301204819277109 'philosophy', 0.531763417305586 'invention', 0.5345016429353778 'computer', 0.537321063394683 'disease', 0.5410733844468785 'brain', 0.5444785276073619 'delivery', 0.549079754601227 Most correct: 'prison', 0.7441095890410959 'illness', 0.7552026286966046 'mechanism', 0.755750273822563 'construction', 0.7561349693251533 'apartment', 0.7644353602452734 'tree', 0.7677984665936474 'road', 0.7747945205479452 'bar', 0.7818088911599387 'residence', 0.8268493150684931 'ignorance', 0.8433734939759037

(23)

glove50d_mixed wordset decoding ---Most mistakes: 'reaction', 0.3682588597842835 'skin', 0.3682588597842835 'attitude', 0.37082691319979455 'charity', 0.38931689779147405 'cook', 0.3975346687211094 'job', 0.4036979969183359 'residence', 0.40575243965074476 'liar', 0.40626605033384694 'challenge', 0.4093477144324602 'brain', 0.4108885464817668 Most correct: 'dessert', 0.6004108885464817 'picture', 0.6029789419619929 'invention', 0.6050333846944016 'bag', 0.6076014381099127 'ship', 0.6101694915254238 'team', 0.6153055983564458 'tree', 0.6163328197226502 'student', 0.640472521828454 'collection', 0.6420133538777607 'blood', 0.7021058038007191

An exploration of the role of wordtype on language model performance for encoding and decoding