How the brain gives meaning to words

(1)

How the brain gives meaning to words

Rasyan Ahmed 10784063 Bachelor Thesis

Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervised by: Samira Abnar dr. Jelle Zuidema 2nd July 2017

Institute for Logic, Language and Computation (ILLC) University of Amsterdam

Faculty of Science

Science Park 107, 1098 XG Amsterdam https://www.illc.uva.nl/

(2)

Abstract

When humans think about a word their brain starts to light up into a certain pattern, this pattern is unique for each word and is how the brain encodes the semantic meaning of a word. Previous studies have shown that these brain activations that belong to each word can be predicted using a multiple linear regression system which has been trained on a distributional word vector representation that has a hand picked feature set. A second prediction task proposed in this thesis shows that these manually made vectors are insufficient to predict the opposite of the first task. They are unable to predict with high accuracy the word vectors from the brain activations. In this thesis more modern distributional word vector representations based on the glove and word2vec algorithms were introduced that are capable of completing both of these tasks with accuracies that match or exceed those achieved by the previous vector. Non distributional word vectors were also shown to outperform all other word vectors on the second task which suggests that the brain itself does not work in a completely distributional fashion.

(3)

Introduction

The ability to understand language is one of the most important skills that we as hu-mans posses. It is the primary method we use to communicate with others and to transfer our knowledge to our descendents so that the species as a whole moves forward with each passing generation. Even our internal thoughts are given form with the use of language, as we use it to reason and apply logic. It is such an important aspect of our lives that we can not even imagine what it would be like to live without it.

It is clear that in order to understand humans we need to thoroughly understand their use of languages.Therefore it is no surprise that understanding natural languages is one of the most important problems facing modern artificial intelligence and that the field dedicated to it, natural language processing, is one of the largest branches in AI research. In the many decades that this field has existed the ability of computers to interpret the structure of natural languages has significantly improved, but the richness and ambiguity within human made language still poses many problems for its ability to perceive the actual meaning of words and sentences. To combat these problems it may be useful to look at the only system known that has successfully solved these problems, the human brain. Yet surprisingly this area of research has had limited development, while research has shown which areas of the brain are related to language processing, it is yet unknown how the brain regions extract meaning from words and sentences. With the advancement of both artificial intelligence and neuro-imaging techniques it has become possible to delve deeper inside the brain to try to understand the processes going on inside, and in 2008 a group led by Tom Mitchell in Carnegie Mellon University did precisely that (Mitchell et al., 2008a). They showed that it was possible to predict which brain regions showed fMRI activity when thinking about a certain word as they could predict with high accuracy which fMRI image belonged to which word. Perhaps more interestingly, they had also shown that words with similar semantic meanings have similar fMRI activations which suggests that the brain represents the semantic meaning of words by the regions of the brain that show activity.

(5)

While their results are a significant breakthrough in our understanding of how the brain processes language, there is still room for improvement in the methods they used to achieve these results. In particular, the word vectors that were constructed for use in this model contain a hand picked feature set which were chosen according to insights gained in previous neuroscience research. While these vectors are able to successfully predict fMRI brain activations, they have been made specifically for this task and can not be used in other natural language processing tasks. This is in contrast to the hu-man brain which is a general system that can solve not only all tasks related to natural language but allows us to perform all the complex behaviour that humans are known for. The process of making a predictor from a corpus can be seen as the construction of a model as similar as possible to the one who originally generated the corpus, in this case that would mean the construction of a model as similar as possible to the human brain. Therefore the performance of a model in predicting fMRI brain activations can be seen as a form of similarity measure between the model in question and the human brain, which suggests that general word vectors would be more suited to brain related prediction tasks. Looking at it from this perspective allows us to learn more about both the human brain and the models used to predict its fMRI activations. For if the word vectors that perform well in this task have a certain property, then it is likely that the human brain has that property as well and the better a model scores on this task the closer it is to the gold standard in natural language processing, the human brain. Due to recent developments in natural language processing a number of more general word vector representations have been made available which have seen successful use in a variety of language related tasks. In this thesis these word vector representations (which form the heart of the predictive model) will be compared to each other and Tom Mitchell’s word vectors in order to test the hypothesis that these models, which are more general just like the human brain, perform better in predicting brain activations, and to learn more about both the human brain and the word vector representations in question.

After an overview of the literature on the subject, the data sets used and all word vectors used in the comparisons will be discussed individually alongside the two testing methods employed in this research. The results of these experiments show that the general models perform better than the hand picked vectors which confirm the hypothesis formed. Afterwards the discussion section will explain why this is the case, what these results tell us about the human brain and discuss possible directions for future research.

(6)

Chapter 2

Literature review

Neuroscientist have shown that the viewing of different categories of objects results in different brain fMRI activations indicating that these activations encode the meaning of these objects (Polyn et al., 2005). Computer linguists showed in the 50’s that meaning of words can be inferred by looking at the words that often occur alongside it (Harris, 1954). Methods based upon this principle are often called distributional. A 2008 study by Mitchell et al. combined the research of these two fields to show that the fMRI brain activation in each brain voxel can be predicted when thinking about a specific word (Mitchell et al., 2008a). This was done with the use of a distributional word vector representation that had the co-occurrence count with 25 hand picked verbs as its features. Further research showed that automatically chosen feature sets performed competitively, if not better than those chosen by hand. (Murphy et al., 2012). This suggests that general word vectors which can be used for a variety of natural language related problems could also perform well on this task. In this thesis the goal is to try to improve further upon the research done in Mitchell et al. by introducing more modern general vector representations, with the hope that these representations achieve a higher accuracy of prediction.

One such vector representation algorithm is Glove which stands for Global vectors and is similarly based upon the co-occurrence counts (Pennington et al., 2014). The differ-ence here is that Glove counts the co-occurrdiffer-ence of the target word with all words in the context which co-occure more then once, compared to the original method of counting the co-occurrences with only the 25 specific hand picked verbs. Another modern word vector representation is Word2Vec (Mikolov et al., 2013). This method is not based on co-occurrence counts like the previous two methods, but is instead a predictive method. What all these three methods do have in common is that they are all based upon the Distributional Hypothesis. To test whether the brain assigns meanings to words in a similar distributional fashion, these distributional methods have to be compared with a non-distributional one. In 2015 Manaal Faruqui and Chris Dyer made one such method, they created a word vector representation that is based on completely non-distributional sparse data (Faruqui & Dyer, 2015). In this thesis word vectors from all three of these

(7)

algorithms will be compared with the vectors from the original Mitchell et al. paper. While some of these have been used in other Brain fMRI prediction tasks (Anderson et al., 2017), this research Will be the first to compare such a diverse set of word vectors to this data set and problem.

(8)

Chapter 3

Method

In this section the methodology used to construct the two predictive models will be explained. How the brain images were captured and processed, which vector represent-ations were used and what the two testing methods are will be discussed.

3.1 The Brain fMRI data

One of the primary factors that has made this research possible is the advent of func-tional magnetic resonance imaging (fMRI). Using this technology it has become possible to peer into the working of brain without the use of complex and dangerous procedures (Ogawa et al., 1990). This technology was used in the Mitchell et al. paper to produce an image of how the brain organises and represents the conceptual knowledge of specific words, which are used as training and test data to construct the predictive model. The 3D images used in Mitchell et al. and this research’s corpus were captured using a Siemens Allegra 3.0T scanner on college aged participants. These nine participants aged between 18 and 32 are all right handed and consist of 5 females and 4 males. The images were captured when the participants were shown a line drawing and word pair, afterwards they were asked to think about the properties of the object in question. The participants were shown drawing-words pairs from 12 semantic different categories with 5 objects of each category for a total of 60 different objects.

One of the largest problems with fMRI data is the large amount of noise inherent in it, the primary source being head motion during the capturing process. To limit the effect of noise as much as possible, each object was shown 6 times totalling 360 different showings which were then randomised to remove any effect the order of showings may have. The machine learning algorithm used in this thesis requires the 3d images to be in a discrete format. To accomplish this, each 3D image was split into 21764 voxels with each representing a 3x3x6 mm part of the brain image. Each of these voxels has an activation value which belongs to the area it represents. The captured brain images and processed voxels were explained and provided in the supporting online material

(9)

(Mitchell et al., 2008b) for the original Mitchell et al. paper (Mitchell et al., 2008a). In this thesis the goal of the first test is to predict the activation value of each of these voxels, while the second test uses these voxels as training data to predict the word vectors.

3.2 Word vector representations

The data supplied to a computer for it and its learning algorithms to understand the meaning of a word is almost always in the form of a word vector, a sequence of numbers which serves as the identifier for the meaning of that word. The closer a word vector is to another word vector, the closer the semantic meaning of those two words. The word vector for ’apple’ is similar to that of the word ’orange’ but significantly different from the word vector for ’car’. The interesting question is how to construct these word vectors so that they best represent the meaning of the words they encode.

In this research different the results of seven different word vector representations will be compared to each other, these are:

1. The 25 verbs word vectors used in the Mitchell et al. paper (Mitchell et al., 2008a) as a baseline for comparison.

2. three word embeddings based upon the predictive Word2Vec algorithm. 3. two word embeddings from the count based Glove algorithm.

4. An non distributional word vector representation formed from resources such as WordNet and FrameNet.

3.2.1 25 verbs vectors

The word vector representation used in the Mitchell et al. paper consist of 25 features that represent the co-occurrence count of the targeted words with 25 different verbs (Mitchell et al., 2008a). These verbs were handpicked with the principle that the brain largely assigns meaning to nouns based upon sensory-motor features. As such, most of these verbs are actions that can be performed on an object, actions that change spatial relationships or basic sensory and motor activities.

The corpus used to construct these word vectors consist of a million words that are organised into n-grams. These n-grams vary from 5-grams (sequence of 5 words that follow each other) to unigrams (single word) based upon their location in the sentence. All n-grams that occur less than 40 times in the complete data set were removed in order to reduce noise and limit the number of n-grams. The vector representation is constructed by simply counting the number of times each of the 60 nouns for which fMRI data is available co-occurred with one of the 25 hand-picked verbs in the same n-gram. The form a verb occurs in the n-gram is of no importance, thus all three verb

(10)

forms are accepted as an occurrence of the verb. The result is a 60 by 25 matrix where each row is a word vector that represents the meaning of one of the 60 nouns.

While easy to construct and apply to a large data set and highly specialised for this task, this model does have its downsides. The limited number of features in each vector limit predictive power which makes under-fitting very likely. It is highly probable that there are many dimensions that are necessary to properly encode the semantic meaning of a word missing in this word vector representation.

3.2.2 Word2Vec

Since its inception in 2013 by a group of researchers at Google (Mikolov et al., 2013), Word2Vec has seen widespread use in natural language processing due to its fast training speed and high quality of word vectors. It constructs these word vectors in a completely different way compared to the methods used before. Where as word vectors such as the 25 verbs vectors are solely based on counting co-occurrences, Word2Vec constructs word vectors in a more predictive fashion. Word2Vec initializes an embedding layer which contains vectors for each word and then passes this layer to the neural network which can have two different predictive tasks with the text corpus as training data. The first possible task is to predict the target word from its context, this flavor of word2vec is called continuous bag of words. The second flavor is the skip gram method which pre-dicts the context from the target words which all Word2Vec vectors used in this thesis are based upon. In this predictive process the word vectors of the embedding layer get closer to the vectors of similar words and the distance increases for words that are not similar.

At the end of the process, left over are the word embeddings that have encoded the meaning of the words. They show some interesting properties that the co-occurrences methods used previously did not. One such property is that they learn the relationship between different words such as showing that the relationship between man and wo-man is similar to the relationship between king and queen. This huwo-man like ability to reason about the relationships between different words may indicate that the Word2Vec algorithm could be somewhat similar to how the brain encodes the meaning of words.

(11)

Figure 3.1: The Word2Vec embeddings shown here have captured the relationship between a country and its capital (Mikolov et al., 2013)

In the tests performed in this thesis two different pre computed word embeddings made using the Word2Vec algorithm were used. The first one is provided by the original google research group (Mikolov et al., 2013) which has 300 dimensions and was trained on 100 billion word tokens that were acquired from Google-News.

Previous research suggested that a dependency based models may perform very well in this task (Murphy et al., 2012). For this reason a second set of Word2Vec embeddings which were made by Omer Levy and Yoav Goldberg (Levy & Goldberg, 2014) have been included. These vectors have been constructed using a 1billion Wikipedia corpus but in a slightly different fashion from the normal Word2Vec algorithm. While the previous embedding defined the context of a word as the 5 words surrounding the targeted word on either side, these word embeddings defined the context as the words that this word is dependent on or words that depend on this target word which were identified using part of speech tagging. An example of words and their contexts are shown in figure 3.2.

Lastly, precomputed word vector representation that uses the same 1 billion Wikipe-dia corpus trained used the FastText algorithm have also been included. A modified word2vec algorithm that takes syntactic meaning in account by also looking into a words substructure.

(12)

Figure 3.2: An example of words and their corresponding dependency based contexts (Levy & Goldberg, 2014)

3.2.3 Glove

After the inception of Word2Vec, a great deal of new research was started on creating even better word embeddings. One of those was done by a group at Stanford which resulted in the Glove Algorithm (Pennington et al., 2014). Glove was made to remedy one of the problems that Word2Vec has, namely that it does not take into account any global statistical data without losing the interesting properties that make word2vec unique such as encoding the relationship between words. Glove does this by going back to the count based methods used before, these co-occurrence counts are used to directly influence the word embedding layer. Instead of passing the embedding layer into a neural network which step by step corrects these embeddings, glove directly optimises these vectors by setting the dot product of two of the vectors in the embedding layer to be as close as possible to the log of the co-occurrence count between the words these two vectors represent. This results in significantly shorter training times without affecting the accuracy and the properties of the final embeddings

In this thesis, two 300 dimensional precomputed word embeddings will be used that were both provided by the original stanford group (Pennington et al., 2014) which only differ in the corpus they use. One uses a 840 billion CommonCrawl word corpus while the other uses a corpus consisting of 1 billion words from wikipedia and another 5 billion words from the Gigaword corpus.

(13)

3.2.4 Non distributional word vector

With most recent natural language processing research being focused on distributional semantics, it is easy to forget that there are many large linguistic resources build over the years that have each tried to encode some semantic aspects of natural language. Each of these resources were built manually over many years and contain vast amounts of information on many different words that should not be simply discarded. One of the oldest and most important examples is Roget’s Thesaurus which was originally built in 1805 and contains synonyms for many different words which allows us to link their meaning together (Roget, 1911). A more modern resource is Wordnet which categorises words into synsets that have similar meanings and assigns properties to these synsets (Miller, 1995). There are also linguistic resources available that capture many other properties a word can have, there are resources that encode properties such as the pos-sible connotations(Feng et al., 2013) or emotional and sentimental (Socher et al., 2013) properties a word can have.

Figure 3.3: Sources used to create the Non distributional word vectors (Faruqui & Dyer, 2015)

In 2015 Manaal Faruqui and Chris Dyer combined many of these resources and organ-ised the large amounts of information within them into word vectors that each encode the different aspects of a word.(Faruqui & Dyer, 2015) By going through these resources and adding new vectors for each word they had not yet seen and adding a new feature dimension for each new property, they constructed a word vector representation that contained information about 119.257 unique words and had a total of 172,418 different features. An indicator function describes the value for each of these features: it could either be a 1, indicating that the feature is present in the word or a 0, indicating that the word does not possess that feature. A small example of this feature vector repres-entation can be seen in 3.4. As can be expected from such a database, 99.9% of these vectors are sparse with each word containing on average 34 non-zero features and each feature being present in only 15 words at average.

(14)

Figure 3.4: An example of words and their corresponding dependency based contexts (Faruqui & Dyer, 2015)

Such a large number of features requires heavy computational resources and is likely to overfit in most tasks, this means that a form of dimensional reduction needs to be applied. Fortunately, for the purpose of the experiments in this thesis only 60 of the almost 120,000 words are of interest, which means that all features that are either present or not present in all of these 60 words can be ignored as they do not add any information that can allow the training algorithm to differentiate them from each other. After this process, left over are vectors with a more manageable 7430 different dimensions.

3.3 PreProccessing

The fMRI activation corpus consists of 9 participants, who were shown 60 different words 6 times each, for a total of 360 words who are each represented by 21764 activ-ation values. Such a large amount of data can be useful when training an algorithm but can also result in false conclusions due to the noise in the data, largely due to head movements during the capturing process. Another downside of having such a large number of data points are the computational requirements to train upon such a large dataset. To alleviate these concerns two steps of prepossessing were taken.

First, a selection of the voxels which are least affected by noise is performed by as-signing a stability score to each voxel with all but the highest scoring 500 voxels being discarded. Each voxel is given a 6 by 60 grid that represents the activation value for each of the different showings of the 60 words. The stability score is then calculated using pairwise correlation between all pairs of rows, which results in the voxels that are consistently showing stable response patterns that outweigh the noise across the different presentations scoring the highest. These are the voxels that are used to train the different prediction models and compare its predictions with.

The second step is to average out all 6 different showings of each word that each patient was shown. With no way of knowing which of the 6 showing is the least affected by the noise, there is no other option than taking the average of all 6 activation values. This step should further reduce the inherent noise in the data.

(15)

While the primary reason for these two steps is reduction of noise, they also signi-ficantly reduce the computational requirements to train the different models that are compared in this research. The brain activation corpus has been reduced to only 500 activation values for each word for each participant which results in significantly shorter training times.

3.4 Testing methods

The goal of this research is to find out how the brain represents the meaning of different words by building predictive models that try to emulate the human brain, which can be used to learn more about the human brain and the models themselves. This is done by comparing the performance of the different word vector representations mentioned above to each other in two different testing methods.

3.4.1 Predicting brain activations from word vectors

The first testing method was introduced in the 2008 Mitchell et al. research paper on which this research is based. The goal of this test is to construct a model that predicts the fMRI activity in each brain voxel when given a new word vector as input. This model is built using a multiple regression learning algorithm, where the word vectors are labelled with their corresponding brain image as the training corpus. The predicted brain activation in each brain voxel is calculated as the weighted sum of the different word vector features.

This process is captured by the following equation: Yv = n X i=1 Cv iFi(W ) (3.1) Where:

• Yv = predicted activation at voxel v

• v = voxel v

• w = word vector w

• Fi = ith feature value for word vector W

• n = number of features

• c_{v i} = weight for voxel v at feature i

The goal of the linear regression algorithm is to find the right weights (Cvi) such that

for each brain voxel, the weighted sum of all feature values for each word is as close as possible to brain activation value measured for that word. This process is repeated for all 500 most stable voxels and all these optimised weights together form the predictive

(16)

model. Using the weights of the trained model and the equation above, a brain activa-tion predicactiva-tion can be calculated for all new word vectors.

The performance of a word vector is measured using leave two out cross validation. 58 of the 60 word vectors are used as the training corpus to train the predictive model which is then used to predict the brain activation for the two word vectors that were left out. The predicted brain activations for each of these word vectors were then compared with the real brain activations for these two word words vectors using the cosine similarity measure. The predictions are counted as accurate if the combined cosine distance between the two predicted brain activations and their corresponding measured activation is smaller than the combined distance with the wrong measured activations. This process is repeated for each possible combination of two leaved out words for a total of 1770 iterations. The final accuracy for the model is calculated as follows:

Number of iterations = 60 × 59

2 = 1770 (3.2)

Accuracy = number of accurate predictions

1770 (3.3)

3.4.2 Predicting word vectors from brain activations

The second testing method is a reverse from the first test as its goal is to predict the right word vector from the brain activation, the complete opposite from the original test. In many other ways it is quite similar to the first test, it uses the same multiple linear regression training algorithm and also uses leave-two-out cross-validation. The algorithm trains on 58 of the 60 brain activations and then predicts the word vectors for the two left over fMRI activations. The resulting two predicted word activations are mapped together with the other 60 word vectors into a vector space which checks whether the closest neighbouring word vector is the correct one. Similarly to the first testing method, This process is then again repeated for each pair of left out word with the final performance grade of a word vector being the average over all 1770 repetitions.

(17)

Chapter 4

Results

One of the biggest problems with working with fMRI data is the large amount of noise in the data. the largest source being caused by head movements during the capturing process. While the three participants who showed the largest amount of movement were removed from the fMRI corpus, the remaining 9 participants still show significant differences in the quality of their data.

Figure 4.1: prediction Accuracy over 9 participants on the original testing method using the 25 verbs vectors.

As can be seen in the figure 4.1, the prediction accuracy for the 25 verbs vectors shows significant variation between the different participants on the first testing method. The

(18)

accuracy on the brain images for the first participant are significantly higher than the others, 5.5% higher than the second best performing participant and close to 25% better than the sixth participant who performed the worst. While it is possible that different people encode the semantic meaning of words in a significantly different fashion which results in some being easier to predict than others, this is unlikely since these variations are consistent across all word vector representations on both testing methods. It is almost always the case that the accuracy on the first participant is the highest while the accuracy on the sixth participant is the lowest which enforces the belief that these variations are likely to be caused by noise.

4.0.1 Original Test Results

Figure 4.2: Average accuracy over all 9 participants with the original testing method. the 25 verbs vectors used in the Mitchell et al. paper were constructed based upon the idea that the neural representations of nouns are partly grounded in sensory-motor features (Mitchell et al., 2008a) and made specifically for this task; however, its results are not unique. Figure 4.2 shows that their results can be duplicated or exceeded by word vectors that can be used for a variety of tasks as all three word vectors trained on the Wikipedia corpus show similar results. Interestingly, the Wikipedia word vectors and the 25 verbs vectors do not exceed 74% indicating that this is the upper limit for the possible accuracy on this data set with the likely cause being the noise inherent in the fMRI data.

(19)

Perhaps the most interesting result from this experiment is that the word vectors trained on the Wikipedia corpus perform significantly better than those trained on other data sets. This is especially noticeable in the two Glove vector representations, as both word vectors were made in the exact same fashion with the only difference being the corpus used to learn from, indicating that the higher accuracies are indeed a result of using data from Wikipedia. This is surprising because the Wikipedia corpus consists of only 1 billion words whereas the other Glove word vectors were trained on a 840 billion word corpus which goes against the expectation that a larger word corpus would result in a better accuracy.

The non distributional word vector performs averagely on this test, while its result are not as good as some distributional methods, especially the Wikipedia based ones, it still manages to perform respectably.

4.0.2 Reverse Test Results

Figure 4.3: Average accuracy over all 9 participants with the reverse testing method. The first testing method showed that the results for the 25 verbs word vectors were close to the best performing word vectors, this is contrary to what is shown in figure 4.3. Its average accuracy on the second test is significantly lower than the second worst performing word vector. While above chance levels (₆₀1 ), the measured accuracy is 32% lower than the second lowest performing word vector and close to 52% lower than the

(20)

best performing word vector. The likely cause for these results is that the 25 verbs word vectors only have 25 dimensions, this stands in contrast to the other word vectors tested which all have at least 300 dimensions. It is likely the case that 25 features are simply not enough to encode all the necessary information required to predict with high accuracy the word vectors from their brain activations.

Similarly to the Original testing method, in the result for the Reverse test the word vectors trained on the smaller Wikipedia corpus perform much better than their coun-terparts trained on larger data sets. Although the difference is small, Wikipedia based Glove again outperforms its non Wikipedia based counterpart. While the same is true for the Word2Vec algorithms, both versions perform significantly worse than the other word vectors, including FastText which is a modified version of Word2Vec. These res-ults suggests that predictive methods like Word2Vec perform worse than count based methods such as glove. Another possibility is that the low results of the Wikipedia based Word2Vec model is caused by the dependency based context it employs since the FastText word vectors achieve significantly higher accuracy even though the syntactic information it encodes is unlikely to contribute heavily in this largely semantic based task. This may indicate that a Wikipedia based Word2Vec vectors with a standard context may perform around this level.

The most surprising result shown in figure 4.3 is that the non-distributional word vectors performed better then the vectors based on the distributional hypothesis. These sparse word vectors with binary values were able to predict the word vector belonging to a brain activation pattern with such precision that more than 90% of the time, the predicted word vectors closest neighbour was the vector belonging to the target word. These results suggest that these non-distributional vectors were able to encode information that the distributional methods can not.

(21)

Chapter 5

Discussion

The main goal for this research was to test whether more modern and general word vec-tors could perform similarly to the specialised word vector representation that Mitchell et al. used in their 2008 research (Mitchell et al., 2008a). the results presented in this thesis show that there are general word vectors and embeddings that are capable of predicting brain activations with the same accuracy. The results from a second testing method that has the opposite goal of predicting word vectors from the brain activations, showed that the Mitchell et al. word vectors with a hand picked feature set may not be right choice for prediction tasks related to brain activities as all other tested Word Vector representations showed significantly better results. These results confirm the hy-pothesis that more general word vectors that can be used for variety of natural language processing task would perform better at brain related prediction tasks as they are more similar to the human brain, which is likewise a general model capable of performing all manner of tasks.

Additionally, these results show that the selection of the right corpus is more important than the size of a corpus as on both tests Wikipedia based word vectors outperformed their non Wikipedia counterparts which were trained on text corpora hundreds of times larger. The most likely explanation being that Wikipedia’s goal as an encyclopedia is to explain the meaning of words which correlates with the goal for this experiment, predicting brain activations which encode the meaning of words. Another interesting result is the high accuracy shown by the non-distributional word vector representation in the reverse test. It seems that these vectors are able to encode information that is relevant to this task that the distributional methods can not. These results suggest that the brain does not work in a completely distributional fashion. If it does, the importance of corpus selection suggests that it only looks at the co-occurrences with selective words that can be used to define the meaning of the target word, instead of taking all words that often appear close to the word into account.

(22)

5.0.1 Future Research

One of the most limiting factors in these sort of brain related prediction experiments is the amount of data available. There are only 60 nouns for which there are fMRI images, limiting the options for any such research. Fortunately in 2014 another 5000 brain images for another 5000 words were made available (Wehbe et al., 2014) which can be used to answer a variety of interesting questions, such as whether words that are synonyms produce the exact same brain activations or whether it is possible to predict brain activations for other non nouns. Future research could also focus on exploring the relationships between words, relationships such as those between a country and its capital as shown in fig 3.1 to investigate whether a similar pattern can be found inside the brain fMRI activations.

As both distributional and non distributional methods were able to achieve high ac-curacies on both testing methods, hybrid vectors that combine data from both of these methods could form a promising direction to explore. Future research that focus on this topic may want to include several of these vectors as they may be capable of achieving higher results on both the experiments done in this thesis. Such research should also pay attention to any methods that reduce the noise inherent to fMRI data.

(23)

Chapter 6

Conclusion

A 2008 study by Mitchell et al. showed that it was possible to predict brain fMRI activations using a word vector consisting of a feature set of 25 hand picked words. In this thesis a systematic comparison between seven different word vectors including the original vectors used in the 2008 paper were made. By comparing the results for predicting brain activations from word vectors as well as those for predicting word vectors from brain activations this thesis has shown that more general models are more suited to brain activation related prediction tasks. Additionally, the results shown in this thesis suggests that the brain itself does not work in a completely distributional fashion.

(24)

Bibliography

Anderson, A. J., Kiela, D., Clark, S., & Poesio, M. (2017). Visually grounded and tex-tual semantic models differentially decode brain activity associated with concrete and abstract nouns. Transactions of the Association for Computational Linguist-ics, 5 , 17–30.

Faruqui, M., & Dyer, C. (2015). Non-distributional word vector representations. arXiv preprint arXiv:1506.05230 .

Feng, S., Kang, J. S., Kuznetsova, P., & Choi, Y. (2013). Connotation lexicon: A dash of sentiment beneath the surface meaning. In Acl (1) (pp. 1774–1784).

Harris, Z. S. (1954). Distributional structure. Word , 10 (2-3), 146–162.

Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. In Acl (2) (pp. 302–308).

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM , 38 (11), 39–41.

Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., & Just, M. A. (2008a). Predicting human brain activity associated with the meanings of nouns. science, 320 (5880), 1191–1195.

Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., & Just, M. A. (2008b). Supporting Online material, Predicting human brain activity associated with the meanings of nouns. http://science.sciencemag.org/content/suppl/2008/05/28/320.5880.1191.DC1. Murphy, B., Talukdar, P., & Mitchell, T. (2012). Selecting corpus-semantic models

for neurolinguistic decoding. In Proceedings of the first joint conference on lexical and computational semantics-volume 1: Proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation (pp. 114–123).

Ogawa, S., Lee, T.-M., Kay, A. R., & Tank, D. W. (1990). Brain magnetic reson-ance imaging with contrast dependent on blood oxygenation. Proceedings of the National Academy of Sciences, 87 (24), 9868–9872.

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Emnlp (Vol. 14, pp. 1532–1543).

(25)

Polyn, S. M., Natu, V. S., Cohen, J. D., & Norman, K. A. (2005). Category-specific cortical activity precedes retrieval during memory search. Science, 310 (5756), 1963–1966.

Roget, P. M. (1911). Roget’s thesaurus of english words and phrases... TY Crowell Company.

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., . . . others (2013). Recursive deep models for semantic compositionality over a sentiment tree-bank. In Proceedings of the conference on empirical methods in natural language processing (emnlp) (Vol. 1631, p. 1642).

Wehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., & Mitchell, T. (2014). Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PloS one, 9 (11), e112575.

How the brain gives meaning to words