Combining experiential and distributional semantic data to predict neural activity patterns

(1)

Combining experiential and

distributional semantic data to predict

neural activity patterns

June 29, 2017 Student: Max Mijnheer 10762302 Supervisors: Samira Abnar Jelle Zuidema Lecturer: Sander van Splunter

Abstract

The way the brain represents meaning has been studied widely. One way of studying this is by using a machine trained model to predict neural activity patterns associated with thinking about a word. If the semantic representa-tions used in the model predict accurately, we can learn something about the representation of meaning in the brain. This method has been performed on several semantic representations, including a distributional and an experien-tial one. Based on a suggestion in previous work, in the current study these two semantic representations were combined and tested to come a bit closer to understanding human semantic representation. While simply combining the feature vectors did not lead to better results, the results suggest the possi-bility of a more complete model.

(2)

1 Introduction

Understanding how the human brain functions has been a field of interest for cen-turies. Because of the enormous complexity of the brain, research has been fo-cused on small parts of its functionality. Natural language processing is one of those research fields that has been studied thoroughly, partly because the ability to understand and produce language is unique to the human brain. Natural language processing is a field of research that is concerned with building machines that can understand and produce language. If we subsequently look at how computers un-derstand and model language, we might learn something about how the human brain does this.

We can learn about the way the brain represents meaning by using a compu-tational model. Whereas in traditional neuroscience small parts of the brain are investigated in order to learn something about their functionality, computational models can learn complex relations between different parts of the brain. There-fore, computational models lend themselves perfectly for the widely distributed representation of meaning in the brain. Multiple word representation methods, as used in natural language processing, have been used to learn something about the representation of meaning in the brain. In the current study, two of those methods will be combined in order to come closer to an understanding of how the brain captures meaning.

Computational representations of the meaning of a word can be learned through different techniques. The techniques that are looked at in the current study try to capture the meaning of a word through its relation with other words. Words are converted to multi-dimensional vectors based on multiple features. The position of the vector in the multi-dimensional space determines its meaning. The closer the two vectors are in the multi-dimensional space, the more their meanings are related to each other. For instance, the vectors of the words ’chair’ and ’couch’ would be closer to each other than the words ’chair’ and ’lightbulb’, since their meanings are more related.

Two ways of learning semantic representations can be distinguished. The first method makes use of experiential features, where the second uses distributional features. Experiential features focus on how a word or the concept behind it is ex-perienced. This includes sensory, spatial and motor features of experience. These features are hand-picked and subsequently ranked per feature and per word, which is a very time-consuming process, since every feature needs to be ranked by hand. Another disadvantage of experiential features is that this data needs to be crowd-sourced, which is a costly procedure as well. The second method makes use of the distributional data of large text corpora. This distributional method looks at how a word occurs with other words in the text and makes a multidimensional vector out of this. Various methods of looking at the distribution of words exist and all these methods create different vectors.

To see how well these semantic models correspond to how the brain represents meaning, a predictive model can be trained that predicts brain activation data based on these different semantic models. The idea behind this is that the more predictive data is captured in the semantic model, the more it is in line with how the brain

(4)

cap-tures semantics. In the first study of this kind, Mitchell et al. (2008) did exactly this. The semantic model used here was based on the co-occurrence of a word with 25 hand-picked words in a large text corpus. Following this research, others applied this same technique to a wide range of semantic models. For example, Murphy et al. (2012) used multiple corpus-based models and showed that these models all had significant results on the task and sometimes even outperformed hand-picked models. More research by Anderson et al. (2016) showed that a 65-feature neurobi-ological inspired model performed even better on this task. This model reflects on sensory, spatial, motor and other neural systems of the brain (Binder et al., 2016). For example, a concept that is edible will score higher on the attribute taste than something that is not. An advantage of an experiential model over a distributional model is that it is clear what attributes are important in the conceptual represen-tation in the brain. Therefore, alignments can be found with the specific features and the associated parts of the brain. Contrarily, a distributional model only gives abstract features that are hard to interpret in terms of neural features.

Most previous research used and tested these hand-picked experiential and dis-tributional models separately. Research by Andrews et al. (2009) suggests that these models will be closer to actual human semantic representation when com-bined. It is argued by Vigliocco et al. (2009) that word meanings are learned in two distinct ways. The first is experiential information and the second is linguis-tic information. Both types of data capture different aspects of semanlinguis-tics and the combination of both will therefore be more complete and closer to the neural rep-resentation of semantics. This hypothesis will be tested by combining experiential models with distributional (linguistic) models. On this combined data a model will be trained to predict neural activity for unseen words. Comparisons with other models will be made to test the validity of the hypothesis.

2 Related work

2.1 Predicting neural activity

As mentioned before, in the first study of this kind, Mitchell et al. (2008) demon-strated that neural activity patterns associated with thinking about a concept could be predicted by using a model trained on semantic representations of other con-cepts. In this study, 9 participants read 60 words 6 times in a fMRI scanner and the brain activation patterns were recorded. To represent the meaning of a word in a vector, Mitchell et al. (2008) used the co-occurence of the word with 25 hand-picked others words in a large text corpus. A linear model was trained to map the relation between these semantic representations and their corresponding neu-ral activity patterns. Testing on concepts that were left out of the training set, it was demonstrated that neural activity patterns could be predicted with significant accuracy. This method was subsequently used by many others to test different se-mantic models in the same manner (Murphy et al., 2012) (Anderson et al., 2016) (Anderson et al., 2017). The same method and neural activity data will be used in the current study.

(5)

2.2 Distributional models

Instead of using a manually picked model, Murphy et al. (2012) demonstrated that corpus derived models achieved similar and sometimes even better results on the prediction task. These models do not require any form of manual intervention and are therefore applicable on a wider domain of words. A variety of distributional semantic models was tested by Murphy et al. (2012), including a dependency parse. Where most distributional models exclusively look at linear context of a word, the dependency parse also includes the grammatical structure of a sentence. This dependency parse achieved the highest accuracy on the neural activity prediction task and a similar parse will therefore be used in this study as well.

One of the state-of-the-art distributional methods is the skip-gram with nega-tive sampling method (Mikolov et al., 2013). This model uses machine learning techniques to create word representations based on the linear context of a word, i.e. the words that precede and follow a word. This method is very efficient and works well with large text corpora. While this method is currently used a lot in word representation tasks, linear context on its own might not be enough to cap-ture the full semantics of a word. As Levy and Goldberg (2014) demonstrated, using syntactic context combined with linear context leads to more functional se-mantic representations. These syntactic contexts are derived from automatically produced dependency parse trees. The dependency model that was created with these syntactic contexts will be used in the current study.

2.3 Experiential models

While research has shown that distributional models are very effective on the neu-ral activity prediction task, Anderson et al. (2016) have demonstrated that a more neurobiologically motivated model yields good results on the task as well. Specif-ically, they demonstrated that this kind of model could be used to predict neural activity patters for unseen sentences, i.e. sentences that are not used in training the predictive model. This experiential model focuses on the distinctive neural systems of experience, as described by Binder et al. (2016). 65 features of how a word or concept can be experienced are manually ranked through crowdsourcing software. For instance, scores are given to how pleasant or unpleasant the experience of a concept is. The average values of these rankings are then put in a vector to use as a model for word meaning. An overview of all the different features can be found in table 1.

The main advantage of using a model with manually picked features, in this case based on the neural systems of experience, is that you know exactly what all the features mean. This is not the case for distributional models, where the features are much harder to interpret. This means that after training the model, you can analyze exactly how much features contributed to the correct or wrong prediction. When analyzing this, Anderson et al. (2016) discovered that semantic information is widely distributed over the brain and that a wide range of neural systems contribute to the meaning of words.

(6)

Table 1: 65 experiential features (Anderson et al., 2016) Dominant modality Attribute

Vision

vision, bright, dark, color, pattern, large, small, motion, biomotion, fast, slow, shape, complexity, face, body.

Auditory audition, loud, low, high, sound, music, speech. Somatosensory touch, temperature, texture, weight, pain. Gustatory+Smell taste, smell.

Motor head, upper limb, lower limb, practice. Attention attention, arousal.

Event duration, long, short, caused, consequential, social, time.

Evaluation benefit, harm, pleasant, unpleasant.

Cognition human, communication, self, cognition, number. Emotion happy, sad, angry, disgusted, fearful, surprised. Drive drive, needs.

Spatial landmark, path, scene, near, toward, away.

3 Method

To answer the question of how compatible a computational representation of a word is to its mental representation, a single layer fully connected neural network is used that predicts the neural activity patterns associated with thinking about a word, given its semantic representation. The more accurate the model is in predicting neural activity patterns, the more the computational representation of the words is in line with its neural representation. The model used here is called BAP, i.e. Brain Activation Predictor.

3.1 BAP

In BAP, both word representations and brain activation patterns need to be pre-sented as n-dimensional vectors, so that they can be treated as the input and output of the neural network. In case of the neural activity patterns, each fMRI scan is flattend into a vector. The fMRI scans contain activation values for different voxels in the brain. A voxel is a 3-dimensional pixel of the brain that represents a tiny area of brain tissue and its activation level during the brain scan. By flattening the fMRI scans into a vector, the spatial relationship among voxels is ignored and each element in the vector corresponds to the activation of a particular voxel. The spatial locations of the voxels are saved for further analysis.

The number of voxels in each fMRI scan is around 20000 and the activation of most of them is not related to processing the meaning of the noun. Considering these superfluous voxels in our task would just increase the computational power and time needed. Therefore, in the same manner as described by Mitchell et al. (2008), dimensionality reduction is applied to the brain activation vector to reduce it to 500 voxels. This reduction is done in order to speed up the computational

(7)

process and remove outliers from the data.

A model is trained to map the relation between the semantic representation of a word and the associated neural activity patterns. This training is done with a shallow neural network that consists of one layer and a sigmoid activation function. The model is given the semantic representation vectors as input data and the neural activity patters as output data. Through training the model, weights are formed between the feature and voxel values. A schematic overview of the model can be found in figure 1. This method is comparable to the method of previous research on this task (Mitchell et al., 2008) (Murphy et al., 2012).

Figure 1: Schematic overview of the Brain Activation Predictor model.

3.2 Evaluation

To evaluate how much predictive data is in either the separate or the combined models, every time the model is trained, two words are left out of the training set. Subsequently, the expected brain activation patterns for these two words are calcu-lated with the trained model. Finally, based on the cosine similarity between the vectors, the predicted brain activation values and the actual values of both words are matched to each other. A score of 1 is given if this match is correct; a score of 0 if it is wrong. This process is done for all possible combinations of two words that are left out of the training set, leading to an accuracy score between 0 and 1. While chance levels would eventually lead to an accuracy score of 0.5, the models can be compared to each other based on how much higher they score than chance levels.

(8)

4 Experiments and results

4.1 fMRI data

The dataset that is used here is by Mitchell et al. (2008). This data set is publicly available to use.1 Functional MRI data was gathered from 9 participants while watching distinctive stimuli. The stimuli consisted of 60 distinctive nouns and corresponding line drawings. Each stimulus was displayed six times for three sec-onds in random order, adding to a total of 360 fMRI images per participant. The fMRI images were recorded with a 3.0T scanner at 1 second intervals, with a spa-tial resolution of 3x3x6mm. Subsequently, the data was pre-processed with the SPM package (Penny et al., 2011). An approximation of the blood-oxygen-level response was made by averaging over the fMRI images in each trial. Finally, to se-lect only the parts of the brain that overlap with the cortex, an anatomical mask was used. This pre-processing resulted in having approximately 20 thousand features per participant per image.

4.2 Data for the dependency parse

The trained word representations for the dependency model that are used here are by Levy and Goldberg (2014). This data set is publicly available to experiment with.2 Instead of the bag-of-words method to create word representations based on linear context, in this parse the context of the words is both linear and syntactic. A dependency parse on the English Wikipedia was created with an implementation of the Stanford parser described in Goldberg and Nivre (2012). In combination with the bag-of-words method, this resulted in feature vectors of 300 dimensions per word. These feature vectors are used for the words that are experimented with in this study.

4.3 Data for the experiential model

Data for the 65-feature experiential model was provided by Dr. Andrew Ander-son. Anderson used this data in his previous research where he and his colleagues predicted neural activity patters of sentences based on the 65-feature experiential model (Anderson et al., 2016). Using the crowdsource platform Amazon Mechan-ical Turk, attribute ratings for 65 features of 242 words were collected. For each attribute, a worker was asked to rate, on a scale from 0-6, how much that attribute was associated with the word. Approximately 30 complete ratings were collected for each word. These rating were averaged and outliers were removed. In the current data set some values were missing and these were replaced by the aver-age value of that feature. Not all of the 60 nouns for which the fMRI patterns are known were present in this dataset. Both the brain activation data and the ranked 65 features were available to use for 28 out of the 60 nouns.

1

http://www.cs.cmu.edu/afs/cs/project/theo-73/www/science2008/ data.html

2_{https://levyomer.wordpress.com/2014/04/25/}

(9)

4.4 Experimental setup

The single layer neural network is implemented in Python 3.4.3 on an Ubuntu oper-ating system. The open-source Keras library is used to create the trained networks. Firstly, the 500 most stable voxels are selected, in the same manner as described in Mitchell et al. (2008). This was done in order to speed up the computational process (500 output points, instead of ±20000) and remove the outliers in the data. To make sure 500 was the right amount of voxels for this task, a plot of the stabil-ity scores was made, which can be seen in figure 2. It is evident that there is no clear threshold and therefore experiments were done with increasing the amount of voxels. Since these experiments resulted in no significant change of accuracy and only made the code computationally heavier, a selection of 500 voxels was used throughout the experiments. Secondly, the predictive model is trained with the vectors of semantic features used as input data and the vectors of the 500 most stable voxel activation values used as output. The activation layer uses the sigmoid function to normalize the data. The weights in this model are modified after each round of training in order to create the predictive model. The predictive models are first trained on the experiential and distributional vectors separately and sub-sequently on the combined vectors of both semantic models. The combination of both models is done by concatenating the feature vectors.

(10)

4.5 Results

The predictive models were trained on both the separate experiential and distribu-tional semantic features and the combination of both on all participants. Further-more, models were also trained on the 25-feature vectors used by Mitchell et al. (2008) for comparison. The predictive models were evaluated with the leave-two-out evaluation task as described above. This evaluation led to the accuracy scores shown in figure 3. Respectively, average accuracy scores for the models are: 0.69, 0.68, 0.66 and 0.64. To begin with, as can be seen in figure 3, accuracy scores per model change a lot per participant. These results suggest that the brain activation patterns associated with thinking about a word are in such a way different per per-son that no model is generally sufficient. Furthermore, combining the experiential model with the distributional model does not lead to a higher average accuracy in this task, contrarily to what was suggested by Anderson et al. (2016). The accu-racy of the combined model is lower than the best performing separate model for all participants. For a further analysis on this result, see section 5. Finally, the av-erage accuracy score for the 25-feature vectors is significantly lower when trained on 26 instead of 58 words, i.e. 0.64 instead of 0.77 (Mitchell et al., 2008).

Figure 3: Accuracy scores for all trained models per participant.

Although simply combining the models did not lead to better predictions, an analysis was made in order to see if the models capture different characteristic of the words, that are subsequently projected in different locations in the brain. An-alyzing which voxels were predicted most accurately for both separate models, no overlap was found for the 50 best predicted voxels. Figure 4 shows the location of these 50 best predicted voxels in the brain for both models. This demonstrates that both the experiential and the dependency model predict best for different voxels, i.e. different parts of the brain. This means that the two models capture semantic patterns that are reflected in different locations in the brain.

To analyze the difference between the predictive models on word-level, a count of all the pairs of words that were predicted wrongly was collected. An overview of all the counts can be found in figure 5. For instance, the combination of the words ’chair’ and ’church’ is always predicted correctly by the experiential model and wrongly by the dependency model. Contrarily, the combination of words ’ant’ and ’dog’ is always predicted correctly by the dependency model and almost never

(11)

Figure 4: 50 best predicted voxels for the dependency model (black) and the expe-riential model (orange)

by the experiential model. For most of the remaining word combinations a small difference in amount of mistakes can be seen. Again, this suggests that both models capture different semantic information and different neural patterns.

Another interesting result that was found is that of all the wrongly matched combination of words, only 11.4 percent of the mistakes were in the same cate-gory. One would expect that words in the same category have similar semantic representations and the matching task would therefore be harder.

5 Discussion

The main result that can be found in the analysis described above is that the ex-periential and the dependency models capture different semantic information and map to different neural patterns. While this would suggest that the combination of models will therefore be more complete and get better results, the current experi-ments do not support this. The reason the accuracy scores of the combined model are always lower in the current experiments is most likely because the model gets much more complicated when the amount of features is increased. While the model does have more information, the neural network has double the features to train on and this makes the chance of error in weights a significant amount higher. The amount of training data stays the same and this increases the chance of the model overfitting. Consequently, the model will perform worse on unseen data. This could be resolved by creating more training data and thus decreasing the chance of overfitting.

Overfitting is generally a problem in the current study, since only 28 nouns out of the 60 available nouns could be used. For instance, the average accuracy score on the 25-feature word vectors used by Mitchell et al. (2008) is significantly lower when only 28 words are used. Because the range of semantic information is significantly smaller when training, i.e. 26 words instead of 58, the chance of overfitting gets substantially higher. The chance of this happening could be decreased by using a bigger set of distinct words and thus a wider range of semantic information.

(12)

(a) Dependency model

(b) Experiential model

(c) Difference between mistakes made

Figure 5: The number of mistakes made for all combinations of words in the leave-two-out task

(13)

Another result that stands out is the difference in model performance per par-ticipant. Contrarily to participants 1 and 4, participants 2 and 6 score remarkably low, i.e. around chance levels. On the one hand these low scores might be due to noisy data or difference in how the participants did the experiment. For instance, someone could have moved a lot during the trial or thought about concepts in a completely different way. On the other hand this could mean that their neural ac-tivity patterns are completely different and are not captured at all by the semantic models used here. Differences like this between participants are not rare in this re-search area. For instance, using a different data set, Anderson et al. (2017) recently found similar differences in results on this task. This suggests a fundamental differ-ence in semantic representation per human brain. Consequently, a general model that would fit on every human brain might not be achievable.

6 Conclusion

Firstly, after analysis of the results, it can be concluded that different kinds of in-formation are captured by the distributional and experiential model. Although the models predict best on different parts, i.e. voxels, of the brain, simply concate-nating the feature vectors does not lead to better results in the task. The failure of simply combining the vectors is likely due to overfitting of the learning algorithm. While the amount of data to train on stays the same, the amount of features dou-bles. This lack of training data, combined with the fact that data is only available for 28 nous, is likely to result in overfitting. The range of semantic information is much smaller when 28 words are used instead of the 60 words used by Mitchell et al. (2008). Consequently, overfitting is more likely to occur and worse results are expected on the leave-two-out task.

Secondly, the results show a lot of variance in model performance per par-ticipant. This variance demonstrates that none of these models are sufficient to describe the way the human brain represents semantics. The representation of se-mantics is in such a way different per participant that the models get completely different results per participant. This result suggests a fundamental difference in semantic representation per human brain, and thus a general best model for the current task might not be found.

7 Future work

In the current study, an attempt was made to create a model that combines two distinct types of semantic information in order to predict neural activity patterns more accurately. Although, simply concatenating the feature vectors did not result in a more accurate model, the results suggest the possibility of a more complete model using both types of data. In order to improve the performance of the model and create a model that is more compatible to human semantic representation, some possible improvements can be made.

Firstly, increasing the amount of training data could prove useful to counter overfitting of the computational models. If more fMRI data of more participants

(14)

and a wider range of words and concepts is gathered, a more general model can be created. Furthermore, more data will also open the possibility to use more complex machine learning models. For instance, a deep neural network could be used to learn more complex relations between voxel activation values.

Secondly, regarding the different kinds of computational representations that can be used for this task, in addition to the experiential and distributional model, a visually grounded model could be added. Anderson et al. (2017) demonstrated that a model that uses deep convolutional neural networks trained on Google Images can be used on this task. Conceivably, this model would be more complete, since visual information is another form of information that is processed in the brain when thinking about different concepts.

Lastly, a computational model could be made that selects which semantic model to use per voxel. Since it is shown that both semantic models capture different in-formation and predict best for different voxels, this voxel-wise model selection will very likely result in better performance on the brain activation prediction task. Although this combination will probably get better results per participant, these results will probably overfit heavily per participant. To counter this, a model could be trained on group level brain activation. In the same manner as described in An-derson et al. (2017), an average image of participants brain activation can be made based on the correlation between voxel activations. Simply averaging over fMRI data does not work, since every brain is anatomically different. Using this averaged image of participants’ brain activation, a model could be created that applies to the general human brain.

(15)

References

Anderson, A. J., Binder, J. R., Fernandino, L., Humphries, C. J., Conant, L. L., Aguilar, M., Wang, X., Doko, D., and Raizada, R. D. (2016). Predicting neural activity patterns associated with sentences using a neurobiologically motivated model of semantic representation. Cerebral Cortex.

Anderson, A. J., Kiela, D., Clark, S., and Poesio, M. (2017). Visually grounded and textual semantic models differentially decode brain activity associated with concrete and abstract nouns. Association for Computational Linguistics. Andrews, M., Vigliocco, G., and Vinson, D. (2009). Integrating experiential

and distributional data to learn semantic representations. Psychological review, 116(3):463.

Binder, J. R., Conant, L. L., Humphries, C. J., Fernandino, L., Simons, S. B., Aguilar, M., and Desai, R. H. (2016). Toward a brain-based componential se-mantic representation. Cognitive Neuropsychology, 33(3-4):130–174.

Goldberg, Y. and Nivre, J. (2012). A dynamic oracle for arc-eager dependency parsing. In COLING, pages 959–976.

Levy, O. and Goldberg, Y. (2014). Dependency-based word embeddings. In ACL (2), pages 302–308.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis-tributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Ma-son, R. A., and Just, M. A. (2008). Predicting human brain activity associated with the meanings of nouns. science, 320(5880):1191–1195.

Murphy, B., Talukdar, P., and Mitchell, T. (2012). Selecting corpus-semantic mod-els for neurolinguistic decoding. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth In-ternational Workshop on Semantic Evaluation, pages 114–123. Association for Computational Linguistics.

Penny, W. D., Friston, K. J., Ashburner, J. T., Kiebel, S. J., and Nichols, T. E. (2011). Statistical parametric mapping: the analysis of functional brain images. Academic press.

Vigliocco, G., Meteyard, L., Andrews, M., and Kousta, S. (2009). Toward a theory of semantic representation. Language and Cognition, 1(2):219–247.

Combining experiential and distributional semantic data to predict neural activity patterns