A multimodal approach to finding form-meaning systematicity

(1)

A multimodal approach to finding

Form-Meaning systematicity

Arie Soeteman 10060565 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. E.Shutova

Institute for Language and Logic Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

Form-meaning systematicity refers to structural ties between words and their meaning. In investigate the structure and significance of systematicity using a Ker-nel Regression framework. I evaluate correlations between form and meaning over a lexicon of English monomorphemes, and take the predictability of semantic vec-tors from strings as a measure of systematicity for individual words. I base my semantic vectors on linguistic as well as visual data, making use of Skip-gram with Negative Sampling and Convolutional Neural Network extraction. I combine these monomodal vectors into a multimodal model using Scoring-Level Fusion, Concate-nation, and Neural Network extraction. My findings corroborate the existence of form-meaning systematicity and show that this systematicity is concentrated in localized clusters, instead of diffusely distributed throughout the Enlish language. Furthermore, I show the value of applying a multimodal approach to finding form-meaning systematicity using quantitative as well as quantitative analyses. I provide starting points for further multimodal approaches to finding form-meaning system-aticity.

(3)

1 Introduction: Arbitrariness in Language

Linguistic aribitrariness refers to the notion that there are no structural ties between words and their meanings. The concept of arbitrariness has been at the foundation of numerous linguistic theories ever since its introduction in the field of linguistics by Ferdinand de Saussure in 1916 (De Saussure and Baskin, 1916). In ‘The Origin of Speech’, for example, Charles Hockett (Hockett and Hockett, 1960) argues that arbitrariness is a fudamental and universal feature of human language:

“The world ‘salt’ is not salty nor granular; ‘dog’ is not canine; ‘whale’ is a small word for a large object; ‘microorganism’ is the reverse.”

According to Hocket, arbitrariness distinguishes human language from the commu-nicative systems used by other animals. He even considered arbitrariness a necessary condition for the extensive and flexible communication that characterizes language and human communication (Hockett and Hockett, 1960). In 2004, Michael Gasser introduced a formalized notion of iconicity as the existence of systematic relations between form and meaning. In his framework, a language is arbitrary when it lacks such systematic relations (Gasser, 2004). Gasser simulated human language acquisition by applying learn-ing algorithms to formalized languages. He argued that iconicity can facilitate language acquisition since structural relations between form and meaning reduce the amount of information required for one to understand a language. However, if the number of form-meaning pairs in a language increases, iconicity limits the number of forms that can be assigned to a specific meaning. This restrictive structure limits linguistic flexibility and unavoidably leads to ambiguity, which hinders language acquisition. Based on formal simulations Gasser therefore reaches the same conclusion as Hockett: arbitrariness is a fundamental property of flexible and expressive languages.

While Gasser thought of arbitrariness as an essential feature of language, he did note several structural relations between form and meaning that can be found in Japanase, Tamil and Zulu. Indeed, linguists have long noted exceptions to arbitrariness. Wordforms can be transparently motivated by their semantic referents. For instance, vowels with high acoustic frequency tend to be associated with smaller items while vowels with low acoustic frequency are associated with larger items (Ohala, 1984). Another example is made by onomatopoeia. These words (e.g. ‘bark’, ‘cling’, ‘clang’, ‘slurp’) directly echo the sound of their referent. Such direct relations between words and perceptual features of their referents are generally referred to as Iconicity (Guti´errez et al., 2016; Gasser, 2004).

Another phenomenon that provides counterevidence for the ubiquitousness of arbi-trariness in language is the existence of phonaesthemes. A phonaestheme is a phonetic cluster that occurs in words with related meanings. For instance, numerous words that start with ‘sn-‘ are related to the nose (e.g. ‘snore’, ‘snorkel’, ‘sniff’, ‘snout’, ‘snot’). Otis and Sagi (2008) found that the existence of 27 of these phonaesthemes in the English language can be statistically confirmed, while research into the psychological reality of

(5)

phonaesthemes suggested that native English speakers perceive as many as 46 phonaes-themes (Hutchins, 1999).

While these studies point towards localized clusters of non-arbitrariness, they do not address the role of arbitrariness in a language as a whole. After all, some systematic clus-ters can be expected to emerge by chance in arbitrary languages (Guti´errez et al., 2016). Shillcock et al. (2001) first developed an approach to gain insight into the significance of systematicity within a language. By analyzing the phonological and semantic distances between English words, they found that a small but statistically significant correlation exists between these two distances. In addition, Monaghan et. al (Monaghan et al., 2014) found that this correlation between word form and meaning is robust to different choices of distance-measures for phonological and semantic distances. While these studies support the existence of non-arbitrariness in language, they find this systematicity to be small and diffusely distributed across languages:

“the observed systematicity . . . is not a consequence only of small pockets of sound symbolism, but is rather a feature of the mappings from sound to meaning across the vocabulary as a whole”. (Monaghan et al., 2014)

This notion of diffusely distributed non-arbitrariness seems incompatible with the existence of phonaesthemes, onomatopoeia, and other localized clusters of systematicity, that has been corroborated by linguistic research (Gasser, 2004; Ohala, 1984).

In 2016, Gutierrez et. al developed an approach to reconcile the findings of these localized studies with corpus-wide analyses of non-arbitrariness (Guti´errez et al., 2016). They trained a model that predicts semantic vectors for words, based on their edit-distances to other words in a lexicon. By minimizing the error of this model, they optimized weights for string-edits, representing the notion that different strings mutations can have different semantic significance. Using this new optimized distance-metric new insight was obtained into the manifestation of non-arbitrariness in the English language. The results of Guiterrez et. al support the existence of a corpus-wide correlation between wordform and semantic representations. In addition, their optimized distance-metric results in a substantial increase of this correlation, suggesting that string mutations indeed differ in their semantic significance. Furthermore, they conclude that systematicity is not diffusely distributed, but centered in localized phonological clusters.

They specifically point towards 10 phonaesthemes that display so much systematicity, that their existence in a language with diffusely distributed non-arbitrariness is highly unlikely. This approach provides a promising basis for research into non-arbitrariness on a corpus-wide scale (Guti´errez et al., 2016).

The semantic vectors used to represent words in this model are based solely on their usage in textual corpora. This approach is based on the distributional hypothesis, which states that words that occur in similar contexts are semantically similar (Harris, 1970; Wittgenstein, 1953). However, research into grounded cognition suggests that simula-tions of visual, auditory and other empirical stimuli play a significant role in language comprehension (Barsalou, 2008). For instance, Zwaan & Madden (Zwaan and Madden,

(6)

2005) found that participants that had to classify an image after reading a sentence, pro-cessed this image faster if its content was congruent with the sentence, suggesting that visual simulation plays a role in comprehending written text. Furthermore, neurological evidence shows that simulation of bodily motion plays a part in the processing, evaluating and understanding of language (Buccino et al., 2005). Even the simulation of emotional states has been shown to contribute to language comprehension (Barsalou, 2008).These findings have motivated the development of multimodal semantic models, in which vector representations of words are not merely based on their usage in textual corpora, but also on visual and auditory data.

I expand on the findings of Guti´errez et al. (2016) by incorporating visual information. As noted earlier, multiple iconic relations have been identified between word-form and visible properties of the corresponding semantic referents (Ohala, 1984; Guti´errez et al., 2016; Gasser, 2004). Furthermore, neurological and behavioral research has shown that visual information plays a role in semantically representating words in human cognition (Barsalou, 2008; Zwaan and Madden, 2005). This suggests that by incorporating visual information in semantic vectors, more insight can be gained into the systematic relations between form and meaning that exist in the English language. The main question to be answered in this research is the following:

What conclusions can be drawn regarding the significance and structure of non-arbitrariness in the English language, based on a multimodal model that incorporates linguistic as well as visual information?

In the multimodal model that I have constructed, semantic vectors are extracted from relevant images of words as well as their usage in textual corpora. I use these multimodal vectors to train weights for string mutations, following the approach developed by Gutier-rez et. al (Guti´errez et al., 2016). I then compare the resulting weighted string-distances to the distances between multimodal semantic vectors to draw conclusions regarding form-meaning systematicity in the English language. By following this method, I evalu-ate the robustness of the findings of Gutierrez et. al. Furthermore, I draw conclusions regarding the added benefit of incorporating visual data in semantic vectors. Based on linguistic as well as psychological research, I hypothesize that visual data is especially helpful for representing concrete words (West and Holcomb, 2000; Bruni et al., 2014). Another point of interest is the construction of the multimodal model itself for which I compare and evaluate different approaches. By training a neural network on both text-and image-based representations to predict string-distances, I extract multimodal vectors that are fit to the task of predicting form-meaning systematicity. I compare the result-ing multimodal model to more simplistic multimodal models to evaluate its efficiency in finding form-meaning systematicity.

In section 2, I discuss data-driven approaches to finding form-meaning systematicity, text- and image-based distributional semantic models and methods for constructing a multimodal model that have been implemented in past research. In section 3, I describe the data used in this project and how it has been obtained. In section 4, I describe my

(7)

experimental setup for finding form-meaning systematicity in a lexicon, as well as the exact manner in which I have constructed my monomodal and multimodal models. Sub-sequently, I list my findings in section 5. In sections 6 and 7, I draw conclusions regarding form-meaning systematicity in the English language, and discuss eventual sources of bias as well as starting points for future research.

(8)

2 Literature Review

2.1 Data-driven approaches to finding form-meaning

system-aticity

Data-driven approaches to finding form-meaning systematicity have only recently been developed. As noted earlier, Shillcock et al. (2001) were the first to measure form-meaning systematicity by analyzing the correlation between phonological distances and semantic distances. They found a small but statistically significant correlation. Additionally, they calculated this same correlation after ommitting individual words from the dataset to determine the systematicity of individual words. The extent to which the correlation decreased after omitting an individual word was taken as a measure of this word’s sys-tematicity. Many systematic words were found to be communicatively important.

In a subsequent study, Monaghan et al. (2014) expanded on the findings of Shill-cock et al. (2001). They introduced phonological edit distance, which is also known as Damerau-Levenshtein distance (Yarkoni et al., 2008), as a phonological distance metric. In addtion, they used WordNet as well as distributional semantic vectors as a measure of semantic similarity. They made use of the Mantel-test (Mantel, 1967) to adequately measure the magnitude and statistical significance of any found systematicity. Similar to Shillcock et al. (2001), they found a small (r = 0.016) but statistically significant correla-tion. Furthermore, they found comparable correlations using different distance metrics, emphasizing the robustness of the results. Using the same methods for evaluating the systematicity of individual words as Shillcock et. al, they found that systematicity is not concentrated in localized clusters, but is a property of language as a whole. They did however find a significant negative correlation between systematicity and the age on which words are generally learned, indicating that form-meaning systematicity plays a role in the early stages of language acquisition.

The aforementioned study from Guti´errez et al. (2016) expanded on the methods of Shillcock et al. (2001) and Gasser (2004). Instead of using phonological edit distances, they analyzed edit distances between strings. They optimized weights to represent the difference in semantic relevance between string edits. Furthermore, they propozed a new method for evaluating the systematicity of individual words: assessing the extent to which their semantic vectors can be predicted based on their strings. Edit weight optimization increased the found correlation between word form and meaning (from r = 0.019, to r = 0.0464). Furthermore, using their new evaluation method for the systematicity of indi-vidual words, Gutierrez et. al found numerous localized clusters in which systematicity is concentrated.

(9)

2.2 Distributional Semantic Models

Numerous earlier performed corpus-wide studies on form-meaning systematicity have made use of distributional semantic models (DSM’s) (Guti´errez et al., 2016; Monaghan et al., 2014; Bruni et al., 2014). In these models the meaning of words are approximated by vectors that represent their co-occurrences with different contexts (Turney and Pantel, 2010; Landauer and Dumais, 1997).

This approach to semantically representing words is based on the assumption that words that occur in similar contexts are semantically similar (Harris, 1970; Wittgenstein, 1953). Accordingly, distributional semantic models provide a systematic approach for computing semantic similarity between words: by comparing the relevant semantic vec-tors. This section provides an overview of the techniques used for constructing semantic vectors based on linguistic corpora as well as visual data.

2.2.1 Text based models

Traditional distributional semantic models base their semantic representations solely on text. These models are referred to as monomodal (Bruni et al., 2014; Kiela and Bottou, 2014). A classic implementation of such a monomodal model is the semantic space model, in which semantic vectors are constructed by analyzing the occurrences of words in a corpus. Each feature in a semantic vector represents the occurrence of the corresponding word in a specific context. These contexts can be defined in a variety of manners, such as document types or the occurrence of other words within a predefined window. More complex context definitions can take the syntactic structure in which a word occurs into account. Once the context counts for a word have been obtained, they can be transformed into scores that take context relevance into account. Usually, the weights of contexts that have a high probability of chance co-occurrence with a word are reduced. The rank of the resulting semantic vectors can then optionally be reduced using dimensionality reduction techniques, such as Principal Component Analysis or Singular Value Decomposition, to enable more efficient computation (Bruni et al., 2014).

While this approach is the basic framework for Distributional Semantic Models, re-cent research has shown that better results can be obtained by predicting semantic vectors instead of counting contexts (Baroni et al., 2014; Goldberg and Levy, 2014). By making use of unsupervised learning, vector weights can be set to optimally predict the con-texts in which words occur. In a comparative study, Baroni et. al (Baroni et al., 2014) conclude that the resulting context-predicting vectors provide better semantic represen-tations of words than traditional context-counting vectors. Their prediction-based model outperformed counting models and other state-of-the-art models on numerous tasks such as computing semantic relatedness and categorizing concepts. The most widely used method for creating context-predicting vectors is the Skip-gram model (Mikolov et al., 2013), for which Negative Sampling is an efficient extension. This method will now be explained in more detail.

(10)

2.2.1.1 Skip-gram Given a set of target words w1w2w3...wt that occur in windows

of context words c bounded by a predefined context window γ. The objective of the Skip-gram model is to maximize the average probability of the contexts found in the language corpus, given the target words, by maximizing the following equation:

1 T T Y t=1 [ γ Y i=−γ p(ci|wt) (1)

In which T is the number of target words to be predicted. Higher values for the context window γ result in more accurate word representations, at the expense of training time (Mikolov et al., 2013). The probability of context words occuring within the context window of target words p(c|w) is usually computed using the softmax function:

p(c|w) = e

vc·vw

P

c0 Cevc0·vw

(2)

Where vc and vw are vector representations of a context word c and a target word

w respectively, and C is the set of all context words in the language corpus. The Skip-gram model maximizes the average probability in equation 1 by optimizing the vector representations used in equation 2. Usually the logarithm is used to enable more efficient computation leading to the following objective to be maximized:

1 T T X t=1 γ X i=−γ log(p(ci|wt) (3) Which is equal to 1 T T X t=1 γ X i=−γ log(evci·vwt_{) − log(}X c0 C evc0·vw₎ ₍₄₎

The larger this probability is for a set of target words and a corpus in which they occur, the better the vector representations of the model have been set to predict the contexts in which these words occur. Iteratively updating the vector weights to maximize equation (4) thus results in semantic vector representations for the target words (Baroni et al., 2014; Goldberg and Levy, 2014; Mikolov et al., 2013). However, this method requires that softmax probabilities are computed for each word-context pair seperately, making the required computation proportional to the total of targetwords and contextwords in the language corpus. Since this total is often large (105_{− 10}7 _{terms), extensions for the}

Skip-gram model have been developed that enable more efficient computation. A widely used and effective extension that is also implemented in this research is Skip-Gram with Negative Sampling (Mikolov et al., 2013).

2.2.1.2 Negative Sampling In 2013, Mikolov et. al (Mikolov et al., 2013) intro-duced Skip-gram with Negative Samplinig as an efficient extension of the Skip-gram model (Goldberg and Levy, 2014). The objective for negative sampling is not to maxi-mize the probabilities of contexts given target words, but to maximaxi-mize the probability of

(11)

the occurence of word context pairs in a corpus. Let D = 1 denote the statement that a word context pair does indeed occur in a corpus. The probability of this statement given a specific word and context can now be calculated as follows using softmax:

p(D = 1|w, c) = 1

1 + e−vc·vw (5)

Where vc and vw are again the vector representations for context words and target

words respectively, that are to be optimized to maximize this probability for all word-context pairs in the data. The probability of the whole corpus is computed by taking the product of the probabilities for all word-context pairs (i.e. ∀(w, c)D). The logarithm can again be used to improve computational efficiency, leading to the following maximization objective:

X

(w,c)D

log( 1

1 + e−vc·vw) (6)

This objective has a trivial solution. If vc = vw for all target words and context words

(i.e. all vector representations are exactly the same) and the length of these vectors is equal or larger than 40, the probability for each word context pair approaches 1 (Goldberg and Levy, 2014). To prevent this, a set of random word context pairs is generated that are assumed to be absent in the corpus (D = 0). This random set can be named D0 so that the following holds for each word context pair: (D0 = 1|(w, c) → (D = 0|(w, c))). The objective to be maximized can now be updated as follows:

X (w,c)D log(p(D = 1|(w, c))) + X (w,c)D0 log(p(D = 0|(w, c))) (7) X (w,c)D log(p(D = 1|(w, c))) + X (w,c)D0 log(1 − p(D = 1|(w, c))) (8) X (w,c)D log( 1 1 + e−vc·vw) + X (w,c)D0 log( 1 1 + evc·vw) (9)

Using this random set of negative samples the trivial solution can be avoided and functional semantic vectors for the target words can be obtained. This is why this ex-tension is called negative sampling (Goldberg and Levy, 2014). Since p(c|w) (3) is a less extensive computation than p(D = 1|w, c) (5), Skip-gram with Negative Sampling provides for more efficient computation of semantic vectors. Furthermore, research has shown Skip-gram with Negative Sampling to produce better semantic vectors than im-plementations that make use of softmax or other objectives, such as hierarchical softmax (Mikolov et al., 2013).

2.2.2 Image based models

The aforementioned research into grounded cognition emphasizes the role of simulation in human language comprehension (Zwaan and Madden, 2005; Barsalou, 2008). As this

(12)

body of research has grown, the motivation for incorporating sensory data into seman-tic models has increased. This has led to Distributional Semanseman-tic Models that do not base their semantic representations solely on textual corpora, but that incorporate data sources such as images and audio. These models have been named multimodal distribu-tional semantic models. Distribudistribu-tional semantic models that incorporate visual data have indeed been shown to outperform numerous state-of-the-art text-based models on tasks such as word categorization and computing word relatedness (Bruni et al., 2014; Kiela and Bottou, 2014). This section describes and compares two methods for the construc-tion of semantic vectors from visual data, that have both been implemented in earlier research: Bag of Visual Words and Convolutional Neural Networks.

2.2.2.1 Bag of Visual Words The Bag of Visual Words (BoVW) technique has been widely used to extract visual information from images. Its main advantages are its simplicity, its computational efficiency and its robustness under affine transformations, as well as variation in lighting and occlusion within an image dataset. (Csurka et al., 2004; Bruni et al., 2014). It is inspired by the traditional Bag of Words technique, which represents textual documents as unordered bags of words. The central notion of the BoVW technique is that visual words are extracted from annotated images without taking the spatial orientation of these visual words into account. These visual words then provide the basis for vector representations of the target words.

The BoVW technique can roughly be divided into four steps. First a database has to be constructed of images that provide visual representations of the words in the dataset. The second step is finding salient image patches that can be used to differentiate these images. A widely used method for the identification and representation of these image patches is the Scale Invariant Feature Transform (SIFT) algorithm. This algorithm uses scale-space filtering to identify potential keypoints. A grid of 16x16 pixels is then taken around each keypoint. This grid is subdivided into 16 4*4 subgrids, for which 8 local gradients are computed, resulting in a 128-feature keypoint vector (Lowe, 2004; Csurka et al., 2004). Thirdly, these keypoints can be classified as different types of visual words by using a clustering algorithm. Images can now be represented as vectors by counting the occurrences of visual words in each image. Since the same vocabulary of visual words is used to represent each image, these vectors automatically have the same dimensionality. The final step is creating vector representations of words in the dataset by aggregating the vector representations of all relevant images. This can be done by summing up, taking the average, or any other method for vector aggregation (Bruni et al., 2014; Kiela and Bottou, 2014; Kiela, 2016; Csurka et al., 2004; Lowe, 2004).

2.2.2.2 Convolutional Neural Networks Another highly efficient approach for exracting visual features that has been used in recent research implements Convolutional Neural Networks (CNN’s)(Kiela and Bottou, 2014; Kiela, 2016; Krizhevsky et al., 2012) . CNN’s have been used in a variety of natural language processing applications. It is even generally recognized in the field of computer vision that CNN’s have superseded the BoVW-approach (Kiela and Bottou, 2014; Kiela, 2016; Krizhevsky et al., 2012).

(13)

CNN’s are Neural Networks with a specialized structure that enables recognition of local patterns, and robustness under scale-variance and other input distortions (LeCun et al., 1998). Since their original development in the 1980’s, CNN’s have been further developed and widely used in a variety of tasks. However, CNN’s are mainly applied in computer vision, since their structure is especially well suited for the recognition of local features in large image datasets, without requiring extensive preprocessing (LeCun et al., 1998; Krizhevsky et al., 2012; Kiela and Bottou, 2014; Karpathy et al., 2014).

CNN’s typically consist of convolutional layers, pooling layers and fully connected layers. An example of the structure of a CNN is given in Figure 1 (LeCun et al., 1998; Karpathy et al., 2014). Relatively unprocessed images are used as input. A convolutional layer, consisting of several feature maps, recognizes distinctive features in the input such as corners and edges. Each feature map recognizes one type of distinctive feature for the entire input image. Therefore, all pixels in a feature map share the same weights and bias. This limits the number of parameters needed to train the network, resulting in faster and more efficient training (LeCun et al., 1998). After a convolutional layer, a subsamling- or pooling-layer is typically used to reduce the spatial size of the image representation and reduce the number of parameters that need training. Common pooling methods are taking the maximum or average of multiple neurons in the previous layers (Krizhevsky et al., 2012; LeCun et al., 1998). This pooling also makes the network more robust to small differences in scale and spatial orientation (LeCun et al., 1998; Karpathy et al., 2014; Krizhevsky et al., 2012). After multiple convolutional layers and pooling layers, the recognized features are given as input to fully connected layers. These layers are connected to all activations in the previous layer, and convert these activations into the required output. Their structure is equal to that of the hidden layers in a standard Neural Network (LeCun et al., 1998).

(Krizhevsky et al., 2012).

Figure 1: Structure of a Convolutional Neural Network (LeCun et al., 1998)

To use a CNN for visual feature extraction it has to be trained on a task that requires semantically representing images. This training is typically performed using a large la-beled image dataset such as ImageNet (Krizhevsky et al., 2012; Kiela, 2016; Kiela and Bottou, 2014). This ensures that the data used to train the network is of sufficient size and that images are correctly labeled with the the linguistic concepts that they represent. Using such a high-quality dataset, the CNN is trained to perform an image-classification

(14)

or -regression task (Krizhevsky et al., 2012; Kiela and Bottou, 2014; Poria et al., 2015). After convergence the CNN is applied to a new set of images that represent the words for which semantic representations are to be extracted. This method in which a model is trained on a high-quality dataset before applying it to the target dataset is called transfer learning. This has the benefit of using extensive and high-quality training data, while ensuring that coverage is sufficient for all words in the target lexicon (Shin et al., 2016). The images in this second dataset are fed to the trained CNN, which performs the same task that it has been trained on. Since efficent feature representations are needed to execute the task, the layer preceding the final layer in the CNN can now be extracted and stored as a vector-representation of the classified image (Kiela and Bottou, 2014; Kiela, 2016). Vector representations of words in the dataset are constructed by combin-ing the vector representations of all relevant images (Kiela and Bottou, 2014; Kiela, 2016).

2.2.3 Constructing a Multimodal Model

Since the introduction of multimodal distributional semantic models by Feng and Lap-ata (2010), several methods have been developed for combining text- and image-based features into a single model. In their model, textual documents are paired with im-ages. These texts and images cover multiple topics. Words and discrete image-features are then considered samples from the probability distributions of these topics. Using this approach, multimodal semantic representations of words can be constructed by their distribution over topics in both textual documents and images. However, this model requires the textual and visual data to be extracted from the same corpus, and rigidly pairs texts with images. This does not allow for much flexibility in the way textual and visual information is combined (Bruni et al., 2014).

Leong and Mihalcea (2011) proposed a different approach that has been later referred to as scoring level fusion (Bruni et al., 2014). They use both textual and visual data to compute word-relatedness. However, unlike Feng and Lapata, they do not construct mul-timodal semantic representations. Instead, seperate text-based and image-based models are developed and the similarity estimates of both models are combined by taking the sum or harmonic mean. The resulting hybrid similarity measures outperformed both text-based and image-based models (Leong and Mihalcea, 2011).

An even more extensive approach for combining multimodal data into a single model is feature level fusion (Bruni et al., 2014; Kiela and Bottou, 2014). In feature level fusion, semantic representations from different modalities are combined in the feature space to create multimodal vectors, instead of combining the results from seperate monomodal models. Features from different modalities can be concatenated into a single matrix and projected onto a common space using a form of dimensionality reduction, such as singular value decomposition or principal component analysis (Bruni et al., 2014). Even more flexible construction of multimodal features can be achieved using deep-learning methods. By concatenating the features from seperate modalities, and feeding these concatenated vectors to a supervised classifier, semantic representations can be obtained

(15)

that are fit to a specific task, similarly to the visual feature extraction explained in 2.2.2.2 (Poria et al., 2015; Kim et al., 2013; Ngiam et al., 2011). Indeed, Convolutional Neural Networks can also be used for multimodal fusion by extracting the pre-final layer (Poria et al., 2015). Similar methods for multimodal fusion have been implemented making use of Deep Belief Networks consisting of stacked Restricted Boltzmann Machines (Kim et al., 2013; Ngiam et al., 2011).

(16)

3 Data

3.1 Lexicon

To find non-arbitrariness in language, words that are explicitly morphologically related should not be taken into account. It is evident that words such as ’walk’ and ’walks’, or ’telephone’ and ’television’ are related syntactically as well as semantically. Words that consist of the same morphemes are generally semantically related to those morphemes (Guti´errez et al., 2016; Shillcock et al., 2001). I focuss on form-meaning systematicity that is not the result of such superficial relations. To minimize the concert of detecting monomorphemes instead of submorphemic relations, I train the model only on monomor-phemic words (Shillcock et al., 2001). I use the same wordlist as Guti´errez et al. (2016), which is selected by cross-referencing monomorphemic English words in the CELEX lex-ical database (Baayen et al., 1996) with monomorphemic words in the Oxford English Dictionary Online (Simpson et al., 1989). Remaing polymorphemic words, place names, demonyms, spelling variants and proper nouns have been removed. Words that were not among the 40,000 most frequent non-filler word types were excluded, resulting in a list of 4958 monomorphemes. Only the words for which both text- and image-based semantic vectors were available have been used. The final wordlist consisted of 4537 monomorphemes.

3.2 Linguistic and Visual Corpora

Two seperate linguistic corpora have been used to extract two distinct sets of text-based semantic vectors. The first corpus is constructed by (Guti´errez et al., 2016) and is a combination of three English language corpora: ukWaC, BNC and Gigaword (Ferraresi et al., 2008; Consortium et al., 2007; Parker et al., 2011). UKWaC is a web-derived corpus that has been introduced in 2011 (Ferraresi et al., 2008). It contains over 2 billion word tokens, gathered by crawling the .uk internet domain. To ensure that a variety of different text types and topics are represented in the corpus, webpages have been selected by submitting random pairs of randomly selected words to google, and retrieving the most promininent results as seed URL’s. These words have been selected from the BNC language corpus (Consortium et al., 2007), and a vocabulary list for foreign learners of English1_{. From the webpages containing the sampled URL’s duplicates have been}

removed. Linguistically irrelevant text such as code and boilerplate has been removed using the Hyppia project BTE tool2_{. After further removal of machine-generated texts,}

and text-processing such as annotation and lemmatization, the resulting corpus has been made freely available online3_.

The British National Corpus (BNC) was originally created in the 1990s by Oxford University Press and contains over a 100 million words. It contains text from a wide range

1_{http://wordlist.sourceforge.net} 2

http://www.smi.ucd.ie/hyppia 3

(17)

of sources such as books, scientific expositions, essays, emails, newspapers and leaflets. The BNC was created with the purpose of constructing a general corpus that characterizes modern British English, therefore targets have been set garanteeing a variety of different topics and media in the corpus. Texts have manually been collected by either scanning or typing into PC’s. The corpus has been annotated with grammatical classifications and made available on the Oxford Text Archive4 _{(Burnage and Dunlop, 1992).}

The Gigaword corpus is currently the largest corpus of English news documents avail-able, containing nearly 10 million documents with a total of over 4 billion words (Parker et al., 2011; Napoles et al., 2012). It covers news from 7 international news sources. Data was gathered as electronic texts via internet retrieval, and improper characters have been removed.

The second language corpus used for extracting text-based semantic vectors contains 100 billion words from the Google News dataset. This dataset consists of texts from more than 4500 English news sites. The text has been extracted from Google News in 2013.

The visual corpus used for extracting image-based semantic vectors is constructed by making use of the Bing images search engine. 10 images were retrieved for over 38000 words using the MMFeat miner module (Kiela, 2016).

3.3 Concreteness ratings

I use concreteness ratings from Brysbaert et al. (2014), who obtained ratings for 40,000 words from over four thousand participants. Altough the instructions used questionaire stressed that the assessment of concreteness should be based on experiences involving all senses and motor responses, Brysbaert et al. (2014) found that participants largely focused on visual and haptic experiences for evaluating concreteness. A 5-point rating scale was used to judge concreteness. 3991 Of the original 4537 words in my lexicon are represented with a concreteness rating in this database.

4

(18)

4 Methods

4.1 String Metric Learning for Kernel Regression

Following the methodological structure of Guti´errez et al. (2016), I use a Kernel Regres-sion framework to analyze form-meaning systematicity. Kernel RegresRegres-sion is a nonpara-metric supervised learning technique that has become widely used for pattern detection and discrimination problems (Yee and Haykin, 1993; Takeda et al., 2007). Since Kernel Regression is nonparametric, it has the advantage of letting the data shape the structure of the model. Data samples are defined by predictor variables as well as target variables. Target variable values for individual data samples are predicted based on their distance in predictor variables to other data samples, for which the target variables are known. This enables the model to capture local structures in the data, in contrast to parametric models that generally provide a more global fit (Takeda et al., 2007).

To predict values for target variables, numerous estimators of varying complexity have been developed (Takeda et al., 2007). I implement the linear Nadaraya-Watson estimator (Nadaraya, 1964). Given a set of N data points {xi}Ni=1 with target values {yi}Ni=1, the

Nadaraya-Watson estimator for a data sample xj is the following:

ˆ y(xj) = P i6=jkij ∗ yj P i6=jkij , (10)

where kij is the kernel between data points i and j, computed using a kernel function that

penalizes distance in predictor variables between two samples. I implement the exponen-tial kernel function, which can be formulated as follows:

k(xi, xj) = exp(−d(xi, xj)/h). (11)

The variable h specifies a bandwidth that determines the radius of the neighborhood in which data samples effectively contribute to each other’s prediction. In other words: h controls the strenght of the distance penalization that is implemented by the kernel function (Takeda et al., 2007). The distance metric d defines the distance in predictor variables between data samples.

Analogous to Guti´errez et al. (2016) I use Kernel Regression to predict semantic vectors of words based on their strings, which function as predictor variables. The dis-tance metric used is the Levenshtein edit-disdis-tance (Levenshtein, 1966), which captures the distance between two strings as the minimum number of edits needed to transform one string into another. An edit is a mutation, insertion or deletion of one letter. This is a widely used and effective measure for string distance (Nosofsky, 1986; Baily et al., 2001; Guti´errez et al., 2016). However, using Levenshtein edit-distance as a predictor variable in Kernel Regression implies that all string-edits are equally relevant predictors for semantic representations. There is no reason to assume that this is the case. In fact, linguistic research has indicated that string-edits differ in their semantic relevance (Gil

(19)

et al., 2005; Guti´errez et al., 2016). The general problem of specifying the relative rele-vance of predictor variables for Kernel Regression has been covered by Weinberger and Tesauro (2007), who introduced Metric Learning for Kernel Regression (MLKR). Their algorithm optimizes a weight matrix W for predictor variables, creating a distance metric that is more fit to the task op predicting target variables. This optimization is executed by using Gradient Descent to minimize the mean squared error of kernel regression de-fined as: L = N X i=1 ((yi− ˆyi)T(yi− ˆyi). (12)

Since all weights are set to minimize this error, Metric Learning for Kernel Regression computes weights that in the process optimize the bandwidth variable h. For instance, instead of setting h to 0.1, MLKR can multiply all weights for the target variables by 10, producing the same results.

MLKR is applied to the string-edits use for the Levenshtein distance. Originally, Levenshtein distance is defined as the minimum edit-distance path between two strings (Levenshtein, 1966). The constructed Weighted Levenshtein distance is defined as the weighted sum of all edits in the minimum edit-distance path between two strings. String edits can be represented in a vector V , in which each locus represents one type of string-edit (i.e. substituting an ’a’ for a ’b’, or deleting a ’t’). Since each locus specifies one string-edit, the edit path between two strings can be stored as a vector by counting the required string edits and assiging these counts as values to the relevant loci. This results in an edit-vector Vij for each pair of strings. Figure 2 provides an example of such an edit-vector for the

strings ’boot’ and ’bee’. Substitutions are represented symmetrically (i.e. substituting ’a’ for ’b’has the same locus as substituting ’b’ for ’a’), representing the assumption that opposite substitutions share the same semantic significance. Furthermore, the weights are bounded between 0 and infinity, since negative weights for string-edits would imply that an edit negatively contributes to the distance between two strings (Guti´errez et al., 2016).

Figure 2: Mutation vector between ’boot’ and ’bee’ (Guti´errez et al., 2016)

The edit-vectors V are multiplied with a weight-vector W of the same size S, resulting in a single weighted edit-distance for each pair of strings. This can be formally stated as

(20)

follows: d(si, sj) = S X s=1 (Ws∗ Vijs) = WTVij. (13)

Using the Gradient Descent technique of MLKR, the weights are optimized by computing the derivative of the total mean squared error, relative to the weight vector W, which is computed as follows: ∂L ∂W = ∂L ∂ ˆyi ∗ ∂ ˆyi ∂W. (14)

Where the partial derivatives are:

∂L ∂yi = 2 N ∗ N X i=1 (yi− ˆyi) (15) ∂ ˆyi ∂W = P j6=i(yj− ˆyi) T_k ijvij P j6=ikij . (16)

The optimized distance-metric is used as input for Kernel Regression to predict semantic representations of words based on their strings. Following the approach of Guti´errez et al. (2016), I take the extent to which the semantic representation of a word can be predicted as a measure of its systematicity. Words for which the meanings can be accurately predicted based on their strings, can be said to display systematicity between word-form and meaning. Following this line of thought, Kernel Regression can now be used as a productive system for corpus-wide evaluation of form-meaning systematicity. The distribution of Kernel Regression error conveys information regarding the distribu-tion of systematicity. I apply the following methods to investigate systematicity in the language corpus:

1. Computing the Correlation between semantic distances and Levenshtein distances as a measure of corpus-wide systematicity.

I compute correlations using the Mantel test for pairwise distances (Mantel, 1967). This test computes the correlation between two distance-matrices in which each entry in the first matrix is paired with the entry on the same index in the second matrix. Subsequently, both matrices are subjected to random permutations after which the same correlation is computed. The p-value is the proportion of permu-tated matrix pairs that display a higher correlation than the initial two matrices (Mantel, 1967). This p-value represents the probability that the measured correla-tion is found in a language corpus under the null hypothesis that form-meaning as-signments are arbitrary. This test is performed with weighted Levenshtein distances as well as unweighted Levenshtein distances to evaluate form-meaning systematicity in the lexicon and the added benefit of the optimized distance metric.

(21)

2. Analyzing phonaesthemes.

Phonosemantic clusters are classical examples of non-arbitrariness in language (Otis and Sagi, 2008). I compare the Kernel Regression error for words containing possi-ble phonaesthemes with the average Kernel Regression error, determining to what extent these systematic clusters exist. I analyze all 26*26 two-letter combinations as possible phonaesthemes. The average Kernel Regression error of all words be-ginnning with one candidate phonaestheme is compared to the average Kernel Re-gression error of 1000 random samples of an equal number of words. The portion of these random sets that have a lower error than the investigated set is assigned as a p-value. This p-value represents the probability of a set of words displaying the found systematicity, under the null hypothesis that phonaesthemes do not exist (Guti´errez et al., 2016).

3. Comparing systematicity to concreteness

The hypothesis that form-meaning systematicity correlates with concreteness has earlier been brought forward in linguistic research (Dingemanse et al., 2015). In ad-dition, Bruni et al. (2014) found that multimodal and image-based models capture semantic referents of concrete concepts better than text-based models. Using con-creteness ratings obtained from English native speakers (Brysbaert et al., 2014), I evaluate whether Kernel Regression error is negatively correlated with concreteness. If this is the case, the hypothesis that concrete words are systematic is confirmed. I especially analyze the difference between linguistic, visual and multimodal models in this context. Under the assumption that visual and multimodal models capture the semantic referents of concrete concepts better than linguistic models (Bruni et al., 2014), I hypothesize that concreteness is more strongly negatively correlated with Kernel Regression error in these models. If semantic vectors that represent concrete words provide better semantic representations, the logical consequence is that the predictability of these vectors is increased.

Furthermore, the optimized weight-vector W for string-edits can provide interesting information itself. Since the weights are optimized for predicting semantic representa-tions, the extent to which edit-weights differ provides information about the semantic relevance of edits. I analyze the weight-vector to evaluate to what extent string-edits differ in their semantic relevance. In addition, I list the most- and least relevant string-edits.

4.2 Linguistic, Visual and Multimodal semantic representations

I conduct the experiments described in section 4.1 with linguistic, visual and multimodal models to investigate form-meaning systematicity. In this section, I describe the methods used for constructing semantic vectors from the textual and visual corpora described in 3.2. Furthermore, I specify which methods I use to construct multimodal semantic representations. In addition, I describe the t-SNE visualization technique I use to analyze all semantic models qualitatively.

(22)

4.2.1 Linguistic Semantic Vectors

I use two sets of text-based semantic vectors extracted from the two text corpora described in 3.2. Both of these sets have been constructed using Skip-gram with Negative sampling as explained in sections 2.2.1.1 and 2.2.1.2.. Vector training and optimization have been executed using the Google Word2Vec tool (Mikolov et al., 2013). The first model is a replication of the text-based model used by Guti´errez et al. (2016). It has been trained using the Google Word2Vec implementation in the Gensim package for Python (Rehurek and Sojka, 2010) with default parameters. The model contains 100-dimensional vector representations for 4958 words. I use this model to extend upon the research of (Guti´errez et al., 2016) and to compare the systematicity found using visual and multimodal models with their findings. The second model has been trained by Google on the Google news dataset, and is supplied as part of the Word2Vec tool 5_{. It contains 300-dimenstional}

vector representations for 3 million words and phrases. I have found that this second model displays more systematicity than the first text-based model. I therefore use this second model to further investigate form-meaning systematicity in the English language.

4.2.2 Visual Semantic Vectors

The set of image-based semantic vectors I use has been obtained using the MMFeat tool, a modular python toolkit that uses bindings to the Caffe deep learning network (Jia et al., 2014). MMFeat contains implementations for mining data and extracting features using a CNN (Kiela, 2016). It supports multiple neural network structures, of which AlexNet was used (Krizhevsky et al., 2012). This CNN was trained by performing a classification task on images from ImageNet. After convergence, the trained network was used to extract features from a set of images that visually represent words in the lexicon. These images were obtained by applying the MMFeat miner module to the Bing Image search engine. Using Bing, 10 images were obtained to represent each word in the lexicon. These images were fed to the trained CNN, after which the layer preceding the final layer was extracted as a vector representation for each image, as explained in section 2.2.2.2. This transfer learning approach has the advantage of using a large annotated dataset to train the model, while ensuring that sufficient visual data is available for all words in the lexicon. By aggregating 10 image representations for each word in the dataset, 4096-dimensional vector representations were obtained for over 38000 words.

4.2.3 Multimodal Semantic Vectors

The first method I use to investigate the potential of a multimodal approach is based on scoring level fusion as developed by Leong and Mihalcea (2011). I perform Kernel Regression seperately on the linguistic and visual models, and compute the semantic distance between two words as a weighted average of the cosine distances between the two relevant vectors in the linguistic and the visual model. This results in a multimodal distance which I correlate with edit distance using the Mantel permutation test described

5

(23)

in section 4.1 (Mantel, 1967). I correlate with unoptimized Levenshtein edit-distances as well as optimized edit-distances. Optimized edit-distances for this model are computed by taking the weighted average of the optimized edit-distances for both models as follows.

α ∗ visual distance + (1 − α) ∗ linguistic distance (17) The resulting correlation is computed using different values for α , after which the weighting factor that results in the highest correlation between semantic distances and edit distances is identified as the optimal ratio to combine the text-based and image-based models. The found correlation indicates whether the linguistic and visual models complement each other and whether a multimodal approach is effective.

Secondly, I implement Feature Level Fusion by concatenating the normalized linguis-tic and visual semanlinguis-tic representations and running Kernel Regression on this multimodal model (Bruni et al., 2011). I conduct all experiments described in section 4.1 using this model. This concatenation approach has the benefit over the scoring level fusion ap-proach of explicitly training weights to optimize the prediction of multimodal semantic representations. However, simple concatenation could prove to be inefficient since the di-mensionality of the image-based vectors (4096) is of a different order of magnitude than the dimensionality of the text-based vectors (300). I therefore normalize the linguistic and visual models seperately before concatenation.

Thirdly, I use a trained regressor in the form of a Siamese Neural Network to extract multimodal features, similar to the approach of Poria et al. (2015). The structure of this Neural Network is shown in Figure 3. I use the concatenation of text- and image-based semantic representations as input data, and randomly pair words in my lexicon. I then use the Levenshtein edit-distances between these pairs as values to be predicted. The two concatenated semantic representations for each pair are fed to two seperate branches in the neural network. Each layer in the network shares the same weights over both branches. Before the final layer, I concatenate the vectors in both branches, after which the edit distance between the two words is predicted. This network is trained untill convergence. Subsequently, I run all words in the dataset through the trained network and extract the layer preceding the final layer (before concatenation) as a multimodal representation of the input word. I train two seperate instantiations of this neural network. I train the first network on a training-set consisting of 80% of my lexicon, and evaluate its performance on the remaining 20%. I report the mean squared error achieved on this 20% as a measure of the networks performance. I train the second network on all words in my lexicon to extract optimal multimodal semantic representations. The neural networks are implemented in Tensorflow (Abadi et al., 2015), and use mean squared error as loss function for optimization.

4.2.4 t-SNE

I use the t-distributed Stochastic Neighbour Embedding (t-SNE) visualization technique to visualize the semantic vectors of all described models in a two-dimensional space (Maaten and Hinton, 2008). t-SNE is an extension of Stochastic Neighbour Embedding

(24)

Figure 3: The Neural Network used for feature extraction

(SNE) (Hinton and Roweis, 2003) that excels at preserving data-structures at several scales.

In SNE, Euclidean distances between data points are converted into conditional propabilites. More specifically, the similarity between a data point xi and a data point

xj is computed as the probability that xj would pick xi as its neighbour, if neighbours

were picked in proportion to their probability density under a Gaussian centered at xi

(Maaten and Hinton, 2008). SNE visualizes high-dimensional data in a low-dimensional map, by optimizing the similarities between data points in the low-dimensional map, to resemble the similarities between data points in the high-dimensional space. The gaus-sians used to compute similarities are specified by sigma-values that are a function of the entropy in the data and a perplexity parameter specified by the user. This perplexity can be interpreted as a measure of the effective number of neighbours (Maaten and Hinton, 2008). Optimization is executed by performing gradient descent on the Kullbak-Leibler divergence between individual data points.

The SNE-technique has the disadvantage that optimization is computationally heavy since similarities have to be computed beteen individual data points. Furthermore, SNE is susceptible to the crowding problem, which causes data points to crush together in the center of the low-dimensional map. t-SNE reduces the complexity of optimization by min-imizing the Kullback-Leibler divergence between the two joint probability distributions of all data points in the high-dimensional space, and all data points in the low-dimensional map. In addition, heavy-taled t-distributions are used instead of Gaussians to compute similarities, which alleviates the crowding problem (Maaten and Hinton, 2008).

I use the t-SNE visualization technique to visualize semantic vectors in a two-dimensional space, and project the obtained datasample onto a labeled scatter plot. I analyze the resulting clusters to investigate the manner in which semantic referents are captured.

(25)

5 Results

5.1 Lexicon-wide Correlations between form and meaning

Tables 1, 2 and 3 show correlations between edit distances and cosine distances be-tween semantic vectors in different models. These correlations were computed using the mantel-permutation test described in 4.1 (Mantel, 1967). Table 1 shows correlations for 3 different monomodal semantic models: the Skip-gram model trained by Guti´errez et al. (2016) (1), the Word2Vec pretrained text-based model (2) and the image-based model. Tables 2 and 3 show correlations for 6 multimodal models. The 3 methods for creating a multimodal model (scoring-level fusion, concatenation, and fusion using a neural network) have been applied to the image-based model with both text-based models. ’U’ denotes that correlation is computed between semantic distances and unweighted edit-distances, ’W’ denotes that correlation is computed between semantic distances and weighted edit-distances, where the weights have been optimized by minimizing Kernel Regression Error on the relevant semantic model. All models using the first text-based model cover the entire lexicon of 4537 words. Due to a slight difference in coverage, all words using the second text-based model, and the monomodal image-based model cover 4479 of these words. The Neural Network used for Neural Network extraction predicts edit-distance from semantic vectors with a Mean Squared error of 0.69.

Table 1: Correlations for Monomodal models

Model Correlation p-value

Text-based 1, U 0.0206 0.003 Text-based 1, W 0.0281 0.001 Text-based 2, U 0.0362 0.001 Text-based 2, W 0.0383 0.001 Image-based, U 0.0198 0.025 Image-based, W 0.0243 0.007

Optimizing edit-weights improves the found correlation for all models. This shows that string edits do indeed differ in their semantic relevance. All correlations found using weighted and unweighted monomodal models are statistically significant. This confirms the existence of form-meaning systematicity in the investigated lexicon. The unweighted correlation of 0.0206 found for the first text-based unweighted model resembles the cor-relation of 0.0194 reported by Guti´errez et al. (2016). A cause of this slight difference could be that I used 91,5% of their lexicon due to limited visual coverage. The weighted correlation of 0.0281 however, is far below the 0.0464 reported by Guti´errez et al. (2016), which indicates that my implementation of SMLKR optimizes weights less effectively. Interestingly, the second text-based model displays substantially more sytematicity than the first text-based model. The image-based model also displays a statistically significant correlation between form and meaning, although less than both text-based models. The relatively small size of all correlations between form and meaning is to be expected under

(26)

the assumption that concentrated systematic clusters exist as exceptions to an overall arbitrary language.

Table 2: Correlations for Multimomdal models constructed from the Skip-gram model and the image-based model

Scoring-level Fusion, U 0.0275 0.001

Scoring-level Fusion, W 0.0353 0.001

Multimodal Concatenation, U 0.0275 0.001 Multimodal Concatenation, W 0.0322 0.001 Neural Network Fusion, U 0.0256 0.001 Neural Network Fusion, W 0.0292 0.001

All multimodal models fusion model display more systematicity than the two used monomodal models. This result is found with optimized edit-distances as well as with un-optimized edit-distances. This shows that the information conveyed in the used monomodal models is complementary, and that more systematicity is found using a multimodal ap-proach. Furthermore, all found correlations are again statistically significant. This em-phasizes the existence of systematicity in the investigated lexicon. The correlation found using the Neural Network Fusion model is smaller than the correlation found using the monomodal Skip-gram model. This shows that this method does not combine the infor-mation available in both monomodal models as effectively as the first two methods.

Table 3: Correlations for Multimomdal models constructed from the Word2Vec, and image-based models

Scoring-level Fusion, U 0.0401 0.001

Scoring-level Fusion, W 0.0420 0.001

Multimodal Concatenation, U 0.0351 0.001 Multimodal Concatenation, W 0.0376 0.001 Neural Network Fusion, U 0.0175 0.004 Neural Network Fusion, W 0.0266 0.001

All these models again show a statistically significant possitive correlation between form and meaning. As expected, the Word2Vec model is harder to improve than the Skip-gram model since its initial performance is better. However, the model constructed using Scoring Level Fusion does outperform the pretrained Word2Vec model and the image-based model. This shows that while the Word2Vec model displays more systematicity than the Skip-gram model, incorporating visual data still increases the found systematic-ity. However, the two other multimodal models do not display more systematicity than the pretrained Word2Vec model. This shows that the used methods do not yet utilize the full potential of a multimodal approach.

(27)

The scoring-level fusion models have been optimized by identifying the parameter α that results in the highest correlation between edit distance and semantic distances, computed as α ∗ visual distance + (1 − α) ∗ linguistic distance. For weighted models, the distances are computed as a weighted average of the two relevant optimized edit-distances, using the same parameter α. The results of this proces are listed in tables 4, 5, 6 and 7.

(28)

Table 4: Scoring Level fu-sion of the image-based and Word2Vec models: α∗linguistic+(1−α)∗visual α Correlation p 0 0.0198 0.021 0.1 0.0222 0.014 0.2 0.0249 0.009 0.3 0.0281 0.001 0.4 0.0316 0.001 0.5 0.0351 0.001 0.6 0.0380 0.001 0.7 0.0398 0.001 0.8 0.0400 0.001 0.9 0.0386 0.001 1.0 0.0362 0.001

Aftter further optimization α was set to 0.75

Table 5: Scoring Level fu-sion of the Skip-gram and image-based models α Correlation p 0 0.0198 0.021 0.1 0.0221 0.014 0.2 0.0239 0.009 0.3 0.0261 0.001 0.4 0.0274 0.001 0.5 0.0275 0.001 0.6 0.0266 0.001 0.7 0.0252 0.001 0.8 0.0236 0.001 0.9 0.0220 0.001 1.0 0.0206 0.001

After optimization α was set to 0.45

Table 6: Weighted Scoring Level fusion of the image-based and Word2Vec models

α Correlation p 0 0.0243 0.007 0.1 0.0264 0.005 0.2 0.0289 0.001 0.3 0.0318 0.001 0.4 0.0349 0.001 0.5 0.0380 0.001 0.6 0.0406 0.001 0.7 0.0419 0.001 0.8 0.0417 0.001 0.9 0.0400 0.001 1.0 0.0373 0.001

Table 7: Weighted Scoring Level fusion of the Skip-gram and image-based mod-els α Correlation p 0 0.0243 0.007 0.1 0.0279 0.002 0.2 0.0311 0.001 0.3 0.0336 0.001 0.4 0.0351 0.001 0.5 0.0353 0.001 0.6 0.0343 0.001 0.7 0.0331 0.001 0.8 0.0314 0.001 0.9 0.0296 0.001 1.0 0.0281 0.001

As noted earlier, incorporating visual data has more effect on the found systematicity in the Skip-gram model than in the Word2Vec model. Consequently, the optimal α

(29)

is lower for this model. The parameter α stays the same under weight-optimization, confirming that it is predominantly dependant on the extent to which the information conveyed in both monomodal models is complementary.

5.2 Phonaesthemes: quantitative and qualitative analysis

I analyze to what extent form-meaning systematicity is localized in phonaesthemes by comparing the average Kernel Regression Error for phonosemantic clusters with random samples in the lexicon as described in section 4.1. For this analysis I use weighted edit-distances which have been optimized for each individual model. Tables 8, 9, 10, 11 and 12 show the two letter word-beginnings that display the most form-meaning systematicity for each investigated semantic model. The p-value represents the probability that the found systematicity for the relevant cluster occurs under the null-hypothesis that systematicity is not localized in phonosemantic clusters. The third column lists the most systematic words whithin these clusters.

Table 8: Skip-gram text-based model Phonaestheme P-value Systematic Words

sn- <0.0001 snail, snake, sneak sneeze, sniff, snore, snort, snout, snuff fl- <0.0001 flake, flash, flea, flick, flip, flight, fluff, flush, flutter sw- 0.0002 sweep, swell, swerve, swing, swipe, swirl, swish, swoop tw- 0.0002 twang, twig, twine, twinkle, twist, twitch

mu- 0.0004 muck, mucus, mud, mush, mushroom, musk, musket

sl- 0.0010 slab, slap, sledge, sleek, slice, slime, sling, slit, slither, sliver, slush sq- 0.0010 squabble, squeack, squeal, squirrel

sh- 0.0014 sheath, sheen, shank, shovel

wh- 0.0037 whack, wheel, whiff, whimper, whine, whirl, whisper, whistle

Table 9: Word2Vec text-based model

Phonaestheme P-value Systematic Words

sn- <0.0001 snake, snail, sneak, sneeze, sniff, snore, snort, snout mu- <0.0001 muck, mushroom, mush, musk, munch, murmur, mutter tw- <0.0001 twig, twine, tweed — twitch, twang, twinkle, twit sq- 0.0032 squabble, squeak, squeal, squirt, squirrel

pe- 0.0079 pea, peach, pear, pearl, pebble, pee, peep, pellet bu- 0.0087 buff, buffalo, bull, bully, bungalow, bunk, bunker sw- 0.0173 swirl, swish, swipe, swim, swerve, sway, swing, swoop cr- 0.0137 crab, crawl, creep, creek, crevice, cripple, crook, crouch,

Using the Skip-gram model, I find all of the 10 phonaesthemes listed by Guti´errez et al. (2016) to be statistically significant. Listing the most systematic words for each

(30)

phonaestheme shows how words in many phonaesthemes resemble each other qua mean-ing. For instance, ’sn-’ refers to the nose, ’fl-’ refers to small motion, ’sw-’ refers to larger motion and ’mu-’ refers to the ground and dirt. Interstingly, I find the phonaestheme ’sq-’ using the Skip-gram model, which is not listed by Gutierrez et. al. Words in this cluster refer to sounds and animals.

Using the pretrained Word2Vec model I find many of the same phonaesthemes as using Skip-gram. This is further evidence for their existence. Additionally, I find the following phonaeshtemes: ’pe-’, which refers to small round objects, ’bu-’, which refers to things large and strong, and ’cr-’ which refers to things crooked and close to the ground. A total of 25 phonaesthemes with p-values below 0.05 were found using the Skip-gram model. The second text-based model displays comparable clustering of systematicity, with a total of 22 phonaesthemes with p < 0.05.

Table 10: Image-based model

re- 0.0054 reach, reckon, reed, reel, referendum, rend jo- 0.0113 join, joint, joke, jolly, jolt, jot, joy

ar- 0.0114 area, arena, aria, argue, arbiter, ark, aroma

hu- 0.0125 hubbub, huddle, hustle, hubris, huff, huge, hulk, humble, humiliate si- 0.0202 sick, sigh, sin, silly, simple

id- 0.0209 idea, idiom, idiot, idle, idol, idyll

pr- 0.0230 pray, preach, press, prestige, pride, priest, proud, proster, prophet fa- 0.0270 fact, faculty, fail, faint, fallacy, farce, fare, falter

Table 11: Multimodal Concatenation

sn 0.0020 snake, snail, sneak, sneeze, sniff, snore, snort, snout

hu- 0.0037 hubbub, hustle, hubris

id- 0.0089 idea, ideion, idiot, idol, idyll

cr 0.0116 crab, crawl, creep, creek, crevice, cringe, cripple, crone, crook, crouch, fl- 0.0165 flash, fleck, flee, flick, flimsy, flinch, fling

si- 0.0286 sight, sign, silhouette, simulate, sing, siren

fa- 0.0489 fail, faint, fall, fallacy, fare, fake, famine, fatigue, fault jo- 0.0492 job, joke, jolly, joy, jolt, jot,

(31)

Table 12: Neural Network Fusion

ar- 0.0081 arbiter, area, arena, ark, arsenic, arson hu- 0.0173 huge, hulk, hull, humiliate, humble bl- 0.0289 blanch, blank, bless, blight, bliss,

le- 0.0300 leach, leaf, leek, learn, lethargy, lesson, letter id- 0.0352 idea, ideion, idiot, idle, idol, idyll

Systematicity is far less clusterred in the image-based and multimodal models. Using the image-based model I find 15 phonaesthemes with p < 0.05. The Multimodal Concat and Neural Network Fusion models respectively show 10 and 7 phonaesthemes with p < 0.05. Furthermore, while most of the same phonaesthemes are identified by the two text-based models, the image-text-based and multimodal models find different phonaesthemes. The text-based models capture concrete similarities between words. The visual and multimodal models on the other hand, capture many similarities of a more abstract nature. The image-based model for instance, finds ’re-’, which refers to thought, ’jo-’, which refers to joy, ’ar-’jo-’, which refers to roundness and sports, ’hu-’jo-’, which refers to disorganization, ’pr-, which refers to high esteem and religion and ’fa-’, which has a negative connotation. The Multimodal Concatenation and Neural Network Fusion models show phonaesthemes found in the text-based models as well as phonaesthemes found in the image-based model. However, the results from both multimodal models predominantly resemble the results found using the image-based model. In addtion, the Neural Network Fusion model identifies the phonaestheme ’le-’, which refers to vegetation as well as education.

5.3 Concreteness

I correlate Concreteness with Mean Squared Error for all investigated semantic models. A negative correlation shows that the semantic referents of concrete words can be more efficiently predicted based on their strings. Under the assumption that low Kernel Re-gression error displays high systematicity (Guti´errez et al., 2016), negative correlation shows that concrete words are more systematic. All correlations have been computed using optimized weights for string-edits.

A multimodal approach to finding form-meaning systematicity