HyGAP: Hybrid Context-based Embeddings generated from Taxonomical Paths

(1)

MSC

ARTIFICIAL

INTELLIGENCE

M

ASTER

T

HESIS

HyGAP: Hybrid Context-based Embeddings

generated from Taxonomical Paths

by

VIKRANT

KUMAR

YADAV

11729465

August 12, 2019

36 EC

January 2019 - August 2019

Supervisor: Dr I CALIXTO

(UvA)

Dr G TSATSARONIS

(Elsevier)

Assessor: Prof. Dr. Evangelos

Kanoulas

(2)

Abstract

In recent years we have experienced a plethora of new applications which rely on taxonomies or on-tologies in order to index and organize content. Major advantages stemming from the utilization of taxonomies in applications such as search, sentiment analysis, and machine translation include the abil-ity to handle ambiguous words, as well as to annotate content with taxonomical concepts enabling its reuse by different text mining applications. Though there have been many advances and research works in representing words with embeddings, aiming at improving the text annotation performance, little attention has been given to approaches that learn hybrid word embeddings comprising context and tax-onomical concepts. In this thesis, we introduce a novel hybrid deep neural network approach to learn deep semantic representations of words, which captures the context of target words and the relevant taxonomical paths and concepts from a source taxonomy. Our approach learns hybrid context embed-dings of words by attending to non-linear projections of the taxonomical nodes included in the most relevant path. We demonstrate empirically that the suggested approach significantly outperforms an existing approach for learning word embeddings in the task of predicting the most relevant word given the context it appears. For the comparative evaluation we utilize a benchmark data set that has been traditionally used for Word Sense Disambiguation to evaluate the quality of our hybrid embeddings.

(3)

Acknowledgements

I would first like to thank my thesis supervisors Dr Iacer Calixto and Dr George Tsatsaronis. Both Iacer and George were very helpful by giving their valuable guidance at the right time. They consistently allowed this thesis to be my own work, but motivated me when I had moments of loosing faith. I would like to thank Iacer for providing valuable feedback on the thesis that I wrote and George for helping me in writing paper for the CIKM conference.

I would also like to acknowledge Prof. Dr. Evangelos Kanoulas for being my assessor, and I am gratefully indebted to him for his very valuable comments on this thesis.

Finally, I express my very profound gratitude to my family and to my friends for providing me with unfailing support and continuous encouragement throughout my years of study. This accomplishment would not have been possible without them. Thank you.

(4)

1 Introduction 5 1.1 ResearchQuestions (RQ) . . . 7 1.2 Contribution . . . 8 1.3 OutlineOfTheThesis . . . 8 1.4 MathematicalConventionUsed . . . 9 2 Background 10 2.1 ContinuousBagOfWords (CBOW). . . 10 2.1.1 NegativeSampling . . . 11 2.1.2 NegativeSamplesAndSub-samplingIn CBOW . . . 12 2.2 Attention . . . 13 2.3 BagOfWords . . . 14 2.4 TF-IDF . . . 14 3 Related Work 16 3.1 LearningVectorRepresentationsOfWords. . . 16 3.2 SemanticKnowledgeBasedRepresentations . . . 16 3.3 Sense-basedRepresentationsOfWordsAndTexts . . . 17 4 DataSets 18 4.1 CorpusDataSets . . . 18 4.1.1 Wikipedia . . . 18 4.1.2 Semcor . . . 18 4.1.3 OtherCorpora . . . 19 4.2 SynsetDataSet . . . 19 4.2.1 Wordnet . . . 19 4.3 EvaluationDataSet . . . 20 4.3.1 WS-353 . . . 20 4.3.2 SemEvalSemanticTextualSimilarity. . . 21 4.3.3 Semcor . . . 22 5 LearningHybridWordEmbeddings 24 5.1 Overview. . . 24 5.2 PreparingWordNetToApplyHyGAP. . . 25

5.3 ContextComponent C & TaxonomicalComponent T . . . 27

5.3.1 CBOW and CBOWT . . . 27

5.3.2 CBOW and Non-Linear Projection Network (NLPN) . . . 27

5.3.3 CBOW and Attention based Non-Linear Projection (ANLP) . . . 28

5.3.4 CBOW and CBOWT W Weighted Nodes (CBOWT W) . . . 29

5.4 SoftwareImplementation . . . 30 6 Experiments 31 6.1 ExperimentalSetup . . . 31 6.2 WordPredictionTask . . . 31 6.3 SemanticTextualSimilarity (STS). . . 33 6.3.1 SemanticTextualSimilarityWithoutUNK . . . 33 6.3.2 SemanticTextualSimilarityWithUNK. . . 36

(5)

6.4 WordLevelEvaluation. . . 38 6.4.1 ComparisonOfBaselineWithTaxonomyBasedModel . . . 39 6.4.2 ComparisonBetweenTaxonomyBasedModels . . . 40 7 QualitativeAnalysis 41 7.1 CapturingDifferentMeaningsOfWords . . . 41 7.2 WordEmbeddingVisualization. . . 44 8 ConclusionAndFutureWork 48 8.1 Conclusion . . . 48 8.2 FutureWork . . . 49

(6)

Chapter 1

Introduction

This Chapter briefly describes the background of the research area of the thesis, the motivation of doing this research, research questions addressed, major research contributions and the outline of the thesis.

The work done in this thesis comes under the research done in the field of Natural Language Processing (NLP). NLP is a subfield of computer science, information engineering, and artificial intelligence focusing on the interactions between computers and human natural languages1. NLP deals with how to program computers to process and analyze large amounts of natural language data. The research done in this thesis utilizes a lot of concepts derived from the field of artificial intelligence. To be more precise the work has a lot of motivations from Machine Learning (ML) and Deep Learning (DL). ML is a sub-field of artificial intelligence which studies scientific algorithms and statistical models that computer systems use to perform a specific task effectively without giving it set of explicit instructions, relying on patterns and inference instead. DL is a sub-field of ML which utilizes Artificial Neural Networks (ANN)2_{. ANNs}

are computing systems inspired by biological neural networks that constitute animal brains3_{. These are}

very sophisticated systems which can perform tasks by considering examples, generally without being programmed with any task-specific rules.

The internet is rich in content of textual information coming from myriad of sources such as Social Networking websites, blogs, Wikipedia, industry based websites etc. This plethora of data is very useful for learning of data driven models (DL models), which are applied in various text based applications such as Question Answering [Li and Roth, 2002] , Machine Translation [Vaswani et al., 2017a] , Named Entity Recognition (NER) [Lample et al.,2016], Data Mining [Witten et al.,2016] , Information Extraction [Abadi et al.,2016], etc. It has also been shown that data-driven models not only have learned from the unstructured data present in World Wide Web but also by using other signals as well, such as Knowledge Graphs (KGs) [Smaili et al.,2018]. KG’s are a way of describing the relation among a set of objects which are related to each other and put in groups by certain relations. Although these data-driven DL models do not have a common architecture, they all rely on mathematical word representation [Iacobacci et al., 2016;Luong and Manning,2016;Tao et al.,2017], i.e, in the field of machine learning word representation is a way of representing a word by using real number vectors.

Although there has already been a lot of work done in the field of learning word representations [Bengio et al.,2003;Mikolov et al.,2013a; Neelakantan et al., 2014; Bojanowski et al.,2016], but still word representations are far from being perfect [Bengio et al.,2013;Tang et al.,2015]. One of the most prominent challenges which deceive word representation algorithms is polysemous nature of words, i.e, a word or phrase having multiple meanings depending upon the context it has been used in. The algorithms not only need to understand the meaning of the word but also the syntactic and semantic relationship of the word with other words. In this thesis we investigate whether knowledge graphs help in learning a better context-based representation of the words. The input we require for the work is knowledge graph-based tagged words in documents and output will be the static vector representation of the words.

1_{https://en.wikipedia.org/wiki/Natural_language_processing} 2_{https://en.wikipedia.org/wiki/Deep_learning}

(7)

Word Embeddings Words representation as mathematical entities has root in the late 1950’s [Firth, 1957]. In the early 1960’s first attempt was made to represent a word as vector space model for the field of information retrieval4_{. Conceptually it involves a mathematical embedding from space with many}

di-mensions per word to a continuous vector space with a much lower dimension. Word representation is also known by the name Word Embeddings. Different machine learning based methods have been tried to learn word embeddings [Lebret and Lebret,2013;Levy and Goldberg,2014;Pennington et al., 2014;Li et al.,2015] but in the recent researches neural network based approaches have shown far more promising results [Mikolov et al.,2013a;Neelakantan et al.,2014;Bojanowski et al.,2016;Peters et al., 2018;Vashishth et al.,2018]. The neural network based embeddings (or in general) can further be subdi-vided into 2 categories, context independent embeddings, and context dependent embeddings. Context independent embedding are static embeddings, i.e, once the method has learned from the training data the embeddings of the words are independent of co-occurrence with other words [Mikolov et al.,2013a; Vashishth et al.,2018]. On the other hand context dependent embeddings are not static, they fine-tune themselves according to the applications they are applied in and are always dependent on their sur-rounding words [Peters et al.,2018]. What is common between context dependent and independent is during the time of learning, i.e, when the model is in the training phase, the model always depends upon its surrounding words. However, this varies from algorithm to algorithm on how and what infor-mation is captured. Research in this thesis falls under the category of context independent embeddings. Although context based embeddings are both application and context based but the drawback is that they have millions of parameters, which makes them much more costly to train. What we intend to do in this thesis is to create context independent embeddings more aware of the context they are applied in. Even though the embeddings will be independent but the model will try to align them in such a way that when the combination of the vectors in a context is taken the target representation will be more context-aware. Details of how we incorporate this are shown in Chapter5.

Knowledge Graphs Graphs are mathematical structures used to model pairwise relations between objects. A graph is made up of a set of vertices (V) (also called nodes or points) which are connected by a set of edges (E) (also called links or lines). Graphs can be both undirected as well as directed graphs. In an undirected graph, the relation between node is bi-directional and in directed graphs the relation between the nodes is directional. Mathematically a graph (G) can be represented in the form of G = (V, E). KGs comes under linguistics and they have been very useful in representing lexical relations in linguistics [Paulheim,2017;Wang et al.,2017] . There are few openly available knowledge graphs such as Wordnet (described in Chapter4)5_{, DBpedia}6_{, YAGO}7_{, and Freebase}8_{are among the most}

prominent ones. For our work we have used Wordnet knowledge graph. It groups English words into sets of synonyms called synsets, provides definitions and usage examples, and records relations among these synonym sets or their members. All synsets are connected to other synsets through semantic relations. Hypernyms, Hyponym, Meronym, Helonym etc are the semantic relations captured in WordNet. These type of relationships come under the category of taxonomy. Majority of edge representations in Wordnet are of hypernym, hyponym types. Hypernym defines IS-A type of relationships and hyponym is just opposite of it. For instance, Dog−−−−→ CanineIS−A −−−−→ CarnivoreIS−A −−−−→ PlacentalIS−A −−−−→ MammalIS−A

IS−A

−−−−→ Vertebrate−−−−→ ChordateIS−A −−−−→ AnimalIS−A −−−−→ . . . is a typical example of hypernym/hyponymIS−A relationships in Wordnet. In this thesis, we bank upon the hypernym and hyponym relations present in the Wordnet.

Shortcomings of the current state of the art word embedding algorithms Although the word em-beddings learned from the models discussed [Mikolov et al.,2013a;Bojanowski et al.,2016;Peters et al., 2018;Vashishth et al.,2018] learn semantic and syntactic relationship between words, but what they only learn is only single representation. However, most of the words are polysemous is nature. For example in the sentences below(1)

(1) a. Mortgages were processed through a private bank | {z }

owned by Merrill Cooms’ entities. b. The town of Banaras is situated near the left bank

| {z } of the Ganga. 4_{https://en.wikipedia.org/wiki/Word_embedding} 5_{https://wordnet.princeton.edu/} 6_{https://www.dbpedia-spotlight.org/} 7_{https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/} 8_{https://en.wikipedia.org/wiki/Free_base}

(8)

the word bank has 2 different meanings. In the first sentence the bank means that it is a financial institution and therefore has surrounding words associated with financial institutions like Mortgages , Merill Cooms’ and processed , while in the second sentence bank represents the river-bank and has surrounding words like town , Banaras (town name) and Ganga (river name) . The models generating context independent embeddings [Mikolov et al.,2013a;Pennington et al.,2014] learn generic (global) embedding of the word, i.e, depending upon the majority of topic covered in the corpus (set of documents) the embedding of the word takes the meaning of that specific form. For example the word bank could have neighbours like (loan, mortgages, money) or (river, water, muddy) depending upon the training corpus. For many NLP based applications like Information Retrieval, Word Sense Disambiguation (WSD), Machine Translation, Named Entity Recognition (NER), Question Answering, etc it is essential that we understand the word in the context which it is used.

1.1 ResearchQuestions (RQ)

It has been shown that taxonomical relationship information in words helps in learning better word representations [Faruqui et al.,2014;Mrksic et al.,2016;Smaili et al.,2018]. The drawback of vector based representation is that each word has a unique vector representation and since the words are pol-ysemous in nature, having a unique representation does not help. With the help of taxonomies, word representations can be changed to sense based representations. For instance, the word bank based on the corpus on which the word vector representation model is trained can learn to have one semantic meaning. However, for the word bank if we consider its noun and verb forms it has 18 different senses9

, i.e, it has 18 different meanings. This empowerment of taxonomy in word representations leads us to the following research questions:

RQ1 Can hypernym derivation of words improve context based embeddings ?In this thesis we are trying to introduce a novel approach which provides the direction in closing the research gap between learning word embeddings and learning taxonomical concept representations. Given the plethora of the industrial systems that are utilizing taxonomies to index and organize content, such an approach will enable the annotation and/or classification steps to take into account in parallel the corpus contexts as well as the conceptualizations of these domains.

However, this step can also be achieved with traditional word sense disambiguation (WSD) tech-niques [Navigli,2009], when a general words taxonomy with a respective sense inventory is used, such as the WordNet10_{thesaurus [}_Miller_,₁₉₉₅_{]. In this work we are using WordNet to demonstrate the}

appli-cability of the approach; nevertheless, the focus of this work is not at optimizing the annotation or the disambiguation process, but rather to showcase how it can be leveraged efficiently to learn jointly word embeddings from the raw text and the used taxonomy.

RQ2 Given a taxonomical structure can gating the information from hypernym nodes help in learn-ing better word representations ?The average path length of the Wordnet words in the corpus we used is 4.2 nodes and on Wordnet level it is 7 nodes. Previous researches [Luu et al.,2016] have recommended excluding higher level hypernyms in each path. For instance, [Luu et al.,2016] found synsets such as object, entity and upper level of the hierarchical path to be noisy during the learning phase. As a matter of fact, for the nouns in WordNet, the word "entity" is root for all the words. Additionally, the words like physical entity, abstraction, object and whole appear in the hierarchical path of 58%, 47.27%, 34.74% and 30.95% of the WordNet nodes respectively.

Although the sharing of nodes percentage drops as we go from entity to the leaf level node, even then, the nodes before the leaf level are rarely mutually exclusive. The question thus arises is how much abstract meaning of each node is essential for the word in consideration with respect to other nodes in the taxonomy. Should we encapsulate the whole information of the node or gate it given the presence of other nodes? For example, plant and person both have living-things node as hypernym but it can be easily said that in semantic space plant is not a replacement of person and vice-versa. However, the knowledge of knowing that both come under living-things is useful to some extent [Alsuhaibani and Bollegala,2018].

9_{http://wordnetweb.princeton.edu/perl/webwn?s=bank&sub=Search+WordNet} 10_{https://wordnet.princeton.edu/}

(9)

RQ3 Can paying more attention to specific words in context help in learning better target word rep-resentations ? As humans, when we read a sentence we know on what words should we focus more on to grab the meaning of the sentence. Similarly, it has been shown that for a given context if more attention is given to certain words than the target word representation is better learned [Devlin et al., 2018]. Similarly, in the field of MT, it has been shown that the model learns to focus more on certain words (original language) for predicting the target word (foreign language), given what the model has already predicted in a foreign language and the original sentence [Bahdanau et al.,2015;Vaswani et al., 2017b]. Absorbing this idea, it becomes important to know should the taxonomy encoding (of the target word) be dependent or independent of the context words. For example in the sentence, Water particles are loosely packed than solid particles. In this sentence the word Water can be replaced by its hypernym liquid which can also be seen in semantic space closely lying solid word. Thus the information of word liquid becomes much more important in this context as compared to other hypernyms in its taxonom-ical structure. Whereas in the example, Water particles are loosely packed than ice particles, solid has been replaced by ice and thus it becomes more important to look at the word water than its more abstract meaning liquid.

1.2 Contribution

We propose the following contribution during the course of our work

• Joint model learning of word representations and taxonomical representations In this thesis we create a hybrid model in which the context based word representation is learnt from taxonomical representation of the target word. The approach leverages deep neural networks to learn repre-sentations of words with embeddings stemming both from raw text and the context of the words, as well as from the taxonomical concepts occurring in the taxonomy path that is considered most relevant to that context. This thus helps in learning better context based representations of the words. The result of this idea shown in Chapter6with the semantic evaluations done on different data set.

• Gating information of hypernym nodes to learn better word representations In this thesis we show that in order to learn taxonomical representation of a taxonomy giving equal weights to every hypernym node is not a good idea. The average of the path start loosing its meaning as we include more number of nodes in the hierarchical path. Word representations are learned better if the nodes weightage are taken in inverted order of number of hierarchical paths they appear in. Therefore giving more weights to nodes which appear in lesser hierarchical paths in comparison to other nodes in consideration, help us in learning better word representations.

• Creating Attention based mechanism to learn dynamic embeddings for the target word given the context and target wordIn this thesis we show that word representations can be further im-proved if the taxonomical derivation of the target word is created dynamically for the context. For this we used the soft-attention mechanism devised by [Bahdanau et al.,2015] to give more weightage to words in context which can help in deriving better target word taxonomical repre-sentations. This also helps in moving from discrete number of senses in Wordnet to continuous number of senses. Since each taxonomical derivation can have varying number of weights for each node based on the context.

1.3 OutlineOfTheThesis

The rest of the thesis is structured as follows

• Chapter 2 deals with the theoretical background and mathematical implementations of algorithms used in this thesis.

• Chapter 3 describes the related research work done in this field and sheds light on the current state of the art techniques.

• Chapter 4 describe different data sets that are used as training corpus for the models, as well as data sets used for evaluation and provides information to the reader about attributes in each corpus. We also describe how data is fed in the model architecture that we have used.

(10)

• Chapter 5 explains in detail how the models are implemented.

• Chapter 6 discusses different quantitative intrinsic and extrinsic evaluation done on the baseline as well as the taxonomy based models

• Chapter 7 shows qualitative evaluation of the models

• Chapter 8 briefly discusses the conclusion of this thesis and outlines future research work in this field.

1.4 MathematicalConventionUsed

Throughout this thesis the mathematical notation used is showed in Table1.1

Type Style Notation

Scalar Latin v

Variables Normal v

Vectors Bold-faced latin v Constants Capital Latin V Matrix Bold-faced Capital V

(11)

Chapter 2

Background

In brief this Chapter provides theoretical background and mathematical implementations of algorithms used in this thesis. The introduction Chapter1states briefly about the work done in the research area but the primary focus in this chapter is to discuss in detail working of CBOW model , objective function Negative Sampling, Negative samples & Sub-Sampling, Attention Mechanism, Bag Of Words and how TFIDF works.

2.1 ContinuousBagOfWords (CBOW)

The models built in this thesis can work with any architectural approach which utilizes context based representations to predict target words. We consider Continuous Bag of Words (CBOW) [Mikolov et al., 2013a] as our base model. In CBOW, a target word is predicted given a context, with the number of adjacent words considered being a hyper-parameter of the model. Let W be the word vocabulary set of the whole corpus being utilized for training, and v(w) ∈ Rdthe word vectors for a specific target word w ∈ V, with d being the dimensionality of the vectors. The context word vectors in CBOW are passed through a common projection layer. The input to the projection layer is independent of the direction and distance from the target word w. An average is calculated based on all context word projections. The context projection then is used to predict the target word as a classification task using a softmax activation. h = 1 2n t+2 X k=t−2,k!=t vk (2.1)

Equation2.1shows the context representation for the target word. In the equation2.1the summation range of 2 is arbitrary and depending upon the window size the model will encodes different type of information. The vkis a vector of the kthword in the context window. Context window size plays very important role in determining whether the word vectors in embedding space are semantically aligned or syntactically aligned. If the context window is small, word vectors will be more sensitive to the syntactic role of the target word; if the context size window is large, word vectors will be more sensitive to the topical role of the target word. Formally, we maximize the following objective function

L = C X

i=1

log P (wt|wt−n, wt−n+1, . . . , wt+n−1, wt+n), (2.2)

where C and i are scalar values representing the range of the number of context in corpus. C is the total number of contexts available in corpus, t is index of the target word , n is the window size to be considered wtis the target word, and wt−n,wt−n+1, . . . ,wt+n−1,wt+nare the context words and the

P (wt| . . .) is defined as P (wt|wt−n, wt−n+1, . . . , wt+n−1, wt+n) = expvt_h PV i=1exp vt ih . (2.3) This reduces L to L = C X i=1 vth − log V X i=1 expvtih . (2.4)

(12)

Figure 2.1: Figure shows CBOW softmax based architecture.To show the working of model "I love Espresso Coffee" is taken as an example.The input to the mode is "I love Coffee" and the target word is "Espresso".

Figure2.1represents the architecture used for softmax based approach. There are two weight matrices W and W0, W is linear space transformation from input to hidden layer and W0is from hidden to output softmax based layer. The learned embeddings are the hidden layer W which has size (|V| ∗ |H|) or W0 which has size (|H| ∗ |V|).

2.1.1 NegativeSampling

If |V | is in the order of 10ˆ6 or 10ˆ7, a scenario which is realistic in a very large corpus of documents, than the application of softmax becomes a computationally expensive process. To address this issue, [Mikolov et al.,2013b] presented few extensions which increased the training speed and quality of word vectors ; Negative Sampling was one of the proposed approaches, it is a simplified version of Noise Contrastive Estimation (NCE) originally introduced by [Gutmann and Hyvärinen,2012]. NCE postulates that a model which is good at its task should be able to separate out the data from noise by means of logistic regression.

To further simplify the model, the projection layer can be removed [Mikolov et al.,2013b]. Now, let vt be the vector representation of a target word t and h be the average embedding of context words defined as in equation2.5.

h = PN

i=1vi

N (2.5)

Figure2.2shows the new architecture of CBOW once the model starts using negative sampling in its objective function,there are 2 embedding layers used, one for the context words and other for target words (both positive and negative). The positive or the observed word in a corpus are denoted using (+) sign and (−) sign is used for the words not observed in the context for a given context. The probability of the target word t+_{in the observed word space is given by the following equation.}

p(S+ = 1|v(t+₎, h) =

1 1 + exp−v(t+ )hT

. (2.6)

The probability of observing the noise vt− in order to maximize the likelihood is given respectively by the equations in2.7-2.9 p(S−= 0|v(t−₎, h) = 1 − p(S− = 1|v_(t−₎, h), (2.7) = 1 − 1 1 + exp−v(t− )hT , (2.8) = −1 1 + exp−v(t− )hT . (2.9)

(13)

Figure 2.2: Figure shows CBOW negative sampling based architecture. To show the working of model "I love Espresso Coffee" is taken as an example.The input to the mode is "I love Coffee" and the target word is "Espresso". The words "Cat , Orange , President" are negative samples for the target word.

Let S+_{denotes the set where the target and context (v}

t+, h)are observed in the corpus, and S− is the set of (vt−, h)of not being observed and N is the number of negative samples. Given a training set, the maximization of the following objective function gives us the learned word embeddings layers, as in equations2.10-2.11 L(θ) = C X i=1 p(S+= 1|vi_(t+₎, h i ) + N X j=1 p(S− = 0|vi_(t−₎, h i ) (2.10) = C X i=1 ₁ 1 + exp−vi(t+ )·h i+ N X j=1 − 1 1 + exp−v j (t− )·h i (2.11) where vt+stands for positive target embedding and v_t−stands for negative target embedding.

2.1.2 NegativeSamplesAndSub-samplingIn CBOW

Negative Samples

The set of negative samples vt− for the context words h is constructed by randomly sampling C noisy targets from the distribution where the frequency of each word is given by the following equation.

P (w) = punigram(w) 0.75

Z , (2.12)

where punigram(w)is the frequency of a word w and Z is normalization constant. Sub-sampling

Sub-sampling is a method of diluting very frequent words, i.e, it is a way of removing stop-words or very high frequency words [Pathical and Serpen,2010]. Before creating context in a sentence, words in the sentence are sampled based on Equation2.13.

P (w) = ( r z(w) 0.001+ 1) · 0.001 z(w), (2.13)

where zwis the fraction of the total words in the corpus that represents the frequency of occurrence of that word and same is represented by the equation2.14.The f(w) is the frequency of the word in the

(14)

corpus and T is the total words in the corpus. The sub-sampling mainly removes the stop words or the words which have a very high frequency, and it helps in enlarging the context window and also considering words that are associated by context with that word [Mikolov et al.,2013b].

z(w) = f(w)

T (2.14)

2.2 Attention

Attention, originally introduced by [Bahdanau et al.,2015] for machine translation. They proposed a neural machine translation model learned to translate and align in parallel. Following an encoder-decoder architecture, previous methods had a bottleneck of fixed-length vector representation. In order to solve this issue, the authors introduced a soft-attention mechanism which, given the current decoder state, decides on which words of the input sentence to focus on by giving normalized weights to each hidden state of the word in that sentence. More recently, the notion of connecting the encoder and decoder with an attention mechanism has been used in several other research efforts, such as in the work of Vaswani et al. [Vaswani et al.,2017b]. Beyond the performance and higher accuracy of [Vaswani et al., 2017b], they also showed how attention utilizes the information flowing through the network. Figure2.3

shows an example given by [Vaswani et al.,2017b], which shows English to French translation and how for the word it the network focuses on different words based on the context. In the first sentence, it refers to “animal” and the second it refers to “street”. While translating to French, the word it has its genders based on whether the word refers to “animal” or ”street”, and the network is able to learn that in both cases by focusing on specific words given the word it in context. Figure2.4shows how much attention mechanism gives weight to each word when translation of word it is in consideration i.e, darker the box surrounding the word , higher the weight the word received. In the left sentence, more weight is given to the word animal, and in the right one, focus is more on the word street than the word animal. This clearly shows that the attention mechanism by understanding the context understood the weightage of each word in context. We have used attention mechanism in our work to learn dynamic embeddings

Figure 2.3: Figure shows translation of English sentence to French using attention mechanism. In both the sentences target word for translation is "it" but the context are different. However in French the words for "it" are different based on the context. The attention mechanism understands this and assigns the right french word for the word "it" based on the context. Source [Vaswani et al.,2017b]

The attention mechanism used in this work follows the principle employed byBahdanau et al. in [Bahdanau et al.,2015]. In order to derive the weights that each of the hidden states (h1, h2, . . . , hn))of the encoder should receive, we can concatenate the words with the current decoder state (dcurrent). The concatenation [hi : dcurrent]is transformed through a weight matrix and a hyperbolic tangent (tanh) is applied. A space transformation is performed again through a weight matrix. The concatenation and the scores computation can be derived by the following equations.

hc = [hi : dcurrent]∀i (2.15)

score = vT_tanh([WT

hc]) (2.16)

The weights for each of the hidden states of the encoder can be calculated by normalizing the scores as shown in the equation2.17

αi=

expscorei PN

j=1expscorej

(15)

Figure 2.4: Figure shows while doing translation how attention mechanism focuses on context words for target word in consideration. In both the figures target word is "it" but the context are different. In the sentence on left, the word "it" refers to "animal", attention mechanism puts more focus on the word "animal". In the figure on right the word "it" refers to "street" and not "animal", attention mechanism understands this and gives more focus to word "street". [Vaswani et al.,2017b]

The context based vectors can be finally calculated as shown in Equation2.18. vectorcontext= N X i=1 αihi (2.18)

2.3 BagOfWords

The Bag of Words (BOW) is one of the most used word representation used in NLP and Information Retrieval (IR) . In this methodology, a text such as a sentence or a document is represented as the bag of its words, disregarding semantic and syntactic relationships in the sentence but only keeping the frequency of the words. The application of this statistics is helpful in search [Liu,2013], document classification [Alahmadi et al.,2013] and topic modelling [Shi et al.,2013;Newman et al.,2010].

2.4 TF-IDF

Term-frequency inverse document frequency (TF-IDF) [Salton and Buckley, 1988] is another way to encode the information of words. TF was introduced by Hans Peter Luhn 1 _{and IDF was intrdouced}

independently by Karen Spark Jones2_{. TF is number of times a word appears in a document and IDF is}

a factor which diminishes the weight of terms that occur very frequently in a document and increases the weight of terms that occur rarely. For example the word “and” or the word “it” will have a very high term frequency since they occur multiple times in every document but they are not relevant to the topic of the document.

The TF–IDF is the product of two statistics, term frequency and inverse document frequency. Term Frequency and Inverse Document Frequency have been very helpful in the field of Information Retrieval [Ramos et al.,2003;Singhal et al.,2001] and Text Mining [Neto et al.,2000a,b] . Instead of counting, TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The TF-IDF is calculated in 2 parts, the Term-Frequency TF is calculated as

T F (tj) = f(t j₎ PN

i=1f(ti)

, (2.19)

where f (tj₎_{is number of times a word occurring in a document. The Inverse document frequency IDF} 1_{https://en.wikipedia.org/wiki/Hans_Peter_Luhn}

(16)

is calculated as IDF (tj) = log |D| |D|tj , (2.20)

where |D| is total number of documents and |D|tj the total number of documents with the word tj _in it. The final equation becomes

T F (tj) ∗ IDF (tj) = f (t j₎ PN i=1f (ti) ∗ log _|D| |D|tj (2.21)

(17)

Chapter 3

Related Work

The research in this thesis relates to several aspects of recent research works, primarily in the direction of generating context independent word embeddings [Mikolov et al.,2013a;Alsuhaibani and Bollegala, 2018]. This Chapter will discuss some of the most recent and prominent works in these directions, and their relations with the current work.

3.1 LearningVectorRepresentationsOfWords

Considerable work has been done in the recent years on learning vector representation of words. In this section we intent to discuss only machine learning based models which are relevant to our work. Our suggested model is hybrid, utilizing input signals from the raw text and underlying taxonomy, and relies on the work of [Mikolov et al.,2013a] and the efficient estimation mentioned in [Mikolov et al., 2013b].

The log-linear models in [Mikolov et al.,2013a] learn the distributed representation of words with a computationally cheap approach, enabling for the first time the learning of word representations in very large scale. [Mikolov et al.,2013a] introduced two log-linear models Continuous Bag of Words (CBOW) and Skip-Gram , but we only focus on the CBOW. Although our approach is also compatible with the Skip-Gram model. In the CBOW model the surrounding words of a target word are passed through a projection layer and then an average of the projections is derived. The projection layer is shared among all the context words and the projection itself is independent of the position of a word in the context window.

CBOW learns generic embeddings of words well using an average of bag-of-words in the context window. However it is important for many NLP applications such as WSD, NER and MT that informa-tion at the word level captured not only uses the surrounding words but also for the words in consid-eration semantic and syntactic information is also used, as opposed to the approach of Bag Of Words . A better resolution would be to consider the directional flow of information from left and right of the target word.To address this issue, [Melamud et al.,2016a] created context2vec, where two bidirectional LSTMs are used in order to encode the context information around the target word. One of the LSTMs encodes the right side context of the target word, while the second LSTM encodes the left side. A final hidden layer is generated from these two LSTMs, whose output is concatenated and is used to predict the target word.

Other recent research works in the field include ELMo [Peters et al.,2018] and Charagram [Wieting et al.,2016]. In the former, the model utilizes N -layer bidirectional LSTM and computes 2N + 1 rep-resentations of the context. All bidirectional layers are fine-tuned appropriately, given the application of the model. In later work, word vectors are enriched with sub-word information utilizing character n−grams, an idea which is similar to FastText1_{embeddings [}_{Bojanowski et al.}_,₂₀₁₆_].

3.2 SemanticKnowledgeBasedRepresentations

Research done on semantic relationships between words (e.g. synonyms, antonyms, hypernyms) have been used to improve the vector representation of words [Bordes et al.,2011;Alsuhaibani and Bollegala, 2018;Smaili et al.,2018] . Most of the work done relates to either on joint training [Vashishth et al.,

(18)

2018], where the word vectors are learned from a text corpus and a knowledge base, or as a fine-tuning step where the vectors are further improved once they are learned on a big corpora. [Faruqui et al., 2014] proposed to retrofit (fine-tune) word vectors to semantic lexicons. In this process of retrofitting the word vectors learned from a big corpora are further refined by using semantic constraint from se-mantic knowledge sources. The sese-mantic knowledge sources used for this type of learning are WordNet, FrameNet etc. Another fine tuning approach proposed by [Mrksic et al.,2016] propose to counter-fit word vectors to linguistic constraints. The counter-fit is a method to inject antonyms and synonyms of the words into vector space representations for learning better semantic similarity. The objective func-tion consists of 2 parts, first is an antonym repeller whose objective is to maximize the distance between antonyms, and the other is synonym attractor where the goal is bringing synonyms closer to each other in the euclidean space.The retrofit and counter-fit fine tuning process is independent of how the vec-tors were created. However, retrofit refines the vector space representation using relational information from knowledge bases by encouraging linked words to have similar vector representations , whereas the counter-fit approach does not rely on knowledge bases at all it relies on publicly available antonyms and synonyms relations data available. [Alsuhaibani and Bollegala,2018] proposed an approach where the joint training is done on corpora and a knowledge base. Their objective function exploits the ob-jective used in Glove [Pennington et al.,2014]. The Glove model leverages statistical information from the global co-occurrence word matrix in its objective function. The drawback of Glove is that it is not be able to learn word vector representations when co-occurrence is very rare, hence failing to learn the de-sired semantics. [Alsuhaibani and Bollegala,2018] improved the objective by not only capturing words co-occurrence but also existing semantic relationships between words in a knowledge base. To further en-hance this work, [Vashishth et al.,2018] utilized Graph Convolutional Networks [Kipf and Welling,2016] ; In order to learn better word representations the model devised by [Vashishth et al.,2018] use syntactic information, encoded in form of edges in the graph. and these relations are feeded to graph as adjacency matrix, to learn word vector representations. Semantic information is also exploited to learn these word vector representation, but this is only put to use as a fine tuning step.

3.3 Sense-basedRepresentationsOfWordsAndTexts

Models discussed above either create vector representations of words which are context independent, or depend on the context of surrounding target words. The idea of using a hybrid vector representation of words or texts which would include both the surface word and senses sourced from a sense inventory of words (e.g., WordNet) it not a new one and has also been explored in the past. Before the rebirth of deep neural networks to learn such representations, hybrid word sense representations would be created by means of generalized vector space models, e.g. [Wong et al.,1985;Tsatsaronis and Panagiotopoulou, 2009], where the sense information would either be used in a hybrid vector space, or would be utilized to learn pairwise term similarities in order to break the orthogonality of the original term vectors in a traditional euclidean vector space.

More recently, [Neelakantan et al.,2014] derived word representations from the distribution of their senses. They proposed a model which is an extension of Skip-Gram, where they learned multiple word embeddings per word type by learning non-parametric estimations of the number of senses per word type. The Bayesian Skip-Gram model proposed by [Brazinskas et al.,2017] creates context-based embed-dings in which the senses of the word are not discrete, overcoming the disadvantage of the previous approach by [Neelakantan et al.,2014]. The senses of a word are created from a prior distribution of the different word occurrences. These priors essentially capture different meanings, or senses, of each target word, given its different contexts. Our work lies in this branch of learning word vector representations. The way our work differs from the above mentioned ones is that they do not use knowledge bases to find the sense of a word, they learn the senses from corpora by assuming senses to be discrete or continu-ous in nature. In our work we capitalize on knowledge bases and assume that we have a correct sense available for certain target words.

(19)

Chapter 4

DataSets

This Chapter discusses in detail the different data sets used for training the baseline model as well as the taxonomical based models. Wordnet structure and what type of knowledge from the graph is used in this thesis is also presented under the section of Synset Data set. This Chapter also describes the different evaluation data sets used for evaluating the model.

4.1 CorpusDataSets

This section describes different data sets used for training the models proposed in this thesis. The data sets that we have used are Wikipedia data dump and Semcor sense-tagged corpus. Wikipedia data dump is generally used for learning word embeddings in a unsupervised way [Neelakantan et al.,2014] (and for training of other NLP machine learning based models as well). We trained our baseline on Wikipedia as well as use these embeddings learned from the data set to get further fine tuned on Semcor data set. Semcor data set is a sense-tagged data corpus. The data in this corpus is helpful in learning taxonomy based word vector representations. This is discussed in detail in later Chapters.

4.1.1 Wikipedia

To train our baseline we chose the Wetsbury Lab Wikipedia Corpus 2010 snapshot as our corpus. The data size is 6 GB in raw format, has over 2 million documents, 990,248,478 words and around 40 million sentences. Pre-processing and cleaning of data is also required as the data is in raw format. The steps followed to do cleaning and pre-processing were similar to done by [Mikolov et al.,2013a].

4.1.2 Semcor

Semcor is a sense-tagged corpus of English created at Princeton University by the WordNet Project re-search team [Miller,1995]. It was created under the WordNet project [Miller,1995]1_{, and is one of the}

biggest manually annotated sense-tagged corpora produced for English. The corpus is a subset of the Brown Corpus (700,000 words, with more than 200,000 sense-annotated). The corpus has 37,176 sen-tences, and it has per word sense and part of speech tagged.

For each sentence, words or multi-word’s, also known as phrases2, named entities are tagged. Table

4.1shows how an example sentence "The Fulton County Grand Jury said Friday an investigation of Atlanta recent primary election" is annotated in Semcor . Every word is marked with a Part of Speech (POS) tag and not all but majority of the words also have a lemma, which is base form of a word. Each word which has a tagged lemma also has a corresponding synset key defining the link between the word and its correspond-ing synset in Wordnet. Some of the words are joined together to form phrases(multi-words) for example primary election becomes primary_election, Fulton County Grand Jury becomes Fulton_County_Grand_Jury. These phrases lemma is generally called group, i.e, to enlighten the fact that more abstract meaning of this word is a group. This helps in learning node based embedding of the word. This annotation is known to be imperfect: [Bentivogli and Pianta,2005] showed that 2.5 % of the words are not aligned to the right synset.

1_{Wordnet is distributed under the Princeton Wordnet License.} 2_{https://aclweb.org/aclwiki/Multiword_Expressions}

(20)

Words POS Lemma Synset key

The DT -

-Fulton_County_Grand_Jury NN group group%1:03:00::

said VBD say say%2:32:00::

Friday NNP friday friday%1:28:00::

an DT -

-investigation NN investigation investigation%1:09:00::

of IN -

-Atlanta NNP atlanta produce%2:39:01::

recent JJ recent recent%3:00:00:past:00:

primary_election NN primary_election primary_election%1:04:00::

Table 4.1: Example of how a sentence is annotated in Semcor. The column Word is word in a sentence and other columns are tagged attributes for the word

The corpus is divided into two parts: Semcor-All in which 186 texts have all open-class words (such as nouns, verbs, adjectives and adverbs) semantically annotated. The Semcor data has 359,732 word types [Lupu et al.,2005] tokens out of which 192,639 are semantically annotated. The second part, Semcor-Verbs, only has 41,497 verbal occurrences senses annotated out of 316,814 tokens. The syntactic (POS tags) and semantic (word sense) tagged information in Semcor has been very useful in various parsing experiments [Bickel et al.,2000;Agirre et al.,2008].

4.1.3 OtherCorpora

Although Semcor is a manually tagged corpus, it has limitation that it has only 37176 sentences which makes it possibly too small to train for deep learning models . This is one of the prime reasons we explored for more corpora. The largest one publicly available is One million sense-tagged instances OMSTI corpus created by [Taghipour and Ng,2015]. It is semi-automatically annotated on one mil-lion training samples and the words are annotated with Wordnet version 3.0. It has 820,557 sentences, 35,843,024 words out of which 920,794 are annotated words. We did not use it because this corpus in-cludes Semcor data and most of the accurate tagging comes from the Semcor part of annotated data only. The remaining data in OMSTI is not densely tagged and has a low accuracy, which makes it not a good candidate for learning synset based embedding. Table4.2shows different corpora statistics which were considered for joint learning of models. Similar to OMSTI, most of the corpora in Table4.2have consid-erable noise (OMSTI, WordNet GlossTag), or they are very small in size (MASC, Ontonotes), or they are available under license (DSO).

Corpus Sentence Words Parts of speech

Total Words Total Annotated Words Nouns Verb Adj Adv

Semcor 37176 778587 229517 87581 89037 33751 19148 DSO 178119 5317184 176915 105925 70990 0 0 WordNet GlossTag 117659 1634691 496776 232319 62211 84233 19445 MASC 34217 596333 114950 49263 40325 25016 0 OMSTI 820557 35843024 920794 476944 253644 190206 0 Ontonotes 21938 435340 52263 9220 43042 0 0

Table 4.2: Statistics of different annotated corpora considered considered for training the model. Table shows total number of words and the words distribution segmented by POS tags

4.2 SynsetDataSet

4.2.1 Wordnet

Wordnet [Miller,1995] is the commonly used lexical database for the resource of English sense relations [Lu et al.,2015;Goikoetxea et al.,2016;Khodak et al.,2017;Jimenez et al.,2019]. It consists of three sepa-rate databases, one with nouns, one with verbs and a third one with adjectives and adverbs; closed class

(21)

words,i.e , function words in English which includes conjunctions (and, or), articles (the, a), demonstra-tives (this, that), and prepositions (to, from, at, with), are not included. Each database contains a set of lemmas, each one annotated with a set of senses. The Wordnet 3.0 release has 117,798 nouns, 11,529 verbs, 22,479 adjectives, and 4,481 adverbs. The average noun has 1.23 senses, and the average verb has 2.16 senses. Wordnet is a graph based tree like structure. Figure4.1shows a subset of graph based tree like structure of Wordnet looks like. At each level there are sibling’s of the word. If we traverse from the leaf to the root we get more general meaning of the word and if we traverse from root towards leaf we move towards more specific meaning of the words. Wordnet has cycles in the graph as most of the words are polysemous in nature because of which each single node is connected to multiple parents.

In Figure4.1the root synset is entity and as we traverse towards the leaves we can see the partition of living and non-living things. If we focus on living node, it gets further sub divided in plant and persons. A person for example can continue till a person’s name and so on; Table4.3shows the typical relationship’s in Wordnet , their definition and examples .There are other relationships as well but these are the ones which contribute to approximately 90% of the link between nodes. Some of the edges between nodes are bi-directional whereas some are uni-directional depending on the type of relationship. For example a relationship between parent and child is unidirectional. On the other hand words which are antonyms of each other will have bi-directional relationship.

For our application we will not be utilizing all the relationships in Wordnet, we will be focusing on Hypernym and Hyponym relationship. Since these are the one which contribute to the maximum number of edges in WordNet. Their contribution is approximately 70% in the noun and verb category while other 23 relationships contribution in WordNet is only 30%. Table4.3lists majority of stakeholder relationships in Wordnet, their description and an example of each relationship.The relationships are described using two pseudo nodes X and Y.

Figure4.2shows the hypernym structure for the synset male. The nodes that are filled with color are the hypernyms for the synset male. Other nodes shown in the Figure4.2show they are related to the hypernym nodes. The hops in the Figure display how much a synset is away from the surface word. For example for a surface word(e.g John) whose synset is male , male is 1 hop away , person is 2 hops away , Organism is 3 hops away and so on. The far we go in number of hops the more general the meaning of the word becomes and vice-versa. This Figure is a typical example of hypernym representation of the synsets in Wordnet. During training/evaluation we fixed the maximum number of hops, i.e, a hyper-parameter, and it is independent of the word or synset.

Relationship Description Example

Hypernym Y is a hypernym of X if every X is a (kind of) Y breakfast → meal Hyponym Y is a hyponym of X if every Y is a (kind of) X meal → lunch

Synoyms Words same meaning soil ↔ dirt

Holonyms Y is a holonym of X if X is a part of Y Fulton_Grand_County → group Meronyms Y is a meronym of X if Y is a part of X faculty → professor

Antonyms Semantic opposition between lemmas leader ↔ follower

Table 4.3: Table lists majority of stakeholder relationships in Wordnet, their description and an example of each relationship. The relationships are described using two pseudo nodes X and Y . Hypernym and Hyponym are inverted relationship of each other. Holonyms and Meronyms are also inverted relationship of each other.

4.3 EvaluationDataSet

Below are the data sets we considered for evaluating the models. We used these data sets to perform intrinsic as well as extrinsic evaluation of the models. Intrinsic evaluation of word vectors is the evalu-ation of word vectors on specific intermediate sub tasks (such as word similarity). Extrinsic evaluevalu-ation of word vectors is the evaluation of word vectors on the real task at hand (such as semantic similarity of sentences). We will briefly describe each of the data sets and type of evaluation they help in.

4.3.1 WS-353

The Word Similarity-353 collection [Finkelstein et al.,2001] contains two sets of word pairs along with a human-assigned similarity score (1-10). 1 meaning "they have almost no similarity" and 10 meaning

(22)

Figure 4.1: Figure represents how hypernym/homonym relationships are represented in WordNet. These relationships in the figure are IS-A relationships which contribute to approximately 70% of the edges in Noun and Verbs. The figure also shows how two nodes which are not semantically close to each other are connected to each other( e.g Table and Female. Common Ancestor Physical Object). The data represented is only a subset of relationships in Wordnet.

"they are equivalent or have similar meanings" . The data set can be used to test algorithms implement-ing semantic similarity measures (i.e., algorithms that numerically estimate similarity of two words in one language). This data set is used for quantitative intrinsic evaluation of the embeddings, i.e , this data set helps in understanding how well the word vectors are semantically aligned to other word vectors in space. This evaluation is done with the help of cosine similarity3_{and Spearman-correlation}4_{. Cosine}

similarity is used to find similarity between two vectors and Spearman-correlation is used to predict how much human annotated score of word pair and cosine similarity score of word vectors(of the same word pair) correlate with each other . Higher the score(maximum value 1) better the model is.

4.3.2 SemEvalSemanticTextualSimilarity

Semantic Textual Similarity (STS) measures the degree of semantic equivalence. Similar to data set described in subsection4.3.1this data set also has human annotated score in the range of 1-10, but it differs from WS-353 in the way that instead of word pairs the data set has sentence pair. Since the data set has pairs of sentences and not pair of words this becomes a downstream application and thus helps in quantitative extrinsic evaluation of word vectors.

STS is related to both textual entailment (TE) and paraphrase, but differs in a number of ways and it is more directly applicable to a number of NLP tasks. TE in NLP is a directional relation between sentences. The relation holds whenever the truth of one sentence follows from another sentence. STS is different from TE in as much as it assumes bidirectional equivalence between the pair of textual sentences. In the case of TE the equivalence is only directional, e.g. a person is a organism, but a organism is not

3_{https://en.wikipedia.org/wiki/Cosine_similarity}

(23)

Figure 4.2: Figure displays the hypernym structure for the synset male.The nodes that are filled with color are hypernyms of the synset male. The other nodes shows the relationship with the nodes in consideration. Although the nodes that are not filled with color are not considered while the hypernym structure is considered but because of the relationship in Wordnet nodes learn from each other.

necessarily a person. STS also differs from both TE and paraphrase in that, rather than being a binary decision (e.g. a organism is not a person), STS is a graded similarity notion (e.g. a plant and a person are more similar than a non living thing and a person). This bidirectional nature of STS is useful for NLP tasks such as Machine Translation evaluation, information extraction, question answering.

For our evaluation we considered SemEval’s STS task’s of 2012, 2013 and 2014. Each year task were further divided by the corpora from which the sentences were picked up. The data set comprises pairs of news head-lines (HDL), pairs of glosses (OnWN), image descriptions (Images), Deft-related discus-sion forums (Deft-forum), news (Deft-news), and tweet comments mappings(Tweets) and news wire headlines. For HDL, naturally occurring news headlines were gathered by the Europe Media Monitor (EMM) engine from several different news sources. For WN , the sense definition pairs in OntoNotes [Hovy et al.,2006] and WordNet are used. The pairs have similar word but the words differ in the sense they are used. Images data set is a subset of the PAS-CAL VOC-2008 data set [Rashtchian et al.,2010], which consists of 750 images and it has been used by a number of image description systems [Li et al., 2009;Marchesotti et al.,2009]. Deft-forum and Deft-news are from Deft data. Deft-forum contains forum post sentences, and Deft-news are news summaries.

We segmented the data in 2 parts for each data set, since we trained our models on Semcor data set which has limited vocabulary. First part had only those sentences whose words were present in the vocabulary of Semcor .In the second part all the sentences are included. In a nutshell, the first part is a subset of second one. Table4.4shows total number of pairs year wise and data set wise, and Table also shows how many pairs were considered when only evaluating on the words which were present in the training corpus.

4.3.3 Semcor

We use 1000 sentences from the Semcor data set and used it for extrinsic evaluation. Since our model ob-jective is to learn more context aware embeddings for the target word, we performed a novel task which as per our knowledge no one has done on the WSD, i.e, problem in NLP concerned with identifying which sense of a word is used in a sentence, data set. For the 1000 sentences 1600 contexts were created and the model was evaluated on predicting the target word given the context. For evaluation of it we used Recall’s at 1, 5, 10, 50 and 100.

(24)

Year Dataset Number of Pairs Number of pairs with words in training vocabulary Source 2012 MSRpar 700 50 newswire 2012 MSRvid 700 250 videos 2012 OnWN 750 206 glosses 2012 SMTnews 400 120 MT eval. 2012 SMTeuropar 460 93 MT eval. 2013 HDL 750 23 newswire 2013 OnWN 561 200 glosses 2014 OnWN 750 225 glosses

2014 Deft-forum 450 112 forum posts

2014 Deft-news 300 10 news summary

2014 Images 750 230 image descriptions

2014 Tweet-news 750 25 tweet-news pairs

Table 4.4: Tables lists 13 different Semeval STS data sets used for quantitative extrinsic evaluation. 5 are from 2012 evaluation, 2 from 2013 evaluation and 6 from 2014 evaluation. The table also shows number of pairs in each data set and number of pairs when words only from Semcor vocabulary were considered.

(25)

Chapter 5

LearningHybridWordEmbeddings

This Chapter is the core of the thesis providing detail working of joint learning of global representation and taxonomical representation of words. Before the discussion of the joint learning of models, the Chapter also provides information on how WordNet data preprocessing is done. In our research since the context-based representation learns from a taxonomical representation of target word we have named our model as Hybrid Context Embedding from Taxonomical Path (HyGap).

Figure 5.1: Deep learning architecture of HyGAP. The model on the left represents context component of the model Context Component (C) and the model on right represents the Taxonomical Component (T) of the model.

5.1 Overview

Figure5.1shows an overview of the architecture of the suggested model for learning hybrid embeddings from context and taxonomical paths. The model consists of two components, Context (C) component and Taxonomical (T) component. C component role is to create a context embedding for a given target word, and T component role is to encode a taxonomic representation of the target word. For example, given the following sentence:

All the players were gathered around the pool |{z}

table in the lounge, testing their new cues. and assuming the target word is pool, C component role becomes to encode the vector representation of all the context words for the target word, and T component role is to encode the most appropriate

(26)

taxonomic representation of the target word. In this example, the taxonomic derivation of the target word pool, following WordNet’s hierarchy, becomes Entity−−−−→ . . .IS−A −−−−→ ActivityIS−A −−−−→GameIS−A −−−−→IS−A Table−−−−→ Pool, i.e, traversing the taxonomy graph from the broader to the shallower meaning. TheIS−A objective function then tries to maximize the likelihood of target word given the context, by increasing the similarity between the context vector representation and the taxonomic representation of the target word.

C component in our approach can be replaced by any component that encodes the context represen-tation for a target word. For example, it can be CBOW [Mikolov et al.,2013a], Context2Vec [Melamud et al.,2016b] or ELMO [Peters et al.,2018]. The major contribution of this work is not to derive a novel model for the context representation of a target word (i.e., a novel C component), but to enhance such models by allowing the context representation of the target word to be combined with the appropri-ate taxonomical derivation and representation. A major benefit of this combination is that it can fully exploit existing methods that can disambiguate words in context, and utilize only this specific context from the taxonomy to enhance the embedding representation, leading to a hybrid word embedding.

5.2 PreparingWordNetToApplyHyGAP

As discussed in Chapter4the data used for the training and evaluation is the SemCor data1_{. In SemCor}

words are assigned manually with their correct sense in every occurrence, using WordNet as the sense inventory. We have used version 3.0 of Semcor, which is mapped to WordNet 3.0. The total number of sentences in Semcor are 37, 176 with a total of ˜360, 000 word occurrences, out of which ˜193, 000(˜55%) are tagged with a WordNet sense (synset).

We used the data pipelines mentioned in [Mikolov et al.,2013a] and [Mikolov et al.,2013b] for the pre-processing of data. Following the work of [Mikolov et al.,2013b] removal of most frequent words(sub-sampling) and negative sampling was done for the C component of model. However, due to the joint training of components C and T , the negative sampling approach entails the combination of the output of both models. All the words with less than 5 occurrences in SemCor were removed

The different type of edges in Wordnet that can be considered for synset derivation is (hypernym, hyponyms, homonym, metronym, etc). However, as discussed in detail in Chapter4, hypernym and hy-ponyms are the only relationships that will be considered in this research. These relationships contribute to the majority of the edges in Wordnet. Thus each word’s hypernyms/hyponyms are considered for cre-ating the synset derivation of the word. Algorithm1shows how we convert a Wordnet graph structure into a tree structure, i.e, a hypernym structure.

Algorithm 1Converting Wordnet graph into a tree structure sent_arr=sentence array

for i = 0; i ≤ numberof sentences; i + + do for j = 0; j ≤ numberof words; j + + do

ifsent_arr[i][j] = tagged then

find_hypernym_paths=list[shortest_path_root] for k = 0; k ≤ number_of _paths; k + + do

ifpath=correct_path then word=path_tagged else add_to_words_paths_list end if end for end if end for end for

In practice, as illustrated in Figure5.2the synsets are interconnected with each other via edges. If the edge information is utilized while creating vector representations of synset than synsets can learn from other synsets as well. This helps the synsets gain semantic information from other synsets as well. In the examples shown in Figure5.2, the relations between different words as well as their polysemous

(27)

nature are also illustrated. For example the word bank, can be associated both with a financial institution (depicted in the financial graph), but also with river and water (Water Graph). If we assume that for a certain context we know the synset of the word bank belongs to the set of Financial graph, in that case we can say that the embeddings of Financial Institution and Insurance will influence the embedding of the synset that represents bank. Similarly if in certain context bank is river bank than the synset representing the river bank will learn from the words like river, body of water, pool and lake.

Figure 5.2: Examples of disambiguated occurrences of words, and how they relate in the conceptual graphs that can be extracted by WordNet.

In order to encode this information the hypernym relation of the synsets is considered and in case the synset has multiple hypernyms, the one leading to the entity with the least number of nodes is considered as the right derivation for the synset. Only those synsets were considered that occurred more than five times in the corpus. Algorithm2 shows how nodes are removed on the basis of their number of occurrences . Since for each word there is always a dominant synset that has a comparatively higher frequency to other synsets of that word (also known in the NLP community as Most Frequent Sense), in order to do negative sampling and allow less frequent synsets to be picked up, we increased the frequency of the negative synsets for every occurrence of a positive synset for that word.

Algorithm 2Removal of Nodes Based on Frequency or Brach_Max words_hypernym_list

for i = 0; i ≤ number_of _words; i + + do

for j = 0; j ≤ number_of _hypernyms; j + + do ifsynset not atleast right tagged then

remove synset continue end if

for j = 0; j ≤ length_till_root; j + + do

iffrequency(node) ≤ minimum_threshold then remove node

end if end for end for end for

In parallel we have introduced a hyper parameter that defines the number of hops allowed to be executed to the WordNet taxonomy graph following hypernymy. This parameter enables thresholding the hypernyms to be considered traversing towards the WordNet root. The reason for this pertains to the fact that higher level synsets tend to produce very general and not meaningful representations (e.g., almost everything is around an entity or a process) [Luu et al.,2016]. This parameter is branchmax.

(28)

5.3 ContextComponent C & TaxonomicalComponent T

In the following section we describe how components C and T are designed, implemented, and com-bined. We started with a straightforward formulation, and iterate over improvements related to the non-linear projection of component T , as well as the incorporation of an attention mechanism.

5.3.1 CBOW and CBOW

T

In a straight forward implementation of the HyGAP architecture illustrated in Figure 5.1, we can use CBOW for the implementation of both component C and T . Component C encodes the global embed-ding(global representation) of the context words (CBOW), while component T encodes the right taxo-nomical path derivation for the target word (CBOWT). In the joint learning since the target word is not the word itself but a synset of it, the negative samples chosen are also synsets not surface words. It is the responsibility of component T to also encode randomly sampled derivation of any other word. The global embedding of the context words is derived as shown in Equation5.1. vglobali is the global vector representation of each word in context, vcontextis the global vector representation of the context and N is the window size of context.

vcontext= PN i=1v global i N (5.1)

The embedding of the taxonomical path derivation, in our case stemming from the linearized WordNet graph using the paths following the hypernym relation is shown in Equation5.2. vsynsetj is the vector representation of each node in the derivation of the synset, vsynset is the vector representation of the target synset and L is the number of nodes considered for creating the synset.

vsynset = PL j=1v synset j L (5.2)

The vector for WordNet synset is created by taking the average embedding of its derivation, where L can have maximum value of branchmax. For training the model, first 5 epochs only CBOW is trained where for both target and context words global embeddings are used, with the objective function defined as shown in Equation5.3. U is total number of context in the corpus, S+_{is set where the target words v}i

t+ are observed with the context hiin the data set, hi is vcontextand S−is set of not observing the target word vj_t−with h

i

in the data set.

L = U X i=1 p(S+= 1|vi_(t+₎, hi) + Y X j=1 p(S−= 0|vj_(t−₎, h i₎ (5.3) Once the CBOW objective function leads to convergence, the CBOW and CBOWT models, i.e., com-ponent C and T respectively, are trained jointly. The objective function then changes as described in Equation5.4. The difference from the equation5.3is that the target words vectors are now replaced by synset vectors of the respective target words.

L = U X i=1 p(S+= 1|vi_(synset+₎, hi) + Y X j=1 p(S−= 0|vj_(synset−₎, h i ) (5.4)

v+synsetis the synset vector observed with the context and v−synsetare vectors that are not likely observed with the context.

5.3.2 CBOW and Non-Linear Projection Network (NLPN)

We implement component T learning a non-linear projection of the taxonomical derivation of the synsets. In this improvement, component C is implemented simply as CBOW. Component T is changed from CBOW to a non-linear projection network. The NLPN model has a projection layer at each hop of the taxonomical path derivation for the target word. For example, the hops are shown in the Figure4.2in Chapter4. The Figure also shows different hop levels as well . In this methodology each layer of hop will have its own non-linear space transformation, i.e, every hop will have its own neural layer. These layer are irrespective of the word, so if the node in different derivations is at different hop levels it will

HyGAP: Hybrid Context-based Embeddings generated from Taxonomical Paths

MSC

ARTIFICIAL

INTELLIGENCE

M

ASTER

T

HESIS

HyGAP: Hybrid Context-based Embeddings

generated from Taxonomical Paths

VIKRANT

KUMAR

YADAV

August 12, 2019

Supervisor: Dr I CALIXTO

(UvA)

Dr G TSATSARONIS

(Elsevier)

Assessor: Prof. Dr. Evangelos

Kanoulas

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

ResearchQuestions (RQ)

1.2

Contribution

1.3

OutlineOfTheThesis

1.4

MathematicalConventionUsed

Chapter 2

Background

2.1

ContinuousBagOfWords (CBOW)

2.1.1

NegativeSampling

2.1.2

NegativeSamplesAndSub-samplingIn CBOW

2.2

Attention

2.3

BagOfWords

2.4

TF-IDF

Chapter 3

Related Work

3.1

LearningVectorRepresentationsOfWords

3.2

SemanticKnowledgeBasedRepresentations

3.3

Sense-basedRepresentationsOfWordsAndTexts

Chapter 4

DataSets

4.1

CorpusDataSets

4.1.1

Wikipedia

4.1.2

Semcor

4.1.3

OtherCorpora

4.2

SynsetDataSet

4.2.1

Wordnet

4.3

EvaluationDataSet

4.3.1

WS-353

4.3.2

SemEvalSemanticTextualSimilarity

4.3.3

Semcor

Chapter 5

LearningHybridWordEmbeddings

5.1