Hot topic prediction with deep learning Short term text-based predictions of word frequencies using a deep neural network based approach

(1)

Hot topic prediction

with deep learning

Short term text-based predictions of word frequencies

using a deep neural network based approach

Lydia Mennes

A master thesis for the degree of

Master of Science in Artificial Intelligence

University of Amsterdam

April 2015

Supervisors

Dr. M.W. van Someren

Faculty of Science

University of Amsterdam

MSc. V. Hoekstra

Sentient Information

Systems

(2)

Abstract

Can we predict the topics in the news that will be spoken about more tomorrow based on the text produced in the previous days? That is the question leading to this thesis report. Based on a word frequency representation of all articles from the previous days, a prediction is made about the word frequencies of the text produced tomorrow. The dimensionality of this problem (due to the number of words used in natural language) is very large. Furthermore the temporal patterns in the data are expected to be complex, making this an interesting problem. If we predict the frequency of a word based on the past frequencies of all other words, the required model becomes too large. To reduce the dimensionality we create a form of local context to base the prediction for target words on. This local context is extracted from a semantic landscape, which is generated by the Puzzle algorithm proposed in this thesis. The actual predictions are made using a deep neural network approach, which is inspired by the use of deep networks in computer vision. The quality of the resulting semantic landscape needs improvement, and the deep neural network approach to word frequency prediction seems promising.

(3)

Acknowledgements

I would like to thank a number of people, without whom this master’s thesis would not have been here. Firstly I would like to thank my supervisor from the University of Am-sterdam, Dr. M.W. van Someren, for his helpful advice during this thesis.

I also want to thank Marten den Uyl for providing the possibility to do a research internship at Sentient Information Systems which resulted in this thesis. Furthermore I would like to thank Vincent Hoekstra for his help and feedback as my daily supervisor. I would also like to thank all the colleagues at SMR for their interest in my research and for the pleasant working atmosphere, with a special thanks to Amogh Gudi for the support.

Lastly, I would like to thank my parents, my family and friends, and especially Markus, for their continuing support and encouragement.

Thank you all so much, Lydia Mennes

(4)

Introduction

Based on what has been spoken about in the recent past, is it possible to predict what will be spoken about tomorrow? That is the problem addressed in this thesis: the pre-diction of language usage in the near future. Language usage will be represented using word frequencies. The resulting problem can be stated as follows: based on text from the past, represented as word frequencies, can we predict the word frequencies in the near future? There are a number of potential applications for such predictions. For instance, predicting which news topics will remain spoken about, and which topics fade away quietly. This is a rather straightforward application, but these text based hot-topic predictions could be used in different contexts as well. Data from, for example, websites that sell illegal drugs can be collected. It could then be an option to try and predict which illegal drugs will become available more or will be available less on line. This would be purely based on the textual content on websites that sell them. A second such example would be to see which spam campaigns will grow and which will diminish, solely based on the text that they contain.

This task is challenging for a number of reasons. Topics pop up due to real world events, which are unlikely to be predictable purely based on text from the past. There are events that are cyclic in time, such as weekly football matches or yearly events in politics in the new. Non-cyclic events, however, will be hard to predict out of nowhere. The mechanics that cause topics to become more or less discussed are expected to be complex and with high dependency among topics. The dimensionality of the data is large: a lot of text is published every day, and the set of words that can be used is large. However, there are expected to be global patterns available in the data. First of all there is a maximum capacity of text that can be published (in the on line news, on drug websites etc.) and there will always be published text, therefore not all topics can increase or decrease in the near future. The relation between topics might contain global patterns. Topics might also, once started, have a general behavior over time. When not taking into account predicting the out of nowhere rise of topics, the task is therefore expected to be difficult, though not impossible.

The large dimensionality of the data causes a challenge for the prediction task. A lexicon consists of a large number of words. When using only a lexicon for a specific domain and

(6)

excluding low frequency words it still consist easily of 60.000 words. If we try to predict the word frequency of some word based on the past frequencies of only this target word it is unlikely that it is possible to make a good prediction, due to a lack of information. Predicting a word frequency based on all past word frequencies makes the dimensionality of the input too large. An intermediate solution is to predict a word frequency of some target word on its frequencies in the past, together with the past frequencies of a limited number of context words that are relevant for the prediction. This is what the approach in this thesis aims to do.

Word frequencies are predicted using a two-step pipeline. First, useful structured context is generated. Words are arranged on a two dimensional grid (called a semantic landscape) in such a way that each word is surrounded words that make up meaningful structured context. This meaningful structured context is defined using semantic similarity. An iterative algorithm called the Puzzle algorithm is proposed to generate such a landscape, which is then used in the second step to make predictions. The idea is that the semantic landscape can provide local information that can be exploited when making predictions about future word frequencies. I.e. that target word frequencies can be predicted better using the word frequencies of context words that are near in the landscape. The data that is used consists of articles collected on-line during approximately seven years by ParaBotS, a sister company of Sentient Information Systems.

The goal of generating a semantic landscape using the Puzzle algorithm is to provide context to be able to make better predictions. The semantic landscape will be created by placing the words on a two dimensional grid. Since we arrange words on a two dimen-sional grid based on their semantic similarity, this requires a representation of semantic similarity. For this co-occurrence vectors will be used. Since this vector representation has a far larger dimensionality than the semantic landscape, the Puzzle algorithm can be considered a dimensionality reduction technique. There are good methods available for dimensionality reduction, such as PCA and t-SNE. However, these methods have their limitations for this approach. When PCA reduces the data to two dimensions it maintains as much of the variance within the data as possible. For this approach we wish to maintain a different property of the data, namely neighbours in the high dimen-sional space. Neighbors are more interesting in this context because in the prediction stage the word frequencies of neighbors are used. This means neighborhood relations are the only property that is used later on. A method for visualising data (and therefore reducing data to two dimensions) is t-SNE, and it optimizes exactly this property. These two methods result in data points (words in this case) in two dimensional space. For the purpose of providing context, as opposed to visualisation, a grid is a lot more convenient. Instead of having to calculate all the pairwise distances to define the k neighbors of a word, we simply extract all k words surrounding the target word in the grid when using the proposed approach.

The predictions of word frequencies are made using a neural network approach. Neural networks are powerful non-linear architectures that can be trained for the task of rec-ognizing patterns in data. Generally neural networks make predictions or classifications based on some vector representation of carefully created features. Convolutional deep networks are capable of integrating feature construction and prediction/classification,

(7)

resulting in state of the art performances in fields like image classification [11]. The ap-proach to word frequency predictions is inspired by these deep convolutional networks.

1.1 Research questions

The approach of using a semantic landscape to improve word frequency predictions re-sults in the following research question:

Can the puzzle algorithm produce a semantic landscape that provides useful local context that improves performance in a word frequency prediction task? To investigate this question systematically the following sub-questions will be answered: • Can the puzzle algorithm result in a semantic landscape that puts

se-mantically closely related word closer together in a landscape?

This is the goal of the Puzzle algorithm as proposed in this thesis, the choices made for this approach are discussed in further detail in Section 3.1. It needs to be evaluated if the resulting landscape actually has this property.

• When looking at word frequencies over time on the landscape, do local patterns arise?

If related words are grouped together in the landscape, this does not guarantee that local patterns arise. Although this is expected to happen, it needs to be evaluated independently.

• How does using the semantic landscape for context influence accuracy on the word frequency prediction task?

Independent of the two questions above, the effect of providing structured context for word frequency needs to be investigated. The approach to this will build towards a deep approach and evaluate every step along the way. The performance will be compared to not having context and to using random context. This approach will be discussed in more detail in Section 3.2.

The remainder of this thesis report is organized as follows. First some background about language predictions, dimensionality reduction techniques and deep convolutional networks is provided in Section 2. In Section 3 the approach to generating a semantic landscape (3.1) and to using this landscape for prediction (3.2) are discussed in detail. The conducted experiments and their results will be provided in Section 4. This is followed by a discussion and conclusion in Section 5. Finally some ideas for future work are provided in Section 6.

(8)

Chapter 2

Background and related Work

Predicting word frequencies as proposed in this thesis is not a common task. Relevant literature with a task that is remotely similar has not been found. Predictions based on natural text are more often made, but the approaches are so unrelated to this thesis that I will not mention these here. Both steps of the pipeline (semantic landscape and predictions) do require background and have related work available. Both these steps will be discussed here.

2.1 Dimensionality reduction to two dimensions

As discussed in the introduction the landscape that will be used in the approach presented in this thesis is de facto a dimensionality reduction. A few requirements need to be met by the dimensionality reduction technique to be suitable for this purpose:

• The technique needs to be suitable for word similarity data.

• It should try to maintain neighbors in high dimensional space in its low dimensional counterpart.

• It needs to be able to be applicable to a large number of words which have repre-sentations with a high dimensionality.

• It should result in a grid.

A few well known and widely used dimensionality reduction techniques will now be discussed here, including their limitations for this approach.

2.1.1 Principal Component Analysis

A well known dimensionality reduction technique is Principal Component Analysis (PCA), which is well explained in [10]. PCA decomposes the covariance matrix of a data set into eigenvalues and eigenvectors. The eigenvalues indicate the importance of each of the eigenvectors. When using PCA for dimensionality reduction the eigenvectors with the smallest accompanying eigenvalues are removed. A more concise representation of the

(9)

data with minimal loss of information can then be obtained by multiplying the eigenvec-tors with the original data. Let V be the matrix with eigenveceigenvec-tors, where each column represents one eigenvector, and D the data where each row contains one instance of the data. The low-dimensional representation of the data Dl is then defined as:

Dl= V>· D> (2.1) The full covariance matrix is required in order to do Singular Value Decomposition (SVD) to obtain the eigenvalues and eigenvectors. When representations contain a lot of features this can become a difficult task. If a data set contains n samples with some (high) dimensionality d, the covariance matrix has size d×d. The RAM requirements and the computation time of SVD can then become too large. There are good approximations available for very high dimensional data sets, such as proposed by Halko et al. [7]. This algorithm decomposes the data directly as opposed of using the covariance matrix. Let k be the number of dimensions that the data will be reduced to. The decomposition is then defined as:

D ≈ UΣT> (2.2)

Where U is an m × k matrix, V an n × k matrix and Σ an k × k matrix. With very high dimensional data this can be a large improvement with respect to needing the full d × d covariance matrix. This method can also be applied incrementally, making it unnecessary to load all data into RAM. A different disadvantage of PCA is that it reduces the dimensionality of the data in such a way that as much of the variation within the data remains. Maintaining variance does not necessarily mean that neighbors are maintained, i.e. that neighbors in high dimensional space are still neighbors in low dimensional space. Since neighbors are the variable we wish to optimize, PCA is not an ideal technique for this task.

2.1.2 t-SNE

A second option for dimensionality reduction to two dimensions is t-SNE. This is an algorithm described by van der Maaten et al. ([21]) for visualizing high dimensional data in a low dimensional space. Besides the examples provided by van der Maaten et al. themselves it is very succesful in visualizing other complex real world data. Two ex-amples are visualizing evolutionary search spaces [13] and visualizing similarity between images[14].

The t-SNE algorithm tries to maintain as much of the local and global organization of the data as possible. Similarity between a data point xi and some other datapoint xj is defined as the probability pi|j that they are neighbors, if neighbors were picked in proportion to their probability density under a Gaussian centered at xi. In order to symmetrize they define the joint probability pij as (pi|j+ pj|i)/2. In the low dimensional space, for the counterparts yiand yj, the similarity qij is defined as the joint probability under a Cauchy distribution (student-t distribution with 1 degree of freedom). The use of this distribution makes the results look very well, since it allows unrelated clusters to be close enough to use all space for the visualisation. The aim of t-SNE is to find a low dimensional representation that minimizes the difference between pij and qij. The asso-ciated cost function C is the Kullbach-Leibler divergence between the joint distributions P and Q:

(10)

C(P, Q) = KL(P |Q) =X i X j pijlog pij qij (2.3) This definition results in the following gradient:

δC δyi

= 4X j

(pij− qij)(yi− yj)(1 + kyi− yjk2)−1 (2.4)

The low dimensional representation can now be optimized using gradient descent. The complexity of a naive implementation of t-SNE has time and space complexity O(N2_), which is not favourable when you wish to deal with a large number of data instances. There is, however, a more complex implementation of t-SNE described by the same authors ([20]) with a time complexity of O(N logN ) and a memory complexity of O(N ): Barnes-Hut-SNE. There is, however, another problem with t-SNE in this context: it is only suitable for data in metric space. Word similarity violates the triangle inequality that holds in metric space (and transitivity that follows from this). If d(x, y) is some distance function, the triangle inequality is defined as:

d(A, C) ≤ d(A, B) + d(B, C) (2.5) A simple example from van der Maaten in [22] is sufficient to show that this axiom does not hold for word similarity. The words knot and suit are not very related to each other, d(knot, suit) will therefore be large. However, both words are strongly related to tie, making d(knot, suit) and d(knot, suit) very small. It can be easily seen how word ambiguity causes problems in the limitations of metric space.

A second problem with similarity data is that it has high centrality: there are a number of words that are semantically similar to a very large number of other words, without these words necessarily being very similar to each other. If only two dimensions are available this causes problems in the representation.

2.1.3 Multiple maps t-SNE

There is a different version of t-SNE that is suitable for non-metric similarity data, called multiple maps t-SNE (MM-t-SNE) by van der Maaten and Hinton [22]. In this case the data is represented on multiple planes, where only part of the data is displayed on each plane. The advantage to this approach can be shown using the knot - tie - suit example. In the multiple maps approach knot and tie can be present and close on one plane and suit and tie present and close together on a different plane, while suit and knot are never close together on any of the maps. Multiple maps can also solve the problem of high centrality. On every map a highly central word can have different neighbors, and these different neighbors do not have to be closer together than they are to the central word. In order to introduce multiple maps, importance weights are introduced to the model. If there are m maps, there are m importance weights πm

i for each data point i. When visualizing the results, high importance weights are used to indicate which data points should be shown on each map. The weights need to be postive and sum up to 1 over maps per data point. They are defined in terms of unconstrained weights wm

(11)

(a)

(b)

Figure 2.1: Examples of clusters found in the result of mm-tSNE on word similarity data as provided by [22].

weights in softmax nodes in neural networks) to make optimization with gradient descent easier. For the exact mathematical definitions of the neighbor probabilities, weights and cost function, see the work of van der Maaten himself [22]. The result on an example of word similarity data provided by van der Maaten is impressive, as can be seen in Figure 2.1.

It needs to be taken into account that they use very easy-to-handle word similarity data. In this thesis project real world text data is used to automatically determine word similarity. In the word by van der Maaten a word is represented in other words that people mentioned in a word association task, which results in much easier data. Additionally, in their example they use only five thousand words. And a full pairwise similarity matrix is still required for MM-t-SNE, which grows fast with the number of words. Furthermore, the number of required maps for 5000 words is already 40, adding an additional dimension with a large domain. And finally, there is no guarantee that all words will have a weight exceeding the threshold in at least 1 map, making it possible that some of the words in the data set will vanish from the dimensionality reduction by spreading their importance weight mass over all maps.

2.2 Deep Neural networks

Neural networks result in good performance in different language related tasks. For instance parsing natural scenes and natural language such as shown by Socher et al. in [19]. Building a probabilistic language model can also be done using a neural network as shown by Bengio et al. in [2]. However, related work to the approach in this thesis has not been found. The approach taken here is inspired on how deep neural networks are used in computer vision. In this field deep networks result in state of the art performance in various tasks. An example of such a task is object recognition as shown by Krizhevsky et al. in [11]. Another task is human action recognition as shown by Ji et al in [9]. A

(12)

Figure 2.2: A convolutional Neural Network. Architecture of LeNet-5 in [12].

final example is emotion recognition as shown by Gudi in [6]. In the approach in this thesis we try to use the strengths of the deep neural network approach in the language prediction task.

Architecture

Deep convolutional neural networks consist of muliple layers of neurons, where each layer represents a stage of processing. An example of a deep convolutional architecture which uses an image as input (represented as a 2D array) can be found in Figure 2.2. The network consists of three main components: the input layer, the body of the network, and the output component of the network.

The body of the network consists of some number of repetitions of a convolutional layer and a sub-sampling layer. A convolutional layer consists of a number of feature maps (the planes in Figure 2.2). Each node in a feature map receives input from a set of units located in a small neighborhood in the previous layer (for the first convolutional layer that is the image itself). All nodes within a feature map share the same weights and therefore become feature detectors. A feature map is a kind of heatmap, showing where and how strong a certain feature is present in the previous layer. It can be seen in the example that convolutions result in a large amount of nodes. In the case of image classification tasks the exact location of features becomes less relevant once features have been detected and therefore convolution is followed by sub-sampling, where the dimen-sionality is reduced greatly. An example of sub-sampling is max-pooling. In this case a small neighborhood in a feature map is reduced to a single node in the next layer which has the activation of the most activated node in the neighborhood it received input from. After the first layers higher-level features can be detected by repeating the convolutional and sub-sampling layer. In the body of the network more operations can be added to the layers depending on the task. The final layers of the network consist of some number of fully connected layers and an output layer, which together result in a classification or prediction.

In the case of a semantic landscape the sub-sampling as done in image classification becomes problematic. In the case of a semantic landscape each ”pixel” is a separate

(13)

word. Therefore discarding information as in the image classification tasks is not di-rectly possible. Suppose we have a semantic landscape containing 10,000 words (which is the size used in the experimental section), resulting in a 100 × 100 landscape. Since we will be incorporating 5 time steps this results in an input of 50,000 features. The images used in computer vision can be 256 × 256 pixels such as in [11], although often they are smaller. Suppose we use 6 feature maps as is done in Figure 2.2 and a windows size of 4. In the computer vision setting we would end up with six feature maps, each with a size of 252 × 252, resulting in 63,504 input nodes for the next layer. Through sub-sampling this number is halved in the example figure, resulting in 31,752 nodes in our vision example. By applying another convolution the next layer will consist of more nodes again. If we now look at the semantic landscape setting, we see that in the first convolutional layer we would end up with 6 feature maps, each with size 5 × 96 × 96, resulting in 276,480 nodes. Without being able to use sub-sampling we end up with 63,504 nodes for the second convolutional layer in the vision setting and 276,480 in the landscape setting. This classic approach is not an option in this thesis, therefore we will approximate this process. This will be further explained in Section 3.2.

(14)

Chapter 3

Method

The proposed approach consists of a two-step pipeline. In the first step a semantic land-scape is created, in the second step a neural network that uses this semantic landland-scape in the input is trained to make predictions about future language use. Both steps will be discussed in this chapter in more detail.

3.1 Method Semantic Landscape

The goal of generating a semantic landscape is to provide context to be able to make better predictions. This context needs to be specific, depending on the word for which the word frequency is predicted in the second step in the pipeline. The approach taken here is to arrange the words on a two dimensional grid according to some property (spec-ified in more depth in section 3.1.1), such that local patterns arise. We speak about a local pattern when the future word frequency of the target word is explained better by the word frequencies of the target word in the past plus the past frequencies of the words surrounding it in the semantic landscape, than by the past frequencies of the target word alone. An example of what that could look like can be found in Figure 3.1. If ’blobs’

(15)

Flower Bear Bear Dog Cat Bear Flower Bear Bear Flower Bear Dog Cat Flower Bear Flower

Table 3.1: Example document

Co-occurrence counts Word Flower Bear Dog Cat Flower 0 7 0 1

Bear 7 4 2 1 Dog 0 2 0 2 Cat 1 1 2 0

Table 3.2: Example co-occurrences

of higher frequency appear over time in the semantic landscape this provides local in-formation about structure over time in word frequencies. For constructing the semantic landscape this means that words which are related according to our chosen property should be close together on the grid, while unrelated words should be located further apart.

There are many methods for dimensionality reduction that could reduce the dimension-ality of our word space to two dimensions. These usually result in data points in a two dimensional space. Here, however, we choose to use a grid as opposed to two dimensional space. The reason for this lies in the fact that the dimensionality reduction itself is not the goal of this project. We wish to improve prediction by using locality in the resulting structure. Neighbors and other close points are extremely easily defined in a grid, and there is no need to calculate a pairwise distance matrix. When using a grid the true dis-tances are lost, but neighbor relations can (mostly) be maintained, which is the property that we are interested in. For this specific purpose it is therefore worthwhile to use a grid. Due to the number of words (n) that need to be placed on the grid the size of the solution space is very large: n!. A optimal solution is therefore unlikely to be found. A global optimization such as proposed in for instance t-SNE (which also does not result in a grid), requires the calculation of a full pairwise distance matrix. With the number of words ideally used in a landscape this is not desirable. Therefore we will try to optimize the grid locally using an iterative approach, while judging the quality of the grid on a global level. The iterative approach proposed in this thesis is the Puzzle algorithm and is explained in more detail in section 3.1.3.

3.1.1 Semantic similarity

As discussed previously a property is needed to base the arrangement of the grid on, in or-der to let local patterns arise. Intuitively semantic similarity would be a good candidate. Whether or not words are used in a certain period of time depends on their semantic content, i.e. on whether or not the word is related to the topic at hand. Therefore pro-viding context of words with similar semantics could provide a useful context. In order to use this property of words for constructing a semantic landscape we need to define a semantic representation of words and a similarity measure between representations.

(16)

Co-occurrences

A measure often used for semantic similarity in linguistics is a similarity between co-occurrence vectors. A co-co-occurrence vector is a vector representation of a target word containing for all other words the frequency at which it was encountered within a cer-tain window of the target word in some collection of text. If, for example, you make a co-occurrence representation with a windowsize of 1 from the words in the document found in table 3.1, the resulting co-occurrence representation of the words can be found in table 3.2.

Co-occurrence representations are considered a valid semantic representation because the context in which words are used tells us a good deal about word meaning. Or more beautifully stated from the field of linguistics: “You shall know a word by the company it keeps” (Firth, J. R. 1957 ). A more precise definition of this intuition can be found in the Distributional Hypothesis from Z. Harris [8]: The degree of semantic similarity between two linguistic expressions A and B is a function of the similarity of the linguistic contexts in which A and B can appear. The resulting representation of a single word obviously becomes very large for a realistic body of text, since a word is expressed in all other words found in a text. Usually such a representation contains numerous zeros (words that never co-occur). The dimensionality of the representation could be reduced: include only words in the representation with a sufficiently high frequency and a high variation across target words. Normal dimensionality reduction techniques might be ap-plied such as PCA. A different option is use a more efficient encoding of this lengthy representation of word meaning in order to make computations feasible using the full co-occurrence vector. This is the option used in this approach. Since we only need to make computations between non zero elements in the vector (explained in more detail in the discussion on similarity later on), these are the only ones that need to be represented. Using co-occurrence vectors for semantic representation also has some limitations. When a word has multiple meanings, e.g. the word bank, both these meanings have an effect on the co-occurrence vector. Both meanings get accumulated in a single co-occurrence vector, even though the two meanings of a word are not necessarily related in any way. However, the number of highly ambiguous words is very low compared to the number of sufficiently unambiguous words. The effect of these few words is therefore expected to be limited and acceptable.

Similarity between co-occurrence representations

In order to measure similarity between co-occurrence representations the cosine similarity will be used. In computational linguestics the cosine similarity is considered a generally well established similarity measure for vectors in high dimensional word spaces [23]. The cosine similarity is defined as:

cos(x, y) = x kxk·

y

kyk (3.1)

The cosine similarity has a number of desirable properties in the context of co-occurrence similarity. Due to its definition the similarity can only become larger when both words have a non-zero value for a frequency and these frequencies share the same pattern. This

(17)

is an advantage since a co-occurrence representation consists of mainly zeros. When using a normal distance metric the words which are less frequent, and therefore contain an even larger amount of zeros than the high frequent, would be considered more similar than words which are both frequent and share the same pattern of co-occurrences. Sometimes the semantic distance instead of semantic similarity is mentioned, this distance is defined as 1-similarity.

3.1.2 Pre-processing for the semantic landscape

The original data is not perfectly clean data. It contains spelling mistakes, it includes sometimes words that are slang, and words that are part of the website as opposed to the article. Therefore a priori a number of words are removed. All words which contain numbers are excluded. The words that have been identified as usually being part of the website (e.g. comment or guest ) are removed. All words which are on the list of Dutch stopwords from the nltk library for python are also removed. When using a collection of 1000 documents from the domain politics this results already in 17515 unique words. In case of 30000 documents this becomes 122077 unique words. If all these words would be used the semantic landscape would become too large for these initial experiments. Only allowing words with a certain frequency does not only reduce the number of words for convenience, it also is likely to remove misspelled words and slang. Finally a stemmer is a good option to combine different versions of a word (e.g. verbs and reductions) into a single concept. The effect of a stemmer is investigated experimentally.

Word selection

When using all documents available the complete set of words when using a stemmer and a minimum frequency of 10 this results in 60074 words for the politics category and 106243 words for the football category. The semantic landscape would become very large in this case and event though this might not matter for a final pipeline, it is not suitable for experimentation due to computation time. Reducing the set of words is expected to be influential on the performance in the prediction phase and will therefore be investigated experimentally. Three approaches to word selection are considered: using all words based on a smaller set of documents, selecting words based on semantic relevance for a category of documents, and word selection based on a combination of semantic relevance and frequency. These three are explained in more detail below:

• Document sampling. This is achieved by taking a month of documents, resulting in 8100 words after pre-processing.

• Semantic relevance criterium. In order to reduce the set of words by semantic relevance a measure is required to order the words by. Rayson and Garside [16] have developed such a measure for word selection for comparing corpora based on frequency profiling. Words are ranked based on the most significant relative frequency differences: the log likelihood statistic. The words with the largest difference are the most different between the two domains. If we limit ourselves for a corpus to the words that have a higher frequency in that corpus compared to some other corpus, and rank those words with the log likelihood the top words should contain those words that are most defining for that corpus. These are the words we

(18)

would like to include in the semantic landscape. We can consider our documents from two different categories as different corpora. Let fA

wi be the frequency of word

wi in corpus A, ftotA the total word length of corpus A. In order to calculate the log likelihood for words we need to calculate the expected values for each word in a corpus, using some other corpus B. The expected value EA

wi is then defined as:

E_wA_i =f A tot(fwAi+ f B wi) fA tot+ ftotB (3.2) And the log likelihood for a word LLwi is then defined as :

LLwi= 2 · ((f A wilog(f A wi/E A wi)) + (f B wilog(f B wi/E B wi))) (3.3)

From the ordered set of words that is now obtained the top 10,000 words are selected for the semantic landscape.

• Semantic relevance and frequency. The final method of word selection takes the entire set of words with a higher frequency in a category A compared to some other category B. A higher frequency in comparison indicates semantic relevance in that particular corpus, although less strongly than the words resulting from the previous method. The resulting words are then sorted according to frequency. The top 10,000 most frequent words are then selected for the semantic landscape.

3.1.3 Puzzle algorithm

As described at the beginning of this chapter the goal of creating a semantic landscape is to make local patterns arise. Furthermore we have established that we will try to achieve this by using semantic similarity as the property according to which we will allocate the words on the grid. The puzzle algorithm is an algorithm that takes a local and iterative approach to improve the semantic similarity between neighboring words on a grid. It takes as input some initial configuration of all n words placed on a s × s grid. The output of the algorithm is the same grid, but with a different allocation of the words. In this resulting allocation the semantic similarity between words that are close together on the grid should have improved significantly with respect to the initial allocation. The basis of the algorithm is swapping words with neighboring words when this makes the swapped words migrate towards their most similar words within their knowledge. This means that words need to have at least some knowledge about their most similar words, resulting in a pre-iterative phase. And after that we need to iterate over the words and swap them if this is desirable. Pseudo code for the algorithm can be found in Algorithm 1, and a more detailed description is provided below.

(19)

input : Grid with some initial allocation of all words W

output: Grid with a new allocation of words W that better represents semantic distance relations

for word wi in W do

Initialize local most similar words of word wi; end for nrIterations do for word wi in W do p = randomUniformValue(0,1); if p ≤ probabilityRandomSwap then swap(random wj in W , wi); else d = empty list; for wj in neighbors(wi) do d.append(calcDistanceChange(wj,wi)); end if max(d) >0 then k = argmax(d); swap(wk,wi); end end end Randomize order W ; probabilityRandomSwap · = 0.95 ; end

Algorithm 1: Pseudo code for the Puzzle algorithm The pre-iterative phase

If the most similar words are interpreted globally this means that the full pairwise se-mantic similarity matrix needs to be computed. Since we incorporate a large number of words this is not desirable. We can, however, initialize the most similar words within a certain range of the target word. This reduces the number of pairwise similarities that need to be computed greatly. And if during the iterative phase, when words start migrating over the grid, new words are encountered we can see if this word is more sim-ilar than the current most simsim-ilar words and update the list if necessary. Additionally the list of most similar words of the encountered word can be checked for other words that are more similar than the current list of most similar words. Due to the fact that the areas over which the list of most similar words is initialized overlap, all words have information about words in different areas of the semantic landscape. Information about words should travel fast, and it is expected that such an initialization should be sufficient to result in a good solution.

(20)

The iterative phase

After the pre-iterative phase the algorithm can start swapping words. Per complete iteration all words in the landscape are visited in a random order. For each word a sub-iteration takes place which consists of the following steps:

• For each neighboring word of the target word: calculate how the average euclidean distance to the most similar words changes for both words if they are swapped. • If the swap with the biggest total improvement (sum of changes for both

ele-ments involved in the swap) is an improvement larger than 0, these two words are swapped.

Since this kind of iterative algorithms might end up in a local optimum noise needs to be added in order to be able to overcome these local optima. In this case noise means that with some small probability a swap does not follow the above procedure but the target word is swapped with a randomly picked word in the landscape. Besides overcoming deadlocks (local optima) this should also result in a more rapid spread of knowledge about most similar words over the grid. In order to be able to converge, the noise should of course be decreased over time. Currently the algorithm terminates after a set number of iterations, but ideally a criterion that is based on the quality of the grid should be used as a termination criterion.

High centrality and triangle inequality

A careful reader might already have noticed that the Puzzle algorithm does not address the problems of high centrality and triangle inequality that similarity data in metric space causes (recall from Section 2.1.2). The puzzle algorithm is used nonetheless for a number of reasons:

• The favorable properties of a grid as discussed at the beginning of this section. • The na¨ıve implementation of t-SNE for N words has time and space complexion

O(N2) (not taking into account the iterative nature for time complexion). The na¨ıve implementation of the Puzzle algorithm has (looking at k neighbors when swapping and q iterations) a time complexion of O(N kq) where k, q N . The memory complexity is O(N l) where l is the number of most similar words that is taken into account.

• Even though MM-tSNE does address the problems with high centrality and triangle inequality, the multiple maps representation is difficult to use for the prediction task. Furthermore the time and space complexities are the same as for t-SNE, without there being a fast implementation available.

(21)

3.1.4 Intelligent initialization of puzzle algorithm

The approach of the puzzle algorithm would most likely benefit from an intelligent initial allocation of the words on the grid. Intelligent in this case means an already roughly interesting solution in which the words within an area are already more similar to each other than to words very far away. This would reduce not only the distance that words itself need to travel for the algorithm to converge to a solution, but also the distance that information about most similar words needs to travel is assumed to be less. If such an intelligent initialization can be found this can therefore both reduce the time needed to converge and improve the final solution, especially for larger landscapes.

A proposal for intelligent initialization is to use the t-SNE algorithm on a sample of the words. This results in a dimensionality reduction to two dimensional space. After applying the algorithm we can convert the 2-dimensional data points to a grid. The words that were not in the sample then need to be combined with the sample, resulting in a initialization for the puzzle algorithm. Two potential procedures are used for this: wave space-to-grid and box space-to-grid. Both methods first use wave space-to-grid as a method of converting the result from t-SNE to a grid (containing only a small amount of words). The behavior for combining the sample with the remaining words is different. • Wave space to grid. The grid entries are initialized at integer positions. The data is scaled such that all data points are located within the grid. The following is then iterated until all data points are assigned to a grid point:

– Assign each data point to the nearest grid point (which, due to scaling, comes down to rounding the position of the data point)

– Let each grid point that has no data point assigned to it send a message to all other grid points

– Each grid point with k > 1 data points assigned lets k − 1 datapoints take a step toward to the nearest grid point with no data points assigned.

• Box space-to-grid First the grid is blown up such as to provide room for the remaining words. The words are then added one by one in a randomized order. Each word is allocated to the open grid entry that is closest to the word in the grid they are assigned to. This assignment is probabilistic, based on the number of words that are most similar and the number of words they already had assigned to them.

(22)

(a) Result tsne

(b) Result space to grid conversion

Figure 3.2: Result from the space to grid conversion using the MNIST dataset. The different colors indicate the different classes which represent the handwritten digits 0-9.

An example of a wave space-to-grid conversion for the MNIST data set can be seen in Figure 3.2. It can be seen that the shape and structure between clusters is retained in the conversion to a grid.

3.2 Method Prediction

As discussed previously the second step in the approach of this thesis is using the semantic landscape in order to improve the prediction of word frequencies in the near future. In order to improve the frequency prediction for a certain word the prediction will be made not only on the past frequencies of the target word, but also on the past frequencies of surrounding words in the semantic landscape. A neural network approach is taken to the prediction task. In order to use a neural network the frequency data is pre-processed and suitable samples are extracted. The approach to the prediction task is expanded one step at a time. First we would like to see if adding local information provided by the semantic landscape has a positive effect on predicting word frequencies per word. In this setting global information is not taken into account. Finally, word frequencies are predicted for the semantic landscape as a whole, based on yesterdays frequencies and the previously predicted frequencies for the entire landscape. This step incorporates global information from the semantic landscape to see if this improves accuracy on the prediction task.

3.2.1 Using the Semantic Landscape for prediction

The semantic landscape created with the Puzzle algorithm is now kept fixed and can be used to exploit locality within the landscape. A technique that is well capable of exploiting locality are deep neural networks. The approach that will be taken here is inspired by the deep neural network approach taken in the field of computer vision. There are a few key aspects in the approach in computer vision that are relevant for taking a deep network approach in the language prediction task.

(23)

• The deep network approach in computer vision generalizes within images by using weight sharing, resulting in feature maps. This means that independent of where a certain structure is present in an image the feature map associated with this structure will pick up on it. Using weight sharing for the landscape can be seen as a good technique for exploiting locality while generalizing over the landscape. There are, however, two key differences between images used in computer vision and the semantic landscape used in the language prediction task. In the case of a semantic landscape each ’pixel’ represents a different concept and the concepts have a fixed location, whereas in an image multiple pixels make up a concept and this concept can be located anywhere in the image while still representing the same concept. Nonetheless, in the case of word frequencies we expect that patterns in the data will not be completely different for every word. Therefore we should still be able to generalize within a landscape using the same approach.

• The use of weight sharing also results in a reduction in the number of weights needed within the network. When using images of 48x48 pixels, without weight sharing a single feature map requires 2116 weights (using a window size of 3 and a stride of 1). Taking into account that 32 or 64 feature maps are often used, it is easy to see how convolution without weight sharing results in an unfeasible number of weights to be trained. When using weights sharing in the same example only 9 weights per feature map are needed. The effect of weight sharing on the reduction of the number of parameters to be learned is obvious. The size of the semantic landscape used in the language prediction task is larger than that of the images that are usually used in the computer vision setting. This makes the reduction of the number of parameters even more of a necessity than in the computer vision setting.

• An important difference between the computer vision setting and the language pre-diction setting to take into account is location. The deep neural network approach in the computer vision setting uses pooling (recall from Section 2.2). Pooling results in discarding the location (or at least reducing precision to a very large extend) of found features. This discarding reduces the number of free parameters greatly. If we want to be able to predict the future word frequency for every word in the landscape, discard word location in the process. However, pooling also results in another substantial reduction of the number of parameters that need to be learned. This is an issue that needs to be addressed in the approach we take in the language prediction domain, which is discussed in Section 3.2.3.

3.2.2 Normalisation

The word frequencies for each day are preprocessed before samples are constructed. When using neural networks it is favorable to scale and normalize data. The problem with the word frequencies that the data consists of is that the distribution is heavily skewed. Most word frequencies are 0 on a given day. Words with low word frequencies (1 or 2) are much more common, whereas words with higher word frequencies are more rare. The value is also unbounded, which is not ideal for neural networks. In this ap-proach we represent the non zero frequencies as the percentile score in the set of non zero word frequencies on some day d, which is called this the relative ranking score (rr).

(24)

0.0

0.2

0.4

0.6

0.8

1.0 Value

0

1

2

3 Frequency (x 100,000)

Distribution over median normalized frequencies

Figure 3.3: Distribution over rr-scores (called value in the figure)

The frequencies that are 0, remain 0. Since there are often multiple words with the same frequency, we will use the median percentile score of these words. We make an exception for the most frequent word, which will have an rr score of 1. The distribution over rr-scores can be found in Figure 3.3.

An example of how the rr-score is calculated follows. If we have the frequencies 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4 (3.4) there are 10 non zero frequencies. Since there are 4 words with a frequency of 1, the median percentile score is that of the third element: 2.5/10 = 0.25. For the words with frequency 2, the median of all words with a frequency 2 is in the 6t_{h position: 6/10 = 0.6,} and so on. This results in the following representation in rr scores of the original word frequencies:

0, 0, 0, 0.25, 0.25, 0.25, 0.25, 0.6, 0.6, 0.6, 0.8, 1, 1 (3.5) This representation has the advantage that it is bounded, and it reduces the effect of the heavily skewed distribution. A disadvantage is that the resulting representation encodes the same frequency different on different days, depending on the distribution over frequencies on that particular day.

3.2.3 Neural network

Two different aspects of context are investigated using a neural network approach: local and global context. Only local context provided by the semantic landscape is used in the single word approach and the shallow network approach. These approaches predict the word frequency of a single word per prediction. The difference is that in the first case a network is trained for every word, and in the latter a single network for the entire landscape. In the deep network approach the predictions of the shallow network for all words are used to simultaneously predict all future word frequencies at once. This introduces global context, since it takes (implicitly) the past frequencies of all words into account to predict all future word frequencies at once. Before the details of the networks

(25)

Bin Included Bin Included number values number values 0 0 4 ≥ 0.7 < 0.8 1 > 0 < 0.3 5 ≥ 0.8 < 0.9 2 ≥ 0.3 < 0.6 6 ≥ 0.9 3 ≥ 0.6 < 0.7

Table 3.3: Bin values

for these different steps are discussed, the included history and the output representation are discussed

History

Each of the approaches to the prediction discussed below use (at least implicitly) the past frequencies of target words and context words in order to predict the word frequency of the next day. When predicting the frequency fton day t this history always consists of the past frequencies ft−xwhere x ∈ 1, 2, 3, 7, 30.

Output representation

The target frequencies are represented using bins. Initial experiments showed that repre-senting the output using classes makes the prediction task feasible in the shallow setting. The bins can be found in Table 3.3.

It can be seen that the bins for classes 1 and 2 are three times bigger than the other bins. The bins are chosen as such for two reasons. As discussed before the low frequency words (especially frequency 1 and 2) are represented by a range of rr-scores, explaining the larger bins for low frequencies. The nature of the data makes it more interesting to distinguish between words with a high frequency, rather than those of a low frequency: hot topics consist of the current high frequency words. If a small number of classes can be easily distinguished, the number of classes can always be increased.

Usually, when classifying data, one output node per class is used. A sample is then assigned to the class represented by the output node with the highest activation. Such a class encoding does not suffice here, since it does not take into account the ordinal nature of the classes. Therefore a thermometer encoding is used as described by Adamczak in [1] and by Smith in [18]. Instead of encoding the output as a series of zeros with a single one for the class of that particular sample, all classes smaller or equal to the desired class are encoded with a one and all remaining classes with a zero. For a frequency of zero all values are zero. For example, if the output falls in the fifth bin the output encoding is: 1, 1, 1, 1, 1, 0. And for the second bin this would be 1, 1, 0, 0, 0, 0. When using the neural network model for prediction, a sample is assigned to the class represented by the first node with an activation smaller than 0.5.

(26)

In case of the deep network real valued rr-scores are predicted. The output dimen-sionality in the deep setting is very high, if this would need to be represented using thermometer encoding the resulting dimensionality would become too large. It is used in the other settings because early experimentation indicated that in this setting it was very difficult to predict real valued rr-scores, and the thermometer encoding resulted in better performance.

Single word networks

The first step in this approach is to train a network for a single word. This is done for a sample of words from the landscape. This step is unrealistic for the entire landscape, which would require a large number of networks to be trained. It is, however, important for comparison purposes. This experiment also gives a initial indication of the additional value of providing context.

We train a simple neural network with a single hidden layer. This network predicts the bin-class in which the future frequency of a single target word lies. For some word wi it predicts the bin-class of fwit based on the past frequencies of the target word: fwi(t−x)

with x in the history set, and the past frequencies of the context words: fwj(t−x) where

j ∈ neighboring set in semantic landscape. The neighboring set is defined with a close-ness criterion in the semantic landscape: the window size. If we use, for example, a window size of 7, this results in 7 ∗ 7 = 48 context words used for making a prediction for the target word. This would results in (Nj+ 1) × Nx= 49 ∗ 5 = 245 input values for the neural network.

Shallow network

In the second step we want to take a shallow neural network approach. We want to use the concept of weight sharing to generalize over the semantic landscape and reduce the number of free parameters. Therefore we use a small network that we slide over the semantic landscape to predict the frequencies in the semantic landscape for the next day, see Figure 3.4 on the next page. This network does not take global information into account. The network is the same as described for the single word networks. This network is simply trained with many more samples, which belong to different words in the semantic landscape. This approach is related to a convolutional layer in a deep con-volutional network. In the case of a deep network the weights are shared for each feature map (as explained in Section 2.2). In this shallow network the weights are the same for the different samples that are extracted using the semantic landscape. Each hidden node with its associated weights from the input can therefore be considered a feature map. This simple neural network approach is still inspired by the deep neural network ap-proach in computer vision. The network could be interpreted as if every hidden node is a single feature map. Instead of processing all of the image (here semantic landscape) at once a sliding technique is used. This sliding approach still results in weight sharing across the semantic landscape, resulting in generalisation and a reduction in the number of free parameters.

(27)

Figure 3.4: This figure explains the sliding network approach. The blue grids represent semantic landscapes and the input layer of the neural network is represented in the middle. The same network is used in a, b and c to predict the different target frequencies (green), based on the frequencies of the target word (T) plus the frequencies of the surrounding words in the semantic landscape (yellow). When the network has been slided over the entire landscape all target frequencies have been predicted. Here only a single day in the past is used, but of course multiple can be used.

Deep network

In the final step of this research we wish to introduce global information in addition to local information. For this preliminary research it is not possible to do a real deep network approach where a single network is trained that takes into account both local and global information. As discussed before, without making use of pooling such a network would become too large. Instead, we use the predictions made by the shallow network. These predictions are combined with the past frequency of the target word on the previous day. So, for each word wi in the landscape, the frequency fwi,t−1 and the

predicted frequency pswi,t from a shallow network are used to predict all p

d_w

i, t at once. The combination of the prediction from a shallow network and the actual frequency on day t − 1 implicitly encodes the past frequencies of the target word and the context words. The landscape that is used contains 10,000 words. This means the input for this network consists of 20,000 features: two for each word. The output is a real value for each word, resulting in 10,000 output nodes.

Training

The approach to training the network is generally the same for all settings described above. For the small network 100 hidden nodes (or feature maps) are used. All nodes use a sigmoid as their activation function. Momentum Back Propagation is used as the learning rule. This learning rule requires a number parameters with the following values:

(28)

the learning rate is set to 0.00001, momentum to 0.5 and alpha decay is 0.99. Alpha is reduced after each complete pass through the data set, which we shall call one iteration. The network is trained for at least 100 iterations. After that learning terminates when no better accuracy is achieved on the validation set for 10 iterations, or when the maximum number of 650 iterations is reached.

3.2.4 Baselines

The predictions made by the different approaches discussed need to be compared against some baseline prediction. The used baseline for the shallow networks differs from the baseline in the deep network setting. In the shallow network setting the main question that needs answering is if using context derived from the semantic landscape is useful. Usefulness in this case is defined as resulting in a higher accuracy in the prediction task. Therefore the accuracy of the shallow network setting is compared to the exact same setting, except that the input consists only of the past frequencies of the target word. Only providing additional word frequencies from different words on the same day might already be beneficial for the prediction task, since this can already provide some information such as an estimate of the amount of articles published that day. To exclude this effect of unstructured context (as opposed to structured context such as provided by the semantic landscape) accuracies are also compared to a setting where a random landscape is used. This means all words are randomly placed on a landscape and this landscape is then used in the same manner in the shallow landscape settings. When using a deep network the effect of providing global context is relevant. The deep network performance is therefore compared to the performance of using only the shallow network.

(29)

Chapter 4

Results

In order to be able to discuss results in full depth some further information about the data should be provided. This can be found in Section 4.1. The results of generating a semantic landscape and using the landscape to make predictions will be discussed in respectively Section 4.2 and 4.3.

4.1 Data

A large collection of articles has been provided by Sentient Information Systems BV. The set of articles spans about five years, from october 1st 2007 until June 5th 2014. The articles can belong to different categories, of which the category Politics and Football will be used here. The number of articles collected over time for the category politics can be seen in Figure 4.1a. It can be seen clearly that the number of articles collected everyday is not steady over time. There are even periods in which hardly any articles were collected. In Table 4.1 some additional statistics can be found. Besides the number of articles that have been collected the word frequencies are of great influence on this

2008 2009 2010 2011 2012 2013 2014 Date 0 50 100 150 200 250 300 Nr documents

Distribution of documents over time

(a) Document distribution for politics

0.00 0.05 0.10 0.15 0.20 0.25 Value 0 15 30 45 60 Frequency (x 100,000)

Distribution over the max normalized frequencies

(b) Distribution over values normalized by maximum frequency per day

(30)

Number of documents 177,655 Total length of documents in words 16,135,244 Number of unique words 537,378 Number of days with articles 2,237 Average number of articles per day

if articles were collected 79

Table 4.1: Features of the texts in the category politics in the period from 01-10-2007 to 05-07-2014. The number of unique words excludes function words.

L Lim Sem rel Sem + freq L Lim Sem rel Sem + freq 1 672419 217468 444277 9 9403 682 2771 2 136966 21780 56867 10 6869 449 1922 3 163422 22843 63879 11 5477 370 1587 4 93650 9835 31066 12 4873 297 1367 5 55364 4898 16695 13 3692 219 980 6 31166 2633 9495 14 2407 149 691 7 18539 1488 5383 15 1917 129 582 8 13250 886 3729 >15 16097 1024 4835 Table 4.2: Distribution over streak lengths (L) of the words in a semantic landscape. Different sets of words depending on the word selection are considered. Lim is the setting using a limited amount of documents, sem rel is only using semantic relevance and sem + freq the method using both semantic relevance and frequency.

task. Obviously there are a lot of words everyday that are not used. When using the words of a semantic landscape to look at the word frequencies per day, about 96% of all word frequencies are 0. This is more or less the case for all categories and landscapes that have been looked at. The remaining non-zero values are distributed very skewed as can be seen in Figure 4.1b.

When predicting word frequencies it is likely that it is impossible to predict when a word frequency will become non zero after a period of not being used. This is due to the fact that this is tied to events in the real world. This will be highly unpredictable unless there is some periodic pattern in the data (e.g. after the word elections the word seat and coalition might be used a lot). When this is not the case it could still be very well possible to predict how word frequencies develop over time once they are active. Therefore it is interesting to know how long words continue to have a non zero frequency once they are used. In Table 4.2 the distribution over streak lengths for the words in different landscapes are shown. The streak length is defined as the number of days that a word has a non zero frequency, with a maximum of one day inbetween that does have a frequency of zero.

(31)

4.2 Results Semantic Landscape

Two main topics with regard to the semantic landscape are investigated here: the pro-cess of generating a semantic landscape using the puzzle algorithm, and the resulting landscape quality. The solution quality depends entirely on its ability to result in bet-ter performance in the prediction task when using the semantic landscape. However, it is useful to be able to estimate the quality of the landscape beforehand. The solution quality is discussed at the end of this section. The puzzle algorithm itself is explored by investigating the effect of the following: initialization type, the effect of noise, scalability and the effect of lack of knowledge of best neighbors. There are a number of measures explained below that are used to give insight in the performance of the Puzzle algorithm. These measures are used to see if the puzzle algorithm is behaving as expected and how it is affected by different settings. Semantic distance refers to the cosine distance in the high dimensional space of co-occurrence representations of the words.

• Optimal semantic distance. The optimal semantic distance is the average se-mantic distance between a word and the current best matches of a word. Since these best matches get updated over time when applying the Puzzle algorithm the average semantic distance is also updated.

• Average semantic distance to neighbors. The average semantic distance be-tween a word and the eight words surrounding that word in the semantic landscape. • Number of encountered words. The number of words that a word encounters when the Puzzle algorithm is applied. The more words are encountered, the more likely it is that a word can update its best matches and lower its optimal semantic distance.

4.2.1 Effect of different initializations

The goal of this experiment is to analyz the effect of different initializations on the puzzle algorithm. The three methods that were discussed are compared: a random initializa-tion and the intelligent initializainitializa-tions using t-SNE and wave or box space-to-grid. The comparison is made using word selection based on a limited amount of documents. Intelligent initialization is a computationally expensive step and needs to improve re-sults significantly in order to be worthwhile. An example of the resulting initializations can be found in Figure 4.2. It can be seen that the wave space-to-grid method results in more ’stripy’ blobs. This means that words in the same blob might be further apart from each other than they are from words in a different blob. In the case of the box space-to-grid it happens that words are placed far from where they belong, as seen in the figure. However, this is only the case for a few words, as opposed to the entire blobs that get stretched in the wave space-to-grid approach. In Figure 4.3 we can see the effect of different initializations more clearly. In Figure 4.3a we see that initialization has no impact on the optimal semantic distance that is reached. Secondly, Figure 4.3b shows there is a slight difference in the average semantic distance to neighbors. Both space-to-grid methods perform similar as using a random initialization. Due to the fact that both non-random initializations are computationally expensive a random initialization seems preferable.

(32)

Initialized grid

(a) Resulting grid for box space-to-grid

Initialized grid

(b) Resulting grid for wave space-to-grid

Figure 4.2: Resulting grid for different space-to-grid methods using the limited politics words. Each area with the same color represents a blob: a word from the t-SNE sample with the words that are assigned to it.

0 200 400 600 800 1000 Trial number 0.55 0.60 0.65 0.70 0.75 0.80

Avg optimal semantic dist

Avg optimal semantic dist in different settings

football lim football lim stripy football lim random

politics lim politics lim stripy politics lim random

(a) 0 200 400 600 800 1000 Trial number 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00

Avg semantic dist to neighbors

Avg semantic dist to neighbors in different settings

football lim football lim stripy football lim random

politics lim politics lim stripy politics lim random

(b)

Figure 4.3: The effect of different initializations during the puzzle algorithm on the optimal semantic distance, average semantic distance to neighbors and the stress value. All graphs show averages over all words in the landscape. Stripy refers to wave space-to-grid.

(33)

0 200 400 600 800 1000 1200 1400 Trial number 0.65 0.70 0.75 0.80

politics big puzz politics big random

(a) 0 200 400 600 800 1000 1200 1400 Trial number 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96

politics big puzz politics big random

(b)

Figure 4.4: Results for a random initialization and a wave space-to-grid initialization using a large semantic landscape with 29,584 words.

0 200 400 600 800 1000 Trial number 0.55 0.60 0.65 0.70 0.75 0.80

football lim

football lim gold politics limpolitics lim gold

(a) 0 200 400 600 800 1000 Trial number 0.86 0.88 0.90 0.92 0.94 0.96 0.981.00

football lim

football lim gold politics limpolitics lim gold

(b)

Figure 4.5: Local versus global (gold) initialization of best matches

A remaining question is how this result scales to a larger landscape. In this case the initialisation might have a stronger effect, since it is more likely that very similar words do not encounter each other in a random initialization. This could affect the quality of the final solution. Therefore the random initialization is compared with the Box space-to-grid approach for a large landscape containing 29,584 words. The results can be found in Figure 4.4. This shows that a random initialization still performs equally well and is preferable, due to computational effort for non-random initialization.

4.2.2 Comparison of local versus global knowledge of best matches

How is the quality of the final solution affected by having only local knowledge of best matches? By partial initialization of the best matches calculating the entire pairwise distances matrix is avoided. However, this might have a large effect on the quality of the final solution, due to the fact that words try might to migrate to words that are not actually their best matches. In order to evaluate this effect the entire pairwise distance

Hot topic prediction with deep learning Short term text-based predictions of word frequencies using a deep neural network based approach