Comparing encoding of number information in dense and sparse LSTMs

(1)

Comparing encoding of number

information in dense and sparse

(2)

Layout: typeset by the author using LA_TEX.

(3)

Comparing encoding of number

information in dense and sparse LSTMs

Jeroen Taal 11755075

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. D. Hupkes J.W.D. Jumelet MSc T. Kersten BSc

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 904 1098 XG Amsterdam

(4)

Abstract

The area of Natural language processing (NLP) has made rapid progress in the past years, using Recurrent neural networks (RNNs) and more specifically the Long Short Term Memory (LSTM). While these networks have made astounding progress on a variety of NLP tasks, they, just as neural networks in general, appear to suffer from over-parametrization. In order to overcome this problem, techniques such as pruning have been introduced to overcome this problem. There is however, much progress to be made in interpreting the inner workings and encoding of infor-mation of both dense LSTMs and sparse LSTMs. Therefore, this thesis compares a dense LSTM and a sparse LSTM.

I will focus in particular on the encoding of number information in both models. First I train both LSTM models on a Dutch language modeling task. Next I asses the ability of both networks to perform number agreement by evaluating their accuracy on the numbers agreement (NA) task. Previous research has shown that LSTMs are surprisingly good at processing long-term dependencies necessary to perform the NA task successfully. Furthermore, it was found that the number information was encoded very locally in single units. Therefore it is interesting to see to what extend a sparse LSTM contains the same encoding of number information as a dense LSTM. I will localize the encoding of number information in single units by ablating units from both networks. Next, I find a more distributed encoding of number information is by investigated the variance in the decoder weights of plural and singular verbs of the units. At last, I will examine the activations of the different components of the LSTM cell during the processing of sentences in order to gain insight in the way these units encode number information.

It was found that the sparse network contained a strong bias towards singular verbs. However, an intervention on the activation of the cell state of a single unit from the sparse network greatly improved the performance on the NA task, indicating that number information is encoded locally in this single unit. For the dense model, multiple units which seemed to encode information of plural verbs were found. However, the dense network achieved significantly higher accuracy scores on the NA task than the sparse network.

(5)

Acknowledgements

First I would like to thank my supervisors Dieuwke Hupkes, Jaap Jumelet, and Tom Kersten for their guidance and extensive feedback throughout this project. The weekly meetings were very helpful and they provided great insights. In addition to this Dieuwke Hupkes also provided the computational resources necessary to complete this project. Secondly, I would like to thank Dennis Ulmer and Rochelle Choenni for allowing me to use their pre-processed Dutch dataset. At last I thank Hugh Mee Wong for allowing me to use the NA task she created as part of her Bachelor’s thesis.

(6)

Chapter 1 Introduction

The area of Natural language processing (NLP) has made rapid progress in the past years. State of the art results have been improved upon multiple times on tasks such as: Language modelling, sentiment classification, named entity recognition, question answering, and reading comprehension. In the most recent works, an architecture based on the transformer architecture, introduced by Vaswani et al. (2017), is often used. However, preceding the transformer based networks, recurrent neural networks (RNNs) achieved state of the art results in most NLP tasks. RNNs are a network architecture using a recurrent connection. This makes them suitable for processing sequential data. The recurrent connection, also called hidden state, between time-steps allows the network to use information from preceding time-time-steps. In this thesis I will refer these RNNs with a single recurrent connection as a vanilla RNN. Vanilla RNNs suffer from the Vanishing gradient problem as described by Bengio et al. (1994). This problem is caused by repetitive multiplication of the gradient of the hidden state during backpropagation. As a consequence, vanilla RNNs are hard to train and often do not learn long-term dependencies in sequential data.

To solve the vanishing gradient problem in vanilla RNNs, Hochreiter and Schmidhuber (1997) proposed the Long short term memory (LSTM) architecture. The LSTM has an additional recurrent connection between units in the architecture called the cell state. Information from the hidden state and input is added to the cell state using different gates. As a result, point-wise multiplication and addition are the only operations applied to the cell state. As a result, the gradient of the weights of the LSTM can be updated via the gradient of the cell state without multiplication between different time-steps. The cell state contains information of the hidden state and the input during the processing of a sequence, as a result it will contain information from previous time-steps. Because of this LSTMs are better at learning long-term dependencies than vanilla RNNs.

Another disadvantage, but also of neural networks in general, is over-parametrization. This entails that the networks contain more parameters than necessary. Multiple efforts exist which try to reduce the number of parameters in neural networks. One technique, often referred to as pruning, reduces the number of weights by selecting the superfluous ones and setting their value to zero. In recent works such as Frankle and Carbin (2018), Zhou et al. (2019), and Zhu and Gupta (2017), it has been shown that sparse networks, obtained by pruning a dense network, achieve similar performance as a dense model while only containing a fraction of the amount of non-zero parameters.

The performance of LSTM networks on multiple sequence processing tasks have been ex-tensively studied. However, research into the inner workings and representation of acquired

(8)

knowledge of LSTMs is scarcer. In recent work, Linzen et al. (2016) introduced the number agreement (NA) task to assess the ability of LSTMs to learn syntax-sensitive dependencies. Gu-lordava et al. (2018) expand upon this work by assessing the ability of an LSTM trained on a language modeling task to learn syntax-sensitive dependencies. In further work, Lakretz et al. (2019) localise the encoding of number information an LSTM trained on a Language modeling task.

The goal of this thesis is to assess the ability of an LSTM network to perform number agree-ment in Dutch by training it on a Dutch language modeling task. In addition, the representation of Dutch number information in a sparse network will be explored. This should answer the question: To what extent do sparse LSTM language models achieve the same representation of number information as dense LSTM language models?

This thesis consists of five chapters including this introduction. Where this chapter introduces the research question and provides some context, the next chapter will provide a literary and theoretical background as a basis for the following chapters. The third chapter states all methods of the experiments executed in order to try and answer the research question. The fourth chapter presents the results from the executed experiments. At last, the fifth chapter provides a summary, a conclusion, and a discussion of the results.

(9)

Chapter 2 Background

The previous chapter introduced this thesis by providing context and stating the research ques-tion. In this chapter I will discuss the relevant literature and theoratical background for this thesis. I will start by explaining the idea behind RNNs, thereafter the LSTM and its components will be discussed in detail. The reason for defining the different components in detail is to provide a foundation for the intuition of the results and conclusion. Thereafter, relevant literature about pruning neural networks will be discussed. At last, I will discuss the relevant literature about the interpretability of LSTM language models, and in particular the NA task.

2.1 Recurrent Neural Networks

As briefly discussed in the introduction, RNNs are a type of neural network architecture suitable for processing sequential data. As RNNs are able to map an input sequence to an output sequence, they are suitable for processing natural language. The reason RNNs are suitable for processing sequential data is because they maintain a hidden state. At each time-step in the sequence, the hidden state will get altered by learned weights. As a result, at each time-step, the hidden state will contain information about the data at preceding time-steps.

An RNN generally contains one cell at each layer. At every time-step during the processing of a sequence, this cell takes both the data and the output of the previous time-step as input. The output of the cell at time-step t will not only go the cell at the next time-step, but also to the next layer of the network. This allows the network to use information from preceding time-steps.

However, there exist challenges in training RNNs. As described in Bengio et al. (1994) and more recently in Pascanu et al. (2013), learning long-term dependencies is difficult. As training an RNN to learn long-term dependencies in sequences using gradient descent often results in vanishing or exploding gradients. This problem is caused by repetitive multiplication of the gradient of the recurrent connection between time-steps. As a result, vanilla RNNs are almost always outperformed by modified versions of a vanilla RNN when processing sequences containing long-term dependencies.

2.1.1 LSTM

One of the modifications of vanilla RNNs is the Long short term memory (LSTM) introduced by Hochreiter and Schmidhuber (1997). In addition to the single recurrent connection in vanilla RNNs, the LSTM also contains a cell state. At each time-step, the cell state contains information

(10)

from previous time-steps. Through multiple gates, previous information from the cell state is erased, new information is added to the cell state, and information from the cell state is presented as output. All of this is done through point wise multiplication and addition. During backpropagation, the gradient of the weight matrices will be calculated using the gradient of the cell state. However, as these are only connected via point-wise multiplication and addition, multiplication of gradients between time-steps is not needed. Therefore, gradient flowing through the cell state during backpropagation will not vanish.

An LSTM cell contains two recurrent connections: the hidden state and the cell state. At every time-step in a sequence, the hidden state is presented as the ouput. The value of the hidden state is determined by the cell state. The cell state is formed by the previous cell state, the previous hidden state, and the incoming data using four different gates. Each of these gates takes information from the previous hidden state ht−1 and from the current input xt. The

outputs, also called activations, of these gates are then used to alter the cell state Ct. The

different gates are: the forget gate f , the input gate i, the candidate gate ˜C, and the output gate o.

The activations of each of the components are defined by equations 2.1 to 2.6. In these equations W denotes the weight matrix, [ht−1, xt] denotes the concatenation of the hidden state

from the previous time-step and the input data, and b denotes the bias vector. In practice, W and b are often split into two matrices and vectors respectively; one of which operates on ht−1

and one which operates on xt. However, for simplicity I denote them using a single symbol.

Another important detail to note is that the equations describe the operations of an entire layer in an LSTM. Thus the outputs and inputs are vectors.

In equation 2.6 we can see how Ctis formed, it is the combination of previous cell state and

new incoming information. Information from the previous cell state is filtered by f . In 2.1 f is defined. It takes ht−1 and xt, multiplies it with the learned weights Wf and adds a learned

bias vector bf. The result is put through a sigmoid function resulting in a vector with values

between zero and one. A point-wise multiplication of this vector with Ct−1 determines to which

degree each element of Ct−1 should be retained. The new information added to the cell state

is determined by ˜Ct× it. As can be seen in 2.2, it is also a vector of values between zero and

one and is based on the input and the previous hidden state. ˜Cton the other hand is a vector

of values between negative one and one. Intuitively, i determines which information from the input and previous hidden state is added to Ct, while ˜Ct determines how much of the selected

information is added.

Gate ht represents the output of the cell. Just as Ct, it is used by the LSTM at step t + 1,

however, it is also passed to the next layer of the network. In equation 2.5 we can see that it is a combination of ot and Ct. Values of Ct are transformed to a range of negative one and

one by the tanh function. ot is a vector of values between one and zero based on xt and ht−1.

Intuitively, ot determines based on current input and previous output, which information from

Ct will be presented as output. Thus for an LSTM cell in layer two or above, the input is ht

from the LSTM cell from the previous layer.

ft= σ(Wf· [ht−1, xt] + bf) (2.1)

it= σ(Wi· [ht−1, xt] + bi) (2.2)

˜

Ct= tanh(WC˜· [ht−1, xt] + bC˜) (2.3)

(11)

ht= ot× tanh(Ct) (2.5)

Ct= ft× Ct−1+ ˜Ct× it (2.6)

At last, there is another important detail about the notation. As described above, equations 2.1 to 2.6 describe the operations of an entire layer. Thus the values of each of the components are vectors. However, in the upcoming chapters, the term units will often be used. When creating an LSTM model, the size of the hidden state in each layer needs to be defined, each other component will be of an equal size. Each element in these vectors corresponds to a unit. As a result, the activation of each of the components of a unit will be a single value.

Furthermore, the LSTM networks used by Gulordava et al. (2018), Lakretz et al. (2019), and in this work, also contain an encoder and a decoder layer. The encoder layer is a look-up table which assigns an embedding vector to each input word. The embedding vectors are learned during training. The decoder layer is a linear layer which transforms the output of the network, the hidden state of the last layer, to an output vector of the size of the vocabulary. This vector represents the probability of each word in the vocabulary assigned by the network.

2.2 Network Sparsity

The goal of sparsifying neural networks is creating networks which contain as little parameters needed to model the training data. As will be discussed in this section, neural networks seem to be over-parameterized, as networks with significantly less non-zero weights are able to reach a similar performance.

Studies performed by LeCun et al. (1990) and Hassibi et al. (1993) propose methods using second order derivative information to remove superfluous weights after training. Specifically LeCun et al. (1990) train a network until a local minimum is reached, then compute the second order derivative of every weight with respect to the error (saliency), order them based on saliency, and remove the lowest saliency weights.

More recently studies such as Zhu and Gupta (2017) and Narang et al. (2017) focus on techniques which deactivate weights based on their magnitude. Narang et al. (2017) find that large sparse networks often outperform small dense networks. They trained both a regular RNN and a gated recurrent units (GRU) network to perform speech recognition on English data. In case of the regular RNN, a large sparse network with 1.7 times more units but 16.7 million non-zero parameters instead of 67 million, achieved a relative performance increase of 3.95%. The sparse GRU model, with 17.8 million non-zero parameters, achieved a relative performance decrease of 2.20% with respect to a GRU model with 115 million non-zero parameters.

Zhu and Gupta (2017) also find that large sparse models can outperform small dense models. Specifically, an Inception-V3 (Szegedy et al. (2016)) with a sparsity of 50% achieves equal perfor-mance on the ILSVR 2012 image recognition task. An Inception-V3 model with 87.5% sparsity achieves a relative performance of −2% with respect to the counterpart with 0% sparsity. An LSTM trained on the Penn Tree Bank (PTB) corpus, achieves a test perplexity of 115.30 while a larger LSTM with 97.5% sparsity achieves a test perplexity of 103.20.

Elsewhere, Frankle and Carbin (2018) form the hypothesis called the The lottery ticket hy-pothesis. This hypothesis entails that if the weights of a network are randomly initialised, a subset of these weights are initialised such that they form a sub-network which learns faster than the rest of the network. When re-initialised to their initial weights and re-trained, this sub-network achieves at least equal performance as the original network. Such a sub-network can be found by pruning the smallest magnitude weights as described in Han et al. (2015).

(12)

Expanding on this work, Zhou et al. (2019) investigate multiple pruning criteria and re-initialisation techniques. They find that retaining weights with large final magnitudes and keep-ing weights of which the magnitude has changed the most durkeep-ing trainkeep-ing, results in sub-networks with similar performance when retrained. They hypothesise that the reason for this is that the pruned weights are being set to their final value, which is zero. Moreover they find that the sign of the weights of the sub-network after training, is an important factor. As long as the sign of the re-initialised weights is equal to the one after training, the found sub-network will perform well.

Mehta (2019) expands upon The lottery ticket hypothesis by investigating multiple transfer learning tasks. This work introduces the Ticket transfer hypothesis, which entails that a sub-network in a randomly initialised feed-forward neural sub-network can be found during training on a source task. After fine-tuning this sub-network on a target task, the sub-network will match the performance of the original network on this target task. This hypothesis is validated by fine-tuning the entire network as well as fine-fine-tuning the fully connected layers while freezing other layers.

2.3 Interpretability of Language models

As described in the previous section, LSTM networks are suitable for processing sequential data such as natural language. However, interpreting the information they contain is not straightfor-ward. In the past years, multiple efforts have been made to provide insights in the information encoded in LSTM networks (Linzen et al. (2016), Gulordava et al. (2018), Giulianelli et al. (2018), and Lakretz et al. (2019)).

First, Linzen et al. (2016) proposed the number agreement (NA) task to assess the ability of LSTMs to learn syntax-sensitive dependencies. They trained multiple LSTM language models with 50 hidden units using multiple training objectives on a dateset of approximately 1.35 million number agreement problems from Wikipedia. The different training objectives are: a supervised number agreement objective, a language modelling task, grammaticality judgements (classifying a sentence as grammatical or not), and verb inflection. In the verb inflection objective the network receives the sentence including the singular form of the verb and then has to predict the correct number of the verb. The LSTM trained on the number prediction task achieved an error rate of 0.83%. The language modelling LSTM on the other hand achieved the highest error of the different objectives: an error rate of 6.78%. As a result, they concluded that an LSTM can learn number agreement fairly well under supervision. Language modelling on the other hand was not found to be a sufficient task for learning number agreement and a joint training objective would be needed.

Elsewhere, Gulordava et al. (2018) expand upon this task by creating a NA task with nonce sentences in different languages, namely: English, Italian, Russian, and Hebrew. However, their findings are in contrast with results from Linzen et al. (2016). Namely, Gulordava et al. (2018) find that an LSTM trained on an unsupervised language modeling task has an equal accuracy on the NA task, when using no more than four attractors, as the supervised LSTM from Linzen et al. (2016). Furthermore, the LSTM only scores slightly lower than human evaluators on the NA task with up to two attractors. Therefore, Gulordava et al. (2018) suggest that a LSTM trained on an unsupervised language modelling task does learn syntactic hierarchical structure and a joint training objective is not needed to perform the NA task.

Giulianelli et al. (2018) take a different approach by training diagnostic classifiers (DCs), introduced by Hupkes et al. (2017), on the internal activations of LSTMs to predict the number of the verb. If a DC is able to predict the verb number with a higher accuracy than chance, this

(13)

indicates that the LSTM has encoded number information. They find that number information is encoded dynamically during the processing of a sentence. Furthermore, they found that in incorrectly predicted sentences, the information was encoded incorrectly from the start of the sentence.

In other work, Lakretz et al. (2019) focus on the distribution of the representation of syntactic hierarchical structures in LSTM networks. Using a similar set up as in Gulordava et al. (2018), ablation experiments, in which units were ablated from the network during the NA task, were performed. In these experiments, two units (LR units) were found in the LSTM of which the ablation significantly impacted the accuracy on different conditions in the NA task.

When evaluated, the entire network did make errors on long range conditions of the NA task. However, when training a diagnostic classifier on the activity of the LR units, the number of the verb could reliably be predicted. Based on this, Lakretz et al. (2019) suggest that there exists a more distributed representation of short range number information which conflicts with the LR units.

Furthermore, Lakretz et al. (2019) also found a group of units of which the activity correlated with syntactic depth of different words in the sentence. The ablation of this group of units as a whole from the network also affected the accuracy of the network on the NA task. One particular unit in this group contained connections with exceptionally high magnitude to the input and forget gates of the two LR units.

To conclude, Lakretz et al. (2019) found that training a LSTM network unsupervised on a large language corpus, resulted in a local sub-network processing hierarchical syntactic infor-mation. However, they also hypothesised that there exists a more distributed representation of short range number information in the network. As the network contains 1300 units divided over two layers, it is computationally intractable to find a large subset of units which significantly impact accuracy on short range NA conditions through ablation.

(14)

Chapter 3 Method

In the previous chapters I introduced the topic of this thesis, the question this thesis tries to answer, and the theoretical background this thesis builds upon. In this chapter I will discuss the methods used to obtain the results presented in section 41. This chapter is split into two sections. In the first section, I will discuss the methods used to train a dense and a sparse LSTM model. In the second section, I will discuss the methods used to evaluate both models.

3.1 Training

This section describes the process of training both models. This process consists of three parts: The preparation of the data used to train the models, the training task, and the pruning algorithm to create the sparse model.

3.1.1 Data

The data used to train the language model (LM) is extracted from a Dutch Wikidump2_{by Ulmer}

and Choenni. The Wikidump still contains article markup after this step, the text of the articles was extracted using wikiextractor3 and split into sentences. Sentences shorter than 5 tokens were removed. As these short sentences are often headers of articles or a result of an incorrect sentence split.

However, the size of the English data used in Gulordava et al. (2018) and Lakretz et al. (2019) is only 26% of the Dutch dataset. In order to relate the results of the Dutch LSTM models from this thesis to those from Lakretz et al. (2019) an Gulordava et al. (2018), an equal amount of data should be used. Therefore, 26% of the Dutch sentences were sampled from a uniform distribution.

After these steps, the sentences were tokenized. The vocabulary used for the model consists of the 50000 most common words in the data. Every word which does not occurs in the vocabulary is replaced with an <unk> token. Furthermore, to label the end of a sentence, the <eos> token was used. Thereafter, following instructions from Gulordava et al. (2018), the data is split into a training, validation, and test set following a 8 : 1 : 1 ratio. The resulting training, validation, and test data sets consist out of 7.5%, 7.4%, and 7.7% <unk> tokens respectively.

1_{The code used to implement these methods is available upon request from jeroen.taal@student.uva.nl.} 2_{https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/nlwiki/}

(15)

3.1.2 Language modeling

Both the dense and the sparse models were trained on a standard language modelling task. This entails that the model processes a sequence, a sentence in this case, one token at a time. For every token in the sequence, the model outputs a probability distribution over the vocabulary for what would be the next word in the sequence. The code used to train both models was taken from Lakretz et al. (2019)4 and slightly adjusted. The LSTM networks were implemented and trained using the Pytorch API developed by Paszke et al. (2019)

The loss of the network during training is calculated using the Cross Entropy Loss criterion. This criterion function takes the output of the model x, which is in this case a probability for each word in the vocabulary, and the index of the target t. The loss for output x is then calculated as follows:

Loss(x, t) = −x[t] + log(X

j

exp(x[j])) (3.1)

The loss over a batch is the average loss for each output x in the batch.

Optimizing the network during training is done using Stochastic gradient descent. The gra-dient of the loss function L is accumulated for each sample i in the training batch B. When the entire batch is processed, the gradient is averaged. Thereafter the weights w will be updated by subtracting the gradient multiplied with the learning rate η. This process is stated in the following equation:

wt+1= wt− η

P

i∈B∇wLi

kBk (3.2)

The model is trained for 40 epochs with an annealing learning rate η = 20. After every epoch, the model is evaluated on the validation set, if the perplexity on the validation set does not improve, η will be divided by a factor of 4.

3.1.3 Training sparse model

The sparse model is trained on the same dataset using the same configuration of settings described in the language modeling subsection 3.1.2. However, during training, weights will be set to zero (pruned) following the algorithm as proposed by Zhu and Gupta (2017). At every pruning step during training, a binary mask is constructed. The amount of zeros in this mask, the weights to be pruned, is determined using equation 3.3. The weights will be pruned by applying binary masks on the weight matrices. During backpropagation, the gradient flows through the binary masks preventing pruned weights to be updated.

The number of weights to be pruned at time-step t during training is calculated using equation 3.3. In this equation, t0 denotes the step at which pruning starts, ∆t denotes the interval of

pruning steps, n denotes the number of pruning steps, si denotes the initial sparsity of the

network, and sf denotes the final sparsity.

st= sf+ (si− sf)(1 −

t − t0

n∆t )

3 _{for t ∈ {t}

0, t0+ ∆t, ..., n∆t} (3.3)

The sparse model is trained following this method using the hyper-parameter configuration as recommended in Zhu and Gupta (2017). Pruning starts at t0 = 125000, by this step in

training the network has already learned from the data and there should be a distinction between superfluous and necessary weights. However, the pruning of weights might cause damage to the network as their value is suddenly changed. To recover from this damage, ∆t has a value of 1000,

(16)

as Zhu and Gupta (2017) found that this leaves enough time for the network to recover from damage done by pruning after a pruning step.

The model will be pruned layer by layer. Thus in each layer the weights will be sorted on magnitude and pruned according to the equation 3.3. As a result, the resulting network could contain different sparsity levels per unit and the different components of an LSTM cell.

3.2 Evaluation

This section will focus on the evaluation of a dense model trained via the methods described in 3.1 and a sparse model trained via the methods described in 3.1.3. First the NA task will be discussed. As described in 2.3, the NA task is used to assess the ability of a network to perform number agreement. Thereafter, I will present the methods used to detect local encoding of number information.

3.2.1 NA task

After training, the ability of both models to perform number agreement is assessed by measuring their performance on the NA task. Recall from section 2.3 that the NA task uses the fact that the form of a verb is only dependent on the form of the subject in the sentence. Thus, if the network manages to correctly predict the form of the verb with high accuracy on a collection of sentences, this indicates that the network has encoded information about the number of the subject.

The Dutch version of the NA task is created by H.M.Wong (2020) as part of a Bachelor’s thesis. The task is divided in multiple templates, of which six are used for this thesis, which vary in the number of words and the types of words used. The entire dataset can be found on Github5. I will summarise the templates below.

The templates used are: Simple, ADV, NounPP, NamePP, Noun_conj, S_conj. The Simple templates consists of a subject and a verb which applies on a second noun occuring after the verb. The ADV template is similar to the Simple template, but also contains an adverb positioned after the second noun. The NounPP template contains a subject, a verb, and an interfering noun (attractor) between the subject and the verb. The NamePP template is similar to the NounPP template but contains an interfering name as an attractor instead of an interfering noun. The Noun_Conj template contains two nouns and a verb which applies to both these nouns. As a result, the verb in the Noun_conj template is always plural. The S_conj template is a conjunction of two pairs of nouns and verbs. In case of the S_conj, the network has to predict the form of the second verb which applies to the second noun in the sentence.

Each of the templates has multiple conditions based on the form of the nouns. The templates with a single noun, thus Simple and ADV, will have the conditions Singular (S) and Plural referring to the number of the noun. The other templates containing two nouns have the following conditions: Singular Singular (SS), Singular Plural (SP), Plural Singular (PS), and Plural Plural (PP). I will refer to the conditions in which the first noun is singular as the singular conditions, and to the conditions where the first noun is plural as the plural conditions.

In table 3.1, for each template an example sentence is provided. The subject of each sentence is highlighted in blue, the attractors in red, and the verbs in green. The Simple and ADV tasks do not contain any words between subject and verb. The NounPP task consists of sentences where there is one noun, which acts as an attractor, positioned between the subject and the verb. In order to correctly predict the correct verb form, the network should remember the form of the

(17)

subject and ignore the form of the attractor. This also applies to the NamePP task, however in this task the network should ignore the interfering name. To correctly predict the form of the verb in the Noun_Conj task, the network needs to have learned that two singular nouns connected with a conjunction, requires the verb to be plural. In the S_Conj task, the form of the verb to be predicted, is dependent on the last noun only.

These tasks have been selected from the NA task as they will highlight the ability of the models to predict the correct verb form on the short term with no interfering words, and on a longer term with interfering words.

The form of a verb is correctly predicted by the network if the probability assigned by the network to the correct form is greater than the probability assigned to the incorrect verb:

P(w+) > P(w−) (3.4)

Task Example sentence

Simple Deboervermijdtde persoon

ADV Dedichtervermijdtde persoon publiekelijk NounPP Devaderonder de vrachtwagensmist

NamePP DemanopPatbegrijptde persoon

Noun_Conj Dedokteren deleraarbegrijpende persoon S_Conj Dezangeronderzoekt en deschrijver vermijdt

Table 3.1: Example sentences of the different templates used from the NA task. The subjects are expressed in blue, the attractors in red, and the verb in green.

3.2.2 Detection of local encoding of number information

The NA task is not only used to evaluate the ability of the networks to encode number infor-mation, but also to localize the encoding of number information. This is done by evaluating the networks on the NA task while ablating the weights of a unit. A drop in accuracy on one of the conditions, while ablating a unit, with respect to the accuracy of the full model, suggests that this unit processes number information. In this thesis, ablating the weights of a unit entails setting their value to zero.

The method of ablating one unit at a time from the network will only detect single units which play a role in the processing of number information. However, there may exist a sub-network of units, which encodes number information in a more distributed manner. Ablating one unit from this sub-network might not impact the accuracy scores on a NA task. However, these units should differentiate between the plural and singular forms of words. Therefore, units which differentiate between the form of verbs, but do not affect the accuracy of the network, are found by looking at the variance of the decoder weights of plural and singular verbs. Please recall from section 2.1.1 that every unit in the second layer of the network has a vector of decoder weights, one for each word in the vocabulary. By selecting all verbs, plural and singular from the NounPP task, we only select the weights from the decoder which we are interested in. The resulting matrix is of shape [30, 650], thus 30 weights for each unit. We can then find units containing the decoder weights with the most variance for the selected verbs using Principal component analysis (PCA). Take the matrix consisting of the weight values for each of the verbs for all units as X. A covariance matrix Σ can then be constructed as follows:

(18)

Using singular value decomposition, Σ can be decomposed into three matrices:

Σ = UVD (3.6)

The matrix U consists of the eigenvectors of Σ in feature space and has a shape of [650, 650]. Matrix V is a diagonal matrix with eigenvalues on its diagonal corresponding to the eigenvectors in U. Sorting the columns, or eigenvectors, of U by the eigenvalues in V will result in the principal components ordered by the amount of variance, in feature space. The first principal component contains each unit projected on an axis such that most of the variance of X is retained. Units with the greatest magnitude value of the first principal component contain most variance in the weight values of the verbs. This method was performed using the PCA implementation of scikit-learn6

The units with most variance in weights for each verb do not necessarily differentiate between singular and plural verbs. After all, PCA is an unsupervised method which constructs lower dimensional axis such that most variance of the original dataset is contained. Eventhough the decoder weights of the units will contain variance, it is not guaranteed that verb form is a segregating factor. To confirm the units found using PCA do differentiate between verb form, the values of the weights need to be examined.

Next, results which indicate that units found using this method might encode number in-formation in a more distributed manner, is gathered by ablating the weights of these units simultaneously while performing the NA task.

To further analyze the behaviour of the units found, the activation of the components of each unit described in equations 2.1 to 2.6 will be visualized. The average activation during the processing of all correctly classified NounPP sentence is used. The activation of the different components of the unit while processing each word will provide insights in which way the network determines the correct form of the verb.

(19)

Chapter 4 Results and Evaluation

The previous chapter presented all methods from the experiments executed to answer the research question. In this chapter the results from these experiments will be presented. The chapter is divided into four sections. The first section will provide an overview of the training of a dense and a sparse model and a comparison of some model characteristics. The second section will present the results of applying the PCA method described in 3.2.2. Next, the results of the unit ablations are presented in section 4.3. At last, the activations of the different components, as described by equations 2.1 to 2.6, of the concerning units are presented.

4.1 Model overview

This section will provide an overview of the training process of both models and model charac-teristics such as the distribution of weight values and sparsity levels among units in the sparse model. Comparing the dense and the sparse model should provide insight in to the manner in which the pruning has affected the sparse model.

4.1.1 Training

Figures 4.1 and 4.2 depict the development of the training process of the dense and the sparse model respectively. In figure 4.1a we can see the loss of the dense model during training. The loss on both the training and the validation set drops quickly in the first epochs. At epoch 15 the loss no longer improves, therefore the learning rate is adjusted from 20 to 5 as visible in figure 4.1b. This results in another significant drop of loss on both the training and the validation set after the next epoch. The loss stops improving at epoch 25, after which the learning rate is again adjusted to a value of 1.25. The learning rate is further annealed at epoch 34 and epoch 37.

The development of training the sparse model is different. In figure 4.2a we can see that both the validation and training loss decreases quickly. However, when the pruning starts in epoch 5, the loss on the training and validation sets explodes. In the following epochs the loss quickly decreases up until epoch 20 after which the learning rate is adjusted from 20 to 5 as visible in 4.2b. After epoch 20, the loss per epoch stagnates and the sparse network does not reach the training and validation loss it achieved before pruning started.

The explosion of the training and validation loss after the start of pruning is remarkable. Please recall from section 3.1.3 that the pruning started at step 125000. The reason for this was that the network would already have learned from the data such that there would be a clear distinction between necessary and superfluous weights. Sorting the weights on magnitude should

(20)

select the superfluous weights. However, the training and validation loss explode to values higher than when the weights were randomly initialized at the start, when only 1.67% of the weights were pruned. This indicates that the weights pruned were not superfluous.

Another remarkable detail is that after the explosion the network seems to recover to a certain extend, this indicates that the interval between pruning steps of 1000 steps allows the network to recover from damage done by pruning. However, the network never achieves the training and validation loss from before pruning. This might indicate that the sparse network contains not enough non-zero weights to model the data to the same extend as the dense network. However, further research is necessary to form any conclusions on this matter.

(a) Development of the loss during training (b) Development of learning rate during training

Figure 4.1: Training results of the dense model. Figure 4.1a shows the validation and average training loss after each epoch. Figure 4.1b shows the learning rate for each epoch during training.

(a) Validation loss, average training loss, and av-erage network sparsity per epoch.

(b) Development of learning rate during training of the sparse model.

Figure 4.2: Training results of the dense model. Figure 4.1a shows the validation and average training loss after each epoch. Figure 4.1b shows the learning rate for each epoch during training.

(21)

4.1.2 Model Comparison

In Figure 4.3, the distribution of non-zero weights of both models can be seen. In figure 4.3a, we can see that the distribution of non-zero weight values in the dense model is centered around zero in both layers. Figure 4.3b shows that the distribution of non-zero weights in the sparse model is split around zero in both layers. This is a result of pruning the lowest magnitude weights. Furthermore, we can see that the distribution of weight values of the sparse model has significantly more density at higher magnitudes than the distribution of weight values of the dense model.

Moreover, it is visible that for both models the second layer contains more density at higher magnitudes. This might be caused by the fact that the second layer is connected to the decoder layer of the network. However to draw any conclusions as to whether this is the case, further research is needed.

(a) Dense model (b) Sparse model

Figure 4.3: Distribution of the values of non-zero weights in both layers for the dense model and the sparse model.

4.1.3 Sparse Model

Figure 4.4 shows that differences in sparsity exist between the layers, units, and gates. Please recall equations 2.1 to 2.4 describing the operations of the different gates in the LSTM cell. The sparsity values in figure 4.4 correspond to the sparsity values in the weight matrices from these equations.

It can be seen that the sparsity values of units in the first layer in figure 4.4a contain less variance than in the second layer in figure 4.4b. This entails that the second layer of the network contains more units of which the weights are significantly less sparse than the average sparsity value of 70%. This might indicate that certain units are specialized in a certain functionality. For example, a unit with of which the weights of o, which controls what parts of C are presented as output, are very sparse in comparison with the rest of the network. This unit might rarely present the cell state as output because it only processes information about a rarely occurring part of speech. This is however purely speculative and further research into this matter is needed to confirm a possible reason for the big differences in weight sparsity between units.

Furthermore, for both layers, the weights of the ˜C are on average less sparse than from the other gates. The reason for this is unknown and further research is necessary to confirm any possible explanation.

(22)

(a) First layer (b) Second layer

Figure 4.4: Sparsity values of weights per gate, each dot represents the sparsity of the weights of a unit.

4.2 PCA

In the previous section, the training development of both models were discussed. In this section I will present the results from executing the PCA method described in 3.2.2. This experiment was executed to find units which do differentiate between singular and plural verbs, but do not affect the accuracy on the NA task when ablated. These units might form or be part of a bigger network which only affects the accuracy on the NA task when ablated as a whole. To achieve this, Principal Component Analysis is used as described in the 3.2.2 section. For convenience of the reader I will provide a short summary:

Recall that the decoder weights of the LSTM network are of shape [50000, 650] (the vocabulary size by the number of units in the second layer). We can then select the weights which correspond to the verbs from the NounPP task, this selection is of shape [30, 650] (15 plural and 15 singular verbs). Using PCA, two principal components containing the most variance will be constructed. As verb form is one factor which separates two groups of verbs from each other, it should participate in the variance in the encoding of these verbs. However, this is not necessarily the case, therefore further inspection of the found units will need to be inspected.

Figure 4.5 shows the units projected on the two principal components for both models. It can be seen that units 873 and 813 are outliers in the space constructed from the two principal components. For both models, there exist other units close to units 873 and 813 which possibly also differentiate in their weights between singular and plural verbs.

Figure 4.6 confirms this. In 4.6a, units 873 and 1207 clearly differentiate between plural and singular verbs. In the weights of units 819, 835, and 961, there exists a minimal amount of overlap. In 4.6b, units 813 and 1207 contain a clear segregation between singular and plural verbs, while units 857 and 1081 contain some overlap between the two verb forms.

(23)

Figure 4.5: Decoder weights from units corresponding to the verbs in the NounPP task projected on the two principal components.

Figure 4.6: Decoder weights of verbs from the NounPP task for both models

4.3 Unit Ablations

The previous section manifested units which differentiate between plural and singular verbs in their decoder weights. This does not necessarily mean that these units encode number informa-tion or are part of a larger sub-network which encodes number informainforma-tion. Therefore, in the upcoming sections, these units will be ablated from the networks to measure the impact on the NA task. This will provide more insights into whether these units encode number information. First I will discuss the results of ablating every single unit from both the dense and the sparse network while performing the NounPP task. Thereafter, I will present the results of ablating the weights of the units found in section 3.2.2 as a group. In order to gain more insight in the which way the different verb forms are encoded, the following section discusses the results of multiplying the decoder weights with a multiplier. Next the results of ablating all decoder weights of the network except for the groups of weights will be presented to evaluate the encoded information in the groups of units without influence of other decoder weights. At last, I will show the performance of the dense model on all the NA tasks described in 3.2.1.

(24)

4.3.1 Single unit ablations

Figure 4.7 shows the results of ablating single units from both models while performing the NounPP task. In figure 4.7a, we can see that ablating unit 873 from the model has the largest negative impact across all conditions. In case of the singular conditions, ablating unit 1207 has similar impact on the accuracy as ablating unit 873. Whereas, unit 819 only has a similar impact as unit 873 on the Singular Plural condition. In case of the plural conditions, unit 835 is the unit of which ablating the weights has the second largest impact. This indicates that these weights might encode information about verb form.

Figure 4.7b shows the accuracy scores of the sparse model while ablating single units from the network. It is clear that ablating unit 813 has the largest impact on the accuracy scores on all four conditions. Ablating the weights of unit 813 from the sparse model, shows a different pattern as ablating units from the dense model. In the singular conditions, ablating the weights of unit 813 decreases the accuracy. However, in the plural conditions, it increases the accuracy. The fact that the sparse model has a significantly higher accuracy scores on the singular conditions, indicates that the model is biased towards singular verbs. However, when ablating the weights of unit 813 from the model, the difference in accuracy score decreases such that there is no obvious bias towards singular verbs. This indicates that the bias towards singular verbs in the model might be encoded locally in the weights of unit 813.

Furthermore, it is visible that ablating single units which are discussed in section 4.2 do not always affect the accuracy of the networks on the NounPP task. This might be the case because these units encode number information as a sub-network, and ablating single units will not affect the accuracy of the networks as most of this sub-network is still intact. To test whether this is the case, these units are also ablated together in the upcoming subsections.

Figure 4.7: Accuracy on NounPP task while ablating one unit at a time.

4.3.2 ablating LSTM cell weights and decoder weights

In the previous chapter, groups of units which differentiate between the form of verbs from the NounPP task have been found. Whereas the previous subsection presented the results of ablating single units from the networks. It was found that not all units found in the previous chapter significantly affect the accuracy on the NounPP task. As described in section 3.2.2, it might be possible that these groups of units encode number information in a more distributed manner. To test this, the weights from all units from these groups were ablated at the same time. However,

(25)

for the dense model, unit 961 will not be discussed as its presence or absence did not have a significant effect on the results. Furthermore, I will refer to the group of units 873, 819, 835, 1207 as the dense group, and to the group of units 857, 1032, 1081 as the sparse group.

In addition to ablating all weights of the concerning units, this section will also discuss results of only ablating the weights of the LSTM cell. By weights of the LSTM cell I mean the weights of the weight matrices corresponding to the different gates in the LSTM cell as described by equations 2.1 to 2.4 in section 2.1.1. This provides insight into whether the impact of ablating the entire unit is caused by the weights of its gates, by the decoder weights, or by both.

In figure 4.1 we can see the results from ablating only unit 873 and ablating units from the dense group of the model while performing the NounPP task. It is visible that only ablating LSTM cell weights does not result in a significant drop in accuracy across the different conditions. Ablating decoder and LSTM cell weights on the other hand, does result in significant drop in accuracy. This indicates that if number information is encoded in the LSTM cell weights of unit 873, it is not encoded in a way such that it greatly impacts the accuracy on the NounPP task.

Ablating the LSTM cell weights and decoder weights of the dense group, results in a drop in accuracy of more than 10 percent for the singular conditions. The drop of accuracy on the plural conditions is 63.5 and 54.8 percent. This indicates that the dense group plays an important role in the encoding of information of plural verbs and that the encoding of singular verbs is more distributed across the network.

The results of ablating LSTM cell weights only or both LSTM cell weights and decoder weights of the sparse model are presented in 4.2. As was the case in 4.1, ablating only the LSTM cell weights of the groups does not have a significant impact on the accuracy scores of the model. Ablating both LSTM cell weights and decoder weights on the other hand does have a significant impact. Ablating the decoder and LSTM cell weights of units from the sparse group seems to increase the bias towards singular verbs. The accuracy on the singular conditions increases to 93.17 and 91.17 percent, while the accuracy on the plural conditions decreases to 8.67 and 12.83 percent respectively.

Another prominent result in table 4.2, is the fact that ablating the decoder and LSTM cell weights of 813 results in no apparent bias. Ablating the LSTM cell weights and decoder weights of the sparse group on the contrary, increases the bias towards singular verbs. Ablating the LSTM cell weights and decoder weights of the sparse group decreases the bias, as the gap between accuracy scores of singular and plural verbs decreases, but to a lesser extend than only ablating the LSTM cell weights and decoder weights of unit 813. This indicates that the decoder weights of unit 813 contain a strong bias towards singular verbs and that the decoder weights of units from the sparse group are significantly less biased.

(26)

Group Condition Full model Weights Decoder and Weights 873 SS 99.83 99.67 98.33 SP 97.83 97.34 94.67 PS 94.67 94.34 83.33 PP 93.5 92.16 84.83 873 1207 835 819 SS 99.83 99.33 88.83 SP 97.83 96.33 81.50 PS 94.67 92.83 29.33 PP 93.5 89.5 34.67

Table 4.1: Results of ablating either the LSTM cell weights or the decoder weights of unit 873, and the dense group. The “Weights" column corresponds to the results when only ablating LSTM cell weights, while the “Decoder and Weights" column corresponds to the results when only ablating decoder weights.

Group Condition Full model Weights Decoder and weights

813 SS 84.67 83.50 61.00 SP 81.33 82.50 60.67 PS 27.67 27.33 52.64 PP 28.50 30.00 56.00 857 1032 1081 SS 84.67 84.50 94.50 SP 81.33 84.00 92.17 PS 27.67 24.34 8.0 PP 28.50 25.50 11.17 813 1081 857 1032 SS 84.67 83.83 76.00 SP 81.33 83.67 74.83 PS 27.67 24.83 27.33 PP 28.50 26.33 30.34

Table 4.2: Results of ablating either the LSTM cell weights or the decoder weights of unit 813, the sparse group, and the sparse group including unit 813. The “Weights" column corresponds to the results when only ablating LSTM cell weights, while the “Decoder and Weights" column corresponds to the results when only ablating decoder weights.

4.3.3 Group perturbation

The results from the previous subsection showed that ablating the LSTM cell weights of the concerning units never resulted in a significant impact on the accuracy of the NounPP task. Therefore, in this subsection, we will look at ablating the weights of the decoder weights only. Furthermore, as can be seen in figure 4.6, the decoder weights of the plural and singular verbs differ most in sign. Applying a high magnitude multiplier to these decoder weights should increase the distance between the value of decoder weights of singular and plural verbs. As a result the model should be better at differentiating between singular and plural verbs and achieve higher accuracy scores on the NounPP task. Therefore, to test whether changing the sign of the decoder weights would also change the predictions of the model, the decoder weights of the concerning units were multiplied with a multiplier while performing the NounPP task.

In 4.3, the results of multiplying the decoder weights of the selected groups of units while performing the NounPP task are presented. We can see that ablating the decoder weights of unit 873 decreases the performance on all conditions to a similar level as ablating both LSTM

(27)

cell weights and decoder weights as stated in 4.1. Ablating just the decoder weights of units from the dense group together on the contrary, does not result in a similar decrease in accuracy across all conditions. This indicates that the drop in accuracy as a result of ablating the weights of the dense group as reported in 4.1, is caused by the ablation of both decoder and LSTM cell weights.

In table 4.4, we can see that in contrast with the dense model, the accuracy scores when ablating decoder weights only do not differ a lot from the accuracy scores when ablating both LSTM cell weights and decoder weights. For all groups in table 4.4, ablating the decoder weights only results in similar accuracy scores on the NounPP task as when ablating both decoder weights and LSTM cell weights.

Both tables do however hint at a similarity in the encoding of number information in both models. In table 4.3 it is visible that multiplying the decoder weights of the dense group with a negative multiplier results in accuracy scores close to zero. Multiplying the decoder weights of this group with two results in higher accuracy on all conditions. The fact that increasing the magnitude of the decoder weights results in higher accuracy, entails that the network is better at differentiating between singular and plural verbs. This indicates that the network differentiates between verb form using sign, as the difference in the decoder weights of the verb forms will increase when multiplied with a large multiplier. Whereas the representation of verb form will reverse if multiplied with a negative multiplier.

A similar pattern is also visible in table 4.4. Multiplying the decoder weights of the different groups containing unit 813 with two results in an increased bias towards singular verbs. Whereas, in case of the sparse group excluding unit 813, applying a multiplier of two to results in a decrease of the bias towards singular verbs. Applying a negative multiplier on the decoder weights of unit 813 on the contrary, results in a bias towards plural verbs. In case of the sparse group, applying a negative multiplier results in a strong bias towards singular verbs. In case of the sparse group including unit 813, it leads to a decrease in bias.

These results indicate that the difference in decoder weights between plural and singular verbs of the NounPP is also largely caused by the sign. For both the dense and the sparse model, The effect of multiplying the decoder weights of the different groups is reversed when the sign of the multiplier changes. Multiplier Group Condition 1 0 2 -1 -2 873 SS 99.83 99.33 99.83 94.83 85.67 SP 97.83 95.00 99.16 89.50 80.00 PS 94.67 84.50 98.83 73.00 48.67 PP 93.5 86.33 97.50 70.83 49.00 873 1207 835 819 SS 99.83 90.50 100.00 39.67 6.00 SP 97.83 83.50 100.00 42.00 12.00 PS 94.67 44.16 99.83 2.50 0.00 PP 93.5 43.34 99.83 2.67 0.17

Table 4.3: Results of multiplying the decoder weights of units 873 and the dense group with a multiplier. All other weights in the network or decoder layer are kept as they are. The column with multiplier 1 corresponds to the accuracy of the full model.

(28)

Multiplier Group Condition 1 0 2 -1 -2 813 SS 84.67 59.67 96.5 33.00 12.33 SP 81.33 59.67 95.33 33.33 11.17 PS 27.67 53.00 8.33 79.67 95.83 PP 28.50 57.99 9.67 83.67 95.50 1081 857 1032 SS 84.67 93.17 71.50 98.33 99.50 SP 81.33 91.17 71.67 96.67 98.67 PS 27.67 8.67 49.84 2.50 0.17 PP 28.50 12.83 53.67 3.17 0.50 813 1081 857 1032 SS 84.67 73.83 90.16 62.50 48.50 SP 81.33 72.33 88.17 58.34 45.33 PS 27.67 30.50 25.84 34.84 37.84 PP 28.50 33.33 24.34 38.83 43.84

Table 4.4: Results of multiplying the decoder weights of units 813, the sparse group, and the sparse group including unit 813, with a multiplier. All other weights in the network or decoder layer are unaltered. The column with multiplier 1 corresponds to the accuracy of the full model

4.3.4 Inverse ablation of units

In the previous subsection it was shown that the sign and magnitude of the decoder weights of the concerning units of both models had a big effect on the prediction of the networks on the NounPP task. However, the prediction of the networks could still be influenced by the decoder weights of other units. Therefore, this section will discuss the results of ablating all decoder weights of the networks while retaining the decoder weights of units from the different unit groups.

The results in table 4.5 are in concordance with the results of section 4.3.3. Please recall from this subsection that ablating the decoder weights of unit 873 resulted in a significant decrease in accuracy on the singular conditions of the NounPP task, while ablating the decoder weights of the dense group resulted in a bias towards singular verbs. As can be seen in table 4.5, retaining the decoder weights of only unit 873 results in accuracy scores of 95.33 and 94.67 percent for the plural conditions respectively. This confirms that these unit groups encode important information about plural verbs.

However, the results of ablating all decoder weights, which is visible in the column with multiplier zero, indicate that the network contains a bias towards singular verbs. The decoder bias values and the weights of the units from the dense group compensate for this bias.

The results in table 4.6 are also in concordance with the results of the previous subsections. Please recall from subsection 4.3.3 that ablating the weights of unit 813 resulted in a decreased bias towards singular verbs, while ablating the weights of the dense group resulted in an increased bias towards singular verbs. In 4.6 it is visible that retaining the decoder weights of unit 813 only results in a bias towards singular verbs. Whereas retaining the decoder weights of the sparse group results in a bias towards plural verbs. This confirms that the decoder weights of unit 813 contain a bias towards singular verbs and that the decoder weights of the sparse group contain a bias towards plural verbs.

The results from tables 4.5 and 4.6 are also in concordance with those from the previous subsection. Recall that in the previous subsection that based on the result I suggested that the difference in decoder weights from the concerning group from both models between singular and plural verbs is largely caused by the sign of these weights. In tables 4.5 and 4.6 we can see that yet again changing the sign reverses the effect of modifying the decoder weights.

(29)

Multiplier Group Condition 1 0 2 -1 -2 873 SS 76.67 23.00 89.83 7.33 0.00 SP 59.50 22.33 79.00 13.34 5.17 PS 95.33 78.00 99.17 15.00 7.67 PP 94.67 81.00 99.17 23.33 12.83 873 1207 835 819 SS 90.67 23.00 98.00 0.00 0.00 SP 87.67 22.33 93.67 0.67 0.00 PS 100.00 78.00 100.00 6.17 0.50 PP 100.00 81.00 100.00 8.0 2.34

Table 4.5: Results of multiplying the decoder weights of unit 873 and the dense group, while ablating the decoder weights of all other units.

Multiplier Group Condition 1 0 2 -1 -2 813 SS 88.00 42.00 100.00 0.00 0.00 SP 87.50 39.83 100.00 0.00 0.00 PS 11.67 57.67 0.00 100.00 100.00 PP 14.67 59.67 0.00 100.00 100.00 1081 857 1032 SS 22.17 42.00 7.17 75.50 90.33 SP 22.83 39.83 10.00 72.83 89.33 PS 94.17 57.67 99.50 14.00 3.83 PP 93.16 59.67 99.00 18.00 7.17 813 1081 857 1032 SS 76.33 42.00 80.50 16.83 6.5 SP 74.17 39.83 81.33 14.50 4.67 PS 49.50 57.67 46.50 57.33 60.83 PP 48.34 59.67 41.50 62.50 69.50

Table 4.6: Results of multiplying the decoder weights of unit 813, the sparse group, and the sparse group including unit 813, while ablating all other decoder weights.

4.3.5 Performance of Dense model on NA tasks

In the previous subsections all experiments were performed on the NounPP task only. However, as described in section 3.2.1, there exist multiple different templates which require different operations to correctly perform them. Therefore in this section the performance of the dense model on all the templates described in section 3.2.1 is presented. It is shown that multiplying the decoder weights of the dense model results in an increased accuracy on all but one task and condition. The performance of the sparse model on the different templates is not discussed as this model is not able to correctly predict verb form, therefore it would be redundant to discuss its performance on all templates.

In table 4.7, it is shown that the dense model reaches close to perfect scores on all singular conditions except for the S_conj template, where the accuracy on the SP task is significantly lower. In case of the plural conditions of the S_conj template, the accuracy on the PS condition is significantly lower than on the PP condition. Please recall from section 3.2.1 that the S_conj template contains two subject-verb pairs, but that the verb form of the second verb needs to be predicted. These results therefore indicate that the network encodes number information of the first noun it encounters. This would namely only result in incorrect predictions if the nouns

(30)

differ in number. Furthermore, the accuracy scores on plural conditions are generally lower than on the singular conditions.

Moreover, it is visible that multiplying the decoder weights of units from the dense group with 5, results in an increased accuracy on all tasks and all conditions, except for the PS condition from the S_conj task. This might be explained by the suggestion that the network encodes the information of the first noun and uses this to predict the form of both verbs. However, this is not an entirely satisfying explanation as it does not explain the increase in accuracy on the SP condition.

Multiplier Multiplier

Task Condition 1 5 Condition 1 5

NounPP SS 99.83 100 PS 94.67 100 SP 97.83 100 SP 93.50 100 S_conj SS 99.50 100 PS 76.17 61.67 SP 87.00 99.33 PP 92.50 100 NamePP S 98.67 100 P 69.17 99.67 Simple S 99.62 100 P 95.86 100 ADV S 99.50 100 P 96.00 100 Noun_Conj SS 79.83 98.33 SP 93.83 100 PS 88.00 100 PP 95.33 100

Table 4.7: Accuracy scores on different tasks while multiplying the decoder weights of units 873, 835, 819, 1207 with a multiplier of 5. The singular conditions are stated at the left, whereas the plural conditions as stated on the right. In sentences from the Noun_conj template, the verb is always plural.

(31)

4.4 cell dynamics

The previous sections provided an overview of the effect of ablating or perturbing the weights of certain units on the performance on the NounPP task. These results indicate that the concerning units encode information relevant for number agreement. However, this does not provide insight in how number information is encoded in the weights of these units. Therefore, this section will present the average activations of each of the components while processing correctly classified NounPP sentences.

In figure 4.8, we can see the activations of the gates of units 873 and units 813 during the processing of a Nounpp sentence. For convenience of the reader I will summarise the intuition behind the different components of a LSTM cell as stated in section 2.1.1:

• The forget gate f determines which information in the cell state from previous time steps is forgotten. When the activation is zero, ft = 0, previous information is erased. When

the activation is one, ft= 1, previous information is preserved on the cell state.

• The input gate i determines which incoming information to write to the cell state. • The candidate gate ˜C determines how much of the incoming information to write to the

cell state.

• The output gate o determines what information from the cell state will be written to the hidden state, and thus presented as output.

• The cell state C contains information from preceding time-steps, and is used to form the hidden state of the current time-step.

• The hidden state h functions as the output of the cell. At every time-step the hidden state of the previous time-step is also processed as input.

In figure 4.8a, we can see that when processing the first wordt “De", the activation of i is zero, thus no information is being written to the cell. When processing the subject “vader", the activation of i is close to one regardless of the form of the subject and information about the subject will be written to the cell state. The activation of f is zero, which entails that previous information from the cell state is erased. The activation of ˜C is one for a singular subject and negative one for a plural subject, meaning that different amounts of information about the subject will be written to C depending on the form of the subject. Furthermore, the activations of C and h are symmetrical around zero for the different conditions, thus the unit differentiates between verb form using sign.

It is visible that between the subject and the verb i has an activation of zero, thus no new information will be written to the cell state. Gate f has an activation close to one, thus the information about the subject is retained. The activation of C remains symmetrical which confirms that the information about the subject remains in the cell state. h has lower activations between the subject and the verb then while processing the subject and the verb, indicating that the cell only ouputs information while processing the subject and the verb.

While processing the noun preceding the verb, the activations of C and h are still symmetrical around zero, indicating that the unit still differentiates between singular and plural by sign. Furthermore, o also has an activation close to one, which entails that the information from C is presented as output of the cell.

(32)

(a) Unit 873 Dense (b) Unit 813 Sparse

Figure 4.8: Gate activations of unit 873 and 813 from the dense and sparse model respectively while processing Nounpp sentences.

4.4.1 Intervention

The activations of unit 813 from the sparse model are different from those of unit 873, as can be seen in figure 4.8b. The activations of i show a similar behaviour as in unit 873. Gate f however, has an activation of close to one until the noun preceding the verb is encountered, thus previous information from C is never forgotten. Gate o also shows similar behaviour as in unit 873, as its activation is close to one when processing the subject, the interfering noun, and the verb itself. However, the biggest discrepancy between the activations of units 813 and 873 is the activation of C. As can be seen, the activation is centered around two instead of zero. As a result, the activation is always positive. Please recall from section 2.1.1 that the value of ht for unit 813

is formed by ot× tanh(Ct). As the activation of Ct is always above zero, this will also be the

case for the activation of ht. This might explain the bias towards singular verbs of the sparse

network and unit 813 as discussed in the previous section 4.3. Please recall that it was suggested that the networks use the sign of weights to differentiate between singular and plural verbs. As a result, if the output of unit 813 is always positive and corresponds to singular verbs this might caus the bias of the network towards singular verbs.

To test this, I have made an intervention on the activation of C of unit 813 while performing the NounPP task. Using a similar approach as Giulianelli et al. (2018), the activation of C while processing the first word of each sentence was set to zero. The activation of C while processing the other words was left unaltered. This resulted in a significant increase in accuracy on all conditions of the NounPP task. The accuracy on the SS and SP conditions were 76.33% and 75.17% respectively while the accuracy on the PS and PP conditions were 73.83% and 77.00% respectively. Comparing these results with the full model in table 4.2, it is visible that the intervention removed the bias towards singular verbs in the network. It is clear that the overall performance of the network has not only increased compared to the performance of the full

Comparing encoding of number information in dense and sparse LSTMs

Comparing encoding of number

information in dense and sparse

Comparing encoding of number

information in dense and sparse LSTMs

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Recurrent Neural Networks

2.1.1

LSTM

2.2

Network Sparsity

2.3

Interpretability of Language models

Chapter 3

Method

3.1

Training

3.1.1

Data

3.1.2

Language modeling

3.1.3

Training sparse model

3.2

Evaluation

3.2.1

NA task

3.2.2

Detection of local encoding of number information

Chapter 4

Results and Evaluation

4.1

Model overview

4.1.1

Training

4.1.2

Model Comparison

4.1.3

Sparse Model

4.2

PCA

4.3

Unit Ablations

4.3.1

Single unit ablations

4.3.2

ablating LSTM cell weights and decoder weights

4.3.3

Group perturbation

4.3.4

Inverse ablation of units

4.3.5

Performance of Dense model on NA tasks

4.4

cell dynamics

4.4.1

Intervention