Subject-Verb Number Agreement in a Dutch LSTM Language Model

(1)

Subject-Verb Number

Agreement in a Dutch LSTM

Language Model

(2)

Layout: typeset by the author using LA_TEX. Cover illustration:

(3)

Subject-Verb Number Agreement in a

Dutch LSTM Language Model

Hugh Mee Wong 11827750

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. D. (Dieuwke) Hupkes J.W.D. (Jaap) Jumelet, MSc T. (Tom) Kersten, BSc

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 107 1098 XG Amsterdam

(4)

Abstract The field of natural language processing (NLP) has enjoyed great success on a remarkable variety of linguistic processing tasks with the deployment of artificial neural networks. In recent years, researchers have tried to understand the inner mechanisms of such systems by probing networks for genuine syntactic representations and high-level (frequency-based) heuristics. In this thesis, we explore the capabilities and limits of a long short-term memory (LSTM;Hochreiter and Schmidhuber, 1997) language model trained and tested on Dutch data. Following the research carried out byLakretz et al.(2019), we focus on grammatical number agreement and examine the network performance on a wide range of tasks in which the network is set to predict the correct verbs in sentences. We also provide the basis for research on cross-serial dependencies, a complex structure found in a mere handful of spoken languages. Through an ablation study, we discover one specific “plural network unit” that affects performance on number agreement to a significant degree when the subject at hand is plural. We also find that the network outperforms itself on subject-verb agreement in declarative content clauses. The plural unit we found does not affect the network performance on sentence constructions involving such dependent content clauses. We conclude that the Dutch LSTM language model constructs syntactic representations of its input to some extent, though likely more in a distributed fashion rather than in a local manner, and that such inner representations differ between dependent and independent clauses.

Keywords natural language processing, cross-serial dependencies, syntax, number agreement,

(5)

Acknowledgements

I would like to thank my supervisor Dieuwke Hupkes for her never-ending enthusiasm and support during this graduation project and for never stopping me when I wanted to “test just one more thing”. I would also like to thank Jaap Jumelet and Tom Kersten for sharing their knowledge and time with me and my fellow third-year student Jeroen Taal. It truly has been a pleasure to work with these great-minded people.

A bachelor’s thesis might not be the most appropriate place to thank my secondary school teachers (now colleagues!), but I would like to thank Ward Leeuwaarden and Wietske van der Ven. I am eternally grateful for them guiding the angsty teenager I used to be in the right direction, my thesis would not have been realised without their help all those years ago. I hope the next generations of students are able to see them for the wonderful people they are.

I would like to thank my parents. We do not share one “whole” language, but through a combination of the three mother tongues in the household they sure have let me know that they will always support me and welcome me back home – even though I moved out during this graduation project. I would also like to thank Casper van Velzen for fuelling my passion for linguistics and for keeping me updated on his opinion on everything regarding the linguistic works of Chomsky, both of which he started doing when I was still in secondary school.

Finally, I would like to thank Jelle Bosscher for being the kindest and most supportive co-TA and fellow student, for listening to the excessive amount of complaints and for repeatedly reminding me that I had to get back to work. As per his suggestions, I have decided to plot the most remarkable results of this thesis in white.

(7)

CHAPTER 1 Introduction

Natural languages have been around for so long that it has become unthinkable that human societies could function without having them. Our reliance on languages and the never-ending quest to understanding their intricacies is reflected in the extensive amount of research conducted in the field of natural language processing (NLP). NLP aims to develop systems that can compare to humans in their abilities to understand and make use of language. In recent years, researchers have shifted from supervised to unsupervised learning, where artificial neural networks are often the preferred technique of choice. This paradigm is in part motivated by the lack of knowledge regarding the ways in which humans process language.

With neural networks currently dominating the field of NLP, we appear to have traded transpar-ency for performance: while this branch of artificial intelligence has enjoyed great success on a range of NLP tasks with the deployment of neural networks, researchers have a difficult time understanding how such systems operate. As this might imply that our understanding of (the processing of) natural language is yet again hampered, researchers are now attempting to understand how and what such networks learn. This is in line with the emergence of explainable AI, a machine learning field that steps back from the black box approach, where internal dynamics are inferred from performance.

A substantial amount of NLP research on linguistics and cognition has been centred around long short-term memory networks (LSTMs), a neural network architecture proposed byHochreiter and Schmidhuber (1997) as an extended version of the classic recurrent neural network (RNN). Such networks have shown to be able to learn simple (artificial) context-free and context-sensitive languages (Gers and Schmidhuber, 2001). As the popularity of LSTM models and other RNN-based networks grows, the question remains of what such networks learn when they process sentences: do they employ high-level frequency-based heuristics or is syntactic information genuinely stored in units of the network?

Research on the ability of neural networks to learn syntactic structures has mainly focused on subject-verb agreement (e.g.,Gulordava et al., 2018; Linzen et al., 2016), which has been regarded as evidence for structured syntactic representations in humans (Everaert et al., 2015). Lakretz et al.

(2019) have conducted a detailed subject-verb agreement study on the inner workings of an English LSTM language model. They conclude that LSTMs do implement genuine syntactic processing mechanisms, at least to some extent. Their findings include two so-called “number units” that play a large part in the encoding and tracking of subject-verb number dependencies in the neural network they studied. In this thesis, we adopt their approach in order to locate individual units essential to the performance on subject-verb number agreement of a neural network trained on a Dutch corpus.

Research Questions

We set a Dutch LSTM language model to predict a verb in a sentence so as to make its number agree with that of its subject. Such sentences are included in a set of tasks now known as number agreement tasks, first proposed byLinzen et al.(2016). We extend the range of number agreement tasks used byLakretz et al.(2019) so as to test the network performance on a wide-ranging number

(8)

CHAPTER 1. INTRODUCTION

of tasks. This serves mainly as an exploration method to answering one of our two main research question: What factors affect the behaviour of the network on subject-verb number agreement? In

Chapter 4, we describe the various aspects that we test the network on in order to answer this main question.

For the second question, we aim to investigate the manner in which number information is encoded in the network: Is number agreement encoded in a local or distributed fashion? To this end, the activation of each unit has been set to zero in an alternating manner in a process known as ablation (Lakretz et al., 2019). Model performance and internal cell-state dynamics of individual units are analysed to find out whether such information can reveal anything about the way the model encodes subject number information.

Thesis Outline

InChapter 2, we provide the theoretical background on long short-term memory networks (2.1) and number agreement (2.2), thereby highlighting both the technical and the linguistic aspects of this thesis. Chapter 3continues with an overview of related work bridging these two topics. Following the information regarding the approaches and findings of studies on subject-verb dependencies in neural networks, we describe our approach and the motivation for the constructed number agreement tasks in Chapter 4. We present our results in Chapter 5 and evaluate our approach and findings in

(9)

CHAPTER 2 Theoretical Background

In this chapter, we provide a brief description of the most important technical and linguistic concepts of this thesis: long short-term memory networks (2.1) and number agreement (2.2). Additionally, we briefly introduce the concept of cross-serial dependencies in Dutch (2.3), an additional syntactic structure that we include in our experiments.

2.1 Long Short-Term Memory Networks

Standard feedforward neural networks suffer from one major shortcoming that generally affects their performance on NLP tasks: they cannot percolate information about previously encountered data through the succeeding data in a sequence. As information passes through the input layer and the hidden layers of the network to the output layer, it cannot be reused to influence the output of the next data in a sequence. As a consequence, semantic and syntactic dependencies essential to the comprehension and generation of text are discarded.

Recurrent neural networks (RNNs; e.g.,Elman, 1990; Jordan, 1997) have been designed to bypass the problem of immediate memoryloss in neural networks. This class of networks allows the output vectors to be used as input vectors, thereby enjoying great success as sequence-processing devices for short-range dependencies. As the distance between the current information and the relevant previous information grows, however, the network ceases to be able to connect the two pieces of information. Although it seems possible in theory, it is difficult for a simple recurrent neural network to latch onto information for longer periods (Bengio et al., 1994; Hochreiter, 1991).

The learning of long-range dependencies is hampered by the gradient update in the process of gradient descent optimisation. The longer the range, the smaller the gradient becomes, resulting in a vector that contains no new information and thereby inhibiting the learning process of the network (1991; Hochreiter et al., 2001). It is also possible for the gradient of an RNN to become very large at an undesirable rate, resulting in an unstable learning process. These two complications with respect to the gradient, known as the vanishing gradient problem and the exploding gradient problem, are avoided with the use of long short-term memory networks (LSTMs; Hochreiter and Schmidhuber, 1997).

Memory Updates The core concept of LSTMs lies in the functions of the gates and cells of its units. The cell state Ct (Equation 2.2) is where information is stored for an arbitrary amount of

time. For each step t in the data sequence, the information in the cell state is updated with new information that is represented by ˜Ct (Equation 2.1). Relevant information for the output is then

encoded in the hidden state htfollowingEquation 2.3.

˜

Ct= tanh(WC· [ht−1, xt] + bC) (2.1)

Ct= ft◦ Ct−1+ it◦ ˜Ct (2.2)

(10)

CHAPTER 2. THEORETICAL BACKGROUND

(a) (b)

(c) (d)

Figure 2.1: A graphical overview of a long short-term memory network byColah(2015). Units of a standard LSTM network are cotrolled by three gates. The input gate it, output gate ot and forget gate ft control

the flow of information for cell-state Ct by updating the candidate values ˜Ctat each time step t. The forget

gate determines what previous information is retained and what gets discarded, clearing the cell of irrelevant information (2.1a). The input gate then determines the flow of incoming information and decides which new candidate values will be added to the cell-state (2.1b). The cell-state is then updated based on this information (2.1c). Finally, the output gate decides what information in the cell-state gets output (2.1d).

˜

Ct∈ (−1, 1) inEquation 2.1is a vector of new candidate values to be considered for the new cell state.

ft, it, ot ∈ (0, 1) are scalars that control the gates of the LSTM network. These gating scalars are

recomputed for each input vector based on the information of the hidden state and the information currently presented to the network. The standard composition of a unit in an LSTM network consists of an input gate, a forget gate and an output gate1₍_{Figure 2.1}_).

Input Gate The right-hand side of Equation 2.2 is partially determined by the input gate ( Fig-ure 2.1a), with gating scalar it. The value of it determined by the previous states, the information

currently presented to the network at step t and the corresponding weights, represented by Wi and

biinEquation 2.4. The sigmoid is used to ensure a value between 0 and 1 in order for the input gate

to control the flow of input values.

it= σ(Wi· [Ct−1, ht−1, xt] + bi) (2.4)

The new candidate values in ˜Ct are scaled by it to determine the influence that these new values

have on cell state Ct: it= 0indicates that the new information is disregarded, while it= 1implies

that the new values are fully taken into account in the state value updates.

Forget Gate The remaining component of the right-hand side ofEquation 2.2 is determined by the forget gate (Figure 2.1b) with gating scalar ft. The value of ft is computed in a way that is

similar to the computation of it, but it has its own corresponding weights Wf and bf (Equation 2.5)

determined during the training of the network.

ft= σ(Wf· [Ct₋₁, ht₋₁, xt] + bf) (2.5)

1_{Note that we abuse notation here: we use all symbols in}_{Section 2.1}_{to denote components of individual cells rather}

(11)

A value of 0 represents the complete forgetting of the previous cell state Ct−1 while a value of 1

implies that the previous state is fully retained2_.

Output Gate The function of the last gate is to determine the information that will be output (Figure 2.1d). The value of gating scalar ot (Equation 2.6) is also in part influenced by the weights

determined during training. A tanh function is applied on the new cell state to ensure a value between -1 and 1 for each element of the states vector.

ot= σ(Wo· [Ct, ht₋₁, xt] + bo) (2.6)

These values, scaled by ot, determine which values (information) are increased (ot> 0) and decreased

(ot< 0) in the resulting output vector.

2.2 Number Agreement

Studies on the extent to which RNN-based models can use sequential cues to learn syntactic structures have mainly focused on subject-verb number agreement (e.g.,Giulianelli et al., 2018; Gulordava et al., 2018; Linzen et al., 2016). Grammatical agreement refers to the phenomenon of co-occurring words in a sentence sharing a property (such as gender or person), with the property of one changing according to a change in the other. This phenomenon is one that demonstrates that phrases cannot simply be reduced to mere properties of their strings, as language is about “structures, not strings” (Everaert et al., 2015). Subject-verb agreement has therefore been regarded as evidence for structured syntactic representation in humans.

In English and Dutch, the present-tense third person form of a verb is inflected for number, i.e., its conjugation is (partially) based on the number of the head of the subject it describes: “the cat walks” but “the cats walk”3_{. The head of a phrase refers to the word on which the syntactic categories (e.g.,}

person and case) of the whole phrase are based. The following examples illustrate the use of “dog” as the head (in boldface) of the subject (in italics):

(1) “The dog walks.”

(2) “The barking dog walks.” (3) “The dog that is barking walks.”

In sentences(1) and(2), the verb is adjacent to the head of the subject, but sentence(3)shows that in English, the verb can occur further along in the sentence, while its number is still dependent on the subject number. The unbounded length of the gap between the noun and the dependent verb in a sentence showcases the advantage of LSTMs over models and techniques that rely on windows of fixed size (e.g., n-grams): it is not uncommon for, i.a., noun phrases in both English and Dutch to become increasingly complex. Giulianelli et al.(2018) coin the term “context size” to refer to the number of tokens (mainly words and punctuation marks) between the head of the subject and the verb. The context size of the English example in(4)is 9, as there are nine words separating the verb from the noun that its form is dependent on. Note that the Dutch sentence in(4)has a context size of 10: the Dutch language requires that a comma (a token on its own) separates two predicates.

(4) a. In English: “The dog that attacked the man near the park last week walks down the street.”

b. In Dutch: “De hond die de man bij het park vorige week aanviel, loopt op straat.” 2_{The naming of this layer as the forget layer is rather unfortunate, as f}

t= 1 indicates that the forget gate is open,

causing the network to fully remember the previous state.

3_{Note that this is the case for the present-tense third person form of a regular verb, which does not imply that}

this phenomenon occurs across all verbs and persons in a language. In English, the first person form is the same for singular and plural: “I/We walk down the street.”

(12)

Agreement Attraction Bock and Miller(1991) conducted a study on erroneous number agree-ment in human speech in sentences such as those shown below, which are extracted from their1991

paper:

(5) a. “The cost of the improvements have not yet been estimated.” (incorrect) b. “The cost of the improvements has not yet been estimated.” (correct)

It was found that the number of a neighbouring preverbal noun had a significant effect on the incidence of number agreement errors made by the human participants. Such sentences might also pose a challenge for sequential language models, as it has been hypothesised that they might employ surface-based heuristics such as “agreeing with the most recent noun” (Linzen et al., 2016). If this is indeed the case, the language model will predict the verb in(5) based on improvements rather than on cost, which is the actual noun that the conjugation of the verb is based on. In this sentence, improvements is the agreement attractor. The same sentence construction is allowed in Dutch: (6)

is the Dutch translation of(5).

(6) a. “De prijs van de verbeteringen zijn nog niet geschat.” (incorrect) b. “De prijs van de verbeteringen is nog niet geschat.” (correct)

The handling of agreement attractors might provide a valuable insight into the way that LSTM language models process, especially for the question concerning potential high-level heuristics.

2.3 Cross-Serial Dependencies

Most research on syntactic representations in neural networks are centred around grammatical num-ber agreement, which is a structure found in context-free grammars, one of the categories in Chom-sky’s hierarchy of formal grammars (Chomsky, 1956). This implies that such agreement dependencies can be created with recursive production rules of the form A→ α, in which A must be a nonterminal (replaceable) symbol and where α can be a terminal, nonterminal or an empty symbol. Lakretz et al.(2019) conclude based on their findings for number agreement that LSTMs do have a way of (partially) representing syntactic structures of sentences. While the findings of all studies on such structures are commendable, it must be noted that grammatical number agreement is only one of the many syntactic dependencies a language may display.

An arguably more complex syntactic structure found in Dutch is that of cross-serial dependencies (alternatively called crossing dependencies). The Dutch language has been claimed to be weakly context-sensitive (Huybregts, 1976) and strongly context-free (Bresnan et al., 1982) due to the ap-pearance of these dependencies. In Dutch, an arbitrary number of noun phrases and corresponding verbs (infinitives without complementiser te) can be inserted into a subordinate clause to create constructions such as those shown in(7)-(9).

(7) _{. . .} . . . dat that de the jongen boy de the kinderen children ziet see-present lezen read-inf

“. . . that the boy saw the children read” (8) _{. . .} . . . dat that de the jongen boy de the moeder mother de the kinderen children ziet see-present laten let-inf lezen read-inf

(13)

CHAPTER 2. THEORETICAL BACKGROUND (9) _{. . .} . . . dat that de the jongen boy de the leraar teacher de the moeder mother de the kinderen children ziet see-present helpen help-inf laten let-inf lezen read-inf

“. . . that the boy saw the teacher help the mother let the children read”

Swiss-German is another spoken language that allows cross-serial dependencies, but the appear-ance of these crossing dependencies differs from those in Dutch in that they include grammatical case markings ((10)): the objects take on the dative or accusative case and verbs subcategorise their objects for these different cases (Shieber, 1987). As cases have been abolished in modern Dutch, though a few instances still appear in set phrases, the crossing dependencies in Dutch are not appar-ent on string-level, i.e., from merely looking at the words it is not apparappar-ent that a verb belongs to a particular noun and not to another neighbouring one. If a neural network is to correctly understand a Dutch sentence containing cross-serial dependencies, it must understand the concept of dependencies being allowed to cross.

(10) _{. . .} . . . das that mer we d’chind the children-ACC em Hans Hans-DAT es the huus house-ACC lönd let hälfe help aastriche paint

“. . . that we let the children help Hans paint the house”

InChapter 4, we describe the ways in which crossing dependencies have been incorporated in our experiments regarding number agreement, but we note that this merely scratches the surface of the possibilities. We propose and discuss alternative ways of probing a network on crossing dependencies inChapter 6.

(14)

CHAPTER 3 Related Work

In recent years, the growing popularity of neural networks in the field of natural language processing has encouraged many to pursue a line of research that aims to understand and explain the inner workings of these systems. InChapter 2, we acknowledge the testing of networks on subject-verb number agreement as a way to probe these networks for evidence on genuine syntactic representations. In this chapter, we give an overview of work that have centred themselves around this concept. Note that, as the number of articles published on research with Dutch language models is scarce, most papers mentioned in this chapter concern neural networks trained and evaluated on English data.

3.1 Prediction Tasks

Various tasks can be devised as a means to testing the abilities of a neural network on subject-verb number agreement. In this section, we present two types of binary tasks that researchers have presented neural language models with.

Number Prediction Task Linzen et al. (2016) proposed the number prediction task, in which a model is presented with the part of a sentence preceding a present-tense verb, as shown in (11)

and(12). For this prediction task, the network is set to assign a probability distribution over the grammatical numbers that the next verb can take on, with the options for English being singular and plural. A prediction is correct if the network assigns a higher probability to the correct number compared to the incorrect one.

(11) “The keys to the cabinet ___” (12) “The girl he met in France ___”

This task was the first one proposed for its purposes and has been described in great detail byLinzen et al.(2016), but it is another one of their suggestions that has gained momentum in recent years for research on number agreement in neural networks.

Number Agreement Task Linzen et al.(2016) subsequently proposed four alternative training objectives and corresponding tasks. The task corresponding to the language modelling objective in

Table 3.1has been referred to as the number agreement task (NA-task) in follow-up studies. For an NA-task, the neural network is set to assign probabilities to both the singular and plural forms of a verb. The prediction of the model is said to be correct if the probability assigned to the correct form is higher than the one assigned to the incorrect form. Since the experiment conducted by Linzen et al. (2016), number agreement tasks have been widely used as a standardised way of probing the performance and internal behaviours of neural networks and their syntactic capabilities (e.g.,

Giulianelli et al., 2018; Gulordava et al., 2018; Lakretz et al., 2019).

A recent paper byLakretz et al.(2019) describes the use of synthetic NA-tasks generated using seven different templates (Table 3.2), an extended version of tests mentioned byLinzen et al.(2016)

(15)

CHAPTER 3. RELATED WORK

Training Objective Sample Input Training Signal Prediction Task

Number prediction “The keys to the cabinet” plural singular/plural? Verb inflection “The keys to the cabinet [is/are]” plural singular/plural? Grammaticality “The keys to the cabinet are here.” grammatical grammatical/

ungrammatical? Language model “The keys to the cabinet” are P (are) > P (is):

true/false?

Table 3.1: Linzen et al.(2016) propose four objectives to train neural networks on. The task corresponding to their language modelling objective is now known as the number agreement task, in which the neural network is set to assign probabilities to the different forms of a verb that is inflected for number. The network is said to have made a correct prediction if the probability it assigns to the correct form is higher than the probability it assigns to the incorrect form.

NA-Task Template Example

Simple The N V … The boy greets …

Adv The N adv V … The boy probably greets … 2Adv The N adv1 adv2 V … The boy most probably greets … CoAdv The N adv and adv V … The boy openly and nicely greets … NamePP The N prep name V The boy near Mary greets … NounPP The N prep the N V The boy near the car greets … NounPPAdv The N prep the N adv V The boy near the car kindly greets …

Table 3.2: Lakretz et al. (2019) make us of synthetic number agreement tasks generated using various templates. Different conditions per task are determined by the combination formed by the numbers of all nouns preceding the verb.

to challenge the network with increasingly complex dependencies. Recall the concept of grammatical (number) agreement attraction introduced inChapter 2(Section 2.2), where the grammatical number of a noun is incorrectly extended to a neighbouring verb that is in fact dependent on a different noun. For each NA-task inTable 3.2, Lakretz et al. (2019) generated all possible combinations of the numbers of the subject and the intervening common nouns, resulting in multiple conditions per number agreement task.

3.2 Syntactic and Sematic Cues

Linzen et al. (2016) conclude that RNNs are able to capture subject-verb number agreement to some extent in strongly supervised settings, but state that the unsupervised language modelling objective is insufficient in enabling an RNN to capture syntax-sensitive dependencies such as long-distance agreement. Bernardy and Lappin (2017) conducted a follow-up study in which they used the approach devised by Linzen et al. (2016) and experimented on a larger corpus with several neural network architectures. They found that neural networks perform better at learning structural patterns of syntactic agreement when they construct substantial lexical representations, prompted by larger vocabularies. This finding suggests that neural networks use rich lexical embeddings to learn syntactic dependencies, thereby relying on distributional regularities of semantic cues.

Gulordava et al. (2018) extend the work carried out by Linzen et al. (2016) and reevaluate their findings. In their approach, they include nonce sentences, sentences that are grammatical but semantically nonsensical (e.g.,Chomsky’s well-known “The colorless green ideas I ate with the chair sleep furiously” (Chomsky, 1957, p. 15)). With this approach, they circumvent the possibility of having to attribute the performance of the neural networks on syntactic dependencies to heuristics based on semantic (frequency-based) cues.

Gulordava et al. (2018) center their research around the unsupervised language modelling ob-jective, as Linzen et al.(2016) deemed this insufficient for a network to learn syntactic structures. They examined English, Italian, Hebrew and Russian and trained RNNs to perform generic language modelling, in which they are not extrinsically motivated to focus on number agreement. This in turn

(16)

NA-task Condition Full model Ablated Condition Full model Ablated

776 988 776 988 Simple S 100 – – P 100 – – Adv S 100 – – P 99.6 – – 2Adv S 99.9 – – P 99.3 – – CoAdv S 98.7 – 821 _P _99.3 _79.2 _– namePP SS 99.3 – – PS 68.9 39.9 – nounPP SS 99.2 – – PS 92.0 48.0 – SP 87.2 – 54.2 PP 99.0 78.3 – nounPPAdv SS 99.5 – – PS 99.2 63.7 – SP 91.2 – 54.0 PP 99.8 – – Linzen – 75.3 93.9 – – – – –

Table 3.3: Results of theLakretz et al.(2019) ablation study. The first columns indicates the NA-task and the second column indicates the condition, i.e. the number of the subject (first letter) and the number of the prepositional object (second letter, where appropriate). The third column indicates the model accuracy when no units have been ablated and the last two columns indicate the model accuracy when unit 776 and unit 988, respectively, are ablated. Dashes indicate reduction of performance by less than 10%.

implies that the networks do not carry an explicit prior bias towards syntactic number agreement. They found that the RNNs perform well on solving long-distance number agreement, a finding that is consistent across all languages they examined. From this they conclude that RNNs do construct grammatical representations of the sentences they process. Lastly, they observe that the ability of a network to perform both generic language modelling and syntactic generalisations is related to the careful tuning of hyperparameters, a finding consistent with those ofMelis et al.(2018) andKuncoro et al.(2018b).

In spite of the promising results presented byGulordava et al.(2018), it is safe to say that overall, findings concerning potential syntactic representations in neural networks are mixed. Aside from the standard frequency-based semantic heuristics (a dog is more likely to bark than a chair) that the findings by Bernardy and Lappin (2017) suggested, there are other heuristics for which evidence has been found. Kuncoro et al.(2018a) found that LSTMs tend to extend the number of the first noun in the sentence to the verb (we present our approach to testing this inChapter 4),Linzen and Leonard (2018) showed that LSTMs trained on English corpora expect embedded relative clauses to be relatively short andMarvin and Linzen (2018) found that network performance is impaired on sentence constructions that are not common, e.g., nested centre-embedded clauses (“The teacher that the children see sings”).

3.3 Ablation

Lakretz et al.(2019) have shown that long-distance number information in English LSTM language models is largely encoded by two specific units. This suggests that LSTMs do track such syntactic structures rather than merely employing high-level heuristics to capture syntax-sensitive generalisa-tions in the data. The fact that the behaviour of the LSTM can be tracked down to a small set at neuron level suggests that encoding of long-distance subject-verb number agreement is a highly local process.

The findings were made through ablation, in which the activation of each unit in the LSTM was set to zero in an alternating manner. As the model consists of 1300 units, 1300 ablation experiments were carried out to look for number units. The significance of an individual unit was determined based on reduction in task performance of the model with the particular unit ablated. For the English language model at hand, units 776 and 988 were found to have a significant influence on the

(17)

performance of the network, with ablation of either leading to a performance reduction by more than 10% across various conditions (see Table 3.3 for an overview). The effect of the ablation depends on the grammatical number of the subject: as unit 776 was found to only impact tasks when the number of the subject was plural (lower half ofTable 3.3) and 988 only those of which the subject is singular (upper half of Table 3.3), the units have been referred to as plural and singular units, respectively.

The results of the ablation study show that tasks such as Simple and Adv are not affected by the two aforementioned units: ablation of these units does not lead to a reduction of performance by more than 10% on these NA-tasks. As such tasks are those in which the subject and verb are relatively close to each other (Simple has a subject-verb distance of 0 and Adv a distance of 1), it seems that the two units can only account for the performance of the network when subject and verb are far apart. Through diagnostic classification, in which one network is trained to make predictions based on the hidden states of the other network, they found the units that play an important part in the encoding and tracking of short-range dependencies. Lakretz et al.(2019) conclude that syntactic structures are, to some degree, genuinely (locally) stored in units of the network.

(18)

CHAPTER 4 Method

In this chapter, we first introduce the Dutch long short-term memory model and the setup of the experiments. We then propose a new set of number agreement tasks for the Dutch LSTM network, varying both the subject-verb distance and the classes (parts-of-speech) of the words intervening between the subject and the dependent verb.

4.1 Long Short-Term Memory Network on a Dutch Corpus

We provide a brief overview of the specifications of an LSTM network trained by Taal(2020) and which we evaluate in our research. Full specifications of this language model can be found in the Bachelor’s thesis written byTaal(2020).

The LSTM language model has been trained using PyTorch (Paszke et al., 2019) on a corpus extracted from a Dutch Wikipedia dump using wikiextractor1_{. The vocabulary of this network is}

made up of the 50,000 most common tokens appearing in this corpus, with⟨eos⟩ denoting the end of a sentence and⟨unk⟩ replacing tokens that do not appear among the selected 50,000. Following

Gulordava et al.(2018), this network has been trained with a generic language modelling objective with no explicit supervision on number agreement. The language model processes a sentence in a sequential manner, one token at a time, and is asked at each step to construct a probability distribution over the whole vocabulary regarding the next word in the sentence.

The neural network consists of two layers with 650 units each. The loss function used during training time is the cross-entropy loss (alternatively called log loss) function and the optimisation makes use of the stochastic gradient descent with a learning rate of 20, which is multiplied by 0.25 if the validation perplexity does not improve. The network has been trained for 40 epochs, ending with a perplexity of 19.28 on training and 27.99 on validation.

4.2 Language Modelling

As explained inSection 4.1, the LSTM network has been trained on a language modelling objective with no emphasis put on number agreement. In our evaluation of this network, we test it on number agreement specifically. We present the network with NA-tasks containing sentences generated using templates discussed in the remainder of this chapter (Section 4.3). Each task includes singular and plural conditions, determined by all possible combinations of the grammatical numbers of the nouns preceding the verb (as discussed in Chapter 3). For each sentence of a task, the neural language model is presented with the tokens preceding the verb it has been set to predict. It is then asked to output a probability distribution over the whole vocabulary for the next word in the sentence. The network is said to have made a correct prediction if it has assigned a higher probability to the correct verb (form) than to the incorrect one. For each task, we calculate the accuracies on each of

(19)

CHAPTER 4. METHOD

its conditions separately in order to isolate the effects of agreement attractors. We follow up on the obtained results with ablation experiments described below.

Ablation Experiments In order for the LSTM language model to perform well on an NA-task, it

must identify the correct subject, encode and hold on to its grammatical number and track the main subject-verb syntactic dependency (Lakretz et al., 2019; Linzen et al., 2016). In our research, we build on the hypothesis put forward byLakretz et al.(2019): if a neural network stores grammatical number information in a local or sparse fashion (rather than in a distributed way), then there must be a small set of units whose absence would lead to a significant decrease in network performance. To test this, we run 1300 distinct ablation experiments, one for each individual unit of the model. An ablation experiment here refers to fixing the activation of one unit to zero while keeping the other ones intact.

Emulating the setup used byLakretz et al. (2019) on the English language model, we carry out the ablation experiments using only sentences generated with the Dutch version of the NounPP template (Lakretz et al., 2019) introduced in Chapter 3 and further discussed in this chapter. We focus on this task as it is the simplest number agreement task involving a long-range dependency with the additional challenge of having one grammatical attractor in two of the four conditions. As in the “full model” experiment introduced at the beginning of this chapter, the neural network is successful on a sentence if the probability it assigns to the correct verb form is higher than the one assigned to the incorrect form. A unit is then considered to be significant if ablating it leads to a performance reduction of more than 10% for at least one of the number conditions in the NounPP task.

4.3 Number Agreement Tasks

Each number agreement task consists of sentences that are generated using a set template indicating the part-of-speech at each position of the sentence. With the use of such templates, we fix the syntactic structures of the sentences in NA-tasks. Lexical material for these sentences is composed of words selected from the 50,000 most common words in the training corpus of the LSTM model introduced inSection 4.1.

Starting with the sentences used in the study conducted by Linzen et al. (2016), it has been a goal to test neural networks on sentences of varying complexity. This has mostly been achieved by increasing the context size (Giulianelli et al., 2018), the linear distance between the subject and the verb. Increasing the context size is one factor that might contribute to the performance of an LSTM language model on number agreement, but the findings made byLakretz et al. (2019) have shown that classes (parts-of-speech) that the context is formed by can also influence the performance of the network.

4.3.1 Standard Templates

We adopt the Simple, NamePP and NounPP templates constructed byLakretz et al.(2019), which we briefly introduced inChapter 3), as the basis for our templates. Note that we omit the templates involving adverbs (Adv, 2Adv, CoAdv and NounPPAdv inTable 4.1) as the Dutch language generally does not allow adverbs to separate verbs from their subject in independent clauses, requiring them to appear after the verb. Since the LSTM network processes sentences in a sequential manner, it will disregard all words appearing after the verb it has been set to predict. Consequently, the addition of adverbs would reduce templates to the three discussed in this section (Table 4.1).

Simple: DETERMINER NOUN VERB This template generates sentences with the simplest

construc-tion, in which the verb immediately follows the subject. It acts as the baseline template to which we compare the other templates discussed in this chapter. As this template contains one preceding the verb, we split it into two conditions: one in which the subject is singular (S) and one plural condition (P).

(20)

CHAPTER 4. METHOD

Simple DET N V … “The boy greets …”

DET N V … “De jongen groet …”

Adv DET N adv V … “The boy probably greets …” DET N V … adv “De jongen groet … waarschijnlijk” 2Adv DET N adv1 adv2 V … “The boy most probably greets …’

DET N V … adv1 adv2 “De jongen groet … zeer waarschijnlijk” CoAdv DET N adv and adv V … “The boy openly and nicely greets …”

DET N V … adv en adv “De jongen groet … publiekelijk en aardig” NamePP DET N prep name V … “The boy near Mary greets …”

DET N prep name V … “De jongen nabij Mary groet …” NounPP DET N prep DET N V … “The boy near the car greets …”

DET N prep DET N V … “De jongen nabij de auto groet …” NounPPAdv DET N prep DET N adv V … “The boy near the car kindly greets …”

DET N prep DET N V … adv “De jongen nabij de auto groet … aardig”

Table 4.1: The templates used in the English ablation study byLakretz et al. (2019) (every first line) with the corresponding templates for Dutch (every second line). As the neural network processes a sentence from left to right, the Adv, 2Adv and CoAdv templates reduce to the Simple template in Dutch, whereas

NounPPAdv is reduced to NounPP.

NounPP: DETERMINER NOUN PREPOSITION DETERMINER NOUN VERB The complex sentences used

byLinzen et al. (2016) are those involving a prepositional phrases modifying nouns, later dubbed NounPP by Lakretz et al. (2019). This template is mainly useful for it being the shortest long-range dependency that allows for the presence of an attractor. The Dutch version of this template is identical to its English equivalent, which means that the context size of 3 is retained for the Dutch task. This template allows two nouns to precede the verb, resulting in four (22_{) double-numbered} conditions: SS, SP, PS and PS, where each first letter marks the grammatical number of the subject (first noun) and the second letter that of the potential attractor (second noun).

NamePP: DETERMINER NOUN PREPOSITION PROPER_NOUN VERB As mentioned in the introduction

of this section, we do not care about merely increasing the context size, as we also wish to vary the context itself. This template has a context size of 2, which is fairly small, but the performance English LSTM model was found to be hampered by the appearance of a proper noun in the context. To find out whether such results can be reproduced with a Dutch LSTM model, the same template has been used to generate Dutch sentences, with no modifications needed for the adaptation of the Dutch lexical material.

4.3.2 Control Tasks

To find out whether heuristics such as “extend the number of the first noun in the sentence to the verb” (Chapter 3) occur with the Dutch LSTM model, we set up several control tasks in which the number of the verb is determined by a noun that is not the first one in the sentence (Table 4.2).

NounConj: DETERMINANT NOUN en DETERMINANT NOUN VERB In sentences following this

tem-plate, the verb is still dependent on the first subject in the sentence, but here the verb is inflected for the conjunction of two simple noun phrases. Note that the linking word has been set to “en” (and), calling for a plural noun to follow in all sentences of this template.

SConj: DETERMINANT NOUN VERB en DETERMINANT NOUN VERB Another manner in which a second

noun can be incorporated into a sentence is by splitting the sentence into two independent clauses. In the case of SConj, the relevant verb is the second one, which follows the subject in the second

(21)

CHAPTER 4. METHOD

Simple DET N V … “De jongen groet …” The boy greets …

NounConj DET N en de N V … “De jongen en de moeder groeten …” The boy and the mother greet …

SConj DET N V en de N V … “De jongen groet en de moeder mist …” The boy greets and the mother misses … ThatSimple DET N V dat de N V “De jongen denkt dat de moeder loopt”

The boy thinks that the mother walks

Table 4.2: The Simple template and new templates that have been introduced to analyse the model’s

behaviour when the verb is dependent on a noun that is not the first one in the sentence. The verb that the neural network would be set to predict in the examples is displayed in boldface, underlined is the head of the subject that this verb is dependent on. All examples shown are in the singular(-singular) condition: all nouns in the sentences displayed are singular. Also note that the verb for That is intransitive.

clause. This template serves as a double control, as we also use it to test whether “en” is always associated with plural verbs, as should be the case for NounConj sentences.

ThatSimple: DETERMINANT NOUN VERB dat DETERMINANT NOUN VERB Declarative content clauses (also known as that-clauses) incorporate a second subject-verb dependency following “dat” (that) in the object of the first verb. The verb in the that-clause is intransitive, as the object in a content clause would have followed the subject and preceded the verb, thereby enlarging the context size. The purpose of this control task requires the context size to be 0, allowing for a comparison with Simple sentences and thus calling for an intransitive verb in the content clause.

4.3.3 Expressions of Quantity or Collection

Similar to English, the Dutch language has specific collective and quantity nouns to make statements about groups and quantities. In both languages, such words precede the nouns they express the quantities of. English and Dutch differ in the way the verb is inflected when the subject contains a collective or quantity noun. In English, the form of the verb is dependent on whether the collective noun refers to a group as one unit or to its members as individuals. Consider the following sentences:

(13) A group of friends is planning its vacation. (14) A group of friends are planning their vacations.

In(13), the group of friends is considered to be one unit that collectively plans the same vacation that involves all friends, while(14)refers to each friend individually planning his or her own vacation. Often, the form of a verb is also a matter of style and preference. In Dutch, quantity and collective nouns do not change the rules of verb inflection: the verb remains dependent on the head of the subject, in this case the noun expressing quantity or collection. The noun following this first one must always be in plural form. Adding a quantity or collective noun thus increases the context size for Dutch sentences, with the added challenge of a potential attractor immediately following the head of the subject. In this section, we propose three templates involving such expressions.

QntySimple: DETERMINER NOUN NOUN VERB This is the shortest dependency involving an

expres-sion of quantity. With a fairly small context size of 1, its sentences only differ from the standard Simple sentences only in the one noun intervening between the head of the subject and the verb. QntyNamePP: DETERMINER NOUN NOUN PREPOSITION PROPER_NOUN VERB The QntyNamePP

template serves as an extension of the plain NamePP template by increasing the context size from 2 to 3. This results in a context size that is equal to those of various other templates, i.a., NounPP, creating another opportunity to investigate the effects of contexts with different parts-of-speech.

(22)

CHAPTER 4. METHOD

QntySimple DET N N V DET N. “De groep jongens groet de persoon.” The group of boys greet the person.

QntyNamePP DET N N P PN V DET N. “De groep jongens bij Mary groet de persoon.” The group of boys near Mary greet the person. QntyNounPP DET N N P DET N V DET N. “De groep jongens bij de boom groet de persoon.”

The group of boys near the tree greet the person. Table 4.3: The Qnty templates that have been introduced for the Dutch NA-tasks. The QntySimple, QntyNamePP and QntyNounPP templates are extensions of the Simple, NamePP and NounPP templates.

QntyNounPP: DETERMINER NOUN NOUN PREPOSITION DETERMINER NOUN VERB This template is

extended from NounPP, which allows for one attractor in a sentence. With the added expression of quantity, we make it possible for two attractors to appear: one immediately follows the head, with the second one further along in the sentence.

4.3.4 Declarative Content Clauses (That-Clauses)

Declarative content clauses in English are often preceded by the linking word “that”; accordingly, they are also known as that-clauses. In Dutch, that-clauses are preceded by “dat”. Such clauses are dependent clauses, they cannot stand as a sentence. An interesting phenomenon occurring in the Dutch constructions is that the word order in the dependent clause differs from a similar sentence in a independent clause. InChapter 5, we compare the network performance on verbs in these dependent clauses with the performance on those in the standard templates ofSection 4.3.1. We propose three templates involving that-constructions (Table 4.4).

ThatSimple: DETERMINER NOUN VERB dat DETERMINER NOUN VERB This template has been

in-troduced in section4.3.2as a control task.

ThatAdv: DETERMINER NOUN VERB dat DETERMINER NOUN ADVERB VERB Adverbs generally do

not intervene between the subject and the verb in Dutch sentences (see section 4.3.1), but they do in content clauses. Note that the that-clauses in ThatAdv sentences emulate the Adv sentences that Lakretz et al. (2019) constructed (“The boy probably greets the person”, Table 4.1). The addition of the adverb increases the context size of 0 in ThatSimple to 1.

ThatNounPP: DETERMINER NOUN VERB dat DETERMINER NOUN PREPOSITION DETERMINER NOUN VERB In the same way that ThatAdv is similar to the English Adv, the that-clauses in ThatNounPP follow a pattern similar to both the Dutch and English NounPP templates (“The boy near the car greets the person”,Table 4.1). Consequently, sentences generated with this template contain three nouns preceding the verb, one of which also precedes the relevant subject.

4.3.5 Defining and Non-Defining Relative Clauses

Similar to English, relative clauses in Dutch can be either defining or non-defining. (15) a. In English: “The dog, that eats my food, barks.”

b. In Dutch: “De hond, die mijn voedsel eet, blaft.” (16) a. In English: “The dog that eats my food barks.”

b. In Dutch: “De hond die mijn voedsel eet, blaft.”

In(15), “that eats my food/die mijn voedsel eet” is a non-defining clause, characterised by the comma preceding it. As this relative clause is positioned in the middle of the sentence, it is also followed by a comma. In(16), this same clause is a defining clause, as it provides information that is essential to

(23)

CHAPTER 4. METHOD

NA-task Template Example

ThatSimple DET N V dat DET N V. “De jongen denkt dat de moeder

loopt.”

The boy thinks that the mother walks. ThatAdv DET N V dat DET N adv V. “De jongen denkt dat de moeder

voorzichtig loopt.”

The boy thinks that the mother walks carefully.

ThatNounPP DET N V dat DET N P DET N V. “De jongen denkt dat de moeder bij de fiets loopt.”

The boy thinks that the mother walks near the bicycle.

Table 4.4: Tasks based on the simple That template.

RelNondef DET N , REL V , V DET N. “De jongen , die komt , groet de persoon.” The boy , who is coming , greets the person. RelDef DET N REL V , V DET N. “De jongen die komt , groet de persoon.”

The boy who is coming greets the person. RelDefObj DET N REL DET N V , V DET N. “De jongen die de moeders zien , groet de

persoon.”

The boy whom the mothers see greets the person. Table 4.5: The templates used to create sentences containing defining and non-defining clauses.

understand which particular dog is being referred to. As the Dutch language requires one to separate two finite verb forms in different verb phrases (“eet” and “blaft”) with a comma, the only difference between(15)and(16)is the comma right after the noun that the relative clause is describing. These two types of sentences are interesting mainly for two reasons: they provide a possibility of adding an agreement attractor to the sentence and they allow us to investigate the effect of punctuation marks on the behaviour of the LSTM model. We construct such sentences using three templates (Table 4.5).

RelNondef: DETERMINER NOUN , RELATIVE_PRONOUN VERB , VERB The context size for

sen-tences generated using the RelNondef template is 4, as punctuation marks such as commas are treated as a token on their own rather than as an addition to the words they follow or precede. This, too, allows for comparison with sentences of equal context size that differ in context content.

RelDef: DETERMINER NOUN RELATIVE_PRONOUN VERB , VERB The RelDef sentences are similar

to the ones generated with RelNondef, with the only difference being the omission of the comma directly following the subject. This template serves two purposes to compare the model performance on these sentences with other ones of context size 3 (such as NounPP) and to investigate whether one comma makes a difference for the performance of the network.

RelDefObj: DETERMINER NOUN RELATIVE_PRONOUN DETERMINER NOUN VERB , VERB In addition,

the template RelDefObj has been included to further investigate the effect of agreement attractors in various positions in a sentence. This construction introduces an attractor in the relative clause.

(24)

CHAPTER 4. METHOD

Crossing1Inf DET N V that DET N DET N V V “De boer wil dat de jongen de kinderen

ziet zwemmen.”

The farmer wants the boy to see the children swim.

Crossing2Inf DET N V dat DET N DET N DET N V V V “De boer wil dat de jongen de moeder de kinderen ziet helpen zwemmen” The farmer wants the boy to see the mother help the children swim.

Table 4.6: The templates for the sentences containing cross-serial dependencies are distinguished by the

number of infinitive forms per sentence. In the current setup, a Crossing sentence can contain one

(Cross-ing1Inf of two (Crossing2Inf ) infinitive forms.

4.3.6 Cross-Serial Dependencies

Cross-serial dependencies in Dutch require the presence of a least three nouns in the sentence: one dependent on the subject of the sentence, one on the subject of content clause and at least one infinitive form. We propose two templates to generate sentences involving these dependencies, named after the number of infinitive forms they contain (Table 4.6). Note that the language model will in fact not process these infinitives, as they appear after the verb it is set to predict, thereby reducing the templates to “standard” that-templates.. In Chapter 6 we suggest other tasks related to the processing of crossing dependencies.

Crossing1Inf: DETERMINER NOUN VERB dat DETERMINER NOUN DETERMINER NOUN VERB (17)

This template contains one attractor despite its relatively small context size of 2. With the disregard for the infinitive verb, it is similar to ThatSimple (Section 4.3.4), adding one object to the verb.

(17) _De The boer farmer wil want-present dat that de the jongen boy de the kinderen children ziet see-present zwemmen swim-inf

Crossing2Inf: DETERMINER NOUN VERB that DETERMINER NOUN DETERMINER NOUN DETERMINER NOUN VERB (18) This template shows the relevance of the crossing dependencies for the num-ber agreement tasks: we can insert an arbitrary numnum-ber of simple noun phrases between the main subject and verb of the that-clause. In this case, the content size is 4, allowing for two attractors in a sentence. (18) _De The boer farmer wil want-present dat that de the jongen boy de the moeder mother de the kinderen children ziet see-present helpen help-inf zwemmen swim-inf

(25)

CHAPTER 5 Results

As the original research question is one that concerns the behaviour of the neural network on a variety of distinct challenges each set up by a group of tasks, we refrain from providing a detailed analysis on all tasks individually, limiting observations to the most relevant. First, we compare our results with those produced byLakretz et al.(2019) in order to highlight similarities and differences between the Dutch and the English LSTM language models. We then present a comparison study with various control tasks to examine the extent to which simple syntactic representations in the Dutch language model can be accounted for by surface-based heuristics. All analyses in the first section are based on the results of the language model with no units ablated. The second section of this chapter covers the findings of the ablation study. We concentrate on one significant unit and investigate its gate and cell-state dynamics during the processing of different sentences.

5.1 Full Network Performance

In this section, the results of the Dutch LSTM language model with no units ablated (“full per-formance”) are presented. We first compare the results to those obtained with the English language model trained byGulordava et al.(2018) on NA-tasks constructed byLakretz et al.(2019).

5.1.1 Comparison with the English LSTM Language Model

As discussed inSection 4.3.1, the templates that we found to be useful for the Dutch language model were Simple, NamePP and NounPP, as the inclusion of adverbs reduces the other templates of the original study to these three. Table 5.1 shows that the overall performance of the Dutch model is similar to the performance of the English LSTM network, with the exception of the Dutch model yielding a higher accuracy on the SP-condition of the NounPP tasks, containing sentences such as

(19).

(19) a. English: “The boy next to the tables greets the person.” b. Dutch: “De jongen naast de tafels groet de persoon.”

Lakretz et al.(2019) provide a possible explanation for the relatively low (although generally speaking still high) performance of the English model on SP compared to PS: as the plural form, being identical to the infinitive and other forms, is more common in English than the singular form, the network will tend to select the plural. This might also account for the same observation with the Dutch language model, though the effect is less pronounced, as the results show that the Dutch LSTM is generally less affected by the agreement attractors.

5.1.2 Control Tasks

In order to find out whether the LSTM network trained on the Dutch corpus makes use of a high-level heuristic such as “track the number of the first noun in the sentence” (Kuncoro et al., 2018a),

(26)

CHAPTER 5. RESULTS

NA-task Singular Subject Plural Subject

Condition Dutch English Condition Dutch English

Simple S 99.6 100 P 95.9 100 NamePP S 98.7 99.3 P 69.2 68.9 NounPP SS 99.8 99.2 PS 94.7 92.0 SP 97.8 87.2 PP 93.5 99.0 ThatSimple SS 94.7 – SP 100 – PS 96.8 – PP 100 – ThatNamePP SS 97.2 – SP 96.7 – PS 95.0 – PP 96.8 – ThatNounPP SSS 99.8 – SPS 100 – PSS 100 – PPS 100 – SSP 98.5 – SPP 100 – PSP 99.3 – PPP 100 –

Table 5.1: We compare our model on the Dutch NA-tasks with the English model (Gulordava et al., 2018) on the original English NA-tasks (Lakretz et al., 2019), shown on the top half of the table. The single major observation is that the Dutch model outperforms the English one on the SP-condition of the NounPP task (“The boy next to the cars greets the person”). We then compare the original tasks with their that-equivalents (bottom half of the table). That-constructions contain an additional noun preceding the one that the conjugation is dependent on – note the additional letter preceding the one in bold for the That-conditions. Note that the network performs better on plural conditions of the that-templates.

several control NA-tasks were set up (Section 4.3.2). The hypothesis is that if the performance of the language model can indeed be (partly) accounted for by such a heuristic, then introducing another noun at the beginning (unrelated to the verb that the network has to predict) must lead to a decrease in performance.

That-constructions To evaluate the hypothesis, we first compare Simple, NamePP and NounPP

with their That-equivalents, where the model must predict the verb contained in the content clause. The results inTable 5.1show no decrease in performance when the relevant subject-verb dependency is positioned in a content clause. The opposite appears to be true: looking at the plural conditions, we find that the network performs better on that-constructions, with the most pronounced increase of more than 25% for NamePP.

Simple-based constructions While that-constructions create dependent clauses in which the relevant noun and verb are situated, we also evaluate network performance on two other control tasks (Section 4.3.2) to investigate the potential presence of a surface-based heuristic. SConj and NounConj have a context size of 0, i.e., in both templates, the verb immediately follows the subject. All such Simple-based tasks are included inTable 5.2.

Comparing SConj (“The boy greets and the mother calls”) to Simple, we once again notice a significant difference in performance for one condition only: the accuracy for the PS-condition has decreased with 20% while performances on other conditions remain unaffected. (20)is an example of a PS SConj sentence.

(20) “De moeders roepen en de jongen groet de persoon.

It might be that the setting in which the connective “en” is followed by a plural form triggers the network to predict a plural form, resulting in a relatively poor (although in general still adequate) performance on the PS SConj task. A similar construction with the same connective can be found in the PS NounConj task, for which the correct verb form is plural. If the combination of “en” with a plural noun results in a plural bias, then we would expect above-average results of the network on this particular NounConj task. Table 5.2shows that this is not the case.

(27)

CHAPTER 5. RESULTS

NA-task Context Size Singular Subject Plural Subject

Condition Accuracy Condition Accuracy

Simple 0 S 99.6 P 95.9 SConj 0 SS 99.5 SP 93.8 PS 79.2 PP 92.5 ThatSimple 0 SS 94.7 SP 94.7 PS 96.8 PP 96.8 NounConj 0 – – SS 79.8 – – SP 93.8 – – PS 88.0 – – PP 95.3 RelDef 3 S 100 P 86.5 RelNondef 4 S 100 P 82.5 RelDefObj 5 SS 100 PS 85.8 SP 97.2 PP 91.0

Table 5.2: Results for the control NA-tasks constructed to investigate the LSTM language model for

potential surface-based heuristics. Note that all conditions of NounConj are positioned in the second half of the table, as the conjunction in the subject causes it to be plural across all conditions.

Another possible explanation is that the network mainly expects a plural verb after having en-countered a plural noun earlier in the sentence, as the plural form is generally more common than the singular form (as stated inSection 5.1.1). The high performance of the network on the PS-condition of ThatSimple would immediately contradict this suggestion, but note that the favourable results on all that-templates might be accounted for by a separate mechanism that overrides the plural bias.

Relative Clauses SinceLinzen and Leonard(2018) found that error rates increased for sentences with relative clauses, thereby suggesting that a high-level heuristic might be at work, we evaluate our Dutch LSTM network on such clauses. As defining and non-defining relative clauses in Dutch merely differ by one comma, we are able to investigate the effect that one comma has on the performance of the network. In Table 5.2 we see that an additional comma preceding the relative clause does not lead to a significant impact on the performance of the neural network, except for the plural conditions. Comparing RelDef to NounPP, a template of which the context size is also 3, we see that it does matter what the context is made up of: the combination of a punctuation mark and a relative pronoun leads to a worse performance than a preposition followed by a noun.

Table 5.2also shows that compared to the Simple conditions, the performance of the network does not decrease when the context size is lengthened by a relative clause. In the plural cases, however, the network does perform worse on the sentences containing relative clauses. While attractors do not hamper the network performance on the ThatSimple task, the performance of the network on RelDefObj suggests that an attractor in a relative clause (rather than a that-clause) does influence the ability of the network to predict the correct verb form. This effect is pronounced for PS sentences such as(21), where the noun intervening between the plural subject and verb is singular.

(21) “De jongens die de moeder roept , groeten de persoon.”

5.1.3 Varying the Parts of Speech

NA-tasks are generally made more complex by lengthening the linear distance between the subject and the verb. We also evaluate the performance of the network on the different types of tokens that make up the context of the task.

Immediate attractors We also evaluate the performance of the network on immediate attractors, where the word following the subject is also a noun. The Qnty-templates (“De groep jongens groet”,

Subject-Verb Number Agreement in a Dutch LSTM Language Model

Subject-Verb Number

Agreement in a Dutch LSTM

Language Model

Subject-Verb Number Agreement in a

Dutch LSTM Language Model

Contents

Acknowledgements

CHAPTER 1

Introduction

Research Questions

Thesis Outline

CHAPTER 2

Theoretical Background

2.1

Long Short-Term Memory Networks

2.2

Number Agreement

2.3

Cross-Serial Dependencies

CHAPTER 3

Related Work

3.1

Prediction Tasks

3.2

Syntactic and Sematic Cues

3.3

Ablation

CHAPTER 4

Method

4.1

Long Short-Term Memory Network on a Dutch Corpus

4.2

Language Modelling

4.3

Number Agreement Tasks

4.3.1

Standard Templates

4.3.2

Control Tasks

4.3.3

Expressions of Quantity or Collection

4.3.4

Declarative Content Clauses (That-Clauses)

4.3.5

Defining and Non-Defining Relative Clauses

4.3.6

Cross-Serial Dependencies

CHAPTER 5

Results

5.1

Full Network Performance

5.1.1

Comparison with the English LSTM Language Model

5.1.2

Control Tasks

5.1.3

Varying the Parts of Speech