Data-driven Parsers in ACT-R and the Prediction of Reading Patterns

(1)

Data-driven Parsers in ACT-R and the

Prediction of Reading Patterns

Puck de Haan 11305150

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. J. Dotlačil

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 107 1090 GE Amsterdam

(2)

Abstract

A transition-based data-driven parsing model has been developed that can si-multaneously parse a text and predict the reading times per word. In this thesis the model has been tested and optimized to explore what would be the best-fitting parsing model to support psycholinguistic evidence. The model is developed with pyactr, a Python implementation of the ACT-R cognitive architecture. Two diﬀer-ent data-sets comprising texts and corresponding eye-tracking measures have been used for the training and testing of the parsing model. To optimize the model, optimal values for three free parameters, namely the latency factor (F ), latency exponent (f ) and rule firing (r), were estimated. The estimation was done using a Markov chain Monte Carlo sampling method. Furthermore three diﬀerent im-plementations of the model – exploiting visual, syntactic, lexical and/or reanalysis information – were tested. First of all, we accomplished an ˆR value that

approxi-mated 1 for all parameter sampling trials, indicating that there is a high probability that the respective chains had converged. Secondly, it turned out that the syntac-tic model was significantly less successful than the models that also made use of visual and lexical information. Finally, the estimated optimal parameters brought about a vast increase in correlation between the predictions and observed data (ex-cluding the syntax-only model), considering the default parameters. If the default parameters were used a correlation measure between 0.141 and 0.238 was achieved. Whereas the model using the estimated parameters achieved a correlation of ap-proximately 0.650 for one data-set and 0.750 for the other data-set. The parameter estimation was successful and the fit of the model was significantly improved with respect to the default parameters.

(3)

1 Introduction

It could be argued that language is one of the most important aspects of being human. That is why it is not surprising that Natural Language Processing (NLP) has become a significant subject in Artificial Intelligence (AI) research. To succeed in AI, computers should be able to accurately process and produce natural language on a human level. There are diﬀerent perspectives on how this level of language proficiency can best be reached in computers. The most prominent perspective is that processing language is best done in a computational manner, through computational linguistics. Another view, that will be explained more in-depth later on, is that an adequate level of proficiency can also be reached in a cognitive manner, using theories from psycholinguistics and cognitive science.

The field of computational linguistics tries to implement the production and com-prehension of language in a statistical and rule-based way. With the help of huge corpora consisting of thousands of annotated sentences, rules and statistics are derived that in turn can be used in either the production or comprehension of natural language by computers. One application of NLP is the syntactic parsing of sentences: assigning grammatical structure to sentences by making use of deterministic or statistical algo-rithms. Humans usually know in an instant what the function of a word is and how it connects with other words in a sentence, through natural experience and insight. For a computer, however, this is not so easy.

It turns out that parsers like this are not only relevant for computational linguis-tics, but for psycholinguistics and cognitive science as well. Psycholinguistics could be explained as the psychology of language; it tries to determine how the human mind processes language. To uncover the workings of this process, research is done in several ways: behavioral experiments, neuroimaging and computational models are all methods used to gain further insight in the subject. It might even be possible that computational models of language can be used to predict human reading patterns, such as eye-fixation times and eye-movements. If a parsing model could be designed that “reads” like a human does, such a parser could be used to support psycholinguistic evidence. In this thesis we will focus on a specific implementation of a syntactic parsing model, that attempts to parse sentences in an as human-like way as possible.

1.1 Research Question

In this thesis we will explore how a parsing model can be used to predict reading patterns. The focus will be on a data-driven transition-based parser that is constructed in the Adaptive Control of Thought-Rational (ACT-R) cognitive architecture (Anderson et al., 2004). Data-driven parsers are primarily designed and used in computational linguistics and not as frequently in psycholinguistics. These computational data-driven parsers are trained on substantial amounts of data and learn the syntactic structure of sentences by means of statistics and rules. However, the reading process in these parsers cannot be compared to that in humans, which is why they cannot be used to directly predict human reading patterns without any adjustments.

(5)

To solve this dissimilarity the data-driven parser used in this thesis is embedded in the ACT-R cognitive architecture. This is an architecture that models executive functions of the mind and simulates human behavior. The embedded parser can pre-dict psycholinguistic reading patterns and simultaneously provide us with the syntactic structure of a sentence. At the moment the parsers designed in psycholinguistics are hand-coded to be able to predict these patterns (Lewis & Vasishth, 2005; Brasoveanu & Dotlacil, 2018). The problem with parsers being hand-coded is that in result the model is less open-ended and thus less human-like. As a solution our parser will be trained, tested and optimized with the use of large data-sets, such that the development parsing process is completely automatic and data-driven.

Right now there is solely a default version of the parsing model and its parameters available. First of all we want to examine how accurate the results are produced by this default model and parameters. Then it is our goal to improve the model and estimate the optimal parameters, to match human data as best as possible. Thus, the main question we will try to answer is as follows: What data-driven ACT-R parsing model is the best fitting and as a result produces the most accurate reading pattern predictions?”. To conclude what model is the best fitting, we will compare the results produced by the model to human data, retrieved from two data-sets that were the result of two diﬀerent psycholinguistic reading tasks.

(6)

2 Theoretical Framework

2.1 Computational Linguistics

Computational linguistics is a broad and interdisciplinary field that, generally, is con-cerned with the processing of language by computers. This can be done in diﬀerent ways, but we will focus on syntactic parsing – unraveling the underlying structure of a text. In this section a description will be given of the diﬀerent computational linguistic aspects that were relevant for the development of this parsing model. First of all the concept of a context-free grammar (CFG) will be explained: such a formal grammar makes it possible to transform a language to a format that can be processed by comput-ers. Then an overview will be given of syntactic parsing and finally we will conclude with a description of transition-based parsing – specifically classifier-based – the data-driven parsing technique used in our model.

2.1.1 Context-Free Grammar

As a way of enabling computers to process natural language, language is given structure with syntax. Syntax makes it possible to label all the parts of a sentence with their grammatical function and to determine the relations between words. Words on their own have a specific meaning, of course, but there are also groups of words that function as a unit and have their own distinct function. These groupings of words are called constituents. By assigning objective functions to words and constituents, language is transformed in such a manner that it can be used by computers. The most frequently used formal system to model the syntactic structure in sentences is the Context-Free Grammar (CFG) (Jurafsky & Martin, 2009).

A CFG consists of two parts: a collection of rules and productions that specify how words group together and a lexicon of all the words and symbols that are part of the grammar. These words and symbols can be split up into two groups: terminals and non-terminals. Two examples of a rule will be provided and used to explain the diﬀerence between terminal and non-terminal symbols:

N P → Det Nom (1)

Det→ a (2)

Rule 1 indicates that a noun phrase (NP), can be composed of a determiner (Det) and a nominal noun (Nom). Both Det, Nom and NP are non-terminal symbols. This means that they are still generalizations, or represent a constituent. Rule 2 shows how a non-terminal symbol (Det) is composed of a terminal symbol, namely a word from our lexicon: a. Symbols that directly specify the function of a word, like the symbol Nom does for a, are called part-of-speech (POS) tags. Whenever structure is derived from a sentence, this structure is usually depicted with a parse tree. Take for example the parse tree in Figure 1. It originates in root-node S and is then parsed into two constituents: a noun phrase (NP) and a verb phrase (VP). These two phrases consist of a determiner (Det) and a nominal noun (Nom), and a 3rd person singular verb (VBZ) respectively. These functions match the three words in our lexicon: A, boy and runs.

(7)

S NP Det A Nom boy VP VBZ runs

Figure 1: A parse-tree of the sentence “A boy runs.”

2.1.2 Syntactic Parsing

The derivation of structure from a sentence, illustrated in the previous section, is called syntactic parsing. Parsing can be seen as a search problem, where a correct parse tree has to be found that matches the input sentence. The constraint, however, is that the solution should be composed of the rules and symbols available in the provided grammar. There are two main approaches to solve this problem: either bottom-up or top-down (Jurafsky & Martin, 2009).

In top-down parsing the parser starts at the root of the parse-tree (S ) and then works its way down to the actual words. Firstly, the parser will look at all the rules in the grammar with S as their left-hand side and expand these into their corresponding right-hand side constituents. Following this strategy it will keep expanding rules, until it has generated a legitimate parse tree that ends with the words of the input sentence. A perk of the top-down strategy is that it will never explore a subtree that does not connect to root node S when the input sentence is reached. A disadvantage, on the other hand, is that a lot of subtrees could be explored that do not match the input sentence in the end.

In bottom-up parsing the parser starts with the actual input sentence, “a boy runs” for example, and works its way up to the root of the tree. At first diﬀerent subtrees are formed using the POS tags for every word in the sentence. These tags are subsequently used to consider the rules and constituents that follow. This process of finding the con-stituents that are made up by the given symbols is continued until a sentence structure is found that can explain the entire sentence, while correctly using the grammar. An advantage of a bottom-up parser is that it will always form a tree that is grounded in the input sentence. A disadvantage, however, is that it could be possible that subtrees are created that are not consistent with end node S. The bottom-up approach can be implemented in transition-based parsers and this is the case for our parser.

Bottom-Up Parsing

Now that a global overview of syntactic parsing methods has been sketched, further insight will be given into the parsing strategy that is implemented in our parser. The parsing model used in this thesis is transition-based: it follows a bottom-up shift-reduce strategy to be precise. The shift-reduce strategy works through a deduction system, of which the rules can be seen in Figure 2 (Shieber, Schabes, & Pereira, 1995).

(8)

Figure 2: The bottom-up shift-reduce deductive parsing algorithm

First the notation of the rules will be explained. Every item consists of two parts: • Before the comma there is a dot (•) that denotes how much of a sentence has been

seen at a given moment in the parse. The left-hand side of the dot represents what symbols have been processed and pushed on the stack. All the processed symbols are pushed onto the stack and saved until they can be used.

• The index symbol after the comma denotes the position of the parser in the sentence.

The parser starts at position 0, where no symbol has been pushed on the stack yet, this is the axiom. The goal of the parser is reached when the only symbol left on the stack is the root of the parse tree (S) at n, the last position in the sentence. Furthermore there are two inference rules: shift and reduce. Shift means that the parser pushes the word wj+1 on the stack and moves to the next position (j + 1) in the sentence. The

reduce rule specifies that the group of symbols on the stack, αγ, can be reduced to the group of symbols αB, if there is a rule in the grammar that states B → γ. This way the parser pops the symbols denoted by γ from the stack and replaces the entry with B. Finally, after enough of these shift- and reduce actions have been executed, only S should be left on the stack at the final position of the sentence. When this point has been reached, we can assume that the sentence is parsed and the algorithm terminates. 2.1.3 Classifier-Based Parsing

To conclude the overview of the computational linguistic components of our parsing model, we will explain what it means that the model is also constituent- and classifier-based. Constituent-based implies that the text is given structure through the procedures explained above. Rules are inferred from a CFG to group words and their functions into constituents, until the structure of the entire sentence is clear. Another type of parsers are the dependency-based parsers: these parsers do not use a CFG and constituents, but look at the lexical relations between words (Jurafsky & Martin, 2009).

A classifier-based parser is a parser that makes parsing decisions through a trained classifier. The parsing strategy used in this thesis is very similar to the one employed by

(9)

Sagae and Lavie (2005). In the previous section an overview was given of the shift-reduce algorithm for parsing, but how does the algorithm know what action to take at what moment during parsing? When the only possible action is to shift, or to reduce, the choice is obvious, but what if it is possible to either shift or reduce? Another diﬃculty is that sometimes there might be multiple rules in the grammar that can be adapted to reduce a symbol in the stack. What rule will provide the right syntactic structure to the sentence? This is where the classifier comes into play.

The classifier is responsible for taking these crucial parsing decisions within the shift-reduce algorithm. There are two types of questions on which the classifier has to decide: should the parser shift or reduce and what rule should be used to reduce a symbol on the stack? At every point in the parsing process the classifier chooses the best action based on the local context. To take these decisions the classifier uses a set of features that it retrieves from the current form of the parse tree and also the local context in the sentence. A more in-depth explanation of the classifier used in this research, will be provided in Section 3.1.1

(10)

2.2 Psycholinguistics

Psycholinguistics is a subfield of psychology that studies the processing of language in the human mind. To design a parsing model that is able to accurately predict reading patterns given a specific text, it will be necessary to integrate psycholinguistic theo-ries. Additionally, psycholinguistic experimental data is also crucial for the training and testing of the resulting model. To integrate all these elements properly, it is thus important to have a fair understanding of psycholinguistics – in specific cognitive archi-tectures (mainly the ACT-R model) and reading tasks. These two aspects provide the psycholinguistic basis for this thesis, hence a comprehensive explanation on both will be supplied in this section.

2.2.1 Cognitive Architectures

Cognitive architectures are formal theories of the mind used in cognitive science and psychology to model the workings of the human brain. Whereas there exist a lot of psychological models that explain a specific component of the mind, a cognitive archi-tecture tries to explain how all the components of the mind work together in an eﬀort to produce cognition. There are several advantages to such a complete theory. First of all this means that the architecture can be used to handle real-world problems that involve multiple mental tasks. On top of that these complete theories are better able to work with data produced by cognitive science, or in our case psycholinguistics (Anderson et al., 2004).

Generally a cognitive architecture is implemented as a computer simulation, which provides it with a few more advantages (Forstmann, Wagenmakers, et al., 2015). The main perk of these simulated models is that they can produce very precise results. When modelling a complete task, like reading a text or driving a car, it is possible to predict accurate reading times or response times using the implemented architecture. Furthermore, by implementing the model of a complete task, its various parts are re-quired to be as general as possible. If the architecture does not consist of parts that are as general as possible, it would be impossible for the model to simulate a general task. Our parsing model in ACT-R, for example, can handle both self-paced reading tasks and eye-tracking tasks, because of its generality.

2.2.2 Adaptive Control of Thought-Rational

The Adaptive Control of Thought-Rational (ACT-R) model is an example of such a cognitive architecture (Anderson et al., 2004). The ACT-R architecture consists of several different modules that are coordinated by a central production system. These modules interact with the central production system and each other through associated buffers. An overview of the model can be seen in Figure 3. It is not out of the question that the model could include more modules and buffers than depicted in this figure, but for the purpose of this thesis we will assume the representation in the figure.

The center of the ACT-R architecture is the central production system – the equiv-alent of the procedural memory. The central system is connected to four diﬀerent

(11)

Figure 3: A schematic overview of the ACT-R architecture (Anderson et al., 2004)

modules through their respective buffers. These buffers hold information that can be used to execute actions, directed by production rules fired by the procedural memory. A production rule is a conditionalized cognitive action, as can be seen in Figure 4a (an example of the production rules used in our parsing model can be found in Section 3.2.1). Whenever the procedural memory fires a production, the implied action will not be executed until the states of the buffers match the condition of the production. We will use the production in Figure 4a to illustrate this process. Above the arrow we can see the condition, namely that IF the goal buffer contains the TASK ‘reading’ of the FORM ‘car’, THEN the cognitive action will be taken that is depicted below the arrow. The cognitive action in this case is retrieving the category of the word ‘car’ from declar-ative memory. This way the central production system connects all the modules, by recognizing information in the buffers and using that information to execute cognitive actions.

Then there is the declarative module, that can be compared with the long-term memory. It is here that all the declarative information can be retrieved from memory. This information is stored in so-called chunks. These chunks can take on different forms, but an example of a chunk containing lexical information of a word can be seen in Figure 4b. It is important to note that every buffer can hold merely one chunk of information. Furthermore the declarative memory is connected to the central production system by means of the retrieval buffer. The buffer holds information retrieved from memory, to be used by the procedural memory. Later on we will expand further on the retrieval of information from the declarative module.

The ACT-R architecture also entails a goal module and a perceptual-motor system, consisting of the visual module and the motor module. The goal module is responsible

(12)

(a) A production rule

(b) A chunk

Figure 4: Examples of a chunk and a production in our ACT-R framework (Brasoveanu & Dotlacil, 2018)

for keeping track of the current goal in mind, similar to the working memory (Anderson et al., 2004). This module directs the actions that have to be taken. Accordingly, the goal buﬀer checks the internal state in the process of reaching this goal. This way the architecture is able to works towards a goal by splitting the task up in sub-goals, that can all be kept track oﬀ.

The last two modules that will be highlighted are the visual and the motor modules. The visual module is necessary to identify objects in the world that can be visually noticed, the motor module is in place to control the hands. Within the visual module there is a distinction to be made between the location module and the visual-object module. The visual-location module keeps track of the location of visual-objects and can transfer chunks containing this information to the central production system through its buffer. Additionally the visual-object module concerns itself with the identification of objects and helps the central production system to produce chunks with declarative information on the subject. Both the visual and the manual buffers are responsible for the communication of the information that was gathered by the visual- and manual buffers to the central production system.

2.2.3 Reading Tasks

Reading tasks have since long been a staple in psycholinguistic research as a means to investigate the connection between reading patterns and cognition. It turned out that inferences can indeed be made about cognition, with the use of reading task data (Rayner, 1998). The reading task data we will use in this thesis are eye-tracking data and self-paced reading data (Rayner, 1978; Just, Carpenter, & Woolley, 1982).

The goal of eye-tracking research is to discover what the connection is between eye movements and cognitive processes. During such a task the eye-movements are tracked not only in a spatial manner, but also a temporal one. When the eye movements are tracked during a reading task, they can give us information on the linguistic proper-ties of a text (Kennedy & Pynte, 2005). In most research the following features of eye-movement are being recorded per word that is read: duration of the initial gaze, total fixation duration and the probability of refixating a word (Kennedy & Pynte,

(13)

2005; Frank, Monsalve, Thompson, & Vigliocco, 2013; Engelmann, Vasishth, Engbert, & Kliegl, 2013). In Figure 5 an example is given of these measurements. The first time the gaze is focused on the word “invited”, for 270ms and 243ms, is called the initial gaze or the first pass. The reader then focuses on “a juggler”, but subsequently returns his attention to “invited” again. The 235ms period that follows is the second pass time and the first and second pass together are the total fixation duration. We observe that the reader returns his gaze to a word he has already fixated on before, this behaviour is called regression.

Figure 5: Hypothetical eye movement record (Liversedge, Paterson, & Pickering, 1998) In a self-paced reading task (Just et al., 1982) participants are presented with the first word of a sentence on a screen and can proceed to the next word of the sentence by pressing a specific key on the keyboard. The time it takes for a participant to press the button, is assumed to be the reading time for that specific word (Frank et al., 2013). Self-paced reading does not give as elaborate results as eye-tracking, since there is only one fixation period and no regression, but it is still a solid indicator for the time it takes to process a word.

(14)

2.3 Bayesian Methods

The parsing model used to predict reading times has several parameters (more on this in Section 3.1.1) that can be tuned to optimize the fit for the reading task data. To find the optimal values of these parameters, we will make use of Bayesian statistical methods. This allows us to explore the likely range, not just one specific value, of the parameter values that can be inferred from the data. As to comprehend the methods used for parameter estimation later on, an overview will be given with the help of two explanatory works on Bayesian statistics (Kruschke, 2010; Farrell & Lewandowsky, 2018).

2.3.1 Markov Chain Monte Carlo Methods

Whenever multiple parameters have to be estimated and the problem cannot be solved analytically, a frequently used solution is the Markov chain Monte Carlo (MCMC) method. A Monte Carlo sampling method tries to estimate the properties of a target distribution by sampling a large amount of random values from a proposed distribution. This is possible since Bayes’ rule tells us to us that a product of samples from the prior distribution and the likelihood is proportional to the posterior distribution (3) – assuming that the marginal probability can be discarded.

P (θ|y) ∝ P (y|θ) × P (θ) (3)

In an MCMC process all sampled values are dependent only from the value that was sampled before it, not of any other value earlier in the chain. This is similar to a first-order Markov chain, hence the name (Kruschke, 2010).

Metropolis-Hastings Algorithm

A particular MCMC algorithm that is utilized a lot and that we will use as well is the Metropolis-Hastings algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970). The algorithm can be summarized as follows:

• First a starting value is picked for the parameter(s) whose posterior distribution (the target distribution) has to be estimated. This will be the first sample. • Then some random noise is drawn from a proposal distribution and added to the

initial value, this adds up to the proposal value. This proposal distribution is not equal to the prior- or the target distribution and can be adjusted to fit the problem at hand.

• The density of the posterior is compared for both the current and the proposed value, this is possible since the posterior is proportional to the known prior and likelihood. If the density of the target distribution associated with the proposed value is higher than that of the current value, the proposed value will become the new value. However, if the posterior density for the proposed value is lower, it

(15)

will be accepted with a probability equal to the ratio between the current and the proposed value.

• When the proposal value is accepted it will become the next sample in the chain and the process starts over.

This way the algorithm keeps sampling values until it converges to the desired target dis-tribution and we have an accurate estimation of the posterior (Farrell & Lewandowsky, 2018).

There are, however, a few possible issues with MCMC methods (Farrell & Lewandowsky, 2018). First of all the initial sample and first proposals are generally not representative for the posterior distribution; it takes a number of iterations before the sampled values can be considered as reasonable. These early samples can skew the final distribution that will be returned by the algorithm, especially when it takes a lot of samples before the values move into a region of the posterior with a higher density. This is why samples from the initial period, also called the burn-in period, are usually discarded.

Secondly it is important to run several chains, so they can be checked for conver-gence. A chain has converged when the entire distribution is explored and the final samples are independent of the starting values. However, since the sampling method is random, it is possible that the target distribution is not suﬃciently explored in its entirety. One solution for this problem is to use multiple chains with diﬀerent starting points and compare them to see if they have converged. Gelman, Rubin, et al. (1992) proposed a method that could be used to inspect whether the chains had converged, by comparing the variability within and between the chains.

(16)

2.4 Related Works

A quick review will be given of related works in this line of research – the explain-ing and predictexplain-ing of readexplain-ing patterns. Two main approaches to explain and predict reading patterns will come up: either making use of cognitive architectures, or of an independent data-driven parser in combination with psycholinguistic theories. We will consider examples of both approaches and discuss their similarities and diﬀerences with our research.

The first strategy that will be expanded upon entails a parser that is directly de-veloped as part of a cognitive architecture. Lewis and Vasishth (2005) dede-veloped a parsing model where the parser was embedded in the ACT-R architecture. As a result this model could be used to parse a sentence and directly predict corresponding reading patterns, just like our parsing model is able to. Brasoveanu and Dotlacil (2018) also used a parsing model embedded in an ACT-R framework, to examine syntactic struc-tures and predict reading patterns. There is one significant diﬀerence between these approaches and ours: in these models every parsing rule had to be hand-coded as a pro-duction, whereas our parsing model only has a few standard productions and retrieves the parsing rules from data. By retrieving the parsing rules from the data, the model is more general and open-ended, thus more human-like. If all the parsing rules have to be manually stored through corresponding productions, this means that the model is less generally applicable.

Another strategy in this line of research is to explain reading patterns with the help of statistical models, cognitive theories and data-driven parsers. In the work of Demberg and Keller (2008), for example, research is done into the computation of syntactic dependencies in texts. They test two theories for syntactic dependence – DLT and surprisal (see Demberg and Keller (2008) for further explanation on these theories) – on the eye-tracking measures from the Dundee corpus (Kennedy & Pynte, 2005). First of all they parsed the sentences in a correct format and then predicted the reading patterns by retrieving the DLT and surprisal values for the parsed data. These values were then combined in a statistical model with the eye-tracking measures, to examine whether there was a correlation between the values calculated with the use of both theories and the eye-tracking measures.

This work certainly attempts to explain reading patterns, resembling our research, but not with the use of a parser that is embedded in a cognitive model. The sentences are parsed by a “standard” parser and afterwards two independent cognitive theories are applied on the parsed sentences, to find a connection with the eye-tracking measures. Instead of striving towards a parser that can explain reading patterns in a human-like manner, reading patterns are explained after the parsing process by external cognitive theories. This same strategy can be found in the work of, for example, Boston, Hale, Vasishth, and Kliegl (2011), where research is done into parallel processing and sen-tence comprehension diﬃculty, with the help of eye-tracking measures and two diﬀerent cognitive metrics.

The final related works that should be mentioned are the bachelor’s theses by Bal (2018) and van Oers (2018). They looked into a more primitive, earlier version of the

(17)

parsing model and analyzed its performance. First of all they evaluated the parser’s parsing abilities on sentences of the Penn Treebank corpus, by looking at the precision and the recall of the parses it produced, compared to the gold standards in the corpus. They retrieved an average precision of 39.5% and an average recall of 39.7%, our parsing model, on the other hand, has achieved a precision of 75.9% and a recall of 73.7%. Furthermore, they examined the correlation measures between the reading patterns that were predicted by the parser and the actual reading task data from Frank et al. (2013) and the Dundee corpus (Kennedy & Pynte, 2005). The maximum Pearson correlation measure they achieved was 0.212 for the Frank et al data and 0.276 for the Dundee corpus data. One of the goals of this research is to improve on these results and achieve higher correlation measures between the predicted reading times and the observed reading times.

(18)

3 Research

3.1 Materials

This section will elaborate on what implementations, data and programs were used to design, test and optimize the parsing model. First of all we will take a look at the default parsing model embedded in ACT-R that was implemented with pyactr (Brasoveanu & Dotlacil, 2019). All the cognitive parts of the parser will be explained, a summary of the modelling parameters is given and a little more background information on the classifier will be provided as well. When the workings of the parsing model are clear, we will expand on the two diﬀerent data sets that were used, retrieved from Frank et al. (2013) and Kennedy and Pynte (2005).

3.1.1 Parsing Model in ACT-R

The parsing model provided was constructed with a Python3 implementation of ACT-R: pyactr (Brasoveanu & Dotlacil, 2019). This implementation is very useful for our cause since it makes it easy to combine the model with Bayesian estimation methods and it also comes with an extensible parsing framework (Brasoveanu & Dotlacil, 2018), in-cluding all the modules mentioned in section 2.2.2. A pyactr model contains two kinds of informational units: chunks and productions – exactly like the ACT-R architecture designed by Anderson et al. (2004). These units represent declarative and procedural information, respectively.

Chunks

The chunks in the declarative memory of our parser hold the information of all words that appear in the task being modelled. Below, an example is provided of the chunk declaration of the word “car” in pyactr:

1 dm_entry = a c t r . c h u n k s t r i n g ( s t r i n g = " " "

2 i s a word

3 form c a r

4 c a t e g o r y noun 5 " " " )

A word chunk, like the one depicted above, holds two kinds of information: the form and the category (POS tag) of a word. To initialize the parsing model, it is important to make sure that there is a chunk stored in the declarative memory module for every word that appears in the data. Apart from word chunks, the model also contains reading and action chunks. The reading chunk keeps track of the model’s current state with respect to the reading task (state, position, word, reanalysis). Finally, the action chunk tracks the actions that are being chosen by the classifier and also all the information that is necessary for the classifier to make a new parsing decision (e.g. tree structure, next word, category of the next word).

(19)

Modules and Buﬀers

Before the productions can be explained in-depth, it is important to have an under-standing of the different modules and buffers included in the model. As mentioned earlier, the main modules in a model are the procedural memory and the declarative memory. These both have their own buffers associated with them: the goal buffer and the retrieval buffer. To model the eye-tracking task a visual module is also necessary, this module is connected with the visual buffer. For the self-paced reading model an additional manual module is required to model the key presses. The procedural memory controls all these modules by giving them instructions and combining the information they supply. An important aspect that should not be overseen, is that the buffers can only hold one chunk of information at a time, just like the procedural memory can only fire one production at a time.

Productions

The second type of information in this ACT-R parsing model is a production or a production rule. Productions are fired by the procedural memory of the model and instruct the parsing model on what actions to take. First of all, we will give an example of a production in our pyactr model and then use this example to explain how a production should be interpreted.

1 p a r s e r . p r o d u c t i o n s t r i n g ( name=" r e c a l l word " , s t r i n g = " " " 2 =g> 3 i s a r e a d i n g 4 s t a t e reading_word 5 word None 6 ? v i s u a l > 7 b u f f e r f u l l 8 =v i s u a l > 9 i s a _ v i s u a l 10 v a l u e =w 11 v a l u e ~___ 12 ? r e t r i e v a l > 13 s t a t e f r e e 14 ==> 15 =g> 16 i s a r e a d i n g 17 s t a t e r e c a l l _ w o r d 18 r e t r i e v e _ w h None 19 ~ v i s u a l > 20 +r e t r i e v a l > 21 i s a word 22 form =w " " " )

(20)

This is the production for the action recall word. First of all, it is necessary to note that a production is a conditional statement. The preconditions are stated before the arrow (==>) and when these are satisfied, the actions below the arrow will be executed. The goal buffer (line 2-5) is denoted by a g and the = indicates that the content of the buffer has to match the conditions stated on lines 3-5. That is, the chunk has to be of the reading type, the state it is in should be reading word and the word slot is still empty. Then the visual buffer (line 6-11) is checked, indicated by the ?, to see if it is indeed full. Furthermore there are a few conditions for the visual buffer stated on lines 9-11: the chunk hold by the visual buffer has to be of the _visual type, the value of the buffer should be the variable =w and the value slot flushed by ˜. The final precondition asks that the retrieval buffer is free (12-13).

Assuming that all these preconditions are satisfied, the action below the arrow will be executed. The state in the goal buffer is changed to recall_word, the visual buffer is now flushed completely and the retrieval buffer is accessed to add a new chunk. This is a word chunk with value of the variable =w – the word just read and retrieved from the visual buffer.

To make the parsing model “read” like a human by using the ACT-R framework, productions have been written to model every action a person takes when participating in an eye-tracking task. By firing these productions in the right order, it is possible for the model to simulate such a task (Figure 6). To start oﬀ, the first word of the sentence has to be found. When this word is located the model can execute an action to recall the word. Subsequently, the model signals that the word has been recalled and then a fitting action can be recalled. If the right action has been recalled and executed, a new production fires to move the attention to the next word in the sentence. This procedure keeps repeating itself until the last word of the sentence has been found and recalled, and the final action is chosen.

(21)

Temporal Properties of Retrieval – Modelling Parameters

Now that it is clear how the actions within the brain are mimicked by the model, we will elaborate on the subsymbolic parameters than are used to tune the model (Brasoveanu & Dotlacil, 2019). The free parameters are the latency factor (F ), latency exponent (f ) and rule firing (r). These parameters will be optimized to try and produce a better-fitting model to the psycholinguistic data. First, we will provide you with (4): the deinition of the latency of retrieval (Ti) in the ACT-R model for chunk i where: F ,

f and Ai (the activation for chunk i) are the free parameters. Latency of retrieval

represents the time it takes for the model to retrieve a chunk from memory and thus is imperative for the prediction of reading times.

Ti = F e−fAi (4)

Before f and F can be explained, it is important to note the activation parameter A. The activation level of a chunk indicates its availability to be retrieved from memory. It determines both the probability that the chunk will be retrieved and the latency of retrieval (Brasoveanu & Dotlacil, 2019). Equation (5), the activation function, consists of two parts: the ACT-R base activation (Bi) and the spreading activation.

Ai = Bi+ m

∑

j

(WjSji) (5)

First we will take a closer look at the base activation (6). In this equation we see that base activation Bi is a function of the times tk. This time value represents the

delay of retrieval of chunk i from the declarative memory, and is summed for all its diﬀerent presentations 1 to k. The free parameter d represents our standard decay of memory – forgetting – and is usually set to 0.5.

Bi = log( n

∑

k=1

t−d_k ) (6)

The spreading activation, the second part of Equation (5), can be described as a context driven increase of activation (Brasoveanu & Dotlacil, 2019). Whenever a chunk, e.g. the chunk of the word “car”, is activated by retrieval, all the chunks that share a value with this chunk will receive spreading activation. The amount of activation these connected chunks receive is dependent on the weight Wj for value j of chunk i.

This weight is subsequently boosted by the associative strength Sji, that represents the

strength of the connection between a chunk i and the value j. Equation (7) depicts the definition of associative strength that is frequently used in ACT-R modelling.

Sji =

{

S− log (fanj) if i and j are associated

0 otherwise (7)

Whenever chunk i is not associated with value j, there is no associative strength, thus no spreading activation. If value j is indeed present in one of the slots of chunk i,

(22)

the associative strength is calculated with the maximum associative strength (S) and the fan of value j. The value for S has been set to 20 in our model, to ensure that the associative strength will always be positive and there will not be a negative associative strength. The fan of j can be explained as the number of chunks in the declarative memory that have j as a value for one of their slots. This means that when a value appears frequently in the declarative memory, the associative strength will be lower than when a value is very specific and does not appear as frequently.

Let us return to Equation (4). The latency of retrieval is thus calculated with Ai,

F and f . In this calculation the activation diﬀers per chunk and is calculated by means of a base activation and a spreading activation that depends on the frequency of a value. To tune the activation levels of chunks we have the latency factor F as a scaling factor for the logarithmic scale of the activation and latency exponent f to decide the slope. With these free parameters it is possible to adjust the model as to optimize the latency time of retrieval, to approach the psycholinguistic data as closely as possible. The default values for F and f , based on the values found in literature, are set to 0.1 and 1, respectively (Lewis & Vasishth, 2005; Brasoveanu & Dotlacil, 2018).

To conclude, a quick explanation of the rule firing parameter r will be supplied, which is significantly more straightforward. The rule firing parameter models the time it takes for the central production system to fire a production. This means that every time an action is taken, the rule firing time will have to be added to the total time it takes to execute this action. It is assumed that the value of this parameter is 0.05 seconds (Brasoveanu & Dotlacil, 2019).

Classifier-Based Parsing in ACT-R

The classifier used in the parsing model is based on the activation levels in ACT-R that were explained previously. To decide what action to take, the base- and spreading activation are calculated for every action chunk that is available for the classifier at each point in the parsing process and the action with the highest activation level will be chosen and executed. The diﬀerent values of the action chunk can be seen as the feature-set the classifier uses to choose the best action.

The specific feature-set that is available for the classifier depends on whether it is following a blind- or a non-blind parsing strategy. A blind parsing strategy implies that the classifier has no knowledge of words in the sentences that is has not encountered yet. When the parser uses a non-blind parsing strategy, on the other hand, it does have information on words that are ahead of the current position in the parse. To illustrate what these features entail, both for the blind- and non-blind parsers, we will provide a list of all the features used by our classifier in the Appendix.

Before the classifier knows how to interpret these features and make solid decisions, it has to be trained on a feature-set paired with the right decision that should follow from these features. To create such a set one can run the parsing model on a set of sentences where the correct parse is already known – in our case this was the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993). When this training is finished, the classifier knows what decisions follow from which feature-sets and it can be used to

(23)

make decision in the parsing process of new, unknown sentences. 3.1.2 Data

For the optimization and testing of the parsing model two sets of reading time and eye-movement data were used: self-paced reading and eye-tracking data from Frank et al. (2013), and eye-tracking data from the Dundee corpus (Kennedy & Pynte, 2005). Both data sets are the result of psycholinguistic research and thus perfect for the evaluation of our model. There are however a few diﬀerences between the two sets of data, both in content and in format. In this section an overview will be given of both.

Frank et al Corpus

Since the goal of the authors was for this corpus to be a “gold standard” for model evaluation, a fair amount of consideration has gone into the stimuli selection (Frank et al., 2013). The final selection of the stimuli consists of 361 hand-picked sentences gathered from online English novels. There was a restraint in sentence selection, namely that they had to be made up out of words that are used with a high frequency in the English language. Overall the sentences are of varying length, ranging from 5 to 38 words with an average of 13.7 words. POS tags were generated for each sentence and adjusted to correspond with the Penn Treebank. A total of 117 subjects participated in the self-paced reading task and 48 subjects participated in the eye tracking task.

The standard self-paced reading (Just et al., 1982) procedure was followed for the first task. Subjects were shown a word on a computer screen and could proceed to the next word of the sentence by pressing a key. The times between the display of the word and key press were recorded as reading time per word. For the eye-tracking task the subjects were presented with an entire sentence and their eye movements were recorded while reading. Some sentences had corresponding clarification questions, but this part of the data was irrelevant for our research and thus discarded. The final reading time measures retrieved with this task were: first-fixation time, first-pass time, right-bounded time and go-past time. The first two measures have been mentioned in section 2.2.3. To clarify the latter two: right-bounded time is the “summed duration of all fixations on the current word before the first fixation on a word further to the right.” (Frank et al., 2013) and go-past time can be defined as “summed duration of all fixations from the first fixation on the current word up to (but not including) the first fixation on a word further to the right.” (Frank et al., 2013).

Dundee Corpus

This corpus made use of English and French stimuli, that were both tested on 10 participants each. We only used the English language part of the corpus since the parsing model is designed for the parsing of English texts. The English stimuli are composed of newspaper texts (from The Independent). For the experiment 20 text files, consisting of approximately 2800 words each, were prepared. In contrast to the Frank et al corpus, the text was not ordered in sentences, but in 40 five-line screens. Participants

(24)

had to read each screen and indicate with the press of a button that they were finished reading, meanwhile the eye movements were measured. Results were saved in two types of files: MA1 and MA2 files. The first type codes the eye-movements in the order that they occurred and the second type codes the movements in word order. It turned out that the second data type was the most convenient for us, since we are processing the data word for word. A fair amount of diﬀerent features was collected during the task, but the (for us) relevant data were the words with their corresponding fixation duration and fixation number – the order in which the words were fixated on while reading.

3.2 Method

In this section a detailed description of the conducted research will be given. The research consisted roughly of three steps: preprocessing the data, testing the parsing model on the data and making some adjustments, and finally estimating the optimal parameters. There was a fair amount of preprocessing that had to be done, especially for the Dundee corpus. A few alterations have been made to the parsing model during the research process, most notably the addition of several high-level parsing parameters. The considerations behind this adjustment and others will be expanded upon later on. Finally we will guide you through the main research component of this thesis: the parameter optimization using Bayesian methods.

3.2.1 Data Preprocessing Frank et al corpus

The Frank et al corpus required a small amount of preprocessing. The stimuli and corresponding POS tags were already ordered per sentence and saved in well-structured .csv files. On top of that, the retrieved reading time data was divided by the creators of the corpus into the four diﬀerent possible measures mentioned in the previous section. To prepare the reading time data – both self-paced reading and eye tracking – the average was taken of all reading times and measures (in ms), per subject, per word, per task.

For the parsing model to be able to parse the sentences used by Frank et al, the stimuli and the POS tags of the stimuli had to be processed and saved in a specific format. First the sentences had to be split into words, this was done with the NLTK Toolkit (version 3.2.4) (Bird, Loper, & Klein, 2009). For every word, six features had to be saved: word position, word, POS tag, sentence number, frequency and activation. The first four features are self-explanatory, but we will provide a further explanation of the latter two. To calculate the frequency of a word we retrieved the frequency of the word according to the British National Corpus and normalized this number by the aver-age amount of languaver-age exposure a person has had throughout his life (Reitter, Keller, & Moore, 2011). The calculations for the activation levels can be found in Section 3.1.1.

(25)

Dundee corpus

The Dundee corpus needed more extensive processing before it was fit to use for the parsing model. Initially the stimuli were not ordered per sentence, but all the separate words were saved by order of appearance. This meant that first of all, the words in the text files had to be grouped by sentence. Then these sentences had to be tagged with the corresponding POS tags. These two processing steps had already been implemented in a thesis from last year where the same data set was used (Bal, 2018). The author’s code could be reused, with a few minor adjustments, to group the stimuli and generate their tags. To generate the tags a built-in POS tagger from the NLTK toolkit was used. The next step in the process was to transform the fixation duration and fixation number per word, to useful eye tracking measures for each word: total duration, first pass, second pass and regression.

The final words and their tags then were converted to the same final format as the data from the Frank et al corpus. Not all the data from the Dundee corpus was used: merely the stimuli of the first 5 text files and the corresponding data of the first 9 subjects – the 10th subject’s data gave some issues while parsing and his data was excluded. Finally, the mean was taken of the eye tracking measures retrieved from these subjects.

3.2.2 The Model

Initial Parsing and Data Decisions

When the data had been converted to a usable format for the parsing model, some initial parses were executed to examine whether any adjustments had to be made to the data or the parser. The model parses a sentence word for word, creates a parse-tree (Figure 7) and finally returns a list with reading times (RT) per word in seconds.

Figure 7: Parse tree produced by the parser for sentence 1 of the Frank et al corpus. Note that there is a slight mistake in the parse, since laughed should be attached to a VP not an NP!

(26)

After the model was run on several sentences from both the Frank et al corpus and the Dundee corpus, it turned out that not all sentences were suited to be a part of the set used to train and test the model. Sentences were excluded from the data-set for two possible reasons: the sentence was too long (>20 words), or the sentence was too hard to parse. We decided to exclude sentences consisting of more than 20 words, because the parsing model produced considerably more incorrect parses when the amount of words in a sentence increased. Additionally, it took the parser an extremely long time to parse these long sentences. This would make them not suitable to use in the Bayesian estimation, where a sentence has to be parsed at least a few hundred times, thus resulting in a very long run-time. Whenever a sentence is encountered that is too hard to parse, the parser will either return default RTs for each word (10s) or throw an error. All the sentences for which this happened, were excluded from the data-set.

Before it was possible to successfully evaluate the RT predictions that were made with the parsing model, several decisions had to be made on how to format the predicted RTs and also the observed RTs per sentence. To this end we decided to not include the RTs for the first and last words of a sentence. It has been shown that the RTs for these words are usually outliers compared to the other words in the sentence (Rayner, 1998). Lastly, we decided to exclude the self-paced reading data and use exclusively the eye-tracking data from both corpora. After initial tests with the Frank et al corpus, for which the first 50 sentences were parsed with the default parameters, it became clear that the predictions for the self-paced reading data were significantly worse than for the eye-tracking data. Whereas we obtained a∼ 0.3 correlation with the observed data for the eye-tracking task, the correlation for the self-paced reading data was ∼ 0.1 or even lower. This is why the decision was made to exclude the self-paced reading data and adjust the parsing model slightly to simulate an eye-tracking task.

High-Level Parameters

During the process, four high-level parameters were added to the model that function as “knobs” and can either be turned on or oﬀ. These knobs regulate what information the parser uses: lexical, visual, syntactic and/or reanalysis information. The eﬀects of the high-level parameters are as follows:

• Lexical: If the lexical parameter is set to False, the activation values for all words present in the declarative memory will be set to 100. This nullifies the eﬀect of lexical retrieval from memory. If the parameter is set to True, all words in the declarative memory will have fitting activation calculated with the aforementioned methods.

• Visual: Whenever this knob is turned oﬀ, all words will have the same spatial length for the model. This means that the eye movements – focusing and reading – will be the same for every word. If the knob is turned on every word will have a spatial length relative to the amount of letters it is composed of.

• Syntactic: If the parser does not make use of syntactic information, this means that the activation of the syntactic rules that have to be retrieved are all identical.

(27)

Furthermore it will not keep track of diﬃcult syntactic structures (like subject-object gaps). Whenever this knob is turned on, the activations are proportional to the relevance of the syntactic rules and the parser does keep track of special structures and as a results is better able to parse them.

• Reanalysis: When this knob is turned on the parser makes use of reanalysis. This implies that whenever the blind parser and the non-blind parser end up with a diﬀerent parse, the parse will be reanalyzed and the same part of the sentence will be parsed again. If the reanalysis parameter is turned oﬀ, reanalysis will never take place. This way only information from the non-blind parser is used.

3.2.3 Parameter Estimation

For the estimation of the optimal parameters used in our model, to simulate both data-sets as accurately as possible, the overall strategy as described by Brasoveanu and Dotlacil (2019) was followed. They recommended the use of the Python library pymc3 to implement Bayesian modeling – especially the Monte Carlo methods. In this section we will elaborate on the exact algorithm that was used to estimate the parameters and the priors that were assumed for the three parameters. The final part of the section will be dedicated to the parallel computing methods that were used to speed up the sampling process.

Bayesian Methods

The pymc3 Python library provides us with an excellent framework to estimate the optimal parameters using Bayesian methods. First a Bayesian model is created with the data and the parameters we want to estimate. Then a sampling method is cho-sen to approximate the posterior distribution of the relevant numerical parameters of the model. Finally the results of the sampling can be analyzed to find out what the estimated posterior is of the parameters.

To start oﬀ, the prior distributions for the parameters (F , f , r) had to be set. A uniform distribution was chosen for the latency factor (F ) and the latency exponent (f ), because this way there were no restrictions for the model when sampling – except for the lower- and upper limit. Especially since there are no clearly defined optimal values for these parameters yet, it seemed like a good idea to start with a fairly uninformative prior (Lewis & Vasishth, 2005). For the rule firing parameter a Gamma distribution was chosen. This distribution makes sure that values equal to 0 are penalized – the rule firing time in humans is not 0 either – although very small values between 0 en 0.10 are acceptable. The mean of the distribution was set to 0.05, since that is the default value for r. See the code snippet and Figure 8 below for the three prior distributions of the parameters:

1 l f = Uniform ( ' l f ' , l o w e r = 0 . 0 0 0 0 1 , upper = 0 . 6 ) 2 l e = Uniform ( ' l e ' , l o w e r = 0 . 0 0 0 0 1 , upper =1) 3 r f = Gamma( ' r f ' , a l p h a =2 , b e t a =40)

(28)

Figure 8: Prior distributions for each parameter used by the Bayesian model. Besides determining what priors had to be used, an algorithm to estimate the pos-terior distribution had to be chosen as well. The method used in this research is the aforementioned Metropolis-Hastings algorithm. This algorithm makes use of the prior distributions and the likelihood, which is dependent on the diﬀerence between the ob-served RT and the predicted RT, to estimate the posterior distributions of the param-eters (see Section 2.3.1).

The data that was used for the sampling consisted of the first 50 suitable sentences from the Frank et al corpus and the first 50 suitable sentences from the Dundee corpus. Only 50 sentences were used for sampling to keep the run-time manageable. All sampling trials that were done with the Frank et al sentences consisted of 1000 draws and 2 or 3 chains. The trial with the Dundee sentences consisted of just 800 draws and 2 chains, since the average sentence in the Dundee corpus was almost twice as long as most sentences in Frank et al corpus, resulting in a significantly longer run-time.

With an MCMC method it is the case that the more draws, the more accurate the final estimation will be (up to a certain point of course). A thousand seemed like a fair amount of draws to end up with a trustworthy estimate of the posteriors. The burn-in period of every chain was 200 draws for all the diﬀerent sampling trials, these draws will be discarded before the analysis of the final results. A drawback of taking a large amount of samples is that the run-time of the program becomes very long very quick, especially since the parsing model takes a fairly long time to syntactically parse and predict reading times simultaneously. To solve this problem and make it possible to do a suﬃcient amount of draws, the sampling was done in parallel.

(29)

Parallel Sampling

To speed-up the sampling processes we implemented it in such a way that it could be executed in parallel. Instead of predicting the RT’s for all the sentences in a serial manner for each draw, the predictions were done in parallel with the Python3 library mpi4py. The mpi4py library allows us to run several processes at once. The sampling was parallelized in the following way:

• Indicate at the initiation of the program how many parallel processes should be executed (this depends on the amount of cores and threads on the machine). • The main function of the program is handled by the “master” process and the

function that is responsible for predicting the RTs of all the sentences is executed by the remaining “slave” processes.

• All the sentences that have to be parsed are split up between the slaves. This way every slave process parses and predicts its own share of the sentences.

• When all the slaves are done parsing, they communicate the values they retrieved to the master, where all their values are merged and can be used for the sampling. This parallelization of the sentence processing, speeds up the sampling considerably. For the trials that consisted of a 1000 draws each, a machine was used that was able to run 50 parallel processes. Compared with the machine of the author, which can run 4 parallel processes, this sped up the process more than 10 times.

Sampling

When the priors were decided and the model had been parallelized, the sampling trials could be done. For the Frank et al corpus we first sampled 3 chains, with all the high-level parameter knobs of the model turned on, for 1000 draws. Then 2 chains were sampled with the model where the reanalysis knob was turned oﬀ, also for 1000 draws. Finally we did a sampling trial where only the syntax knob was turned on. The trials were done with a variation in high-level parameters in the hope that the impact of these parameters on the parsing process and the reading time predictions would become clear. Since the Dundee corpus consists almost exclusively out of long sentences (13-20 words) that take a lot longer to process, we only ran one sampling trial with 2 chains for 800 draws, with all the high-level parameters turned on.

(30)

3.3 Results

3.3.1 Frank et al corpus Sampling – Overview

An overview will be given of the sampling results for all three parsing models: the regular model where all the knobs are turned on, the model that does not make use of reanalysis and the model that only uses syntactic information. As mentioned in the previous section, every sampling trial consisted of 1000 draws with a burn-in period of 200 draws. We will illustrate the sampling distribution, the mean and variance of the sampled values, and the Gelman-Rubin convergence stats ( ˆR) for every parameter. Regular sampling

r F f

Chain Mean Var Mean Var Mean Var

0 5.90× 10−5 2.44× 10−9 2.34× 10−4 5.81× 10−8 0.491 0.070 1 5.86× 10−5 2.26× 10−9 2.21× 10−4 4.46× 10−8 0.521 0.086 2 5.37× 10−5 1.70× 10−9 2.15× 10−4 3.43× 10−8 0.507 0.073 Table 1: Calculated mean and variance of the parameters values for all draws with regular parsing model. The first 200 draws are considered the burn-in period and are discarded.

Figure 9: Distribution of the parameter values (r, F , f ) drawn using the regular parsing model – chain 0.

(31)

r F f ˆ

R 1.003 1.000 1.005

Table 2: The Gelman-Rubin statistics on convergence for all chains of the regular model. No reanalysis

r F f

0 6.85× 10−5 2.99× 10−9 2.01× 10−4 5.81× 10−8 0.567 0.087 1 5.94× 10−5 1.78× 10−9 2.55× 10−4 4.51× 10−8 0.516 0.091 Table 3: Mean and variance of the parameters values for all draws in sampling with parsing model without reanalysis. The first 200 draws are considered the burn-in period and are discarded.

Figure 10: Distribution of the parameter values (r, F , f ) drawn using the parsing model without reanalysis – chain 0.

r F f

ˆ

R 1.000 1.000 1.017

Table 4: The Gelman-Rubin statistics on convergence for all chains of the model without reanalysis.

(32)

Only syntax

r F f

0 6.15× 10−4 1.46× 10−7 7.62× 10−4 1.55× 10−8 0.590 0.0040 1 7.73× 10−4 1.90× 10−7 4.10× 10−4 7.50× 10−8 0.694 0.0011 Table 5: Mean and variance of the parameters values for all draws in sampling with parsing model that only considers syntax. The first 200 draws are considered the burn-in period and are discarded.

Figure 11: Distribution of the parameter values (r, F , f ) drawn using the parsing model that only considers syntax – chain 0.

r F f

ˆ

R 1.004 1.000 1.061

Table 6: The Gelman-Rubin statistics on convergence for all chains of the model that only uses syntactic information.

(33)

Reading Time Predictions – Overview

The retrieved means for the three parameters that resulted from the sampling trials were used to predict the RTs for the sentences 50-100 with all three models. Subsequently the RTs were also predicted with the default parameters (F = 0.1, f = 1, r = 0.05) as to compare the results. The correlation between the predicted RTs and the data was calculated (Table 7) and the predictions using both parameters sets will be illustrated in a plot , compared to the observed RTs (Figure 12, 13a, 13b). Since the default parameters resulted in predicted several extremely high values, these predictions have been omitted from the plot of the only-syntax model.

Standard Correlation Measure

Regular No reanalysis Only syntax

Estimated Default Estimated Default Estimated Default

Correlation 0.649 0.141 0.650 0.238 0.179 0.146

p-value 3.09× 10−38 0.0135 2.77× 10−38 2.49× 10−5 0.00179 0.0112 Table 7: Pearson correlation coeﬃcient between the observed RTs and the predicted RTs, both with the default and estimated parameters.

Regular model

Figure 12: RT predictions for sentences 50-100 using the estimated parameters (blue) and the default parameters (green), together with the observed RTs (red) – chain 0.

(34)

No Reanalysis and Only Syntax

(a) No reanalysis model

(b) Only syntax model

Figure 13: RT predictions using the estimated parameters (blue) and using the de-fault parameters (green, omitted for the only-syntax model), shown together with the observed RTs (red/brown) – chain 0.

(35)

3.3.2 Dundee corpus Sampling

Since only one sampling trial has been done with the Dundee corpus data, the distribu-tions of both chains will be shown to give a more complete overview. The trial consisted of 800 draws and the first 200 are considered the burn-in period. The estimated mean and variance of the estimated parameters will be given (Table 8) and two plots will illustrate the sampling distributions (Figure 14, 15). The Gelman-Rubin statistics ( ˆR) were calculated for all three parameters and can be seen in Table 9.

r F f

0 2.76× 10−5 1.48× 10−10 5.88× 10−5 2.64× 10−9 0.456 0.0753 1 2.66× 10−5 1.48× 10−10 6.63× 10−5 2.70× 10−9 0.485 0.0665 Table 8: Mean and variance of the parameters values for all draws in sampling with regular model. The first 200 samples are considered the burn-in period.

Figure 14: Distribution of the parameter values (r, F , f ) drawn using the regular parsing model – chain 0.

(36)

Figure 15: Distribution of the parameter values (r, F , f ) drawn using the regular parsing model – chain 1.

r F f

ˆ

R 1.001 1.000 1.000

Table 9: The Gelman-Rubin statistics on convergence for both chains. Reading Time Predictions

The retrieved means for the three parameters that resulted from the sampling trial were used to predict the RTs for the 50 suitable sentences in the sentence range [183, 314]. This was done twice with the regular model, once using the resulting parameters from chain 0 and once with those from chain 1. Finally the RTs were also predicted with the default parameters (F = 0.1, f = 1, r = 0.05) as to compare the results. The Pearson correlation between the predictions and the data was calculated (Table 10) and the predictions using both parameters sets will be illustrated with a plot, compared to the observed RTs (Figure 16a, 16b).

Standard correlation measure

Chain 0 Chain 1 Default

Correlation 0.745 0.745 0.214

p-value 3.13× 10−80 3.13× 10−80 4.84× 10−6

Table 10: Pearson correlation coeﬃcient between the observed RTs and the predicted RTs, using the estimated parameters from chain 0 and 1, and the default parameters.

(37)

(a) Chain 0

(b) Chain 1

Figure 16: RT predictions made by the model for sentences 50 sentences in the range [183, 314] using the estimated parameters (blue) and using the default parameters (green), shown together with the observed RTs (red)

Data-driven Parsers in ACT-R and the Prediction of Reading Patterns

Data-driven Parsers in ACT-R and the

Prediction of Reading Patterns

Contents

1

Introduction

2

Theoretical Framework

3

Research