Exploiting Preposition Contexts for Computer-Assisted Language Learning

(1)

Exploiting Preposition Contexts for

Computer-Assisted Language Learning

Martijn Boekema

m.boekema-1@student.rug.nl

Master Thesis

Primary Supervisor: Malvina Nissim, Ph.D.

Secondary Supervisor: Drs. Jeroen van Engen

Department of Information Sciences

University of Groningen

(2)

(3)

Abstract

Preposition learning can be a challenging task for second language learners. Yet, little research has been done to partition prepositions for the purpose of enriching Computer-Assisted Language Learning (CALL). Exploiting linguistic domains of prepositions can help learners by providing a level-appropriate and personalized learning process. 27 language learners of Dutch and 15 native speakers participated in a learning experiment consisting of an online platform on which learners can experiment with prepositions through multiple choice fill-in-the-blank exercises. Participants were assessed on five domains, some of which are suggested in previous studies on prepositions. These domains are; preposition frequency, semantic context, syntactic context, surrounding lexical complexity, and sentence length. High correlations have been measured for the first three domains.

(4)

(5)

Preface

In front of you lies the fruits of my labor, the benefits of my hard work, and the acknowledgment that anyone, with the right mindset, can enjoy an academic study. I am humbled to have had the opportunity to study at the University of Groningen and the Department of Information Sciences. After years of having studied on lower academic levels, I have developed a strong intrinsic appreciation for education and its role in our lifetime journey of self-discovery.

I would not have been able to come so far without the help of many friends, family members, colleagues and teachers, which is why I would like to take the time to give them some appreciation. I first thank my friends and family for taking the time to assist me with my thesis, the annotations and hearing me out whenever I felt discouraged. Secondly, I would like to thank one person in particular. My dearest Sensei who has taught me about discipline, perseverance and respect. I am sure that I would have never made it so far without you. Thank you for being there when I needed it. Lastly, I would like to thank my supervisor Malvina. At the many times I was stuck, confused or just unsure, your advice always gave the clarification I needed. Hopefully you have enjoyed our collaboration as much as I have.

After many years of hard work I finally find myself looking at the first of hopefully many meaningful contributions in the field of Information Science and (Language) Education. With that, I do not just refer to intelligent tools to as-sist learning, but also to our collective aspiration to improve our understanding of the most important technological principles known to man; language.

(6)

”Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it. Every concept that can ever be needed will be expressed by exactly one word, with its meaning rigidly defined and all its subsidiary meanings rubbed out and forgotten.”

(7)

2.4 Theoretical contributions . . . 18 3 Method 19 3.1 Experiment design . . . 19 3.2 Data . . . 21 3.2.1 Corpus . . . 21 3.2.2 Annotations . . . 21 3.3 Participants . . . 22 4 Preposition model 24 4.1 Overall design . . . 24 4.2 Preposition frequency . . . 25 4.3 Semantic context . . . 25 4.3.1 Semantic roles . . . 26

(8)

4.4 Syntactic context . . . 28

4.5 Surrounding lexical complexity . . . 31

4.6 Sentence length . . . 33 5 System architecture 34 5.1 General design . . . 34 5.2 Dataset preparation . . . 35 5.3 Database configuration . . . 36 5.4 Automated extraction . . . 38

5.4.1 Syntactic context extraction . . . 38

5.4.2 Surrounding lexical complexity extraction . . . 39

5.5 Manual annotations . . . 41

5.5.1 Semantic role annotations . . . 41

5.5.2 Preposition annotations . . . 42 5.5.3 Inter-rater agreement . . . 43 5.6 Interface design . . . 44 6 Results 48 6.1 Collected data . . . 48 6.2 Preposition complexity . . . 49 6.3 Feature scores . . . 50 6.4 Preposition frequency . . . 51

6.5 Distribution of the semantic scores . . . 52

6.6 Polysemy scores . . . 54

6.7 Distribution of the syntactic classes . . . 55

7 Discussion 58 7.1 Partitioning prepositions . . . 58

7.2 The computational modeling of prepositions . . . 59

7.3 The benefit of preposition models for CALL . . . 60

8 Conclusion 62 8.1 Summary . . . 62

(9)

Chapter 1

Introduction

One of the hallmarks of an academic study is the training of language proficiency. Most studies require students to learn a second language (L2) as an integral part of their academic curriculum, and a growing number of people decide to study (or work) abroad. As Second Language Acquisition (SLA) and the use of online technology has become ubiquitous, the quality of Computer-Assisted Language Learning (CALL) is becoming increasingly more important as well.

One of the more difficult challenges in acquiring L2 is the use of preposi-tions [Dale et al., 2012]. Preposition learning can be quite a time-consuming and labor-intensive endeavor, because prepositions cannot simply be acquired by learning a set of linguistic rules. It requires language learners to familiarize themselves with the syntax of a language and the semantic roles that preposi-tions can have.

As an addition to traditional language courses, CALL can help students to practice with, for instance, prepositions. Computer-assisted learning tools have the benefit of being quicker, more consistent, and more accessible as opposed to human tutors. However, most CALL applications do not have the ability to understand the grammatical proficiency of individual learners.

(10)

1.1 Preposition learning

Prepositions are used to describe a relation between two or more words in a sentence. Some uses of prepositions are learned relatively fast because of their frequent use in everyday social interaction, and because the semantic relation rather straightforward. Examples of these common uses of prepositions in Dutch are:

(1) Dit is de trein naar Amsterdam; (This it the train to Amsterdam) (2) Ik zit in de trein; (I am sitting in the train)

(3) Deze trein stopt op halte Amsterdam. (This train stops on platform Amsterdam)

The role performed by the prepositions in examples (1) to (3) can be classified with relative ease; it is of the spatial kind. There are a lot of cases where even Dutch native speakers find it more difficult to select the appropriate preposition for a sentence:

(4) * Hij is slimmer als ik; (He is smarter than me) (5) Hij is slimmer dan ik; (He is smarter than me)

(6) * Hij komt 15 minuten na zes; (He will arrive 15 minutes after six) (7) Hij komt 15 minuten over zes. (He will arrive 15 minutes over six) Dutch language users often find it cumbersome to remember the exact meanings of the words als/dan, while the first is used to communicate equative utterances (e.g. as smart as) and the other for comparatives (e.g. smarter than). Na-tive English language speakers studying Dutch might find learning the als/dan differentiation even more difficult because the English language does not dif-ferentiate between equative and comparative prepositions. As exemplified in sentences (4) and (5). The als/dan differentiation might also be found more complex because in most cases als and dan function as particles (i.e. a specific type of preposition), for which the relation often seems more arbitrary than regular prepositions [O’Dowd, 1998].

(11)

can occupy a temporal function, others can never have that same role. The Dutch preposition bij (i.e. with) can never be used to tell the time:

(8) * Ik ben er bij 10 minuten; (I will be there with 10 minutes) (9) Ik ben er in 10 minuten. (I will be there in 10 minutes)

It takes a considerate amount of effort for learners to get used to the way in which the Dutch (or any other) language makes use of prepositions. The reason for this is that there are too many ways in which a single preposition can be used. The extent to which a preposition can be used (i.e. its polysemic nature) might even play a very important role in its learning complexity. Another reason why they are difficult to learn is that they are almost never directly transferable to a different language, as can be seen in sentences (4) and (5).

Each language has its own rules for how prepositions should or should not be used, and some language do not even have prepositions (e.g. The Japanese language only uses postpositions). As a consequence, learning a new language requires learners to familiarize themselves with a language’s specific way of using semantics and (morpho)syntax.

To learn prepositions effectively a learner needs to experiment with the lan-guage, but for effective language teaching it is evenly important to provide feedback properly. Communicating right and wrong does not always help learn-ers to undlearn-erstand which semantic roles prepositions can adopt. Most CALL applications can measure which prepositions a learner struggles with (e.g. by calculating scores for each assessed preposition), but they do not measure the extent to which they know how to apply temporal, spatial or instrumental prepo-sitions accurately. They also do not measure if a learner understands common sentence constructions. Instead of letting learners exercise with random prepo-sitions it would be more effective to measure which domains of prepoprepo-sitions a learner understands, and to present exercises and feedback according to a personalized competency profile.

1.2 Research question

(12)

affect preposition learning? This question can be further articulated into the following sub-questions:

• Can prepositions be partitioned into linguistic domains?

• Is it possible to (automatically) exploit these domains to assess preposition usage?

• Do these domains benefit preposition learning?

After constructing an empirically-based model on preposition learning, I will attempt to (automatically) extract these linguistic domains from a corpus containing Dutch sentences, for the purpose of assessment. Evaluating these domains will be done through an experiment involving NT2 (Dutch as a sec-ond language) learners and a linguistically aware CALL application, specifically designed to assess this preposition learning model.

1.3 Contributions

The main focus of the CALL application is the development of a conceptual preposition learning model. This model will include several linguistic domains that influence whether or not a learner understands the preposition that has to be applied in a certain context. Such domains could be the semantics of a sentence (i.e. does a learner understand the meaning of the sentence?), or its syntax (i.e. what type of sentence constructions is the learner familiar with?).

The accuracy of this conceptual model will be tested through an online preposition learning interface. During a six-week experiment, NT2 learners will exercise with a high number and variety of preposition contexts. The system will automatically feed learners with sentences that have certain (linguistic) contexts for which the system requires more data. Acquiring enough data for each studied linguistic domain will make is possible to measure the extent to which each of these prepositional domains contributes to preposition learning.

(13)

Chapter 2

Related work

In this chapter I will lay the theoretical groundwork for developing a linguistically aware learning application enriched with a preposition classification scheme. The first section of this chapter discusses current standards and limitations on automatic tutoring systems. The remainder will elaborate on the preposition-modeling task, targeted from two perspectives. The first perspective is a linguis-tics perspective on the cognitive processing of prepositions. Knowing how lan-guage learners develop an understanding of prepositions will help to distinguish the features relevant for machine translation. The second perspective addresses preposition modeling as a Natural Language Processing (NLP) task. Several studies employ machine learning to statistically model prepositions and preposi-tion semantics, through syntactic construcpreposi-tions and a preposipreposi-tion’s lexical con-text. While this study does not utilize any machine learning, understanding the techniques used to develop computational models on prepositions could help to distinguish relevant linguistic domains.

2.1 Intelligent CALL applications

(14)

increasing development of open-source applications, APIs, modules and plugins, automated tutoring systems are slowly making their evolution towards more In-telligent Computer-Assisted Language Learning (ICALL) [Volodina et al., 2012; Amaral and Meurers, 2011].

There are two strategies that linguistically aware tutoring systems utilize to support a learning process [Meurers, 2015]. A system can assess a learner’s language use to provide individual feedback, and it can learn the properties of a language to become more conscious of a learner’s competency. Providing feed-back does not always mean that a system has an understanding of a language. Storing all of the correct answers for a set of questions or problems allows for perfect and immediate feedback, but it does not mean that the system knows anything about a language. When a system does have language intelligence it will also be capable of determining a learner’s strengths and weaknesses, and establish a personalized learning process accordingly.

2.1.1 Current standards on ICALL applications

Amaral and Meurers [2011] propose several heuristics as prerequisites for suc-cessful ICALL integration for Foreign Language Teaching and Learning (FLTL). They focus on the aspects of NLP reliability and the pedagogical considerations in the design of ICALL systems for FLTL practice. Considering these two fac-tors they perform a case study on three systems that focus on the broad task of language teaching:

• Robo-sensei – A stand-alone ICALL system for learning and practicing grammar principles for the Japanese language [Noriko, 2009];

• Spanish for business professionals – A language learning tool for novice L2 Spanish learners [Hagen, 1999];

(15)

test. Of course, an activity that has proven to be successful for assessment purposes does not necessarily make it an effective learning tool. MPC fill-in-the-blank exercises are rather machine-friendly, and used a lot, because of their high input constrain. Giving students a couple of predefined options to choose from makes the data is quite easy to process, unlike creative writing tasks, where there are numerous ways in which students can express themselves.

Amaral and Meurers [2011] also mention the manner in which the studied CALL systems utilize native data for learning purposes. L1 refers to the use of a learner’s native language (or often just English) as the interface’s utility language. In the case of Robo-sensei, it is essential that the application makes consistent use of the English language (and the roman alphabet) because learn-ers have to learn to read the kanji/kana alphabet, while they also have to learn the meaning of the Japanese words. So the language used in the application, including its feedback, is always displayed in English. This is the case for most CALL applications. Spanish for Business Professionals (SBP) and E-tutor even accompany learning exercises with an English translation, making the activity feel like a translation exercise. The E-tutor system also provides simple feedback in German, but the rest of the interface is in English.

SBP does not adopt a learner model. The application essentially starts in the same way for every learner and ends the same as well. It does make use of a learning program and exercises with alternating/increasing difficulty, but no personalized activity or feedback is employed. E-tutor is somewhat more progressive in its use of NLP. The system measure and communicates the learner’s performance, using several domains of language learning. E-tutor employs learner uptake as well. Meaning that learners can respond to corrective feedback by improving upon their mistakes.

2.1.2 Challenges in ICALL applications

Building on the limitations of these case studies Amaral and Meurers [2011] discuss four challenges for ICALL system design, based on the previously men-tioned heuristics:

• Constraining learner input;

• Activity specification and instructions;

• Use of L1 activities, instructions, and feedback;

• Feedback on linguistic, learner and activity information.

(16)

Constraining learning input

Constraining learner input is important for NLP tasks because proper input allows the system to process data effectively and efficiently. It is also essential for calculating feedback with high precision, which is an essential asset to the learning process [Tschichold, 1999]. Input constrain can be done via simple multiple-choice (MPC) features. But despite giving accurate feedback MPC exercises are not always to the benefit of language learning. For example, giving feedback on creative writing can be a helpful and stimulating activity for more advanced learners. MPC does have the benefit of being able to easily test learner competency on several linguistic domains and it will allow computers to process learner data more efficiently and effectively.

Activity specification and instructions

Providing proper Activity specification and instructions is a logical yet impor-tant challenge that accompanies activity design. Complex activities generally require more explanation in order to make sure that the test accurately reflects the outcomes of the task that a learner has to carry out. A system for instance, has to differentiate between system instructions and exercise instructions in a logical manner. At the same time activity design should be implemented in such a way that it minimizes the need for additional explanation as much as possible. Of course, constraining user input has a major influence on the activity design and makes it easier for learners to interpret the exercise.

Use of L1 in activities, instruction, and feedback

(17)

Use of feedback on linguistic, learner & activity information

(18)

2.1.3 ICALL system architecture

Amaral and Meurers [2011] present the design of a new ICALL application for Portuguese language learning called TAGARELA (Teaching Aid for Grammat-ical Awareness, Recognition, and Enhancement of Linguistic Abilities). The previously mentioned challenges have been carefully taking into account in the development of TAGARELA, with an extensive learner model that takes in all of a learner’s input analyzes it linguistically and strategically, and returns feed-back subsequently. Figure 2.1 shows their system architecture, which will be used as an example to elaborate the role of linguistic analysis in ICALL appli-cations.

Figure 2.1: Tagarela system architecture

(19)

The Expert module, in charge of the linguistic- and strategic analysis, is actually what drives the intelligence of TAGARELA. The linguistics analysis contains an analysis of form and content. The form section checks the submitted words by spell-checking, tokenizing and disambiguating it (i.e. it makes sure that the submitted word is a grammatically correct Portuguese word). The content analysis then evaluates the appropriateness of the answer by checking whether the required word (or the stem of the word) matches with the required answer. An interesting addition to the model is the use of a strategic analysis that assesses whether or not the selection task and exercise is appropriate for the learner and how much knowledge transfer is taking place.

2.2 The semantics of prepositions

As briefly introduced in Section 1.1, prepositions can be rather complex because their meanings (and their suitability for a given phrase) have to be derived from their context. Understanding a sentence’s semantics is one of the important factors in the cognitive process of determining preposition appropriateness. Ex-amples of frequently used semantic roles are time (10) and location (11):

(10) Ik ben er in 5 minuten; (I will be there in 5 minutes) (11) Ik verblijf in een hotel. (I am staying in a hotel)

(20)

Unfortunately, the “periodic table” of SRs has not been agreed upon, so many different models of semantic families exist [Saint-Dizier, 2006; Luraghi, 2003; Lauer, 1995; Zelinsky-Wibbelt, 1993; Finin, 1980], and a lot of attention has been given to automated semantic role labeling [Girju, 2009]. But consider-ing the unlimited number of possible relations, it is very unlikely that a complete (and machine-friendly) model will be available any time soon. Several simplis-tic models containing frequently occurring roles have been developed for more general purposes. Zelinsky-Wibbelt [1993], for example, adopts prepositions of time, state, area, instrument, circumstance, and cause. In the next section, I will shortly describe the meaning of each role with the addition of including the spa-tial roles location and direction separately, as proposed by Luraghi [2003] and many others. Phrases containing spatial roles tend to occur quite frequently and are, for the purpose of this study, valuable types to include.

2.2.1 Semantic roles

Prepositions of location, sometimes used as prepositions of place, refers to the physical location of a sentence’s subject in relation to a sentence’s object. Like the directional role, this role is classified under the spatial domain.

(12) Jelmer is op de universiteit (Jelmer is at the university); (13) Ik zit in een stoel (I am sitting in a chair).

Dissimilar to prepositions of location, directional prepositions denote the direc-tion in which a sentence’s subject (often an entity) moves to or away from the sentence’s object (an entity, situation or concept).

(14) Ik ga naar de bioscoop (I am going to the cinema);

(15) Ruby zwemt door het water (Ruby is swimming through the water). Temporal prepositions are used in sentences that describe time periods, such as dates on a calendar, days of the week, actual times (16), and temporal distances (17).

(16) Lisanne is thuis om 8 uur (Lisanne is home at 8 O’clock); (17) Ik ben er in een uur (I will be there in an hour).

(21)

(18) Gemma is in prima conditie vandaag (Gemma was looking rather in the pink);

(19) Hij werd bekroond als de snelste man ter wereld (He was appraised as the fastest man alive).

The area which this preposition type refers to can denote two (or more) entities, situations or concepts that have a specific type of connection/relation (e.g. a person having/exercising some kind of occupation). An area is often an area of profession.

(20) Hij wordt bekroond tot koning (He is crowned to be king);

(21) Als politieagent ben je een ambassadeur van de wet ( As a police officer, you are an ambassador of the law).

Prepositions of instrument describe the use of a physical or a conceptual object by the sentence’s subject.

(22) Floris reisde met zijn fiets (Floris was travelling by bike);

(23) Ik was alleen met mijn gedachten (I was alone with my thoughts). Prepositions of circumstance/manner are used to specify the manner in which actions are performed. Example (24) specifies a bag as part of John’s belongings, which are being carried. Carried is the action performed in this statement, like eat is the action exercised in example (25).

(24) The groep draagde John’s spullen, inclusief zijn tas (The group was car-rying John’s belongings including his bag);

(25) Thijs mag geen eten met kaas erin. (Thijs should not eat food with cheese in it).

Causal prepositions denote a cause/consequence relation. Describing the cause of an event can be done in numerous ways, which explains why so many prepo-sitions can adopt a causal role. Although due to and because of only occur as causal prepositions, in or for are also quite possible.

(26) Rochel is niet meer wezen fietsen vanwege zijn rugklachten ( Rochel has not been cycling because of his back problems);

(27) Aangezien hij moe was, ging hij naar bed (Since he was tired, he went to bed);

(22)

(29) Ik zal niet opgeven aangezien ik al zo ver ben gekomen (I will not quite as I am already so far along).

Reading through this list, it should become apparent that an infinite number of semantic roles can be distinguished and that the roles will start to overlap each other sooner or later. So a more comprehensive model might lead to more ambiguity and vagueness as well.

2.2.2 The polysemy of prepositions

Zelinsky-Wibbelt [1993] studies the polysemic nature of prepositions, for which she has used the earlier mentioned classification of SRs. Polysemic words, by definition, can adopt different senses/meanings depending on their context. It is also true that the meaning of some prepositions extends further than others. The preposition in can adopt any of the relations presented in the classification of Zelinsky-Wibbelt [1993], as discussed in the previous section. The word in can function as a preposition of location (13), of time (17), state (18), et cetera. But the preposition if could never be used to describe spatial (30)(31) or temporal (32) events:

(30) Ik zwem in een zwembad (I am swimming in a pool); (31) * Ik zwem als een zwembad (I am swimming if a pool); (32) * Tom komt als 10 uur (Tom will arrive if ten minutes).

(23)

2.3 The computational modeling of preposition

knowledge

A number of papers exist on machine learning and the data that is required for effective and efficient processing. Computational approaches to modeling prepo-sitions can help to understand how preposition knowledge can be translated to a computational model. They also illustrate the adopted methodology in similar machine-classification research, which has been beneficial for the structuring of this research project. The outcomes of this study might, in turn, be beneficial for new machine learning approaches to preposition modeling.

2.3.1 The computational modeling of semantic roles

Girju [2009] has developed a novel procedure for the automatic annotation of the SRs that prepositions can encode. She first presents a list of 22 SRs that cover a large majority of frequently occurring roles. The model distinguishes, for instance, prepositions of cause, time, location, and instrument, as described in the previous section. To develop the list of SRs, Girju [2009] performs a cross-lingual study on a corpus containing noun-noun pairs and nominal phrases. The study utilizes two corpora; Europarl and CLUVI. From the Europarl corpus lists of bilingual-aligned sentences were matched using their English translation, resulting in a matching list of sentences in Spanish, Italian, French, Portuguese and of course English. The English corpus was then parsed syntactically with Part of Speech (POS) tags using Charniak’s parser [Charniak, 2000]. Using these tags patterns can be analyzed to find out whether there is a correlation between a sentence’s semantics and a sentence’s syntactic properties (i.e. can we estimate a sentence’s meaning based on its syntax). Sentences were finally tagged manually by a group of annotators, showing a fair to good inter-annotator agreement (0.56-0.8). A distribution of semantic roles for both corpora is presented in table 2.1. The displayed percentages correspond to Girju’s unique number of instances per SR. P art-Whole Lo cation Prop ert y Agen t Purp ose T opic Theme Other Europarl 2.4% 2.1% 6% 7.1% 7.2% 11% 19.2% 8.13% Cluvi 34.4% 8% 2.8% 5.8% 4.5% 0.8% 4% 8.8%

(24)

One of the intentions of the study is to discover how frequent certain roles occur in written text. Both corpora show high variations in their distributions. A high number of Part-whole prepositions were found in the CLUVI corpus, and the Europarl corpus shows a high number of Theme prepositions. Based on the syntactic constructions and SR distribution, Girju [2009] attempts to automate the process of semantic parsing. The procedure for this task leverages several features:

• Features 1 and 2 - Semantic class of noun specifies the WordNet sense of the head noun (F1), and the modifier noun (F2);

• Features 3 and 4 - WordNet derivationally related form specifies if the head noun (F3), and the modifier noun (F4) are related to a corresponding verb in WordNet;

• Feature 5 - Prepositional cues link the two nouns in a nominal phrase. These can be either simple or complex prepositions such as of or according to;

• Features 6 and 7 - Type of nominalized noun indicates the specific class of nouns the head (F6) or modifier (F7) belongs to depending on the verb from which it derives.

(25)

2.3.2 Preposition error detection and correction

Recently, more attention has been given to the automated detection and cor-rection of prepositional errors. Computationally detecting prepositional errors poses a challenge for machines for the same reason it’s hard to learn for humans; there are no simple rules that determine whether or not a preposition selection is correct. Recent systems detect prepositional errors with a precision range of 50% to 80% but with a recall as low as 10% to 20% [Tetreault et al., 2010], which generally means that there is a high level of errors that are falsely regarded as correct. Kloppenburg [2015] proposes a supervised approach to the tasks of detecting and correcting preposition misuse. Although similar approaches have been adopted earlier by De Felice and Pulman [2009] and Chodorow et al. [2007], it is the first time anyone has researched preposition error detection and cor-rection for the Dutch language. Kloppenburg [2015] adopts three types of error cases in his work; insertion errors, deletions errors and substitution errors. In this summary I will explain his approach to detect prepositional substitution errors. (i.e. cases in which a preposition should be present but where the prepo-sition has been selected erroneously). This type of prepoprepo-sitional misuse most strongly resonates with the task used for this study.

Multiple approaches employ syntactic parsers to automatically detect the likelihood of a prepositional error [Kloppenburg, 2015; De Felice and Pulman, 2009]. The strategy involved in detection of prepositional misuse, through syn-tactic parsing, is the use of n-gram based features (in a two or three word window) for POS-tags surrounding a preposition. For example, a preposition (e.g. in) could be followed by a determiner (e.g. the) and a noun (e.g. train). The combination of these two lexical categories represents a bigram class (e.g. det noun) for contexts to the right of a preposition. If this determiner-noun context follows the preposition more often, then it could be the case that it is grammatically correct for other sentences sharing the same (or similar) syntac-tic constructions. The frequencies in which syntacsyntac-tic contexts occur (preceding or following a given preposition) determine the weights (i.e. the ranking) that will be given to a list of prepositions.

(26)

Dutch English Translation Frequency Van Of 22.8% In In 16.3% Op On 7.2% Te At 7.1% Voor For/Before 6.3% Met With 5.7% Aan On, To 4.3% Door Through, by 3.3% Bij At, With 2.8%

Uit Out 2.6%

Om By, Around 2.6% Over Over, About 2.4% Tot Until, To 2.2%

Naar To 2.1%

Als If, As 1.2%

Others 11.1%

Table 2.2: Frequency distribution of Dutch prepositions in LASSY large

2.4 Theoretical contributions

The related work discussed in this chapter has provided several useful empirical observations, for which I will shortly summarize some of the contributions.

It seems that, while the classification of a preposition’s semantics is a difficult and ambiguous task, multiple useful models exist that can be used to assess learners on their understanding of a sentence’s meaning(s). Some of these roles adopt more meanings than others. This might influence their complexity.

Several criteria exist for the proper implementation of CALL systems. The clarity of instructions and usability of the activities play an important role in its design. However, to effectively exploit intelligent language models it is also necessary to constrain the input to such an extent that the data can be easily processed, which is necessary to allow for personalized learning activities.

(27)

Chapter 3

Method

In this chapter I will explain the setup of the experiment and the corpus used to generate Dutch sentences and extract preposition contexts for experimentation.

3.1 Experiment design

During a six week experiment, L2 learners of Dutch will exercise their preposi-tion skills using an online web applicapreposi-tion.

The experiment primarily consists of fill-in-the-blank Multiple Choice (MPC) exercises with random Dutch sentences, and 15 possible prepositions to choose from. A fill-in-the-blank MPC strategy has been used because of its high input constrain, making it easier to process and evaluate the data. Consequently, it simplifies the activity design, as discussed in Section 2.1.3.

Participants will practice with Dutch sentences that have been selected from the preposition model containing 5 contextual features that might influence preposition complexity. Each feature contains a high amount of sentences to ensure for enough random selection during the experiment.

The analysis will only includes a very minimum of eight answers per class, to ensure for reliable assessment. For most features, a lot more data will be gen-erated through random selection, seeing as there are always multiple features present in a single sentence. For example, a sentence could have the seman-tic meaning place, the preposition in, a high surrounding lexical complexity, a sentence length of 50 characters, et cetera. The system will measure and re-quire data for all of these metrics every time a participant solves a prepositional problem.

(28)

sentence containing a class for which it has not yet required enough data. If a sentence containing the preposition in, has only been answered once, while others have been answered more than once, it will generate a sentence with the in class. If all classes are represented equally it will select a random class until every class, for a particular feature, has been populated sufficiently. It will then transcend to the next phase, which is the next feature (prepositional context) in line.

Because many classes will be represented in a single sentence, a lot of data will be generated for a single answer. Most classes will always be present in a sentence (e.g. a sentence always has a sentence length or a semantics that it communicates) with the exception of the syntactic context feature. Only the ten most frequent syntactic contexts will be used, so this means that some sentences will contain contexts that the system will not store for analysis. Other than that, each answer will contain a relevant class for each feature. The system will then collect more data for some classes because they tend to appear more often naturally. For instance, when collecting data for the in preposition class, it will be more likely that in has a semantic role that occurs more often naturally. It will, of course, still store this data so that it can skip the collection of data for a lot of classes later in the experiment. Even so, the experiment still requires a learner to answer approximately 300 sentences. Without the addition of a multiple classes collection function, learners would have to generate data for 608 sentences, to allow for reliable analysis. The 300 already stretches the extent to which most participants are willing to cooperate, as it takes an average learner around four hours when fully concentrated.

After answering a sentence, participants will receive immediate feedback. This design choice has been chosen so that learners can always focus on a single sentence at a time, without worrying too much about previous results. A differ-ent and popular alternative is to show scores after answering a set of sdiffer-entences (e.g. 10, 25 or 50). Showing scores after a solving a set of problems would be a good strategy to stimulate participants to do the experiment in separate sessions. However, showing immediate feedback has proven to be more useful for the learning process, because learners can more easily reflect on their answer after receiving immediate feedback. Human learners generally do not have the capacity to remember their solutions to large collections of problems that have to be solved. For future work and to optimize the learning process, it would still be a good idea to evaluate participant results after submitting a set of problems, in addition to providing immediate feedback.

(29)

this research project. Of course, not all participants will understand (or invest enough time to learn to understand) what syntax or what surrounding lexical complexity means. That is why the system will first focus on giving feedback on prepositions, and additionally focus on other (possibly) relevant contexts such as semantics.

The experiment is not designed to measure the learning effects of a linguisti-cally aware CALL system, as it is currently used to assess a preposition model. It also behaves as a personal assessment tool, rather than a learning interface. Nevertheless, where possible it is still important to take the pedagogical and motivational considerations into account. When and if the preposition model (or some part) has been validated, the platform can be further developed into an actual language learning tool.

3.2 Data

The data needed for the experiment consists of a corpus containing syntactically annotated sentences, and the addition of annotations for each assessed linguistic feature.

3.2.1 Corpus

Lassy (Large Scale Syntactic Annotation of written Dutch) is a large corpus containing 1 million syntactically annotated words, for which most have been corrected manually. An additional larger corpus called LASSY Large is available that contains 7 million words, which have not been corrected. Lassy consists of seven different Dutch}Flemish corpora, like the Europarl corpus [Koehn, 2005], a Wikipedia dump of 2011, the SONAR500 corpus [Oostdijk et al., 2013], et cetera.

The Lassy Small corpus has been used because, aside from providing a dataset with Dutch sentences, they have been annotated with POS-tags and a dependency structure. The dataset makes it easy to extract sentences contain-ing the prepositions used in this study, because it each word has been annotated with its lexical category. Additionally, LASSY sentences are provided in XML, making it easy to automatically extract the data via XPath queries.

3.2.2 Annotations

(30)

classifying a preposition’s semantic role is almost impossible, as even humans find these roles too abstract in many cases. For this reason, a relatively simple semantic model has been used with rather concrete roles that should not be too difficult to annotate manually.

The application will also have to know which answers are correct and sensible for each preposition case. All of the 15 prepositions included in this study will have to be assessed for each case. Seeing as a sentence always includes one preposition, 14 other candidates exist that will have to be evaluated. As shown in Section 2.3.2, the selection model of Kloppenburg [2015] would have been ideal for this particular task. Unfortunately, no API exists to allow for easy access to this model just yet. Over 1300 prepositions have, therefore, been manually annotated with all possible alternative prepositions, by an expert committee.

3.3 Participants

The experiment has been carried out by a group of 27 NT2 learners, from which 16 participants have been active enough to ensure for reliable data. Participants have been regarded as active when they have finished at least 30% of the exper-iment, as it ensures that there is enough data for reliable assessment. For the lower percentages only a couple of classes can be used for statistical analysis. For these cases, a very bare minimum of eight answers per class has been used as the lower limit. This minimum requirement is also needed to provide a learner with adequate feedback.

Most NT2 learners have been recruited via the Language Center at the University of Groningen and have a minimum Dutch proficiency of A2 (i.e. basic language user) and a maximum proficiency score of C2 (i.e. proficient language user). The CEFL (Center for English and Foreign Languages) has played a crucial role in this study, because their language tests have allowed for the selection of proper candidates. The remainder of the NT2 participants has been addressed via online communities, although most of them have not participated actively enough to include their data in the analysis.

(31)

(32)

Chapter 4

Preposition model

Previous studies suggest a relation between several preposition contexts (i.e. lin-guistic domains) and the grammatical appropriateness of prepositions [Luraghi, 2003]. Such contexts are semantics, (morpho)syntax, and of course a prepo-sition’s lexical context. Seeing as prepositions depend on these domains, it is likely that their complexity will also influence a preposition’s complexity. In this chapter, a partitioned preposition model will be proposed, including some do-mains that are already empirically supported for different languages. The model also includes some new domains for which no related research has been done, to the extent of my knowledge. During the experiment, learners will be assessed on each of these domains, by practicing with prepositions, that contain proper-ties like a semantic meaning, a syntactic context, a length, and even subjective properties like its contextual lexical complexity.

4.1 Overall design

The goal of this classification is to find any relevant indications that could serve as proxies (i.e. approximations) to estimate preposition complexity. Based on previous studies and intuition on contexts that might be relevant to preposition complexity, the following features have been included in this study:

• Preposition frequency (extracted automatically); • Semantic context (obtained via manual annotations); • Syntactic context (extracted automatically);

(33)

Obviously, a lot more relevant dependencies can and eventually should be added to this list. Future studies should, for example, include the assessment of object and subject complexity as well. Seeing as a preposition, by definition, describes a relation between two or more words (i.e. the subject and object) in a sentence, understanding these words is probably important to choosing a grammatically correct and pragmatically sensible preposition. The LASSY corpus does already contain annotations for sentence objects and subjects, but for the purpose of collecting enough data for reliable analysis, this feature has not been included in this study. The list has been limited to 5 features, with sub-features for the semantics and syntax features. Aside from the semantic context, all features have been extracted automatically from the dataset. The remainder of the chapter is dedicated to explaining the details of each feature.

4.2 Preposition frequency

The first and most sensible feature to include is the frequency in which prepo-sitions are used in everyday social interaction. It makes sense that prepoprepo-sitions which are used often are also prepositions that a learner understands more quickly. A counter argument to this assumption is that when prepositions are used more often they will probably also express a high number of meanings. This makes them more ambiguous, and therefore maybe more difficult to use as well. The frequency in relation to understanding will be assessed by measuring whether there is a difference in the complexity of prepositions, and whether or not this correlates to the frequency in which the prepositions are produced. As an indication of how many times prepositions are produced, the LASSY corpus will represent the use of prepositions in the Dutch language. A frequency list has been presented earlier in Table 2.2. As can be read in this table this study is limited to the 15 Dutch prepositions van, in, op, te, voor, met, aan, door, bij, uit, om, over, tot, naar and als. Because of the relatively low number of prepo-sitions it is easier to select certain prepoprepo-sitions more repetitively, which also should have a beneficial effect for the learning process. Training novice learners on prepositions that are used rarely makes less sense. Determining the amount of prepositions is also necessary for reliable analysis, which is the primary goal of this study.

4.3 Semantic context

(34)

se-mantic role (e.g. time or place). Some roles also tend to be more ambiguous than others. For this reason, two semantic sub-features will be assessed. The first sub-feature will determine if there is a significant difference between the complexity of certain roles. If so, then it should prove useful to train and assess learners on how to communicate certain meanings, instead of solely focusing on teaching preposition understanding. The second sub-feature will assess whether or not there is a relation between a semantic role’s ambiguity and preposition complexity. Communicating meanings that can express themselves in more ways than others could make it difficult to select an appropriate preposition. A more detailed explanation will be presented in this section.

4.3.1 Semantic roles

An understanding of prepositions does not mean that a learner will be able to understand how to apply that preposition in every context. Prepositions, by definition, can have different meanings based on the relation that it has to communicate. The following example contains two prepositions with different semantic functions:

(33) Ik reis naar Duitsland met de bus (I am traveling to Germany by bus). The preposition to has a directional function because it describes the physical movement of an object (e.g. I ) and a subject (e.g. Germany). The preposition met has an instrumental function. Prepositions of instrument typically describe the use of an object (e.g. a bus) by the sentence’s subject. So, prepositions can have different semantic functions that determine the manner in which they should be used and which prepositions are appropriate for that specific context. The preposition by, for instance, can never be used as a directional preposition. Prepositions can have different meanings while being lexically identical. Take the following sentences:

(34) Tim speelt op zijn gitaar (Tim is playing on his guitar);

(35) Tim legt het boek op de tafel (Tim placed the book on the table); (36) Tim is op tijd aangekomen (Tim has arrived on time).

In the first sentence on has an instrumental function. In the second sentence on clearly describes the location of the sentence’s object (i.e. table), and the third portrays a time function.

(35)

been expanded with a differentiation between location and direction, as men-tioned in different studies [Vorsah, 2012; Luraghi, 2003; Saint-Dizier, 2006]. I have also omitted the circumstance role to further reduce the number of roles in the model. With that, the new model includes the roles time, location, direc-tion, instrument, cause, and a separate others class, to represent all unclassifi-able roles. Many sensible classifications can be made, but these semantic roles are widely supported by different studies. Including more roles would make it increasingly more difficult to train learners sufficiently on particular roles.

4.3.2 Preposition meaning extension

As shown earlier, most semantic roles can be expressed through different prepo-sitions. The meaning extensiveness (or polysemy) variates for each role. Table 4.1 presents an overview of the meaning extensions of the eight different se-mantic classes and the 15 most frequently occurring Dutch prepositions in the LASSY corpus. The plus symbols illustrate that a semantic role can be ex-pressed by the corresponding preposition.

Prep. Time Place Dir. Instr. Cause Others

In + + + + + + Op + + + + + + Te + + - + + + Met + + - + + + Tot + + - - + + Voor + + - - + + Van + + - - + + Aan + + - - + + Naar + + + - + + Door - - - + + + Om + + - - + + Over + + - - + + Uit - + - - + + Als - - - - + + Bij - + - - + +

Table 4.1: Survey of the meaning extensions of frequently occurring prepositions in the LASSY corpus

(36)

The meaning extensions as displayed in Table 4.1 have been put together by the manner of manually annotating prepositions with their semantic roles. The process for this will be explained in Chapter 5. Because manual annotations almost aways have errors to some degree, the model will most likely contain some incorrect classifications, and definitely some missing ones. However, for the purpose of this study it is more than sufficient to use an indication based on the roles that the annotation committee has found within LASSY.

During the analysis, correlations will be calculated between meaning exten-sions and the difficulty of prepositions. It is expected that lower extenexten-sions will probably be easier to use than higher extensions because it increases the role’s ambiguity.

4.4 Syntactic context

Like semantics, a sentence’s syntactic construction can determine whether or not a preposition is appropriate to some extent. It intuitively makes sense that some of these constructions are more difficult to comprehend than others, as some constructions might occur more frequently in our language, or are similar to constructions to a learner’s native language. Figure 4.1 displays a dependency tree for the Dutch sentence wij zetten ons in voor vrede en veiligheid (i.e. we are committed to providing peace and security). The tree has been generated by the Alpino dependency parser [van Noord et al., 2011].

(37)

To find out whether preposition use is indeed influenced by the syntactic con-text, n-gram based classes will be constructed using a preposition’s surrounding syntactic context. For this particular task, bigrams and trigrams are utilized for the lexical categories left and right of a preposition. Table 4.2 displays the classes generated from Figure 4.1.

Sub-feature Words POS categories Class name Bigram-left ons, in Pronoun, prep. VNW P Bigram-right vrede, en Noun, conj. N VG Trigram-left zetten, ons, in Verb, pronoun WW VNW P Trigram-right vrede, en, veili... Noun, conj., noun N VG N

Table 4.2: Syntactic context features

The left column in Table 4.2 contains the four sub-features consisting of bi-grams and tribi-grams on both sides of a preposition. These sub-features contain POS categories that represent the feature’s classes. The class names (see right column) consist of abbreviated Dutch POS categories. Table 4.3 includes trans-lations of all of the POS categories (in Dutch and English) used in this study.

Abb-reviation Lexical category (Dutch) Lexical category (English) Dutch example English example LID Lidwoord Determiner De The N Naamwoord Noun Trein Train ADJ Adjectief Adjective Prachtig Beautiful WW Werkwoord Verb Loop Walk VNW Voornaamwoord Pronoun Zij They TW Telwoord Numeral Twee Two LET Leesteken Punctuation & & VG Voegwoord Conjunction En And VZ Voorzetsel Preposition Te Too

Table 4.3: Syntax abbreviations table

(38)

A list of the top ten most frequently occurring contexts (i.e. features) have been extracted from the LASSY corpus as displayed in Table 4.4. This fre-quency distribution illustrates that the Dutch language often makes use of de-terminer/noun (i.e. LID N) combinations, like ...on the table, or the man walks to his house. The extraction process for the syntactic context feature this will be explained in Chapter 5.

Sub-features

Bigram-left Bigram-right Trigram-left Trigram-right LID N 7627 LID N 10772 VZ LID N 3899 LID N VZ 4013 ADJ N 3253 LID ADJ 3837 LID ADJ N 1894 LID ADJ N 3114 N WW 2596 N LET 2401 N VZ N 966 LID N LET 2379 VZ N 2349 N VZ 2344 WW LID N 963 LID N WW 1864 WW WW 1432 VNW N 2076 LID N WW 850 N VZ LID 878

VNW N 1179 N WW 1764 N WW WW 639 LID N VG 660

WW N 790 ADJ N 1730 VG LID N 589 ADJ N LET 621

TW N 725 WW LET 1444 WW VZ N 561 VNW N WW 605

VNW WW 704 TW N 1000 ADJ N WW 478 N VZ N 553

N LET 700 N VG 950 N VG N 433 N VG N 536

(39)

4.5 Surrounding lexical complexity

A preposition’s lexical context refers to the word context surrounding a prepo-sition. Unlike the syntactic context, which looks at the surrounding POS cat-egories, this feature represents the complexity of contextual words (i.e. an understanding of the surrounding lexemes).

To estimate the complexity of a word, its frequency will be used as a proxy. The rationale behind the use of this feature is that rare words have a higher chance of being less familiar to a learner, which would make them more com-plex. Not knowing specific words surrounding a preposition could make it more difficult to select an appropriate preposition for that particular context.

To calculate the surrounding lexical complexity, the frequencies left and right of a preposition are used for which a mean complexity score will be calculated. I will explain this more thoroughly using the following sentence:

(37) De man liep naar zijn huis (The man walked to his house).

To evaluate the lexical context complexity, the words surrounding the preposi-tion naar (i.e. to) are extracted. This particular sentence contains the Dutch words liep (i.e. walked) and zijn (i.e. his). A mean complexity prediction is calculated for both words by comparing the frequency of the word within the LASSY corpus to the total number of tokens in the LASSY corpus. The following equation expresses how a complexity score is calculated:

f (x) = ₁ lassy Pn i=1 n −tokens lassy

(40)

Word Van Mens Nieuw Vakbond Communist D66minister Frequency 3421 135 125 10 4 1

Table 4.5: Word frequencies

To determine whether a context is simple or complex, a threshold will be used for the mean complexity scores. Scores can vary from 15057 to 1, meaning very easy and very difficult, respectively. A logarithmic scale has been used to classify scores, using 5 classes. A logarithmic scale makes sense to adequately partition complexity classes, and conveniently the 5 classes are easy to map to the 5 figures (i.e. 15057) that a complexity score can have. Table 4.6 shows how classes are divided and how much unique words each class contains.

Class Very easy Easy Intermediate Difficult Very difficult Threshold <10 10-100 100-1000 1000-10000 10000> Words 1457 3578 10861 16795 987

Table 4.6: Surrounding lexical complexity frequencies

(41)

4.6 Sentence length

The last (and by far the technically easiest) feature to implement is sentence length. Longer sentences have more complex constructions, which makes them more difficult to interpret. It might be true that this affects preposition com-plexity as well. It might also have no effect at all, seeing as preposition can be selected without having to interpret a whole sentence. Additionally, it requires little extra work and it is easy to query a sentence with a specific length aside from selecting a sentence because it contains other relevant classes.

For the sentence length feature, ten classes will be used with an even distri-bution of lengths. Seeing as the sentence length distridistri-bution of LASSY roughly ends around 250 characters, ten classes of 25 different lengths are used. Table 4.7 shows all of the classes and the number of sentences in LASSY that corre-spond to a length class.

<25 25> 50> 75> 100> 125> 150> 175> 200> 225> 135 919 2462 2887 2534 1812 1183 717 450 690

Table 4.7: Sentence length frequency distribution

(42)

Chapter 5

System architecture

In this chapter, I present the architecture of Pon`ere. Pon`ere is a CALL ap-plication specifically designed to evaluate preposition usage on several linguistic domains. The application, in its current state, is an evaluation tool, designed to assess a preposition model. This means that it does not yet select preposi-tions intelligently, but instead decides what to feed learners based on what data is required for statistical analysis. The system does provide direct and competency-based feedback for learning purposes, and it also has all the mechanics in place to integrate intelligent language models for personalized learning. The applica-tion’s current architecture will be described in this chapter, which includes the system’s database configuration, the feature extraction process, and the interface design.

5.1 General design

Pon`ere consists of an online platform where NT2 learners can practice with Dutch prepositions. The method of experimentation currently consists of a fill-in-the-blank Multiple Choice (MPC) exercise. Figure 5.1 shows the train/ experiment page, where learners can choose a prepositions from a random Dutch sentences.

(43)

Figure 5.1: Pon`ere train page

The application is built on the open source Content Management System Dru-pal. Drupal allows for the development of highly customizable applications through integrative and custom-built PHP modules. The Drupal API also al-lows for easy communication with a MySQL database. The process of devel-opment consists of several stages which I will, respectively, explain in the next sections.

5.2 Dataset preparation

The dataset has to be prepared to generate the input for the feature extrac-tion. The input consists primarily of sentences, POS-tags, sentence IDs, tagged prepositions and their locations within the sentences. To allow the system to select sentences that match certain requirements (e.g. has a specific seman-tic role), prepositions need to be annotated with the appropriate classes (e.g. time). The LASSY small corpus was used to create this dataset, as it provides all of the necessary data. The steps required to prepare the dataset for feature extraction are as following:

• Extract sentences and sentence IDs (using LASSY’s unique document names), containing at least 1 of the 15 studied prepositions, and popu-late a dataset table in MySQL;

• Extract all of the POS-tags and add them to the dataset table.

(44)

As LASSY consists of XML documents, XPath queries have been used to parse the required data into usable SQL tables. This preliminary step is needed to allow for a clear database structure, efficient data processing and better integration into the Drupal CMS.

After preparation, the dataset is annotated in two separate stages. I first extracted the classes that can be generated through a set of straightforward algorithms. All of the classes from the features preposition frequency, syntactic context, surrounding lexical complexity and sentence length, allow themselves to be extracted automatically. Unfortunately, no easy and reliable methods exist to allow for the automated classification of a preposition’s semantic role. For this reason, Pon`ere is accommodated with several manual annotation func-tions, annotation progression statistics, and a separate annotator user role with additional functional permissions.

5.3 Database configuration

Due to the high number of features (5) and classes (76), for which Pon`ere will collect data, a lot of processing is required. For this reason, the structuring and querying of the data play an important role in the system’s development. Figure 5.2 displays an ERD containing the table designs and their relations.

(45)

The dataset, as described in section 5.2, functions as the central table. It uses two primary keys (i.e. PK’s); a sentence ID (sentence id ) and a preposition position (pos in sentence). The combination of these keys is used to identify each unique preposition in the dataset. The primary keys also serve as relational data for allowing foreign keys (FK’s) by the surrounding (feature) tables.

Table 5.1 shows how rows in the dataset are populated. The dataset table stores sentences as arrays for the purpose of iteration. Arrays make it easier to match words with the POS-tags array, containing all of the lexical categories. The dataset table also contains a sentence ID column derived from LASSY’s document titles (each document always contains one sentence). Sentences are also stored in a separate string format to allow for easy character length counts in SQL. This is a convenience for the extraction of the sentence length classes.

Sentence id Sentence Sentence array Postags Prep. Pos. WS-U-E-A.. De politie.. (’De’,’politie’.. (’LID’, ’N’.. van 7 WR-P-P-C.. Achteraan in.. (’Achteraan’,’in’.. (’VZ’, ’VZ’.. in 2 dpc-rou-98.. Ik kreeg bij.. (’Ik’,’kreeg’.. (’VNW’, ’W’.. bij 3 dpc-eli-862.. Het genees.. (’Het’,’genees.. (’LID’, ’N’.. voor 6

Table 5.1: Example dataset table rows

The ERD (Figure 5.2) shows three data tables assigned for feature data storage. The semantic context table stores the semantic role per preposition. The syn-tactic context table stores all of the synsyn-tactic sub-features (e.g. bigram left etc.) with their corresponding syntactic constructions (e.g. LID N ). The last feature table contains the lexical complexities. This table stores the lexical items to the left and right of a preposition. It also stores the frequencies in which these items are produced within LASSY. These word frequencies are rough estimations used to approximate word complexity. Lastly, the table contains a mean complexity score for the surrounding words. These scores have been generated using the average of surrounding words, for which the process will be explained in section 5.4.

No table for the sentence length feature is included because the length of a sentence can be queried directly via the number of characters in a sentence string and through SQL’s LENGTH() function. This does not require much additional data processing, so a separate table for this feature has been omitted.

(46)

5.4 Automated extraction

The first prerequisite for allowing the automated extraction of features is the dataset preparation described in the previous section. The second prerequisite is the development of a set of algorithms to translate sentence data to new SQL tables containing the classes needed for the experiment. The automated annotation process of the syntactic context and surrounding lexical complexity features are explained in this section.

5.4.1 Syntactic context extraction

As described in section 4.4, the lexical categories surrounding a preposition are referred to as the syntactic context. N-grams are used to store the lexical categories left and right of a preposition. This results in a set of four sub-features; trigram left, bigram left, bigram right and trigram right.

The frequency list in Table 4.4 shows the distribution of the top ten most common syntactic contexts for each sub-feature. The analysis will validate whether some contexts (LID N or WW WW) are significantly more complex than others. The analysis will also include whether or not the frequency of the syntactic contexts are of influence to preposition complexity.

ID Sentence ID Pos. Sub-feature Class 1 WS-U-E-A-204.. 5 bigram left LID N 2 WS-U-E-A-204.. 5 bigram right N N 3 WS-U-E-A-204.. 5 trigram left VZ LID N 4 WS-U-E-A-204.. 5 trigram right N WW WW

Table 5.2: Example syntactic context table

As displayed in Table 5.2, the syntactic context table requires a sentence ID, a preposition position, the sub-feature (e.g. bigram left ) and one of the ten classes (e.g. LID N ) consisting of concatenated POS-tags. The POS-tags consist of abbreviated Dutch lexical categories (e.g. The letter N stands for the Dutch category naamwoord, which means noun). A list of translations is available in Table 4.3.

(47)

Algorithm 1: POS-tags to syntactic context class conversion

1 Procedure SyntacticClassConvertor(sentence, pos tags, position)

Input : A pos tags array type and a non-negative integer position, which store’s a preposition’s location in the sentence Output: An object syntacticcontext containing classes for one to

four n-grams Init : position += 1;

separator = ” ”;

length = length(pos tags);

sub features = bi lef t, tri lef t, bi right, tri right;

2 for ft ∈ sub features do

3 if ft = tri lef t and position > 3 then 4 class 3 = pos tags ← (position − 3);

5 class 2 = pos tags ← (position − 2);

7 ft ← class 3 + separator + class 2 + separator + class 1;

8 else if ft = bi lef t and position > 2 then 9 class 2 = pos tags ← (position − 2);

11 ft ← class 2 + separator + class 1;

12 else if ft = bi right and position ≤ (length − 2) then 13 class 1 = pos tags ← (position + 1);

14 class 2 = pos tags ← (position + 2);

15 ft ← class 1 + separator + class 2;

16 else if ft = tri right and position ≤ (length − 3) then 17 class 1 = pos tags ← (position + 1);

20 ft ← class 1 + separator + class 2 + separator + class 3;

21 end 22 end

23 return sub features; 24 end

5.4.2 Surrounding lexical complexity extraction

(48)

ID St. ID Pos. Left Right L. freq. R. freq. Score

1 WS-U-E-A.. 5 Loopt Het 11 2700 5647

2 WR-P-P-C.. 2 Staat Mooie 15 4 157

3 dpc-rou.. 7 Vriend dat 4 791 813

4 dpc-eli.. 2 hebt Prijs 11 9 343

Table 5.3: Example surrounding lexical context table

The score column in Table 5.3 shows the lexical complexity score based on the word frequencies. As these scores can be calculated from the frequencies, it was not necessary to store them separately, but for the purpose of performance optimization it was a logical decision to extract these scores once, instead of having to calculate them for every sentence request.

(49)

Algorithm 2: Surrounding lexical complexty scores calculation

1 Procedure SLCCalculator(f l, f r)

Input : Two non-negative integers f l and f r containing the frequencies of the words surrounding a preposition Output: A non-negative integer slc containing the surrounding

lexical complexity score Init : token = 5.6904;

lassy = 33678;

lassy inv = 1 / 33678; t l ratio = token / lassy; slc = 0;

2 if f l > 0 and f r > 0 then

3 slc = (lassy inv * ((f l + f r) / 2))) - t l ratio;

4 end

5 else if f l = 0 and f r > 0 then

6 slc = (lassy inv * f l) - t l ratio;

7 end

8 else if f l > 0 and f r = 0 then

9 slc = (lassy inv * f r) - t l ratio;

10 end 11 return slc; 12 end

5.5 Manual annotations

The manual annotation process consists of two parts. The first one is the annota-tion of the semantic context feature. The second is the annotaannota-tion of alternative prepositions, which are also grammatically correct and pragmatically sensible. A fill-in-the-blank exercise allows learners to select an appropriate preposition from a list of 15 options. Obviously, in most cases, multiple prepositions are valid. The method of Kloppenburg [2015] would be a perfect tool to automate this annotation process, but unfortunately, the model has not yet been pre-pared for such annotation tasks. This means that Pon`ere had to be provided with several new annotation functions and that an expert committee would have to manually annotate a rather extensive dataset.

5.5.1 Semantic role annotations

(50)

required for analysis. If too little data has been collected for the directional roles, it will show learners more sentences and prepositions containing that specific class. Table 5.4 shows the number of annotation per semantic role.

Time Location Direction Instrument Cause Others

146 349 50 52 437 323

Table 5.4: Semantic annotation distribution

Table 5.5 shows how the annotations are stored in the database. The table includes a separate alternative annotations column containing the annotations that have been made by a second annotator for the purpose of calculating the inter-annotator reliability.

ID St. ID Pos. Prep. Semantic Role Alt. annotation 1 WS-U-E-A.. 5 Naar Time Time

2 WR-P-P-C.. 2 Door Instrument Instrument 3 dpc-rou.. 7 In Location Direction 4 dpc-eli.. 2 Bij Other Cause

Table 5.5: Example semantic context table

5.5.2 Preposition annotations

The same 1300+ sentences have been annotated with their appropriate alter-native prepositions. This step is needed to allow learners to experiment with prepositions properly. Table 5.6 shows the distribution of alternative preposi-tions. These numbers only include the alternative prepositions, not the prepo-sitions were available in the dataset.

The process for annotating prepositions is relatively simple. Annotators are allowed to access a separate section on Pon`ere, where they get to see sentences and select alternative prepositions. Because they are asked to give a subjective judgment and have to evaluate a rather large amount of prepositions, the task is prone to error. For this reason, annotation agreements have been calculated after completing the annotation process.

aan als _naar te bij om van tot in op _voor uit _met _door _over 45 19 59 36 409 32 227 82 230 127 521 152 219 283 127

Exploiting Preposition Contexts for Computer-Assisted Language Learning