The influence of different lexical features on vocabulary learning

(1)

The influence of different lexical

features on vocabulary learning

Manon Schriever 10762418

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. A. Bisazza Informatics institute (IVI)

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

The goal of this project is the determination of the effects of morphosyn-tactic features, cognates, number of senses, phonetics, word length, and frequency on the difficulty of vocabulary words. A logistic regression model is used to predict the recall rate of a word by a certain student on the online learning platform Duolingo. The new features are created based on data from Duolingo, the Open Multilingual Wordnet, Google Translate, Open-Subtitles2016 and the International Phonetic Alphabet and are added to the logistic regression model. It is shown that all features help improve the prediction of the recall rate, and thus have some effect on the difficulty of vocabulary words. Furthermore, all features aside from the cognates have a postive effect on the ranking quality of the predicted recall rates.

(3)

1 Introduction

The internet enables us to instantly communicate with people worldwide, which means an increasing amount of people will need to correspond in a language other than their own. Luckily, the internet also gives us the ability to learn another language from the comfort of our own home. Online platforms mean that language learning is no longer an activity reserved only for the classroom. One of the online language learning platforms is Duolingo

1_{. With 120 million users around the world (Huynh, Zuo, & Iida, 2016), it is}

one of the most popular language learning platforms. At the time of writing, Duolingo teaches 26 languages2_{, although most of them are only available}

for English speakers.

After a user on Duolingo has learned new vocabulary for the first time, he will be invited to practise the word again after a certain amount of time has passed. This method is based on the fact that memory performance increases when content is reviewed in multiple sessions over a longer period of time rather than in one session (Mubarak & Smith, 2008). To calculate the time between sessions, Duolingo uses a model similar to the Leitner system (Settles & Meeder, 2016). The Leitner system was originally designed for flashcards sorted into boxes. If the word on the flashcard can be recalled, the card is promoted to a higher box. Because a higher box corresponds to an increasing time lag between repetitions, a student will practise the words she finds challenging more frequently than the ones she can easily remember (Mubarak & Smith, 2008).

It can be useful to predict which words are more difficult to learn than others, both for online platforms as for offline teaching programs, before a student has started learning. For example, words with a higher difficulty could be taught later in the course. This thesis aims to answer the following question:

What is the effect of different lexical features on the difficulty of foreign vocabulary words?

Five different features are considered: morphosyntactic features, such as part-of-speech, number and tense; cognate status, the similarity between a word and its translation; the number of word senses; difficulty of pronunci-ation; word length and word frequency.

Examples of morphosyntactic features influencing difficulty could be ir-regular verbs being harder to remember than ir-regular ones, or the past tense being more challenging than the present. Cognates would ease learning as well; an English student learning French might find it more difficult to re-member the translation for "dog" (chien) than the translation for "cat" (chat)

1

https://www.duolingo.com/

(5)

because the latter is more similar to the English word. The number of senses of a word adds to the confusion on which word to use in which context, while a word with a pronunciation that is not directly clear from its written form will be misspellt sooner. Longer words are also usually regarded as more difficult than shorter words. A high word frequency means the user will have more experience with the word outside the learning environment, which adds to his ability to correctly recall its meaning.

The new features are implemented into a logistic regression model that is trained and evaluated on two weeks of data from Duolingo (Settles & Meeder, 2016). This data encompasses seven different language pairs with five unique target languages and more than 11 million student-word interac-tions. The logistic regression model predicts the recall rate of a specific word by a specific student during a session. The different features are evaluated based on their effect on these predictions.

In the next section, related work will be explored, followed by an analysis of the data. Section 3 will give an outline of the model used for this project. In section 4 each of the newly added features will be described in more detail, including the pre-processing needed for each feature. The results will be presented in section 5, followed by the discussion and conclusion in sections 6 and 7.

2 Related work

The new features used in this project have been identified in previous stud-ies as possible explanations for the difficulty of a word. Settles and Meeder (2016) suggest using morphosyntactic features such as gender and part-of-speech and add that corpus frequency and word length might be effective as well. The use of morphosyntactic features is further supported by DeKeyser (2005), who states that differences in morphosyntactic features between lan-guages can cause problems for second language learners. De Groot and Keijzer (2000); Xu, Chen, and Li (2015); Montalvo, Pardo, Martinez, and Fresno (2012) and Hauer and Kondrak (2011) all state that cognates, words with similar spelling and meaning, are easier to learn and harder to forget than non-cognates. The number of senses of a word is also mentioned in multiple articles as an important feature for predicting difficulty (Laufer, 1990; Dela Rosa & Eskenazi, 2011). Multiple meanings can cause confusion for students, especially if the meanings do not match up with the senses in his native language. Phonetic complexity, which is also referred to as pro-nounceability, plays a role according to Dela Rosa and Eskenazi, Beinborn, Zesch, and Gurevych (2016) and Chen and Chung (2008). Word length in-fluences difficulty according to Chen and Chung and Baddeley, Thomson, and Buchanan (1975). Although De Groot and Keijzer only see a minor influence of frequency, Koirala (2015); Reynolds, Wu, Liu, Kuo, and Yeh

(6)

(2015) and Chen and Chung do see an effect.

Most of these papers study a small group of participants. Chen and Chung (2008) only use 15 students, Dela Rosa and Eskenazi (2011) use two groups of 21 learners and De Groot and Keijzer (2000) have two groups of 40 students. Koirala (2015) and Reynolds et al. (2015) use 217 and 330 participants respectively. The research that comes closest to the number of students studied in this paper is Beinborn et al. (2016) with 85,000 learners from different corpora.

3 Data

The data used in this project consists of two weeks of log data from Duolingo, collected by and provided by Settles and Meeder (2016). This data contains information about the interaction of over 100,000 unique students with more than 10,000 unique words, where each data point describes one session or lesson of interactions. The names used for each feature in this data are outlined in Table 1, while an example of a data point is provided in Table 2. Hereinafter, these names will be used to indicate these features.

Name Description

p_recall Recall-rate of this word by this student during the current session, the feature the program is predicting

timestamp Time at this interaction

delta Time in seconds since last interaction of this student with this word

user_id Unique id of the student

learning_language The language the student is learning

ui_language The language the user interface of the student is in

lexeme_id The id of the lexeme-string

lexeme_string The current word, in the form: word/lemma<tag1><tag2>

history_seen Number of times this word has been seen by this student

history_correct Number of times this word has been correctly translated by this student session_seen Number of times this word has been seen by this student during the current session session_correct Number of times this word has been correctly translated by this student during the current session

(7)

p_recall 1.0 timestamp 1362113669 delta 176 user_id u:hT_J learning_language es ui_language en lexeme_id 767fce885fae7a70ef9eab520d78f1bb lexeme_string una/uno<det><ind><f><sg> history_seen 102 history_correct 96 session_seen 2 session_correct 2

Table 2: Random example of a data point

The data provided by Settles and Meeder consists of eight different lan-guage pairs, which are: English-German, English-Spanish, English-Portuguese, English-Italian, English-French, Spanish-English, Italian-English and Portuguese-English. Because German is not available in the Open Multilingual Wordnet (see section 5.2), the German-English language pair was removed from the data. All the remaining language pairs and how often they occur in the data are specified in Table 3.

Pair (ui-learning) Counts

Spanish-English 3,641,179 English-Spanish 3,407,689 English-French 1,873,734 Portuguese-English 949,460 English-Italian 793,935 Italian-English 424,152 English-Portuguese 311,480 Total 11,401,629

Table 3: Ordered occurrences of language pairs in data

4 Model

Settles and Meeder (2016) have compared four different algorithms in their experiments: the Pimsleur method, the Leitner system, half-life regression, and logistic regression. Each of these methods is used to predict the p_recall; the recall rate of a feature by a student during one session. The p_recall in the data is bound between 0.0001 and 0.9999 in the program, to prevent overflow and underflow errors (Settles & Meeder, 2016).

(8)

The Pimsleur method (Pimsleur, 1967) is a spaced repetition model with fixed spaces that increase exponentially. Pimsleur reasoned that the proba-bility of remembering vocabulary decreases rapidly over time, but that each time the vocabulary is relearned this decrease happens more slowly. He sug-gests that this decrease happens in such a way that an exponential schedule would be ideal, meaning that a student who starts with a five second interval will reach the tenth recall 510 seconds = 113 days after first learning the word. Pimsleur (1967) already points out that this schedule is not universal and should be changed based on the individual circumstances of the student and the vocabulary.

Settles and Meeder (2016) state that the Leitner system is more adaptive than the Pimsleur method because the intervals depend on the performance of a student. The Leitner system consists of a number real or virtual flash-cards with vocabulary words and a series of numbered boxes. A box with a higher number corresponds to a larger time lag between repetitions. Flash-cards are promoted to higher boxes when the student can recall a word correctly and are demoted when the student has forgotten a word (Mubarak & Smith, 2008).

The half-life regression model is an attempt by Settles and Meeder to create a more accurate model. The model is based on the Ebbinghaus model, also called the forgetting curve: RecallP robability = 2

LagT ime

Half −lif e_{, where}

half-life represents the strength of a word in a student’s long-term memory. Since the half-life is not available in the data it is estimated with: h = _log[2]p−t , where p is the p_recall of a data point.

Based on experiments on the French part of the data using both lo-gistic regression and half-life regression, it was decided to use lolo-gistic re-gression. Although half-life regression produces better results overall, the effects of the different features were clearer when using logistic regression. Since the goal of this research is not to produce the best possible predictor, but rather to determine the effect of newly added features, logistic regres-sion was chosen. The formulas used for updating the feature weights with logistic regression are shown in Algorithm 1. The predicted p_recall is 1/(1 + e−sum(weight(F eature)×F eatureV alue F OR EACH F eature)).

Settles and Meeder (2016) did not use all the features available in the data for their program. Both the half-life regression model and the logistic regression model only use the number of times the user has provided the right (history_correct) or wrong (history_seen - history_correct) translation to the word and a binary feature indicating the lexeme_string. The feature values for these three features are √1 + right, √1 + wrong and either 0 or 1. All three of these features were maintained in the new model, however, the time feature present with logistic regression was removed, which puts the focus on the inherent difficulty of the words and away from the spaced repetition factor.

(9)

Algorithm 1 Updating feature weights for logistic regression

learningRate = 0.001 α = 0.01

λ = 0.1

for Feature, FeatureValue in FeatureList do

rate = _1+true1 _{_}_p *√ LearningRate

1+counts(F eature)

weight(Feature) -= rate × (predicted_recall - true_recall) × Fea-tureValue

weight(Feature) -= rate × λ × weight(Feature) counts(Feature) += 1

end for

5 New features

In the following section, all the newly added features will be explained in more detail and the process of adding each new feature will be outlined. All the features were tuned with 90 per cent of the French part of the data, using 5-fold cross-validation.

5.1 Morphosyntactic features

The term "morphosyntactic features" represents features such as part-of-speech, gender, number, tense, case and person. Settles and Meeder (2016) suggest adding these features because it "might be able to capture useful and interesting regularities". DeKeyser (2005) states that there are three types of problems that explain why certain morphological features can cause dif-ficulties for second language learners. Problems of meaning occur when the meaning of a form is not present in the learner’s native language. Exam-ples of such forms are articles, classifiers and grammatical gender. Problems of form represent the complexity of picking the correct morphemes, even when the meaning is clear. Here, too, the problem lies in the differences between languages, such as English, which is morphology-poor and Spanish, which is morphology-rich (DeKeyser, 2005). The third type of problems that DeKeyser mentions are the problems of form-meaning mapping that occur when the relationship between a form and its meaning is not transparent. One way this can happen is when the form is semantically redundant be-cause another part of the sentence already expresses it. Morphemes that are homophonous to other morphemes can cause form-meaning problems as well. An example of this is ’s’, which gets added at the end of a word in English when that word is either a third-person singular verb, a plural noun or the genitive of a noun (DeKeyser, 2005).

The morphosyntactic features are already available in the original data, specifically in the lexeme_string, as tags divided by angle brackets.

(10)

How-ever, to be able to use them they had to be separated, which was achieved by using a regular expression. For example, the morphosyntactic features for the data point in Table 2 would be "det", "ind", "f" and "sg", meaning that this word is a determiner, indefinite, female and singular. All the tags and their meanings can be found in Appendix A. Among these newly extracted features are two special cases: missing features and expressions.

Missing morphosyntactic features occur when, for example, the number or gender of a word is not clear. This is indicated in the tag by an asterisk, which makes these missing features easy to detect. Because these tags do not present any information, they were not added as features to the model. Expressions are indicated in the tags by a colon, showing that the lemma is part of an expression. Examples of this are "a/avoir" from "il y a" and "you" in "thank you". It was decided to leave the data points that are expressions out of the data for several reasons. Firstly, the tags belonging with an expression spell out the entire expression, meaning that the tag is no longer generic. Secondly, individual words in expressions often do not have a translation, which would be a problem for the cognate feature. Finally, these expressions make up only a relatively small part of the data and thus will not have a major influence on the results.

The algorithm for finding the morphosyntactic features and adding them to the list of features for a data point is shown in Algorithm 2. The data points with expression are already removed prior to this algorithm. All added features received a feature-value of 1.

Algorithm 2 Adding morphosyntactic features to the model for datapoint in dataset do

taglist = all < ([<]∗) > in lexeme_string of datapoint

for tag in taglist do

if not * in tag then . remove tags with missing features

add tag to features of datapoint

end if end for end for

5.2 Cognate status

Three types of cognates can be distinguished: true cognates, false cognates and semi-cognates (Xu et al., 2015). True cognates are words that have a common etymological origin (Montalvo et al., 2012) and, because of their common origin, often have a common spelling and the same meaning. False cognates and semi-cognates both have a similar spelling, but false cognates do not share a meaning and semi-cognates only in some circumstances (Xu et al., 2015). These two types of cognates are also known as false friends

(11)

(Montalvo et al., 2012). Although false friends could lead to users confusing translations, many researchers agree that cognates usually help second lan-guage learners (Xu et al., 2015; Montalvo et al., 2012; Hauer & Kondrak, 2011; De Groot & Keijzer, 2000).

There are multiple possible ways to detect cognates (for example, Montalvo et al. (2012) use a fuzzy logic system that combines multiple similarity mea-sures), but, to prevent the program from becoming too complicated, the normalised Levenshtein distance was used for this project. Levenshtein dis-tance counts the minimal number of deletions, insertions and substitutions necessary to convert one string into the other. The distance was normalised with 1 −_{M axlength(word1,word2)}Levenshteindistance , meaning that two words that are exactly the same receive a score of 1. In addition to its relative simplicity, normalised edit distance also produces high accuracy scores (Inkpen, Frunza, & Kon-drak, 2005; Montalvo et al., 2012).

To determine the cognates, translations of the lemmas were required. The translating occurred in three steps: creating a list of all unique lemmas per language, translating this list with Open Multilingual Wordnet3 and translating the remaining lemmas with Google Translate.

For the first step, the lemmas were extracted from the lexeme_string of each word. Most of the time, this could be done by splitting the lex-eme_string at the backslash and at the leftmost angle bracket and taking the middle string of the resulting list. However, for the data points with a Spanish or English learning_language, the place of the lemma of personal pronouns is replaced with "prpers". When this happened, the lemma was derived from the word based on observations of the lemmas of personal pronouns in the other languages. These observations have shown that all pronouns that have multiple genders are reduced to the male form and that plural third-person pronouns are reduced to the singular male form. For the other pronouns, the word is taken as the lemma. These lemmas are used across all features. In addition to the lemma, the part-of-speech (POS) tag was also extracted from the lexeme_string to obtain more accurate trans-lations. A list was made of all the unique lemma-POS pairs, which could then be passed on to the Open Multilingual Wordnet.

The Open Multilingual Wordnet is a combination of 34 individual single-language wordnets created by Bond and Paik (2012). As a result of this combination, the Wordnet not only returns senses of words but also mul-tiple translations of each sense of a word. The part-of-speech tags used by the Open Multilingual Wordnet differ from the tags used in the data. More specifically, Wordnet recognises only four part-of-speech categories, while the data includes 17 different part-of-speech categories. Where the conversion between the two was clear, the part-of-speech tags were changed into those usable by the Open Multilingual Wordnet. The lemmas

(12)

out a clear part-of-speech tag were left without one and the translations for these lemmas are chosen regardless of part-of-speech. For each lemma-POS pair a list of translations was collected, only using only the translations for that specific part-of-speech were possible and using all translations when no part-of-speech was present. Out of these translations, the one with the best normalised Levenshtein distance score was picked. Orthographic heuristic rules were used similarly to the ones used by (Xu et al., 2015); distance is calculated without accents, meaning that ’é’ and ’e’ have a distance of zero between each other. Unfortunately, the Open Multilingual Wordnet does not contain every lemma in the data. For these missing lemmas, another translation method was needed in the form of Google Translate.

The translation with Google Translate was automated by using the Google Cloud Translate API 4. This API only gives one translation per lemma and does not take part-of-speech into account. These Google-translations are combined with the Wordnet-translations to create a dictionary that re-ceives a lemma-POS pair and returns a translation.

This dictionary is used by the model to determine the cognate status of each data point. Lemmas and part-of-speech tags are again extracted from the lexeme_strings and used to retrieve a translation from the dictionary. Next, all accents are removed from both the lemma and the translation and the normalised Levenshtein distance between the two is calculated. To establish if this distance indicates a cognate, the threshold determined by Inkpen et al. (2005) was utilised, which is 0.34845. However, Inkpen et al. used a normalisation measure without subtracting from 1, so the threshold was put on 1 − 0.34845 = 0.65155. All lemmas with a distance lower than this number were labelled as cognates with a feature-value of 1, while all lemmas with a higher distance received a cognate feature-value of 0. Table 4 shows examples of the cognate status of six lemmas.

Lemma Translation Cognate Status

diferencia difference cognate american americano cognate

career carrera cognate

fish pisces non-cognate

apple macieira non-cognate

aimer like non-cognate

Table 4: Examples of lemma-translation pairs and their cognate status

4

(13)

5.3 Number of senses

The number of senses of a word can, according to Laufer (1990), lead to problems when there are differences between the first and second language of a student. One word with many meanings could have multiple transla-tions, depending on the meaning, which makes it difficult for a student to pick the correct meaning and translation. Laufer mentions that some stu-dents will refuse to use a word in a certain sense when this sense is very different from the other meanings of the word. A word with multiple mean-ings can be a polyseme, when its meanmean-ings are related, or a homonym, when its meanings are unrelated (Laufer, 1990). Reynolds et al. (2015) mention that the meanings of polysemes are generally easier to learn than those of homonyms because of the relationship between the different meanings. Laufer states that it is often difficult to decide whether a word is a polyseme or a homonym and suggests to treat them as one problem; the same will be done in this research.

For the number of senses feature, the Open Multilingual Wordnet was used again. For this feature, the number of senses in the Wordnet of each unique lemma were counted. For the lemmas that are not present in the Open Multilingual Wordnet, the number of senses was set to 1. This time, part-of-speech was not taking into account because users might also be con-fused by different meanings across several types of part-of-speech. A dictio-nary was created, with the lemmas serving as keys and the number of senses serving as values. In line with the research done by Dela Rosa and Eskenazi (2011), the lemmas were divided into three groups. The first group consists of the lemmas with 1 word sense, the second group contains the lemmas with 2 to 4 word senses and the third group consists of the lemmas with 5 or more word senses.

Of these groups, the first received a feature value of 1, the second received a feature value of 0.5 and the third a feature value of 0.

5.4 Pronunciation & Word Length

According to Chen and Chung (2008), over 82 per cent of Taiwanese stu-dents learning English thought that the pronunciation and length of a word influence its difficulty. The idea that shorter words are usually easier than longer words can also be observed in books for young children, which mostly contain shorter words. Baddeley et al. (1975) confirm that memory span is inversely related to the length of words. Difficult pronunciation can lead to misspellings, especially in English were the relationship between a grapheme and its phoneme are often not clear (Beinborn et al., 2016). Dela Rosa and Eskenazi (2011) show that words with a grapheme-to-phoneme ratio of 1 are easier than words with a ratio lower or higher than 1.

(14)

for-mula that has been introduced by Chen and Chung (2008). This forfor-mula is as follows:

0.7 × LengthP arameter + 0.3 × P honeticP arameter. Chen and Chung fur-ther add the Taiwanese General English Proficiency Test grading level to this formula as a weight; which was not included here because the data cov-ers more languages. The outcome of this formula serves as the feature value for the combined pronunciation-wordlength feature.

For the measurement of the length of a lemma, accented characters are regarded as one character. The length parameter is calculated according to Algorithm 3, which puts the lengths on a 0-1 scale.

Algorithm 3 Computing the length parameter 1: WordLength = length(word)

2: MaxLength = 19

3: MinLength = 1

4: MiddleLength = 6

5: if WordLength > MaxLength then

6: LengthPar=.5 + (W ordLength − M iddleLength) ×

1−.5

M axLength−M inLength 7: else

8: LengthPar=0 + (W ordLength − M inLength) ×

.5−0

M iddleLength−M inLength 9: end if

The MaxLength is the length of the longest words in the entire dataset, which is 19. Similary, MinLength is the length of the shortest word in the dataset. For the MiddleLength, a sorted list was created of all word lengths. The length in the middle of this list is taken as the MiddleLength

The phonetic parameter is calculated with the ratio between the length of the grapheme of a word and the length of its phoneme. The phonemes are denoted with the International Phonetic Alphabet (IPA), using the Epitran tool5 _{for Python. After they are established, the length of these phonemes}

is determined; considering special characters in the International Phonetic Alphabet as a single character. With the length of both the phoneme and the grapheme, the phonetic parameter is calculated using Algorithm 4. The MaxRatio in this algorithm is the maximum ratio of a grapheme-phoneme pair in the whole data; MinRatio is the smallest ratio of a grapheme-phoneme pair.

5

(15)

Algorithm 4 Computing the phonetic parameter 1: MaxRatio = 4

2: MinRatio = 0.6

3: Ratio = _{length(phoneme}length(word)

4: PhonemePar=_{M axRatio−M inRatio}1−0 ∗ (Ratio − M inRatio)

5.5 Word Frequency

Word frequency represents how often a student might come across a vocab-ulary word. The effect of word frequency is not unanimous in the literature. Koirala (2015) saw that difficulty decreased when frequency increased and (Chen & Chung, 2008) report that 98% of learners expect that frequency impacts the learning outcome. Reynolds et al. (2015) explain that this im-pact is intuitive because seeing a word more often makes it more likely that the student will learn it more often as well. On the other hand, (De Groot & Keijzer, 2000) see no significant effect of a word frequency feature.

The frequencies were determined by the number of occurrences of each word in the OpenSubtitles2016 corpus (Lison & Tiedemann, 2016). This corpus was used because it offers a wide range of languages, including all languages used in the data. In addition, Brysbaert et al. (2011) show that frequencies based on subtitles outperform frequencies based on written texts. For each learning_language, the unique lemmas were sorted from least com-mon to most comcom-mon based on these occurrences. This sorted list is made into a dictionary, with the lemmas serving as keys and the index of each lemma in the sorted list as the values. When multiple lemmas have the same number of occurrences they receive the same index in the dictionary and when lemmas that are not in the OpenSubtitles corpus they all get a frequency of 0 and thus an index of 0. The indices are normalised to a 0 to 1 range with the following formula:

N ormalisedF req = _{M axF requency−M inF requency}F requency−M inF requency . The outcome of this for-mula is the feature value for the word frequency feature. MaxFrequency and MinFrequency are determined per language because the number of oc-currences is not directly comparable between languages; they are the highest and lowest value in the dictionary.

(16)

6 Experiment & Evaluation

Experiments were conducted with the logistic regression model on the pro-gram without new features added, also known as the baseline, and on the program with each of the new features, one feature at the time. All preex-isting features were kept, with the exception of the time feature. For these experiments, the full data set was used, with the first 90 per cent acting as training data and the final 10 per cent as the testing data. The results of these experiments are the predicted recall scores for each data point, which were compared with the actual p_recall with two evaluation measures.

6.1 Results

The mean absolute error (MAE) measures the resemblance between the real and predicted p_recall; a lower MAE-score indicates a better prediction. The area under the ROC curve (AUC) measures the ranking quality; a higher AUC-score means a better ranking order. Table 5 and Figure 1 and 2 show the average results of ten experiments per feature. All features show a significant improvement of the MAE-score compared to the baseline and only the cognates-feature gives a worse AUC-score. When all features are used together, the AUC-score worsens compared to the frequency feature. Settles and Meeder (2016) also ran a test with a constant predicted p_recall, namely the average recall rate of 0.859. Two experiments with a constant predicted recall rate were executed, one with the recall rate from Settles and Meeder and one with the most common recall rate: 0.9999 (or 1 before the bound). Both of these outperform all features on the MAE-score.

Feature MAE↓ AUC↑

Baseline 0.2638 0.5030

Morphosyntactic 0.2546* 0.5054*

Cognates 0.2630* 0.5016

Number of senses 0.2625* 0.5036* Phonetics and wordlength 0.2631* 0.5031*

Frequency 0.2613* 0.5066*

All features 0.2505* 0.5064*

Constant p_recall = 0.9999 0.1045 0.5 Constant p_recall = 0.859 0.1991 0.5 Table 5: MAE and AUC score per feature. Significant improvements (p<0.001) are marked with *.

(17)

Baseline 0.24 0.25 0.26 0.27 0.28 MAE Feature all morphosyntactic cognate status number of senses phonetics & wordlength frequency

(18)

Baseline 0.490 0.495 0.500 0.505 0.510 A UC Feature all morphosyntactic cognate status number of senses phonetics & wordlength frequency

Figure 2: Visualisation of the area under ROC curve per feature Additionally, the program produced the following feature weights • Cognate status: 0.1098

• Number of senses: 0.11

• Phonetics & wordlength: 0.0649 • Frequency: 0.1275

The feature weights for the individual morphosyntactic features are avail-able in Appendix B.

6.2 Results per language pair

To examine the difference in results per language pair, the predictions for one experiment per feature were split. These predictions were still trained on the full data set. Table 6 and Figure 3 show the MAE-scores for each feature per language pair. Overall, the data for students learning English (es-en, it-en and pt-en) have the best MAE-scores, although the addition of the

(19)

morphosyntactic features brings the other language pairs on approximately the same level.

Feature en-es en-fr en-it en-pt es-en it-en pt-en

Baseline 0.2725 0.2646 0.2681 0.2684 0.2567 0.2574 0.2589 Morphosyntactic 0.2549 0.2540 0.2524 0.2530 0.2541 0.2548 0.2553 Cognates 0.2719 0.2627 0.2677 0.2672 0.2562 0.2571 0.2594 Number of senses 0.2687 0.2656 0.2631 0.2656 0.2566 0.2573 0.2589 Phonetics & wordlength 0.2716 0.2640 0.2669 0.2675 0.2563 0.2568 0.2584 Frequency 0.2692 0.2611 0.2658 0.2654 0.2550 0.2554 0.2571 Table 6: Table of MAE↓ scores per feature per language pair (ui-learning)

0.24 0.25 0.26 0.27 0.28

en−es en−fr en−it en−pt es−en it−en pt−en

Language Pair (ui−learning)

MAE Feature baseline morphosyntactic cognate status number of senses phonetics & wordlength frequency

Figure 3: Visualisation of the MAE scores per language pair

Table 7 and Figure 4 show the AUC-scores for each feature per language pair. Here, the best results are obtained on the English-Italian and the English-Portuguese language pairs. The AUC-scores show more pronounced differences in the effect of different features. The morphosyntactic feature, for instance, produces a more accurate ranking for Spanish,

(20)

English-French and English-Portuguese, but a worse ranking for the other language pairs.

Feature en-es en-fr en-it en-pt es-en it-en pt-en

Baseline 0.5047 0.5110 0.5410 0.5384 0.5200 0.5203 0.5039 Morphosyntactic 0.5149 0.5179 0.5382 0.5517 0.5108 0.5027 0.5069 Cognates 0.5042 0.5132 0.5428 0.5404 0.5172 0.5181 0.5018 Number of senses 0.5038 0.5086 0.5343 0.5331 0.5205 0.5190 0.5051 Number of senses 0.5038 0.5086 0.5343 0.5331 0.5205 0.5190 0.5051 Table 7: Table of AUC↑ scores per feature per language pair (ui-learning)

0.50 0.52 0.54 0.56

en−es en−fr en−it en−pt es−en it−en pt−en

Language Pair (ui−learning)

A UC Feature baseline morphosyntactic cognate status number of senses phonetics & wordlength frequency

(21)

7 Discussion

All the newly added features have some effect on the quality of the predicted recall rate, with all features improving the MAE-scores and only the cognate status not improving the AUC-score. This supports the hypothesis that these features have some influence on the difficulty of foreign vocabulary. However, the results vary greatly between language pairs, especially for the ranking quality.

Differences between the outcomes of the MAE and the AUC are caused by the preference that the mean absolute error has for higher values, as shown by the two static p_recall predictions in Table 5. This preference is a result of an unbalanced dataset, with a perfect recall rate for over 83 % of the data points. The weak performance of the cognate features in regards to the ranking quality could be a consequence of differences between the estimated and real translations. Words could have been identified as cognates even when the users did not recognize them as such.

Concerning the MAE-score, there are two features that do not improve the prediction quality for all languages. The number of senses makes pre-dictions worse for the English-French language pair. A possible reason for this is the high maximum number of senses value in French, which is 103 senses of the word "donner". To compare, the maximum values of Spanish is 34, that of Portuguese is 32, that of Italian is 38 and the maximum number of senses in the English data is 75. This causes the distribution of the num-ber of senses in French to be spread out more than in the other languages, which means a smaller percentage of French words will fall in the range of 1-4 number of senses. The cognates status worsens the MAE-score for Por-tuguese students learning English, but not the other way around; perhaps the English to Portuguese translations have more inaccuracies.

Although the direction of the effects was not the focus of this research, it is possible to look at the feature weights to discover if the features increase or decrease the difficulty. Considering the differences in implementation between features, it is not possible to compare them directly. Aside from some of the morphosyntactic features, all features have gotten a positive weight. This indicates that cognates, words with a high number of senses and words with a high frequency are easier to learn, which matches the effect given by the literature. The positive weight for the phonetics and wordlength feature indicates that words with a higher word-phoneme ratio or a longer length would be easier, which does not match the literature. The weights for the morphosyntactic features mark nouns, singular words and male words as the three easiest categories and words in future tense, auxiliary verbs (helper verbs) and translations of "to have" are the three most difficult categories. The feature weights will also have been influenced by the high number of perfect recall rates, so these values should not be taken as fact.

(22)

8 Conclusion

This research has aimed to answer the following question: What is the effect of different lexical features on the difficulty of foreign vocabulary words?. Six types of features were studied: morphosyntactic features, cognates, number of senses, phonetics, word length and frequency. These features were imple-mented into a logistic regression model, which uses data from Duolingo to predict the recall rate of a word by a specific student. The results show that all features positively affect the prediction quality of the recall rate and that morphosyntactic features, number of senses, phonetics, word length and fre-quency improve the ranking quality of the predictions. This indicates that these features could be used to improve the presentation order of vocabulary words and to find the optimal time between practises for language learning programs. There are some major differences in the effect of different fea-tures on different language pairs, especially regarding the ranking quality. These differences and their causes could be studied further in the future. Furthermore, future research could focus on additional features, such as the concreteness of words.

(23)

References

Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and the structure of short-term memory. Journal of verbal learning and verbal behavior , 14 (6), 575–589.

Beinborn, L., Zesch, T., & Gurevych, I. (2016). Predicting the spelling difficulty of words for language learners. In Proceedings of the 11th workshop on innovative use of nlp for building educational applications (pp. 73–83).

Bond, F., & Paik, K. (2012). A survey of wordnets and their licenses. In Proceedings of the 6th global wordnet conference (Vol. 8, pp. 64–71). Matsue.

Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent de-velopments and implications for the choice of frequency estimates in german. Experimental psychology, 58 , 412–424.

Chen, C.-M., & Chung, C.-J. (2008). Personalized mobile english vocabulary learning system based on item response theory and learning memory cycle. Computers & Education, 51 (2), 624–645.

De Groot, A., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of word concreteness, cognate status, and word frequency in foreign-language vocabulary learning and forgetting. Language Learn-ing, 50 (1), 1–56.

DeKeyser, R. M. (2005). What makes learning second-language grammar difficult? a review of issues. Language learning, 55 (S1), 1–25.

Dela Rosa, K., & Eskenazi, M. (2011). Effect of word complexity on l2 vo-cabulary learning. In Proceedings of the sixth workshop on innovative use of nlp for building educational applications (pp. 76–80).

Hauer, B., & Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In Proceedings of the 5th interna-tional joint conference on natural language processing (pp. 865–873). Huynh, D., Zuo, L., & Iida, H. (2016). Analyzing gamification of "duolingo"

with focus on its course structure. In R. Bottino, J. Jeuring, & R. Veltkamp (Eds.), Games and learning alliance. gala 2016. lecture notes in computer science (Vol. 10056, pp. 268–277). Springer. Inkpen, D., Frunza, O., & Kondrak, G. (2005). Automatic identification of

cognates and false friends in french and english. In Proceedings of the international conference recent advances in natural language processing (pp. 251–257).

Koirala, C. (2015). The word frequency effect on second language vocabulary learning. In F. Helm, L. Bradley, M. Guarda, & S. Thouësny (Eds.), Critical call–proceedings of the 2015 eurocall conference, padova, italy (pp. 318–323).

(24)

Laufer, B. (1990). Ease and difficulty in vocabulary learning: Some teaching implications. Foreign Language Annals, 23 (2), 147–155.

Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the 10th international conference on language resources and evaluation. Montalvo, S., Pardo, E. G., Martinez, R., & Fresno, V. (2012). Automatic

cognate identification based on a fuzzy combination of string similarity measures. In International conference on fuzzy systems (pp. 1–8). Mubarak, R., & Smith, D. C. (2008). Spacing effect and mnemonic

strate-gies: A theory-based approach to e-learning. In e-learning (pp. 269– 272).

Pimsleur, P. (1967). A memory schedule. The Modern Language Journal, 51 (2), 73–75.

Reynolds, B. L., Wu, W.-H., Liu, H.-W., Kuo, S.-Y., & Yeh, C.-H. (2015). Towards a model of advanced learnersâĂŹ vocabulary acquisition: An investigation of l2 vocabulary acquisition and retention by taiwanese english majors. Applied Linguistics Review, 6 (1), 121–144.

Settles, B., & Meeder, B. (2016). A trainable spaced repetition model for language learning. In Proceedings of the 54th annual meeting of the association for computational linguistics (pp. 1848–1858).

Xu, Q., Chen, A., & Li, C. (2015). Detecting english-french cognates using orthographic edit distance. In Proceedings of australasian language technology association workshop (pp. 145–149).

(25)

A

Morphosyntactic features lexeme tags

Lexeme Tag Meaning

@compound_past compound past

@cond_perfect conditional perfect

@formal formal

@future future

@future_perfect perfect future @future_phrasal phrasal future

@ger_past past gerund

@obj object

@passive passive

@past past

@past_cond past conditional

@past_inf past infinitive

@past_perfect past perfect

@past_subjunctive past subjunctive

@pluperfect pluperfect

@pos possessive

@present_perfect present perfect

@ref reflective

@subjunctive_perfect perfect subjunctive @subjunctive_pluperfect subjunctive pluperfect

aa animate acr acronym adj adjective adv adverb an animate or inanimate ant antroponym apos apostrophe cni conditional

cnjadv conjunctive adverb

cnjcoo co-ordinating conjunction cnjsub sub-ordinating conjunction

comp comparative def definitive dem demonstrative det determiner dim dimension enc enclitic f feminine

fti future indicative

gen genitive

ger gerund

ifi past definite

ij interjection

imp imperative

Lexeme Tag Meaning

itg interrogative loc locative m masculine mf masculine or feminine n noun nn inanimate np proper noun nt neuter num numeral obj object ord ordinal p1 first person p2 second person p3 third person past past pii imperfect

pis imperfect subjunctive

pl plural

pos possessive

pp past participle

ppers present participle

pr preposition

preadv pre-adverb

predet pre-determiner

pres present

pri present indicative

prn pronoun pro proclitic pron pronominal prs present subjunctive qnt quantifier ref reflexive rel relative sg singular sint synthetic sp singular or plural subj subject sup superlative tn tónico

vaux auxiliary verb

vbdo to do’ verb

vbhaver ’to have’ verb

vblex standard verb

(26)

B

Morphosyntactic features weights

Feature Weight n 0.1791 sg 0.1559 m 0.1048 pl 0.1003 vblex 0.0988 f 0.0889 adj 0.088 pri 0.0845 ij 0.0691 p1 0.0689 pres 0.0524 adv 0.0516 p3 0.0512 prn 0.0422 mf 0.0413 p2 0.0355 tn 0.0332 sp 0.0314 itg 0.0313 sint 0.0313 subj 0.0291 det 0.024 vbser 0.0224 pos 0.0223 def 0.0217 cnjadv 0.0175 pron 0.0157 prs 0.0155 pr 0.0151 gen 0.0149 vbmod 0.0134 num 0.011 cnjcoo 0.0108 nt 0.0107 apos 0.0096 np 0.0078 loc 0.0077 inf 0.006 ger 0.0054 pp 0.0051 ord 0.005 cnjsub 0.0047 preadv 0.0047 obj 0.0042 Feature Weight qnt 0.0027 sup 0.0026 cni 0.0023 acr 0.0022 ifi 0.0019 @future_perfect 0.0016 enc 0.0015 fti 0.0014 comp 0.0013 aa 0.0013 @subjunctive_pluperfect 0.0012 pis 0.0009 ant 0.0008 dim 0.0006 predet 0.0005 @formal 0.0005 @ref 0.0005 @past_inf 0.0003 rel 0.0002 imp 0.0001 @obj 0.0001 @passive -0.0 @past_subjunctive -0.0002 @ger_past -0.0002 @pos -0.0002 @subjunctive_perfect -0.0007 @pluperfect -0.0013 @past_cond -0.0013 pii -0.0022 @past -0.0026 @past_perfect -0.0029 vbdo -0.0036 @cond_perfect -0.004 nn -0.0047 pprs -0.0052 past -0.0066 @compound_past -0.0069 @present_perfect -0.0078 an -0.0083 ref -0.0084 pro -0.0117 dem -0.0122 @future -0.0142 vaux -0.0161

The influence of different lexical features on vocabulary learning