Improving automatic correction of article errors

(1)

Improving automatic correction

of article errors

Rutger Kraaijer 10382259 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. T. Deoskar

Institute for Language, Logic and Computation Faculty of Science

University of Amsterdam Science Park 107 1098 XG Amsterdam

(2)

Abstract

In this thesis possible improvements are evaluated which try to im-prove the automatic correction of article errors (the, a, an) by using a classification algorithm. Two distinct problems are examined: represent-ing the head noun as a vector, and quantifyrepresent-ing information around a noun phrase. This is achieved by using a clustering algorithm on word embed-ding vectors to group similar words to reduce the feature dimensionality of a head noun representation, and considering the frequency of the head noun in previous sentences. The clustering method shows a small positive influence on the classifier. The context feature did not show a significant impact.

(3)

1 Introduction

Over a billion people speak English as their second language worldwide, and this number is constantly growing. Native speakers of many languages already have become used to automatic feedback on their writing regarding typing mistakes or grammatical errors like subject-verb agreement, but this has not yet been the case for those learning a language.

Mistakes that native speakers most commonly make are syntax related. Since the underlying structure of syntax is driven by concrete rules, correcting these automatically is done more easily. There exist many more types of errors, however. Results from an evaluation on the Cambridge Learners Corpus by Tetreault, Leacock, and CTB [14] are partially displayed below. These show the three types of mistakes language learners most often make. The automatic correction of any of these can be used to improve automatic feedback for learners, but may for example also be useful for improving output of Machine Translation applications to improve translation quality.

1. Word choice 20% 2. Preposition 13%

3. Article 12%

To correct a mistake, a system needs to be able to ‘know’ the correct answer, similar to how a person does. The process of finding the right answers become more difficult in the cases listed, however. Even though an incorrect word may not lead to a grammatically incorrect sentence, the contents of a message may be completely different from what was actually meant. Consider a native Spanish speaker wanting to say that he likes your clothes. Instead, he says “I like your ropes”. The Spanish word for clothing is ropa, which coincidentally is alike to the English word rope. This error is more complex to detect than syntax erros, due to the sentence itself being completely valid.

This thesis will focus on the third most occurring error, which regards errors made in article choice. The English articles are the, a and an, but can be split in two types: the definite the and the indefinite a and an. For example, a person may have knowledge, but more specifically, have a knowledge of English. Most people would say that they enjoy the bright sun, except for the rare occasions someone is observing one in a different solar system. Learners have to determine what is the best option every time they will write a noun, which is especially difficult if they are not familiar with this construct in their native language. Some of these languages are Japanese, Chinese, Korean, and Russian.

Machine learning algorithms are a common solution to automatically detect-ing article errors. The development roughly revolves around finddetect-ing a suitable algorithm and selecting appropriate features, which mainly provide semantic information about the words in the relevant context of the problem. In this

(5)

situation, the algorithm should be a classifier, because it is constantly choosing between a selection of options. Since a and an are syntactically the same word and the occurence of no article is also a valid option, the options for a classifier are the, a or an and null, with the latter to describe the omission of an article. The task of determining the best option between a and an can be fulfilled in a post-processing step and is not an issue for the task at hand.

1.1 Words as a feature

Systems that try to predict the correct article require semantic information about the words relevant for determining which article fits best in a given in-stance. The most informative feature has been shown to be the noun itself. However, representing a noun (or more generally, words) as a feature can only be done by using a one-hot vector. This is an N-dimensional vector which is able to represent N words by using a one-to-one mapping of a word to one of the cells in the vector. All cells in this vector are 0, except for the one representing the word, which has the value 1.

Word occurrences are distributed in such a way that a relatively small amount of words cover a large amount of spoken vocabulary and a large amount of words cover the infamous ‘long tail’. Determining what value to use for the size of the one-hot vector inevitably results in a compromise. When speed of an algorithm is concerned, the size should be kept low, but this will result in a large portion of words to be left out. Trying to represent as many words as possible quickly leads to a vector size in the millions, harming both training and testing speed.

1.2 Word categories

In this thesis an approach is proposed that aims to reduce the negative conse-quences of limiting the size of one-hot vectors for words. The concept is based on the assumption that many low-frequency words are semantically similar to more frequent words and may be used in the same manner as well. If it is possible to collect these relations, a feature can be constructed that allows the more frequent words to pose as an ‘example’ for the semantically similar, but low-frequent ones.

To formalise this, the following question is posed:

How can word embedding vectors be used to categorise semantically similar words so low-frequency words can be represented without ex-plicit encoding?

As stated, a system that can learn across word categories may perform better in determining the correct article, since it has gained more information thanks to the similar words in the same category.

(6)

1.3 Context

Other research has carefully shown that context, which is information outside the sentence in which a noun appears, is a possible feature that may prove useful. Only evaluating the sentence in which te noun lies can be viewed as joining a conversation at a bar that already started a while ago. Without knowledge of what has already been discussed and by hearing “... dog is barking every morning”, there is no way of knowing if they are annoyed by a dog, because they do not know who the owner is, or the dog of the neighbours.

No implementation of context has yet been explored. A possible way of repre-senting context as a feature will also be evaluated in this thesis. The question corresponding to this is:

How can context be incorporated for improving correction of article errors?

It is expected that there are functioning methods to be found for interpreting context as features for a classifier, as other research also expects. Finding the most optimal way of doing so requires exploring the concept and evaluating different solutions.

1.4 Thesis overview

This thesis will explore a possible implementation of expressing word categories and a method for representing context, for being used as a feature for a classifier. To represent the word categories, mappings as result from a word embedding algorithm are used. The technique of word embedding results in a mapping from words to vectors in a low-dimensional space. These are obtained by a neural network-based solution and is requires a large (unannotated) corpus. The vectors which represent words are then clustered, which results in groups (clusters) of words that share meaning. To represent context, the occurence of the head nouns in previous parts of the text is taken into account.

To evaluate the utility of the new features, a reference model is developed. This model aims to represent current research and to provide a more realistic view of the possible increase that the new features may bring.

The process is more elaborated on in the following sections. Previous research is discussed in the Literature review to show what is already achieved in the field of article error correction. The Method section provides an in depth overview of the steps taken to produce the features that are used to construct the refer-ence model and the features that will represent the proposed ideas. Finally, in Results, different models trained on combinations of feature sets are evaluated alongside each other to review the possible impact of the newly proposed ideas.

(7)

2 Literature review

There is no definite method which performs better than others when faced with the task of generating the most fitting articles. More often machine learning techniques are used, however other approaches have also shown to produce com-parable results.

For example, research by Turner and Charniak [16] used a language model (developed earlier by Charniak [2]) that was based on a parser. This model was trained on the Penn Treebank and on the North American News Text Corpus. Their approach resulted in a significant improvement over previous research: 86.6% accuracy over 83.6%. However, it is not clear how the classes (null, the, a/an) were distributed in their data. Without this information, it is difficult to compare the improvement to the approach to others’.

A study that used machine learning for article selection is by Han, Chodorow, and Leacock [6]. They used a maximum entropy classifier, which was was trained on features extracted from noun phrases. The noun phrases (NPs) were gathered using a maximum entropy noun phrase chunker1 _{which in turn was performed}

on part of speech (POS) tagged sentences from the MetaMetrics corpus. The features that were constructed focussed on the words inside and partly around the NPs and represented lexical and syntactic information. Examples of this are the POS tag of all the words in the NP and the POS tag of the word following the NP.

Another feature that was used by Han et al., was the countability of the head noun. This feature is based on the fact that uncountable (mass) nouns2 _rarely

require an article, thus the null article is very likely to be the correct option. This feature is implemented for the reference model. Details on the implemen-tation can be found in Section 3.1.4.

The classifier scored an accuracy of 83%. In the data used, the null article occurred 71.8%. This measure was initially used to compare the benefit gained by individual features, but provides a helpful tool for comparing results from different approaches.

Another study that applied machine learning for both the article and the prepo-sition problem, which is another frequently made mistake, is by De Felice and Pulman [4]. In a preceding paper [5], the pre-processing steps are discussed: Data from the British National Corpus is processed by a pipeline of scripts that produces “stemmed words, POS tags, named entities, grammatical relations, and a Combinatory Categorial Grammar (CCG) derivations of each sentence.” [5]. These outputs are then used as input for another script which creates vectors for each sentence and populates it with 0s and 1s according to the value of the

1_The _noun _phrase _chunker _was _provided _by _Thomas _Morton

(http://opennlp.sourceforge.net/)

(8)

feature at each position. These vectors are afterwards combined into a format to be processed by a machine learning algorithm.

The features implemented in this study are similar in design to the previously discussed study by Han et al., but rely on more “manual labour”. They ex-press the belief that if the correct syntactic and semantic information is made available to an algorithm, the underlying structure should provide an “accept-able” accuracy. Similarly to Han et al., information from the NP is gathered, but in this case not the POS tags, but more specific semantic information. For example, the head noun is not represented by its POS tag, but by its plurality, whether it is a named enitity, and information regarding possible adjectives and prepositions in the NP.

The performance of the classifier was a significant improvement over previous work, including the two previously mentioned studies. With a baseline of 59.8%, their classifier achieved an accuracy of 92.2%. Again, notice the baseline the most occurring option is also null.

Because the study by De Felice et al. achieved the highest accuracy that was found in literature on the subject, nearly features discussed in their paper are implemented for use in the reference model, which are described in Section 3.2. As also discussed in the introduction, a goal for which these systems are gener-ally built is for use for aiding language learners. The goal for the two previous studies was therefore not only to perform well on the native language corpora on which their classifiers were trained, but also to work well on text written by language learners. However, this is not necessary to take into account for this thesis, since this is focussed on the utility of features themselves. It is assumed that improving correction on systems for native language will also positively impact the performance on learner language.

2.1 Word categories

A notable feature that was used by De Felice et al. was the category of the head noun in WordNet. This feature provides semantic information about the words in the form of showing membership of the word in one of the 26 WordNet categories. The feature itself did not provide any noticeable influence on the classifier and was even questioned of being of any significance in the thesis following up on the original article [3].

Even though the feature did not work favourably, the idea of being able to utilise a source that contains information about which words are similar has caused this thesis to explore the idea further to use this concept as a basis for constructing the aforementioned word categories.

(9)

2.2 Context

A separate study by Lee et al. regarded the human capabilities of utilising context in the task of determiner selection. They found that “although the context substantially reduces the range of acceptable answers, there are still often multiple acceptable answers given a context” [7]. Even though this result does not seem to be able to fully guarantee the correct selection of a determiner, implementing context has been proven to substantially improve the process. To illustrate this, regard an example situation provided in their paper:

Given the sentence: “He drove down to the paper’s office and presented ... lamb, killed the night before, to the editor”, the task of selecting the most fitting article is quite difficult. Without any knowledge of the context, there is no definite answer available. Even though null does not fit, both the and a are possible answers to respond with. Keeping these options in mind, regard the preceding sentence: “Three years ago John Small, a sheep farmer in the Mendip Hills, read an editorial in his local newspaper which claimed that foxes never killed lambs.”. This sentence does not specify any specific lamb and discusses the more general category lambs, which indicates that the correct answer is a rather than the. The implementation of context regarding this problem has not been researched, however the idea has also been briefly discussed in the study by De Felice and Pulman [4]:

“We plan to experiment with some simple heuristics: for example, given a sequence ‘Determiner Noun’, has the noun appeared in the preceding few sentences? If so, we might expect the to be the correct choice rather than a.”

3 Methodology

This section concerns an overview of all steps made to handle the data, the features and how these are created, and what method is used for classification.

3.1 Analysis of the data

In the following sections, details about the data used for feature creation is dis-cussed. This concerns the proportion of nouns in the used corpus, the frequency of different articles and the method of dividing the data into groups for use in the evaluation process.

(10)

Figure 1: The frequency of nouns in the British National Corpus (log scale on x and y axes)

3.1.1 Noun frequencies

Figure 1 shows a histogram of the frequency of nouns in the British National Corpus (BNC) [13]. Due to the steep curve on a linear scale, a log scale is necessary to show the information in a readable format. The BNC is used in this thesis as a resource for acquiring English sentences and contains nearly 400,000 unique nouns, with a total usage count of slightly over 25 million. As can be seen in the figure, the mass (blue surface area in the histogram) of words is heavily gravitated towards the leftmost region of the total word count. To illustrate this effect more clearly, more detailed figures can be found in Table 1.

Noun % 10 20 30 40 50 60 70 80 90 100

Mass in millions 24.0 24.5 24.7 24.9 24.9 25.0 25.0 25.1 25.1 25.1

Mass % 95.3 97.6 98.5 98.9 99.2 99.4 99.6 99.7 99.9 100

Table 1: The mass and corresponding percentage of noun usage compared to the percentage of most frequent words used

3.1.2 Baseline

In the Literature review, the baseline, e.g. percentage of the null article in the testing data, is used to make the comparison between results more easily. The same is done on the British National Corpus, of which the results can be found in Table 2. These results are based on the whole of the BNC, but due to computational (and time) limitations, the models will not be trained on all features. Because of this, the baseline may differ from the values presented

(11)

here. The percentages are only based on sentences at least containing one NP, which resulted in 4,927,667 sentences containing a total of 10,841,032 NPs. This results in an average of 2.2 NPs per sentence.

Article Occurence

Null 59.7%

The 27.2%

A/an 13.1%

Table 2: Occurrence of articles in the BNC

3.1.3 Cutoffs

Since, due to the ‘long tail’, not all nouns can be represented as a feature, a metric is used that quantifies how words are split based on their frequency in the BNC. The words are split in three ‘ordinarity groups’, marked by two cutoffs. These groups are ‘common’, ‘rare’ and ‘noise’, which are the result of the ‘rare cutoff’ and the ‘noise cutoff’. All common words will be represented by the one-hot vector and both the common and rare groups will be represented via the cluster as described in more detail in Section 3.3.5. The noise group will not be represented by either feature. Not representing these words is thought of to reduce training on words that are likely to not be of significant use and will not occur in real-world situations. A visual representation of how the data is split using these cutoff values can be found in Table 5.

Frequency Ordinarity 0 .. . noise 4 5 .. . rare 9 10 .. . common n

Table 3: An example of word category distribution with noise cutoff 5 and rare cutoff 10

(12)

3.1.4 Mass nouns

To be able to make a distinction between mass and count nouns, a method has been devised that determines which nouns are countable and which are not. This information as feature is very descriptive, since mass nouns cannot be used with all articles. For example, scissors, meat, and English do not have both a singular and a plural form and the indefinite article a/an cannot be used with any of these. The feature, although informative, is not a perfect heuristic. Some words, like water, are generally regarded as a mass noun (you cannot put two water(s) in a bucket), but are still able to be used with a (water can also represent a glass of water ).

To identify the mass nouns from the others, all nouns in the BNC are counted. To estimate which words share a common base, a stemmer is used. In this case this is the Snowball stemmer as made available by the NLTK package [1]. Using the stem, all variations of the same word can be grouped together. Per stem an evaluation is made: If a certain instance of a word occurs significantly more often than the other forms, the word itself can be regarded as (practically) a mass noun. The thresholds for this value have been empirically estimated, by ensuring some known mass nouns to be marked as such3_{. Words are saved as}

being mass if one form takes up either more than 90% of the base concept’s frequency, but only if the total frequency (the sum of the frequencies of all words with the same stem) is above 10. If this would not be done, many rare words could be incorrectly flagged as being a mass noun.

3.2 Features

3.2.1 Reference

The features used for the reference model are based on the research done by De Felice and Pulman [4] and Han, Chodorow, and Leacock [6], as previously discussed. These are:

• Head noun vector

The one-hot vector representing the head noun. To construct this vector, output as described in Section 3.3.4 is used. In this case, the list of nouns is consulted. This list is, based on the rare cutoff, set to only save the first words until the index value. Since the word list is placed in descending order of word frequency, the most occurring words are left. A vector consisting of zeros with this same size is constructed and the index at which the head noun occurs in the list, becomes a 1 in the vector. The value at which the cut is done is varied for the head noun.

• Head noun noise

A flag feature based on the frequency of the head noun. If the word falls

(13)

in the noise category, the word is flagged as such. • Head noun number

Marks if the head noun is singular or plural. This information is gathered using the POS tag of the word, which holds separate tags for these cases. The POS tags for singular nouns are NN and NNP. Plural nouns are denoted with NNP and NNPS.

• Head noun type

Gathered from the output described in Section 3.3.4, marks if the head noun is a count or a mass noun.

• Head noun named entity

Marks if the head noun is a concept, like a person or place. This infor-mation is also gathered using the POS tag of the word, namely the NNP and NNPS tags.

• Preposition modification

Marks if a preposition is used inside the noun phrase. This is done by checking if one of the words in the noun phrase is tagged as such. The corresponding POS tag is IN.

• Preposition modification vector

The one-hot representation of the found preposition. How this vector is built is explained at the Head noun vector feature. A significant dif-ference between nouns and prepositions is the needed cardinality of the vector. Since there are orders in magnitude fewer different prepositions than nouns, all prepositions used in the BNC can be represented. The car-dinality of this vector (and thus total unique occurences of prepositions in the BNC) is 1,701.

• Head noun object of prep

Marks whether the head noun is the object of the found preposition. This is the case if the first noun after the preposition is the head noun. • Adjective modification

Marks whether an adjective modifies the head noun. This is determined by checking of an adjective occurs in the noun phrase, before the head noun.

• Adjective modification vector

The one-hot encoding representation of the found adjective. How this vector is built is explained at the Head noun vector feature. Like nouns, there exist many adjectives (close to 140,000 in the BNC). Becasue of this, the cardinality of this vector is set to 4,000. This allows 90% of all word usage to be represented, but due to the ‘long tail’, only 2.9% of all possible words are covered.

• Adjective grade

(14)

com-parative (good → better) and/or a superlative (good → best) form. The grade of the adjective can be read from the POS tag, which denote the adjective, comparative and superlative as JJ, JJR, and JJS respectively. • Noun phrase relation

Marks if a noun phrase is the object or subject of a verb. This is tested by checking the siblings of the noun phrase in the tree structure. If a verb can be found in one of these, the np is marked as the object. Else, it is marked as subject.

• POS ±3

Marks the three POS tags around the article, if one exists. If the article is null, the POS tags around the head noun are given instead. There are 36 POS tags specified in the Penn Treebank POS tagset. To accomodate the possibility of there not being three words on either side, a one-hot vector of size 37 is used to represent a single tag.

3.2.2 Word categories

The word categories are represented by being part of a certain cluster. How these clusters are made can be found in Section 3.3.5. All clusters have an index, which allows for the representation of the cluster as a one-hot vector, similar to how the nouns, prepositions and adjectives are represented.

3.2.3 Context

The implementation of context is kept simple: A single value is used for this feature, which is 1 if the head noun of the noun phrase occurs in the previous X sentences. If this is not the case, its value is 0.

3.3 Pre-processing

(15)

Figure 2 provides an overview of the steps in which the pre-processing is divided. In short, the British National Corpus [13] is tagged with Penn Treebank POS tags and parsed using a CCG parser. The parsed trees, together with the result of a procedure that analyses the frequency of certain POS tags, and clusters from the Google News mappings are combined to create feature vectors. These can then be used as input for a machine learning algorithm. The following sections will discuss the implementation of these steps in more detail.

3.3.1 Extracting from corpus

The resource used for the conversion of English language to the final form of feature vectors is the British National Corpus (BNC). This corpus is heavily annotated, which is very useful for many applications. However not in this case. The words in the BNC are provided with POS tags, but are from the BNC Basic (C5) Tagset and not the Penn Treebank POS tag, which is required for the parser that will be used at a later point. Therefore, only the sentences in the BNC itself are useful and need to be extracted so the required information can be added in the subsequent steps.

The BNC consists of around 4000 files, placed in a three-level-hierarchy. The files themselves are in an XML format that with tags for all sentences and words with multiple pieces of information. An example of how a sentence is represented can be found in Figure 3. To extract the ‘raw’ sentences from this structure, the hierarchy is traversed and the files are read. The containing XML structure is read and the sentences, without any tags, are saved.

<w c5="AV0" hw="particularly" pos="ADV">particularly </w> <w c5="VVD" hw="wish" pos="VERB">wished </w>

<w c5="VVI" hw="clarify" pos="VERB">clarify </w> <w c5="AVQ" hw="why" pos="ADV">why </w>

<w c5="PNI" hw="one" pos="PRON">one </w> <w c5="VVZ" hw="study" pos="VERB">studies </w> <w c5="NN1" hw="history" pos="SUBST">history </w> <mw c5="AV0"> <w c5="PRP" hw="at" pos="PREP">at </w> <w c5="DT0" hw="all" pos="ADJ">all</w> </mw> <c c5="PUN">.</c> </s>

Figure 3: Fragment of BNC XML structure for the sentence “I particularly wished to clarify why one studies history at all.” (indented for readability)

(16)

3.3.2 Tagging

To provide the Penn Treebank POS tags to the extracted sentences, the Stanford Log-linear Part-Of-Speech Tagger [15] is used. The parser requires the tagged sentences to be in a slightly different format than what the Stanford tagger outputs. To accomodate this, another script processes the stanford output to a readable output for the next step. Sentences are now structured as can be seen in Figure 4.

Figure 4: Example of the tagged sentence “The density of Mercury indicates that its interior is substantially different from the interiors of the other terrestrial planets.”

3.3.3 Parsing

To parse the tagged sentences into tree structures, a CCG parser called Easy-CCG developed by Lewis and Steedman [8] is used. One of the reasons for this choice is because of the high speed (150 sentences per second), which is necessary to prevent pre-processing taking up too much time. How this output is formatted can be seen in Figure 5.

(<T S[dcl] 1 2> (<L NP PRP PRP It NP>) (<T S[dcl]\NP 0 2> (<T (S[dcl]\NP)/(S[pss]\NP) 0 2> (<L (S[dcl]\NP)/(S[pss]\NP) VBZ VBZ is (S[dcl]\NP)/(S[ pss]\NP)>) (<L (S\NP)\(S\NP) RB RB not (S\NP)\(S\NP)>) ) (<T S[pss]\NP 0 2> (<L S[pss]\NP VBN VBN transmitted S[pss]\NP>) (<L (S\NP)\(S\NP) IN IN from (S\NP)\(S\NP)>) ) ) )

Figure 5: Example of the CCG parsed “It is not transmitted from.” (indented for readability)

(17)

The output of the EasyCCG parser produces a tree structure that is not directly readable into a data structure. Additionally, many POS tags created by the CCG parser are not needed for further usage, since many tags hold information about the sentence structures of their children nodes. This information is not necessary for further processing, and to both reduce the size of the final file and make it more human-readable for intermediate evaluation, it is removed in this step. An example of the final format for the tree structures can be found in Figure 6. [’S’, [’S’, [’NNP’, ’Gim’], [’NP’, [’PRP’, ’me’], [’NP’, [’RP’, ’back’] ] ] ], [’NP’, [’NP’, [’DT’, ’that’], [’NN’, ’bottle’] ], [’NP’, [’IN’, ’of’], [’NP’, [’NP’, [’NP’, [’NNP’, ’Sainsbury’] ], [’POS’, "’s"] ], [’NNP’, ’Cider’] ] ] ] ]

Figure 6: Example of the final parsed sentence “Gimme back that bottle of Sainsbury’s Cider.” (indented for readability)

(18)

3.3.4 Frequency analysis

To represent nouns, adjectives and prepositions as vectors, a script is used that loops through all trees and counts the words which are either a noun, adjective or preposition. The word lists are placed in descending order based on the gathered frequencies. Using this information, the X most occuring words in these tags can be represented as a vector in feature creation. Additionally, as described in Section 3.1.4, the mass nouns are identified and saved as well.

3.3.5 Clustering

A popular toolkit for the creation of word embedding mappings is word2vec, developed by Mikolov et al. [9]. They also provide pre-trained vectors trained on a part of the Google News dataset (∼100 billion words). The model contains 300-dimensional vectors representing 3 million words [10].

The Google News mappings are read using the gensim library [12], which allows convenient reading of the binary file in which the mappins are stored. The map-pings are then converted to another data format, so they can be fed as input to a clustering algorithm. This is done by using the scikit-learn library’s Mini-BatchKMeans algorithm [11]. This algorithm performs very similar to k-means, except for the fact that it works with smaller batches to decrease computation time.

The amount of clusters over which the word vectors are spread can be varied. There are no specific initial assumptions made on this value, but it is expected that more clusters will lead to more specific word categories, which might lead to more specific word groupings. It is expected that the specificity of the word groups may have a positive impact on the performance of the feature. Variations are tested and can be found in the Results section (Section 4).

3.3.6 Feature creation

The final step in pre-processing is to collect the data from the parsing trees, word counts, mass nouns, and clusters to construct features from these sources that can be used for machine learning.

A general idea of the construction of the features is outlined in Procedure 1. The pre-process is separated in three parts, extracting information from the noun phrase, the sentence and the context. However, not all noun phrases are useful for this process. The ones that contain determiners like ‘those’ or ‘every’ are sure to not need an article and are therefore marked invalid. These cases are not converted to feature vectors since they can be recognised easily by pre-processing steps.

(19)

Procedure 1 Overview of feature construction for all sentenceT ree ← f ile do

for all nounP hraseT ree ← sentenceT ree do validF eature := checkV alid(nounP hraseT ree) if validF eature then

getN ounP hraseF eatures() getSentenceF eatures() getContextF eatures() end if end for end for

3.4 Machine learning

Due to the large amount of feature vectors and these vectors having a high cardinality, not all data be trained at once since this does not fit in memory. Scikit-learn [11] offers several machine learning algorithms that can be used out-of-core, meaning it does not require all data to be present in memory at once. Instead, it is able to train multiple ‘batches’ after each other. The algorithms available with this function are:

• Naive Bayes for multinomial models • Naive Bayes for Bernoulli models • Perceptron

• Stochastic Gradient Descent • Passive Agressive Algorithms

For all results that can be found in this thesis, the machine learning algorithms are trained on 75% of the data and tested on the remaining 25%. During training, each batch is randomly split into a training and a testing set. The training set is trained upon for a set amount of iterations. Afterwards, the test set is used to evaluate the classifier at that point. After every batch, especially at the early iterations, the performance of the classifier increases rapidly. To evaluate the final version of the classifier, a separate evaluation script is used. Since every model is trained on a random sample of data, this evaluation cannot be done on the same data for every model. Therefore, all intermediate test sets are saved inbetween the training sessions to be used for this purpose. Because of the randomness of train-test splitting and the variety of text sources in the BNC there are no concerns that the model is optimised for the testing data. To determine which algorithm suits the data best, possible models need to be trained and compared. Apart from the choice of algorithm itself, some initial

(20)

values can be set to improve the performance as well. The process of finding the most suitable algorithm and its parameters is described in the following section.

4 Results

This section is split in four parts. First, a reference model is constructed. Then, the influence of different values for the rare cutoff is evaluated. Finally, using the reference model as a comparison measure, the impact of the word categories are evaluated, followed by the context feature.

4.1 Building a reference model

To start, a model needs to be made that works only on the reference features, which are discussed in Section 3.3.6. Since several machine learning algorithms are available, the first step in establishing the reference model is to determine which algorithm suits the data best. Table 4 shows a comparison of the available algorithms trained on 75,000 items and tested on 25,000.

The head noun is vectorised with a rare cutoff of 10, which results in the head noun vector feature to be of size 68,578. This value enables the feature to represent 97.5% of noun usage, but still only 18.6% of possible nouns.

Algorithm Accuracy

Baseline 55.0%

Naive Bayes multinomial 67.0%

Naive Bayes Bernoulli 65.5%

Perceptron 61.1%

Passive Agressive Algorithms 60.9% Stochastic Gradient Descent 65.8%

Table 4: Comparing the performance of different machine learning algorithms on the reference data

The Naive Bayes model for multinomial data achieves the highest accuracy compared to the other options. However, it does not achieve the scores of previously discussed research. There is no definite reason to explain this, but it is clear that the model is predicting based on the features. The baseline in the table again refers to the frequency of the most occurring class, null, and the difference between Naive Bayes and the baseline is significant. Therefore, the Naive Bayes for multinomial data trained on only the reference features will be used for all comparisons when regarding the reference model.

(21)

4.2 Cutoff influence

To investigate the impact of the head noun representation for the common cat-egory, Table 5 shows an overview of different rare cutoff values, the appropriate mass of noun usage that can be represented and the accuracy achieved. Note that no baseline is mentioned in this table, since the goal of this test is not to search for the best outcome, but rather to see what shifting the rare cutoff means for performance. The models are trained on 750,000 items and tested on 250,000.

Rare cutoff Noun mass Accuracy

10 97.5% 68.1%

50 93.9% 67.9%

1,000 73.7% 67.1%

5,000 47.3% 64.5%

10,000 31.8% 63.2%

Table 5: Accuracy for different rare cutoff values

As can be seen in the table, the higher the rare cutoff is set (and thus when the represented noun mass decreases), the lower the accuracy of the classifier becomes. A significant difference can be seen between cutoff 1,000 and 5,000, which decreased the noun mass by 26.4% and lowered the accuracy by a rel-atively large portion. The difference observed one row above, from cutoff 50 to cutoff 1,000, is relatively small compared to the other decrease. A mass de-crease of 20.2% leads to just 0.8% dede-crease in accuracy. This is to be expected, though. As observed in the analysis on the data (Section 3.1), word usage is dis-tributed on a near-logarithmic scale. Impact on accuracy is therefore expected to decrease the most when reaching noun masses closer to zero.

4.3 Word categories

To test the impact of the word clusters, the reference model is compared to a model which is trained both the reference features and the noun category feature. The comparison is made using two rare cutoff features: 10 and 1,000. Even though the lowest possible cutoff value is normally preferred, the higher cutoff is also regarded because it creates a larger portion of rare words, which allows the word category feature to be trained and tested on more instances. For both values of the rare cutoff, three models are compared: The reference model as described in previous sections, and two variations of models which have been trained on both the reference features and the word category feature. These differ in the cluster size, which determines the specificity of the categories (greater size, more specific clusters).

(22)

The performance of previous models have only been tested on all words in the testing set. However, since the word categories are expected to show most impact on the rare words, which are not represented directly by a one-hot vector, and the percentage of words in this group is relatively low by definition, the accuracy on all different word ordinarities (common, rare, noise) are evaluated. The results are displayed in Table 6. The models are all trained on 75,000 items and tested on 25,000.

Model Rare cutoff Common Rare Noise All words

Baseline 10 54.0% 71.1% 76.3% 55.3% 1,000 51.2% 62.2% 75.7% 55.2% Reference 10 66.0% 68.6% 77.5% 66.6% 1,000 65.2% 63.5% 76.6% 65.4% Clusters 1,000 10 64.4% 65.1% 76.8% 65.3% 1,000 64.4% 65.2% 75.8% 65.2% Clusters 10,000 10 65.0% 70.2% 78.4% 65.8% 1,000 64.4% 66.8% 76.1% 65.6%

Table 6: Accuracy when clusters are used, evaluated on different word ordinar-ities

From these results a few observations can be made. To start, the baseline per-centage seems to increase when words become more rare. A possible explanation for this is the fact that more rare words are more specific, which can be used in less different ways and are therefore less compatible with all articles. However, a more in depth study will be required to investigate the exact reason behind this.

It also seems that that the reference model is using its predictive power on the common words, which may indicate that the other reference features beside the one-hot vector for the head noun are not utilised very well. This would also explain the relatively low accuracy that the model achieves when compared to the models which it is based on.

The results of the cluster model is somewhat positive. The accuracy for the models when tested on all word ordinarities stay similar, but when regarding the common words only, the model using 1,000 clusters shows a significantly higher accuracy than the reference and cluster model with a higher cluster count in the same situation. This result shows promise, but is not sufficient to confirm whether the feature works well in all cases.

4.4 Context

To evaluate the context feature, the baseline and reference models are compared to two models that incorporate context. Results on this can be found in

(23)

Ta-ble 7. The number after “Context” regards the context size: The amount of sentences that are evaluated to contain the head noun. The models are trained on 750,000 items and tested on 250,000. A rare cutoff of 50 is used to speed up the evaluation process.

Model Accuracy

Baseline 55.4%

Reference 67.9%

Context size 1 68.3% Context size 5 68.4%

Table 7: Accuracy when context is introduced

The models trained on the additional context feature have achieved a slightly higher accuracy, but the significance of this is debatable. However, even though the improvement is small, if implemented in a more complex way, the feature is expected to affect the model in a positive way rather than negative.

5 Conclusion

The reference classifier performed with a lower accuracy than desired, but was useful for the evaluations nonetheless. The steep curve of word usage when compared to its occurrence was as noticeable as expected regarding the deter-mination of the size of the one-hot vector to represent a noun.

The individual evaluation on different word ordinarities suggests a possible prop-erty of article usage that was not found in previous research: Highly frequent words seem to be used with more different articles than infrequent words. The word category feature seems to provide a small positive impact on repre-senting infrequent words by using a relatively small sized vector, but the single significant value that was observed is not enough to make a final judgement on the utility of the feature. The implementation of the cluster feature in this thesis has shown no significant impact.

6 Discussion

It was decided quite early in the project to base the reference model largely on the paper by De Felice and Pulman [4], but due to their features being heavy on semantics, a large portion of the available time for the project was spent on this instead of the exploration of new features. If faced with the same situation again, I would have preferred to base it off the features used by Han, Chodorow, and Leacock [6], since most of their features were based on a more simple idea

(24)

of representing all POS tags in a given NP. Perhaps time spent on this goal instead of the conducted plan would have allowed more time to be spent on the new ideas, but at this point it is difficult to make such a prediction confidently. In the last week of the project, it was discovered that named entities (including names) have been part of the noun set. This has also been done by De Felice et al., but would not have been necessary to do, since Named Entitity Recognition can also be used to solve that problem. The Named Entity feature implemented as part of the reference model was also quite a simplified approach, since it relied on the Stanford Tagger to assign the proper noun tag to named entities.

7 Future research

From the results on word categories, it was observed that less frequent occurring words had a higher baseline percentage for the null article. A possible reason for this phenomenon is discussed, but more analytic methods could be used to map the frequencies of different articles to the word occurence. It may be that some relation can be found between these variables.

The context feature has been the least explored feature in the thesis with only a very basic implementation being evaluated. Only the literal occurence of the head noun is evaluated in this thesis, but a logical next step would be to also regard similar words that have occurred. Lemmatising or stemming could be used for this, but word embedding vectors could also be used to calculate the cosine similarity between the headnoun and previous (head) nouns.

The approach for creating word categories could be easily used for other lan-guages as well, for example on Dutch. Word embedding is language independent, which makes it possible to find categories in many natural languages.

8 Acknowledgements

I want to thank the supervisor of my thesis, Dr. Tejaswini Deoskar, for her guidance throughout the project and suggestions that pushed me in the right direction when it was needed. The experience of working on this project has been very educational, which can be accredited to her.

(25)

References

[1] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. ” O’Reilly Media, Inc.”, 2009.

[2] Eugene Charniak. “Immediate-head parsing for language models”. In: Pro-ceedings of the 39th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. 2001, pp. 124–131. [3] Rachele De Felice. “Automatic error detection in non-native English”.

PhD thesis. University of Oxford, 2008.

[4] Rachele De Felice and Stephen G Pulman. “A classifier-based approach to preposition and determiner error correction in L2 English”. In: Proceed-ings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics. 2008, pp. 169–176. [5] Rachele De Felice and Stephen G Pulman. “Automatically acquiring mod-els of preposition use”. In: Proceedings of the Fourth ACL-SIGSEM Work-shop on Prepositions. Association for Computational Linguistics. 2007, pp. 45–50.

[6] Na-Rae Han, Martin Chodorow, and Claudia Leacock. “Detecting errors in English article usage by non-native speakers”. In: (2006).

[7] John Lee, Joel Tetreault, and Martin Chodorow. “Human evaluation of article and noun number usage: Influences of context and construction variability”. In: Proceedings of the Third Linguistic Annotation Workshop. Association for Computational Linguistics. 2009, pp. 60–63.

[8] Mike Lewis and Mark Steedman. “A* CCG Parsing with a Supertag-factored Model.” In: EMNLP. 2014, pp. 990–1000.

[9] Tomas Mikolov et al. “Efficient estimation of word representations in vec-tor space”. In: arXiv preprint arXiv:1301.3781 (2013).

[10] _{Tomas Mikolov et al. Google Code Archive: word2vec. July 30, 2013. url:} https://code.google.com/archive/p/word2vec/.

[11] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Jour-nal of Machine Learning Research 12 (2011), pp. 2825–2830.

[12] Radim ˇReh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”. English. In: Proceedings of the LREC 2010 Work-shop on New Challenges for NLP Frameworks. http : / / is . muni . cz / publication/884893/en. Valletta, Malta: ELRA, May 2010, pp. 45–50. [13] _{Oxford University Computing Services. The British National Corpus. url:}

http://www.natcorp.ox.ac.uk/.

[14] Joel R Tetreault, Claudia Leacock, and McGraw-Hill Education CTB. “Automated Grammatical Error Correction for Language Learners.” In: COLING (Tutorials). 2014, pp. 8–10.

[15] Kristina Toutanova et al. “Feature-rich part-of-speech tagging with a cyclic dependency network”. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Lin-guistics on Human Language Technology-Volume 1. Association for Com-putational Linguistics. 2003, pp. 173–180.

(26)

[16] Jenine Turner and Eugene Charniak. “Language modeling for determiner selection”. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguis-tics; Companion Volume, Short Papers. Association for Computational Linguistics. 2007, pp. 177–180.

Improving automatic correction of article errors