Automatically Assessing the Need for Revision of Academic Writing using Text Classification

(1)

Automatically Assessing the Need for Revision of Academic Writing using Text Classification

submitted in partial fulfillment for the degree of master of science Michael Schniepp

12160067

master information studies data science

faculty of science university of amsterdam

2019-07-05

Industry Supervisor Academic Supervisor Title, Name Dr Georgios Tsatsaronis Dr Pengjie Ren

Affiliation Elsevier BV Universiteit van Amsterdam, ILPS Email g.tsatsaronis@elsevier.com p.ren@uva.nl

(2)

Automatically Assessing the Need for Revision of Academic

Writing using Text Classification

Michael Schniepp

University of Amsterdam 12160067

michael.schniepp@student.uva.nl

ABSTRACT

The task of automatically identifying erroneously written text can be achieved in a variety of ways particularly depending on the context and on what is sought to be identified. In this thesis we explore the feasibility of using non-specialized machine learning tooling to detect erroneous text in the context of academic writing. We use a corpus of 72,500 scientific manuscripts, each of which has the original late-stage rough draft and its professionally edited counterpart. In a task that is typically heavily reliant on statistical parser output, our proposed method effectively classifies erroneous text using widely available tools in a scalable manner. Specifically, we find acceptable performance by using an ensemble of classifiers in conjunction with raw text and part-of-speech bag-of-words rep-resentations while also exploring effectiveness of some probabilistic representations and syntactic complexity measurements.

KEYWORDS

Text classification, Automatic language assessment, Sentence clas-sification, Quality assessment

1 INTRODUCTION

Text classification methods can be applied to a variety of useful tasks such as grammaticality detection, question classification, language detection[1], sentiment analysis[2], and also quality assessment[3]. Despite the contrasts and differences in definition of quality, value can be realized by specifying the appropriate definition, such as the grading of student essays, where commercial applications such as e-rater[4] focus on assigning a grade to a document. The US Navy and other government entities have found use in standard-izing acceptable levels of readability for official documentation[5]. Detecting specific grammatical errors in text has also found wide-spread value in commercial grammar checkers to aid in writing as well as to highlight specific errors made by learners of English as a second language[6].

Elsevier B.V.1, a major academic journal publisher, possesses vast quantities of textual data ripe for the application of automatic quality assessment methods to aid authors and editors alike. Con-trastingly, this thesis eschews any attempts to define quality in favor of remaining focused on a simpler complementary task. The data they provide comes in the form of roughly-drafted academic papers accompanied by a professionally edited version, making this a unique case study in that we will train on organically created errors whereas many similar studies rely on artificially generated 1_{https://www.elsevier.com/en-gb}

errors. We then frame our assessment of the text as a binary classifi-cation task in which we construct a model to differentiate between the error-containing text and text that has been deemed acceptable. Many effective text classification applications rely on statistical parser output as features[1] but when attempting to implement such parsers at scale parsing tens of thousands of documents can prove costly in time and computing resources. To overcome this difficulty we sought a methodology that utilized a widely available scalable machine learning toolkit2. Using the available native tools we created a methodology that leveraged various bag-of-words representations in conjunction with commonly used classification models while also incorporating manually computed probabilistic representations. Finally, we also explore the potential of syntactic complexity generated by parsers albeit on a much smaller sample of our dataset. We found compelling results by ensembling individual logistic regressions each trained on their own raw text and part-of-speech n-gram bag-of-words representations. Our experiments were able to consistently discriminate between erroneous and non-erroneous text achieving 72% and 71% accuracy on sentences and paragraphs respectively.

Our primary contribution is the demonstration of the feasibility of erroneous text detection using non-specialized machine learning tooling. Value is further driven by our method’s application to a unique dataset of human-generated errors which offers practical results in a real-world use case.

2 RELATED WORK

Domains such as applied statistical language models, grammatical error correction, and automated essay scoring contributed inspira-tion to our methodology and are further described here.

2.1 Language Models

Language modeling (LM) is a foundational tool in natural language processing (NLP) and generally entails the use of developing a prob-ability distribution for the occurrence of words or strings. Specific adaptations of LMs such as discriminative language models (DLMs) can be described most generally as a linear model consisting of a weight vector associated with a feature vector representation of a sentence. These models have been used in the task of classifying erroneous text, such as the work of Okanohara et al.[7] where a DLM is constructed from Markov-like probabilistic representations of sentences. Further, they utilize artificially generated negative training examples to enable the model to learn to differentiate be-tween erroneous and non-erroneous. Cherry et al. [8] use a parser enhanced by latent SVM training along with subsequent parse tree 2_{https://spark.apache.org/}

(3)

probabilities to identify erroneous text. When attempting to classify fine and coarse-grain ungrammatically, Post et al. [1] found the best performance on these tasks using top-n parse tree probabilities as described by [9]. Significant differences between these studies and ours include our utilization of multiple sentence representations and not just probabilistic representations; moreover, they utilize pseudo-negative examples whereas we use organically produced errors which can offer valuable insight for real-world application.

2.2 Grammatical Error Correction

Writing containing grammatical errors or poor syntactical and lexi-cal choices can oftentimes signal poor writing and thus inspiration is drawn from the field of automatic grammatical error correction (GEC). Many techniques have been applied to the problem, with many systems being rule-based, machine learning based, or a hybrid of both [10].

The often-cited ALEK system (Assessing Lexical Knowledge) [11] utilizes bi-gram and tri-gram information to detect potential poorly formulated sequences of words with intent to detect incorrect word choice. The system computes mutual information of a given bi-gram or tri-bi-gram in which the probability of a sequence and its constituent uni-grams are evaluated to detect erroneous sequences. Attempts to detect syntactically erroneous sentences by Sun et al. [12] involved POS tag sequences which permit arbitrary gaps in between tags when checking new sequences against a bank of known sequence patterns.

Wagner et al. [13] demonstrates the viability of n-gram proba-bilities in conjunction with decision tree voting.

Michael Gamon [14] introduces an error-agnostic method utiliz-ing POS tag sequences to predict grammatical issues. Here, usutiliz-ing an annotated corpus, he marks each token as inside or outside an error in a sentence (inspired by named entity recognition) and utilizes a maximum entropy Markov model to predict the inside/outside tags on sentence tokens. Additional features are also used when predicting a tag used such as n-gram probabilities, n-gram ratios, and n-gram probability deltas across a window.

Rozovskaya et al. [15] compares machine translation and ma-chine classification for error detection. Here, they describe the machine translation approach as maximizing the probability of a translation (correct) given an original sentence (erroneous). Spe-cial steps must be taken to modify these systems for the task of grammar correction [16].

Usage of neural networks has also found popularity applied to the task of error detection. Kaneko et al. [17] uses specially tuned embeddings to capture grammaticallity.

GEC studies draw many similarities to ours in that detecting incorrect sentences oftentimes comes in the form of incorrect gram-mar. In our study we further differentiate by attempting to capture more general forms of errors that may not only be grammatical in nature.

2.3 Automated Essay Scoring

Automated essay scoring (AES) is the task of scoring essays written by students in an examination setting with the aim to alleviate the need for multiple human graders. It is also common for AES

systems to leverage techniques from GEC. Further, many of these studies also rely on pseudo-erroneous training data.

Rudner and Liang [18] utilize Naive Bayes to classify student essays in score categories, where Higgins et al. [19] looks to as-sess coherence in student essays by extending the functionality of a discourse segment identifying software called Criterion[20]. The software identifies typically present discourse segments found in student essays such as: thesis statement, main idea, support, etc. Once the discourse segment is identified, each segment has a slightly different protocol for evaluating semantic relatedness to other parts of the text by evaluating dimensions for both local and global coherence via vector-based semantic representations. A support vector machine is utilized to classify high quality or low quality in terms of how well the relevant discourse segments are related. The unique aspect of this study lies in the richly annotated student essays.

E-rater 2.0[4] is an automatic essay evaluating software used by the Educational Testing Service3_{. The system also works in}

con-junction with Criterion which offers features relating to grammar, usage, mechanics, and style combined in a weighted average.

Pitler and Nenkova [3] utilize a number of lexical, syntactic, and discourse features and combine them in a multiple linear regression to predict readability of articles. The data used was the Penn Dis-course Treebank, which is composed of Wall Street Journal articles taken from the Penn Treebank [21] and hand-annotated on each sentence for discourse relations. The articles had been additionally rated for readability by human readers.

Neural networks have also shown promising application in the domain, such as the experiments with LSTM models conducted by Alikaniotis et al. [22]. Application-specific tuned word embeddings enhanced effectiveness and human-level accuracy in scoring was obtained.

We find similarities to AES in that these systems aim to capture and assess a wide variety of characteristics of written language and subsequently score it. In our case we also seek to capture some of these characteristics but rather than score entire documents we aim to classify small pieces of text.

3 METHODOLOGY

Here we describe the techniques used to assess and classify texts. First, the task is described along with the data used. The individual feature sets are also described in detail and finally the examined classifier models are listed.

3.1 Task Description

The primary task of this study is to determine if we can predict whether or not a piece of text requires revision. That is to say, can we reliably discriminate between erroneous text and edited text that is assumed to be free of errors? We can then frame this problem as a binary classification task in which, given a piece of unseen text, mark it as needing revision or not needing revision. This task will be done on both sentences and paragraphs in which the results will be compared to determine which level can be more reliably discriminated between.

3_{https://www.ets.org/} 2

(4)

3.2 Feature Sets

In order for text to be machine-interpretable it must be represented in an alternative numerical format. In this study we use a variety of numerical representations of text as well as additional features extracted to describe the composition of the text.

3.2.1 Bag of Words.A bag-of-words (BoW) representation of text was used as the foundation in which the classification would take place. Bag-of-words can be described as a numerical representation of text in which a matrix is constructed to represent a given vocabu-lary used in a corpus with rows representing individual documents and columns representing words. Row vectors are then composed of binary or weighted values to indicate the presence of a word in the document. In this study we used term frequency-inverse document frequency (TF-IDF) weighting, a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

TF-IDF can be described as follows4_{: Denote a term by t, a}

doc-ument by d, and a corpus by D. Term frequency T F(t,d) is the number of times that term t appears in document d, while docu-ment frequency DF(t, D) is the number of docudocu-ments that contains term t. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., "a", "the", and "of". If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse doc-ument frequency is a numerical measure of how much information a term provides:

IDF (t, D) = log_{DF (t, D) + 1}|D| + 1 , (1) where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:

T FIDF (t,d, D) = T F(t,d) · IDF(t, D) (2) 3.2.2 POS Tagging. As an alternate representation of the data, we also parsed and converted text into granular part-of-speech (POS) tags thus giving a POS sequence representation for each document. This enabled further evaluation of text in a generalized grammatical representation. The purpose for this is to allow the model to associate incorrect POS sequences with erroneous text labels. For example, the sentence "The dog is ran" yields a POS sequence of "DT NN VBZ VBN", which indicates: determiner, singular noun, present tense verb, past tense verb. The usage of VBZand VBN in sequence is incorrect here and would most likely not be found in correct English writing. As such, using POS tags can assist in identifying erroneous writing.

3.2.3 N-Grams. One limitation of BoW is the loss of sequential information, that is, the word order is not preserved in the vector-ized representation. This can result in the loss of important insight into the grammaticality of a sentence. In order to counteract this limitation we can also break sentences down into small chunks rather than individual terms. These chunks come in the form of 4_{https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#tf-idf}

n-length sequences of terms known as n-grams. These short se-quences can be extracted as sliding windows of length n over a sequence of terms (sentences or paragraphs) in which word order is then preserved for short sequences. Using n-grams can allow one to identify potentially incorrect or obscure collocations of words or POS sequences. For this study, 1, 2, 3, and 4-grams were com-puted for both raw text and POS tag sequences. These collections of n-grams and subsequent combinations, can be used in the same fashion as the TF-IDF weighted vector representation previously described.

3.2.4 Probability of N-Grams.Inspired by the work of Wagner et al.[13] we further utilized the probability of a given n-gram to occur. The notion here is: given some n-gram of raw text tokens or POS tags, the probability of such a sequence occurring can be used to detect erroneous usage, where abnormally low probability sequences could indicate erroneous usage of language. For example, given some sentence we can compute the minimum occurring prob-ability of POS 3-grams from a selection of sentences. For sentences containing erroneous language we would expect them to have a significantly lower minimum occurring probability compared to the correct sentences. The probabilities computed for our experiments were: minimum probability of raw and POS bi-grams and POS tri-grams along with average probability of raw and POS bi-tri-grams and POS tri-grams.

Probabilities are derived from the construction of a language model where, given a corpus of text, each unique term is counted and divided by the total number of terms of the same type. Thus, given some term (word, POS tag, n-gram etc.) t we can say the probability of term t can be expressed as:

p(t, D) = _|D|T F (t, D)

tokens, (3)

where D is a corpus, |D|tokensis the length of the corpus in number

of tokens of that term type, and T F(t, D) is the term frequency of term t over the entire corpus D.

3.2.5 Syntactic Complexity Measures.To further attempt to reli-ably differentiate between the erroneous and non-erroneous text we computed a series of syntactic complexity measures as put forth by Xiaofei Lu in his work on automatic complexity analysis[23]. These measures are intended to analyze the syntactic complexity of written English-language samples using 14 different measures, cov-ering: length of production units, amounts of coordination, amounts of subordination, degree of phrasal sophistication and overall sen-tence complexity (a full listing is found in Table 7 of the appendix). We use these complexity measures under the assumption that erro-neous text and non-erroerro-neous text may exhibit different patterns of syntactic complexity.

3.3 Classification Framework

The workflow is as follows: For each paragraph or sentence in a document, construct the feature sets to be tested. Once the text has been converted to numerical representation and the aforemen-tioned feature sets computed, we may then proceed to testing their effectiveness as predictors in the classification task. In this task we specifically are seeking to take a piece of text as input into a classifier and subsequently receive a binary output signal of either 3

(5)

0 or 1, 1 designating erroneous. The task was performed with four different classifiers with each being compatible to use the same input structure. The classifiers are listed here:

•Logistic Regression:logistic regression is chosen as basic classifier and is a demonstratedly effective discriminative linear model[24].

•Naive Bayes:Naive Bayes has shown to be very effective in various natural language processing tasks further demon-strated as an effective essay scoring tool [18].

•Support Vector Machine:SVM is also a proven powerful classifier and is ubiquitous throughout NLP classification literature.

•Gradient Boosted Decision Trees:GBDTs are ensemble learners that improve on the concept of Random Forest at the cost of additional computational complexity.

3.3.1 Individual Feature Sets. Eight individual feature sets were ex-amined individually and compared for their own predictive power, these being:

•TF-IDF uni-gram •TF-IDF bi-grams •TF-IDF tri-grams •POS TF-IDF uni-gram •POS TF-IDF bi-grams •POS TF-IDF tri-grams •N-Gram Probabilities •Syntactic Complexity

3.3.2 Combining Feature Sets. After individual assessments we proceeded to combine features. The first method was to concate-nate various combinations of the sets into single feature vectors. Once combined, the same procedure for individual feature sets was followed.

In addition to simple concatenation it was thought that improved results could be had by isolating the signals offered by individual feature sets as has been shown in previous literature[25, 26]. In order to best handle individual input signals when combining fea-ture sets we chose to train individual classifiers on each feafea-ture set and combine the subsequent outputs. Success in improving per-formance by ensembling individual learners has been observed in other similar applications[27]. Each individual classifier outputs a binary response, or a vector of class probabilities which are concate-nated into a new feature vector and then passed into a final logistic regression to combine and weigh each signal. Figure 1 illustrates this ensemble setup:

As an example, an ensemble of logistic regressions can be ex-pressed mathematically as follows: Consider the logistic regression model parameterized by θ,

hθ(X ) = 1

1 + e−θT_X (4) Given a collection of n logistic regressions trained on n feature sets, we can then compute a vector of probabilities,

®

h = [hθ1(X ), hθ2(X ), . . . , hθn(X )] (5)

Figure 1: Ensemble of Multiple Classifiers

We can then pass our final feature vector ®h through another uniquely parameterized hθ(X ) giving the final classification y,

y =(1, if hθ( ®h) > 0.

0, otherwise. (6) Through this framework we test and evaluate the effectiveness of this model type in the application of the previously described task.

4 EXPERIMENTAL SETUP

The techniques described in Methodology were established with the intent to address the following research questions:

• RQ1: Can the proposed methodology be used to reliably discriminate between erroneous and non-erroneous text? • RQ2: Of paragraphs and sentences, which is more reliable

in determining the need for revision?

• RQ3: Of the classification models described, which perform best on the task?

• RQ4: How does classification performance change with length of text?

Experiments were conducted on both paragraph and sentence level and thus the comparison of these results address RQ2, where RQ1, RQ3, and RQ4 are addressed on each of these respective levels. The setup of experiments and implementation details are described in further detail in the following subsections.

4.1 The Data

4.1.1 Overview.The data used here comes in the form of a corpus of 72,500 scientific manuscripts, that is, scientific papers in the preliminary stages of refinement before seeking publishing. Each of these manuscripts have been read and edited by professional editors as a part of Elsevier’s own language-editing service5_{. The purpose}

of the editing effort is to enhance or correct the mechanical usage of the English language, meaning, efforts focus on grammaticality and syntactical correctness rather than addressing the content of the paper.

4.1.2 Sampling the Data.Computational difficulty was encoun-tered during the extraction of syntactic complexity measures and thus could only be computed for a sample of sentences and para-graphs. Some summary statistics are given in Table 1. Feature sets 5_{https://webshop.elsevier.com/languageservices/languageediting/}

(6)

were evaluated against each other within the samples to gauge the effectiveness of the syntactic complexity measures.

4.1.3 The Nature of the Edits.Here we briefly describe the results of an audit of the nature of the edits; these being the context and qualities which differentiate the positive and negative labels. Un-derstanding the patterns in the edits will help to understand how the text differs between classes. For each edited paragraph the edit comes in the form of deleted and/or inserted characters and only a single deletion or insertion is required to grant a positive label. Across all paragraphs we observe an average of 16 edits each. Fur-ther, Figure 2 shows a distribution of the length of the inserts and deletes in number of characters, white space included.

Figure 2: Histograms Showing Length of Edits in Number of Characters

With 26% of edits being a single character and 96% being less than 10 characters we can see that most the edits are small changes as opposed to large overhauls of the original writing. Across insert and delete single-character edits, spaces are second and first most common, respectively; where just under 10% of all single charac-ter edits are punctuation marks. Frequent insertion and deletion of capital letters also indicate that many edits are related to capi-talization errors or notation improvements. Three character edits reveal the word "the" as the most common insertion and deletion which indicates frequent misuse of the article, a common problem in ESL writing [28]. Other three character edits reveal changing common word endings such as "ing", "ies", and "ion" which can indicate frequent changes in conjugation and other morphological constructions. Larger edits indicate frequent insertion of preposi-tional phrases, and deletion of idioms or other informal language. The edits in this dataset corroborate the description by Dale and Kilgarriff [29] of what type of edits are used to improve academic writing e.g. correcting domain- and genre-specific spelling errors (including casing errors), dispreferred or suboptimal lexical choices, and basic grammatical errors.

4.2 Data Preparation

For each document in the corpus we have two versions, the original un-edited version and the corresponding edited version thus a parallel corpus in which to draw comparisons. Available documents were converted and concatenated into a large dataframe in which each row contains a paragraph, more generally, text separated by a line break. Accompanying the pairs of pre-edited and post-edited text are columns containing the specific character deletions and insertions (provided with string indexes of the characters) as made by the editors. Because each paragraph is accompanied by its edits, we could then label each original paragraph as needing revision or not (no edits).

4.2.1 Extracting Sentences.From these paragraph pairs we were able to break them down further into sentence pairs using the spaCy6sentence tokenizer. Sentence alignment was required to match corresponding sentences and the presence of edits. After sentences and edits were aligned labels could be assigned. 4.2.2 Raw Text Processing.It is important to note that preprocess-ing techniques such as stop-word removal and lemmatization were not employed in this task. In order to capture small differences between edited texts, it is important to maintain granular detail so that these subtle signals can be identified. Moreover, the original dataset contains many rows of fragmented writing in the form of headers, references, titles, and other standard typographical fea-tures and accordingly, data filtering was required to capture only body text that could be parsed as complete English sentences. For the paragraph level, various quantities of minimum token threshold were examined and a 50 token paragraph minimum was chosen as this ensured the dataframe only contained complete English writing; the same evaluation was performed for the sentence-level dataframe resulting in the choice of a 15 token minimum (see results for further exploration). The summary statistics in Table 1 represent the datasets used in the experiments. The maximum values were selected to remove clear outliers in which some rows were clearly entire documents. In the case of removing outliers, no more than 0.004% of the original dataset was removed.

Stat Pars Sents Pars Sample Sents Sample

Min 51 16 52 16 Q1 80 20 83 20 Median 121 24 125 24 Q3 184 31 189 31 Max 500 80 500 80 Count 5,146,603 12,235,411 74,605 143,640 Pos Label 48.60% 40.30% 48.65% 40.33% Table 1: Length of Text in Tokens; Pos Label Represents Class Distribution as Percent of Positively Labeled Rows

4.2.3 Labeling Process.To be precise in how the data was labeled and concatenated, here the technique is described. In the cases where no edits were made to text only one entry was kept to prevent 6_{https://spacy.io/}

(7)

double entry and increasing class imbalance. For example, if a row had edits made to the text, the pre-edit portion would be kept and given a positive label to indicate errors and the post-edited text would be given a negative label, indicating correct writing. If a row had no edits made, only the post-edited text was kept and assigned the negative label. All text, once labeled, was concatenated vertically into a single column of text alongside their labels. The resulting class balances can be seen in Table 1.

4.2.4 Part of Speech Tagging. SpaCy7was also used to create ac-companying fine-grained part-of-speech tag sequences. This tagger uses the PennTreebank style annotation tags8_.

4.2.5 N-Gram Probabilities.In accordance our described method-ology, frequencies and subsequent probabilities were computed from scratch using the edited versions of our corpus of scientific manuscripts.

4.2.6 Syntactic Complexity Measures. Syntactic complexity mea-sures were also computed by means of the Stanford Parser9_{and its}

Tregex system (a regex-like system used to find patterns in a parse tree). 14 measures of syntactic complexity were computed by which each sentence was parsed and searched using tregex patterns with subsequent matches counted. Some examples are counts of clauses, coordinate phrases, dependent clauses, verb phrases and more (see appendix Table 7 for full listing).

4.3 Classification Models and Baseline

A bag-of-words (BoW) of raw text uni-grams was used as the baseline in which experiments would be evaluated and compared against. This method was selected as baseline due to its simplicity and commonly being the first technique tested in NLP tasks such as sentence classification. Our bag-of-words representation accounts for the 32,768 most frequent terms. This vocabulary was arrived at by testing vocabulary quantities on the uni-gram baseline. Little performance was gained beyond this range and the exact figure is given as a power of 2, as per the recommendation of the Spark documentation10_.

Some experiments evaluate a single feature set and thus a single classification model was required. In this case, the four models were tested and evaluated accordingly. In the case of combining feature sets we opted for an ensemble of classifiers in which a single model was trained on each separate feature set. The resulting output of class probabilities was then combined as features for input into a final logistic regression. All reported results utilized a 70/30 train-test split of the data, 5-fold cross validation, and grid search hyper-parameter tuning.

4.4 Model Evaluation

Finally, in order to test the effectiveness of these features on the task, individual groups and various combinations of features were composed and tested in pertinent classification models. As this is a binary classification task, traditional binary classification metrics were used in the evaluation of our models:

7_{SpaCy English Model: en_core_web_sm} 8_{https://spacy.io/api/annotation} 9_{https://nlp.stanford.edu/software/lex-parser.shtml} 10_{https://spark.apache.org/docs/latest/ml-features.html#tf-idf} • Precision • Recall • F1 Score • Accuracy

• Area Under ROC Curve

Specifically, models were trained to maximize area under the ROC curve (AUROC) instead of accuracy as this metric demon-strates resilience against skews in class distribution, but best models were selected based on highest accuracy, precision, and recall.

4.5 Behavior of Longer Text

Further, we experimented on how our dataset responded to the classification task when only keeping longer text. Subsets of the full datasets were taken by only keeping text of length n or greater where n = 0, 50, 100, 200, 300, 500, 700 and n = 0, 15, 25, 50, 100, 150, 200 for paragraphs and sentences, respectively. With each of these filtered datasets the uni-gram logistic regression baseline was trained and tested to compare the data’s response in performance.

5 RESULTS & DISCUSSION

In this section we report the results of experiments relating to the primary task and RQ1. Each subsection addresses a specific approach to handling the features used.

5.1 Individual Feature Sets

In the task of classifying a piece of text as erroneous or not, we first explored the predictive power of each set of features individually. The results are presented in Table 2.

Features Precision Recall Accuracy Sentences 1-grams Raw 0.6054 0.3252 0.6427 2-grams Raw 0.5743 0.2204 0.6200 3-grams Raw 0.5348 0.103 0.6024 1-grams POS 0.545 0.2065 0.6108 2-grams POS 0.5925 0.3132 0.6365 3-grams POS 0.6121 0.3708 0.6518 Probabilities 0.5187 0.1431 0.6011 Paragraphs 1-grams Raw 0.6709 0.5995 0.6712 2-grams Raw 0.6312 0.5207 0.6291 3-grams Raw 0.5796 0.3919 0.5782 1-grams POS 0.4982 0.2470 0.5260 2-grams POS 0.4954 0.3356 0.5254 3-grams POS 0.4966 0.3608 0.5246 Probabilities 0.5415 0.5150 0.5523 Table 2: Individual Feature Set Performance

Modest results were found when using individual feature sets while there was notable differentiation between the response of sentences and paragraphs to the techniques applied. Generally, the sentences responded better to POS tag n-grams and the paragraphs 6

(8)

responded better to the raw text n-grams with paragraphs perform-ing comparatively much worse when usperform-ing POS tags. With regards to using n-gram features, the uni-gram unique vocabulary is over 200,000 and thus the unique vocabulary for larger n-grams grows by the power of n. As such, with n-grams we find increasingly large vocabularies and therefore more sparse vector representa-tions. Because the maximum vocabulary permitted did not change during training it is possible that we cannot capture the information required to match the effectiveness of the uni-gram BoW.

Additionally, when using the probabilities that were intended to detect incorrect grammatical sequences in sentences we observed better performance in their application to paragraphs as indicated by the very low recall in the sentence application, but poor perfor-mance overall as a predictor. Examining the probabilities closer, it appeared that averages of probabilities did not differ greatly across erroneous and non-erroneous (see Figures 8 and 9 in appendix for visual representation). A Student’s T-test11on log average probability of POS tri-gramsfurther indicates that we can-not claim that the difference of means across erroneous and non-erroneous text were not zero, meaning the two classes exhibited very similar values for this feature (p < 0.01). Similar results were found for the other features of average probability. Contrastingly, log of minimum occurring POS n-gram probabilities showed some differentiation as can be seen in Figures 6 and 7 in the appendix. Erroneous text had, on average, lower minimum occurring proba-bilities as expected. Another hypothesis test confirmed with 95% confidence that the data supports the claim that erroneous and non-erroneous texts have significantly different mean minimum probability of POS tri-grams occurring (p < 0.01). Again, similar re-sults were found across paragraphs and sentences for POS n-grams. While the computed probabilities did offer some predictive power, they alone were not very strong predictors in our study as they fail to capture a large portion of errors that the edits account for.

5.2 Combining Feature Sets

To enhance model performance we proceeded to combine feature sets and then examined their effectiveness together. Adding n-gram features is a known technique to supplement BoW with more in-formation that can supply words with more context. The initial attempts to incorporate n-grams were simple concatenation to the baseline BoW. The results as shown in Table 3 indicate little to no performance gain by simple concatenation.

We postulate that simple concatenation does not permit each feature set to contribute a clear enough signal for the model to learn clear patterns from and that the addition of too sparse n-gram vectors increases dimensionality while bringing in little additional information. As a result, to improve the clarity of signals produced from each feature set we proceeded with an ensemble of learners in which a logistic regression12_{was trained on each feature set}

individually and the subsequent probabilities used as features for a final classification. The results of various ensembles of features are shown in Table 4. Ensembling did prove to be a more effective way to boost performance demonstrating the best performance 11_{The log probabilities were found to be normally distributed. Required assumptions}

for Student’s T-test also met.

12_{See Analysis for reasoning in this selection}

Features Precision Recall Accuracy Sentences

1,2,3-grams Raw 0.5899 0.3144 0.6356 2,3-grams Raw 0.5637 0.1937 0.6147 1,2,3-grams POS 0.5993 0.3609 0.6452 2,3-grams POS 0.616 0.3725 0.6536 3-gram POS + Probs 0.6133 0.3790 0.6534 Paragraphs

1,2,3-grams Raw 0.6521 0.5893 0.6573 2,3-grams Raw 0.6102 0.5224 0.6165 1,2,3-grams POS 0.4961 0.3562 0.5248 2,3-grams POS 0.4954 0.3576 0.5241 1-gram Raw + Probabilities 0.6682 0.5985 0.6698

Table 3: Simple Feature Concatenation

yet13. The results here confirm that the model is better able to learn from each feature set by isolating them in their own independent learners.

Features Precision Recall Accuracy Sentences 1,2,3,4-grams Raw 0.5992 0.3057 0.7118 1,2,3,4-grams POS 0.6056 0.33185 0.7165 1,2,3,4-grams Raw+POS 0.6340 0.3732 0.7299 Paragraphs 1,2,3,4-grams Raw 0.6814 0.6692 0.6783 1,2,3,4-grams POS 0.6955 0.6797 0.6908 1,2,3,4-grams Raw+POS 0.7225 0.7061 0.7172

Table 4: Emsembling Feature Sets

5.3 Complexity Measures

Here we present the results of experiments conducted on the sample subset of our data which contains the previously described syntactic complexity measures. For the comparison we use the uni-gram base case, n-gram probabilities, the complexity measures themselves, and an ensemble of them all.

Firstly, the results of the baseline on the sample set are signifi-cantly lower which would indicate the importance of data quantity in this task. Furthermore, the complexity measures did not prove to be particularly powerful predictors, performing worse than the other features used. In the case of sentences recall dropped to a low 0.06 indicating ineffectiveness to reliably identify positive ex-amples of erroneous text. We attribute this ineffectiveness to the nature of the edits in that, considering most of the edits are small changes to the text the syntactic structure usually does not change substantially. With positive labels being applied to text due to a 13_{Performance improved only slightly at the cost of much more computational time}

when incorporating 5-grams and thus stopped at 4-grams

(9)

Features Precision Recall Accuracy Sentences TF-IDF uni-gram 0.4975 0.424 0.5948 Probabilities 0.5093 0.1537 0.5987 Complexity 0.4981 0.0607 0.5956 Ensemble 0.5187 0.4622 0.6099 Paragraphs TF-IDF uni-gram 0.5931 0.5821 0.6021 Probabilities 0.5433 0.5182 0.5533 Complexity 0.535 0.3666 0.5365 Ensemble 0.5774 0.5718 0.5877 Table 5: Results with Complexity on Sample Datasets

variety of errors, major syntactic changes are infrequent compared to the quantity of minor changes. Because all edits share a class label alike and thus syntactic differences may not provide a strong enough signal, corroborating a similar result found by Okanohara and Tsujii [7]. This problem could also be exasperated by the lack of data in the sample, offering too few examples of complexity shifts between pre and post-edits. In examining a few key paragraph measures like number of verb phrases and number of clauses, a Kolmogorov-Smirnov test indicated that we cannot make the claim that erroneous and non-erroneous text differ significantly in these measures (p < 0.01, see appendix for box plot examples). These observations hold true for both paragraphs and sentences and thus in this task we find these measures to hold little predictive power.

6 ANALYSIS

In this section we discuss the results of a few auxiliary experiments that aided in some of our design decisions contributing to the primary task and addressing our research sub-questions.

6.1 Model Comparisons (RQ3)

In comparing models we started by comparing the baseline case from the start but also tried numerous configurations in the same style of our best performing ensemble model. Results comparing the baseline case are presented in Table 6. These observed differences in performance propagated consistently through all of the model configurations and feature combinations we tried and thus only report the baseline for brevity.

Logistic regression proved itself to be the strongest performing classifier using our BoW model. The linear support vector classifier came in a close second, usually trailing a few percentage points in accuracy and at the cost of 3-4 times the training time. Naive Bayes did not perform particularly well at all, even in the ensemble setting but did train faster than logistic regression corroborating the find-ings of Ng and Jordan[24]. Gradient boosted decision trees (GBTs) similarly performed poorly while also not enhancing performance on the dense vector representations probability and complexity measures independently.

Model Precision Recall Accuracy Sentences

Logistic Regression 0.6688 0.5969 0.6697 Naive Bayes 0.4621 0.7153 0.5497 Gradient Boosted Trees 0.6471 0.1142 0.6179 Support Vector Machine 0.6513 0.1984 0.6345 Paragraphs

Logistic Regression 0.6827 0.6259 0.6857 Naive Bayes 0.5812 0.6825 0.6175 Gradient Boosted Trees 0.6553 0.4679 0.6320 Support Vector Machine 0.6593 0.6195 0.6690 Table 6: Comparing Models on Raw Text Uni-Grams

6.2 Minimum Token Threshold

Model performance on text of varying length is shown in Figures 3 and 4. The results here motivated the selection of our minimums in our best performing model.

Figure 3: Effects of Adjusting Minimum Number of Tokens in Sentences

Many of the rows we aimed to eliminate are short titles, headers, and other similar features that tend to not have edits made to them. We believe that removing these simple no-edit "freebies" that are simple to for the model to classify leads to the reduction in accuracy as token threshold increases, while also increasing the ratio of positively classified examples as many of these un-edited negative examples are left out; illustrated by the large drop in number of available sentences. Additionally, we can see a major drop in performance in the jump from 50 to 100 tokens coinciding with a large dip in data quantity. At n > 100 tokens we drop from about 1 million rows to about 28,000 rows. This offers us some good insight into the importance of data quantity in this task. Our final selection was to take only sentences of 15 or greater to maintain a larger dataset to train on.

The same approach was taken in assessing token threshold on the paragraph level. Firstly, we see the ratio of positive examples rise to just about 50% which we attribute to many of the completely unedited rows being short headers, titles, etc. thus leaving longer 8

(10)

Figure 4: Effects of Adjusting Minimum Number of Tokens in Paragraphs

paragraphs in which most have edits made to them, resulting in a nearly 50/50 split of erroneous and non-erroneous paragraphs. Interestingly, in the case of paragraphs we find that performance only climbs higher with an increase in minimum length despite data quantity dropping rapidly. Row count goes down from 85,523 at length greater than 500 to 20,616 rows at length greater than 700 where we see over 80% accuracy while maintaining a nearly 50% class balance. We postulate that this increase in accuracy may be due to the higher feature availability as longer documents contain more words and thus more contributing feature values which can enable better discrimination. Our final selection was a minimum of 50 tokens, again to preserve more training examples.

6.3 Error Analysis

To better understand where our models were failing, we manually investigated erroneous classifications and looked for any patterns or clues on how to improve the current model. Assessed predic-tions were taken from our best performing ensemble model on the paragraph level. The first thing we noticed on this inspection was the frequent occurrence of typographical errors in the post-edit column coming resulting from failure in the initial data collection. Specifically these errors were mostly spacing errors in which newly inserted text ran up against adjacent words e.g. be approximately is edited to become isapproximately. This can be particularly problematic for model learning as post-edit text that is labeled as non-erroneous does in fact contain errors. In a manual audit of 50 misclassified paragraphs 11 typographical errors were found in the post-edit column; 8 of these errors occurring in text that was labeled as non-erroneous and predicted as erroneous. Further, it was noticed that some misclassified rows were edited to make correction in notation, nomenclature, or just stylistic adjustment. In these cases the differences between the pre-edit and post-edit text were minimal. From this we hypothesized that many misclassifica-tions could be attributed to the fact that some texts, while having opposing labels were indeed still very similar to each other bearing minor or very few changes in the editing process. A visualization is presented in Figure 5 which indicates that misclassified text has on average fewer insertions and deletions than texts that were cor-rectly classified. In posing the hypothesis that misclassified rows

have fewer edits on average than those that are classified correctly, we utilized a Kolmogorov-Smirnov test which indicated the data is sufficient to support this claim (p < 0.01)14. This observation further supports our intuition that text with more changes made to it differs more significantly from its erroneous counterpart. We be-lieve that cleaner data, free of these described typographical errors in the post-edit text, can improve the performance of our proposed classification framework but we do not believe that it has the ability to capture the subtle changes or errors, such as capitalization errors.

Figure 5: Average Number of Inserts and Deletes. 1 Indicates Correct Predictions

7 CONCLUSION & FUTURE WORK

In this study we have presented a methodology which demonstrates the feasibility of erroneous text detection using widely available and scalable machine learning tools. We evaluated our proposed classi-fication framework using variations of TF-IDF weighted BoWs on raw text and POS tags. Each variant was tested independently and then tested in combination by means of simple vector concatena-tion and ensembling individually trained classifiers. Experimental results indicated that uni-grams were the strongest base feature set but ultimately found the best results though using multiple feature sets together.

We also showed that the task is feasible not only on paragraph level but also on sentence level in which our best performing mod-els achieved 72% and 71% accuracy on sentences and paragraphs, respectively. Given that the proposed framework can use many binary classifier types we also showed the performance of four different commonly used classifiers, these being: logistic regres-sion, support vector machine, gradient boosted decision trees, and naive Bayes. Of the four, logistic regression proved to be the best performing in nearly every iteration of our framework.

We also evaluated how the proposed framework would work when using text of varying length. Longer text was more easily 14_{Hypothesis tested on inserts and deletes separately with similar p-values} 9

(11)

discriminated between by offering more feature values to consider but of course this limits the usability of real systems as all text in a given document may need to be evaluated.

While the framework did exhibit usefulness in the task it showed shortcomings in that it could not detect small edits or less mean-ingful errors such as capitalization errors.

In the future, to make a more effective system, we would like to create subsystems that can target specific error types that are relevant to the Elevier use case. Additionally, the popularity and success of neural models could prove to be very effective at this task as well and would be in our best interest to explore further.

ACKNOWLEDGMENTS

I would like to thank my supervisors Pengjie Ren and George Tsatsaronis for their continual feedback and guidance through my research. I would also like to thank my Elsevier colleagues Ehsan and Jenny for day-to-day discussion and encouragement. I thank my family for supporting my move halfway around the world to pursue this crazy dream. Finally, I would like to thank the Hardly Workinggroup for the incredible journey we have made together, full of wonderful memories, unmatched camaraderie, and of course becoming Masters of Science.

REFERENCES

[1] Matt Post and Shane Bergsma. Explicit and implicit syntactic features for text classification. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 866–872, 2013.

[2] Prem Melville, Wojciech Gryc, and Richard D Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1275–1284. ACM, 2009.

[3] Emily Pitler and Ani Nenkova. Revisiting readability: A unified framework for predicting text quality. In Proceedings of the conference on empirical methods in natural language processing, pages 186–195. Association for Computational Linguistics, 2008.

[4] Yigal Attali and Jill Burstein. Automated essay scoring with e-rater® v. 2. The Journal of Technology, Learning and Assessment, 4(3), 2006.

[5] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. 1975.

[6] Chris Brockett, William B Dolan, and Michael Gamon. Correcting esl errors using phrasal smt techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 249–256. Association for Computational Linguistics, 2006.

[7] Daisuke Okanohara and Jun’ichi Tsujii. A discriminative language model with pseudo-negative samples. In Proceedings of the 45th Annual Meeting of the Asso-ciation of Computational Linguistics, pages 73–80, 2007.

[8] Colin Cherry and Chris Quirk. Discriminative, syntactic language modeling through latent svms. Proceeding of Association for Machine Translation in the America (AMTA-2008), 2008.

[9] Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd annual meeting on associa-tion for computaassocia-tional linguistics, pages 173–180. Association for Computational Linguistics, 2005.

[10] Madhvi Soni and Jitendra Singh Thakur. A systematic review of automated grammar checking in english language. arXiv preprint arXiv:1804.00540, 2018. [11] Martin Chodorow and Claudia Leacock. An unsupervised method for detecting

grammatical errors. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000.

[12] Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, and Chin-Yew Lin. Detecting erroneous sentences using automatically mined sequential patterns. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 81–88, 2007.

[13] Joachim Wagner, Jennifer Foster, and Josef van Genabith. A comparative eval-uation of deep and shallow approaches to the automatic detection of common grammatical errors. In Proceedings of the 2007 Joint Conference on Empirical

Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 112–121, 2007.

[14] Michael Gamon. High-order sequence modeling for language learner error de-tection. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, pages 180–189. Association for Computational Linguis-tics, 2011.

[15] Alla Rozovskaya and Dan Roth. Grammatical error correction: Machine transla-tion and classifiers. In Proceedings of the 54th Annual Meeting of the Associatransla-tion for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2205–2215, 2016.

[16] Courtney Napoles and Chris Callison-Burch. Systematically adapting machine translation for grammatical error correction. In Proceedings of the 12th Workshop on Innovative use of NLP for Building Educational Applications, pages 345–356, 2017.

[17] Masahiro Kaneko, Yuya Sakaizawa, and Mamoru Komachi. Grammatical error detection using error-and grammaticality-specific word embeddings. In Proceed-ings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 40–48, 2017.

[18] Lawrence M Rudner and Tahung Liang. Automated essay scoring using bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2), 2002. [19] Derrick Higgins, Jill Burstein, Daniel Marcu, and Claudia Gentile. Evaluating

multiple aspects of coherence in student essays. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 2004.

[20] Jill Burstein, Martin Chodorow, and Claudia Leacock. Automated essay evalua-tion: The criterion online writing service. Ai Magazine, 25(3):27–27, 2004. [21] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a

large annotated corpus of english: The penn treebank. 1993.

[22] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289, 2016. [23] Xiaofei Lu. Automatic analysis of syntactic complexity in second language

writing. International journal of corpus linguistics, 15(4):474–496, 2010. [24] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers:

A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages 841–848, 2002.

[25] David A Hull, Jan O Pedersen, and Hinrich Schütze. Method combination for document filtering. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 279–287. Citeseer, 1996.

[26] Leah S Larkey and W Bruce Croft. Combining classifiers in text categorization. In SIGIR, volume 96, pages 289–297. Citeseer, 1996.

[27] Mangi Kang, Jaelim Ahn, and Kichun Lee. Opinion mining using ensemble text hidden markov models for text classification. Expert Systems with Applications, 94:218–227, 2018.

[28] María Luisa Carrió-Pastor and Eva María Mestre-Mestre. Lexical errors in second language scientific writing: Some conceptual implications. International Journal of English Studies, 14(1):97–108, 2014.

[29] Robert Dale and Adam Kilgarriff. Helping our own: Text massaging for compu-tational linguistics as a new shared task. In Proceedings of the 6th International Natural Language Generation Conference, pages 263–267. Association for Compu-tational Linguistics, 2010.

(12)

APPENDIX

Figure 6: Boxplot of Log of Min POS Tri-gram: Sentences

Figure 7: Boxplot of Log of Min POS Tri-gram: Paragraphs

Figure 8: Log Average Probability of Raw Bi-Gram. Label 1 Indicates Erroneous

Figure 9: Log Average Probability of POS Bi-Gram. Label 1 Indicates Erroneous

Figure 10: Flesch-Kincaid Readability

Figure 11: Comparing Number of Verb Phrases Across Para-graphs

(13)

Features

total characters #Clauses

average characters per word #Clauses per sentence number of tokens #Clauses per T-unit

#Complex nominals

dale chall readability #Complex nominals per clause flesch-kincaid readability #Complex nominals per T-unit flesh-kincaid grade level #Coordinate phrases

#Coordinate phrases per clause minimum probability of raw bi-gram #Coordinate phrases per T-unit minimum probability of POS tag bi-gram #Complex T-units

minimum probability of POS tag tri-gram #Complex T-unit ratio average probability across all raw bi-grams #Dependent clauses

average probability of POS tri-grams #Dependent clauses per clause average probability of POS bi-grams #Dependent clauses per T-unit natural log of each probability above #Mean length of clause

#Mean length of sentence #Mean length of T-unit #Sentences

#T-unit

#T-units per sentence #Verb phrases

#Verb phrases per T-unit

Table 7: Features: Probabilities Grouped as One Set, Remaining Grouped as Complexity Features

Figure 12: Comparing Number of T-Units Across Paragraphs 12