Automated Writing Feedback using Natural Language Processing

(1)

Bachelor Informatica

Bachelor Thesis - Automated

Writing Feedback using

Nat-ural Language Processing

B¨

ar Halberkamp (10758380)

March 25, 2019

Supervisor(s): A. van Inge

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

While tools that evaluate and provide feedback for code written in formal languages are ubiquitous, natural language analysis tools are still lacking. We use existing text readability metrics taken from the education sector and a natural language linter to discover if these can be used within a data driven context to give writers insight in what they could improve, and construct a proof of concept web-based tool to put this research in practice, as well as explore alternate methods to give meaning to the results generated by our tool through comparison between other writers in the same group. By constructing a corpus of thousands of paragraphs of scientific text, we attempt to answer the question whether these metrics can be used to accurately and consistently differentiate between different datasets using a quartet of experiments. Results show a varying amount of reliability of metrics and situations. The metrics show difference in median values between different types of texts, but within our corpus of scientific texts, our experiments generally indicate an inability to rely on these metrics for more pinpointed feedback. However, natural language linting shows promise, and the precedent set by other projects makes applying statistical analysis on texts within a group an approach worth looking into.

(4)

(5)

1 Introduction 7 1.1 Background . . . 7 1.1.1 Previous Research . . . 7 1.2 Research Specifics . . . 8 2 Method 11 2.1 Data . . . 11 2.1.1 Corpus . . . 11 2.1.2 Feedback Dashboard . . . 11 2.2 Metrics . . . 12 2.3 Software . . . 13 2.4 Experiments . . . 13 2.5 Implementation . . . 16 3 Results 19 3.1 Experiment 1 . . . 19 3.2 Experiment 2 . . . 23 3.3 Experiment 3 . . . 24 3.4 Experiment 4 . . . 26 4 Conclusion 29 4.1 Further Research . . . 30 5 Appendix 33 5.1 Source code . . . 33 5.2 Additional plots . . . 33

(6)

(7)

CHAPTER 1

Introduction

1.1 Background

Automated feedback is commonplace in a lot of areas within Computer Science. For a long time, programmers have relied on using code linters to test their code for syntax and bad practices. Despite the ubiquity of these linters, there has traditionally been a severe blind spot when it comes to automated feedback on style of writing, which in turn makes it substantially more dif-ficult for writers to know how to improve their writing style as there is no instant feedback, and vastly increases the burden on readers to thoroughly read and comment on the quality of writ-ing. To attempt to fill in this blind spot, programmers have started experimenting with natural language linters [10]. These linters work identically to how a linter for a formal (programming) language would, except they run on English written text, instead of source code. In this thesis, we are going to look beyond the natural language linters that are already widely available and see how we can implement these techniques in a more data driven environment in order to provide accurate, automated feedback on people’s writing style.

Before we can start, it is necessary to first establish what we see as ”feedback” within the context of this paper. For this paper, we are defining feedback as ”An observed distinction between the input text and a body of properly written texts”. As we can’t provide a judgment of a text’s quality since this is both outside the scope of this paper and field of research, we are going to avoid putting emphasis on whether or not a text is well written, instead focusing purely on texts as ”inliers” and ”outliers”, letting writers and readers determine what this means for the quality of their writing.

1.1.1 Previous Research

As mentioned, one of the current ways to receive feedback on written text is by using a Natu-ral Language Linter. These linters have a rather primitive, but effective approach to point out perceived issues in the text. Linters work with a number of rules which consist of common bad practices within writing. The input text is then parsed, and a note is made in places where the linter observes a problem. While this type of generated feedback can be very useful to spot cer-tain mistakes within a text, such as spelling mistakes or typos, these linters operate based entirely on the judgment of the creator of the linter’s rules, and do not make any comparisons of their own. A more comparitive approach is the one used in Jordy Bottelier’s thesis [3]. Bottelier used readability metrics in order to point out differences between different authors within a text. In his research, he trained a model using different texts from different authors as a training set, and then used these generated values to determine the text of origin of new input. Using his data and metrics, he managed to consistently differentiate authors in 20 sentences of writing. To accomplish this, he used five different metrics, taken from the education section, to both train

(8)

the model and run the comparisons.

Also important is the success of collaborative reading and studying software, such as Perusall [14]. While this software tries to help students read and understand texts better, as opposed to this paper’s focus on writing, it does show that a collaborative automated tool can be valuable for students. The usage of Perusall has significantly increased students’ perceptions of the courses that utilize Perusall at the University of Groningen [12].

1.2 Research Specifics

This paper is going to focus on answering the question ”Is it possible for an automated tool to provide accurate feedback on input text based on parameters generated from a body of existing work?”. In here, ”accurate” means feedback that shows an existing difference between texts on an as small as possible scale. To run experiments and answer this question, we will construct an automated tool using the previously discussed findings. The tool will be used to parse a corpus into parameter value based on the different metrics and formulas used in the research. After it is done parsing, it can then use these parameters to compare incoming texts to the corpus. This tool is aimed at usage within the Computer Science Bachelor of the University of Amsterdam, and this will be reflected in the choices made regading implementation and datasets. Other terms used throughout this paper include ”reader”, ”writer” and ”group”. Given our focus on education, these terms in most cases refer to students being writers within a workgroup, with TAs/teachers/professors being the readers. However, the terms used were kept intentionally vague, as they can apply to various education, work and hobby groups.

Before we can formulate a hypothesis, we must first need to make a few choices regarding the specifics of the tool we are constructing:

• What type of data does the body of work consist of ? The corpus will consist of documents similar in length and style to the input texts, allowing us to run the exact same code on input texts as the corpus. This in turn allows us to assume a similar distribution of scores for inlier input texts as within the corpus.

• What is the format of the input text? Both the corpus and input text will have parameters generated per paragraph. The choice for paragraphs as the unit was made to allow more specific feedback on parts of the input text. This would also allow comparing paragraphs within a text internally, and still allows an average to be calculated if this is deemed more desirable. Since Bottelier’s research showed a sample size of 20 sentences is enough to accurately match texts, we assume a paragraph should be enough text to generate consistent results.

• How does the tool generate the parameters? The tool will utilize the formulas used in Botterlier’s thesis to score text readability, as well as the output generated from a natural language linter, to generate the parameters.

We predict it is possible for an automated tool to provide accurate feedback. Assuming there is a style consistency between texts, one could create a profile of academic texts through usage of multiple metrics. The metrics used by Bottelier are established and verified to work to classify texts for various levels of education [8]. An assumption could be made that this extends to aca-demic level texts, and could be used to differentiate between strict inliers and outliers. However, readbility metrics might prove insufficient to discern minor differences on a smaller scale, as these metrics aim to compress entire pieces of texts into a single integer, which leads to an obvious loss of information. Since natural languages linters are already proven to be useful for enforcing basic style guides within writing on this smaller scale [10], we assume the feedback generated there will be more useful for comparing inliers. Both methods of parsing together should provide

(9)

accurate statistics.

The next section will go into greater detail about the proposed data mining tool, including data used and metrics used, as well as describing the experiments we will run to test the usage of the tool.

(10)

(11)

CHAPTER 2

Method

2.1 Data

One of the most important parts of the data mining tool is the data we use to compare new texts to. For this paper, we are going to be looking at data coming from two different sources. One from generating a corpus of pre-existing pieces of scientific writing, the other from looking at statistics of other writers in the same group.

2.1.1 Corpus

In order to use the tool to generate feedback on incoming texts, we must first construct a corpus to compare texts to. Initially, we planned to base it on a dataset of previous Bachelor’s theses at the UvA, using the work available from Science in Progress [16]. However, this would require manual parsing of a large amount of .pdf documents to extract the text and build the corpus, which would in turn consist of texts of a variable degree of quality and grades.

Instead, we decided to use the VTeX Language Editing Dataset of Academic Texts [5], which consists of paragraphs from 7996 documents with a length between 100 and 5000 characters, sorted by the field the paper was published in, sanitized entirely for usage within data mining and machine learning, meaning text formatting, mathematical formulas and figures were removed from the text, as these make it harder to parse the text accurately, while not providing any sig-nificance for text analysis. The dataset also contains suggested edits for every sentence in the paragraph, showing which syntax mistakes from the source texts were fixed by the researchers. Although this part of the dataset is ignored for the scope of this thesis, it is worth mentioning as it can be used for other approaches to text analysis in the future. For the rest of this thesis, this dataset will be referred to as the VTeX dataset. We will use this VTeX dataset to construct our corpus, and use it in the experiments to signify the group of inliers.

2.1.2 Feedback Dashboard

Instead of writing a computer program that outputs feedback on the comparison between input text and corpus, it is also possible to take a different approach. As opposed to the computer returning written out feedback on the texts, the program can return the raw statistics and com-ponents, putting them in perspective alongside other writers in a group on a dashboard. A writer might, for example, see the tool reporting an issue with usage of the passive voice. Based on the comparitive data, the writer can compare the amount of reported issues to the amount of other writers or other texts, and use this to discern whether the observed differences are cause for con-cern. Another large benefit to this more collaborative approach is the vastly increased amount of metadata. As every submitted version can be saved as a snapshot of the text, statistics can be generated on the progression of the text over time, which can provide insight to the writer

(12)

and their progress, as well as give readers easy insight in the progression of the writers, allowing them to determine which writers might need extra support or time.

However, the main downside of such an approach is the lack of quality comparitive material, as well as a general lack of scarcity within the data. As this approach relies on statistics coming from other texts written by writers from the same group, we would need to guarantee that the majority of writers are producing stylistically adequate texts, which is something we cannot do in a number of usecases. Another limit is the number of writers in a group. The sample size of the statistics provided is limited to the number of writers in a group, which in turn means statistics become less consistent the smaller the group becomes, with the approach not working at all for singular writers not part of a group, as there is simply no data to compare to.

While these two approaches are completely different in the way they approach the generated data from the algorithm and the ”judgment” of a computer, it is possible to implement them both in the same piece of software. By putting the computer-generated classification next to the raw statistics writers themselves can use, we can compare both approaches, and conclude whether either of them works, and how useful the generated feedback is. Putting both approaches side by side can also help cover up some of the shortcomings of either approach. It allows the results of the corpus to be put in perspective within the group the writer is operating in, while also giving writers a more consistent statistic to refer to.

2.2 Metrics

For this thesis we will use three of the five metrics provided in Bottelier’s thesis. These metrics are the Flesch Reading Ease, Gunning Fog Index, and Automated Readability Index.

F leschReadingEase = 206.835 − 1.015 ∗ words

sentences− 84.6 ∗

syllables

words (2.1) GunningF ogIndex = 0.4 ∗ ( words

sentences+ 100 ∗ complicatedwords words ) (2.2) AutomatedReadabilityIndex = 4.71 ∗characters words + 0.5 ∗ words sentences− 21.43 (2.3) While these three metrics are similar in usecase, they use different statistics from the text to generate their results, and in different weights. All three metrics use the ratio between words in a sentence as well as a metric to judge the complexity of words. Two of the three metrics use the number of syllables to judge complexity of words, while the Automated Readability Index (ARI) uses the number of characters in a word instead. Since the English language has a large amount of loan words and exceptions, it is impossible to give a 100% accurate reading of how many syllables are in a word purely from looking at the syntax. Despite this, tools have been developed to calculate the number of syllables with a less than 1% error margin [11]. Outside of not being 100% precise, the algorithm to detect syllables complicates generation of the value, while the ARI is designed to be easily calculated while the text is being written.

Although the Flesch and Gunning Fog scores both rely on (mostly) the same components, the returned values are vastly different. First of all, the Flesch reading ease is the only one of the three metrics where low values imply a complicated text as opposed to a simple one (which is the case for the other metrics). It also generates much larger numbers, as the other metrics are scaled to the American grade system, and attempt to return a value equivalent to an American grade. The Gunning Fog score does the calculation of the word complexity differently from the other metrics as well, relying on a percentage of complicated words as opposed to an average of syllables or characters in a word.

The reason we are omitting the Coleman Lieau Index Bottelier used is due to the fact it uses the exact same components as the Automated Readability Index also used in his thesis. We are also omitting the SMOG index, as this index is specifically made to be used with fragments of

(13)

30 sentences, which is not something we can guarantee within this thesis.

2.3 Software

For the scope of this thesis, we are going to focus on the usage of the tool exclusively within the Bachelor’s Computer Science degree at the University of Amsterdam (UvA), and this is also where we aim to apply the tool. For the implementation of the tool, we will focus on creating a lightweight, easy to deploy tool that still retains sufficient performance. We will also create a clear separation between the front-end collection of data of writers and comparison of statistics, and the generation of the feedback and corpus. This is both to allow one central feedback server to serve multiple front ends at multiple organizations, allow front end interfaces to use different layouts and implementations than the one provided in this thesis, as well as to ensure privacy of writer details and connections.

Since the data mining aspect takes up a majority of the focus of this thesis, we are going to utilize an external library to implement a natural language linter. To do this, we are using the write-good linter in Node.js [7]. This linter has eight different checks for common bad practices within the English language. It uses a regex implementation to scan text, and generates comments based on whether it can detect certain structures. As well as the built-in filters, it is possible to add your own extensions to the linter. Another commonly used linter is Vale [6], which is a more feature-rich application written in Golang. As this linter is an external binary, and does not have any Node.js bindings, we have decided to not use Vale for this thesis, relying entirely on write-good for the proof of concept software.

After our formulas and linter generate values for every paragraph in the input text, they are then compared to the values generated by the corpus. Since the focus of this tool and thesis is on the parameter generation, there is no specified method for comparison. The version of the software used in this research will contain a placeholder implementation of a comparison func-tion. A more sophisticated algorithm could be created based on the results from the experiments. As well as a parser for the data, there must also be an interface for writers to submit texts, and readers to observe the texts and the generated feedback. For this we are implementing a feedback dashboard, using the approach described in subsection 2.1.2. The dashboard should contain elements to submit texts, view the generated feedback inline, as well as view the generated statistics of the text in comparison to averages of other writers, in a clear format. Supervisors should be able to securely access a page to have a view of the statistics of all writers, as well as their progress over time. It is imperative here that anomalies are easy to detect by readers, since the goal is to enable easier identification of writers who need extra attention. In production, this dashboard would feature different groups for different courses, and allow students to log in using their university’s ID system. For the scope of this thesis, however, these features are not mandatory, and could always be added at a later date. To track which text belong to which writer, another identifier much be chosen. For this thesis, we chose IP-addresses as the identifier. Identifiers are not visible to any another writer, and are meant to be used by the server and readers to track writers’ progress.

2.4 Experiments

As previously mentioned, the tool we are constructing relies on a number of assumptions regard-ing the usability of the metrics for the data minregard-ing portion of the thesis. If our assumptions turned out to be wrong, we expect (one of) the following reasons to be at cause:

• The size of the paragraphs in the VTeX dataset might be insufficient. While Bottelier used samples of 20 sentences, not all paragraphs from the VTeX dataset reach that number of sentences. This can potentially lead to noisy results.

(14)

• The VTeX dataset is potentially not large enough for the variance to stabilize. This could lead to incorrect results, and a larger dataset would need to be found.

• The metrics we use to determine the complexity and readability of the text purely look at sentence structure, not actual content of the text. This could lead to a similar distribution fo every text written for an educated audience, regardless of subject matter.

• There might not be as much of a uniform style between academic texts. There are multiple factors - such as the different academic fields and differences in style between different writers - that have a potential impact on the outcome of the metrics. The amount of impact these factors have in practice is unknown, and will have to be investigated. • Language is a very advanced concept with a ton of different factors and oddities. It might

simply be impossible for a mathematical formula to generate a single number that says anything meaningful about something as complicated as style of writing. Since we are condensing an entire text into a single number, it is unavoidable that details get lost in compression, and in this case, the devil could be in the details.

• Related to the previous point is the potential issue with noise. Because language is a complicated concept, there is a possibility for the calculations to cause a substantial amount of noise. We make the assumption that most paragraph within a single text have a similar structure and complexity, while this does not have to be the case.

• The metrics might not be suitable for all texts. Since the metrics we use are developed entirely for use within grade school education within the United States, they might not work on non-English texts, or work better on certain types of texts than others.

In order to test whether or not these concerns are grounded, we are going to be running a quartet of experiments. These experiments are ran using Python 3, with NumPy and PyPlot used to parse the data and generate the plots. Source code for the experiments can be found in the Appendix. Data used was generated using our feedback tool.

The first experiment will look at the consistency within the VTeX dataset, and find out whether the VTeX dataset works to generate a usable body of statistics to use within the tool. In order to put this to the test, we are going to make plots to see whether the potential concerns are grounded.

• We are first plotting the distribution of the values for paragraphs in the VTeX dataset, one plot per different metric used. By putting this in a boxplot, it becomes easy for us to see the range within the majority of points fall, and thus how easy it would be to use these values.

• To see whether the length of the paragraphs influences the variance of the statistics, we are also plotting the variance of the distribution for different minimum amount of sentences for paragraphs. If longer paragraphs in fact lead to more accurate metrics, this graph should follow a downward trend. If there is no correlation, we expect this graph to trend upwards, as more scarce data should lead to more variance.

• To discover whether the amount of documents in the database are enough for consistent results, we are generating plots displaying the relative variance of samples of the data of various sizes in the form of a line graph, both a plot using a different sample for each value, and a cummulative plot. The value of this graph should stabilize for larger samples, the point at which this happens showing the amount of data required for a consistent result. • Finally, to test the tool’s ability to classify non-scientific texts as outliers, we are using

the book ”A Short History of Nearly Everything” by Bill Bryson as a secondary test dataset. The .pdf file for the e-book was parsed by the tool in the same way as the VTeX database. This book was chosen as it covers a large variety of scientific topics, written in a distinctively non-scientific prose style. It is described in the Amazon.com description as ”Though A Short History (. . . ) covers the same material as every science book before it, it reads something like a particularly detailed novel” [1], which makes it a perfect fit for

(15)

our experiments. On top of Bill Bryson’s book, we have also taken a .pdf of Harry Potter and the Sorcerer’s Stone as additional test data. Since Harry Potter is a book written for grade school age children, we assume this book should work optimally with the metrics, as they are developed for use on grade school texts. If our assumptions are correct, the metrics should show a clear difference in distributions between the VTeX, Bill Bryson and Harry Potter datasets.

The second experiment will investigate the potential variance between different scientific fields. It is possible that the variance between these different fields is large enough that it negatively impacts the effectiveness of the tool. We are going to use the VTeX dataset’s labeling to sort all paragraphs in the corpus per field. To display the distribution, we will generate a boxplot for every used metric, displaying all the different fields as labeled in the VTeX dataset, as well as the distribution of the values of the metrics for every paragraph.

The third experiment will try to compare a much wider variety of texts written for different purposes by different authors in different languages in order to see if the readability metrics prove useful differentiating these texts, and for which texts they provide the most consistent results. We will be adding the following texts to our test data:

• The book ”Boyhood: Scenes from Provincial Life” by J. M. Coetzee, an autobiographical historical novel about growing up during South African apartheid.

• The book ”De buitenvrouw” by Joost Zwagerman, a Dutch literary novel.

• The book ”Joe Speedboot” by Tommy Wieringa, a Dutch coming of age story, notable for its simple writing style and vivid imagery. [9]

• The May 2018 issue of Scientific American, a scientific magazine. [15]

These texts, combined with the previously mentioned VTeX, Bill Bryson and Harry Potter datasets will be used to make 2d and 3d plots, showing the distributions of statistics for every text.

The final experiment will take a look at the validity of the write-good linter. If the bad practices pointed out by the linter are in fact things to be avoided while writing professional texts, we should observe a difference in amount of positives when looking at a text written by learners and a text written by professionals. We are going to investigate this by comparing two samples of texts. One being the submitted reports for a UvA computer science course during the Bachelor’s degree, and other being a sample of PhD theses from the UvA. If the linter provides accurate statements, this should be reflected in a much lower amount of positives for the PhD sample.

(16)

2.5 Implementation

Figure 2.1: Architecture of the software

Due to the time constraints for the research, we had to make a pragmatic choice to program the software in a language with a high abstraction level. Initially, we chose Python 3 as the language for the project. However, performance while loading and parsing data became an issue rapidly. After comparing various benchmarks [2], we reached the conclusion that node.js [13] provides superior performance with a similar abstraction level to Python, and was deemed a su-perior choice for this project. In an enterprise environment with more resources available, Java would be the language most suitable for this project, as it provides even greater performance [2], while retaining widespread support for database bindings and web request handling, which are important components to this tool. Another benefit to using node over platforms is the great amount of support for asynchronous execution, allowing concurrency to be easily implemented without adding to the workload during development.

To load and parse the corpus, the parser uses a multiprocess structure with 4 workers. This is done to keep from overloading the main process, as Node.js has a limit to RAM usage per process, due to there being a limit in the V8 JavaScript engine Node uses [4], which requires severe throttling to avoid on a single process architecture. To load our training data, either the VTeX dataset or other documents (as used by the experiments), the main process reads the data from the file, either a pdf file, plain text or an xml file. After reading the data, it splits the data into chunks (without parsing it), either pages of a pdf or chunks of 500 paragraphs from the VTeX dataset, then sends chunks to each of the workers. The worker processes then parse the text further, and return the generated parameters to the main process. These values are then combined into a single JavaScript Object by the main process, which also handles all incoming requests for documents to be compared by the tool. A similar worker structure to the one used for parsing the corpus can be applied here as well if the extra throughput is desired, although this was not done in the current implementation to lower workload and ease debugging. In order to prevent having to load the corpus every time the tool is launched, the parsed corpus is cached by splitting it up in chunks, and writing these chunks to .json files, which then can be loaded back in. The reason to split the corpus up in chunks here is to avoid a memory overload by having to stringify the entire object in memory at once, which can lead to running out of memory if the whole corpus is parsed at once. The cache generated by the corpus is also the source of the data the experiments use.

The parser relies on plugins to generate parameters, which it takes from the plugins/ direc-tory. Plugins are loaded into the parser as modules, and export a Plugin object. Plugin objects have a specific layout, providing a simple API for the main parser process to use, which allowing

(17)

plugin modules to include any amount and kind of JavaScript code, as long as an object follow-ing the specification is exported. For the scope of the current research, there are two different plugins. One to generate feedback based on our data mining research, and one to utilize the previously mentioned write-good linter to generate more pinpointed feedback.

To implement the feedback dashboard, we use the WebInterface class, as well as clientside JavaScript code within the index.html file. The parser runs on port 8000 in the default setup. On top of that, there is a separate port for the web interface. The dashboard uses a simple node.js webserver to serve the webpages to the user, and keep track of the feedback for texts submitted by writters. The front end code relies on a few AJAX requests to send and receive feedback. The graphs are generated using d3.js. In order to generate the graphs shown to writers after they submit a text, the WebInterface class keeps track of submissions and averages by storing them in a JSON object, using an array of all attempts with timestamps as the values, the most recent attempt being index 0.

The components are designed to function without needing any shared memory, with the other components relying only on the feedback parser. The web interface is designed to work with a remote feedback parser, allowing multiple interface instances to run off a single centralized feedback server. The full separation between the web interface and parser ensures the potentially remote and untrusted parser server does not attain any potentially identifying information about students or educational facilities from the web interface. Initially, we had the plan to integrate the tool directly into Overleaf.com, although this was deemed unfeasible due to the complexity of the Overleaf editor, and lack of proper API. The tool is extensible enough to facilitate plugins for text editors with an API if those are made in the future, but this is not part of the scope of this thesis.

(18)

(19)

CHAPTER 3

Results

3.1 Experiment 1

Figure 3.1: Test results for the VTeX dataset using all paragraphs.

In this distribution, the variance is noticably high. There is a small set of inliers, but the large variance within the outliers makes these statistics hard to use. As mentioned earlier in the paper,

(20)

the rest of this experiment will focus on finding out what the underlying cause of this variance is. Our initial suspicion is that it might be related to the shorter paragraphs in the dataset.

Figure 3.2: Bar plot of the VTeX dataset filtered by minimum number of sentences per paragraph, and average standard deviation over all three metrics.

This plot shows the suspicion was correct, and putting a lower bound on sentence length does lead to a lower standard deviation. However, if we make this lower bound too high, the variance goes up again. We conclude that our assumption about the volume of data becoming a problem is correct. A sentence lower bound of 20, as proposed by Bottelier in his thesis, will be used in the next plot to base tests on the volume of data for. If we want to use this dataset in our tool, we need the variance of larger samples of data from dataset to stabilize.

(21)

Figure 3.3: Line plot plotting size of the sample of paragraphs from the corpus (with 20 or more sentences) and mean compared to initial value, starting at x=250 documents. New samples generated for every x value. Values were smoothed using scipy. Samples randomly generated using numpy.random.choice.

(22)

Figure 3.4: Line plot plotting size of the sample of paragraphs from the corpus (with 20 or more sentences) and mean compared to initial value, starting at x=250 documents. New samples appended to old samples. Values were smoothed using scipy. Samples randomly generated using numpy.random.choice.

In these plots, we observe that, while the Flesch Reading Ease is more stable than the other metrics, none have reached a fully stabilized mean value. We assume these fluctuations are, however, small enough to still allow correct identification of inliers and outliers, but lead to less conclusive results overall. While it would be possible to combine multiple paragraphs into one larger one to ensure all paragraphs contain at least 20 sentences, this would potentially lead to incorrect results; the paragraphs from the VTeX dataset come from different fields, different authors and different parts of papers, which are factors that can all affect the generated parameter value and invalidate the values generated for that paragraphs.

In order to get insight into whether this sparseness of data makes it hard to discern between inliers and outliers, we will be plotting the VTeX dataset (with lower bound of 20 sentences) alongside all paragraphs from the Bill Bryson and Harry Potter books, with each point being signifying a paragraph in the dataset:

(23)

Figure 3.5: Scatterplot of VTeX dataset compared to Bill Bryson’s book and Harry Potter.

Figure 3.6: Same results as the previous plot, except in three different subplots to more clearly show the separate datasets.

This scatterplot shows a clear overlap between the Bill Bryson book and VTeX dataset. While there is a slight difference in center of both collections of points, the overlap is still substantial enough and there is too little clustering for a formula to be able to tell the difference between paragraphs with any sort of certainty.

For the Harry Potter dataset however, there is a clear difference in parameter values, with the Harry Potter dataset being almost entirely separated from the other points. Notably, there is much less variance in metric values between different data points from the Harry Potter book compared to the other datasets. The level of readability seems to be relatively constant compared to the other datasets.

The next experiments will focus on trying to find an explanation for the large amount of variance within the VTeX dataset and Bill Bryson dataset, which is not observed in the Harry Potter dataset. We will also experiment with different datasets in order to see if the metrics perform better or worse on those, or can be used to differentiate different types of texts.

3.2 Experiment 2

As mentioned previously, one of the theories for the largely unclustered results within the VTeX database is that there is a significant difference in the values generated for paragraphs in different

(24)

fields within academics.

Only one of the three plots is shown below for the sake of compactness. The other plots can be found in the appendix, but were omitted as they show similar, but less pronounced results. Following the results of Experiment 1, we have chosen to limit the data used to only paragraphs of at least 5 sentences long to lower the noise, while still avoiding the data becoming too sparse to provide a stable result.

Figure 3.7: Boxplot comparing the different tagged fields in the VTeX dataset based on their automated readability index

The plot shows that there is a clear difference between fields when it comes to the lower end of the distribution. While the median is close between fields - Mathematics being the notable outlier - there is a large difference in the lower bound between fields. Whether this is caused by actual differences in readability or due to the difference in noise between fields is unknown, although it does provide a possible explanation for the large variance within the dataset: Different subject matter leads to differently structured texts, which can vary greatly in complexity.

On the other hand, the plot shows that despite large amounts of variance between the lower bound of every different field, the upper bound is (within a small margin) the same for every field. This is opposite of what is typical of a normal distribution, which is symmetrical around a point. This suggests that there is a limit to how complicated and long words and sentences realistically get in any scientific field or context. As these metrics all seem to approach the limit to how complex English text can reasonable be, this could mean the metrics are much less sensitive than they would be for datasets focusing on generally easier to read texts.

3.3 Experiment 3

The third experiment will try to compare a much wider variety of texts written for different purposes by different authors in different languages in order to see if the readability metrics prove useful differentiating any of the texts.

Much like experiment 2, we made a boxplot for every single metric for each of the texts. To keep results consistent with experiment 2, we will focus on the plot of the Automated Readability Index:

(25)

Figure 3.8: Boxplot comparing the different datasets based on their automated readability index The first thing the plot shows is a large similarity within the inliers of all the non-fictional texts. While there is still a very large variance between data points in the VTeX dataset, the inner 25% of the data, as well as the median, are almost identical. This is in line with the theory that there is simply a lot of variance in smaller texts, and these metrics only work when applied to longer stretches of text. However, the similar value of the median of the VTeX, Bill Bryson and Scientific American datasets also supports the theory that the metrics are just not sensitive enough to accurately point out the differences.

Figure 3.9: 3D scatterplot of our different datasets, with our readability metrics as the axes One thing the 3D plot shows that the 2D plots do not illustrate as well is the few complete outliers. We presume these consist of paragraphs containing nontypical text, such as bibliography or footnotes, and do not indicate anything regarding actual document body, which is what the feedback would focus on. This statistical noise could be filtered out with a more sophisticated text parsing algorithm, though this is out of scope for this paper.

Because of the other results, it is impossible to assess how much the difference in language between the English and Dutch texts causes a different in metric values, as they only show a similarity with the English text from novels. In order to investigate this further, and find out the effect of languages on the readability metrics, specific and elaborate research must be done. This is however not within the scope of this thesis, and falls within a different field of research entirely.

(26)

3.4 Experiment 4

As our data for the PhD thesis, we used the thesis ”Exploratory Search over Semi-structured Documents” by Hosein Azarbonyad, as found on UvA-DARE [17]. For the Netcentric dataset, we are using the final submitted versions of the group papers in the Netcentric Computing course:

Figure 3.10: Comparison of the number of matches for linter rules between Netcentric Computing papers and a PhD thesis

While this plot indicates there is definitely a relevance to the matches generated by the linter, this relevance seems to differ greatly based on the rule used. The rules for adverbs and ”there is” structures do not indicate an actual difference between the papers, while matches for the weasel and so rules occur significantly more in the PhD thesis. While this does not imply these rules are ineffective, it could be that these perceived issues in the text are simply a matter of style.

The strength of the linter lies in the rules for passive voice, overly wordy language use, and lexical illusions. These show a significant decrease for the PhD text compared to the BSc text, which we assume indicates these are undesirable structures to use in an academic text, and the linter accurately points this out. However, since there are still nonzero matches, even in the PhD thesis, it is still unclear how useful this feedback is perceived as by students.

What can be concluded from this is that while some of the default rules in the write-good linter do not target the correct structures within the text, a linter with the proper rules can definitely be used to help writers write more proper texts. However, the exact usefulness is unclear, as these tests do not take into account false positives, false negatives, the effect of personal writing style on these values, or analysis whether the implementation of the suggestions from the linter actually leads to a ”better” text.

(27)

CHAPTER 4

Conclusion

The results from the experiments show that in its current state, the metrics are not useful for making the smaller scope observations on paragraphs on text which are required for our assumed usecase to be functional. There are a number of potential reasons that could have been the cause of the unfavorable results:

• The metrics are potentially unsuitable for analyzing or classifying college level texts. Met-rics like the Flesch reading ease are specifically developed to be used in grade school edu-cation, not academic, college level texts. The metrics show difference in complexity using components such as sentence length and average amount of syllables in a word. For these components, there is simply an upper limit: words only have a limited number of syl-lables, and sentences cannot be infinitely long. This might explain the nearly identical distribution of the Bill Bryson and VTeX datasets. These are both datasets made with-out a self-imposed limit on complexity, as they are both written for adults. As a result, the differences in sentence length and word use could be too subtle to detect using these metrics.

• The characteristics the metrics attempt to quantify might not be relevant for our usecase. These metrics are purely intended to give an idea of the complexity and readability of a text, and that on the scale of the whole text. It is very much possible that these metrics do not target the correct areas where there is a difference between texts. While Bottelier’s research showed success in using the metrics, and confirmed that the metrics give correct results for input texts, it did not guarantee that these correct results are actually distinct for different input texts. The results from our experiments show a similar trend. It is possible to identify obvious outliers, but texts that aim to reach the same audience or have the same purpose are often written in a similarly readable style. Readability is simply one aspect of a text to quantify, and other facets of a text could provide more distinct results. • Another possibility is that the metrics are potentially useful, but our approach and/or dataset was unsuitable for the results we were looking for. The metrics we used are meant to provide an easy way to classify the complexity of the entire text. However, both our input parser and the VTeX dataset work on a paragraph level. It is possible that using the dataset and classification on a larger scope than singular paragraphs would have yielded better results. After all, we omitted one of the metrics Bottelier used in his thesis, since this metric was intended to be used on texts of at least 30 sentences long. We also saw in the experiments that the variance in values of the metric went down the longer the paragraphs got, up to a point where we concluded that the data was too sparse to get an accurate average. A dataset which is larger or contains longer texts could be used instead, with potentially better results. The metrics can then also be run on the entire input text as opposed to specific paragraphs, which could lead to an overall less noisy result. However, the issue with this different implementation is that the pinpointed feedback, which was the main objective of the data mining on paragraphs, gets less reliable as writers would not be able to see which paragraph contains problems.

(28)

However, we still assume there is a potential use for these statistics, based on the precedent set by software like Perusall, as well as the ubiquity of spellcheckers, document statistics and linters. The results from Experiment 4 also show that, since there is a very large difference in the amount of passive, tooWordy, and illusion statistics, linter rules and can be developed to help writers cut down on certain bad practices within their writing. Despite the amount of variance within the per-paragraph application of the metrics, the observed stability and consistency of the median values for the metrics in the experiments is in line with Bottelier’s research, which showed that the metrics give correct results for texts. From this we can conclude that even without pinpointed readability-based feedback, the used readability metrics can be used as statistics on the feedback dashboard, and should give a correct assessment of the average complexity of the writer’s submitted text, as well as the average of all other texts.

4.1 Further Research

One of the main avenues for further research is looking into ways to research and draw conclusions about the usability of the dashboard and linter feedback. As mentioned before, the issue here is that there is inherently a large human component. This in turn makes it hard to get quantifiable, objective results from a personal and subjective concept as ”usefulness”. However, there are a few possibilities for experiments:

• The best to test whether the comparitive aspect works for writers and whether readers find the features useful is to deploy the tool during a college level course, and making both students and supervisors fill in a feedback form after the course is over. The results of these feedback forms could then be used. This is however very much dependent on the type of course, type of class and type of college this tool is deployed in. The solution here would be to deploy it on a larger scale, though this would bring different practical problems with it, namely the size and effort required.

• To test the quality of the linter feedback, the best idea would be to run a small, varied corpus through the linter, and having multiple judges manually classify each point of feedback as useful or not useful. They could also point out places where feedback should be given but is not. This can be used to calculate the precision and recall of the linter feedback as well as kappa measure between judges. This generates the most reliable results possible. The challenge for this experiment would be choosing the corpus, as the results may vary greatly depending on the type of texts used. The corpus would need to contain a wide variety of texts, while still being small enough for a human to manually classify the entirety of it within a reasonable amount of time.

Another way to continue this research is by looking into the metrics, more specifically finding different metrics, or finding better ways to apply the current metrics:

• As mentioned in the conclusion, it would be possible to try the experiments from this research again, with a differently built corpus containing longer texts. Plotting the graphs from the experiments for different corpora can provide insight into what kind of corpus (and therefore input text) needs to be used in order to receive usable results.

• The metrics used are in the form of a formula, using variables retrieved from the text, as well as different linear weights attached to those variables. Instead of using the metrics directly, it is possible to analyze the components in order to construct a different, more accurate formula. Since the metrics are intended to be used for grade school education, there is a possibility that some components are vastly more or less important than for college level texts, and potentially need a logarithmic, exponential or polynomial scaling, as opposed to a simple linear weight. Generation of these new formulas would be done mathematically using the corpus.

• Instead of coming up with new metrics, one could also take a look at different, pre-existing metrics. The three metrics used in this thesis were only chosen as they proved effective for Bottelier’s previous thesis on this subject, and other metrics could be found to potentially

(29)

get more accurate results. For an experiment, it would be possible to compare different metrics on a test set, seeing which ones give the most accurate classification of documents, and using those to construct a new feedback generator.

As opposed to finding ways to make the current architecture work, it is also an option to approach the data mining portion of the tool in a completely different way. Instead of trying to condense documents into singular data points, the VTeX dataset could also be used to train a markov chain based model or a neural network. Using the suggested additions and deletions contained for every paragraph in the VTeX dataset, these models can be trained to recognize these structures in texts, and provide similar additions and deletions. This approach puts emphasis on the actual content of the text as opposed to merely the syntax, and therefore should be able to provide much more specific feedback. However, building these AIs takes up exponentially more resources than storing a single number per document, which makes doing experiments with a substantial amount of data impossible on a similar setup to the one used for this thesis. Additionally, this approach would not mitigate any issues with noise and variance between the dataset and input documents.

Although less useful for future research, another way to expand the tool as a product is by expanding the linter as previously mentioned. There are numerous statistics one can retrieve from a text using regex, which can then be added to the graphs so writers can compare those statistics as well. Things that were brought up for future use of the tool were statistics such as number of sources, amount of keywords used off a list of keywords relevant to the course or assignment, amount of field-specific jargon used (data mining using field from the VTeX dataset can be used to generate a list of jargon), length of paragraphs, and amount of headings/subheadings. The linter could also be expanded to include more error detection, such as a spellchecker or a script to check whether images are correctly captioned. Whether this is implemented as part of future research or as part of making a usable product within education, these linter functions should be, as described previously, deployed in an actual university course alongside plots like in Experiment 4, letting students fill in feedback forms to discern usability. Additionally, by asking students to give their opinion on specific parts of the linter, one could get a better insight in which specific features of the linter are and are not useful.

(30)

(31)

CHAPTER 5

Appendix

5.1 Source code

Thesis source code: Experiment code:

Alternatively, if these attachments do not work on your PDF viewer, the source code for this thesis can be found at https://bumba.me/thesis-src.tar.bz2 and https://bumba.me/ experiments-src.tar.bz2 for the tool and experiments respectively.

5.2 Additional plots

Figure 5.1: Boxplot comparing the different tagged fields in the VTeX dataset based on their Gunning Fog score

(32)

Figure 5.2: Boxplot comparing the different tagged fields in the VTeX dataset based on their Flesch Reading Ease

Figure 5.3: Boxplot comparing the different different datasets from experiment 3 based on their Gunning Fog score

Figure 5.4: Boxplot comparing the different different datasets from experiment 3 based on their Flesch Reading Ease

(33)

Bibliography

[1] _{Amazon page for A Short History of Nearly Everything by Bill Bryson. url: https://} www.amazon.com/Short- History- Nearly- Everything- ebook/dp/B000FBFNII (visited on 01/10/2019).

[2] _{Benchmarks between Node.js and Python 3. url: https://benchmarksgame-team.pages.} debian.net/benchmarksgame/faster/node-python3.html (visited on 01/10/2019). [3] Jordy Bottelier. “Comparing and Analyzing Individual Writing Statistics in a

Collabora-tively Written Document in Real-Time”. 2017.

[4] _{Bug report regarding memory limit in the V8 JavaScript engine. url: https : / / bugs .} chromium.org/p/v8/issues/detail?id=847 (visited on 01/10/2019).

[5] Vidas Daudaravicius. “Language Editing Dataset of Academic Texts”. In: ().

[6] _{Github repository for the vale linter. url: https://github.com/errata-ai/vale (visited} on 01/10/2019).

[7] _{Github repository for the write-good linter. url:} https://github.com/btford/write-good (visited on 01/10/2019).

[8] Tabea Hensel. “Validation of the Flesch-Kincaid Grade Level within theDutch educational system”. In: (2014). url: https://essay.utwente.nl/64884/1/Hensel,%20T.N.C.A. %20-%20s0170860%20(verslag).pdf (visited on 01/10/2019).

[9] _{Judith Janssen. De Volkskrant review of Joe Speedboot by Tommy Wieringa. 2009. url:} https : / / www . volkskrant . nl / cultuur media / joe speedboot een meteoriet -dondert-het-dorp-binnen~b7670385/ (visited on 01/10/2019).

[10] _{Timotheus Kampik. Natural language linting. 2017. url: https://tech.signavio.com/} 2017/natural-language-linting (visited on 01/10/2019).

[11] J Peter Kincaid et al. “Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel”. In: (1975). [12] Hans Kun´e. “Interactief onderwijs verhoogt de kwaliteit”. In: (2018).

[13] _{Node.js homepage. url: https://nodejs.org (visited on 03/15/2019).} [14] _{Perusall’s website. url: https://perusall.com/ (visited on 01/10/2019).}

[15] Scientific American - May 2018 - How the Dinosaurs Got Lucky. Scientific American, 2018. [16] _{UvA FNWI Science in Progress. url: https://esc.fnwi.uva.nl/thesis/ (visited on}

01/10/2019).

Automated Writing Feedback using Natural Language Processing

Bachelor Informatica

Bachelor Thesis - Automated

Writing Feedback using

Nat-ural Language Processing

B¨

ar Halberkamp (10758380)

March 25, 2019

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

Contents

CHAPTER 1

Introduction

1.1

Background

1.1.1

Previous Research

1.2

Research Specifics

CHAPTER 2

Method

2.1

Data

2.1.1

Corpus

2.1.2

Feedback Dashboard

2.2

Metrics

2.3

Software

2.4

Experiments

2.5

Implementation

CHAPTER 3

Results

3.1

Experiment 1

3.2

Experiment 2

3.3

Experiment 3

3.4

Experiment 4

CHAPTER 4

Conclusion

4.1

Further Research

CHAPTER 5

Appendix

5.1

Source code

5.2

Additional plots

Bibliography