Human Evaluation of Unsupervised, Semi-supervised and Supervised Machine Translation Through Error Annotation

(1)

Human Evaluation of

Unsupervised, Semi-supervised

and Supervised Machine

Translation Through Error

Annotation

Yuying Ye

MA Digital Humanities SID: s3308936

August 27, 2019

First supervisor: Dr. Antonio Toral Ruiz Second reader: Tommaso Caselli

(2)

Acknowledgements

First and foremost, I want to express my gratitude to my supervisor, Dr. Antonio Toral, for introducing me to the field of machine translation, and his support and guidance throughout this thesis.

I would like to thank Tommaso Caselli who introduced me to Antonio and suggested me to write a thesis combining my BA background in translation with the MA Digital Humanities. I want to show my appreciation for all the knowledgeable teachers of the University of Groningen who teached me how to write code, and most importantly, how to treat data critically.

I would also like to thank the support crew at Peregrine HPC cluster, especially Bob Dröge and Cristian Marocico for always being willing to help me with configuring the unsupervised machine translation system and other technical issues.

I want to say thank you to Fouad El Ghamarti for all his understanding and encourage-ment, and my family for supporting me to study in Groningen.

(3)

Preface

The thesis is about an evaluation of present machine translation technologies, especially the newly developed unsupervised machine translation systems, and an analysis of their errors. It was written as a part of the MA programme of Digital Humanities in University of Groningen from March to August, 2019.

Given the intensity of technical knowledge in the field of machine translation (MT), the research was challenging for me. Without a background in MT, I had to comprehend the mechanisms behind different MT systems and paradigms and needed to train a system myself. However, learning to configure an unsupervised statistical machine translation system gave me valuable hand-on research experience, something which I hope to profit from in my future work.

Evaluating machine translation performance connects translators and machine translation, which breaks the stereotypical fear of machine translation threatening the work of human translators. It helps us to recognise the weakness and strength of machine translations, thus better collaborates with machine translation.

I hope the reader will find this subject as interesting as I have found it to be. Yuying Ye

Groningen, the Netherlands August 27, 2019

(4)

Abstract

Without needing parallel data, unsupervised machine translation (MT) gives opportunity to low-resource language pairs, but its quality remains unclear because of the lack of evaluation. Research has evaluated the unsupervised MT systems based on automatic metrics to date. In this MA thesis, a human evaluation of unsupervised MT was conducted through manual error annotation of systems’ outputs to investigate the performance of unsupervised MT systems in comparison with supervised MT systems. The thesis aims to answer the question to what extent unsupervised MT proves beneficial when translating from English to Chinese in the news domain.

The unsupervised and semi-supervised MT were trained based on the unsupervised sta-tistical MT built by Artetxe et al. (2018b). The outputs from two state-of-the-art neural MT systems were included in the evaluation. Based on existing studies, the error taxon-omy in this annotation was developed from the Multidimensional Quality Metrics with customisation for the English–Chinese language pair. The annotation was performed by one annotator on a sample of the systems’ outputs. The results show that the unsuper-vised MT system generates an output with little use and the neural MT system with Transformer architecture performs the best producing 91% less errors, compared to the semi-supervised system. Further research is needed to investigate into the limitations of unsupervised machine translation.

(5)

List of Figures

2.1 The MT error typology proposed by Llitjós et al. (2005).. . . 5 2.2 The MT error typology for the Chinese–English language pair proposed by

Vilar et al. (2006). . . 6 2.3 The MT error typology for the English–Greek language pair proposed by

Avramidis and Koehn (2008). . . 7 2.4 The MT error classification for the Chinese–English language pair proposed

by Hsu (2014). . . 8 2.5 The MT error typology for the Chinese–English language pair proposed by

Castilho et al. (2017). . . 8 2.6 The MQM-compliant error taxonomy for Slavic languages developed by

Klubička et al. (2018). . . 10 2.7 The annotation environment provided by the DQF tools, taken from a

eval-uation task. . . 14

3.1 The architecture of the unsupervised SMT proposed by Artetxe et al. (2018b). 22

4.1 The sample MQM-compliant error hierarchy for diagnostic MT evaluation. 24 4.2 The Chinese MT tagset, derived from the MQM framework. All the changes

are marked by red boxes and the issue types that are not included in the MQM issue types are italicised. . . 27

5.1 Bucket analysis . . . 33

(8)

1. Introduction

This chapter introduces the most significant concepts of this thesis. It will provide in particular a brief explanation of machine translation, machine translation evaluation, and error analysis. In addition, the research questions and objectives of the study are outlined in this chapter. The overview of the thesis will be presented as well.

1.1 Machine translation

Machine translation (MT) is about using computational technologies to translate one natural language into another. Not only is MT an important technology in everyday life, facilitating information exchange and communication between different languages, but it has also been a subject of academic research for decades, where paradigms have changed and developed rapidly.

In the past few years, neural machine translation (NMT) (Bahdanau et al. 2014) and es-pecially the Transformer architecture (Vaswani et al. 2017) has revolutionised the MT domain, which used to be dominated by statistical phrase-based machine translation (SMT) (Koehn et al.2003). NMT models are based on deep learning, in contrast to the SMT which utilises statistical methods learned from parallel corpora to generate transla-tion.1

Recently also, a new branch of research has emerged in the field of MT research, intro-ducing a new paradigm, unsupervised machine translation (Lample et al. 2017; Artetxe et al.2017a; Artetxe et al. 2018b). Compared to SMT and NMT, the unsupervised MT is trained without parallel data. Using only monolingual data, it creates opportunities for conducting research with low-resource language pairs that do not have abundant parallel data.

Although the unsupervised MT systems to date did not perform as well as the state-of-the-art supervised MT systems as per BLEU scores (Papineni et al. 2002), they still showed effective results. Artetxe et al. (2019) showed that their unsupervised machine translation system was able to outperform the previous winning supervised MT system at WMT2014 (Bojar et al. 2014) by 0.5 BLEU points in the language direction English to German.

1_{A linguistic corpus of two or more languages where each element in one language corresponds to an}

element with the same meaning in the other language(s). “Glossary, Moses statistical machine translation system, http://www.statmt.org/moses/?n=Moses.Glossary (accessed 18-08-2019)”

(9)

1.2 MT evaluation metrics

MT has brought new methods to this decade of research, such as NMT and unsupervised MT. These new systems need to be evaluated to see how they perform. The quality assessment of systems is intricate in the field of MT. It ranges from scoring metrics to fine-grained error analysis, and automatic methods and manual approaches.

Given that over a hundred systems have been submitted annually in the last two editions of the Conference on Machine Translation (WMT) (Bojar et al.2017, p. 170; Bojar et al. 2018, p. 276), metrics are useful for performing evaluation efficiently and give a score that can be used for direct comparison of systems2_{. Several different evaluation metrics have}

been proposed and utilised for such a function, including common automatic metrics, such as BLEU3(Papineni et al.2002), TER4(Snover et al.2006), Meteor (Denkowski and Lavie 2014) and NIST5 _(Doddington ₂₀₀₂_{). Also manual options, such as direct assessments}

(Graham et al. 2017) and relative ranking (Bojar et al. 2016, p. 133; Federmann 2012) have been put into use, the latter being used until WMT16 (Bojar et al.2018).

Automatic measures are often used in research to provide metrics of general performance for direct comparison because they have an edge over efficiency and convenience. The methods produce an overall score by computing the similarity between the system output and the reference translation. The reference translation functions as the gold standard here, which can be controversial. Multiple versions can be considered to be valid, because of variation in word choice and flexibility of sentence structure. Papineni et al. (2002) also considered the reference bias issue and designed BLEU to use multiple references. However, it is rather commonly used with only one reference, even in the WMT shared tasks.

Apart from the automatic scoring, other metrics including direct assessments (DA) and relative ranking (RR) have been carried out manually to assess the absolute and relative translation quality in the WMT, respectively. In DA, each sentence is rated on a 100-point Likert scale (Graham et al.2017), while the latter often has researchers rank five system outputs from best to worst on each sentence (Bojar et al.2016, p. 133) after which they are evaluated with the TrueSkill algorithm (Herbrich et al.2007). The result from the RR approach constituted the official ranking of WMT until being replaced by DA in 2017, since their results showed a strong correlation, and DA needs less professionals to produce an absolute score (Bojar et al.2016, p. 146; Bojar et al. 2017, p. 174).

1.3 Error annotation

However, systematic performance might not be reflected correctly in the aforementioned automatic and human evaluation metrics, due to lack of information on what types of errors appear in the translations evaluated. Popović (2011) developed Hjerson, an auto-matic error annotating tool that overcame this major metric limitation. It can detect five types of errors on a word level and provide raw error counts and error rates efficiently. The downsides of this approach are that it only covers a small amount of possible erro-neous phenomena and leaves grammatical, contextual and stylist problems unattended, as well as that the rigid structure is not adaptable to different research purposes. It

2_{http://www.statmt.org/wmt17/index.html; http://statmt.org/wmt18/index.html.} 3_{Bilingual evaluation understudy.}

4_{Translation Error Rate.}

(10)

is crucial for further MT development that these kinds of errors and their severity are investigated.

Apart from the automatic annotation tool by Popović, studies were dedicated to error analysis by comparing NMT and SMT paradigms. Castilho et al. (2017) conducted an evaluation on NMT by using a combination of automatic method and human error eval-uation via annotation, while Bentivogli et al. (2016) and Toral and Sánchez-Cartagena (2017) made use of different algorithms to look into into certain types of error in NMT and SMT translation. However, their methods might not be feasible because the algo-rithms are complex, and using them would require an amount of knowledge coming from a background in machine translation technology.

Another popular way to conduct evaluation of MT systems is human error annotation. Although the aforementioned methods are by no means ineffective ways to research MT quality, human error annotation is less cumbersome and easier to replicate because of the integration of human annotators in the evaluation process. The method also allows using a customised error taxonomy that can be adjusted to serve various research pur-poses. The taxonomy is then used by human annotators when annotating the translation. This method can shed light to which specific types of errors a MT system generates or overcomes. Such an approach has been implemented extensively to explore the trans-lation quality of MT in many studies (Llitjós et al. 2005; Vilar et al. 2006; Hsu 2014; Klubička et al. 2018). Klubička et al. (2018) compared performance of between NMT and SMT on English–Croatian with a substantial error taxonomy that was based on the Multidimensional Quality Metrics (MQM) (Lommel et al.2014).

To this day, no fine-grained evaluation has been conducted on unsupervised MT yet. Little can be known about its performance on different translation issues, apart from the fact that unsupervised system was mentioned as one of the worst in word sense disambiguation from German to English in agreement with BLEU (Bojar et al.2018, p. 294). Theoretically, the new method provides valuable opportunities for low-resource language pairs and should be able to produce corresponding translation systems. In practice, the present models have only been tested on high-resource and related language pairs, namely French and English, German and English (Lample et al. 2017; Artetxe et al. 2019; Bojar et al. 2018). There was one submission on unrelated languages, Estonian to English (Del et al. 2018). The unsupervised system could barely perform over the baseline systems, which is considerably poorer than the unsupervised systems on English to French or German.

One can argue that the effective results are, to some extent, reliant on the language pairs being related and resource-rich. Yet it is important that the method will also be implemented on other low-resource or distant language pairs, if only because global devel-opments do not stand still and wait for research to catch up. In the past decade, China has become a serious competitor economically and politically. From 2008 to 2018 it has tripled its Gross Domestic Product (GDP) from 4.6 to 13.6 billion US dollars, representing 21.5 percent of the world economy.6 The country has become a major hub in the flow of capital, both financial and human. Ten years ago, Irish journalist Clifford Coonan in 2009 figuratively called China “the largest English-speaking” country in the world.7 _Vice

versa, Mandarin Chinese has become a language of interest in English-speaking coun-tries. In Britain for example, it is even possible for students to take a GCSE exam in the subject.8

6_{Trading Economics, “China GDP”, https://tradingeconomics.com/china/gdp. (accessed 01-08-2019)} 7_{“The largest English-speaking country in the world? China, of course”, The Irish Times, 20-06-2009,}

https://www.irishtimes.com/news/the-largest-english-speaking-country-china-of-course-1.788688. (ac-cessed 01-08-2019)

(11)

1.4 Research questions and Thesis overview

Given that little research has been done for unsupervised MT quality, this thesis intends to add to research concerning error analysis of MT systems by conducting a fine-grained human evaluation and error annotation on English to Chinese language pair translation output produced by current state-of-the-art unsupervised SMT and supervised NMT sys-tems.9 _{The aim of this study is to explore the result of error annotation conducted by}

human annotators on the translation outputs from unsupervised SMT and supervised NMT systems, based on a defined error taxonomy. News is a recurring genre for trans-lation tasks in the WMT and has been researched and utilised in training, testing and evaluating MT systems. Hence, the main research question of this thesis will be: To what extent can unsupervised SMT prove beneficial when translating from English to Chinese in the news domain?

To answer this research question, the following sub-questions need to be resolved:

1. What kinds of problematic translation phenomena in the English to Chinese direction should be included in the error taxonomy?

2. How do unsupervised and semi-supervised SMT systems perform when translating news articles from English to Chinese?

3. What are the differences in performance of unsupervised SMT and supervised MT systems?

4. What are the difference in performance between supervised MT systems using the two most popular neural architectures: recurrent with attention and Transformer? In order to facilitate the error analysis, I plan to build the error taxonomy upon the compre-hensive MQM10_{, adapting the MQM Core and modifying its branches to include relevant}

translation problems or difficulties regarding this language pair, which adds a language direction that to the best of my knowledge has not been researched yet with this method. The taxonomy will then be implemented in the manual annotation on the outputs of the state-of-the-art unsupervised statistical machine translation and supervised NMT systems using the data collected from the shared task on news translation at WMT19.11

The annotation will be performed using translate5, an open-source web-based tool.12 After finishing the annotation, the results will be analysed to explore the possible pros and cons of unsupervised SMT and compare its detailed performance against supervised NMT. The comparison might be able to shed light on the current capabilities of unsupervised SMT. Finally, a conclusion will be drawn addressing the research question.

The rest of the thesis will be organised as follows: Chapter 2 presents a literature review on state-of-the-art error analysis and the available tools. Chapter 3 deals with state-of-the-art unsupervised SMT and supervised NMT systems, and data sets used in the project. Next, Chapter 4 introduces the methodology for error annotation and the definition of the error taxonomy, after which the results and statistical analysis of the annotation are shown in Chapter 5, followed with a discussion with examples. Lastly, Chapter 6 gives a conclusion of the project and suggests possible directions for future research.

9_{For clarification, Chinese stands for Standard Mandarin Chinese with simplified Chinese as the writing}

system throughout the thesis, unless specific indications.

10_{http://www.qt21.eu/mqm-definition/definition-2015-12-30.html.} 11_{http://www.statmt.org/wmt19/index.html.}

(12)

2. Literature Review

This chapter is dedicated to shaping a theoretical framework for the research and it is split into three sections. The first section reviews the previous studies related to human error annotation. The second part introduces automatic error analysis technologies, to be com-pared with human error annotation. After reviewing both the manual and machine error analysis, the last section will discuss the tools that can contribute to the evaluation of MT quality, including holistic comparison tools and quality annotation environments.

2.1 State-of-the-art human error analysis

According to Hsu (2014), human error analysis involves the use of human evaluators to identify and classify mistakes in a given MT system. An significant use of this strategy is manual error annotation, where errors are marked by human annotators per category. Early work in error annotation tended to take an arbitrary and unsophisticated approach to categorise errors that appear in MT output with a purpose of introducing a practical framework, rather than providing a detailed error analysis. Llitjós et al. (2005) pioneered, with proposing a preliminary MT error typology (see2.1) for English-to-Spanish transla-tion to collect informatransla-tion on mistakes and correctransla-tions so as to train a module that would be able to locate and correct errors in the translation output automatically. The error typology was structured hierarchically but without justification or elaboration on their decisions. Its first level included five classes: missing words, superfluous words, wrong word order, incorrect words and wrong agreement. The limited terminology seems to be straight-forward but also shows a strong emphasis on the lexical level.

Figure 2.1: The MT error typology proposed by Llitjós et al. (2005).

This error typology was further developed and employed in error analysis on results from different models of supervised MT that were preprocessed with different kinds of corpora (Vilar et al.2006). They adjusted the classification by moving the categories. Extra words and wrong agreement went to the second level, affiliated to incorrect words, and they added

(13)

two new classes: unknown words and punctuation to the first level for the English-to-Spanish language pair. Variations in the taxonomy were made for the Chinese-to-English direction in accordance with features of this language pair (as shown in Figure2.2). For example, a refined categorisation of word order was added to mark syntactical mistakes that appear in translations of questions, infinitives, declarative and subordinate sentences. Also, concerning that Chinese and English have different writing systems, English proper names need to be converted to Chinese, sometimes based on pronunciation. For this reason, they specified the error type, Unknown Words, into four sub-types, including person, location, organisation and other proper names. In addition, they stated that punctuation could be included in the taxonomy, but they did not do so in the Chinese– English error typology, since punctuation errors often only cause minor disturbances (Vilar et al.2006, p. 698). They found from the result that rarely any or no reordering errors on questions and sub-ordinate sentences were observed in the two MT outputs, and neither were unknown words.

Errors Missing Words Word Order Incorrect Words Extra Words Unknown Words Content Words Filler Words Declare Question Sub-ordinate Infinitive Long Range Local Range Wrong Lexical Choice Incorrect Form Content Words Filler Words Person Location Organisation Other Proper Names

Figure 2.2: The MT error typology for the Chinese–English language pair proposed by Vilar et al. (2006).

Despite the partial overlap in the classification, they defended it by arguing that there is a high possibility that one mistake would lead to another and multiple errors could occur in the same sentence (Vilar et al. 2006, p. 698). Though their argument was not well-established, it touched upon the obstacle to executing annotation when more than one error appears in one sentence. Problematically still, the methods and tools that they used to conduct the error annotation, and the language proficiency and professional background of annotators were not clearly stated and elaborated upon.

In spite of this, the error taxonomy proposed by Vilar et al. (2006) has been extensively utilised as the framework for human error analysis in other studies in the body of quality evaluation of various MT systems on different language pairs. It has served a wide range of purposes, including the discovery of linguistic problems for systematic advancement to reporting the effectiveness of improvement, and to compare the strengths and weaknesses of different MT systems with respect to error types. For instance, Avramidis and Koehn (2008) made use of the classification to evaluate SMT on English to Greek, tackling the difficulty of “translating from a morphologically poor to a morphologically rich language” (763). Their error taxonomy is shown in Figure2.3. They specified the lexical categories in the subset of incorrect forms under incorrect words and the introductory error annotation on the baseline system showed that the highest amount of errors appeared in noun-cases agreement and verb-person conjugation. Then the improvement of the SMT system was carried out with a focus on such linguistic problems and a sequential manual error analysis was conducted to assess the degree of achieved success.

(14)

Figure 2.3: The MT error typology for the English–Greek language pair proposed by Avramidis and Koehn (2008).

Following Avramidis and Koehn (2008), error analysis with a focus on assessing structural improvement was also used in evaluation of a refined SMT system on the Chinese-to-Korean direction (Li et al. 2009). They carried out an error analysis with only three categories from the original model by Vilar et al. (2006) (missing words, wrong word order and incorrect words), the latter with a further specification into wrong lexical choices/extra words and incorrect modality. The simplified classification was used to check if their method of reordering verb phrases, preposition phrases and modality-bearing words in the Chinese data resulted in an improved system. Khalilov and Fonollosa (2009) carried out similar error analysis with the intention to contrast Arabic-to-English translation works by N-gram-based SMT with the one from syntax augmented MT so as to provide information on strong and weak linguistic phenomena of the two systems. The evaluation was done manually on a random selection considered a representation of the two systems, because of the time-consuming nature of the annotation. Examples of errors were selected and discussed to form a subjective evaluation of severity.

However, such researches showed a commonality in that the way they conducted the error analyses was vaguely mentioned. Lack of clarification on annotation set-up, language proficiency and translation experience of annotators and statistical analysis implies that the research method could be exceedingly questionable. In response to the inspection done by Vilar et al. (2006), Max et al. (2008) proposed a system that included contextual features to enhance the translation quality of a language pair that had morphological inequality. They addressed the case of English–French with the help of four native French speakers ranking different outputs to confirm that contextual features were beneficial in contrast to the results of automatic metrics. In this case, the manual evaluation remained on the holistic aspect.

Analytic approaches have also been used for identifying errors. In the research of Farrús et al. (2009), in order to improve high-performance SMT on Catalan–Spanish, an error analysis covering morphosyntactic details was completed by a native Catalan and Spanish linguist on the SMT output. However, the error classification was highly language-oriented and did not have a hierarchical structure and therefore was not replicable on other language pairs.

(15)

lan-guage categorisation: orthographic, morphological, lexical, semantic and syntactic mis-takes to assess the translations using linguistic criteria. Hsu (2014) adopted the classifica-tion scheme (Farrús et al.2010) in the error analysis on Chinese–English MT. Hsu extended the first-level languages classification into a list of language-dependent subcategories, in-cluding caplitalisation and spelling of English, shown in Figure2.4. Hsu’s research found that error occurrences are relevant to the language characteristics of the source language, such as syntactic element arrangements of Chinese.

Figure 2.4: The MT error classification for the Chinese–English language pair proposed by Hsu (2014).

Apart from proposing an error taxonomy to verify that a specific change improves a sys-tem, error analysis has been frequently used to compare different systems and paradigms. To illustrate, Castilho et al. (2017) conducted an extensive evaluation to compare the quality of NMT with SMT in three domains: e-commerce, patent and Massive Open On-line Courses. Firstly, they conducted a DA and RR and asked bilingual native German speakers to rank the preservation of meaning with a 4-point Likert scale and also to rank the three translation outputs on e-commerce from worst to best. Secondly, two annotators assessed a sample of Chinese–English translation with a content-specific error taxonomy (Figure 2.5) for patent domain consisting of punctuation, wrong terminology, part of speech, literal translation, addition, omission and word form. The results showed that NMT outperformed SMT in sentence structure while SMT had considerably less omis-sion problems and more error-free segments. Lastly, profesomis-sional translators on language combinations from English to German, Greek, Portuguese and Russian were sought to help annotating with a simple categorisation (mistranslation, inflectional morphology, ad-dition, word order and omission) and rated adequacy and fluency of outputs on a 4-point Likert scale again. They introduced the method of creating content-specific in-domain error typology and rating of adequacy and fluency, which is a breakthrough in the field of error analysis.

Errors

Punctuation Wrong Terminology Word Form Part of Speech Omission Addition Literal Translation

Figure 2.5: The MT error typology for the Chinese–English language pair proposed by Castilho et al. (2017).

(16)

analyses to compare the quality of NMT with SMT. Instead of annotating errors man-ually, errors were detected and classified automatically by means of different algorithms in both researches. Bentivogli et al. (2016) found that NMT outperforms SMT with less morphology, lexical and word order errors and on all sentence lengths. Toral and Sánchez-Cartagena (2017) discovered that though SMT performs better with long sentences, NMT produces more fluent output with less inflection and reordering problems. Both studies verify that NMT has pushed the state-of-the-art MT considerably forward.

The error typologies in above studies were used to identify major problems in the trans-lation, to verify positive effects of certain methods or to compare different MT systems. While they used in isolation they perform well, it is difficult to compare them with each other. In response to the lack of a standard in error taxonomies, frameworks were pro-moted attempting to standardise the evaluation of translation quality. One example is the Dynamic Quality Framework (DQF) developed by the Translation Automation User Society (TAUS) in 2011 (O’Brien2012; Görög 2015). DQF aims to provide a shared lan-guage on quality evaluation for the translation industry by providing standard evaluation methods, one of which is the TAUS DQF error typology (Görög 2015). This typology includes accuracy, fluency, terminology, style and locale convention at its first level that can be specified into granular levels. The DQF error typology is implemented in the DFQ tools, which will be discussed in Section 2.3.2, on the TAUS Evaluate platform1 (Görög 2015).

Another example of a translation quality evaluation framework is the MQM framework promoted by the QTLaunchPad project in order to clarify translation phenomena as “issue types” in a systematic manner (Lommel et al.2014). The MQM Issue types2 _{define more}

than 100 issues derived from automatic and manual annotation, which offers a large vari-ety of errors to be chosen from. Its hierarchical structure and flexible guidelines provide the possibility for the framework to be adapted and used in a variety of translation gen-res, purposes, orientation and languages. Meanwhile, it maintains scientific compatibility among different projects. For example, it reached harmonisation with the TAUS DQF error typology and includes the DQF as a subset of MQM.3 _{Therefore, it will also be used}

as the main framework for the error taxonomy of this thesis.

Klubička et al. (2018) built their categorisation upon the well-established MQM in one of the most recent studies on error analysis. Their custom MQM taxonomy (in Figure 2.6) was built for Slavic languages for the evaluation on SMT and NMT systems on the language direction English to Croatian. Slavic linguistic characteristics were taken into account when extending their error taxonomy based on the MQM Core4_{, including}

accuracy and fluency for the first level. Considering the morphological complexity of Croatian language, a subset including person, number, gender and case was added under agreement. The annotation task was conducted on the web-based tool, translate55

and completed by two annotators with prior knowledge on MQM, translate5, the same academic background and Croatian as mother language. Apart from a detailed error comparison, statistical analysis was performed with chi-square (χ2_{) to compute if the}

different amounts of issue types among the systems were statistically significant, which was not included in previous studies.

1_{https://dqf.taus.net/.}

2_{http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html.}

3_{http://www.qt21.eu/mqm-definition/definition-2015-12-30.html#dqf-mapping.} 4_{http://www.qt21.eu/mqm-definition/definition-2015-12-30.html.}

(17)

Figure 2.6: The MQM-compliant error taxonomy for Slavic languages developed by Klubička et al. (2018).

2.2 Automatic error analysis

Other works have been devoted to automation of error classification. Although it is not the main focus of this thesis, it would be helpful to understand what has been accomplished on the automated side as it might shed light on how human evaluation can benefit from automatic tools. In determining whether to make use of automated or human evaluation, there are certain possibilities and weaknesses to consider, which will be briefly discussed in this section. Apart from the well-known automatic evaluation metrics, such as BLEU and TER, researchers and MT experts have sought possibilities to build tools that can perform more sophisticated error analysis to deal with time limitations and labour-intensive manual options.

As mentioned before, Hjerson, one of the computational tools designed for systematic error annotation, was built based on the error category scheme presented by Vilar et al. (2006) (Popović 2011; Popovic and Burchardt 2011). The tool can cover five ma-jor language-independent error categories: inflectional errors, reordering errors, missing words, extra words and incorrect lexical choices. It was tested on the language pairs Arabic–English, Chinese–English, German–English and further made use of by Toral and Sánchez-Cartagena (2017) for an evaluation of nine language directions. Though the re-sults (Popovic and Burchardt2011, in) showed high correlation with the parallel human error analysis, the result was limited to the word level as mentioned earlier in the intro-duction.

Another similar automatic error detection tool is Addicter (Zeman et al. 2011). In Ad-dicter, a monolingual alignment was used to pair words in the hypothesis6 _{with the}

match-ing words in one or more reference translations (Zeman et al.2011, p. 80). Through the alignment between reference and hypothesis, Addicter was able to spot errors including missing and untranslated words, punctuation, wrong surface forms7_{, and word order}

er-rors. The algorithm should be capable to identify extra hypothesis words too, though it was not mentioned in the article. All the aforementioned errors were flagged and grouped into four major types: missing reference words, lexical, order and punctuation errors (Ze-man et al. 2011, p. 84). Both Hjerson and Addicter shared a word-based mechanism and both error schemes were derived from the work of Vilar et al. (2006). With similar

6_{Hypothesis: it refers to the translation outputs from MT as same as source text.} 7_{Surface form: a word/lemma with inflections.}

(18)

functions and goals, Hjerson collaborated with Addicter to further enrich its graphical user interface (GUI) by generating error summary and included both variations of error taxonomies (Berka et al.2012).

Hjerson and Addicter were developed to provide automatic error analyses that tell more about MT systems than overall automatic metrics, but conversely, they were integrated to generate scores for a metric, TerrorCat (Fishel et al. 2012). It made use of machine learning technology to rank sentences from MT outputs and normalised rankings into a general score to facilitate comparison (Fishel et al. 2012, p. 64). Machine learning relies on a large amount of training data generally. For language pairs without enough manually ranked data, the metric would possibly produce erroneous results (Fishel et al. 2012, p. 69). It was uncertain that to what extent TerrorCat could contribute to improving the MT quality evaluation.

Different from word-based automatic methods, diagnostic evaluations that aim to exam-ine MT systems over linguistic units have also been a focus in automatic error analysis research. An early study was done by Yu (1993). He presented MTE, an automatic eval-uation system that works on test points extracted from target translations. An English– Chinese MT translation was used as an example. Words, phrases and sentences were extracted as test points ranging from a specific word to sentential patterns and so on. The test points were categorised into six groups: words, idioms, morphology, and elemen-tary, moderate and advanced grammar (Yu 1993, p. 189). However, the automatic test points extraction and categorisation were not subject to approval by users. The systems uncontrollability is likely to have a negative impact on its application.

Further developing on the idea of test units, Wang et al. (2014) proposed Woodpecker, a checkpoint-based diagnostic evaluation. Sharing a similar concept with a test unit, a checkpoint refers to a linguistic unit (a word, a phrase or a sentence) which would be extracted by word alignment, parsing programmes, linguistic rules and special words dictionaries automatically (Wang et al.2014, p. 1411). Chinese–English translation was taken as an example to demonstrate the tool. They adapted manual categorisation from grammatical references for an extensive error taxonomy of 22 Chinese categories and 21 English categories that covered word, phrase and sentence levels. The working procedure of Woodpecker started from identifying checkpoints in source text, reference and hypothesis, aligning source checkpoints with reference, n-grams matching hypothesis with reference and finally the diagnostic scores of different linguistic categories. That Woodpecker heavily relies on the identification of grammatical units may result in its preference for over-literal translations that resemble source grammatical structure. The downside is likely to maintain and manifest when analysing English to Chinese translation, due to grammatical difference between the two languages.

In addition, Woodpecker is not open-source and has a strict restriction on modification, which hinders its wide implementation in further academic research. The Woodpecker toolkit8 is published and distributed by Microsoft Research and it is limited to Windows systems only. The last update took place eleven years ago which seems to be outdated and user-unfriendly, though it was one of a few tools that has been compiled into an executable installation file.

To overcome limitations in Woodpecker, Toral et al. (2012) built DELiC4MT9_{on the same}

paradigm but the tool was constructed with open-source modules. The core components of DELiC4MT remain part of speech (PoS) tagging and word alignment. The architecture

8_{https://www.microsoft.com/en-us/download/details.aspx?id=52447.} 9_{https://github.com/antot/DELiC4MT.}

(19)

also supports user-specified linguistically-motivated units. It has been tested by an evalua-tion of different language direcevalua-tions covering German, Dutch, Italian and English (Naskar et al.2011), which overcame the limitation of Woodpecker being language-dependent. The inherent complexity for DELiC4MT is that word aligners need a large amount of parallel data to be trained to perform accurate alignment (Toral et al.2012, p. 130).

With NLP technology and machine learning, there has been great advancement in au-tomatic error classification methods. The taxonomy becomes more extensive from the model of Vilar et al. (2006) to covering hundreds of linguistically-rich checkpoints. The algorithms also grow from word-base rates and alignment to the level of phrase and sen-tence. Taxonomy customisation is achievable through listing syntactic and semantic rules. Translation flexibility can be considered by involving synonyms in dictionaries.

The disadvantages of automation should not be neglected either. How well machine learn-ing algorithms can perform highly depends on the quality and quantity of trainlearn-ing data. Data scarcity and inferior data could impair results critically. In the three cases of di-agnostic evaluation, the encoding language has changed from TDL to XML and then to KAF (Bosma et al. 2009), which could potentially increase the difficulty for future re-search when such languages become unreadable for computer programs. Also the license and accessibility issues. The online demo of DELiC4MT10 _{cannot be found}11 _{and the}

Woodpecker has been left unattended for eleven years.

2.3 Error analysis tools

Automatic error classification is still limited in many aspects. For instance, it cannot iden-tify stylistic or culturally related errors and it is difficult to change built-in taxonomies to serve different purposes. Moreover, human evaluation is regarded as the foremost way to assess Chinese outputs in WMT. However, digital tools should not be opposed in hu-manities research. Integrating tools and digital methods into translation studies could contribute to quantifying analysis and improve efficiency in human error annotation, for which manual methods have been heavily criticised for. Also, it is an approach which is fit-ting for the Digital Humanities where one of the fields focuses is exploring how humanities research can benefit from information technology.

In the previous sections, I provided an overview of recent studies that addressed how error analyses can be conducted manually and automatically. The following section will discuss a set of computational tools that could be used to collaborate with human error analysis, including a holistic comparison tool and quality annotation environments. Such tools can be used to assist manual annotation processes and make them easier and more efficient.

2.3.1 Holistic comparison

compare-mt is a holistic comparison tool that implements many of the aforesaid automatic evaluation metrics and methods that were utilised in automatic error classification tools (Neubig et al.2019). It succeeds in combining statistical analysis with examples in one open source Python package, covering BLEU scores (Papineni et al. 2002), length ratio, word accuracy, statistical analysis on sentence buckets to characteristic n-grams. The purpose of compare-mt is to provide a basic general review of two systems’ performance

10_{https://www.computing.dcu.ie/ atoral/delic4mt/.} 11_{The retrieval attempt was made on June 28, 2019.}

(20)

that could function as a preliminary quality evaluation for further analysis (Neubig et al. 2019). The tool is utilised to provide a part of the statistical analysis for this research. The set-up and results are discussed and displayed in section 4 and 5.

For the statistical analysis, Neubig et al. (2019) categorised methods into aggregate scores analysis, bucketed analysis and n-grams difference analysis. In addition to BLEU, aggre-gate scores analysis includes comparing the length of outputs to the length of reference, as shown in Equation 2.1. The ratio could reveal if a system tends to produce verbose or laconic translations (Neubig et al. 2019). However, given that BLEU and length ra-tios are computed by treating the output as a whole, it does not allow us to rationalise meaningful statistical analysis on the results. Bootstrap resampling (Koehn2004) is used in compare-mt to compute confidence interval and statistical significance of difference of these scores.

length ratio = total output length

total ref erence length (2.1) compare-mt also produces a bucket analysis (Neubig et al. 2019). Words are grouped into buckets based on frequency to calculate their F-score, which reflects accuracy, while sentences are grouped by length, difference in length and score to visualise relation and distribution (Neubig et al.2019). The flexible bucket analysis can be extended with POS tag analysis (Chiang et al. 2005) and the n-gram analysis is built by finding the best n-grams in the outputs from Akabe et al. (2014).

The results of the multiple features would then be demonstrated in graphs and tables on a HTML-based interactive visualisation. All the analyses contribute to offering a overview of salient characteristics and difference and shedding light on further evaluation.

There are other tools designed for holistic comparison of MT systems that share similar functionality, such as MT-ComparEval12_{(Klejch et al.}₂₀₁₅_{). Compared to compare-mt, it}

does not allow bucket analysis on word and sentence level and it has a higher dependency on external software, which makes it more complicated to apply.

2.3.2 Quality annotation environments

As mentioned in the introduction, Appraise, a web-based interface implemented in Python and Django, became widely known for collecting ranking and grading for MT outputs as required by the WMT shared tasks. In the initial design, the open-source toolkit supported not only ranking of translation, but also error classification, quality estimation and post-editing (Federmann2012). The error classification was possible on word and sentence level but it was not possible to change the typology that was refined from the work of Vilar et al. (2006). During the task, annotators can read one sentence from the source and the corresponding one from a system output. It seems to be impossible to annotate multiple systems at once, which hinders comparison. The segmented annotation of individual sentences might not be helpful for annotators to understand the context, which could conceal contextual problems.

Recent studies have begun to challenge the error categorisation with new discourses and methods in both automatic and human error analysis. Martins and Caseli (2015) employed machine-learning algorithms to train automatic error annotation systems with manually annotated English to Brazilian Portuguese translation data. The manual error annotation

(21)

was conducted on Blast, a graphical tool for human error analysis (Stymne 2011). It is worth mentioning that Martins and Caseli (2015) kept a conventional error typology (Vi-lar et al. 2006) for the manual annotation, including inflectional errors, lexical problems, multiword expression and reordering. Blast13was used for the manual annotation in Mar-tins and Caseli (2015). The advantages of Blast over Appraise are that it is configurable with customised hierarchical typologies and it is possible to view the reference transla-tion in annotatransla-tion (Stymne 2011). The disadvantages are similar to those of Appraise, for example being incompatible with annotating two or more outputs and the viewing of individual sentences.

For users that are not familiar with configuration, Ocelot14provides compiled distributions and the DQF tools can be assessed through its website15_{. In the graphical interface of}

Ocelot, multiple sentences can be viewed at the same time and a maximum of five error flags can be attached to units in one sentence. The drawbacks of Ocelot lie in its limited selection of error types, while also no hierarchical typology is supported. As shown in Figure2.7, although the DQF tools provide an annotation environment that is compatible with the DQF–MQM error typology subset16, users have to count and categorise errors per segment into the first level of the typology and possibly mark down the specific errors in the comment section. This would require a significant amount of annotation work and it would be difficult to quantify the lower level of errors.

Figure 2.7: The annotation environment provided by the DQF tools, taken from a evaluation task.

In the past few years of research, two new tools have come into use. They provide config-urable implementation of the MQM framework, the MQM Scorecard17 _{and translate5.}

Both interfaces display several lines of text and the files can be easily scrolled up and down, which could avoid examining sentences out of context. In the MQM Scorecard interface, a source text and a target text are displayed in separate columns. When annotating texts in MQM Scorecard, annotators can add issue types and notes to the entire segment. After annotation, the MQM Scorecard could generate a score on the amount and severity of the errors and a report based on the number of each error18_.

translate5 prevails over the MQM Scorecard because translate5 supports tagging se-quences of words with issue types and annotating multiple translation outputs simultane-ously. Although the installation of translate5 has complicated system requirements and requires several dependencies, it seems to offer the most functionality and the most de-tailed MQM stylistics with error report. In order to gain a meticulous observation on the unsupervised SMT and NMT system, translate5 will be used as the quality annotation platform for the human error annotation in this project.

13_{https://cl.lingfil.uu.se/ sara/blast/.} 14_{https://open.vistatec.com/ocelot/index.php?title=Main_Page} 15_{https://dqf.taus.net/} 16_{https://www.taus.net/evaluate/qt21-project#harmonized-error-typology.} 17_{https://github.com/multidimensionalquality/qt21-scorecard} 18_{http://www.qt21.eu/launchpad/node/1347.html}

(22)

3. MT systems and data sets

This chapter introduces the MT systems and data sets used in this project. For supervised systems, I used outputs from top systems in WMT, while for the unsupervised MT, I trained my own system with monolingual corpora since there was no system available on the language pair English to Chinese yet. Firstly, Section 3.1 briefly introduces the mechanism of NMT and introduces the two NMT systems involved in this research. It emphasises on exploring the outstanding systematic difference for better understanding rather than explaining each system in its technical detail.

Secondly, Section 3.2 presents the unsupervised MT systems and explains data sets and the experiment set-up for training the semi-supervised and unsupervised models. In total, four different MT systems are included for comparison: a semi-supervised and an unsupervised SMT system built with open-source implementation, Monoses1_{, the work of Artetxe et al.}

(2018b) and two supervised systems, AI-Translation2 and the University of Edinburghs Neural MT system (uedin-nmt) (Sennrich et al.2017a).

3.1 Supervised NMT systems

In recent years, NMT has risen rapidly benefiting from its application of deep learning. Deep learning is also known as deep neural networks, a division of machine learning (Gold-berg and Hirst2017, p. 2). As Goldberg and Hirst (2017) stated, the core idea of machine learning could be briefly summarised as a procedure where a machine learns to make pre-dictions according to recognised patterns in the data, and the process depends on a massive amount of data and algorithms. The mechanism of machine learning is often introduced with the human brain as an analogy. To be more specific, a major component of neural networks for text can be characterised as converting textual data into machine-readable numerical vectors, namely, substituting a sequence of letters or words with a sequence of numbers, and the neural networks could understand vectors.

Goldberg and Hirst (2017) concluded that recurrent neural networks (RNNs) (Elman1990) and feedforward neural networks are the two main types of neural network architectures. Nematus is a case where both networks are involved while Transformer contains no re-currence (Sennrich et al. 2017b; Vaswani et al. 2017). In a simple example of applying RNNs in MT, the networks take sequential data, apply computations on each token of the sequence and transform it into a vector.3 _{Meanwhile, they also produce a hidden}

vector with the structured properties of the first token. While transforming the token, the hidden vector will be fed back into the computation and the new token’s information will

1_{https://github.com/artetxem/monoses.} 2_{http://matrix.statmt.org/systems/show/4243.}

3_{Tokens are the basic unit in a machine translation process. Tokens are a sequence of characters, such}

as words, punctuation or symbols, separated by a space, “Glossary, Moses statistical machine translation system, http://www.statmt.org/moses/?n=Moses.Glossary (accessed 18-08-2019).”

(23)

be added into the hidden vector (Lipton et al.2015). The process repeats until the whole sequence is output as a sequence of vectors. On the contrary, feedforward neural networks do not allow feedback of hidden vectors back to another output (Goodfellow et al.2016, p. 164). Namely, all the information only goes forward, including the input and hidden vectors.

One of the translation outputs for the error analysis is from the uedin-nmt system which is trained with Nematus, a recurrent encoder-decoder architecture with attention (Sennrich et al. 2017b). The system is explained in detail by Sennrich et al. (2017a). Recurrence allows information feedback on the syntactic property of tokens in the sequence. Encoder-decoder architecture is another name for seq2seq framework (Cho et al.2014). The model consists of two parts: encoder and decoder. In this scenario, an encoder functions to encode input text into vectors and then a decoder decodes the vectors and predicts the most probable translation. For Nematus, the encoder was composed of a RNN while the decoder included a feedforward hidden layer for the hidden vectors. Attention can be characterised as mapping two arrays into a matrix to explore their correlation. The attention mechanism here is involved in a recurrent unit, where the whole context set and the intermediate hidden state serve as input to compute a context vector (Sennrich et al. 2017b).

Different from uedin-nmt, the other supervised MT system included in this research is AI-Translation system benefiting from the Transformer architecture, a sequence-to-sequence (seq2seq) self-attention mechanism without any recurrence, described thoroughly by Vaswani et al. (2017). Transformer makes use of self-attention and feedforward neural networks in both the encoder and decoder rather than RNNs. Self-attention is a kind of attention mechanism where the same sequence is input twice as two arrays to compute a representation of the sequence (Vaswani et al.2017). Without recurrence, there could not be any feedback of the structured properties of the sequence. To compensate for this loss, positional encoding was added in the architecture to fill in the gap of syntactic informa-tion. It encodes the information of the order of sequences in the same dimension as the vectors of the sequence to be compatible with the vectors.

Before the development of Transformer in 2017, the recurrent seq2seq with attention architecture (Bahdanau et al. 2014) was dominant over other NMTs and PBMTs. The uedin-nmt system was the model with the highest BLEU score of 36.3 for the language pair English–Chinese at WMT2017 and it was also one of the top three systems ranked in DA. Systems implemented with Transformer architecture seemed to show a predominant advantage over those with RNNs since 2018 (Bojar et al.2018). Consistent improvement could be observed from the results of WMT18 and WMT19. AI-Translation achieved the highest BLEU score of 44.6 on English–Chinese news translation at WMT19, which implies that it should perform the best English to Chinese translation. It shows the great potential of seq2seq attention-only architecture of Transformer models which have outperformed the RNN-based Nematus toolkit.

From the perspective of BLEU scores, both uedin-nmt and AI-Translation obtained rather impressive results in comparison to systems to other language pairs in WMT and the language direction English to Chinese has only been included in WMT for three years. Therefore, it would be fascinating to explore the performance of these systems in detail so as to see how far MT is able to achieve in translating English news into Chinese and if the main mechanism difference shows a significant change in translation errors related to e.g. syntactic structures or meanings.

(24)

Shared Task: Machine Translation of News4. For AI-translation, its translation is uploaded and accessible online5 and for uedin-nmt, although the system did not participate in the task, its output for this test set was kindly provided by Dr. Rico Sennrich.

3.2 Unsupervised SMT systems

The unsupervised and semi-supervised SMT systems were built with the unsupervised MT system Monoses which was implemented in Python, Moses v4.06 _{, FastAlign}7_{, Phrase2vec}8

and VecMap9_{. The methodology was described in depth by Artetxe et al. (}_2018b_{). In brief,}

Moses is a SMT system and it creates phrase tables of n-grams, similar to probabilistic bilingual dictionaries (Koehn et al. 2007). FastAlign is an unsupervised word aligner that lines up words in the source and target languages (Dyer et al. 2013). Phrase2vec is an extension of word2vec (Mikolov et al.2013) to learn monolingual word and n-gram embedding, that is to say, turning words and n-grams into vectors (Artetxe et al.2018b). VecMap, also known as cross-lingual word embedding mapping, puts two monolingual word embeddings into the same dimensions (Artetxe et al. 2018a). The closest word or phrase to the source word embedding would be the most probable translation.

Phrase2vec seems to be highly similar to the step of converting textual data into vectors in deep learning. In fact, they are both word embedding methods. It could be argued that Monoses is a hybrid system of SMT and NMT. Although word embedding technique was utilised, Monoses did not have a decoder part where vectors would be interpreted for making predictions, as in NMT. In Monoses, phrase tables connect the two separate word embedding by creating a collection of the 100 closest tokens of the source unigrams, bigrams and trigrams as their possible corresponding translation based on the Phrase2vec and VecMap (Artetxe et al.2018b).

Artetxe et al. (2017b) also designed an unsupervised NMT system10 _{that implemented}

an encoder-decoder model. Instead of keeping a consistency of using NMT systems in this research, the unsupervised SMT is included for the comparison because Monoses has showed a substantially improved performance over unsupervised NMT. Bojar et al. (2018) mentioned that the unsupervised systems performed much poorer than supervised systems. Results from unsupervised systems are likely to be unintelligible, which would hinder conducting an error analysis on them. It seems to be a practical and feasible decision to include the better version of unsupervised systems to evaluate its capability and performance in detail.

3.2.1 Data sets

Training set

The semi-supervised and unsupervised SMT systems were both built with Monoses but trained with different data sets on Peregrine. The semi-supervised model was trained with a set of pure news data within which parallel data from News Commentaries were

4_{http://www.statmt.org/wmt19/translation-task.html.} 5_{http://matrix.statmt.org/.} 6_{https://github.com/moses-smt/mosesdecoder/tree/RELEASE-4.0.} 7_{https://github.com/clab/fast_align.} 8_{https://github.com/artetxem/phrase2vec.} 9_{https://github.com/artetxem/vecmap.} 10_{https://github.com/artetxem/undreamt.}

(25)

included. News Crawl monolingual corpora were collected annually from 2007 to 2018 for English and from 2008 to 2018 for Chinese. 23 GB of English news data were available and only 284 MB of Chinese news. The scarcity of Chinese data is likely to influence the semi-supervised SMT performance. Due to the imbalance of data, the system might not be able to recognise the relation between English and Chinese words. When an unfamiliar English word is involved in the test data, there is a high possibility that the system will translate it incorrectly.

English training data Chinese training data

Semi-supervised

News Crawl, News Commen-tary (the sum: 194,543,763 lines 4,518,597,322 tokens)

News Crawl, News Commen-tary (the sum: 2,091,984 lines 50,818,101 tokens)

Unsupervised

News Crawl (the

sum: 194,026,719 lines 4,505,617,231 tokens)

News Crawl, News Com-mentary, Common Crawl (the sum: 194,026,719 lines 3,143,791,176 tokens)

Table 3.1: Quantitative information for each corpus used to train the unsupervised and semi-supervised system.

In order to compensate for the data scarcity, I included Chinese data from other mono-lingual sources to the research. With the data, I built a second unsupervised model to compare with the first. In order to maintain the unsupervised feature, solely monolingual data was used to train the system. The Chinese data from News Commentary parallel corpora and Common Crawl11 _{data were added into the training so as to balance the}

amounts of data between English and Chinese. The pre-processed and segmented Chinese monolingual common data from Common Crawl12 was added into the Chinese training set. After the first step in Monoses, the first 194 million sentences (the same amount as the English training corpora) in the Chinese training set were split into a new corpus for further training of the system.

Validation set

During the first step of Monoses training script, 10 thousand random sentences were split from the English and Chinese training sets to function as the validation set for tuning the model.

Test set

As the English–Chinese test set13 _{of the WMT19 new translation tasks has not been}

in-volved in the training and validation, it was used as the test set to evaluate the final models. As Chinese is written without whitespace between words, the outputs from Monoses have been desegmented by the command shown in listingA.6.

3.2.2 Data pre-processing

Training data can have a decisive impact on the system. If a machine is fed with a large amount of errors and irrelevant information, it is possible that it will learn from such data and lead to an undesired learning outcome, which forms a ”garbage in garbage out”

11_{http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/zh/deduped/zh.deduped.xz.} 12_{http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/zh/deduped/zh.deduped.xz.} 13_{http://data.statmt.org/wmt19/translation-task/test.tgz.}

(26)

vicious circle. Textual noise can be defined as all unwanted, incorrect, unrelated forms in the electronic text that could range from URL to spelling errors (Subramaniam et al. 2009). Therefore, it is important to clean and pre-process the data before training. The data collected for training Monoses include crawled news (News Crawl and News Commentary) and web crawled content (Common Crawl). By observing samples of the data, the crawled news seems to be clean news data with an disregardable amount of noise, while the common crawl data contains a large amount of non-Chinese text, unre-lated metadata obtained from the webpages, and the text is less formal and well-written generally.

All the data was stored and processed on Peregrine, due to its large amount. There was 23 GB of English news data in total and 120 GB of Chinese Common Crawl data. The Pere-grine cluster implements a Linux command-line environment; therefore, all the codes that I wrote for cleaning data are by resort to Linux commands and Python. My script touched upon the rudimentary pre-processing for Chinese data. Methods to remove wrongly writ-ten characters need to be included in future research because a higher proficiency in coding would be needed for this.

Removal non-Chinese data

The common crawl content includes various website articles, posts, forums, chat room histories and their metadata, such as web page addresses, time-stamps and error codes. The metadata can be considered to be a kind of noise for the Chinese text. Given that metadata are usually expressed in English letters and numbers, the Unicode script14 _was

used instead, which can identify all the Chinese characters to keep Chinese crawl data only (The entire command is shown in ListingA.3 in the appendix A). Considering that the common data was used as Chinese monolingual data, text in all the other languages is irrelevant and can be removed. This command was efficient and comprehensive compared to only selecting lines that match the URL patterns.

A tr command was used to remove all numbers in the corpora, including all time-stamp variations. Removing numbers should not have a negative impact on training the MT system. Many of the numbers were used to represent information that did not need to undergo translation, such as phone numbers, user-names, amounts of money, size, population and so on. The usage of Arabic numerals seems to share compatible rules in both English and Chinese. Both languages also have their written forms of numbers that correspond to each other. As it is shown in the example below, it would be acceptable to use either the numeral and the word twenty in both languages.

English 20 years ago / twenty years ago

Chinese 20 年前 / 二十年前

Table 3.2: Example of the usage of Arabic numerals and words.

Rearrangement newlines

Another observation was that each sentence in the common crawl data was segmented into a separate line, but the location of the end of some sentences were wrongly recognised. They were ended after the first word of the following sentence rather than the ideographic full stop that is used in Chinese in the same manner as a full stop in English. An example

(27)

of a line from the training data is shown below to illustrate. In the example, the sentence in line 1 was complete at the full stop, and a newline symbol should be placed after the full stop to end this line, but the newline was placed after the first character of the next sentence instead. It could cause a problem in word segmentation, since the last character in the original line 1, “别”, should be segmented as a token with the following character “说”, meaning “let alone” in Chinese. With the wrong end of line, the two characters would be segmented as two tokens instead.

Original data Line 1 人山人海，见到招聘企业人员的面就已算不错了。别 \n

Line 2 说应聘行不行了！\n

Rearranged newline Line 1 人山人海，见到招聘企业人员的面就已算不错了。\n

Line 2 别说应聘行不行了！\n

Table 3.3: Example of a wrongly ended line and its correction. The character in bold was included in line 1 incorrectly, because it should be placed at the beginning of line 2.

A Python script (shown in ListingA.4) was written with a for-loop and if statements to process each line in the corpora and relocate the end of line correctly. The textual data were read with indication that the encoding was UTF-8, which made sure that Python could read the Chinese data correctly as Unicode characters rather than bytes. English words are composed of combinations of 26 letters and separated by whitespaces. However, Chinese has a far larger set of characters and they are written without word separation. Character encoding can make sure every Chinese character is machine-readable and sep-arated accurately. A Chinese character can be represented as one Unicode character or multiple bytes in computer programmes. Without the indication of encoding, Python might process the Chinese text in the form of bytes, which would make it horrendous to identify which bytes belong to a character and further disrupt the attempt to rearrange the sentences. Hence it is significant to ensure Python processes the Chinese data in the form of Unicode.

Converting traditional Chinese

During the preliminary inspection of random data segments, I encountered text that was written in traditional Chinese. Although there was no clear claim that Chinese refers to Mandarin Chinese in the translation language pair Chinese and English in WMT, system outputs and news data from previous years were generally written in simplified Chinese. Therefore, traditional Chinese text should be converted into simplified Chinese. Before the computational procedure was finished in the previous step of rearranging the newline, an additional Python library, hanziconv15_{, was applied to check if each string is written}

simplified Chinese. If not, the library converted it from traditional Chinese to simplified Chinese.

Word segmentation

Monoses itself incorporates the necessary pre-processing in the training script, which covers tokenizing and truecasing. It would be enough to deal with the English corpora and additional text processing approaches are unnecessary. However, the methods are not applicable to Chinese which is written without spaces and differentiation between upper and lower case. Other approaches are required to solve the problem posted by the lack of word separation in Chinese and split linguistic units. One alternative is word segmentation for Chinese, which is an useful and important step to prepare Chinese textual data prior

Human Evaluation of Unsupervised, Semi-supervised and Supervised Machine Translation Through Error Annotation