Accident Reporting by Dutch Media: Content Analysis with Natural Language Processing

(1)

language processing

2

submitted in partial fulfillment for the degree of bachelor of science

3

credits: 12EC

4

ella casimiro

5

11195002

6

bachelor information studies

7

faculty of science

8

university of amsterdam

9

2019-06-28

10 Supervisor Title, Name Dr Maarten Marx Affiliation UvA, FNWI, IvI

Email maartenmarx@uva.nl . 11

(2)

Abstract 13

In 2017, 123.930 traffic accidents were recorded in the Netherlands and news articles on this topic are 14

published daily. Besides containing valuable information as to who was involved and what happened, there 15

is more to be discovered when carefully analyzing the content. This research attempts to give a better 16

understanding of how Dutch media report traffic accidents by searching for hidden patterns with exploratory 17

data analysis and other Natural Language Processing techniques. Accordingly, the main question of this 18

research was: "What patterns can be found in accident reporting by Dutch media?". Identifying these patterns 19

is essential for understanding the influence of media on the perception of traffic accidents and awareness of the 20

dangers of traffic. In this research, a search for the most representative words was held by comparing different 21

weighting schemes for word clouds. Afterwards, the most common subject-verb-object triples were extracted 22

and further analyzed. The last part of this research was aimed at automating manually performed analyses, so 23

that they could be applied to larger data sets.The results suggested the use of a specific language for accident 24

reporting in which the focus is mostly put on the victim, and thus shifted away from the perpetrator, and 25

a vehicle is referred to more than the person driving it. Furthermore, it was found, among others, that a 26

vulnerable road user had a higher probability of being the subject in a SVO triple with the verb ’overlijden’ (to 27

die) than a motorized road user. Ultimately, regular expressions were found to produce at least reasonable 28

results in automating manual analyses. The application of the analyses on a larger data set also produced 29

some interesting insights, including on the use of vague expressions and on the preferred terms to describe an 30

accident. This research was mostly exploratory in nature and thus further research is needed to generalize the 31

results. 32

(3)

Contents 33 Contents 2 34 1 Introduction 3 35 2 Background 4 36 3 Related Work 5 37

4 Methods & Results 6

38

4.1 Description of the data 7

39

4.1.1 Pre-processing of the DataFrames 7

40 4.2 RQ1 Common words 8 41 4.2.1 Methods 8 42 4.2.2 Results 9 43 4.3 RQ2 SVO triples 12 44 4.3.1 Methods 12 45 4.3.2 Results 12 46 4.4 RQ3 Automation 14 47 4.4.1 Methods 15 48 4.4.2 Results 15 49 5 Evaluation 19 50 5.1 RQ1 Common words 19 51 5.2 RQ2 SVO triples 19 52 5.3 RQ3 Automation 20 53 6 Conclusions 22 54 6.1 Acknowledgements 23 55 References 23 56 AWord Clouds 24 57 B Automation 27 58 B.1 Regexps 27 59 B.2 ML Classification 53 60

(4)

1 INTRODUCTION

According toHet Centraal Bureau voor de Statistiek, 678 people died in a traffic accident in the 61

Netherlands last year. CBS reported an increase of 10,6 percent compared to 2017, being the biggest 62

increase since 1989. Fatal accidents only account for a small fraction of the total number of accidents 63

[3] . Data fromWetenschappelijk Onderzoek Verkeersveiligheid (SWOV) shows a total number of 64

123.930 accidents in 2017. Although most of those accidents ended in material damage only, in 65

18.706 of the accidents someone was slightly injured or worse [19]. To reduce the number of 66

accidents, the Ministry of Infrastructure and Water Management and the Ministry of Justice and 67

Security have created the ’Strategisch Plan Verkeersveiligheid 2030’ (SPV) together with other 68

parties. Some of the themes in the SPV relate to infrastructure, Vulnerable Road Users (VRU), young 69

and elderly drivers and alcohol consumption. An effort is thus being put in trying to reduce the risk 70

of participating in traffic. The government is, however, not the only one investigating this topic. 71

In particular, Thalia Verkade (journalist forDe Correspondent) and Marco te Brömmelstroet 72

(affiliated with the University of Amsterdam) have also taken an interest in traffic accidents. More 73

specifically in trying to understand what role media plays or should play in this topic. They started 74

the websiteHetongeluk.nl where news articles about traffic accidents are collected and annotated 75

by hand with useful information such as who was involved and whether someone was injured. 76

There is, however, a lot more to be found in those articles that influence the way in which people 77

perceive traffic accidents and their level of awareness with regards to the possible dangers. 78

Exploratory data analysis (EDA) combined with visualization techniques can aid in finding that 79

information by summarizing the main characteristics of accident reporting, while at the same time 80

uncovering hidden patterns and insights. From there on, other Natural Language Processing (NLP) 81

techniques can be applied to dig deeper by performing clear-cut analyses that would otherwise be 82

time-consuming and unscalable when performed manually. 83

84

This thesis thus attempts to give a better understanding of how Dutch media report traffic accidents 85

with the use of Natural Language Processing techniques by applying EDA and automating analyses. 86

Consequently, the following research question and sub-questions were defined: 87

RQ What patterns can be found in accident reporting by Dutch media? 88

(1) What are the most common and representative verbs, adjectives and nouns in news 89

articles? 90

(2) What are the most common subject-verb-object triples? 91

(3) What manual processes can be automated and applied to larger data sets? 92

93

This research is part of a larger project in collaboration with Sander Siepel and Barry Hendriks 94

aimed at automatically annotating and analyzing accident reporting articles. The objective of this 95

thesis is to automatically analyze accident reporting articles. 96

Overview of thesis. 97

Section2provides background information on the methods used in RQ1 through relevant literature. 98

Thereafter, related work on the methods used in RQ2 and RQ3 is discussed in section3. The methods 99

and results are then explained in section4per sub-question, followed by an evaluation of the results 100

and reflection on the process in section5. Finally, section6describes the main conclusions of this 101

research. 102

(5)

2 BACKGROUND

In this section, the use of and difference between word clouds, Venn clouds and Parsimonious 103

Language Models will be illustrated through relevant literature. These methods will be applied in 104

section4.2. 105

Word Clouds. 106

One method that has proven to be effective in the field of text analysis and summarization is the 107

use of tag or word clouds. Clouds are used to visualize tags assigned to documents or frequently 108

occurring words in a document or corpus, so that a user can quickly understand what it is about 109

without having to read the entire text. According to Heimerl et al. [9] literature on tag clouds can be 110

divided into two main areas, namely studies on the effectiveness and visual interpretability of word 111

clouds, and studies that focus on finding improvements and extensions for the existing concept. 112

One research on the effectiveness of word clouds conducted an experiment to investigate the 113

influence of font size and order of words on a selection task [8]. They concluded that font size 114

plays an important role in aiding someone in easily finding information. Secondly, alphabetical 115

ordering of words also improved search time. Word Clouds have also been proven to be effective 116

for very specific applications. An example can be found in the research by DePaolo and Wilkinson 117

[7]. In their study on assessment students were asked to answer questions with short answers. 118

These answers were visualized with word clouds for a professor to get a better understanding of 119

the knowledge (and potential lack of) on specific topics. This method can be applied during lectures 120

or even before and after a test. 121

Heimerl et al. [9] createdWord Cloud Explorer, a system in which users have full control over a 122

word cloud that summarizes a body of text. Part of speech (POS) tags can be selected if a user is for 123

example only interested in nouns and adjectives. Moreover, users can hover over a specific term to 124

have the system highlight other terms that appear in the same sentence. There is, amongst others, 125

also the possibility to extract extra information about a specific term like the frequency count and 126

detected word forms in the entire corpus or a selected part of it. After conducting a qualitative user 127

study, with the goal being receiving useful feedback, one of their findings was that filtering on 128

POS-tags was considered more useful than using colors to distinguish between them in a single 129

word cloud. 130

Another interesting study focused on the use of word clouds for exploratory data analysis [5]. 131

One of the examples in the research showed content clouds of public meeting transcripts in different 132

areas in America. All meetings concerned the same topic, but the results showed that people in 133

different states perceived the discussed problem differently. This is an example of how differences 134

in text can be easily spotted with the use of word clouds. 135

Venn Clouds. 136

An improvement on using word clouds for differentiating between two bodies of text is discussed 137

by Coppersmith and Kelly [6]. To define the relationship between two corpora they made use of 138

Venn clouds. In Venn clouds, words associated with both corpora are displayed in the center, while 139

the other words are displayed either left or right depending on the corpora they are found in. In 140

this specific research, association was based on the probability of a term occurring in the corpora. 141

Association could, however, also be based on frequency counts. 142

One limitation of word clouds, that is not completely solved by Venn clouds, is the lack of 143

removing non-representative words. Stop-word removal can be applied to both methods, but this 144

may not always be sufficient. 145

(6)

Parsimonious Language Models. 146

Kaptein, Hiemstra & Kamps [12] investigated this problem by searching for additions to the term-147

frequency (TF) word clouds that improve the selection of visualized terms. They note that the terms 148

in a TF word cloud are often not the words that users would like to see. This is due to the fact 149

that the most frequently occurring terms are not necessarily very informative. In their research, 150

the addition of bigrams and a language model to word clouds was tested. According to their user 151

study users prefer word clouds with both unigrams and bigrams or only bigrams over the basic 152

unigram cloud. Furthermore, they also investigated if users preferred term-frequency multiplied 153

by the inverse document frequency weighting (TF-IDF) or the parsimonious model over the TF 154

weighting. With a parsimonious model, common stop-words and article specific common words 155

are automatically excluded from the word cloud. While the TF word clouds were still found to be 156

most beneficial for retrieval, the parsimonious word clouds outperformed both other methods in 157

returning specific words and removing stop-words. 158

In [10], the use of parsimonious language models for Information Retrieval (IR) is investigated. 159

Normally, a language model is created for each document, in which terms are ranked based on 160

their probability of occurrence. The problem with this is that common words or stop-words get 161

assigned a high probability, which is unwanted. To solve this, the concept of parsimony was 162

applied to the language model, meaning that terms that distinguish a document from the rest of 163

the corpus are rewarded with a higher probability. From their research they concluded that the 164

smaller parsimonious models compared to the normal language models performed equally well or 165

even better at three stages of retrieval (indexing, request and feedback time). 166

3 RELATED WORK SVO triples.

167

Extracting subject-verb-object (SVO) triples has been used in media analysis for multiple purposes. 168

In [15] an application is described that automatically extracts SVO triples and saves them to a 169

database to then identify important actors and actions for a specific domain or during a specific 170

period in time. By applying the application to a large number of crime related news articles from 171

the New York Times between 1987 and 2007, the key actors and actions could be extracted by 172

weighting their frequency in the crime articles against their frequency in a general news corpus. 173

This eliminated actors that were not specifically relevant for crime news. They also created a 174

network from the top 300 subjects, verbs and objects to visualize the relationships amongst them. 175

In [16] the systemElectionWatch is introduced. The system makes use of SVO triples from U.S. 176

election news to display endorsement and opposition relationships in a graph between actors. 177

Another application of SVO triples in news content analysis is shown in [13]. Landsdall-Welfare 178

et al. collected and analyzed millions of articles on nuclear power over a five year period to 179

find out how the Fukushima disaster influenced media coverage. One of their analyses included 180

extracting SVO triples to create a network that showed actors and actions affecting a specific 181

topic. In their results they showed the network for "nuclear power" before and after Fukushima. 182

They concluded that the network changed after the disaster, with the biggest difference being the 183

growth of the public as an important actor. And where before the disaster actions with a positive 184

tone were most prominent, like "support", actions found after the disaster were much more negative. 185

186

Church et al. [4] took a different approach on the use of SVO triples. They showed in their research 187

that statistics can provide added value that may, in the future, help parsers reduce mistakes. Apply-188

ing Mutual Information (MI) to rank verbs for a specific noun can show what associations make 189

more sense than others. For example, the nounboat was found to be more associated with the verb 190

(7)

cruise (MI = 8.17) than with get (MI = 0.57). If only the co-occurence of the words and verbs would 191

have been used, they would have been ranked equally high because they both occurred together 192

three times. MI on the other hand compares the probability of findingboat and cruise/get together 193

with the probability of finding them separately. If they occur separately much more often, the MI 194

will be closer to zero, meaning that there is no true association between the noun and the verb. 195

Euphemisms. 196

Tayler and Ogden [17] conducted two studies on the use of euphemisms for ’heart failure’. The 197

first part of their study was aimed at doctors and how likely it would be for them to explain the 198

diagnosis of ’heart failure’ to a patient with that specific term or some euphemism. The results 199

showed that doctors were significantly more likely to use some of the given euphemisms over 200

’heart failure’. The preferred euphemism was ’fluid on your longs as your heart is not pumping 201

hard enough’. There was no significant difference for all euphemisms and some euphemisms were 202

significantly less likely to be used. Doctors were also significantly more likely to put ’heart failure’ 203

in the computer compared to telling the patient. The second part of their study focused on the 204

relative impact of the preferred euphemism over ’heart failure’ on patients. The results showed 205

that patients were significantly more anxious and felt like the condition would impact their life 206

more if the term ’heart failure’ was used. 207

Another interesting study conducted by Bowers and Pleydell-Pearce [2] measured electrodermal 208

activity (EDA) in participants when reading aloud two swear words, two neutral words and their 209

euphemisms. Since EDA measures can vary largely depending on the test person, the measures 210

were normalised. A one-way ANOVA showed that the test persons found it more stressful to say the 211

swear words than the euphemisms. There was no significant difference between the neutral words 212

and their euphemisms. In their discussion, the authors argue that euphemisms can be effective to 213

talk about sensitive topics that would otherwise be avoided due to the emotional response that the 214

offending word would evoke. In contrast, Johns and DellaSalla [11] argue in their essay, directed 215

at conservation biologists, that euphemisms should not be used because they are misleading and 216

’sugar-coat’ reality. 217

Blame and inanimate objects. 218

Ralph et al. [14] conducted a content analysis of 200 news articles in U.S. media reporting accidents 219

involving pedestrians and bicyclists. According to the authors, the aim of the research was "to 220

systematically describe patterns of traffic crash reporting rather than to draw causal inferences". 221

Their results did however show interesting patterns in media coverage concerning blame. If a car 222

or driver was mentioned in the article, 81% referred to the vehicle instead of the driver. When 223

looking at sentence types, they found that reporters preferred to leave out the driver or vehicle 224

completely and simply state that "a VRU was hit". These two findings are, by the authors, seen as 225

ways of shifting blame away from the driver. In the same research they also found that ’accident’ 226

was the most used term to describe a crash. The authors argue that this term conceals the fact that 227

most accidents are preventable because of its neutral tone. 228

4 METHODS & RESULTS

In this section a description of the used data sets will be given (section4.1), followed by an 229

explanation of the methods used for data pre-processing (section4.1.1). Afterwards, the methods 230

and results of the three sub-questions will be addressed separately. 231

Sections4.2,4.3and4.4will begin with a short introduction to demonstrate the relevance of the 232

sub-questions. Subsequently, the process of finding the best results will be clarified by explaining 233

the used methods along with their utility and possible shortcomings. Lastly, the results will be 234

(8)

discussed starting off with the best. An evaluation of the results per sub-question will be given in 235

section5. 236

4.1 Description of the data 237

There were three data sets used in this thesis. Two of them contain accident reporting articles and 238

the third one is a general news corpus that is only used for RQ1 for comparison. 239

Het Ongeluk. 240

For the first data set, 396 accident reporting articles from thehetongeluk.nl website were received 241

in JSON format and turned into a Pandas DataFrame. M. Te Brömmelstoet provided annotations for 242

a part of these articles in excel and SPSS file formats. Those files were also loaded into DataFrames 243

and merged together. The articles with partly or fully missing annotations were not included in 244

the final DataFrame. The final DataFrame contained 260 rows and 55 columns. A large part of the 245

columns consisted of labels corresponding to specific analyses, which were used for evaluating the 246

automation of those analyses. 247

Flitsservice. 248

The second data set contained accident reporting articles from the websiteflitsservice.nl. The 249

data was provided as a .csv file by B. Hendriks, who scraped the articles from the website with a 250

self-written Python function. The .csv file contained the title, full text and date of 7784 articles. 251

The file was loaded into a DataFrame and afterwards some symbols (tab, newline and return) and 252

leading white spaces were removed. 253

VU-DNC Corpus. 254

The VU University Diachronic News text corpus [1] is available as a .zip file that contains general 255

news articles from five Dutch newspapers, published in 2002. The .zip file was loaded in Python 256

and the text between ’HEADLINE’ and ’TAIL’ was selected for each article. The leftover text was 257

split into titles and articles by searching for the first newline character. Some symbols (tab, newline 258

and return), leading and trailing white spaces were removed from the articles. The final DataFrame 259

contained 2552 articles. 260

261

4.1.1 Pre-processing of the DataFrames. 262

To prepare the DataFrames for further investigation, some NLP tools were used for pre-processing. 263

The specific modules used during step are briefly discussed below. 264

NLTK. 265

Python’s Natural Language Toolkit (NLTK) library offers functions and collections that can assist 266

programmers with NLP. Their package for tokenization contains the submodulesent_tokenize that 267

was used in this thesis to split the articles into sentences. 268

Pattern. 269

The web mining module Pattern is written for Python 2.5+, but does not support Python 3 yet. 270

Therefore, the code in this thesis is written in Python 2.7. Pattern was used for parsing the articles 271

and titles because it offers a module for the Dutch language. More specifically, the functionparsetree() 272

was used that takes a string as input and returns aText object for each row in a DataFrame. The 273

Text object can be broken down into Sentence objects, which in turn break down into Word objects. 274

For each word, the part-of-speech tag, assigned chunk, relation in the chunk (subject, object, ...) 275

and lemma were stored. 276

(9)

Pickle. 277

Pickling is used to save Python objects as a binary file. These objects can later be loaded in a 278

different Jupyter Notebook without losing any valuable information. This format was chosen over 279

the csv format to store the DataFrames since theparsetree functionality would otherwise be lost. 280

281

4.2 RQ1 Common words 282

The methods and results concerning the following sub-question will be discussed in this section: 283

"What are the most common and representative verbs, adjectives and nouns in news articles?" 284

4.2.1 Methods. 285

The language or vocabulary used in news articles can help us understand more about the way in 286

which accidents are reported by Dutch media. Furthermore, it can help us distinguish this type of 287

articles from general news. Inside the language for accident reporting common Dutch words can 288

be found, often referred to as stop-words, but also very specific words like ’ongeval’ (accident) and 289

’aanrijding’ (collision). In between these two extremes are words that are shared amongst multiple 290

news reporting languages and are only meaningful for accident reporting when context is taken 291

into account. Examples of these words include ’19-jarige’ (19-year-old) and ’auto’ (car). The goal 292

was thus to find those words that are common but also sufficiently specific for accident reporting. 293

Instead of merely listing those words, the decision was made to transform them into word clouds. 294

Word clouds have become very popular in a wide range of research fields because of their ability 295

to visualize text in a straightforward and appealing way. They are often applied in exploratory data 296

analysis because the possible conclusions drawn from them can help in further NLP steps. The 297

weights assigned to the words decide which words are shown and their magnitude in the cloud. 298

299

To find the word cloud that best answers the research question, three different methods for com-300

puting the weight of the words were tested. Secondly, to maximize the value of the word clouds 301

they were created for specific part of speech (POS) tags. By using the POS-tag instead of a single 302

word cloud for all words, more representative words can be found. This is due to the fact that 303

some POS-tags appear more often in a language and would therefore appear in the word cloud, 304

while not being of great interest to this research. Pronouns, for example, provide no information 305

about a given subject but may appear very frequently in a text. Verbs, nouns and adjectives were 306

chosen for the word clouds because they could be specific for accident reporting, and contain a lot 307

of information. 308

Term-frequency WordClouds. 309

Term frequency (TF) Word clouds are the most common and simplest application of word clouds. In 310

these clouds the weight is calculated by counting how often a word occurs in a corpus, also known 311

as TF. They are often a good first step in finding out how a corpus compares to others because it 312

shows the most common terms. Additionally, in a corpus with a distinct theme some of the most 313

common words may also be specific. Therefore, word clouds were generated for both the accident 314

reporting corpus and a general Dutch news corpus. In each POS-tag word cloud, the 40 words with 315

the highest TF are shown. As an addition, word clouds with bigrams were also created for nouns 316

and adjectives. To make sure that the bigrams were still related to the proper POS-tag, either the 317

first or second word needed to have the POS-tag of the concerned word cloud. The decision was 318

made not to add bigrams to the verb clouds because the preceding or following word would not 319

help determine whether a verb is representative or not. 320

(10)

Why TF does not suffice 322

The problem with TF is that it only takes absolute occurrence into account leading to a lot of 323

unwanted, or meaningless, words in the word clouds. This is due to the fact that stop-words are 324

used very frequently and thus belong to the most common words. Moreover, some words are found 325

in both accident reporting and general news. One solution would be stop-word removal, but this 326

would still leave non-representative words in the clouds, and TF would still determine the weight. 327

For that reason, Venn clouds and Parsimonious LM were chosen because they solve the problem of 328

stop-words while also improving the weight calculation method. 329

Venn clouds. 330

Venn clouds solve the problem of stop-words by making a word cloud for each corpus and showing 331

their intersection. In this intersection, stop-words and some non-representative words are found. 332

This way the words on both the left and right side are more representative for the corpus they 333

belong to. The size of the intersection represents the overlap between the corpora. 334

Since Venn clouds still rely on TF, although being a progression from TF word clouds, this method 335

is not expected to extract the most representative words. 336

Parsimonious Language Model. 337

The method that is expected to find the most representative words is creating word clouds from a 338

parsimonious language model. Language models assign a probability to words as weight, repre-339

senting the chance of a word occurring in a given text. Language models do tend to overfit on the 340

given data because they will select the words that are either very specific for a certain document or 341

common words. This is solved by the Parsimonious language model by selecting the words that 342

distinguish a certain document or corpus from another. 343

The Parsimonious LM has two parameters, one being the number of words to return (n) and 344

the other being the weight of the document model (w). The model works by creating two different 345

kinds of language models. A base model is created from one of the corpora first, on top of which 346

another language model is created. For the latter, words that are already described by the base 347

model receive a probability of zero. A model with high parsimony thus needs less parameters 348

(non-zero words) to describe the second corpus. This means that there are less to no stop-words 349

returned as the top n words for the second corpus, since those are normally also found in the first 350 corpus. 351 4.2.2 Results. 352 Parsimonious LM. 353

The best word clouds for the given research question were produced by the Parsimonious LM for 354

each POS-tag of interest. The best results were produced with w=0.01 and n=40. When visually 355

inspecting the word clouds from the Parsimonious LM (Fig.1), the conclusion can be drawn that 356

stop-words were removed and the remaining words are representative for accident reporting. 357

Fig. 1. Parsimonious LM word clouds. From left: verbs, nouns and adjectives 358

(11)

TF Word Clouds. 359

Even though the TF word clouds did not show very representative words, they did provide some 360

interesting information that could not be extracted from the Parsimonious LM word clouds due to 361

the weighting scheme 362

363

POS-tag verbs 364

Figure2shows the accident reporting TF word clouds for conjugated and lemmatized verbs. The 365

word clouds that were made with the conjugated verbs had a larger variety of unique words and 366

therefore repeated certain verbs. Although this may seem unwanted, the conjugations actually 367

contained interesting information. Even though the accident reporting word cloud showed both 368

present and past tense, some verbs were clearly used more often in the past tense. This finding 369

suggests that the following sentence "the victim was hit by the perpetrator" occurs more often 370

than "the perpetrator hit the victim". The emphasis is thus more often put on the victim. Word 371

clouds with lemmata were also created to extract more unique verbs. There were some verbs clearly 372

related to accidents like ’overlijden’ (passing away) en ’raken’ (hitting), but common verbs still 373

dominated. The word clouds for the general news consisted almost solely of commonly used verbs 374

like ’worden’ (to become) and ’kunnen’ (to can). This was expected as there was no single theme 375

shared amongst the articles. 376

Fig. 2. TF word clouds for accident reporting. From left: verbs and verb lemmata 377

POS-tag nouns 378

There was a clear difference between the word clouds from the general news corpus and the accident 379

reporting corpus, which indicates that some nouns are typically used in describing accidents. Besides 380

differences between the corpora, another conclusion can be drawn solely from the accident reporting 381

word cloud (Fig.3). The word ’auto’ (car) occurred more often than ’automobilist’ (driver), implying 382

that reporters prefer to reference a vehicle instead of the person driving it. 383

Fig. 3. TF word cloud for accident reporting nouns 384

POS-tag adjectives 385

Figure4shows the most common adjectives in accident reporting. Two of the words found were 386

(12)

’vermoedelijk’ (presumably) and ’onbekend’ (unknown), which may point to the fact that reporters 387

often have to make assumptions about the situation. Other common adjectives seemed to point to 388

the degree of injuries (’dodelijk’;deadly) and the type of accident (’eenzijdig’;unilateral). 389

Fig. 4. TF word cloud for accident reporting adjectives 390

Bigrams 391

The bigrams provided extra information to some of the words that were not considered represen-392

tative in the unigram clouds. For example, the adjective ’hoge’ (high) was not considered to be 393

representative. ’Hoge snelheid’ (high speed) on the other hand, as found in Figure5, was considered 394

to be much more related to accident reporting. Similarly, the noun ’leven’ (life) was already some-395

what representative, but the bigram ’leven gekomen’ (passed away) contained more information. 396

Fig. 5. TF bigram word cloud for accident reporting adjectives 397

Venn clouds. 398

60 words were used instead of 40 for the Venn clouds because the words in common are moved to the 399

center, leaving less possible representative words for accident reporting. It is possible to conclude 400

that some common words were removed from the accident reporting side and instead placed in the 401

center. As a result, more representative words were to be found outside of the intersection. Figure6

402

shows the Venn cloud for verbs. The Venn clouds showed to be an improvement over the TF word 403

clouds by removing some of the common and non-representative terms but still largely relied on 404

frequency count. That being the case, they were not chosen as the best result. 405

(13)

Fig. 6. Venn cloud for verbs

4.3 RQ2 SVO triples 407

The methods and results of the following sub-question will be discussed in this section: 408

"What are the most common subject-verb-object triples?" 409

4.3.1 Methods. 410

In most languages, including Dutch, the structure of a sentence is SVO or subject-verb-object. 411

Extracting these triple objects can be useful to summarize sentences and detect recurrent events 412

or similarities between articles. For example, the sentences "de motorrijder is aangereden door 413

een dronken automobilist" en "de 19-jarige motorrijder is gisteren aangereden op de A6 door een 414

automobilist" are different in length and description of a situation but have the same SVO triple, 415

namely "motorrijder-aangereden-automobilist". Whether these two sentences refer to the same 416

accident is unclear, but when triple objects are extracted from a large body of text it is possible to 417

detect what type of accidents occur more often than others. 418

To find the most common SVO triples, the two accident reporting DataFrames (Flitsservice and 419

Het Ongeluk) were merged together. Afterwards, the underlying relations were extracted from 420

the, with Pattern’s parsetree, parsed articles. A sentence could have multiple SVO relations and 421

some may have been incomplete due to the subject being named in a previous sentence. To find the 422

correct triples, a list comprehension was created that looped through the key-value pairs of the 423

subject, verb and object relations found in a sentence. The values were Pattern Word() objects and 424

the keys referred to the specific relationship they belonged to. If a subject, verb and object with the 425

same key were found, the lowercase string of the Word() objects were returned as a 3-tuple. The 426

occurrences of the 3-tuples were counted and returned as a dictionary, with the 3-tuples as keys 427

and their occurrence in the entire corpus as values. 428

4.3.2 Results. 429

The 20 most common triple objects were extracted from the accident reporting titles and articles. 430

The results are shown in table1. 431

(14)

Table 1. 20 most common SVO triples found in accident reporting. From left: for titles and for articles.

Triple titles Count titles Triple articles Count articles

motorrijder-overleden na-ongeval 22 de politie-doet-onderzoek 319

fietser-overleden na-aanrijding 19 dat-meldt-de politie 255

vrouw-overleden na-aanrijding 17 de politie-onderzoekt-de toedracht 197

fietser-overleden na-ongeval 14 dat-heeft-de politie 183

vrouw-overlijdt na-ongeval 10 de politie-stelt-een onderzoek 144 motorrijder-overleden na-ongeluk 10 de politie-heeft-een onderzoek 124 voetganger-overleden na-aanrijding 10 het ongeval-vond-plaats 94 motorrijder-overleden na-aanrijding 9 de politie-is-een onderzoek 65

vrouw-overleden na-ongeval 8 het verkeer-werd-omgeleid 60

vrouw-overleden na-verkeersongeval 8 de politie-onderzoekt-de oorzaak 56

fietsster-overleden na-aanrijding 7 de man-reed-met 56

automobilist-overleden bij-ongeval 7 dat-heeft-de politie gemeld 55 fietsster-overleden na-ongeval 7 het slachtoffer-is-een xx-jarige man 50 bromfietser-overleden na-aanrijding 7 dat-meldde-de politie 47

fietser-overleden na-ongeluk 6 dat-maakte-de politie 45

fietser-overlijdt na-aanrijding 6 dat-bevestigt-de politie 39

automobilist-overleden na-ongeval 6 de politie-doet-verder onderzoek 38 motorrijder-overlijdt na-ongeval 5 de politie-onderzoekt-de zaak 38 automobilist-overleden na-ongeluk 5 de aanrijding-vond-plaats 37 slachtoffer dodelijk ongeluk-is-man 5 het ongeluk-vond-plaats 37

For the titles, 19 out of 20 triples turned out to have the same structure. All objects referred to an 433

accident, by using one of three descriptive terms (’ongeval’, ’ongeluk’ and ’aanrijding’). Furthermore, 434

they all contained a conjugation of the verb ’overlijden’ (to die). A possible explanation could be 435

that when an accident leads to someone’s dead this will be mentioned in the title, while an injury 436

may only be mentioned in the article. 437

438

Assuming the triples and occurrences were correct, the results could be further analyzed by 439

calculating conditional probabilities. For this, the 19 similar triples were used and all subjects were 440

divided into three groups, namely: 441

• Vulnerable road users (VRU) 442

• Motorized road users (Vehicle) 443

• Man, woman, ... (Undefined) 444

445

For each term and each subject, P(Subject|Term) and P(Term|Subject) were calculated. 446

The formula for computing a conditional probability isP(A|B) = P(A ∩ B)/P(B) and should be 447

interpreted as the probability of A, given B. The results are shown as heatmaps in figure7. 448

(15)

Fig. 7. Heatmaps showing conditional probabilities. From left: P(Term|Subject), P(Subject|Term)

To give an example for the interpretation of the heatmaps, P(aanrijding|VRU), or the probability 450

of finding the term ’aanrijding’ in a triple with a VRU as subject, was higher than the probability of 451

finding the term ’ongeluk’. 452

453

Since almost all triples contained a conjugation of the verb ’overlijden’, conditional probabili-454

ties were also calculated for P(Subject|overlijden) and P(Object|overlijden). The results are shown 455

as a mirror bar chart in figure8. From the chart the conclusion can be drawn that a VRU had the 456

highest probability of being the subject and ’ongeval’ had the highest probability of being the 457

object, given the verb ’overlijden’.

Fig. 8. Mirror bar chart of P(Subject|’overlijden’) and P(Object|’overlijden’) 458

The most common triples from the articles were very different, mostly because there were a lot 459

more sentences and a wider variety in words used for describing situations. ’De politie’ (the police) 460

appeared most often as the subject, with ’onderzoek’ (investigation) as the object. this result may 461

suggest that for a lot of accidents that are reported in the news, the police has to get involved to 462

find out what exactly happened. 463

4.4 RQ3 Automation 464

The methods and results of the following sub-question will be discussed in this section: 465

"What manual processes can be automated and applied to larger data sets?" 466

467

An attempt was made at automating 5 out of 11 analyses that were manually performed by M. 468

Te Brömmelstoet on the accident reporting articles fromHet Ongeluk. The methods used will be 469

(16)

discussed in4.4.1. the results from three of the automated analyses will be discussed in4.4.2. The 470

Jupyter Notebook containing all analyses and evaluations is provided in AppendixB.1. 471

4.4.1 Methods. 472

By automating the processes, the largerFlitsservice data set could also be analyzed. Different 473

approaches were used depending on the analysis in question. 474

Machine Learning. 475

A common approach to automatically analyzing text is the use of Machine Learning (ML) algorithms. 476

ML can be used for predicting either continuous numbers or classes. This is, however, only possible 477

when (enough) training and test data is available. In theHet Ongeluk DataFrame, labels related 478

to some of the analyses were provided with a value of either 0 or 1 for each article. Since there 479

are only two options and the numbers have no continuous meaning, this is considered a binary 480

classification problem. When multiple labels are taken into account at once, it becomes a multi-label 481

classification. Although there exist a lot of algorithms for classification, some have proven to deal 482

better with text data than others. Therefore, Multinomial Naive Bayes (MNB) and Support Vector 483

Machines (SVM) were chosen. Grid search was applied for hyperparameter tuning in order to find 484

the optimal model and obtain the best results. The available annotated data set was, however, very 485

small and could therefore produce bad results. Consequently, a second solution for automating 486

analyses was also tried that does not rely on algorithms. 487

Regular Expressions. 488

Regular expressions (regexps) are a way of looking for specific patterns in a text. Inside a regular 489

expression a pattern can be defined that has to match certain words in a given order. Regexps 490

provide a lot of options for writing a sequence, from wildcards that are used to match any single 491

character, to specifying the exact number of occurrences of a character or group. The benefit of 492

regexps is that it does not require to be trained and can thus be used even when available data is 493

limited. The downside is that the patterns have to be constructed by hand, meaning that all options 494

and exceptions have to be taken into account and the regexps can become very long and complex. 495

The vocabulary in the data was, however, expected to be small enough to construct regexps that 496

would provide reasonable results. 497

4.4.2 Results. 498

Vague language and euphemisms. 499

Euphemisms and the use of vague language in general are ways of softening an otherwise harsh or 500

unpleasant message. In accident reporting this is sometimes applied to either mention someone got 501

injured or decrease the blame put on a perpetrator. The most common examples found in accident 502

reporting are "iemand over het hoofd zien" (missed seeing someone), "de macht over het stuur 503

verliezen" (loosing control over the steering wheel), "er met de schrik vanaf komen" (dodging a 504

bullet), "geschept worden" (getting hit) and "onbekende verwondingen" (unknown injuries). The 505

common theme here is that they all leave readers in the dark about what exactly happened. From 506

here on, the term ’vague expressions’ will be used to refer to all of the above expressions. 507

Regexps were utilized to search for the minimum amount of words necessary (that also always 508

occurred together) that would match the expressions, except for "iemand over het hoofd zien" and 509

"de macht over het stuur verliezen". For example, for the expression "er met de schrik vanaf komen", 510

the regexps only had to search for "met de schrik". This method was used since the expressions did 511

not always have the same syntax. There could have been words in between parts of the expressions, 512

certain words could have been replaced and verbs could have been conjugated. For the other 513

two expressions, patterns with a similar meaning were also included. To illustrate, "controle" was 514

(17)

considered to be an alternative for "macht" in "de macht over het stuur verliezen". The results from 515

the analysis are shown in table2. 516

Table 2. Use of vague expressions in accident reporting articles (%) Het Ongeluk Flitsservice ’er met de schrik vanaf komen’ 4.62 1.89

’geschept worden’ 4.23 6.12

’iemand over het hoofd zien’ 8.08 8.84 ’de macht over het stuur verliezen’ 6.15 10.11

’onbekende verwondingen’ 5.00 0.86

517

In the data set fromHet Ongeluk all euphemisms appeared in roughly four to eight percent of the 518

articles. In theFlitsservice data set, the occurrences were not as equally divided. All euphemisms 519

did however appear, with ’onbekende verwondingen’ being found the least and ’de macht over 520

het stuur verliezen’ occurring most frequently. The conclusion can be drawn that euphemisms 521

were used in less than 10 percent of the news articles. ’De macht verliezen over het stuur’ appeared 522

the most, although this could be due to the type of accidents in the data. If the data would have 523

contained less car accidents, this number would also have been lower. Drawing general conclusions 524

from these results about the exact amount of euphemisms in accident reporting is thus not wanted. 525

The data does however provide enough information to conclude that euphemisms are to be found 526 in accident reporting. 527 528 Person vs vehicle. 529

In accident reporting articles it is not uncommon to find something in the ballpark of "bicyclist hit 530

bycar". According to [18] this is due to the fact that we see a car rather than the person driving 531

it. As a consequence, people tend to refer to the vehicle and this in turn affects the way in which 532

people perceive reality. This analysis aimed at answering the question "how often is a victim or 533

perpetrator described as either a person or vehicle?" 534

The best results were again produced by regexps and are shown in figure9. 535

(18)

Fig. 9. Percentage of accident reporting titles that match the regexps for each label

There were two important factors that the solution for this analysis had to be able to deal with. 537

The primary factor being making a distinction between a person and a vehicle and the second 538

deciding whether a person or vehicle was considered a victim or an opposite party (perpetrator). 539

This led to four labels for each title and article. A fifth one was also included that looked for words 540

describing consequences for the rest of traffic. 541

• victim as a person 542

• victim as a vehicle 543

• opposite (perpetrator) as a person 544

• opposite (perpetrator) as a vehicle 545

• consequences for the rest of traffic 546

547

Victim as a person turned out to be much more common than victim as a vehicle in the titles 548

for both data sets. For the opposite party, however, more titles referred to a vehicle instead of a 549

driver. This suggests that blame is often, either consciously or unconsciously, shifted away from 550 perpetrators. 551 552 "Ongeluk" vs "aanrijding". 553

According to [14] car crashes are often preventable and this message should be conveyed to the 554

public by using the right term to describe a crash. Unfortunately, in the research conducted by 555

Ralph et al. the neutral term "accident" is found to be most used. This term is considered to mask 556

the preventable nature of crashes. 557

For this analysis, variations on the terms were also included in the regexps. Most of those 558

included conjugations of a verb closely related to the term. For example, the verb "botsen" (to hit) 559

and conjugations of that verb were also included for the term "botsing". 560

(19)

Fig. 10. Use of specific terms to describe a crash in accident reporting articles (%)

As seen in the figure10, the term "ongeluk" was found most often inHet Ongeluk. In Flitsservice, 562

"ongeval" was found to be the most occurring term by far. Both terms are translations of the English 563

term accident, thus the conclusion can be drawn that accident was indeed the most used term. 564

565

By automating this analysis, theFlitsservice data set could also be analyzed over time to visu-566

alize whether term preference changed. The result is shown in figure11.

Fig. 11. Use of specific terms to describe a crash inFlitsservice over time (%) 567

Intervals were used because for some years no data was available. The biggest differences in 568

percentage of use were found for ’ongeval’ and ’ongeluk’. While ’ongeval’ was found in all titles 569

before the year 2000, it only occurred in around 60% of titles between 2006 and 2010. ’Ongeluk’ or 570

something closely related to it like ’verongelukt’ (crashed) on the other hand, was not found in any 571

titles before 2000 but afterwards an upward trend was seen. 572

(20)

5 EVALUATION

5.1 RQ1 Common words 573

There were two types of possible mistakes in the word clouds that may have influenced how well 574

they described the language. The first mistake was related to Pattern’s parsetree module. Some 575

words had a wrong POS-tag assigned, meaning they were misclassified. The second mistake, specific 576

for this research, was a word showing up in the word cloud that was not representative in any 577

way, although common. To evaluate how well the different word clouds performed, the terms were 578

printed as a list (see AppendixA). The mistakes made by the Parsimonious LM are shown in table 579

3.

Table 3. Mistakes made by Parsimonious LM

Word type Wrong POS-tag Not representative Total mistakes

Verbs 6/7 4 10/11

Nouns 6 0 6

Adjectives 16 3 19

580

Most mistakes related to POS-tags were made for adjectives. This could be due to fact that the 581

accident reporting vocabulary differed a lot from the corpus on which the Pattern module for 582

Dutch was trained. Unseen words are harder to classify and the module may have mistaken words 583

for being adjectives based on their position in a sentence. Since wrong POS-tags occurred in all 584

word clouds, the amount of non-representative words was considered to be more important. Here, 585

there were only 4 for verbs, none for nouns and 3 for adjectives. All the other words were at least 586

somewhat related to accident reporting. The mistakes detected in the TF word clouds are shown in 587

table4. 588

Table 4. Mistakes made by TF word clouds

Word type Wrong POS-tag Not representative Total mistakes

Verbs 6 22 28

Verbs lemma 9 23 32

Nouns 4 0 4

Adjectives 8 12 20

589

A lot of mistakes were made for representativeness, with the exception of the nouns word 590

cloud. The nouns were, however, less representative than in the Parsimonious LM cloud. The addi-591

tion of bigrams provided a few more representative adjectives but the results were still not sufficient. 592

593

Unfortunately, the Venn Clouds module did not provide an easy way to access the words. But since 594

Venn clouds are also largely based on TF, it can be expected that they produced better results than 595

the TF word clouds, but performed worse than the Parsimonious LM clouds. 596

5.2 RQ2 SVO triples 597

The results relied a lot on Pattern’s capability of correctly recognizing subjects, verbs and objects. 598

This means that there could have been other frequently occurring triples that were not detected. This 599

led to the fact that the conditional probabilities are to be considered correct under the assumption 600

that the triple objects were as well. 601

(21)

5.3 RQ3 Automation 602

For the automation, 11 manually created analyses were provided. An attempt was made at au-603

tomating 5 of them, although some changes were applied. Some of the other 6 were considered too 604

difficult to solve with regexps because there were no very clear patterns to search for. If more data 605

was available, Machine Learning algorithms may have offered a solution. 606

To evaluate how well the automations performed, annotations provided by M. Te Brömmelstoet 607

served asground truth. Although they were in some way subjective and a few inconsistencies were 608

found, there was no better alternative. 609

For each annotation or label, 0 translated to negative (N) and 1 to positive (P). The following test 610

statistics were used: 611

612

• True Positives (TP): annotated 1 and matched by regexps 613

• True Negatives (TN): annotated 0 and not matched by regexps 614

• False Positives (FP): annotated 0 but matched by regexps 615

• False Negatives (FN): annotated 1 but not matched by regexps 616

617

With these statistics, the following metrics were calculated: 618 accuracy =_{T P + T N + FP + FN}T P + T N 619 precision = T P T P + FP 620 recall =_{T P + FN}T P 621

Accuracy is defined as the percentage of correctly predicted observations. This metric can suffice 622

for evaluation, but if the data set has unbalanced classes or if more in depth evaluation is wanted, 623

including precision and recall is needed. Precision expresses the proportion of data points classified 624

as positive (class 1), that really are positive. Recall on the other hand expresses how well a model 625

performs at identifying positive data points. Because of the inconsistencies in the ground truth, 626

some misclassifications were inevitable and thus perfect results were not expected. 627

Vague expressions. 628

The original analysis focused more on accusations, among which two of the vague expressions 629

also used in this research could be found. The decision was made to change the focus to vague 630

expressions based on the news article published by T. Verkade and M. Te Brömmelstoet [18]. In the 631

article the five vague expressions and their effect on interpretation are discussed in detail. 632

Because of the adaptations it was not possible to evaluate the performance of all regexps, unless 633

performed manually. Annotations were, however, provided for "iemand over het hoofd zien" and 634

"de macht over het stuur verliezen". The results for the two expressions are given in table5. 635

Table 5. Performance of regexps at identifying vague expressions

metric "iemand over het hoofd zien" "de macht over het stuur verliezen" overall

accuracy 0.98 1.00 0.94

precision 0.80 1.00 0.89

(22)

636

The regexps performed very well both overall and per vague expression. 637

Person vs vehicle. 638

The regexps for this analysis were based on the titles of the articles. This denotes that a lot of 639

exceptions could be captured but the model was probably also overfit on the particular data. 640

Multi-label classification was also performed as a solution (see AppendixB.2, but the test set only 641

contained 52 articles. It is therefore difficult to say how good the classifiers actually were. On the 642

other hand, ML algorithms are known to perform rather well on text data, so this method could be 643

useful for further research, if more training data would become available. 644

The analysis was performed for both titles and articles. As expected the model performed better on 645

titles than on articles. This is mainly due to the fact that context played a bigger role in the articles. 646

Take, for example, the following article: 647

“Twee voertuigen zijn vanmiddag met elkaar in botsing gekomen op de Leidse Schouw in Alphen 648

aan den Rijn.Een bestuurder moest worden nagekeken door ambulancepersoneel, maar hoefde 649

niet mee naar het ziekenhuis.” 650

This was annotated 0 for ’victim as a person’ and 1 for ’victim as vehicle’ (because of "twee 651

voertuigen"). The regexps on the other hand labelled this 1 for ’victim as a person’ because a driver 652

was mentioned in the second sentence ("een bestuurder"). This was considered incorrect because 653

the fact that the driver was brought to the hospital did not say anything about the accident itself. 654

The results for titles and articles are shown in table6and table7, respectively. 655

Table 6. Performance of regexps for person vs vehicle analysis on titles metric vic_person vic_vehicle oppo_person oppo_vehicle traffic overall

accuracy 0.95 0.90 0.95 0.90 0.99 0.94

precision 0.97 0.71 0.71 0.80 0.94 0.85

recall 0.94 0.88 1.00 0.81 1.00 0.91

656

Table 7. Performance of regexps for person vs vehicle analysis on articles metric vic_person vic_vehicle oppo_person oppo_vehicle traffic overall

accuracy 0.67 0.64 0.72 0.74 0.89 0.73

precision 0.68 0.60 0.53 0.74 0.85 0.68

recall 0.86 0.73 0.70 0.84 0.85 0.81

"Ongeluk" vs "aanrijding". 657

The performance of the regexps on this analysis are shown in table8.

Table 8. Performance of regexps for "ongeluk" vs "aanrijding" analysis metric ongeluk aanrijding botsing ongeval overall

accuracy 0.94 0.93 0.93 0.96 0.94

precision 0.93 0.87 0.90 0.91 0.91

recall 0.94 0.88 0.93 0.98 0.94

(23)

Based on all metrics it is possible to conclude that the automation of this analysis went very well. 659

When manually checking the misclassifications, it was discovered that at least some of them were 660

actually wrongly annotated. This means that the performance of the regexps may have been even 661

higher. 662

6 CONCLUSIONS

Throughout this research an attempt was made at extracting patterns from accident reporting 663

articles published by Dutch media. Identifying these patterns does not only provide useful insights 664

on the Five W’s and How of traffic accidents. More importantly, it is a crucial step towards 665

understanding the influence of media on the public awareness surrounding the dangers of traffic 666

participation and the way in which accidents are perceived. In addition, the application of NLP 667

techniques makes for faster and more scalable analyses that in turn assist in drawing more general 668

conclusions. With rising numbers of traffic accidents, the relevance of research on this topic becomes 669

more prominent. While the government is taking actions to identify the main risks of traffic, the role 670

of media is left untouched. Therefore, this research and possible future work is of great importance 671

to start a discussion on what role media can or should play in this field. 672

By applying exploratory data analysis, the content of accident reporting articles was explored on 673

multiple levels. Primarily, word clouds were utilized to visually capture the essence of the language 674

used in accident reporting. To achieve this, a search for the best weight computing method was 675

carried out, resulting in the Parsimonious LM clouds containing the most representative words. 676

Previous research already showed that the addition of parsimony could lead to more representative 677

terms; hence this result was expected. The TF word clouds, while containing a lot of common 678

and unrepresentative words, did provide useful insights that could not be derived from the other 679

clouds. The results suggested that the emphasis is more often put on the victim, and that more often 680

reporters refer to a vehicle instead of the driver. The latter was also found to be true in English 681

accident reporting by Ralph et al. [14]. To further investigate language use, SVO triples were 682

extracted from both titles and articles to detect recurring events. The shared structure amongst the 683

title triples opened up the possibility for further analysis. The calculated conditional probabilities 684

showed, among others, that the chance of finding the term ’aanrijding’, given a VRU, was higher 685

than finding the term ’ongeluk’. 686

687

The second part of this research focused on the automation of manually performed analyses. 688

From the original 11, an attempt was made at automating 5. The others were considered too difficult 689

given the lack of training data for Machine Learning and not being very suitable for regexps. The 690

main advantage of automated processes is their ability to scale, so that results can more easily 691

be generalized. Therefore, the automated analyses were also conducted for the largerFlitsservice 692

data set. Overall, the metrics showed good results for the three analyses that were discussed in 693

detail. For the vague expressions only two labels could be evaluated since the original analysis was 694

adapted for this research. The performance of the regexps on ’person vs vehicle’ was lower for 695

articles than for titles. This was mainly due to the fact that the regexps were based on the titles, the 696

articles contained more information and context played a greater role. Tuning the regexps to also 697

capture everything in the articles is, if even possible, not a very good solution. This would have 698

only led to more overfitting and a non-scalable model. Further research is needed to find a solution 699

that can overcome this limitation. Machine Learning might provide good results if enough training 700

data is available. In further research, an attempt could be made at automating the remainder of the 701

analyses by applying different NLP techniques. 702

(24)

This research is for the most part considered to be exploratory since the data sets were rela-704

tively small. Therefore, drawing general conclusions was mainly avoided and the results were 705

mostly observations and indications. Interesting patterns were, however, detected and in further 706

research these findings should be taken into account. Furthermore, the automations provided at 707

least reasonable results and could be applied to larger data sets. 708

6.1 Acknowledgements 709

I would like to express my very great appreciation to my supervisor for providing me with extensive 710

feedback and supporting me to bring this thesis to a successful end. Secondly, I would like to offer 711

my special thanks to M. Te Brömmelstoet and T. Verkade for making this research possible and for 712

making a great effort in investigating the role of media in the perception of traffic accidents and 713

the level of awareness surrounding the dangers. Lastly, I would like to thank B. Hendriks and S. 714

Siepel for their assistance in the initial phases of this research project. 715

REFERENCES

[1] VU-DNC corpus [online service]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-g4, 716

2018. 717

[2] Jeffrey S. Bowers and Christopher W. Pleydell-Pearce. Swearing, Euphemisms, and Linguistic Relativity.PLoS ONE, 718

6(7):e22341, July 2011. 719

[3] Centraal Bureau voor de Statistiek. 11 procent meer verkeersdoden in 2018.Centraal Bureau voor de Statistiek, 2019. 720

[4] Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. Using Statistics in Lexical Analysis. page 33. 721

[5] Julie Cidell. Content clouds as exploratory qualitative data analysis: Content clouds as exploratory qualitative data 722

analysis.Area, 42(4):514–523, December 2010. 723

[6] Glen Coppersmith and Erin Kelly. Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis. InProceedings 724

of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 22–29, Baltimore, Maryland, 725

USA, June 2014. Association for Computational Linguistics. 726

[7] Concetta A. DePaolo and Kelly Wilkinson. Get Your Head into the Clouds: Using Word Clouds for Analyzing Qualitative 727

Assessment Data.TechTrends, 58(3):38–44, May 2014. 728

[8] Martin J. Halvey and Mark T. Keane. An assessment of tag presentation techniques. InProceedings of the 16th 729

International Conference on World Wide Web - WWW ’07, page 1313, Banff, Alberta, Canada, 2007. ACM Press. 730

[9] F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. Word Cloud Explorer: Text Analytics Based on Word Clouds. In2014 731

47th Hawaii International Conference on System Sciences, pages 1833–1842, January 2014. 732

[10] Djoerd Hiemstra, Stephen Robertson, and Hugo Zaragoza. Parsimonious language models for information retrieval. 733

InProceedings of the 27th Annual International Conference on Research and Development in Information Retrieval - SIGIR 734

’04, page 178, Sheffield, United Kingdom, 2004. ACM Press. 735

[11] David Johns and Dominick A. DellaSala. Caring, killing, euphemism and George Orwell: How language choice 736

undercuts our mission.Biological Conservation, 211:174–176, July 2017. 737

[12] Rianne Kaptein, Djoerd Hiemstra, and Jaap Kamps. How Different Are Language Models andWord Clouds? In David 738

Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar 739

Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, 740

Gerhard Weikum, Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan 741

Rüger, and Keith van Rijsbergen, editors,Advances in Information Retrieval, volume 5993, pages 556–568. Springer 742

Berlin Heidelberg, Berlin, Heidelberg, 2010. 743

[13] T. Lansdall-Welfare, S. Sudhahar, G. A. Veltri, and N. Cristianini. On the coverage of science in the media: A big data 744

study on the impact of the Fukushima disaster. In2014 IEEE International Conference on Big Data (Big Data), pages 745

60–66, October 2014. 746

[14] Kelcie Ralph, Evan Iacobucci, Calvin G. Thigpen, and Tara Goddard. Editorial Patterns in Bicyclist and Pedestrian Crash 747

Reporting.Transportation Research Record: Journal of the Transportation Research Board, 2673(2):663–671, February 748

2019. 749

[15] Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, and Nello Cristianini. Quantitative Narrative Analysis of 750

US Elections in International News Media. page 12. 751

[16] Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, and Nello Cristianini. ElectionWatch: Detecting Patterns 752

in News Coverage of US Elections. InProceedings of the Demonstrations at the 13th Conference of the European Chapter of 753

(25)

the Association for Computational Linguistics, pages 82–86, Avignon, France, April 2012. Association for Computational 754

Linguistics. 755

[17] Michael Tayler and Jane Ogden. Doctors’ use of euphemisms and their impact on patients’ beliefs about health: An 756

experimental study of heart failure.Patient Education and Counseling, 57(3):321–326, June 2005. 757

[18] Thalia Verkade and Marco te Brömmelstroet. ‘Busje ramt auto’, ‘file na ongeluk’. En de mensen dan? 758

https://decorrespondent.nl/9272/busje-ramt-auto-file-na-ongeluk-en-de-mensen-dan/974923679400-3cc81f84, March 759

2019. 760

[19] Wetenschappelijk Onderzoek Verkeersveiligheid. https://theseus.swov.nl/single/?appid=73c9f2d7-2873-4e4a-8e6e-761

095840c66ee5&sheet=0ce1fd1f-761c-40ae-b54e-66823d116a34&opt=currsel,ctxmenu. 762

(26)

Parsimonious LM

POS-tag verbs POS-tag nouns POS-tag adjectives

Terms Weight Terms Weight Terms Weight

overleden 0.0804 ongeval 0.0221 xx-jarige 0.1619

reed 0.0697 aanrijding 0.0198 dodelijk 0.0412

overleed 0.0678 ongeluk 0.0182 inzittenden 0.0410

raakte 0.0629 plaatse 0.0168 ernstig 0.0389

gekomen 0.0562 automobilist 0.0165 eenzijdig 0.0384

overgebracht 0.0344 auto 0.0162 voertuig 0.0378

afgesloten 0.0329 verkeersongeval 0.0148 vermoedelijk 0.0328

gebeurde 0.0315 toedracht 0.0147 dodelijke 0.0233

gebracht 0.0269 slachtoffer 0.0145 frontale 0.0220

geraakt 0.0263 fietser 0.0136 tegemoetkomende 0.0205

aangereden 0.0175 motorrijder 0.0136 plekke 0.0204

verongelukt 0.0156 boom 0.0129 kritieke 0.0202

vervoerd 0.0150 kruising 0.0127 jarige 0.0186

onderzoekt 0.0142 ziekenhuis 0.0127 automobiliste 0.0168

getuigen 0.0136 gewond 0.0125 gewond 0.0144

meldt 0.0131 personenauto 0.0123 terecht 0.0129

sloot 0.0128 botsing 0.0122 raakte 0.0123

onbekende 0.0121 traumahelikopter 0.0116 precieze 0.0093

overlijdt 0.0116 zwaargewond 0.0102 weg 0.0092

bestuurd 0.0115 ambulance 0.0099 flauwe 0.0092

ingesteld 0.0108 richting 0.0093 onbekend 0.0089

belandde 0.0107 stuur 0.0090 file 0.0084

inzittende 0.0096 brandweer 0.0088 bevrijd 0.0083

oversteken 0.0083 vrachtwagen 0.0087 bejaarde 0.0076

geslingerd 0.0083 plekke 0.0079 tegenovergestelde 0.0076

gebotst 0.0080 oorzaak 0.0079 aanrijding 0.0073

kwam 0.0079 vrachtwagenchauffeur 0.0076 dode 0.0071

gereden 0.0078 voertuig 0.0075 noodlottig 0.0065

gereanimeerd 0.0078 berm 0.0074 aanspreekbaar 0.0065

geschept 0.0071 frontaal 0.0073 passerende 0.0064

stak 0.0067 bocht 0.0071 gelderse 0.0063

gewonden 0.0063 weghelft 0.0067 nader 0.0059

bekneld 0.0061 onwel 0.0065 verkeersongevallenanalyse 0.0054

plaatsgevonden 0.0058 vrachtauto 0.0064 naastgelegen 0.0052

terecht 0.0052 politie 0.0063 stilstaande 0.0050

verleende 0.0051 rijbaan 0.0062 gestart 0.0046

botsten 0.0050 letsel 0.0061 slachtofferhulp 0.0046

aangetroffen 0.0046 inwoner 0.0061 eenzijdige 0.0043

gehaald 0.0045 fiets 0.0061 inhaalmanoeuvre 0.0042

gewonde 0.0045 stilstand 0.0061 auto 0.0042

(27)

TF WordClouds

POS-tag verbs POS-tag verbs lemmata POS-tag nouns POS-tag adjectives

Terms Weight Terms Weight Terms Weight Terms Weight

aangereden 1513 aanrijden 1568 aanrijding 4681 automobiliste 260

afgesloten 1810 afsluiten 1814 ambulance 1140 bekend 1291

betrokken 872 besturen 881 auto 10645 direct 582

doet 758 betrekken 873 automobilist 3016 dodelijk 945

gebeurde 1890 brengen 1869 boom 1965 dodelijke 569

gebracht 1742 doen 1563 botsing 2081 duidelijk 499

gekomen 4241 gaan 1026 brandweer 1288 eenzijdig 927

geraakt 1497 gebeuren 2467 fiets 880 enige 553

gereden 819 geraken 1528 fietser 1697 ernstig 1795

getuigen 972 getuigen 972 gewond 2555 ernstige 370

geweest 930 halen 784 hoogte 1371 frontale 386

had 1199 hebben 7426 hulp 962 half 599

hebben 1534 komen 11384 jongen 915 hard 333

heeft 4392 kunnen 2764 kruising 1428 hoge 413

is 21549 laten 861 leven 4505 inzittenden 860

kon 1384 liggen 864 man 10756 jarige 325

kwam 5213 lopen 864 motorrijder 1614 kort 434

kwamen 811 melden 1562 onderzoek 3896 kritieke 354

meldt 994 moeten 1112 ongeluk 5396 lange 266

mocht 760 mogen 860 ongeval 8431 mogelijk 526

niet 1467 niet 1469 oorzaak 2439 nader 258

nog 1940 nog 1941 personenauto 1343 onbekend 488

onbekende 1853 onbekennen 1853 plaats 1496 onduidelijk 308

onderzoekt 799 onderzoeken 1260 plaatse 3046 plekke 411

overgebracht 1583 overbrengen 1585 politie 9497 precies 328

overleden 6197 overlijden 10145 richting 3173 precieze 269

overleed 3412 raken 3868 slachtoffer 4940 raakte 277

raakte 3774 reden 4727 stuur 992 snel 374

reed 4273 rijden 987 tijd 1154 technisch 281

sloot 914 schrijven 859 toedracht 2077 tegemoetkomende 324

te 4243 sloten 914 traumahelikopter 1305 uiteindelijk 267

vond 816 staan 848 uur 5873 vast 268

waren 1121 te 4243 verkeer 1317 verkeerde 379

was 5313 vinden 1308 verkeersongeval 1962 vermoedelijk 1350

werd 6773 wezen 938 vrachtwagen 910 voertuig 687

werden 812 willen 876 vrouw 5597 vrij 314

worden 1128 worden 9818 water 1338 waarschijnlijk 430

wordt 1100 zien 1603 weg 3203 weg 328

zat 1116 zijn 34355 ziekenhuis 4702 xx-jarige 12963

(28)

B AUTOMATION B.1 Regexps 763