language processing
2submitted in partial fulfillment for the degree of bachelor of science
3credits: 12EC
4ella casimiro
511195002
6bachelor information studies
7faculty of science
8university of amsterdam
92019-06-28
10 Supervisor Title, Name Dr Maarten Marx Affiliation UvA, FNWI, IvIEmail maartenmarx@uva.nl . 11
Abstract 13
In 2017, 123.930 traffic accidents were recorded in the Netherlands and news articles on this topic are 14
published daily. Besides containing valuable information as to who was involved and what happened, there 15
is more to be discovered when carefully analyzing the content. This research attempts to give a better 16
understanding of how Dutch media report traffic accidents by searching for hidden patterns with exploratory 17
data analysis and other Natural Language Processing techniques. Accordingly, the main question of this 18
research was: "What patterns can be found in accident reporting by Dutch media?". Identifying these patterns 19
is essential for understanding the influence of media on the perception of traffic accidents and awareness of the 20
dangers of traffic. In this research, a search for the most representative words was held by comparing different 21
weighting schemes for word clouds. Afterwards, the most common subject-verb-object triples were extracted 22
and further analyzed. The last part of this research was aimed at automating manually performed analyses, so 23
that they could be applied to larger data sets.The results suggested the use of a specific language for accident 24
reporting in which the focus is mostly put on the victim, and thus shifted away from the perpetrator, and 25
a vehicle is referred to more than the person driving it. Furthermore, it was found, among others, that a 26
vulnerable road user had a higher probability of being the subject in a SVO triple with the verb ’overlijden’ (to 27
die) than a motorized road user. Ultimately, regular expressions were found to produce at least reasonable 28
results in automating manual analyses. The application of the analyses on a larger data set also produced 29
some interesting insights, including on the use of vague expressions and on the preferred terms to describe an 30
accident. This research was mostly exploratory in nature and thus further research is needed to generalize the 31
results. 32
Contents 33 Contents 2 34 1 Introduction 3 35 2 Background 4 36 3 Related Work 5 37
4 Methods & Results 6
38
4.1 Description of the data 7
39
4.1.1 Pre-processing of the DataFrames 7
40 4.2 RQ1 Common words 8 41 4.2.1 Methods 8 42 4.2.2 Results 9 43 4.3 RQ2 SVO triples 12 44 4.3.1 Methods 12 45 4.3.2 Results 12 46 4.4 RQ3 Automation 14 47 4.4.1 Methods 15 48 4.4.2 Results 15 49 5 Evaluation 19 50 5.1 RQ1 Common words 19 51 5.2 RQ2 SVO triples 19 52 5.3 RQ3 Automation 20 53 6 Conclusions 22 54 6.1 Acknowledgements 23 55 References 23 56 AWord Clouds 24 57 B Automation 27 58 B.1 Regexps 27 59 B.2 ML Classification 53 60
1 INTRODUCTION
According toHet Centraal Bureau voor de Statistiek, 678 people died in a traffic accident in the 61
Netherlands last year. CBS reported an increase of 10,6 percent compared to 2017, being the biggest 62
increase since 1989. Fatal accidents only account for a small fraction of the total number of accidents 63
[3] . Data fromWetenschappelijk Onderzoek Verkeersveiligheid (SWOV) shows a total number of 64
123.930 accidents in 2017. Although most of those accidents ended in material damage only, in 65
18.706 of the accidents someone was slightly injured or worse [19]. To reduce the number of 66
accidents, the Ministry of Infrastructure and Water Management and the Ministry of Justice and 67
Security have created the ’Strategisch Plan Verkeersveiligheid 2030’ (SPV) together with other 68
parties. Some of the themes in the SPV relate to infrastructure, Vulnerable Road Users (VRU), young 69
and elderly drivers and alcohol consumption. An effort is thus being put in trying to reduce the risk 70
of participating in traffic. The government is, however, not the only one investigating this topic. 71
In particular, Thalia Verkade (journalist forDe Correspondent) and Marco te Brömmelstroet 72
(affiliated with the University of Amsterdam) have also taken an interest in traffic accidents. More 73
specifically in trying to understand what role media plays or should play in this topic. They started 74
the websiteHetongeluk.nl where news articles about traffic accidents are collected and annotated 75
by hand with useful information such as who was involved and whether someone was injured. 76
There is, however, a lot more to be found in those articles that influence the way in which people 77
perceive traffic accidents and their level of awareness with regards to the possible dangers. 78
Exploratory data analysis (EDA) combined with visualization techniques can aid in finding that 79
information by summarizing the main characteristics of accident reporting, while at the same time 80
uncovering hidden patterns and insights. From there on, other Natural Language Processing (NLP) 81
techniques can be applied to dig deeper by performing clear-cut analyses that would otherwise be 82
time-consuming and unscalable when performed manually. 83
84
This thesis thus attempts to give a better understanding of how Dutch media report traffic accidents 85
with the use of Natural Language Processing techniques by applying EDA and automating analyses. 86
Consequently, the following research question and sub-questions were defined: 87
RQ What patterns can be found in accident reporting by Dutch media? 88
(1) What are the most common and representative verbs, adjectives and nouns in news 89
articles? 90
(2) What are the most common subject-verb-object triples? 91
(3) What manual processes can be automated and applied to larger data sets? 92
93
This research is part of a larger project in collaboration with Sander Siepel and Barry Hendriks 94
aimed at automatically annotating and analyzing accident reporting articles. The objective of this 95
thesis is to automatically analyze accident reporting articles. 96
Overview of thesis. 97
Section2provides background information on the methods used in RQ1 through relevant literature. 98
Thereafter, related work on the methods used in RQ2 and RQ3 is discussed in section3. The methods 99
and results are then explained in section4per sub-question, followed by an evaluation of the results 100
and reflection on the process in section5. Finally, section6describes the main conclusions of this 101
research. 102
2 BACKGROUND
In this section, the use of and difference between word clouds, Venn clouds and Parsimonious 103
Language Models will be illustrated through relevant literature. These methods will be applied in 104
section4.2. 105
Word Clouds. 106
One method that has proven to be effective in the field of text analysis and summarization is the 107
use of tag or word clouds. Clouds are used to visualize tags assigned to documents or frequently 108
occurring words in a document or corpus, so that a user can quickly understand what it is about 109
without having to read the entire text. According to Heimerl et al. [9] literature on tag clouds can be 110
divided into two main areas, namely studies on the effectiveness and visual interpretability of word 111
clouds, and studies that focus on finding improvements and extensions for the existing concept. 112
One research on the effectiveness of word clouds conducted an experiment to investigate the 113
influence of font size and order of words on a selection task [8]. They concluded that font size 114
plays an important role in aiding someone in easily finding information. Secondly, alphabetical 115
ordering of words also improved search time. Word Clouds have also been proven to be effective 116
for very specific applications. An example can be found in the research by DePaolo and Wilkinson 117
[7]. In their study on assessment students were asked to answer questions with short answers. 118
These answers were visualized with word clouds for a professor to get a better understanding of 119
the knowledge (and potential lack of) on specific topics. This method can be applied during lectures 120
or even before and after a test. 121
Heimerl et al. [9] createdWord Cloud Explorer, a system in which users have full control over a 122
word cloud that summarizes a body of text. Part of speech (POS) tags can be selected if a user is for 123
example only interested in nouns and adjectives. Moreover, users can hover over a specific term to 124
have the system highlight other terms that appear in the same sentence. There is, amongst others, 125
also the possibility to extract extra information about a specific term like the frequency count and 126
detected word forms in the entire corpus or a selected part of it. After conducting a qualitative user 127
study, with the goal being receiving useful feedback, one of their findings was that filtering on 128
POS-tags was considered more useful than using colors to distinguish between them in a single 129
word cloud. 130
Another interesting study focused on the use of word clouds for exploratory data analysis [5]. 131
One of the examples in the research showed content clouds of public meeting transcripts in different 132
areas in America. All meetings concerned the same topic, but the results showed that people in 133
different states perceived the discussed problem differently. This is an example of how differences 134
in text can be easily spotted with the use of word clouds. 135
Venn Clouds. 136
An improvement on using word clouds for differentiating between two bodies of text is discussed 137
by Coppersmith and Kelly [6]. To define the relationship between two corpora they made use of 138
Venn clouds. In Venn clouds, words associated with both corpora are displayed in the center, while 139
the other words are displayed either left or right depending on the corpora they are found in. In 140
this specific research, association was based on the probability of a term occurring in the corpora. 141
Association could, however, also be based on frequency counts. 142
One limitation of word clouds, that is not completely solved by Venn clouds, is the lack of 143
removing non-representative words. Stop-word removal can be applied to both methods, but this 144
may not always be sufficient. 145
Parsimonious Language Models. 146
Kaptein, Hiemstra & Kamps [12] investigated this problem by searching for additions to the term-147
frequency (TF) word clouds that improve the selection of visualized terms. They note that the terms 148
in a TF word cloud are often not the words that users would like to see. This is due to the fact 149
that the most frequently occurring terms are not necessarily very informative. In their research, 150
the addition of bigrams and a language model to word clouds was tested. According to their user 151
study users prefer word clouds with both unigrams and bigrams or only bigrams over the basic 152
unigram cloud. Furthermore, they also investigated if users preferred term-frequency multiplied 153
by the inverse document frequency weighting (TF-IDF) or the parsimonious model over the TF 154
weighting. With a parsimonious model, common stop-words and article specific common words 155
are automatically excluded from the word cloud. While the TF word clouds were still found to be 156
most beneficial for retrieval, the parsimonious word clouds outperformed both other methods in 157
returning specific words and removing stop-words. 158
In [10], the use of parsimonious language models for Information Retrieval (IR) is investigated. 159
Normally, a language model is created for each document, in which terms are ranked based on 160
their probability of occurrence. The problem with this is that common words or stop-words get 161
assigned a high probability, which is unwanted. To solve this, the concept of parsimony was 162
applied to the language model, meaning that terms that distinguish a document from the rest of 163
the corpus are rewarded with a higher probability. From their research they concluded that the 164
smaller parsimonious models compared to the normal language models performed equally well or 165
even better at three stages of retrieval (indexing, request and feedback time). 166
3 RELATED WORK SVO triples.
167
Extracting subject-verb-object (SVO) triples has been used in media analysis for multiple purposes. 168
In [15] an application is described that automatically extracts SVO triples and saves them to a 169
database to then identify important actors and actions for a specific domain or during a specific 170
period in time. By applying the application to a large number of crime related news articles from 171
the New York Times between 1987 and 2007, the key actors and actions could be extracted by 172
weighting their frequency in the crime articles against their frequency in a general news corpus. 173
This eliminated actors that were not specifically relevant for crime news. They also created a 174
network from the top 300 subjects, verbs and objects to visualize the relationships amongst them. 175
In [16] the systemElectionWatch is introduced. The system makes use of SVO triples from U.S. 176
election news to display endorsement and opposition relationships in a graph between actors. 177
Another application of SVO triples in news content analysis is shown in [13]. Landsdall-Welfare 178
et al. collected and analyzed millions of articles on nuclear power over a five year period to 179
find out how the Fukushima disaster influenced media coverage. One of their analyses included 180
extracting SVO triples to create a network that showed actors and actions affecting a specific 181
topic. In their results they showed the network for "nuclear power" before and after Fukushima. 182
They concluded that the network changed after the disaster, with the biggest difference being the 183
growth of the public as an important actor. And where before the disaster actions with a positive 184
tone were most prominent, like "support", actions found after the disaster were much more negative. 185
186
Church et al. [4] took a different approach on the use of SVO triples. They showed in their research 187
that statistics can provide added value that may, in the future, help parsers reduce mistakes. Apply-188
ing Mutual Information (MI) to rank verbs for a specific noun can show what associations make 189
more sense than others. For example, the nounboat was found to be more associated with the verb 190
cruise (MI = 8.17) than with get (MI = 0.57). If only the co-occurence of the words and verbs would 191
have been used, they would have been ranked equally high because they both occurred together 192
three times. MI on the other hand compares the probability of findingboat and cruise/get together 193
with the probability of finding them separately. If they occur separately much more often, the MI 194
will be closer to zero, meaning that there is no true association between the noun and the verb. 195
Euphemisms. 196
Tayler and Ogden [17] conducted two studies on the use of euphemisms for ’heart failure’. The 197
first part of their study was aimed at doctors and how likely it would be for them to explain the 198
diagnosis of ’heart failure’ to a patient with that specific term or some euphemism. The results 199
showed that doctors were significantly more likely to use some of the given euphemisms over 200
’heart failure’. The preferred euphemism was ’fluid on your longs as your heart is not pumping 201
hard enough’. There was no significant difference for all euphemisms and some euphemisms were 202
significantly less likely to be used. Doctors were also significantly more likely to put ’heart failure’ 203
in the computer compared to telling the patient. The second part of their study focused on the 204
relative impact of the preferred euphemism over ’heart failure’ on patients. The results showed 205
that patients were significantly more anxious and felt like the condition would impact their life 206
more if the term ’heart failure’ was used. 207
Another interesting study conducted by Bowers and Pleydell-Pearce [2] measured electrodermal 208
activity (EDA) in participants when reading aloud two swear words, two neutral words and their 209
euphemisms. Since EDA measures can vary largely depending on the test person, the measures 210
were normalised. A one-way ANOVA showed that the test persons found it more stressful to say the 211
swear words than the euphemisms. There was no significant difference between the neutral words 212
and their euphemisms. In their discussion, the authors argue that euphemisms can be effective to 213
talk about sensitive topics that would otherwise be avoided due to the emotional response that the 214
offending word would evoke. In contrast, Johns and DellaSalla [11] argue in their essay, directed 215
at conservation biologists, that euphemisms should not be used because they are misleading and 216
’sugar-coat’ reality. 217
Blame and inanimate objects. 218
Ralph et al. [14] conducted a content analysis of 200 news articles in U.S. media reporting accidents 219
involving pedestrians and bicyclists. According to the authors, the aim of the research was "to 220
systematically describe patterns of traffic crash reporting rather than to draw causal inferences". 221
Their results did however show interesting patterns in media coverage concerning blame. If a car 222
or driver was mentioned in the article, 81% referred to the vehicle instead of the driver. When 223
looking at sentence types, they found that reporters preferred to leave out the driver or vehicle 224
completely and simply state that "a VRU was hit". These two findings are, by the authors, seen as 225
ways of shifting blame away from the driver. In the same research they also found that ’accident’ 226
was the most used term to describe a crash. The authors argue that this term conceals the fact that 227
most accidents are preventable because of its neutral tone. 228
4 METHODS & RESULTS
In this section a description of the used data sets will be given (section4.1), followed by an 229
explanation of the methods used for data pre-processing (section4.1.1). Afterwards, the methods 230
and results of the three sub-questions will be addressed separately. 231
Sections4.2,4.3and4.4will begin with a short introduction to demonstrate the relevance of the 232
sub-questions. Subsequently, the process of finding the best results will be clarified by explaining 233
the used methods along with their utility and possible shortcomings. Lastly, the results will be 234
discussed starting off with the best. An evaluation of the results per sub-question will be given in 235
section5. 236
4.1 Description of the data 237
There were three data sets used in this thesis. Two of them contain accident reporting articles and 238
the third one is a general news corpus that is only used for RQ1 for comparison. 239
Het Ongeluk. 240
For the first data set, 396 accident reporting articles from thehetongeluk.nl website were received 241
in JSON format and turned into a Pandas DataFrame. M. Te Brömmelstoet provided annotations for 242
a part of these articles in excel and SPSS file formats. Those files were also loaded into DataFrames 243
and merged together. The articles with partly or fully missing annotations were not included in 244
the final DataFrame. The final DataFrame contained 260 rows and 55 columns. A large part of the 245
columns consisted of labels corresponding to specific analyses, which were used for evaluating the 246
automation of those analyses. 247
Flitsservice. 248
The second data set contained accident reporting articles from the websiteflitsservice.nl. The 249
data was provided as a .csv file by B. Hendriks, who scraped the articles from the website with a 250
self-written Python function. The .csv file contained the title, full text and date of 7784 articles. 251
The file was loaded into a DataFrame and afterwards some symbols (tab, newline and return) and 252
leading white spaces were removed. 253
VU-DNC Corpus. 254
The VU University Diachronic News text corpus [1] is available as a .zip file that contains general 255
news articles from five Dutch newspapers, published in 2002. The .zip file was loaded in Python 256
and the text between ’HEADLINE’ and ’TAIL’ was selected for each article. The leftover text was 257
split into titles and articles by searching for the first newline character. Some symbols (tab, newline 258
and return), leading and trailing white spaces were removed from the articles. The final DataFrame 259
contained 2552 articles. 260
261
4.1.1 Pre-processing of the DataFrames. 262
To prepare the DataFrames for further investigation, some NLP tools were used for pre-processing. 263
The specific modules used during step are briefly discussed below. 264
NLTK. 265
Python’s Natural Language Toolkit (NLTK) library offers functions and collections that can assist 266
programmers with NLP. Their package for tokenization contains the submodulesent_tokenize that 267
was used in this thesis to split the articles into sentences. 268
Pattern. 269
The web mining module Pattern is written for Python 2.5+, but does not support Python 3 yet. 270
Therefore, the code in this thesis is written in Python 2.7. Pattern was used for parsing the articles 271
and titles because it offers a module for the Dutch language. More specifically, the functionparsetree() 272
was used that takes a string as input and returns aText object for each row in a DataFrame. The 273
Text object can be broken down into Sentence objects, which in turn break down into Word objects. 274
For each word, the part-of-speech tag, assigned chunk, relation in the chunk (subject, object, ...) 275
and lemma were stored. 276
Pickle. 277
Pickling is used to save Python objects as a binary file. These objects can later be loaded in a 278
different Jupyter Notebook without losing any valuable information. This format was chosen over 279
the csv format to store the DataFrames since theparsetree functionality would otherwise be lost. 280
281
4.2 RQ1 Common words 282
The methods and results concerning the following sub-question will be discussed in this section: 283
"What are the most common and representative verbs, adjectives and nouns in news articles?" 284
4.2.1 Methods. 285
The language or vocabulary used in news articles can help us understand more about the way in 286
which accidents are reported by Dutch media. Furthermore, it can help us distinguish this type of 287
articles from general news. Inside the language for accident reporting common Dutch words can 288
be found, often referred to as stop-words, but also very specific words like ’ongeval’ (accident) and 289
’aanrijding’ (collision). In between these two extremes are words that are shared amongst multiple 290
news reporting languages and are only meaningful for accident reporting when context is taken 291
into account. Examples of these words include ’19-jarige’ (19-year-old) and ’auto’ (car). The goal 292
was thus to find those words that are common but also sufficiently specific for accident reporting. 293
Instead of merely listing those words, the decision was made to transform them into word clouds. 294
Word clouds have become very popular in a wide range of research fields because of their ability 295
to visualize text in a straightforward and appealing way. They are often applied in exploratory data 296
analysis because the possible conclusions drawn from them can help in further NLP steps. The 297
weights assigned to the words decide which words are shown and their magnitude in the cloud. 298
299
To find the word cloud that best answers the research question, three different methods for com-300
puting the weight of the words were tested. Secondly, to maximize the value of the word clouds 301
they were created for specific part of speech (POS) tags. By using the POS-tag instead of a single 302
word cloud for all words, more representative words can be found. This is due to the fact that 303
some POS-tags appear more often in a language and would therefore appear in the word cloud, 304
while not being of great interest to this research. Pronouns, for example, provide no information 305
about a given subject but may appear very frequently in a text. Verbs, nouns and adjectives were 306
chosen for the word clouds because they could be specific for accident reporting, and contain a lot 307
of information. 308
Term-frequency WordClouds. 309
Term frequency (TF) Word clouds are the most common and simplest application of word clouds. In 310
these clouds the weight is calculated by counting how often a word occurs in a corpus, also known 311
as TF. They are often a good first step in finding out how a corpus compares to others because it 312
shows the most common terms. Additionally, in a corpus with a distinct theme some of the most 313
common words may also be specific. Therefore, word clouds were generated for both the accident 314
reporting corpus and a general Dutch news corpus. In each POS-tag word cloud, the 40 words with 315
the highest TF are shown. As an addition, word clouds with bigrams were also created for nouns 316
and adjectives. To make sure that the bigrams were still related to the proper POS-tag, either the 317
first or second word needed to have the POS-tag of the concerned word cloud. The decision was 318
made not to add bigrams to the verb clouds because the preceding or following word would not 319
help determine whether a verb is representative or not. 320
Why TF does not suffice 322
The problem with TF is that it only takes absolute occurrence into account leading to a lot of 323
unwanted, or meaningless, words in the word clouds. This is due to the fact that stop-words are 324
used very frequently and thus belong to the most common words. Moreover, some words are found 325
in both accident reporting and general news. One solution would be stop-word removal, but this 326
would still leave non-representative words in the clouds, and TF would still determine the weight. 327
For that reason, Venn clouds and Parsimonious LM were chosen because they solve the problem of 328
stop-words while also improving the weight calculation method. 329
Venn clouds. 330
Venn clouds solve the problem of stop-words by making a word cloud for each corpus and showing 331
their intersection. In this intersection, stop-words and some non-representative words are found. 332
This way the words on both the left and right side are more representative for the corpus they 333
belong to. The size of the intersection represents the overlap between the corpora. 334
Since Venn clouds still rely on TF, although being a progression from TF word clouds, this method 335
is not expected to extract the most representative words. 336
Parsimonious Language Model. 337
The method that is expected to find the most representative words is creating word clouds from a 338
parsimonious language model. Language models assign a probability to words as weight, repre-339
senting the chance of a word occurring in a given text. Language models do tend to overfit on the 340
given data because they will select the words that are either very specific for a certain document or 341
common words. This is solved by the Parsimonious language model by selecting the words that 342
distinguish a certain document or corpus from another. 343
The Parsimonious LM has two parameters, one being the number of words to return (n) and 344
the other being the weight of the document model (w). The model works by creating two different 345
kinds of language models. A base model is created from one of the corpora first, on top of which 346
another language model is created. For the latter, words that are already described by the base 347
model receive a probability of zero. A model with high parsimony thus needs less parameters 348
(non-zero words) to describe the second corpus. This means that there are less to no stop-words 349
returned as the top n words for the second corpus, since those are normally also found in the first 350 corpus. 351 4.2.2 Results. 352 Parsimonious LM. 353
The best word clouds for the given research question were produced by the Parsimonious LM for 354
each POS-tag of interest. The best results were produced with w=0.01 and n=40. When visually 355
inspecting the word clouds from the Parsimonious LM (Fig.1), the conclusion can be drawn that 356
stop-words were removed and the remaining words are representative for accident reporting. 357
Fig. 1. Parsimonious LM word clouds. From left: verbs, nouns and adjectives 358
TF Word Clouds. 359
Even though the TF word clouds did not show very representative words, they did provide some 360
interesting information that could not be extracted from the Parsimonious LM word clouds due to 361
the weighting scheme 362
363
POS-tag verbs 364
Figure2shows the accident reporting TF word clouds for conjugated and lemmatized verbs. The 365
word clouds that were made with the conjugated verbs had a larger variety of unique words and 366
therefore repeated certain verbs. Although this may seem unwanted, the conjugations actually 367
contained interesting information. Even though the accident reporting word cloud showed both 368
present and past tense, some verbs were clearly used more often in the past tense. This finding 369
suggests that the following sentence "the victim was hit by the perpetrator" occurs more often 370
than "the perpetrator hit the victim". The emphasis is thus more often put on the victim. Word 371
clouds with lemmata were also created to extract more unique verbs. There were some verbs clearly 372
related to accidents like ’overlijden’ (passing away) en ’raken’ (hitting), but common verbs still 373
dominated. The word clouds for the general news consisted almost solely of commonly used verbs 374
like ’worden’ (to become) and ’kunnen’ (to can). This was expected as there was no single theme 375
shared amongst the articles. 376
Fig. 2. TF word clouds for accident reporting. From left: verbs and verb lemmata 377
POS-tag nouns 378
There was a clear difference between the word clouds from the general news corpus and the accident 379
reporting corpus, which indicates that some nouns are typically used in describing accidents. Besides 380
differences between the corpora, another conclusion can be drawn solely from the accident reporting 381
word cloud (Fig.3). The word ’auto’ (car) occurred more often than ’automobilist’ (driver), implying 382
that reporters prefer to reference a vehicle instead of the person driving it. 383
Fig. 3. TF word cloud for accident reporting nouns 384
POS-tag adjectives 385
Figure4shows the most common adjectives in accident reporting. Two of the words found were 386
’vermoedelijk’ (presumably) and ’onbekend’ (unknown), which may point to the fact that reporters 387
often have to make assumptions about the situation. Other common adjectives seemed to point to 388
the degree of injuries (’dodelijk’;deadly) and the type of accident (’eenzijdig’;unilateral). 389
Fig. 4. TF word cloud for accident reporting adjectives 390
Bigrams 391
The bigrams provided extra information to some of the words that were not considered represen-392
tative in the unigram clouds. For example, the adjective ’hoge’ (high) was not considered to be 393
representative. ’Hoge snelheid’ (high speed) on the other hand, as found in Figure5, was considered 394
to be much more related to accident reporting. Similarly, the noun ’leven’ (life) was already some-395
what representative, but the bigram ’leven gekomen’ (passed away) contained more information. 396
Fig. 5. TF bigram word cloud for accident reporting adjectives 397
Venn clouds. 398
60 words were used instead of 40 for the Venn clouds because the words in common are moved to the 399
center, leaving less possible representative words for accident reporting. It is possible to conclude 400
that some common words were removed from the accident reporting side and instead placed in the 401
center. As a result, more representative words were to be found outside of the intersection. Figure6
402
shows the Venn cloud for verbs. The Venn clouds showed to be an improvement over the TF word 403
clouds by removing some of the common and non-representative terms but still largely relied on 404
frequency count. That being the case, they were not chosen as the best result. 405
Fig. 6. Venn cloud for verbs
4.3 RQ2 SVO triples 407
The methods and results of the following sub-question will be discussed in this section: 408
"What are the most common subject-verb-object triples?" 409
4.3.1 Methods. 410
In most languages, including Dutch, the structure of a sentence is SVO or subject-verb-object. 411
Extracting these triple objects can be useful to summarize sentences and detect recurrent events 412
or similarities between articles. For example, the sentences "de motorrijder is aangereden door 413
een dronken automobilist" en "de 19-jarige motorrijder is gisteren aangereden op de A6 door een 414
automobilist" are different in length and description of a situation but have the same SVO triple, 415
namely "motorrijder-aangereden-automobilist". Whether these two sentences refer to the same 416
accident is unclear, but when triple objects are extracted from a large body of text it is possible to 417
detect what type of accidents occur more often than others. 418
To find the most common SVO triples, the two accident reporting DataFrames (Flitsservice and 419
Het Ongeluk) were merged together. Afterwards, the underlying relations were extracted from 420
the, with Pattern’s parsetree, parsed articles. A sentence could have multiple SVO relations and 421
some may have been incomplete due to the subject being named in a previous sentence. To find the 422
correct triples, a list comprehension was created that looped through the key-value pairs of the 423
subject, verb and object relations found in a sentence. The values were Pattern Word() objects and 424
the keys referred to the specific relationship they belonged to. If a subject, verb and object with the 425
same key were found, the lowercase string of the Word() objects were returned as a 3-tuple. The 426
occurrences of the 3-tuples were counted and returned as a dictionary, with the 3-tuples as keys 427
and their occurrence in the entire corpus as values. 428
4.3.2 Results. 429
The 20 most common triple objects were extracted from the accident reporting titles and articles. 430
The results are shown in table1. 431
Table 1. 20 most common SVO triples found in accident reporting. From left: for titles and for articles.
Triple titles Count titles Triple articles Count articles
motorrijder-overleden na-ongeval 22 de politie-doet-onderzoek 319
fietser-overleden na-aanrijding 19 dat-meldt-de politie 255
vrouw-overleden na-aanrijding 17 de politie-onderzoekt-de toedracht 197
fietser-overleden na-ongeval 14 dat-heeft-de politie 183
vrouw-overlijdt na-ongeval 10 de politie-stelt-een onderzoek 144 motorrijder-overleden na-ongeluk 10 de politie-heeft-een onderzoek 124 voetganger-overleden na-aanrijding 10 het ongeval-vond-plaats 94 motorrijder-overleden na-aanrijding 9 de politie-is-een onderzoek 65
vrouw-overleden na-ongeval 8 het verkeer-werd-omgeleid 60
vrouw-overleden na-verkeersongeval 8 de politie-onderzoekt-de oorzaak 56
fietsster-overleden na-aanrijding 7 de man-reed-met 56
automobilist-overleden bij-ongeval 7 dat-heeft-de politie gemeld 55 fietsster-overleden na-ongeval 7 het slachtoffer-is-een xx-jarige man 50 bromfietser-overleden na-aanrijding 7 dat-meldde-de politie 47
fietser-overleden na-ongeluk 6 dat-maakte-de politie 45
fietser-overlijdt na-aanrijding 6 dat-bevestigt-de politie 39
automobilist-overleden na-ongeval 6 de politie-doet-verder onderzoek 38 motorrijder-overlijdt na-ongeval 5 de politie-onderzoekt-de zaak 38 automobilist-overleden na-ongeluk 5 de aanrijding-vond-plaats 37 slachtoffer dodelijk ongeluk-is-man 5 het ongeluk-vond-plaats 37
For the titles, 19 out of 20 triples turned out to have the same structure. All objects referred to an 433
accident, by using one of three descriptive terms (’ongeval’, ’ongeluk’ and ’aanrijding’). Furthermore, 434
they all contained a conjugation of the verb ’overlijden’ (to die). A possible explanation could be 435
that when an accident leads to someone’s dead this will be mentioned in the title, while an injury 436
may only be mentioned in the article. 437
438
Assuming the triples and occurrences were correct, the results could be further analyzed by 439
calculating conditional probabilities. For this, the 19 similar triples were used and all subjects were 440
divided into three groups, namely: 441
• Vulnerable road users (VRU) 442
• Motorized road users (Vehicle) 443
• Man, woman, ... (Undefined) 444
445
For each term and each subject, P(Subject|Term) and P(Term|Subject) were calculated. 446
The formula for computing a conditional probability isP(A|B) = P(A ∩ B)/P(B) and should be 447
interpreted as the probability of A, given B. The results are shown as heatmaps in figure7. 448
Fig. 7. Heatmaps showing conditional probabilities. From left: P(Term|Subject), P(Subject|Term)
To give an example for the interpretation of the heatmaps, P(aanrijding|VRU), or the probability 450
of finding the term ’aanrijding’ in a triple with a VRU as subject, was higher than the probability of 451
finding the term ’ongeluk’. 452
453
Since almost all triples contained a conjugation of the verb ’overlijden’, conditional probabili-454
ties were also calculated for P(Subject|overlijden) and P(Object|overlijden). The results are shown 455
as a mirror bar chart in figure8. From the chart the conclusion can be drawn that a VRU had the 456
highest probability of being the subject and ’ongeval’ had the highest probability of being the 457
object, given the verb ’overlijden’.
Fig. 8. Mirror bar chart of P(Subject|’overlijden’) and P(Object|’overlijden’) 458
The most common triples from the articles were very different, mostly because there were a lot 459
more sentences and a wider variety in words used for describing situations. ’De politie’ (the police) 460
appeared most often as the subject, with ’onderzoek’ (investigation) as the object. this result may 461
suggest that for a lot of accidents that are reported in the news, the police has to get involved to 462
find out what exactly happened. 463
4.4 RQ3 Automation 464
The methods and results of the following sub-question will be discussed in this section: 465
"What manual processes can be automated and applied to larger data sets?" 466
467
An attempt was made at automating 5 out of 11 analyses that were manually performed by M. 468
Te Brömmelstoet on the accident reporting articles fromHet Ongeluk. The methods used will be 469
discussed in4.4.1. the results from three of the automated analyses will be discussed in4.4.2. The 470
Jupyter Notebook containing all analyses and evaluations is provided in AppendixB.1. 471
4.4.1 Methods. 472
By automating the processes, the largerFlitsservice data set could also be analyzed. Different 473
approaches were used depending on the analysis in question. 474
Machine Learning. 475
A common approach to automatically analyzing text is the use of Machine Learning (ML) algorithms. 476
ML can be used for predicting either continuous numbers or classes. This is, however, only possible 477
when (enough) training and test data is available. In theHet Ongeluk DataFrame, labels related 478
to some of the analyses were provided with a value of either 0 or 1 for each article. Since there 479
are only two options and the numbers have no continuous meaning, this is considered a binary 480
classification problem. When multiple labels are taken into account at once, it becomes a multi-label 481
classification. Although there exist a lot of algorithms for classification, some have proven to deal 482
better with text data than others. Therefore, Multinomial Naive Bayes (MNB) and Support Vector 483
Machines (SVM) were chosen. Grid search was applied for hyperparameter tuning in order to find 484
the optimal model and obtain the best results. The available annotated data set was, however, very 485
small and could therefore produce bad results. Consequently, a second solution for automating 486
analyses was also tried that does not rely on algorithms. 487
Regular Expressions. 488
Regular expressions (regexps) are a way of looking for specific patterns in a text. Inside a regular 489
expression a pattern can be defined that has to match certain words in a given order. Regexps 490
provide a lot of options for writing a sequence, from wildcards that are used to match any single 491
character, to specifying the exact number of occurrences of a character or group. The benefit of 492
regexps is that it does not require to be trained and can thus be used even when available data is 493
limited. The downside is that the patterns have to be constructed by hand, meaning that all options 494
and exceptions have to be taken into account and the regexps can become very long and complex. 495
The vocabulary in the data was, however, expected to be small enough to construct regexps that 496
would provide reasonable results. 497
4.4.2 Results. 498
Vague language and euphemisms. 499
Euphemisms and the use of vague language in general are ways of softening an otherwise harsh or 500
unpleasant message. In accident reporting this is sometimes applied to either mention someone got 501
injured or decrease the blame put on a perpetrator. The most common examples found in accident 502
reporting are "iemand over het hoofd zien" (missed seeing someone), "de macht over het stuur 503
verliezen" (loosing control over the steering wheel), "er met de schrik vanaf komen" (dodging a 504
bullet), "geschept worden" (getting hit) and "onbekende verwondingen" (unknown injuries). The 505
common theme here is that they all leave readers in the dark about what exactly happened. From 506
here on, the term ’vague expressions’ will be used to refer to all of the above expressions. 507
Regexps were utilized to search for the minimum amount of words necessary (that also always 508
occurred together) that would match the expressions, except for "iemand over het hoofd zien" and 509
"de macht over het stuur verliezen". For example, for the expression "er met de schrik vanaf komen", 510
the regexps only had to search for "met de schrik". This method was used since the expressions did 511
not always have the same syntax. There could have been words in between parts of the expressions, 512
certain words could have been replaced and verbs could have been conjugated. For the other 513
two expressions, patterns with a similar meaning were also included. To illustrate, "controle" was 514
considered to be an alternative for "macht" in "de macht over het stuur verliezen". The results from 515
the analysis are shown in table2. 516
Table 2. Use of vague expressions in accident reporting articles (%) Het Ongeluk Flitsservice ’er met de schrik vanaf komen’ 4.62 1.89
’geschept worden’ 4.23 6.12
’iemand over het hoofd zien’ 8.08 8.84 ’de macht over het stuur verliezen’ 6.15 10.11
’onbekende verwondingen’ 5.00 0.86
517
In the data set fromHet Ongeluk all euphemisms appeared in roughly four to eight percent of the 518
articles. In theFlitsservice data set, the occurrences were not as equally divided. All euphemisms 519
did however appear, with ’onbekende verwondingen’ being found the least and ’de macht over 520
het stuur verliezen’ occurring most frequently. The conclusion can be drawn that euphemisms 521
were used in less than 10 percent of the news articles. ’De macht verliezen over het stuur’ appeared 522
the most, although this could be due to the type of accidents in the data. If the data would have 523
contained less car accidents, this number would also have been lower. Drawing general conclusions 524
from these results about the exact amount of euphemisms in accident reporting is thus not wanted. 525
The data does however provide enough information to conclude that euphemisms are to be found 526 in accident reporting. 527 528 Person vs vehicle. 529
In accident reporting articles it is not uncommon to find something in the ballpark of "bicyclist hit 530
bycar". According to [18] this is due to the fact that we see a car rather than the person driving 531
it. As a consequence, people tend to refer to the vehicle and this in turn affects the way in which 532
people perceive reality. This analysis aimed at answering the question "how often is a victim or 533
perpetrator described as either a person or vehicle?" 534
The best results were again produced by regexps and are shown in figure9. 535
Fig. 9. Percentage of accident reporting titles that match the regexps for each label
There were two important factors that the solution for this analysis had to be able to deal with. 537
The primary factor being making a distinction between a person and a vehicle and the second 538
deciding whether a person or vehicle was considered a victim or an opposite party (perpetrator). 539
This led to four labels for each title and article. A fifth one was also included that looked for words 540
describing consequences for the rest of traffic. 541
• victim as a person 542
• victim as a vehicle 543
• opposite (perpetrator) as a person 544
• opposite (perpetrator) as a vehicle 545
• consequences for the rest of traffic 546
547
Victim as a person turned out to be much more common than victim as a vehicle in the titles 548
for both data sets. For the opposite party, however, more titles referred to a vehicle instead of a 549
driver. This suggests that blame is often, either consciously or unconsciously, shifted away from 550 perpetrators. 551 552 "Ongeluk" vs "aanrijding". 553
According to [14] car crashes are often preventable and this message should be conveyed to the 554
public by using the right term to describe a crash. Unfortunately, in the research conducted by 555
Ralph et al. the neutral term "accident" is found to be most used. This term is considered to mask 556
the preventable nature of crashes. 557
For this analysis, variations on the terms were also included in the regexps. Most of those 558
included conjugations of a verb closely related to the term. For example, the verb "botsen" (to hit) 559
and conjugations of that verb were also included for the term "botsing". 560
Fig. 10. Use of specific terms to describe a crash in accident reporting articles (%)
As seen in the figure10, the term "ongeluk" was found most often inHet Ongeluk. In Flitsservice, 562
"ongeval" was found to be the most occurring term by far. Both terms are translations of the English 563
term accident, thus the conclusion can be drawn that accident was indeed the most used term. 564
565
By automating this analysis, theFlitsservice data set could also be analyzed over time to visu-566
alize whether term preference changed. The result is shown in figure11.
Fig. 11. Use of specific terms to describe a crash inFlitsservice over time (%) 567
Intervals were used because for some years no data was available. The biggest differences in 568
percentage of use were found for ’ongeval’ and ’ongeluk’. While ’ongeval’ was found in all titles 569
before the year 2000, it only occurred in around 60% of titles between 2006 and 2010. ’Ongeluk’ or 570
something closely related to it like ’verongelukt’ (crashed) on the other hand, was not found in any 571
titles before 2000 but afterwards an upward trend was seen. 572
5 EVALUATION
5.1 RQ1 Common words 573
There were two types of possible mistakes in the word clouds that may have influenced how well 574
they described the language. The first mistake was related to Pattern’s parsetree module. Some 575
words had a wrong POS-tag assigned, meaning they were misclassified. The second mistake, specific 576
for this research, was a word showing up in the word cloud that was not representative in any 577
way, although common. To evaluate how well the different word clouds performed, the terms were 578
printed as a list (see AppendixA). The mistakes made by the Parsimonious LM are shown in table 579
3.
Table 3. Mistakes made by Parsimonious LM
Word type Wrong POS-tag Not representative Total mistakes
Verbs 6/7 4 10/11
Nouns 6 0 6
Adjectives 16 3 19
580
Most mistakes related to POS-tags were made for adjectives. This could be due to fact that the 581
accident reporting vocabulary differed a lot from the corpus on which the Pattern module for 582
Dutch was trained. Unseen words are harder to classify and the module may have mistaken words 583
for being adjectives based on their position in a sentence. Since wrong POS-tags occurred in all 584
word clouds, the amount of non-representative words was considered to be more important. Here, 585
there were only 4 for verbs, none for nouns and 3 for adjectives. All the other words were at least 586
somewhat related to accident reporting. The mistakes detected in the TF word clouds are shown in 587
table4. 588
Table 4. Mistakes made by TF word clouds
Word type Wrong POS-tag Not representative Total mistakes
Verbs 6 22 28
Verbs lemma 9 23 32
Nouns 4 0 4
Adjectives 8 12 20
589
A lot of mistakes were made for representativeness, with the exception of the nouns word 590
cloud. The nouns were, however, less representative than in the Parsimonious LM cloud. The addi-591
tion of bigrams provided a few more representative adjectives but the results were still not sufficient. 592
593
Unfortunately, the Venn Clouds module did not provide an easy way to access the words. But since 594
Venn clouds are also largely based on TF, it can be expected that they produced better results than 595
the TF word clouds, but performed worse than the Parsimonious LM clouds. 596
5.2 RQ2 SVO triples 597
The results relied a lot on Pattern’s capability of correctly recognizing subjects, verbs and objects. 598
This means that there could have been other frequently occurring triples that were not detected. This 599
led to the fact that the conditional probabilities are to be considered correct under the assumption 600
that the triple objects were as well. 601
5.3 RQ3 Automation 602
For the automation, 11 manually created analyses were provided. An attempt was made at au-603
tomating 5 of them, although some changes were applied. Some of the other 6 were considered too 604
difficult to solve with regexps because there were no very clear patterns to search for. If more data 605
was available, Machine Learning algorithms may have offered a solution. 606
To evaluate how well the automations performed, annotations provided by M. Te Brömmelstoet 607
served asground truth. Although they were in some way subjective and a few inconsistencies were 608
found, there was no better alternative. 609
For each annotation or label, 0 translated to negative (N) and 1 to positive (P). The following test 610
statistics were used: 611
612
• True Positives (TP): annotated 1 and matched by regexps 613
• True Negatives (TN): annotated 0 and not matched by regexps 614
• False Positives (FP): annotated 0 but matched by regexps 615
• False Negatives (FN): annotated 1 but not matched by regexps 616
617
With these statistics, the following metrics were calculated: 618 accuracy =T P + T N + FP + FNT P + T N 619 precision = T P T P + FP 620 recall =T P + FNT P 621
Accuracy is defined as the percentage of correctly predicted observations. This metric can suffice 622
for evaluation, but if the data set has unbalanced classes or if more in depth evaluation is wanted, 623
including precision and recall is needed. Precision expresses the proportion of data points classified 624
as positive (class 1), that really are positive. Recall on the other hand expresses how well a model 625
performs at identifying positive data points. Because of the inconsistencies in the ground truth, 626
some misclassifications were inevitable and thus perfect results were not expected. 627
Vague expressions. 628
The original analysis focused more on accusations, among which two of the vague expressions 629
also used in this research could be found. The decision was made to change the focus to vague 630
expressions based on the news article published by T. Verkade and M. Te Brömmelstoet [18]. In the 631
article the five vague expressions and their effect on interpretation are discussed in detail. 632
Because of the adaptations it was not possible to evaluate the performance of all regexps, unless 633
performed manually. Annotations were, however, provided for "iemand over het hoofd zien" and 634
"de macht over het stuur verliezen". The results for the two expressions are given in table5. 635
Table 5. Performance of regexps at identifying vague expressions
metric "iemand over het hoofd zien" "de macht over het stuur verliezen" overall
accuracy 0.98 1.00 0.94
precision 0.80 1.00 0.89
636
The regexps performed very well both overall and per vague expression. 637
Person vs vehicle. 638
The regexps for this analysis were based on the titles of the articles. This denotes that a lot of 639
exceptions could be captured but the model was probably also overfit on the particular data. 640
Multi-label classification was also performed as a solution (see AppendixB.2, but the test set only 641
contained 52 articles. It is therefore difficult to say how good the classifiers actually were. On the 642
other hand, ML algorithms are known to perform rather well on text data, so this method could be 643
useful for further research, if more training data would become available. 644
The analysis was performed for both titles and articles. As expected the model performed better on 645
titles than on articles. This is mainly due to the fact that context played a bigger role in the articles. 646
Take, for example, the following article: 647
“Twee voertuigen zijn vanmiddag met elkaar in botsing gekomen op de Leidse Schouw in Alphen 648
aan den Rijn.Een bestuurder moest worden nagekeken door ambulancepersoneel, maar hoefde 649
niet mee naar het ziekenhuis.” 650
This was annotated 0 for ’victim as a person’ and 1 for ’victim as vehicle’ (because of "twee 651
voertuigen"). The regexps on the other hand labelled this 1 for ’victim as a person’ because a driver 652
was mentioned in the second sentence ("een bestuurder"). This was considered incorrect because 653
the fact that the driver was brought to the hospital did not say anything about the accident itself. 654
The results for titles and articles are shown in table6and table7, respectively. 655
Table 6. Performance of regexps for person vs vehicle analysis on titles metric vic_person vic_vehicle oppo_person oppo_vehicle traffic overall
accuracy 0.95 0.90 0.95 0.90 0.99 0.94
precision 0.97 0.71 0.71 0.80 0.94 0.85
recall 0.94 0.88 1.00 0.81 1.00 0.91
656
Table 7. Performance of regexps for person vs vehicle analysis on articles metric vic_person vic_vehicle oppo_person oppo_vehicle traffic overall
accuracy 0.67 0.64 0.72 0.74 0.89 0.73
precision 0.68 0.60 0.53 0.74 0.85 0.68
recall 0.86 0.73 0.70 0.84 0.85 0.81
"Ongeluk" vs "aanrijding". 657
The performance of the regexps on this analysis are shown in table8.
Table 8. Performance of regexps for "ongeluk" vs "aanrijding" analysis metric ongeluk aanrijding botsing ongeval overall
accuracy 0.94 0.93 0.93 0.96 0.94
precision 0.93 0.87 0.90 0.91 0.91
recall 0.94 0.88 0.93 0.98 0.94
Based on all metrics it is possible to conclude that the automation of this analysis went very well. 659
When manually checking the misclassifications, it was discovered that at least some of them were 660
actually wrongly annotated. This means that the performance of the regexps may have been even 661
higher. 662
6 CONCLUSIONS
Throughout this research an attempt was made at extracting patterns from accident reporting 663
articles published by Dutch media. Identifying these patterns does not only provide useful insights 664
on the Five W’s and How of traffic accidents. More importantly, it is a crucial step towards 665
understanding the influence of media on the public awareness surrounding the dangers of traffic 666
participation and the way in which accidents are perceived. In addition, the application of NLP 667
techniques makes for faster and more scalable analyses that in turn assist in drawing more general 668
conclusions. With rising numbers of traffic accidents, the relevance of research on this topic becomes 669
more prominent. While the government is taking actions to identify the main risks of traffic, the role 670
of media is left untouched. Therefore, this research and possible future work is of great importance 671
to start a discussion on what role media can or should play in this field. 672
By applying exploratory data analysis, the content of accident reporting articles was explored on 673
multiple levels. Primarily, word clouds were utilized to visually capture the essence of the language 674
used in accident reporting. To achieve this, a search for the best weight computing method was 675
carried out, resulting in the Parsimonious LM clouds containing the most representative words. 676
Previous research already showed that the addition of parsimony could lead to more representative 677
terms; hence this result was expected. The TF word clouds, while containing a lot of common 678
and unrepresentative words, did provide useful insights that could not be derived from the other 679
clouds. The results suggested that the emphasis is more often put on the victim, and that more often 680
reporters refer to a vehicle instead of the driver. The latter was also found to be true in English 681
accident reporting by Ralph et al. [14]. To further investigate language use, SVO triples were 682
extracted from both titles and articles to detect recurring events. The shared structure amongst the 683
title triples opened up the possibility for further analysis. The calculated conditional probabilities 684
showed, among others, that the chance of finding the term ’aanrijding’, given a VRU, was higher 685
than finding the term ’ongeluk’. 686
687
The second part of this research focused on the automation of manually performed analyses. 688
From the original 11, an attempt was made at automating 5. The others were considered too difficult 689
given the lack of training data for Machine Learning and not being very suitable for regexps. The 690
main advantage of automated processes is their ability to scale, so that results can more easily 691
be generalized. Therefore, the automated analyses were also conducted for the largerFlitsservice 692
data set. Overall, the metrics showed good results for the three analyses that were discussed in 693
detail. For the vague expressions only two labels could be evaluated since the original analysis was 694
adapted for this research. The performance of the regexps on ’person vs vehicle’ was lower for 695
articles than for titles. This was mainly due to the fact that the regexps were based on the titles, the 696
articles contained more information and context played a greater role. Tuning the regexps to also 697
capture everything in the articles is, if even possible, not a very good solution. This would have 698
only led to more overfitting and a non-scalable model. Further research is needed to find a solution 699
that can overcome this limitation. Machine Learning might provide good results if enough training 700
data is available. In further research, an attempt could be made at automating the remainder of the 701
analyses by applying different NLP techniques. 702
This research is for the most part considered to be exploratory since the data sets were rela-704
tively small. Therefore, drawing general conclusions was mainly avoided and the results were 705
mostly observations and indications. Interesting patterns were, however, detected and in further 706
research these findings should be taken into account. Furthermore, the automations provided at 707
least reasonable results and could be applied to larger data sets. 708
6.1 Acknowledgements 709
I would like to express my very great appreciation to my supervisor for providing me with extensive 710
feedback and supporting me to bring this thesis to a successful end. Secondly, I would like to offer 711
my special thanks to M. Te Brömmelstoet and T. Verkade for making this research possible and for 712
making a great effort in investigating the role of media in the perception of traffic accidents and 713
the level of awareness surrounding the dangers. Lastly, I would like to thank B. Hendriks and S. 714
Siepel for their assistance in the initial phases of this research project. 715
REFERENCES
[1] VU-DNC corpus [online service]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-g4, 716
2018. 717
[2] Jeffrey S. Bowers and Christopher W. Pleydell-Pearce. Swearing, Euphemisms, and Linguistic Relativity.PLoS ONE, 718
6(7):e22341, July 2011. 719
[3] Centraal Bureau voor de Statistiek. 11 procent meer verkeersdoden in 2018.Centraal Bureau voor de Statistiek, 2019. 720
[4] Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. Using Statistics in Lexical Analysis. page 33. 721
[5] Julie Cidell. Content clouds as exploratory qualitative data analysis: Content clouds as exploratory qualitative data 722
analysis.Area, 42(4):514–523, December 2010. 723
[6] Glen Coppersmith and Erin Kelly. Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis. InProceedings 724
of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 22–29, Baltimore, Maryland, 725
USA, June 2014. Association for Computational Linguistics. 726
[7] Concetta A. DePaolo and Kelly Wilkinson. Get Your Head into the Clouds: Using Word Clouds for Analyzing Qualitative 727
Assessment Data.TechTrends, 58(3):38–44, May 2014. 728
[8] Martin J. Halvey and Mark T. Keane. An assessment of tag presentation techniques. InProceedings of the 16th 729
International Conference on World Wide Web - WWW ’07, page 1313, Banff, Alberta, Canada, 2007. ACM Press. 730
[9] F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. Word Cloud Explorer: Text Analytics Based on Word Clouds. In2014 731
47th Hawaii International Conference on System Sciences, pages 1833–1842, January 2014. 732
[10] Djoerd Hiemstra, Stephen Robertson, and Hugo Zaragoza. Parsimonious language models for information retrieval. 733
InProceedings of the 27th Annual International Conference on Research and Development in Information Retrieval - SIGIR 734
’04, page 178, Sheffield, United Kingdom, 2004. ACM Press. 735
[11] David Johns and Dominick A. DellaSala. Caring, killing, euphemism and George Orwell: How language choice 736
undercuts our mission.Biological Conservation, 211:174–176, July 2017. 737
[12] Rianne Kaptein, Djoerd Hiemstra, and Jaap Kamps. How Different Are Language Models andWord Clouds? In David 738
Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar 739
Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, 740
Gerhard Weikum, Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan 741
Rüger, and Keith van Rijsbergen, editors,Advances in Information Retrieval, volume 5993, pages 556–568. Springer 742
Berlin Heidelberg, Berlin, Heidelberg, 2010. 743
[13] T. Lansdall-Welfare, S. Sudhahar, G. A. Veltri, and N. Cristianini. On the coverage of science in the media: A big data 744
study on the impact of the Fukushima disaster. In2014 IEEE International Conference on Big Data (Big Data), pages 745
60–66, October 2014. 746
[14] Kelcie Ralph, Evan Iacobucci, Calvin G. Thigpen, and Tara Goddard. Editorial Patterns in Bicyclist and Pedestrian Crash 747
Reporting.Transportation Research Record: Journal of the Transportation Research Board, 2673(2):663–671, February 748
2019. 749
[15] Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, and Nello Cristianini. Quantitative Narrative Analysis of 750
US Elections in International News Media. page 12. 751
[16] Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, and Nello Cristianini. ElectionWatch: Detecting Patterns 752
in News Coverage of US Elections. InProceedings of the Demonstrations at the 13th Conference of the European Chapter of 753
the Association for Computational Linguistics, pages 82–86, Avignon, France, April 2012. Association for Computational 754
Linguistics. 755
[17] Michael Tayler and Jane Ogden. Doctors’ use of euphemisms and their impact on patients’ beliefs about health: An 756
experimental study of heart failure.Patient Education and Counseling, 57(3):321–326, June 2005. 757
[18] Thalia Verkade and Marco te Brömmelstroet. ‘Busje ramt auto’, ‘file na ongeluk’. En de mensen dan? 758
https://decorrespondent.nl/9272/busje-ramt-auto-file-na-ongeluk-en-de-mensen-dan/974923679400-3cc81f84, March 759
2019. 760
[19] Wetenschappelijk Onderzoek Verkeersveiligheid. https://theseus.swov.nl/single/?appid=73c9f2d7-2873-4e4a-8e6e-761
095840c66ee5&sheet=0ce1fd1f-761c-40ae-b54e-66823d116a34&opt=currsel,ctxmenu. 762
Parsimonious LM
POS-tag verbs POS-tag nouns POS-tag adjectives
Terms Weight Terms Weight Terms Weight
overleden 0.0804 ongeval 0.0221 xx-jarige 0.1619
reed 0.0697 aanrijding 0.0198 dodelijk 0.0412
overleed 0.0678 ongeluk 0.0182 inzittenden 0.0410
raakte 0.0629 plaatse 0.0168 ernstig 0.0389
gekomen 0.0562 automobilist 0.0165 eenzijdig 0.0384
overgebracht 0.0344 auto 0.0162 voertuig 0.0378
afgesloten 0.0329 verkeersongeval 0.0148 vermoedelijk 0.0328
gebeurde 0.0315 toedracht 0.0147 dodelijke 0.0233
gebracht 0.0269 slachtoffer 0.0145 frontale 0.0220
geraakt 0.0263 fietser 0.0136 tegemoetkomende 0.0205
aangereden 0.0175 motorrijder 0.0136 plekke 0.0204
verongelukt 0.0156 boom 0.0129 kritieke 0.0202
vervoerd 0.0150 kruising 0.0127 jarige 0.0186
onderzoekt 0.0142 ziekenhuis 0.0127 automobiliste 0.0168
getuigen 0.0136 gewond 0.0125 gewond 0.0144
meldt 0.0131 personenauto 0.0123 terecht 0.0129
sloot 0.0128 botsing 0.0122 raakte 0.0123
onbekende 0.0121 traumahelikopter 0.0116 precieze 0.0093
overlijdt 0.0116 zwaargewond 0.0102 weg 0.0092
bestuurd 0.0115 ambulance 0.0099 flauwe 0.0092
ingesteld 0.0108 richting 0.0093 onbekend 0.0089
belandde 0.0107 stuur 0.0090 file 0.0084
inzittende 0.0096 brandweer 0.0088 bevrijd 0.0083
oversteken 0.0083 vrachtwagen 0.0087 bejaarde 0.0076
geslingerd 0.0083 plekke 0.0079 tegenovergestelde 0.0076
gebotst 0.0080 oorzaak 0.0079 aanrijding 0.0073
kwam 0.0079 vrachtwagenchauffeur 0.0076 dode 0.0071
gereden 0.0078 voertuig 0.0075 noodlottig 0.0065
gereanimeerd 0.0078 berm 0.0074 aanspreekbaar 0.0065
geschept 0.0071 frontaal 0.0073 passerende 0.0064
stak 0.0067 bocht 0.0071 gelderse 0.0063
gewonden 0.0063 weghelft 0.0067 nader 0.0059
bekneld 0.0061 onwel 0.0065 verkeersongevallenanalyse 0.0054
plaatsgevonden 0.0058 vrachtauto 0.0064 naastgelegen 0.0052
terecht 0.0052 politie 0.0063 stilstaande 0.0050
verleende 0.0051 rijbaan 0.0062 gestart 0.0046
botsten 0.0050 letsel 0.0061 slachtofferhulp 0.0046
aangetroffen 0.0046 inwoner 0.0061 eenzijdige 0.0043
gehaald 0.0045 fiets 0.0061 inhaalmanoeuvre 0.0042
gewonde 0.0045 stilstand 0.0061 auto 0.0042
TF WordClouds
POS-tag verbs POS-tag verbs lemmata POS-tag nouns POS-tag adjectives
Terms Weight Terms Weight Terms Weight Terms Weight
aangereden 1513 aanrijden 1568 aanrijding 4681 automobiliste 260
afgesloten 1810 afsluiten 1814 ambulance 1140 bekend 1291
betrokken 872 besturen 881 auto 10645 direct 582
doet 758 betrekken 873 automobilist 3016 dodelijk 945
gebeurde 1890 brengen 1869 boom 1965 dodelijke 569
gebracht 1742 doen 1563 botsing 2081 duidelijk 499
gekomen 4241 gaan 1026 brandweer 1288 eenzijdig 927
geraakt 1497 gebeuren 2467 fiets 880 enige 553
gereden 819 geraken 1528 fietser 1697 ernstig 1795
getuigen 972 getuigen 972 gewond 2555 ernstige 370
geweest 930 halen 784 hoogte 1371 frontale 386
had 1199 hebben 7426 hulp 962 half 599
hebben 1534 komen 11384 jongen 915 hard 333
heeft 4392 kunnen 2764 kruising 1428 hoge 413
is 21549 laten 861 leven 4505 inzittenden 860
kon 1384 liggen 864 man 10756 jarige 325
kwam 5213 lopen 864 motorrijder 1614 kort 434
kwamen 811 melden 1562 onderzoek 3896 kritieke 354
meldt 994 moeten 1112 ongeluk 5396 lange 266
mocht 760 mogen 860 ongeval 8431 mogelijk 526
niet 1467 niet 1469 oorzaak 2439 nader 258
nog 1940 nog 1941 personenauto 1343 onbekend 488
onbekende 1853 onbekennen 1853 plaats 1496 onduidelijk 308
onderzoekt 799 onderzoeken 1260 plaatse 3046 plekke 411
overgebracht 1583 overbrengen 1585 politie 9497 precies 328
overleden 6197 overlijden 10145 richting 3173 precieze 269
overleed 3412 raken 3868 slachtoffer 4940 raakte 277
raakte 3774 reden 4727 stuur 992 snel 374
reed 4273 rijden 987 tijd 1154 technisch 281
sloot 914 schrijven 859 toedracht 2077 tegemoetkomende 324
te 4243 sloten 914 traumahelikopter 1305 uiteindelijk 267
vond 816 staan 848 uur 5873 vast 268
waren 1121 te 4243 verkeer 1317 verkeerde 379
was 5313 vinden 1308 verkeersongeval 1962 vermoedelijk 1350
werd 6773 wezen 938 vrachtwagen 910 voertuig 687
werden 812 willen 876 vrouw 5597 vrij 314
worden 1128 worden 9818 water 1338 waarschijnlijk 430
wordt 1100 zien 1603 weg 3203 weg 328
zat 1116 zijn 34355 ziekenhuis 4702 xx-jarige 12963
B AUTOMATION B.1 Regexps 763