Retrieving, Cleaning and Analysing Dutchnews articles about traffic accidents

(1)

news articles about traffic accidents

Barry Hendriks 11268883

Bachelor thesis Credits: 12 EC

Bachelor Opleiding Informatiekunde

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. M. J. Marx ILPS, IvI Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam 2019-06-21

(2)

Abstract

Every day, several news articles are published about different kinds of traffic accidents all across the Netherlands. These usually short articles give some basic information about what happened, which vehicles were involved, how many victims there were, etc. The goal of this thesis is to automate the annotation of these news articles, specifically with the amount of deaths and the involved vehicles. This is done by using web scraping and information extraction techniques to gather and analyse Dutch news articles about traffic accidents. The research question used in this thesis is as follows: "With which accuracy can web scraping and information extraction be used to retrieve and analyse Dutch news articles about traffic accidents?"

First, web scraping functions are developed to gather the title, text and date of news articles. These functions are then used to gather data to train two information extraction models, the Death Count extractor model and the vehicle extractor model. Hereafter, both the web scraping functions and information extraction models are tested and evaluated.

The results of these evaluations are very positive, with both the web scraping functions and information extraction models performing very well. In the end it can be said that both web scraping and information extraction can be used to retrieve and analyse Dutch news articles about traffic accidents, with a very high accuracy.

(3)

Contents Contents 3 1 Introduction 4 2 Related Work 5 2.1 Web Scraping 5 2.2 Information Extraction 5 3 Methodology 6

3.1 Description of the data 6

3.2 Methods 6 3.2.1 RQ1 6 3.2.2 RQ2 8 3.2.3 RQ3 10 4 Evaluation 12 4.0.1 RQ1 12 4.0.2 RQ2 14 4.0.3 RQ3 16 5 Conclusions 17 5.1 Future Work 17 5.2 Acknowledgements 18 6 References 18 7 Appendix 20

7.1 Appendix A: Python Code 20

7.1.1 Web Scraping functions 20

7.1.2 DC Extractor 24

7.1.3 Vehicle Extractor 29

7.2 Appendix B: Cosine Distributions 35

(4)

1 INTRODUCTION

Dutch news outlets cover a lot of news about traffic accidents. Every day, several news arti-cles are published about different kinds of traffic accidents all across the Netherlands. These usu-ally short articles give some basic information about what happened, which vehicles were in-volved, how many victims there were, etc. Sev-eral months ago Thalia Verkade, a journalist from De Correspondent, and Marco te Brömmel-stroet, a planologist at the University of Amster-dam, decided to investigate the news coverage about traffic accidents. In doing so they made a website, www.hetongeluk.nl, which functions as a database for news articles about traffic ac-cidents. Users can manually enter a URL of a web article and then add the date of the accident, involved vehicles, amount of victims that died, and the amount of victims that were wounded. While manually entering this data works, it is prone to mistakes. For example, since vehicles are classified into 16 different types some users may disagree about whether a vehicle belongs to one group or the other, thus reducing the ac-curacy of the dataset. Furthermore the manual entering of data might reduce the amount of ar-ticles gathered, since users do not want to make the effort to fill in this data. A possible solution to this would be to automate this gathering of data.

This thesis is part of a small research group, consisting of Ella Casimiro, Sander Siepel and Barry Hendriks, which goal is to help automate this process. The goal of this thesis is to auto-mate the extraction of the amount of deaths that occured and the involved vehicles from the arti-cle. This is done by using web scraping to gather the news article, after which information extrac-tion is used to extract the wanted informaextrac-tion. The users will still have to enter a URL of a web article, however the annotation of the article will be done automatically.

First of all, it s important to know the def-initions of both web scraping and information extraction. Web scraping is the extraction of data from a human-readable source from a website. Information extraction is extracting information

from unstructured data. First the text, title, and date of news articles about traffic accidents on Dutch news websites will be gathered using web scraping. Using this gathered data information extraction will be used to extract the amount of deaths that occurred and the involved vehicles, according to the article. Thus data from websites will be extracted and then information will be extracted from this data.

There are many different techniques used in both web scraping and information extraction. Some of the most common techniques for both web scraping and information extraction will be explained here.

Web scraping usually uses a combination of several techniques: regular expressions, analysing HTML structure, and web scraping frameworks. A regular expression is a sequence of characters that is used as a search pattern to find matches to the pattern within a text. In web scraping this can be used to find specific words or text patterns present within the HTML or text of a web page. Analysing the HTML structure of a web page allows data to be extracted from specific parts of the HTML of a website. A com-mon tool used for this is BeautifulSoup, which provides users with an interface in which they can navigate the HTML content (Lawson, 2015). Web scraping frameworks are tools which try to make web scraping easier. There are many dif-ferent web scraping frameworks available, and they can be used for many purposes. An example is the Scrapy framework which allows users to build "web spiders" that crawl and extract data from websites.

Information extraction commonly makes use of Natural Language Processing (NLP) tech-niques like Named Entity Recognition (NER) and Part-of-Speech tagging (POS), as well as pattern matching. Named Entity Recognition is the iden-tification of names and other entities such as dates, countries, and currencies. Part-of-Speech tagging is the process of tagging words accord-ing to a particular part of speech. An example of this would be tagging the word ’docks’ as a verb in the sentence "He docks the ship", whilst tagging it as a noun in the sentence "There are many docks in the harbour". Pattern matching

(5)

is a technique that makes use of NER and POS to extract relevant information. For example the patternvehicle crashes into vehicle can be used to find vehicles involved in a crash (Grishman, 1997).

The research question for this thesis will be as follows: "With which accuracy can web scraping and information extraction be used to retrieve and analyse Dutch news articles about traffic accidents?"

This question will be divided into the follow-ing sub questions:

RQ1 . . . With which accuracy can the title, text, and date of the article be retrieved? RQ2 . . . With which accuracy can the

amount of deaths be extracted?

RQ3 . . . With which accuracy can the type of vehicle that was involved be extracted?

2 RELATED WORK

2.1 Web Scraping

Research into web scraping usually involves ex-plaining and detailing the use of certain web scraping API’s or systems that can be used to gather data from the web. Articles that make ac-tual use of web scraping usually use it to collect data for further research, explaining little about their methods.

Malik and Rizvi (2011) make use of the pro-gramming language Prolog along with regular expressions to scrape data from web pages. Even though in this article it is not explicitly stated that web scraping will be used for several dif-ferent websites, the use of regular expression might prove useful for this. This use of regular expression will probably be a great way to ex-tract specific data from several different news sites.

In an article written by Boeing and Paul (2017) the programming language Python is used in combination with the scrapy framework, a web crawling framework for Python. Here, web scrap-ing was used to gather craigslist rental listscrap-ings which where then used for further research. Again, web scraping is used for only one spe-cific website which would not be the case in this thesis. However, the use of Python does show

that Python can be used for web scraping, as is done in this thesis.

No articles were found where web scraping was used to gather articles from news sites, mak-ing this an interestmak-ing subject to explore.

2.2 Information Extraction

When it comes to information extraction, a lot of research has already been conducted. While many articles about information extraction fo-cus on developing new techniques (Banko et al., 2007; Fader et al., 2011) and evaluating or giv-ing an overview of current techniques (Sarawagi, 2008; Cowie & Wilks, 2000), practical research has also been done. Much of this research focuses on extraction of highly specific data (Ono et al., 2001; Fukuda et al., 1998) or gaining an overview of a document (Ku et al., 2006; Shinyama et al., 2002; Rose et al., 2010).

Whilst there are many articles detailing the use of information extraction for specific prob-lems, almost none of these articles make use of news articles. Articles that do use news articles usually focus more on giving an overview of the article as opposed to extracting highly specific in-formation from it. In 2002 an article was written by Shinyama et al., detailing the use of infor-mation extraction to paraphrase news articles. In this research, named-entity recognition was used in combination with dependency parsing to extract parts of sentences. These sentences where then combined in order to paraphrase the original news article. This article focuses more on giving a summary of the article, whilst this thesis will focus more on extracting specific in-formation.

On the contrary there are articles that make use of information extraction to acquire specific information but that do not make use of news articles. These articles usually try to extract in-formation from larger documents like scientific articles. In 2001 Ono et al. made use of Part-of-Speech (POS) tagging, regular expressions and keywords in order to extract information about interaction between proteins in biology articles. This combination of techniques resulted in a higher precision and recall when compared to previous research. Since these techniques were

(6)

used to extract information about interaction be-tween proteins, it should be possible to use the same techniques to extract information about in-teraction between vehicles. This would be very useful for extracting the vehicles involved in an accident.

This thesis seeks to make use of techniques to extract specific information from news arti-cles, essentially combining previous work. Fur-thermore, this thesis seeks to combine both web scraping and information extraction to gather information from unstructured web data.

3 METHODOLOGY

3.1 Description of the data

In order to answer the research question, data was needed to train and test the web scrap-ing functions and information extraction mod-els. Two datasets were obtained and used for this purpose. The first dataset consists of 7784 different articles obtained from the website www.flitsservice.nl and will henceforth be called flitsservice. The articles on www.flitsservice.nl are unavailable on the site at the writing of this thesis. The second dataset consists of 1891 different articles obtained from the website www.hetongeluk.nl and will hence-forth be called hetongeluk. The articles from www.hetongeluk.nl can still be obtained from the website at the writing of this thesis.

In the flitsservice dataset all articles are about traffic accidents. Flittservice obtained these arti-cles from different sources and in some cases added their own commentary to the articles. These articles, however, did not contain any ad-ditional data like the amount of deaths, involved vehicles, etc. The actual title, text, and date of the articles were obtained in two ways. First the data was obtained by creating a function within Python which gathered the title, text, and date from each webpage. However a second version, which eventually was used to train and test the information extraction models, was obtained by using the web scraping functions that are ex-plained in the next section.

Like the flitsservice dataset, the hetongeluk dataset also contains articles about traffic acci-dents. Aside from the annotated data provided with hetongeluk, separate data about the arti-cles was also provided. This separate data con-tained accident id’s that matched those of heton-geluk. This additional data contained informa-tion about the amount of deaths, involved vehi-cles, etc. Both the hetongeluk data and the sepa-rate data was provided in json format and was extracted from this json format using Python.

3.2 Methods 3.2.1 RQ1.

The goal of the functions

In order to answer the research question, "With which accuracy can the title, text, and date of the article be retrieved", several func-tions were made in Python. These funcfunc-tions are called Date transformer, Article scraper, Article cleaner, Source scraper, and File scraper. The goal of these functions is to gather the wanted in-formation of an online news article about traffic accidents, of which the URL has been provided. Each of these functions handles a part of the web scraping process and will be explained in detail below. The data on which these functions were tested will also be explained. The Python code can be found in appendix A.

Source scraper

The Source scraper function accesses online web pages and makes use of the Article scraper and Article cleaner functions to scrape and clean their content. The Source scraper will take in a list with strings containing URL’s of web pages from which the articles need to be scraped and cleaned. It will then use the requests and http.cookiejar library to make requests to and access these web pages. It uses the http.cookiejar library to load a cookie file which allows the function to bypass privacy and cookie state-ments. This cookie file was made manually and contains cookies for several large news outlets. The response from these web pages are sent to the Article scraper function. The data returned

(7)

from this function will then be stored and send to the Article cleaner function. Once the Source scraper has this cleaned content it will transform the content into a pandas dataframe containing the scraped articles title, text, and date.

File scraper

The File scraper function is very similar to the Source scraper function. The difference, how-ever, is that the File scraper does not access online content, but locally stored HTML files. The function takes in a file name of a zip file containing HTML files as well as the name of the folder the HTML files are stored in, within the zip file. It will then open these files using the zipfile library and send the content to the Article scraper and Article cleaner functions. As with the Source scraper function it will also transform the cleaned content into a pandas dataframe containing the scraped articles title, text, and date.

Fig. 1. Example of output of Source scraper and File scraper.

Article scraper

The Article scraper function scrapes the pro-vided HTML data for the title, text, and date of the article. It takes in HTML content and uses the BeautifulSoup library to transform the content into something it can work with. First it removes links from the HTML data. These links usually only contain information about other articles making them useless. Once the links have been removed it uses a regular expression to look for a title and the date of the article. For the text of the article, multiple regular expres-sions are used in order to gather as much of the articles text as possible. This usually means the title also gets scraped as text and thus needs to

be removed, which is what is done. It will then return one of three possible options. In the first option, both the title and date of the article were found and the title, text and date are returned. In the second option, either the title or date was not found and it returns the text and either the title or date, whilst leaving the other blank. In the third option, neither the title or date were found and only the text is returned, leaving the title and date blank. This last result usually gets removed during the cleaning of the data as it most probably indicates that the HTML page did not contain an article.

Article cleaner

The Article cleaner function cleans the data provided by the Article scraper function so that it can be transformed and stored into a pandas dataframe. It takes in three lists, one containing text, one containing titles, and one containing dates. The first thing it does is iterate over all the texts in the text list in order to find common sentences. If a sentence is written in the exact same way in several articles it usually involves a disclaimer or announcement made by the news organization. These sentences are not needed and thus removed. It will then remove any texts that are empty or extremely large. Unusually large texts most probably mean that the web page did not contain a news article about a traffic accident but something like a privacy or cookie statement which the Source scraper function was not able to bypass. If a text is removed the corresponding title and date are also removed. After this has been done, the function will make use of the Date transformer function to convert all the received dates into the same format. It will then return the remaining texts, titles and dates.

Date transformer

The Date transformer transforms the received dates into a single format so that it can be stored and easily used for further analysing the article. It makes use of regular expressions to find days, months, and years in number formats. Months

(8)

however are also found using their name or ab-breviation. It then places the found day, month, and year into ISO 8601 format of year-month-day. Both the day and the month contain two numbers and the year contains four, e.g. 2000-02-01 would be February the first of the year 2000. If the function can not find a day, month, or year in the provided untransformed date it will substitute this missing day, month, or year with question marks. The same date used in the previous example whilst missing its day would then look like this 2000-02-??. The Date trans-former returns the transformed date as a string.

Test data

The functions were tested on the flitsservice dataset and articles gathered from two source sites from flitsservice: RTV Utrecht and Blikop-nieuws. The File scraper function will be tested on flitsservice and the Source scraper function will be tested on both source sites. Since the ar-ticles of the flitsservice database are no longer available on the www.flitservice.com website, the two source sites had to be used to test the Source scraper function. In order to test both functions they will have to be tested against a gold standard, a scraped version of the article containing the correct title, text and date. These gold standards were made using a web scraping function targeting the specific HTML design of each website. RTV Utrecht contains 201 different articles and Blikopnieuws contains 519 different articles.

3.2.2 RQ2.

The proposed model

To answer the second sub question, "With which accuracy can the amount of deaths be extracted", an information extraction model was made. The goal of the proposed model is to ex-tract the amount of deaths that has occurred at a traffic accident, according to a news article. The model takes in a text, a Dutch news article about a traffic accident, and then tries to extract the amount of deaths that has occurred, after which it outputs a number which equals the extracted

amount of deaths. The model is built with Python and makes use of a Part-of-Speech (POS) tagger, imported from the Spacy module in Python as well as a dependency parser, imported from the same module. From here on, the model will be refered to as the Death Count (DC) extractor. First a general description of the DC extractor will be given, afterwards a step by step descrip-tion along with a example for each step will be given. Furthermore, the data the DC extractor was tested on will be explained. The Python code can be found in appendix A.

The DC extractor is a rule based system that makes use of a list of indicator words, POS tags like number modifiers, and a shallow depen-dency parse of the text. The list containing the used indicator words is shown in the figure be-low. Each of the steps the DC extractor takes in order to extract information, will now be explained.

Fig. 2. All words used in the list of indicator words.

Step by step description of the DC extrac-tor

Step 1 The first step the DC extractor takes is to tokenize the input text, afterwhich the POS tagger and dependency parser will tag the tok-enized text. For example, given the following short Dutch text "Een dodelijk ongeval heeft gister plaats gevonden. Twee vrouwen zijn om het leven gekomen. Het ongeval heeft het leven geëist van een Rotterdamse vrouw. De andere vrouw die omkwam was Amsterdams." Each word in this text will be transformed into a to-ken. The POS tagger and dependecy parser will then tag these tokens, for example "vrouwen" will be tagged as a noun and will have a nominal subject dependency with "gekomen".

Step 2 The second step of the DC extractor will check whether or not the word "dodelijk" is found within the text. If "dodelijk" is found the

(9)

DC extractor will use the tagged text to find any object dependencies between the word "dodelijk" and any other word within the text. If such an dependency is found the DC extractor will try to use the tagged text to find any number modi-fiers between the object and a word in the text. If such an dependency is found it will return this number as the amount of deaths, if not it will try to find a appositional modifier dependency of the object. If it finds this dependecy it will again check if a number modifier is found. In this case it will return this modifier, if not it will go to step 3. If we use the previous example, the word "dodelijk" is present, the dependency parser also found an object dependency between "dodelijk" and "ongeval". It will now look for number mod-ifiers, but none are present, so it will look for appositional modifiers, however again none are present. This will cause the DC extractor to go on to step 3.

Step 3 The third step involves checking the tagged text for the word "leven" along with a de-pendecy with the verb "kosten" or "eisen". It wil then try to find a nominal subject or object de-pendency between this verb and another word in the text. If a subject is found a check will be done whether the aforementioned filter list contains the subject. If it does, it will try to find a num-ber modifier for the subject. If it finds a numnum-ber modifier it wil return this unless the POS tagger has indicated the found subject to be singular. If it does not return anything the DC extractor will go on to step 4. When we look at the ex-ample is does contain the word "leven" in the second sentence, "Het ongeval heeft het leven geëist van een Rotterdamse vrouw." It also finds a dependency with a form of the verb "eisen". A nominal subject dependency is also present between "geëist" and "vrouw" and "vrouw" is present in the filter list. It will now check for a number modifier, but none are present in this sentence thus the DC extractor goes to step 4.

Step 4 The fourth step starts with looking for a dependency between a form of the verb "komen" and the word "om". If it finds this de-pendency it will start looking for words with a nominal subject, object or adjectival clause de-pendency with the verb. If a subject is found, a

check will be done whether the aforementioned filter list contains the subject. If it does and the verb is singular, it will set the result to 1. How-ever, if the verb is plural it will set the result to 2 and will start looking for a number modifier of the subject. If it finds this number modifier it will return this unless the POS tagger has indicated the found subject to be singular, else it will go on to step 5. The second sentence of the example text contains the dependency between a form of the word "komen" and "om" this step is look-ing for. A nominal subject dependency between "vrouwen" and "komen" is found. "Vrouwen" is present in the filter list and "komen" has been tagged plural, setting the result to 2. A num-ber modifier is also found in combination with "vrouwen", the number "2". The DC extractor would now return this 2 as the amount of deaths that occured.

Step 5 The fifth step starts with checking if any of the words in the tagged text is contained within the list with death indicators. If such a word is found, the result is set to 1 and it will try to find a nominal subject dependency be-tween this word and another word in the tagged text. When such a dependecy is found the DC extractor will look for a number modifier for the subject and will return this modifier if found. If no dependency is found, the DC extractor will look for a number modifier for the death indica-tor itself and return that, if found. When nothing has been returned until now the DC extractor will return the result. Now, imagine the example did not contain the second sentence and entered step 5. The DC extractor finds two indicator words in the text, "dodelijk" and "omkwam", thus the result is set to 1. No nominal subject depen-dency is found for "dodelijk", but "omkwam" does have this dependency with "vrouw". No number modifiers are found with either "omk-wam" or "vrouw" and will thus return the result.

Test Data

To evaluate the DC extractor a dataset of 1828 articles about traffic accidents was created. This dataset also contained the amount of deaths that occurred at the traffic accident according to the

(10)

article. In order to create this dataset, labelled data from the hetongeluk dataset was combined with manually labelled data from the flitsservice dataset. Out of the 1828 articles, 501 are from the flitsservice dataset and 1327 are from the heton-geluk dataset. The dataset contains 1201 articles where no death is confirmed, 593 articles where one death is confirmed and 34 articles where multiple deaths are confirmed. The reason as for why the dataset of hetongeluk was combined with manually labelled data from flitsservice, is that the dataset from hetongeluk contains a very low amount of articles containing deaths com-pared to articles containing no deaths. On the other hand, the flitsservice dataset contains a lot of articles with one confirmed death and very little articles with no confirmed deaths. Com-bining these would then reduce the imbalance between the classes. However, there still is an imbalance since both datasets contain very little articles with multiple confirmed deaths.

As specified before, the flitsservice data was manually labeled. This was done because the original flitsservice dataset did not contain any labeled data, only the titles, text and dates of the articles. The first 501 articles of the flitsservice dataset were labeled with the amount of deaths confirmed according to the article. Since there is no particular order in the flitsservice dataset, annotating the first 501 should not have biased the data in any way. However, since the DC extractor was trained on at least a subset of the flitsservice data it might overperform for this part of the combined dataset.

3.2.3 RQ3.

The proposed model

For the third sub question, "With which accu-racy can the type of vehicle that was involved be extracted", an information extraction model was made. The goal of the proposed model is to extract the involved vehicles of a traffic acci-dent according to a news article. Much like the previously explained DC extractor, the model will take in a Dutch news article about a traffic accident. It will then try to extract the involved vehicles, which it will output as a list. The model is built with Python and makes use of a POS tagger and dependency parser imported from the Spacy module. Henceforth the model will be referred to as the vehicle extractor. The Python code can be found in appendix A.

The vehicle extractor, like the DC extractor, is a rule based systems that makes use of a list of indicator words consisting of Dutch vehicle names, POS tags, shallow dependency tagger, and a conversion dictionary. The list with indica-tor words and the conversion dictionary can be found in figures3and4. The vehicle extractor consists of 5 different steps: 1) tokenizing and parsing the text of the article, 2) finding auxiliary vehicles, 3) finding other vehicles, 4) replacing “vague” vehicles, 5) removing duplicates from the result. Each of these steps will be explained below, along with a description of the data the vehicle extractor was tested on.

Zero deaths One death Multiple deaths

Flitsservice subset 1 474 26

Hetongeluk 1200 119 8

Combined 1201 593 34

Table 1. The amount of articles containing zero deaths, one death, and multiple deaths, for the flitsservice subset, hetongeluk, and the combined dataset of the two on which the DC extractor will be tested.

(11)

Fig. 3. List of all vehicle names that the vehicle ex-tractor uses.

Fig. 4. Overview of each class and the vehicles be-longing to it.

Step by step description of the vehicle extractor

Step 1 First the text of the article is tokenized. After this has been done the POS tagger and dependency parser will tag the tokenized text.

Given the following short Dutch text as exam-ple, "Twee wielrenners zijn op een vrachtwagen gebotst. De voertuigen waren flink beschadigd, maar de ambulancedienst was er snel bij en er zijn geen doden gevallen." Each of the words in this example will be tokenized and parsed by the POS tagger and dependency parser. The word "wielrenners", for example, will be turned into a token that is tagged as plural and with a number modifier dependency with the token "Twee".

Step 2 The second step involves finding aux-iliary vehicles that possibly were involved in the accident. This is done by checking whether a word in the aforementioned list containing Dutch vehicle names is present in the tokenized text. The reason as to why auxiliary vehicles are not found alongside other vehicles is because auxiliary vehicles are often mentioned in arti-cles, even when they are not directly involved in the accident. Due to this fact, auxiliary vehi-cles are required to have a dependency with a verb that indicates involvement. A list contain-ing these words can be found in the figure below. If a dependency with a verb is found, the vehicle extractor will look for a number modifier depen-dency between the auxiliary vehicle and another token in the text. Whether a number modifier is found or not, the auxiliary vehicle will be added to possibly found vehicles. This is done n (where n is the number modifier) amount of times if a number modifier is found, or only once if not. When a vehicle is added to the possibly found ve-hicles, the vehicle extractor will always first try to convert the vehicle to one of 16 vehicle types. In the example a auxiliary vehicle is present in the word "ambulancedienst". However a word indicating involvement is not present thus the vehicle extractor will go to the next step.

Fig. 5. All words indicating involvment for auxiliary vehicles.

Step 3 The third step is finding other vehicles that possibly were involved in the accident. This

(12)

step is very similar to step 2 but involves an extra check between finding a vehicle and finding a number modifier dependency between the found vehicle and a different token in the text. This extra check consists of finding a dependency between the found vehicle and the word “an-dere”. The word “andere” indicates that a differ-ent vehicle of the same vehicle type was involved. This allows the vehicle extractor to extract mul-tiple vehicle of the same type, even when a num-ber modifier is not found. In the example used in the previous steps, three words are present that indicate non auxiliary vehicles, "wielren-ners", "vrachtwagen", and "voertuigen". The ve-hicle extractor now looks for number modifier dependencies between these words and other words in the text. For "vrachtwagen" and "voer-tuigen" none are found, however "wielrenners" does have one with the word "Twee". All words will now be converted to a vehicle type: "wiel-renners" will become "fiets", "vrachtwagen" will stay the same since it already has the name of a vehicle type, "voertuigen" will not be converted since it is not present in the converter. At this point all words will be added to the found ve-hicles, "vrachtwagen" and "voertuigen" will be added once, "fiets" will be added twice. Found ve-hicles now looks like this [2 fiets, vrachtwagen, voertuig].

Step 4 In this step, “vague” words indicating vehicles are removed or replaced. Currently only one word is considered “vague”, which is the word “voertuig”, the Dutch word for vehicle. If a vague word is found within the possibly in-volved vehicles and a non-vague word is also present, the vague word will simply be removed. If a vague word is found with a number modi-fier dependency and a non-vague word with the same modifier is present, the vague word will be removed. However, if a vague word is found with a number modifier and only one non-vague word without a modifier is present, the vague word will be replaced with the non-vague word, keeping the number modifier. In the example a vague word is present in found vehicles. Since there is another non-vague word present this word will simply be removed. Found vehicles now looks like this: [2 fiets, vrachtwagen]

Step 5 The last step removes all duplicates from the found vehicles. When multiple vehicles of the same vehicle type are found, the extractor will only keep the one with the largest num-ber modifier. Since the original example does not contain duplicates, a different example will be used. Imagine the possibly found vehicles contains the list [fiets, fiets, vrachtwagen, 2 fi-ets, fiets] only the following list will be kept [vrachtwagen, 2 fiets]. Since "2 fiets" has the highest number modifier, the other instances of "fiets" were removed. If no vehicle has a number modifier simply only one of each type will be kept.

Test Data

The vehicle extractor is tested on a small sub-set of the previously described (3.1) hetongeluk data. This subset consists of 494 different articles about 303 different traffic accidents. The labeled data containing the involved vehicles was pro-vided separately from the hetongeluk data, this labeled data was provided for the 303 different accidents. Using the accident id and article id within the hetongeluk data, the articles were matched with the accidents. After this was done, some small changes were made since some arti-cles did not report about all the involved vehiarti-cles in the accident. The labeled data for these articles was manually changed to fit what was actually described in the article.

4 EVALUATION

4.0.1 RQ1.

Evaluation Measures

In order to evaluate the created web scrap-ing functions, two thscrap-ings have to be measured. First the amount of titles, text, and dates that are successfully scraped, compared to the amount of articles that were accessed. This would es-sentially mean that the recall of the functions would be calculated. The second measure would be how ‘correct’ the scraped titles, text and dates are. For the titles and text this can be done by calculating the similarity between the output of

(13)

the made web scraping functions and the out-put of the previously mentioned, highly specific function for flitsservice. The output of this spe-cific function is essentially a gold standard, it contains the title, the whole text, and date of ev-ery article on flitsservice. Thus measuring the similarity between the general scraped version and the specific version should give an idea of how close the general scraped version is to the gold standard. In order to measure this similarity the Cosine similarity between TF-IDF vectors of the general scraped version and specific version will be used. When it comes to evaluating the date the accuracy will be used. A date can either be correct or not, since unlike with the title and the text, if even a small part of the date is missing it becomes useless.

The web scraping functions were evaluated on the flitsservice dataset as specified. The re-sults of the evaluation of both title and text can be found in table2. The evaluation of date can be found in table3. Graphs containing the distri-bution of the cosine similarities can be found in appendix B.

From the results we can see that the File scraper function performs very well. This was to be expected since it only had to scrape files that were downloaded from a single website, mean-ing that either it would have performed very bad or very good. If it had performed very bad it would have made the same mistake for every page, since the HTML design of each page is the same. When it comes to the Source scraper func-tion the similarity between the gold standard text and scraped text are also high. This indi-cates that the function is able to remove most

of the HTML whilst retaining most of the ar-ticle’s text. However the similarity of the title, while still decent, is considerably lower than the similarity of the text. The explanation for this lower performance comes from the way the Arti-cle scraper function scrapes titles. The function looks for HTML code containing the word title, indicating that a title is being displayed in this code. It will use the first instance of a title that is found. However, most sites make use of title to give a name to the webpage. This is the text that is displayed at the top of your browser when opening a website. This title does still contain the actual title of the article but usually also con-tains the name of the news outlet. Since the news outlet name is not present in the gold standard the similarity between the two is lowered.

Function Dataset Accuracy Recall

Source scraper

RTV Utrecht 1.00 1.00 Blikopnieuws 1.00 1.00 Flitsservice sources - 0.64 File scraper Flitsservice 1.00 1.00 Table 3. The accuracy and recall of the scraped date.

The results of the performance of extracting the date when tested on both the flitsservice, RTV Utrecht and Blikopnieuws datasets are ex-tremely high. In all three datasets there always was a date found, and this date always matched the gold standard. However, since these results were too perfect an extra test was added. The Source scraper function was also tested on the sources of flitsservice. Whilst the dates of the flitsservice gold standard could not be compared to the dates of the sources, since flitsservice uses the date of the accident and the sources usually

Function Dataset Count Scraped Cosine similarity Standard Deviation

Source Scraper RTV Utrecht 201 Title 0.79 0.05 Text 0.86 0.07 Blikopnieuws 519 Title 0.71 0.07 Text 0.94 0.04

File Scraper Flitsservice 7784

Title 1.00 0.00

Text 0.95 0.07

(14)

use the date the article was written, the recall could still be calculated. The recall of this test is much lower than the recall of the other three tests. Again, an explanation for this can be given. The low recall is caused by a combination of three things: the way the Article scraper func-tion scrapes dates, the chosen test data, and the sources of flitsservice. The Article scraper func-tion, much like when scraping titles, looks for HTML code containing the word date and re-turns the first instance of date it finds. Since a gold standard was required to compare the scraped data against, the chosen websites from which datasets were made, RTV Utrecht and Blikopnieuws, both needed to contain a large amount of articles. However, websites with a large amount of articles are usually well made and make frequent use of tags for things like dates and titles, resulting in a high performance of the web scraping functions. The flitsservice dataset, however, does also contain a lot of ar-ticles from much smaller and less professional sources. These small websites are usually not very well built and maintained, resulting in a ab-sence of the tags the web scraping functions rely on to identify dates. This is not a problem when only professional sources are used, however if smaller, less professional sources are also used, this needs to be taken into account.

4.0.2 RQ2.

Evaluating the DC extractor

Since the proposed model essentially works like a multi-class classification model, it was de-cided to evaluate the DC extractor using the proven information retrieval and classification measures Accuracy, Precision, Recall and F1 Score. However, since multi-class classification is being done by the DC extractor these measures will be calculated using both macro-averaging and micro-averaging.

Macro-averaging simply takes the average of the different classes for the specified mea-sures thus treating all classes as equal. Micro-averaging first sums the true positives (TP), true negatives (TN), false positives (FP), and false neg-atives (FN) from each class, thus creating a new

value for the TP, TN, FP, and FN. From these new values the aforementioned measures are then cal-culated. This gives a weight to each class, which favours the larger classes (Sokolova & Lapalme, 2009). Since the testing data contains a signifi-cantly smaller class, micro-averaging is the pre-ferred measure since it gives more weight to the larger classes. However, macro-averaging will still be used since it cannot be guaranteed that future data will have the same distribution of classes.

Aside from the averages, the measures will also be calculated for each individual class. This way it will be easier to identify whether the DC extractor overperforms or underperforms for certain classes, allowing further insight into the performance of the DC extractor. As speci-fied before, the DC extractor will be tested on a combined dataset of hetongeluk and flitsser-vice data. However since the DC extractor was trained on a subset of the flitsservice data the DC extractors performance will also be tested for hetongeluk data en flitsservice data alone. This will give further insight in the possible bias of the DC extractor.

When the DC extractor outputs a value, this value can then be right, partially right or wrong. A value is right when the DC extractor value is the same as the labelled value from the test dataset and is wrong when it is not the same as the labelled value. However in the case of mul-tiple deaths the value can also be partly right. In this case the value is larger than one, thus belonging to the multiple deaths class, but is not the same value as the labelled value. When evaluating the DC extractor these partially right answers will be counted as a right answer. A overview of when a value counts as a TP, TN, FP or FN can be found in table4.

(15)

Class TP TN FP FN

Zero deaths Predicted and la-beled value equal 0

Predicted and la-beled value do not equal 0

Predicted value equals 0, labeled value does not equal 0

Predicted value does not equal 0, labeled value equals 0

One death Predicted and la-beled value equal 1

Predicted and la-beled value do not equal 1

Predicted value equals 1, labeled value does not equal 1

Predicted value does not equal 1, labeled value equals 1

Multiple deaths Predicted and la-beled value are larger than 1 Predicted and labeled value equals 1 or 0 Predicted value is larger than 1, labeled value equals 0 or 1 Predicted value equals 0 or 1, labeled value is larger than 1 Table 4. The explanation of when a value counts as a True Positive (TP), True Negative (TN), False Positive (FP) or False Negative (FN) for each class of the DC extractor.

The DC extractor was tested on the specified annotated dataset, consisting of 1327 articles of hetongeluk and the first 501 articles of flitsser-vice. The results can be found in table5 and the confusion matrix can be found in6. For the results of the individual hetongeluk and 501 ar-ticles of flitsservice datasets see appendix C.

From these results we can see that for both micro-averaging and macro-averaging, the DC extractor performs very well. Both the preci-sion and recall are atleast 0.9, indicating that the DC extractor for each class not only collects the largest part of the relevant values but also that a large part of the collected values are rele-vant. When looking at individual classes, both the Zero class and the One class perform as ex-pected from the averages, the multiple deaths class however underperforms when compared to the other two classes. This underperfomance

might be explained by the DC extractors reliance on dependency parsing of number modifiers. The dependency parser used does not perform as well for Dutch as it does for English, resulting in less correct number modifier dependencies.

Predicted: Multiple Predicted: One Predicted: Zero True: Multiple 25 9 0 True: One 0 574 19 True: Zero 0 18 1183

Table 6. Confusion matrix of DC extractor when tested on the combined dataset.

Dataset Class Count Accuracy Precision Recall F1 score

Combined Zero deaths 593 0.98 0.96 0.97 0.96 One death 1201 0.97 0.96 0.97 0.96 Multiple deaths 34 0.99 1.00 0.74 0.85 Micro average 1828 0.98 0.97 0.97 0.97 Macro average 1828 0.98 0.98 0.90 0.94

Table 5. The accuracy, precision, recall and f1 score for each individual class and the micro and macro average of the proposed model when tested on combined subset of hetongeluk and flitsservice data.

(16)

4.0.3 RQ3.

Evaluating the vehicle extractor

To evaluate the vehicle extractor, two different measures will be used. The first measure used is the accuracy, the second measure used is the Jaccard similarity. Accuracy is used since it is a simple overview of how well the vehicle ex-tractor works. In this case an answer is either right or it is not, the output of the vehicle ex-tractor must match the labelled test data exactly. Jaccard similarity is used since it can give bet-ter insight into how well the extractor performs. The Jaccard similarity is commonly used to mea-sure the similarity between two different sets. It is calculated by dividing the number of common elements from both sets by the total of the two sets. Here, each answer will be given a similarity score, a certain amount of points, meaning that an answer will still get points even if the answer does not exactly match the labeled test data.

Aside from using to different measures, the ve-hicle extractor will also be evaluated at both set level and multiset level. At set level the perfor-mance of the extraction of vehicle types will be measured e.g. [car, truck]. At multiset level the performance of the correct amount of vehicles will be measured e.g. [2 cars, truck]. Since on multiset level the Jaccard similarity would essen-tially work the same as accuracy, each multiset will be reduced to set level. This will be done by counting each vehicle of a vehicle type with a number modifier as a different vehicle e.g. [2 cars, truck] will become [car1, car2, truck]. This ensures that, if the vehicle extractor returns one car instead of two, it will still get some points instead of nothing.

The vehicle extractor was tested on the speci-fied labeled subset of hetongeluk containing 494

different articles. The results can be found in table7.

As we can see from these results, the vehicle extractor performs well on set level, however on multiset level the results are somewhat lacking. It would seem that the vehicle extractor is capa-ble of extracting the right vehicle types 86% of the time. The Jaccard similarity indicates that in some cases the extractor is at least able to par-tially return the right vehicle types. On multiset level the vehicle extractor return the right ve-hicle types and amount of veve-hicles 76% of the time. The significantly higher Jaccard similarity indicates that of the 24% the model gets wrong, it is at least able to partially extract the amount of vehicles.

A short error analysis of 50 wrong answers, according to the accuracy, on both set level and multiset level showed several problems. On set level the largest amount of errors came from the fact that vehicles were mentioned that were not involved in the accident. This could be because traffic information for other vehicles as a result of the accident, a witness using a certain vehicle, or a different accident was mentioned. Another frequent error was that the article did not explic-itly mention the vehicle type that was involved. Instead “bestuurder”, Dutch for driver, or some-thing similar was used, e.g. "Driver crashes into tree". Besides these two errors other errors in-cluded different words used for the same vehicle, possibly involved vehicles, a word in the article contained a vehicle name, or the used name for the vehicle was not recognized by the vehicle extractor. On multiset level the largest amount of errors came from the dependency parser be-ing unable to connect a number modifier to the found vehicle. This could be either because no number modifier was present or because the de-pendency parser was not advanced enough to be

Level Accuracy Jaccard similarity

Set level 0.86 0.91

Multiset level 0.76 0.87

Table 7. The accuracy and Jaccard similarity for each set level of the vehicle extractor when tested on a subset of hetongeluk.

(17)

able to parse the dependency. Another frequent error was again the mentioning of uninvolved ve-hicles. Besides these errors other errors included usage of a different word for the second vehicle of the same type, the extractor falsely using a dependency with the word “andere” as a sign of a second vehicle, or the failure to recognize a used name for a vehicle.

5 CONCLUSIONS

In the end we can see that all models and func-tions perform, over all, very well. The web scrap-ing functions should be able to extract the text and title without many problems, which allows the information extraction models to do their work. The DC extractor performs really well and should in most cases be able to extract the amount of deaths about as good as, or even better than, a human user. The vehicle extractor does not yet perform as well as a human user, how-ever the automatic classification of vehicles does reduce errors of the same vehicle being classified as multiple types. As stated before, the research question for this thesis was, With which accu-racy van web scraping and information extrac-tion be used to retrieve and analyse Dutch news articles about traffic accidents? Judging from the results the answer to this question would be that web scraping and information extraction can be used retrieve and analyse Dutch news ar-ticles about traffic accidents with decently high accuracy. In some cases the accuracy probably matches or even surpasses that of a human users, and in the other cases the accuracy is still de-cently high.

A few shortcomings can be identified for each of the methods used to answer the sub ques-tions. There are 3 main problems when it comes to the methods used for the first sub question. Firstly the retrieval of the dates is very inaccurate when a web page does not make use of profes-sional web design. In these cases the web scrap-ing functions are unlikely to find the date of the article. Secondly due to the use of a manually made cookie file not all news websites are acces-sible by the web scraping functions. This can be solved by adding more cookies to this file, how-ever this will currently have to be done manually,

increasing the amount of work that is needed to maintain these functions. Lastly the Article cleaner function performs much better when it is given multiple articles from the same news outlet. This might mean that the web scraping functions do not perform as well when just a single URL is given to them.

The largest shortcoming of the DC extractor, made to answer the second sub question, is the relatively low recall of the multiple deaths class. This is due to the limitation of the used POS and dependency parser. A possible solution for this is extending the training of the POS and depen-dency parser with a newspaper based corpus. This might increase the performance.

The problems with the methods used for the third sub question are similar to the one of the second sub question. The vehicle extractor is also limited by the used POS and dependency parser. There are however a few other smaller limitations present in the vehicle extractor. The filter list with vehicle names does not contain all vehicle and vehicle brand names. This causes some vehicles to not be recognised. Furthermore the presence of vehicles mentioned in an article which are not involved in the accident is a large reason for the reduction in accuracy. Currently now way to filter out these vehicles exists within the vehicle extractor.

5.1 Future Work

Further research can be done for both web scrap-ing and information extraction. While the cur-rent web scraping functions are able to scrape content of smaller and less professional sites, they are by no means perfect. Dates are still hard to find and content from these websites are cleaned less. Further research on web scraping of smaller websites might prove helpful in creating a general web scraper that is able to scrape con-tent from all sites. Complimenting this research would be further research into bypassing cookie and privacy statements. Many websites reroute a first time visitor to a different page for privacy or cookie statements, this interferes with the web scraping process. Finding a way to automatically bypass these statements would greatly improve the ease of web scraping.

(18)

As previously described one of the most lim-iting factors in the performance of both infor-mation extraction models are the POS tagger and dependency parser. For this thesis the POS tagger and dependency parser from the Spacy package were used. Further research into differ-ent packages providing POS taggers and depen-dency parsers might prove useful. A different option is to improve the POS tagger and depen-dency parser from the Spacy package by training them with more Dutch data. Currently the Dutch model from Spacy is trained on Wikipedia data. Further training this model with other data, like Dutch newspaper data, might improve the per-formance. Aside from improving the used tools, there can also be further research into the use of more tools. Currently Named Entity Recog-nition (NER) is not used, since the performance of the NER model provided with Spacy was too unreliable. Finding a different NER model or im-proving the one used by Spacy could make NER a viable tool to increase the performance of the models.

5.2 Acknowledgements

I would like to thank my thesis advisor dr. Maarten Marx, for the great feedback given dur-ing the writdur-ing of this thesis. I would also like to thanks Ella Casimiro and Sander Siepel, for the interesting discussions and sharing of ideas, as well as the proofreading of this thesis. Fur-thermore I would like to thank my parents for supporting me through my studies.

6 REFERENCES

Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open in-formation extraction from the web. IJCAI, 7, 2670-2676

Boeing, G., & Waddell, P. (2017). New insights into rental housing markets across the united states: web scraping and analyzing craigslist rental listings.Journal of Planning Education and Research, 37(4), 457-476

Cowie, J., & Wilks, Y. (2000). Information extraction.Handbook of Natural Language Pro-cessing, 56, 241-260

Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information ex-traction. Proceedings of the conference on em-pirical methods in natural language processing, 1535-1545

Fukuda, K. I., Tsunoda, T., Tamura, A., & Tak-agi, T. (1998). Toward information extraction: identifying protein names from biological pa-pers.Pac symp biocomput, 707(18), 707-718

Grishman, R. (1997). Information extraction: Techniques and challenges.Information Extrac-tion A Multidisciplinary Approach to an Emerging Information Technology, 1299(1), 10-27

Ku, LW., Liang, YT., & Chen, HH. (2006). Opin-ion extractOpin-ion, summarizatOpin-ion and tracking in news and blog corpora. Proceedings of AAAI, 100-107

Lawson, R. (2015). Web scraping with Python. Birmingham, United Kingdom: Packt Publishing Ltd.

Malik, S. K., & Rizvi, S. A. M. (2011). Informa-tion extracInforma-tion using web usage mining, web scrapping and semantic annotation. Interna-tional Conference on ComputaInterna-tional Intelligence and Communication Networks, 465-469

Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein–protein interactions from the biolog-ical literature.Bioinformatics, 17(2), 155-161

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from indi-vidual documents.Text mining: applications and theory, 1-20

(19)

Sarawagi, S. (2008). Information extraction. Foundations and Trends® in Databases, 1(3), 261-377

Shinyama, Y., Sekine, S., & Sudo, K. (2002). Automatic paraphrase acquisition from news articles.Proceedings of the second international conference on Human Language Technology Re-search, 313-318

Sokolova, M., & Lapalme, G. (2009). A system-atic analysis of performance measures for classi-fication tasks.Information Processing & Manage-ment, 45(4), 427-437

(20)

7 APPENDIX

7.1 Appendix A: Python Code 7.1.1 Web Scraping functions.

import zipfile

from bs4 import BeautifulSoup import re

import pandas as pd import requests import http.cookiejar

def date_extractor(rawdate):

monthtonumb = {v: str(k + 1) for k,v in enumerate(['januari', 'februari', ' maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', ' oktober', 'november', 'december'])}

monthtonumb.update({v: str(k + 1) for k,v in enumerate(['jan.', 'feb.', ' mrt.', 'apr.', 'mei', 'jun.', 'jul.', 'aug.', 'sept.', 'okt.', 'nov.', 'dec.'])})

monthtonumb.update({v: str(k + 1) for k,v in enumerate(['jan', 'feb', 'mrt', 'apr', 'mei', 'jun', 'jul', 'aug', 'sept', 'okt', 'nov', 'dec'])}) monthtonumb.update({'febr': '2', 'febr.': '2', 'sep': '9', 'sep.': '9'}) date = rawdate.lower().split('␣')

for x in range(len(date)): if date[x] in monthtonumb:

date[x] = monthtonumb[date[x]] date = '␣'.join(date)

#Check if date is already in right format

if re.search('([0-9]){4}[:\/-][0-9]+[:\/-][0-9]+', date) != None:

date = re.search('([0-9]){4}[:\/-][0-9]+[:\/-][0-9]+', date).group()

return re.sub('[:\/]', '-', date)

if re.search('([0-9]){2}|[0-9]', date) != None:

day = re.search('([0-9]){2}|[0-9]', date).group() date = re.sub(day, '', date, 1)

else:

day = '?'

if re.search('([0-9]){2}|[0-9]', date) != None:

month = re.search('([0-9]){2}|[0-9]', date).group() date = re.sub(month, '', date, 1)

else:

month = '?'

if re.search('([0-9]){4}', date) != None:

year = re.search('([0-9]){4}', date).group()

else:

year = '????'

(21)

def artikel_scraper(html):

# create soup

soup = BeautifulSoup(html, 'html.parser') # remove links to minimise useless text links = soup.find_all('a')

for link in links:

link.extract() # get titel from soup

titel = soup.find(re.compile('.*title.*')) #get date froum soup

date = soup.find(re.compile('.*'), re.compile('^(?!vali|unvali).*date.*'))

if date == None:

date = soup.find(re.compile('.*time.*')) # select text most likely to belong to article

artikels = [x for x in soup.find_all(string=re.compile('.')) if len(x) > 20

and len(re.findall('[-\(\\\#\:\=)\[\]\n]', x)) / len(x) < 0.03 and '©' not in x ]

if date != None and titel != None:

# remove titel from artikel text

if titel.get_text() in artikels:

artikels.remove(titel.get_text())

return artikels, titel.get_text(), date.get_text() elif date != None and titel == None:

return artikels, titel, date.get_text() elif date == None and titel != None:

return artikels, titel.get_text(), '??/??/????' else:

return artikels, titel, '??/??/????' def artikel_cleaner(artikels, titels, dates):

# find common sentences within the articles

counted = Counter(x for xs in artikels for x in set(xs))

for sentence in {x : counted[x] for x in counted if counted[x] >= 10}:

# remove common sentence from article

for artikel in artikels: while sentence in artikel:

artikel.remove(sentence)

# extra check for removing common sentence

for artikel in artikels: if sentence in artikel:

artikel.remove(sentence)

# return only non empty and not overly large strings

return ['␣'.join(artikels[x]) for x in range(len(artikels)) if artikels[x]

(22)

[titels[x] for x in range(len(artikels)) if artikels[x] != [] and len (artikels[x]) < 50],\

[date_extractor(dates[x].strip()) for x in range(len(artikels)) if artikels[x] != [] and len(artikels[x]) < 50 ]

def bronschraper(bronlist): artikels = [] titels = [] dates = [] bronnen = [] cj = http.cookiejar.MozillaCookieJar('cookies.txt') cj.load()

# loop over sources in sourcelist

for bron in bronlist: if 'http' in bron:

try:

# try to request the source and scrape its contents url = bron hdr = {'User-Agent': 'Mozilla/5.0␣(X11;␣Linux␣x86_64)␣ AppleWebKit/537.11␣(KHTML,␣like␣Gecko)␣Chrome/23.0.1271.64␣ Safari/537.11', 'Accept': 'text/html,application/xhtml+xml,application/xml;q =0.9,*/*;q=0.8', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding': 'none', 'Accept-Language': 'en-US,en;q=0.8', 'Connection': 'keep-alive'} #print('request made')

req = requests.get(url, timeout=5, headers=hdr, cookies=cj) #print('request got trough', bron)

artikel, titel, date = artikel_scraper(req.content) artikels.append(artikel)

titels.append(titel) dates.append(date) bronnen.append(bron)

except:

# skip source if it can't be reached

pass

# clean articles and make the data dataframe ready

artikels, titels, dates = artikel_cleaner(artikels, titels, dates) data = {'Titel':titels, 'Artikel':artikels, 'Datum':dates}

return pd.DataFrame(data) def filescraper(file, targetmap):

artikels = [] titels = [] dates = []

(23)

zipje = zipfile.ZipFile(file) # loop over files in zipfile

for f in zipje.namelist():

# check if files are in targetmap

if '/' + targetmap + '/' in f:

open_file = zipje.open(f)

artikel, titel, date = artikel_scraper(open_file) artikels.append(artikel)

titels.append(titel) dates.append(date)

# clean articles and make the data dataframe ready

artikels, titels, dates = artikel_cleaner(artikels, titels, dates) data = {'Titel':titels, 'Artikel':artikels, 'Datum':dates}

(24)

7.1.2 DC Extractor. import spacy import pandas as pd import re nlp = spacy.load('nl_core_news_sm') """

Taken from stackoverflow:

https://stackoverflow.com/questions/493174/is-there-a-way-to-convert-number-words-to-integers

"""

def text2int(textnum, numwords={}):

wordnums = list(range(1, 3000))

for x in range(len(wordnums)):

wordnums[x] = str(wordnums[x])

if textnum in wordnums: return int(textnum) if not numwords:

units = [

"nul", "een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "negen", "tien", "elf", "twaalf", "dertien", "veertien", "vijftien", "zestien", "zeventien", "achttien", "negentien",

]

tens = ["", "", "twintig", "dertig", "veertig", "vijftig", "zestig", " zeventig", "tachtig", "negentig"]

scales = ["honderd", "duizend", "miljoen", "miljard", "triljoen"] numwords["en"] = (1, 0)

for idx, word in enumerate(units): numwords[word] = (1, idx) for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)

for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or

2), 0) current = result = 0

for word in textnum.split(): if word not in numwords:

if word == 'beide': return 2

return 1

scale, increment = numwords[word] current = current * scale + increment

(25)

if scale > 100:

result += current current = 0

return result + current

def om_het_leven(doc, teken, filter_list, result, wordlist, tense): if teken.head.text.lower() in wordlist and teken.text.lower() == 'om':

if tense == 'past':

result = 1

for token in doc:

subject = None

if token.text.lower() in wordlist and token.dep_ in ['nsubj', 'obj',

'acl'] \

and token.head.text.lower() in filter_list: if tense == 'sing': result = 1 if tense == 'mult': result = 2 if token.text.lower() in wordlist: subject = token.head.text.lower()

if token.head.text.lower() in wordlist and token.dep_ in ['nsubj', '

obj', 'acl']\

and token.text.lower() in filter_list: if tense == 'sing': result = 1 if tense == 'mult': result = 2 subject = token.text.lower() if tense == 'past': subject = token.text.lower() if subject != None: for token in doc:

if token.head.text.lower() == subject and token.dep_ == '

nummod'\

and re.search('[-./]', token.text) == None:

result = text2int(token.text.lower())

if 'Sing' in token.head.tag_:

result = 1

if result < 10 and result != 1:

return [text2int(token.text.lower())] return result

def aantal_slachtoffers(tekst):

# Step 1

(26)

indicator_list = ['doden', 'dode', 'dodelijk', 'dodelijke', 'dood', ' doodgereden', 'overleden',

'overleed', 'overlijdt', 'stierf', 'gestorven', 'sterft', ' fataal', 'fatale',

'omgekomen', 'omkwam', 'omkwamen']

filter_list = ['vrouw', 'vrouwen', 'man', 'mannen', 'jongen', 'jongens', ' meisje', 'meisjes', 'dame', 'dames',

'heer', 'heren', 'persoon', 'personen', 'fietser', 'fietsers', 'fietsster', 'fietssters',

'automobilist', 'automobilisten', 'automobiliste', ' automobilistes', 'bestuurder', 'bestuurders',

'bestuurster', 'bestuursters', 'motorrijder', 'motorrijders', 'motorrijdster', 'motorrijdsters',

'bromfietser', 'bromfietsers', 'bromfietsster', ' bromfietssters', 'scooterrijder', 'scooterrijders', 'scooterrijdster', 'scooterrijdsters', 'inzittende', '

inzittenden', 'zussen', 'broers', 'neven',

'nichten', 'mens', 'mensen', 'vrachtwagenchauffeur', ' slachtoffer', 'slachtoffers', 'hij', 'zij',

'echtpaar', 'slachtoffer', 'slachtoffers', 'die'] result = 0

# Step 2

if 'dodelijk' in tekst.lower():

verwijzing = None obj = None

if token.text.lower() in ['dodelijk', 'dodelijke'] and token.dep_ ==

'obj':

obj = token.head.text.lower()

if obj != None: for token in doc:

if token.head.text.lower() == obj and token.dep_== 'obj' and

token.text.lower() != 'dodelijk': obj = token.text.lower()

if token.head.text.lower() == obj and token.dep_ == 'nummod' and

re.search('[-./]', token.text) == None:

return text2int(token.text.lower())

if token.head.text.lower() == obj and token.dep_== 'appos':

verwijzing = token.text.lower()

if verwijzing != None: for token in doc:

if token.head.text.lower() == verwijzing and token.dep_ == '

nummod' \

and re.search('[-./]', token.text) == None: return text2int(token.text.lower())

(27)

if token.text.lower() == 'leven' and token.head.text.lower() in ['

gekost', 'geëist', 'kostte', 'eiste']:

subject = None

if token.head.text.lower() in ['gekost', 'geëist', 'kostte', '

eiste']\

and token.dep_ in ['nsubj', 'obj'] and token.text.lower() in

filter_list:

subject = token.text.lower()

if subject != None: for token in doc:

nummod' \

result = 1

# Step 4

result = om_het_leven(doc, token, filter_list, result, ['kwam', 'komt'], 'sing')

result = om_het_leven(doc, token, filter_list, result, ['kwamen', ' komen'], 'mult')

if type(result) == list: return result[0]

result = om_het_leven(doc, token, filter_list, result, ['gekomen'], ' past')

if type(result) == list: return result[0]

# Step 5

if any(indicator == token.head.text.lower() for indicator in

indicator_list) \

or any(indicator == token.text.lower() for indicator in indicator_list):

result = 1 subject = None

if token.dep_ == 'nsubj' and token.text.lower() in filter_list:

subject = token.text.lower()

if subject != None: for token in doc:

nummod' \

(28)

if token.dep_ == 'nummod' and re.search('[-./]', token.text) == None

:

result = 1

return text2int(token.text.lower()) if result > 10:

result = 1

(29)

7.1.3 Vehicle Extractor. import spacy import pandas as pd import re nlp = spacy.load('nl_core_news_sm') def unique(list1):

# intilize a null list unique_list = []

# traverse for all elements

for x in list1:

# check if exists in unique_list or not

if x not in unique_list: unique_list.append(x) # print list return unique_list def voertuigen(tekst): # Step 1 doc = nlp(tekst) result = [] gevonden_voertuigen = [] multiples = []

indicators = ['gereden', 'beschadigd', 'ongeluk', 'verongelukt', 'geknald', 'aanrijding', 'raakte', 'raken'

'reden', 'reed', 'beschadigde', 'geslingerd', 'gebotst', ' gesleurd', 'gekatapulteerd']

voertuig_list = ['asfalteerder', 'bestelauto', 'auto', 'autobus', 'bakfiets ', 'bakwagen', 'brommer',

'bestelwagen', 'bestelbus', 'buggy', 'bulldozer', 'bus', ' bedrijfsbus',

'bedrijfsauto', 'bedrijfswagen', 'busje', 'cabrio', 'camper' , 'cementwagen',

'combine', 'crossfiets', 'damesfiets', 'dienstauto', ' dieplader', 'dierenambulance',

'driewieler', 'dubbeldekker', 'wielrenner', 'wielrenster', ' fiets', 'fietskar',

'gierwagen', 'goederentrein', 'intercity', 'invalidenwagen', 'jeep', 'kiepwagen',

'kinderwagen', 'koelwagen', 'kraanwagen', 'ladderwagen', ' landrover', 'leswagen',

'ligfiets', 'lijkwagen', 'limousine', 'locomotief', ' meisjesfiets', 'mestwagen',