Predicting traffic accident injury severity in Great Britain using machine learning techniques

(1)

P

REDICTING TRAFFIC ACCIDENT INJURY

SEVERITY IN

G

REAT

B

RITAIN USING MACHINE

LEARNING TECHNIQUES

.

Master Thesis Econometrics – Specialisation Big Data Eva Elisabeth Oudemans 10003142

University of Amsterdam, Amsterdam, Netherlands 2016

Supervisor: Second Reader:

(2)

1

T

ABLE OF

C

ONTENTS

1. INTRODUCTION ... 2

1.1 The effects of traffic accidents ... 2

1.2 Features contributing to injury severity ... 3

1.3 Classical statistics versus Machine learning ... 3

1.4 Thesis outline ... 5

2. METHODOLOGY ... 6

2.1 Recursive Feature Elimination ... 6

2.2 Support Vector Machine ... 7

2.2.1 Choice of Kernel & Penalty Parameter ... 8

2.3 Logistic Regression ... 9

2.3.1 Penalized Logistic Regression ... 10

2.4 Accuracy Ratio & Area Under Receiving Operating Characteristic ... 10

3. DATA ... 11

4. RESULTS ... 13

4.1 Impact of the variables ... 13

4.2 Penalty parameter LR ... 14 4.3 Penalty parameter SVM ... 16 4.4 Comparison LR and SVM ... 17 5. CONCLUSION... 19 6. DISCUSSION ... 20 7. REFERENCES ... 22

(3)

2 A B S T R A C T

In this thesis a Support Vector Machine (SVM) learning model and a Logistic Regression (LR) model is applied to classify the severity of an injury caused by a car accident in Great Britain. It is compared which model makes the most accurate predictions about traffic injury severity. Based on 51,178 personal injury accidents on public roads that are reported to the police, classifying the severity of an accident in severe and non-severe, the two models were developed. The performance of the classifications were measured by the Area Under Receiving Operating Characteristic (AUROC). The AUROC of LR was found to be equal to 0.723, which was higher than that produced by the SVM model (0.592). However, due to the small proportion of severe injuries in the dataset (7.25%), both models did not properly succeed in predicting severe injuries when keeping the classification threshold equal to 0.5. Further analyses could be done by applying the same techniques to different datasets, where the incidence of a severe injury is higher.

1. INTRODUCTION

Although engineers and researchers in the traffic vehicle industry keep designing better and safer vehicles, the possibility of a severe or fatal traffic accident is inevitable. Every day, millions of people are part of the road traffic. In the last year in Europe, almost 30,000 people suffered a fatal injury and more than 200,000 people left the hospital with a life-changing, severe injury. For big institutions, such as the European Union and the World Health Organization (WHO), road safety has always been a main topic on the agenda. The European Union as well as the WHO even has set an ambitious 2020 goal1_{to reduce the number of annual road traffic deaths}

by half.

1.1 T

HE EFFECTS OF TRAFFIC ACCIDENTS

Traffic accidents have numerous effects on people, society and nature. Much research has been done on the psychological effects of a traffic accident. Mayou et al (1993) determined the consequences of being a road traffic accident victim. They concluded that psychiatric symptoms and disorder are frequent after major and less severe injuries. The post-traumatic symptoms a traffic accident causes are disabling and common. Next to such effects on your health, traffic accidents also cause physical damage to the victim and the vehicle and involve high costs for aftercare affecting the welfare of society by leaning on scarce medical resources. A study of Bastida (2004), shows by a cost-of-illness method, dividing costs into health services costs, insurance administration costs and the cost of the material damages to vehicles, that the total cost of traffic accidents was more than 6 billion euros in Spain, representing 1,35% of their gross national product. Besides the damage traffic accidents can cause for the welfare of a country, a very recent research of Jou and Chen (2015) examine the external costs that are involved in traffic accidents. These include air pollution and time delay.

1

(4)

3

To minimize the costs of a traffic accident injury, physical as well as financial, prediction models can be developed to detect patterns and classify accidents into different severity types. Over the last decade, researchers have paid increasing attention to determining features which are most significantly contributing to the severity of injuries caused by traffic accidents.There are numerous characteristics such as driver and casualty characteristics, highway characteristics, meteorology characteristics, vehicle characteristics and even type of accident that can contribute to the severity of an injury after a traffic accident. Understanding which characteristics contribute the most to the severity of a traffic accident injury helps decision makers reduce the severity of crashes and therefore minimize the costs of a traffic accident injury.

1.2 F

EATURES CONTRIBUTING TO INJURY SEVERITY

Classical statistical models have been widely used for used for traffic accident injury severity analysis. The injury severity data is often classified into several discrete categories such as slight injury, severe injury, fatal injury etc. This kind of categorical (ordered) data suits well with models such as the Logistic Regression model or other variations of that model. Al-Ghamdi (2001) used a Logistic Regression to examine the contribution of multiple features to traffic accident injury severity. He classified the accident injury severity into either fatal or non-fatal. This binary nature of the dependent variable made the Logistic Regression approach most suitable. He found that Logistic Regression is a promising tool in providing meaningful insights in order to interpret and determine relationships between features and injury severity.

In 2011, Oña et al. used Bayesian networks to find the features that contribute to traffic accident injury severity. They found the main risk of suffering a severe injury, was the age group 18-25 years. They also controlled for the accident type characteristics, finding that a head on collision and rollovers significantly increase the probability of having a severe or fatal traffic accident injury. Other characteristics were found by Kashani et al (2011), with the use of classification and regression trees (CART), they determined that improper overtaking and seatbelt usage are the most important features that affect the severity of traffic accident injury severity.

Çelik and Oktay (2014) conducted a retrospective cross-sectional study by analysing 11.771 traffic accidents reported by the police in Turkey. By classifying the injury severity into: fatal, injury and no injury, they performed a multinomial logit analysis to determine the characteristics that most affect the severity of traffic accident injuries. They found the main characteristics that increase the probability of a fatal injury were: drivers over the age of 65, single-vehicle accidents, accidents occurring on state routes, highways of provincial roads and the presence of pedestrian crosswalks. Other results from their research showed that good weather and light conditions also have a significant impact on decreasing the probability of fatal injuries.

1.3 C

LASSICAL STATISTICS VERSUS

M

ACHINE LEARNING

However, classical statistical models are not always the best inference instruments in predicting injury severity as they can suffer from some limitations. Statistical regressions rely on many assumptions about the form and construction of the data and a linear and functional form

(5)

4

between the dependent and the explanatory variables. When these assumptions fail to hold, it can be hard to find robust reliable insights about relationships in your data. Also, with our world becoming increasingly reliant on technology, firms know that it is important to keep track of their historical data. Datasets are growing exponentially and regular statistical techniques prove to have trouble coping with this kind of volume of data. Machine learning techniques help mine these big data sets and can relax the requirements on data and lower the dependence on heuristics. In order to address the limitations of statistical modelling, many researchers have developed (non-parametric) machine learning techniques for predicting injury severity.

Chong et al (2005) considered numerous machine learning techniques to model the severity of traffic accidents and try to develop new traffic safety control policies. They examined neural networks trained using hybrid learning techniques, Support Vector Machines, decision trees and a concurrent hybrid model involving decision trees and neural networks. They conclude that, among these different machine learning paradigms, the hybrid decision tree-neural network approach outperformed all the other techniques. This technique gave the highest accuracy in predicting severity. Li et al (2011) compared the Support Vector Machine technique to a linear ordered probit model for a crash injury severity analysis with data from the United States. Their goal was to show that SVM can be used to model injury severity. They indeed found that the SVM techniques worked best in predicting the injury severity. SVM gave a higher accuracy than the ordered probit model. Meaning that the percent of correct prediction for the SVM model was found to be 48.8%, which was higher percent of correct predictions that was produces by the ordered probit model (44.0%).

Not only in injury severity analysis, but also in other fields of expertise many researchers have tried to prove that machine learning techniques such as SVM beat the regular classical statistics. When it comes to classification, models such as the aforementioned probit model or Logistic Regression model prove to be good contestants to compare the predictive power of machine learning techniques to. Machine learning techniques are very popular in the biomedical field. Muniz et al (2010) compared several machine learning techniques with a Logistic Regression (LR). Among these machine learning techniques were probabilistic neural network (PNN) and Support Vector Machine classifiers. They tried to classify certain patterns in order to evaluate the effect of treatment in Parkinson disease. They found that LR, PNN and SVM all three had high indexes for classifying the patterns. However, PNN gave the most restrictive predictions and was found to be the best classifier. Also in the medical field, in a very resent research of Decruyenaere et al (2015), multiple machine learning techniques and Logistic Regression were compared to each other in terms of predicting delayed graft function (DGF) after kidney transplantation. Since Logistic Regression was always used for predicting DGF, they wanted to see if machine learning techniques are more valuable. Nine types of predicted models were fitted to a data frame consisting of 497 kidney transplantations: the Logistic Regression, linear discriminant analysis, quadratic discriminant analysis, Support Vector Machine (using several Kernel functions), decision trees, random forest and stochastic gradient boosting. The performance of the models were evaluated through sensitivity, positive predictive values and area under the receiver operating characteristic curve (AUROC). They found that Linear SVM (i.e. using SVM with a linear Kernel function), has the highest predictive power (with an AUROC of 84.3%) and was the only model that could outperform Logistic Regression. Musa (2013) compared the predictive power SVM with LR by fitting the models to

(6)

5

13 different data sets. He found that, in general, the SVM and LR have comparable performance measures. SVM will perhaps perform better for the unbalanced data sets, whereas LR will have better performance in balanced data set. He concludes that LR has a higher interpretability and SVM is more considered to be a black box predictor2_.

1.4 T

HESIS OUTLINE

Combining the results of the aforementioned researchers, the main aim of this thesis will be to apply a non-parametric Support Vector Machine (SVM) learning model and a simple linear Logistic Regression (LR) model to classify the severity of an injury in Great Britain, and compare which model makes the most accurate predictions about traffic injury severity, given that an accident occurred. After identifying the best model, advice could be given to the government in order to reduce costs involved in preventing a severe or fatal traffic accident injury. An overview of those costs is displayed in table 1. Comparing SVM with LR with the use of data stemming from Great Britain has been never done before in injury severity analysis.

Table 1: The table shows the prevention costs involved in a fatal, severe and a slight injury in Great Britain. Source: https://www.gov.uk/government/publications/reported-road-casualties-great-britain-annual-report-2012

SVM is a preferred learning algorithm, because it makes use of a Kernel function which allows it to operate in a high dimensional implicit feature space, without ever having to compute the actual coordinates of the data in that space. It develops a decision boundary between two classes by mapping data with the Kernel function into a higher dimensional space and then maximizes the margin hyperplane within that space. LR is used to estimate the conditional probability of a severe or fatal traffic accident injury to occur given that an accident took place.

When trying to identify which model makes the most accurate prediction of severe traffic accidents, a few elements have to be taken into account: feature selection algorithms (FSA) and choice of Kernel function. The feature selection is needed to reduce dimensionality. This will be done by the backward selection procedure also known as recursive feature elimination, where you transform the original feature space to a smaller one to reduce the dimension.

2_{A black box is a device or system that takes an input and produces an output, without any knowledge of what happens in between}

(7)

6

In this thesis, first the tested models will be discussed in more detail. A brief explanation of the data will follow and after that the results will be shown. Lastly, a conclusion will be formulated and possible advice for the British government will be formulated and options for further research will be discussed.

2. METHODOLOGY

In this section, a detailed procedure of injury severity prediction will be discussed using Support Vector Machine and Logistic Regression techniques.

2.1 R

ECURSIVE

F

EATURE

E

LIMINATION

Before testing the SVM and LR model on the data a feature selection algorithm is used to select the key features contributing to the severity of an accident. Feature selection algorithms are used for three main reasons:

1. They reduce dimensionality, thereby reducing overfitting; 2. they help shortening training times of models;

3. They simplify models, to make them easier to understand by researchers.

The main idea of feature selection is that you can eliminate features that are either irrelevant or redundant and thus can be easily removed without loss of information. Here, recursive feature elimination (RFE) algorithm is used to choose the key features that contribute to injury severity.

Recursive Feature elimination incorporating resampling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

for Each Resampling Iteration do

Partition data into training and test/hold-back set via resampling Tune/train the model on the training set using all predictors Predict the held-back samples

Calculate variable importance or rankings for Each subset size SI, i=1….S do

Keep the Si most important variables

Tune/train the model on the training set using Si predictors

Predict the held-back samples end

end

Calculate the performance profile over the Si using the held-back samples

Determine the appropriate number of predictors

Estimate the final list of predictors to keep in the final model

Fit the final model based on the optimal Si using the original training set

Table 2: Recursive Feature Elimination procedure

Table 2 shows the steps taken to construct the list of features required in the model. The RFE starts with fitting the model to all the features. Each feature is then ranked by calculating its

(8)

7

importance to the model. Now, take S as a sequence of ordered features which are possible values for the number of features to keep. The RFE keeps the Si top ranked features, refits the

model is and assesses its performance at each iteration. The value of Si that has the best

performance is eventually determined and the top Si predictors are used to estimate the final

list of features to keep in the model. The RFE algorithm here also incorporates a resampling method (bootstrap), to take variability caused by feature selection into account. Findings of former researchers (section 1.2) have been takes into account when selecting the appropriate features to test. The resulting features selected for the research are presented chapter 3.

Figure 1a and 1b show the first results of the recursive feature elimination algorithm. Figure 1a shows the importance of the features plotted (step 1 - 5 in Table 2). In Figure 1b the appropriate number of variables is shown (step 6 - 15 in Table 2). Because of the size of the dataset, features are tested in batches of around 10 variables. The most important features per batch will be used for analysis. In this case, the appropriate number of variables is 1, but 3 variables give the same accuracy. That’s why ‘Speed limit’, ‘Skidding and overturning’ and ‘Light conditions’ are selected for analysis.

Figure 1a: The importance of features Figure 1b: Appropriate number of variables using RFE

2.2 S

UPPORT

V

ECTOR

M

ACHINE

SVM is a machine learning technique which can be used for either classification or regression. This technique is widely used and a popular technique, because of its strong classification power. Essentially, SVM creates a decision boundary between two classes. Suppose, you have several data points which belong to one of the classes, then you the goal is to find a hyperplane that can separate these points. There may be many hyperplanes that classify the data, that is why it is most reasonable to look for the hyperplane that shows the largest separation (margin) between the data points. In a linear case this is called the maximum-margin hyperplane. This hyperplane can be written as:

𝑦(𝒙) = 𝑤𝑇_{𝜙(𝒙) + 𝑏,}

where 𝑤 denotes a weight vector, 𝑏 a bias and 𝜙(𝒙) represents a fixed feature-space transformation (Figure 2).

(9)

8

Figure 2: Feature-space transformation and maximum hyperplane.

In order to find the optimal solutions for the parameters w and b, the following optimization problem has to be solved with Lagrange, using

min 𝑤,𝑏,𝜉 1 2‖𝒘‖ 2_{+ 𝐶 ∑ 𝜉} 𝑛 𝑁 𝑛=1 𝑠. 𝑡. 𝑡𝑛(𝑤𝑇𝜙(𝒙) + 𝑏) ≥ 1 − 𝜉𝑛 ≥ 0,

where 𝜉𝑛 represent the slack (misspecification) variable and 𝐶 (>0) is the specified penalty

parameter of the error term.

The SVM algorithm was already invented in 1963 by Vladimir N. Vapnik and Alexey Ya. Chervonenkis. This algorithm however only provided a framework for linear classification. In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying kernel functions to maximum-margin hyperplanes. Here, the Kernel functions avoid the explicit mapping of data points, earlier described for the linear case. Instead, it calculates the inner product of two data points, making it possible to construct a non-linear decision boundary in high dimensional spaces.

2.2.1 C

HOICE OF

K

ERNEL

&

P

ENALTY

P

ARAMETER

A crucial problem for SVM is choosing the right Kernel function in order to succeed in the classification task. A good start for choosing the optimal kernel is using prior knowledge of other researchers and invariances. SVM knows four basic Kernels:

 Linear: 𝐾(𝒙, 𝒙′_{) = 𝒙}𝑇_𝒙′

 Polynomial: 𝐾(𝒙, 𝒙′) = (𝛾𝒙𝑇𝒙′+ 𝑐)𝑑, 𝛾 > 0

 Radial Basis Function (RBF): 𝐾(𝒙, 𝒙′_{) = exp (−𝛾 ‖𝒙 − 𝒙′‖}2_{), 𝛾 > 0}

 Sigmoidal: 𝐾(𝒙, 𝒙′_{) = tanh(𝛾𝒙}𝑇_𝒙′_{+ 𝑐),}

where 𝛾 is the kernel parameter. For this thesis a (Gaussian) radial basis function kernel is used. Even though other researchers also proposed other Kernels, the RBF is most commonly used for injury severity analysis. Hsu et al (2010) explain some advantages RBF has and why it is a reasonable choice. The advantages are that it can handle possible nonlinear relationships

x' x x' x 𝜙( ′₎ 𝜙( ) ξ ξ 𝑤𝑇_{𝜙(𝒙) + 𝑏} ξ 𝑚𝑎𝑟𝑔𝑖𝑛 = 1 ‖𝑤‖ ξ Feature - space Transformation

(10)

9

between features and the classes. Secondly, it has less hyperparameters which influence the complexity of model selection than other basic Kernels, such as the Polynomial.

In the RBF, two parameters are important. The first one is the kernel parameter γ which can be written as _2𝜎12. When using the RBF, it is beneficial to normalize the data because then it is not needed to adjust σ with every iteration. This will be done, by setting the mean to zero and the standard deviation (σ) to 1. The second parameter is the penalty parameter C. The penalty parameter, also called the cost of misclassification (hence the letter C), penalizes misclassifications. A small value of C allows more misclassifications, where a large value of C forces the model to explain the data more strict. A risk of a low value of C is that it can possibly underfit the data, whereas a high value of C can result in overfitting.

2.3 L

OGISTIC

R

EGRESSION

Because of the binary outcome variable, a linear regression is not useful in producing appropriate classification, since it might produce probabilities bigger than one or smaller than zero. That is why the use of a Logistic Regression is more appropriate. Logistic Regression relaxes some of the assumptions made by a linear regression model. The relationship between the dependent variable and the predictors does not have to be linear, the predictors do not have to be normally distributed and also heteroscedastic variances are not needed.

Logistic Regression (LR) is in itself not a classifier, however it produces probabilities between zero and one and in this thesis is used to predict the probability of a severe or fatal injury after a car crash accident. The model takes the form

ℙ[𝑦 = 1 |𝑋 = ] = 𝑒 𝑥′𝛽 𝑒𝑥′𝛽_{+ 1}= 1 𝑒𝑥′𝛽_{+ 1}≡ 𝜎( ′_𝛽),

where 𝜎( ′𝛽) represents the sigmoidal S-shaped function.

Figure 3: sigmoidal function 𝜎( ′𝛽)

When the probability is bigger than 0.5, then the set of data points will be classified as a severe or fatal injury. To test the validity of the Logistic Regression (and the selected features), a simple generalized linear model will be tested. The McFadden’s pseudo R-squared will then test the predictive power of the model.

Since the SVM model does not produce coefficients of features, but merely produce classification results, the estimated coefficients of the Logistic Regression model will be discussed in order to say something about the impact of several features.

(11)

10

2.3.1 P

ENALIZED

L

OGISTIC

R

EGRESSION

Sometimes, when studying rare events data, regular Logistic Regression may underestimate the probability of an event. That is why, just like with SVM, it might be beneficial to add a penalty term to the Logistic Regression that penalizes misclassifications. Penalized regression allows some coefficients of a regression to shrink to towards zero by putting a constraint on their size, thereby creating a bias-variance trade-off. The optimization problem takes the form

min ∑𝑛𝑖=1(𝑦𝑖− ∑𝑚𝑗=1 𝑖𝑗𝛽𝑗)2 + 𝜆𝐽(𝛽).

Here 𝜆 represents the penalty parameter and 𝐽(𝛽) is known as the penalty function. The choice of 𝜆 is crucial: a small 𝜆 can result in underfitting, where a large 𝜆 may cause the model to overfit. Note that if the parameter is set to zero, the model takes the regular form. The behaviour of the resulting estimates not only depend on 𝜆, but also on the choice of penalty function 𝐽(𝛽). Two common forms are known. The L1 penalty results in a LASSO (Tibshirani,

1996). It makes many coefficients shrink to exactly zero and leave the other coefficients with little shrinkage. On the other hand you have the L2 penalty, which results in a ridge regression

(Hoerl & Kennard, 1970). This penalty function tends to result in smaller, but non-zero, coefficients. A combination of L1 and L2 results in fewer coefficients to be exactly equal to zero

and more shrinkage of the other coefficients. Such a situation, where both penalty functions are imposed, is known as the elastic net (Zou & Hastie, 2005). Different combinations of lambda, L1 and L2 penalties will be tested to find the optimal model which produced the best

injury severity predictions.

2.4 A

CCURACY

R

ATIO

&

A

REA

U

NDER

R

ECEIVING

O

PERATING

C

HARACTERISTIC

SVM and LR will be cross-validated multiple times and their performance will be measured by accuracy ratio and Receiving Operating Characteristic analysis. The accuracy ratio of the models will be calculated by dividing the data into a training set and a test set. The training set will be used to discover predictive relationships and the test set is then used to evaluate the strength of the predictions. In this thesis 70% of the data will be used for training and 30% for testing. When dividing the data, the proportion of severe or fatal injuries will be kept equal in both datasets. The ROC is a graphical plot that shows the performance of a classifier, for various thresholds.

Table 3 shows a confusion matrix which will be used to construct the ROC curve. On the y-axis of the graph of the ROC curve the True Positive (TP) rate is shown, which represents all the

Negative (y = 0) Positive (y = 1) Negative

(y = 0) TN FN

Predicted total number non-severe injuries Positive

(y = 1) FP TP

Predicted total numbersevere injuries Total number

non-severe injuries

Total number

severe injuries Total injuries

Table 3: Confusion matrix that visualises the performance of an algorithm.

P re d ic ti o n True value

(12)

11

correctly classified severe or fatal injuries (TP/Total severe injuries). On the x-axis of the ROC curve the false positive (FP) rate is shown, which represents the false predicted severe or fatal injuries (FP/Total number non-severe injuries). To evaluate the performance of the algorithms, the area under the ROC (AUROC) curve is assessed.

𝐴𝑈𝐶 = ∫ 𝑇𝑃𝑅(𝜃)𝐹𝑃𝑅′(𝜃)𝑑𝜃, 𝑤ℎ𝑒𝑟𝑒 𝜃 = 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑.

−∞ ∞

The most important feature of the AUROC is that it is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. A value of AUROC closer to one usually correlates with a better performance of the model. A value of the AUROC equal to 0.5 means that the two classes are statistically identical. The models will be compared by plotting both ROC curves in the same graph. The statistical significant difference between the (ROC curves of the) two models will be tested with the DeLong test.

3. DATA

In the last ten years, around 1.6 million traffic accidents have been reported to the police in Great Britain, of which more than 220,000 accidents caused severe damage and 21,382 accidents were fatal.

For the purpose of this thesis, only traffic accidents caused by a car are being tested and only where the driver was the casualty. This choice has been made because the outcomes should provide insight on measures that could be taken by the car driver in order to reduce the possibility of a severe or fatal traffic accident injury. The data used stems from an open dataset published by the Department for Transport of the government of Great Britain. The data provides detailed road safety information about the circumstances of 51,178 personal injury road accidents in the Great Britain in 2014, the types of vehicles involved and the consequential casualties. These accidents are shown in Figure 4a and Figure 4b. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form.

_{Figure 4a: Heat map of all accidents reported to the} _{Figure 4b: All fatal accidents reported to the police}

police in 2014 in Great Britain in 2014 in Great Britain

These figures are plotted using QGIS. This is a free and open-source desktop geographic information system (GIS) application that enables a user to view, edit and analyse geographical data.

(13)

12

The injury severity is separated in two classes: severe and non-severe (Table 3). Here a severe car accident means that the casualty suffered from a severe or fatal car crash injury (y = 1) and non-severe represents a slight injury (y = 0). Because of the sparsity of fatal accidents in Great Britain, the frequency of severe and fatal injuries were combined.

Classes Description Frequency Percentage

Severe Severe injury (3,471) 3,831 7,5%

Fatal injury (360)

Non severe Slight injury 47,347 92,5%

Table 3: Binary Classes

Table 4 shows the 10 variables that were selected using the recursive feature selection. The features are divided into different categories: accident type, driver characteristics, vehicle characteristics, road circumstances and weather conditions. As most of the values from the dataset are integer, a small degree of noise was added to make sure the features had enough variety to ensure the classification process.

Category Feature Description Frequency Percentage

Accident Type Vehicle Manoeuvre Driving slow 13,635 27%

Turning 6,443 13% Overtaking 1,083 2% Going ahead 30,017 59% Skidding and Overturning None 42,399 83% Skidded or Jack-knifed 5,374 11% Overturned 3,405 7%

Driver Characteristics Driver home area Type Urban 38,969 76% Small town Rural 5,407 6,802 11% 13%

Drive IMD Decile* 10% most deprived – 10%

least deprived

Age of driver 12 - 99 years old

Vehicle Characteristics Age of vehicle 1 - 80 years old

Engine Capacity (CC)** 170 - 12170 CC

Road circumstances Speed limit 20 - 70 mph

Junction detail No junction within 20

meters 29,720 58%

Junction 21,458 42%

Weather conditions Light conditions Daylight 38,123 74%

Darkness - lights lit 8,957 18%

Darkness - lights unlit 304 1%

Darkness - no lighting 3,794 7% Table 4: Features for analysis

* An Index of Multiple Deprivation (IMD) is used to identify how deprived an area is. It makes use of a range of economic, social and housing data to create a single deprivation score for each small area of the Great Britain. ** CC is a volume unit and stands for cubic centimetres. 1500 cc is roughly equal to 1.5 litres.

(14)

13

4. RESULTS

In this chapter the results of the analyses will be thoroughly discussed. First the impact of the variables calculated by the Logistic Regression will be discussed after that the penalized Logistic Regression will be examined. In the third section the penalty parameter of Support Vector Machine will be evaluated and lastly AUROC of LR and SVM will be compared and explained.

4.1 I

MPACT OF THE VARIABLES

From the recursive feature selection, 10 variables were chosen to use for classification. These chosen variables have then been used in Logistic Regression and the estimation results are presented in table 5. To estimate the predictive power of the model, a McFadden’s Pseudo R2

test is performed. This gave a value of 0.0853, which is on the low side but is easily explained by the fact that the dataset is huge. Instead of coefficients, the ratio is shown. The odds-ratio is equal to the exponent of the coefficients. The odds-odds-ratio measures the impact of a one unit increase on the odds of having a severe or fatal injury. Almost all variables prove to have a highly significant relationship with injury severity with a significance level of α =0.001. Only the variable engine capacity proves to have a very small impact on the probability of a severe injury. It is significant at the 0.10 significance level and a one unit increase in engine capacity almost has no impact on the odds of having a severe accident. This can be easily explained by the fact that the variable has a very wide spread, ranging from 170 to 12170 CC.

Features Odds-ratio Std. Error z value Pr(>|z|)

Vehicle Manoeuvre 1.4258 0.0174 20.42 < 2e-16

Skidding and overturning 1.2798 0.0261 9.44 < 2e-16

Light Conditions 1.1138 0.0092 11.76 < 2e-16

Age of Driver 1.0182 0.0009 18.97 < 2e-16

Engine Capacity (cc) 0.9999 0.0000 -1.84 0.0663

Age of Vehicle 1.0400 0.0035 11.14 < 2e-16

Driver Home Area Type 1.0998 0.0228 4.18 2.88e-05

Speed Limit 1.0136 0.0012 11.25 < 2e-16

Junction Detail 1.5291 0.0387 10.97 < 2e-16

Driver IMD Decile 1.0337 0.0066 5.02 5.09e-07

McFadden’s Pseudo R2 _0.0853 Table 5: Logistic Regression results

This explanation also suits the fairly small odds-ratios of age of driver and age of vehicle. Age of driver and age of vehicle both have a positive relationship with injury severity. A one year older driver has 1.8% higher odds of having a severe or fatal injury and a one year older vehicle has 4.0% higher odds of having a severe or fatal injury. It is interesting to see that Driver Home Area Type also has a significant positive relationship with injury severity. Drivers from a small town or rural area have 10% higher odds of having a severe or fatal injury, than drivers from town. The Driver IMD decile odds-ratio shows that a 10% less deprived driver will have 3.4% higher odds of having a severe or fatal injury. It is no surprise that Light Conditions significantly

(15)

14

increase the odds of having a severe injury. The darker the road, the less a driver sees, the bigger the chance on a severe or fatal injury. The variables with the highest impact include: vehicle manoeuvre, junction detail and skidding and overturning, which all have a positive relationship with injury severity. It is interesting to see that this contains both variables from the accident type category. It intuitively makes sense. A car driver who is overtaking another car is making a riskier move than when he is driving slowly or is making a turn (where a car most of the time also has lower speed and thus a lower chance of a severe or fatal injury). Also, when your car is overturned, severity of the accident is more likely to be high. Junction detail has maybe a less obvious outcome. It states that if a driver is more than 20 metres away from a junction that he has more than 50% higher odds to end up in a severe or fatal injury. It could be explained by the fact that, when a car is in the neighbourhood of a junction, its speed level is lower than for example on a highway. A higher speed level is correlated with a more severe accident (odds-ratio of 1.01). If the speed limit increases by one mile, the odds of a severe or fatal injury are 1.4 % higher. The results of some relationships are graphically shown in Figure 5.

The calculated coefficients are fitted on a simulated data frame and an upper confidence interval and lower confidence interval is then constructed. The small confidence intervals show the significant relationship. It is easy to see that accidents involving overtaking or going ahead 20 meters away from a junction, with the age of vehicle of 25 give the highest probability of having a severe or fatal injury. Note however, these probabilities are around 0.4 and never reach the threshold of p = 0.5.

Figure 5: Predictive graph of results Logistic Regression. On the x-axis the age of the vehicle is plotted, the lines show the different junction details and the four separate graphs represent the four different vehicle manoeuvres.

4.2 P

ENALTY PARAMETER

LR

Since the predicted probabilities of the Logistic Regression never seem to be higher than the threshold value (probably due to sparsity in the outcome variable), it is worth wile to investigate the penalized Logistic Regression. In table 6 the results of the penalized Logistic Regression are shown for different values of the penalty parameter λ (base 10 logarithm) and three different penalty functions: the Lasso (L1 penalty), the elastic net (combination of L1 and L2 penalty) and

(16)

15 Penalty par. (λ) 0.0032 0.01 0.0316 0.1 0.3162 1 3.1623 10 31.6228 100 Lasso 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 Elastic Net 0.9249 0.9249 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 Ridge 0.9249 0.9249 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248 0.9248

Table 6: Results of the penalized Logistic Regression.

The predictive strength of the penalized model is tested by looking at the accuracy ratio:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = # 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

𝑁 .

From the results it is seen that the predictive power of the model is not dependent on the value of the penalty parameter or penalty function. The value of the accuracy ratio is only slightly better for a combination of a penalty parameter value smaller than 0.01 and Elastic Net or Ridge regression. This combination of parameters give an accuracy of 0.9249, which gives an increase in accuracy of 0.007%. The Lasso has the same accuracy ratio for every value of λ. The returning value 0.9248, interestingly, is exactly equal to the proportion non-severe injuries in the class variable (Chapter3, Table 3). It indicates that the model classified every row as being a non-severe accident. To understand that outcome better, the penalized coefficients of the model are evaluated in table 7.

Logistic

Regression Lasso Elastic Net Ridge

Age of driver 0.3092 0.0000 0.0362 0.2054

Skidding and Overturning 0.1383 0.0000 0.0884 0.1719

Vehicle Manoeuvre 0.4723 0.0000 0.0733 0.1355

Engine Capacity (CC) -0.0308 0.0448 0.1916 0.2884

Age of Vehicle 0.1827 0.0000 0.0189 0.1377

Driver IMD Decile 0.0928 0.0000 0.0933 0.1651

Driver Home Area Type 0.0673 0.0000 0.0000 0.0717

Speed limit 0.2077 0.0000 0.0000 0.0743

Light Conditions 0.1764 0.0000 0.0000 0.1278

Junction Detail 0.2096 0.0000 0.0000 -0.0096

Table 7: Comparison coefficients Logistic Regression and penalized Logistic Regression for several penalty functions using a penalty parameter λ = 0.0316

For the different penalized models, a λ equal to 0.0316 (10-1.5_{) is used, because this seems to}

be the flip-point where all different penalized models are not able anymore to classify any row as a severe or fatal traffic accident injury. In the second column of table 7, the results of non-penalized regular Logistic Regression are shown. Please note that these coefficients do not correspond to the regression results of table 5 (section 4.1). The reason for that is that the penalized regression model requires normalized data to produce the best results. When evaluating the coefficients of the Lasso penalty function, it is immediately seen that the function penalizes all coefficients but the coefficient from engine capacity to go to zero. This result explains the incapability of the model to classify a row as a severe or fatal injury. The

(17)

16

same explanation goes for the elastic net. Four coefficients are being forced to zero by the penalty function. The coefficients of the features age of driver, skidding and overturning, vehicle manoeuvre and age of vehicle are all greatly shrunk by the elastic net penalty. The ridge regression has modified all coefficients of the features. The ridge regression has forced the coefficients of the features skidding and overturning, engine capacity, driver IMD decile and driver home area type to increase in value. The other coefficients have decreased. Although these results look more promising than the results of other penalty functions, also this model is unable to predict severe or fatal traffic injuries.

It is interesting to see that all three penalty functions have turned the coefficient from engine capacity from negative to a positive value. This however did not contribute to the improving the capability of the models to predict severe or fatal injuries. Although all three penalty functions have tried to modify the original coefficients to increase predicting capabilities of the original LR model, it looks like the problem is still the threshold value being kept at p = 0.5. For the purpose of this thesis, the non-penalized regular Logistic Regression will be used for comparison with SVM, since penalizing does not improve the classification process.

4.3 P

ENALTY PARAMETER

SVM

In this section, the predictive power of the Support Vector Machine will be evaluated for different values of the penalty parameter. Here, the optimality is being measured by checking which value for the penalty parameter C gives the highest accuracy.

Figure 6: accuracy ratio SVM for multiple values of the penalty parameter C.

As mentioned in section 2.2.1, the penalty parameter penalizes misclassifications in the model. It is seen in Figure 6 that there is a certain flip-point. After 100 _{= 1, the prediction accuracy}

decreases. Before that, the accuracy is equal to 0.925. This value was also found in the previous section and it again corresponds to the proportion of non-severe injuries in the dataset. Although a value of a cost parameter smaller than 1 gives the highest accuracy, it indicates that the model classified every row as being a non-severe accident. To evaluate this result in more depth, the predicted results are being evaluated for two values of the penalty parameter C.

(18)

17

Table 8.1 Classification scheme SVM for C = 0.01 Table 8.2 Classification scheme SVM for C = 31.62

Table 8.1 and 8.2 show the predictions of the SVM model versus the actual values of the test set for two different values of penalty parameter C. The test set consists of 15,535 observations, of which 1,154 severe and 14,199 non-severe. In table 8.1 it is shown that with such a low penalty parameter (C = 0.01), a severe or fatal injury is never predicted, hence the accuracy equals 14,199/15,353 = 0.925. In Table 8.2 it is shown that with a very high penalty parameter (C = 31.62), the model is capable of predicting 687 severe or fatal injuries. However, 574 of those injuries are wrongly classified. Also, due to overestimation the SVM also predicts values bigger than one and smaller than zero. The number of wrongly classified severe injuries (severe injury estimated as non-severe) is decreasing (from 100% to 89.9%), this is a positive effect. In predicting injury severity, it is more desirable to wrongly classify non-severe accident than the other way around. In a more formal way, the Type-II error (false negative) is more costly than a Type-I error (false positive). It appears that the cost parameter should not be bigger than 31.62. An optimal value would be somewhere between 1 and this value, hence a cost value of 3.162 is chosen to take into the next accuracy measure where it is being tested with varying threshold.

4.4 C

OMPARISON

LR

AND

SVM

In this section the predictive power of the Logistic Regression in comparison with the Support Vector Machine will be evaluated. The models will be compared using the Area Under Receiving Operating Curve and their statistical significant difference will be tested with the DeLong test.

Figure 7: ROC curves of the Support Vector Machine and the Logistic Regression.

0 1 0 1 -1 27 3 30 0 14,199 1,154 15,353 0 13,594 1,038 14,632 1 0 0 0 1 574 113 687 2 4 0 4 14,199 1,154 15,353 14,199 1,154 15,353

Table 8.1 Classification scheme SVM for C = 0.01 Table 8.2 Classification scheme SVM for C = 31.62 Prediction results SVM

for C = 31.62 (101.5)

Actual value test set

Row total

Column total Prediction

SVM

Prediction results SVM

for C = 0.01 (10-2) Row total

Column total Prediction SVM 0 1 0 1 -1 27 3 30 0 14,199 1,154 15,353 0 13,594 1,038 14,632 1 0 0 0 1 574 113 687 2 4 0 4 14,199 1,154 15,353 14,199 1,154 15,353

Table 8.1 Classification scheme SVM for C = 0.01 Table 8.2 Classification scheme SVM for C = 31.62 Prediction results SVM

for C = 31.62 (101.5)

Row total

SVM

Prediction results SVM

for C = 0.01 (10-2) Row total

(19)

18

In Figure 7 the ROC curves of the SVM and LR model are plotted for 1000 threshold values varying from 0 to 1. Since the penalized Logistic Regression models did not outperform the regular Logistic Regression model, the classification results of the regular Logistic Regression model is used to plot this graph. To plot the curve of the SVM, a penalty parameter equal to 3.126 (100.5_{) is used. The ROC curve of the Logistic Regression is more convex than the ROC}

curve of the SVM, indicating that the Logistic Regression outperforms the SVM model in predicting injury severity. The AUROC values of the LR and SVM are equal to 0.723 and 0.592 respectively and they are significantly different from each other, using DeLong’s test with p-value < 0.05. To explain this results more thoroughly the false positive rate (FP) and true positive rate (TP) as well as the false negative rate (FN) and the true negative rate (TN) are plotted for different threshold values for both models.

Figure 8: Plots of the predicted values of the models versus the true values at a threshold of p = 0.5

Figure 8 shows the results and comparison of the models for a threshold value of p = 0.5. As proven earlier in this thesis, the LR is barely capable of predicting severe or fatal injuries at this threshold value, hence a minimal amount of True Positive points can be seen in this graph. The SVM, since it is penalized, is capable to produce slightly more True Positive points than the LR. However, as a downside the number of false positive points (Type-I error) that is plotted is also a lot bigger. The number of severe or fatal injuries classified as non-severe (false negative) is smaller, which is a preferable outcome. Keeping the threshold at 0.5, SVM would thus be the preferred classification model.

Figure 9: Plots of the predicted values of the models versus the true values at a threshold of p = 0.2

For a threshold equal to 0.2 the performances of both models are much closer to each other. In Figure 9 it is seen that the Logistic Regression model is now capable in predicting more true

(20)

19

positive points. With this improvement in true positive rate also comes an increase in false positive rate. The same holds for the SVM. On the one hand a lower threshold allows more data points to be correctly classified as a severe or fatal injury, but on the other hand also causes the model to allow for more data points to be wrongly classified as a severe or fatal injury. Both models now make sure that the false negative rate (type-II) error is decreased which is a desirable outcome. Table 9 shows the confusion matrices of LR and SVM for a threshold value equal to 0.2.

Table 9: Confusion matrix f of LR and SVM for a threshold equal to 0.2

It is seen that the Logistic Regression has a higher rate of true predictions (TP) of severe or fatal injuries (14.1% > 10.1%). SVM on the other hand has a slightly lower FP rate than LR (3.9% < 4.2%). However, the type-II error is in this case smaller for the Logistic Regression (85,9% < 89,9%). This lower threshold value allows Logistic Regression to slightly outperform the SVM. An optimal threshold value is found by analysing the point closest to (FP, TP) = (0, 1). This point could also be viewed as the optimal trade-off between the FP and TP rate. For the LR this value is almost equal to 0.08, which is almost equal to the proportion of severe or fatal injuries. The optimal threshold for the SVM is lower than that, its value is equal to 0.03.

5. CONCLUSION

The main aim of this thesis was to apply a non-parametric Support Vector Machine (SVM) learning model and a simple linear Logistic Regression (LR) model to classify the severity of an injury in Great Britain and compare which model makes the most accurate predictions about traffic injury severity.

After intensively examining the models it is concluded that based on the value of the Area Under Receiving Operating Curve, the Logistic Regression outperformed the Support Vector Machine learning model. The AUROC of the Logistic Regression was significantly different with a value of 0.723, which was higher than the AUROC found for SVM which was equal to 0.592. However, both models had much trouble classifying a row as severe or fatal injury when the threshold value was being kept at p = 0.5. This result is not surprising. The proportion of severe or fatal accidents in the data set is equal to 0.07. This is a very small proportion and is explained by the fact that Great Britain is a country with one of the lowest traffic death rates in Europe. Also, with low increments in the data, the models were almost not able to classify a row as being a severe or fatal accident.

Because of this sparsity of severe and fatal injuries in the dataset, it was examined whether penalizing the coefficients of the Logistic Regression model through different penalty methods would increase its classification capabilities. This however forced the coefficients to shrink so

0 1 0 1

0 13,601 (95.8%) 991 (85.9%) 14,592 0 13,639 (96.1%) 1,037 (89.9%) 14,676 1 598 (4.2%) 163 (14.1%) 761 1 560 (3.9%) 117 (10.1%) 677

14,199 1,154 15,353 14,199 1,154 15,353

Prediction Prediction

Column total Column total

Prediction results LR at th = 0.2 True value Row total Prediction results SVM at th = 0.2 True value

(21)

20

much or change in a way that it did not affect the classification capabilities of the model. For the same reason, it was tested whether penalizing misclassifications of SVM would improve its classification capabilities. Here, a penalty parameter equal to 3.126 was chosen that increased the competence of the model to predict severe or fatal injuries.

Lowering the threshold value allowed both models to increase the number of correctly predicted severe or fatal injuries, but as a negative effect this lower threshold also allowed the models to increase the number of wrongly predicted severe or fatal injuries. To find the best trade-off between these two effects, an optimal threshold value was calculated. This resulted in optimal threshold values equal to 0.08 and 0.03 for the Logistic Regression and SVM respectively. This is a very low result and shows that lowering the threshold not necessarily improves the accuracy of predictions. As a conclusion, both models did not succeed in classifying severe injuries properly. It does not mean that the models are not good classification tools in itself. The safest choice when analysing a row is classifying this as a non-severe accident, since there is 92.5% chance that it is indeed a non-severe accident. Concluding, for predicting injury severity, the Logistic Regression is preferred over the Support Vector Machine. Nevertheless, the models should be exploited more or tested on different datasets in order to have exclusive preference for one of the two.

Next to comparing the two models, it was also analysed which features contribute most to injury severity. This analysis provided some actionable insights for the British government. It was for instance found that older vehicles greatly increase the chances of a severe or fatal injury. A one year older vehicle increases the odds of a severe or fatal injury by 4%. An advice would be to enforce stricter control by the DVSA3_{on older cars. Other variables on which action}

can be undertaken include: light conditions, junction detail and speed limits. It was found that light conditions on the road significantly increase the odds of having a severe or fatal injury by 11.4%. It is advised to build more lampposts on dark roads. More severe injuries tend to happen 20 meters away from a junction. More junction control in the form of traffic lights could be easily implemented to lower the speed of vehicles and therewith create a safer road.

6. DISCUSSION

It is clear to see that the classification process could be well improved if the incidence in the dataset would be higher. Further research could be done by adding more years to the dataset. When increasing the number of observations of data, the number of severe or fatal accidents increases simultaneously. The downside of adding more years is that machine learning techniques such as SVM are computationally very heavy. Bigger datasets require more computation time and stronger computers. Another possibility is that certain regions within Great Britain are analysed, preferably regions with a higher density of severe or fatal injuries (for example in the London area), instead of analysing the country as a whole. Exploring other datasets of other countries, such as Spain where the traffic fatality rate is much higher, could also help improving the classification process.

3

The Driver and Vehicle Standards Agency (DVSA) is an agency of the Department of Transport in the GB. It is responsible for setting standards for drivers and vehicles in Great Britain.

(22)

21

In this thesis some assumptions has been made, such as evaluating only the RBF Kernel for SVM and taking a fixed σ for that. More Kernel functions could be explored and could and different values of hyper parameters. There is a possibility of finding the optimal combination of hyper parameters and Kernel functions, by optimising them simultaneously.

Another important part that could be improved is finding a data set that also contains non-injury accidents. Maybe there are some distinct differences between car drivers ending up with an injury after a car crash and those who do not.

(23)

22

7. REFERENCES

Al Ghamdi, A.S., (2001). Using Logistic Regression to estimate the influence of accident factors on accident severity. Accident analysis & prevention. 34(6), 729-741.

Bishop, C.M., (2006). Pattern Recognition and Machine Learning. New York, NY: Springer-Verlag New York.

Çelik, A.K., & Oktay, E. (2014). A multinomial logit analysis of risk factors influencing road traffic injury severities in the Erzurum and Kars Provinces of Turkey. Accident analysis & prevention. 72, 66-77.

Chong, M., Abraham, A., & Paprzycki, M. (2005). Traffic Accident Analysis Using Machine Learning Paradigms. Informatica, 29, 89-98.

Decruyenaere, A., & Decruyenaere, P., & Peeters, P., & Vermassen, F., & Dhaene,T., & Couckuyt, I. (2015). Prediction of delayed graft function after kidney transplantation: comparison between Logistic Regression and machine learning methods. BMC Medical

Informatics & Decision Making, 2015.

Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning. New York, NY: Springer-Verlag New York.

Hoerl, A.E., & Kennard, R.W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics, 12(1), 69-82.

Hsu, C., & Chang, C., & Lin, C. (2010). A practical guide to Support Vector Classification.

Department of Computer Science, National Taiwan University.

Jou, R., & Chen, T. (2015). External Costs to Parties Involved in Highway Traffic Accidents: The Perspective of Highway Users. Sustainability, 7, 7310-7332.

Kashani, A.T., & Mohaymany, A.S. (2011). Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models. Safety science, 49(10), 1314-1320. King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9, 137-163.

Li, Z., Liu, P., Wang, W., & Xu, C. (2011). Using Support Vector Machine models for crash injury severity analysis. Accident Analysis and Prevention, 45, 478-486.

Lopez Bastida, J., & Serrano Aguilar, P. (2005). The Economic Costs of Traffic Accidents in Spain.

(24)

23

Mayou, R., & Bryant, B., & Duthie, R. (1993). Psychiatric consequences of road traffic accidents.

British Medical Journey, 307(6905), 647-651.

Muniz, A.M.S., & Liu, H., & Lyons, K.E., & Pahwa, R., & Liu, W., & Nobre, F.F., & Nadal, J. (2010). Comparison among probabilistic neural network, Support Vector Machine and Logistic Regression for evaluating the effect of subthalamic stimulation in Parkinson disease on ground reaction force during gait. Journal of Biomechanics, 43(4), 720-726.

Musa, A.B. (2012). Comparative study on classification performance between Support Vector Machine and Logistic Regression. International Journal of Machine Learning and Cybernetics, 4(1), 13-24.

Oña, J. de, & Mujalli, R.O., & Calvo, F.J. (2011), Analysis of traffic accident injury severity on Spanish rural highways using Bayesian networks. Accident analysis & prevention, 43(1), 402-411.

Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of the Royal

Statistical Society, 58(1), 267-288.

Zou, H., & Hastie, T. (2004), Regularization and variable selection via the elastic net. Journal of