• No results found

The Impact of Big Data and Machine Learning on Insurance

N/A
N/A
Protected

Academic year: 2021

Share "The Impact of Big Data and Machine Learning on Insurance"

Copied!
110
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

The Impact of Big Data and Machine

Learning on Insurance

Emma Blanken

(1000442)

April 2017

Master Thesis Actuarial Science and Mathematical Finance

Thesis committee:

dr. S.U. Can (UvA)

J. de Mik (EY)

(2)

Abstract

The aim of this thesis is to investigate whether the insurance industry can improve its practice by applying data mining and machine learning techniques. The first part of the thesis is an extensive literature survey that gives an overview of the current situation and how the insurance industry is expected to change in the future as a result of big data analytics. The game changers, opportunities, as well as concerns and challenges are discussed. The second part of the thesis is an empirical study that can be seen as a practical guidance for insurers on how to implement and execute the data-driven opportunities. This thesis considers ‘Cross-Industry Standard Process for Data Mining’ (CRISP-DM), a methodology that is widely used in other financial sectors for data mining and predictive analytics projects. We apply CRISP-DM on a real world business case, in which a Dutch insurer wants to select potential customers for targeted marketing based on internal and open data. An extensive study is performed in which three different splits in training and test datasets are considered. Seven different feature selection approaches as well as four machine learning algorithms are applied and the results are compared.

Keywords: insurance industry, customer expectations, advanced analytics, big data, machine learning, classification, feature selection, class imbalance, CRISP-DM, Logistic Regression, Decision Tree, Artificial Neural Network, Naive Bayes, Synthetic Minority Over-sampling.

(3)

Introduction

“Information is the oil of the 21st century, and analytics is the combustion engine” -Peter Sondergaard, senior vice president, Garner Research.

According to a recent study of IBM, 2.5 quintillion bytes of data are created every day and 90% of the data in the world today are obtained in the last two years [1]. The data are collected from a wide variety of sources including social media, purchase transaction records, internet-enabled devices, climate sensors, video and voice recordings. These large volumes of data created by tools, machines and people are called ‘big data’. Big data is characterized by high volume, high velocity and high variety information.

In their book Big data: a revolution that will transform how we live, work and think [2], V. Mayer-Sch¨onberger and K. Cukier state that big data is altering the way businesses operate. The concept of deriving insights from data in order to create business value and to make fact-based decisions is not new. The rise of the internet, multimedia and social media have however caused an exponential growth in available information in recent years. Effectively using the insights hidden in the large volumes of available data is becoming the basis of competition according to several studies including papers of EY [3] and McKinsey Global Institute [4]. Companies that are able to successfully derive value from their data will have an advantage over their competitors.

According to studies of McKinsey Global Institute [5] and BCG Perspectives [6], the traditionally slow-moving insurance industry has lagged behind in applying (big) data analytics in comparison with other financial services sectors. The insurance industry relies on the principle of risk. Customers select policies based on their assessment of a certain un-desirable event happening to them, while insurers offer protection based on their assessments of the cost of covering possible claims. Insurers traditionally outperformed competitors by combining underwriting expertise and the economies of scale. Increasing expectations from consumers and the threat of non-traditional competition force the insurance industry to reinvent its practice far further than the boundaries of traditional actuarial science. A more accurate assessment of risk would be preferable for both the customer and the insurer. When analysing internal data of their policyholders in combination with open data from various sources, insurers might be able to evaluate the risks of insuring a particular client better and could be able to determine the premium for the policy more accurately. As the insurance industry becomes more competitive, insurers have to stand out by offering products that cost less than their rivals, by operating more efficiently and by providing perfect customer service.

The creative sourcing of data and the refinement of analytical methods, enabled by respectively the outburst of new digital data sources and the revolutionary advances in computing technology in recent years, are expected to become much greater sources of competitive advantages in the insurance industry. Many insurers are currently still in the early stages of considering the potential uses of data analytics and consumers are not yet really experiencing advantages of these first big data initiatives. A rapid growth of applying data analytics within the insurance industry is however expected. For example, a paper by

(4)

Boston Consulting Group and Google [7] states that, in India, 75% of insurance policies sold by 2020 will be influenced by digital channels.

The main goal of this thesis is to investigate whether the traditional slow-moving in-surance industry can improve its practice by applying advanced analytics and using new data sources. This thesis is divided into two parts: a literature study and an empirical study. The literature study investigates how big data analytics is changing the traditional insurance industry. The main game changers are described, including technological advances, increasing customer expectations, the presence of new risks and new sources of external data. The question of how insurers can use big data in order to grow and to gain a competitive edge is answered. Furthermore, big data opportunities within the insurance industry are discussed. Better tools, more data and new risks are creating opportunities alongside demanding challenges and concerns. To cope with the changing insurance industry, not only large-scale and complex organisational changes are required, but also privacy and solidarity issues must be considered. The more insurers rely on the use of data, the more likely it will become a subject for regulators. This interference might slow the developments or offset growth opportunities. The main concerns and challenges are discussed in the literature study. The literature study shows that large scale use of big data in the insurance industry is at a turning point. Traditional insurers must reinvent their practice. Advanced analytics, sophisticated models, improved storage and processing capabilities are needed to turn the enormous amounts of available data into useful insights in a cost effective way.

The aim of the empirical study is to show how insurers can implement and execute the new big data driven opportunities. This thesis considers ‘Cross-Industry Standard Process for Data Mining’ (CRISP-DM), a methodology that is widely used in other financial sectors for data mining and predictive analytics projects. In the empirical study, the methodology of CRISP-DM is explained and applied in a real world business case, in which a Dutch insurer wants to select potential customers for marketing reasons based on internal and open data. To solve the case, machine learning algorithms Logistic Regression, Decision Tree, Artificial Neural Network and Naive Bayes are explained and applied. Issues including feature selection and imbalanced data are addressed.

This thesis is structured as follows. Chapter 1 focuses on the changes in the insur-ance industry and investigates why the industry has failed to keep up with the digital developments in comparison with other financial sectors. The major game changers are also discussed in chapter 1. Chapter 2 investigates how insurers can use big data in order to grow and gain competitive advantages by pointing out some concrete opportunities. Better tools, more available data and new risks are creating opportunities alongside challenges and concerns. Chapter 3 describes the main challenges and concerns. Chapter 4 explains the theory, algorithms and formulas to be used in the empirical study. Chapter 5 solves the real world business case according to the phases of CRISP-DM. Chapter 6 concludes and gives suggestions for further research.

(5)

Contents

I Literature study 7

1 Big data analytics transforms insurance industry 7

1.1 Game changers . . . 9

1.1.1 Customer expectations . . . 9

1.1.2 New sources of external data . . . 9

1.1.3 Real-time data monitoring . . . 9

1.1.4 New risks . . . 10

1.1.5 Technological advances . . . 11

2 Data analytics opportunities 12 2.1 Car insurance . . . 12

2.2 Customer satisfaction . . . 14

2.2.1 Claim processing capabilities . . . 14

2.2.2 Mobile based insurance solutions . . . 14

2.2.3 Risk detection and prevention services . . . 15

2.3 Cyber Insurance . . . 15

2.4 mHealth and wearable devices . . . 16

2.5 Aerial and Digital Imagery . . . 17

2.6 Fraud detection . . . 17

3 Challenges and concerns 19 II Empirical study 21 4 Theoretical background 22 4.1 CRISP-DM . . . 22 4.2 Machine learning . . . 24 4.3 Classification models . . . 25 4.3.1 Logistic Regression (LR) . . . 25 4.3.2 Naive Bayes (NB) . . . 26 4.3.3 Decision Tree (DT) . . . 28

4.3.4 Artificial Neural Network (NN) . . . 29

(6)

4.5 Imbalanced dataset . . . 31

4.6 Evaluation . . . 32

4.7 Synthetic Minority Over-sampling Technique (SMOTE) . . . 33

4.8 Pearson’s correlation coefficient . . . 34

5 Case on effective marketing and the improvement of customer experience 35 5.1 Business Understanding . . . 35

5.2 Data Understanding . . . 38

5.2.1 Initial data collection and description . . . 38

5.2.2 Data analysis . . . 39 5.2.3 Data quality . . . 48 5.3 Data Preparation . . . 50 5.4 Modelling . . . 55 5.5 Evaluation . . . 56 5.6 Deployment . . . 60

6 Conclusions and suggestions for further research 62 III Appendix 63 A Theoretical background extension 64 B Data description: tables and figures 67 C Data analysis of socio-demographic attributes 77 D Results feature selection 87 D.1 Model 1 without SMOTE . . . 87

D.2 Model 2 without SMOTE . . . 88

D.3 Model 1 with SMOTE . . . 89

D.4 Model 2 with SMOTE . . . 90

E Evaluation performance of models 91 E.1 Model 1 without SMOTE . . . 91

E.2 Model 2 without SMOTE . . . 95

E.3 Model 1 with SMOTE . . . 99

(7)

Part I

Literature study

1

Big data analytics transforms insurance industry

Historically, the insurance industry makes operational and tactical decisions based on internally obtained data. The decisions can concern how to price risk, which customers to target, how to estimate the losses, etc. The traditional insurance practice is changing. The outburst of new digital data sources and the revolutionary advances in computing technology force the insurance industry to reinvent its practice far further than the boundaries of the traditional actuarial science. Large amounts of externally obtained data are expected to be used increasingly. Enabled by the development of sophisticated computational techniques, insurers will use big data, forward-looking simulation and modelling techniques in order to make strategical decisions. Moreover, the creative sourcing of data and the refinement of analytical methods are expected to become much greater sources of competitive advantage. Currently, however, the insurance industry is lagging behind other players in the fi-nancial sector in adopting new technology and using big data analytics. This is surprising: insurers have relevant in-house data and knowledge. More specifically, the insurance industry could be the leader in the use of analytics, since they have been collecting data on their clients through transactions and interactions with customers on a daily basis for hundreds of years in order to determine risks. Furthermore, insurers have been analysing data in relation to risk assessment and prevention for years through their actuaries.

Based on research1 from business consultant BearingPoint [8], 90% of insurance firms are yet to implement a company-wide big data strategy. Big data is expected to significantly alter the insurance industry, but according to the research, insurers are unprepared. More than 67% of the surveyed insurers indicated that big data will play a highly important role in their future. The research by BearingPoint revealed that by 2018 big data will be top priority for 71% of the surveyed companies. However only 24% of the surveyed insurance firms believe that their big data maturity is advanced or leading. No more than 33% have started an enterprise or departmental implementation process.

The question arises why the insurance industry has failed to keep up with the digital developments. According to the survey of BearingPoint, 53% of insurance executives across Europe blame a lack of skills for failing to keep up. Besides the lack of skills, some financial and cultural factors could explain the reluctance to profit from technology and data analytics with the same eagerness as other financial sectors. The so-called black boxes2 for example, are very expensive for the insurer to fit into cars. British insurer Aviva was one of the first to use these black boxes. Car insurance premiums fell by 30% and policyholders had 30% fewer accidents [9]. Aviva has however stopped using black boxes, because it was too expensive 1The research is based on a survey with respondents covering 30 insurance companies in Europe and the

US. The survey was undertaken between January and February 2014.

2Black boxes are in-vehicle telematics systems installed in cars to monitor the driving behaviour of

(8)

to buy and install the boxes. They are now offering discounts to the driver who use mobile apps that monitor their driving habits. In general, many insurers are waiting for the cost of the black boxes to fall and prefer in the meanwhile a cheaper option of offering discounts to customers who are willing to have their driving behavior monitored.

As mentioned earlier, a cultural factor might also explain why the insurance industry is lagging behind. Catherine Barton, former partner at EY, said[9] “Compared to many other industries, (insurers) are still playing catch-up. The sector has a very traditional culture”. Face-to-face interactions and personal contacts have always been very crucial in the insurance industry. Personal contact is nowadays still important in negotiations and making deals. Technological innovations are thus far not able to find an adequate replacement for human interaction. Even with the prospect of revolutionary technological advances, insurers believe that there will always be a need for personal interaction in the future [9]. The British insurer RSA, for example, offers innovative telematics-based insurance and at the same time feels the need to have meeting rooms where brokers can do business in the old-fashioned way. The traditional insurers are under a threat from non-traditional competition due to their slow implementation of digital technology according to a paper from Forrester Re-search [10]. Retail, technology and e-commerce institutions are starting to enter different parts of the insurance industry. Beside access to enormous customer databases, non-traditional companies such as Amazon and Google have a large capital base, significant brand presence and already have big data processing capabilities. An Accenture survey of 6,000 consumers in 11 countries in 2013 revealed that 67% of the surveyed consumers would consider taking insurance from companies other than traditional insurers [11]. A total of 43% stated that they would consider banks as possible insurance providers. Moreover, almost a quarter would consider large internet companies like Amazon and Google when choosing an insurer.

The focus of these new non-traditional players primarily lies on health, property and casualty insurances according to a paper of Capgemini [12]. Google and Walmart have part-nered with comparison sites to offer car insurances in the U.S., IKEA and a Swedish insurance company ‘Ikano F¨ors¨akring’ have developed OMIFALL pregnancy and child insurance in 2014. Competition will increase in the insurance industry due to the entrance of the non-traditional players. The smaller insurers are expected to be impacted the most. Developing their technology infrastructure and competency to cope up with the entrance of the giant technological firms is expected to be the hardest for these smaller firms. The customer, on the other hand, will benefit from the increasing competition, which forces insurers to provide better products and services. Note however that the insurance industry is often more highly regulated than the current businesses in which the new players operate. Therefore, it can be expected that these new players have to face challenges on the regulations front, which might slow down or prevent their entrance.

(9)

1.1 Game changers

The traditional insurance industry is changing. This section describes the most influential ‘game changers’.

1.1.1 Customer expectations

Customers have always desired an efficient, reliable and friendly service. However, with the development of new technology, their expectations have been increasing. In general, cus-tomers are increasingly demanding transparency, simplicity and speed in their transactions with businesses. Furthermore, customers want to have personalised service. The old saying ‘know your customer’ might never have been more accurate.

The insurance industry is not immune to this trend. As a consequence of the slow-adaptation of the insurance industry to the digital world, many insurers are struggling to satisfy consumers. Innovations in the design and the delivery of products and services must change in order to meet customer expectations of simplicity and transparency. Investments in technologies are needed to fulfill the desire for speed and mobility of service. An increasing use of digital platforms, like apps and mobile websites, is needed. If insurers fail to improve their consumer experience, it is most likely that non-traditional competitors will take advantage from this failing. Section2.2 suggests some concrete opportunities for insurers to improve their consumer experience.

1.1.2 New sources of external data

The rises of the internet, multimedia and social media have caused an exponential growth in available information in recent years3. Data from computers, smart phones, tablets and other industrial and consumer devices are rich sources for behavioural insights. Furthermore, the European Union, the US and UK governments have launched ‘open data’ websites on which enormous amounts of government statistics about amongst others health, worker-safety, en-ergy and education can be found. With better access to external data sources, insurers are able to understand the many different types of risks better by combining obtained insights. Based on obtained insights, the insurer could for example answer the question what the prob-ability is that a certain person will lose his house by a forest fire or die from a traffic accident given the geographic radius. Or which treatment options and combination of geodemographic factors will have the largest impact on the life expectancy of people suffering from cancer.

1.1.3 Real-time data monitoring

Real-time data monitoring is expected to be increasingly used by insurers, since it can improve their risk and exposure analyses. Data obtained from real-time monitoring can for instance give valuable insights about the driving behaviour or health and lifestyle of a consumer. Fur-thermore, using real-time data monitoring, an insurer can influence the insured by rewarding good behaviour. Telematics, for example, are used in the car insurance to monitor the driving

3

There are for example over 1.71 billion monthly active Facebook users which is a 15 percent increase year over year.(Source: Facebook, 27/7/16)

(10)

habits of the insured in real time. There is evidence that this affects drivers and changes their driving habits positively. One UK insurance company using telematics stated that the better driving habits has led to a 30% reduction in the number of claims. Another UK insurer using telematics similarly noticed a reduction of accidents caused by risky driving manoeuvres by 53% [13].

1.1.4 New risks

The risk landscape is shifting rapidly due to new economic, environmental, socio-political and technological developments and the interdependencies between them [14]. The growing prevalence of new risks presents a major opportunity for the insurance industry in terms of new businesses. Among the new and (partially) unsettled risks are the following risks:

• Increasing frequency and severity of catastrophic events

The frequency and severity of natural and man-made catastrophic events have been growing. As the effects of global warming are increasing, the occurences of climate extremes are increasing according to the Climate Emergency Institute [15]. Research of ‘Verbond van Verzekeraars’ revealed that in the Netherlands the damage caused by extreme rainfalls will increase by 5% in best case scenario and by 139% in worst case scenario by 2085 [16]. Additionally, insurers must consider the degradation of the environment by man. The ‘International Energy Outlook 2016’ [17], recently released by the U.S. Energy Information Administration, reveals for example that the world energy consumption will grow by 48% between 2012 and 2040. The report also states that even though non-fossil fuels are expected to grow faster than fossil fuels, fossil fuels still account for more than 75% of the world energy consumption through 2040. With the continued use of fossil fuels, pollution will remain a significant health issue. In order to assess risk accurately in different regions, insurers must closely monitor trends in global climate, atmospheric pollution, etc.

• Terrorism

Terrorism is not a new phenomenon in the world. Due to the attacks in Paris, Brussels, London and New York in recent years, the Western world is however becoming more aware of the impact of terrorist strikes. Terrorism does not only have tremendous social impacts but also massive global economic impacts. The bomb explosion in the financial district of London on 4 October 1992 has cost the insurance industry for instance ap-proximately 900 million U.S. dollars [18]. Terrorist attacks often affect multiple product lines, including workers compensation, business interruption, commercial property, life and benefits. In the traditional actuarial models, the different product lines are often modelled independently and therefore do not consider this interdependence [19]. To understand the capacity requirements for terrorism coverage, more detailed modelling is required.

• Cyber threats

In recent years, cybersecurity attacks have become recurring news items. The hack-ing attack on a UK mobile phone and broadband provider in October 2015 is one of the largest cyber claims in Europe. Based on an internal investigation on how many

(11)

customers were affected, the provider revealed that 20,000 bank account numbers and 28,000 partially obscured credit card details were accessed. Additionally, the investiga-tion showed that approximately 1,200,000 email addresses, names and phone numbers of customers were accessed [20]. The World Economic Forum termed cyber risk as a potential global threat in 2015. Based on a report of Center for Strategic and Inter-national Studies in 2014 [21], the annual cost to the global economy from cybercrime is more than 400 billion U.S. dollars. Deloitte has quantified the risk of losses from cyber attacks through data analysis within the Netherlands in 2016. They estimated that the total annual cyber crime losses confronting the largest Dutch corporations and government could amount to 10 billion euro every year [22]. The increasing occurrence of cyber attacks presents an opportunity for the insurance industry. This opportunity is discussed in section 2.3. Insurers themselves are also vulnerable to cyber attacks. The downside of the increasing use of digital platforms (apps and mobile websites) to satisfy consumers is that insurers expose themselves more to cyber risk. As insurers store an increasing volume of data about their customers, the exposure to cyber attacks is likely to increase.

Insurers must be more sophisticated in their risk modelling in order to manage these new types of risks. New sources of data and new monitoring technologies will help in understanding and accurately assessing these new risks.

1.1.5 Technological advances

The increasing ability to store data facilitates the accumulation and analysis of extremely large volumes of data4. The explosion of computing power and innovation in analytic modelling enable the development of more sophisticated insurance tools. An innovative analytics vendor specializing in insurance applications has for instance introduced a new health-risk model by combining the most accurate actuarial and government data with demographic trends and medical science [24]. This model for longevity risk is forward- and backward-looking. The model includes data from traditional mortality tables and adds data on medical advances and new lifestyle trends such as more exercise, healthier diets and less smoking. Innovation in analytics modelling could furthermore help insurers to understand and underwrite the new risks that currently might be underinsured, including cybersecurity and industry-wide business interruption due to natural disasters.

4

The storage costs of 1 gigabyte data was approximately 10,000 dollars in 1990, it was 10 dollars in 2000 and currently it is only a couple of cents [23]

(12)

2

Data analytics opportunities

The insurance industry is based on the principle of risk: customers select policies based on their assessments of a certain undesirable event happening to them, while insurers offer protection based on their assessments of the cost of covering possible claims. A more accurate assessment of risk would be preferable for both the customer and the insurers. Insurers might be able to evaluate the risks of insuring a particular person and set the premium for the policy more accurately by capturing and analyzing internal data about their policyholders and open data from various sources. Furthermore, as the insurance industry becomes more competitive, insurers have to stand out by offering products that cost less than their rivals, by operating more efficiently and by providing perfect customer service. The opportunities discussed in this chapter reveal that these goals can be achieved by using big data and predictive analytics. More specifically, this chapter gives an overview of opportunities that can enable insurers to stand out, improve their practice, meet customers’ expectations and overcome current challenges. This chapter additionally describes whether these opportunities are already applied and how they might be developed and used in the future.

2.1 Car insurance

Car insurers are shifting toward Usage-Based Insurance. Usage-Based Insurance (UBI) refers to the concept of determining the premium for car insurance based on the car usage or driving behaviour. In this type of car insurance, insurers collect real-time driving information, such as speed or brake use. Insights from this data enable insurers to link the individual risk of drivers more closely with their premium. The telematic devices named ‘black boxes’ are used to collect data which reveals driving patterns and driving behaviours.

A better understanding of risk and a refinement of the premium based on driving be-haviour can be achieved by combining monitoring technology and data analytics. Insurers are able to gain competitive advantage and attract low-risk policyholders when using UBI. The insurer’s traditional role of protector against financial loss might be altered using UBI. The insurer can influence policyholder behaviour, the number of claims and the well-being of a driver by creating a reward system and by informing on road and weather conditions5. Three types of UBI can be distinguished: pay-as-you-drive (PAYD), pay-how-you-drive (PHYD), manage-how-you-drive (MHYD). PAYD is a low-mileage insurance. The insurance premium is based on the number of kilometers a vehicle has covered. In PHYD the premium of the customer is fixed by assessing their driving style. MHYD is an extention of PHYD. MHYD additionally provides guidelines about best driving practices to drivers based on their driving behaviour monitored by the telematic devices.

5

The percentages given in section1.1show that the driving behaviour of customers when using real-time monitoring improved.

(13)

According to a report from IHS Automotive [25]6, close to 12 million consumers globally subscribed to UBI in 2015. This number is expected to grow to 42 million globally by 2023. The most mature UBI markets in Europe are by far Italy and the United Kingdom. In Germany, Spain and France activity is growing. According to the IHS Automotive report UBI in these countries is however still a niche market. Consumers are still unfamiliar with the product and insurers are uncertain of what business propositions perform best. UBI represented 10% of the market in Italy in 2015. According to the IHS Automotive report, Italy is the only country in which UBI has a double digit share in the insurance market. This is strongly influenced by government intervention. In response to fraudulent insurance claims, Italy’s Prime Minister forced in 2012 that if an insurance company already offered a UBI product with a black box, the box must be offered to the customer at no charge and additionally a significant upfront premium discount for customers must be offered. The IHS report states that in 2017 the U.S. will be the leader in UBI innovation and marketing due to its market potential. The car insurance market of the U.S. is the largest in the world with more than 260 million operating vehicles in 2015. Based on IHS estimates there were more than 5 million UBI policyholders in the U.S. in 2015. This amount was by far the most of any other country. Italy was a distant second with 3.6 million UBI policyholders out of 36.8 million.

In China a shift toward UBI is expected. In 2016 approximately 14 insurers launched UBI pilot programs according to the IHS report. Estimates suggest that the UBI subscriber volume will grow from 50,000 in 2015 to over 22 million by 2023.

The IHS report states that the technology behind UBI is evolving. Technological de-velopments will stimulate the UBI growth. Technological advances will increase the effectiveness and convenience, and drive down the costs of using telematics devices in the future. Due to the rapid growth of smartphones capabilities, including g-force tracking, GPS and accelerometers, the global UBI market is expected to be eventually dominated by smartphone-only solutions, but probably only for a short time. The ‘connected car’ or ’smart car’ is on its way7. This vehicle itself might be the ultimate UBI device. In the future

connected cars will be common.

Companies like Polisvoormij8 and Whoosz!9 were the first to introduce telematics

based car insurance products in the Netherlands. Both companies use mobile phone apps to monitor driving behaviour. VOOROP10 recently entered the market using a car plugin monitoring device. These companies claim to be commercially successful with their new UBI propositions. However, the majority of newly purchased car insurance policies in the 6The findings in the ‘IHS Usage Based Insurance Report’ are based on more than 40 interviews from across

the value chain: data aggregators, telecommunications companies, insurance carriers, automotive OEMs. More than 70 companies are considered.

7

According to recent research of Gartner, the production of cars equipped with data connectivity, either through a built-in communications module or by a tether to a mobile device, is forecast to be 12.4 million in 2016 and reach 61 million in 2020 (source:http://www.gartner.com/newsroom/id/3460018)

8

In cooperation with Reaal and Fatum Insurance.

9In cooperation with Zurich Financial Services and T-mobile. 10

(14)

Netherlands is still non-telematics based. According to research by Deloitte [26], based on a survey among 900 Dutch consumers in 2015, telematics based car insurance has strong growth potential in the Netherlands. This growth is stimulated by growing consumer acceptance of sharing personal data and technology advances that will quickly lower costs and increase effectiveness of telematics devices. A growth in UBI will have social benefits for the Netherlands, since 65% of the considered 900 Dutch consumers are willing to change their driving behaviour when offered a financial reward11. Recognition of these social benefits will increase the demand for UBI insurance in the upcoming years.

2.2 Customer satisfaction

Increasing customer expectations is introduced in section1.1.1as a major game changer. By closely analysing consumers’ data, their needs and risks can be identified more accurately. Based on the obtained insights, personalised products can for example be offered and the desired support can be given to the consumer. The following opportunities can help the insurer to meet the rising customer expectations.

2.2.1 Claim processing capabilities

A quick and smooth claim processing process will help to increase the customers’ satisfaction. Big data analytics can help insurers to improve their claims processing. It enables insurers to identify and report events in a fast and effective way. It is difficult for an insurer to assess the loss when a claim is first reported. Insights obtained from data analytics could make it easier to compute and reassess the loss reserve [12].

2.2.2 Mobile based insurance solutions

The use of smartphones and tablets has grown immensely in recent years in the Netherlands. Despite this enormous growth, a survey by Deloitte, [26], shows that 61% of the 900 surveyed Dutch customers did not use their mobile devices for insurance related purposes in the last 12 months. The customers that did, used their smartphone or tablet mostly to view policy terms. The surveyed consumers rated client satisfaction for insurance services provide through a mobile device at 37%.

The preferences of customers are changing. They want simpler and quicker communi-cation. Mobile devices enable the desired ‘anytime, anywhere’ interaction. An increase in mobile based insurance solutions is expected, since consumers indicate willingness to use mobile devices for insurance purposes.

The survey of Deloitte [26], showed that 44% of the respondents find the option to compare personal risk on damages and accidents with peers valuable. Moreover, 36% want to compare their driving behaviour with others and want to be rewarded for positive behaviour. Insurers could offer personalised risk related dashboard to satisfy the desire of consumers to

11

65% of the consumers are willing to change their behaviour, 20% gave a negative response and 16% stated that they were neutral.

(15)

obtain insight into personal usage and risk profiles in comparison with peers. The Dutch lease company Leaseplan can be taken as example. The company offers insights to drivers into their performance with regard to speeding tickets, number of accidents and fuel efficiency in comparison with their peers.

Mobile solutions could not only be used for efficient and fast interaction with current and new customers, but also to collect data on for example location or driving behaviour. Recall that this information can be used in UBI, but real-time monitoring through mobile devices can also be used for custom offerings.

In other words, when more individual data is provided, insurance can offer products that better fit personal needs. An insurer can offer, based on GPS data, a wintersport coverage to a current travel policy when a customer arrives at a wintersport destination. A GPS based micro insurance is already used by the Japanese insurer Tokio Marina & Nichido Fire in partnership with the Japanese telecom company NIT Domoco. They offer ‘One-Time Insurance’ products which use the GPS of Domoco to make location specific insurance offers. The suggested product can by purchased by Domoco customers anytime and anywhere via mobile devices and will protect them against potential risk from short-term and rare events associated with their location.

2.2.3 Risk detection and prevention services

The insurance industry has traditionally used their knowledge on claims and risk mainly for pricing purposes. Data about the customer’s lifestyle and driving style will increase significantly due to the growth of new sources like smart cars, smart homes and perhaps eventually smart cities. Beside pricing, insurers can offer risk reducing products or services based on this individual data.

Research by Deloitte [26] revealed that 66% of the 900 surveyed Dutch consumers would appreciate it when their insurer informs them about a recently increased burglary risk in their neighborhood. Furthermore, 54% of the respondents indicate that they would change their behaviour in case they had insight into their personal risks. 52% of the surveyed consumers are willing to install risk reducing devices in return for monthly benefits.

Interpolis Achmea and the ACE Insurance group have already introduced products to meet this customer’s desire. In its online prevention store, Interpolis Achmea offers up to 50% discounts on risk reducing articles including emergency kits, smoke detectors and safe locks. The ACE Insurance group, one of the world’s largest property and casualty insurers, offers a mobile app that provides data about risk related information like political turmoil and natural disasters, based on GPS locations.

2.3 Cyber Insurance

As mentioned in section1.1, cyber attacks are increasing. These attacks can be severe and can have a large impact on organisations. Although many companies have an information security division, complete elimination of cyber risk is impossible. Therefore, it might be attractive

(16)

for companies to transfer the cyber risk for a premium. Cyber insurance has been available for over 10 years. According to a paper of Capgemini [12], cyber insurance is not widely adopted due to both a lack of awareness among users and a lack of data. Since the occurrence of cyber attacks is increasing, it is expected that the use of cyber insurance is increasing as well. Cyber insurance can cover various damages including business interruption, intellectual property theft, cyber extortion, data and software loss, breach of privacy events, impact on reputation, intellectual theft, etc.

2.4 mHealth and wearable devices

mHealth is an abbreviation for mobile health. The term refers to the use of mobile devices in medical care. With mHealth apps a consumer’s health and surroundings can be monitored by the insurer more accurately.

mHealth apps are often supported by so-called wearables. A wearable is an electronic device, often worn on the body, that collects and transmits a variety of data concerning the activity of the person who is wearing it. Measures that can be obtained from wearables go beyond the basic fitness measures like heart rate, steps and distance. The bracelet Mio Slice for example determines a personal target activity that will reduce the risk of diabetes and cardiovascular diseases and that will maximise the lifespan of an individual12. According to the EY’s 2016 Sensor Data Survey of senior executives from nearly 400 insurers globally, wearable sensors will be one of the most important data sources for future competitiveness within the insurance industry [27].

In-ear and headband health monitors are also examples of wearables. An application is the LG Heart Rate Headphones, which measures blood oxygen levels and physical activity. The re-designed Google Glass is expected to facilitate a complete health assessment through inner-ear and tongue speech recognition, a holographic visual interface and bone-conduction audio [27]. Sensor-equipped clothing that are able to capture a range of health data will soon be introduced. At CES 2016, a global consumer electronics and consumer technology tradeshow, OMSignal introduced a smart sports bra that measures breathing rhythm, running performance, biometric effort and fatigue. Additionally it includes a sophisticated coaching function. Another example is Samsung’s smart belt that is able to track weight gain and steps and that is able to determine whether the waist expansion is due to weight gain or overeating. It is expected that consumers will make their health data available to an insurer in order to receive a personalised life- or health-insurance and a premium based on their likelihood of illness and level of fitness. Furthermore, when healthy behaviour is rewarded, insurers will indirectly motivate consumers to improve their physical activities and eating habits. Wearables might even be used to reduce risk though preventative actions. Drivers who show for instance indications of extreme tiredness could be discouraged to drive using actuarial estimates of their probability of causing an accident.

12

At the London’s Wearable Technology Show 2016, the Vancouver based company Mio introduced Mio Slice. The Mio Slice uses a new algorithm that measures your Personal Activity Intelligence (PAI). PAI takes into account an individuals gender, data of birth, resting and maximum heart rate. It awards the personalized PAI score every 7 days.

(17)

The focus of insurers on mHealth apps and wearables is increasing, since they enable insurers to monitor and improve customers health and thereby reduce their health care spending. Additionally insurers will directly benefit when spreading of diseases is monitored and controlled since this will again lead to reduction in claims. According to PwC, mHealth can save potentially 99 billion euros in health care spending in the European Union [28].

2.5 Aerial and Digital Imagery

To estimate damages and losses, insurers often have to physically visit the event site. Civil authorities sometimes restrict physically examination for safety reasons. For instance after a catastrophe, an insurer may not be allowed to physical visit certain locations. This will delay the estimation process of claims. Using aerial and digital imagery, esti-mation of damages before claims processing can be done quicker, cheaper and more accurately. Aerial and digital imagery uses digital images and software to view land and prop-erties. New technological advances on aerial imaging enables an insurer to capture high-resolution 2D and 3D images of properties or land and estimate their location, size and incurred damages [12]. The cost of processing claims can be reduced using aerial and digital imagery. According to [12], an experiment of a leading U.S. insurer to test the effectiveness of unmanned aerial vehicles, like drones, revealed that they would have overpaid $338 per claim if they had not used unmanned aerial vehicles. The experience revealed that a person would over-measure the actual damage 79% of the time, while unmanned aerial vehicles assessed the properties more accurately.

Although aerial and digital imagery have a huge potential to become a primary tool for insurers to assess properties, there are also a few challenges for its implementation. Regulatory approvals must be obtained for drones to capture images. When drones cross a private area, property access rights might be violated. Public safety may be disturbed, since drones can hit people or distract drivers.

2.6 Fraud detection

Fraud is an immense problem for insurers. Fraudulent claims and the cost of investigating suspected frauds lead to higher premiums. Research in 2016 of ‘Centrum Bestrijding Verzekeringscriminaliteit’ and ‘Verbond van Verzekeraars’ revealed that insurance fraud is costing Dutch insurers hundred million of euros per year. A Dutch household pays on average 100 euros more premium to compensate for the fraudulent claims [29].

In the past, fraud detection was a task of claim agents, who had to rely on red-flag business rules, a few facts and lots of intuition [30]. The arrival of big data and high-performance analytics technology in recent years give the insurance industry the opportunity to completely reinvent their fraud detection methods. Predictive analytics can be used to pre-vent and detect fraudulent claims. Obtained insights can help insurers to identify applicants who are most likely to commit fraud or to determine whether a claim is likely to be legitimate.

(18)

More specifically, insurers can combine own collected data, open data and personal data from their customer’s smartphones, telematic devices and social media accounts to determine the veracity of claims. For example, when a crash is reported to be the result of heavy rainfall, the insurer can check the weather on the date and time of the accident. Social media can tell the insurer if the claimants know the other individuals that have submitted claims or know the person they are making a claim against, which could indicate that the claim is more likely to be fraudulent. Furthermore, all available data can be used to build sophisticated models that can predict the likelihood of claims to be true.

(19)

3

Challenges and concerns

Better tools, more data and new risks are creating opportunities alongside demanding challenges and concerns. This chapter describes the main challenges and concerns insurers are facing when changing their traditional actuarial business.

Every day 2.5 quintillion bytes of data are created according to IBM [1]. Moreover, 90% of the data in the world today are obtained in the last two years. The data are obtained from a wide variety of sources. The challenge for insurers is to turn these enormous amounts of often low quality data into useful insights in a cost effective way. Advanced analytics, smarter data management, large-scale and complex organisational changes are needed. As some data might be unimportant or even misleading, insurers must pay attention to the data source. It is necessary to have the right skilled people to capture and analyse big data, since using data in isolation can lead to bad decisions. Furthermore, insurers must improve their storage and processing capabilities.

The availability of individual data is increasing exponentially due to the rise of, among others, internet, social media and real-time monitoring. The question arises what to do with all this new information and insights on customer behaviour. Insurers must pay attention to the thin line between using data correctly and intruding on the privacy of an individual. A recent research by Deloitte [26] based on a survey among 900 Dutch consumers in 2015 revealed that 48% of the respondents are concerned about their privacy.

In comparison with the traditional insurance products, the new data driven opportu-nities described in chapter 2 rely much more on individual customer data. Insurers must consider the willingness of customers to share data. The research of Deloitte revealed that 45% of the 900 Dutch surveyed customers would share personal data, but 71% of the respondents are against the use of information shared on social media. Although consumers are still a little hesitant to share their personal data, they do want to benefit from the aggregated data collected by insurers for a better understanding of how to manage property, health and lifestyle risks.

Regulation is another challenge. The more insurers use and rely on data, the more likely it will become a subject for regulation. In 2016, the Joint Committee of European Supervisory Authorities (JCESA)13 examined the use of big data by financial institutions. JCESA stated that its aim is to ‘analyse the adequacy of sectoral regulatory frameworks and identify any regulatory and/or supervisory measures which may need to be taken’ [31]. In their book We are big data: the future of the information society [32], S. Klous and N. Wielaard state that individuals know more and more about themselves in the big data driven society. Consequently, individuals also know more about for instance their individual risk of being involved in a car accident, getting sick or having other problems. According to the book, individuals will know exactly how their individual risk level

13

The Joint Committee of European Supervisory Authorities consists of the European Securities and Mar-kets Authority (ESMA), the European Banking Authority (EBA) and the European Insurance and Occupa-tional Pensions Authority (EIOPA).

(20)

deviates from the average. Insurers historically determine the insurance premium on the average of a group with similar risk characteristics for pricing. The traditional insurances are based on this solidarity principle, but with the arrival of big data this is no longer evident. More specifically, an insurance business model is often built on the basis of imperfect information or information asymmetry. One insured person has for example a higher risk of getting diabetes or getting into a car accident than others do. Those insured people with a low risk are not aware that they have a lower than average risk. They are in the same group of insured as those with a higher risk. Since the premium is based on the average, the low-risk ones pay in a certain sense for the high-risk ones. One of the consequences of the new ‘information society’ is, according to Klous and Wielaard, that individuals are obtaining increasingly better insights into their individual risks and into what they are exactly getting for the insurance premiums they pay. This transparency might be considered good or even fair since it destroys the information asymmetry, but it also changes the root of the traditional insurance business model. In the ‘information society’, individuals know exactly what their risks are and how much they are paying for other person’s higher risks. In other words, individuals will know exactly how much the acclaimed solidarity with others cost them. An extreme consequence could be that the low-risk individuals no longer want to insure themselves since paying the premium is more costly than paying the damage or cost in case of an accident or illness. Consequently, insurers must increase their premiums if only risky consumers want to insure themselves and low-risk individuals no longer want to compensate and consequently lower the premiums. Furthermore, a “couch potato” needs different health care than an athletic person and a construction worker faces different risks than a doctor. Since the measurement of these differences is becoming much easier, it is likely that the number of affinity groups or social networks, like people from the same neighbourhoods, unions, colleagues and friends, want to group together will increase. Another concern is that identifying the risk and health profile of an individual be-comes much easier using big data and advanced analytics. This could allow insurers to be more selective on the risks and set portfolios that are more attractive. In the extreme case certain types of risky consumers could be refused.

(21)

Part II

Empirical study

The theoretical study states that insurers must reinvent their practice in order to meet rising customer expectations and to compete with non-traditional players. Therefore, larger scale use of big data is expected. Advanced analytics, sophisticated models, improved storage and processing capabilities are needed to turn the enormous amounts of available data into useful insights in a cost effective way. The goal of the empirical study is to investigate how insurers can improve their practice using advanced analytics and big data with a real world business case. The case will be solved according to the phases of ‘Cross-Industry Standard Process for Data Mining’ (CRISP-DM). The CRISP-DM methodology is already widely used for data mining and predictive analytic projects in other financial sectors. A dataset provided by the Dutch data mining company Sentient Machine Research is used, [33]. The dataset is not directly published by an insurer but based on a real world business data from the Dutch insurance industry. It was used for the CoIL 2000 data mining competition14. Since the dataset is pre-processed and unpersonalised before publishing, it is

not possible to extend the dataset with data from open sources. However, the dataset already combines two different types of data: internal product usage data and socio-demographic data15.

The chosen dataset is used to investigate how an insurer can improve their customer experience and use marketing budget effectively. Since the dataset is pre-processed and unpersonalised and the amounts of observations and attributes are relatively low, the performance of the machine learning algorithms and outcomes might not be as desirable. However, the dataset is appropriate to show in general the way insurers can improve their practice. More specifically, although the techniques are applied on a particular dataset and problem, the used CRISP-DM methodology, algorithms and formulas are suitable for all suggested opportunities in chapter2.

The empirical study is structured as follows. The theory used to solve the case is ex-plained in chapter 4. The case is described and solved in chapter 5 according to the CRISP-DM phases.

The programming language R is used to analyse the dataset and implement the ma-chine learning algorithms. R is chosen as programming language because it is a free and open source software, which includes many packages that automate particular tasks. Furthermore, R has a large and active community of users, which can provide support.

14

CoIL Challenge 2000 is the second edition of the Computational intelligence and Learning competition Challenge, which is an international data mining competition.

15

Socio-demographic refers to a group defined by its demographic and sociological characteristics. These groups are used for analyses in social science as well as for medical studies and marketing. Examples of demographic characteristics are age, sex, religion, place of residence, marital status and educational level. Sociological characteristics are more objective features, like household status, values, social groups and mem-bership in organizations.

(22)

4

Theoretical background

4.1 CRISP-DM

‘Cross-Industry Standard Process for Data Mining’ (CRISP-DM) is a methodology that covers the typical phases and tasks of an analytic project. The methodology, devel-oped in 1996, was first published in 1999 by SPSS, NCR and Mercedes. Although IBM released a new method named ASUM-DM in 201516, CRISP-DM is still the leading

methodology and the industry standard for data mining and predictive analytic projects, [34]. CRISP-DM divides the process of a data mining project into six phases; business understanding, data understanding, data preparation, modelling, evaluation and deploy-ment [35]. The sequence of the six phases is not rigid, since there is in practice always interaction between the phases. Next, the six phases are discussed separately.

Phase 1: Business Understanding

The first phase starts with the understanding of the project objectives and requirements from a business perspective. Next, the obtained knowledge is converted into a problem definition. Finally, a primary project plan is developed to achieve the objectives.

Phase 2: Data Understanding

The second phase starts with an initial collection of data. After the collection, simple statistics can be performed and figures can be plotted in order to get familiar with the data. The quality of the data must also be considered in the second phase. Data quality problems occur for example when an attribute that indicates a person’s age is 10,000 or the length of a person is -10 meter or when there are a lot of missing values.

Phase 3: Data Preparation

The Data Preparation phase includes all activities necessary to transform the initial raw dataset into the final dataset used in the modelling phase. The data preparation tasks differ per project. Data preparation tasks might be the selection of attributes, cleaning of data, solving data quality problems and the construction of new attributes.

Two examples of possible data preparation actions are

• Assume a gender attribute with ‘male’, ‘female’, ‘m’, ‘f’, ‘girl’ and ‘boy’ as possible input. These six different inputs all indicate the gender. Only one term for female and one term for male are desired for modelling purposes and predictive power. The data preparation task for this attribute is to replace ‘female’ and ‘girl’ by ‘f’, ’male’ and ‘boy’ by ‘m’.

• Based on the available data, new attributes can be created. Consider for example the date of birth. New attributes that can be created based on the date of birth are month of birth, year of birth, day of birth or whether the person is a child or adult. The question might arise why a specific attribute for the month of birth is needed when 16‘Analytics Solutions Unified Method for Data Mining & Predictive Analytics’ (ASUM-DM) is a refinement

(23)

this information is also captured in the date of birth. The answer is that a model only considers the exact input. For example, a model will ‘read’ 07-02-1992 and 08-02-1992 as two different numbers and will not notice that the dates are only one day apart or that the dates are in the same year or in the same month. These insights must be explicitly given to the model.

Phase 4: Modelling

Research has proven that the performance of data mining models depends on the underlying problem. Research of Caruana and Mizil [36], Han and Kamber [37] proves that there is no individual data mining technique that offers the best solution to all problems. The decision on the selection of the model is crucial for the outcome. Therefore, the performance of a data mining technique must be evaluated for each specific task. Various modelling techniques are selected and applied in this phase. The parameters are calibrated to obtain optimal values. Since some techniques require specific input, often steps in the Data Preparation phase must be repeated.

Phase 5: Evaluation

The obtained outcomes of the considered models are analysed and evaluated. It is important to thoroughly check whether the business objectives are achieved properly by evaluating the model and by examining the steps taken to create the model. Finally, the modelling technique that is the best fit is chosen.

Phase 6: Deployment

A predictive analytic project does not generally end when the model is created and the results are evaluated. The obtained knowledge usually needs to be organised and presented in such a way that it is useful to the client or to the customer. The development phase depends on the requirements from the client. It can be as complex as implementing a repeatable process or as simple as generating a report.

Figure 1 collects the six phases of CRISP-DM accompanied by generic tasks and out-puts. The empirical case will be solved according to these phases and tasks. Additional theory and models used in the empirical study are explained in subsections4.2,4.3,4.4,4.6,

(24)

Figure 1: General tasks (bold) and outputs (italics) of CRISP-DM. [Source: DataPrix. ‘The reference model CRISP-DM’, http://www.dataprix.net/en/reference-model-crisp-dm]

4.2 Machine learning

As described earlier, CRISP-DM is the standard methodology for data mining projects. Data mining can be defined as the process of investigating large datasets in order to extract new information or interesting patterns. During this process, machine learning algorithms can be used.

Machine learning is a sub-field of artificial intelligence that uses algorithms to find for example patterns in data or predict events in the future. There are two different machine learning techniques: supervised learning and unsupervised learning. In machine learning a variable is called ‘attribute’.

Assume a dataset contains labelled observations and has a target attribute. In that case, the modelling of that dataset is called supervised learning. Each observation in supervised learning can be denoted as (x, y), where x is a set of independent attributes and y is the dependent target attribute. The independent attributes can be either continuous or discrete. The target attribute y can also be continuous or discrete. The machine learning techniques used when dealing with a continuous or a discrete target attribute y are respectively called regression or classification. The different discrete values the target attribute can obtain in classification are called ‘classes’.

The technique of modelling a dataset that does not include a target attribute is called unsupervised learning. The aim of unsupervised learning is to find clusters within the data or similarities among groups.

Depending on the available data and the problem, the required machine learning algo-rithm can either be supervised or unsupervised. The case studied in this thesis requires

(25)

a supervised classification technique. The performances of different classification machine learning algorithms on the data of the case are evaluated.

4.3 Classification models

The aim of a classification model is in general to accurately predict the target attribute or the class, for each observation in the dataset. Therefore, the dataset is divided in a training and test dataset. The target attribute is removed from the test set. The training set is used to build the model. The test set is used to validate the built model and to examine how well the model performs on data it has never seen before. The predictions of the target attribute in the test set are compared with the real observations to evaluate how well the algorithm performs.

As mentioned before, the performance of a classification model depends on the under-lying problem. Research of Caruana and Mizil [36], Han and Kamber [37] proves that there is no individual model that offers the best solution to all classification problems. The decision on the selection of the classification model is crucial for the outcome. Therefore, the performance of a model must be evaluated for each specific problem.

The four classification techniques used in this case are discussed separately.

4.3.1 Logistic Regression (LR)

Logistic Regression is a widely used model to predict a target variable. A distinction can be made between Binary Logistic Regression and Multinomial Logistic Regression. The Binary Logistic Regression can predict a binary variable (0 or 1)17. The Multinomial Logistic Regression is an extension of Binary Logistic Regression. This model is able to predict more than two discrete categorical outcomes. They both use maximum likelihood estimation to find the parameters that best fit the data, [38]. The case discussed in section 5 considers a binary target attribute. Therefore, the Binary Logistic Regression is explained, [39].

Assume that n is the number of explanatory attributes and m is the number of ob-servations. Introduce target attribute Y , where Y is a [m × 1] vector. We assume that the target attribute can be 0 or 1. Denote Yj as the jth observation of Y , for j = 1, · · · , m.

The set of explanatory attributes is given by X = {X1, · · · , Xn}, where Xi is a [m × 1]

vector for i = 1, · · · , n. So when discussing a specific explanatory attribute of X, we will address it as Xi. Xj denotes the [1 × n] vector with the values of all n explanatory attributes

for observation j, j = 1, · · · m. Furthermore, Xij corresponds with the value of the ith attribute for the jth observation. (Yj, Xj) corresponds with the jth observation in the dataset. The Binary Logistic Regression relies on the logistic function g(s) = 1+e1−s. In the logistic

17

The linear regression model can generate any real number as predicted values. A categorical variable can only take on a limited number of discrete values within a predetermined range. Therefore, it is not appropriate to use linear regression for binary/categorical variables.

(26)

regression, the linear combination of the explanatory attributes is taken s = β0+ β1X1+ β2X2+ · · · + βnXn= β0+ n X i=1 βiXi (1)

and the following probabilities are assigned: P(Y = 1|s) = 1

1 + e−s,

P(Y = 0|s) = 1 − P(Y = 1|s) = e

−s

1 + e−s.

To simplify equation1, we augment X so that X = {X0, · · · , Xn}, where X0=1 and define

weight vector β = {β0, β1, · · · , βn}. This gives s = βTX.

By assuming that the observations are independent, we obtain likelihood function L(β). The likelihood function is given by

L(β) = P(Y |X, β) = m Y j=1 P(Yj|Xj, β) = m Y j=1 (e−βTXj)(1 − Yj) 1 + e(−βTXj) . The log likelihood is given by

l(β) = log L(β), = m X j=1 (1 − Yj) log e −βTXj 1 + e−βTXj ! , = m X j=1 (1 − Yj)(−βTXj) − log(1 + e−βTXj)

The aim is to find the maximum likelihood estimator ˆβ, [39]. This can be done using the Newton-Raphson method. This method involves determining the first and second derivatives of the log likelihood function. More information on the Newton-Raphson method can be found in [38].

Logistic regression is considered a powerful tool. However, the problem of overfitting may occur when large datasets are considered. Another disadvantage is that various numerical problems might occur when the model is fitted, including not existing maximum likelihood estimates and collinearity between covariates.

4.3.2 Naive Bayes (NB)

Naive Bayes is a probabilistic model that analyses the likelihood of target attributes appearing in certain classes. Naive Bayes has theoretical roots in the Bayes’ theorem. Again18, introduce a target attribute Y and a set of explanatory attributes X = {X0, X1, · · · , Xn}.

(27)

Bayes’ theorem states that

P(Y |X) = P(Y )P(X|Y )

P(X) , (2)

where P(Y |X) is the posterior probability of target attribute Y given predictors X, P(X|Y ) is the likelihood which is the probability of predictors X given class Y , P(Y ) is the prior probability of the class and P(X) the prior probability of X. P(X) does not depend on Y and can be considered as constant. According to Bayes’ theorem, formula2 can therefore be rewritten as

P(Y |X) ∝ P(Y )P(X|Y ), (3)

where the symbol ∝ means ‘is proportional to’. From formula 2 follows that the left-hand side of formula3 equals the right-hand side multiplied by the ‘scaling constant’ P(X)1 . Bayes’ theorem can be interpreted as follows, [40]

P(Y |X) ∝ P(Y ) × P(X|Y ),

i.e. ’beliefs after having observed data’ ∝ ’beliefs before observing data’ × ’influence of the data’. In other words, one starts with the class prior probability P(Y ). The class prior probability contains theoretical or intuitive ideas on Y , that may be obtained from parallel or earlier studies. Then one learns from the data through the likelihood function P(X|Y ). This yields the posterior probability P(Y |X).

To obtain the Naive Bayes classifier, the ‘naive’ conditional independence assumption is applied on the likelihood P(X1, · · · , Xn|Y ). The assumption states that each attribute Xi

is conditionally independent of every other attribute Xj for j 6= i. Therefore,

P(X1, · · · , Xn|Y ) = n Y i=1 P(Xi|Y ). This gives P(Y |X) ∝ P(Y ) n Y i=1 P(Xi|Y ).

For the Naive Bayes classifier the following classification rule is used ˆ Y = argmaxYP(Y ) n Y i=1 P(Xi|Y ). (4)

The naive assumption is trivially false when the explanatory attributes are dependent. In practice, the attributes are rarely independent. Therefore, the Naive Bayes classifier tra-ditionally has been considered not very reliable. Research, including [41] and [42], showed however that the prediction accuracy of the Naive Bayes classifier compares in many applica-tions well with that of more complex algorithms. Additionally, the Naive Bayesian classifier is computationally efficient and simple. It has been proven to be robust to noise and to irrelevant attributes.

(28)

4.3.3 Decision Tree (DT)

A decision tree is a simple and widely used classification algorithm. The classification problem is solved by asking a series of questions about the attributes of the observations. Each time an answer is received, a follow-up question is asked until a conclusion about the class of the observation’s target attribute is reached. The series of questions and their possible answers are organised in the form of a decision tree. A decision tree is a hierarchical structure consisting of nodes and directed edges, [43]. The tree has three types of nodes:

• Root node: node that has no incoming edges and two or more outgoing edges.

• Internal nodes: nodes that have exactly one incoming edge and two or more outgoing edges.

• Terminal nodes: nodes that have no outgoing edges.

Each terminal node in a decision tree is assigned to a class label. Each non-terminal node (root node or internal nodes) contains attribute test conditions to divide observations that have different attributes. The classification of an observation from the test set is straightforward once the decision tree is constructed19.

Tan, Steinbach and Kumar state in their book, [43], that the amount of possible deci-sion trees that can be created grows exponentially with the number of considered attributes. However, finding the optimal tree is computationally infeasible because of the exponential size of the search space. Efficient algorithms have been developed that lead to reasonably accurate decision trees in a reasonable time period. These efficient algorithms often use a greedy strategy that builds a decision tree by making a series of local optimal decisions about which attributes to use for dividing the data.

Hunt’s algorithm is such an algorithm, [43]. Introduce Dt as the set of train

observa-tions that are associated with node t. Let j = 1, · · · , J be the class labels. A recursive definition of Hunt’s algorithm is given by

• Step 1: When all observations in Dt belong to the same class j, then t is a terminal node labeled as j.

• Step 2: When Dtcontains observations that belong to more than one class, an attribute

test condition is selected to partition the observations into smaller subsets. A child node is constructed for each outcome of the test condition. Next, the observations in Dt are

passed to the children based on the outcomes. The algorithm is recursively applied to each child node.

Many decision three algorithms, including CART, C5.0 and ID3, are based on Hunt’s algorithm.

The main advantage of decision trees over other techniques is the produced output. The output of a decision tree is transparent and easily interpretable. A disadvantage is that decision trees might encounter scalability and efficiency problems. More specifically, the efficiency of existing algorithms, like CART, ID3 and C5.0, has well been established

19

Start at the root node, apply the test condition to the observation and then follow the correct branch based on the outcome of the test. The outcome of the test leads to another internal node, in which a new test condition is applied, or to a terminal node. The class label corresponding to that terminal node is then given to the target attribute of the observation.

Referenties

GERELATEERDE DOCUMENTEN

In veertig jaar tijd steeg het aantal medische professionals van vier miljoen naar zestien miljoen, steeg het bedrag dat er jaarlijks per persoon aan zorg wordt uitgegeven van

Briefly, this method leaves out one or several samples and predicts the scores for each variable in turn based on a model that was obtained from the retained samples: For one up to

The driving idea behind this model is that of particular individuals choosing to ‘give’ their data (where ‘giving’ might involve expressly allowing the collection, access,

Indicates that the post office has been closed.. ; Dul aan dat die padvervoerdiens

The present text seems strongly to indicate the territorial restoration of the nation (cf. It will be greatly enlarged and permanently settled. However, we must

That’s why when you take out a health insurance policy with us, we’ll subsidise your membership to Cannons, LA Fitness and most UK Virgin Active gyms.. Go twice a week and

Hypothesis 3b: The number of interactions in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn..

Drawing on the RBV of IT is important to our understanding as it explains how BDA allows firms to systematically prioritize, categorize and manage data that provide firms with