• No results found

Applying machine learning on the data of a controltower in a retail distribution landscape

N/A
N/A
Protected

Academic year: 2021

Share "Applying machine learning on the data of a controltower in a retail distribution landscape"

Copied!
100
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Applying machine learning on the data of a control tower in a retail distribution landscape

Thomas Kolner 30 augustus 2019

Master thesis

Supervisors: prof. dr. J. van Hillegersberg, dr. N. Sikkel

(2)

Abstract

Retail distribution is the activity of getting goods into stores where they are sold to the public. A new concept in the retail distribution in the Netherlands is the transport con- trol tower. In this context, a transport control tower is defined as an integrated platform where transportation companies and their stakeholders share information or data and connect different services. This thesis aims to build a predictive model to predict the on-time arrival rate of trucks at the stores and help to explain the variance in on-time arrivals of trucks by using the data from a transport control tower. Building a predictive model and explaining the on-time arrivals, this thesis asks: How can a control tower pro- vide insights and be valuable within the chain distribution of a retailer using on planned and actual truck arrivals? And how can a control tower be used to explain the variance in the on-time arrival rate of trucks?

Based on a review of the literature on integration platforms in the transportation, it has been argued, at a conceptual level, that there is a huge potential for the use of a control tower in the field of retail distribution. To validate the use of the control tower, this thesis conducted a case study to apply machine learning on the control data in col- laboration with Albert Heijn. This show a successful application of machine learning on the data of a control tower in a retail distribution landscape. It describes a method to extend the control tower data with open data on weather and traffic, and apply machine learning on the extended control tower data. The results show that the Random Forest model is most suited for the detection of on-time arrivals. The Random Forest classifier achieves an f1 score of 0.86. Analysis of the outcomes showed that the on-time arrival rate is caused by several variables. The most important variables in this case study are ranked by using the feature importance from the proposed Random Forest model.

Human factors, could influence the time of arrival, and it is concluded that such factors should be considered in future research.

Title: Applying machine learning on the data of a control tower in a retail distribution landscape

Authors: Thomas Kolner, t.kolner@student.utwente.nl, s1505432 Supervisors: prof. dr. J. van Hillegersberg, dr. N. Sikkel

End date: August 30, 2019

Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente

Drienerlolaan 5, 7522 NB Enschede

https://www.utwente.nl/en/eemcs/

(3)

Preface

This thesis could not have been completed without the contribution and help of multiple persons.

First of all I would like to thank everyone from the Transport department at Albert Heijn for the interest they have shown in my graduation project and input they gave me.

Next, I want to thank the Noortje van Genugten for the opportunity to write my thesis at Albert Heijn which gave me the possibility to get to know the company from within and for her advice and comments on my thesis. In special I would like to thank Rene Krukkert for all his input and critical review on my writings.

I also thank my supervisors from the university, Prof. dr. Jos van Hillegersberg and dr. N. Sikkel for guiding me through the process of writing a thesis. Your advice and comments on my research were most useful and helped me to create the result you are reading now.

Lastly, I would like to thank my family for their support the past 11 months.

Thomas Kolner

Enschede, Augustus 2019

(4)

Contents

Introduction 1

1.1 Company introduction . . . . 1

1.2 Problem introduction . . . . 2

1.2.1 Problem context . . . . 2

1.2.2 Problem identification . . . . 3

1.3 Research introduction . . . . 5

1.3.1 Research scope . . . . 5

1.3.2 Research objective . . . . 5

1.3.3 Research questions . . . . 5

1.3.4 Research method . . . . 6

1.3.5 Report structure . . . . 9

Background 11 2.1 Distribution network . . . . 11

2.1.1 Supply chain . . . . 11

2.1.2 Distribution network . . . . 12

2.1.3 Retail distribution . . . . 14

2.2 Machine learning . . . . 16

2.2.1 Machine Learning classifiers . . . . 17

2.2.2 Evaluation classifiers . . . . 22

2.2.3 Cross-validation . . . . 25

2.3 Related work . . . . 26

2.3.1 Academic work on control towers . . . . 27

2.3.2 Whitepapers on controltowers . . . . 31

Data understanding 34 3.1 Data collection . . . . 34

3.2 Data description . . . . 34

3.3 Data exploration . . . . 36

3.3.1 Variables in the dataset . . . . 36

3.3.2 Transformed variables . . . . 42

3.3.3 External variables . . . . 43

Data Preparation 49

4.1 Data selection . . . . 49

(5)

4.2 Data cleansing . . . . 49

4.3 Data transformation . . . . 50

4.4 Data integration . . . . 53

4.5 Data balancing . . . . 53

4.6 Data formatting . . . . 54

4.7 Variables . . . . 54

Modelling 55 5.1 Selection machine learning techniques . . . . 55

5.2 Experimental design . . . . 56

5.2.1 Label . . . . 56

5.2.2 Datasets . . . . 56

5.2.3 Featureset . . . . 56

5.3 Training and testing . . . . 57

5.4 Feature importance . . . . 58

5.5 Tools . . . . 59

Results analysis 60 6.1 Performance overview . . . . 60

6.2 Performance per classifier . . . . 62

6.2.1 Random Forest . . . . 62

6.2.2 Logistic Regression . . . . 64

6.2.3 K-nearest neighbour . . . . 65

6.2.4 Conclusion . . . . 67

6.3 Feature importance . . . . 68

6.3.1 Important ranking . . . . 68

6.3.2 Recursive feature elimination . . . . 70

6.3.3 Conclusion . . . . 71

Usability of the Model 73 7.1 Technical Usability . . . . 73

7.2 Business usability . . . . 74

7.3 Conclusion . . . . 76

Conclusion 77 8.1 Answers to the research questions . . . . 77

8.2 Variance in arrival time . . . . 79

Discussion 82 9.1 Results . . . . 82

9.2 Contribution . . . . 83

9.3 Challenges . . . . 83

9.4 Limitations . . . . 84

9.5 Recommendations . . . . 84

(6)

9.6 Future research . . . . 85

Appendix A - Data integration 91

Appendix B - Overview of the variables 92

(7)

Introduction

This report is written for the completion of the Master Business & IT within the Data Science & Business track at the University of Twente. To do so, research is conducted at Albert Hijn located in Zaandam. The research is done at the Transport department of Albert Heijn. This research focuses on the complex retail distribution network of ALbert Heijn in southwest of the Netherlands. Retail distribution is the activity of getting goods into stores where they are sold to the public.

This chapter describes gives an introduction of the company, the problem statement and the research questions.

1.1 Company introduction

This research is conducted at Albert Heijn (AH). Albert Heijn B.V. is a supermarket chain in the Dutch retail market. It was founded by mister Albert Heijn, who opened his first store in 1887 in Oostzaan, The Netherlands. From there it expanded through the first half of the 20th century and became the largest supermarket chain in the Nether- lands. In 1973, parent company Ahold was established. Ahold merged in 2016 with the Belgium food retailer company Delhaize. Therefore, AH currently is a subsidiary of Ahold Delhaize N.V. Nowadays Albert Heijn B.V. operates in more than 880 stores, of which approximately 840 are located in the Netherlands. The other stores are located in Belgium with 42. One third of the stores in Netherlands are franchise stores. At this moment, approximately 80,000 people work for the AH brand in the Netherlands. The biggest part of them work at AH, and 30.000 of them works for franchisers who operate an AH store. Several formats exist within the AH stores. The most common store is the Albert Heijn district store. Besides, the stores consist of an Albert Heijn XL, Albert Heijn To Go and Albert Heijn Online. The online service delivers the groceries at home or the customers can pick up the groceries at one of the 54 pick-up points. Together, the stores have a market share of 34,7 percent at the end of 2018 [1].

The assortment of AH consists of more than 30.000 products items from numerous of brands. Four of these are own brands, called AH Huismerk, AH puur & eerlijk, AH Excellent and AH BASIC.

Albert Heijn Transport is a departement from Albert Heijn located in Geldermalsen.

Albert Heijn Transport consist of inbound transport and outbound transport. Inbound

transport is from suppliers to one of the DC, and outbound transport are the deliveries

to the stores. Based on internal information, AH transport inbound has to deal with

approximately 400 suppliers. The outbound transport needs to deliver goods at 886

stores almost every day. Every week, more than 400.000 load carriers are delivered

(8)

towards the stores. In total operates AH transports between the 1000-1200 trucks every day.

1.2 Problem introduction

This section introduces the problem. First, it describes the context. Next, it describes the problem identification.

1.2.1 Problem context

Sparks [2] researched the changes within the retail logistics and concluded that in the past, retailers were for large part passive participants in the supply chain. The suppliers determined when the products arrived at the retailer, which complicated the logistics for the retailers. In the current retail environment, this is unthinkable. Consumers expect nowadays shelves which are well stocked, where the products are distinguished, with a wide variety of products, and preferable as fresh as possible. To meet these expectations and to compete in a highly competitive market, retailers are forced to be able to be independent of other stakeholders within the supply chain and to operate more efficiently. In the last 30 years retailers built up their own logistics network and manage all the operations within this network. How that impacts the retailer AH follows below.

The supply chain network of Albert Heijn is explained in this section. The next section

the role of Albert Heijn transport will be explained. The supply chain network of Albert

Heijn is divided into regular products, fresh products and frozen chain. The regular

products chain contains mainly of non-perishable products which do not need refrigerated

transport. The fresh chain consists of perishables which need both refrigerated transport

and storage. The frozen chain consists of products which need to be deeply frozen

during transport and storage. The supply chain structure of AH differs from the typical

European retail structure which is known for one central DC (CDC) which supplies

every store in the country. Some regional DCs (RDC) supply a subset of the total

stores, for example, several provinces. And many local DCs who deliver only a few

stores in specific areas, such as agglomerations. Finally, the structure contains internal

consolidation points where the load carriers from the different DC types are consolidated

before they are shipped to a specific store [3]. The structure of AH consists of ten DCs

belonging to three different types. They operate two central DCs, two central fresh DC,

two frozen DCs and four regional DCs. The two central DC are located in Geldermalsen

(LDC) and OSS (OSS). The two central fresh DCs are the Shared Fresh Center (SFC)

in Nieuwegein and the Shared Warehouse Fresh (SWH) in Zeewolde. Both are owned

by external parties, the SFC by XPO Logistics and the SWH by Bakker Logistics. The

frozen DC are located in Hoogeveen (ELH) and Tilburg (ELT). The two frozen DCs are

owned by the external party XPO. The RDCs, however, are in own management and are

located in Zaandam (DCZ), Zwolle (DCO), Tilburg (DCT) and Pijnacker (DCP). All

the DCs in the structure of AH are supplied directly by a wide range of suppliers having

their shipping points at their factory, industrial warehouses or wholesale warehouses.

(9)

The slow-moving items are products with low sales volumes and distributed over the four central DCs (LDC, NOTE, SFC and SWH). The fast-moving items are products with high sales volumes and are located at the four regional DCs (RDCs). Downstream, all stores are supplied by trucks coming from the RDCs. The bundling of products from the different DC options thus happens not at special consolidation points but are consolidated at the RDCs. The products allocated at central DC requires, therefore, a shipment to the RDCs, which are called a ’transit’ ride by AH. Each RDC is dedicated to a subset of stores in their region of the Netherlands; DCZ supplies the stores in the region North-West, DCO the stores in the North-East, DCP the stores in the SouthWest, and DCT the stores in the South-East.

Albert Heijn Transport is responsible for the entire transport planning. This consist of inbound transport and outbound transport. The rides can be distinguished into: from suppliers to one of the DC (inbound), the transit ride between DC (inbound) and the deliveries to the stores (outbound).

1.2.2 Problem identification

The supply chain and transport domain are a rapidly changing environment [4]. Al- bert Heijn Transport, their external partners, and several experts identified three main challenges that retail distributions is facing at this moment. These challenges are:

• Increased supply chain complexity

• Increasing traffic congestion

• High dependency on driver availability

The increased supply chain complexity is a challenge AH identified. Supply chain is a complex network of business entities involved in the upstream and downstream flows of products and/or services, along with the related finances and information [5].

Complexity in a supply chain grows, as customer requirements, competitive environment and industry standards change, and as the companies in the supply chain form strategic alliances, engage in mergers and acquisitions, outsource functions to third parties, adopt new technologies, launch new products/services, and extend their operations to new geographies, time zones and markets [5]. In other words, the growth of supply chain complexity accelerates with trends such as globalization, sustainability, customization, outsourcing, innovation, and flexibility.

Traffic congestion is one of the major problems in the transportation domain [6]. Con-

gestion creates a substantial variation in travel speeds during morning and evening rush

hours. This is problematic for all vehicle routing models that rely on a constant value

to represent vehicle speeds. Urban route designs that ignore these significant speed vari-

ations result in inefficient, unrealistic and sub-optimal solutions. These poorly designed

routes that lead freight vehicles into congested urban traffic, result in some cases in a

waste of valuable time, and customers have to wait unreasonably long without having

(10)

any reliable information about the actual arrival times of the vehicles. In these circum- stances, it becomes difficult to satisfy the time windows during which the customers must be visited. This increases supply chain and logistics cost.

The European road transport firms are racing towards a driver shortage crisis [7]. Ac- cording to the UWV, the Dutch employment agency, the problem of lack of drivers in the Netherlands has been escalating from April 2016. More and more transport compa- nies have trouble finding potential truckers, which makes carrying out everyday activities difficult. There are a few reasons for the lack of truckers in the Netherlands [8]. The eco- nomic growth, similarly to other EU countries, results in growing demand for transport.

Currently the carriers are not able to meet the market demands. At the same time a lot of truckers resign from work in the industry or retire, while there is little interest in that profession among the young Dutch.

The supply chain at Albert Heijn is a collaboration between different departments.

These departments are Replenishment, Transport and Logistics. Replenishment is re- sponsible for the forecast. Transport is responsible for the transportation of all the goods. Logistics is the department that is responsible for all the distribution centers.

For Albert Heijn transport to be competitive in the future they need to come up with smart solutions. One of the initiative that started is the Simacan control tower. In coop- eration with Simacan and all the external transport partners of Albert Heijn the control tower was launched in December 2017. To monitor and control all the transportation activities, Simacan designed the control tower. The Simacan control tower is a neu- tral, supplier-independent data exchange platform. It ensures that the spatial data sets from all sources, sensors, and logistics systems have a place where they feel understood and where they can be connected to all other data. This creates an ecosystem of inex- haustible information that can lead to insight into the physical world. In this research, a control tower is defined as an integrated platform where transportation companies and their stakeholders share information or data and connect different services. The main advantage of the control tower is the "shared-real time" information. The Transport de- partment can keep real-time track of all trucks and deliveries. From this real-time data the transport department is able to gather a lot of data about their performances. Since every delivery is followed by the control tower. One of the problems that AH Transport identified is the on-time delivery. Since there is more reliable data available of all the deliveries since the introduction of the control tower, Albert Heijn can identify with a higher accuracy the number of on-time deliveries. The next step is to get insights and optimize their entire transport domain based the real-time data collection and be able to deal with the three main challenges Albert Heijn identified. This next step shows the goal for this research. The problem Albert Heijn identified is to use and analyse the data from their control tower in order to deal with their challenges in the transport domain.

Traffic congestion in the Netherlands increasing. The truck that operates from DC Pijnacker and DC Zaandam faces the most problems with traffic congestion’s. Due to the traffic congestion, it becomes more difficult to plan the routes, such that the time windows which the stores must be visited are met. This increases supply chain and logistics cost.

The problem of the dependency on driver availability is applicable on Albert Heijn.

(11)

Especially during the holiday times, there is shortage for truck drivers. This results in long workdays for the available drivers. During the summer of 2018 the drivers shortage resulted eventually in cancelled deliveries.

1.3 Research introduction

This section introduces the research. This research consist of a case study in collaboration with Albert Heijn. With a case study, this research can investigate a contemporary phenomenon within its real-life context [9]. Firstly, the scope is determined, secondly, the objectives are formulated, the research question is stated and the methodology is given.

1.3.1 Research scope

This thesis is focused at the outbound transport at the regional distribution center Pijnacker (DCP). The performance of Albert Heijn transport in the region DCP is, in comparison with the other regional distribution centers, below the average. In particular the service level of on-time deliveries for stores in the region that gets delivered by DC Pijnacker is significantly lower than average service degree. The below average service degree can be partly explained by the previously mentioned main challenges. In the region DCP there is relatively more traffic congestion and the external transport partners facing problems with the availability of driver availability. The number of stores that are being delivered by DCP is 139. DCP handles the regular products chain and the fresh chain for those 139 stores. One of the main advantages of DCP is availability of data. Since Albert Heijn started with the control tower, all the data of the deliveries of the regular chain and the fresh chain from DCP are available in the control tower.

1.3.2 Research objective

The objective of this research is to get to know to what extent the collected data from a transport control tower can be valuable for a retail company. Furthermore, in which extend is machine learning applicable in the field of retail distribution. To apply machine learning in the retail the distribution, the objective is to build a predictive model to predict the on-time arrival rate and help to explain the variance in on-time arrivals of trucks.

1.3.3 Research questions

This research uses the following main research question and sub-questions:

M.Q: How can a control tower provide insights and be valuable within the chain dis-

tribution of a retailer using on planned and actual truck arrivals? And how can a control

tower be used to explain the variance in the on-time arrival rate of trucks?

(12)

The main question in this research is developed with Albert Heijn. The answer to the main research question provides insights for Albert Heijn based on the available data from the control tower. Furthermore, it provides a model that is able to explain the on-time delivery rate from the distribution center located in Pijnacker.

The practical contribution of this thesis is to propose a model to explain the variance in arrival time of trucks in the area of retail distribution. The major contribution of this thesis is presenting the application of machine learning models on a distribution control tower in a retail distribution landscape.

The following sub-questions are formulated to help answer the main research question:

S.Q.1: What can we learn from literature about collaboration integration platforms in the domain of logistics?

S.Q.2: Which data is available in the Simacan control tower for Albert Heijn?

S.Q.3: What features are relevant for explaining the accuracy of arrival times of trucks?

S.Q.4: What would be a good machine learning classifier?

S.Q.5: How do different machine learning techniques perform in explaining on-time arrival rate of trucks?

S.Q.6: What is the usability of the machine learning model and the outcomes of the model?

1.3.4 Research method

A research method is required to answer the research questions in a structured manner.

The research method used in this study is based on the research of Shmueli and Koppius [10]. This method provides several steps in building an empirical model (Predictive or Explanatory).

This thesis is organized according to the research method shown in figure 1.1 The research method and the report structure is described below. In this research steps 3 and 4 are exchanged. This is done to first explore the data, and understand its context and values. So, in the data preparation step the different data sources can be prepared and combined into one dataset for the modelling phase.

Figure 1.1: Research methodology

(13)

1. Goal Definition

Building a predictive model requires careful specification of what needs to be predicted, as this impacts the type of models and methods used later on. One common goal in predictive modeling is to accurately predict an outcome value for a new set of observa- tions. This goal is known in predictive analytics as prediction (for a numerical outcome) or classification (for a categorical outcome).

The goal of this research is to build a model to predict and explain the on-time arrival of a truck. The model will be used to classify if a truck is on-time or not. Based on this predictive model and the outcomes of it, the impact of several variables can be ranked in order to explain which variables impact the on-time arrival of a truck. So, the goal is to build a predictive classifier.

2. Data collection & study design

Even at the early stages of study design and data collection, issues of what and how much data to collect, according to what design, and which collection instrument to use are considered differently for prediction versus explanation.

Since the dataset used in this research is an external dataset provided by Simacan, the data collection step is not applicable for this research. Furthermore, the external datasets used in this research are public available datasets. The only relevant question applicable to this research is the amount of data that will be used. As stated in section 1.3.1, only DC Pijnacker is selected for this research. So, this limits the amount of data.

3. Data preparation

There are two common data preparation operations: handling missing values and data partitioning. Most real datasets consist of missing values, thereby requiring one to iden- tify the missing values, to determine the extent and type of missingness, and to choose a course of action accordingly. A popular solution for avoiding overoptimistic predictive accuracy is to evaluate performance not on the training set, that is, the data used to build the model, but rather on a holdout sample which the model ’did not see.’ The creation of a holdout sample can be achieved in various ways, the most commonly used being a random partition of the sample into training and holdout sets. A popular alternative is cross-validation.

In this research the data preparation step consists of handling errors in the data and the preparation of the dataset that will be used by the machine learning models. This

’preparing’ of the complete dataset exist of combining the Simacan set with the external data sources and prepare the variables for analyses.

4. Exploratory data analysis

Exploratory data analysis (EDA) is a key initial step in both explanatory and predictive

modeling. It consists of summarizing the data numerically and graphically, reducing

their dimension, and ’preparing’ for the more formal modeling step.

(14)

The EDA is conducted to understand the datasets that are used in this research. The data is summarized, and several statistics are shown. After the EDA the data preparation is conducted. This is done to understand the data before it was processed and prepared.

5. Choice of variables

Predictive models are based on association rather than causation between the predictors and the response. The choice of potential predictors is often wider than in an explanatory model to better allow for the discovery of new relationships. Predictors are chosen based on a combination of theory, domain knowledge, and empirical evidence of association with the response.

The choice of variables is done at first by using the expertise of several experts in the transportation domain. After the dataset was prepared for the modelling phase, the relevant variables are selected based on the domain knowledge of the experts.

6. Choice of potential methods

In predictive modeling, where the top priority is generating accurate predictions of new observations and the prediction is often unknown, the range of plausible methods includes not only statistical models (interpretable and uninterpretable) but also data mining or machine learning algorithms.

In this research the choice of suitable models is based on theory. There are well-known machine learning classifiers selected and briefly explained in the introduction to machine learning.

7. Evaluation, validation & model selection

Choosing the final model among a set of models, validating it, and evaluating its perfor- mance based on different metrics.

The model selection is in Shmueli and Koppius [10] based on the best scoring model for prediction purpose. In this research a prediction model is used to help explain the variance in the on-time arrival rate. So, the selected models are compared based on the best metrics. The best scoring models are selected and further analyzed. However, the main goal of these models is not to predict, merely explain the variance. So the best scoring classifier is selected to help explain the variance in the on-time arrival of trucks.

8. Model Use & Reporting

At the end of the explanatory modeling process, a predictive model is used to make pre- dictions from new data, and the results are used to formulate new hypotheses, establish relevance of existing theories, and assess predictability of certain relations.

The model used in this research explains the variance in the on-time arrivals with the

help of the outcomes of the predictions.

(15)

1.3.5 Report structure Background

This consists of the background required to understand this thesis. The main topics of this chapter are an introduction to supply chain and retail distribution to understand the context of this thesis and to understand the domain of retail logistics. Next, a background on machine learning is given, to understand the models and methods used in this thesis. Furthermore, recent papers on collaboration integration platforms in the domain of logistics are examined. This last step is performed to find relevant literature on the use of a control tower in distribution networks. This phase presents the results needed to answer the sub-question S.Q.1. Chapter 2 contains the findings of this phase.

Data understanding

The dataset used in this research is provided by an external party. Therefore this phase is required to understand the content of the provided dataset. The dataset content is explored with the use of multiple descriptive statistics. This phase also consists of verifying the data quality. This phase presents the results needed to answer the sub- questions S.Q.2. Chapter 3 contains the findings of this phase and steps 2 and 4 of the research methodology. Step 3 and 4 are exchanged in this research as stated in section 1.3.4.

Data preparation

This chapter describes the data preparation and the choice of variables. Multiple prepa- ration steps are needed to construct a dataset that can be used in this research. Chapter 4 describe the steps taken during this phase and present the answer to sub-questions S.Q.3. Chapter 4 contains the findings of this phase and steps 3 and 5 of the research methodology.

Modelling

This phase describes the choice of potential methods. This phase consists of selecting machine learning techniques, setting up experiments, and training and testing of the machine learning techniques. This phase presents the results needed to answer the sub- question S.Q.4. Chapter 5 describes the steps taken during this phase and contains step 6 of the research methodology.

Results analysis

This chapter describes the evaluation, validation and model selection. The results of the

experiments and feature analysis are collected and documented during this phase. This

phase presents the results needed to answer the sub-question S.Q.5. Chapter 6 contains

the findings of this phase and step 7 of the research methodology.

(16)

Usability of the Model

This chapter describes the used model and how it can be used. This phase consists of analyzing the usability of the proposed model in this research. This phase results in the answer to S.Q.6. Chapter 7 describes the findings of this phase and step 8 of the research methodology.

Conclusion, discussion and future research

Then Chapter 8 concludes with the answers to the research questions, discusses the

results of this research and suggest potential future work on this research.

(17)

Background

Each subsection of this chapter describes the necessary background knowledge for a specific subsection of this thesis, to understand its content. Section 2.1 describes the relevant literature on supply chain within the retail area, to understand the context of this thesis. In section 2.2 an introduction to machine learning is given, to understand the models and methods used in the modelling phase of this thesis. Section 2.3 describes the related work on collaboration integration platforms in the domain of logistics.

2.1 Distribution network

A distribution network is a part of the supply chain. First a supply chain is defined. In the second section defines a distribution network and gives several distribution network design options.

2.1.1 Supply chain

A supply chain is an integrated manufacturing process where raw materials are converted into final products, then delivered to customers. At its highest level, a supply chain is comprised of two basic, integrated processes [11]:

1. The Production Planning and Inventory Control Process 2. The Distribution and Logistics Process

These processes, shown below in figure 2.1, provide the basic framework for the conver-

sion and movement of raw materials into final products. The Production Planning and

Inventory Control Process encompasses the manufacturing and storage sub-processes,

and their interface(s). More specifically, production planning describes the design and

management of the entire manufacturing process (including raw material scheduling and

acquisition, manufacturing process design and scheduling, and material handling design

and control). Inventory control describes the design and management of the storage

policies and procedures for raw materials, work-in-process inventories, and usually, final

products. The Distribution and logistics process determines how products are retrieved

and transported from the warehouse to retailers. These products may be transported to

retailers directly, or may first be moved to distribution facilities, which, in turn, transport

products to the retailers. This process includes the management of inventory retrieval,

transportation, and final product delivery. These processes interact with one another

to produce an integrated supply chain. The design and management of these processes

determine the extent to which the supply chain works as a unit to meet the required

performance objectives.

(18)

Figure 2.1: Test

2.1.2 Distribution network

At the highest level, performance of a distribution network should be evaluated along two dimensions [12] :

1. Customer needs that are met 2. Cost of meeting customer needs

The customer needs that are met influence the company’s revenues, which along with cost decide the profitability of the delivery network. While customer service consists of many components, we will focus on those measures that are influenced by the structure of the distribution network. These include:

• Response time

• Product variety

• Product availability

• Customer experience

• Order visibility

• Return-ability

Response time is the time between when a customer places an order and receives

the product. Product variety is the number of different products/configurations that a

customer desires from the distribution network. Availability is the probability of having

a product in stock when a customer order arrives. Customer experience includes the ease

with which the customer can place and receive their order. Order visibility is the ability

of the customer to track their order from placement to delivery. Return-ability is the

ease with which a customer can return unsatisfactory merchandise and the ability of the

network to handle such returns. It may seem at first that a customer always wants the

highest level of performance along all these dimensions. In practice, however, this is not

always the case. For example, customers ordering an item at an online store are willing

(19)

to wait longer than those that drive to a nearby store to get the same item. On the other hand, customers can find a far larger variety of items at an online store compared to the nearby store.

(a) Relationship between desired response time and number of facilities

(b) Relationship between number of facilities and logistics cost.

(c) Variation in logistics cost and response time with number of facilities

Firms that target customers who can tolerate a large response time require few loca- tions that may be far from the customer and can focus on increasing the capacity of each location. On the other hand, firms that target customers who value short response times need to locate close to them. These firms must have many facilities, with each location having a low capacity. Thus, a decrease in the response time customers desire increases the number of facilities required in the network, as shown in Figure 2.2a. For example, local store provides its customers with items on the same day but requires a large number of stores to achieve this goal. On online shop, on the other hand, takes about a week to deliver an item to its customers, but only uses a few locations to store its products.

Changing the distribution network design affects the following supply chain costs:

inventories, transportation, facilities and handling and information. As the number of

facilities in a supply chain increases, the inventory and resulting inventory costs also

increase as shown in figure Figure 2.2b. For example, Amazon with fewer facilities is

able to turn its inventory about 12 times a year, while Borders with about 400 facilities

achieves only about two turns per year [12]. As long as inbound transportation economies

(20)

of scale are maintained, increasing the number of facilities decreases total transportation cost, as shown in figure Figure 2.2b. If the number of facilities is increased to a point where there is a significant loss of economies of scale in inbound transportation, increasing the number of facilities increases total transportation cost. A distribution network with more than one warehouse allows Amazon.com to reduce transportation cost relative to a network with a single warehouse. Facility costs decrease as the number of facilities is reduced as shown in figure Figure 2.2b, because a consolidation of facilities allows a firm to exploit economies of scale.

Total logistics costs are the sum of inventory, transportation, and facility costs for a supply chain network. As the number of facilities is increased, total logistics costs first decrease and then increase as shown in figure Figure 2.2c. Each firm should have at least the number of facilities that minimize total logistics costs (see figure 2.2c). As a firm wants to further reduce the response time to its customers, it may have to increase the number of facilities beyond the point with minimum logistics costs. A firm should add facilities beyond the cost-minimizing point only if managers are confident that the increase in revenues because of better responsiveness is greater than the increase in costs because of the additional facilities.

2.1.3 Retail distribution

The present section elaborates on a retail distribution system. First, we characterize the structure of the retail network. Second, we describe the subsystems and processes that are part of a retailer’s distribution network.

Characterization of the retail network structure

DCs are the core of retail supply network architectures for several retail sectors. Al-

though other flow types (e.g., direct-to-store deliveries from suppliers) also play a role in

retail distribution, the overwhelming majority of products are shipped via retail-operated

DCs in several sectors [13]. The products are stored temporarily at the DCs until picked

according to store orders and shipped to the stores. Retail logistics networks often con-

sist of DCs belonging to different DC types. A retail supply network could for instance

comprise central, regional and local DCs. The central DC (one site) might as an example

serve all stores in a country, the regional DCs (some sites) could be dedicated to serv-

ing a subset of stores in specific areas, e.g., several states or provinces, while local DCs

(many sites) serve comparatively few stores in relatively small specific areas, e.g., single

states or agglomerations. Referring to Fleischmann [3], this is denoted as the structure

of a European retail distribution system. Mostly, DCs are supplied directly from sup-

plier shipping points (e.g., manufacturers, industry warehouses or wholesale warehouses),

which implies that there are no supply relationships between the different types of DCs

[13].

(21)

Characterization of the subsystems and processes

Several processes occur in the distribution network described above, such as transporta- tion, storing products, picking store orders, and stocking the shelves in the stores. Each of these processes belongs to a certain subsystem of the retail chain, i.e., (1) inbound trans- portation (supplier - DC), (2) warehousing, (3) outbound transportation (DC - Stores), and (4) instore operations. These subsystems are explained in the following paragraphs.

Inbound transportation

Inbound transportation comprises transportation tasks between the supply points of the manufacturers and the DCs. The transportation efforts depend on the number of shipments and the distance from a supplier to a specific DC. The number of shipments to a DC required is dependent on the sales volumes and the physical volumes of the products delivered. From a single product perspective, an assignment to the DC type with the lowest volume-weighted average distances may save inbound transportation costs assuming that trucks always have full loads. Transportation costs are especially important for the allocation of products with high sales and high physical volumes. The assignment of a product to a certain DC type influences the volume per period that a supplier has to deliver to a DC. Regional or local DCs serve fewer stores than a central DC that serves a large number of stores, where correspondingly more store demand has to be fulfilled from a central site. When choosing more decentralized distribution, i.e., via regional or local DCs, less volume of this product is required per DC but at multiple places. The efficiency of inbound transportation then particularly depends on the question of the extent to which transportation units can be utilized. Consequently, inbound delivery costs for an stock keeping unit (SKU) depend on the other SKUs of the same supplier that are allocated to the same DC type, as freight space can be shared.

Warehousing

In the retail DCs products are received, stored in a storage area and provisioned in the

picking area where loading carriers, i.e., pallets or roll cages, are filled with products

according to store orders [14]. In the warehousing domain, inventory holding and picking

are affected by the SKU assignment and influence the allocation decision. Safety stocks

in the DCs are affected by inventory pooling effects, which are higher the more central

the stock holding point is and the larger sales variances of the SKUs are. The system-

wide safety stock required for a product increases according to how regional the DC

type is, since the number of DCs per type goes up. The degree of inventory pooling

therefore decreases. Both inventory positions, cycle stock and safety stock cause costs

for the capital tied up in inventories. The higher the value and the larger the sales

variance of a product, the greater the importance of inventory costs for the DC type

allocation. Besides inventory costs, varying picking costs may also have to be considered

when assigning products to DC types. Picking costs are often responsible for more than

50 percent of the total warehouse operating expenses [15]. Differences in picking costs

between DCs may occur due to differing picking technologies applied at the DCs and

varying labor costs due to divergent wage agreements in different geographical areas.

(22)

Outbound transportation

While in inbound transportation the source-destination is a result of the allocation of the products, not every supplier has to serve all DC types, the outbound transport is pre- determined. Outbound transportation bridges the geographical disparity between DCs and the stores. In addition, store delivery frequency is very dependent on store-specific situations and last-mile volume bundling across stores [16]. Outbound transportation from the DCs to stores depends on the allocation of products across the several DCs Distributing products via more regional DCs saves outbound transportation efforts since distances from the DCs to the stores get shorter and vice versa via central DCs. This is especially important for high-volume products. However, it might be in contradiction to inbound transportation and the resulting cycle stock. A more central DC type comprises fewer DCs and suppliers may therefore supply higher volumes or deliver their products more frequently. In addition, the safety stocks required increase as DC types become more decentralized [17].

Instore operations

Instore operations efforts are known to cause a high share of total operational costs in retailing [13]. In the stores, the products delivered have to be stacked onto the shelves.

As known from literature, this effort decreases as for example the case pack size and the number of case packs per order line rise, assuming sufficient shelf space [18]. Each product sold in the stores belongs to a specific product category. Shelf filling activities are more efficient if all products of one layout segment are sourced from a single DC.

Figure Figure 2.3 gives an overview of the sub-process in the retail network structure:

(A) inbound transportation costs, (B) warehouse inventory costs, (C) warehouse picking costs, (D) outbound transportation costs, and (E) instore operational costs. In addition, the associated cost drivers are denoted: inbound delivery volume and inbound delivery frequency from one supplier to a specific DC, cycle stocks and system-wide safety stocks, picking volume, technology and labor costs, outbound delivery volume and the number of different DC types to which products of one store layout segment are assigned.

Figure 2.3: Sub-process in the retail network structure

2.2 Machine learning

The previous section introduced the background on supply chain and retail distribution.

This section gives the next necessary background for this thesis. This section gives an

introduction to explain and to understand the models and methods used in the modelling

(23)

phase. The models are examined and tested if they are applicable on the dataset of the AH/Simacan control tower. The definition for machine learning used throughout this research is: "the complex computation process of automatic pattern recognition and intelligent decision making based on training sample data" [19]. A more general definition of machine learning is "the process of applying a computing-based resource to implement learning algorithms" [19]. Based on different books on machine learning [19][20][21][22], the basic theory of the different Machine Learning techniques used in this research is described in this section. Three categories of learning algorithms are:

supervised learning, unsupervised learning, and semi-supervised learning. In supervised learning, the goal is to create a model which predicts y based on some x, given a training set consisting of examples pairs of (x i , y i ). Here y i is called the label of the example x i . When y is continuous, the problem at hand is called a regression problem, and when y is discrete the problem at hand is called a classification problem. Throughout this research, the focus is on supervised learning as we try to detect whether a truck described by some features x, is is on-time at their destination. In this case, the prediction value y takes the value 1 if a truck arrives too late and 0 if a truck arrives on-time. Section 2.2.1 describe the machine learning classifiers used in this research. Then Section 2.2.2 describes the metrics used to evaluate classifiers, Section 2.2.2 describes the challenges of an imbalanced dataset. Lastly, Section 2.2.3 describes the cross-validation methodology.

2.2.1 Machine Learning classifiers

In this section several machine learning algorithms are briefly explained.

Random Forest

The Random Forest (RF) classifier is an ensemble classifier that uses multiple decision tree classifiers to classify test instances. An example of a decision tree is shown in Figure 2.4. The basic principle of this classifier is to train multiple Decision Trees and have those together make a classification. Each of those trees is trained on a subset of the training data drawn with replacement. The training procedure is similar to how a normal Decision Tree is trained except for one difference. At each split in the tree a random selection of features is selected, from which the feature for the split is selected. Usually the square root of the number of available features is used for how many features have to be drawn.

The reason for this random feature selection is to decrease the correlation between the

individual trees. A major disadvantage of decision trees is their instability. Decision trees

are known for high variance and often a small change in the data can cause a large change

in the final tree. Random Forests try to reduce the variance of decision trees by taking

multiple decision tree classifiers to classify testing instances. Then, classification is done

using a majority vote among all the decision trees. Some advantages of Random Forest

are i) it overcomes overfitting ii) it can deal with high-dimensional data. Disadvantages

include i) accuracy depends on the number of trees, and ii) for very large data sets, the

size of the trees can take up a lot of memory.

(24)

Figure 2.4: Example of Decision Tree

Algorithms for constructing decision trees for a Random Forest usually work top- down, by choosing a variable at each step which has the best splits for the remain items.

Different algorithms use different metrics for measuring the ’best’ split variable. One of the most used algorithms is the Gini Impurity [23]. Gini Impurity is a measurement of the likelihood of incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. So, Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability p_i of an item with label i being chosen times the probability P k6=i p k = 1−p i of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category [24].

K-Nearest Neighbour

The K-nearest neighbour (KNN) is a distance-based classifier. Distance-based classifiers generalise from training data to unseen data by looking at similarities between training instances. Given a query instance q, the classifier finds the k training instances, the closest in distance to the query instance q. Subsequently, it classifies the query instance using a majority vote among the k neighbours. The distance from the query instance to its training instances can be calculated using different metrics such as the Euclidean distance, Minkowski distance, or Manhatten distance. An example of the k-nearest neighbour classification is given in figure 2.5. For the calculation of the KNN there are three algorithms. These are described below.

Brute-force

The most basic computation method for the Nearest Neighbor classification is bruteforce.

This algorithm simply calculates the distances between all points in the data set and uses those to determine which points are closest. For small sample sizes this algorithm can return accurate result. Due to its naive nature, brute-force quickly becomes an unfeasible approach when sample size increases.

k-d tree

To counter the issue of the brute-force method being unfeasible for larger sample sizes, a

more efficient method has been developed by Bentley (1975). This method used a decision

tree to efficiently store distance information requiring less computations. Assume three

(25)

points A, B and C, of these points A and B are very distant and C and B are close.

From this information it follows that points A and C are also very distant. A k-d tree is constructed by iterating over several steps. In each iteration a (not previously used) feature is selected at random, on which a decision will be based. The median value of the selected feature is calculated, values larger than this median are separated from the smaller values. Now two branches have been created, each with approximately half of the samples. On both branches these steps are repeated. This process continues until the number of samples in a branch drops below a certain threshold. After a tree has been generated, the approximately closest neighbors can easily be determined. All nodes of the tree are applied on the new instance, the branch where the new instance ends up in, contains samples that are close. For all of these samples the distance to the new instance is calculated to find nearest ones.

Ball tree

In high dimensional space it becomes computationally expensive to create a k-d tree. In those situations it is computationally favorable to create a ball tree. Omohundro [25]

describe a ball tree as a binary tree of which each node represents a hypersphere called a ball. Each node of the tree splits the data into two disjoint sets, each set is contained by the smallest ball containing all points. The hyperspheres are allowed to cross, data points are assigned to the sphere of which the center is closest.

Advantages of KNN are [3]: i) high precision and accuracy ii) non-linear classification iii) no assumption of features. The disadvantages are i) it is sensitive to unbalanced sample set, ii) it is computational expensive.

Figure 2.5: Example of K-Nearest Neighbour Classification Naive Bayes

Naive Bayes (NB) is a statistical classifier that uses Bayes’s theorem to predict the

probability of given query instance belonging to a certain class. Bayes’s theorem, also

called Bayes’s rule, calculates the probability of a hypothesis H being true, given some

(26)

evidence x, according to the following formula:

P (H|e) = P (e|H) ∗ P (H) P (e)

The classifier is called naive because it assumes conditional independence, making the computation of the above formula less computationally expensive; especially for datasets with many features. Although Naive Bayes assumes conditional independence, it performs well in domains where independence is violated [14]. Advantages of Naive Bayes are: i) high speed ii) insensitive to irrelevant feature data iii) simple and mature algorithm. A disadvantage is that it requires the assumption of independence of features [26].

Generalized Linear Model

The Generalized Linear Model is a generalization of the general linear model [27]. In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y , and a set of predictor variables, the X 0 s, so that:

Y = β 0 + β 1 ∗ x 1 + β 2 ∗ x 2 + ... + β n ∗ x n

Generalized linear models are a class of linear models which unify several widely used models, including linear regression and logistic regression. The distribution over each output is assumed to be an exponential family distribution whose natural parameters are a linear function of the inputs.

Logistic Regression

Logistic Regression is one of the generalized linear models that is used in this research.

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification [28]. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables. Figure 2.6 gives an example of a Logistic Regression.

Figure 2.6: Example of Logistic Regression

Unlike in linear regression, it is not possible to determine a closed form equation to

determine the coefficients. Instead other methods like maximum likelihood estimation

(27)

are used. In this method an iterative process is used during which in each iteration the coefficients are slightly changed to try to improve the maximum likelihood. In this research two methods to fit the model are taken in to account. These methods will not be discussed in depth since it is not within the scope of this research. The first method is Liblinear, it used a coordinate descent algorithm to find suitable values for the coefficients. The second method is saga which uses a stochastic average gradient descend. The second method usually is faster on large data sets.

In the loss function that is minimized it is usual to include a regularization term.

Such a term is to penalize complex models and favor models which are simpler. With Logistic Regression two types of regularization are commonly used, L1 and L2. The first of these is a regularization that favors sparse models, or models where a large fraction of the coefficients is zero. L2 is used as regularization term when a sparse model is not suitable. When the data set contains highly correlated features, L1 should be used as regularization term. It picks a single of the correlated features and sets the coefficient of the other features to zero. L2 would simply shrink the coefficient of all correlated features. Usually a parameters is added to the algorithm which can be used to determine the strength of the regularization.

Artificial neural networks

Artificial neural networks (ANN) is a machine learning model that uses a structure of nodes, i.e. artificial neurons, to classify testing instances [29]. These nodes are connected to each other by directed links. An ANN consists of an input layer, some hidden layers, and an output layer. Every directed link between neurons has some numeric weight shown as w i j in the example ANN, shown in Figure 2.7. These numeric weights are used in the activation function of each node. This activation function is used to determine the output of a node. Different learning algorithms can be used to determine the number of hidden layers, the number of neurons, and the weights between the neurons. Some of the most popular are feed-forward back-propagation and radial basis function networks. This research uses the Perceptron classifier which is a class of ANN that uses back-propagation for learning.

Figure 2.7: Example of an Artificial Neural Network

(28)

AdaBoost

Adaptive boosting (AdaBoost or Ada) is, like the Random Forest classifier, an ensemble classifier. AdaBoost uses multiple training iterations on subsets of the dataset to boost the accuracy of a (weak) machine learning classifier. The machine learning classifier is first trained on a subset of the dataset. Then all training instances are weighted, with any sample not correctly classified in the training set being weighted more, thereby having a higher probability of being chosen in the training set of the next iteration. Likewise, any sample correctly classified is weighted less. This process is repeated until the set maximum number of estimators is reached. AdaBoost is known for offering accurate machine learning classifiers [19]. However, a disadvantage of AdaBoost is that it is a greedy learning, i.e. offering suboptimal solutions. In this research, AdaBoost is used with (standard) decision trees.

Gradient Boosting

Boosting is a method of converting weak learners into strong learners. It is like the Random Forest classifier and AdaBoost, an ensemble classifier. In boosting, each new tree is a fit on a modified version of the original data set.

Gradient Boosting trains many models in a gradual, additive and sequential manner.

The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identify the shortcomings of weak learners (eg. decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y = ax + b + e , e needs a special mention as it is the error term) [30]. The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimize. For example, if we are trying to predict the sales prices by using a regression, then the loss function would be based off the error between true and predicted house prices.

Similarly, if our goal is to classify credit defaults, then the loss function would be a measure of how good our predictive model is at classifying bad loans. One of the biggest motivations of using gradient boosting is that it allows one to optimize a user specified cost function, instead of a loss function that usually offers less control and does not essentially correspond with real world applications.

2.2.2 Evaluation classifiers

Different performance metrics exist to evaluate a classifier.

Confusion matrix

The most basic performance metrics are summarized in a confusion matrix. The design

of a confusion matrix is shown in Figure 2.1.

(29)

Actual Positive Actual Negative Predicted Positive True Postive (TP) False Positive (FP) Predicted Negative False Negative (FN) True Negative (TN)

Table 2.1: Sample cost function

The confusion matrix shows how many too late instances were correctly classified as being too late (TP), how many too late instances were missed (FP), how many benign instances were correctly classified as being on time (TN), and how many on time classes were incorrectly classified (FN). Other metrics and their formula are shown in table 2.2.

These metrics use the metrics shown in Table 2.1. A frequently used metric is the accu- racy, defined by the percentage of correct predictions (TP + TN), of the total predictions (TP + TN + FP + FN). This metric, however, might not reflect the performance of a classifier well. In a skewed dataset, that is a dataset containing more of one class than the other, high accuracy can be achieved by always predicting the majority class. For example in a dataset consisting of 90% malicious actions and 10% benign actions, al- ways predicting malicious actions results in an accuracy of 90%. In the case of a skewed dataset, the performance metrics Precision (PPV) and/or Recall (TPR), reflect the per- formance of a classifier more realistic. The harmonic mean of the Precision and Recall are reflected in the f1-score (F-score with α = 1).

Metric Formula

Accuracy T P +T N +F P +F N T P +T N

True Positive Rate (TPR) T P +F N T P False Positive Rate (FPR) F P +T N F P True Negative Rate (TNR) T N +F P T N Precision (PPV) T P +F P T P F-score (F-measure) (1 + α 2 ) P P V ∗T P R

α

2

(P P V ∗T P R)

Table 2.2: Performance Metrics

Area under the curve

To understand the details of the area under the curve, it is required to first explain the receiver operating characteristics (ROC) curve. The ROC is used for visualizing classifier performance and has long been used in signal detection theory to depict the trade off between true and false positive rates of classifiers [31].

An ROC curve is created by plotting the true positive rate (T P/P ) against the false

(30)

positive rate (F P/N ) for different thresholds. Since classifiers calculate a score between 0.0 and 1.0, a threshold has to be chosen as border between positive and negative classifi- cations. The calculated score, x, can be can be seen as being sampled from a continuous random distribution X. An instance is classified as positive if x > T , with T being the chosen threshold. Different thresholds will result in different true and false positive rates.

Figure 2.8 shows three examples of an ROC curve: a random model and two models with predictive capabilities. The ROC curve of a random model approaches the line stretching from (0, 0) to (0, 1). The reason behind this behavior is best explained with an example. Assume that a random fraction K is classified as positive, then a fraction K of the instances that should be classified as positive will be correctly classified, and the same fraction K of values that should be negative will be correctly classified as negative.

For models that perform better than random guessing, the true positive rate will be higher than the false positive rate and thus the model will have a ROC curve above the diagonal.

The area under the curve (AUC) is a measure that tries to summarize the ROC curve in a single number. It is important to note that it is impossible to summarize the curve in a single number without loss of information. The name of the AUC is very accurate, it is the area under the ROC curve. For the ROC curve of the random model graphed in Figure 2.8, the AUC is exactly 0.5, for Model B the AUC is approximately 0.67. Models that are better than a random classifier have an AUC above 0.5, a perfect classifier has an AUC of 1.0.

Figure 2.8: Graph containing the ROC of a random classifier and two better performing classifiers.

Imbalanced Dataset

In real-world data sets the number of ’interesting cases’ is often small in comparison to

the total number of instances. Consider for example a data set with on-time arrival rate

of trucks, since in normal situations a truck is on-time the data set is imbalanced. This

results in a few challenges while training and testing machine learning classifiers. First,

standard machine learning techniques are often biased towards the majority class in an

imbalanced dataset [19]. Hence, standard metrics such as the accuracy do not reflect

the actual performance of a model well [19] . In a skewed dataset containing 95% on

(31)

time examples and 5% too late examples, an accuracy of 95% might be the result of the classifier predicting on-time labels 100% of the time. This research addresses this challenge by using metrics that take into account the skewness of a dataset, such as the f1-score which is the harmonic mean between the True Postive Rate and True Negative Rate.

In regular learning, we treat all misclassifications equally, which causes issues in im- balanced classification problems, as there is no extra reward for identifying the minority class over the majority class [32]. Cost-sensitive learning changes this, and uses a func- tion C(p, t) (usually represented as a matrix) that specifies the cost of misclassifying an instance of class t as class p. This allows us to penalize misclassifications of the minority class more heavily than we do with misclassifications of the majority class, in hopes that this increases the true positive rate. A common scheme for this is to have the cost equal to the inverse of the proportion of the data-set that the class makes up. This increases the penalization as the class size decreases.

Actual Positive (y i = 1) Actual Negative (y i = 0)

Predicted Positive (c 1 = 1) C T P

i

C F P

i

Predicted Negative (c 0 = 1) C F N

i

C T N

i

Table 2.3: Sample cost function matrix

A simple way to deal with imbalanced data-sets is simply to balance them, either by oversampling instances of the minority class or undersampling instances of the majority class [32]. This simply allows us to create a balanced data-set that, in theory, should not lead to classifiers biased toward one class or the other. However, in practice, these simple sampling approaches have flaws: i) Oversampling the minority can lead to model overfitting, since it will introduce duplicate instances, drawing from a pool of instances that is already small. ii) undersampling the majority can end up leaving out important instances that provide important differences between the two classes.

2.2.3 Cross-validation

When training models to make predictions, it is needed to have a method for estimating how accurately the model will make predictions in practice. In cross-validation the data is split in a set used to train the model, the training set, and a set against which the models is tested, the test set. Many types of cross-validation rely on multiple iterations to reduce variability. In the following paragraphs three methods are explained.

Holdout method

The holdout method randomly splits the data set into two sets, d 0 and d 1 , the training set

and test set, respectively. The model is trained using d 0 and validated using d 1 . Usually

a majority of the samples is assigned to d 0 . Typical splits in data mining applications

Referenties

GERELATEERDE DOCUMENTEN

This Participation Agreement shall, subject to execution by the Allocation Platform, enter into force on the date on which the Allocation Rules become effective in

people over the past few days this woman had ignored the devastating reviews of movie critics – and in doing so she had allowed the film’s studio and distributor to claim a

A stereoscopic reading of Praying Mantis and Bidsprinkaan shows how the two versions, when read together, make up a total text in which differences between the two versions

Naar aanleiding van de plannen voor de bouw van serviceflats op het fabrieksterrein van de voormalige kantfabriek werd een archeologische prospectie door middel

This model gives insight into the effect of the arrival rate variation and the discharges before 2 PM on the overcrowding of a hospital ward. It provides the surgery schedulers of

Hypothesis 3b: The number of interactions in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn..

This paper investigates the effect of sharing logistics information within B2B e-commerce field, and tests the relationship of trust, supply chain performance and sharing

As will be shown in section 4.5, this definition of modal tautology is adequate in the sense that it can form the counterpart to a sound and complete deduction system for alethic