Cancellation forecasting in the airline industry

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

1

Cancellation Forecasting

in the Airline Industry

Dorien Lugten

(10383646)

Msc in Econometrics Track: Free

Date of final version: January 15, 2018 Supervisor: dr. K. Pak

(2)

Abstract

Overbooking is the general term that describes that more capacity is offered than is actually available; in the airline industry this means that more seats are sold on a flight than are available. Overbooking is one of the most important aspects of revenue management and is based on cancellation forecast. It is crucial to have accurate cancellation rates in order to control overbooking, which in this case consists of two main actions: reducing the risk of empty seats and reducing the number of denied boarding passengers. In this report, some cancellation forecasting models proposed in the literature are reviewed and the methodology and used techniques are provided. Subsequently, the performance of six classification models based on modern-day machine learning techniques are examined using a real-world dataset.

(3)

Statement of Originality

This document is written by Dorien Lugten who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

List of Figures

3.1 Number of bookings in dataset XL . . . 12 3.2 Cancellation rate of three different attributes for bookings in dataset XL . . 13 3.3 Cancellation rate per time frame for bookings in dataset XL . . . 14

5.1 ROC curves of SVM with three different kernels . . . 30 5.2 ROC curves of six classification models for each dataset . . . 33

(6)

List of Tables

3.1 Summary of input data . . . 12

3.2 Three attributes with ranges . . . 14

3.3 Attributes (class-label attribute included) . . . 15

4.1 Confusion Matrix . . . 25

5.1 λ that minimizes the log loss function for each dataset . . . 27

5.2 Log Loss and runtime versus number of iterations of SGD for each dataset . 28 5.3 Optimal number of iterations for each dataset . . . 28

5.4 Accuracy measures of three different kernels for support vector machine . . . 29

5.5 p-values of Paired t-tests of the three different kernels . . . 30

5.6 Accuracy of classification models on the same out-of-sample set . . . 33

5.7 p-values of Paired t-tests of the different classification models . . . 34

5.8 Runtime of the six classification models for each dataset . . . 35

5.9 Variable importance in the top split of decision trees for each dataset . . . . 37

5.10 Attributes chosen for a split in decision trees for each dataset . . . 37

5.11 The relative importance of each attribute in decision trees for each dataset . 37 5.12 Summary of variable importance of logistic regression for each dataset . . . . 38

5.13 Summary of variable importance of support vector machine for each dataset 39 5.14 Summary of variable importance of naive Bayes on each dataset . . . 40

5.15 Class-conditional probabilities for the attribute IsTicketed from naive bayes for dataset XL . . . 40

5.16 Summary of variable importance of random forests for each dataset . . . 41

(7)

Chapter 1 Introduction

The first publication of forecasting models for cancellations and no-shows was in 1958 by Beckmann and Bobkoski. They applied three different distributions for total passenger arrival and they made assumptions about demand arrival that may no longer be valid. In the almost sixty years after this publication, a lot has changed in the airline industry. Nowadays in the airline industry, Passenger Name Record-based cancellation and no-show forecasting are commonly viewed as one of the best methods available. These methods are part of revenue management.

The objective of revenue management is to maximize profits; however, airline short-term costs are largely fixed, and variable costs per passenger are small; thus, in most situations, it is sufficient to seek booking policies that maximize revenues (McGill & Van Ryzin, 1999). A revenue management system must take into account the possibility that a booking may be cancelled, or that a booked customer may fail to show up at the time of service (no-show), which is a special case of cancellation that happens at the time of service (Morales & Wang, Passenger Name Record Data Mining Based Cancellation Forecasting for Revenue Management, 2008).

Iliescu, Garrow and Parker (2006) studied airline passenger cancellation behaviour and stated that leisure passengers, who are more likely to book further in advance of flight de-parture, are less likely to cancel than business passengers. However, as the flight nears departure, both leisure and business travellers are more likely to refund and exchange their tickets. The study of Iliescu, Garrow and Parker (2006) points out that cancellation pro-portions of 30% or more are not uncommon today. Cancellation forecast is one important aspect of revenue management. Accurate forecasts of the expected number of cancellations for each flight can increase airline revenue by reducing the number of spoiled seats (empty seats that might otherwise have been sold) and the number of involuntary denied boardings at the departure gate (Lawrence, Hong, & Cherrier, 2003).

Another important aspect of revenue management in airlines is overbooking. Overbook-ing intends to increase revenues by decidOverbook-ing the number of seats to be offered for sale (virtual capacity) such that it maximizes the chance of the aircraft seats being occupied (physical capacity) when the flight departs (Talluri & van Ryzin, 2004). The overbooking levels are based on a cancellation forecast in combination with service criteria to keep the risk of

(8)

hav-ing too many passengers showhav-ing up very small. Therefore, it is crucial to have accurate cancellation rates.

The task of forecasting the probability of cancellation of a single booking can be modelled as a two-class probability estimation problem with the two classes being “cancelled” and “not cancelled” (Morales & Wang, Cancellation forecasting Using Support Vector Machine with Discretization, 2008). There are different classification techniques one can use to solve this estimation problem such as Decision Trees (DT), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Naive Bayes (NB) and Random Forests (RF).

Morales and Wang (2008) showed that decision trees perform better than logistic re-gression since there are practical limitations on logistic rere-gression. Furthermore, the study by Wang on airline data showed that dynamic decision trees outperform logistic regression in terms of runtime and forecast accuracy (Morales & Wang, Identify Critical Values for Supervised Learning Involving Transactional Data, 2009). But with the latest developments in machine learning techniques it may be possible for logistic regression to outperform the dynamic decision tree. Modern-day machine learning techniques are able to deal with a lot of dummy variables.

To forecast cancellation rates, KLM uses dynamic decision trees. KLM faces some ques-tions in the use of these trees, namely, when is it better to make the decision tree less dynamic or less static, what are the best thresholds to create a node, what are the best attributes to consider in the dynamic decision tree model and what is the best pruning method for decision trees?

The objective of this report is to investigate whether a decision tree is the best model for predicting cancellations. This is examined by comparing the decision tree model with the five other classification models mentioned earlier to see which one performs best.

The rest of this report is structured as follows. In the second chapter, the cancellation forecasting problem is described and detailed background information about previous re-search on this topic is provided, as well as some basic terminology used throughout this report. The third chapter describes the real-world dataset used for this research. The at-tributes are also explained. Chapter 4 explains the six different methods commonly used for binary classification. In addition, five different measures of classification accuracy are discussed, as well as a test for significance. Chapter 5 discusses the main results followed by the conclusion, discussion and the possibilities for future research in the sixth and last chapter.

(9)

Chapter 2 Theoretical Background

To optimize the expected revenue of an airline company, it is essential to have an accurate passenger cancellation forecast. Using this forecast, the risk of unnecessary empty seats on a flight will be reduced by overbooking. Overbooking is the fact that the number of seats available for sale is higher than the physical capacity of the airplane. An optimized over-booking rate leads to a reduction of expenses due to denied boardings and to a reduction of revenue loss due to seats that are not sold although there is a demand for those seats (Hueglin & Vannotti, 2001). In this chapter, some papers about different cancellation fore-casting models proposed in the literature are discussed. In addition, some basic definitions used in this report are described.

2.1 Existing Forecasting Models

Most of the forecasting models proposed in the literature focus on the no-show case. How-ever, these models can also be used to forecast cancellation rates. Conventional forecasting methods predict the number of cancellations using time-series methods such as taking the seasonally-weighted moving average of cancellations for previous instances of the same flight leg (Lawrence, Hong, & Cherrier, 2003). Time series forecasting looks at sequences of data points, trying to identify patterns and regularities in their behaviour that might also apply to future values (Lemke & Gabrys, 2008). Weatherford, Gentry and Wilamowski (2002) compared traditional forecasting methods such as moving averages, exponential smoothing and regression with the neural network method. Neural networks represent a promising generation of intelligent machines that are capable of processing large and complex forms of information (Weatherford, Gentry, & Wilamowski, 2002). Weatherford, Gentry and Wilam-owski (2002) concluded that the most basic neural network can outperform the traditional forecasting methods.

Lawrence et al. (2003) used two different passenger-based forecast models to predict no-show rates based on the Passenger Name Record (PNR) and implemented these models by using different classification methods such as Naive Bayes, Adjusted Probability Model (APM), which is an extension of Naive Bayes, ProbE (based on tree-algorithms) and C4.5 (an algorithm for making decision trees). They have shown that ”models incorporating specific

(10)

information on individual passengers can produce more accurate predictions of no-show rates than conventional, historical-based, statistical methods” (Lawrence, Hong, & Cherrier, 2003). Neuling, Riedel and Kalka (2003) also used C4.5 decision trees based on PNRs. Hueglin and Vanotti (2001) used classification trees and logistic regression models to predict the cancellation probability of passengers. They concluded that ”the accuracy of no-show forecasts can be improved when individual passenger information extracted from passenger name records (PNRs) is used as input” Business (Hueglin & Vannotti, 2001). The three publications mentioned above conclude that making use of PNR data improves forecasting performance. The PNR data mining approach models cancellation rate forecasting as a two-class probability estimation problem (Morales & Wang, Forecasting Cancellation Rates for Services Booking Revenue Management Using Data Mining, 2009).

Popular two-class probability estimation methods are tree-based methods and kernel-based methods. Probability estimation trees estimate the probability of class membership, in our case the probability that a booking will be cancelled. Quinlan (1993) developed an algorithm, C4.5, that generates decision trees. The trees produced by C4.5 are small and accurate, resulting in fast reliable classifiers and therefore decision trees are valuable and popular methods for classification. Provost and Domingos (2003), in contrast, concluded that the performance of conventional decision-tree learning programs is poor and therefore they made some modifications to the C4.5 algorithm, namely the C4.4 algorithm. The C4.4 uses information gain criteria to divide the tree nodes and no pruning is used. Fierens, Ramon, Blockeel and Bruynooghe (2005) concluded that overall the C4.4-approach outperforms the C4.5-approach. However, the trees of the C4.5-approach are much smaller than those of the C4.4-approach. The C4.4 method builds a single tree; however, random forests can improve the predictive performance of a single tree by aggregating many decision trees. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forests (Breiman, 2001). For a large number of trees, it follows from the Strong Law of Large Numbers and the tree structure that random forests always converge so that overfitting is not a problem (Breiman, 2001). In random forests, the idea is to decorrelate the several trees and then reduce the variance in the trees by averaging them (Random Forests in R, 2017). Averaging the trees helps to reduce the variance and improve the performance of the trees and eventually to avoid overfitting.

Kernel-based methods make use of kernel functions which map input data points to a higher dimensional space, such that a linear method in a new space becomes non-linear in the original space and therefore these methods are able to model non-linear relationships between dependent and independent variables (Morales & Wang, Forecasting Cancellation Rates for Services Booking Revenue Management Using Data Mining, 2009). One of the most popular kernel-based methods for class probability estimation is Support Vector Machine (SVM). If we have labelled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of

(11)

data (Machine Learning Using Support Vector Machines, 2017). SVM is able to find the hyperplane that creates the biggest margin between the training points for class 1 and −1 (Hastie, Tibshirani, & Friedman, 2001).

Caruana and Niculescu-Mizil (2006) evaluated the performance of SVMs, logistic regres-sion, naive Bayes, random forests, decision trees and more supervised learning algorithms on binary classification problems. From the five methods mentioned, random forests were the best learning method overall, followed by SVMs. The models performing most poorly were logistic regression, naive Bayes and decision trees. However, even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well (Caruana & Niculescu-Mizil, 2006). Rich (2001) did an empirical study of the NB classifier and concluded that ”despite its unrealistic independence assumption, the naive Bayes classifier is surprisingly effective in practice since its classification decision may often be correct even if its probability estimates are inaccurate”.

In this report, we evaluate the performance of decision tree, logistic regression, support vector machines, naive Bayes and random forests on a real-world data set to forecast cancel-lation rates. In the next section, the terminology used throughout this report is provided.

2.2 Terminology

In this section, we introduce some basic definitions used throughout this report: a cancella-tion, a no-show, a passengers show-up and denied boarding are described.

Passengers are said to cancel when the associated confirmed seat reservations are freed and returned to the inventory for sale. For every booking, Yiis the random variable associated

with the realization of the cancellation indicator, which is equal to 1 if all passengers in the booking have cancelled before departure, 0 otherwise. The cancellation probability of booking i is denoted as pi = P (Yi = 1). When a booking is cancelled, all passengers in that

booking have cancelled.

Passengers are said to no-show when their confirmed booking was not cancelled, but they do not show up at the departure time. For every passenger, Nk is the random variable

associated with the realization of the no-show indicator, which is equal to 1 if the booking of the passenger is not cancelled but the passenger does not show up at the time of boarding, 0 otherwise. The passenger no-show probability is denoted as nik = P (Nk = 1|Yi = 0), where

k indicates a certain passenger in booking i.

Passengers who do not cancel or no-show are said to show up. In this case, both the cancellation indicator and the no-show indicator are equal to 0. We denote Sk as the random

variable associated with the realization of the show-up indicator. And hence, the passenger show-up probability for a passenger k in booking i is

sik = P (Sk = 1) = P (Nk = 0 ∩ Yi = 0) = P (Yi = 0)P (Nk = 0|Yi = 0) = (1 − pi)(1 − nik).

Passengers who show-up with a confirmed reservation, but which cannot be accommo-dated on a flight are said to be denied boarding.

(12)

Chapter 3 Data

Information on bookings is available in the form of Passenger Name Records (PNRs), which are typically transferred to a PNR database from an airline’s flight reservation system. A new PNR is generated whenever a customer makes a flight reservation and contains information such as the creation date, the number of passengers, departure date, ticketing status, price class and many other attributes about the booking. Each time a customer contacts the airline in order to change the state of the booking (confirmation, cancellation, etc.), an additional transaction record is written into the PNR and stored in the reservation system. A PNR may include more than one passenger flying the same itinerary. If one of the passengers in a PNR decides to deviate from the existing itinerary, then the PNR is split. For this passenger, a new PNR is generated. Each PNR is tagged with a label indicating whether the booking is cancelled or not; 1 for a cancellation and 0 otherwise. When a PNR is cancelled, all passengers in the PNR have cancelled. This label is used as the target variable for modelling the cancellation probability.

The database that is investigated in this report contains booking records of KLM and Air-France flights. All booking records with a departure date between 01.10.2016 and 01.10.2017 are selected from the database, which are almost 92 million bookings. The datasets that are used in this report are random samples of those 92 million bookings. The PNRs of these data samples were created in the time period between 22.05.2015 and 01.10.2017. Table 3.1 summarizes the characteristics of the datasets. For all datasets the mean cancellation rate is calculated as follows: CR = Pn i=1Yi· Ni Pn i=1Ni , (3.1)

where n is the total number of PNRs, Yi is the cancellation label for PNR i with value 1

for a cancellation and value 0 for no cancellation, Ni is the number of passengers in PNR

i and the cancellation rate is the number of passengers that have cancelled divided by the total number of passengers. Note that the mean cancellation rate is more than 41% for all datasets. To investigate and compare the model performances of the fitted models, another sample is generated from the almost 92 million bookings, which contains 10561 observations. Predictions of cancellations will be made on this out-of-sample set.

(13)

Dataset S Dataset M Dataset L Dataset XL

Number of PNRs (n) 37202 148795 498285 994344

Mean cancellation rate (CR) 0.4150 0.4145 0.4170 0.4174

Table 3.1: Summary of input data

(a) Number of bookings per month (b) Number of bookings vs day of week

Figure 3.1: Number of bookings in dataset XL

Figure 3.1a shows the number of bookings per month and figure 3.1b the number of bookings per day for all bookings in dataset XL. Most flights are booked in January, March and May. December is the least popular month for booking. The number of bookings on weekdays is more or less the same, whereas the number of bookings in the weekends is much lower.

Attributes are used to predict whether a PNR is cancelled or not. Table 3.3 summarizes the set of attributes extracted from the PNR database. The class-label attribute,

IsCancelled, tells whether a booking is cancelled or not and has two values: 1 if the booking is cancelled and 0 if it is not. The rest of the attributes is used to predict cancellations. Figure 3.2 visualizes the influence of three different attributes on the observed cancellation rate. All departure months have on average the same cancellation probability ≈ 40%, the same holds for the departure day of the week. In contrast to these two attributes, the cancellation rate for different price classes is not the same. For example, the Z price class, one of the cheapest chair in the business cabin with flexible cancellation standards, has a higher cancellation rate than the G price class, which is the cheapest chair in the economy cabin without flexible cancellation standards. The cancellation rates are almost 18% and 52% of respectively class G and Z.

Some of the attributes need more explanation. First the True and Karma attributes are explained. Suppose that a certain booking consists of a set of two flight legs, for example, LHR-CDG and CDG-FRA (London-Paris-Frankfurt). The TrueOriginAirport is LHR and the TrueDestinationAirport is FRA. If both flight legs are executed by KLM or AirFrance, then the Karma O&D (Origin and Destination) Airports are the same as the True O&D Airports. If only the first flight leg is executed by KLM or AF, then the KarmaOriginAirport

(14)

is LHR and the KarmaDestinationAirport is CDG. And if only the last flight leg is executed by KLM or AF, then the KarmaOriginAirport is CDG and the KarmaDestinationAirport is FRA. The IsTrueLocal attribute indicates whether a booking consists of a single flight. So the IsTrueLocal attribute is 0 in the example above. If a certain booking consists of only one flight leg, then the IsTrueLocal attribute is 1.

The IsOutboundFlow attribute is 1 if the passenger or passengers in the PNR begin their journey, otherwise, the attribute is 0. The NegoSpaceType attribute is a special case of a group reservation made by a travel organization. Travel agents reserve a fixed number of seats on a single flight. This reservation is called the ’master’ booking for which the NegoSpaceType attribute has value 1. If a passenger confirms the booking then this booking is split from the ’master’ booking and a new booking is created, which can have more flight legs than the original reservation made by the travel agent. However, this booking is derived from the original group booking made by the travel agent and therefore the value of the NegoSpaceType attribute is 2. In all other cases, the NegoSpaceType attribute is 0.

(a) Cancellation rate vs departure month (b) Cancellation rate vs departure day of week

(c) Cancellation rate vs pricing class

Figure 3.2: Cancellation rate of three different attributes for bookings in dataset XL

Table 3.2 gives the ranges for the attributes LengthofStayRange, NbPaxRange and Time-FrameLabel. The NbPaxRange attribute indicates the number of passengers in a certain PNR and the LenghtofStayRange attribute indicates the length of stay in number of days of these passengers. The TimeFrameLabel attribute of a booking is calculated by the demand date minus the departure date and gives the booking time in days before departure. Figure 3.3 shows the fraction of cancellation for each time frame. The cancellation rate decreases when the day of departure comes closer. However, a booking that is made 200 days before departure and is still active 5 days before departure is unlikely to be cancelled.

(15)

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 Level 11

LengthofStayRange 0 1 2-6 7-21 22-98 >98 - - - -

-NbPaxRange 0 1 2 3-4 5-9 10-20 21-40 >40 - -

-TimeFrameLabel 0 1 2-3 4-7 8-14 15-30 31-90 91-180 181-260 261-360 >360

LengthofStayRange = length of stay in days; NbPaxRange = number of passengers; TimeFrameLabel = booking time in days before departure.

Table 3.2: Three attributes with ranges

Figure 3.3: Cancellation rate per time frame for bookings in dataset XL

All four datasets are stored as an nxp-matrix, where n is the number of bookings and p the number of attributes. These four models, also called the train sets, are used to fit the models. The out-of-sample set is stored as an nxp-matrix as well and is used to estimate the prediction error of the models. Subsequently, for both the train sets and the out-of-sample set, dummy variables are created for the explanatory attributes, i.e. all attributes except the class label attribute IsCancelled. Each explanatory attribute returns the number of levels minus 1 as dummy variables.

Example: the explanatory attribute DepartureDayofWeek has 7 levels, namely Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. When we turn this attribute into a dummy, 6 variables are created: DepartureDayofWeeki for i = 2, ..., 7. Only one or none of these variables can have the value 1, all other variables have the value 0. If all six variables have the value 0, the departure day is Monday and if the variable DepartureDay-ofWeek2 has value 1, the departure day is Tuesday etc..

With these dummy variables, the train and out-of-sample sets are now nxk-matrices, where n is the number of observations in the train or out-of-sample set and k is the total number of (levels-1) of all explanatory attributes. These matrices contain only zeros and ones. For faster computation in R we ’delete’ the zeros and create a sparse matrix, for both the train and out-of-sample sets. Using the sparse matrix of the train set six different models are fit and the sparse matrix of the out-of-sample set is used to estimate the prediction error of the model. The next chapter describes the classification models used in this report. The models have to work well with sparse matrices or with many factor attributes with many levels in order to fit the model.

(16)

Attribute Name Description Type Number of levels

1 IsCancelled Class label (1 = cancellation, 0 = no cancellation) factor 2

2 IsGroup Group: booking > 9 passengers (1 = group, 0 = no group) factor 2

3 IsDepartureWeekend Departure day in weekend? (1 = yes, 0 = no) factor 2

4 IsTrueLocal Set of flight legs executed by KL or AF? (1 = yes, 0 = no) factor 2

5 IsOutboundFlow The outboundflow of a passenger (1 = yes, 0 = no) factor 2

6 IsTicketed Is the booking ticketed? (1 = yes, 0 = no) factor 2

7 IsSaturdayNightStay Is Saturday night included in the booking? (1 = yes, 0 = no, 2 = NA) factor 3

8 ReturnType Is the passenger coming back to origin? (1 = round trip, 0 = one way) factor 2

9 NegoSpaceType Special groupbookings of travelorganizations (0 = no NegoSpaceType, 1 = master NST, 2 = NST) factor 3

10 DepartureDayofWeek Departure day of week factor 7

11 DepartureMonth Departure month factor 12

12 LengthofStayRange Length of stay in number of days factor 6

13 NbPaxRange Number of passengers factor 8

14 TimeFrameLabel Booking time in days before departure factor 11

15 PricingClass Price class A-Z factor 26

16 DominantAirline Longest flight with AF or KL factor 2

17 ConnexionIndicator No connection, Europe-Ica, Europe-Europe, Ica-Ica factor 4

18 TrueOriginAirport Origin airport of the first flight leg factor 1479

19 TrueDestinationAirport Destination airport of the last flight leg factor 1540

20 KarmaOriginAirport Origin airport of the first flight leg executed by KL or AF factor 325

21 KarmaDestinationAirport Destination airport of the last flight leg executed by KL or AF factor 321

22 TruePointofSaleCountry Country where the booking is made factor 215

Table 3.3: Attributes (class-label attribute included)

(17)

Chapter 4 Methodology and Techniques

In this chapter, the methodology and techniques used in this report are discussed. The first section gives an overview of the classification models that are compared to each other by using five accuracy measures which are discussed in the second section of this chapter. Furthermore, a test for significance is explained in the last section.

4.1 Classification models

The task of forecasting the probability of cancellation of a single booking can be modelled as a two-class probability estimation problem with the two classes being “cancelled” and “not cancelled” (Morales & Wang, Cancellation forecasting Using Support Vector Machine with Discretization, 2008). Classification is a process for predicting qualitative responses. Some of the methods commonly used for binary classification are:

1. Decision Trees

2. Logistic Regression

3. Support Vector Machines

4. Naive Bayes classifier

5. Random Forests

In this chapter, the above five models are described.

4.1.1 Decision Trees

The first model is a decision tree which is an area of data mining techniques. A decision tree is a structure that can be used to divide a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules (Berry & Linoff, 2004). The model construction method proceeds in two steps.

In a first step, a tree is grown using a greedy heuristic to create the nodes. The algorithm starts with a root node containing the entire population and proceeds by recursively selecting an attribute and splitting the nodes into child nodes which bear the same attribute value.

(18)

Splitting is performed until a termination criterion is met or there is nothing left to split. Each attribute for a split is selected as the one which locally maximizes a heuristic criterion known as the gain function.

In a second step, the tree can be pruned off some of its nodes and branches. This phase is necessary to remove all nodes which might cause over-fitting of the data. Pre-pruning is done during calibration of the tree, post-pruning is done after the generation of the decision tree. It tries to simplify the tree by removing superfluous nodes.

There are several strategies available for splitting and pruning the tree. Different strate-gies are discussed in the following subsections.

Splitting

The objective of the splitting algorithm is to select the best attribute on which to perform a split. The split attributes are selected dynamically and in order to find the best split condition the algorithm makes use of an impurity measure and a gain function.

The impurity function measures how well the classes are separated, so this function should be 0 when all data belongs to the same class (low impurity means high coherence). There are multiple impurity functions used in the literature, two of which are discussed here. For the following, we consider a node N with data set S that contains examples from k classes (k child nodes Ni). Then pi(S) is the relative frequency of class i in S. The first impurity

function is the Entropy:

Entropy(S) = −

k

X

i=1

pi(S) log pi(S), (4.1)

which is mostly used by full split dynamic decision trees which split each value of an attribute. Entropy is intended for attributes that occur in classes and is used by the C4.5 algorithm. The second impurity function is the Gini-index:

Gini(S) = 1 −

k

X

i=1

pi(S)2, (4.2)

which is intended for continuous attributes and minimizes misclassification. This index is mostly used by binary dynamic decision trees, which split all values of an attribute into two sets. The Gini-index is used by the CART-algorithm.

There is a large selection of gain functions to measure the worth of an attribute, the attribute which is worth the most gets selected. The simplest gain function is:

Gain(S, A) = I(S) −X v∈A |Sv| |S| · I(Sv) , (4.3)

where I(S) represents the selected impurity function; v represents any possible values of attribute A; Sv is the subset of S for which attribute A has value v; |Sv| is the number of

elements in Sv; |S| is the number of elements in S (Du & Zhan, 2002). This gain function

favours attributes that have a large number of values. A second gain function known as Gain

(19)

Ratio compensates this favour by taking the intrinsic value of the split: GainRatio(S, A) = Gain(S, A) SplitInf o(S, A) = I(S) −P v∈A _|S v| |S| · I(Sv) −P v∈A _|S v| |S| · log _|S v| |S| . (4.4)

The attribute with the highest gain ratio is chosen for the split.

Earlier research on cancellation forecast by KLM shows that the best dynamic decision tree model consists of a dynamic full split decision tree that is calibrated using Entropy as impurity function and Gain Ratio as gain function.

Pruning

The objective of the pruning algorithm is to remove nodes which over-fit the data in order to improve the forecasting capability of the tree. The simplest pruning method is based on a fixed pruning threshold, which is a lower bound on the number of observations in a node. If the number of observations in a certain node does not exceed the pruning threshold then this node is not created. There are three pruning thresholds: parent pruning, child pruning and gain pruning. A node with fewer observations than the parent pruning threshold becomes a leaf. A child node is a leaf node and if the number of observations does not exceed the child pruning threshold it is pruned. If the gain value of a certain split is lower than the gain pruning threshold, i.e. the split does not bring much information which makes it uninteresting to increase the tree size for such gain; then the corresponding candidate attribute is not chosen for the split. It can be interesting to combine pruning and gain ratio computation.

A second method for pruning decision trees is the pruning algorithm of the C4.5 decision tree algorithm, an error-based pruning algorithm. The algorithm is a kind of post-pruning algorithm; it allows the tree to overfit the examples and then the algorithm post-prunes the tree (Kijsirikul & Chongkasemwongse, 2001). The algorithm starts from the bottom of the tree and examines each non-leaf subtree. If replacement of this subtree with a leaf would lead to a lower estimated error, then the tree is pruned (Kijsirikul & Chongkasemwongse, 2001). The error rate is defined as

qS = min(pS, 1 − pS), (4.5)

where pS is the average calibration probability (same as event probability) of data set S.

For a parent node we calculate the upper bound for the error rate as

qmax(S) = qS+ 1, 96

s

qS(1 − qS)

|S| . (4.6)

We also calculate the upper bound for the error rate for all the child nodes: qmax(Si) for

i = 1, ..., k. If the error rates from the child nodes satisfy the following criteria,

qmax(S) − k X i=1 |Si| |S|qmax(Si) ≤ 0, (4.7)

(20)

4.1.2 Logistic Regression

The second model is a logistic regression, which models the probability that a certain booking xi, with i = 1, ..., n, belongs to one of the following classes:

Yi =      1 if xi is cancelled; 0 if xi is not cancelled.

Multiple Logistic regression is a statistical method to estimate event probability using multi-ple attributes (X). In other words, we model the conditional distribution of the response Y , given the attributes X. To model the relationship between p(X) = P r(Y = 1|X) (the prob-ability that a certain booking is cancelled given attributes X) and X the logistic function is used,

p(X) = exp

β0+β1X1+...+βpXp

1 + expβ0+β1X1+...+βpXp. (4.8)

For all values of X, where X = (X1, ..., Xp) are p attributes, the logistic function (4.8) takes

a value between 0 and 1. To fit the model the maximum likelihood method is used, which is discussed in the next subsection. After some derivations of (4.8), we find the odds:

p(X)

1 − p(X) = exp

β0+β1X1+...+βpXp_, _(4.9)

which can take on any value between 0 and ∞. Taking the logarithm of both sides of (4.9) gives log p(X) 1 − p(X) = β0+ β1X1+ ... + βpXp, (4.10)

where the left-hand side is called the log-odds or logit. Increasing X by one unit changes the logit by β1. However, because the relationship between p(X) and X in (4.8) is not a straight

line, β1 does not correspond to the change in p(X) associated with a one-unit increase in

X (James, Witten, Hastie, & Tibshirani, 2013). The amount that p(X) changes due to a one-unit change in X will depend on the current value of X (James, Witten, Hastie, & Tibshirani, 2013). The logistic regression gives the opportunity to use qualitative attributes such as DominantAirline or TrueOriginAirport (see chapter 3) by creating dummy variables.

Regularized Maximum Likelihood

The coefficients β0, ..., βp in the logistic function are unknown and must be estimated based

on training data. To estimate the unknown regression coefficients the regularized maximum likelihood method can be used. This method estimates the coefficients by finding the pa-rameter values that maximize the penalized log-likelihood (Friedman, Hastie, & Tibshirani, 2009): l(β0, ..., βp) = 1 n n X i=1 [yilog pi+ (1 − yi) log(1 − pi)] − λ p X j=1 |βj|, (4.11) where Pp

j=1|βj| is the lasso penalty, i.e. the sum of the absolute values of the β coefficients

of all attributes. The effect of the penalty term is to set the coefficients that contribute most to the error to zero; i.e. the lasso penalty picks out the most predictive coefficients. This penalty is useful in any situation where there are many correlated predictor variables.

(21)

After the estimation of the logistic regression coefficients β0, ..., βp we can compute the

probability of a cancellation for any combination of attributes: ˆ p(X) = exp ˆ β0+ ˆβ1X1+...+ ˆβpXp 1 + expβˆ0+ ˆβ1X1+...+ ˆβpXp . (4.12)

However, the evaluation of the penalized log-likelihood over the full data set is computation-ally challenging when the amount of bookings and attributes is large. Therefore the SGD method is also used for estimating the coefficients β0, ..., βp in the logistic function. This

method is described in the next subsection.

Stochastic Gradient Descent

Another method for estimating the coefficients in the logistic function is stochastic gradient descent. Stochastic gradient descent is popular for large-scale optimization (Johnson & Zhang, 2013). To minimize the error of the logistic regression on the real-world dataset, the stochastic gradient descent (SGD) evaluates and updates the coefficients every iteration.

The aim of stochastic gradient descent is to find the set of coefficients β0, ..., βp that

minimize the error for the model on the training data. For the logistic regression model with a class label y with only two values, 0 for no cancellation and 1 for a cancellation, this is done by finding the coefficients that minimize the following error measure function:

LogLoss = −1 n n X i=1 [yilog pi+ (1 − yi) log(1 − pi)], (4.13)

where n is the number of observations, pi the predicted probability of a cancellation of

booking i and yi the class-label attribute of booking i. The predicted probability of a

cancellation of booking i is calculated as pi =

1

1 + exp(−xi· w)

, (4.14)

where the parameter w is estimated by using iterations. Each iteration estimates the stochas-tic gradient descent on the basis of a randomly picked example (xi, yi) (Bottou, 2010):

wt+1 = wt+ γ(yi− pi)xi, (4.15)

where γ is the learning rate. The stochastic process {wt, t = 1, ...} depends on the examples

(xi, yi) randomly picked at each iteration (Bottou, 2010).

Stochastic gradient descent is efficient because it evaluates the Log Loss function at a single observation yi given xi, rather than the entire data set, which saves significant

com-putation time (Tran, Toulis, & Airoldi, 2015). Johnson and Zhang (2013) conclude that SGD does not require the storage of gradients which makes it easily applicable to complex problems such as structured prediction or neural network learning.

In the rest of this report, the Logistic Regression model refers to the logistic regression model with estimations based on the regularized maximum likelihood and the Stochastic Gradient Descent model refers to the logistic regression model estimated with stochastic gradient descent.

(22)

4.1.3 Support Vector Machine

The third data classification method is Support Vector Machine (SVM). The basic idea is to find a hyperplane which separates the p-dimensional data perfectly into its two classes (Boswell, 2002). In a p-dimensional space, a hyperplane is a subspace of dimension p − 1 which need not pass through the origin (James, Witten, Hastie, & Tibshirani, 2013). The formula

β0+ β1X1+ β2X2+ ... + βpXp = 0 (4.16)

defines a p-dimensional hyperplane, in the sense that if a point X = (X1, X2, ..., Xp)T in

p-dimensional space satisfies 4.16, then X lies on the hyperplane. If a point X does not lie on the hyperplane, then X satisfies

β0+ β1X1+ β2X2+ ... + βpXp < 0 (4.17)

or

β0+ β1X1 + β2X2+ ... + βpXp > 0, (4.18)

which means that X lies on one of the sides of the hyperplane. So a hyperplane divides p-dimensional space into two classes. The two classes are: yi ∈ {−1, 1}, where −1 represents

one class and 1 the other class (James, Witten, Hastie, & Tibshirani, 2013). If a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located (James, Witten, Hastie, & Tibshirani, 2013). A certain booking x∗ is assigned to class 1 if f (x∗) is positive and to class −1 if f (x∗) is negative, where

f (x∗) = β0+ β1x∗1+ β2x∗2+ ... + βpx∗p. (4.19)

The support vector classifier allows some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane. It does not perfectly separate the two classes because it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations (James, Witten, Hastie, & Tibshirani, 2013). The support vector classifier is the solution to the optimization problem

maximize β0,...,βp,1,...,n,MM (4.20) subject to p X j=1 β_j2 = 1, (4.21) yi(β0+ β1xi1+ β2xi2+ ... + βpxip) ≥ M (1 − i), (4.22) i ≥ 0, n X i=1 i ≤ C, (4.23)

where M represents the margin of the hyperplane and C is a nonnegative tuning parameter. The higher C, the higher the tolerance level of violations to the margin and to the hyperplane. The slack variables 1, ..., p, allow individual observations to be on the wrong side of the

margin or the hyperplane (James, Witten, Hastie, & Tibshirani, 2013). If i = 0 the ith

observation is on the correct side of the margin; if i > 0 the ith observation is on the wrong

(23)

side of the margin; if i > 1 the ith observation is on the wrong side of the hyperplane (James,

Witten, Hastie, & Tibshirani, 2013). Support vectors are observations that lie directly on the margin, or on the wrong side of the margin for their class and they do affect the support vector classifier. An observation that lies strictly on the correct side of the margin does not affect the support vector classifier (James, Witten, Hastie, & Tibshirani, 2013). The solution to the optimization problem above only involves the inner product of the observations; the inner product of two bookings xi and xi0 is given by

hxi, xi0i = p

X

j=1

xijxi0_j. (4.24)

With some technical computations, it can be shown that the support vector classifier de-scribed above, can be represented as

f (x) = β0+

X

i∈S

αihxi, xi0i, (4.25)

where S is the set that contains all support vectors points; αi is nonzero only for the support

vectors, therefore summarizing over other points is not necessary.

The support vector classifier described above assumes a linear boundary between the two classes, but when we have to deal with non-linear class boundaries we have to enlarge the feature space using functions of the attributes. The support vector machine (SVM) is an extension of the support vector classifier that enlarges the feature space by using kernels (James, Witten, Hastie, & Tibshirani, 2013). A kernel is a function that quantifies the similarity of two observations and is a generalization of the inner product of two observations of the form

K(xi, x0i). (4.26)

There are different kinds of kernels, for example, a linear kernel,

K(x, x0_i) =

p

X

j=1

xijxi0_j, (4.27)

which is used by the support vector classifier. An SVM uses a non-linear kernel, such as a polynomial kernel with degree d > 1

K(x, x0_i) = 1 + p X j=1 xijxi0_j !d , (4.28)

or a radial kernel with a positive constant γ

K(x, x0_i) = exp −γ p X j=1 (xij − xi0_j)2 ! . (4.29)

When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known as a support vector machine and can be represented as a non-linear function

f (x) = β0+

X

i∈S

αiK(x, xi), (4.30)

where S is the set that contains all support vectors points (James, Witten, Hastie, & Tib-shirani, 2013). One advantage of using a kernel rather than simply enlarging the feature

(24)

space by using functions of the original features is computational. With the use of kernels, we only need to compute K(xi, x0i), this can be done without explicitly working in the

en-larged feature space. Without the use of kernels, the enen-larged feature space is so large that computations are intractable.

4.1.4 Naive Bayes classifier

The fourth technique is Naive Bayes classifier. Naive Bayes is a classical classification method which has been widely used to derive class probability estimates (Hastie, Tibshirani, & Fried-man, 2001). Suppose we have two kinds of feature vectors, x1 and x2. Rather than building

a large model, it might be more practical to learn two separate classifiers, p(y|x1) and p(y|x2)

and then to combine them (Murphy, 2006). A simple way to combine the systems is to assume the features are conditionally independent given the class labels: p(x1, x2|y) = p(x1|y)p(x2|y)

(Murphy, 2006). This is the naive Bayes assumption.

The naive Bayes classifier works as follows (Leung, 2007):

1. Let T be a training set of n bookings, each with their class label. The class label attribute, cancellation, tells whether a booking is cancelled or not and has two classes: C1 = 1 (cancellation) and C2 = 0 (no cancellation). Each booking is represented

by a p-dimensional vector, X = {x1, ..., xp}, which are the measured values of the p

attributes, A1, ..., Ap, for booking X. In chapter 3 all attributes are described.

2. Given a booking X, the classifier predicts that a booking X is cancelled if and only if P (C1|X) > P (C2|X), where

P (Ci|X) =

P (X|Ci)P (Ci)

P (X) . (4.31)

3. Given data sets with many attributes, it would be computationally expensive to com-pute P (X|Ci). In order to reduce computation in evaluation P (X|Ci)P (Ci), the naive

Bayes assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the booking:

P (X, Ci) ≈ p

X

k=1

P (xk|Ci). (4.32)

The probabilities P (x1|Ci), P (x2|Ci), ..., P (xp|Ci) can easily be estimated from the

training set. Here, xk refers to the value of attribute Ak for booking X. If Ak is

categorical, then P (xk|Ci) is the number of bookings of class Ci in T having the value

xk for attribute Ak, divided by f req(Ci, T ), the number of bookings of class Ci in T .

4. In order to predict whether a booking X is cancelled or not, P (X|Ci)P (Ci) is evaluated

for each class Ci. The classifier predicts that the class label of X is Ci if and only if it

is the class that maximizes P (X|Ci)P (Ci).

(25)

The naive Bayes classifier is easy to implement and has a very low computation time. Com-pared to other models, less training data is needed and the naive Bayes classifier performs well in case of categorical input variables. On the other side, this classifier is also known as a bad estimator; it makes use of the assumption of independent attributes. In real life, this is rarely the case.

4.1.5 Random Forests

The last technique is random forests. First bagging is discussed, also known as bootstrap aggregation, which is a procedure for reducing the variance of a statistical learning method such as decision trees.

Bagging works as follows (James, Witten, Hastie, & Tibshirani, 2013): let T be a training data set. Bootstrap T by taking repeated samples from set T , so that we generate B different bootstrapped training data sets. We then train our method on the bth bootstrapped training set in order to get variance ˆf∗b(x) and finally average all the predictions, to obtain

ˆ fbag(x) = 1 B B X b=1 ˆ f∗b(x). (4.33)

To apply bagging to decision trees, we construct B trees using B bootstrapped training sets and average the resulting predictions, which leads to a reduction in variance. In this report, a classification problem where dependent variable Y is qualitative has to be solved. For this situation, the simplest way of bagging is as follows (James, Witten, Hastie, & Tibshirani, 2013): for a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions. The random forests algorithm works as follows (James, Witten, Hastie, & Tibshirani, 2013): while building the decision trees on bootstrapped training samples, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. A fresh sample of m predictors is taken at each split and only one of the m predictors makes the split. At each split in the tree, the random forests algorithm is not allowed to take a majority vote on the available predictors. There is a reason for this: when there is a strong predictor, almost all the trees use this one in the top split, and as a consequence, the predictions from the trees are highly correlated and this does not lead to a lower variance. Random forests provide an improvement over bagged trees by forcing each split to consider only a subset of the predictors so that there is a larger chance that other predictors besides the strong ones will be used. This is a way to decorrelate the trees which leads to a reduction in variance over a single tree (James, Witten, Hastie, & Tibshirani, 2013).

4.2 Measures of Classification Accuracy

To evaluate the accuracy of the five classification models mentioned earlier, some different techniques are used. In this section, five accuracy measures are described.

(26)

Accuracy with Confusion Matrix

The five classification models used in this report will output a probability of a cancellation instead of a class label 0 or 1 for a cancellation. However, if we take 0.5 as the cut-off prob-ability i.e. bookings with a probprob-ability value of 0.5 and above are marked as a cancellation and below 0.5 are marked as no cancellation, the predicted result could be compared to the actual result. A Confusion Matrix is a popular way to summarize the results of a classifica-tion model (see table 4.1). True Positives are observaclassifica-tions where the actual and predicted class label are both 1; True Negatives are observations where the actual and predicted class label are both 0; False Positives are observations where the actual class label is 0 but the predicted class label is 1 and False Negatives are observations where the actual class label is 1 but the predicted class label is 0. A perfect test is one where FP and FN have value zero.

True Positives (TP) False Negatives (FN) False Positives (FP) True Negatives (TN)

Table 4.1: Confusion Matrix

With the use of this Confusion Matrix, we can calculate some measures used for testing the classification models. The first measure is the fraction of observations which is correctly predicted. We call this measure PC and it is calculated as,

P C = (T P + T N )/n. (4.34)

The second measure is the Matthews Correlation Coefficient (MCC), which takes all the cells of the Confusion Matrix into consideration in its formula and is calculated as,

M CC = (T P · T N –F P · F N )

p(T P + F P ) · (T P + F N) · (T N + F P ) · (T N + F N). (4.35) The range of the MCC lies between −1 to +1. A model with a score of +1 is a perfect model and a model with a score of −1 is a poor model (Joseph, 2016).

AUC-ROC

The Area Under a ROC (Reciever Operating Characteristic) Curve measures accuracy by how well the test separates the bookings being tested into two groups: cancellations and no cancellations. The ROC curve plots the true positive rate against the false positive rate; the true positive rate is the fraction of bookings that are cancelled that the test correctly identifies as cancelled and the false positive rate is the fraction of bookings that are cancelled that the test incorrectly identifies as not cancelled. An AUC of 1 represents a perfect test, this means that the test correctly identifies for all bookings whether a booking is cancelled or not. A worthless test has an area of 0.5. All classification models have an AUC value between 0.5 and 1.

(27)

The above three accuracy measures make use of the cut-off-probability of 0.5, i.e. book-ings with a probability value of 0.5 and above are marked as a cancellation, and bookbook-ings with probability value below 0.5 are marked as no cancellation. However, there is a possibility that a booking with a cancellation probability of 0.49, and thus marked as no cancellation, is cancelled. So directly classifying observations into a cancellation or no cancellation with use of the cut-off-probability increases the risk of misclassification. Two accuracy measures based on the predicted probability of a cancellation are the Mean Squared Error and the Log Loss. Both are described below.

SSE and MSE

The sum of squared error is a measure of dissimilarity between the actual data and an estimation model. The smaller the SSE, the better the fit of the model on the data. The SSE is calculated as SSE = n X i=1 (yi − pi)2, (4.36)

where n is the number of observations, pi the predicted probability of a cancellation of

booking i and yi the class-label attribute of booking i. The mean squared error is the sum

of squared error divided by the number of observations:

M SE = SSE

n . (4.37)

Log Loss function

The Log Loss measure quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier (Collier, 2015). This means that a classifier with a Log Loss of zero is a classifier with 100% accuracy. The Log Loss function is defined as

LogLoss = −1 n n X i=1 [yilog pi+ (1 − yi) log(1 − pi)]. (4.38)

4.3 Test for Significance

To determine whether the model performances of the six classification models are significantly different a Paired t-test is performed. This test is chosen because the predicted cancellation probabilities for each model come from repeated observations of the same subject. The null and alternative hypothesis of the Paired t-test are defined below:

H0: true difference in means is equal to zero

H1: true difference in means is not equal to zero

The Paired t-test must have a p-value below the chosen significance level α in order to reject the null hypothesis. If that is the case, there is a significant difference between the model performances.

(28)

Chapter 5 Results and Analysis

In this chapter, the results of the fitted classification models are discussed. In the first section of this chapter, the chosen parameters of the classification models for each dataset are explained. The models are fitted with these parameters and the accuracy of the predicted cancellation probabilities of each model is discussed in the second section. The accuracy of each classification model is based on the five accuracy measures mentioned in section 4.2. Another important way to compare the classification models is based on the runtime of each model. The runtime of the models on each dataset is given in the third section of this chapter. Furthermore, the variable importance is investigated in the last section.

5.1 Tune parameters

The choice of λ for the logistic regression model, the optimal number of iterations for the stochastic gradient descent and the kernel choice for the support vector machine are tuned and explained in the next three subsections. In the fourth subsection, the chosen parameters for the other classification models are described, such as the number of trees in RF.

5.1.1 Choice of λ for Logistic Regression

As mentioned in chapter four, the logistic regression is fit by regularized maximum likelihood that estimates the coefficients by finding the parameter values that maximize equation 4.11. Now the following question arises: which value for λ must be chosen for the optimal out-of-sample fit. To answer this question, hundred different values of λ ∈ [0, 0.5] are used to predict the cancellation probability of each observation in the out-of-sample set. The value of λ for which the predicted probabilities pi minimize the log loss function is chosen as optimal

λ. This is done for all datasets and the results are shown in table 5.1.

optimal λ 0.0020 0.0005 0 0

Table 5.1: λ that minimizes the log loss function for each dataset

(29)

The value of λ decreases as the number of observations increases. Dataset S and dataset M both have a small value of λ, 0.002 and 0.0005 respectively and no penalty on the β coefficients is needed for dataset L and XL. So the four optimal λ’s in table 5.1 are used to predict the cancellation probability of each observation in the out-of-sample set.

5.1.2 Number of Iterations for Stochastic Gradient Descent

Another method for estimating the β coefficients in logistic regression is stochastic gradient descent, which updates the coefficients every iteration and with these updated coefficients the predicted probabilities of bookings are calculated. The optimal number of iterations is investigated. By optimal number of observations is meant: the number of observations for which the predicted probabilities pi of the observations in the out-of-sample set minimize

the log loss function. The stochastic gradient descent is implemented on all four datasets and the results are given in table 5.2.

Dataset S Dataset M

Number of iterations Log Loss Runtime Number of iterations Log Loss Runtime

0.10·nrow(datasetS) 0.2404 40.48 0.10·nrow(datasetM) 0.2420 466.71 0.25·nrow(datasetS) 0.2421 76.68 0.25·nrow(datasetM) 0.2524 1078.13 0.50·nrow(datasetS) 0.3106 141.56 0.50·nrow(datasetM) 0.2382 2101.40 0.75·nrow(datasetS) 0.2325 216.08 0.75·nrow(datasetM) 0.2343 3223.88 1.00·nrow(datasetS) 0.2531 278.54 1.00·nrow(datasetM) 0.2262 4297.98 2.00·nrow(datasetS) 0.2544 562.87 2.00·nrow(datasetM) 0.2256 8593.58 Dataset L Dataset XL

Number of iterations Log Loss Runtime Number of iterations Log Loss Runtime

0.10·nrow(datasetL) 0.2354 5145.73 0.10·nrow(datasetXL) 0.2317 27954.34 0.25·nrow(datasetL) 0.2404 13139.81 0.25·nrow(datasetXL) 0.2312 69863.59 0.50·nrow(datasetL) 0.2366 26038.36 0.50·nrow(datasetXL) 0.2479 146354.6 0.75·nrow(datasetL) 0.2331 46374.05 0.75·nrow(datasetXL) 0.2246 202530.7 1.00·nrow(datasetL) 0.2309 57521.27 1.00·nrow(datasetXL) 0.2381 281375.7 2.00·nrow(datasetL) 0.2312 103916.1 2.00·nrow(datasetXL) 0.2406 562755.9

Number of iterations = in fractions of the number of observations in dataset; Runtime = total time of running the model in seconds.

Table 5.2: Log Loss and runtime versus number of iterations of SGD for each dataset

The number of iterations for which the predicted probabilities pi minimize the log loss

function is chosen as the optimal number of iterations. This is done for all datasets and the results are shown in table 5.3. With these number of iterations, the stochastic gradient descent is fitted. The choice of the learning rate parameter of the stochastic gradient descent is explained in subsection 5.1.4.

optimal number of iterations 0.75 2 1 0.75

optimal number of iterations = in fraction of the number of observations in the dataset.

(30)

5.1.3 Kernel Choice for Support Vector Machine

As mentioned earlier, there are different kinds of kernels. Three kernels for the support vector machine model are considered.

1. linear kernel 2. polynomial kernel 3. radial kernel

Table 5.4 summarizes the results of the support vector machine models with three different kernels and figure 5.1 illustrates the ROC curves of the kernels on all four datasets. The linear kernel performs best on all five accuracy measures for dataset S. For dataset M and dataset L, the linear kernel performs best on the accuracy measures PC, MCC and MSE, whereas the radial kernel performs best on the AUC and Log Loss measures. For dataset XL, the linear kernel performs best on all accuracy measures except for the area under the curve. On this accuracy measure the radial kernel outperforms the linear kernel. Overall, the polynomial kernel performs the worst of all kernels on all four datasets.

Dataset S Dataset M

Model PC MCC AUC MSE Log Loss PC MCC AUC MSE Log Loss

Linear Kernel 0.9351 0.8622 0.9252 0.0589 0.2259 0.9352 0.8626 0.9212 0.0590 0.2305

Polynomial Kernel 0.9336 0.8585 0.9198 0.0618 0.2411 0.9244 0.8366 0.9133 0.0698 0.2873

Radial Kernel 0.9344 0.8604 0.9218 0.0601 0.2293 0.9344 0.8604 0.9222 0.0601 0.2292

Dataset L Dataset XL

Model PC MCC AUC MSE Log Loss PC MCC AUC MSE Log Loss

Linear Kernel 0.9356 0.8635 0.9118 0.0588 0.2304 0.9356 0.8635 0.9142 0.0588 0.2258

Polynomial Kernel 0.9174 0.8204 0.9129 0.0730 0.2879 0.9330 0.8569 0.9201 0.0622 0.2425

Radial Kernel 0.9344 0.8604 0.9267 0.0601 0.2290 0.9344 0.8604 0.9203 0.0601 0.2288

PC = Procent Correct; MCC = Matthews Correlation Coefficient; AUC = Area Under an ROC Curve; MSE = Mean Squared Error; Log Loss = Log Loss.

Table 5.4: Accuracy measures of three different kernels for support vector machine

After fitting support vector machine with three different kernels, we check if the model performances are significantly different. This is done by performing a Paired t-test, where the significance level is set to α = 5%. It is assumed that the error terms have a normal distribution. Table 5.5 summarizes the p-values of the paired t-tests. For dataset S, dataset L and dataset XL, this results in a significant difference between the model performances of the polynomial kernel versus the other kernels, but there is no significant difference between the linear and radial kernel. Although the performance of the polynomial kernel is statistically different, there is no improvement in accuracy compared to the linear or radial kernel. For dataset M there is no significant difference between the model performances of the error terms of all kernels. Although the linear kernel performs best on most accuracy measures, there is no clear answer to the question which kernel to choose for SVM, because of insignificant differences between the performances of the kernels. For an easy comparison, the linear kernel is used to fit support vector machine on datasets S, M, L and XL in all further results.

(31)

Linear Polynomial Linear Polynomial Linear Polynomial Linear Polynomial

Polynomial < 2.2e−16 _- _0.5964 _- _{< 2.2e}−16 _- _{< 2.2e}−16

-Radial 0.5249 < 2.2e−16 _0.8315 _0.6342 _0.7477 _{< 2.2e}−16 _0.7529 _{< 2.2e}−16

Table 5.5: p-values of Paired t-tests of the three different kernels

(a) ROC curves dataset S (b) ROC curves dataset M

(c) ROC curves dataset L (d) ROC curves dataset XL

(32)

5.1.4 Set parameters

Stochastic Gradient Descent

The learning rate of the stochastic gradient descent method, γ in equation 4.15, has to be set to an appropriate value. This parameter determines how fast the start value of w will move towards the optimal value of w. One technique of setting the parameter value is to adapt the value of γ in each iteration; start with a large value of γ to learn a lot and decrease the value of this parameter each iteration. However, it is difficult to set this parameter because the optimal values are not known. Therefore, in this report the choice has been made to set the learning rate equal to γ = 0.1 for each iteration on all datasets.

Decision Trees

As mentioned in chapter three, there are several strategies available for splitting and pruning the tree. Based on earlier research on cancellation forecast by KLM, Entropy has been chosen as impurity function and Gain Ratio as gain function. As pruning methods, parent and child pruning have been chosen. Both thresholds are set equal to 200 observations.

Random forests

Random forests trains several independent decision trees for one dataset. The random forests in this report trains 30 trees on each dataset for n rounds. In this report, n = 100 rounds are trained. Each round gives a train-error that indicates how well the model explains the data used in this report. The train-error decreases each round. As mentioned in chapter 4.1.5 the random forests model takes a random sample of m attributes each split. The predictors in the random sample are chosen with a probability of 0.5.

5.2 Accuracy of Classification Models

To compare the classification models, five accuracy measures are used. Table 5.6 summarizes the values of the five accuracy measures for the six different classification models on all four datasets and figure 5.2 illustrates the ROC curves of the classification models. The predictions of each fitted model are made on the same out-of-sample set. In total, twenty-four models are compared: six different classification models on twenty-four datasets. After fitting all classification models, we check if the model performances are significantly different. Again, assuming a normal distribution of the error terms, this is done with use of a Paired t-test, where the significance level is set to α = 5%. The results are summarized in table 5.7.

It is clear from figure 5.2 that the random forests model has the highest area under the curve closely followed by the logistic regression, whereas naive Bayes definitely has the lowest area under the curve.

(33)

From table 5.6 it follows that the decision tree model fitted on dataset L has the highest PC value, which means the highest fraction of observations which is correctly predicted based on binary classification (1 for cancellation and 0 for no cancellation). This model also performs best on the MCC accuracy measure. For both accuracy measures, there is only a small difference between this model and the logistic regression fitted on dataset L. However, the difference between the performances of these two models is not significant.

The random forests model fitted on dataset L performs best on the AUC and Log Loss accuracy measures compared to all other models. The logistic regression on dataset XL has the lowest mean squared error. However, the difference is only 0.0002 compared to the logistic regression fitted on dataset L and only 0.0003 compared to the random forests model fitted on dataset L. From table 5.7 it follows that there is no significant difference in performance, neither between the logistic regression fitted on dataset XL and the logistic regression on dataset L nor between the logistic regression fitted on dataset XL and the random forests model on dataset L.

It is clear from table 5.7 that only the model performances of stochastic gradient descent and the naive Bayes classifier are significantly different in comparison to the performances of all other classification models. Despite this result, not one of these two models performs best on an accuracy measure. Therefore, by applying stochastic gradient descent or naive Bayes the cancellation forecast accuracy is not significantly improved.

The stochastic gradient descent and the logistic regression estimate the coefficients of the same model differently. For both models, the accuracy measures are based on the predictions for which the cancellation probabilities of the observations minimize the Log Loss function. If these two models are compared, it is clear that the logistic regression performs better on all five accuracy measures.

So overall, the decision tree model fitted on dataset L performs good at binary classifi-cation, i.e. the PC and MCC measures. However, the logistic regression fitted on dataset L performs more or less the same on the PC and MCC measures and even better on the MSE measure compared to the decision tree model fitted on dataset L. The random forests model fitted on dataset L performs best on the AUC and Log Loss measures compared to all other models and the MSE value is more or less the same as the MSE value of the logis-tic regression model fitted on dataset L. A notable result is that the XL dataset does not have a large contribution to better model performances compared to dataset L. Therefore, it is sufficient to make predictions on a relative smaller dataset. The naive Bayes classifier, support vector machine and stochastic gradient descent are less accurate compared to the other classification models. The naive Bayes classifier performs the worst of all.

Cancellation forecasting in the airline industry

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Cancellation Forecasting

in the Airline Industry

Dorien Lugten

(10383646)

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Theoretical Background

2.1

Existing Forecasting Models

2.2

Terminology

Chapter 3

Data

Chapter 4

Methodology and Techniques

4.1

Classification models

4.1.1

Decision Trees

4.1.2

Logistic Regression

4.1.3

Support Vector Machine

4.1.4

Naive Bayes classifier

4.1.5

Random Forests

4.2

Measures of Classification Accuracy

4.3

Test for Significance

Chapter 5

Results and Analysis

5.1

Tune parameters

5.1.1

Choice of λ for Logistic Regression

5.1.2

Number of Iterations for Stochastic Gradient Descent

5.1.3

Kernel Choice for Support Vector Machine

5.1.4

Set parameters

5.2

Accuracy of Classification Models