Predicting Hotel Booking Cancellations with Machine Learning

(1)

Predicting Hotel Booking Cancellations

with Machine Learning

submitted in partial fulfillment for the degree of

master of science

Enrique Herreros Jim´

enez

11400307

master information studies

data science

faculty of science

university of amsterdam

2017-06-23

Internal Supervisor External Supervisor Title, Name Evangelos Kanoulas Germ´an G´omez-Herrero Affiliation UvA FindHotel BV

(2)

1 Introduction 3 2 Related Work 4 3 Methodology 6 3.1 Data description . . . 6 3.2 Infrastructure . . . 7 3.3 Research Question 1 . . . 7 3.3.1 Data preparation . . . 7 3.3.2 Baseline model . . . 9 3.3.3 Learning algorithms . . . 10 3.3.4 Hyperparameter tuning . . . 11 3.3.5 Evaluation Framework . . . 12 3.3.6 Deployment . . . 14 3.4 Research Question 2 . . . 15

4 Results and Analysis 18 4.1 RQ1 . . . 18

4.2 RQ2 . . . 21

5 Conclusions 24 5.1 Limitations and Future Research . . . 24

5.2 Acknowledgements . . . 25

A Appendix 27 A.0.1 Gradient Boosting Machines . . . 27

A.0.2 Random Forest . . . 27

A.0.3 Extreme Gradient Boosting . . . 28

A.0.4 LightGBM . . . 28

(3)

Abstract

In this study, we carried out an assessment of the e↵ectiveness of mod-ern machine learning techniques in predicting hotel booking cancellations. The most explanatory variables learned by the model were also studied.

It is estimated that more than 50% of all travel reservations are made on the internet, reaching a figure of approximately 150M hotel bookings made online every year. This is due to the widespread use of free cancel-lation policies by Online Travel Agencies. This policies encourages hotel clients to book rooms even when they are unsure about their trip. On-line Travel Agencies (OTAs) also create a sense of urgency on their users by transmitting limited availability, limited-time deals, loss aversion and scarcity. These marketing strategies aim to generate a pressure on their clients to finalize their bookings as soon as possible. Another usual pro-cedure from the users is booking with the idea of carrying on looking for better options or double-booking to later decide which option (s)he wants to stick to. Fraudulent bookings that aim, for example, visa applications also occupy a high position in the cancellation drivers ranking.

Being able to predict hotel booking cancellations is relevant for many components in the hospitality industry. For example, in room inventory management, predicting cancellations allows to produce accurate dynamic pricing models for hotel rooms. Regarding marketing departments, ac-curate prediction of revenue is crucial. Without a proper prediction of cancellations digital marketing analyses may be biased towards market-ing targets that exhibit higher cancellation rates. Lastly, OTAs and hotels may want to carry out value estimation for free cancellation policies. In other words, they want to use those policies only in segments where they are likely to increase revenue. For example, travelers of some national-ities may value ”Free cancellations” highly but they may produce little cancellations which would make them ideal targets for ”Free cancellation” policies.

Surprisingly, only a few studies have examined Cancellation Modeling (CM) using Machine Learning (ML) in the hotel industry. Furthermore, very few have researched the use of machine learning to understand can-cellations.

Our study shows that Boosted Trees models, combined with feature engineering can outperform classical data mining models. Furthermore, some algorithms are able to train models 30 times faster than popular tree-based algorithms like Random Forest. This may be an important factor to take into account for very large datasets or when the cancellation dynamics changes very rapidly and models need to be retrained very often.

(4)

1 Introduction

Our research yields a contribution towards understanding the potential factors behind hotel booking cancellations and how they can be predicted using ma-chine learning techniques. Furthermore, di↵erent classic and modern tree-based models are compared with each other in terms of accuracy and performance within di↵erent scenarios.

The hospitality industry is a big industry that delivered a revenue of 550 US billion dollars in 2016 with an expected growth of almost 5% in 2017. Booking cancellations have a huge impact on the industry. It is estimated that between 25% and 30% of hotel bookings are cancelled (own reference, [6] and [21]).

A revenue management system must take into account the possibility that a booking may be canceled [21]. A way to take cancellations into account is to work with net demand [19] instead of demand. Here net demand is defined as the number of demand requests once the number of cancellations is subtracted. Accurate cancellation rates are crucial to for the revenue management system to produce accurate dynamic pricing models for hotel rooms.

Correctly estimating the net available inventory is important for hotels and OTAs so they can optimize the options that are shown to their users. For exam-ple, an OTA is interested in exhausting the inventory of properties in a uniform way to o↵er as many options as possible to their users.

As we already mentioned, the cancellation rate in this industry is enormous, which means that any financial or operative forecast must take booking cancel-lations into account. As an example, a hotel meta-search engine will probably receive booking cancellation information with a lag in the order of months (own source) whilst revenue is reported monthly. This means that financial informa-tion is out of sync and both results and forecasting will be incurring in a big gap on their revenue estimations.

From an OTA perspective, the value of a marketing campaign is measured by the number of ”Click-to-Booking” occurrences combined with the commission the OTA received for getting the hotel a new client. Nevertheless, this metric needs to be readjusted in the future by subtracting the losses generated due to cancellations, in which case the OTA will have to withdraw the corresponding commission. Without an accurate cancellation model, it is impossible to assign a real value to di↵erent marketing initiatives.

In the case of a meta-search engine, the revenue is most probably created from receiving a commission from the booking value. In this case, the cancel-lation probability for each OTA should a↵ect the order in which the di↵erent providers are displayed to the user for a specific o↵er.

From the hotel revenue management perspective, it is helpful to have a bet-ter understanding of the cancellation drivers and more accurate models to be able to handle overbooking and forecasting in a more efficient way. Once the hotel is aware of the number of potential clients that may cancel, they will be

(5)

able to adjust their prices based on the virtual availability.

It is also interesting for hotels to identify large bookings in terms of value that have a high chance of being canceled during the free or low-cost can-cellation period. Hotels can leverage this knowledge to minimize loss by, for examples, linking bookings with prepaid activities at destination, shorten the free-cancellation period or removing the free cancellation period in exchange for meals or room upgrades.

During this research we were able to discover that machine learning models, together with high quality processed data can achieve AUC scores of approxi-mately 0.75, which is considered good performance. Therefore, machine learning should be applied to better predict booking cancellations.

The results showed in this research might be interesting to marketing, finan-cial, operational or any data-related teams in the hospitality industry such as hotels, OTAs or meta-search engines.

The aim of this study is to answer the following research question:

RQ1 Are machine learning models suitable tools for predicting and understand-ing hotel bookunderstand-ing cancellations?

RQ2 Which are the most informative features in hotel booking data for un-locking better predictions?

2 Related Work

A number of revenue management research studies concluded that ”accurate demand forecasting is a key aspect of revenue management” [20]. Other studies like [29] and [21] also acknowledge that revenue management systems require forecast of quantities and that ”its performance depends critically on the the quality of these forecasts”. Moreover, cancellation prediction is a a critical tool in revenue management performance as it allows to accurately predict net de-mand [19] and avoid unnecessary loss [10]. It is also interesting in marketing, where more profitable campaigns are built. It is also very important for finan-cial planning to anticipate sales and predict revenue and profit and also in OTA inventory control to avoid both overstock and stock-out situations [27].

Existing research has focused on overbooking problem optimization. This is because if hotels limit the reservations to the available capacity of the hotel because of late cancellation some of the rooms will be empty and the goal of maximizing the revenues and profits will not be fulfilled. For example, [23] and [12] present mathematical models to find the optimal solutions of overbook-ing for stochastic cancellation with soverbook-ingle and multiple room types consideroverbook-ing marginal costs. [12] also concludes that the optimal level of overbookings is inversely related to the amount of the cancellation charges applied by hotels. Studies like [14] have shown that overbooking has a potentially negative impact on customers satisfaction, and consequently on the customers loyalty and their

(6)

future booking behavior. This fact confirms the importance of being as pre-cise and cautious as possible when it comes to user overbooking techniques to maximize profit and circumvent the problems caused by booking cancellations. Alternatively, our recommendation to hotels and OTAs would be to apply a flexible cancellation policy either based on time to check-in date (dynamically decreasing refund while the check-in date gets closer) or under which a refund for cancellation is o↵ered only if the service provider expects the service to be sold out. [15] suggest that the seller can raise the refund for cancellations if it appears that demand for the service is higher than expected and lower or can-cel the refund if it appears that service units will remain unsold. Thus, service provider can profit from o↵ering partial refunds for service cancellations, because such refunds allow the service provider to capture some of the consumer-added surplus that is created when customers find alternative opportunities and cancel the purchased service [15].

There are many indications that deal-seeking travelers continue to search after they have made a reservation, looking for an even better deal for the same tourism product or service. If a better deal is found after they made their ini-tial booking, these deal-seekers cancel their existing reservation and rebook the better deal [14]. Other booking cancellation driver are fraudulent bookings (for example due to stolen personal ID or credit card details or comply with visa requirements to get to certain countries), ’pay later’ on-site option and book-ing accelerators that pressure clients with a message such as ”book now or you might lose this opportunity” [7], perceived scarcity has significant influences on perceived value and purchase intention [30].

In the scope of data science, specifically in the field of machine learning, supervised predictive modeling problems are usually divided in two type of problems [11]: regression, when the outcome measurement is quantitative (e.g. forecasting bookings cancellation percentage of the total of bookings), or as a classification, when the outcome is a class/category (e.g. predicting if a specific booking ”will cancel” or ”will not cancel”). Customer cancellation is typically a classification problem and machine learning algorithms can be used to model and predict cancellations before the check-in date. This involves our first re-search question: what are the most suitable machine learning models that can be used to predict hotel booking cancellations? Previous research have pointed out that Back Propagation Neural Networks and General Regression Neural Networks have excellent cancellation predictive abilities, can assist managers in judging whether customers will cancel reservations, and can help in dynamic service capacity planning [10].

To the best of our knowledge, only one previous study has applied machine learning algorithms to model hotel booking cancellations was [24]; which evalu-ated a group of machine learning algorithms against every hotel in a particular chain, separately. The study reported that all models built reached accuracy values above 90% and AUC scores above 93.5%, which is considered excellent. Our results, using similar features but being the number of examples one order of magnitude higher, yield AUC scores 20% lower. This could be an indication of their study overfitting the training data.

(7)

As for our second research question: ”What are the most explanatory fea-tures that help predict booking cancellations?” Previous research suggest that ”cancellation deadline a↵ects customers’ search behavior” as the main conclu-sion from [4]. [24] shows that, although the most import variables di↵er from one hotel to another, the most important variables are: country of origin of the client, OTA seller agent, if the booking was refundable in case of cancellation, night rate and number of days prior to arrival that the booking was placed.

3 Methodology

3.1 Data description

Various data sources were used to carry out this project, three of them were gathered from the internal databases at FindHotel BV and two of them were publicly available external data:

1. Bookings and leads: all hotel bookings carried out through the FindHotel BV online reservation platform during 2016, summed up a total of 237.614 rows. This table included the following fields:

• IDs for booking and hotel • Timestamp of the event creation • Check-in and check-out dates • Commission

• Provider code of the OTA to which the user was redirected

• Whether the booking was cancelled or not. This is the target feature, called ”provider cancelled”

• HotelsCombined’s estimated probability of the booking being can-celled at the moment of booking, ”cancellation rate”

• Currency

• Number of rooms

• Number of children and adults per room • Users browser language

• Country code of client • Type of device

• Device brand if mobile • Browser

2. Hotels: updated information about the characteristics from more than 1.2 million hotels in the world:

• Hotel ID • Name

• Local address • Country

(8)

• Star rating

• Cleanliness, service, location, price, facilities, rooms and dining av-erage rating

• Global average rating

• Popularity as a rank for every hotel • Date of creation

• Number of reviews • Number of images

• Chain ID if the hotel is part of a chain

• Facilities information. This information includes sport facilities such as volleyball court, sailing boat rental, football field or horse riding; wellness facilities such as sauna, massage or spa among others • Minimum nightly rate

3. External data: some data has been embedded into the final data set as part of the feature engineering stage, explained in this same section:

• Free visa origins: this data set contains one row per country. For each country, the number of countries to which a citizen holding its citizenship without needing a visa is specified

• Country to country visa requirements: this data set contains one row for all the origin and destination country permutations. For each row, a boolean that represents whether or not a visa is required is available

3.2 Infrastructure

All the preprocessing and ML modelling were executed on cloud computing ma-chines from Amazon Web Services (AWS). The service used is known as EC2 (Elastic Computing). More specifically, we started c3.8xlarge AWS EC2 virtual instances, which uses 32 virtual CPU cores and 60Gb of RAM memory.

3.3 Research Question 1

3.3.1 Data preparation

As in any data science problem, the data quality is fundamental and, in general, production data sources provide data with outliers, inconsistencies and null val-ues. Therefore, a series of steps must be taken in order to ensure the quality of the model predictions.

The aforementioned data sources were joined together to produce the final dataset. The objective was to have a numerical data set with enough examples to compensate the high dimensionality of the the data, without missing infor-mation nor duplicates and with balanced categories for all the features.

First of all, an initial exploratory data analysis of the separated tables was done. This step allowed us to understand the data, its structure and distribu-tion. We created visualizations such as the bi-variate pairplot distributions, a

(9)

Figure 1: Cancellation rates breakdown

correlation matrix plot 2 to find linear dependencies between the features and a target variable breakdown into multiple dimensions related to hotel and book-ing information such as 7, provider 8 or coutries of origin and destination of the user 9, 10. Apart from allowing us to understand the data, we think this representation will be relevant to answer the second research question.

The data preparation pipeline is as follows. We discarded variables that con-tained null values in at least half of the examples. We also discarded examples which missed the booked hotel id. After this, we removed outliers and erroneous examples, which we established to be those bookings with a total length of stay higher than a month, less than 1 day or which total value was over 10.000 USD. Then, we deleted columns that contained the exact same values as another col-umn. With the rest of null or missing values we applied di↵erent techniques to filled them in; most typically, we used the most common value in the case of categorical values and the mean in the case of continuous variables. Regarding the categorical variables we look at the frequency distribution of each value and combine values having less than 5% frequency of total observation. Lastly, we applied one hot encoding to binarize the categorical features.

(10)

Figure 2: Correlation matrix

from which 35 are continuous variables and 13 are categorical. Simple statistics on the continuous variables can be seen in figure 1. After binarizing the categor-ical features (one-hot encoding), the number of final features is 284. The total number of examples is 237.614. A simple statistical description can be seen in 1

3.3.2 Baseline model

First of all, it is important to establish a baseline, which is a meaningful refer-ence point to which to iteratively compare our results. Fortunately, FindHotel BV has access to the cancellation predictions produced by a partner company (HotelsCombined.com, HC), that used the same booking data used throughout this research. In order to create their own booking cancellation predictions, they used data mining techniques to first do a features selection where they filtered what they considered to be the most relevant features. The selected features were the OTA provider and origin and destination country. Following, they grouped the data by those features and averaged the historical cancellation rate for each combination. For each corresponding combination of features in the data set from which we want to generate the predictions, the computed proba-bility is directly applied. This feature is called ”cancellation rate”. In figure 3 we can see the binary class classification plots, ROC AUC curve and the AUC score. The AUC score is 0.61, which, as seen in table 1, is a ”fair” result, al-though very close to being ”poor”. The class separation is almost imperceptible in the plot. This is corroborated by the F1-score, which is 0.07 and thus really low, knowing that the class distribution is around 20:80.

(11)

Figure 3: ROC AUC of HC’s data mining model

Apart from using the cancellation predicted probability from HC as our base-line model, we will embed this data as one of the features in our data set.

3.3.3 Learning algorithms

The learning algorithms chosen for this experimental comparison should be able to provide feature importances and predicted class probabilities. After research-ing the performance of algorithms that fulfilled the previous conditions, we de-cided to generate the analysis and experiments one of these four algorithms:

1. Random Forest (RF) 2. LightGBM

3. Gradient Boosting Trees (GBT) 4. Extreme Gradient Boosting (XGBoost)

Ensemble methods Ensemble methods are supervised learning algorithms which combine multiple inferences to form a global inference with more suit-able properties, such as less sensitivity to noise or outliers, or lower variance [8]. There are two types of ensemble methods, boosting methods and averag-ing methods. In boostaverag-ing methods, initial are built sequentially and each one tries to reduce the bias of the combined estimator. The idea behind it is to combine several weak models to generate a more powerful ensemble model (e.g., AdaBoost and Gradient Tree Boosting). On the other hand, in averaging meth-ods, several estimators are built to average their predictions. It can be seen as voting based on average weak predictions. The combined estimator is usually better than any of the fundamental estimators since its variance is reduced (e.g., Bagging methods and Forests of randomized trees). The ensemble method mod-ules are chosen from scikit-learn, Gradient Tree Boosting and Random Forests, dmlc’s Extreme Gradient Boosting and Microsoft’s LightGBM.

There are many other di↵erences between averaging methods such as RF and boosting methods like GBT, XGBoost and LightGBM. Boosting methods

(12)

have better performance but both have similar fitting speed. RFs are more ro-bust to overfitting and require less precautions to avoid it. RF needs just one fundamental hyperparameter to be set, which is the number of features to be selected each node; even then, we could choose the rule of thumb of using the square root of the total number of features in the dataset, which typically works well [25]. On the other hand, boosting methods have several hyperparameters that include the number of trees, the depth (or number of leaves), the bagging fraction, the subsampling rates, the number of leaves and learning rate among others. This makes CV vital to avoid overfitting in boosting models. RF can be understood as boosted trees with dropout rate of 1. Neither 0 nor 1 is the optimal dropout rate [16].

Generally, a well-tuned boosting algorithm will outperform a RF. RF has traditionally been easier to parallelism, but recent algorithms such as XGBoost and LightGBM have been optimized in terms of memory usage, efficiency and GPU usage capabilities. Whereas LightGBM splits the tree leaf-wise with the best fit, other boosting algorithms such as XGBoost split the tree depthwise or level-wise rather than leaf-wise. The leaf-wise algorithm can reduce more loss by using more complex trees than the level-wise algorithm and hence results in much better accuracy. LightGBM also uses applies histogram based transfor-mation in the continuous variables, which means that it creates discrete bins from continues values, fastening the training process. The down side of this algorithm is that is more prone to overfitting than other boosting algorithms, which must be avoided by tuning the parameters that sets up the maximum depth of the weak learners (trees).

All in all, we have decided to use this algorithms instead of other linear mod-els such as logistic regression, K-Nearest Neighbors or Support Vector Machines (SVM) because these can hardly handle categorical features and because the feature importances can’t be retrieved unless the kernel used is linear. Naive-Bayes classifiers (NB) assumes that attributes are conditionally independent to each other given the class and can only have linear, elliptic, or parabolic decision boundaries which makes them very inflexible classifiers. SVM is very slow to train and can take enormous amounts of time to tune its parameters, specially when it comes to finding the correct kernel function. Last but not least, the models mentioned in this paragraph are unable to output which are the important features by themselves, i.e. the ones that leverage the model’s predictive power. This is a big drawback for our research due to our second research question and thus, we will not include them in the model comparison. 3.3.4 Hyperparameter tuning

Hyperparameter tuning is an optimization problem where we find the model space parameter setting that maximizes the performance of the model on a val-idation set. However, the mapping from this space to the performance on the validation set cannot be described mathematically into a formula, hence we do not know the derivatives of this function, i.e. it is a blackbox function. Therefor, optimization techniques such as Newton method or Gradient Descent cannot be applied.

(13)

There are three main methods to sweep the hyperparameter space: Grid Search, Random Search and Bayesian Optimization. In the first, the hyperpa-rameter space is swept in a brute force fashion, meaning that every possible option is evaluated and in the end, the one that maximizes the model perfor-mance is chosen. In the second, [13] demonstrates that by navigating the grid of hyperparameters randomly, similar performances to a full grid search is ob-tained. Surprisingly, in this approach the close to optimal region occupies 5% of the space compared to the previous approach. In the first and the second, di↵erent configurations are evaluated randomly and blindly. The next trial is independent to all the trials done before. This means that the process is easily parallelizable at the cost of not being able to exploit historical evaluation results. The third possible approach, Bayesian Optimization, is an adaptive ap-proach that trades o↵ between exploring new areas of the parameter space and exploiting historical information to find the parameters that maximize the function quickly. We use this option due to its balance between speed and performance. We will use a library called BayesianOptimization, available for Python at https://github.com/fmfn/BayesianOptimization/.

3.3.5 Evaluation Framework

The task being solved is a binary classification problem where the possible out-comes are 0 if ”no cancellation happens” of 1 if ”cancellation happens”. Many classifiers, like the ones we are comparing, are also able to quantify their uncer-tainty about the answer by outputting a probability value. This capability is very relevant in our case because it will allow users to make decisions based on the confidence of the predictions and not only on the predicted class.

The main scoring metrics that we will use are Area Under the Curve (AUC) and F1-score. Other evaluation methods such as accuracy, precision, recall, sup-port, ROC AUC curves, confusion matrix, calibration curves, learning curves and binary class separation plots will also be shown in the evaluation section to compare the four chosen models. It is very important to prevent evaluating metrics on the holdout set more than once. This means that the model will be evaluated on fresh data to which the model has never been exposed and to which it will only be exposed once to avoid overfitting on the holdout set [5].

Calibration curves Well calibrated classifiers are probabilistic classifiers for which the predicted probability of the method can be directly interpreted as a confidence level. For instance, a well calibrated binary classifier should classify the samples such that among the samples to which it gave a predicted probability value close of 0.8, approximately 80% actually belong to the positive class.

ROC AUC is the de facto standard metric for evaluating classifiers under class imbalance [3], like our case, where the lower class represent around 20% of the data. This is due to the fact that it is both independent of the selected threshold and prior probabilities, as well as o↵ering a single number to compare classifiers [28].

(14)

Table 1: AUC Value interpretation AUC Value Model performance

0.5 - 0.6 Poor 0.6 - 0.7 Fair 0.7 - 0.8 Good 0.8 - 0.9 Very good 0.9 - 1.0 Excellent

F1-score [28] attempts to measure the trade-o↵s between precision and re-call by producing a single value that reflects the goodness of a classifier in the presence of rare classes. While ROC curves represent the trade-o↵ between dif-ferent TPRs (True Positive Rates) and FPRs (False Positive Rates), F1 score represents the trade-o↵ among di↵erent values of TP (True Positives), FP (False Positives), and FN (False Negatives).

AUC is a summary indicator of ROC curve that can summarize the perfor-mance of a classifier into a single metric. Unlike difficulties encountered in the comparison of di↵erent ROC curves, the AUC outputs a numerical score, which allows us to sort models by overall performance. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a ran-domly chosen negative instance [9]. In practice, the value of AUC varies between 0.5 and 1; [2] suggests a scale for the interpretation of AUC value which can be seen in table 1.

Cross-validation will be used to evaluate the performance of each of the algorithms, specifically stratified k-fold cross-validation, a widely used model assessment technique [11]. Although cross-validation can be computationally costly (Smola & Vishwanathan, 2010), it allows for the development keeping in mind the intention of avoiding overfitting so that the model can generalize well when predicting unseen examples. Stratified k-fold cross validation works by randomly partitioning the sample data into k sized subsamples while maintain-ing the original class proportions on each subsample. In this research, a holdout subsample was created and the rest of the data was divided in 5 folds, a typical number of chosen folds ([11] Smola & Vishwanathan, 2010). Then, each of the 5 folds was used as a test set and the data in the remaining 4 as training data. Performance measures were calculated for each of the 5 folds, and then, mean and standard deviation are calculated to assess the global performance of each algorithm. The final evaluation of each model was done using the holdout set, as this is the only data that the model has not yet seen.

Learning curves Additionally, the learning process described previously was repeated with various numbers of training samples. The mean and standard deviation was plotted at each point for every classifier. This is known as learn-ing curves, and it is useful to find out how much we benefit from addlearn-ing more training data and whether the estimator su↵ers more from a variance error or a bias error.

(15)

Bias detection Regarding bias avoidance, it is very important that the model does not make predictions with a high bias towards the most common values of the most important features features. In other words, the model might minimize the predictive loss by learning to predict well on the most common subsets of data. We want to make sure that the models probability error is similar given any breakdown at any dimension. To evaluate whether this happens, we will show plots of the error distribution to visually evaluate possible bias. We will also present AUC and log-loss comparison tables of each dimension against the baseline. We will consider as an excellent improvement over the baseline if the AUC at any category at any dimension is higher than the baseline’s best AUC score of any category at any dimension.

Speed performance Finally, since our ML models need to operate in real time at large production environments, speed performance is a relevant char-acteristic in this scenario. The training should be done in under 1 hour if hyperparameters need not be re-tuned; prediction of a new example should be under a tenth of a millisecond. Thus, we will also measure the runtime of all the stages: model parameter optimization, fitting and predicting.

3.3.6 Deployment

Although the deployment of these models in a production environment was not in the scope of this research, the way they are deployed is critical to their suc-cess. For this reason, the elaboration of a framework to define how models are deployed is also an important duty of this research.

As represented in figure 4, the whole pipe line was segmented in four Python scripts:

1. Data preparation: contains the whole data munging pipeline explained at the beginning of this section.

2. Model train: calls data preparation to generate the data set that will be used to train the model. This script contains the model tuning, fitting and evaluation stages. The trained model is then uploaded to a bucket in AWS S3.

3. Model predict: calls data preparation to generate the data set that will be used to predict the missing target value. This script loads the previously fitted model, generates the predictions and its probabilities and uploads the results to AWS S3 in the correct format. AWS RedShift will then integrate the results in a database table, making it accessible to anyone internally.

The final application was packed and embedded into a Docker container. This container was built and pushed to AWS ECR (AWS EC2 Container Reg-istry). As can be seen in figure 5, we decided to use AWS Batch to dynamically

(16)

Figure 4: Application workflow

provision the optimal quantity and type of compute resources to run the ap-plication. Jenkins is in charge of creating a daily AWS Batch job at a certain moment of the day.

Figure 5: Deployment diagram

3.4 Research Question 2

There are a few ways to estimate the importance of the variables involved in a tree based model. The most typical are splitting and gaining measurement over all the trees in the model. The former refers to the amount of possible splits taken on a feature or feature interaction weighted by the probability of the splits to take place. The latter measures total gain of each feature or feature interaction. An example can be seen in figure 6. Other configurations consider depth of a feature used as a decision node in a tree to assess the relative impor-tance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features [26]. By averaging those expected activity rates over several randomized trees, one can reduce the variance of such an estimate and use it for feature importance estimation.

(17)

Figure 6: Feature importance measurement techniques for tree-based models

As mentioned in the data preparation subsection, an exploratory data analy-sis was carried out to better understand the data and the relation of the features with our target variable. By intuitively looking at figures 7 and 8, the correla-tion between the any pair of variables can reveal causal (or not) dependence.

Practically, to find the most informative features, we will first use the adhoc integrated methods in scikit-learn for RF and GBT. Particularly for LightGBM we will use the embedded function that, by default, employs weighted split-ted measurements. In the case of XGBoost, we will use an external library called xgbfir [17], which is a library based on xgbfi [22]. This library outputs a complete spreadshit with several measurements (gain, number of possible splits taken with and without weighting, normalized, expected and averaged gain, tree index and tree depth among others. This measures are calculated not only for individual features but also for every possible double and triple combination (interactions) of every feature.

It is worth mentioning that we are currently missing features that, from basic domain knowledge and as previous research suggest, would be meaningful in terms of predictive power. To enumerate a few examples, our data set missed information such as the refund policy applied to the canceled booking, the date that the cancellation took place and historical data about the user generating the booking such as previous booking and historical cancellation data. We concluded that this lack will have a slight impact in our final results.

(18)

Figure 7: Cancellation rates breakdown

Figure 8: Cancellation rate by OTA provider

(19)

Figure 10: Cancellation rate by destination country

4 Results and Analysis

4.1 RQ1

Our first research question is ”are machine learning models suitable tools for predicting and understanding hotel booking cancellations?” For the reasons ex-plained in section 3, we decided to employ ensemble machine learning models to answer this first research question.

We first proceed to comment the results of the calibration evaluation. Figure 11 shows how well the probabilistic predictions of di↵erent classifiers are all well calibrated as its curve is nearly diagonal. Results seem to validate the findings from [1], which states that ”bagging decision trees are well-calibrated models”. And, given that bagged trees are well calibrated, we can deduct that regular decision trees are also well calibrated on average, in the sense that if decision trees are trained on di↵erent samples of the data and their predictions averaged, the average will be well calibrated”.

Regarding learning curves, we can see in figure 12 that the training score is starting to converge when using one third of the data. This means that adding more data to train the model will slightly increase the model’s AUC score and thus, the amount of data that we used is enough to guarantee the model’s sta-bility.

As we mentioned in the previous section, the baseline comes from the pre-dictions of HC. We computed the ROC AUC of their prepre-dictions, which can be seen in figure 3, scoring 0.61. By looking at figure 13 and 2, it is clear that the four machine learning models chosen outperform the data mining based model used as our baseline. RF, LightGBM, XGBoost and GBT obtained scores of around 0.74 AUC, which is more than a 20% higher than the baseline.

Results seem to validate the findings of [18]. These authors tested 179 clas-sifiers from 17 families and concluded that the best results are usually obtained with the random forest algorithms family.

To guarantee the lack of bias, we have selected the best performing model in terms of AUC (XGBoost). We first calculated the error (deviation from target’s

(20)

Figure 11: Calibration curves

Figure 12: Learning curves with di↵erent train sizes

ground truth) of the predicted probabilities for both XGBoost and the baseline model. Following, we measured the AUC and logloss for four dimensions: OTA provider, mobile phone device brand, origin country and destination country. In figures 14, 16, 18 and 20 we show the distribution of the aforementioned errors for the 4 dimensions at the top 3 categories of each of those dimensions. The top categories were selected by ranking the volume of bookings. We can visually confirm the similarity of the error probability distribution of all the categories at a given dimension.

(21)

Figure 13: ROC AUC of all the models Table 2: Model performance

AUC F1-Score Accuracy Precision Recall LightGBM 0.7401 0.7724 0.8229 0.6737 0.1359 Gradient Boosted Trees 0.7417 0.7788 0.8241 0.6560 0.1566

XGBoost 0.7453 0.7785 0.8246 0.6588 0.1555 Random Forest 0.7389 0.7745 0.8253 0.7223 0.1457

same dimensions as before. As can be seen in figures 15, 17, 19 and 21. The evaluation yields excellent results, as we can see that all the categories have an AUC higher than the baseline’s best AUC score (0.65) of any category at any given dimension.

Figure 14: Error of probabilities. Provider breakdown histogram

Speed performance The final set of experiments were related to speed per-formance. As we can see in figure 3, there is a big di↵erence between the four algorithms studied. The hyperparameter tuning stage is composed of thousands of fitting trials, thus it is linearly related to the training time. The results yield a clear advantage of using LightGBM instead of any algorithm, as it more than five times faster than XGBoost, 20 times faster than GBT and 30 times faster than RF.

(22)

Figure 15: Error of probabilities. Provider breakdown

Figure 16: Error of probabilities. Device breakdown histogram

Figure 17: Error of probabilities. Device breakdown

Figure 18: Error of probabilities. Origin country breakdown histogram

4.2 RQ2

Our second research question is ”which are the most informative features in hotel booking data to unlock better predictions?” We will answer this question both from the data exploration and the model analysis perspectives.

From the exploratory data analysis done in the very beginning of this re-search, we discovered that there was a slight correlation between the target value

(23)

Figure 19: Error of probabilities. Origin country breakdown

Figure 20: Error of probabilities. Destination country breakdown histogram

Figure 21: Error of probabilities. Destination country breakdown

(user cancelled) and, in order of correlation power: 1. Average cancellation rate of the origin country 2. Average cancellation rate of the destination country

3. Whether a visa is required for citizens of the country of origin to travel to the destination country

4. Length of stay

5. Average cancellation rate of the OTA provider 6. Baseline’s estimation

7. Number of additional bookings created in the same week from the same user

The previous findings can be seen in the correlation matrix plot from section 3. Likewise, figure 1 shows the linear relation to the amount of cancellations for

(24)

Table 3: Speed performance

Algorithm Hyperparameter optimization Training Extreme Gradient Boosting 122 minutes 128 seconds Gradient Boosting Machines 238 minutes 413 seconds Random Forest 732 minutes 691 seconds LightGBM 25 minutes 23 seconds

the aforementioned variables.

After the model training stage, we can extract the individual feature im-portances. This has a very clear advantage over finding linear relations before training the model: the model will expose the ”explanatory score” of every fea-ture in the data set. In figure 22, it can be seen how the four models score the importance of every feature. Although in general, the four models agree on the importance of every feature, LightGBM shows a slight deviation in behaviour. In order of average feature importance:

1. Country of origin 2. Destination country 3. Baseline’s cancellation rate 4. Night price

5. Hotel number of reviews 6. Hotel popularity

7. Day distance between booking and check-in

Figure 22: Feature importance comparison

As we explained in the methodology section, the Python package called xgbfir allows to map the feature importances from the XGBoost model. It allows to

(25)

not only measure single feature importances, but also 1-way and 2-way feature interaction importances. The top 10 features for this 0-way, 1-way and 2-way feature interaction mapping can be seen in A.0.5.

5 Conclusions

By using data from a meta-search engine company that serves hotel o↵ers from more than 100 OTAs and more than 1 million hotels in conjunction with the application of data science skills such as data visualization, data mining, and machine learning, it was possible to answer the two main research questions of the research:

1. We were able to extract n-way feature interactions out of the trained models, which allowed the understanding of feature’s predictive relevance. It was found that the most important features were origin and destination country, lead and booking day distance to check-in, length of stay and number of additional bookings done close in time. We also identified the number of examples needed to build a good classifier, which starts at 25.000 examples and grows slightly but steadily until the total number of examples that we disposed of.

2. We succeeded finding a model to classify bookings that would allow us to predict not only the class (cancelled / not-cancelled) but also the proba-bility of that class being the correct one (confidence). From all the models tested, we discovered that the best performing ones were the boosted trees. The best score was achieved by Extreme Gradient Boosting with an AUC of 0.7453 and the best balance between accuracy, capabilities and speed was achieved by LightGBM with an AUC of 0.74 and a total runtime 4 times faster than the rest.

5.1 Limitations and Future Research

Although this research was carefully prepared, we are still aware of its limita-tions and shortcomings at the moment.

First of all, the research was conducted missing what, in our opinion, is im-portant information. From domain knowledge and previous research, the can-cellation policy applied to the booking is a very important feature that would most probably lead to a higher AUC score. It would also be interesting to in-vestigate the predictive power of features such as personal information on the user (age, sex, profession, civil status, number of children, annual income, etc.) and historical bookings from unique users if they were uniquely identified by using, for example, a log in system.

Secondly, the number of examples used was around 240.000. By having available a larger amount of data may lead to a higher generalization ability. Apart from potentially fitting unseen instances even better, we could exploit temporal patterns if we had, at least, two years of data. In this sense, new time-based features could be engineered to capture time patterns.

(26)

5.2 Acknowledgements

This work was carried out during the first semester of 2017 at University of Amsterdam and FindHotel.

It is a pleasure to thank those who made this thesis possible. I am grateful to my internal supervisor Professor Evangelos Kanoulas, PhD for his support and optism, to my external supervisor German Gomez-Herrero, PhD, Lead of Data Science at FindHotel for his continuous guidance and motivation. I would like to deeply thank my beloved ones for their encouragement and for helping me be a better person every day with their unconditional love, guidance and care.

References

[1] R.Caruana A. Niculescu-Mizil. Predicting good probabilities with super-vised learning, 2005.

[2] J. Allaire. Introduction `a l’analyse roc receiver operating characteristic, 2006.

[3] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms, 1997.

[4] Z. Schwartz C. Chen and P. Vargas. The search for the best deal: How hotel cancellation policies a↵ect the search and booking decisions of deal-seeking customers, 2011.

[5] M. Hardt C. Dwork, C. Feldman. Generalization in adaptive data analysis and holdout reuse, 2015.

[6] R. A. Parker D. C. Iliescu, L. A. Garrow. A hazard model of us airline passengers’ refund and exchange behavior, 229-242, 2008.

[7] P. Delgado. Cancellations on booking.com: 104expedia, 31 https://www.mirai.com/blog/

cancellations-on-booking-com-104-more-than-on-the-hotel-website-expedia-31-more/, 2016.

[8] Ridgeway Dietterich, Friedman. Predicting good probabilities with super-vised learning, 2000.

[9] T. Fawcett. An introduction to roc analysis, 2006.

[10] C. Ho H. Huang, A. Chang. Using artificial neural networks to establish a customer-cancellation prediction model, 2013.

[11] Tibshirani Hastie and Friedman. The elements of statistical learning, 2001. [12] Stanislav Ivanov. Management of overbookings in the hotel industry - basic

(27)

[13] Y. Bengio J. Bergstra. Random search for hyper-parameter optimization, 2012.

[14] D. K. Tscheulin J. Lindenmeier. The e↵ects of inventory control and de-niedboarding on customer satisfaction: the case of capacity-based airline revenuemanagement, 2008.

[15] Eitan Gerstner Jinhong Xie. Service escape: Profiting from customer can-cellations, 2007.

[16] R. Gilad-Bachrach K. V. Rashmi. Dart: Dropouts meet multiple additive regression trees, 2015.

[17] B. Kostenko. xgbfir. https://github.com/limexp/xgbfir, 2016.

[18] S. Barro M. Fernandez-Delgado, E. Cernadas and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems?, 2004. [19] P. Wang et al M. Rajopadhye, M. Ghalia. Forecasting uncertain hotel room

demand, 2001.

[20] Mehrotra and Ruttley. Revenue management, 2006.

[21] Morales and Wang. Predicting good probabilities with supervised learning, 2009.

[22] M. Muller. xgbfi. https://github.com/Far0n/xgbfi, 2015.

[23] Panaratch Maneesophon Naragain Phumchusri. Optimal overbooking de-cision for hotel rooms revenue management, 2014.

[24] A. Almeida Nuno Antonio and L. Nunes. Predicting hotel booking cancel-lations to decrease uncertainty and increase revenue, 2017.

[25] S. Adam S. Bernard, L. Heutte. Influence of hyperparameters on random forest accuracy, 2012.

[26] Scikit-Learn. Ensemble methods. http://scikit-learn.org/stable/ modules/ensemble.html, 2016.

[27] Dr. Suwandi. International conference on biotechnology and environment management proceedings, 2016.

[28] N. Chawla T. Ryan. Imbalanced dataset: from sampling to classifiers, 2013. [29] Talluri and Van Ryzin. The theory and practice of revenue management,

2004.

[30] Y. Wu W. Wu. The e↵ects of product scarcity and consumer’s need for uniqueness on purchase intention, 2012.

(28)

A

Appendix

A.0.1 Gradient Boosting Machines

Table 4: Gradient Boosted Trees confusion matrix Predicted

1 0

Actual 1 62326 1432 0 12360 2295 The best parameters found for this model were: 1. max depth=7

2. learning rate=0.020000 3. n estimators=312 4. max features=93

5. min samples leaf=1.000000 6. min samples split=5.000000 7. min weight fraction leaf=0.000000 8. subsample = 0.926813

A.0.2 Random Forest

Table 5: Random Forest confusion matrix Predicted

1 0

Actual 1 62766 992 0 12708 1947 The best parameters found for this model were: 1. max depth=15

2. n estimators=700 3. max features=52 4. min samples leaf=5

(29)

Table 6: XGBoost confusion matrix Predicted

1 0

Actual 1 62399 1359 0 12398 2257

A.0.3 Extreme Gradient Boosting The best parameters found for this model were:

1. max depth=6

2. learning rate=0.020000 3. n estimators=538 4. gamma=0

5. min child weight=5 6. subsample = 0.701741 7. colsample bytree = 1.000000 A.0.4 LightGBM

Table 7: LightGBM confusion matrix Predicted

1 0

Actual 1 62609 1149 0 12736 1919

(30)

T ab le 8: X G B o os t fe at u re im p or tan ce s In te rac ti on G ai n R an k G ai n F S c or e w F S c or e A v e rage w F S c or e A v e rage G ai n can ce ll at ion rat e al l tim e g or g cou n tr y 1 54982. 2537 344 84. 29689613 0. 245049117 159. 8321328 b o ok dd 2 33498. 9194 443 89. 61528576 0. 202291841 75. 61832821 le ad dd 3 25693. 91545 285 48. 82575721 0. 171318446 90. 15408929 we ek ly b o ok in gs 4 18475. 35284 414 57. 66910786 0. 139297362 44. 62645614 los 5 17909. 85855 327 34. 37278526 0. 105115551 54. 77020963 can ce ll at ion rat e al l tim e pro vider co de 6 16607. 94113 323 50. 0077542 0. 154822768 51. 4177744 hc can ce ll at ion rat e 7 14524. 82824 360 39. 98099885 0. 11105833 40. 3467451 can ce ll at ion rat e al l tim e g dest cou n tr y 8 12226. 27766 239 36. 44848328 0. 152504114 51. 15597347 can ce ll at ion rat e 9 9999. 000952 377 21. 31053913 0. 056526629 26. 52254894 nu mb er of reviews 10 7714. 39744 299 19. 28631295 0. 064502719 25. 80066033

(31)

T ab le 9: X G B o os t fi rs t in te rac ti on fe at u re im p or tan ce s In te rac ti on G ai n R an k G ai n F S c or e w F S c or e A v e rage w F S c or e A v e rage G ai n can ce ll at ion rat e al l tim e g or g cou n tr y — le ad dd 1 78433. 21424 102 34. 52304973 0. 338461272 768. 9530808 b o ok d d — can ce ll at ion rat e al l tim e g or g cou n tr y 2 64747. 15538 136 34. 35441464 0. 25260599 476. 0820249 b o ok d d — can ce ll at ion rat e al l tim e g dest cou n tr y 3 20834. 67263 76 21. 68223404 0. 285292553 274. 1404293 b o ok dd — w eekly b o ok in gs 4 20736. 2626 140 25. 51516532 0. 182251181 148. 1161614 can ce ll at ion rat e — can ce ll at ion rat e al l tim e pro vider co de 5 20471. 30696 216 7. 184444375 0. 033261317 94. 77456926 can ce ll at ion rat e al l tim e g or g cou n tr y — los 6 20206. 18538 87 5. 48259373 0. 063018319 232. 2550044 b o ok d d — can ce ll at ion rat e al l tim e pro vider co de 7 18970. 53536 107 24. 27641813 0. 226882412 177. 294723 can ce ll at ion rat e al l tim e pro vider co d e — le ad dd 8 17432. 25208 65 14. 90753325 0. 229346665 268. 1884935 los — w ee k ly b o ok in gs 9 16777. 99474 118 11. 78570783 0. 09987888 142. 1863961 b o ok dd — hc can ce ll at ion rat e 10 14954. 65569 82 16. 87086591 0. 205742267 182. 3738499

(32)

T ab le 10: X G B o os t se con d in te rac ti on fe at u re im p or tan ce s In te rac ti on G ai n R an k F S c or e w F S c or e A v e rage w F S c or e A v e rage G ai n can ce ll at ion rat e al l tim e g or g cou n tr y — le ad dd — we ek ly b o ok in gs 1 37 10. 36703427 0. 280190115 1010. 6337 can ce ll at ion rat e al l tim e g or g cou n tr y — can ce ll at ion rat e al l tim e pro vider co de — le ad dd 2 39 9. 421747687 0. 241583274 940. 9498103 can ce ll at ion rat e al l tim e g or g cou n tr y — can ce ll at ion rat e al l tim e g or g cou n tr y — le ad dd 3 22 2. 871661565 0. 130530071 1647. 089468 b o ok dd — can ce ll at ion rat e al l tim e g or g cou n tr y — los 4 46 4. 876379462 0. 106008249 713. 1712174 b o ok dd — can ce ll at ion rat e al l tim e g or g cou n tr y — we ek ly b o ok in gs 5 57 7. 706196947 0. 135196438 524. 2917096 can ce ll at ion rat e al l tim e g or g cou n tr y — le ad dd — los 6 36 2. 699005531 0. 074972376 800. 5262444 b o ok dd — can ce ll at ion rat e al l tim e g or g cou n tr y — can ce ll at ion rat e al l tim e pro vider co de 7 47 12. 23755593 0. 26037353 566. 631867 b o ok dd — can ce ll at ion rat e al l tim e g or g cou n tr y — can ce ll at ion rat e al l tim e g or g cou n tr y 8 20 2. 992452369 0. 149622618 1316. 763645 can ce ll at ion rat e — can ce ll at ion rat e al l tim e pro vider co de — le ad dd 9 79 2. 743656114 0. 034729824 277. 7251942 b o ok dd — can ce ll at ion rat e — can ce ll at ion rat e al l tim e pro vider co de 10 96 2. 302519356 0. 023984577 223. 3654894

Predicting Hotel Booking Cancellations with Machine Learning