Sales and revenue prediction: a classifier based approach

(1)

Sales and Revenue Prediction

A classifier based approach

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER OF

SCIENCE

Lance Gurney

11554509

M

ASTER

I

NFORMATION

S

TUDIES

Data Science

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

11

th

_{July 2019}

1st Examiner 2nd Examiner 3rd Examiner

Prof. Ger Koole Dr. Qing Chen Wang Dr. Valeria Krzhizhanovskaya Department of Mathematics Department of Mathematics FNWI

(2)

Sales and Revenue Prediction

A classifier based approach

Author

Lance Gurney

Universiteit van Amsterdam

lance.gurney@gmail.com

Supervisor

Ger Koole

Vrije Universiteit Amsterdam ger.koole@vu.nl

ABSTRACT

In this thesis, we first compare and investigate the performance of several machine learning models in the task of prediction of sale of tickets on the online second hand ticket platform TicketSwap (www.ticketswap.com).1We find that the Gradient Boosted Trees classifier outperforms all other models in this task with an AUC score of 0.8213. However, there is a noticeable difference in both performance and behaviour when predicting sold vs. unsold tickets. Various causes of this difference are considered: imbalance in the dataset and non-uniformity of a certain feature of the tickets, but they are not found to be the cause. It is possible that this difference is due to unknown features which are missing from the model.

We then derive from these predictive models new models for the prediction of total sales and revenue generated by all tickets for a single event. We find that the model derived from the Gradient Boosted Trees classifier outperforms all others in both problems, but that similar performance is offered by a Gradient Boosted Trees model trained directly on the event data. For example, in the prob-lem of predicting total sales the Gradient Boosted Trees derived model had an RMSE of 1.6704 while the directly trained Gradient Boosted Trees regression model had an RMSE of 1.9453.

Finally, using the Gradient Boosted Trees model we investigate the relationship between the price of a ticket and its probability of sale and expected revenue. We find that this relationship is some-what unexpected: on average, raising the ticket price slightly results in almost no change in probability of sale, and therefore in an in-crease in expected revenue. It follows that, as far as this model is concern, most tickets on the TicketSwap platform are under priced. We explain how this may not be an accurate reflection of reality, give a possible explanation in terms of certain peculiarities of the marketplaces of events on the TicketSwap platform.

1 INTRODUCTION

TicketSwap is an online peer-to-peer marketplace providing a plat-form for users to buy and sell second hand tickets for concerts, festivals, sporting events and other ticketed events. Sellers are free to set the prices of their tickets anywhere upto 20% above the orig-inal price, and tickets for each event are listed on a single page. Potential buyers, before purchasing a ticket, can register their in-terest in an event in order to be kept updated on which tickets are for sale and for which price.

As far as online marketplaces go, TicketSwap is somewhat unique in that, for a single event, there is no material difference between the tickets being sold save for their price. Moreover, both buyers

1_{TicketSwap provided the author of this thesis with a sample dataset, and limited}

technical help. The selection of the research questions and the conclusions they led to are the author’s responsibility.

and sellers can see all tickets for sale, the number of tickets which have already sold and the number of users which have registered interest in the event. Although, there are perhaps two similar cases. The first is stock markets which are often traded online: individual shares in a given company are indistinguishable and buy and sell orders are visible to all interested involved. However, unlike shares, which have no expiry date, tickets to an event must be bought and used before a fixed time period before they lose all value. The second is reverse auctions where a company asks third parties to bid for contracts to supply products or services satisfying certain conditions. In this case, there is often a time limit on the bids for the contract, however competing bids are not visible to all parties and different bids are not distinguished solely by their price. The problem of prediction of winning bids in reverse auctions using machine learning methods has been previously considered in [8] and [10].

In this study we propose to investigate and compare the effec-tiveness of several machine learning techniques when it comes to the problems of predicting ticket sales and expected revenue, both for individual tickets and the totals for all tickets for a single event. Using these models we will also consider how these predicted quantities vary as functions of the price.

When making predictions of ticket sales, these predictions must be made at some time before the event (the lead time). This thesis focuses primarily on the prediction problem with a lead time of one week. This constraint was chosen as, in order for predictions to be useful, there must be an practical window of time before the event during which appropriate actions can be taken.

The first research question of this thesis is:

Research Question 1: Which machine learning mod-els most effectively predict the probability of individual ticket sale with a lead time of seven days?

Of course, it is naive to completely disregard the temporal aspect of these markets and so, while we still only consider predictions with a lead time of seven days, we will also consider machine learn-ing models which are trained on datasets containlearn-ing information with variable lead times. The second part of this question is:

Research Question 1b: Does allowing for a variable lead time produce significantly more accurate individ-ual ticket sale predictions compared to static lead time? Although estimating the probability of the sale of a ticket is an important problem, it is often difficult and it can be impractical to act on such high volumes of information. Therefore, we will also consider the accuracy of the combined predictions made by the models of Research Question 1 in the two related problems of predicting total ticket sale and total revenue per event:

(3)

Research Question 2: Which of the models constructed in Research Question 1 most effectively predict total number of tickets sold and the revenue generated at the level of events?

Finally, one of the most important aspects of any market place is the relationship between the price of the good for sale and both the probability of sale and the expected revenue generated by its sale. An accurate understanding of this relationship allows one to identify possible sources of inefficiency and to then use this information to increase the expected revenue. The final question to be considered is:

Research Question 3: How do the probability of sale and expected revenue of individual tickets vary as func-tions of the price?

In the case of reverse auctions a similar approach was considered in [10]. In order to determine optimal bid pricing in certain reverse auctions a Naive Bayes classifier was constructed to determine the win probability of bid and then the optimal bid prices, those which maximise expected revenue, were determined by varying the bid prices.

2 THEORETICAL FRAMEWORK

This section outlines several machine learning techniques including those that will be used in this study, along with related issues and terminology.

The primary problem of this study concerns the prediction of the sale of tickets given various features of the individual ticket. Therefore, the problem at hand is one of classification of tickets into two classes: sold and unsold. The secondary problems are ones of regression: prediction of the total number of sold tickets and the expected revenue generated per event.

2.1 Machine Learning Techniques

While binary classification methods attempt to identify in which of two categories (positive or negative) a given observation belongs, many classification methods are probabilistic and instead output a probability, or confidence value, that the observation belongs to the positive class. Predictions with a confidence value above some threshold (usually taken to be 0.5) are classified as positive and those with values below the threshold are classified as negative. For the purposes of this study only probabilistic classification methods will be considered.

Logistic Regression. Logistic regression is a statistical model which uses the logistic function (1+ ex)−1(which takes values in the interval [0, 1]) to predict the value of a binary variabley using a linear combinationx of values of one or more variables on which y may depend.

Naive Bayes. The Naive Bayes classifier uses Bayes’ Theorem, relat-ing the probability of two events occurrrelat-ing with their conditional probabilities, to predict the value of a discrete random variable, given observations of certain variables on which it depends. The word ‘naive’ in its title refers to the assumption in Bayes’ Theorem that observed variables are independent. Such classifiers were first used for automatic text classification [12] but have found applica-tions in other areas [5]. In particular, in [10] a Naive Bayes classifier

was used to predict winning bids for a particular type of reverse auction contracts.

Decision Tree Learning. Decision trees are a supervised machine learning algorithm which can be used for both classification and regression [2]. A decision tree attempts to find the the value of an dependent variable (continuous or probabilistic) by successively applying a collection of yes-no decision rules based on the values of feature variables. Having moved through the decision tree and arriving at one of the leaves, the value of the dependent variable is estimated according to the corresponding values of learned exam-ples which follow the same path along the decision tree. There are several algorithms which can be used to construct decision trees and we shall not discuss them here. Individual decisions trees do have some draw backs. Indeed, they only allow for very simple interactions of the feature variables and can at times over fit. Gradient Boosted Trees. Individual decision trees can be prone to over-fitting and there are several ways to overcome this problem. One common way is to use ‘ensemble’ methods, which make use of many simple models (each of which is too simple to over fit) but whose combination still has strong predictive power. One common ensemble method (which applies not only to decision trees, but this is method we shall use in this study) is gradient boosting [4].

Given a family of predictive models, which can be used to pre-dict the values of (any) dependent variable, and a differentiable loss function which measures the failure of the model to correctly pre-dict the dependent variable, gradient boosting works constructing a sequence of models, the first of which attempts to predict the desired variable. Each successive model is then constructed in such a way as to model the gradient of the associated loss function and is then combined with the previous model. When applied to a simple decision tree learning algorithm one obtains the Gradient Boosted Trees algorithm.

2.1.1 Support Vector Machines. Support vector machines (SVM) are a class of supervised machine learning algorithms which can be used for both classification and regression problems. SVMs in-corporate essentially arbitrary (non-linear) transformations of the feature variables and as such allow for very complex interactions among the feature variables. However, the freedom allowed by such transformations of the feature space often results in models with lengthy training times and which are also prone to over fitting without extensive tuning of the hyper parameters.

2.1.2 Artificial Neural Networks. Artificial Neural Networks (ANN) are a machine learning technique inspired biological neural net-works of living animals [1]. Roughly speaking an artificial neural network consists of a sequence of layers each of which linearly transforms a sequence of input variables and then applies a (non-linear) activation function, with the result serving as the input to the next layer.

ANNs have been shown to be successful in solving complex prob-lems, both in classification and regression and their popularity has increased over the recent years due to advances in computational power and in the algorithms used to train them [14]. However, their highly customisable structure (the number and size of each layer is arbitrary) makes it difficult to determine which is the best

(4)

configuration for a given problem and they often require enormous amounts of data to train effectively.

2.2 Class Imbalance

In binary classification problems, class imbalance refers to training datasets in which one class is significantly more represented than the other. Such imbalanced distributions can result in many stan-dard algorithms placing a higher emphasis on correctly predicting the majority class while ignoring the minority class thus leading to under performing classifiers, especially when the minority class is of equal importance.

The standard way to over come the class imbalance problem is to produce from the imbalanced training set, using one or an-other method, a balanced training set. This is done either by over-sampling the minority class or under-over-sampling the majority class. Under-sampling the majority class results in a natural dataset (all records are ‘real’). However, as the model is not able to learn from the entire dataset this can lead to under performance.

Over sampling the minority class can be done either naturally, by simply repeating elements in the data, or synthetically, by creating ‘new’ representatives of the minority class. Two common synthetic over-sampling techniques are SMOTE [3] and ADASYN [6].

SMOTE works by creating new examples of the minority class using linear combinations of ‘real’ minority class examples while ADASYN works in a similar way, while also introducing an amount of noise in order to better model natural phenomena. Although over-sampling retains all the ‘real’ data, the introduction of new data can lead to over fitting and an increase in the time taken to train the algorithms.

2.3 Evaluation Criteria

The following metrics will be used to evaluate the performance of the predictive models used in this study. As there are two types of models considered, binary classifiers for individual ticket sales and regression models for event level ticket sale and revenue prediction, two different classes of metrics are required.

2.3.1 Binary Classifiers. Many binary classification models M (and all those used in this study) compute for a given inputx, not a single binary value 0 or 1, but instead a probability value of confidence scoreM(x) in the interval [0, 1].

Choosing a thresholdθ in the interval [0, 1] (commonly taken to beθ = 0.5) yields an actual binary classifier Mθ which assigns to an inputx the class negative or 0 class if M(x) is in [0, θ) and the positive class or 1 otherwise.

There are many ways to evaluate such probabilistic binary clas-sifiers [7]. The follow sections describe those which we shall use in this study.

Confusion Matrix, Precision, Recall and Accuracy. Given a threshold θ in the interval (0, 1), let Mθ denote the corresponding binary clas-sifier. Associated toMθ there are four basic quantities: the number of true and false positive outcomesT P an FP and the number of true and false negative outcomesT F and F N . These scores are usually organised into a confusion matrix as in Table 1.

From these quantities several evaluation metrics for the binary classifierMθ can be computed.

Predicted Class

Pos Neg

Actual Class Pos T P F N

Neg FP T N

Figure 1: Confusion Matrix

The precision is given by:

Pre :=_{T P + FP}T P

which is the proportion of the positive predictions which are correct. The higher the precision the better the classifier, and a perfect classifier has a precision of 1.

The accuracy is computed by: Acc=T P + T N

N ,

whereN is the number of inputs X , which is the proportion of all predictions which are correct.

The recall or true positive rate is computed by:

Rec or TPR := T P

T P + FN

which is the proportion of the elements in the positive class which were correctly predicted.

The false positive rate is computed by: FPR :=_{FP + T N}FP

which is the proportion elements in the negative class which were incorrectly predicted.

The F -Score. The F -score is a metric which combines the precision and the recall. It is given by

F -score := 2 · Pre · Rec Pre+ Rec.

A perfect classifer hasF -score equal to 1, with a higher F -score corresponding to better classification.

2.3.2 ROC Curve and AUC. For a probabilistic classifier M the Receiver Operating Characteristic (ROC) curve is the parametric curve in the plane given by plotting the true positive rate against the false positive rate:

(TPRθ, FPRθ),

for each the binary classifiersMθ corresponding to every threshold valueθ in [0, 1].

The ROC curve begins at the point (0, 0) and ends at the point (1, 1) and for a random classifier is roughly a diagonal line from (0, 0) to (1, 1).

A single metric, called the Area Under the Curve (AUC), can be obtained by computing the area below the ROC curve. A perfect classifier has AUC equal to 1 and the higher the AUC the better the classifier.

(5)

Brier Score. The Brier-Score, also called the Mean Square error or MSE, is computed by MSE= 1 N Õ i (M(x_i) −y_i)2

whereN is the number of observations (xi, yi). The MSE is equal to 0 for a perfect classifer, and a lower MSE corresponds to a better classifier.

2.3.3 Regression Models. Regression models will be evaluated us-ing the Root Mean Squared Error (RMSE). Explicitly, if (xi, yi) de-notes a set ofN observations of the independent and dependent variables andM is a model which predicts yi givenxi, then the RMSE ofM over the set of N data points (xi, yi) is:

RMSE= s 1 N Õ i (M(xi) −y_i)2_.

A regression model with RMSE equal to zero corresponds to per-fect prediction, while the lower RMSE values correspond to lower variance in the prediction errors (by definition).

Although not an evaluation metric, a useful secondary indicator of the quality of a regression model is its bias:

Bias= 1 N

Õ

i

M(xi) −y_i

which is the difference between the mean of dependent variable over the dataset and its values as predicted by the model. A model with bias close to 0, in absolute terms, has the property that its errors are roughly equally distributed around its expected value.

3 METHODOLOGY

This section describes methodological content of this study, includ-ing which of the theoretical frameworks from the previous section will be used and why, a description of the raw dataset and how it was processed for use within these frameworks and a explanation of the evaluation methods used.

3.1 Dataset

3.1.1 Description. The raw data set consisted of two files: a ticket dataset and an event dataset. The ticket dataset contained 171,048 rows and the following columns:

• ticket_id: unique id identifying the ticket. • event_id: unique id identifying the event.

• created_at: date and time the ticket was listed with dates ranging from 2017-02-17 until 2018-04-30.

• bought_at: date and time the ticket was sold - or empty if unsold.

• initial_price: initial price of the ticket. • initial_price_currency: initial price currency. • original_price: price originally paid for the ticket. • original_price_currency: original price currency, as

re-ported by the ticket owner.

• is_price_changed: binary variable indicating whether the price has been changed.

• latest_price: new price of the ticket, if the price was changed.

• latest_price_currency: latest price currency.

The 171,048 tickets corresponded to a total of 5,929 unique events. None of the tickets had changed prices and so the final three columns were immediately discarded.

The event dataset contained a daily summary of statistics related to each of the events:

• event_id: unique id identifying the event.

• date: the date at which the statistics were recorded. • event_started: binary variable indicating the whether the

event occurs on the date.

• num_tickets_wanted: number of users registered as want-ing to buy tickets.

• num_tickets_sold: number of tickets which have already sold.

• num_tickets_total: number of tickets which have sold and which are for sale.

We note that while highly desirable it was (at this time) not possible to obtain various information specific to events which may have been relevant such as event type or whether it was sold out or not.

3.1.2 Preparation. We now describe how the datasets, which will be used to train the models, were constructed.

First, each data point in the new dataset represents a ticket for sale at a given time before the event. As the statistics related to the events are recorded daily, it is only possible to represent tickets at fixed daily intervals before the event takes place. While we are interested in predicting sales of tickets with a lead time of seven days, a single data set is constructed, containing tickets for sale at 5, 6, 7, 8 and 9 days before the event. This allows for training of models on data containing variable lead times or with a static lead time of 7 days, by restricting to the corresponding smaller subset. We will refer to these as the static lead and variable lead datasets. As some of the models to be considered only allow for very simple interactions of the features several new features were constructed to allow for some interaction such as original_price_ratio and wanted_cheaper_diff. A complete list of the features and their descriptions is given in Table 1. Following this method produced the variable lead dataset with 43,507 rows corresponding to 2,562 unique events, with the static lead dataset as a subset and an event dataset. Event and ticket counts for the variable lead dataset are shown in in Table 2.

A second dataset, the event dataset, was created whose rows correspond to the events and whose columns are the associated event statistics recorded at lead time of seven days. As we are interested in predicting total revenue, we define the value of an event as the sum of the prices of all tickets for that event and the revenue as the sum of the prices of all tickets for that event which sold. Basic statistics of the event dataset are show in Table 3 with the tickets per event counts are shown in Figure 2.

3.2 Models

Three models were selected for comparison while answering Re-search Questions 1a and 1b: the classification problem of ticket sale prediction, and compared with respect to their performance along several standard evaluation metrics. Using these classification models, subsequent regression models are constructed for use in

(6)

Feature Description

time The number of days at which the values of the features are recorded.

tickets_for_sale The number of tickets for sale.

tickets_sold The number of tickets previously sold.

tickets_wanted The number of users registered as looking for a ticket.

cheapest_price The price of the cheapest ticket.

weekend A binary variable indicating whether the event takes place on Friday, Saturday or Sunday.

created_at The number of days before the event at which the ticket was listed for sale.

price The sale price of the ticket in Euros.

original_price The original purchase price of the ticket (as reported by the user).

original_price_ratio pricedivided by original_price.

cheapest_price_diff The difference in price between the ticket and the cheapest ticket. tickets_cheaper The number of tickets for sale at a lower price.

wanted_cheaper_diff The difference between tickets_cheaper and tickets_wanted.

Table 1: Description of ticket based features

Day 5 6 7 8 9 Events 2143 2018 1940 1863 1804 Tickets 9942 9179 8546 8144 7696 Sold 7309 6804 6400 6177 5874 Unsold 2633 2375 2146 1967 1822 % Sold 0.73 0.74 0.75 0.76 0.75

Table 2: Variable lead dataset ticket and event counts by lead time. Mean Std. Dev. Value 102.4 211.0 Revenue 72.5 184.5 Tickets 4.4 7.2 Tickets (Sold) 3.30 6.27

Table 3: Event dataset statistics.

Figure 2: Ticket counts per event for the event dataset.

Research Questions 2 and 3, regarding the regression problems of event level ticket sale prediction and expected revenue prediction.

3.2.1 Base line classifiers. In order to gain a reference point against which the other classification models can be compared, two base line models were established. These consisted of two ‘dummy’ classifiers, the first of which assigns a random probability of sale to each ticket while the second assigns a constant probability of sale to each ticket, this probability being the ratio of sold tickets to all tickets, as computed on a training data set.

3.2.2 Classifiers. The following three classification algorithms were selected for comparison in the ticket sale prediction prob-lem:

(1) Gradient Boosted Trees (2) Logistic Regression (3) Naive Bayes

The implementations of these algorithms used for this study are those of the Python library sk-learn [13]. An important feature of each of these classification algorithms is that their output is a probability or confidence value that an input belongs to the positive class which will allow subsequence regression models to predict expected revue generated by a ticket sale.

While artificial neural networks and support vector machines are powerful algorithms they were not selected for comparison in this study. This was due to the fact that each of the models often require extensive hyper parameter optimisation and lengthy training times. 3.2.3 Regression Models. The two regression problems considered in this study are estimating the total number of tickets sold per event and the total revenue generated by the sold tickets. A classifierM assigning to a ticketx the probability of its sale, yields a regression modelMtotestimating the total number of tickets sold for an event

E by summing the probabilities M(x) over tickets for the event E: Mtot(E) :=

Õ

x a ticket for E

M(x).

Similarly, the same classifierM yields a regression model R which estimates the expected revenue generated by the ticketsx of an eventE by summing over the products of the probabilities M(x) and the sale priceP(x):

R(E) := Õ

x a ticket for E

M(x)P(x).

(7)

Finally, for comparison two Gradient Boosted Trees regression models estimating the total tickets sold and revenue generated will also be trained using the events dataset.

3.2.4 Class Imbalance. As can be see in Figure 2 there is a moderate class imbalance between sold and unsold tickets. We will investigate the effects of random under sampling of the majority class and synthetic over sampling of the minority class using both SMOTE and ADASYN. The implementation of these algorithms used for this study are those of the Python package imbalanced-learn [11]. 3.2.5 Hyperparameters. Of the models considered in this study, only the Gradient Boosted Trees requires (moderate) hyperparame-ter tuning. The values of these paramehyperparame-ters were chosen using a grid search of the parameter space, in such a way that they minimised the mean square error (for classifiers) and the root mean square error for regression models. The specific values of the hyperparam-eters can be found in Appendix A.

3.2.6 Evaluation Frameworks. While evaluating the quality of the models we will make usek-fold cross validation, which is one of the most common methods used in model evaluation and selection [9]. Explicitly,k-fold cross validation splits the data set into k equally sized chunks, the folds, where upon each fold is held back while the model is trained on the otherk − 1 folds and performance metrics are computed using the fold which was held back. The obtained metrics are then averaged to produce a single score.

We shall be using three main metrics for evaluation of the clas-sifiers: Mean Squared Error (MSE), Area Under the Curve (AUC) andF -score. We also record the accuracy, recall and precision of each of the classifiers. The regression models will be evaluated using Root Mean Squared Error and their bias will also be recorded. When training on the event dataset or the static lead dataset 5-fold cross validation will be used, and when training on the variable lead dataset 10-fold cross validation will be used (these fold sizes are standard for datasets of this size [9]).

The points of the dataset correspond to tickets at a given number of days before the event and contain various features some of which are dependent only on the event. Therefore, in order to avoid data leakage between training and validation data sets, the training and validation splits will be made according to the events rather than solely the tickets themselves. Thus, the training and validation sets will each contain either all or none of the tickets for each event.

Finally, when making use of over-sampling techniques, which are insensitive to events, in order to avoid data leakage, the over-sampling methods will be applied only to the training dataset.

4 EXPERIMENTS AND EVALUATION

In this section we present and analyse the results of the experiments which were conducted for each of the Research Questions. First, we consider several probabilistic classification algorithms and their performance with respect individual ticket sale prediction. We also compare the performance of classifiers trained on variable lead time data sets versus static lead time datasets. We then compare the effectiveness of these models at predicting total ticket sales and expected revenue at the level of events and compare them to Gradient Boosted Trees regression models trained on the event

dataset. Finally, we investigate how the predicted probability of sale and expected revenue vary as functions of ticket price.

4.1 Research Questions 1a and 1b

The first part of Research Question 1 of this thesis asks: ‘Which machine learning models most effectively predict the probability of individual ticket sale with a lead time of seven days?’ In answer-ing this questions several different models were constructed and their performances along various metric were recorded: the most important of which were MSE, AUC and F1 scores. The accuracy, precision and recall were also recorded and can be found in Ap-pendix B. The results of these experiments can be found in Table 4, along with the performance of the two base line models. We note that with the static lead time, all of trained models outperform the baseline models in the main metrics (MSE, AUC, F1) with respect to ticket sale prediction, and that the Gradient Boosted Trees model is the best performing in each of these metrics.

MSE AUC F1

Gradient Boosted Trees 0.133805 0.821259 0.880023

Logistic Regression 0.180516 0.76833 0.825734

Naive Bayes 0.224316 0.705646 0.853904

Baseline (Random) 0.327255 0.510804 0.604335

Baseline (Majority) 0.27779 0.5 0.837089

Table 4: Static lead time performance for individual ticket sale.

The second part of this Research Question asks: ‘Does allowing for a variable lead time produce significantly more accurate individ-ual ticket sale predictions compared to static lead time?’ The same models were trained again, now on the variable lead time dataset which contained all records with lead times of 5, 6, 7, 8 and 9 days respectively. The performance of these models with respect to the sale prediction at a lead time of 7 days was recorded, along with the performance of the base line models in Table 5.

MSE AUC F1

Naive Bayes 0.216462 0.754598 0.861118

Baseline (Random) 0.333991 0.493903 0.585999

Baseline (Majority) 0.275394 0.5 0.838351

Table 5: Variable lead time performance for individual ticket sale.

As with the static lead time models, the trained models outper-formed the base line models in all three of the major metrics and moreover among these the Gradient Boosted Trees algorithm per-formed the best, although it was still outperper-formed in all metrics by its static lead time counterpart. However, when comparing the variable lead time Logistic and Naive Bayes models to their static lead time counter parts there is an across-the-board increase in performance in all of the main metrics.

(8)

Therefore, while training on variable lead times does not lead to an increase in absolute performance, nor in performance for the Gradient Boosted Trees model, it does lead to increases in performance for the Logistic and Naive-Bayes models.

Predicted

Sold Unsold

Truth Sold 1881 336

Unsold 318 669

Figure 3: Confusion Matrix for Static Lead Gradient Boosted Trees model.

Figure 4: Prediction errors for static lead Gradient Boosted Trees model for events with ≤ 5 and > 5 tickets.

The confusion matrix and prediction errors for the static lead Gradient Boosted Model, trained and then evaluated on a 70–30 train test split, can be see in in Figures 3 and 4. We see that there is a significant difference in the relative performance of sale pre-diction between sold and unsold tickets and that this difference is also reflected in the distribution of the prediction errors (nega-tive prediction errors correspond to unsold tickets, while posi(nega-tive prediction errors correspond to sold tickets).

This difference could be due to several factors. First, there are many more events with a small number of tickets than events with a large number of tickets. However, the difference in the error distributions is present whether one restricts to tickets for events with few or many tickets, as is also shown in Figure 4.

There is also a moderate class imbalance in the training data set (see Figure 3). In order to investigate whether this is contributing to the performance of the models, random under-sampling of the sold class and synthetic over sampling of the unsold class using both ADASYN and SMOTE were performed. The models were trained and evaluated using these balanced datasets and their performance with respect to the main metrics recorded in Tables 6, 7 and 8.

As in the previous experiments, when trained on the each of the three balanced data sets, the Gradient Boosted Trees model continues to outperform the others in all metrics, although it offers no increase in performance compared to the unbalanced trained

MSE AUC F1

Naive Bayes 0.216374 0.739374 0.859955

Table 6: Under-sampled static lead time performance.

MSE AUC F1

Naive Bayes 0.189508 0.693020 0.847232

Table 7: Performance on ADASYN over-sampled static lead time dataset.

MSE AUC F1

Naive Bayes 0.195083 0.712055 0.859049

Table 8: Performance on SMOTE over-sampled static lead time dataset.

models. Moreover, the difference in error distributions between sold and unsold tickets persisted, despite training on balanced datasets and so it is likely not due to an imbalanced training dataset. Of course, one easy explanation is that the dataset and models are missing certain features which would allow it to discriminate more normally between sold and unsold tickets. An obvious candidate which was not available for this study was whether or not the events in question were sold out or not.

Finally, we conclude that when predicting the sale of individual tickets with lead of seven days the Gradient Boosted Trees model trained on the static lead dataset offers the greatest performance when compared with the other models whether they are trained on balanced or variable lead time data sets.

4.2 Research Question 2

Research Questions 2 asks ‘Which of the models constructed in Research Question 1 most effectively predict total number of tickets sold and the revenue generated at the level of events?’ For this we decided to focus on the models of Research Questoin 1a rather than those of Research Question 1b.

As explained in section 3.2.3, each of the classification models constructed in Research Question 1 can be used to construct re-gression models estimating, for each event, the total ticket sales by taking the sum over all tickets of their sale probabilities and the total revenue generated by taking the sum of over each ticket of the product of their sale probability with their price. The same also applies to the two base line models (random and majority). Finally, for the sake of comparison two Gradient Boosted Trees regression models were trained using the event dataset for predicting total ticket sales and total revenue.

Using 5-fold cross validation the classification models were trained and the Root Mean Squared Errors and bias were recorded

(9)

for the two derived regression models, and similarly for the two Gradient Boosted Trees regression models. The results for the ticket sale and revenue prediction can be seen in Table 9.

Total Sales Total Revenue

RMSE Bias RMSE Bias

GB Trees Classifier 1.6704 -0.0128 48.8554 -0.0542 Logistic Regression 2.9 0.7872 79.1422 17.9328 Naive Bayes 3.5902 -0.6116 103.1661 -7.7493 GB Trees Regressors 1.9453 0.0031 53.8224 -5.7631 Baseline (Random) 4.2340 1.0561 106.4310 20.8029 Baseline (Majority) 5.5952 -1.3681 113.1138 -33.3154

Table 9: RMSE and bias for total sales and total revenue re-gression models.

Each of the trained models outperformed the base line models and again the Gradient Boosted Trees classifier derived model was the best performer for both total sales and total revenue. Although, the Gradient Boosted Trees Trees regression models offer similar performance.

4.3 Research Question 3

The final research questions asks: ‘How do the probability of sale and expected revenue of individual tickets vary as a functions of the price?’ As the Gradient Boosted Trees model has systematically outperformed all other models with respect to individual ticket sale, and total ticket sale and total revenue prediction, we will use the Gradient Boosted Trees model to analyse this problem.

Specifically, we trained the Gradient Boosted Trees model using all of the static lead time dataset and then considered variations in probability of sale and expected revenue arising from variations in ticket price. In order to properly compare and aggregate such variations we must first normalise these variations in such a way that they are comparable.

To this end, we shall consider the lift in probability of sale and expected revenue generated by a change in the ticket price. The lift of a quantityf (p) depending on a variable p, generated in a change of the variable from a base valuep0top1is:

Lift := f (p1) −f (p0)

f (p0).

We shall take as the base value the listed price of the ticket and consider the average lift in the predicted probability of sale and expected revenue generated by a percentage change in the ticket price. The results of computing these lifts for all tickets with price variations with ±25% and taking the average can be see in Figure 5. We see that raising and lowering the price results in negative and positive lifts in the probability of sale respectively, which is to be expected, but that these lifts are not equivalent. The result of this is that the average lift in revenue is almost monotonically increasing as a function of a percentage change in price. Therefore, we find that on average the model expects that a raise in price of the ticket will reslut in a raise in the expected revenue generated. It may indeed be possible that TicketSwap users are almost systematically under pricing their tickets, but this would be quite a surprising result indeed.

Figure 5: Average lifts for all tickets generated by a percent-age change in price.

Another explanation for this may be the peculiar structure of the markets for the events on the TicketSwap platform. Indeed, there is a natural upper limit to the price for a each ticket: it may be listed for sale at a price at most 20% above its original sale price (although this original price is reported by the users themselves and may not be accurate). Moreover, if a ticket is listed for sale at a price above its original price, this may be an indicator of higher demand for tickets to that event (for instance, tickets to events which are sold out) and correspond despite the inflated price to a higher likelihood of sale.

In order to elaborate on this further, Figure 6 shows the actual sale rates of tickets according their price ratio (the ratio of the original ticket price to its sale price). We see that it is indeed true that the sale rates of tickets increase as the price ratio moves either below or above a price ratio of one.

Figure 6: Sale rates in the static lead data set according to price ratio.

This, combined with the fact that there many tickets listed at their original price (see Figure 7), may account for the behaviour seen in Figure 5. Indeed, a small increase in the price of a ticket sold for its original price will result in its price ratio changing from one to more than one, therefore moving it into a class of tickets for which the sale rates are higher.

(10)

Figure 7: Ticket counts by price ratio.

Thus, we find that while the probability of sale of tickets de-creases monotonically as a function of the price, this decrease is not fast enough (within the range considered) to result in a unimodal revenue function but in one which is roughly linear and that this may be due, not to a systematic under pricing of tickets, but to particular features of the dataset and the TicketSwap platform.

5 CONCLUSION AND FUTURE RESEARCH

The aims of this thesis were first to compare various machine learning models in the task of prediction of sale of tickets on the TicketSwap platform at an individual level, then to compare the models derived from these in the tasks of total ticket sales and rev-enue prediction and finally, to understand the relationship between price and sale probability of tickets.

With regards to the first two aims, it was found that the Gradient Boosted Trees classifier outperformed all other models with respect to the predictive tasks and that training on balanced datasets or on a dataset with variable lead time did not result in increases in performance. While we have focused on predictions with a lead time of seven days, an obvious future direction for research is to consider predictions with a variable lead time and to more deeply understand how temporal features influence the sale of tickets.

There also appeared to be some room for improvement as the model behaved somewhat differently with respect to the sold and unsold segments of the dataset. This behaviour was found not to be due to imbalance in dataset, nor to non-uniformity in the distribu-tion of tickets per event and so may be due to some (hypothetical) missing features which would give the model greater discrimina-tory power between sold and unsold tickets. Therefore, a second area of future research would be to attempt to engineer new fea-tures with the aims of tackling this problem (with the clear first step of determining whether or not events listed are already sold out).

The final aim of this thesis was to understand the relationship between the price of a ticket, its probability of sale and the expected revenue generated by its sale. To this end, average lifts in prob-ability of sale and expected revenue were computed and it was found that while the average lift in probability was (almost) mono-tonically decreasing with respect to a percentage change in price, which was roughly to be expected, the average lift in revenue was itself monotonically increasing, rather than unimodal. Therefore, at least on average the model seems to suggest that tickets are almost

always under priced. We proposed that rather than a reflection of reality this may be explained by certain aspects of the TicketSwap event markets and in particular the relationship between probabil-ity of sale and the price ratio of the tickets. Therefore, a final future direction of research would be to investigate ways of disentangling this relationship, perhaps with future feature engineering (as men-tioned above) and or a possible deeper analysis of the data in order to identify segments of the data where demand (or the price ratio) have varying levels of importance.

REFERENCES

[1] S Agatonovic-Kustrin and R Beresford. 2000. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. Journal of Pharmaceutical and Biomedical Analysis 22, 5 (2000), 717–727. https://doi.org/ 10.1016/s0731-7085(99)00272-1

[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.

[3] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res. 16, 1 (June 2002), 321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416 [4] Jerome H. Friedman. 2002. Stochastic gradient boosting. Computational Statistics

& Data Analysis 38, 4 (2002), 367–378. https://doi.org/10.1016/s0167-9473(01) 00065-2

[5] Guillermo Gallego and Garrett Van Ryzin. 1994. Optimal Dynamic Pricing of Inventories with Stochastic Demand over Finite Horizons. Management Science 40, 8 (1994), 999–1020. https://doi.org/10.1287/mnsc.40.8.999

[6] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969 [7] José Hernández-Orallo, Peter Flach, and Cèsar Ferri. 2012. A Unified View of

Performance Metrics: Translating Threshold Choice into Expected Classification Loss. J. Mach. Learn. Res. 13, 1 (Oct. 2012), 2813–2869. http://dl.acm.org/citation. cfm?id=2503308.2503332

[8] Jong-Min Kim and Hojin Jung. 2019. Predicting bid prices by using machine learning methods. Applied Economics 51, 19 (2019), 2011–2018. https://doi.org/10. 1080/00036846.2018.1537477 arXiv:https://doi.org/10.1080/00036846.2018.1537477 [9] Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1137–1143. http://dl.acm.org/citation. cfm?id=1643031.1643047

[10] Richard D. Lawrence. 2003. A Machine-Learning Approach to Optimal Bid Pricing. Springer US, Boston, MA, 97–118. https://doi.org/10.1007/978-1-4615-1043-7_5 [11] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.html

[12] M. E. Maron. 1961. Automatic Indexing: An Experimental Inquiry. J. ACM 8, 3 (1961), 404–417. https://doi.org/10.1145/321075.321084

[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [14] G. P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30, 4 (Nov 2000), 451–462. https://doi.org/10.1109/5326.897072

A

HYPERPARAMETERS

The following hyperparameters were used in the training of the Gradient Boosted Trees models. Defaults were used for unlisted parameters.

GB Trees Classifier

• learning_rate: 0.01 • n_estimators: 650 • max_depth: 5 • min_samples_leaf: 60 8

(11)

• subsample: 0,8

GB Trees Total Sales Regressor

• learning_rate: 0.01

• n_estimators: 1250 • max_depth: 3 • subsample: 0.8

GB Trees Total Revenue Regressor

• learning_rate: 0.01 • n_estimators: 1450 • max_depth: 3 • subsample: 0.8

B

TABLES

9

(12)

MSE AUC F1 Acc Prec Rec

Gradient Boosted Trees 0.133805 0.821259 0.880023 0.819022 0.829843 0.937713

Logistic 0.180516 0.76833 0.825734 0.758906 0.841855 0.813688

Naive Bayes 0.224316 0.705646 0.853904 0.765667 0.775718 0.953972

Baseline (Random) 0.327255 0.510804 0.604335 0.513723 0.731898 0.517132

Baseline (Majority) 0.27779 0.5 0.837089 0.72221 0.72221 1

Table 10: Static lead time performance for individual ticket sale.

Logistic 0.153645 0.772472 0.868194 0.789654 0.791475 0.96416

Naive Bayes 0.216462 0.754598 0.861118 0.779772 0.78453 0.956501

Baseline (Random) 0.333991 0.493903 0.585999 0.496696 0.721278 0.495148

Baseline (Majority) 0.275394 0.5 0.838351 0.724606 0.724606 1

Table 11: Variable lead time performance for individual ticket sale.

Logistic Regression 0.158328 0.756595 0.865173 0.787656 0.788558 0.962452

Naive Bayes 0.216374 0.739374 0.859955 0.7787 0.783861 0.955426

Table 12: Static lead time performance on under sampled balanced data set.

Naive Bayes 0.189508 0.693020 0.847232 0.762784 0.810452 0.888095

Table 13: Performance on ADASYN over-sampled static lead time dataset.

Naive Bayes 0.195083 0.712055 0.859049 0.772911 0.798489 0.930358

Table 14: Static lead time on SMOTE over-sample balanced dataset.