Settingbundlesizesformobileinternet:apracticalapproachusingcustomerdata U G

(1)

U

NIVERSITY OF

G

RONINGEN

M

ASTER

T

HESIS

Setting bundle sizes for mobile internet: a

practical approach using customer data

(2)

(3)

iii

Abstract

(4)

(5)

v

List of Figures

1.1 Percentages of the Dutch population having a smartphone and having

used mobile internet in the past three months (CBS, 2018). . . 1 2.1 Timeline of the problem setting. . . 5

4.1 A diagram of a decision tree. All dots represent nodes of the tree,

where the top node is the starting point, and the bottom nodes, or leafs, are the endpoints. . . 16 4.2 Examples of an ROC curve. . . 17 5.1 The mean absolute (percentage) error evolving over time. . . 21

5.2 Estimated effects of (3-month average) bundle utilization on churn

by the logit model, where the points are the estimates and the bars indicate the standard errors. . . 23

5.3 Receiver Operating Characteristic (ROC) curve for the logit model

(8)

(9)

ix

List of Tables

4.1 An overview of models for mobile data usage with corresponding

specifications. . . 12 4.2 Definitions of true and false, positives and negatives. . . 17

5.1 The mean absolute (percentage) error of the benchmark models and

panel data models. . . 20 5.2 Estimation results of the final model on the main data set. . . 22 5.3 Selection of estimated coefficients for the logit model predicting churn

probability. For categorical variables the base level is indicated in brackets underneath the variable. . . 24 5.4 Three sets of bundle sizes with the lowest churn score. The percentage

of renewing customers that will choose the bundle size is indicated below each bundle size. . . 26 5.5 Best performing set of bundle sizes for different values of α, the

mini-mal percentage of customers per bundle size . . . 27 A.1 Regression results of the panel data models with categorized trend. . . 33

A.2 Regression results of the panel data models with homogeneous trend. 34

(10)

(11)

1

Chapter 1

Introduction

Since the launch of the first smartphones at the beginning of the century, their use has increased tremendously. A majority of ninety percent of the Dutch population had a mobile or smartphone in 2018, whereas only half of the population did so six years earlier in 2012 (Figure 1.1). It not surprising that it is nearly inevitable to encounter smartphones in daily life since they can be employed for nearly any purpose. Smart-phones nowadays do not only enable its user to call and send text messages but also to stream videos and music, surf the internet, pay in stores and receive directions. These functionalities are enabled by the ever-growing number of applications for smartphones, and many of these applications require a (mobile) internet connection. Consequently, aggregate mobile internet traffic has increased exponentially (Cisco, 2017). This growth is driven by two factors: the increased amount of mobile internet users (Figure 1.1), and the increased amount of mobile internet used per user. In fact, it is expected that the average mobile connection in Western Europe will generate 7 gigabytes (GB) of mobile data traffic per month in 2021, whereas it was only 1.6 GB in 2016 (Cisco, 2017). As a result, the market for mobile internet bundles is evolving rapidly.

FIGURE 1.1: Percentages of the Dutch population having a smart-phone and having used mobile internet in the past three months (CBS,

2018).

(12)

2 Chapter 1. Introduction This is one of the reasons why telecommunication companies change their mobile in-ternet plan offerings on an annual or biannual basis. When changing mobile inin-ternet plans, many factors have to be accounted for, such as the trends in mobile internet usage, competitors’ products and prices and consumers’ product choice behaviour.

1.1 Background and Related Literature

Pricing Schemes

As a result of the increasing popularity of smartphones, mobile service providers have been launching various mobile internet plans (Rahimi and Koosawangsri, 2013). The main static pricing schemes are fixed flat-rate pricing and usage-based pricing. In flat-rate pricing, the consumer pays a fixed amount per month, independent of usage, for an internet subscription that is either unlimited or up to a cap. In usage-based pricing, the consumer pays proportionally to mobile internet usage. More re-cently, also more exotic pricing schemes such as time-of-day pricing (Ha, Joe-Wong, Sen, and Chiang, 2012) have been proposed as a solution to the congestion problem. Extensive surveys on mobile data plans are provided by Sen, Joe-Wong, Ha, and Chiang (2013) and Gizelis and Vergados (2011).

Consequently, a question that arises for telecommunication companies is what kind of pricing scheme they should use. First of all, the benefits from mobile data plans are usually of opposite direction for telecommunication companies and consumers. In spite of this contrast, it has been shown that capped data plans are more prof-itable than prepaid plans for both parties (Zheng, Joe-Wong, Andrews, and Chiang, 2018). Next to that, unlimited mobile data plans with flat rate pricing are not efficient from the company’s perspective as well as from most consumers’ perspectives (Paul, Subramanian, Buddhikot, and Das, 2011). Secondly, heterogeneity among customer preferences is of relevance when comparing pricing schemes. Both Rahimi and Koo-sawangsri (2013) and Zheng et al. (2018) create a model to evaluate mobile data plans for different consumer segments. The findings of Rahimi and Koosawangsri (2013) show that students are price sensitive and have high data and text message usage, whereas seniors prefer a cheap plan that provides many voice minutes, reasonable data and availability to a hotspot. He and Walrand (2005) show that internet service providers are able to collect more revenues by offering multiple service classes rather than a single one, whereas Shakkottai, Srikant, Ozdaglar, and Acemoglu (2008) ar-gue that benefits from differentiated pricing for multiple service classes rather than a simple flat fee are small for fixed internet providers.

Tiered Pricing Schemes

In the Netherlands, all four major telecommunication companies are currently using tiered pricing schemes (KPN; T-Mobile; Tele2; Vodafone). These schemes use flat-rate pricing for different usage caps. After reaching the usage cap, one of three options can be applied;

• usage-based pricing; consumers are charged per additional data unit used. • extra bundles; consumers are forced to buy extra data bundles to be able to use

more data.

(13)

1.2. Research Question and Contribution 3 Although these options have all been in practice historically, currently only throt-tling and extra bundles are applied in the Dutch telecommunication market.

Once a telecommunication company has decided to use a tiered pricing scheme for their mobile data plans, it still has to decide on the levels of the tier caps and cor-responding prices. Several economic models have been developed to address this problem, both for fixed internet services (Lv and Rouskas, 2008, 2009, 2010, 2011) and mobile internet services (Sugiyama, Urakawa, Taya, Yamada, Kobayashi, and Tagami, 2016). These researches are all based on utility theory, but the available data for this project does not enable estimation of the utility parameters. These mod-els are therefore not applicable for the telecommunication company. However, the data sets that are available contain substantial information, that could still be use-ful in determining tier levels of mobile internet bundles. Hence this thesis provides a methodology to size mobile internet bundles, that is specifically adjusted to the available data set.

1.2 Research Question and Contribution

This thesis aims to improve the mobile internet plans of a telecommunication com-pany. Improvement is in this case measured by the churn rate, which is the per-centage of customers who leave the telecommunications company per month. This improvement is beneficial for both parties; for consumers, appropriate bundle sizes lead to a higher satisfaction level, and for the telecommunication company a churn reduction generally leads to higher revenues. This thesis adds to the existing lit-erature by developing a practical approach to determine tier levels using customer data. The following research steps are proposed for this project:

1. A forecast of data usage on an individual level is obtained to get insights into the heterogeneity in customers’ mobile data usage, and the growth rates thereof.

2. Churn is modelled using bundle size and utilization to find out what the ideal level of bundle utilization is with respect to churn.

3. Possible sets of bundle sizes are evaluated with respect to churn. This step includes a decision rule for consumers’ choice of bundle sizes as well.

(14)

(15)

5

Chapter 2

Problem Formulation

This thesis aims to provide a Dutch telecommunication company with a methodol-ogy, based on customer data, that enables it to determine the set of mobile internet bundle sizes that minimizes churn. This chapter starts by discussing several aspects of the project setting. The sections thereafter discuss the steps that are proposed to achieve the research goal.

The solution should be applied for the upcoming year, and should be based on data

of the last two years. Let t0 be the current month, and assume that the company

requires a solution to the decision problem this month. The solution, a new set of bundle sizes for mobile internet, will be implemented at the beginning of the follow-ing month, t1. This set of bundle sizes will be in production for a year, thus until

t12. At the time of problem solving, t0, the company has access to observed monthly

customer data over the past two years, which is from t−24 until t0. Naturally, the

company would like the propositions to match the preferences of consumers that will buy a contract between t1 and t12. An overview of the described situation is

given in Figure 2.1.

Observed Behavior

Implementation New Bundle Sizes Problem

Solving

t−24 t0 t1 t12

FIGURE2.1: Timeline of the problem setting.

In what follows the decisions regarding the scope of this thesis are defined and prop-erly motivated.

• Only the mobile data component of phone plans is considered, not the call and text messages component.

(16)

6 Chapter 2. Problem Formulation Since it is believed that customers choose their phone plan based on the mo-bile internet component, this thesis focuses on the momo-bile internet component of phone plans only.

• Only the bundle sizes of the phone plans are considered, not the prices.

Naturally, the demand for several mobile data bundles depends on the cor-responding prices. However, since price sensitivity data is not available, this thesis focuses on the bundle sizes only and assumes that the prices are set af-terwards.

• Only existing customers with a postpaid mobile data plan in the consumer market are considered.

Consumers who will buy a mobile subscription in the period during which the new bundle sizes will be implemented, can either be new customers or existing customers who renew their contract. However, since no behavioural data is accessible for possible new customers, the focus is on existing customers only. This focus is in line with the objective to reduce churn.

The remaining sections of this chapter are devoted to the research steps that were proposed in section 1.2. For each step a motivation is provided, as well as a brief description of the approach.

2.1 Forecast Data Usage on a Customer Level

Why? Mobile internet usage has been increasing in the previous years and is expected to keep increasing the upcoming years. Changing demand of mobile internet advocates for changes in the supply of mobile internet. This shows the need for a forecast of future mobile data usage. Secondly, the mobile internet us-age is expected to have a different level for different customers as well as a different growth rate. For a company it is important to offer bundle sizes that match the demand of several customer segments. This diversity among cus-tomers stresses the importance of an individual forecast rather than one on an aggregate level.

How? A panel data model is estimated in order to obtain an individual forecast and to provide insights on the drivers of data usage. The main advantage of a panel data model is that it is able to take heterogeneity of customers into account, while still learning from the population. This is possible since customers’ dif-ferent time-invariant preferences that are not captured by the included regres-sors are captured in an unobserved individual effect. Furthermore, it is possi-ble to capture a time trend in this type of model.

2.2 Model the Effect of Bundle Utilization on Churn

(17)

2.3. Evaluate Sets of Bundle Sizes 7 limitations. On the other hand, customers who use a small percentage (e.g. 10%) of their bundle each month, might also be dissatisfied with it. These cus-tomers pay for more than they need, and may therefore be more likely to churn. Although customers who are dissatisfied with their bundle size can upgrade their bundle within the same provider, these customers are more actively mon-itoring their other options than customers who are satisfied with their bundle size. As one of these options is churn, it is expected that such customers are more likely to churn.

How? Two approaches that are often encountered in the literature to model binary outcomes, in this case churn, are used. Firstly, the probability to churn for each customer is predicted by a logit model. Secondly, churn is predicted by a random forest, which is often used in the field of machine learning. With regard to the explanatory variables, we are mainly interested in the effects of bundle size and bundle utilization, but other variables are being controlled for as well. The advantage of the logit model is that it provides insights into the drivers of churn, whereas a random forest is difficult to interpret, but might have a better predictive ability.

2.3 Evaluate Sets of Bundle Sizes

Why? The results obtained from the previous steps need to be combined to give a recommendation on a set of bundle sizes that would reduce churn.

(18)

(19)

9

Chapter 3

Data Description

This chapter describes the structure of the data set that is used in this thesis as well as the variable of which it consists. Some preliminary analyses regarding trends in mobile internet usage and bundle utilization are not disclosed in this version.

3.1 Data Description

Monthly customer data of the consumer market of a large Dutch telecommunica-tion company is collected. The number of customers is large, and the number of observations per customer ranges from 1 to 24 months. Due to privacy legislation, the telecommunication company is allowed to keep customers’ data for a maximum of two years. Thus, the data is shifting, in the sense that each month new observa-tions are added and the oldest observaobserva-tions are removed. Therefore, in this research, the data available at the moment is used, just as the company would have to when they would implement this methodology. The original data set is unbalanced, since some customers churn and some are acquired in the past two years. Since the goal is to reduce churn, only customers who are allowed to renew their contract in the next year are kept. Hence, for all customers, the last observation is for January 2019, and the first observation is not before February 2017. The resulting data set is panel data consisting of 1 to 24 monthly observations per customer, where the number of customers is large. Similar to this main data set a testing data set is obtained. This data set includes the same variables as the main data set, but other customers where chosen. Namely, those who were able to renew their contract in the past year rather than the upcoming year. Therefore, the last year of this testing data set can be used to test the models.

There are five categories of variables in the data set: mobile internet usage, bundle information, subscription information, churn information and socio-demographics. The next paragraphs discuss the definitions and relevance of some variables. Not all variables are disclosed in this version.

(20)

10 Chapter 3. Data Description utilization is discretized to the first decimal and truncated at 2.

General information regarding the bundles includes the price and size. The bundle price is the list price for the bundle, and might not correspond to the price that the customer actually pays due to promotions. Secondly, the variable bundle_mb de-notes the amount of data that is in the bundle to which the customer is subscribed. Finally, socio-demographic variables are collected. These consist of the year of birth, gender and several categorical variables that constitute customer segments. Six gen-erations are used; the prewar generation (1910-1930), the quiet generation (1931-1940), the babyboomers (1941-1955), generation X (1956-1970), the pragmatic gener-ation (1971-1985), and genergener-ation Y (1986-2002) (Spangenberg and Lampert, 2013). It is expected that these generations explain part of the data usage differences among customers. The remaining socio-demographic variables are not disclosed in this version.

Missing Values

(21)

11

Chapter 4

Methodology

This chapter discusses the methodology used in this thesis, following the three re-search steps introduced in chapter 2. Firstly, section 4.1 specifies several panel data models for mobile data usage and introduces two metrics to assess model perfor-mances. Next, section 4.2 describes two models for churn: a logit model and a ran-dom forest. Lastly, section 4.3 describes how each set of bundle sizes is evaluated by combining the results of the first two steps.

4.1 Forecast Data Usage

A panel data model is used to model individuals’ data usage over time. This type of model is chosen because it exploits the panel data structure, which has obser-vations over time for many individuals. Such a model is able to learn from the population, while still allowing for heterogeneity among individuals. This section discusses several model specifications and two methods to estimate such models. Next, it describes how the models can be used to create forecasts of data usage and how the accuracy of these forecasts are measured.

4.1.1 Model Specifications

Since aggregate mobile data usage has increased during the past two years, time is one of the explanatory variables in the first model specification. The model is described by

yit= x0itβ+ci+uit, (4.1)

where yitdenotes mobile data usage in MBs of individual i at time t. Furthermore,

xit = (birth_yrit, bundle_mbit, t)0is the vector of explanatory variables of individual i

at time t and β = (β1, β2, γ0)0 is the corresponding vector of coefficients. Lastly, ci

denotes the individual effect and uitis the idiosyncratic error of individual i at time t.

However, the level of increase in data usage might differ generations. As an attempt to improve the first model, the second model includes interaction variables between generation and time rather than time itself. Therefore, it allows the coefficient of time to differ between generations. This second model is described by,

yit= ˜xitβ˜+ci+uit, (4.2)

where yit, ci and uit are defined as before. The included explanatory variables are

(22)

12 Chapter 4. Methodology of coefficients ˜β= (β1, β2, γ1, . . . , γ6)0.

In the previous models, the relation between the explanatory variables and mobile data usage is linear. However, it is possible that a log-linear relation is more suitable for the data generating process. Therefore, two more models are used, in which data usage is replaced by its logarithm. These log-linear specifications also have the ad-vantage that the estimations of data usage are not able to be negative.

Lastly, two benchmark models are created, such that the added value of the panel data models can be determined. The first benchmark model is a simple linear re-gression, in which the panel data structure is ignored. That is, the standard model with a homogeneous trend is estimated by Ordinary Least Squares (OLS). The sec-ond benchmark originates from a business perspective. This model assumes that the growth rate of mobile data for each customer is approximately 3% per month, which is a rough estimate based on business insights. An overview of all the models that are estimated is given in Table 4.1.

TABLE4.1: An overview of models for mobile data usage with

corre-sponding specifications.

Benchmark Models

OLS Benchmark

Business Benchmark with 3% increase per month

Panel Data Models

Model Form Time Trend

Linear Homogeneous

Linear Categorized

Log-linear Homogeneous

Log-linear Categorized

4.1.2 Estimation Methods

Panel data models can be estimated by either fixed effects or random effects meth-ods. The assumption about the individual effects ci is the main difference between

these methods.

Random Effects

The random effects method considers the individual effects cito be a part of the error

term. The variance matrixΣ is therefore a T×T matrix with elements σ2

c +σu2 on

the diagonal and σ_u2outside of the diagonal. The resulting feasible GLS estimator is known as the random effects estimator,

ˆ βRE= N

∑

i=1 X_i0Σˆ−1Xi !−1 N

∑

i=1 X_i0Σˆ−1yi ! , (4.3)

where Xi = (xi1, . . . , xit)0 and yi = (yi1, . . . , yit)0. The consistent variance matrix

(23)

4.1. Forecast Data Usage 13 residuals. The random effects estimator is consistent if the following two assump-tions hold:

Ass. 1 The individual effects are not correlated with the regressors. Ass. 2 The expectation of X_i0Σ−1Xi has full rank.

The random effects estimator is efficient as well if a third assumption is made: Ass. 3 The conditional variances are constant and the conditional covariances are

zero, and the variances of the individual effects are homoskedastic;

E(uiu0i|xi, ci) =σ_u2IT, (4.4a)

E(c2_i|xi) =σ_c2. (4.4b)

Fixed Effects

The fixed effects estimator considers the individual effects ci not as a part of the

error term, but allows them to be correlated with the explanatory variables xit. The

fixed effects estimator uses the within transformation to eliminate the individual effects from the estimation. The time-demeaned explanatory variables are given by ¨xit =xit− ¯xi, and ¨yitand ¨uitare found similarly. Then, the fixed effects estimator is

given by, ˆ βFE= N

∑

i=1 ¨ X_i0X¨i !−1 N

∑

i=1 ¨ X_i0y¨i ! . (4.5)

Note that the fixed effects method does not require the regressors to be orthogonal to the individual effects, whereas the randome effects method does.

Specification Test

Which estimation method to use is dependent of the willingness to assume that the composite error term is uncorrelated with the explanatory variables. The specifica-tion test that was introduced by Hausman (1978) is used here to test this assumpspecifica-tion (i.e. H0 : E(ci+uit|xit) = 0). The test is based on the differences between the

esti-mated coefficients of the random effects and fixed effects estimation. The Hausman test statistic is given by

H= ˆβRE−βˆFE

0

V(βˆFE) −V(βˆRE)

−1

ˆ_β_RE₋_βˆ_FE_.

Under the null hypothesis it follows asymptotically a chi-squared distribution with rank(V(βˆFE) −V(βˆRE))degrees of freedom. If the null hypothesis holds true, the

random effects estimator is consistent and efficient, whereas the fixed effects esti-mator is only consistent. Therefore, in this case, the random effects estiesti-mator is preferred. On the other hand, if the null hypothesis is rejected, only the fixed effects estimator is consistent, and thus preferred.

4.1.3 Forecasting

(24)

14 Chapter 4. Methodology a textbook treatment can be found in Baltagi (2008a). Wansbeek and Kapteyn (1978) showed that, in the case of the random effects method, in which the regression and error process parameters are known, the best linear unbiased predictor is for period T+s at time T is

ˆyi,T,s = x0i,T+sβˆGLS+θ ¯ˆui, (4.6)

where θ = Tσ_c2/(Tσ_c2+σ_u2). In practice, when the parameters are unknown, Baillie

and Baltagi (1999) recommend using the fixed effects and ordinary predictor. The fixed effects predictor is given by

˜yi,T,s = x0i,T+sβˆFE+˜ci, (4.7)

where the individual effects are estimated by ˜ci = ¯yi− ¯x_i0βˆFE, where ¯yiand ¯xiare the

time-averages for individual i. Lastly, the ordinary predictor is given by

ˆy_i,T,s = x0_i,T₊_sβˆRE+ ˆθ ¯ˆui, (4.8)

where the estimated parameters and residuals are obtained by random effects esti-mation.

4.1.4 Evaluating Model Performance

Although applied econometrics is usually concerned with inference between vari-ables, in this thesis the most important quality of the models is the predictive ability. To evaluate this ability, the models are estimated on the first year of data, such that the last year of data can be used to compare the forecast with realized values. To match the setting of the final problem, customers who were able to renew in the last year are selected. The forecast of the test data is 12 months ahead, but with less avail-able training data than in the final problem. Therefore, the estimated performance can be seen as the minimum performance of the final problem.

Two measures are used to evaluate the predictive ability of the models. Firstly, the mean absolute error is used, which is the deviance of the predicted value from the realized value. The mean absolute error is given by,

MAE = 1 N

∑

_i,t ˆy_it−y_it . (4.9)

The MAE does not consider the error relative to the actual value. However, a 1GB error has more implications for a customer with a 1GB monthly data usage than for a customer with a 10GB monthly data usage. Therefore, the mean absolute percentage error is used as a second measure. It is given by,

MAPE= 1 N

∑

_i,t ˆyit−yit yit . (4.10)

(25)

4.2. Model the Effect of Bundle Utilization on Churn 15

4.2 Model the Effect of Bundle Utilization on Churn

Kamalraj and Malathi (2013) find that the three most-commonly used methods to model churn in the communications sector are neural networks, decision tree meth-ods and logistic regression. The latter two are used in this thesis, because they are easily implemented. In this research step, the panel data structure is ignored. There-fore, the models do not allow for heterogeneity among customers or over time. This section explains the theory regarding the logit model and random forest, and pro-vides a metric that will be used to measure the predictive ability of the models.

Logit model

In econometrics it is common practice to use a logit model when the dependent variable is binary. The logit model is specified by

P(yi =1|xi) =

exp(xiβ)

1+exp(xiβ), (4.11)

where yiis an indicator for churn in the next month (see the variable churn_next_month)

for individual i and xi are the explanatory variables. The variables in the model are

selected by the step-wise procedure of Venables and Ripley (2002). This procedure is based on the Akaike Information Criterion

AIC = −2LL+2K, (4.12)

where LL is the log-likelihood of the model and K the number of parameters. Hence, the AIC makes a trade-off between model simplicity and model fit. Initially, all variables are included in the model, but in each step of the procedure a variable is omitted, as long as it improves the AIC.

Random Forest

In machine learning, random forests are often used to model binary variables. These models may have a better predictive ability than logit models, but are also referred to as a ‘black box’ because of the lack of insights they provide. The remainder of this section provides the principles of random forests, which are originally formulated by Breiman (2001), following the textbook approach of Suthaharan (2016) and James, Witten, Hastie, and Tibshirani (2013).

Decision trees are the building blocks of a random forest. Figure 4.1 shows the the structure of a decision tree. The branches connect the starting node, at the top of the tree, to internal nodes. The terminal nodes or leafs, at the bottom of the tree, are reached by following the branches and passing by several internal nodes. Since the dependent variable is binary here, the predicted value in each terminal node is binary as well. At each internal node a decision rule based on one of the explanatory variables is applied, which divides the data in two or more sub-domains. The ex-planatory variables and cuts are chosen such that the accuracy in the terminal nodes is maximized.

(26)

16 Chapter 4. Methodology

FIGURE4.1: A diagram of a decision tree. All dots represent nodes

of the tree, where the top node is the starting point, and the bottom nodes, or leafs, are the endpoints.

at each node only a random sample of m explanatory variables is considered, with-out replacement. The result is a set of decision trees, which is called a random forest. When a random forest is used to predict on a new data set, each observation is pulled through all the trees, and a majority vote is used to determine the final prediction (Varian, 2014).

Evaluating Model Performance

The most intuitive measure of predictive ability is accuracy, which is the fraction of observations that is correctly predicted. However, this measure is misleading if the dependent variable is highly unbalanced. This problem originates from medical research, in which predictions of rare diseases among patients were made. In the prediction of a disease occurring in only 5% of patients, a model that only predicts negative outcomes has a 95% accuracy. Although the models’ accuracy is high, the model is useless. In the case of churn, the outcome is highly unbalanced as well, and hence the accuracy can be misleading.

To overcome the misleading effect of accuracy, the Receiver Operating Characteristic (ROC) curve was developed. The definition and interpretation of an ROC curve is described, following Metz (1978). The curve is based on sensitivity and specificity, which are given by

Sensitivity= Number of True Positives

Number of Actual Positives, and

Specificity= Number of True Negatives

Number of Actual Negatives,

(27)

4.3. Evaluate Sets of Bundle Sizes 17 TABLE4.2: Definitions of true and false, positives and negatives.

Actual Churn Actual Non-churn

Predicted Churn True Positive False Positive

Predicted Non-churn False Negative True Negative

Three examples of ROC curves are shown in Figure 4.2. The left panel of this figures shows the curve for a perfect model (AUC=1), whereas the right panel corresponds to a model that is random (AUC=0.5). The panel in the model corresponds to a model whose predictive ability is in between. Although an AUC below 0.5 could occur, turning around the labels would make sure that the AUC is always above 0.5.

FIGURE4.2: Examples of an ROC curve.

4.3 Evaluate Sets of Bundle Sizes

This section discusses how a recommendation on the set of bundle sizes is formed based on the results from the previous two research steps. Since computational time allows for it and implementation is simple, the sets of bundle sizes are eval-uated one by one, rather than using an optimization algorithm. Bundle sizes of 1, 2, . . . , 9, 10, 15, 20, . . . , 50 and 100 GB are considered for evaluation. Since the com-pany offers five bundle sizes, all possible combinations of five bundle sizes are con-sidered. Hence, more than 11000 sets of bundle sizes are evaluated one by one. Then, for each customer the churn probability is computed for each of the bundle sizes. For this purpose the best churn model from the second research step is used, as well as the predicted data usage of the first research step. The churn model uses the predicted MB usage at the moment of renewal as explanatory variable. The renewal moment is assumed to be at the moment the contract ends, or if this has already happened, the first month. Furthermore, the bundle size and type are ad-justed, as well as the (3-month average) bundle utilization.

(28)

18 Chapter 4. Methodology here. By assuming that customers are rational, it could be argued that customers choose the bundle size for which their churn probability is minimal. However, af-ter testing this decision rule on past renewals, an accuracy of only 15% is found. Therefore, a simpler decision rule is used in this thesis. The rule is as follows;

1. If possible, customers choose the smallest bundle that is at least as large as their current bundle.

2. If that is not possible, customers choose the largest bundle that is offered. After testing this decision rule on historical renewals, an accuracy of 65% and 59% is found for the current and previous set of bundle sizes respectively.

(29)

19

Chapter 5

Results

This chapter discusses the results that are obtained in this thesis. It starts with the results of the forecast of mobile data usage. Then the findings regarding the relation between churn and bundle utilization are addressed. The chapter concludes with the main result, which is a recommendation on the set of bundle sizes to be offered.

5.1 Forecast Data Usage

In this section, the findings related to future data usage are discussed. First of all, the differences between the model estimations are interpreted. Then, the predic-tive abilities of these models are evaluated. Based on these evaluations, one model specification is chosen to be used for the forecast that is used in the last research step. Lastly, the estimation of this chosen model specification on the main data set is interpreted.

Model Testing

The estimation results of the OLS benchmark model, as well as six panel data models with homogeneous and categorized trends can be found in Table A.2 and Table A.1 of Appendix A. The models with categorized trend are not estimated by random effects because extremely long computation times were encountered. For the panel data models with log-linear specifications the coefficients that are reported are mul-tiplied with a factor 100 to improve legibility. Because of this transformation and the fact that the models are log-linear, the reported values can be interpreted as per-centages. That is, for each reported value v we can state that a unit increase in the regressor results in a v% increase in the dependent variable, keeping other factors constant.

For the models with a homogeneous trend, some differences between the fixed and random effects estimations are found. The effect of the year of birth is estimated significantly larger in a random effects estimation, both for the linear and log-linear specification. Furthermore, the random effects estimation of the time effect is nega-tive in the linear model, which contradicts both the fixed effects estimation and the expectations based on a preliminary analysis.

(30)

20 Chapter 5. Results leads to a 0.2% increase in data usage, whereas in the model with a homogeneous time effect this was only 0.001%.

For the panel data models with homogeneous time trend specification, the Haus-man test is performed. The HausHaus-man statistics, with corresponding p-values, are reported in Table A.2. For both the linear and the log-linear model, the p-value is practically zero. This implies that the null hypothesis, that the error term is uncor-related with the regressors, can be rejected with a high level of certainty. Hence, the fixed effects estimator should be used to obtain consistent estimates, and the random effects estimator is inconsistent.

Predictive Ability

For all the models, the performances with respect to the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are reported in Table 5.1. For detailed information regarding the absolute error and absolute percentage error of the model estimations see Table 5.1. By comparing the MAE, it is found that all panel model specifications outperform the benchmark models, especially the OLS model. How-ever, the differences between the panel model specifications are small. Overall, the fixed effects estimations perform better than the random effects estimations. Al-though with a small difference, the models with a categorized trend perform better than those with a homogeneous trend. The models with linear specification also have a slightly lower MAE than the models with a log-linear specification. Overall, the model with linear specification and a categorized trend performs best, with a 50% improvement on the OLS benchmark and a 25% improvement on the business benchmark. However, it is closely followed up by all other fixed effects estimations. Also, the MAE of this model is 1306, indicating that the average error in predicting data usage is still 1.3 GB. Hence, the panel data models are a significant improve-ment on the benchmark models, but the MAE is still relatively high.

TABLE 5.1: The mean absolute (percentage) error of the benchmark models and panel data models.

Model MAE MAPE

Relation Trend Estimation

Bench 1731.260 4.557

OLS 2583.461 10.217

Linear Homogeneous Fixed 1318.144 7.122

Linear Categorized Fixed 1306.886 5.993

Log-linear Homogeneous Fixed 1360.768 4.005

Log-linear Categorized Fixed 1353.359 3.964

Linear Homogeneous Random 1439.215 6.723

Log-linear Homogeneous Random 1568.022 3.489

(31)

5.1. Forecast Data Usage 21 specification performs best, with a 65% improvement on the OLS model, and a 24% improvement on the business benchmark model.

Lastly, the evolution of the MAE and MAPE over time is investigated using Fig-ure 5.1. As expected, the MAE deteriorates over time for all models. Surprisingly, the OLS benchmark seems to have the least deterioration over time. Also, it is found that in the first few periods the MAE of the business benchmark model is approx-imately equal to that of the panel data models. However, the decrease in accuracy is faster for this benchmark model, indicating that the panel data models are bet-ter in predicting changes over time. Again the differences between the panel data model specifications are minimal. With respect to the MAPE, more differences ap-pear among the panel data models, an again the benchmark model outperforms some of the panel data models. From the MAPE it is found that the log-linear mod-els perform better, whereas the differences with respect to the MAE are almost not observable.

FIGURE 5.1: The mean absolute (percentage) error evolving over time.

Based on the evaluations with respect to MAE, MAPE and performance over time, the log-linear model with a categorized trend is chosen to be used for estimation on the main data set. This model shows significant improvement compared to the benchmark models, although improvement with respect to the other panel data models is only existent with respect to the MAPE.

Final Model Estimation

The last results regarding the forecast of future data usage are obtained from the fixed effects estimation of the log-linear model with a categorized trend on the main data set. The estimation results are reported in Table 5.2. All coefficients are highly significant, except for the one for the interaction between the prewar generation and time. However, the interaction variables are found to be jointly significant.

(32)

22 Chapter 5. Results TABLE5.2: Estimation results of the final model on the main data set.

Dependent Variable: MB usage Estimation Method: Fixed Effects

Coefficient Std. Error

Year of Birth 0.432∗∗∗ 0.074

Bundle Size 0.002∗∗∗ 0.000 01

Time×Generation 1 (Prewar) 0.205 0.186

Time×Generation 2 (Quiet) 1.866∗∗∗ 0.047

Time×Generation 3 (Babyboom) 2.354∗∗∗ 0.015

Time×Generation 4 (X) 2.257∗∗∗ 0.010

Time×Generation 5 (Pragmatic) 1.915∗∗∗ 0.011

Time×Generation 6 (Y) 1.242∗∗∗ 0.012

Observations 5323292

R2 0.03665

Note: ∗p <0.1;∗∗p<0.05;∗∗∗p <0.01

multiplied by 100 in the table, the coefficients of the interaction variables can be in-terpreted as the monthly percentage increase of the usage. So the rate of increase is fastest for the baby-boom generation and generation X, having 2.4% and 2.3% growth per month respectively. The rate of increase is slightly slower for the quiet, pragmatic and Y generation. However, for the prewar generation, there is almost no increase in data usage. This is not surprising since these customers, with a minimum age of 90, might not be picking up new habits, and if they do, it will be only stopping using mobile phones rather than do more.

Furthermore, it is found that a 1 GB larger mobile internet bundle leads to a 2% increase in data usage. This effect is small and could indicate that customers do not buy larger data bundles because they actually need it, but rather because they want a feeling of safety. Lastly, it is estimated that being born 10 years later leads to a 4% higher data usage.

5.2 The Effect of Bundle Utilization on Churn

(33)

5.2. The Effect of Bundle Utilization on Churn 23

Logit model

A selection of the regression results of the logit model predicting churn is reported in Table 5.3. The complete regression results are not disclosed in this version. However, there are four variables that are mainly interesting here: mobile data usage, mobile internet bundle, bundle utilization, and its 3-month average. These variables are most important, because they are the only variables that are changed by design in the last research step. The estimated coefficients corresponding to the bundle uti-lization and its 3-month average are visualised in Figure 5.2.

(A) Bundle Utilization (B) Bundle Utilization 3-month average FIGURE5.2: Estimated effects of (3-month average) bundle utilization

on churn by the logit model, where the points are the estimates and the bars indicate the standard errors.

First of all, the churn probability is increasing with bundle utilization. This implies that customers are more satisfied when they do no reach the cap of their mobile in-ternet bundle. The most obvious reason for this is that no additional costs or speed limitations are experienced. However, this does not explain why the satisfaction is also higher for customers using only 20% of their bundle with respect to customers using 80% of their bundle, while in both cases no limitations occur. Another reason could be that customers who use 80% of their bundle know that they should control their data usage during the month, or else they will reach the limit, whereas cus-tomers with 20% bundle utilization have room for unanticipated behaviour.

(34)

24 Chapter 5. Results TABLE5.3: Selection of estimated coefficients for the logit model

pre-dicting churn probability. For categorical variables the base level is indicated in brackets underneath the variable.

Dependent variable: churn

Variable Level Coefficient Standard Error

(35)

5.3. Recommendation on the Set of Bundle Sizes 25 Other variables are not disclosed in this version.

Observations 1000000

Akaike Inf. Crit. 98304.150

Note: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

Furthermore, higher mobile data usage reduces the churn probability. Lastly, the churn probability increases slightly with larger mobile internet bundles.

The remaining explanatory variables remain constant in the last research step, and thus do not affect the evaluation of sets of bundle sizes directly. However, they are necessary to isolate the effect of the main variables of interest that were discussed previously. The interpretation of these coefficients are not disclosed in this version.

Initially, more variables were considered to be included in the model. Stepwise se-lection was performed based on the AIC. This resulted in omitting several socio-demographic variables, as well as gender and a reduction in AIC from 98316 to 98304.

(A) Logit model (B) Random Forest

FIGURE5.3: Receiver Operating Characteristic (ROC) curve for the logit model and random forest estimations

Predictive Ability

For both the logit model estimation and the random forest, the ROC curves are com-puted, see Figure 5.3. From these plots it can be seen that the curve of the logit model is slightly higher than that of the random forest, and thus better. In fact, the AUC of the logit model is 0.7069, whereas the AUC of the random forest is 0.6346. Hence, in this application the random forest performs worse than the logit model. This could be explained by the limited amount of variables that are used here. It is chosen to use the logit model in the final research step, since its predictive ability is better and it is easier interpreted than the random forest.

5.3 Recommendation on the Set of Bundle Sizes

(36)

26 Chapter 5. Results smallest bundle size that is as least as large as their current bundle size, and that they do so at the end of the contract. The bundle sizes that are chosen to evaluate here are 1, 2, . . . , 9, 10, 15, 20, . . . , 45, 50 and 100 GB. Making all possible combinations of 5 of these bundle sizes, results in more than 11000 sets of bundle sizes.

TABLE 5.4: Three sets of bundle sizes with the lowest churn score. The percentage of renewing customers that will choose the bundle

size is indicated below each bundle size.

Size 1 Size 2 Size 3 Size 4 Size 5 Churn Score

2 15 20 25 30 0.730 (56.1) (37.6) (0) (5.5) (0.5) 2 3 15 25 30 0.730 (56.1) (0.0) (37.6) (5.5) (0.5) 2 4 15 25 30 0.730 (56.1) (0.0) (37.6) (5.5) (0.5)

Note: All numbers are reported in percentages, except for the bundle sizes. For each of the sets of bundle sizes, the churn score is computed by taking the av-erage churn probability according to the logit model with the forecast of data usage as input. Note that this score can not be interpreted as the actual churn, but rather as a measure of customer satisfaction. Table 5.4 reports the three sets of bundle sizes with the lowest churn score of 0.7%. It may seem surprising that the top three sets of bundle sizes have the exact same churn score. However, it is observed that for all of the sets, the bundle sizes include 2, 15, 25, and 30 GB, and that the fifth bundle size is not chosen at all. This explains why the churn rate is the same for all sets of bundle sizes. It also would imply, that the company would be just as well off by offering four bundle sizes instead of one. In this project the company prefers to offer five bundle sizes. If the customers are evenly spread among the bundle sizes, the risk is smaller. If most customers prefer a certain bundle size, but the competitor offers the same bundle size at a lower price, the company would lose a large part of their customers. Hence, it is preferred that customers are evenly spread among the bundle sizes.

(37)

5.3. Recommendation on the Set of Bundle Sizes 27 TABLE5.5: Best performing set of bundle sizes for different values of

α, the minimal percentage of customers per bundle size

α Size 1 Size 2 Size 3 Size 4 Size 5 Churn Score

0.1 2 9 15 25 30 0.730 (56.1) (2.6) (35.1) (5.6) (0.6) 1 1 4 9 15 25 0.758 (37.1) (19.1) (2.6) (35.1) (6.2) 2 1 4 9 15 25 0.758 (37.1) (19.1) (2.6) (35.1) (6.2)

Note: All numbers are reported in percentages, except for the bundle sizes. Summarizing, the company faces a trade-off between an even spread of customers among bundle sizes or a lower churn rate. However, the increase in churn rate when setting higher requirements for the spread is very minimal. Therefore, the recom-mendation is to offer the set of bundle sizes that is spread out, thus with 1, 4, 9, 15, and 25 GB.

Furthermore, the density of the churn scores for all the bundle sets is shown in Fig-ure 5.4. Here, we observe that there is a part of sets (about 1% of all bundle sets) that has a very high score, and are thus no good. It is found that these are all bundles that have the largest bundle of 100 GB, but the bundle before that at less than 10 GB. Hence, according to our renewal decision rule, customers are forced to an enormous bundle, for which their bundle utilization might be zero, and their churn probability is high.

FIGURE5.4: Density of the churn scores for all sets of bundle sizes

(38)

(39)

29

Chapter 6

Conclusion

6.1 Summary

This thesis has developed a methodology for telecommunication firms to determine the sizes of the mobile internet bundles they offer such that the churn rate will be reduced. For this purpose, three research steps have been performed: forecasting mobile data usage, estimating the effect of bundle utilization on churn, and lastly combining these insights to evaluate several sets of bundle sizes. This thesis con-tributes to the literature by developing a practical approach to determine bundle sizes using customer data. Furthermore, the methodology can be easily applied by telecommunications companies in practice if the availability of data is sufficient. In the first research step, several panel data models are tested for their ability to forecast data usage on an individual level. It is found that these models outperform both the OLS benchmark and the business benchmark. Although the trend coeffi-cients, differentiated along generations, show significant differences, the improve-ment in prediction accuracy with respect to models with a homogeneous trend is small. Generally, the differences between the predictive abilities of all panel data models are small. The log-linear model with a differentiated trend is used on the final data set, as it has an improvement of 48% and 61% on the MAE and MAPE of the OLS benchmark, respectively. This model shows that the absolute increase in data usage is larger for younger generations and that the highest relative increase in data usage is found in the baby-boom and X generation. Summarizing, it is possible to predict data usage on an individual level, but, due to the high volatility in data usage behaviour, the estimated mean absolute error is still 1.3 GB.

Secondly, a logit model and a random forest are used to predict churn. Based on the area under the receiver operating characteristic curve, it is found that the logit model is better in predicting churn than the random forest. The estimation shows that a higher (3-month average) bundle utilization increases the probability to churn. Furthermore, bundle size is positively correlated with churn, and MB usage is nega-tively correlated with churn. Other important drivers for churn, such as the contract type and other products owned at the location, are found as well, but their relevance is not directly related to the purpose of this thesis.

(40)

30 Chapter 6. Conclusion customers. Here, we need to stress the fact that this outcome is binding to the current data set of the current telecommunications company.

6.2 Discussion

To finalize this thesis, some limitations of this work are discussed, and some direc-tions for future research are proposed. The section starts with the discussion of some general remarks and then treats the limitations of each research step.

First of all, we have implicitly assumed that demand and supply are independent. However, the changing usage patterns for mobile devices are dependent on the mo-bile plans that are offered (Lee, Cho, Hong, and Yoon, 2016). For example, customers with an unlimited data plan will cause more traffic than customers with a capped data plan, since the latter might limit their usage behaviour once they approach their data cap. Such behaviour exists for voice and text messages, but differs per user type (Andrews, Bruns, Do ˘gru, and Lee, 2014). These authors believe that a similar analy-sis can be done for mobile data usage. Both findings contradict the assumption that data usage is independent on offered data plans.

Secondly, the recommended bundle sets should be seen within the scope of this the-sis. That is, the data of one telecommunications company is used, for other compa-nies recommendations could be different. Furthermore, this thesis only considered changing the sizes of the mobile data bundles, but it did not consider other pricing schemes or the number of bundles to offer. Also, the price of the bundles was not taken into account, and the renewal behaviour of customers is dependent on the prices of the offered bundles. Furthermore, the recommended set of bundle sizes is matched to the preferences of existing customers but does not consider attracting new customers by matching the bundles to their preferences.

In the first research step, some limitations are encountered as well. First of all, the trend component was differentiated along generations, however, ideally, the trend would be individual. This could be done by interacting the unobserved individual effect with time, however, this brings along difficulties in estimation. Further re-search could indicate whether it is possible to estimate such a model and whether this improves the predictive abilities of the model. Also, due to limited data avail-ability, seasonality is not included in the models. However, the preliminary analysis showed some seasonality, and it might be an interesting direction for future research when more data is available.

(41)

6.2. Discussion 31 Lastly, when combining all parts to one recommendation, the main difficulty is in predicting the bundles that are chosen at renewal. For now, a simple business rule is used, but with more data or experiments the accuracy could be further improved. Now only 65% of historical renewals is predicted correctly. Also, the fact that this level of accuracy was found on past bundle sets, does not necessarily imply that it will hold for the hypothetical new bundles as well. Regarding the renewals, also some other major assumptions are made: the fact that every customer will renew, and the renewal takes place at the end of the contract. However, some customers might churn, while others may stay, but will not renew. In future research, price sensitivity data could improve the results in this step.

(42)

(43)

33

Appendix A

Regression Results of Data Usage

Forecast

TABLEA.1: Regression results of the panel data models with

catego-rized trend.

Model Relation Linear Log-linear

Estimation Method Fixed Fixed

Year of Birth −3.241 0.210∗∗

(2.964) (0.102)

Bundle Size 0.159∗∗∗ 0.001∗∗∗

(0.000 3) (0.000 01)

Time×Generation 1 (Prewar) −15.952∗ 0.774∗∗∗

(8.581) (0.296)

Time×Generation 2 (Quiet) −5.114∗∗ 2.148∗∗∗

(2.367) (0.082)

Time×Generation 3 (Babyboom) −5.927∗∗∗ 2.311∗∗∗

(0.761) (0.026)

Time×Generation 4 (X) 18.230∗∗∗ 1.971∗∗∗

(0.508) (0.018)

Time×Generation 5 (Pragmatic) 16.189∗∗∗ 1.448∗∗∗

(0.583) (0.020)

Time×Generation 6 (Y) 68.323∗∗∗ 0.853∗∗∗

(0.650) (0.022)

Observations 5145400 5145400

R2 0.07721 0.01306

(44)

(45)

(46)

(47)

37

Bibliography

Andrews, M., G. Bruns, M. Do ˘gru, and H. Lee (2014). Understanding quota dynam-ics in wireless networks. ACM Transactions on Internet Technology (TOIT) 14(2-3), 14:1–14:17.

Baillie, R.T. and B.H. Baltagi (1999). Prediction from the regression model with one-way error components. In C. Hsiao, M.H. Pesaran, K. Lahiri, and L.F. Lee (Eds.), Analysis of Panels and Limited Dependent Variable Models, Chapter 10, pp. 255–267. Cambridge: Cambridge University Press.

Baltagi, B. (2008a). Econometric analysis of panel data. John Wiley & Sons.

Baltagi, B.H. (2008b). Forecasting with panel data. Journal of Forecasting 27(2), 153– 173.

Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.

CBS (2018). Internet; toegang, gebruik en faciliteiten [data file]. https://opendata.

cbs.nl/statline/#/CBS/nl/dataset/83429NED/table?ts=1541589343852.

Ac-cessed: November 6, 2018.

Cisco (2017). Cisco visual networking index: Global mobile data traffic forecast update, 2016–2021 White Paper.

Gerpott, T.J. (2010). Impacts of mobile internet use intensity on the demand for sms and voice services of mobile network operators: An empirical multi-method study of german mobile internet customers. Telecommunications Policy 34(8), 430–443. Gizelis, C.A. and D.D. Vergados (2011). A survey of pricing schemes in wireless

networks. IEEE Communications Surveys & Tutorials 13(1), 126–145.

Ha, S., C. Joe-Wong, S. Sen, and M. Chiang (2012, 1). Pricing by timing: innovating broadband data plans. In Broadband Access Communication Technologies VI, Volume 8282, pp. 82820D. International Society for Optics and Photonics.

Hausman, J.A. (1978). Specification tests in econometrics. Econometrica: Journal of the econometric society, 1251–1271.

He, L. and J. Walrand (2005). Pricing differentiated internet services. In Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, Volume 1, pp. 195–204. IEEE.

James, G., D. Witten, T. Hastie, and R. Tibshirani (2013). An introduction to statistical learning, Volume 112. New York: Springer.

(48)

38 BIBLIOGRAPHY

KPN. Sim only abonnement voor 1 of 2 jaar. https://mobielshop.kpn.com/

mobiel/sim-only/5-gb-en-onbeperkt-bellen-sms/2-jaar. Accessed: Novem-ber 9, 2018.

Lee, S., C. Cho, E.K. Hong, and B. Yoon (2016). Forecasting mobile broadband traffic: Application of scenario analysis and delphi method. Expert Systems with Applica-tions 44, 126–137.

Lv, Q. and G.N. Rouskas (2008). On optimal sizing of tiered network services. In INFOCOM 2008-The 27th Conference on Computer Communications, pp. 1822–1830. IEEE.

Lv, Q. and G.N. Rouskas (2009, 11). Internet service tiering as a market segmentation strategy. In GLOBECOM 2009-2009 IEEE Global Telecommunications Conference, pp. 1–6. IEEE.

Lv, Q. and G.N. Rouskas (2010). An economic model for pricing tiered network services. Annals of Telecommunications 65(3-4), 147–161.

Lv, Q. and G.N. Rouskas (2011, 12). On optimal tiered structures for network service bundles. In 2011 IEEE Global Telecommunications Conference -GLOBECOM 2011, pp. 1–5. IEEE.

Metz, C.E. (1978). Basic principles of ROC analysis. In Seminars in nuclear medicine, Volume 8, pp. 283–298. WB Saunders.

Paul, U., A.P. Subramanian, M.M. Buddhikot, and S.R. Das (2011, 4). Understanding traffic dynamics in cellular data networks. In 2011 Proceedings IEEE INFOCOM, pp. 882–890. IEEE.

Rahimi, N. and R. Koosawangsri (2013). Selection of smart phone plans. In 2013 Pro-ceedings of PICMET’13: Technology Management in the IT-Driven Services (PICMET), pp. 426–448. IEEE.

Sen, S., C. Joe-Wong, S. Ha, and M. Chiang (2013). A survey of smart data pric-ing: Past proposals, current plans, and future trends. ACM computing surveys (CSUR) 46(2), 15:1–15:37.

Shakkottai, S., R. Srikant, A. Ozdaglar, and D. Acemoglu (2008). The price of sim-plicity. IEEE Journal on Selected Areas in Communications 26(7), 1269–1276.

Spangenberg, F. and M. Lampert (2013). De grenzeloze generatie: en de eeuwige jeugd van hun opvoeders. Nieuw Amsterdam.

Sugiyama, K., J. Urakawa, M. Taya, A. Yamada, A. Kobayashi, and A. Tagami (2016, 4). Empirical analysis of customer behavior for tiered data plans in mobile market. In 2016 IEEE Conference on Computer Communications Workshops, pp. 389–394. IEEE. Suthaharan, S. (2016). Machine learning models and algorithms for big data

classifi-cation. In Integrated Series in Information Systems, Volume 36. Springer US.

T-Mobile. Go sim only. https://www.t-mobile.nl/shop/product/go-sim-only? ch=es&cc=con&sc=acq&dr=24&pr=GOP21,GAP23. Accessed: November 9, 2018. Tele2. 4G sim only abonnement. https://www.tele2.nl/mobiel/sim-only/.

(49)

BIBLIOGRAPHY 39 Varian, H.R. (2014). Big data: New tricks for econometrics. Journal of Economic

Per-spectives 28(2), 3–28.

Venables, W.N. and B.D. Ripley (2002). Modern applied statistics with S. New York: Springer-Verlag.

Vodafone. Mobiele 4G abonnementen - sim only. https://www.vodafone.nl/shop/ mobiel/abonnement/. Accessed: November 9, 2018.

Wansbeek, T.J. and A. Kapteyn (1978). The separation of individual variation and systematic change in the analysis of panel data. In Annales de l’INSEE, pp. 659–680. Institut national de la statistique et des études économiques.

Settingbundlesizesformobileinternet:apracticalapproachusingcustomerdata U G

U

NIVERSITY OF

G

RONINGEN

M

T

Setting bundle sizes for mobile internet: a

practical approach using customer data

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Background and Related Literature

1.2

Research Question and Contribution

Chapter 2

Problem Formulation

2.1

Forecast Data Usage on a Customer Level

2.2

Model the Effect of Bundle Utilization on Churn

2.3

Evaluate Sets of Bundle Sizes

Chapter 3

Data Description

3.1

Data Description

Chapter 4

Methodology

4.1

Forecast Data Usage

∑

∑

∑

∑

∑

∑

4.2

Model the Effect of Bundle Utilization on Churn

4.3

Evaluate Sets of Bundle Sizes

Chapter 5

Results

5.1

Forecast Data Usage

5.2

The Effect of Bundle Utilization on Churn

5.3

Recommendation on the Set of Bundle Sizes

Chapter 6

Conclusion

6.1

Summary

6.2

Discussion

Appendix A

Regression Results of Data Usage

Forecast

Bibliography