• No results found

Prediction through statistical and machine learning : logistic regression combined with Classification and Regression Trees (CART)

N/A
N/A
Protected

Academic year: 2021

Share "Prediction through statistical and machine learning : logistic regression combined with Classification and Regression Trees (CART)"

Copied!
36
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Prediction through statistical and machine learning:

Logistic regression combined with Classification and

Regression Trees (CART)

Econometrics and Operational Research

University of Amsterdam

June 26, 2014

Lopamudra Rath

Student number: 10253459

Supervisor: Noud van Giersbergen

(2)

Abstract

This thesis primarily investigates the prediction accuracy of two models which are combinations of a logistic regression and a classification and regression tree (CART). In addition to this the differences between the two combined logit models are explored and a comparison is made of these models with the original trees to see how well the trees perform. The method of using combined models as a good prediction tool is also compared to using logit models with only variables having highly significant coefficients. It might be so that dummy variables induced from a tree work better than input variables selected by a tree for a combined logit model. To be able to decide this one perhaps needs to have sufficiently large datasets and a great amount of variables to have enough variation in the data. Nevertheless this does not affect the evaluation of whether combined logit models predict better than the original trees and the logit models with variables having highly significant coefficients. It appears that this is not always the case. It also appears that variable selection by using decision trees works quite well, but the technique of selecting variables with highly significant coefficients offers good competition. Still this finding might also have been affected by the relatively small datasets. On the whole however all the models applied in this thesis indicate to be good models for prediction of failure. Therefore these models might be useful for many companies and institutions.

(3)

Table of Contents 1. Introduction ... 4 2. Research background ... 5 2.1. Decision trees ... 5 2.2. Logistic regressions ... 7 2.3. Related literature ... 8

3. Data and Methods ... 9

3.1. Data ... 9

3.1.1. Wine datasets ... 9

3.1.2. Home sales dataset ... 11

3.2. Methods ... 13

4. Results and analysis ... 14

4.1. Estimated models ... 14

4.1.1. White wine dataset ... 14

4.1.2. Red wine dataset ... 17

4.1.3. Home sales dataset ... 19

4.1.4. Prediction accuracies... 22 4.2. Analysis ... 23 5. Conclusion ... 25 References ... 26 Appendix A ... 28 Appendix B ... 34

(4)

1. Introduction

In quite a few recent papers in the field of finance or economics often decision trees (DT) and logit models are applied (e.g. Brezigar-Masten & Masten, 2012; Chen, 2011; Cho, Hang & Ha, 2010; Duchessi & Laurìa, 2013; Li, Sun & Wu, 2010; Nie, Rowe, Zhang, Tian & Shi, 2011; Twala, 2010). According to the literature review of Nie et al. (2011) regression and decision trees are even the most popular methods for research.

For business failure prediction, logit models are one of the most popular statistical methods (Li et al., 2010). Chen (2011) uses both decision trees and logistic regression models to analyze corporate financial distress. In the same line, Cho et al. (2010) and

Brezigar-Masten and Brezigar-Masten (2012) have tried to combine the two methods to create a good bankruptcy prediction model. The latter also suggest using the combined model to assess credit risk. Twala (2010) has also looked at DT models for credit risk assessment. Duchessi & Laurìa (2013) have used tree models to see what kind of impact promotional and advertising strategies have on sales, which is however only slightly related to the field of economics.

There are several reasons why decision trees and logistic regression are such popular methods in analysis of data. According to James, Witten, Hastie & Tibshirani (2013) few advantages of tree based methods are that they are easier to explain, realistic, easily

interpreted, and can easily model complex relationships in datasets. Also they are good with qualitative predictors as well as quantitative predictors. Logistic regressions have also easy interpretability and are relatively simple to explain. It is also a very widely used model, which is why logit models can be constructed using almost any kind of statistical software program (Brezigar-Masten & Masten, 2012).

Brezigar-Masten and Masten (2012) and Cho et al. (2010) have attempted to construct a combined bankruptcy prediction model and the results of these papers seem to achieve just that. In this research I also look at how the combination of these two models performs as far as prediction performance is concerned. Both the papers have used variables selected by a classification and regression tree (CART). Cho et al. (2010) have simply used them as input variables whereas Brezigar-Masten and Masten (2012) have induced dummy variables which seem to improve the prediction skill of the logit model and performs better than the CART model too. The main objective of this thesis is to investigate whether the prediction of the logit model with dummy variables is more accurate than the logit model with only CART-based selected input variables. Besides that I also look at the accuracy of the CART models to see how well the prediction performance of these models fares and at the differences between the estimated logit models. I also look at the prediction accuracies made by logit models using only variables with significant coefficients and logit models with all the variables to make a comparison between this method and the method of using the combined logit model. To do all this, data of rather non-financial and non-economic character are used, because in the papers of Brezigar-Masten and Masten (2012) and Cho et al. (2010) only data with a financial or economic character have been considered. It might be that financial or economic data share the same kind of characteristics which might cause the methods of the authors to be less robust when using other kinds of datasets.

In this thesis a classification tree (CT) is applied to be as close as possible to the logit model. Next a logit model is created with CT-based selected variables and applied on the

(5)

same datasets. Afterwards another logit model is constructed with dummy variables induced from the corresponding nodes in the CT model. Also a logit model is built with all the variables to see which variables have significant coefficients. Next these variables are implemented in another logit model. The predictive performance of all the models will then be compared and assessed for each dataset. The other results of the logit models will also be analyzed.

The rest of the thesis is organized as follows. Section 2 gives an overview of the techniques used in this thesis and of relevant papers. The research methodology and the datasets are described in Section 3. The results are reported in Section 4 and conclusions are given in Section 5.

2. Research background

Big data has become a popular topic, due to the fact that nowadays a lot of information is gathered by transactions that take place online. In this study I am also using large amounts of data. To analyze these amounts of data one can use statistical or machine learning techniques. Machine learning techniques are techniques used in computer science to manipulate and gain knowledge from complicated and huge datasets. This process of gaining knowledge from large sets of data is referred to as “data mining (DM)” or “knowledge discovery in databases (KDD)” (Chen, 2011). For my research I am applying a logistic regression and a decision tree. The former is a statistical technique and the latter is a machine learning technique.

2.1. Decision trees

The technique of decision trees is one of the main machine learning techniques applied in data mining. One can apply different algorithms for a tree model, for example Interactive Dichotomiser 3 (ID3), C4.5/C5.0, CART and chi-squared automatic interactive detector (CHAID) (Chen, 2011).

In this study the focus is on the CART algorithm, in particular classification trees. CART is a nonparametric model and can be applied for classification as well as regression tasks as the name already suggests. As James et al. (2013) describes it, it involves creating a set of if-then rules and thereby ultimately splitting the data into different regions Rj to give

classifications and predictions for a response variable Y. The classifications and predictions are done by taking the mean (if it is a regression tree (RT)) or the mode (if it is a classification tree (CT)) of the observations in the region where the given observation of the Y-variable falls in. The regions are known as “terminal nodes” or “leaves”. The nodes, where the predictor space (that is the values that the explanatory variables Xj can take on) is being split, are called

“internal nodes”. The parts which connect the nodes are called “branches”. The trees can be displayed in “tree forms” or in partition plots. The latter is only possible when two variables are being used (Varian, 2014). Figure 1 shows an example of a CART model in a tree form and in a partition plot.

(6)

Figure 1. An example of a tree form display (left) and a partition plot (right) of a CART model

The recursive binary splitting is applied to build a CART model. In the case of regression trees, this splitting process aims to minimize the residual sum of squares (RSS) of the Y-variable. The method starts at the top of the tree and chooses first a predictor Xj and a

cutpoint s in such a way that RSS is minimized up to that point. This creates two regions. The splitting process is applied again in each of the regions and another four regions are created. This continues until it reaches a certain specified limit. For the classification tree the

“classification error rate” is used as the criterion instead of the RSS. This is the percentage of observations in a particular region that do not belong to the most common class in the region. Other more powerful measures of error are the Gini index and the cross-entropy. According to James et al. (2013) these two measures are more often used for evaluating a split in a tree, whereas the classification error rate is more useful when the focus is on the prediction

accuracy after pruning a tree. In this study I am using the classification error rate as the focus of the research is on the prediction performances of the applied models.

Pruning is useful when one wants to improve the performance of a CART. Often trees overfit the training data and perform weak on test observations. One can then prune the tree to get a subtree which might be less complex and has a smaller variance. However pruning leads to a little bit of model bias. To get the optimal subtree one needs to have first a manageable set of subtrees to choose from. This can be done with cost complexity pruning/ weakest link pruning which will return a sequence of subtrees as a function of the cost-complexity

parameter α . With k-fold cross validation (CV) one can then choose a value for α. The α that is selected corresponds with a particular subtree from the sequence and this subtree has then the least test error.

Other than that tree based methods are quite easy to understand and realistic, they also work well with missing data and interactions. Furthermore they also have a powerful feature of indifference to monotone transformations of the explanatory variables (Twala, 2010). According to Twala (2010) this results in the same decision tree whatever transformation is applied which is a quality not many other classification techniques possess. Moreover decision trees are not affected by outliers due this invariance of transformations. Another

(7)

advantage is that machine learning techniques do not often require strong assumptions compared to statistical techniques (Li et al., 2010). Disadvantages are that trees are not good with linear relationships and are not as good in predicting as some of the other classification and regression techniques (James et al., 2013).

To improve the prediction skills (even after pruning) of trees, methods like bagging, boosting, and random forests can be applied.

2.2. Logistic regressions

The focus in this thesis is on the logistic regression. Along with decision trees logistic regressions are one of the most commonly used methods for classification problems (Chen, 2011). In contrast with decision trees, logit models are parametric models which classify a response variable Y by modelling the chance that it will fall in a particular class. It has the following form (Heij, Boer, Franses, Kloek & van Dijk, 2004):

𝑝𝑝(𝑌𝑌|𝑋𝑋) =1 + 𝑒𝑒𝑒𝑒𝛽𝛽0𝛽𝛽+𝛽𝛽0+𝛽𝛽1𝑋𝑋11+⋯+𝛽𝛽𝑋𝑋1+⋯+𝛽𝛽𝑘𝑘𝑋𝑋𝑘𝑘𝑘𝑘𝑋𝑋𝑘𝑘 = 𝑝𝑝(𝑌𝑌 = "succes")

The expression p(Y|X) gives the chance that Y = “a particular category” given the explanatory variables (X1... Xk ) with one particular category as the baseline. The expression also gives

therefore the chance of having “success” that it will fall in the desired category. The formula is established by taking the cumulative distribution function of the logistic density. This means that a logit model requires a logistic distribution (Heij et al., 2004).

Logistic regression is usually used as a binary response model, while multiple

responses are also possible. For the multiple responses often the Linear Discriminant Analysis (LDA) is applied (James et al., 2013). An example of a binary response of Y can be defined as:

𝑌𝑌 = �1 𝑖𝑖𝑖𝑖 "𝑌𝑌𝑒𝑒𝑌𝑌"0 𝑖𝑖𝑖𝑖 "𝑁𝑁𝑁𝑁"

So when predicting whether Y is “Yes” or “No”, the 𝑝𝑝(𝑌𝑌 = “𝑌𝑌𝑒𝑒𝑌𝑌”| 𝑋𝑋) gives the chance that the explanatory variable will be “Yes” given the predictors (X1... Xk) and 1 − 𝑝𝑝(𝑌𝑌 =

“𝑌𝑌𝑒𝑒𝑌𝑌”| 𝑋𝑋) gives the chance that it is going to be a “No”. The chance of Y being “No” is the baseline in this case. Typically p(Y|X) has to range between 0 and 1.

One can see that there exists no need of an assumption of a linear relationship between the response variable and the explanatory variables. Still there needs to be a linear relationship between the log-odds or logit and the explanatory variables. This can be put in the following manner:

log ( 𝑝𝑝(𝑌𝑌|𝑋𝑋)

1 − 𝑃𝑃(𝑌𝑌|𝑋𝑋)) = 𝛽𝛽0+ 𝛽𝛽1𝑋𝑋1+ ⋯ + 𝛽𝛽𝑘𝑘𝑋𝑋𝑘𝑘

(8)

The parameters (β0...βk) can be estimated with the help of maximum likelihood in

which the likelihood function is maximized (the likelihood requires the observations of the response variable to be independent):

max

𝛽𝛽0,𝛽𝛽1,…,𝛽𝛽𝑘𝑘𝑙𝑙( 𝛽𝛽0, 𝛽𝛽1, … , 𝛽𝛽𝑘𝑘) = � 𝑝𝑝(𝑦𝑦𝑖𝑖|𝑥𝑥𝑖𝑖) � 1 −

{𝑖𝑖;𝑦𝑦𝑖𝑖=0}

{𝑖𝑖;𝑦𝑦𝑖𝑖=1}

𝑝𝑝(𝑦𝑦𝑖𝑖|𝑥𝑥𝑖𝑖), 𝑤𝑤𝑖𝑖𝑤𝑤ℎ 𝑖𝑖 = 1, … , 𝑁𝑁

Instead it is often more convenient to maximize the log-likelihood function.

Logistic regression is a convenient tool for many reasons. Like the decision trees logistic regressions are easy to understand and it is a strong and broadly used prediction model for purposes of predicting variables like financial distress (Brezigar-Masten & Masten, 2012).

Logistic regressions do not require comparatively many assumptions according to the literature review of Camdeviren, Yazici, Akkus, Bugdayci and Sungur (2007) and Cho et al. (2010) which is an attractive feature. However according to the literature review of Li et al. (2010), and Moon, Kang, Jitpitaklert and Kim (2012), some of these required assumptions limit the applicability of the logit model to some extent on real-world data.

Assumptions which logit models rely on are that the response variable Y should be binary or of multiple-class, there is a linear relationship between the log-odds and the explanatory variables, the observations of the response variable have to be mutually independent, the error terms of the model has to follow a logistic distribution and the last assumption concerns the model having homoscedasticity (Heij et al., 2004).

2.3. Related literature

The main objective of this thesis is inspired by two papers, namely of Brezigar-Masten and Masten (2012) and Cho et al. (2010). The latter attempts to create a hybrid bankruptcy prediction method by applying different types of models, one of them being the logit model, and by having input variables selected through decision trees. According to Cho et al. (2010), variable selection through decision trees is a very powerful variable selection technique. On the basis of their research the case-based reasoning (CBR) model with the Mahalanobis distance using variable weight was the best model, although the logit model also had reasonable prediction accuracy. It had an average of 70.6% prediction accuracy against the average of 73.2% of the best model.

Brezigar-Masten and Masten (2012) did similar research by amalgamating the CART model with the logit model for bankruptcy prediction purposes. However this is done by using dummy variables induced from the CART-based selected variables at the corresponding nodes in the decision tree as opposed to using CART-based selected variables as input variables. According to Brezigar-Masten and Masten (2012) dummy variables are more able to capture the possible nonlinearities between the explanatory variables and the response variable compared to CART-based selected input variables. Besides applying models with only dummy variables, the authors have also used a step-wise selection procedure to select

(9)

variables and constructed models which contain these variables and the dummy variables. Furthermore they also tested and reported the results of the approach of Cho et al. (2010) and looked at the prediction performance of all the classification trees. The results show that the method of using CART-based dummy variables with logit gives the best results (whether in addition with the conventional selected variables or not).

To see whether the approaches of both the previous mentioned authors can also be applied for somewhat non-financial or non-economical prediction purposes, I am using three different datasets. Two of them being datasets related to wine: one consisting samples of white wine and the other consisting samples of red wine. The data has been assembled by Cortez, Cerdeira, Almeida, Matos and Reis (2009). They have used these datasets to predict wine preferences of individuals using physicochemical properties such as pH values, alcohol and density. According to the authors there have been very few cases in the past where data mining techniques and large datasets have been applied to classify wine quality. In their research three models have been considered: Support Vector Machines (SVM), Neural Network (NN) and Multiple Regression (MR). The variables have been selected on the basis of a sensitivity analysis. They have found that SVM seems to give the best results as it gave the most accurate predictions compared to the other two models (with the highest prediction accuracy of 86.8% for white wine and 89% for red wine when allowing for absolute error tolerance of 1%).

Another dataset has been collected from the website of the book “Principles of

Econometrics, Fourth Edition” written by Hill, Griffiths and Lim (2011). The dataset consists of information about home sales in Los Angeles.

The book is a preliminary book for students who are graduating in (agricultural) economics, finance, accounting, marketing, public policy, sociology, law and political science. This dataset has been therefore used in this book for educational purposes.

3. Data and Methods

In this thesis three different datasets are used. The same methodology is applied on each dataset. In the next subsections a description of the datasets and the research

methodology are given.

3.1. Data

3.1.1. Wine datasets

Two wine related datasets (under the name “Wine quality”) have been collected from the UCI Machine Learning Repository. The datasets consist of samples of white and red wine. The dataset of white wine has 4898 samples and the dataset of red wine has 1599 samples. Some samples have been removed from the original data files, because of their large and therefore seemingly implausible high values of alcohol percentage and somewhat large values for density. The latter might be recording errors. However as it is not sure whether this is the

(10)

case, these samples have been removed for safety. The white wine dataset consists then of 4793 samples and the red wine dataset has 1528 samples.

The wine comes from the Portuguese wine product “Vinho Verde”. The data had already been gathered by the authors Cortez et al. (2009) for their research “Modeling Wine preferences by data mining from physicochemical properties”. As their research title already mentions, the dataset consists of input variables which represent the physicochemical

properties and a response variable which represents the wine quality. The values of the input variables have been collected using objective tests, whereas the values of the response variable are on the basis of sensory data. The sensory data are medians of at least 3

evaluations of the wine samples done by wine experts. The evaluation is given by means of a grade on a scale of 0 (very bad) to 10 (excellent). Cortez et al. (2009) have also indicated that the physicochemical properties might be correlated.

In this thesis I also aim to predict the wine quality on the basis of the other variables. However the response variable quality has been slightly altered, so that it can be applied in a logit model (See section 3.2.). The physicochemical properties are used as input variables and dummy variables as is described in Section 4. Further statistical information of the datasets is given in Table 1 and Table 2. The frequency of the grades for the output variable “quality” of white wine and red wine is given in Figure 2. In Figure 2 one can see that the response

variable quality seems to have some kind of a binomial distribution with a chance of around a half for success (𝐵𝐵𝑖𝑖𝐵𝐵(𝐵𝐵, 𝑝𝑝 = 1/2)).

Table 1

Statistics of physicochemical properties and the quality of white wine

Variables Minimum Maximum Mean Median

Fixed acidity (g(tartaric acid)/dm3) 3.80 14.20 6.85 6.80

Volatile acidity (g(acetic acid)/dm3) 0.08 1.10 0.28 0.26

Citric acid (g/dm3) 0.00 1.66 0.33 0.32

Residual sugar (g/dm3) 0.60 22.60 6.22 5.10

Chlorides (g(sodium chloride)/dm3) 0.01 0.35 0.05 0.04

Free sulfur dioxide (mg/dm3) 2.00 289.00 35.15 34.00

Total sulfur dioxide (mg/dm3) 9.00 440.00 137.70 133.00

Density (g/cm3) 0.99 1.00 0.99 0.99

pH 2.72 3.82 3.19 3.18

Sulphates (g(potassium sulphate)/dm3)

0.22 1.08 0.49 0.47

Alcohol in volume percentage

(%)

8.00 14.20 10.53 10.40

Response variable: Quality 3.00 9.00 5.88 6.00

Note. n=4793

(11)

Table 2

Statistics of physicochemical properties and the quality of red wine

Input variables Minimum Maximum Mean Median

Fixed acidity (g(tartaric acid)/dm3) 4.60 15.90 8.17 7.80

Volatile acidity (g(acetic acid)/dm3) 0.12 1.58 0.53 0.52

Citric acid (g/dm3) 0.00 1.00 0.26 0.25

Residual sugar (g/dm3) 0.90 13.90 2.46 2.20

Chlorides (g(sodium chloride)/dm3) 0.01 0.61 0.09 0.08

Free sulfur dioxide (mg/dm3) 1.00 72.00 15.84 14.00

Total sulfur dioxide (mg/dm3) 6.00 289.00 46.29 37.00

Density (g/cm3) 0.99 1.00 1.00 1.00

pH 2.74 4.01 3.32 3.32

Sulphates (g(potassium sulphate)/dm3)

0.33 2.00 0.66 0.62

Alcohol in volume percentage

(%)

8.40 14.90 10.43 10.20

Response variable: Quality 3.00 8.00 5.63 6.00

Note. n=4793

Figure 2. Histograms for quality of white wine (left) and red wine (right)

3.1.2. Home sales dataset

The third dataset used in this thesis is the home sales dataset. This dataset contains 1080 observations of home sales in Baton Rouge located in Los Angeles (US) during the period of mid-2005. The variables in this dataset have also some correlation like in the wine datasets, when looking at the correlations computed in R. One can then see that these are overall a bit higher than for the variables in the wine datasets.

The aim with this dataset is to predict which sales have had a price that was above the median. The response variable has been modified a bit for computation reasons in R (See

(12)

section 3.2.). The rest of the variables play the role of input variables and dummy variables as is described in Section 4.

The descriptive statistics of the home sales dataset is given in Table 3. A histogram of the prices at the home sales above and beneath the median is given in Figure 3.

Table 3

Statistics of the home sales dataset

Variables Minimum Maximum Mean Median

Total square feet 662 7897 2326 2186

Number of bedrooms 1.00 8.00 3.18 3.00

Number of full baths 1.00 5.00 1.973 2.00

Age in years 1.00 80.00 19.57 18.00 Occupancya 1.00 3.00 1.565 2.00 Pool 0.00 1.00 0.08 0.00 Styleb 1.00 11.00 3.75 1.00 Fireplace 0.00 1.00 0.56 1.00 Waterfront 0.00 1.00 0.07 0.00 Days on market 0.00 728.00 74.06 40.00

Response variable: Price ($) 22000 1580000 154863 130000

Note. n=1528

a Owner=1, vacant=2, tenant=3. b

Traditional = 1, Townhouse = 2, Ranch = 3, New Orleans = 4, Mobile Home = 5, Garden = 6, French = 7, Cottage = 8, Contemporary = 9, Colonial = 10, Acadian = 11

Figure 3. Histogram for the prices at the home sales

(13)

3.2. Methods

The research methodology of this thesis is a bit similar to the methodology of Brezigar-Masten and Brezigar-Masten (2012) and Cho et al. (2010). For every dataset the procedure of the research is as follows:

1. A classification tree is constructed to fit the training dataset. The training dataset is made by taking 60% of the whole dataset. Pruning is applied if the CT can be improved. This is done with the help of cost-complexity pruning and k-fold cross validation. The latter giving the right value for the cost-complexity parameter α which will give the subtree with the least test error. The final tree is then tested on the testing dataset (the remaining 40% of the whole dataset). See for results of the pruning

process for each dataset Figure B1, B5 and B9 in Appendix B and for other details of the tree Figure B2, B6 and B10 in Appendix B.

2. Afterwards a logit model is fit to the training dataset with variables that have been selected by the (pruned) classification tree. This model is then also tested on the testing dataset.

3. The same procedure as above for the CT and the logit model holds for the logit model with dummy variables induced from the explanatory variables at the corresponding nodes in the (pruned) CT. The way the dummy variables are created is not unique, so any other formation is also possible.

4. The prediction accuracies of all the models are then calculated. There are three types of prediction accuracies: one for both categories and one overall. The exact amount of observations for each category is given in Figure B3, B7 and B11 in Appendix B. 5. For comparison a logit model with all the variables is constructed to see which of them

have statistically significant coefficients. Next these variables are then implemented in another logit model. These models are again built on the basis of the training dataset and tested on the testing dataset.

6. The prediction accuracies of the logit models in step 5 are then calculated and reported in the same way that is done in step 4. The exact amount of observations for each category is given in Figure B4, B8 and B12 in Appendix B.

The operations described above are performed with the statistical program R. The scripts written in R for each dataset are available in Appendix A.

For each dataset a different kind of binary response variable is made. The new response variables for the wine datasets are created on the basis of the median of the wine samples, that is if the quality is above the median then the binary response variable will take on the value “Yes” and otherwise “No”. The binary response variable for the home sales dataset is defined similarly. See further descriptions of the response variables in Table 4. The option to take the median to split the observations of the response variables has been made in order to equalize the sizes of the two groups as much as possible.

(14)

Table 4

Description of the binary response variables

Response variable Description

Quality of white wine (“quality”)

Dummy variable (GoodOrBad): “No” if 𝑞𝑞𝑞𝑞𝑞𝑞𝑙𝑙𝑖𝑖𝑤𝑤𝑦𝑦 <= 6, “Yes” otherwise

Quality of red wine (“quality”)

Dummy variable (GoodOrBad): “No” if 𝑞𝑞𝑞𝑞𝑞𝑞𝑙𝑙𝑖𝑖𝑤𝑤𝑦𝑦 <= 6, “Yes”) otherwise

Price of the home sales ($) (“PRICE”)

Dummy variable (Price): “No” if 𝑤𝑤𝑁𝑁𝑤𝑤𝑞𝑞𝑙𝑙𝑌𝑌𝑡𝑡𝑁𝑁𝑡𝑡𝑒𝑒 <= 130000, “Yes” otherwise

Note. The names in brackets are the names defined in R. The names in brackets under the heading “Response variable” are

the original names of the variables and the names under the heading “Description” are the names of the new dummy variables. Therefore e.g. “PRICE” differs from “Price”.

4. Results and analysis

4.1. Estimated models

4.1.1. White wine dataset

The classification tree designed for the white wine dataset (step 1 of the research methodology) is shown in Figure 4. Pruning was actually not required, because after cross validation and weakest link pruning it was shown that the initial tree gave the least test error. Yet the tree with 3 nodes has been chosen, because this has the same test error as the initial tree and is less complex (see Figure B1 in Appendix B). The classification tree has chosen only one variable for the prediction of the quality of wine, namely alcohol.

The variable chosen by the CT is then inserted as input variable in a logit model (step 2). The results of the logit model are reported in Table 5. According to step 3 of the research methodology dummy variables have been created too to implement in a separate logit model. First ordinal variables have been created so R can automatically make dummy variables out of it. The values of this logit model are shown in Table 6. The ordinal variables are as defined below.

For comparison other two logit models have been constructed with all the variables and only variables with significant coefficients respectively (step 5). The results are reported in Table 7 and 8. As all the coefficients in Table 5 and 6 are all highly statistically significant (at 0.1% level) only variables with statistically significant coefficients at the 0.1% level have been chosen to be implemented in the logit model corresponding Table 8. This is done to a make a good comparison.

• 𝐴𝐴𝑙𝑙𝑡𝑡𝑁𝑁ℎ𝑁𝑁𝑙𝑙1 = � "𝐵𝐵" 𝑖𝑖𝑖𝑖 10.85 < 𝑞𝑞𝑙𝑙𝑡𝑡𝑁𝑁ℎ𝑁𝑁𝑙𝑙 < 11.85"𝐴𝐴" 𝑖𝑖𝑖𝑖 𝑞𝑞𝑙𝑙𝑡𝑡𝑁𝑁ℎ𝑁𝑁𝑙𝑙 < 10.85 "𝐶𝐶" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒

(15)

Figure 4. Pruned classification tree (CT) of the white wine dataset

Table 5

Estimated logit model with CT-based input variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -10.32*** (0.47) -21.99 < 2.00×10-16 Alcohol 𝛽𝛽1 0.83*** (4.2×10-2 ) 19.83 < 2.00×10-16

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

Table 6

Estimated logit model with dummy variables induced from the CT

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -2.24*** (8.00×10-2 ) -28.08 < 2.00×10-16 Alcohol 1B 𝛽𝛽1 1.45*** (0.12) 11.87 < 2.00×10-16 Alcohol 1C 𝛽𝛽2 2.35*** (0.12) 19.84 < 2.00×10-16

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

(16)

Table 7

Estimated logit model with all the variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α 499.50*** (123.8) 4.03 5.48×10-5 Fixed acidity 𝛽𝛽1 0.47*** (0.12) 3.84 1.24×10-4 Residual sugar 𝛽𝛽2 0.24*** (4.70×10-2) 5.18 2.26×10-7 Total sulfur dioxide 𝛽𝛽3 -1.48×10-3 (2.00×10-3) -0.74 0.46 Sulphates 𝛽𝛽4 1.84*** (0.46) 4.02 5.93×10-5 Volatile acidity 𝛽𝛽5 -3.75*** (0.65) -5.82 6.04×10-9 Chlorides 𝛽𝛽6 -14.40** (4.91) -2.93 3.37×10-3 Density 𝛽𝛽7 -521*** (125.50) -4.15 3.32×10-5 Alcohol 𝛽𝛽8 0.32* (0.15) 2.23 2.60×10-2 Citric acid 𝛽𝛽9 -1.11* (0.53) -2.08 3.80×10-2 Free sulfur dioxide 𝛽𝛽10 1.23×10-2** (4.06×10-3) 3.02 2.52×10-3 pH 𝛽𝛽11 2.94*** (0.57) 5.16 2.52×10-7

Note. Standard errors in parentheses below the coefficients. *

Statistical significance at 5% level.

**

Statistical significance at 1% level.

*** Statistical significance at 0.1% level.

Table 8

Estimated logit model with variables having highly statistical significant coefficients

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α 791.60*** (40.72) 19.44 < 2.00×10-16 Fixed acidity 𝛽𝛽1 0.62*** (8.23×10-2 ) 7.59 3.23×10-14 Residual sugar 𝛽𝛽2 0.35*** (2.39×10-2) 14.67 < 2.00×10-16 Sulphates 𝛽𝛽3 2.26*** (0.43) 5.33 1.01×10-7 Volatile acidity 𝛽𝛽4 -3.58*** (0.60) -5.97 2.31×10-9 Density 𝛽𝛽5 -817.11*** (41.99) -19.46 < 2.00×10-16 pH 𝛽𝛽6 3.86*** (0.43) 8.88 < 2.00×10-16

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

(17)

4.1.2. Red wine dataset

The classification tree built for the red wine dataset is shown in Figure 5. According to the cross validation and weakest link pruning the tree with 4 terminal nodes has the least test error (See Figure B5 in Appendix B). In Figure 5 one can see that the following variables have been chosen by the classification tree: alcohol, sulphates and volatile acidity.

Afterwards these variables have been used in a logit model straightaway as input variables (step 2). The results of this logistic regression can be seen in Table 9. Also dummy variables have been formed from the corresponding nodes in the CT. The values of the coefficients of the logit model with the dummy variables (step 3) are shown in Table 10. The dummy variables have been created out of the ordinal variables, which R does automatically. The ordinal variables are defined below.

The results of the logit models with all the variables and variables with significant coefficients are shown in Table 11 and 12 (step 5). In Table 12 only the variables which had statistically significant coefficients at the 0.1% level have been chosen from the variables in Table 11. This is done because again all the coefficients in Table 9 and 10 are also statistically significant at the 0.1% level.

• 𝐴𝐴𝑙𝑙𝑡𝑡𝑁𝑁ℎ𝑁𝑁𝑙𝑙1 = �"𝑌𝑌𝑒𝑒𝑌𝑌" 𝑖𝑖𝑖𝑖 𝑞𝑞𝑙𝑙𝑡𝑡𝑁𝑁ℎ𝑁𝑁𝑙𝑙 > 10.45 "𝑁𝑁𝑁𝑁" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒 • 𝑌𝑌𝑞𝑞𝑙𝑙𝑝𝑝ℎ𝑞𝑞𝑤𝑤𝑒𝑒𝑌𝑌1 = �"𝑌𝑌𝑒𝑒𝑌𝑌" 𝑖𝑖𝑖𝑖 𝑌𝑌𝑞𝑞𝑙𝑙𝑝𝑝ℎ𝑞𝑞𝑤𝑤𝑒𝑒𝑌𝑌 > 0.735"𝑁𝑁𝑁𝑁" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒 • 𝑣𝑣𝑁𝑁𝑙𝑙𝑞𝑞𝑤𝑤𝑖𝑖𝑙𝑙𝑒𝑒. 𝑞𝑞𝑡𝑡𝑖𝑖𝑎𝑎𝑖𝑖𝑤𝑤𝑦𝑦1 = �"𝑌𝑌𝑒𝑒𝑌𝑌" 𝑖𝑖𝑖𝑖 𝑣𝑣𝑁𝑁𝑙𝑙𝑞𝑞𝑤𝑤𝑖𝑖𝑙𝑙𝑒𝑒. 𝑞𝑞𝑡𝑡𝑖𝑖𝑎𝑎𝑖𝑖𝑤𝑤𝑦𝑦 > 0.335"𝑁𝑁𝑁𝑁" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒 17

(18)

Figure 5. Pruned classification tree (CT) of the red wine dataset

Table 9

Estimated logit model with CT-based input variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -14.71*** (1.61) -9.11 < 2.00×10-16 Alcohol 𝛽𝛽1 1.16*** (0.12) 9.45 < 2.00×10-16 Sulphates 𝛽𝛽2 3.18*** (0.60) 5.28 1.32×10-7 Volatile acidity 𝛽𝛽3 -4.76*** (0.82) -5.84 5.20×10-9

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

Table 10

Estimated logit model with dummy variables induced from the CT

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -3.48*** (0.42) -8.23 < 2.00×10-16 Alcohol 1Yes 𝛽𝛽1 2.95*** (0.39) 7.64 2.20×10-14 Sulphates1Yes 𝛽𝛽2 1.54*** (0.24) 6.36 1.98×10-10 Volatile acidity 1Yes 𝛽𝛽3 -1.50*** (0.26) -5.86 4.57×10-9

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

(19)

Table 11

Estimated logit model with all the variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α 119.60 (158.00) 0.76 0.45 Fixed acidity 𝛽𝛽1 0.20 (0.18) 1.15 0.25 Residual sugar 𝛽𝛽2 0.24* (0.12) 2.00 4.56×10-2 Total sulfur dioxide 𝛽𝛽3 -6.52×10-3 (5.75×10-3) -1.13 0.26 Sulphates 𝛽𝛽4 4.08*** (0.74) 5.54 2.98×10-8 Volatile acidity 𝛽𝛽5 -3.18** (1.10) -2.89 3.87×10-3 Chlorides 𝛽𝛽6 -8.96* (4.34) -2.07 3.89×10-2 Density 𝛽𝛽7 -135.80 (160.80) -0.84 0.40 Alcohol 𝛽𝛽8 1.05*** (0.21) 4.96 7.11×10-7 Citric acid 𝛽𝛽9 0.74 (1.23) 0.60 0.55 Free sulfur dioxide 𝛽𝛽10 -2.01×10-2 (1.83×10-3) -1.10 0.27 pH 𝛽𝛽11 -0.11 (1.36) -8.30×10-2 0.93

Note. Standard errors in parentheses below the coefficients. *

Statistical significance at 5% level.

**

Statistical significance at 1% level.

***

Statistical significance at 0.1% level.

Table 12

Estimated logit model with variables having highly statistical significant coefficients

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -17.94*** (1.49) -12.02 < 2.00×10-16 Sulphates 𝛽𝛽1 3.71*** (0.57) 6.54 6.03×10-11 Alcohol 𝛽𝛽2 1.23*** (0.12) 10.49 < 2.00×10-16

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

4.1.3. Home sales dataset

The classification tree for the home sales dataset (step 1 of the research methodology) is given in Figure 6. The weakest link pruning and CV indicated that the trees with 5, 7 or 10 terminal nodes have the least test error (See Figure B9 in Appendix B for more details). The choice to take the tree with 5 terminal nodes was because this tree was comparatively the least complicated one. Further one can see that the tree has selected only three variables: total square feet, waterfront and age.

(20)

The variables that have been selected by the classification tree have been afterwards used as input variables (step 2) and dummy variables (step 3) in a two different logit models. The values of the coefficients of both the logit models can be seen in Table 13 and 14

respectively. The dummy variables made by R have been formed by first creating ordinal variables out of the CT-based selected variables. The ordinal variables are shown below. The results of logit models with all the variables and variables with significant coefficients can be seen in Table 15 and 16 (step 5). In Table 16 again only the statistically significant ones at the 0.1% level have been chosen, as the coefficients in Table 13 and 14 are also significant at the same level.

• 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆1 = � "𝐵𝐵" 𝑖𝑖𝑖𝑖 2121 < 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 < 2904"𝐴𝐴" 𝑖𝑖𝑖𝑖 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 < 2121 "𝐶𝐶" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒

• 𝑊𝑊𝐴𝐴𝑆𝑆𝑊𝑊𝑊𝑊𝑆𝑆𝑊𝑊𝑊𝑊𝑁𝑁𝑆𝑆1 = �"𝑌𝑌𝑒𝑒𝑌𝑌" 𝑖𝑖𝑖𝑖 𝑊𝑊𝐴𝐴𝑆𝑆𝑊𝑊𝑊𝑊𝑆𝑆𝑊𝑊𝑊𝑊𝑁𝑁𝑆𝑆 > 0.5"𝑁𝑁𝑁𝑁" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒 • 𝐴𝐴𝐴𝐴𝑊𝑊1 = �"𝑌𝑌𝑒𝑒𝑌𝑌" 𝑖𝑖𝑖𝑖 𝐴𝐴𝐴𝐴𝑊𝑊 > 15.5"𝑁𝑁𝑁𝑁" 𝑁𝑁𝑤𝑤ℎ𝑒𝑒𝑡𝑡𝑤𝑤𝑖𝑖𝑌𝑌𝑒𝑒

Figure 6. Pruned classification tree (CT) of the home sales dataset

(21)

Table 13

Estimated logit model with CT-based input variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -5.29*** (0.50) -10.49 < 2.00×10-16 Total square feet 𝛽𝛽1 2.59×10-3*** (2.20×10-4) 11.74 < 2.00×10-16 Waterfront 𝛽𝛽2 2.27*** (0.60) 3.79 1.52×10-4 Age 𝛽𝛽3 -3.62×10-2*** (6.75×10-3) -5.37 8.09×10-8

Note. Standard errors in parentheses below the coefficients. *** Statistical significance at 0.1% level.

Table 14

Estimated logit model with dummy variables induced from the CT

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -0.93*** (0.21) -4.45 8.49×10-6 Total square feet 1B 𝛽𝛽1 2.74*** (0.26) 10.44 < 2.00×10-16 Total square feet 1C 𝛽𝛽2 4.60*** (0.42) 10.95 < 2.00×10-16 Waterfront 1Yes 𝛽𝛽3 2.44*** (0.60) 4.08 4.51×10-5 Age 1Yes 𝛽𝛽4 -2.03*** (0.25) -8.21 < 2×10-16

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

(22)

Table 15

Estimated logit model with all the variables

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -8.02*** (1.20) -6.68 2.45×10-11 Pool 𝛽𝛽1 0.81 (0.51) 1.58 0.11 Total square feet 𝛽𝛽2 2.38×10-3*** (2.87×10-4) 8.31 < 2.00×10-16 Style 𝛽𝛽3 6.12×10-2• (3.19×10-2) 1.92 5.52×10-2 Number of bedrooms 𝛽𝛽4 -0.15 (0.26) -0.59 0.55 Fireplace 𝛽𝛽5 0.31 (0.24) 1.27 0.20 Number of full baths 𝛽𝛽6 1.44** (0.51) 2.82 4.79×10-3 Waterfront 𝛽𝛽7 2.03** (0.63) 3.21 1.33×10-3 Age in years 𝛽𝛽8 -3.15×10-2*** (7.18×10-3) -4.39 1.11×10-5 Days on market 𝛽𝛽9 9.04×10-4 (1.20×10-3) 0.75 0.45 Occupancy 𝛽𝛽10 0.16 (0.22) 0.71 0.48

Note. Standard errors in parentheses below the coefficients. Statistical significance at 10% level

*

Statistical significance at 5% level.

** Statistical significance at 1% level. ***

Statistical significance at 0.1% level.

Table 16

Estimated logit model with variables having highly statistical significant coefficients

Effect of : Parameters Coefficients Z value 𝐏𝐏𝐏𝐏(> |𝐳𝐳|)

Intercept α -4.98*** (0.48) -10.37 < 2.00×10-16 Total square feet 𝛽𝛽1 2.52×10-3*** (2.12×10-4) 11.91 < 2.00×10-16 Age in years 𝛽𝛽2 -3.93×10-2*** (6.68×10-3) -5.88 4.21×10-9

Note. Standard errors in parentheses below the coefficients. ***

Statistical significance at 0.1% level.

4.1.4. Prediction accuracies

The accuracy of the predictions made by the estimated models described in the previous Sections is given in Table 17 and Table 18 (step 4 and 6 of the research

methodology). There are three prediction accuracies given for each dataset: two accuracies for the classes in which the response variables can fall in and one overall. Averages of the

prediction categories are given in the last rows of the tables. The exact amount of

observations in each category is given in Figure B3, B4, B7, B8, B11 and B12 in Appendix B.

(23)

Table 17

Prediction accuracies of the new response variables for each dataset

Classification tree Logita Logitb

White wine quality Yes 40.00 No 88.75 Overall 77.95 Yes 21.41 No 95.84 Overall 79.34 Yes 21.41 No 95.84 Overall 79.34 Red wine quality Yes 15.05

No 97.88 Overall 85.29 Yes 22.58 No 95.37 Overall 84.29 Yes 22.58 No 95.37 Overall 84.29 Price of the home

sales ($) Yes 74.77 No 92.38 Overall 83.33 Yes 81.53 No 88.57 Overall 84.95 Yes 81.53 No 88.57 Overall 84.95 Average Yes 43.27 No 93.00 Overall 82.19 Yes 41.84 No 93.26 Overall 82.86 Yes 41.84 No 93.26 Overall 82.86

Note. The values presented in the table are in percentage (%). a Logit model with input variables selected by the CT method. b

Logit model with dummy variables induced from the CT.

Table 18

Prediction accuracies of the other logit models

Logita Logitb

White wine quality Yes 26.12

No 94.77 Overall 79.55

Yes 26.35 No 94.77 Overall 79.60

Red wine quality Yes 31.18

No 94.98 Overall 85.27

Yes 17.20 No 97.49 Overall 85.27 Price of the home sales ($) Yes 82.43

No 85.24 Overall 83.80 Yes 79.28 No 88.10 Overall 83.56 Average Yes 46.58 No 91.66 Overall 82.87 Yes 40.94 No 93.45 Overall 82.81

Note. The values presented in the table are in percentage (%). a

Logit model with all the variables.

b

Logit model with only significant variables at 0.1% level.

4.2. Analysis

The white wine dataset has the simplest tree of all the trees made for the datasets. It requires only alcohol in contrast to the tree built for the red wine dataset where other features also matter, namely sulphates and volatile acidity. Also the tree for the white wine dataset is the only tree with one variable in it, whereas the other two trees have three variables. In any case alcohol seems to be a strong explanatory variable as the coefficient is highly statistically significant (at 0.1% level) in both the combined logit models for the white wine dataset. In fact in both the combined logit models all the coefficients are highly significant. Further one notices that the effect of alcohol becomes rather different when it transfers from being just an input variable to a dummy variable. Alcohol as input variable has a slightly positive effect but

(24)

when implemented as a dummy variable one sees that the effect of alcohol increases when the alcohol level increases. The variable selection using a classification tree indicated that only one variable, namely alcohol, is the important explanatory variable. However when

implementing all the variables in a logit model one can see in Table 7 that there are multiple variables which seem to be important, as the coefficients of the variables are significant. Even when one only takes the highly statistical significant ones (at the 0.1% level) one ends up with more variables than only one variable chosen by the CT. One strange finding is that the

coefficient of alcohol is not highly significant in this case, so alcohol has not been chosen in Table 8. The effects of the variables that have been chosen have not changed much when comparing Table 7 and 8. This method of choosing variables with highly significant

coefficients also results in slightly higher overall prediction accuracy compared to the overall prediction accuracies of the combined logit models. It is even higher than the overall

prediction accuracy of the classification tree and the logit model with all the variables. Also in the red wine dataset all the coefficients of both the combined logit models are highly statistically significant. One can see that the effects of alcohol and volatile acidity have increased when switching from being an input variable to a dummy variable. The intercept has increased a lot compared to the effects of the two previous mentioned variables. Only the effect of sulphates has decreased. Looking at the results of the other logit models one can see that in this case a lot less variables have significant coefficients. Only sulphates and alcohol have highly statistically significant coefficients (at the 0.1% level). The effects of sulphates and alcohol have not changed much when comparing Table 11 and 12. Only the intercept has decreased a lot. Variable selection using a classification tree has almost the same results except according to CT volatile acidity seemed to be an important variable too. So the variable selections almost match in this case. The overall prediction accuracy obtained from the logit model with only variables that have highly significant coefficients is again slightly higher than the overall prediction accuracies of the combined models and slightly lower than the overall prediction accuracy of the classification tree. It is however equal to the overall prediction accuracy of the logit model with all the variables.

For the case of the home sales dataset again all the coefficients of the combined logit models are significant (at the 0.1% level). The effects of total square feet and waterfront have increased when switching from being an input variable to a dummy variable. The effect of waterfront has the same kind of transformation as the variable alcohol had in the white wine dataset. The intercept has increased too. The effect of age has decreased. Again not a lot of variables have significant coefficients when looking at the other logit models. Only total square feet and age have highly statistically significant coefficients (at the 0.1% level). The two different variable selections again match almost, except that the variable selection using a classification tree indicated that the variable waterfront is important too. In this dataset the overall prediction accuracy of the combined logit models are higher than the overall prediction accuracies of the other logit models and the classification tree.

One interesting and surprising fact stands out when looking at the prediction accuracies in Table 17 in Section 4.1.3. This is that the prediction accuracies of both the combined logit models are absolutely the same. This makes it practically impossible to see when which type of combined logit model predicts better. Though one can see that in the case of the white wine and home sales dataset the logit models overall predict better, and in the red

(25)

wine case the classification tree does better. The difference between the overall prediction done with the combined logit models and the classification tree is highest for the home sales dataset.

Looking at the averages of all the logit models one can notice that there is not much difference between the prediction performances. However the logit models do overall predict better on average than the classification trees.

The last interesting finding is that the models appear to predict the chance of failure better (“No”) than the chance of success (“Yes”). On average the logit model with only variables having highly statistically significant coefficients predicts failure most accurately.

5. Conclusion

The main objective of this thesis was to investigate whether the logit model with dummy variables performs better than the logit model with CT-based input variables as far as prediction performance is concerned. It is also interesting to see if the logit model with dummy variables works better than the classification tree as Brezigar-Masten and Masten (2012) have found in their research and to see what kind of differences there are between the two different logit models. In addition I have also used logit models with all the variables and only the (highly) significant ones to compare this method with the method of using combined logit models. For the research, datasets of slightly non-financial and non-economic character have been considered to see if the methods of the Cho et al. (2010) and Brezigar-Masten and Masten (2012) give the same type of results on these types of datasets.

It is shown that on the basis of the values reported in Table 17 in Section 4.1.3. one cannot differentiate between the two logit models and say which one predicts better. This might be because the datasets contain relatively “small” amounts of observations and

variables as compared to how large datasets can be. Therefore this does not allow for a lot of variation in the data. Hence in future research one can therefore test these methods using larger datasets to obtain more robust results. One can however conclude that the logit models do not always perform better than the classification trees as Brezigar-Masten and Masten (2012) had indicated.

The results of the logit models with CT-based input variables have shown that classification trees do serve as a good means of variable selection like Cho et al. (2010) had pointed out, because all the coefficients were highly significant. However this technique does not differ much with the technique of using only variables with highly significant coefficients, because the average prediction accuracies of the previous mentioned logit model and the logit model with CT-based input variables do not differ much. Thus the two types of variable selections considered in this thesis are both roughly equally good techniques. Again this might be due to the fact that the datasets are relatively small. Hence in future research one might want to investigate this topic of variable selection using larger datasets. There are further not much strong differences between the two types of combined logit models, except that perhaps when a classification tree indicates that a variable has more than two levels, the effect than increases per level. In other words when a variable transforms from being an input variable to a dummy variable the effect increases when the level increases, if the variable has

(26)

more than two levels. Thus in the case that variables have more than two levels creating dummy variables helps to bring forth the different positive (negative) effects per level, when the effect is estimated as positive (negative) as a normal input variable.

The method of selecting variables which have highly significant coefficients to build a good prediction model is not much better or worse than using the combined logit models as one can conclude from the analysis. One method is sometimes better than the other. Thus initially using combined logit models seemed to be simple and effective, but perhaps using other techniques will work as efficiently. Nevertheless all the logit models overall predict better on average than the classification trees.

Further all the models seem to be handy tools to predict something one wishes to prevent, namely failure, because the logit models seem to predict that better than the chance of success.

In future research one can also use ordered logit, LDA or any other type of model which can take on response variables with multiple responses instead of the dichotomous logit model applied in this thesis to predict the response variables of the wine datasets, since the response variables can actually take on 6 or 7 different values. In this thesis these initial variables have been transformed to binary variables, which lead to using less information of the datasets. This might have affected the results of the models and then some other type of model might perform better.

References

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository

[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Brezigar-Masten, A., & Masten, P. (2012). CART-based selection of bankruptcy predictors for the logit model. Expert Systems with Applications, 39, 10153-10159.

Camdeviren, H. A., Yazici, A. C., Akkus, Z., Bugdayci, R. & Sungur, M. A. (2007). Comparison of logistic regression model and classification tree: An application to postpartum depression data. Expert Systems with Applications, 32, 987-994.

Chen, M. (2011). Predicting corporate financial distress based on integration of decision tree classification and logistic regression, Expert Systems with Applications, 38, 11261-11272.

Cho, S., Hong, H., & Ha, B. (2010). A hybrid approach based on the combination of variable selection using decision trees and case-based reasoning using the

Mahalanobis distance: For bankruptcy prediction, Expert Systems with Applications,

37, 3482-3488.

Comissão de Viticultura da Região dos Vinhos Verdes (CRVV). (2014). Retrieved from http://www.vinhoverde.pt/en/

Cortez, P., Cerdeira, A., Almeida, F., Matos, T. & Reis, J. (2009). Decision Support

Systems, Elsevier, 47(4), 547-533.

Duchessi, P. & Laurìa, E. J. M. (2013). Decision tree models for profiling ski

(27)

resorts’ promotional and advertising strategies and the impact on sales, Expert Systems

with Applications, 40, 5822-5829.

Heij, C., P. De Boer, P.H. Franses, T. Kloek, en H.K. van Dijk (2004). ‘Econometric Methods with Applications in Business and Economics.’

Li, H., Sun, J. & Wu, J. (2010). Predicting business failure using classification and regression tree: An empirical comparison with popular classical statistical methods and top classification mining methods. Expert Systems with Applications, 37, 5895-5904 Moon, S. S., Kang, S., Jitpitaklert, W., & Kim, S. B. (2012). Decision tree models for

characterizing smoking patterns of older adults. Expert Systems with Applications, 39, 445-451.

Nie, G., Rowe, W., Zhang, L., Tian, Y. & Shi, Y. (2011). Credit card churn forecasting by logistic regression and decision tree. Expert Systems with Applications, 38, 15273-15285.

Principles of Econometrics, Fourth Edition. br [br.xlsx] Retrieved

from http://www.principlesofeconometrics.com/poe4/poe4excel.htm Principles of Econometrics, Fourth Edition. br [br.def] Retrieved

from http://www.principlesofeconometrics.com/poe4/data/def/br.def Otexts: Online, Open-Access Textbooks. (2014). [A figure of a binary decision

tree]. Retrieved from https://www.otexts.org/1512

Otexts: Online, Open-Access Textbooks. (2014). [A partition plot of the binary decision tree]. Retrieved from https://www.otexts.org/1512

R Development Core Team (2014). The R foundation for statistical computing. R [Version 3.1.0]. Available from http://www.r-project.org/

Twala, B. (2010). Multiple classifier application to credit risk assessment. Expert Systems

with Applications, 37, 3326-3336.

Varian, H. R. (2014). Big Data: New Tricks for Econometrics.

UCI Machine Learning Repository-Center for Machine Learning and Intelligent Systems. (2013). Wine Quality [winequality-red.csv, winequality-white.csv, winequality.names]. Retrieved from

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

(28)

Appendix A

(29)
(30)
(31)
(32)
(33)
(34)

Appendix B

Figure B1. Values of the CV and weakest link pruning produced by R for the CT of the white wine dataset (size = #terminal nodes, dev = cross validation error rate, k = cost-complexity parameter)

Figure B2. Descriptive statistics produced by R for the pruned CT of the white wine dataset

Figure B3. Prediction tables of the estimated models for the white wine dataset: CT (left), logit with CT-based input variables (middle) and logit with dummy variables (right)

Figure B4. Prediction tables of the other estimated logit models for the white wine dataset: logit with all variables (left) and logit with only variables with significant coefficients (0.1% level) (right)

(35)

Figure B5. Values of the CV and weakest link pruning produced by R for the CT of the red wine dataset (size = #terminal nodes, dev = cross validation error rate, k = cost-complexity parameter)

Figure B6. Descriptive statistics produced by R for the pruned CT of the red wine dataset

Figure B7. Prediction tables of the estimated models for the red wine dataset: CT (left), logit with CT-based input variables (middle) and logit with dummy variables (right)

Figure B8. Prediction tables of the other estimated logit models for the red wine dataset: logit with all variables (left) and logit with only variables with significant coefficients (0.1% level) (right)

(36)

Figure B9. Values of the CV and weakest link pruning produced by R for the CT of the home sales dataset (size = #terminal nodes, dev = cross validation error rate, k = cost-complexity parameter)

Figure B10. Descriptive statistics produced by R for the pruned CT of the home sales dataset

Figure B11. Prediction tables of the estimated models for the home sales dataset: CT (left), logit with CT-based input variables (middle) and logit with dummy variables (right)

Figure B12. Prediction tables of the other estimated logit models for the home sales dataset: logit with all variables (left) and logit with only variables with significant coefficients (0.1% level) (right)

Referenties

GERELATEERDE DOCUMENTEN

3.4 Performance of CATREG and six other regression with trans- formation methods: Prediction accuracy in the analysis of the Ozone

The restriction is imposed by applying weighted (weighting with category fre- quencies) regression of the nominal quantifications; on the category values for ordinal and numeric

Monotonic (spline) transformations may lead to subop- timal solutions. We have done a simulation study to investigate the effect of particular data conditions on the incidence

The high expected prediction error for the nominal scaling level relative to the apparent error might well be due to the rather small number of observations in the ozone data

In this chapter, three of these methods (Ridge regression, the Lasso, and the Elastic Net) are incorporated into CATREG, an optimal scaling method for both lin- ear and

Using regression with optimal scaling to find nonlinear transformations, the Lasso to select a sparse model with stable predictors, and the .632 bootstrap to assess the

The predic- tion accuracy for all scaling levels was assessed and compared to five other regression methods that use transformations to fit nonlinear relations, and for which results

The SAVE subcommand is used to add the transformed variables (category indicators replaced with optimal quantifications), the predicted values, and the residuals to the