• No results found

Since supervised machine learning algorithms are the focus of this study, only these algorithms which are called classifiers are considered. There are different classifiers with numerous variants. The ones used in this study are introduced here:

2.7.1 Dummy Classifier

This classifier does not use any assumptions and it always outputs the majority class in the dataset as its prediction. In addition, the interpretation of this classifier does not provide valuable insights from the dataset; however, it can be used as a baseline to compare results of other classifiers.

2.7.2 Naive Bayes Classifier

Naive Bayes classifier, as its name suggests, is an implementation of the Bayes’

theorem:

P(A|B) = P(A)P (B|A)

P(B) (2.11)

This classifier is based on a Naive assumption stating that features are pairly in-dependent of each other. It means that the probability of an instance belonging to class Ck with features X1, X2, ..., Xn corresponds to the probability of class Ck

multiplied by the probability of the existence of each feature in this class [37]. That is why this classifier is skew insensitive [21]:

P(Ck|X1, X2, ..., Xn) = P(Ck)P (X1, X2, ..., Xn|Ck) Although independence is not a strong assumption in this classifier, Naive Bayes competes well with other classifiers in practice [38].

20

2.7. MACHINE LEARNING ALGORITHMS (CLASSIFIERS)

2.7.3 Logistic Regression Classifier

This classifier is similar to linear regression with a small difference which makes it suitable for classification tasks. First, input features should be normalized to lie between zero and one. In this way, the effect of their units are eliminated.

Standardization is one of the possible normalization methods which transforms each feature (x) into an standardized feature (xs) by making its mean value (µ) as 0 and variance value (σ) as 1:

xs= x − µ

σ (2.14)

This classifier uses a sigmoid function to transform the output of the linear regression into a number between zero and one for the binary classification purpose, then a threshold value on the final output (see Equation 2.17) defines the class type of the instance. If the probability is above the threshold, the target class will be assigned to the instance [39].

This classifier can use an optimizer like Stochastic Gradient Descent to optimize its cost function and to find optimal parameters for θT. The cost function to be optimized is:

• L is the regularization method, e.g. L1 or L2.

• Xi is the standardized features of instance i.

• yi is the target label of instance i.

2.7.4 Decision Tree Classifier

Decision Trees use simple decision rules to predict the target class. There are different implementations of decision trees, e.g. C4.5 [40], C5.0 [40], and Classifi-cation And Regression Tree (CART) [41]. CART supports both classifiClassifi-cation and regression tasks. It constructs binary trees using input features in order to maximize

CHAPTER 2. EXTENDED BACKGROUND the information gain at each data split. An example of a decision tree (built using the scikit-learn library in python [42]) is shown in Figure 2.6. This decision tree was built on the Iris flower dataset publicly available [43]. As shown in this figure, the interpretation of this classifier is easy and it clarifies a set of rules leading to the defined target class. In general, tree–based classifiers do not need the normalization step, as the units of input features do not have an effect on the data splitting.

Figure 2.6. An example of a decision tree built on the Iris flower dataset.

2.7.5 Random Forest Classifier

Combinations of different classifiers of the same type could lead to better results.

This idea is the basis of ensemble classifiers. Random forest [44], is an ensemble classifier which combines different decision trees to improve the results. First, the random forest creates n different decision trees by taking random instances from the dataset, then a random set of features is selected to split and to build the decision tree accordingly. Later, these n different decision trees are applied on the dataset and for each instance there are n predictions (class probabilities in classification and numerical values in regression tasks). Finally, in order to output the final

22

2.7. MACHINE LEARNING ALGORITHMS (CLASSIFIERS)

prediction for each instance, the average value of predictions is reported. Although this classifier performs quite well, it is not easy to interpret it correctly.

2.7.6 Light Gradient Boosting Classifier

The Gradient Boosting Decision Tree, as a popular classifier, combines several clas-sifiers to make a strong classifier in a sequential order. First, a classifier is trained on the training set, then the second classifier is fit to the residuals of the first classifier.

The process of fitting a classifier to the residuals of the previous classifier continues until a given condition is satisfied. This classifier has different implementations like the Extreme Gradient Boosting Tree (XGBT). Although numerous work have been done, different implementations of the Gradient Boosting Decision Tree are not ef-ficient and scalable when the feature space and the dataset size become large.

The Light Gradient Boosting (LGBM) is a new implementation of the Gradient Boosting Decision Tree introduced by Microsoft in 2017 [45]. This classifier claims to perform up to 20 times faster than the XGBT while achieving almost the same accuracy based on the experiments on the public datasets. This classifier differs from other implementations, since others scan all data instances to estimate the information gain of all possible split points. This approach is tedious and time con-suming. That is why LGBM proposes two novel methods called ‘Gradient-based One-sided Sampling (GOSS)’ and ‘Exclusive Feature Bundling (EFB)’ in order to fast and efficient while achieving almost the same accuracy.

With GOSS only a sample of instances with high residuals is selected to estimate the information gain. While with EFB, features with simultaneous non-zero values (e.g.

one-hot encoded features) were bundled to reduce the number of features. That is how this classifier reduces the number of instances and the number of features in order to be fast, memory efficient, and highly accurate [45]. This classifier like any other tree–based classifier does not need the normalization step. To sum up, main properties of this classifier are [46]:

1. Fast training speed 2. Efficient memory usage 3. CPU and GPU support 4. Large-scale datasets support

Although this classifier has great advantages, it is not easy to interpret it correctly.

2.7.7 Stacked Classifier

This classifier which is commonly used in data science competitions like Kaggle [47], combines different types of classifiers (base–learners) and applies a new classifier (meta–classifier) on top of their predictions. This meta-classifier determines the relationship among predictions of base–learners and the target class [48][49].

CHAPTER 2. EXTENDED BACKGROUND