Selection of Appropriate Classification Model

A.2 Selection of Appropriate Classification Model

In terms of the classification model, logistic regression models, modern machine learning algo-rithms that use bagging(E.g: Random Forest) or boosting(E.g: Extreme Gradient Boosting) can be considered as a candidate. Logistic regression models are beneficial for interpreting the rela-tionship between two variables as their interpretability is much higher than machine learning algorithms. However, machine learning algorithms show better performances even with small sample sizes which is the case for the data-set on hand. Random forest uses bagging which is a combination of predictions of weak learners using a different bootstrap sample from each study data set. Random forest is a popular method that uses several decision trees to create one strong predictor. Extreme gradient boosting(XGBoost), on the other hand, has proven to be a good algorithm on a variety of data-sets. It uses boosting which iteratively trains several weak learners by weighting the practice examples for the next learner to measure how successful the trainee was. XGBoost also integrates regulation into its gradient boosting algorithm to check overfitting which can be a strong factor of poor performance for small sample size problems.

The objective function of XGBoost combines both optimizing training losses with the complex-ity of models. That’s why overfitting can be prevented by using such models. By doing so, the predictors get closer to the underlying distribution of information and it promotes easy models with smaller deviations (Chen et al., 2016). That is why the generalizability of XGBoost is con-sidered higher than other algorithms such as logistic regression models and random forests.

In addition to that, Bent´ejac et al. (2019) introduced an empirical analysis of XGBoost that has proved itself as an effective problem solver in classification problems. XGBoost’s power from the point of speed and accuracy of its training is compared with the performance of random forest across different types of data-sets which include small and big samples. The authors support the results of the peak performance of XGBoost. In addition to that, XGBoost also allows fine-tuning the parameters by using a highly efficient computational algorithm. This is not possible using random forest because small gains are achieved by adjusting parameters.

The main advantage of logistic regression over modern machine learning algorithms is inter-pretability. As the machine learning algorithms are considered as black box and interpretability of algorithms is difficult. Therefore, the literature is consulted if there is already a method for interpreting machine learning models instead of considering them as a black box. Lundberg et al. (2018) suggested shapley additive explanation for understanding the output of complex machine learning models such as XGBoost. The interpretation of XGBoost is difficult because those models can capture complex non-linear relationships between independent and depen-dent variables. Therefore, it is not easy to interpret such methods by calculating coefficient val-ues of independent variables as in linear models such as logistic regression. Shapley Additive Explanation(SHAP) is a well-structured method to calculate the contribution of independent variables to predictions, and more importantly, it is easy to understand by visualizations pro-vided by SHAP values to understand the interpretability of features. Another advantage of SHAP is that it can be used to explain predictions made by XGBoost models. However, one drawback of SHAP is that the computational complexity of the model can be high depending

A.2 Selection of Appropriate Classification Model on the data-set on hand. However, this is not a problem for our case as the data-set on hand can be considered as a small input for XGBoost and SHAP. Further details of interpretation of SHAP can be seen in Appendix F.

Appendix B Hieararchical Clustering Dendrograms For Other Top-ics

As the outputs of hierarchical clustering algorithm, dendrograms for each topic can be seen as follows:

Figure 29: Dendrogram of sequence 4 suppliers in environment topic

Figure 30: Dendrogram of sequence 4 suppliers in health & safety topic

Figure 31: Dendrogram of sequence 4 suppliers in business ethics topic

Figure 32: Dendrogram of sequence 4 suppliers in human capital topic

Appendix C Label Spreading Algorithm

The idea of the algorithm is labeling the unlabeled data by using the already labeled data.

For example, let us assume we have a data-set X = x1, ..., x_l, x_l+1, ..., x_n ⊂R^m and label set L = 1, ..., c. Here, the first l points are labeled and labels are represented as yi ∈ L. However, the other points xu(l + 1 ≤ u ≤ n)are unlabeled. The aim of the algorithm is to predict the label of the unlabeled points. Steps of the algorithm is provided as follows:

• Step 1: Affinity matrix W must be calculated where Wij = exp(−|xi− x_j|²/2σ²)if i 6= j and Wii= 0.

• Step 2: S = D^−0.5W D^−0.5must be constructed where D is a diagonal matrix. D matrix is determined in a way that [i, j] element must be equal to sum of row i of W matrix.

• Step 3: Iterate over F (t + 1) = SF (t)α + Y (1 − α) until convergence occurs. α is the clamping factor between (0, 1).

• Step 4: F^∗ denote the limit of the sequence F (t). Label each point xi as a label yi = arg max_j≤cF_ij.

Each point receives information from its neighbors (the first item) and also retains its original information (the second item). The parameter α specifies the relative amount of information from the neighbors and the initial label information. It is also possible to change the alpha value to provide some degree of freedom for the algorithm or clamping factor, so that the boundary can be relaxed and some unlabeled data can be reassigned to more appropriate ad-jacent categories.

α = 0means to keep the initial label information; α = 1 means to replace all initial information.

Since the initial label information has been confirmed by experts, the smallest possible alpha value should be chosen to ensure that the label information remains the same.

Appendix D Data Preprocessing Phase

Before processing the data, the first thing is converting categorical variables into binary vari-ables. Therefore, each category is encoded by using dummy variables by converting them to a binary column for each variable to use in the model. Note that dummy variable trap was also taken into account in order not to experience multi-collinearity between variables which might influence model performance in a bad way. As an illustration of dummy variable trap, consider an example of education categories (high school, bachelors, masters). An illustration of converting categorical variables to dummy variables can be seen in Table 24 and 25

Table 24: An illustration of converting categorical variable into binary variables -1

Table 25: An illustration of converting categorical variable into binary variables -2

The dummy variables for the example are whether the education level of a person is high school, bachelors or masters. However, it is possible to say that if the education level of a person is not bachelors or masters then it can be said that education level must be masters by using masters as the reference category. Therefore, to represent the situation, two dummy variables are needed. Basically, the number of dummy variables that are needed is one less of the number of categories. After converting variables, there is also a need for data prepro-cessing because of the several problems that can affect model performance. The first one is multicollinearity between independent variables which can affect the model performance. It is important to detect highly correlated features from the data to prevent unreliable models and erroneous predictions.

Step 1 – Removing highly correlated features:To remove highly correlated independent vari-ables, the correlation between variables is checked. As there are different types of varivari-ables, different types of correlation measures must be employed.

• Pearson’s Correlation Coefficient: The most popular correlation measure that is used for calculating the correlation between two continuous variables (Rodgers and Nicewander, 1998). The ratio variables are used as an input to calculate Pearson’s correlation coeffi-cient. The correlation coefficient with more than 0.6 is considered as highly correlated.

Therefore, one of the variables is removed from the data-set.

• Point Biserial Correlation: This correlation measure is used for calculating the

correla-tion of a continuous and a categorical variable (Tate, 1954). As the categorical variables can be coded as binary columns, the formula of point biserial correlation can be used for both binary and categorical variables. The point biserial correlation calculated as follows:

p_n₁_n₀

where n1represents the length of first subset, y represents the continuous variable and x is the binary variable and n0represents the length of second subset.

• Phi Coefficient: To calculate the correlation between two binary variables, Phi coefficient can be used (Cramer, 1946). The phi coefficient is calculated from the chi-squared statis-tics by using 2x2 contingency table idea. An example contingency table can be seen in Table 26.

Table 26: An example of Contingency table to calculate phi coefficient

In the table, x and y represent the binary variables. By using the contingency table, the calculation of phi coefficient is provided as follows:

(n₁₁∗ n₀₀) − (n₁₀∗ n₀₁)

√n_1.∗ n_0.∗ n_.1∗ n_.0

A different rule of thumb must be used for Phi coefficient. Therefore, the coefficient values of more than 0.25 are considered as highly correlated variables and one of the variables is removed from the data-set (Akoglu, 2018).

• Cramer’s V: Cramer’s V is used for calculating the correlation between two categorical variables (Cramer, 1946). The difference between phi coefficient and Cramer’s V is that phi coefficient cannot be used for categorical variables that have more than 2 categories.

The same rule of thumb with Phi coefficient is also valid for Cramer’s V. Therefore, the coefficients with more than 0.25 are considered as highly correlated variables (Akoglu, 2018)

Correlation values show that there is a strong correlation between worker distribution vari-ables as those varivari-ables are expected to be defined as a linear function of each other. Therefore, the scale for worker distribution variables is changed into relative measure instead of using absolute measure by converting the values into percentages. For example, the variable

‘Num-ber of female workers’ is changed into ‘percentage of female workers’ as ‘the num‘Num-ber of female workers’ is highly correlated with ‘total number of worker’ variable. After analyzing the corre-lation values, the multi-collinearity problem is considered as solved as there are no correlated independent variables left.

Step 2 – Standardizing the continuous variables: As the data-set contains features with dif-ferent scales that have difdif-ferent magnitudes and range, it should be converted into a common scale. It is because machine learning algorithms use Euclidean distance between data points which is highly dependent on the magnitude and range of the data. Therefore, standardiza-tion of variables must be conducted. By saying standardizastandardiza-tion of continuous variables means re-scaling the data in a way that it has a mean of 0 and a standard deviation of 1 which is the unit standard deviation. The equation that is applied can be seen as follows:

x⁰ = x − ¯x σ

Original features may not be clear enough to get accurate predictions as they have different range and magnitudes which can be harmful to the model performance. That is why a set of manipulation methods must be implemented to transform original features or to generate new features with better representation which might help the predictive power of the classification model (Garc´ıa et al., 2014).

Step 3 – Feature Selection:As there are many independent variables in the data set, overcom-ing the curse of dimensionality must also be another aim for enhancovercom-ing the model performance.

The problem in classification models in machine learning is to find a way to reduce the dimen-sionality of the feature space to decrease the chance of over-fitting. Data over-fitting may occur when the number of features is large, and when the number of training data is small as it is in our case. Therefore, without reducing the dimensionality, the classification model can easily find decision boundaries for the different classes of dependent variables but the generalizabil-ity of the model will be bad as its performance on test data will not be good enough. For that purpose, a feature selection algorithm is implemented to detect the set of important variables for each classification model. As we have 4 different topic and overall sustainability score, a feature selection algorithm must be used before constructing the model as irrelevant features may affect the model performance severely. As a feature selection algorithm, recursive feature elimination is used as it is an effective algorithm that can be combined with machine learning algorithms such as XGBoost. The method removes the worst-performing features on a partic-ular model iteratively until the best subset of features that provides the best performance are obtained. Elimination is performed according to those steps.

• Train the classifier (E.g: XGBoost or random forest etc.).

• Compute feature importance of independent variables on the dependent variable for all features.

• Remove the feature with smallest feature importance.

This procedure continues until the best subset of features is obtained. It should be consid-ered that recursive feature elimination does not consider the correlation between the variables (Guyon et al., 2002). Therefore, recursive feature elimination is used after stage 1 which re-moves the variables that have a strong correlation value.

Step 4 – Data Imbalancity: A data-set is considered as imbalanced when observations in one the class category are much higher than the other classes. As the aim of the machine learning classification models is to increase accuracy by reducing error, they do not pay attention to the class distribution. The problem is popular in the problems related to fraud and anomaly detection. Classification models are biased towards the majority class and they tend to predict new instances as majority class as the class distribution tricks the model. It results in having many errors in classifying minority classes compared to majority class instances. There are two different methods to overcome imbalance data: Undersampling and oversampling (Shelke et al., 2017). They are basically contrary and alike methods. Undersampling method aims to re-duce the majority class samples in a way that the number of minority class and majority class equal to each other. However, if there are not many available observations in the data, it does not make sense to decrease the sample size further. On the other hand, oversampling method adds new observations to the minority classes to balance the distribution. New observations can be generated by using a combination of existing observations. Therefore, artificial samples are added to the minority class. As oversampling method is more appropriate for the problem on hand, it is chosen to solve data imbalance. The most common technique in oversampling is called synthetic minority oversampling technique(SMOTE). SMOTE generates new minority observations using existing minority observations. It uses linear interpolation for the genera-tion process. The steps of SMOTE are provided as follows:

• For each x ∈ A where A represent the minority class, k-nearest neighbors of x are de-termined by using euclidean distance between x and the other points in the minority class.

• The proportion that represents the imbalancity calculated as N is used for constructing N examples (x1, x₂, ...., x_N) to obtain the set A1.

• For each point xk ∈ A₁ provided formula must be used to generate new examples for minority classes:

x⁰ = x + rand(0, 1) ∗ |x − xk| where rand(0, 1) represents a random number between 0 and 1.

It should be noted that oversampling method must be implemented only for the training set. It is because there can be information leak from training data to test data if it is performed before train-test split.

Appendix E Set of Important Variables and Explanation For The Mod-els

Table 27: Important variables for the models and their explanations

Appendix F Interpreting of Shapley Additive Explanation For Un-derstanding The Model

Interpreting prediction models is important for understanding the effects, debugging and tak-ing appropriate actions. Interprettak-ing tree ensemble methods such gradient boosttak-ing machines and random forests is not easy, and importance values must be analysed for each input vari-able. These importance values can be calculated by using the whole data-set to understand the average effects of features on dependent variables. That is called global interpretation. SHAP (SHapley Additive exPlanation) values which are constructed using game theory and local ex-planations are used. SHAP values can be used to interpret the effects of features on the output of a function f as a sum of the effects of each feature φiusing a conditional expectation. SHAP values are consistent for explaining the model outputs asPM

i=0φi= f (x). In order to calculate φ_i, z⁰ in 0,1 variable is defined. The variable with mapping hx is necessary for function f (x) for evaluating the impact of a missing feature as f (hx(z⁰)). Therefore, the function fx(S)can be defined as f (hx(z⁰))where S is the set of features without zero indices of z⁰. Then, φican be

Where N,M represents input features and number of input features respectively. SHAP inter-action effect also estimated by :

Φij = X

S⊆N \{i,j}

|S|!(M − |S| − 2)!

2(M − 1)!] ∇_ij(S)

It should be noted that i 6= j must be satisfied and ∇ij(S)is provided as follows:

∇_ij(S) = [f_x(S ∪ {i, j}) − f_x(S ∪ {j}] − [f_x(S ∪ {i}) − f_x(S)]

And the main effect of the feature can be calculated by taking the difference between SHAP value and SHAP interaction value:

Φi,i= φi−X

j6=i

Φi,j

In order to understand the effects, SHAP summary plots are used. It is similar to standard feature importance bar charts with the impact of range and distribution of features. In SHAP summary plots, features are first sorted by using their impact on the model output. Figure 33 represents an example SHAP summary plot taken from Lundberg et al. (2018).

Figure 33: SHAP summary plot of data-set from NHANES I (Miller, 1973).

Figure 33 represents SHAP summary plot of XGBoost used on 20 year mortality data-set from NHANES I (Miller, 1973). Here, the idea is that higher the SHAP value of a feature, the higher the log odds of being death. Color of dots represent the feature values for the individuals. For example, it can be seen that higher ages increases the log odds of being death.

Appendix G SHAP Summary Plot for Each Topic

In this section, SHAP summary plots that are used to interpret effects of supplier characteristics on learning behaviors can be found for each topic and each learning behavior.

Figure 34: SHAP summary plot of Lagged Fast Learner Suppliers in Validation (Overall) Score

Figure 35: SHAP summary plot of Fast Learner Suppliers in Validation (Overall) Score

Figure 36: SHAP summary plot of Slow Learner Suppliers in Validation (Overall) Score

Figure 37: SHAP summary plot of Indifferent Suppliers in Validation (Overall) Score

Figure 38: SHAP summary plot of Lagged Fast Learner Suppliers in environment topic

Figure 39: SHAP summary plot of Fast Learner Suppliers in environment topic

Figure 40: SHAP summary plot of Slow Learner Suppliers in environment topic

Figure 41: SHAP summary plot of Indifferent Suppliers in environment topic

Figure 42: SHAP summary plot of lagged fast learner suppliers in health & safety topic

Figure 43: SHAP summary plot of fast learner suppliers in health & safety topic

Figure 44: SHAP summary plot of slow learner suppliers in health & safety topic

Figure 45: SHAP summary plot of indifferent suppliers in health & safety topic

Figure 46: SHAP summary plot of lagged fast learner suppliers in business ethics topic

Figure 47: SHAP summary plot of fast learner suppliers in business ethics topic

Figure 48: SHAP summary plot of slow learner suppliers in business ethics topic

Figure 49: SHAP summary plot of indifferent suppliers in business ethics topic

Figure 50: SHAP summary plot of lagged fast learner suppliers in human capital topic

Figure 51: SHAP summary plot of fast learner suppliers in human capital topic

Figure 52: SHAP summary plot of slow learner suppliers in human capital topic

Figure 53: SHAP summary plot of indifferent suppliers in human capital topic

Appendix H Results of Interpretation For Each Topic

In this section, the full set of variables that affect learning behaviors and their effect direction and size are compared for each topic and learning

In document Eindhoven University of Technology MASTER Analysis of suppliers' sustainability learning behavior Kayhan, M. (pagina 95-126)