JASP : a fresh way to do machine learning

(1)

University of Amsterdam

Internship Report

Research Master Psychology

JASP: A Fresh Way to Do

Machine Learning

Author:

Koen Derks, BSc

Supervisors:

prof. dr. E.M.

Wagenmakers

dr. H.M. Steingr¨

over

August 14, 2017

(2)

Foreword

This report is the theoretical part of the result of a research internship under the supervision of prof. dr. E.M. Wagenmakers and dr. H.M. Steingr¨over. The prac-tical part consists of the implementation of three machine learning analysis in JASP (www.jasp-stats.org). The k-nearest neighbors regression, the classifica-tion analyses, and the k-means clustering analysis were studied in this internship with two main goals: implementing these analyses in JASP and studying the behavior of optimizers for the k-nearest neighbors algorithm. These internship goals have been achieved and the process of doing so is described in the following pages. The goal of this report is to summarize these classic machine learning methods and provide an easy-to-understand tutorial for their corresponding JASP analyses. All three fully-functioning analyses - the k-nearest neighbors classification, the knearest neighbors regression and the kmeans clustering -will be added to the machine learning module in JASP in its next release. With this, I hope to make machine learning more popular because it has some interest-ing advantages in comparison to other prediction methods. By providinterest-ing this tutorial and the corresponding module in JASP, prediction through machine learning should become more accessible for students and researchers.

The report contains three chapters, the first chapter describes the k-nearest neighbors algorithm and its implementation in JASP. Second, a chapter on op-timizing the k-nearest neighbors algorithm lays the foundation of achieving the best parameter estimation for the data. Third, a chapter on the k-means algo-rithm describes the method and its implementation in JASP. The sections on the implementations in JASP are written in a tutorial style, featuring explanations on all input options and descriptions of all tables and figures JASP outputs. Finally, examples of the analyses using well-known data sets are included to guide you through the new module step-by-step.

This work (and the corresponding JASP module) was conducted in col-laboration with Eric-Jan Wagenmakers, Helen Steingr¨over, Don van den Bergh, Qiaodan Luo, and the JASP team.

(3)

Machine Learning

Over the past two decades machine learning has grown to be a vital contribution to science and large parts of our life. With the ever increasing amounts of data becoming available there is good reason to believe that these kinds of data analysis will become an even more necessary ingredient for technological progress. In recent years, also the field of Psychology has tried a taste of machine learning (Pereira et al., 2009; Rosenfield et al., 2012), and it wants more.

One might ask “Why should machines have to learn? Why not design machines to perform as desired in the first place?” There are several reasons why machine learning is important. Of course, the achievement of learning in machines might help us understand how animals and humans learn. But there are other important reasons as well. The amount of knowledge available about certain tasks might be too large for explicit encoding by humans. Machines that learn this knowledge gradually might be able to capture more of it than humans would be able to write down. A good example is the case of backgammon, where deep learning networks are already able to win from world-class human opponents (Tesauro, 1995). This is the reason Machine learning is often used for large data sets, with both lots of variables and lots of observations. Algorithms that learn from available data can predict unseen observations based on what they saw.

There are many different machine learning techniques, which are used in many different approaches. I implemented methods that look at the similarity of observations when making predictions, the so called k-methods. They compute a measure for the similarity to known observations in a multidimensional search space and base the predictions on the outcome of this similarity measure.

(6)

Chapter 1

The k-Nearest Neighbors

Algorithm

1.1 Method

1.1.1 Introduction

The k-nearest neighbors algorithm is a machine learning method that can be used for both classification and regression. The basic use of this algorithm, as with most machine learning algorithms, is to predict a certain unknown value. This value might be a class membership (i.e., in classification) or a numerical value (i.e., in regression).

1.1.2 Defining Nearest Neighbors

The knearest neighbors algorithm takes a subset of the data the training set -consisting of entries on both the predictor variables and the target variable and trains the model on this set. The algorithm can then applied to the remaining data - the test set - to assess performance of the model or make predictions for this subset. For every observation in the test set, the k-nearest neighbors algorithm searches for its k number of nearest neighbors, where k is a fixed num-ber defined by the researcher. Nearest neighbors are defined by the k smallest distances from each observation to each test set observation. Figure 1.1 shows examples for defining the three nearest neighbors in classification and regres-sion. In both cases, the objective is to predict a value for the red star with three nearest neighbors. To do this, we define its three nearest neighbors as the observations that have the three smallest distances to the red star. Using the Euclidian distance, we select the training set observations that lie the closest to our test set observation.

(7)

Figure 1.1: The objective is to predict a value for the test set observation (red star). To do this, we define its three nearest neighbors as the observations that have the 3 smallest distances to the red star. Using the Euclidian distance, we select training set observations A, B and C to be the three nearest neighbors that lie the closest to our test set observation.

1.1.3 Defining Distances

Different distance metrics can be used to determine the distance between test set observation p and training set observation q (Hechenbilcher & Schliep, 2004). One of the most commonly used distance metric is the Euclidian distance, dis-played for n dimensions in Equation 1.1. Another commonly used distance metric is the Manhattan or city block distance, displayed for n dimensions in Equation 1.2. In these formulas, n equals the number of nearest neighbors. The Manhattan distance metric calculates the distance between two points in a grid based on a strictly horizontal and/or vertical path, while the Euclidian distance metric allows for diagonal paths. Figure 1.2 visualizes how the distance between two points on a grid is calculated by these distance metrics. Both distances are defined as strictly positive values.

dEuclidian(x, y) = v u u t n X i=1 (xi− yi)2 (1.1) dM anhattan(x, y) = n X i=1 |xi− yi| (1.2)

(8)

Figure 1.2: The Euclidian distance (a) between two points versus the Manhattan distance (b) between the same points.

Both the Euclidian distance and the Manhattan distance are special cases of the Minkowski distance with its parameter p. Setting p to 1 yields the Man-hattan distance, while setting p to 2 results in the Euclidian distance. Parameter p is continuous and can be any value above 1.

To illustrate the different outcomes of the formula, figure 1.3 shows the effect of parameter p on unit circles. The unit circle is the circle whose center is at the origin and whose radius is one. Every point of the unit circle has an x and a y coordinate related to the points position on the circle computed by simple sine and cosine equations. It can be seen that different values of p drastically influence the shape of the circle and therefore the outcome of the formula.

dM inkowski(x, y) = n X i=1 |xi− yi|p !1/p (1.3)

Figure 1.3: Visualization of the Minkowski distance between unit circles with various values of p. When p = 1 (left), the distance metric equals the Manhat-tan disManhat-tance. When p = 2 (middle), the disManhat-tance metric equals the Euclidian distance. In the rare case that p = ∞ (right), the distance metric equals the Chebyshev distance.

(9)

1.1.4 Weighing the Neighbors

Both for classification and regression, it makes sense to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the prediction than the more distant ones. Different weighting schemes, also referred to as kernels, can be applied to the distance d between test set observation y and training set observation x (Hechenbilcher & Schliep, 2004). These kernels are functions of the distances d with maximum in d = 0 and values that get smaller with growing absolute value of d. The implementation in the kknn package adds a small constant to the distance in order to avoid weights of 0 for some of the nearest neighbors (Hechenbilcher & Schliep, 2004). The available kernels are shown below and are visualized in Figure 1.4.

Rectangular kernel 1₂· I(|d| ≤ 1) Triangular kernel (1 − |d|) · I(|d| ≤ 1) Epanechnikov kernel 3 4(1 − d 2_{) · I(|d| ≤ 1)} Biweight kernel 15 16(1 − d 2₎2_{· I(|d| ≤ 1)}

Triweight kernel 35₃₂(1 − d2₎3_{· I(|d| ≤ 1)}

Cosine kernel π₄cos(π₂d) · I(|d| ≤ 1)

Gaussian kernel √1

2πexp(− d2

2)

(10)

Figure 1.4: Visualizations of the various kernels with distances d ranging from 0 to 1,5. As can be seen, at some point the weights for the observations becomes 0. To avoid this, we add a small constant to the distance.

(11)

With weights, only positive values of d are allowed. Adding weights adds a third parameter, alongside the distance parameter and the number of nearest neighbors, to the k-nearest neighbors model. However, from experience the choice of a special kernel (apart from the special case of the rectangular kernel) is not crucial (Hechenbilcher & Schliep, 2004).

1.1.5 Making Predictions

Classification After training the model on the training data, each new observation in the test set (y,x) has one or multiple values of predictors x and can have a value of the target y. The model classifies it in into the class with the largest added weight

max r(

k

X

i=1

W (d(x, x(i)))I(yi= r)). (1.4)

Where W is the weight of the kernel, k equals the number of nearest neighbors, d is the distance to the nearest neighbors and r is a class label. To illustrate, in Figure 1.5, the question mark represents an observation we want to classify. The red triangles and blue squares represent the training set that our model has seen. The solid line represents the 3-nearest neighbor boundary, the dashed line the 5-nearest neighbor boundary. The test sample should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3 squares vs 2 triangles inside the outer circle, assuming a rectangular weighting scheme).

Figure 1.5: The objective is to classify the question mark here. Two predictors are used, the first represented at the x-axis and the second represented at the y-axis. If k = 3 and using a rectangular kernel, the question mark will be classified as a red triangle. If k = 5, the question mark will be classified as a blue square.

(12)

Regression After determining the distances and weights for the obser-vations in the training set, each new observation is predicted using the weighted average of the numerical target of the k nearest neighbors. For example, in Fig-ure 1.6, k-nearest neighbors regression with k = 2 and a rectangular kernel is used. The target value of the test observation (question mark) is calculated by respectively taking the mean of the 2 nearest neighbors of the observation. As shown the predicted value lies exactly between the 2 nearest neighbors, because of the choice of the rectangular kernel.

Figure 1.6: k-nearest neighbors regression with k=2. The goal is to predict a target value for the test observation (question mark). This is done by taking the mean of the 2 nearest target values of nearest neighbors A and B.

In Figure 1.7, the target value of the test observation (question mark) is calculated by respectively taking the mean of the 3 nearest neighbors of the observation. The predicted value in this case lies somewhat lower than the predicted value in Figure 1.5.

(13)

Figure 1.7: k-nearest neighbors regression with k=3. Again, the objective is to predict a target value for the test observations (question mark). It is predicted much lower now, because the third nearest neighbor C is much lower than the nearest neighbors A and B.

1.1.6 Summary

The advantage of the k-nearest neighbors methods are that we can use classifi-cation and regression without making strong parametric assumptions. However, one of the disadvantages of these methods are that we need to manually choose the number of nearest neighbors k, the distance parameter p and the kernel W. Because of this, it can be useful to perform cross-validation to find out accu-rately what the models performance is before committing to these parameters.

1.2 Cross-Validation

1.2.1 Introduction

Cross-validation, in the machine learning context, is a predictive model valida-tion technique for assessing how the results of a machine learning analysis will generalize to an independent data set. It is used in settings where one wants to estimate how accurately a predictive model will perform in practice. In a machine learning problem, cross-validation consists of defining various training and test sets to test the model and assess accuracy on the whole data set, in order to limit problems like over-fitting and give insight on how the model will generalize to an independent data set (Shao, 1993; Kohavi, 1995).

One round of cross-validation involves dividing the data into complemen-tary sub sets, running the analysis on the training set, and validating the perfor-mance on the test set. To reduce variability, multiple rounds of cross-validation are performed using different training and test sets, and the validation results

(14)

are averaged over the rounds. The size of the training and test sets depend on the kind of cross-validation used (Refaeilzadeh, Lang & Lui, 2009; Stone, 1974). One of the main reasons for using cross-validation over running the model on a fixed set is that there might not be enough data available to partition the data set into separate training and test sets without losing significant model-ing or testmodel-ing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general tech-nique. Also, it gives an improved estimation of the accuracy of the model, when generalized to a new data set (Seni & Elder, 2010; Stone, 1974).

1.2.2 Leave-One-Out Cross-Validation

In the case of leave-one-out cross-validation, we pick only one observation as the test set. We then build a model on all the remaining, complementary ob-servations, and evaluate its error on the single, held out, observation. A model accuracy estimate is obtained by repeating this procedure for each of the data points available, returning accuracy as the proportion of correctly classified ob-servations (classification) or the Root Mean Squared Error (regression). Leave-one-out cross-validation can be computationally expensive because it generally requires one to construct many models — equal in number to the size of the data and the specified search space for nearest neighbors (Kearns & Ron, 1999). Also, leave-one-out cross-validation is similar in behavior to the AIC in that it will favor models that have a large N (Stone, 1974).

1.2.3 K-Fold Cross-Validation

In the case of K-fold cross validation, we divide the data set in K parts , the so called folds. It it important to notice that we denote the number of folds with K, this should not be confused with the number of neighbors k. Every fold is then iteratively used as the test set, where the model is trained on the remaining parts as the training set. Accuracy is assessed by averaging the proportion of correctly classified observations (classification) or by computing the Root Mean Squared Error (regression) in each test set over the k folds (Rodriguez, Pereze & Lozano, 2010).

1.2.4 Split-half Cross-validation

In split-half cross-validation, the data set is split into a training and a test set which both contain 50 % of the data. The training set is then used to train the model, after which the accuracy is assessed on the test set. When this is done, the training set and the test set are switched and the procedure is repeated. The two estimates of the performance of the model are then averaged to obtain a general assessment of the models performance (Steyerberg et al., 2001; Kohavi, 1995).

(15)

1.2.5 Summary

Armed with the k-nearest neighbors algorithm and various cross-validation methods, we can successfully solve a large number of classification and regres-sion problems, including handwritten digits, satellite image scenes and mapping of forests. Closer to the field of Psychology, k-nearest neighbors has been proven ideal for pattern classification, visual recognition of categories and feature learn-ing (Cover & Hart, 1967; Zhang et al., 2006; Cost & Salzberg, 1993). However, we see that it is mostly used in the field of Neuropsychology, which is remark-able. The lack of use in other field within Psychology may be because there is no understandable software for researchers to apply the k-nearest neighbors al-gorithm to their data set. JASP features the k-nearest neighbors in its machine learning module. This next section will give an introduction to the k-nearest neighbors classification and regression analyses in JASP.

1.3 k-Nearest Neighbors in JASP

1.3.1 The Interface

Both the k-nearest neighbors analyses for classification and regression that I implemented, can be found in the machine learning module in JASP. In this section, you will be walked through the interface of the k-nearest neighbors method option by option. For every option, an explanation will be given as to how it influences the performance and output of the k-nearest neighbors procedure. The interfaces for classification and regression are practically the same, so I will discuss them simultaneously here. When one of these analyses is clicked, it opens up the graphical user interface shown in Figure 1.7.

(16)

Figure 1.8: The graphical user interface for the k-nearest neighbors classification method in JASP.

The Variable Window The variable window at the top left is the place where the variables of the data set are shown. As is usual in JASP, the mea-surement level of the variables is indicated by the icon to the left of the variable names. The variable level can either be nominal, ordinal, or continuous and will be automatically selected by the JASP engine. However, if the user wants to specify the measurement level of a variable manually, this can be done by clicking the variable name.

The Target Window The target selector lets the user specify which variable needs to be predicted. In the case of regression this has to be a contin-uous variable. For classification, this has to be a variable at nominal or ordinal measurement level. The selected variable can be transferred from the variable

(17)

window to the target window by pressing the arrow button. Only one target variable is allowed.

The Predictor Window This window lets the user specify which pre-dictors should be used for predicting the target variable. Multiple prepre-dictors are allowed. Specifying predictor(s) works analogously to specifying a target.

The Apply Indicator Window This window accepts a binary apply indicator variable to separate the data into two parts: one where the target variable is known (indicator value = 0) and one where the target variable is unknown so that the model must be applied to produce a prediction (indicator value = 1).

% of Training Data This options lets the user select the percentage of observations in the data set that is randomly selected to be the training set.

• Auto [default]: Sets the percentage of observation used for the training set to 80%

• Manual: When clicked, spawns an input box. This lets the user specify a percentage between 1 and 99.

Number of Nearest Neighbors This option lets the user specify the number of nearest neighbors that the algorithm will use for the analysis. There are three possible options for specifying the number of neighbors:

• Auto [default]: the number of neighbors is determined by the number of observations in the data set. If N ≤ 1000, the number of neighbors is set to 1. If N ≥ 21000, the number of neighbors is set to 21 because going higher than 21 neighbors generally does not add more information to the model (Beyer et al., 1999). For all other cases, the number of neighbors is 0.1 % of the observations in the entire data set. As an example, with N = 5600, the number of neighbors is set to 5.

• Manual: When selected, this spawns an input box. The number of neigh-bors is determined by the input of the user in this box. The box requires as input an integer between 1 and N − 1.

• Optimized: When selected, this spawns two input boxes. The optimal number of neighbors is determined by running the k-nearest neighbors analysis for the entire k range specified by the user and selecting the model with the highest accuracy. Accuracy in classification is defined as the proportion of correctly classified observations in the test set, while accuracy in regression is defined by computing the Root Mean Squared Error.

(18)

Weights The weights determine how much each nearest neighbor con-tributes to the predicted value. This has to be one of unweighted, optimal, trian-gular, Epanechnikov, biweight, triweight, cos, inv, Gaussian or rank weighting schemes. Formal notations of the weighting schemes can be found in section 1.1.3.

Tables These boxes can be checked to produce tables presenting the output of the k-nearest neighbors algorithm. The user can specify the range of observations for which he or she wants to see the tables. Below the tables header are two boxes that specify the lower and the upper limit of the observation range. • Predictions: This table shows the observed and predicted values for each observation in the test set. In regression these are numerical values and in classification this would be a class value or membership. The deviation of the predicted values from the observed values is used to assess model performance.

– Confidence [only available in classification]: This options shows the confidence with which the observation is assigned to the given class. Confidence is calculated for each test case by aggregating the re-sponses of the k-nearest neighbors among the training cases using the distances and the weights.

• Distances: This table shows the distances between every observation in the test set and the nearest neighbors associated with this observation. The distances can be calculated using the Minkowski distance metric with parameter p that the user can define as an advanced option. Most common are the Manhattan distance (p = 1 ) or the Euclidian distance (p = 2 ). • Weights: This table shows the weights of every nearest neighbor with

respect to the observation that it is associated with. Weight values depend on the weighting scheme selected.

• Confusion table: This table shows the frequencies of the observed classi-fications and the predicted classiclassi-fications.

Plots These boxes can be checked to produce a plot presenting the re-sults of the k-nearest neighbors optimization. In regression, an additional plot is possible.

• Accuracy vs k: This check-box is only available when the number of near-est neighbors is being optimized. It plots the accuracy of all the models within the k range against k.

• Test set accuracy [regression only]: This check box is only available in the k-nearest neighbors regression analysis and can be considered the equiva-lent of the confusion table in classification.

(19)

Predictions for New Data This bar can be clicked to open up the options for predicting new data. Note, that in order to be able to predict new data, there should be a binary apply indicator variable in the data set indicating which data to predict. The options are displayed in Figure 1.9.

Figure 1.9: The options under ’Predictions for New Data’ in the k-nearest neigh-bors analyses.

Advanced Options When clicked, this button opens up a new part of the interface where the users can select advanced options for the algorithm, displayed in Figure 1.10.

Figure 1.10: The advanced options in the k-nearest neighbors analyses. • NA action: This options lets the user specify how to deal with missing

data in the data set using two different methods. Delete listwise simply omits the missing data when running the procedure. NA predict uses missing value information to predict missing values. Rough fix performs a rough imputation of missing values.

• Scale: This lets the user scale the predictors to have equal standard devi-ations.

(20)

• Distance parameter: This lets the user set the distance parameter for the calculations.

– Auto [default]: This sets the distance parameter p to 2 (i.e. Euclidian distance)

– Manual: When clicked, this spawns a box. This box lets the user specify the distance parameter manually.

• Model optimization: This optimizes the entire model to estimate the best parameters for the analysis. This method will be discussed in chapter 2. • Cross-validation: These options let the user specify what kind of

cross-validation he or she wants to do. Choices are leave-one-out cross cross-validation and K-fold cross validation, where the number of folds can be specified by the user.

– Leave-one-out cross-validation: This options performs the leave-one-out cross-validation described in section 1.2.2 with same the number of nearest neighbors as the regular analysis.

– K-fold cross-validation: This option performs the K-fold cross-validation described in section 1.2.3. The input box with the label No. folds re-quires a value between 1 and the number of observations in the data set. When the number of folds equals the number of observations in the data set, leave-one-out cross-validation is performed.

• Seed: This options sets the seed for the random taking of test samples in the analysis.

1.3.2 The Output

The Summary Table This table is the only table that is always present in the output, since it displays key output results for the analysis. It contains the number of nearest neighbors used in the analysis and the fit measure asso-ciated with the specific analysis, see Figure 1.11. Accuracy is the proportion of correctly classified test set observations, while the root mean squared error (RMSE) is an indicator of the prediction error on the test in regression. The summary model also contains the cross-validation results upon request.

Figure 1.11: Example of the summary table in a k-nearest neighbors classifica-tion analysis in JASP.

(21)

The Confusion Table [Classification only] This table displays the observed levels of the target variable against the predicted levels of the target variable in the test set. An example table is shown in Figure 1.12.

Figure 1.12: Example of the confusion table in a k-nearest neighbors classifica-tion analysis in JASP.

The Predictions Table This table displays the prediction made for each test set observation according to the k-nearest neighbors algorithm. The filter above can be used for selecting a portion of the observations of interest. The header ”observed ” indicates the true values of the test set observations, while the header ”predicted ” indicates the predicted values, see Figure 1.13.

Figure 1.13: Example of the predictions table in a k-nearest neighbors regression analysis in JASP.

The Distances Table This table displays the distances associated with each of the nearest neighbors of each test set case. The filter above the table header allows the user to inspect a subset of the test set observations. The distances are calculated according to the formula of the Minkowski distance with distance parameter p, see Equation 1.3. When p is set to 1 or 2 - the Manhattan distance and the Euclidian distance respectively - a table footnote displays which distance metric is used, see Figure 1.14.

(22)

Figure 1.14: Example of the distances table in a k-nearest neighbors regression analysis with 3 nearest neighbors in JASP.

The Weights Table This table displays the weights assigned to each of the nearest neighbors of each test set observation. With the filter above the table header there is the possibility to see only a part of the test set observations. The weights do not sum to one. A message under the table displays the kernel that was used to assign the weights, see Figure 1.15.

Figure 1.15: Example of the weights table in a k-nearest neighbors regression analysis in JASP.

Predictions for New Data When the apply indicator is specified in the Apply Indicator window, the analysis creates a table with predictions for these observations with apply indicator equal to 1, see Figure 1.16.

(23)

Figure 1.16: Example of predictions for new data in a k-nearest neighbors clas-sification analysis in JASP using the Iris data set.

Accuracy vs k This plot displays the relation between the number of nearest neighbors and the accuracy of the model.

(24)

Figure 1.17: Example of accuracy vs k plot in the k-nearest neighbors classifi-cation analysis in JASP.

Test set accuracy When this check box is ticked, the predicted test observations are plotted against the observed test set observations to get an indication of model fit. If the observations lie near the diagonal, the predicted values are close to the observed values.

Figure 1.18: Example of the test set accuracy plot in a k-nearest neighbors regression analysis in JASP.

(25)

1.3.3 Examples in JASP

Diabetes _{The diabetes data set (http://www4.stat.ncsu.edu/~boos/} var.select/diabetes.html) contains 442 observations of diabetes patients. Ten baseline variables were collected: age, sex, body mass index, average blood pressure, and six blood serum measurements. The response of interest is a quan-titative measure of disease progression after one year. We want to predict this variable from all other 10 variables using k-nearest neighbors regression. First, we load the data set in JASP and select k-nearest neighbors regression. We drag the criterium variable Y to the target window and all the predictor variables to the predictor window.

Figure 1.19: Illustration of the input window with the procedure described above with the Diabetes data set.

As can be seen from Figure 1.19, our RMSE is now 78.9. Optimizing the nearest neighbors from 1 to 10 gives us an optimal number of nearest neighbors of 6 with a lower RMSE, 58.5. To see the process of this optimization, we can tick the box for the Error vs k plot. This shows us a Figure 1.20, the summary table and a plot of the different values for k versus the RMSE.

(26)

Figure 1.20: The error vs k plot of the example above with the Diabetes data set.

We can select 6 as the optimal number of nearest neighbors and see how well the observed valued are approached by the predicted values by clicking the test set accuracy plot that produces Figure 1.21.

(27)

Figure 1.21: The test set accuracy plot of the example above with the Diabetes data set and 6 nearest neighbors.

Glass The Glass data set (https://archive.ics.uci.edu/ml/datasets/ glass+identification) contains 214 observations of glasses with variables in-dicating the amount of certain minerals in each glass, its refractive index and the type of glass. This data set is perfect for classification because we have a categorical variable (type of glass) and many continuous variables (minerals).

We can predict the type of glass from all the minerals inside the glass using k-nearest neighbors classification. To do this, we read in the data set in JASP and drag the Type of glass to the target window. We then drag Sodium, Magnesium, Aluminum, Silicon, Potassium, Calcium, Barium & Iron to the predictor window shown in Figure 1.22.

(28)

Figure 1.22: Illustration of the input window with the procedure described above for the glass data set.

The output (Figure 1.16) shows the accuracy (70 % of our test set is classified correctly) and that our model is automatically fitted with a number of nearest neighbors that equals 1. The output also standardly shows the confusion table of the test set. However, suppose we want to discover and apply a model with an optimal number of nearest neighbors. To achieve this we set the nearest neighbors option to optimize and specify the range from 3 to 20. The output table now shows us that the optimal number of nearest neighbors is 5 with an accuracy of 67.6 %. Tinkering somewhat with the weights shows that the probable best weighting scheme in combination with 5 nearest neighbors gives us an accuracy of 70.3 %. To get a better estimation of the accuracy of the classifier, we can perform cross validation by clicking the K-fold button under advanced options. The output in Figure 1.23 now shows a more accurate estimate of the accuracy, namely 66.8 %.

Figure 1.23: The summary table output of the example above with the glass data set.

Breast Cancer The Wisconsin Breast Cancer data set (https://archive. ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) contains 699 observations on several indicators of breast cancer. It also contains a

(29)

vari-able classifying each observation as a form of breast cancer. We can predict the form of breast cancer from the indicators of breast cancer in the data set using k-nearest neighbors classification. To do so, we first load the data set in JASP and select the k-nearest neighbors classification analysis. We drag class to the predictor window and Clump thickness, Uniformity of cell Size, Uniformity of cell Shape, Marginal adhesion, Single ephithelial cell size, Bland chromatin, Normal nucleoli & Mitoses to the predictors window shown in Figure 1.24.

Figure 1.24: Illustration of the input window with the procedure described above with the Wisconsin breast cancer data set.

As can be seen, the auto option for the number of nearest neighbors gives us a pretty good accuracy already (91.2 %). Suppose we nevertheless want to optimize the number of nearest neighbors to get the optimal accuracy. We click optimize and set the range from 1 to 20. This range is based on an informal guess such that a sufficient number of values are available. Our range of 1 to 20 obtains an accuracy of 96.4 % at k = 7. The confusion table in Figure 1.25 presents the observed and predicted labels and their frequency.

(30)

Figure 1.25: The summary table output of the example above with the Wiscon-sin breast cancer data set.

We can inspect the predictions for the test set data by clicking the pre-dictions button; to also display the confidence of the prepre-dictions we click the confidence button as well. We set the range of the predictions to 56 to 64. This table displays the observed and predicted values for every observation, and confidence with which the prediction is made, see Figure 1.26.

Figure 1.26: The prediction table output of the example above with the Wis-consin breast cancer data set.

We can now see that observation 63 was classified incorrectly and can inspect it further to assess why it was classified incorrectly.

1.4 List of R Packages

• kknn • plot3D

(31)

Chapter 2

An Optimizer for k-Nearest

Neighbors

2.1 The Need to Optimize

Changing the parameters in the k-nearest neighbors algorithm can often have a large impact on the accuracy or root mean squared error (RMSE) of the model. In order to obtain the most accurate predictions and avoid having to try out many parameter combinations by hand, I implemented an optimizer that takes this work out of users’ hands. Optimizing the accuracy involves tuning the learning parameters of the model, such as the distance parameter, the number of nearest neighbors, and the weights; this can be accomplished by exhaustive enumeration, since there are often only a few possible values for some param-eters. There are however, more efficient methods available. These methods range from Bayesian optimization methods to evolutionary optimization algo-rithms (Kennedy, 2011; Goldberg, 1988; Snoek et al., 2012).

The latter, evolutionary algorithms are particularly interesting for ma-chine learning because - just like mama-chine learning algorithms -, they try to learn from data. The idea of these algorithms is that, through cooperation and competition in the population of search particles in the algorithm, population-based optimization techniques can find good solutions to complex problems (Shi & Eberhart, 1999). Such algorithms have been developed to arrive at good solu-tions to optimization problems, even when dimensionality is high. Evolutionary algorithms and evolutionary-based machine learning have been critiqued on the grounds that natural evolution is simply too slow to accomplish good results in an artificial learning system (Goldberg & Holland, 1988). This argument, how-ever, ignores the complexity that evolution has achieved in three billion years. The performances of even the simplest living organisms are more complex than the most complex human artificial designs. Evolutionary optimization algo-rithms are therefore a good way to start thinking about optimizing a problem in machine learning.

(32)

2.2 The Optimization Algorithm

JASP uses an in-house manufactured optimization technique which aims to se-lect the combination of the number of nearest neighbors k, the distance param-eter d, and the weights W that leads to the best predictive performance. This means the lowest misclassification error for classification or the lowest RMSE for regression. The performance of the model is assessed by the optimizer using leave-one-out cross-validation (see section 1.2).

When selecting parameters for the k-nearest neighbors algorithm, the user has options for three different parameters. One of them - the distance parameter p - can take on a bounded infinite range of values (Equation 2.1), while the other two - the kernel and the discrete number of nearest neighbors - are bounded. The combination of the weights and the nearest neighbors gives us a finite grid of combinations, see Figure 2.1. Since we can calculate the goodness-of-fit - either misclassification of RMSE - of every element in this plane given a certain distance parameter, we can then optimize this plane across the distance parameter. The distance parameter has much influence with a tiny adjustment, see Figure 2.3. Formula 2.1 shows the formula for the Minkowski distance again to illustrate where the distance parameter is used. For more information on the Minkowski distance, see section 1.3.

dM inkowski(x, y) = n X i=1 |xi− yi|p !1/p (2.1)

Figure 2.1: A finite grid of combinations between the weights parameter and the number of nearest neighbors at a specific value of the distance parameter p, color indicates different values of predictive performance.

(33)

Using cross-validation and the train.kknn function in the kknn package (Schliep & Hechenbichler, 2016) the optimizer starts to optimize the model in the discrete W · k 2-dimensional plane for d = 0. This results in the opti-mal combination of weights and number of nearest neighbors for that specific distance parameter, see Figure 2.1. The algorithm then proceeds to the next distance parameter so that di = di−1+ 0.1 until the maximum distance

pa-rameter specified by the user has been reached. The optimizer purposely takes these small steps so that it can take advantage of the computation speed of the train.kknn function. When the algorithm has finished, it returns the optimal combination of the distance parameter, the weights parameter, and the number of neighbors, while also returning the RMSE or accuracy associated with that specific combination.

2.3 The Optimizer in JASP

In JASP, the complete model optimization described in the previous section is available through a check box under the advanced options. The model optimiza-tion routine requires as input both the maximum number of nearest neighbors (Max. k) and the maximum distance parameter (Max. d). Because of com-putation time, these values are set to 10 by default. When the corresponding check box is ticked, the optimization starts (can be terminated at any time by unchecking the check box) and returns a table displaying the optimal parameters for the model. An example table is given in Figure 2.2.

Figure 2.2: Table output for the glass data set predicting type of glass from all mineral variables.

The optimization algorithm plots a graph of the optimization process. For each distance parameter on the x-axis the goodness-of-fit is displayed on the y-axis, while the optimal number of nearest neighbors is indicated by color.

It also returns the following plot displaying respectively and the process of the optimization, where for each distance parameter, the goodness-of-fit is displayed on the y-axis and the optimal number of neighbors is indicated by the color.

(34)

Figure 2.3: Optimizer plot for the glass data set predicting type of glass from all mineral variables. As can be seen from this plot, the optimal number of nearest neighbors k is 7, the optimal distance parameter is 0.5 and the associated lowest misclassification error equals 0.256. The optimal weighting scheme can be inspected in the optimization table.

(35)

Chapter 3

The k-Means Algorithm

3.1 Method

3.1.1 Introduction

Clustering is the process of partitioning or grouping a given set of observations into disjoint clusters. This is done such that observations in the same cluster are alike and observations belonging to two different clusters are different. The k-means algorithm is an unsupervised machine learning method for clustering. Unsupervised methods do not specify a target variable that they want to pre-dict. k-Means is a specific unsupervised method that aims to divide a number of observations into k clusters in which each observation belongs to the cluster with the nearest mean. The main idea is to define k centroids, or means, one for each cluster so that the within-cluster sum of squares is minimized (Steinley, 2006). Minimizing the within-cluster sum of squares ensures that the clus-ters reach maximum intra-homogeneity and inter-heterogeneity. Observations within a cluster should be similar, while observations between clusters should be dissimilar.

3.1.2 Determining Cluster Membership

Contrary to supervised machine learning methods, the theory behind the unsu-pervised k-means algorithm is to train the model on all the specified variables. The algorithm initially places centroids on different locations equal to values of random observations. Since different locations of these centroids causes the algorithm to produce different results, it is better to place them as far away as possible from each other and update their positions iteratively until no move-ment of an observation from one cluster to another will reduce the within-cluster sum of squares. The next step is to take each point belonging to a given data set and associate it with the nearest optimum centroid.

(36)

3.1.3 Determining the Number of Clusters

The correct choice of the number of cluster k is often not immediately evident, with decisions depending on the trade off between weighing parsimony and the information that every new cluster provides. In general, increasing k without penalty always reduces the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points). Intuitively, the optimal choice of k strikes a perfect balance between maximum compression of the data using a single cluster and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it can be chosen according to one of the following exploratory methods.

The Elbow Method The elbow method looks at the total within-clusters sum of squares as a function of the number of within-clusters: One should choose a number of clusters so that adding another cluster does not produce a within-clusters sum of squares that is substantially lower. More precisely, if one plots the within-clusters sum of squares against the number of clusters, the first clusters explain much of the within-clusters sum of squares, but at some point, how much information a new cluster adds will drop, resulting an angle in the graph (see Figure 3.1). The number of clusters is chosen at this point, hence the ”elbow criterion”.

Figure 3.1: Example of number of clusters plotted against the total within sum of squares. In this case, the optimal number of clusters is 3 because the 4th cluster does not substantially reduce the sum of squares further.

(37)

Information Criteria Another set of methods for determining the num-ber of clusters are information criteria, such as the Akaike information crite-rion (AIC) (Akaike, 1987; Akaike, 1981) and the Bayesian information critecrite-rion (BIC) (Schwarz, 1978). However, they can only be used for relative model com-parisons The rule-of-thumb with these information criteria is that a lower value indicates a better fit. The AIC for a model i is defined as

AICi= −2 log Li+ 2Vi (3.1)

- where Li, the maximum likelihood for the candidate model i, is

deter-mined by adjusting the Vi free parameters in such a way as to maximize the

probability for the observed data under the candidate model. Equation 3.1 shows that the AIC rewards descriptive accuracy via the maximum likelihood, and penalizes lack of parsimony according to the number of free parameters. The BIC for a model i is defined as

BICi= −2 log Li+ V log n (3.2)

where n is the number of observations that enter into the likelihood cal-culation. The BIC penalty term is larger than the AIC penalty term when n ≥ e2_.

AIC and BIC Weights When the AIC and BIC differences are small, the acceptance of a single model may lead to a false sense of confidence. In addition, the raw AIC and BIC values cannot tell us what the weight of evidence is in favor of model 1 over model 2. Such considerations are important in situations where a specific model 1 may have the lowest AIC, but model 2 may generally be the true model. It is therefore useful to determine the weight of all the candidate models. Weight wi(IC) can be interpreted as the probability that

model i is the best model, given the data and the set of candidate models. The weights of the IC models (for either AIC or BIC) can be calculated as follows (Wagenmakers & Farrell, 2004):

∆i(IC) = ICi− min IC (3.3) Wi(IC) = exp{−1₂∆i(IC)} K P k=1 exp{−1₂∆k(IC)} (3.4)

3.1.4 Summary

The advantage of the k-means algorithm is that it is computationally fast and thus easily applicable to large data sets. Fortunately, the k-means algorithm is known in the field of Psychology (Jenkins & Russell, 1952; Vaughan, 1968; Sodian, Schneider & Perlmutter, 1986) but has remained relatively underused outside of the topic of memory and recall. To make the k-means clustering

(38)

algorithm available for everyone, it is featured in the machine learning module in JASP. The next section will give an introduction to the k-means analysis in JASP.

3.2 k-Means in JASP

The k-means method can be found under the classification header in the machine learning module in JASP. When clicked, it opens up the graphical user interface shown in Figure 3.2.

3.2.1 The Interface

This is the place where all the tuning of the k-means algorithm takes place and where the user can specify the desired model. In this section, we will walk you through all the different options possible and explain how they will affect the outcome of the k-means algorithm. We will also take a careful look at the output and tell you exactly how to interpret every part of it. Suggestions for how to optimize the performance of the k-means algorithm will be given regarding every option.

(39)

Figure 3.2: The graphical user interface for the k-means method in JASP. The Variable Window The variable window is the place where the variables in the data set are shown. As is usual in JASP, the measurement level of the variables is indicated by the icon on the left of the variable names. The variable level can either be nominal, ordinal, or continuous and will be auto-matically selected by the JASP engine. However, if the user wants to manually specify the measurement level of a variable, this can be done by clicking the variable.

The Predictor Window This window lets the user pick which variables are used in the k-means analysis. The user can move variables from the variable window to the predictor window by selecting one or multiple variables and pressing the arrow button.

(40)

Number of Clusters This option lets the user specify the number of clusters that the algorithm will use to do the analysis. There are four possible options for determining the number of clusters:

• Auto [default]: the number of clusters is determined by the number of observations in the data set. If N ≤ 1000, the number of clusters will be set to 2. On the other hand, if N ≥ 21000, the number of clusters will be fixed to 21. For all other cases, the number of clusters will be 0.1 % of the observations in the data set. So, the number of clusters will be set to 5 for N = 5600.

• Manual: When selected, this spawns an input box. The number of clusters is determined by the input of the user in this box. The box requires as input an integer between 1 and the number of observations in the data set.

• Optimized: When selected, this spawns two input boxes, one for the lower limit of the k range and one for the upper limit. The optimal number of clusters is determined by running the k-means analysis for the entire k range specified by the user and selecting the model with the lowest AIC or BIC value.

• Robust: When selected, this spawns two input boxes, one for the lower limit of the k range and one for the upper limit. The difference between the optimized option and the robust option is that, instead of the k-means method, this option performs the k-medoids method (Park & Jun, 2009). It chooses a number of random data points as centers instead of the conventional means. The optimal number of clusters is determined by running the k-medoids analysis for the entire k range specified by the user and selecting the number of clusters with the highest criterion value specified within the Criterion selection menu at the bottom of the Number of clusters menu.

• Criterion: The criterion option is only used when the k-medoids tech-nique is used (Reynolds, Richards & Rayward-Smith, 2004; Jin & Han, 2011). The criterion defaults to Average silhouette width (Rousseeeuw, 1987). The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. This criterion ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Zero is not an arbitrary null point and indicates that the object is equally similar to all clusters. If most objects have a high value, then the clustering con-figuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. For large data sets, it is recommended to use the Multi average silhouette width, to decrease computation time. Finally, there is the option to use the Calinski-Harabasz criterion, or known as the variance ratio criterion (Cali´nski & Harabasz, 1974).

(41)

Number of Iterations This options lets the user select the number of iterations the algorithm will run before returning output.

• Auto [default]: sets the number of iterations to 15.

• Manual: When selected, this spawns an input box. This box lets the user set the number of iterations manually. The field requires an input between 1 and 999.

Number of Random Sets This option lets the user select the number of random sets that are chosen from the rows as the initial centroids in the algorithm.

• Auto [default]: sets the number of random sets to 25, which is also the default in the R package kknn used for this analysis.

• Manual: When selected, this spawns an input box. This box lets the user set the number of random sets manually. It requires an input between 1 and 999.

Plots These options can be checked to produce the described plots as a result of the k-means method.

• 2-D cluster plot: This check box should only be ticked when there are 2 variables in the predictor window. This options outputs a plot where the first variable is plotted on the x-axis and the second variable is plotted on the y-axis. The option to show the distances to the centroids is click-able, which shows black lines from every point to the cluster centroid.

• Criterion versus clusters: This options produces a line plot of the criterion as a function of the number of clusters. It can be used to visually determine the optimal number of clusters, since the elbow method can be applied here.

• PCA cluster plot: This option extracts the 2 largest principal components from the data and plots the observations as a function of these components. It then draws cluster boundaries around the observations belonging to the same cluster. Since it extracts 2 principal components from the data, this options can be used with as many predictor variables as desired.

• Within sum of squares versus clusters: This options produces a line plot of the within cluster sum of squares as a function of the number of clusters. It can be used to visually determine the optimal number of clusters, since the elbow method can be applied here.

(42)

Tables These options can be checked to produce the described tables as a result of the k-means method.

• Cluster information: This table displays information about the individual clusters, such as the size of the cluster, the values of the centroids of the clusters plus the within, between and total sum of squares of the clusters. • Predictions: This table shows predicted values for each observation. Since this is an unsupervised machine learning method, there are no observed target values that we are trying to predict. Instead, we are trying to predict an unknown class membership. Thus, this table shows only the observation numbers and the cluster they are predicted to belong to. Advanced Options

• Algorithm: can either be one of Hartigan-Wong (Hartigan & Wong, 1979), Lloyd (Lloyd, 1982) or MacQueen (MacQueen, 1967). Lloyds algorithms is the standard iterative k-means algorithm and is the simplest procedure. The Hartigan-Wong algorithm is smarter but slower. MacQueen’s algo-rithm updates the centroids any time a point is moved and also makes time-saving choices in checking for the closest cluster.

• Seed: This options sets the seed for the random selection of test samples in the analysis.

3.2.2 The Output

Summary Table This table is the only table that is always present in the output because it displays key summary statistics of the analysis, such as the number of clusters used in the analysis, the R-squared of this number of clusters, and the AIC and BIC statistics. Figure 3.3 shows an example table.

Figure 3.3: Example of the summary table in a k-means clustering analysis in JASP.

Cluster Information This table displays the cluster information per cluster. Figure 3.4 shows the table, which grows with the number of variables as it will display a centroid for each predictor. Furthermore, it provides useful information for interpreting the results of the analysis and its implications.

(43)

Figure 3.4: Example of the cluster information table in a k-means clustering analysis in JASP.

Predictions Table This table displays the prediction made for each test set observation according to the k-means algorithm. The filter above can be used for selecting a portion of the observations of interest. Figure 3.5 shows an example table.

Figure 3.5: Example of the predictions table in a k-means clustering analysis in JASP.

2-D Cluster Plot This options plots two variables and their cluster membership. The plot in Figure 3.6 displays the relation between two variables. Points assigned to different clusters are shown in different colors. Black lines from the cluster points to the cluster centroids can be drawn in the plot using an additional option.

(44)

Figure 3.6: Example of the 2D cluster plot in a k-means clustering analysis in JASP.

Criterion vs Cluster Plot When the algorithm is optimized, this plots the selected criterion versus the number of clusters. An example plot is shown in Figure 3.7.

Figure 3.7: Example of the criterion vs clusters plot in a k-means clustering analysis in JASP with the quakes data set.

(45)

PCA Cluster Plot This plot displays the relation between the two largest principal components in the data in a plot. Data is plotted on these principal components and clusters are visualized using cluster boundaries in this plot, see Figure 3.8.

Figure 3.8: Example of the principal component analysis (PCA) cluster plot in a k-means clustering analysis in JASP.

Within Sum of Squares vs Cluster Plot When the algorithm is optimized, this option plots the within sum of squares of the clusters versus the number of clusters. An example plot is shown in Figure 3.9.

Figure 3.9: Example of the optimization results of the k-means clustering anal-ysis using the Iris data set.

(46)

3.2.3 Examples in JASP

Iris The iris data set (https://vincentarelbundock.github.io/Rdatasets/ csv/datasets/iris.csv) contains 150 observations of irises, measuring their petal length, petal width, sepal length, sepal width and classifying them as three different species. We can use this data to do a k-means clustering analysis to examine whether if the data really shows three clusters.

First off, we can put two variables in the predictor window and see how the observations on these variables are distributed in 3 clusters, see Figure 3.10. To do this, we set the manual number of clusters to 3 and click 2-D cluster plot with the distances to the centroids.

Figure 3.10: Example of the procedure described above in the k-means clustering analysis in JASP.

After inspecting this plot, we can now add the other variables Sepal.Length and Sepal.Width to the predictors. This invalidates the 2-D cluster plot because it requires two variables. When there are more or less than two variables, the plot will show an error. To determine the optimum number of clusters, we can optimize the k-means model by clicking the optimized options under number of nearest neighbors. We optimize the model from 2 to 10 clusters. The optimization returns all models and their summary statistics. The optimization table is shown in Figure 3.11.

(47)

Figure 3.11: Example of the optimization results of the k-means clustering analysis using the Iris data set.

We can inspect how the analysis selected this optimal model by clicking the within sum of squares vs clusters plot. The analysis tries to find the elbow in the curve based on the decrease in the within sum of squares and fill it with a red color to identify it, see Figure 3.12.

Figure 3.12: Example of the optimization results of the k-means clustering analysis using the Iris data set.

The results in Figure 3.13 show that 5 clusters yield the most optimal performance. However, using more than 5 clusters does not substantially reduce the sums of squares. Now that we have this result, we would like to inspect our clusters. To do so, we can click the cluster information option and select which results we want to display in the table. For now, we select the size of the clusters, the centroids of the clusters, the between clusters sum of squares and the total sum of squares.

(48)

Figure 3.13: Example of the cluster information results of the k-means clustering analysis using the Iris data set.

Quakes The quakes data set (https://vincentarelbundock.github. io/Rdatasets/csv/datasets/quakes.csv) contains 1000 observations on earth-quakes in Fiji. The variables are the latitude, the longitude, the depth, number of stations that reported the earthquake and the magnitude of the earthquakes. We can use this data set to see if we can differentiate between various types of earthquakes without having to classify them. The input options for this analysis are shown in Figure 3.14.

Figure 3.14: Example of input for the k-means analysis described above in JASP with the Quakes data set.

Because this data contains outliers in the variable stations, we can choose to deviate from the classical k-means method and try a more robust variant, the k-medoids method. This is done by clicking the robust option under number of nearest neighbors. We will optimize the model from 2 to 10 clusters with the silhouette length criterion. The results are displayed in Figure 3.15.

(49)

Figure 3.15: Example of the summary table in the k-means clustering analysis in JASP with the quakes data set.

Because this data is multidimensional, the 2-D cluster plot is impossible to construct. Therefore we can make a Principal Component Analysis plot (Figure 3.16) of the data by clicking the PCA cluster plot option.

Figure 3.16: Example of the PCA cluster plot of the k-means clustering analysis using the quakes data set.

(50)

3.3 Future Work: Optimizing the k-Means

Al-gorithm

Since the optimizer for k-nearest neighbors seems to work well, it seems that a valid thought is to implement this in the k-means algorithm as well. However this thought results in several problems, since the method is unsupervised, there is no right way to optimize the algorithm. Secondly, there are more parameters than in the k-nearest neighbors algorithm. An interesting approach involves combining the k-means algorithm with the particle swarm algorithm. Van der Merwe & Engelbrecht (2003) applied this optimization technique and found that the k-means algorithm tends to converge faster than the Particle swarm optimization (PSO), but usually with a less accurate clustering. They showed that the performance of the PSO clustering algorithm can further be improved by seeding the initial swarm with the result of the k-means algorithm. The hybrid algorithm first executes the k-means algorithm once. In this case the K-means clustering is terminated either when the maximum number of iterations is exceeded, or when the average change in centroid vectors is less that 0.0001 (a user specified parameter). The result of the k-means algorithm is then used as one of the particles, while the rest of the swarm is initialized randomly. The hybrid technique was better at defining separate clusters than both the k-means algorithm and the PSO algorithm.

3.4 List of R Packages

• vegan • fpc • cluster

(51)

List of Figures

1.1 The objective is to predict a value for the test set observation (red star). To do this, we define its three nearest neighbors as the observations that have the 3 smallest distances to the red star. Using the Euclidian distance, we select training set observations A, B and C to be the three nearest neighbors that lie the closest to our test set observation. . . 6 1.2 The Euclidian distance (a) between two points versus the

Man-hattan distance (b) between the same points. . . 7 1.3 Visualization of the Minkowski distance between unit circles with

various values of p. When p = 1 (left), the distance metric equals the Manhattan distance. When p = 2 (middle), the distance metric equals the Euclidian distance. In the rare case that p = ∞ (right), the distance metric equals the Chebyshev distance. . . . 7 1.4 Visualizations of the various kernels with distances d ranging

from 0 to 1,5. As can be seen, at some point the weights for the observations becomes 0. To avoid this, we add a small constant to the distance. . . 9 1.5 The objective is to classify the question mark here. Two

predic-tors are used, the first represented at the x-axis and the second represented at the y-axis. If k = 3 and using a rectangular kernel, the question mark will be classified as a red triangle. If k = 5, the question mark will be classified as a blue square. . . 10 1.6 k-nearest neighbors regression with k=2. The goal is to predict

a target value for the test observation (question mark). This is done by taking the mean of the 2 nearest target values of nearest neighbors A and B. . . 11 1.7 k-nearest neighbors regression with k=3. Again, the objective

is to predict a target value for the test observations (question mark). It is predicted much lower now, because the third nearest neighbor C is much lower than the nearest neighbors A and B. . 12 1.8 The graphical user interface for the k-nearest neighbors

classifi-cation method in JASP. . . 15 1.9 The options under ’Predictions for New Data’ in the k-nearest

neighbors analyses. . . 18 1.10 The advanced options in the k-nearest neighbors analyses. . . 18

(54)

1.11 Example of the summary table in a k-nearest neighbors classifi-cation analysis in JASP. . . 19 1.12 Example of the confusion table in a k-nearest neighbors

classifi-cation analysis in JASP. . . 20 1.13 Example of the predictions table in a k-nearest neighbors

regres-sion analysis in JASP. . . 20 1.14 Example of the distances table in a k-nearest neighbors regression

analysis with 3 nearest neighbors in JASP. . . 21 1.15 Example of the weights table in a k-nearest neighbors regression

analysis in JASP. . . 21 1.16 Example of predictions for new data in a k-nearest neighbors

classification analysis in JASP using the Iris data set. . . 22 1.17 Example of accuracy vs k plot in the k-nearest neighbors

classi-fication analysis in JASP. . . 23 1.18 Example of the test set accuracy plot in a k-nearest neighbors

regression analysis in JASP. . . 23 1.19 Illustration of the input window with the procedure described

above with the Diabetes data set. . . 24 1.20 The error vs k plot of the example above with the Diabetes data

set. . . 25 1.21 The test set accuracy plot of the example above with the Diabetes

data set and 6 nearest neighbors. . . 26 1.22 Illustration of the input window with the procedure described

above for the glass data set. . . 27 1.23 The summary table output of the example above with the glass

data set. . . 27 1.24 Illustration of the input window with the procedure described

above with the Wisconsin breast cancer data set. . . 28 1.25 The summary table output of the example above with the

Wis-consin breast cancer data set. . . 29 1.26 The prediction table output of the example above with the

Wis-consin breast cancer data set. . . 29 2.1 A finite grid of combinations between the weights parameter and

the number of nearest neighbors at a specific value of the dis-tance parameter p, color indicates different values of predictive performance. . . 31 2.2 Table output for the glass data set predicting type of glass from

all mineral variables. . . 32 2.3 Optimizer plot for the glass data set predicting type of glass

from all mineral variables. As can be seen from this plot, the optimal number of nearest neighbors k is 7, the optimal distance parameter is 0.5 and the associated lowest misclassification error equals 0.256. The optimal weighting scheme can be inspected in the optimization table. . . 33

JASP : a fresh way to do machine learning

University of Amsterdam

Internship Report

Research Master Psychology

JASP: A Fresh Way to Do

Machine Learning

Author:

Koen Derks, BSc

Supervisors:

prof. dr. E.M.

Wagenmakers

dr. H.M. Steingr¨

over

August 14, 2017

Foreword

Contents

Machine Learning

Chapter 1

The k-Nearest Neighbors

Algorithm

1.1

Method

1.1.1

Introduction

1.1.2

Defining Nearest Neighbors

1.1.3

Defining Distances

1.1.4

Weighing the Neighbors

1.1.5

Making Predictions

1.1.6

Summary

1.2

Cross-Validation

1.2.1

Introduction

1.2.2

Leave-One-Out Cross-Validation

1.2.3

K-Fold Cross-Validation

1.2.4

Split-half Cross-validation

1.2.5

Summary

1.3

k-Nearest Neighbors in JASP

1.3.1

The Interface

1.3.2

The Output

1.3.3

Examples in JASP

1.4

List of R Packages

Chapter 2

An Optimizer for k-Nearest

Neighbors

2.1

The Need to Optimize

2.2

The Optimization Algorithm

2.3

The Optimizer in JASP

Chapter 3

The k-Means Algorithm

3.1

Method

3.1.1

Introduction

3.1.2

Determining Cluster Membership

3.1.3

Determining the Number of Clusters

3.1.4

Summary

3.2

k-Means in JASP

3.2.1