Predicting the movement of the S&P500 Index by using different machine learning techniques

(1)

MSc in Econometrics Thesis

Predicting the Movement of the S&P500

Index by using diﬀerent Machine Learning

Techniques

Thomas Ovaa (10675698)

supervisor

Dr. Noud van Giersbergen second reader

Dr. Marco van der Leij

(2)

2

Statement of Originality

This document is written by Student Thomas Ovaa who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

3

Abstract

In this paper we compare six different machine learning techniques to predict stock market index movement. As input variables we use different technical variables, based on closing prices of the index. We test for significant forecasting outperformance and try to set up a profitable trading strategy. Both statistically and economically, the Neural Network framework performs best.

(4)

CONTENTS 4

1 Introduction

Forecasting stock market prices and also just its movement has been considered one of the hardest tasks in time series prediction. This is because the stock market is dynamic, nonlinear, complicated, nonparametric and chaotic in nature (Abu-Mostafa & Atiya (1996)). The problem is of great importance because we can develop eﬀective market trading strategies if we can predict the stock index movement accurately. For instance, investors can hedge against market risks and speculators can make money by trading the stock index.

Now according to the Eﬀcient Market Hypothesis (Fama (1970)), all relevant information is at any moment fully incorporated in the price, so that investors cannot outperform the market. The only way one can obtain higher returns is by buying riskier securities. In this research we hope to show that we can reject the Eﬃcient Market Hypothesis.

Many authors have applied many diﬀerent techniques to many diﬀerent individual stocks or stock market indices. Our aim now is to compare a broad range of machine learning techniques that can predict the movement of the well-known S&P 500 index. So we are trying to predict a binary variable.

This is what Ou & Wang (2009) do (but then applied to the Hang Seng index of Hong Kong), and they achieve high out-of-sample hit rates of up to 86%, but their results are rather vaguely described and their conclusion is very briefly written down. In comparison, Huang et al. (2005) perform much worse when using the same approach. For instance, Ou & Wang (2009) achieve a hit rate of 84% when using a technique called LDA, whereas Huang et al. (2005) only predict 55% correctly. This is a large difference and it is highly questionable if this is only due to different input variables and a different market to predict.

We aim at building a more consistent and robust framework so we can properly compare and discuss the diﬀerent methods.

In our research we will try to answer the following question:

Which machine learning technique performs best in predicting the up- and downward movement of the stock market and can we set up a successful trading strategy which beats a standard buy-and-hold strategy?

In particular we will investigate the following models/methods: Logit;

(6)

1 INTRODUCTION 6

Linear Discriminant Analysis; Quadratic Discriminant Analysis; K-nearest Neighbors;

Support Vector Machines. Neural Networks;

We split up our research question in the following subquestions:

1. How dow we build the diﬀerent techniques/models and how do we make the corresponding predictions?

This is in-depth described in section 4. 2. How do we evaluate the predictions?

Can we find significant differences in predictive power between the models and also in comparison with a naive prediction method?

In section 5 we carry out calculations regarding signiﬁcant outperformance of our predic-tions over naive prediction methods.

3. How do we check for robustness?

The relative performance of machine learning techniques can depend on characteristics of the data, so we will perform some robustness checks to see whether a certain model outperforms the others on a structural basis. For instance, we can also apply our methods on a diﬀerent index.

4. Which variables are we going to use to predict the movement? In section 3 we thoroughly describe the prediction variables.

The outline of this thesis is as follows; first a discussion of relevant literature; secondly a description of the data and the different prediction variables. After that, the different methods are described in the Methodology section. Then the results are discussed, after which the conclusion follows.

(7)

2 LITERATURE 7

2 Literature

There has been done quite some research in this particular field of stock market forecasting, but existing literature seemed not always to be fully complete if you look at it in an econometric way. Huang et al. (2005) measures relative performance of the models but does not carry out a proper test on significance. Patel et al. (2015) uses the same type of input variables as we do, but their calculations seem questionable. For instance, they come up with negative standard deviations when reporting summary statistics. Ou & Wang (2009) reports out-of-sample hitrates of between 80 and 86% on a daily basis, which seems unlikely. In our research we try to come up with a reliable overview/comparison of the different machine learning techniques.

3 Data

First we have to choose the stock market index for which we want to predict the movement. We choose the American S&P500 index because it is rather broad and in general is considered to be the most reliable regarding developments in US economy. We use daily data because we expect that if we use less-frequent data certain information is already incorporated in the price. We obtain the historical data, both the index prices and the risk-free rates (used in the empirical trading strategy), from the ﬁnance section of Yahoo. The whole dataset covers the period from January 2005-December 2014.

Patel et al. (2015) use so-called technical indicators to predict the future movement of stocks. These indicators are derived from the stock market prices itself. We selected ten technical in-dicators as input variables for the construction of prediction models to forecast the direction of movement of the stock price index, following Kara et al. (2011). Next we describe the diﬀerent indicators, give their formulas and explain why they might have forecasting power for the index movement.

Simple n-day Moving Average

SM At=

∑n−1 i=0 Ct−i

(8)

3 DATA 8

Weighted n-day Moving Average

W M At=

∑_n−1

i=0(n− i)Ct−i

n(n + 1)/2 . (2)

The ﬁrst two technical indicators are moving averages. The moving average is a simple tech-nical analysis tool that smoothes out price data by calculating a constantly updated average price. We will use 10-day Moving Averages as we are predicting short term future. In general, if the price is above the moving average then the trend is up and vice versa. So to get the ﬁnal predictor we subtract the closing price from the moving average.

Momentum

M OMt= Ct− Ct−n+1. (3)

Momentum measures the rate of rise and fall of stock prices. If momentum is positive, it indi-cates an up trend and vice versa.

Stochastic K%

ST CK%t=

Ct− Ln

Hn− Ln × 100.

(4) Stochastic K% is a stochastic oscillator. It is a momentum indicator that shows the location of the close price relative to the high-low range over a set number of periods. As a rule, the momentum changes direction before price.

Stochastic D%

ST CD%t=

∑n−1

i=0 ST CK%t−i

n . (5)

Stochastic D% is a (usually) 3-day moving average of Stochastic K% and serves as a signal or trigger line. We now combine Stochastic K% with Stochastic D% in the following way into a new binary variable:

B ST Ct=      1 if ST CK%t> ST CD%t 0 else (6)

Because according to Achelis (2001), when the Stochastic K% rises above the Stochastic D% it serves as a buy signal and vice versa.

(9)

3 DATA 9

Moving Average Convergence Divergence (MACD)

M ACDsignal(k)t= M ACDt

2

k + 1 + M ACDsignal(k)t−1(1−

2

k + 1), where M ACDt= EM A(12)t− EMA(26)t, where

EM A(l)t= Ct

2

l + 1 + EM A(l)t−1(1−

2

l + 1).

The Moving Average Convergence/Divergence oscillator (MACD) is one of the simplest and most eﬀective momentum indicators available. The MACD turns two trend-following indica-tors, moving averages, into a momentum oscillator by subtracting the longer moving average from the shorter moving average. A 9-day moving average of the MACD (the signal line) is usually subtracted from the MACD line and we use the resulting variable as predictor, since according to Achelis (2001) we have to buy when the MACD rises above its signal line and vice versa.

Relative Strength Index (RSI)

RSIt= 100− 100 1 + (∑n_i=0−1U Pt−i/n)/( ∑n−1 i=0 DWt−i/n) . (7)

The Relative Strength Index is a momentum oscillator that measures the speed and change of price movements. RSI oscillates between zero and 100. Traditionally, RSI is considered over-bought (possible future downward trend) when above 70 and oversold (possible future upward trend) when below 30. So we transform this continuous variable into a categorical variable.

RSI CATt=            0 if 0≤ RSIt< 30 1 if 30≤ RSIt≤ 70 2 if 70 < RSIt≤ 100 (8)

And after that we use the following two dummy variables as inputs in the diﬀerent prediction models. D RSI1t=      1 if RSI CATt= 0 0 else (9) D RSI2t=      1 if RSI CATt= 2 0 else (10)

(10)

3 DATA 10 Larry William’s R% W ILLRt= Hn− Ct Hn− Ln × −100. (11) William’s R% reﬂects the level of the close price relative to the highest high for the look-back period. William’s R% oscillates from 0 to -100. Readings from 0 to -20 are considered overbought (possible future downward trend). Readings from -80 to -100 are considered oversold (possible future upward trend). So in the same way as for the Relative Strength Index, we create dummy variables for William’s R%.

D W ILLR1t =      1 if − 100 ≤ W ILLRt<−80 0 else (12) D W ILLR2t=      1 if − 20 < W ILLRt≤ 0 0 else (13)

A/D (Accumulation/Distribution) Oscillator

A/D OSCt=

Ht− Ct−1

Ht− Lt

. (14)

The A/D Oscillator follows the stock trend meaning that if its value at time t is greater than that at time t− 1, the trend is probably upwards and vice versa. So we create a binary variable as indicator. B A/D OSCt=     

1 if A/D OSCt> A/D OSCt−1

0 else

(11)

3 DATA 11

Commodity Channel Index (CCI)

CCIt= Mt− SMt 0.015Dt where Mt= (Ht+ Lt+ Ct)/3, SMt= ∑n−1 i=0 Mt−i n and Dt= ∑n−1 i=0 |Mt−i− SMt| n .

The Commodity Channel Index is a versatile indicator that can be used to identify a new trend or warn of extreme conditions. In general, CCI measures the current price level relative to an average price level over a given period of time. CCI is relatively high when prices are far above their average (indicating a possible future downward trend) and vice versa. To use the CCI as an overbought/oversold indicator, readings above +100 imply an overbought condition while readings below -100 imply an oversold condition. So again we deﬁne two dummy variables and use these as predictors:

D CCI1t=      1 if CCIt<−100 0 else (16) D CCI2t=      1 if CCIt> 100 0 else (17)

With Ct the closing price at time t, Lt the low price at time t, Ht the high price at time

t, Ln= min(Lt, . . . , Lt−n−1) and Hn= max(Ht, . . . , Ht−n−1).

Using historical data from January 2005 until December 2014, we calculated summary statistics for the diﬀerent indicators. They are given in Table 1.

(12)

3 DATA 12

Table 1: Summary statistics for the selected indicators

Indicator Min Max Mean Standard deviation

SMA 707.93 2070.60 1350.90 276.14 WMA 700.91 2077.30 1351.40 276.77 MOM -303.35 125.35 3.11 38.31 STCK% 0.00 100.00 60.19 32.02 STCD% 7.35 97.73 60.18 22.30 MACD -56.90 18.49 2.38 11.48 RSI 1.63 99.82 55.19 18.53 WILLR% -100.00 0.00 -39.81 32.02 A/D Osc -0.58 1.67 0.52 0.41 CCI -272.16 248.42 19.29 102.08

Table 2: Correlations for the selected indicators

SMA WMA MOM B STC MACD D RSI1 D RSI2 D WILLR1 D WILLR2 B A/D Osc D CCI1 D CCI2 SMA 1.00 WMA 0.97 1.00 MOM -0.87 -0.76 1.00 B STC -0.56 -0.55 0.43 1.00 MACD -0.59 -0.47 0.73 0.12 1.00 D RSI1 0.52 0.47 -0.56 -0.24 -0.44 1.00 D RSI2 -0.39 -0.32 0.49 0.23 0.31 -0.17 1.00 D WILLR1 0.64 0.63 -0.55 -0.42 -0.37 0.49 -0.24 1.00 D WILLR2 -0.65 -0.63 0.59 0.52 0.41 -0.25 0.50 -0.35 1.00 B A/D Osc -0.02 -0.05 -0.02 0.03 -0.06 0.02 0.00 -0.04 0.01 1.00 D CCI1 0.66 0.65 -0.54 -0.39 -0.38 0.45 -0.24 0.73 -0.35 0.01 1.00 D CCI2 -0.53 -0.53 0.44 0.41 0.30 -0.19 0.34 -0.27 0.68 0.00 -0.27 1.00

SMA is very strongly correlated with WMA, so we remove WMA from the predictor vari-ables. Now we end up with eleven predictor varivari-ables.

We can analyze the results of the diﬀerent models in two ways, the ﬁrst is statistically and the second economically:

1. By calculating out-of-sample hitrates and testing whether they signiﬁcantly outperform other models;

2. By setting up a trading strategy and comparing the return with that of a simple buy-and-hold strategy in the index.

(13)

4 METHODOLOGY 13

4 Methodology

In this section we describe the models used for each technique. In general we use 5 years of data to estimate the parameters of the model/to train the model (training data) and 5 years of data to compare the predictions with (test data). The total dataset runs from January 2005-December 2014. To make a prediction for day t + 1 on day t, we use data from day t−1249, ..., t to train the model. This is called a rolling window. We don’t use an expanding window, because we want to minimize the likelihood of having a break in the characteristics of the data that we are using. Next we will discuss the diﬀerent models.

Logit

Logistic regression can be used for describing the relationship between a set of predictor variables and a categorical response variable. We can predict the dependent variable on the basis of the explanatory variables. The logit model is non-linear and the parameters can be estimated by maximum likelihood. The estimated model gives predicted probabilities ˆpt+1 for the choice

yt+1 = 1, and this can be transformed into predicted choices by predicting that ˆyt+1 = 1 if

ˆ

pt+1 ≥ c and ˆyt+1 = 0 if ˆpt+1 < c. In practice one often takes c = 1₂. The model for logistic

regression is given by P (yt+1= 1) = F (x ′ tβ) = exp(x′_tβ) 1 + exp(x′_tβ). (18)

where ytis the binary stock market variable and xt−1is a vector of predictors one period earlier.

Matlab is used to perform the calculations.

Linear Discriminant Analysis

Let πk represent the prior probability that a randomly chosen observation comes from the kth

class; this is the probability that a given observation is associated with the kth category of the response variable y. Let fk(X) := P (X = x|y = k) denote the density function of X for an

observation that comes from the kth class. Then Bayes’ Theorem states that

pk(X) := P (y = k|X = x) =

πkfk(x)

∑K

l=1πlfl(x)

. (19)

In general, estimating πk is easy if we have a random sample of ys from the population.

(14)

4 METHODOLOGY 14

to pk(x) as the posterior probability that an observation X = x belongs to the kth class. We

apply the LDA classiﬁer to the case of multiple predictors (p > 1). To do this, we will assume that X = (X1, X2, ..., Xp) is drawn from a multivariate normal distribution, with a class-speciﬁc

mean vector and a common covariance matrix. This is the Linear Discriminant Analysis for-mula: δk(x) = xTΣ−1µk− 1 2µ T kΣ−1µk+ logπk. (20)

The Bayes classiﬁer assigns an observation X=x to the class that has maximum δk(x). In

practice, the LDA method maximizes the ratio of between-class variance to within-class variance, thus guaranteeing maximal separability between the classes. Matlab is used to perform the calculations.

Figure 1: Scatter plot of the ﬁrst set of training data (January 2005-December 2009), two predictor

variables. Linear decision boundary is included

−50 0 50 100 150 −40 −30 −20 −10 0 10 20

Simple Moving Average

MACD

Up Down

(15)

4 METHODOLOGY 15

Quadratic Discriminant Analysis

According to James et al. (2013), the Bayes classiﬁer assigns an observation X=x to the class for which δk(x) =− 1 2(x−µk) T_Σ−1 k (x−µk) + logπk=− 1 2x T_Σ−1 k x + x T_Σ−1 k µk− 1 2µ T kΣ−1k µk+ logπk. (21) is largest.

QDA is not really that much diﬀerent from LDA except that you assume that the covariance matrix can be diﬀerent for each class and so, you have to estimate the covariance matrix Σk

separately for each class k. Matlab is used to perform the calculations.

K-nearest Neighbors

Given a value for K and a prediction point xt+1, KNN ﬁrst identiﬁes the K training observations

that are closest to xt+1, represented by Nt+1. It then estimates f (x0) using the average of all

the training responses in Nt+1. So,

ˆ f (xt+1) = 1 K ∑ xi∈N0 yi. (22) ˆ

f (xt+1) is in the range of [0; 1], we transform it to a binary variable by rounding to the nearest

integer (0 or 1). Matlab is used to perform the calculations.

Support Vector Machines

SVM classify points by assigning them to one of two disjoint half spaces, in a higher di-mensional feature space. The idea of the support vector machine method is to construct a multi-dimensional hyperplane as a decision surface such that the margin of separation between observations on both sides of the hyperplane is maximized. The optimization problem is

maximize β0,β11,β12,...,βp1,βp2,ϵ1,...,ϵn M subject to yi ( β0+ p ∑ j=1 βj1xij + p ∑ j=1 βj2x2ij ) ≥ M(1 − ϵi), n ∑ i=1 ϵi≤ C, ϵi≥ 0, p ∑ j=1 2 ∑ k=1 β_jk2 = 1.

(16)

4 METHODOLOGY 16

C, a positive constant parameter controlling the trade-oﬀ between the training error and the

margin is set at 100. In this paper we use the radial kernel K(s, t) = exp(−₁₀1||s−t||2) as kernel function of the SVM because it tends to give good performance under general smoothness assumptions (Huang et al. (2005)). The e1071 package in R is used to perform the calculations.

Figure 2: SVM with a radial kernel

In ﬁgure 2 an example of an SVM with a radial kernel with two input variables is shown, applied on non-linear data.

(17)

5 RESULTS 17

Neural Networks

Figure 3: Neural Network framework

If we denote the vector of inputs by x, the outputs by y, then the neural network model can be written as (Hastie (1996))

zj = σ(α0j + αTjx), j = 1, . . . , n

ˆ

yk = fk(β0k+ βTkz), k = 1, . . . , q.

(23)

The activation function σ is used to introduce a nonlinearity at the hidden layer, and is often taken to be the sigmoid σ(z) = 1/(1 + e−z). The parameters αjl and βkj are known as weigths

and deﬁne linear combinations of the input vector x and hidden unit output vector z respectively. The intercepts α0j and β0k are known as biases. The function fkpermits a ﬁnal transformation

of the output. The number of neurons n was chosen at 50, the number of epochs was chosen at 5000, the momentum constant was chosen at 0.5 and the learning rate at 0.1. The nnet package in R is used to perform the calculations.

5 Results

To evaluate the predictions, Heij et al. (2004) uses the following method. In the population the fraction of successes is p. If we randomly make the prediction 1 with probability p and 0 with

(18)

5 RESULTS 18

probability (1− p), then we make a correct prediction with probability q = p2+ (1− p)2. Using the properties of the binomial distribution for the number of correct random predictions, it follows that the random hit rate hr (this is the hit rate when we predict a 1 with probability p)

has expected value E[hr] = q and variance var(hr) = q(1− q)/n. The predictive quality of our

model can be evaluated by comparing our hit rate h with the random hit rate hr. Under the null

hypothesis that the predictions of the model are no better than pure random predictions, the hit rate h is approximately normally distributed with mean q and variance q(1− q)/n. Therefore we reject the null hypothesis of random predictions in favour of the (one-sided) alternative of better-than-random predictions if z = √ h− q q(1− q)/n = nh− nq √ nq(1− q) (24)

is large enough (larger than 1.645 at 5% signiﬁcance level). In practice, q is unknown and estimated by ˆp2 + (1− ˆp)2, where ˆp is the fraction of successes in the sample. In our case,

ˆ

p = ₁₂₅₈705 and ˆq = 0.5073.

Logit

We can display the prediction results in a so-called confusion matrix.

Table 3: Evaluation predictions

Predicted

Observed Down Up Total Correct % Correct % Incorrect

Down 104 449 553 104 18.81 81.19

Up 126 579 705 579 82.13 17.87

Total 230 1028 1258 683 54.29 45.71

Now z = (683 − 638.18)/√(1258∗0.5073∗0.4927) = 2.53. The corresponding p-value is 0.0057. So we can conclude that the prediction of up and down movement by a logit model is better than that would have been achieved by random predictions.

(19)

5 RESULTS 19

Linear Discriminant Analysis

Predicted

Down 114 439 553 114 20.61 79.39

Up 145 560 705 560 79.43 20.57

Total 259 999 1258 674 53.58 46.42

Now z = (674− 638.18)/√(1258∗0.5073∗0.4927) = 2.02. The corresponding p-value is 0.0217. So we can conclude that the prediction of up and down movement by LDA is better than that would have been achieved by random predictions.

Quadratic Discriminant Analysis

Predicted

Down 199 354 553 199 35.99 64.01

Up 280 425 705 425 60.28 39.72

Total 479 779 1258 624 49.60 50.40

Now z = (624− 638.18)/√(1258∗0.5073∗0.4927) =−0.80. The corresponding p-value is 0.7881. So we cannot conclude that the prediction of up and down movement by QDA is better than that would have been achieved by random predictions (on the 5% level).

(20)

5 RESULTS 20

K-nearest Neighbors

Predicted

Down 174 379 553 174 31.46 68.54

Up 230 475 705 475 67.38 32.62

Total 404 854 1258 649 51.59 48.41

Results using K=45 neighbors. Now z = (649− 638.18)/√(1258∗0.5073∗0.4927) = 0.61. The corresponding p-value is 0.2709. So we cannot conclude that the prediction of up and down movement by KNN is better than that would have been achieved by random predictions. We plot the out-of-sample hitrate as a function of the number of neighbors.

(21)

5 RESULTS 21

Figure 4: Out-of-sample hitrate against number of neighbours

25 30 35 40 45 50 55 60 65 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 K out−of−sample hitrate

It seems that the out-of-sample hitrate reaches it maximum at K = 43.

Support Vector Machines

Predicted

Down 205 348 553 205 37.07 62.93

Up 176 529 705 529 75.04 24.96

(22)

5 RESULTS 22

Now z = (732− 638.18)/√(1258∗0.5073∗0.4927) = 5.29. The corresponding p-value is 6.08e−8. So we can conclude that the prediction of up and down movement by our Support Vector Machine model is better than that would have been achieved by random predictions.

Neural Networks

Predicted

Down 204 349 553 204 36.89 63.11

Up 169 536 705 536 76.03 23.97

Total 373 885 1258 740 58.82 41.18

Now z = (740− 638.18)/√(1258∗0.5073∗0.4927) = 5.74. The corresponding p-value is 4.68e−9. So we can conclude that the prediction of up and down movement by our Neural Networks framework is better than that would have been achieved by random predictions.

Comparison with random walk

Instead of testing if our predictions are better than random predictions, we can also compare our out-of-sample test results with a random walk model as a benchmark. Random walk is a one-step ahead forecast method, because it uses today’s value as a prediction for tomorrow:

ˆ

yt+1= yt. (25)

The test statistic is now deﬁned as

z = ˆ hi− h0 √ h0(1−h0) n , (26)

with ˆhi the out-of-sample hitrate of model i. We say that the prediction performance of our

respective models is better than that of a random walk when z is suﬃciently large (larger than 1.645 at 5% signiﬁcance level). We can easily calculate the percentage that is correctly predicted by the random walk model. This is h0 = 0.4992. The total number of predictions is n = 1258.

(23)

5 RESULTS 23

Table 9: Comparison forecasting performance (out-of-sample hitrate with random walk)

Model hi z p-value Logit* 54.29% 3.10 9.68e−4 LDA* 53.58% 2.60 0.0047 QDA 49.60% -0.23 0.5898 KNN 51.59% 1.18 0.1181 SVM* 58.35% 5.98 1.12e−9 NN* 58.82% 6.31 1.37e−10

Models denoted with a ∗ are performing signiﬁcantly better (out-of-sample) than a simple random walk. We cannot say if QDA and KNN are performing worse/better than a random walk.

We see that there exists a one-on-one relationship between the out-of-sample outperformance versus random predictions and the out-of-sample outperformance versus a random walk.

Robustness checks

Table 10: Comparison forecasting performance (out-of-sample hitrate)

Model S&P500 Rank NASDAQ Rank

Logit 54.29% 3 54.45% 3 LDA 53.58% 4 53.42% 5 QDA 49.60% 6 51.11% 6 KNN 51.59% 5 54.13% 4 SVM 58.19% 2 58.90% 2 NN 58.82% 1 59.62% 1

We applied all our methods on a second stock index, namely the NASDAQ index (the American technology index) with the purpose of checking for robustness. Overall, the results seem rather robust. LDA and QDA switch places but NN and SVM still come on top for the NASDAQ index.

(24)

6 CONCLUSION 24

Setting up a trading strategy

Table 11: Annual cumulative returns

Model No transaction costs .1% transactions costs

Buy-and-hold 13.05% Logit 13.61% 11.08% LDA 14.27% 12.79% QDA 10.66% 8.28% KNN 13.14% 8.28% SVM 16.60% 14.71% NN 16.79% 14.54%

Besides statistical performance, it can also be interesting to see whether we can make a proﬁt when actually applying our strategy on the market. The calculations are done as follows; when predicting a 1, we invest in the market index and when predicting a 0, we switch to a government bond because we expect a negative return for the index. As basis for the risk-free rate the 30-year US government bond was chosen. This is both done without transaction costs and transactions costs of 0.1% for a round trip trade applied. Results are compared with a simple buy-and-hold-strategy, where we invest in the market en keep our portfolio for 5 years. The best methods (SVM and NN) outperform the buy-and-hold strategy, even with transaction costs applied. The KNN method probably switches a lot in daily predictions between up and down movement, because we see a huge loss in performance when we apply transaction costs.

6 Conclusion

Most of our methods structurally outperform both random predictions and a random walk method. We applied our models on two diﬀerent markets, the relative performance order was almost the same. If we go on the market and apply our methods in a trading strategy, we can outperform a buy-and-hold strategy. Overall, we can say we can be quite successful by applying machine learning techniques to predict stock market movement. However, not every technique works. The neural network framework seems to perform best, both statistically and economically. Besides that, it seems to be a robust method.

(25)

REFERENCES 25

Concerning further research, the following suggestions can be made. In this paper we have chosen to use technical variables as input variables/predictors. It is also possible to use macro-economic or ﬁnancial variables as inputs, as long as we expect these inputs to contain a certain amount of information about the future movement of the stock market. We can easily adapt our models to see whether we can improve the results.

It is also possible to not simply use all our predictors in the diﬀerent models, but to test our models by comparing all diﬀerent subsets of input variables in terms of prediction performance. It is interesting to see whether certain variables are more often incorporated in the ”best” performing subsets than others.

Thirdly, in addition to comparing diﬀerent indexes we can also check for robustness by using diﬀerent time periods.

At last, an interesting option to possibly improve results is by combining diﬀerent forecasting methods. We can propose the following combining model:

ft+1= k

∑

i=1

wiyit+1ˆ , (27)

where wiis the weight assigned to classiﬁcation method i,yit+1ˆ is the binary prediction of model

i. We can easily transform the [0; 1] continuous value of ft+1to a binary value by rounding it.

To calculate the weights we could for instance use

wi =

ai

∑k i=1ai

, (28)

where ai is the in-sample performance for forecasting method i (Huang et al. (2005)).

References

Y. Abu-Mostafa & A. Atiya (1996). Introduction to ﬁnancial forecasting. Applied Intelligence, 6(3):205–213.

S. Achelis (2001). Technical Analysis from A to Z.

E. Fama (1970). Eﬃcient capital markets: a review of theory and empirical work. The Journal

of Finance, 25(2):383–417.

(26)

REFERENCES 26

Heij, D. Boer, Franses, Kloek & V. Dijk (2004). Econometric Methods with Applications in

Business and Economics.

W. Huang, Y. Nakamori & S.-Y. Wang (2005). Forecasting stock market movement direction with support vector machine. Computers & Operations Research, 32(10):2513–2522.

G. James, D. Witten, T. Hastie & R. Tibshirani (2013). Introduction to Statistical Learning

with Applications in R.

Y. Kara, M. Acar Boyacioglu & O. Baykan (2011). Predicting direction of stock price index movement using artiﬁcial neural networks and support vector machines: The sample of the istanbul stock exchange. Expert Systems with Applications, 38(5):5311–5319.

P. Ou & H. Wang (2009). Prediction of stock market index movement by ten data mining techniques. Modern Applied Science, 3(12):28–42.

J. Patel, S. Shah, P. Thakkar & K. Kotecha (2015). Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert

Predicting the movement of the S&P500 Index by using different machine learning techniques

MSc in Econometrics Thesis