## High-Frequency Stock Direction

## Prediction Using Machine Learning

### Jan Meppe

### Master’s Thesis to obtain the degree in

### Financial Econometrics

### University of Amsterdam

### Faculty of Economics and Business

### Amsterdam School of Economics

### Author:

### Jan Meppe

### Student nr:

### 10326316

### Date:

### September 16, 2016

### Supervisor:

### Noud van Giersbergen

### Preface

### I would like to thank my supervisor Noud van Giersbergen for his endless support and

### patience. I feel privileged to have worked with him. I would also like to thank Simon

### Broda for providing me with the data set, without him this thesis wouldn’t even exist.

### I would also like to thank my dearest friends: Joris, Wouter, Joaquin, Jelle, and Lasse

### for their advice, support, and banter. Finally, I would like to express my sincere thanks

### and appreciation to my family and Gigi for their unconditional love and support.

### Statement of Originality

### This document is written by Student Jan Meppe who declares to take full responsibility

### for the contents of this document. I declare that the text and the work presented in this

### document is original and that no sources other than those mentioned in the text and

### its references have been used in creating it. The Faculty of Economics and Business is

### responsible solely for the supervision of completion of the work, not for the contents.

### Contents

### 1

### Introduction

### 6

### 2

### Market efficiency

### 7

### 2.1

### The Efficient Market Hypothesis

### . . . .

### 7

### 3

### Stock prediction approaches

### 9

### 3.1

### Fundamental analysis . . . .

### 9

### 3.2

### Technical analysis . . . .

### 10

### 3.3

### Quantitative analysis . . . .

### 11

### 4

### Machine Learning

### 13

### 4.1

### Introduction . . . .

### 13

### 4.2

### Support vector machine . . . .

### 13

### 4.2.1

### The hyperplane . . . .

### 14

### 4.2.2

### Maximal margin classifier . . . .

### 15

### 4.2.3

### Non-separable case

### . . . .

### 15

### 4.2.4

### Non-linear boundaries and kernels

### . . . .

### 17

### 4.2.5

### Multiclass SVMs . . . .

### 19

### 4.3

### Neural Networks . . . .

### 19

### 4.3.1

### Basic idea behind neural networks

### . . . .

### 20

### 4.3.2

### Functional transformations

### . . . .

### 20

### 4.4

### Decision trees . . . .

### 21

### 4.4.1

### Basics of decision trees . . . .

### 22

### 4.4.2

### Tree pruning . . . .

### 23

### 4.4.3

### ID3 (Iterative Dichomotizer 3)

### . . . .

### 23

### 4.4.4

### C5.0 algorithm . . . .

### 24

### 4.5

### Bagging and boosting

### . . . .

### 25

### 4.5.1

### Bagging . . . .

### 25

### 4.5.2

### Boosting . . . .

### 25

### 4.6

### Ensembles . . . .

### 25

### 4.6.1

### Algebraic combiners . . . .

### 26

### 4.6.2

### Voting based methods . . . .

### 26

### 4.7

### Testing and Performance . . . .

### 28

### 4.7.1

### Confusion matrix . . . .

### 28

### 4.7.2

### ROC and AUC . . . .

### 28

### 4.7.3

### Profitability . . . .

### 29

### 5

### Data

### 30

### 5.1

### Data description

### . . . .

### 30

### 5.3

### Further processing and scaling . . . .

### 32

### 6

### Model

### 33

### 6.1

### Framework

### . . . .

### 33

### 6.1.1

### Training . . . .

### 33

### 6.1.2

### Evaluating the pool

### . . . .

### 34

### 6.1.3

### Ensemble prediction . . . .

### 35

### 6.2

### Benchmarks . . . .

### 35

### 7

### Results

### 36

### 7.1

### Up or down . . . .

### 36

### 7.2

### Large upmoves . . . .

### 41

### 7.3

### Large downmoves . . . .

### 42

### 7.4

### Decision tree only experiment . . . .

### 44

### 8

### Conclusion

### 45

### Appendices

### 47

### A Technical indicators

### 47

### A.1 Rate of change . . . .

### 47

### A.2 Percentage change of SMA . . . .

### 47

### A.3 Percentage change of EMA

### . . . .

### 48

### A.4 Relative Strength Index . . . .

### 49

### A.5 Chande Momentum Oscillator . . . .

### 50

### A.6 Aroon Oscillator . . . .

### 50

### A.7 Bollinger Bands . . . .

### 51

### A.8 Commodity Channel Index

### . . . .

### 52

### A.9 Chaikin Volatility . . . .

### 53

### A.10 Close Location Value . . . .

### 54

### A.11 Moving Average Convergence Divergence Oscillator . . . .

### 55

### A.12 Trend Detection Index . . . .

### 55

### A.13 Williams Accumulation/Distribution . . . .

### 56

### A.14 Stochastic Momentum Oscillator . . . .

### 57

### A.15 Average True Range . . . .

### 58

Abstract

This thesis deals with high-frequency stock direction prediction using an en-semble machine learning model, based on Rechenthin (2014). The data used is 5 minute interval data of the first 6 months of 2012 of Apple, Google, and Microsoft. The model continuously trains various machine learning algorithms (neural nets, support vector machines, and decision trees) and combines them using a wrapper-based ensemble framework to predict next period’s upmove or downmove. The proposed model significantly outperforms the random walk benchmark and the lin-ear benchmark, but gets outperformed by the benchmark model containing only decision trees. Besides predicting upmoves or downmoves, the model is also used to predict significant upmoves or significant downmoves. The results are similar. The proposed model performs significantly better than random, outperforms the random walk benchmark and linear benchmark, but again gets outperformed by one of the benchmark models. Both for predicting significant upmoves and pre-dicting significant downmoves, the best performing model in terms of AUC is the model containing only neural nets, and the best performing model in terms of ac-curacy is the model containing only decision trees. Finally, a model containing only (but varying amounts of) decision trees is used to predict upmoves or downmoves. The model shows a significant improvement in performance increasing the amount of decision trees from 100 to 1000, but no significant improvement increasing the amount of decision trees to 5000. This model significantly outperforms all the other estimated models.

### 1

### Introduction

Stock prediction has always been one of the most researched subjects for both academics, firms, and individuals. This is largely caused by the widespread practical applications (arbitrage, risk management, hedging) and the potential for profit. In recent years, technological developments and rising availability of high quality market data have pushed this field of study to its limit.

Stock prediction using machine learning is a field with a large body of litera-ture. Callen, Kwan, Yip, and Yuan (1996) compare neural networks with classical linear regression models for 296 earnings time series. Zhang, Cao, and Schnieder-jans (2004) consider earnings per share forecasting using machine learning models. Teräsvirta, Van Dijk, and Medeiros (2005) compare neural networks with linear models for different macroeconomic time series. Neural networks have been exten-sively used to predict stock markets and trends (Kreesuradej, Wunsch, & Lane, 1994; Saad, Prokhorov, & Wunsch, 1998). Another example is the use of recurrent neural networks to predict a stock’s index the following day (Saad, Prokhorov, & Wunsch, 1996). In the last few years support vector machines have gained more traction (F. E. H. Tay & Cao, 2001; Kim, 2003). Although there is a lot of research using machine learning algorithms to predict stock prices, there has not been a lot of work that unifies these algorithms in a single model. One of the very few that do, is the work of Rechenthin (2014) on which this thesis is based. The research question this thesis tries to answer is: To what extent does a wrapper-based frame-work of machine learning algorithms improve upon the benchmark of a linear model for predicting high-frequency stock price directions.

In order to investigate this, an ensemble machine learning framework is used to predict high-frequency price directions of the stocks of Apple, Google, and Mi-crosoft. The model works as follows: a large number of machine learning algorithms are continuously trained on chunks of previously seen data. After a classifier is done training, it is added to the so-called pool. At each period in time, all the available classifiers in the pool are evaluated. The top performers from the pool are then used to predict the upmove or downmove for the next period. The performance of the model is then compared against five different benchmarks.

The rest of this thesis is organised as follows. Section 2 discusses the efficient market hypothesis. Section 3 explains the three most commonly used methods to predict stocks. Section 4 describes what machine learning is, and the algorithms used in the model. Section 5 describes the data and how the attributes are created. Section 6 describes the model. Section 7 contains the results. Section 8 concludes and provides directions for future research.

### 2

### Market efficiency

### 2.1

### The Efficient Market Hypothesis

Amongst economists there is an old joke which goes as follows: An economist walks down the sidewalk with a friend and they see a 100$ bill lying on the ground. When his friend reaches down to grab the 100$ bill the economist stops him and says: ’Don’t bother - if it were real, somebody would’ve picked it up already’. There are numerous variations on this joke but the gist is the same. This joke is an example of exaggerated economic logic following from the efficient markets hypothesis.

Developed independently by Fama and Samuelson in the 1960s, the Efficient Markets Hypothesis (EMH) states that market prices fully reflect all available in-formation. This surprisingly simple hypothesis has far reaching consequences and has been the subject of decades of research. Yet, there has been no consensus yet about whether markets, and in particular financial markets, are efficient or not.

There are three variants to the EMH: the weak, semi-strong, and strong EMH. The weak EMH states that the prices of currently traded stocks and other assets completely reflect all the information of past prices. This means that no trader can use any form of technical analysis or quantitative analysis to beat the market. The semi-strong form of the EMH states that all prices are incorporated into the current price but also that prices instantaneously change to reflect new public information. The strong EMH states that both public and private information is always fully incorporated in the price of a stock or other asset. This means that no investor, no matter how much research, can gain an edge in the market without taking on additional risk.

Lo (2007) creates an analogy between quantum mechanics and economics. He states that just like Heisenberg’s uncertainty principle places a fundamental limit on what we can know simultaneously about both an electron’s position and momentum, the EMH places a limit on what we can know about future prices, given that the EMH is true.

If markets are truly efficient, that means that the market never over- or under-reacts during the trading day. This implies that all efforts and research by investors analyzing markets and assets is futile as no one can consistently beat the market if it were efficient. Furthermore, this also means that if one achieves high returns it comes coupled with high risk as well. The EMH is quite a controversial but important topic because if it were to be true then my research would be futile and end right here.

Another thing that might affect the price of a stock is news. There is a large body of literature that studies the effect of news on the price of a stock. The most well-known news effect is the so called post-earnings-announcement drift (Bernard & Thomas, 1989) which indicates the tendency for a stock’s returns to drift in one direction after an earnings announcement. More relevant for this thesis is the speed at which markets reacts to news. If markets do not adjust instantaneously, then there is a (very small) window of opportunity in which algorithms can evaluate the news and use this information to make a more educated guess about the price development of the stock. Although the effect of news on the price of a stock is undeniable, this information arrives irregularly and at lower frequencies. Both

because of these reasons, and because it is out of the scope of this thesis, news is not incorporated in the model.

### 3

### Stock prediction approaches

This chapter gives an overview of the three most common methodologies used to predict stock prices. Fundamental analysis is discussed first. Fundamental analysis uses the economic characteristics of a firm or commodity to value it. Technical analysis tries to predict market prices based on market psychology. Finally, quan-titative analysis attempts to find statistical and numerical patterns in the data to form price predictions and trading decisions.

### 3.1

### Fundamental analysis

Fundamental analysis aims to uncover the intrinsic (fundamental) value of a com-pany based on its economic characteristics. Every quarter, publicly traded compa-nies are required to file a quarterly report on their financial results in that period. An analyst using fundamental analysis uses all information regarding the economic factors of the company such as earnings, debt, profits, losses, revenue forecasts, etc., to determine the intrinsic value of that company and price the stock accordingly.

If this analyst finds that the intrinsic value of this company is higher than the stock price reflects, then the stock is underpriced and the analyst would recommend a long position in that stock. On the other hand, if the stock is overvalued by the market in terms of it’s underlying characteristics then the analyst would recommend a short position in the stock.

There are several studies which lend some credibility to this approach. For example, Ang and Bekaert (2007) explore the predictive power of dividend yields for forecasting several economic factors such as (excess) returns, cash flows, and interest rates. They find that the dividend yield and short rate have significant predictive power for excess returns, but only at a short horizon. A strong short rate predicts negative turns, a result that turns out to be robust across countries. On the other hand, the short term predictive power of the dividend yield was found not to be robust across countries. Welch and Goyal (2008) argue that the average historical excess stock return provides a better forecaster than doing regressions of excess returns on various predictor variables. In their paper, Campbell and Thompson (2008) then show that this is not the case as many predictive regressors beat out the historical return once some relatively mild restrictions are imposed.

A lot of research has already been done on the subject of stock prediction with respect to their economic variables (i.e. fundamental analysis): valuation ratios (Kothari & Shanken, 1997), interest rates (Hodrick, 1992), relative valuations of low and high beta stocks (Campbell & Thompson, 2008), and financing structure (Baker & Wurgler, 2000). Although there has been some concern that apparent predictability might be simply spurious correlation, research points towards the conclusion that fundamental analysis provides useful information.

In conclusion, fundamental analysis uses the economic characteristics to value a commodity and is often used by analysts who have to forecast prices months or years into the future. This approach is not very useful if our goal is to predict the movement of a stock hours, minutes, or even seconds into the future. To achieve this goal we need a different set of tools and we turn our attention to technical and quantitative analysis.

### Figure 1: Schematic view of the head-and-shoulders price pattern.

### Source: Savin et al. (2007)

### 3.2

### Technical analysis

Technical analysts attempt to predict future price trends based on what they believe the other market participants are thinking. This is done by using historical prices, volumes, or other variables to calculate certain indicators, often summarized and visualised in charts. These indicators are then used as predictors for market move-ments. The main idea behind technical analysis is that prices move in trends that are determined by the changing attitudes of investors towards the market. Accord-ing to De Bondt (1993), technical analysis is a form of mass psychology because it attempts to forecast price movement based on the emotion and psychological status of the market.

Malkiel (2003) argues that the success of technical analysis might be due to the fact that it becomes a self-fulfilling prophecy. In fear of missing out on an upwards trend, technical traders might buy a stock. This in turn signals to the market that there is an increased demand for this commodity, increasing the price. On the other hand, if an indicator recommends selling a stock, multiple traders might respond by rapidly selling their stocks on the market, increasing supply and thus decreasing the price. Hence it becomes a self-fulfilling prophecy.

There are many different forms of technical analysis. Most technical analysts use popular chart patterns with varying jargon such as the ‘head-and-shoulders’, ’double top’, or ’double bottom’. Others adhere to so-called candlestick charting or use technical market indicators such as supports, resistances, moving averages, and breakout levels. Figure 1 shows a stylised view of what the head-and-shoulders price pattern looks like.

Although the true value of technical analysis remains unclear, it is still an often used and widespread practice. A survey of London based chief foreign-exchange traders revealed that more than 90 percent of them place some weight on technical analysis (Taylor & Allen, 1992). More recently, Menkhoff (2010) conducted a survey of 692 fund managers and found that more than 85 percent of these fund managers use technical analysis in their valuations. Of these managers who use technical analysis, 18 percent indicated to prefer technical analysis above any other form of analysis.

In summary, technical analysis remains a widespread set of tools that attempts to predict future market prices based on trends, with market psychology as the underlying driver of these trends. This underlying concept of psychology is what differentiates technical analysis and quantitative analysis, which will be discussed in the next section.

### 3.3

### Quantitative analysis

Quantitative analysis is the use of mathematical models to uncover patterns from the historical data. Where technical analysis aims to uncover patterns in price trends by using visual patterns, quantitative analysis is purely numerical relying only on algorithms and mathematical models to find these patterns. In this the-sis I make the distinction between technical analythe-sis and quantitative analythe-sis as follows: technical analysis is based on ’vague’ rules which are hard to quantify, whereas quantitative analysis is based on explicit well-defined rules. An advantage of quantitative analysis is that these can be easily programmed into a computer and backtested against historical data to examine their performance.

An example of a technical indicator that a quantitative analyst might use is the Relative Strength Index (RSI). Introduced by Wilder (1978), the RSI measures the relative strength of a stock on a scale from 0 to 100. When the index drops below 30 this signals that the stock is oversold and should be bought. Similarly, if the index rises above 70 this is interpreted as the stock being overbought and should be sold or shorted. The Relative Strength (RS) is calculated as follows

RSt(n) =

Average of t − n periods up closes Average of t − n periods down closes

where an ’up close’ refers to a day where the closing price was higher than the previous closing price, and a ’down close’ refers to the opposite. Given the RS the RSI can be calculated as follows

RSI(n) = 100 − 100 1 + RS(n)

Figure 2 shows the RSI for the stock of Goldman Sachs between April 20, 2015 and April 15, 2016. The decision to trade the stock based on this indicator is as follows: buy the stock if the RSI drops below 30 and sell the stock if the RSI rises above 70.

There is a growing body of literature on quantitative analysis, with recent de-velopments in availability of high quality data and computational power being one of the main drivers. For example, Teixeira and De Oliveira (2010) examine feasi-bility of an intelligent prediction system for daily prices based on relatively simple quantitative rules such as stop loss, stop gain, nearest neighbours, and the relative strength index as well. Their results show that the proposed method generated significant profits relative to the buy-and-hold strategy even with realistic trading costs. Another example is the work of Lai, Fan, Huang, and Chang (2009) who use a combination of technical indicators with clustering fuzzy decision trees and genetic algorithms to predict stocks in the Taiwan Stock Exchange. The proposed

### Figure 2: Stock price and RSI(14) for Goldman Sachs’ stock

### be-tween 20-04-2015 and 15-04-2016. Source: Wharton TAQ Research

### Database.

method had a hit rate of 82% which significantly outperforms the baseline model used.

In sum, quantitative analysis uses solely mathematical models to discover pat-terns in financial data. The body of literature on quantitative analysis is plentiful with a lot of studies posting significant profits as their results. Care should be taken because this could also be the result of simple publication bias. A lot of studies apply quantitative analysis on daily data. Because the goal of this thesis is to create a model that can learn patterns in the data and make high-frequency predictions on the direction of the stock price, we need a different set of tools. The tool that will allow our model to effectively learn from the data is machine learning, which will be discussed in the next section.

### 4

### Machine Learning

### 4.1

### Introduction

Machine learning is the study of generic algorithms that aim to learn from the provided data. Machine learning has found many applications in all subfields of sci-ence such as spam filtering, computer vision (Wernick, Yang, Brankov, Yourganov, & Strother, 2010), financial time series forecasting (F. E. Tay & Cao, 2001), mar-keting (Ou & Wang, 2009), and medical diagnosis (Tarassenko, Hayton, Cerneaz, & Brady, 1995).

Most machine learning tasks can broadly be categorised into three different categories: supervised learning, unsupervised learning, and reinforcement learning. In a supervised learning setting, the algorithm is trained on input data which has the correct label (output) attached to them. The goal of the algorithm is then to learn from this data and make predictions about the output label based on other input data. On the other hand, in an unsupervised learning setting, the algorithm is fed input data without any label attached to the data. It is then the algorithm’s goal to find the labels, or even to find the labels and make predictions on the unknown labels of new input data. Reinforcement learning (Sutton & Barto, 1998) is concerned with the process of a system interacting with a dynamic environment to maximize some reward without explicitly telling the system whether it has come closer to it’s goal or not. Stock price direction prediction is a supervised learning setting because the data are labelled with the correct label, namely that period’s historical upmove or downmove.

The rest of this section is organised as follows: Section 4.2 discusses support vector machines. Section 4.3 describes neural nets. Section 4.4 describes decision trees and the particular version that is being used in the model. Section 4.5 describes bagging and boosting. In Section 4.6 the method of combining the machine learning algorithms is discussed. Lastly, Section 4.7 describes the performance measures that are used to evaluate and compare the model against its benchmarks.

### 4.2

### Support vector machine

The Support Vector Machine (SVM) (Cortes & Vapnik, 1995) is a type of supervised learning algorithm which can be used for both regression and classification. The idea behind the SVM is to find the optimally separating hyperplane that maximizes the so-called margin.

SVMs are effective in high dimensional spaces, memory efficient, and very ver-satile due to the possible kernel functions that can be applied. A disadvantage of SVMs is that performance is sub-optimal if the number of features is significantly larger than the number of samples. Because we are dealing with technical indicators as features and a large amount of samples this is not very relevant in this case.

The support vector machine is based on a hierarchy of generalisations. It starts with the maximal margin classifier. Generalizing this maximal margin classifier to allow for some points to lie on the wrong side of the hyperplane we arrive at the support vector classifier. If we now allow this support vector classifier to ac-commodate for non-linear kernel functions we arrive at the support vector machine.

Although people often refer to maximal margin classifiers, support vector classifiers, and support vector machines as ‘support vector machines’, they are three different concepts and care has to be taken not to confuse them. Fundamentally the SVM is based on the concept of a hyperplane, which will be discussed next.

### 4.2.1

### The hyperplane

For lower dimensions a hyperplane is quite intuitive. In two dimensions a hyperplane is a flat one-dimensional subspace. In other words, in two dimensions a hyperplane is a line. Similarly, in three dimensions a hyperplane is a flat two-dimensional subspace, a plane. A more formal definition of a hyperplane is as follows: in a p-dimensional subspace, a hyperplane is a flat affine subspace of dimension p − 1 (James, Witten, Hastie, & Tibshirani, 2013). The word affine here refers to the fact that the hyperplane does not have to go through the origin. A p−dimensional hyperplane is defined by the following equation

β0+ β1xi1+ β2xi2+ · · · + βpxip= 0 (1)

If a point xi= (xi1, . . . , xip) satisfies (1) then we say that xilies on the hyperplane.

Now, if

β0+ β1xi1+ β2xi2+ · · · + βpxip> 0 (2)

or

β0+ β1xi1+ β2xi2+ · · · + βpxip< 0 (3)

that means that our point xilies either on one side, or on the other side of the

hyper-plane. Intuitively, one can think about a hyperplane as separating a p-dimensional space in two, with points lying either on one side or the other. Calculating which side the point is on is as simple as calculating the sign of the left hand side.

Suppose now that we have a matrix X consisting of n training observations with p features, i.e. xi= (xi1, . . . xip) for i = 1, 2, . . . , n and that the n observations

fall into two classes y1, . . . , yn ∈ {−1, 1}. Finally, there is a test observation x∗=

(x∗

1, . . . , x∗p). The goal is to classify the test observation based on the observed

features.

For now, assume that it is indeed possible to create a separating hyperplane that perfectly separates the training observations. Then the following relationship must hold

β0+ β1xi1+ · · · + βpxip> 0, if yi= 1 (4)

and

β0+ β1xi1+ · · · + βpxip< 0, if yi= −1 (5)

equivalent to

yi(β0+ β1xi1+ · · · + βpxip) > 0 (6)

we can then classify the test observation x∗based on the sign of f (x∗) = β0+β1x∗1+

· · · + βpx∗p. If the sign is positive we assign it the test observation class 1, if the sign

is negative we assign the test observation to class −1. The magnitude of f (x∗) also provides information, the larger f (x∗) the more certain we are of the fact that it is indeed on that side of the hyperplane. If it lies very close to the border, it might not be so obvious that the observation belongs to that side.

### 4.2.2

### Maximal margin classifier

If the data are indeed perfectly linearly separable then there exists an infinite
amount of hyperplanes which separate the data. To choose between these
hyper-planes the maximal margin hyperplane is introduced. For any given hyperplane, the
perpendicular distance to the hyperplane for each observation can be calculated.
Of these distances, the smallest distance is called the margin. The maximal margin
hyperplane is then the hyperplane with the largest margin. In other words, it has
the largest minimum distance to the hyperplane, see Figure 3. The hyperplane
fit by the maximal margin classifier is the solution to the following optimisation
problem.
max
β1,...,βp
M s.t.
p
X
j=1
β_{j}2= 1
yi(β0+ β1xi1+ . . . βpxip) ≥ M, ∀i = 1, . . . , n

The maximal margin classifier now tests on which side of the maximal margin hyperplane the test observation lies. From Figure 3 we see that there are some observations that lie exactly on the margin. These observations are known as the support vectors, because they “support” the hyperplane. If one of these support vectors were to move, then the hyperplane would shift as well. Note that small movements in the other observations (i.e. not the support vectors) would not cause the hyperplane to shift. It turns out that this property, the fact that the hyperplane only depends on a small subset of the data are an important property of support vector machines.

### 4.2.3

### Non-separable case

In the previous part we simply assumed that such a separating hyperplane exists. We assumed that the data is linearly separable, but what happens if this is not the case? To solve this problem the notion of a soft margin is introduced and with that the support vector classifier, as a generalisation of the maximal margin classifier. Figure 4 shows an example of a set of data that are not linearly separable.

The support vector classifier (soft margin classifier) is a classifier that is based on a hyperplane that does not perfectly separate the two classes. This classifier

### Figure 3: The maximal margin classifier. The margin is indicated by

### the arrows. The support vectors are on the dashed line. Source: James

### et al. (2013)

### Figure 4: Example of data that is not linearly separable.

### Source:

### James et al. (2013)

allows some observations to be on the wrong side of the hyperplane. The margin is soft in the sense that condition is violated by some observations. This is done by introducing slack variables in the optimisation problem of maximising the margin.

The optimisation problem then becomes the following
max
β1,...,βp,ε1,...,εn
M s.t.
p
X
j=1
β_{j}2= 1
yi(β0+ β1xi1+ . . . βpxip) ≥ M (1 − εi),
εi≥ 0,
n
X
i=1
εi≤ T, ∀i = 1, . . . , n

where T is a training tuning parameter. A lower T results in a narrower margin and a higher T results in a wider margin.

Similarly to the maximal margin classifier, it turns out that the support vectors in this case are the observations that lie directly on the border of the margin or inside the margin. Again, the support vector classifier decision rule is based on only a small subset of the training observations: the support vectors.

### 4.2.4

### Non-linear boundaries and kernels

The support vector classifier works well for data which more or less have a linear boundary between the two classes. In practice, non-linear boundaries occur often, with an example shown in Figure 5. Fitting a linear model to these data, as done in the right frame of Figure 5, would result in a poorly performing model. To address this issue of linearity we enlarge the feature space by including non-linear (quadratic, cubic, polynomial, etc.) functions of the predictors in addition to the original features. Instead of simply using the p features

### Figure 5: Example of non-linear data with a fitted linear boundary.

### Source: James et al. (2013)

xi1, xi2, . . . , xip

for i = 1, . . . , n, we could now fit a support vector classifier on the 2p features

There are many ways to enlarge the feature space in such a way. The support vector machine is the extension of the support vector classifier which results from enlarging the feature space in a specific way using kernels. The main idea is that we want to allow the support vector classifier to accommodate non-linear boundaries. It turns out that the solution to the support vector classifier optimisation prob-lem depends only on the inner product of the observations (James et al., 2013). Let ha, bi denote the inner product of a and b, then the linear support vector classifier can be represented as f (x) = β0+ n X i=1 αihx, xii (7)

with a parameter αi for each observation i = 1, . . . , n. The classification rule then

depends on the sign of f (x). It turns out that αi is nonzero only for the support

vectors in the solution. This means that if the training observation is a support vector then αi is not equal to zero. Let S denote the set of indices of such support

vector observations then we can rewrite (7) as follows:

f (x) = β0+

X

i∈S

αihx, xii. (8)

Now instead of using the inner product, this inner product is replaced by a more general function K(·) of the following form

K(xi, xi0)

with K(·) some function. This function K(·) is referred to as the kernel. Taking the kernel to be the inner product we find that

K(xi, xi0) =

p

X

i=1

xijxi0_{j} (9)

which would give us the support vector classifier. However, instead of using the
inner product there is wide range of other functions available now to choose as a
kernel. An example of such a function is the much used polynomial kernel of degree
d
K(xi, xi0) = 1 +
p
X
i=1
xijxi0_{j}
!d
(10)

with d a positive integer. Another popular choice is the radial kernel

K(xi, xi0) = exp −γ
p
X
i=1
(xij− xi0_{j})2
!d
(11)

with γ > 0. In the model I use a SVM with a polynomial kernel of degree 2. To summarize, we started with the definition of a hyperplane. From this we derived the maximal margin classifier. Allowing some variables to be on the wrong side of the hyperplane gives us the support vector classifier. Finally, allowing for

non-linear boundaries through kernel functions we arrive at the support vector machine.

### 4.2.5

### Multiclass SVMs

The support vector machine is fundamentally a binary classifier. In reality, the amount of classes often exceeds C = 2. Several solutions to this problem have been proposed to combine single SVMs into multiclass SVMs which can accommodate more than two classes.

A commonly used approach is the so called one-versus-the-rest approach, pro-posed by Vapnik and Vapnik (1998). In this approach C separate SVMs are con-structed where the C-th model is trained using the data from the C-th class as positive examples and the data from the remaining C − 1 classes as negative exam-ples. Although this approach is commonly used, it does suffer from some problems. First, there is no guarantee that the calculated quantities yc(x) have the same

scale. Second, this method often causes training sets to become imbalanced. For example, imagine a C = 100 class problem with each an equal amount of training observations. Applying the one-versus-the-rest approach now results in a training set with 99% negative examples and 1% positive examples, hence the symmetry of the original problem is lost.

Another possible approach is the one-versus-one approach. In this approach C(C −2)/2 different (binary) SVMs are trained on all possible pairs of the C classes. These SVMs then learn how to distinguish between the two assigned classes. In the prediction phase the C(C − 2)/2 classifiers are tested on an unseen sample and the class with the majority of the votes (based on correct predictions) then becomes the final prediction. For large C it is obvious that significantly more computational power is required for both training and testing.

In this thesis I discern between upmoves and down moves of the stock in a high frequency setting, and as such there is no need for multiclass SVMs. However, if one wants to extend the research in this thesis to accommodate neutral price movements as well, then these multiclass SVMs can be used for this purpose.

### 4.3

### Neural Networks

An artificial neural network (ANN) is another much used machine learning algo-rithm. A neural net a set of interconnected nodes (or ’neurons’) that simulate the network of neurons in the human brain. The connections between the neurons are weighted, based on training data, allowing the neural networks to learn from the data. Neural networks are widely used for their ability to learn from the data and ease of implementation. A disadvantage of neural nets is that training them is computationally intensive.

Neural networks have been successfully applied in many different fields of eco-nomics such as forecasting, asset management, portfolio selection, bankruptcy warn-ing systems, and fraud detection (Fadlalla & Lin, 2001).

This chapter proceeds by first giving a more intuitive explanation, and then a more technical one.

### 4.3.1

### Basic idea behind neural networks

Figure 6 shows a schematic representation of a simple neural network. This neural network has three layers, the input layer, one hidden layer with five hidden nodes, and the output layer. The input layer consists of four inputs x1, ..., x4, and the

output layer is a single real valued number y. The objective of the neural network is to choose the set of weights w between the input layer, hidden layer, and output layer, which minimizes some cost function. In supervised learning scenarios, the most commonly used cost function is the mean-squared error function. During the training of the neural network the weights get adjusted according to some learning parameter λ ∈ (0, 1). Choosing λ too large results in changes in the weights that are too drastic. Choosing λ too small results in changes in the weights being too little, which increase the training times. Now that the general idea behind a neural network is clear, we can delve a bit deeper into what a neural network is in the next section.

### x

1### x

2### x

3### x

4### y

### Hidden

### layer

### Input

### layer

### Output

### layer

### Figure 6: Example of an artificial neural network

### 4.3.2

### Functional transformations

What follows is a condensed summary of the chapter on neural nets in Bishop (2006) using some slightly different notation. A neural net can be seen as a function that maps a certain set of inputs {xi} to a certain set of outputs {yi} according to some

weights {wj}, which are updated in the training phase. The first transformation

consists of constructing M linear combinations of the p input variables x1, . . . , xp

aj =
p
X
i=1
w_{ji}(1)xi+ w
(1)
j0 (12)

where j = 1, . . . , M , where M denotes the number of hidden nodes in the following
layer. The superscript (1) indicates that the weights are in the first layer of the
neural network. The parameters w(1)_{ji} are called weights and the parameters w_{j0}(1)
are called the biases. The activations aj are then transformed according to an

activation function h(·)

zj= h(aj). (13)

The activation function is usually a sigmoidal function such as the logistic or the hyperbolic tangent function. These values zj are then linearly combined again (in

a similar fashion as in (12)) to give the output unit activations

ak=
M
X
i=1
w(2)_{kj}zj+ w
(2)
k0, (14)

where k = 1, . . . , K with K the total amount of outputs. These output unit ac-tivations are then transformed according to some activation function to find the network outputs yk. For standard regression the activation function is the identity

such that yk = ak. For binary classification the logistic sigmoid function can be

used yk = σ(ak) with a certain threshold value where

σ(a) = 1

1 + exp(−a), (15)

Combining (12), (13), and (14), we can write

yk(x, w) = σ
M
X
j=1
w_{kj}(2)h
p
X
i=1
w_{ji}(1)xi+ w
(1)
j0
!
+ w_{k0}(2)
. (16)

By defining another input variable x0 and restricting the value to x0 = 1 we can

absorb the biases into the weights to write

yk(x, w) = σ
M
X
j=0
w(2)_{kj}h
p
X
i=0
w(1)_{ji} xi
!
. (17)

In summary, a neural network is a set of functional transformations from a set of input variables {xi, i = 1, . . . , p} to a set of output variables {yk, k = 1, . . . , K}

controlled by a set of weights that are adjusted during the training phase. The next problem is training the neural net to find the optimal weights. Books have been written on this subject and it is out of the scope of this thesis. In the model the neural net is trained with a resilient backpropagation algorithm by Riedmiller (1994).

### 4.4

### Decision trees

Another branch of machine learning algorithms is the decision tree. Decision trees can be used for both regression and classification problems. In this chapter, first the basic idea behind a decision tree is illustrated through a simple example. After this, tree pruning is explained. Tree pruning is a method of generating a smaller subtree from large base tree. Finally, the ID3 and C4.5 algorithms are discussed which form the basis for the C5.0 algorithm used in the model.

### 4.4.1

### Basics of decision trees

The basic idea behind decision trees is partitioning the input space into multiple regions. Figure 7 shows an example of such a partitioning of a two-dimensional input space (x1, x2) into five different regions named A through E. This partitioning of

### Figure 7: Example of a two-dimensional input space segmented into

### five regions. Source: Bishop (2006)

the input space can also be illustrated as a tree with binary decisions at each node, see Figure 8. Starting from the top of the tree, the first decision divides the input

### Figure 8: Corresponding binary tree to figure 7. Source: Bishop (2006)

space into x1≤ θ1and x1> θ1where θ1is some parameter of the model. This step

separates the input space into two regions which can be split into further subregions. For example, the region x1≤ θ1 can be partitioned further into the regions x2≤ θ2

and x2 > θ2, indicated by the left branch of the decision tree in Figure 8. These

two regions correspond to regions A and B, and so on. For each subregion there is a separate predictive model for the observations that fall into that region. The simplest example of such a model is to predict a chosen constant over each region. In rough terms, this stratification of the input space can be summarized in the following two steps (James et al., 2013):

1. Divide the input space, i.e. all possible values of (xi1, . . . , xip), i = 1, . . . , n,

into J distinct non-overlapping regions, R1, R2, . . . , RJ.

2. For every observation that falls into region Rj for j = 1, 2, . . . , J make a

pre-diction based on some model (where the same model is used for observations in the same region).

The next question that comes to mind is: how should we construct these regions? The regions R1, . . . , RJ are chosen in such a way that minimizes the residual sum

of squares (RSS) RSS = J X j=1 X i∈Rj yi− ¯yRj 2

where ¯yRj is the mean response for the training observations in region j. Often a

top-down, greedy approach is used to generate such a decision tree. The approach is called top-down because it starts at the top node of the tree and recursively adds nodes to the tree to split the predictor space into two halves. The approach is called greedy because it creates the best split at that current node, instead of looking ahead and choosing the split that might lead to a better tree (i.e. with less RSS) at a further node.

### 4.4.2

### Tree pruning

Given this top-down greedy strategy to grow a decision tree, when do we stop adding nodes (splitting up the input space)? The most straightforward approach is to stop adding nodes when the reduction in residual error falls below some threshold. In practice it is often the case that none of the splits at that node provide a satisfactory reduction in residual error, yet after some more splits a substantial reduction in error occurs (Bishop, 2006). Because of this, a common strategy is to grow a very large tree with a stopping criterion based on the number of observations in the final nodes. Then, the tree is pruned back in order to obtain a subtree.

Pruning is the process of simplifying a decision tree by removing sections of the tree that provide little to no predictive power. The most common way to prune a tree is through a procedure called cost complexity pruning which balances residual error against model complexity based on some tuning parameter. The type of decision tree that is used in this thesis (C5.0) is of the pruned variety as well.

### 4.4.3

### ID3 (Iterative Dichomotizer 3)

The algorithm used to generate decision trees in this thesis is the C5.0 algorithm, the successor to the C4.5 algorithm (Ross, 1993). The C4.5 algorithm itself is again an extension of the ID3 algorithm (Iterative Dichomotizer 3) (Quinlan, 1986).

Before taking a look at the ID3 algorithm, the concepts of entropy and infor-mation gain must first be defined. The entropy H(S) is a measure of uncertainty in a data set S and is defined as

H(S) = − X

ps∈Ps

where S is the data set for which the entropy is to be calculated. S changes every iteration of the ID3 algorithm. Denote the set of features in S by Pswith individual

elements ps (for example age, height, weight, BMI). In order to keep the notation

consistent, I would prefer to denote this set by P because we have P distinct features, but this might raise some confusion in conjunction with p(x), hence Ps.

Let p(x) denote the proportion of number of elements in class ps to the number of

elements in S.

The information gain IG(A) is a measure of the difference in entropy after set S has been split on feature A. This is defined as

IG(A, S) = H(S) −X

u∈U

p(u)H(u) (19)

where H(S) is the entropy of set S and U are the subsets created from splitting data set S on feature A such that S = ∪u∈Uu.

The ID3 algorithm can then be summarised in the following four steps:

1. Calculate the entropy of every feature in the current data set S

2. Split S into subsets using the feature which has minimum entropy (or maxi-mum information gain)

3. Create a decision tree node containing that feature

4. Recursively follow steps (1)-(3) for the remaining features and stop if one of the stopping criteria is met

There are three different stopping criteria: In the first scenario, if every element in the remaining set S belongs to the same class (output target), then the node is changed a terminal node (or leaf) and labelled as that class. In the second scenario, all attributes are exhausted but there are still multiple classes in the subset. In this case, the most common class of the observations in the subset is used. The last stopping criteria occurs when the subset is empty. This can happen if the subset is split on a feature which contains no observations in that particular subset. For example, say the data set is being split on retirement status, but in the previous step it was split already based on age categories. In this case, a terminal node is created with the most common class of the parent set.

### 4.4.4

### C5.0 algorithm

The C4.5 algorithm is an extension of the ID3 algorithm. The improvements over ID3 are as follows: It is more robust to noise, it works with missing data, C4.5 prunes the decision tree once after creation, but most importantly it allows for the use of continuous features (Ross, 1993). Instead of the C4.5 algorithm I use the C5.0 algorithm. The C5.0 algorithm is a slightly improved version of the C4.5 algorithm. The main improvements being speed, better handling of extreme edge cases, and boosting.

Although decision trees are easy to estimate and interpret, a single decision tree often does not have the same predictive power as the more sophisticated algorithms such as the neural net and support vector machine. Bagging and boosting are

techniques that combine multiple decision trees into one prediction, often dramat-ically improving the prediction performance, at the cost of interpretability. These techniques are discussed in the next section.

### 4.5

### Bagging and boosting

### 4.5.1

### Bagging

Bagging (Bootstrap aggregating), proposed by Breiman (1996a), is an ensemble meta-algorithm designed to improve the stability and accuracy of machine learning classifiers. Given a training set X with N observations, bagging creates M (chosen by the user) bootstrapped training sets Xi where i = 1, . . . , M each of size N by

uniformly sampling with replacement. The corresponding classifier is then trained on the M new data sets Xiand are combined by either averaging (for regression) or

voting (for classification). Bagging leads to improvements for unstable procedures such as neural networks and decision trees (Breiman, 1996a).

### 4.5.2

### Boosting

Boosting is another ensemble meta-algorithm that is designed to reduce bias and variance (Breiman, 1996b). Boosting is based on the question whether it is possible to combine several weak learners (slightly better than random performance) into a single strong learner (classifier with arbitrarily good performance) (Kearns, 1988). The AdaBoost (Adaptive Boosting) algorithm proposed by Freund and Schapire (1997) has more or less become synonymous with boosting.

Rechenthin (2014) explains boosting as follows: consider the original training set {(x1, y1), . . . , (xn, yn)} where xiis a vector if inputs and yiis the associated class

la-bel. Boosting adds weights to each observation {(x1, y1, w1), . . . , (xn, yn, wn)} which

sum to 1. The AdaBoost algorithm then builds K classifiers with initial weights wi = 1/N . Upon each iteration (where the amount of iterations is chosen by the

user) the weights get adjusted according to the error of the classifier. Observations that were classified wrongly receive a higher weight in hopes that the re-weighting will help to correctly identify observations that were previously misidentified. In the model a C5.0 decision tree with 100 boosting iterations is used.

### 4.6

### Ensembles

In any model that uses multiple classification algorithms, as is the case with the model used in this thesis, the classifiers need to be combined to come to a final prediction of the label in the final step. This combining of classifiers can be done using different combinational rules which will be discussed in this section. Let dk,j∈ {0, 1} be the decision for the kth classifier, where k = 1, . . . , K; j = 1, . . . , C

with K the number of classifiers and C the number of classes. Consider an example with three classifiers (K = 3) in the ensemble and a binary output label (C = 2).

Then d could look as follows d = 1 0 0 1 1 0

Each row in d represents one classifier and each column the class chosen by that classifier. Although there are numerous combinational rules, the most commonly used ones are algebraic combiners and voting based combiners.

### 4.6.1

### Algebraic combiners

Algebraic combiners use the algebraic properties (mean, minimum, maximum, me-dian) of the classifier’s decisions and combines them according to some rule. Table 1 shows different rules that result in different final class supports. The final

### ensem-Rule

### Support

### Mean rule

### µ

j### (x) =

_{K}1

### P

Kk=1### d

k,j### (x)

### Sum rule

### µ

j### (x) =

### P

Kk=1### d

k,j### (x)

### Weighted sum rule

### µ

j### (x) =

### P

K_{k=1}

### w

k### d

k,j### , w

t### some weight according to performance

### Product rule

### µ

j### (x) =

### Q

K_{k=1}

### d

k,j### (x)

### Maximum rule

### µ

j### (x) = max

k=1,...,K### {d

k,j### }

### Minimum rule

### µ

j### (x) = min

k=1,...,K### {d

k,j### }

### Median rule

### µ

j### (x) = med

k=1,...,K### {d

k,j### }

### Generalized mean rule

### µ

j,a### (x) =

1 K### P

K k=1### d

k,j### (x)

a 1/a### , a = 1 gives the mean rule

### Table 1: Table of algebraic combiners

ble decision is the class that has the largest support µj(x) after the combining rule

has been applied to the individual supports of each class: ˆyf inal= arg max j µj(x).

Consider the example above with K = 3 and C = 2, then the sum rule would result in µ(x) = (µ1(x), µ2(x)) = (2, 1) and as such the final decision would be ˆyf inal= 0.

### 4.6.2

### Voting based methods

Voting based methods combine the base classifiers according to some voting count. The final ensemble decision is based upon the class J that receives the largest amount of votes.

### Rule

### Support

### Majority voting

### µ

j### (x) =

### P

Kk=1### d

k,j### Weighted majority voting

### µ

j### (x) =

### P

Kk=1### w

k### d

k,j### , w

k### some weights

### Table 2: Table of voting based methods

and the weighted majority voting method. In the majority voting method the decision is made upon the classifier which receives the largest amounts of votes. For the weighted voting method these classifiers are weighted with some weights. Note that the majority voting rule and weighted voting rule are exactly the same as the algebraic sum and weighted sum rules, respectively.

### 4.7

### Testing and Performance

### 4.7.1

### Confusion matrix

A confusion matrix is an often used method to visualise classifier performance in a supervised classification machine learning setting. In a confusion matrix, the rows represent the actual class of the observation and the columns represent the predicted class by the model or classifier. A classification problem with C classes would require a C × C confusion matrix, with each row containing the actual class and each column containing the predicted class. Table 3 shows an example of a

### Predicted class

### +

### −

### Actual class

### +

### TP

### FN

### −

### FP

### TN

### Table 3: Example of a confusion matrix

confusion matrix for a two class (C = 2) classification problem. A true positive (TP) is an actual positive instance being classified as a positive instance. A true negative (TN) is a negative instance correctly being identified as a negative instance. A false positive (FP) is a negative instance being wrongly classified as a positive instance (type I error). A false negative (FN) is a positive instance being wrongly classified as a negative instance (type II error).

Following from this confusion matrix, a logical measure to compare different models is the Accuracy. The Accuracy is defined as the correct predictions over the total amount of predictions. The Error Rate is defined as the misclassified predictions over the total amount of predictions.

Accuracy = T P + T N

T P + T N + F P + F N (20) Error Rate = F P + F N

T P + T N + F P + F N = (1 − Accuracy) (21)

A problem with the Accuracy is that in a highly imbalanced data set, for example a data set containing 99% positive samples and only 1% negative samples, it is very easy to get a high accuracy by just classifying everything as a positive sample.

### 4.7.2

### ROC and AUC

The Receiver Operating Characteristic (ROC) curve is another commonly used way to visualise the performance of a binary classifier. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR). These are defined as follows

T P R = T P T P + F N F P R = F P

### Figure 9: Example of the ROC curve for three classifiers. Source:

### Rechenthin (2014)

Comparing the performance of the classifiers visually is straightforward. The best performing classifier has a curve that is the closest to the top left corner. A curve in this corner means that for every value of the FPR the TPR is high. A very poorly performing classifier would have a curve which is close to the bottom right corner. Figure 9 shows an example of the ROC curves for three different classifiers. From the figure it is clear that classifier 1 has the best curve (a curve that is closest to the top left corner). The performance of classifier 2 is slightly better than random and the performance of the third classifier is significantly worse than random.

Instead of using the actual ROC curve, to compare classifiers with a single number against each other the Area under (the ROC) curve is used. The AUC is the area under the ROC curve, calculated by either integration or approximation. Comparing models with similarly scaled units, the AUC can be interpreted as the probability that a classifier will rank a positively chosen instance higher than a neg-atively (falsely) chosen instance (Fawcett, 2006). A random classifier (i.e. flipping a coin) will have an AUC of 0.50, this means that a classifier better than random has an AUC greater than 0.50 and vice versa. Although most often applied for two class problems the AUC can be extended to multi-class problems by weighing the AUC according to the class distribution.

### 4.7.3

### Profitability

Ultimately, the main goal of many market researchers is profit. In contrast to this, the goal of this thesis is to gain a deeper understanding on market predictability using machine learning algorithms, not to build an automated trading system or to look for profit opportunities. In many studies the models seem to always signif-icantly outperform existing benchmarks, however when realistic trading costs are implemented profits often disappear.

### 5

### Data

This chapter first gives a description of the used data, then the attribute and class label creation is discussed. Finally, further processing and rescaling of the data is discussed.

### 5.1

### Data description

High-frequency financial data possesses unique characteristics that is otherwise absent in data that is measured at a lower frequencies (e.g. daily, weekly, an-nually). First, the number of observations is extremely large. Second, the data are often recorded with some form of errors (gaps, outliers, disordered series) and requires substantial preprocessing before they can be analysed properly. Third, high-frequency data typically exhibit periodic patterns in terms of market activ-ity (volume). It is a well known fact that trading volume is generally lower dur-ing lunchtime than at the opendur-ing and respective closdur-ing time of the traddur-ing day. Fourth, these data streams often show sensitivity to relevant news items, which induces short-lived volatility increases. Andersen and Bollerslev (1998) provide an extensive discussion on these problems.

The data used in this thesis are high-frequency stock data of three large com-panies spanning the first 6 months of 2012. The stocks Google (GOOG), Apple (AAPL), and Microsoft (MSFT) are used. The source of these data is the Trade and Quote (TAQ) database from the Wharton Research Data Services. The pro-vided dataset is a post-processed data set. This means that bad trades and outliers have been removed already and that the data has been spaced to equal intervals (seconds). Trading volume is also absent.

Although the data have been pre-processed already, further processing is needed. The original data set contained the prices of the stock spaced out equally every second during the trading day. Instead of using the price of each stock every second, I record the opening, closing, minimum, and maximum price of the stock in every five minute interval. Table 4 shows the summary statistics for these data and Figure 5 shows a snippet of recorded data. These data are then used to calculate various technical indicators.

### Stock

### N

### Mean

### Std. dev

### Min

### Max

### AAPL

### 9805

### 542.82

### 62.88

### 409.49

### 643.34

### GOOG

### 9805

### 607.47

### 25.43

### 557.12

### 669.96

### MSFT

### 9805

### 30.50

### 1.41

### 26.47

### 32.93

### Table 4: Summary statistics for the price data

### 5.2

### Attribute and class label creation

In machine learning, features (attributes) are individually measurable properties of the phenomenon that is being observed (Bishop, 2006). In order to extract

### time

### avg_price

### min_price

### max_price

### opening_price

### closing_price

### 9800

### 582.2041

### 581.86

### 582.534

### 582.0115

### 582.5180

### 9801

### 582.6059

### 582.39

### 582.800

### 582.5020

### 582.7400

### 9802

### 582.9962

### 582.73

### 583.320

### 582.7600

### 583.2742

### 9803

### 583.8268

### 583.36

### 584.000

### 583.3600

### 583.7601

### Table 5: Snippet of data of AAPL on of June 29th 2012 between

### 15:35:01 and 15:54:59. Each row is one five minute interval.

regularities in the data, we construct a large set of technical indicators, which are then used as features for the models to be trained on.

Because the complete dataset over the whole time span is available, simplifi-cations with respect to the model and attribute creation are possible. In a more realistic setting, where data comes in at regular intervals, one would calculate the technical indicators for each five minute interval of observations as they arrive. Because all the data are already available, these indicators can be calculated be-forehand.

In total 15 different groups of indicators were used with different parameters, totalling 51 features. The technical indicators used are as follows: rate of change, percentage change of SMA, percentage change of EMA, Relative Strength Index, Chande Momentum Oscillator, Aroon Oscillator, Bollinger Bands, Commodity Channel Index, Chaikin Volatility, Close location value, Moving Average Divergence Oscillator, Trend Detection Index, Williams Accumulation/Distribution, Stochastic Momentum Oscillator, Average True Range. For a complete and detailed descrip-tion of how these indicators are calculated see Appendix A.

### Stock

### Up

### Down

### No upmove

### Sign. upmove

### No downmove

### Sign. downmove

### AAPL

### 49.4%

### 50.6%

### 72.1%

### 26.9%

### 74.1%

### 25.9%

### GOOG

### 49.9%

### 50.1%

### 73.7%

### 26.3%

### 73.4%

### 26.6%

### MSFT

### 49.0%

### 51.0%

### 74.5%

### 25.5%

### 75.7%

### 24.3%

### Table 6: Amount of (significant) up- and downmoves for the three

### stocks.

For the class labels, instead of looking at the price of the stock I look at whether there has been an upmove or a downmove. The first two columns of Table 6 show the amount of upmoves and downmoves in the dataset. Let ptbe the price in period

t then the predicted class {yi, i = 1, . . . , N } is given by yi = 0 if

pt−pt−1

pt−1 ≤ 0 or

yi= 1 if

pt−pt−1

pt−1 > 0. This gives us {(xi,1, . . . , xi,51, yi), i = 1, . . . , N } pairs to train

the machine learning algorithms on. Besides the standard up- or downmove I also consider significant upmoves and downmoves. A significant upmove (downmove) is defined as a price change larger (smaller) than 0.05% (-0.05%). The last four columns of Table 6 show the slightly imbalanced distribution of significant upmoves and significant downmoves for the three stocks.

### 5.3

### Further processing and scaling

Before these indicators can be used to train machine learning algorithms on, they need to be rescaled. There are numerous reasons to rescale the data but the most important one is the fact that a lot of machine learning algorithms are sensitive to outliers. Commonly used transformations are

x0_{t}= xt− xmin
xmax− xmin
x0_{t}= xt
xmax
x0_{t}= xt− µ
σ
x0_{t}= log(xt)

Ultimately in the model, the machine learning algorithms are being continuously trained on chunks of previously seen data. All these algorithms get added to a large pool and the top performers are then used to predict the next period. Rescaling, and normalising in particular, is problematic here. If we rescale the first chunk with parameters (µ1, σ1) and the second chunk one with (µ2, σ2), then how do we

compare the results of the trained algorithms? To circumvent these problems I rescale the data to the interval [−1, 1]. Scaling a series of observations to [0, 1] one can use

f (x) = x − xmin xmax− xmin

.

Hence, to scale a series of observations to [a, b] we use

f (x) = (b − a) x − xmin xmax− xmin

+ a (22)

### 6

### Model

This chapter describes the model that is used and the benchmarks against which the performance of the ensemble model is measured.

### 6.1

### Framework

Figure 10 shows a schematic overview of the model framework that is used to predict the direction of the stock prices. The model used is a variant of the model used in Rechenthin (2014). The framework is comprised of four different steps:

1. Classifiers are continuously trained on prior data

2. The pool of classifiers that are the result of step 1 are evaluted

3. Top performers are identified and combined into an ensemble 4. The ensemble is then used to form a prediction for the next period

### Figure 10:

### Schematic overview of the wrapper model.

### Source:

### Rechenthin (2014)

### 6.1.1

### Training

In the first step of the model, the model is continuously trained on random length subsets of previous data (also called chunks). It is unclear in Rechenthin (2014) how the chunks are generated, but it is reported that they are constrained within some minimum and maximum value. In this thesis a different approach is taken for the size of the chunks. The size of each chunk is generated from a Geometric distribution with parameter θ = 1/200.

The generated starting times of these classifiers are distributed uniformly over the whole interval. The stopping time for each classifier is then given by the starting time plus the randomly generated size of the chunk plus 50. This last padding value

has been chosen arbitrarily but is needed because the Geometric distribution has a nonzero probability of generating a zero length chunk. Another reason to add this constant to the size of the chunks is because some technical indicators need some time before they can return their first value, for example the N -period simple moving average.

In total 300 classifiers (100 neural nets, 100 support vector machines, and 100 decision trees) were trained. The parameter values for the machine learning algo-rithms were found using a trial-and-error approach using the parameter values in Rechenthin (2014) as a starting point. The neural nets were trained with 1 hidden layer with 50 hidden nodes and a logistic activation function. The support vector machines were trained with a polynomial kernel with a coefficient of 2. The decision trees were of the C5.0 variant with 100 boosting iterations.

### Figure 11: Visualisation of overlapping chunks

Figure 11 shows a visualisation of the often overlapping chunks of training data for the three different types of classifiers. The figure shows a visualisation of what the chunks could look like for each classifier for the first 1000 observations. When a classifier is done training on its assigned chunk, it is added to the pool. In the beginning there are relatively few classifiers in the pool which might lead to poor performance. To aid the model we therefore skip the first 500 observations for each stock as a ramp-up period for the model to generate a substantial pool of classifiers to start with.

### 6.1.2

### Evaluating the pool

In the second step of the model, each classifier that is finished training on its assigned chunk gets added to the so-called pool. This pool of classifiers contains all the trained classifiers which are ready to be evaluated. Over time the pool grows larger as new classifiers are continuously added.

For each period t = 0, 1, . . . , T the performance of all the classifiers in the pool is determined by testing the classifiers on the most recent sliding window of 2 hours (24 intervals of 5 minutes). The performance measure used is the AUC. The performance of each classifier is sorted on AUC in descending order, and the top 10 performers (classifiers) are chosen from the pool and grouped into an ensemble.

### 6.1.3

### Ensemble prediction

In the last step, the ensemble formed in the previous step is used to create a prediction for the upmove or downmove at period t + 1. To reiterate, the ensemble contains the top 10 classifiers (based on their AUC in a sliding window of the most recent 2 hours) from the evaluated pool. Each of these classifiers is then evaluated at time t + 1 resulting in 10 predictions for the class labels. The final prediction of the ensemble is based on a majority vote of the classifiers in the ensemble. In case of a tie an upmove is chosen. Averaged over the three stocks, 2.7%, 3.0%, and 2.6% of the predicted upmoves are due to ties for respectively predicting upmoves or downmoves, significant upmoves, and significant downmoves.

### 6.2

### Benchmarks

To analyse the performance of the model, first some benchmarks (or baselines) have to be established. The model described above is compared to five different models for the three different stocks.

The first baseline is the random walk. Assuming an upmove or downmove is a martingale the best prediction for period t+1 is its value at period t. For the random walk benchmark, the prediction for the upmove or downmove in the next period is this period’s upmove or downmove. The second benchmark is a Logit model using all the available information up until time t to make a prediction for the period t + 1. This benchmark also includes the one, two, and three period lagged upmoves in addition to the technical indicators. The other three benchmarks are versions of the ensemble model where respectively only neural nets, only support vector machines, and only decision trees are used in the pool. Because of computational limitations I estimate only 100 of these classifiers each, as they have been pre-trained already. Recall that the ensemble model consists of 300 classifiers (100 of each), so these benchmark models only contain 100 classifiers in total. A comparison of the ensemble model against these three models measures the added value of the wrapper framework.