Predicting bitcoin price movements using LSTM

(1)

Predicting Bitcoin Price Movements

using LSTM

Lucas Arnoud van der Kleij

Student number: 11004517

Faculty of Economics and Business University of Amsterdam

This dissertation is submitted for the degree of BSc Econometrics & Operations Research

(2)

(3)

Declaration

This document is written by Lucas Arnoud van der Kleij who declares to take full respons-ibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

(5)

Acknowledgements

I would like to thank Hao Li for giving me the opportunity and freedom to write about something I am passionate about.

(6)

(7)

Abstract

Due to the lack of research into short-term price forecasting of bitcoin there still remain many unanswered questions. One of which is to evaluate, by using a long short-term memory (LSTM) model and limit order book (LOB) data, whether 30-second price movement forecasting can result in positive overall returns. The results indicate that, when using a Buy-and-Sell strategy, the returns made by the model’s predicted price movements outperform random predicted price movements (using the same strategy) and a Buy-and-Hold strategy. Hence, the LSTM model is capable of improving the out-of-sample price movement prediction accuracy, such that positive overall returns can be made.

(8)

(9)

List of Tables

1 The amount of delivered and dropped messages via the GDAX API. . . 27

2 Parameters of the best tuned model . . . 27

3 The precision and recall scores of the predicted classes. . . 27

4 Descriptive statistics of predicted up and down returns . . . 28

(13)

(14)

List of Figures

1 Several plots visualizing the distribution and autocorrelation of 30-second returns. . . 22 2 A visualization of the mean 30-second returns throughout the day. . . 23 3 The blue dots in subfigure a illustrate the 30-second trading volumes

through-out the day. The red line is the rolling mean (10 minute window) of the blue dots. The line graph in subfigure b visualizes the realized standard deviation at 10 minute windows. At each window w it is calculated as r 1

20 P

i∈w(Rim)2 23

4 The folded and unfolded representation of a vanilla RNN. The output se-quence yt is predicted from a input sequence xt, using a recurrent hidden

state ht, which is updated on each state t. Take note that t doe not

neces-sarily mean a time interval, but more as the position in the sequence; for instance a sequence of words xt = {this, is, a, sentence}. . . 24

5 The original LSTM representation [Colah’s Blog, (2015), understanding LSTM ]. Just as in a typicall RNN is a output sequence yt predicted from a input

se-quence xt. The core of a lstm is the cell in each state t (the green planes).

The top input arrow in the middle cell represents the cell state Ct−1, which

is being updated along the middle cell to Ct at the top right output arrow. . 24

6 A visualization of the train and validation loss at each epoch in subfigure a. Subfigure b visualizes the accuracy of the train and test model. . . 25 7 A confusion matrix, illustrating the performance of predicted classes

com-pared to the actual classes . . . 25 8 A visualization that shows how the returns are distributed among the

pre-dicted up and down classes. . . 26 9 A visualization that shows how the returns are clustered over time. . . 26

(15)

(16)

1 Introduction

H

igh-frequency trading (HFT) has become one of the fastest growing financial innov-ations of the last decade due to its high earnings potential. HFT is a subset of algorithmic trading that employs sophisticated computer algorithms for making high-speed trading decisions based on buy and sell activity (Brogaard, Hendershott and Riordan, 2014, p. 1). HFT is often criticized in the media due to the considerable market impact. However, Brogaard et al. (2014, p. 38) explain that the presence of HFT results in better price effi-ciency. A necessary condition for HFT is to have access to real-time intraday market data. Although this data is often expensive for traditional markets, cryptocurrency exchanges, such as GDAX and Kraken, offer free real-time intraday data on cryptocurrency markets. Katsiampa (2017) describes that cryptocurrency markets are characterized by their spec-ulative and highly volatile nature. The volatile nature, derived from future uncertainty, leads to a degree of difficulty in generating long-term predictions. However, volatile mar-kets have the advantage of better short-term price movement predictions (Goldstein, Kwan and Philip, 2017, p. 38; Gao, Han, Li and Zhou, 2015). Therefore, it is valuable to research the short-term forecasting potential of cryptocurrencies.

HFT strategies typically aim to take advantage of market-price inefficiencies, such as arbitrage opportunities and limit order book (LOB) imbalances. These market inefficiencies are in contrast with the widely accepted market theories, such as the law of one price and the efficient market hypothesis. According to Goldstein et al. (2017, p. 4), have HFT a significant market information advantage over non-HFT due to their fast processing of market data, especially LOB data. This information advantage can result in better short-term future price movement predictions (Cao, Hansch and Wang, 2009; Cont, Kukanov and Stoikov, 2014; Goldstein et al., 2017).

Financial time series prediction is often based on some form of regression technique. The most well known are the univariate auto-regressive integrated moving average (ARIMA) and the multivariate vector auto-regressive (VAR) (Harvey, 1990; Box, Jenkins, Reinsel and Ljung, 2015). However, with the advancements in computational science, state-of-the-art algorithms, such as deep learning, are developed to forecast time-series data. The most recent advancements in deep learning algorithms are recurrent neural networks (RNN ) and long-short term memory (LSTM ). These deep learning algorithms outperform

(17)

regression-1. Introduction 2

based models due to their ability of non-linear and sequential modeling (Siami-Namini and Namin, 2018; Adebiyi, Adewumi and Ayo, 2014). Unlike regression-based models, that have a high degree of inference and economic interpretation, these deep learning algorithms have a low degree of inference and almost no economic interpretation. Zhang, Patuwo and Hu (1998) mention that this is due to their so-called ’black-box’ structure.

Contrary to many HFT studies, this dissertation uses intraday cryptocurrency data instead of equity data. The volatile environment of cryptocurrencies makes it well suited for short-term price movement predictions. The data that is used in this dissertation, is obtained from GDAX exchange and includes the highest level market data (level III) of the Bitcoin/Euro market. Level III market data has all necessary features to follow the entire order process. This data is then used to fully reconstruct and analyze the LOB. Eventually, an LSTM model is trained on the basis of this data. The main objective of this dissertation is to evaluate, by using an LSTM model, whether price movement predictions can result in positive overall returns.

The remainder of this dissertation is organized as follows: Chapter 2 starts with an explanation of the bitcoin market and its characteristics. It then describes the usefulness of the LOB data in price prediction and discusses intraday patterns in traditional markets. Fi-nally, this chapter explains the advantage of state-of-the-art deep learning models. Chapter 3 analyzes the level III data of the Bitcoin/Euro market and explains the mathematical framework of RNN and LSTM models. Moreover, it proposes the final forecasting model. Chapter 4 summarizes the model output and evaluates its forecasting ability. Chapter 5 presents the concluding remarks on the results, proposes a trading strategy and gives sug-gestions for future research.

(18)

2. Review of Literature 3

2 Review of Literature

2.1 Bitcoin Market

The characteristics of bitcoin are essentially different from other financial markets today. The main differences can be lead back to its economic value, volatility, trading patterns and forecasting power. All of these will be described in this section.

2.1.1 Economic Value

Much of the literature on bitcoin seeks to identify its economic role and underlying blockchain technology. Harvey (2014) summarizes the important characteristics of bitcoin and speculates about future possibilities for bitcoin and blockchain technology. He disproves the myth that bitcoin lacks security by illustrating that it is not feasible for hackers to attack bitcoin as an enormous amount of computing power is needed. However, B¨ohme, Christin, Edelman and Moore (2015) argue that bitcoin is prone to DDoS1 _{attacks and explain that}

Bitcoin transactions reveal personal information which violate the widespread expectation of privacy. Harvey (2014) further illustrates that many of the tasks executed by central authorities and banks will migrate to blockchain. Both Harvey (2014) and B¨ohme et al. (2015) emphasize bitcoin’s possibility of disrupting the role of monetary systems.

Balvers and McDonald (2017) explain the characteristics of an ideal currency and de-scribe practical steps to mimic this ideal design using cryptocurrencies. Linking a currency to a weighted combination of inflation for major countries would be the first step in moving toward this ideal currency according to Balvers and McDonald (2017). They empirically show that the costs involved in the stabilization of a currency, could be reduced thanks to the creation of a measure of inflation.

Dyhrberg (2016) states that, although bitcoin acts like a currency, it will never behave exactly like the currencies on the market today, as it is both decentralized and largely unregulated. This is also confirmed by Yermack (2015). Yermack (2015) further explains that bitcoin fails to satisfy the three criteria of a real currency ( medium of exchange, a store of value, and a unit of account). He declares that bitcoin is used as a medium of exchange by only a small percentage of buyers and sellers. Its high volatility, which is much higher than any other traditional currency, imposes a high risk for short-term users. Instead, many 1_{Distributed Denial of Service (DDoS) attacks are attempts to disrupt computer systems} (https://www.us-cert.gov/ncas/tips/ST04-015).

(19)

users use it as a speculative asset (Yermack, 2015).

2.1.2 Volatility

Understanding the volatile behavior of bitcoin prices creates new possibilities for risk managers and speculative traders, as explained by Katsiampa (2017). Katsiampa (2017) compares several generalized autoregressive conditional heteroskedasticity (GARCH) type models to explain bitcoin price volatility. He finds that the AR(1)-CGARCH(1) model is the optimal model in terms of goodness-of-fit. However, his model still departs from normality. He emphasizes the importance of having both a short-run and a long-run component of conditional variance. Dyhrberg (2016) employs GARCH models to explain the similarity between bitcoin, gold and the dollar. He explains that bitcoin has many similarities to both gold and the dollar. Specifically, most of bitcoin’s aspects are similar to gold as they react to similar variables in the GARCH model, possess similar hedging capabilities and react symmetrically to good and bad news. However, the frequency may be higher for bitcoin as trading is faster and reactions to market sentiment are quick (Dyhrberg, 2016). This is in contrast with Yermack (2015), who explains that bitcoin’s exchange rates are not correlated with any other type of currency or gold, making bitcoin useless for risk management and exceedingly difficult for its owners to hedge.

Eross, McGroarty, Urquhart and Wolfe (2017) take another approach and examine the return-volatility behaviour of bitcoin itself. They observe a negative correlation coeffi-cient of 0.029 between return and realized volatility. They also inspect the causal relation-ship between return and volatility. They point out that this relationrelation-ship is not return- or volatility-driven, but that it is a bilateral relationship according to Granger causality.

2.1.3 Trading Patterns

Trading activity on traditional markets, such as bonds, indices and stocks, typically follow a U-shape distribution. Cizeau, Liu, Meyer, Peng and Stanley (1997) explain that the U-shape corresponds with peaks around opening and closing times on most days. Xu (2017) provides strong evidence that trades in the morning are mostly information driven and trades in the afternoon are mostly liquidity driven. As the bitcoin market is continuously tradable (not having a closing time), information can be immediately processed and incorporated into the price throughout the day. However, Eross et al. (2017) provide evidence that returns are highest during the hours between 8 am and 4 pm, which is consistent with the opening and

(20)

2. Review of Literature 5 closing of European and North American markets. Furthermore, they observe that volume increases throughout the day and falls from around 4pm until midnight, which is consistent with the intraday patterns found in currency markets. Realized volatility shows an N-shape which can be explained by the lack of market makers in the bitcoin exchanges (Eross et al., 2017).

2.1.4 Forecasting Returns

Balcilar, Bouri, Gupta and Roubaud (2017) describe that the volume–return relation-ship has been broadly covered for equities, commodities, bonds, interest rates and currencies. However, it remains unexplored for the bitcoin market. According to Balcilar et al. (2017), studying the relationship between volume and returns is important in generating a better understanding of how market information is transmitted and then embedded in asset prices. Budish, Cramton and Shim (2015) discovered that volume can predict returns, except in bitcoin bear and bull market regimes. They emphasize that this result highlights the import-ance of modelling non-linearity and accounting for the tail behavior when analyzing causal relationships between bitcoin returns and trading volume. Detzel, Liu, Strauss, Zhou and Zhu (2018) use a different approach to predicting returns by using moving averages. They document that bitcoin returns are largely unpredictable when macroeconomic variables are used, become predictable when 5- to 100-day moving averages (MAs) of their prices are used (both in- and out-of-sample).

2.2 High-Frequency Trading

HFT strategies aim at finding price imbalances by analyzing intraday market data. Specifically, LOB data. Since there is few literature on LOB analysis and intraday patterns for bitcoin, most of this literature is based on traditional financial markets.

2.2.1 Limit Order Book Information

Goldstein et al. (2017) reveal that LOB analysis can result in a significant market in-formation advantage for HFT. Cao, Hansch and Wang Beardsley (2004) provide empirical evidence that the LOB beyond the first step has an information share of around 30 percent. Specifically, the imbalance between demand and supply from LOB depth 2 to 10 provides additional power in explaining future short-term returns according to them. In a later study, Cao et al. (2009) describe that the order book is moderately informative as its contribution

(21)

2. Review of Literature 6 to price discovery is roughly 22 percent. The remaining 78 percent comes from the best bid and ask prices and the last transaction price. Furthermore, Cao et al. (2009) reveal that order imbalances between demand and supply along the book are significantly related to future short-term returns, even after controlling for the autocorrelations in return, the inside spread, and the trade imbalance, which is consistent with their previous work.

In agreement with the above literature, Chordia, Roll and Subrahmanyam (2002) ex-plain that market-wide returns are strongly affected by contemporaneous and lagged or-der imbalances. They observe that market returns reverse themselves after high-negative-imbalances, large-negative-return days, even after controlling for aggregated volume and liquidity. Moreover, Rime, Sarno and Sojli (2010) emphasize that order flow is a powerful predictor of daily movements in exchange rates for both in an out-of-sample. Maslov and Mills (2001) discover that a large imbalance in the number of limit orders placed at bid and ask sides lead to a predictable short-term price change, which is in agreement with the law of supply and demand. Additionally, Zheng, Moulines and Abergel (2012) provide evidence that trade sign and market order size as well as the liquidity on the best bid and best ask are consistently informative for predicting the incoming price jump.

2.2.2 Intraday Patterns

As described at the start of section 2.1.3, the volume of traditional markets typically follows a U-shape distribution. Cartea, Jaimungal and Penalva (2015) provide a more in-depth analysis of the volume distribution, but finds that hourly or minutely distributions show no significant pattern. Moreover,they discover that during periods with a constant flow of information more trading and lower spreads occur together. Furthermore, they reveal that declining volume go hand-in-hand with declining spreads.

A more recent study by Jiang and Zhu (2017) research the short-term underreaction in the US equity market. They proved the existence of short-term underreaction in the US equity market, using jumps in stock prices as a proxy for large information shocks. Their results based on intraday jumps, in particular, overnight jumps, provide further evidence consistent with underreaction. They further reveal that limited investor attention contrib-utes to short-term underreaction. Xu (2017) discovers that trades in the afternoon negat-ively predict future returns and cause price reversals. He describes that momentum trading strategies based on morning returns and the reversal trading strategies based on afternoon returns generate significant returns. These returns cannot be explained by standard risk

(22)

factors including momentum and reversal factors according to Xu (2017).

2.3 State-of-the-Art on Time Series Prediction

Financial time series are often chaotic in a way that they have little to no linear or deterministic pattern. This makes it difficult for traditional methods, such as ARIMA and VAR, to make accurate out-of-sample predictions. However, in recent years, other advanced algorithms have proven more useful in these prediction tasks.

2.3.1 Recurrent Neural Networks

Traditional methods, such as ARIMA and VAR typically lack the flexibility to model non-linear behavior, sequential dependencies, and regime switches characteristically for fin-ancial time series (Franses and Van Dijk, 2000). However, Siami-Namini and Namin (2018) describes that recent developments in deep learning algorithms have resulted in advanced algorithms, such as RNN and LSTM, that are used in time series prediction. According to Hochreiter and Schmidhuber (1997), the advantage of LSTM over RNN is that is able to store information over extended time intervals by recurrent backpropagation without decaying error backflow.

Siami-Namini and Namin (2018) compares the prediction error of ARIMA with an LSTM model on financial time series. They find that an LSTM network with one hidden layer leads to an 85 percent reduction in errors compared to ARIMA. Gers, Eck and Schmidhuber (2002) explain that, although LSTM is able to solve many time series tasks unsolvable by feed-forward networks and traditional methods, its superiority does not carry over to cer-tain simpler time series prediction tasks solvable by time window approaches. According toHermans and Schrauwen (2013), time series often have a temporal hierarchy, with inform-ation that is spread out over multiple time scales. They find that a hierarchical structure of stacked layers allows to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. This is also confirmed by Pascanu, Gulcehre, Cho and Bengio (2013), whom explain that a hierarchical model can be more efficient at representing some functions than a simple one. However, one major problem with training neural nets is the possibility of overfitting during training. Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (2014) found a simple way to reduce the possibility of overfit-ting by using a regularization technique called dropout. The dropout parameter d ∈ [0, 1] deactivates a fraction d of the neurons in a particular layer. They describe that it prevents

(23)

2. Review of Literature 8 neurons from developing co-dependencies amongst each other during training and forces a neural network to learn more robust features. The dropout parameter is typically used in both the LSTM and input layers.

2.4 Theoretical Findings

Research has proven that LOB imbalances are a powerful predictor for short-term price movements. However, the relationship between LOB imbalances and price movements is often based on linear methods. With the advantages of LSTM, possible unknown non-linear and sequential relationships can be modeled, which can lead to better out-of-sample prediction accuracy. The high volatility of the bitcoin market is well suited for modeling price movements as prediction accuracy is generally higher in volatile markets.

Hence, the expectation is that an LSTM model is able to make accurate short-term price movement predictions of the bitcoin. In this dissertation, accurate predictions are satisfied when a profitable (theoretical) trading strategy can be implemented, based on these predictions.

(24)

3. Data & Methodology 9

3 Data & Methodology

3.1 Data Description

The data used in this analysis is level III market data from the Bitcoin/Euro market. The data is extracted from the digital currency exchange GDAX (previously known as Coinbase) and consists of the period 2018-05-09 to 2018-06-18. The level III market data is the highest level market data i.e. it includes all necessary features to follow the entire order process. Every change in the order process results in a message of that kind. These include:

1. Received message

- sequence - time - order id - order type - side - price - size - funds 2. Open message

- sequence - time - order id - price - done - remaining size - side 3. Done message

- sequence - time - order id - price - remaining size - side 4. Match message

- sequence - time - maker order id - taker order id - size - price - side

The received messages exhibit incoming orders that are accepted by the GDAX matching engine2_{. The two types of orders that can be registered are limit and market orders. Received}

limit orders display the price at which the order is made, the side (buy or sell) and its corresponding size. Received market orders do not use the price and size fields, but instead, use the “funds” field to indicate how much quote currency will be used. Furthermore, a unique order id is added to each order, after which it is sent to the order book (OB). When an order on the OB is not filled immediately, an open message is sent to indicate how much of the remaining (unfilled) order size goes back on the OB. A done message is sent when an order on the OB is completely filled or canceled. The remaining size of a done message indicates how much of the order is left unfilled (0 for filled orders). A match message is sent when a trade occurs between two orders. This message indicates the order id’s of both the taker and the maker, the size of the trade and the matching price. To validate the consistency of the OB, a unique sequential sequence number, that increases by one for each new message, is included for each message.

(25)

Some key problems in data analytics, as described by Najafabadi et al. (2015), are data quality and validation, feature engineering, and high-dimensionality and data reduction. These problems will be examined in the next section.

3.2 Data Preprocessing

The quality of the data can be validated by inspecting the consistency of the messages through their sequence numbers. However, according to the GDAX API documentation: ‘While a websocket connection is over TCP, the websocket servers receive market data in a manner which can result in dropped messages.’, it is difficult to asses the quality of the data through this method. Table 1 shows the amount of delivered and dropped messages. The ratio delivered to drop messages is approximately 2:1. Over the course of the sampling period this results in approximately 555 dropped message per minute. However, as there is no further information on these messages, the delivered messages are used.

The data is grouped into 30-second intervals, resulting in a time series of 118,312 ob-servations. Several predictor features are extracted based on the literature in section 2.2. These include variations of: order book price imbalances, order book size imbalances, previ-ous returns, trade imbalance, spread, order flow and placement, previprevi-ous price movement, and order volume.

The variables that indicate the imbalances are calculated as by Rubisov (2015), with the adjustment of an order book depth parameter k to capture deeper order book imbalances at time t ∈ {0, 1, .., (118, 312)}:

I(t, k) = Wbid(t, k) − Wask(t, k) Wbid(t, k) + Wask(t, k)

, (1)

where I(t, k) ∈ [−1, 1], and both Wbid(t, k) and Wask(t, k) are computed as the weighted

average and using exponentially decreasing weights. The 30-second returns Rtare calculated

as the log returns:

Rt= log(

St

St−1

), (2)

where Stis the last traded price at time t. The spread is calculated as the difference between

the lowest ask price Pa

t and highest bid price Ptb:

st= Pta− P b

t. (3)

Order flow /order volume are variables that indicate how much orders/volume are made on each side. Previous trade signs pt are derivatives of the returns Rt, where pt= 1 if Rt> 0,

(26)

3.3 Data Analysis

As described in section 2.1.3 and 2.1.4, intraday return patterns, return-volume rela-tionship and other aspects are extensively analyzed for traditional markets, but not for the bitcoin market. Therefore, this section gives a brief analysis of the returns, volume and volatility.

3.3.1 Returns

In this section the 30-second returns will be computed using the microprice Sm t , as is

done by Cartea et al. (2015). The microprice is the weighted average of the highest bid price and the lowest ask price, weighted by their corresponding lowest ask volume (Va

t ) and

highest bid volume (Vb t): S_tm = V b t Vb t + Vta P_ta+ V a t Vb t + Vta P_tb. (4)

The returns are then calculated as:

Rm_t = log S m t Sm t−1 (5) The histogram in figure 1a visualizes the sample-distribution of the returns. Observe from the histogram that it somewhat seems to follow a normal or t-distribution. The Q-Q plot in 1b confirms that the returns have fatter tails than a normal distribution. This is also generally found in returns for traditional financial markets (Cartea et al., 2015). The Q-Q plot in figure 1c confirms that the returns are more likely to follow a t-distribution with 4 degrees of freedom as the sample- and theoretical quantiles are almost similar. However, the two-sided Kolmogorov-Smirnov test still rejects the null-hypothesis of identical distributions (p-value < 0.001).

Another interesting aspect is to see whether past returns are correlated with each other. Observe from the autocorrelation plot in figure 1d that the first, second and third lag have significant autocorrelation present. Interesting is that several others are also significantly correlated, however, there is no feedback in the literature for this observation.

The average returns over time, in figure 2, shows a peak at 0.003 during morning trading hours of traditional markets. It then declines to -0.001 in the afternoon. Generally, the returns seem to follow a decreasing trend that starts in the afternoon and ends at around the closing time of traditional markets. There are several peaks in the evening as well. However, no clarification can be given without further research.

(27)

3. Data & Methodology 12 3.3.2 Daily Volume and Volatility

The volume during a full day is visualized in figure 3a. Observe that the volume is highest during trading hours of other European stock markets and steadily declines during the afternoon. This is consistent with the findings of Eross et al. (2017). However, the N-shape Eross et al. (2017) found in the realized volatility is not found in this data set. In contrary, the realized volatility (figure 3b) seems to follow a U-shape in the period of 6 am to 6 pm. The rest of this dissertation will focus on the methodology and results.

3.4 Methodology

3.4.1 Mathematics of LSTM

An LSTM network continues on the framework of an RNN (figure 4), whereby an output sequence y = {yi}i∈Z+ is predicted from an input sequence x = {xj}j∈Z+, using hidden states

hj. The core of an LSTM-network is the cell-state Cj, which can be seen as the information

flow running through the entire cell3 _{from which predictions can be made. A layer has the}

following variables:

Wxk:= weight matrix of the input x in layer k,

Whf := weight matrix of hidden state h in layer k,

bk:= bias vector in layer k.

LSTM has the ability to add and remove information to the cell state regulated by several gate layers (see figure 5). The first layer in the LSTM-network consists of the forget gate layer. This layer controls the extent to which each value xj and hj−1 remains in the cell by

giving it a value between 0 and 1:

fj = σ(Wxfxj + Whfhj−1+ bf). (6)

The second layer consists of two steps, first the input gate layer uses a sigmoid function to decide which values are going to be updated in the cell state:

qj = σ(Wxjxt+ Whqhj−1+ bq, (7)

3_{The explanation of LSTM in this section is a simplified version of the explanation given on Colah’s Blog,} (2015), understanding LSTM (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). Please refer to this blog for a more in-depth explanation.

(28)

Second, a tanh input transformation is used to create a vector of new candidate values, ˆCt,

that could be added to the state: ˆ

Cj = tanh(WxCxj + WhChj−1+ bC). (8)

Finally, the previous cell state Cj−1 is updated to the new cell state Cj:

Cj = fjCj−1+ qtCˆj, (9)

The first part, fjCj−1, forgets the values that are selected by the forget gate and the second

part, qjCˆj, adds the new candidate values. Finally, the cell state Cj is put through a tanh

layer to rescale the output between -1 and 1, whereafter it is multiplied by the output gate

oj = σ(Wxoxj+ Whohj−1+ bo). (10)

This results in:

hj = ojtanh(Cj). (11)

The next step is to implement the LSTM model.

3.4.2 Model

The final model consists of a stacked (multiple layers) LSTM model, also known as a deep LSTM model. The model is written in python using the Keras and TensorFlow libraries. Both libraries are leading in designing artificial networks, whereby Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano, as described by Keras’ official documentation4_.

Selecting the best deep LSTM model comes down to fine-tuning of several parameters. The most important are: number of hidden layers, number of neurons in each layer, op-timizer, class weights, learning rate and dropout rate. The parameter tuning is done by using the Hyperopt library5_{. Model training, evaluating and testing is done using the}

train-validation-test method, with ratio 80:10:10. This corresponds with fitting the model to a train set, tuning the parameters on a validation set and finally testing the tuned model on a test set. However, before passing the data to the model, the data is re-scaled using the standardization method as this improves LSTM’s performance.

4_{See https://keras.io/ for its documentation.}

5_{For its documentation see: https://conference.scipy.org/proceedings/scipy2013/pdfs/bergstra hyperopt.pdf} - http://hyperopt.github.io/hyperopt/.

(29)

4. Results 14 The problem of predicting price movements is cast as a multiclass classification problem. To make predictions about the price movement, the 30-second returns Rt are transformed

into a 3-dimensional vector yt = {(yt1, yt2, yt3)|P3_u=1ytu = 1, ytu = {0, 1}}. Recall that the

returns Rt are computes as describes in section 3.2. The possible scenarios are

yt=          (1, 0, 0) if Rt < 0 (0, 1, 0) if Rt = 0 (0, 0, 1) if Rt > 0 . (12)

To let the LSTM model predict the probability of an up, constant or down movement at time t+1, a softmax activation function is used in the output layer. Softmax, unlike sigmoid does not give the independent probabilities, but instead gives the (conditional) probability distribution around the classes as output. It therefore has the following properties: Let the predicted vector be given as ˆyt = (ˆyt1, ˆyt2, ˆyt3). The softmax layer ensures that ˆytu ∈ (0, 1)

and P ˆytu = 1.

The goal of the model is to minimize some loss function J . Corresponding with multiclass classification, a categorical cross entropy loss function is used. Mathematically it is defined as J (y, ˆy) = −1 N N X i=1 X u∈{1,2,3} yiulog ˆyiu, (13)

whereby N is the amount of observations. Finally, the accuracy A is measured as A(y, ˆy) = 1 N N X t=1 V (yt, ˆyt), (14)

where V (yt, ˆyt) = 1 if class t is correctly predicted and else V (yt, ˆyt) = 0.

4 Results

4.1 Empirical Results

4.1.1 Model Performance

The parameters of the best model found by the grid search are displayed in table 2. The train process of the best model is visualized in figure 6. The train loss follows a decreasing

(30)

4. Results 15 trend, whereas the validation loss remains somewhat constant. This might suggest that the model overfits slightly. Furthermore, the accuracy of both the train and the validation set follow an increasing trend until epoch 15, whereafter the validation accuracy decreases slightly. This clearly indicates some degree of overfitting. Hence, the inclusion of dropout parameters did not prevent overfitting.

The next step is to evaluate the prediction accuracy on the out-of-sample test set.

4.1.2 Prediction Accuracy

The corresponding predictions for all classes are presented in the confusion matrix in figure 7. Each row of the matrix represents the instances of the true class and each column represents the instances of the predicted class. A more in-depth analysis is presented in table 1 to evaluate the performance of the predictions. The precision is defined as:

P ( ˆC|C) = P ( ˆC ∩ C)

P ( ˆC) , (15)

where ˆC represent the correctly predicted classes, C the actual classes and P (x) the probab-ility of x. First, the constant class has the highest precision score (62.0 %) compared to the other classes. The precision of the up and down classes are 42.8% and 39.4% respectively. However, note that these precision scores include the misclassification of a constant price movement. Typically, a constant price movement does not result in a loss of investment. Therefore, let’s assume that a loss of investment can only occur when up is predicted and down is the true class or vice versa. Then, the probability that no loss of investment is made is defined as:

P ( ˆC ∪ D|C) = P (( ˆC ∪ D) ∩ C)

P (C) , (16)

where D represent the predicted constant price movements in a specific class. This results in a probability of 0.772 and 0.822 for predicted up and down classes respectively. Hence, the probability of no loss is higher for the down class predictions. Another interesting measure is to look at the accuracy of the up and down classes, independently of the constant class in terms of a probability. Mathematically this is defined as:

P ( ˆC| ˆC ∪ ¯C) = P ( ˆC|C)

P ( ˆC ∪ ¯C|C), (17)

where ¯C is the predicted class corresponding to the opposite movement of ˆC (for instance if ¯C represents the up predictions of the up class, then ˆC represents the down predictions

(31)

4. Results 16

of the up class). This results in a probability of 0.653 and 0.689 for the up and down class respectively. Hence, the model is better at isolating down predictions in the down class than up predictions in the up class.

Although the model does relatively well in isolating the constant predictions in the constant class, it lacks the ability to correctly isolate each class. Whether these predictions can lead to a profitable trading strategy is discussed in the next sections.

4.1.3 Distribution of Returns

Firstly, inspecting the histogram of the returns corresponding to the up and down pre-dictions in figure 8 gives insight into how the returns are distributed. Observe that in both figures most of the returns are between 0 and 0.2 percent. Observe from table 4 that the mean return is slightly higher for the down returns (1.8 × 104) than the up returns (1.7 × 104). The standard deviation for the down returns (6.5 × 104) is lower than for the up returns (8.4 × 104_{). Furthermore, notice that the skewness of the down returns (0.83) is}

approximately one-third of the up returns (2.97). This implies that the tail of down returns elongates more to the left than the up returns. Another thing to notice is that the kurtosis of the up returns is (44.09) almost three times as high as the kurtosis for the down return (13.47). This indicates that the up returns have significantly more “extreme” returns (re-turns that are more than 1 standard deviation from the mean) than the down re(re-turns. This also explains the higher standard deviation of the up returns.

Knowing the return distributions, it is interesting to see whether the predictions are clustered around certain time periods. Figure 9 visualizes when non-constant predictions are made. Both the up 9a and down 9b show similar prediction patterns where most of the predictions occur during daytime. However, the predictions for the down class are clustered more densely.

4.1.4 Market Timing

Finally, the question that remains unanswered is whether these predictions can result in a profitable trading strategy. However, before answering this, several assumptions have to be made: first, transaction costs and other types of costs are ignored. Second, orders are placed and executed instantly i.e. there is no latency. With these assumptions taken into account, a simple Buy-and-Sell strategy can be formulated. Let µ(It) ∈ {−1, 0, 1} be the

(32)

4. Results 17 at time t + 1, where It is the information needed at time t. Then, one buys bitcoin when

µ(It) = 1, sell bitcoin when µ(It) = −1 and do nothing when µ(It) = 0.

This strategy is backtested on the test set (approximately 4 days) and compared with random uniform price movement predictions (using the same strategy) and a Buy-and-Hold strategy. The results are displayed in table 5. Notice that from the first row, when all of the above assumptions are satisfied, the model’s overall return of 156.35% outperforms the other two strategies. The random predicted and Buy-and-Hold strategy have overall returns of -3.0% and 1.2% respectively. This implies that the stacked-LSTM model is able to find patterns in LOB data that improve out-of-sample accuracy, such that it outperforms random predicted price movements as well as a Buy-and-Hold strategy. This is in line with the expectations made at the end of section 2: that the model is able to make accurate short-term price movement predictions that result in a profitable (theoretical) trading strategy.

Interesting is to see what happens with the returns when GDAX’s transactions costs of 0.3% are included, ceteris paribus. Notice that this has little impact on the Buy-and-Hold strategy as only one transaction is made. However, both the model’s predictions and randomly predicted price movements result in a complete loss of investment. Hence, it is not recommended to use random predictions nor this model’s predictions for a simple Buy-and-Sell strategy.

There is a simple explanation why the model’s predictions result in such a large profit when no transaction costs are included in contrary to when transaction costs are included. Recall from the previous section that most of the returns are smaller than 0.2%. If transac-tion costs are larger than 0.2%, which is true for GDAX, these cost will be higher than the profit made. A possible way to work around GDAX’s transaction costs is to use limit orders since they have no transaction costs. A suggestion would be to use some form of market making6_{. However, working with limit orders demands more research as there is a possibility}

that orders remain unfilled. Other possibilities are to make the prediction interval larger or use trend forecasting, such as bull and bear market trends7.

6_{See for example: Kanagal, Wu and Chen (2017), Li, Deng, Zhu, Wang and Xie (2014)} 7_{See for example: Chen (2009), Nyberg (2013)}

(33)

5. Conclusion 18

5 Conclusion

With bitcoin shifting from a medium of exchange to a speculative asset, it creates new possibilities for risk managers and speculative traders. High-frequency traders often have an information advantage compared to non-high-frequency traders due to their fast processing of LOB data. The usefulness of LOB data is extensively researched for traditional markets and proofs to be a powerful predictor for future short-term price movements. Traditional methods, such as ARIMA and VAR are often used for these prediction tasks. However, with the advancements in deep learning and in particular RNN and LSTM models, these al-gorithms are able to capture temporal hierarchical structures in time series that outperform these traditional methods in out-of-sample forecasting.

In this dissertation a brief analysis of the bitcoin/euro market is presented. The results of this analysis show similarities with traditional markets characteristics, such as the presence of autocorrelation between returns, large tail distribution of returns and similar volatility patterns. Furthermore, the results indicated that out-of-sample price movement predictions of the stacked-LSTM model are significantly better for constant price movements (62.0%) than for up (42.8%) and down (39.4%) price movements. Assuming that a constant price movement results in no loss, the probability of no loss is higher for down predictions (.822) than up predictions (.772).

It is demonstrated that, by using a Buy-and-Sell strategy, the model’s predictions outper-form random predictions (using the same strategy) and a Buy-and-hold strategy. However, when transaction costs are taken into account, the overall return results in a 100 percent loss of investment for the model’s predictions. As transaction costs clearly eliminate the profits, it might be worth researching the use of limit orders as they have no transaction costs on GDAX. Other possibilities are to extend the prediction interval or trend forecasting.

The work described in this paper advocates the use of LOB data and LSTM algorithms in predicting short-term price movements.

(34)

References 19

References

Adebiyi, A. A., Adewumi, A. O. & Ayo, C. K. (2014). Comparison of arima and artificial neural networks models for stock price prediction. Journal of Applied Mathematics, 2014.

Balcilar, M., Bouri, E., Gupta, R. & Roubaud, D. (2017). Can volume predict bitcoin returns and volatility? a quantiles-based approach. Economic Modelling, 64, 74–81.

Balvers, R. J. & McDonald, B. (2017). Designing a global digital currency.

B¨ohme, R., Christin, N., Edelman, B. & Moore, T. (2015). Bitcoin: Economics, technology, and governance. Journal of Economic Perspectives, 29 (2), 213–38.

Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time series analysis: Forecasting and control. John Wiley & Sons.

Brogaard, J., Hendershott, T. & Riordan, R. (2014). High-frequency trading and price dis-covery. The Review of Financial Studies, 27 (8), 2267–2306.

Budish, E., Cramton, P. & Shim, J. (2015). The high-frequency trading arms race: Frequent batch auctions as a market design response. The Quarterly Journal of Economics, 130 (4), 1547–1621.

Cao, C., Hansch, O. & Wang Beardsley, X. (2004). The informational content of an open limit order book.

Cao, C., Hansch, O. & Wang, X. (2009). The information content of an open limit-order book. Journal of futures markets, 29 (1), 16–41.

Cartea, ´A., Jaimungal, S. & Penalva, J. (2015). Algorithmic and high-frequency trading. Cambridge University Press.

Chen, S.-S. (2009). Predicting the bear stock market: Macroeconomic variables as leading indicators. Journal of Banking & Finance, 33 (2), 211–223.

Chordia, T., Roll, R. & Subrahmanyam, A. (2002). Order imbalance, liquidity, and market returns. Journal of Financial economics, 65 (1), 111–130.

Cizeau, P., Liu, Y., Meyer, M., Peng, C.-K. & Stanley, H. E. (1997). Volatility distribution in the s&p500 stock index. Physica A: Statistical Mechanics and its Applications, 245 (3-4), 441–445.

Cont, R., Kukanov, A. & Stoikov, S. (2014). The price impact of order book events. Journal of financial econometrics, 12 (1), 47–88.

Detzel, A. L., Liu, H., Strauss, J., Zhou, G. & Zhu, Y. (2018). Bitcoin: Predictability and profitability via technical analysis.

(35)

References 20

Dyhrberg, A. H. (2016). Bitcoin, gold and the dollar–a garch volatility analysis. Finance Research Letters, 16, 85–92.

Eross, A., McGroarty, F., Urquhart, A. & Wolfe, S. (2017). The intraday dynamics of bitcoin. Franses, P. H. & Van Dijk, D. (2000). Non-linear time series models in empirical finance.

Cambridge University Press.

Gao, L., Han, Y., Li, S. Z. & Zhou, G. (2015). Intraday momentum: The first half-hour return predicts the last half-hour return.

Gers, F. A., Eck, D. & Schmidhuber, J. (2002). Applying lstm to time series predict-able through time-window approaches. In Neural nets wirn vietri-01 (pp. 193–200). Springer.

Goldstein, M. A., Kwan, A. & Philip, R. (2017). High-frequency trading strategies.

Harvey. (1990). Forecasting, structural time series models and the kalman filter. Cambridge university press.

Harvey. (2014). Bitcoin myths and facts.

Hermans, M. & Schrauwen, B. (2013). Training and analysing deep recurrent neural net-works. In Advances in neural information processing systems (pp. 190–198).

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780.

Jiang, G. J. & Zhu, K. X. (2017). Information shocks and short-term market underreaction. Journal of Financial Economics, 124 (1), 43–64.

Kanagal, K., Wu, Y. & Chen, K. (2017). Market making with machine learning methods. Katsiampa, P. (2017). Volatility estimation for bitcoin: A comparison of garch models.

Economics Letters, 158, 3–6.

Li, X., Deng, X., Zhu, S., Wang, F. & Xie, H. (2014). An intelligent market making strategy in algorithmic trading. Frontiers of Computer Science, 8 (4), 596–608.

Maslov, S. & Mills, M. (2001). Price fluctuations from the order book perspective—empirical facts and a simple model. Physica A: Statistical Mechanics and its Applications, 299 (1-2), 234–246.

Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R. & Muharema-gic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2 (1), 1.

Nyberg, H. (2013). Predicting bear and bull stock markets with dynamic binary time series models. Journal of Banking & Finance, 37 (9), 3351–3363.

(36)

References 21

Pascanu, R., Gulcehre, C., Cho, K. & Bengio, Y. (2013). How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026.

Rime, D., Sarno, L. & Sojli, E. (2010). Exchange rate forecasting, order flow and macroe-conomic information. Journal of International Emacroe-conomics, 80 (1), 72–88.

Rubisov, A. D. (2015). Statistical arbitrage using limit order book imbalance (Doctoral dis-sertation, University of Toronto (Canada)).

Siami-Namini, S. & Namin, A. S. (2018). Forecasting economics and financial time series: Arima vs. lstm. arXiv preprint arXiv:1803.06386.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dro-pout: A simple way to prevent neural networks from overfitting. The Journal of Ma-chine Learning Research, 15 (1), 1929–1958.

Xu, H. (2017). Reversal, momentum and intraday returns.

Yermack, D. (2015). Is bitcoin a real currency? an economic appraisal. In Handbook of digital currency (pp. 31–43). Elsevier.

Zhang, G., Patuwo, B. E. & Hu, M. Y. (1998). Forecasting with artificial neural networks:: The state of the art. International journal of forecasting, 14 (1), 35–62.

Zheng, B., Moulines, E. & Abergel, F. (2012). Price jump prediction in limit order book. arXiv preprint arXiv:1204.1381.

(37)

22

Appendices

A

Figures

(a) 30 second return distribution (b) Q-Q plot of returns compared to Normal distri-bution.

(c) Q-Q plot of returns compared to t-distribution with 4 degrees of freedom.

(d) Autocorrelation of the returns

Figure 1: Several plots visualizing the distribution and autocorrelation of 30-second re-turns.

(38)

A. Figures 23

Figure 2: A visualization of the mean 30-second returns throughout the day.

(a) Trading volume throughout the day. (b) Realized standard deviation of returns throughout the day.

Figure 3: The blue dots in subfigure a illustrate the 30-second trading volumes throughout the day. The red line is the rolling mean (10 minute window) of the blue dots. The line graph in subfigure b visualizes the realized standard deviation at 10 minute windows. At each window w it is calculated as r 1

20 P

(39)

A. Figures 24

Figure 4: The folded and unfolded representation of a vanilla RNN. The output sequence ytis predicted from a input sequence xt, using a recurrent hidden state ht, which is updated

on each state t. Take note that t doe not necessarily mean a time interval, but more as the position in the sequence; for instance a sequence of words xt= {this, is, a, sentence}.

Figure 5: The original LSTM representation [Colah’s Blog, (2015), understanding LSTM ]. Just as in a typicall RNN is a output sequence yt predicted from a input sequence xt. The

core of a lstm is the cell in each state t (the green planes). The top input arrow in the middle cell represents the cell state Ct−1, which is being updated along the middle cell to

(40)

A. Figures 25

(a) train and validation loss (b) train and validation accuracy

Figure 6: A visualization of the train and validation loss at each epoch in subfigure a. Subfigure b visualizes the accuracy of the train and test model.

Figure 7: A confusion matrix, illustrating the performance of predicted classes compared to the actual classes

(41)

A. Figures 26

(a) Histrogram of predicted up class returns (b) Histogram of predicted down class returns

Figure 8: A visualization that shows how the returns are distributed among the predicted up and down classes.

(a) Up class predictions through time (b) Down class predictions through time

(42)

B. Tables 27

B

Tables

Message Amount

Delivered 66,857,634 Dropped 32,853,244

Table 1: The amount of delivered and dropped messages via the GDAX API.

Parameter Value Neurons Layer 1 200 Neurons Layer 2 75 Epochs 30 Batch Size 1000 Dropout Rate 0.21 Optimizer Adam

Class Weights (up: 1.4, constant: 0.18, down: 1.2)

Learning Rate 0.001

Table 2: Parameters of the best tuned model

Class Precision Recall

Up 42.8 57.2

Constant 62.0 37.9

Down 39.4 51.5

Overall Accuracy 46.8

(43)

B. Tables 28

Class Mean Standard dev. Kurtosis Skewness

Up 1.7×104 _{8.4 ×10}4 _44.09 _2.97

Down 1.8×104 _{6.5 ×10}4 _13.47 _0.83

Table 4: Descriptive statistics of predicted up and down returns

Model Buy-and-Sell Random Buy-and-Sell Buy-and-Hold

% Return (without costs) 156.35 -3.00 1.22

% Return (with costs) -100.00 -100.00 1.217

Predicting bitcoin price movements using LSTM