## High-frequency algorithmic trading strategies for Bitcoin using ensemble

## learning

Author: C´eline van der Heijden (12418765) Supervisor: dr. N.P.A. van Giersbergen

June 30, 2021

Abstract

This study has the objective of acquiring more insight in the suitability of ensemble learning for the prediction of Bitcoin price directions. A logistic regression, random forest, support vector machine and voting ensemble of the three models are fitted on 5-, 15- and 30-minute Bitcoin data. The data covered the period from May 6, 2020 to May 5, 2021, with the test set beginning on February 4, 2021. Features used for modeling included technical indicators, lagged returns of prominent cryptocurrencies and time-related features. Trading strategies based on these models were backtested using Freqtrade to investigate the profitability of the models.

All models obtained accuracies above 50%, indicating the predictive value of the features and a market that is not fully efficient. However, the voting ensemble showed no superiority compared to the individual models in terms of either model or trading performance. In addition, all strategies were found to be highly unprofitable, which was mainly caused by the trading fees.

Statement of Originality

This document is written by Student C´eline van der Heijden (12418765) who declares to take full responsibility for the contents in this document.

I declare that the text and the work presented in this document are original and no sources other than those mentioned in the text and its references have been used creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

### 1 Introduction

In 2009, Bitcoin was introduced as the first cryptocurrency by an anonymous person or group under the pseudonym Satoshi Nakamoto (Nakamoto, 2008). Due to its vast increase in price and high volatility, it has gained worldwide interest of companies, investors and individuals and is a hot topic until today. To understand the popularity of Bitcoin, it should be understood what a cryptocurrency in general is. A cryptocurrency is a virtual, decentralized currency that guarantees transparency and immutability by a technology known as ‘blockchain’ (Fang et al., 2020). Being decentralized means that no central institution has control over it and it thus discards the need for a trusted third party, which was especially welcome after the financial crisis of 2008 when Bitcoin was introduced. Decentrality is accomplished by the use of blockchain, which is a digital ledger that keeps track of all transactions with the use of cryptographic signatures as verification. After Bitcoin, many more cryptocurrencies found its introduction. According to CoinMarketCap (2021), cryptocurrencies make up a total of 4,912 as of May 2021, of which the largest in terms of market capitalization are Bitcoin, Ethereum, Binance Coin and Ripple. They constitute about 70% of the total market capitalization, which currently equals 1.6 trillion US Dollars (CoinMarketCap, 2021). Cryptocurrencies thus provide an alternative type of currency built on cryptography and have claimed a place within society.

The use of cryptocurrencies is two-fold. Initially, they were designed as an alternative, transparent way for financial transactions. Some companies indeed accept cryptocurrencies as payment for their products, such as Overstock and Namecheap (Tuwiner, 2021). Most people, however, do not use the coins for buying goods, but rather see them as a way to make profit, hoping that another person is willing to pay more for it, as stated by Vo and Yost-Bremm (2020). Their volatile behavior gives rise to opportunities for high short-term returns. For this reason, the act of high-frequency algorithmic trading, defined as the use of computer algorithms to automatically execute trades, is very common in this market. According to Vo and Yost- Bremm (2020), about 60% of the trades is algorithmic. So, even though cryptocurrencies are officially currencies, they are mostly being used as speculative assets on the financial market.

A common strategy for algorithmic trading on the Bitcoin market is the use of machine learning (ML) algorithms for predictions. In most previous studies, ML is used to predict the direction of the price in the next time period for which classification algorithms are used. The studies differ mostly in the choice of models, used features and time frames on which is traded.

Commonly used models are the logistic regression (LR), random forest (RF), support vector machine (SVM) and deep learning (DL) models. For instance, Vo and Yost-Bremm (2020) built an RF and a DL model. Few studies, however, combine several models in an ensemble in order to overcome individual model deficiencies. As regards features, most previous studies only made use of technical indicators, such as the study by Akyildirim et al. (2021). However, more types of features that are available for high-frequency time frames have proved their predictive value

for Bitcoin. These include returns of other prominent cryptocurrencies (Sebasti˜ao and Godinho, 2021) and time-related features such as day of the week (Keller and Scholz, 2019). In addition, some previous studies only report the accuracies of their predictions, but fail to convert their models into trading strategies to analyze the profitability of the models, such as the studies by Madan et al. (2015) and Akyildirim et al. (2021).

The main purpose of this study is to investigate the performance of a voting ensemble compared to individual ML models, where the voting ensemble takes the majority vote of the individual models to make predictions. An LR, SVM, RF and voting ensemble are fitted on 5-, 15- and 30-minute Bitcoin data. The study uses multiple types of features, which include lagged returns of prominent cryptocurrencies, technical indicators and time-related features.

For all models, a simple trading strategy is constructed, of which the profitability is assessed by backtesting using the open-source software Freqtrade (2021). During the study, the possibility of a more efficient and less predictable market is kept in mind, which will be further discussed in Section 2.1.

The remainder of the paper has the following structure. Section 2 provides a literature review that discusses some fundamental concepts as well as related work. Section 3 presents a description of the data and the methodology, where the models, features and used software are elaborated on. Section 4 provides the classification and trading performance results. Section 5 discusses the results and gives a conclusion.

### 2 Theoretical Framework

Below, an overview of the following concepts is given: the efficiency of the Bitcoin market, factors containing predictive value for Bitcoin price movements and the performance of commonly used ML models and ensembles.

2.1 Efficiency of the Bitcoin market

With respect to the predictability of Bitcoin, Akyildirim et al. (2021) stated that Bitcoin prices are predictable if the Efficient Market Hypothesis (EMH) does not hold. Adhering to the definition of the EMH, they posed that in an efficient market all past information is already reflected in the current price and therefore the price follows a random walk. They concluded that the Bitcoin market is inefficient but is likely gaining weak-form efficiency over time. Here, weak-form efficiency means that the price of Bitcoin cannot be predicted by past prices alone. A study that supports this finding is from K¨ochling et al. (2019), who found that the introduction of futures in 2017 was probably a factor in turning the Bitcoin market from inefficient to weak- form efficient. This would mean that the use of ML algorithms for high-frequency trading would still be able to partly predict prices and make profits, but this would get harder over time as the market becomes more efficient.

Another perspective on the efficiency of the cryptocurrency market is taken by Chu et al.

(2019), who investigated whether the Adaptive Market Hypothesis (AMH) holds for the Bitcoin and Ethereum market. The AMH combines the EMH, that was investigated by Akyildirim et al.

(2021), with behavioural finance, stating that investors show a varying degree of rationality over time, leading to markets that vary in efficiency over time. They used hourly data of Bitcoin and Ethereum prices and indeed found results consistent with the AMH, where the markets for Bitcoin and Ethereum are neither efficient nor inefficient, but vary between these states over time. In addition, they found that the two markets sometimes move together and sometimes show opposite trends. So, in contrast to Akyildirim et al. (2021), they stated that the efficiency of the Bitcoin market varies over time and does not necessarily converge to a more efficient market. This would mean that there are times that Bitcoin prices are more predictable and times they are less predictable.

By investigating the predictability of Bitcoin in this study, some information is also gained about the current state of efficiency of its market.

2.2 Factors containing predictive value for Bitcoin prices

Several information sources have shown their predictive value for Bitcoin prices. Most commonly used are technical indicators, which are transformations on the Open-High-Low-Close-Volume (OHLCV) data giving a lagged summary of the prices. They filter out noise and uncover, among others, momentum, volatility and volume related signals from the time-series. Huang et al. (2019) stood out by using 124 technical indicators in a tree-based model to predict daily bitcoin returns. On top of the aforementioned categories, these indicators also included overlap study indicators, pattern recognition indicators and cycle indicators, of which the calculation can be found in Achelis (2001). Simple trading strategies based on their tree-based model showed promising results with a win-to-loss ratio of 1.71, outperforming a buy-and-hold strategy that had a win-to-loss ratio of 1.38. These results demonstrated that technical indicators are a valid tool in predicting Bitcoin prices.

Apart from technical indicators, more types of features have shown their predictive power on the cryptocurrency market. For instance, Sebasti˜ao and Godinho (2021) tried to predict the price direction of Bitcoin, Ethereum and Litecoin and fitted their models on several feature sets.

They chose the set that maximized the average return during the training period and found that all feature sets included the returns of the other cryptocurrencies as well as dummies for day of the week. Keller and Scholz (2019) backed some of the findings by Huang et al. (2019) and Sebasti˜ao and Godinho (2021). They examined the influence of different factors on the Bitcoin exchange rate and also found that weekday and technical indicators were significant predictors of the Bitcoin price. Wo lk (2020) investigated to what extend daily Bitcoin prices can be predicted from Google Trends and Twitter. They used sentiment analysis indicators from Twitter and the amount of google searches as input features for multiple ML regression

models to predict prices of several cryptocurrencies, under which Bitcoin. Using a simple trading strategy based on these models, they found positive profits for Bitcoin outperforming the buy-and-hold strategy. This indicated that these information sources contained predictive value for Bitcoin prices. Given that this study focuses on a more high-frequency time frame, it is decided to only use features easily available for a high-frequency time frame, including technical indicators, returns of prominent cryptocurrencies and time-related features.

2.3 Machine learning algorithms for Bitcoin price predictions

Most research regarding the use of ML algorithms for Bitcoin price predictions focused on the use of individual algorithms. However, promising results have been found for the combination of models in an ensemble. Below, both types of studies are discussed, since the results in this study will also shed a light on individual model performances. However, most focus is put on the use of ensemble learning. On top of this, the performance of traditional time-series models for cryptocurrencies is also briefly discussed.

Madan et al. (2015) were one of the first to investigate the suitability of ML techniques for Bitcoin and fitted both an RF and a generalized linear model (GLM) on 10-minute Bitcoin data, using features related to the payment network and past prices. They found an accuracy of 57.4% for the RF, outperforming the GLM that had an accuracy of 53.9%. This showed the potential of the use of ML methods on the Bitcoin market.

As mentioned in the introduction, Vo and Yost-Bremm (2020) studied the prediction of Bitcoin price movement by fitting a simple DL model and an RF. They found that the RF outperformed the DL model and found that a time frame of 15 minutes and 30 trees in the RF were optimal for predictions, achieving an impressive F1-score of over 97%. Moreover, they found a Sharpe ratio of 8.22 for the 15-minute time frame and 1.77 for the daily time frame, compared to 1.16 for a buy-and-hold strategy. However, they found that the relative performance of the models decreased over time. This might have been a result of the market becoming more efficient, as discussed in Section 2.1. It should also be noted that they fitted their model on the Bitcoin prices of one randomly chosen exchange, CoincheckJPY, and tested on six other exchanges. Since the price follows similar trends on all exchanges, the training set might have been similar to the test sets, which might have caused unreliable results.

In contrast to Vo and Yost-Bremm (2020), who only studied classification algorithms and Bitcoin prices, Sebasti˜ao and Godinho (2021) fitted both classification and regression algorithms on Bitcoin as well as Ethereum and Litecoin prices to predict daily returns. They used block- chain features and lagged log returns and volatilities of the three cryptocurrencies to fit linear models, RFs and SVMs. In contrast to Vo and Yost-Bremm (2020) they used more trees in their RFs, ranging from 1000 to 1500. In addition to individual models, they fitted multiple voting ensembles, which achieved the best results out of all models. Unfortunately, the accuracies of the ensembles were not reported, however, they found Sharpe ratios of 0.55 for Bitcoin, 0.91 for

Litecoin and 0.95 for Ethereum using an ensemble, compared to respectively -1.29, -1.19 and -1.50 for a buy-and-hold strategy. The ensembles thus performed well even in a bear market, where there were prolonged price declines. The study therefore supports the suitability of an ensemble of ML models for cryptocurrency price prediction.

Akyildirim et al. (2021) also tried to predict cryptocurrency returns using ML algorithms and ensemble learning. They fitted an Artificial Neural Network (ANN), RF, SVM and LR on the twelve most liquid cryptocurrencies to predict the daily, 15-, 30- and 60-minute returns.

As features they used technical indicators as well as past prices and volume traded. The algorithms reached about 55-65% accuracy. They also considered an ensemble of all models, where they found accuracies consistently above 50%. However, in contrast to Sebasti˜ao and Godinho (2021), they found that one individual model (the SVM) achieved the best average performance, obtaining a higher average accuracy than the ensemble of all models. Since they did not do a profitability analysis, nothing can be said about the trading behaviour of the models. What also stands out, is that they compared their models to an autoregressive integrated moving average (ARIMA) model, a traditional time-series model. However, for this model, they found accuracies scattered around 50% and it thus performed no better than a coin toss. This finding is backed by Siami-Namini et al. (2018), who compared the ARIMA model with DL algorithms such as a Long Short-Term Memory (LSTM) for forecasting time-series.

They found that the LSTM strongly outperformed ARIMA, obtaining error rates that were 84-87% lower. It might therefore be concluded that ARIMA is not suitable for the prediction of cryptocurrency price directions.

Borges and Neves (2020) also studied the performance of a voting ensemble. Additionally, they performed several data resampling methods with the goal of generating higher returns with less risk. They fitted an LR, RF, SVM and gradient tree boosting algorithm on 1-minute Bitcoin data, using technical indicators as features. In line with Sebasti˜ao and Godinho (2021), they found that a voting ensemble outperformed all individual models as well as a buy-and-hold strategy, obtaining an accuracy of 59.26% and a Sharpe ratio of 0.288 compared to an accuracy of 37.40% and a Sharpe ratio of -0.352 for a buy-and-hold strategy. They attributed the high performance of the voting ensemble to a better generalization performance compared to the individual models, stating that the weaknesses of some models were balanced by the strengths of others.

Another approach was taken by Pintelas et al. (2020), who investigated the suitability of DL models for cryptocurrency price prediction. They fitted an LSTM, BiLSTM, CNN-LSTM and CNN-BiLSTM to predict prices of Bitcoin, Litecoin and Ethereum on a 4- and 9-hour horizon using past prices as input features. They found unimpressive results, with accuracies just above 50%. The study thus indicates that the use of advanced deep learning models might not be suited for predictions on the cryptocurrency market. A DL model is therefore left out in this study. In their discussion they suggested that the use of more types of features might

be an important way to improve on their models and on ML models in general, which has been effectuated in this study.

Since ensembles have shown their superiority compared to individual ML models in prior research, the same is expected for the ensembles in this study. The hypotheses regarding the voting ensemble are formulated as follows:

• H0: The voting ensemble does not perform significantly better than the individual ML models in terms of predictive accuracy for predicting Bitcoin price directions.

• H1: The voting ensemble performs significantly better than the individual ML models in terms of predictive accuracy for predicting Bitcoin price directions.

The hypotheses will be tested for each of the three individual ML models on the three different time frames making up a total of nine tests. In Section 3.3.4, a mathematical formulation of the hypotheses is given as well as the testing procedure. In addition to this, it is investigated if the trading behaviour of the voting ensemble is superior to that of the individual ML models, measured by statistics as presented in Section 3.4.

### 3 Methodology

In this section, the data, features and ML models are described as well as the performance metrics. Also, the hypotheses are formulated. Furthermore, the software Freqtrade is explained, which is used for scraping data and backtesting strategies.

3.1 Data

In this study, USDT-denominated cryptocurrency data is used from the Binance (2021) ex- change, which is obtained using a Binance API (2021). USDT, also Tether, is a cryptocurrency that mimics the US Dollar by converting cash into digital currency (Tether, 2021). The data is downloaded with the use of Freqtrade, which is discussed in section 3.4. The data covers a period of one year, from May 6, 2020, to May 5, 2021, which is divided into a training and a test set. The training set consists of the first 75% of the data points and the test set of the next 25%

of the data points. The data contains timestamps and the OHLCV data of Bitcoin, Ethereum, Binance Coin and Litecoin on a 5-, 15- and 30-minute time frame. For each coin there is a total of 104,889 observations for the 5-minute time frame, 34,964 for the 15-minute time frame and 17,484 for the 30-minute time frame. Figure 1 shows the Bitcoin price during the training and test set, ranging around 10,000 USDT in the first few months of the data set and then rising quickly to a maximum of 64,702 USDT with a few large drops. Table 1 shows descriptive statistics of the Bitcoin price. It is observed that the mean price was much larger in the test set, while the standard deviation was larger in the training set. Table 2 shows the number of positive returns of Bitcoin out of all returns, where it stands out that there were particularly

many upward movements in the test set of the 30-minute time frame. These percentages could be interpreted as a benchmark for the performance of the classifiers, since they correspond to the accuracy of a model that predicts only upward movements.

Figure 1: Bitcoin price from May 6, 2020, to May 5, 2021.

Table 1: Descriptive statistics for Bitcoin data.

Data Minimum Maximum Mean Standard deviation Median

Bitcoin train 8,270 41,895 15,474 8,437 11,434

test 38,888 64,702 54,361 4,732 55,381

Table 2: Percentage of positive returns out of all returns for the training and test set.

Data 5-minute time frame 15-minute time frame 30-minute time frame

Bitcoin train 50.44% 50.73% 51.18%

test 49.83% 49.89% 51.77%

3.2 Features

The ML models made use of three different types of features: past returns of prominent cryp- tocurrencies, technical indicators and time-related features. The technical indicators are cal- culated using the TA-Lib package by Benediktsson (2017) in Python (Van Rossum and Drake, 2009). An overview of all used features can be found in Table 3. The calculation and inter- pretation of the past returns and technical indicators can be found in Appendix A. They are obtained from Achelis (2001).

Table 3: Set of features utilized in the classification algorithms.

Feature name Number of lags / window size Number of features

Past returns

Past returns of Bitcoin, Ethereum, Binance Coin and Litecoin

Lags 1-3 12

Technical indicators

Relative Strength Index (RSI) Window = 7, 14 2

Average True Range (ATR) Window = 7, 14 2

HighLow Lags 1-3 3

On-Balance Volume (OBV) Lag 1 1

Exponential Moving Average (EMA) Window = 7, 14 2

Moving Average Convergence/Divergence (MACD) - MACD and signal line are included

Fastperiod=12, slowperiod=26, signalperiod=9

2

Aroon oscillator Window = 14 1

Rate of Change (ROC) Window = 7, 14 2

Commodity Channel Index (CCI) Window = 14 1

Linear Regression Slope (LRS) Window = 7, 14 2

Time-related features

Dummies for day of the week - first 6 dummies are included

- 6

Dummies for hour of the day - first 23 dummies are included

- 23

3.3 Models and model performance metrics

Four classification algorithms have been fitted on the data. The target variable was the direction of the price in the next time period p:

y_{p}=

1 if Closep > Openp

0 if Close_{p} ≤ Open_{p}

where p stands for a time period of 5, 15 or 30 minutes, Closep is the close price of Bitcoin at
period p and Open_{p} is the open price of Bitcoin at period p.

Model performance is reported in terms of accuracy and F1-score. The accuracy is calcu- lated as follows:

accuracy = T P +F P +T N +F N^{T P +T N} ∗ 100%,

where T P are the true positives (number of correctly predicted upward trends), T N are the true negatives (number of correctly predicted downward trends), F P are the false positives (number of incorrectly predicted upward trends) and F N are the false negatives (number of incorrectly predicted downward trends).

The calculation of the F1-score is given by:

F 19score = _{T P +}1^{T P}

2(F P +F N )∗ 100%, where T P, F P and F N are as stated above.

The LR, SVM, RF and voting ensemble are implemented using the scikit-learn package by Pedregosa et al. (2011) in Python. All features are scaled using the StandardScaler also from the scikit-learn package which calculates the z-scores of the features:

z =^{x−µ}_{σ} .

In the following, a short explanation of the models is given as well as the hyperparameters needed to tune. These hyperparameters are tuned using GridSearchCV from the scikit-learn package, which determines the optimal parameters using cross-validation. An overview of the optimal hyperparameters for the 5-, 15- and 30-minute time frame can be found in, respectively, Table 9, 10 and 11 of Appendix B.

3.3.1 Logistic Regression

A logistic regression (LR) is a generalized linear model for binary classification (Wright, 1995).

It is closely related to a linear regression. It predicts the log-odds of belonging to a class using
a weighted sum of the input features. It does so by computing the logistic function of this
weighted sum, which equals: σ(t) = _{1+exp (−t)}^{1} . Here, t is assumed to be a linear function of the
explanatory variables. The model is fitted by minimizing a log loss function on the training data.

In this study, L2 is chosen as regularization parameter, which adds an extra term to the loss function to prevent the model from overfitting. A hyperparameter that influences the strength of the regularization is C, which is the inverse of the regularization strength. This means that smaller values of C specify stronger regularization which reduces overfitting. The model is fast to train and fast in making predictions. Also, due to the small number of parameters that needs to be estimated, the model has relatively high bias and is less prone to overfitting compared to more flexible models.

3.3.2 Random Forest Classifier

A random forest (RF) classifier is an ensemble method based on multiple decision trees (Svetnik et al., 2003). The final decision is based on the outcome of the majority of the trees. A decision tree consists of nodes that split the data based on the feature that performs the best split, which is determined by the Gini impurity in this study. Each tree is trained on an artificial sub-sample of the training data. This is done by bootstrapping, a method of sampling the training data with replacement. This means that a certain data point can be in the sample multiple times while others are not in the sample at all. The size of the sub-samples on which the trees are trained can be adjusted by the hyperparameter max samples. For each node, only a part of the features is considered, which can be adjusted by the hyperparameter max f eatures. Two other

hyperparameters that can be tuned are n estimators and max depth, which are the number of trees fitted and the maximal depth of the trees respectively. An RF offers a good bias-variance trade-off with a manageable running time.

3.3.3 Support Vector Machine

A support vector machine (SVM) classifier algorithm has the primary objective of maximizing the margin of a hyperplane in an n-dimensional space, where n is equal to the number of features used in the model (Wang, 2005). The margin is the distance between the hyperplane and training samples that are closest to the hyperplane, which are the support vectors. It thereby separates different classes. The SVM is dependent on its kernel and hyperparameters C and gamma. The kernel is a transformation on the feature space which enables changing the flexibility of the SVM. In this study, a linear kernel or radial basis function kernel is used, determined by the grid search that is performed. C is the penalty parameter, representing an error term. It tells how much error is bearable. The higher C, the more data points in the training set are classified correctly, but the more prone the model is to overfitting. Gamma defines how close points should be to be able to influence the hyperplane that separates the classes. A lower gamma means that points further away should also be considered to determine the decision boundary.

3.3.4 Voting ensemble and hypotheses

A voting ensemble is a model that aggregates the predictions of the individual classifiers. The final prediction is based on a majority vote of the underlying model predictions. So, if at least two out of three models predict an upward movement, the voting ensemble predicts an upward movement as well. Otherwise it predicts a downward movement.

To formally compare the performances of the individual models with the ensemble, the Binomial test for paired comparisons is used (Raschka, 2018). This test compares whether the predictions of two binary classification models differ significantly. It does so by looking at two random variables, A and B. A is the event that model 1 produces a correct prediction where model 2 produces an incorrect prediction, and conversely B is the event that model 2 produces a correct prediction where model 1 produces an incorrect prediction. In this study, model 1 stands for the voting ensemble and model 2 for one of the individual ML models. The hypotheses are then formulated as follows:

• H0 : P (A) = P (B)

• H1 : P (A) > P (B)

Now, a is defined as the number of times that event A happens and b as the number of times that event B happens. Given that under the null hypothesis the models perform equally well,

the test is a binomial test with proportion 0.5. The p-value is calculated as follows:

p =Pn i=a

n

i0.5^{i}(1 − 0.5)^{n−i},

where n = a + b. Nine tests will be performed: one for each of the three individual ML models on the three different time frames.

3.4 Freqtrade and trading performance metrics

Freqtrade is an open-source cryptocurrency algorithmic trading software, developed in Python (Freqtrade, 2021). It enables users to download historical data, develop crypto-currency trading algorithms and deploy them on multiple exchanges on time frames up to 1 minute. In this study, Freqtrade is connected to the Binance API from which 5-, 15- and 30-minute historical data for Bitcoin, Ethereum, Binance Coin and Litecoin is downloaded.

The main benefit of using Freqtrade is the simplicity for the user. All important settings are stored in one configuration file. These include, for instance, the user API key and password, the currency pairs traded on, a (trailing) stop-loss and take profits. The trading strategy is implemented in a separate file, where OHLCV data is loaded, features are extracted and buy/sell decisions are made. The ML models are loaded into the strategy file where predictions are made.

In Freqtrade, strategies can be used for three purposes: to backtest, dry-run (live sim- ulation trading) and live-run. In this study, it is used for backtesting, where the strategy is applied to the stored historical data to estimate strategy performance. In the strategy in this study, a fee of 0.075% is used, according to the fee of Binance (2021). Additionally, a stop-loss of -10% is implemented and a take profit of 4%, which is reduced to 2% after 20 minutes. The strategy starts with a fixed capital for investment (equal to 1000 USDT) and invests that capital according to the buy and sell signals of the models. The maximum number of open trades is set to 1 and only the BTC/USDT pair is traded. Backtesting is done on the same period as the test set to prevent overfitting. Apart from the stop-loss and take profit, the strategy can be summarized as follows:

Action =

Buy if position = 0 and E[Closep+1|inf ormation p] > Close_{p}
Hold if position = 1 and E[Close_{p+1}|inf ormation p] > Close_{p}
Sell if position = 1 and E[Closep+1|inf ormation p] < Close_{p}
,

where Closep is the close price in period p, E[Closep+1||inf ormation p] is the expected close price at period p + 1 given the information at period p and the position indicates whether a position in the trade is open or closed.

From the backtest results, the following information will be reported or calculated: return on investment (ROI), annualized Sharpe ratio, trades per day, win-to-loss ratio and the average duration of winning and losing trades. The ROI is calculated as follows:

ROI = F inalCapital−InitialCapital

InitialCapital ∗ 100%,

where F inalCapital is the final balance of the account and InitialCapital is the starting balance of the account.

The annualized Sharpe ratio is calculated as in Neves (2021), using a risk-free rate of zero since interest rates on savings accounts are very low or even negative. The calculation is then as follows:

S_{a}= ^{µ(ROI}_{σ(ROI}^{d}^{)}

d) ∗√ 365,

where S_{a} is the annualized Sharpe ratio and ROI_{d} is the daily return on investment.

The win-to-loss ratio equals:

r = _{Losses}^{W ins},

where r is the win-to-loss ratio, W ins is the number of trades that made profit and Losses is the number of trades that made a loss.

Apart from being compared to each other, the strategies are also compared to a buy-and- hold strategy that invests the entire initial capital at the start of the test period and sells it at the end of the test period.

### 4 Results

In Table 4 the accuracies and F1-scores of the models are reported. What stands out is that the voting classifier does not outperform all individual ML models on any time frame. On the 5- and 15-minute time frame, the RF achieves the highest accuracy of 54.21% and 53.60%

respectively. On the 30-minute time frame, the SVM achieves the highest accuracy of 53.06%.

It is also noted that the accuracy of all models decreases as the time frame becomes larger.

Still, all accuracies are well above 50%. As regards the F1-scores, it mostly stands out that they are particularly low for the 15-minute time frame, with all scores below 50%.

Table 5 reports the p-values of the binomial tests. It shows that the voting ensemble only significantly outperforms the LR on the 5-minute time frame, with a p-value of 0.0065. For Table 4: Model performance in terms of accuracy and F1-score.

Logistic regression Random forest Support vector machine Voting ensemble 5 min

Accuracy 53.90% 54.21% 54.04% 54.19%

F1-score 56.22% 53.42% 58.49% 55.39%

15 min

Accuracy 53.31% 53.60% 53.56% 53.55%

F1-score 45.18% 46.44% 46.58% 46.19%

30 min

Accuracy 52.49% 51.96% 53.06% 52.35%

F1-score 56.21% 48.82% 54.90% 53.37%

Table 5: P-values of the binomial test to compare individual models to the voting ensemble. Significant values on the 5% level are indicated by ”*”.

Logistic regression vs.

Voting ensemble

Random forest vs.

Voting ensemble

Support vector machine vs.

Voting ensemble

5 min 0.0065^{∗} 0.5495 0.2364

15 min 0.1180 0.5554 0.5673

30 min 0.6238 0.2751 0.8281

Table 6: Trading performance of the models and the buy-and-hold strategy.

Logistic regression Random forest Support vector machine

Voting ensemble

Buy-and-hold

5 min

ROI -94.55% -91.02% -93.41% -93.44% 48.02%

Annualized Sharpe ratio -21.643 -19.238 -21.066 -20.785 2.587

Trades per day 30.04 27.96 31.64 30.28 -

Win-to-loss ratio 0.780 0.879 0.767 0.805 -

Avg. duration winner 0:14:00 0:14:00 0:15:00 0:14:00 -

Avg. duration loser 0:36:00 0:35:00 0:36:00 0:36:00 -

15 min

ROI -48.34% -66.00% -42.30% -41.33% 48.02%

Annualized Sharpe ratio -6.615 -7.945 -5.870 -4.622 2.587

Trades per day 7.48 9.48 8.06 8.11 -

Win-to-loss ratio 1.030 1.251 1.115 1.069 -

Avg. duration winner 0:24:00 0:37:00 0:24:00 0:24:00 -

Avg. duration loser 0:53:00 2:03:00 0:54:00 0:56:00 -

30 min

ROI -17.05% -51.21% -19.26% -37.40% 48.02%

Annualized Sharpe ratio -0.927 -5.504 -1.336 -3.514 2.587

Trades per day 6.39 8.34 7.84 7.36 -

Win-to-loss ratio 1.421 1.120 1.148 1.282 -

Avg. duration winner 1:20:00 0:54:00 1:30:00 1:11:00 -

Avg. duration loser 3:38:00 2:17:00 2:10:00 2:46:00 -

all other models and time frames, the voting classifier does not significantly outperform the individual models, which corresponds to the fact that their accuracies and F1-scores are often better than or similar to the ones of the voting ensemble.

Table 6 shows the trading performances of the models and the buy-and-hold strategy. It is observed that all trading strategies using ML models end with a negative ROI and have a negative Sharpe ratio, thereby being inferior to the buy-and-hold strategy. However, the larger the time frame, the lower the loss. The highest Sharpe ratio for the ML models is obtained by the LR on the 30-minute time frame, equalling -0.927, compared to 2.587 for the buy-and-hold strategy. It is also observed that the voting ensemble obtains the best Sharpe ratio of -4.622 on the 15-minute time frame, considering only the strategies based on ML models. What else stands out is that the win-to-loss ratios of all ML models are larger than one on the 15- and

30-minute time frame, meaning that there were more winning trades than losing trades. Since the final ROI is negative, this means that the losing trades on average lost more per trade than the winning trades won. It is also observed that the duration of the losing trades was on average longer than the duration of the winning trades. As regards the number of trades per day, it is observed that on the 5-minute time frame there around 30 trades per day, while this number lies below 10 on both the 15- and 30-minute time frame.

The evolution of the ROI over time is presented in Figure 2, 3 and 4 for the 5-, 15- and 30-minute time frame respectively. It is observed that in the first half of the test period of the 30-minute time frame, the LR and SVM are in fact profitable, however, they still perform worse than the buy-and-hold strategy. It is also observed that on the 15-minute time frame, the

Figure 2: Return on investment on the 5-minute time frame.

Figure 3: Return on investment on the 15-minute time frame.

Figure 4: Return on investment on the 30-minute time frame.

voting classifier outperforms the individual models, although only slightly. On the 5-minute time frame, the models clearly perform similarly badly, showing rapidly declining ROI’s. Ad- ditionally, it is found that the RF clearly performs worse than all other models on both the 15- and 30-minute time frame. Looking again at Table 4, it is observed that this corresponds to the RF attaining the lowest accuracy (51.96%) and F1-score (48.82%) of all ML models on the 30-minute time frame. In fact, it just outperformed a strategy that predicts only upward move- ments, which would have had an accuracy of 51.77% as can be derived from table 2. However, on the 15-minute time frame, the RF outperformed all other models in terms of accuracy and obtained the second highest F1-score, although the differences were small.

In order to investigate the reason behind the losses, the strategies were also backtested without fees. It should be emphasized that this is not a realistic scenario, but does provide insight in the impact of fees on the trading performance. The ROI’s were again plotted over time and can be found in Figure 4, 7 and 8 of Appendix C. Table 12 in Appendix C shows the trading performances of these strategies. It is observed that all strategies using ML models are profitable when fees are not taken into account during trading. The differences are especially large on the 5-minute time frame, where all strategies now strongly outperform the buy-and- hold strategy, while they initially all had losses of more than 90%. The SVM reaches the highest ROI of 328.97% and Sharpe ratio of 10.041 on the 5-minute time frame compared to respectively 48.02% and 2.587 for the buy-and-hold strategy. On the 15- and 30-minute time frame, all models also outperform the buy-and-hold strategy in terms of Sharpe ratio, except for the RF on the 15-minute time frame.

In order to explore the progress of the predictability of Bitcoin over time, the models were also fitted on Bitcoin data from earlier time periods, which included a time period of one year earlier (May 6, 2019 - May 5, 2020) and a time period of two years earlier (May 6, 2018 - May 5, 2019). A figure of the Bitcoin prices as well as descriptive statistics and the percentage of upward movements during these time periods can be found in respectively Figure 10, Table 13 and Table 14 of Appendix D. It mostly stands out that the mean price and standard deviation were much lower in these time periods than in the time period of 2020-2021. The model and trading performance of the voting ensemble on the 5-minute time frame in all three time periods are presented in Table 7. It is observed that the accuracy was lowest (52.72%) in the time period 2018-2019, highest (54.37%) in 2019-2020 and was somewhat lower than that (54.19%) in 2020- 2021. The F1-score, however, was highest (55.39%) in 2020-2021. This shows a varying degree of predictability of Bitcoin and no convergence to a less predictable and more efficient state.

It is also observed that the ROI and annualized Sharpe ratio were very low during all time periods, which was most likely caused by the trading fees. The performance of the individual ML models in these time periods can be found in Table 15 of Appendix D.

Table 7: Model and trading performance of voting ensemble compared over three years on the 5-minute time frame.

2018-2019 2019-2020 2020-2021 5 min

Accuracy 52.53% 54.37% 54.19%

F1-score 52.72% 52.27% 55.39%

ROI -98.48% -98.90% -93.44%

Annualized Sharpe ratio -51.595 -23.368 -20.785

Next, the feature importances were investigated to get an idea of which features are
important and to investigate whether better predictions can be obtained by using a smaller
feature set. The feature importances of the LR and RF are presented in Figures 11 and 12
of Appendix E respectively. What mostly stands out is that the time-related features seem to
have low feature importances in both models. Also, a heatmap of the features was made to see
if features were highly correlated, which can be found in Figure 13 of Appendix E. Based on
this, an extra feature set was created, that excluded time-related features and highly correlated
features. If two features had a correlation higher than 0.9, the one with the lowest feature
importance in the LR and RF was excluded. These were the ATR with window 7, the RSI
with window 14, the EMA with window 14, the MACD signal line and the CCI. All models
were then again fitted on this feature set. The results on the 5-minute time frame can be
found in Table 8. When comparing these results to the results for the regular feature set as
presented in Tables 4 and 6, it is observed that all values are now lower, except for the accuracy
(54.24%) and F1-score (53.56%) of the RF that slightly increased. Since no highly correlated
features are in the feature set anymore, feature importances can now be better interpreted and
are presented in Figure 5 and 6. It is observed that the feature importances of some features
are consistent across both models, such as the RSI with window 7, which achieves the highest
feature importance in both models. Also, both ROC’s and the first lagged return obtain high
feature importances in both models. For other features, the difference in ranking across the
models is quite large, such as for the ATR with window 14 that obtains the second highest
feature importance in the LR, but ranks 18^{th} in the RF.

Table 8: Performance of the models and the buy-and-hold strategy on the 5-minute time frame for the smaller feature set.

Logistic regression Random forest Support vector machine

Voting ensemble

Buy-and-hold

5 min

Accuracy 53.74% 54.24% 53.42% 53.77% -

F1-score 54.65% 53.56% 55.59% 55.24% -

ROI -94.77% -92.47% -94.76% -94.63% 48.02%

Annualized Sharpe ratio -24.101 -21.462 -24.024 -24.168 2.587

Figure 5: Feature importances of the LR on the 5-minute time frame for the smaller feature set.

Figure 6: Feature importances of the RF on the 5-minute time frame for the smaller feature set.

### 5 Discussion and conclusion

In this study, the predictability of the Bitcoin price direction was investigated using ensemble learning. An LR, RF, SVM and a voting ensemble of all models were fitted on a 5-, 15- and 30-minute time frame using time-related features, technical indicators and past returns of other prominent cryptocurrencies. The three most outstanding results were as follows. First of all, the voting classifier showed no superiority in terms of either model or trading performance, which was against expectations. There was also no clear winner among the individual ML models.

Secondly, all accuracies were above 50% with the highest accuracy of 54.21% obtained for the RF on the 5-minute time frame. This showed that the Bitcoin market is still partly predictable and thus not fully efficient. Finally, it was found that the trading strategies built upon these models were highly unprofitable and performed worse than a buy-and-hold strategy. The main cause for this result were found to be the trading fees, since all strategies were profitable when the fees were set equal to zero and nearly all beat the buy-and-hold strategy in terms of the Sharpe ratio.

One of the most striking results found in this study is that all strategies using ML models

were not profitable, even though their accuracies were well above 50%. After additional analysis of the profitability of the models, it was found that the fees were the main factor causing these losses. The effect of fees should thus not be underestimated, especially on short time frames, such as the 5-minute time frame in this study. The strategies on this time frame particularly suffered from the fees, since significantly more trades per day were executed here compared to the 15- and 30-minute time frames. Also, a model that obtains the highest accuracy does not necessarily have the best trading performance, as is, for instance, observed from the RF on the 15-minute time frame. Many studies on similar topics only reported accuracies of the models they built (such as Madan et al. (2015) and Akyildirim et al. (2021)), so it is good to keep in mind that a good accuracy does not always go hand in hand with a profitable trading strategy.

As regards the performance of the voting ensemble, the results were not as expected. The voting ensemble only significantly outperformed the LR on the 5-minute time frame in terms of predictive accuracy, but did not achieve either the highest accuracy or the highest F1-score on any time frame. With respect to the trading behaviour, it did outperform the individual models on the 15-minute time frame in terms of the Sharpe ratio and final ROI, but only slightly. On both the 5- and 30-minute time frame, the voting ensemble did not outperform all individual ML models in terms of either Sharpe ratio or final ROI. These findings are in contrast to the findings of Borges and Neves (2020) and Sebasti˜ao and Godinho (2021), who found that the ensembles achieved the highest accuracies and the highest Sharpe ratios. This might have to do with the fact that these studies used more models for their predictions (5 and 6 respectively), so that the voting ensemble was even less affected by individual models and had a better generalization performance. The finding is, however, similar to the one by Akyildirim et al. (2021), who also found that the voting ensemble did not outperform all individual models.

They found that the SVM performed best in terms of accuracy across all time frames, but this study has no clear winner. The RF achieved the highest accuracy on the 5- and 15-minute time frame, but it had bad trading performance on the 15- and 30-minute time frame. Opposed to this, the LR showed no clear superiority in terms of accuracy and F1-score, but did perform well compared to other ML models in terms of the Sharpe ratio. The study thus indicates no superiority of the voting ensemble, but also does not suggest another winner.

Since all accuracies were above 50% and were higher than the percentage of upward trends as shown in Table 2, it can be concluded that there is some predictive value in the features for the prediction of the Bitcoin price direction. This supports the possibility of a market that is not fully efficient, which could back either the EMH by Akyildirim et al. (2021) or the AMH by Chu et al. (2019), as discussed in Section 2.1. When investigating the predictability over the past three years, a varying degree of predictability was found, which would mostly support the AMH. The lowest accuracy was even found for the earliest time period (2018-2019) of the three, which would mean that the Bitcoin market was less predictable and thus less efficient in that time than it was in the two years thereafter. The accuracies obtained are, however, lower than

in most previous studies. For instance, Akyildirim et al. (2021) found accuracies ranging from 55% to 65% and Borges and Neves (2020) found an accuracy of 59.26% for the voting ensemble.

Both studies used data until the year 2018. This might be an indication of a market that has become less predictable and thus more efficient, which would mostly support the hypothesis by Akyildirim et al. (2021) of a market that is slowly converging to an efficient state, in accordance with the EMH. Thus, when taking into account prior research, there is no clear answer to the question which of the hypotheses is most likely.

Furthermore, when investigating the feature importances, it was found that the time- related features had only low feature importances in both the LR and the RF. Also, when investigating the correlation between features, it was found that some of the used features had a correlation of over 0.9. When both time-related and highly correlated features were excluded, however, the model and trading performances decreased for nearly all models. So, for predictive purposes, even features that are highly correlated or have low significance might contribute to better predictions. The features could, however, be better optimized. For instance, there could be varied in which technical indicators to include and how many lags to include for different time frames.

Additionally, in this study, a voting ensemble was used for predictions, but there are also other ensemble methods that could be used. Zhou (2012) provides many options for this, including weighted voting, where more weight is given to stronger classifiers, and soft voting, which uses class probabilities of classifiers to make predictions. Further research could investigate the suitability of these ensemble methods for the prediction of Bitcoin prices.

Another suggestion for further research is to make different labels for the classifiers and only indicate an upward movement if the price increase is larger than the transaction costs of the exchange traded on. Since this would create an imbalanced data set, it might be necessary to implement re-sampling methods in order to maximize the performance of the classifiers. Next to this, more types of ML models could be included in the ensemble in order to obtain a better generalization performance.

### References

Achelis, S. B. (2001). Technical analysis from a to z. McGraw Hill Education, New York.

Akyildirim, E., Goncu, A., and Sensoy, A. (2021). Prediction of cryptocurrency returns using machine learning. Annals of Operations Research, 297(1):3–36.

Benediktsson, J. (2017). Ta-lib indicators. https://mrjbq7.github.io/ta-lib/doc_index.

html.

Binance (2021). Binance Exchange: Cryptocurrency Exchange. Retrieved May 6, 2021, from https://www.binance.com/en.

BinanceAPI (2021). Binance API documentation. Retrieved May 6, 2021, from https://

binance-docs.github.io/apidocs/spot/en/#change-log.

Borges, T. A. and Neves, R. F. (2020). Ensemble of machine learning algorithms for cryp- tocurrency investment with different data resampling methods. Applied Soft Computing, 90:106187.

Chu, J., Zhang, Y., and Chan, S. (2019). The adaptive market hypothesis in the high frequency cryptocurrency market. International Review of Financial Analysis, 64:221–231.

CoinMarketCap (2021). Retrieved May 6, 2021, from https://coinmarketcap.com/all/

views/all/.

Fang, F., Ventre, C., Basios, M., Kong, H., Kanthan, L., Li, L., Martinez-Regoband, D., and Wu, F. (2020). Cryptocurrency trading: a comprehensive survey. arXiv preprint arXiv:2003.11352.

Freqtrade (2021). Freqtrade. Retrieved May 6, 2021, from https://www.freqtrade.io/en/

stable/.

Huang, J.-Z., Huang, W., and Ni, J. (2019). Predicting bitcoin returns using high-dimensional technical indicators. The Journal of Finance and Data Science, 5(3):140–155.

Keller, A. and Scholz, M. (2019). Trading on cryptocurrency markets: Analyzing the behavior of bitcoin investors.

K¨ochling, G., M¨uller, J., and Posch, P. N. (2019). Does the introduction of futures improve the efficiency of bitcoin? Finance Research Letters, 30:367–370.

Madan, I., Saluja, S., and Zhao, A. (2015). Automated bitcoin trading via machine learning algorithms. URL: http://cs229.stanford.edu/proj2014/Isaac%20Madan, 20.

Nakamoto, S. (2008). A peer-to-peer electronic cash system. Bitcoin.–URL:

https://bitcoin.org/bitcoin.pdf, 4.

Neves, F. (2021). Use Python to calculate the Sharpe ratio for a port- folio. Retrieved June 1, 2021, from https://towardsdatascience.com/

calculating-sharpe-ratio-with-python-755dcb346805.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.

Pintelas, E., Livieris, I. E., Stavroyiannis, S., Kotsilieris, T., and Pintelas, P. (2020). Investi- gating the problem of cryptocurrency price prediction: a deep learning approach. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 99–

110. Springer.

Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808.

Sebasti˜ao, H. and Godinho, P. (2021). Forecasting and trading cryptocurrencies with machine learning under changing market conditions. Financial Innovation, 7(1):1–30.

Siami-Namini, S., Tavakoli, N., and Namin, A. S. (2018). A comparison of arima and lstm in forecasting time series. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1394–1401. IEEE.

Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., and Feuston, B. P. (2003).

Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of chemical information and computer sciences, 43(6):1947–1958.

Tether (2021). Digital money for a digital age. Retrieved May 16, 2021, from https://tether.

to/.

Tuwiner, J. (2021). Who accepts bitcoin? 11 major companies. Retrieved May 6, 2021, from https://www.buybitcoinworldwide.com/who-accepts-bitcoin/.

Van Rossum, G. and Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace, Scotts Valley, CA.

Vo, A. and Yost-Bremm, C. (2020). A high-frequency algorithmic trading strategy for cryp- tocurrency. Journal of Computer Information Systems, 60(6):555–568.

Wang, L. (2005). Support vector machines: theory and applications, volume 177. Springer Science & Business Media.

Wo lk, K. (2020). Advanced social media sentiment analysis for short-term cryptocurrency price prediction. Expert Systems, 37(2):e12493.

Wright, R. E. (1995). Logistic regression.

Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. CRC press.

### Appendices

A Calculation and interpretation of the returns and technical indicators 1. Features used regarding other cryptocurrencies were past returns of Bitcoin, Ethereum,

Binance Coin and Litecoin. The returns were calculated as follows:

Returnp = ^{Close}_{Open}^{p}

p − 1,

where p stands for the time period, which equals 5, 15 or 30 minutes, since these are the time frames used for modeling. Returnp is the return at period p, Closep is the close price at period p and Openp is the open price at period p. The last three lags of the returns are included as features in the models.

2. The relative strength index (RSI), a momentum indicator, measures the magnitude of recent price changes and thereby indicates whether a stock is oversold or overbought. It is calculated as follows:

RSI = 100 −_{1+RS}^{100}

n,

where RS_{n}is the ratio between the average gain and the average loss over n periods. The
values 7 and 14 are taken for n. For a value of 7 and time frame of 5, this means that the
RSI is taken over the past 7 periods of 5 minutes, so the past 35 minutes are included for
its calculation.

3. A volatility indicator used is the average true range (ATR). It is calculated as follows:

T Rp= M ax[(High − Low), |High − Closep|, |Low − Close_{p}|],
AT R_{n}= _{n}^{1}Pn

i=1T R_{i},

where p refers to the time period, T Rp is the true range of that time period and n is the total number of time periods over which the average is taken. In this study, the values 7 and 14 are taken for n.

4. Another volatility feature used was the HighLow, which is calculated as follows:

HighLow = ^{High}_{Low}^{p}

p,

where Highp is the high price of period p and Lowp is the low price of period p. The HighLow of the last 3 periods is included in the models.

5. A volume indicator used is the on-balance volume (OBV). It is calculated as follows:

OBV = OBV_{prev}+

volume, if Close > Closeprev

0, if Close = Close_{prev}

−volume, if Close < Closeprev

,

where OBV is the current on-balance volume, OBV_{prev}is the previous on-balance volume
and volume is the latest trading volume amount.

6. An overlap studies indicator used is the Exponential Moving Average (EMA), which is a smooth line representing the successive average of the price. The EMA is a type of Weighted Moving Average (WMA) that gives more value to recent price data. It is calculated as follows:

EM A_{p}(n) = EM A_{p−1}(n) + (_{n+1}^{2} )[close_{p}− EM A_{p−1}(n)],

where p is a specific time period and n is the number of time periods over which the EMA is calculated. In this study, the EMA signals over 7 and 14 time periods are included.

7. The Moving Average Convergence/Divergence (MACD) is a momentum indicator that captures the relationship of two EMA’s. It consists of three components:

M ACD = EM A_{a}− EM A_{b},
Signal line = EM A_{c}(M ACD),
M ACDhistogram = M ACD − Signal line,

of which the MACD and the signal line are included as features in the models with a=12, b=26 and c=9.

8. The Aroon oscillator is a trend following indicator that measures the strength of a trend and the likelihood that the trend will continue. It is calculated by subtracting two different components:

Aroon up = 100 ∗n−P eriods since n9period High

n ,

Aroon down = 100 ∗n−P eriods since n9period Low

n ,

Aroon oscillator = Aroon up − Aroon down,

where P eriods since n9period High is the number of periods since the highest high price of the last n periods and P eriods since n9period Low is the number of periods since the lowest low price of the last n periods. In this study, the value 14 is taken for n.

9. The Rate of Change (ROC) is a momentum indicator that measures the percentage amount that a price has changed over a given number of time periods:

ROC =^{Close}_{Close}^{p}^{−Close}^{p−n}

p−n ∗ 100%,

where Closep is the close price at the current time period and Closep−n the close price n time periods ago. The ROC is included for n = 7 and n = 14.

10. The Commodity Channel Index (CCI) is a momentum indicator that quantifies the differ- ence between the current price and the average past price and thereby indicates whether a stock is oversold or overbold. It is calculated as follows:

T Pp = ^{High}^{p}^{+Low}_{3}^{p}^{+Close}^{p},
CCI(n) = ^{T P}^{p}^{−SM A}_{σ} ^{n}^{(T P}^{n}^{)}

n(T Pp) ,

where T P_{p} is the typical price, SM A_{n} is the simple moving average over n time periods
and σn is the standard deviation over the last n periods. In this study, the value 14 is
taken for n.

11. A linear regression is a statistical tool that uses the least squares method to plot a straight line through prices while minimizing the distance of the trend line to the prices. The Linear Regression Slope (LRS) is the slope of this trend line. If the linear regression is given by y = a + bx, then the slope is given by:

b = ^{P(x−¯}_{P(x−¯}^{x)(y−¯}_{x)}2^{y)},

where the sum is taken over the past n periods. The LRS is included for n = 7 and n = 14.

B Hyperparameters utilized for each algorithm

Table 9: Hyperparameters utilized for each algorithm and its value (or set of values if grid searched) for the 5-minute time frame. Bold values indicate optimal parameters.

Algorithm Hyperparameter Grid of values

Logistic regression C [1, 0.1, 0.01, 0.001, 0.0001]

Penalty L2

Random Forest N estimators [100, 1000]

Max features [number of features,√

number of features]

Max samples [number of observations, round(0.75 * number of observations)]

Max depth [None, 5, 10, 30]

Splitting criteria Gini

Support Vector Machine C [1, 10]

Gamma number of features^{1}

Kernel [radial basis function, linear]

Table 10: Hyperparameters utilized for each algorithm and its value (or set of values if grid searched) for the 15-minute time frame. Bold values indicate optimal parameters.

Algorithm Hyperparameter Grid of values

Logistic regression C [1, 0.1, 0.01, 0.001, 0.0001]

Penalty L2

Random Forest N estimators [100, 1000]

Max features [number of features,√

number of features]

Max samples [number of observations, round(0.75 * number of observations)]

Max depth [None, 5, 10, 30]

Splitting criteria Gini

Support Vector Machine C [1, 10]

Gamma number of features^{1}

Kernel [radial basis function, linear]

Table 11: Hyperparameters utilized for each algorithm and its value (or set of values if grid searched) for the 30-minute time frame. Bold values indicate optimal parameters.

Algorithm Hyperparameter Grid of values

Logistic regression C [1, 0.1, 0.01, 0.001, 0.0001]

Penalty L2

Random Forest N estimators [100, 1000]

Max features [number of features,√

number of features]

Max samples [number of observations, round(0.75 * number of observations)]

Max depth [None, 5, 10, 30]

Splitting criteria Gini

Support Vector Machine C [1, 10]

Gamma number of features^{1}

Kernel [radial basis function, linear]

C Trading performance of strategies without fees taken into account

Figure 7: Return on investment on the 5-minute time frame without fees taken into account.