Predicting Profitability of Trades in Noisy and Obfuscated Grouped Time Series

(1)

Predicting Profitability of Trades in Noisy and Obfuscated Grouped Time Series

submitted in partial fulfilment for the degree of master of science Jorijn Jacob Hubert Smit

10321977

master information studies data science

faculty of science university of amsterdam

2021-08-31

Supervisor Examiner

prof. dr. E. Kanoulas dr. S.A. Borovkova Informatics Institute, UvA School of Business and Economics, VU

(2)

(3)

Predicting Profitability of Trades in Noisy and Obfuscated

Grouped Time Series

J.J.H. Smit

Faculty of Science University of Amsterdam

ABSTRACT

Financial trading datasets are notorious for their difficulty to ac-quire and their low signal-to-noise ratio. On top of that, their data structures can be less conventional; features which are obfuscated and dates which are reduced to unevenly spaced indices and thus deviating from the standard time series. These datasets form a chal-lenge to train and validate models on. This research attempts to tackle that challenge.

First, noise that is present in the dataset is quantified through the means of a principal component analysis and attempted to be reduced by training a neural network to operate as an autoencoder. Then, by combining elements from multiple existing and proven cross-validation methods, a novel cross-validation technique is pre-sented which is able to address all said problems of unconventional data structures. Finally all these elements are put in a pipeline to verify their performance and be tested against alternatives.

Even though the PCA makes it obvious that indeed there is a lot of noise present, the autoencoder fails to transform the data and reduce its dimensions. The experiments with the cross-validation methods are more fruitful; it can be concluded that the novel cross-validation method produces better results than any of the existing ones. Optuna proves to be a great tool to verify and optimise per-formance of the various pipeline components.

KEYWORDS

grouped netime series, cross-validation, autoencoder, binary classi-fication, XGBoost, neural networks, Optuna, algorithmic trading, quantitative finance

1 INTRODUCTION

In November 2020 Jane Street Capital, a quantitative trading firm specialised in exchange traded funds (ETFs) [26], announced a Kaggle competition. It presents the challenge of developing a model which is able to distinguish profitable from non-profitable trades— given limited information. The competition itself runs for three months after which submitted models are tested on live market data for a period of six months.

Its dataset and carefully protected test set provides a great oppor-tunity to test various theories and strategies that attempt to solve the difficulties that come with such collections of financial trades.

The rest of this introduction takes a closer look at the general prob-lems it poses, breaks it down into specific research questions and concludes with a layout of the rest of the thesis.

Market makers such as Jane Street are trading desks that profit from providing liquidity for financial instruments such as stocks or bonds. They are not necessarily in the business of picking ’win-ners’, but play both sides of the market by both buying and selling thousands of times per day. With yearly volumes in the trillions of US dollars, even tiny profits matter. The same however goes for losing trades. So which trades should be executed and which not? Once a position has been opened, should it be held for an hour, a day or a week? These are the questions Jane Street tries to answer and are at the basis of a modelling financial trades in general.

In an attempt to get answers from data scientists outside their company, Jane Street released a large dataset of ∼6GB consisting of more than 2 million historical trades. Of course not wanting to reveal too much of their internal data (and possibly being restricted by data brokers’ licenses), all the features of these trades are ob-fuscated. Any training of prediction models will thus have to be done ’blind’; feature_13 might reveal to be of high importance to a model, but what property it represents in reality is never revealed. The same goes for the trades themselves, we don’t know what is being traded or even when: trades only carry a unique ID and dates are reduced to a day number.

This is both a curse and a blessing. It is the cause of a lot of uncertainties about the nature of the features (e.g.; do they have a time element in them? Are they fundamental or technical?), and using domain knowledge to add more specific features is made practically impossible. At the same time this blindness forces to be more objective about the results they produce, and only allows them to be interpreted statistically.

The ultimate goal of the competition is clear: develop a model which is able to (binarily) classify a potential trade as being prof-itable or non-profprof-itable. The model’s ability to do so is measured through a utility score, which evaluates the model’s choices on mul-tiple levels: weight of the trade, amount of profit generated and the risk needed to do so. A mathematical analysis of this utility score is provided in subsection 3.1.

This custom metric is also responsible for the first subproblem: the test set does not contain all information needed to calculate the utility scores. And in this case, a trained model can only be evaluated on the test dataset through an online time series API. This

(4)

guarantees that the model does not peek forward in time during the prediction process. The downside of this API is that it does not provide all variables needed for a manual utility score calculation, and that it runs quite slowly; one evaluation can take anywhere between 2 to 8 hours and is limited to 5 calls per day. A local offline method of validating trained models, which does not depend on this API’s test set, is thus unmissable.

However, due to the dataset not being a traditional time series, there are various requirements for such a validation method in order to be reliable. We are not dealing with a traditional time series: trades are not labelled to their underlying asset so we can only generalise across all assets. And although the trades are ordered, no timestamps are given and thus the relative time in between trades is unknown. Instead of timestamps, trades are only grouped per date. This gives somewhat a resemblance of a time series but not completely; a grouped time series. Lastly it should be assumed that some of the features are ’rolling’; meaning they contain information from the preceding trades (e.g. a moving average).

All this make it that the dataset deviates from a traditional time series, making it impossible to analyse it as such. The commonly used k-fold or time series cross-validation methods will therefore not suffice. Subsection 3.2 describes the necessary design features and implementation of a cross-validation method which addresses all of the above mentioned problems and thus answers the question: what is a reliable and efficient way to validate trained mod-els on grouped time series data containing no evenly spaced timestamps and possible rolling features?

Besides the unconventional row-wise representation of the data, the column-wise representation (the dataset’s features) also poses some challenges. None of the 130 features are labelled or explained in any way, except for being arbitrarily tagged with 9 categories— which are, again, unlabelled. And although not an uncommon prac-tice, this anonymity makes it particularly difficult to apply domain knowledge. Whereas normally one would make a selection of which features to use and engineer a new set of features, this step is now practically impossible.

However, especially given that the dataset is described as having "a very low signal-to-noise ratio, potential redundancy [and] strong feature correlation", it remains important to find a way to analyse and make a selection of the available features. Subsection 3.3 ex-amines how to make apparent, in a strictly quantitative way, how much noise is present in the dataset and how it could be removed.

These are the two main subproblems needed to tackle. In order to do so, a pipeline consisting of elements that solve these prob-lems is presented which tests and optimises various preprocessing techniques, cross-validation methods and model configurations.

2 RELATED WORK

Financial time series forecasting can be particularly difficult because it is not always guaranteed that the past is indicative of the future (compared to for example natural phenomena such as the ocean’s tide, which lend themselves perfectly for time series forecasting). The risk of overfitting is particularly strong, increasing the need for a model which is as robust as possible. Of particular value in that regards was prof. M.L. de Prado’s work, namely Advances

in Financial Machine Learning [8]. His chapters on backtesting and cross-validation provide good approaches to these problems. Another good overview of the possibilities on the intersection of machine learning and finance is Dixon [10].

For the noise reduction and the use of an autoencoder, much basic principals were taken from Parviainen’s work on supervised dimension reduction [19], but also from Keras’ lead developer Chol-let [5] and Hinton’s Reducing the Dimensionality of Data with Neural Networks [15]. A good introduction to gradient boosted trees can be found in Chen’s seminal work on XGBoost [4]. Finally various (specific) ingredients for the neural network were researched; a very complete overview of gradient descent algorithms by Ruder [22], activation functions by Ramachandran [21] and the workings of batch normalisation by Ioffe and Szegedy [16]. Donate et al.’s paper [11] provides a novel way of looking at cross-validation score aggregation.

3 METHODOLOGY

Before diving into various methods used to tackle the research questions, it helps to start with getting a deeper understanding of the metric used to measure the models’ performance. In this particular case, the submission is scored on the customly defined utility score. Custom metrics make for a more realistic evaluation of the model—opposed to a general metric such as accuracy—but also increase the difficulty of training and evaluating a model before submitting it.

3.1 Utility Score

The utility score is what determines the final score of the model. It consists of three separate formulae, and since no further explanation is given for them, appears to be completely novel. However, this section will break these formulae down into components in order to make apparent that it consists of metrics which are widely used in finance today.

First let’s take a look at the definitions. Given: • a trade 𝑗

• occurring on date 𝑖, with weight 𝑤𝑖 𝑗 ≥ 0, response 𝑟𝑖 𝑗 and

action 𝑎𝑖 𝑗 ∈ {0, 1}

• and 𝑛 as the unique number of dates 𝑖, we define: 𝑝𝑖= Õ 𝑗 (𝑤𝑖 𝑗· 𝑟𝑖 𝑗· 𝑎𝑖 𝑗) (1) 𝑡= Í 𝑝𝑖 pÍ 𝑝𝑖2 · r 250 𝑛 (2) 𝑢= 𝑚𝑖𝑛(𝑚𝑎𝑥 (𝑡,0), 6) · Õ 𝑝𝑖 (3)

Where 𝑢 is the final measurement of utility. It can be broken into three components:

3.1.1 Weighted Return. The simplest way to evaluate a trading strategy is by looking at its direct returns. This is what happens in formula (1). It is quite straight-forward in that it calculates the weighted return (response) of all executed trades on a particular day 𝑖; their volume multiplied by their return. For trades not executed 𝑎𝑖 𝑗 =0 and are thus ignored.

(5)

Predicting Profitability of Trades in Noisy and Obfuscated Grouped Time Series

3.1.2 Annualised Sharpe Ratio. To be able to measure the perfor-mance of a portfolio on more than just its return alone, a second dimension often looked at in finance is the risk taken needed to generate said return. A common way to quantify such risk is by looking at the volatility (standard deviation) of the portfolio:

𝜎= r

Í (𝑟 − 𝜇)2

𝑛 The higher the volatility, the higher the risk.

Understanding that both return and volatility are important mea-sures, in 1966 the economist William Sharpe proposed a reward-to-variability ratio for the purpose of measuring the performance of mutual funds [23]. Currently better known as the Sharpe ratio, it serves as one of the most fundamental measurements of perfor-mance for financial instruments. It is defined by the return of the instrument divided by its volatility: 𝑆 = 𝑟

𝜎. Common values lay

between 0 and 1, everything > 1 is considered good and > 2 or 3 is excellent or with very little risk.

Taking the weighted return 𝑝𝑖from (1) and assuming its mean

to be 0, its Sharpe ratio is then:

𝑆= Í 𝑝𝑖 q Í (𝑝𝑖−𝜇𝑖)2 𝑛 (4)

For the next step, it appears there has been made use of what in quantitative finance is known as the square-root-of-time rule [7]: a rule that states that any volatility can be scaled by the square root of time ratio. It assumes that price movements are nothing more than a random process.1In such a process the variance is proportional to its time horizon [24, p. 109]; i.e. if at every time step 𝑡𝑖 the variance is 𝜎_𝑖2, the total variance is 𝜎2· 𝑡 or 𝜎 ·

√ 𝑡.

This square-root-of-time rule is widely used in finance to scale metrics across different time spans. In order to convert a daily volatility into an annual volatility, the daily volatility can simply be multiplied by the square root of the amount of trading days in a year: 𝜎𝐴= 𝜎𝐷·

√ 250.

The same can be done in order to annualise the Sharpe ratio of daily returns. First the Sharpe ratio formula (4) is rewritten to formula (5) for readability and then multiply by the square root of time ratio: 𝑆_𝐴= Í 𝑝𝑖 pÍ 𝑝𝑖2 · r 1 𝑛 (5) = Í 𝑝𝑖 pÍ_𝑝 𝑖2 · r 1 𝑛 ·√250 (6) = Í 𝑝𝑖 pÍ 𝑝𝑖2 · r 250 𝑛 (7)

Proving 𝑆𝐴= 𝑡; formula (7) = formula (2) I.e. the component 𝑡 in the

utility score is the annualised Sharpe ratio of all weighted returns.

1_{This is actually quite contradictory, because in assuming that a financial time series is}

nothing more than a random walk, we are directly opposing the goal of this competition: finding consistent patterns in such financial time series.

3.1.3 Weighted Returns Scaled by Annualised Sharpe Ratio. Bring-ing this all together in formula (3), the utility score 𝑢 thus consists of the annualised Sharpe ratio 𝑡 , clipped between 0 and 6 and mul-tiplied by the sum of all weighted returns (Í

𝑝𝑖).

The goal is to maximise the utility score 𝑢. In order to do so both the weighted returns (𝑝𝑖) and the annualised Sharpe ratio (𝑡 ) have

to be maximised. Bear in mind that the higher the Sharpe ratio is, the lower the volatility. The utility score thus makes perfect sense from a financial risk management perspective: generate returns as high as possible while keeping the volatility of these returns as low as possible.

This analysis helps, because it shows that there is no way around the utility score. Training and validating for a standard metric such as accuracy would make things a lot easier, and would yield high returns—but ignore its volatility completely.

3.2 Cross-validation

With the underlying idea for the utility score clear, it is now possible to write a function that evaluates the model’s prediction based on this score.

However, this is only applicable to the training data. Since the test set does not provide the necessary weights (𝑤𝑖) needed to

cal-culate the utility score, validation of trained models is not possible manually on the test set. Cross-validation (CV) is then a very obvi-ous choice, since it enables both training and testing on the same dataset. And since it does this multiple times on different parts of the dataset, it also increases the robustness of the model. This is in contrast to a classic train-validation split, which would always validate on the same last 20% or so of the given time series.

But, as stated in the introduction, the dataset in question has some peculiarities that make it difficult to use conventional valida-tion methods. The rest of this subsecvalida-tion takes a look at the existing methods, their shortcomings and presents a possible solution to all of these shortcomings. Figure 1 depicts these various existing CV methods and possible solution (PurgedTimeSeriesGroupSplit). 3.2.1 K-fold. A k-fold cross-validation is the most basic way to perform CV; the dataset is split into 𝑘 folds: one fold is set aside as validation data while the other folds form the training data. The process of training/validation is ran 𝑘 times; every time picking a different fold for the validation set. Finally the scores are aggregated somehow—usually averaged. Validation is performed across all of the available data, resulting in a more robust model.

K-fold is necessary because it prevents overfitting on a small subset of the dataset. In its simplest form however, it is not usable here. One obvious reason for this is that it does not respect the fact that the dataset is ordered.

3.2.2 Time Series. Since the given data is ordinal, we cannot vali-date a model on data that occurred before its training data. In Figure 1 it is made clear that iteration number 0 of KFold is unusable; our goal is to predict future prices, not historical ones. The first addi-tional step on top of k-fold CV is therefore to respect the order of the folds and only pick a validation fold which occurs after the training folds: TimeSeriesSplit.

3.2.3 Group-awareness. Besides being ordered, all trades also oc-cur in groups (’dates’). Certain patterns might ococ-cur within these

(6)

groups, for example most of the profitable trades might occur at the beginning of a group. In order for the model to be able to generalise upon such phenomena, a single group cannot be distributed across both the training and the validation data. All folds should coincide exactly with the boundaries of the groups. This is illustrated by the difference between KFold en GroupKFold in Figure 1.

3.2.4 Purging. The last step is to prevent leakage. Because some features might still hold information about historical values (e.g. a moving average), this information might ’leak’ from the validation set into the model and influence its predictions.

Say there is a feature which represents a 7 day moving average of the price. Based on this feature alone, a model could have great results predicting the direction of the price. However, if the time offset of the final live data is greater than those 7 days, that feature will be rendered useless.

To prepare the model for this, the idea is to leave a gap be-tween the training and validation data, big enough to make the overlapping observations meaningless. De Prado calls this process "purging" [8, ch. 7.4]. A moving average of 7 days is purged by introducing a gap of 7 days.

Most of these aforementioned elements of cross-validation are supported in some way by the latest version of the Scikit-learn module model_selection, but no single class exists that combines all of them. Various open source code from unreviewed GitHub pull requests2and discussion board posts was combined into the PurgedTimeSeriesGroupSplitclass, it supporting all of the above mentioned features.

3.2.5 Smooth Mean. Finally, when aggregating all the various CV scores attained from the various folds, it is common to take the mean. However, since in this case the dataset is time ordered, an ar-gument could be made for weighing CV scores from folds that occur later heavier. Donate et al. suggest that "in the forecasting domain, recent patterns should have a higher importance when compared with older ones". [11, p. 6] They go on to define a weighted cross-validation function based on 𝑤𝑖 =1/2

𝑛_+1−𝑖

, and are indeed able to conclude that it performs better than both a non-weighted k-fold CV and the baseline; a 0-fold validation [p. 13].

This theory seems plausible and could apply to this particular use case as well. However, instead of using fixed weights, an expo-nentially weighted mean 𝑤𝑖 = (1 − 𝛼)

𝑖

based on a decaying factor 𝛼was opted for. The added advantage of this is that the decaying factor can be adjusted in a granular fashion, making it possible to let the smooth mean increase or decrease the importance of more re-cent events. Also setting 𝛼 to 0 is the same as a non-weighted mean (all weights will equal 1), so keeping the common linear mean open as an option as well. The final cross-validation score is calculated as such: 𝑐 𝑣𝑡 = Í𝑡 𝑖₌₀(1 − 𝛼) 𝑖 𝑥_𝑡−𝑖 Í𝑡 𝑖=0(1 − 𝛼) 𝑖

Notice that the formula above results in a so-called moving window (𝑐𝑣𝑡is still a vector) so only the last value of it is used.

2_{PR https://github.com/scikit-learn/scikit-learn/pull/16236 being the most notable.}

Figure 1: A visual representation of the vari-ous cross-validation methods mentioned [17].

PurgedTimeSeriesGroupSplit solves all problems of the

versions before that; it splits on the exact separation of the groups and purges data between the training and validation set.

(7)

To summarise; before evaluating models on the test set, a reliable validation of the trained model needs to take place. Traditional cross-validation methods all have their shortcomings due to the unconventional data structure of this particular dataset. However, when combining some of these traditional methods into a single method multiple hurdles can be overcome: (1) the order of folds is maintained and never tested backwards, (2) groups are never split across different folds and (3) possible rolling features are purged. On top of that, the individual results of the folds are combined by ways of a smooth mean, which allows for stronger weighing of more recent folds to various degrees.

3.3 PCA and Autoencoder

Besides local validation of the utility score, another subproblem is the low signal-to-noise ratio. What is meant by that is that it is quite difficult to determine what exact part of the data carries predictive power for the model and what part could safely be ignored. A useful way of making this apparent is a principal component analysis (PCA). Its aim is to "reduce the dimensionality of highly correlated data by finding a small number of uncorrelated linear combinations that account for most of the variance of the original data" [18, pp. 214-219].

The first step in this data-rotation technique is to find the lin-ear combination of the dataset, i.e. the vector which minimises the mean squared error (MSE) for all data points to that line. This vector—also known as the eigenvector—forms the first principal component of the dataset. For the second principal component we repeat the process but make sure that it is orthogonal to the first principal component. The following principal component being again orthogonal to that, etc. Each principal component is thus uncorrelated to the previous one. This can be repeated until we run out of dimensions and the resulting vector turns to zero. In a common PCA this process is only repeated until the collection of principal components can account for 95% or 99% of the variance in the original data. In essence the dataset is thus reduced in dimen-sions without sacrificing the truly important information (’signal’) it carries.

Exactly such a PCA was performed on the Jane Street dataset. In order to compare the variance across the 130 features they first had to be standardised (𝑧 = (𝑥𝑖− 𝜇)/𝜎). The exception being feature_0,

it being a binary feature, which had to be scaled by 2𝜎 (see [14, p. 57] for more details on standardising binary features).

Figure 2 shows the results of a PCA. By reducing the 130 original dimensions back to 38, it is still possible to explain 95% of the original variance. In order to reach an explained variance of 99%, only 65 components had to be calculated. These findings confirm that the unnecessary amount of information in the original dataset is indeed quite high.

3.3.1 Autoencoder. The problem with a PCA is that it reduces the original data to eigenvectors only. So although it provides a way to demonstrate the presence of noise, it cannot actually remove it. For this purpose an autoencoder was trained. An autoencoder is created by training a neural network to encode its input (𝑥 in Figure 3) to a layer with smaller dimensions (the so called bottleneck, depicted as 𝑑) before decoding it back into the original input as best as it can (𝑥).

Figure 2: A principal component analysis for which the cu-mulative explained variance ratio is 95%.

Figure 3: General layout of an auto-encoder, here with both the encoding en decoding layers visible. [19]

After such a neural network is successfully trained by minimis-ing the loss between 𝑥 and 𝑥, the decodminimis-ing layers are removed. The last layer of the encoder (layer 𝑑) then holds a more concise representation of the input data, transformed with the least loss of information as possible. The hidden layers basically learn to be an approximation of a PCA [5]. The difference being that it actually transforms the data along the principal components instead of only extracting those principal components. The resulting output of the autoencoder can thus be used as input again for another model.

This subsection tried to solve the problem of making apparent how much noise is present in the dataset and how it could be re-moved. Through the application of a principal component analysis the low signal-to-noise ratio could be confirmed. Even better; it could be quantified in 38 features explaining 95% or 65 features explaining 99% of the variance in the dataset. That is significantly less than the 130 original features. In order to remove this noise an autoencoder is proposed, which will act as an approximation of said PCA and transform the original 130 features into less dimensions.

4 EXPERIMENTAL SETUP

4.1 Data

The data Jane Street provides consists of one single dataset. It is a historical collection of 2,390,491 trades, executed over the course of 500 ’dates’, assumed to be trading days. Each row is a trade, featuring 130 columns of features. With the exception of the first one—it being binary—all the features are floats, most of them being centred around zero with a standard deviation slightly above 2. Every single

(8)

Figure 4: These 100 binned histograms show the distribution of the various types of (transformed) weights. Leftmost shows the original weight, having a very long tail. The weight_scaled is a scaled natural logarithmic representation of the weight, already creating a somewhat fatter tail. The two histograms on the right depict weight_scaled_rank, in which the weights are not only log scaled but also ranked and thus forced into an almost uniform distribution. For both transformations a variant between 0 and 1 and between 1 and 2 was experimented with.

row is a single trade and in this regards the dataset follows the definition of being tidy data; "each variable forms a column [and] each observation forms a row". [25] Thus no transforming, pivoting, melting or reshaping of the original table was needed.

There are 14 features which have a significant amount of miss-ing values; 15% of their trades have no value. Due to the regularity of these missing values however it is assumed that these features provide some windowed feature, e.g. a rolling average of the day or week. Such features are bound to have missing values in the begin-ning. Depending on the model, these missing values are handled differently; see the following subsection (4.2).

Besides these features, the training set also includes some addi-tional columns that are only available for training. First of all the resp(response) of the trade is given, expressed as the relative re-turn of the trade. The binary target for training was simply defined as resp > 0.

On top of the response, there is an additional dimension: the trade’s weight. This variable is not given in the test set, but does play an important role in determining the utility score (see 3.1.1). Again, the next subsection describes how this feature is used by the models.

4.2 Models

As said the final model will need to be a classifier. Since the data is in essence a good example of panel data, a logical first choice is a form of gradient boosted trees [3], of which XGBoost is a very popular and well-performing example. It is not necessarily always the best choice however, so a comparison will be made with various neural networks as well.

Although XGBoost can be adjusted by quite a lot of parameters, only the major ones will be optimised: max_depth, learning_rate, gamma, colsample_bytree and min_child_weight.

For the neural network, a relative simple multi-layer perceptron (MLP) is constructed. The input layer is fed the 130 features, which

passes them to the first of four hidden layers. Each of these lay-ers uses a swish activation function [21], followed by a dropout set to 50%. The final layer consists of one node with a sigmoid activation function which classifies the trade as 0 or 1. Using the training value, loss is then calculated based on binary cross-entropy and the weights are optimised using Adaptive Moment Estimation (Adam) [22]. Parameters which are left to optimise are the amounts of nodes for each layer and whether the layers should use batch normalisation or not.

The missing values had to be handled differently for each model; XGBoost can handle sparse data just fine (meaning no change to the data was necessary), whereas the neural network cannot. Missing values for the neural network were therefore imputed with 0.

For both models their sample weights strategy was used: each sample (row) is also fed a weight. For the XGBoost model this means that on each gradient boosting iteration the residuals are multiplied by those sample weights. In the neural network, the sample weight is used to weigh the outcome of the loss function. In both cases, the result is that samples with a higher weight become more important to the training process, due to influencing the leaf/node weight adjustment to a higher degree.

Since the distribution of the original weights was very skewed, different transformations were created (see Figure 4) to try. The next subsection describes how a choice was made between these weight types and more importantly how the remaining parameters for the models were tuned.

4.3 Pipeline Optimisation

This section lays out the setup of the different experiments used to decide on the final pipeline. All of the results of these experiments can be found in the next section.

It has to be noted that the optimisation in general takes place towards the utility score. However, neither the local nor the test set’s utility score is fully indicative of the final performance. Especially in

(9)

the comparisons between the local and test set’s utility scores, some slight overfitting towards the test set might therefore be unwillingly introduced.

The autoencoder is the first component of the pipeline. To verify whether it is indeed able to get rid of noise and generate a more concise dataset, a comparison is made between performance of a default XGBoost model trained with and without applying the autoencoder to the data. Two autoencoders were tried: a simple autoencoder where the only hidden layer is immediately the centre layer (so with layer sizes 130→97→130) and one with an added hidden layer (130→113→97→113→130). Both ran for a maximum of 50 epochs with an early stopping which monitored the decrease of loss over the last 3 epochs.

When it comes to the cross-validation method, all the theory from subsection 3.2 was validated. In order to do so, multiple ran-dom XGBoost models were trained and submitted to the test set API. Six of these trained models were picked in such a way that they have an evenly and wide ranging utility score: from good (1880.9) to bad (943.8). The utility score of these same models was then calculated locally using four different cross-validation methods:

• PurgedTimeSeriesGroupSplit(group_gap=31) • PurgedTimeSeriesGroupSplit(group_gap=7) • TimeSeriesGroupSplit()

• GroupKFold()

Results can be found in Table 1.

Besides trying out different types of cross-validation methods, once a final method was decided upon it was then further optimised with the use of Optuna [1]. Optuna is a framework which optimises hyper-parameters in a given piece of code, by running the code multiple times with different values passed to the parameters and recording the final outcome. In this way it is able to calculate a correlation between each hyper-parameter and the end result. By letting the user set an objective in advance, such as minimising or maximising the result, Optuna is able to approach the optimal configuration in a much more efficient manner than for example a traditional grid search, which just exhausts every possible combi-nation of arguments.

Optuna was used multiple times in this experiment to optimise both the cross-validation method, the XGBoost model and the deep neural network. In the case of the CV method, the objective for Op-tuna was to reproduce these scores as close as possible locally. The objective was thus set to minimise the difference (RMSE) between the outcome of the local cv method and the recorded test set score. It ran for a total of 40 trials with a search space consisting of:

• group_gap: integer between 0 and 500 • smooth_mean_alpha: float between 0.0 and 1.0 Its results are in Table 2.

For the XGBoost model itself, only five very influential hyper-parameters were tuned in a total of 100 trials:

• max_depth: integer between 1 and 20

• learning_rate: float between 1 × 10−8and 1.0 • gamma: float 1 × 10−8and 1.0

• colsample_bytree: float between 0.1 and 1 • min_child_weight: integer between 1 and 10 All other parameters were kept at their default values.

Results can be found in Table 3.

Lastly the architecture of the deep neural network was deter-mined using Optuna. Besides an input layer and a binary output layer, a total of four hidden layers with an amount of nodes ranging between 65 and 1170 (a factor of 0.5 to 9 times the input layer of 130 features) were trained, combined with a binary switch for the use of batch normalisation:

• batch_norm: binary

• layer{1, 2, 3, 4}: integer between 65 and 1170 On top of that, for both the XGBoost model as the neural network model a meta hyper-parameter weight_type of integer type rang-ing between 0 and 4 was defined; representrang-ing one of the five sample weight types as depicted in Figure 4. This parameter determines which of these vectors is passed to either model for the sample weights.

4.4 Results

The results for both autoencoders are pretty conclusive; for neither configuration was it possible to generate a utility score which even came close to the non autoencoded set. The utility scores for the autoencoded versions differed too much with the baseline and were almost always close to zero—results are therefore not reported in a table.

It’s not clear why these results were so bad, the loss reached (0.0461 after 12 epochs) seemed promising. It might have been naive to assume the autoencoder would automatically approximate a PCA. More likely however, is that the overall problem is so precarious that every little bit of information matters—even that last 5% or 1% of variance lost in the dimension reduction.

Secondly, let’s take a look at how the different types of CV meth-ods scored. Since the length of the splits used by the various CV methods differs, and since the first component of the utility score (𝑝𝑖) directly depends on the amount of trades, we cannot compare

utility scores directly. In order to still make a meaningful compar-ison, the scores were ranked per CV method and the correlation between ranks was compared.

It is obvious from Table 1 that Purged 31 comes closest to the expected ranking, in comparison to the other cross-validation meth-ods that do not perform purging or maintain time series ordering. At the same time the length of the purge gap has a clear influence on the performance of PurgedTimeSeriesGroupSplit. It is for this reason that this parameter was fully optimised with the use of Optuna, of which the results can be found in Table 2.

In this experiment the length of the splits is the same across all trials, and therefore the scale at which the training and the test set’s utility score differ is the same. Therefore, this time there is no problem in comparing these scores directly. An RMSE between the locally validated utility score and the true test utility score was used to measure the performance of each configuration. After a total of 40 trials, a clear preference for a gap of around 130 groups (days) became evident. Even though the possible values ranged from 0 to 500, none of the smaller gaps performed well. A gap of 130 trading days is actually exactly 6 months (130/5/4.33 ≈ 6.0). This could mean that unwillingly this might have exposed the gap that is present between the training and private test set! This is however just a hypothesis which cannot be tested.

(10)

Score Purged 31 Grouped TS Purged 7 Gr. KFold 1880.9 1 3 3 3 1748.0 2 4 5 4 1577.6 4 1 2 5 1492.2 5 2 1 2 1465.4 3 5 4 1 943.8 6 6 6 6 Spearman’s 𝜌 0.829 0.543 0.314 0.0857 Table 1: Rankings of the utility scores from each CV method, correlated with the rank of the test set’s score. Not only does Purged 31 correlate the best to the target scores, it is also the only method that is properly able to identify the best scoring model.

Trial Purge Gap Smooth Mean 𝛼 RMSE 25 126 0.093102 988.3 24 132 0.092434 1041.0 23 171 0.019918 1065.1 21 204 0.016799 1092.6 13 347 0.386409 1117.6

Table 2: The group gap of 126 from trial 25 is optimal, since its divergence in the form of RMSE with the target utility scores is the smallest.

Besides the purging gap, Optuna also optimised the smooth mean’s alpha parameter. The results show a preference for a small coefficient, which at the same time did not reduce all the way to 0. This confirms Donate et al. [11] their findings; that by increasing the weight of more recent splits it is possible to increase the overall performance of cross-validating time series.

Finally the results from tuning the XGBoost and neural network model. The first obvious conclusion is that the XGBoost model strongly outperforms the neural network. It has to be noted that a possible reason for this could be the fact that the search space for the XGBoost was much larger. With all its different arguments, it is much easier to define a search space that tries very different forms of the XGBoost model. For the neural network this was much more limited, much of the architecture (activation functions, amount of layers) had to be hard-coded. On top of that the neural network takes a lot more computational resources to train, and therefore could not be given as much trials as the XGBoost model.

As can be seen in the XGBoost model tuning results in Table 3, a higher (if not maxed out) max_depth correlates with a better performance. All configurations in the top 5 have a depth value which is on the higher end of the spectrum of tried values (between 1 and 20). Another obvious conclusion is that weight_type 3 is the best performing one. This is the version of the training weights which are both scaled and ranked, and normalised between 0 and 1. The XGBoost model thus performs better if the trades with a given training weight 𝑤 = 0 are ignored.

The other arguments tried (colsampled_bytree, learning_rate, min_child_weightand gamma) are much more difficult to interpret and therefore lend themselves perfectly for a hyper-parameter op-timisation framework such as Optuna. Some general observations are that the subsample ratio of columns per level is optimal around 20%, and that the learning rate benefits from being very small; it is actually always very close to the minimum of 1 × 10−8.

Finally, some conclusions that can be drawn from the neural network optimisation results in Table 4. The use of—or lack of— batch normalisation has no effect on performance. The weight type here is also less conclusive, although a slight preference for normalisation between 0 and 1 can be seen. Lastly there is a trend of first increasing and then decreasing node sizes as layers are added.

5 CONCLUSION

When given a dataset of obfuscated grouped time series of financial trades, and the challenge being to distinguish between profitable and non profitable trades, two big questions were asked in order to be able to approach that challenge:

(1) What is a reliable and efficient way to validate trained models on grouped time series data containing no evenly spaced timestamps and possible rolling features?

(2) How can it be made apparent, in a strictly quantitative way, how much noise is present in the dataset and how could it be removed?

A clear and definitive answer can be formulated for the first question. Local cross-validation, with support for groups, purging and ordinality, has shown to be a proven method to approach the true performance of models trained on said dataset. Experimental results clearly show that it is superior to other cross-validation methods without grouping or purging (1. In addition, aggregation of the various scores per fold is better done through some form of exponential or ’smooth mean’ opposed to a linear mean.

In regards to the second question: principal component analy-sis (PCA) is able to give a good insight into the amount of noise that is present in a dataset. However, removing that noise is not as straightforward. Although the literature suggests that an au-toencoder should be able to handle such a task, it did not in this particular experiment. This could be due to multiple reasons; one of which being that the experiment contained too many ’moving parts’, variables that still had to be optimised and for which the per-formance was not known in advance. Another reason could be that distinguishing between noise and signal is just a task too complex for an autoencoder, or plainly not possible due to the stochastic nature of the data.

Optuna has shown itself to be of value in the process of optimis-ing machine learnoptimis-ing pipelines. Runnoptimis-ing 100 trials manually is a time consuming task, and it is impossible to have such a precise intuition for variables such as learning rate or gamma.

Finally, as a binary classifier XGBoost performed better than a neural network.

6 DISCUSSION & FUTURE WORK

The computational resources needed for Optuna formed a strong bottleneck for this research. In an ideal situation, all experiments would have been able to run for much more epochs and trials,

(11)

Trial colsample_bytree gamma learning_rate max_depth min_child_weight Weight Type Utility Score 27 0.221529 5.220882e-03 1.233561e-08 18 5.977980 3 2531.0 86 0.234081 7.156510e-04 1.524946e-08 19 5.894953 3 2347.9 72 0.167802 2.589203e-03 1.053145e-08 19 5.486295 3 2311.0 42 0.220431 5.263412e-04 1.008979e-08 18 4.708205 3 2290.5 81 0.178043 2.861433e-02 1.625215e-08 20 5.750757 3 2286.2 Table 3: Both a very high depth and low learning rate occur across all top performing models. It is also obvious that weight type 3 provides the most optimal performance.

Trial Batch Normalisation Layer 1 Layer 2 Layer 3 Layer 4 Weight Type Utility Score

7 False 618 434 805 343 3 1725.8

4 True 532 429 1152 451 1 1705.5

3 False 403 159 1083 649 1 1657.0

6 True 884 444 397 738 0 1543.4

0 False 422 486 711 1147 4 1407.0

Table 4: An overall trend in layer size observable here is that the third layer is almost always the largest, and that the last layer tends to be the smallest. Also, it is quite evident that the results in regards to batch normalisation are inconclusive.

until an obvious optimal configuration would have been found. Especially the neural network tuning (Table 4) would probably have benefitted from this.

Another added complication to this resource dependency is that only after so many trials does it become apparent if a given range of parameters is correct. During the process of tuning the XG-Boost model, both the learning_rate and the max_depth reached their minimum and maximum respectively. Broadening their ranges could have optimised the results even more.

There is also much work still to be done in regards to dimen-sionality reduction and the removal of noise in financial datasets in general. Even though the principal component analysis showed the obvious presence of noise in the training data, it could not be removed by an autoencoder as easily. A good first step would be to repeat this part of the experiment in a more controlled manner, e.g. with a pipeline for which the model is already proven to be optimal. This would make the autoencoder the only variable and enable it to be optimised or be compared with other noise reduction techniques.

A final improvement could be made on the somewhat limited search space offered to the neural network. A more dynamic defini-tion of its architecture, such as not only the amount of nodes but also the amount of layers and letting it chose from various types of activation functions and layers, could yield more interesting results.

REFERENCES

[1] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623–2631.

[2] Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

[3] Chen, J. M. (2020). An introduction to machine learning for panel data: Decision trees, random forests, and other dendrological methods. Random Forests, and Other Dendrological Methods (October 23, 2020).

[4] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794.

[5] Chollet, F. (2016). Building autoencoders in Keras. https://blog.keras.io/building-autoencoders-in-keras.html. Published 2016-05-14, last accessed 2021-03-30. [6] Craib, R., Bradway, G., Dunn, X., and Krug, J. (2017). Numeraire: A cryptographic

token for coordinating machine intelligence and preventing overfitting. Retrieved, 23:2018.

[7] Danielsson, J. and Zigrand, J.-P. (2006). On time-scaling of risk and the square-root-of-time rule. Journal of Banking & Finance, 30(10):2701–2713.

[8] De Prado, M. L. (2018). Advances in Financial Machine Learning. John Wiley & Sons.

[9] de Prado, M. L. and Fabozzi, F. J. (2020). Crowdsourced investment research through tournaments. The Journal of Financial Data Science, 2(1):86–93.

[10] Dixon, M. F., Halperin, I., and Bilokon, P. (2020). Machine Learning in Finance. Springer.

[11] Donate, J. P., Cortez, P., Sánchez, G. G., and De Miguel, A. S. (2013). Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble. Neurocomputing, 109:27–32.

[12] Fama, E. F. (1995). Random walks in stock market prices. Financial analysts journal, 51(1):75–80.

[13] Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algo-rithm. In icml, volume 96, pages 148–156. Citeseer.

[14] Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multi-level/Hierarchical Models, volume 1. Cambridge University Press New York, NY, USA.

[15] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507.

[16] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456. PMLR.

[17] learn developers, S. (2021). Visualizing cross-validation behavior in Scikit-learn. https://scikit-Scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices. html#visualizing-cross-validation-behavior-in-scikit-learn. Published 2021-03-30, last accessed 2021-04-20.

[18] McNeil, A. J., Frey, R., and Embrechts, P. (2015). Quantitative Risk Management: Concepts, Techniques and Tools—Revised Edition. Princeton University Press. [19] Parviainen, E. (2010). Deep bottleneck classifiers in supervised dimension

reduc-tion. In International Conference on Artificial Neural Networks, pages 1–10. Springer. [20] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. [21] Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation

func-tions.

(12)

[22] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

[23] Sharpe, W. F. (1966). Mutual fund performance. The Journal of Business, 39(1):119– 138.

[24] Shreve, S. E. (2004). Stochastic Calculus for Finance II: Continuous-time Models, volume 11. Springer Science & Business Media.

[25] Wickham, H. et al. (2014). Tidy data. Journal of Statistical Software, 59(10):1–23. [26] Wigglesworth, R. (2021). Jane Street: the top Wall Street firm ’no one’s heard of’. https://www.ft.com/content/81811f27-4a8f-4941-99b3-2762cae76542. Published 2021-01-28, last accessed 2021-03-16.

[27] Zaremba, A., Szyszka, A., Long, H., and Zawadka, D. (2020). Business sentiment and the cross-section of global equity returns. Pacific-Basin Finance Journal, 61:101329.