Learning Representations from Quote and Trade Data on Financial Markets

(1)

MS

C

. A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Learning Representations from Quote and

Trade Data on Financial Markets

by

M

ICHAEL VAN DER

W

ERVE

10565256

August 14, 2020

36 EC January - August 2020

Supervisors:

Sindy L ¨

OWE

Ivo T

OPS

(external)

Assessor:

Dr. Erik B

EKKERS

(2)

Abstract

This thesis explores the application of machine learning models on financial mar-kets and the role of high-level representations to fill the out-of-sample general-ization gap. We seek to define and predict anomalies and significant movement around market shocks, such as when new information is being priced in, exclu-sively using quote and trade data.

Evaluation is done by comparing a self-supervised Contrastive Predictive Coding method and an unsupervised variational autoencoder method against a fully su-pervised Binary Cross-Entropy baseline for learning price series representations on an intraday timeframe. As input, data is uniformly normalized across symbols, theoretically allowing the models to be more clearly exposed to the underlying market structure. Objectives are to predict significant movement, positive move-ment, and anomalies in price series.

From the Toronto Stock Exchange, 38 different symbols are taken for five months (107 trading days) containing small-cap, mid-cap, and large-cap stocks. The up-stream model had access to two hours of history and was tasked to predict twelve minutes into the future.

Contrastive Predictive Coding seems to generalize best across tasks despite only performing best on the anomaly task, with consistently higher precision for the top probability scores of the downstream classifier. Representation loss is an indica-tive feature across all tasks. Anomalies cannot be forecast accurately; however, the significant movement objective does seem to encode useful data. Overall per-formance on all tasks was too low to be exploitable for trading.

(3)

Acknowledgements

I would like to thank my supervisor Sindy L¨owe who has brought a lot of focus to the research, who was always ready to reply to yet another question about anything. Her optimism and continuous high-quality feedback was fundamental for the final thesis. Despite being incredibly busy at the end, she always found a way to make some time when I needed more guidance, for which I am very grateful.

Secondly, I would also like to thank my colleagues at Ruppert Trading BV. Their expertise on ev-erything surrounding financial markets and the numerous discussions we had were paramount to the final shape of the thesis and the generous resources they made available for the research in terms of data and workstation. I also thank them for the countless questions they have posed throughout the process, which significantly deepened my understanding as well. They were always there to support me when I encountered setbacks, bugs, or difficulties.

Lastly, I would like to thank everyone around me who has spent their precious time proofreading my thesis, discussing ideas with me, or giving me support in another way. Your help was much appreciated and helped me improve my thesis in countless ways.

(4)

2.2 Variational Autoencoder . . . 8 3 Method 9 3.1 Timeframe . . . 9 3.2 Preprocessing . . . 9 3.3 Representation Learning . . . 12 3.3.1 Encoder . . . 12 3.3.2 Loss functions . . . 12 3.3.3 Context vector . . . 13 3.4 Evaluation . . . 14 3.4.1 Labels . . . 14 3.4.2 Baseline . . . 16 4 Related Work 18 5 Experiments 19 5.1 Data . . . 19 5.2 Preprocessing . . . 19 5.3 Models . . . 20 5.4 Results . . . 20 6 Discussion 28 6.1 Possible Improvements . . . 28 7 Conclusion 30 7.1 Future work . . . 30 References 31 A Extracted Features 35

B List of symbols from Toronto Stock Exchange 36

C Result tables 37

(5)

1 Introduction

Every day, millions of transactions are executed in the capital markets. Although trading used to happen mostly on the trading floor of the exchange, nowadays, this happens almost exclusively digitally and is mostly fully automated.

Automated trading used to be largely fundamental, e.g., based on company numbers. If a company had a good quarter, the company should be worth more, and if it had a bad quarter, it should be worth less than anticipated. This approach naturally gave way to more quantitative research and formed into what is commonly referred to as the quantamental approach to trading. This approach consists of hand-crafting signals to extract information from the market environment and trading on them. This field has now been around for many years, and consists of many participants, driving down

alpha1for funds exploiting these strategies. Machine learning is a natural extension to this approach

to replace the hand-crafting of the features and instead automatically extract useful signals from the data.

It can not be known what the real value of a company is at any given moment because there are too many variables that are unknown (Koller et al., 2010). This effect ties into the inherent asymmetry of information present on companies since they generally only release numbers every quarter, which is also when they exhibit different stock market behaviour (Chambers and Penman, 1984). This asymmetry makes it so that this information cannot be known, and instead, the company is experi-encing random drifts, also known as the random walk (Fama, 1995). Because of this random walk, the process is often simulated as a Wiener process (Wiener, 1923), which is a stochastic process that assumes the logarithmic returns of the stock are independent and normally distributed, making the stock price lognormally distributed. As in the real world for stocks, negative prices cannot exist under this model, and the price change magnitude depends on the price level. However, this simple model does not consider the shocks that can occur around corporate events, news, and other external influences (Hirsa and Neftci, 2013).

Although a vast majority of market trading is estimated to be done by algorithms (Bigiotti and Navarra, 2019), investor behavior still has a significant impact on the financial markets (Nagy, 2017). Investors have been known to cause sudden stock market shocks on both good and bad news (Frank and Sanati, 2018), which are then absorbed by the market in one way or another. Although the algorithms might not have had anything to do with the initial shock, they can stabilize it, actively trading during the shock.

A long-standing theory has been the efficient market hypothesis (Malkiel, 1989), which means that all available information is fully reflected in the stock market price, and that excess returns can not be realized without assuming increased risk. This hypothesis also means that returns following price shocks would not be predictable. However, recent evidence has been found opposing this hypothesis (Fakhry, 2016), essentially subdividing it into many forms and weakening it to make it applicable to real-world circumstances. Additionally, even the weaker forms of the hypothesis do not seem to hold on some markets (S´anchez-Granero et al., 2020; Hamid et al., 2017).

Most research focuses on predicting price or returns directly, but these approaches have some prob-lems, which are explained in Section 4. Our focus will be on anomalies and significant movement. The detection of anomalies would allow trades to be entered with the knowledge that the price will normalize as this is an anomalous overbought or oversold stock, or whether the trend might continue. This knowledge could supplement momentum trading strategies, for example, since they rely on the continuation of the price in the direction. Similarly, mean reversion strategies might benefit in the same way, where entering trades at anomalous points can be prevented. Significant movement indi-cators will have a more direct mapping to trading strategies but can also be used as a supplemental indicator to show a higher likelihood of a significant price move.

Many of the recent advances in machine learning have been on finding high-level representations for use in downstream classifiers (Bengio et al., 2013; Zhong et al., 2016). These approaches often surpass their predecessors in terms of performance on a wide variety of tasks and can have a superior generalization. It has long been known that the stock market has a low signal-to-noise ration, being very chaotic. This low ratio is one reason the dataset could greatly benefit from learned high-level representations that have significantly denser information, instead of the majority of usual noise.

1

(6)

pricing stream fully supervised training prediction regular learning representations (unsupervised, self-supervised, ...) representation learning fully supervised training prediction

Figure 1: Representation learning versus regular learning on financial data.

Machine learning models are known to fit the noise if the ratio is too low and working better on smoother target functions (Caruana et al., 2001). Therefore, compression of this data should yield better generalization by smoothing out the search space, given enough of the signal can be retained. This smoothing could reduce the discrepancy between in-sample and out-of-sample performance prevalent in most of the literature (Reschenhofer et al., 2020).

This thesis aims to apply the recent advances in representation learning to the financial markets, by constructing high-level representations for pricing series, as displayed in Figure 1. The figure shows that instead of running the fully supervised training directly on a pricing stream, a representation for that pricing stream is learned upon which the fully supervised training and final prediction takes place. If these representations can exhibit a particular structure, the downstream classification tasks might be simpler than learning on the original data. Furthermore, resources could be significantly reduced if most of the information can be captured in a lower-dimensional latent space. Failure to construct a representation indicated by a high loss term might also be a good indication of anomalous data.

There are many ways to construct representations in a latent space. Since market data is a time-series, the latent space should exhibit a temporal locality, meaning that points close in time should also be close in the latent space. Ideally, this will allow a smoother path to be traced through the latent space, with the space itself being continuous. To achieve this, we consider three models, Contrastive Predictive Coding (CPC) (Oord et al., 2018), variational autoencoder (VAE) (Kingma and Welling, 2013), and fully-supervised training. The VAE, in combination with convolutions, should be robust towards temporal shifting of the data. Additionally, it has recently been shown that VAEs and other generative models can learn to represent price and order streams (Li et al., 2020), making them very suitable for creating the representations. Its inherent ability to generate samples could also prove useful to more accurately simulate real financial market circumstances. They have also been applied successfully alongside a news stream (Xu and Cohen, 2018).

For usage of CPC, the theoretical case is even stronger since it encodes information indicative of the future state into the current state. This process could effectively denoise the signal and construct

(7)

features that are geometrically close in the latent space if they are also close in time. These models were chosen since they reflect the direct approach using fully supervised learning, VAE for the approach without time-dependence, and CPC for the approach with explicit time-dependence. The tradeoffs of time dependence between a broad range of models types could be researched this way.

1.1 Research Questions

Several questions are posed to evaluate the application of representation learning in this thesis, mostly regarding the market structure around anomalous events. These events are defined as mo-ments when new pieces of information are still being priced in, causing a momentary fluctuation in price. These points in time can be seen as anomalies since the underlying distribution has a sudden and significant change, or the point is far outside of the estimated distribution. The idea is that this deviation should be detectable without any other external input since the new pieces of information are being priced in from multiple sources. Because these sources are rapidly priced in, this should be visible from the pricing and volume changes of the orders alone.

RQ1. Which method of training yields the best latent encoding in terms of downstream

per-formance and generalization? We consider encodings to be better if they enable downstream

classifiers to attain better performance with less training time and better generalization.

RQ2. How accurately can stock market anomalies be detected using representation

learn-ing? Because of the low occurrence of anomalies, the learned representations should be sensitive

to outliers in that their location in the latent space is significantly different from the temporally neighboring points. This sensitivity should allow the outliers to be detected by failure to construct an accurate representation alone, with a high representation loss.

RQ3. How accurately can significant movements and anomalies be forecast on the stock

mar-ket using the learned representations? This research question is the default target for

profitabil-ity, as this is a direct indicator of whether or not to perform a trade and can thus be used as a trading strategy directly.

(8)

2 Background

For the thesis, an understanding of two fundamental models is needed, Contrastive Predictive Coding (CPC) with its explicit temporal dependence and the variational autoencoder (VAE) with purely probabilistic encodings. This section will expand on the intuition, reasoning, and suitability of the models.

2.1 Contrastive Predictive Coding

Recently, CPC (Oord et al., 2018) has achieved competitive performance with the current state-of-the-art audio sequence models on a speaker classification task learned on the self-supervised repre-sentations, with only supervision on the final task. The InfoNCE loss, based on Noise Contrastive Estimation (NCE) (Gutmann and Hyv¨arinen, 2010), is denoted in the function below, where the

density ratio fk for step k was set to a simple log-bilinear model fk(xt+k, ct) = exp zt+kT Wkct,

where xtis the sample at timestep t and ctis the encoded context at timestep t.

LN = −EX

"

log_Pfk(xt+k, ct)

xj∈Xfk(xj, ct)

#

Figure 2: Overview of the CPC model. Origin Oord et al. (2018), visualized with

audio waves as trained on in the original paper. Input xtis converted into latent

repre-sentation ztusing genc, wihch is temporally aggregated into ctusing gar. Predictions

for zt+iare made using ct.

Intuitively, InfoNCE incentivizes the model to encode information indicative of its future state ztk

into the current context ct, explicitly exploiting temporal properties. Practically, the model achieves

this by minimizing the cross-entropy loss of a classifier that distinguishes a positive pair from

neg-ative ones, where a positive pair (ct, zt+k) is a value paired to a value in its future, and a negative

pair is a value paired with a random value (ct, zr) where r 6= t + k, not in its future. As the model

improves at correctly classifying the positive samples, representations of temporally nearby inputs become more similar. On the other hand, inputs from different timesteps are pushed apart in the latent space. More formally, this process is reflected by the optimization of mutual information be-tween the current timestep’s encoded state, and the future encoded state (Poole et al., 2019). That

is, mutual information I(xt+k, ct) ≥ log N − LN where N is the number of negative samples.

Although this result does not necessarily mean that the representation has any useful structure (Ben-gio et al., 2013; Tschannen et al., 2019), it has been shown to work well on a variety of tasks such as audio, video, and image classification (Oord et al., 2018), with the most recent additions to the model even outperforming its supervised counterparts (H´enaff et al., 2019). Since the mutual infor-mation is related to the number of negative samples as described in the equation above (Oord et al., 2018), the lower bound of the mutual information can increase as the number of negative samples increases. Given that the number of negative samples N can be picked freely, it should intuitively be chosen as high as possible.

(9)

2.2 Variational Autoencoder

The VAE (Kingma and Welling, 2013) has successfully been applied to a wide variety of tasks from anomaly detection (Yao et al., 2019) to image classification (Pu et al., 2016), allowing an unsupervised approach to learn the latent space with sampling. The VAE is very similar to a standard autoencoder (AE) (Ballard, 1987), with regularization to ensure generative properties on the latent space. This is done by encoding the input to a distribution which can be sampled, rather than an explicit hidden state. In contrast to the regular AE, this leads to the latent space representing a distribution, making it continuous with the ability to sample anywhere (Berthelot et al., 2018). Figure 3 shows a high-level example of how the VAE that models a Gaussian distribution works,

where the input x is encoded into two vectors, µ and σ2, which model the parameters for the

Gaus-sian distribution for x. Formally, to take a probabilistic view at the VAE, the decoder is defined by p(x|z) (the probability of x given z), and the encoder is defined by q(z|x) (the encoded variable given the input x). Rather than the deterministic approach a normal AE takes, a VAE models the distribution of the input variables. Now, a point is sampled using Gaussian noise ∼ N (0, 1) under the modeled distribution, so that backpropagation can be done. As the probability distribution is in-tractable, this cannot be achieved without sampling. This sampled point is z = σ + µ which is now

a normal sample around N (µ, σ2_{), after which the decoder is attempting to make a reconstruction ˆ}_x

from z.

The model is evaluated using the Evidence Lower Bound (ELBO) loss, which consists of a recon-struction loss term and a distributional divergence loss term. The reconrecon-struction loss term usu-ally contains either the Mean Squared Error (MSE) or the Cross-Entropy (CE) loss function, being

M SE(x, ˆx) or CE(x, ˆx) depending on whether the input is x ∈ R or x ∈ [0, 1]. The regularisation

term is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), which is the

informa-tion gain obtained from using N (µ, σ2_{) over Gaussian noise. Intuitively, optimizing for the negative}

of this term ensures that the latent distribution q(z|x) is pushed to be as close as possible to N (0, 1).

ELBO = M SE(x, r) − KLN (µ, σ2_{)||N (0, 1)}

The addition of the KL divergence regularization to the loss theoretically enforces the distribution structure, converting the latent space to a distribution. This distribution causes points close in the latent space have the same decoding, and similar values in input x to end up close together in the latent space z after encoding. Although regular AEs have good dimensionality reduction proper-ties (Wang et al., 2016), this structure is not always present since they do not experience the same regularization (Goodfellow et al., 2016). This lack of structure means that points close in the latent space can have vastly different encodings, which causes regular AEs to not be directly usable in a generative context. Even worse, the space itself can contain gaps where the sampled points are not meaningful. x x encoder decoder mu logvar z N(0, 1) ^

Figure 3: Visualization of the VAE model based on a Gaussian distribution. In the

image, x is the original data, z is the sampled latent representation, and ˆx is the

decoded reconstruction of x from z.

Because of the explicit modeling of the distributions, there are some remarkable advantages to this method. Most notably, the model can be used generatively, because the distribution is defined over the latent space, and every point between other points should have a meaningful mapping. Because of this meaningful mapping and structure, points in-between other points should also be meaningful, giving a more continuous space.

(10)

3 Method

This section will first detail the preprocessing steps followed to convert the dataset, after which the representations can be learned. Secondly, the encoder architecture and the corresponding losses will be specified, after which the construction of the context vector is detailed. Lastly, the supervised labels will be defined, and the evaluation methods will be explained.

3.1 Timeframe

For this research, an intraday timeframe was chosen. Intraday specifically means that the horizon that was considered had to fall within the trading day and that no cross-day effects were taken into account. This choice has several key advantages, since generally a lot more data is available, and there are potentially a lot more market microstructural events (Kluger and McBride, 2011), as op-posed to daily trades. This abundance also means that there is potentially a lot more opportunity for temporary mispricings, which should theoretically be exploitable. In general, more trade oppor-tunities mean that the certainty per trade can be lower since, statistically, it will be made up for in volume. Additionally, a lot more information could be extracted from this interval, since usually, only the aggregated pricing information is available. Lastly, there are many known intraday tempo-ral effects discovered through the years. some recent examples can be found in (Chu et al., 2019; Khademalomoom and Narayan, 2020; Renault, 2017), which the model could potentially pick up on.

In contrast, this timeframe’s downside is that it generally contains a lot more noise than the larger time-horizons such as daily trades, since the moves are relatively small and market events tend to cluster. Additionally, dealing with these relatively small intervals also means that the potential gains and losses are severely constrained since the average drift over this small interval is very small. Lastly, market participants that are actively trading on this level are generally much more well-informed, potentially having access to much higher-quality sources and faster infrastructure, allowing them to act on the information faster (Ros¸u, 2019).

3.2 Preprocessing

In this thesis, level 1 quotes and trades were used. This event stream consists of three types of events; a trade event, updated bid price, and updated ask price. Primarily, each event contains a price and a size. There are some other properties present; however, these mostly relate to order book management and administration, so are only relevant to filter out incorrect quotes, incorrect trades, and the specific exchange the quote originated from.

Because the event stream is not well-suited for machine learning, the events are first accumulated into what is commonly referred to as bars. Table 1 shows an example of the trades coming in, and Table 2 shows the state after a subsequent accumulation that is applied. It converts the non-uniform stream of events into a uniformly distributed state over time.

Timestamp Price Volume

.. . ... ... 09:51:00 12.54 100 09:51:27 12.55 400 09:51:54 12.53 200 09:52:03 12.55 300 09:52:05 12.55 400 .. . ... ...

Table 1: Example stream of trades.

Usually, only bars constructed using time as a base are considered, but recently Easley et al. (2016) has shown that other bar types show better statistical properties, which make them better suited for machine learning purposes. In this research, volume and dollar bars are also considered as bar delineations. A volume bar example is displayed in Table 3, and in this case, the dollar bar would have the same delineations because the price does not drift enough.

(11)

Time Open High Low Close Volume

9:51 12.54 12.55 12.53 12.53 700

9:52 12.55 12.55 12.55 12.55 700

Table 2: Time minute bars generated from trades in Table 1. Last bar is still open, and might change with extra trades coming in.

Volume Open High Low Close

500 12.54 12.55 12.54 12.55

500 12.53 12.55 12.53 12.55

400 12.55 12.55 12.55 12.55

Table 3: Volume bars generated from trades in Table 1 for each 500 shares traded. Last bar is still open and will change with extra trades coming in.

Because of how these bars are constructed, extra care has to be taken when predicting these bars, especially in more volatile markets with more market activity. This volatility can happen in many different circumstances but usually happens when a new piece of information is priced into the market. This new information can be a piece of news, earning, rumors, and many other different factors. Regardless, the effect is that there could be significantly more trades, which can be very close temporally.

Therefore, an additional requirement on these bars, as opposed to being a pure bar of its type as defined, is that it has a minimum number of trades or quotes and cannot be broken up within the same timestamp of the event. This rule prevents the model from breaking up sequences that could not always have been captured, e.g., within the same millisecond of the last trade. Otherwise, the model could attain artificially high performance, even though the effect could not have been captured on the live markets. Naturally, the minimum length in terms of this bar’s time depends on the latency of the software’s proprietor, where high-frequency traders could take the most advantage from these properties. These alterations make sure that the bar can be acted upon depending on the available latency.

Usually, time bars can be directly obtained from a market data provider. Although that is possible, the raw events’ preprocessing was done to capture more information from the market. Appendix A shows the features that are extracted from the raw event stream, where normally only open, high, low, and close prices are available, as well as volume sometimes.

Determining whether or not a trade counts as a ’sell’ or a ’buy’ is done using the tick rule from De Prado (2018). Although this is not infallible, this proxy most likely closely resembles the actual trade direction, of which the data is not available. In reality, it is not always clear whether the bid or the asking price has been hit, which determines the trade direction (whether or not it is an actual

sell or buy). The tick rule is displayed in the equation below, where ptdenotes the price at a point

in time t. Intuitively, this formula classifies the trade (and subsequent trades at the same price) as a ’buy’ when the price went up, and as a ’sell’ when the price went down.

bt=    1, ∆pt> 0 − 1, ∆pt< 0 bt−1, ∆pt= 0

After this initial step, the raw event stream of unevenly spaced events has been converted into a dataset containing evenly spaced bars. In the case of time bars, these are cut off at a given interval. In the case of volume, tick, and dollar bars, these are information-driven. In more volatile markets with more trades, they will occur more often and allow for these effects to be detected better (Easley et al., 2016). Because the bars depend on trades, time-driven bars must be filled with ’empty’ bars containing no trades, no volume, and generally devoid of most features during periods of inactivity. These features were set to zero. All other features were set to the last known price, which is the close price. If these gaps were not filled, an implicit look-ahead bias would be introduced because it would not be known when the next bar is generated, since the bars would be unevenly spaced again.

(12)

After the bars have been generated, the dataset was split up into a training, validation, and testing set, where all samples in the training set have to be before the validation set temporally, and the samples in the validation set have to be before the testing set. Although this is usually not required, omitting this rule would introduce another bias that artificially inflates model performance (De Prado, 2018). Because look-ahead bias is one of the most prevalent issues in this type of research (De Prado, 2018), and because many of the features are known to be non-stationary (Horv´ath et al., 2014), the sequence data is normalized relative to the previous day using the trimmed mean and standard deviation. Using the trimmed mean ensures robust estimation of location even in the presence of outliers (Arce, 2005), ensuring proper normalization.

An alternative approach to standardization of the data is taking the rolling trimmed mean and stan-dard deviation within the same trading day and use these to normalize the data. A problem with this approach is that much data would be discarded at the start of the trading day since the window is not complete yet, and calculating the rolling statistics will not be very informative; however, this could be remedied by supplementing with pre-market data. Naturally, this method can be extended to incorporate multiple days from the past to increase the number of samples used to measure. There are three potential issues with this normalization overall. The first issue occurs when shares are issued or repurchased, basically growing or shrinking the pool of outstanding shares. Another

potential problem is that these normalizations do not work correctly when a stock split2happens,

but those will be disregarded since they are becoming much less frequent (Minnick and Raman,

2014). Lastly, although dividend3 _{payments will certainly temporarily lower the price on the next}

day (Barker, 1959) at which further normalization could be applied, the amount is usually relatively small compared to the total stock price and will, therefore, be ignored.

Pricing data is converted into a standard logarithmic returns series rt = log(pt) − log(pt−1) with

ptdenoting the price at time t. In many cases, pricing data cannot be directly used, even when the

data is standardized relative to the previous trading day because of its non-stationarity, even within the same trading day (Horv´ath et al., 2014). Under normal market conditions, the drift should be constant and equal to the risk-free interest rate on average, which normalized to an intraday time interval is approximately zero. The logarithmic returns are calculated for the prices relative to their previous values. This method has two advantages; it makes the data stationary and converts the noise from multiplicative to additive. Machine learning models are mostly based on additive noise, so this should result in better performance.

Lastly, the timestamp was converted to a positional encoding (Vaswani et al., 2017). The timestamp cannot be included directly since this is a monotonically increasing value that is cyclical, which means that the mean from the last day as a normalization value does not indicate the values today.

The timestamp was converted to a positional encoding using the equation below, where the tidenotes

the timestamp at point-in-time i, and t0 and tN are the start-of-day and end-of-day timestamps,

respectively. This process was repeated for the time within the hour and the day of the week, as these extra features are also cyclical. The bar’s time was also added, which will simply be a constant for time-driven bars, but this might be an insightful feature on the information-driven bars.

x = sin 2π ti− t0 tN − t0 , y = cos 2πti− t0 tN− t0

Intuitively, the normalized timestamp (ranging from 0 to 1 within the hour/day/week) is converted to a point on the circle, so the start and the end of the period are closer to each other. This effect is most naturally explained for hours; the first minute of a new hour has more to do with the last minute of the previous hour. This relation is reflected in the positional encoding, but not in the 0 to 1 encoding where they are furthest apart.

There are two main reasons for this diligent preprocessing. Firstly, machine learning models con-verge significantly faster on data that has been properly normalized (Singh and Singh, 2019). Sec-ondly, normalizing every sequence to itself allows for the training to happen across tickers, and potential for significantly better generalization by exposing the model to the entire market.

Nor-2

A stock split happens when a company divides its shares to increase liquidity. The total market capitaliza-tion will remain the same, but the shares outstanding are multiplied.

3

(13)

malizing the data based on a mean and standard deviation from the complete dataset would mean that symbols with significantly lower volumes would be marginalized away, hurting generalization. Lastly, by having more data available, the model will be less likely to fit on the noise, and deep learning can be applied since that usually requires millions of examples.

Technical Analysis (TA) is relatively standard in the financial literature, where bars and their his-tory are converted into pieces of extra information (Edwards et al., 2007). An example of this is calculating the rolling mean of closing prices, and including that as an extra feature. Although it is standard, no TA functions were applied, since they only make calculations on the data already seen by the neural networks. In theory, these networks are universal function approximators (Hornik et al., 1989) and therefore should be able to extract these constructed features by themselves, should they be salient enough.

3.3 Representation Learning

3.3.1 Encoder

The encoder was taken to be a set of stacked convolutional layers (Lecun et al., 1998) with a Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN) (Chung et al., 2014) on top. This archi-tecture is displayed in Figure 4. Relative to the encoder used for the audio in Oord et al. (2018), the hidden state must be significantly smaller since the sequences are significantly shorter. The en-coder’s task is to convert the data into high-level representations in general, but since this specific data source has a very high level of noise, it is more relevant to discern the signal from the noise, and essentially distill the data down into its salient features.

x_{0,0 ...}x_0,T x_{1,0 ...}x_1,T ... x_{K,0 ...}x_K,T autoregressor fully connected 1D-convolutional encoder

Figure 4: Encoder architecture. x0,0upto xK,T denote the K channels with each the

same number of timesteps T .

By making this reduction, downstream estimators should become less prone to overfitting, and the performance and generalizations of downstream estimators should improve. Especially the latter part is often amiss in applying machine learning models to financial data since there usually is a large gap between out-of-sample and in-sample performance (De Prado, 2018). Additionally, because the dimensionality is lower than the original dataset, learning should become a lot faster after the initial transformation, which means more experiments can be run that are mostly using the same data, lowering evaluation time per model as there are significantly fewer dimensions.

3.3.2 Loss functions

InfoNCE Loss The network was trained in a fully self-supervised way by using the InfoNCE loss

from Contrastive Predictive Coding (CPC) as described in Oord et al. (2018). The optimization target is detailed in the equation below. An extra layer was added on top of the encoder, and this layer is fed N samples. The first sample is the positive sample, and the other N − 1 samples are negative samples. The positive sample is within a sampling window of k in the training sample’s future, while the negative samples are not. The loss function is displayed in the equation below,

where Xiis the data for sample i and citis the context for sample i at timestep t.

Li= −EXi " log fk(xit+k, cit) P xj∈Xifk(xj, cit) #

(14)

This loss should work particularly well on a dataset with such a high noise level because the model is actively incentivized to encode predictive information of the future state into the encoded con-text. This incentive should cause points close in time to be close in the latent space as well, ideally canceling out a lot of the noisiness and gaining more stability. Secondly, the representation loss indicates how well the trained model distinguished the future state from an unrelated state.

Addi-tionally, the attribution to the model’s ability to extract the slow features4from the dataset is also a

desired feature of the model for data.

Evidence Lower Bound (ELBO) Loss A single extra layer was added to the encoder, and a

decoder was constructed by inverting the encoder layers. The extra layer contains two separate linear layers that convert the context to the parameters mu and log variance. The layout of the variational autoencoder (VAE) is shown in Figure 3, where the encoder is the encoder detailed in Figure 4.

The ELBO loss consists of two terms, which were added to the final context vector as a single combined loss term. The intuition is that a worse representation loss indicates salient data might have been lost, potentially performing well for anomaly detection where this anomalous data cannot be encoded in the limited latent space (Yao et al., 2019). Although the cross-entropy is often used for the loss term, since the data does not use probability distributions but rather continuous data points, the Mean Squared Error (MSE) is used for the reconstruction loss term instead. The per-sample loss formula is written in the formula below, where x denotes the original data, and r its reconstruction.

Li = M SE(xi, ri) − KLN (µ, σ2)||N (0, 1)

Binary Cross-Entropy (BCE) A single extra layer with two neurons was added with a softmax

activation function, of which the output then models the probability of the positive class, being the label in this case. Because of the large imbalance, the BCE was balanced by assigning the base probability of a positive sample to α, to prevent the model from being overwhelmed by the large majority of negative samples. The calculated loss per full sample is calculated as described in the equation below.

Li= αyilog p(yi) + (1 − α)(1 − yi) log(1 − p(yi))

Although this loss can be used for training and learning the representations, this loss can not be in-cluded in the context vector for the downstream classifier since it will implicitly contain information on the label. The reason for this is visualized in Figure 5. Adding this loss would create a look-ahead bias because downstream classifiers could perfectly predict the same data again correctly. Generally, since the inferred label that was trained on would not have been available at the timestep t it was evaluated, this loss would not have been possible to calculate online.

3.3.3 Context vector

Once the representations have been learned, the context vector can be constructed for downstream classifiers, being the encoding in the latent space. There are many ways to evaluate this, and so for this research, it was constrained only to use the pooled representations, in which they are average pooled over the initial sequece timesteps to retain a single context vector. This pooling process has been shown to retain most of the information in other tasks while discarding non-essential informa-tion (Boureau et al., 2010).

As a useful addition to the context vector, the loss per sample was added for the InfoNCE and ELBO loss models. This addition could allow the downstream models to learn better on the classification task since higher loss intuitively means that the latent space’s projection was very inefficient and possibly unknown and that the current latent representation says less about the future than usual. This has been visualized in Figure 6.

This addition would also allow the model to be evaluated based on the loss, where the dataset could be prefiltered to ensure the market conditions are as expected. When this is not the case, the latent

4

(15)

x1 x2 x3 x4 x5 x6 ... t=4 encoder 0.6 0.4 softmax y = x6 > x4 CE objective y^

Figure 5: Graphic to show temporal dependency. At timestep t = 4, x6is not known,

and therefore the objective y = x6 > x4 cannot be calculated and the loss on the

model softmax output on ˆy cannot be evaluated with Cross-Entropy (CE). Green is

the data that is available, and red is not available yet.

c0 c1 ... cT c Lr Context vector average pooling calculate loss c encoded downstream =

Figure 6: Construction of the context vector. ctis the encoded latent context

repre-sentation of timestep t, of which the states are average pooled over time into c, and

the respective loss Lris calculated and appended to c, forming the context vector c

which is for the downstream classifiers.

representations will be unexpected, in which case machine learning models have been known to exhibit low generalization.

3.4 Evaluation

This subsection will elaborate on the evaluation methods used to validate the usability of the latent representations. Firstly some labels that can be extracted from the sequence will be defined, then the baselines will be defined, and subsequently, the analysis methods will be elaborated.

3.4.1 Labels

For the downstream classification task, several labels need to be defined. These labels can be ex-tracted from the prices and contain information known after the window has passed, but usually not at the moment itself. If they would only contain past information already known, there would be nothing to predict since these can be trivially calculated. By incorporating unknown information from the future, good predictions gain value. The standard movement and movement up indicator take a price from the future and compare it to the price at the current timesteps for all timesteps, and the anomaly indicators try to find outliers under the underlying model distribution.

Movement From a series of prices pt at timestep t, this indicator detects if the absolute price

(16)

(a) Movement in next 10 timesteps. (b) Up movement in next 10 timesteps.

(c) Anomalous point within k = 10. (d) Future anomalous point within 10 k = 10.

Figure 7: Indicator functions graphed. Red highlights are activations, and grey high-lights are where not enough data is available. Sequence data was randomly generated.

size happens without predicting its actual direction. See the example in Figure 7a. The interpretation of T is the percentage returns.

ymt = 1|pt+k

pt −1|>T

(p)

Movement Up Another label is extracted from the future returns at some offset t, on significant

up movements. The label indicates whether the returns r at an offset t exceed a given threshold T within a window k. The T should be chosen to exceed the transaction cost and cost of crossing the spread to get a usable estimate. See the example in Figure 7b. The interpretation of T is the percentage returns.

yut = 1pt+k

pt −1>T

(p)

Evaluation using returns will only be on long positions because there are extra constraints to short positions, which can introduce spurious effects into the evaluation, which do not reflect the real-world scenarios (Nagel, 2005). These prices do not require a logarithmic transformation since these indicators will usually be used separately from each other in different trades. Because all trades are independent of each other in this way, there is no cumulative effect, and hence the addition of each separate trade gain is allowed instead of managing a portfolio that consists of that single stock.

Anomaly This label indicates whether the current point in time is in an anomalous state relative

to its surroundings.

First, we determine the shifted sample mean around each observation xtat timestep t

pt= 1 2k t+k X i=t−k pi

(17)

followed by a calculation of the shifted sample variance around the point St2= 1 2k − 1 t+k X i=t−k (pi− pt)2

where k denotes the window size divided by two. These results can be combined into an anomaly label where yat = 1 pt−pt_√ S2_t >T (p)

where T denotes a threshold, interpretable as the number of standard deviations. See the example activations in Figure 7c.

The intuition behind the anomaly definition is that it uses future data; if there is a momentary peak, this would be defined as an anomalous state, while a continually rising price will not be anomalous (if it keeps rising). This inclusion essentially smooths out the data points and contains information not available at that moment.

Future Anomaly The anomaly approach was also expanded to allow prediction of future

anoma-lies by shifting the labels left by k steps. This approach could teach the model to predict future anomalous states based on the current normal state. This shift is also shown visually in Figure 7d.

yft = yat+k

Therefore, note that these can exclusively use future information, without any overlap with the cur-rent state. This task should theoretically be an infeasible target since it presumes to preventively predict a future anomalous state, without seeing any part of this state, since the market is still behav-ing normally. Unless the market is already movbehav-ing to anticipate an event that has not happened yet, this should be impossible.

3.4.2 Baseline

Since the dataset is non-public and much of the published research body focuses on daily rather than intraday data since those datasets are open, no direct comparison to existing literature is possible. Therefore, instead of comparing the outcomes from different papers, the downstream classifier’s performance is measured for all different training methods. This comparison shows the actual dif-ferences between training approaches.

As a baseline for showing whether the models can extract useful information at all or if only noise is found, the baseline accuracy is the majority class, where a ’simple’ model is always presumed to predict the majority class. Apart from this, it is also compared against an exploitable baseline, which requires a precision of over 50% to be usable in a trading setting, assuming symmetric payouts. This baseline is only relevant for precision, since it is only applicable to opportunities taken. Naturally, this threshold depends on various factors, notably the cost of executing the trade, and the potential gain and loss. Depending on these factors, the threshold might be different, especially when the payoff distribution is asymmetric.

As a more standard target, a positive was added, as this is a target that is often used since it is directly

mappable to a trading strategy. In this context, that is simply yut for T = 0, e.g. any positive

movement will be an activation. On this, the trading profits per trading method for a simple trading strategy will be evaluated. Training on the various context vectors was done after the data was split into a training and testing set, and the hyperparameters were optimized using cross-validation on the training set. For training, rather than weighting the BCE function, the training set was balanced for training using synthetic minority over-sampling technique (SMOTE) (Chawla et al., 2002). This alteration was done because the latent space should have a better structure than the original space, and points in-between other points are meaningful whereas this would not be the case in the normal space.

To prove most of the signal was captured in the latent space and that the latent space has a useful structure, the downstream classifier is a simple logistic regression as a baseline. This classifier was trained on the learned representations from the fully supervised method, the learned representations

(18)

from the InfoNCE loss, and the representations obtained from the VAE. This way, the approach’s merit can be interpreted by its relative performance concerning differently trained models on the same data. The classfiers were trained for all labels, to test whether or not the encoded representa-tions had generic information in them, or information that was highly specific for the task.

Standard metrics for classification, such as accuracy, precision, and recall, were taken as perfor-mance measures. Additionally, the Phi-coefficient (Cramir, 1946) was included, which expresses the relation between the assigned labels and the ground truth in a single number. These metrics were calculated both on the combined dataset of all different symbols and for each respective symbol — this distinction allowed for comparison between symbols and find potential issues in generalizability. These comparisons will show if there are significant sector/industry differences for the approach. In general, predicting these labels is a problem where precision is significantly more important than recall and accuracy, and evaluation will be weighted on that proportionally. When looking at it from a trading perspective, if a positive prediction is made to enter a trade, there is a much higher requirement for that prediction to be right then for a negative prediction not to enter a trade. In the first case, a mistake will have a cost, but there is only the potential to miss possible gains in the second case. This target also makes training more difficult, since the model needs to be incentivized to trade (not merely opt for always predicting the negative class) but still retain a high relative precision in its predictions.

(19)

4 Related Work

Unfortunately, most successful research on financial markets is performed behind closed doors and will not be published, but instead kept secret. Usually, when it is published, either the effect was already gone or no longer profitable for the original finder, or shortly after publishing, the effect was exploited by many market participants, effectively nullifying the effect (McLean and Pontiff, 2016). It has also been found that the turnover time of publications until the point that it is no longer possible to exploit has substantially shortened over the years (McLean and Pontiff, 2016), further diminishing the value of published financial research.

Many papers report superior performance to traditional time series models such as Autoregressive Integrated Moving Average (ARIMA) (Box et al., 2015) using simple prediction algorithms (Egeli et al., 2003; Qian and Rasheed, 2007; Moews et al., 2019; Ticknor, 2013); however, most of these do not take into account transaction costs, either directly through the broker cost, or indirectly through the spread. When these costs are factored in, these results are not usable in real-world scenarios. Additionally, many papers seem to focus on predicting the price directly, achieving a low Mean Average Percentile Error (MAPE) (Patel et al., 2015; Weng et al., 2018); however, these effects are often not exploitable in a trading setting either (Reschenhofer et al., 2020).

Many studies on machine learning in financial markets focus on a daily timescale since these datasets are open and freely accessible. For this data, various approaches are taken. Some of the research focuses on constructing or optimizing a portfolio using machine learning (Yuan et al., 2020). Other research focuses on modelling the price series directly or indirectly using Long Short-Term Memory (LSTM) (Rebane et al., 2018; Nelson et al., 2017; Eapen et al., 2019) or Gated Recurrent Unit (GRU) (Dutta et al., 2020). Recently, the use of attention in combination with sequence models has become more prevalent for time series prediction, with many adapting standard Recurrent Neural Network (RNN) models with an attention layer (Kim and Kang, 2019; Zhang et al., 2019; Qin et al., 2017). Alternatively, reinforcement learning approaches have even been applied to market making with moderate reported success (Lim and Gorse, 2018; Spooner et al., 2018).

Another approach that was taken is to predict realized variance instead (Sasaki et al., 2020; Carr et al., 2019). Although this is not directly exploitable similarly to price, the information gained has many applications, from option pricing inputs to straight volatility trading.

Similar to our method, Zhao et al. (2018) has converted the daily stock market values to a market image, upon which regression is performed in an attempt to predict the returns. The fully supervised part is trained on the returns directly. They have attained a lower reconstruction error than Prin-cipal Component Analysis (PCA) and reported mild success on reconstructions. Alternatively, Hu et al. (2018) takes an entirely different approach by converting the market data to candlestick charts and converting these to representations using a Convolutional Neural Network (CNN) after which clustering is applied to pick a portfolio. By using the candlestick charts directly, the issue with nor-malization was solved at the expense of discarding possibly useful data. Both of these methods were on the daily timescale as opposed to the intraday timescale used for this thesis.

To the best of our knowledge, only Wu et al. (2020) has applied contrastive learning and mutual information optimization to financial time series. Their method showed the potential to advance the current state-of-the-art, further strengthening the argument for the application of Contrastive Predictive Coding (CPC) to the problem. The largest difference between their approach and ours is the timescale, with their research being on a daily timescale. Additionally, they have shown a significant improvement to their performance by applying an attention layer as well.

(20)

5 Experiments

5.1 Data

The market data provider used was Nanex (Nanex, 2020). The considered time interval was 01-04-2019 up to 30-08-2019. This interval contains 107 trading days. The train-test split used was 01-04-2019, including 31-07-2019 for the training set, and 01-08-2019 up to 30-08-2019 for the testing set. These intervals contain 85 and 22 trading days, respectively. Fifty random symbols were selected from the stock universe on the Toronto Stock Exchange (TSE). This selection included small-cap, mid-cap, and large-cap stocks. TSE was chosen because this prevented numerous cross-exchange effects from occurring, as would happen in the United States (US) markets where symbols are traded on multiple exchanges. The pre-market and aftermarket data was not included, and open was taken from the first executed trade in the regular market, up to the close, which was determined by the last trade in the regular market. Chosen tickers symbols are in Appendix B. In the interval, there were 55 dividend payments done by the selected symbols. There were no stock splits during the period.

5.2 Preprocessing

A 60-second time interval was used for the time bars, which means a bar was generated for every minute. Since the TSE trading hours are 9:30 until 16:00, 390 of such bars were generated for every trading day. For the information-driven bars, the division factor was set to 390 to get an equal number of volume bars on average as time bars. The chosen window size for the sequence was 120, so a single trading day contains approximately 270 exploitable points per trading day.

For the labels, a lookahead window of k = 12 was used, so 12 bars into the future, which is₁₀1th of

the history. As a threshold for the movement, a threshold of T = 0.5% was taken as a significant movement, and a threshold of T = 2 standard deviations was used for anomalies. Table 4 shows the percentage of the positive class for each label over various bar types overall, and Figure 8 shows the distribution of these values between trading days. The values were taken so that the relative occurrence of the positive class are close to each other. The positive movement targets seem to be rising, with a peculiar peak around day 80, which might indicate more volatile times. However, this contradicts the intuition that more anomalies should be picked up during these volatile times, even though they remained steady in the figure.

Label

Bar

Time Volume Dollar

Movement 0.046 0.041 0.040

Up Movement 0.023 0.021 0.021

Anomaly 0.003 0.003 0.003

Table 4: Percentage of the positive class. Future anomaly has been omitted since it is equivalent to the anomaly label.

(21)

Before standardization, performing an Augmented Dickey-Fuller (ADF) test shows that the price series intraday is non-stationary, with a statistic of on average −14.11 (standard deviation 6.28) the null-hypothesis can be strongly rejected since this exceeds the critical value of −3.43 at a 99% level. Because of the high volume of events, a C++ extension was written for Python that performed the initial step of reducing the raw event stream to various bar types. The extension can be found in a

repository on GitHub5.

5.3 Models

The base encoder used 5 layers with kernel sizes 10, 8, 4, 4, 4 and padding 2, 2, 2, 2, 1 respectively. Only the strides were removed from the original implementation since too much of the sequence was consumed. Additionally, the hidden state of the convolutional part was reduced to only 32 dimensions. The autoregressive hidden state on top of the encoder was reduced to only four di-mensions. The Contrastive Predictive Coding (CPC) code was adapted from L¨owe et al. (2019),

and can be found in a GitHub repository6_{. The variational autoencoder (VAE) latent space mapped}

from the four dimensions back to the 32 encoder dimensions for the µ and log σ2 parameters, and

the decoder was the inverse from the encoder, using deconvolution layers instead of convolution layers. The prediction step is set to 12 timesteps into the future, the same as the supervised labels’ lookahead.

These parameters were relatively constrained, mostly because of the low signal-to-noise ratio, to only keep the most salient features. Furthermore, the original parameters from Oord et al. (2018) were based on sequences of windows with 20480 timesteps and only a single channel, and the sequences for this problem are significantly shorter at 120 timesteps per sequence. This discrepancy means that the latent space would be substantially larger than the original encoding under the original hidden sizes.

All models were trained for a maximum of 1000 epochs on the training set. Per epoch, a random window was extracted from the trading day per symbol, which means that every symbol/trading day combination only contributed a single sample for each epoch. This selection method ensures enough variation within the batch and no overlapping sequences within the epoch. Each sample in the dataset was seen approximately three times over the total training duration.

Similar to Oord et al. (2018) the Adam optimizer (Kingma and Ba, 2014) was used, with a learning

rate of 1.5e−4. This learning rate was used for all model versions, across all different tasks. This

learning rate allowed the model to learn quickly without being thrown off by anomalous samples with very high losses. The used batch size was 32 samples. For the CPC task, 256 negative samples were taken. Gradients were clipped to have a norm of 5 at maximum, to prevent exploding gradi-ents (Goodfellow et al., 2016). For speed reasons, negative samples were sampled using a method that has a slight probability of also selecting positive samples as negative.

For the downstream classifiers, the Adam optimizer was also used with a learning rate of 1e−3. The

downstream classifier was trained for 100 epochs on the pooled context vectors with a batch size of 64.

5.4 Results

All experiments were tracked using Weights & Biases online service (Biewald, 2020). The loss curve for the CPC method is displayed in Figure 9a, the Evidence Lower Bound (ELBO) loss curves are in Figure 9b, and the Binary Cross-Entropy (BCE) loss is displayed in Figures 9c to 9f. Figure 9a shows that the InfoNCE loss is still decreasing, which might indicate that the upstream model could be trained even further. All other models in 9 except the BCE on movement targets for time bars seem to have converged after a few dozen epochs, after which they oscillate around the same level.

Figure 9 also shows that CPC does not exhibit any divergence between training and validation set loss, indicating a good generalization for all bar types. In contrast, the VAE seems to overfit for the time bars and underfit for the dollar and volume bars. There was no indication of a posterior

5_{https://github.com/mvdwerve/streambar}

6

(22)

(a) InfoNCE (b) ELBO (c) BCE - movement (d) BCE - movement up

(e) BCE - anomaly

(f) BCE - future anomaly

Figure 9: Training and validation loss for all trained representations on the various bar types. Loss curves show both observed version and version rolling mean over 30 epochs.

collapse, since the Kullback-Leibler (KL) term of the ELBO loss did not reach zero during training. Even though the BCE movement objectives training curves keep going down, their validation curve starts to rise, clearly indicating overfitting. The BCE anomaly targets converged to the same level on every bar type. In all cases, the dollar and volume seem to achieve slightly higher losses as well, where these curves are strictly above the training and validation curve for the time bars. These information-driven bars might make it more difficult to train on, despite having better statistical properties. This discrepancy suggests that some inherent temporal properties could not be correctly captured in the dollar and volume bars. These volume and dollar bars also seem to exhibit better

(23)

generalization between training and validation set, however that may be attributed to failure to learn useful features.

Figure 10 shows the Precision-Recall (PR) curves for the various tasks trained on the learned repre-sentations. Overall, CPC and the VAE seem to have a relatively higher precision in the estimations with the highest scores. This excludes the downstream anomaly task using the VAE representations, since that only achieved random precision. For every task, CPC strictly outperforms VAE, indicating that the explicit temporal dependence is superior across all of the defined downstream tasks. The worst performing task overall seems to be the prediction of future anomalies.

0.0 0.2 0.4 0.6 0.8 1.0 recall 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 precision loss InfoNCE (AUC 0.113) VAE (AUC 0.095) BCE-Movement (AUC 0.121) (a) Movement 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 precision loss InfoNCE (AUC 0.056) VAE (AUC 0.051) BCE-Up-Movement (AUC 0.063) (b) Movement up 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.00 0.01 0.02 0.03 0.04 0.05 precision loss InfoNCE (AUC 0.011) VAE (AUC 0.004) BCE-Anomaly (AUC 0.009) (c) Anomaly 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.000 0.002 0.004 0.006 0.008 0.010 0.012 precision loss InfoNCE (AUC 0.004) VAE (AUC 0.003) BCE-Future-Anomaly (AUC 0.005) (d) Future Anomaly

Figure 10: PR curves for classification of testing set on the various tasks using the downstream representations. Dashed gray line indicates random precision.

Figure 11 shows the PR curve for all upstream representations trained on the positive target, which is a specialization of the movement up target with T = 0. This figure shows that all methods per-form below an exploitable baseline of 50%, and that notably the VAE is only able to achieve random precision. The fully supervised representations created for the movement up downstream task seems to generalize best towards the positive task. This generalization is expected since it is essentially a subset of the movement up indicator. Similar to the other tasks, CPC has a better precision on the es-timations with the highest probability, intersecting with the movement target around 30% of recalled labels. CPC performs strictly better than the representations from the fully supervised representa-tions trained on the anomaly targets. Similarly to CPC, the fully supervised anomaly representarepresenta-tions seen to encode useful information for the estimations with the highest scores. Together with Figure 10 this might indicate that CPC encodes more generally applicable information than the representa-tions created by the supervised tasks. For the final classification result scores, see Tables 5 to 9 in Appendix C.

To evaluate the usability of the representation loss as a feature, Figure 12 and Figure 13 show the PR curves for the representation loss from CPC and VAE as a naive classifier for the targets. When comparing Figure 12 to Figure 10, it seems that the representation loss as a standalone feature already adds a lot to the predictive power of the downstream classifier for the movement targets.

(24)

0.0 0.2 0.4 0.6 0.8 1.0 recall 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 precision loss InfoNCE (AUC 0.410) VAE (AUC 0.388) BCE-Movement (AUC 0.410) BCE-Up-Movement (AUC 0.417) BCE-Anomaly (AUC 0.404) BCE-Future-Anomaly (AUC 0.401)

Figure 11: PR curve for the positive target using learned representations from all upstream tasks. Dashed gray line indicates random precision.

Despite the bad performance by the VAE on the anomaly task, Figure 13 shows its representation loss is more useful for classification on the anomaly tasks. Despite CPC working so well on the anomaly target, its representation loss is not very indicative for the anomaly target.

0.0 0.2 0.4 0.6 0.8 1.0 recall 0.0 0.2 0.4 0.6 0.8 1.0 precision loss InfoNCE VAE bar time volume dollars (a) Movement 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.0 0.2 0.4 0.6 0.8 1.0 precision loss InfoNCE VAE bar time volume dollars (b) Movement up

Figure 12: PR curves using representation loss from CPC and VAE as naive classifier for the movement targets on all different bar types.

To apply the model to a trading setting, the top 5% of estimations were taken, and the positive target should be evaluated on the excess precision on a per-symbol basis, meaning precision gained over the random precision baseline of that specific symbol. Figure 14 reveals that this distribution across different upstream methods varies a lot over the symbols. For the time bars, Figure 14a shows the distribution for the representation on the movement tasks is the widest. Everything on the time bars has a mean of approximately zero, meaning that little extra precision was gained over random precision. There is a significantly lower recall on these outliers than the globally set 5%, which also explains how these outliers can be so wide, showing that the excess precision in those cases is mostly chance. The distribution of excess precision on the test set seems to be flatter than its respective distribution on the training set for all bar types. The fact that Figure 14b and Figure 14c show CPC having a wide distribution might be attributed to the loss not fully having converged yet, as can be seen in Figure 9a. Overall, this shows a decent generalization of the downstream classifier on the positive target.

To evaluate these decisions are evaluated on profitability, an average return per trade is calculated in Basis Points (BPS). These averages are shown in Figure 15, which shows the mean and standard deviation of the return per trade hypothetically taken on the positive target. It also shows that the

(25)

0.0 0.2 0.4 0.6 0.8 1.0 recall 0.000 0.002 0.004 0.006 0.008 0.010 0.012 precision loss InfoNCE VAE bar time volume dollars (a) Anomaly 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.000 0.005 0.010 0.015 0.020 0.025 0.030 precision loss InfoNCE VAE bar time volume dollars (b) Future Anomaly

Figure 13: PR curves using representation loss from CPC and VAE as naive classifier for the anomaly targets on all different bar types.

effects found are far too small to be exploited, with most of them falling around a mean of zero basis points. In general, it applies that the more trades that can be made, the less the margin per trade has to be; however, these predictions are at least an order of magnitude short of being useful for trading, since these cannot overcome the costs associated with trading.

Figure 15a shows that representations from CPC and the VAE exhibit negative profits on the testing set, despite having a fairly focused excess precision. This negative profit can be attributed to the fact that most of the final precision scores remain below the 50% exploitable baseline, causing a net loss. Figure 15b shows on average positive trades on the testing set of a few BPS. There remains a large discrepancy between the training and testing set performance on most representations. Figure 15c shows that all means are relatively close to each other between training and testing set, indicating better generalization on this downstream task.

Figure 16 shows the Area Under the Curve (AUC) for all tasks for classifiers not strictly trained on their upstream task. It seems that the representations from the movement and anomaly tasks encode particular information only applicable to the tasks, not generalizing outside of the objectives. CPC seems to work almost as well as the supervised method on every task, except for the anomaly target where it has a better AUC than the supervised method. This figure shows that CPC seems to generalize best across all tasks. In contrast, Figure 11 seems to indicate that the anomaly task should work reasonably well on the movement tasks, despite it not being reflected in Figure 16a and Figure 16c.

(26)

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 excess precision split train test (a) Time

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 excess precision split train test (b) Volume

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 0.6 0.4 0.2 0.0 0.2 0.4 excess precision split train test (c) Dollar

Figure 14: Excess precision per symbol on the positive target for various bar types on top 5% probability scores. Higher values are better.

(27)

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 4 3 2 1 0 1 2 profit (bps) split train test (a) Time

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 2 1 0 1 2 3 profit (bps) split train test (b) Volume

InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly loss 2 1 0 1 2 profit (bps) split train test (c) Dollar

Figure 15: Mean return per trade in BPS for a naive trading strategy over various bar types, with bar edges encompassing two standard deviations. Calculated by evaluating the BPS gain at every positive prediction made by the model on the positive target at the threshold for 5% recall.

(28)

movement movement_up target 0.00 0.02 0.04 0.06 0.08 0.10 0.12 auc loss InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly (a) Movement anomaly future_anomaly target 0.000 0.002 0.004 0.006 0.008 0.010 auc loss InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly (b) Anomalies positive target 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 auc loss InfoNCE VAE BCE-Movement BCE-Up-Movement BCE-Anomaly BCE-Future-Anomaly (c) Positive