Interpretable Deep Learning for Stochastic Financial Scenario Generation

(1)

MSc Computational Science

Master Thesis

Interpretable Deep Learning for

Stochastic Financial Scenario

Generation

by

Joop van de Heijning

August 21, 2018

Daily Supervisors:

Dr. P.F.A. Tuijp

Mr. A.A. de Geus

MSc

Academic Supervisor

and Examiner:

Prof. Dr. B.D.

Kandhai

Second Assessor:

Mr. I. Anagnostou

MSc

(2)

Abstract

This thesis investigates whether it is possible to train a potentially useful yet inter-pretable deep learning model for stochastic financial scenario generation. The research introduces a fully probabilistic and autoregressive model with multivariate input and output. It is capable of generating stochastic scenarios of financial returns and changes, which when averaged can estimate multiple steps ahead with confidence intervals. The generated scenarios’ main statistics are comparable to observed statistics. Adjusting and applying a well-known attribution method typically used in image classification makes the model interpretable for evaluation, exploration and understanding in time series applications.

(3)

Acknowledgements

I thank the numerous people who helped me with the thesis. Special thanks to Alex de Geus and Patrick Tuijp (in no particular order) for their extensive daily guidance, to Drona Kandhai for his academic supervision and to Ioannis Anagnostou for his comments and initial brainstorming.

(4)

6 Experiments and Results 29 6.1 Generation . . . 29 6.2 Forecasting . . . 31 6.3 Interpretation . . . 35 6.3.1 Perturbation Analysis A . . . 35 6.3.2 Perturbation Analysis B . . . 35 6.3.3 Qualitative Analysis . . . 36 7 Conclusion 38 References 39 A Sine Waves 42 A.1 Verification Integrated Gradients with Random Sine Waves . . . 42

(5)

1 Introduction

Inspired by recent advances in generating and forecasting of time series using deep learning and techniques of interpretation thereof, this thesis investigates whether it is possible to train a potentially useful yet interpretable deep learning model for stochastic financial scenario generation. This work contributes to the literature by:

1. Introducing an autoregressive generative model based on the deep neural network WaveNet [Van Den Oord et al., 2016a]. Trained on raw input data, the model generates stochastic equity returns and interest rate changes with main statistics comparable to observed statistics.

2. Exploring the use of the model for forecasting with confidence intervals on monthly returns or changes by averaging the model’s multivariate stochastic scenarios for eq-uities and interest rates conditioned on observed and generated data. An experiment finds t + 1 forecasts to be comparable with the naive forecast.

3. Adjusting the state-of-the-art input attribution technique Integrated Gradients typi-cally used in static contexts such as computer vision [Sundararajan et al., 2017] to a financial time series context to relate the model’s predictions to underlying economic mechanisms. Two quantitative experiments and one qualitative example evaluate the performance of the method and find that the technique is robust and assigns attribu-tions as objectively expected.

The algorithmic modeling approach of deep learning has recently dramatically improved the state-of-the-art in domains such as visual object and speech recognition, and other domains [Krizhevsky et al., 2012, Van Den Oord et al., 2016a]. [LeCun et al., 2015] expect deep learning to continue to have successes in more fields in the near future.

The effectiveness of any learning machine is limited by the machine’s (in)ability to explain their decisions and actions to human users. Explainable AI is essential to enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners [Gunning, ]. Techniques of interpretation for deep learning are also becoming increasingly popular as tools for exploration and analysis in the sciences [Montavon et al., 2018].

(6)

2 Literature Review

2.1 Deep Learning

One approach to artificial intelligence (AI) is systems extracting patterns from raw data, known as machine learning. The performance of machine learning algorithms depends heav-ily on the representation in features of the data they are given [Goodfellow et al., 2016]. When it is difficult to know what features should be extracted, machine learning can dis-cover not only the mapping from representation (the description of the data in a particular way) to output but also the representation itself. This approach is known as representation learning. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help. [Goodfellow et al., 2016]

Deep learning solves this central problem in representation learning by introducing rep-resentations that are expressed in terms of other, simpler reprep-resentations. Deep learning enables the computer to build complex concepts out of simpler concepts [Goodfellow et al., 2016, Zeiler and Fergus, 2014]. E.g. [Hua, 2018] shows a WaveNet [Van Den Oord et al., 2016a] trained on sound waves learns to internally represent pitch and spectral features. In the financial setting a deep learning system might represent the concept of a strong positive return in t + 1 by combining simpler concepts such as short term patterns, which are in turn defined in terms of simple differences between two subsequent past returns.

Machine learning researchers distinguish between supervised and unsupervised learning, where in the latter case the model learns not by seeing the true labels of every observation, but e.g. by mapping to the input itself with some form of compression in between. This can be used for finding simple representations in the compression layers. Another category is that of generative networks; models which learn the distribution of the data set and are then able to generate new data from the same distribution.

2.1.1 History of Deep Learning

Broadly speaking, there have been three waves of development: deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006 [Goodfellow et al., 2016].

Some of the earliest learning algorithms were intended to be computational models of biological learning, that is, models of how learning happens or could happen in the brain. As a result, one of the names that deep learning has gone by is artificial neural networks. The earliest predecessors of modern deep learning from the era of cybernetics were simple linear models learning a set of weights w1, .., wn and computing their output f (x, w) = x1w1+

.. + xnwn. The training algorithm used to adapt the weights of some of these models was a

special case of an algorithm called stochastic gradient descent. Slightly modified versions of the stochastic gradient descent algorithm remain the dominant training algorithms for deep learning models today. Linear models have many limitations however, such as not being able to learn the XOR function. Critics who observed these flaws caused a backlash against biologically inspired learning in general. This was the first major dip in the popularity of neural networks. [Goodfellow et al., 2016]

In the 1980s, the second wave of neural network research emerged in great part via a movement called connectionism. The central idea in connectionism is that a large number of simple computational units can achieve intelligent behavior when networked together.

(7)

A key concept that arose is that of distributed representation. This is the idea where for example, instead of having nine separate neurons activate for cars, trucks and birds of colors red, green and blue, a distributed representation is used with three neurons describing the color and three neurons describing the object identity. Another major accomplishment of the connectionist movement was the successful use of back-propagation to train deep neural networks with internal representations. This algorithm is currently the dominant approach to training deep models. This second wave of neural networks research lasted until the mid-1990s. When AI research did not fulfill unrealistically ambitious expectations and other fields of machine learning started achieving good results on many tasks, popularity of neural networks declined until 2007. [Goodfellow et al., 2016]

The third wave of neural networks research popularized the use of the term “deep learn-ing” to emphasize that researchers were now able to train deeper (longer sequence of in-put/output layers) neural networks than had been possible before, and to focus attention on the theoretical importance of depth. At this time, deep neural networks outperformed com-peting AI systems based on other machine learning technologies as well as hand-designed functionality. [Goodfellow et al., 2016]

One of the reasons for the recent success of deep learning is that today more resources are available. “Big Data” has made machine learning easier because the key burden of statistical estimation -generalizing well to new data after observing only a small amount of data- has been considerably lightened. When data is scarce data augmentation techniques can enlarge a data set to obtain similar benefits [Hernandez-Garcia and Konig, 2018]. Another key resource is the computational ability to run much larger models [Goodfellow et al., 2016].

2.1.2 Learning a Multilayer Perceptron

This and the next subsection describe two canonical deep learning architectures, starting with the multilayer perceptron (MLP) together with the basic learning algorithm. An MLP is an example of a simple supervised deep learning model. It is just a mathematical function mapping some set of input values to output value, formed by composing many simpler functions. An MLP with l layers, predicting output ˆ_{y ∈ R}dl _{associated with an}

input x ∈ Rd0_{, consists of the following equations:}

h0≡ x, (2.1) z1= W1h0+ b1, (2.2) h1= f1(z1), (2.3) .. . zl−1= Wl−1hl−2+ bl−1, (2.4) hl−1= fl−1(zl−1), (2.5) zl= Wlhl−1+ bl, (2.6) ˆ y = fl(zl), (2.7)

with hi_{, z}i _{∈ R}di_{, W}i _{∈ R}dixdi−1_{, b}i _{∈ R}di_{, and f generally being a non-linear activation}

function. If one dimensional, ˆy, often represents the concept of probability to a certain event conditioned on x, with fl _{then a squashing function from zero to one. Figure 1 shows a}

(8)

Figure 1: Multilayer perceptron [LeCun et al., 2015].

To train the network parameters W and b, gradient descent is applied iteratively to minimize L with in the simplest case the following equations:

W_t+1i = W_ti− α ∂L ∂Wi t , (2.8) bi_t+1= bi_t− α∂L ∂bi t , (2.9)

with L a loss function parameterized by W , e.g.

L(ˆy, y; W ) = 1

2kˆy − yk

2

2 (2.10)

with y the true label and α the learning rate hyper-parameter. Gradient descent optimiza-tion is usually done by calculating average gradients in mini-batches with

1 < mini-batch size < M, (2.11)

with M being the total amount of observations x in the data set X ∈ RM xd0 _{and is then}

called stochastic gradient descent. This is not only more practical from a memory manage-ment standpoint for large data sets, but is in practice often more efficient regardless. Many enhancements to basic stochastic gradient descent have been proposed such as Adam, which computes individual adaptive learning rates for different network parameters from estimates of first and second moments of the gradients [Kingma and Ba, 2014]. Back-propagation refers to the method of calculating the gradients in consecutive layers from top to bottom by the chain rule.

The central challenge in machine learning is to minimize the expected value of the loss on unobserved inputs. Typically this generalization error is estimated on a test set of examples that were collected separately from the training set. Under i.i.d. assumptions for the data the factors determining this error are:

1. The errors on the training set.

(9)

These two factors correspond to respectively underfitting and overfitting. Altering the model’s capacity (ability to fit a wide variety of functions), can control whether the model is likely to underfit or overfit [Goodfellow et al., 2016]. E.g. increasing the number of parameters increases the capacity, thereby decreasing underfitting but increasing the risk of overfitting. Increasing the number of observations in the training set makes overfitting harder with no influence on capacity. Regularization such as weight decay decreases the capacity. The loss function with weight decay becomes

L = 1 2kˆy − yk 2 2− β 2kW k 2 2 (2.12)

with β a hyper-parameter determining the strength of the regularization. To determine the hyper-parameter settings it is usual to use a separate training set, the development set.

2.1.3 Convolutional Neural Networks

Convolutional neural networks (CNN) are designed to process data that come in the form of multiple arrays, for example a color image composed of three 2D arrays containing pixel intensities in the three color channels [LeCun et al., 2015]. In this work the data comes in the form of time series composed of four 1D arrays containing returns or changes of four financial variables.

In the simplest version of a CNN the equations from above stay the same as in the MLP, except that the linear equations in some or all layers are replaced by multiple convolutions with ∗ standing for the convolution operator, i.e.

zi= Wihi−1+ bi→ zi_{= W}i_{∗ h}i−1_{+ b}i _(2.13)

which works as in the following one-dimensional example:

w1 w2 0 0 w1 w2   h1 h2 h3  + b1 b2 =z1 z2 (2.14)

where [w1 w2] is a 1x2 filter and [z1 z2]T the feature map before the non-linearity.

Usually multiple filters have created multiple feature maps in the previous layer, and then the filter in the example becomes a two-dimensional filter with depth equal to the number of feature maps, e.g. [[w₁(1) w₂(1)], [w(2)₁ w(2)₂ ]] where [w₁(1) w(1)₂ ] are the weights for the first and [w(2)₁ w(2)₂ ] for the second feature map in case of two maps. The filter then slides over all feature maps at once with specifiable stride (stride is one in the above example) linearly combining the results, resulting in one feature map [z1 z2]T per filter.

The concept of multiple feature maps per layer is referred to as channels, similar to the RGB color channels in an image. Figure 2 gives a visual impression of how one 2x2 filter convolved over a single-channel 2D input produces one feature map.

The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name. [LeCun et al., 2015]

(10)

Figure 2: Convolutional Neural Network with a convolution over the input producing a feature map [Goodfellow et al., 2016].

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects [LeCun et al., 2015]. In financial time series local combinations of returns may form simple patterns assembling into more abstract patterns representing concepts of impending positive or negative returns.

The CNN architecture came under the spotlight following the breakthrough in image clas-sification by [Krizhevsky et al., 2012]. In recent literature, trailing the success of WaveNet in sound wave generation [Van Den Oord et al., 2016a], CNNs are also becoming more popular in the domain of time series [Bai et al., 2018].

2.1.4 Theoretical Underpinnings

The universal approximation theorem [Hornik et al., 1989] guarantees that in practice, regardless of what function it is trying to learn, a large enough feedforward network with at least one hidden layer (such as an MLP or CNN) will be able to represent this function. Still, learning the function can fail because of not being able to find the corresponding function parameters or choosing the wrong function because of overfitting. Empirically, networks with greater depth (i.e. deeper networks) generalize better for a wide variety of tasks [Goodfellow et al., 2016]. According to the no-free-lunch theorem, no search algorithm exists that on average over all possible data distributions is better than any other algorithm. Yet, when making assumptions about the distribution (such as made in the discussion on CNNs), algorithms can be designed to perform well on that specific distribution [Wolpert, 1996].

(11)

2.2 Generative Deep Learning Models

Generative models in statistics are models that learn the true data-generation joint distri-bution p(x) from the data x so to generate new data points with variations. p(x) can be learned with supervised (deep) learning by splitting p(x) into n supervised learning problems by the chain rule:

p(x) =

n

Y

i=1

p(xi|x1, .., xi−1). (2.15)

Popular deep generative models include Variational Autoencoders rooted in Bayesian inference [Kingma and Ba, 2014] and Generative Adversarial Networks (GAN) which poses the problem in a game theoretic framework [Goodfellow et al., 2014], with [Chen et al., 2018] an example of using GANs for generating time series in the space of renewable energy. Autoregressive models instead train a network modeling the conditional distribution given previous values as in equation 2.15 for p(x) given above. [Van Den Oord et al., 2016b] model the distribution of natural images in this manner pixel by pixel, while [Van Den Oord et al., 2016a] introduce WaveNet, an autoregressive convolutional neural network for generating raw audio waveforms.

The literature is sparse on generative deep learning models generating the stylized facts of fat tails, close to zero autocorrelation and volatility clustering in equity and interest rate returns. This thesis intends to fill this gap by taking the autoregressive approach by extending WaveNet [Van Den Oord et al., 2016a], making the model multivariate from input to output. [Ruder, 2017] reviews how a multi-task learning setup, such as a multivariate output, often leads to improved generalization capability in deep neural networks.

Convolutional neural networks such as WaveNet [Van Den Oord et al., 2016a], are best known for image recognition tasks [LeCun et al., 2015]. [Bai et al., 2018] suggest that it may be time to consider convolutional neural networks as the default ’go to’ architecture for sequence modeling as well.

2.3 Financial Forecasting using WaveNet

[Bao et al., 2017, Binkowski et al., 2017, Zhou et al., 2018] are examples of deep learning models making one-step ahead financial point forecasts. [Borovykh et al., 2017] extends WaveNet [Van Den Oord et al., 2016a] by bringing it in the finance domain and making the input multivariate to condition the network on past values of more than one variable, while still outputting one-step ahead single variable point forecasts. [Chen et al., 2018] discuss forecasting renewable energy production with stochastic scenario generation using GANs. This thesis extends [Borovykh et al., 2017] by introducing a WaveNet based model forecasting multi-step multivariate returns or changes with confidence intervals by generating stochastic scenarios.

2.4 Interpretability of Deep Learning Models

While deep learning has achieved state-of-the-art predictive accuracy in many domains [LeCun et al., 2015], it is also essential to verify for a given task, that the high measured accuracy results from the use of a proper problem representation, and not from the exploita-tion of artifacts in the data. Techniques for interpreting and understanding what the model has learned are therefore an important part of a robust verification procedure, especially in

(12)

highly regulated areas. Interpretation techniques are also becoming popular as a tool for exploration and analysis in the sciences. In combination with deep learning models, these techniques have been able to extract new insights from diverse complex systems [Montavon et al., 2018]. Finally, such methods can be used to provide a rationale for a model’s output, helping the user to understand the strengths and weaknesses of a model and compensate for it [Sundararajan et al., 2017].

This thesis focuses on interpretability in the sense that a trained model is given and the goal is to understand what the model predicts in terms of the readily interpretable input variables [Montavon et al., 2018]. The research does not try to explain the model’s inner workings or shed light on its internal representations as in e.g. [Zeiler and Fergus, 2014].

Specifically, the question is, what makes a given data point x representative of a certain concept encoded at the output of the model. The output neuron that encodes this concept can be described as a function f (x) of the input. Then each element of x receives an attribution score determining how relevant this feature is for explaining f (x) [Montavon et al., 2018].

One technique for calculating these attributions is inspecting the gradients of f (x) with respect to x, e.g. [Baehrens et al., 2010], which are easily obtainable with back-propagation in most deep neural networks. A problem is that the method relates to the local variation of the function rather than to the function value itself [Montavon et al., 2018].

For a concrete example, consider

f (x) = 1 − max(0, 1 − x), (2.16)

with f possibly representing the concept of probability assigned to an event conditioned on x ∈ [0, →), see Figure 3. Suppose the input changes from x = 0 to x = 2 and the goal is to obtain the attribution of x = 2 to f (2). The function changes from 0 to 1, but because f becomes flat at x = 1, the gradient method gives attribution of 0 to x = 2. [Sundararajan et al., 2017]

Figure 3: f (x).

The example also illustrates the need for a baseline input: People usually do not ask what aspect of x is representative of f (x) in isolation [Sundararajan et al., 2017]; in the example we ask for the attribution of x = 2 to the change of f (2) from f (0), where x = 0 was the baseline input.

(13)

Integrated Gradients, introduced by [Sundararajan et al., 2017], improves on the gradi-ents method by averaging the gradigradi-ents obtained by scaling the input x in tiny steps from a neutral baseline value to the actual value. In the example above the gradient for x ∈ [0, 1) is positive, leading to a positive attribution as expected for the change in f . Chapter 5 on Integrated Gradients gives a more comprehensive explanation of the method. [Ancona et al., 2018] show that in a theoretical and empirical analysis Integrated Gradients should in general perform at least equal to three other recent attribution methods in assigning correct input relevance. [Ghorbani et al., 2017] show that Integrated Gradients is least sensitive in an empirical comparison with three recent attribution methods to adversarial attacks (engineering input to fool the model).

[Kumar et al., 2017] assign attributions to previous financial returns in an autoregressive deep learning model with an older method based on [Zeiler and Fergus, 2014] in a context of one step ahead binary up or down forecasts. Inspired by [Kumar et al., 2017], this thesis brings Integrated Gradients into the domain of finance to relate the model’s predictions to underlying economic mechanisms. This work adds to the literature in general and to Integrated Gradients specifically by demonstrating the use of this method of attribution, typically used in contexts such as image classification, in the continuous domain of financial returns or changes.

2.5 Financial Markets Predictability

The discussion on whether it is possible or not at all to predict financial markets is still ongoing. [Malkiel, 2003] is a well-cited example of such a discussion. As long as it is not proven that it is impossible I believe there is room for studies such as this investigating forecastibility.

(14)

3 Data

The data consists of equities, government yields and corporate yields for some of the major developed countries or regions. Table 1 shows these variables and their sources and start dates per country or region.

Table 1: Variables and sources per country, end date is 2018-02-28 for all.

Variable Source Start Date

Australia

2 year generic government yield Bloomberg 2000-1-4

10 year generic government yield Bloomberg

ICE BofAML Australia Corporate Index 7-10y yield DataStream

AS30 Index Bloomberg

Canada

ICE BofAML Canada Corporate Index 7-10y yield DataStream

SPTSX Index Bloomberg

Eurozone

ICE BofAML Euro Corporate Index 7-10y yield DataStream

ESTX Index Bloomberg

Japan

ICE BofAML Japan Corporate Index 7-10y yield DataStream

TPX Index Bloomberg

United Kingdom

Sterling Investment Grade Index 7-10y yield Bloomberg

ASX Index Bloomberg

United States

ICE BofAML US Corporate Index 7-10y yield DataStream

SPX Index Bloomberg

All data are daily closing values from source and the end date for all countries is the 28th of February 2018. Table 2 shows descriptions per country of monthly changes (2 year, 10 year and corporate yields) and monthly returns (equities) with start date per country according

(15)

to Table 1 and end date 2018-02-28, with the first order (absolute) autocorrelations as ’ac1’

(’acabs’). Figure 4 shows correlations and distributions of monthly returns or changes from

country start dates to 2018-02-28.

Table 2: Data descriptions per country of monthly changes (2 year, 10 year and corporate yields) and monthly returns (equities) with start date per country according to Table 1 and end date 2018-02-28, with the first order (absolute) autocorrelations as ’ac1’ (’acabs’). Mean,

Median and Standard Deviation are in percentages, the other descriptions in decimals.

Name Mean Median Stdev Skew Kurt Min Max Ac1 Acabs

Australia

2y yield change 0.00 0.00 0.18 0.62 6.51 -0.01 0.01 -0.03 0.04 10y yield change -0.02 -0.03 0.24 0.00 0.16 -0.01 0.01 0.02 0.02 crp yield change 0.00 -0.01 0.18 0.16 3.01 -0.01 0.01 -0.03 0.28 equity return 0.39 0.60 4.05 -0.70 3.42 -0.21 0.14 -0.08 -0.02 Canada

2y yield change -0.01 -0.02 0.17 0.61 4.6 -0.01 0.01 0.09 0.12 10y yield change -0.02 -0.03 0.21 0.02 0.11 -0.01 0.01 -0.15 -0.05 crp yield change -0.00 -0.00 0.13 0.62 8.29 -0.01 0.01 0.37 0.36

equity return 0.52 1.05 4.82 -0.91 3.48 -0.26 0.14 0.08 0.17

Eurozone

2y yield change 0.00 -0.02 0.16 0.84 2.01 -0.00 0.01 -0.02 0.09 10y yield change -0.02 -0.04 0.20 0.22 -0.42 -0.00 0.01 0.04 0.12 crp yield change -0.00 -0.01 0.17 0.4 7.57 -0.01 0.01 0.31 0.28

equity return 0.17 0.65 5.58 -0.76 2.30 -0.25 0.16 0.06 0.07

Japan

2y yield change -0.01 -0.01 0.09 0.77 3.72 -0.0 0.00 -0.14 0.14 10y yield change -0.01 -0.02 0.10 0.90 3.17 -0.00 0.00 0.03 0.17 crp yield change -0.00 -0.01 0.08 0.36 5.79 -0.00 0.00 -0.09 0.18

equity return 0.18 0.62 5.80 -0.58 2.65 -0.29 0.16 0.03 0.06

United Kingdom

2y yield change 0.01 0.01 0.18 1.33 9.44 -0.01 0.01 0.02 0.12

10y yield change -0.01 -0.03 0.23 -0.14 1.64 -0.01 0.01 -0.00 0.12 crp yield change 0.00 0.01 0.22 -0.59 10.82 -0.01 0.01 0.43 0.52

equity return 0.28 0.81 4.51 -0.39 0.94 -0.14 0.13 -0.07 0.29

United States

2y yield change -0.00 -0.02 0.17 0.26 2.16 -0.01 0.01 0.07 0.09 10y yield change -0.01 -0.02 0.28 0.06 0.98 -0.01 0.01 0.03 0.11 crp yield change 0.00 -0.00 0.20 -0.30 16.9 -0.02 0.01 0.29 0.45

equity return 0.76 1.00 4.66 -0.38 3.01 -0.20 0.19 -0.03 0.30

(16)

Figure 4: Scatter plots and distributions of monthly changes (2 year, 10 year and corporate yields) and monthly returns (equities) for the United States from 1986-10-31 to 2018-02-28.

et al., 2016, Salamon and Bello, 2017], in order to improve the generalization capability of the network this work artificially increases the number of observations in the data set in three ways:

• Putting observations from various countries and variations of monthly periods (in this work monthly is a period of 22 business days by default) in the same set. The country and period of the observation then condition the network, similar to how [Van Den Oord et al., 2016a] condition WaveNet on speaker ID.

• Obtaining monthly returns day-by-day with a sliding window over daily data.

• Adding the same observation again with random noise added.

The steps below demonstrate how to obtain the training, development and test sets. Table 3 shows the preprocessing parameters.

1. Download data from source.

2. Linearly interpolate missing values, of which there are approximately two hundred per country. Most of these missing values are holidays where equity markets were closed, but up to date yields for bonds were still obtained.

3. Get changes or returns for various periods by taking the monthly -and variations of monthly- first differences of the yields and the percentage changes of the equities per country for every day in the date range. For example, for a period of 22 business days,

(17)

Table 3: Preprocessing parameters.

parameter value

Periods (11, 22, 44) days

Noise thresholds down (.25, .125, .0625)

Noise thresholds up 1 - noise thresholds down

Number of quantiles 16

Number of returns per observation 18

the first value in the series records the return of the last 22 business days, and the second value records the return of the last 22 business days starting just one day later, etcetera. Taking various periods, see Table 3, gains an expanded number of data sets, each containing different period returns.

4. Standardize: Calculate the mean and standard deviation for each variable per country not using the data that will become the test set. Subtract the mean and divide by the standard deviation.

5. Calculate quantiles (Table 3) by dividing the returns or changes sorted by value in equal segments for every country, period and variable. Remember the means of every quantile for re-transformation.

6. Quantize values by assigning them the number of the quantile they belong to.

7. Add two new variables c, p to every observation indicating by an integer which country the data is from and which variation of monthly change/return is used.

8. Separate and expand number of sets. For every variation of monthly (or variations thereof) returns obtained in step 4, these returns were calculated every subsequent day in the set, without skipping. Now separate every one of these sets into several shorter sets where every return is followed by a return exactly one month or variation thereof later, which is the normal sequence of data. E.g. the 22 business day return at day 23, will be followed by the next 22 day return at day 45.

9. Separate these series into sequences with the desired number of lags (Table 3), forming the observations of the data set, by rolling a window of the observation length over the series.

10. Add noisy time series to expand the data set even more. This work adds to the data set for every observation one noisy observation. Every noise threshold tuple (Table 3) has three levels which a random number from a uniform distribution from zero to one has to surpass from above (noise thresholds down) or from below (noise thresholds up) to subtract or add one, two or three quantization levels respectively to the return or change value. Only do this for observations that will go to the train set and do not perturb last target values in an observation.

11. Separate the data into train, development and test sets by sifting the data on the desired dates of the last return in the observation.

(18)

12. Separate the financial variables from the country and period indicator variables, but keep in the same order for correct input to the model.

13. Separate the last return from every observation sequence to serve as the true label for training and evaluation.

14. Encode every unique value in the data to one-hot format. [Mehri et al., 2017] report better results in a similar network when one-hot encoding the data. Analysis by [Buckman et al., 2018] shows that networks using one-hot encoded input data are more robust to adversarial attacks. Finally, the format lends itself to a new way of obtaining a baseline as explained in chapter 5 on Integrated Gradients.

One-hot encoding means simply transforming a scalar in N to a vector containing zeros and a one, i.e. a = (0, 1, 0, 0) instead of a = 2; a ∈ {1, .., 4}. One observation x in the model consists of eighteen quantized returns/changes (seventeen conditions and one target) of four variables; hence the dimensionality of x without target is N17x4. There are sixteen quantization levels, thus after one-hot encoding the dimensionality of x becomes {0, 1}17x4x16_{. After concatenating the four variables the dimensionality becomes {0, 1}}17x64_,

(19)

4 Model

The input data consists of returns X = {x(1)_{, .., x}(m)_{}, country ID’s C = {c}(1)_{, .., c}(m)_{} and}

period ID’s P = {p(1)_{, .., p}(m)_{}. The observed targets are Y = {y}(1)_{, .., y}(m)_{}. Every x is}

a sequence of multivariate past returns (x1, .., xt), every c indicates the country category,

every p the period category (Tables 1 and 3). Every y indicates the next multivariate return xt+1. The model estimates conditional distributions

p(y|x, c, p) = p(xt+1|x1, .., xt, c, p). (4.1)

The output layer of the model z is transformed with the softmax function [Goodfellow et al., 2016] to a discrete probability distribution with sixteen categories per variable, the number of quantiles in the data X, Y (Table 3). The softmax function normalizes its input to sum to one as in a probability distribution with ˆyq one category of ˆy:

ˆ yq = softmax(zq) = exp(zq) P iexp(zi) . (4.2)

Compared to e.g. mixture models, [Van Den Oord et al., 2016b] showed that a softmax distribution works better than a single point forecast in a similar autoregressive setting, even when the data is implicitly continuous. One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape [Van Den Oord et al., 2016a].

The cross-entropy loss function specifies the difference between the estimated distribution and the target with v iterating over the four variables in y and q over the sixteen quantiles:

L(y, ˆy) = − 4 X v=1 16 X q=1 yv,qlog(ˆyv,q). (4.3)

A variant of stochastic gradient descent Adam [Kingma and Ba, 2014] then iteratively trains the network parameters by averaging the gradients over mini-batches of observations.

4.1 Architecture

The model consists of a sequence of layers, whose outputs are inputs to the next layer. The input x has dimensionality {0, 1}17x64 with seventeen consecutive returns or changes in the first dimension and four variables times sixteen quantiles as the second. The architec-ture, with the general outline inspired by and nomenclature borrowed from [Van Den Oord et al., 2016a], is divided in three sections. The first two sections, causal convolution and dilated convolutions, process the information from the input with convolutions. The last sec-tion consists of ordinary matrix multiplicasec-tions as in the multilayer perceptron, outputting p(y|x, c, p)v, v ∈ (1, .., 4) with v over the four variables in parallel. [Ruder, 2017] reports

improved generalization capabilities for layouts multi-tasking in similar ways. Figure 5 gives an overview of the architecture.

4.1.1 Causal Convolution

The first section consists of only one layer with a first convolution operation with filter size two and stride one. The operation winds down the 64 input channels (four variables

(20)

Figure 5: Overview of the model architecture, with 1x1 standing for ordinary matrix mul-tiplications.

times sixteen quantiles, second input dimension) to a more compact representation of eight channels by setting the number of filters to that number. Because of the convolution the first dimension of x is reduced by one;

z(0)=W(0)∗ x, z(0)

∈ R16x8 _(4.4)

a(0)=z(0) (4.5)

as in the following one channel example with the dimension of input x reduced from three to two in output z: w1 w2 0 0 w1 w2   x1 x2 x3  = z1 z2 . (4.6) 4.1.2 Dilated Convolutions

The next section is a stack of four blocks with convolutions performed four times on the input of the stack a(0). In this entire subsection i ∈ (1, .., 4) is used as the counter of the block. Because the stride is now set to two, the array sizes are reduced to half every iteration. The one-hot encoded country and period indicators are concatenated to one vector with seven zeros and two ones (cT_{, p}T₎T _{∈ {0, 1}}9 _{(six countries and three periods) and added to the}

(21)

convolutions as biases:

z(i)= W(i)∗ a(i−1)+ V(i)(cT, pT)T, z(i)_{∈ R}24−ix8, V(i)(c, p)T _{∈ R}8, i ∈ (1, .., 4) (4.7) with V(i)a learnable linear projection and the vector V(i)(cT, pT)T broadcast over the first dimension as in the following one dimensional example where b1is repeatedly summed over

the entire dimension:

w1 w2 0 0 0 0 w1 w2     a1 a2 a3 a4     +b1 b1 = z1 z2. (4.8)

The rectified linear activation function (ReLU) [Nair and Hinton, 2010] forms the non-linearity:

ReLU(x) = max(0, x). (4.9)

The equation for z(i) _{including non-linearity with b ≡ (c}T_{, p}T₎T _{then becomes:}

z(i)= ReLU(W(i)∗ a(i−1)_{+ V}(i)_{b), z}(i)

∈ R24−ix8 _(4.10)

The output a(i)_{of block i is a sum of z}(i)_{and r}(i)_{. r}(i)_{is formed from a}(i−1) _{by keeping}

only the even elements in the first dimension k of a(i−1)_k,l to conform to the dimension of a(i)_{. The intuitive reason for these ’residual connections’ r}(i)_{is that by keeping values from}

previous layers the network can more easily build up representations that are at least as good as in the previous layer in the next layer, speeding up convergence in training [He et al., 2016, Van Den Oord et al., 2016a]. In initial experiments residual connections proved beneficial:

r(i)=a(i−1)_2k , k ∈ (1, .., 24−i_{), r ∈ R}24−ix32 (4.11) a(i)=z(i)+ r(i), a(i)∈ R24−ix32_. _(4.12)

Parametrized skip connections siare computed with learnable matrix multiplications so

that the number of channels is increased to 16 and only the last value of a(i)_k,l in the first dimension k is used as input:

s(i)=W_s(i)a(i)₂4−i, s

(i)

∈ R16_. _(4.13)

Finally the si are summed as output of the full stack:

s =

4

X

i=1

s(i)_{, s ∈ R}16. (4.14)

The s(i)have a similar intuition as the residual connections, but the connections go further back [Orhan and Pitkow, 2018]. Figure 6 gives an illustration of r(i)and s(i).

(22)

Figure 6: Residual and skip connections are summed in the dilation layers section.

4.1.3 Output Layers

This final section outputs the conditional distributions p(xt+1|x, c, p)v, with v over the four

variables. The output of all previous layers s is the input for a(5)_{. z}(5)

v is a learnable matrix

multiplication with no biases:

a(5) = ReLU(s), a(5)∈ R128_, _(4.15) z(5)v =Wv(5)a(5), zv(5)∈ R16, v ∈ (1, .., 4), (4.16) ˆ yv = softmax(z(5)v ), ˆy (j) v ∈ R 16_. _(4.17) 4.1.4 Regularization

As in [Loos et al., 2017, Bai et al., 2018], drop out proved beneficial and is implemented after a(5) _{and every z}(5)

v . Drop out randomly sets a small subset of the layer input values

to zero every iteration in stochastic gradient descent, leading effectively to the training of an ensemble of neural networks providing an inexpensive way of regularizing the model [Srivastava et al., 2014]. Weight decay as in function 2.12 is implemented as well on every weight in the model.

4.2 Generation Process

The following steps generate a single stochastic scenario for the four variables with ∼ the sampling operator: ˆ xt+1∼ p(xt+1|x1, .., xt, c, p), (4.18) ˆ xt+2∼ p(xt+2|x2, .., ˆxt+1, c, p), .. .

Figure 7 shows ˆxt+1for equities conditioned on a standardized USA observation, compared

to the (scaled) normal distribution with same variance and zero mean.

Forecasts with confidence intervals can be obtained by generating multiple scenarios and taking the mean with (multiples of) the standard deviations of the scenarios distribution [Hyndman and Athanasopoulos, 2014] with Figure 8 showing an example.

(23)

Figure 7: Categorical distribution ˆxt+1for the equities variable conditioned on a

standard-ized USA observation, compared to the (scaled) normal distribution with the same variance and zero mean.

Figure 8: Means of one thousand stochastic cumulative USA equity scenarios conditioned on a standardized USA returns observation with 67% and 95% confidence intervals.

(24)

5 Interpretability

This thesis focuses on interpretability in the sense that a trained model f (x) is given and the goal is to understand what the model predicts in terms of the input x. Specifically the intent is to assign to each element of x an attribution score determining how relevant this input feature is for explaining f (x). See the section on interpretability in the literature review for more discussion, this chapter focuses more on the technicalities.

In the model, every one of the sixteen discrete values of ˆyv, v ∈ (1, .., 4) represents the

probability the model assigns to a quantized value to be the variable v’s next monthly return or change, while the probabilities of all possible quantized values together sum to one per variable. Every output probability in [0, 1] can be seen as a function f of an observation x. For interpretation the question is, for a given probability assignment f and observation x, what are the attributions ai of each feature xi to f .

See Figure 9 for a visual example of what this work aims to achieve, obtained by the method of Integrated Gradients [Sundararajan et al., 2017] explained below. Here the given observation x is the set of ordered two dimensional values (x1, .., xt) of two independent

random sine waves (the top two rows) and the function is the output probability of the, according to the model, most likely quantized value of xt+1for the second sine wave (bottom

row in green shade), conditioned on x. The attributions aiare the shaded background colors

of every xi with blue positive and darker means stronger. As expected the model assigns

positive attributions only to features in the second sine wave, as the first sine wave is independent from the second. A positive attribution ai means that without xi, f (x) would

be lower and a negative attribution means that without xi, f (x) would be higher.

Figure 9: Attributions from the two conditioned upon random independent sine waves (top two rows, positives are encircled) to t + 1 in green shade (bottom row t = 17).

5.1 Integrated Gradients

The technique of Integrated Gradients (IG) [Sundararajan et al., 2017] solves interpretability of deep learning models by defining attributions from input features to the output. If

(25)

f (x) estimates the probability of the presence of a certain event in x, then attribution a1

represents the attribution of input feature x1to f (x). All aitogether sum to the probability

score.

IG calculates attributions using only the gradients from the output with respect to the input. Only using gradients makes the method implementation invariant to any differen-tiable function, including deep neural networks with thousands of parameters for which direct inspection is unpractical, such as the model this work introduces. It does so by av-eraging the gradients obtained by scaling the input x in tiny steps from a neutral baseline value to the actual value. The rest of the chapter sheds light on the technique and the importance of the baseline.

5.2 Image Example

An example from the field of image classification provides more insight into the method and the baseline. Let f (x1, x2) be the function assigning probability to the presence of a certain

bright object in a gray scale image with only two pixels (x1, x2) representing the brightness

intensity in [0, 1]: f (x1, x2) = g(x1) + h(x2) = log(4x1+ 1) 4 + x2 4 (5.1)

as depicted in Figure 10. The attributions (a1, a2) to f (1, 1), calculated by evaluating the

Figure 10: f (x1, x2).

gradient at (1, 1), are potentially misleading; x1clearly contributes more, approximately .4

against x2= .25, while its slope at (1, 1)

∂f ∂x1 = 1 4x1+ 1 = 1 5 (5.2)

is smaller than the slope of x2

∂f ∂x2

= 1

4. (5.3)

IG on the other hand uses a baseline far enough away from the input to not just focus on the local neighborhood of f with the following justification: Note that (0, 0) represents a

(26)

completely dark image and f (0, 0) correctly assigns zero probability to the image depicting the bright object, therefore

f (x1, x2) = f (x1, x2) − f (0, 0). (5.4)

Probabilities always lie between zero and one. Hence another way to look at attributions to a probability score is that they represent attributions to the increase of f from a neutral baseline input representing zero probability to the actual input.

In similar but potentially more complex functions and conditioned on the presence of a natural baseline input xb for which f (xb) ≈ 0, IG solves the problem of attribution using

only the gradient by invoking the gradient theorem

f (~q) − f (~p) = Z

γ[~p,~q]

∇f (~u) · d~u (5.5)

with γ any differentiable curve and with q the input and p the baseline. The equation is invariant to γ so a linear γ works as well by parameterizing as x1(t) = (q1− p1)t, x2(t) =

(q2− p2)t with e.g. q = (1, 1), p = (0, 0). The equations below explain the method with

evaluations on f (x1, x2) from the example above:

.65 ≈ f (1, 1) − f (0, 0) = Z (1,1) (x,y)=(0,0) ∇f (x, y) · (dx, dy) (5.6) = Z (1,1) (x,y)=(0,0) ∂f (x, y) ∂x dx + ∂f (x, y) ∂y dy (5.7) = Z 1 t=0 ∂f (x, y) ∂x dx dtdt + ∂f (x, y) ∂y dy dtdt (5.8) = Z 1 t=0 ∂f (x, y) ∂x dx dtdt + Z 1 t=0 ∂f (x, y) ∂y dy dtdt (5.9) = Z 1 t=0 1 4t + 1 dt + Z 1 t=0 1 4 dt (5.10) =log(4t + 1) 4 | 1 t=0+ t 4| 1 t=0≈ .4 + .25 = .65. (5.11)

The equations above also work with more complex functions where ∂f (~_∂xx)

i depends on more

variables than only xi. For example, with

(27)

when parameterizing γ again as x(t) = t, y(t) = t: 2 = f (1, 1) − f (0, 0) = Z (1,1) (x,y)=(0,0) ∇f (x, y) · (dx, dy) (5.13) = Z (1,1) (x,y)=(0,0) ∂f (x, y) ∂x dx + ∂f (x, y) ∂y dy (5.14) = Z 1 t=0 ∂f (x, y) ∂x dx dtdt + ∂f (x, y) ∂y dy dtdt (5.15) = Z 1 t=0 (2x + y)dx dtdt + Z 1 t=0 xdy dtdt (5.16) = Z 1 t=0 3t dt + Z 1 t=0 t dt (5.17) =3 2t 2 |1t=0+ t2 2| 1 t=0= 2. (5.18)

With linear γ the following integral from steps 5.9 and 5.15 represents the desired attribution ai [Sundararajan et al., 2017]: Z 1 t=0 ∂f (~x) ∂xi dxi dt dt = dxi dt Z 1 t=0 ∂f (~x) ∂xi dt. (5.19)

Finally, it is possible to approximate this integral by numerical integration by taking tiny steps from p to q, similar to taking the infinitesimal steps of dt in the integral when xi(t) =

(qi− pi)t + pi, t ∈ [0, 1]. The formula for attributions ai over m steps then becomes the

Riemann approximation ai= (qi− pi) m X α=0 ∂f (_mα(q − p) + p) ∂xi 1 m, (5.20) or when q = ~1, p = ~0 ai= m X α=0 ∂f (_mαq) ∂xi 1 m. (5.21)

5.3 Specializing to the Model

Equation 5.20 explains how to obtain attributions for generic differentiable multivariate functions with single variable output. In this work the function is a neural network with multivariate input x = (x1, .., xt, xc, xp) (c, p as xc, xp for notation convenience)

one-hot-encoded as e.g. xi,v = (0, 1, 0, .., 0) where xi,v stands for a quantized value at time i for

variable v. The output ˆyv consists of a discrete probability distribution for variable v

at t + 1. This distribution consists of sixteen categories representing again a quantized value with probability mass assigned by the network. Every single one of these masses can represent a function with single variable output as f in equation 5.20.

Let’s say for a specific input the model assigns probability mass to a specific category in one of the output distributions ˆyv. This category can be considered as a fully differentiable

(28)

obtain partial derivatives ∂f (x)_∂x

i , i ∈ (1, .., t, c, p). Assume a certain baseline x

(b) _{for which}

naturally f (x(b)_{) ≈ 0 can be found, then the attributions a}

i of the input features to f (x)

can be found by numerical integration as described in the image example in section 5.2 by writing f (x) − f (x(b)) = Z x u=x(b) ∇f (u)du = (5.22) p X i=1 Z x u=x(b) ∂f (u) ∂xi d(u) ≈ p X i=1 ai. (5.23)

where it is ignored for now that xi is one-hot encoded.

In the more common setting of image classification one would often obtain a natural baseline by setting every pixel value to zero, generating an empty black image, as in most cases this contains no information for the model (a counter-example is if the model is trained to distinguish between night and day). In the setting of this thesis finding a natural baseline seems more challenging, as it is hard to argue that e.g. a sequence of zero returns in equities has no meaningful information. However taking into account that the input is one-hot encoded, a natural xb is scaling the one-hot-encoded input features by zero, e.g.

0 · xi = 0 · (0, 1, 0, .., 0) = (0, 0, 0, .., 0), making it impossible for the model to gain any

information. The following conjecture underlines the importance of one-hot encoding for attribution purposes:

Conjecture 1 An IG baseline for quantizable data in any domain can be found by one-hot encoding the data and setting all vectors to zero for the baseline.

Attributions ai, i ∈ (1, .., t, c, p) for input features (x1, .., xt, xc, xp) for f (x) are then

obtained by numerically integrating f (αx) over α ∈ [0, 1] as in equation 5.21. Because of the one-hot encoding, ∂f (mαx)

∂xi becomes ∇

(i)_{f (}α

mx) containing a vector of partial derivatives

for every element in the one-hot-encoded input feature xi, but the zero-elements do not take

part in f (x). Therefore it is necessary to multiply element-wise by xito zero out the partial

derivatives of the zero-elements. Finally the one-norm obtains the scalar value ai:

ai = kxi m X α=0 ∇(i)_{f (}αx m) m k1. (5.24)

Initial experiments show m = 30 is sufficient and all experiments in this work use this value.

5.4 Caveats

A downside of IG is the reliance on a zero-scoring natural base-line [Kindermans et al., 2017], and the method does not address the interactions between the input features nor the logic employed by the network. However this technique is state-of-the-art, easily implementable and is relatively robust to adversarial examples [Ghorbani et al., 2017]. A similar method is shown to outperform older methods [Samek et al., 2017] and IG compares favorably to other recent methods [Ancona et al., 2018].

Specific to neural networks with softmax output layers (see definition 4.2) is the fact that even for the most neutral baseline, such as setting one-hot input vectors to zero, f (x(b)_{) will}

(29)

f (xi) smaller than or equal to f (x (b)

i ) the IG formula is not defined. In most use cases I

assume people are interested in attributions for the output categories to which the network assigns the most probability mass and then there is no issue.

A final possible issue has to do with storing of information in biases b in neural networks such as MLPs (e.g. equation 2.4) that use b’s in their layers, for example:

z(j)= W(j)h(j)+ b(j) (5.25)

with h the input to layer j and W, b trainable weights and biases respectively. In this case, even when the input h is scaled to zero, the output z might be non-zero. In the network’s output layer the biases might have encoded a ’prior’ distribution. In that case a relatively large f (x) might be equal or smaller than f (xb) even for well-chosen f (xb) in which case IG

attributions are meaningless. In this work the issue is moot as V(i)_(cT_{, p}T₎T _{substitute for b}

in the dilated convolutions (equation 4.7), in which c, p as part of x are scaled from zero to one as well. The causal convolution and output layers contain no biases at all as equation 4.16 and equation 4.4 have shown in chapter 4 on the model’s architecture. Appendix A, while exploring the application of IG on sine waves, demonstrates experimental results on this issue as well (Figures 20 and 22).

(30)

6 Experiments and Results

This final chapter puts the model to the test on generation of new data, forecasting and interpretability. The experiments make use of the model with hyper-parameters (e.g. the learning rate in equation 2.8) tuned to minimize the cross entropy loss on a development set with observations whose targets are from March till September 2016. The experiments on generation and on interpretability make use of the same instantiation of the model, trained on data until March 2017 with ten thousand steps of stochastic gradient descent, which is the point where the cross entropy loss on the data between March 2017 and September 2017 starts deteriorating. The last observations with targets from September 2017 to February 2018 are then used for experimentation. The second experiment on forecasting makes use of model instantiations trained with data up till three different end dates with the same hyper-parameters, except that they are trained with fewer iterations of stochastic gradient descent to lower the risk of overfitting. Table 4 shows the hyper-parameter values.

Table 4: Model hyper-parameter values.

Hyper-parameter Value

Number of training steps generation and interpretation model 10000 Number of training steps forecasting models 3000

Learning rate α <3000 steps 0.001

Learning rate α 3000-10000 steps 0.0001

Mini-batch size 1024

Weight decay scale β 0.001

Drop out rate 0.1

6.1 Generation

If the model has learned the distribution of the data set correctly its stochastic generations should reproduce the main statistics present in the data. First stochastic gradient descent trains the model on data from before March 2017 until the cross entropy loss on the de-velopment set from March 2017 till September 2017 starts deteriorating. Then feeding the model with the last observation with period 22 business days with target in February 2018 of the United States provides p(xt+1|x1, .., xt, c, p). Sampling from this distribution and

subse-quent estimations generates a single stochastic scenario as in steps 4.18 of arbitrary length, with a horizon of ten thousand steps or approximately eight hundred years taken in this experiment. The experiment generates ten equivalent instances of such a single stochastic scenario and tests for the presence similar to the data of:

1. Skewed distributions.

2. Fat tails by comparing the measures of kurtosis between the data and the generated data.

3. Close to zero autocorrelations of the changes or returns by comparing the first order autocorrelations.

(31)

4. Volatility clustering by comparing the first order autocorrelations of the absolute changes or returns.

5. Correlation structure between the variables.

6. Mean and standard deviation.

Table 5 shows the averaged results with standard errors over ten scenarios. First thing to notice is that the quantization process as described in steps 5 and 6 in chapter 3 on Data introduces distortions in the data statistics, especially in the kurtosis measure. This is due to the averaging of the values in the extreme quantiles to obtain the means for re-transformation, so the outliers are moderated. Of course the model can only learn from the quantized data as its input. Taking a larger number of quantiles will lower these distortions. This study has refrained from doing this as initial experimentation showed that this increased the risk of overfitting, as more quantiles mean fewer observations per quantile category.

Table 5: Generation statistics on ten thousand stochastic monthly steps based on the United States variables. ± is the standard error_√σ

10 on ten observations on each statistic. Values

are in decimals. Ac and Ac-abs are first order (absolute) autocorrelations. R is the Pearson correlation. Data is standardized. Qnt data is the data after (re)transformation to and from quantiles.

Mean ± Std ± Skew ± Kurt ± Ac ± Ac-Abs ±

2 Year data 0.0 1.0 -0.17 1.46 0.17 0.25 qnt data 0.0 0.98 -0.11 0.27 0.19 0.27 model -0.01 (0.00) 0.91 (0.00) -0.17 (0.01) 0.73 (0.03) 0.04 (0.00) 0.18 (0.00) 10 Year data 0.0 1.0 0.00 1.19 0.05 0.12 qnt data 0.0 0.98 0.13 -0.21 0.05 0.11 model -0.01 (0.00) 0.96 (0.00) 0.13 (0.01) -0.15 (0.01) 0.01 (0.00) 0.01 (0.00) Corporate data 0.0 1.0 0.88 6.32 0.12 0.22 qnt data 0.0 0.95 0.30 0.27 0.11 0.17 model 0.01 (0.00) 0.91 (0.00) 0.31 (0.01) 0.46 (0.01) 0.01 (0.00) 0.06 (0.00) Equities data 0.0 1.0 -0.92 4.78 -0.03 0.24 qnt data 0.0 0.95 -0.52 0.64 -0.04 0.20 model -0. (0.00) 0.96 (0.00) -0.53 (0.01) 0.61 (0.02) 0.02 (0.00) 0.06 (0.00) R 2y-10y ± R 2y-crp ± R 2y-eq ± R 10y-crp ± R 10y-eq ± R crp-eq ± All

data 0.81 0.64 0.14 0.72 0.08 -0.22

qnt data 0.81 0.71 0.10 0.77 0.064 -0.17

model 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) -0.00 (0.00) 0.01 (0.00) -0.00 (0.00)

While the first three moments of the generation fit the data well and the fourth moment reasonably, the autocorrelations show less and the inter-variable correlations show no

(32)

resem-blance. Further research might verify whether the correlations can be improved by adjusting the sampling process of an individual variable to take into account the sampling outcome of the other variables. Currently each variable’s next step value is sampled independently from the other variables.

Figure 11 visualizes the returns and the autocorrelations of a stochastic scenario of USA equities, with the stylized facts of volatility clustering (top left), skew and kurtosis (top right), zero autocorrelation (bottom left) and slow decay in absolute autocorrelation (bottom right, volatility clustering as well) visible.

Figure 11: Generated dequantized returns of one stochastic scenario of USA equities. The histogram on the right reflects the frequency over the quantile means. The bottom row shows autocorrelations (absolute on the right) from first to hundredth order.

6.2 Forecasting

This experiment compares out of sample forecasts ˆxt+1 to the naive benchmark on both

mean absolute scaled error (MASE) and root mean squared error (RMSE) for the four variables simultaneously. To check for variation in the results, the testing period is divided into three six months periods:

A) September 2016 - February 2017

B) March 2017 - August 2017

(33)

For every period, three models are initialized with random weights and are trained with hyper-parameters as in Table 4 on the training set spanning all data until six months before the beginning of the period. The cross entropy loss on the development set containing the last six months of data (i.e. March - August 2016 for period A) is used to select the best model out of the three. Only after the model is completely trained and selected, the MASE and RMSE on the periods are calculated once to prevent look-ahead bias.

Data from all countries with monthly period is used. Period A, B and C contain obser-vations (x1, .., xt, xt+1, c, p) for six months with t = 17. xi is a standardized multivariate

monthly return or change on a daily basis. xt+1 is on a date in one of the periods, is not

quantized and kept separate as the true label. Hence there are about 2400 targets (132 days per country times four variables) in every period. The dequantized mean ˆxt+1 of

p(xt+1|x1, .., xt, c, p) is compared to xt+1for calculating MASE and RMSE.

Table 6 demonstrates that over all periods and variables the model’s forecast errors are comparable to the naive benchmark with some but not much variation between periods. The results might be improved by either

1. Increasing the model’s capacity to pick up predictive patterns from the training set.

2. Limiting overfitting on the train set and/or development set by adding more regular-ization.

3. Obtaining more (i.i.d.) training data to increase the ability to learn to distinguish between predictive and spurious patterns.

4. It might not be possible at all to predict better than the naive solution with the given conditioning and training data, at least in the selected periods, due to e.g. non-stationarity.

Figure 12 shows p(xt+1|x1, .., xt, c, p) for the four variables on the last observation of USA

(34)

Table 6: t+1 errors on monthly returns or changes for periods A, B and C on all countries.

Period A Period B Period C Mean A,B,C

Model MASE RMSE MASE RMSE MASE RMSE MASE RMSE

2 Year naive 1.000 0.545 1.000 0.545 1.000 0.556 1.000 0.545 model 0.996 0.533 1.000 0.540 0.944 0.526 0.980 .533 10 Year naive 1.000 0.724 1.000 0.724 1.000 0.700 1.000 0.716 model 1.006 0.726 1.015 0.735 0.994 0.695 1.005 0.719 Corporate naive 1.000 0.519 1.000 0.519 1.000 0.550 1.000 0.529 model 1.043 0.527 1.033 0.529 0.966 0.531 1.014 0.529 Equities naive 1.000 0.420 1.000 0.420 1.000 0.640 1.000 0.493 model 1.039 0.434 1.032 0.430 0.989 0.639 1.020 0.501

Mean over all variables

naive 1.000 0.552 1.000 0.552 1.000 0.612 1.000 0.572

(35)

Figure 12: p(xt+1|x1, .., xt, c, p) for the four variables on the last observation of USA in

the test set from period C plotted alongside the normal distribution with same mean and standard deviation for visual reference.

(36)

6.3 Interpretation

While interpretation techniques can be used to validate a model, a fundamental question as well is how much we can trust the interpretation itself [Ghorbani et al., 2017]. Humans may often be able to intuitively assess the quality of given attributions in e.g. the field of image classification by matching with prior knowledge and experience of what is regarded as being relevant [Samek et al., 2017]. For other tasks, such as in the domain of finance, it may be difficult to determine objectively whether an explanation technique is good or not [Montavon et al., 2018]. Therefore this section evaluates this work’s interpretation technique first by two systematic objective metrics. Finally a qualitative example demonstrates how the interpretation method can be used for explanation, exploration and debugging.

6.3.1 Perturbation Analysis A

For attributions generated by any interpretation technique, [Samek et al., 2017] observe that the input elements to which most relevance is associated should be those that are the most likely to destroy the probability f (x) assigned by the network to a certain classification if they are perturbed. The experiment inspired by [Samek et al., 2017] is as follows:

1. Calculate ˆxt+1 and the according attributions for all observations in the test set of

period C.

2. Perturb the top-K attributing elements by substituting with a category sampled from the uniform distribution over all quantiles with K ∈ (1, .., 10). Record the change in the probability assigned by the model to the mode of p(xt+1|x1, .., xt, c, p).

3. Do the same for K randomly picked input elements.

4. To account for randomness perform the above experiments five times and compute the means and standard deviations of the results.

5. Compare the results for the top-K and randomly picked elements.

Figure 13 demonstrates that as expected the input elements to which most relevance is asso-ciated destroy the value of f (x) the most if perturbed, with standard errors not overlapping.

6.3.2 Perturbation Analysis B

[Ghorbani et al., 2017] note that for a robust attribution technique, small perturbations in the input should lead to small perturbations in the set of attributions. The experiment inspired by [Ghorbani et al., 2017] is as follows:

1. Calculate ˆxt+1 and the according attributions for all observations in the test set of

period C.

2. Inject various levels of random noise to the input elements by summing the input quan-tiles with rounded values from a normal distribution with scale 2K, K ∈ (−2, .., 1). 3. Compute and record the Spearman Rank correlation between the attributions from

the perturbed and non-perturbed input, with both attributions sorted in the same order by value of the non-perturbed input.

(37)

4. To account for randomness perform the above experiments three times and compute the means and standard deviations of the results.

Figure 14 shows that this thesis’ attribution technique is robust to small perturbations in the input.

6.3.3 Qualitative Analysis

By an example this experiment aims to demonstrate the usefulness of practical application of the method in three use cases:

1. Providing a rationale for the prediction to help users understand the model.

2. Extracting insights for scientific analysis.

Figure 13: Comparisons of the values assigned to the mode of p(xt+1|x1, .., xt, c, p) after

perturbing K input elements. Means of five experiments, standard error bars are too small to be visible.

Figure 14: Spearman Rank correlations between attributions obtained from inputs with and without perturbation by summing the input quantiles with rounded samples from a normal distribution with varying scale. Means of five experiments, standard error bars are too small to be visible.

(38)

3. Analyzing the input-output behavior of the model to give the ability to improve it. [Sundararajan et al., 2017]

In this experiment the last observation of the US data with target in February 2018 is the input of the model.

Figure 15 visualizes the forecast and corresponding attributions for the equities variable. The attributions make clear that the (expectation of) monetary tightening visible in the two year yield, combined with recent positive equities returns leads the model to forecast a slightly negative return for equities in t+1. The attribution from the second to last month in corporate yield is negative, meaning that with another value for that month this prediction for equities would probably have gained a higher score. If this pattern in attributions and forecast would come up more often, a scientist might be inclined to look for a relation between monetary tightening, recent positive equities returns and negative equities returns in t + 1. Finally, the attributions from the 10 year yields are close to zero, which might be cause for investigating whether or not the model is taking this variable sufficiently into account in general. Contrast this with Figure 9 in chapter 5, where it is actually expected that the first sine wave does not contain any attribution.

Figure 15: Attributions belonging to the mode of the forecast for USA equities as depicted in Figure 12 in chapter with positive attributions encircled.

(39)

7 Conclusion

This thesis introduced an interpretable, fully probabilistic autoregressive deep learning model outputting multivariate p(xt+1|x1, .., xt, c, p), capable of generating multivariate

fi-nancial stochastic scenarios, which when averaged can lead to multi-step ahead forecasts with confidence intervals. The mean, standard deviation and skew of scenarios resemble the statistics from observed data well, with reasonable results on kurtosis, some resemblance on (absolute) autocorrelations and no resemblance on inter-variable correlations. Forecast-ing xt+1 was comparable to the naive forecast. Two systematic objective measures and a

quantitative example demonstrated how this work’s method of interpretation can reliably provide a rationale for the prediction to help users understand the model, extract insights for scientific analysis and analyze the input-output behavior of the model to give the ability to improve it.

Further research might investigate whether some form of a dependent sampling process improves correlation values in multi-step generations. In turn it could be interesting to examine insightful forms of interpretation of multi-step generations, such as taking the differences of attributions in time. Another conceivable inroad to multi-step interpretability is averaging attributions from several independent stochastic scenarios at a time for multi-step forecasts with confidence intervals.

Interpretable Deep Learning for Stochastic Financial Scenario Generation

MSc Computational Science

Master Thesis