Training Neural Networks with the Unscented Kalman Filter for Time Series Prediction

(1)

Masters Thesis

Training Neural Networks with the

Unscented Kalman Filter for Time

Series Prediction

Author: Kasper Nicholas Supervisor: Mrs. J. Schijven MSc Examiner: Prof. dr. B.D. Kandhai

A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Science

in the

Computational Science Lab Informatics Institute

(2)

I, Kasper Nicholas, declare that this thesis, entitled ‘Training Neural Networks with the Unscented Kalman Filter for Time Series Prediction’ and the work presented in it are my own. I confirm that:

This work was done wholly or mainly while in candidature for a research degree at the University of Amsterdam.

Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

Where I have consulted the published work of others, this is always clearly at-tributed.

Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed:

Date: 12 October 2020

(3)

(4)

Abstract

Faculty of Science Informatics Institute

Master of Science in Computational Science

Training Neural Networks with the Unscented Kalman Filter for Time Series Prediction

by Kasper Nicholas

This thesis aims to asses the unscented Kalman filter (UKF) as a training method for neural networks (NN) on time series prediction. The research is split into three compo-nents. First, the theoretical framework of NN and the UKF are derived and elaborated upon. Second, a sensitivity analysis of UKF training is performed to assess the effect of the process and observation noise covariance matrices on the performance and gen-eralization. To quantify the generalization of NN we implement the weight Hessian and input Hessian. Third, a comparison between the UKF and stochastic gradient descent (SGD) is performed to assess the comparative efficacy of the UKF to a well-known benchmark method. The comparison is firstly done on synthetic data and secondly on the daily returns of the USD/EUR and USD/RUB foreign exchange rates. The outbreak of COVID-19 and subsequent global economic turmoil are finally studied to assess the robustness of the UKF and SGD for unexpected macroeconomic changes. The sensi-tivity analysis illustrates the importance of the training data characteristics and the NN architecture complexity when choosing the optimal noise covariance matrices. The comparison between the UKF and SGD indicates an overall higher out-of-sample perfor-mance of the SGD method. However, our results indicate the UKF obtains competitive performance in almost all experiments and even outperforms SGD for a small training set and high level of noise for the synthetic data. Experiments with the foreign exchange rates indicate that the SGD training produces optimal results in a high volatility and COVID-19 training period and the UKF for a low volatility period. We note that the trace of the weight and input Hessian provided little insight into the comparative gener-alization between the methods. In spite of the notable difference in fitting the training data, the difference in performance for the UKF and SGD training methods are largely negligible when the uncertainty of the experiments is taken into consideration.

(5)

First and foremost I would like to thank Jacqueline Schijven for her continued feedback and guidance throughout the process of writing this thesis. Despite the majority of the internship taking place at home due to the COVID-19 measures she was always quick to respond and eager to help. Our weekly meetings aided in tackling the countless issues that one faces when working on a such a sizable project.

Secondly, I am grateful to my team at EY for giving me the opportunity to write my thesis with them. The regular chats with colleagues and general pleasant working environment proved greatly beneficial to the progress of my research.

I would also like to extend my sincerest gratitude to Drona Kandhai and Ioannis Anag-nostou for our biweekly discussions and for contributing to the academic quality of the thesis project.

I finally want to thank my friends and family for their invaluable advice and support throughout the internship. You helped make the individualistic process of writing a thesis a fruitful and enjoyable experience.

(6)

Declaration of Authorship i

Abstract iii

Acknowledgements iv

Contents v

List of Figures vii

List of Tables ix

List of Algorithms x

Abbreviations xi

1 Introduction 1

2 Literature Review 3

2.1 Neural Networks and Time Series Prediction . . . 3

2.2 Problems with Gradient Descent . . . 4

2.3 Bayesian Inference Methods . . . 5

2.4 Bayesian Inference for Parameter Estimation . . . 6

3 Theoretical Background 8 3.1 Neural Networks . . . 8

3.1.1 Stochastic Gradient Descent. . . 11

3.2 Recursive Bayesian Estimation . . . 12

3.2.1 Bayesian Inference . . . 12

3.2.1.1 State Estimation . . . 12

3.2.1.2 Parameter Estimation . . . 15

3.3 The Kalman Filter . . . 16

3.3.1 The Unscented Kalman Filter . . . 17

3.3.1.1 The Unscented Transform. . . 18

3.3.2 The UKF for Training Feedforward Neural Networks . . . 23

(7)

4 Methods 28

4.1 Numerical evaluation . . . 28

4.1.1 Artificial data: the noisy sine function . . . 28

4.1.2 Real-world data: currency pairs. . . 29

4.2 Metrics . . . 31

4.2.1 Performance . . . 31

4.2.2 Generalization . . . 31

5 Experiments and Results 34 5.1 Noisy sine function . . . 34

5.1.1 Sensitivity analysis of the UKF . . . 35

5.1.1.1 Varying Q . . . 35

5.1.1.2 Varying R . . . 39

5.1.1.3 Adaptive R and Q . . . 42

5.1.2 The UKF compared to SGD . . . 44

5.1.2.1 Varying the noise in the data . . . 46

5.1.2.2 Expanding the training set . . . 48

5.2 Currency pairs . . . 51

5.2.1 The USD/EUR . . . 52

5.2.2 The USD/RUB . . . 54

6 Discussion 59 7 Conclusions and Future Research 62 7.1 Conclusion . . . 62

7.2 Future research . . . 65

A Feedforward Neural Networks 67 A.1 Backpropagation for FNN . . . 69

B The Kalman Filter 76 B.1 The MMSE derivation of the Kalman Filter . . . 76

C Experiments and Results 87 C.1 Noisy sine function . . . 87

C.1.1 Sensitivity analysis of the UKF . . . 87

C.1.1.1 Varying Q . . . 87

C.1.1.2 Varying R . . . 91

C.1.2 The UKF compared to SGD . . . 94

C.1.2.1 Varying the noise in the data . . . 94

(8)

3.1 A feedforward neural network with a single hidden layer.. . . 9

3.2 A hidden Markov model.. . . 12

3.3 Simple representation of the unscented transform.. . . 20

4.1 The USD/EUR (L) and USD/RUB (R) daily returns for the minimum,

maximum and mean volatility periods. . . 30

4.2 The USD/EUR daily returns for the COVID-19 period . . . 30

5.1 Single UKF training instances for Q = 0.001 and Q = 1 with the noisy

sine data. . . 35

5.2 The input Hessian for UKF training with different Q, c = 0.3 and K = 150 36

5.3 The weight Hessian for UKF training with different Q, c = 0.3 and K = 150 37

5.4 Single UKF training instances for R = 0.001 and R = 1 with the noisy

sine data. . . 40

5.5 The input Hessian for UKF training with different R, c = 0.3 and K = 150 40

5.6 The weight Hessian for UKF training with different R, c = 0.3 and K = 150 41

5.7 The input Hessian for UKF training with adaptive R and Q on the noise

sine data. . . 43

5.8 The weight Hessian for UKF training with adaptive R and Q . . . 44

5.9 Single UKF and SGD training instances for R = 0.001 and R = 1 with

the noisy sine data. . . 45

5.10 The input Hessian of the UKF and SGD for varying noise and K = 150 . 46

5.11 The weight Hessian of the UKF and SGD for varying noise and K = 150. 47

5.12 The weight Hessian for single UKF and SGD training instances with the

noisy sine data, K = 450 and c = 0.3. . . 48

5.13 The input Hessian of the UKF and SGD for a varying training set size

and c = 0.1 . . . 49

5.14 The input Hessian of the UKF and SGD for a varying training set size

and c = 0.3 . . . 49

5.15 The weight Hessian of the UKF and SGD for a varying training set size

and c = 0.1 . . . 50

5.16 The weight Hessian of the UKF and SGD for a varying training set size

and c = 0.3 . . . 50

5.17 The input Hessian for UKF and SGD training with the three data sets of

the USD/EUR. . . 53

5.18 The weight Hessian for UKF and SGD training with the three data sets

of the USD/EUR. . . 53

5.19 The input Hessian for UKF and SGD training with the three data sets of

the USD/RUB. . . 55

(9)

5.20 The weight Hessian for UKF and SGD training with the three data sets

of the USD/RUB. . . 55

5.21 The predictions for a single run of UKF and SGD training with the

US-D/EUR COVID-19 period data and the (1, 10) architecture. . . 56

US-D/EUR COVID-19 period data and the (2, 20) architecture. . . 57

US-D/EUR minimum volatility period data and the (1, 10) architecture. . . . 58

A.1 The perceptron.. . . 67

A.2 The sigmoid function and the step function. . . 69

A.3 Gradient descent of a cost function E for a single weight. . . 70

B.1 A schematic representation of the recursive updates for the Kalman filter. 80

C.1 The input Hessian for UKF training with different Q, c = 0.1 and K = 150 87

C.4 The weight Hessian for UKF training with different Q, c = 0.1 and K = 150 88

C.7 The input Hessian for UKF training with different R, c = 0.1 and K = 150 91

C.10 The weight Hessian for UKF training with different R, c = 0.1 and K = 150 92

C.13 The input Hessian of the UKF and SGD for varying noise and K = 450 . 94

(10)

4.1 The standard deviation for three periods of the USD/EUR and USD/RUB

daily returns. . . 30

5.1 The test results for UKF training with different Q, c = 0.3 and K = 150 . 38

5.2 The test results for UKF training with different R, c = 0.3 and K = 150 . 42

5.3 The test values for UKF training with adaptive R and Q on the noisy

sine data. . . 44

5.4 The test results of the UKF and SGD for varying noise and K = 150 . . . 47

5.5 The test results of the UKF and SGD for a varying training set size and

c = 0.1 . . . 51

5.6 The test results of the UKF and SGD for a varying training set size and

c = 0.3 . . . 51

5.7 The test results for UKF and SGD training with the three data sets of

the USD/EUR. . . 54

5.8 The test results for UKF and SGD training with the three data sets of

the USD/RUB. . . 56

C.1 The test results for UKF training with different Q, c = 0.1 and K = 150 . 89

C.4 The test results for UKF training with different R, c = 0.1 and K = 150 . 93

C.7 The test values of the UKF and SGD for varying noise and K = 450 . . . 94

(11)

1 The Kalman filter for state estimation . . . 17

2 The unscented Kalman filter for state estimation . . . 24

3 The unscented Kalman filter for training neural networks . . . 27

(12)

BNN Bayesian Neural Networks

CNN Convolutional Neural Networks

EKF Extended Kalman Filter

FNN Feedforward Neural Networks

GRV Gaussian Random Variable

HMM Hidden Markov Model

KL Kullback-Leibler

MAE Mean Absolute Error

MLP Multi-Layer Perceptron

MMSE Minimum Mean Square Error

MSE Mean Square Error

NN Neural Networks

PF Particle Filter

RNN Recurrent Neural Networks

SGD Stochastic Gradient Descent

SMC Sequential Monte-Carlo

UKF Unscented Kalman Filter

UPF Unscented Particle Filter

(13)

Introduction

Neural networks (NN) are increasingly used for a variety of applications in financial risk management including time series modelling. Although the increased predictive power of such methods for nonlinear time series has been shown extensively, the black box nature of NN inhibits their large-scale comprehension and subsequent application. Standard NN training is equivalent to a maximum likelihood estimation of the weights. Any uncertainty in the weight estimates is lost by treating the weights as well as the input as point estimates and generally increases the probability of overfitting during training. A proposed variety of NN that mitigates this issue is the Bayesian Neural Network (BNN), which allows for the estimation of a posterior distribution over predictions as well as providing posteriors over the network weights during the training process. Well-known examples of BNN training methods include the sequential Monte-Carlo method (SMC), variational inference (VI) and the unscented Kalman filter (UKF). Contrary to the large majority of NN in the literature which apply stochastic gradient descent (SGD) as their training method, BNN incorporate Bayesian inference methods for training the weights. Bayesian inference estimates the posterior by combining the prior and likelihood in a recursive manner. Aside from providing posteriors over the weights and the output, BNN are capable of mitigating the overfitting issue of SGD, especially for small data sets [1,2]. Additionally, the incorporation of prior knowledge in both the weights and state estimation preserves the sequential nature of training data. However, the improvements of methods such as the UKF come at a higher computational cost compared to SGD. This resulted in a lacking practical use for the better part of the last two decades.

Since then advances in both computer software and hardware has enabled the increased application of BNN by the academic community[1, 3, 4]. One such a method is the UKF, a second-order filtering technique that has been shown to function as consistent training mechanisms for NN on time series. The UKF estimates the posterior through

(14)

a deterministic set of samples, but generally requires a set that is several orders smaller than that of comparable methods such as the SMC method. This is due to the assump-tion that the estimated variables always pertain Gaussian distribuassump-tions and thus greatly simplify the posterior when compared to the estimation of more complex distributions. While the Gaussian assumption cannot be said to hold for many real-world applications the UKF showed promising results when training neural networks on a wide range of problems [4–8]. While a large extent of the research done on the UKF focuses on its application in engineering related topics little has been written on the application of the UKF on time series and particularly in the context of financial markets. An important element of financial time series modelling is ability to adjust to sudden changes in volatil-ity. A recent example is the outbreak of COVID-19 and the resulting global economic turmoil [9,10]. The sequential nature of the UKF allows it to adjust to changes in the historical data in a single training iteration. Moreover, several other characteristics of the UKF training method provide mitigating factors to counter the overfitting issue that arises when training neural network architectures of increasing complexity.

This research aims to assess the ability of the UKF for training neural networks on time series prediction. This is firstly done by means of a theoretical derivation and analysis of the method and secondly by performing numerical experiments. The numerical exper-iments are initially used to assess the sensitivity of the UKF training mechanism with respect to the noise covariance terms R and Q, where R introduces noise in the output estimation and Q in the weight estimation. We then compare the UKF method with SGD in order to assess both the generalization and corresponding out-of-sample perfor-mance of the UKF in comparison to the well-established SGD used as a benchmark for training NN. These two objectives are studied by means of two types of training data with properties important in financial time series prediction, including temporal pat-terns, volatility and nonlinearity [11]. Firstly, a composite function is used to simulate sequential data by means of a perturbed sine function. Secondly, the USD/RUB and USD/EUR currency pairs are used to assess the training methods for nonlinear financial time series. We asses a low volatility and high volatility subset for both currencies to study the effect of different data characteristics on the training methods. Additionally, the effect of COVID-19 on both currencies is assessed by training neural networks on the currency data from January 2020 until September 2020. As the generalization of neural networks is not readily quantified by standard performance metrics we implement the input and weight Hessian recently proposed by Borovykh et al. in [12]. All numerical comparisons are done for two architectural choices in terms of the number of network weights and layers so that we can gain a better understanding of the computational bottlenecks and computational efficacy of the different training methods.

(15)

Literature Review

2.1 Neural Networks and Time Series Prediction

Neural networks have increasingly become one of the most popular trends within machine learning [13]. Although the vast majority of applications involve robotics and image recognition tasks an increasing number of studies illustrate use of ANN for time series prediction. Initial neural network architectures consisted of few nodes and one hidden layer, as was proposed Minksy in 1951 [14]. Since then the improvement of computer hardware corresponded to a widespread use of the multilayer perceptron (MLP) [2]. Over the last 20 years different architectures were proposed for time series prediction, such as the Recurrent Neural Network (RNN) and the Convolutional Neural Network (CNN). Instead of connecting all weights between layers in a sequential manner, as was the case for standard NN, these methods allow for the removal and addition of non-sequential connections. The performance and robustness of these network types showed significant improvements [15–17]. Furthermore, both methods were shown to be capable of capturing long and short term patterns in time series while decreasing the tendency of overfitting by the trained networks[18,19].

Complementing the improvements in neural network architectures were the novel train-ing methods that were proposed to replace the standard SGD traintrain-ing methods. While SGD only made use of first-order information of the error function during training, a class of second-order methods emerged to incorporate more information during training for increased robustness and performance [20]. As the calculations for higher-order train-ing methods are computationally costly, the application only became viable for MLP in the last decade [21].

In addition to incorporating second-order information in the training method it was also proposed as a means to quantify the generalization of trained networks by Borovykh

(16)

et al in 2019[12]. In order to assess the generalization of neural networks for time se-ries prediction they propose to analyze two metrics: the input Hessian and the weight Hessian. A theoretical analysis and numerical evaluation substantiate the claims that the mentioned metrics provide valuable insights into the neural networks and corre-sponding training methods. As opposed to SGD Bayesian inference methods are able to incorporate an estimate of the uncertainty for the weights during training[5].

2.2 Problems with Gradient Descent

Due to its straightforward implementation and ability to train neural networks on many different problems, gradient descent is one of the most common training mechanisms in the literature [2,12,22]. Especially for simple network architectures and straightforward problems it is deemed an adequate training method. However, as the number of weights and the size of the solution space increase we observe several issues with gradient descent. As gradient descent only utilizes limited information from the first-order derivative it has a tendency to overfit on the training data. Overfitting occurs when a neural network performs well on the training data, has decreased performance for new data. Stochastic gradient descent partially addresses this shortcoming. However, it still only takes first-order information of the gradient into account, limiting the use of information on the training problem and therefore often require relatively large amounts of training data. A third problem that is important for training on deep neural networks is the sequential propagation of the error through the different layers, resulting a potentially ”vanishing” gradient for weights further away from the output layer.

Since the rediscovery of gradient descent as a supervised learning mechanism by Rumel-hart et al. in 1986, many alternate methods were proposed to improve these and other aspects of the algorithm [5, 23]. Among the more promising alternatives in terms of accuracy are a number of methods that include second-order derivative information, such as quasi-Newton, Levenburg Marquardt and conjugate gradient techniques. The majority of these methods are prone to converging to poor local optima and require batch training for a single weight update. The poor convergence is thought to partially be due to the lack of a stochastic component in the aforementioned methods [5]. Over the last two decades several Bayesian training methods were proposed to incorporate information from higher-order derivatives and were shown to obtain better performance than SGD for training NN [1–3,6].

(17)

2.3 Bayesian Inference Methods

Bayesian inference is a probabilistic approach for estimation unknown probability den-sities over time in a recursive manner by combining measurements and prior beliefs of the system at hand. This stems from the Bayesian approach which views probability as a degree of beliefs for a given situation. In contrast to the Bayesian approach a Fre-quentist considers probabilities as being derived from long run frequency distributions. One of the first Bayesian inference frameworks was the renowned Kalman filter [24]. The Kalman filter accurately estimates Gaussian random variables with linear transi-tion models. It essentially propagates the mean and covariance of the estimated variable to produce a so-called posterior filtering distribution. Initially, the Kalman filter was mainly implemented in robotics and guidance navigation and control of vehicles. Later on it was often used for time series analysis and signal processing in a variety of fields [5]. Since its inception a variety of extensions and alterations were proposed to the Kalman filter to account for more general systems. Complementing its usage as a state estima-tion technique the Kalman filter was also used for parameter estimaestima-tion problems. One of the most common examples to this end is training the weights of NN. We identify four Bayesian inference methods in the literature.

The first and most applied variant of the Kalman filter is the extended Kalman Filter (EKF). The EKF was proposed to allow for nonlinear transitions of the variables, while maintaining the Gaussian assumption of all system variables. It essentially approxi-mated the estimate by linearizing the transition functions around the current estimate through first-order Taylor series approximations. This linearization is only incorporates the estimated mean of the state and neglects the information of higher order terms such as the variance. The accuracy and consistency of the algorithm often decay as a result [6]. In other words, the approximations of the EKF often result in large errors for the posterior mean and covariance of a transformed variable.

To counter the limitations of the EKF for nonlinear transition functions, the unscented Kalman filter (UKF) was developed. Based on the unscented transform, the UKF was shown to accurately propagate the first- and second-order information of Gaussian ran-dom variables (GRV) by means of a deterministic set of points [4–6]. Unlike the EKF, the UKF used the true nonlinear transformation while approximating the distribution of the state random variable. Ever since the UKF was first derived it has replaced the EKF in most applications with GRV undergoing nonlinear transformations. In almost all cases, extensive research has shown the superiority of the UKF over the EKF while maintaining similar computational costs [25].

(18)

A third approximation method for obtaining the posterior filtering distribution is the sequential Monte-Carlo (SMC) method. Although this method was already applied in the field of theoretical physics in 1951 by Kahn et al. [26], its application to signal filtering and time series estimation was only proposed by Kitagawa in 1993 [27] and by Liu et al. in 1998 [28]. SMC revolves around estimating the state posterior through stochastically sampling a large set of points and propagating these. Unlike the vari-ous Kalman filter methods, the SMC allows for both nonlinearities and non-Gaussian random variables. However, in order to obtain accurate results, the SMC requires a number of points that is several orders larger than that of the UKF and thus making it computationally expensive in comparison.

The last category is based on the Variational Inference (VI) method, and is specifically known for its application in NN training [3,29]. VI aims to provide a locally-optimal exact analytical solution to an approximation of the true posterior. In the context of training neural networks, VI is used to learn the network parameters such that the proposed distribution q(X) is optimal in terms of the Kullback-Leibler (KL) divergence. Moreover, as q(X) is chosen to be a parametric density we can compute gradients of the KL-divergence and thus allow for backpropagation algorithms similar to the gradient descent algorithm that are normally used for training NN. The first proposed variational training method to this end was derived by Bishop in 2006 [30].

2.4 Bayesian Inference for Parameter Estimation

Parameter estimation through Bayesian inference methods has been researched exten-sively for various applications, such as training neural networks for classification prob-lems and time series estimation. The Kalman Filter was first proposed as a second-order training mechanism for neural networks by Singhal and Wu in 1989 [31]. Since then, the EKF, UKF and SMC algorithms were shown to be effective alternatives to the commonly implemented gradient descent [1,5,6,31]. Beside the increased accuracy and additional information obtained by estimating the posterior distribution an additional benefit for NN training stems from training in an on-line manner. The data used to train the NN is only iterated over once in a sequential manner. Conversely, training NN through gra-dient descent requires training weights on the entire training data set numerous times to achieve adequate performance. One of the first Bayesian frameworks for training NN was first proposed in 1992 by MacKay and further extended upon by Bar-Shalom et al. in 1993 [32,33]. Gradient descent was further shown to be a degenerate form of the EKF training method by Rogers et al in 1992 and later by De Freitas et al. in 1998 [34,35]. After the introduction of the UKF in 2001 by Wan and Van Der Merwe the UKF was

(19)

shown to consistently outperform the EKF for NN training on a variety of problems [4, 6–8, 36]. Furthermore, Van Der Merwe extensively illustrates the high accuracy of UKF-trained NN on different nonlinear noisy time series. However, the Gaussian as-sumption of the parameters and final state posterior does exclude any multi-modality that is likely to occur for neural network approximations. The SMC method addresses this issue and provides an accurate training mechanism for time series estimation with neural networks, such as the pricing of options on the Financial Times Stock Exchange 100-Share (FTSE-100) index [1]. Moreover, since its inception the computational limi-tations of the SMC have been greatly reduced and thus make its an increasingly viable candidate for NN training. A final hybrid method of interest is the unscented particle filter (UPF) proposed by Van Der Merwe in 2004 [6]. This method incorporates the UKF to obtain the proposal distribution in the SMC algorithm allowing it to incorporate the latest measurement in a more effective manner than the standard SMC method. Van Der Merwe shows the UPF outperforms the UKF and SMC as a training mechanism for one-step ahead predictions of options on the FTSE-100. The superiority of the UPF in terms of predictive accuracy is further shown for various composite time series with varying noise and nonlinearity, such as the Mackey-Glass-30 chaotic time series and the Ikeda chaotic time series [6]. While the potential of the SMC and hybrid UPF method for training NN was substantiated over two decades ago we observed little research on these methods since then, contrary to variational inference. The relatively low compu-tational costs have made VI the de facto method for calculating posterior distributions with NN training [3,29].

(20)

Theoretical Background

This section provides the theoretical background of neural network training with the UKF and SGD. First, we explain the calculations for forward propagation in standard NN and briefly cover NN training through SGD. Second, an overview is given of Bayesian inference for state and parameter estimation problems. Lastly, the basic Kalman filter and the complete mechanism of the UKF are derived to help understand the proposed NN training.

3.1 Neural Networks

Feedforward neural networks (FNN), the basis of all NN, can learn to perform tasks by considering examples with known inputs and desired outcomes [2]. A FNN is typically constructed with three different layers:

1. The input layer - the initial data enters the network here.

2. The hidden layers - these are layers located between the input and output layer. Computations in neural networks are done in these layers.

3. The output layer - responsible for producing an outcome for a given income.

(21)

Figure 3.1: A feedforward neural network with a single hidden layer.

In Figure3.1a straightforward FNN is illustrated for one hidden layer. In turn the layers each consist of nodes. A simple explanation of a network node is given in AppendixA. For standard FNN we assume that each node is connected to all nodes in both the previous layer and the next layer. We can further differentiate two mechanisms in the learning process of FNN. Typically, an input xn is first passed through some nonlinear function to obtain an predicted output ˆyn for n = 1, . . . , N , with N denoting the size of the sample set. For FNN we call this process forward propagation: the network signal is propagated forwards to obtain an output. The estimated output ˆynis then compared to a desired outcome yn. Secondly, the network adjusts according to the discrepancy between ˆyn and yn in order to better predict the yn+1 for the next iteration, this is known as backpropagation. The combined process of the two methods for a given set (xn, yn) of n = 1, . . . , N iterations is called supervised learning. For each node in the network we require an activation function σ. A common choice for σ is the sigmoid function as

σ(x) = 1

1 + exp (−x). (3.1)

The propagation of the input through the network for a node k in a layer l can now recursively be calculated with a weighted sum of the nodes in the previous layer and the activation function as al_k(z_kl) = 1 1 + exp (−zl_k), with z l k= nl−1 X i=1 wj,k· al−1_j + blk, (3.2)

(22)

where al_k is called the activation of a node. For the input layer this is equivalent to a0_k = xn. When the activation is calculated for the output layer with Eq. 3.2 we complete the process of forward propagation for an input xn. In summary the input signal is propagated through the respective nodes in every layer with predetermined activation functions, weights and biases.

Now that we have defined forward propagation of FNN we explain the method of back-propagation, which revolves around the backpropagation of an error function through the network. To this end we apply sum-of-squares error function given by

E(θ) = 1 2 X N yi− ˆyn(xn, θ) 2 , (3.3)

where i is the sample index and the 1₂ is required for the subsequent derivation of gradient descent. This specific error function is commonly chosen in the literature but can be replaced with other functions [2]. Now that we have obtained an expression for the error, or cost of our system, we can elaborate on how to minimize this function by means of gradient descent. If we consider the cost function E(θ) we note that minimizing is equivalent to obtaining the gradient of E(θ) with respect to θ equal to zero and thus indicating stationary points. However, for a given optimization problem it is likely that the cost function has multiple local optima with respect to w and b, one of which being the global optimum for which we can find the corresponding optimal set of parameters. In order to find such optima we can apply small changes to w and b in the direction for which the gradient of E(θ) is negative. This requires a complete calculation of the gradient ∇E(θ) for each sample in our training data set. It is common practice to initialize the weights and biases for some w0 and b0. Thereafter, for training iterations t = 1, . . . , T the weights and biases are updated according to

wt+1= wt− η∇E(wt), (3.4)

bt+1= bt− η∇E(bt), (3.5)

where the minus sign takes into account that we wish to move into the direction of the decreasing cost function until some desired accuracy is obtained. The learning parameter η determines how big change in weights and biases is for every iteration. Too large a value for η can cause the algorithm to refrain from converging to an optimum, whereas a very small η causes the descent to occur too slowly. For multiple weights and

(23)

biases backpropagation becomes multi-dimensional with the complete gradient given b y∇E(w, b), and thus the partial derivative with respect to each parameter is required to determine any optima.

The calculation of the gradient ∇E is done by obtaining ∇En for each sample n = 1, . . . , N according to ∇E = 1 N N X n=1 ∇E_n. (3.6)

The complete calculations per node for backpropagation are derived in Appendix A. We can further differentiate between on-line training, where the gradient is calculated for one sample per iteration, and batch training, where a subset of the training set is considered per iteration. If we consider on-line training, our weight and bias update in Eq. 3.4and 3.5become

wt+1= wt− η∇En(wt), (3.7)

bt+1= bt− η∇En(bt). (3.8)

3.1.1 Stochastic Gradient Descent

The lack of a stochastic component in gradient descent often causes the network to over-fit on training data and thus poorly predicts unseen data samples [12]. Furthermore, for large data sets it is very costly to calculate the complete gradient. Stochastic gradient descent (SGD) was thus proposed to reduce the number of computations and obtain adequate generalization for NN[22]. With SGD we randomly select a subset of the com-plete training set according to a batch size Nbatch. The gradient ∇E is then calculated for this batch. This can be formulated as

∇E = 1 M M X m=1 ∇Em. (3.9)

for M < N . In the case of M = 1 this is simply online training as only one sample is considered per training iteration. SGD provides a noisy estimate of the complete gradient at a much smaller computational cost.

(24)

3.2 Recursive Bayesian Estimation

3.2.1 Bayesian Inference

Bayesian inference concerns the estimation of hidden variables, be it system states or parameters a, from noisy and/or partial observations b. Rather than concerning our-selves with obtaining a point estimate, with Bayesian inference we aim to estimate the posterior probability distribution over all possible outcomes. By means of Bayes’ rule the posterior can be calculated by incorporating certain prior beliefs of the system. In a system where observations are scarce or contain large amounts of noise this can sig-nificantly improve estimates of the hidden variables. The posterior according to Bayes’ rule is given by

p(a|b) = p(b|a)p(a)

p(b) , (3.10)

where p(a), p(b|a) and p(b) are the prior, likelihood and normalising constant respec-tively. The normalizing constant p(b) is obtained by integrating out a from the joint probability p(b, a). In complex systems where we are interested in tracking many sys-tem parameters in a continuous state space, such as the weights of a neural network, this normalizing constant is difficult to compute due to the highly dimensional nature of the integral. The posterior therefore has no tractable solution for many systems, and is therefore often approximated numerically by approximation methods.

3.2.1.1 State Estimation

In order to understand Bayesian inference for a system evolving over time we wish to define a corresponding state-space representation [24]. State space models provide a general framework for the probabilistic dependence between the true state of a system x and a stochastic set of observed measurements y. For a discrete state space this is equivalent to a Hidden Markov Model (HMM) [37], as is illustrated in Fig. 3.2.

Figure 3.2: A hidden Markov model with a true state x and an observed state/mea-surement y evolving over time.

(25)

Let us consider the hidden system state xn with initial probability p(x0) evolving over time, where n denotes the discrete time index, as an indirect or partially observed first order Markov process according to

p(xn|x0, . . . , xn−1) = p(xn|xn−1). (3.11)

Similarly, the observation yn at time n is only dependent on the current state and thus conditionally independent of all other states given the current state as

p(yn|x0, . . . , xn) = p(yn|xn). (3.12)

The probability distribution over all states of the above HMM can then be given as

p(x0, . . . , xn, y1, . . . , yn) = p(x0) n Y

i=1

p(yi|xi)p(xi|xi−1). (3.13)

From the probability distribution for all states we generally wish to obtain an estimate of the state xn by obtaining the probability of xn given a set of observations y1, . . . , yn. This state-space model can also be viewed as a set of nonlinear equations as

xn= f (xn−1, un) (3.14)

yn= h(xn, vn), (3.15)

where un and vn are the process and measurement noise respectively. In terms of the previously defined probability densities, we first note that f and the process noise dis-tribution p(un) define the state transition density p(xn|xn−1). Second, the observation likelihood p(yn|xn) is described by h and the observation noise distribution p(vn). In order to determine the evolution of the hidden system state we are therefore required to define:

1. The state transition function f (·) and the observation function h(·)

(26)

3. The prior state distribution p(xn−1).

In a Bayesian framework the distribution of interest is the posterior density of xn. From the posterior we can then obtain some state estimate ˆxn, which in practice is often taken to be the mean of the distribution. The posterior density of the state, given a set of observations {y1, y2, . . . , yn}, is given by p(xn|y1, . . . , yn). Combined with Bayes’ rule from Eq. 3.10 we can define the posterior distribution as

The Bayesian recursive formula above is related to the state-space model defined in Eqs. 3.14 and 3.15 as follows. Firstly, the prior at time n is calculated by projecting the posterior at n − 1 forward in time with the probabilistic process model as

p(xn|y1, . . . , yn−1) = Z

p(xn|xn−1)p(xn−1|y1, . . . , yn−1)dxn−1, (3.18)

where p(xn|xn−1) is the state transition density. Hence the prior at n is recursively defined as the posterior from n − 1, meaning that we only require to specify the initial prior distribution p(x0). It should thus be noted that the influence of this initial prior deteriorates as the posterior is calculated for any new observations. After the prior is determined we need to incorporate the latest measurement by means of the observation likelihood for the updated posterior as

p(xn|y1, . . . , yn) = C · p(yn|xn)p(xn|y1, . . . , yn−1), (3.19)

where C denotes the normalizing factor as

C = Z

p(yn|xn)p(xn|y1, . . . , yn−1) !−1

. (3.20)

(27)

p(xn|xn−1) = Z δ(xn− f (xn−1, un))p(un)dun, (3.21) p(yn|xn) = Z δ(yn− h(xn, vn))p(vn)dvn, (3.22)

where δ denotes the Dirac-delta function. The derived expression for the posterior is the optimal solution, but becomes hard to compute for nonlinear and/or non-Gaussian systems in an exact manner. Most real-world applications require approximations to the integrals in Eqs. 3.18 - 3.22 due to a large number of data and system parameters. In the linear case with Gaussian noise terms the Kalman filter can be used to accurately estimate the optimal posterior density. For nonlinear or non-Gaussian systems the UKF, SMC and VI are examples of methods that can be used to approximate the state posterior numerically.

3.2.1.2 Parameter Estimation

Rather than directly estimating the state of a system we can also estimate the parameters of a nonlinear function g(·), which takes an input xnand returns an output ˆynas follows

ˆ

yn= g(xn, w), (3.23)

where n denotes a discrete enumerating index. When we consider g(·) to be a neu-ral network the parameters w constitute both the weights and biases. In the context of training a neural network we aim to estimate w optimally for a training set that consists of known inputs and desire outputs {xn, yn}. This allows us to calculate the measurement error as

en= yn− ˆyn (3.24)

= yn− g(xn; w), (3.25)

with the goal being to solve for the parameters w such that the expectation of the error is minimized. The state-space representation of our system can now be written as

(28)

wn+1= wn+ rn (3.26)

yn= g(xn, wn) + en, (3.27)

where wnrepresents a stationary process with an identity transition matrix and is driven by the process noise rn.

3.3 The Kalman Filter

The Kalman filter is a minimum mean square error (MMSE) Bayesian inference method that fuses data from sensors or inputs to produce an accurate estimate for the true state of a system. Although it is mostly known for its application in engineering for vehicle navigation the Kalman filter is an efficient and accurate method for time series prediction. We consider the state of a system xn and the observation yn at a discrete time n. The underlying framework is based on the state-space representation of a system and evolves according to the process model and measurement model respectively defined as

xn+1 = Fnxn+ un (3.28)

yn= Hnxn+ vn, , (3.29)

with un and vn denoting known white, zero-mean Gaussian noise terms with known covariances Qn and Rn. The matrices Fn and Hnare respectively the process and mea-surement transition matrix. When these transitions are linear functions the optimal state estimate can be computed by the standard Kalman Filter proposed by Kalman in 1960 [24]. Specifically, we are able to obtain tractable solutions to Eqs. 3.18 - 3.22. However, in the nonlinear case we require to extend upon the original Kalman filter framework to obtain accurate state estimates. One such an extension is the unscented Kalman filter (UKF) proposed by Julier et al. [38]. The UKF has further been success-fully applied to parameter estimation such as training neural networks [2]. In order to derive the UKF-based training mechanism for FNN it is important to understand the Kalman filter for a linear state-space model. The derivation of the Kalman filter is given in AppendixBwith the final optimized state filter presented in Algorithm 1.

(29)

Algorithm 1: The Kalman filter for state estimation Input: Specify Q, R and initial x0.

Initialization: For n = 0 : ˆ x_0|0 = E[x0] P0|0 = ||x0− ˆx0||2 Recursion: For n = 1, . . . : Predict Step: ˆ xn|n−1 = Fn−1ˆxn−1|n−1 Pn|n−1 = Fn−1Pn−1|n−1Fn−1T + Qn−1 Filter Step: en= yn− Hnˆxn|n−1 Sn= (Rn+ HnPn|n−1HnT) Kn= Pn|n−1HnTS−1n ˆ x_n|n= ˆx_n|n−1+ Knen Pn|n= (I − KnHn)Pn|n−1(I − KnHn)T + KnRnKTn

3.3.1 The Unscented Kalman Filter

However, when we wish to apply the Kalman filter in this form to nonlinear systems it no longer produces accurate estimates. A nonlinear state transition or the measurement model inhibits the calculation of the optimal state estimate and requires altering. The state-space model allowing for nonlinearities is formulated as

xn+1= fn(xn) + un (3.30)

yn= hn(xn) + vn, (3.31)

where fn(·) and hn(·) are the state transition model and the measurement model re-spectively. The process noise unand measurement noise vn are considered additive and zero-mean in this specific formulation. As before we are interested in obtaining the pos-terior Gaussian distribution of the state, and thus require an estimate of the mean and the covariance. However, it is difficult to accurately obtain an estimate of the posterior by transforming the state distribution through nonlinear functions. In order to cope with the nonlinear transformation of xn the UKF can provide us with a set of sigma points in state space that adequately capture the first and second order moments, or

(30)

the mean and the covariance, of a posterior distribution. Note that higher order mo-ments can be captured through the unscented transform with a larger set of sigma points [6,39]. This set of points is then easily propagated through any nonlinear function, and the posterior estimates can be obtained from this set of propagated points. In other words, we require a set of points that show the same sample properties as the estimated distribution. It has been shown that the UKF can provide accurate estimates through these sigma points in practice [5,6,25]. In the following section we present the method of picking the set of weighted sigma points for the UKF.

3.3.1.1 The Unscented Transform

For linear state transition models the mean ˆx and error covariance matrix P˜x provide mathematically convenient representations of the uncertainty because they can easily be calculated as

E[T · ˆx] = T · E[ˆx], (3.32)

Var[T · ˜x] = T · Var[˜x]. (3.33)

For the nonlinear transitions of the mean and error covariance we cannot derive such general rules. Only for an error covariance equal to zero can we calculate the transformed state estimate by simply calculating f (ˆx) for some nonlinear function f (·). Note that this is a trivial case where the mean estimate becomes a point estimate with no uncertainty. In the general case where the covariance is not equal to zero the transformed mean is not generally equal to f (ˆx). Furthermore, it is not even possible to determine the mean of the transformed state distribution with only the prior mean and covariance [39]. This indicates that in this instance the transformed mean and covariance can only be approximated.

The unscented transform is a much applied method for estimating the mean and co-variance given a nonlinear transition. This is achieved by accurately encoding the in-formation of the mean and covariance in a set of sigma points. These points represent the discretized probability distribution with the true mean and covariance. Each point is then propagated through the nonlinear function f (·), resulting in a transformed dis-tribution with a corresponding mean and covariance. Suppose we have an estimate of the mean ˆxn and the error covariance Pn at time n. A simple scheme to estimate a set of points that incorporates these quantities is given to illustrate the mechanism. Capital letters are used to make a distinction between the different variables and the

(31)

corresponding set of sigma points and weights. The sigma points can first be chosen according to X0= ˆx X_i= ˆx +p(nx+ k)Pn i for i = 1, . . . , nx, (3.34) X_i+n_x= ˆx −p(nx+ k)Pn i, for i = 1, . . . , nx,

where nx represents the dimension of x, k a tuning parameter and

p(nx+ k)Pn

i is the i-th row of the matrix square root of (n + k)Pn. The square root is generally obtained through the Cholesky decomposition. Secondly, the sigma points are weighted according to W0 = k (nx+ k) Wi = 1 2(nx+ k) , for i = 1, . . . , 2nx,

where the weight Wi corresponds to the point Xi and we require P2ni=0xWi = 1. The complete set of points and weights is thus S = {Wi, Xi; i = 0, . . . , 2nx}. The nonlinear transformation of the sigma points can then be described as follows:

1. Pass each sigma point through the nonlinear transformation to obtain the trans-formed points according to

Y_i= f (Xi) for i = 0, . . . , 2nx. (3.35)

2. The mean estimate is then given by the weighted average of the transformed points as ˆ y = 2nx X i=0 WiYi. (3.36)

3. The transformed error covariance Py˜ is then

P˜y= 2nx X

i=0

(32)

and the state and measurement error cross-covariance between is calculated as P˜x˜y= 2nx X i=0 Wi(Xi− ˆx)(Yi− ˆy)T. (3.38)

Figure 3.3: The basic idea behind the unscented transform is shown for the set of sigma points before (L) and after (R) the transformation.

Figure3.3allows us to grasp the concept of the unscented transform for two-dimensional state space example. We observe that 2 sigma points are chosen for each dimension and are then passed through a nonlinear transformation to obtain a new set of points. The mean sigma point for each dimension is not visible in the image, but should be taken into account in the calculations. The distance of the i -th point from the mean is proportional to√nx+ k. Firstly, this indicates the spread of all sigma points increases as the number of state dimensions increases. For highly nonlinear transformations this can lead to inaccuracies. The parameter k is introduced to counter this problem, providing an extra degree of freedom to tune the second and higher order moments of the approximated state distribution. For a Gaussian state distribution the standard choice in the literature is k = 3 − nx. Moreover, this choice results in a scaling invariance of the sigma points with respect to the dimension nx. Deviating from this can allow for estimating non-Gaussian state distributions. One important fault of this straightforward method of calculating the weighted sigma points is for k < 0 when the matrix square root is no longer positive semi-definite, which is a requirement of the Cholesky decomposition. To counter this issue an alteration is proposed to the unscented transform. Specifically, we apply a scaled transformation to the set of sigma points S = {Wi, Xi; i = 0, . . . , 2nx} by

X_i0 = Xiα(Xi− X0), W_i0 =    W0 α2 + (1 −_α12) Wi α2 for i = 1, . . . , 2nx (3.39)

(33)

This scaling step can also be incorporated in the calculation of sigma points to reduce the number of required calculations. The points are still calculated according to Eq. 3.34 but with the scaling parameter k now defined as

k = α2(nx+ λ) − nx. (3.40)

The weights are now calculated separately for the mean and covariance as

W₀(m)= k nx+ k , W₀(c)= k nx+ k + 1 − α2+ β, (3.41) W_i(m)= W_i(c) = k 2(nx+ k) , for i = 1, . . . , 2nx.

Three new parameters are introduced to define the weighted sigma points. First, the constant α determines the spread of sigma points around the mean estimate ˆx and we require 0 ≤ α ≤ 1. Ideally, α is a small number to avoid sampling non-local effect for highly nonlinear systems. Second, the constant λ should be chosen as λ ≥ 0 to ensure we obtain a positive semi-definite covariance matrix. Because the value of λ is shown to have little effect, the general choice is λ = 0 [6]. Lastly, β is used to incorporate prior knowledge of the distribution of x, where β = 2 is optimal for a Gaussian distribution. More generally, we require β ≥ 0 for correct calculation of the points. The mean, error covariance and error cross-covariance are now calculated as:

Yi = f (Xi) for i = 0, . . . , 2nx, ˆ y = 2nx X i=0 W_i(m)Y_i, Py˜ = 2nx X i=0 W_i(c)(Yi− ˆx)(Yi− ˆx)T, P˜x˜y = 2nx X i=0 W_i(c)(Xi− ˆx)(Yi− ˆy)T. (3.42)

(34)

The nonlinear Kalman gain - In order to derive the Kalman filter framework for nonlinear transition models we require a more general form of the Kalman Gain. To achieve this we can generalize the linear Kalman gain as

Kn= Pn|n−1HnTS−1n

= TnS−1n , (3.43)

where Tn is a to be determined matrix that represents a more general system that allows for nonlinearities in its transition models. It can be shown that this matrix Tn is the error cross-covariance matrix of the predicted state ˆxn|n−1 and the predicted measurement ˆyn|n−1. We can derive this equivalence from the standard Kalman Filter by writing out the mentioned error cross-covariance of ˆx_n|n−1 and ˆy_n|n−1 as

(3.44) = Eˆxn|n−1− E[ˆxn|n−1])

· (Hnˆxn|n−1+ vn− E[(Hnˆxn|n−1+ vn])T,

where we inserted the measurement equation for ˆy_n|n−1 from Eq. 3.29. As for the derivation of the standard Kalman filter we use the linearity of expectation to factor out Hn. We further define the error of the measurement noise ˜vn|n−1 = vn− E[vn] and the state prediction error as ˜xn|n−1 = ˆxn− E[ˆxn]. Lastly, we recall the state prediction ˆ

xn|n−1 is uncorrelated with the measurement noise vn. Using this information we can

rewrite the above as

(35)

The last expression for P˜x˜y,n|n−1 is thus equal to the unknown matrix Tn for a linear measurement model Hn. The nonlinear Kalman gain cannot be derived through the same assumptions, and thus requires us to use the initial expression of the error cross-covariance Tn= Px˜˜y,n|n−1 for the UKF and define Kn as

Kn= P˜x˜y,n|n−1S−1n

= P_˜_x˜_y,n|n−1P−1_˜_y,n|n−1, (3.46)

where Sn is precisely the measurement error covariance matrix for the linear Kalman filter as

Sn= ||en||2

= ||yn− ˆyn|n−1||2

= HnPn|n−1HnT + Rn. (3.47)

With the unscented transform we update the initial Kalman filter algorithm in Algorithm 1 and formulate the UKF method in Algorithm2.

3.3.2 The UKF for Training Feedforward Neural Networks

In addition to state estimation for a system with nonlinear dynamics the UKF can also be used to estimate a set of parameters on which these system dynamics depend. We recall from Section 3.2.1.2 that parameter estimation can be understood as learning a nonlinear mapping yn = g(xn, wn). Supervised training of neural networks can be placed in this context as a parameter estimation problem where the parameter vector w encompasses both the weights and the biases and the nonlinear function g(·) denotes the neural network. We further recall from Section3.1that supervised learning requires a set of training instances (xn, yn), xn is the input and yn is the target output for n = 1, . . . , N . The state-space representation of a neural network is thus similar to Eq.

3.26 and3.27 as

wn+1= wn+ rn, (3.48)

(36)

(37)

The process model in Eq. 3.48 specifies that the ideal state of the neural network is characterized as a stationary process corrupted by some measurement noise rn, where the state of the system is given by the neural network parameters wn. The measurement model in Eq. 3.49 represents the desired neural network output yn evolving according to the nonlinear function g(xn, wn) perturbed by the measurement noise en. We recall from Section 3.1 that training neural networks revolves around minimizing a function of the prediction error. To allow for the representation of neural network training as a Kalman filtering problem we require the two noise terms to be Gaussian and zero-mean as

rn∼ N (0, Qn), (3.50)

en∼ N (0, Rn). (3.51)

When we apply the UKF to training neural networks the resulting model returns optimal MSE estimates due to the definition of the Kalman gain in Eq. 3.46. The error function for the sample set of size N which is to be minimized is given by

E(w) = N X

n=1

[yn− g(xn, w)]T(Rn)−1[yn− g(xn, w)], (3.52)

which can be perceived as the MSE function weighted by the observation noise covariance Rn. The optimal Kalman gain derived in Appendix B can be shown to arrive to the same result when Rn is a constant diagonal matrix and consequently cancels out of the algorithm. We further recall that for Gaussian independent noise terms the minimum mean square and maximum a posteriori estimate coincide, as is explicitly derived in [6]. Generally, training problems that are trained on noisy data require a larger Rn than those trained on relatively noise-free data [40, 41]. Furthermore, Rn allows for non-uniform scaling of the different output elements and is interpreted as the inverse learning rate ηn in optimization algorithms such as gradient descent, multiplied by a scaling matrix Bn [5].

The process noise covariance Qn determines to what extent the weights are allowed to change per time step, influencing the convergence of the training and tracking perfor-mance of the algorithm. Generally, larger Qn cause for prior knowledge to be discarded more quickly and thus emphasizing more recent observations in the state estimate and covariance. If Qn is set to an arbitrary diagonal matrix that decreases with respect

(38)

to time we can improve the model accuracy through the ”annealing” towards zero as training progresses. For stationary data with stable temporal characteristics both Qn and Rn can be set to some constant diagonal matrix to reduce the number of parame-ters that require tuning during the training process. This does imply an independence assumption on the parameters, and should thus be taken into account when the system is defined. When we consider a highly volatile and nonlinear time series it is advised to make both Rn and Qn adaptive. For Rn we first consider the form proposed by Sum et al. in [42] and by Haykin in [5] on training neural networks with the extended and Unscented Kalman filter as

Rn= (1 − αR)Rn−1+ αR(yn−1− ˆyn−1)2, (3.53)

where αRis a constant that can take on values in the range (0, 1] and is not to be confused with the α for the sigma point calculation of the UKF. Setting αR close to 1 places a larger emphasis on the error in the estimate of the previous time step while setting αR close to zero causes Rn to maintain a more stable level. Sum et al. specify αR= 0.005 for training neural networks on the exchange rate of the United States dollar/Deutsche Mark (USD/DM) [42]. For the adaptive Qn we implement the definition of Haykin in [5] for training neural networks with the UKF as

Qn= (τ−1− 1)Pw,n−1˜ , (3.54)

where τ is known as the ”forgetting factor”, providing approximately exponentially decaying weights on the past weight values. For τ = 1 we get Qn = 0 and the weight update is solely based on the previous weights. When we use τ < 1 this quickly lead to the discarding of large proportions of past measurements. An extensive analysis of τ in [43] indicates that τ = 0.997 results in a balanced Qn which retains sufficient historical data but does not generally get stuck in local optima of the weights. It should be noted that the adaptive variant Qn no longer maintain equal noise terms for all weights. The mentioned methods for calculating Rn and Qn as well as other alternatives are further explained in [5].

(39)

Algorithm 3: The unscented Kalman filter for training neural networks Input: Specify parameters Q, R α, β, γ, and provide initial w0. Initialization: For n = 0 : ˆ w0= E[w0], Pw,0|0˜ = E[(w0− ˆw0|0)(w0− ˆw0|0)T]. Recursion: For n = 1, . . . : Predict Step: ˆ wn|n−1= ˆwn−1|n−1 P_w,n|n−1_˜ = P_{w,n−1|n−1}_˜ + Qn−1 W0= ˆwn−1|n−1 W_i= ˆw_n−1|n−1+q(nw+ k)Pw,n−1|n−1˜ i W_i+n_w = ˆw_n−1|n−1−q(nw+ k)Pw,n−1|n−1˜ i Yi= g(xn, Wi) for i = 0, . . . , 2nw ˆ yn|n−1= 2nw X i=0 W_i(m)Y_k|k−1 Filter Step: P_y,n|n−1_˜ = 2nw X i=0 W_i(c)(Yi− ˆyn|n−1)(Yi− ˆyn|n−1)T + Rn P_w˜_˜_y,n|n−1= 2nw X i=0 W_i(c)(Wi− ˆwn|n−1)(Yi,k|k−1− ˆyn|n−1)T Kn= Pw˜˜y,n|n−1P−1y,n|n−1˜ ˆ w_n|n= ˆw_n|n−1+ Kn(tn− ˆyn|n−1) Pw˜n = Pw,n|n−1˜ − KnP˜y,n|n−1K T n

(40)

Methods

4.1 Numerical evaluation

By means of a numerical evaluation we aim to provide a complementary analysis to the theoretical analysis of the UKF training method of neural networks for time se-ries prediction. The implementation of the UKF is written in Python according to the pseudo-code described in Algorithm 3 and SGD is implemented with TensorFlow [44]. For all subsequent experiments we run 20 simulations each by applying a rolling win-dow of size K one discrete time step ahead for each new simulation. The constant K determines the number of training points we wish to generate for our data set. For a training set of K training points, 20 simulations and 5 inputs we require K + 25 points. The remainder of this chapter gives a description of the data and an explanation of the metrics used in our numerical experiments.

4.1.1 Artificial data: the noisy sine function

In order to assess the generalization and performance of the UKF and SGD training methods for neural networks we firstly generate data from a sine function perturbed by noise, as proposed by Borovykh et al. in [12]. The function is defined as

yt= sin(0.1 · k) + c · t, (4.1)

with k ∈ {1, . . . , K} and k ∼ N (0, 1). The constant c determines the amount of Gaus-sian noise added to the function. The noisy sine function displays three important prop-erties that are present in real time series [36,45]. The first property is seasonality and

(41)

is perhaps most clear in temperature data and to a lesser extent in financial time series. The second property and third property, volatility and nonlinearity, are other important aspects of real-world time series that we wish to take into account in our experiments. As is the choice of Borovykh et al. we take the 5 previous values (yt−5, yt−4, . . . , yt−1) as the neural network input to predict the current value yt. In subsequent experiments we implement the low and high-noise case from Borovykh. et al with c = 0.1 and c = 0.3.

4.1.2 Real-world data: currency pairs

After providing insights into the characteristics of the UKF training method by means of time series prediction on synthetic data we wish to analyze the predictive performance of our method on currency pairs. As financial time series such as currency pairs are known to be highly nonlinear and volatile they remain hard to predict in both the short run and the long run [11,12,46]. For our experiments we use the daily observations of the USD/EUR and USD/RUB from 1 January 2004 to 1 June 2020, obtained from the Yahoo Finance database [47]. Missing values are removed from the data set to ensure continuity in the input for the sequential UKF training method. With the cleaned data sets we then calculate the return rt as

rt=

Pt− Pt−1 Pt−1

, (4.2)

where Pt is the observation for a discrete time t. As with the artificial data we use the 5 previous values (rt−5, . . . , rt−1) as input to predict the current value rt. We then identify two periods of six months each based on the standard deviation of a period and calculate the maximum and minimum for both currencies. The 6 month period translates to 174 data points on which the rolling window is subsequently applied. The aim of the different periods is to assess the performance and generalization of the UKF and SGD for varying levels of volatility. Another important aspect of financial time series modelling is the ability to handle sudden turning points and/or changes in volatility in the data [46]. The recent outbreak of COVID-19 and the coinciding consequences for the global economy is therefore included in a fourth period from 1 January till 1 September 2020 [9,10]. The returns for both currencies during this period are shown in Figure4.2 The standard deviation of all periods for the USD/EUR and the USD/RUB are presented in Table 4.1.2. We observe larger standard deviations of the USD/RUB compared to the USD/EUR in nearly all cases. The larger volatility in the USD/RUB data is also visible in Figures 4.1.2and 4.2. We note that the extremes in the returns for the USD/RUB are more than double those of the USD/EUR. Moreover, in Figure 4.1.2 we observe a

(42)

0 25 50 75 100 125 150 175 Count 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125 r Max Mean Min 0 25 50 75 100 125 150 175 Count 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125 r Max Mean Min

Figure 4.1: The USD/EUR (L) and USD/RUB (R) daily returns are plotted for the periods with the maximum, minimum and mean standard deviation of periods with 174

points.

larger difference between the minimum and maximum period of the USD/RUB than for the USD/EUR. The exact setup for each experiment is further described alongside the results in Chapter 5.

USD/EUR USD/RUB

Name Period Std Dev Period Std Dev

Max 02/09/2008 - 01/05/2009 1.19e-2 30/10/2014 - 30/06/2015 2.51e-2 Min 25/06/2019 - 21/02/2020 2.68e-3 13/02/2004 - 13/10/2004 1.26e-3 COVID-19 01/01/2020 - 31/08/2020 4.98e-3 01/01/2020 - 31/08/2020 1.32e-2

Table 4.1: The standard deviation for the USD/EUR and USD/RUB daily returns of 174 days. By means of a rolling window of 150 points the periods with the minimum and maximum standard deviation are determined. The COVID-19 period is taken from

01/01/2020 up to 31/08/2020.

Jan

2020 Feb Mar Apr May Jun Jul Aug Date 0.04 0.02 0.00 0.02 0.04 0.06 0.08 r USD/EUR USD/RUB

Figure 4.2: The returns of the USD/EUR and USD/RUB are plotted from 1 January till 1 September 2020 and includes the outbreak of COVID-19

(43)

4.2 Metrics

As it is our goal to compare the performance and generalization of both methods we require metrics that can provide insights to this end. The implemented metrics are elaborated upon in the following section.

4.2.1 Performance

To assess the performance of models trained by the UKF and SGD respectively we implement two point-wise similarity measures. The first, the mean square error (MSE), was previously defined in SectionB.1 as the loss function for which the Kalman filter is derived. We recall that for a data set with K data samples the MSE is given by

MSE = PK

k=1(yk− ˆyk)2

K . (4.3)

The second performance metric is the mean absolute error (MAE) and is defined as

MAE = PK

k=1|yk− ˆyk|

K . (4.4)

Because the MAE does not square the resulting error it penalizes outliers less than the MSE and could hence provide us with additional information. It should be noted that the UKF training method concerns predicting distributions rather than single point estimates contrary to SGD. Estimating the accuracy for such predictions generally re-quires different metrics to properly quantify the error. One such an example is the KL-divergence that is mentioned in Section 2. However, as it is our aim to compare the posterior mean of the UKF to SGD the proposed metrics can be calculated for both methods.

4.2.2 Generalization

In addition to quantifying the performance of the training methods by through point-wise prediction errors we wish to analyze the generalization of trained neural networks on time series. With generalization in this context we mean the robustness of a neural net-work with respect to the noise in input data. In other words we wish to express to what