### Faculty of Economics and Business

### Amsterdam School of Economics

### Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version (e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

### Forecasting the state of macroeconomic

### activity with artificial neural networks

### A comparison of VAR models with artificial neural networks

### Nik Gabel

10548882

Date of final version: December 3, 2018 Master’s programme: Econometrics

Specialisation: Data Science and Business Analytics Supervisor: dr. N.P.A. van Giersbergen

Second reader: dr. K.A. Lasak

Abstract

In contrary to linear models such as VARX models, artificial neural networks are capable of solving nonlinear problems. The mean topic of this thesis is forecasting inflation, unem-ployment and GDP with a VARX model and four different artificial neural networks. The analysis is done with extra independent variables and for six different countries. The main results are that the artificial neural networks show competitive forecasting performance or even outperform the VARX model. However, poor predictability of the variables and interpolation could have a negative effect on the performance of artificial neural networks, compared to a VARX model.

### Statement of Originality

This document is written by Nik Gabel who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

## Contents

1 Introduction 1

2 Literature review 3

3 Theoretical background 5

3.1 Vector Autoregression . . . 5

3.2 Artificial Neural Networks . . . 6

3.2.1 A brief history . . . 7 3.2.2 Basic concepts . . . 7 3.2.3 Architectures . . . 11 4 Research Methodology 17 4.1 Data . . . 17 4.1.1 Description . . . 17 4.1.2 Preprocessing . . . 18

4.2 Model design and estimation . . . 20

4.2.1 Benchmark VARX model . . . 20

4.2.2 Artificial neural networks . . . 21

4.3 Forecasting and performance . . . 23

4.3.1 Accuracy . . . 23

4.3.2 The Diebold-Mariano test . . . 23

5 Results 25 5.1 Comparison of the forecast results of the US . . . 25

5.2 Comparison of the forecast results of all countries . . . 29

5.3 The Diebold-Mariano test . . . 31

6 Conclusion 33 Bibliography 35 Appendices 37 A Tables . . . 37

A.2 Principal components analysis . . . 38

A.3 Forecast results . . . 39

B Graphs . . . 42

B.1 transformations and stationarity . . . 42

B.2 Principal components analysis . . . 51

### Chapter 1

## Introduction

One of the most important aspects of an investment is its rate of return. Moreover, the motive of almost every investor is getting profit out of the investment. The economy, or more specific, economic changes, can have a big influence on those investments. Therefore, it is of key essence to know the economy and its behaviour. Macroeconomic indicators such as unemployment, inflation and the Gross Domestic Product (GDP) are important in this understanding. Macroeconomics is the part of economics that lays the essence on the interrelation of different aspects of the economy for the generation of wealth: aspects such as the behaviour, the performance and the decision making of a country. Inflation, unemployment and GDP are key indicators of those aspects and the change of it. The inflation rate is the rate of change in purchasing power of the consumers. A rise in the inflation rate, meaning the price of goods and services to rise, has as a consequence that the purchasing power of consumers fall. This influences the investments. The national unemployment rate describes the percentage of unemployed people of the total labor force. A rise of the unemployment rate signifies a loss of wages and purchasing power. Therefore, the country as a whole is effected and loses their input in the economy by means of produced goods or services, which could have a negative effect on investments. Lastly, GDP is the measure of all goods and services produced by a country within a given time period. It includes investments, all public and private consumption, government outlays and private inventories. This makes GDP a good measurement of the overall economic activity of a nation and consequently, influences the rate of return of investors as well. Knowing the influence of those indicators on the rate of return makes it worthwhile looking at the forecasts and the accuracy of those forecasts.

For over the last fifty years, macroeconomic forecast methods have improved con-siderably. Nowadays, forecast methods can be divided into structural and non-structural approaches. The structural approach consists of forecasting by using a theoretical model of the economy as basis. This is especially useful when forecasting macroeconomic in-dicators after a specific change in policy. Non-structural approaches on the other hand are used without choosing a specific underlying theoretical economic model. They do not rely on a theoretical model but concentrate on the properties of the given data and the

statistical methods and models that provide the best forecast for the data. One of the most frequently used forecasting models in the non-structural approach are the Vector Autoregressive (VAR) models. These models require relatively few assumptions about the underlying data and make it possible to add multiple series of data into the analysis. On the other hand, VAR models are linear models and therefore do not account for non-linearities in the data. They are also sensitive to specifications such as the decision of which series to include and cointegration. These choices create a person specific model and causes models for the same data to differ from each other, which could cause a difference in outcome and therefore unreliable results. Recent studies have shown that, in order to overcome these problems and to improve the forecasting of macroeconomic indicators, Artificial Neural Networks (ANN) could be a good alternative (Cook and Hall, 2017).

ANNs are computational structures developed to copy the storage of knowledge in the biological central nervous system. These networks are able to solve nonlinear and poorly defined problems. In the last twenty years these networks have increasingly been used in the field of business and economics, which led to many different scientific applications. The characteristics of ANNs are efficiency, robustness and adaptability, which make them a valuable tool for classification, decision support, financial analysis and credit scoring (Tkáč and Verner, 2016). These characteristics make ANNs possibly a reliable alternative for the VAR models. Therefore it is worthwhile comparing the performance of ANNs and VAR models on forecasting key macroeconomic indicators.

The mean topic of this thesis is to investigate if ANNs, which can take non-linearities into account, are preferred over VAR models in forecasting inflation, unemployment and GDP. More specifically, the investigation of four common ANNs, the feed-forward neural network, the convolutional neural network, the recurrent neural network and the encoder decoder network, in comparison with a VAR model when forecasting those macroeconomic variables, with extra exogenous variables. To be able to give an indication of robustness, this comparison is done for the US, the UK, Japan, Germany, France and Italy. Addition-ally, it is investigated if the models differ in accuracy when forecasting different forecast horizons. These forecast horizons range from one month ahead up until one year ahead. Not only is the research done per individual country but also on average over all countries. Lastly, for the US, it is investigated if there is a difference in performance of the ANNs compared to VAR models when forecasting with and without additional input variables.

This thesis is structured as follows. First, in Chapter 2, a brief overview of previous literature and their findings about this topic is given. Second, Chapter 3 explains the basic concepts of a VAR model, the basics of a neural network and the used neural networks. Chapter 4 describes the data together with the research methodology. Chapter 5 presents and discusses the results. Lastly, Chapter 6 gives a final conclusion.

### Chapter 2

## Literature review

Non-structural macroeconomic forecasting methods such as VAR models do not rely on a theoretical model but concentrate on the properties of the given data. Compared to struc-tural models with simultaneous equations, VAR models require few assumptions about indicators influencing a variable, but are still limited in a few essential ways. VAR models are used to find the linear connection between multiple time series, even when the connec-tion is non-linear. Precise knowledge of the data and adjustments to the model could find non-linearities, but the exposure of these interdependencies increases with the complexity of the model.

To detect the non-linearities without the precise knowledge of the data, the use of ANNs is proposed. A lot of research on comparison between linear models and ANNs have shown that ANNs outperform the linear models. Mirmirani and Cheng Li (2004) compare a VAR model with ANNs in forecasting the price of oil and conclude that the ANNs outperforms the VAR model. Alon, Qi, and Sadowski (2001) do the same compari-son but on aggregated retail sales, consisting of strong trends and seacompari-sonal patterns. They conclude that the ANNs are on average favorable over the traditional methods and show that the ANN is capable of finding the nonlinear trend, seasonal pattern and the inter-action between them. Binner, Bissoondeeal, Elger, Gazely, and Mullineux (2005) provide empirical evidence that the within-sample and out-of-sample forecast of ANNs are prefered over those of the linear models. They conclude that linear models are simply a subset of non-linear models. Last, Adya and Collopy (1998) have examined 48 applications of the forecast performance of ANNs and found that when the ANNs are effectively implemented and validated, they show potential for forecasting.

Besides the comparison of ANNs with commonly used models that do not account for non-linearities, Cook and Hall (2017) also investigated different types of neural net-works. An feed-forward, fully connected network, which consist of layers of fully connected computational nodes, wherein information flows in one direction through the network. A convolutional neural network, which uses convolution instead of matrix multiplication in at least one layer. The convolutional layer analyses the patterns in the given data to find the pattern which minimizes the predictive error. They also use one of the most commonly

used neural networks, which is the long short term memory network from the family of recurrent neural networks. By not only using the current input but also the previous evaluated data, the network is given a memory. Lastly, they used an encoder-decoder network, which belongs to the family of sequence to sequence networks and was created to obtain translations. Cook and Hall (2017) focused on predicting monthly unemployment from the U.S. based on only lags of unemployment. They indeed found that the ANNs outperform the linear benchmark models at short forecast horizons, but, as the horizon increased, the performance of the ANNs and the benchmark models became closer to each other. Emphasized is the encoder decoder model, which outperforms the linear benchmark models at every forecast horizon. As they have restricted themselves to the use of only unemployment of the U.S., they expect that additional input variables will increase the performance of the ANNs compared to the linear benchmark models.

### Chapter 3

## Theoretical background

To get a better understanding about VAR models and ANNs, a theoretical background is given. First, the basic VAR model is explained together with its forecasting technique. Second, the ANN is described, which is divided into a brief history, the basics of an ANN and the four used models. As this is the theoretical background, the models are explained in general and not specifically to the data. The specific models used in this thesis and applied to time-series data are explained in the research methodology.

### 3.1

### Vector Autoregression

The VAR model, which is a stochastic process, is used to find linear interdependencies be-tween multiple time series. These models are one of the most successful, easy and flexible to use models in doing so. VAR models are an extension of the Univariate Autoregressive (AR) models because they allow for more than one input variable. A VAR(p) model, with lag length p, consists of endogenous variables, which depend on its own lagged values, lagged values of other endogenous variables and an error term. In this thesis, the endoge-nous variables inflation, unemployment and GDP are used, together with a large dataset of exogenous predictors. A VAR model with exogenous variables is called a VARX model:

Yt= c + A1Yt−1+ A2Yt−2+ ... + ApYt−p+ B1Xt−1+ ... + BqXt−q+ t,

with Y_{t} = (y1t, y2t, ..., ynt)0 a n-vector of variables depending on time, Xt a m-vector of

exogenous variables, c a constant, Ai an (n×n) coefficient matrix, Bi an (n×m) coefficient

matrix and tan (n×1) error term, which is a zero mean white noise vector process with Σ,

a time invariant covariance matrix. To be able to use this model for forecasting, first of all the parameters of the model have to be estimated. When the model is covariance stationary these parameters can be estimated per separate equation by ordinary least squares. For the model to be covariance stationary consider the model in lag operator notation:

with

A(L) = In− A1L − ... − ApLp.

If the roots of

det(In− A1z − ... − Apzp) = 0

lie outside the complex unit circle, then the VARX(p) model is stationary with time invari-ant mean, variances and autocovariances. This means the model is covariance stationary and the parameters can be estimated per individual equation. Each equation separately can be written as

yi= Zai+ Dbi+ ei, i = 1, ..., n,

with yia (T ×1) vector of the ithequation, Z is a (T ×k) matrix with Zt= (1, Yt−1, ..., Yt−p),

aiis a (k×1) vector of parameters, D is a (T ×m) matrix, biis a (m×1) vector of parameters

and e_{i} is a (T × 1) vector of errors with covariance σ2_{i}It. These equations all have the same

explanatory variables and each equation can be estimated separately by ordinary least squares, resulting in the matrix of coefficients ˆA = (ˆa1, ..., ˆan) and ˆB = (ˆb1, ..., ˆbn).

Second, after estimating the parameters, the lag length p has to be selected. For the VARX(p) model to give optimal results the lag length p has to be chosen optimal. This is done by choosing p such that some selection criteria is minimized. These selection criteria are of the form

IC(p) = ln| ˜Σ(p)| + cTϕ(n, p), with ˜ Σ(p) = T−1 T X t=1 ˆ tˆ0t

representing the residual covariance matrix without a degrees of freedom correction. cT is

a sequence depending on the sample size T and ϕ is a penalty function.

After estimating the parameters and selecting the lag length the model can be used for forecasting. When forecasting macroeconomic indicators, longer forecast horizons h have to be obtained. This is done using the chain-rule of forecasting:

YT +h|T = c+ ˆA1YT +h−1|T+ ˆA2YT +h−2|T+...+ ˆApYT +h−p|T+ ˆB1XT +h−1|T+...+ ˆBqXT +h−q|T,

where Y_{T +j|T} = YT +j for j ≤ 0 (Zivot and Wang, 2006). When using the chain-rule of

forecasting, the future values of the exogenous variables are needed. These can be estimated with a VAR model for the exogenous variables.

### 3.2

### Artificial Neural Networks

The introduction in ANNs consist of three parts. First, the emergence of the ANN. Second, the basic concepts of an ANN. Lastly, the explanation of the four used models.

3.2.1 A brief history

An artificial neural network can be seen as an interconnected assembly of simple process-ing elements, units or nodes. The functionality of those elements is partly based on the biological neuron. The weights obtained by learning from a training set stores the pro-cessing ability of the network (Gurney, 2014). The creation of these ANNs started in the early 1940’s, almost at the same point in time as the history of programmable electronic computers. There where breakthroughs like the discovery that simple networks are able to compute logic or arithmetic functions and the computation of the Hebbian rule, carrying the fundamentals of all neural learning procedures. Around 1951 came the golden age of ANNs. Research as differences between top-down and bottom-up were formed and after-wards the Adaptive Linear Neuron, known as ADALINE. In 1969, it was shown that the perceptron could not represent many important problems, which decreased the popularity and research funds. This caused a silence and slow reconstruction for about 15 years, after which the field of ANNs became more popular again. In 1986 non-linear separable prob-lems could be solved with multilayer perceptrons and again gave rise to the field of ANNs (Kriesel, 2007). Afterwards an increase in understanding biological neurons innovated the modeling and structure of ANNs and advancements in computing technology made train-ing complex neural networks computationally feasible.

3.2.2 Basic concepts

In order to understand the four used models, the basic concepts of an ANN are explained. This again is divided into sub parts. First, the basic model. Next, the activation function. Last, error backpropagation, which optimizes the network.

3.2.2.1 Model

Graphical models are the models on which neural networks are based. They are repre-sented as directed graphs instead of equations, or systems of equations. In these models information flows through the nodes in the structure from input to target output, where the nodes serve as activity. Zooming in on the graphical model shows that it can also be represented as a number of equations of which the number and length of the equation cor-relates with the complexity of the model (Cook and Hall, 2017). Those equations consist of linear combinations of fixed nonlinear basis functions:

y(x, w) = f (

M

X

j=1

ωjφj(x)),

where each basis function φj(x) represents a nonlinear function of a linear combination of

basic neural network model. First start with a linear combinations of the input variables:
aj =
D
X
i=1
ω_{ji}(1)xi+ ωj0(1),

consisting of j outputs. Here, (1) indicates the first layer of the model, ωji corresponds

to the weights given in that layer, ω_{j0} is the bias and x_{i} the input variables. Second, the
values a_{j} are transformed using an activation function, which is basically a transformation
to tell whether the node should be activated or not,

zj = h(aj).

After being transformed by the activation function, zj represents the output of node j

in the first layer and is also referred to as a hidden unit in a neural network. Next, the information is transferred to the second layer

ak= D

X

j=1

ω(2)_{kj}zj+ ω(2)_{k0},

consisting of k outputs and where(2) indicates the second layer. Yet again ω are weights and xj corresponds to the output of the previous layer. The amount of layers stacked on

top of each other is variable until the output unit activations z_{m} are transformed to the
network outputs y_{k}. Combining these layers gives the artificial neural network:

yk(x, w) =
M
X
j=1
ω_{kj}(2)h(
D
X
i=1
ω_{ji}(1)xi+ ω_{j0}(1)) + ω_{k0}(2).

Note that as an example, two hidden layers are used but the amount of used layers is variable. Summarized, the neural network is simply a nonlinear transformation from the input data to the output values, where the weights ω can be adjusted (Bishop, 2006).

The combination of those equations are also referred to as architectures.
Architec-tures consist of the nodes, connections between those nodes and the operations between
those nodes. Nodes represent the model input x_{i}, the activation a_{j} as well as the activation
function zj and the output yk. The connections between those nodes are represented by

the weights ω. Together and with one layer of computational neurons those layers are defined as a perceptron. The perceptron is the basic architecture upon which the ANNs are build (Cook and Hall, 2017). A basic perceptron for forecasting is given in Figure 3.1.

Figure 3.1: Perceptron

The first layer represents the model input. The second layer consist of the computational neurons and the last layer is the output.

3.2.2.2 Activation Function

As explained before, an activation function is a function that is added between layers or at the end of a neural network. These functions are used to scale the given values between 0 and 1 or −1 and 1, which makes it possible to compare the influence of different nodes in the same layer. The activation functions can be divided into two groups, namely linear and non-linear activation functions. As the name reveals, linear activation functions only use linear transformations. Consequently, the range of the output will still be between −∞ and ∞, which means the influence of a node compared to other nodes is not revealed. Another disadvantage of linear activation functions in neural networks is that a multilayer neural network with linear transformation function per layer can also be written as a single layer neural network. Therefore, when using neural networks, linear activation functions are not optimal.

The most commonly used nonlinear activation function is the rectified linear unit (ReLU) (Jarrett, Kavukcuoglu, LeCun, et al., 2009). ReLU gives an output z if it is positive and 0 otherwise. The ReLU function is presented in Figure 3.2.

Figure 3.2: ReLU

Advantages of RELU compared to linear activation functions are that it is non-linear, which means layers can be stacked on top of each other. Another advantage is that it makes the activations sparse and efficient. This because values equal to or below 0 will stay 0 and therefore causes the node to deactivate or die out. When optimizing the

neural network, fewer nodes have influence on the output, making the network faster. This activation function is most commonly used in the hidden layers.

3.2.2.3 Error Backpropagation

Knowing the model and the activation functions of an ANN brings up the next topic: optimizing the network in an efficient way by error backpropagation. This technique sends small parts of data forward and backward through the network. In other words, error backpropagation uses gradient descent to train a multilayer perceptron. It does this by means of minimizing an error function by iteratively updating the weights in the network. This minimization can be divided into two stages. First, evaluate the derivative of the error function with respect to the weights. Second, use this evaluation to update the weights.

Start with the first stage and specifically with a simple linear model, to explain this in more detail:

yk=

X

i

ωkixi,

where y_{k} is the estimated value and ω_{ki} the weight i from input x_{i}. For this model the
error function will be:

En= 1 2 X k (ynk− tnk)2,

where t_{nk} is the real value and y_{nk} the estimated value. Deriving the gradient with respect
to the weight w_{ji} gives:

∂En

∂ωji

= (ynj− tnj)xni,

which is the change in the error function associated by a change in the weight. Now translate this to a general neural network where an unit looks like:

aj =

X

i

ωjizi,

with z_{i} the value of the activation of unit i in the previous layer and w_{ji} the weight given
to unit i for going to unit j in the next layer. Next, z_{j} is the value of the activation of unit
j:

zj = h(aj).

Applying the derivative of the error function with respect to the weight gives: ∂En ∂ωji = ∂En ∂aj ∂aj ∂ωji .

Note that ∂aj ∂ωji = zi and δj = ∂En ∂aj, obtaining: ∂En ∂ωji = δjzi.

This means that, to know the change in the error function by looking at a change in the
weight, the value of δ for the unit at the end of that weight has to be multiplied by the
value of z_{i}, which is the value on the other end of the weight. Therefore calculating δ_{j}
gives the derivative of the error function with respect to the weights. For the output units:

δk = yk− tk.

For the hidden units, using the chain rule for partial derivatives gives the solution: δj = ∂En ∂aj =X k ∂En ∂ak ∂ak ∂aj .

Knowing the impact of every weight on the error function gives the ability the minimize the error function, which optimizes the ANN (Bishop, 2006).

3.2.3 Architectures

The first architecture is a feed-forward, fully connected neural network. Second, a con-volutional neural network. Third, a recurrent neural network. Last, an extension of the recurrent neural network, the encoder-decoder network.

3.2.3.1 Feed-forward, fully connected neural network

The first ANN is the feed-forward fully connected neural network. This network consist of layers of computational nodes, wherein information flows in one direction through the network. Fully connected corresponds to a connection between every node in consecutive layers. Where, as mentioned before, a layer corresponds to one or more perceptrons. An overview is given in Figure 3.3.

Figure 3.3: Feed-forward, fully connected network

The figure shows a generalized feed-forward, fully connected neural network. The computational nodes are represented by the circles and each hidden layer consist of three computational nodes (Cook and Hall, 2017).

Same as with the VAR model, a feed-forward, fully connected neural network fore-casts based on lags of the output variable and lags of other variables. Output corresponds to the variable that will be predicted. Referring back to Figure 3.3, forecasting with a feed-forward neural network works as follows: the input variables xt−i, i = 0, 1, ..., k

con-sist of lags of the output variable together with lags of other variables. These input values are transformed by the layers in the network, as explained in paragraph 3.2.2.1, up until the output value, xt+n, which is the n-step ahead forecast.

3.2.3.2 Convolutional neural network

The second ANN that will be used is the convolutional neural network. The data only flows in one direction, which is equivalent to the feed-forward neural network. The name convolutional comes from the fact that the neural network performs a mathematical oper-ation called convolution. Convolution means that each input node is connected to a few nodes in the next layer. Therefore, it is not fully connected anymore. In other words, convolutional neural networks are different than feed-forward neural networks in the fact that they use convolution instead of matrix multiplication in at least one layer. It consists of three important parts namely: sparse interactions, parameter sharing and equivariant representations. Sparse interactions corresponds to the nodes not connected to every node in the next layer but only to a few of them. Parameter sharing means using the same parameter multiple times in the network. Normally, the weights used for calculating the output value are only used once for one specific input value, but with convolution the weights in a layer are the same for every node in that layer. This reduces the memory and improves the classification performance (Yi, Ju, Yoon, and Choi, 2017). Because of parameter sharing the layers have the property named equivariance. It means that when

the input changes, the output changes in the same way. Translated to time series, this means that when an event is moved over time, it will give the same result but later in time and therefore shows independence (Goodfellow, Bengio, Courville, and Bengio, 2016).

For the convolutional layers to give the output to another different set of layers, for example fully connected layers, the output has to be modified. This will be done with the max pooling filter. It only keeps the maximum value within a rectangular neighborhood, therefore the output will be the maximum value per rectangular. Consequently pooling makes the convolutional layer invariant to small changes in the input (Goodfellow et al., 2016).

The used convolutional neural network is an expansion of the feed-forward, fully connected neural network. First of all, a structure of one or more convolutional layers are stacked on top of each other. Afterwards the network will look the same as the feed-forward, fully connected neural network, which implements reasoning and the actual output (Yi et al., 2017).

Forecasting with this convolutional neural network works as follows: first the input data, which looks the same as for the feed-forward neural network, is given a score from zero to ten, indicating to what extent it matches the filter. Next, after the convolutional layer, the output has to be modified by the max pooling filter. This safes the maximum value per rectangle, which is given to the next layer. Afterwards the neural network is the same as the feed-forward, fully connected neural network. Therefore, forecasting from that part on is the same as the feed-forward neural network.

3.2.3.3 Recurrent neural network

The third model is a long short term memory (LSTM) network. This type of neural network belongs to the family of recurrent neural networks. Recurrent neural networks are for processing sequential data. Same as the convolutional network, they are especially for processing a grid of values. By using the state of the network, which is created by the previous input data, they do not only use the current input when predicting but also the previous evaluated data. This gives the recurrent neural network a memory. As the name of the LSTM network reveals, the network has a long and short term memory. This network works as follows. The data is provided to the network by sequentially inserting it to a 1-node layer. The 1-node layer not only gets the new data inserted but also inserts the state of the network up until that moment in time, which is the long term memory, and the output of the previous element, which is the short term memory. The state of the network corresponds to the understanding of the data up until that element in the sequence. The previous output on the other hand corresponds to the prediction given by the previous element. Together the model can be described as an architecture with as many layers as elements in the sequence of data, where every layer receives the prediction of the previous layer and the state of the network up until the previous layer. Figure 3.4 illustrates the LSTM network, where each layer is unfolded. Figure 3.5 shows the same LSTM network

but with every layer folded into one cell, which has a feedback loop into itself representing the long and short term memory.

Figure 3.4: Unfolded LSTM network

xkcorresponds to the sequence of data given to the network. f (xk, hk−1, sk−1) represents the layers, which

are stacked on top of each other. hk−1are the short term memories and sk−1are the long term memories

(state of the network) (Cook and Hall, 2017).

Figure 3.5: Folded LSTM network

xt−k represents the sequence of data given to the network. The LSTM cell represents the stacked layers

of 1-node, with the long and short term memory provided by the arrow back into itself. xt+ncorresponds

to the final prediction(Cook and Hall, 2017).

The used LSTM model for forecasting the macroeconomic indicators is a LSTM cell together with the structurers of feed-forward, fully connected layers. These are the same as in the feed-forward, fully connected network and the convolutional network. The purpose of these extra structures is to implement reasoning to the state of the network and eventually give the prediction (Cook and Hall, 2017).

as explained before. Afterwards, the LSTM cell gives the vector of forecasts to the feed-forward neural network and this network processes the input vector in the same way as explained for the feed-forward, fully connected neural network, leading to one forecast value.

3.2.3.4 Encoder-decoder network

The last model for forecasting is the encoder-decoder network. This network belongs to the family of sequence to sequence networks, in which the family of recurrent neural networks exist. This correlates with the fact that the encoder-decoder network is an expansion of the recurrent neural network described in the previous paragraph. The architecture was created to obtain translations. The network predicted translations of words together with memorizing earlier predictions of translations of words. It starts with the encoder to process the input sequence. After which it outputs a context that summarizes the input sequence. This context is received by the decoder, which in succession generates an output sequence. Both the encoder and decoder are a LSTM architecture, which means they both process the given sequence by processing it element wise together with the prediction of the previous element (Goodfellow et al., 2016).

The extra contribution of the decoder architecture can be explained with forecasting longer horizons. The encoder alone, so as described in the previous paragraph, is a single step predictor. Meaning that if it has to predict two steps into the future, it will skip the first step and immediately predicts the second step, without considering the prediction of the first step, even when the first prediction could influence the second prediction. This skipping could lead to a decrease in accuracy when predicting over longer horizons. When the model consist of an encoder and a decoder, the forecast changes from skipping elements to iteratively extrapolate out to the forecasting horizon. The encoder creating the one step ahead forecast, after which the decoder iteratively extrapolates the one step ahead forecast over the horizon to create the forecast for the longer horizon (Cook and Hall, 2017).

The encoder-decoder model used for forecasting macroeconomic indicators consist of the encoder and the decoder. The feed-forward, fully connected neural network is left out because of expectations that the decoder architecture already implements reasoning (Cook and Hall, 2017). The final model is shown in Figure 3.6.

Figure 3.6: Encoder decoder network

71b7↵9

### Figure 9: Encoder-Decoder Diagram

### Input

### Layer

### LSTM Cell

### LSTM Cell

### Prediction

### 17

The first LSTM Cell corresponds to the encoder. The second LSTM Cell correspond to the decoder. The feedback is provided by the arrows around the cells (Cook and Hall, 2017).

### Chapter 4

## Research Methodology

Knowing the basics of the forecasting models gives the possibility to discuss the research methodology. The main goal of this chapter is to discuss the data, together with the decisions made to accomplish the models. First, the datasets used for forecasting are described. Second, the decisions made to design and estimate the models are discussed. Last, the forecast technique together with the evaluation techniques are given.

### 4.1

### Data

First, the data is described and afterwards the necessary preparations of the data are explained, in order for the models to process the data correctly.

4.1.1 Description

The dataset is identical to the dataset used in Kuzin, Marcellino, and Schumacher (2013), which is collected by central bank economists. It contains monthly data and quarterly GDP growth for the countries: USA, Germany, Italy, France, Japan and the UK. The data consist of 190 monthly indicators for the USA, 113 for Germany, 150 for Italy, 167 for France, 71 for Japan and 60 for the UK. This difference in amount of available indicators is due to the lack of similarity between monthly statistics of countries. Per country, experts chose the data based on the relevance and appropriateness in forecasting. Most common indicators are business surveys, consumer surveys, income, production, inventories, un-employment, interest rates, retail, manufacturing, housing sales, price indexes and wages, credit, exchange rates and stock prices. The time interval also differs per country: for the USA the data is from 1982Q1 up until 2010Q2, for the UK it is from 1980Q1 until 2010Q2 and for the other countries from 1991Q1 until 2010Q2.

When inflation, unemployment and GDP are available in the dataset used in Kuzin et al. (2013), those are used. Otherwise, the variables are retreived from the Federal Reserve Bank of ST. Louis (FRED), for the same period as the dataset used in Kuzin et al. (2013). For the UK, inflation is retreived from the FRED (FREDa, 2018). For France, both inflation (FREDb, 2018) and unemployment (FREDc, 2018) are retreived from the

FRED. For Italy, only unemployment is retreived from the FRED (FREDd, 2018). For the other countries, the variables are already available in the dataset used in Kuzin et al. (2013).

4.1.2 Preprocessing

As all the forecasting methodes require stationarity and well behaved time series, prepro-cessing the data is required. The steps of preproprepro-cessing the data are explained for an individual country, but are equivalent for every country.

First, GDP is disaggregated to a monthly indicator. In contrary of inflation and unemployemt, GDP is released on quarterly basis, which means different sampling fre-quencies of the data. Dealing with different sampling frefre-quencies can be done in different ways: using a mixed frequency VAR model, disaggregating GDP to a monthly indicator with interpolation or using a monthly proxy. The essence of this thesis is not about han-dling different sampling frequencies and using a monthly proxy shifts the subject from GDP to the proxy, which is equal to changing GDP to the proxy in the first place. When using interpolation, the main topic will still be GDP and therefore, in this situation, this is the best solution. Interesting with interpolation is the effect on the measurement and forecast error of GDP. The interpolation is done with the cubic spline method, which is commonly used in economics.

Second, the missing values are dealt with. Luckily, all missing values were at the beginning months or end months of the time series. Therefore the rows with missing values could be deleted without discontinuing the time series. When a column, so a variable, has more then three missing values, not the row but the variable is deleted. In this way, not too many observations are lost. After cleaning, the data for the USA has a time interval from 1982Q1 up until 2010Q1, with 188 monthly indicators, the UK has a time interval from 1980Q1 up until 2009Q4 and 60 monthly indicators, France has a time interval from 1991Q1 up until 2010Q2 with 166 monthly indicators, Germany has a time interval from 1991Q1 up until 2010Q1 and 112 monthly indicators, Italy has a time interval from 1991Q2 up until 2009Q4 with 150 monthly indicators and Japan has a time interval from 1991Q1 up until 2010Q2 with 71 monthly indicators.

Next, before transforming the dataset, the data is split into the explanatory variables, which are inflation, unemployment and monthly GDP, and the predictors, which are all other variables. These datasets have to be transformed to eliminate trends, seasonality, heteroscedasticity and outliers.

First, the predictors are transformed. When an observation of a variable has a negative value, that value is added to all the observations of that variable. Afterwards, heteroscedasticity is avoided by taking the natural logarithm of all time series, except interest rates. Kuzin et al. (2013) have used the data for nowcasting GDP, which requires the same preprocessing of the data as in this thesis. Therefore, the transformations to eliminate trends and seasonality as specified in the data file that belongs to Kuzin et al.

(2013) are used to prepare the predictors data for forecasting. Last, when an observation differs by more than six times the sample interquartile range from the sample median it is identified as an outlier. The outliers are set equal to the outside boundary of the sample interquartile range.

For the explanatory variables, the same transformations are made. First heteroscedas-ticity is avoided in the same way as for the predictors. Next, stationarity of the time series is tested with the Augmented Dickey-Fuller test. The null hypothesis of this test is that an unit root is present. Consequently, the alternative is that the time series is stationary. When the time series is non-stationary, the first difference is used and tested for station-arity. When again, the time series is non-stationary, the second difference is used and tested. Next, seasonality is removed with the statsmodels.tsa.seasonal package in python 2.7. This package removes seasonality by applying a convolution filter to the time series. Last, outliers are treated the same as with the predictors.

Per country, the descriptive statistics, the transformations and the results of the Aug-mented Dickey-Fuller test for inflation, unemployment and GDP are presented in Appendix A.1. Graphs of inflation, unemployment and GDP before and after the transformations are presented in Appendix B.1.

In order to prevent overfitting, both the explanatory variables and the predictors are split into a training set and a test set. Overfitting is a too closely fit to the data, which could cause the model to fail in showing the same forecasting performance when used with additional data. Splitting the data into a training set and a test set is also necessary for evaluating the performance of the models.

Next, due to different scales of all input features, the models could give higher weights to features with higher average values, which could give a bias. Therefore both datasets are normalized.

Subsequently, principal component analysis is used on the predictors dataset. Not all predictors have a significant influence on the explanatory variables and Stock and Watson (2004) and Banerjee, Marcellino, and Masten (2005) have shown that, because of the vari-ation of informvari-ation content of indicators over time, selecting single predictors can be very difficult. To avoid this problem, principal component analysis (PCA) on the predictors data is used. PCA locates the direction of maximal variance in high-dimensional data and projects it onto a lower dimensional subspace while retaining most of the information. The number of meaningful components retained from the indicators are determined by combin-ing two different criteria. First, the Eigenvalue-One Criterion is used. This criterion selects all components with eigenvalues greater than 1. Next, the Scree Criterion is used. With this criterion every component before the last break in the graph, where the eigenvalues of the components are plotted, is used. Every component before the break is assumed mean-ingful and every component after the break is assumed meaningless (O’Rourke, Psych, and Hatcher, 2013). The components from the criterion with the least components are chosen. Appendix A.2 shows the components used per country and Appendix B.2 shows the largest eigenvalues together with the proportion of variance per component.

Last, the components are again normalized, due to different scales.

### 4.2

### Model design and estimation

The benefit of both the VAR model and the ANN is that they concentrate on the properties of the given data and the statistical methods and models that provide the best forecast for the data. In order for this flexibility to be a benefit, the right combination of parameters have to be chosen. This paragraph discusses the design and estimation of the models. In the same way as for the steps of preprocessing the data, the steps in designing the models are explained for an individual country, but are equivalent for every country.

4.2.1 Benchmark VARX model

To let the VAR process be effected by predictors other then inflation, unemployment and GDP, the model is changed to a VARX model. With this model, the endogenous variables are predicted by lags of those endogenous variables together with exogenous variables.

Building a VARX model can be divided into different steps. Starting with choosing the endogenous and exogenous variables. The endogenous variables are the macroeconomic indicators inflation, unemployment and monthly GDP. The exogenous variables are the components from the PCA. Next, the order of the model is selected with a grid search from α ∈ {1, 2, 3, 4, 5, 6}. Normally, the order of the model is selected with selection criteria such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), the model is tested for serial correlation with the Portmanteau test and, subsequently, the lags of the model are increased by one up until there is no sign of serial correlation. However, Lütkepohl (2005) indicate that as long as the model forecasts well, it is not very important that the residuals are serial correlated. Therefore, the order of the model is chosen based on the forecast performance in a grid search. The evaluation of the different order configurations is done with five rounds of cross-validation. Per round, cross-validation divides the training data into a new training set and a validation set. Subsequently, it trains the model on the new training set and evaluates the model, by use of the root mean squared error (RMSE), on the evaluation set. After the five rounds, the mean of the five RMSEs is used to evaluate the configuration with the other configurations to eventually select the configuration with the lowest mean RMSE.

After all steps have been completed, the model can be estimated per separate equa-tion by ordinary least squares, which is done in python 2.7.

When forecasting longer horizons, future values of the exogenous variables are needed. Since future values of the exogenous variables are not known, these values have to be es-timated with a VAR model of its own. In building the VAR models for the components, the same steps are used as for the VARX model.

Lastly, for the US, a model is build per separate endogenous variable to compare the results with (Cook and Hall, 2017). These models use lags only of the explanatory

variable, making it AR models. Again, the same steps are used as for the VARX model.

4.2.2 Artificial neural networks

As a consequence of the large number of parameters in ANNs, the decision of selecting the right values for those parameters involves much trial and error and are made by constantly evaluating the model with different configurations of the parameters. The parameters of the artificial neural networks consist of activation functions, hidden layers, neurons per layer, batch size, epochs, optimization technique, learning rate and weight initialization. The design of the different ANNs are similar in most steps. Therefore, first, the steps in designing the feed-forward neural network are explained. Afterwards, per other ANN, the differences in steps of designing the model, compared to the feed-forward neural network, are highlighted and explained.

For the feed-forward neural network, first, the activation functions, optimization technique and weight initialization are set fixed, due to a quadratic increase in computa-tional time when an extra parameter with two options is added to the grid search. For the hidden layers, the exponential linear unit (ELU) activation function is used and for the output layer no activation function is used. The optimization is done by stochastic gradient descent and the initial weights are drawn from a normal distribution with mean 0 and standard deviation 0.05. Next, the input variables are lags of the explanatory variables together with lags of the components from the PCA. The amount of lags are set equal to the lags used in the VARX model. Subsequently, the optimal configuration of learning rate, hidden layers and neurons are simultaneously selected with a grid search. The batch size and epochs are set fixed to 8 and 100. The optimal learning rate is selected from α ∈ {0.01, 0.001, 0.0001}. The optimal amount of hidden layers is selected from β ∈ {1, 2} and the optimal amount of neurons is selected from θ ∈ {λ, 1.25 ∗ λ}, where λ is the input size. Next, with the optimal configuration for the learning rate, hidden layers and neurons, the batch size and epochs are also selected with a grid search. The optimal batch size is selected from γ ∈ {5, 8, 16} and the optimal epochs is selected from δ ∈ {50, 100, 200}. To reduce the chance of overfitting, two regularization techniques are implemented. First, the technique called dropout. Per layer, this technique ignores random neurons when training the model. Second, a kernel regularizer. This technique uses penalties for weights that are getting to large. The penalty is added to the optimization function. The evaluation of the different configurations is done with five rounds of cross-validation. Per round, cross-validation divides the training data into a new training set and a validation set. Sub-sequently, it trains the model on the new training set and evaluates the model, by use of the RMSE, on the validation set. After the five rounds, the mean of the five RMSEs is used to evaluate the configuration with the other configurations, to eventually select the configuration with the lowest mean RMSE. After the cross-validation, the model is estimated with the selected configuration. Per explanatory variable, an optimal model is selected and the model is estimated. This is identical to the VARX model. The building

and estimation of the feed-forward neural network is done with the Keras library in python 2.7. In order to compare the performance of the feed-forward neural network with a VARX model, the prediction of future values should be done in the same way as the VARX model does. Therefore the predictions of future values of the components are needed. These values are estimated with a feed-forward neural networks per individual component. In building these models, the same steps are used as for the original model.

The convolutional neural network differs in two steps. Namely, the grid search for the hidden layers and the prediction of future values. The optimal amount of hidden layers is selected from β ∈ {(1, 1), (1, 2)}, where the first configuration consist of one convolutional layer and one fully connected layer and the second configuration consist of one convolutional layer and two fully connected layers. The future values of the components are estimated with a convolutional neural network per individual component.

The recurrent neural network differs in four steps. Namely, the grid search for the hidden layers, the fixed batch size during the grid search, regularization techniques and the prediction of future values. The grid search for the optimal learning rate, hidden layers and neurons are done with a fixed batch size of 1 and 100 epochs. The batch size is set to 1 because the recurrent neural network uses the state of the network, which is created by the previous output, to predict the next value. The optimal amount of hidden layers is selected from β ∈ {(1, 1), (1, 2)}, where the first configuration consist of one LSTM layer and one fully-connected layer and the second configuration consist of one LSTM layer and two fully-connected layers. Afterwards, during the grid search for the optimal amount of epochs, the batch size is also set to 1, because of the memory of the network. Next, the dropout, kernel and activity regularization techniques are used to prevent overfitting. The activity regularizer makes the output of the layer smaller. Last, the future values of the components are estimated with a recurrent neural network per individual component.

The encoder decoder network differs in five steps. Namely, the hidden layers, the fixed batch size, the optimal amount of neurons, the regularization techniques and the prediction of future values. First, the layers of the encoder decoder network are fixed to two LSTM layers. Therefore, the first grid search is for the optimal learning rate and for the optimal amount of neurons. The batch size is set to 1, due to the memory of the network. The optimal amount of neurons is selected from θ ∈ {λ, 1.25 ∗ λ, 1.5 ∗ λ}, where λ is the input size. The regularization techniques are identical to the recurrent neural network. Last, the future values of the components are estimated with an encoder decoder network per individual component.

Lastly, for every ANN and only for the US, a model is build per separate endogenous variable. These models use lags only of the explanatory variables. The purpose of these models is to compare the results with Cook and Hall (2017) and to investigate if there is a difference in performance of the ANNs compared to VAR models when forecasting with and without additional input variables. Again, in building these models, the same steps are used as for the original model.

### 4.3

### Forecasting and performance

This paragraph discusses the forecasting technique and performance evaluation. First, the decisions made to evaluate the accuracy are discussed. Afterwards, a comparing technique is given.

4.3.1 Accuracy

The models are estimated with the training sample and their performances are based on the forecast accuracy in the test sample. To give an indication of robustness, the accuracy of the forcasts of the US are evaluated with the Mean Absolute Error (MAE) method:

M AE(ˆyt+h|t) = 1 N N X n=1 | ˆyt+h|t;n− yt+h|t;n |,

and the RMSE method:

RM SE(ˆy_{t+h|t}) =
v
u
u
t
1
N
N
X
n=1
(ˆy_{t+h|t;n}− y_{t+h|t;n})2_{,}

with ˆy_{t+h|t}the h-step ahead forecast, y_{t+h|t}the true value at t + h and N the total amount
of observations in the test sample (Tsay, 2005). All other countries are evaluated with the
RMSE method.

The target forecast horizons are 1, 3, 6, 9 and 12 months ahead. Per country, the VARX model and the ANNs are trained with the multiple output strategy. With this strategy, one model is used to predict the entire forecast sequence. When forecasting longer horizons, the chain-rule of forecasting, explained in Paragraph 3.1, is used. With this technique, predictions of future values are used. For example, when h = 2, first the one step ahead value is forecasted. Afterwards, the one step ahead forecast is used to forecast the two step ahead value.

For the ANNs, the initial weights in the training process are randomly distributed, causing the model to differ in final weights and therefore in forecast results when used with repeated runs. To account for this stochasticity, 20 instances of every ANN are trained and the average of the three predictions with the lowest RMSE are used as result.

4.3.2 The Diebold-Mariano test

When comparing the predictive accuracy of two forecasts, the Diebold-Mariano test can
be used. The forecast errors are defined as e_{i,t+h} = ˆyi,t+h|t− yt+h|t, i = 1, 2, with ˆyi,t+h|t

the h-step ahead forecast of technique i and y_{t+h|t} the true value at time t + h. The loss
differential between the two forecasts is defined by d_{t}= g(e1,t+h) − g(e2,t+h), with g(ei,t+h)

the loss function. The null hypothesis states that the two forecasts have the same accuracy, which can only be true if E(dt) = 0 ∀ t. Alternatively, E(dt) 6= 0 states a difference in

the level of accuracy of the two forecasts. When testing the null hypotesis for two h-step ahead forecasts, the test statistic is

DM = ¯ d q 2π ˆfd(0) n , with ˆ fd(0) = 1 2π(ˆγd(0) + 2 h−1 X k=1 ˆ γd(k)), ¯ d = n P t=1

dt, ˆγd(k) = cov(dˆ t, dt−k), n the sample size and DM −→ N (0, 1) (Diebold andd

Mariano, 2002).

However, Diebold and Mariano (2002) have shown that the DM test can become oversized when forecasting for longer horizons. Therefore, Harvey, Leybourne, and New-bold (1997) have modified a corrected statistic:

HLN − DM = r

n + 1 − 2h + h(h − 1)

n DM,

which has a Student-t distribution with n − 1 degrees of freedom. This test is used for comparison. The comparison is done with the function dm_test in python 2.7.

### Chapter 5

## Results

This chapter presents and evaluates the forecast performance of the models. Appendix A.3. gives, per country, the tables with RMSE ratios. Appendix B.3 presents the forecast graphs of the transformed variables, sorted per country. First, to compare the results with Cook and Hall (2017) and to investigate if there is a difference in performance of the ANNs compared to VAR models when forecasting with and without additional input variables, the forecast results of the US are presented and evaluated. Second, the forecast results of all other countries are analysed. Last, the results from the Diebold-Mariano test are discussed.

### 5.1

### Comparison of the forecast results of the US

Tables A.8, A.9, A.10, A.11, A.12 and A.13 present the forecast performances of the neural networks against the benchmark VARX model. The performances are measured by the RMSE of the corresponding indicator relative to the RMSE of the VARX model. For every variable, the RMSE of the feed-forward neural network, convolutional neural network, recurrent neural network and encoder decoder network are divided by the RMSE of the VARX model, which gives a ratio.

h indicates the forecast horizon, which ranges from 1 month ahead up until one year ahead. Better performance corresponds to a lower RMSE. Therefore, when the ratio is less than one, the corresponding ANN performs better than the benchmark VARX model. For example, in Table A.8, the ratio of the feed-forward neural network, on variable CPI, in the US, with forecast horizon h = 1, is 0.982. This indicates that the ANN performs better than the benchmark VARX model for that forecast horizon.

In previous literature, Cook and Hall (2017) have investigated these ANNs with data of unemployment of the US, without additional input variables. Their results are shown in Table 5.1. As seen in the table, Cook and Hall (2017) concluded that the ANNs outperform the linear benchmark models at short forecast horizons, but, as the horizon increased, the performance of the ANNs deteriorated compared to the benchmark models. Emphasized is the encoder decoder model, which outperforms the linear benchmark models at every

forecast horizon.

Table 5.1: Forecast performance of the ANNs, from Cook and Hall (2017), measured by the MAE of the corresponding variable relative to the MAE of the benchmark model, the US.

Variable Model h=1 h=3 h=6 h=9 h=12

Unemployment Feed-forward NN 0.563 0.934 1.070 1.123 1.208 Convolutional NN 0.993 0.948 1.172 1.415 1.550 Recurrent NN 0.770 1.007 1.148 1.317 1.423 Encoder decoder network 0.326 0.679 0.740 0.812 0.861

For this thesis, Table 5.2 shows the results, measured in MAE, of CPI, unemployment and GDP, of the US, without additional input variables, for comparison with Cook and Hall (2017). Overall, the results of the ANNs are better than the benchmark at long horizons and worse at short horizons. This is exactly the opposite of the findings from Cook and Hall (2017). For example, for unemployment, at h = 1, the ratio of the feed-forward neural network of Cook and Hall (2017) is 2.07 times smaller than the ratio of the feed-forward neural network in this thesis. This indicates that their ANN performs better compared to their benchmark than the ANN in this thesis compared to the AR model. At h = 12, the ratio of their feed-forward neural network is 1.25 times larger than the feed-forward neural network in this thesis, indicating that their ANN performs worse than the ANN in this thesis. Looking at the encoder decoder network, at h = 1, the ratio in Table 5.1 is 3.55 times smaller and, at h = 12, 1.12 times smaller, than the ratio in Table 5.2. This indicates that the encoder decoder network of Cook and Hall (2017) performs better, compared to their benchmark, than the encoder decoder network, compared to the AR model, in this thesis. Therefore, for unemployment, the ratios are clearly not the same as the ratios in Table 5.1. For h = 1 and h = 3, the table shows ratios which are always greater than one. This implies that the ANNs perform worse than the benchmark AR model. For longer horizons, the ratios are smaller than 1, implying that the performance of the ANNs become better than the AR model.

Another finding is that the ratios in Table 5.2 are closer to one compared to the ratios in Table 5.1. Subsequently, the upward trend, from Cook and Hall (2017), can not be seen in the ratios of unemployment of the US, in Table 5.2. The difference in results could have multiple causes. First, there is a difference in benchmark. Cook and Hall (2017) use the Survey of Professional Forecasters (SPF) as benchmark, whereas, in this thesis, the AR model is used as benchmark. Second, there is a difference in forecast technique for longer horizons. Cook and Hall (2017) directly predicts the longer horizons, whereas, in this thesis, the chain-rule of forecasting is used. Last, the implementation of the ANNs could be different. ANNs have many parameters and a small change in these parameters could lead to distinctively different results.

In Table 5.2, results of CPI also show an absence of the upward trend. At h = 1, most ratios are greater than one. However, for longer horizons the ratios decrease, indicating that the ANNs improve compared to the AR model. For real GDP, at h = 1, the AR model

also performs better than the ANNs, whereas, for longer horizons, the ANNs perform better than the AR model. Remarkable is the encoder decoder network. For every variable and almost every forecast horizon, the encoder decoder network performs worse than the AR model. This is in contrary to the results of Cook and Hall (2017), where it is, in every regard, the best model tested. The reason for this difference could be the implementation explained earlier.

Table 5.2: Forecast performance of the ANNs, without additional input variables, measured by the MAE of the corresponding variable relative to the MAE of the VARX model, the US.

Variable Model h=1 h=3 h=6 h=9 h=12

CPI Feed-forward NN 0.985 0.988 1.009 0.998 0.999 Convolutional NN 1.034 0.985 1.006 0.995 0.997 Recurrent NN 1.015 0.981 1.008 0.997 0.999 Encoder decoder network 1.051 0.986 1.007 0.996 0.998 Unemployment Feed-forward NN 1.166 1.070 0.997 0.962 0.969 Convolutional NN 1.165 1.069 0.996 0.961 0.968 Recurrent NN 1.050 1.033 0.998 0.963 0.969 Encoder decoder network 1.156 1.067 0.997 0.961 0.968 Real GDP Feed-forward NN 1.405 0.985 0.932 1.025 0.942 Convolutional NN 1.675 0.984 0.932 1.018 0.934 Recurrent NN 2.112 0.966 0.979 0.999 0.969 Encoder decoder network 1.959 1.079 1.020 1.159 1.026

Table 5.3 shows the results of the US, with lags of inflation, unemployment, GDP and the predictors, measured by the MAE. In contrary to the models in Table 5.2, the models in Table 5.3 use additional input variables. Looking at unemployment, for h = 1, the ratio of the feed-forward neural network in Table 5.1 is 1.57 times smaller than the ratio of the feed-forward neural network in Table 5.3. For h = 12, the ratio of the feed-forward neural network in Table 5.1 is 1.19 times larger than the ratio of the feed-forward neural network in Table 5.3. For h = 1, the ratio of the encoder decoder network of Cook and Hall (2017) is 2.7 times smaller than the ratio of the network in Table 5.3, and for h = 12 it is 1.12 times smaller. These results indicate that the performance of the ANNs in Table 5.3 have become more similar to the results of Cook and Hall (2017) compared to the results of the models without additional input variables in Table 5.2. Looking across all forecast horizons and all ANNs, the results in Table 5.3 and Table 5.1 still differ from each other. However, in contrary to Table 5.2, the trend found in Cook and Hall (2017) is also found in the ratios of unemployment of the US in Table 5.3.

Looking at the forecast graphs of unemployment in appendix B.3.1 and B.3.2 could give a possible explanation for the absence of the trend when forecasting without additional input variables and the presence of the trend when forecasting with additional input vari-ables. For every forecast horizon and every forecast technique, the forecast graph in B.3.1 tends to be closer to zero compared to the graphs of unemployment in appendix B.3.2. Consequently, this decreases the difference in forecast error between the ANN and the AR

model, since more similar forecast values corresponds to a lower difference in forecast error. Subsequently, a lower difference in forecast error gives a ratio closer to one.

In Table 5.3, results of CPI show the same trend as for unemployment. At h = 1 the ratios are smaller than one and they grow closer to one when h increases. For real GDP, most of the ANNs perform better than the VARX model. However, at h = 1, the VARX model performs better than the ANNs, indicating that the conclusion of a trend found by Cook and Hall (2017) is not always true. A good explanation for the high ratios could be that the interpolation of the quarterly GDP data to monthly data, with the cubic spline method, led to an increase in the overall performance of the VARX model at short forecast horizons. The spline used in the cubic spline interpolation is a third degree polynomial, indicating that it is not linear. However, it could be possible that the VARX model is better in predicting the linear part of the spline than the ANN in predicting the spline as a whole.

Table 5.3: Forecast performance of the ANNs, with additional input variables, measured by the MAE of the corresponding variable relative to the MAE of the VARX model, the US.

Variable Model h=1 h=3 h=6 h=9 h=12

CPI Feed-forward NN 0.949 0.913 1.000 1.011 1.005 Convolutional NN 0.924 0.892 1.010 1.018 1.008 Recurrent NN 0.988 0.874 1.006 1.010 0.999 Encoder decoder network 1.014 0.880 1.009 1.026 1.028 Unemployment Feed-forward NN 0.882 0.988 1.008 1.024 1.011 Convolutional NN 0.863 0.989 1.056 1.035 1.019 Recurrent NN 0.998 0.985 1.030 1.004 0.993 Encoder decoder network 0.885 0.962 1.072 1.037 0.983 Real GDP Feed-forward NN 1.435 0.875 0.946 0.852 0.972 Convolutional NN 1.454 0.857 0.929 0.858 0.969 Recurrent NN 1.998 1.013 0.969 1.048 1.042 Encoder decoder network 2.040 1.013 1.003 1.014 1.143

Comparing the results of Table 5.2 and Table 5.3 reveals the influence of additional variables as model inputs. Table 5.4 shows the results of Table 5.2 and Table 5.3 when averaging over all variables and ANNs. For every forecast horizon, the ratios of the ANNs with additional input variables are lower than one, indicating that the ANNs perform better than the VARX model. The ratios without additional input variables are not always lower than one. On the contrary, overall, they are a little higher than one, indicating that the VARX model performance slightly better. Summarized, the ANNs with additional input variables perform better than the ANNs without additional input variables. Therefore, the additional variables as model inputs have led to an increase in the ANNs performance compared to the benchmark models.

Table 5.4: Forecast performance of the ANNs, measured by the MAE of the corresponding variable relative to the MAE of the VARX model, average over all variables and ANNs, the US.

h=1 h=3 h=6 h=9 h=12 Overall Without additional input variables 1.156 1.012 1.003 0.999 0.993 1.033 With additional input variables 0.996 0.876 0.933 0.928 0.941 0.935

To give a better indication of robustness of the results, Table 5.5 shows the forcast performance of the ANNs with additional input variables, but measured by the RMSE. Comparing it with Table 5.3 shows that the results only slightly differ. However, the trend found for CPI, unemployment and real GDP are still present. The difference in results is caused by the fact that the RMSE uses the squared error. By using the squared error, the proportion of effect of large errors becomes greater compared to small errors. Whereas, when using the MAE, the influence of every error is in direct proportion.

Table 5.5: Forecast performance of the ANNs, with additional input variables, measured by the RMSE of the corresponding variable relative to the RMSE of the VARX model, the US.

Variable Model h=1 h=3 h=6 h=9 h=12

CPI Feed-forward NN 0.982 0.946 0.988 0.998 0.998 Convolutional NN 0.924 0.892 1.010 1.018 1.008 Recurrent NN 1.076 0.932 0.980 0.994 0.994 Encoder decoder network 1.094 0.937 0.981 1.007 0.998 Unemployment Feed-forward NN 0.890 0.979 1.014 1.015 1.011 Convolutional NN 0.863 0.989 1.056 1.035 1.019 Recurrent NN 1.003 0.975 1.046 1.003 0.998 Encoder decoder network 0.892 0.951 1.089 1.029 0.969 Real GDP Feed-forward 1.444 0.854 0.964 0.881 0.981 Convolutional NN 1.454 0.857 0.929 0.858 0.969 Recurrent NN 2.260 0.988 0.968 1.049 1.128 Encoder decoder network 1.979 0.953 1.003 1.021 1.189

### 5.2

### Comparison of the forecast results of all countries

Looking across all countries, variables and horizons in Tables A.8, A.9, A.10, A.11, A.12 and A.13, show that the upward trend returns multiple times. Results from CPI of the US, unemploymtent of the US, CPI of Japan, CPI of Germany and CPI of Italy all show the same trend. However, the other results contradict this conclusion and indicate that, for some reason, the VARX model is more competitive to the ANNs.

Remarkable are the ratios at real GDP for forecast horizon h = 1. For every country, the ratios are greater than one, indicating that the VARX model outperforms all ANNs. This is possibly caused by the use of the cubic spline method explained earlier. For longer forecast horizons, the ANNs still perform equal or better than the VARX model.