Introduction of a hybrid ARFIMA-RNN model and comparison to other time series models

(1)

Master’s Thesis

Introduction of a hybrid ARFIMA-RNN

model and comparison to other time

series models

Freek van den Honert

Student number: 10885161 Date of final version: July 15, 2018 Master’s programme: Econometrics

Specialisation: Big Data Business Analytics Supervisor: dr. K.A. Lasak

Second reader: Prof. dr. C.G.H. Diks

Abstract

Various hybrid neural network ARIMA models have been proposed for time series fore-casting. This thesis introduces a new model, namely a recurrent neural network ARFIMA model, and compares its forecasting ability with those of existing models. This is done for simulated data sets as well as empirical data sets, by comparing the RM SE, M AP E, M AE and using the Diebold-Mariano test. The proposed model seems to outperform the existing models if the data is complex enough.

(2)

Statement of originality

This document is written by Freek van den Honert who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(5)

Introduction

Having some insight into what the future might bring is of vital importance to many professions. Some examples are predicting the weather for weathermen, the volatility of stocks for option traders and the amount of traffic for providing traffic control.

What the future might bring is often assumed to depend on what the past has brought. As a result of this assumption time series models are often used to predict the future. However, finding an optimal time series model can be quite difficult.

The autoregressive integrated moving average (ARIMA) model was one of the first time series models that was introduced. This model is one of the most important and most used time series models. The model is built on the assumption that all relations between consecutive observations are linear. This assumption makes the ARIMA model easy to interpret. However, a major drawback is that, if real world data are used, this assumption is often violated as this data tend to have more complex relations. Another drawback of these models is the shortness of the memory, as observations further in the past are assumed to have little to no influence on the future.

A more general model, the autoregressive fractionally integrated moving average (ARFIMA) model, was introduced shortly afterwards which solves the short memory prob-lem. This model uses a binomial series so that observations very far in the past still have some influence.

Some nonlinear time series models have also been introduced since then, such as the bilinear model and the nonlinear autoregressive model. These models can be useful but

(6)

only for data sets with specific characteristics.

The nonlinear time series problem was solved by the introduction of neural networks, which made their way into this field of study two decades ago. The strength of neural networks lie in the absence of assumptions. Neural network models can estimate any arbitrary function. However, a drawback of neural networks is that they are prone to overfitting.

Since then several methods to reduce overfitting have been proposed. Most of the methods try to mix neural networks with ARIMA models. For instance, Zhang (2003) introduced a hybrid of a multilayer perceptron (MLP) type neural network and an ARIMA model, Aladag, Egrioglu and Kadilar (2009) introduced a hybrid of a recurrent neural network (RNN) and an ARIMA model and Chaˆabane (2014) introduced a hybrid of an MLP neural network and an ARFIMA model.

The research done in this thesis focuses on a hybrid version of an ARFIMA and a RNN model, which theoretically should be able to outperform the other models mentioned. The aim of this research is to compare the proposed model with ARIMA, ARFIMA, MLP, RNN models and a hybrid ARFIMA MLP model. To do so, data sets are simulated on which the different models are estimated and their forecasts are compared to the forecast produced by the proposed model. Three empirical data sets are also included so that the performance of the proposed model for empirical data sets can be investigated.

The contents of this research are as follows. Chapter 2 discusses the theoretical back-ground of the different models, performance measures, a test for significant difference in accuracy of forecasts and relevant academic literature. Chapter 3 covers the research methodology in more detail. The empirical data sets are discussed in Chapter 4. Af-terwards, the results are given in Chapter 5. Finally, the conclusion is given in Chapter 6.

(7)

Literature Review

This chapter includes some theoretical background to the different models, performance measures and a test for significant difference in accuracy as well as a discussion of relevant academical literature. In Section 2.1.1 the ARIMA and ARFIMA models are discussed, followed by an overview of the MLP model in Section 2.1.2, an overview of the RNN model in Section 2.1.3 and the hybrid methodology is covered in Section 2.1.4. Some performance measures are discussed in Section 2.2.1 and a test for significant difference in accuracy is discussed in Section 2.2.2. Finally, Section 2.3 covers relevant academic literature.

2.1 Models

2.1.1 The ARIMA and ARFIMA models

Suppose there is a time series {yt, t ∈ T} where T is a set containing all time points for

which yt can be observed. An autoregressive integrated moving average model assumes

that the value of a variable is a linear function of the past values of this variable. An ARIMA(p, d, q) model can thus be written as

xt= φ0+ φ1xt−1+ ... + φpxt−p

+ at− θt−1at−1− ... − θt−qat−q, t ∈ T

(2.1)

where {xt} is the time series constructed by taking d differences of {yt}, so xt= ∆dyt =

yt− yt−d, at is the error or shock at time point t and φ0, ..., φp and θ1, ..., θq are model

(8)

parameters. Using B as lag operator, so that Bdyt = yt−d, the ARIMA(p, d, q) can also

be written as

φ(B)(1 − B)dyt= φ0+ θ(B)at (2.2)

where φ(B) = (1 − φ1B − φ2B2− ... − φpBp), the autoregressive polynomial, and θ(B) =

(1 − θ1B − θ2B2− ... − θqBq), the moving average polynomial.

The errors or shocks, at, in (2.1) and (2.2) are assumed to follow a certain distribution

of which the variance, denoted by σ2

t, can be a constant or even a time series, such as in a

generalised autoregressive conditional heteroskedastic(m, s) model, which corresponds to an ARMA(p, s) model with p = max {m, s} for a2

t. Such a GARCH(m, s) model has the

form σ_t2= α0+ m X i=1 αia2t−i+ s X j=1 βjσt−j2

where α0, ..., αm and β1, ..., βs are model parameters.

Finding an optimal ARIMA model which fits an observed time series can be difficult. A practical approach to building these models is the Box-Jenkins procedure as described in Box and Jenkins (1970). This approach had a fundamental impact in the field of time series analysis and forecasting. The Box-Jenkins procedure consists of three steps, namely identification, estimation and testing.

In the identification step initial guesses for p, q and d are obtained. Here d denotes the integration order, which is the number of differences that need to be taken to make ∆dxt≡ xt−xt−da stationary time series. A time series is, by definition, weakly stationary

if the first two moments of the time series remain constant over time, that is E[xt] = µ,

Cov(xt, xt−`) = γ`, ∀t ∈ T, ` ∈ Z or, in other words, the time series may not explode.

The initial guess for d is obtained by looking at the graph of the time series. If the graph shows nonstationarity the d is increased by one and the graph is inspected again until it resembles a stationary time series.

The initial guesses for p and q are based on the sample autocorrelation function (ACF) which is defined as ˆ ρ` = PT t=`+1(xt− ¯x)(xt−`− ¯x) PT t=1(xt− ¯x)2

where ¯x is the sample mean of x. The sample ACF should show a clear cut-off point at ` = q if p = 0. If this is not the case and the sample ACF shows exponential decay, then

(9)

p 6= 0 and the sample partial autocorrelation function (PACF) should be examined as well. The sample PACF { ˆφ`,`} can be obtained by least-squares estimation of

xt= φ0,`+ φ1,`xt−1+ ... + φ`,`xt−`+ e`t

and should show a clear cut-off point at ` = p if q = 0. If both the sample ACF and the sample PACF show exponential decay, then both p and q should be nonzero. In that case a low choice such as p = q = 1 is used as initial guess.

These initial guesses of p, d and q make the estimation of the model possible which is the second step of the Box-Jenkins procedure. The ARMA parameters α0, ..., αm and

β1, ..., βs are usually estimated by (nonlinear) least-squares on the proposed model but

maximum likelihood is also a viable choice of estimation.

The last step of the Box-Jenkins procedure involves diagnostic testing. The residuals of the estimation step, denoted by {ˆat}, should have white noise properties if the model

specification is correct. The white noise properties can be tested by performing the Ljung-Box test or the Breusch-Godfrey Lagrange multiplier test.

The Box-Jenkins procedure is one of the most frequently used procedures but it requires thorough investigation of the sample ACF, PACF and graph of the time series. Often various models pass all diagnostic tests and the model can then be based on information criteria which involve a trade-off between fit and parsimony. Two popular information criteria are

AIC(p, q) = c + log ˆσ2_p,q+2(p + q) T∗

BIC(p, q) = c + log ˆσ2_p,q+(p + q) log T

∗

T∗

where c is constant, T∗ = T − p, the effective sample size, and ˆσ_p,q2 = T∗−1PT

t=p+1ˆa2t. The

preferred model is then the model which has the lowest value of the information criterion used. BIC favours parsimony more due to the penalty for extra parameters being more severe.

A more general model which allows for fractional d is the autoregressive fractionally integrated moving average (ARFIMA) model that was first introduced by Granger and Joyeux (1980). The ARFIMA model has the same form as (2.2) but the integration order,

(10)

d, is allowed to be fractional. Fractional differencing is done using the binomial series expansion, that is (1 − B)d= ∞ X k=0 d k (−B)k= 1 − dB +d(d − 1) 2! B 2_{− ...}

ARFIMA models are useful in modelling time series with long memory. Furthermore, the ARIMA model is a special case of the ARFIMA model where the integration order, d, is integer.

A way to esimate an ARFIMA model based on an observed time series is by estimating the fractional d by maximum likelihood estimation of an ARFIMA(2, d, 0) model as is done by Haslett and Raftery (1989). The data are then fractionally differenced using the estimated d. The orders p and q are found by fitting several ARMA(p, q) to this fractionally differenced data with different p and q of which the model with the lowest AIC is chosen as the optimal model as proposed by Hyndman, Khandakar and others (2007). The model parameters are finally estimated by maximum likelihood estimation of the ARFIMA(p, d, q) model on the data as done by Haslett and Raftery.

Forecasting can be done when an optimal model is found. Let ` denote the number of time points ahead that need to be forecast, the forecasting horizon, and Fh= {xh, xh−1, ...}

be a set containing all information on the process {xt} up to time h, the forecast origin. The

optimal forecast of ` time points ahead of the forecast origin is then ˆxh(`) = E[xh+`|Fh].

Although the ARIMA and ARFIMA models are flexible in terms of the number of lagged variables they can contain, they are limited in the sense that they can only estimate linear relations between variables at different time periods.

2.1.2 The MLP Neural Network model

A model which is capable of estimating nonlinear relations between variables is a multilayer perceptron (MLP) neural network. This type of neural network consists of an input layer, a single or multiple hidden layers and an output layer. Every layer consists of neurons which are connected with the previous layer through synapses.

To aid with understanding and explaining how an MLP works Figure 2.1 is included. From this figure it can be obtained that all synapses have weights, for instance the synapse

(11)

Figure 2.1: A simple MLP with one input layer containing p neurons, one hidden layer containing L neurons and one output layer containing m neurons.

from the first neuron of the input layer to the first neuron of the hidden layer has weight w₁₁h in this figure. The input of each neuron outside of the input layer is the sum of the connected neurons of the previous layer multiplied by the respective weights of their synapses and is denoted by uj in the figure.

In each of the neurons outside of the input layer an activation function is used on the input of that neuron, which is denoted by σ. Some popular activation functions are the sigmoid function, σ(x) = 1 + exp (−x)−1

, and the rectifier function, σ(x) = max {x, 0}. The output of these neurons are the values of the activation functions evaluated at the input of the neurons, so hj = σ(uj) in the figure.

This is repeated every next layer so that the sum of the outputs of the connected neurons of the last hidden layer multiplied by the respective weights of their synapses is the input of a neuron in the output layer. The activation function used in the output layer is the linear function, σ(x) = ax, if the model is made for regression but other activation functions are also possible, such as the sigmoid function when the model is made for binary classification.

Estimating the MLP model based on a data set involves finding the optimal weights for the synapses. This is done by training the model and one of the most used training methods is backpropagation. This training method was introduced by Rumelhart, Hinton and Williams (1986) and is based on the derivative of output of the MLP model. The training

(12)

method involves first choosing a loss function which is to be minimised, for instance the mean squared error E(W ) = N−1PN

i=1 f (W, xi) − yi

2

where N is the number of observations in the data set, f (W, xi) is the output of the MLP with weights W and

the independent variables of observation i and yi the dependent variable of observation

i. These weights are then initialised by random sampling from a distribution such as the uniform distribution. With these initialised weights the output of the MLP and the value of the loss function are calculated. The gradient of the loss function with respect to the weights is then used to update the weights so that the loss function will decrease. This is often done in batches and can even reiterated several times on the same data set until the MLP performs adequately.

The nonlinear activation functions enable the MLP to capture the nonlinear relations between variables. Funahashi (1989) showed that the MLP is a universal function estima-tor, which means that it can approximate any continuous function. The MLP model can be seen as a nonlinear ARIMA(p, 0, 0) model when there are p input neurons with values xt−1, ..., xt−p and the targeted output is xt. The q is zero as this model does not hold

memory for longer than p time steps so that there is no moving average part.

A one step ahead forecast ˆxt(1) based on an MLP is the output of the MLP with input

variables xt−1, ..., xt−p.

2.1.3 The MLP Recurrent Neural Network model

A type of MLP model with longer memory is the recurrent neural network (RNN) model. Like the MLP model, this model consists of an input layer, a single or multiple hidden layers and an output layer. Each layer also consist of neurons that are connected with the previous layer through synapses as seen in Figure 2.1.

The difference between the MLP model described in the last section and the RNN model is that the outputs of the neurons in the hidden layer, or states, are retained until the next time data is fed to the network. These states are then fed as input to the hidden neurons the next time data is fed to the network. Figure 2.2 is included to aid with understanding how a RNN works. In this figure the circle denotes all the hidden layers, U the weights with which the input data is multiplied, st−ithe states of the hidden neurons,

(13)

V the weights with which the output of the hidden neurons is multiplied and fed as input to the neurons in the output layers, ot−i the output of the neurons in the output layer and W the weights with which the states of the hidden neurons are multiplied and fed as input to the neurons in the hidden layer the next time data is fed to the model.

Figure 2.2: A figure of a RNN.

Every neuron outside of the hidden layer in the RNN model works in the same way as in the MLP. The input of a neuron in a hidden layer is the sum of the output of the connected neurons in the previous layer and the previous states of the neurons in the hidden layer multiplied by the respective weights of their synapses. The output, or state, of this neuron is the same as in the MLP model, namely the activation function evaluated at the input of the neuron. These states are then stored and fed to the neurons in the hidden layer of the RNN again as input the next time the model is evaluated. Furthermore, the RNN model is an MLP model if the weights of the synapses connecting the neurons in the hidden layers with their states are equal to zero.

Estimation of the RNN is done in the same way as estimating the MLP model. The model is trained through backpropagation after the initial weights are randomly sampled. However, the first time the RNN is evaluated the neurons in the hidden layer have no states yet. Thus, the initial states of the neurons in the hidden layers are randomly sampled.

The nonlinear activation functions make the RNN model able to capture nonlinear relations between variables. As the states of the neurons in the hidden layers are fed to the model the next time it is evaluated, the RNN has long memory and can therefore be

(14)

seen as a nonlinear ARIMA(p, 0, q) model when the inputs are the past p observations xt−1, ..., xt−p and the targeted output is the value at time t, that is xt.

A one step ahead forecast ˆxt(1) based on the trained RNN is done by evaluating the

model with input variables xt, ..., xt−p+1 and the previous states of the neurons in the

hidden layer.

2.1.4 Hybrid methodology

The hybrid methodology as proposed by Zhang (2003) first considers a time series {yt, t ∈

T} that is composed of a linear autocorrelation structure and a nonlinear component, that is yt = Lt+ Nt where Lt denotes the linear component and Nt denotes the nonlinear

component. If the ARIMA or ARFIMA model the linear component, then the residuals of this linear model should then only contain the nonlinear component. If et denotes the

residual at time t of the linear model, then et= yt− ˆLt where ˆLt is forecast of the linear

model for time t.

By modelling the et using an MLP or RNN the nonlinear component Nt is also

esti-mated. If ˆNtis the forecast of the MLP or RNN at time t, then the forecast of the hybrid

model is ˆyt= ˆLt+ ˆNt.

In summary, the hybrid methodology involves first estimating the linear component of the time series with a linear time series model. Afterwards an MLP or RNN is estimated on the residuals of this linear model to capture the nonlinear component of the time series. The forecast of the hybrid is then the two separate forecasts added together.

2.2 Performance measures and test

2.2.1 Performance measures

Once forecasts are done the accuracy of these forecasts can be calculated. There are many different methods to measure accuracy. Some of the most frequently used measures of

(15)

accuracy in academic literature are M AP E = 1 N N X i=1 ˆ yi− yi yi M AE = 1 N N X i=1 yˆ_i− y_i M SE = 1 N N X i=1 ( ˆyi− yi)2

where ˆyidenotes the forecast for time i. The M AP E measures the mean absolute

percent-age error and is not defined when an observation has a value of zero. The M AE measures the mean absolute error and the M SE measures the mean squared error. Unfortunately, none of the measures is optimal in all cases as Armstrong and Collopy (1992) concluded in their measurement comparison study. They found that the optimal accuracy measure is greatly dependent on the type of data that is used.

2.2.2 Diebold-Mariano test

Accuracy measures alone are not enough to conclude whether a certain model forecasts better than another model. The difference between the outcomes of the measures for the different models might be small and due to noise. To test whether differences between metrics are significant, Diebold and Mariano (2002) introduced the Diebold-Mariano test. The Diebold-Mariano test procedure is as follows. Suppose the actual values are {yt; t =

1, ..., T } and there are two forecasts {ˆy1,t; t = 1, ..., T } and {ˆy2,t; t = 1, ..., T }. The forecast

errors are then ei,t = ˆyi,t− ytfor i = 1, 2. Furthermore, let g(ei,t) denote a function which

takes the value zero when no error is made, is nonnegative and increases in size as the errors become larger, for instance g(ei,t) = e2i,t.

If dt= g(e1,t) − g(e2,t), then a test for significant difference in accuracy corresponds to

testing H0 : E[dt] = 0 against Ha : E[dt] 6= 0. Diebold and Mariano show that under the

null hypothesis holds that

¯ d q 2πfd(0) T → N (0, 1)

where ¯d is the sample mean of the series {dt; t = 1, .., T } and fd(0) = _2π1

P∞

k=−∞γd(k)

(16)

The Diebold-Mariano test statistic when the forecast horizon is one step ahead is then DM = q d¯ 2π ˆfd(0) T where ˆfd(0) = _2π1 ˆγd(0) and ˆγd(k) = _T1 PT

t=|k|+1(dt− ¯d)(dt−|k|− ¯d), the sample

autocovari-ance of the loss differential at lag k.

Under H0 the DM test statistic is asymptotically N (0, 1) distributed so that H0 will

be rejected if the DM statistic falls outside of the interval Φ−1(α/2), Φ−1(1 − α/2) where Φ−1(x) is the inverse of the cumulative distribution function of the standard normal distribution and α is the significance level.

The simulations done by Diebold and Mariano show that their test is conservative when applied to short horizon forecasts. This means that the power of the DM test is relatively low when the forecast horizon is short. A remedy for this problem is found by Harvey, Leybourne and Newbold (1997). They corrected for possible bias and used the critical values of the student’s t-distribution instead of the standard normal distribution.

2.3 Relevant literature

Neural networks have frequently been researched as time series models. Lee, White and Granger (1993) compare a neural network test for neglected nonlinearity with several other tests. The neural network test is based on the approximating ability of neural network modelling techniques. The tests are performed on several simulated nonlinear time series and empirical economic time series. After comparison of the tests they conclude that the relative performance of the neural network test is encouraging.

Zhang, Patuwo and Hu (2001) investigate the effect of input nodes, hidden nodes and sample size on MLP modelling and forecasting behaviour in a simulation study. In their simulation study they simulate data sets from different nonlinear time series data generating processes (DGP) and estimate MLP with different architectures and several Box-Jenkins models on these data sets. They find that both the number of neurons in the input layer and the number of neurons in the hidden layer have significant effects on the neural network model building and predictive ability. They also find that neural

(17)

networks are able to identify the appropriate number of lagged observations and simple and parsimonious networks are effective. Furthermore, they conclude that MLP models are more competent than Box-Jenkins models in forecasting nonlinear time series and the sample size has limited effect on the performance of neural networks.

Ho, Xie and Goh (2002) investigate suitable time series models for repairable system failure analysis. They compare ARIMA, MLP and RNN models with each other. They find that both the ARIMA and the RNN models outperform the MLP model in terms of lower predictive errors and higher percentages of correct reversal detection based on simulation results on a set of compressor failures. They conclude that both the ARIMA and the RNN model show promise for predicting failures of repairable systems.

A hybrid ARIMA MLP model is introduced by Zhang (2003) and its forecasting per-formance based on real data sets is compared to the forecasting perper-formance of an ARIMA model and a MLP model. The empirical data sets used are a data set containing the an-nual number of sunspots from 1700 to 1987, a data set containing the number of trapped Canadian lynx in the Mackenzie River district of Northern Canada and the exchange rate between the British pound and the exchange rate between the British pound and the US dollar. Zhang concludes that his empirical results suggest that the proposed hybrid model is able to outperform both the ARIMA and the MLP model used in isolation. Further-more, Zhang finds that by fitting the ARIMA model to the data first, the MLP model suffers less from overfitting.

A hybrid ARIMA RNN model is introduced by Aladag et al. (2009) and estimated on the same lynx data set used by Zhang (2003). Aladag, Erol and Kadilar conclude that their hybrid ARIMA RNN model outperforms the hybrid ARIMA MLP model proposed by Zhang, based on the one step ahead out-of-sample forecast M SE.

Chaˆabane (2014) proposes a hybrid ARFIMA MLP model to predict the price of electricity. The data set that is used contains the hourly spot prices from the NordPool electricity market between October 1, 2012 and November 28, 2012. The optimal hybrid ARFIMA MLP model is compared to an MLP model, an ARFIMA model and the hybrid ARIMA MLP model as proposed by Zhang (2003). The proposed hybrid ARFIMA MLP model outperforms the other models as it results in the lowest one step ahead forecast

(18)

RM SE, M AE and M AP E of all estimated models.

The theory behind the proposed model in this thesis is as follows. As all of the existing models are special cases of the proposed model, for instance the hybrid ARIMA MLP model proposed by Zhang (2003) is a special case of the proposed model with integer differentiation order, the proposed model should be at least as good as all special cases of this model. However, the freedom this model provides may lead to more overfitting.

(19)

The Models

This section covers the process of constructing the simulated data sets, the setups of the models, the estimation procedures of the different models and the Diebold-Mariano test procedure. Section 3.1 discusses the different DGP that are used for the simulations. The setups of the models are given in Section 3.2 and the procedures followed to estimate these models are described in Section 3.3. Finally, Section 3.4 covers the Diebold-Mariano testing procedure that is used.

3.1 Simulated data sets

Data is needed to investigate the performance of the proposed hybrid ARFIMA RNN model. Stylised data sets of which the underlying DGP is known are an advantage when investigating the performance of a model as the models forecasts can then be compared to the true model forecast. Therefore, simulated data sets are used in this research.

In this entire section the εt i.i.d.

∼ N (0, 1). For every data set 300 observations are sampled using the DGP and the starting values are taken as y0= 0 and σ0=

p

E[σt2] =

√ 5, if appropriate. To reduce the influence of the set starting values, only the last 200 sampled observations are contained in the data set that is used for estimating and testing the models.

The true model forecast for each of the data sets is constructed as E[ˆyt(1)|Ft] using

the DGP, where Ftis a set containing yt, yt−1, .... Furthermore, all models with a GARCH

(20)

component have the same true model forecasts as their non-GARCH variant if ˆε is replaced by ˆa. The GARCH variants are included to investigate the robustness of the different models regarding serial correlation.

3.1.1 ARIMA(1, 0, 1)

The DGP used to construct this data set is

yt= 0.7yt−1+ εt− 0.4εt−1

and the true model forecast for this model is ˆ ε1 = y1 ˆ yt−1(1) = 0.7yt−1− 0.4ˆεt−1, t ≥ 2 ˆ εt = yt− ˆyt−1(1), t ≥ 2

The purpose of including this DGP is to investigate how the hybrid ARFIMA RNN model copes with a simple DGP compared to the other models.

3.1.2 ARIMA(1, 0, 1)-GARCH(1, 1)

yt = 0.7yt−1+ at− 0.4at−1

at = σtεt

σ_t2 = 1 + 0.3a2_t−1+ 0.5σ_t−12

3.1.3 Bilinear (BL)

The DGP used to construct this data set is

yt= 0.7yt−1εt−2+ εt

and the true model forecast for this model is ˆ ε1 = y1 ˆ ε2 = y2 ˆ yt−1(1) = 0.7yt−2εˆt−2, t ≥ 3 ˆ εt = yt− ˆyt−1(1), t ≥ 3

(21)

The bilinear data set simulated by this DGP has white noise properties, which means that the data is not linearly predictable. The purpose of including this DGP is to investigate how the proposed model copes with linear unpredictability.

3.1.4 Nonlinear autoregressive (NAR)

The DGP used to construct this data set is yt=

0.7|yt−1|

|y_t−1| + 2+ εt and the true model forecast for this data set is

ˆ

yt−1(1) =

0.7|yt−1|

|y_t−1| + 2, t ≥ 2

The time series sampled from this DGP has nonlinear autoregressive properties and is used to investigate whether the proposed model is capable of estimating this nonlinear autoregressive component. 3.1.5 NAR-GARCH(1, 1) yt = 0.7|yt−1| |yt−1| + 2 + at at = σtεt σ_t2 = 1 + 0.3a2_t−1+ 0.5σ_t−12

3.1.6 Nonlinear moving average (NMA)

The DGP used to simulate this data set is

yt= εt− 0.4εt−1+ 0.3εt−1εt−2

Construction of the true model forecast for this DGP is less straightforward. First, the ε are estimated in the same way as done by Robinson (1977), that is

ˆ ε1 = y1 ˆ ε2 = y2+ 0.4ˆε1 ˆ εt = yt+ 0.4ˆεt−1− 0.3ˆεt−1εˆt−2, t ≥ 3

(22)

Afterwards, the true model forecast is constructed by ˆ

yt(1) = −0.4ˆεt+ 0.3ˆεtεˆt−1

This DGP simulates a time series that has nonlinear moving average properties and the purpose of including this data set is to investigate the performance of the proposed model if the data set has these properties.

3.2 Model setup

The ARFIMA and ARIMA model setups are the same as described in Section 2.1.1 as the setups of these models cannot be altered.

The neural network type models on the other hand can have varying model setups. The setup for these models that is used in this research contains three layers of neurons. These three layers are an input layer, a hidden layer and an output layer. Only one hidden layer is used as adding more layers significantly increases the computation time. The input and hidden layers contain varying numbers of neurons and the procedure to obtain the optimal numbers of neurons for these layers is described in Section 3.3. The output layer only has one neuron as ˆxt(1), the one step ahead forecast at time t, is a scalar.

The activation functions that are used in the neurons of the neural network type models are the sigmoid activation function, f (x) = 1 + exp (−x)−1, for neurons in the hidden layer and the linear activation function, f (x) = x, for the neuron in the output layer. Furthermore, the loss function used in the training procedure of the neural network type models is the in-sample M SE.

Initialising the weights on the synapses is done by sampling from a unif(−0.05, 0.05) distribution. The initial states of the RNN type models are also sampled from this uniform distribution.

Finally, the optimiser used to train the model is the rmsprop algorithm. This is a variant of stochastic gradient descent. The difference between normal stochastic gradient descent and rmsprop is that the learning rate for a weight of the model is divided by a running average of the magnitudes of recent gradients for the same weight.

(23)

3.3 Estimation

As the purpose of this research is to compare the one step ahead out-of-sample forecast of the proposed hybrid model to the forecasts of other models, the data sets are split into two parts, namely an estimation set and a test set. The estimation set is used to estimate the different models and the test set is used for out-of-sample forecasting. For the simulations the estimation set contains the first 160 observations and the test set contains the last 40 observations.

To estimate the optimal ARFIMA model the procedure described in Section 2.1.1 is followed. First, the d is obtained by maximum likelihood estimation of an ARFIMA(2, d, 0) model and used to fractionally difference the data. The orders p and q are obtained by finding the model with the lowest AIC on the fractionally differenced data. The optimal model is the model estimated by maximum likelihood estimation of the ARFIMA(p, d, q) model on the original data. This optimal model is subsequently used to forecast the test set. This entire procedure is performed by the function arfima in the forecast package in R.

Finding the optimal MLP or RNN model takes more effort as the optimal architecture, i.e. the number of neurons in the input layer and the hidden layer, also needs to be found. To do so the estimation set is split in two sets, namely a training set and a validation set. For the simulations the training set contains the first 120 observations and the validation set contains the 121st to 160th observations. The different architectures range from 1 to 5 neurons in the input layer and 1 to 10 neurons in the hidden layer so that there are 50 different architectures. Finding the optimal architecture is then done by training 10 models of the same architecture on the training set for each architecture, so that 500 different models are estimated. The 10 models per architecture differ only in the initial weights that are sampled. Afterwards, the trained models are used to forecast the validation set and for every model the RM SE of the forecast of the validation set is calculated. The optimal architecture is the architecture with the lowest mean RM SE over all 10 models with that architecture. The model with the lowest RM SE of all the models with the best architecture is chosen as the optimal MLP or RNN model. Training of this model is then resumed on the validation set to complete the estimation of the optimal model. All

(24)

training for the MLP models is done with a batch size of 400 observations and using 1000 epochs so that the estimation set is used 1000 times for the training of the model. For the RNN the number of epochs is set to 500. Construction and training of the different neural network type models is done using the Keras package in R.

The optimal hybrid models are estimated as described in Section 2.1.4. The optimal ARFIMA model is used to forecast the entire data set and the residuals are kept. These residuals are then used as data set to find the optimal MLP or RNN model as described. Afterwards, the forecasts of the ARFIMA and MLP or RNN models are added to obtain the forecasts of the different hybrid models.

All optimal models found are used to forecast the test set and their forecasts are compared. For the simulated data sets the RM SE, M AP E and M AE of the forecasts constructed by the different models are calculated taking the true model forecast as yi in

the formulas in Section 2.2.1 so that the accuracy with respect to the true model forecast is calculated. For the empirical data sets the observations are used as yiin these formulas.

3.4 Diebold-Mariano testing procedure

The DM test, as described in Section 2.2.2, is performed on the resulting forecasts of the different models. As the true model forecast is known for the simulated data sets the DM test procedures for the simulated and empirical data sets differ slightly.

For the simulated data sets the true model forecast, denoted by ˜yt, is known. Therefore,

the DM test can be altered slightly so that it can be used to investigate whether the forecasts of the proposed hybrid model are closer to the true model forecasts than those of other models. Let dt = |ˆy1t−1(1) − ˜yt| − |ˆyt−1i (1) − ˜yt|, where ˆyt−11 (1) denotes the one

step ahead forecast of the proposed hybrid model and ˆyi_t−1(1) denotes the one step ahead forecast of an arbitrary other model. The DM test is then used to tests H0 : E[dt] = 0

against Ha: E[dt] < 0.

As the true DGP is unknown for the empirical data sets, the true model forecasts are unknown as well. Therefore, the data itself is used in the DM testing procedure. Let yt

denote the observation on time step t and dt = |ˆyt−11 (1) − yt| − |ˆyt−1i (1) − yt| using the

(25)

proposed hybrid is lower than that of other models, that is H0: E[dj] = 0 is tested against

(26)

Data

The empirical data sets are described in this chapter. Section 4.1 covers the Lynx data set, in Section 4.2 the S&P500 data set is described and Section 4.3 covers the electricity data set.

4.1 Lynx data set

One of the most used data sets in academic research in the field of time series forecasting is the Lynx data set. This data set contains approximations of annual numbers of Canadian lynx trapped in the Mackenzie River district of North-West Canada for the period 1821 to 1934. These approximations are based on the annual numbers of fur sales. The lynx data set is one of many data sets included in the MASS package for R.

A histogram and plot of the lynx data set can be found in Figure 4.1. To counter the skewness of the data it is transformed using the base 10 logarithm, which is also done by Zhang (2003). Figure 4.1 also contains a plot and histogram of the transformed data. The plots of the data set show seasonality with a period of 10 years that should be taken into account during the estimation of the models. Some descriptive statistics can be found in Table 4.1 at the end of this chapter.

This data set is split into three subsets, one for training, another for validation and the last for testing. The training set contains the first 80 observations, the validation set contains the 81st to 100th observations and the test set contains the last 14 observations.

(27)

Figure 4.1: Plot and histogram of the lynx data set for the, first row, original data set and the, second row, base 10 logarithmic transformation.

4.2 S&P500 data set

Another frequently researched time series is the S&P500 index price. The data set used in this research contains the weekly log returns times 100 of this index from the 11th of August 2014 until the 11th of June 2018 and can be downloaded from finance.yahoo.com. If yt is the closing price at time t, where t ∈ N denotes the week, the weekly log returns

are defined as ˜yt = log (yt) − log (yt−1). The advantage of using log returns is that the

(28)

set can be found in Figure 4.2 and some descriptive statistics can be found in Table 4.1. No further transformations are necessary for this data set.

For this data set the training set contains the first 120 observations, the validation set contains the 121stto 160th observations and the test set contains the last 40 observations.

Figure 4.2: Plot and histogram of the weekly log returns of the S&P500 index times 100.

4.3 Electricity data set

Finally, a data set containing hourly spot prices of electricity in euro per MWh is also included. This data set contains observations from the 1st of January 2018 until the 28th of February 2018, 1416 observations in total, and can be downloaded from

www.nordpoolgroup.com.

A plot of the time series and its ACF can be found in Figure 4.3. The ACF indicates seasonality with periods of 12 and 24 hours. Analogous to the research done by Chaˆabane (2014), who used the same type of data but different dates, the seasonality with a period of 24 hours is removed. This is done by, for each observation, subtracting the median of all observations of the same hour on other days. The deseasonalised time series, {˜yt, t ∈

1, ..., 1416}, is thus obtained by ˜yt = yt− s(t), where s(t) = s(t + 24n), n ∈ N denotes

the seasonal component. Figure 4.3 also includes a plot of the deseasonalised time series and its ACF. Descriptive statistics for this time series can be found alongside those of the

(29)

other empirical time series in Table 4.1.

The ACF of the deseasonalised time series shows slower than exponential decay. This indicates long memory and therefore the ARFIMA model is the optimal linear model. Furthermore, as this data set contains a lot of observations, that greatly increases the computation time of the RNN models, and Chaˆabane (2014) concluded that the hybrid ARFIMA MLP model outperforms the ARFIMA and MLP models in isolation, only the ARFIMA model, hybrid ARFIMA MLP model and the proposed model are estimated. Moreover, the optimal architecture found by Chaˆabane is used for the hybrid ARFIMA MLP model, which is a 4 layered MLP with 20 neurons in the first hidden layer and 1 in the second hidden layer. The optimal number of input neurons is found using the standard procedure described in Section 3.3.

The RNN architectures considered for the proposed model are different for this data set as well. In the estimation procedure RNN with 3 or 4 layers are considered. Both of these RNN have 1 to 6 neurons in the input layer and 10, 15 or 20 neurons in the first hidden layer. The second hidden layer of the 4 layered RNN contains 1 to 3 neurons, so that a total of 72 different architectures are considered.

For this data set the first 658 observations make up the training set. The validation set contains the 569th_{to 1316}th_{observations and the test set contains the last 100 observations.}

lynx S&P500 Electricity Number of observations 114 200 1416 Mean 2.903664 0.182689 2.46166 Standard deviation 0.558409 1.784518 7.971328 Skewness -0.357238 -1.058924 1.721087 Kurtosis -0.773168 3.763442 4.169957 Minimum 1.591065 -8.593784 -13.89 Maximum 3.844539 4.786851 50.69

Table 4.1: Descriptive statistics of the transformed lynx time series, the S&P500 weekly log returns times 100 time series and the deseasonalised electricity time series.

(30)

Figure 4.3: Plots of the electricity data set. The first row contains the plots for the original time series and the second row contains the plots of the time series after removing the seasonality.

(31)

Results

The results obtained by performing the estimations as described in Section 3.3 and fore-casting using these optimal models are covered in this chapter. The results of the simula-tions are covered in Section 5.1. Section 5.2 covers the results of the Lynx data set and the results of the S&P500 data set are covered in Section 5.3. Finally, the results of the electricity data set are given in Section 5.4

5.1 Simulated data sets

To reduce unnecessary clutter and confusion, the figures containing the forecasts of the models for the different simulated data sets are given in the appendix.

(32)

5.1.1 ARIMA(1, 0, 1)

Model structure RM SE M AP E M AE DM statistic ARFIMA (0,0.2259,1) 0.6753 0.7750 0.5535 -2.3352

MLP (4,8) 0.7332 1.8726 0.5861 -0.4952 RNN (1,8) 0.6232 1.2229 0.4918 -3.4821 ARFIMA-MLP (4,1) 0.7731 1.4907 0.6356 2.4538∗∗ ARFIMA-RNN (4,4) 0.7562 1.3334 0.6185

-Table 5.1: Results for the data set simulated using the ARIMA(1, 0, 1) DGP, ∗ and ∗∗ denote significance using α = 0.10 and α = 0.05 respectively.

The results for this data set are given in Table 5.1. The model structure column of this table indicates the optimal (p, d, q) orders of the ARFIMA model and the optimal (p, q), where p is the number of neurons in the input layer and q is the number of neurons in the hidden layer, of the neural network type models. This table shows that the proposed model has the second highest RM SE, the third lowest M AP E and the second highest M AE. The only model that is significantly less accurate than the proposed model is the hybrid ARFIMA MLP model, as the p-value of the DM statistic is less than 0.05.

5.1.2 ARIMA(1, 0, 1)-GARCH(1, 1)

MLP (3,9) 1.3590 2.7209 1.1079 -3.0436 RNN (1,9) 1.7318 4.7809 1.4247 -0.2092 ARFIMA-MLP (4,2) 1.8337 4.0402 1.4501 1.2991 ARFIMA-RNN (1,1) 1.7419 2.8296 1.4129

-Table 5.2: Results for the data set simulated using the ARIMA(1, 0, 1)-GARCH(1, 1) DGP.

Table 5.2 contains the results for the data set simulated using the ARIMA(1, 1)-GARCH(1, 1) DGP. For this data set the proposed model has the second highest RM SE, the third

(33)

low-est M AP E and the third lowlow-est M AE. However, no model forecasts are significantly less accurate than the proposed model according to the DM test.

5.1.3 Bilinear

Model structure RM SE M AP E M AE DM statistic ARFIMA (0,0.0000,2) 1.1761 3.2243 0.9067 0.1635

MLP (3,4) 1.0984 2.9128 0.8547 0.5773 RNN (3,2) 1.0999 2.6519 0.8675 0.7167 ARFIMA-MLP (1,10) 1.1810 2.0405 0.9133 0.1665 ARFIMA-RNN (1,10) 1.1090 2.1667 0.8602

-Table 5.3: Results for the data set simulated using the BL DGP.

The results for the bilinear data set can be found in Table 5.3. For this data set the proposed model has the third lowest RM SE and the second lowest M AP E and M AE. The DM test does not reject the null hypothesis if the forecasts of the proposed model are compared to those of the other models, indicating that the proposed model is not significantly more accurate than any of the other models.

5.1.4 Nonlinear autoregressive Structure RM SE M AP E M AE DM statistic ARFIMA (0,0.0566,0) 0.1160 1.8640 0.0896 -1.6071 MLP (1,10) 0.1557 1.7636 0.0907 0.3188 RNN (1,8) 0.1460 1.2979 0.0814 -1.4307 ARFIMA-MLP (1,9) 0.1801 1.9160 0.1162 1.6079∗ ARFIMA-RNN (1,4) 0.1527 1.0562 0.0902

-Table 5.4: Results for the data set simulated using the NAR DGP.

Table 5.4 contains the results for the data set simulated using the nonlinear autoregressive DGP. This table shows that the proposed model has the third lowest RM SE, the lowest M AP E and the third lowest M AE for this data set. Furthermore, the DM statistic

(34)

implies that the proposed model outperforms the ARFIMA-MLP model if a significance level of 0.1 is used.

5.1.5 Nonlinear autoregressive-GARCH(1, 1)

MLP (2,10) 0.6630 4.1680 0.5533 2.9879∗∗ RNN (2,8) 0.5554 3.3088 0.5012 3.3136∗∗ ARFIMA-MLP (2,10) 0.6387 3.9400 0.5678 4.3649∗∗ ARFIMA-RNN (4,2) 0.4213 2.4113 0.3564

-Table 5.5: Results for the data set simulated using the NAR-GARCH(1, 1) DGP.

Table 5.5 contains the results for the NAR-GARCH(1, 1) simulated data set. The proposed model has the second lowest RM SE, M AP E and M AE. The DM stastics imply that the proposed model performs significantly better than the MLP, RNN and ARFIMA-MLP models for this data set. Of all the neural network type models the proposed model seems to suffer the least from the GARCH component in this data set.

5.1.6 Nonlinear moving average

Model structure RM SE M AP E M AE DM statistic ARFIMA (0,0.0000,1) 0.2862 3.7639 0.2081 0.8901

MLP (1,10) 0.2769 7.2463 0.2110 0.1677 RNN (1,10) 0.2719 4.9642 0.2153 0.1354 ARFIMA-MLP (2,1) 0.2640 4.0503 0.2061 -0.5529 ARFIMA-RNN (3,2) 0.2697 4.9559 0.2109

-Table 5.6: Results for the data set simulated using the NAR-GARCH(1, 1) DGP.

The results for the data set simulated using the NMA DGP can be found in Table 5.6. This table shows that the proposed model has the second lowest RM SE, the third lowest

(35)

M AP E and the third lowest M AE. The DM test does not reject the null hypothesis of equal accuracy for any of the models compared with the proposed model.

5.2 Lynx results

Figure 5.1: ACF and PACF of the deseason-alised lynx data.

As academic research of this data set of-ten includes an ARIMA model as a po-tential optimal model for this data set it is included in this research as well. In Section 4.1 it is shown that this data set has a seasonal component with a period of 10 years and is stationary so that d = 0. Removing this seasonality by estimating a SARIMA(0, 0, 0) − (0, 1, 1)10 model and

taking the residuals as the data set for further estimation results in the ACF and PACF shown in Figure 5.1. The PACF and ACF indicate that the SARIMA(2, 0, 0) − (0, 1, 1)10 model might be a good model.

After estimation of this model the Ljung-Box statistic of its residuals is 8.3035,

which indicates a p-value of 0.3066. Therefore, there is no statistical ground to assume that the model is not correctly specified and this SARIMA model is assumed to be the optimal ARIMA type model.

The ARFIMA model takes the seasonality into account in the same way as the SARIMA model does. However, this seasonality also has to be accounted for in the neural network type models. This is done by always using the 10th lag as an input variable so that one of the neurons in the input layer is allocated to this lag. It is also elected to let the number of hidden neurons vary between 1 and 12, instead of 10, in the estimation procedure of the neural networks for this data set as the models using 10 hidden neurons do not suffer from overfitting.

(36)

The accuracy of the forecasts of the different models can be found in Table 5.7. The proposed model is not able to fit the data as well as the other models as it has the second highest RM SE and the highest M AP E and M AE and no model has significant lower accuracy according to the DM test. The reason for the poor performance of this model might be that the SARIMA model already forecasts reasonably well and thus the inner relations of the data might be less complex. Furthermore, the ARFIMA model does not perform well and as this is part of the proposed model its performance suffers. Figure 5.2 contains a plot of the forecasts of the different estimated models.

The optimal estimated model for this data set seems to be the MLP model. However, the model proposed by Zhang (2003) is not compared in this research and is proven to outperform the standard MLP model for this data set.

Model structure RM SE M AP E M AE DM statistic SARIMA (2, 0, 0) − (0, 1, 1)10 0.1373 0.0331 0.0977 -1.0677 ARFIMA (3,0.3046,2) 0.1807 0.0491 0.1446 0.0081 MLP (2,12) 0.1266 0.0391 0.0914 -1.8914 RNN (2,9) 0.1381 0.0366 0.1049 -2.4061 ARFIMA-MLP (4,4) 0.1675 0.0448 0.1310 -0.8619 ARFIMA-RNN (3,4) 0.1805 0.0506 0.1478

(37)

Figure 5.2: A plot of the forecasts of the lynx data set.

5.3 S&P500 results

Figure 5.3: ACF and PACF of the S&P500 data set.

The standard ARIMA model is often used as benchmark model and is therefore in-cluded in the results of this data set as well. The plot of the data in Figure 4.2 indicated stationarity which is also shown by the au-tocorrelation functions in Figure 5.3, so that d = 0. The latter figure also indicates that both p and q are positive as there are no clear cut-off points. After step-by-step addition of 1 to either order of the ARIMA model until the ACF and PACF show no significant correlations anymore the opti-mal ARIMA model is found to have orders (2, 0, 2). The Ljung-Box statistic of the residuals of this optimal model is 9.0237, so that the p-value is 0.1723. Therefore,

(38)

to be the optimal ARIMA model.

The resulting performance measures are summarised in Table 5.8 and Figure 5.4 con-tains a plot of the forecasts of the different models. The proposed model fits this data set relatively well as it has the third lowest RM SE, the highest M AP E and the lowest M AE. Furthermore, the DM test concludes that the proposed model significantly outperforms the ARIMA, ARFIMA and ARFIMA-MLP model. The linear models do not perform well which might indicate that the relations between time points in the data are complex, which can be captured by the proposed model. The best model for this data set seems to be the RNN model.

Model structure RM SE M AP E M AE DM statistic ARIMA (2, 0, 2) 2.3018 2.7262 1.5010 1.9801∗∗ ARFIMA (2,0.0000,2) 2.2825 3.3556 1.4365 2.4182∗∗ MLP (1,5) 2.1743 2.4267 1.4177 -0.3241 RNN (4,7) 2.1751 2.1588 1.4260 -0.3105 ARFIMA-MLP (1,1) 2.2644 3.6336 1.4023 1.8949∗∗ ARFIMA-RNN (3,10) 2.2188 3.7031 1.3672

-Table 5.8: Results for the S&P500 data set.

(39)

5.4 Electricity results

The results for this data set are given in Table 5.9 and Figure 5.5 contains a plot of the forecasts of the different models. According to the M AP E and M AE measures the proposed model outperforms the other models. Furthermore, the proposed model is signif-icantly more accurate than the ARFIMA model according to the DM test. However, the hybrid ARFIMA MLP model has a lower RM SE than the proposed model. Nonetheless, the proposed model seems to be the optimal model for this data set.

Model structure RM SE M AP E M AE DM statistic ARFIMA (1,0.3993,2) 5.2496 0.0489 2.6875 1.3931∗ ARFIMA-MLP (5,20,1) 4.9680 0.0487 2.6176 -0.9212 ARFIMA-RNN (4,10,1) 5.0316 0.0486 2.5699

-Table 5.9: Results for the electricity hourly spot prices data set.

(40)

Conclusion

The aim of this research was to introduce a new hybrid model of an ARFIMA and RNN model, which theoretically should be able to outperform all existing time series models, and compare its forecasting ability to those of the existing models. The procedure followed to do so involved first simulating several data sets which had different complex relations between time steps, which allowed investigation of the ability of the proposed model to capture certain relations. Subsequently, the optimal ARFIMA, MLP and RNN models were estimated. Then, the hybrid models were estimated by using the residuals of the optimal ARFIMA model. The one step ahead out-of-sample forecasts were calculated and compared using the RM SE, M AP E and M AE performance measures. Finally, the Diebold-Mariano test was performed to check for significant higher accuracy of the proposed model compared with the existing models. To investigate how the model copes to real data the lynx data set and a S&P500 data set were researched as well.

On the ARIMA(1, 0, 1) data set, the proposed model was only able to significantly out-perform the ARFIMA-MLP model and was outout-performed by the other models according to the accuracy measures. The proposed model also did not benefit much from the addi-tion of the GARCH component to the ARIMA(1, 0, 1) data set and performed on a par with the other models. The results were the same for the bilinear data set, the proposed model did, again, perform on par a with the other models.

However, on the nonlinear autoregressive data set, the proposed model was able to outperform the ARFIMA-MLP model again and the rest of the models performed roughly

(41)

the same. This time around, the proposed model did benefit from the addition of a GARCH component to this data set. After this addition the proposed model significantly outperformed the MLP, RNN and ARFIMA-MLP models. Excluding the ARFIMA model, which was only a constant, the proposed model had relatively low performance measures all-round.

On the nonlinear moving average data set the model performed on a par with the other models again.

As the empirical data sets are real data the relations between observations are assumed to be very complex which should benefit the proposed model. However, this did not seem to be the case for the lynx data set. The SARIMA model performed relatively well which indicates that these relations are less complex than was thought. Therefore, the proposed model did not perform well at all using any metric and the best model for this data set was found to be the MLP model.

The S&P500 data set did seem to contain complex relations between observations as the linear models did not perform well. The proposed model was even able to significantly outperform the ARIMA, ARFIMA and hybrid ARFIMA MLP model. Furthermore, it performed comparably to the MLP and RNN model in terms of RM SE and M AE.

Finally, the results for the electricity data set, which contains complex relations be-tween observations and has long memory, showed that this is where the strength of the proposed model lies. The proposed model significantly outperformed the ARFIMA model. Furthermore, its forecasts were more accurate according to two out of the three accu-racy measures. The proposed model only had a slightly higher RM SE than the hybrid ARFIMA MLP model.

In conclusion, the proposed model does not seem to be much of an improvement if the data set it is used to forecast on contains simple relations between observations. However, its strength lies in forecasting complex data sets with long memory. This research has shown that it performs on a par with the MLP and RNN models if the data set contains complex relations. Furthermore, the proposed model seems to outperform the existing models if the data set also has long memory. Therefore, the proposed model is a promising model for forecasting of data sets with both properties.

(42)

Further research can be done by altering the estimation procedure and model setup. Ways to do this are by using two hidden layers or letting the model have more than 10 neurons in the hidden layer so that the model becomes more complex. Furthermore, the conventional Diebold-Mariano testing procedure can be changed as it is proven that this test is conservative when the forecast horizon is short. This can be done by using the modifications for this test proposed by Harvey et al. (1997).

(43)

Forecasts

This chapter contains plots of the forecasts for the different simulated data sets.

Figure A.1: Forecasts for the ARMA(1, 0, 1) data set.

Figure A.2: Forecasts for the ARMA(1, 0, 1)-GARCH(1, 1) data set. 39

(44)

Figure A.3: Forecasts for the bilinear data set.

Figure A.4: Forecasts for the nonlinear autoregressive data set.

(45)

(46)

References

Aladag, C. H., Egrioglu, E., & Kadilar, C. (2009). Forecasting nonlinear time series with a hybrid methodology. Applied Mathematics Letters, 22 (9), 1467–1470.

Armstrong, J. S. & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons. International journal of forecasting, 8 (1), 69–80. Box, G. E. P. & Jenkins, G. (1970). Time series analysis: Forecasting and control.

Holden-Day.

Chaˆabane, N. (2014). A hybrid arfima and neural network model for electricity price prediction. International Journal of Electrical Power & Energy Systems, 55, 187– 194.

Diebold, F. X. & Mariano, R. S. (2002). Comparing predictive accuracy. Journal of Business & economic statistics, 20 (1), 134–144.

Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2 (3), 183–192.

Granger, C. W. J. & Joyeux, R. (1980). An introduction to long-memory time series models and fractional differencing. Journal of time series analysis, 1 (1), 15–29. Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equality of prediction mean

squared errors. International Journal of forecasting, 13 (2), 281–291.

Haslett, J. & Raftery, A. E. (1989). Space-time modelling with long-memory dependence: Assessing ireland’s wind power resource. Applied Statistics, 1–50.

Ho, S. L., Xie, M., & Goh, T. N. (2002). A comparative study of neural network and box-jenkins arima modeling in time series prediction. Computers & Industrial En-gineering, 42 (2-4), 371–375.

Hyndman, R. J., Khandakar, Y., et al. (2007). Automatic time series for forecasting: the forecast package for R. Number 6/07. Monash University, Department of Economet-rics and Business Statistics.

Lee, T. H., White, H., & Granger, C. W. J. (1993). Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. Journal of Econometrics, 56 (3), 269–290.

Robinson, P. M. (1977). The estimation of a nonlinear moving average model. Stochastic processes and their applications, 5 (1), 81–90.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323 (6088), 533.

Zhang, G. P. (2003). Time series forecasting using a hybrid arima and neural network model. Neurocomputing, 50, 159–175.

Zhang, G. P., Patuwo, B. E., & Hu, M. Y. (2001). A simulation study of artificial neural

(47)

networks for nonlinear time-series forecasting. Computers & Operations Research, 28 (4), 381–396.

Introduction of a hybrid ARFIMA-RNN model and comparison to other time series models

Master’s Thesis