A multi-step forecasting comparison between ARIMA and LSTM on financial time series

(1)

A multi-step forecasting comparison between

ARIMA and LSTM on financial time series

submitted in partial fulfillment for the degree of

master of science

Amir Alnomani

10437797

master information studies

data science

faculty of science

university of amsterdam

Date of defence 2018-07-05

Internal Supervisor External Supervisor

Title, Name Dr Maarten Marx Juan Carlos Romero

Affiliation UvA, FNWI, IvI Friesland Campina

(2)

Contents 1 Contents 2 2 1 Introduction 3 3 2 Methodology 3 4 2.1 ARIMA 3 5

2.2 Recurrent Neural Networks 4

6

2.3 Long Short-Term Memory 5

7 2.4 Evaluation 5 8 3 Experimental Setup 6 9 3.1 Data 6 10 3.2 Results 7 11 3.3 Discussion 7 12 References 8 13 4 Appendix 9 14 2

(3)

Abstract

15

This study attempted to address the question about whether LSTM

16

based forecasting of time series can outperform ARIMA based

forecast-17

ing. While previous research explored single step forecasts with this

18

particular comparison, experiments were conducted that investigated

19

whether those results generalized to multi-step forecasting. However,

20

the reported root mean squared errors for the respective models

indi-21

cated that the specific approach utilized in this paper with LSTM does

22

not outperform the ARIMA model on all the data sets. The data sets

23

that were used were the Dutch national butter price, the WTI crude

24

oil price, the S&P 500 index and the Nikkei 225 index.

25

1 INTRODUCTION

Price forecasting is becoming increasingly relevant to producers

26

within various markets. Such forecasts can be of great benefit for

27

developing strategies and negotiation skills ahead of time.

Commod-28

ity market information is sequential and partially observable, so

29

historical prices are important for predicting those in the future and

30

can potentially be combined with more static data obtained from

31

market snapshots, such as the data about changes in certain

regu-32

lations or laws, which affects the prices. The existing time series

33

forecasting methods include a variety of both linear and non-linear

34

algorithms. The autoregressive integrated moving average (ARIMA)

35

models and their variations such as AR, MA, ARMA, which fall in

36

the linear model class, have been extensively researched for this

37

purpose[5]. There have been many successful applications of this

38

class of models on univariate financial time series such as

electric-39

ity prices in Sweden[10], tomato prices in Serbia[6] and household

40

food retail prices[9].

41

A non-linear approach for time series forecasting has been to look

42

at the effectiveness of Neural Networks. Earlier methods include

43

comparisons between the use of feed-forward ANN models and

lin-44

ear models. However, these comparisons were inconclusive given

45

contradictory results, some claiming linear models produce more

46

accurate predictions, while others favor ANN’s for this task[5].

47

More recent studies explore the use of Recurrent Neural Networks,

48

more specifically the Long Short-Term Memory (LSTM) variants. In

49

contrast to traditional neural networks which are based on the

as-50

sumption that the input data are independent of each other, RNN’s

51

are able to capture sequential information, by carrying results from

52

previous computations or states into the next states, often referred

53

to as memory units. These networks can be trained on variable

54

sized input and are able to produce variable sized output, this makes

55

them suitable for capturing temporal data and thus for the task of

56

forecasting.

57

LSTM networks are variations of the vanilla RNN that employ a

gat-58

ing system for solving the vanishing and exploding gradient

prob-59

lems which are present in the vanilla versions. Within the domain of

60

forecasting LSTM’s have been utilized for house price predictions[3]

61

and in various studies on stock price predictions[7][2].

62

In the paper Forecasting Economics and Financial Time Series:

63

ARIMA vs. LSTM[8], a comparison is made between ARIMA and

64

LSTM models in terms of performance when forecasting financial

65

and economic time series. Their results seem to suggest that LSTM

66

based algorithms produce errors rates that are approximately 85%

67

lower than those with ARIMA. However, this is based on

predic-68

tions that are only one step ahead and with monthly recorded data.

69 70

Research goal. In this study, LSTM models are explored for the

71

use of commodity price forecasting and compared to ARIMA

mod-72

els. The main research question is formulated as follows: Can a

73

Long Short-Term Memory based model that is trained on time series

74

outperform an autoregressive integrated moving average model in

75

the task of predicting commodity prices? To address this question

76

multistage forecasts will be considered, and so this entails

exper-77

imentation with different time intervals and leads to the related

78

questions: Are there specific time frames in which the chosen

mod-79

els would be more effective, and is a model thus more accurate in

80

short-term predictions, for periods shorter than a month, or for

81

longer-term predictions, for several months ahead. Furthermore,

82

are there any disadvantages of utilizing one model over the other

83

model for this particular task. If there is an increase in performance

84

does this improvement generalize to different types of commodity

85

markets? In order to answer these questions, experiments will be

86

performed by forecasting future values for four time series data sets

87

utilizing the LSTM and ARIMA models. The data sets are of varying

88

sizes, where two are commodity price time series, and the other

89

two stock index data sets that were previously used in Forecasting

90

Economics and Financial Time Series: ARIMA vs. LSTM[8].

91

2 METHODOLOGY

2.1 ARIMA

92

The theory described in this section is based on the information

93

from the book Time Series Analysis: Forecasting and Control[1]. A

94

time series can be defined as a set of observations that have been

95

obtained sequentially over time. While a time series can either be

96

continuous, where the space between any time points contains an

97

infinite number of other time points, or discrete, where the time

98

points are usually recorded over equispaced intervals, data typically

99

falls within the latter category. Within such temporal sequences, a

100

dependency exists between adjacent elements, and this is of

con-101

siderable practical interest. An observation at time t is expected to

102

have a correlation with its lagged values from timet − 1 to t − p,

103

where the window that is spanned by p depends on the particular

104

time series. In the domain of time series analysis, this intrinsic

char-105

acteristic is modeled utilizing stochastic processes, which in turn

106

enables the possibility of predicting future, out-of-sample values.

107

The process of forecasting given a discrete time series entails the

108

use of p previous observations beginning at time t, where this

win-109

dow is referred to as the set of lagged values or simply lag, to create

110

predictions for l time steps ahead, such frame is called the lead time.

111

For instance, one could be interested in predicting the sales price

112

of an item for the next month given the sale prices of that item of

113

the two previous months. In order to be able to solve such task, the

114

underlying model needs to able to infer the probability distribution

115

of future observations given past values, these types of models are

116

referred to as autoregressive.

117

According to the Wold representation theorem in statistics every

118

time series can be rewritten as the sum of two time series, one

119

deterministic and one stochastic, on the condition that the original

120 3

(4)

time series is stationary. Consequently, such representation makes

121

it possible to linearly model the temporal evolution of a variable.

122

The stationarity property of a sequence implies a form of

statisti-123

cal equilibrium, where both the mean and variance are constant

124

over time. Thus, the probability distribution of a stationary time

125

series z stays the same for all times t, making it possible to infer

126

such distribution. Moreover, there is independence among the

in-127

dividual elements of such sequence, this supports the theoretical

128

foundation behind autoregressive models. Another argument for

129

stationarity is the prevention of spurious causation. Let a random

130

variable X representing some time series be utilized to predict

an-131

other random variable Y with a regression model. Both variables

132

are non-stationary and are completely independent of each other

133

in terms of a mutually causal relationship. Then it would still be

134

possible for the regression model to indicate a non-existing

rela-135

tionship as if existing. This could be attributed to the existence of

136

any local arbitrary trends that are similar for both variables. It is

137

evident that the phenomenon of spurious causation can also occur

138

in auto-regressive models with just one random variable ifzt is

139

interpreted as the dependent variable and the lagged valueszt −p

140

as the explanatory variables. The stationary characteristic becomes

141

desirable as it removes such arbitrary seasonal trends. The

follow-142

ing paragraph will introduce the type of model that is typically

143

used within the domain of time series analysis for forecasting.

144

The concept behind autoregressive-integrated moving averages

145

models has been around since the twenties[4] and its popularity

146

increased after the publication of Time Series Analysis:

Forecast-147

ing and Control by Box & Jenkins(1970)[1]. The authors solidified

148

the theoretical foundation of the models and laid out a three-stage

149

methodology for the identification, estimation, and verification in

150

times series modeling. The model can combine terms from two

151

types of stochastic processes, the first being the autoregressive

152

process AR(p) which is defined by the following equation:

153

zt = c + ϕ1zt −1+ ϕ2zt −2+ ... + ϕpzt −p+ at (1)

The measurements in the time series are denoted with z where t

154

indexes the equidistant time points.atrepresents the information

155

that is being added at each point in time in the form of white noise,

156

andc a constant to be determined in the model. Furthermore, each

157

element inϕ1, ϕ2...ϕpweights its corresponding lagged value, the

158

lag window is indicated by p which is also referred to as the order

159

of the AR process. However, due to the conditions that are imposed

160

on the process so it is stationary, such as, that all of the weightsϕ

161

are in the range between -1 and 1, the equation is often encountered

162

in the following form:

163 164 ˜ zt = ϕ1z˜t −1+ ϕ2z˜t −2+ ... + ϕpz˜t −p+ at (2) 165 µ = c 1 − p Õ n=1 ϕn (3) 166 1 − p Õ n=1 ϕn , 0 (4)

The constant term c is substituted with the equation that determines

167

the meanµ, provided that ˜zt = zt−µ and equation 4 applies.

168

The second part of ARIMA is the moving average process MA(q),

169

where the focus lies on shocks or innovationsat and q indicates

170

the number of past terms. As mentioned previously, the termsat

171

contain the new information at each instant. The elements of this

172

set of innovations have a constant mean and variance, and thus

173

each of the values are uncorrelated with their past or future values.

174

According to the MA process, a time series can be expressed as the

175

sum of innovations:

176

˜

zt = θ1a˜t −1+ θ2a˜t −2+ ... + θpa˜t −q+ at (5)

In this case the ˜zt = zt −µ still applies, and due to all the

indi-177

vidual components being stationary, the whole process satisfies

178

this property. The AR and MA processes are complementary ways

179

of representing time series, with their own properties, which are

180

combined in ARMA processes as follows:

181

˜

zt= ϕ1z˜t −1+ϕ2z˜t −2+...+ϕpz˜t −p+at−(θ1a˜t −1+θ2a˜t −2+...+θpa˜t −q)

(6) However, in practice, natural systems from which time series

182

data is obtained do not satisfy the assumption of stationary of input

183

that is required for ARMA models. Such time series always contain

184

either trends or seasonality. Hence, different methods have been

185

employed to transform the data in order to sufficiently satisfy the

186

condition. The method that is incorporated in ARIMA models for

187

this purpose is differencing, and this is regulated by the d

param-188

eter that indicates the order for it. The differences are computed

189

between consecutive measurements in the times series, this

calcu-190

lation stabilizes the mean and essentially removes the effect of time.

191 192 y′ t= yt−yt −1 (7) y′′ t = (yt−yt −1) − (yt −1−yt −2) (8) = yt− 2yt −1+ yt −2 (9)

In most cases, the resulting time series will become stationary, and

193

follow a distributional form on which the ARMA models apply.

194

However, it is possible that the first-order differencingy′_t is not

195

sufficient to produce stationary time series and a differencing of a

196

higher order is required suchy_t′′, but in practice, it is not necessary

197

ford to be higher than 2.

198

In short, the ARIMA model combines the AR process and the MA

199

process, where the terms of the former are based directly on the

200

previous values, and the terms of the latter on previous innovations.

201

The parametersp, and q are non-negative integers that regulate

202

the number of terms, respectively. Finally, the model provides an

203

integrated approach for transforming input time series into

station-204

ary time series, where the order of differencing is determined by

205

the parameter d, this makes the model suitable for forecasting time

206

series in practice.

207

2.2 Recurrent Neural Networks

208

In comparison to feed-forward neural networks, in which

infor-209

mation is fed only once through the nodes, Recurrent Neural

Net-210

works(RNN) pass information back into the network, operating

211

in a feedback loop. This makes RNN’s suitable for processing

se-212

quential information over time, in contrast to feed-forward NN’s

213

that function based on the assumption of independence among the

214 4

(5)

data samples and therefore do not sufficiently capture sequential

215

dependencies that might exist within the data.

216

The input from two sources is combined, namely the information at

217

the current time step and the result obtained from the hidden state

218

which utilized the information of the previous time step, to produce

219

the current hidden state. The process is repeated for every element

220

of the input sequence. Each hidden state is a function of the patterns

221

that reflect the information which has been accumulated over time.

222

Hence, in theory, the network is capable of finding correlations

223

between patterns that are separated by a variable number of time

224

steps and to learn long-term dependencies. A hidden state h at time

225

t given the input x at time t can be computed as follows:

226

ht = σ(Winxt+ Wr echt −1+ bh) (10)

Where the sigma represents the activation function, typically

227

either a tanh or a sigmoid function,Winthe conventional weights

228

matrix for the input at the current time step,Wr ecthe recurrent

229

weights matrix built over the adjacent hidden states through the

se-230

quence, and finally the bias b corresponding to its hidden layer, for

231

potentially learning shifts in the functions. Thex0can be specified

232

by the user but is typically set to zero. To optimize the network,

233

the gradients with respect to the weights are computed with

Back-234

propagation Through Time (BPTT) on the unrolled model. The

235

model is laid out similarly to a feed-forward neural network such

236

that each element in the entire input sequence is utilized as the

237

input layer and each element from the sequence of outputs as the

238

output layer. However, the application of the chain rule when

prop-239

agating the error gradient across all the layers with respect to the

240

recurrent weight matrix that contains the information of long-term

241

patterns leads to the problems known as the vanishing gradient

242

problem and the exploding gradient problem. The consecutive

mul-243

tiplication of gradients that are smaller than one is sufficient to

244

cause an exponential decay that vanishes the gradients,

prevent-245

ing the model of learning long-term dependencies. The exploding

246

gradient problem is essentially the opposite behavior where the

247

gradients are larger than one instead of smaller, this will

necessar-248

ily lead to an exponential increase of the long-term components

249

which will eventually completely overshadow the more current

250

short-term components. As a result, the information carried by

251

these short-term components relative to the long-term components

252

will progressively contribute less to the training of the model.

253

2.3 Long Short-Term Memory

254

A variation of the RNN architecture is the Long Short-Term

Mem-255

ory (LSTM) that employs a gating system for solving the

afore-256

mentioned vanishing gradient problem, and the exploding gradient

257

problem that both occur in the vanilla version. An LSTM network

258

is comprised of LSTM blocks or units, also referred to as memory,

259

that can contain multiple memory cells. The gating system

regu-260

lates what information is passed through the cells outside of the

261

normal flow of the recurrent network. In addition to memory cells,

262

the typical LSTM block has three gates an input gate, an output

263

gate and a forget gate. The gates are filters that determine what

264

information is stored, read, written and erased from the cells that

265

share the same block. Each gate corresponds to a weights matrix,

266

these can be learned when training the network.

267 it= σ(Wiht −1+ Uixt+ bi) (11) 268 ft = σ(Wfht −1+ Ufxt+ bf) (12) 269 ot = σ(Woht −1+ Uoxt+ bo) (13)

The variableitin equation 2 represents the input gate that

con-270

trols what is written to a cell, the second variableftthe forget gate

271

and determines what should be erased from a cell. The last gate

272

in equation 4 is represented byotfor output, that directs what is

273

being read from a cell. The construct is the same as in the vanilla

274

RNN’s calculation for the hidden states. To compute the hidden

275

state at timet a cell candidate ˆctis calculated

276

ˆ

ct = tanh(W ht −1+ Uxt+ b) (14)

The cell candidate is then filtered with the forget and input gates

277

utilizing element-wise multiplication indicated by the circle

opera-278

tor

279

ct = ft◦ct −1+ it◦ ˆct (15)

Finally, the hidden statehtis computed in equation 7, by filtering

280

the cellctwith the output gate, after the application of the tanh

281

activation function.

282

ht = ot◦ tanh(ct) (16)

The vanishing gradient problem occurs when the magnitude of

283

the gradients, which are calculated by taking the derivatives with

284

respect to the weights and activation functions, is smaller than 1.

285

However, due to the formulation of the calculations in the

back-286

propagation of the LSTM, the weight derivative term in the chain

287

rule, is the identity function resulting in a constant derivative of

288

1. This prevents both the vanishing gradient problem as well as

289

the exploding gradient problem, for in the latter case the problem

290

would occur when the term is above 1.

291

2.4 Evaluation

292

To measure the performance of the models in this paper, two metrics

293

were utilized, the root mean squared error (RMSE) and the average

294

accuracy (AA). The RMSE is a frequently used quadratic evaluation

295

measure in machine learning, and is calculated by taking the root

296

of the average squared differences between the predictions and the

297

actual observations as follows

298 RMSE = v t 1 n n Õ i=1 (yi− ˆyi)2 (17)

Wheren is the number of samples in the test set, y the actual ob-servations and ˆy the predictions. The advantage of the root mean squared error is that it penalizes relatively large errors. In practice, this is useful because large individual errors entail more significant consequences for companies that use the predictions of their mod-els to make decisions that influence their profits. Since this metric is also used in similar papers where ARIMA models are compared to LSTM models, it is easier to determine whether some of their find-ings apply in experiments performed in this paper given the same metric. The other evaluation metric that was used was the average accuracy. Instead of quadratic differences, absolute differences are 5

(6)

computed and averaged after divided them by the real observations: AA =_n1 n Õ i=1 |yi− ˆyi| yi (18)

This metric is more easily interpreted than the RMSE, as it

ex-299

presses the average error relative to the actual observations, as such

300

it is utilized within some corporate environments to measure the

301

performance of their models.

302

3 EXPERIMENTAL SETUP

3.1 Data

303

The first dataset was obtained from a dairy company and contains

304

time series having the Dutch butter price samples recorded weekly

305

in the period between January 2006 and the first week of February

306

2018, for a total of 632 samples. The intervals are limited to a weekly

307

period due to the butter price, which is the to be predicted variable,

308

having a low degree of variation within smaller intervals.

309

The second data set contains the largest number of observations

310

compared to the other datasets in this paper with a total of 8449

311

observations. It contains crude oil prices from West Texas

Interme-312

diate and is recorded daily over a span from January 1986 to May

313

2018. This time series had a small percentage of missing values

314

spread over the whole time period, which have been imputed with

315

linear interpolation.

316

The last two data sets are the S&P 500 (GSPC) and Nikkei 225 (N225),

317

both extracted from the Yahoo Finance website and previously used

318

in the ARIMA LSTM comparison in Forecasting Economics and

319

Financial Time Series: ARIMA vs. LSTM[8]. The times series have

320

the same monthly intervals from a period between January 1985

321

to June 2018, as has been mentioned in the Forecasting Economics

322

and Financial Time Series: ARIMA vs. LSTM[8] paper, with a total

323

of 402 observations. The S&P 500 is described as an American stock

324

market index that depends on common stocks listed on the NYSE

325

or NASDAQ that are owned by 500 large companies. Nikkei 225 is

326

a stock market index for the Tokyo Stock Exchange, and with its

327

currency being Japanese yen.

328

The evaluation in the comparison between the two algorithms

329

on the data was performed on the most recent 20% of the time

se-330

ries. The prediction intervals were chosen to be 5, 13 and 26 weeks,

331

corresponding to a forecast approximation for 1, 3 and 6 months

332

respectively, for the butter price data set. The same intervals were

333

utilized for the other data sets for comparison purposes.

Addition-334

ally, different lag window sizes were experimented with, which

335

represent the number of previous observations used in each data

336

sample. In the ARIMA case, a rolling forecast was implemented

337

starting at the test data index. The model is trained on the initial

338

80% of the data, creating an out of sample forecast for a set time

339

interval. From the resulting list of predictions the last element,

cor-340

responding to the last week, is then saved. Subsequently, the model

341

is retrained after the addition of the next sample from the test set

342

to the training set, this complete process is then repeated until the

343

end of the data set, minus the forecasting window has been reached.

344

Thus, creating an expanding training window until all the data is

345

used minus the forecasting window for evaluation purposes. After

346

the application of the models, the last elements accumulated are

347

used to compute the accuracies and RMSE scores per observation,

348

which are ultimately averaged to get a single score.

349

Data preparation for the LSTM model included min-max

nor-350

malization to a [-1,1] range and the creation of a lagged values data

351

frame. The dimensions for this data frame arenX (h + l) and thus

352

dependent on the lag h and the lead time l parameters, as well as n

353

number of samples. The rows in the data frame are a sliding window

354

across the time series. The first h columns contain the explanatory

355

sequences and the following l columns the corresponding target

356

sequences to learn. Each sample is then a by one time step into

357

the future shifted version of the previous sample, thus with the

358

addition of the next value in the time series and without the first

359

entry in the previous sample. Internally, within most algorithm

360

implementations of the LSTM, differencing is not included as in

361

ARIMA to make time series stationary, so in this paper, it is

im-362

plemented manually outside of the network. Given that LSTM’s

363

are non-linear models and can then learn such relationships, they

364

do not necessarily require stationary time series. Nonetheless,

ex-365

periments were conducted to see whether differencing improved

366

performance. The data was split so that approximately 80% was

367

used for training and 20% for the test set. The target sequences

368

of the training data will partially contain sequences that are used

369

in the test set in the temporally adjacent samples that divide the

370

training and test sets. In the instance where both the lead time

371

and lag are 2, the second element in the target sequence of the last

372

sample of the training set will also be in the first entry of the target

373

sequence of the first sample in the test set. Consequently, given that

374

a rolling forecast is utilized, each test sample will always partially

375

contain information from the most recent row of the training set.

376

To account for this leak of information an offset equal to the size of

377

the lead time is taken at the index, which creates a gap between the

378

training and test set. In order to evaluate the model on the same

379

period as ARIMA, the offset slightly reduces the training set.

380

Subsequently, the data frame was split horizontally, to separate

381

the target columns from the feature columns before feeding it into

382

the network. As mentioned before a similar approach for testing

383

has been employed as with ARIMA, namely a rolling forecast. In

384

ARIMA’s case, after expanding the window that is made up from

385

the training set by adding the individual test set samples, the model

386

is completely retrained. However, in regard to the LSTM, a different

387

approach was taken, due to the significantly larger training time

388

that would otherwise be required. The training would have to be

389

repeated the number of test samples multiplied by the number of

390

epochs. Instead of completely retraining the model, the weights

391

from the initial training split are saved, and after adding the

con-392

secutive test sample to the training data, the model continues to

393

train utilizing those initial weights for two more epochs.

394

The first reason for using the rolling forecast approach for the LSTM

395

is to prevent attributing differences in performance to an

inconsis-396

tent number of training samples, as a rolling forecast incorporates

397

more training data into the model as opposed to a traditional

train-398

ing test split. Secondly, when attempting to predict future values,

399

the more recent data samples will typically have more predictive

400

power than the older values in a temporal sequence. Hence, when a

401

traditional split is used there is a relatively large time gap between

402

the last sample of the training set and the most recent few samples

403

of the test set. Additionally, in practice the models will be trained

404 6

(7)

on all of the available data, so performing an evaluation using a

405

rolling forecast, that incorporates the more recent data, would more

406

closely correspond to that practical scenario. Finally, when using

407

time series in a machine learning model, k-fold cross-validation is

408

not a valid evaluation method. If the data is split into equal parts,

409

arbitrary shuffling of the folds would yield inconsistent

configu-410

rations where past values would be predicted using future values.

411

Hence, an expanding window approach such as the rolling forecast

412

can be utilized as a substitute for cross-validation.

413

Another method that could have been employed for this purpose

414

is the sliding window evaluation. Initially, A subset with a fixed

415

size would be selected that starts at the first index of the complete

416

data set. This selected window would then be split in a training

417

set and test set and evaluated accordingly. Afterward, the window

418

would be shifted across the data set repeating the previous step.

419

However, the sliding window size is dependent on the size of the

420

entire data set and also determines the number of shifts that are

421

possible. Consequently, such an approach is only usable if the data

422

set is large enough for it to allow a significant number of shifts

423

and for the subsets to be large enough to sufficiently capture the

424

dependencies that exist within the time series. However, the stock

425

index price data sets, as well as the butter price data set, is too

426

small to satisfy these conditions. Therefore, the rolling forecast was

427

chosen over the latter approach.

428 429

3.2 Results

430

The results can be observed in tables 1 and 2. Table 1 reports the

431

reproduction attempt of one step ahead predictions as are performed

432

in x, with the same model parameters ARIMA(5,1,0) and two of the

433

same data sets. The target columns report the RMSE values that

434

were found in Forecasting Economics and Financial Time Series:

435

ARIMA vs. LSTM[8] and the left columns are the results utilizing

436

the models in this paper. While the ARIMA evaluations are similar,

437

the LSTM results seem to be off target by a large margin as well

438

as having higher error rates than the ARIMA model. This trend

439

continues across the findings in table 2, where ARIMA produces

440

lower error rates and higher accuracies than the LSTM model for all

441

intervals and data sets. The interval values in table 2 represent the

442

number of time steps ahead for which the predictions are performed.

443

Furthermore, it is evident that the relative differences between the

444

RMSE values get larger as the interval increases. The accuracies

445

are fairly consistent with the RSME values in terms of the previous

446

findings. However, the errors vary based on the average monetary

447

values that are expressed in each time series, and given the relative

448

character of the accuracy measure, it is more stable across the

449

different datasets. Finally, table 3 reports the influence of the lag

450

parameter on the performance of the ARIMA model in terms of

451

RMSE.

452

3.3 Discussion

453

The model parameters that were utilized to produce the results in

454

table 2 for ARIMA were (2,1,0), where 2 is the lag, 1 represents first

455

order differencing and the 0 indicates the lack of moving average

456

terms. Although the lag was set to 5 to produce the outcomes in

457

Table 1: Reproduction attempt RMSE

Target Target Time series LSTM ARIMA

GSPC 62.73733 7.814 55.33580524 55.3 N225 750.011821 105.315 739.3906823 766.45

Table 2: Performance results

Interval 5

ARIMA LSTM

Time Series RMSE Accuracy RMSE Accuracy Butter price 0.292 0.963 0.44 0.935 Crude oil price 1.222 0.986 2.609 0.968 GSPC 56.267 0.977 109.503 0.954 N225 759.312 0.962 1762.563 0.905

Interval 13

Butter price 0.474 0.941 0.989 0.825 Crude oil price 1.226 0.986 4.36 0.945 GSPC 75.24 0.969 226.937 0.904 N225 771.834 0.96 3213.036 0.833

Interval 26

Butter price 0.371 0.951 1.256 0.787 Crude oil price 1.259 0.985 6.526 0.918 GSPC 134.178 0.938 359.725 0.845 N225 850.953954 0.955 4163.072 0.796

Table 3: Lag experiment RMSE lag GSPC N225 1 57.039 758.209 2 56.267 759.312 3 58.7745 759.341 4 61.110 762.977 5 64.454 773.470 6 63.786 755.649 7 65.723 756.287 8 66.248 759.366 9 66.464 768.656

table 1, this was to test whether the performance values were

com-458

parable to the experiments in Forecasting Economics and Financial

459

Time Series: ARIMA vs. LSTM[8]. As empirical experiments with

460

the lag parameter, from which the results are depicted in table 3,

461

on two of the data sets demonstrate that utilizing a larger lag does

462

not significantly improve performance, with an exception of what

463

seemed to be a slight increase for outlying instances. In

config-464

urations where moving average terms were added could be also

465 7

(8)

observed that there was no significant performance increase at least

466

within a parameter range of 1-3. Higher values were not feasible as

467

they caused invertibility errors and convergence problems.

468

As for the LSTM implementation, the number of epochs was set to

469

500, the lag to 2 and the batch size to 5 for all the data sets except

470

for crude oil which was set to 100, given that it is a larger data set.

471

The number of epochs corresponds to the number of training

itera-472

tions using gradient descent on the complete dataset. The authors

473

in Forecasting Economics and Financial Time Series: ARIMA vs.

474

LSTM[8] reported that there was no evidence more than 1 epoch

475

yielded lower error rates. When experiments were performed with

476

the number of epochs, this indeed seemed to be the case for single

477

step forecasts. However, this finding did not generalize to

multi-478

step forecasts that require a more complex model as larger numbers

479

of epochs resulted in lower RMSE rates.

480

The batch size determines the size of the subsets on which the

train-481

ing is iteratively performed. In the experiments, higher batch sizes

482

yielded faster training times at the cost of performance and yielded

483

an increase in the variations in the results when running the exact

484

same model. Furthermore, the model architecture consisted out

485

of 5 LSTM units and a single LSTM layer that feeds into a dense

486

layer. Increasing the number of layers did not significantly improve

487

performance and resulted in longer training times. Finally, an Adam

488

optimizer function and a mean squared error loss function were

489

utilized when fitting all the LSTM models.

490

As was mentioned in the results section the ARIMA model

outper-491

formed the LSTM model in every experiment even in the

repro-492

duction attempts in table 1. In order to explain the homogeneity

493

of these results, the LSTM approaches should be compared. In the

494

architecture of the LSTM approach of Forecasting Economics and

495

Financial Time Series: ARIMA vs. LSTM[8], 4 LSTM units were

496

used with a single LSTM layer as can be observed in their provided

497

code snippet. The number of epochs was set to 1. Other parameters

498

that were observed in the code snippet included a mean squared

499

error loss function, an Adam optimizer, and the enabled

stateful-500

ness. Compared to the parameters listed earlier in this section, few

501

differences can be perceived. The number of LSTM units are nearly

502

identical, and the one extra LSTM unit cannot be accountable for the

503

large difference in performance. Additionally, experiments showed

504

a lower number of units did not result in lower error rates. Neither

505

did the number of epochs or statefulness influence the results with

506

such large margins. The authors mentioned that for the purpose of

507

reducing the complexity of their algorithm a few other

manipula-508

tions were performed, such as the addition of dense layers, specific

509

batch sizes and transformations. However, the specifics behind

510

these were not reported. Furthermore, in the description of the

511

rolling forecast, it is mentioned that the re-estimation of the model

512

after each test sample is performed, but this was not documented

513

in the code snippet. It is likely that the large homogeneity can be

514

attributed to these unreported modifications.

515 516

Conclusion. One of the subquestions stated in the introduction

517

section was: Whether there are specific time frames in which the

518

chosen models would be more effective, and is a model thus more

519

accurate in short-term predictions, for periods shorter than a month, or

520

for longer-term predictions, for several months ahead. Solely based on

521

Figure 1: The evaluation of the ARIMA model on the butter price time series as lead time is increased.

the results from table 2 can be observed that ARIMA outperforms

522

the LSTM models in all the cases. Hence, the LSTM would not

523

provide better predictions over any time windows, whether they

524

are short-term or long-term.

525

Furthermore, figure 1 depicts the RMSE and accuracies for the

526

ARIMA model as a function of the weekly lead time on the

but-527

ter price data set. As expected the evaluations of the model get

528

progressively worse, but the model is quite robust as the accuracy

529

only starts to dip beneath the 90% after the lead time exceeds a 100

530

weeks. Whether that is an acceptable threshold for determining

531

effectiveness will depend on the conditions that are imposed in a

532

practical setting and the particular data set.

533

As it currently stands ARIMA has only advantages over the LSTM

534

model, it is faster to train and predict, less complex in terms of

535

implementation. This also generalizes to the different data sets with

536

varying sizes and time intervals. However, the results are

insuf-537

ficient to conclude an affirmative answer to the central research

538

question: Can a Long Short-Term Memory based model that is trained

539

on time series outperform an autoregressive integrated moving

aver-540

age model in the task of predicting commodity prices? It is possible,

541

however, to conclude that this particular approach with the LSTM

542

does not outperform the ARIMA model for both single as well as

543

multi-step forecasting and that more configurations of this model

544

should be investigated.

545

REFERENCES

[1] George Edward Pelham Box and Gwilym Jenkins. 1990. Time Series Analysis,

546

Forecasting and Control. Holden-Day, Inc., San Francisco, CA, USA.

547

[2] K. Chen, Y. Zhou, and F. Dai. 2015. A LSTM-based method for stock returns

548

prediction: A case study of China stock market. (Oct 2015), 2823–2824. https:

549

//doi.org/10.1109/BigData.2015.7364089

550

[3] Xiaochen Chen, Lai Wei, and Jiaxin Xu. 2017. House Price Prediction Using

551

LSTM. CoRR abs/1709.08432 (2017).

552

[4] Jan G De Gooijer and Rob J Hyndman. [n. d.]. 25 years of time series forecasting.

553

International Journal of Forecasting ([n. d.]), 2006.

554

[5] Jan G. De Gooijer and Rob J. Hyndman. 2006. 25 Years of Time Series Forecasting.

555

[6] Dragan Ivanisevic, Beba Mutavdzic, Nebojsa Novkovic, and Natasa Vukelić. 2015.

556

Analysis and prediction of tomato price in Serbia. 62 (01 2015), 951–962.

557

[7] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon, and K. P. Soman.

558

2017. Stock price prediction using LSTM, RNN and CNN-sliding window model.

559

(Sept 2017), 1643–1647. https://doi.org/10.1109/ICACCI.2017.8126078

560

[8] Sima Siami-Namini and Akbar Siami Namin. 2018. Forecasting Economics and

561

Financial Time Series: ARIMA vs. LSTM. CoRR abs/1803.06386 (2018).

562

[9] Yue Wang, XingYu Ye, and Yudan Huo. 2011. Prediction of household food retail

563

prices based on ARIMA Model. (July 2011), 2301–2305. https://doi.org/10.1109/

564

ICMT.2011.6002376

565 8

(9)

[10] M. Xie, C. Sandels, K. Zhu, and L. Nordström. 2013. A seasonal ARIMA model

566

with exogenous variables for elspot electricity prices in Sweden. (May 2013), 1–4.

567

https://doi.org/10.1109/EEM.2013.6607293

568

4 APPENDIX

(10)

Figure 2: The above plot depicts the Dutch butter price time series before and after differencing.

Figure 3: The above plot depicts WTI crude oil price time series before and after differencing.

Figure 4: The above plot depicts the S&P 500 index time se-ries before and after differencing.

Figure 5: The above plot depicts the Nikkei 225 index time series before and after differencing.