A multi-step forecasting comparison between
ARIMA and LSTM on financial time series
submitted in partial fulfillment for the degree of
master of science
Amir Alnomani
10437797
master information studies
data science
faculty of science
university of amsterdam
Date of defence 2018-07-05
Internal Supervisor External Supervisor
Title, Name Dr Maarten Marx Juan Carlos Romero
Affiliation UvA, FNWI, IvI Friesland Campina
Contents 1 Contents 2 2 1 Introduction 3 3 2 Methodology 3 4 2.1 ARIMA 3 5
2.2 Recurrent Neural Networks 4
6
2.3 Long Short-Term Memory 5
7 2.4 Evaluation 5 8 3 Experimental Setup 6 9 3.1 Data 6 10 3.2 Results 7 11 3.3 Discussion 7 12 References 8 13 4 Appendix 9 14 2
Abstract
15
This study attempted to address the question about whether LSTM
16
based forecasting of time series can outperform ARIMA based
forecast-17
ing. While previous research explored single step forecasts with this
18
particular comparison, experiments were conducted that investigated
19
whether those results generalized to multi-step forecasting. However,
20
the reported root mean squared errors for the respective models
indi-21
cated that the specific approach utilized in this paper with LSTM does
22
not outperform the ARIMA model on all the data sets. The data sets
23
that were used were the Dutch national butter price, the WTI crude
24
oil price, the S&P 500 index and the Nikkei 225 index.
25
1
INTRODUCTION
Price forecasting is becoming increasingly relevant to producers
26
within various markets. Such forecasts can be of great benefit for
27
developing strategies and negotiation skills ahead of time.
Commod-28
ity market information is sequential and partially observable, so
29
historical prices are important for predicting those in the future and
30
can potentially be combined with more static data obtained from
31
market snapshots, such as the data about changes in certain
regu-32
lations or laws, which affects the prices. The existing time series
33
forecasting methods include a variety of both linear and non-linear
34
algorithms. The autoregressive integrated moving average (ARIMA)
35
models and their variations such as AR, MA, ARMA, which fall in
36
the linear model class, have been extensively researched for this
37
purpose[5]. There have been many successful applications of this
38
class of models on univariate financial time series such as
electric-39
ity prices in Sweden[10], tomato prices in Serbia[6] and household
40
food retail prices[9].
41
A non-linear approach for time series forecasting has been to look
42
at the effectiveness of Neural Networks. Earlier methods include
43
comparisons between the use of feed-forward ANN models and
lin-44
ear models. However, these comparisons were inconclusive given
45
contradictory results, some claiming linear models produce more
46
accurate predictions, while others favor ANN’s for this task[5].
47
More recent studies explore the use of Recurrent Neural Networks,
48
more specifically the Long Short-Term Memory (LSTM) variants. In
49
contrast to traditional neural networks which are based on the
as-50
sumption that the input data are independent of each other, RNN’s
51
are able to capture sequential information, by carrying results from
52
previous computations or states into the next states, often referred
53
to as memory units. These networks can be trained on variable
54
sized input and are able to produce variable sized output, this makes
55
them suitable for capturing temporal data and thus for the task of
56
forecasting.
57
LSTM networks are variations of the vanilla RNN that employ a
gat-58
ing system for solving the vanishing and exploding gradient
prob-59
lems which are present in the vanilla versions. Within the domain of
60
forecasting LSTM’s have been utilized for house price predictions[3]
61
and in various studies on stock price predictions[7][2].
62
In the paper Forecasting Economics and Financial Time Series:
63
ARIMA vs. LSTM[8], a comparison is made between ARIMA and
64
LSTM models in terms of performance when forecasting financial
65
and economic time series. Their results seem to suggest that LSTM
66
based algorithms produce errors rates that are approximately 85%
67
lower than those with ARIMA. However, this is based on
predic-68
tions that are only one step ahead and with monthly recorded data.
69 70
Research goal. In this study, LSTM models are explored for the
71
use of commodity price forecasting and compared to ARIMA
mod-72
els. The main research question is formulated as follows: Can a
73
Long Short-Term Memory based model that is trained on time series
74
outperform an autoregressive integrated moving average model in
75
the task of predicting commodity prices? To address this question
76
multistage forecasts will be considered, and so this entails
exper-77
imentation with different time intervals and leads to the related
78
questions: Are there specific time frames in which the chosen
mod-79
els would be more effective, and is a model thus more accurate in
80
short-term predictions, for periods shorter than a month, or for
81
longer-term predictions, for several months ahead. Furthermore,
82
are there any disadvantages of utilizing one model over the other
83
model for this particular task. If there is an increase in performance
84
does this improvement generalize to different types of commodity
85
markets? In order to answer these questions, experiments will be
86
performed by forecasting future values for four time series data sets
87
utilizing the LSTM and ARIMA models. The data sets are of varying
88
sizes, where two are commodity price time series, and the other
89
two stock index data sets that were previously used in Forecasting
90
Economics and Financial Time Series: ARIMA vs. LSTM[8].
91
2
METHODOLOGY
2.1
ARIMA
92
The theory described in this section is based on the information
93
from the book Time Series Analysis: Forecasting and Control[1]. A
94
time series can be defined as a set of observations that have been
95
obtained sequentially over time. While a time series can either be
96
continuous, where the space between any time points contains an
97
infinite number of other time points, or discrete, where the time
98
points are usually recorded over equispaced intervals, data typically
99
falls within the latter category. Within such temporal sequences, a
100
dependency exists between adjacent elements, and this is of
con-101
siderable practical interest. An observation at time t is expected to
102
have a correlation with its lagged values from timet − 1 to t − p,
103
where the window that is spanned by p depends on the particular
104
time series. In the domain of time series analysis, this intrinsic
char-105
acteristic is modeled utilizing stochastic processes, which in turn
106
enables the possibility of predicting future, out-of-sample values.
107
The process of forecasting given a discrete time series entails the
108
use of p previous observations beginning at time t, where this
win-109
dow is referred to as the set of lagged values or simply lag, to create
110
predictions for l time steps ahead, such frame is called the lead time.
111
For instance, one could be interested in predicting the sales price
112
of an item for the next month given the sale prices of that item of
113
the two previous months. In order to be able to solve such task, the
114
underlying model needs to able to infer the probability distribution
115
of future observations given past values, these types of models are
116
referred to as autoregressive.
117
According to the Wold representation theorem in statistics every
118
time series can be rewritten as the sum of two time series, one
119
deterministic and one stochastic, on the condition that the original
120 3
time series is stationary. Consequently, such representation makes
121
it possible to linearly model the temporal evolution of a variable.
122
The stationarity property of a sequence implies a form of
statisti-123
cal equilibrium, where both the mean and variance are constant
124
over time. Thus, the probability distribution of a stationary time
125
series z stays the same for all times t, making it possible to infer
126
such distribution. Moreover, there is independence among the
in-127
dividual elements of such sequence, this supports the theoretical
128
foundation behind autoregressive models. Another argument for
129
stationarity is the prevention of spurious causation. Let a random
130
variable X representing some time series be utilized to predict
an-131
other random variable Y with a regression model. Both variables
132
are non-stationary and are completely independent of each other
133
in terms of a mutually causal relationship. Then it would still be
134
possible for the regression model to indicate a non-existing
rela-135
tionship as if existing. This could be attributed to the existence of
136
any local arbitrary trends that are similar for both variables. It is
137
evident that the phenomenon of spurious causation can also occur
138
in auto-regressive models with just one random variable ifzt is
139
interpreted as the dependent variable and the lagged valueszt −p
140
as the explanatory variables. The stationary characteristic becomes
141
desirable as it removes such arbitrary seasonal trends. The
follow-142
ing paragraph will introduce the type of model that is typically
143
used within the domain of time series analysis for forecasting.
144
The concept behind autoregressive-integrated moving averages
145
models has been around since the twenties[4] and its popularity
146
increased after the publication of Time Series Analysis:
Forecast-147
ing and Control by Box & Jenkins(1970)[1]. The authors solidified
148
the theoretical foundation of the models and laid out a three-stage
149
methodology for the identification, estimation, and verification in
150
times series modeling. The model can combine terms from two
151
types of stochastic processes, the first being the autoregressive
152
process AR(p) which is defined by the following equation:
153
zt = c + ϕ1zt −1+ ϕ2zt −2+ ... + ϕpzt −p+ at (1)
The measurements in the time series are denoted with z where t
154
indexes the equidistant time points.atrepresents the information
155
that is being added at each point in time in the form of white noise,
156
andc a constant to be determined in the model. Furthermore, each
157
element inϕ1, ϕ2...ϕpweights its corresponding lagged value, the
158
lag window is indicated by p which is also referred to as the order
159
of the AR process. However, due to the conditions that are imposed
160
on the process so it is stationary, such as, that all of the weightsϕ
161
are in the range between -1 and 1, the equation is often encountered
162
in the following form:
163 164 ˜ zt = ϕ1z˜t −1+ ϕ2z˜t −2+ ... + ϕpz˜t −p+ at (2) 165 µ = c 1 − p Õ n=1 ϕn (3) 166 1 − p Õ n=1 ϕn , 0 (4)
The constant term c is substituted with the equation that determines
167
the meanµ, provided that ˜zt = zt−µ and equation 4 applies.
168
The second part of ARIMA is the moving average process MA(q),
169
where the focus lies on shocks or innovationsat and q indicates
170
the number of past terms. As mentioned previously, the termsat
171
contain the new information at each instant. The elements of this
172
set of innovations have a constant mean and variance, and thus
173
each of the values are uncorrelated with their past or future values.
174
According to the MA process, a time series can be expressed as the
175
sum of innovations:
176
˜
zt = θ1a˜t −1+ θ2a˜t −2+ ... + θpa˜t −q+ at (5)
In this case the ˜zt = zt −µ still applies, and due to all the
indi-177
vidual components being stationary, the whole process satisfies
178
this property. The AR and MA processes are complementary ways
179
of representing time series, with their own properties, which are
180
combined in ARMA processes as follows:
181
˜
zt= ϕ1z˜t −1+ϕ2z˜t −2+...+ϕpz˜t −p+at−(θ1a˜t −1+θ2a˜t −2+...+θpa˜t −q)
(6) However, in practice, natural systems from which time series
182
data is obtained do not satisfy the assumption of stationary of input
183
that is required for ARMA models. Such time series always contain
184
either trends or seasonality. Hence, different methods have been
185
employed to transform the data in order to sufficiently satisfy the
186
condition. The method that is incorporated in ARIMA models for
187
this purpose is differencing, and this is regulated by the d
param-188
eter that indicates the order for it. The differences are computed
189
between consecutive measurements in the times series, this
calcu-190
lation stabilizes the mean and essentially removes the effect of time.
191 192 y′ t= yt−yt −1 (7) y′′ t = (yt−yt −1) − (yt −1−yt −2) (8) = yt− 2yt −1+ yt −2 (9)
In most cases, the resulting time series will become stationary, and
193
follow a distributional form on which the ARMA models apply.
194
However, it is possible that the first-order differencingy′t is not
195
sufficient to produce stationary time series and a differencing of a
196
higher order is required suchyt′′, but in practice, it is not necessary
197
ford to be higher than 2.
198
In short, the ARIMA model combines the AR process and the MA
199
process, where the terms of the former are based directly on the
200
previous values, and the terms of the latter on previous innovations.
201
The parametersp, and q are non-negative integers that regulate
202
the number of terms, respectively. Finally, the model provides an
203
integrated approach for transforming input time series into
station-204
ary time series, where the order of differencing is determined by
205
the parameter d, this makes the model suitable for forecasting time
206
series in practice.
207
2.2
Recurrent Neural Networks
208
In comparison to feed-forward neural networks, in which
infor-209
mation is fed only once through the nodes, Recurrent Neural
Net-210
works(RNN) pass information back into the network, operating
211
in a feedback loop. This makes RNN’s suitable for processing
se-212
quential information over time, in contrast to feed-forward NN’s
213
that function based on the assumption of independence among the
214 4
data samples and therefore do not sufficiently capture sequential
215
dependencies that might exist within the data.
216
The input from two sources is combined, namely the information at
217
the current time step and the result obtained from the hidden state
218
which utilized the information of the previous time step, to produce
219
the current hidden state. The process is repeated for every element
220
of the input sequence. Each hidden state is a function of the patterns
221
that reflect the information which has been accumulated over time.
222
Hence, in theory, the network is capable of finding correlations
223
between patterns that are separated by a variable number of time
224
steps and to learn long-term dependencies. A hidden state h at time
225
t given the input x at time t can be computed as follows:
226
ht = σ(Winxt+ Wr echt −1+ bh) (10)
Where the sigma represents the activation function, typically
227
either a tanh or a sigmoid function,Winthe conventional weights
228
matrix for the input at the current time step,Wr ecthe recurrent
229
weights matrix built over the adjacent hidden states through the
se-230
quence, and finally the bias b corresponding to its hidden layer, for
231
potentially learning shifts in the functions. Thex0can be specified
232
by the user but is typically set to zero. To optimize the network,
233
the gradients with respect to the weights are computed with
Back-234
propagation Through Time (BPTT) on the unrolled model. The
235
model is laid out similarly to a feed-forward neural network such
236
that each element in the entire input sequence is utilized as the
237
input layer and each element from the sequence of outputs as the
238
output layer. However, the application of the chain rule when
prop-239
agating the error gradient across all the layers with respect to the
240
recurrent weight matrix that contains the information of long-term
241
patterns leads to the problems known as the vanishing gradient
242
problem and the exploding gradient problem. The consecutive
mul-243
tiplication of gradients that are smaller than one is sufficient to
244
cause an exponential decay that vanishes the gradients,
prevent-245
ing the model of learning long-term dependencies. The exploding
246
gradient problem is essentially the opposite behavior where the
247
gradients are larger than one instead of smaller, this will
necessar-248
ily lead to an exponential increase of the long-term components
249
which will eventually completely overshadow the more current
250
short-term components. As a result, the information carried by
251
these short-term components relative to the long-term components
252
will progressively contribute less to the training of the model.
253
2.3
Long Short-Term Memory
254
A variation of the RNN architecture is the Long Short-Term
Mem-255
ory (LSTM) that employs a gating system for solving the
afore-256
mentioned vanishing gradient problem, and the exploding gradient
257
problem that both occur in the vanilla version. An LSTM network
258
is comprised of LSTM blocks or units, also referred to as memory,
259
that can contain multiple memory cells. The gating system
regu-260
lates what information is passed through the cells outside of the
261
normal flow of the recurrent network. In addition to memory cells,
262
the typical LSTM block has three gates an input gate, an output
263
gate and a forget gate. The gates are filters that determine what
264
information is stored, read, written and erased from the cells that
265
share the same block. Each gate corresponds to a weights matrix,
266
these can be learned when training the network.
267 it= σ(Wiht −1+ Uixt+ bi) (11) 268 ft = σ(Wfht −1+ Ufxt+ bf) (12) 269 ot = σ(Woht −1+ Uoxt+ bo) (13)
The variableitin equation 2 represents the input gate that
con-270
trols what is written to a cell, the second variableftthe forget gate
271
and determines what should be erased from a cell. The last gate
272
in equation 4 is represented byotfor output, that directs what is
273
being read from a cell. The construct is the same as in the vanilla
274
RNN’s calculation for the hidden states. To compute the hidden
275
state at timet a cell candidate ˆctis calculated
276
ˆ
ct = tanh(W ht −1+ Uxt+ b) (14)
The cell candidate is then filtered with the forget and input gates
277
utilizing element-wise multiplication indicated by the circle
opera-278
tor
279
ct = ft◦ct −1+ it◦ ˆct (15)
Finally, the hidden statehtis computed in equation 7, by filtering
280
the cellctwith the output gate, after the application of the tanh
281
activation function.
282
ht = ot◦ tanh(ct) (16)
The vanishing gradient problem occurs when the magnitude of
283
the gradients, which are calculated by taking the derivatives with
284
respect to the weights and activation functions, is smaller than 1.
285
However, due to the formulation of the calculations in the
back-286
propagation of the LSTM, the weight derivative term in the chain
287
rule, is the identity function resulting in a constant derivative of
288
1. This prevents both the vanishing gradient problem as well as
289
the exploding gradient problem, for in the latter case the problem
290
would occur when the term is above 1.
291
2.4
Evaluation
292
To measure the performance of the models in this paper, two metrics
293
were utilized, the root mean squared error (RMSE) and the average
294
accuracy (AA). The RMSE is a frequently used quadratic evaluation
295
measure in machine learning, and is calculated by taking the root
296
of the average squared differences between the predictions and the
297
actual observations as follows
298 RMSE = v t 1 n n Õ i=1 (yi− ˆyi)2 (17)
Wheren is the number of samples in the test set, y the actual ob-servations and ˆy the predictions. The advantage of the root mean squared error is that it penalizes relatively large errors. In practice, this is useful because large individual errors entail more significant consequences for companies that use the predictions of their mod-els to make decisions that influence their profits. Since this metric is also used in similar papers where ARIMA models are compared to LSTM models, it is easier to determine whether some of their find-ings apply in experiments performed in this paper given the same metric. The other evaluation metric that was used was the average accuracy. Instead of quadratic differences, absolute differences are 5
computed and averaged after divided them by the real observations: AA =n1 n Õ i=1 |yi− ˆyi| yi (18)
This metric is more easily interpreted than the RMSE, as it
ex-299
presses the average error relative to the actual observations, as such
300
it is utilized within some corporate environments to measure the
301
performance of their models.
302
3
EXPERIMENTAL SETUP
3.1
Data
303
The first dataset was obtained from a dairy company and contains
304
time series having the Dutch butter price samples recorded weekly
305
in the period between January 2006 and the first week of February
306
2018, for a total of 632 samples. The intervals are limited to a weekly
307
period due to the butter price, which is the to be predicted variable,
308
having a low degree of variation within smaller intervals.
309
The second data set contains the largest number of observations
310
compared to the other datasets in this paper with a total of 8449
311
observations. It contains crude oil prices from West Texas
Interme-312
diate and is recorded daily over a span from January 1986 to May
313
2018. This time series had a small percentage of missing values
314
spread over the whole time period, which have been imputed with
315
linear interpolation.
316
The last two data sets are the S&P 500 (GSPC) and Nikkei 225 (N225),
317
both extracted from the Yahoo Finance website and previously used
318
in the ARIMA LSTM comparison in Forecasting Economics and
319
Financial Time Series: ARIMA vs. LSTM[8]. The times series have
320
the same monthly intervals from a period between January 1985
321
to June 2018, as has been mentioned in the Forecasting Economics
322
and Financial Time Series: ARIMA vs. LSTM[8] paper, with a total
323
of 402 observations. The S&P 500 is described as an American stock
324
market index that depends on common stocks listed on the NYSE
325
or NASDAQ that are owned by 500 large companies. Nikkei 225 is
326
a stock market index for the Tokyo Stock Exchange, and with its
327
currency being Japanese yen.
328
The evaluation in the comparison between the two algorithms
329
on the data was performed on the most recent 20% of the time
se-330
ries. The prediction intervals were chosen to be 5, 13 and 26 weeks,
331
corresponding to a forecast approximation for 1, 3 and 6 months
332
respectively, for the butter price data set. The same intervals were
333
utilized for the other data sets for comparison purposes.
Addition-334
ally, different lag window sizes were experimented with, which
335
represent the number of previous observations used in each data
336
sample. In the ARIMA case, a rolling forecast was implemented
337
starting at the test data index. The model is trained on the initial
338
80% of the data, creating an out of sample forecast for a set time
339
interval. From the resulting list of predictions the last element,
cor-340
responding to the last week, is then saved. Subsequently, the model
341
is retrained after the addition of the next sample from the test set
342
to the training set, this complete process is then repeated until the
343
end of the data set, minus the forecasting window has been reached.
344
Thus, creating an expanding training window until all the data is
345
used minus the forecasting window for evaluation purposes. After
346
the application of the models, the last elements accumulated are
347
used to compute the accuracies and RMSE scores per observation,
348
which are ultimately averaged to get a single score.
349
Data preparation for the LSTM model included min-max
nor-350
malization to a [-1,1] range and the creation of a lagged values data
351
frame. The dimensions for this data frame arenX (h + l) and thus
352
dependent on the lag h and the lead time l parameters, as well as n
353
number of samples. The rows in the data frame are a sliding window
354
across the time series. The first h columns contain the explanatory
355
sequences and the following l columns the corresponding target
356
sequences to learn. Each sample is then a by one time step into
357
the future shifted version of the previous sample, thus with the
358
addition of the next value in the time series and without the first
359
entry in the previous sample. Internally, within most algorithm
360
implementations of the LSTM, differencing is not included as in
361
ARIMA to make time series stationary, so in this paper, it is
im-362
plemented manually outside of the network. Given that LSTM’s
363
are non-linear models and can then learn such relationships, they
364
do not necessarily require stationary time series. Nonetheless,
ex-365
periments were conducted to see whether differencing improved
366
performance. The data was split so that approximately 80% was
367
used for training and 20% for the test set. The target sequences
368
of the training data will partially contain sequences that are used
369
in the test set in the temporally adjacent samples that divide the
370
training and test sets. In the instance where both the lead time
371
and lag are 2, the second element in the target sequence of the last
372
sample of the training set will also be in the first entry of the target
373
sequence of the first sample in the test set. Consequently, given that
374
a rolling forecast is utilized, each test sample will always partially
375
contain information from the most recent row of the training set.
376
To account for this leak of information an offset equal to the size of
377
the lead time is taken at the index, which creates a gap between the
378
training and test set. In order to evaluate the model on the same
379
period as ARIMA, the offset slightly reduces the training set.
380
Subsequently, the data frame was split horizontally, to separate
381
the target columns from the feature columns before feeding it into
382
the network. As mentioned before a similar approach for testing
383
has been employed as with ARIMA, namely a rolling forecast. In
384
ARIMA’s case, after expanding the window that is made up from
385
the training set by adding the individual test set samples, the model
386
is completely retrained. However, in regard to the LSTM, a different
387
approach was taken, due to the significantly larger training time
388
that would otherwise be required. The training would have to be
389
repeated the number of test samples multiplied by the number of
390
epochs. Instead of completely retraining the model, the weights
391
from the initial training split are saved, and after adding the
con-392
secutive test sample to the training data, the model continues to
393
train utilizing those initial weights for two more epochs.
394
The first reason for using the rolling forecast approach for the LSTM
395
is to prevent attributing differences in performance to an
inconsis-396
tent number of training samples, as a rolling forecast incorporates
397
more training data into the model as opposed to a traditional
train-398
ing test split. Secondly, when attempting to predict future values,
399
the more recent data samples will typically have more predictive
400
power than the older values in a temporal sequence. Hence, when a
401
traditional split is used there is a relatively large time gap between
402
the last sample of the training set and the most recent few samples
403
of the test set. Additionally, in practice the models will be trained
404 6
on all of the available data, so performing an evaluation using a
405
rolling forecast, that incorporates the more recent data, would more
406
closely correspond to that practical scenario. Finally, when using
407
time series in a machine learning model, k-fold cross-validation is
408
not a valid evaluation method. If the data is split into equal parts,
409
arbitrary shuffling of the folds would yield inconsistent
configu-410
rations where past values would be predicted using future values.
411
Hence, an expanding window approach such as the rolling forecast
412
can be utilized as a substitute for cross-validation.
413
Another method that could have been employed for this purpose
414
is the sliding window evaluation. Initially, A subset with a fixed
415
size would be selected that starts at the first index of the complete
416
data set. This selected window would then be split in a training
417
set and test set and evaluated accordingly. Afterward, the window
418
would be shifted across the data set repeating the previous step.
419
However, the sliding window size is dependent on the size of the
420
entire data set and also determines the number of shifts that are
421
possible. Consequently, such an approach is only usable if the data
422
set is large enough for it to allow a significant number of shifts
423
and for the subsets to be large enough to sufficiently capture the
424
dependencies that exist within the time series. However, the stock
425
index price data sets, as well as the butter price data set, is too
426
small to satisfy these conditions. Therefore, the rolling forecast was
427
chosen over the latter approach.
428 429
3.2
Results
430
The results can be observed in tables 1 and 2. Table 1 reports the
431
reproduction attempt of one step ahead predictions as are performed
432
in x, with the same model parameters ARIMA(5,1,0) and two of the
433
same data sets. The target columns report the RMSE values that
434
were found in Forecasting Economics and Financial Time Series:
435
ARIMA vs. LSTM[8] and the left columns are the results utilizing
436
the models in this paper. While the ARIMA evaluations are similar,
437
the LSTM results seem to be off target by a large margin as well
438
as having higher error rates than the ARIMA model. This trend
439
continues across the findings in table 2, where ARIMA produces
440
lower error rates and higher accuracies than the LSTM model for all
441
intervals and data sets. The interval values in table 2 represent the
442
number of time steps ahead for which the predictions are performed.
443
Furthermore, it is evident that the relative differences between the
444
RMSE values get larger as the interval increases. The accuracies
445
are fairly consistent with the RSME values in terms of the previous
446
findings. However, the errors vary based on the average monetary
447
values that are expressed in each time series, and given the relative
448
character of the accuracy measure, it is more stable across the
449
different datasets. Finally, table 3 reports the influence of the lag
450
parameter on the performance of the ARIMA model in terms of
451
RMSE.
452
3.3
Discussion
453
The model parameters that were utilized to produce the results in
454
table 2 for ARIMA were (2,1,0), where 2 is the lag, 1 represents first
455
order differencing and the 0 indicates the lack of moving average
456
terms. Although the lag was set to 5 to produce the outcomes in
457
Table 1: Reproduction attempt RMSE
Target Target Time series LSTM ARIMA
GSPC 62.73733 7.814 55.33580524 55.3 N225 750.011821 105.315 739.3906823 766.45
Table 2: Performance results
Interval 5
ARIMA LSTM
Time Series RMSE Accuracy RMSE Accuracy Butter price 0.292 0.963 0.44 0.935 Crude oil price 1.222 0.986 2.609 0.968 GSPC 56.267 0.977 109.503 0.954 N225 759.312 0.962 1762.563 0.905
Interval 13
Butter price 0.474 0.941 0.989 0.825 Crude oil price 1.226 0.986 4.36 0.945 GSPC 75.24 0.969 226.937 0.904 N225 771.834 0.96 3213.036 0.833
Interval 26
Butter price 0.371 0.951 1.256 0.787 Crude oil price 1.259 0.985 6.526 0.918 GSPC 134.178 0.938 359.725 0.845 N225 850.953954 0.955 4163.072 0.796
Table 3: Lag experiment RMSE lag GSPC N225 1 57.039 758.209 2 56.267 759.312 3 58.7745 759.341 4 61.110 762.977 5 64.454 773.470 6 63.786 755.649 7 65.723 756.287 8 66.248 759.366 9 66.464 768.656
table 1, this was to test whether the performance values were
com-458
parable to the experiments in Forecasting Economics and Financial
459
Time Series: ARIMA vs. LSTM[8]. As empirical experiments with
460
the lag parameter, from which the results are depicted in table 3,
461
on two of the data sets demonstrate that utilizing a larger lag does
462
not significantly improve performance, with an exception of what
463
seemed to be a slight increase for outlying instances. In
config-464
urations where moving average terms were added could be also
465 7
observed that there was no significant performance increase at least
466
within a parameter range of 1-3. Higher values were not feasible as
467
they caused invertibility errors and convergence problems.
468
As for the LSTM implementation, the number of epochs was set to
469
500, the lag to 2 and the batch size to 5 for all the data sets except
470
for crude oil which was set to 100, given that it is a larger data set.
471
The number of epochs corresponds to the number of training
itera-472
tions using gradient descent on the complete dataset. The authors
473
in Forecasting Economics and Financial Time Series: ARIMA vs.
474
LSTM[8] reported that there was no evidence more than 1 epoch
475
yielded lower error rates. When experiments were performed with
476
the number of epochs, this indeed seemed to be the case for single
477
step forecasts. However, this finding did not generalize to
multi-478
step forecasts that require a more complex model as larger numbers
479
of epochs resulted in lower RMSE rates.
480
The batch size determines the size of the subsets on which the
train-481
ing is iteratively performed. In the experiments, higher batch sizes
482
yielded faster training times at the cost of performance and yielded
483
an increase in the variations in the results when running the exact
484
same model. Furthermore, the model architecture consisted out
485
of 5 LSTM units and a single LSTM layer that feeds into a dense
486
layer. Increasing the number of layers did not significantly improve
487
performance and resulted in longer training times. Finally, an Adam
488
optimizer function and a mean squared error loss function were
489
utilized when fitting all the LSTM models.
490
As was mentioned in the results section the ARIMA model
outper-491
formed the LSTM model in every experiment even in the
repro-492
duction attempts in table 1. In order to explain the homogeneity
493
of these results, the LSTM approaches should be compared. In the
494
architecture of the LSTM approach of Forecasting Economics and
495
Financial Time Series: ARIMA vs. LSTM[8], 4 LSTM units were
496
used with a single LSTM layer as can be observed in their provided
497
code snippet. The number of epochs was set to 1. Other parameters
498
that were observed in the code snippet included a mean squared
499
error loss function, an Adam optimizer, and the enabled
stateful-500
ness. Compared to the parameters listed earlier in this section, few
501
differences can be perceived. The number of LSTM units are nearly
502
identical, and the one extra LSTM unit cannot be accountable for the
503
large difference in performance. Additionally, experiments showed
504
a lower number of units did not result in lower error rates. Neither
505
did the number of epochs or statefulness influence the results with
506
such large margins. The authors mentioned that for the purpose of
507
reducing the complexity of their algorithm a few other
manipula-508
tions were performed, such as the addition of dense layers, specific
509
batch sizes and transformations. However, the specifics behind
510
these were not reported. Furthermore, in the description of the
511
rolling forecast, it is mentioned that the re-estimation of the model
512
after each test sample is performed, but this was not documented
513
in the code snippet. It is likely that the large homogeneity can be
514
attributed to these unreported modifications.
515 516
Conclusion. One of the subquestions stated in the introduction
517
section was: Whether there are specific time frames in which the
518
chosen models would be more effective, and is a model thus more
519
accurate in short-term predictions, for periods shorter than a month, or
520
for longer-term predictions, for several months ahead. Solely based on
521
Figure 1: The evaluation of the ARIMA model on the butter price time series as lead time is increased.
the results from table 2 can be observed that ARIMA outperforms
522
the LSTM models in all the cases. Hence, the LSTM would not
523
provide better predictions over any time windows, whether they
524
are short-term or long-term.
525
Furthermore, figure 1 depicts the RMSE and accuracies for the
526
ARIMA model as a function of the weekly lead time on the
but-527
ter price data set. As expected the evaluations of the model get
528
progressively worse, but the model is quite robust as the accuracy
529
only starts to dip beneath the 90% after the lead time exceeds a 100
530
weeks. Whether that is an acceptable threshold for determining
531
effectiveness will depend on the conditions that are imposed in a
532
practical setting and the particular data set.
533
As it currently stands ARIMA has only advantages over the LSTM
534
model, it is faster to train and predict, less complex in terms of
535
implementation. This also generalizes to the different data sets with
536
varying sizes and time intervals. However, the results are
insuf-537
ficient to conclude an affirmative answer to the central research
538
question: Can a Long Short-Term Memory based model that is trained
539
on time series outperform an autoregressive integrated moving
aver-540
age model in the task of predicting commodity prices? It is possible,
541
however, to conclude that this particular approach with the LSTM
542
does not outperform the ARIMA model for both single as well as
543
multi-step forecasting and that more configurations of this model
544
should be investigated.
545
REFERENCES
[1] George Edward Pelham Box and Gwilym Jenkins. 1990. Time Series Analysis,
546
Forecasting and Control. Holden-Day, Inc., San Francisco, CA, USA.
547
[2] K. Chen, Y. Zhou, and F. Dai. 2015. A LSTM-based method for stock returns
548
prediction: A case study of China stock market. (Oct 2015), 2823–2824. https:
549
//doi.org/10.1109/BigData.2015.7364089
550
[3] Xiaochen Chen, Lai Wei, and Jiaxin Xu. 2017. House Price Prediction Using
551
LSTM. CoRR abs/1709.08432 (2017).
552
[4] Jan G De Gooijer and Rob J Hyndman. [n. d.]. 25 years of time series forecasting.
553
International Journal of Forecasting ([n. d.]), 2006.
554
[5] Jan G. De Gooijer and Rob J. Hyndman. 2006. 25 Years of Time Series Forecasting.
555
[6] Dragan Ivanisevic, Beba Mutavdzic, Nebojsa Novkovic, and Natasa Vukelić. 2015.
556
Analysis and prediction of tomato price in Serbia. 62 (01 2015), 951–962.
557
[7] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon, and K. P. Soman.
558
2017. Stock price prediction using LSTM, RNN and CNN-sliding window model.
559
(Sept 2017), 1643–1647. https://doi.org/10.1109/ICACCI.2017.8126078
560
[8] Sima Siami-Namini and Akbar Siami Namin. 2018. Forecasting Economics and
561
Financial Time Series: ARIMA vs. LSTM. CoRR abs/1803.06386 (2018).
562
[9] Yue Wang, XingYu Ye, and Yudan Huo. 2011. Prediction of household food retail
563
prices based on ARIMA Model. (July 2011), 2301–2305. https://doi.org/10.1109/
564
ICMT.2011.6002376
565 8
[10] M. Xie, C. Sandels, K. Zhu, and L. Nordström. 2013. A seasonal ARIMA model
566
with exogenous variables for elspot electricity prices in Sweden. (May 2013), 1–4.
567
https://doi.org/10.1109/EEM.2013.6607293
568
4
APPENDIX
Figure 2: The above plot depicts the Dutch butter price time series before and after differencing.
Figure 3: The above plot depicts WTI crude oil price time series before and after differencing.
Figure 4: The above plot depicts the S&P 500 index time se-ries before and after differencing.
Figure 5: The above plot depicts the Nikkei 225 index time series before and after differencing.