Automated time series model discovery: Study of symbolic regression forecasting effectiveness on M3 competition data

(1)

MSc Software Engineering

Master Thesis

Automated time series model

discovery

Study of symbolic regression forecasting

effectiveness on M3 competition data

by

Bartol Karuza

10495495

August 30, 2018

Supervisors:

prof.dr. M.R.H. Mandjes

dr. Haralambie Leahu

J. Cremers

G. Koot

Assessor:

dr. A.M. Oprescu

(2)

0.1 Abstract

This research examines the effectiveness of symbolic regression, an application of genetic programming [1], at forecasting univariate economic time series. First I examine the forecasting accuracy on generated well known time series patterns with added white noise. The algorithm proves effective at uncovering the orig-inal signal to a high degree and outperforms basic statistical methods on this task. Next the symbolic regression algorithm is used to forecast the time se-ries in the M3 competition data set. The outcome of the post sample accuracy is the low average symmetric mean absolute percentage error (sMAPE) 11.57, 0.51 higher than that of Theta [2] on the same data set. The symbolic regres-sion algorithm was particularly effective at evolving micro economic time series models, outperforming Theta by 0.99 sMAPE on that type of time series. This result is a positive example of using symbolic regression for time series model discovery in specific domains.

(3)

Chapter 1

Introduction

The art and science of forecasting is concerned with estimating a future state based on currently available information. Various quantitative techniques have been proposed to generate predictions based on past values ranging from simple statistical models, such as the Holt-Winters method [3], up to more complex statistical models such as the Box Jenkins approach to ARIMA [4]. Since the early 90’s various machine learning methods for time series forecasting have been proposed, such as various Neural Networks [5] and Support Vector Machines [6]. The most prominent competition between these forecasting techniques is the series of M-competitions [7, 8, 9], named after Spyros Makridakis. These competitions are held roughly every 10 years and invite the experts of the dif-ferent schools of forecasting to make predictions on a shared data set. Using a shared data set allows all of these techniques to be compared with each other objectively. All participants were asked to make forecasts for a certain number of intervals into the future, after which the real values were compared with the forecasts, resulting in comparable average sMAPE scores between all forecasting techniques.

Genetic programming was proposed by Koza[1] as a genetic algorithm that evolves programs by scoring each program on a particular fitness. Genetic pro-gramming has been used successfully to generate circuit boards [10] and even to evolve neural network architectures [11]. Symbolic regression is an application of genetic programming that searches for the best fitting relationship, in the form of a formula, between input features and output values. Symbolic regression can be used to find an effective formula that describes the relationship between past values of a time series and a future value. Symbolic regression has been used to successfully forecast hydraulic systems [12]and stock market prices [13] among other applications.

This research explores the effectiveness of symbolic regression as a forecast-ing technique, by applyforecast-ing it to a series of artificial data sets that exhibit a particular feature relevant to time series forecasting. Furthermore the effective-ness of symbolic regression is explored in the context of the M-competition data set, particularly the M3 data set. By looking at that particular data set, it

(4)

becomes possible to compare the effectiveness of symbolic regression to many other forecasting techniques.

The thesis is organized as follows: Chapter 2 gives the reader an overview of time series forecasting and genetic programming. Chapter 3 describes the ex-periments on which this research is based. Chapter 4 contains the experimental results. Chapter 5 discusses the implications of the results. In Chapter 6 several directions for future research are described. The thesis concludes in Chapter 7.

(5)

Chapter 2

Background

The background chapter provides contextual information about forecasting, symbolic regression, the combination of the two and a discussion of prior work.

2.1 Forecasting

Forecasting is the prediction of a future state of a system, based on the past state and possibly other values external to the system. Forecasting can be split across two categories, qualitative and quantitative. Whereas qualitative forecasting is concerned with judgment from humans, quantitative forecasting uses statistical and machine learning techniques to generate a forecast based on numerical and categorical data. This research is only concerned with quantitative forecasting.

2.1.1 Time series

Time series are sequences of discrete time data points with a time ordering or with a time index. One dimensional time series without a time index are typically considered to have equal intervals on the time axis with the interval described separately from the series. Univariate time series are time series with only a single variable. Multivariate are time series with multiple dependent variables. The traditional components of a time series are the level, the trend, the seasonality and the noise. The level is equal to the mean of the time series and all other components are deducted from it. The trend is the direction of the time series over time. Seasonality refers to a repeating pattern in the series over a fixed number of intervals. A time series can contain more than one seasonal component with varying number of intervals for each of the seasonality components.

(6)

2.1.2 Signal and noise

Any point in a time series can be described as

yt= st+ nt

where s is the signal and n the noise. White noise plays an important role in forecasting as it can be shown that a time series is white noise if the values have a zero mean, variance and no temporal correlation. After forecasting, the remaining forecasting error should be a white noise time series. If it is not, it is an indication that the model does not completely capture the signal yet and improvement to the model is theoretically possible.

2.1.3 Auto Regression

Auto regression describes a relationship between past values of a series and future values. It assumes a correlation between those values, called auto-correlation. Auto correlation is an important indicator that a meaningful pre-diction can be made from the past values and distinguishes predictable time series from white noise. Auto regression (AR) models describe a series as a linear recurrence relation between past values and the current value and a noise component. The AR model is part of more complex statistical forecasting tech-niques such as ARIMA [4].

2.1.4 Exponential Smoothing

Exponential smoothing is a mathematical technique for time series. It allows previous values to be incorporated in a forecast, putting more weight on the most recent previous value and reducing the weight for older values exponentially. The following example is the exponential smoothing of a naive (level based):

ˆ

yx+1= α ∗ yx+ (1 − α) ∗ ˆyx−1

This formula generates a one step ahead forecast based on previous time series values. How much weight is applied to the more recent values compared to values further in the past is controlled by the value of the α coefficient, where a higher value means more weight on recent values compared to older values. Single exponential smoothing is used to make lagging indicators such as the exponential moving average.

Using the same principle, multiple components of the time series can be expressed as an exponential smoothing formula. The commonly used and effec-tive exponential smoothing formula that combines all 3 components into one, is called triple exponential smoothing or the Holt-Winters method [3]. This method consists of three smoothing formula’s, one for trend, one for seasonality and one for the level.

(7)

They are all combined into a single forecasting formula (in additive form), shown below : ˆ yt+h|t= lt+ hbt+ st−m+h+ m lt= α(yt− st−m) + (1 − α)(lt−1+ bt−1) bt= β ∗ (lt− lt−1) + (1 − β∗)bt−1 st= γ(yt− lt−1− bt−1) + (1 − γ)st−m.

where l is the level component with smoothing controlled by α, b is the trend component with smoothing controlled by β and s is the seasonality component with smoothing controlled by γ. When applying this method to actual data, a regression analysis is performed to find the most appropriate smoothing coef-ficients for each of the components on this particular time series. One of the limitations of the Holt Winters method is that it only incorporates a single seasonality.

2.1.5 M-competitions

Spyros Makridakis and Michele Hibon started the M-competitions in 1982[14] to provide an objective comparison between different forecasting techniques on a single data set. The motivation for the competition was criticism from the ARIMA community on research by Makridakis and Hibon on forecasting, com-paring the results of the ARIMA technique unfavorably with simpler statistical methods. The criticism was that the results were not good, because of the inexperience of Makridakis and Hibon in using the technique. To counteract that argument, the first M-competition was organized inviting the experts of each respective forecasting technique to participate directly. From the second competition, M2 [7], the data that participants were asked to forecast would become available in real time after the competition submission deadline. The M3 competition [8] expanded the number of time series drastically, to 3003 from the 24 used in M2. In the first 3 competitions the results show that a combina-tion of simple statistical models can outperform complex statistical models and machine learning models on objective error measures. The winner of the last competition, M4 [9], used a hybrid approach called exponential smoothing re-current neural network, which combines hand coded statistical techniques with trained neural networks. For this research we are using the M3 competition data set.

2.2 Genetic Programming

Genetic programming is an evolutionary heuristic search algorithm. The search space consists of programs, that are invoked with the input data. The result is compared to some desired value. The individual programs with the best fitness survive and are used to create new individuals. The programs can be actual machine instructions or some higher level representation, such as mathematical formula’s in symbolic regression or control sequences for video games [15].

(8)

2.2.1 Genetic Algorithms

Genetic Programming is a special case of a genetic algorithm. Genetic algo-rithms were first proposed by Alan Turing [16] in 1950, named the ’learning machine’. It is a heuristic algorithm inspired by evolutionary biology. A genetic algorithm consists of a population of individuals, a fitness function, cross-over, mutation and evolution over multiple generations. The algorithm starts with a number of individuals, that make up the population. These individuals can be randomly generated or initialized to some value of known high fitness. The individuals are scored using the fitness function, the output of which represents the individuals chances to be selected for the next generation. After the first generation each consecutive generation is made up of either unmodified indi-viduals from a previous generation, mutated indiindi-viduals or mixed indiindi-viduals, through cross-over. The algorithm continues until a set number of generations has been evolved or until a stopping criteria has been reached, such as a desired fitness level for at least one individual in a generation. Across genetic algorithms the representation of the individual varies, but it can be as simple as a list of numbers of which individual numbers are mutated and ranges from different individuals are combined to create new individuals during cross-over.

2.2.2 GP representation

The original genetic programming representation of an individual is tree based[1].

Figure 2.1: Example of an individual tree

Over the years various alternative representations have been proposed, such as linear GP [17] or Cartesian GP [18]. Some promising results for time series forecasting have been published using a recurrent form of Cartesian GP [19], which could be a simple solution to the lag operator. The current research is limited to tree based GP.

(9)

2.2.3 Symbolic Regression

Symbolic regression searches the space of mathematical equations that fit a particular data set in accuracy and simplicity. Each equation is made up of operators, constants and variables from the data set. Genetic programming is typically used to evolve these expressions, by recombining high fitness individu-als from the initial random population. Symbolic regression is an optimization problem.

Given a fitness function f : F → R from all valid mathematical functions in set F to a Real number

Sought a function x0∈ F such that f (x0) ≤ f (x) for all x ∈ F

The formulation above assumes a minimization of the fitness, but a maxi-mization is also possible. For the case of forecasting, a minimaxi-mization of the error metric is used.

The set of all valid mathematical functions given a limited set of operators and no constraint on the number of nodes in the function is infinitely large. A well known characteristic of genetic programming is that over generations the average individual size starts to increase without a proportional increase in fitness [20], this is known as bloat. Applying parsimony pressure ensures that simplicity is taken into account in addition to raw fitness.

Given a fitness function f : F → R from all valid mathematical functions in set F to a Real number and the size function s : F → Z from all valid mathematical functions in set F to an Integer.

Sought a function x0 ∈ F such that f (x0) ≤ f (x) and s(x0) ≤ s(x) for all

x ∈ F .

There are multiple approaches to balancing the size and the fitness functions such as combining the size and fitness into one weighted fitness score. Another approach is a multi-objective fitness function and selecting un-dominated can-didates from the pareto front [21]. For this experiment parsimony pressure was applied by adding a size penalty to the fitness function, controlled by a parsi-mony coefficient. The resulting fitness function becomes f (x) = e(x) + s(x) ∗ ρ where ρ is the parsimony coefficient and e(x) is the fitness error function.

2.2.4 Symbolic Regression for Forecasting

Symbolic regression can be used for time series forecasting, by searching the space of possible mathematical relationships between previous time series values and one or more future values. For multivariate time series forecasting, multiple variables can be provided to the algorithm as inputs or multiple complete time series.

For univariate time series analysis the forecast function can be expressed as

ˆ

yn= f ([y0..yn−1])

where f is the individual tree output by symbolic regression and ˆyn is the

fore-cast for the value at n and [y0..yn−1] are the previous values in the series. An

(10)

temporal dependency between values. Standard symbolic regression does not treat time series data differently from any input data set. The temporal relation-ship is captured by the individuals, if that proves to be effective at increasing the forecasting accuracy. This means that no modifications are required for the basic symbolic regression algorithm to make it compatible with time series forecasting.

The fitness function can be a simple absolute error function that is checked against each of the series and each of the individual trees in the generation.

ˆ

e = |ˆyn− yn|

where ˆe is the absolute forecast error and yn is the actual value. The

aver-age of ˆe across all series that need to be forecast, is the Mean Absolute Error (MAE), which is discussed in the experimental setup chapter as well as the more sophisticated metric symmetric Mean Absolute Percentage Error (sMAPE).

2.2.5 Prior work

Genetic programming has been extensively used in the field of forecasting, for example for rain run-off forecasting [22], stock price prediction [13] or for fore-casting energy consumption [23].

Researchers have examined the ability of symbolic regression to find under-lying mathematical models in dynamical systems [24, 12] with high accuracy. Specialized forms of genetic programming have been developed for particular forecasting purposes, such as long term forecasting [25] or speed of training execution [26].

Applying some of these modifications to the existing simple tree based ge-netic programming algorithm could improve forecasting accuracy further, but for the purpose of this experiment the standard tree based symbolic regression with only mathematical operators is used to establish a baseline forecasting accuracy measurement on the M3 data set.

(11)

Chapter 3

Research Methodology

This chapter describes the experiments, the data set and the data preprocessing used in the experiment. The goal of the research is to measure the effectiveness of symbolic regression as a forecasting technique and to uncover the conditions under which symbolic regression performs better than other forecasting tech-niques.

The data set on which this research applies symbolic regression consists of several categories of artificial time series data created specifically for this research and the M3 forecasting competition data set, as used by Ahmed[27] and Makridakis[28].

3.1 Forecasting

This research explores the effectiveness of tree based symbolic regression for forecasting univariate time series. The regression function searches the space of all mathematical formula’s that describe a relationship between past values and a single future value.

3.1.1 Technology

The symbolic regression implementation is based on gplearn, a python based genetic programming library specialized in symbolic regression. The implemen-tation of the custom tests is in Python version 3.6 in a Anaconda environment. The training runs are executed on a distributed environment in Amazon Web Services using AWS SageMaker. The training runs of the symbolic regression algorithm are executed on AWS EC2 ml.c5.4xlarge instances.

3.1.2 Operator Set

This research has assumed a stable list of operators as input for the symbolic regression algorithm: Add, Subtract, Multiply, Divide, Square root, Inverse,

(12)

Logarithm, Absolute, Sine, Cosine and Tangent. Sine and Tangent appear fre-quently in successful individuals across runs, capturing periodic patterns in the data. Conditional operators and more complex operators such as a lag operator were considered but not included for this experiment.

3.1.3 Terminal Set

All previous time series values are a single terminal for the symbolic regres-sion algorithm. In addition to these values, constants in the range of -1..1 are included in the terminal set.

3.1.4 Fitness

During training the fitness is evaluated using the built-in gplearn Mean Absolute Error (MAE) metric. This metric was the most performant out of the built in and the custom metrics tried (such as sMAPE), which makes it very suitable for the performance critical path of calculating the fitness on many individuals during training. MAE is defined as:

M AE = Pn

t=1(ˆyt− yt)

n

where ˆyt is the predicted one-step-ahead value at time t and yt is the actual

value. n is the total number of predictions and equals to the forecasting horizon. In all cases where a MAE for multiple time series is reported, the average MAE for all series is calculated.

The out of sample fitness and overall forecasting accuracy is expressed by the symmetric Mean Absolute Percentage Error (sMAPE) on the original data (after reversing the scaling and centering).

sM AP E =100% n n X t=1 |Ft− At| (|At| + |Ft|)/2

3.1.5 Other Hyper Parameters

Tournament Size The tournament size parameter specifies the number of participants in each tournament for selecting a new individuals parent(s). If tournament size equals population size, only the highest scoring individual from a previous generation is selected for crossover or mutation. A tournament size of 1, means that a random individual is selected from the previous generation, which equals to no survival pressure through fitness and a completely random selection of the new generations individuals. The value of tournament size corresponds to the diversity of each generation. A tournament size of 4 was selected to promote a very diverse population and prevent early convergence on dominating individuals.

(13)

Population Size The population size parameter specifies the number of individuals in each generation, it corresponds to the computation cost of a generation. A value of 1000 was selected.

Crossover The crossover parameter specifies the percentage of individuals in each generation that are created using the crossover method, a value of 0.2 was selected. Higher values led to more bloat in the experiment.

Subtree Mutation Subtree mutation replaces a random subtree of an in-dividual with a randomly generated subtree. [1] The subtree mutation param-eter specifies the percentage of individuals that are created using the subtree mutation method, expressed as a floating point value between 0-1. For this experiment a value of 0.2 was selected.

Hoist Mutation Hoist mutation takes a subtree from a selected individual and makes that subtree the next individual. It reduces the size of the resulting tree. [29] This parameter specifies which percentage of individuals are created using the hoist mutation method, expressed as a floating point value between 0-1. For the experiment a value of 0.2 was selected.

Point Mutation Point mutation changes a node in a selected individual. A terminal is replaced by another terminal or an operator by another operator. This parameter specifies the percentage of individuals that are created using the point mutation method, expressed as a floating point value between 0-1. For the experiment a value of 0.2 was selected.

All above mutation and crossover methods combine to a probability of 0.8, leaving 0.2 of the next generation to be unmodified selected tournament winners. Initialization Method The initialization method is set to a static value of ’half and half’ for all tests in this research.

Initial Population Depth The initial depth parameter has considerable influence on the run time of the training runs. It consists of two integer values, that together specify the range of tree depth for each generated individual. Typical values that were effective are [2, 12], but this value was varied across the experiments.

Stopping Criteria The stopping criteria parameter can cause a training run to be stopped when the value of the best individuals fitness evaluation is below (or above, for ’higher is better’ fitness) the value of this parameter. This parameter is set to a static 0.000001 value for all tests in this research. Considering the stochastic nature of time series forecasting, the stopping criteria was never reached for time series with noise components.

Window Size The window size specifies the range of input values or features available to the symbolic regression algorithm. The value is set to 36 for all tests in this research.

Horizon The horizon parameters specifies to which horizon the forecasting should be performed. The value is set to 18, to match [8] for all tests.

3.1.6 Forecasting multiple horizons

There are several alternatives to forecasting multiple horizons [28], of which we considered two for this research. The first one is iterative forecasting, where one

(14)

step ahead forecasts are produced and those forecasts are considered as the last input on the next step forecast. The second approach, multi model forecasting, runs symbolic regression once for each of the horizons, creating a separate model for each of the horizons. Limited testing of the multi model approach showed no significant improvement over the iterative approach and this research therefore focuses only on the iterative approach. The multi model approach is referred to in Makridakis [28] as direct.

3.2 Data Set windowing

Symbolic regression requires a fixed number of input values, called terminals. Each of these input values is assigned a designation consisting of X and a num-ber, ranging from 0 to n where n is the number of input terminals. An example is X12. The terminals are used directly in the symbol trees. To be able to handle time series of different length, I select a fixed window size ws which is the input range of time series data points and corresponds to n. The shortest monthly time series in the M3 data set contains 68 data points. The last 18 data points are defined as the test set, so they need to be considered out of sample. For all tests the window size is fixed at ws = 36, which contains sufficient information for at least three seasons of 12 months. For training purposes the whole range of the time series is used by creating a x/y pair of input range of size ws and one output value. The symbolic regression estimator receives 36 values as input terminals and the 37th value is the output value that the fitness is compared against. Across each time series, this window of 36 values slides forward until the end of the time series is reached.

Figure B.1 is an example of the test window input features for one of the M3 time series, showing the window on the series in the 36 features.

The windowing creates a data set for step ahead predictions. These one-step ahead predictions can be used to generate the 18 one-steps ahead prediction of the M3 competition, by assuming predicted values of previous steps as ground truth.

3.3 Artificial Data

Most statistical forecasting models are based on assumptions about the nature of the time series under forecast. These assumptions make up the components of a forecasting model. In this experiment we look at the different types of components individually, by generating separate time series for each component in isolation. A forecast using Symbolic regression is examined on these time series as well as on the same time series with additional noise.

Using the same signal value, different variants are generated with new white noise added. The training set consists of 100 series and the test set of 20 series, for each of the different artificial data categories. Each series has 140 values. With the windowing approach at ws = 36 this results in 104 windows per series,

(15)

of 36 input features and 1 output each, totalling at 10400 training samples and 2080 test samples per category.

3.3.1 Data Pre-processing

All artificial data time series are pre-processed to achieve zero mean and unit variance (standardization). The mean is centered by deducting the mean of the series from the values. The unit variance is achieved by dividing each of the values by the standard deviation of the series. The resulting transformation is different per time series, and the mean and standard deviation are stored to allow reversing the transformation and the forecast to the original range.

The purpose of this transformation is to make the symbolic regression output more portable on the different time series of different scales.

3.3.2 Seasonality data

The first test data set is based on a generated time series with a strong seasonal component. If s is the season length, k is the total number of data points in the series, x is the time index x ∈ [0..k], yx is the time series value at index x and

is white noise, then the data is generated using

yx= x mod s +

where k = 140 and s = 12 and is generated by multiplying a random value between -1 and 1 with a noise level constant nl = 3. The purpose of this data set is to find out how well the symbolic regression algorithm is able to extract the seasonal relationship from the data, with and without noise. One of the training series after all processing steps is shown in figure B.2 for illustration.

This data set can be forecast by taking the average of the previous seasons within the window. In the case of ws = 36 and s = 12, this is the average of the last 3 seasons. There is no trend in the data and any deviation from the seasonal pattern comes from the noise, therefore the average of previous seasons is a good approximation for this time series.

3.3.3 Multiple Seasonality data

To complement the single seasonality data set, another seasonal generated data set is used in the experiment. This data set contains two seasonal components that are stacked and noise is added to the combined set. In addition to s of the single seasonal data set another season q is added. The formula for generating any yx is shown below.

yx= x mod s + x mod q +

where s = 12 and q = 5.

This data set contains a repeating pattern, shown in figure B.3, after every 60 months, which is outside of the visible window of a training input sample

(16)

at ws = 36 shown in figure B.4. This measure makes the common strategy found by the estimator of selecting the value of 1 season length into the past impossible. Any formula will have to find the two separate patterns to be able to forecast accurately using just the past 36 inputs.

3.3.4 Trend

The experiment for evaluating the ability of genetic programming to find a trend, uses a generated trend based data set. The data is created by applying

yx= x ∗ 0.1 +

and standardizing the resulting time series to a zero mean and unit variance. The noise component is generated in the same way as with the seasonal time series and is set at nl = 3. An example is shown in figure B.5.

3.3.5 Damped trend

A damped trend is a trend that decreases over time, approaching 0. One of the more successful techniques from the M-series competition, damped trend exponential smoothing [30], works based on the assumption that the trend in the data will reduce over time. This artificial data set simulates this assumption, by creating a trend line which reduces the slope over time, following the equation:

yx= γ −

x δ ∗ γ +

where γ = 0.1 is the slope coefficient, δ = 300 is the dampening coefficient and is added white noise. An example series from this artificial data set is shown in figure B.6

3.4 M3 Data Set

The M3 data set contains 3003 time series in total. Out of these 3003 time series, this research is only considering a subset that corresponds to the same criteria as used in recent M3 method comparison studies [27, 28], which is all monthly time series with more than 80 values. The reported resulting total is 1010 time series, however, I found that using this criteria the actual number of monthly time series is 1086.

Each of these time series has a variable length, between 81 and 144 values with a mean of 132.4 and a median of 134. The number of series, mean and median of each of the time series types is shown in table 3.1

The unique quality of the M3 data set is the large amount of prior research into the effectiveness of various forecasting methods performed specifically on the M3 data set [27] [28] [8]. All time series in the M3 data set can be considered economic time series and any research performed on the M3 data set falls in the field of economic time series forecasting. A typical characteristic of monthly

(17)

# of series Mean Median All 1086 132.44 134.0 Micro 197 126 126 Macro 309 131.5 134 Finance 131 129.97 134 Industry 334 140.02 144.0 Demographic 92 133.98 135.0 Other 23 98.09 96

Table 3.1: Monthly M3 time series per series type

economic time series is seasonality of season length 12, which corresponds to a yearly seasonality. The time series of type macro and of type demographic typically score the lowest error rates for most forecasting methods. The macro economic and demographic time series are dominated by a strong and stable trend. The micro, industry, finance and other are less regular and vary from stable trend time series to seasonality dominated time series to almost entirely residual noise based time series from a classical decomposition point of view. These variations between the time series are the cause of typically high fore-casting errors (average sMAPE of more than 10) for most methods researched [28].

3.4.1 Training and forecasting methods

I’ve experimented with three different approaches to training on the M3 data set, starting with a global approach. The global approach takes all time series in the M3 data set as training data and tries to find a forecasting formula that has the lowest error rate when predicting all test data.

The second approach is to train and predict on subsets of the total time series, separated by the different types, micro, macro, finance, demographic, industry and other and predicting on the series of that type.

The third approach is local, where each individual time series, split into several forecasting windows, is used as the training set and the resulting formula is used to predict the outcome of that time series only. The results section shows the differences in forecasting effectiveness between these three methods.

3.4.2 Pre-processing

The M3 data set is also, standardized to zero mean and unit variance. In addi-tion to standardizaaddi-tion, the series is deseasonalized and detrended. A Box Cox transformation was considered to match the preprocessing applied by Makri-dakis et al.[28], but the transformation could only be applied to a subset of the series, after manual analysis. The goal of the symbolic regression forecasting method is to create automated forecasts without human interference, therefore the Box Cox transformation was not used for this purpose.

(18)

Deseasonalization and detrend

An additive model was used for the decomposition, to handle the zero and negative values resulting from the standardization.

Y (t) = T (t) + S(t) + e(t)

With the season, S, at season length 12 given, the model only has to find remaining patterns in e, potentially find seasonality of lengths other than the fixed 12 months and filter out noise.

A separate training run was performed with no preprocessing, detrending, deseasonalizing and a combination of the two. The run was trained over 200 generations with the global training strategy. The results are shown in table 3.2

Preprocessing Short Medium Long Average None 27.2 27.63 30.04 28.29 Deseasonalize 28.68 29.13 31.35 29.72 Detrend 12.89 12.61 15.99 13.83 Combined 12.34 11.89 15.19 13.14

Table 3.2: Results of different preprocessing strategies

The results show that the highest accuracy is achieved by combining both preprocessing methods, with detrending having the highest individual impact.

(19)

Chapter 4

Results

This chapter presents the results of the experiment on artificial time series and the experiment on the M3 monthly time series.

4.1 Artificial Data

Table 4.1 shows the sMAPE score of the series with noise for each of the arti-ficial data categories. The column ’original signal’ shows the sMAPE score of the original generated signal, compared to the signal with noise. It represents the theoretically optimal forecast if all of the original signal is retrieved. The column ’linear trend’, represents a linear trend forecast on the signal with noise. The seasonality column represents the sMAPE score of a naive forecast on de-seasonalized data at season length 12. The SR column represents the actual symbolic regression result and the last column, the difference with the original signal.

original signal linear trend seasonality SR Diff original seasonality 116.77 151.04 117.13 127.05 10.28 multi seasonality 61.14 155.88 92.94 74.09 12.95 trend 55.51 65.76 84.56 55.17 -0.34 damped trend 77.15 114.83 94.02 81.15 4.0

Table 4.1: Results on Artificial Data

4.1.1 Seasonality test

With the generated ground truth with single seasonality and without noise, the genetic programming algorithm returns a stable result; the

ˆ y = yx−s

(20)

where s is the length of the season. In the case of ws = 18 and s = 12 this is represented by the SR tree in figure A.1.

For other s and ws it returns the same shortcut calculation. When the data set has additional noise, starting from nl > 1.5 one notable recurring strategy of the symbolic regression algorithm is a stack of alternating sin and tan nodes, which effectively creates a low-pass filter. The high spikes with values outside of the band of 0-1 in previous seasons, contribute to a small dip at the peak of the seasonality. If the input data would be re-scaled to the range of 0-1, this strategy would yield better prediction results. Figures A.2 and A.3 represent two such model outputs.

The most successful run by sMAPE output a tree of depth 58 and resulted in a sMAPE score of 127 ( 10 higher than the basic prediction of taking the season average). The model is shown in figure A.4 and the fit is shown in figure B.7.

4.1.2 Multiple Seasonality test

For the noise free data set, the genetic programming algorithm found the perfect fit for the data in the mathematical expression shown in figure A.5.

The run reached a MAE of 1.46e-14 (approaching 0) within 302 generations and terminated because of reaching the stopping criteria. The resulting model fit and prediction is shown in figure B.8.

For the noisy data, the run reached the maximum number of generations and the best individual’s MAE ended up at 0.91. The sMAPE score of this individual on the test data set is 74.09, which is an 18 point improvement over the naive benchmark sMAPE of 92.9. The individual has a tree depth of 431 and would require considerable effort to analyse in detail. The model is shown in figure A.6 and the model fit and forecast are shown in figure B.9.

4.1.3 Trend test

For the initial test, on the generated trend data without noise, the program reached a near perfect solution within 111 generations.

The resulting program is shown in figure A.7 which is a shortcut to estimate the slope based on only three past values. The sMAPE of the resulting forecast on the test data set is 2.66e-14. The model fit and prediction are shown in figure B.10.

The next result is for the noisy trend data set. After 500 generations, the most successful run reached a sMAPE of 55.20 which is considerably lower than the simple regression result of 65.76. The resulting program appears to be more resilient to noise than linear regression. The resulting forecast and model fit are shown in figure B.11. The highest fitness individual has 161 nodes and is shown in figure A.8.

(21)

4.1.4 Damped Trend test

On the damped trend data set without noise, symbolic regression found a close to perfect fit after 529 generations, after which the run exited because the stopping criteria was reached. The resulting sMAPE is 3.73e−12, approaching 0. The resulting fit is shown in figure B.12 and the model in figure A.9.

On the noisy damped trend data set the best run terminated after 1000 generations, reaching a final sMAPE of 81.15, which is 4 percent points higher than the best possible sMAPE of 77.15, if the underlying signal was perfectly modelled. The resulting program has a size of 211 nodes and a depth of 33. The resulting model fit is shown in figure B.13 and the corresponding model in figure A.10.

(22)

4.2 M3 data set test

Next are the results of forecasting M3 competition time series using the symbolic regression algorithm, per training type. For each of the results, the result for Theta[2] is shown, which was the best method from the M3 competition [8]. Because the sMAPE scores differ slightly from the ones calculated by Makridakis [28], the Theta scores shown are my re-calculations on the subset of 1086 (vs 1010 [27]) time series using the actual forecasts posted on the M3 website.

4.2.1 Global

For the global training strategy, the best run resulted in the forecasting accuracy in table 4.2. The resulting model is shown in figure A.11.

Short Medium Long Average SR 12.41 11.89 15.19 13.17 Theta 9.17 10.67 13.34 11.06

Table 4.2: Global training forecasting results

4.2.2 Per series type

When the training is performed on each of the series types individually, the results of the best run are shown in table 4.3.

Method Series type Short Medium Long Average SR micro 20.54 14.16 17.68 17.46 Theta micro 19.22 15.94 20.19 18.45 SR Macro 4.71 6.64 9.73 7.02 Theta Macro 3.99 6.02 8.95 6.32 SR Finance 9.79 11.98 17.22 13.0 Theta Finance 8.45 11.62 15.3 11.79 SR Industry 11.77 14.33 16.0 14.03 Theta Industry 9.35 12.76 14.48 12.2 SR Demographic 4.66 5.9 7.94 6.17 Theta Demographic 3.99 5.03 6.41 5.15 SR Other 17.22 19.8 16.1 17.71 Theta Other 14.93 14.66 13.98 14.52

Table 4.3: Per series type training results

The weighted average, by number of series in the category, of these individ-ual series type results is an sMAPE of 11.47, 1.69 lower than the best global strategy run. The comparison for the different horizons with the global strategy and Theta is shown in table 4.4. An additional modification was applied to preprocessing of the micro type series, to skip deseasonalization.

(23)

Short Medium Long Average Global 12.41 11.89 15.19 13.17 Per type 9.72 10.94 13.76 11.48 Theta 9.17 10.67 13.34 11.06 Table 4.4: Comparison global and per type training

Training on each of the different series types produced a single model per type. The size of each of the models is shown in table 4.5.

Length Depth Micro 23 10 Macro 3 1 Finance 5 2 Industry 5 2 Demographic 3 1 Other 3 1

Table 4.5: Per type model size statistics

For illustration, the finance model is shown in figure A.13 and the micro model in figure A.12.

4.2.3 Local

The last strategy, local, showed improved results for some of the series types, but not for all as shown in table 4.6. The resulting average for each forecasting horizon is shown in table 4.7 when using only locally trained models. If the per type forecasting model is used for the series of micro type and the per series local forecasting models for all other series, the resulting average scores are shown in table 4.7 in the row labeled ’Best mix’.

(24)

Training type Series type Short Medium Long Average Per type Micro 20.54 14.16 17.68 17.46 Local Micro 18.9 17.1 20.27 18.76 Theta Micro 19.22 15.94 20.19 18.45 Per type Macro 4.71 6.64 9.73 7.02 Local Macro 4.49 6.64 9.76 6.96 Theta Macro 3.99 6.02 8.95 6.32 Per type Finance 9.79 11.98 17.22 13.0 Local Finance 9.08 12.2 17.48 12.92 Theta Finance 8.45 11.62 15.3 11.79 Per type Industry 11.77 14.33 16.0 14.03 Local Industry 11.12 13.55 15.53 13.4 Theta Industry 9.35 12.76 14.48 12.2 Per type Demographic 4.66 5.9 7.94 6.17 Local Demographic 4.01 5.48 7.36 5.62 Theta Demographic 3.99 5.03 6.41 5.15 Per type Other 17.22 19.8 16.1 17.71 Local Other 17.49 16.39 18.14 17.34 Theta Other 14.93 14.66 13.98 14.52

Table 4.6: Local training results per series type

Short Medium Long Average Global 12.41 11.89 15.19 13.17 Per type 9.72 10.94 13.76 11.48 Local 9.93 11.44 14.35 11.91 Best mix 9.93 10.90 13.87 11.57 Theta 9.17 10.67 13.35 11.06 Difference 0.76 0.23 0.52 0.51

(25)

Chapter 5

Discussion

For the first part of the experiment the symbolic regression algorithm was chal-lenged with finding well known time series patterns, with and without added noise. It showed relatively accurate forecasting results even at very high noise levels. The second part of this experiment tested the forecasting accuracy of symbolic regression on the M3 data set. The outcome is different from the aver-age of the forecasting techniques used in Makridakis[28]. In particular, in various runs the forecasting accuracy of the symbolic regression algorithm proved to be lower at shorter horizons (1-6 months) while the same model was very accurate for longer horizons (7-18 months).

5.1 Artificial data

The results of the tests on the artificial data show that symbolic regression can be used to find effective mathematical models for data with multiple seasonality, a trend or a damped trend at relatively high noise levels. For these three categories the algorithm outperforms the simple statistical approximations that are more subject to distortion by noise. In the case of single seasonality, with a fixed season length the symbolic regression algorithm was not able to outperform a seasonally adjusted naive algorithm.

On the multiple seasonality data set a very large tree was capable of filtering out noise better than simple statistical methods. During the first 350 genera-tions the tree length increased linearly. For the remaining 150 generagenera-tions the best tree was of stable length, but of changing content. For the duration of the 500 generations a linear decrease of the MAE score was apparent, trading computational speed for accuracy along the way, shown in figure B.14.

5.2 Training strategies

The results of the global symbolic regression runs on the m3 data set, is a consistent local optimum similar to figure 5.1 or figure 5.2.

(26)

Figure 5.1: Local optimum on global strategy example

Figure 5.2: Local optimum on global strategy example

with a sMAPE on the 18 horizons of approximately 13.17, which is higher than the average of the ’Naive 2’ from Makridakis [28] at 12.77. The many variations between the M3 time series are possibly causing the models to incur a high forecasting error on those types of time series. The algorithm did not prove flexible enough to handle the full range of variations when trained on the full set of series in the M3 data set.

The results for the per type and local training show improved results, with a combination of per type training for the micro type series and local training for all other time series resulting in the most accurate forecasts. The selection of the right training type for any new data set would be to try out training on the whole class of time series in addition to training on each individual series and comparing effectiveness between the two approaches.

5.3 Diversity improvements

Diversity improving measures such as a very low tournament size, greatly im-prove the symbolic regression effectiveness. The tournament size that im-proved effective, 4, allows a diverse group of individuals to survive preventing an early convergence. There is no direct metric used for this convergence, other than the feedback that the size of the best fit individual remains around 5 for at least 100 generations.

(27)

5.4 Pre-processing effectiveness

Many of the individual time series in the M3 data set show much better per-formance if the deseasonalization pre-processing is not applied. This effect was observed by examining the resulting forecast on time series with very high er-ror values (sMAPE > 50). An example of one of these series detrended and deseasonalized with an sMAPE of 51.00 is in figure B.15. The same series only detrended has an improved sMAPE of 14.50, shown in figure B.16.

The graph shows that the level of seasonality drastically changes from the beginning of the time series, compared to the end of the time series. Because the deseasonalization creates average seasonality across the whole series, it ac-tually reduces the forecasting accuracy by overestimating the seasonality for the forecast range. Any statistical technique with exponential smoothing will outperform the symbolic regression on this particular time series. A possible so-lution for this problem is inclusion of the lag operator as discussed in the future research directions or by applying exponentially smoothed deseasonalization in the preprocessing steps. This insight led to conditionally preprocessing the se-ries per sese-ries type, to only detrend the micro type sese-ries. The improvement in performance for that particular type was 7 sMAPE points, a large difference and brought the overall effectiveness of the per series type training to lower sMAPE average than the result for any of the machine learning techniques reported by Makridakis [28].

5.5 The case of micro economic series

One notable result of the experiment is an improved forecasting accuracy on micro economic time series compared to the best statistical method, Theta. Theta’s sMAPE average for the first horizon on the micro economic series is 21.98.

The micro economic model found by using the symbolic regression algorithm scored an sMAPE of 17.68 on the average of the last six horizons (long range), two and a half points lower than the result of Theta. The improved result was achieved by forecasting using a model trained on all micro economic time series. This model can be re-written from the original tree as:

ˆ y36= α ∗ x12− β ∗ x30− α ∗ X [x27..x29] − α ∗ X [x31..x35]

where α = 0.181511 and β = 0.319. This formula captures multiple season-ality, at lag 6 and 24 and a moving average up to lag 9. The effectiveness of the above model suggests that the micro economic time series defy a few as-sumptions built into the statistical forecasting methods, such as a fixed seasonal length of 12 periods and a gradually reducing importance of previous values, through exponential smoothing.

The resulting model of the symbolic regression is a simple statistical model more accurate at forecasting micro economic time series than the best in class statistical model, Theta. This is a remarkable result in itself, considering that

(28)

there is no direct human involvement in shaping the structure and content of this model. This result highlights a possible use case for symbolic regression, where existing static statistical models do not properly extract features of a time series in a particular domain. A dynamic model specific to the domain can be created using symbolic regression.

(29)

Chapter 6

Further research directions

In this chapter follow up research directions are discussed, related to the current research.

6.1 M4 data set

This thesis has explored the effectiveness of symbolic regression on the M3 data set. This data set is from the year 2000 and sized at what was, at the time of this research, a large data set. Computational power has increased drastically in the past 18 years, and the new data set of the M4 competition is much larger to reflect this. Many machine learning algorithms become more effective as more training data is used. The M4 competition is also the first of the M-series competitions that was won by a technique which uses machine learning combined with statistical methods. Future research could explore the effectiveness of symbolic regression on the new M4 data set and compare the results to the smaller M3 data set.

6.2 Prediction interval

Currently the symbolic regression algorithm creates only a single prediction output. The M4 competition requires each algorithm to provide a prediction interval and the accuracy of the prediction interval is compared between tech-niques. Future research could attempt to generate accurate prediction intervals for the symbolic regression predictions.

6.3 Lag operator

Most classical auto regressive techniques for time series forecasting use some form of exponential smoothing, where recent values are given more weight than values further in the past. To express this in a formula, some form of lag operator

(30)

is required. The addition of a lag operator to the function set of symbolic regression decreases the tree size required to express a relation between a whole past input set and the prediction value in an exponential smoothing formula or any other AR model. Future research can consider including this operator and comparing its effectiveness on the M-competition data set.

6.4 Initialization and demes

Symbolic regression can be improved upon by initializing the runs with more advanced individuals that already have some level of effectiveness [31]. In the experiment naive initial populations were used, forcing the algorithm to create effective individuals from scratch.

Evolving multiple populations in parallel in separate pools of individuals, called ’demes’ [32], that are connected in some way to exchange effective indi-viduals is another research direction that could improve forecasting accuracy. Forecasting Strategies from the training runs on one category could be re-used on other categories, for example.

6.5 Alternative representations

This research has considered the classical tree based symbolic regression genetic programming model representation. Miller et al. has shown in previous research promising results for using the Cartesian GP representation [18] for forecasting. A variant of the Cartesian GP algorithm, named Recurrent Cartesian GP [33], allows state information to be held by the model, making it particularly inter-esting for time series forecasting to encode the temporal dependency between input variables, similar to a lag operator. Future research could compare the effectiveness of Cartesian GP and other representations for forecasting on the M-competition data set.

6.6 Hybrid approach with other ML/Statistical

techniques

The most successful technique in the M4 competition, ESRNN [9], is a combina-tion of classical statistical techniques and deep learning models. Future research could explore the effectiveness of symbolic regression as a component in a hybrid forecasting model. This would already be achieved partially by including the previously discussed lag operator, by allowing the symbolic regression formula to simply express exponentially smoothed models.

(31)

Chapter 7

Conclusion

Symbolic regression can be used as a technique for forecasting univariate time series. It’s effectiveness has not been previously compared to other conventional techniques by the leading forecasting researchers [27, 28]. Symbolic regression shows promise on a range of artificial data sets which contain well known time series patterns, even at high noise levels. The algorithm is able to extract the original signal and filter out noise effectively shown by a sMAPE score, close to the sMAPE of the original signal compared to the signal with noise, which is the theoretical perfect forecast.

When symbolic regression is used to forecast on the M3 competition data set for most types of series in the M3, except for micro economic, the best approach to forecasting was to train an individual model for each series. The micro economic series had the most accurate results using a model that was trained on the whole range of micro economic series for medium and long range forecasting.

Symbolic regression was able to improve on the forecasting accuracy of micro economic time series, compared to the best statistical method, Theta. Overall, the average forecasting accuracy for the combination of symbolic regression runs was 11.57, 0.51 sMAPE higher than Theta. The gap between SR and Theta was the largest for short range forecasting (1-6 months) with a difference of 0.76 sMAPE and the smallest for medium range forecasting (7-12 months) with a difference of 0.23 sMAPE.

Although the results of Theta and other classical statistical tools are better on average, the results of this research underline the purpose of symbolic re-gression as a model discovery tool, where the structure of a model is not known ahead of time. When general purpose forecasting tools for a particular domain do not offer accurate results, symbolic regression can be used to discover alter-native model representations and structures that are more appropriate for the particular domain. The positive example of this use is the high accuracy result on the micro economic time series of the M3 data set by using a brand new evolved statistical model, through symbolic regression.

(32)

Bibliography

[1] J. R. Koza, “Genetic programming: On the programming of computers by means of natural selection. ma,” 1992.

[2] V. Assimakopoulos and K. Nikolopoulos, “The theta model: a decomposi-tion approach to forecasting,” Internadecomposi-tional journal of forecasting, vol. 16, no. 4, pp. 521–530, 2000.

[3] P. R. Winters, “Forecasting sales by exponentially weighted moving aver-ages,” Management science, vol. 6, no. 3, pp. 324–342, 1960.

[4] D. Bartholomew, “Time series analysis forecasting and control,” Journal of the Operational Research Society, vol. 22, no. 2, pp. 199–201, 1971. [5] E. M. Azoff, Neural network time series forecasting of financial markets.

John Wiley & Sons, Inc., 1994.

[6] K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik, “Predicting time series with support vector machines,” in Inter-national Conference on Artificial Neural Networks, pp. 999–1004, Springer, 1997.

[7] S. Makridakis, C. Chatfield, M. Hibon, M. Lawrence, T. Mills, K. Ord, and L. F. Simmons, “The m2-competition: A real-time judgmentally based forecasting study,” International Journal of Forecasting, vol. 9, no. 1, pp. 5– 22, 1993.

[8] S. Makridakis and M. Hibon, “The m3-competition: results, conclusions and implications,” International journal of forecasting, vol. 16, no. 4, pp. 451–476, 2000.

[9] “The m4 competition: Results, findings, conclusion and way forward.,” [10] F. H. Bennett, J. R. Koza, J. Yu, and W. Mydlowec, “Automatic

syn-thesis, placement, and routing of an amplifier circuit by means of genetic programming,” in International Conference on Evolvable Systems, pp. 1– 10, Springer, 2000.

(33)

[11] J. R. Koza and J. P. Rice, “Genetic generation of both the weights and architecture for a neural network,” in Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, vol. 2, pp. 397–404, IEEE, 1991. [12] M. Quade, M. Abel, K. Shafi, R. K. Niven, and B. R. Noack, “Prediction of dynamical systems by symbolic regression,” Physical Review E, vol. 94, no. 1, p. 012214, 2016.

[13] H. Iba and T. Sasaki, “Using genetic programming to predict financial data,” in Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999 Congress on, vol. 1, pp. 244–251, IEEE, 1999.

[14] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, J. Newton, E. Parzen, and R. Winkler, “The accuracy of extrapolation (time series) methods: Results of a forecasting competi-tion,” Journal of forecasting, vol. 1, no. 2, pp. 111–153, 1982.

[15] D. G. Wilson, S. Cussat-Blanc, H. Luga, and J. F. Miller, “Evolving simple programs for playing atari games,” arXiv preprint arXiv:1806.05695, 2018. [16] A. M. Turing, “Computing machinery and intelligence,” in Parsing the

Turing Test, pp. 23–65, Springer, 2009.

[17] W. Banzhaf, “Genetic programming for pedestrians,” in ICGA, p. 628, 1993.

[18] J. F. Miller, “An empirical study of the efficiency of learning boolean functions using a cartesian genetic programming approach,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2, pp. 1135–1142, Morgan Kaufmann Publishers Inc., 1999. [19] A. J. Turner and J. F. Miller, “Recurrent cartesian genetic programming of

artificial neural networks,” Genetic Programming and Evolvable Machines, vol. 18, no. 2, pp. 185–212, 2017.

[20] W. B. Langdon and R. Poli, “Fitness causes bloat,” in Soft Computing in Engineering Design and Manufacturing, pp. 13–22, Springer, 1998. [21] D. A. Van Veldhuizen and G. B. Lamont, “Evolutionary computation and

convergence to a pareto front,” in Late breaking papers at the genetic pro-gramming 1998 conference, pp. 221–228, 1998.

[22] S. T. Khu, S.-Y. Liong, V. Babovic, H. Madsen, and N. Muttil, “Ge-netic programming and its application in real-time runoff forecasting 1,” JAWRA Journal of the American Water Resources Association, vol. 37, no. 2, pp. 439–451, 2001.

[23] Y.-S. Lee and L.-I. Tong, “Forecasting energy consumption using a grey model improved by incorporating genetic programming,” Energy conver-sion and Management, vol. 52, no. 1, pp. 147–152, 2011.

(34)

[24] T. McConaghy, H. Leung, and V. Varadan, “Functional reconstruction of dynamical systems from time series using genetic programming,” in 2000 26th Annual Conference of the IEEE Industrial Electronics Society. IECON 2000. 2000 IEEE International Conference on Industrial Electronics, Con-trol and Instrumentation. 21st Century Technologies, vol. 3, pp. 2031–2034 vol.3, Oct 2000.

[25] E. Carreno Jara, “Long memory time series forecasting by using genetic programming,” Genetic Programming and Evolvable Machines, vol. 12, pp. 429–456, Dec 2011.

[26] T. McConaghy, “Ffx: Fast, scalable, deterministic symbolic regression tech-nology,” in Genetic Programming Theory and Practice IX, pp. 235–260, Springer, 2011.

[27] N. K. Ahmed, A. F. Atiya, N. E. Gayar, and H. El-Shishiny, “An empir-ical comparison of machine learning models for time series forecasting,” Econometric Reviews, vol. 29, no. 5-6, pp. 594–621, 2010.

[28] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The accuracy of ma-chine learning (ml) forecasting methods versus statistical ones: Extending the results of the m3-competition,” tech. rep., UNIC Working Paper, 2017. [29] K. E. Kinnear, “Fitness landscapes and difficulty in genetic programming,” in Evolutionary Computation, 1994. IEEE World Congress on Computa-tional Intelligence., Proceedings of the First IEEE Conference on, pp. 142– 147, IEEE, 1994.

[30] E. S. Gardner Jr, “Exponential smoothing: The state of the art,” Journal of forecasting, vol. 4, no. 1, pp. 1–28, 1985.

[31] R. Aler, D. Borrajo, and P. Isasi, “Using genetic programming to learn and improve control knowledge,” Artificial Intelligence, vol. 141, no. 1-2, pp. 29–56, 2002.

[32] R. Poli and J. Page, “Solving high-order boolean parity problems with smooth uniform crossover, sub-machine code gp and demes,” Genetic pro-gramming and evolvable machines, vol. 1, no. 1-2, pp. 37–56, 2000. [33] A. J. Turner and J. F. Miller, “Recurrent cartesian genetic

program-ming,” in International Conference on Parallel Problem Solving from Na-ture, pp. 476–486, Springer, 2014.

(35)

Chapter 8

Acknowledgement

I would like to thank my academic supervisors, Michel Mandjes and Haralambie Leahu for their tremendous support in navigating the early stages of the thesis, shaping the subject and scope and pushing me to execute with rigour. My industry supervisors Jerome Cremers and Gijs Koot, from Xomnia, I would like to thank for helping me make my first steps into Data Science the right way and providing me with fresh ideas for solving problems along the way.

I would like to thank Ana Oprescu for helping me figure out all the logistics of the master project and for letting me work on a subject outside of the immediate research areas of Software Engineering.

The management of Xomnia and especially Ollie Dapper I would like to thank for hosting my master project and giving me the freedom to pursue funda-mental research without the expectation of immediate commercial application. My family and friends, I would like to thank for continued support and for not complaining too much about my reduced presence for the past months.

(36)

(37)

Appendix A

Models

Figure A.1: Single seasonality model

(38)

(39)

Figure A.4: Single Seasonality best model

(40)

Figure A.6: Multiple Seasonality model with noise

(41)

Figure A.8: Trend artificial data with noise model

(42)

Figure A.10: Damped trend model with noise

(43)

Figure A.12: Model of training per series type on Micro

(44)

Figure A.14: Detrended and deseasonalized high error series

(45)

Appendix B

Series

(46)

Figure B.2: Data with single seasonality

(47)

Figure B.4: Multiple seasonality input window

(48)

Figure B.6: Damped trend artificial series

(49)

Figure B.8: Multiple seasonality model fit and forecast

(50)

Figure B.10: Trend model fit and forecast

(51)

Figure B.12: Damped trend without noise fit and forecast

(52)

Figure B.14: MASE vs tree size trade-off

(53)

(54)

List of Figures

2.1 Example of an individual tree . . . 7

5.1 Local optimum on global strategy example . . . 25

5.2 Local optimum on global strategy example . . . 25

A.1 Single seasonality model . . . 36

A.2 Short Sin Tan stack . . . 36

A.3 Longer Sin Tan stack . . . 37

A.4 Single Seasonality best model . . . 38

A.5 Multiple seasonality model without noise model . . . 38

A.6 Multiple Seasonality model with noise . . . 39

A.7 Trend model without noise . . . 39

A.8 Trend artificial data with noise model . . . 40

A.9 Damped trend model without noise . . . 40

A.10 Damped trend model with noise . . . 41

A.11 Global strategy M3 model . . . 41

A.12 Model of training per series type on Micro . . . 42

A.13 Model of training per series type on Finance . . . 42

A.14 Detrended and deseasonalized high error series . . . 43

A.15 Detrended high error series . . . 43

B.1 3 Input windows . . . 44

B.2 Data with single seasonality . . . 45

B.3 Multiple seasonality pattern . . . 45

B.4 Multiple seasonality input window . . . 46

B.5 Trend artificial series . . . 46

B.6 Damped trend artificial series . . . 47

B.7 Model fit and forecast for single seasonality . . . 47

B.8 Multiple seasonality model fit and forecast . . . 48

B.9 Multiple seasonality model fit and forecast with noise . . . 48

B.10 Trend model fit and forecast . . . 49

B.11 Caption . . . 49

B.12 Damped trend without noise fit and forecast . . . 50

B.13 Damped trend with noise model fit and forecast . . . 50

(55)

B.15 Detrended and deseasonalized high error series . . . 51 B.16 Detrended high error series . . . 52

(56)

List of Tables

3.1 Monthly M3 time series per series type . . . 16

3.2 Results of different preprocessing strategies . . . 17

4.1 Results on Artificial Data . . . 18

4.2 Global training forecasting results . . . 21

4.3 Per series type training results . . . 21

4.4 Comparison global and per type training . . . 22

4.5 Per type model size statistics . . . 22

4.6 Local training results per series type . . . 23