Investigating the complementarity of preprocessed featureswith generative autoregressive neural networks in time series

(1)

MSc Artificial Intelligence

Master Thesis

Investigating the complementarity of

preprocessed features with generative

autoregressive neural networks in time series

by

Petru Neague

11844523

February 22, 2020

48 Credits March 2019-December 2019

Supervisor:

Dr E Gavves

Assessor:

Dr D Gupta

Informatics Institute

(2)

Investigating the complementarity of preprocessed features

with generative autoregressive neural networks in time series

Petru Neague

December 2019

Abstract

This paper examines whether generative autoregressive neural networks (whose main rep-resentative is Wavenet [30]) can be complemented with pre-computed information and thus lead to a better result in the domain of time series. 3 of the features tested here belong to the denoising features category: ARMA reconstruction, De-noising autoencoder [32] and Fourier low-pass filter. The last tested feature is represented by the Fisher vector, which while nor-mally applied in computer vision, it seemed like it had the potential to provide the network with compressed valuable information about the entire time-series. Experiments on manufactured datasets have shown with very high levels of confidence that the addition of the noise reduction features can indeed improve the accuracy of the network by between 0.25 and 0.5 %. However, on 2 datasets taken from the real world almost no feature managed to improve accuracy, only the Fourier filter managing to very slightly achieve significance in improvement in one of them. The Fisher vector reliably improved the accuracy on the low-noise created dataset. In terms of additional computational complexity, the autoencoder took between 1 and 1.5 times the time to get constructed when compared to the training of the Wavenet itself, Fourier filter com-plexity is negligible and the Fisher vector and ARMA took between 3 and 20 times more than the training of the Wavenet itself. While low in terms of gain (where gain actually existed) and high in resources necessary to construct, ARMA and the de-noising autoencoder may be usefully employed in some fields where accuracy is paramount and computer power is plentiful. Nevertheless, the Fourier combines the great results and low computational complexity. This makes it valuable in situations where models can be trained twice, once with and once without in order to check whether Fourier filtering could improve on the simple model. This research opens up new avenues of research in preprocessed features that can be constructively combined with autoregressive networks in time-series.

1 Introduction

A time series is a sequence of data points arranged in a successive order. For processes which are governed by the passing of time, patterns emerge that can be distinguished by proper analysis. Time series analysis represents the attempt to do exactly that, to search for structure in the noise and find the underlying signal. This allows one to predict what the future entails for one’s time series, which in turn allows short or long term planning to reach specific goals. Since time series are present in the most varied fields, from financial markets to astrophysical processes, their analysis is thus extremely important for the development of society.

(3)

Machine learning and artificial intelligence have recently contributed to data analysis with new tools that facilitate better understanding of time series. This work attempts to further improve on current techniques by combining disparate methods which have been shown to work in their specific contexts. Further background information will be talked about in the chapter 3.

2 Motivation

This paper was inspired by previous work which showed that hand crafted features can improve the quality of neural network predictions in computer vision [24] [19]. The hypothesis of this paper is that, similar to computer vision, predictions for time series data can be reliably improved when using autoregressive generative models like Wavenet. The ideas for the features used came from [4] [5] [24] [19] which gave an overview of what features were useful within the computer vision topic.

As such, this paper aims to answer the following questions:

1. Are generative autoregressive models, like Wavenets, and standard neural networks like de-noising autoencoders [32], complementary to each other?

2. Can generative autoregressive models be complemented with traditional time series models, such as ARMA?

3. Are generative autoregressive models complementary to low-pass Fourier expansions of the dataset?

4. Are generative autoregressive models complementary to soft-assignment bag-of-word features like Fisher vectors?

Even a slight improvement in the accuracy of predictions for autoregressive models (such as Wavenet) would be a welcome addition to the immense domain of time-series analysis.

3 Literature Review

This section examines the previous work this paper is based upon. As this paper combines different machine learning methods which have little in common, this part is sectioned in more or less self-contained subsections.

3.1 RNN & LSTM in Time Series

Prior to Wavenet, the time series topic was mostly the domain of the family of Recurrent Neural Networks (RNN’s [20]) and Long-Short Term Memory Networks(LSTM’s [12]). RNN’s (represented in Figure 3.1 were a 1989 improvement upon regular fully connected networks sequence analysis because it kept information from previous data points through a recurrent connection called ”mem-ory vector”. While it had the plus that it could theoretically contain information from any number of previous inputs, certain shortcomings regarding the propagation of information from too far in the past emerged [3]. The most important one was the vanishing gradient in long sequences due to the multiplicative combination of past and present features, meaning that its reach to the past was

(4)

Figure 1: Fully Connected vs Recurrent Neural Networks. The fully connected network creates a prediction only based on current input with no regard to past values of the neurons in the hidden layer. Recurrent Neural networks on the other hand also regard the past values taken by those hidden neurons (noted here as ”t-1” as in previous time point).

not unlimited.

8 years after RNN’s were created, a solution to the vanishing/exploding gradient problem was proposed called the LSTM (see Figure 3.1. Due to its innovative features called ”gates” it allowed the additive combination of past information with the present (as opposed to the multiplicative one of the RNN). This solved the gradient problem, allowing long-term relationships between data points to be taken into account. Since then, the LSTM is widely used in time series analysis and is continuously developed, and kept being used and built upon even after the invention of Wavenet [34].

3.2 Wavenet

The basis of this paper is the investigation of combining pre-calculated features with the generative model invented recently known as Wavenet [30]. Wavenet is a deep artificial neural network which was created to autoregressively analyse and generate raw audio waveforms. It became recognized as

(5)

Figure 2: Long Short Term Memory Network. The series consisting the building block of the LSTM network. Main innovation was the gates (yellow cells, showing the nonlinear transformation each of them apply) which modify the information in specific ways to determine which parts should be included in the memory vector (denoted by ht). The modified information which passed through

respective gates is combined with the previous output of the cell ct−1 through operations shown

here in red.

one of the great existing algorithms for such purposes, outperforming conventional techniques [33]. Wavenet has also been used for other purposes, for example [6] and [27]. Some improvements upon the classical Wavenet have been found, increasing computational efficiency or improving the quality of generated audio waveforms (judged subjectively) ([26] [31] [23]). For the purpose of this paper, a simplified original version of Wavenet was used to assess the impact of introducing pre-calculated features.

3.2.1 Convolutions

Wavenet is based on convolutions, which are functions that are applied to other function (say f (t) and g(t)) and whose result provide information about how the shape of the functions relate to one another. The convolution operation is shown in Equation 1. It is said that f is the kernel of the convolution and g is the input. In cases where the data is discrete convolutions take the form of Equation 2. A convolution of a function of 4 data points with a kernel of size 2 is shown in Figure 3.2.1.

(6)

f (t) ∗ g(t) = Z ∞ −∞ f (τ )g(t − τ )dτ (1) f (t) ∗ g(t) = ∞ X −∞ f (τ )g(t − τ )dτ (2)

Convolutions are generally used in neural networks employed in image analysis [2]. They have performed extremely well so far and have been under further development for some time [2]. 3.2.2 Causal and Dilated Convolutions

Wavenet’s convolutions is a type of convolutional neural network which uses ”dilated” and ”causal” convolutions in order to cover the large amount of data present in an audio waveform.

Where normal convolutions would cover the data points from both the left and right of the current point in a series, causal convolutions are meant to be used specifically in time-related problems where the ”future” (i.e. right of the current point) data points should not be used in the prediction of the next data point. Thus, causal convolutions only cover data which precedes the current data-point (as seen in Figure 4). This ensures the correct order of training such that the prediction at time t does not depend on the data present at t0 > t, i.e.

p(xt|x0...xt−1, xt+1, ...) = p(xt|x0...xt−1)

Additionally, a dilated convolution is one where the convolution is applied to an area larger than or equal to the kernel size (see Figure 5). This allows the network to quickly gather information from many data points and pass them onward in tractable time. Of course, this efficiency comes with a number of problems. Firstly, as seen in Figure 5, subsequent predictions depend on differ-ent combinations of data points. This makes predictions less stable; Secondly, a small number of layers are used to compress a very large sequence into useful information, this can lead to loss of information along the way, at the expense for quick computational ability.

Having explained the basic principle of dilation layers and causal convolutions, the structure of the Wavenet model is given in Figure 6[8]. The time-series (in this case an audio wave-form) is first passed through a causal convolution layer to expand the data in 128 channels. This expanded data is then fed to a series of residual blocks consisting of two dilated convolutions, which are then ran through sigmoid and tanh filters. The result of their multiplication is passed through a 1 ∗ 1 convolutional layer, which along with a residual connection transmits the information to the next residual block. Another separate 1 ∗ 1 convolutional layer takes the outcome of the multiplication of the gates and outputs a skip connection which is added to the pool of skip connections. (Note: In Figure 6 it is represented as being the same convolution, but the output-residual and the skip convolutions are different, nonetheless, it is the best illustration of the Wavenet concept and that’s why this figure was chosen to represent it.) The sum coming from the skip connections from all blocks represents the output of the whole Wavenet architecture. In the original paper this out-put is subject to other transformations (ReLU followed by a 1 ∗ 1 convolution followed by another identical sequence, and ending with a Sigmoid to output the probabilities of different classes) but, as will be presented, this part was slightly modified in order to suit the task in question (Regres-sion, as opposed to classification). As for the dilation factor, it is doubled from one layer to the

(7)

next up to a limit and then repeated (e.g. dilation depth per layer ∈ {1, 2, 4, ..., 512, 1, 2, 4, ..., 512}). The slightly modified version of the Wavenet used in this paper is shown in Figure 10. The specific architecture details will be presented in

3.3 Fisher Vector

The Fisher vector [28] is a technique used (as far as it could be gathered) exclusively in the domain of image recognition. Nevertheless, studies have shown that the Fisher Vector is indeed a powerful tool that can help image classification algorithms improve their accuracy ([4] [5] [28]). The rationale for their use in time series prediction was that they could conceivably provide features for the shape of a series well enough to also improve the accuracy of time series regressive models which may have difficulties with noisy datasets. The way the Fisher vector works will be summarily presented in the remaining of this subsection.

Let X and Y be samples of a generative model expressed by a probability density function p with a parameter λ. The Fisher kernel (first introduced in [15]) is then defined as:

K(X, Y ) = g_λT(X)F_λ−1gλ(Y ) (3)

Where the T indice represents the transposition operation, gλ(X) represents the gradient vector

of the log likelihood and F is the Fisher information matrix:

gλ(X) = ∆λlog(p(X|λ)) (4)

Fλ= EX≈p[gλ(X)gλ(X)T] (5)

The Fisher information matrix is symmetric positive definite and as such can be Cholesky decomposed into:

F_λ−1= LT_λLλ (6)

Rewriting the Fisher Kernel then brings:

K(X, Y ) = Gλ(X)TGλ(Y ) (7)

Where Gλ(X) is the normalized Fisher Vector:

Gλ(X) = Lλgλ(X) = Lλ∆λlog(pλ(X)) (8)

And thus, for each xtindividual data in X = x1, x2...xT for t ∈ 1, 2...T :

Gλ(X) = T

X

t

Lλ∆λlog(pλ(xt)) (9)

Assuming the X descriptors are independently sampled from different probability functions, this method is usually used with the Gaussian Mixture Model (GMM) as the probability p(X|λ). The GMM is defined as follows: p(x) = K X i ωiq(x, µi, Si) (10)

(8)

Here ωi is the mixture weight of the i-th GMM component, K represents the number of

com-ponents in the GMM and q(x, µ, S) represents the Gaussian distribution of mean µ with standard deviation S. A soft assignment of data to individual GMM’s is also introduced:

γi(xt) =

ωiq(xt|µi, Si)

PK

j ωjq(xt|µj, Sj)

(11)

In Ref. [15] the Fisher vector was calculated under the assumptions that each GMM has a diagonal covariance and is located far from other GMM’s making combinations of probabilities unlikely. Thus

S = diag(σ2) (12)

Where σ2_{is a vector representing the diagonal elements of the diagonal matrix S. The diagonal}

assumption helps with saving time in the computation-intensive field of image analysis where one image can be formed of a few million data points. The issue with computational speed related to the Fisher vector is a notorious problem with many attempts to make it more efficient (for ex. [21], [22] etc.). However, for the purpose of time-series analysis, this doesn’t represent a problem at all. That’s why it was opted Fisher matrix with a full covariance would be approximated. This was done based on previous work which looked into how the full covariance Fisher matrix can be calculated to a close approximation [29].

That’s how the covariance matrix S was written as an eigen decomposition:

S = U diag(ψ2)UT (13)

Where U is the matrix whose columns are eigenvectors of S, and ψ2 _{is the vector whose elements}

are eigenvalues of S. Finally, putting everything together the equation through which to calculate the elements of the Fisher vector are:

∆µ= 1 √ T ωi T X t γi(xt)S −1 2 i (xt− µi) (14) ∆S = 1 √ T ωi T X t γi(xt)((S −1 2 i (xt− µi))2− 1) (15)

In Ref. [29] some more approximations are done in order to further improve the computational speed, however due to those measures not being necessary in the experiments performed for this paper, they were not adopted here. Only the equations 14 and 15 were used from that paper.

3.4 ARMA

The autoregressive moving average (ARMA) model (presented eloquently in [1]) is a composite technique for time series analysis. The autoregressive part involves the regressing a variable based on its own ”lagged” values (i.e. the values it took previously to the current prediction) while moving average part models the error term as a linear combination of error terms occurring in the past.

The Autoregressive part is given by:

Xt= c + p

X

i

(9)

Where φ1...φtare parameters, c is a constant and the random variable tis white noise. As for

the Moving Average part:

Xt= µ + t+ q

X

i

θit−i (17)

Where θ1...θq are parameters of the model, µ is the expectation of Xtand i are error terms.

Putting them together we get the ARMA model: Xt= c + t+ p X i φiXt−i+ q X i θit−i (18)

In order to determine the number of hyperparameters the ARMA model should use it is necessary to look at the autocorellation coefficient(ACF) and partial autocorrelation function(PACF). They are 2 plots which provide information about the how observations in a time series are related to each other. They provide information about what hyperparameters should be used for a given time-series function. The method for calculating the values of the ACF and PACF plots is given in [8]. It was implemented using the statsmodels library in Python.

3.5 Fourier transform low-pass filter

One of the important methods in image as well as in time series analysis is represented by using a low-pass filter upon a Fourier transform or a Direct Cosine Transform (DCT) [13] [10] [19]. The methodology consists in transforming the data in the frequency space, and then selecting only the lowest frequencies. They represent the large patterns of the data, while the larger frequencies are assumed to be representing noise. Thus, by getting rid of the large frequencies, noise can be attenuated to a large extent Because of its popularity in noise reduction, this method was also included in the present paper as a hand-crafted feature which attempts to show the noise-reduced pattern of the data.

3.6 Denoising Autoencoder

A de-noising autoencoder is a neural network invented a few years ago [32] to remove noise in images (for example granular noise from poor imaging conditions). It takes a number of inputs n, transfers this information to a lower number of hidden neurons h < n (within one or more layers) and then re-creates n outputs. The loss is then calculated with respect to the difference between the initial n inputs and the n outputs. The key to this algorithm is that a large number of inputs has to be translated into a smaller number of neurons, and thus, the network is forced to find the pattern present in the n data points that can be used to compress the data. Since there is no pattern in noise by definition, technically the network will not compress noise, leaving the output of the autoencoder to just represent the pattern in the original data.

De-noising autoencoders have also been used in simple dimensionality reduction [11], in creating a robust defense against an adversarial attack [7], in time series [25] and many others. Because of their powerful noise reducing properties, a de-noising autoencoder was used in this project as a pre-computed feature to be used in combination with the Wavenet generative autoregressive algo-rithm.

(10)

Traditionally, the denoising autoencoder was trained in a supervised manner, which makes sense in the context of images as the network needs to understand how the regular object would look like in order to find out what noise should be reduced. However in the present context of this paper, there is no specified object that needs to be learned, just an error to be minimized with the original version of the de-noised data being able to have small peaks and troughs all throughout. Because in time series there is no ground truth available on which to train the autoencoder, it is needed that the autoencoder is trained in an unsupervised manner, trying to minimize the error of the original data. Since noise cannot be encoded, the best way to minimize the error is to predict the actual signal, as incorporating noise into the prediction will inevitably increase error over the long term (if the loss is calculated as a function of the squared of the residuals).

4 Experimental Method

This section goes into details on the setup of the experiments. The structure opted for in this paper was to use the Wavenet (presented in section 3.2) to analyse the raw dataset, and to append the hand crafted features (the vectors in the case of Fisher vectors, and a number of the closest data points created through the de-noising measures presented in sections 3.4, 3.5 and 3.6). The exact setup and hyperparameters used in the algorithms will be presented in the rest of this section.

4.1 Datasets

The datasets used were determined to be 3 theoretical ones, with varying amounts of noise included in them; and an additional 2 time series datasets obtained from the Kaggle website [17].

The first 3 theoretical datasets are represented by a composite sinusoidal function which were deliberatly corrupted by random uniform noise to give it the appearance of realistic datasets. The amount of noise included was varied in order to assess how these methods would work in an environment where the signal-to-noise ratio is high and where it is low. The function used to create the datasets in Equation 19.

sin(0.001 ∗ x) + cos(0.001 ∗ x) + sin(0.0004 ∗ x) + U nif orm(−n, n) (19) Where x represents the x axis indice which counts the data points and n represents the noise:n ∈ {1.5, 2.5, 3.5}. These datasets as well as the non-corrupted dataset are shown in Figure 8. Each of the datasets had been created with 100.000 data. The large number of points was chosen because of Wavenet’s great ability at including a large number of data in its analysis (through its dilated convolutions). Whilst regular datasets on which Wavenet can handle are regularly much larger (for example the LJ Speech Dataset [14] which is composed of around 1.900.000.000 data points, which Wavenet uses in [9]), the scale could not be reproduced for this project as it would require much more computational power than was available. Nevertheless, 100k data points can showcase the power of Wavenet to a low error margin as well. After the datasets were created, they were rescaled to fit between -0.5 and 0.5 for easier interpretation by the neural network.

The ”real-life” datasets were taken from Kaggle.com and represent the hourly consumption of electricity from the company ”American Electric Power” (AEP), and from the company ”Dominion Virginia Power” (DOM) over the course of about 10 years. They both have around 120.000 data points which is about the range that the created ones were as well. From a visual inspection and conceptually they are very similar, and that’s the reason they were both tested here. To make sure

(11)

that some variation in one did not create the conditions where an algorithm would perform better than expected. They were chosen as testing datasets because of their nature of being very close to a pure time series and because they represent data from the ”real world”. They are shown in Figure 9.

4.2 Metrics

Each variation of the algorithm was made to predict the next data point given the past data points. It was not used to predict more than one point in the future based on its own predictions. While this method is useful for cases where the prediction is necessary for medium to long term planning, it can create situations where one bad prediction leads to a whole set of other errors. This would make the comparison between algorithms more about how well they manage to not have bad outliers than about how well they predict in general. For this reason it was opted for the classical comparison between predictions of the next data point based on the previous ”real” data points.

4.2.1 RMSE

As a main comparison metric the Root Mean Square Error (RMSE) was used. Experiments were ran for 30 times each and the RMSE was recorded in all of them. A mean and standard deviation of the resulting RMSE per experiment was then recorded. This allowed insight into how spread out the results are, which in turn allowed for calculating the probability that the results are significant or not.

After running the experiment 30 times, a mean µ and standard deviation σ of the RMSE results were calculated. This was done for the simple Wavenet as it is and also with hand-crafted features included. Assuming the standard deviation recorded over the 30 runs of each experiment is approximately the real standard deviation σ ≈ σreal of the RMSE results, it was possible to

calculate the significance of the result using the Student’s t-test between the result for the simple Wavenet and the feature-included Wavenet. The resulting t score is a ratio of the difference between two groups and the difference within the groups. The larger the t score, the more difference there is between groups. This method allows one to determine whether by adding one hand-crafted feature to Wavenet, the RMSE will reliably decrease. The equation for Student’s t-test is given in Equation 20. t = Z s = µsample− µbase q σ2 sample n + σ2 base n (20)

Where µsampleand σsampleare the mean and standard deviation calculated through the sample,

n is the population number in the sample (in this case 30) and µbase is the baseline value against

which the mean of the sample is compared, i.e. if the mean and standard deviation of the sample indicate that the difference between µsample and µbase is statistically significant and there is little

chance that it was just luck-drawn. The specific T-test performed here is the one for 2 independent samples with equal sample sizes but unequal variances. This is also called the Welch test. For calculating the p value a number of degrees of freedom is also necessary which can be calculated using Equation 21.

dof = 2n − 2 (21)

So the goal of this paper is to reject the null hypothesis which says: ”There is no statistically significant difference between using Wavenet alone and using Wavenet with any of the features

(12)

included according to Figure 10 or by using Wavenet with no additional features involved the accuracy would be larger than otherwise.” This hypothesis is safely rejected in the case a p < 0.05 is found. It is assumed implicitly by using this method that the distribution of the losses found by the algorithms follow a normal distribution.

4.2.2 Computational speed

Assuming at least one result will turn out better than the original Wavenet architecture without hand-crafted features, the decision was taken to also include a comparison of how much time it takes to calculate the features as in some cases (such as the Fisher vector) it takes a relatively long time compared to others. At least a somewhat vague understanding of complexity through monitoring the amount of time it takes to create the features (and thus to improve the accuracy by a certain amount) would be a useful addition considering that Wavenet was specifically created to go through datasets much faster than previous alternatives (i.e. LSTM’s).

The physical infrastructure used to perform the modelling and everything else was the Google Colaboratory. Initially, it was believed that the same amount of computational power is reserved for everybody at all times but after performing the experiments with timing it was found that resources fluctuate heavily from session to session. As such, all timing experiments were performed on the same session. This was feasible since it doesn’t take too long to perform any of the calculations for the hand-crafted features.

4.3 Simple Wavenet

The original Wavenet, presented in Section 3.2 and shown in Figure 6, was slightly modified for the purposes of this paper. The last layer which was meant to classify information was removed and replaced with a simple Fully Connected layer which outputs a value through linear regression. Previous layers meant to improve accuracy by providing additional nonlinear relationships were also removed as there was no need for trying to improve the absolute result in all experiments, and with 2000 dimensions coming from the skip connections, computational resources would have been unnecessarily strained. The only addition to the algorithm is the concatenation happening before the last fully connected layer.

The hyperparameters chosen were such that it would take advantage of the Wavenet ability to reach very far into the history.

• The number of epochs used was 1000, each epoch consisted of 5 predictions of 100 consecutive points and averaging the error. The train-validation-test split was 60:20:20 in percentage terms, allowing enough data in each for statistical representation

• There were 12 dilation layers which were repeated once leading to a series of dilation factors: {1, 2, 4, 8, 16, ..., 1024, 2048, and then again starting from 1, 2, 4, 8, 16, ..., 1024, 2048}. In total it results in 24 layers and a reach of 4096 data points. The repeat of the convolutions was taken from the original paper of Wavenet [30].

• There were 25 residue channels. A few experiments were performed with 100 and 200 but it didn’t seem to have a meaningful difference so they were kept at 25.

• There were 2000 skip connections per layer. The reason for this relatively large number was the fact that the idea was to replace neurons with features, e.g. if there would be an addition

(13)

of 50 features in an experiment, then there would be 1950 skip connections. This was done to make the comparison more viable in terms of computational time: to add more neurons, or to add the feature? Later it was found that the computation of the features would take more than just adding a few more neurons would make the network slower. Nevertheless, since a few experiments were already performed this architecture stuck.

• The hand-crafted features shown in Figure 10 will be further detailed in later subsections as they have their own properties and hyperparameters which need to be explained.

The features to be added to the Wavenet algorithm are described below in section 4.4. The results of these additions are then compared to each other and to the base Wavenet without features in Section 5. The comparison between the features is performed on a dataset by dataset basis (Sections 5.1.1, 5.1.2, 5.1.3, 5.2.2, 5.2.1), and then the results are judged to see whether they provide consistent results over the tested datasets.

4.4 Hand-crafted Features

The features tried in this project can be split in 2 types:

1. De-noising features; as the name implies, these types of features attempt to find the signal and remove the noise from the original data. The closest data points (located behind the prediction point) were de-noised and treated as the ”hand-crafted features” in Figure 10. A good amount of thought was put into determining how to introduce the signal into the Wavenet algorithm, i.e. whether to place it within the Wavenet superstructure, or append the signal to the result of the Wavenet like in Figure 10, whether it should replace the original data or it should complement the original data in case the de-noising is not done well. Eventually, the signal computed through the de-noising techniques presented in sections 3.6 3.5 3.4 were treated as described in Figure 10. The de-noised data were not appended along with the real data within Wavenet because of the idea that inputs should be as loosely correlated as possible such that the network doesn’t have to figure the correlation by itself (as it’s likely to lead to overfitting). At the same time, the de-noised data did not replace the real data (as in the real data that the signal also represented were inputted in Wavenet as well) as the Wavenet, handling thousands of data would not output something which can be considered to be well correlated with the signal to which the output would be appended to; at the same time the fact that the signal had been processed would make this link even weaker. And that’s how the de-noised signal was treated as a feature like in Figure 10.

2. Fisher vector feature; this feature represents something more abstract about a collection of data points, rather than just a de-noised signal. Fitting the Fisher vector feature within Wavenet would not have made sense because the Fisher vector is not part of the time series like the other inputs of the Wavenet are. That’s why it was decided that it would be appended to the results of the Wavenet as in Figure 10.

4.4.1 Fourier Feature

The Fourier de-noising feature was created by using the fast Fourier transform on the dataset, inspecting the frequency-space dataset and deciding which would be the cutoff. For the created datasets the cutoff was deemed to be around 20. A small number but it was to be expected since

(14)

the dataset was sinusoidal by nature. A comparison between the signal and the recreation of the signal through the low-pass Fourier filter is shown in Figure 12.

Similarly, after inspecting the AEP and DOM it seemed like a cutoff of 500 would be acceptable. Since it is unknown how these datasets would look like without noise, choosing a cutoff is similar to probing an object in the dark. That is one of the drawbacks of methods like the Fourier low-pass method.

4.4.2 ARMA Feature

The ARMA feature required an inspection of the ACF and PACF plots and a decision on what choice of autoregression and moving average would be best. For the created datasets, they all looked the same so after inspecting the plots it was decided an ARMA(2,15) would be used.

Similarly, an ARMA(2,30) was used for the AEP and DOM datasets.

After the ARMA models were trained on the whole dataset, they were tasked to predict how the entire dataset would look like, rendering the new dataset from which the features were taken. 4.4.3 Denoising-Autoencoder Feature

The specific structure of the denoising autoencoder is given in Figure 13. After training the au-toencoder, there had to be created a new dataset from which the signal of the dataset could be drawn. But it was discovered (and to be expected) that if one reconstructed a data point in the first position of the autoencoder, it would yield a slightly different result than if the same data point was reconstructed in any other position. In order to decide which value should be used in constructing the new dataset, the autoencoder reconstructed each point in every position possible. For example, in a dataset with 100 data, the first point can be reconstructed only in the first po-sition, the second point can be in the first and second positions, etc. The 48’th point of this series can be reconstructed in any of the 48 positions of the autoencoder (i.e. encoding 1-48, 2-49, ..., 47-95, 48-96). For training a number of 500 epochs and a learning rate of 10−3 were used. The training and validation set were the same as in the regular Wavenet training session: 60-20-20. The autoencoder was not trained also on the test set so it doesn’t overfit the test set somehow leading to erroneous predictions of the non-corrupted signal. The entire test set was reconstructed at each testing phase. After the 500 training runs, the model parameters which best performed on the test set was chosen (as at some point overfitting started taking place and performance went down). After reconstructing each data point for the maximum amount of times, an average of the reconstructions was taken. This would smooth out the problems posed by different reconstructions. After the new dataset was created, it was used to supply the signal for the features as described in 4.4.

4.4.4 Fisher Vector Feature

Fisher vectors were originally developed for multidimensional data like images. Fitting them into time-series is not intuitive. The idea was to fit the Gaussian Mixture model on the time series in a 2-dimensional fashion. The x axis was inputted as the indice of the time series data point and the y axis as the value of the data point. After fitting the GMM’s, the Fisher Vector could be calculated. So it was decided that the Fisher Vector would represent 4 fitted Gaussians every 800 data points of the immediate past. The idea was to have enough data points to have Gaussians with non-diagonal covariance but which can discern the trend from the noise so a relatively large number

(15)

of data were selected. At the same time, too many fitted Gaussians would impact the computing speed of the algorithm so 4 was selected as the maximum for fast computations and minimum for making the algorithm relevant. These are, of course, qualitative arguments since no quantitative analysis exists to provide information about how many data points should be used and how many Gaussians should be fitted. Thus, each Fisher vector constituted of the gradient of the mean and standard deviation of each gaussian in the GMM for 2 dimensions (as [29] shows that the closed form approximation of the full covariance Fisher vector is the diagonal covariance Fisher vector; i.e. the Gaussians are not assumed to have diagonal covariances but the Fisher vector is). Thus the dimensions of the Fisher vectors are: 2(dimensions) ∗4(gaussians in the mixture) ∗2(components in each gaussian, i.e. mean and variance) = 16. So the size of the features is 16.

Since for each point there should theoretically be another Fisher vector calculated of 4 Gaussians of the past 800 points, the computing speed would be extremely slow. As such it was decided that every 50 data points a new Fisher vector would be computed. So every 50 data points would share Fisher Vectors of the past 800 data points (of the leftmost point to be predicted). An illustration of this process is shown in Figure 14.

5 Results

This section examines the performance measured in RMSE of the algorithms on each of the 5 datasets. As presented in the 4, the result of each experiment was run 30 times and the mean and standard deviation were recorder. This allowed calculating the Student’s T test value, and thus the probability that a result is statistically relevant.

After drawing the comparison between the features in terms of performance, another comparison is performed in terms of computational resources necessary. This comparison is based on the time required for the computations to be finished.

5.1 Created datasets

5.1.1 Noise = 1.5

Starting with the created dataset with noise n = 1.5, the results are shown in Table 1. The ”No Features” represents Wavenet without any addition to it. One of the first noticeable pieces of infor-mation is the extremely low standard deviation of the loss compared to the mean of the loss for all algorithms. Even compared to most loss differences between the feature-included Wavenet and the simple Wavenet, the standard deviation is less than a fifth. This already hints towards the fact that the results are significant. After pursuing the necessary calculations (Equations 20 and 21) with the values from the table, a ’p value’ can be calculated measuring the probability of significance for not being able to reject the null hypothesis. As it can be seen, all features manage to perform better than the Wavenet with no features included, and with basically zero percent chance of this being due to luck.

Contrary to expectations, the signal feature seems to have fared worse than the autoencoder feature, which points towards the fact that the structure of the denoising autoencoder chosen had flaws in its conception. As the autoencoder’s purpose is to emulate the signal, if it beats the signal then it may pass information about the noise of the next data point, thus rendering it unusable in real situations. It may be that because of the way the new dataset is created through the denoising

(16)

Feature number of features mean loss std of loss p value No Features N/A 0.22462 0.00033 0.0% Signal Feature 28 0.22231 0.00014 0.0% ARMA 28 0.22350 0.00017 0.0% Autoencoder 28 0.22158 0.00020 0.0% Fourier 42 0.22284 0.00028 0.0% Fisher N/A 0.22432 0.00023 0.013%

Table 1: Experiment results for the created dataset with noise = 1.5 Feature number of features mean loss std of loss p value

No Features N/A 0.29556 0.00027 N/A Signal Feature 14 0.29322 0.00012 0.0%

ARMA 84 0.29428 0.00011 0.0%

Autoencoder 28 0.29197 0.00018 0.0%

Fourier 28 0.29350 0.00020 0.0%

Fisher N/A 0.29551 0.00052 64%

Table 2: Experiment results for the created dataset with noise = 2.5

autoencoder (the averaging of predictions of each point on each position of the autoencoder), each data point holds some information of future data points. This would allow the Wavenet to notice that and predict better than if only containing the noiseless signal.

Apart from the Autoencoder anomaly, there are no other concerning results: the signal outper-forms every other feature, followed closely by ARMA, Fourier and lastly Fisher, all with well below the p = 5% threshold to reject the Null hypothesis.

5.1.2 Noise = 2.5

Looking at the table showing the results for the created dataset with noise n = 2.5 (Table 2, a similar pattern can be distinguished as in Table 1. The standard deviation of the loss over the 30 experi-ments is many times larger than the difference between the means of these algorithms, in which only the Fisher vector feature performs very similar to Wavenet. Apart from the Fisher, which seems to not have done anything all other p values are rounded up to zero by Python. This means that on this dataset too, most methods improve the accuracy of Wavenet. Fisher’s p value basically points towards the fact that the network found the feature useless and didn’t use it in its calculations at all. The Fourier feature seems to provide a very similar result to the signal, hinting towards the fact that the filtering was very effective at reducing the noise. Since the difference between the Fourier and the signal was more pronounced in the n = 1.5 dataset may suggest that differences between the signal and the noise are more important when the noise is less pronounced.

5.1.3 Noise = 3.5

Moving on to the n = 3.5 (Table 3) dataset it is apparent that the results are similar to the n = 2.5 case. The signal feature improves the accuracy by about 0.002, or 0.5%, ARMA improves accuracy

(17)

Feature number of features mean loss std of loss p value No Features N/A 0.34647 0.00047 N/A Signal Feature 28 0.34457 0.00018 0.0%

ARMA 28 0.34567 0.00034 3.4e-6%

Autoencoder 28 0.34238 0.00026 0.0%

Fourier 98 0.34449 0.00022 0.0%

Fisher N/A 0.34683 0.00123 N/A

Table 3: Experiment results for the created dataset with noise = 3.5 Feature number of features mean loss std of loss p value

No Features N/A 0.0164 0.0003 N/A

ARMA 28 0.0175 0.0007 N/A

Autoencoder 28 0.0165 0.0001 N/A

Fourier 98 0.0172 0.0002 N/A

Fisher N/A 0.0178 0.0007 N/A

Table 4: Experiment results for the AEP hourly dataset.

by about 0.001 or 0.25%, the autoencoder feature performs better than the signal again, and the Fourier feature managed to get a very similar improvement to the signal. The difference between the Fourier and the signal features is even lower than in the n = 2.5 case, though this interesting finding may be caused by simple luck. The Wavenet with Fisher vector obtained a worse result than the Wavenet with no features, though this difference is about half the standard deviation of the Feature-enhanced algorithm. A special calculation was performed for this anomaly to find how relevant it is and the ”p value” here is 0.27%, meaning the Fisher vector is probably considered useless by Wavenet and the difference between the results is due to luck.

5.2 Real World Datasets

5.2.1 American Electric Power

For the AEP dataset (results in Table 4), the features have all failed to show any positive results. This is doubly interesting for the fact that this is the first real dataset looked at. The reason may be that finding the real signal here is not only as hard as in the created dataset with n = 3.5 but even harder as the real signal is unknown so the analyst looking at the data could not tell whether a feature resembles the signal or not. The Fisher vector maintains the inability to improve the loss as in the previous datasets.

5.2.2 Dominion Virginia Power

And lastly, the DOM dataset (results in Table 5) provides similar results to the AEP dataset. The Fourier feature appears to have made the cut though, very slightly improving upon Wavenet with a confidence of 4.6%, grazing the 5% required to reject the Null hypothesis. Why this happened only with Fourier low-pass filtering but not with the other methods of noise reduction is anyone’s guess. While rejecting the Null hypothesis, one would need to further investigate the situation to make sure this is not indeed a statistical fluke.

(18)

Feature number of features mean loss std of loss No Features N/A 0.0214 0.0010 N/A

ARMA 28 0.0220 0.0007 N/A

Autoencoder 28 0.0220 0.0007 N/A

Fourier 70 0.0210 0.0003 4.6%

Fisher N/A 0.0229 0.0015 N/A

Table 5: Experiment result for the DOM hourly dataset. Feature n=1.5 n=2.5 n=3.5 AEP DOM

ARMA 1300 626 2783 7739 8113 Autoencoder 761 788 746 815 844 Fourier <1 <1 <1 <1 <1 Fisher 1219 1226 1229 1705 1550

Table 6: Feature Computation Duration (seconds) per dataset.

With the comparison of statistical results performed, the computational complexity required should be reviewed. After performing a measurement of time for finishing the computation of all features on all datasets with the hyperparameters presented in Section 4, the times are shown in Table 5.2.2. ARMA varies heavily not only due to the different hyperparameters in the created datasets and the real ones, but actually from dataset to dataset even with the same hyperparam-eters (as for example the created datasets with n=1.5, 2.5 or 3.5). To this, the time necessary for an analyst to look at the data and ACF and PACF graphs should technically be added, ending up with an extremely large time required for computation when compared to the other features. The Autoencoder feature reliably gets finished in around 12-15 minutes with low fluctuations; this is a great characteristic of the autoencoder as it also needs no hyperparameters for it to work. Fourier low-pass filtering is instantly finished on all datasets, though one should add the time necessary to select the threshold for using the low-pass filter. The Fisher vector gets calculated in around 20-30 minutes depending on the dataset (AEP and DOM are slightly larger than the created datasets, around 120k data). While reliable, the addition of the Fisher vector has been less than effective, always leading to a worsening of the prediction power. While the time required for its calculation is manageable and reliable, it lacks effectiveness and thus should never be employed in time-series. As a benchmark comparison, the time required for completing the training of a model was around 400-600 seconds. This means that apart from the Fourier filter, most features actually add a sizeable amount of complexity to the problem. at the same time the improvement on the created datasets has ranged between 0.5 and 1.5 % while for the real world datasets they have damaged the accuracy of the base model. This poses the resource management problem where one would need to appreciate how many computational resources one has and decide whether having a top accuracy is desirable under the present conditions.

(19)

6 Conclusion

This paper has looked into the possibility of adding pre-computed features to an already well estab-lished algorithm in the domain of time series to check for possible increases in performance. After performing the experiments, all of the features with the exception of the Fisher vectors have proven their effectiveness in the created datasets with noise n = 1.5 and n = 2.5. These results are relevant behind any shadow of a doubt, as shown by calculating the p value through the Welch test. The uncorrupted signal feature got better results than the others apart from the de-noising autoencoder. This is unexpected and may be due to the autoencoder needing to take into account future data points to predict the present data point in most cases. This was a design flaw in the algorithm. With this insight in mind, future experiments would take this into account and only include the pre-diction of each point when it is on the last position of the autoencoder (i.e. 48’th place in this case). The benefits of the features were consistent over all the created datasets, with an improvement of between 0.001 and 0.002 in the loss for the Wavenet-with-features. This benefit was only observed for the de-noising features, with the Fisher vector being only useful when the noise was low, and even then less effective than the others.

On the ”real world” datasets no feature managed to improve on the classical Wavenet architec-ture. The Fourier filter was useful at predicting data from the DOM dataset, otherwise everything else failed. The factors leading to this result may be a combination of the one already present in the n = 3.5 dataset, where noise is harder to calculate by individual features, but also the fact that the analyst does not have access to the signal, so one cannot determine whether a de-noising operation arrives at the ground truth or is far off the real signal.

While generally unsuccessful on the real datasets, the features have shown that under certain conditions they can bring improvements in the analysis of time series if coupled with generative autoregressive networks like Wavenet. The improvements on these datasets have ranged between around 0.5 and 1.5 % so that would be the range one would expect an improvement on other types of datasets.

The computational time required for creating the features is fairly large when compared to the time for training the model. This situation poses the problem of whether one has the resources necessary for using features like the ones presented here. Since the performance on the real world datasets was damaged, it would be advisable to continue using the classical models if time resources are not available, and only testing the possibility that adding the features brings an improvement when resources can spare training a model with and without features.

Future work on this topic can be justified first and foremost on the question of what other types of features that might be useful for networks. This paper proves that generative autoregressive neural networks can be enhanced by certain hand-crafted features in certain conditions in the topic of time-series prediction, opening thus a whole area of research. Most of the features used here had to do with de-noising the data and inputting the signal, and the Fisher vector had to do with understanding a broader pattern of past data. De-noised signals used as features proved successful in some situations while the Fisher vector was overall unproductive, however one could possibly imagine other types of features similar to the Fisher vector in nature that can provide good information for the network to process. A future attempt to investigate additions to the algorithm might modify the autoencoder feature such that it only takes into consideration previous data as it

(20)

has potential to be a useful tool which doesn’t take long to create and needs no hyperparameters. Lastly, a possibly good next phase of research would look for datasets in the real world that are properly addressed by using the features presented in this paper in order to prove that the additions work in practical applications too and thus incentivize other to use them when the conditions allow it.

This code used for this project is uploaded on Github [18] for future reference.

References

[1] Ratnadip Adhikari and R. K. Agrawal. An introductory study on time series modeling and forecasting. CoRR, abs/1302.6613, 2013.

[2] Neena Aloysius and M. Geetha. A review on deep convolutional neural networks. pages 0588– 0592, 04 2017.

[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.

[4] Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. The devil is in the details: An evaluation of recent feature encoding methods. volume 2, pages 76.1–76.12, 11 2011.

[5] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. 2014.

[6] Kejiang Chen, Hang Zhou, Dongdong Hou, Hanqing Zhao, Weiming Zhang, and Nenghai Yu. Provably secure steganography on generative media. CoRR, abs/1811.03732, 2018.

[7] Seung Ju Cho, Tae Joon Jun, Byungsoo Oh, and Daeyoung Kim. Dapas : Denoising autoen-coder to prevent adversarial attack in semantic segmentation, 2019.

[8] Gregory C. Reinsel George P. Box, Gwilym M. Jenkins. Time series analysis – forecasting and control.

[9] Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit, 2019.

[10] Indra Hermawan, Ario Husodo, Wisnu Jatmiko, Budi Wiweko, Alfred Boediman, and Beno Pradekso. Denoising noisy ecg signal based on adaptive fourier decomposition. pages 11–14, 12 2018.

[11] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[12] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[13] Marius Ionita and H.G. Coanda. Wavelet and fourier decomposition based periodic noise removal in microscopy images. The Scientific Bulletin of Electrical Engineering Faculty, 18:68– 71, 04 2018.

(21)

[14] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.

[15] Tommi Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. In In Advances in Neural Information Processing Systems 11, pages 487–493. MIT Press, 1998. [16] Lukas S. Martak, M´arius Sajgalik, and Wanda Benesova. Polyphonic note transcription of time-domain audio signal with deep wavenet architecture. 2018 25th International Conference on Systems, Signals and Image Processing (IWSSIP), pages 1–5, 2018.

[17] Rob Mulla. Hourly energy consumption datasets, https://www.kaggle.com/robikscube/hourly-energy-consumption, last accessed on 04/01/2020.

[18] Petru Neague. Github repository of code used in this project. https://github.com/pneague/Wavenet-with-precomputed-features, 2020.

[19] Jinsun Park, Yu-Wing Tai, Donghyeon Cho, and In So Kweon. A unified approach of multi-scale deep and hand-crafted features for defocus estimation. CoRR, abs/1704.08992, 2017. [20] Barak Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural

Computation - NECO, 1:263–269, 06 1989.

[21] Florent Perronnin, Yan Liu, Jorge S´anchez, and Herve Poirier. Large-scale image retrieval with compressed fisher vectors. pages 3384–3391, 06 2010.

[22] Erik Bodzsár Bálint Daróczy István Petrás and András A Benczúr. Gmm based fisher vector calculation on gpgpu.

[23] Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. CoRR, abs/1807.07281, 2018.

[24] Johannes Lutz Sch¨onberger, Hans Hardmeier, Torsten Sattler, and Marc Pollefeys. Compara-tive evaluation of hand-crafted and learned local features. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[25] Hongyu Shen, Daniel George, Eliu. A. Huerta, and Zhizhen Zhao. Denoising gravitational waves with enhanced deep recurrent denoising auto-encoders. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019. [26] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,

Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr-giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on mel spectro-gram predictions. CoRR, abs/1712.05884, 2017.

[27] Sam Shleifer, Clara McCreery, and Vamsi Chitters. Incrementally improving graph wavenet performance on traffic prediction, 2019.

[28] Jorge S´anchez, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105, 12 2013. [29] Masayuki Tanaka, Akihiko Torii, and Masatoshi Okutomi. Fisher vector based on

full-covariance gaussian mixture model. IPSJ Transactions on Computer Vision and Applications, 5:50–54, 2013.

(22)

[30] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.

[31] A¨aron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalch-brenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel wavenet: Fast high-fidelity speech synthesis. CoRR, abs/1711.10433, 2017.

[32] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th Inter-national Conference on Machine Learning, ICML ’08, page 1096–1103, New York, NY, USA, 2008. Association for Computing Machinery.

[33] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, and Junichi Yamagishi. A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis. 04 2018.

[34] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. A review of recurrent neural networks: Lstm cells and network architectures. Neural Computation, 31:1–36, 05 2019.

(23)

Figure 3: A simple 1 dimensional convolution. Function g is the input and f is the kernel applied to the input. The kernel is color coded to show the 2 values it takes, one for the diagonal operation and one for the horizontal operation. It can be seen that a simple convolution renders an output which is smaller than the input. This can be modified so the sizes become the same.

(24)

Figure 4: Causal Convolutions. The next prediction depends only on the past data points.

Figure 5: Dilated Convolutions. Only 4 dilated-convolution layers, each with dilation factor 2layer number_{,are needed to coagulate information from a receptive field of 32 past data points.}

If the number of layers is increased to 10, this would have a receptive field of 1024. The increase in the receptive field is exponential with new layers.

(25)

(26)

Figure 7: The classical structure of a de-noising autoencoder. Original input (x) is artificially corrupted by the type of noise there is a desire to get rid of (˜x), the autoencoder encodes the corrupted data (h) and tries to recreate the uncorrupted data (x0) with as much fidelity as possible. Then, the reconstructed data is compared to the original data yielding a reconstruction error which is used to train the network.

(27)

Figure 8: Created, fictitious datasets. X axis represents the counted number of the data point and the Y axis represents the value assigned to it. Top left: initial signal coming from Equation 19. It is the original dataset from which the other ones were created by adding random noise from -n to n; Top right: signal with n = 1.5 random uniform perturbation; Bottom left:signal with n = 2.5 perturbation; Bottom right: signal with n = 3.5 perturbation. All datasets were rescaled to fit between -0.5 and 0.5.

(28)

Figure 9: Real life datasets taken from Kaggle.com [17]. Left: American Electric power hourly consumption per about 10 years; Right: Dominion Virginia Power hourly consumption per about 10 years. They were rescaled to be between 0.5 and -0.5, similar to the created datasets. There was no outlier filtering performed leading to the bulk of the DOM dataset ending up between about -0.2 and 0.5.

(29)

Figure 10: Wavenet architecture used in this project. Similar to the original, but with a slightly changed structure towards the end, and different hyperparameters. The waveform is transformed by the initial causal convolution which then sends the 25 dimensional result to the 12 dilation layers, repeated twice (total: 24 layers), with dilation factors as seen in the figure. The output of each cell of the Wavenet is almost 2000 skip connections which are summed, concatenated with the hand-crafted features and then inputed to the fully connected layer. The fully connected layer performs linear regression and outputs a value which is then compared with the actual value through the RMSE, allowing backpropagation.

(30)

Figure 11: Left: Fourier space rendering of the created dataset for noise = 3.5. Right: After applying the low-pass filter, a comparison of the Fourier recreation and the original signal without corrupting noise. The stretching comes only from the fact that the original uncorrupted signal was squeezed as the normalization between -0.5 and 0.5 was applied after noise.

Figure 12: Left: ACF plot; Right: PACF plot for the created datasets. After inspection, it was decided an ARMA(2,15) would be used. Ideally an ARMA(50,50) would have been used but the computation time required was immense (the statsmodels library from Python was used as the computation method).

(31)

Figure 13: Training sequence of the denoising autoencoder used in this project. 48 data points are compressed into 32 neurons and then recreated. Then, RMSE is taken between the real and recreated data.

(32)

Figure 14: An illustration of the Fisher vector features. Each 50 points, a new Fisher vector is calculated and it consists the features for the next 50 points