LSTM probability classifier for anomaly detection

(1)

MS

C

I

NFORMATION STUDIES

TRACK:DATASCIENCE

M

ASTER

T

HESIS

LSTM probability classifier for anomaly detection

M

ICHAL

K

OZEL

11413204

July 26, 2017

U

NIVERSITY OF

A

MSTERDAM

Supervisor:

T

HOMAS

M

ENSINK

Assessor:

E

DWIN DE

J

ONG

(2)

LSTM probability classifier for anomaly detection

Michal Kozel

University of Amsterdam

Michal.Kozel@student.uva.nl

ABSTRACT

Identifying data that do not conform to normal behaviour, or so call anomaly detection, is an important task in many domains from medical data analysis to credit card fraud detection. At the present time the amount of available data is often too big to be analyzed by human experts, there-fore, automatic anomaly detection is an essential tool. Many times the data are captured sequentially in given order, creating so called times series. We describe meth-ods that deal with such series in on-line fashion, without using algorithms that have to act over whole data at once. One of the ways to achieve such characteristic is to build a predictive model and report anomalies based on the pre-diction errors. In this paper we presented two such meth-ods, using LSTM neural networks and Kalman filters as the predictive models. These methods are tested on grad-ually more complicated time series and on the end, LSTM networks are extended to handle inherently unpredictable time series by predicting complex probability distribu-tions rather than a single value or a Gaussian. This new approach, LSTM probability classifier, shows promising results with unpredictable series and allows on-line effi-cient anomaly detection that was not previously possible with similar predictive approaches.

Author Keywords

Anomaly detection; Kalman filters; LSTM neural networks;

INTRODUCTION

Anomaly detection refers to identifying data points or patterns that do not fit normal or expected behaviour. Such points are then called outliers, or anomalies. A re-lated topic, novelty detection, also refers to identifying nonconforming patterns. Those patterns, however, are then adapted and treated as normal. A comprehensive overview of anomaly and novelty detection methods can be found in [2]

Anomaly detection is an important task and is widely used in many domains. This is due to the fact the it provides fast actionable information from a data that are often too big or even incomprehensible for human observers. For example in the medical domain, anomalous MRI can help identify tumors [15]; Anomalous behaviour of network traffic can help uncover or prevent hacker attacks [8], and sudden change in credit card transaction data might be sign of credit card theft [1]. Data from many production systems in factories, or readings of aircraft sensor data

Figure 1: Sample of series of dice roll with every roll interleaved with 10-20 zeroes. Anomalies are shown by blue dots

also need to be monitored for anomalies to prevent possi-bly disastrous events. [3].

The definition of an anomaly is domain and task specific, therefore most of the systems are designed for one prob-lem within a certain domain. Some of the common chal-lenges might be lack of labeled data or not enough data to specify normal behaviour, evolving behaviour, or noise. Interesting case is when the anomalies are caused by ma-licious action. Typically, the attackers would try to make the anomaly appear normal, making the detection task even more difficult. Withdrawals from a stolen credit card will be most likely spread over longer period of time and in small instalments.

One of the key distinctions for detection methods is the nature of input data. We can deal not only with univari-ate and multivariunivari-ate data, but also qualitative or quantita-tive data. The most important data property for this pa-per is the relation between data points in the set. Most of the techniques deal with data with no relation between its instances but many times those instances can be re-lated. For example geographic data are spatially related and sequence data are time related. This paper will focus on sequential data, that introduce temporal dependencies - so called time-series. A formal definition is given later in time-series section. In such a setting, two otherwise identical data instances can be very different depending on the context in which they appear. This also introduces new type of anomaly - change points. These are points in which the behaviour of the time-series changes, and the new behaviour is adopted as normal for a given period, until the next change point appears.

Time-series can bring in one more challenge which is the frequency of incoming data instances. In high speed envi-ronments we cannot afford to process the whole series at once, or even work with sub-samples of the series. This rules out many anomaly detection methods, mostly based on clustering[2, 16]. It forces us to process every data instance sequentially as they come.

One way to approach such problems is try to predict a pos-sible next value in the series even before the real value is available, and base our decisions on the prediction error as

(3)

shown in [11]. Some series, however, are inherently un-predictable, yet can still have obvious anomalies. To clar-ify, a simple example could be a series or dice rolls. Such series is clearly not predictable, yet observing a value of 4.5 would be unexpected and anomalous. Values in this series are not time dependent, and are expected to have the same distribution in every timestamp. A more compli-cated version would be to interleave every dice-roll out-come with series of zeroes of fixed or even varying length. Dice roll example with varying lengths between rolls is shown in figure 1. In such cases, predicting a single value or some standard probability distribution would not be of much use.

The main objective of this paper is to show methods that deal with sequential data in an on-line fashion, and ex-tend these methods to deal with gradually more com-plicated time-series like the dice roll example. Firstly the necessary background about time-series and Markov processes will be introduced in the Background section, along with one standard model, Kalman filters, and its us-age for anomaly detection. Next, the Methodology sec-tion presents LSTM models and introduces its variant that deals with unpredictable series. Such series would previ-ously not be solvable by standard methods that can pro-cess data sequentially point by point in efficient manner. The following sections focus on experiments conducted on both generated and real data sets. Lastly, the paper concludes with future work and conclusion sections. BACKGROUND

This section will focus on the related work and back-ground theory that is necessary to understand the pre-sented anomaly detection methods. First, we define Time-series and explain what type we will use in the rest of the paper. The second fundamental part of this section is about Markov process, which will be used to describe our Time-series models. Kalman filters will be introduced from this perspective. At the end of this section, we will introduce anomaly detection methods that can be imple-mented using predictive models.

Time-series

Time-series can be formally defined as X = x1, x2, x3...xn where x ∈ Rm. We will be

work-ing with one dimensional series as it is easier to define anomalies in such a way. Note that both presented methods can also work with and predict multidimensional series. Every data point is also typically associated with a time stamp. We will work with homogeneous time series, which means that time stamps are evenly distributed over a period of time, and can therefore be omitted. In the case of non-homogeneous time-series, the time differences between data points differs, which can introduce new types of anomalies like long period without any activity. Markov process

A Markov process refers to a process where the next state of a system is solely dependent on the current state. An example is shown in 2.The states themselves can be hid-den, but we can generate observable variables from them. In order to represent a process this way, the Markov as-sumption, sometimes called the complete state assump-tion, must hold. This assumption states that future and past data are independent if we know the current state x at time t. In such case the future states are solely depen-dent on the current state x, so the current state should be a

Figure 2: Example of Markov process. State X at time t is only dependent on state X at time t-1. From each such state we can generate measurements Z

complete summary of the past. This can be expressed by equation 1.

P (xt+1|xt, xt−1, . . . , x0) = P (xt+1|xt) (1)

The Markov assumption can be violated for example by some unmodeled dynamics of the environment, that we did not include in the state representation. Usually such variables can be included in state representation, but sometimes incomplete states can be preferred over com-plex representations due to computational efficiency.

Kalman Filters

Kalman filters are estimators, and their different variants have been used in anomaly detection as shown in [14] or [17]. The version of Kalman filter described in [13] is also used as a benchmark for anomaly detection in radio signals [11]. More general use cases, along with python implementation and detailed description of various filters, can be found in [9]. The filters try to estimate the internal state of a system based on a series of obtained measure-ments. Typically, this internal state is larger than the mea-sured state, but the Kalman filter can estimate the whole internal state based on covariances of these hidden vari-ables. An example can be the internal state, represented by position, velocity and acceleration of a tracked object, and the measurement space with only position of the ob-ject. From changes in measured positions, the Kalman filter can infer velocity and acceleration that are kept in the internal state. Therefore this internal state can be seen as an unobserved Markov process. As such, Kalman fil-ters have exactly the same structure as Hidden Markov Models.

The whole process has two steps. As mentioned above, internal states are modeled as a Markov process, so we can estimate the following state from the current one using a transition matrix. This is called the prediction step, and can be expressed by following equations:

P rediction xk+1= Ax + Bu

Pk+1= APkAT + Q

x represents the internal state, A is the transition matrix and P is the process covariance matrix. B is input signal noise and u is optional control input. Lastly, Q describes process noise.

After each prediction step, an update step follows. This step adjusts the state based on actual measurement. After

(4)

Figure 3: LSTM unit with input, output and forget gates

such measurement is taken, we then compute the resid-ual of the measurement and our prediction, and adjust our state to be in between those. The resulting state is com-puted using so called Kalman gain, which takes into ac-count our confidence in the predictions and in the mea-surement. All state variables are represented as Gaussian distributions.

U pdate

K = PkHT(HPkHT + R)−1

xk+1= xk+ K(zk− Hxk)

Pk+1= (I − KH)Pk

K is Kalman gain, z is the observed variable as shown in figure 2, and H is matrix that maps the hidden state x to measurements space, where it is represented by the observed variable z.

So for we have only introduced linear filters but two ver-sions of Kalman filters work with nonlinearities - ex-tended Kalman filter (EKF) and unscented Kalman filter filter (UKF). EKF uses derivatives to linearize the system and propagate mean and covarinace. UKFs, first intro-duced in [7], use unscented transform. This means we pick sample points around our current state mean. These points are then propagated through a non-liner function and the resulting transformed points are used to estimate mean and covariance of the new state, often yielding bet-ter results then EKF. We will be using unscented filbet-ters to deal with non-linear series.

Kalman filters have a disadvantage that they have to be designed to fit particular system. Since their predictions are based on the current state, represented by a matrix, and a transition matrix, we can also train the transition matrix to make the model better fit for the current envi-ronment. To achieve this we can use an expectation max-imization algorithm to obtain values in the transition ma-trix. Even with with this improvement, filters still cannot capture long-term dependencies, and the predictions are limited to Gaussian distributions.

Anomaly detection

Having defined predictive models, we can use them to pre-dict the next data point in our series. The error of the prediction is measured and used to mark anomalies. A similar approach was used in [11, 12]. Whether the error is big enough for the point to be considered an anomaly is given by a threshold. We can use a threshold with con-stant value, or dynamically adjust the threshold based on the current variance in the time-series. Another factor for setting or adjusting the threshold is the confidence in the prediction of our model.

Figure 4: Architecture of the network. Several LSTM lay-ers are followed by several dense laylay-ers. On the end, there are two outputs for classification and regression. Categor-ical crossentropy is used as a loss function to optimize classification and Mean squared error as a loss for regres-sion

METHODOLOGY

This section introduces LSTM networks as time-series predictors, and their advantages over Kalman filters. We explain their behaviour for different time-series, and at the end adjust our model to deal with unpredictable series in on-line fashion.

LSTM neural networks

Artificial neural networks are being used in wide variety of domains, and countless versions and architectures have been invented. A detailed and comprehensive overview can be found for example in [5]. Examples of neural net-works used specifically for time-series are presented in [4]. One notable variant are recurrent neural networks (RNNs). RNNs have recurrent connection in their lay-ers, which means that the output of a layer in one step is kept in a memory cells and is used as an input to this layer in the following step. This makes RNNs particularly suit-able for time-series processing as this recurrence allows them to learn temporal dependencies.

Feed-forward networks can also capture temporal depen-dencies and structures in the series, but we would need to input n values of the series at once to show the network the current context of the series and to predict following value. We would also need to know whether there is a pattern in the series and what is the period of this pattern. With such knowledge we could set n to be grater then period of this pattern. This issue is solved by recurrent connections in RNNs that allows us to process series one point at time and the network remembers the current state without specifically providing current context as an input. A shortcoming of RNNs is that the content of the mem-ory cells is always fully overwritten by the new output of the corresponding layer. This limits the network so it cannot learn any longer term dependencies. This problem was addressed by adding forget, input, and output gates to the memory cells. Such networks are called LSTM (Long short term memory) and were proposed by Hochreiter [6]. LSTMs also learn what to store, delete and read from its memory cells which allows them to capture long term de-pendencies in a time-series. An LSTM unit is shown in figure 3.

(5)

In the context of Markov processes we can see the content of memory cells as the current state that summarizes all the past events. This state is then combined with a new data-point, producing a new inner state x and outputting a prediction of then observed variable z.

LSTM Neural networks are much stronger prediction models then Kalman filters and can learn complex states and long term dependencies. They are, however, typically used as regression models which was done for example in [11, 12]. This allows them to correctly predict simple lin-ear or non-linlin-ear time series and also filter out noise simi-larly to Kalman filters. They can, however, also correctly predict time series with long-term patterns, given that we have enough training data to capture such dependencies.

LSTM probability classification model

Another type of series, as briefly mentioned in the intro-duction, can be series that are not predictable by regres-sion. Next to the dice roll example we can also look at AWS cloud cpu usage as shown in 5. These kinds of series have some specifics and behaviour that is considered nor-mal, yet we cannot precisely predict when the next spike in utilization will appear. This would make our prediction often very inaccurate and labeling anomalies impossible. The solution is to predict the whole probability distribu-tion over all possible following values. If we are not sure whether or not a spike in utilization will appear, we can expect small utilization values with certain probability, and values that are in a reasonable range in high utiliza-tion will be assigned different probabilities. All the other values, not typical for the time-series, will be assigned no probability. These predicted distributions are also context dependent, so we want to assign higher probabilities to a spike appearing after an unusually long period of no ac-tivity. After long enough time with low utilization, low values should be assigned small or zero probability and such period will be marked as anomalous.

A model with similar behaviour can be built by chang-ing or expandchang-ing LSTM with a classification output layer. Sample architecture is shown in figure 4. The regression output is then optimized with mean squared error loss while the classification output uses a crossentropy loss function. The number of layers and neurons is dependent on the complexity of the time-series we are trying to cap-ture. Using more data points as an input can also help to capture longer term dependencies.

We need to properly categorize our data to use them with this model. The first step, as usually required for neu-ral network training, is to normalize our data. There is no problem with series with known limit values. In some cases, like trending data, the normalization becomes more complicated as we cannot guarantee that the new data point will fall into a range we used for normalization. This problem can mitigated by using differences between data rather than their absolute values. We then add some ad-ditional margins to these differences. This margin should be based on expected or possible anomaly values. The margin will later create classes that are almost entirely re-served for anomalies. Normalized data can be binned into n classes. Given that our last layer in the network is soft-max, it will output probabilities for each class, which is a probability distribution over possible outcomes. This dis-tribution will have no limitations like Gaussian, and can generalize to any synthetic or real time-series with enough

Figure 5: AWS Cloudwatch cpu utilization data with marked anomalies. First anomaly is double spike, when cpu utilization unexpectedly started to rise again. Second anomaly is 100% cpu utilization and the last anomaly is long period without any activity. Those anomalies were hand labeled.

training data. The number of classes must be selected and will vary for different series and use cases.

With this model we can solve anomaly detection for un-predictable data in on-line fashion. This is useful for fast streaming data as forward pass-through trained network is very efficient. The predictions can be visualized in the form of a heat-map, which provides an impression about how well the network is trained and what the character-istic of the series are. It provides more information than simple prediction error measures, and also explains the regression output, which is always near the mean of the predicted probability distribution.

Detection

Detecting anomalies with an LSTM classification model can be done by setting a suitable threshold for predicted probabilities. Data points with probabilities below the threshold should be marked as anomalies. For unpre-dictable series with more complicated change points, we would also need to compute the threshold dynamically. For example, a switch from 6-sided dice to 24-sided dice in our toy example. After this change we need to dis-tribute probabilities over more possible outcomes and the threshold should be lowered. Such series were not ex-plored in this paper.

EXPERIMENTS ON GENERATED DATA

In this section we present experiments on different data-sets. To show the properties and example detection of im-plemented models, we generated a number of data series using a data generator. This was done due to the unsatis-factory anomaly labels in open data sets. Those issues are shown in the Experiments on real data section and in the appendix.

Data Generator

Our generator can generate linear or sin data series with given parameters. For linear data we can specify an initial value, slope, and number of data points. For sine data we can specify an initial value, frequency, amplitude and also number of generated values. We can then insert a Gaus-sian noise with given variance to the series, and generated series can be concatenated. The places where these se-ries where concatenated are then marked as anomalies or

(6)

Figure 6: Sample series consisting of linear data with change point every 200 and randomly inserted outliers. Blue line shows predictions made by Kalman filter. Blue dots show real anomalies. We can see anomalies are context dependent and small deviations as shown in first 200 data point would not be considered anomalies later on when variance increases. The filter could detect all the anomalies except the last one, where it adjusted too smoothly to the change.

Figure 7: Sample series consisting of sin data with no variance. Blue line shows prediction of unscented Kalman filter and we can see it fits the real data well. Note that linear Kalman filter or moving average would considerably lag behind the data. All anomalies, marked with blue dots where detected in this case

called change points. Different kinds of anomalies, so-called outliers, can be inserted within the series with cer-tain probability. In such case, points farther than 4 stan-dard deviations of current series are inserted. The genera-tor also allows us to repeat generated series n times to test models on data with long term patterns.

Firstly, we will show experiments on linear and non-linear data series. Such data are simple to predict as almost any prediction close to the previous point is reasonable. This makes the whole problem more of just noise filter-ing. The task becomes incomparably more difficult for data with long term patterns or even unpredictable time-series, which will be addressed just after the first two.

Linear Time Series

Both KFs and LSTM networks can predict linear data se-ries and sufficiently smooth out the noise as seen on 6. Detection of outliers given the current variance of the se-ries is then a simple task.

For change points, like change of mean or slope, the prob-lem becomes more difficult and is dependent on various factors. For KFs we need to tune the filter to choose how well it will adapt to changes. If the filter is too adaptive, it can easily adapt to small change in slope without noticing the anomaly. On the other hand, if the filter is not adaptive enough, it will detect the first change of slope, but will lag behind the data for a long time and will not notice other anomalies during such period. Adaptivity of LSTMs de-pends on the provided training data. Sample series with KF detection can be seen in figure 5. Alternatively, other metrics that we are interested in, like slope or variance, can be tracked separately as different data series. The same models can be then used for such data series, im-proving detection accuracy, but making the detection less general, as we need to extract those characteristics from the series.

Non-Linear Time Series

Anomaly detection on non-linear data series shares the same characteristic as detection on linear data. In case

(7)

Figure 8: Sample time-series with repeated pattern. Blue lines show LSTM prediction. We can see that the network is capable of noticing missing uptick in pattern and also filters out noise. Such series are not solvable by Kalman filters as every change of mean would be seen as anomaly. We can see kalman predictions shown by the green line

Figure 9: Time-series of dice rolls is shown by red line. LSTM network predictions are shown by heat map, assigning probabilities to 36 groups, shown on vertical axis. Regions with high probability are green whereas with low probability are dark blue. Groups marked with a number divisible by 6 represents integer outcomes 0-6. We can see that those are the outcomes that the network predicts. Anomaly can be seen at point eleven, where the value falls in group 21, representing number around 3.5. Clearly an anomaly for a dice roll

of using unscented Kalman filter to adapt to the nonlin-earities, both KFs and LSTMs smooth out the noise and can easily detect outliers. Change point detection remains highly dependent on model setting. Sample series with detection can be seen in figure 7.

Time Series with Patterns

Data with long term patterns become much harder to pre-dict as the hidden state in Markov process needs to be much more complex to capture this time information. A Kalman filter cannot capture such information even when transition matrix is trained by expectation maximization algorithm. This results in the Kalman filter reporting a lot of change-points that are part of the pattern or not report-ing anythreport-ing if it adapts to the data too well.

Such data need more complex models. We can still use simple NN if we have a knowledge about pattern period.

In such case, we use the last n in the series as a network input, given that the pattern period is lower than n. If we dont have such knowledge, the only suitable solu-tions are LSTM networks that proved to be able to solve such pattern as seen in figure 8.

Unpredictable Time Series

The last and most challenging type of data are series that cannot be predicted by regression. Even simple probabil-ity distributions like Gaussian are not sufficient, therefore KFs are not enough to solve this series. Only LSTM prob-ability classifier can deal with this data. Classical regres-sion models output only the mean of the real distribution. This is caused by the objective of regression - minimizing mean squared error

The simple toy example of dice rolls is given in the In-troduction section, and the LSTM predictions can be seen on 9.Another tested example was with slight variance in

(8)

outliers changepoints training data accuracy recall accuracy recall

Linear data UKF 96% 91% 92% 45% 10-30 LSTM networks regression 97% 90% 91% 44% 1K-10K LSTM networks classification 97% 90% 91% 44% 1K-10K Non-linear data UKF 91% 85% 92% 38% 10-30 LSTM networks regression 97% 94% 91% 41% 1K-10K LSTM networks classification 97% 94% 91% 41% 1K-10K

pattern anomaly pattern change training data patterns data UKF 10% 50% 10% 25% 10-30 LSTM networks regression 100% 100% 100% 75% 100K LSTM networks classification 100% 100% 100% 75% 100K unpredictable data UKF 10% 75% 10% 25% 10-30 LSTM networks regression 10% 75% 10% 25% 100K LSTM networks classification 100% 100% 90% 75% 100K-1M

Table 1: Table reporting accuracy for different type of time series. It serves as a comparison of proposed method and was done on 4-10 different time-series with varying length (10K-100K) and anomaly types for every type of data. Series and anomaly types were picked to show differences and highlight properties described previously in the experiment section. LSTM networks are better for data with patterns and classifier network is only model capable of solving unpredictable series. The best achieved results after properly tuning models for specific case are shown

Figure 10: Time series of AWS clud CPU utilization. In lower figure we can see the real data. Upper figure assigns probabilities to classes. We can see that after utilization goes up the high probabilities are assigned to higher values. Between point 2300 and 3000 we can see the network assigns some probabilities also to higher values, possibly expecting a spike in utilization. Capturing whole properties of the series by the network would require much more training data, yet we can already see some emerging patterns

the pattern period and proved to be solvable as well. The amount of data required to make such models work was generally above 500,000. The required complexity and size of the network is dependent on the amount of data and complexity of the pattern. Simple dice rolls are solv-able by single layer feed forward network, whereas dice rolls interleaved with series of n zeroes require 1-2 LSTM layers, followed by 1-5 dense layers depending on the n which was tried up to 10. We didnt mange to solve this series for larger n due to computational complexity. Comparison

The previous section shows example usages, and points out the strengths and weaknesses of both methods. Detec-tion of outliers is always solvable using a suitable method.

Change-point detection depends on the setting of Kalman filter or on the training data provided to LSTM. Given training data with simple linear series without a slope, LSTM learns to predict constant series and will detect even subtle changes in slope. Using training data with a slope, however, makes the network adapt to change in time-series smoothly without noticing this anomaly. Ta-ble 1 s shows qualitative results of both methods tested on different types of series of lengths between 10,000 and 1,000,000 data points. We report accuracy and recall of hitting and anomaly window. The windows start n data points before the actual anomaly and end n data points after. The n is determined by the type of the anomaly and length of the series. The windows are a more intuitive way of marking anomalies, and some problems with point

(9)

anomalies are pointed out in real data experiments and in the appendix.

Conclusion

Less complicated series can be solved by Kalman filters without any training, therefore more efficiently. Kalman filters, however, cannot identify patterns and LSTM mod-els are needed in such cases. Simple predictable patterns, with consistent pattern period, do not require much train-ing data to perform well. Data requirements for unpre-dictable series are much higher. This requirement espe-cially grows for pattern with varying periods like dice roll interleaved with varying amounts of zeroes.

EXPERIMENTS ON REAL DATA Data sets

The first dataset is provided by yahoo1. The data set con-tains both real anomalies and synthetically generated data with inserted anomalies. The real part of data is based on traffic in various yahoo services, and anomalies where hand-labeled by experts. Each series have 1,000- 2,000 data points.

The second data set is Numenta anomaly benchmark22_,

which also contains a combination of real and artifi-cially generated data. It has 58 data files, each with 1,000-22,000 data points, and combines data from vari-ous sources like social media, industrial machines sensors or CPU utilization.

Results

The model has to be adjusted and designed to fit particular data. This makes testing on large diverse datasets difficult, and measuring recall and accuracy yields poor results. This is caused by the wide difference between anomaly definition in different time series, missing anomaly labels, missing context of the whole series while processing data sequentially, and unclear distinction between outliers and change points in the series. One such example of unclear change-point specification can be seen on figure 11, where some change-points are for instance marked as just one anomaly, some are marked as 25 consecutive anomalies. The whole part of the series containing 120 data points can be either marked with 2 change-points or as a series of 120 anomalies. Other examples are in the Appendix. This makes reporting of meaningful recall and accuracy of the models impossible. In appendix we show that do-main specific knowledge would be very useful in some cases

Another issue is an insufficient number of data points in the real data series. This makes training LSTM net-work for unpredictable series a difficult task. In figure 10 (showing the AWS CPU utilization), we can see that the network correctly assigns probabilities of following val-ues, but clearly cannot perfectly capture properties of such series as it was only trained on 1,000 data points. Series of this type were previously showed most demanding for data amount, requiring up to 1,000,000 data points even for relatively simple patterns.

1_{https://webscope.sandbox.yahoo.com/catalog.}

php?datatype=s&did=70

2_{https://numenta.com/numenta-anomaly-benchmark/}

CONCLUSION

We described Kalman filters and LSTM neural network as similar models based on the Markov process. Both mod-els were tested and compared on different data series and the advantages, shortcomings and suitable use cases for both were explained. Both models are intended to be used on streams of data, getting one value at time. Some is-sues of such approach were discussed, and compared to an algorithm operating on the whole data at once. We overcame the complexity of designing Kalman filter by training transition matrix by expectation maximization algorithm. This, however, still wont allow the Kalman filter to learn long term dependencies. For such series, LSTMs are a more suitable model.

We extended the LSTM model to handle unpredictable series and showed this on examples. Training of such model requires large amounts of data that may not be al-ways available, but the models shows promising result for difficult time-series.

We only used generated data to show models’ properties and make the conclusion. The used generator was limited to one dimensional series and only Gaussian noise was in-serted. A complex generator would be suitable to provide more extensive tests and results.

Diverse labeling of anomalies on real data shows that anomaly detection is still very domain specific task and the best results can be achieved with some domain knowl-edge and particular anomaly definitions, rather than more complicated models. Sometimes just setting a fixed threshold for anomaly detection can be a good solution as shown on figure 13 in appendix.

FUTURE WORK

Implemented methods were tested only on one-dimensional time-series, even though they can be extended to multiple dimensions. This is partly because the available data sets are one-dimensional, and partly because specifying anomalies in multiple dimensions is tricky task. An interesting suggestion for multi-dimensional time series with clear anomalies is to use images from MNIST data set [10] and create a series of same digit images with occasionally inserted different digit as an anomaly. Such a problem would be interesting to investigate. A A recent paper from Subin Yi at al.[18] introduces grouped convolutional neural networks, that were specifically designed for multivariate time series, which could be particularly suitable for this task.

ACKNOWLEDGEMENT

I would like to thank to Thomas Mensink, Taylan Toy-garlar, and Edwin de Jong, my research supervisors, for their help and guidance, and also to IMC financial mar-kets, were I have carried out my research.

(10)

Figure 11: First image shows a change point marked with just one point. After this the data is considered normal again. Second Image marks approximately 25 consecutive points as change-point. On the last image 120 point are marked as anomalous but it could be as well 2 change-point.

(11)

REFERENCES

1. Aleskerov, E., Freisleben, B., and Rao, B. Cardwatch: a neural network based database mining system for credit card fraud detection. In Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr)(Mar 1997), 220–226.

2. Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 41, 3 (July 2009), 15:1–15:58.

3. Fujimaki, R., Yairi, T., and Machida, K. An approach to spacecraft anomaly detection problem using kernel feature space. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, ACM (New York, NY, USA, 2005), 401–410.

4. Gamboa, J. C. B. Deep learning for time-series analysis. CoRR abs/1701.01887 (2017). 5. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.

http://www.deeplearningbook.org.

6. Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. 7. Julier, S. J., and Uhlmann, J. K. New extension of the kalman filter to nonlinear systems, 1997.

8. Kumar, V. Parallel and distributed computing for cybersecurity. IEEE Distributed Systems Online 6, 10 (Oct. 2005), 1–.

9. Labbe, R. Kalman and bayesian filters in python.

https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python, 2014.

10. LeCun, Y., and Cortes, C. MNIST handwritten digit database.

11. O’Shea, T. J., Clancy, T. C., and McGwier, R. W. Recurrent neural radio anomaly detection. CoRR abs/1611.00301 (2016).

12. Pankaj Malhotra, Lovekesh Vig, G. S. P. A. Long short term memory networks for anomaly detection in time series. In ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.(2015).

13. Pimentel, M. A. F., Clifton, D. A., Clifton, L., and Tarassenko, L. Review: A review of novelty detection. Signal Process. 99(June 2014), 215–249.

14. Soule, A., Salamatian, K., and Taft, N. Combining filtering and statistical methods for anomaly detection. In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement, IMC ’05, USENIX Association (Berkeley, CA, USA, 2005), 31–31.

15. Spence, C., Parra, L., and Sajda, P. Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In Proceedings of the IEEE Workshop on Mathematical Methods in

Biomedical Image Analysis (MMBIA’01), MMBIA ’01, IEEE Computer Society (Washington, DC, USA, 2001), 3–. 16. Syarif, I., Prugel-Bennett, A., and Wills, G. Unsupervised Clustering Approach for Network Anomaly Detection.

Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, 135–145.

17. Ting, J. A., Theodorou, E., and Schaal, S. A kalman filter for robust outlier detection. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems(Oct 2007), 1514–1519.

18. Yi, S., Ju, J., Yoon, M., and Choi, J. Grouped convolutional neural networks for multivariate time series. CoRR abs/1703.09938(2017).

(12)

APPENDIX

Following figures illustrates issues in anomaly detection data-sets. Firstly we can see missing anomaly marker on figure 12.

Another figure 13 shows obvious spikes in data that are not marked as anomalies. We would mark those as anomalies as we process the data sequentially and don’t have the whole context.

Another problem is depicted on figure 11. This example shows varying length of change-points and no real distinction from outliers. This information is important to choose how adaptive the model should be or how should it be trained. Lastly we can show marked anomaly, for which there seems to be no obvious reason on figure 14

All these makes it pointless to report accuracy and recall without having specific information about every time-series

Figure 12: Missing anomaly around timestamp 425

Figure 13: Earlier spikes can be considered anomalies until we see the whole data. Even after seeing whole data, anomalies are still unclear as they seem to be determined by an arbitrary threshold.

Figure 14: The anomaly on the bottom has no obvious explanation. It still can be an anomaly but we do not have enough information to detect it