Robust Detection of Anomaly Types in Energy Consumption Data

(1)

MSC

ARTIFICIAL

INTELLIGENCE

TRACK: MACHINELEARNING

MASTER

THESIS

Robust Detection of Anomaly Types

in Energy Consumption Data

by

KOEN

KEUNE

10003527

December 13, 2017

42 EC February 20, 2017 - December 13, 2017

Supervisors:

dr. M.W. van Someren

ir. E. de Jong

Assessor:

dr. E. Kanoulas

(2)

Robust Detection of Anomaly Types in Energy Consumption Data

Abstract

Improving the energy consumption in buildings is an ongoing problem. One of the ways to improve the energy consumption in buildings is by detecting the moment when faults occur in the data. The moment when faults occur can be detected with anomaly detection techniques and it is the first step in diagnosing the problem. There exist several anomaly detection techniques and there are several techniques applied to energy consumption data. However, none of them makes a clear distinction between general anomaly types that can occur in the data and they are mostly focused on point anomalies.

This paper introduces different anomaly types based on the anomalies that occur in energy consumption data and presents a robust method to detect them. The anomaly types occur in the context of the outside temperature and the time in hours. They are detected with a robust method that consists of finding a regression model of the data with the context and an anomaly detection rule based on the robust residuals of the regression model. The regression model for the anomalies in the context of the outside temperature used a robust linear regression model and different robust linear regression models are compared. The regression model for the anomalies in the context of the time in hours used a nonlinear model, an artificial neural network and support vector regression are compared.

The results shows that the robust linear regression model MM-estimation is the most suitable model for the detection of the anomalies in the context of the temperature. The results show that either an artificial neural network or support vector regression could be used for the detection of the anomalies in the context of the time in hours. Furthermore, the results show that the general anomaly detection approach could be used for the detection of anomaly types in energy consump-tion data.

(3)

Acknowledgements

I would like to thank my supervisors: Maarten van Someren from the University of Amsterdam and Erik de Jong from E-Nolis. Maarten van Someren helped me a lot by having a good overview of what needed to happen at all times and helped me steer the thesis in the right direction. Erik de Jong helped me a lot by making sure that the thesis could be successfully finished. Furthermore, I would like to thank my former supervisor of E-Nolis Elke Klaassen for helping me with the thesis in the beginning part of the thesis by providing great feedback. I would also like to thank everyone from E-Nolis for being helpful and for giving me the opportunity to do my thesis for them. Finally, I would like to thank Evangelos Kanoulas for agreeing to be a part of the defense committee.

(4)

1 Introduction

Residential and commercial buildings have a significant share of the total energy consumption, it accounts for 40% of the total energy consumption in the European Union[Parliament and Council, 2010]. Reducing energy consumption in buildings is important to achieve the target of the European Council to have a 40% reduction in greenhouse gas emissions and a minimum of a 27% improvement in energy efficiency in 2030[Nesbit et al., 2017]. One of the ways to reduce the energy consumption in buildings is to make sure that energy that is used for buildings is working correctly. The energy con-sumption of (commercial) buildings is mostly due to the Heating, Ventilation, and Air Conditioning (HVAC) systems. These systems might not work properly due to being outdated, poorly maintained, or being improperly controlled. This could be improved by detecting the exact moments when this occurs, which in turn helps diagnosing the problem and makes adjustments to improve the overall efficiency possible. These moments can be detected with anomaly detection techniques. There are various anomaly detection techniques, their applicability depends on the particular problem and the data. The goal of this paper is to find out what a robust method is to detect all types of anomalies in hourly energy consumption data. The anomalies are expressed in terms of the gas or electricity variable. The detection is done with a residual based approach that detects newly defined anomaly types for two different contexts. The two contexts are a step in the right direction for the detection of all anomaly types in the consumption data.

1.1 Anomaly detection for E-nolis

This paper is made in cooperation with the company E-nolis. E-nolis advises people how they can lower the energy consumption of a building through an analysis of the energy consumption of that building. They typically work with limited knowledge of the building. The datasets mostly consists of weather variables together with hourly gas and electricity values over a year of a building, such datasets are also used for this paper. In their analysis of the energy consumption data they try to find out, among other things, which HVAC systems the building has and how well they work. They use two different representations of the data that helps them answering these types of questions. In the first representation of the data they try to find characteristics of the building by looking how the energy consumption behaves relative to the outside temperature. In the second representation of the data they look how the data behaves with the time. These two representations define the two contexts of which the anomalies are being detected with in this paper.

1.2 Anomalies

Anomalies can be defined as: “patterns in data that do not conform to a well defined notion of nor-mal behavior”[Chandola et al., 2009, p. 2]. This paper presents a method to detect different types of anomalies in energy consumption data. Anomaly types can be grouped in three general groups [Chandola et al., 2009]. The first group are point anomalies, where an individual data instance lies outside the normal data values. The second group are contextual anomalies, where a data instance has to be outside of the normal data values given a certain context. The context can be defined through contextual attributes, while the behavioral attributes define the non-contextual character-istics. Within this group data instances can be considered normal under one context and anomalous under a different context. The last group of anomalies are collective anomalies. In which a collec-tion of data points are anomalous, while the individual data points do not have to be necessarily

(6)

This paper presents two methods to detect contextual and collective anomalies for two different contexts. These two contexts are motivated by the two representations of the data E-nolis works with. The two contexts result in the detection of different anomalies. The first contextual attribute where the anomalies are being detected with is the outside temperature, the second contextual at-tribute is the time in hours. The behavioral atat-tribute of the anomalies in the first context is the gas consumption data, and the behavioral attributes of the anomalies in the second context are the gas and electricity consumption data. The contextual attributes are used to predict the behavioral at-tribute by fitting a regression line on the data. The two contexts will be explained in the next two subsections.

1.3 Temperature anomalies

The first contextual attribute where anomalies are being detected with is the outside temperature. This results in the detection of so called ‘temperature anomalies’. The anomalies are detected by representing the data in two dimensions, the outside temperature with the gas consumption. The normal behavior of the data behaves according to two assumptions. The first assumption is that there exists a linear function between the two dimensions of the form

y= a x + b (1)

where the parameters a and b are unknown, x is the outside temperature, and y the gas consump-tion. The second assumption is that the data is divided by a ‘breakpoint value’. The linear function only applies to outside temperature values lower than the breakpoint value. The values above the breakpoint value have a different linear function with a= 0. The assumptions of the data come from E-nolis and are made by people who work with energy consumption data more often (see section 2.2.1).

This regression line should be detected despite the presence of varying degrees of anomalous data. The anomalous data sometimes form a certain pattern. This anomalous pattern is another linear function in the data over the outside temperature values lower than the breakpoint value and is called a form of a collective anomaly. This paper introduces a robust method that can detect con-textual anomalies in anomalous data in which the anomalies do not form a particular pattern, or with anomalies that form some linear function.

Figure 1 shows an example of a dataset with the normal behavior according to the above described characteristics and with anomalous data with some linear function. The exact method to learn the normal behavior and detect the anomalies is described in section 3.

1.4 Time anomalies

The second contextual attribute where anomalies are being detected with is the time in hours. This results in the detection of so called ‘time anomalies’. The normal behavior of the data is the expected energy consumption of a building through the time. This is some nonlinear dependency between the gas or electricity consumption and the time and the weather. This paper introduces different kinds of collective anomalies that can occur within this context and a method to detect them and the non-collective anomalies.

The context is visualized with a 3D-plot of the data to make normal and abnormal behavior of the consumption data visible. Figure 2 shows an example of such a 3D-plot with various types of col-lective anomalies. The first two dimensions are the hour of the day and the day of the year. They

(7)

Figure 1: Example in the data of two different linear slopes. The contextual attribute is the outside temperature and the behavioral attribute is the gas consumption. The red spots indicate the con-textual anomalies, all the red spots in this example form a collective anomaly.

make the pattern of the consumption data visible over the time, which contains a daily and a weekly pattern. The third dimension shows the consumption (gas or electricity) value and correspond to some color on a color spectrum. High consumption values correspond to a red color, while low con-sumption values correspond to a blue color. The daily pattern of the data can be seen with colors close to red between the hours 09:00 and 21:00 and blue values outside those hours. The weekly pattern of the data can be seen with a blue vertical stripe after every five days. Abnormal values can be recognized in the 3D-plot as any data point that has a different color than what is expected from the normal pattern.

1.5 Organization

This paper is organized as follows. Section 2 reviews the relevant literature for anomaly detection techniques, energy consumption models, and robust regression. Section 3 describes, presents the results, and discusses the method to detect the temperature anomalies. Section 4 describes, presents the results, and discusses the method to detect the time anomalies. Section 5 concludes the results over both parts, and section 5.4 discusses possible future directions.

(8)

Figure 2: A 3D-plot of the electricity consumption data that also shows different types of anomalies recognized in the data. The contextual attribute is the time in hours and the behavioral attribute is the electricity consumption. The colors indicate how high the consumption values are at a particular time, with red corresponding to high values and blue corresponding to low values. The data shows, among other things, noticeably high values during day times between 09:00 and 21:00 and low values during the night between 23:00 and 07:00. The marked anomalies in them show an abnormal color pattern and thus anomalous consumption behavior.

2 Related work

Anomaly detection or fault detection in energy consumption has been widely investigated. The ap-proach for this paper is to have a robust model of the data and detect anomalies with some rule based on the learned model. This section summarizes the most relevant work to this approach. Firstly, the different types of anomaly detection approaches that are used in energy consumption data are re-viewed, secondly energy consumption models, and at last robust regression models.

2.1 Anomaly detection

There are various types of approaches to detect anomalies in data. Many different approaches have been used for anomaly detection in energy consumption data. In the literature they are also called fault detection, and they are called fault detection and diagnosis (FDD) when they try to diagnose the problem after finding it. This section reviews the different types of approaches for anomaly detection in energy consumption data based on the categorization of anomaly detection methods by Chandola et al.[2009].

A common approach for detecting anomalies in energy consumption data is with statistical de-tection techniques. The basic principle for statistical anomaly dede-tection techniques is that some stochastic model can explain the data such that the data points that have a low probability in the model must be anomalies. The techniques can be further divided in parametric techniques in which

(9)

the underlying model is assumed to be known and non-parametric techniques in which the model is not assumed to be known[Chandola et al., 2009, p. 33].

The most direct approach for statistical detection techniques in energy consumption is by compar-ing the residuals. The residual is the difference between an expected value and an observed value. Norford et al.[2002] detected faults in air-handling units with residuals from two types of models. One of the models is a parametric technique that uses subsystem models that are based on vari-ous principles. These principles make use of physical properties, such as the fan laws and simple quadratic expressions for the change in system resistance to predict the supply air static pressure for their fan/duct model [Norford et al., 2002, p. 49]. Lee et al. [2004] used a non-parametric technique by generating residuals with a neural network for regression. They estimated and detected different types of anomalies for an air-handling unit by generating residuals for different types of variables. One of the problems residual based methods face is that detection of anomalies might be inaccurate when the residuals are small. Yang et al.[2011] uses a fractal correlation dimension algorithm to de-tect generated anomalies of an air-handling unit. Their algorithm calculates the distance between the expected value and the observed value in a nonlinear setting, and is able to detect small residuals under noise conditions better than standard linear residual based approaches.

Another approach for detecting anomalies in energy consumption data is a classification based ap-proach. With this approach it is assumed that the classifier can distinguish between normal and anomalous classes[Chandola et al., 2009, p. 21]. So used Lee et al. [1997] an Artificial Neural Net-work (ANN) for FDD for a simulated air-handling unit. The ANN learns the normal class and eight different anomalous classes through idealized patterns. The different patterns are defined with sim-plified first-order equations with different types of variables. The ANN is able to classify the subsys-tem where a fault probably occurs, which is used to diagnose the specific cause at the subsyssubsys-tem level. Khan et al.[2013] had success with a decision tree to detect anomalies in energy consumption data. The data came from an office building, where for each room it is known how many people are present. Schein et al.[2006] used 28 expert rules to detect anomalies for air-handling units.

A cluster based approach is another approach to detect anomalies. A cluster based approaches as-sume that normal data belongs to a cluster, while anomalous data do not belong to a cluster or a small or sparse cluster[Chandola et al., 2009, p. 30-31]. Khan et al. [2013] used two different cluster based approaches to detect anomalies in energy consumption data. One of the cluster based ap-proaches used is a k -means approach. The other cluster based approach used is the density based DBSCAN algorithm. Khan et al. concluded that the k -means approach can detect anomalies, how-ever it is not suitable for robust anomaly detection. DBSCAN did not have this problem, howhow-ever it was unable to detect the artificial anomalies added in the data. The clustering methods had prob-lems with detecting time related anomalies.

The last type of approach for anomaly detection in energy consumption are spectral based. Spec-tral based anomaly detection techniques rely on the assumption that “data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly differ-erent”[Chandola et al., 2009, p. 41]. So used Wang and Xiao [2004] principal component analysis (PCA) for FDD in air-handling units. The air-handling units have multiple variables that are corre-lated and can therefore represented by a smaller number of principal components. Anomalies are residual based and are detected if the difference between some observed value and the value repre-sented in its principal components is larger than some threshold. Wang and Xiao showed that many typical anomalies in air-handling units can be detected.

(10)

2.2 Energy consumption models

This paper uses a residual based statistical detection technique to detect anomalies. To get expected values one need a model of the data. There has been made various research for energy consump-tion models. These models typically try to predict the energy consumpconsump-tion of buildings in order to improve the energy performance of buildings[Zhao and Magoulès, 2012]. There are different ap-proaches taken to predict the energy consumption, and they differ in the complexity of the model and in how many and which parameters are used for the model. These models can be divided in linear and nonlinear models, and they correspond with the two types of models in this paper. So uses the anomaly detection method for temperature anomalies a linear model of the data, while the detection method for time anomalies uses a nonlinear model of the data.

2.2.1 Linear models

A common approach is to model the energy consumption data with a linear regression model (often called the degree day method). They are motivated by their ease of use while still being accurate. The basis of these models are similar to the approach of Fels[1986]. Fels states that the heating system of a building is generally first required when the outside temperature (Tout) drops below a certain level (the temperature breaking pointτ), and for each additional degree drop in temperature a constant amount of heating fuel ( f ) is required

f = α + β(τ − Tout)+ (2)

where ‘+’ indicates zero if the term is negative, α represents the base level of heating fuel a building needs, andβ represents the building’s effective heat-loss rate.

There have been various variations proposed of this basic linear regression model. These variations have varying complexity. Some of these models are based on physical properties and need to know things as external climate conditions, building construction, operation, utility rate schedule and which HVAC systems are present. A difference between these models is in how much information they need. Other models use historical performance and weather data instead.

Al-Homoud[2001] compared two common linear regression based models that predict energy con-sumption based on physical properties. The first one is the degree day method, which is similar to the model described earlier. The model sums the heating degree days for the days in which the out-side temperature is lower than the temperature breaking point to calculate the expected energy con-sumption. The temperature breaking point is assumed to be 18.3 degrees, however the actual break-ing point temperature depends on many factors. Therefore, a variable based temperature breakbreak-ing point is more accurate[Al-Homoud, 2001, p. 425].

The second method is the bin method. The data is split in different energy bins with three different 8-hour shifts, to take occupied and unoccupied conditions into account. The heating energy is cal-culated per bin in a similar setting as the first method. The number of occurrences of the bins are then multiplied with the heating energy of the bin.

Other linear based approaches are similar. So proposed Bauer and Scartezzini[1998] a linear regres-sion based method to account for heating and cooling. They allowed different slopes over different segments of the data. Lei and Hu[2009] used a parametric model for buildings in a region with hot summers and cold winters. They found that a single variable linear model based on the out-side temperature is sufficient and practical as a model in that region. Lam et al.[2010] used PCA

(11)

to develop a climatic variable. The climatic variable considered the dry-bulb temperature, the wet-bulb temperature, and global solar radiation. The climatic variable was used to examine the role of a changing climate for energy consumption. Newsham and Birt[2010] used an Auto-Regressive Integrated Moving Average with eXternal inputs (ARIMAX) to predict energy demand. They used building occupancy as an independent variable and concluded that building occupancy improves the prediction significantly.

2.2.2 Nonlinear models

Another approach for the prediction of energy consumption usage are nonlinear models. The non-linear models have the advantage of being able to predict more complex relations at the cost of hav-ing a more complex model. Ghiaus[2006] states that the residuals of the linear model ideally should have a normal distribution of zero mean which in practice is often not the case. Furthermore, they depend on the assumption that there is a linear relation between the temperature and the energy consumption which is often a simplification of the problem[Jiménez et al., 2008].

Popular approaches for nonlinear models to predict energy consumption are artificial neural net-works (ANNs) and support vector machines (SVMs). There are a lot of variations for predicting en-ergy consumption data with the help of ANNs or SVMs. They vary in what they exactly want to predict and which variables they use, and they are similar to the variations with the linear models. So used Kalogirou et al.[1997] back propagation neural networks to predict the heating load of 225 build-ings using several building characteristics as input. Yokoyama et al.[2009] used a back propagation neural network to predict the cooling demand of buildings, with the air temperature and relative humidity as input. Nizami and Al-Garni[1995] used a feed forward network to predict the energy consumption over a whole area, with the humidity, global solar radiation, and the population of the area as input. Wong et al.[2010] used a neural network to predict energy consumption in subtropical climates. They used as input the dry-bulb temperature, the wet-bulb temperature, solar radiation, the day type, and four building variables.

SVMs have similar variations. Dong et al.[2005] used a SVM for predicting energy consumption in tropical regions, with the dry-bulb temperature, relative humidity, and global solar radiation as in-put. Li et al.[2009] used a SVM to predict the cooling demand of an office building, using the outside temperature, the humidity, and solar radiation as input. Li et al.[2010] compared a SVM, a back propagation neural network, a radial basis function neural network, and a general regression neural network (GRNN), for the prediction of building energy consumption. They based their comparison on 59 residential buildings and they had the knowledge of the building characteristics such as vari-ous wall ratios and the heat transfer coefficient of the walls. Li et al. concluded that the SVM method and the GRNN method are feasible and effective for the prediction.

2.3 Robustness

Another aspect of this paper is to find a robust method to detect anomalies. A robust method is desirable because the goal is to have a good method over all kinds of data. A robust method is a method that is able to ignore some of the data, such that it is not affected as much by anomalous data than a non-robust method. A robust is however hard to find because one does not know where the anomalies are and how much of the data is anomalous. The best robust method is therefore tested and chosen based upon different energy consumption datasets.

(12)

The robust anomaly detection method consists of a robust model of the data and a robust detection method with the model of the data. The robust model to detect the temperature anomalies is cho-sen from various robust models. These models can be categorized with their breakdown point. The breakdown point is a theoretical property of a regression model and indicates the smallest propor-tion of observapropor-tions (data points) that is needed to make the model swamped[Alma, 2011][p. 412]. A model is swamped if some non-anomalous data points appear to be anomalous due to anoma-lies. So has ordinary least squares (OLS) a breakdown point of 1/n, because one (extreme) value can move the model arbitrarily far away.

A robust method performs typically less on data without anomalies, because the model does not typically know where the anomalies are and can therefore be too robust. This makes the robustness typically a trade-off between the accuracy1_{and the robustness of the method. In this context the} term accuracy tells how well the method performs on data without anomalies and the term robust-ness tells how well the method performs on data with anomalies.

The accuracy of a model is also used in this paper because it can be calculated easily. The accuracy of a model in this paper is calculated with the coefficient of determination (R2_{). It indicates how well} the model fits the data by looking to what degree the variance of the data points around the fitted model is predicted by the model. It is calculated with

R2= 1 − P i(yi− fi)2 P i(yi− ¯y )2 (3) where yi is point i of the data, fithe predicted value, and ¯y the average value of the data.

2.3.1 Robust model

The robust anomaly detection method consists of a robust model of the data with a robust anomaly detection method. The model is some kind of regression model. There is various literature regard-ing robust regression and the specifics of those models depend on the data the model tries to model. Relevant for the detection of the temperature anomalies are robust linear regression models. Alma [2011] compared the most popular robust linear regression models on simulated data. They com-pared the R2_{of M-estimation with Huber’s loss function, S-estimation, LTS-estimation, and} MM-estimation on various datasets. The datasets were generated from a linear model and anomalies were added with data points that are some number of standard deviations away from the linear model at random points of the data. They added anomalies in the x -direction and the y -direction. The datasets on which the tests were performed had different rates of anomalies in them and the datasets had different dimensions. S-estimation and M-estimation produced the best results on most of the datasets. However, on some datasets LTS-estimation or MM-estimation performed bet-ter. So performed LTS-estimation the best on datasets containing 40% anomalies in the y -direction. Susanti et al.[2014] compared M-estimation, S-estimation, and MM-estimation, on maize produc-tion data in Indonesia. They found that the best model for the data is S-estimaproduc-tion.

Relevant for the detection of time anomalies are robust nonlinear regression models, however, such models are more complex and due to time-constraints were not examined for this paper.

Most of the regression models for energy consumption data do not emphasize the robustness of the model (see section 2.2). An exception is Ghiaus[2006], who performed robust regression on energy

1_{The term accuracy is generally called efficiency when one talks about the efficiency of some model, however, it is called}

(13)

consumption and the outside temperature. They performed linear regression between the 1st and the 3rd quartile of the data, and they compared the relative errors with ordinary linear regression. The robust regression method had a relative error of 2-4%, while ordinary regression had a relative error of 5-10% errors on the same data.

2.3.2 Robust detection

The anomalies in this paper are detected with some kind of threshold based rule for the residuals. The classical rule to detect outliers from some general dataset X ∈ R is based on if the absolute value of the z -score exceed some value, where the z -score is calculated by

zi= (xi− ˆx )/s (4)

where ˆx is the average value of x[Rousseeuw and Hubert, 2011]. Rousseeuw and Hubert argued for

a robust outlier rule

zi= (xi− median(xj) j=1,...,n )/MAD

(5) where MAD is the median of all absolute deviations from the median. The robust rule makes it easier to detect outliers when the data is contaminated with anomalous (or noisy) data. This is because now the average value ˆx and the standard deviation s are less affected by anomalous data, which

increases the score of the robust z -score with anomalous data.

2.4 Approach

This paper presents a robust anomaly detection method to detect anomalies in energy consumption data for two different contexts: the temperature and the time in hours. Different anomaly types are introduced in both contexts. The approach to detect them is by comparing the residuals of a regres-sion model of the data for both contexts, because regresregres-sion models showed success in modeling similar data. Different regression models are compared and the best regression model is selected based on multiple datasets. The regression models that are used correspond to the models described earlier. So is a linear model with a breaking point used for the detection of the temperature anoma-lies. Different variations of the model containing one of the popular robust linear regression models are compared. An ANN and a SVM based regression model are compared for the detection of the time anomalies. This paper introduces a robust detection rule similar to equation (5) to detect the anomalies.

(14)

3 The detection of temperature anomalies

There are anomalies detected for two different contexts. This section describes the method to detect the gas consumption anomalies in the context of the outside temperature (the temperature anoma-lies). The temperature anomalies are found in two-dimensional data of the gas consumption data and the outside temperature. Section 4 describes the method to detect the gas and electricity con-sumption anomalies in the context of the time in hours (the time anomalies). This section describes firstly the methodology to detect the temperature anomalies, secondly the results will be showed, and lastly the results will be discussed.

3.1 Methodology

The method to detect the temperature anomalies is a residual based approach. A residual based approach is used because the non-anomalous data can be modeled with some parametric model. The model is based on a linear model and is a variation of equation (2). Various popular robust linear regression models are compared to find out what the best robust model for the data is, these will be explained firstly. Secondly, the representation of the data is discussed, the assumptions that are made of the data, and how the data is labeled. The implementation of the model over the data is explained in the following section. The temperature anomalies are found with robust residuals of the found model and will be explained in detail in the next section. Lastly, the evaluation of the different models is explained.

3.1.1 Robust linear regression models

There are various robust linear regression models that will be compared, because their performance depend on the type of data (see section 2.3.1). These models can be divided with their breakdown point. This results in two types of robust regression models: the low breakdown models and the high breakdown models. The low breakdown models have the lowest possible breakdown point of 1/n and perform well on a few anomalous data points, while the high breakdown models have a breakdown point of n/2 and can perform well on data that have up to 50 percent anomalous data points. The low breakdown models that will be compared are the M-estimation models with Huber’s loss function, Tukey’s loss function, and Hampel’s loss function, together with the ordinary least squares (OLS) method. The high breakdown models that will be compared are LTestimation, S-estimation, and MM-estimation. This results in a comparison of seven different models.

M-estimation models M-estimation models are the standard low breakdown models. They are called M-estimation models because they are maximum likelihood functions. They are more re-sistant to point-anomalies than OLS, while being similarly efficient. The accuracy is expressed in terms of the coefficient of determination (equation (3)). M-estimation models have problems with data that has point anomalies in the x -space, especially Huber’s model, which make them have a breakdown point of 1/n.

M-estimation models try to find a solution to min β n X i=1 ρei s = min β n X i=1 ρ yi− Pp j=0xi jβj s (6)

(15)

where eiis the error of the function and s is a robust scale estimate of the errors. The scale estimate s is taken as ˆ s= MAD 0.6745= median|ei− median(ei)| 0.6745 (7)

where the constant 0.6745 makes the estimate unbiased at the normal distribution.

The functionρ is some predefined function that determines how the errors should be weighted. The solution for M-estimation models is found with the partial derivatives ofβj and setting them equal to 0 n X i=1 xi jψ( ei ˆ s ) = 0, forj = 0,1,...,p (8)

whereψ is the derivative of ρ and determines how much the errors should be weighted. To solve equation (8) one can use the following weighted function

w(ei) =

ψ(ei/ ˆs)

ei/ ˆs

(9) such that equation (8) becomes

n X

i=1

xi jwiei= 0, forj = 0,1,...,p. (10)

Equation (10) can then be solved with the iteratively reweighted least squares (IRLS) method. This method iteratively calculates the errors ei, the scale s , the weight wi, and the regressor coefficients

β until they converge [Susanti et al., 2014].

Huber’s function M-estimation models differ in theirρ function, which determines the weight for the errors. So are the errors of OLS weighted quadratically, such that extreme values have a lot of influence on the model. This makes OLS not resistant to anomalies. Huber’s loss function has a monotoneψ function. It has the property that the ρ function is convex, which always allows the computation of the optimal solution to equation (8). The function increases the weight exponen-tially until k and from then on it increases linearly (in contrast to OLS). It is defined as

ρk(ei) = ¨₁

2x2 if|x | ≤ k

k |x | −k₂ if|x | > k (11)

where k= 1.345 for the comparison in this paper, which corresponds to 95% accuracy under normal errors[Koller and Mächler, 2016, p. 3].

Tukey’s function Tukey’s bisquare loss function has a redescendingψ function2_{chosen such that} all errors larger than k are weighted the same (instead of a linearly increasing loss function by Huber’s loss function). This makes that an extreme point anomaly has as much influence as another data point if both have an error larger than k . The redescendingψ function results in a non-convex ρ function, which makes it possible for the algorithm to find a non-optimal solution. It is defined as

ρk(ei) = (

1−1− x_k23 if|x | ≤ k

1 if|x | > k (12)

(16)

func-where k= 4.685 for the comparison in this paper, which corresponds to 95% accuracy under normal errors[Koller and Mächler, 2016, p. 4].

Hampel’s function Hampel’s loss function is a smoother version of Tukey’s bisquare, it also has a redescendingψ function ρa ,b ,r(ei) =          1 2x2/C |x | ≤ a 1 2a2+ a (|x | − a) /C a< |x | ≤ b a 2 2b− a + (|x | − b ) 1 + r−|x | r−b /C b < |x | ≤ r 1 r< |x | (13)

where a= 1.5k, b = 3.5k, r = 8k, and k = 4.685 for the comparison in this paper, which corresponds to 95% accuracy under normal errors[Koller and Mächler, 2016, p. 5].

Ordinary least squares Ordinary least squares (OLS) is also form of M-estimation, where theρ function is as follows

ρ(ei) = x2. (14)

LTS-estimation LTS-estimation (Least Trimmed Squares) is a high breakdown model and is given by

ˆ

βLTS= arg minQLTS(β) where QLTS(β) = h X

i=1

e_i2 (15)

e2

1 ≤ e22≤ · · · ≤ en2 are the ordered squared residuals and h = n+p+1

2 for a breakdown point of 1/2 [Alma, 2011, p. 413]. A larger h will result in roughly a breakdown point of n−h

n . This is a robust method because up to half of the data can be ignored. The parameter h has the value floor(n+p+1

2 ) for the comparison in this paper, which correspond to a model with a breakdown point of1₂. The accuracy of this method depends on the fraction of point anomalies. The accuracy is the same as OLS if the number of trimmed data points is the same as the number of point anomalies. How-ever, the method becomes less efficient if more or less data points are trimmed than there are point anomalies[Alma, 2011]. Furthermore, this method requires heavy computational effort [Bellio and Ventura, 2005].

S-estimation S-estimation is a high breakdown model that minimizes the dispersion of the resid-uals. The final scale parameter is the standard deviation of the residuals from the fit that mini-mized the dispersion of the residuals. In contrast to M-estimation which has a fixed scale estimate. The scale parameter is calculated iteratively together with the coefficient parameters similar to M-estimation. The objective function of S-estimation is the minimization of the dispersion of the resid-uals arg min β s(e1(β),...,en(β)) (16) which gives the final scale estimate

ˆ

(17)

where the disperson s(e1(β),...,en(β)) is the solution of 1 n n X i=1 ρei s = K . (18)

K is a constant E_φ[ρ] with φ defined as the standard normal, ρ(x) should be a redescending function

taken as ρ(x ) = ¨_x2 2 − x4 2c2+ x6 6c4 if|x | ≤ c c2 6 if|x | > c . (19)

The parameter c is a tuning constant. For the comparison c = 1.548 and k = 0.1995, which gives a breakdown point of 1/2 [Onur and CETIN, 2011]. S-estimation has a better accuracy than LTS-estimation under normal errors.

MM-estimation Mestimation is a high breakdown model that combines the accuracy of M-estimation with the robustness of S-M-estimation. MM-M-estimation calculates the S-estimate with a form of Tukey’s bisquare function

ρ(x ) = ¨ 3 x_c2− 3 xc 4 + x c 6 if|x | ≤ c c2 6 if|x | > c . (20)

The found scale parameter ˆσ is used to find the MM parameters ˆβ with M-estimation

ˆ β = arg min β n X i=1 ρ yi− x T i β ˆ σ . (21)

Finally, the scale parameter s for the MM-estimate is calculated by solving 1 n− p n X i=1 ρ yi− x T i βˆ s = 0.5. (22)

MM-estimation has a higher accuracy under normal errors than S-estimation, while also having a breakdown point of 1/2 [Alma, 2011, p. 415].

3.1.2 Data representation

The models are trained and compared on datasets containing hourly gas consumption values over one year. The gas consumption values are splitted in occupied and unoccupied hours during week-days and week-days in the weekend. Occupied hours are defined as the hours between 11:00 and 15:00, and unoccupied hours are defined as the hours between 23:00 and 03:00. This corresponds to four different clusters. All the data in those clusters are grouped together in one dataset. There are in total the gas consumption values of eight different buildings used, which in turn corresponds to a total of 32 different datasets.

The data is splitted in occupied and unoccupied hours because building occupancy is an important variable to know. So is more energy needed when people are in the building. The building occupancy is an unknown variable, therefore the extremes in which at least people should be in the building or

(18)

and 03:00. This is a heuristic of E-nolis. The hours outside of those extremes are not used because for E-nolis it is of particular importance to know how the data behaves at those extremes. Furthermore, these values are hard to predict with respect to the outside temperature without the knowledge of building occupancy and without a parametric model for these hours.

It is assumed that the data has a linear relation between the gas consumption values and the tem-perature values, similar to the linear models for the prediction of energy consumption (see section 2.2.1). Lower temperatures should correspond to higher consumption values starting at some tem-perature value, the breakpoint value, where the heating should start. The gas consumption values should be near zero above the temperature breakpoint value. The breakpoint value is assumed to be around 18.3 degrees Celsius[Al-Homoud, 2001, p. 424].

Data labeling The data has no labels by itself. Therefore, experts labeled interesting data points that can be considered as anomalies. The experts do not have knowledge about the buildings where the datasets come from. However, they do have knowledge about how patterns in the energy sumption data typically should look like. The judgment whether a point in the data should be con-sidered as an anomaly is based on that knowledge. Datapoints are either labeled as a (contextual) point anomaly or as a (contextual) collective anomaly. The only type of collective anomaly that was detected in the data can be characterized as the collection of data points around a linear regression line with a different slope than the assumed true linear dependency. Figure 1 is an example of a la-beled dataset that contains such a collective anomaly. The labeling makes the comparison between the different robust linear regression models possible.

The labeling of collective anomalies is used to compare the performance of the models between datasets with a collective anomaly and without a collective anomaly. So are high breakdown models expected to perform better on datasets that contain a lot of anomalous data points, compared to low breakdown models. The percentage of datasets that contain a collective anomaly is approximately 19% (6 out of 32).

3.1.3 Model of the data

The implementation of the model is based on finding the best fit of some linear model that con-tains one breakpoint. The model is based on the assumptions of the data mentioned in the previous section. The model is as follows

f(x ,Ω) = f (x ,α,β1,β2) =

¨β1 if x> t

αx + β2 if x≤ t

(23) where x is the outside temperature, t is the temperature breakpoint value,β1is the level of gas con-sumption higher than the breakpoint value,β2is a value chosen such that the regression line of the values below the breakpoint value connects with the horizontal line for the values above the break-point value, andα is the slope of the regression line for the values below the breakpoint value. The value ofα for one particular breakpoint value is the slope of one of the robust linear regression mod-els described in section 3.1.1. The best breakpoint value (t ) is found with an accuracy score over the whole model. The accuracy score is the coefficient of determination (R2_{) and is calculated as follows}

R2= 1 −SSError SSTotal = 1 − Pn i=1(yi− f (xi,Ω))2 Pn i=1(yi− ¯y )2 (24) where yi is point i of the data and xithe input for the predicted value of point i .

(19)

Various breakpoints around the expected breakpoint value of 18.3 were tested per model to find the optimal breakpoint. The pseudo code for finding the optimal breakpoint and in turn the optimal model can be seen in Algorithm 1. M-estimation with Huber’s loss function is used for when x> t because that part of the data is typically less complex and the model for that part is also less complex because only one parameter is estimated for that part.

input : dataset, x-variable, y-variable, dayCluster, hoursInCluster, breakpoint output: m odel to detect anomalies with

x= dataset with x-variable, dayCluster, hoursInCluster; y= dataset with y-variable, dayCluster, hoursInCluster; interval_around_breakpoint= 8;

temperatute_step= 0.1;

rounds= interval_around_breakpoint / temperatute_step; model_for_right_part= robust_model;

model_for_left_part= huber;

breakpoint_temp= breakpoint - interval/2;

for i in rounds do

(xR, yR)= (x,y > breakpoint_temp); (xL, yL)= (x,y ≤ breakpoint_temp);

(model_R, accuracy_R, y_breakpoint)= model_for_right_part(xR, yR, slope = 0); (model_L, accuracy_L)= model_for_left_part(xR, yR, breakpoint_temp, y_breakpoint); total_accuracy= (length(xR) / length(x)) * accuracy_R + (length(xL) / length(x)) *

accuracy_L;

models[i] = total_accuracy, model_R, model_L; breakpoint_temp+= temperatute_step;

end

best_model= max(models[i], total_accuracy);

Algorithm 1: Algorithm to find the best model in the energy consumption data.

3.1.4 Anomaly detection

The anomalies in the data are detected with a rule. The rule is based on the robust outlier rule of equation (5) but then for a robust regression model

zi= ¨ |ri|/MAD if MAD6= 0 |ri|/MeanAD if MAD = 0 (25) where ri= yi− f (xi,Ω) (26)

and MAD is the median of all absolute deviations from the median of the residuals MAD= 1.4826 median

i=1,...,n |ri− medianj=1,...,n (rj)| (27) where the factor 1.4826 makes it unbiased at normal errors, such that a standard threshold value for the rule can be used[Rousseeuw and Croux, 1993, p. 1273]. MeanAD is the mean of all absolute deviations from the mean of the residuals

(20)

where the factor 1.2533 makes it unbiased at normal errors[Rousseeuw and Croux, 1993, p. 1273]. A data point yi is determined as anomalous if the z -score is higher than 3.5 ‘robust’ standard devia-tions (as advised by Iglewicz and Hoaglin[1993])

zi> 3.5. (29)

3.1.5 Experiments

The optimalΩ and t for a particular robust linear regression model is determined with the accu-racy score explained in section 3.1.3. The robust method to detect anomalies is explained in section 3.1.4. These two parts result in one method to detect anomalies in one time cluster of one dataset. The best robust anomaly detection method is being determined with the model that corresponds to an anomaly detection method that has the highest average F1-score over every time cluster and all datasets. The F1-score considers both the precision and the recall of the models, it is calculated as follows F1= 2 · precision· recall precision+ recall (30) where precision= tp tp+ fp (31) and recall= tp tp+ fn (32)

where 0≤ F1 ≤ 1 with 1 being the best score, tp = number of true positives, fp = number of false positives, and fn= number of false negatives. The average F1-score of the anomaly detection method with one particular regression model is calculated as follows

¯ F1= Nd X i=1 Nt X j=1 F1(i , j ) NdNt (33)

where Nd is the number of datasets and Nt is the number of time clusters, which are eight and four for this experiment. The robust linear regression models that are being compared are OLS, three low breakdown models, and three high breakdown models (see section 3.1.1), which corresponds to seven different models total. It sometimes happens that a time cluster of a dataset has no labeled anomalies or has no detected anomalies. For these cases the precision or recall (and therefore also the F1-score) cannot be calculated and are therefore omitted from the evaluation.

The models will also be compared with their average precision and average recall to make the com-parison between different F1-scores more complete. Furthermore, the models are compared on datasets that contain a collective anomaly and datasets that do not contain a collective anomaly.

3.2 Results

The results (table 1) show the performance of the anomaly detection method with various models on eight different datasets with four different time clusters. The table shows the performance of the anomaly detection method with seven different regression models on three subsets of the data: all the datasets, the datasets that contain a collective anomaly, and on the datasets that do not contain a

(21)

collective anomaly. They were compared with their average F1-score, the average precision, and the average recall. The standard deviation of the average values are shown in the table in parentheses. M-estimation with Hampel’s loss function and MM-estimation produced the highest overall F1-score with a score of .55.

Figure 3 and figure 4 show an example of the difference between anomaly detection with M-estimation with Hampel’s loss function and MM-estimation on a dataset that contains a collective anomaly (more specifically the dataset showed with figure 1). The detection of anomalies with M-estimation and Hamel’s loss function has a regression line between the two groups of data points and is there-fore unable to detect most of the anomalies in the upper group. The detection with MM-estimation has a regression line through the lower group of data points and is therefore able to detect most of the anomalies in the upper group.

Table 1: Evaluation between the various models. The models are compared with their average F1 -score, precision, recall on all of the data, the data with a collective anomaly, and on the data without a collective anomaly. The averages are followed with their standard deviation in parentheses. The highest F1-scores for each part of the data are bold. Everything was rounded by two numbers.

OLS Huber Tukey Hampel LTS S MM

All data F1 .53 (.31) .54 (.32) .54 (.33) .55 (.32) .51 (.32) .52 (.33) .55 (.32) Precision .61 (.34) .61 (.34) .61 (.35) .61 (34) .51 (.37) .52 (.37) .58 (.36) Recall .61 (.36) .62 (.35) .61 (.36) .62 (35) .72 (.33) .69 (.37) .66 (.36) Collective F1 .32 (.24) .36 (.32) .33 (.34) .38 (.31) 47 (.34) .48 (.35) .56 (.31) Anomaly Precision .55 (.39) .64 (.38) .64 (.38) .62 (.40) .69 (.34) .74 (.36) .68 (.37) Recall .41 (.46) .43 (.45) .40 (.46) .42 (.44) 57 (.39) .59 (.39) .59 (.40) No Collective F1 .58 (.31) .58 (.32) .58 (.32) .58 (.31) .52 (.32) .54 (.33) .55 (.32) Anomaly Precision .62 (.33) .60 (.34) .60 (.34) .61 (.34) .47 (.36) 47 (.36) .56 (.36) Recall .65 (.32) .67 (.32) .67 (.33) .66 (.31) .76 (.32) .71 (.33) .67 (.32)

Figure 3: Example of anomaly detection on a dataset that contains a collective anomaly with M-estimation with Hampel’s loss function as model. Most of the anomalous data points in the upper group of data points are incorrectly classified as non-anomalous data.

(22)

Figure 4: Example of anomaly detection on a dataset that contains a collective anomaly with MM-estimation as model. Most of the anomalous data points in the upper group of data points are cor-rectly classified as non-anomalous data.

3.3 Discussion

The results in table 1 show that the anomaly detection methods with either Mestimation or M-estimation with Hampel’s loss function have the highest overall F1-score (.55), where MM-estimation performs better on datasets with a collective anomaly (.56 versus .38) and M-estimation performs marginally better on datasets without a collective anomaly (.58 versus .55). This observation also holds true if the three high breakdown models: LTS, S, and MM-estimation are compared with the three low breakdown M-estimation models. So perform all the high breakdown models better than the low breakdown models on the data with a collective anomaly (lowest F1 for high breakdown is .47 versus the highest for low breakdown of .38), whereas the low breakdown models perform slightly better on the data without a collective anomaly (average of .58 versus an average of .53). The percentage of datasets with a collective anomaly is approximately 19%, which explains the higher influence of the datasets without a collective anomaly.

The standard deviation of the performance scores relative to the range of the F 1-scores are high, which makes it hard to generalize the results. A reason for the high variation between datasets is that some datasets only have a couple of labeled anomalies, such that one labeled anomaly influence the performance of a model significantly. The variation of the results of the models on the same datasets is generally really low because of the similarity of the models.

The results also show that the low breakdown models have a higher precision than the high break-down models, while having a lower recall than the high breakbreak-down models. A reason for this could be that the low breakdown models consider more of the data in determining their model, which in turn produces a more accurate fit on data with few anomalies, which is the case for most of the data. Whereas the high breakdown models have the ability to ignore more of the data. This makes them more robust to detect all kinds of anomalies and makes them have a higher recall, at the cost of the accuracy and the precision of the model.

The comparison of the performance between all of the models show a couple of things. The M-estimation models with OLS perform similarly well, with Hampel’s Loss function performing slightly better on datasets with a collective anomaly (.38 versus .36, .33, and .32). LTS and S-estimation

(23)

perform similarly well (.51 versus .52). Lastly, LTS and S-estimation have a higher recall than estimation, while estimation has a higher precision. The reason for this might be that MM-estimation is more efficient at the cost of being less robust compared to LTS and S-MM-estimation.

(24)

4 The detection of time anomalies

This section describes the method to detect the gas and electricity consumption anomalies in the context of the time in hours (the time anomalies). The methodology to detect the time anomalies is described firstly, secondly the results will be showed, and lastly the results will be discussed.

4.1 Methodology

The method to detect anomalies in the representation with the time component is also a residual based approach. A residual based approach is used because the data can be modeled with some nonlinear model. The data is modeled with one of the two popular nonlinear models for energy consumption data: an artificial neural network (ANN) or with a support vector machine based ap-proach (see section 2.2.2). The anomalies in the data are detected with the help of an ANN or with support vector regression (SVR), these will be explained firstly. The representation of the data with the labeling is explained secondly. The choice of the parameters and features of the ANN and the SVR for the detection of the anomalies is explained in the following section. The detection of the anoma-lies with those models is explained in the next section. Lastly, the evaluation of the two models is explained.

4.1.1 Nonlinear models

Two nonlinear regression models are used to model the data with the time component. These mod-els will be used to find different types of anomalies in the data. The modmod-els are an ANN and a SVR. The two of them are compared because they find different models and one of them might be more suitable to detect anomalies with than the other.

Artificial neural network An artificial neural network is a model that can learn nonlinear patterns and can be used for regression. The ANN can model data through interconnected ‘neurons’ and backpropagation. The ANN with one hidden layer for point k is of the form

f(x,w) = M X j=0 w(2)_j h _D X i=0 w_{j i}(1)xi (34)

where x are the input parameters, w the weights of the parameters,PD_i₌₀w(1)_{j i}xia linear combination of activations of the neurons in the first layer,PM

j=0wk j(2)h(...) a linear combination of activations of neurons in the second (hidden) layer, and h(·) a nonlinear activation function [Bishop, 2006]. The model for the detection of anomalies has one hidden layer which has two-thirds of the number of input neurons in the hidden layer.

The weights of the ANN are updated with error backpropagation. The error backpropagation con-sists of two stages. In the first stage the derivatives of the error function are evaluated and propagated backwards. The derivatives are then used in the second stage to update the weights. The exact im-plementation of this varies. Resillient backpropagation was used for the detection of anomalies in this paper[Riedmiller and Rprop, 1994].

The error function of ANNs is non-convex which makes it hard to find a good solution. Repeated runs were made to make sure that a good solution was found.

(25)

Support vector regression Support vector regression is a regression method that is able to find nonlinear models. It is an extension of support vector machines for regression problems. It uses the ‘kernel trick’ such that it can perform linear regression in a constrained high dimensional feature space. SVR has the property that the objective function is convex, such that the soluction can be found without having to worry about local optima. The model in the high dimensional feature space is of the form

f(x,w) = wTφ(x) + b (35)

where x are the input parameters,φ(x) denotes the feature-space transformation, b the bias param-eter, and w the weights of the features. The regression problem is solved by minimizing the following regularized error function

C N X n=1 (ξn+ bξn) + 1 2||w|| 2 ₍₃₆₎

where C is a regularization parameter andξn ≥ 0 and bξn ≥ 0 are the slack variables. Any point

yn to be predicted is allowed to be withinε distance of the model. A point outside this ‘tube’ is punished with the slack variables in the error function, whereξn ≥ 0 corresponds to a point for which yn > f (xn, w) + ε and ξn ≥ 0 corresponds to a point for which tn < f (xn, w) − ε. The error function is solved with the help of Lagrange multipliers and optimizing the Lagrangian. The solution is expressed in terms of some kernel function[Bishop, 2006].

The kernel function that is used for the detection of anomalies is the commonly used radial basis kernel

k(x,z) = exp −γ||x − z||2 (37)

whereγ is one divided by the number of different features.

Theε and the C are two fixed parameters and need to be found with a hyperparameter optimaliza-tion. The need to do a hyperparameter optimalization is inherent to SVR. The exact implementation of the model is explained in section 4.1.3.

4.1.2 Data representation

The models are trained on and compared with hourly gas and electricity consumption values over one year. The consumption values show a certain pattern over a week, with higher values during the day and whether the day is a workday and lower values during the night and weekends (see figure 5). Furthermore, the gas consumption has higher values when the outside temperature is lower and no gas consumption when the outside temperature is high. While the electricity consumption is gen-erally more steady having high values when the outside temperature is really high, and sometimes having high values when the outside temperature is low, depending on the building. This is because gas is only used for heating, while electricity is used for cooling and at times for heating too. Figure 2 shows an example of electricity consumption over a year.

Data labeling The data is labeled by an expert of energy consumption data and the expert knows how energy consumption patterns should look like. The labeling is done with the help of 3D-plots of the data, such as the one of figure 2. Datapoints are labeled as either (contextual) point anomalies or some form of a (contextual) collective anomaly. A point anomaly is a single outlying data point, whereas the collective anomalies have a start date and an end date in which they occur. There were different types of collective anomalies identified that were categorized from the vantage point of the

(26)

Figure 5: Example in the data of the gas consumption over one week. The example shows high values during workdays, and low values at night and during the weekend.

3D-plot. The anomaly types are: point anomalies, horizontal anomalies, vertical anomalies, and shifted anomalies. The anomaly types are described generally as follows

1. Point anomalies: an anomalous value that is not directly surrounded by other anomalous val-ues.

2. Vertical anomalies: consecutive anomalous values between some period of time. For instance, when there is an unexpected vacation day during a workday.

3. Horizontal anomalies: consecutive anomalous values for a specific hour between some period of time. For instance, when the consumption values for some nights are consecutively not as low as normal nights.

4. Shifted anomalies: values that are anomalous because they are all are moved up a certain amount of hours or down a certain amount of hours for some period of time. This happens in the data when there is a wrong correction made during the summertime/wintertime such that all the consumption values in that time period are all moved one hour up or down. The anomalies are described formally in section 4.1.4.

4.1.3 Model of the data

Two different models were used for the representation with the time component. These two models need some parameters and predictive features. These were found by splitting the datasets in a 70% training set and a 30% test set. The models were optimized with the root-mean-square error (RMSE)

RMSE= v tPn

i=1(yi− f (xi, w)2

n (38)

where xi is the set of input parameters for the predicted gas or electricity consumption value of yi for hour i .

The SVR needs a value for the parameters C andε. These were found with a grid search over the error of the model with different values. The parameters were fixed across the different datasets because

(27)

the optimal parameters were similar across the optimized datasets, the same for the ANN.

Both the ANN and SVR need predictive feateres as inputs for their models. The features for both models were scaled. For SVR the features are scaled to have a zero mean and unit variance, for the ANN the features are scaled to a value between 0 and 1, where the maximum value has value 1 and the minimum value has value 0.

Several features were tested. The minimal set of predictive features were sought after to make the model as simple as possible. The predictive features for the SVM were (the non-scaled values the features can take are showed in parentheses):

• Outside temperature (-13.9, . . . , 34.4) • Hour (1,2, . . . ,24)

• Hour2 (13,14, . . . ,24,1, . . . ,12) • Day of the week (1,2, . . . ,7) • Weekend (0, 1)

The ANN used the same features as the SVM together with: • Morning (0, 1)

• Afternoon (0, 1) • Evening (0, 1) • Night (0, 1)

The ‘hour2’ feature makes the distance between hour 24:00 and 01:00 one instead of 23 in the ‘hour’ feature. This should make modeling similar values between these hours easier and was found to be predictive enough to be included in the final set of predictive features.

4.1.4 Anomaly detection

There are different anomalies that need to be detected for the data with the time component. These anomalies are detected with different rules. The rules are based on modified z -scores, similar to equation (25)

zi= |ri|/MAD (39)

where

ri= yi− f (xi, w) (40)

MAD is calculated according to equation (27), yi corresponds to the value of the i -th hour of the dataset, and xiis the set of input parameters for that value.

A value yi is determined to be anomalous if the corresponding zi passes the following rule

zi> threshold (41) where the threshold is a value that depends on the type of data (gas or electricity) and the model (ANN or SVR).

(28)

2. Vertical anomaly: if yiis anomalous and if either yi−1or yi+1is also anomalous. 3. Horizontal anomaly: if yi is anomalous and if either yi−24or yi+24is also anomalous.

The shifted anomalies are not detected because the model of the data is not robust enough for this type. If the shifted anomalies occur in the data then 50% is shifted, which result in the model being influenced too much by the anomalous values to be able to predict them. One could probably make a specific rule to detect this specific type. However, the rule would then not rely on the current model of the data and would therefore not follow the main approach in this paper just for the detection of that type.

4.1.5 Experiments

Experiments were ran on gas consumption and electricity consumption data with the anomaly de-tection method that used either the ANN model or the SVR model, resulting in four different results. The optimal model of the data f(x,w) is determined with the RMSE described in section 4.1.3. The method to detect the anomaly types is described in section 4.1.4. This once again results in one method that can detect different types of anomalies. The optimal method is determined with the

F1-score. The method for detecting anomalies with the time component uses a non-fixed thresh-old. The optimal threshold is the threshold that has the highest average F1-score on the data. The procedure for finding the optimal threshold is independent of the procedure for finding the optimal model of the data, therefore no training set is used for the optimal threshold.

The F1-score for one dataset is calculated according to equation (30). The average F1of one model over one consumption parameter is calculated as follows

¯ F1= Nd X i=1 F1(i ) Nd (42)

The F1-score for one dataset is omitted when there are no labeled anomalies or no detected anoma-lies. The recall of the models for the anomaly types is also calculated to know how well the method can recognize certain types. The precision of the models is calculated to show the trade-off between the precision and recall.

4.2 Results

The results in table 2 show the average F 1-score of the anomaly detection method with SVR or ANN as a model of the data on gas consumption data and electricity consumption data. Table 3 shows the average recall over the various types of anomalies. Table 4 shows the average precision over all the types of anomalies to show the trade-off of the F 1-score between the recall and precision. The SVR and ANN based methods performed equally well on gas consumption data with a F 1-score of .22 and .23 on the gas consumption data and a score of .18 and .18 on the electricity consumption data.

Figure 6 shows as an example the detected anomalies with SVR of the dataset shown in figure 2. The figure shows the true positives, false positives, true negatives, and false negatives with different colors.

Robust Detection of Anomaly Types in Energy Consumption Data

MSC

ARTIFICIAL

INTELLIGENCE

MASTER

THESIS

Robust Detection of Anomaly Types

in Energy Consumption Data

KOEN

KEUNE

December 13, 2017

Supervisors:

dr. M.W. van Someren

ir. E. de Jong

Assessor:

dr. E. Kanoulas

Robust Detection of Anomaly Types in Energy Consumption Data

Acknowledgements

Contents

1

Introduction

1.1

Anomaly detection for E-nolis

1.2

Anomalies

1.3

Temperature anomalies

1.4

Time anomalies

1.5

Organization

2

Related work

2.1

Anomaly detection

2.2

Energy consumption models

2.3

Robustness

2.4

Approach

3

The detection of temperature anomalies

3.1

Methodology

3.2

Results

3.3

Discussion

4

The detection of time anomalies

4.1

Methodology

4.2

Results