Structured Kernel Based Modeling: An Exploration in Short-Term Load Forecasting

(1)

Structured Kernel Based Modeling: An

Exploration in Short-Term Load Forecasting

Marcelo Espinoza ∗, Johan A.K. Suykens, Bart De Moor

Katholieke Universiteit Leuven, ESAT/SCD-SISTA Kasteelpark Arenberg 10, B-3001 Leuven (Heverlee), Belgium

Abstract

This paper considers an exploratory modeling strategy applied to a large scale real-life problem of power load forecasting. Different model structures are considered, including Autoregressive models with eXogenous inputs (ARX), Nonlinear Autore-gressive models with eXogenous inputs (NARX), both of which are also extended to incorporate residuals that follow an Autoregressive (AR) process (AR-(N)ARX). These models are parameterized either as a full black-box model or as partially linear structures. Starting from the Least Squares Support Vector Machines (LS-SVMs) formulation for regression, relevant properties from the problem at hand can be imposed as additional constraints which can be incorporated in a straightforward way within the primal-dual optimization formulation and be solved through convex optimization. For the case study of short-term load forecasting, it is shown that a structured model can reach results comparable to the full black-box model, with the additional benefit of having interpretable coefficients. It obtains satisfactory results when the model structure and the hyperparameter selection procedure are tailored towards the application, taking into account prior knowledge or relevant properties of the problem.

Key words: Least-Squares Support Vector Machines, Fixed-Size LS-SVM, Primal Space Regression, Load Forecasting, Time Series, Model Structures.

1 Introduction

This paper addresses the issue of selecting a suitable model structure by using prior knowledge and physical insight taken from the problem under analysis. ∗ Corresponding Author.

Email addresses: marcelo.espinoza@esat.kuleuven.be (Marcelo Espinoza), johan.suykens@esat.kuleuven.be(Johan A.K. Suykens).

(2)

In real-life applications, the designer can make use of different model spec-ifications, from nonlinear black-box models to linear parametric models [1]. Following the well known golden rule of “do not estimate what you already know”, it is important to include as much information as possible about the problem into the modeling stage. This prior information may relate to several levels: the available data, the model structure, model selection criteria, in such a way that the final model obtains a good performance in terms of predictive accuracy because it has been“tailored” to the application at hand.

The work in this paper has been developed using the Least Squares Support Vector Machines (LS-SVMs [2]) nonlinear regression formulation as a basis technique for nonlinear system identification. LS-SVMs belong to the class of kernel methods, which is a subject with several directions studied in differ-ent fields. LS-SVMs and Support Vector Machines (SVMs) [3–5] follow the approach of a primal-dual optimization formulation, where both techniques make use of a so-called feature space where the inputs have been transformed by means of a (possibly infinite dimensional) nonlinear mapping ϕ. This is converted to the dual space by means of Mercer’s theorem and the use of a positive definite kernel, without computing explicitly the mapping ϕ. The SVM model solves a quadratic programming problem in dual space, obtaining a sparse solution [6]. The LS-SVM formulation, on the other hand, solves a linear system under a least squares cost function with equality constraints, where the sparseness property can be obtained e.g. by sequentially pruning the support value spectrum. The LS-SVM training procedure involves a se-lection of the kernel parameter and the regularization parameter of the cost function, that usually can be done by cross-validation or by using Bayesian techniques [7]. Other directions in kernel methods follow different approaches. In Reproducing Kernel Hilbert Spaces (RKHS) [8] the problem of function es-timation is treated as a variational problem ; Gaussian Processes (GP) [9,10] follow a probabilistic-Bayesian setting. Although these different approaches have links with each other, e.g. for the simple case of static regression without a bias term it is well-known that GP, regularization networks and LS-SVM lead to the same set of linear equations to be solved at the dual level, in general the methodologies are different. Particularly, the primal-dual formulation of LS-SVM makes it easy to add additional constraints, which in the context of the present work makes it straightforward to incorporate more structure into the models, as it will be illustrated throughout the paper. In addition, the LS-SVM models can be estimated in primal space directly with a sparse rep-resentation by using Nystr¨om methods (which originated in the GP literature) for the case of large samples, which is also exploited in this paper.

In this work we consider the LS-SVM formulation for regression. However, the LS-SVM framework is more general [2]. Given that the LS-SVM formulation works with equality constraints and a L2loss function, this optimization-based

(3)

kernel versions of PCA (Principal Components Analysis) [11], FDA (Fisher’s Discriminant Analysis) [12,13], CCA (Canonical Correlation Analysis) [14], PLS (Partial Least Squares), spectral clustering, subspace methods [15], re-current networks, control, and others. In general, LS-SVMs can be seen as core formulations from which more sophisticated kernel machines can be built by using additional constraints or using different loss functions for robust-ness or sparsity. Within the LS-SVM framework, different model structures and parameterizations can often be estimated as a solution of a set of linear equations obtained from a convex problem.

As a real-life application, we consider the problem of Short-Term Load Fore-casting (STLF), which is an important area of applied quantitative research. The goal of STLF is to build a model that can be used for the prediction of the power load over the next few hours. Usually a load series shows im-portant seasonal patterns (yearly, weekly, intra-daily patterns) that need to be taken into account in the modeling strategy [16]. This situation translates into autocorrelation patterns across the corresponding hour on previous days [17]. In addition, unlike other time series problems, in STLF accurate forecasts are required particularly for the hour of the daily peak. The accuracy of the predicted daily peak load has to be as good as possible, in order to minimize the risk of a grid congestion. Typically, a linear parametric model may be pre-ferred over a full nonlinear black-box, due to the insights that can be derived from the identified coefficients [17] in terms of temperature sensitivity, growth trends, useful for management and planning purposes. By considering all these aspects as prior knowledge, in this paper we illustrate an exploratory model-ing methodology for model structure selection within the STLF framework, using estimation techniques based on Least Squares Support Vector Machines (LS-SVM [2]). Previous work on short-term load forecasting using black-box models estimated with LS-SVMs only considers the case of a general NARX black-box formulation [18,19]. The work presented in this paper extends the framework towards the use of structured models which are tailored to the application under study.

This paper is organized as follows. The description of the different model struc-tures used in this paper is presented in Section 2. The estimation techniques based on LS-SVM are described in Section 3, where the model structures are parameterized in several ways. In Section 4, the implementation using Fixed-Size LS-SVMs [2] is described, with a description of the estimation methodology for each one of the model structures and parameterizations. The application to the STLF case study and the estimation results are presented in Section 5.

(4)

2 Model Formulation

In this paper we consider two model structures: a (N)ARX model, which is a (Nonlinear) AutoRegression with eXogenous inputs, and a AR-(N)ARX model, which is a (Nonlinear) AutoRegression with eXogenous inputs and Au-toRegressive residuals. Each one of these model structures can be formulated in terms of a linear parameterization, a full black-box, or a partially linear parameterization.

2.1 Model Structures

Consider the regression vector zt ∈ Rn of a (N)ARX model, zt = [yt−1;

· · · ; yt−p; ut; ut−1; · · · ; ut−q] containing p past values of the output yt ∈ R

and q past values of input vectors ut ∈ RNu, with n = p + Nuq. A (N)ARX

model structure [1,20,21] can be formulated as

yt= g(zt) + et (1)

where the error term etis assumed to be i.i.d with zero mean and constant

vari-ance σ2

e. The AR-(N)ARX [1,22,23] model structure incorporates a

covariance-stationary process on the error terms et,

     yt = g(zt) + et et= ρet−τ + rt (2)

where τ _{≥ 1, r}tis a white noise sequence with zero mean and constant variance

σ2

r and |ρ| < 1. When τ = 1, the noise process is AR(1) et= ρet−1+ rt, which

is often used in applied modeling work [24], as in the seminal work of Engle et al. on electricity prices [25].

2.2 Model Parameterizations

For the parameterization of the function g(_{·) from (1) or (2) the following}

alternatives are considered.

2.2.1 Linear Parameterization

The function g(_{·) can be parameterized as}

(5)

with θ ∈ RNd_{. In this case the function g is linear, and the corresponding} models structures from (1) and (2) are an ARX and AR-ARX, respectively. This model can be estimated as a linear regression, using e.g. Ordinary Least Squares (OLS).

2.2.2 Black-Box Parameterization

The nonlinear function g(·) for a NARX (1) or AR-NARX (2) structure is

parameterized under a black-box formulation in primal space using LS-SVMs [2]:

g(zt) = wTϕ(zt) + b (4)

where b is a constant (bias) term, and ϕ(_{·) : R}n _{→ R}Nh is the feature map

from the input space to the so-called feature space (of dimension Nhwhich can

be possibly infinite). In the area of support vector machines and kernel-based learning this feature map is used in relation to a Mercer kernel [2,4], in such a way that the feature map is not computed explicitly.

2.2.3 Partially Linear Parameterization

In this case, some of the regressors are included as linear terms, and others are included under a nonlinear black-box term. Consider a partition of the re-gression vector ztas follows. LetZ = {x : x is a component of the vector zt}.

Define an arbitrary partitionZ = ZA∪ ZB withZA∩ ZB =∅. Define a vector

zA,t ∈ RNa with regressors x ∈ ZA, and a vector zB,t ∈ RNb with regressors

x_{∈ Z}B. The original regression vector is thus partitioned as zt = [zA,t; zB,t]

The subscript A (resp. B) represents the subset of regressors that enters lin-early (resp. nonlinlin-early) into the model. The nonlinear component of this Par-tially Linear (PL) [26] parameterization is, in turn, expressed as a black-box

formulation using LS-SVMs. Thus, the nonlinear function g(·) for a PL-NARX

(1) or a PL-AR-NARX (2) is parameterized as

g(zt) = βTzA,t+ wTϕ(zB,t) + b (5)

for a given partition zt = [zA,t; zB,t], where ϕ(·) : Rn → RNh is the feature

map from input space to feature space.

3 Model Estimation

The different nonlinear model structures can be estimated using the LS-SVM regression framework. Starting from a given dataset {zi, yi}N_i=1, the different

estimation problems are presented for each of the model structures outlined in the previous section.

(6)

3.1 NARX Model: Black-box approach

For the NARX model (1), with g(_{·) parameterized as in (4), the following}

optimization problem with a regularized cost function is formulated (primal problem): min w,b,ei 1 2w T_w + γ1 2 N X i=1 e2i (6) s.t. yi = wTϕ(zi) + b + ei, i = 1, . . . , N.

where γ is a regularization constant. The solution is formalized in the following lemma.

Lemma 1 [2,27] Given a positive definite kernel function K : Rn

× Rn

→ R, with K(zi, zj) = ϕ(zi)Tϕ(zj), the solution to (6) is given by the dual problem

   0 1T 1 Ω + γ−1_I       b α   =    0 y   , (7)

with y = [y1, . . . , yN]T, α = [α1, . . . , αN]T, and Ω is the kernel matrix with

Ωij = K(zi, zj), ∀i, j = 1 . . . , N.

The model representation in dual form is given by ˆ yt = N X i=1 αiK(zi, zt) + b. (8)

Remark 1 [Kernel functions.] For a positive definite kernel function K some common choices are: K(xi, xj) = xTjxj (linear kernel); K(xi, xj) = (xTi xj+

c)d _{(polynomial of degree d, with c > 0 a tuning parameter); K(x}

i, xj) =

exp(_−||xj − xj||22/σ2) (RBF kernel), where σ is a tuning parameter.

3.2 AR-NARX Model: Incorporating a noise model

Consider the AR-NARX model (2), with g(_{·) parameterized as in (4). With}

the inclusion of a noise model, the following regularized optimization problem is formulated: min w,b,ri 1 2w T_w_{+ γ}1 2 N X i=τ +1 r_i2 (9) s.t.    yi = wTϕ(zi) + b + ei, i = τ + 1, . . . , N. ei = ρei−τ + ri, i = τ + 1, . . . , N.

(7)

where γ is a regularization constant and the noise model coefficient ρ is a tuning parameter satisfying_{|ρ| < 1 (invertibility condition of the process). At}

this stage, the parameter ρ is considered to be given. By eliminating ei, the

following problem is obtained:

min w,b,ri 1 2w T_w_{+ γ}1 2 N X i=τ +1 ri2 (10)

s.t. yi = ρyi−τ + wTϕ(zi)− ρwTϕ(zi−τ)

+b(1_{− ρ) + r}i, i = τ + 1, . . . , N.

The solution is formalized in the following lemma.

Lemma 2 [23] Given a positive definite kernel function K : Rn_{× R}n _{→ R,}

with K(zi, zj) = ϕ(zi)Tϕ(zj), the solution to (10) is given by the dual problem

   0 1T 1 Ω(ρ)+ γ−₁ I       b α   =    0 ˜ y   , (11)

with ˜y= [y1+τ−ρy1, . . . , yN+τ−ρyN −1]T, α = [α1, . . . , αN −τ]T, and Ω(ρ) is the

kernel matrix with entries Ω(ρ)ij = K(zi+τ, zj+τ)−ρK(zi, zj+τ)−ρK(zi+τ, zj)+

ρ2_K(z

i, zj), ∀i, j = 1 . . . , N − τ.

The model can be represented in dual form as ˆ yt= ρyt−τ + h(zt)− ρh(zt−τ), (12) where h(zt) is given by h(zt) = N X i=τ +1 αi−τ[K(zi, zt)− ρK(zi−τ, zt)] + b. (13)

Remark 2 [Considering ρ as a tuning parameter.] For the estimation of the AR-NARX and PL-AR-NARX models, we have considered the parameter ρ as a tuning parameter in order to work with a feasible convex optimization prob-lem in which the Mercer’s Theorem can be applied and a unique solution can be obtained. The parameter ρ, therefore, is determined on another level (e.g. via cross-validation, together with the kernel parameters and the regularization term) to yield a good generalization performance of the model. In this way, the selected ρ will be the value that gives the best cross-validation performance.

(8)

3.3 PL-NARX Model: Considering a Partially Linear Structure

For the PL-NARX model (2), with g(_{·) parameterized as in (5), the following}

optimization problem with a regularized cost function is formulated: min w,b,ei,β 1 2w T w+ γ1 2 N X i=1 e2i (14) s.t. yi = βTzA,i+ wTϕ(zB,i) + b + ei, i = 1, . . . , N,

where γ is a regularization constant. The solution is given by the following lemma.

Lemma 3 [26] Given the problem (14) and a positive definite kernel function

K : RNb× RNb → R, the solution to (14) is given by the dual problem

       0Na×Na 0Na×1 Z T 01×Na 0 1 T Z 1 Ω + γ−1_I               β b α        =        0Na×1 0 y        , (15) where Z = [zT

A,1; zTA,2;· · · ; zTA,N] ∈ RN ×Na is the matrix of linear regressors;

y = [y1; y2; . . . ; yN], and Ω is the kernel matrix with Ωij = K(zB,i, zB,j) =

ϕ(zB,i)Tϕ(zB,j), ∀i, j = 1, . . . , N.

The estimated model in dual space becomes ˆ yt= βTzA,t+ N X i=1 αiK(zB,i, zB,t). (16)

3.4 PL-AR-NARX Model: Combining it all

Consider now the PL-AR-NARX model, with g(_{·) parameterized as in (5).}

With the inclusion of a noise model, the following regularized optimization problem is formulated: min w,b,ri 1 2w T_w + γ1 2 N X i=τ +1 ri2 (17) s.t.    yi = βTzA,t+ wTϕ(zi) + b + ei, i = τ + 1, . . . , N. ei = ρei−τ + ri, i = τ + 1, . . . , N.

where γ is a regularization constant and the noise model coefficient ρ is a given tuning parameter satisfying _{|ρ| < 1. By eliminating e}i, the following problem

(9)

is formulated: min w,b,ri 1 2w T w+ γ1 2 N X i=τ +1 ri2 (18)

s.t. yi = ρyi−τ + wTϕ(zB,i)− ρwTϕ(zB,i−τ)

+b(1_{− ρ) + β}T(zA,i− ρzA,i−τ) + ri, i = τ + 1, . . . , N.

The solution is formalized in the following lemma.

Lemma 4 Given the problem (18) and a positive definite kernel function K :

RNb× RNb → R, the solution to (18) is given by the dual problem

       0Na×Na 0Na×1 Z˜ T 01×Na 0 1 T ˜ Z 1 Ω(ρ)+ γ−1_I               β b α        =        0Na×1 0 ˜ y        , (19) where ˜Z = [zT

A,1+τ − ρzTA,1;· · · ; zTA,N − ρzA,N −τ] ∈ R(N −τ )×Na is the matrix

of linear regressors; ˜y = [y1+τ − ρy1, . . . , yN − ρyN −τ]T, α = [α1, . . . , αN −τ]T,

and Ω(ρ) is the kernel matrix with entries Ω(ρ)ij = K(zB,i+τ, zB,j+τ)−ρK(zB,i,

zB,j+τ)− ρK(zB,i+τ, zB,j) + ρ2K(zB,i, zB,j), ∀i, j = 1 . . . , N − τ.

Proof: Consider the Lagrangian of problem (18) L(w, b, e, β; α) = 1 2w T_w_{+ γ}1 2 N X i=τ +1 r_i2₋ N X k=τ +1 αk−τ[βT(zA,i− ρzA,i−τ) + wT_ϕ(z

B,i)− ρwTϕ(zB,i−τ) + ρyi−τ− yi− ri],

where αj ∈ R, j = 1, . . . , N − τ are the Lagrange multipliers. Taking the

optimality conditions ∂L ∂w = 0, ∂L ∂b = 0, ∂L ∂ri = 0, ∂L ∂αj = 0, ∂L ∂β = 0 yields w= N X k=τ +1 α(k−τ )[ϕ(zB,i)− ρϕ(zB,i−τ)], ri= αi−τ/γ, i = τ + 1, . . . , N, 0 = N −τ X k=1 αk, 0 = N X i=τ +1

αi−τ(zA,i− ρzA,i−τ),

yi= ρyi−τ + wTϕ(zB,i)− ρwTϕ(zB,i−τ)

(10)

With application of Mercer’s theorem [4] ϕ(zB,i)Tϕ(zB,j) = K(zB,i, zB,j)

with a positive definite kernel K, we can eliminate w and ri, obtaining

yi− ρyi−τ = N

X

k=τ +1

αk−τ[K(zB,i, zB,k)− ρK(xB,i−τ, xB,k)− ρK(zB,i, zB,k−τ)

+ ρ2K(zB,i−τ, zB,k−τ)] + b + βT(zA,i− ρzA,i−τ) +

αk−τ

γ . (21)

Building the kernel matrix Ω(ρ)ij and writing the equations in matrix notation

gives the final system (19).

The estimated model in dual space becomes ˆ

yt = ρyt−τ + h(zB,t)− ρh(zB,t−τ) + βT(zA,i− ρzA,i−τ), (22)

where h(zB,t) is given by

h(zt) = N

X

i=τ +1

αi−τ[K(zB,i, zB,t)− ρK(zB,i−τ, zB,t)] + b. (23)

A summary of the different nonlinear model structures and representations is given in Table 1.

4 Implementation using Fixed-Size LS-SVM

All the models of the previous section are given in terms of the dual solution, which requires solving a linear system of size (N_{− τ + N}a)× (N − τ + Na) (for

a NARX model, τ = 0, Na= 0). This system is obtained with the application

of Mercer’s theorem, without having to compute the nonlinear mapping ϕ explicitly. However, for large sample sizes this may become unpractical. For such a case, it is possible to obtain a finite dimensional approximation ˆϕ(·) of the nonlinear mapping from any kernel matrix Ω using the Nystr¨om technique as studied in the context of Gaussian Processes in [28].

4.1 Nonlinear Approximation in Primal Space

For clarity of presentation, consider the generic kernel matrix Ωeq with entries Ωeqij = K

eq

model(xi, xj), where the kernel function K_modeleq can be defined for each

of the model structures, as listed on Table 2, and where the vector x represents

(11)

Table 1

Summary of Nonlinear Model Structures and Representations by LS-SVMs. NARX Model Primal yˆt= wTϕ(zt) + b Dual yˆt=PNi=1αiK(zi, zt) + b AR-NARX Model Primal yˆt= ρyt−τ + wTϕ(zt)− ρwTϕ(zt−τ) + (1− ρ)b Dual yˆt= ρyt−τ + h(zt)− ρh(zt−τ) with

h(zt) =PN_{i=τ +1}αi−τ[K(zi, zt)− ρK(zi−τ, zt)] + b

PL-NARX Model

Primal yˆt= βTzA,t+ wTϕ(zB,t) + b

Dual yˆt= βTzA,t+PN_i=1αiK(zB,i, zB,t)

PL-AR-NARX Model

Primal yˆt = ρyt−τ + wTϕ(zB,t)− ρwTϕ(zB,t−τ) + b(1− ρ) + βT(zA,t−

ρzA,t−τ)

Dual yˆt= ρyt−τ + h(zB,t)− ρh(zB,t−τ) + βT(zA,i− ρzA,i−τ)

with

h(zB,t) =PNi=τ +1αi−τ[K(zB,i, zB,t)− ρK(zB,i−τ, zB,t)] + b

Table 2

Kernel function K_modeleq for the different model structures. Knarxeq (zi, zj) = K(zi, zj)

K_ar−narxeq (zi, zj) = K(zi+τ, zj+τ)− ρK(zi, zj+τ)− ρK(zi+τ, zj) +

ρ2K(zi, zj)

K_pl−narxeq (zi, zj) = K(zB,i, zB,j)

K_{pl−ar−narx}eq (zi, zj) = K(zB,i+τ, zB,j+τ) − ρK(B, zB,i, zB,j+τ) −

ρK(zB,i+τ, zB,j) +ρ2K(zB,i, zB,j)

by means of an eigenvalue decomposition of the kernel matrix Ωeq. Given the

integral equation

Z

(12)

with solutions λi and φi for a variable x with probability density p(x), we can write ϕ= [qλ1φ1, q λ2φ2, . . . , √ λNhφNh] T_. ₍₂₅₎

Given the dataset_{xi, yi}Ni=1, where the vector xi represents any of the

regres-sion vectors zior zB,idepending on the model being estimated, it is possible to

approximate the integral by a sample average. This will lead to the eigenvalue problem (Nystr¨om approximation [28])

1 N N X k=1 K_modeleq (xk, xj)ui(xk) = λ(s)i ui(xj), (26)

where the he sample eigenvalues λ(s)i and eigenvectors ui of the kernel matrix

Ωeq can be used for an approximation of the eigenvalues λi and eigenfunctions

φi from the continuous problem (24), as

ˆ λi = 1 Nλ (s) i , ˆφi = √ N ui. (27)

Based in this approximation, it is possible to compute the eigendecomposition

of the kernel matrix Ωeq and use its eigenvalues and eigenvectors to compute

the ith _{required component of ˆ}_ϕ(x(v)_{) for any point x}(v) _{by means of}

ˆ ϕ_i(x(v)) = q1 λ(s)i N X k=1 ukiKmodeleq (xk, x(v)). (28)

leading to the M_{−dimensional approximation}

ˆ

ϕ(x) = [ ˆϕ₁(x), ˆϕ₂(x), . . . , ˆϕ_M(x)]T. (29)

This finite dimensional approximation ˆϕ(x) can be used in the primal problem

(6) to estimate w and b.

4.2 Sparse Representation and Large Scale Problems

It is important to emphasize that the use of the entire training sample of size N to compute the approximation of ϕ will yield at most N components, each

one of which can be computed by (27) for all x _{∈ {x}i}Ni=1. However, if we

have a large scale problem, it has been motivated [28] to use of a subsample

of M ≪ N datapoints to compute the ˆϕ. In this case, up to M components

will be computed. It has been further proposed [2] to use an external criteria such as entropy maximization for an optimal selection of the subsample: given

(13)

a fixed-size M , the aim is to select the support vectors that maximize the quadratic Renyi entropy [29]

HR =− log

Z

p(x)2dx, (30)

which can be approximated by usingR

ˆ

p(x)2_{dx =} 1

M21

T_Ωeq_{1. The use of this}

active selection procedure can be important for large scale problems, as it is related to the underlying density distribution of the sample. In this sense, the optimality of this selection is related to the final accuracy that can be obtained in the modeling exercise.

4.3 Selection of Support Vectors in the presence of correlated noise

For the uncorrelated case, building the small kernel matrix only requires eval-uation of the kernel function at the selected points. However, the kernel matri-ces for the AR-NARX and PL-AR-NARX structures require the evaluations of Kar−narxeq and Kpl−ar−narxeq on past values of the selected points. This means

that the selection of support vectors for this case also requires the knowledge of the points that are located τ timesteps in the past.

4.4 Final Estimation in Primal Space

From a kernel matrix Ωeq evaluated over a data subsample of fixed-size M ,

the approximation ˆϕis obtained [2] using (28) for each of the components. Let

us denote by ˆϕmodel(zt) the finite dimensional approximation of the feature

map for a datapoint zt for a given model structure. The model can now be

estimated in primal space directly by using ridge regression techniques with regularization parameter γ. This is, the problem can be solved by minimizing a regularized least-squared cost function as follows.

• For the NARX model in Primal Space, the solutions w, b are obtained from min w,b,ei 1 2w T w+ γ1 2 N X i=1 e2i (31) s.t. yi = wTϕˆnarx(zi) + b + ei, i = 1, . . . , N.

With the explicit expression for ˆϕnarx(zi), the model is solved in primal

space by eliminating ei from (31),

min w,b 1 2w T w+ γ1 2 N X i=1 (yi − wTϕˆnarx(zi)− b)2. (32)

(14)

• For the AR-NARX model, the solutions w, b are obtained from min w,b,ri 1 2w T_w_{+ γ}1 2 N X i=τ +1 r_i2 (33)

s.t. yi = ρyi−τ − wTϕˆar−narx(zi) + b + ri, i = τ + 1, . . . , N.

where the autocorrelation structure for the nonlinear function is embedded into the evaluation of ˆϕar−narx(zi) computed from the kernel matrix Ωeq.

With the explicit expression for ˆϕar−narx(zi), the model is solved in primal

space by eliminating ri from (33),

min w,b 1 2w T_w_{+ γ}1 2 N X i=1

(yi− ρyi−τ − wTϕˆar−narx(zi)− b)2. (34)

• For the PL-NARX model, the solutions β, w, b are obtained from min w,b,ei,β 1 2w T w+ γ1 2 N X i=1 e2i (35)

s.t. yi = βTzA,i+ wTϕˆpl−narx(zB,i) + b + ei. i = 1, . . . , N,

With the explicit expression for ˆϕpl−narx(zB,i), the model is solved in primal

space by eliminating ei from (35),

min w,b,β 1 2w T w+ γ1 2 N X i=1

(yi− βTzA,i− wTϕˆpl−narx(zB,i)− b)2. (36)

• Finally, for the PL-AR-NARX model structure, the solutions β, w, b are obtained from min w,b,ri,β 1 2w T_w_{+ γ}1 2 N X i=τ +1 ri2 (37)

s.t. yi = ρyi−τ + βT(zA,i− ρzA,i−τ) + wTϕˆpl−ar−narx(zB,i) + b + ri,

for i = τ + 1, . . . , N . With the explicit expression for ˆϕpl−ar−narx(zB,i), the

model is solved in primal space by eliminating ri from (37),

min w,b,β 1 2w T w+ γ1 2 N X i=τ +1

(yi− ρyi−τ − βT(zA,i − ρzA,i−τ)−

wTϕˆpl−ar−narx(zB,i)− b)2. (38)

5 Case Study: Short-Term Load Forecasting

In this section the practical application is described, in terms of the problem context, methodological issues and results.

(15)

5.1 Description and Objective

Short-Term Load Forecasting (STLF) is an important area of quantitative re-search. In order to deal with the everyday process of planning, scheduling and unit-commitment [30], the need for accurate short-term forecasts has led to the development of a wide range of models based on different techniques. In recent literature, some interesting examples are related to traditional time se-ries analysis [31–33], and neural networks applications [34–38]. The main goal in load forecasting to estimate a model that can capture all the dynamics be-tween possible explanatory variables for the load, in order to obtain forecasts from one hour ahead, up to several days ahead [30]. In the literature there is a broad consensus concerning important explanatory variables: past values of the load, weather information, calendar information, etc. It has also been shown in the literature that it is possible to incorporate a priori information concerning the seasonalities at several levels (daily, weekly, yearly, etc.) of lin-ear models by appropriately choosing the model structure and the estimation method. Examples are stochastic seasonality (Box-Jenkins seasonal ARIMA models [39]), seasonal unit roots [16], nonparametric models with seasonal components [40]; or periodic autoregressions [41,17]. Our objective is to show an exploratory data modeling methodology, where prior knowledge about the problem can be incorporated at different levels, obtaining a nonlinear model structure based on LS-SVMs which is tailored to the STLF application.

5.2 Data Description

The data consists of a 3.5-years series of hourly load values from a particular HV-LV substation in Belgium, thus containing more than 26,000 datapoints. A first linear regression containing only a linear trend is estimated for this time series, to remove any growth trend present in the sample. Finally, the series is normalized using the mean and standard deviation of the observed values, in order to scale it to zero-mean and unit variance. We define our training set to consist of 18,000 datapoints, and we validate the final models using different out of sample test sets of 30 days each. Figure 1 shows an example of the load series in a week, at hourly values starting at 00:00 hrs on Monday, until 24:00 hrs on Sunday. The daily shapes, the daily peaks and the weekly cycle are clearly visible.

5.3 Model Structures and Estimation

The model structures described previously are applied to the STLF problem with the available data. The general regression vector z contains an

(16)

autore-20 40 60 80 100 120 140 160 −1 −0.5 0 0.5 1 1.5 2 2.5 3

mon tue wed thu fri

sat sun

Hour Index (One Week period)

N o rm a li z e d L o a d

Fig. 1. Example of the Load series within a week. Daily cycles are visible, as well as the weekend effects. Also visible are the intra-day phenomena, such as the peaks (morning, noon, evening) and the night hours.

gressive part of 48 lagged load values yt−1, . . . , yt−48 (the past 2 days); and

exogenous inputs ut containing the temperature-related variables (measuring

the effect of temperature on cooling and heating requirements [25]) and cal-endar information in the form of dummy variables for month of the year, day of the week and hour of the day [17]. This leads to a set of 48+3+12+7+24= 94 explanatory variables. Most of the prior information about the seasonal properties of the load are incorporated into the modeling stage in the form of the mentioned dummy variables for calendar information. In addition, it is known that the load series shows a strong correlation every 24 hours, there-fore we use τ = 24 for the noise correlation description for the AR-NARX and PL-AR-NARX models.

The procedure for model estimation using the large-scale fixed-size LS-SVM methodology can be described in the following steps:

(1) Normalize inputs and outputs to have zero mean and unit variance (2) Select an initial subsample of size M

(3) Choose a Model Structure (NARX, AR-NARX, PL-NARX, PL-AR-NARX)

(4) Build the (M × M) kernel matrix for the model structure and compute

its eigendecomposition

(5) Build the nonlinear feature map approximation for the rest of the data (6) Estimate the model in primal space according to the primal

representa-tion from Table 1

(17)

(8) Use the model estimates with the test data nonlinear mapping to obtain the out of sample forecasts.

5.4 Performance Assessment

The performance of the predictions obtained from any given model can be

assessed by using different indicators. Denoting by yi and ˆyi the observed

values and the predictions, respectively, in an evaluation sample_{S, the Mean}

Squared Error (MSE) is defined as

MSE = 1

Ns

X

i∈S

(yi− ˆyi)2, (39)

and the Mean Absolute Percentage Error (MAPE) is defined as

MAPE = 1 Ns X i∈S |yi− ˆyi| yi . (40)

The MSE is typically used within the general context of applied modeling. In this paper the MAPE is also used because it is common practice in the partic-ular context of the STLF. For this reason, in this case study the performance of the models is assessed using both indicators.

5.5 Hyperparameters Selection

The tuning of the hyperparameters σ, γ and ρ is performed by 10 fold cross-validation (CV) in the training sample. For a given kernel function (built with σ and ρ) and a given regularization parameter γ, the training sample is divided in 10 parts, 9 of which are used for model estimation. The performance of the estimated model is assessed in the remaining data part (which is used as a test set). By repeating the process over the 10 parts, the cross-validation performance of the model is the average of the 10 individual performances. The whole process is repeated for different hyperparameters.

By using cross-validation, it is also possible to incorporate another element of the prior knowledge into the modeling. Typically, for the training of a model using CV, the Mean-Squared-Error (MSE) is used as a performance measure, and therefore the hyperparameters to be used in the out of sample predictions are those that minimize the cross-validation MSE (MSE-CV). However, the predictions for the daily peak load are critical. An error in the prediction at the peak hour has a larger impact into the planning and unit commitment scheduling. For this reason, we also implement the tuning in such a way that

(18)

the selected parameters minimize the Mean Absolute Prediction Error under cross-validation (MAPE-CV), which is computed only for the hours of the

daily peak loads. The set _{S in (40) thus only contains the index of the hours}

where the peak of each day occurs. Under this scheme, we look for the hy-perparameters that are able to produce accurate forecasts for the daily peaks. For any model structure, the hyperparameters can then be tuned using both MSE-CV and MAPE-CV.

The size M of the subsample from which the feature map approximation is computed can affect the performance of the final model. For this paper we use M = 1000, which accounts for 3% of the available training dataset. The size of M = 1000 can produce very good results for this problem without requiring too much computational time. For an illustration of the the effect of increasing sizes of M , the interested reader is referred to [18,19].

5.6 Final Assessment on Test Data

Once the hyperparameters are selected, then the final models are used to compute predictions for the test sample (which is never part of the training set), where its performance is measured by both the MSE and MAPE. The test sample starts after the last datapoint of the training sample. Usually models for short-term load forecasting may be used for prediction during a certain number of days, after which the models are re-estimated again with the new available information. Typically a model might be re-estimated after no more than one week, in some cases in a matter of a few days. Rarely a model will remain for more than 7 days without re-estimation. In this context, for the purposes of evaluation of the models in this study, 50 different non-overlapping weeks are used to assess the performance of the model. Denote by d the last day contained in the training dataset. The first test period is the week occurring immediately after the training data, covering from d+1 until d+7. The second test week begins on d + 8, finishing on d + 14, the third period begins on d + 15 until d+21, and so on, until the last testing period which begins at d+344 and ends on d + 350. The models performance are assessed by using MSE on each of the test periods, after which the mean and standard deviation are reported for each case.

6 Results

The results of the model selection methodology, the identified model structures and the final prediction results are described in this section.

(19)

6.1 (N)-ARX Models

First, a linear ARX model (3) is estimated with Ordinary Least Squares using

ztwith the input variables defined above. The mean performance of the linear

model on the test sets is 0.047 (MSE) and 0.050 (MAPE). This model is simple to estimate, and it has a high degree of interpretability in the sense that each one of the explanatory variables obtains an estimated coefficient which reflects its importance for predicting the load. However, as it is typically the case, the linear functional form neglects nonlinear effects which may be relevant. A second model is estimated, this time a full black-box NARX model (’NARX’ on Table 1). The hyperparameters for this second model are tuned by minimizing the MSE-CV or the MAPE-CV. For the case of MSE-CV minimization, the mean test sets results are 0.0201 (MSE) and 0.0301 (MAPE). For the case of MAPE-CV minimization, the results are 0.0202 (MSE) and 0.0310 (MAPE). It is already clear, by comparing the mean test set performances, that the NARX models shows a better performance in terms of prediction accuracy than the linear ARX model. The average MSE is reduced to less than half. This means that there are revelant nonlinear effects that are neglected by the linear structure, but they are captured in the full black-box NARX model. However, it is interesting to check whether these effects are produced only by a subset of the input variables, or if they are due to the correlation structure. So far, it is clear that the performance of the NARX model does not substantially change when its hyperparameters are selected either by minimization of the MSE-CV or the minimization of the MAPE-CV. The quality of the ARX and NARX predictions for the first test set (immediately after the training set) are shown on Figures 2 and 3, respectively. It can be observed from the figures that the linear model generates bad predictions for the weekend days. The residuals for the both models are shown on Figure 4. The difference between the ARX and NARX models is very significant.

6.2 AR-(N)ARX

The next step is to check if the NARX black-box model can be improved by imposing an autocorrelation with τ = 24. The autocorrelation parameter ρ

is tuned by minimization of the MAPE-CV, obtaining an optimal ρ∗

close to -0.4, as shown on Figure 5. According to this result, a negative correlation on the residuals of the NARX model can be captured using the AR-NARX parameterization. The estimated AR-NARX model, however, does not show a significant improvement over the NARX model. The AR-NARX mean per-formance on the test sets is 0.020 (MSE) and 0.031 (MAPE).

(20)

0 20 40 60 80 100 120 140 160 −2 −1 0 1 2 3 4 xtest N o rm a li z e d L o a d

Fig. 2. Performance of the linear model over the first test set. The observed load (full line) is compared to the prediction from the linear ARX (thick line) model.

Fig. 3. Performance of the NARX model over the first test set. The observed load (full line) is compared to the prediction from the NARX (thick line) model.

(21)

0 20 40 60 80 100 120 140 160 −1.5 −1 −0.5 0 0.5 1 1.5 xtest R e si d u a ls , N A R X M o d e l 0 20 40 60 80 100 120 140 160 −1.5 −1 −0.5 0 0.5 1 1.5 xtest R e si d u a ls , A R X M o d e l

Fig. 4. Residuals on the Test Set, for the NARX model (Top panel) and the ARX model (Bottom panel) on the first test set.

autocorrelation process model with τ = 24. The AR-ARX model is estimated using the Cochrane-Orcutt procedure for correlated residuals [42,43]. This is a classical econometric technique which iterates in order to obtain the optimal autocorrelation coefficient ρ, by exploiting the linear property of the model structure. This method gives an estimate of the correlation to be positive

(ˆρ = 0.15), and the mean performance on the test sets is given by 0.0365

(MSE) and 0.0451 (MAPE). This is an improvement from the previous ARX linear model without correlated residuals, but still it is above the levels reached by the nonlinear counterparts.

6.3 PL-NARX

Now, we turn into the partially linear parameterizations. Several linear models have been proposed in the load forecasting literature [31,32] with good results,

(22)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 Value of ρ C ro ss v a li d a ti o n M A P E Fig. 5.

MAPE-CV performance of the AR-NARX model for different values of ρ

it is interesting to check if a partially linear parameterization can improve over the linear ARX model into the direction of the full black-box NARX model. Since the seminal work of partially linear models applied to electricity prices [25], it has been common practice to separate the inputs which are the past values of the load, from those inputs which are calendar and temperature

effects. In our notation, starting from our original regression vector zt, the

following partitions are defined to be used in the PL-NARX models:

• Use the past values of the load yt−i, i = 1, . . . , 48 as nonlinear regressors, and

the exogenous inputs ut (calendar and temperature effects) as linear

regres-sors. This is, zA,t = ut, zB,t = [yt−1; yt−2, . . . , yt−48]. Using this partition, the

PL-NARX model is estimated. Let us denote this model by ’PL-NARX-1’. • Use the past values of the load yt−i, i = 1, . . . , 48 as linear regressors, and

the exogenous inputs ut (calendar and temperature effects) as nonlinear

regressors. In this case, zA,t= [yt−1; yt−2,· · · , yt−48], zB,t = ut. Let us denote

by ’PL-NARX-2’ the model estimated using this partition.

The performance of the models PL-NARX-1 and PL-NARX-2 differ from each other. When the tuning parameters are found by minimization of the MSE-CV, the mean performance of PL-NARX-1 is 0.0221 (MSE) and 0.0327 (MAPE) on the test sets. The mean performance of PL-NARX-2 is 0.0269 (MSE) and 0.0370 (MAPE). Under this tuning scheme, the model with linear terms on the calendar and temperature information (PL-NARX-2) shows a lower per-formance than the model PL-NARX-1, which, in turn, shows a result closer to the NARX model.

(23)

Fig. 6. Improvement of the PL-NARX-2 model. The observed load (full line) is com-pared to the prediction from the PL-NARX-2 model tuned with MSE-CV (dashed) and the prediction from the PL-NARX-2 model tuned with MAPE-CV (thick).

However, when the tuning parameters are found by minimization of the MAPE-CV, model PL-NARX-2 improves considerably. Under this tuning scheme, the performance of PL-NARX-1 is 0.0221 (MSE) and 0.0326 (MAPE), whereas PL-NARX-2 improves to 0.0217 (MSE) and 0.0322 (MAPE). This last result is much closer to the performance of the AR-NARX model. For this model parameterization, this example shows that moving from the MSE-CV to the MAPE-CV for hyperparameter selection can improve the performance. This could be expected, given the shape of the load series to be forecasted. Errors can occur at the peak hours of the day; hence by selecting the parameters in such a way that the percentage errors for the daily peaks are minimized, then the overall forecasting performance improves.

The improvement of the model PL-NARX-2 is shown on Figure 6, where the prediction obtained with the PL-NARX-2 model tuned with MSE-CV (dotted) is compared to the prediction obtained with the PL-NARX-2 model tuned with MAPE-CV (dashed), shown for the case of the first test set. For an easier visualization, only part of the test set data is shown on the plot. By looking to the actual values (thin line) it can be observed that the model improves around the peak of each day, which is the purpose of using the MAPE performance criterion for hyperparameters tuning.

(24)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0.056 0.057 0.058 0.059 0.06 0.061 0.062 0.063 0.064 0.065 Value of ρ C ro ss v a li d a ti o n M A P E

Fig. 7. MAPE CV performance of the PL-NARX-2 model for different values of ρ

6.4 PL-AR-NARX

Finally, it is still possible to check the effect of imposing an autocorrelation structure on both PL-NARX models. Using the partitions over the original regression vector defined above, two models are estimated, PL-AR-NARX-1 and PL-AR-NARX-2.

For the case of the PL-AR-NARX-1 model (which uses zA,t = ut, zB,t =

[yt−1; yt−2, . . . , yt−48]), there is a small improvement when testing different

val-ues of ρ. For this model, the optimal ρ∗

is found to be -0.2, with a mean performance on the test sets of 0.0221 (MSE) and 0.0326 (MAPE). There is no improvement over the PL-NARX-1 model.

For the case of the PL-AR-NARX-2 model (which uses zA,t = [yt−1; yt−2,

. . . , yt−48], zB,t = ut), the optimal ρ∗ is found at -0.5, as shown on Figure 7.

For this model, the mean performance on the test set is now 0.0204 (MSE) and 0.0309 (MAPE). The PL-AR-NARX-2 model is therefore better improves over the PL-NARX-2 results. The results obtained are comparable to those from the NARX model.

(25)

Table 3

Results on Test Sets. Mean and standard deviation of the performance of each model for the 50 different test sets.

Tuning with MSE-CV Tuning with MAPE-CV

Model No Noise Model No Noise Model AR Noise Model

MSE MAPE MSE MAPE MSE MAPE

ARX 0.0470 0.0500 0.0470 0.0500 0.0365 0.0451 (0.0028) (0.0008) (0.0028) (0.0008) (0.0030) (0.0008) NARX 0.0201 0.0307 0.0202 0.0310 0.0207 0.0314 (0.0012) (0.0014) (0.0011) (0.0012) (0.0012) (0.0013) PL-NARX-1 0.0221 0.0327 0.0221 0.0326 0.0221 0.0326 (0.0012) (0.0015) (0.0013) (0.0016) (0.0016) (0.0021) PL-NARX-2 0.0269 0.0370 0.0217 0.0322 0.0204 0.0309 (0.0013) (0.0005) (0.0017) (0.0013) (0.0016) (0.0013) 6.5 Summary of Results

The results of the exploratory data modeling are summarized on Table 3. The mean over the 50 different test sets is reported, and the standard devi-ation is given in brackets. It is found that the black-box NARX model does not improve substantially when adding a noise structure. However, the par-tially linear structure PL-NARX-2 which is linear on the past values of the load and nonlinear on the calendar and temperature information, can improve substantially when an AR noise structure is included. The improvement is such that it can match the performance of the full black-box NARX model. The PL-NARX-1 models show a performance which is also comparable to the full-black box NARX model. The advantage of using PL-NARX parameteri-zations is that they provide a set of linear coefficients which can be directly interpretable for the variables of interest. The results are also visualized on the boxplots depicted on Figure 8 for the case of the MSE and the MAPE performances.

7 Conclusion

In this paper we have used an exploratory modeling methodology for a real-life case study in short-term load forecasting. Based on the LS-SVMs formulation, several model structures and parameterizations have been tested, from a linear ARX model, to a full black box NARX model, to partially linear structures

(26)

1 2 3 4 5 6 7 8 9 10 11 0.02 0.025 0.03 0.035 0.04 0.045 0.05 ARX

NARX PL−NARX−1 PL−NARX−2

Model Structures M S E o n T e st S e ts 1 2 3 4 5 6 7 8 9 10 11 0.03 0.035 0.04 0.045 0.05

ARX NARX PL−NARX−1 PL−NARX−2

Model Structures M A P E o n T e st S e ts

Fig. 8. Boxplots for the different model performances over the 50 test sets for the MSE (Top) and MAPE (Bottom). The linear models (ARX and AR-ARX) are compared to the NARX, PL-NARX-1, and PL-NARX-2 models. For each of the nonlinear model types, the first box shows the performance of the model where the hyperparameters are tuned with MSE-CV; the second, when they are tuned with MAPE-CV; the third box shows the inclusion of the AR noise model to each structure.

(27)

with autocorrelated residuals. In all these cases, the solution of each model is characterized by a set of linear equations which is obtained from a con-vex problem. It is illustrated that a modeling strategy can benefit from prior knowledge of the problem under research. In this way, the obtained model is clearly more tailored towards the application, with better performance on the test sample. For the case study of short-term load forecasting, it has been shown how the modeling strategy can benefit from a tailored definition of the cross-validation criteria. By moving from the minimization of the MSE-CV to the minimization of the MAPE-CV over the daily peaks, and by including an autocorrelation with delay τ = 24, the results improve substantially for the partially linear structure. This structured model, which is linear on the past values of the load and nonlinear on the calendar and temperature informa-tion, shows a final performance on the test set which is comparable to the full black-box NARX model, and yet it retains a linear part which improves its interpretability with respect to the fully black-box model.

Acknowledgments. Work supported by grants and projects for the Research Council K.U.Leuven (GOA Ambiorics, several PhD/Postdocs & fellow grants), the Flemish Government (FWO: PhD/Postdocs grants, projects G.0211.05,G.0240.99, G.0407.02,G.0197.02,G.0141.03,G.0491.03,G.0120.03,G.0452.04, G.0499.04, ICCoS, ANMMM; AWI; IWT: PhD grants, GBOU (McKnow) Soft4s), the Belgian Federal Government (Belgian Federal Science Policy Office: IUAP V-22; PODO-II (CP/ 01/40), the EU (FP5- Quprodis; ERNSI, Eureka 2063- Impact; Eureka 2419- FLiTE) and Contracts Research / Agreements (ISMC /IPCOS, Data4s, TML, Elia, LMS, IPCOS, Mastercard). J. Suykens and B. De Moor are an associate professor and a full professor at the K.U.Leuven, Belgium, respectively. The scientific responsibility is assumed by its authors.

(28)

References

[1] L. Ljung, System Identification: Theory for the User, Prentice Hall, New Jersey, 1987.

[2] J. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.

[3] T. Poggio, F. Girosi, Networks for approximation and learning, Proceedings of the IEEE 78 (1990) 1481–1497.

[4] V. Vapnik, Statistical Learning Theory, Wiley, New-York, 1998.

[5] C. Williams, Prediction with gaussian processes: from linear regression to linear prediction and beyond, in: M. Jordan (Ed.), Learning and Inference in Graphical Models, The MIT Press, Cambridge, MA, 1999, pp. 599 – 621.

[6] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000.

[7] D. MacKay, Bayesian interpolation, Neural Computation 4 (1992) 415–447. [8] G. Wahba, Support vector machines, reproducing kernel hilbert spaces, and

randomized gacv, chapter 6, in: B. Sch¨olkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, MIT Press, 1998, pp. 69–87.

[9] C. Williams, Prediction with Gaussian processes: from linear regression to linear prediction and beyond, in: M. Jordan (Ed.), Learning and Inference in Graphical Models, Kluwer Academic Press, 1998, pp. 599–621.

[10] D. MacKay, Introduction to gaussian processes, in: C. Bishop (Ed.), Neural Networks and Machine Learning, Vol. 168 of Springer NATO-ASI Series F, 1998, pp. 133–165.

[11] J. Suykens, T. Van Gestel, J. Vandewalle, B. De Moor, A support vector machine formulation to PCA analysis and its kernel version, IEEE Transactions on Neural Networks 14 (2) (2003) 447–450.

[12] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (1999) 293–300.

[13] T. Van Gestel, J. Suykens, G. Lanckriet, A. Lambrechts, B. De Moor, J. Vandewalle, A Bayesian framework for least squares support vector machine classifiers, Gaussian processes and kernel Fisher discriminant analysis, Neural Computation 14 (2002) 1115–1147.

[14] T. Van Gestel, J. Suykens, J. De Brabanter, B. De Moor, J. Vandewalle, Kernel canonical correlation analysis and least squares support vector machines, in: G. Dorffner, H. Bischof, K. Hornik (Eds.), Proceedings International Conference on Artificial Neural Networks (ICANN 2001), Vol. 2130 of Lecture Notes in Computer Science, Springer, Vienna, Austria, 2001, pp. 381–386.

(29)

[15] I. Goethals, K. Pelckmans, J. Suykens, B. De Moor, Subspace identification of hammerstein systems using least squares support vector machines, IEEE Transactions on Automatic Control, Special Issue: Linear vs. Nonlinear 50 (10) (2005) 1509–1519.

[16] S. Hylleberg, Modelling Seasonality, Oxford University Press, 1992.

[17] M. Espinoza, C. Joye, R. Belmans, B. De Moor, Short term load forecasting, profile identification and customer segmentation: A methodology based on periodic time series, IEEE Transactions on Power Systems 20 (3) (2005) 1622– 1630.

[18] M. Espinoza, J. Suykens, B. De Moor, Load forecasting using fixed-size least squares support vector machines, in: J. Cabestany, A. Prieto, F. Sandoval (Eds.), Proceedings of the 8th International Work-Conference on Artificial Neural Networks, Vol. 3512 of Lecture Notes in Computer Science, Springer-Verlag, 2005, pp. 1018–1026.

[19] M. Espinoza, J. Suykens, B. De Moor, Fixed-size least squares support vector machines : A large scale application in electrical load forecasting, Computational Management Science (Special Issue on Support Vector Machines) 3 (2) (2006) 113–129.

[20] J. Sj¨oberg, Q. Zhang, L. Ljung, A. Benveniste, B. Deylon, P. Glorennec, H. Hjalmarsson, A. Juditsky, Nonlinear Black-box Modelling in System Identification: a Unified Overview, Automatica 31 (1995) 1691–1724.

[21] A. Juditsky, H. Hjalmarsson, A. Benveniste, B. Deylon, L. Ljung, J. Sj¨oberg, Q. Zhang, Nonlinear Black-box Modelling in System Identification: mathematical foundations, Automatica 31 (1995) 1725–1750.

[22] R. Guidorzi, Multivariable System Identification: From Observations to Models, Bononia University Press, 2003.

[23] M. Espinoza, J. Suykens, B. De Moor, LS-SVM regression with autocorrelated errors, in: Proc. of the 14th IFAC Symposium on System Identification (SYSID), 2006, pp. 582–587.

[24] J. Hamilton, Time Series Analysis, Princeton University Press, 1994.

[25] R. Engle, C. Granger, J. Rice, A. Weiss, Semiparametric estimates of the relation between weather and electricity sales, Journal of the American Statistical Association 81 (394) (1986) 310–320.

[26] M. Espinoza, J. Suykens, B. De Moor, Kernel based partially linear models and nonlinear identification, IEEE Transactions on Automatic Control, Special Issue: Linear vs. Nonlinear 50 (10) (2005) 1602–1606.

[27] C. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm in dual variables, in: Proceedings 15th International Conference on Machine Learning (ICML’98), Morgan Kaufmann, San Francisco, California, 1998, pp. 515–521.

(30)

[28] C. Williams, M. Seeger, Using the Nystr¨om method to speed up kernel machines, in: V. T.Leen, T.Dietterich (Ed.), Proc. NIPS 2000, Vol. 13, MIT press, Vienna, Austria, 2000, pp. 682–688.

[29] M. Girolami, Orthogonal series density estimation and the kernel eigenvalue problem, Neural Computation 10 (6) (1998) 1455–1480.

[30] E. Mariani, S. Murthy, Advanced Load Dispatch for Power Systems, Advances in Industrial Control, Springer-Verlag, 1997.

[31] N. Amjady, Short-term hourly load forecasting using time-series modeling with peak load estimation capability, IEEE Transactions on Power Systems 16 (4) (2001) 798–805.

[32] R. Ramanathan, R. Engle, C. Granger, C. Vahid-Aragui, F.and Brace, Short-run forecasts of electricity load and peaks, International Journal of Forecasting (1997) 161–174.

[33] S.-J. Huang, K.-R. Shih, Short term load forecasting via ARMA model identification including non-gaussian process considerations, IEEE Transactions on Power Systems 18 (2) (2003) 673–679.

[34] H. Steinherz, C. Pedreira, R. Castro, Neural networks for short-term load forecasting: A review and evaluation, IEEE Transactions on Power Systems 16 (1) (2001) 44–55.

[35] A. Khotanzad, R. Afkhami-Rohani, D. Maratukulam, ANNSTLF-artificial neural network short-term load forecaster-generation three, IEEE Transactions on Power Systems 13 (4) (1998) 1413–1422.

[36] K.-H. Kim, H.-S. Youn, Y.-C. Kang, Short-term load forecasting for special days in anomalous load conditions using neural networks and fuzzy inference method, IEEE Transactions on Power Systems 15 (2) (2000) 559–565.

[37] L. Mohan Saini, M. Kumar Soni, Artificial neural network-based peak load forecasting using conjugate gradient methods, IEEE Transactions on Power Systems 17 (3) (2002) 907–912.

[38] D. Fay, J. Ringwood, M. Condon, M. Kelly, 24-h electrical load data-a sequential or partitioned time series?, Neurocomputing 55 (2003) 469–498.

[39] G. Box, G. Jenkins, Time Series Analysis, Forecasting and Control, Holden-Day, San Francisco, 1970.

[40] L. Yang, R. Tschernig, Non- and semiparametric identification of seasonal nonlinear autoregression models, Econometric Theory 18 (2002) 1408–1448. [41] P. Franses, R. Paap, Periodic Time Series Models, Oxford University Press,

2003.

[42] J. Johnston, Econometric Methods, Economics Series, McGraw-Hill, 1991. [43] D. Cochrane, G. Orcutt, Application of least-squares regressions to relationships

containing autocorrelated error terms, Journal of the American Statistical Association 44 (1949) 32–61.