Componentwise Least Squares Support Vector Machines

(1)

Componentwise Least Squares Support Vector

Machines

K. Pelckmans1, I. Goethals1, J. De Brabanter1,2, J.A.K. Suykens1, and B. De Moor1

1

KULeuven - ESAT - SCD/SISTA, Kasteelpark Arenberg 10, 3001 Leuven (Heverlee), Belgium,

{Kristiaan.Pelckmans,Johan.Suykens}@esat.kuleuven.ac.be

2

Hogeschool KaHo Sint-Lieven (Associatie KULeuven), Departement Industrieel Ingenieur Summary. This chapter describes componentwise Least Squares Support Vector Machines (LS-SVMs) for the estimation of additive models consisting of a sum of nonlinear compo-nents. The primal-dual derivations characterizing LS-SVMs for the estimation of the additive model result in a single set of linear equations with size growing in the number of data-points. The derivation is elaborated for the classification as well as the regression case. Furthermore, different techniques are proposed to discover structure in the data by looking for sparse com-ponents in the model based on dedicated regularization schemes on the one hand and fusion of the componentwise LS-SVMs training with a validation criterion on the other hand.

keywords: LS-SVMs, additive models, regularization, structure detection

1 Introduction

Non-linear classification and function approximation is an important topic of in-terest with continuously growing research areas. Estimation techniques based on regularization and kernel methods play an important role. We mention in this con-text smoothing splines (Wahba, 1990), regularization networks (Poggio and Girosi, 1990), Gaussian processes (MacKay, 1992), Support Vector Machines (SVMs) (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000; Schoelkopf and Smola, 2002) and many more, see e.g. (Hastie et al., 2001). SVMs and related methods have been introduced within the context of statistical learning theory and structural risk minimization. In the methods one solves convex optimization problems, typically quadratic pro-grams. Least Squares Support Vector Machines (LS-SVMs)3 (Suykens and Vande-walle, 1999; Suykens et al., 2002) are reformulations to standard SVMs which lead to solving linear KKT systems for classification tasks as well as regression. In (Suykens et al., 2002) LS-SVMs have been proposed as a class of kernel machines with primal-dual formulations in relation to kernel Fisher Discriminant Analysis (FDA), Ridge

3

(2)

Regression (RR), Partial Least Squares (PLS), Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), recurrent networks and control. The dual problems for the static regression without bias term are closely related to Gaussian processes (MacKay, 1992), regularization networks (Poggio and Girosi, 1990) and Kriging (Cressie, 1993), while LS-SVMs rather take an optimization approach with primal-dual formulations which have been exploited towards large scale problems and in developing robust versions.

Direct estimation of high dimensional nonlinear functions using a non-parametric technique without imposing restrictions faces the problem of the curse of dimension-ality. Several attempts were made to overcome this obstacle, including projection pursuit regression (Friedmann and Stuetzle, 1981) and kernel methods for dimen-sionality reduction (KDR) (Fukumizu et al., 2004). Additive models are very use-ful for approximating high dimensional nonlinear functions (Stone, 1985; Hastie and Tibshirani, 1990). These methods and their extensions have become one of the widely used nonparametric techniques as they offer a compromise between the somewhat conflicting requirements of flexibility, dimensionality and interpretability. Traditionally, splines are a common modeling technique (Wahba, 1990) for addi-tive models as e.g. in MARS (see e.g. (Hastie et al., 2001)) or in combination with ANOVA (Neter et al., 1974). Additive models were brought further to the attention of the machine learning community by e.g. (Vapnik, 1998; Gunn and Kandola, 2002). Estimation of the nonlinear components of an additive model is usually performed by the iterative backfitting algorithm (Hastie and Tibshirani, 1990) or a two-stage marginal integration based estimator (Linton and Nielsen, 1995). Although consis-tency of both is shown under certain conditions, important practical problems (num-ber of iteration steps in the former) and more theoretical problems (the pilot estimator needed for the latter procedure is a too generally posed problem) are still left.

In this chapter we show how the primal-dual derivations characterizing LS-SVMs can be employed to formulate a straightforward solution to the estimation problem of additive models using convex optimization techniques for classification as well as regression problems. Apart from this one-shot optimal training algorithm, the chapter approaches the problem of structure detection in additive models (Hastie et al., 2001; Gunn and Kandola, 2002) by considering an appropriate regularization scheme leading to sparse components. The additive regularization (AReg) frame-work (Pelckmans et al., 2003) is adopted to emulate effectively these schemes based on 2-norms, 1-norms and specialized penalization terms (Antoniadis and Fan, 2001). Furthermore, a validation criterion is considered to select relevant components. Clas-sically, exhaustive search methods (or stepwise procedures) are used which can be written as a combinatorial optimization problem. This chapter proposes a convex relaxation to the component selection problem.

This chapter is organized as follows. Section 2 presents componentwise LS-SVM regressors and classifiers for efficient estimation of additive models and relates the result with ANOVA kernels and classical estimation procedures. Section 3 introduces the additive regularization in this context and shows how to emulate dedicated reg-ularization schemes in order to obtain sparse components. Section 4 considers the

(3)

problem of component selection based on a validation criterion. Section 5 presents a number of examples.

2 Componentwise LS-SVMs and Primal-Dual Formulations

2.1 The Additive Model Class

Giving a training set defined asDN = {xk, yk}Nk=1 ⊂ RD× R of size N drawn i.i.d. from an unknown distribution FXY according to yk = f (xk) + ek where f : RD _{→ R is an unknown real-valued smooth function, E[y}

k|X = xk] = f (xk) and e1, . . . , eN are uncorrelated random errors with E [ek|X = xk] = 0, E£(ek)2|X = xk

¤ = σ2

e < ∞. The n data points of the validation set are de-noted asDn(v)= {x(v)_j , y(v)_j }n_j=1. The following vector notations are used through-out the text: X = (x1, . . . , xN) ∈ RD×N,Y = (y1, . . . , yN)T ∈ RN,X(v) = ³ x(v)₁ , . . . , x(v)n ´ ∈ RD×n_and_Y(v)₌³_y(v) 1 , . . . , y (v) n ´T ∈ Rn_{. The estimator of a} regression function is difficult if the dimensionD is large. One way to quantify this is the optimal minimax rate of convergenceN−2l/(2l+D)_{for the estimation of an}_l times differentiable regression function which converges to zero slowly ifD is large compared tol (Stone, 1982). A possibility to overcome the curse of dimensionality is to impose additional structure on the regression function. Although not needed in the derivation of the optimal solution, the input variables are assumed to be uncorrelated (see also concurvity (Hastie and Tibshirani, 1990)) in the applications.

Let superscriptxd _{∈ R denote the d-th component of an input vector x ∈ R}D for alld = 1, . . . , D. Let for instance each component correspond with a different dimension of the input observations. Assume that the functionf can be approximated arbitrarily well by a model having the following structure

f (x) = D X

d=1

fd(xd) + b, (1)

where fd _{: R → R for all d = 1, . . . , D are unknown real-valued smooth} func-tions and b is an intercept term. The following vector notation is used: Xd ₌ ¡xd 1, . . . , xdN ¢ ∈ R1×N _and _X(v)d ₌ ³_x(v)d 1 , . . . , x (v)d n ´ ∈ R1×n_{. The optimal} rate of convergence for estimators based on this model isN−2l/(2l+1)_{which is} inde-pendent ofD (Stone, 1985). Most state-of-the-art estimation techniques for additive models can be divided into two approaches (Hastie et al., 2001):

• Iterative approaches use an iteration where in each step part of the unknown components are fixed while optimizing the remaining components. This is moti-vated as: ˆ fd1_(xd1 k ) = yk− ek− X d26=d1 ˆ fd2_(xd2 k ), (2)

(4)

for allk = 1, . . . , N and d1 = 1, . . . , D. Once the N − 1 components of the second term are known, it becomes easy to estimate the lefthandside. For a large class of linear smoothers, such so-called backfitting algorithms are equivalent to a Gauss-Seidel algorithm for solving a big (N D × N D) set of linear equations (Hastie et al., 2001). The backfitting algorithm (Hastie and Tibshirani, 1990) is theoretically and practically well motivated.

• Two-stages marginalization approaches construct in the first stage a general black-box pilot estimator (as e.g. a Nadaraya-Watson kernel estimator) and fi-nally estimate the additive components by marginalizing (integrating out) for each component the variation of the remaining components.

2.2 Componentwise Least Squares Support Vector Machine Regressors

At first, a primal-dual formulation is derived for componentwise LS-SVM regressors. The global model takes the form as in (1) for anyx∗∈ RD

f (x∗; wd, b) = D X d=1 fd(xd∗; wd) + b = D X d=1 wdTϕd(xd∗) + b. (3) The individual components of an additive model based on LS-SVMs are written as fd_(xd_{; w}

d) = wdTϕd(xd) in the primal space where ϕd : R → Rnϕd denotes a potentially infinite (nϕf = ∞) dimensional feature map. The regularized least

squares cost function is given as (Suykens et al., 2002)

min wd,b,ek Jγ(wd, e) = 1 2 D X d=1 wdTwd+ γ 2 N X k=1 e2k s.t. D X d=1 wdTϕd(xdk) + b + ek = yk, k = 1, . . . , N. (4) Note that the regularization constantγ appears here as in classical Tikhonov regular-ization (Tikhonov and Arsenin, 1977). The Lagrangian of the constraint optimregular-ization problem becomes Lγ(wd, b, ek; αk) = 1 2 D X d=1 wdTwd+ γ 2 N X k=1 e2k− N X k=1 αk( D X d=1 wdTϕd(xdk)+b+ek−yk). (5) By taking the conditions for optimality∂Lγ/∂αk = 0, ∂Lγ/∂b = 0, ∂Lγ/∂ek = 0 and∂Lγ/∂wd= 0 and application of the kernel trick Kd(xdk, xdj) = ϕd(xdk)Tϕd(xdj) with a positive definite (Mercer) kernelKd _{: R × R → R, one gets the following} conditions for optimality

       yk =PD_d=1wdTϕd(xdk) + b + ek, k = 1, . . . , N (a) ekγ = αk k = 1, . . . , N (b) wd =PN_k=1αkϕd(xdk) d = 1, . . . , D (c) 0 =PN k=1αk. (d) (6)

(5)

Note that condition (6.b) states that the elements of the solution vectorα should be proportional to the errors. The dual problem is summarized in matrix notation as

· 0 1T N 1N Ω + IN/γ ¸ · b α ¸ =· 0 Y ¸ , (7) where Ω ∈ RN ×N _with _{Ω =} PD

d=1Ωd andΩdkl = Kd(xdk, xdl) for all k, l = 1, . . . , N , which is expressed in the dual variables ˆα instead of ˆw. A new point x∗∈ RDcan be evaluated as ˆ y∗= ˆfd(x∗; ˆα, ˆb) = N X k=1 ˆ αk D X d=1 Kd(xdk, xd∗) + ˆb, (8)

where α and ˆb is the solution to (7). Simulating a validation datapoint xˆ j for all j = 1, . . . , n by the d-th individual component

ˆ yd j = ˆfd(xdj; ˆα) = N X k=1 ˆ αkKd(xdk, xdj), (9)

which can be summarized as follows: ˆY = (ˆy1, . . . , ˆyN)T ∈ RN, ˆYd=¡ ˆy1d, . . . , ˆyNd ¢T ∈ RN_{, ˆ}_Y(v)₌³_y_ˆ(v) 1 , . . . , ˆy (v) n ´T ∈ Rn_{and ˆ}_Y(v) d = ³ ˆ y₁(v)d, . . . , ˆy(v)dn ´T ∈ Rn_. Remarks:

• Note that the componentwise LS-SVM regressor can be written as a linear smoothing matrix (Suykens et al., 2002):

ˆ

Y = SγY. (10)

For notational convenience, the bias term is omitted from this description. The smoother matrixSγ ∈ RN ×N becomes

Sγ = Ω µ Ω + IN1 γ ¶−1 . (11)

• The set of linear equations (7) corresponds with a classical LS-SVM regressor where a modified kernel is used (see also Figure 1)

K(xk, xj) = D X d=1 Kd_(xd k, xdj). (12)

Figure 1 shows the modified kernel in case a one dimensional Radial Basis Func-tion (RBF) kernel is used for allD (in the example, D = 2) components. This observation implies that componentwise LS-SVMs inherit results obtained for classical LS-SVMs and kernel methods in general. From a practical point of view, the previous kernels (and a fortiori componentwise kernel models) result

(6)

in the same algorithms as considered in the ANOVA kernel decompositions as in (Vapnik, 1998; Gunn and Kandola, 2002).

K(xk, xj) = D X d=1 Kd(xdk, xdj) + X d16=d2 Kd1d2³_(xd1 k , x d2 k ) T_{, (x}d1 j , x d2 j )T ´ + . . . , (13) where the componentwise LS-SVMs only consider the first term in this expan-sion. The described derivation bridges the gap between the estimation of additive models and the use of ANOVA kernels.

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0 0.5 1 1.5 2 X1 X2 K = K 1 + K 2

Fig. 1.The two dimensional componentwise Radial Basis Function (RBF) kernel for compo-nentwise LS-SVMs takes the formK(xk, xl) = K1(x1k, x

1 l) + K 2 (x2 k, x 2 l) as displayed. The

standard RBF kernel takes the formK(xk, xl) = exp(−kxk− xlk22/σ 2_{) with}

σ ∈ R+ 0 an

appropriately chosen bandwidth.

2.3 Componentwise Least Squares Support Vector Machine Classifiers

In the case of classification, let yk, yj(v) ∈ {−1, 1} for all k = 1, . . . , N and j = 1, . . . , n. The analogous derivation of the componentwise LS-SVM classifier is briefly reviewed. The following model is considered for modeling the data

(7)

f (x) = sign Ã_D X d=1 fd(xd) + b ! , (14)

where again the individual components of the additive model based on LS-SVMs are given as fd_(xd_{) = w}

dTϕd(xd) in the primal space where ϕd : R → Rnϕd denotes a potentially infinite (nϕd = ∞) dimensional feature map. The regularized

least squares cost function is given as (Suykens and Vandewalle, 1999; Suykens et al., 2002) min wd,b,ek Jγ(wd, e) = 1 2 D X d=1 wdTwd+γ 2 N X k=1 e2k s.t. yk Ã D X d=1 wdTϕd(xdk) + b ! = 1 − ek, k = 1, . . . , N, (15)

whereekare so-called slack-variables for allk = 1, . . . , N . After construction of the Lagrangian and taking the conditions for optimality, one obtains the following set of linear equations (see e.g. (Suykens et al., 2002)):

· 0 YT Y Ωy+ IN/γ ¸ · b α ¸ = · 0 1N ¸ , (16)

whereΩy ∈ RN ×N withΩy =PD_d=1Ωyd ∈ RN ×N andΩy,kld = ykylKd(xdk, xdl). New data pointsx∗∈ RDcan be evaluated as

ˆ y∗= sign Ã_N X k=1 ˆ αkyk D X d=1 Kd(xdk, xd∗) + ˆb ! . (17)

In the remainder of this text, only the regression case is considered. The classification case can be derived straightforwardly along the lines.

3 Regularizing for Sparse Components via Additive

Regularization

A regularization method fixes a priori the answer to the conditioned (or ill-defined) nature of the inverse problem. The classical Tikhonov regularization scheme (Tikhonov and Arsenin, 1977) states the answer in terms of the norm of the solu-tion. The formulation of the additive regularization (AReg) framework (Pelckmans et al., 2003) made it possible to impose alternative answers to the ill-conditioning of the problem at hand. We refer to this AReg level as substrate LS-SVMs. An appropri-ate regularization scheme for additive models is to favor solutions using the smallest number of components to explain the data as much as possible. In this paper, we use the somewhat relaxed condition of sparse components to select appropriate compo-nents instead of the more general problem of input (or component) selection.

(8)

3.1 Level 1: Componentwise LS-SVM Substrate

Level 2:

Level 1:

(via Additive Regularization)

α,b _α,_b

LS−SVM Substrate _{Emulated Cost−funtion} X,Y c X,Y X,Y Conceptual Computational LS−SVM Substrate Emulated Cost Funtion

Fig. 2.Graphical representation of the additive regularization framework used for emulat-ing other loss functions and regularization schemes. Conceptually, one differentiates between the newly specified cost function and the LS-SVM substrate, while computationally both are computed simultanously.

Using the Additive Regularization (AReg) scheme for componentwise LS-SVM regressors results into the following modified cost function:

min wd,b,ek Jc(wd, e) = 1 2 D X d=1 wdTwd+1 2 N X k=1 (ek− ck)2 s.t. D X d=1 wdTϕd(xdk) + b + ek= yk, k = 1, . . . , N, (18)

whereck ∈ R for all k = 1, . . . , N. Let c = (c1, . . . , cN)T ∈ RN. After constructing the Lagrangian and taking the conditions for optimality, one obtains the following set of linear equations, see (Pelckmans et al., 2003):

· 0 1T N 1N Ω + IN ¸ · b α ¸ +· 0 c ¸ =· 0 Y ¸ (19)

ande = α + c ∈ RN_{. Given a regularization constant vector}_{c, the unique solution} follows immediately from this set of linear equations.

However, as this scheme is too general for practical implementation,c should be limited in an appropriate way by imposing for example constraints corresponding with certain model assumptions or a specified cost function. Consider for a moment

(9)

the conditions for optimality of the componentwise LS-SVM regressor using a regu-larization term as in ridge regression, one can see that equation (7) corresponds with (19) ifγ−1_{α = α + c for given γ. Once an appropriate c is found which satisfies} the constraints, it can be plugged in into the LS-SVM substrate (19). It turns out that one can omit this conceptual second stage in the computations by elimination of the variablec in the constrained optimization problem (see Figure 2).

Alternatively, a measure corresponding with a (penalized) cost function can be used which fulfills the role of model selection in a broad sense. A variety of such explicit or implicit limitations can be emulated based on different criteria (see Figure 3).

Emulated Cost function Level 2:

Convex

Non−convex Validation set Training set

Fig. 3.The level 2 cost functions of Figure 2 on the conceptual level can take different forms based on validation performance or trainings error. While some will result in convex tuning procedures, other may loose this property depending on the chosen cost function on the second level.

3.2 Level 2: Emulating anL1based Component Regularization Scheme

(Convex)

We now study how to obtain sparse components by considering a dedicated regular-ization scheme. The LS-SVM substrate technique is used to emulate the proposed scheme as primal-dual derivations (see e.g. Subsection 2.2) are not straightforward anymore.

Let ˆYd _{∈ R}N _{denote the estimated training outputs of the}_{d-th submodel f}d as in (9). The component based regularization scheme can be translated as the fol-lowing constrained optimization problem where the conditions for optimality (18) as summarized in (19) are to be satisfied exactly (after elimination ofw)

(10)

min c, ˆYd_,e k;α,b Jξ( ˆYd, ek) = 1 2 D X d=1 k ˆYdk1+ ξ 2 N X k=1 e2k s.t.        1T Nα = 0, Ω α + 1T Nb + α + c = Y, Ωd_{α = ˆ}_Yd_{, ∀d = 1, . . . , D} α + c = e, (20)

where the use of the robustL1 norm can be justified as in general no assumptions are imposed on the distribution of the elements of ˆYd_{. By elimination of}_{c using the} equalitye = α + c, this problem can be written as follows

min ˆ Yd_,e_k_;α,b Jξ( ˆYd, ek) = 1 2 D X d=1 k ˆYdk1+ ξ 2 N X k=1 e2k s.t.        0 1T N 1N Ω 0N Ω1 .. . ... 0N Ωd        · b α ¸ +        0 e ˆ Y1 .. . ˆ YD        =        0 Y 0N .. . 0N        . (21)

This convex constrained optimization problem can be solved as a quadratic program-ming problem. As a consequence of the use of theL1norm, often sparse components (k ˆYd_k

1= 0) are obtained, in a similar way as sparse variables of LASSO or sparse datapoints in SVM (Hastie et al., 2001; Vapnik, 1998). An important difference is that the estimated outputs are used for regularization purposes instead of the solution vector. It is good practice to omit sparse components on the training dataset from simulation: ˆ f (x∗; ˆα, ˆb) = N X i=1 ˆ αi X d∈SD Kd(xdi, xd∗) + ˆb, (22) whereSD= {d|ˆαTΩdα 6= 0}.ˆ

Using theL2normPDd=1k ˆYdk22instead leads to a much simpler optimization problem, but additional assumptions (Gaussianity) are needed on the distribution of the elements of ˆYd_{. Moreover, the component selection has to resort on a significance} test instead of the sparsity resulting from (21). A practical algorithm is proposed in Subsection 5.1 that uses an iteration of L2 norm based optimizations in order to calculate the optimum of the proposed regularized cost function.

3.3 Level 2 bis: Emulating a Smoothly Thresholding Penalty Function (Non-convex)

This subsection considers extensions to classical formulations towards the use of dedicated regularization schemes for sparsifying components. Consider the compo-nentwise regularized least squares cost function defined as

(11)

Jλ(wd, e) = λ 2 D X d=1 ℓ(wd) + 1 2 N X k=1 e2k, (23)

whereℓ(wd) is a penalty function and λ ∈ R0+ acts as a regularization parameter. We denoteλℓ(·) by ℓλ(·), so it may depend on λ. Examples of penalty functions include:

• The Lp penalty functionℓpλ(wd) = λkwdkppleads to a bridge regression (Frank and Friedman, 1993; Fu, 1998). It is known that theL2penalty functionp = 2 results in the ridge regression. For theL1 penalty function the solution is the soft thresholding rule (Donoho and Johnstone, 1994). LASSO, as proposed by (Tibshirani, 1996; Tibshirani, 1997), is the penalized least squares estimate using theL1penalty function (see Figure 4.a).

• Let the indicator function I{x∈A} = 1 if x ∈ A for a specified set A and 0 otherwise. When the penalty function is given byℓλ(wd) = λ2 − (kwdk1 − λ)2_I

{kwdk1<λ}(see Figure 4.b), the solution is a hard-thresholding rule (Antoniadis,

1997).

The Lp and the hard thresholding penalty functions do not simultaneously satisfy the mathematical conditions for unbiasedness, sparsity and continuity (Fan and Li, 2001). The hard thresholding has a discontinuous cost surface. The only continuous cost surface (defined as the cost function associated with the solution space) with a thresholding rule in the Lp-family is the L1 penalty function, but the resulting estimator is shifted by a constant λ. To avoid these drawbacks, (Nikolova, 1999) suggests the penalty function defined as

ℓaλ(wd) = λakwd k1 1 + akwdk1

, (24)

witha ∈ R+0. This penalty function behaves quite similarly as the Smoothly Clipped Absolute Deviation (SCAD) penalty function as suggested by (Fan, 1997). The Smoothly Thresholding Penalty (TTP) function (24) improves the properties of the L1penalty function and the hard thresholding penalty function (see Figure 4.c), see (Antoniadis and Fan, 2001). The unknownsa and λ act as regularization parameters. A plausible value fora was derived in (Nikolova, 1999; Antoniadis and Fan, 2001) as a = 3.7. The transformed L1 penalty function satisfies the oracle inequalities (Donoho and Johnstone, 1994). One can plugin the described semi-normℓa

λ(·) to improve the component based regularization scheme (20). Again, the additive regu-larization scheme is used for the emulation of this scheme

min c, ˆYd_,e k;α,b Jλ( ˆYd, ek) = 1 2 D X d=1 ℓaλ( ˆYd) + 1 2 N X k=1 e2k s.t.        1T Nα = 0, Ω α + 1T Nb + α + c = Y, Ωd_{α = ˆ}_Yd_{, ∀d = 1, . . . , D} α + c = e, (25)

(12)

−2.5 −1.5 −0.5 0.5 1.5 2.5 −0.5 0.5 1.5 2.5 3.5 4.5 wd cost L 2 L 1 L 0.6 (a) −2.5 −1.5 −0.5 0.5 1.5 2.5 −0.2 0.1 0.4 0.7 1 w d cost (b) −3 −2 −1 0 1 2 3 −0.2 0 0.2 0.4 0.6 0.8 1 w d cost (c)

Fig. 4.Typical penalty functions: (a) theLppenalty family forp = 2, 1 and 0.6, (b) hard

thresholding penalty function and (c) the transformedL1penalty function.

which becomes non-convex but can be solved using an iterative scheme as explained later in Subsection 5.1.

4 Fusion of Componentwise LS-SVMs and Validation

This section investigates how one can tune the componentwise LS-SVMs with re-spect to a validation criterion in order to improve the generalization performance of the final model. As proposed in (Pelckmans et al., 2003), fusion of training and validation levels can be investigated from an optimization point of view, while con-ceptually, they can be considered at different levels.

4.1 Fusion of Componentwise LS-SVMs and Validation for Regularization Constant Tuning

For this purpose, the fusion argument as introduced in (Pelckmans et al., 2003) is briefly revised in relation to regularization parameter tuning. The estimator of the LS-SVM regressor on the training data for a fixed valueγ is given as (4)

Level 1_{: ( ˆ}_{w, ˆb) = arg min} w,b,e

Jγ(w, e) s.t. (4) holds, (26)

which results into solving a linear set of equations (7) after substitution of w by Lagrange multipliers α. Tuning the regularization parameter by using a validation criterion gives the following estimator

Level 2_{: ˆ}_{γ = arg min} γ n X j=1 ³ f (xj; ˆα, ˆb) − yj ´2

with (ˆα, ˆb) = arg min α,b

Jγ (27) satisfying again (4). Using the conditions for optimality (7) and eliminatingw and e

(13)

Fusion_{: (ˆ}_{γ, ˆ}_{α, ˆb) = arg min} γ,α,b n X j=1 (f (xj; α, b) − yj)2 s.t. (7) holds, (28) which is referred to as fusion. The resulting optimization problem was noted to be non-convex. The optimal solutionsw or dual α’s corresponding with a γ > 0 de-scribe a non-convex set. To overcome this problem, a re-parameterization of the trade-off was proposed leading to the additive regularization scheme. At the cost of overparameterizing the trade-off, convexity is obtained. To circumvent this draw-back, different ways to restrict explicitly or implicitly the (effective) degrees of free-dom of the regularization scheme c ∈ A ⊂ RN _{were proposed while retaining} convexity ((Pelckmans et al., 2003)). The convex problem resulting from additive regularization is

Fusion_{: (ˆ}_{c, ˆ}_{α, ˆb) = arg min} c∈A,α,b n X j=1 (f (xj; α, b) − yj)2 s.t. (19) holds, (29) and can be solved efficiently as a convex constrained optimization problem ifA is a convex set, resulting immediately in the optimal regularization trade-off and model parameters (Boyd and Vandenberghe, 2004).

4.2 Fusion for Component Selection using the Additive Regularization Scheme

One possible relaxed version of the component selection problem goes as follows: Investigate whether it is plausible to drive the components on the validation set to zero without too large modifications on the global training solution. This is translated as the following cost function much in the spirit of (20). Let Ω(v) de-notePD

d=1Ω(v)d ∈ Rn×N andΩ (v)d

jk = Kd(x (v)d

j , xdk) for all j = 1, . . . , n and k = 1, . . . , N . (ˆc, ˆY(v)d, ˆwd, ˆe, ˆα, ˆb) = arg min c, ˆYd_{, ˆ}_Y(v)d_,e,α,b 1 2 D X d=1 k ˆY(v)dk1+ 1 2 D X d=1 k ˆYdk1+ ξ 2 N X k=1 e2k s.t.            1T Nα = 0 α + c = e Ω α + 1Nb + α + c = Y Ωd_{α = ˆ}_Yd_{, ∀d = 1, . . . , D,} Ω(v)d_{α = ˆ}_Y(v)d_{, ∀d = 1, . . . , D,} (30)

where the equality constraints consist of the conditions for optimality of (19) and the evaluation of the validation set on the individual components. Again, this convex problem can be solved as a quadratic programming problem.

(14)

4.3 Fusion for Component Selection using Componentwise Regularized LS-SVMs

We proceed by considering the following primal cost function for a fixed but strictly positiveη = (η1, . . . , ηD)T ∈ (R+0)D Level 1_: _min wd,b,ek Jη(wd, e) = 1 2 D X d=1 wdTwd ηd +1 2 N X k=1 e2k s.t. D X d=1 wdTϕd(xdk) + b + ek= yk, k = 1, . . . , N. (31)

Note that the regularization vector appears here similar as in the Tikhonov regular-ization scheme (Tikhonov and Arsenin, 1977) where each component is regularized individually. The Lagrangian of the constrained optimization problem with multipli-ersαη_{∈ R}N _becomes Lη(wd, b, ek; αk) = 1 2 D X d=1 wdTwd ηd +1 2 N X k=1 e2k − N X k=1 αη_k( D X d=1 wdTϕd(xdk) + b + ek− yk). (32)

By taking the conditions for optimality∂Lη/∂αk = 0, ∂Lη/∂b = 0, ∂Lη/∂ek = 0 and∂Lη/∂wd= 0, one gets the following conditions for optimality

       yk =PDd=1wdTϕd(xdk) + b + ek, k = 1, . . . , N (a) ek = αηk, k = 1, . . . , N (b) wd= ηdPNk=1α η kϕd(xdk), d = 1, . . . , D (c) 0 =PN k=1α η k. (d) (33)

The dual problem is summarized in matrix notation by application of the kernel trick, · 0 1T N 1N Ωη+ IN ¸ · b αη ¸ =· 0 Y ¸ , (34) whereΩη _{∈ R}N ×N _with_Ωη ₌PD

d=1ηdΩdandΩkld = Kd(xdk, xdl). A new point x∗∈ RDcan be evaluated as ˆ y∗= ˆf (x∗; ˆαη, ˆb) = N X k=1 ˆ αηk D X d=1 ηdKd(xdk, xd∗) + ˆb, (35)

where α and ˆb are the solution to (34). Simulating a training datapoint xˆ k for all k = 1, . . . , N by the d-th individual component

(15)

ˆ y_kη,d= ˆfd(xdk; ˆαη) = ηd N X l=1 ˆ αη_lKd(xdk, xdl), (36)

which can be summarized in a vector ˆYη,d_{= (ˆ}_yd

1, . . . , ˆyNd) ∈ RN. As in the previous section, the validation performance is used for tuning the regularization parameters

Level 2_{: ˆ}_{η = arg min} η n X j=1 ³ f (xj; ˆαη, ˆb) − yj ´2

with (ˆαη_{, ˆb) = arg min} αη_,b

Jη, (37) or using the conditions for optimality (34) and eliminatingw and e

Fusion_{: (ˆ}_{η, ˆ}_αη_{, ˆb) = arg min} η,αη_,b n X j=1 (f (xj; αη, b) − yj)2 s.t. (34) holds, (38) which is a non-convex constrained optimization problem.

Embedding this problem in the additive regularization framework will lead us to a more suitable representation allowing for the use of dedicated algorithms. By relating the conditions (19) to (34), one can view the latter within the additive regularization framework by imposing extra constraints onc. The bias term b is omitted from the remainder of this subsection for notational convenience. The first two constraints reflect training conditions for both schemes. As the solutionsαη_and_{α do not have} the same meaning (at least for model evaluation purposes, see (8) and (35)), the appropriatec is determined here by enforcing the same estimation on the training data. In summary:          (Ω + IN) α + c = Y ³³ PD d=1ηdΩd ´ + IN ´ αη _{= Y} ³ PD d=1ηdΩd ´ αη = Ωα ⇒                (Ω + IN) α + c = Y Ωα = ηT _{⊗ I} N      Ω1 . . . ΩD      (α + c), (39) where the second set of equations is obtained by eliminatingαη_{. The last equation} of the righthand side represents the set of constraints of the valuesc for all possible values ofη. The product ⊗ denotes ηT ⊗ IN = [η1IN, . . . , ηDIN] ∈ RN ×N D. As for the Tikhonov case, it is readily seen that the solution space ofc with respect to η is non-convex, however, the constraint on c is recognized as a bilinear form. The fusion problem (38) can be written as

Fusion_{: (ˆ}_{η, ˆ}_{α, ˆ}_{c) = arg min} η,α,c ° ° °Ω (v)_{α − Y}(v)°_° ° 2 2 s.t. (39) holds, (40) where algorithms as alternating least squares can be used.

(16)

5 Applications

For practical applications, the following iterative approach is used for solving non-convex cost-functions as (25). It can also be used for the efficient solution of non-convex optimization problems which become computational heavy in the case of a large number of datapoints as e.g. (21). A number of classification as well as regression problems are employed to illustrate the capabilities of the described approach. In the experiments, hyper-parameters as the kernel parameter (taken to be constants over the components) and the regularization trade-off parameterγ or ξ were tuned using 10-fold cross-validation.

5.1 Weighted Graduated Non-Convexity Algorithm

An iterative scheme was developed based on the graduated non-convexity algorithm as proposed in (Blake, 1989; Nikolova, 1999; Antoniadis and Fan, 2001) for the opti-mization of non-convex cost functions. Instead of using a local gradient (or Newton) step which can be quite involved, an adaptive weighting scheme is proposed: in every step, the relaxed cost function is optimized by using a weighted 2-norm where the weighting terms are chosen based on an initial guess for the global solution. For every symmetric loss functionℓ(|e|) : R+_{→ R}+_{which is monotonically increasing, there} exists a bijective transformationt : R → R such that for every e = y − f(x; θ) ∈ R

ℓ(e) = (t(e))2. (41)

The proposed algorithm for computing the solution for semi-norms employs itera-tively convex relaxations of the prescribed non-convex norm. It is somewhat inspired by the simulated annealing optimization technique for optimizing global optimiza-tion problems. The weighted version is based on the following derivaoptimiza-tion

ℓ(ek) = (νkek)2⇔ νk = s ℓ(ek) e2 k , (42)

where theek for allk = 1, . . . , N are the residuals corresponding with the solu-tions toθ = arg minθℓ (yk− f (xk; θ)) This is equal to the solution of the convex optimization problemek = arg minθ(νk(yk− f (xk; θ)))2for a set ofνksatisfying (42). For more stable results, the gradient of the penalty functionℓ and the quadratic approximation can be takne equal as follows by using an intercept parameterµk∈ R for allk = 1, . . . , N : ½ ℓ(ek) = (νkek)2+ µk ℓ′_(e k) = 2νk2ek ⇔ · e2 k 1 2ek 0 ¸ · ν2 k µk ¸ =· ℓ(ek) ℓ′_(e k) ¸ , (43) where ℓ′_(e

k) denotes the derivative of ℓ evaluated in ek such that a minimum of Jℓalso minimizes the weighted equivalent (the derivatives are equal). Note that the constant interceptsµkare not relevant in the weighted optimization problem. Under the assumption that the two consecutive relaxationsℓ(t) _and_ℓ(t+1)_{do not have too} different global solutions, the following algorithm is a plausible practical tool:

(17)

−2.5 −1.5 −0.5 0.5 1.5 2.5 −0.5 0.5 1.5 2.5 e cost e_k (ν_k e_k)2 + µ_k (a) −4 −3 −2 −1 0 1 2 3 4 100 101 residuals weighting (b)

Fig. 5. (a)WeightedL2-norm (dashed) approximation(νkek)2+ µkof theL1-norm (solid)

ℓ(e) = |e|1which follows from the linear set of equations (43) once the optimalekare known;

(b)the weighting termsνkfor a sequence ofekandk = 1, . . . , N such that (νkek)2+ µk=

|ek|1and2ν 2 kek= l

′

(ek) = sign(ek) for an appropriate µk.

Algorithm 1 (Weighted Graduated Non-Convexity Algorithm) For the

optimiza-tion of semi-norms (ℓ(·)), a practical approach is based on deforming gradually a 2-norm into the specific loss function of interest. Letζ be a strictly decreasing series 1, ζ(1)_{, ζ}(2)_{, . . . , 0. A plausible choice for the initial convex costfunction is the least} squares cost functionJLS(e) = kek22.

1. Compute the solutionθ(0)_for_L

2normJLS(e) = kek22with residualse (0) k ; 2. t = 0 and ν(0)_{= 1}

N;

3. Consider the following relaxed cost functionJ(t)_{(e) = (1−ζ}

t)ℓ(e)+ζtJLS(e); 4. Estimate the solutionθ(t+1)and corresponding residualse(t+1)k of the

costfunc-tionJ(t)_{using the weighted approximation}_J

approx= (νk(t)ek)2ofJ(t)(ek) 5. Reweight the residuals using weighted approximative squares norms as derived

in (43):

6. t := t + 1 and iterate step (3,4,5,6) until convergence.

When iterating this scheme, mostνkwill be smaller than1 as the least squares cost function penalizes higher residuals (typically outliers). However, a number of resid-uals will have increasing weight as the least squares loss function is much lower for small residuals.

5.2 Regression examples

To illustrate the additive model estimation method, a classical example was con-structed as in (Hastie and Tibshirani, 1990; Vapnik, 1998). The data were gener-ated according toyk = 10 sinc(x1k) + 20 (x2k− 0.5)2+ 10 x3k+ 5 x4k + ek were

(18)

−2 −1 0 1 2 −2 −1 0 1 2 3 X 1 Y −2 −1 0 1 2 −2 −1 0 1 2 3 X 2 Y −2 −1 0 1 2 −2 −1 0 1 2 3 X 3 Y −2 −1 0 1 2 −2 −1 0 1 2 3 X 4 Y

Fig. 6.Example of a toy dataset consisting of four input componentsX1

, X2

, X3

andX4

where only the first one is relevant to predict the outputf (x) = sinc(x1_{). A componentwise}

LS-SVM regressor (dashed line) has good prediction performance, while theL1 penalized

costfunction of Subsection (3.2) also recovers the structure in the data as the estimated com-ponents correspnding withX2

, X3

andX4

are sparse.

ek ∼ N (0, 1), N = 100 and the input data X are randomly chosen from the in-terval[0, 1]10_{. Because of the Gaussian nature of the noise model, only results from} Least Squares Methods are reported. The described techniques were applied on this dataset and tested on a test set. Furthermore, table 1 reports whether the algorithm recovered the structure in the data (if so, the measure is 100%). The experiment using the smoothly tresholding penalized (STP) cost function was designed as follows: for every 10 components, a version was provided for the algorithm for the use of a linear kernel and another for the use of a RBF kernel (20 components). The regularization scheme was able to select the components with the appropriate kernel (a nonlinear RBF kernel forX1_and_X2_{and linear ones for}_X3_and_X4_{), except for one spurious} component (A RBF kernel was selected for the fifth component).

5.3 Classification example

An additive model was estimated by an LS-SVM classifier based on thespamdata as provided on the UCI benchmark repository, see e.g. (Hastie et al., 2001). The data consists of words frequencies from 4601 email messages, in a study to screen email for spam. A test set of size 1536 randomly drawn from the data leaving 3065 to training purposes. The inputs were preprocessed using following transformation p(x) = log(1 + x) and standardized to unit variance. Figure 7 gives the indica-tor functions as found using a regularization based technique to detect structure as described in Subsection 3.3.

(19)

Method Test set Performance Sparse components L2 L1 L∞ % recovered LS-SVMs 0.1110 0.2582 0.8743 0% componentwise LS-SVMs (7) 0.0603 0.1923 0.6249 0% L1regularization (21) 0.0624 0.1987 0.6601 100% STP with RBF (25) 0.0608 0.1966 0.6854 100% STP with RBF and lin (25) 0.0521 0.1817 0.5729 95%

Fusion with AReg (30) 0.0614 0.1994 0.6634 100% Fusion with comp. reg. (40) 0.0601 0.1953 0.6791 100%

Table 1.Results on test data of numerical experiments on the Vapnik regression dataset. The sparseness is expressed in the rate of components which is selected only if the input is relevant (100% means the original structure was perfectly recovered).

6 Conclusions

This chapter describes nonlinear additive models based on LS-SVMs which are ca-pable of handling higher dimensional data for regression as well as classification tasks. The estimation stage results from solving a set of linear equations with a size approximatively equal to the number of training datapoints. Furthermore, the addi-tive regularization framework is employed for formulating dedicated regularization schemes leading to structure detection. Finally, a fusion argument for component selection and structure detection based on training componentwise LS-SVMs and validation performance is introduced to improve the generalization abilities of the method. Advantages of using componentwise LS-SVMs include the efficient esti-mation of additive models with respect to classical practice, interpretability of the estimated model, opportunities towards structure detection and the connection with existing statistical techniques.

Acknowledgments. This research work was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven. It is supported by grants from several funding agencies and sources: Research Council KU Leuven: Concerted Research Action GOA-Mefisto 666 (Math-ematical Engineering), IDO (IOTA Oncology, Genetic networks), several PhD/postdoc & fel-low grants; Flemish Government: Fund for Scientific Research Flanders (several PhD/postdoc grants, projects G.0407.02 (support vector machines), G.0256.97 (subspace), G.0115.01 (bio-i and microarrays), G.0240.99 (multilinear algebra), G.0197.02 (power islands), research com-munities ICCoS, ANMMM), AWI (Bil. Int. Collaboration Hungary/ Poland), IWT (Soft4s (softsensors), STWW-Genprom (gene promotor prediction), GBOU-McKnow (Knowledge management algorithms), Eureka-Impact (MPC-control), Eureka-FLiTE (flutter modeling), several PhD grants); Belgian Federal Government: DWTC (IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006) (2002-2006): Dynamical Systems and Control: Computation,

(20)

0 0.5 1 1.5 −0.01 0 0.01 0.02 "our" f 5 (X 5 ) 0 0.2 0.4 0.6 0.8 1 −0.01 −0.005 0 0.005 0.01 "remove" f 7 (X 7 ) 0 1 2 3 −0.5 0 0.5 "hp" f 25 (X 25 ) 0 0.5 1 1.5 −1 −0.5 0 0.5 "!" f 52 (X 52 ) 0 0.5 1 1.5 2 −0.04 −0.02 0 0.02 0.04 "$" f 53 (X 53 ) 0 2 4 6 8 10 −2 −1 0 1 sum # Capitals f 57 (X 57 )

Fig. 7.Results of the spam dataset. The non-sparse components as found by application of Subsection 3.3 are shown suggesting a number of usefull indicator variables for classifing a mail message as spam or non-spam. The final classifier takes the formf (X) = f5

(X5 ) + f7 (X7 ) + f25 (X25 ) + f52 (X52 ) + f53 (X53 ) + f56 (X56

) where 6 relevant components were selected out of the 56 provided indicators.

Identification & Modelling), Program Sustainable Development PODO-II (CP/40: Sustaini-bility effects of Traffic Management Systems); Direct contract research: Verhaert, Electrabel, Elia, Data4s, IPCOS. JS is an associate professor and BDM is a full professor at K.U.Leuven Belgium, respectively.

References

Antoniadis, A. (1997). Wavelets in statistics: A review. Journal of the Italian Statistical

Asso-ciation (6), 97–144.

Antoniadis, A. and J. Fan (2001). Regularized wavelet approximations (with discussion).

Jour-nal of the American Statistical Association 96, 939–967.

Blake, A. (1989). Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction. IEEE Transactions on Image Processing 11, 2–12.

(21)

Cressie, N. A. C. (1993). Statistics for spatial data. Wiley.

Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines. Cam-bridge University Press.

Donoho, D.L. and I.M. Johnstone (1994). Ideal spatial adaption by wavelet shrinkage.

Biometrika 81, 425–455.

Fan, J. (1997). Comments on wavelets in statistics: A review. Journal of the Italian Statistical

Association (6), 131–138.

Fan, J. and R. Li (2001). Variable selection via nonconvex penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456), 1348–1360. Frank, L.E. and J.H. Friedman (1993). A statistical view of some chemometric regression

tools. Technometrics (35), 109–148.

Friedmann, J. H. and W. Stuetzle (1981). Projection pursuit regression. Journal of the

Ameri-can Statistical Association 76, 817–823.

Fu, W.J. (1998). Penalized regression: the bridge versus the lasso. Journal of Computational

and Graphical Statistics (7), 397–416.

Fukumizu, K., F. R. Bach and M. I. Jordan (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Reasearch (5), 73–99.

Gunn, S. R. and J. S. Kandola (2002). Structural modelling with sparse kernels. Machine

Learning 48(1), 137–163.

Hastie, T. and R. Tibshirani (1990). Generalized addidive models. London: Chapman and Hall. Hastie, T., R. Tibshirani and J. Friedman (2001). The Elements of Statistical Learning.

Springer-Verlag. Heidelberg.

Linton, O. B. and J. P. Nielsen (1995). A kernel method for estimating structured nonparame-teric regression based on marginal integration. Biometrika 82, 93–100.

MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural

Computation 4, 698–714.

Neter, J., W. Wasserman and M.H. Kutner (1974). Applied Linear Statistical Models. Irwin. Nikolova, M. (1999). Local strong homogeneity of a regularized estimator. SIAM Journal on

Applied Mathematics 61, 633–658.

Pelckmans, K., J.A.K. Suykens and B. De Moor (2003). Additive regularization: Fusion of training and validation levels in kernel methods. (Submitted for Publication) Internal

Re-port 03-184, ESAT-SISTA, K.U.Leuven (Leuven, Belgium).

Poggio, T. and F. Girosi (1990). Networks for approximation and learning. In: Proceedings of

the IEEE. Vol. 78. Proceedings of the IEEE. pp. 1481–1497.

Schoelkopf, B. and A. Smola (2002). Learning with Kernels. MIT Press.

Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. Annals

of Statistics 13, 1040–1053.

Stone, C.J. (1985). Additive regression and other nonparameteric models. Annals of Statistics 13, 685–705.

Suykens, J.A.K. and J. Vandewalle (1999). Least squares support vector machine classifiers.

Neural Processing Letters 9(3), 293–300.

Suykens, J.A.K., T. Van Gestel, J. De Brabanter, B. De Moor and J. Vandewalle (2002). Least

Squares Support Vector Machines. World Scientific, Singapore.

Tibshirani, R.J. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society (58), 267–288.

Tibshirani, R.J. (1997). The lasso method for variable selection in the cox model. Statistics in

Medicine (16), 385–395.

Tikhonov, A.N. and V.Y. Arsenin (1977). Solution of Ill-Posed Problems. Winston. Washing-ton DC.

(22)

Vapnik, V.N. (1998). Statistical Learning Theory. John Wiley and Sons. Wahba, G. (1990). Spline models for observational data. SIAM.