ORDINAL LEAST SQUARES SUPPORT VECTOR MACHINES - A DISCRIMINANT ANALYSIS APPROACH Kristiaan Pelckmans, Peter Karsmakers, Johan A.K. Suykens, Bart De Moor K.U. Leuven, ESAT/SCD-SISTA Kasteelpark Arenberg 10, B-3001 Leuven, Belgium

(1)

ORDINAL LEAST SQUARES SUPPORT VECTOR MACHINES - A DISCRIMINANT

ANALYSIS APPROACH

Kristiaan Pelckmans, Peter Karsmakers, Johan A.K. Suykens, Bart De Moor

K.U. Leuven, ESAT/SCD-SISTA

Kasteelpark Arenberg 10, B-3001 Leuven, Belgium

ABSTRACT

This paper explores the extension of the classical ideas be-hind linear discriminant analysis to the problem of ordinal regression. It is shown how this reasoning fits in a framework of Least Squares Support Vector Machines (LS-SVMs) and kernel machines, hereby allowing for a nonlinear extension. The resulting method is conceived as a practical alternative to recently proposed, computationally demanding formulations based on maximal margin.

1. INTRODUCTION

Discriminant analysis is involved with the task of learning op-timal binary classification rules from a given set of class dis-tributions (parametric approach) or given finite set of obser-vations (empirical approach). Linear Discriminant Analysis (LDA) and its extensions constitute the basic methodology of the statistician [12] and the machine learner [8] in the first set-ting, while Support Vector Machines [2, 18] gain increasing popularity in the empirical setting due to its intuitive appeal, the proper extension to the nonlinear case using kernels, its theoretical foundations and convex formulation.

While the binary case can be extended straightforwardly to-wards the multi-class case using coding schemes (see e.g. [5, 10]), the special case of multi-class classification where the different class-labels possess an intrinsic order is recog-nized as a learning task on its own, denominated as the task of ordinal regression [11]. The crucial issue discriminating this task from classical function approximation tasks is that in case of ordinal data, the response labels possesses no (given) metric (norm). (e.g. the difference between the1st and the 2nd class is not necessarily the same as the difference be-tween the10th and the 11th class). The statistical approach as described in [11] proceeds using a generalized linear model where the link function provides the parametric cumulative distribution of the class distributions. Recently, maximal mar-gin methods where described for the similar task of ranking [9] or [13, 4] with successful applications to information re-trieval tasks and collaborative filtering amongst others. The Gaussian process framework and Bayesian inference tech-niques where studied in this context in [3].

Least Squares Support Vector Machines (LS-SVMs) [16, 15]

constitute a method for building a binary classification rule from observed labeled samples. The naming convention is motivated by the similarity of the derivation withL2-SVMs, where additionaly inequalities are substituted by equalities. Other cornerstones are the use of the kernel trick and the fact that the final solution follows from a well-conditioned lin-ear system. The formulation of ordinal LS-SVMs (or oLS-SVMs) retains those advantages, and additionally shows that the derivation underlying LS-SVMs will not necessarily lead to a simple regression on the labels in all cases. On a theoret-ical account, SVMs however do differentiate from LS-SVMs in the sense of the distribution-free learning theory support-ing the former. The dependence of the least squares approach on the Gaussian assumption, however, can be relaxed theoret-ically [14] and practtheoret-ically (using robust versions as described in [15]). The view essentially differs from the approach taken by the Gaussian process approach [3] as it does not concern modelling of the data-generation process.

This paper is organized as follows. Section 2 discusses the principles behind the discriminant analysis for ordinal data. Section 3 implements those ideas in the form of ordinal LS-SVMs, and topics as model selection are discussed briefly.

2. AN ORDINAL CLASSIFIER

2.1. ORDINAL LINEAR DISCRIMINANT ANALYSIS

Consider a set of samples_{D = {(x}i, yi)}Ni=1⊂ Rd×{1, . . . , L} sampled i.i.d. from a joint distributionFX×FY withL ∈ N0. The labels yi reflect an ordering of the samples - but not a metric - such that if yi > yj and yj > yl, then yi > yl for all _{i, j, l = 1, . . . , N . Let S}l denote the indices of the data-points with labell such that yi = l for all i ∈ Sl. As-sume that the samples of each class_{xi}i∈Slare distributed

as a Gaussian distribution with meanml ∈ Rd and variance Σl= Σ ∈ Rd×d, formally xi∼ N(xi; ml, Σl) = 1 (2π)p/2_|Σ l|1/2 exp µ −1₂(xi− ml)TΣ−1l (xi− ml) ¶ . (1)

(2)

X₂ −5 −4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 Violating order Pairwise Classifier 1 2 3

Fig. 1.A multi-class approach via the combination of atomic pairwise classification rules (solid lines) of the classes (con-tour lines) in general does not lead to estimates which pre-serve in any case the same order (the decision lines do often cross each other).

Let the vectorM = (mT

1, . . . , mTL)T ∈ RL×d contain the mean vectors. Instead of studying the optimal classifiers be-tween each pair of distributions (LDA, [8]), we are concerned with the problem of deriving the best predictor reflecting the ordering of the labels. Therefor, the linear utility function un-derlying statistical ordinal regression [11] is employed: the data are projected as follows

U (x; w) = wT_x_{, u,} ₍₂₎

since the resulting sample_{ui}Ni=1is univariate it possesses a well-defined ordinal scale (ui< ujorui= ujorui> ujfor alli, j = 1, . . . , N ). The use of a global linear utility func-tion results in parallel decision planes, which is a necessary condition for obeying the ordering in all cases (see Fig. 1). Now we investigate the marginal distribution of_{ui}i∈Slby

projecting the corresponding distribution onw: ui∼ N(u; wTml, wTΣw) = 1 p2π(wT_Σ lw) exp µ −(u − w T_m l)2 2(wT_Σ lw) ¶ . (3) Let_{I be the indicator function such that I(z) = 1 if z is true} and zero otherwise. Given those univariate distributions, the rules deciding whether a point belongs to thelth class become cl(x; w, M ) = I (vl−1,l< U (x; w) ≤ vl,l+1) , (4) wherevl,l+1is defined as the mean1₂(wTml+ wTml+1) for all_{l = 1, . . . , L − 1, and where we define v}0,1 = −∞ and vL,L+1 = +∞. This rule is consistent with the given order-ing wheneverv1,2 ≤ · · · ≤ vL−1,L. Now we turn to the task of estimating the parameter vectorw such that the above rules will perform optimally. Herefor, we have to quantify the prob-ability mass which violates the rules, i.e. the risk_{R(w; M, Σ)}

X₁ X2 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 Decision rule

Overlap of Class distributions

Fig. 2. A schematical view of the principles behind oLDA. Assuming that the class distributions are approximatively Gaussian, the marginal class distributions (dashed-dotted line) obtained by projecting the class distributions (contour lines) on the utility function (solid line) can be computed ex-plicitly. The risk will then coincide with the overlap between the different distributions (shaded areas). The ordinal rule with minimal risk corresponds with a set of threshold rules on the utility function (dashed lines).

defined as (see e.g. [12]) 1 L + 1 L−1 X l=1 Z u≥vl,l+1 N (u; wTml, wTΣw) du + Z u≤vl,l+1 N (u; wTml+1, wTΣw) du, (5) reflecting only the overlap of two consecutive distributions. Using the definition of the cumulative distribution function of the univariate normal distribution in terms of the monotonous function_{erf : R →]0, 1[, one gets for each l = 1 . . . , L − 1}

Z u>vl,l+1 N (u, wTml, wTΣw) du = Z u<wT_m_l_−(v_l,l+1_−wT_m_l₎ N (u, wTml, wTΣw) du =1 2 µ 1 + erfµ vl,l+1− w T_m l √ 2wT_Σw ¶¶ , (6) due to symmetry of the normal distribution around the mean wT_m l. SimilarlyR_u<v l,l+1N (u, w T_m l+1, wTΣw)du = 1 2 ³ 1 + erf³vl,l+1−wTml+1 √ 2wT_Σw ´´

. This expression can be sim-plified by using the fact thaterf is symmetric, and by combin-ing both equalities and substitution ofvl,l+1 = 1₂(wTml+ wT_m l+1) such that: R(w; M, Σ) ∝ L−1 X l=1 − erfµ w T_(m l+1− ml) √ 2 wT_Σw ¶ , (7)

(3)

assuming that the ordering inequalitieswT_m

1≤ · · · ≤ wTmL hold. It is interesting to note at this point that the above risk is also useful when the Gaussian assumption does not hold. As long as the second moment of the class distributionwT_Σw exists, Chebychev’s inequality can be used to bound the prob-ability mass beyond a threshold given the variance, see e.g. [6]. Furthermore, note that the specific form of the true distri-bution translates into a corresponding function similar toerf which is monotonously increasing by construction.

Remark further that this risk function is invariant w.r.t. the norm of the parameter vector_{kwk. An optimal predictor is} obtained by minimizing the theoretical risk (or the measure of overlap between the distributions of the different classes) w.r.t. the parametersw of the utility function. A learning ma-chine implementing minimization of_{R(w; M, Σ) for given} values ofM and Σ may be referred to as to ordinal Linear Discriminant Analysis (oLDA). Note that in general, the risk

minimization does not follow straightforwardly from an ana-lytical expression or a standard convex optimization problem due to the use of theerf function.

2.2. THE BINARY CASE

The case ofL = 2 is considered here in some detail. Since erf is a monotonically increasing function, the risk minimization may be implemented as follows

min w R2(w; m1, m2, Σ) ⇔ maxw wT_(m 2− m1) √ 2 wT_Σw . (8) Since the risk as defined here is not dependent on the norm kwk, we fix kwk such that the values wTm1+ w0= −1 and wT_m

2+ w0= 1, for a corresponding w0∈ R, resulting in

min w,w0 √ 2wT_{Σw s.t. w}T_m 1+ w0= −1, wTm2+ w0= 1. (9) At this point, the following insight avoids the need for having an accurate description of the multivariate means and the co-variance matrix. As one is only interested in the marginal dis-tribution of the projection, knowledge of the variance of the univariate sets _{wTxi}i∈Sl is sufficient for minimizing the

risk as previously. Furtheron, we substitute the sample mean and variances of those sets in the estimator: (i) the shifted marginal meanwT_m

l+ w0is substituted by the sample mean 1

|Sl| P

i∈Slw

T_x

i + w0, and (ii) the variance wTΣw is re-placed by the sample estimate_N1 P

i∈S1(w T_x i+ w0+ 1)2+ P i∈S2(w T_x

i + w0− 1)2. To counteract the finite sample effect of the minimum sample variance to deviate too much from the true marginal variance (or overfitting), regulariza-tion is applied. Including a regularized version of the sam-ple variance defined aswT_{( ˆ}_{Σ + 1/γI}

d)w for a fixed λ > 0 (see Regularized Discriminant Analysis [7]) yields following

modified optimization problem (w∗, w0∗) = arg min w,w0 J2(w, w0) = X i∈S1 (wTxi+ w0+ 1)2+ X i∈S2 (wTxi+ w0− 1)2+ 1 γw T_w s.t. ( ₁ |S1| P i∈S1w T_x i+ w0= −1 (a) 1 |S2| P i∈S2w T_x i+ w0= +1, (b) (10)

which establishes a connection between LDA, RDA and LS-SVMs [15] where the constraints (a) and (b) are omitted. For additional links of LS-SVMs and FDA, see e.g. [17].

2.3. ORDINAL LEAST SQUARES CLASSIFIER

Given only a set of samples_{D, the accurate estimation of the} class means_{ml}Ll=1and the covariance matrixΣ ∈ Rd×d of-ten prohibits the construction of an effective rule, especially whend is high. However, this can be avoided using the fact that in the end one is only interested in the means _{bl , wT_m

l}Ll=1and the variancewTΣw of the marginal distribu-tions. Hereto, the formulation of oLDA is modified in differ-ent ways as in the previous subsection using the sample class mean (bl= _|S1_l_|P_i∈S_lwTxi+ w0) and the sample variance (f rac1γwT_w+ 1

N

PN

i=1(wTxi+ w0− bl|i)2) wherebl|i de-notes the classl containing the ith sample. Risk minimization is obtained as min w,b 1 2Rˆγ(w, b; ˆΣ) ∝ L−1 X l=1 erf   −(bl+1− bl) q 2 wT_{( ˆ}_{Σ + 1/γI)w}  , (11) such thatb1≤ · · · ≤ bLholds, and whereb = (b1, . . . , bL)T ∈ RL. Instead of solving this problem, the following approxi-mation is used min w,b Jγ,λ( ˆΣ, w, b) = w T_{(γ ˆ}_{Σ + I} d)w +λ 2 L−1 X l=1 (bl+1− bl)2, (12) withλ > 0 an appropriate constant (see e.g. Fig. 3). Al-though this approximation is only based on an intuitive ground, the resulting rule turns out to be a good approximation if the interval b1, . . . , bL is hinged on −1 and 1 similar as in the binary case. In this case the different values ofb are forced towards an equidistantly spaced configuration preserving the specified ordering. Combining those ideas motivates the fol-lowing ordinal classifier, only making use of least squares cri-teria and equality constraints.

Definition 1 (Ordinal Least Squares Classifier) Letγ > 0 and _{λ > 0 be given hyper-parameters. Given the set D =} {(xi, yi)}Ni=1 withxi ∈ Rdand yi ∈ {1, . . . , L}. Then the

(4)

fol-−5 −4 −3 −2 −1 0 1 2 3 4 5 −0.5 0 0.5 1 1.5 2 2.5 (b_l+1 − b_l) L₂ approximation −erf

Fig. 3.Theerf-function (solid line) is convex in the case the domain is restricted to the positive half plane. The ordinal least squares classifier approximates this effect in terms of a

L2function (dashed line).

lows (w∗, w∗0, b∗) = arg min w0,w,b Jγ,λ(w0, w, b) = λ 2 L−1 X l=1 kbl+1− blk2+ 1 2w T_{w +}γ 2 N X i=1 kwTxi+ w0− b_l|ik2 s.t. ( b1= −1, bL= 1 1 |Sl| P i∈Slw T_x i+ w0= bl ∀l = 1, . . . , L (13)

whereb_l|idenotes the meanbl ∼ wTmlcorresponding with

thelth class containing the ith sample. Let CS ∈ RN ×Lbe

defined as the class indicator matrix such thatCS,il = 1 iff i ∈ Sl and zero otherwise. Then we can also note b_l|i as b_l|i= CS,ib.

Most importantly, this learning machine does not implement a simple regression on the labels, but allows instead a differ-ent metric of the responses (as parameterized by the vector b = (b1, . . . , bL)). When the parameter λ → ∞, it can be shown that the vectorb is spread equidistantly over the inter-val_{[−1, 1] and effectively equals the approach of regression} on the labels. When_{λ → 0, the ordering constraints are in} general not satisfied. A proper value ofλ results then in a vectorb obeying the ordering while the values of b are vary-ing reasonable flexible on the interval_{[−1, 1].}

3. A KERNEL-BASED APPROACH TO ORDINAL REGRESSION TASKS

3.1. ORDINAL LEAST SQUARES SUPPORT VECTOR MACHINES

A primal-dual derivation of the ordinal least squares classi-fier presented in the previous subsection is given, hereby

al-lowing for the kernel trick as standard in LS-SVMs and ker-nel machines [18, 15]. Herefor, we reformulate the ordinal least squares classifier as a constrained optimization problem as follows: min w,w0,b,eJ ′ γ,λ(w, b, e) = λ 2 L−1 X l=1 kbl+1−blk2+ 1 2w T_w+γ 2 N X i=1 e2i s.t.          wT_ϕ(x i) + w0+ ei= bl|i ∀i = 1, . . . , N b1= −1 bL= 1 P i∈Sl(w T_ϕ(x i) + w0− bl) = 0 ∀l = 1, . . . , L, (14) where ϕ : Rd _{→ R}Dϕ _{denotes the feature map induced by}

a positive definite kernel. The solution of this optimization problem can be found by solving a linear system.

Proposition 1 (Ordinal LS-SVM) Define1

• C′

S = CS(:, 2 : L − 1) ∈ RN ×L−2 let CS′′ = CS(: , L) − CS(:, 1) ∈ RN.

• Let D ∈ {−1, 0, 1}L−1×L_{be the first order difference}

matrix withDl,l= 1, Dl,l+1= −1 for all l = 1, . . . , L

and zero elsewhere.

• Let ∆ = DT_{D ∈ R}L×L_{, then}_∆′ _{= ∆(:, 2 : L − 1) ∈} RL,L−2, and let ∆′′ _{= ∆(2 : L − 1, 2 : L − 1) ∈} RL−2,L−2.

• IL∈ RL×L−2equals the identity matrix of sizeL, then IL′ = IL(:, 2 : L − 1).

• b′′_{∈ R}L_{be defined as the vector}_{b = (−1, 0, . . . , 0, 1)}T_. • The vector δ(λ) ∈ RL−2_{is defined as}_{δ(λ) = λ(∆}_L_{(2 :}

L − 1, L) − ∆L(2 : L − 1, L)).

The optimum to (14) is characterized by the following dual set of linear equations

A    α w0 b′′ ξ   =    I′′ S 0 δ(λ) b   . (15)

Let1N = (1, . . . , 1)T ∈ RNandIN ∈ RN ×N be the identity

matrix. The symmetric positive semidefinite matrix A ∈ R(N +2∗L−1)×(N+2∗L−1)_{is defined as} A =     Ω + 1 γIN 1N −CS′ Ω¯N 1T N 0 0 1TL −C′ S T 0 −λ∆′′ _−I′ L T ¯ ΩT N 1L −IL′ Ω¯     , (16)

1_{The matlab notation A}_{(:, i : l) denotes columns i to l of the matrix A,}

(5)

whereΩ is the kernel matrix such that Ωij = ϕ(xi)Tϕ(xj)

for all i, j = 1, . . . , N . Let ¯ΩL ∈ RN ×L be defined as ¯

ΩL,il = _|S1_i_|P_j∈S_lΩij and ¯Ω ∈ RL×Lbe defined as ¯Ω = ¯

ΩT LΩ¯L.

Proof: The Lagrangian of the constrained problem (14)

becomes L(w, w0, b, e; α, β) = λ 2b T_{∆b +}1 2w T_{w +}γ 2 N X i=1 e2i − N X i=1 αi(wTϕ(xi)+w0+ei−b_l|i)−β1(b1+1)−βL(bL−1) − L X l=1 ξl 1 |Sl| Ã X i∈Sl (wTϕ(xi) + w0− bl) ! , (17) with multipliersα = (α1, . . . , αN)T ∈ RN,ξ = (ξ1, . . . , ξL)T ∈ RLandβ1, βL∈ R. The first order conditions for optimality become                                                                ∂L ∂w = 0 → w = PN i=1αiϕ(xi) +PLl=1ξlx¯l ∂L ∂w0 = 0 → PN i=1αi+PLl=1ξl= 0 ∂L ∂b1 = 0 → λ∆1b = −P_i∈S1αi+ β1− ξ1 ∂L ∂bL = 0 → λ∆Lb = − P i∈SLαi+ βL− ξL ∂L ∂bl = 0 → λ∆lb = −P_i∈S_lαi− ξl (1 < l < L) ∂L ∂ei = 0 → γei= αi ∀ = 1, . . . , N ∂L ∂αi = 0 → wT_ϕ(x i) + ei= bl|i ∀ = 1, . . . , N ∂L ∂β1 = 0 → b1= −1 ∂L ∂βL = 0 → b L= 1 ∂L ∂ξl = 0 → wT_ϕ_¯ l+ w0= bl, ∀l = 1, . . . , L (18) where ϕ¯l ∈ Rd is defined as ϕ¯l = _|S1_l_|P_i∈S_lϕ(xi) for alll = 1, . . . , L. Eliminating the primal variables w, b1, bL andei, and the dual variablesβ1andβ2yields the dual sys-tem (15) where the symmetric positive semi-definite matrix Ω ∈ RN ×N _{contains the inner products between every pair} of data-pointsΩij = Ωji = ϕ(xi)Tϕ(xj). This can be writ-ten entirely in terms of the inner product hereby allowing for the application of the kernel trickK(xi, xj), ϕ(xi)Tϕ(xj) whereK : Rd_{× R}d _{→ R is a proper inner product function} referred to as the kernel. Note that one can also eliminate the primal variablesb from the dual set of equations, resulting in smaller dual systems.

Using the conditionw =PN

i=1αiϕ(xi) + PL

l=1ξlϕ¯l, and by switching sums, the estimated utility function can be

evalu-X₁ X2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Fig. 4. Example of the ordinal LS-SVM on a nonlinear toy example with quadratic boundaries. The linear LSC ordinal classifier is represented as the parallel solid lines, the nonlin-ear oLS-SVM result is visualized by the nonlinnonlin-ear contours.

ated as follows ˆ U (x∗) = N X i=1 µ αi+ ξ_l|i |Sl| ¶ K(xi, x∗), (19) where α and ξ are the solutions to (15) and ξ_l|idenotes the multiplier belonging to thelth class containing the ith sample. The resulting classifier rules becomes

n ˆ

cl(x) = I(ˆvl−1,l< ˆU (x) + w0≤ ˆvl,l+1) oL

l=1, (20) where again ˆvl,l+1 is defined as 12(ˆbl + ˆbl+1) for all l = 1, . . . , L + 1 and ˆv0,1 = −∞ and ˆvL,L+1 = ∞. The use of the kernel representation is advantageous in case a linear rule does not suffice (see e.g. Fig. 4).

3.2. CONFUSION MATRIX AND ILLUSTRATIVE EX-AMPLES

An important drawback of the method at least in practice -is that multiple hyperparameters (γ and λ) are to be tuned. A qualitative assessment of the results of a ordinal discriminant rule can be given by the confusion matrix, a notion [12] well known in the literature on classical discriminant analysis. The confusion matrix _{Π ∈ R}L×L can be defined in an ordinal setting as Πkl= 1 |Sk| X i∈Sk I(cl(xi) = 1), (21) where_{I(z) equals one if z is true and zero otherwise. This} notion is closely related to the ROC curve in the binary case, see [1] for a discussion in the bipartite ranking problem. If the confusion matrix is not diagonal dominant, the ordinal discriminant is not able to reflect the proper ordering of the labels well. Fig. 5 examplifies this tool.

(6)

True label Predicted Label 72% 14% 0% 2% 0% 21% 43% 50% 4% 3% 7% 43% 50% 9% 9% 0% 0% 0% 77% 30% 0% 0% 0% 9% 58% 1 2 3 4 5 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fig. 5.Confusion matrix of an ordinal discriminant rule dis-criminating between 5 classes. From this matrix, it becomes clear that the rule cannot reproduce well the class label of samples from the third class, while the rule clearly reproduces the class 1 labels.

LS LDA LS-SVM OR oMM oLS-SVM

Abalone 52.34 51.43 57.00 52.34 55.02 56.12

Auto-mpg 50.12 55.88 53.67 50.13 52.20 56.61

Housing 44.50 50.28 51.44 57.22 58.38 57.86

Table 1.Numerical performances (% accuracy) achieved on a number of UCI datasets, discretized in 8 equiprobable bins. LDA denotes a standard Linear Discriminant Analysis. The multiclass LS-SVMs are based on a one-versus-one encoding scheme, LS gives the result of a least squares regres-sion on the labels. OR denotes the result of ordinal regresregres-sion, while oMM gives results from an ordinal maximal approach [13]. For brevity, only lin-ear tools (linlin-ear kernel) are applied, indicating the comparable performance of the methods. The last method denotes the ordinary LS-SVM with linear kernel which is equivalent to the primal ordinary Least Squares Classifier. Note that the ordering assumption may not yield optimal generalization per-formance as e.g. in case of the discretized Abalone dataset.

An alternative model selection measure of a given rule is the risk as defined in (7). This continuous measure can be adviced to tune the parameterλ as it encourages a proper ordening, while avoiding the limit case of putting the thresholds_{bl}Ll=1 equidistantly. It is however not recommended for tuningγ as it does not measure the generalization ability directly. Six different methods are compared on a range of datasets taken from the UCI benchmark repository. At first, the LDA method equipped with a one-versus-one encoding scheme and a voting scheme are applied based on the LS-SVMlab tool-box respectively. Subsequently, the classical linear technique of ordinal regression (OR) is executed based on the StatBox 4.2 toolbox. Finally, the proposed ordinal linear least squares classifier and ordinal LS-SVM are applied. The tuning of the hyper-parameters is based on a 10-fold crossvalidation scheme. Table 1 reports some results on those datasets.

4. CONCLUSIONS

This paper discusses the ideas of linear discriminant analysis in a context of ordinal regression. These are implemented in a kernelk based context of ordinal LS-SVMs.2

5. REFERENCES

[1] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Gener-alization bounds for the area under the ROC curve. Journal of Machine

Learning Research, 6:393–425, 2005.

[2] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifier. In In Proceedings of the Fifth Annual ACM Workshop

on Computational Learning Theory, pages 144–52. ACM, 1992.

[3] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression.

Journal of Machine Learning Research, 6:1019–1041, 2006.

[4] W. Chu and S. S. Keerthi. New approaches to support vector ordinal re-gression. In in Proc. of International Conference on Machine Learning, pages 145–152. 2005.

[5] K. Crammer and Y. Singer. On the algorithmic implementation of mul-ticlass kernel-based vector machines. Journal of Machine Learning

Research, 2:265–292, 2001.

[6] L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern

Recognition. Springer-Verlag, 1996.

[7] J. Friedman. Regularized discriminant analysis. Journal of the

Ameri-can Statistical Association, 84:165–175, 1989.

[8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical

Learning. Springer-Verlag, Heidelberg, 2001.

[9] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank bound-aries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000. MIT Press, Cambridge, MA.

[10] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class sup-port vector machines. IEEE Transactions on Neural Networks, 13:415– 425, 2002.

[11] P. McCullagh. Regression models for ordinal data. Journal of the Royal

Statistics Society B., 42:109–142, 1980.

[12] G. McLachlan. Discriminant Analysis and Statistical Pattern

Recog-nition. Wiley Series in Probability and Mathematical Statistics. John

Wiley and Sons, 1946.

[13] A. Shashua and A. Levin. Ranking with large margin principle: Two approaches. In S. Thrun S. Becker and K. Obermayer, editors,

Ad-vances in Neural Information Processing Systems 15, pages 937–944.

MIT Press, Cambridge, MA, 2003.

[14] I. Steinwart. Consistency of support vector machines and other reg-ularized kernel machines. IEEE Transactions on Information Theory, 51:128–142, 2005.

[15] J.A.K. Suykens, T. van Gestel, J. De Brabanter, B. De Moor, and J. Van-dewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002.

[16] J.A.K. Suykens and J. Vandewalle. Least squares support vector ma-chine classifiers. Neural Processing Letters, 9(3):293–300, 1999. [17] T. Van Gestel, J.A.K. Suykens, G. Lanckriet, A. Lambrechts, B. De

Moor, and J. Vandewalle. A Bayesian framework for least squares sup-port vector machine classifiers, gaussian processes and kernel Fisher discriminant analysis. Neural Computation, 14(5):1115–1147, 2002. [18] V.N. Vapnik. Statistical Learning Theory. Wiley and Sons, 1998.

2_{- (KP): BOF PDM/05/161, FWO grant V4.090.05N; - (SCD:) GOA}

AMBioRICS, CoE EF/05/006, (FWO): G.0407.02, G.0197.02, G.0141.03, G.0491.03, G.0120.03, G.0452.04, G.0499.04, G.0211.05, G.0226.06, G.0321.06, G.0553.06, (ICCoS, ANMMM, MLDM); (IWT): GBOU (Mc-Know), Eureka-Flite2 IUAP P5/22,PODO-II,FP5-Quprodis; ERNSI; - (JS) is an associate professor and (BDM) is a full professor at K.U.Leuven Bel-gium, respectively.