Computational Statistics and Data Analysis

(1)

Contents lists available atSciVerse ScienceDirect

Computational Statistics and Data Analysis

journal homepage:www.elsevier.com/locate/csda

A mixed effects least squares support vector machine model for

classification of longitudinal data

Jan Luts

a,b,∗

, Geert Molenberghs

c,d

, Geert Verbeke

d

, Sabine Van Huffel

a,b

,

Johan A.K. Suykens

a,b

a_{Department of Electrical Engineering (ESAT), Research Division SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium} b_{IBBT-K.U. Leuven Future Health Department, Leuven, Belgium}

c_{I-BioStat, Universiteit Hasselt, Agoralaan 1, B-3590 Diepenbeek, Belgium}

d_{I-BioStat, Katholieke Universiteit Leuven, Kapucijnenvoer 35, B-3000 Leuven, Belgium}

a r t i c l e i n f o Article history:

Received 12 November 2010

Received in revised form 9 September 2011 Accepted 10 September 2011

Available online 18 September 2011

Keywords:

Classification Longitudinal data Least squares Support vector machine Kernel method Mixed model

a b s t r a c t

A mixed effects least squares support vector machine (LS-SVM) classifier is introduced to extend the standard LS-SVM classifier for handling longitudinal data. The mixed effects LS-SVM model contains a random intercept and allows to classify highly unbalanced data, in the sense that there is an unequal number of observations for each case at non-fixed time points. The methodology consists of a regression modeling and a classification step based on the obtained regression estimates. Regression and classification of new cases are performed in a straightforward manner by solving a linear system. It is demonstrated that the methodology can be generalized to deal with multi-class problems and can be extended to incorporate multiple random effects. The technique is illustrated on simulated data sets and real-life problems concerning human growth.

1. Introduction

During the past decade, kernel-based approaches for estimation and learning problems have gained significant

interest (Schölkopf and Smola, 2001). Kernel methods are typically used in the machine learning community to deal with

all kinds of classification problems involving input vectors with a fixed dimension. For instance, the popular kernel-based

technique support vector machines (SVMs) has successfully been used for many applications (Vapnik, 1998). On the other

hand, for many real-life problems the data are characterized by their non-fixed dimensional nature. For example, the analysis of longitudinal data, which consist of repeated measurements on the same units (or often called cases) over a period of time, generally requires the use of methods which can deal with unbalanced data. Within the context of longitudinal data, unbalanced means that there is an unequal number of observations for each case at non-fixed time points. While SVMs are not straightforwardly applicable to such kind of data, the classical approach in the statistics community is to analyze

longitudinal data by means of mixed effects models (Verbeke and Molenberghs, 2000). Importantly, these models are able

to take the relationship between observations from the same case into account. Estimation of the fixed and random effects is commonly based on the theory of best linear unbiased prediction (BLUP).

∗_{Corresponding author at: Department of Electrical Engineering (ESAT), Research Division SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg} 10, B-3001 Leuven, Belgium. Tel.: +32 16 321065; fax: +32 16 321970.

E-mail address:jan.luts@esat.kuleuven.be(J. Luts).

(2)

To classify longitudinal profiles in different groups,Verbeke and Lesaffre(1996) proposed a linear mixed effects model with heterogeneity in the random effects. The normality assumption for the random effects was replaced by a mixture of normal distributions with a common covariance matrix. Maximum likelihood estimates were obtained based on the expectation-maximization algorithm. A multi-class classification rule was established by assigning a case to the component

to which it had the highest posterior probability to belong. Marshall and Barón (2000) designed a linear/quadratic

discriminant analysis method for unbalanced longitudinal data using mixed effects models. Linear models as well as models that were nonlinear with respect to the mean and the random effects were included. An iterative algorithm was proposed to estimate the class means and covariance matrices, which were subsequently used in various discriminant functions. Variance components were allowed to vary between classes. The final classification was performed based on an extended

Mahalanobis distance measure. James and Hastie (2001) extended linear discriminant analysis to a functional linear

discriminant analysis (FLDA) classifier for irregularly sampled curves. An expectation-maximization algorithm was used to fit a mixed model with splines as random and fixed effects. Final class assignment was based on posterior probabilities. Several extensions such as functional quadratic discriminant analysis (i.e. class-specific covariance matrices) and functional

regularized discriminant analysis were proposed. Wouters et al.(2007) presented a two-step approach called doubly

hierarchical discriminant analysis to classify high-dimensional longitudinal data. This approach first performs modeling of the data via a class of flexible linear mixed models. The resulting set of estimated parameters, which represented a summary of the data, was then fed to a hierarchical linear discriminant classifier. A decision tree composed of binary

classifiers was constructed and posterior probabilities were computed for class assignment.Fieuws et al.(2008) focused

on the classification of multivariate longitudinal profiles. Univariate linear, nonlinear and generalized mixed models were combined into a multivariate mixed model in order to take the correlation between the different profiles into account. Association between the different profiles was assumed to be fully captured by the association between their random effects. Parameter estimates were obtained based on combining all possible bivariate mixed models. Posterior probabilities were

calculated for class assignment. Other studies, e.g.Brown et al.(2001) andDe la Cruz-Mesía et al.(2007), addressed the

problem of classification of longitudinal profiles from a Bayesian point of view. The former described a multivariate model to classify longitudinal data, that were only measured at fixed time points, and used an expectation-conditional maximization algorithm. The latter developed a semi-parametric approach for unbalanced longitudinal data and used Markov chain Monte Carlo methods. In addition, the topic of longitudinal data classification is also related to the field of functional

data analysis (González-Manteiga and Vieu, 2007;Ramsay and Silverman, 2005). Functional data can be interpreted as

an observation over a certain continuum, e.g. time. Even though these observations are discrete, they can be perceived

as values that vary smoothly over time, and therefore they are called functional observations. Within this context,James

(2002) developed a technique to incorporate (multiple) functional predictors in generalized linear models. An

expectation-maximization algorithm was described to fit the spline-based functional logistic regression model. Finally,Müller(2005)

showed an example of the use of a functional binomial regression model with a logistic link for classification of sparse longitudinal data.

Various approaches have been used to deal with nonlinear profiles in mixed effects models. For instance, the FLDA

approach byJames and Hastie(2001) included natural cubic spline functions to model the observations of each case.Wouters

et al.(2007) employed a fractional polynomial mixed model to extract features as input for discriminant analysis.Liu et al.

(2007) studied the use of kernels for regression analysis by means of linear mixed models, but longitudinal data were

not considered. More recently, explicit connections between the traditional approaches for longitudinal data fitting and

kernel machines were described (Pearce and Wand, 2009). The correspondence between reproducing kernel Hilbert space

optimization and estimated BLUPs was illustrated. Whereas the study byPearce and Wand(2009) showed that kernel

methods can be used for regression modeling of longitudinal data, the present study focuses on classification of this type

of data using a kernel-based approach. To this end, the least squares support vector machine (LS-SVM) classifier (Suykens

et al.,2002;Suykens and Vandewalle, 1999) is extended to a mixed effects LS-SVM model for classification of longitudinal profiles.

The proposed method consists of a regression modeling and a consecutive prediction step (Section2). This kernel method

combines the regularization property of support vector machines with mixed effects models. The methodology is able

to deal with nonlinearity in the data and it addresses binary, as well as multi-class classification (Section3.1). A model

formulation with multiple random effects and corresponding classification strategy are developed (Section 3.2). Model

selection via parameter tuning is discussed in Section4. The techniques are applied to various simulated and real-life

problems (Section5).

2. Random intercept LS-SVM

This section starts with a brief review of the standard LS-SVM classifier (Section2.1). In the next sections, an LS-SVM

model is developed for longitudinal data classification. The random intercept LS-SVM approach includes two steps. Firstly, a regression is done using training data for which the true class labels are known in order to estimate the model parameters

(Section2.2). Secondly, new longitudinal profiles with unknown class labels can be classified based on the obtained model

parameters from the previous step (Section2.3). Finally, an alternative approach to this classification step is established

(3)

2.1. LS-SVM classifier

The standard LS-SVM classifier is a binary classifier and requires a data set of N input vectors and corresponding class

labels:

{

xi

,

zi

}

Ni=1, with xi

∈

Rdbeing the input pattern and zi

∈ {−

1

, +

1

}

the class indicator. The proposed optimization

problem is defined as min w,e,b 1 2

w

T

_{w +}

γ

2 N

−

i=1 e2_i

,

(1) subject to zi

(w

T

ϕ(

xi

) +

b

) =

1

−

ei

,

i

=

1

, . . . ,

N

,

(2)

with e

= [

e1

. . .

eN

]

T,

ϕ(·) :

Rd

→

Rdha mapping from the input space to a high-dimensional feature space with dimension

dh,

w

a vector of the same dimension as

ϕ(·)

,

γ

a positive regularization constant and b a bias term. In contrast with SVMs,

LS-SVMs employ a least squares loss function with error variables eiand equality constraints are used in place of inequality

constraints. This formulation results in solving a set of linear equations, while SVMs require solving a quadratic programming

problem. Whereas the primal problem in(1)is expressed in terms of the feature map, the resulting linear system in the dual

space is expressed in terms of the kernel function. The method requires tuning of the regularization constant

γ

and kernel

parameters. The resulting LS-SVM classifier can be used to classify a test data set

{

x∗_j

,

z_j∗

}

G_j₌₁. The classifier is formulated as

ˆ

z∗ j

=

sign



_N

−

i=1

ˆ

β

iziK

(

x∗j

,

xi

) + ˆ

b



,

(3)

in the dual space with x∗

j the jth test case to be classified, Lagrange multipliers

β

iand the positive definite kernel function

K

(·, ·)

with K

(

x∗_j

,

xi

) = ϕ(

x∗j

)

T

_ϕ(

_x

i

)

. In practice, a linear, polynomial or radial basis function (RBF) kernel is often used. In

the remainder of this study the RBF kernel (i.e. K

(

x∗

j

,

xi

) =

exp

(− ‖

x∗j

−

xi

‖

22

/σ

2

)

) is employed.

The LS-SVM has been shown to be a core model, leading to a generic framework that allows to define new methods in a

systematic and constructive way (De Brabanter et al.,2010,2011;Luts et al.,2010;Suykens et al.,2010). Examples include

weighted versions to incorporate robustness, sparse representations and a fixed-size approach for large-scale data. The models provide connections to parametric statistics, as well as non-parametric statistics, via primal and dual formulations. The LS-SVM, having minimal underlying assumptions, is well-suited to deal with high-dimensional input data. In addition, reliable numerical implementations are available (e.g. a linear system for classification and regression).

2.2. Training: random intercept LS-SVM regression

The longitudinal training data can be represented as yijwhich is the observed value at time xijfor case i, i

=

1

, . . . ,

N,

j

=

1

, . . . ,

ni, with N the number of cases, nithe number of observations for case i and the total number of observations

equals M

=

∑

N

i=1ni. The class labels are known and are denoted by zi

∈ {−

1

, +

1

}

for case i. The training step consists

of a regression modeling. Based on the theory of BLUP (Robinson, 1991), the proposed optimization problem for random

intercept LS-SVM regression is formulated as

( ˆw, ˆ

e

, ˆ

b

, ˆ

B0

, ˆ

B1

) =

argmin w,e,b,B0,B1 1 2

w

T

_{w +}

λ

1 2 N

−

i=1 b2_i

+

λ

2 2 N

−

i=1 ni

−

j=1 e2_ij

,

(4)

subject to the model yij

=

B0

+

B1zi

+

bi

+



zi

−

1 2



w

T 1

ϕ

1

(

xij

) +



zi

+

1 2



w

T 2

ϕ

2

(

xij

) +

eij

,

i

=

1

, . . . ,

N

,

j

=

1

, . . . ,

ni

,

(5)

with weighting vector

w = [w

T

1

w

2T

]

T, a vector of random intercepts b

= [

b1

. . .

bN

]

T, error variables e

= [

e11

. . .

e1n1e21

. . .

eNnN

]

T

_∈

RM, positive regularization constants

λ

1and

λ

2, model parameters B0and B1, and mappings

ϕ

1

(·)

and

ϕ

2

(·)

. The Lagrangian for the problem in(4)–(5)is

L1

(w,

e

,

b

,

B0

,

B1

, α) =

1 2

w

T

_{w +}

λ

1 2 N

−

i=1 b2_i

+

λ

2 2 N

−

i=1 ni

−

j=1 e2_ij

−

N

−

i=1 ni

−

j=1

α

ij



yij

−

B0

−

B1zi

−

bi

−



zi

−

1 2



w

T 1

ϕ

1

(

xij

) −



zi

+

1 2



w

T 2

ϕ

2

(

xij

) −

eij



,

(6)

(4)

with Lagrange multipliers

α = [α

11

. . . α

1n1

α

21

. . . α

NnN

]

T

_∈

RM. The conditions for optimality yield











∂

L1

∂w

1

=

0

→

w

₁

= −

N

−

i=1 ni

−

j=1

α

ij



zi

−

1 2



ϕ

1

(

xij

),

∂

L1

∂w

2

=

0

→

w

₂

= −

N

−

i=1 ni

−

j=1

α

ij



zi

+

1 2



ϕ

2

(

xij

),

∂

L1

∂

e_ij

=

0

→

eij

= −

α

ij

λ

2

,

i

=

1

, . . . ,

N

,

j

=

1

, . . . ,

ni

,

∂

L1

∂

bi

=

0

→

bi

= −

1

λ

1 ni

−

j=1

α

ij

,

i

=

1

, . . . ,

N

,

∂

L1

∂α

ij

=

0

→

yij

=

B0

+

B1zi

+

bi

+



zi

−

1 2



w

T 1

ϕ

1

(

xij

) +



zi

+

1 2



w

T 2

ϕ

2

(

xij

) +

eij

,

i

=

1

, . . . ,

N

,

j

=

1

, . . . ,

ni

,

∂

L1

∂

B0

=

0

→

0

=

N

−

i=1 ni

−

j=1

α

ij

,

∂

L1

∂

B1

=

0

→

0

=

N

−

i=1 ni

−

j=1

α

ijzi

.

(7)

After eliminating

w

1,

w

2, eijand bione obtains











yij

=

B0

+

B1zi

−

1

λ

1 ni

−

k=1

α

ik

−

N

−

k=1 nk

−

l=1

α

kl



zk

−

1 2

 

zi

−

1 2



ϕ

1

(

xkl

)

T

ϕ

1

(

xij

) −

N

−

k=1 nk

−

l=1

α

kl



zk

+

1 2

 

zi

+

1 2



ϕ

2

(

xkl

)

T

ϕ

2

(

xij

) −

α

ij

λ

2

,

i

=

1

, . . . ,

N

,

j

=

1

, . . . ,

ni

,

0

=

N

−

i=1 ni

−

j=1

α

ij

,

0

=

N

−

i=1 ni

−

j=1

α

ijzi

.

(8)

This can be written as the linear system







02×2 1MT ZT 1M Z

−

K1

−

K2

−

_λΓ₁

−

IM×M λ2









_B 0 B1

α



=

[

02×1 y

]

,

(9) with y

= [

y11

. . .

y1n1y21

. . .

yNnN

]

T

_∈

RM, Z

= [

z1

. . .

z1







n1 z2

. . .

z2







n2

. . .

zN

. . .

zN







nN

]

T_{, x}

_{= [}

_x 11

. . .

x1n1x21

. . .

xNnN

]

T

_∈

RM,

Γ

=

blockdiag

(

1n1×n1

. . .

1nN×nN

)

, kernel matrix K1

∈

R

M×M_{with elements K} 1rs

=

(

Zr−1 2

)(

Zs−1 2

)ϕ

1

(

xs

)

T

_ϕ

1

(

xr

) = (

Zr −1 2

)

(

Zs−1

2

)

K1

(

xs

,

xr

)

and kernel matrix K2

∈

R

M×M_{with elements K} 2rs

=

(

Zr+1 2

)(

Zs+1 2

)ϕ

2

(

xs

)

T

_ϕ

2

(

xr

) = (

Zr+₂1

)(

Zs₂+1

)

K2

(

xs

,

xr

)

,

where K1

(·, ·)

and K2

(·, ·)

are positive definite kernel functions. Values for the regularization constants (and kernel

parameters) can be obtained via cross-validation procedures. This is further discussed in Section4.

2.3. Prediction: random intercept LS-SVM classification

Now that the parameters

w

1,

w

2, B0and B1in(5)are estimated, these can be used to classify a longitudinal data set

{{

x∗_ij

,

y∗_ij

}

gi

j=1

}

G

i=1with unknown class labels z

∗

i, where G represents the number of cases, gi the number of observations for

case i and H

=

∑

G

i=1githe total number of observations. The proposed optimization problem for prediction is formulated

as

( ˆ

e∗

, ˆ

_b∗

, ˆ

_z∗

) =

_argmin e∗,b∗,z∗

λ

1 2 G

−

i=1 b∗_i2

+

λ

2 2 G

−

i=1 gi

−

j=1 e∗_ij2

,

(10) subject to y∗_ij

= ˆ

B0

+ ˆ

B1zi∗

+

b ∗ i

+



z∗_i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) +

e ∗ ij

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

,

(11)

(5)

with random intercepts b∗

= [

b∗₁

. . .

b_G∗

]

T, error variables e∗

= [

e∗₁₁

. . .

e∗_1g₁e∗₂₁

. . .

e_Gg∗ _G

]

T

∈

_RH, class labels z∗

=

[

z₁∗

. . .

z_G∗

]

T, weights

w

ˆ

₁

= −

∑

N i=1

∑

ni j=1

α

ˆ

ij

(

zi −1 2

)ϕ

1

(

xij

)

and weights

w

ˆ

2

= −

∑

N i=1

∑

ni j=1

α

ˆ

ij

(

zi +1 2

)ϕ

2

(

xij

)

. The Lagrangian

for the problem in(10)–(11)is

L2

(

e∗

,

b∗

,

z∗

, ν) =

λ

1 2 G

−

i=1 b∗_i2

+

λ

2 2 G

−

i=1 gi

−

j=1 e∗_ij2

−

G

−

i=1 gi

−

j=1

ν

ij



y∗_ij

− ˆ

B0

− ˆ

B1z∗i

−

b ∗ i

−



z∗ i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) −



z∗ i

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) −

e ∗ ij



,

(12)

ν = [ν

11

. . . ν

1g1

ν

21

. . . ν

GgG

]

T

_∈

RH. The conditions for optimality result in











∂

L2

∂

e∗ ij

=

0

→

e∗_ij

= −

ν

ij

λ

2

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

,

∂

L2

∂

b∗ i

=

0

→

b∗_i

= −

1

λ

1 gi

−

j=1

ν

ij

,

i

=

1

, . . . ,

G

,

∂

L2

∂

z∗_i

=

0

→

0

=

gi

−

j=1

ν

ij



ˆ

B₁

+

w

ˆ

1 T

_ϕ

1

(

x∗ij

)

2

+

ˆ

w

2T

ϕ

2

(

x∗ij

)

2



,

i

=

1

, . . . ,

G

,

∂

L2

∂ν

ij

=

0

→

y∗_ij

= ˆ

B0

+ ˆ

B1zi∗

+

b ∗ i

+



z∗ i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z∗ i

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) +

e ∗ ij

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

.

(13)

Elimination of e∗_ijand b∗_i gives











0

=

gi

−

j=1

ν

ij



ˆ

B1

+

ˆ

w

1T

ϕ

1

(

x∗ij

)

2

+

ˆ

w

2T

ϕ

2

(

x∗ij

)

2



,

i

=

1

, . . . ,

G

,

y∗_ij

= ˆ

B0

+ ˆ

B1zi∗

−

1

λ

1 gi

−

k=1

ν

ik

+



z_i∗

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) −

ν

ij

λ

2

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

.

(14)

The optimization problem in the dual space is formulated as



0G×G A∗T A∗

₋

Θ λ1

−

IH×H λ2



[

z∗

ν

]

=

[

0G y∗

− ˆ

B01H

+

P∗

]

,

(15) with y∗

_{= [}

_y∗ 11

. . .

y ∗ 1g1y ∗ 21

. . .

y ∗ GgG

]

T

_∈

RH,Θ

=

blockdiag

(

1g1×g1

. . .

1gG×gG

)

, A ∗

₌

_blockdiag

_{([ ˆ}

_B 1

+

ˆ w1Tϕ1(x∗11) 2

+

ˆ w2Tϕ2(x∗11) 2

. . . ˆ

B1

+

ˆ w1Tϕ1(x∗_1g1) 2

+

ˆ w2Tϕ2(x∗_1g1) 2

]

T

_{. . . [ ˆ}

_B 1

+

ˆ w1Tϕ1(x∗G1) 2

+

ˆ w2Tϕ2(x∗G1) 2

. . . ˆ

B1

+

ˆ w1Tϕ1(x∗_GgG) 2

+

]

T

₎

_and P∗

= [

wˆ1 T_ϕ 1(x∗11) 2

−

ˆ w2Tϕ2(x∗11) 2

. . .

ˆ w1Tϕ1(x∗_1g1) 2

−

ˆ w2Tϕ2(x∗_1g1) 2 ˆ w1Tϕ1(x∗21) 2

−

ˆ w2Tϕ2(x∗21) 2

. . .

−

]

T_{. The resulting}

longitudinal LS-SVM classifier is defined as sign

( ˆ

z∗

)

_{. Note that solving}₍₁₅₎_{with all G test cases or with one test case at the}

time will result in identical predictions. Section4provides a discussion on the selection of the values for the regularization

constants

λ

1and

λ

2and the kernel parameters.

2.4. An alternative prediction formulation

In this section an alternative approach to the prediction step in Section2.3is formulated. Instead of obtaining the

unknown class labels_z

ˆ

∗_{from solving}₍₁₅₎_{, the cost of assigning a test case, e.g.}

_{[

_x∗

i1

. . .

x ∗ igi

]

T

_{, [}

_y∗ i1

. . .

y ∗ igi

]

T

_}

_{, to a specific}

class can be calculated. Final prediction is obtained by assigning the case to the class which is associated with the lowest

cost. The problem in(10)–(11)is redefined as

( ˆ

e∗

, ˆ

_b∗

) =

_argmin e∗,b∗

λ

1 2b ∗ i 2

₊

λ

2 2 gi

−

j=1 e∗_ij2

,

= ˆ

B0

+ ˆ

B1zi∗

+

b ∗ i

+



z_i∗

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) +

e ∗ ij

,

j

=

1

, . . . ,

gi

.

(17)

(6)

This formulation assumes that the true class label z_i∗is provided, while the estimatesB

ˆ

0,B

ˆ

1,

w

ˆ

1and

w

ˆ

2, obtained from the

regression step in Section2.2, are plugged in. The Lagrangian for this problem becomes

L3

(

e∗

,

b∗

, ν) =

λ

1 2b ∗ i 2

₊

λ

2 2 gi

−

j=1 e∗_ij2

−

gi

−

j=1

ν

ij

(

y∗ij

− ˆ

B0

− ˆ

B1zi∗

−

b ∗ i

−



z∗ i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) −



z∗ i

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) −

e ∗ ij

),

(18)

ν

i

= [

ν

i1

. . . ν

igi

]

T_{and the conditions for optimality yield}











∂

L3

∂

e∗ ij

=

0

→

e∗_ij

= −

ν

ij

λ

2

,

j

=

1

, . . . ,

gi

,

∂

L3

∂

b∗_i

=

0

→

b ∗ i

= −

1

λ

1 gi

−

j=1

ν

ij

,

∂

L3

∂ν

ij

=

0

→

y∗_ij

= ˆ

B0

+ ˆ

B1z∗i

+

b ∗ i

+



z_i∗

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) +

e ∗ ij

,

j

=

1

, . . . ,

gi

.

(19) Elimination of e∗ ijand b ∗ i yields y∗_ij

= ˆ

B0

+ ˆ

B1zi∗

−

1

λ

1 gi

−

k=1

ν

ik

+



z∗_i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) −

ν

ij

λ

2

,

j

=

1

, . . . ,

gi

.

(20)

The solution to this optimization problem can be found by solving the linear system

[

−

1gi×gi

λ

1

−

Igi×gi

λ

2

]

[

ν

_i

] = [

y∗_i

−

( ˆ

B0

+ ˆ

B1zi∗

)

1gi

−

D ∗ i

]

,

(21) with y∗_i

= [

y∗_i1

. . .

y_ig∗_i

]

T, D∗_i

= [

(

z ∗ i−1 2

) ˆ

w

1 T

_ϕ

1

(

x∗i1

)+(

z_i∗+1 2

) ˆ

w

2 T

_ϕ

2

(

x∗i1

) . . . (

z_i∗−1 2

) ˆ

w

1 T

_ϕ

1

(

x∗igi

)+(

z∗_i+1 2

) ˆ

w

2 T

_ϕ

2

(

x∗igi

)]

T_{. Following}

(19)the cost function in(16)equals₂1_λ

1

(∑

gi j=1

ν

ij

)

2

+

₂1_λ₂

∑

gi j=1

ν

2

ij. This cost function can be calculated for the first, i.e. z

∗

i

=

−

1, and the second, i.e. z∗

i

=

1, class by solving(21). The classification rule assigns the case

{[

x

∗ i1

. . .

x ∗ igi

]

T

_{, [}

_y∗ i1

. . .

y ∗ igi

]

T

_}

_to

the class with the smallest cost.

3. Extensions

In the following sections possible extensions of the model in(5)are explored.

3.1. Multi-class classification

The alternative approach in Section 2.4 has been introduced because it allows direct generalization of the binary

longitudinal LS-SVM classifier to a multi-class version. This section illustrates this extension for the development of a

longitudinal LS-SVM classifier for C classes. Similar to Section2.2, assume the longitudinal training data

{{

xij

,

yij

}

ni

j=1

}

N i=1,

but with known class labels z1i

, . . . ,

zCi

∈ {−

1

, +

1

}

for case i. These class coding labels equal

−

1 except for zci

=

1 if case

i belongs to class c. As before, the training step consists of a regression modeling and it is formulated by the optimization problem

( ˆw, ˆ

e

, ˆ

b

, ˆ

B0

, ˆ

B1

, . . . , ˆ

BC

) =

argmin w,e,b,B0,B1,...,BC 1 2

w

T

_{w +}

λ

1 2 N

−

i=1 b2_i

+

λ

2 2 N

−

i=1 ni

−

j=1 e2_ij

,

(22)

subject to the model yij

=

B0

+

C

−

c=1



zci

+

1 2



(

Bc

+

w

Tc

ϕ

c

(

xij

)) +

bi

+

eij

,

i

=

1

, . . . ,

N

,

j

=

1

, . . . ,

ni

,

(23)

with weighting vector

w = [w

T

1

. . . w

TC

]

T, model parameters B0, B1

, . . . ,

BC, and mapping

ϕ

c

(·)

, c

=

1

, . . . ,

C . The dual

solution to this optimization problem can be obtained by solving the linear system







0₍C+1)×(C+1) 1MT Z1T

...

ZCT 1M Z1

. . .

ZC

−

C

−

c=1 Kc

−

Γ

λ

1

−

IM×M

λ

2













B0 B1

...

BC

α







=

[

0₍C+1) y

]

,

(24)

(7)

with Zc

=

1₂

([

zc1

. . .

zc1







n1 zc2

. . .

zc2







n2

. . .

zcN

. . .

zcN







nN

]

T

+

1M

)

, kernel matrix Kc

∈

RM×M with elements Kcrs

=

(

Z_cr+1 2

)

(

Zcs+1

2

)

Kc

(

xs

,

xr

)

, where Kc

(·, ·)

is a positive definite kernel function. The obtained estimates for B0, B1,

. . .

,BC and

w =

[

w

T₁

. . . w

_CT

]

T_{are now used to predict the class for a new longitudinal data set}

_{{

_x∗

ij

,

y ∗ ij

}

gi j=1

}

G

i=1with unknown class labels.

The approach from Section2.4can be used to classify the ith test case

{[

x∗

i1

. . .

x ∗ igi

]

T

_{, [}

_y∗ i1

. . .

y ∗ igi

]

T

_}

_{. The multi-class extension}

for the problem in(16)–(17)is defined as

( ˆ

e∗

, ˆ

_b∗

) =

_argmin e∗,b∗

λ

1 2b ∗ i 2

₊

λ

2 2 gi

−

j=1 e∗_ij2

,

= ˆ

B0

+

C

−

c=1



z∗ ci

+

1 2



( ˆ

Bc

+ ˆ

w

cT

ϕ

c

(

x ∗ ij

)) +

b ∗ i

+

e ∗ ij

,

j

=

1

, . . . ,

gi

.

(26)

The Lagrangian for this problem is L4

(

e ∗

_,

b∗

, ν) =

λ

1 2b ∗ i 2

₊

λ

2 2 gi

−

j=1 e∗_ij2

−

gi

−

j=1

ν

ij



y∗_ij

− ˆ

B0

−

C

−

c=1



z∗ ci

+

1 2



( ˆ

Bc

+ ˆ

w

cT

ϕ

c

(

x ∗ ij

)) −

b ∗ i

−

e ∗ ij



,

(27)

and the conditions for optimality yield











∂

L4

∂

e∗ ij

=

0

→

e∗_ij

= −

ν

ij

λ

2

,

j

=

1

, . . . ,

gi

,

∂

L4

∂

b∗_i

=

0

→

b ∗ i

= −

1

λ

1 gi

−

j=1

ν

ij

,

∂

L4

∂ν

ij

=

0

→

y∗_ij

= ˆ

B0

+

C

−

c=1



z_ci∗

+

1 2



( ˆ

Bc

+ ˆ

w

cT

ϕ

c

(

x∗ij

)) +

b ∗ i

+

e ∗ ij

,

j

=

1

, . . . ,

gi

.

(28)

Elimination of e∗_ijand b∗_i gives

y∗_ij

= ˆ

B0

+

C

−

c=1



z∗ ci

+

1 2



( ˆ

Bc

+ ˆ

w

cT

ϕ

c

(

x∗ij

)) −

1

λ

1 gi

−

j=1

ν

ij

−

ν

ij

λ

2

,

j

=

1

, . . . ,

gi

.

(29)

The dual solution for this problem can be found by solving

[

−

1gi×gi

λ

1

−

Igi×gi

λ

2

]

[

ν

_i

] =



y∗_i

−



ˆ

B0

+

C

−

c=1



z∗ ci

+

1 2



ˆ

Bc



1gi

−

L ∗ i



,

(30) with L∗_i

= [

∑

C c=1

(

z∗ ci+1 2

) ˆ

w

c T

_ϕ

c

(

x∗i1

) . . . ∑

C c=1

(

z∗ ci+1 2

) ˆ

w

c T

_ϕ

c

(

x∗igi

)]

T_{. The cost, i.e.} 1

2λ1

(∑

gi

j=1

ν

ij

)

2

+

₂_λ1₂

∑

gi

j=1

ν

ij2, of assigning a

test case to a specific class can be calculated by solving(30)with the corresponding class coding labels (e.g. z_1i∗

=

1 and

z∗

2i

=

. . . =

z ∗

Ci

= −

1 for class 1). Finally, the classification rule assigns the ith test case

{[

x

∗ i1

. . .

x ∗ igi

]

T

_{, [}

_y∗ i1

. . .

y ∗ igi

]

T

_}

_{to the}

class with the smallest cost. 3.2. Multiple random effects

The proposed random intercept LS-SVM classifier (Section2.3) can be extended to allow multiple random effects. Assume

the longitudinal test data set

{{

x∗_ij

,

y∗_ij

}

gi

j=1

}

G

i=1with unknown class label z

∗

i for case i. The optimization problem for prediction

with r random effects is formulated as

( ˆ

e∗

, ˆ

_b∗

, ˆ

_z∗

) =

_argmin e∗,b∗,z∗

λ

3 2 G

−

i=1 b∗_iTR b∗_i

+

λ

4 2 G

−

i=1 e∗_iTEie∗i

,

= ˆ

B0

+ ˆ

B1zi∗

+



z∗_i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z_i∗

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) +

b ∗ i T u∗_ij

+

e∗_ij

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

,

(32)

with positive regularization constants

λ

3and

λ

4, random effect covariates u∗ij

= [

u

∗

ij1

. . .

u

∗

ijr

]

T, and the covariance matrices

R−1_{and E}−1

i of the random effects b

∗ i

= [

b ∗ 1i

. . .

b ∗ ri

]

Tand residuals e ∗ i

= [

e ∗ i1

. . .

e ∗ igi

]

T_{, respectively. The Lagrangian for this}

(8)

L5

(

e ∗

_,

b∗

,

z∗

, ν) =

λ

3 2 G

−

i=1 b∗_iTR b∗_i

+

λ

4 2 G

−

i=1 e∗_iTEie ∗ i

−

G

−

i=1 gi

−

j=1

ν

ij



y∗_ij

− ˆ

B0

− ˆ

B1z∗i

−



z∗ i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) −



z∗ i

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

) −

b ∗ i T_u∗ ij

−

e ∗ ij



.

(33)

The corresponding conditions for optimality yield











∂

L5

∂

e∗_ij

=

0

→

0

=

λ

4 gi

−

k=1 e∗_ikEijk

+

ν

ij

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

,

∂

L5

∂

b∗ ki

=

0

→

0

=

λ

₃ r

−

l=1 Rlkb∗li

+

gi

−

j=1

ν

iju∗ijk

,

i

=

1

, . . . ,

G

,

k

=

1

, . . . ,

r

,

∂

L5

∂

z∗ i

=

0

→

0

=

gi

−

j=1

ν

ij



ˆ

B1

+

ˆ

w

1T

ϕ

1

(

x∗ij

)

2

+

ˆ

w

2T

ϕ

2

(

x∗ij

)

2



,

i

=

1

, . . . ,

G

,

∂

L5

∂ν

ij

=

0

→

y∗_ij

= ˆ

B0

+ ˆ

B1z∗i

+



z∗ i

−

1 2



ˆ

w

1T

ϕ

1

(

x∗ij

) +



z∗ i

+

1 2



ˆ

w

2T

ϕ

2

(

x∗ij

)

+

r

−

k=1 b∗_kiu∗_ijk

+

e∗_ij

,

i

=

1

, . . . ,

G

,

j

=

1

, . . . ,

gi

.

(34)

Elimination of e∗_ijand b∗_kigives











0

=

gi

−

j=1

ν

ij



ˆ

B1

+

ˆ

w

1T

ϕ

1

(

x∗ij

)

2

+

ˆ

w

2T

ϕ

2

(

x∗ij

)

2



,

i

=

1

, . . . ,

G

,

y∗_i

= ˆ

B0

+ ˆ

B1z ∗ i

+



z_i∗

−

1 2



[ ˆ

w

₁TΦ∗_i1

]

T

+



z∗_i

+

1 2



[ ˆ

w

₂TΦ_i2∗

]

T

−

1

λ

3 U_i∗TR−1U_i∗

ν

i

−

1

λ

4 E_i−1

ν

i

,

i

=

1

, . . . ,

G

,

(35) with y∗_i

= [

y∗_i1

. . .

y∗_ig_i

]

T,

ν

i

= [

ν

i1

. . . ν

igi

]

T_{, U}∗ i

= [

u ∗ i1

. . .

u ∗ igi

]

,Φ ∗ i1

= [

ϕ

1

(

x∗i1

) . . . ϕ

1

(

x∗igi

)]

,Φ ∗ i2

= [

ϕ

2

(

x∗i1

) . . . ϕ

2

(

x∗igi

)]

,

whereas Eijkand Rjkdenote the element on the jth row and kth column of the covariance matrices Eiand R, respectively. The

dual solution to this problem can be found by solving

[

0G×G A∗T A∗ _F∗

] [

z∗

ν

]

=

[

0G y∗

_{− ˆ}

_B 01H

+

P∗

]

,

(36) where F∗

=

blockdiag

(−

_λ1 3U ∗ 1 T R−1U₁∗

−

_λ1 4E −1 1

. . . −

1 λ3U ∗ G T R−1U_G∗

−

_λ1 4E −1 G

)

, while A ∗

and P∗are identical to their

formulations in(15). In case one assumes only a random intercept and a simple covariance structure for Ei,(36)simplifies

to(15). The final class assignment is obtained by applying sign

( ˆ

z∗

₎

_.

4. Model selection

Selection of appropriate values for the regularization parameters

λ

1,

λ

2,

λ

3,

λ

4and the kernel parameters is an important

issue. Moreover, the use of multiple random effects requires the estimation of the covariance matrices R−1_{and E}−1

i . These

topics are discussed in detail in the following sections. 4.1. Random intercept LS-SVM

The typical procedure to tune regularization parameters in the context of conventional LS-SVMs is by means of cross-validation or repeated random sampling, resulting in training and cross-validation data. On the other hand, some complications arise when using these methods to tune the parameters of the mixed effects LS-SVM models. The training and the prediction

step each involve two regularization parameters (i.e.

λ

1,

λ

2) and additional kernel parameters (e.g.

σ

1,

σ

2). Although one

could tune the parameters of the training step and the prediction step in a separate way, we decided to tune them together.

This means that one gridsearch is performed and for a fixed combination of parameters(9)(or(24)) and(15)(or(21),

(30)) are solved. Since the main goal of this paper is classification, the preferred performance measure is classification

accuracy. Therefore, the combination of tuning parameters that maximizes the classification performance (i.e. at the level of the prediction step) on the validation data (cf. cross-validation or repeated random sampling) is chosen. The advantage of this approach is the lower computational cost of tuning only two parameters (kernel parameters excluded). Performing a separate model selection for the training and the prediction step would require tuning three parameters (kernel parameters

(9)

sampling involve splitting data in a training set and validation set. Remark that data splitting was performed in such a way that cases instead of individual observations were repeatedly assigned to these sets. The final resulting model selection

procedure for random intercept LS-SVM classification is summarized inAlgorithm 1.

Fig. 1 further illustrates the importance of a careful model selection. The data that are used are identical to the training

and validation data of the toy problem in Section5.1.1. WhileFig. 2visualizes the optimal results, in the sense that the

selected parameters maximize classification performance,Fig. 1shows the effect of changing one parameter at a time on

the estimated class means (dashed lines) and the case-specific predicted curve (red line) for the observed curve (green

line). Decreasing the bandwidth of the RBF kernel results in less smooth class means and predicted curve. Increasing

λ

2

strengthens this effect

Algorithm 1 (Model selection and classification strategy for random intercept LS-SVM).

1: Start a gridsearch using cross-validation or repeated random sampling on training data to obtain values for

λ

1and

λ

2.

This includes step 2 and step 3:

2: Perform regression on training data by solving(9)or(24)to obtain estimates for

w

, B0, B1(

, . . . ,

BC).

3: Use these estimates to perform prediction on validation data by solving(15)or(21)for binary problems and(30)for

multi-class problems and select the

λ

1and

λ

2with the best classification performance.

4: Finally, use the tuned

λ

1and

λ

2and corresponding estimates

w

ˆ

,B

ˆ

0,B

ˆ

1(

, . . . , ˆ

BC) to perform prediction on test data by

solving(15)or(21)for binary problems and(30)for multi-class problems.

since more emphasis is put on the sum of squared residuals. An increasing

λ

1value makes the class means even less smooth

and shrinks the predicted curve toward the class mean.

4.2. Multiple random effects LS-SVM

The proposed method to deal with multiple random effects assumes that the covariance matrices R−1_{and E}−1

i are known,

while this is in practice not the case. A solution to this problem can be found by first applying the random intercept LS-SVM

approach from Sections2.2and2.3, in combination with the model selection procedure from Section4.1, in order to estimate

the mean (i.e.B

ˆ

0,B

ˆ

1,

w

ˆ

1,

w

ˆ

2) for each class. Then, for all observations of each case the corresponding residuals are computed

by removing the estimated mean of the true class to which this case belongs. The residuals of all cases can then be used to

estimate R−1and E−_i 1using (restricted) maximum likelihood theory (Patterson and Thompson, 1971). Finally, values for the

regularization constants

λ

3and

λ

4can be obtained via cross-validation or repeated random sampling of the data. Remark

that since the problem in(31)requires a separate gridsearch, one of the lambdas could have been removed by dividing one

through the other. In Section3.2both

λ

3and

λ

4were included to have a more consistent presentation. In practice, one

lambda can be fixed while a gridsearch is used for the other one. The full procedure is summarized inAlgorithm 2.

Algorithm 2 (Model selection and classification strategy for LS-SVM with multiple random effects).

1: Start with a random intercept LS-SVM and proceed as described inAlgorithm 1to obtain the estimates

w

ˆ

,B

ˆ

0,B

ˆ

1(

, . . . , ˆ

BC).

2: Compute the residuals for all observations of each case by removing the estimated mean, resulting from the random intercept LS-SVM step, while the given class label is used to decide about the estimated mean.

3: Use (restricted) maximum likelihood to estimate the covariance matrices R−1_{and E}−1

i from the residuals of all cases.

4: Use cross-validation or repeated random sampling on training data to obtain values for

λ

3and

λ

4. This includes step 5:

5: Perform prediction on validation data using the multiple random effects LS-SVM by solving(36)with the obtained

w

ˆ

,B

ˆ

0,

ˆ

B1(

, . . . , ˆ

BC). Select the

λ

3and

λ

4with the best performance.

6: Finally, perform prediction on test data by solving(36)with

w

ˆ

,B

ˆ

0,B

ˆ

1(

, . . . , ˆ

BC) and tuned

λ

3and

λ

4.

5. Examples

This section provides examples of the application of the longitudinal LS-SVM classifier for 4 simulation studies and 2 real-life problems.

5.1. Simulation studies 5.1.1. Toy problem

For each class, 50 curves were generated according to the model in(5), resulting in the unbalanced data set in the

upper right panel ofFig. 2. A simple variance structure, with a standard deviation of 80, was chosen for the normally

distributed residual error and the standard deviation for the normally distributed random intercept was 400. The number of observations per curve varied between 5 and 10. This data set was 25 times split in a random way into a set for training

(2

/

3 of the cases) and a set for validation (1

/

3 of the cases). An RBF kernel was used for the longitudinal LS-SVM classifier

(10)

Fig. 1. Model selection. Top: decreasing the bandwidthσof the RBF kernel results in wiggly estimates for the means (dashed lines) and the case-specific predicted curve (red line). The observed curve is visualized in green. Middle: putting more weight on the squared errors by increasing the regularization parameterλ2makes the estimated means and predicted curve more unstable. Bottom: increasingλ1shrinks the case-specific predicted curve to the

estimated mean. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

performed in order to tune the parameters

λ

1(

∈ {

10−6

,

10−5

,

10−4

,

10−3

,

10−2

,

0

.

1

,

0

.

5

,

0

.

7

,

1

,

2

,

3

,

4

,

5

,

10

,

20

,

30

,

40

}

),

λ

2 (

∈ {

10−2

,

0

.

1

,

1

, . . . ,

20

,

30

,

40

,

80

,

100

,

150

,

200

,

300

,

400

,

600

,

800

}

) and

σ

1

, σ

2(

∈ {

10−3

,

10−2

,

0

.

1

,

1

, . . . ,

25

}

).

(11)

Fig. 2. Toy problem. Top left: simulated mean curves (solid) and estimated mean curves (dashed) for class 1 (black) and class 2 (blue). Top right: simulated

training and validation data. Middle left: simulated test data. Middle right: simulated random effects versus estimated random effects. Bottom left: estimated z∗

-values for test data. The first 50 cases belong to class 1 while the last 50 cases belong to class 2. Bottom right: case 43 (green) and after removing the residual error (red). Training and validation data, mean curves and estimated mean curves. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Empirical results showed that the classification approach in(15)produced identical predictions as compared to the

alternative approach following(21). The combination of tuning parameters with the lowest (i.e. 2.19%) averaged balanced

error rate (BER, i.e. the average of the class-specific error rates) on the validation data was:

λ

1

=

10−3,

λ

2

=

300,

σ

1

=

σ

2

=

8.Fig. 2illustrates that the estimated mean curves for class 1 and class 2 approximate the true mean curves,

(12)

LS-SVM classifier was used to classify 50 new curves from class 1 and 50 new curves from class 2. The estimated random

effects are close to the true random effects and the BER on the independent test set was 3%. Finally,Fig. 2presents the

estimated curve for case 43 after removing the residual error. The average over the squared residuals related to the prediction of y∗ ij(i.e. 1 G

∑

G i=1 1 gi

∑

gi j=1

(

y ∗ ij

− ˆ

B0

− ˆ

B1z

ˆ

i∗

− ˆ

b ∗ i

−

(

ˆ z∗i−1 2

) ˆ

w

1 T

_ϕ

1

(

x∗ij

) − (

ˆ z∗i+1 2

) ˆ

w

2 T

_ϕ

2

(

x∗ij

))

2) equaled 4477.5.

5.1.2. Sensitivity on tuning parameters

A training and validation data set with 50 curves for each class was generated according to the model in(5)to illustrate the

influence of the tuning parameters of the longitudinal LS-SVM on the classification performance (Fig. 3). A simple variance

structure, with a standard deviation of 60, was used for the normally distributed residual error and the standard deviation for the normally distributed random intercept was 600. The number of observations per curve varied between 5 and 7.

Parameter tuning based on 50 times random splitting resulted in the following settings:

λ

1

=

10−4,

λ

2

=

300,

σ

1

=

σ

2

=

9.Fig. 3visualizes the true mean curves and the estimated mean curves, using these settings, for class 1

and class 2. In order to study the sensitivity of the classifier’s performance on the tuning parameters a grid was created

around the optimum on the validation data:

λ

1(

∈ {

10−6

,

10−5

,

10−4

,

10−3

,

10−2

}

),

λ

2(

∈ {

80

,

150

,

200

,

300

,

400

,

600

}

)

and

σ

1

, σ

2(

∈ {

5

,

7

,

8

,

9

,

11

,

13

}

). In a next step all these 180 combinations of tuning parameters were evaluated on 100

newly generated test data sets (e.g.Fig. 3, middle left panel) with 50 curves for each class, while training was done on the

original training and validation data. The histogram of all 180 averaged (i.e. over the 100 repetitions) BERs is visualized inFig. 3. The median BER was 19.50% while all 180 models fell in a range of 1.76%. This illustrates that the classifier’s

performance is rather insensitive with respect to (small) changes in tuning parameters. The bottom row ofFig. 3presents

the BER for different combinations of tuning parameters. Clearly, setting

σ

1

=

σ

2

=

13 or

λ

2

=

80 results in larger BERs.

Omitting these models from the histogram makes clear that all remaining models obtain BERs within a range of less than 1%.

5.1.3. Multi-class classification

The use of the multi-class approach from Section3.1is illustrated for a classification problem involving three classes

(Fig. 4). Fifty curves were generated for each of the classes following the model in(23)with normally distributed residual error (with standard deviation 150) and random intercept (with standard deviation 600). The number of observations per curve varied between 5 and 7.

The set with the smallest (i.e. 8.96%) averaged error rate on the validation data based on 100 times random sampling was:

λ

1

=

2,

λ

2

=

16,

σ

1

=

σ

2

=

σ

3

=

2. The upper left panel ofFig. 4visualizes the estimated mean curves which approximate

the true mean curves. The multi-class classifier was then used to classify 50 new curves from each class and resulted in an

overall error rate of 5.33%. The average over the squared residuals related to the prediction of y∗

ijequaled 18308.2. The lower

right panel ofFig. 4shows_cost1

c

/(

1 cost1

+

1 cost2

+

1

cost3

)

for each test case with c

=

1

, . . . ,

3. The first 50 test cases belong to

class 1, the following 50 to class 2, etc.

5.1.4. Multiple random effects

To illustrate classification using models with multiple random effects 100 curves were generated for each of the classes. To simulate data normally distributed residual error with standard deviation 80 (i.e. uncorrelated errors) and three random effects (i.e. random intercept, random slope and random quadratic effect) with normal distribution and covariance matrix

[[

400 10 5

]

T

_[

_{10 40 20}

_]

T

_[

_{5 20 20}

_]

T

_]

_{were assumed for the model in}₍₃₂₎_{. The number of observations per curve varied}

between 5 and 10. Test data, consisting of 100 curves for each class, were simulated with similar settings. The top left panel ofFig. 5visualizes the mean for each of the two classes, while the upper right panel visualizes the test data. The strategy inAlgorithm 2was followed for classifier building. First, a random intercept model was constructed using the training and validation data.

A gridsearch and repeated random sampling resulted in the following parameter settings:

λ

1

=

0

.

01,

λ

2

=

80,

σ

1

=

σ

2

=

2. The upper left panel ofFig. 5demonstrates that the estimated mean curves approximated the true mean

curves. After removing the estimated mean, restricted maximum likelihood was used on the training data to estimate the covariance matrix of the three random effects and residual error variance. Subsequently, a gridsearch, in combination with

repeated random sampling, was used to tune the regularization constants:

λ

3

=

1600 and

λ

4

=

950. Finally, this classifier

with three random effects was used on the independent test data and resulted in a BER of 9.5% (average over squared

residuals for prediction of y∗

ij

=

5512

.

3), while a LS-SVM with only a random intercept yielded a BER of 15.5% (average

over squared residuals for prediction of y∗_ij

=

5643

.

0). The cases that are misclassified by the LS-SVM with multiple random

effects are visualized in the bottom panel ofFig. 5. It is clear that these remaining misclassifications are atypical profiles,

(13)

Fig. 3. Sensitivity analysis. Top left: simulated mean curves (solid) and estimated mean curves (dashed) for class 1 (black) and class 2 (blue). Top right:

simulated training and validation data. Middle left: simulated test data. Middle right: histogram for averaged BER on test data. Bottom left: averaged BER on test data forσ1=σ2=9. Bottom right: averaged BER on test data forσ1=σ2=13. (For interpretation of the references to colour in this figure legend,

the reader is referred to the web version of this article.) 5.2. Real-life data sets

5.2.1. Berkeley growth data

The Berkeley growth data set contains 54 growth curves of girls and 39 growth curves of males (Ramsay and Silverman,

2005). The classification problem is to predict the sex of a case based on the growth curve. Each original curve consists

of 31 observations, taken at fixed time points. This experiment reports the performance of the random intercept LS-SVM

classifier for various levels of unbalanced data (Fig. 6). Observations were deleted from each sample in a random manner in