Contents lists available atSciVerse ScienceDirect
Computational Statistics and Data Analysis
journal homepage:www.elsevier.com/locate/csdaA mixed effects least squares support vector machine model for
classification of longitudinal data
Jan Luts
a,b,∗, Geert Molenberghs
c,d, Geert Verbeke
d, Sabine Van Huffel
a,b,
Johan A.K. Suykens
a,baDepartment of Electrical Engineering (ESAT), Research Division SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium bIBBT-K.U. Leuven Future Health Department, Leuven, Belgium
cI-BioStat, Universiteit Hasselt, Agoralaan 1, B-3590 Diepenbeek, Belgium
dI-BioStat, Katholieke Universiteit Leuven, Kapucijnenvoer 35, B-3000 Leuven, Belgium
a r t i c l e i n f o Article history:
Received 12 November 2010
Received in revised form 9 September 2011 Accepted 10 September 2011
Available online 18 September 2011
Keywords:
Classification Longitudinal data Least squares Support vector machine Kernel method Mixed model
a b s t r a c t
A mixed effects least squares support vector machine (LS-SVM) classifier is introduced to extend the standard LS-SVM classifier for handling longitudinal data. The mixed effects LS-SVM model contains a random intercept and allows to classify highly unbalanced data, in the sense that there is an unequal number of observations for each case at non-fixed time points. The methodology consists of a regression modeling and a classification step based on the obtained regression estimates. Regression and classification of new cases are performed in a straightforward manner by solving a linear system. It is demonstrated that the methodology can be generalized to deal with multi-class problems and can be extended to incorporate multiple random effects. The technique is illustrated on simulated data sets and real-life problems concerning human growth.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
During the past decade, kernel-based approaches for estimation and learning problems have gained significant
interest (Schölkopf and Smola, 2001). Kernel methods are typically used in the machine learning community to deal with
all kinds of classification problems involving input vectors with a fixed dimension. For instance, the popular kernel-based
technique support vector machines (SVMs) has successfully been used for many applications (Vapnik, 1998). On the other
hand, for many real-life problems the data are characterized by their non-fixed dimensional nature. For example, the analysis of longitudinal data, which consist of repeated measurements on the same units (or often called cases) over a period of time, generally requires the use of methods which can deal with unbalanced data. Within the context of longitudinal data, unbalanced means that there is an unequal number of observations for each case at non-fixed time points. While SVMs are not straightforwardly applicable to such kind of data, the classical approach in the statistics community is to analyze
longitudinal data by means of mixed effects models (Verbeke and Molenberghs, 2000). Importantly, these models are able
to take the relationship between observations from the same case into account. Estimation of the fixed and random effects is commonly based on the theory of best linear unbiased prediction (BLUP).
∗Corresponding author at: Department of Electrical Engineering (ESAT), Research Division SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium. Tel.: +32 16 321065; fax: +32 16 321970.
E-mail address:jan.luts@esat.kuleuven.be(J. Luts).
0167-9473/$ – see front matter©2011 Elsevier B.V. All rights reserved.
To classify longitudinal profiles in different groups,Verbeke and Lesaffre(1996) proposed a linear mixed effects model with heterogeneity in the random effects. The normality assumption for the random effects was replaced by a mixture of normal distributions with a common covariance matrix. Maximum likelihood estimates were obtained based on the expectation-maximization algorithm. A multi-class classification rule was established by assigning a case to the component
to which it had the highest posterior probability to belong. Marshall and Barón (2000) designed a linear/quadratic
discriminant analysis method for unbalanced longitudinal data using mixed effects models. Linear models as well as models that were nonlinear with respect to the mean and the random effects were included. An iterative algorithm was proposed to estimate the class means and covariance matrices, which were subsequently used in various discriminant functions. Variance components were allowed to vary between classes. The final classification was performed based on an extended
Mahalanobis distance measure. James and Hastie (2001) extended linear discriminant analysis to a functional linear
discriminant analysis (FLDA) classifier for irregularly sampled curves. An expectation-maximization algorithm was used to fit a mixed model with splines as random and fixed effects. Final class assignment was based on posterior probabilities. Several extensions such as functional quadratic discriminant analysis (i.e. class-specific covariance matrices) and functional
regularized discriminant analysis were proposed. Wouters et al.(2007) presented a two-step approach called doubly
hierarchical discriminant analysis to classify high-dimensional longitudinal data. This approach first performs modeling of the data via a class of flexible linear mixed models. The resulting set of estimated parameters, which represented a summary of the data, was then fed to a hierarchical linear discriminant classifier. A decision tree composed of binary
classifiers was constructed and posterior probabilities were computed for class assignment.Fieuws et al.(2008) focused
on the classification of multivariate longitudinal profiles. Univariate linear, nonlinear and generalized mixed models were combined into a multivariate mixed model in order to take the correlation between the different profiles into account. Association between the different profiles was assumed to be fully captured by the association between their random effects. Parameter estimates were obtained based on combining all possible bivariate mixed models. Posterior probabilities were
calculated for class assignment. Other studies, e.g.Brown et al.(2001) andDe la Cruz-Mesía et al.(2007), addressed the
problem of classification of longitudinal profiles from a Bayesian point of view. The former described a multivariate model to classify longitudinal data, that were only measured at fixed time points, and used an expectation-conditional maximization algorithm. The latter developed a semi-parametric approach for unbalanced longitudinal data and used Markov chain Monte Carlo methods. In addition, the topic of longitudinal data classification is also related to the field of functional
data analysis (González-Manteiga and Vieu, 2007;Ramsay and Silverman, 2005). Functional data can be interpreted as
an observation over a certain continuum, e.g. time. Even though these observations are discrete, they can be perceived
as values that vary smoothly over time, and therefore they are called functional observations. Within this context,James
(2002) developed a technique to incorporate (multiple) functional predictors in generalized linear models. An
expectation-maximization algorithm was described to fit the spline-based functional logistic regression model. Finally,Müller(2005)
showed an example of the use of a functional binomial regression model with a logistic link for classification of sparse longitudinal data.
Various approaches have been used to deal with nonlinear profiles in mixed effects models. For instance, the FLDA
approach byJames and Hastie(2001) included natural cubic spline functions to model the observations of each case.Wouters
et al.(2007) employed a fractional polynomial mixed model to extract features as input for discriminant analysis.Liu et al.
(2007) studied the use of kernels for regression analysis by means of linear mixed models, but longitudinal data were
not considered. More recently, explicit connections between the traditional approaches for longitudinal data fitting and
kernel machines were described (Pearce and Wand, 2009). The correspondence between reproducing kernel Hilbert space
optimization and estimated BLUPs was illustrated. Whereas the study byPearce and Wand(2009) showed that kernel
methods can be used for regression modeling of longitudinal data, the present study focuses on classification of this type
of data using a kernel-based approach. To this end, the least squares support vector machine (LS-SVM) classifier (Suykens
et al.,2002;Suykens and Vandewalle, 1999) is extended to a mixed effects LS-SVM model for classification of longitudinal profiles.
The proposed method consists of a regression modeling and a consecutive prediction step (Section2). This kernel method
combines the regularization property of support vector machines with mixed effects models. The methodology is able
to deal with nonlinearity in the data and it addresses binary, as well as multi-class classification (Section3.1). A model
formulation with multiple random effects and corresponding classification strategy are developed (Section 3.2). Model
selection via parameter tuning is discussed in Section4. The techniques are applied to various simulated and real-life
problems (Section5).
2. Random intercept LS-SVM
This section starts with a brief review of the standard LS-SVM classifier (Section2.1). In the next sections, an LS-SVM
model is developed for longitudinal data classification. The random intercept LS-SVM approach includes two steps. Firstly, a regression is done using training data for which the true class labels are known in order to estimate the model parameters
(Section2.2). Secondly, new longitudinal profiles with unknown class labels can be classified based on the obtained model
parameters from the previous step (Section2.3). Finally, an alternative approach to this classification step is established
2.1. LS-SVM classifier
The standard LS-SVM classifier is a binary classifier and requires a data set of N input vectors and corresponding class
labels:
{
xi,
zi}
Ni=1, with xi∈
Rdbeing the input pattern and zi∈ {−
1, +
1}
the class indicator. The proposed optimizationproblem is defined as min w,e,b 1 2
w
Tw +
γ
2 N−
i=1 e2i,
(1) subject to zi(w
Tϕ(
xi) +
b) =
1−
ei,
i=
1, . . . ,
N,
(2)with e
= [
e1. . .
eN]
T,ϕ(·) :
Rd→
Rdha mapping from the input space to a high-dimensional feature space with dimensiondh,
w
a vector of the same dimension asϕ(·)
,γ
a positive regularization constant and b a bias term. In contrast with SVMs,LS-SVMs employ a least squares loss function with error variables eiand equality constraints are used in place of inequality
constraints. This formulation results in solving a set of linear equations, while SVMs require solving a quadratic programming
problem. Whereas the primal problem in(1)is expressed in terms of the feature map, the resulting linear system in the dual
space is expressed in terms of the kernel function. The method requires tuning of the regularization constant
γ
and kernelparameters. The resulting LS-SVM classifier can be used to classify a test data set
{
x∗j,
zj∗}
Gj=1. The classifier is formulated asˆ
z∗ j=
sign
N−
i=1ˆ
β
iziK(
x∗j,
xi) + ˆ
b
,
(3)in the dual space with x∗
j the jth test case to be classified, Lagrange multipliers
β
iand the positive definite kernel functionK
(·, ·)
with K(
x∗j,
xi) = ϕ(
x∗j)
Tϕ(
xi
)
. In practice, a linear, polynomial or radial basis function (RBF) kernel is often used. Inthe remainder of this study the RBF kernel (i.e. K
(
x∗j
,
xi) =
exp(− ‖
x∗j−
xi‖
22/σ
2)
) is employed.The LS-SVM has been shown to be a core model, leading to a generic framework that allows to define new methods in a
systematic and constructive way (De Brabanter et al.,2010,2011;Luts et al.,2010;Suykens et al.,2010). Examples include
weighted versions to incorporate robustness, sparse representations and a fixed-size approach for large-scale data. The models provide connections to parametric statistics, as well as non-parametric statistics, via primal and dual formulations. The LS-SVM, having minimal underlying assumptions, is well-suited to deal with high-dimensional input data. In addition, reliable numerical implementations are available (e.g. a linear system for classification and regression).
2.2. Training: random intercept LS-SVM regression
The longitudinal training data can be represented as yijwhich is the observed value at time xijfor case i, i
=
1, . . . ,
N,j
=
1, . . . ,
ni, with N the number of cases, nithe number of observations for case i and the total number of observationsequals M
=
∑
Ni=1ni. The class labels are known and are denoted by zi
∈ {−
1, +
1}
for case i. The training step consistsof a regression modeling. Based on the theory of BLUP (Robinson, 1991), the proposed optimization problem for random
intercept LS-SVM regression is formulated as
( ˆw, ˆ
e, ˆ
b, ˆ
B0, ˆ
B1) =
argmin w,e,b,B0,B1 1 2w
Tw +
λ
1 2 N−
i=1 b2i+
λ
2 2 N−
i=1 ni−
j=1 e2ij,
(4)subject to the model yij
=
B0+
B1zi+
bi+
zi−
1 2
w
T 1ϕ
1(
xij) +
zi+
1 2
w
T 2ϕ
2(
xij) +
eij,
i=
1, . . . ,
N,
j=
1, . . . ,
ni,
(5)with weighting vector
w = [w
T1
w
2T]
T, a vector of random intercepts b= [
b1. . .
bN]
T, error variables e= [
e11. . .
e1n1e21
. . .
eNnN]
T
∈
RM, positive regularization constants
λ
1andλ
2, model parameters B0and B1, and mappingsϕ
1(·)
andϕ
2(·)
. The Lagrangian for the problem in(4)–(5)isL1
(w,
e,
b,
B0,
B1, α) =
1 2w
Tw +
λ
1 2 N−
i=1 b2i+
λ
2 2 N−
i=1 ni−
j=1 e2ij−
N−
i=1 ni−
j=1α
ij
yij−
B0−
B1zi−
bi−
zi−
1 2
w
T 1ϕ
1(
xij) −
zi+
1 2
w
T 2ϕ
2(
xij) −
eij
,
(6)with Lagrange multipliers
α = [α
11. . . α
1n1α
21. . . α
NnN]
T
∈
RM. The conditions for optimality yield
∂
L1∂w
1=
0→
w
1= −
N−
i=1 ni−
j=1α
ij
zi−
1 2
ϕ
1(
xij),
∂
L1∂w
2=
0→
w
2= −
N−
i=1 ni−
j=1α
ij
zi+
1 2
ϕ
2(
xij),
∂
L1∂
eij=
0→
eij= −
α
ijλ
2,
i=
1, . . . ,
N,
j=
1, . . . ,
ni,
∂
L1∂
bi=
0→
bi= −
1λ
1 ni−
j=1α
ij,
i=
1, . . . ,
N,
∂
L1∂α
ij=
0→
yij=
B0+
B1zi+
bi+
zi−
1 2
w
T 1ϕ
1(
xij) +
zi+
1 2
w
T 2ϕ
2(
xij) +
eij,
i=
1, . . . ,
N,
j=
1, . . . ,
ni,
∂
L1∂
B0=
0→
0=
N−
i=1 ni−
j=1α
ij,
∂
L1∂
B1=
0→
0=
N−
i=1 ni−
j=1α
ijzi.
(7)After eliminating
w
1,w
2, eijand bione obtains
yij=
B0+
B1zi−
1λ
1 ni−
k=1α
ik−
N−
k=1 nk−
l=1α
kl
zk−
1 2
zi−
1 2
ϕ
1(
xkl)
Tϕ
1(
xij) −
N−
k=1 nk−
l=1α
kl
zk+
1 2
zi+
1 2
ϕ
2(
xkl)
Tϕ
2(
xij) −
α
ijλ
2,
i=
1, . . . ,
N,
j=
1, . . . ,
ni,
0=
N−
i=1 ni−
j=1α
ij,
0=
N−
i=1 ni−
j=1α
ijzi.
(8)This can be written as the linear system
02×2 1MT ZT 1M Z−
K1−
K2−
λΓ1−
IM×M λ2
B 0 B1α
=
[
02×1 y]
,
(9) with y= [
y11. . .
y1n1y21. . .
yNnN]
T∈
RM, Z= [
z1. . .
z1
n1 z2. . .
z2
n2. . .
zN. . .
zN
nN]
T, x= [
x 11. . .
x1n1x21. . .
xNnN]
T∈
RM,Γ
=
blockdiag(
1n1×n1. . .
1nN×nN)
, kernel matrix K1∈
RM×Mwith elements K 1rs
=
(
Zr−1 2)(
Zs−1 2)ϕ
1(
xs)
Tϕ
1(
xr) = (
Zr −1 2)
(
Zs−12
)
K1(
xs,
xr)
and kernel matrix K2∈
RM×Mwith elements K 2rs
=
(
Zr+1 2)(
Zs+1 2)ϕ
2(
xs)
Tϕ
2(
xr) = (
Zr+21)(
Zs2+1)
K2(
xs,
xr)
,where K1
(·, ·)
and K2(·, ·)
are positive definite kernel functions. Values for the regularization constants (and kernelparameters) can be obtained via cross-validation procedures. This is further discussed in Section4.
2.3. Prediction: random intercept LS-SVM classification
Now that the parameters
w
1,w
2, B0and B1in(5)are estimated, these can be used to classify a longitudinal data set{{
x∗ij,
y∗ij}
gij=1
}
G
i=1with unknown class labels z
∗
i, where G represents the number of cases, gi the number of observations for
case i and H
=
∑
Gi=1githe total number of observations. The proposed optimization problem for prediction is formulated
as
( ˆ
e∗, ˆ
b∗, ˆ
z∗) =
argmin e∗,b∗,z∗λ
1 2 G−
i=1 b∗i2+
λ
2 2 G−
i=1 gi−
j=1 e∗ij2,
(10) subject to y∗ij= ˆ
B0+ ˆ
B1zi∗+
b ∗ i+
z∗i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) +
e ∗ ij,
i=
1, . . . ,
G,
j=
1, . . . ,
gi,
(11)with random intercepts b∗
= [
b∗1. . .
bG∗]
T, error variables e∗= [
e∗11. . .
e∗1g1e∗21. . .
eGg∗ G]
T∈
RH, class labels z∗=
[
z1∗. . .
zG∗]
T, weightsw
ˆ
1= −
∑
N i=1∑
ni j=1α
ˆ
ij(
zi −1 2)ϕ
1(
xij)
and weightsw
ˆ
2= −
∑
N i=1∑
ni j=1α
ˆ
ij(
zi +1 2)ϕ
2(
xij)
. The Lagrangianfor the problem in(10)–(11)is
L2
(
e∗,
b∗,
z∗, ν) =
λ
1 2 G−
i=1 b∗i2+
λ
2 2 G−
i=1 gi−
j=1 e∗ij2−
G−
i=1 gi−
j=1ν
ij
y∗ij− ˆ
B0− ˆ
B1z∗i−
b ∗ i−
z∗ i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) −
z∗ i+
1 2
ˆ
w
2Tϕ
2(
x∗ij) −
e ∗ ij
,
(12)with Lagrange multipliers
ν = [ν
11. . . ν
1g1ν
21. . . ν
GgG]
T
∈
RH. The conditions for optimality result in
∂
L2∂
e∗ ij=
0→
e∗ij= −
ν
ijλ
2,
i=
1, . . . ,
G,
j=
1, . . . ,
gi,
∂
L2∂
b∗ i=
0→
b∗i= −
1λ
1 gi−
j=1ν
ij,
i=
1, . . . ,
G,
∂
L2∂
z∗i=
0→
0=
gi−
j=1ν
ij
ˆ
B1+
w
ˆ
1 Tϕ
1(
x∗ij)
2+
ˆ
w
2Tϕ
2(
x∗ij)
2
,
i=
1, . . . ,
G,
∂
L2∂ν
ij=
0→
y∗ij= ˆ
B0+ ˆ
B1zi∗+
b ∗ i+
z∗ i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
z∗ i+
1 2
ˆ
w
2Tϕ
2(
x∗ij) +
e ∗ ij,
i=
1, . . . ,
G,
j=
1, . . . ,
gi.
(13)Elimination of e∗ijand b∗i gives
0=
gi−
j=1ν
ij
ˆ
B1+
ˆ
w
1Tϕ
1(
x∗ij)
2+
ˆ
w
2Tϕ
2(
x∗ij)
2
,
i=
1, . . . ,
G,
y∗ij= ˆ
B0+ ˆ
B1zi∗−
1λ
1 gi−
k=1ν
ik+
zi∗−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) −
ν
ijλ
2,
i=
1, . . . ,
G,
j=
1, . . . ,
gi.
(14)The optimization problem in the dual space is formulated as
0G×G A∗T A∗−
Θ λ1−
IH×H λ2
[
z∗ν
]
=
[
0G y∗− ˆ
B01H+
P∗]
,
(15) with y∗= [
y∗ 11. . .
y ∗ 1g1y ∗ 21. . .
y ∗ GgG]
T∈
RH,Θ=
blockdiag(
1g1×g1. . .
1gG×gG)
, A ∗=
blockdiag([ ˆ
B 1+
ˆ w1Tϕ1(x∗11) 2+
ˆ w2Tϕ2(x∗11) 2. . . ˆ
B1+
ˆ w1Tϕ1(x∗1g1) 2+
ˆ w2Tϕ2(x∗1g1) 2]
T. . . [ ˆ
B 1+
ˆ w1Tϕ1(x∗G1) 2+
ˆ w2Tϕ2(x∗G1) 2. . . ˆ
B1+
ˆ w1Tϕ1(x∗GgG) 2+
ˆ w2Tϕ2(x∗GgG) 2]
T)
and P∗= [
wˆ1 Tϕ 1(x∗11) 2−
ˆ w2Tϕ2(x∗11) 2. . .
ˆ w1Tϕ1(x∗1g1) 2−
ˆ w2Tϕ2(x∗1g1) 2 ˆ w1Tϕ1(x∗21) 2−
ˆ w2Tϕ2(x∗21) 2. . .
ˆ w1Tϕ1(x∗GgG) 2−
ˆ w2Tϕ2(x∗GgG) 2]
T. The resultinglongitudinal LS-SVM classifier is defined as sign
( ˆ
z∗)
. Note that solving(15)with all G test cases or with one test case at thetime will result in identical predictions. Section4provides a discussion on the selection of the values for the regularization
constants
λ
1andλ
2and the kernel parameters.2.4. An alternative prediction formulation
In this section an alternative approach to the prediction step in Section2.3is formulated. Instead of obtaining the
unknown class labelsz
ˆ
∗from solving(15), the cost of assigning a test case, e.g.{[
x∗i1
. . .
x ∗ igi]
T, [
y∗ i1. . .
y ∗ igi]
T}
, to a specificclass can be calculated. Final prediction is obtained by assigning the case to the class which is associated with the lowest
cost. The problem in(10)–(11)is redefined as
( ˆ
e∗, ˆ
b∗) =
argmin e∗,b∗λ
1 2b ∗ i 2+
λ
2 2 gi−
j=1 e∗ij2,
(16) subject to y∗ij= ˆ
B0+ ˆ
B1zi∗+
b ∗ i+
zi∗−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) +
e ∗ ij,
j=
1, . . . ,
gi.
(17)This formulation assumes that the true class label zi∗is provided, while the estimatesB
ˆ
0,Bˆ
1,w
ˆ
1andw
ˆ
2, obtained from theregression step in Section2.2, are plugged in. The Lagrangian for this problem becomes
L3
(
e∗,
b∗, ν) =
λ
1 2b ∗ i 2+
λ
2 2 gi−
j=1 e∗ij2−
gi−
j=1ν
ij(
y∗ij− ˆ
B0− ˆ
B1zi∗−
b ∗ i−
z∗ i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) −
z∗ i+
1 2
ˆ
w
2Tϕ
2(
x∗ij) −
e ∗ ij),
(18)with Lagrange multipliers
ν
i= [
ν
i1. . . ν
igi]
Tand the conditions for optimality yield
∂
L3∂
e∗ ij=
0→
e∗ij= −
ν
ijλ
2,
j=
1, . . . ,
gi,
∂
L3∂
b∗i=
0→
b ∗ i= −
1λ
1 gi−
j=1ν
ij,
∂
L3∂ν
ij=
0→
y∗ij= ˆ
B0+ ˆ
B1z∗i+
b ∗ i+
zi∗−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) +
e ∗ ij,
j=
1, . . . ,
gi.
(19) Elimination of e∗ ijand b ∗ i yields y∗ij= ˆ
B0+ ˆ
B1zi∗−
1λ
1 gi−
k=1ν
ik+
z∗i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) −
ν
ijλ
2,
j=
1, . . . ,
gi.
(20)The solution to this optimization problem can be found by solving the linear system
[
−
1gi×giλ
1−
Igi×giλ
2]
[
ν
i] = [
y∗i−
( ˆ
B0+ ˆ
B1zi∗)
1gi−
D ∗ i]
,
(21) with y∗i= [
y∗i1. . .
yig∗i]
T, D∗i= [
(
z ∗ i−1 2) ˆ
w
1 Tϕ
1(
x∗i1)+(
zi∗+1 2) ˆ
w
2 Tϕ
2(
x∗i1) . . . (
zi∗−1 2) ˆ
w
1 Tϕ
1(
x∗igi)+(
z∗i+1 2) ˆ
w
2 Tϕ
2(
x∗igi)]
T. Following(19)the cost function in(16)equals21λ
1
(∑
gi j=1ν
ij)
2+
21λ2∑
gi j=1ν
2ij. This cost function can be calculated for the first, i.e. z
∗
i
=
−
1, and the second, i.e. z∗i
=
1, class by solving(21). The classification rule assigns the case{[
x∗ i1
. . .
x ∗ igi]
T, [
y∗ i1. . .
y ∗ igi]
T}
tothe class with the smallest cost.
3. Extensions
In the following sections possible extensions of the model in(5)are explored.
3.1. Multi-class classification
The alternative approach in Section 2.4 has been introduced because it allows direct generalization of the binary
longitudinal LS-SVM classifier to a multi-class version. This section illustrates this extension for the development of a
longitudinal LS-SVM classifier for C classes. Similar to Section2.2, assume the longitudinal training data
{{
xij,
yij}
ni
j=1
}
N i=1,
but with known class labels z1i
, . . . ,
zCi∈ {−
1, +
1}
for case i. These class coding labels equal−
1 except for zci=
1 if casei belongs to class c. As before, the training step consists of a regression modeling and it is formulated by the optimization problem
( ˆw, ˆ
e, ˆ
b, ˆ
B0, ˆ
B1, . . . , ˆ
BC) =
argmin w,e,b,B0,B1,...,BC 1 2w
Tw +
λ
1 2 N−
i=1 b2i+
λ
2 2 N−
i=1 ni−
j=1 e2ij,
(22)subject to the model yij
=
B0+
C−
c=1
zci+
1 2
(
Bc+
w
Tcϕ
c(
xij)) +
bi+
eij,
i=
1, . . . ,
N,
j=
1, . . . ,
ni,
(23)with weighting vector
w = [w
T1
. . . w
TC]
T, model parameters B0, B1, . . . ,
BC, and mappingϕ
c(·)
, c=
1, . . . ,
C . The dualsolution to this optimization problem can be obtained by solving the linear system
0(C+1)×(C+1) 1MT Z1T...
ZCT 1M Z1. . .
ZC−
C−
c=1 Kc−
Γλ
1−
IM×Mλ
2
B0 B1...
BCα
=
[
0(C+1) y]
,
(24)with Zc
=
12([
zc1. . .
zc1
n1 zc2. . .
zc2
n2. . .
zcN. . .
zcN
nN]
T+
1M)
, kernel matrix Kc∈
RM×M with elements Kcrs=
(
Zcr+1 2
)
(
Zcs+12
)
Kc(
xs,
xr)
, where Kc(·, ·)
is a positive definite kernel function. The obtained estimates for B0, B1,. . .
,BC andw =
[
w
T1. . . w
CT]
Tare now used to predict the class for a new longitudinal data set{{
x∗ij
,
y ∗ ij}
gi j=1}
Gi=1with unknown class labels.
The approach from Section2.4can be used to classify the ith test case
{[
x∗i1
. . .
x ∗ igi]
T, [
y∗ i1. . .
y ∗ igi]
T
}
. The multi-class extensionfor the problem in(16)–(17)is defined as
( ˆ
e∗, ˆ
b∗) =
argmin e∗,b∗λ
1 2b ∗ i 2+
λ
2 2 gi−
j=1 e∗ij2,
(25) subject to y∗ij= ˆ
B0+
C−
c=1
z∗ ci+
1 2
( ˆ
Bc+ ˆ
w
cTϕ
c(
x ∗ ij)) +
b ∗ i+
e ∗ ij,
j=
1, . . . ,
gi.
(26)The Lagrangian for this problem is L4
(
e ∗,
b∗, ν) =
λ
1 2b ∗ i 2+
λ
2 2 gi−
j=1 e∗ij2−
gi−
j=1ν
ij
y∗ij− ˆ
B0−
C−
c=1
z∗ ci+
1 2
( ˆ
Bc+ ˆ
w
cTϕ
c(
x ∗ ij)) −
b ∗ i−
e ∗ ij
,
(27)and the conditions for optimality yield
∂
L4∂
e∗ ij=
0→
e∗ij= −
ν
ijλ
2,
j=
1, . . . ,
gi,
∂
L4∂
b∗i=
0→
b ∗ i= −
1λ
1 gi−
j=1ν
ij,
∂
L4∂ν
ij=
0→
y∗ij= ˆ
B0+
C−
c=1
zci∗+
1 2
( ˆ
Bc+ ˆ
w
cTϕ
c(
x∗ij)) +
b ∗ i+
e ∗ ij,
j=
1, . . . ,
gi.
(28)Elimination of e∗ijand b∗i gives
y∗ij
= ˆ
B0+
C−
c=1
z∗ ci+
1 2
( ˆ
Bc+ ˆ
w
cTϕ
c(
x∗ij)) −
1λ
1 gi−
j=1ν
ij−
ν
ijλ
2,
j=
1, . . . ,
gi.
(29)The dual solution for this problem can be found by solving
[
−
1gi×giλ
1−
Igi×giλ
2]
[
ν
i] =
y∗i−
ˆ
B0+
C−
c=1
z∗ ci+
1 2
ˆ
Bc
1gi−
L ∗ i
,
(30) with L∗i= [
∑
C c=1(
z∗ ci+1 2) ˆ
w
c Tϕ
c(
x∗i1) . . . ∑
C c=1(
z∗ ci+1 2) ˆ
w
c Tϕ
c(
x∗igi)]
T. The cost, i.e. 1
2λ1
(∑
gi
j=1
ν
ij)
2+
2λ12∑
gij=1
ν
ij2, of assigning atest case to a specific class can be calculated by solving(30)with the corresponding class coding labels (e.g. z1i∗
=
1 andz∗
2i
=
. . . =
z ∗Ci
= −
1 for class 1). Finally, the classification rule assigns the ith test case{[
x∗ i1
. . .
x ∗ igi]
T, [
y∗ i1. . .
y ∗ igi]
T}
to theclass with the smallest cost. 3.2. Multiple random effects
The proposed random intercept LS-SVM classifier (Section2.3) can be extended to allow multiple random effects. Assume
the longitudinal test data set
{{
x∗ij,
y∗ij}
gij=1
}
G
i=1with unknown class label z
∗
i for case i. The optimization problem for prediction
with r random effects is formulated as
( ˆ
e∗, ˆ
b∗, ˆ
z∗) =
argmin e∗,b∗,z∗λ
3 2 G−
i=1 b∗iTR b∗i+
λ
4 2 G−
i=1 e∗iTEie∗i,
(31) subject to y∗ij= ˆ
B0+ ˆ
B1zi∗+
z∗i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
zi∗+
1 2
ˆ
w
2Tϕ
2(
x∗ij) +
b ∗ i T u∗ij+
e∗ij,
i=
1, . . . ,
G,
j=
1, . . . ,
gi,
(32)with positive regularization constants
λ
3andλ
4, random effect covariates u∗ij= [
u∗
ij1
. . .
u∗
ijr
]
T, and the covariance matricesR−1and E−1
i of the random effects b
∗ i
= [
b ∗ 1i. . .
b ∗ ri]
Tand residuals e ∗ i= [
e ∗ i1. . .
e ∗ igi]
T, respectively. The Lagrangian for this
L5
(
e ∗,
b∗,
z∗, ν) =
λ
3 2 G−
i=1 b∗iTR b∗i+
λ
4 2 G−
i=1 e∗iTEie ∗ i−
G−
i=1 gi−
j=1ν
ij
y∗ij− ˆ
B0− ˆ
B1z∗i−
z∗ i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) −
z∗ i+
1 2
ˆ
w
2Tϕ
2(
x∗ij) −
b ∗ i Tu∗ ij−
e ∗ ij
.
(33)The corresponding conditions for optimality yield
∂
L5∂
e∗ij=
0→
0=
λ
4 gi−
k=1 e∗ikEijk+
ν
ij,
i=
1, . . . ,
G,
j=
1, . . . ,
gi,
∂
L5∂
b∗ ki=
0→
0=
λ
3 r−
l=1 Rlkb∗li+
gi−
j=1ν
iju∗ijk,
i=
1, . . . ,
G,
k=
1, . . . ,
r,
∂
L5∂
z∗ i=
0→
0=
gi−
j=1ν
ij
ˆ
B1+
ˆ
w
1Tϕ
1(
x∗ij)
2+
ˆ
w
2Tϕ
2(
x∗ij)
2
,
i=
1, . . . ,
G,
∂
L5∂ν
ij=
0→
y∗ij= ˆ
B0+ ˆ
B1z∗i+
z∗ i−
1 2
ˆ
w
1Tϕ
1(
x∗ij) +
z∗ i+
1 2
ˆ
w
2Tϕ
2(
x∗ij)
+
r−
k=1 b∗kiu∗ijk+
e∗ij,
i=
1, . . . ,
G,
j=
1, . . . ,
gi.
(34)Elimination of e∗ijand b∗kigives
0=
gi−
j=1ν
ij
ˆ
B1+
ˆ
w
1Tϕ
1(
x∗ij)
2+
ˆ
w
2Tϕ
2(
x∗ij)
2
,
i=
1, . . . ,
G,
y∗i= ˆ
B0+ ˆ
B1z ∗ i+
zi∗−
1 2
[ ˆ
w
1TΦ∗i1]
T+
z∗i+
1 2
[ ˆ
w
2TΦi2∗]
T−
1λ
3 Ui∗TR−1Ui∗ν
i−
1λ
4 Ei−1ν
i,
i=
1, . . . ,
G,
(35) with y∗i= [
y∗i1. . .
y∗igi]
T,ν
i= [
ν
i1. . . ν
igi]
T, U∗ i= [
u ∗ i1. . .
u ∗ igi]
,Φ ∗ i1= [
ϕ
1(
x∗i1) . . . ϕ
1(
x∗igi)]
,Φ ∗ i2= [
ϕ
2(
x∗i1) . . . ϕ
2(
x∗igi)]
,whereas Eijkand Rjkdenote the element on the jth row and kth column of the covariance matrices Eiand R, respectively. The
dual solution to this problem can be found by solving
[
0G×G A∗T A∗ F∗] [
z∗ν
]
=
[
0G y∗− ˆ
B 01H+
P∗]
,
(36) where F∗=
blockdiag(−
λ1 3U ∗ 1 T R−1U1∗−
λ1 4E −1 1. . . −
1 λ3U ∗ G T R−1UG∗−
λ1 4E −1 G)
, while A ∗and P∗are identical to their
formulations in(15). In case one assumes only a random intercept and a simple covariance structure for Ei,(36)simplifies
to(15). The final class assignment is obtained by applying sign
( ˆ
z∗)
.4. Model selection
Selection of appropriate values for the regularization parameters
λ
1,λ
2,λ
3,λ
4and the kernel parameters is an importantissue. Moreover, the use of multiple random effects requires the estimation of the covariance matrices R−1and E−1
i . These
topics are discussed in detail in the following sections. 4.1. Random intercept LS-SVM
The typical procedure to tune regularization parameters in the context of conventional LS-SVMs is by means of cross-validation or repeated random sampling, resulting in training and cross-validation data. On the other hand, some complications arise when using these methods to tune the parameters of the mixed effects LS-SVM models. The training and the prediction
step each involve two regularization parameters (i.e.
λ
1,λ
2) and additional kernel parameters (e.g.σ
1,σ
2). Although onecould tune the parameters of the training step and the prediction step in a separate way, we decided to tune them together.
This means that one gridsearch is performed and for a fixed combination of parameters(9)(or(24)) and(15)(or(21),
(30)) are solved. Since the main goal of this paper is classification, the preferred performance measure is classification
accuracy. Therefore, the combination of tuning parameters that maximizes the classification performance (i.e. at the level of the prediction step) on the validation data (cf. cross-validation or repeated random sampling) is chosen. The advantage of this approach is the lower computational cost of tuning only two parameters (kernel parameters excluded). Performing a separate model selection for the training and the prediction step would require tuning three parameters (kernel parameters
sampling involve splitting data in a training set and validation set. Remark that data splitting was performed in such a way that cases instead of individual observations were repeatedly assigned to these sets. The final resulting model selection
procedure for random intercept LS-SVM classification is summarized inAlgorithm 1.
Fig. 1 further illustrates the importance of a careful model selection. The data that are used are identical to the training
and validation data of the toy problem in Section5.1.1. WhileFig. 2visualizes the optimal results, in the sense that the
selected parameters maximize classification performance,Fig. 1shows the effect of changing one parameter at a time on
the estimated class means (dashed lines) and the case-specific predicted curve (red line) for the observed curve (green
line). Decreasing the bandwidth of the RBF kernel results in less smooth class means and predicted curve. Increasing
λ
2strengthens this effect
Algorithm 1 (Model selection and classification strategy for random intercept LS-SVM).
1: Start a gridsearch using cross-validation or repeated random sampling on training data to obtain values for
λ
1andλ
2.This includes step 2 and step 3:
2: Perform regression on training data by solving(9)or(24)to obtain estimates for
w
, B0, B1(, . . . ,
BC).3: Use these estimates to perform prediction on validation data by solving(15)or(21)for binary problems and(30)for
multi-class problems and select the
λ
1andλ
2with the best classification performance.4: Finally, use the tuned
λ
1andλ
2and corresponding estimatesw
ˆ
,Bˆ
0,Bˆ
1(, . . . , ˆ
BC) to perform prediction on test data bysolving(15)or(21)for binary problems and(30)for multi-class problems.
since more emphasis is put on the sum of squared residuals. An increasing
λ
1value makes the class means even less smoothand shrinks the predicted curve toward the class mean.
4.2. Multiple random effects LS-SVM
The proposed method to deal with multiple random effects assumes that the covariance matrices R−1and E−1
i are known,
while this is in practice not the case. A solution to this problem can be found by first applying the random intercept LS-SVM
approach from Sections2.2and2.3, in combination with the model selection procedure from Section4.1, in order to estimate
the mean (i.e.B
ˆ
0,Bˆ
1,w
ˆ
1,w
ˆ
2) for each class. Then, for all observations of each case the corresponding residuals are computedby removing the estimated mean of the true class to which this case belongs. The residuals of all cases can then be used to
estimate R−1and E−i 1using (restricted) maximum likelihood theory (Patterson and Thompson, 1971). Finally, values for the
regularization constants
λ
3andλ
4can be obtained via cross-validation or repeated random sampling of the data. Remarkthat since the problem in(31)requires a separate gridsearch, one of the lambdas could have been removed by dividing one
through the other. In Section3.2both
λ
3andλ
4were included to have a more consistent presentation. In practice, onelambda can be fixed while a gridsearch is used for the other one. The full procedure is summarized inAlgorithm 2.
Algorithm 2 (Model selection and classification strategy for LS-SVM with multiple random effects).
1: Start with a random intercept LS-SVM and proceed as described inAlgorithm 1to obtain the estimates
w
ˆ
,Bˆ
0,Bˆ
1(, . . . , ˆ
BC).2: Compute the residuals for all observations of each case by removing the estimated mean, resulting from the random intercept LS-SVM step, while the given class label is used to decide about the estimated mean.
3: Use (restricted) maximum likelihood to estimate the covariance matrices R−1and E−1
i from the residuals of all cases.
4: Use cross-validation or repeated random sampling on training data to obtain values for
λ
3andλ
4. This includes step 5:5: Perform prediction on validation data using the multiple random effects LS-SVM by solving(36)with the obtained
w
ˆ
,Bˆ
0,ˆ
B1(
, . . . , ˆ
BC). Select theλ
3andλ
4with the best performance.6: Finally, perform prediction on test data by solving(36)with
w
ˆ
,Bˆ
0,Bˆ
1(, . . . , ˆ
BC) and tunedλ
3andλ
4.5. Examples
This section provides examples of the application of the longitudinal LS-SVM classifier for 4 simulation studies and 2 real-life problems.
5.1. Simulation studies 5.1.1. Toy problem
For each class, 50 curves were generated according to the model in(5), resulting in the unbalanced data set in the
upper right panel ofFig. 2. A simple variance structure, with a standard deviation of 80, was chosen for the normally
distributed residual error and the standard deviation for the normally distributed random intercept was 400. The number of observations per curve varied between 5 and 10. This data set was 25 times split in a random way into a set for training
(2
/
3 of the cases) and a set for validation (1/
3 of the cases). An RBF kernel was used for the longitudinal LS-SVM classifierFig. 1. Model selection. Top: decreasing the bandwidthσof the RBF kernel results in wiggly estimates for the means (dashed lines) and the case-specific predicted curve (red line). The observed curve is visualized in green. Middle: putting more weight on the squared errors by increasing the regularization parameterλ2makes the estimated means and predicted curve more unstable. Bottom: increasingλ1shrinks the case-specific predicted curve to the
estimated mean. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
performed in order to tune the parameters
λ
1(∈ {
10−6,
10−5,
10−4,
10−3,
10−2,
0.
1,
0.
5,
0.
7,
1,
2,
3,
4,
5,
10,
20,
30,
40}
),λ
2 (∈ {
10−2,
0.
1,
1, . . . ,
20,
30,
40,
80,
100,
150,
200,
300,
400,
600,
800}
) andσ
1, σ
2(∈ {
10−3,
10−2,
0.
1,
1, . . . ,
25}
).Fig. 2. Toy problem. Top left: simulated mean curves (solid) and estimated mean curves (dashed) for class 1 (black) and class 2 (blue). Top right: simulated
training and validation data. Middle left: simulated test data. Middle right: simulated random effects versus estimated random effects. Bottom left: estimated z∗
-values for test data. The first 50 cases belong to class 1 while the last 50 cases belong to class 2. Bottom right: case 43 (green) and after removing the residual error (red). Training and validation data, mean curves and estimated mean curves. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Empirical results showed that the classification approach in(15)produced identical predictions as compared to the
alternative approach following(21). The combination of tuning parameters with the lowest (i.e. 2.19%) averaged balanced
error rate (BER, i.e. the average of the class-specific error rates) on the validation data was:
λ
1=
10−3,λ
2=
300,σ
1=
σ
2=
8.Fig. 2illustrates that the estimated mean curves for class 1 and class 2 approximate the true mean curves,LS-SVM classifier was used to classify 50 new curves from class 1 and 50 new curves from class 2. The estimated random
effects are close to the true random effects and the BER on the independent test set was 3%. Finally,Fig. 2presents the
estimated curve for case 43 after removing the residual error. The average over the squared residuals related to the prediction of y∗ ij(i.e. 1 G
∑
G i=1 1 gi∑
gi j=1(
y ∗ ij− ˆ
B0− ˆ
B1zˆ
i∗− ˆ
b ∗ i−
(
ˆ z∗i−1 2) ˆ
w
1 Tϕ
1(
x∗ij) − (
ˆ z∗i+1 2) ˆ
w
2 Tϕ
2(
x∗ij))
2) equaled 4477.5.5.1.2. Sensitivity on tuning parameters
A training and validation data set with 50 curves for each class was generated according to the model in(5)to illustrate the
influence of the tuning parameters of the longitudinal LS-SVM on the classification performance (Fig. 3). A simple variance
structure, with a standard deviation of 60, was used for the normally distributed residual error and the standard deviation for the normally distributed random intercept was 600. The number of observations per curve varied between 5 and 7.
Parameter tuning based on 50 times random splitting resulted in the following settings:
λ
1=
10−4,λ
2=
300,σ
1=
σ
2=
9.Fig. 3visualizes the true mean curves and the estimated mean curves, using these settings, for class 1and class 2. In order to study the sensitivity of the classifier’s performance on the tuning parameters a grid was created
around the optimum on the validation data:
λ
1(∈ {
10−6,
10−5,
10−4,
10−3,
10−2}
),λ
2(∈ {
80,
150,
200,
300,
400,
600}
)and
σ
1, σ
2(∈ {
5,
7,
8,
9,
11,
13}
). In a next step all these 180 combinations of tuning parameters were evaluated on 100newly generated test data sets (e.g.Fig. 3, middle left panel) with 50 curves for each class, while training was done on the
original training and validation data. The histogram of all 180 averaged (i.e. over the 100 repetitions) BERs is visualized inFig. 3. The median BER was 19.50% while all 180 models fell in a range of 1.76%. This illustrates that the classifier’s
performance is rather insensitive with respect to (small) changes in tuning parameters. The bottom row ofFig. 3presents
the BER for different combinations of tuning parameters. Clearly, setting
σ
1=
σ
2=
13 orλ
2=
80 results in larger BERs.Omitting these models from the histogram makes clear that all remaining models obtain BERs within a range of less than 1%.
5.1.3. Multi-class classification
The use of the multi-class approach from Section3.1is illustrated for a classification problem involving three classes
(Fig. 4). Fifty curves were generated for each of the classes following the model in(23)with normally distributed residual error (with standard deviation 150) and random intercept (with standard deviation 600). The number of observations per curve varied between 5 and 7.
The set with the smallest (i.e. 8.96%) averaged error rate on the validation data based on 100 times random sampling was:
λ
1=
2,λ
2=
16,σ
1=
σ
2=
σ
3=
2. The upper left panel ofFig. 4visualizes the estimated mean curves which approximatethe true mean curves. The multi-class classifier was then used to classify 50 new curves from each class and resulted in an
overall error rate of 5.33%. The average over the squared residuals related to the prediction of y∗
ijequaled 18308.2. The lower
right panel ofFig. 4showscost1
c
/(
1 cost1+
1 cost2+
1cost3
)
for each test case with c=
1, . . . ,
3. The first 50 test cases belong toclass 1, the following 50 to class 2, etc.
5.1.4. Multiple random effects
To illustrate classification using models with multiple random effects 100 curves were generated for each of the classes. To simulate data normally distributed residual error with standard deviation 80 (i.e. uncorrelated errors) and three random effects (i.e. random intercept, random slope and random quadratic effect) with normal distribution and covariance matrix
[[
400 10 5]
T[
10 40 20]
T[
5 20 20]
T]
were assumed for the model in(32). The number of observations per curve variedbetween 5 and 10. Test data, consisting of 100 curves for each class, were simulated with similar settings. The top left panel ofFig. 5visualizes the mean for each of the two classes, while the upper right panel visualizes the test data. The strategy inAlgorithm 2was followed for classifier building. First, a random intercept model was constructed using the training and validation data.
A gridsearch and repeated random sampling resulted in the following parameter settings:
λ
1=
0.
01,λ
2=
80,σ
1=
σ
2=
2. The upper left panel ofFig. 5demonstrates that the estimated mean curves approximated the true meancurves. After removing the estimated mean, restricted maximum likelihood was used on the training data to estimate the covariance matrix of the three random effects and residual error variance. Subsequently, a gridsearch, in combination with
repeated random sampling, was used to tune the regularization constants:
λ
3=
1600 andλ
4=
950. Finally, this classifierwith three random effects was used on the independent test data and resulted in a BER of 9.5% (average over squared
residuals for prediction of y∗
ij
=
5512.
3), while a LS-SVM with only a random intercept yielded a BER of 15.5% (averageover squared residuals for prediction of y∗ij
=
5643.
0). The cases that are misclassified by the LS-SVM with multiple randomeffects are visualized in the bottom panel ofFig. 5. It is clear that these remaining misclassifications are atypical profiles,
Fig. 3. Sensitivity analysis. Top left: simulated mean curves (solid) and estimated mean curves (dashed) for class 1 (black) and class 2 (blue). Top right:
simulated training and validation data. Middle left: simulated test data. Middle right: histogram for averaged BER on test data. Bottom left: averaged BER on test data forσ1=σ2=9. Bottom right: averaged BER on test data forσ1=σ2=13. (For interpretation of the references to colour in this figure legend,
the reader is referred to the web version of this article.) 5.2. Real-life data sets
5.2.1. Berkeley growth data
The Berkeley growth data set contains 54 growth curves of girls and 39 growth curves of males (Ramsay and Silverman,
2005). The classification problem is to predict the sex of a case based on the growth curve. Each original curve consists
of 31 observations, taken at fixed time points. This experiment reports the performance of the random intercept LS-SVM
classifier for various levels of unbalanced data (Fig. 6). Observations were deleted from each sample in a random manner in