Citation/Reference Van Belle V., Van Calster B., Van Huffel S., Suykens J.A.K., Lisboa P.,
``Explaining support vector machines: a color based nomogram'', Plos One, vol. 11, no. 10, Oct. 2016, pp. 1-33
Archived version Final publisher’s version / pdf
Published version http://dx.doi.org/10.1371/journal.pone.0164568
Journal homepage http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164 568
IR https://lirias.kuleuven.be/handle/123456789/552325
(article begins on next page)
Explaining Support Vector Machines: A Color Based Nomogram
Vanya Van Belle
1,2*, Ben Van Calster
3*, Sabine Van Huffel
1,2, Johan A. K. Suykens
1,2, Paulo Lisboa
41 Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium, 2 iMinds Medical IT, Leuven, Belgium, 3 Department of Development and Regeneration, KU Leuven, Leuven, Belgium, 4 Department of Applied Mathematics, Liverpool John Moores University, Liverpool, United Kingdom
* vanya.vanbelle@esat.kuleuven.be (VVB); ben.vancalster@kuleuven.be (BVC)
Abstract
Problem setting
Support vector machines (SVMs) are very popular tools for classification, regression and other problems. Due to the large choice of kernels they can be applied with, a large variety of data can be analysed using these tools. Machine learning thanks its popularity to the good performance of the resulting models. However, interpreting the models is far from obvious, especially when non-linear kernels are used. Hence, the methods are used as black boxes. As a consequence, the use of SVMs is less supported in areas where interpretability is important and where people are held responsible for the decisions made by models.
Objective
In this work, we investigate whether SVMs using linear, polynomial and RBF kernels can be explained such that interpretations for model-based decisions can be provided. We further indicate when SVMs can be explained and in which situations interpretation of SVMs is (hitherto) not possible. Here, explainability is defined as the ability to produce the final deci- sion based on a sum of contributions which depend on one single or at most two input variables.
Results
Our experiments on simulated and real-life data show that explainability of an SVM
depends on the chosen parameter values (degree of polynomial kernel, width of RBF kernel and regularization constant). When several combinations of parameter values yield the same cross-validation performance, combinations with a lower polynomial degree or a larger kernel width have a higher chance of being explainable.
a11111
OPEN ACCESS
Citation: Van Belle V, Van Calster B, Van Huffel S, Suykens JAK, Lisboa P (2016) Explaining Support Vector Machines: A Color Based Nomogram. PLoS ONE 11(10): e0164568. doi:10.1371/journal.
pone.0164568
Editor: Santosh Patnaik, Roswell Park Cancer Institute, UNITED STATES
Received: February 1, 2016 Accepted: September 27, 2016 Published: October 10, 2016
Copyright: © 2016 Van Belle et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability Statement: All data is available within the paper and repositories listed. The first two datasets used in the paper are available from the UCI Machine Learning Repository. The Iris dataset is accessible from: http://archive.ics.uci.
edu/ml/datasets/Iris The Pima dataset from: http://
archive.ics.uci.edu/ml/datasets/Pima+Indians +Diabetes. The credit risk dataset is available from http://archive.ics.uci.edu/ml/machine-learning- databases/statlog/german/.
Funding: V. Van Belle is a postdoctoral fellow of
the Research foundation Flanders (FWO). This
research was supported by: Center of Excellence
Conclusions
This work summarizes SVM classifiers obtained with linear, polynomial and RBF kernels in a single plot. Linear and polynomial kernels up to the second degree are represented exactly. For other kernels an indication of the reliability of the approximation is presented.
The complete methodology is available as an R package and two apps and a movie are provided to illustrate the possibilities offered by the method.
Introduction
Support vector machines (SVMs) have proven to be good classifiers in all kinds of domains, including text classification [1], handwritten digit recognition [2], face recognition [3], bioin- formatics [4], among many others. Thanks to the large variety of possible kernels, the applica- tion areas of SVMs are widespread. However, although these methods generalize well to unseen data, decisions made based on non-linear SVM predictions are difficult to explain and as such are treated as black boxes. For clinical applications, information on how the risk of dis- ease is estimated from the inputs is crucial information to decide upon the optimal treatment strategy and to inform patients. Being able to discuss this information with patients might enable patients to change their behaviour, life style or therapy compliance. Interpretation is especially important for validation of the model inferences by subject area experts. The fact that machine learning techniques are not able to find their way into clinical practice might very well be related to the lack of acquiring this information.
Offering interpretation to SVMs is a topic of research with different perspectives [5]. Identi- fication of prototypes [6] (interpretability in dual space) offers an interpretation closely related to how doctors work: based on experience from previous patients (the prototypes) a decision is made for the current patient. A second view on interpretability (interpretability in the input space) intends to offer insights in how each input variable influences the decision. Some researchers worked on a combination of both [7] and identified prototypes dividing the input space into Voronoi sections, within which a linear decision boundary is created, offering inter- pretation w.r.t. the effect of the inputs in a local way. Other approaches try to visualize the deci- sion boundary in a two-dimensional plane [8], using techniques related to self-organizing maps [9]. The current work attempts to offer a global interpretation in the input space.
The literature describes several methods to extract rules from the SVM model (see [10, 11]
and references therein) in order to provide some interpretation of the decisions obtained from SVM classifiers. However these rules do not always yield user-friendly results, and when inputs are present in several rules, identifying how the decision will change depending on the value of an input is not straightforward.
Several authors have therefore tried to open the black box by attempts to visualize the effect of individual inputs to the output of the SVM. In [12], Principal Component Analysis is used on the kernel matrix. Biplots are used to visualize along which principal components the class separability is the largest. To visualize which original inputs contribute the most to the classi- fier, pseudosamples with only one input differing from zero are used to mark trajectories within the plane spanned by the two principal components identified before. Those inputs with the largest trajectories along the direction of largest class separability are the most impor- tant inputs. Although this approach enables to visualize which inputs are most relevant, it is not possible to indicate how the output of the classifier (i.e. the latent variable or the estimated probability) would change in case the value of one input would change.
(CoE): PFV/10/002 (OPTEC); iMinds Medical Information Technologies; Belgian Federal Science Policy Office: IUAP P7/19/ (DYSCO, ‘Dynamical systems, control and optimization’, 2012–2017);
European Research Council: ERC Advanced Grant, (339804) BIOTENSORS. This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. JS acknowledges support of ERC AdG A-DATADRIVE-B, FWO G.0377.12, G.088114N.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared
that no competing interests exist.
A second method to visualize and interpret SVMs was proposed by [13] for support vector regression. They propose to multiply the input matrix containing the inputs of all support vec- tors with the Lagrange multipliers to get the impact of each input. This approach is again able to identify the most important inputs, but is not able to indicate how the output of the SVM changes with changing inputs.
Other work consists in visualizing the discrimination of data cohorts by means of projec- tions guided by paths through the data (tours) [14–16]. Although these methods offer addi- tional insights, they do not quantify the impact of each feature on the prediction, which is the goal of the current work.
Standard statistical methods such as linear and logistic regression offer the advantage that they are interpretable in the sense that it is clear how a change in the value of one input vari- able will affect the predicted outcome. To further clarify the impact of the input variables, visualization techniques such as nomograms [17] can be used (see Fig 1). In short, a nomo- gram represents a linear model ^ y ¼ P
dp¼1
w
ðpÞx
ðpÞþ b, with x
(p)the p
thinput and w
(p)the cor- responding weight, by means of lines, the length of which is related to the range of w
(p)x
(p)observed in the training data. For each input value the contribution to the predicted outcome can instantly be read of from the plot. See Section Logistic regression models for more infor- mation. Straightforward extension of this technique to SVMs is not possible due to the fact that the SVMs are mainly used in combination with flexible kernels that can not be decom- posed into additive terms, each accounting for one single input variable. A possible extension
Fig 1. Visualization of the logistic regression model for the Pima dataset by means of a nomogram.
The contribution of each input variable x
(p)(f
(p)= w
(p)x
(p)) to the linear predictor is shifted and rescaled such that each contribution has a minimal value of zero and the maximal value of all contributions is 100. Each input variable is represented by means of a scale and the value of the contribution can be found by drawing a vertical line from the input variable value to the points scale on top of the plot. Adding the contributions of all input variables results in the total points. These can be transformed into a risk estimate by drawing a vertical line from the total points scale to the risk scale. The importance of the inputs is represented by means of the length of the scales: variables with longer scales have a larger impact on the risk prediction.
doi:10.1371/journal.pone.0164568.g001
of nomograms towards support vector machines [18] therefore focusses on the use of decom- poseble kernels [19]. The most restricting way of applying this approach is to define a kernel as the addition of subkernels that only depend on one single input. The use of a localized radial basis function kernel in [20] is only one example. The original work of [18] to represent SVMs by means of nomograms is less restrictive in the sense that kernels including interac- tions between two inputs are allowed. The idea behind these approaches is that by using a decomposible kernel, the latent variable of the SVM can be expressed as a sum of terms, depending on one input. As such, the SVM becomes a generalized additive model and can be visualized by means of a nomogram. In [18] non-linearities are visualized by drawing two- dimensional curves instead of straight lines in the nomogram, such that non-linearities can more easily be represented than when using a line. They also allow for interactions between two inputs, but, as with standard nomograms, they can only be represented after categoriza- tion of one of the two involved inputs.
In contrast with the approaches found in the literature, this work does not intend to adapt the kernel nor the SVM model formulation. This work takes the first steps in answering the question whether existing SVMs in combination with generally used kernels can be explained and visualized, in which circumstances this is possible and to which extent. Instead of adapting the kernel, the nomogram representation is altered to easily allow for non-linear and two-way interaction effects. This is achieved by replacing the lines by color bars with colors offering the same interpretation as the length of the lines in nomograms. It is indicated for which kernels and kernel parameters the representation by means of this color based nomogram is exact. In cases where the visualization is only approximate, additional graphs indicate why the approxi- mation is not sufficient and how this might be solved. The current approach is related to the work in [21, 22], where a Taylor expansion of the RBF kernel is used to extract interpretable and visualizable components from an SVM with RBF kernel. In this work, the expansion is indicated for linear, polynomial and RBF kernels. Additionally, the expansion is used to visual- ize the working of an existing SVM, whereas in the previous work a new model was created after feature selection by means of iterative l
1regularization of a parametric model with the dif- ferent components as inputs.
The remainder of this work is structured as follows. First, a short introduction to SVM clas- sification is given. It is shown how a nomogram is built for logistic regression models and how an alternative color based nomogram for logistic regression was used in [23]. Next, it is
explained how to reformulate the SVM classifier in the same framework. Experiments on artifi- cial data illustrates the approach and indicates possible problems and solutions. Finally, real life datasets are used to illustrate the applicability on real examples. The work concludes with information on the available software and a discussion on the strengths and weaknesses of the study.
Methods
This section clarifies how an SVM can be explained by means of a color based nomogram. For generality, we start with a brief summary of an SVM classifier, followed by an introduction on the use of a nomogram to visualize logistic regression models.
In the remainder of this work, x
iðpÞm
will indicate the m
thpower of the p
thinput variable of
the i
thobservation x
i.
SVM classifier
Suppose a dataset D ¼ fx
i; y
ig
Ni¼1is a set of N observations with input variables x
i2 R
dand class labels y
i2 {−1, 1}. The SVM classifier as defined by Vapnik [24] is formulated as
min
w;b;1
2 w
Tw þ C X
Ni¼1
i
subject to
y
ið w
Tφðx
iÞ þ b Þ 1
i; 8 i ¼ 1; . . . ; N
i
0; 8 i ¼ 1; . . . ; N :
8 <
: ð1Þ
To facilitate the classification, a feature map φ() is used to transform the inputs into a higher dimensional feature space. The coefficients in this higher dimensional feature space are denoted by w 2 R
nφ. The trade-off between a smooth decision boundary and correct classifica- tion of the training data is made by means of the strictly positive regularization constant C.
The dual formulation of the problem stated in Eq (1) is found by defining the Lagrangian and characterizing the saddle point and results in:
min
a1 2
X
Ni;j¼1
y
iy
jφðx
iÞ
Tφðx
jÞa
ia
jX
Ni¼1
a
isubject to X
Ni¼1
a
iy
i¼ 0
0 a
iC; 8 i ¼ 1; . . . ; N : 8 >
> >
<
> >
> :
ð2Þ
The power of SVMs lies in the fact that the feature map does not need to be defined explic- itly. An appropriate choice of a kernel function K(x, z) for any two points x and z that can be expressed as
Kðx; zÞ ¼ φðxÞ
TφðzÞ ;
makes it possible to use an implicit feature map.
A class label for a new point x can then be predicted as
y ¼ sign ^ X
Ni¼1
a
iy
iKðx
i; xÞ þ b
! :
Here ‘ ¼ P
Ni¼1
a
iy
iKðx
i; xÞ þ b is called the latent variable. In order to obtain probabilities, the sign() function can be replaced by a function h(). In this work the latent variable will be converted into a risk estimate by using it as a single input in a logistic regression model with two parameters. This approach is known as Platt’s rule [25, 26].
Visualization of risk prediction models
Logistic regression models. In statistics, regression models can be visualized using nomo- grams [17]. More recently color plots have been proposed [23] to represent contributions to the linear predictor (here P
dp¼1
w
ðpÞx
ðpÞþ b) depending on only one or by extension maximally two input variables. The nomogram for logistic regression (LR) builds on the fact that the model in its most basic form can be written as
^ p ¼ h X
dp¼1
w
ðpÞx
ðpÞþ b
!
; ð3Þ
where h() is a link function (here the sigmoid function) transforming the linear predictor to a chance, w
(p)is the coefficient corresponding to the p
thinput variable x
(p)and b is a constant.
The contribution of each input variable x
(p)to the linear predictor can thus be visualized by
plotting f
(p)(x
(p)) = w
(p)x
(p). In fact, for nomograms these terms are rescaled to start from 0 to a maximum of 100 points. Doing so makes clear that the range of the contributions is important.
A wide range of the contributions for one input variable indicates that changing the value of this input, can have a large impact on the linear predictor and as such on the risk estimate.
Fig 1 clarifies this approach for a logistic regression model trained on the Pima Indian diabetes dataset from the UCI repository [27]. The training data as provided in the R package MASS [28, 29] was used to train the logistic regression model. The nomogram was generated using the rms package [28, 30]. To obtain the risk estimate for an observation, the points corre- sponding to each input variable are obtained by drawing a vertical line from this value up to the points scale on top of the plot. These points are added to obtain the total points, which are converted to a risk by means of the bottom two scales.
A similar approach using the methods proposed in [23] is illustrated in Fig 2. Instead of scales, color bars are used, the color of which indicates the contribution of the input variable value. In this case, the contributions are shifted to make sure that the minimal contribution of each input is zero. The contributions are not rescaled. The importance of the inputs is clear from the color. The intenser the red color becomes within the color bar, the more impact this input has (similar to the length of the scales in the nomogram). To obtain a risk estimate for an observation, the procedure is as follows. For each input, find the color corresponding to the input’s value. This color is converted to a point by means of the color legend at the right.
Fig 2. Visualization of the logistic regression model for the Pima dataset by means of a color plot or color based nomogram. The contribution of each input variable x
(p)(f
(p)= w
(p)x
(p)) to the linear predictor is shifted such that each contribution has a minimal value of zero. To obtain a risk estimate for an observation, the color corresponding to the input’s value needs to be indicated. This color is converted to a point by means of the color legend at the right. Repeating this for each input and summing the resulting points, yields the score. This score is then converted into the risk estimate by means of the bottom most color bar. The importance of the inputs is represented by means of the redness of the color: variables with a higher intensity in red have a larger impact on the risk prediction.
doi:10.1371/journal.pone.0164568.g002
Repeating this for each input and summing the resulting points, yields the score. This score is then converted into the risk estimate by means of the bottom most color bar. A more detailed explanation of how this color based nomogram is constructed from the risk prediction model is given in S1 Text.
From both approaches (nomogram and color-based plot) it is easily concluded that glucose, the pedigree function and bmi are the most influential inputs.
Support vector classifiers. Whether or not an SVM classifier can be interpreted in the same way as explained above and represented by similar graphs, depends on the choice of the kernel. A derivation for the linear, polynomial and RBF kernel is given here.
When using a linear kernel K
linðx; zÞ ¼ P
dp¼1
x
ðpÞz
ðpÞ, the extension of the nomogram to an SVM is easily made. The predicted risk is found as:
^ y ¼ h X
Ni¼1
a
iy
iKðx
i; xÞ þ b
!
¼ h X
Ni¼1
a
iy
iX
dp¼1
x
ðpÞix
ðpÞþ b
!
;
such that the contribution of the p
thinput variable to the linear predictor is defined as
f
ðpÞ¼ X
Ni¼1
a
iy
ix
ðpÞix
ðpÞ:
This expansion enables to visualize an SVM model with a linear kernel using plots of the type presented in Fig 2. Each contribution f
(p)is then represented by a color bar. The points that are allocated to the value of an input are read off by means of the color legend. The score is obtained by addition of all points. The function h() converting this score to a risk estimate is visualized by another color bar at the bottom of the graph. Examples of this type of representa- tion for SVM models are given in Section Results.
This approach can also be extended to other additive kernels [31] and ANOVA kernels [19, 24], in which kernels are expressed as a sum of subkernels, each of which depend on a restricted set of input variables. In cases were no more than two inputs are involved in each subkernel, the representation will be exact. Visualization of two-way interaction effects is done by the use of color plots instead of color bars. Examples of this approach are given in Section Results.
For the polynomial kernel K
poly(x, z) = (ax
Tz + c)
δ, with δ a positive integer, an expansion of the latent variable is found by use of the multinomial theorem [32]
ðx
ð1Þþ þ x
ðdÞÞ
d¼ X
k1þþkd¼d
d k
1; . . . ; k
d! Y
1pd
x
ðpÞkp: ð4Þ
The latent variable of the SVM classifier can then be written as:
‘ ¼ X
Ni¼1
a
iy
iK
polyðx
i; xÞ þ b
¼ X
Ni¼1
a
iy
iðax
Tix þ cÞ
dþ b ¼ X
Ni¼1
a
iy
ia X
dp¼1
x
iðpÞx
ðpÞþ c
!
dþ b
¼ X
Ni¼1
a
iy
ic
dþ a
dX
dp¼1
x
ðpÞid
x
ðpÞdþ X
dp¼1
X
kpþkc¼d
d
k
p; k
c0
@ 1
Aa
kpx
ðpÞi kpx
ðpÞkpc
kcþ 2
4
X
dp¼1
X
q6¼p
X
k
pþ k
q¼ d
k
p; k
q6¼ d d
k
p; k
q0
@ 1
A ax
iðpÞx
ðpÞk
p
ax
ðqÞix
ðqÞk
q
þ
X
dp¼1
X
q6¼p
X
k
pþ k
qþ k
c¼ d
k
p; k
q; k
c6¼ d
k
pþ k
q6¼ d
k
pþ k
c6¼ d k
qþ k
c6¼ d
d
k
p; k
q; k
c0
@
1
A ax
ðpÞix
ðpÞkp
ax
ðqÞix
ðqÞkq
c
kcþ D 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5
þ b
¼ X
Ni¼1
a
iy
ib
0þ X
dp¼1
g
ðpÞx
iðpÞ; x
ðpÞþ X
dp¼1
X
q6¼p
g
ðp;qÞx
iðp;qÞ; x
ðp;qÞþ D
! þ b
¼ X
dp¼1
f
ðpÞx
iðpÞ; x
ðpÞþ X
dp¼1
X
q6¼p
f
ðp;qÞx
ðp;qÞi; x
ðp;qÞþ b
@þ D‘
Here, we define f(p) as the functional form of the p
thinput x
(p), i.e. the contribution to the
latent variable that is solely attributed to x
(p). In analogy, f
(p,q)is defined as the contribution to
the latent variable that is attributed to the combination of inputs x
(p)and x
(q). The derivation
above, shows that for each a, c and δ, an SVM classifier with a polynomial kernel can be
expanded in main contributions f
(p), contributions f
(p,q)involving two input variables and a
rest term Δℓ, including all contributions involving a combination of more than two input vari-
ables. From the equations, it can be seen that whenever d or δ are not higher than 2, the expan-
sion of the polynomial kernel is exact, i.e. Δℓ = 0. S2 Text indicates how the terms f
(p)and f
(p,q)for this polynomial kernel can be calculated.
When using the popular Radial Basis Function (RBF) kernel, the extension is based on a similar approach. The RBF kernel is defined as
K
RBFðx; zÞ ¼ exp 1
s
2jjx zjj
22¼ exp gjjx zjj
22;
with s
2¼
1gthe kernel width. Using the Taylor expansion of the exponential function, this ker- nel can be written as
K
RBFðx; zÞ ¼ X
1n¼0
ð 1Þ
ng
nðjjx zjj
22Þ
nn! :
Application of the multinomial theorem results in
K
RBFðx; zÞ ¼ X
1n¼0
ð 1Þ
ng
nn!
X
k1þþkd¼n
n k
1; . . . ; k
d! Y
1pd
x
ðpÞz
ðpÞ2kp
: ð5Þ
The question whether SVM classifiers using the RBF kernel can be visualized and explained as in Figs 1 and 2 is now reduced to the question whether we can write Eq (5) as the addition of terms only depending on one input variable, or by extension also including terms depending on two input variables. To achieve this, Eq (5) is written as:
K
RBFðx; zÞ ¼ X
1n¼0
ð 1Þ
ng
nn!
X
dp¼1
ðx
ðpÞz
ðpÞÞ
2n"
þ X
dp¼1
X
q6¼p
X
k
pþ k
q¼ n k
p; k
q6¼ n
n k
p; k
q!
ðx
ðpÞz
ðpÞÞ
2kpðx
ðqÞz
ðqÞÞ
2kq3 7 7 7 7 7 7 7 5
þ D ; ð6Þ
¼ X
dp¼1
g
ðpÞx
ðpÞ; z
ðpÞþ X
dp¼1
X
q6¼p
g
ðp;qÞx
ðp;qÞ; z
ðp;qÞþ D : ð7Þ
The latent variable can then be written as:
‘ ¼ X
Ni¼1
a
iy
iK
RBFðx
i; xÞ þ b ð8Þ
¼ X
Ni¼1
a
iy
iX
dp¼1
g
ðpÞx
iðpÞ; x
ðpÞþ X
dp¼1
X
q6¼p
g
ðp;qÞx
ðp;qÞi; x
ðp;qÞþ D
" #
þ b ð9Þ
¼ X
dp¼1
f
ðpÞx
ðpÞi; x
ðpÞþ X
dp¼1
X
q6¼p