BIOINFORMATICS
Vol. 00 no. 00 2007 Pages 1–2Supplement to:
Kernel-based data fusion for gene prioritization
Tijl De Bie
a,b, L ´eon-Charles Tranchevent
c, Liesbeth M. M. van Oeffelen
c, Yves Moreau
caDept. of Engineering Mathematics, University of Bristol, University Walk, BS8 1TR, Bristol, UK
bOKP Research Group, Katholieke Universiteit Leuven, Tiensestraat 102, 3000 Leuven, Belgium
cESAT-SCD, Kathlieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
PROOF OF THEOREM 1 Generalizing the problem
It is convenient to consider here a slightly more general algorithm. In particular, we consider the optimization problem maxK max
M,w,ξ p(M, ξ) = M − 1
nν10ξ, (1)
s.t. w0w ≤ 1, x0iw ≥ M − ξi(∀i), ξi≥ 0 (∀i), K ∈
(X
j
µj(Kj/βj) : µ01 = 1, µ ≥ 0 )
.
The difference with the problem introduced in the paper is that slack variables ξiare used here, which allow small mistakes for individual data objects. These mistakes are penalized stronger for small values of ν, and for ν → 0 the more simple optimization problem explained in the main part of the paper is recovered. Using duality theory, this problem can be shown to be equivalent to
mint,αt s.t. 1
nν ≥ αi≥ 0 (∀i), 10α = 1, t ≥ α0(Kj/βj)α (∀j).
For any value of the margin M , optimization problem (1) minimizes 10ξ = Pn
i=1max(M − f (x), 0) = γ ˆEX(gM,γ(x0w)) = γ ˆEX(gM,γ(f (x))) for f belonging to the function class FKdefined as:
FK= (
f : x → Xn i=1
αik(x, xi)/√
α0Kα and k ∈ K )
.
A more general Theorem
We prove a more general Theorem, from which Theorem 1 follows immediately. First we give a few definitions. Given values M and γ, define the function φM,γas:
φM,γ(a) = min
³ max
³
M −a γ , 0
´ , 1
´
=
1 a ≤ M − γ,
M −a
γ if M − γ < a ≤ M,
0 M ≤ a.
Furthermore, define gM,γ(a) = max
³M −a γ , 0
´
. Then, with I the indicator function:
gM,γ(a) ≥ φM,γ(a) ≥ I(a ≤ M − γ).
THEOREM2. Given a set X of n objects (genes) xisampled iid from an unknown distribution D. Let λ(Kj) denote the largest eigenvalue of Kj. Then, for any M, γ ∈ <+and for any δ ∈ (0, 1), with probability of at least 1 − δ the following holds for all functions f ∈ F:
PD(f (x) ≤ M − γ) ≤ ˆEX(φM,γ(f (x))) + 4 nγ
vu utmin
à n max
j
λ(Kj) βj
, Xm j=1
trace (Kj) βj
! +
r2 nln2
δ.
c
° Oxford University Press 2007. 1
De Bie et al
Proof of Theorem 2
The proof has the same structure as the Rademacher complexity proofs in Bartlett and Mendelson (2002); Shawe-Taylor and Cristianini (2004); Lanckriet et al. (2004). For any M and γ:
PD(f (x) ≤ M − γ) = ED(I (f (x) ≤ M − γ)) ≤ ED(φM,γ(f (x)))
≤ EˆX(φM,γ(f (x))) + sup
f ∈F
³
ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))
´
We now make use of the fact that ˆEX(φM,γ(f (x))) is close to its expectation, as shown by McDiarmid’s inequality for functions with bounded differences. With probability 1 −δ2 over X:
sup
f ∈F
³
ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))
´
≤ EX∼Dnsup
f ∈F
³
ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))
´ +
r 1 2nln2
δ. Now consider a new sample Z of n data objects, sampled iid from D. From linearity, ED(φM,γ(f (x))) = EZ∼Dn
³EˆZ(φM,γ(f (z)))
´ . Furthermore, the supremum of an expectation is smaller than or equal to the expectation of a supremum, such that:
EX∼Dnsup
f ∈F
³
ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))´
≤ EX,Z∼Dnsup
f ∈F
³EˆZ(φM,γ(f (z))) − ˆEX(φM,γ(f (x)))
´
= 1
nEX,Z∼Dn,σ sup
f ∈F
à n X
i=1
σi(φM,γ(f (zi))) − φM,γ(f (xi))
!
≤ 2
nEX∼Dn,σ sup
f ∈F
¯¯
¯¯
¯ Xn i=1
σiφM,γ(f (xi))
¯¯
¯¯
¯
where we take σ ∈ {−1, 1}na so-called Rademacher random variable with a uniform distribution. Finally, we can use McDiarmid’s inequality again to show that, with probability 1 −δ2 over X, this is upper bounded byn2Eσ supf ∈F
¯¯Pn
i=1σiφM,γ(f (xi))¯
¯ +q
1 2nln2δ. The first term in this expression is twice the so-called empirical Rademacher complexity of the function class HM,γ = {h = φM,γ ◦ f with f ∈ F}:
RˆX(HM,γ) = 1
nEσ sup
h∈HM,γ
¯¯
¯¯
¯ Xn i=1
σih(xi)
¯¯
¯¯
¯.
Since ∃a : φM,γ(a) = 0 and φM,γ is an L-Lipschitz-function with L = γ1, we can invoke Lemma 3 from Bartlett et al. (2002) to obtain that: ˆRX(HM,γ) ≤ γ2RˆX(F). Hence it suffices to bound ˆRX(F):
n ˆRX(F) = Eσ sup
f ∈F
¯¯
¯¯
¯ Xn i=1
σif (xi)
¯¯
¯¯
¯= Eσ sup
k∈K
supα
¯¯
¯¯ σ0Kα
√α0Kα
¯¯
¯¯
= Eσ sup
k∈K
supα
¯¯
¯¯
³√Kσ
´0µ √
√ Kα α0Kα
¶¯¯
¯¯
≤ Eσ sup
k∈K
√σ0Kσ ≤ r
Eσ sup
k∈Kσ0Kσ
The value of Eσ supk∈Kσ0Kσ can be upper bounded in two ways. From σ0Kjσ ≤ σ0σλ(Kj) = nλ(Kj), it is clear that n maxjλ(Kj)/βjis an upper bound. Alternatively, observe that Eσ supk∈Kσ0Kσ ≤ Eσσ0³Pm
j=1Kj/βj
´
σ =Pm
j=1trace (Kj)/βj.
Putting all pieces together completes the proof. ¤
REFERENCES
Bartlett, P., Bousquet, O., and Mendelson, S. (2002). Localized rademacher complexities. In Proceedings of the 15th annual conference on Computational Learning Theory (COLT02), pages 44–58.
Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., and Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for Pattern Analysis. Cambridge University Press, Cambridge, U.K.
2