Supplement to:

(1)

BIOINFORMATICS

Vol. 00 no. 00 2007 Pages 1–2

Supplement to:

Kernel-based data fusion for gene prioritization

Tijl De Bie

^a,b

, L ´eon-Charles Tranchevent

^c

, Liesbeth M. M. van Oeffelen

^c

, Yves Moreau

^c

aDept. of Engineering Mathematics, University of Bristol, University Walk, BS8 1TR, Bristol, UK

bOKP Research Group, Katholieke Universiteit Leuven, Tiensestraat 102, 3000 Leuven, Belgium

cESAT-SCD, Kathlieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

PROOF OF THEOREM 1 Generalizing the problem

It is convenient to consider here a slightly more general algorithm. In particular, we consider the optimization problem maxK max

M,w,ξ p(M, ξ) = M − 1

nν1⁰ξ, (1)

s.t. w⁰w ≤ 1, x⁰iw ≥ M − ξi(∀i), ξi≥ 0 (∀i), K ∈

(X

j

µj(Kj/βj) : µ⁰1 = 1, µ ≥ 0 )

.

The difference with the problem introduced in the paper is that slack variables ξiare used here, which allow small mistakes for individual data objects. These mistakes are penalized stronger for small values of ν, and for ν → 0 the more simple optimization problem explained in the main part of the paper is recovered. Using duality theory, this problem can be shown to be equivalent to

mint,αt s.t. 1

nν ≥ αi≥ 0 (∀i), 1⁰α = 1, t ≥ α⁰(Kj/βj)α (∀j).

For any value of the margin M , optimization problem (1) minimizes 1⁰ξ = P_n

i=1max(M − f (x), 0) = γ ˆEX(gM,γ(x⁰w)) = γ ˆEX(gM,γ(f (x))) for f belonging to the function class FKdefined as:

FK= (

f : x → Xn i=1

αik(x, xi)/√

α⁰Kα and k ∈ K )

.

A more general Theorem

We prove a more general Theorem, from which Theorem 1 follows immediately. First we give a few definitions. Given values M and γ, define the function φM,γas:

φM,γ(a) = min

³ max

³

M −a γ , 0

´ , 1

´

=





1 a ≤ M − γ,

M −a

γ if M − γ < a ≤ M,

0 M ≤ a.

Furthermore, define gM,γ(a) = max

³M −a γ , 0

´

. Then, with I the indicator function:

gM,γ(a) ≥ φM,γ(a) ≥ I(a ≤ M − γ).

THEOREM2. Given a set X of n objects (genes) xisampled iid from an unknown distribution D. Let λ(Kj) denote the largest eigenvalue of Kj. Then, for any M, γ ∈ <+and for any δ ∈ (0, 1), with probability of at least 1 − δ the following holds for all functions f ∈ F:

PD(f (x) ≤ M − γ) ≤ ˆEX(φM,γ(f (x))) + 4 nγ

vu utmin

Ã n max

j

λ(Kj) βj

, Xm j=1

trace (Kj) βj

! +

r2 nln2

δ.

c

° Oxford University Press 2007. 1

(2)

De Bie et al

Proof of Theorem 2

The proof has the same structure as the Rademacher complexity proofs in Bartlett and Mendelson (2002); Shawe-Taylor and Cristianini (2004); Lanckriet et al. (2004). For any M and γ:

PD(f (x) ≤ M − γ) = ED(I (f (x) ≤ M − γ)) ≤ ED(φM,γ(f (x)))

≤ EˆX(φM,γ(f (x))) + sup

f ∈F

³

ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))

´

We now make use of the fact that ˆEX(φM,γ(f (x))) is close to its expectation, as shown by McDiarmid’s inequality for functions with bounded differences. With probability 1 −^δ₂ over X:

sup

f ∈F

³

´

≤ EX∼Dⁿsup

f ∈F

³

´ +

r 1 2nln2

δ. Now consider a new sample Z of n data objects, sampled iid from D. From linearity, ED(φM,γ(f (x))) = EZ∼Dⁿ

³EˆZ(φM,γ(f (z)))

´ . Furthermore, the supremum of an expectation is smaller than or equal to the expectation of a supremum, such that:

EX∼Dⁿsup

f ∈F

³

ED(φM,γ(f (x))) − ˆEX(φM,γ(f (x)))´

≤ EX,Z∼Dⁿsup

f ∈F

³EˆZ(φM,γ(f (z))) − ˆEX(φM,γ(f (x)))

´

= 1

nEX,Z∼Dⁿ,σ sup

f ∈F

Ã _n X

i=1

σi(φM,γ(f (zi))) − φM,γ(f (xi))

!

≤ 2

nEX∼Dⁿ,σ sup

f ∈F

¯¯

¯ Xn i=1

σiφM,γ(f (xi))

¯¯

¯

where we take σ ∈ {−1, 1}ⁿa so-called Rademacher random variable with a uniform distribution. Finally, we can use McDiarmid’s inequality again to show that, with probability 1 −^δ₂ over X, this is upper bounded by_n²Eσ supf ∈F

¯¯Pⁿ

i=1σiφM,γ(f (xi))¯

¯ +q

1 2nln²_δ. The first term in this expression is twice the so-called empirical Rademacher complexity of the function class HM,γ = {h = φM,γ ◦ f with f ∈ F}:

RˆX(HM,γ) = 1

nEσ sup

h∈H_M,γ

¯¯

¯ Xn i=1

σih(xi)

¯¯

¯.

Since ∃a : φM,γ(a) = 0 and φM,γ is an L-Lipschitz-function with L = _γ¹, we can invoke Lemma 3 from Bartlett et al. (2002) to obtain that: ˆRX(HM,γ) ≤ _γ²RˆX(F). Hence it suffices to bound ˆRX(F):

n ˆRX(F) = Eσ sup

f ∈F

¯¯

¯ Xn i=1

σif (xi)

¯¯

¯= Eσ sup

k∈K

supα

¯¯

¯¯ σ⁰Kα

√α⁰Kα

¯¯

= Eσ sup

k∈K

supα

¯¯

³√Kσ

´₀µ √

√ Kα α⁰Kα

¶¯¯

¯¯

≤ Eσ sup

k∈K

√σ⁰Kσ ≤ r

Eσ sup

k∈Kσ⁰Kσ

The value of Eσ supk∈Kσ⁰Kσ can be upper bounded in two ways. From σ⁰Kjσ ≤ σ⁰σλ(Kj) = nλ(Kj), it is clear that n maxjλ(Kj)/βjis an upper bound. Alternatively, observe that Eσ supk∈Kσ⁰Kσ ≤ Eσσ⁰³P_m

j=1Kj/βj

´

σ =P_m

j=1trace (Kj)/βj.

Putting all pieces together completes the proof. ¤

REFERENCES

Bartlett, P., Bousquet, O., and Mendelson, S. (2002). Localized rademacher complexities. In Proceedings of the 15th annual conference on Computational Learning Theory (COLT02), pages 44–58.

Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.

Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., and Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72.

Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for Pattern Analysis. Cambridge University Press, Cambridge, U.K.

2