The posterior distribution for which we obtain samples via the Gibbs sampling procedure can be written as:

(1)

GIBBS SAMPLING

The posterior distribution for which we obtain samples via the Gibbs sampling procedure can be written as:

) ,

| , ,

( g c θ X ψ p

Here, g is the vector of binary gene labels, c the vector of binary condition labels, θ the vector of model parameters, ψ the total set of hyperparameters (κ, ν, s, ϕ, ξ

g0

, ξ

g1

, ξ

c0

, ξ

c1

), X the expression data matrix (with rows referring to genes, and columns to conditions), Β(ξ

g1

, ξ

g0

) the Beta prior distribution on the probability that a gene belongs to the bicluster, Β(ξ

c1

, ξ

c0

) the Beta prior distribution on the probability that a condition belongs to the bicluster,

Derivation of full conditional Bernouilli distribution for gene and condition labels

1. Assume that the prior probability that a gene (condition) label equals one is given by a Beta distribution:

)

, and )

(

_g₁ _g₀

B ξ ξ B ( ξ

_c₁

, ξ

_c₀

2. The probability distribution of a collection g of n binary gene labels is then found by integrating out over this (conjugate) Beta prior distribution:

) (

) , , ( )

1 (

) , , ( ) , (

) , , ( )

| (

) , (

1 0

1 1 0 1

0 1

0 1 1

0 1 0

1

1 1

n n

dx x

B x

x

dx x

B x g Bern

dx x

B x p p

g g

g g n

i

i g g g

g

+ + Γ

+ Γ

− +

= Γ

−

=

∫

∫ ∏

∫

−

=

ξ ξ

ξ ξ ξ ξ ξ

ξ

g g

g

| g

g g

n refers to the total number of genes and ║g║

1

to the one norm of the current (binary) gene label vector (number of genes currently in the bicluster).

3. The derivation for the length m vector of binary condition labels c is analogous, and leads to:

) (

) ) (

, (

1 0

1 1 0 1

0

1

m

p m

c c

c

Γ + +

+ Γ

− +

= Γ

ξ ξ

ξ ξ ξ

ξ ^c ^c

|

c

(2)

4. Now consider the full conditional distributions of the gene labels. The conditional for gene label i is modeled by a Bernouilli distribution with parameter α

i

, that gives the probability that the label equals one:

∏

≠

=

_i _i _g _g _ij _ij ⁿ

i k

x p x

p g

p ( 1 , |

₁

,

₀

)

^bcl

( )

^bgd^,⁰

( )

1 g ξ ξ

In the above, X

k.

indicates the k-th row in the expression matrix X.

∏

≠=

=

≠

=

i k k

k c

j c

j

n

k

k i

g g i i

g g i

i i

i

g N p

g p

g p g

N p

g p g

p

j

j 1 | 0 1

|

1 0

1

0 1

) , ,

| (

) , ,

| ( ) , , 1

| ( ) ,

| , 1 (

) )

, , ,

| , (

) , , , , ,

| 1 ( ) , , , ,

| 1 (

θ c X

g θ c g X

θ c g X ψ

θ c g X

k.

ξ

i.

ξ ξ ξ

ξ ξ

≠

=

= =

=

ⁱ ⁱ ^g ^g ⁱ ⁱ ^g ^g

i

g p g

p ( X , 1 , g | c , θ , ξ

₁

, ξ

₀

) ( X , 1 , g | c , θ , ξ

₁

, ξ

₀

α

≠i g1 g0

1 5. Similarly, an expression for 1-α

i

can be derived:

∏

≠=

=

≠

=

−

n

i k k

k c

j

ij c

j

ij g

g i i

i i

i

g p

x p x

p g

N p g p

j

j | 0 1

0 , bgd 1

|

1 , bgd 0

1

, ) ( ) ( ) ( | , , )

| , 0 1 (

) , , , ,

| 0 (

θ c X

g

ψ θ c g X

ξ

k.

ξ α

1 ow, the ratio of those two expressions has a simple form and an elegant rpretation, as discussed in some detail in the main text:

6. N te in

= ∏

l

b

( )

) ,

| , 1

(

^ij

c

x

g p

p g ξ ξ

α

∏

=

≠

1

|

1 , 0 bgd

1

( )

cj

j

ij g

g i i

i

p x

∏

+

≠

=

l b

1 _i 1 ^c

(

_ij

)

g

g p x

ξ

≠ =

− +

= =

−

1

| bgd,1 0 1

1 0 | 1

) ( ) ,

| , 0 1 (

j

c

j ij

i g

c g j g i i i

x p n

g p

g g

ξ

ξ α ξ

s for the Bernouilli parameters of the condition label conditionals can be The last equation is a direct consequence of the result described under 2.

7. The odd

obtained similarly:

(3)

≠

= =

=

_j _j _j _i _c _c

j

p ( c 1 | X , c , g , θ , ψ ) p ( c 1 | X , c , g , θ , ξ

₁

, ξ

₀

) β

∏

≠=

=

≠

≠=

≠

=

^m

i k k

k g

i

ij g

i

ij c c

c j j

i k k

k j

c c j j

c p

x p x

p c

M p M

i

i | 0 1

1 , bgd 1

| l b 0

1

1 0

1

) , ,

| ( ) ( )

( )

,

| , 1 1 (

θ g X

c

_.k

.k .j

ξ ξ

∏

≠

=

= =

m

c c j

j c

c i

j

c p

c p c

p

M c

p p

c

p

₁ ₀ ₁ ₀

) , ,

| ( ) , , 1

| ( ) ,

| , 1 (

) , , ,

| , 1 , ( )

, , ,

| , (

) , , ,

| , 1 , (

θ g X

c

θ g c X

ξ ξ

ξ ξ ξ

ξ

ξ ξ

≠j c1 c0

1

∏

= =

=

≠

=

_k

g i

ij g

i

ij c

c j

j

p x p x p c

c

M p

_i _| _i ₀

0 , bgd 1

|

0 , bgd 0

1

, ) ( ) ( ) ( | , ,

| , 0 1 (

g X

c ξ ξ

_.k

≠=

=

≠

m

i k k c

c g g j

j

1 0

1 0 1

) )

, , , , , , , ,

| 0

θ ψ

θ g c

X ξ ξ ξ ξ

=

−

_j

p( c

1 β

∏

≠

=

≠

+

= +

1 , bgd l

b

0 1 1

1 0 , 0 bgd

1

) ( )

(

) (

ij ij

c

j c

i

ij c

c j j

j

x p x

p

x p c

ξ ξ

∏

=

≠ =

=

≠

−

− =

0

| bgd,0 1

0

|

1 , bgd 1

| l b 0

1

) ( )

(

) ( )

( )

,

| , 0 (

) ,

| 1

i i

g

i ij

g

i ij

j

n g i

ij g

i

ij c c

c j

x p x

m p

x p x

p c

p

c

c ξ ξ ξ ξ β

The extra factor is a result of the asymmetrical nature of the patterns we are

= (

_j

= 1 ,

j

p c c

β

looking for (‘striped’ biclusters rather than constant-value biclusters).

Derivation of full conditional Bernouilli distribution for model parameters ue to th

D e conjugacy, the full conditional distributions for the model are in the same form

mal distribution for the mean.

bined with a normal likelihood, the Inverse χ

²

prior on σ

²

gives rise to an Inverse χ

²

posterior on σ

²

.

This interpretation that directly motivates the parameterization of the prior, as it allows to f the corresponding formulas, as the priors.

1. One can easily derive that for the me t of the normal likelihood with the normal prior (conditioned on σ) leads to a po an, the produc

sterior nor

2. Similarly, com

interpret both κ and ν as a number of prior observations (sometimes referred to as

pseudocounts) associated with the prior. For a derivation o

(4)

we refer the reader to standard textbooks on Bayesian statistics (Gelman A.B. et al., 2004).

Note on the use of two background models

Strictly speaking there are two background models rather than one single model, as shown in figure REF. If condition label j is zero, the entire j th column is captured by one background model (bgd,0). On the other hand, if the j th column label equals one, the background model that describes the expression of those background genes should only be constructed with the background genes (bgd,1). The difference between those models only shows up in the derivations for the full conditional distributions of the condition labels (see above).

In most cases this distinction is primarily a technical issue and the difference between bgd,0 and bgd,1 only matters when very large biclusters occur.

SCORING MEASURES

Biclustering results were scored with module recovery and bicluster relevance scores as described in (Prelic et al., 2006). The module recovery score indicates how well the gene content of the ‘ideal’ modules is on average reflected in the (best matching bicluster of the) bicluster results:

2 1

2 1 )

,

( ( , )

1 2

1

1 1

1 2 2 2

1 max )

,

( G G

G G M M

M S

M C

G G C M

MR

∪

= ∑ ∩

∈ ∈

In the above, M

1

indicates the set of biclusters or modules that are considered ‘ideal’ or

‘real’, M

2

indicates the set of biclusters that are really returned by the algorithm. Hence, module recovery runs over all ‘ideal’ biclusters and averages out the match scores (in terms of gene content G) of the best matching biclusters from the obtained biclustering results.

The bicluster relevance score S

BR

is obtained by interchanging the roles of the ‘ideal’ and reported biclusters. It is related to the relevance of the set of modules in the output. In other words, it penalizes ‘incorrect’ modules in the output.

REFERENCES

Gelman A.B. et al. (2004) Bayesian data analysis. Chapman & Hall/CRC.

(5)

Prelic,A. et al. (2006) A systematic comparison and evaluation of biclustering methods

for gene expression data. Bioinformatics., 22, 1122-1129.

The posterior distribution for which we obtain samples via the Gibbs sampling procedure can be written as:

GIBBS SAMPLING