• No results found

(1)Missing values management One of the problems customarily encountered when analyzing gene expression data (especially with microarrays) is the occurence of missing values

N/A
N/A
Protected

Academic year: 2021

Share "(1)Missing values management One of the problems customarily encountered when analyzing gene expression data (especially with microarrays) is the occurence of missing values"

Copied!
3
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Missing values management

One of the problems customarily encountered when analyzing gene expression data (especially with microarrays) is the occurence of missing values.

To deal with them, we use a methodology suggested by Kaufman and Rousseeuw (Kaufman, L. and Rousseeuw, P.J. (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York – see page 14-15). The mathematical description of

this method is given below. In general, the operations where missing values have to be taken into consideration in Step 1 and Step 2 are the calculations of Euclidean distances and the calculation of mean expression profiles.

Suppose that A={gi(gi1,gi2,…, giE), i=1,…,Q} is a set of Q gene expression profiles gi

where E is the number of measurements for each gene (subscript indicates the gene number; superscript indicates measurement number). Suppose that the measurement numbers of the missing values for gene expression profile gi are given by the set

Pi={pi,m}m=1,…,Mi , where Mi is the number of missing values in gi. For example, suppose that E=7 and g1={1, 3, –9, *, 5, *, 0} (* indicates a missing value), then P1={4,6} (p1,1=4;

p1,2=6; M1=2).

If we want to calculate the Euclidean distance d(gk,gl) between gk and gl, we have to take their missing values into account. Suppose that #(Pk U Pl ) < E, otherwise d(gk,gl) is undefined (# means number of elements in a given set). We define d(gk,gl) as:

( )2

)

) (

( ) #

,

( =

l

k P

P i

i l i k l

k l

k g g

P P

E g E

g d

(2)

For example if E=7 and g1={1, *, *, -7, 9, 0, -1} and g2={*, 2, *, 5, 1, *, *} then P1={2,3}, P2={1,3,6,7}, P1UP2={1,2,3,6,7} and #(P1UP2)=5. The distance d(g1,g2) is given by:

If we want to calculate the mean expression profile gav of A, we also have to take the missing values into account. The j-th measurement of gav (gavj) is defined as follows (Note that * . 0 = 0):

where

and

For example if E=7 and A={g1,g2,g3} (Q=3) where g1={1, *, *, -7, 9, 0, -1}

g2={*, 2, *, 5, 1, *, *}

g3={2, 3, *, -9, *, 6, *}

then



=

= =

0 ) ( if missing

0 ) ( if )) , ( . ) (

( 1

1

j N

j N j

p D j g

g N

Q p

j j p

av

( ) ( )

[ 7 5 9 1 ] 26.98

5 7 ) 7

,

( 1 2 2 + 2 =

= g g d

=

p p

P j

P j j

p

D 0 if if ) 1

, (

=

= Q

p

j p D j

N

1

) , ( )

(

{1.5,2.5,*,3.667 ,5,3, 1}

1 , 1 2

0 , 6 2

1 ,9 3

9 5 ,*, 7

2 3 ,2 2

2

1 =

+ + + + +

= gav

(3)

Note that the approach described above discards the need to replace the missing values by fictive values (e.g., by using interpolation).

Referenties

GERELATEERDE DOCUMENTEN

Inspired by Ka- makura &amp; Wedel (2000), a general framework based on latent variable models is proposed to analyze missing data. With this framework, the authors develop

While I will use the case study method to understand how cognitive values can be applied in theory appraisal and the epistemic benefits that non-cognitive values can provide

To make inferences from data, an analysis model has to be specified. This can be, for example, a normal linear regression model, a structural equation model, or a multilevel model.

In this work we present a novel method to estimate a Takagi-Sugeno model from data containing missing val- ues, without using any kind of imputation or best guest estimation. For

An empirical conical structure, a conex structure, was obtained that reflects the three facets of the definition: value modality - cognitive, affective, and instrumental; focus

The difference in the number of missing values between the pilot study and the main study suggests that the lack of missing values in the latter may be partly the

The international competitive position of energy-intensive industry in the Netherlands does not currently allow for the national increase in the carbon price that would be required

For each of our evaluation data sets we thus have two versions available: a version with missing values and a version with complete records.. The former version is imputed,