IE72: Cases studies bio-i system IE72: Cases studies bio-i system integration integration

(1)

IE72: Cases studies bio-i system IE72: Cases studies bio-i system

integration integration

http://www.esat.kuleuven.ac.be/sista/GGS/ie72.html Bart De Moor ESAT-SCD K.U.Leuven

Kasteelpark Arenberg 10 B-3001 Leuven Belgium T: +32-(0)16 321709

F: +32(0)16 321970

E: bart.demoor@esat.kuleuven.ac.be

(2)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2 Contents of the course Contents of the course

1. 27/02: Learn to swim in a sea of data, Bart De Moor, ESAT-SCD, KUL 2. 20/02: Turn data into knowledge, Katja Scheiser, Phase1

3. 01/03: Title, Wim Van Criekinge, DevGen

4. 04/03: The construction of pairwise distance trees, Y. Van de Peer, UG 5. 11/03: From genome to vaccine, Joelle Thonnard, GlaxoSmithKline

6. Title, Mark Lambrecht, CMPG, KUL

7. 21/03: Tutorial on micro-array analysis, Kathleen Marchal, ESAT-SCD, KUL

8. 22/03: Pacs Systems, Erwin Bellon, ESAT-PSI, KUL

9. 26/03: Cropdesign and bio-informatics, Koen Bruynseels, Cropdesign

10. 22/03: The VIB micro-array facility, Paul Van Hummelen, VIB

(3)

What to study for this lesson What to study for this lesson

? ?

-The slides

-The article entitled:

Availabe on the course website

http://www.esat.kuleuven.ac.be/sista/GGS/ie72.html

-Notes made by you !!

(4)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 Context Context

 Technological breakthroughs in

 Data-acquisition – (soft)sensors

 Computer capacity:Speed / memory /Databases

 Software: Algorithms !

 Data mining =

Learn to swim in a

sea of data………

(5)

Motivating examples Motivating examples

 I: Customer intelligence

 II: Bio-informatics

(6)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

6 Example area I: Customer Example area I: Customer intelligence

intelligence

 Motto: Motto:

“know your customer and adapt your company to this knowledge”

 Purpose: Purpose:

 Profitability (retention, loyalty, up sell, personalised offers)

 Cost cut (bad debt, fraud risk, optimized offers)

 Techniques: Techniques:

 Connect to multiple inbound data sources

 Capture, fuse, summarize, store all this data

 Analyze individuals and groups

 Decision on optimal customer interaction

 Multi-channel outbound steering

(7)

… … diverse range…. diverse range….

 Mobile phone loyalty & fraude detection

 Credit card user profile and fraud detection

 Money laundry ‘best practices’

 Websites; User Profile Identification and site customization

 Company Collective Loyalty programs

 E-government applications

 …..

(8)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

8 Score & Target Score & Target Analyze & Model

Analyze & Model

Measure Measure response response

Contact

(9)

HCI Profiling Platform HCI Profiling Platform

Build Build Customer Customer

Profiles Profiles

GSM calls

Actual GSM use

Database

Customer info Subscription type

Top-ups

Prepay/credit card

Risk management

Use Customer Use Customer

Profiles Profiles

Segmentation/clustering Predictive modeling

Classification

example case

(10)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

1 0

Datamining: Platform, GUI, Datamining: Platform, GUI,

Reporting Reporting

DATA

 Oracle

 Foxpro

 Sybase

 ...

MINING

 advanced mathematics

 Filters, Neural Nets, Clustering

 Lookup Tables

 Triggering mechanism

 ...

Dynamic Libraries ODBC-DAO

SQL

WWW Access TCP/IP link

FDL Environment

(11)

HCI Mining HCI

Control Center

Static &

dynamic data

Profiling

Segmentation

Key indicators, Synthesis, Accumulation,

...

Scenario Decision support

Outbound

Export

clustering/segmentation

Deploy Deploy

HCI

Scenario Builder

Predictive Modeling Association

Mining Workbench Segmentation

Profiling Key indicators,

Synthesis, Accumulation,

...

Scenario Decision support

clustering/segmentation Predictive Modeling

Association

Mining Workbench Static &

dynamic data

Courtesy of www.data4s.com

(12)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

1 2

Closed loop solution suite

(13)

Example area II: Bio- Example area II: Bio- informatics

informatics

- DNA: de-oxy-ribo-nucleine-acid - The double helix = linear polymer

- Backbone=phosphodiesterbounds

- 4 nucleotide bases: Code with 4 ‘bits’

A: Adenine C: Cytosine G: Guanine T: Thymine

- Length: 3 billion (humans)

- Complementarity: H-bonds: A-T; G-C ; - mRNA

-Single strand DNA with U(racil)

instead of Thymine

(14)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

1 4

Some history and impact Some history and impact

-1865: Mendel

-1953: Watson & Crick, Nobel prize 1962 -1965: Restriction enzymes: ‘DNA scissors’

-1973: Cohen en Boyer, genetransfer in bacteria

-1975: PCR (Polymerase Chain Reaction): DNA cloning -1982: Insuline by transgene bacteria

-1985: First plant GMO Bt-tabacco

-1991: First transgene animal: Herman the bull -1994: First GMO tomato in market

-1996: Microarrays

-1997: Dolly is cloned

-1998: Transgene rabbit (Pompe’s (orphan) disease)

-2000: Human Genome

(15)

Protein Protein

- Linear polymer of 20 different kinds of amino acids

- Each amino acid is represented by a triplet of bases (codons) - Linked by peptide bonds

- Complex 3-D structure (folding)

-‘Workhorses’ for functionalities of the cell

-Computational folding and docking problems

(16)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

1 6

20 amino acids 20 amino acids

Ala A Alanine Arg R Arginine Asn N Asparagine Asp D Aspartic acid Cys C Cysteine

Gln Q Glutamine Glu E Glutamic acid Gly G Glycine

His H Histidine Ile I Isoleucine

Leu L Leucine Lys K Lysine

Met M Methionine Phe F Phenylalinine Pro P Proline

Ser S Serine

Thr T Threonine

Trp W Tryptophan

Tyr Y Tyrosine

Val V Valine

(17)

64 Codons 64 Codons

1st position 2nd position 3rd position

U C A G

UUU Phe UCU Ser UAU Tyr UGU Cys U

U UUC Phe UCC Ser UAC Tyr UGC Cys C

UUA Leu UCA Ser UAA stop UGA stop A

UUG Leu UCG Ser UAG stop UGG UGG G

CUU Leu CCU Pro CAU His CGU Arg U

C CUC Leu CCC Pro CAC His CGC Arg C

CUA Leu CCA Pro CAA Gln CGA Arg A

CUG Leu CCG Pro CAG Gln CGG Arg G

AUU Ile ACU Thr AAU Asn AGU Ser U

A AUC Ile ACC Thr AAC Asn AGC Ser C

AUA Ile ACA Thr AAA Lys AGA Arg A

AUG Met, start

ACG Thr AAG Lys AGG Arg G

GUU Val GCU Ala GAU Asp GGU Gly U

G GUC Val GCC Ala GAC Asp GGC Gly C

GUA Val GCA Ala GAA Glu GGA Gly A

GUG Val GCG Als GAG Glu GGG Gly G

(18)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

1 8

64 Codons 64 Codons

-Triplets of nucleotide bases

- Position 1, 2 and 3 have specific functions -Several amino acids form a protein

-Error-correcting features

-Lowering chance that mutation in (D)(R)NA changes amino acid

-Similar amino acids have similar codons

-Redundancy: One amino acid has several codons

(19)

Central dogma (Francis Central dogma (Francis Crick, 1958)

Crick, 1958)

-Genetic code is universal among species:

-DNA  mRNA  codons  amino acids  proteins -…Except for a number of variations

-Retro virus (reverse transcription RNA  DNA !)

-Example: Aids virus (‘genomic RNA’ as life started with) -DNA has exons/introns (junk DNA)

-Left overs from evolution

-In transcription to mRNA, introns are removed

(20)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2 0

‘

‘ Cracked’ genomes Cracked’ genomes

(21)

Genomes…. Genomes….

Group Species Genes Genome (Mbase)

Phages Bacteriophage MS2 4 0.003560

Viruses HIV Type 2 9 0.009671

Bacteria Haemophilus Influenza 1760 1.83

Archaea Methanococcus jannaschii 1735 1.74

Fungi Saccaromyces cerevisiae 5800 12.1

Protoctista Oxytricha similis 12000 600

Arthropoda Drosophila melanogaster 12000 165

Nematoda Caenorhabditis Elegans 14000 100

Mollusca Loligo Pealii 35000 2700

Plantae Arabidopsis thaliana 25000 70-145

Chordata Homo Sapiens 70000 3000

(22)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2 2

The genome…. The genome….

>gi|7242486|emb|X90381.2|ATM4GENE Arabidopsis thaliana MYB102 gene

ACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATGCATGGTTATTG GTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAACTTAACAAAAC TGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATTTGTTTCAAGGT TTCGTAAGTGCACAATATCAAGAAGACAAAAATGACTAATTTTGTTTTCAGGAAGCATATATATTACACGAACACAAATCTATT TTTGTAATCAACACCGACCATGGTTCGATTACACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAA CCATCTTTGAGATTATTGATGCATGGTTATTGGTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACAT GTGTTTCATTGAGGAAGGAACTTAACAAAACTGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAG AAAGCTTAATATAAAGACATTTGTTTCAAGGTTTCGTAAGTGCACAATATCAAGAAGACAAAAATGACTAATTTTGTTTTCAG GAAGCATATATATTACACGAACACAAATCTATTTTTGTAATCAACACCGACCATGGTTCGATTACACACATTAAATCTTATATGC TAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATGCATGGTTATTGGTTAGAAAAAATATACGCTT GTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAACTTAACAAAACTGCACTTTTTTCAACGTCAC AGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATTTGTTTCAAGGTTTCGTAAGTGCACAATATCA AGAAGACAAAAATGACTAATTTTGTTTTCAGGAAGCATATATATTACACGAACACAAATCTATTTTTGTAATCAACACCGACCA TGGTTCGATTAACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATG CATGGTTATTGGTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAA CTTAACAAAACTGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATT TGTTTCAAGGTTTCGTAAGTGCACAATATCAAGAAG

-Linear sequence of 4 nucleotide bases

-Genes, regulatory elements, promotor sequences, exons, introns,

junk DNA

(23)

 Exponential growth of available data

 5.5 million sequences

 4.3 billion bases (=letters)

0.00E+00 5.00E+05 1.00E+06 1.50E+06 2.00E+06 2.50E+06 3.00E+06 3.50E+06 4.00E+06 4.50E+06 5.00E+06

Jun-82 Aug-84 Sep-86 Aug-88 Aug-90 Sep-92 Sep-94 Sep-96 Sep-98

date

Database growth: Number of sequences

Number of sequences

0.00E+00 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 4.00E+09

Jun-82 Aug-84 Sep-86 Aug-88 Aug-90 Sep-92 Sep-94 Sep-96 Sep-98

date

Database growth: Number of nucleotides

Number of nucleotides

(24)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2

4

(25)

Microarrays Microarrays

Genome-wide robotized monitoring of gene activities by

measurement of the levels of RNA transcripts (mRNA or total RNA)

Massively parallel Fully automated Standardizable

High/Mega Througput Screening (HTS/MTS)

(26)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2 6

Micro-array Micro-array

 Measure Genome-wide expression levels of RNA transcript

 Applications:

 Diagnosis: RNA

expression differs in

healthy and pathological cells

 Disease models: etiology and pathogenesis

 Genotyping

 Response to treatment

 ...

(27)

Algorithms / techniques Algorithms / techniques

 Principal and independent component analysis

 Canonical correlation

 Clustering

 Optimization

 Dynamic

programming/EM/HMM….

 System identification

 Classification

 Diagnosis

 Risk assessment

 Discriminant analysis

 Logistic regression

 Neural networks

 SVM

 Self-organizing maps

 Bayesian networks

 Incorporating a priori’s

 Pattern recognition

 Global/local alignment

 Database search for similarities

 HMM

 Motif extraction/detection

(28)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

2 8

Rule based Rule based

Cash Transaction Value

Cash Transaction Frequency

If total value of transactions Is > 10.000 Euro

Suspicious transactions

Regular transaction

If amount of

of cash transactions

< 10.000 Euro Is >3

RULE BASED BUSINESS LOGIC:

Interpretability

Use of lists (black list, gray list)

BUT expert knowledge

needed!

(29)

Neural nets Neural nets

Cash Transaction Frequency Cash Transaction Value

C T V + C T F+ C C P + … D e cis io n

Classification from examples

Data driven

(30)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

3 0

Clustering – density Clustering – density modeling

modeling

Call frequency Average call duration

User or activity grouping Fast and robust learning Statistics: outlier

detection,…

Abnormal = fraude

(31)

Case 1: Oncology Case 1: Oncology

 Cancer = genetic disorder

 Mutations cause transformation of a normal cell into a tumor cell

 Mutations also induce changes in expression levels of non-mutated genes !

 Disturbed expression levels determine the phenotype of the tumor

 Measurement of all expression levels will

therefore be of great benefit to determine and to

understand the real behavior of tumor cells

(32)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

3 2

Microarrays in oncology Microarrays in oncology

 Ingredients: Tissue bank, Medical files, Medical expertise, Biotech know-how, Microarray facility, Statistical expertise,

Algorithms, Software development

 By screening banks of tumors with the microarray, we obtain data to build statistical models for

 Diagnosis

 Staging

 Prognosis

 Choice of therapy

 Follow-up of therapy

 Microarrays can also be used to identify genes implied in cancer

and to understand the mechanisms of oncogenesis

(33)

MIT Leukemia data set MIT Leukemia data set

- 2 classes: ALL (class 1) and AML (class 2)

- training set of 38 samples (27 ALL and 11 AML)

- independent test set of 34 samples

- ~7000 gene expression levels for each sample = 7000x38 matrix for training set

- Goal:

- Which of the 7000 genes are relevant ?

- Classification into class 1 and class 2

- Data Preprocessing (rescaling,

thresholding, log-transformation,

normalization

(34)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

3 4

Principal component Principal component analysis

analysis _ Principal Components = eigenvectors of covariance matrix = svd of mean corr. data matrix

 Dimensional reduction: Project on dominant eigenspace

 Results in 2 dimensions :

 ^x ^μ  ^x ^μ  ^T

Σ    ⁿ 

n

-20 -15 -10 -5 0 5 10 15 20 25

-20 -15 -10 -5 0 5 10

15

* = training

O = independent

Class 1/ALL

Class 2/AML

(35)

Fisher linear discriminant Fisher linear discriminant analysis

analysis Further reduction to one dimension after PCA :

• The vector w is chosen so that:

–

the within-class spread is small

–

the class separation is large

• and is given by:

• Threshold w

_o

:

y > w

_o

 Class 1 / y < w

_o

 Class 2

• Linear Discriminant analysis after PCA in 5 dimensions gives only 2

x w ^T

 y

 ^μ2 ^μ1 

S

w 

_w^¹



     

^T

w

x μ1 x μ1 x μ2 x μ2

S        



n C

n T n n

C n

n

2 1

where

μ w

^T

o

 w

-15 -10 -5 0 5 10 15 20

-10 -5 0 5 10 15

Class 1/ALL Class 2/AML

(36)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

3 6

K-means clustering K-means clustering

 Assign each sample, arbitrarily, to one of N clusters

 Iterate

 Calculate the mean vector for each cluster, 

_i

, i=1…N

 Re-assign each sample x

^j

, j=1…#Samples, to the k-th cluster with the nearest mean vector 

_k

:

 K-means in 5 dimensions with N=2 clusters

k i x

x

_j

 

_k



_j

 

_i

, 

-20 -15 -10 -5 0 5 10 15 20

-25 -20 -15 -10 -5 0 5 10 15

20

* = Cluster 1

O = Cluster 2

Class 1/ALL

Class 2/AML

Cluster means

(37)

Neural nets : Nonlinear Neural nets : Nonlinear classifiers

classifiers

Clustering

Plots the percentage of correct predictions for ALL samples versus the percentage of incorrect predictions for AML samples, for varying values of the threshold

ROC ROC curve curve

number of

genes % area ROC

training % area ROC prospective

20 1 1

15 1 1

10 1 99.29

5 1 98.57

4 1 98.57

3 1 97.50

2 98.32 98.21

1 93.60 71.07

(38)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

3 8

Least squares support Least squares support vector machines

vector machines -Nonlinearly transform the data to a higher dimensional space -In this space, classes are linearly

separable

-Vapnik’s original SVM:

-Cvx. Quadr. Opt. Problem,

-dimension = # data points (huge !) -LS-SVM

- Only need to solve LARGE set of linear equations - Iterative (Jacobi, Gauss-Seidel, SOR, block SOR,…) - Current research: Advise required from num.anal.!

(39)

Bayesian networks Bayesian networks

P(A|B) = P(B|A) . P(A) / P(B) Bayesian network

Observations

Knowledge Machine learning (2) Inference (3) Engineering (1)

Literature

Experts

Classification

Probability prediction

c

_i

?

P(C

_i

)=?

Evidences

e.g.: A = model, B=data

(40)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 0

Bayesian networks & expert knowledge Bayesian networks & expert knowledge

- What is a Bayesian network? •Set of variables (with finite number of states)

•Set of directed edges

•Parameters = conditional probability tables

Age Parity

Pathology Locularity

Multi

Loc-solid Color score CA 125

•P(Path=benign|Age<40,Parity=0) = 0.738

•P(Path=benign|Age<40,Parity=1) = 0.8252

•P(Path=benign|Age<40,Parity=2) = 0.8603

• …

A Bnstructure, together with a fully specified conditional

probability table defines an overall

distribution over the variables in

the network.

(41)

Real world application: Ovarian Cancer Real world application: Ovarian Cancer

The goal in this problem is to classify an ovarium cancer tumor

pre-operatively. A lot of prior knowledge (doctors, literature,…) is present and 300 cases.

Age

Pregnancy

Genetic

Pathology Locularity

Solid

Papillat

M-Solid

Color

Meno

Bilateral

RIndex CA 125

Ascites

Meno

Papillat

Ascites

Bilateral

CA 125

Color

Pathology

microarray

(42)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 2

(6) Annotated Bayesian (6) Annotated Bayesian

networks II.

(43)

0.1 0.15 0.2 0.25 0.3 0.35 0.4

Multilayer perceptron with a non-informative prior Bayesian network with an non-informatice prior Bayesian network with an informative prior Multilayer perceptron with an informative prior Missclassification rate

(44)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 4

Case 2: Finding regulatory Case 2: Finding regulatory sequences

sequences

 Cluster genes from microarray expression data to build clusters of coexpressed genes

 Coexpressed genes may share regulatory mechanisms

 Most regulatory sequences are found in the upstream region of the genes (up to 2kb in A. thaliana )

 Motifs that are statistically overrepresented in the

upstream regions are candidate regulatory

sequences

(45)

Motifs are hidden in Motifs are hidden in background

background

Need: Motif model and background model Models: HMM

Algorithms: EM and Gibbs sampling (stochastic EM)

(46)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 6

Microarray

Microarray A. Thaliana A. Thaliana data set data set

(47)

Adaptive quality-based clustering of gene Adaptive quality-based clustering of gene

profiles profiles

Normalised expression profiles Normalised expression profiles

(48)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

4 8

Motif finding Motif finding

Clustering EMBL Blast

start start

Gibbs sampler

CACGTG ID A1234

Z4321

(49)

CACGTG Gibbs

sampler

Clustering ID A1234

Z4321

tumor 0.9 suppressor 0.6 phospho- 0.4

Keyword extraction

Medline

Jones et al.

Gibbs sampler

CACGTG EMBL

start

Blast

Gene Cards

P53, etc…

http://sphinx.rug.ac.be:8080/PlantCARE/index.htm

SVM

reclustering

(50)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

5 0

Pubmed

Text mining

raw MA data MA Db Preprocessing Algorithms

MA Db Cluster Algorithms

MA Db

Functional validation External Gene

Info Db

Validation of clustering

Public Dbs on the web Motif finding

algorithms External Motif

Info Db Validation

of motifs

External Gene Info Db

Bayesian Network Inference aMAZE

(51)

Business System Business System

Software Diagnostics companies Microarray manufacturers

Application Toolkit +

Consulting

Platform Platform

LIMS

Record

Data

High-throughput genomics

Patient

Symptoms Physician

Result Result

Web interface Diagnosis, etc...

Administration

Record

Sample

(52)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

5 2

Example: HIV Example: HIV

-Under pressure of antiretroviral

drugs, HIV mutates  drug resistance and treatment failure - Predict optimal drug cocktail

given the sequence of 2000 kb of the viral species infecting patient

- Query large database of 35000

previous patients with corresponding resistance profiles

- Correlation analysis

(53)

WWW Information retrieval WWW Information retrieval

Database category Data content Examples

1. Literature database Bibliographic citations MEDLIINE (1971) On-line journals

2. Factual database Nucleic acid sequences GenBank (1982), EMBL (1982), DDBJ (1984) Amino acid sequences PIR (1968), PRF (1979),

SWISS-PROT (1986) 3D molecular structures PDB (1971), CSD (1965) 3. Knowledge base Motif libraries PROSITE (1988)

Molecular classifications SCOP (1994)

Biochemical pathways KEGG (1995)

(54)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

5 4

Data world Textual world

Expression-based clustering

Gene-as-document clustering

?

(55)

Information retrieval Information retrieval

Information retrieval

Search Indexing

Documents

Indices

Query

Result list

Experts

Knowledge engineers

Bayesian network modeling

ABN

(56)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

5 6

Information Retrieval

from Local Drives

D o c u m e n t M e ta D a ta

Clustering Algorithms

Matching Algorithms

Search Algorithms Updating Mechanisms

Unprocessed Documents Data Mining

Techniques

Documents Users

U s e r P ro fi le D a ta

Information Retrieval

from External Sources

(57)

Type and object of the query The annotated Bayesian network

Query Result list

Query’= f(query, A(variable), A(group), A(model) )

= f(papillation,A(Locularity),A(Morphology),A(Ovarian cancer) )

(58)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

5 8

Perspectives Perspectives

-….

 Humane genome diversity

 Single nucleotide polymorphism (SNP)

 Drug development: study gene expression in disease, model systems (cfr. Alzheimer mouse), pathogens, response to drug treatment

 Functional genomics and transgenesis (identification of regulatory mechanisms, modeling of genetic networks)

 Disease management (diagnosis, prognosis, …)

 Oncology, AIDS, Alzheimer, ….

 Pharmacogenomics (drug tayloring by genotyping)

(59)

The data flood (Help, they are The data flood (Help, they are

coming !) coming !)

 10,000 to 100,000 data points per experiment

 Technology will spread, cost will drop (cfr. Bio- CD-player Universite de Namur)

 Data explosion: Mega Throughput Screening

Biotech

& Pharma R&D

Medical R&D

Routine Diagnostics

GP or At Home Follow-up

2000-2005 2010 2020

(60)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

6 0

Application Development

Interface

Communi- cation Data

Acquisition

Experiment Design Data

Mining

Application Platform

WWW

USER

Bioinf.

tools

LIMS

(61)

Abstracts of scientific publications

(Pubmed)

Text mining Preprocessing Algorithms:

- Lowess fit - ANOVA - ...

Cluster and Classification Algorithms:

- Percolation clustering - AQBC

- Gene Shaving - Bayesian clustering - K-means, K-medoids - Metaclustering...

Fuctional validation of the clustering

Algorithms for finding DNA motifs:

-Gibbs sampling - String search

Information on known motifs (TRANSFAC)

Validation of the motifs:

- Known motifs ?

- Phylogenetic footprinting

Gene Infornation from public databases (SGD, SWISSPROT, MIPS...)

Inference of genetic networks:

- Bayesian networks

Information on Pathways in yeast

(aMAZE) Microarray data - Experimental data - Public data sets (SMD)

(MIAME standaard) Results of the microarray analysis

Results of the Motif Analysis

Figure 1 : Representation of the information flow between the integrated algorithms. Microarray data, both public data (Stanford Microarray Database) and data generated in the IDO-project will be stored in the system according to MIAME standards. The data can be analyzed using preprocessing algorithms. The results are stored in the system or used in various cluster algorithms. The results of multiple cluster analyses can be compared (metaclustering) and stored in the system.

(62)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

6 2

Challenges Challenges

 Datamining = integration of

 Data-acquisition – (soft)sensors

 Computer capacity

 Software:

 Statistics

 Algorithms

 GUI

(63)

VIB concept

(64)

T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista

6 4

Basic research Society program

Technology transfer

(65)

(66)