IE72: Cases studies bio-i system IE72: Cases studies bio-i system
integration integration
http://www.esat.kuleuven.ac.be/sista/GGS/ie72.html Bart De Moor ESAT-SCD K.U.Leuven
Kasteelpark Arenberg 10 B-3001 Leuven Belgium T: +32-(0)16 321709
F: +32(0)16 321970
E: bart.demoor@esat.kuleuven.ac.be
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2
Contents of the course Contents of the course
1. 27/02: Learn to swim in a sea of data, Bart De Moor, ESAT-SCD, KUL 2. 20/02: Turn data into knowledge, Katja Scheiser, Phase1
3. 01/03: Title, Wim Van Criekinge, DevGen
4. 04/03: The construction of pairwise distance trees, Y. Van de Peer, UG 5. 11/03: From genome to vaccine, Joelle Thonnard, GlaxoSmithKline
6. Title, Mark Lambrecht, CMPG, KUL
7. 21/03: Tutorial on micro-array analysis, Kathleen Marchal, ESAT-SCD, KUL
8. 22/03: Pacs Systems, Erwin Bellon, ESAT-PSI, KUL
9. 26/03: Cropdesign and bio-informatics, Koen Bruynseels, Cropdesign
10. 22/03: The VIB micro-array facility, Paul Van Hummelen, VIB
What to study for this lesson What to study for this lesson
? ?
-The slides
-The article entitled:
Availabe on the course website
http://www.esat.kuleuven.ac.be/sista/GGS/ie72.html
-Notes made by you !!
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4
Context Context
Technological breakthroughs in
Data-acquisition – (soft)sensors
Computer capacity:Speed / memory /Databases
Software: Algorithms !
Data mining =
Learn to swim in a
sea of data………
Motivating examples Motivating examples
I: Customer intelligence
II: Bio-informatics
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
6
Example area I: Customer Example area I: Customer intelligence
intelligence
Motto: Motto:
“know your customer and adapt your company to this knowledge”
Purpose: Purpose:
Profitability (retention, loyalty, up sell, personalised offers)
Cost cut (bad debt, fraud risk, optimized offers)
Techniques: Techniques:
Connect to multiple inbound data sources
Capture, fuse, summarize, store all this data
Analyze individuals and groups
Decision on optimal customer interaction
Multi-channel outbound steering
… … diverse range…. diverse range….
Mobile phone loyalty & fraude detection
Credit card user profile and fraud detection
Money laundry ‘best practices’
Websites; User Profile Identification and site customization
Company Collective Loyalty programs
E-government applications
…..
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
8
Score & Target Score & Target Analyze & Model
Analyze & Model
Measure Measure response response
Contact
Contact
HCI Profiling Platform HCI Profiling Platform
Build Build Customer Customer
Profiles Profiles
GSM calls
Actual GSM use
Database
Customer info Subscription type
Top-ups
Prepay/credit card
Risk management
Use Customer Use Customer
Profiles Profiles
Segmentation/clustering Predictive modeling
Classification
example case
example case
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
1 0
Datamining: Platform, GUI, Datamining: Platform, GUI,
Reporting Reporting
DATA
Oracle
Foxpro
Sybase
...
MINING
advanced mathematics
Filters, Neural Nets, Clustering
Lookup Tables
Triggering mechanism
...
Dynamic Libraries ODBC-DAO
SQL
WWW Access TCP/IP link
FDL Environment
HCI Mining HCI
Control Center
Static &
dynamic data
Profiling
Segmentation
Key indicators, Synthesis, Accumulation,
...
Scenario Decision support
Outbound
Export
clustering/segmentation
Deploy Deploy
HCI
Scenario Builder
Predictive Modeling Association
Mining Workbench Segmentation
Profiling Key indicators,
Synthesis, Accumulation,
...
Scenario Decision support
clustering/segmentation Predictive Modeling
Association
Mining Workbench Static &
dynamic data
Courtesy of www.data4s.com
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
1 2
Closed loop solution suite
Closed loop solution suite
Example area II: Bio- Example area II: Bio- informatics
informatics
- DNA: de-oxy-ribo-nucleine-acid - The double helix = linear polymer
- Backbone=phosphodiesterbounds
- 4 nucleotide bases: Code with 4 ‘bits’
A: Adenine C: Cytosine G: Guanine T: Thymine
- Length: 3 billion (humans)
- Complementarity: H-bonds: A-T; G-C ; - mRNA
-Single strand DNA with U(racil)
instead of Thymine
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
1 4
Some history and impact Some history and impact
-1865: Mendel
-1953: Watson & Crick, Nobel prize 1962 -1965: Restriction enzymes: ‘DNA scissors’
-1973: Cohen en Boyer, genetransfer in bacteria
-1975: PCR (Polymerase Chain Reaction): DNA cloning -1982: Insuline by transgene bacteria
-1985: First plant GMO Bt-tabacco
-1991: First transgene animal: Herman the bull -1994: First GMO tomato in market
-1996: Microarrays
-1997: Dolly is cloned
-1998: Transgene rabbit (Pompe’s (orphan) disease)
-2000: Human Genome
Protein Protein
- Linear polymer of 20 different kinds of amino acids
- Each amino acid is represented by a triplet of bases (codons) - Linked by peptide bonds
- Complex 3-D structure (folding)
-‘Workhorses’ for functionalities of the cell
-Computational folding and docking problems
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
1 6
20 amino acids 20 amino acids
Ala A Alanine Arg R Arginine Asn N Asparagine Asp D Aspartic acid Cys C Cysteine
Gln Q Glutamine Glu E Glutamic acid Gly G Glycine
His H Histidine Ile I Isoleucine
Leu L Leucine Lys K Lysine
Met M Methionine Phe F Phenylalinine Pro P Proline
Ser S Serine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
64 Codons 64 Codons
1st position 2nd position 3rd position
U C A G
UUU Phe UCU Ser UAU Tyr UGU Cys U
U UUC Phe UCC Ser UAC Tyr UGC Cys C
UUA Leu UCA Ser UAA stop UGA stop A
UUG Leu UCG Ser UAG stop UGG UGG G
CUU Leu CCU Pro CAU His CGU Arg U
C CUC Leu CCC Pro CAC His CGC Arg C
CUA Leu CCA Pro CAA Gln CGA Arg A
CUG Leu CCG Pro CAG Gln CGG Arg G
AUU Ile ACU Thr AAU Asn AGU Ser U
A AUC Ile ACC Thr AAC Asn AGC Ser C
AUA Ile ACA Thr AAA Lys AGA Arg A
AUG Met, start
ACG Thr AAG Lys AGG Arg G
GUU Val GCU Ala GAU Asp GGU Gly U
G GUC Val GCC Ala GAC Asp GGC Gly C
GUA Val GCA Ala GAA Glu GGA Gly A
GUG Val GCG Als GAG Glu GGG Gly G
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
1 8
64 Codons 64 Codons
-Triplets of nucleotide bases
- Position 1, 2 and 3 have specific functions -Several amino acids form a protein
-Error-correcting features
-Lowering chance that mutation in (D)(R)NA changes amino acid
-Similar amino acids have similar codons
-Redundancy: One amino acid has several codons
Central dogma (Francis Central dogma (Francis Crick, 1958)
Crick, 1958)
-Genetic code is universal among species:
-DNA mRNA codons amino acids proteins -…Except for a number of variations
-Retro virus (reverse transcription RNA DNA !)
-Example: Aids virus (‘genomic RNA’ as life started with) -DNA has exons/introns (junk DNA)
-Left overs from evolution
-In transcription to mRNA, introns are removed
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2 0
‘
‘ Cracked’ genomes Cracked’ genomes
Genomes…. Genomes….
Group Species Genes Genome (Mbase)
Phages Bacteriophage MS2 4 0.003560
Viruses HIV Type 2 9 0.009671
Bacteria Haemophilus Influenza 1760 1.83
Archaea Methanococcus jannaschii 1735 1.74
Fungi Saccaromyces cerevisiae 5800 12.1
Protoctista Oxytricha similis 12000 600
Arthropoda Drosophila melanogaster 12000 165
Nematoda Caenorhabditis Elegans 14000 100
Mollusca Loligo Pealii 35000 2700
Plantae Arabidopsis thaliana 25000 70-145
Chordata Homo Sapiens 70000 3000
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2 2
The genome…. The genome….
>gi|7242486|emb|X90381.2|ATM4GENE Arabidopsis thaliana MYB102 gene
ACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATGCATGGTTATTG GTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAACTTAACAAAAC TGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATTTGTTTCAAGGT TTCGTAAGTGCACAATATCAAGAAGACAAAAATGACTAATTTTGTTTTCAGGAAGCATATATATTACACGAACACAAATCTATT TTTGTAATCAACACCGACCATGGTTCGATTACACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAA CCATCTTTGAGATTATTGATGCATGGTTATTGGTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACAT GTGTTTCATTGAGGAAGGAACTTAACAAAACTGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAG AAAGCTTAATATAAAGACATTTGTTTCAAGGTTTCGTAAGTGCACAATATCAAGAAGACAAAAATGACTAATTTTGTTTTCAG GAAGCATATATATTACACGAACACAAATCTATTTTTGTAATCAACACCGACCATGGTTCGATTACACACATTAAATCTTATATGC TAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATGCATGGTTATTGGTTAGAAAAAATATACGCTT GTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAACTTAACAAAACTGCACTTTTTTCAACGTCAC AGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATTTGTTTCAAGGTTTCGTAAGTGCACAATATCA AGAAGACAAAAATGACTAATTTTGTTTTCAGGAAGCATATATATTACACGAACACAAATCTATTTTTGTAATCAACACCGACCA TGGTTCGATTAACACATTAAATCTTATATGCTAAAACTAGGTCTCGTTTTAGGGATGTTTATAACCATCTTTGAGATTATTGATG CATGGTTATTGGTTAGAAAAAATATACGCTTGTTTTTCTTTCCTAGGTTGATTGACTCATACATGTGTTTCATTGAGGAAGGAA CTTAACAAAACTGCACTTTTTTCAACGTCACAGCTACTTTAAAAGTGATCAAAGTATATCAAGAAAGCTTAATATAAAGACATT TGTTTCAAGGTTTCGTAAGTGCACAATATCAAGAAG
-Linear sequence of 4 nucleotide bases
-Genes, regulatory elements, promotor sequences, exons, introns,
junk DNA
Exponential growth of available data
5.5 million sequences
4.3 billion bases (=letters)
0.00E+00 5.00E+05 1.00E+06 1.50E+06 2.00E+06 2.50E+06 3.00E+06 3.50E+06 4.00E+06 4.50E+06 5.00E+06
Jun-82 Aug-84 Sep-86 Aug-88 Aug-90 Sep-92 Sep-94 Sep-96 Sep-98
date
Database growth: Number of sequences
Number of sequences
0.00E+00 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 4.00E+09
Jun-82 Aug-84 Sep-86 Aug-88 Aug-90 Sep-92 Sep-94 Sep-96 Sep-98
date
Database growth: Number of nucleotides
Number of nucleotides
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2
4
Microarrays Microarrays
Genome-wide robotized monitoring of gene activities by
measurement of the levels of RNA transcripts (mRNA or total RNA)
Massively parallel Fully automated Standardizable
High/Mega Througput Screening (HTS/MTS)
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2 6
Micro-array Micro-array
Measure Genome-wide expression levels of RNA transcript
Applications:
Diagnosis: RNA
expression differs in
healthy and pathological cells
Disease models: etiology and pathogenesis
Genotyping
Response to treatment
...
Algorithms / techniques Algorithms / techniques
Principal and independent component analysis
Canonical correlation
Clustering
Optimization
Dynamic
programming/EM/HMM….
System identification
Classification
Diagnosis
Risk assessment
Discriminant analysis
Logistic regression
Neural networks
SVM
Self-organizing maps
Bayesian networks
Incorporating a priori’s
Pattern recognition
Global/local alignment
Database search for similarities
HMM
Motif extraction/detection
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
2 8
Rule based Rule based
Cash Transaction Value
Cash Transaction Frequency
If total value of transactions Is > 10.000 Euro
Suspicious transactions
Regular transaction
If amount of
of cash transactions
< 10.000 Euro Is >3
RULE BASED BUSINESS LOGIC:
Interpretability
Use of lists (black list, gray list)
BUT expert knowledge
needed!
Neural nets Neural nets
Cash Transaction Frequency Cash Transaction Value
C T V + C T F+ C C P + … D e cis io n
Classification from examples
Data driven
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
3 0
Clustering – density Clustering – density modeling
modeling
Call frequency Average call duration
User or activity grouping Fast and robust learning Statistics: outlier
detection,…
Abnormal = fraude
Case 1: Oncology Case 1: Oncology
Cancer = genetic disorder
Mutations cause transformation of a normal cell into a tumor cell
Mutations also induce changes in expression levels of non-mutated genes !
Disturbed expression levels determine the phenotype of the tumor
Measurement of all expression levels will
therefore be of great benefit to determine and to
understand the real behavior of tumor cells
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
3 2
Microarrays in oncology Microarrays in oncology
Ingredients: Tissue bank, Medical files, Medical expertise, Biotech know-how, Microarray facility, Statistical expertise,
Algorithms, Software development
By screening banks of tumors with the microarray, we obtain data to build statistical models for
Diagnosis
Staging
Prognosis
Choice of therapy
Follow-up of therapy
Microarrays can also be used to identify genes implied in cancer
and to understand the mechanisms of oncogenesis
MIT Leukemia data set MIT Leukemia data set
- 2 classes: ALL (class 1) and AML (class 2)
- training set of 38 samples (27 ALL and 11 AML)
- independent test set of 34 samples
- ~7000 gene expression levels for each sample = 7000x38 matrix for training set
- Goal:
- Which of the 7000 genes are relevant ?
- Classification into class 1 and class 2
- Data Preprocessing (rescaling,
thresholding, log-transformation,
normalization
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
3 4
Principal component Principal component analysis
analysis Principal Components = eigenvectors of covariance matrix = svd of mean corr. data matrix
Dimensional reduction: Project on dominant eigenspace
Results in 2 dimensions :
x μ x μ T
Σ n
n
n
-20 -15 -10 -5 0 5 10 15 20 25
-20 -15 -10 -5 0 5 10
15
* = training
O = independent
Class 1/ALL
Class 2/AML
Fisher linear discriminant Fisher linear discriminant analysis
analysis Further reduction to one dimension after PCA :
• The vector w is chosen so that:
–
the within-class spread is small
–
the class separation is large
• and is given by:
• Threshold w
o:
y > w
o Class 1 / y < w
o Class 2
• Linear Discriminant analysis after PCA in 5 dimensions gives only 2
x w T
y
μ2 μ1
S
w
w1
Tw
x μ1 x μ1 x μ2 x μ2
S
n C
n T n n
C n
n
2 1
where
μ w
To
w
-15 -10 -5 0 5 10 15 20
-10 -5 0 5 10 15
Class 1/ALL Class 2/AML
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
3 6
K-means clustering K-means clustering
Assign each sample, arbitrarily, to one of N clusters
Iterate
Calculate the mean vector for each cluster,
i, i=1…N
Re-assign each sample x
j, j=1…#Samples, to the k-th cluster with the nearest mean vector
k:
K-means in 5 dimensions with N=2 clusters
k i x
x
j
k
j
i,
-20 -15 -10 -5 0 5 10 15 20
-25 -20 -15 -10 -5 0 5 10 15
20
* = Cluster 1
O = Cluster 2
Class 1/ALL
Class 2/AML
Cluster means
Neural nets : Nonlinear Neural nets : Nonlinear classifiers
classifiers
Clustering
Plots the percentage of correct predictions for ALL samples versus the percentage of incorrect predictions for AML samples, for varying values of the threshold
ROC ROC curve curve
number of
genes % area ROC
training % area ROC prospective
20 1 1
15 1 1
10 1 99.29
5 1 98.57
4 1 98.57
3 1 97.50
2 98.32 98.21
1 93.60 71.07
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
3 8
Least squares support Least squares support vector machines
vector machines -Nonlinearly transform the data to a higher dimensional space -In this space, classes are linearly
separable
-Vapnik’s original SVM:
-Cvx. Quadr. Opt. Problem,
-dimension = # data points (huge !) -LS-SVM
- Only need to solve LARGE set of linear equations - Iterative (Jacobi, Gauss-Seidel, SOR, block SOR,…) - Current research: Advise required from num.anal.!
Bayesian networks Bayesian networks
P(A|B) = P(B|A) . P(A) / P(B) Bayesian network
Observations
Knowledge Machine learning (2) Inference (3) Engineering (1)
Literature
Experts
Classification
Probability prediction
c
i?
P(C
i)=?
Evidences
e.g.: A = model, B=data
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4 0
Bayesian networks & expert knowledge Bayesian networks & expert knowledge
- What is a Bayesian network? •Set of variables (with finite number of states)
•Set of directed edges
•Parameters = conditional probability tables
Age Parity
Pathology Locularity
Multi
Loc-solid Color score CA 125
•P(Path=benign|Age<40,Parity=0) = 0.738
•P(Path=benign|Age<40,Parity=1) = 0.8252
•P(Path=benign|Age<40,Parity=2) = 0.8603
• …
A Bnstructure, together with a fully specified conditional
probability table defines an overall
distribution over the variables in
the network.
Real world application: Ovarian Cancer Real world application: Ovarian Cancer
The goal in this problem is to classify an ovarium cancer tumor
pre-operatively. A lot of prior knowledge (doctors, literature,…) is present and 300 cases.
Age
Pregnancy
Genetic
Pathology Locularity
Solid
Papillat
M-Solid
Color
Meno
Bilateral
RIndex CA 125
Ascites
Meno
Papillat
Ascites
Bilateral
CA 125
Color
Pathology
microarray
microarray
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4 2
(6) Annotated Bayesian (6) Annotated Bayesian
networks II.
networks II.
0.1 0.15 0.2 0.25 0.3 0.35 0.4
Multilayer perceptron with a non-informative prior Bayesian network with an non-informatice prior Bayesian network with an informative prior Multilayer perceptron with an informative prior Missclassification rate
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4 4
Case 2: Finding regulatory Case 2: Finding regulatory sequences
sequences
Cluster genes from microarray expression data to build clusters of coexpressed genes
Coexpressed genes may share regulatory mechanisms
Most regulatory sequences are found in the upstream region of the genes (up to 2kb in A. thaliana )
Motifs that are statistically overrepresented in the
upstream regions are candidate regulatory
sequences
Motifs are hidden in Motifs are hidden in background
background
Need: Motif model and background model Models: HMM
Algorithms: EM and Gibbs sampling (stochastic EM)
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4 6
Microarray
Microarray A. Thaliana A. Thaliana data set data set
Adaptive quality-based clustering of gene Adaptive quality-based clustering of gene
profiles profiles
Normalised expression profiles Normalised expression profiles
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
4 8
Motif finding Motif finding
Clustering EMBL Blast
start start
Gibbs sampler
CACGTG ID A1234
Z4321
CACGTG Gibbs
sampler
Clustering ID A1234
Z4321
tumor 0.9 suppressor 0.6 phospho- 0.4
Keyword extraction
Medline
Jones et al.
Gibbs sampler
CACGTG EMBL
start
Blast
Gene Cards
P53, etc…
http://sphinx.rug.ac.be:8080/PlantCARE/index.htm
SVM
reclustering
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
5 0
Pubmed
Text mining
raw MA data MA Db Preprocessing Algorithms
MA Db Cluster Algorithms
MA Db
Functional validation External Gene
Info Db
Validation of clustering
Public Dbs on the web Motif finding
algorithms External Motif
Info Db Validation
of motifs
External Gene Info Db
Bayesian Network Inference aMAZE
Business System Business System
Software Diagnostics companies Microarray manufacturers
Application Toolkit +
Consulting
Platform Platform
LIMS
Record
Data
Data
High-throughput genomics
Patient
Symptoms Physician
Result Result
Web interface Diagnosis, etc...
Administration
Record
Sample
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
5 2
Example: HIV Example: HIV
-Under pressure of antiretroviral
drugs, HIV mutates drug resistance and treatment failure - Predict optimal drug cocktail
given the sequence of 2000 kb of the viral species infecting patient
- Query large database of 35000
previous patients with corresponding resistance profiles
- Correlation analysis
WWW Information retrieval WWW Information retrieval
Database category Data content Examples
1. Literature database Bibliographic citations MEDLIINE (1971) On-line journals
2. Factual database Nucleic acid sequences GenBank (1982), EMBL (1982), DDBJ (1984) Amino acid sequences PIR (1968), PRF (1979),
SWISS-PROT (1986) 3D molecular structures PDB (1971), CSD (1965) 3. Knowledge base Motif libraries PROSITE (1988)
Molecular classifications SCOP (1994)
Biochemical pathways KEGG (1995)
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
5 4
Data world Textual world
Expression-based clustering
Gene-as-document clustering
?
Information retrieval Information retrieval
Information retrieval
Search Indexing
Documents
Indices
Query
Result list
Experts
Knowledge engineers
Bayesian network modeling
ABN
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
5 6
Information Retrieval
from Local Drives
D o c u m e n t M e ta D a ta
Clustering Algorithms
Matching Algorithms
Search Algorithms Updating Mechanisms
Unprocessed Documents Data Mining
Techniques
Documents Users
U s e r P ro fi le D a ta
Information Retrieval
from External Sources
Type and object of the query The annotated Bayesian network
Query Result list
Query’= f(query, A(variable), A(group), A(model) )
= f(papillation,A(Locularity),A(Morphology),A(Ovarian cancer) )
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
5 8
Perspectives Perspectives
-….
Humane genome diversity
Single nucleotide polymorphism (SNP)
Drug development: study gene expression in disease, model systems (cfr. Alzheimer mouse), pathogens, response to drug treatment
Functional genomics and transgenesis (identification of regulatory mechanisms, modeling of genetic networks)
Disease management (diagnosis, prognosis, …)
Oncology, AIDS, Alzheimer, ….
Pharmacogenomics (drug tayloring by genotyping)
The data flood (Help, they are The data flood (Help, they are
coming !) coming !)
10,000 to 100,000 data points per experiment
Technology will spread, cost will drop (cfr. Bio- CD-player Universite de Namur)
Data explosion: Mega Throughput Screening
Biotech
& Pharma R&D
Medical R&D
Routine Diagnostics
GP or At Home Follow-up
2000-2005 2010 2020
T: +32-(0)16-321709 W: http://www.kuleuven.ac.be/sista
6 0
Application Development
Interface
Communi- cation Data
Acquisition
Experiment Design Data
Mining
Application Platform
WWW
USER
Bioinf.
tools
LIMS
Abstracts of scientific publications
(Pubmed)
Text mining Preprocessing Algorithms:
- Lowess fit - ANOVA - ...
Cluster and Classification Algorithms:
- Percolation clustering - AQBC
- Gene Shaving - Bayesian clustering - K-means, K-medoids - Metaclustering...
Fuctional validation of the clustering
Algorithms for finding DNA motifs:
-Gibbs sampling - String search
Information on known motifs (TRANSFAC)
Validation of the motifs:
- Known motifs ?
- Phylogenetic footprinting
Gene Infornation from public databases (SGD, SWISSPROT, MIPS...)
Inference of genetic networks:
- Bayesian networks
Information on Pathways in yeast
(aMAZE) Microarray data - Experimental data - Public data sets (SMD)
(MIAME standaard) Results of the microarray analysis
Results of the Motif Analysis
Figure 1 : Representation of the information flow between the integrated algorithms. Microarray data, both public data (Stanford Microarray Database) and data generated in the IDO-project will be stored in the system according to MIAME standards. The data can be analyzed using preprocessing algorithms. The results are stored in the system or used in various cluster algorithms. The results of multiple cluster analyses can be compared (metaclustering) and stored in the system.