Exercices: Motif Detection
Adapted 09/11/2006
This exercise aims at 1) identifying regulatory motifs in coregulated genes using orthologous information (phylogenetic footprinting); 2) detecting novel targets by genomewide screening.
Model organism S. typhimurium;
Sequence
Identify homologs Select Intergenics
Phylogenetic footprinting
Genome wide screening
Blast
INCLUsive
Motif Sampler/
RSA
Motif Scanner/
RSA
Sequence
Identify homologs Select Intergenics
Phylogenetic footprinting
Genome wide screening
Blast
INCLUsive
Motif Sampler/
RSA
Motif Scanner/
RSA
MANUAL selection of the intergenic sequences
Start from the given protein sequence:
>gi|3850275|gb|AAC72071.1| cytcbb3 cytochrome c oxidase CytN subunit [Azospirillum brasilense]
MTSATLTPGAALGSQRVSENVRYYEDAVRLFVIAAVFWGVVGFLAGVFIALQLAFPALNLGLEWTSFGRLRPVHTSAVIF AFGGNVLFATSLYSVQRTSRQFLFGGEGLAKFVFWNYNIFIVLAALSYVLGYTQGKEYAEPEWILDLYLTVIWVLYAIQF VGTVMTRKESHIYVANWFFMAFILTVAILHIGNNVNVPVSLTGMKSYPFVSGVQSAMVQWWYGHNAVGFFLTAGFLGIMY YFVPKRAERPVYSYRLSIVHFWTLIFLYIWAGPHHLHYTALPDWAQTLGMTFSVMLWMPSWGGMINGIMTLSGAWDKLRT DPVLRFLVTSVAFYGMSTFEGPLMSVKPVNALSHYTDWTIGHVHSGALGWVAFISFGAIYYLVPVLWKRSQLYSLRLVSY HFWTATIGIVLYITAMWVSGIMQGLMWRAYDNLGFLQYSFVETVAAMHPFYVIRALGGVLFLAGALIMVYNLWRTAKGDV RIEKPYASAPHKAAVGAA
This protein is also a terminal oxidase belonging to the same protein family as in the previous
exercise. We will now identify close homologs of this protein. The chance that the mechanism
of transcriptional regulation is conserved is higher in close homologs than in distantly related
homologs. Terminal oxidases of this subfamily occur in bacteria only. They are involved in
reduction of O2 at extremely low O2 levels (nearly anaerobic conditions). The enzymes differ
from the classical ternminal oxidases (that also occu in eukaryotes) by their extremely high
affinity for O2. It can therefore be expected that these proteins are needed as the O2 level in
the cell drops. Their genes will be switched on upon decreasing O2 levels. Lets find out if we
can find a motif that is involved in this transcriptional regulation.
Search for regulatory motifs in the promoter sequence of this protein by phylogenetic footprinting.
1) Select orthologs of this sequence by a blast search (idem previous exercise)
Because we have tested the blast searches in the previous exercises we will not repeat them here. Instead suppose you have selected by Blast search the following genes: see text file (motifdetection/motifs.xls sheet2).
2) Select for each ortholog the corresponding intergenic sequence Go to the protein file for a given blast hit: (eg NP_43566.1)
Select from the protein file the gene name of the protein and the Gen Bank accession number where the DNA sequence of this gene can be found.
Go to the GenBank file:
In this case: NC_003037
Download the accessionnumber, find the gene in the entry
Note the start and stop positions of the sequence encoding the protein of interest Check whether another gene is located upstream of the gene of interest
Delineate based on the sequence positions the intergenic sequence Put the intergenic sequence in fasta format
Remark that this manual download of sequences is a tedious task. Therefore bioinformaticians have
made tools to automate the process.
Automatic download of the intergenic sequences
NZ_AAAF01000001
Rpal_p_1301NZ_AAAV01000169 Saro_p_3044
AB024290
cytNAL672112
fixNNC_004041
fixNU90521
fixNNC_003317 BMEI1564
NC_003062 AGR_C_2835
NZ_AAAE01000158 Rsph_p_4089
NC_002696
CC1401
NZ_AAAN01000093 Mmc10458 NZ_AABA01000127 Pflu2484 NZ_AAAD01000083
Avin_p_2153NC_003037 fixN3
NC_003112 NMB1725
NC_004347
ccoNNC_004459 VV12620
NZ_AAAT01000001 Mdeg_p_0148 NZ_AABE01000011 Chut0193
AB025342 ORF19
1) use NCBI
search for the entry in the nucleotide database (using gene accessionnumber)
find the gene in the file, search for its start and stop position. Do the same for the gene upstream of
the gene of interest. Download the intergenic sequence using the subselect option. Save the FASTA
file.
Use the dataset the obtained dataset with intergenic sequences in FastA format.
http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html 1. Use the Motif Sampler to find motifs in this dataset
The motif Sampler works with an email address. Results pages will be sent to your email address.
This exercise can only be performed if you have access to an email. If you do not have a yahoo, student or hotmail address that you access from this computer room, try this exercise at home.
Because many students will perform the exercise simultaneously the server might become overloaded. Do not push the calculate button in vain.
1. Try different parametersettings and describe how they influence the outcome of the algorithm:
o Choose appropriate background model or background model from a completely different species. What do you observe?
o Alter the motif length
o Describe and explain the different scores
o Is the retrieved motif informative, how can you derive this from the result?
2. Run the algorithm twice with the same parameter settings (using a motif length of 6) and compare the result. Explain your observations
Input form of Motif Sampler
Results Page
Interpret the matrix file
#INCLUSive Motif Model v1.0
#
#ID = box_1_1_TTGAynTnGATCAA
#Score = 80.1843
#W = 14
#Consensus = TTGAynTnGATCAA
0.00547926 0.00395095 0.0039585 0.986611 0.00547926 0.00395095 0.0039585 0.986611 0.0809511 0.00395095 0.909621 0.00547737 0.986613 0.00395095 0.0039585 0.00547737 0.00547926 0.38131 0.0039585 0.609252 0.307367 0.305838 0.230374 0.156421 0.156423 0.00395095 0.0039585 0.835668 0.609254 0.0794228 0.305846 0.00547737 0.0809511 0.154895 0.683205 0.0809492 0.684726 0.154895 0.154902 0.00547737 0.00547926 0.00395095 0.0039585 0.986611 0.00547926 0.985085 0.0039585 0.00547737 0.986613 0.00395095 0.0039585 0.00547737 0.986613 0.00395095 0.0039585 0.00547737
Use the word counting method (RSA tools http://embnet.cifn.unam.mx/~jvanheld/rsa- tools/) to retrieve possible motifs (DNA patterns that are overrepresented.
First try to oligo analysis, analyse only on 1 strand
o Try different parameter settings (motif lengths, backgroundmodels)
i.e. are both methods consistent with each other.
As you might have noticed the FNR motif consists of 2 conserved parts separated by a linker (non conserved part). Try the dyad analysis. Is it easier to retrieve the right motif?
Results page of RSA tool
Statistical parameters
The P-value (column occ_P) represents the probability to observe at least 13 occurrences when expecting 1.22. For CACGTG, it is of the order of 10-9. However, this P-value might be
misleading, because in this analysis we considered 2080 possible patterns. Indeed, there are 4096 possible hexanucleotides, but we regrouped each of them with its reverse complement, resulting in 2080 distinct patterns. Thus, we performed a simultaneous test of 2080 hypotheses.
The E-value (column occ_E) provides a more reliable and intuitive statistics than the P-value. The E-value is simply obtained by multiplying the P-value by the number of distinct patterns. It
represents the number of patterns with the same level of over-representation which would be expected by chance alone. For CACGTG, the E-value is of the order of 10-6, indicating that, if we would submit random sequences to the program, such a level of over-representation would be expected every 1,000,000 trials.
An even more intuitive statistics is provided by the significance index (column occ_sig), which is the minus log transform (in base 10) of the E-value. The higher values are associated to the most significant patterns. On the average, a significance higher than 0 would be expected by chance alone once per trial (sequence set). A score higher than 1 every ten trials, a score higher than 2 every 100 trials, and a score higher than 6 every 106 trials.
The default parameters were chosen to return no more than one pattern per sequence set (threshold
on significance is 0). With these settings, the analysis of a small regulon typically returns half a
dozen of hexanucleotides, among the 2080 possibilities. In addition, these hexanucleotides are
related with each other, and can be assembled to provide a more refined description of the predicted binding sites, as discussed below.
To display the location of these patterns in your sequence data:
Click Pattern matching: It searches for the exact positions of the oligonucleotides in your input data Click Feature map
If you have detected a motif you have discovered the regulatory motif that is important for the O2 dependent expression of the respective respiratory genes. This motif is known as the FNR regulatory motif and is recognized by the O2 sensor FNR.
Profile matching
By using the set of coregulated genes, you could construct a motif model describing the FNR binding site. Use this motif model to derive new targets e.g. in the genome of S. typhimurium. We will first screen a complete genome using a string based representation and subsequently the probabilistic representation generated by Gibbs Sampling.
Set of intergenic sequences of the complete S. typhimurium genome:
NC_003197intergenics.txt (size: 1472966).
Set of intergenic sequences of the complete E. coli genome:
NC_000913 intergenics.txt (size: 1472966).
FNR motif model as retrieved by the MotifSampler but in format compatible with the MotifScanner: matrixFNR.txt
Use RSA tools:
1. Represent motif by its consensus: how many hits do you retrieve? (derive the consensus
from the output of the Motif Sampler)
2. Represent motif by a regular expression: how many hits do you retrieve?
Explain the observation
Use MotifScanner
1. Represent motif by PSSM (matrixFNR.txt): how many hits do you retrieve?
o Run the algorithm and save the results as a text file o Open the text file in excel
o Remove the 2 upper heading lines
o What are the distinct results displayed by the algorithm
o Compare the hits with the outcome of the RSA tool, what do you observe o How will you select the most promising candidates
o Try to put a threshold, how would you proceed?
o How would you be able to increase the specificity of the screening procedure?
o compare the targets of the screening of the Salmonella genome with those of the E.
coli genome
start stop
score consensus hit