Exercices: Motif Detection

(1)

Exercices: Motif Detection

Adapted 09/11/2006

This exercise aims at 1) identifying regulatory motifs in coregulated genes using orthologous information (phylogenetic footprinting); 2) detecting novel targets by genomewide screening.

Model organism S. typhimurium;

Sequence

Identify homologs Select Intergenics

Phylogenetic footprinting

Genome wide screening

Blast

INCLUsive

Motif Sampler/

RSA

Motif Scanner/

RSA

Sequence

Identify homologs Select Intergenics

Phylogenetic footprinting

Genome wide screening

Blast

INCLUsive

Motif Sampler/

RSA

Motif Scanner/

RSA

MANUAL selection of the intergenic sequences

Start from the given protein sequence:

>gi|3850275|gb|AAC72071.1| cytcbb3 cytochrome c oxidase CytN subunit [Azospirillum brasilense]

MTSATLTPGAALGSQRVSENVRYYEDAVRLFVIAAVFWGVVGFLAGVFIALQLAFPALNLGLEWTSFGRLRPVHTSAVIF AFGGNVLFATSLYSVQRTSRQFLFGGEGLAKFVFWNYNIFIVLAALSYVLGYTQGKEYAEPEWILDLYLTVIWVLYAIQF VGTVMTRKESHIYVANWFFMAFILTVAILHIGNNVNVPVSLTGMKSYPFVSGVQSAMVQWWYGHNAVGFFLTAGFLGIMY YFVPKRAERPVYSYRLSIVHFWTLIFLYIWAGPHHLHYTALPDWAQTLGMTFSVMLWMPSWGGMINGIMTLSGAWDKLRT DPVLRFLVTSVAFYGMSTFEGPLMSVKPVNALSHYTDWTIGHVHSGALGWVAFISFGAIYYLVPVLWKRSQLYSLRLVSY HFWTATIGIVLYITAMWVSGIMQGLMWRAYDNLGFLQYSFVETVAAMHPFYVIRALGGVLFLAGALIMVYNLWRTAKGDV RIEKPYASAPHKAAVGAA

This protein is also a terminal oxidase belonging to the same protein family as in the previous

exercise. We will now identify close homologs of this protein. The chance that the mechanism

of transcriptional regulation is conserved is higher in close homologs than in distantly related

homologs. Terminal oxidases of this subfamily occur in bacteria only. They are involved in

reduction of O2 at extremely low O2 levels (nearly anaerobic conditions). The enzymes differ

from the classical ternminal oxidases (that also occu in eukaryotes) by their extremely high

affinity for O2. It can therefore be expected that these proteins are needed as the O2 level in

the cell drops. Their genes will be switched on upon decreasing O2 levels. Lets find out if we

can find a motif that is involved in this transcriptional regulation.

(2)

Search for regulatory motifs in the promoter sequence of this protein by phylogenetic footprinting.

1) Select orthologs of this sequence by a blast search (idem previous exercise)

Because we have tested the blast searches in the previous exercises we will not repeat them here. Instead suppose you have selected by Blast search the following genes: see text file (motifdetection/motifs.xls sheet2).

2) Select for each ortholog the corresponding intergenic sequence Go to the protein file for a given blast hit: (eg NP_43566.1)

Select from the protein file the gene name of the protein and the Gen Bank accession number where the DNA sequence of this gene can be found.

Go to the GenBank file:

In this case: NC_003037

Download the accessionnumber, find the gene in the entry

Note the start and stop positions of the sequence encoding the protein of interest Check whether another gene is located upstream of the gene of interest

Delineate based on the sequence positions the intergenic sequence Put the intergenic sequence in fasta format

Remark that this manual download of sequences is a tedious task. Therefore bioinformaticians have

made tools to automate the process.

(3)

Automatic download of the intergenic sequences

NZ_AAAF01000001

Rpal_p_1301

NZ_AAAV01000169 Saro_p_3044

AB024290

^cytN

AL672112

^fixN

NC_004041

^fixN

U90521

^fixN

NC_003317 BMEI1564

NC_003062 AGR_C_2835

NZ_AAAE01000158 Rsph_p_4089

NC_002696

CC1401

NZ_AAAN01000093 Mmc10458 NZ_AABA01000127 Pflu2484 NZ_AAAD01000083

Avin_p_2153

NC_003037 fixN3

NC_003112 NMB1725

NC_004347

^ccoN

NC_004459 VV12620

NZ_AAAT01000001 Mdeg_p_0148 NZ_AABE01000011 Chut0193

AB025342 ORF19

1) use NCBI

search for the entry in the nucleotide database (using gene accessionnumber)

find the gene in the file, search for its start and stop position. Do the same for the gene upstream of

the gene of interest. Download the intergenic sequence using the subselect option. Save the FASTA

file.

(4)

Use the dataset the obtained dataset with intergenic sequences in FastA format.

http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html 1. Use the Motif Sampler to find motifs in this dataset

The motif Sampler works with an email address. Results pages will be sent to your email address.

This exercise can only be performed if you have access to an email. If you do not have a yahoo, student or hotmail address that you access from this computer room, try this exercise at home.

Because many students will perform the exercise simultaneously the server might become overloaded. Do not push the calculate button in vain.

1. Try different parametersettings and describe how they influence the outcome of the algorithm:

o Choose appropriate background model or background model from a completely different species. What do you observe?

o Alter the motif length

o Describe and explain the different scores

o Is the retrieved motif informative, how can you derive this from the result?

2. Run the algorithm twice with the same parameter settings (using a motif length of 6) and compare the result. Explain your observations

Input form of Motif Sampler

Results Page

(5)

Interpret the matrix file

#INCLUSive Motif Model v1.0

#

#ID = box_1_1_TTGAynTnGATCAA

#Score = 80.1843

#W = 14

#Consensus = TTGAynTnGATCAA

0.00547926 0.00395095 0.0039585 0.986611 0.00547926 0.00395095 0.0039585 0.986611 0.0809511 0.00395095 0.909621 0.00547737 0.986613 0.00395095 0.0039585 0.00547737 0.00547926 0.38131 0.0039585 0.609252 0.307367 0.305838 0.230374 0.156421 0.156423 0.00395095 0.0039585 0.835668 0.609254 0.0794228 0.305846 0.00547737 0.0809511 0.154895 0.683205 0.0809492 0.684726 0.154895 0.154902 0.00547737 0.00547926 0.00395095 0.0039585 0.986611 0.00547926 0.985085 0.0039585 0.00547737 0.986613 0.00395095 0.0039585 0.00547737 0.986613 0.00395095 0.0039585 0.00547737

 Use the word counting method (RSA tools http://embnet.cifn.unam.mx/~jvanheld/rsa- tools/) to retrieve possible motifs (DNA patterns that are overrepresented.

 First try to oligo analysis, analyse only on 1 strand

o Try different parameter settings (motif lengths, backgroundmodels)

(6)

i.e. are both methods consistent with each other.

 As you might have noticed the FNR motif consists of 2 conserved parts separated by a linker (non conserved part). Try the dyad analysis. Is it easier to retrieve the right motif?

Results page of RSA tool

Statistical parameters

The P-value (column occ_P) represents the probability to observe at least 13 occurrences when expecting 1.22. For CACGTG, it is of the order of 10-9. However, this P-value might be

misleading, because in this analysis we considered 2080 possible patterns. Indeed, there are 4096 possible hexanucleotides, but we regrouped each of them with its reverse complement, resulting in 2080 distinct patterns. Thus, we performed a simultaneous test of 2080 hypotheses.

The E-value (column occ_E) provides a more reliable and intuitive statistics than the P-value. The E-value is simply obtained by multiplying the P-value by the number of distinct patterns. It

represents the number of patterns with the same level of over-representation which would be expected by chance alone. For CACGTG, the E-value is of the order of 10-6, indicating that, if we would submit random sequences to the program, such a level of over-representation would be expected every 1,000,000 trials.

An even more intuitive statistics is provided by the significance index (column occ_sig), which is the minus log transform (in base 10) of the E-value. The higher values are associated to the most significant patterns. On the average, a significance higher than 0 would be expected by chance alone once per trial (sequence set). A score higher than 1 every ten trials, a score higher than 2 every 100 trials, and a score higher than 6 every 106 trials.

The default parameters were chosen to return no more than one pattern per sequence set (threshold

on significance is 0). With these settings, the analysis of a small regulon typically returns half a

dozen of hexanucleotides, among the 2080 possibilities. In addition, these hexanucleotides are

(7)

related with each other, and can be assembled to provide a more refined description of the predicted binding sites, as discussed below.

To display the location of these patterns in your sequence data:

Click Pattern matching: It searches for the exact positions of the oligonucleotides in your input data Click Feature map

If you have detected a motif you have discovered the regulatory motif that is important for the O2 dependent expression of the respective respiratory genes. This motif is known as the FNR regulatory motif and is recognized by the O2 sensor FNR.

Profile matching

By using the set of coregulated genes, you could construct a motif model describing the FNR binding site. Use this motif model to derive new targets e.g. in the genome of S. typhimurium. We will first screen a complete genome using a string based representation and subsequently the probabilistic representation generated by Gibbs Sampling.

Set of intergenic sequences of the complete S. typhimurium genome:

NC_003197intergenics.txt (size: 1472966).

Set of intergenic sequences of the complete E. coli genome:

NC_000913 intergenics.txt (size: 1472966).

FNR motif model as retrieved by the MotifSampler but in format compatible with the MotifScanner: matrixFNR.txt

 Use RSA tools:

1. Represent motif by its consensus: how many hits do you retrieve? (derive the consensus

from the output of the Motif Sampler)

(8)

2. Represent motif by a regular expression: how many hits do you retrieve?

Explain the observation

 Use MotifScanner

1. Represent motif by PSSM (matrixFNR.txt): how many hits do you retrieve?

o Run the algorithm and save the results as a text file o Open the text file in excel

o Remove the 2 upper heading lines

o What are the distinct results displayed by the algorithm

o Compare the hits with the outcome of the RSA tool, what do you observe o How will you select the most promising candidates

o Try to put a threshold, how would you proceed?

o How would you be able to increase the specificity of the screening procedure?

o compare the targets of the screening of the Salmonella genome with those of the E.

coli genome

(9)

start stop

score consensus hit

Do you find new targets. Search for the potential function of the best hits. Are these functions related to a function in respiration or energy conversion.

To do some validation a few targets that occur both in the E.coli and the Salmonella genome (if the motif is conserved in both species you have a higher chance that it is a true positive).