Methylation in Nucleosome Positioning

(1)

CpG Methylation in

Nucleosome Positioning

Thesis

submitted in partial fulfillment of the requirements for the degree of

Master of Science in Theoretical Physics Author: Student ID : Supervisor: 2nd corrector: Renger Zoonen 1066102 Helmut Schiessel John van Noort

(2)

Abstract

DNA methylation is the phenomenon where a methyl group attaches to the nu-cleotide cytosine or adenine. We study CpG methylation and specifically its role in nucleosome positioning. Methylation changes the mechanical properties of the CpG step and therefore the mechanical affinity of DNA to form nucleosomes. We use a rigid base pair model to emulate DNA where the mechanical and geometric properties of a base pair step depend on which base pairs it is made of. A 147 base pair long piece of DNA is forced into the shape known for wrapped DNA inside a nucleosome by attaching it to the nucleosome core proteins at 28 positions. Monte Carlo simulations then allow us to measure various quantities like the average energy of a given sequence or the occurrence rate of certain base pair triplets. The latter is used to inform a trinucleotide Markov model that generates free energy landscapes. From these landscapes we calculate free energy it takes to place a nucleosome at any position close to transcription start site of genes. On average human genes attract nucleosomes around their transcription start sites. We find in this thesis that methy-lation turns these nucleosome attractive regions into nucleosome expelling regions. We also study the role of entropy in nucleosome positioning by comparing energy landscapes to free energy landscapes. These comparisons show that at higher tem-peratures entropy becomes non-negligible. We propose a model that could generate energy landscapes at much lower calculational cost, a tool that can be useful in studying the role of entropy in nucleosome positioning on a genome wide scale.

(3)

1 Introduction

If one would stretch out all DNA molecules contained in an average human body this would cover the distance to Pluto and back twice. Or to stay closer to home, it would cover about 70 round trips to the sun. An individual human cell alone contains about 2 meters of DNA, DNA that is all contained within the cell nucleus which typically has a diameter of only 10 micrometer. In order to fit DNA into the cell nucleus it has to be compacted. Nucleosomes are nature’s basic solution to this packaging problem. A nucleosome consists of an octamer of histone proteins with 147 base pairs of DNA wrapped around it in a 13₄ superhelical turn. The diameter of a nucleosome is about 10 nm so DNA is strongly deformed when forced into this shape. Some pieces of DNA will more readily bend into shape than others due to their mechanical properties. DNA is a double-stranded polymer of a sugar called deoxyribose, a phosphate group and a nucleotide which is made of one of four nucleobases, adenine (A), thymine (T), cytosine (C) or guanine (G). Depending on which nucleobases the sequence is made of, its mechanical properties will vary. Some sequences have higher affinity to form nucleosomes than others. This way sequence dependent mechanical properties could influence nucleosome positioning, an idea explored by Segal et al. [1].

The main research topic of this thesis is how CpG methylation affects nucleosome positioning. CpG methylation is the phenomenon where a methyl group is attached to the cytosines of a CpG step. A CpG step on one strand will be attached to a CG step on the other strand and usually either both or none of the cytosines are methylated. In human DNA between 60 and 80% of CpG steps is methylated [6]. Methylation influences the geometric and mechanical properties of the base pair step. A methylated CpG step has higher roll and considerably higher stiffness than the unmethylated CpG step. Combined with the fact that about 70% of CpG steps in human DNA is methylated, it is very likely that methylation will play a large role in determining where nucleosomes will be placed.

Studying the phenomenon of methylation in general is motivated by its diverse role in a lot of biological processes. Biologists generally consider methylation to be an gene expression inhibitor [2]. The bulky methyl group blocks transcription proteins moving along the DNA. Methylation also seems to play an important role in can-cer. Methylation patterns are usually maintained upon cell division by the action of enzymes, so-called methyltransferase. Tumor cells generally exhibit very low methy-lation levels [3] and pass these low levels on to daughter cells. Atherosclerosis comes with global hypomethylation and hypermethylation at specific regions. Since these methylation abnormalities show up before actual symptoms of the disease, this could

(5)

serve as a tool for early diagnosis [4]. Methylation seems to play a role in aging, or is at least correlated with age. Older people show lower global methylation levels [5]. In none of these examples it is fully understood what role methylation plays but it is a promising field of research.

(6)

2 Methods

2.1 Nucleosome Model

At the very basis of this thesis’ subject, is a nucleosome model introduced by Eslami et al. [7]. In this model a piece of double-stranded DNA consisting of 147 base pairs is forced into a super-helical structure. The DNA is held into this position by being attached to 28 fixed points that can be seen in Figure 1, indicated by the little red dots.

Individual base pairs are treated as rigid entities. In this so-called Rigid Base Pair Model, the elastic energy of two connected base pairs is given by:

Eel = (q − q0)TQ(q − q0)

where q is a six-dimensional vector. Its entries describe the position and orientation of the second base pair in the reference frame of the first base pair, with q0 being the equilibrium position and orientation, and Q being a six-by-six stiffness matrix. The equilibrium position q0 _{and the stiffness Q between two base pairs depends on}

which base pairs make up the base pair step. Every possible base pair combination has its own equilibrium and its own stiffness. These parameters can be found in literature and are either obtained by all-atom molecular dynamics (MD) simulations [8] or from experimental work [9].

The total elastic energy put into wrapping a 147 base pair long piece of DNA around the histone core, is the sum of the elastic energies of all 146 base pair steps. This energy is sequence dependent, but only takes nearest neighbour interactions into account.

2.2 Energy landscapes from a Monte Carlo model

With this nucleosome model as a tool, we can now build a Monte Carlo model that lets a piece of DNA relaxate into our predetermined super-helical shape [7]. The idea of Monte Carlo simulations is straightforward. In our case, we propose a change in the position/orientation of a single base pair, we calculate the new elastic energies with its neighbors, compare it to the old energy. If the elastic energy goes down

(7)

Figure 1: Schematic depiction of the nucleosome model introduced by Eslami et al. [7]. The DNA is held in position by attaching it to the nucleosome core at 28 positions as indicated by the red dots.

by this proposed change, we accept it. If the energy goes up by dE, we accept the change with probability

Paccept = e − dE

kB T

where T is the simulation temperature. Higher simulation temperatures allow for more changes that increase the total elastic energy.

If we would just let this simulation run for a while, the DNA would completely relax. To prevent this, we attach it to the histone core. The trick is that at certain positions when a change in position/orientation is proposed, the exact opposite change is applied to its neighbor. This will ensure that position and orientation of the mid-plane between the two base pairs doesn’t change. This condition is enforced at 28 positions, emulating the effect of the DNA being attached to the binding sites on the nucleosome core proteins. The position and orientation of the base pairs are determined by the DNA conformation in the nucleosome crystal structure [10]. Usually when running these simulations the total elastic energy will rapidly decrease at the beginning. This indicates that the initial state was quite frustrated. This can be seen in figure 2 where we track the elastic energy of a piece of DNA (in this case a sequence of 147 cytosines, a C-tract) during a number of Monte Carlo steps. After this initial relaxation, we track the energy over an extended period of time. The average energy over many Monte Carlo steps gives us a good estimate of the total

(8)

Figure 2: Elastic energy of a 147bp long C-tract during a 2,000 steps long Monte Carlo simulation. The initial state is quite frustrated and it takes some time to relaxate. Lower temperatures generally lead to longer relaxation time. To be safe we start measuring the energy after 5,000 Monte Carlo steps.

elastic energy needed to force a certain 147 bp long tract of DNA into a nucleosome at a certain simulation temperature. To get to the ground state energy, one has to subtract the thermal energy, 1₂kBT for each degree of freedom of the piece of DNA.

We won’t do this because we are not interested in absolute values of the energy, but rather in energy differences.

Next we want to know what the elastic energy would be of positioning a nucleosome on actual genes. We are particularly interested in nucleosome positioning close to transcription start sites (TSS). Nucleosomes can hinder transcription of genes, and when they are positioned at the TSS can inhibit gene expression. We make the arbitrary choice of looking at the 1,000 base pairs before until 1,000 base pair after the TSS. We place a nucleosome at position -1,000, so that it covers base pair -1,073 to -926 relative to the TSS. We use the Monte-Carlo model to get the elastic energy needed for this particular 147 base pair long piece of DNA to form a nucleosome. We do this for all 2,000 positions to get an energy landscape such as can be seen in figure

(9)

5. Lower values in this landscape means it takes less energy to position a nucleosome at that position, and hence a higher likelihood a nucleosome will actually be found at that position.

2.3 Markov Chain Approximation

Our goal is to study nucleosome affinity around transcription start sites genome wide. Mammalian genomes typically consists of several 104 _genes. _{Creating an}

energy landscape for one gene using the Monte Carlo approach takes a couple of hours. Creating an energy landscape for all 104 _{genes using this method, would}

literally require years of computation time.

Now in order to get around this issue of exploding computation times, Tompitak et. al. [11] suggested a method of re-purposing the model of Segal et al. to calculate these energy landscapes at computation times several orders of magnitude smaller than when taking the straightforward Monte-Carlo approach, bringing down the computation times from years to minutes.

Lets define a probability distribution on the space of all possible sequences P(S). If this probability distribution is normalized, one could think of this probability as the likelihood that of all possible sequences the sequence S is occupied by a nucleosome. The logarithm of such a probability distribution scales linearly with free energy of those sequences. Segal et al. treat a DNA sequence as a Markov chain of order 1. They assign a probability to a 147 base pair long sequence according to

P (S) = P (S1) 147

Y

n=2

P (Sn|Sn−1)

P (S), the probability of the whole wrapping sequence is a product of probabilities of the individual nucleotides. Given there is a Thymine nucleobase at position 4 of the sequence, P (A5|T4) is the probability it will be followed by an Adenine nucleobase.

The probabilities P (A5|T4), P (T5|T4), P (C5|T4) and P (G5|T4) must add up to 1

because the fourth base pair will always be followed by either an A, T, C or G. This dinucleotide model can be generalized to a trinucleotide model very easily. Instead of looking at just the previous position to determine the probability, one now looks at the previous two positions. For example, given an A at position 3, a T at position 4, what probability do we have of finding a G at position 5?

(10)

We want to go back from probabilities P (S) to energies as the goal is to calculate energy landscapes efficiently. Statistical physics tells us we can calculate the free energy straight from the (Boltzmann) probability according to:

F (S) = −kBT ∗ log(P (S))

which is the free energy of the 147 base pair long sequence S wrapped into a nucleo-some. This energy is calculated by doing 145 multiplications, which takes a fraction of a millisecond to calculate. To create an energy landscape we have to calculate this free energy 2,000 times. So we went from a couple of hours to create an energy landscape to a fraction of a second. When bench marked on the first chromosome of S. cerevisiae at 100K, energies as calculated by the trinucleotide model scale one on one with energies as calculated by the Monte Carlo model, with a RMSD of 0.85kBT

.

2.4 Mutation Monte Carlo simulation

Segal et al. got their dinucleotide probability distributions from analyzing an ensem-ble of sequences with high affinities for the nucleosome (in vitro). Instead of using an experimental ensemble as input for their bio-informatics model, it is also possible to generate this data using a variation of the Monte Carlo model we have seen before. This approach is used in the study by Tompitak [11].

In the Mutation Monte Carlo (MMC) method one does not only change the orienta-tion and posiorienta-tions of the 147 base pair, another change that can be proposed to the 147 base pair long DNA is the mutation of a single base pair into another one. The elastic energies with both neighbors has to be recalculated, and the energy difference determines the acceptance ratio. Now if we let the Mutation Monte Carlo simulate run for a long time, it will by itself generate sequences that have high affinities for the nucleosome. We specifically want to know how often certain trinucleotides happen at certain positions. So that is what we track. During the simulation we simply keep track of how often every possible trinucleotide occurs at each position. That gives us 145 ∗ 4 ∗ 4 ∗ 4 data points. These frequencies are then converted into probabilities that we use as input for the trinucleotide model.

(11)

2.5 Mutation Monte Carlo with CpG Methylation

Methylation can happen to either cytosines or adenines. In mammalian DNA methy-lation is almost exclusively found in CpG dinucleotides. When this happens, the cytosines are usually methylated on both strands. Translating this to our rigid base pair model, this results in a change of the equilibrium values and the stiffness pa-rameters of the CG step itself and of the base pair steps prior to and after the CG step. When we methylate the cytosines in the tetranucleotide x1− C − G − x2 to

x1− Cm− Gm− x2, it changes the equilibrium values and stiffness of all 3 base pair

steps in this tetranucleotide. In this notation, Gm means that the cytosine on the

other strand is methylated.

So far in the Mutation Monte Carlo simulation we have proposed two different types of changes, namely a change in orientation/position of an individual base pair or a mutation of a single base pair into another base pair. To these two changes we add a third possible change. We split up our simulation in two steps. The first step is basically the unchanged Mutation Monte Carlo simulation. Then at every second step we select a random base pair. When this base pair happens to be a CG step, we propose to change it to a methylated CG step and vice verse. As always, we calculate the change in energy this change would come with from which we then calculate the acceptance ratio and accept or reject the change accordingly.

2.5.1 Parameter set

In order to calculate the change in energy CpG methylation comes with, we have to recalculate the elastic energies of 3 base pair steps. We need one stiffness matrix and equilibrium vector for the base pair step Cm− Gm, five for the the base pair step

prior to the CpG step, i.e. x − Cm where x can be A, T, C, G or Gm, and five for

the base pair step after the CpG step Gm− x where x can be A, T, C, G or Cm. It

seems that CpG methylation introduces 11 new possible base pair steps. However we counted the Gm − Cm step twice so instead of 11 we have 10 new possible base

pair steps. These parameters can be found in Appendix A and were provided by the Molecular Modelling and Bioinformatics group of IRB Barcelona. If we compare the parameters for methylated CpG to the Olson parameters, it turns out the methylated CpG step has higher intrinsic roll than its unmethylated counterpart, 0.2236 rad vs. 0.0958 rad respectively. Also the roll-roll stiffness is higher for methylated CpG. Methylated CpG has a roll-roll stiffness of 147.01 kbT /rad2 whereas non-methylated

(12)

2.6 Markov with Methylation

Adapting the Markov model to take CpG methylation into account is straightforward. We treat Cm and Gmjust like any other base pair. We just have to make sure that the

sequence we want to calculate the energy landscape of, doesn’t include any invalid base pair steps. A methylated cytosine must always be followed by a guanine that has the cytosine on the other strand methylated.

Methylated CpG has considerably higher stiffness than unmethylated CpG, it will take more energy to bend it. As a result, methylation levels will be low during the Mutation Monte Carlo simulation. Most changes proposing methylation will not get accepted because too much energy is needed to make this change. Even if we run the MMC for a long time, certain triplets won’t occur very often. Especially triplets that contain multiple methylated CpG steps such as Gm−Cm−Gm will almost never

occur. Tompitak ran his MMC for 107 _{Monte Carlo steps to acquire good statistics.}

In order to acquire accurate statistics of the triplets containing a methylated CpG step, I ran MMC for 109 Monte Carlo steps. Fortunately one can easily run multiple simulations simultaneously, and combine the results afterwards.

When comparing the free energy landscapes we created to the ones found in studies by Tompitak, one might notice that the absolute value of our landscape seems to differ slightly from Tompitaks’. This is due to a slight difference in the way the model is programmed. Since we are looking at triplets, only 145 overlapping triplets fit within a 147 base pair long piece of DNA. Our probability is therefore a product of 145 terms. However, Tompitak also looks at the dinucleotide and mono-nucleotide at the beginning of the 147 base pairs. We choose not to do this because firstly it complicates programming, and secondly the first three base pairs in the nucleosome model are essentially free DNA. The DNA is attached to the nucleosome core for the first time between the third and fourth base pair (see figure 1). Since the first three base-pairs are not forced into a certain shape, it won’t add extra information to use the probabilities of the starting mono-nucleotide and dinucleotide.

2.7 An Alternative to the Trinucleotide Markov Model

The Trinucleotide Markov Model seems to have issues. In our group it became apparent that the model seems to work only at specific simulation temperatures. There are indications that failure of this model might have to do with the switch to free energy. In the Markov Model we calculate the free energy from a probability,

(13)

whereas in the Monte Carlo model we work with pure elastic energy. The difference between energy and free energy is entropy, scaling with temperature. It might be the case that at higher temperatures entropy starts playing a non-negligible role. In order to get around this issue with entropy I came up with another model to create energy landscapes. This time it won’t be free energy landscapes, but we will be using the elastic energy that we use in the Monte Carlo simulations thus getting rid of the entropic contribution.

This model is a variation on the Mutation Monte Carlo model and is inspired by the trinucleotide model. As said before, in the Monte Carlo simulation the total elastic energy of the 147 base pair long bend piece of DNA, is a sum of the elastic energies of the 146 base pair steps. Instead of tracking the total energy, one might as well measure the energies at base pair step level. There are 146 base pair steps. These base pair steps connect two base pairs. If we don’t take CpG methylation into account, there are 16 possible combinations of base pairs forming a base pair step. In a first attempt at building this model, I tracked the elastic energy for these 16 possible combinations. One base pair step is fixed, while for the rest of the base pairs the Mutation Monte Carlo model is running as usual. The elastic energy of the fixed base pair is tracked over a certain number of Monte Carlo steps. This procedure gives us the value of one entry in an array of values for all possible dinucleotides at all possible positions. To get back to the full energy of a given 147 base pair long sequence, it is simply a matter of adding all the energies for the base pair steps the sequence is made of.

Creating a trinucleotide model is the obvious next step. As with the Markov model, there is probably a lot to win in terms of accuracy when stepping up from a dinu-cleotide to a trinudinu-cleotide model. Instead of just looking at the two base pairs that form the base pair step, we also inform our model which base pair is positioned prior to the dinucleotide. It is important to note that we still only measure the elastic energy of one base pair step. To clarify by example: say we have the trinucleotide ATG, we will only measure the elastic energy between T and G. However, this en-ergy will implicitly depend on nucleotides further away. It therefore makes sense to further split up the elastic energy for the base pair step TG depending on whether it is preceded by an A, T, C or G.

A 147 base pair long piece of DNA is composed of 145 overlapping trinucleotides. So we will end up generating an array of size 145 ∗ 4 ∗ 4 ∗ 4. The total elastic energy of a 147 base pair long piece of DNA will be the sum of 145 contributions that can be looked up in this array.

(14)

Figure 3: plot of CG and AA+AT+TT frequency of occurrences during a Monte Carlo simulation of 109 steps at 100K using the Olson Parameters.

3 Results

With all our tools described in the previous chapter, we are now ready to do some actual work and produce results. The results in this chapter are twofold. During the first part of this project I was mainly interested in the phenomenon of methylation. This constitutes the first part of this chapter. During the final couple of months of the project research shifted a bit towards exploring problems related to the trinu-cleotide Markov model. In the second part of this chapter we will show how energy landscapes generated by the model described in section 2.7 compare to energy land-scapes generated by the trinucleotide Markov model and energy landland-scapes generated by the slow Monte Carlo simulations that is at the base of the other models.

(15)

Figure 4: The fraction of CG that is methylated along the 147bp during MMC. Nowhere along the 147 the methylation percentage gets above 6%, which comes to show that the methylated CG really doesn’t fit in well in a nucleosome.

3.1 Methylation

3.1.1 Frequencyplots

In figure 3 we see how often certain base pairs occur in the Monte Carlo simulation at all 147 positions along the nucleosome using the Olson parameters [9] at 100K. The nucleosome dyad is at bp 0. One might object that the frequency of CG steps is too high as for real genomes. In reality only 1% of all base pair steps is a CpG step. There is however no reason to demand an accurate CG content prediction from our model. First off, this is modeled around a nucleosome. In vivo not all DNA will be formed into nucleosomes. Second, biologist believe that CG content is low for reasons other than CGs mechanical properties.

(16)

3.1.2 CG vs. CmGm

Figure 4 shows what fraction of CpG steps is methylated during a Mutation Monte Carlo simulation along the 147 base pairs forming a nucleosome. The fact that this fraction always stays below 6% shows that the methylated CpG step does not fit in easily into the shape of a nucleosome. This is an immediate result of the mechanical properties of this base pair step. It has increased roll and higher stiffness than any other base pair step.

The MMC simulation was ran to produce triplet frequencies/probabilities. When comparing triplets containing unmethylated CpG to triplets containing methylated CpG, it turns out that at every position the triplet X − C − G occurs more frequently than its methylated counterpart X − Cm− Gm, where X can be A, T, C, G or Gm.

This shows that in the trinucleotide Markov model methylating a CpG step will always increase the free energy.

3.1.3 Nucleosome Occupancy

Section 2.3 describes how I calculate free energy landscapes of genes around the TSS. Figure 5 shows such a landscape for the human gene SEMA3F without methylation and a landscape for the same gene with every CpG step methylated. The landscapes in figure 5 are exemplary of what landscapes look like for a lot of human genes. Close to the transcription start site, CG content is high. When all the CpG steps are methylated this results in a peak in free energy around the TSS. When nothing is methylated on the other hand, the free energy is generally low close to the TSS. Figure 6 exhibits this same behaviour very well. It shows the average energy land-scape over all human genes both when fully methylated and with nothing methylated. The dip in free energy around the TSS turns into a peak upon methylating all CpG steps.

The next step is to investigate how methylation affects nucleosome occupancy. For each base pair in the interval -1000 to +1000 we calculated the nucleosome occupancy by summing over all the Boltzmann probabilities of all 147 nucleosome positions that result into that base pair being covered by the nucleosome. The normalized nucle-osome occupancy (NNO) is the logarithm of this occupancy after it is normalized to one. In figure 7 we see free energy landscapes for a human genome, the genome of a zebra-fish and the genome of a mouse both non-methylated and methylated. Figure 8 shows the corresponding normalized nucleosome occupancies. In all three

(17)

Figure 5: Energy landscape of a the human gene SEMA3F without methylation (blue) and fully methylated (green) at 100K using the Olson Parameters. Methyla-tion creates a very strong nucleosome expelling region around the TSS. Next to this, local minima occur at different positions, likely resulting in different positioning of nucleosomes.

studied organisms methylation seems to function as a switch, turning a nucleosome attractive region into a nucleosome expelling region.

(18)

Figure 6: Average energy landscape of human at 100K when unmethylated vs. when fully methylated. Full methylation switches the energy landscape completely.

(19)

(a) (b)

Figure 7: Average energy landscapes for three different organisms when (a) not methylated and when (b) fully methylated. For all three organisms methylation flips the energy landscape on its head.

(a) (b)

Figure 8: Normalized nucleosome occupancy for three different organisms when (a) not methylated and when (b) fully methylated. For all organisms methylation turns the nucleosome attractive region into a nucleosome expelling region.

(20)

Figure 9: Average energy landscape around the TSS of human genes at different percentages of methylation.

Figure 10: Normalized Nucleosome Occupancy around the TSS of human genes at different percentages of methylation. 30% seems to be the tipping point, where the TSS becomes nucleosome expelling instead of nucleosome attracting.

(21)

Figure 11: Same as in figure 9, but for percentages close to the tipping point.

Figure 12: Normalized Nucleosome Occupancy for the energy landscapes in figure 11

(22)

Figure 13: NNO graph of human genome around TSS next to a graph with the number of CpG steps covered by a nucleosome extending 73 base pairs towards both directions, positioned n base pairs away from the TSS. The average CpG content and NNO have a very similar shape.

3.1.4 CG content

In Figure 6 the two energy landscapes are almost a mirror image of each other. The only difference between those energy landscapes is that one has all CpG steps methylated. Methylated CpG has quite different mechanical properties from non-methylated CpG, non-methylated CpG being much stiffer than the non-non-methylated ver-sion. So how can methylating all CpG steps result in a energy landscape that is almost a mirror image of the landscape for non-methylated DNA? The likely answer is that for some reason the shape of these energy landscapes is a direct result of the CG content. So if we change the properties of the CpG step by methylating, the whole energy landscape flips.

As can be seen in figure 13 the graph of CpG content and the NNO of the human genome show remarkable similarities. On the left is the normalized nucleosome oc-cupancy, on the right is the number of CpG steps covered by a nucleosome extending 73 base pairs towards both directions, with the dyad positioned n base pairs away from the transcription start site.

Figure 9 and figure 11 show average energy landscapes for the human genome around the transcription start sites for different amounts of methylation. Figure 10 and 12

(23)

show the NNOs for the same percentages of methylation. At 25% the CG content dependency is still visible, at 35% the dependency has switched sign. So the methy-lation fraction can be used as a sort of control knob to determine the strength of the relation between the average energy landscape and CG content. Right around 30% methylation the CG dominance seems to cancel out. What is left is the influence of other base pairs on the energy landscape.

(24)

3.2 Investigating the Markov Model

In this section I will show some exploratory work I did to find out more about the nature of problems that arise when using the Markov Model at higher temperatures. At 100K the Markov model does a very good job. At 300K however something peculiar seems to happen.

3.2.1 Accuracy of the Markov Model

Figure 14 shows energy landscapes of the same human gene (TMEM176A) accord-ing to different models, all at 100K. The blue graph is the elastic energy from the Monte Carlo simulation using the hybrid parameter set [12]. The green curve is the energy landscape according to the trinucleotide Markov Model, also using the hy-brid parameter set. The red curve is a landscape from the Markov Model but now with the Olson parameter set [9] at its base. The differences between Markov and Monte Carlo using hybrid parameters (green and blue) are much smaller than the differences when using a different parameter set (red vs blue/green). All the graphs are made to have the same average energy over the 2000 base pairs by shifting the whole landscape up/down. This is something we can do because we are for the most part interested in relative energy differences.

It is clear that at 100K the Markov model does a pretty good job recreating the ‘true’ energy landscapes. Figure 15 is a smoothed version of Figure 14 where the 10 bp oscillations are removed. In this graph it is easy to see that the blue MC and green Markov energy landscapes differ less than 1kBT from each other at most

positions. Figure 16 shows the same landscapes zoomed in on 100 bp to allow a closer look at the 10 bp oscillations showing that both the position and the amplitude of these oscillations are recreated almost perfectly. I did the same analysis at lower temperature. At 30K the Markov model performs as well as at 100K.

At 300K however the Markov model doesn’t do so well. Figure 17, 18 and 19 re-spectively show a full energy landscape, a smoothed and a zoomed in version of energy landscapes according to the Monte Carlo simulations (blue) and the trin-ucleotide Markov model (green) at 300K. So far our hypothesis has been that at higher temperatures the free energy gets dominated by the entropic contribution. When looking at the zoomed in version of the energy landscape it becomes clear that the 10bp oscillations have more or less the same amplitude in MC simulations as in the Markov model. This suggest that these oscillations are driven by energetics

(25)

Figure 14: Energy landscapes of the human gene TMEM176A according to different models at a simulation temperature of 100K. Blue is the energy landscape straight from the Monte Carlo simulation as described in section 2.2 using the hybrid param-eter set. Green is the landscape from the Markov model, everything else constant. Red is a landscape from the Markov Model but now with the Olson parameter set. At 100K the Markov model does a very good job recreating the energy landscape from the Monte Carlo model.

(26)

Figure 15: The same landscapes as figure 14 but now with the 10bp oscillations smoothed out using a Gaussian kernel. Now it is easy to see that the Markov model follows the Monte Carlo landscape pretty well, whereas the differences when using a different parameter set are much larger.

and not by entropy.

If the free energy gets dominated by the entropic contribution above a certain tem-perature, it is however unexpected that the free energy still looks so much like the energy from the MC simulation, something that clearly shows in figure 18. It seems that energy differences at larger scales are exaggerated by the Markov model. Would we scale down the energy landscape by a factor of 3, the two landscapes would again look remarkably similar.

(27)

Figure 16: The same landscapes as figure 14 but now zoomed on 100bp. Again green and blue very close whereas there are some deviations to the red curves.

Figure 17: Monte Carlo (green) and Markov (blue) energy landscape at 300K of the gene TMEM176A.

(28)

Figure 18: The same energy landscapes as in figure 17 but with the 10bp oscillation smoothed out using a Gaussian kernel. Here it can be seen clearly that the energy difference at larger scales are exaggerated by the Markov Model.

(29)

Figure 19: The 10bp oscillations have more or less the same amplitude in MC and Markov at 300K. This zoom is taken at a position where MC and Markov don’t differ too much, and is only made to show that the oscillations have about the same amplitude. This suggests that the oscillations are energy driven and are not affected by entropy.

(30)

Figure 20: Energy landscape of the human gene TMEM176A as calculated by the Monte Carlo model (blue), as calculated by fdMMC (green) and according to ftMMC (red) at 100K using the Olson parametrization.

3.2.2 Results from ftMMC

In section 2.7 we proposed an alternative to Tompitaks trinucleotide Markov model, the fixed trinucleotide Mutation Monte Carlo model (ftMMC). Unfortunately this model seems to be doing much worse in creating energy landscapes than the Markov model. This can be seen in figure 20 where we compare an energy landscape generated by the ftMMC model (red) to energy landscapes generated by the Monte Carlo model (blue). This figure doesn’t show the same energy landscape generated by the Markov Model but at this temperature this landscape just looks extremely similar to the blue Monte Carlo landscape, same oscillations and following trends very well. The energy landscape generated by ftMMC on the other hand is completely off. It places the 10bp oscillations at the right position, but gets the amplitude wrong and when it comes to larger trends in the energy landscape, there is absolutely no resemblance. The ftMMC model fixes a trinucleotide and tracks the energy of one of the base pair steps in this trinucleotide, and sums over the energies for all base pair steps to get to the full energy. When we use the plain Monte Carlo model to calculate energies, we fix every base pair and then sum over all base pair step energies. The ftMMC model could be changed to fix tetranucleotides, or even longer polynucleotides, eventually

(31)

keeping all base pairs fixed in which case we are back to the plain Monte Carlo model. So if there is a difference in outcome between ftMMC and MC, it has to be due to long range dependencies.

However, the trinucleotide Markov model also looks at trinucleotides and does a good job generating energy landscapes (at low temperatures). This model looks at how often trinucleotides occur at certain positions during a Mutation Monte Carlo simulation. If the results of our ftMMC model are valid, it seems we can conclude that these frequencies used by the Markov model implicitly carry long range information that the base pair step energies used in ftMMC do not.

Another thing to notice is that the energy landscapes produced by ftMMC have a lower absolute value than the ones generated by plain MC. This indicates that a base pair step energy can relaxate much further when nucleotides further down the line are allowed to mutate. Again suggesting that these long range influences are not negligible.

To test the hypothesis that long range dependencies are not negligible I decided to measure these effects directly in the Monte Carlo simulation. First I ran the MMC simulation as usual to measure the average base pair step energies at all 146 positions. Next I kept the first N base pairs fixed to random nucleobases, N being a random number between 20 and 126, and measured the base pair step energies from the Nth up to the N+20th base pair. I then subtracted the value these base pair step energies usually take when all nucleobases are allowed to mutate. This step is repeated 1,000 times to acquire accurate statistics. Average results for these 1,000 measurements are shown in Figure 21 This graph shows how much higher the base pair step energy will on average be if base pairs are fixed up to x base pairs prior to the base pair in question. It shows that correlation extends out to 3 base pair steps, at the 4th base pair step frustration is almost exactly zero. This suggests that in order to get a good approximation we need a fixed hexanucleotide MMC model.

The number of polynucleotides whose energies we need to measure goes up expo-nentially with the number of nucleotides we fix. In order to acquire accurate mea-surements (standard deviation less than 0.01 kBT ) we need to run the simulation

one million MC steps for every fixed polynucleotide, taking about 5 minutes. Figure 21 shows there is next to no correlation further than 3 base pairs out. So we could cut computation times by a factor of 10 by fixing ten polynucleotides instead of one along the 147 base pairs that make up a nucleosome. We could also run multiple simulations simultaneously which execute less MC steps each, and recombine the re-sults afterwards. If we want to measure all energies for all possible hexanucleotides,

(32)

Figure 21: A measure of how much a base pair step will be frustrated at position N+x where N is the position up to which point all base pairs are fixed. x=1 is the position of the base pair step just after the final fixed base pair. This graph is an average over 1,000 measurements each with a random value for N and base pairs randomly fixed.

this would still result in a total computation time of a couple of months which, as unfortunate as it is, are not available.

Figure 20 shows that the trinucleotide model is a slight improvement over the din-ucleotide model. Figure 22 shows the difference between the energy landscape as calculated by ftMMC and fdMMC. Even though the landscapes themselves don’t even remotely look like the energy landscape they aim to approximate, changing from fdMMC to ftMMC is clearly a move in the right direction. Since the difference between the ftMMC landscape and fdMMC landscapes looks so much like the actual landscape, it seems that the shape of this particular landscape is mainly an effect of the neighbors of the dinucleotide and much less so of the dinucleotide itself. These results give full confidence that this model would be able to approximate energy landscapes accurately when fixing longer polynucleotides.

(33)

Figure 22: In blue the landscape we try to approximate, in green the difference between ftMMC and fdMMC multiplied by 2 (assuming the nucleotide on the other side of the dinucleotide will have the same effect as the one we are now taking into account), and moved up by 180 kBT for better visuals. From this it is clear that the

(34)

4 Discussion

Biologist in general consider methylation to be a gene expression inhibitor [2]. We showed that this inhibition certainly doesn’t come from the mechanical properties of the methylated CpG step, since in all organisms we studied methylation of the promoter areas created a very strong nucleosome expelling effect which would result in increased gene expression. This finding however doesn’t necessarily conflict with knowledge from biology, since biologist attribute this gene expression inhibition to other reasons than mechanical properties of DNA. The quite bulky methyl molecule attached to cytosine blocks the machinery that reads out DNA. Gene expression and nucleosome positioning are not a one on one mapping of each other. There are many factors influencing gene expression, nucleosome positioning being only one of them. The result that energy landscapes of promoter areas switch upon methylation looks interesting. It is however not clear that this switching happens in reality. The results show that on average more than 30% of CpG steps need to be methylated for a nucleosome attracting region to become a nucleosome expelling region. As can be seen in figure 13 CG content is really high close to the transcription start site. This is because about 60-70% of human genes have a CpG island covering the TSS [13]. These CpG islands usually have very low methylation levels with even lower levels of methylation in coding areas. So even though global methylation levels in human DNA lay around 70%, the figure is probably much lower than 30% around transcription start sites. According to our results this generally is not enough to turn a landscape on its head.

I showed that it is possible to generate energy landscapes by fixing polynucleotides in the Mutation Monte Carlo simulation. So far I have thought of this as an alternative to the trinucleotide Markov model. However these two models really do two different things. One generates energy landscapes, the other free energy landscapes. Free energy eventually determines where nucleosomes go. However, if we want to study the role of entropy in nucleosome positioning on a genome wide scale, it probably will be very useful to have a model both for creating energy landscapes and free energy landscapes.

In none of our current models the free energy of the ’free’ DNA prior and after the nucleosome is taken into account. The suspicion is that taking this into account will alter the free energy landscapes significantly. When DNA is allowed to move freely, entropy can take much higher values to the point where it might dominate the free energy landscapes. Entropy wants to be high and will therefore favor less stiff DNA.

(35)

This can be seen as a tug-of-war between energy and entropy both competing for the less stiff parts of the DNA. Entropy keeps the flexible parts free of nucleosomes whereas energetics put the nucleosomes right at those flexible parts. Would the free energy of the free DNA be incorporated in the current models, results might change drastically.

(36)

5 Conclusion

The main research topic of this thesis has been to investigate the influence of lation on nucleosome positioning in promoter areas. Results clearly show that methy-lation affects nucleosome positioning drastically and expels them under all circum-stances. However, we can’t jump to the conclusion that methylation therefore en-hances gene expression. Firstly methylation levels are usually much lower in gene coding areas than in non-coding DNA. Secondly biologist agree that methylation inhibits gene expression for reasons other than the mechanical properties of a methy-lated CG step.

I proposed a new model to calculate energy landscapes. It measures the base pair step energy during a Mutation Monte Carlo simulation while keeping its near neighbors fixed. Results suggest that one could use these measurements to generate accurate energy landscapes given that the six nearest base pairs are fixed. Unfortunately there wasn’t enough time available to actually generate this data, measuring base pair step energies for all possible hexanucleotides at all possible positions would require several months of computation time. This model could be very useful if one wants to study the role of entropy in nucleosome positioning on a genome wide scale.

(37)

Acknowledgements

I would like to express gratitude towards Prof. dr. Helmut Schiessel for having the patience to be my supervisor, for the interesting and fruitful conversations we had and for offering the opportunity to work on such interesting subjects. Advice offered by Martijn Zuiddam helped me a lot in acquiring and working with the necessary genetic data. Special appreciation goes out to Behrouz Eslami for his helpfulness and kindness. Conversations we had gave direction to - and inspiration for - a lot of the research done in this thesis. I wish you all the best.

Finally I would like to extend my thanks to Federica Battistini from the Molecular Modelling and Bioinformatics group of IRB Barcelona for providing me with the rigid base pair parameters of the methylated CpG step, without which most of this research would not have been possible.

References

[1] Segal E, Fondufe-Mittendorf Y, Chen L, Th˚astr¨om A, Field Y, Moore IK, et al (2006). “A genomic code for nucleosome positioning”. Nature. 442: 772–778. doi: 10.1038/nature04979 PMID: 16862119

[2] Brenet F (January 2011),“DNA methylation of the first exon is tightly linked to transcriptional silencing”. PLoS One. 6(1):e14524. doi: 10.1371/jour-nal.pone.0014524. PMID: 21267076

[3] Wang YP, Lei QY (May 2018). “Metabolic recoding of epigenetics in can-cer”. Cancer Communications. 38 (1): 25. doi:10.1186/s40880-018-0302-3. PMID 29784032.

[4] Lund G, Andersson L, Lauria M, Lindholm M, Fraga MF, Villar-Garea A, Ballestar E, Esteller M, Zaina S (July 2004). “DNA methylation poly-morphisms precede any histological sign of atherosclerosis in mice lacking apolipoprotein E”. The Journal of Biological Chemistry. 279 (28): 29147–54. doi:10.1074/jbc.m403618200. PMID 15131116.

[5] Gonzalo S (August 2010). “Epigenetic alterations in aging”. Journal of Ap-plied Physiology. 109 (2): 586–97. doi:10.1152/japplphysiol.00238.2010. PMC 2928596. PMID 20448029.

(38)

[6] Ehrlich M; Gama Sosa MA; Huang L-H.; Midgett RM; Kuo KC; McCune RA; Gehrke C (April 1982). “Amount and distribution of 5-methylcytosine in human DNA from different types of tissues or cells”. Nucleic Acids Research. 10 (8): 2709–2721. doi:10.1093/nar/10.8.2709. PMC 320645. PMID 7079182.

[7] Eslami-Mossallam B, Schram RD, Tompitak M, van Noort J, Schies-sel H (June 2016) “Multiplexing Genetic and Nucleosome Position-ing Codes: A Computational Approach”. PLoS ONE 11(6):e0156905. doi:10.1371/journal.pone.0156905

[8] Lankas F, Sponer J, Langowski J, Cheatham TE III (2003). “DNA basepair step deformability inferred from molecular dynamics simulation”. Biophys J. 85: 2872–2883. doi: 10.1016/S0006-3495(03)74710-9 PMID: 14581192

[9] Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB (1998). “DNA sequence-dependent deformability deduced from protein-DNA crystal complexes”. Proc Natl Acad Sci USA.; 95: 11163–11168. doi: 10.1073/ pnas.95.19.11163 PMID: 9736707

[10] Luger K, M¨ader AW, Richmond RK, Sargent DF, Richmond TJ. (1997) “Crystal structure of the nucleosome core particle at 2.8 ˚A resolution”. Nature. 1997; 389: 251–260. doi: 10.1038/38444 PMID: 9305837

[11] Tompitak M, Vaillant C, Schiessel H (Februari 2017). “Genomes of Multicel-lular Organisms Have Evolved to Attract Nucleosomes to Promoter Regions”. Biophys. J. 112.3 505-511 [79]. doi:10.1016/j.bpj.2016.12.041

[12] Becker NB, Wolff L, Everaers R. (2006) “Indirect readout: detection of optimized subsequences and calculation of relative binding affinities using different DNA elastic potentials”. Nucl Acids Res. 34: 5638–5649. doi: 10.1093/nar/gkl683 PMID: 17038333

[13] Bird AP (March 1985). “CpG-rich islands and the function of DNA methyla-tion”. Nature. 321 (6067): 209–13. doi:10.1038/321209a0. PMID: 2423876

(39)

Appendix A: Parameterset

Methylating the CpG step will affect the geometric and mechanical properties of 10 base pair steps. Both the equilibrium values and stiffness matrices of these 10 base pair steps are given here. These parameters were provided by the Molecular Modelling and Bioinformatics group of IRB Barcelona. All parameters are in units of ˚angstr¨om, radians and kBT and are given in the order shift, slide, rise, tilt, roll

and twist.

A.1: Equilibrium values

A − Cm = 0.5980 −0.8530 3.4030 0.0105 −0.0476 0.5659 T − Cm =0.9900 −0.0200 3.3400 0.0663 −0.0192 0.6685 C − Cm =0.8100 −0.4700 3.3950 0.0616 −0.0005 0.6247 G − Cm =0.6450 −0.5700 3.4150 0.0340 −0.0428 0.6065 Gm− A =−0.9250 −0.0650 3.3400 −0.0620 −0.0179 0.6624 Gm− T = −0.5900 −0.9500 3.4400 −0.0070 −0.0471 0.5655 Gm− C = −0.5450 −0.7050 3.4600 −0.0327 −0.0362 0.6048 Gm− G = −0.8500 −0.4400 3.3900 −0.0628 0.0017 0.6301 Cm− Gm =0.0740 −0.0060 3.0140 0.0063 0.2236 0.4677 Gm− Cm =0.0000 −1.0700 3.5000 0.0000 −0.0501 0.5431

A.2: Stiffness Matrices

A − Cm =         2.8882 −0.8659 0.2735 −0.7737 1.9343 −1.7409 −0.8659 8.0686 5.6734 0.4836 −0.8704 −15.9580 0.2735 5.6734 20.1125 −0.7737 11.7025 −16.4416 −0.7737 0.4836 −0.7737 288.1516 11.0828 −11.0828 1.9343 −0.8704 11.7025 11.0828 188.4068 55.4138 −1.7409 −15.9580 −16.4416 −11.0828 55.4138 338.0240        

(40)

T − Cm =         4.0411 −1.6424 −1.3150 −5.7062 −2.1277 −2.6113 −1.6424 7.2449 2.5168 3.4817 −5.0292 −26.0164 −1.3150 2.5185 20.2307 35.0109 4.0620 −14.1204 −5.7062 3.4817 35.0109 321.3999 11.0828 −33.2483 −2.1277 −5.0292 4.0620 11.0828 155.1586 49.8724 −2.6113 −26.0164 −14.1204 −33.2483 49.8724 371.2723         C − Cm =         3.9600 −1.7319 −1.1056 −9.0912 −3.2883 −10.0584 −1.7319 5.5907 2.7852 4.3522 0.0967 −20.6971 −1.1056 2.7852 19.5994 33.7536 5.4161 −17.8923 −9.0912 4.3522 33.7536 354.6481 5.5414 −49.8724 −3.2883 0.0967 5.4161 5.5414 144.0758 22.1655 −10.0584 −20.6971 −17.8923 −49.8724 22.1655 315.8585         G − Cm =         3.6917 −1.8315 −0.5672 −8.4142 −1.0639 −3.5785 −1.8315 7.9100 5.5366 3.8686 6.3832 −17.8923 −0.5672 5.5366 21.4444 −0.9672 16.9252 −17.4087 −8.4142 3.8686 −0.9672 282.6102 −22.1655 −33.2483 −1.0639 6.3832 16.9252 −22.1655 171.7827 22.1655 −3.5785 −17.8923 −17.4087 −33.2483 22.1655 321.3999         Gm− A =         3.7507 1.6104 1.2373 −5.5128 3.2883 4.7390 1.6104 7.1554 2.6316 −4.0620 −4.2555 −25.4361 1.2373 2.6316 20.1210 −35.0109 3.6752 −16.4416 −5.5128 −4.0620 −35.0109 310.3171 −11.0828 33.2483 3.2883 −4.2555 3.6752 −11.0828 160.6999 44.3310 4.7390 −25.4361 −16.4416 33.2483 44.3310 349.1068         Gm− T =         3.3676 0.3899 −0.5722 −1.9343 −0.6770 1.6442 0.3899 7.6922 5.7645 1.0639 −3.4817 −15.7646 −0.5722 5.7645 20.1378 1.6442 11.5091 −17.6989 −1.9343 1.0639 1.6442 282.6102 −16.6241 5.5414 −0.6770 −3.4817 11.5091 −16.6241 205.0310 60.9551 1.6442 −15.7646 −17.6989 5.5414 60.9551 326.9413        

(41)

Gm− C =         3.7541 1.7606 0.3832 −9.2847 1.1606 2.5146 1.7606 7.2854 5.6092 −3.9653 4.7390 −17.3120 0.3832 5.6092 20.9920 1.2573 16.2482 −17.3120 −9.2847 −3.9653 1.2573 277.0689 22.1655 33.2483 1.1606 4.7390 16.2482 22.1655 177.3241 27.7069 2.5146 −17.3120 −17.3120 33.2483 27.7069 299.2344         Gm− G =         3.8588 1.4838 0.8035 −9.3814 3.8686 9.7682 1.4838 5.1720 2.5928 −3.2883 0.0000 −20.6971 0.8035 2.5928 18.7317 −30.8522 4.2555 −18.6660 −9.3814 −3.2883 −30.8522 338.0240 −5.5414 49.8724 3.8686 0.0000 4.2555 −5.5414 149.6172 27.7069 9.7682 −20.6971 −18.6660 49.8724 27.7069 315.8585         Cm− Gm =         3.5621 −0.2708 −0.1210 −5.1971 0.3887 −0.7830 −0.2708 6.0147 0.2463 −0.4544 0.3176 −9.0132 −0.1210 0.2462 16.7807 −0.6283 −20.2088 −24.7654 −5.1971 −0.4544 −0.6283 219.5396 3.8105 3.3835 0.3887 0.3176 −20.2088 3.8105 147.0062 61.0432 −0.7830 −9.0246 −24.7768 3.3835 61.0432 163.9726         Gm− Cm =         4.2180 0.0550 −0.0355 −4.9228 −0.5252 −0.0851 0.0550 8.1280 5.4380 −0.1112 −5.6646 −23.4515 −0.0355 5.4380 21.8654 0.1045 16.4145 −16.6873 −4.9228 −0.1112 0.1054 303.5566 0.3879 0.1662 −0.5252 −5.6646 16.4135 0.3879 284.1618 68.3252 −0.0851 −23.4515 −16.6882 0.1662 68.3252 336.0291        

Methylation in Nucleosome Positioning

CpG Methylation in

Nucleosome Positioning

Thesis

Abstract

Contents

1

Introduction

2

Methods

2.1

Nucleosome Model

2.2

Energy landscapes from a Monte Carlo model

2.3

Markov Chain Approximation

2.4

Mutation Monte Carlo simulation

2.5

Mutation Monte Carlo with CpG Methylation

2.6

Markov with Methylation

2.7

An Alternative to the Trinucleotide Markov Model

3

Results

3.1

Methylation

3.2

Investigating the Markov Model

4

Discussion

5

Conclusion

Acknowledgements

References

Appendix A: Parameterset

A.1: Equilibrium values

A.2: Stiffness Matrices