Sequential dependencies of Watson-Crick-to-Hoogsteen Transition

(1)

Sequential dependencies of

Watson-Crick-to-Hoogsteen Transition

Jasper den Duijf

1

_{, Jocelyne Vreede}

2 12_{University of Amsterdam}

ABSTRACT

The nucleotides in DNA can form pairs in two different sets of stable configurations, namely the Watson-Crick pairing or the Hoogsteen pairing. The Watson-Crick-to-Hoogsteen transition between these states can take place over the course of few microseconds and can be studied through various computational models. However, these models are computationally expensive, especially when simulating larger, more complex structures. It is hypothesised that every structure and every DNA sequence might influence the Watson-Crick-to-Hoogsteen-transition differently. So sequence dependency might determine characteristics of the transition and.

In this study the use of Transition Path Sampling (TPS)(1) takes a central role to analyse the Watson-Crick-to-Hoogsteen transition. Techniques to generate initial paths as well as the model itself are explained. Furthermore, the scope of this study is aimed at the nucleotide sequence in DNA and its effect on the Watson-Crick-to-Hoogsteen transition. In this research the stability of DNA structures consisting of five base pairs is confirmed by in silico experiments. By placing an Adenine - Thymine (AT) rolling base pair between two flanking base pairs on either side, 256 different 5bp strings can be generated. 15 of these have been shown to be stable in both states and to allow for transitions between those states. Then a systematical comparison is made between these transitions. The findings addressing the sequence dependence of the transitions as well the characteristics of these transitions are given below. The results reported in this work can support the future research into the Watson-Crick-to-Hoogsteen transition.

INTRODUCTION

Deoxyribonucleic acids (DNA) can be found in every kind of living organism. In 1953 Francis Crick and Frederick Watson unravelled the exact lay-out of these molecules(3). DNA most commonly occurs in its double stranded form (dsDNA). dsDNA consists of two nucleotide chains, which are held together by hydrogen bonds. Each chain consists of nucleotides, which in turn consists of a five-carbon sugar group, a phosphate group and a heterogenic base. The

phosphate and sugar groups of these nucleotides form the backbone of the DNA. The hetrogenic base can take four different forms in DNA: adenine (A), guanine (G), cytosine (C) and thymine (T). Each base forms a base pair with an complementary base in the opposite DNA strand. Adenine pairs with thymine (AT) and cytosine with guanine (CG). The base pairs are held together through hydrogen bonds; a weak electrostatic interaction between an elecotronegative atom and a hydrogen atom bound to another electronegative atom(2). Watson and Crick formalized a standard numbering convention for all atoms in a nucleotide chain. They observed that in AT base pairs the hydrogen bonds are formed between the N6 and N3 atoms of adenine with the O4 and N1 in the thymine base. Consequently, this is called a Watson-Crick (WC) state.

Due to the linear nature of DNA, its sequence can easily be encoded as text. As example (GTATG)2 denotes the order of nucleotides in the first stand: guanine, thymine, adenine, thymine and guanine. As each base has exactly one complementary base that it bonds with, the sequence captures the complete atomistic structure of the DNA molecule. However, the spatial arrangement of the atoms in a molecule, also known as the molecule’s configuration, can vary. Different configurations can be aggregated to sets. Sets in which all configurations meet certain conditions are called states. For example the WC state entails all configurations in which the the N 6_A−O4_T and N 3_A−N 1_T hydrogen bonds are formed. A state is considered stable, if a molecule over a prolonged time after entering the state keeps meeting the state conditions.

In 1963 Karst Hoogsteen showed that another stable state of DNA exists(10). In this state two hydrogen bonds were formed between the N6 and N3 atoms of adenine with the O4 and N7 in the thymine base. Most of the AT base pairs in DNA were revealed to spend several hundred µs in the Hoogsteen (HG) configuration(5). Infrequently the transition between Hoogsteen and Watson-Crick state - the Hoogsteen-to-Watson-Crick transition or Hoogsteen transition - occurs where the molecule changes from one of the states to the other. The base pair in which one of the bases rotates is called the rolling base pair. Due to its infrequent occurrence this transition is classified as a rare event

Recent research of Nikolova showed that a chemical reaction was influenced by the configuration of one of the base pairs (8). In this paper a certain reaction occurred around the glycosidic bond, which is the covalent bond between a base and sugar group, only when it was rotated to a certain degree to the HG state. This showed the relevance of the Hoogsteen configuration. Other important features of the HG state have been shown by Zhou in 2015 (5), showing that in certain DNA sequences the Hoogsteen configuration played a role in the

(2)

replication process of DNA.

It is worth noting that the base pair is wider in a configuration in the HG state (10.1 ˚A) than one in the WC state (8.2 ˚A) when measured from the glycosidic bonds of both bases. This alters the directionality of the backbone, which in turn was shown to affect the interaction of the DNA with other proteins. Different transition mechanisms were observed in different DNA structures(5).

Computer simulations can be used to recreate DNA configurations and transition mechanisms that occur in empirical experiments. The research of Vreede (7) is build upon the the DNA sequence of Nikolova (8) and used it to study the mechanisms of the transition and stability of the states through the use of rate constants. Molecular Dynamics (MD) simulations make it possible to analyse multiple configurations over time. In discrete time steps, the configuration of a set of molecules gets calculated based on the previous configuration. MD simulations permit analysis on small time and space scales (picoseconds and nanometers). For larger systems or longer time frames MD falls short, however. As rare events usually require large time frames to be observed, other methods have been developed to efficiently sample them. These methods include TPS, TIS, umbrella sampling and steered MD. These methods provide a useful tool to study such rare events

The aim of this paper is twofold. Firstly, to analyse the methods and techniques commonly used in molecular simulations. Simulations are computational expensive and a proper understanding upfront will help researchers to make a choice in the techniques they use. The properties of those techniques differ and can have an influence on the sampled transition mechanisms. Methods and parameter settings are compared in such a way that future work can build on the results of this paper.

The second aim is to using the previously mentioned techniques but varying the observed sequence of DNA using a systematic approach. Different mechanisms might be observed for different sequences. As the neighbouring base pairs might have an influence on the rotating base pair. different sequences might display different properties. This paper compares the Hoogsteen transition for a number of sequences. The systematic search through all combinations of sequences is a difficult task, especially for longer sequences as the numbers of combinations grow exponentially with the length of the sequence. Therefore, this paper will start with an analysis of the DNA chain with the lowest number of nucleotides that still allows for stable HG and WC states and then starts to explore all possible combinations. The output of simulations based on different sequences can then be compared.

This introduces the central research question in this paper: How do different nucleotide sequences affect the Watson-Crick-to-Hoogsteen Transition?

This paper answers this question by reporting different transition mechanisms and by showing in which sequences these mechanisms appear. It reports findings of 16 different sequences and a total of 100 ms of simulations. It discusses, explains and compares certain simulation techniques, as well the results of these simulations.

In the theory section a more in depth view about the molecular simulations and models (MD simulation metadynamics, TPS)

Figure 1. An example of a AT base pair. Noted are the three parts: the phosphate di-ester groups with 3’,5’ linkages and the deoxyribose form the backbone of the strand (grey) and the base adenine (purine, red) and thymine (pyrimidine, blue). The dotted lines illustrate the hydrogen bonds. This base pair is currently in a Watson-Crick configuration.

Figure 2. An example of a AT base pair in Hoogsteen configuration. The adenine is in a spatial position allowing the N7 of the adenine to form a hydrogen bond with the N3 of the thymine, as depicted by the lowest dotted line.

is given. In the methods & materials section the technical implementations like used libraries and parameters. In the results & discussion section the results of the research are shown comprehensible and these data are interpreted. Eventually in the conclusion section the insight of this works will be summarised.

THEORY

MD simulations

Molecular Dynamics (MD) makes it possible to simulate how a system develops over time on the atomistic level. MD requires an initial configuration, with the positions and velocities of all particles. The structure of the DNA defines the initial configuration of the DNA molecules. The initial configurations used in this work were generated by the Web 3DNA 2.0 tool (4). To equilibrate the system positions and initial velocity of the molecules in the solution were randomly sampled from a random distribution that takes into account the initial temperature.

Further a mathematical description of how the particles interact with each other is required. For this purpose so called force fields are being defined. A force field describes all interactions between atoms/particles and has been normalised

(3)

by physical observations. It describes the potentials of the bonds, valence angles, dihedral angles, van der Waals contributions, electrostatic contributions and other chemico-physical interactions and parameters. Having the force field as well as the position of all particles in the system, any atom’s potential can be calculated. Taking the derivative of these potentials, gives the forces acting on an atom at a given time point. The forces can then be used to update the velocities of the particles, which in turn are used to update their positions. This way the positions and velocities of the whole system at the next time step can be calculated simply by following Newtons laws of motion. This way, the system can be iteratively simulated over time. As velocities and positions are only updated once per time step, choosing smaller step sizes will yield a closer approximation of the system. The configuration at a certain time step is called a frame or snapshot. Multiple frames in temporal order are the intermediate steps of a trajectory.

The MD simulations are done in an upfront defined space. The computational assumption made is that the simulated space is surrounded only by regions which share the exact same conditions. With these periodic boundaries only a smaller space needs to be simulated, which can reduce computational time.

As the initial configuration might be unnatural high forces due to closely spaced atoms, an energy minimisation step needs to be performed. Energy minimisation repositions atoms to decrease their potential. Position restraints might apply, penalising certain atoms in order to keep their position fixed. Position restraint is most often used to favour the repositioning of the solution atoms over the repositioning of atoms in the in the solute.

Metadynamics

As said before a state is stable when over time the system is expected to remain in a certain state. If the initial frame of an MD simulation is in a stable state, there is a high chance that the that the system will remain in this state over the course of the whole simulation. This is due to the high free energy barriers that separate stable states like a multidimensional border. A higher barrier makes it less likely to cross from one state to another. Configurations on the free energy barrier have a higher chance to go to the nearest stable state than to cross the barrier to the other stable state. The highest point on the free energy barrier is called the transition state. On both side of this transition state lies a barrier region and a stable state, and a configuration is prone to go to this stable state over time.

To generate trajectories that have frames in both stable states, the metadynamics method is used. The premise is that during a MD simulations a biased potential is added to the force field. These Gaussian hills guide the configuration in a specified direction. If the proper potential is added to the force field, it is more likely that during a simulation both stable states are reached. It was shown (7) that a biased potential on the dihedral of the glycosidic bond can cause this transition.

Rotation axis

Figure 3. An adenine base and a piece of the backbone. In this figure the dotted line depicts the axis around which the base is rotated and the green dot the center of mass of the weight.

Committor binary search

In the previous section it has been mentioned that a configuration in a stable state has a high probability of remaining in the stable state and a lower probability of crossing the free energy barrier. However, for configurations near the transition state the gap between those probability is smaller. Using a configuration close to the transition state as the initial configuration of a MD simulation, allows reaching both states. Such a configuration close to the transition state is called a transition configuration. The chance of the system ending up an a given state is called the committor of that state. Committor analysis is frequently used to study these configurations close to the transition state.

So it is computational easier to find trajectories from the transition configuration to the stable state. By reversing the velocities in every frame and reversing the order of frames, a trajectory starting in a transition configuration and ending in a stable state can be converted to a trajectory starting in a stable state and ending in the transition configuration. As the energy and impulses are preserved, the trajectory is physically valid. By generating two trajectories from the transition configuration to both stable states, inverting one and joining them together, a valid trajectory connecting the two states can be generated.

The problem of this method is that the transition state is not easily defined. Both the free energy barrier and transition state are multidimensional and no direct formula is known. However, the committor binary search can be used to find a transition configuration. It establishes a one dimensional collective variable (CV) that influences the initial configuration of the MD simulation. Depending on the value of this CV, the initial configuration changes as does the probability to reach one of the stable states. This CV has an interval between two values with a high probability to reach each state. If a good CV is chosen, there is a value on the interval where reaching both states is equally likely; this CV value is linked to a set of configurations high up the barrier close to the transition state.

The committor binary search uses an algorithm that is similar to the binary search algorithm. It starts with a given interval of the CV. Taking the values at the end points of the interval, it

(4)

generates a set of initial configurations to run MD simulations until a stable state is reached. The paths generated by these simulations are labelled according to the stable state most of the paths end in. Then the value at the center of the interval is selected to generate new initial configurations and paths. Again this value is labelled according to the state in which most paths end. One of the borders of the interval is subsequently moved to coincide with the middle point. Of the two borders, the one which shares the same most common final state as the middle value is the one that will be replaced. This way the interval will continue to span a range that has a CV corresponding to one state on the one end and a CV corresponding to the other on the other. By repeating this process of halving the interval, eventually a configuration will be found where both stable states can be reached from one configuration, and these two trajectories can be joined to one trajectory between the stable states. (See figure 6).

For the Watson-Crick-to-Hoogsteen transition we define a rotation angle τ . The rotation takes place around the axis that goes through the atom in the sugar group C1’ and the centre of mass of the base (see figure 3). All atoms in the base rotate around that axis with angle τ . The choice of this axis ensures that the length of the glycosidic bond remains the same and the centre of mass remains in the same position.

Stability Testing

In order to confirm that a certain state is stable, an MD simulation with a duration of 200 ns was run with an initial configuration in the stable state. However, before testing whether the WC and HG state are stable whether dsDNA in general remain stable in our simulation and does not denature. In this case the stability means that the DNA backbones are held together near the original configuration. To assess this stability the number of hydrogen bonds was counted at any time step. This number of hydrogen bonds should stay near close to 2 hydrogen bonds per AT pair and 3 hydrogen bondS per GC pair.

Both metadynamics and committor binary search are capable of generating an unphysical trajectory with at least some frame in the HG state. However, if one frame with an initial configuration in the HG state is needed, there is an easier way. The positions of a nucleotide with an adenine base from a previously found configuration in the HG state can be taken. Then with the use of ordinary least squares (OLS), the rigid transformation of this nucleotide is found where the positions of the atoms of the phosphate and sugar group are as close to the original positions. Replacing the original atom positions with the new found positions gives a configuration in the HG state or close to the HG state.

Within the DNA pi-stacking can occur. This is an interaction between the bases as their aromatic rings are stacked. This attraction can take place when two purines, or two pyrimidines are aligned with each other. The structural integrity of a DNA can benefit from pi-stacking.

Transition Path Sampling

Transition path sampling (TPS) is a method for sampling transition paths. TPS can generate a large number of

decorrelated paths in a computationally efficient way. The method uses as input one trajectory connecting the two defined stable states. On this trajectory, a so called shooting move is performed. In this shooting move a frame of the path is selected, and its the velocities are randomized. Then a new sub-trajectory, starting from this frame is generated and attached to the original path. However, to this deterministic process is a stochastic thermostat added. The thermostat adds a perturbation to the system, so different trajectories from the same starting configuration are generated.

It repeats the process, generating for each iteration a trial trajectory aimed at one of the states. If the trial reaches the state, the trial is added as a branch to the solution space. A new snapshot can be selected from the successfully added trial. Two trials to the different states can be added to each other - similar as in the Committor binary search - forming a transition path between the states. When these two transition paths do not share any frames, they are uncorrelated.

To limit the amount of time an MD simulation can run, a threshold is chosen, and any trials exceeding this threshold will be stopped and marked as rejected. The trial is accepted when it reaches its aimed state. The percentage of accepted trials is called the acceptance probability. The acceptance probability, and the number of uncorrelated paths are both indicators how effective the TPS has been.

METHODS & MATERIALS

Software

All MD simulations were performed with the GROMACS engine. The program is employing the parmsbc1 force field and the TIP3P water model form the AMBER14 package. This package is specialised for MD simulation on proteins and nucleic acids. It can be combined with the PLUMED library in order to perform metadynamics.

Most other functions are written in Python. The libraries used are the OpenPathSampling (OPS) library for TPS. This library integrates the GROMACS engine. For analyses are both OPS as the library MDtraj used.

All the code, including an instructions manual about all experiments, can be found on Github1_.

Parameter settings

A number of parameters of the MD simulation were taken from previous research(7). Certain settings were changed as empirical tests showed promising results. During the energy minimisation, the tolerance was increased to relocate atoms with a potential higher than 100 KJ/mol/nm. The length of the MD simulations depend on the type of experiment, but the time step size was held steady on 2 fs. The configuration was stored every 1,000 time steps.

During the metadynamics, the heights of the Gaussian hills was set to 0.05, the widths to 0.35 and the hills were added every picosecond.

(5)

Analysis

In this paper the configuration is expressed by a number of parameters. In the rolling base pair several distances were measured between the bases: the distance between N6 and O4 (dBS), N3 and N1 (dW C), and N3 and N7 (dHG), respectively in adenine and thymine. The d_HGand d_{W C}can be combined together into one parameter λ = arctan(d_HG,dWC). This parameter was used as a reference to show the position of the base pair.

When drawing a line between the center of mass of the rolling base and the centre of mass of the sugar group, and antoher line between the centre of mass of the sugar group and the centre of mass of the four adjacent nocleotides, the angle between these two lines is captured as the parameter

θ.Itisusedtodef inehowf artherollingbasehasmovedf romtheotherbases.Inpreviousresearchθ was used to distinguish inner and outer paths.

Proof of stability of the WC state

Figure 4. Number of hydrogen bonds formed in three-base pair DNA molecules over the course of a 200,000 ns molecular simulation. All base pairs start in the WC state. The histograms show the count of frames with specific numbers of hydrogen bonds. The bottom left corner of each plot lists the sequence of the DNA. Blue bars: Count for three-base pair molecule Orange bars: Count for three-base pair molecule, extended by one GC pair on either side. On the x-axis are the number of intact hydrogen bonds, on the y-axis the frequency.

Proof of stability of the HG state

Figure 5. Number of hydrogen bonds formed in three-base pair DNA molecules over the course of a 200,000 ns molecular simulation. The base pair in the middle starts the simulation close to the HG state. The histograms show the count of frames with specific numbers of hydrogen bonds. The bottom left corner of each plot lists the sequence of the DNA. Blue bars: Count for three-base pair molecule Orange bars: Count for three-three-base pair molecule, extended by one GC pair on either side. On the x-axis are the number of intact hydrogen bonds, on the y-axis the frequency.

Committor binary search example

Figure 6. Committor binary search: At every angle of the initial configuration ten simulations were performed. The differently colored area of the bars indicate the number of simulations ending up in either of the states, or none at all. Bars with more than ten results indicate, that no definitive result existed after the first set of simulations and more runs had to be performed. The green arrow represent the algorithms pattern of narrowing the search interval. The inset depicts the rotation angle region between 125 and 130 for a better visual inspection. The x-axis shows the rotation angle of the adenine in degrees.

Table 1. Characteristics of the trajectories generated by the metadynamics simulations.

Sequence Start of the transition (ps) Length path (ps) Transition type GAAAG 3,846 10 Inner GAACG 6,402 362 Inner GAAGG 5,432 68 Inner GAATG 8,248 2,126 Inner GCAAG 8,382 8 Inner GCACG 6,964 3,766 Inner GCAGG 5,812 122 Inner GGAAG 9,978 6 Inner GGACG 10,688 5,226 Inner GGAGG 9,472 2,566 Inner GGATG 7,354 4,042 Outer GTAAG 6,606 394 Inner GTACG 7,142 20 Inner GTAGG 17,852 10 Inner GTATG 7,146 42 Inner

(6)

Table 2. Characteristics of the Transition Path Sampling Sequence Number of trials Average path length (ps) Acceptance probability GAAAG 498 1,379 0.291 GAACG 499 289 0.319 GAAGG 742 394 0.319 GAATG 685 1,336 0.309 GCAAG 747 9 0.424 GCACG 243 2,811 0.218 GCAGG 506 609 0.328 GGAAG 746 6 0.475 GGACG 200 2,837 0.180 GGAGG 746 273 0.293 GGATG 250 529 0.224 GTAAG 746 273 0.293 GTACG 496 1,727 0.288 GTAGG 249 14 0.393 GTATG 748 142 0.315

(7)

TPS results: Distances of hydrogen bonds

Figure 7. Each subplot of this figure represents an overview of the different sequences and illustrates the transition reducing the configuration into two variables. The histogram show how often a path crosses a certain combination of hydrogen bond distances, the darker the colour the more frequently the grid point was visited. Linear interpolation is used to complete the configuration between the frames.

RESULTS

Stability

Theoretically, the shortest DNA molecule in which a Hoogsteen transition could occur consists of one base pair. However, as this work was aimed at identifying the effect of neighbouring base pairs on this transition the 1 base pair DNA was disregarded. Also note that, while not directly proven in our research, it appears unlikely, that the hydrogen bonds between the two nucleotides would remain stable, without neighboring nucleotides.

In this study, the sequence of three base pairs have the AT base pair fixed in the middle position linked to a base pair on both sides. Through alternating these neighbouring bases, 16 different sequence combinations were produced. In all of these configurations, the middle base pair is the rolling base pair undergoing the transition.

Each of these sequences is extended with one additional GC pair to either side, producing a five base pairs molecule. GC base pairs were chosen to flank this sequence, as they reinforce the structure by adding six additional hydrogen bonds compared to the four provided by an additional AT pair. The guanine of these reinforcing GC pairs lie on the

same strand as the adenine of the rolling base pair. As a consequence these flanking base pairs do not contribute to an increase in the number of possible sequences. The general form of notation is (GXAYG)2where the base on the placeholder X and Y can vary.

In order to test DNA molecule with different sequences for their structural stability, two MD simulations are run for 200 ns for each of the molecules. During the MD simulation the number of hydrogen bonds between the three central nucleotides are counted. The hydrogen bonds of the two flanking GC pairs are excluded from this calculation to be comparable with three letter sequences. Stability does not mean that over the course of the simulation the number of hydrogen bonds has to remain the same. However, if the number of hydrogen bonds remains close to the maximum number of hydrogen bonds, this indicates that the configuration is stable and the chains are kept together. The results are shown in two figures, these differ in initial configuration. Figure 4 shows the simulation results when initialising the system in a WC state, whereas figure 5 shows the results for a system starting in the HG configuration. The most noticeable is these graphs is that the simulation without

(8)

TPS results: Opening base angles and arctan values

Figure 8. Each subplot of this figure represents an overview of the different sequences and illustrates the transition reducing the configuration into two variables. The histogram show how often the path takes a certain parameter value for both θ and λ as described in the last section of the methods. The darker grid points indicates more often frames of trajectories meet these parameter values. No linear interpolation has been used.

the reinforcing CG pairs have more frequently lower numbers of hydrogen bonds. or all of these three-base pair sequences eventually all hydrogen bonds break an the DNA denatures. These DNA chains were therefore not stable enough to be used in this work. The 5 base pair DNA doublestrands showed a higher level of stability. During the simulation nearly all of those remained connected by at least half of their maximally possible number of hydrogen bonds.

In the case of the five base pair sequences starting in the HG state, fifteen out of sixteen sequences were found to be stable. During the 200 ns simulation time, the number of hydrogen bonds never drops below half of the maximum bonds. Further, the full set of hydrogen bonds remained formed in more than 50% of all frames. The simulations of the HG configuration show less definitive results then these of the WC configurations. As the first simulation for (GGACG)2, (GGAGG)₂ and (GCATG)₂ indicated instability, these runs were repeated. However, even in this second simulation, the (GCATG)₂ DNA was still unstable. Therefore this sequence was removed from further results.

The most important finding is that the minimal sequence length for our research was 5 base pairs, as the three base pairs

are all too unstable to remain connected over the course of longer simulations. As a consequence, all further experiments are based on five base pairs with GC as the first and last base pair.

Another finding is that the WC configurations stay relatively stable during the simulation”, while the HG configurations do not. This might be caused by the initial configuration, as the rigid transformation by OLS minimisation used to generate these structures does not guarantee to produce fully formed hydrogen bonds. This created HG configuration is therefore from the beginning of the simulation less stable than the initial WC configuration.

TPS can only work with two stable states. If one of the states of interest is not stable, the method fails. As the results indicated that the (GCATG)₂ molecule is not stable, it was excluded from the further experiments, which relied on TPS. However, the findings do not disprove the possibility of this molecule also possessing a stable HG state. Even though it is theoretically the least stable sequence, as it does not have any pi-stacking, there are two arguments for the presence of a stable HG state in (GCATG)2. First of all, for (GTACG)₂ the symmetrical counterpart the HG state is stable, which suggests that (GCATG)2 should also have a

(9)

possible transition, even though it has not been found yet. Secondly, for two other of the sequences the stable HG state has only been found in the second simulation. This indicates that a single simulation is not sufficient to disprove the existence of the stable state. Further computational resources should be allocated to finding or disproving -a st-able HG st-ate for (GCATG)2, in the future. This work however, will contain only the incomplete set of 15 sequences.

Generating an initial configuration

As written in the theory section, in order to run a TPS simulation an initial trajectory is required. This initial trajectory could have been produced via different methods, e.g. metadynamics of committor binary search. As a large number of trajectory is needed to gain a better understanding of the neighbourhood dependency of the WC-HG transition, the most efficient and reliable method for generating such trajectories should be selected.

Figure 6 shows the results of the committor binary search on (GTATG)2. It mainly serves as a proof of concept.To obtain the initial configuration, the standard WC configuration was modified. All neighbouring base pairs were bent away with 15◦to avoid collisions with the rolling base pair. Simulations were started in the WC state (a rotation of 0◦) and were restricted to a length of 20 ps. Ten trajectories form a sample for each turning angle. Eventually, the configuration with a rotation of 128.3◦was found to display three subpaths to the HG state as well as one subpath to the WC state. Combining these allowed for the generation of three transition paths. All simulation combined were a little over 160 times 20 ps, resulting in 3.2 ns summed MD simulations to find a transition path.

The metadynamics results are given in the chapter results discussion, see table 1. In a simulation with a duration of 20 ns a initial transition for TPS was found after 7,146 ps, with a path length of 42 ps.

Both methods generated a transition path. In the preliminary test, metadynamics required the most computational time. However, it is difficult to draw conclusions about the computational expenses and reliability of these methods after performing only one test. In both cases the initial trajectories are results of successful experiments. However, using committor binary search to find a trajectory on (GCACG)2, the trajectories almost never reached any of the stable states and was eventually terminated after 4 ns of unsuccessful sampling. With metadynamics, each of the 15 combinations was successfully simulated, see 4. Each simulation had a duration of 20 ns. The sequences (GGAAG)₂and (GTACG)₂ required two and three simulations, respectively, to find a transition. In terms of reliability and computational time, both methods have been proven to work. However, they might not be perfectly reliable and the computational time might show some outliers.

Another property of the transition paths found is that the committor binary search tended to produce inner paths as the initial configuration always had a small θ value. Metadynamics might show a preference for inner paths, but produced outer paths as well (see ).

Transition Path Sampling

With metadynamics the initial paths have been created for the 15 of the 16 possible sequences. The properties of these paths are found in table 1. The table shows a variety of paths as well as their lengths. The type of transition was labelled based both visual inspection and the calculated theta. These paths were then used for TPS.

The aim of the project was to calculate 750 trials for each sequence. Unfortunately, due to hardware problems as well a lack of computational time, for certain sequences less trials per sequence were performed. In table the number of simulated trials are listed together with the acceptance probability and the average path length. note that three sequences display an acceptance probability below 0.25. Due to this low number, to few paths were generated to obtain significant results. These paths are thdrefore removed from the remainder of the analyses.

Based on the distribution of the paths two groups can be distinguished. The group of sequences generating long paths is defined as a group with heavy-right tails and an average path length longer than 500 ns, as opposed to the group with short paths. The sequences in the first group include (GAAAG)2, (GAATG)₂, (GCAGG)₂, (GGAGG)₂ and (GGACG)₂, while the remainder of the sequences falls into the second group. Figures 7 and 88 show path density plots. Path density plots show how many paths cross a certain combination of parameters. It can be used to gain an understanding of the mechanism during the transition. Darker colours indicate that a bin was more often visited. The linear colour scale is normalised to the number of accepted paths, so similar coloured bins in different subplots share the same percentage of accepted paths, but not necessarily absolute counts. These density plots both use interpolation between the frames, but also normalises so that each trajectory only contributes once per bin.

Figure 7 shows the length of the characteristic WC and HG hydrogen bonds. MD simulations already showed that the stable WC state lies around (WC bond length = 0.3 nm, HG bond length = 0.5 nm) and the stable HG state lies around (0.5 nm, 0.3 nm). In the density plots only the path between these two states are shown. A number of patterns are recognisable when comparing each of the sequences:

(GAAAG)₂displays a elliptical spot around (0.6 nm, 0.6 nm). The ellipse is tilted, indicating that the two hydrogen bond distances are positively correlated. The witdh is 0.1 nm in the direction of both parameters. This pattern is referenced by the name mechanism 1.

(GAACG)₂ shows a high density connecting (0.3 nm, 0.5 nm) and (0.5 nm, 0.3 nm). The highest density lies around (0.35 nm, 0.5 nm). This pattern is referenced by the name mechanism 2.

(GAAGG)₂ displays a high density around the point (0.45 nm, 0.3 nm). It is flattened with a spread of 0.1 nm on the WC distance, but less than 0.5 nm on the HG distance. This pattern is referenced by the name mechanism 3.

(GAATG)2displays a amorphous shape lying within intervals [0.6 nm, 0.75 nm] and [0.4 nm, 0.6 nm] for the WC and HG distances, respectively. This pattern is referenced by the name mechanism 4.

(10)

mechanism 1. These are no exact copies, but the description as previously given matches the pattern in the respective subplots.

(GCAGG)2displays mechanism 1, but also has a an additional region of lower density aroung (0.7 nm, 0.45 nm). This second region is called mechanism 5, so the sequence shows two mechanism as well as a bridge between them.

(GGAAG)₂displays mechanism 2.

(GGACG)2displays mechanism 1. But even more paths cross the bins around the HG distance of 0.8 nm and between 0.5 and 0.65 for the WC distances. These bins form a spot called mechanism 6.

(GGAGG)2 forms a combination of mechanism 1 and a slightly less dense mechanism 3.

(GGATG)₂ has a high density similar like mechanism 1. (GTAAG)2shows mechanism 5.

(GTACG)₂ is spread over the complete area. The created pattern is on both parameters on an interval over 0.3 nm wide. The highest density is around (0.6 nm, 0.6 nm) like for mechanism 1, but it does not show the same elliptical shape, thus it is classified as a different pattern called mechanism 7. (GTAGG)2 is classified as mechanism 2, as it shows the pattern of a line between both states with the correct density. At last (GTATG)₂ has a dense ellipse at (0.6 nm, 0.4 nm), referenced as mechanism 8. Above that is a less intense ellipse yet classifiable as mechanism 1.

Examining the subplots has resulted in seven distinguishable types of patterns. In the next section these patterns will be linked to a molecular occurrence and sequences with the same type of mechanisms will be analysed.

Sequence Mechanism Average path length

Pi-stacking

GAAAG Mechanism 1 Long Full

GAACG Mechanism 2 Short One-sided

GAAGG Mechanism 3 Short Full

GAATG Mechanism 4 Long One-sided

GCAAG Mechanism 2 Short One-sided

GCACG Mechanism 1 Short No

GCAGG Mechanism 1 + 5 Long One-sided GGAAG Mechanism 2 + 6 Short Full GGACG Mechanism 1 + 6 Long One-sided

GGAGG Mechanism 1 + 3 Long Full

GGATG Mechanism 1 Short One-sided

GTAAG Mechanism 5 Short One-sided

GTACG Mechanism 7 Short No

GTAGG Mechanism 2 Short One-sided

GTATG Mechanism 1 + 8 Short No

Table 3. Table used to determine sequential dependencies. In the second column are the described mechanism, in the third column the division where a long path length is on average longer than 500 ns. The fourth column implies how many purines are theoretically in the sequence and thus if pi-stacking is possible.

Now a set of mechanisms is defined, the aim of this study is to see if the corresponding sequences match. The sequential dependencies this research focuses on are symmetrics, pi-stacking and direct neighbours.

The hypothesis of direct neighbours sequential is that a certain base directly next to the rolling base results is a

specific mechanism. To test this, the sequences with a certain direct neighbour should show the same mechanisms.

Secondly the hypothesis exists that the pi-stacking in the DNA influences the mechanisms in the DNA. The sequences where the purines are lined up might act differently than the ones where this does not happen, due to an extra bond between the bases. Then the observed sequences fall in three categories; five purines in a row, three purines in a row or no pi-stacking.

Another hypothesis about a symmetric couple of sequences ((GXAYG)2 and (GYAXG)2) is that they show similar patterns and mechanisms. This can confirm a certain sequential effect. This hypothesis can be tested by checking whether or not both sequences of a couple result in the same mechanisms.

See table 3 for a summary. With the revealed mechanism and several possible divisions, cross checking is possible. The main focus lies on mechanism 1 and 2, as more than two sequences revealed these mechanisms and makes confirming and rejecting the hypothesis slightly easier. The first hypothesis states that a single neighbour might cause a certain mechanism. Looking at Mechanism 1, all the neighbouring bases are represented in the set of corresponding sequences. Also mechanism 2 is seen in path density plots generated by sequences with four different neighbouring bases. It is not possible to point out one neighbour responsible for the mechanism.

The second hypothesis is regarding the pi-stacking. Mechanism 1 is seen in plots belonging to five purines, three purines and none purines in a row. So independent of the amount of pi-stacking in the neighbourhood, this mechanism can still take place. Mechanism 2 only appears if at least three purines are rowed up in the sequences.

Eventually symmetrical couples show in a few cases the same mechanism. Both (GCAGG)2and (GCAAG)2are categorised as the same mechanism as their symmetrical counterparts. However, the other four symmetrical couples are not classified so that both sequences show the same mechanisms.

There is another appearance noteworthy, all symmetrical sequences - (GXAXG)2 where both neighbours next to the rolling base are the same - are categorised as mechanism 1.

After the mechanisms are described and controlled, none of the previous hypotheses are completely confirmed. No certain sequential property results in a certain mechanism as far as the data shows. However, this does not mean the hypothesis can easily be disapproved. A few augmentation points require discussion.

First of all, there are a few issues with the data. The data sets are not all the same length. The number of accepted paths varies between 36 and 314. More simulation would be preferable. The shorter paths have more representation, as they require less computational time. Ideally a larger sample could give better insight and significant results. A new, larger batch of paths are available at the end of this year, generated by the same methods, for improved research. This study can be used for indications and reverence material.

The variance in the number of accepted paths between sequences also results in a slight misleading aspect in the path density plots. The mechanisms are selected based on the

(11)

intensity of the plots. However, the same amount of path does not reflect the same colour level on the colour scale, as the scale is normalised based on the grid point with the maximum number of paths crossing for the subplots sequence. As such, it is possible that for two similar sequences that at only one a mechanism is visible even when the same number of paths are passing a certain configuration. This is only because another more frequent mechanism can overshadow in this way. The decision is made to show the subplots this way to focus on the most common mechanisms for each sequence.

Another problem with the data set is that there is an inconsistency in the method used to recreate it. The dataset is based on two different TPS simulations, one with 250 and one with 500 trials. Unfortunately, not all these simulations were finished by the end of the study, due to both computational time and hardware failures. This might be possible that the results for different sequences come from one or both simulations. This might cause a difference, as both simulations start based on the same initial path and thus might cause biased sampling. All TPS accepted paths are only included in the study after the first uncorrelated path found, which counter effects the biased sampling.

Secondly, there are problems with the methods to classify different mechanisms. These are currently based on the visual assessment of the researcher and are thus not objective. There is no exact definition established of these methods, rather a crude estimation of the HG and WC distances. Unfortunately, other possible classifying methods would require more density plots, especially when at least eight different mechanisms exist.

But even if there is a good match based on the density plots, this does not necessarily mean that all the correlating paths are indeed showing the same mechanisms. Manually checking the configurations shows that paths initially classified as the same still behave differently. This is either because of two reasons: it only shares parts of the transition, so part of their paths cross. This is especially possible with the longer transitions, they often have a configuration that is also visible in the shorter transitions. However, it can also be that two different configurations still share the same HC and WC distances. Both can be solved by adding more variables to the classification. For example the base opening angle has not been used yet in the classification, but can be crucial to classify inner and outer paths. The former can be solved by adding the path length to the classification process.

Now these results are named and described, the correlation between the sequences and the TPS results is sought. However, a connection between the path length and the neighbouring base is not found. Every neighbour generates longer and shorter paths, and a specific rule can not be found. In a similar way no sequences with a lot of pi-stacking does not have a certain outcome on the type of length of the transition.

Certain relations are found through the process between the initial trajectories and the results of the TPS. For example there is a small positive correlation between the length of the initial input and the average path length of the TPS. In previous work(7) is shown that the choice of initial path might influence the type of outer path, as inner paths generate both inner paths and outer paths, but outer paths never inner

paths. However, the TPS of (GGATG)₂ starts with an initial outer path and generates both inner and outer paths. For two simulations with initial inner paths only inner paths are generated, even with an relatively high acceptance probability (0.393 and 0.475).

All previous notions of inner and outer paths are calculated by analysing the distribution of the value of θ in every frame gained from all the TPS uncorrelated paths, as shown in figure ??. The dividing line is located around a base opening angle of 34◦, where the frequency reaches the lowest local minimum. During this paper most labelling of outer paths were done through visualising and human observing. However, this plot raises another definition: all outer paths have at least one frame where the base opening angle is bigger than 34◦. This corresponds with all previous labelled outer paths.

The straight line patterns seem all to be inner path rotations with a very short average path. The ellipse patterns are all characteristic for longer average paths. Contrary to intuition however, the frames and configurations in the ellipse are not parts of outer paths. This is also visible by combining figures 7 and 8, the densities of the ellipse correspond with a dense part with low θ value.

Most important knowledge taken from these plots is that different mechanics are distinguishable, but that no direct relation can be drawn between the mechanics and the sequences based only on this data. More TPS trials are necessary for significant results. The procedure on TPS and especially the choice of initial trajectories might have a large influence for the output of the TPS, which should be taken into consideration in future work.

CONCLUSION

In this paper the Watson-Crick-to-Hoogsteen transition is studied in small DNA structures through the use of MD simulations, committor binary search, metadynamics and transition path sampling. The smallest stable base pair sequences are simulated in order to alternate the neighbours of the rolling base and thus get systematically comparable results for 16 sequences. The main research question central of this study is: What are the sequential dependencies on the Watson-Crick-to-Hoogsteen Transition?

The first part is analysing and understanding the different models used to study the transition. The molecular dynamics simulations are used to test the structural stability of the DNA sequences. Two methods to alter the configuration to find a configuration in a HG state are discussed. The metadynamics simulations required a few parameter changes due to the short length of the DNA, but resulted in initial transfiguration. In addition, the committor binary search has also been capable to generate these paths. Eventually the transition path sampling is used to map the transitions and are density path plots created to analyse these results.

It is shown that at least five base pairs are required to simulate the stable states and the transition, leading to 15 different yet comparable sequences. The simulations did not find the HG state to be stable for the sequence (GCATG)₂. That others were stable might be due to pi-stacking, or the sequence need more simulations. The committor binary search was successful in one case for the sequence (GTATG)2 by

(12)

TPS results: Length of trials

Figure 9.

selecting the rotating angle τ , where a rotation of 128.3◦ lead to an initial path. However the metadynamics simulation turned out to be more reliable and at least even computational expensive, so it was used to generate the 15 initial trajectories. A combination of metadynamics and transition path sampling proves to be the fastest way to generate multiple trajectories of the transition.

The study include over 100 ms of transition path sampling results for 15 different sequences. The sequences can be divided into two groups; with average long transition paths and average short ones. Then at least eight different mechanisms can be distinguished based on the distances of the hydrogen bonds. Researching the correlation between sequences and the output of the simulations, a few indications can be made to the sequential dependencies on the transition. A direct effect of the sequence on the transition however can not be made based on only this data. There is clearly a distinction possible in the types of Watson-Crick-to-Hoogsteen transition and this research shows that dependencies might be found through the used methods with more computational power. Hopefully this study will be used to continue the research into the Watson-Crick-to-Hoogsteen transition.

ACKNOWLEDGEMENTS

This research is commissioned by the Computational Chemistry Group.

REFERENCES

1. Bolhuis, Peter Chandler, David Dellago, Christoph Geissler, Phillip. (2002). Transition Path Sampling: Throwing Ropes over Rough Mountain Passes, in the Dark. Annual review of physical chemistry. 53. 291-318.

2. Creeth J.M., Gulland J.M. and Jordan D.O. (1947) Deoxypentose nucleic acids, part III. Viscosity and streaming birefringence of solutions of the sodium salt of the deoxypentose nucleic acid of calf thymus. J. Chem. Soc. 214, 1141–1145

3. Watson, J., Crick, F. (1953) Molecular structure of nucleic acids: a structure for deoxyribose Nucleic Acid. Nature, 171, 737 – 738.

4. Li, Shuxiang and Olson, Wilma K and Lu, Xiang-Jun (2019) Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures. Nucleic Acids Research, 47(2), 26 - 34.

5. Zhou H, Hintze BJ, Kimsey IJ, et al. (2015) New insights into Hoogsteen base pairs in DNA duplexes from a structure-based survey. Nucleic Acids Research, 43(7) 3420-3433.

6. Lamm E, Harman O, Veigl SJ (2020) Before Watson and Crick in 1953 came Friedrich Miescher in 1869. Genetics, 2, 291 – 296.

7. Vreede J, Perez de Alba Ortiz, et al. (2019) Atomistic insight into the kinetic pathways for Watson–Crick to Hoogsteen transitions in DNA Nucleic Acids Research, 47(21) 11069–11076.

8. Nikolova, Evgenia N, et al. (2011) Transient Hoogsteen base pairs in canonical duplex DNA Nature, 470(7335) 498-502

9. Duan Y, et al. (2003) A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations Computational Chemistry 24, 1999 – 2012.

10. K. Hoogsteen (1963) The Crystal and molecular structure of a hydrogen-bonded complex between 1-Methylthymine and 9-Methyladenine Acta Crystallographica, 16, 907 – 916.

11. Yang, C., Kim,E. and Pak,Y. (2015) Free energy landscape and transition pathways from Watson–Crick to Hoogsteen base pairing in free duplex DNA. Nucleic Acids Research, 43, 7769–7778.