Analysis of generative reversible neural network for sampling of chemical transition states

(1)

Analysis of generative

reversible neural network for

sampling of chemical

transition states

(2)

Layout: typeset by the author using LATEX.

(3)

Analysis of generative reversible neural

network for sampling of chemical transition

states

Ivan Simsic-Babic 12205079

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. S. van Splunter Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jan, 2021

(4)

1 Abstract

The sampling of rare events in computational chemistry is a challenge due to the vast amount of simulation times required to sample a rare event without the use of biased algorithms to simulate molecular dynamics. Recently, software was developed which specializes in the sampling of a certain rare chemical event, namely the transition state. Finding the transition state of a chemical reaction and the circumstances under which it occurs is important as this provides insight into the conditions that need to be met in order for a certain reaction to occur. This thesis analyzes a reversible generative neural network designed to sample stable protein states and aims to lay out the steps in order to use it to sample transition states of a certain chemical reaction. The reversible generative neural network in question is designed in a way that the algorithm trains by combining standard Machine Learning approaches with statistical mechanics. This approach is an efficient way of implementing molecular dynamics because it incorporates the laws of physics which dictate the measures of stability of certain molecular configurations. The aforementioned software will be used to generate a small amount of transition state samples, on which the neural network will be trained in order to sample transition states, instead of the stable protein states it was originally designed to generate. This thesis will investigate the nature of the neural network and the adjustments that should be made in order to use it for this novel idea. Furthermore, it investigates the computations that need to be done on the simulates transition state data in order to make it compatible with the neural network.

2 Introduction

Artificial Intelligence (AI) is capable of performing versatile feats because of its self improving na-ture and the many different ways in which it has been implemented.

One of the lesser known applications of AI is Molecular Dynamics (MD), a computer simulation method used to analyze the properties of chemical systems without needing to perform physical ex-periments. Inside a MD simulation, an atom is not physically present yet its properties are modeled to emulate its real-life counterpart’s actions. Its shape is translated into x, y, and z coordinates, and its charges are represented as sets of repellent and attractive forces that interact with other particles in the simulation.

When all modeled properties are put together, large molecules in environments with hundreds of other atoms can be accurately modeled, and information from them extracted that is useful and otherwise hard to learn from physical world experiments. Logically, the vast amount of data in-volved with performing a single simulation also poses a problem. Simulations generally require high computation times even for small systems, and increasing the number of particles involved in a simulation will exponentially increase the amount of time and memory needed to accurately model a certain process.

In a MD simulation, a chemical process is stripped down to its underlying mathematical com-ponents, which are modeled into a computer. A single simulation contains an enormous amount of data, which is near impossible for a human to manually effectively analyze. Depending on the specific type of simulation, the time step may be a millionth of a second or less. In order to man-ually analyze the simulation of a single physical second, multiple days of work may be necessary. Even then, it is difficult for a human researcher to recognize the underlying patterns that dictate

(5)

parts of chemical reactions from spatial coordinates and velocities alone. Many properties of a chemical reaction are better understood by looking at angles between certain groups of atoms, or bonds between specific parts of a molecule when considering large molecules such as proteins. These properties are ones that are not easily described using just x, y, z coordinates, but must instead be extracted from these coordinates and other information contained within a simulation. When comprehending the vastness and complexity of data contained in a single MD simulation, it becomes clear how AI can help in the field of MD. One of AI’s many strengths lies in its ability to analyze and reproduce mathematical data. When translated into mathematical form in a computer, a large and complex process such as a chemical reaction lends itself well to the powerful analytic capabilities of AI. This project discusses the concrete ways in which AI can be applied to a certain MD problem. This goal of this project is to explore the enabling of modifying an existing reversible generative neural network which can sample stable protein states in order to sample transition states in certain chemical reactions. The neural network to be modified is called a Boltzmann Generator (BG)1_.

This is a reversible neural network which learns to transform a certain simple prior probability dis-tribution (e.g. the Gaussian disdis-tribution) to the probability disdis-tribution of the configurations of a molecule. It is the latter distribution which is learned and perfected as the neural network is trained. In its original function, the neural network learns to sample stable protein states. This refers to certain configurations of the protein molecule which are low in potential energy, and thus exist in a state of stability. The neural network is initially trained on a few metastable states of said protein, after which the network parameters are perfected to approximate the Boltzmann distri-bution of molecule configurations as closely as possible. It is this latter part of training in which the energy of a generated sample is used in the Boltzmann distribution to determine its proba-bility; this probability is then used to reweigh the neural network parameters. It is important to understand that there are two different kinds of training applied in the BG algorithm. The first kind is maximum likelihood estimation and is used to keep the algorithm from sampling the same lowest-energy configuration repeatedly. The second is Kullback-Leibler (KL) training, which is a certain formula used to measure the energy of a system. In physics, a ’system’ refers to a portion of the universe that has been chosen for studying the changes that take place within it in response to varying conditions. A system may refer to anything as simple as a few atoms or something as large and complex as all the particles that make up a planet.

This project, instead of sampling stable states of a large protein molecule, attempts to provide the necessary steps and information to sample the transition states of a certain chemical reaction. Namely, the transition between two stable states of the alanine dipeptide (ADP) molecule. The transition state refers to a configuration of the molecule that exists on a microscopically small scale of time, which makes it hard to observe in real experiments. Because the transition state must be passed when a chemical reaction happens, it acts as a bottleneck for the reaction. For that reason, finding out what the different possible transition states are and what causes them to appear is valuable knowledge in the field of chemistry.

An important difference between the sampling of stable molecule configurations and unstable transition state is the potential energy of the system. A stable molecule configuration is stable be-cause of the low potential energy it holds, meaning it will not tend towards a reaction if left alone;

(6)

for this reason, a stable state has a higher probability of existing. Conversely, a transition state is unstable and holds a high amount of potential energy, meaning it does tend towards a reaction in order to become stable and decrease its amount of potential energy. This also means that transition states have a low probability of existing. Not every state with a high potential energy is a transition state, however.

For these reasons, the functionality of the algorithm must be altered in order to not prioritize the lowest-energy, probable molecule configurations. Rather, the algorithm must find molecule con-figurations which are high in energy and low in probability, but all share the attribute that they are transition states in a certain reaction.

The challenges in this project are as follows; the right training and initialization data must be collected and converted to a suitable format for the BG algorithm. This means analyzing the BG, collecting training data and performing the necessary transformations and alterations on the data in order to set the stage for transition state sampling. In order to collect the training and initialization data for the BG algorithm the OpenPathSampling (OPS)2 _{software is used. OPS is}

a Python library that facilitates path sampling algorithms, this software is used to generate the initial ADP transition state data. After collecting this data the data must be converted and certain computations made in order to use the BG. This process, OPS, and the inner workings of the BG will be discussed in more detail in the sections below.

3 Detailed problem description

For the project at hand the transition of the ADP molecule from its C7eqto the alphaR state is

modeled using the OPS Python library. The specific workings of OPS are discussed in more detail in a later section. OPS is a piece of software which has been specialized to model chemical transition reactions. A transition in this context refers to a molecule going from one stable state to another.

The transition in question (C7eq to alphaR) is marked by changes in two dihedral angles, φ

and ψ. A dihedral angle refers to the angle between the first and last atom in a group of four interconnected atoms. A transition state in this reaction is marked by its high potential energy and its tendency to fall into one of either two aforementioned states with equal probability. Any state that satisfies this rule is a valid transition state. A certain state in this project refers to the specific spatial configuration of the molecule.

With the OPS software simulations of the ADP molecule transitioning from C7eqto the alphaR

state can be run. From these simulations the transition states can be extracted by performing a type of analysis that measures the probability of a certain snapshot falling into either stable state. If that probability is around 0.5 for both states, the selected snapshot is a transition state. This implies, correctly, that there are multiple states which can be an acting transition state to a singular transition (C7eqto alphaR). A common way to visualize the stable states and the transitions that

occur between them is by modeling them onto a Potential Energy Surface. Figure 1 displays this, representing the two stable states as the areas circled in orange, and the transition paths between them represented by the colored dashed lines.

(7)

Figure 1: Visual representation of the stable states and the energy barriers that separate them.3

When seen this way, one could argue that the transition states are being drawn from the tran-sition state surface.

Figure 2: Transition paths mapped against dihedral angle values4

The graph above demonstrates how paths move from one stable state to the next. What these ’paths’ are, and other specifics regarding the software, OPS, are explained in the section below. Note that here, the stable states are shown in a graph that represents the values of the ψ and φ dihedrals, which define the alphaRand C7eq states. These states are visually represented in Figure

3.

3_{(Jung et al., 2019)} 4_{(Swenson et al., 2019)}

(8)

Figure 3: ADP stable states5

Figure 4:Visual representation of BG functionality 6

The BG is a generative reversible neural network which can sample stable protein states. It is trained by two metrics, the second of which can be augmented with an additional metric. The first one is a form of maximum likelihood that tries to learn the spatial configuration of metastable

5_{(Strodel & Wales), 2008} 6_{(Noé et al., 2019)}

(9)

protein states. This is important to ensure that the second part of training does not get stuck repeatedly sampling the single lowest energy configuration. The second form of training attempts to learn the potential energy of configured states, in order to be able to sample the stable, low-energy protein states. The second form of training is referred to as training by low-energy, this is because here the Boltzmann function of the spatial configuration space of a molecule (BPTI protein) is approximated. It makes sense that this second metric of learning would be used to specify the exact Boltzmann distribution because it uses measures of potential energy within a system to correct itself. The functionality of the BG is represented visually in Figure 4.

3.1 Research Question

Knowing all of the above, it is reasonable to think the BG could be modified in a way to work with the data provided to it by OPS, in order to sample transition states. The first part of BG training is to be initialized by a few known transition states, then the second part is used to perfect the Boltzmann distribution for the ADP molecule. Finally, samples from the transition state surface can be drawn. The research question is to explore the possibility to adapt the approach of the original BG article7 _{to identify potential transition states for the ADP molecule. To explore the}

main research question, the following subquestions need to be answered: - How can OPS be used to create transition paths of the ADP molecule?

- How can transition states be extracted from these paths and edited for use with BG?

- Which additional operations need to be performed on these transition states to be able to train the BG?

4 Background

This section discusses OPS, the Boltzmann generator, and TICA.

4.1 OpenPathSampling and ADP molecule

OPS is a Python library designed to simulate certain chemical processes and thereby capture rare chemical events. It contains multiple different types of algorithms to simulate different chemical events and attributes, but the focus lies on simulating chemical transitions. Such a simulation describes a Markov chain model, moving from one state to the next. The reason OPS is useful is because it is optimized in a way that certain rare events are easily found and modeled.

CONCEPTUAL

A chemical transition refers to the change from one stable chemical state to a different stable state.

(10)

Figure 5: Reaction energy diagram for a simple substitution reaction8

This transition can entail a chemical reaction as it is commonly known, i.e. the reaction mod-eled above where the start and end states comprise of different molecules; it can also refer to the change from a certain spatial molecular configuration to a different one, without the chemical com-position of the molecule itself changing. In either case, the transition from one stable state to the next contains an unstable, high-energy chemical state which is known as the transition state. The transition states of a chemical reaction have very short lifetimes, only a few femtoseconds9_{, before}

it changes into the end state. As the transition state holds a high potential energy it is an infre-quent occurrence, since the laws of entropy state that physical systems tend towards states of lower energy. Despite this, the transition state is a crucial step to any chemical reaction and must always be crossed in order for a reaction to happen. For this reason, it holds significance as it represents the bottleneck for a certain chemical reaction to occur. Finding out what causes a chemical system to enter a transition state is tantamount to finding out what the underlying cause is of the chemical reaction the transition state is a part of.

Molecular Dynamics (MD) simulation is a concept that has existed since the early 1950s and has since been developed further for many different applications10_.

The idea behind MD is that by accurately representing the forces at work behind chemical pro-cesses, these processes can be modeled in a computer. This reduces the need for physical chemical experiments and allows scientists to closely monitor chemical reactions without being hindered by linearity of time, limits to physical perception of the reaction, or budget restraints. In order for a simulation to be accurate, the simulation program must contain information regarding the different atoms and their respective connections within a chemical system. This means information such as the weights and charges of, and the bonds between atoms are required in order to run an accurate simulation. By computing all active repellent and attractive forces per unit of time, the MD simu-lation becomes a complex, multidimensional mathematical process that approximates the workings of chemistry in the physical world.

A number of challenges exist with accurately modeling certain chemical reactions, however.

8_{https://www.masterorganicchemistry.com/2010/11/03/whats-a-transition-state/} 9_{(Schramm, 2011)}

(11)

First of all, a vast amount of memory is required in order to contain all the relevant information in a molecular system per unit of time11_{. For example, the initial transition path simulations that}

were run for this project in order to generate training/initialization data required information about 1651 different atoms (22 belonging to ADP, the rest to the solvent water molecules around the ADP molecule) to model the transition. On top of the information about which atoms are present and how they are connected (in MD this is provided through what is known as a topology file) the MD simulation also needs information about the interacting forces, and needs to compute these forces for every atom per unit of time of the simulation. There are different ways of modeling these forces. Most of the specific models are specialized in order for usage with a certain type of molecule, such as a protein or a ligand. These models are provided in files known as force fields.

Secondly, the orders of magnitude between individual time steps within a simulation and the span of time in which a targeted chemical reaction occurs can differ greatly, resulting in long computa-tion times for even small chemical processes. For example, the dissociacomputa-tion of a weak acid in water might occur with a half-life of around 1 ms, whereas the elementary steps of molecular motions in water occur in 1 fs12_{. In order to simulate this process, which would include the acid molecule and}

several hundred water molecules, 1 fs of real time might correspond to 1 second of computing time, meaning that with rudimentary MD methods finding one example of acid dissociation would take an exceedingly long amount of time.

For the aforementioned reasons it is important to make the MD process more efficient wherever it is possible. As stated before, the occurrence of transition states is a rare thing and so with rudi-mentary MD methods finding them can take an exceedingly long amount of time. The algorithms present in OPS attempt to bypass this problem by focusing directly on the transition state. This can greatly reduce the simulation time because the transition state represents the bottleneck that determines whether the chemical event in question occurs or not. The algorithm central to the functionality of OPS is called Transition Path Sampling (TPS)12_{, and is based upon statistical}

mechanics. TPS makes use of a technique called Importance sampling in order to prioritize these rare events. In order to sample a trajectory, a Monte Carlo calculation performs a random walk through configuration space of a certain system. Here, configuration space refers to the possible spatial configurations and momentums a certain chemical system can be found in. A configuration x is visited in proportion to its probability p(x), this relationship between a configuration and its probability is utilized in order to prioritize events that otherwise would take too long to occur in a normal MD simulation. This biased way of sampling from a configuration space is central to Importance sampling and this property is what is exploited in TPS.

Importance sampling can also be applied to entire trajectories as opposed to single configura-tions. If all trajectories belonging to a certain subset of the trajectory space, e.g. all trajectories that are 1 ps long, are sampled, only a small subset of these trajectories contains a transition event. Such events can be seen as the transition between certain states A and B, e.g. the C7eqand alphaR

states of the ADP molecule. These states are characterized by their respective population opera-tors, hA(χ)and hB(χ). χ refers to a certain point in phase space; meaning a combination between

configuration and momentum space. This combination allows a certain point to describe both the spatial configuration of a molecule and the kinetic energies it holds in a certain moment. When χis within region A, hA(χ) = 1, otherwise hA(χ) = 0. hB(χ) is similarly defined. A trajectory

is defined as a chronological sequence of phase space points, a transition path is defined as such a

(12)

trajectory that crosses over from one stable state into another. The statistical weight for such a transition path connecting states A and B is hA(χ0)ρ[χ(t)]hB(), where ρ[χ(t)] is the unconstrained distribution functional for trajectories. All information listed about TPS above can be analyzed in greater detail in the TPS paper12_{, TPS is then done by carrying out a random walk in trajectory}

space, using the statistical distribution hA(χ0)ρ[χ(t)]hB() as a bias in importance sampling, in order to prioritize the sampling of transition path trajectories. After an initial trajectory has been found, new trajectories are generated by initializing a point in the initial trajectory with a slightly altered momentum, and running the simulation from there on. This alteration of a point in an initial trajectory is referred to as a ’shooting move’, and is an integral part of the efficiency of TPS. The importance of OPS for this project is twofold; firstly the collection of training/initialization data of the BG, secondly it provides the tools to check if the results of the BG are valid transition states.

The transition from the C7eqto the alphaRADP state is the transition in question for this project.

These stable states of the ADP molecule are defined by the values of two dihedral angles, namely φand ψ. A dihedral angle is a chemical term which denotes the angle between the first and fourth atom in a sequence of four interconnected atoms within a molecule.

The OPS software is used to generate several MD trajectories that move from one of these stable states to the other. Afterwards, an OPS native analysis tool called the ’committor analysis’ is used to determine where in a given trajectory the transition state is, and extract it from the trajectory in order to use it as training data for the BG. The committor is a certain probabilistic attribute which refers to the probability that a trajectory will end in a certain state (being either C7eq or

alphaR) when started from a certain snapshot with a new, randomized velocity. If the probability

is around 0.5, this means that the analyzed state is a transition state because it exists right between the two stable states.

Several transition paths are simulated and their results saved, then after performing committor analysis the transition states within these paths are extracted and used to train and initialize the BG. After the BG is trained and samples are drawn, these samples can then be analyzed with OPS to determine whether the samples also represent transition states.

PRACTICE In order to successfully run OPS, several files are needed that describe certain information about the molecule to be modeled. Most important is the Protein Data Bank (PDB) file. The PDB file describes the topology of a molecular system, i.e. the molecules which are present in this system, and the atoms which comprise the molecules. The file describes the bonds between atoms and additional information which is useful for modeling the system. For more information on

(13)

PDB files, visit https://www.rcsb.org/. The PDB file is needed so that OPS knows which chemicals are present and how they connect to one another, in order to properly run simulations.

The second set of files are force field files. The force field files contain parameters for the forces that drive the interactions between particles that are modeled in OPS, and in MD in general. Parame-ters that are included in these files describe, among others, the energy between covalently bonded atoms, the energy of twisting certain bonds (torsions), etc.

Once these files have been obtained, the program can be run. In order to run properly, OPS requires you to set up an engine before running simulations. An engine takes the aforementioned PDB and force field files as input, and sets several parameters that will decide the behavior of the simulation. These parameters are initially defined for an OpenMM engine, which is then used as input for OPS to set up its own native engine. OpenMM is a commonly used toolkit for molecular simulation, and provides a basis for the functionality of many MD applications, including OPS and the BG. To learn more about OpenMM, go to http://docs.openmm.org/. The parameters that are passed to the engine include presets regarding the behavior of water, error tolerances, and other specifics. When running an OPS simulation, the ensemble to which transition paths belong must also be defined. In short, an ensemble defines a set of rules that transition paths must adhere to, e.g. that the transition starts in one state and ends in another, or the temperature that the transition paths are to be simulated with.

After defining an engine, the integrator must also be defined. An integrator refers to a certain im-plementation of a numerical integration scheme. i.e. a mathematical approximation to an ordinary differential equation. This is important because accurate approximations make the MD process faster and effective at low cost of accuracy. Having a reversible integrator is integral to proper usage in OPS, OPS cannot be done effectively if certain processes that happen in simulation cannot be reversed as transition paths are generated by moving back and forth from certain points in a trajectory.

The OpenMM engine and integrator are then used to initialize the OPS engine, which is saved as an object. After this, state definitions are provided in order to define transitions. The specifics to these definitions are provided in the Experimental setup section. Once the states have been defined, the next step is to set up the reaction network and move scheme. Simply said, the reaction network defines which transition is to be modeled (state A → state B), and the move scheme refers to the type of shooting moves which will be used in running the simulation. Once these have been defined, OPS is ready to run an initial trajectory, and from there on sample multiple transition paths as was described above.

When a simulation has been run, it is saved as a data object describing the transition paths that were modeled. This object describes all the paths that were sampled during simulation. A path is referred to in OPS as a Trajectory, consisting of multiple snapshots in temporal order. A snapshot describes the state of a certain molecule at a certain time, and contains information about it such as the 3D coordinates of all constituent atoms, and their velocities.

(14)

4.2 Boltzmann Generator

The original Boltzmann Generator algorithm was originally used to sample stable states of the BPTI protein13_{. The BG works by learning a transformation between a simple prior distribution}

and a probability distribution representing different molecular configurations.

The generator itself is a reversible neural network, this is important because the algorithm needs to be able to go back and forth between the prior distribution and the Boltzmann distribution of the molecular system that is being learned. The significance of transforming to and from the prior distribution lies in the Boltzmann distribution thereby being directly related to a simple measure of probability. This transformation between a simple probability distribution and a complex prob-ability is done by using the Normalizing Flows (NF) method.

Figure 6: Normalizing Flows14

Figure 6 visually showcases the step by step process of repeatedly transforming a simple prob-ability distribution to an approximation of the complex distribution that is to be learned. There are two key linear algebra concepts that are essential to understanding the workings of NF. Firstly, the Jacobian Matrix, and its determinant.

When looking at the formulae used in the BG to perform ML, and then KL training, they have one term in common. ML formula: JM L= Ex h 1 2kFzx(x)k 2 − logRxz(x) i

KL formula: JKL= Ez [u(Fzx(z)) − logRzx(z)]

The second term in both formulae refers to the determinant of the Jacobian matrix of the transformation from z (sample from prior distribution) to x (sample of probability distribution of molecular configuration). The Jacobian matrix is the matrix of all first-order partial derivatives of a function that maps from an input vector z to an output vector x. The determinant of this matrix is a measure of how much the multiplication by this matrix locally expands or contracts space.

13_{(Noé et al., 2019)}

(15)

Figure 7: Jacobian matrix15

Figure 7 showcases this. Let J be the Jacobian matrix: 2 0 0 2

When transforming from Z to X, the determinant of the matrix showcases the increase in area. This property of the determinant is important in calculating the weights needed to perfect the transformation from a simple probability distribution to a more complex one, as different parts of the simple prior distribution need to be scaled differently in order to generate a more complexly shaped probability distribution.

The second key concept in NF is the change of variable theorem. In mathematics, this theorem refers to the practice of replacing variables by functions of other variables, in order to simplify a problem. In this case, a random variable z must be replaced by a different variable, namely the transformation from x to z, Fxz(x), in order to train the algorithm. Here, the determinant again

plays an important role. p(x) = p(fxz(x))

det ∂fxz(x) ∂x = p(z) det_∂x∂z

The formulae above showcase the usage of the change of variable theorem in order to define the probability of a generated sample x. Its probability is defined by the probability of the trans-formation Fxz, which maps from distribution p(x) to distribution p(z), and is the inverse of the

transformation Fzx. The determinant of the Jacobian matrix of the Fxz is important in order to

preserve the property that all probability distributions must have an area of 1.

The Boltzmann distribution also describes a probabilistic phenomenon, namely the probability of encountering a particle at a certain energy level for a certain temperature. The graph below

15_{https://www.youtube.com/watch?v=i7LjDvsLWCgab}

(16)

illustrates the Boltzmann distribution at different temperatures.

Figure 8: Boltzmann Distribution for various temperatures16

The different colored lines represent different temperatures of a certain molecular system. The y-axis shows the number of particles and the x-axis shows the speed of particles in the system. The area below the curve is equal to the number of particles present. The probabilistic element lies in this curve representing the proportion of particles within a system that carry a certain amount of kinetic energy (speed). By increasing the temperature, this proportion also increases. Because the peak of the Boltzmann distribution shifts to the right when the temperature increases, this means there is a higher chance of finding particles at a higher energy level. Higher temperatures correspond with higher overall kinetic energy. When the particles in a system move more they are more prone to collisions with other particles, this makes it easier for particles to react with one another. It is for this reason that systems with higher energies are more unstable than those with lower energies; a system at low energy will have fewer interactions happening with its constituent particles and therefore is more stable.

The formula for the Boltzmann distribution is given by: pi∝ _kTei where pi is the probability of

the system being in state i, i is the energy of that state, and a constant kT of the distribution is the product of Boltzmann’s constant k and thermodynamic temperature T. Here, a ’state’ refers to both the spatial configuration of all atoms in a molecular system, and the way the energy is distributed throughout By utilizing this relationship between energy of a system and the probability of a certain configuration, the BG is able to sample the specific configurations that hold low energy.

The BG primarily makes use of two different error metrics by which it reweights itself. In training, z refers to a random vector sampled from a Gaussian prior distribution, and x refers to a proposal configuration, and is defined as x = Fzx(z).

4.2.1 Training by example

Training by example is important so as to ensure the BG does not simply repeatedly sample the one most stable metastable protein state. Training by example is done by initializing the BG with a

16_{https://en.wikipedia.org/wiki/Maxwell-Boltzmann}

(17)

few ’valid’ configurations x that have been acquired from short initial MD simulations/experimental structures. These configurations are transformed into latent space via z = Fxz(x). It is also this

phenomenon which ensures the low-energy configurations lie close together in the latent space rep-resentation. Maximizing their likelihood in the Gaussian distribution corresponds to minimizing the loss function:

JM L= Ex h 1 2kFzx(x)k 2 − logRxz(x) i

The first term in JML, 1/2||Fxz(x)||2, is the energy of a harmonic oscillator corresponding to

the Gaussian prior distribution, the second term refers to the determinant of the Jacobian matrix. 4.2.2 Training by energy

Training by energy is the key to the functionality of the BG. The BG measures the energy of a system and makes use of this measure to approximate the Boltzmann distribution of the system, as described above. The BG uses this training by generating a configuration from a proposal distribution px(x), and measuring the difference between this distribution and the Boltzmann

dis-tribution whose statistical weights e−u(x) are known. A natural measure of this difference is the

relative entropy, kor Kullback-Leibler (KL) divergence. The KL divergence is computed as follows: JKL= Ez [u(Fzx(z)) − logRzx(z)]

here, u(Fzx(z))refers to the energy of the generated configuration, and the second term once afain

describes the determinant of the BG’s Jacobian matrix.

4.3 TICA

The third training method provides an exceptionally useful tool for this project. In order to use it, one must perform time-structure independent component analysis (TICA)17_{on certain features of}

a MD simulation. This data is then used in the training of the algorithm as a reaction coordinate. In chemistry, a reaction coordinate is a one-dimensional coordinate which represents progress along a reaction pathway. In molecular dynamics simulations, a reaction coordinate is called a collective variable. In the ADP example, the collective values of both relevant dihedral angles conjoined into one variable would be a reaction coordinate that describes how the ADP transition goes. TICA finds the independent components of a piece of data by maximizing their statistical independence. In other words, it tries to separate the components from a dataset that are least influenced by each other. A famous example of independent component analysis (of which TICA is a specific application) is the cocktail party problem. In the cocktail party problem, a listener is trying to discern what someone is saying to them while there are many other speakers present in the same room. The sounds of the different speakers speaking simultaneously make it hard to single out any single person’s speech, thereby making it impossible to understand what anyone is saying. An effective application of independent component analysis would take the convoluted audio of multiple people speaking at the same time and output all different independent components of the audio; in other words, the individual speakers’ voices would be extracted, and their speech would become intelligible. The reason this is useful is because in MD simulations there can be enormous amounts of data relating to a single simulation, and it is a challenge to try to extract the relevant underlying causes of a reaction from the many data that are present. TICA provides a way to isolate the

(18)

components that are largely responsible for the process that is modeled in a certain simulation. By identifying and extracting these components they can be used within the BG algorithm to aid it in properly sampling ADP states that are contingent on the values of these components. The TICA process works by featurizing a MD simulation, and then running the TICA analysis on the featurized components.

5 Methods

The process of this project was as follows:

1. Download and install all necessary software for data generation and running of BG algorithm. OPS software, along with Openmm package, jupyter notebooks on BG and associated data files from original paper link. Additionally, several Python packages must be downloaded to perform necessary transformations/extractions on the generated training data.

2. Research BG algorithm to see which data are necessary to train the BG in order to extract the same data in proper format/transform into proper format from OPS simulations. 3. Run OPS software in order to gain sufficient amount of transition paths for ADP from

C7eqstate to alphaR state

4. Use OPS to select the appropriate frames that fall within the transition state boundaries from these trajectories using committor analysis. Collect these frames into one file and save them in NumPy array format, this format is used in the BG algorithm and constitutes the training data for the ML section of BG training.

5. Extract Z-matrix (Internal coordinate representation) that is used in BG to better work with dihedral angles, and insert this information into the appropriate BG files. The Z-matrices are important for both the BG and this project as the values of two ADP dihedral angles are essen-tial to defining the modeled transition. Write new PDB file that does not contain information on solvent molecules and convert trajectory information to allow TICA transformations. 6. Perform TICA transformations on the collected simulation data of transition states, the output

of these transforms give us a set of data that are useful in determining a reaction coordinate the BG corrects itself on.

6 Experimental Setup

This section describes in more detail the specific steps taken to do this project. This project was fully done in Linux.

6.1 Installation of necessary software

(19)

6.2 Analysis of BG algorithm

This step looks at the BG algorithm, specifically the training data the algorithm utilizes and the format thereof. This is done to find out the right data formats in which the new training data must be collected. After downloading the BG files from the link listed above, the notebook for running the algorithm can be found under Boltzmann/deep_boltzmann/deep_boltzmann/BG_BPTI. This folder contains the necessary training data to run the BG, and the notebook that collects the data and trains the algorithm.

The training data is split under two folders; ’data’ and ’tica_data’. ’data’ contains the training data that is used to perform the initial ML training on the spatial structure of a molecule. ’tica_data’ contains the TICA extracted information which can be used as a reaction coordinate to guide the algorithm. This part is not strictly necessary but highly recommended for efficiency and effective-ness of the algorithm.

The original training data is a NumPy array of dimensionality (150000, 2676). The first di-mension refers to the amount of training samples present. The second didi-mension refers to the x, y and z coordinates of every atom of the original molecule, which contains 892 atoms. This means that the initial training is done on a set of 3D-coordinates of the molecule that the algorithm is trained on. According to the original implementation of the BG, these samples are of metastable states of the molecule, and are meant to prime the algorithm for finding similar (albeit more stable) configurations.

Following the same logic, in order to use the algorithm effectively for the purpose listed in this thesis, the training data must consist of a certain amount of samples describing the 3D coordinates of transition states of the ADP molecule. The following steps describing this process are discussed in detail in sections 7.3-7.6. In order to connect the 3D coordinate data to the configuration of a molecule a PDB file must also be provided, describing the topology of the molecule at hand. The proper force field files must also be provided for the BG to correctly calculate the energy of a given molecule. Because the BG makes use of a mixed coordinate system, the internal coordinate representation for the ADP molecule must also be generated and provided to the BG for proper functionality.

6.3 OPS training data generation

Path sampling methods make use of molecular dynamics, hence the first step in setting up a transi-tion path simulatransi-tion is to set up a molecular dynamics engine. In this instance, OpenMM is used as the underlying engine. Setting up the engine means setting several parameters which dictate what molecules are present and how they are built, how the forces in a simulation interact, and which forces are present. The first necessary files are the ones that define the topology of the molecule of interest, and the files that determine which force fields will be used in the engine. The topology file is provided in the OPS tutorial notebooks which can be found online, and describes the structure of the ADP molecule, along with a large number of water molecules in order to simulate the behavior of ADP in water.

There are many different force field files to pick from, which all have different values for param-eters of the force field (e.g. force constants, equilibrium bond lengths and angles, charges). For this simulation the ’amber96.xml’ and ’tip3p.xml’ files are used. This engine runs in the NVT ensemble,

(20)

with T set at 300K. Note that normally an engine would be instantiated at higher temperatures in order to sample an initial transition path trajectory, and then equilibrated to simulate realistic molecular dynamics. Next, an integrator for the engine must be chosen. Here, the Velocity Verlet with Velocity Randomization (VVVR) integrator is used, which simulates Langevin dynamics. This is a reversible integrator. As mentioned before, reversibility of the integrator is essential to properly performing OPS simulations.

Now that the appropriate parameters, topology, and force field files have been chosen, an OpenMM engine can be instantiated. Then, a native OPS engine is made which uses the OpenMM engine along with all its parameters as input. Extra parameters such as the number of steps per frame, and the maximum number of frames per simulation can be chosen here. For this experiment, the maximum number of frames is set to 2000, the and the number of steps per frame is set to 10. The integrator is instantiated with 300K as mentioned above, then finally the engine is instantiated by combining all the aforementioned objects and parameters into one.

Now that the OPS engine has been properly defined, the ADP states between which a transition must be simulated must be defined. As mentioned earlier, these states are defined by the dihedral angles φ and ψ. These are angles along the protein backbone, defined for residue i as involving the following atoms: φi: C(i1), N(i), C (i) α , C(i) ψi : N(i), C (i) α , C(i), N(i+1)

To make use of the indices contained within the dihedral angle definitions above, the aforemen-tioned topology file is necessary, as it contains indices for every constituent atom. To make use of these angles they must be created as ’collective variables’ i.e. a single variable that contains the values of multiple other variables. To do this, a certain OPS function is used to combine the indices in the topology file along with an operation to compute dihedral angles.

The stable state definitions are as follows: C7eq : −180 ≤ φ ≤ 0 , 100 ≤ ψ ≤ 200

alphaR: −180 ≤ φ ≤ 0 , −100 ≤ ψ ≤ 0

This information is saved into two collective variables that define the states, once again using an OPS function to collect and save the information. The reaction network and move schemes are set up using the state definitions above. The reaction network is from C7eq to alphaR, the move

scheme is a one way shooting mover with a uniform shooting point selection.

The next step is the generation of the first trajectory. This can be one of the most challenging parts of transition path sampling. For this specific simulation, it can easily be done by running a simulation under high temperatures to get the first path going from one state to the next, then equilibrating said path into the lower temperature that will be used for the actual path sampling. Once an initial path has been made, running the actual transition path sampling simulation is simple.

The initial trajectory is saved as an object by using another OPS function. This initial tra-jectory, the move scheme, and initial conditions (initial trajectory and all of the parameters that applied to it, e.g. force fields, integrators, etc.) are supplied to the sampler, which is an OPS function that, when run, samples as many transition paths as is requested of it by the user. The

(21)

collection of these paths is then saved as a single simulation object, and can be freely recovered and analyzed.

6.4 OPS trajectory transition state extraction

Now that a sufficient amount of transition paths has been simulated, a series of analysis tools need to be run on the simulation data in order to extract the transition states from the simulation. These tools are run over the course of several notebooks provided in the OPS tutorial.

Firstly, in order to simplify the process of extracting transition states, the snapshots that do not exist in any of the aforementioned stable states are extracted and saved. This is because a transition state is a state that by definition cannot exist within a stable molecule state. In order to do this, the simulation data must be loaded and the state definitions selected from the storage data. Luckily, the OPS simulation storage object contains all information relevant to the simula-tion. Here, the state definitions can easily be accessed by accessing the ’volume’ attribute of the storage and indexing by the state name, like so:

stateA = storage.volumes[’C_7eq’] stateB = storage.volumes[’alpha_R’]

The simulation snapshots that do not belong to either state are selected and saved to a new object.

Now that the snapshots not belonging to either state have been saved, this data can be loaded and analyzed by running a committor simulation. As explained before, the committor is an at-tribute of a snapshot that determines its probability of moving into a certain stable state when initialized with a random velocity. Again, the stable states are defined so that the analysis can be run. The specific commands that are run in this analysis are discussed in detail in the Appendix section. The analysis calculates the committor value for every snapshot in the provided file. Note that performing this analysis is heavy on memory and will require a lot of time to run when done on a moderately large simulation file.

The final step is to select the snapshots that describe transition states. This is done by per-forming ’ShootingPointAnalysis’, an OPS native analysis tool. This analysis is done by loading the previously saved file and providing the analysis object with the stable state definitions. The results of this analysis are saved to an object, and from this object the transition state snapshots can be extracted by selecting the snapshots with a certain committor value. A transition state is defined as having a committor value of approximately 0.5. The limits of the transition state definition can be freely chosen, for this project every snapshot that had a committor value between 0.45 and 0.55 was considered to be a transition state.

The transition state snapshots are extracted and saved to an array. As the snapshots are OPS simulation objects they contain much data and cannot be used raw as training data for the BG. For this reason, the x,y,z coordinates must still be extracted from the selected snapshots. These

(22)

coordinates must then be saved in a NumPy array format of dimensionality (n_samples, 66). The number 66 refers to the x, y, and z coordinates of the 22 ADP atoms. Now, the x, y, and z coordinates of the ADP transition state frames are saved in the same format that is used in the original BG implementation.

6.5 Data transformations and manipulations

In order to properly run, the BG needs some information that cannot be directly extracted from OPS. Three minor alterations need to be done, then finally the TICA transformations which are discussed in the section below.

Firstly, the ADP PDB file must be altered to only contain information regarding the ADP molecule. For proper simulation, the water molecules listed in the original file are important. For the BG, however, only the structure of the ADP molecule is needed. The ’PDB-tools’ package is necessary to perform this transformation.

After installing the package, change directory to the directory containing the PDB file. Run the following command: python PDB_delresname.py -< option > <PDB file> > <output file>. This command deletes all atoms belonging to the specified residue. <option> must be substituted by the name of the residue to be removed, in this case that residue is water and its residue name is HOH. <PDB file> must be substituted by the name of the PDB file, in the case of this thesis the name of the file is ’AD_initial_frame.PDB’. <output file> is the name of the file the altered PDB information is written to. If given a name that does not exist in the given directory, a new PDB file is written.

Secondly, the internal coordinate matrix of ADP must be defined. This is done using the ’zmattools’ package. In order to extract the internal coordinate matrix, change directory to the location of the PDB file without the solvent water molecules and run zmat generate <PDB file> where <PDB file> is replaced by the name of the PDB file without water molecules. Running this command generates a full internal coordinate matrix in the terminal window, this matrix must be copied somewhere so that its parts can be inserted properly into a BG data file. Navigate to the Boltzmann/deep_boltzmann/deep_boltzmann/models folder. Here, a file describing the internal coordinates of amino acids called ’proteins.py’ can be found. An entry must be made for each residue in the ADP molecule, namely ’ACE’, ’ALA’, and ’NME’. For each of these residues, the internal coordinate matrix entry corresponding only to atoms from that residue must be added. In other words, if one of the entries in the generated internal coordinate matrix contains atoms 1-4, and each of these atoms is contained in a single residue, this entry must be added to the proteins.py file under the name of the corresponding residue.

After successfully inserting the internal coordinate matrix, the final data alteration must be done. In order to perform the TICA transformations detailed below, the OPS trajectory files must be converted to a different format. As they are generated and saved by OPS, the transition path simulations are saved as .nc files. This is a format that pyemma cannot work with, so it must be transformed into one that pyemma can process in order to perform TICA transformations. The mdtraj library contains a function called ’mdconvert’, which can transform between several different MD file formats. Several different MD formats can be used to perform TICA transformations with,

(23)

for this project the ’.xtc’ file extension was chosen. Because the saved simulation file is a collection of several transition paths, they must be looped over and converted into ’.xtc’ format and saved one by one. When this is done, all necessary data transformations except for TICA have been completed.

6.6 TICA transformations

As mentioned above, the TICA transformations can be done on the trajectory data when it exists in a compatible format, like ’.xtc’. In order to extract the most useful data for the purpose of this project, the trajectory data must be featurized before the TICA transformations are computed. Pyemma allows for featurization of trajectories if the appropriate PDB file is provided. Once again, the PDB file with no water is necessary to perform the featurization and subsequent TICA transfor-mations well. Several features can be selected for the featurization process, but because it is already known that the dihedral angles define the transition that is to be analyzed, the featurizer object that describes these should be selected. First, the featurizer object is created using the pyemma fea-turizer function. This feafea-turizer object contains a function called ’add_backbone_torsions()’. This function must be performed on the featurizer object as it allows for the TICA transformations to be performed specifically on the ψ and φ angles of the ADP molecule. The code to do this is as follows: torsions_feat = pyemma.coordinates.featurizer(PDB)

torsions_feat.add_backbone_torsions(cossin=True, periodic=False) torsions_data = pyemma.coordinates.load(files, features=torsions_feat)

After the backbone torsions have been added to the featurizer, the trajectory data must be loaded once more, this time with the featurizer object passed as an argument. This data is then passed on to the pyemma TICA function like so: tica = pyemma.coordinates.tica(torsions_data, lag=5). Because TICA measures the influence of a certain feature on a Markov MD simulation relative to a certain lag time, this lag time must also be provided. For this project, the optimal lag time has been found to be 0.5 ns. After performing the TICA transformation, its output can be collected from the object and saved for usage with the BG. The BG requires three files pertaining to the TICA output in its original implementation, though additional files can be used for experimentation. The three necessary files are the TICA mean, the TICA eigenvalues and the torsion indexes. The first two files can be directly extracted from the TICA output in the form of NumPy matrices, the last file needs to be manually computed by providing the indexes of the atoms that comprise the φ and ψ angles.

7 Results

This section describes the results and conclusions pertaining to each part of the project.

7.1 OPS

The OPS data was collected by closely following the instructions listed in the notebook tutorial files and provided by the creators of OPS. By simply running the notebooks involved in generating

(24)

ADP transition state paths and doing the committor analysis, 59 transition state frames could be isolated. By running the notebooks again, this time with a longer simulation, the total amount of transition state frames was brought up to 232. OPS was chosen because it is a versatile tool specialized for modeling the very phenomenon that is studied in this thesis, and because the BG works using the same underlying MD engine, namely OpenMM. The OPS transition state data is necessary to train the BG because its first training section works by approximating known molecule configurations.

In the original implementation, 150.000 sample configurations were provided for initial training. Compared to this, 232 may seem like too small a number to accurately train the BG on. However, it must be kept in mind that the original BG implementation was used to sample states of the BPTI protein, a molecule that is 892 atoms long. As a molecule grows in length, its structure grows exponentially more complex, and there exist far more variations in the amount of spatial configurations it can have. For reference, the structure of a water molecule would be easy to learn for an algorithm as there are very few ways in which one water molecule could differ from another. A water molecule only contains 3 atoms, meaning the ADP molecule has a little over 7 times more atoms than a water molecule. ADP is a far more complex molecule than water, but the BPTI protein has over 40 times as many atoms as the ADP molecule. For these reasons it is assumed in this thesis that the BG could learn to approximate the structure of ADP by using an amount of samples a few orders of magnitude below the amount used to learn the BPTI structure. Additionally, the algorithm would only prove to be useful for the purpose of this project if the amount of samples needed to train it would not exceed a few thousand. Because ADP is a relatively small molecule, there are not that many configurations it can take, conversely meaning there also isn’t an enormous amount of transition states it can be in. If there is a very large amount of transition state data needed to initialize the algorithm to sample other transition states, it will most likely not provide any new information and it would be more efficient to simply run OPS until most transition states have been found.

OPS is useful for this project for one additional reason, namely the testing of the BG sampled transition states. By resolvating (reintroducing water molecules in their environment) the given samples and running committor analysis on them, it can be checked whether the given samples actually are transition states or not.

7.2 BG data

Not many alterations were made to the BG as it is not necessary for the scope of this project. A few noteworthy changes were made, however. Most notably, to the proteins.py file that contains the definitions of the internal coordinate matrices of every amino acid and some code to interpret those matrices relative to a given molecule. In this file, the internal coordinates of alanine are provided, but not of the other two residues present in ADP. Furthermore, the way that amino acids connect together is different from how alanine dipeptide is connected to the other ADP residues, meaning that different internal coordinates are needed. Providing the internal coordinates for the molecule is important as internal coordinates describe the positions of groups of four atoms within a molecule relative to each other; a dihedral angle is the angle between the first and fourth atom in such a group, since the ADP transition depends on the values of two dihedral angles, having a good inter-nal coordinate representation can be useful. The interinter-nal coordinate matrices in the altered version of proteins.py were generated using an application with very little online documentation, however,

(25)

and so its correctness cannot be certain until the experiment is done and results are collected and analyzed. The internal coordinate matrix entries can be manually checked to make sense, and this was also done for this project.

Another alteration to the native BG files is the aforementioned code that uses the internal matrices to generate coordinates for the BG. In the original implementation, the code was badly suited for flexibly deciding which residues’ internal coordinates would be used and which would not (not every part of the molecule needs to necessarily be described with internal coordinates in order for the BG to produce correct results), this was fixed with a small list comprehension.

7.3 TICA transformations

The TICA results provide interesting insights into the role of the angles φ and ψ in the modeled ADP state transition. The TICA transformation was performed on a set of 2000 ADP trajectories, including transition and non-transition states. A lag time of 0.5ns was found to be

optimal, according to the pyemma TICA tutorial’s own documentation.

Figure 9: VAMP 2 scores for various TICA features

Figure 9 shows the VAMP2 scores of several different features of the ADP molecule. The VAMP2 score is a metric that measures the kinetic variance within the simulation that a certain feature is responsible for. Evidently, the most relevant feature is the backbone torsions. The extracted backbone torsions are described by the angles φ and ψ that the project focuses on. The TICA information can be provided to the BG as a reaction coordinate in KL training. This can be thought of as the TICA information providing the algorithm with a direction to look to when trying to find the right parameters to correctly model the ADP behavior with, as the angles described by the TICA information are so relevant to the ultimate goal of the BG. It makes sense that the TICA data generated for this project would be effective for the BG as it was derived from simulations of the very process the BG is meant to sample from, namely the transition of ADP from its C7eqto

the alphaR state.

7.4 Final conclusions

Besides the results specific to the different software/analyses used in the thesis, concepts explored in this thesis such as the likelihood of certain molecular configurations, and the independent com-ponent analysis of molecular interactions, provide powerful tools for computational chemistry. By combining these concepts with powerful algorithms such as the BG, certain molecular configurations can be studied more closely and generated more easily than by performing physical experiments.

(26)

Keeping in mind all of the above results, this project finds that using the BG to generate transition state samples of the ADP C7eq to alphaR state transition is possible with some minor alterations

to the algorithm. The original algorithm works with the same underlying engine as OPS, and it learns from spatial configurations that are possible to make with OPS. Its energy training section applies itself well to the problem at hand and with some minor changes to the energy loss function the high energy transition states could be prioritized over the low energy stable states that the BG is currently best suited for. Furthermore, the TICA data derived from the ADP simulations has been shown to describe the very attributes that are crucial to proper transition state sampling accurately, and provides the final argument for the reasoning of this conclusion.

8 Discussion

The next step for this project is to apply the information learned to run the algorithm and sample the transition states as intended. A way to check the correctness of the results is provided through the committor analysis tool of OPS which can be easily used. As mentioned before, inaccuracies in the OPS sampling are very unlikely, but it is possible that the internal coordinate representation of ADP is somehow faulty or that the untested parts of the algorithm lend themselves badly to the matter at hand. In the future, if the implementation discussed in this thesis proves successful, the BG algorithm could be modified for many more purposes. Its mixed training that incorporates both spatial structure and energy of a molecular system can be a versatile tool capable of analyzing many different kinds of molecule interactions and providing new insights into them by efficiently and accurately modeling their behavior. Furthermore, the aforementioned concepts of the likelihood of certain molecular configurations, and the independent component analysis of molecular interactions, could provide interesting new ways to design novel drugs in the future. As for the scope of this thesis, all required theoretical changes to the BG algorithm and the data format of OPS simulations have been researched.

9 Bibliography

Berne, B. J. (1999). Molecular Dynamics in Systems with Multiple Time Scales: Reference System Propagator Algorithms. Molecules, 22(10), 20.

https://doi.org/10.3390/molecules22101782

Bolhuis, P. G., Chandler, D., Dellago, C., Geissler, P. L. (2002). Transition path sampling: Throwing ropes over rough mountain passes, in the dark. Annual Review of Physical Chemistry, 53, 291–318.

https://doi.org/10.1146/annurev.physchem.53.082301.113146

Jung, H., Covino, R., Hummer, G. (2019). Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations. ArXiv.

Noé, F., Olsson, S., Köhler, J., Wu, H. (2019). Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457).

(27)

https://doi.org/10.1126/science.aaw1147

Pérez-Hernández, G., Noé, F. (2016). Hierarchical Time-Lagged Independent Component Anal-ysis: Computing Slow Modes and Reaction Coordinates for Large Molecular Systems. Journal of Chemical Theory and Computation, 12(12), 6118–6129.

https://doi.org/10.1021/acs.jctc.6b00738

Schramm, V. L. (2011). Enzymatic Transition States, Transition-State Analogs, Dynamics, Thermodynamics, and Lifetimes. Journal of Chemical Theory and Computation, 176(3), 1–44. https://doi.org/10.1146/annurev-biochem-061809-100742.Enzymatic

Swenson, D. W. H., Prinz, J. H., Noe, F., Chodera, J. D., Bolhuis, P. G. (2019). OpenPath-Sampling: A Python Framework for Path Sampling Simulations. 1. Basics. Journal of Chemical Theory and Computation.

https://doi.org/10.1021/acs.jctc.8b00626

10 Appendix

This section follows section 5.1 in describing the installation of the software used in this thesis. Run the following commands to download and install both OPS and Openmm: . conda install -c conda-forge -c omnia openmm conda install -c conda-forge openpathsampling Down-loading and installing through anaconda is not strictly necessary but highly recommended. Openmm is a high performance toolkit for molecular simulation. It is used in both OPS and BG, in order to compute things such as the force fields that drive interatomic interactions.

https://zenodo.org/record/3242635#.X3hrZtZS8W8 notebooks and associated files to run BG.

http://ftp.mi.fu-berlin.de/pub/solsson/BG_BPTI.tar.gz link for notebook containing in-stance of BG from initialization to training. This notebook and all relevant files it imports are the ones that will need to be altered to get the algorithm working properly in the end. Also, all data this notebook requires are the data that need to be provided from OPS in order to get the results this project requires. Side note: to get the algorithm working properly, two files are provided in the second link. In the files from the first link, in the deep_boltzmann folder two library files are found that need to be overwritten by the files with same name from the second link. Subsequently, the library must be reinstalled for proper functionality of the BG.

Following the essential BG and OPS files, four additional Python packages should be installed in order to perform necessary data transformations. ’mdtraj’ is a Python package meant for an-alyzation and manipulation of MD trajectories. It can be installed through the following com-mand: conda install -c conda-forge mdtraj. ’PDB-tools’ is a package for modifying PDB files and can be installed by running pip install PDB-tools in a terminal window. The third pack-age, ’zmattools’ can generate an internal coordinate representation of a molecule by supplying it with the appropriate PDB file. In order to install this package, clone the following repository:

(28)

https://github.com/SGenheden/zmattools and change the active directory to the downloaded di-rectory and run python setup.py install

Finally, ’pyemma’ must be downloaded. This is a Python library for the analysis of Markov mod-els of molecular dynamics. This library is needed to perform the TICA transformations on the ADP simulation data. The library can be downloaded by running the following command: conda install -c conda-forge pyemma