Machine learning-assisted committor prediction based on molecular simulation data

(1)

University of Amsterdam

Masters Thesis

Machine learning-assisted committor

prediction based on molecular

simulation data

Author:

Martin Frassek

Supervisor:

Arjun Wadhawan

A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Science

in the

Computational Chemistry Group Van’t Hoff Institute for Molecular Sciences

(2)

Declaration of Authorship

I, Martin Frassek, declare that this thesis, entitled ‘Machine learning-assisted committor prediction based on molecular simulation data’ and the work presented in it are my own. I confirm that:

This work was done wholly or mainly while in candidature for a research degree

at the University of Amsterdam.

Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly stated.

Where I have consulted the published work of others, this is always clearly

at-tributed.

Where I have quoted from the work of others, the source is always given. With the

exception of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

Date: December 27, 2020

(3)

UNIVERSITY OF AMSTERDAM

Abstract

Faculty of Science

Van’t Hoff Institute for Molecular Sciences

Master of Science in Computational Science

Machine learning-assisted committor prediction based on molecular simulation data

by Martin Frassek

Molecular simulations are becoming increasingly popular to glean in silico atomistic in-formation of physical and chemical reactions which is beyond the spatio-temporal limits of laboratory experiments. In such a molecular simulation (e.g. Molecular Dynamics), a reaction coordinate (RC) is the principal feature to determine the progress along the reaction from one state to another. A good RC can be useful to generate good statistics by using enhanced sampling or to estimate accurate rates of transition. In general, an RC could be described by the Cartesian coordinates of all the particles in a system. However, the interpretability of such a representation is low. A more practical approach is to describe the RC by some low-dimensional molecular descriptors called order pa-rameters (OPs). The challenge lies in finding the right combination of these OP which can be used to construct the RC. We try to tackle this challenge using machine learning. In this work, a novel machine learning model was developed to approximate the RC of different molecular systems. This model is an extended variant of autoencoder and was therefore named extended autoencoder model (EAE). It functions by encoding the inputs and mapping them onto a lower-dimensional latent space. This encoded information is subsequently used for the reconstruction of the input as well as the prediction of the RC. As a consequence, the latent space is optimized for both reconstruction and prediction. Nucleation data of methane hydrates served as a test case for the model. On this dataset we showed that the EAE model can effectively extract novel insights about the underly-ing mechanism of a reaction, make reliable predictions about the committor of a given configuration, and potentially even generate new paths representative for a reaction.

(4)

Acknowledgments

First I want to thank Prof. Peter Bolhuis for giving me the opportunity to write my thesis in his group, for his academic input, and for his open and human style of supervision. I want to thank Dr. Jocelyne Vreede for taking over the role of my second assessor and for her ”Biomolecular Simulations” course that sparked my interest in molecular simulations in the first place.

My daily supervisor, Arjun, I want to give special thanks for all his advice and help and for his relaxed but persistent attitude that did a great job in keeping my motivation and productivity up through all these months of thesis writing.

I want to thank the path group for all the nice meetings, the feedback and ideas I took away from them, and all the glimpses into other people’s projects.

I want to thank my study mates for the amazing time I had with them in Amsterdam and all the things I learned from them.

And lastly, I want to thank all of my friends and my family for the support they have given me over the past year.

(5)

2.2 Available data . . . 15 2.2.1 Toy datasets . . . 15 2.2.1.1 Double-well potential . . . 15 2.2.1.2 Z-Potential potential . . . 17 2.2.2 Nucleation data. . . 19 2.2.2.1 TIS data . . . 20 2.2.2.2 TPS data . . . 20 2.3 Data processing . . . 20 2.3.1 Data import . . . 20 2.3.2 Pipeline . . . 21 2.3.2.1 Outlier removal . . . 22 iv

(6)

Contents v

2.3.2.2 Normalization . . . 22

2.3.2.3 Reduction of input dimensions . . . 22

2.3.2.4 Snapshot binning . . . 23

2.3.2.5 pB-Approximation . . . 23

2.3.2.6 Packing of Tensorflow datasets . . . 24

2.4 Data characterization and visualization . . . 24

2.4.1 Input importance measure . . . 24

2.4.2 Visualizing model outputs . . . 25

2.4.2.1 Plotting committor predictions . . . 25

2.4.2.2 Plotting input reconstruction . . . 25

2.4.2.3 Mapping paths onto the latent space. . . 26

2.4.2.4 Mapping paths onto the committor space . . . 26

2.4.2.5 Mapping latent paths onto the configurational space . . . 26

2.4.3 Visualizing the ground truth . . . 26

2.5 Libraries . . . 26

2.6 Workflow . . . 27

3 Experiments and Results 28 3.1 Double-well potential. . . 28

3.2 Z-Potential . . . 32

3.3.1 Non-linear encoder . . . 36

3.3.2 Linear encoder . . . 40

4 Discussion 46 4.1 Double-well potential. . . 46

4.2 Z-Potential . . . 47

5 Conclusion and future work 52 A Nucleation data characterization 55 A.1 Input distribution . . . 55

A.2 pB distribution . . . 56

B Parameter settings tested 58 B.1 Activation functions . . . 58

B.2 Outlier cutoffs . . . 59

B.3 Evaluated labels and weights . . . 60

B.4 Loss functions . . . 63

B.5 Bottleneck size . . . 66

C Additional results 67 C.1 Full dimensional dataset . . . 67

(7)

Contents vi

(8)

List of Figures

1.1 TPS procedure . . . 3

1.2 TIS procedure. . . 4

1.3 Importance of the right choice of reaction coordinate . . . 5

1.4 Structure of a perceptron . . . 7

1.5 Common non-linear activation functions . . . 8

1.6 Structure of a neural network . . . 9

1.7 Structure of an autoencoder . . . 10

1.8 Common gas hydrate structures . . . 11

2.1 Structure of the extended autoencoder . . . 14

2.2 Double-well potential energy surface . . . 16

2.3 Double-well potential dataset . . . 17

2.4 Z-potential energy surface . . . 17

2.5 Z-potential dataset . . . 18

2.6 Data processing pipeline . . . 22

2.7 Workflow of this project . . . 27

3.1 Double-well potential: Ground truth vs. prediction . . . 29

3.2 Double-well potential: Ground truth vs. prediction, x1/x2 plot . . . 30

3.3 Double-well potential: Reconstruction of inputs . . . 31

3.4 Double-well potential: Path generation from latent space . . . 31

3.5 Z-potential: Ground truth vs. prediction . . . 33

3.6 Z-potential: Ground truth vs. prediction, x1/x2 plot . . . 34

3.7 Z-potential: Reconstruction of inputs . . . 35

3.8 Z-potential: Path generation from latent space . . . 35

3.9 Nucleation data, non-linear encoder: Ground truth vs. prediction . . . 37

3.10 Nucleation data, non-linear encoder: Reconstruction of inputs . . . 38

3.11 Nucleation data, non-linear encoder: Paths mapped onto committor space 39 3.12 Nucleation data, non-linear encoder: Path generation from latent space. . 40

3.13 Nucleation data, linear encoder: Ground truth vs. prediction . . . 41

3.14 Nucleation data, linear encoder: Reconstruction of inputs . . . 42

3.15 Nucleation data, linear encoder: Paths mapped onto committor space . . 43

3.16 Nucleation data, linear encoder: Path generation from latent space . . . . 44

3.17 Nucleation data, linear encoder: Input importance measure . . . 45

A.1 Distribution of input values . . . 55

A.2 Distribution of approximated pB values . . . 56

A.3 Fraction of pB= 0 points for different numbers of dimensions . . . 57

(9)

List of Figures viii

B.1 Effect of different activation functions on the committor prediction . . . . 58

B.2 Effect of the activation function on the input reconstruction . . . 59

B.3 Effect of the outlier cutoff on the ground truth . . . 60

B.4 Effect of the dataset composition on the committor prediction. . . 62

B.5 Effect of the loss function on the committor prediction . . . 64

B.6 Loss of the trained model dependent on the bottleneck size . . . 66

C.1 Full-dimensional nucleation data, non-linear encoder: Prediction . . . 68

C.2 Full-dimensional nucleation data, linear encoder: Prediction . . . 69

C.3 Shooting data, ground truth. . . 70

(10)

List of Tables

2.1 Default EAE parameters . . . 14

2.2 Order parameters of the methane hydrate nucleation dataset . . . 19

(11)

Abbreviations

AE AutoEncoder CV Collective Variable EAE Extended AutoEncoder FES Free Energy Surface

MCG Mutually Coordinated Guests MD Molecular Dynamics

ML Machine Learning NN Neural Network OP Order Parameter

PES Potential Energy Surface RC Reaction Coordinate ReLU Rectified Linear Units RPE Reweighted Path Ensemble sI structure I

sII structure II

TIS Transition Interface Sampling TP Transition Path

TPS Transition Path Sampling UvA Universiteit van Amsterdam

(12)

Chapter 1

Introduction

In the past decades, molecular simulations have opened the door to modeling chemical processes on the molecular level, while machine learning has provided the tools to derive deeper insights from these simulations. Nonetheless, many challenges in understanding chemical reactions still persist today. One of these remaining challenges, that we will focus on, is identifying a one-dimensional variable that optimally describes the progress of a reaction - the reaction coordinate (RC).

In this chapter, we will dive into the basics of molecular simulations and the need for enhanced simulations, like transition path sampling (TPS) and transition interface sampling (TIS). We will discuss the details of TPS and TIS and the concepts of the RC and committor. We will then cover the fundamentals of machine learning and explore our specific architecture, namely the autoencoder. Finally, we will briefly discuss the dataset that was used to test our machine learning model and elaborate on the aim of this thesis to use our model for learning a system’s underlying RC.

1.1 Molecular Simulation

Atoms and molecules are the basic building blocks of the natural world. From the air we breathe to the DNA in our cells, everything is composed of them. With these particles constituting such a fundamental part of our surrounding, a microscopic understanding of the processes (such as reactions) that involve atoms and molecules is crucial. While direct observation is encumbered by their small scale, molecular systems can be repli-cated and examined in silico, in molecular simulations.

All molecular simulations begin with an initial frame for which the positions of all par-ticles must be known. Different schemes can then be applied to simulate the system’s progression through time. Molecular Dynamics (MD) constitutes one such approach. In

(13)

Introduction 2

MD the positions as well as the momenta of the particles are known. Further, force fields are defined which describe the force exerted by particles onto each other. With these prerequisites given, Newton’s equations of motion can be solved to calculate the particles’ positions and momenta for the subsequent time steps [1].

While MD is a powerful tool to gain new insights about a system, it is computationally limited to small system sizes (∼103− 105 _{molecules) and time scales (∼10}−9_{− 10}−6 _s).

As a consequence, MD is ill-suited to capture events that occur only rarely [1].

One example of such rare events is the transition between metastable states of a system. When a system encounters a metastable state, i.e. a local free energy minimum, it is likely to spend a prolonged period of time in there, as it can only escape by moving through conformations at higher free energy levels [2,3]. Once escaped from this state, the system will continue to pass through this high energy region of the configurational space before either returning to the state or entering another one.

When simulating such dynamics, the simulation is dominated by the time the system spends in the metastable states. Therefore, large amounts of computational resources need to be expended to observe even a few transitions. Brute force MD is therefore unsuited for efficient sampling of rare events. However, other methods that allow a com-putationally efficient rare event sampling have been developed. These include umbrella sampling [3], metadynamics [4], local elevation [5], nudged elastic band [6], milestoning [7], transition path sampling (TPS) [8], transition interface sampling (TIS) [9] and oth-ers [10]. Of these methods, we will dive more deeply into TPS and TIS and use them to sample rare events. This information will further be used to find the RC of these simulations via machine learning.

1.2 TPS and TIS

TPS and TIS are trajectory-based sampling algorithms that perform Monte Carlo sam-pling in the path space. At their core, both TPS and TIS make use of MD to generate paths. However, they have more complex schemes to select which regions of the configu-rational space to visit and thereby allow for more efficient sampling of transition paths. TPS generates decorrelated, natural transition paths from already existing ones (Figure

1.1) [11]. The three key ingredients to perform TPS are a) the definition of different (meta)stable states of the system between which the transition needs to be sampled, b) an initial trajectory between these states from which sampling can be initiated and c) a shooting scheme. In practice, defining the stable states requires a certain degree of knowledge about the system. However, it is simpler than defining the rare event mecha-nism itself. The complexity of the problem at hand is therefore reduced when applying TPS [12].

(14)

Introduction 3

Once the states are defined, an initial transition path is required. If no path is available and obtaining one via prolonged MD is unfeasible, other methods, like metadynamics can be used. While these methods might produce unnatural transition paths, after suf-ficiently long relaxation, TPS will likely be able to generate natural paths nonetheless. However, it is worth noting that the closer the initial path resembles a natural path, the shorter this relaxation time is expected to be [1]. If previous TPS runs exist, a path generated in these can also serve as an initial trajectory.

A common shooting scheme is the standard one-way shooting move, although more elab-orate shooting moves like aimless shooting, spring shooting, or others can be chosen as well [13,14]. During the shooting move, a snapshot from an already existing transition path is selected. This snapshot is modified and then used to generate a new path by integrating Newton’s equations of motion forward or backward in time. This newly gen-erated subpath then replaces the respective part of the original path, beginning at the shooting point. If the newly generated path connects the states of interest, a probability is assigned to determine whether it will be accepted. If the path does not connect the states of interest, it is rejected immediately [1].

Figure 1.1: TPS procedure for sampling pathways in a two-dimensional toy system. The blue line in the left image represents an unnatural path. Generation of a natural

transition path from an unnatural is done via a shooting move.

Figure1.1depicts the progressive application of TPS to obtain an uncorrelated, natural path from an unnatural, initial path. Initially, an unnatural interpolation from state A to state B is given. After the first shooting point is chosen and a new partial trajectory is generated from there, the path still partially consists of the original, unnatural path. As the second shooting move only replaces a part of the remaining unnatural path, the newly generated path is still not fully decorrelated. Only once the third shooting move causes all remainders of the original path to be replaced, the resulting path is fully decorrelated from the initial path. This new path represents the natural channel between the two states significantly better than the initial, unnatural path and can serve as a natural, initial path for further TPS runs.

(15)

Introduction 4

TIS also belongs to the family of path sampling algorithms. It iteratively generates paths that extend progressively further from one state to the other, thereby sampling the com-plete ensemble of paths using various interfaces along the transition path between the states (Figure 1.2). For TIS, at first, the metastable states of the system need to be defined. Further, a set of non-intersecting interfaces λ0− λn lying between these states

has to be selected. The position of each interface is chosen such that a path starting from λi has a sufficiently high probability of crossing λi+1to allow efficient sampling [1].

This probability is recommended to be approximately 20% [15].

As a path crossing one interface can be used for the efficient generation of paths crossing the subsequent interface, this method can progressively generate paths crossing increas-ing numbers of interfaces and eventually reachincreas-ing the other state. Therefore, the full path ensemble is sampled [1]. This path ensemble, however, is imbalanced as paths be-longing to the lower interfaces are underrepresented, while paths bebe-longing to the higher interfaces are overrepresented. Therefore a reweighting scheme needs to be applied to remove this bias, yielding the reweighted path ensemble (RPE) [16].

Figure 1.2: TIS procedure for sampling pathways in a two-dimensional toy system. Generation of a natural transition path is achieved via progressive interface crossing.

Figure1.2shows an exemplary TIS run. The initial path solely crosses λ1 before

return-ing to state A. By applyreturn-ing the shootreturn-ing move to this path, a new path can be generated, which passes through λ2. Via two further applications of this procedure, a path crossing

the interfaces λ3 and λ4 and reaching state B is successfully generated.

1.3 Order parameters

Independent of the method used to generate the transition paths, a representation needs to be chosen to depict the state of the system at any given time point and allow a distinction between metastable and transition states. For this representation, a set of variables needs to be chosen to describe the system. These variables are called order parameters (OPs) or collective variables.

(16)

Introduction 5

While it is possible to describe the system in terms of the positions of all its particles, this approach is cumbersome and leads to low interpretability. Therefore the Cartesian coordinates are usually not chosen as OPs [17]. Instead, more high-level descriptions, calculated as functions of the Cartesian coordinates, are often selected as OPs [17]. While higher-level OPs cannot capture the complete information contained in all particle positions, a good set of OPs is chosen such that most of the relevant information is retained nonetheless. With a poor set of OPs, gaining meaningful insights is encumbered significantly.

1.4 Reaction coordinate and committor

The ultimate goal of rare event sampling is gaining insight into the underlying reaction mechanism. As for the OPs, low-level descriptions can lack interpretability, meaning that more high-level descriptors are needed. One such descriptor is the reaction coordinate (RC), which captures the progress of a reaction in a single variable [13], making the RC both concise and potentially highly informative.

The choice of RC is not trivial, however, and a poorly chosen RC can be misleading when trying to understand the reaction mechanism (Figure1.3). Therefore special heed should be paid to identifying a good RC.

Figure 1.3: Illustration of the effect of reaction coordinate choice on the free energy and path sampling: (Left) Good RC - the barrier predicted by the RC coincides with the true barrier of the potential energy surface. (Right) Poor RC - the points on the barrier predicted by the RC show varying behavior dependent on dimensions not captured in

the RC.

One such good RC, in fact, the optimal RC, is defined by the so-called committor. The committor of a given point is defined as the probability of a path beginning at said point

(17)

Introduction 6

reaching a given state before any other state. The committor for state B is called pB.

To approximate the pB of a point, a large number of paths need to be generated via

shooting moves. The fraction of paths ending up in state B then gives pB. The more

accurate the approximation of the pB should be, the more paths need to be generated.

Therefore determining the committor is a costly procedure. Especially when attempting to cover the complete range of values in a larger system, calculating the committor for all given points is computationally infeasible.

Alternatively, the RC can also be approximated by a linear or non-linear combination of the OPs used to describe the system. This means, to construct the RC, the most relevant OPs as well as the relationships between them need to be identified and prioritized over less relevant ones. By describing the RC in such terms it becomes highly interpretable. Apart from directly yielding a better understanding of the reaction, the RC can also be used for highly efficient generation of new transition paths [18]. This is enabled by the fact, that the RC can be used to determine the position of the transition barrier in the configurational space. TPS runs selecting shooting points close to this barrier will have an optimal acceptance probability, as their chance of connecting to two different states is maximized [11].

While the generation of additional paths seems unnecessary, when the RC is already known exactly, the same process can be applied using an imperfect RC to generate paths more efficiently [18]. This way the RC can be improved iteratively. By generating more transition paths, a closer approximation of the RC can be found, which in turn allows for a more efficient generation of transition paths. This process can then be continued until a sufficiently close approximation of the RC is found.

Identifying the optimal combination of OPs to capture the RC seems to be a task well-suited for machine learning (ML) models. Indeed, several ML approaches to approximate the pB have been put forward in recent years. These approaches include the use of the

likelihood maximization method to identify the best approximation of the RC as a linear combination of OPs [13], training a neural network (NN) to predict the pB based on

a given configuration and finding the combination of OPs that most closely matches the RC via a genetic algorithm [17], iteratively generating new paths based on a given approximation of the RC and utilizing these paths to refine the approximation further [18], the use of the encoder part of an autoencoder to learn a linear mapping from the input dimensions to the RC [19], as well as others [20].

Based on these precedents, it seemed promising to design a new ML model tailored to RC prediction based on TPS and TIS data. Before elaborating on this approach, however, we first will introduce the key concepts of ML.

(18)

Introduction 7

1.5 Fundamentals of machine learning

The term ML was originally put forward in 1959 by Arthur Samuel [21]. ML focuses on computational models learning through experience and improving their own performance on tasks they were not explicitly programmed to perform. ML models are considered universal approximators - with an adequate architecture and sufficiently thorough train-ing they could potentially learn any task.

One of the oldest concepts in machine learning is the perceptron introduced by Rosen-blatt (Figure1.4) [22]. It is a single processing unit that was originally meant to explain the process of perception in higher organisms [22].

Figure 1.4: Structure of a perceptron. Inputs x1, x2, and x3are weighted respectively

by w1, w2, and w3and summed together to calculate the weighted sum y. The activation

function a is then applied to y to obtain the perceptron’s output z.

The perceptron is a linear classifier, which can learn to separate instances of a dataset into two distinct categories. If the two categories are linearly separable, the perceptron is guaranteed to converge. If the dividing surface is non-linear, however, no convergence occurs [23].

The perceptron can take in several inputs, which can, in turn, be associated with different weights to calculate a weighted sum according to formula 1.1, where n represents the number of inputs, xi the ith input, and wi the weight associated with xi.

y =

n

X

i=1

wi∗ xi (1.1)

The result of formula 1.1 is subsequently passed to an activation function to calculate the final output of the perceptron. The purpose of an activation function is to map the incoming signal to an appropriate output range. This can be done using either linear or non-linear activation functions. Linear activation functions simply multiply the input with a constant factor, while non-linear activation functions consist of more complex

(19)

Introduction 8

mappings. Commonly used non-linear activation functions include tanh, sigmoid and Rectified Linear Units (ReLU) (Formulae 1.2- 1.4as well as figure 1.5).

tanh(x) = e x_{− e}−x ex_{+ e}−x (1.2) sigmoid(x) = 1 1 + e−x (1.3) ReLU (x) = max(0, x) (1.4)

Figure 1.5: Plot of the common non-linear activation functions tanh, sigmoid and ReLU.

To train the perceptron, a loss function is employed, to measure how much the model’s prediction deviates from the desired outcome. During training, the model attempts to find parameter settings that minimize the loss, thereby optimizing its predictions. A variety of loss functions exist, comprising of absolute error, squared error, binary log likelihood, binomial log likelihood, difference of logs, logarithmic absolute error (Formu-lae B.1 - B.6), and many others. As different loss functions act differently for a given input, the choice of loss function can have an impact on the performance of the model [23]. Loss functions should therefore be chosen carefully.

While a single-layer perceptron can be a powerful tool to learn simple patterns, its inability to learn non-linear dividing patterns makes it unsuited when more complex re-lationships exist in the dataset. In such cases, a multi-layer perceptron may yield better results.

Such multi-layer perceptrons, also known as neural networks (NNs), can learn signifi-cantly more complex relationships than the single-layer perceptron. As for the percep-tron, they were originally inspired by the structure of the brain. However, the resem-blance of modern NNs with the brain is only minimal [24]. Nowadays, NNs are one of the most commonly used models in ML and artificial intelligence. Figure1.6depicts an exemplary NN.

(20)

Introduction 9

Figure 1.6: Structure of a neural network. Each circle represents a single neuron. There are one input layer, three hidden layers, and one output layer in this architecture. Nodes

of subsequent layers are densely connected.

NNs generally consist of an input layer, a freely selectable number of hidden layers, and an output layer. The layers can consist of an arbitrary number of nodes, where each node resembles a single-layer perceptron. Additionally, a bias, which does not depend on the previous layers, can be present.

Which information is processed by a node is determined by its connections. In simple, linear NNs, a node is only connected to the nodes in the layer immediately preceding it. In more complex NNs, like residual NNs and recurrent NNs, nodes can also receive input from layers further back or even ahead of the node [25,26].

Two layers are densely connected when all nodes of one layer are connected to all nodes of the other. When fewer connections are present, the layers are sparsely connected. Each connection between nodes is weighted. The weight determines how much impor-tance is assigned to the input received through this connection when calculating the node’s output. Depending on the type of NN, the weights of the connections can be independent of each other, or specific connections can share the same weights.

For NNs, the choice of activation function can play an important role in the quality of its predictions. If all nodes have linear activation functions, the model will be limited to learning a linear representation of the relationship between input and output. The hidden layers are effectively collapsed into one. If non-linear activation functions are chosen, however, more complex relationships can be learned.

As for the perceptron, NNs are trained with the help of a loss function. During training, the NNs weights are updated via backpropagation to minimize the loss, thereby yielding an optimized model.

Different variants of NNs exist which are suited for solving a variety of problems. These include convolutional neural networks, autoencoders (AEs), and others [23]. Despite

(21)

Introduction 10

their different purposes and implementation, these variants still share the basic struc-ture of connected layers and perceptrons as basic units.

NNs have been applied for many different tasks in a variety of fields, including speech recognition, product recommendations, image recognition, drug discovery, stock market trading, fraud detection, and many others.

1.6 Autoencoders

One special class of NNs is the autoencoder (AE). An AE has the same number of output as input dimensions and aims to minimize the difference between these inputs and outputs. A task that would be trivial if all hidden layers contained at least as many nodes as the input layer. AEs, however, are further characterized by a bottleneck layer, where the number of nodes is greatly reduced compared to the input layer (Figure1.7).

Figure 1.7: Structure of an autoencoder. Each circle represents a single neuron. The input and the output layer as well as two of the three hidden layers consist of four nodes each. Only the central hidden layer, consists of only one node, representing the bottleneck. The layers leading up to the bottleneck form the encoder, while the layers

leading away from the bottleneck form the decoder.

While the presence of the bottleneck complicates perfect reconstruction of the input data, it allows for the acquisition of insights at the same time. As all information needs to pass through the bottleneck, the model is pressed to learn a mapping from the high-dimensional configurational space to a low-high-dimensional latent space and back. Therefore, an AE can conceptually be split into two parts. The layers leading up to the bottleneck, encode the input by mapping it on the latent space. This part of the model is therefore called the encoder. The layers following the bottleneck decode the previously encoded information by mapping from the latent space back to the original space. Consequently, this part of the model is called the decoder.

(22)

Introduction 11

To ensure optimal information retention, the encoder needs to prioritize the input with the highest information density and identify complex relationships between the inputs. The latent space, therefore, contains highly relevant information and can be used to gain new insight into the system.

Further, as the decoder provides an efficient mapping from latent to configurational space, it can be used generatively. Provided with a point on the latent space, the decoder will reconstruct a fully dimensional output from it. Any point on the latent space can be chosen. Thereby novel instances of the dataset can be generated that adhere to the rules learned by the autoencoder. For molecular systems, these generated points can be used as starting points for simulations. Thereby AEs can facilitate the exploration of new regions of the configurational space [27].

1.7 Methane hydrate nucleation

Apart from selecting a model, a system needs to be selected to test the performance of the model. As a realistic application for such testing, we chose the process of methane hydrate nucleation, which was recently modeled using TPS and TIS sampling methodology [15]. Methane hydrate is a gas hydrate, consisting of methane and water molecules. This ice-like solid forms preferentially at high pressure and low temperature [11,28].

Generally, in gas hydrates, the water molecules form cage-like structures in which the gas molecules are captured. These cages can have different geometries and arrange into different larger structures. The three predominant cage types in gas hydrates are 512 (twelve pentagonal faces), 51262 and 51264, (twelve pentagonal and two or four hexagonal faces, respectively). Structure I (sI) is composed of 512 and 51262 cages in a 1:3 ratio. Structure II (sII) on the other hand is made up of 512 and 51264 cages in a 2:1 ratio [29]. Naturally, methane hydrate occurs mainly in sI. It can, however, also occur in sII as well as in an amorphous state [11,29,30]. Figure1.8depicts the different cage types and structures.

Figure 1.8: Common gas hydrate structures: (Left) 512and 51262cages assembling in a 1:3 ratio to form structure I. (Right) 512_{and 5}12₆4_{cages assembling in a 2:1 ratio to}

(23)

Introduction 12

Previous computational investigations into methane hydrate nucleation have identified a temperature-dependent shift in the reaction mechanism. At strong undercooling (270 K), the solution nucleates into the amorphous state, while at higher temperatures (285 K) the crystalline sI prevails [11]. The RC of the amorphous nucleation was shown to mainly depend on the size of the nucleus. The RC of the crystalline nucleation, however, depended both on nucleus size and structural elements [11].

1.8 Aim of this work

To our knowledge, no model exists that can map configurations onto and back from a latent space, and at the same time make predictions about the committor of a given configuration. We think, however, that these two tasks are highly compatible, as both of them rely on filtering for the most relevant information from the input. Combining an AE and a predictor NN to share some of their architecture therefore seems feasible. In this work, we aim to construct such a hybrid model as a proof of concept.

After proving that this extended autoencoder (EAE) is functional and feasible, we plan to use it to identify the RC of methane hydrate nucleation. Further, a close investigation of the latent space is planned, to see which insight can be gained from the mappings between configurational space and latent space and vice versa. Lastly, the feasibility of the decoder for the generation of transition paths will be tested.

(24)

Chapter 2

Methods

2.1 The extended autoencoder model

Previous works in the field of ML have proven that NNs can facilitate the identification of a system’s reaction mechanism. Further, AEs are known to learn an efficient mapping from the configurational space onto the latent space and back.

Therefore a model based on AEs was designed in this work - the extended autoencoder (EAE) model (Figure2.1). Like a normal AEs, the EAE has a bottleneck and is forced to learn an efficient encoding and decoding of the incoming data to minimize its loss. It however not only passes the information from the bottleneck to a decoder for reconstruc-tion. The bottleneck is also connected to a committor decoder, which learns to predict the committor based on the encoded information. Consequently, the encoder needs to find an efficient encoding for information both relevant for the reconstruction and the committor prediction.

All three parts of the EAE (encoder, reconstruction decoder, and committor decoder) are NNs. The NNs each consist of an input layer, several densely connected, hidden layers, and an output layer. For the encoder, the size of the input layer is equal to the number of dimensions used, while the output layer size is determined by the number of bottleneck nodes. The reconstruction decoder uses an input layer of the size of the bot-tleneck. Its output layer size corresponds to the number of input layers of the encoder. The input layer of the committor decoder is equal in size to the bottleneck as well, while its output layer is of size one.

Both the reconstruction decoder and the committor decoder are regression models trained via supervised learning using a loss function. While the output of the encoder is passed to both of these models, they have no direct influence on each other.

Due to the modular build of the models, six different (sub-)models can be accessed: The

(25)

Methods 14

complete extended autoencoder (encoder, reconstruction decoder, committor decoder), the base autoencoder (encoder + reconstruction decoder), the committor predictor (en-coder + committor de(en-coder) as well as the basic models of en(en-coder, reconstruction decoder, and committor decoder. Each of these models can be prompted for predictions based on a given input.

Figure 2.1: Structure of the extended autoencoder. As for a regular autoencoder, the model consists of an encoder and a decoder. Further, the output of the bottleneck is

fed into another model to make predictions.

The default model parameters used in this work are listed in Table 2.1. These settings were chosen after rigorous testing on the available data as they yielded the best predic-tions (AppendixB).

Table 2.1: Default EAE parameters

Parameter Encoder Recon. decoder Comm. decoder

# input nodes # dimensions 1 1

# output nodes 1 # dimensions 1

layer type dense dense dense

# hidden layers 4 4 4

# hidden layer nodes 4 × # dimensions 4 × # dimensions 4 × # dimensions act. f. hidden layers linear tanh sigmoid

act. f. output layer linear linear sigmoid

(26)

Methods 15

Conceptually the encoder maps information from the configurational space onto a lower-dimensional latent space. The lower-dimensionality of this latent space corresponds to the number of bottleneck nodes. The latent space can be accessed by returning the output of the encoder for a given input.

Conversely, the reconstruction decoder can map from the latent space back onto the configurational space. Any point with a matching dimensionality can be converted into a corresponding full-dimensional point by the decoder. This allows a generative usage of the reconstruction decoder, even for points on the latent space that were not in the original dataset.

2.2 Available data

Three different datasets were used to develop and test the EAE model. Initially two toy datasets based on the double-well and Z-potential were generated as simple test cases before moving to the more complex methane hydrate nucleation dataset produced in [15] and [11].

2.2.1 Toy datasets

2.2.1.1 Double-well potential

For a first validation of the model, a dataset was generated using a simple double-well potential. Its potential energy landscape (PES) consisted of two Gaussian potentials distributed along the x1-axis, and exponentially growing barriers when moving away

from the center of the landscape (Figure 2.2). The two states A and B were defined to be circular and centered around the bottom of either of the wells.

(27)

Methods 16

Figure 2.2: Two-dimensional energy surface of the double-well potential. Contour lines every 0.1 kBT . Red: State A. Blue: State B.

The 2D double-well potential was defined by V (x1, x2) = x61+ x62− 0.7 ∗ (e

−7.5∗((x1−0.5)2+x22)+ e−7.5∗((x1+0.5)2+x22)).

The states were chosen such that State A where p (x1− 0.5)2+ (x2)2 ≤ 0.2 State B where p (x1+ 0.5)2+ (x2)2 ≤ 0.2

The two states were separated by a barrier with a height of 0.5 kBT .

An MD trajectory of 10,000,000 time steps was generated on the double-well PES utiliz-ing the Langevin BAOAB integrator from the openpathsamplutiliz-ing toy engine at dt = 0.02, temperature = 0.1, and gamma = 2.5. Eight additional dimensions were generated from random noise uniformly distributed between -1 and 1.

Paths were cut upon entry or exit of either of the states and all snapshots lying within the states were discarded. What remained was a set of disjointed paths. All paths in this set were assigned to one of four classes: AA, AB, BA, or BB paths. The union of these four classes constituted the reweighted path ensemble (RPE). In total 28,745 AA, 520 AB, 520 BA, and 30,114 BB paths were generated.

(28)

Methods 17

Figure2.3shows the configurational density map as well as the approximated pBs of the

double-well toy dataset.

Figure 2.3: Visualization of all points along the x1 and x2 dimensions occupied by

instances of the double-well potential dataset. (Left) Configurational density. (Right) Approximated pBs.

2.2.1.2 Z-Potential potential

As a more complex test case, a dataset based on the Z-potential introduced in [16] was generated and used (Figure 2.4). One inherent feature of this PES is its non-linear RC [31]. This fact makes the calculation of the committor and RC more challenging and thereby provided a sensible next step in the validation of the model.

Figure 2.4: Two-dimensional energy surface of the Z-potential. Contour lines every 0.5 kBT . Red: State A. Blue: State B.

(29)

Methods 18

The Z-potential follows the formula

V (x1, x2) = x4 1+ x42 20480 − 3e −0.01(x1+5)2−0.2(x2+5)2_{− 3e}−0.01(x1−5)2−0.2(x2−5)2 + 5e −0.2(x+3(x2−3))2 1 + e−x1−3 + 5e−0.2(x1+3(x2+3))2 1 + ex1−3 + 3e −0.01₍x2 1+x22).

In accordance with [31] the states A and B were defined as ellipsoids around the minima following the formula

(x1, x2)|(x1− x1m)

2_{/16 + (x}

2− x2m)

2_{< 0.25}

where (x1m, x2m) are positioned at (−7.2, −5.1) for state A and at (7.2, 5.1) for state B.

The barrier between these two states had a height of ∼ 4.5 kBT .

Further, a harmonic oscillator was defined as V (x1, x2, ..., xn) =

X

i

(xi)2

An MD trajectory of 32,000,000 time steps was generated on the Z-PES utilizing the Langevin BAOAB integrator at dt = 0.02, temperature = 1.0 and gamma = 1.0. Eight further dimensions were added through a harmonic oscillator, centered around the origin. As for the double-well dataset, the resulting path was split into the sets of AA, AB, BA or BB paths. In total 207,563 AA, 582 AB, 582 BA, and 210,139 BB paths were generated.

Figure 2.5 depicts the configurational density, as well as the approximated pBs of the

Z-potential dataset.

Figure 2.5: Visualization of all points along the x1 and x2 dimensions occupied by

instances of the Z-potential dataset. (Left) Configurational density. (Right) Approxi-mated pBs.

(30)

Methods 19

2.2.2 Nucleation data

After the model proved functional on the toy datasets, it was tested against methane hydrate nucleation data. The liquid state was defined as state A while the solid state was defined as state B, with a barrier of ∼ 57 kBT separating these two states [15]. Both

TIS and TPS were used to produce the dataset. AA and AB paths were generated via TIS, while additional AB paths were generated via TPS [11,15]. No BB or BA paths were added to the dataset. In total the dataset consisted of 13,787 AA paths and 222 AB paths amounting to a total of 7,038,943 snapshots.

22 OPs were selected to describe the system. These OPs and their interpretation are listed in Table 2.2.

Table 2.2: Order parameters of the methane hydrate nucleation dataset

OP Interpretation M CG # nuclear methanes

Ns,2 # nuclear methanes with ≤ 2 nuclear methanes within 9 ˚A

Nc,2 # core methanes: M CG - Ns,2

Nw,2 # waters with ≥ 2 nuclear methanes within 6 ˚A

Nsw,2−3 # surface waters: Nw,2 - Nw,3

Nsw,3−4 # surface waters: Nw,3 - Nw,4

512 # cages with 12 pentagonal surfaces

51262 # cages with 12 pentagonal and 2 hexagonal surfaces 512₆3 _{# cages with 12 pentagonal and 3 hexagonal surfaces}

51264 # cages with 12 pentagonal and 4 hexagonal surfaces

4151062 # cages with 1 square, 10 pentagonal, and 2 hexagonal surfaces 4151063 # cages with 1 square, 10 pentagonal, and 3 hexagonal surfaces 4151064 # cages with 1 square, 10 pentagonal, and 4 hexagonal surfaces CR Ratio of 51262 and 512 cages

Rg Radius of gyration of the nucleus

F 4 Mean cosine of HO··OH torsion angles for water pairs ≤ 3.5 ˚A apart

(31)

Methods 20

2.2.2.1 TIS data

The TIS dataset was taken from [15]. The TIS was performed at T = 280 K and P = 500 bar, on a cubic box containing 2,944 water and 512 methane molecules. For the water molecules, the Tip4P/ICE force field was used, while the methanes were modeled via united atom Lennard-Jones interactions. The number of mutually coordinated guests (MCG), as an indicator of the nucleus size, was chosen as the OP for interface definition. Interfaces were placed at MCG = 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90 and 100. Paths belonging to the ten highest interfaces were used in this work, while the others were disregarded.

2.2.2.2 TPS data

The TPS data was produced as described in [11], at T = 280 K with spring shooting moves, a spring constant of 0.1, and a frame shift of 200. 153 AB paths were generated this way and added to the dataset used in this work.

2.3 Data processing

The three datasets were stored in their crude form in files. To become accessible and usable to the model, several processing steps had to be performed to prepare the data for its final use.

2.3.1 Data import

As a first step, the raw data was imported. For the toy datasets, paths and the corre-sponding labels were read from files. Since the dataset already represented the RPE, all paths were assigned the same weight.

For the methane hydrate nucleation data, paths were also read from the raw TIS and TPS files. The label of each path was chosen based on the states of its first and last snapshot. Paths starting or ending outside of these state definitions were removed from the dataset. AA paths were associated with the label 0 and AB paths with the label 1. Since the distribution of different paths among the TIS and TPS data was biased, dif-ferent weights were used to unbias the data. By applying a reweighting process [16] to the TIS data, each interface was assigned an RPE weight. These RPE weights corrected for the over- or under-representation of paths belonging to the given interface when compared to an unbiased path ensemble. Further, each TIS path was associated with

(32)

Methods 21

a Monte Carlo weight which corrected for the undersampling of that specific path and which was provided with the path data. A path’s final weight was calculated by reading its respective Monte Carlo weight from the raw TIS files and multiplying it with the RPE weight associated with the path’s interface. Paths from the TPS raw files were extracted without associated weights. Instead, all TPS paths were assigned a weight based on Formula 2.1

wT P S=

wM CG100

nM CG100 + nT P S

(2.1) where wT P S is the weight assigned to a TPS path, wM CG100 the RPE weight associated

with the highest TIS interface, nM CG100 the number of paths in the highest TIS interface,

and nT P S the number of TPS paths. The weights of all TIS paths belonging the highest

interface were subsequently updated according to Formula2.2

wnew=

wold

nM CG100 + nT P S

(2.2)

where wnew is the updated weight and wold the original weight.

After this initial import, all three datasets were treated equally. Each path was decom-posed into a list of snapshots. The snapshots inherited the label from their parent paths, while the weight of each snapshot was calculated by dividing the weight of the parent path by the number of snapshots it contained. All snapshots and their corresponding labels and weights were shuffled to obtain a randomly ordered dataset. This dataset was then split into training, validation, and test set, according to a 6:3:1 split ratio.

2.3.2 Pipeline

As the datasets had limited usability in their crude form, further processing steps fol-lowed. Each of these steps was executed by a different function. Due to the modularity and unified interfaces of these functions, it was possible to assemble them into a variety of different pipelines. The pipeline used during this work as well as its components are described below. A schema of the pipeline is shown in figure 2.6.

(33)

Methods 22

Figure 2.6: Steps taken by the data processing pipeline.

2.3.2.1 Outlier removal

As a first step outliers were removed. For this, the values at the 2nd and 98th percentile of each input dimension were chosen as lower and upper bound, respectively. The complete dataset was revisited and values falling below the lower bound or above the higher bound were respectively increased or decreased to match it. As a consequence values of each dimension were capped by its lower and upper bound.

2.3.2.2 Normalization

Having input variables that span different ranges, can impede the performance of a NN. Normalization was therefore applied to bring all input dimensions to a comparable order of magnitude. The data of each dimension was normalized according to Formula2.3

xinorm =

xi− µi

σi

(2.3) where i represents the current dimension, xinorm the normalized value of dimension i for

the current data point, xi the original value of dimension i for the current data point,

µi the mean value along dimension i, and σi the standard deviation along dimension i.

After the normalization, the data for each dimension was centered around 0 with a standard deviation of 1.

2.3.2.3 Reduction of input dimensions

For runs where only a subset of dimensions present in the data was of interest, the data columns representing the other dimensions were removed to save memory and increase performance speeds.

(34)

Methods 23

2.3.2.4 Snapshot binning

As the values of the input dimensions were lying on a continuous spectrum, most po-sitions on the hyper-cube were only visited by a maximum of one path. For a reliable estimate of a pB value, however, a higher number of paths passing through a point was

needed. Therefore the points were binned. For the binning, a resolution r was chosen. The value along each dimension of each point was then mapped onto one of r discrete values.

This mapping consisted of shifting, rescaling, and rounding. First, the minimal value along each dimension was subtracted from the respective values of all points. As a consequence the range of values was shifted to begin at 0. Next, the values along all dimensions were divided by the span between their respective minima and maxima and multiplied with r − 1, thereby rescaling the range of values to lie between 0 and r − 1. Lastly, to obtain the closest integers, values were increased by 0.5 and rounded down. Formula2.4 describes the binning process for a given data point

xibin= xi− mini maxi− mini ∗ r + 0.5 (2.4)

where i represents the current dimension, xibin the rescaled value of dimension i of the

current point, xithe original value of dimension i of the current point, minithe minimal

value of dimension i, maxi the maximal value of dimension i, and r the resolution.

Through numpy’s broadcasting, these operations were performed efficiently on the whole set of data points simultaneously.

2.3.2.5 pB-Approximation

Binning the snapshots allowed for an approximation of pBs for each of the respective

bins. The pBs were approximated with the help of two hash tables and the list of binned

snapshots. The first hash table was used to store the product of the label and weight -the weighted label - associated with -the snapshots, while -the second hash table stored weights alone. The snapshots themselves served as the keys of the hash tables.

The algorithm iterated over all snapshots. If the current snapshot did not occur in the hash tables yet, it was added and the associated weighted label or weight stored as a respective value. If the snapshot already existed in the hash tables, the weighted label and weight were added to the values already stored.

After all snapshots were visited, the two hash tables stored the sum of all weighted labels and the sum of all weights associated with each binned snapshot, respectively. To obtain the final pB values associated with a snapshot, its sum of weighted labels

(35)

Methods 24

associated with this point and served as an approximation of the pB. The pB for a given

bin was approximated following formula2.5

pB= Pn i=1li∗ wi Pn i=1wi li =    1, for AB paths 0, for AA paths (2.5)

where n is the number of paths within the bin, li is the label of the ith path within the

bin and wi is the weight of the ith path within the bin.

By using hash tables, the size of the auxiliary data structures never exceeded the size of the list of snapshots. This allowed a memory-efficient approximation of the pBs even

for high-dimensional data.

2.3.2.6 Packing of Tensorflow datasets

Finally, the snapshots and the approximated pB values were packed together into

Ten-sorflow dataset objects, distributed into batches of 64 instances each, and passed to the model for training and validation.

2.4 Data characterization and visualization

2.4.1 Input importance measure

Often non-linear machine learning models are hard to interpret and seen as black boxes. Therefore, several strategies have been developed to extract insights about the relevance and relationships of the input. One such approach is to measure the contribution of each input dimension to the overall quality of the model.

Methods like holdback input randomization [32], input permutation [33], input pertur-bance [34], input averaging [35] have been put forward. These methods share that they act on an already trained model and modify the dataset along one dimension to esti-mate its importance. While the fact that no retraining of the model is necessary can be time-saving, these methods tend to overestimate the importance of dimensions that are highly correlated with others [36].

As the methane hydrate nucleation dataset used in this work possessed many strongly correlated dimensions, these methods were deemed unsuitable. However, as all relevant information passes through the bottleneck, the problem of finding the input importance of the complete model can be reduced to finding the input importance of the encoder alone. It has been shown before that AEs with only linear activation functions in the

(36)

Methods 25

encoder but non-linear activation functions in the decoder can still learn complex pat-terns. At the same time, such a linear encoder allows a simple importance measure by assessing the linear contribution of each input to the encoder’s output [19].

We implemented an importance measure for linear encoders based on this principle. The method takes the linear contributions to the encoder and visualizes them for a simple assessment of input importance.

2.4.2 Visualizing model outputs

After the model was tuned and trained, it was possible to gain new insights from it. The raw outputs of the model, however, lacked interpretability. Therefore, methods were written to allow a visual inspection of the model outputs.

2.4.2.1 Plotting committor predictions

To visualize the committor prediction, a set of pairwise heat maps was generated, cov-ering all pairs of dimensions that can be selected from the dataset. Representative snap-shots were generated for each of the heat maps. For each point on the heat map, the representative snapshot consisted of fixed values for the heat map’s x and y dimension, given by the position of the point on the grid, as well as representative values for the other dimensions. The representative values of these other dimensions were calculated as the mean values of all points with the given values for the dimensions of interest, as depicted in formula 2.6 ¯ xi= Pn j=1xij n (2.6)

where ¯xiis the calculated value for dimension i and n is the number of points that match

the current values of the dimensions of interest.

To generate the heat maps each grid point’s representative snapshot was passed through the committor predictor and the predicted committor taken as the value displayed in the heat map.

2.4.2.2 Plotting input reconstruction

To visualize the reconstruction, a set of scatter plots was generated. Each scatter plot displayed the input along one dimension on the x-axis and the corresponding prediction of the base autoencoder along the y-axis.

(37)

Methods 26

interest was defined by the position on the x-axis of the scatter plot, while the values for the other dimensions were calculated via formula 2.6.

2.4.2.3 Mapping paths onto the latent space

By giving a snapshot to the encoder, it was possible to map it onto the latent space. Similarly, complete paths could be mapped onto the latent space, by encoding all its snapshots. To gain insights into the representation of the configurational space on the latent space, a method was written that selected paths from the dataset, mapped them on the latent space, and plotted these latent paths.

2.4.2.4 Mapping paths onto the committor space

As for the encoder, the committor predictor not only allowed mapping single snapshots to the committor but also mapping of whole paths onto the committor space. A method was written to select paths and plot the development of the committor over their course.

2.4.2.5 Mapping latent paths onto the configurational space

Not only mappings from the configurational space were possible, but the reconstruction decoder also allowed a mapping from latent space to configurational space. By defining any path on the latent space a corresponding path on the configurational space would be reconstructed. A method was written to generate a representative latent path and plot the resulting reconstructed path.

2.4.3 Visualizing the ground truth

To allow for a visual comparison between prediction and ground truth, 2D heat maps for each pair of dimensions were generated from the original dataset. Similar to the pB

approximation, a weighted mean associated with each grid point was calculated from the labels and weights. Here, however, the snapshots were binned along only two dimensions, at each step.

2.5 Libraries

In this work, custom code was written using IPython [37], Matplotlib [38], Numpy [39], Openpathsampling [40,41], Plotly [42], Scikit-learn [43] and Tensorflow [44].

(38)

Methods 27

2.6 Workflow

The model and methods were developed iteratively over the course of this project. To indicate the steps of this project we depicted our generalized workflow in figure2.7.

Figure 2.7: Workflow of this project. The model was developed iteratively, first on the double-well potential dataset, then on the Z-potential dataset, and finally on the methane hydrate nucleation dataset. After finalizing the model, experiments were

(39)

Chapter 3

Experiments and Results

3.1 Double-well potential

As proof of concept, the model was tested on the double-well potential toy dataset. In figure3.1the ground truth and the respective predictions by the model are displayed in pairwise heat maps for five dimensions of the dataset (x1− x5). Additionally, figure3.2

shows a close-up view of the x1/x2 plot for both ground truth and prediction.

(40)

Experiments and Results 29

Figure 3.1: Double-well potential: Pairwise heat maps for the dimensions x1 - x5.

(Top) Ground truth, pBs approximated from the double-well potential dataset.

(41)

Figure 3.2: Double-well potential: x1/x2heat map (Left) Ground truth, pBs

approx-imated from the double-well potential dataset. (Right) Prediction made by the model after training.

In the first column, the ground truth shows a clear split of the approximated pB values

along the x1 axis. Low pB values are associated with negative values for x1 and high

pB values with positive ones. A vertical dividing surface is located at x1 = 0 for all

subfigures in this column. The predictions by the model closely resemble the ground truth in all heat maps of the first column. The model also captures the split between low and high committor along the x1 axis.

For the second, third, and fourth column, the heat maps of the ground truth appear to be dominated by random noise, with a large fraction of values lying close to an approximated pB of 0.5. This behavior is captured less clearly by the model. While its

predictions are also dominated by values close to 0.5, there are also regions with higher or lower predicted pB values. For the prediction, however, also no clear trend appears in

the heat maps belonging to these columns.

When closely inspecting the x1/x2-plots it can be seen that the model predicts the

committor with high accuracy. The position of the dividing surface, as well as the colors on the two sides of the barrier, are closely matched between ground truth and prediction. Apart from its predictions, the model’s reconstruction was also assessed. Figure 3.3

(42)

Figure 3.3: Double-well potential: Output of the reconstruction decoder of the trained model, based on given inputs for the dimensions x1 - x5.

A good reconstruction of the input values is signified by a diagonal line, indicating that low input values yield low output values and high input values yield high output values. Such a diagonal line is only present for x1. The reconstruction line for all other input

dimensions resembles a horizontal line, indicating poor reconstruction.

By selecting an exemplary path on the latent space and passing it to the reconstruction decoder, it was possible to convert this path on the latent space into a path on the configurational space. Figure3.4depicts how the path develops along all five dimensions (left), as well as how the x1 and x2 component develop with respect to the underlying

double-well potential (right).

Figure 3.4: Double-well potential: (Left) Conversion of a representative latent space path into a representative path moving through the configurational space. Values of x1 - x5 at different frames of the latent path. (Right) x1 and x2 components of the

representative path projected onto the underlying double-well potential.

As a clear trend, the x1 values start low and increase together with the latent space

values. The values of x2 start around 0.5 and drop to -0.5 as the latent path progresses.

The other dimensions all stay roughly constant around values of 0 as the path progresses, indicating that they are unaffected by the change in the latent variable.

(43)

In the overlay with the double-well potential energy surface, the first point of the gener-ated path lies left of state A, the next point, however, already lies within state A. From there the path moves in an almost straight line towards state B, passing over the barrier at its lowest point. The path then moves through state B and ends to the right of this state.

3.2 Z-Potential

After testing the model on the simplistic double-well potential data, a dataset generated on the Z-potential was chosen as the next test case. A visual comparison of ground truth and predictions by the model are depicted in3.5and 3.6.

(44)

Figure 3.5: Z-potential: Pairwise heat maps for the dimensions x1- x5. (Top) Ground

truth, pBs approximated from the Z-potential dataset. (Bottom) Prediction made by

(45)

Figure 3.6: Z-potential: x1/x2heat map (Left) Ground truth, pBs approximated from

the Z-potential dataset. (Right) Prediction made by the model after training.

The x1/x2-plot of the ground truth shows the characteristic z-shape that is associated

with the Z-potential. The lower region of the plot shows values below 0.5, and the higher region shows values above 0.5. At the barrier a set of black dots is present, indicating a region with an approximated pB of around 0.5. The prediction somewhat deviates

from the ground truth. While there is also a clear split between the lower and upper regions, the split between these two regions is flattened out compared to the ground truth. Further, only a transition from green to yellow, but no black dividing region is present.

For the other heat maps in the first column, the ground truth shows a clear split along the x1 dimension. The prediction captures the transition from low to high pBvalues, but

again no black dividing region can be observed. Similarly, for the second column, a clear split along the x2 dimension can be seen in the ground truth, which is also captured by

the model’s prediction. The remaining two columns are dominated by random noise. In the ground truth, the majority of points are black, indicating an approximated pB close

to 0.5. In the prediction, this behavior is not captured. Instead, both higher and lower values are being predicted for the heat maps in these columns.

To test the reconstruction of inputs, again input and output values of the AE part of the model were compared. Figure3.7 depicts this comparison.

(46)

Figure 3.7: Z-potential: Output of the reconstruction decoder of the trained model, based on given inputs for the dimensions x1 - x5.

For the reconstruction of inputs, x1 shows a very clear diagonal line, indicating a good

reconstruction. x2 shows a stepwise pattern. Input values below 0 yield a constant, low

output, while input values above 0 yield a constant, high output. In between a swift transition occurs. While this captures some behavior of the data, the reconstruction of these values is poor. The other dimensions show mainly horizontal lines, meaning that the input value for a given dimension does not influence the corresponding output value. As for the double-well potential dataset, an exemplary latent path was chosen and converted into a path on the configurational space using the reconstruction decoder. Figure 3.8depicts how the generated path develops along x1− x5 (left), as well as only

along the x1 and x2 dimensions (right).

Figure 3.8: Z-potential: (Left) Conversion of a representative latent space path into a representative path moving through the configurational space. Values of x1 - x5 at

different frames of the latent path. (Right) x1and x2components of the representative

path projected onto the underlying Z-potential.

When mapping the latent path onto the configurational space, mainly the x1 and x2

values change as the path progresses, while the other dimensions stay mostly unchanged. Initially, the x1 values start low, while the x2 values begin somewhat higher but also

(47)

in the negative region. As the path progresses, at first the x1 values rapidly increase,

while the x2 values fall slightly. Then simultaneously, the x1 values decrease, while the

x2 values quickly increase. Lastly, the x1 values once more increase rapidly, while the x2

decrease moderately. At the same time x3, x4, and x5 experience only minimal changes

in values.

In the overlay of the x1 and x2 dimensions with the potential energy surface, a

Z-shaped path is apparent. The path begins left and below state A and then moves past the state before turning and passing over the barrier. The path then turns again and moves towards state B. In its final frame, the generated path leaves state B and ends to the right of it.

3.3 Methane hydrate nucleation

After showing the feasibility of the EAE model, it was tested on the methane hydrate nucleation dataset. While the dataset consisted of 22-dimensional data, in the course of this work only a subset of eight representative dimensions was utilized (M CG, 512, 51262, 51264, CR, Rg, F 4, Nw,3). Preliminary results for the full-dimensional dataset can

be found in AppendixC.

To compare the effect of different activation functions in the encoder on the different tasks executed by the model, two variants of the model will be discussed below. In the non-linear encoder variant, the activation functions of all nodes in the encoder are non-linear, while the linear encoder variant uses only linear activation functions in its encoder.

3.3.1 Non-linear encoder

A comparison of the ground truth and the predictions made by the EAE with a non-linear encoder are depicted in figure3.9.

(48)

Figure 3.9: Nucleation data, non-linear encoder: Pairwise heat maps for eight rep-resentative dimensions. (Top) Ground truth, pBs approximated from the nucleation