Transferability of Molecular Machine Learning Models

(1)

Literature Thesis

Transferability of Molecular Machine Learning Models

by

Bart J. de Mooij

21 May 2021

Student number 11217278

Research Institute Supervisor/examiner

Van ‘t Hoff Institute for Molecular Sciences Dr. Ir. B. Ensing

Research group Second examiner

(2)

Abstract

In recent years, more and more computational chemists have been explor-ing and investigatexplor-ing the applications and developments of machine learnexplor-ing in chemistry due to their promising results. Machine learning is able to re-duce the computational cost significantly, while maintaining the accuracy of the model. Consequently, a multitude of molecular machine learning models have already been made. However, these models often focus on specific applications and fail to describe related problems and molecular systems. This would require a high number of different models to describe every single chemical application, which is both inefficient and unfeasible. Therefore, it is of importance to improve current methods in order to ob-tain generic and transferable models. This literature research provides an overview of current ”state-of-the-art” machine learning methods and inves-tigates the associated challenges to achieve generic and transferable models for molecular simulations. This study concludes that these challenges con-cern the generalization of current machine learning models, efficient data handling and transferability of intensive properties. Additional challenges occur when machine learning methods are applied to molecular dynamics and coarse grained simulations, such as ones concerning the improvement of enhanced sampling techniques and extension to larger molecular systems. Further research is required to solve these challenges, which could include incorporating active learning into current models.

(3)

List of Abbreviations

AL Active learning. 9

AM1 Austin method 1. 35

ANI Accurate neural network engine for molecular energies. 34

BAND NN Bonds, angles, nonbonded interactions, and dihedrals neural net-work. 39

CC Coupled cluster. 49

CCSDT Coupled cluster single double triple. 49

CG Coarse-grained. 8

DFT Density functional theory. 6

DFTB Density functional tight binding. 35

DimeNet Directional message passing neural network. 33

FF Feed forward. 13

GAP Gaussian approximation potential. 7

GNN Graph neural network. 17

HOMO Highest occupied molecular orbital. 31

LJ Lennard-Jones. 11

LLS Linearly least squares. 33

LUMO Lowest unoccupied molecular orbital. 31

MAE Mean absolute error. 22

MC Monte Carlo. 8

MD Molecular Dynamics. 8

(5)

MSE Mean square error. 22

NN Neural network. 7

OQMD Open quantum materials database. 27

PES Potential energy surface. 6

PM6 Parameterization method 6. 35

QBC Query by committee. 28

RMSE Root mean square error. 22

RNN Recurrent neural network. 17

SOAP Smooth overlap of atomic positions. 19

SRV State-free reversible VAMPnets. 39

VAMPnets Variational approach for Markov processes using neural networks.

38

(6)

1 Introduction

In recent years, an enormous growth in data-driven approaches for modelling molecules has been observed to accelerate and enable a wide range of chemical applications, including chemical discovery [1], enlarging and analysing molecular databases [2] as well as designing molecules for specific applications [3]. Therefore, various machine learning (ML) methods are developed more and more often for these applications and their desired use. In addition, this could potentially allow computational chemists to overcome current challenges in the nearby future.

Previously, quantum chemical calculations were often used to determine the properties and potential energy surface (PES) of the chemical system. However, these methods suffer from a trade-off between the accuracy and computational cost of the simulation [4]. Density functional theory (DFT) is one of the most used methods for quantum chemical computations as it gives high accuracy results for a relatively low computational cost due to approximations made for solving the Schr¨odinger equation [5]. In this way, DFT is able to examine the electronic structure of a many-body system, such as atoms and molecules. Nevertheless, these calculations still fall short in simulating large molecular systems as well as longer time scales, which are often observed in biological systems.

The increased computational cost for quantum chemical calculations makes it, in many cases, not viable or even impossible to simulate larger systems or longer time scales. As a result, approximate methods are often used to allow for faster simulations, while compromising the accuracy of the simulation. An example of this is a semi-empirical method, which still use ab initio methods but is supplemented with approximations and parameters from empirical data [6, 7]. Another example is a molecular mechanics simulation, which is fully empirical and completely excludes quantum mechanics from its computations. This is realized by using a database of compounds from which a force field, including certain parameters and functions, is obtained [8–10]. Although these force fields are less accurate compared to quantum chemical methods, they are frequently used in practice due to their considerably increased efficiency [11, 12].

Currently, ML methods are being developed to overcome the problems asso-ciated with quantum chemical, semi-empirical and molecular mechanics methods. Behler and Parrinello were the first to develop a molecular representation, called symmetry functions, for ML in molecular simulation in 2007 [13]. This opened the way to a new rapidly advancing research area within computational chem-istry. ML is able to extract complex patterns and links from a large data set and can, subsequently, predict properties of molecules [14]. These methods have already been widely applied in other areas, such as image recognition [15], speech recognition [16] and traffic prediction [17], and are now also used in molecular simulations. In ML, a machine is trained on a data set of molecules with known

(7)

properties already obtained from experiments or physics driven computations in order to learn complex patterns and links observed in the data [18]. As a result of training, ML is not required to solve first principle equations itself as long as it is provided with a sufficient number of examples to extract the underlying patterns from. After that it uses this knowledge to predict the (unknown) properties of other molecules outside the trained data set. These predictions are as good as the data given to the model for training. Consequently, if the given training data is obtained from quantum chemical computations, the obtained outcomes of ML are of similar accuracy, while the computational cost is significantly reduced.

Several types of ML are being developed for usage in molecular simulations, which can be divided into kernel based [14,19–23] and neural network (NN) based methods [24–31]. For the desired application, a choice between both methods needs to be made based on their advantages and disadvantages. The inspiration for developing NNs came from biological neural networks within the animal brain [32]. Artificial NNs are based on a collection of nodes, which are connected to other nodes in the system via non-linear functions of the sum of its inputs. These connections contain associated weights that are being optimized during an ini-tial learning process. These methods often need a large amount of training data and, thus, require relatively large databases. Databases with accurate data for molecules can be based on empirical results or on previously obtained quantum chemical data, such as the QM9 database [33] or the Open Catalyst 2020 Dataset [34]. Another type of method is kernel based ML, such as the support vector ma-chine and Gaussian approximation potentials (GAP), which only requires a kernel specified by a user. Nevertheless, these kernel based methods can also been seen as special types of NNs with a fixed first layer. The kernel is a similarity function over pairs of data points in a raw representation. It uses the so-called ”kernel trick”, which means that kernel methods are not required to compute the coordinates of all the data, but only the inner product between images of all pairs of data [35]. Consequently, these methods can operate in high-dimensional space, which is very common for larger molecular systems. On the other hand, kernel based methods, such as support vector machines, require a linearly increasing number of parame-ters with the size of the input. Therefore, kernel based methods can struggle when they are applied to large data sets [36].

These ML models can be used to describe the chemical system by obtaining its PES for which an increasing amount of research has been performed in recent years [13, 20, 37–48]. The PES can be used to describe chemical interactions as well as to determine all properties of the system using appropriate computer simulations [24]. PES is defined as a function resulting in the potential energy of the system based on its atomic coordinates. The size of an atomic coordinates set depends on the size of the system and, thus, the PES for large molecular systems can be

(8)

given by a high-dimensional and complicated function. The input of the atomic coordinates should take all relevant degrees of freedom into account. The degrees of freedom need to be written in suitable coordinates, such as Cartesian or internal coordinates, depending on its application [49, 50]. Examples of such internal coordinates are bond lengths, angles and dihedrals. Other types of representations can be obtained by performing permutations, rotations or translations on them. However, the ability of ML models to predict intensive properties is still far behind compared to predicting the energies of the molecular system. One of the main reasons for this deficiency are their transferability issues.

Transferability of a ML model refers to its ability to apply an obtained model to other related problems outside the used training set. This is an important requirement for a ML model as the goal of ML in molecular simulations is to ac-curately predict unknown properties of chemical systems, such as the PES. These related problems include larger molecular systems as well, which are usually either too expensive or impossible to obtain using quantum chemical calculations. Un-fortunately, this has still shown to be difficult for chemical applications and even high-end ML algorithms, such as SchNet [27], suffer from these challenges.

Subsequently, ML methods can be applied to molecular dynamics (MD) and coarse grained (CG) models [51,52] as well as Monte Carlo (MC) simulations [53]. MD simulations are often used to explore the conformational space by analysing the physical movement of atoms and molecules obtained from Newton’s equations of motion [54]. Therefore, this is often the method of choice to explore dynamics of large molecular systems, such as often observed in biology. CG models refer to their simplified representation, where a combination of atoms is clustered together and simulated as one ”particle” (or ”bead”). This enables more efficient simulations of complex molecular systems. In addition, using CG allows the application of MD to larger molecular systems as well as longer timescales than would otherwise be possible with an atomic resolution. Nevertheless, the conclusions drawn from this simplified representation are still required to be in line with higher resolution models. MC stochastically produces the states of a molecular system according to an appropriate distribution by the use of random numbers. In contrast to MD, MC is not able to obtain the dynamics of a system as only static calculations are possible. Although this might seem as a disadvantage, it enables MC to easily bypass energy barriers and generate new conformations. Therefore, it is usually the preferred method in applications where the dynamics are not essential or not of main interest, such as adsorption [55].

The increased number of computational chemists exploring and investigating the applications and developments of ML displays the acknowledged importance and potential these methods contain. Consequently, an overview of current meth-ods can quickly become outdated due to the fast developing field. As a result, it

(9)

is more valuable to provide what challenges need to be solved in order to obtain a ML model for molecular simulations. Additionally, the PES is a basic compo-nent in MD [56] and MC [54] simulations, which are widely used in chemistry but also in related sciences [57]. Therefore, fast and accurate computations of PES are beneficial for many chemical simulations as well as potentially allowing accu-rate simulations for large molecular systems of more than thousands of atoms and longer simulation times. It is essential for ML models to be transferable in order to extent its application to these larger molecular systems, which are too expensive to calculate with quantum chemistry. A CG representation can be incorporated in these ML models in order to help in acquiring accurate predictions for these larger, extended molecular systems. In addition, transferability is required to accurately predict unknown properties, such as PES, of molecular systems, which were not present in the training data set.

The following research question will be investigated in this literature research.

”What challenges need to be solved to arrive at a generic transferable ma-chine learning model for molecular simulations?”

This report will explain the theoretical background of molecular simulation techniques as well as different types of NNs, including their training, data sets and active learning (AL). Then, transferability of ML models will be explained after which an overview of current ”state-of-the-art” molecular ML methods is provided. Subsequently, the transferability of ML methods for MD and CG modelling will be investigated. After that, the challenges accompanying the development of a transferable molecular ML model will be discussed. Lastly, a conclusion will be made based on the performed literature research together with an outlook for further research.

2 Molecular Simulations

2.1 Ensembles

Molecular simulations are performed within a certain ensemble. An ensemble is the average of a high number of configurations of the molecular system for which all of the generated conformations represent a possible state of the system. This en-semble should be chosen before performing the simulation and the choice depends on both the goal and method used in the simulation. Several ensembles exist and the four most common ones are the micro-canonical (NVE) ensemble, canonical (NVT) ensemble, isothermal-isobaric (NPT) ensemble and grand-canonical (µVT) ensemble. For example, the micro-canonical ensemble keeps the number of atoms

(10)

N , volume V and energy E constant during simulation, whereas the number of atoms N , pressure P and temperature T remain fixed for the isothermal-isobaric ensemble. The micro-canonical ensemble is commonly used in MD, whereas the grand canonical ensemble is often chosen in MC. The reason for this is that the grand canonical ensemble uses a fixed chemical potential µ rather than a constant number of atoms, which is required for MC simulations as they often remove or add atoms to the system at random. Choosing an ensemble allows measuring a specific property and how this property alone affects the molecular system.

2.2 Molecular Dynamics

As its name suggests, MD follows the dynamics of a molecular system in time. MD generates information of the system at the microscopic level, i.e. atomic positions and velocities. The movement of atoms and molecules in the molecular system is determined by numerically solving Newton’s equations of motion at every time step of the simulation. Subsequently, macroscopic properties, such as pressure or temperature, are obtained by applying statistical mechanics. MD simulations consist of five steps as shown below after which these steps are more elaborately explained.

1. Assign positions and velocities of particles 2. Compute forces on all particles

3. Integrate equations of motion 4. Measure properties

5. Repeat steps 2-4 until stopped

At first, initial positions and velocities of a molecular system are provided to the machine. For example, in the case of a crystalline solid, the initial positions can be defined as the atom positions observed within a unit cell of the crystal and, subsequently, this cell is repeated in all directions until the desired dimension is achieved. The initial velocities of the molecular system are given to the machine by assuming a three-dimensional Maxwell-Boltzmann distribution [58]. This is performed by taking a random number from a Gaussian distribution shown below.

f (x) = 1 σ√2πe −1 2( x−µ σ ) 2 , (1)

where f (x) is the probability density function, σ the standard deviation, µ the mean and x the position in a certain dimension (i.e. x, y or z). Subsequently,

(11)

the obtained random number is multiplied by the root mean square velocity given given by vrms = r 2kBT m , (2)

where kB is the Boltzmann constant, T the temperature and m the mass. This is performed for all three dimensions and at the end the total momentum of the molecular system should be zero.

In order to compute the forces on all particles, a cutoff radius is defined and only interactions between particles within that radius are considered, resulting in increased computational efficiency. The use of a cutoff radius is supported by the fact that interaction contributions of particles further away will be small. There are several types of interactions between particles in a molecular system, namely bonded and non-bonded interactions. Bonded interactions are always on a relatively short-range within the cutoff radius, as otherwise the bond between the two atoms would break. Nonbonded interactions consist of van der Waals (VDW) and electrostatic interactions. These two interactions are given in the following Lennard Jones (LJ) potential.

V (rij) = X i<j " 4 σ rij 12 − σ rij 6! + qiqj drij # , (3)

where V is the potential energy, rij the distance between two interacting particles, σ the distance at which particle-particle potential energy is zero, q the charge of the respective atom and d the depth of the potential well. In equation 3, the first component (i.e. with r−12) refers to the VDW interactions and the second component (i.e. with r−6) to the electrostatic interactions. The VDW interactions decay faster than electrostatic interactions, causing the VDW forces to be negli-gible at the cutoff distance. However, as electrostatic interactions decay slower, the interaction remains significant at the cutoff distance. This effect introduces an error, which can be reduced by, for example, applying Particle Mesh Ewald to long-ranged electrostatic interactions. Once the corrected potential energy is obtained, the forces on the atoms can be computed by the negative gradient of the potential energy as shown below.

Fi = −

∂U (rN₎ ∂ri

, (4)

where Fi is the force on atom i, U (rN) the potential energy depending on all atomic positions in the system and ri the position of atom i.

After obtaining the forces from the positions, new positions and velocities can be computed by integrating Newton’s equations of motion, which is shown below.

(12)

Fi = miai = 1 2mi

∂2xi

∂t2 , (5)

where ai and xi are the acceleration and position of atom i, respectively. In order to obtain the velocity, the acceleration is first computed by using ai = Fi/mi. Subsequently, this acceleration is used to calculate the difference in velocity be-tween the two time steps by ∆vi = ai∆t, where ∆t is the time difference between the two subsequent states. The velocity of the particles is simply obtained by adding the velocity difference to the current velocity. Then, the new atomic po-sitions of the system at the next time step are obtained by ∆xi = vi∆t and xi,new = xi,current+ ∆xi.

At the end of each MD iteration, the properties can be measured. In order to determine the macroscopic thermodynamic properties, the molecular system should be ergodic. Ergodicity states that a point of a moving system will eventually visit all possible states of the system in a uniform and random manner. As a consequence of ergodicity, the average of a large number of configurations of a molecular system is equal to the average over time for the same system, implying that the result of MC and MD simulations are the same. The computation of forces, integrating equations of motion and measuring thermodynamic properties are repeated until a stopping condition is reached. This stopping condition is usually when either all accessible states of the system have been visited or when the probability distribution of a certain quantity no longer changes (i.e. converges).

2.3 Coarse Grained Modelling

Large molecular systems, such as biomolecules, are often too large for modelling by traditional all atom models due to their limited efficiency and available computing power [59]. Even specifically designed supercomputers for atomistic MD are usu-ally only applicable to relatively small systems and fast processes [60]. Therefore, CG models are often used to study these longer length and time scales. CG models use a simplified representation of the molecular system by splitting up the system into beads (i.e. subcomponents), consisting of multiple atoms, e.g. a phenyl group is represented by one bead. Therefore, CG models increase the efficiency of simula-tions by linking high accuracy quantum chemistry with a macroscopic model. This resolution reduction is performed in order to reduce the number of degrees of free-dom of the system. However, the outcome obtained from a CG model should be in line with higher resolution models and are often made with a specific goal in mind (e.g. reproducing thermodynamics). The choice of beads is often not straightfor-ward, especially in highly fragmented molecular systems. Although a CG model is beneficial for the computational cost of simulations, it introduces an error as mul-tiple atomistic molecular structures can be represented by the same CG structure.

(13)

As a consequence, issues related to thermodynamic representability, consistency and transferability occur [61]. Moreover, certain properties that highly depend on the small-scale, atomistic configuration will be lost when performing CG [62]. Therefore, understanding of the problem at hand is important in order to capture as much of the underlying chemistry and physics into the CG representation as possible. Nevertheless, it is still possible to obtain accurate thermodynamics by matching the CG model with average forces. In addition, these thermodynamics are consistent when a linear combination of atomistic coordinates is used to ob-tain the CG coordinates and the equilibrium distribution of both configurations is equal. However, these distributions are based on a certain thermodynamic state, limiting the transferability to intensive properties, such as density and material hardness. Also, consistency of thermodynamic properties can be influenced during the process of making a macroscale representation [63].

The two general implementation methods for CG are the top-down and bottom-up approaches used to reproduce macroscale properties or more detailed features of a molecular system, respectively [64]. The top-down approach uses atomistic simulations in order to obtain an energy function from which target properties can be predicted, such as electron density and diffusivity [63]. The bottom-up approach requires the mapping of an atomistic representation into larger beads as well as a model describing the physics of the CG system in order to maintain the properties of the real molecular system. These two approaches are visualized in figure 1, where a CG model via top-down approach is obtained from macroscale and via bottom-up approach from quantum mechanics or atomistic (i.e. molecular mechanics) representations. Whereas the top-down approach uses target proper-ties to optimize the potential, the bottom-up approach works the opposite, i.e. it uses the potential to predict target properties. CG models are often combined with force fields, such as MARTINI force field [65]. The efficiency, accuracy and transferability of different CG models depend on the resolution, parameterization and potential energy function used.

3 Neural Networks

3.1 Artificial Neural Networks

Numerous types of NNs have already been developed applicable for various pur-poses. An example of a NN is an artificial NN and its simplest type is also referred to as a feed forward (FF) NN. A FF NN consists of nodes arranged in layers, namely an input layer, one hidden layer and an output layer, in which information only moves in the forward direction (i.e. from input to output). Additionally, there are no cycles or loops present in this simplistic NN. The NN is defined as a

(14)

func-Figure 1: Schematic overview of simulation methods based on their respective time and length scales used in simulations as well as the two methods in order to obtain a CG representation: bottom-up and top-down.

(15)

tional relation between the input and the output of the system, which are often the atomic configuration and PES, respectively [24]. In order to give the atomic configuration as input, it should be written in an appropriate and suitable set of coordinates G = Gi. In between the input and output layer, there is a hidden layer. It is called a hidden layer because it is not directly visible to the outside from the system’s input and output. Additional hidden layers can be incorporated into the NN to obtain a deep or multilayered NN. Each of the hidden layers in the NN consists of a certain number of nodes and these numbers can differ between the different hidden layers. In contrast to quantum chemical computations, these hidden layers contain no physical meaning. Nevertheless, they are essential in a NN as they define its functional form. In addition, the hidden layers give the NN flexibility, allowing it to correctly fit the supplied data. The nodes of a layer are interconnected to the nodes of an adjacent layer by links (i.e. edges). Each link multiplies its input by a fitting parameter (i.e. weight) before feeding it into the next node. A function is applied to the sum over the input values for each node to obtain the resulting value of that node.

An example of such a NN with 3 nodes in the input layer, two hidden layers of 5 nodes each and an output layer with one node is schematically shown in figure

2. In this example, only several weights w are shown for clarity, but it should be taken into account that every arrow in the figure has its own corresponding weight associated with it. The weights are indexed where the first number corresponds to the layer it leaves with the first hidden layer being 1. The second number corresponds to the node of the layer it leaves and the last number to the node of the layer it enters. The NN in this example visualizes the functional relationship of the three input coordinates G1, G2 and G3 with the potential energy U . The molecular system is given in a suitable representation to the input layer, where each node corresponds to one degree of freedom. Oftentimes, such a representation is given by internal coordinates to ensure that translations and rotations are not affecting the properties of the system. Therefore, the NN shown in figure 2 is an example of a simple, small chemical system with three degrees of freedom. On the other hand, a large chemical system and, thus, high degrees of freedom, results in a high-dimensional PES and a more complex NN.

All the nodes in the first hidden layer are determined by the input coordinates and their weights. Consequently, each node m in the first hidden layer can be calculated by first taking the weighted sum s of the input coordinates, which is mathematically given as s0,m = N X i=1 Giw0,i,m, (6)

(16)

Figure 2: An example of a three-dimensional FF NN, consisting of an input layer with three nodes, two hidden layers with five nodes each and an output layer with one node.

nodes in the input layer. Additionally, a bias can be added to the system (which is not shown in the example of figure 2). The purpose of adding a bias b0,m is to help the model fit better to the input data and is similar to adding an intercept to a linear equation. As a result, equation6 becomes

s0,m = b0,m+ N X

i=1

Giw0,i,m. (7)

Subsequently, this is multiplied by a nonlinear, differentiable function fn,m, en-abling it to fit any continuous function to arbitrary accuracy [66]. After gener-alizing this to a multilayer NN, the following basis function is obtained for the respective node in layer n.

hn,m = fn,msn,m = fn,m " bn,m+ Nn−1 X i=1 hn−1,iwn−1,i,m # , (8)

where Nn−1 are the number of nodes in layer n − 1 and hn−1,i becomes Gi for n = 1, i.e. for calculating nodes in the first hidden layer. Whenever the calculated layer is the output layer (e.g. the potential energy U ), equation 8becomes

(17)

U = fn,msn,m = fn,m " bn,m+ Nn−1 X i=1 hn−1,iwn−1,i,m # , (9)

where n becomes the number of layers present in the NN, which would be 4 for the example in figure 2.

The notation for a NN will be given as the number of nodes in the respective layers followed by letters denoting the activation function. These activation func-tions can be sigmoidal (s), hyperbolic tangent (t), linear (l) or gaussian (g). As a result, the NN in figure 2 is denoted as {3 − 5 − 5 − 1stl} if the first hidden layer is obtained by a sigmoid and the second by a hyperbolic tangent function. A linear function is usually used in the output layer to enable a range of output values without any constraints. The shapes of these functions are, in respective order (i.e. [s, t, l, g]), f (x) = 1 1 + e−x, (10) f (x) = tanh(x), (11) f (x) = x, (12) f (x) = e−αx2. (13) These activation functions can be used, but other possibilities include cosines [67] and exponential functions [68], which are often the choice to fit periodicity. In ad-dition, a combination of or adjustment on the aforementioned activation functions is also a possibility.

Although FF NNs are commonly used NNs in molecular ML, recurrent NNs (RNNs) have previously been used as well [69–71]. In contrast to FF NNs, RNNs are able to represent loops in the NN, enabling nodes to move information back into a previous node before it moves forward again for additional processing. This means that information of nodes can be passed to nodes of previous layers and the current layer, including itself. The recursions can occur for a certain number of iterations and are especially important when information from previously obtained iterations can affect the information of the current iteration. However, the addi-tional processing also causes RNNs to be slower and can make training difficult compared to FF NNs. Besides the introduction of loops into the NN, RNNs are similar to FF NNs. In addition, other types of NNs used for chemical applications include graph NNs (GNNs), explained in section3.3, and support vector machines [36]. Besides simply transferring information, NNs are also concerned with per-forming operations on the output of one node before transferring it to the next node. An example of this is a convolutional NN, which introduces convolutional layers that are able to apply a filter before optimizing the weights of the next

(18)

layer. In this way, a convolutional layer reduces the number of parameters to be learned in later layers. This enables the NN to grow deeper, while requiring fewer parameters.

3.2 Kernel based Machine Learning

Kernel based methods only require a fixed nonlinear function K, also known as the kernel. The kernel aims to describe the degree of similarity between atomic environments in the chemical system and the kernel’s arguments. Whereas a NN needs to learn a representation by itself, a kernel in kernel based ML is specified by the user. This allows for applying already known chemical knowledge to the machine as well as giving it a specific and easier to understand representation. However, as the kernel remains fixed during learning, it is not possible for the machine to control nor influence the representation even if another representation would be more suitable. Nevertheless, this characteristic could also be preferred as a machine could otherwise remove certain parts of the representation that are useless for machines, but help considerably for human understanding. It should be noted that a trade-off between complexity and interpretability should often be made. For example, kernel methods are able to take more complicated features into account, but the meaning of the outcome can be hard to interpret [72].

Kernels are usually considered to be co-variance functions and, thus, are re-quired to be symmetric and positive definite [37]. The symmetry condition is mathematically shown in the following equation.

K(q, q0) = K(q0, q), (14) where arguments q and q0 are vectors describing the atomic environments. The positive definite conditions are mathematically shown as

X m

X n

αmαnK(q(m), q(n)) > 0, (15)

where α is any nonzero vector of coefficients.

Kernel functions can be represented in different forms. For example, one based on a hyperbolic tangent activation function for a NN with two hidden layers con-taining infinite nodes is given by [73]

KN N(q, q0) ∼ C − |q − q0|2, (16) where C is a constant. NNs with additional hidden layers, i.e. more than two, can be obtained using the same equation, while performing a nonlinear transformation on the input coordinates [74]. The kernel can also be represented by different similarity functions between q and q0, such as the Gaussian function

(19)

K(q, q0) = exp −||q − q 0_||2 2σ2 , (17)

where σ is a constant and ||q −q0|| is the norm of the difference between the atomic environments q and q0.

A representation of the neighbourhood (i.e. nearby atoms) can be given by atomic neighbour density functions. In contrast to using a delta function for the atoms, they can be smeared out by using a Gaussian function of equation17. This results in the following density function for the neighbourhood of an atom.

ρi(q) = X j exp −||q − q 0_||2 2σ2 fcut(|q0|), (18)

where fcut is a cutoff function that smoothly cuts the interactions off at a certain cutoff distance. This results in translational and permutational invariance as well as a continuous function. However, the obtained representation is not yet rotation-ally invariant and an option to solve this is by using Behler Parrinello symmetry functions [13].

A smooth overlap of atomic positions (SOAP) kernel function is often used [20,

74, 75]. This function can be obtained by directly constructing the kernel from the atomic density functions. By taking the densities of two different atoms and, thus, two different neighbouring environments, overlap integral

S(ρi, ρi0) =

Z

ρi(q)ρi0(q)dq (19)

is obtained. Subsequently, this overlap integral can be made rotationally invariant by squaring it and applying all possible rotations to it, which results in

K(ρi, ρi0) = Z |S(ρi, ˆRρi0)|2d ˆR = Z Z ρi(q)ρi0( ˆRq)dq 2 d ˆR, (20)

where ˆR represents all possible rotations in three dimensions. However, this is rather difficult to calculate for each of the basis functions as three spatial and three rotational integrals would need to be solved. Fortunately, the neighbour density functions can be expanded into spherical harmonics and, thus, be represented as

ρi(q) = X

nlm

c(i)_nlmgn(q)Ylm( ˆq), (21)

where c(i)_nlm are expansion coefficients, gn a radial basis set and Ylm spherical har-monics. The use of spherical harmonics can show that this integrated overlap is

(20)

exactly equal to the dot product of its respective power spectra p. Consequently, the SOAP kernel becomes

K(ρi, ρi0) = X n,n0_,l p(i)_nn0_lp (i0) nn0_l= p(i)· p(i 0₎ , (22) where pnn0_l = c†

nl · cn0l. Then, the kernel matrices can be obtained by a small integer power of this, which is mathematically shown as

Kij = |K(ρi, ρi0)|ζ, (23)

where ζ is a small positive integer.

An advantage of using a kernel is that the input coordinates can be mapped to feature space without a high computational cost due to the kernel trick. As a result of this trick, the input coordinates (e.g. distances and angles) can be represented as inner products for which explicit mapping is unnecessary. This mapping would otherwise be expensive, especially in high-dimensional systems, which are often observed in chemistry.

3.3 Graph Neural Networks

A special type of NNs, especially useful to represent molecular structures, is GNNs. A graph is a data structure consisting of nodes and edges connecting these nodes. A graph G is defined as G = (V, E) with nodes vi ∈ V and edges ei ∈ E. These edges can be directed if there is a directional correlation between two nodes and undirected if not. Graphs can represent molecular systems, where the nodes corre-spond to atoms and the edges to their respective interactions. They are commonly represented by an adjacency matrix A with (n × n) dimensions for n nodes or (n × f ) dimensions if each node has f features. Issues can arise with the analyza-tion of a graph, especially for complex graphs observed in molecular systems. The reason for this is that their form is not fixed, i.e. they can have different sizes of unordered nodes as well as changing numbers of neighbouring nodes. In addition, the nodes in a graph are dependent on one another, resulting in complications for ML algorithms, such as convolutional NNs.

A GNN is a type of NN especially developed for the interpretation of graph structured data [76]. GNNs are permutation equivariant networks, meaning that permutations on the input data will result in the same permutations on the out-put data [77]. Besides permutational equivariance, translational and rotational equivariance are of importance in molecular systems. These types of equivariance result in an equivalent translation and rotation in the output as was applied to the input, respectively. It should be noted that equivariance is not the same as

(21)

invariance. Whereas invariance means that there is no variance at all, equivari-ance means that the variations are in an equivalent proportion. For example, an equivariant atomic configuration implies that the configuration varies equally with the distortion rather than that these two configurations are exactly the same.

The graph is given by an atomic configuration of a molecular system for which information of its local environments, i.e. atoms, can be obtained by a GNN. GNNs obtain locality information via a recursive aggregation scheme of the neighbouring environment, also known as message passing [78, 79]. Each node (i.e. atom) ag-gregates the feature vectors of the connected nodes to obtain a new feature vector, where an aggregation is performed by a NN. In order to obtain the new feature vector, neighbouring nodes of the neighbouring nodes can be taken into account as well. This aggregation can continue for k iterations, resulting in a transformed feature vector for the node. The number of iterations used depends on how much structural information of the local environment is desired to be taken into ac-count. Subsequently, the total representation of the graph can be achieved by several pooling schemes [80–82]. For extensive properties, this pooling scheme can be the summation of its local node representations.

An example of such an aggregation scheme for the atomic configuration of caffeine together with the first three layers of a GNN for the red coloured atom N are shown in figure 3. In this case, the red coloured atom N is chosen to be the one for which a new feature vector will be obtained. Its first neighbouring atoms are shown by the orange colour. Subsequently, the neighbours of these atoms are shown in yellow unless they have already been assigned a colour. Then, this process is continued and several layers (three in this case) are constructed. This procedure can be performed for all the atoms in the molecule to obtain their respective feature vectors. After that, a combination of all these feature vectors can be used to obtain the complete representation of the graph.

3.4 Training of Neural Networks

In NNs for molecular simulation, supervised learning is regularly used to train the algorithm as this method is effective for classification problems, i.e. to predict discrete values. This means that all data is labelled during training of the NN and the desired outcome that the model should give is known in advance. In molecular simulations, this outcome is often the potential energy or another property of the system, which is usually obtained via quantum chemical calculations, such as DFT. A ML model is made by examining many examples and, subsequently, trying to minimize the loss or error function, corresponding to the predicted property. The accuracy of the model depends on the values of the weights in the NN. In general, the weights are initialized randomly and then iteratively optimized to reduce the loss function as much as possible. Each of these iterations is referred to

(22)

(a) Atomic configuration of Caf-feine

(b) First three layers of atom N in GNN of Caffeine

Figure 3: The atomic configuration of a caffeine molecule together with an example of the first three layers of a GNN, where the colours correspond to the iterations of the aggregation scheme.

as an epoch. A loss function can take several shapes where a difference between the true and computed values are calculated, but the mean square error (MSE) is commonly used. M SE = 1 N N X i=1 [yi,ref − yi,N N]2, (24)

where yi,ref is the target value obtained, for example from DFT, and yi,N N is the value obtained using the NN for property y. In the case of computing the PES of the system, y will be given by the energy. Other loss functions that can calculate the fitting accuracy are the root mean square error (RMSE) and mean absolute error (MAE) functions. The RMSE is simply the square root of MSE, whereas the MAE is given by M AE = 1 N N X i=1 |yi,ref − yi,N N|. (25)

A wide range of algorithms exists that minimize the chosen loss function in order to optimize the weights in a NN. One of the most popular methods for this in ML is back propagation [83,84]. An output is obtained from the input data and the weights in the NN. Subsequently, information about the observed error in the output compared to the target value is obtained using a loss function. Then, this error is backpropagated through the network to adjust the biases and weights in order to minimize the error. This process is continued until the biases and weights are accurately converged to the target outcome. Several methods, such as gradient

(23)

descent [85, 86] or simulated annealing [87, 88], exist to determine the direction of where the smaller error is. The used method should be chosen carefully as, for example, the gradient descent is prone to ending up in a local minimum rather than a global minimum, which is sub-optimal [89].

Not all available data is used for training the model as some part of the data should be reserved for testing the model’s quality. Testing of the model is essential to verify the model’s ability to also predict properties of molecular systems outside the training data set. In addition, these tests can help find the optimum number of hidden layers and nodes as well as determine the best choice of the activation func-tions used in the NN. When using too few nodes in the hidden layers, the resulting model is prone to underfitting. Fortunately, underfitting is easily identified as it leads to large fitting errors during training. On the other hand, using too many nodes in the hidden layers can result in overfitting due to the increased flexibility of the system [90]. This overfitting results in misleadingly accurate outcomes for fitting chemical systems in the training data. As a result, test data outside of the training set is necessary to determine the quality of the model. Both under- and overfitting of the model should be avoided in ML as they result in the generation of inaccurate predictions. Examples of fitting a model are visualized in figure 4, showing that under- and overfitting describe the training points either too poorly or too accurately. Therefore, a sufficient number of nodes should be present in the hidden layers to prevent underfitting, while not overfitting on the training data. In general, this is realized by taking the smallest number of nodes for which accurate results are obtained for the training data. In addition, the data set needs to be sufficiently large to correctly train the model and prevent overfitting. Besides the number of nodes and data set size, the splitting up of the original training data into new training and validation sets can be used to combat under- and overfit-ting as well. This method is, for example, used in deep learning and is useful to determine when to stop training, which is also known as early stopping [91]. By using a validation set, the performance after each iteration can be measured. The model will underfit when only a few iterations are performed, but will, generally, be improved as the number of iterations increases. However, at a certain point, ad-ditional iterations will cause overfitting on the training data. Therefore, the early stopping method should ensure minimization of the validation set error by taking the corresponding number of iterations. The early stopping method is shown in figure 5[92].

Other regularization techniques can be used in conjunction with early stopping to further combat overfitting, such as adding a penalty and dropout. Regulariza-tion can introduce an addiRegulariza-tional penalty term in the loss funcRegulariza-tion, which is able to limit excessive weight fluctuations of nodes [93]. If this additional penalty term is omitted, the weight of a node can take an extremely high positive or negative

(24)

(a) Underfitting model (b) Good fitting model

(c) Overfitting model

Figure 4: An example of fitting polynomials of degrees 1, 4 and 15 to a set of data points taken from the true function, cos (3₂πx), including noise, resulting in an (a) underfitted, (b) good fit and (c) overfitted model, respectively.

(25)

Figure 5: An example of a graph where the error of the model on the training and validation sets are plotted against the number of iterations [92]. Furthermore, this plot shows where the optimum number of performed iterations would be according to the early stopping method.

value, requiring the next node to compensate for this value by taking an extreme value with an opposite sign. Large weights are generally a sign of overfitting as nodes usually only require such high weights in order to exactly fit higher order polynomials to the training data points. Therefore, including a regularization technique into the learning algorithm is beneficial in order to make better gen-eralization, causing the model to also be applicable to data outside the training set. Another technique is dropout, where several nodes, including their edges, are randomly dropped out (i.e. turned off) for one epoch during training. Dropout is an effective and computationally cheap method to reduce overfitting of the model. The idea behind this method is that whenever every node contributes comparably, turning one or two off will only result in a relatively small error. On the other hand, if some of the nodes contribute significantly more than others, then dropping these nodes will give high errors. As a consequence, the NN needs to adapt and correct previous misstep, making the model more robust to overfitting.

3.5 Data Sets for Molecular Machine Learning

The molecular generated database GDB-17 [94] is one of the largest databases currently available. It contains 166.4 billion small molecules of at most 17 atoms

(26)

besides hydrogen (H), consisting of the elements carbon (C), nitrogen (N), oxygen (O), Sulfur (S) and halogens. Oftentimes databases of quantum chemical calcu-lations of molecular systems are used for making a ML model as they provide accurate prediction of the desired properties. These databases can be a subset of the GDB-17 database on which quantum chemical calculations are performed or obtained from different sources. The GDB-17 database excludes unstable and nonsynthesizable molecules. Consequently, different sources are required when un-stable atomic configurations as well as larger molecular systems are required for training. Representations of unstable structures (i.e. non-stationary points) in the training data are, for example, important in accurately describing the PES and could even result in holes on the PES if these unstable structures are not present [90].

Several quantum chemical databases already exist, mainly for smaller molecular systems. The QM9 database [33] consists of the geometric, energetic, electronic and thermodynamic properties of 134 thousand stable small organic molecules computed using DFT. These small molecules are made from the elements C, H, O, N and fluorine (F) with a maximum of nine heavy atoms (i.e. C, O, N and F) per molecule. Other data sets based on quantum chemical calculations include ANI-1x [95] and QM7-X [96], which also contain non-equilibrium structures. Another example is the ISO17 database [26], which performed short MD trajectories on 129 molecules from the QM9 database. Nevertheless, these databases only consist of small molecules and, thus, models trained on these databases need to be trans-ferable in order to be used on larger systems. Otherwise, a large number of highly expensive quantum chemical calculations needs to be performed on larger chemical systems to create an adequate training set. Furthermore, these computations can be even impossible to perform as is, for example, the case for large biomolecular systems.

As quantum chemical calculations are not always feasible, data sets obtained from different computational methods are sometimes required. These alternative data sets include the MD17 data sets [97], which are based on force field MD simulations of 150 thousand to nearly 1 million conformational geometries with 0.5 femtosecond time steps. In the case for large molecular systems, multiple MD simulations could be performed and, subsequently, be used to develop new data sets. This has already resulted in several data sets, such as ones for monomeric soluble proteins [98], cyclodextrins [99] and more [100,101]. Although these com-putations are less accurate, they allow the extension of ML models to a wider range of applications. Furthermore, they can also be used for other molecular simulation techniques, such as MC [102]. Likewise, data obtained from MC simu-lations can be used for training a ML model [103]. An example of this is MC tree search, which is able to create new related molecular systems, which are similar,

(27)

but slightly different from the systems already present in the database [104–106]. Another alternative would be to develop data sets based on experimental data [107, 108], but these are often not sufficiently available.

The Open Catalyst Project is a collaboration between Facebook AI Research and the Department of Chemical Engineering of the Carnegie Mellon University. In 2020, this project released the Open Catalyst Dataset [34], containing over 250 million DFT computations to describe 1.2 million molecular relaxations of ran-domly selected catalysts and adsorbates. This database has been made widely available in order to combat climate change and can be used to develop ML meth-ods for the discovery of new electrocatalysts. An important challenge in the field of climate change is to store obtained power from sun, wind and hydro for longer time periods as these are accountable for an increasing amount of the world’s elec-tricity production [109]. This storage is essential for efficient use as generation peaks of renewable energy are not coinciding with the peaks of energy demand and consumption [110]. The storage as well as possibly converting these types of energy to other fuels requires the development of new catalysts. Currently used electrocatalysts still contain expensive metals, making them unaffordable for the large application scale required in the (near) future. Therefore, it is especially important to find economically viable and scalable solutions to this problem [111]. Moreover, these catalysts should be durable (i.e. stable), efficient and selective. Besides providing a data set, this project helps researchers by exploring and com-prehending the underlying physics and chemistry in order to accelerate the energy transition. In addition to making their database, code and baseline models avail-able to the public, this project even includes a leaderboard and discussion forum to stimulate research groups to investigate these important challenges.

Besides the Open Catalysis Project and already mentioned databases, other data sets exist for several applications in electrocatalysis that are worth men-tioning. These include, among others, ones for intermetallic surfaces with small adsorbates [112, 113], three or more metal alloys [114], fcc(111) transition metal surfaces [115] and CO2 electroreduction catalysts [116]. In addition, multiple data sets are already developed, containing structural and thermodynamic properties of materials. An example of a data set for inorganic compounds is the Material Project [117], which contains 70 thousands molecular structures. Another one is the open quantum materials database (OQMD) [118], containing properties of over 800 thousand materials computed by DFT. Additionally, the continuously growing database AFLOW [119] contains the crystal structure properties of over 3.5 mil-lion materials, including intermetallic surfaces, inorganic compounds and alloys. As last example, the Matbench set [120] contains bulk properties of inorganic benchmark materials.

(28)

the complexity of the molecular system and chosen NN. In general, low-dimensional chemical problems require less training data compared to high-dimensional molec-ular systems. Moreover, it can actually be preferred to use a smaller training set for low-dimensional systems to prevent overfitting. Nevertheless, the training set should be sufficiently large to capture all the underlying chemistry required for making a trustworthy model.

3.6 Active Learning

AL is a type of ML, where the machine can request additional training data when-ever it is not able to confidently predict an outcome. As a result, the machine keeps learning ”on the fly” in order to obtain more accurate and transferable models. For example, when a molecular system related to the problem cannot be accu-rately predicted, it is provided some new data to extend the models predicting capabilities. Consequently, the model is able to predict properties of more related problems, improving the transferability to other molecular systems. Additional training data is only acquired when the model’s predictions are uncertain by per-forming additional quantum chemical calculations on several molecular systems [121]. The ”on the fly” learning makes AL models more efficient in handling data and, thus, generally require less training data compared to regular ML models. The reason for the efficient data handling in AL is that the model is first initial-ized on a small randomly chosen subset of the training data. Subsequently, the model is tested on remaining data and only when the model fails to accurately describe one of the molecular systems, it will add this system to its training data. Therefore, AL often results in only requiring a small fraction of the data com-pared to standard ML methods [121]. In the case a database for certain molecular systems is not readily available, a small amount of randomly sampled data across many of those systems can suffice. However, this requires more computations to derive a desired model. As a consequence of AL, the number of expensive quantum chemical calculations to be performed can be minimized, saving computational re-sources, time and money. In addition, the efficiently generated data can allow the development of models on molecular systems of which little data is available.

The ML model is provided with labelled training data. Whenever the model is not able to accurately predict a desired property of a molecular system, it sends a query for these systems. There are several strategies for AL and query by committee (QBC) [122] is one of them, which is especially useful for molecular systems. In QBC AL, multiple models are trained on the provided labelled data and, subsequently, the obtained models (i.e. the committee) are used to decide on the output of a molecular system outside the training set. The molecular systems that disagree the most with all these models are added to the training set, usually by performing quantum chemical calculations on them. The idea behind this is

(29)

that whenever there is a high variance observed in the outputs of the models, there should be at least one model in the committee that contains a considerable error from the ground truth. The variance measures the spread of the outcomes compared to their mean value as it is given by the expected value of the squared deviation from the mean

Var(Y ) = E(Y − µ)2 , (26) where Y is the outcome of a model and µ the mean outcome of all models. There-fore, if the variance is high, the outcomes of the models differ. However, this should not be the case if they correctly describe the underlying chemistry as all models are trained on the same data. The molecular systems with the highest disagreement are chosen for additional learning to avoid repetitious, expensive quantum chemical calculations. Nevertheless, it can occur that this variance is larger than a certain threshold, meaning that the given molecular system is very poorly described by the current models. This implies that a different model should be used to obtain accurate information for the given molecular system. A reason for this could be that the molecular system is not sufficiently related to the systems used for train-ing. In these cases, it is smarter to use a different model as it would require too much new quantum chemical training data for the current models, making them inefficient and expensive. In addition, inclusion of this threshold prevents models from learning forever as could otherwise be the case due to the extremely large chemical space.

4 Transferability

The ability of a ML model to be applied to different but related problems is referred to as the transferability of a model. This implies that the learned weights of a NN obtained from learning a certain task are applied to another new task. In this way, what has already been learned during a certain task can be used to improve generalization into other tasks. This can allow researchers to make accurate predictions on systems with only little available data. Additionally, this can also allow to make models based on related problems rather than training it from scratch again, saving both time and resources. This includes the extension to molecular systems of an increased size.

Reserving a part of the data set for testing is essential to make sure that the model works. However, transferability of the model outside the original database cannot always be guaranteed for molecular systems, especially if they are qual-itatively different from the training set [123]. This issue might lead to the de-velopment of models only applicable to specific problems and, hence, become highly non-transferable. An example of such a non-transferable specific model

(30)

is the model used to determine how the interface properties control the equilib-rium shape of core-shell Fe-Ag nanoparticles [124]. Furthermore, it has also been shown that errors occur when predicting a metal/metal interface when the model was only trained on the separate metal surfaces without their interactions [123]. As a result, the transferability of a model is limited and the desired interactions and structures should be included in the training set in order for the model to accurately predict them. Therefore, current ML methods are specific for certain elements or alloys and are not yet able to include many different types of metals in their crystal structures. Furthermore, these specificity issue are also observed in other subsets of computational chemistry [125–129].

Another common issue reducing the transferability of a model is the introduc-tion of more nodes into the hidden layers of a NN. This increases the complexity and flexibility of the model to allow for better fitting of the training data and its underlying structure-property relations [130, 131]. However, a more complex model is not automatically better nor is it necessarily an improvement of the loss function for unlearned molecular systems. Moreover, more flexible models usually result in less transferable models. It is useful to reduce the dimensionality of the chemical system, whenever a limited number of training structures is available. This allows the model to describe interactions between atoms outside the train-ing data. However, learntrain-ing of a lower-dimensional chemical system is prone to saturation for larger data sets as they are not sufficiently flexible to describe dif-ferences between atoms or elements [132]. Consequently, this is likely to limit ML models to a certain number of elements, especially considering the vast amount of elements in the periodic table.

A shift in the representation of molecular systems to more closely describe fundamental principles, including locality and multiple ranges for interactions (i.e. short-ranged and long-ranged (electrostatic) interactions), generally result in higher transferability. Previously, many of the representations used global de-scriptions of a molecular system [133–135], whereas recent NNs use additive mod-els. These additive models are often atom-centred and lead to more accurate and transferable models for extensive properties [136]. Not only is additivity useful for transferability, it is even required when a data set consists of systems with varying molecular sizes. The transformation of a global to an atom-centred representation is usually uncomplicated. The breaking down of a global representation into a sum of individual atomic contributions is often referred to as the locality of a model. This locality improves the transferability of the model as the local parts can be added up or averaged in order to make predictions about a global description of the molecular system. Consequently, predictions on larger and more complex sys-tems outside the training data are possible. In this way, the AlphaML model could make an accurate model for the polarizability of small organic molecules based on

(31)

a relatively small training set as well as extent its predictions to larger molecules [129]. Another example is the fourth generation high-dimensional NN potential [39], which is able to predict global charge distributions as well as energies by taking an atom-centred approach and separating its short- and long-ranged interactions. In addition, including long-ranged features into the model, such as electrostatic interactions, allows for more accurate descriptions of intermediate-ranged inter-actions as well. Subsequently, assigning relative importance to the different local and non-local terms could ideally allow extrapolation towards larger molecular systems, while only training it on smaller systems, which has recently been shown for molecular dipoles [137]. Another frequently calculated property in chemistry, the electron density, is also especially suitable for description as a collection of local contributions, leading to good transferability for these properties over a wide range of chemical spaces [138, 139]. In general, if a property can be predicted by a local model, it will result in increased accuracy and transferability.

Unfortunately, it is not always possible to separate global properties into local parts, which is, for example, the case for the polarizability of conjugated hydro-carbons [140]. In these cases, long-ranged representations could offer solutions to the transferability issue as they contain a combination of a long-ranged characters which are decomposable into additive local parts, such as the long-distance equiv-ariant feature representations [141]. Consequently, this can be used to accurately predict larger molecular systems as well. In addition, another possible strategy would be for the model to learn an alternative, stand-in target property with a more localized nature, which is also suitable to transform or manipulate in order to get a prediction of the desired property. In general, the prediction of intensive properties also results in transferability issues. Examples of intensive properties are temperatures and material elasticity, hardness and ductility, but also the gap between the highest occupied and lowest unoccupied molecular orbital energies (HOMO-LUMO gap) [142–144]. The reason for these issues is that it is harder to describe intensive properties by a sum of localized contributions, because they are independent on the amount of matter or size of molecular systems. Therefore, the prediction of intensive properties might still require global descriptors, while local descriptors are necessary to describe local interactions and are preferred to obtain generic transferable models. Consequently, ML models made to predict intensive properties are usually not transferable to molecular systems of sizes different from the training data.

One of the main limitations on the accuracy and transferability of a molecular ML model is the quality and broadness of the available training data. A reason for this is that the predictions of a ML model can only be as accurate as the provided training data. In addition, obtaining more accurate training data is expensive as this usually implies the computation of quantum chemical calculations or MD

(32)

sim-ulations. As discussed in section 3.5, there are some databases already available, especially for small organic molecules. However, many computationally interesting molecular systems are still not present in these databases. In addition, the expen-sive computations required for these databases might also imply that accuracy of a ML model is limited, unless a model based on more affordable molecular systems is transferable enough to be applied to these systems as well.

5 Current Molecular Machine Learning

Methods

Currently, several promising methods have been developed for molecular ML. These methods are made for all sorts of applications within chemistry, but opti-mizing the predictions of PES is the most common. These molecular ML methods have their own advantages and shortcomings and researchers are still improving them as well as applying them to other molecular systems. Promising molecular ML methods in terms of transferability are listed below.

5.1 Graph Neural Network based Models

The first idea of GNNs, which is discussed in section 3.3, originates from 1997, where recursive NNs were introduced to process data presented in a graph [145]. In 2009, the first GNNs based on recurrent NNs and FF NNs were developed [69,146], allowing to deal with cyclic graphs. Afterwards, these types of networks have been applied to various fields of science and engineering, including molecular modelling [147–150]. GNNs are especially useful in order to obtain the conformation of a molecular system [147]. This is important in order to find chemical and physical properties, such as bond formations, interactions and reactions [151].

An advantage of these GNNs compared to other types of NNs is that they are automatically invariant w.r.t. translation, rotation and permutation of atomic coordinates. Consequently, it is unnecessary to compute computationally expen-sive spherical harmonics, which are used in other equivariant methods [152, 153]. In addition, a new E(n) equivariant GNN was recently developed [149]. This method improved previous GNNs [154, 155] by allowing to extend the equivari-ance of three-dimensional spaces to higher-dimensional spaces as well as enabling the propagation of node features among nodes. The scaling to higher-dimensional spaces is especially important for the application of GNNs to more complex and larger systems as observed in biomolecules. Therefore, these GNNs possess good transferability and, as a result, this method is potentially useful to model larger molecular systems. As computations on larger molecular system have not yet been

(33)

performed due to its novelty, the transferability still needs to be confirmed in prac-tice. In addition, the ability of node features to be propagated to different nodes is essential for describing molecular systems, because interactions between atoms are not correctly incorporated into the model without this feature.

Another GNN method developed for molecular simulations is directional mes-sage passing NN (DimeNet) [156], which is able to predict forces and properties of molecular systems. In addition, DimeNet is twice continuously differentiable w.r.t. atomic coordinates, making this method especially suitable for MD simulations. DimeNets are automatically invariant w.r.t. translation, rotation and permutation of atomic coordinates as they are GNNs. GNNs only use pairwise distances as well as their previous atom embeddings to compute properties, such as the energy of the system. These pairwise distances are represented by Spherical Bessel functions. Although only using pairwise distances is necessary to achieve its desired equiv-ariance, it also makes it more difficult to correctly model energies. The reason for this is that energies resulting from angles and torsions cannot directly be obtained from distances and accounting for this issue is not straightforward. Therefore, DimeNet uses directional message passing to describe interactions between atoms in coordinate space by edges. Only interactions between all atoms within a certain cutoff distance are used in order to prevent overfitting and increase the efficiency of the model. Subsequently, the directions of these messages to and from atoms as well as their angles and torsions can be utilized to take angle and torsion energy terms into account. DimeNet has shown to achieve accurate results when tested on the MD17 [97] and QM9 [33] databases.

5.2 Gaussian Approximation Potential

GAP is a form of kernel based ML and was already first published in 2010 [20]. A Gaussian process is a stochastic process, using Gaussian probability distributions of functions over all observations. Subsequently, it predicts the probability of the next observation. The average of these probabilities results in the same solution as would have been acquired by a linearly least squares (LLS) method. However, in contrast to LLS solutions, the GAP method also gives the covariance. Con-sequently, this method is in line with values of the data points, but can have a range of solutions with certain probabilities for the points not present in the data set. These Gaussian kernels are popular as the error of the prediction goes to zero as the number of observations (i.e. data points) N goes to infinity as long as the functions are reasonably regular and the basis functions scale with an order of N . Therefore, it is possible to obtain a function arbitrarily close to the target function.

GAP can use the SOAP kernel as was previously described in section3.2. This combination has already been shown successful in obtaining interatomic potentials

Transferability of Molecular Machine Learning Models

Literature Thesis