• No results found

Necessity of Code Generators in Electronic-Structure Theories and their Infrastructure

N/A
N/A
Protected

Academic year: 2021

Share "Necessity of Code Generators in Electronic-Structure Theories and their Infrastructure"

Copied!
40
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MSc Chemistry

Molecular Sciences

Literature Thesis

Necessity of Code Generators in

Electronic-Structure Theories and their Infrastructure

by

Stan Papadopoulos

10722718

12 EC

June 2019 – August 2019

Supervisor/Examiner:

Examiner:

Prof. dr. L. Visscher

Dr. D. Dubbeldam

Amsterdam Institute for Molecules,

Medicines and Systems

(2)

At the source of every error which is blamed on the computer you will find at least two human errors, including the error of blaming it on the computer.

(3)

Abstract

Code generators have been successfully applied to a plethora of second quantization methods, both single- and multi-reference, and have been used to generate serial, as well as parallel code. Their infras-tructures consist of three components, the derivation of working equations, the transformation of those raw equations into more computationally efficient ones, and finally the implementation of the optimised equations into computer code. Generally, the generated codes perform worse than hand-written code, but still have decently competitive timings. In one instance, generated code was 20% faster than hand-written code, which proves their usefulness and necessity. Another prime example for their necessity, is that a code generator was used to produce code for analytic nuclear gradients of CASPT2, which humans were not able to do due to the sheer complexity.

Generating code for DFT was focused on the XC functionals and their corresponding derivatives. A symbolic algebra package, such as Mathematica, is used to obtain the derivatives, which are then converted to code. The downside of this procedure is that the functions might be numerically unstable. It seems that this and the fact that fairly complete XC libraries are readily available, makes XC code generators unnecessary. This is also reflected in their very limited availability.

(4)

Contents

Abstract i

1 Introduction 1

2 Deriving Working Equations for Second Quantization Methods 3

2.1 Algebraic Approach: Normal Ordering and Wick’s Theorem . . . 4 2.2 Diagrammatic Approach . . . 6

3 Second Quantization Methods 11

3.1 Configuration Interaction . . . 11 3.2 Coupled-Cluster . . . 12

4 Code Generators for Second Quantization Methods 14

4.1 Tensor Contraction Engine . . . 14 4.2 SMITH . . . 18 4.3 ORCA-AGE . . . 20

5 Density Functional Theory 25

6 Code Generators for Density Functional Theory 26

6.1 dfauto . . . 26 6.2 Sałek and Hesselman . . . 27

7 Discussion 29

8 Conclusion 31

9 Outlook 32

10 Acknowledgements 33

(5)

1

Introduction

Solving complex equations is a constant issue in natural sciences. The solution of such equations might involve a substantial amount of terms, hence making derivations by hand prone to human errors. To apply those solutions and perform feasible calculations for the target systems, a computer code would have to be created. Again, doing this by hand will lead to a long developing time, because of the many terms as well as a debugging phase to correct for small errors, such as sign inconsistencies. Naturally, some scientists believed there must be an easier, less error-prone way to create a computer program con-taining working equations.1–3Working equations are those that consist only of mathematical operations that can be directly implemented into a computer program without further algebraic manipulation, i.e. (matrix) multiplications or addition/subtraction of numbers/matrices, etc.

Already in the early 1960s research was being conducted on this topic and the first breakthrough came from the Dutch particle physicists and the 1999 Physics Nobel Laureates Veltman and ’t Hooft. Their program SCHOONSCHIPi was developed in order to calculate the quadrupole moment of the W boson, of which the expression contains a number of terms in the order of 50.000 before simplifica-tions.1,4,5 Clearly, a mistake is bound to occur while working out this gruesome number of terms by hand. Furthermore, SCHOONSCHIP is capable of computing more than just that property, in fact it is designed to be much more general. One of its capabilities is to evaluate all expressions of the form:

(A1 + B1 + . . .) ∗ (A2 + B2 + . . .) ∗ . . . + (AA1 + BB1 + . . .) ∗ . . . , (1.1)

whereA1, B1, etc., could be products of numbers, algebraic symbols, vectors, functions, etc.1So, the

input for this program is a general algebraic expression and the output can be obtained on punched cards and directly incorporated in a Fortran-compatible numerical program. That was the first instance (to the author’s knowledge) of a code generator, a program capable of (1) interpreting and evaluating symbolic algebraic equations and (2) consecutively converting these evaluated expressions to working equations and (3) casting that in a computer program based on a certain programming language, typically C or Fortran for performance reasons.

Since SCHOONSCHIP, multiple such generators, or otherwise known as computer algebra sys-tems, that can manipulate mathematical expressions have been developed to alleviate the problem of long implementation times. The most well-known examples are Maple and Mathematica, which are general-purpose computer algebra systems, containing a plethora of features, such as differentiation, matrix multiplication, generating C code and much more.6,7 Depending on what (electronic-structure)

method the target is, the code generator has to focus on specific aspects, but what they should all have in common is an interpreter for abstract equations; a module for simplifying the equations, for instance by combining common terms; a memory manager, that may shuffle pieces of code to minimise storage cost for example; and an implementation module that produces the code.

Besides their use in physics, code generators also made their way into electronic-structure theory to assist in several stages of method development.2,3,8–15Electron correlation theories, such as Coupled-Cluster (CC), Configuration Interaction (CI), or Many-Body Perturbation Theory (MBPT) are

hierarchi-i

(6)

cal methods that can systematically be improved, with each step adding extra complexity, to eventually converge at the exact solution of the Schr¨odinger equation within a specific basis set expansion.16

The development of the aforementioned methods involves three steps according to Hirata,16namely derivation, transformation and implementation, of which the first two are essentially algebraic manipula-tions. The derivation procedure is the process in which methods expressed in second quantization (Sec-tion 2) are cast into mathematical equa(Sec-tions as sums of tensor products. Brute-force deriva(Sec-tion generally does not result in equations that are optimised for computer codes, so the equations are transformed into a more efficient scheme, by factorising common terms for instance. Lastly, the implementation step is self-explanatory. It becomes apparent that this is a very systematic procedure of method development and a code generator, as described in the previous paragraphs, can nicely be applied to these methods.

In Density Functional Theory (DFT), equations are not expressed in the language of second quanti-zation, so deriving equations using Wick’s theorem (Section 2.1), diagrammatic techniques (Section 2.2) or another similar method will not be possible. Furthermore, DFT is not a hierarchical method, unlike electron correlation methods, so there are no additional equations to solve other than optimising the den-sity. Rather, the main problem in Kohn-Sham (KS) DFT (the chemistry standard) is finding a reasonable approximation to the exchange-correlation (XC) functionalExc[ρ], which is dependent on the electron

densityρ(r) and possibly derivatives of that density.9,11,12If we want to minimise the energy, we would

also need the XC potential, which is the functional derivative ofExcwith respect toρ(r). In the quest

for discovering improved XC functionals, obtaining that potential is vital and screening would become much simpler if these potentials could be automatically generated.12 Work on this subject has already been executed in the late 90s by Knowles and more recently has also been applied to arbitrary-order density functional response theory.17

As is clear from the previous paragraphs, no one-size-fits-all code generator exists, since wave func-tion and density-based theories need different funcfunc-tionalities, thus there will be at least two separate classes of generators, one based on second quantization and the other based on (functional) derivatives. In this thesis, an overview will be provided of the most relevant examples of code generators - and if necessary, more background information and/or theory - and discussing their methodology, before turn-ing an eye to the future and discuss what more can be accomplished in the field of symbolic algebra in quantum chemistry and code generators.

(7)

2

Deriving Working Equations for Second Quantization Methods

To be able to explain the derivation of working equations (and the second quantization methods), the formalism of second quantization must be explained, since the code generators rely heavily on its al-gebraic properties. First of all, second quantization is not an additional level of quantization, but can be seen as a more elegant way of reformulating quantum mechanics.18 Within this formalism, wave functions are no longer represented as Slater determinants, but as a series of creation and annihilation operators working on a vacuum state. Moreover, operators can also be constructed using those same ladder operators.ii The definition of these operators follow from their effect on a wave function|ki, which is now represented as an occupation-number vector

|ki = |k1, k2, . . . , kmi , kp=    1 φpoccupied 0 φpunoccupied , (2.1)

whereφP refers to a specific orbital.18 When an creation operator a†P works on a wave function, two

situations can occur:

a†p|k1, k2, . . . , 0p, . . . , kmi = p−1 Y q=1 (−1)kq|k 1, k2, . . . , 1p, . . . , kmi (2.2) a†p|k1, k2, . . . , 1p, . . . , kmi = 0 (2.3)

Either a new wave function is generated with the now occupied orbitalφp(Eq. 2.2) or the product

van-ishes (Eq. 2.3), because an orbital appears twice.iii The adjoint of the creation operator, the annihilation operator, has the opposite effect.18Besides the wave function, we can also express the Hamiltonian and

other operators in the language of second quantization. Finally, we can establish the anticommutation relations between the operators. An anticommutator is defined as

{A, B} = AB + BA, (2.4)

whereA and B may be any operator. Substituting the ladder operators in Equation 2.4, gives us three distinct relations:18 a† p, a†q = 0 (2.5) ap, aq = 0 (2.6) a† p, aq = δpq (2.7)

whereδpq is the Kronicker delta. These simple anticommutation relations constitute the fundamental

properties of the ladder operators and from them we can derive all other algebraic properties of the second quantization formalism, including the antisymmetry of the wave function.18

With these relations (Eq. 2.5–2.7), it should now become clear why code generators could be useful in second-quantized methods. If we have a string of creation and annihilation operators, we can system-aticallyapply those relations to evaluate all of the expressions. However, this algebraic approach is not

iiLadder operator is a collective term that can refer to both the creation and annihilation operator. iii

(8)

the sole way through which we can obtain working equations. A diagrammatic approach, which should help visualise the problem, has been developed and applied to quantum chemistry as well.2,3,10In the next sections (2.1 and 2.2) both methods will be discussed.

2.1 Algebraic Approach: Normal Ordering and Wick’s Theorem

When working out equations, it is helpful to know which terms have a zero contribution, so we can neglect those terms, leading to a more efficient computer code. This means we need to have normal-orderedoperators, which implies that all annihilation operators are positioned on the right of the cre-ation operators.19The expectation value of that string of operators will be 0, so that leads to a simpler description. As mentioned in the previous section, we can use anticommutation relations to evaluate the contributing terms, which is achieved by normal-ordering all operators. Suppose we have a string of creation and annihilation operators

A = apa†qara†s (2.8)

to which apply the anticommutation relations to reorder the ladder operators, which looks as follows: A = apa†qara†s

= δpqara†s− a†qapara†s

= δpqδrs− δpqa†sar− δrsa†qap+ a†qapa†sar

= δpqδrs− δpqa†sar− δrsa†qap+ δpsa†qar− aq†a†sapar.

(2.9)

If this expectation value of ˆA is evaluated in the vacuum state |i, the expression becomes: h|A|i = h|δpqδrs|i − h|δpqa†sar|i − h|δrsa†qap|i + h|δpsa†qar|i − h|aq†a†sapar|i

= δpqδrs,

(2.10)

where we assume the vacuum state to be normalised. Thus, it is obvious that only those terms contribute that do not contain any second-quantized operators.19This procedure can naturally also be extended to non-vacuum states, the only difference being a lengthier string of operators, thus more mathematical manipulation is needed.

Perhaps a more pragmatic method of reducing the length of an operator string is Wick’s Theo-rem, which is achieved by so-called “contractions”.19 From Equation 2.9 it is clear that a string of creation/annihilation operators can be written as a linear combination of normal-ordered strings, most of which containing a reduced number of operators multiplied by Kronicker delta functions. The reduced terms may be regarded as arising from those contractions, which can be defined as

AB ≡ AB − {AB}, (2.11)

whereA and B are operators and the parentheses denote the normal-ordered form of the operator string within the brackets. We can now define four cases for the ladder operators where the result is either zero

(9)

or nonzero: apaq= apaq− {apaq} = apaq− apaq= 0 (2.12) a†pa†q= a†pa†q−a† pa†q = a†paq†− a†pa†q= 0 (2.13) a†paq= a†paq−a†paq = a†paq− a†paq= 0 (2.14) apa†q= apa†q−apaq† = apa†q+ a†qap = δpq. (2.15)

As we can see, in the first 3 cases (Eq. 2.12–2.14) the contractions are 0, because the string is al-ready normal-ordered; only in the last case (Eq. 2.15) do we obtain a nonzero result by applying the anticommutation relation in Eq. 2.7, since this string is not normal-ordered.

Wick’s theorem can also be formulated with an arbitrary string of operators:19,20 ABC . . . XY Z = {ABC . . . XY Z} + X singles ABC . . . XY Z + X doubles ABC . . . XY Z + . . . , (2.16)

where singles and doubles refer to the number of contractions. If we take a look at our previous example (Eq. 2.9) where we evaluated an operator with anticommutation relations, we will see that we obtain the exact same result when applying Wick’s theorem to operatorA as follows:

A =apa†qara†s + apa†qara†s + apa†qara†s + apa†qaras† + apa†qara†s , (2.17)

where we only preserve the nonzero contractions. Using the anticommutation rules and the definition of a contraction, the result is:

A =apa†qara†s + δpqara†s + δpsa†qar + δrsapa†q + δpqδrs

= −a†qa†sapar− δpqa†sar+ δpsa†qar− δrsa†qap+ δpqδrs,

(2.18)

which is exactly the same as in Eq. 2.9. Once more, this is further simplified by the fact that any matrix element can be written as an expectation value (with respect to some reference), in which case we only need to keep the fully-contracted terms (See Eq. 2.10), as the other are necessarily0.

However, in working out the equations in electron correlation methods, we do not only encounter single strings of creation and annihilation operators, we also come across products of strings. Wick’s theorem can also be extended to take those into account through a more general definition:19

{ABC . . .}{XY Z . . .} = {ABC . . . XY Z . . .}

+ X singles ABC . . . XY Z . . . + X doubles ABC . . . XY Z . . . + . . . , (2.19)

(10)

where the only difference is that the contractions are between normal-ordered strings instead of within.19 We are now fully equipped to tackle all algebraic derivations.

2.2 Diagrammatic Approach

The second method of deriving working equations is the diagrammatic approach. Initially devised by Goldstone, this method is based on Feynman diagrams for many-body perturbation theory and resulted in the formulation of the linked cluster theorem and the concept of size-extensivity.16,21–23 The first computer implementation in quantum chemistry using the diagrammatic approach was reported by ˇC´ıˇzek in 1966, in which he developed a Coupled-Cluster Doubles (CCD) code and this method was further popularised by Kucharski and Bartlett where they demonstrated it could produce the equations much faster than by using Wick’s theorem.19,24,25

Since this approach makes use of diagrams, the best way to explain it, is to show some examples (from Coupled-Cluster theory). This will be done by first introducing the diagrams themselves, together with some terminology and rules, after which is shown how to derive some terms.

The first thing to know is the particle-hole formalism. In the previous section a vacuum state was mentioned as a reference state, but in electron correlation methods a different reference is usually em-ployed, namely|Φ0i, otherwise known as the Fermi vacuum and usually taken as the Hartree-Fock (HF)

Slater determinant.19The occupied one-electron states in this reference are known as hole states and the unoccupied ones are referred to as particle states.19With this the first diagrammatic convention can be introduced, which is that downward-directed lines are those that refer to hole states and upward-directed lines are particle states, see below:

(a) (b) (c)

Figure 1. Types of directed lines: (a) hole lines; (b) particle lines; (c) sinlgy-excited determinantΦa i.19

Next, operators such as the one- and two-electron part of the normal-ordered Hamiltonian are repre-sented by horizontal interaction lines together with hole and particle lines, depicting the creation and annihilation strings.19 Different types of operators are represented by different interaction lines, like solid lines being cluster operators and dashed lines being components of the electronic Hamiltonian. Furthermore, those directed lines (Fig. 1) can only originate from vertices on the interaction line, with the vertex representing the action of the operator on the electrons. Thus, one-electron diagrams have one vertex, two-electron diagrams have two vertices and so on; only two directed lines can be attached to one vertex, one incoming and one outgoing.

We can now have a look at the diagrammatic representations of the one-electron part of the Hamil-tonian, ˆFN, and the two-electron part, ˆVN. First, the one-electron part, where we cap the interaction line

(11)

with an “X”, may be written in four fragments, each fragment pertaining to a different block of the Fock matrix of which two are diagonal and two off-diagonal:

Figure 2. Diagrammatic representation of the one-particle operator ˆFN of the normal-ordered

Hamilto-nian, broken up into its 4 components, wherei and j refer to occupied orbitals, and a and b to virtual orbitals. The numbers below the diagrams denote the excitation level of the diagram.19

The two-electron operator may be split up into several diagrams in the same manner as the one-electron fragment, the only difference being an additional vertex which is connected by a dashed inter-action line to the other vertex, and with that two more directed lines, hence more possible diagrams.

Figure 3. Diagrammatic representation of the two-particle operator ˆVN of the normal-ordered

Hamilto-nian, broken up into its 9 components. The numbers below diagrams denote the excitation level of the diagram. hpq||rsi are the antisymmetrised two-electron integrals.19

(12)

Figure 4. Diagrammatic representation of the 3 lowest excitation operators ˆTn. The numbers to the right

of the diagrams denote the excitation level of the diagram. The interaction line is on the bottom of each diagram, represented by a solid horizontal line.19

Finally we introduce the diagrams for the cluster operators with the solid interaction lines at the bottom (See Figure 4).19This is important for the interpretation of these diagrams, since we declare that the reference determinant|Φ0i is at the bottom, represented as empty space. No lines may extend below

the interaction line of the cluster operator. The top represents the target determinant (the bra state). In the energy equation, the bra and ket states are both the reference determinant, so no lines may extend above or below the diagram. These diagrams are called closed and thus only closed diagrams contribute to the energy. When evaluating amplitude equations, e.g. hΦa

i| ˆHT|Φ0i = 0 for the singles amplitude,

the target is the singly excited determinant, so the diagrams are not closed. Directed lines that extend above or below the lowest or highest operator interaction lines are classified as external lines.

We are now ready to construct diagrams for the Coupled-Cluster equations. Let us look at a con-tributing term to the CCSD energy equation

hΦ0|( ˆHNTˆ1)c|Φ0i (2.20)

where the subscript c indicates that the two operators are connected according to Wick’s theorem (at least one contraction between both strings). There is a ˆT1 operator on the right, so to connect the right

state to the left state, we must choose a diagram from ˆHN that has an excitation level of -1, so that the

overall excitation level is 0. Moreover, because the combined diagram must be closed, we can only use the one-electron diagrams. Only the third diagram of ˆFN (Figure 2) fits these criteria, so the resulting

diagram is:

hΦ0|( ˆFNTˆ1)c|Φ0i = (2.21)

To interpret this diagram algebraically, we must use the following rules:

• All directed lines must be labelled: hole lines have the indices i, j, k, . . . and particle lines

a, b, c, . . ..19The diagram in Eq. 2.21 will therefore be labelled as .

• Each interaction line contributes an integral or amplitude to the expression. For the Fock matrix elements the rule is hout| ˆf |ini, where out is an outwards directed line at the vertex and in the

(13)

incoming line.19For Eq. 2.21, the Fock matrix elements will be hi| ˆf |ai, otherwise written as fia.

For the two-electron operator ˆVN the rule is hlef t − out, right − out||lef t − in, right − ini.

The T-amplitudestabc...ijk...are labelled from left to right according to the lines in the diagram. • The summation is over all internal indices, which are all indices that belong to directed lines

beginning and ending at an interaction line.19The other lines are the external ones.

• The sign is determined by the formula (−1)h+l, where h is the number of hole lines and l the

number of loops.19In this context a loop means a route along a series of directed lines that either returns to its beginning or begins at an external line and ends at another.

• Prefactors come in through two seperate ways. One is if there are pairs of equivalent lines, mean-ing that for lines beginnmean-ing the same interaction line and endmean-ing at the same interaction line, a prefactor of 12 is multiplied to the expression per equivalent pair.19 The other is for equivalent vertices: if the same type of excitation operators connect to the same ˆVN line in the same manner,

a prefactor of n!1 is multiplied to the expression forn equivalent vertex.

• For every unique external pair of hole or particle lines a permutation operator Ppqivis factored in

to ensure antisymmetry of the wave function.19

There are a few more rules, but these will are the ones we need for evaluating the energy expression (Eq. 2.20) As a reminder, the labelled diagram looks as follows:

hΦ0|( ˆFNTˆ1)c|Φ0i = (2.22)

Using the rules above, we already established that the Fock matrix element isfia; all indices are internal,

so the sum is over bothi and a; there is only 1 loop and 1 hole line, so the sign is positive; and finally, there are no equivalent lines or vertices, so there is no prefactor. The expression is therefore

hΦ0|( ˆFNTˆ1)c|Φ0i =

X

ia

fiatai. (2.23)

To make the reader more accustomed to this approach we will look an one final example, namely the term

hΦ0|( ˆHNTˆ2)c|Φ0i . (2.24)

Hopefully, it is now already clear that we can only create a closed diagram with the two-electron partition of the Hamiltonian in combination with the ˆT2operator. Furthermore, we can only use the diagram with

excitation level -2 to connect the bra and the ket state, so the diagram will be:

hΦ0|( ˆVNTˆ2)c|Φ0i = (2.25)

iv

(14)

and if we follow the rules we arrive at the algebraic interpretation of hΦ0|( ˆVNTˆ2)c|Φ0i = = 1 4 X ijab hij||abi tabij. (2.26)

Using these diagrams it should also become apparent that no cluster operator above ˆT2 is able to

contribute to the energy expression, since the corresponding diagrams of those operators contain more than two vertices, making it impossible to create a closed diagram when the Hamiltonian contains at most two vertices.

To conclude this section, we note that just as with Wick’s theorem this approach is very systematic and it should hence be possible (and it is) to convert the creation and subsequent interpretation of these diagrams into code.

(15)

3

Second Quantization Methods

In this section two prominent methods will be discussed to show the reader that the derivation of their respective working equations can be performed using the approaches discussed in Section 2. There are, of course, more methods to which those approaches can be applied, but discussing them all is not relevant for this thesis.

3.1 Configuration Interaction

Configuration Interaction is one of the post-HF or correlated methods. With electron correlation we roughly mean that the movement of each electron is coupled to the others and that they try to evade each other as much as possible. This effect is not fully accounted for in HF and CI tries to incorporate this correlation by constructing an electronic wave function as a linear combination of Slater determinants

|ΨCIi = (1 + ˆC) |Φ0i , (3.1) ˆ C = ˆC1+ ˆC2+ . . . = X ia caiEia+1 4 X ijab cabijEjbEia+ . . . , (3.2) Epq = a†apα+ a†qβapβ, (3.3)

where ˆC is the CI-operator that can be truncated up until a certain excitation level;vcabc...

ijk...are the linear

coefficients that need to be determined variationally;Epqare the singlet excitation operators used in

non-relativistictheories.18,26To determine the coefficients and the energy we can formulate the eigenvalue problem

ˆ

H |ΨCIi = ECI|ΨCIi , (3.4)

with the non-relativistic Hamiltonian defined as

H =X pq hpqEqp+ 1 2 X pqrs (pq|rs)(EqpEsr− δqrEsp) (3.5) hpq = hψp|ˆh|ψqi ; ˆh = ˆT + ˆVen (3.6) (pq|rs) = Z Z ψp∗(r1)ψq(r1)ψr∗(r2)ψs(r2)r12−1dr1dr2 (3.7)

There are multiple ways of solving this problem (Eq. 3.4) approximately, but since the main focus of this report is code generators, only the most applicable method will be discussed, which is called direct CI, and has the same approach as solving the Coupled-Cluster equations.27,28Within this approach, the eigenvalue problem in Eq. 3.4 is decomposed into several equations to determine the energy and the coefficients using projective techniques:

hΦ0| ˆH|ΨCIi = E; hΦ0|ΨCIi = 1 (3.8)

hΦab...ij...| ˆH|ΨCIi = Ecab...ij... (3.9)

vWhen the number of CI-operators reaches the number of electrons, we arrive at the exact wave function within the given

(16)

By projecting the reference wave function we obtain an expression as an expectation value for the energy (Eq. 3.8) and the same can be done with excited determinant or configuration state functions (CSFs) to obtain amplitude equations (Eq. 3.9). CSFs are spin-adapted functions, which are eigenfunctions of the total and projected spin operators, whereas Slater determinants are only eigenfunctions of the projected spin.18 A CSF is created by taking a linear combination of Slater determinants, thereby providing a more compact representation of the electronic system while also imposing the correct spin symmetry. The downside is that this basis representation is more complicated, since each CSF consists of multiple determinants.

With the Hamiltonian and CI wave function expressed in terms of second quantization operators, an iterative scheme can be devised to solve Equations 3.8 and 3.9, where Wick’s theorem could be used to evaluate all the contributing terms, since all equations can be written as an expectation value over a string of second-quantized operators.26Furthermore, these equations are not restricted to single-reference states, but also apply to multi-single-reference ones.

3.2 Coupled-Cluster

Coupled-Cluster, like CI, is a correlated method, but rather than using a linear ansatz, CC utilizes an exponentialansatz:18,26

|ΨCCi = e ˆ T

0i (3.10)

where ˆT is the cluster operator, which can be decomposed into multiple excitation operators ˆ T = ˆT1+ ˆT2+ · · · = X ia taiEia+1 4 X ijab tabijEjbEia+ . . . . (3.11)

If we compare CI to CC by expanding the exponentiated cluster operator and collecting all terms to the same order in the excitation level, we arrive at the following equalities:18,26

ˆ C1= ˆT1 (3.12) ˆ C2= ˆT2+ 1 2Tˆ 2 1 (3.13) ˆ C3= ˆT3+ ˆT1Tˆ2+ 1 6Tˆ 3 1 (3.14) .. .=... (3.15)

where we note that the two models are fairly similar and in the limit of the excitation level being ex-tended up to the number of electrons, they are in fact equal.18Furthermore, we note that that the main advantages of using CC over CI are the disconnected excitations, such as 12Tˆ12, that are included at lower

levels of truncation in CC. Moreover, the wave function is size-extensive regardless of the truncation level. Size-extensive means that if a calculation is performed on a compound system AB, where A and B are non-interacting systems, the final energy is equal to the sum of the energies of the subsystems obtained in individual calculations.

To solve the Schr¨odinger equation with the exponential ansatz of CC, projection techniques are used, just as with direct CI, however this can be done in two ways. The first is the unlinkedviapproach, where

vi

(17)

the reference state or excited determinants are left-projected onto the Schr¨odinger equation:

hΦ0| ˆH|ΨCCi = E; hΦ0|ΨCCi = 1 (3.16)

hΦab...ij...| ˆH|ΨCCi = E hΦab...ij...|e ˆ T

0i (3.17)

These equations have the downside that the amplitude equations (Eq. 3.17) are energy-dependent and the method itself is only order-by-order size-extensive due to lucky cancellation of individually-size-extensive-violating terms. The second and more convenient approach is the linked formulation where the Schr¨odinger equation is projected onto with an additional operatore− ˆT:

hΦ0|e− ˆTHeˆ ˆ T

0i = E, (3.18)

hΦab...ij...|e− ˆTHeˆ Tˆ|Φ0i = 0, (3.19)

where algebraic evaluation of the above expressions are simplified due to the Baker-Campbell-Hausdorff (BCH) expansion of the similarity-transformed Hamiltonian ˆHT

ˆ HT = e− ˆTHeˆ Tˆ = ˆH +h ˆH, ˆTi+ 1 2!hh ˆH, ˆT i , ˆTi+ 1 3!hhh ˆH, ˆT i , ˆTi, ˆTi + 1 4!hhhh ˆH, ˆT i , ˆTi, ˆTi, ˆTi+ . . . (3.20)

which naturally truncates at the quadruply nested commutator.viiAt this point, a suitable iterative scheme can be composed, either diagrammatically, using Wick’s theorem or with commutation rules.

vii

One can explain this using Wick’s theorem, which requires that the Hamiltonian needs to have at least 1 contraction with each cluster operator on its right.19A diagrammatic explanations would be that we have a maximum of 4 directed lines to

(18)

4

Code Generators for Second Quantization Methods

This section will discuss several code generators which are applied to second quantization methods with the goal of introducing the reader to the structure of a generator, so he/she might get a feel of what the important aspects and complications are in creating such a program. Although the field is still rel-atively young, a substantial amount of work has already been performed in quantum chemistry over the last decade.29 K´allay and Surj´an10and Hirata14 were among the first with their automated higher-order CC implementations to show that generated programs could compete with hand-optimised codes in serial and massively-parallel environments, respectively. Hanrath et al.30,31 laid the foundation with their arbitrary-order CC program for implementation strategies regarding tensor contractions, which has been extended to multiple methods: CC-F12,15 a specialised type of CC that converges faster to the basis set limit;26relativistic CC;32 open-shell CC;33,34and excited-state CC.35These are only single-reference methods, meaning that a single Slater-determinant acts as the single-reference upon which correlation is added. Multi-reference (MR) methods, in which the reference consists of multiple determinants, is where automated implementations might really have an impact, since those equations are more complex than their single-reference counterparts. Hanauer and K¨ohn produced a MRCC code36and Saitow et al. generated an MRCI code with Density Matrix Renormalization Group reference states.37 Besides CC

and CI, perturbation theories have also been implemented with the help of code generators: Shiozaki implemented Complete Active Space Second-order Perturbation Theory (CASPT2) in hisSMITH3

pro-gram, which is called the most general tool to date by Krupi˘cka.15,29,38The most recent code generator in electronic-structure theory is ORCA-AGE, which looks quite promising, due to its generality and proven performance.26

As became clear from the previous paragraph, elaborating on each generator and their strategies would be unfeasible, so instead 3 are selected, which are deemed most relevant for this thesis, namely the Tensor Contraction Engine,SMITHand ORCA-AGE.14,15,26

4.1 Tensor Contraction Engine

One of the major code generators used for producing code for second-quantized methods - specifically CC, CI and MBPT - is the Tensor Contraction Engine14,39,40(TCE) from Hirata,viiiwhich took the de-sign philosophy from a CC code generator by Janssen and Schaefer.8,14This engine could be regarded as a compiler for a general-purpose language, meaning that input is read by the engine, which is then transformed and optimised, before ultimately producing executable code. This program builds on tech-niques developed by others,8,10,41–44and claims that its applicability is broadened and its capabilities in equation analysis and program optimisation are enhanced so much that the program is able to compete with hand-optimised code.14

Looking more closely into the inner structure of the TCE, the program can be split up into several modules, as is depicted in Figure 5 below.

viii

(19)

Figure 5. Schematic representation of the architecture within the Tensor Contraction Engine.40

First, we have the input, which are the tensor expressions. Given a definition of a many-electron theory model that is expressed in terms of quasi-vacuum expectation value (for instance Eq. 3.8 or 3.18) the engine produces working equations using Wick’s theorem.14,40It is therefore able to generate CC and CI hierarchies of arbitrary order, but Hirata only went up until the quadruple excitations, most likely because the pentuple excitations do not contribute significantly in general and/or due to limitations of computer power at the time (2003). These raw working equations are then simplified by merging equivalent, erasing vanishing (See Section 2.1) and deleting unlinked terms. The TCE Language Parser then puts the working equations directly into a code without modifications. However, these equations can be more simplified by means of strength reduction (Eq. 4.1), which is determining the optimal ordering of the contractions; factorisation (Eq. 4.2); and intermediate reuse, to minimise memory consumption (Eq. 4.3).14,45

X = ABCD = ((BC)A)D (4.1)

X = AB + AC = A(B + C) (4.2)

X = AB + C(AB) = Y + CY ; Y = AB (4.3)

Those three actions are executed with the goal of expressing all terms as binary tensor contractions and additions, since those are the only ones the BLAS library, which provides the routines for performing this algebra, can handle. Looking at a more specific example, we can directly evaluate the effect of these operations. Suppose we have the contraction

Sijab=

X

cdef kl

AacikBbef lCdf jkDcdel (4.4)

where the scaling would be approximately O(N10).ix By recasting this contraction into a series of

pairwise contractions as shown in Figure 6a, the scaling is reduced to O(N6).39 While this approach

ix

(20)

has achieved the minimal number of floating-point operations (flops), practically the program would need to allocate several intermediate and/or temporary tensors. This may potentially lead to not having enough memory on disk.

I1bcdf = X el Bbef lDcdel I2bcjk = X df I1bcdfCdf jk Sabij= X ck I2bcjkAacik

(a) Formula sequence

I 1 = 0 ; I 2 = 0 ; S = 0 ; f o r b , c , d , e , f , l | I 1bcdf += Bbef l Dcdel f o r b , c , d , f , j , k | I 2bcjk += I 1bcdf Cdf jk f o r a , b , c , i , j , k | Sabij += I 2bcjk Aacik (b) Direct implementation (unfused code) S = 0 ; f o r b , c | I 1 f = 0 ; I 2 f = 0 ; | f o r d , f | | f o r e , l | | | I 1 f += Bbef l Dcdel | | f o r j , k | | | I 2 fjk += I 1 f Cdf jk | f o r a , i , j , k | | Sabij += I 2 fjk Aacik

(c) Memory reduced implementa-tion (fused)

Figure 6. Example for demonstrating reduced scaling and memory minimisation.39,40

When analysing Figure 6b, one can immediately tell that this memory requirement could be brought down by loop fusion, since there are multiple common indices (Fig. 6c).46 The algorithm selectively moves loops, so that temporary tensor slices can be generated, thereby making it redundant to evaluate the tensor in its entirety.40Note that in Figure 6c, I1f is now a scalar and I2f is a 2D array, where both were previously rank-4 tensors. This raises the question what the optimal loop-fused structure would be: to be frank, there is no clear answer, since it is heavily dependent on the architecture of the machine on which the calculations is performed, and even on the system your studying. If your cluster has little memory, a highly loop-fused code might be the solution, where redundant operations could be mixed in to further increase fusing, if the user is willing to pay for this in computational cost. On the other hand, if memory is no issue, the user will likely favour a more direct implementation. This is why the algorithm does not select just a single “best” set of loop fusion, but rather a set of options to allow for a certain degree of flexibility. Also, because the loop fusion structure is important for subsequent optimisations. One more way to save on memory is to exploit the permutation symmetry of the indices, i.e.tabij = −tbaij, so instead of storing the full tensortab

ij, it suffices to storeta<bi<j. This symmetry can then also be used to

reduce the flop count by not having to sum over all indices: X

cd

Vcdabtcdij = 2X

c<d

Vcdabtcdij (4.5)

The output of the Loop Fuser is passed on to the Abstract Syntax Tree Generator (AST), which is a representation for all the Input/Output (I/O) operations, individual loops and more.40Regular,

general-purpose compilers also make use of this AST, but since the goal is electronic-structure methods, certain transformations can be performed, which are not possible in a more general context. The TCE includes the following optimisation procedures at the AST stage:

• Data distribution and partitioning: In order for the TCE to generate parallel code it needs to know how to distribute its data. This module will figure out how to distribute the data which

(21)

allows for minimal communication. Because the distribution infrastructure affects the memory usage, it is coupled to the memory minimisation routines.47

• Space-time transformation: It was mentioned in a previous paragraph that depending on the system the loop-fusing could be adjusted. This transformation module systematically searches for the best trade-off between storage space for recomputation or to improve the performance.40 In the case of too little storage space, the memory minimisation module is invoked once more to apply more loop fusion.48

• Data-locality optimisation: The final stage is to make effective use of the CPU’s cache memory by means of blocking or tiling tensors by spin-, spatial- and permutation symmetry (Fig. 7).40The

idea is to bring in a piece of data and (re)use it as much as possible before removing it from cache. When opting for a disk-based calculation, the main memory can be regarded as the cache, so now the data has to be moved between the disk and the main memory. To obtain the best performance the data-locality optimisation routines determine the optimal tile sizes.49,50

do i do j i f ( i>j ) i f X( i j ) i s n o n z e r o a r i t h m e t i c (a) do I ( i 1−i 2 ) do J ( j 1−j 2 ) i f ( I>J ) i f X( I J ) i s n o n z e r o do i do j a r i t h m e t i c (b)

Figure 7. (a) Example where the elements of matrix are analysed to determine the nonzero components, to be used in the arithmetic. (b) By tiling the matrix the peak memory usage could be controlled by changing the size of the tiles at run time.14

Having performed every possible optimisation, a code can finally be generated. Everything dis-cussed about the procedures of the TCE were relatively general, in the sense that it would work for any electronic-structure package. However, every program has its own input, way of obtaining integrals and parallelisation scheme. This means that it is non-trivial to interface the generated code with an existing program. Nonetheless, this would only require small modifications to the code generation module, thus it is not a really time-consuming issue.

Now the inner workings of the TCE have been discussed, let us look at the actual performance. Unfortunately, no comparison could be found between timings of a hand-written code and the generated one, only the scaling and parallel efficiency were mentioned. The claim of being able to compete with

(22)

hand-optimised code can therefore not be verified by the author of this report, hence we look at the reported speed-up due to parallelisation, as shown in Table 1. While these numbers do not look great on paper, keep in mind that this work was one of, if not, the first generator(s) applied to quantum chemistry. The fact that they were able to achieve this feat is impressive in itself. The article further breaks down the timings of the transformation and CCSD iteration step into the parallel work and the communication. The authors of that article observe that the parallel work was scaling well (80% efficiency), but that the bottleneck was the communication, for which they did not have an explanation.40Considering the many factors playing a role in this process, such as data distribution, tiling, etc., this is not particularly odd. No further papers could be found where a solution is provided for this optimisation problem within the TCE.

Table 1. Calculation of ethylene inC1symmetry at CCSD(T) level. The walltime is in seconds and the

speed-up factor is noted between brackets.40

Number of processors Algorithm step 4 8 16 Transformation 94.1 (1.0) 83.5 (1.1) 90.0 (1.0) CCSD iteration 23.5 (1.0) 13.5 (1.7) 10.0 (2.4) (T) correction 1210 (1.0) 593.9 (2.0) 287.5 (4.2) Overall CCSD(T) 1574 (1.0) 835.1 (1.9) 495.5 (3.2) 4.2 SMITH

Code generators can also be used to produce variations on the regular methods, such as ground-state Coupled-Cluster, which is what led to the development of theSMITHprogram and it successors,SMITH2

(unpublished) andSMITH3.15,29,38It was initially designed for CC-R12, in which the interelectronic (r12)

degrees of freedom are included explicitly in the electronic wave functions to reduce the errors arising from basis set incompleteness.38,51,52 This inclusion gives rise to another equation, called the geminal amplitude equation, which adds more complexity. So, in addition to the regular CC equations, SMITH

has to produce the geminal equation automatically as well. The problem that CC-R12 presents for symbolic algebra programs is the permutation symmetry of intermediates. K´allay and Surj´an made the observation that by restricting certain permissible computational sequences, intermediates could be produced with a known permutation symmetry,10 but the CC-R12 method can not be handled by that procedure. Moreover, 3 additional algebraic transformations need to be executed besides the ones already discussed in Section 4.1, but these are highly systematic, so they can also be taken care of by the code generator.38

The algorithm to produce the working equations is different from the TCE, because SMITH uses

antisymmetrised Goldstone diagrams in accordance with Bartlett’s rules,53 which produces only topo-logically distinct terms.38According to Shiozaki, this is the reason that the derivation is faster inSMITH

(23)

diagrammati-cally represented by interaction lines and vertices. A vertex insideSMITHis represented as a class object

containing the vertex type and strings of creation and/or annihilation operators. Interaction operators do not have a type predefined to allow for a more general setup. The creation and annihilation operators of the same type are contracted between all excitation vertices and an interaction vertex and produces only the unique fragments, through rigorously permuting the connectable lines. Once the distinct diagrams have been generated, the rules as formulated by Bartlett can be used to determine the prefactors and the index permutation operators. To accommodate for CC-R12, a new rule was established, which states that when a creation/annihilation operator in an interaction vertex of unknown type is contracted with an annihilation/creation operator, the tensor index of the interaction vertex inherits the type of the other vertex.

After the derivation comes the transformation to convert the raw equations into efficient computa-tional sequences, which is done by strength reduction, factorisation and common subexpression elimina-tion.38The strength reductions determines the best sequential binary contraction order for each multiple tensor product, for instanceA(B(CD)) for ABCD. All n!/2 unique sequential binary contraction or-ders for ann-tuple product are analysed and the least expensive order is found by first comparing the peak operation cost and then the peak memory cost. The fact that only sequential orders are considered is not a problem for conventional CC, but it turns out to have a significant effect on CC-R12. The operation cost can be reduced by an order of magnitude if non-sequential orders are considered.SMITHhas built in

a work-around for this problem, namely that before strength reductions, special types of intermediates are introduced to transform contractions that are expected to be important.

The second step of the transformation procedure is the factorisation where new intermediates are defined as a sum of a single tensor expression and/or binary tensor contractions (AB + AC = A(B + C)).38 It was not mentioned how the algorithm exactly determines the intermediates, but since Hirata

(of the TCE) also was a part of theSMITHdevelopment, the algorithm will likely be similar to the one in

the TCE.

The third and final algebraic transformation is the common subexpression elimitation, which means an arbitrary string of tensor contractions is reformulated using intermediates which only binary tensor contractions (see Eq. 4.3) for an example.38 If a certain intermediate appears more than once, it would only have to be calculated once, and is then stored and used when required.

Regarding the code generator aspect, not much is documented other than their use of a tiling algo-rithm.54Tensors are blocked or tiled by type of particle, spin, and point-group symmetry, which helps the data layout and contraction algorithms take advantage of the index-permutation symmetry. The size of the tiles is not fixed at the moment of the code generation, but can be changed at run-time to accom-modate for physical memory. Moreover, using tiles can reduce the overhead, since spin, point-group and index-permutation symmetries can be exploited at tile level as opposed to individual element level. Furthermore, each tile is labelled row-major wise and the location of the first element within a tile is stored in a hash table with the tile number as the hash key. No reports could be found of the generator’s performance when compared to hand-optimised code.

(24)

theories, humans have been unable to manually program.29The latest iteration ofSMITH,SMITH3, has been

used to implement analytic nuclear gradients for the CASPT2 method in the BAGEL code.29,55Besides non-relativistic CASPT2, SMITH3has also been used to produce 4-component relativistic CASPT2, as

well as 4-component relativistic and non-relativistic multi-reference CI.56

4.3 ORCA-AGE

ORCA-AGE is one of, if not, the most recent code generators, which is applied to quantum chemistry. Developed as part of the ORCA suit, one of its objectives is to provide the possibility of quickly imple-menting and testing a diverse pool of wave function methods for energy and property calculations.26It is thus meant as a general-purpose tool and should be able to handle a large variety of electronic-structure problems. Furthermore, the generated code is not designed for massively parallel architectures, like the TCE, likely since only a limited number of researchers have access to those kind of machines. It therefore focuses on common cluster infrastructures on which ORCA is most employed. This fact leads to changes in the design strategy, as there is no mass communication to handle, for instance. As an example, Hanrath et al. put a lot of emphasis on global optimisation techniques, but because AGE has a different target, it can exploit/reuse the optimised algorithms from ORCA.31

The toolchain can be divided into three major components as shown in Figure 8 by colour cod-ing: the equation generator, which needs a wave function Ansatz and either a Restricted Hartree-Fock (RHF), Restricted Open-Shell Hartree-Fock (ROHF), Unrestricted Hartree-Fock (UHF) or Complete Active Space SCF (CASSCF) reference state; the raw equations are then processed in the factorisation component; and finally, the actual code generator.26These 3 components communicate with each other through human readable text files on disk, so the developers are able to check in between steps and possibly make adjustments. The documentation also states that the first two modules are completely independent on the quantum chemistry package, which makes it good candidate for producing code for other packages as well. The 3 steps will now be elaborated on in the following paragraphs.

(25)
(26)

Figure 9. Wave function Ansatz input for the generation of the closed-shell CISD equations. “o” refers to the reference state; the bra and the ket states are defined by a list of [factor, excitation operator, corresponding amplitude, integral, and other tensors.26

The equation generator starts with the choice of wave function Ansatz, which is parsed as input in the manner shown in Figure 9. The input is then interpreted by the matelem module, which can produce several types of equations:26

• Projected energy: E = hΦ0| ˆH|Ψi;

• Residual equations: σi..la..d= h ˜Ψa..di..l | ˆH|Ψi, where h ˜Ψa..di..l | is a contravariant CSF;x

• One-particle densities: Γqp = h ˜Ψ|Epq|Ψi;

• One-particle spin-density: ρqp = h ˜Ψ|Spq|Ψi, with Spq= a†qαapα− a†apβ;

• Denominators for either the CI or CC amplitdes over a 0-th order Hamiltonian (Møller-Plesset, Dyall or full):∆a..di..l = h ˜Ψa..di..l | ˆH0|Ψi;

• Pair correlation energy: ǫij =PabCijabhΦ0| ˆH|Ψabiji;

• Orbital gradient: kpq= h ˜Ψ|[ ˆH, Epq− Eqp]|Ψi;

• Derivatives with respect to amplitudes, integrals or equations containing integral derivatives; • Methods with 2 sets of amplitudes, such as Equation-of-Motion CC or the CC lambda equations.

What is interesting to note here, is that the CC equations are evaluated in the unlinked formalism, which as explained in Section 3.2 is not the most ideal approach for computer implementations. Their reasoning behind this is that they prefer the generality of the commutator approach instead of using the BCH expansion or diagrammatic techniques. However, the use of this commutator approach results in the generation of disconnected terms, which do not contribute. These are therefore removed, so ultimately the output from the gen+process module is completely equivalent to the linked formalism.

The next step is to factorise more terms to reduce computational cost and the first part is the canon-icalizer, which reduces the number of terms by exploiting tensor symmetry to combining terms (See Fig. 8 for an example). Krupi˘cka does not elaborate on the algorithms behind this canonicalizer, so no comments can be made about it. Further simplifications can be made by introducing intermediates of

x

(27)

which AGE defines 2 types: the first resembles theτ -amplitude from CCSD, τab

ij = tabij + taitbj, and the

second is to eliminate common factors, as shown below.

y = aw1+ aw2+ aw3 = a(w1+ w2+ w3)

u = w1+ w2+ w3

y = au

(4.6)

With these definitions for intermediates, strength reduction can be applied, followed by finding sum-mation intermediates and combine similar contractions (again, an example is shown in Fig. 8). To determine a near-optimal solution for the factorisation, a cost model is used to assess the storage and computational demands of a set of contractions. Just as with the TCE multiple solutions are analysed and the main factor that determines the outcome is the storage requirements. An estimate for the com-putational cost is obtained by evaluating the length of the contraction indices and represented by the flop count.26The factorisation is executed through a recursive algorithm in which contractions are checked to see whether they are mathematically identical and if they are, they are replaced by a new intermediate. This is a term-wise optimisation, so it is not the best solution for the full set of equations, which is called the global optimisation problem, which Engels-Putzka and Hanrath have researched.31The global

opti-misation strategy was deemed not be worth the effort by their benchmarks and estimates, showing that at most 20% speed-up could be gained from it.26

Now that the equations are in an approximately operation-minimal form, indices of intermediates and contractions are rearranged to reduce I/O cost and peak memory usage, which is taken care of by the shuffler.26 This is realised by comparing all possible index permutations, which are then screened according to the cost model; the permutation that results in the most efficient processing will end up in the code. The memoryopt module is now able to group independent sets of contractions, which as the name implies, leads to the optimal usage of memory, since tensors can be discarded after their last usage or stored for later (re)use.

At this point the toolchain has performed all its possible optimisations and actual C++ code can be generated. Each contraction is encapsulated in its own function in the order that was previously determined to allow for easy manual tuning.26Within the contraction functions no more algebraic opti-misations are performed, but the code generator does determine the most favourable order of the loops over the summation indices. Furthermore, no loop fusion is implemented as of yet, which could decrease the I/O penalty. Krupi˘cka acknowledges that data access is the slowest operation, so loop fusion should be a prime target for new features. Elaborating on data access, handling rank-4 tensors is typically the most difficult problem, so ORCA stores these and other rank-4 intermediates as a list of matrices, where only 2 indices can be used for the matrix multiplication. The final problem is how to execute the tensor contraction, which preferably can be executed as a matrix multiplication, since highly-optimised BLAS routines can be used. For this a pattern matching algorithm is applied to the contractions, which chooses between BLAS functions and ORCA’s in-house optimised routines.

Having gone through all the steps that make up ORCA-AGE, we can have a look at the performance. Multiple theories have been benchmarked and compared to the original hand-written code of ORCA using naphtalene as a test system.

(28)

Figure 10. RHF CISD benchmark calculation on naphtalene with different basis sets, com-paring the native and AGE implementations.26

Figure 11. RHF CCSD benchmark calculation on naphtalene (66 correlated electrons) with different basis sets, comparing the native and AGE implementations.26

As can be seen from Figures 10 and 11, the ORCA-AGE code has worse performance than the hand-optimised one, but considering this is only an initial benchmark with more optimisations in the near future, this result is by no means bad. These bar plots also show which contribution is the cause of the slow-down, which are the contractions with 2 external labels.26 Through manual inspection they observed that the factorisation for these terms was non-optimal, which they deemed acceptable due to the lack of a global optimisation strategy. Moreover, now that the problem has been located, manual modifications can be made, which is a major benefit of this generator. Overall, they observed that the ORCA-AGE codes were in general approximately 30% slower than the native code. However, there was one case, UHF CCSD, where the automatically generated code was 20% faster than the original code.

To sum up, ORCA-AGE is a modular toolchain that has been used to produce a plethora of electronic-structure methods with an already decent level of sophistication, but also with much more improvements possible, such as loop fusion and parallelisation. The hand-written code is in most cases only 30% faster.

(29)

5

Density Functional Theory

Moving on from wave function theories, code generators will now be discussed which are applied to density functional theory (DFT), a highly successful method in the last half-century.57,58 This success likely arises from the preferable scaling of O(n3) when compared to wave function methods, such as

CCSD(T) (O(n7)), where n is the number of orbitals; the scaling also allows for larger systems to be

studied.

Looking at the theory, DFT is mostly used within the Kohn-Sham scheme, which makes use of a non-interacting electronic system to calculate the density of the interacting system.57,59 Using this scheme, the Hamiltonian can be partitioned into several components, where atomic unitsxiare used:

 −∇ 2 2 + vext(r) + vHartree[ρ(r)] + vxc[ρ(r)]  ψi(r) = εiψi(r). (5.1)

The first term on the left-hand side represents the kinetic energy of the electrons; the second term is the external potential, usually from a set of Coulombic point charges; the third terms is called the Hartree potential, which incorporates the classical electrostatic repulsion between electrons,57

vHartree[ρ(r)] =

Z

dr′ ρ(r

)

|r − r′|. (5.2)

The final term is the exchange-correlation (xc) potential

vxc[ρ(r)] =

δExc[ρ]

δρ(r) , (5.3)

whereExc[ρ] is the XC energy functional and ρ is the electronic density

ρ(r) =

occ

X

i

|ψi(r)|2 (5.4)

with the sum being over all occupied orbitals.

The main problem in DFT is the XC energy functional that describes all non-trivial many-body effects for which the exact form is not known,57but it is typically taken as an integral over space of a function dependent on the density and its derivatives:12

Exc[ρ] =

Z

drf (ρ(r), ∇ρ(r), ∇2ρ(r)) (5.5)

Over the years many functionals have been introduced and are typically grouped in families, such as local density approximation (LDA), which doens not include any density derivatives; generalised gradi-ent approximation (GGA), which adds the density gradigradi-ent; meta-GGAs that include the Laplacian of the density as well; and hybrid functionals, which combine multiple functionals to arrive at a correct result, hence making them more empirical functionals.57 For response properties it might sometimes be necessary to include higher-order derivatives,17,57so there is a lot of flexibility to the formulation of the XC energy functional, after which the XC potential needs to be evaluated as well. The calculation of those should all be relatively straightforward using basic functional analysis and the chain rule for functional derivatives. Thus, the XC functional will be the prime target for code generators.

xie2= ¯h= m

(30)

6

Code Generators for Density Functional Theory

The groundwork for code generators in DFT was likely laid by Jemmer and Knowles in 1997 who developed a symbolic algebra program in Mathematica to evaluate functional derivatives of the XC functional.9,12They demonstrated that they could obtain the functional derivative of a general function by using the calculus of variations:60,61

δExc[ρ] δρ = ∂f ∂ρ − ∇ · ∂f ∂∇ρ ! + ∇2 ∂f ∂∇2ρ ! (6.1)

This expressionxiishould be rather straightforward to evaluate with simple definitions ofExc[ρ(r)], such

as9 Exc[ρ] = Z dr ∇ 2ρ ρ1/3 ! , (6.2)

of which the functional derivative is

δExc[ρ] δρ = 4|∇ρ|2 9ρ7/3 − 2∇2ρ 3ρ4/3. (6.3)

However, increasingly complex functionals make manual derivations rather tedious, hence the creation of this symbolic algebra program. Besides the functional as depicted in Eq. 5.5, another formulation is possible in which the functional integrand depends on the density and two dimensionless parameters: x(ρ, ∇ρ), the dimensionless scalar density gradient; and y(ρ, ∇ρ, ∇2ρ), the dimensionless Laplacian of

the density.9They then formulate the XC functional as

Exc[ρ] =

Z

drρ4/3g(x, y). (6.4)

In their article they showed that they could obtain the functional derivative of the above expression as well.9

This Mathematica implementation, however, did not contain the functionality to produce code that could be embedded into other Kohn-Sham programs. It was merely designed for obtaining explicit expressions of the functional derivatives.

6.1 dfauto

The first real code generator applied to DFT was reported by Strange, Manby and Knowles in 2001. They formulate the general XC functional as11

Exc[ρ] =

Z

drK(ρ, σ, υ, τ ), (6.5)

whereσ = ∇ρ, υ = ∇2ρ, and τ is the kinetic energy density, which is defined as

τ = X

i∈occ

|∇ψi|2. (6.6)

xii

(31)

Eq. 6.5 differs from Eq. 5.5 only in the kinetic energy density dependency. To evaluate the XC potential the authors state that the functional derivative can and should be avoided and that the XC contribution, fxc, should be considered by the derivatives with respect to the AO density matrix elementsγab:11

fabxc= ∂Exc[ρ, τ ] ∂γab = Z dr∂K ∂γab (6.7)

where a and b are AO indices. The AO density matrix comes from the fact that the orbitals ψi are

expanded in an AO basis.11Using the chain rule in Eq. 6.7, they arrive at

fabxc= X ξ∈{ρ,σ,υ,τ } Z dr ∂ξ ∂γab vξ; vξ = ∂K ∂ξ , (6.8)

of which∂ξ/∂γabare easily evaluated in terms of AOs and their gradients.11The integration of Eq. 6.8

is performed on a numerical quadrature grid by evaluating the orbitals, their gradients,K and vξat each

grid point.

To treat a general N-electron system, the spin densities, ρα andρβ, have to be treated

indepen-dently.11 K is partitioned into a spin-summed term (the exchange component) and a coupling term between the spins, which is present in correlation functionals:

K = X

s∈{α,β}

g(ρs, ρ¯s, σss, σs¯s, σs¯¯s, υs, υs¯, τs, τs¯) + f (ρα, ρβ, σαα, σαβ, σββ, υα, υβ, τα, τβ), (6.9)

where¯s is the opposite spin of s. To generate the Fortran code of a functional, the user has to provide the definition off and g as a Maplexiiiexpression.11Using these spin-indexed quantities is not optimal,

so the program converts theα, β form to one which makes use of the total density, ρ ≡ ρc = ρα+ ρβ,

and the spin density,ρo = ρα− ρβ, where the subscriptc stands for closed-shell, and o for open-shell.

Similar to the density, the other quantities can be converted to their total- and open-shell analogues. Naturally, for computational efficiency, a switch is built in to prevent the calculation of unnecessary components, such as open-shell quantities in a closed-shell calculation.

When a functional is generated, its value is tested on a 90-point grid for a hydrogen atom (ρα =

e−2r/π). Should a singularity occur when ρs−→ 0, which can happen in the dimensionless spin density

gradients for instance, the user is prompted to insert the limit when one of the densities goes to 0.

6.2 Sałek and Hesselman

The code generator from Sałek and Hesselman was reported in 2007 and was meant to be a unified DFT framework, in which the functional derivatives are easily verified for their correctness and which is modular.62It differs from dfauto in several aspects: this generator is capable of producing higher-order derivatives, up to fourth order; it does not require the use of Maple, an external commercial software package; and finally, expressions are simplified.

The methodology, however, is quite the same. Maxima,63a computer algebra system, is employed to convert the XC energy kernel, K (Eq. 6.9), and all of its derivatives from their algebraic form into one suited for numerical calculations, given the definition of K as input.62Maxima also contains routines to

xiii

(32)

optimise the expression by common-factor elimination. The equations are then processed and converted into C code. This program already has multiple predefined functionals implemented, so the user is also able to create its own functionals by mixing others.

While it appears to be rather easy to obtain functionals this way, there is a downside, namely that the generated expression might contain singularities.62In other words, it might be numerically unstable. This is why generated code should only be used after rigorous testing. Their program also provides routines that check the energy and the consistency of the derivatives through the finite difference method.

(33)

7

Discussion

It turns out that code generators have already been successfully applied to multiple second quantization methods, such as various types of Coupled-Cluster, Configuration Interaction and perturbation theories. Three of those generators have been looked at in detail to see what they consist of. The first compo-nent is the derivation of working equations. This can be done either through an algebraic approach or a diagrammatic one. The algebraic approach is used more frequently, both Wick’s theorem as well as the commutator approach. This is most likely because the diagrammatic approach involves many rules for construction and interpretation, whereas the algebraic approach is plain math, which presumably makes it much easier to implement. Surely, the algebraic approach should also be easier to verify given that there are intermediate stages in the derivation, which the diagrammatic approach lacks. However, Shiozaki states38that the diagrammatic approach produces the equations faster, as it only creates topo-logically different terms; this would be essential for more complex methods, such as CC-R12. While this might be true, this would not necessarily mean that diagrams should be used in favour of the algebraic approach. No timings were provided for how long the derivation procedures were, therefore making it impossible to draw a definitive conclusion. It will depend on the user to decide whether the extra time spent in derivation is worth it. It is also important to keep in mind that the derivation of the raw equations has zero effect on the performance of the generated code. Moreover, when higher-order excitation levels are included, it quickly becomes to unwieldy to evaluate all topologically non-equivalent diagrams.64,65 Next is the transformation step in which the raw working equations are cast into a computationally efficient set of contractions and which consists of strength reduction, factorisation and intermediate reuse. This is where the developer has to decide the target architecture, since the optimisation procedure will be influenced by this choice. The TCE was designed for massively parallel computer systems, so memory issues are most likely less of a problem, which can be further reduced by using permutational symmetry. This means that more emphasis can be put on minimising the flop count, which means a lower degree of loop fusions. The fact that the TCE was meant for those high-performance architectures, also means that a large part of the scientific community will not be able to make use of it, since not everyone has access to those machines. With this in mind, ORCA-AGE was specifically designed for common compute infrastructure on which it is most employed.26 Besides choosing a target architecture, it is

also important for developers to choose what optimisation strategy they want to pursue, either a local- or global-minimum operation cost. Obviously, a global-minimum strategy is preferred, but is more difficult to implement. In the TCE, a direct-descent algorithm analyses different configurations and locates local minima. Some trials are run and the minimum from all local minima is taken. However, this does not guarantee finding the global minimum. ORCA-AGE opted for a local-minimum search, which would only result in a 30% performance reduction at most by their estimate, so not worth the trouble. This might be true for smaller calculations, but 20% quickly adds up: compare a 7-day calculation to a 10-day one. Most scientists would find this significant, so the speed-up should be well worth the effort.

The actual code generation is the final step in which no further algebraic optimisation should be performed. The only optimisations that should occur is changing the order of contractions, for instance, to reduce memory cost. Additionally, one should implement loop tiling, since this can be used to create

Referenties

GERELATEERDE DOCUMENTEN

Procentueel lijkt het dan wel alsof de Volkskrant meer aandacht voor het privéleven van Beatrix heeft, maar de cijfers tonen duidelijk aan dat De Telegraaf veel meer foto’s van

People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.. • The final author

This model was unanimously considered suitable as basis for a prototype, the sign being deemed sufficiently specific for application as an official stop sign

know different ways to implement these ideas, which could lead to innovation. This paper deems to investigate if and how diversity and communications in a start-up could leads

Reaction 3, in which poly butyl acrylate reaction 1 was chain extended with polystyrene, produced a mixture of products; the majority of the initial homopolymer latex was

Replace these five lines (which begin “These commands are overridden”) with: To override a heading on a right-hand page (any page for one-sided print- ing), put a \markright after

In doing so, the Court placed certain limits on the right to strike: the right to strike had to respect the freedom of Latvian workers to work under the conditions they negotiated

*The Department of Education should evaluate all schools around Colleges of Education and make it a point that only good principals and teachers will be