Automatic Differentiation and Tensor Networks

(1)

Automatic Differentiation and

Tensor Networks

Finite periodic PEPS

Thesis submitted for the Master’s degree in

Theoretical Physics

by

Stijn Kleijweg

10995137

August 2020

Supervisor: Dr. Philippe Corboz

Reader: Dr. Edan Lerner

(2)

Abstract

The field of tensor networks comprises a set of very successful techniques for numerically analyzing strongly correlated condensed matter systems. A tensor network is an ansatz for the wave function whose accuracy and polyno-mial scaling in the system size are guaranteed by the area law of entanglement entropy scaling for local Hamiltonians. Although well established in one di-mension, the generalization to higher dimensions is not straight-forward. The technique of automatic differentiation, recently applied to tensor networks for the first time, is a useful tool to make the implementation of tensor network algorithms easier, as well as increase the accuracy. The use of automatic differentiation with tensor networks is investigated and newly applied to two-dimensional finite periodic systems. A new contraction technique, based on the Corner Transfer Matrix Renormalization Group (successful for infinite systems), is explored in the context of finite periodic systems. Given a tensor network ansatz, the tensor elements need to be optimized to find the ground state. Using the exact gradient delivered by automatic differentiation directly from the algorithm, the tensor network is variationally optimized. Benchmark results are obtained for the spin 1/2 Heisenberg model on periodic square lattices of linear sizes L = 4, 6, . . . , 16. While contracting periodic tensor networks is computationally more expensive than networks with open bound-ary conditions, the results obtained at smaller bond dimensions compare well to the Quantum Monte Carlo values. This agreement lends support to the further application of automatic differentiation to different tensor network al-gorithms, as well as the analysis of different Hamiltonians on finite periodic two-dimensional lattices using the presented scheme.

(3)

Samenvatting

Het vakgebied van tensornetwerken bestaat uit een set succesvolle technieken om sterk gecorreleerde systemen uit de gecondenseerde materie numeriek te analyseren. Een tensornetwerk is een ansatz voor de golffunctie, waarvan de precisie en polynomiale schaling met de systeemgrootte gegarandeerd worden door de oppervlaktewet voor de verstrengelingsentropie voor locale Hamilto-nianen. Het tensornetwerk-algoritme is de gevestigde methode in één dimen-sie, maar dit is niet eenvoudig te generaliseren naar hogere dimensies. De techniek ‘automatisch differentiëren’ is recentelijk voor het eerst toegepast op tensornetwerken en is een effectieve manier gebleken om het schrijven van tensornetwerk-algoritmes te vereenvoudigen en de precisie te vergroten. Het gebruik van automatisch differentiëren in combinatie met tensornetwerken wordt onderzocht en voor het eerst toegepast op tweedimensionale eindige periodieke systemen. Een nieuwe techniek om het tensornetwerk te contra-heren, gebaseerd op de ‘Corner Transfer Matrix Renormalization Group’ (suc-cesvol voor oneindige systemen), wordt verkend in de context van eindige periodieke systemen. De energie van het tensornetwerk is onderhevig aan een variatieprincipe: om de grondtoestand te vinden moeten de tensorelementen in de tensornetwerk-ansatz geoptimaliseerd worden. Hierbij wordt Automa-tisch Differentiëren gebruikt om direct uit het algoritme een exacte gradiënt te vinden. Referentieresultaten (“benchmarks” ) zijn verkregen voor het spin 1/2 Heisenberg-model op vierkante periodieke roosters van lineaire groottes L = 4, 6, . . . , 16. Hoewel het contraheren van periodieke tensornetwerken meer rekentijd kost dan voor systemen met open randvoorwaarden, wijken de resultaten voor kleinere tensordimensies weinig af van de Quantum Monte Carlo-waarden. Dit onderbouwt de verdere ontwikkeling van het gebruik van automatisch differentiëren met verschillende tensornetwerk-algoritmes. Daar-naast kan de combinatie van de onderzochte contractie- en optimalisatiemeth-ode gebruikt worden om verschillende Hamiltonianen op eindige periodieke tweedimensionale roosters te analyseren.

(4)

Acknowledgments

I want to thank my supervisor Philippe Corboz for many questions answered and his guidance along the way of this project. I also want to thank Boris Ponsioen and Schelto Crone for always being willing to answer questions or have a discussion. Finally, I would like to give credit to Schelto for the idea of adding a bias to the energy when the bra-ket symmetry is broken.

(5)

2.5 Automatic differentiation . . . 15 2.6 U (1) symmetric PEPS . . . . 17 3 Methods 19 3.1 Implementation . . . 20 3.2 Exact contraction . . . 21 3.3 Approximate contraction . . . 24 3.4 Methodological findings . . . 27 4 Results 30 4.1 Exact contraction . . . 30 4.2 Approximate contraction . . . 33 5 Discussion 36 5.1 Exact contraction . . . 36 5.2 Approximate contraction . . . 37 Conclusion 39 Bibliography 40

(6)

Chapter 1

Introduction

The quantum many-body problem is enormously rich and lies at the heart of many open problems, not only in condensed matter but also in high en-ergy physics. The goal of understanding how macroscopic phases arise from microscopic particles and their interactions, where the macroscopic order of-ten has new characteristics not present in the underlying microscopic physics (also called emergence), becomes especially difficult when faced with strong interactions. Many systems of theoretical, experimental and even technolog-ical interest have such interactions, with perhaps the most famous being the Hubbard Hamiltonian. Even as the most basic model of strongly interacting electrons on a lattice, it is relevant to the study of high temperature super-conductivity, thus touching on all three of these aspects. Strong interactions prohibit the use of perturbation theory, making numerics an appealing avenue toward progress.

A successful method across many fields is Monte Carlo simulation, which relies on a strategical sampling of the parameter space to obtain very accurate results. While Quantum Monte Carlo is successful for certain spin systems, for fermionic and frustrated spin systems it suffers from the ‘negative sign prob-lem’: essentially some matrix elements based on the Hamiltonian that need to be interpreted as probabilities are negative. This motivates the alternative method of tensor networks, which have the additional benefit of a theoretical guarantee of sub-exponential (usually polynomial) scaling of computational cost with the system size, when targeting low energy eigenstates. This theo-retical guarantee is based in the area law of entanglement entropy [1,2], which states that ground states of local Hamiltonians are much less entangled than most states in the many-body Hilbert space. Tensor networks obey this area law by construction. While the generalization to higher spatial dimensions is not straight-forward, tensor network methods have been increasingly success-ful, leading to new insights on e.g. the two-dimensional Hubbard model [3].

Tensor network algorithms have expanded to cover many aspects of inter-est of condensed matter systems, such as excited states, finite temperature calculations, and topological states. Here the focus lies on ground state cal-culations. A tensor network (TN) then, is an ansatz1 for the wave function. Given a TN ansatz with some geometry (here a periodic square lattice), the initial tensor elements are usually randomly chosen, or otherwise obtained us-ing some other TN algorithm. The elements then need to be optimized to find

1

An ansatz is an estimated solution for a problem, of which the structure is fixed but the exact parameters still need to be determined; it can be seen as a trial solution. Ansatz is a German word that roughly translates as ‘approach’.

(7)

CHAPTER 1. INTRODUCTION

the ground state. A few different methods exist; here the TN is optimized directly by minimizing its energy expectation value. This requires a gradient, which is obtained by automatic differentiation [4].

The method of automatic differentiation is widely used in machine learning to optimize (“train”) the model of a data set. Recently, it was first applied to TNs [5]. Automatic differentiation delivers an exact gradient directly from the algorithm, by tracking the sequence of operations used from the input to the output, and multiplying the predefined derivatives for each step together according to the chain rule. The gradient thus obtained can then be used to optimize the TN such that it results in the lowest energy. This optimization is variational, since the TN represents a state in the Hilbert space, and if this state is not the ground state, then its energy will be higher than the ground state energy (this argument holds so long as the hΨ| = |Ψi† virtual symmetry is not broken in the algorithm). The power of using automatic differentia-tion with tensor networks lies in the possibility to obtain an exact derivative through the entire tensor network contraction, as well as the relative ease of implementing the algorithms compared to previous variational methods, where the gradient was derived manually [6,7]. The recent benchmark of an infinite two-dimensional tensor network variationally optimized using a gradient from automatic differentiation resulted in the best energies yet [5].

The use of a fully-fledged two-dimensional tensor network requires some motivation, especially for finite systems. For smaller open two-dimensional systems the Density Matrix Renormalization Group (DMRG) can be used. The successful and established method for one-dimensional systems, the DMRG was originally developed by White [8,9] as a generalization of previous nu-merical Renormalization Group approaches [10]. For smaller two-dimensional lattices with open boundary conditions the one-dimensional TN (a Matrix Product State) might cover the lattice as a snake, or by wrapping around a cylinder, and these methods lead to good results. However, the loops intro-duced into the TN by periodic bonds prohibit the use of the DMRG. Instead, a proper two-dimensional TN is employed, which goes by the name of Pro-jected Entangled-Pair State (PEPS) [2,11–13]. These are not restricted to small lattices, unlike the DMRG. However the contraction of two-dimensional TNs is computationally so expensive that an approximate contraction scheme is required for all but the smallest lattices.

In this work a new approach is developed, based on the Corner Trans-fer Matrix Renormalization Group (CTMRG), originally created by Nishino and Okunishi [14,15] as an extension of the DMRG to two-dimensional classi-cal models using corner transfer matrices. It was later generalized to infinite quantum systems [16], and is very successful for infinite (open) systems, es-pecially when combined with variational optimization (rather than imaginary time evolution) [5,6]. In the finite case, the translation invariance afforded by the periodic boundary conditions simplifies the optimization, since then a small unit cell or even a single tensor can be used to cover the lattice, rather than a distinct tensor per site.

Periodic boundary conditions are used to eliminate boundary effects in finite systems, but little research is available on the accuracy and the best contraction approach for finite periodic TNs. Thus, the first goal of this work is to develop and benchmark the new approach described in the previous para-graph. In addition, a second, somewhat separate goal is to further investigate the use of automatic differentiation with TN algorithms, and approximate

(8)

CHAPTER 1. INTRODUCTION

contractions in particular, motivated by its success in Ref. [5]. The new addi-tion of automatic differentiaaddi-tion facilitates variaaddi-tional optimizaaddi-tion of the TN, which has been shown to be more accurate than the alternative of imaginary time evolution [6,7]. These two goals combine to form a third: to develop a complete TN algorithm for finite periodic systems, which can then be used for any Hamiltonian to obtain complementary data to that for infinite systems, for which more established methods exist already [6,7].

Tensor network algorithms are founded on a diagrammatic or graphical notation, which will be introduced in the Theory in Chapter 2. Then the con-nection to entanglement is explained and demonstrated to result in a theoreti-cal guarantee of the efficacy of TNs. After an introduction to the approximate contraction of TNs and their optimization, the method of automatic differen-tiation is described, including some context and the foundational theory. The Theory Chapter is concluded with an outline of the use of U (1) symmetry with PEPSs. These theoretical ideas lead to the implementation of the TN algorithm, which is described in Chapter 3 on Methods. Using the spin 1/2 Heisenberg model as a benchmark, finite periodic PEPSs of different linear sizes L = 4, 6, . . . , 16 are contracted using the new CTMRG-inspired method, except for the smallest size which is contracted exactly. The TNs are then optimized variationally with a gradient obtained by automatic differentiation. The Methods Chapter is concluded with some methodological findings, on both the TN and automatic differentiation sides. The results are shown and analyzed in Chapter 4, followed by a more in-depth discussion in Chapter 5. Finally, a summary, conclusion, and outlook are given in the final Chapter.

(9)

Chapter 2

Theory

The motivation of tensor networks (TNs) and a demonstration of the intrinsic area law requires an introduction to the graphical notation used in the field, which forms the first section of this Chapter. Then the theoretical foundations of TNs are introduced in Section 2.2. A TN algorithm to find the ground state of some quantum many-body system broadly consists of two parts: a choice of the geometry of the TN, including a contraction approach if it cannot be contracted exactly, and an optimization scheme. These form the next two Sections (2.3 and 2.4) of this Chapter, where the section on contracting TNs contains some methodological aspects such as the use of matrix decomposi-tions. Then the variational optimization of a TN is introduced , which requires the evaluation of operators from the TN state. The penultimate Section 2.5 on the theory and practice of automatic differentiation is followed by a brief introduction to the implementation of global U(1) symmetry at the level of the PEPS.

2.1 Diagrammatic notation

The field of tensor networks is based on a diagrammatic or graphical notation. A tensor is represented by a shape, with the number of legs equal to the rank of the tensor, or equivalently the number of indices in the usual tensor notation. So a vector vi will be drawn as a shape with a single leg, a matrix Mab as a

shape with two legs, etcetera, see Fig. 2.1 below. Different shapes can be used to indicate the role or type of a tensor, such as using a triangle to denote an isometry, a special matrix that will be discussed in more detail in Section 2.3, or a larger tensor to denote an environment tensor in a TN algorithm. Shapes can also be used to keep track of the indices without labeling them, like in the case of a triangular isometry: it is easily differentiated from its transpose.

Summed indices and traces are easily represented using the TN notation, consider the matrix-vector product u_i = M_ijvj and trace Mii in Fig. 2.2. A

v

i

M

ab

(10)

2.2. TENSOR NETWORKS THEORY CHAPTER 2. THEORY j i

M

v

= i

u

=

c

N

i

Figure 2.2: Diagrammatic notation of a matrix-vector product and a trace.

more complicated example like

c = X

i,j,k,l

AijkBilCklDj (2.1)

is easily visualized as a TN, revealing at a glance that it is a contraction between a rank 3 tensor, two matrices and a vector, where the remaining indices of the matrices are also contracted. While the order of contracting the indices i, j, k, l does not matter for the value of the outcome scalar, in algorithms it is important because different orders involve a different total number of scalar multiplications and thus a different computational cost. More details on this are in Section 2.3.

2.2 Tensor networks theory

A general many-body wave function may be written as a sum over tensor products of states in the local Hilbert space. Consider a chain of spins 1/2, which have a local Hilbert space dimension of two: spin up or down. The many-body wave function of a chain of n spins is completely specified by

|Ψi = X

i1i2...iN

Ci1i2...iN|i1i ⊗ |i2i ⊗ ... ⊗ |iNi , (2.2)

where each index ij runs over the local Hilbert space of site j. The coefficients

of each basis state in the sum can be neatly organized in the rank N tensor

Ci1i2...iN, which contains dim(ij)

N _{= 2}N _{elements, with dim(i}

j) the number

of values the index ij can take, in this case equal to the local Hilbert space

dimension. The infamous exponential growth of the many-body Hilbert space with the system size is visible here in the number of coefficients needed to

k l i j

A

D

C

B

=

c

(11)

2.2. TENSOR NETWORKS THEORY CHAPTER 2. THEORY

specify a many-body state (which is equal to the total number of states, i.e. the dimension of the Hilbert space).

Given a choice of basis, the tensor C_i₁_i₂_...i_N completely specifies the many-body wave function. Consider the example of a two-site anti-ferromagnetic Heisenberg chain. Its ground state is given by the singlet state. Abbreviating the tensor product state as |↑i₁⊗ |↓i₂ := |↑↓i, the singlet is given by (|↑↓i − |↓↑i)/√2. In this case the sum in Eq. 2.2 has two non-zero terms, since the basis states |↑↑i and |↓↓i do not contribute. The tensor is a matrix Ci1i2, with

entries i1 i2 _↑ _↓ ↑ 0 1/√2 ↓ −1/√2 0 .

Thus, given choice of basis, the tensor Ci1i2 completely specifies the wave

function. This generalizes trivially to higher dimensional systems and various types of lattices, since the tensor is defined through the tensor product of the local Hilbert spaces at each site.

Combining this tensor representation of the wave function with the graph-ical notation introduced in the previous section, the wave function can be denoted by the diagram in Fig. 2.4. The idea of TNs is to decompose the tensor C_i₁_i₂_...i_N into multiple tensors that are all contracted together, that is, a tensor network. Different choices of decomposition are possible, more interconnected TNs can be used to achieve a different entanglement scaling than the area law mentioned in the Introduction, which will be discussed in detail in the next Section. However, in Fig. 2.4 the common Matrix Product State (MPS) is shown. Its name comes from the fact that each coefficient in the wave function in Eq. 2.2, obtained by fixing (choosing a value for the index) all the physical legs, what remains is a product of N − 2 matrices (and 2 vectors at the endpoints). With the original legs that correspond to sites in the system (labeled i_j) referred to as physical legs, the new ones introduced in the decomposition into a TN are called auxiliary legs, or auxiliary bonds. Their dimension is usually referred to as the bond dimension and labeled D. The bond dimension D plays a key role in TN algorithms, as will be explained below.

The MPS plays an important role in the history of TNs: it was realized to be the underlying ansatz of the DMRG (some time after the initial de-velopment). In addition to its role in algorithms, at its origins the MPS is an analytic tool that was used to study many-body systems, specifically spin chains. The first MPS was introduced by Affleck, Kennedy, Lieb and Tasaki in

i1 i2 i3 in . . .

C

= i1 i2 i3 in . . . a1 a2 a3 an

Figure 2.4: The tensor containing the many-body wave function coefficients and its decomposition into a Matrix Product State. The newly introduced legs between the tensors on the right are called auxiliary legs or bonds, and their dimension D plays a key role in TN algorithms.

(12)

1988 when they exactly constructed the ground state of the AKLT Hamilto-nian (a Heisenberg-like HamiltoHamilto-nian), which was the first example in support of Haldane’s conjecture that chains of integer spins will have an energy gap between the ground state and the first excited state [11,17]. Analytic research using TNs still continues, for example in Ref. [18] where a quantum critical ground state is exactly constructed as a TN for the first time.

Area law of entanglement entropy

Wilson already realized that the locality of interactions in many physical Hamiltonians of interest had implications for the type of ground state allowed, and exploited this fact in his Numerical Renormalization Group (NRG) [2,10]. The NRG was generalized to any one dimensional quantum system with the in-vention of the Density Matrix Renormalization Group by White in 1992 [8,9]. It has become the standard technique for finding ground states of strongly cor-related Hamiltonians. Similarities between the NRG and the DMRG, as well as techniques for time-evolution, excited states and finite temperature calcu-lations, were developed later once it was realized that the DMRG is based on an underlying MPS ansatz [2].

The success of modern TN algorithms is based on the efficient representa-tion of many-body ground states they provide: due to the locality of interac-tions these states are special, and can be targeted directly without worrying about the exponentially large Hilbert space. After insights in quantum infor-mation theory, it could be made precise how these ground states are special. They are much less entangled than most states in the Hilbert space. The en-tanglement is commonly quantified by the enen-tanglement entropy, which might be expected to scale extensively (with the system volume), and for most states in the Hilbert space this is true. But low energy eigenstates of local Hamil-tonians are much less entangled, and obey an area law of entanglement en-tropy [1], where the entanglement entropy is a measure of the entanglement. Intuitively, local interactions lead to mostly local entanglement, across the surface between the subsystems of interest. While the area law is proven in one dimension for Hamiltonians with an energy gap between the ground state and first excited state, in higher dimensions a proof is lacking. However, it is observed numerically and there are some supporting theoretical arguments as well [2].

The generalization of the MPS to two dimensions is called the Projected Entangled-Pair State (PEPS) [2,11–13], after how it was initially constructed. A detailed look into the PEPS is instructive to see the area law built into TNs in action, and to compare it to earlier attempts of covering a two-dimensional lattice with an MPS, which was done in view of the success of the DMRG. This latter method is still used successfully, but restricted to smaller two-dimensional systems with open boundary conditions.

A demonstration of the area law in the PEPS starts with the entanglement entropy, which is defined as

S = −Tr ( ˆρAlog ˆρA) , (2.3)

where ˆρA is the reduced density matrix of part A of the system, obtained

by tracing out the other part (say B) from the full density matrix (using ˆρB

would result in the same value for S). Two different coverings of a 5 × 5 lattice with open boundary conditions are shown in Fig. 2.5. Focusing on the PEPS

(13)

(a)

(b)

Figure 2.5: A snake MPS and PEPS for the same 5x5 two dimensional system with part of the system shaded in red. The snake MPS in (a) has on average a single bond per side of the boundary, whereas the PEPS in (b) has a number that scales with the size of the boundary, thus admitting an area law for the entanglement entropy.

on the right, the shaded area is part A of the system, for which the reduced density matrix is ˆ ρA= X k pk|ψki hψk| , (2.4)

where |ψ_ki are in the reduced Hilbert space, and the p_kare the weights (prob-abilities) of each state in the sum. A product state will have only one non-zero

pk of value 1, while a maximally entangled state has each of them non-zero

with equal probability. Following the argument in Ref. [11], consider the num-ber of terms in the sum over k. This is given by the total dimension of the tensor obtained by contracting the tensors in the shaded area for a given wave function coefficient, that is, fixing the physical indices. Thus, for four sides of the shaded area of length L (L = 3 in the figure), the number of values k can take is D4L. Using that

S = −X

k

pklog pk, (2.5)

and that a maximally entangled state will thus have pk= _D14L, ∀k:

S ≤ −X k 1 D4Llog 1 D4L ≤ −(D4L) 1 D4L log 1 − log D4L ≤ 4L log D. (2.6)

Which is an upper bound version of the area law of entanglement entropy scaling, intrinsic to the PEPS.

Upon closer inspection, the PEPS’s area law only holds at D > 1, since otherwise the entanglement is zero, demonstrating that a PEPS with bond dimension 1 represents a product state. For D > 1 however, the entanglement the TN can capture scales with the linear size, which is nothing less than the area in two dimensions. As D increases, the amount of entanglement captured increases, and in theory once D becomes exponentially large the

(14)

2.3. CONTRACTING TENSOR NETWORKS CHAPTER 2. THEORY

entire Hilbert space is accessible. In practice this is of course impossible on a classical computer, but D does provide a parameter which systematically improves the accuracy of the TN compared to the exact ground state. Thus, in TN algorithms, the computational scaling in D of the algorithm is important, since higher values of D lead to better results.

At this point it is useful to discuss one-dimensional TN coverings of two-dimensional systems, such as the snake MPS on the left in Fig. 2.5. It is clear that the number of bonds across the boundary between subsystems is constant in L for the snake MPS, thus not enough entanglement is captured to represent area law states. A workaround is to allow the bond dimension to grow exponentially with the system size, which permits enough entanglement to “pass” through the four bonds across the boundary, and this is precisely the reason why the snake and cylinder coverings used in DMRG for two-dimensional systems are limited to small systems (with a maximum width of roughly 16). Another complication is that the DMRG is based on separating the MPS into two parts by cutting a single auxiliary bond, something that cannot be done if loops are present in the TN. Thus, it cannot be used for (even small) periodic two-dimensional systems.

The area law of entanglement scaling can be violated, for example by critical states, resulting in a logarithmic correction. The TNs used to treat such systems can be adjusted accordingly, as seen in the example above the entanglement scaling depends on the geometry and interconnectivity of the TN. One example is the Multiscale Entanglement Renormalization Ansatz (MERA), which represents a d-dimensional system with a d + 1-dimensional TN with an increased connectivity (compared to e.g. the MPS) and more than one tensor per physical site. Without going into details, the MERA allows a more efficient representation of states with a large amount of entanglement. An interesting side-note is that the connection between geometry and entan-glement made explicit by TNs is of broader interest, for example in gravity research, where the geometry corresponds to the curvature of space [11, see also their Ref.’s 51].

2.3 Contracting tensor networks

An important aspect of implementing tensor network algorithms is the manip-ulation of TNs: contracting them and performing matrix decompositions on tensors are not entirely trivial operations. The contraction of a TN comes with a computational cost directly related to the total number of scalar multiplica-tions. This number scales with the index dimensions of the tensors involved in the contraction, for example in the matrix multiplication AB in Fig. 2.6: the total number of scalar multiplications involved is simply the product of

j i k 3 2

A

B

2 5 4 8 7 6 3 ! ·    2 1 3 6 8 4   

Figure 2.6: Two rectangular matrices multiplied together at a computational cost of dim i × dim j × dim k = 12. To gain some intuition for this fact, observe that the total number of scalar multiplications in the matrix product on the right is equal to 12.

(15)

the dimensions of all the legs involved in the contraction, meaning all open (uncontracted) and contracted legs attached to the tensors. It happens that different orders of contracting the tensors in a larger network come with dif-ferent computational costs. Importantly, the cheapest order can always be constructed as a series of pairwise contractions [19]. Consider the example in Fig. 2.7, where all the legs have dimension p. The contraction orders A(BC) and C(AB) have leading cost O(p4_{) while B(AC) has leading cost O(p}3_).

Finding the cheapest order of pairwise multiplications can often be done by inspection, for simple enough TNs. For larger ones the Netcon function [19] can be employed to find it.

B

A

C

Figure 2.7: A small tensor network with all leg dimensions equal to p. The cheapest contraction order is B(AC) with leading cost O(p3), while the orders

A(BC) and C(AB) have a leading cost that is a factor p more expensive.

Contracting a PEPS is computationally expensive: starting, say, in a cor-ner of the PEPS and adding tensors one by one, the total number of legs on the large intermediate tensor keeps growing, meaning the computational cost of each next contraction keeps growing. Some sets of ordered pairwise con-tractions are cheaper than others, but even the best ones are too expensive for all but the smallest PEPS. In fact, the contraction of two PEPSs together (required for the evaluation of expectation values, as explained in the next Sec-tion) can be shown to be exponentially hard [11]. Fortunately, approximate methods have been developed, which are very accurate.

Approximate contraction schemes are based on truncated matrix decom-positions, which reduce the dimension of the tensor legs, while keeping the most important part. Usually the SVD is used for this purpose, although the eigendecomposition can also be used, which is intimately related to the SVD. A contraction scheme then consists of a tensor contraction (or update) followed by a truncation using the SVD, which is iterated until the desired system size, or until convergence for infinite systems. The main methods go by the names of Tensor Renormalization Group (TRG) [20], Corner Transfer Matrix Renormalization Group (CTMRG) [14–16], and a recent one called Core-Tensor Renormalization Group (CTRG) [21]. What all these methods have in common is the renormalizing action of using the SVD to truncate ten-sor dimensions, which can be interpreted as keeping only the most important degrees of freedom. A variation of the CTMRG used to contract finite periodic systems will be described in detail in the Methods section.

To understand how to truncate growing tensor dimensions back to accept-able values, two things are required: the reshaping of tensors, that is, the merging or splitting of tensor legs to change the rank, and the idea of using the Singular Value Decomposition (SVD) to truncate a matrix (or tensor) in-dex dimension. First the reshaping: consider the example of placing the rows

(16)

of a matrix next to each other

   51 4 8 73 6 3 32 9 8   “ = ” 51 4 8 73 6 3 32 9 8. (2.7)

Obviously a simple operation like adding a scalar will produce the same result whether done directly on the matrix, or on the vector after which the vector is “unreshaped” back to the matrix. So long, of course, as the reshaping happens the same both ways. The elements in the vector correspond to those in the matrix, except that the first index of the matrix is translated to adding the dimension of that index to the element’s location in the vector. For example, in Python [22] index notation, for element 3: A[1, 2] → v[1 ∗ dim i + 2] =

v[3 + 2] = v[5], with i being the first index (row) of the matrix. This idea

can of course be generalized to tensors of higher rank, and is used to perform matrix decompositions on tensors, see Fig. 2.8 for the graphical equivalent. This reshaping is also important because it allows libraries that work with tensors to reshape them to matrices ‘under the hood’ and perform operations like tensor contractions using existing optimized routines for matrices.

A

→

_A’

Figure 2.8: The graphical notation corresponding to reshaping a tensor to a matrix, here the three legs on the right are merged.

Then the use of the SVD to truncate a tensor leg. The SVD is a gener-alization of the eigendecomposition to rectangular matrices. For an m × n matrix it reads A = U SV†, where U (m × k) and V†(k × n) are matrices with orthonormal vectors in the columns and rows, where k = min(m, n). They satisfy U U† = I_n and U†U = Ik (and similarly for V†), and are also called

isometries. The diagonal matrix S contains the singular values (which are real and non-negative), and the SVD can be sorted such that they appear in de-creasing order along the diagonal. The concept of using the SVD to decrease the dimension of a tensor leg is based on truncating the singular values, cut-ting the smallest ones from S and adjuscut-ting the matrices U, V† accordingly. See also the image of the corresponding TN in Fig. 2.9. It can be shown that the new product U SV† is the best approximation to the original matrix (us-ing smaller matrices of lower matrix-rank), in the sense that it minimizes the Frobenius norm of the difference matrix [2]. The Frobenius norm of a matrix is perhaps the most trivial matrix norm, consisting simply of the square root of the sum of all the element squared. A useful concept is the truncation error, which consists of the fraction of the summed cut singular values over the total sum. The truncation error can be used as a measure of the accuracy of the approximation in projecting a tensor down to a smaller dimension on one or more legs.

The truncated isometries U0 and V†0 are now approximations to the iden-tity when contracting their truncated legs, e.g. U0U0† ≈ Im. These can be

used to project the corresponding legs of the original matrix to a smaller di-mension. This procedure lies at the foundation of approximate contraction schemes for TNs: reshape a tensor to a matrix with the leg to be projected to a smaller space on one side and all the other legs on the other, then perform

(17)

2.4. OPERATORS AND OPTIMIZATION CHAPTER 2. THEORY

m n

A

=

m U k S k V† n

Figure 2.9: The Singular Value Decomposition (SVD) of a matrix A, with the index dimensions written next to the legs. The contracted indices of U and V† have dimension k ≤ min(m, n), where equality is the default SVD, which can be truncated by cutting the smallest singular values, decreasing the dimension of the matrix S and that of the corresponding legs on U and

V†. These truncated isometries can then be used to project the corresponding legs of A to a smaller space, as also indicated by the triangular shape of the tensors.

an SVD on the resulting matrix, truncate the relevant isometry and use it to project the desired leg.

Finally, sometimes the eigendecomposition is used rather than the SVD, for different reasons which include accuracy and computation time (with AD). This works because the eigendecomposition on the relevant matrix multiplied with its complex conjugate (or transpose since TNs most often work with real numbers) results in M MT = U ΛUT, but substituting the SVD for M :

U ΛUT = U SVT(U SVT)T = U SVTV SUT

= U S2UT,

(2.8)

with the identification Λ = S2. Thus an eigendecomposition on the matrix multiplied with its transpose results in the same projector U , which can then be truncated and used as a projector.

2.4 Operators and optimization

Besides representing wave functions, a key aspect of tensor network algorithms— and indeed quantum physics—is the calculation of expectation values of op-erators. The translation from the usual bra-ket notation to TNs is quite straight-forward, and actually an intuitive notation for evaluating an opera-tor locally comes for free. The expectation value of an operaopera-tor is denoted h ˆOi = hΨ| Ô |Ψi or, for an unnormalized state h Ôi = hΨ| Ô |Ψi / hΨ|Ψi. Usual

Hamiltonians of interest consist of a sum of local terms, for example nearest-neighbor interactions. Local order parameters are also measured using local operators. Evaluating such local operators from a TN state is achieved as in Fig. 2.10. The ket is obtained from the MPS by taking the complex conjugate, though usually real numbers suffice, in which case it is simply the transpose (in fact, because of this symmetry there is no real difference between the bra and ket in a TN algorithm). Then the operator is contracted with the bra and ket layers, at the relevant position. This method of obtaining expectation values of operators generalizes to operators that span multiple sites, to next-nearest neighbor interactions, and of course to single site operators that measure a local order parameter.

Physical Hamiltonians often consist of a sum over local interactions, in the case of only nearest-neighbor interactions these terms are sometimes also called bond Hamiltonians, since they live on the bond between two sites. Evaluat-ing the total Hamiltonian then means evaluatEvaluat-ing the bond Hamiltonians and

(18)

2.4. OPERATORS AND OPTIMIZATION CHAPTER 2. THEORY

hΨ|

†

= |Ψi

a1 a2 a3 a4 a5 a∗₁ a∗₂ a∗₃ a∗₄ a∗₅ a1 a2 a3 a4 a5 a∗₁ a∗₂ a∗₃ a∗₄ a∗₅

/

Figure 2.10: The evaluation of the bond energy between the second and third sites using the nearest-neighbor Hamiltonian. The bra is a five-site MPS with open boundary conditions, where the ket is obtained by taking the complex conjugate. The evaluation of the energy is divided by the norm to ensure normalization.

summing the results, the example in Fig. 2.10 would have E = h ˆHi = P

bonds

ˆ

Hb,

where ˆHb is the bond Hamiltonian. The bond Hamiltonian tensor is obtained

by reshaping (splitting) each leg of the bond Hamiltonian matrix (obtained from the operators given a basis choice) into two legs of dimension√m, where

the bond Hamiltonian is an m × m matrix and √m is also the dimension of

the local Hilbert space. At the level of the algorithm the bond Hamiltonian tensor is often simply referred to as the Hamiltonian.

An important concept is that of imposing symmetries of the model on the TN, in particular translation invariance. If the ground state of the model under study is translation invariant (possibly with a finite unit cell), then there is no reason the TN should not be translation invariant, since the physical legs correspond to physical sites. This translation invariance can thus be imposed on the TN, simplifying algorithms. In particular (for a finite system), if the MPS in Fig. 2.10 had legs connecting the sites at the edges (see Fig. 2.11), all five tensors could be the same one. Then, the energy could be obtained by measuring a single bond Hamiltonian and simply multiplying by the number of bonds (5).

With a means to evaluate the energy of a TN state in hands, E can be used to judge “how good” the state is, meaning how close to the ground state. The energy of a TN state is subject to a variational principle

E0≤ E =

hΨ| ˆH |Ψi

hΨ|Ψi , (2.9)

where E₀is the exact ground state energy. The inequality follows by definition: if the state |Ψi is not the ground state then it must have a higher energy. Thus energies obtained for the same system using different TN methods can be compared quite simply: the lowest energy wins, since it is closest to the ground state (even if the exact ground state energy is not known).

Given a TN state, the tensor elements need to be optimized to find the best wave function with the lowest energy. The initial elements are usually chosen randomly, or sometimes obtained from a different TN algorithm, if available or relevant. The two main ways of executing the optimization (minimization) of the energy are imaginary time evolution and variational optimization. In imaginary time evolution, the operator e−τ ˆH is applied to the TN to project it onto the ground state, using [11]

|E0i = lim τ →∞ e−τ ˆH|Ψ(0)i p | hΨ(τ )|Ψ(τ )i |, |Ψ(τ )i = e −τ ˆH_{|Ψ(0)i ,} _(2.10)

(19)

2.5. AUTOMATIC DIFFERENTIATION CHAPTER 2. THEORY

Figure 2.11: A periodic MPS, with every tensor the same.

where |E0i is the exact ground state and the denominator serves as the

nor-malization constant. Of coures, in practice the infinite limit means that the imaginary time operator is applied to the initial state |Ψ(0)i for large enough

τ to reach convergence.

More recently, variational optimization was shown to result in significantly better energies than imaginary time evolution algorithms [6,7]. It is based on directly minimizing the energy expectation value, given the variational principle of Eq. 2.9, and employing some numerical (iterative) optimizer. To implement this method a gradient is required, which can be obtained in dif-ferent ways. In principle it is conceivable to use a finite differences numerical gradient, but this is restricted to approaches with very few variational param-eters [5, and their 36, 37]. In the successful approaches in [6,7] it is derived manually by performing a sum over all the different relative positions of the Hamiltonian to the tensor being optimized (while the others are fixed). A method recently first applied to TNs is automatic differentiation, which deliv-ers an exact gradient at the same computational cost as the algorithm to be differentiated [5]. It is discussed in detail in the next Section.

2.5 Automatic differentiation

Automatic differentiation (AD) is a method that tracks all the operations in an algorithm or a single function, and uses predefined derivatives for these to obtain an overall derivative using the chain rule. Importantly, it achieves this at a computational cost that is at most the same as the original algorithm [4,5]. AD was first developed for a range of applications in computational mathemat-ics, among which optimization problems [4, and references therein]. Recently, it is perhaps best known for its use in machine learning, where it is used to optimize (“train” in the machine learning community) a model of a data set.

The idea of AD is to track every operation between the input θ and the output L , where in principle both can be anything between a scalar and a high-rank tensor, though in optimization problems the output is a scalar. Then the chain rule is used to obtain the overall derivative:

∂L ∂θ = ∂L ∂Tn ∂Tn ∂Tn−1. . . ∂T2 ∂T1 ∂T1 ∂θ , (2.11)

where Tn is the n-th intermediate tensor. A derivative can be defined for the typical operations, such as scalar and matrix multiplications and tensor index permutations. The derivatives corresponding to the steps in the algorithm or function that AD is applied to constitute the factors in the chain rule in Eq. 2.11. Tracking these and multiplying them together is the ‘automatic’ part of AD. Although it might sound like it, there is (of course) no magic happening: ‘automatic’ merely refers to the fact that the user need not worry about finding a derivative for the algorithm.

A central concept in AD is the computation graph. On the computation graph, vertices represent data points (tensors) and edges represent operations.

(20)

2.5. AUTOMATIC DIFFERENTIATION CHAPTER 2. THEORY

The computation graph is directed, with arrows pointing to the outcome of operations, all the way from the input to the output, see Fig. 2.12. It is useful to define the adjoint to understand how the computation graph is used to multiply together the factors in the chain rule:

T := ∂L

∂T , (2.12)

that is, the adjoint of a variable is the derivative of the output with respect to that variable. This definition is used to iterate through the computation graph, traversing it according to [5]

Ti _{= T}i+1∂T i+1

∂Ti , (2.13)

where i = n − 1, n − 2, . . . , 1. Thus all the derivatives up to vertex i in the computation graph are accumulated. The end points are given byTn₌_L ∂L

∂Tn,

where L = 1 by definition, and θ = T1 ∂T1

∂θ , where of course θ = ∂L

∂θ , the

desired derivative. See also the example with two intermediate tensors in Fig. 2.12.

This procedure is the so-called ‘reverse mode’ AD, where the computation graph is traversed backward, that is the first factors in the chain rule multiplied together are those closest to the outputL . Forward mode AD is also possible, but the total number of scalar multiplications is smaller when starting at the end with the smaller number of elements, in this case the output, which is a scalar. Thus, for optimization problems, it is computationally cheaper to employ reverse mode AD (also the standard in machine learning applications). For most algorithms the computation graph will be considerably more complex than the example in Fig. 2.12, making it difficult to visualize—let alone think of using the chain rule for a manual derivation, which is part of the power of AD. Of course a computer has no trouble keeping track of all the operations in an algorithm. An important point is that the graph might contain branches: a single tensor might produce multiple tensors through some operation. A relevant example is a matrix decomposition. Branches are easy to handle in terms of the computation graph and the chain rule: simply add contributions from the previous adjoints during the backward pass.

A practical aspect is that a program commonly has more than a single input (tensor). Input parameters, or vertices in the computation graph that

θ

T

1

T

2

L

T2₌_L ∂L ∂T2 T1 _{= T}2∂T 2 ∂T1 θ = T1∂T 1 ∂θ

Figure 2.12: The backpropagation of the derivative through the computation graph.

(21)

2.6. U (1) SYMMETRIC PEPS CHAPTER 2. THEORY

do not have any arrows pointing to them, are called leaves (though this term seems to be largely PyTorch [23] specific). One or more leaves will partici-pate in the chain rule, but not necessarily all of them. An example of a leaf with fixed elements (without gradient tracking enabled, in the terminology), is the Hamiltonian in a TN algorithm, which is an input, but not subject to the optimization. Another useful tool is that intermediate tensors can be “detached” from the computation graph, turning them into leaves that do not get a gradient in the backward pass. Finally, while the computational cost scales maximally the same as the original algorithm, memory can be a concern. When simply doing a function evaluation, memory might be recycled within a single call, but AD requires the entire computation graph for the backward pass. A way around this problem is checkpointing: saving only every so-many tensors in the computation graph and recomputing the missing links during the backward pass.

An issue that arises with AD is at which level to define the derivative. Ulti-mately, in a lower level routine, an operation like a matrix decomposition con-sists of many simple operations like scalar multiplications. While the derivative could be implemented at that level (and it used to be [24]), it happens to be worthwhile to derive a higher-level derivative, for the matrix decomposition as a whole. Reasons include increased numerical stability [25], decreased compu-tational cost [4], and better integration with existing code [26]. One method to derive such higher level derivatives simply differentiates the expression that defines the operation, after which some manipulations are required (including using the adjoint definition Eq. 2.12) to find the adjoint of that operation’s input in terms of the adjoints of its output, see Ref. [24] for a recipe and examples.

A second method to derive higher-level derivatives first rewrites the original equation of the operation to obtain the adjoint relation (different from the definition of the adjoin in Eq. 2.12), which is then used to find the derivative, see Ref.’s [25,27] for details. An example of a higher level derivative is that for the eigendecomposition, A = U ΛUT with Λ a diagonal matrix containing the eigenvalues and U the matrix with the corresponding eigenvectors in the columns. In the computation graph, the operation of eigendecomposing a matrix leads to the creation of two ‘child’ vertices. Given their adjoints, the adjoint of the original matrix A is given by [5,24,26]

A = UhΛ + F (UTU − UTU )/2iUT, (2.14) where Fij = 1/(λj− λi), and denotes the Hadamard product (an

element-wise matrix product). This equation allows passing through the eigendecom-position on the computation graph during the backpropagation, thus exactly differentiating it. The eigenvalue difference in the denominator might cause problems in the case of degenerate eigenvalues, which is discussed in more detail in Chapter 3 on Methods. Besides the easier implementation of TN al-gorithms with AD compared to deriving the gradient manually, the addition of this derivative of the matrix decomposition is new compared to those previous methods of variational optimization [6,7].

2.6 U (1) symmetric PEPS

Many Hamiltonians possess various symmetries, which can be implemented at the level of TNs. Focusing here on global U (1) symmetry for the PEPS, the

(22)

2.6. U (1) SYMMETRIC PEPS CHAPTER 2. THEORY

goal is to specify states using fewer parameters by exploiting the symmetry, thus permitting the use of larger bond dimensions leading to higher accuracy. What follows is an outline of the ideas used to implement the symmetry at the TN level, based on Ref.’s [28,29] which contain a lot more detail, also demonstrating that this approach is correct.

The symmetry of the Hamiltonian can be seen from commutation with all the elements of the symmetry group G , that is

[ ˆH, q] = 0 , ∀q ∈G , (2.15) where [A, B] = AB − BA is the commutator. The eigenstates of ˆH then

correspond to irreducible representations ofG . The labels of these correspond to commuting quantities in the system, also called good quantum numbers. In this context they are also called charges or symmetry sectors (afforded by the groupG ).

For U (1) the operator that commutes with the Hamiltonian (and generates a representation of U (1)) is

gU (1)=

X

i

σ_iz, (2.16)

where i runs over the sites in the system. Thus the charges under this group correspond to expectation values of the operator σz. For each symmetry group rules can be defined for calculating with the charges, based on the resulting label when taking a tensor product between two irreducible representations. For the U (1) group these rules are the same as those of integer addition, with the inverse charges taking a minus sign.

Where TN algorithms are concerned, finite groups are easier to implement than continuous ones. In the case of U (1), finite subgroups Zqare used. Then

the integer labels of the charges can take values 0, 1, . . . , q − 1. Inverse charges are then defined modulo q, for example, using Z36 a charge of -1 becomes

35, since 1 + 35 mod 36 = 0. The charges can be defined at the level of the tensors by assigning them to each tensor leg. The symmetry then allows only non-zero elements that respect it, resulting in a block diagonal structure when the tensor is reshaped into a matrix. This sparse structure is used to implement operations like tensor contraction and matrix decomposition at the level of these symmetry blocks, thus allowing all the usual TN operations for symmetric tensors while exploiting the sparse structure of the tensors for computational gains. Of course some extra overheads arise as well, but these are no match for the decreased computational cost at the level of the TN operations, which allows to reach higher bond dimensions.

(23)

Chapter 3

Methods

Finite periodic square lattice PEPSs are investigated using a new periodic contraction approach inspired by the Corner Transfer Matrix Renormalization Group (CTMRG) and a variational optimization strategy where the gradient is supplied by automatic differentiation. Periodic boundary conditions are used to eliminate boundary effects in finite systems, which are investigated as a complement to the data on infinite systems in for example Ref.’s [5,6], and because little research is available on contraction approaches for finite periodic tensor networks (TNs). As an additional benefit, periodic boundaries make the system translation invariant, which means that a reduced number of tensors can be used, simplifying the algorithm. The entire scheme is bench-marked using the Heisenberg Hamiltonian, which is a strongly correlated spin Hamiltonian without a sign problem, thus a good test case for a TN algorithm, while at the same time the results can be compared to Quantum Monte Carlo data.

The Heisenberg Hamiltonian is given by ˆ H =X hi,ji ˆ S_ixSˆ_jx+ ˆS_iySˆ_jy+ ˆS_izSˆ_jz =X hi,ji ~ Si· ~Sj (3.1)

where the angular brackets indicate a sum over nearest neighbors and the sec-ond equality is an abbreviated form. Since the sign with each term is positive, the energy is minimized when spins anti-align leading to anti-ferromagnetic order. The exact ground state consists of a superposition of many such states, but this symmetry is spontaneously broken in the thermodynamic limit, lead-ing to a non-zero magnetization. For finite systems this symmetry will not be broken and a vanishing magnetization is expected, though in practice the finite bond dimension of the TN might induce a breaking of the symmetry, leading to a finite magnetization.

Automatic differentiation (AD) is widely used in machine learning to op-timize (“train”) models of data sets. The models usually consist of a neural network of some form, which consists of tensors. Machine learning libraries are thus a useful tool for the implementation of an AD optimized TN algorithm, since many (if not all) of the required tensor operations are already imple-mented, including their derivatives (or ‘backwards’). In the present work, Py-Torch [23] was used. An interesting benefit is that machine learning libraries usually come ready to run code on the Graphics Processing Unit (GPU) in-stead of the usual Central Processing Unit (CPU), which can lead to significant computational speed-ups. However, GPUs commonly built into laptops and desktops normally work at single float precision, with double precision cal-culations being very slow. TN algorithms require double precision, so the

(24)

3.1. IMPLEMENTATION CHAPTER 3. METHODS

hardware required is somewhat specialized. All the results in the next Section are obtained on CPUs.

As a sinote, the scheme implemented here is of course completely de-terministic and has little to do with machine learning, besides the common utilization of the generic tool of AD. Thus, it is conceivable to implement AD in some other preferred language or library for writing TN algorithms instead of using a machine learning toolkit. This is exactly what is done in Ref. [30], including a detailed description and some generalizations to complex tensor elements.

Another programming detail includes a few tools for dealing with the ad-ditional complexity of finding the cheapest contraction orders for periodic PEPSs. NCON [31] allows writing a sequence of pairwise contractions for multiple tensors in a single line by labeling all the legs (where contracted legs are labeled the same). This reduces efforts in copying a set of pairwise con-tractions for a larger TN from a diagram on paper. Due to increased number of tensor legs in periodic systems, determining the cheapest contraction order by inspection can be a difficult task, especially for contractions like that of a corner update with its complex conjugate or the final contraction of the two-site reduced density matrix (see the next Section for details on these steps). A very useful tool is Netcon [19], which requires the definition of the tensors and the legs to be contracted (in MATLAB [32] code) to find the cheapest contraction order. A similar feature is included in TensorTrace [33], which allows creating the TN diagrams visually before generating a MATLAB [32], Python [22] or Julia [34] code which executes the contraction in the cheapest order.

3.1 Implementation

Having decided on the TN geometry of a square lattice periodic PEPS, a unit cell can be chosen in conjunction with the Hamiltonian under study. While in general different choices have to be tried and the results compared, because the periodicity of the ground state is usually not known, for the Heisenberg model the ground state is known to have a checkerboard structure. A useful trick is to perform a sublattice rotation on one of the sublattices, such that the unit cell consists of a single tensor, further simplifying the algorithm. Such a sublattice rotation can be achieved by multiplying an operator with the Hamiltonian matrix, but an alternative is to contract a spin flip operator with both legs on one side of the two-site Hamiltonian tensor used in the algorithm as in Fig. 3.1. This option has a nice visual TN interpretation of flipping the spin before it meets the Hamiltonian. The spin flip operator is

U = e2iπσy = 0 1

−1 0

!

(3.2)

in the spin up/down basis. The Hamiltonian tensor is obtained by reshaping (splitting) each index of the bond Hamiltonian matrix into two. A useful index labeling convention when storing the tensor in the program is indicated in Fig. 3.1: start at the top, left to right, then the bottom, left to right.

Having motivated the use of a single tensor copied all over the PEPS (see Fig. 3.2), the schematic form of the program or function that will be differentiated using AD is E = f ( ), where f is the TN algorithm. The

(25)

3.2. EXACT CONTRACTION CHAPTER 3. METHODS 3 1 4 2 U U Hbond

Figure 3.1: Hamiltonian with the index labeling convention indicated by the numbers at the legs. The spin flip matrix U = is attached on both legs on one side of the Hamiltonian, effectively flipping spins on one sublattice. The shape of U indicates its orientation, that is, it should be transposed on one side of the Hamiltonian with respect to the other.

tensor will also simply be indicated by a, with the indices omitted. For treating the Heisenberg model with this TN approach, it is sufficient for the tensor elements to be real rather than complex, turning the complex conjugate into a transpose (which is used, for example, when using the ket represented by the TN to create the bra). The AD approach broadly consists of three stages, the machine learning terminology (also used in the PyTorch [23] documentation) for these are the forward, backward and optimization, where the first two refer to the direction in which the computation graph is traversed. Thus, the forward is the function to be differentiated, meaning the entire TN algorithm that contracts the TN consisting of some initial tensor(s), and outputs the energy. In optimization terms the energy is the cost to be minimized, which in machine learning libraries is also called the ‘loss’.

During the first step, the forward, the computation graph is created by tracking all the operations and intermediate tensors. The specific contraction schemes used for the smallest linear size L = 4 PEPS and the larger L > 4 PEPSs are described in more detail the next Sections. Then, in the backward step, or backpropagation, the computation graph is traversed in the opposite direction to obtain the gradient. Finally, this gradient is used by an optimizer to update the initial tensor’s elements (which are the variational parameters) to minimize the energy.

The optimizer used here is the common Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimizer, a quasi-Newton optimizer that uses gradient-based estimation of the Hessian (matrix of second derivatives) to find minima. The Limited-memory version does not store the full Hessian, but rather an implicit approximation based on vector products. The method converges when the norm of the gradient is sufficiently small (compared to a tolerance) or when a better value of the cost (energy) cannot be found. The Python [22] scientific library SciPy’s [35] implementation of this optimizer of-fers better control over tolerances and other parameters than the one built into PyTorch [23]. The SciPy one can be interfaced with PyTorch using, for example, code as in Ref. [36].

3.2 Exact contraction

The L = 4 PEPS is small enough to be contracted exactly. This also makes it a good practice case for AD optimized PEPS, since no derivative of linear al-gebra decompositions like the SVD and eigendecomposition are needed. Some

(26)

3.2. EXACT CONTRACTION CHAPTER 3. METHODS

D

2

L

Figure 3.2: The L = 4 translation invariant PEPS with a single tensor. The auxiliary bond dimension D is indicated, as well as the physical index dimen-sion, which is equal to the local Hilbert space dimension of 2 for the spin 1/2 Heisenberg model. The periodic bonds are not completely drawn to reduce clutter, except for the example at the top (dashed).

additional lattice symmetries of the model are easy to implement at the level of the TN. The Heisenberg Hamiltonian in Eq. 3.1 on the periodic square lattice is not only translation invariant, but also reflection invariant in both spa-tial directions, since the sites are indistinguishable to the anti-ferromagnetic nearest neighbor interaction. Finally, the Hamiltonian being isotropic (direc-tions x, y, z are indistinguishable), the PEPS should also be invariant under rotations. These symmetries will also simplify the approximate contraction algorithm in the next Section. Implementing these symmetries in the TN is a matter of enforcing them on the tensor . For example, permuting opposite auxiliary legs should leave the tensor invariant if it is reflection sym-metric, where tensor index permutation is simply the generalization of the matrix transpose. In a similar way as for matrices, a symmetric tensor can be created by adding the relevant permutations together:

5 1 3 4 2 =     5 1 3 4 2 + 5 1 3 2 4     /2, (3.3)

for a reflection symmetry in one direction—note the exchange of the indices 2 and 4. Doing similarly for the other reflection symmetry and the rotation symmetry, on the orange outcome tensor, a fully symmetric tensor can be obtained.

On the programming side, a useful trick is to do this procedure once, start-ing from a tensor with all different elements. Findstart-ing the unique elements in the fully symmetric tensor created from that initial tensor, a vector containing only those elements can be used in the algorithm, together with a fixed ‘in-verse’ which acts as a look-up table for creating a tensor from that vector. The optimization can then happen on the reduced set of elements in the vector, while the first step in the TN algorithm is to recreate a symmetric tensor from the vector and the inverse, after which the usual contractions can take place. Due to the translation and rotation invariance, calculating the energy of this periodic PEPS is relatively straight-forward: only a single Hamiltonian term is required, that is, since all bonds are the same only a single bond energy needs to be evaluated (see also Fig. 3.2). For this small PEPS the cheapest contraction order happens on the single (ket/bra) layer first. This is followed by a partial contraction with its conjugate over the physical legs, leaving two

(27)

3.2. EXACT CONTRACTION CHAPTER 3. METHODS

adjacent physical legs open, which can contracted with a Hamiltonian in-between to evaluate the energy. Also called the two-site reduced density matrix (since density matrices are outer products between a ket and a bra), it can be used to find the expectation value of local operators as well, for example by contracting one site with the operator ˆSx to find the magnetization in that

direction, while simply tracing over the second site. Tracing over both sites (simply contracting the remaining physical legs) results in the norm hΨ|Ψi. The leading computational cost to obtain the two-site reduced density matrix scales in the bond dimension as O(D8). The most expensive step that results in this leading scaling comes from contracting two tensors of 2 × 2 physical sites to create half the ket layer.

U (1) contraction

Since the spin flip operator used in the sublattice rotation does not respect the U (1) symmetry, it is not straight-forward to use both these tools together. Thus, a checkerboard AB unit cell is required to implement the L = 4 PEPS with U (1) symmetric tensors. The required adaptation of the code is quite simple: add a second tensor in the correct places in the contraction of the PEPS, and simply add this as a second parameter (with gradient tracking) to the computation graph. The optimizer takes care of minimizing the energy with respect to both tensors. Of course, with a larger unit cell the four different contributions from the Hamiltonian need to be summed to obtain the energy given to the optimizer (dividing by two then gives the energy per site). The checkerboard pattern and these different contributions are indicated in a top-down view in Fig. 3.3.

Of course, the charge sectors in the tensors need to be chosen. While they might be chosen dynamically in a more sophisticated algorithm, they can also be chosen manually. The latter is done here, since the goal was merely to demonstrate that exploiting U (1) symmetry indeed leads to speed-ups, also in the present finite periodic PEPS and AD approach. To select the charge sectors manually, consider that the ground state would like to have h ˆSzi = 0, so a sensible choice is a symmetric arrangement around zero. The bond dimension

D corresponds to the number of charges on a leg. For example, for D = 5

a choice is to have, on each auxiliary leg, 3 charges in the 0 sector, and 1 charge at each +1 and -1. Of course, for the physical legs the charge sectors are dictated by the local Hilbert space, being +1 and -1 for the Heisenberg model.

Figure 3.3: A top-down view of the L = 4 PEPS, meaning the physical legs are not visible. The four different pairs of nearest neighbors are indicated with the blue rectangles. To optimize the energy correctly all these contributions should be taken into account. The periodic bonds are not fully drawn to reduce clutter.

(28)

3.3. APPROXIMATE CONTRACTION CHAPTER 3. METHODS

3.3 Approximate contraction

For larger PEPSs it is cheaper to contract the a tensor with its transpose first, before contracting physical sites together, see Fig. 3.4. This approach is possible since the wave function itself is not needed in the algorithm, only expressions involving the bra and ket together like hΨ| ˆH|Ψi. After creating

the A tensor some approximate contraction scheme is needed for any PEPS with roughly L > 4. A new periodic approach is developed, based on the Corner Transfer Matrix Renormalization Group (CTMRG) [14–16] which is successful for infinite PEPS [5,6]. The idea is to contract corners and half-rows iteratively while truncating the growing bond dimensions, to create an environment to two physical sites that represents the rest of the larger PEPS. The corners and half-rows are grown from the initial tensor A, until the desired system size is reached. Since they are grown from the ‘inside’, meaning tensors are added from the side of the measurement site, the periodic bonds (at the outer edges of the corner) are pushed away further from the measurement site with each growth step of the corner and half-row (also called an edge tensor). Hopefully, the periodic bonds’ dimensions can then remain relatively small, since their presence makes all the TN operations considerably more expensive than (for example) for the original CTMRG. The measurement site is where the surrounding environment is contracted with the original single layer a tensors to evaluate local operators and the norm hΨ|Ψi, see Fig 3.5.

D

→ D

2

Figure 3.4: The creation of the double-layer A tensor from the a tensor. The thicker bonds on the A tensor indicate the merging of corresponding indices on the a tensors into a single index of dimension D2.

The tensors that constitute the environment are obtained as follows. The same lattice symmetries are employed as with the L = 4 case, meaning that only a single corner (C) and edge tensor (T ) are required to cover the entire environment. See Fig. 3.6 for a visual description of the procedure. After growing a corner using two edge tensors and an A tensor, the increased bond dimensions somehow need to be limited to keep the entire contraction from getting too expensive computationally. The best approximation of a tensor using smaller bond dimensions is achieved using a truncated projector obtained in an SVD, as detailed in the Theory in Section 2.3. Thus, the new, larger corner tensor C0 is reshaped into a matrix, with the leg to be projected on one side (one matrix index) and all the other legs on the other. Then an SVD is applied to this matrix, the D2 smallest singular values are cut, and the relevant projector (with the chosen orientation of the C0 matrix this is V†) truncated accordingly, discarding the rows in the matrix corresponding to the cut singular values. This truncated projector is then applied to C0. The same projector can be used for the second χD2 leg, because of the reflection and rotation symmetry of the lattice. The same procedure is used to project the periodic legs χ0D2 back to dimension χ0. The edge tensors are updated in a

Automatic Differentiation and Tensor Networks