Exploring and Reducing Gradient Bias in Discrete Flow-Based Models

(1)

MSc Artificial Intelligence

Master Thesis

Exploring and Reducing Gradient Bias

in Discrete Flow-Based Models

by

Alexandra Lindt

12230642

February 14, 2021

48ECTS February 2020 – February 2021 Supervisor: Assessor:

(2)

Discrete flow-based models are a recently proposed class of generative models that learn invertible transformations for discrete random variables. Specific instances are In-teger Discrete Flows [19] for ordinal discrete random variables and Discrete Flows [51] for binary and categorical discrete random variables. Due to their discrete nature and ex-act likelihood objective, discrete flow-based models can be used in a straight-forward manner for lossless compression. Unfortunately, they also come with a drawback: Each of their building blocks contains a quantization operation. To enable backpropagation, the models have to make use gradient estimators that introduce bias to the optimiza-tion and effectively harm performance. In this thesis, we analyse the gradient bias in Integer Discrete Flows and Discrete Flows. To improve the performance of Integer Dis-crete Flows for lossless compression, we introduce alternative quantization methods and stochastic layers to the model. Unfortunately, none of our modifications leads to improved performance. Finally, we introduce a new flow-based generative model for binary discrete random variables that we call the Discrete Denoising Flow. This discrete flow-based model can be trained without introducing gradient bias. We demonstrate that the Discrete Denoising Flow is capable of modelling high-dimensional binary data distributions with a good lossless compression rate, thereby significantly outperform-ing the Discrete Flow.

(3)

A C K N O W L E D G E M E N T S

A special thanks to my supervisor Emiel who provided me with great input, feedback and guidance throughout the process. I am very grateful for this experience. I would also like to thank the UvA-Bosch Delta Lab for providing financial support for this project. Another big thank you goes to my fellow students for the many interesting thesis-related discussions both face-to-face and via video chat. Last but not least, I would like to thank my friends and family who have been a tremendous support along my educational journey.

(4)

1 i n t r o d u c t i o n 1 1.1 Hypotheses . . . 1 1.2 Contributions . . . 2 2 b a c k g r o u n d 4 2.1 Flow-based Models . . . 4 2.1.1 Normalizing Flows . . . 4

2.1.2 Discrete Flow-based Models . . . 7

2.2 Gradient Estimators . . . 11 2.2.1 Reparameterization . . . 11 2.2.2 REINFORCE . . . 12 2.2.3 Gumbel-Softmax / Concrete . . . 13 2.2.4 REBAR . . . 15 2.2.5 RELAX . . . 17 3 i n t e g e r d i s c r e t e f l o w s 18 3.1 Analysis of Gradient Bias . . . 18

3.2 Methods . . . 22

3.2.1 Alternative Rounding Functions . . . 22

3.2.2 Sampling Layers . . . 26

3.3 Experiments . . . 33

3.3.1 Reproducing Gradient Bias for 2D Data Distribution . . . 33

3.3.2 Fine-tuning with FDA gradients . . . 35

3.3.3 Alternative Rounding Functions . . . 36

3.3.4 Sampling Layers . . . 38

4 d i s c r e t e f l o w s 43 4.1 Gradient Bias in Discrete Flows . . . 43

4.2 Method: Discrete Denoising Flow . . . 45

4.3 Experiments . . . 50 5 r e l at e d w o r k 54 6 c o n c l u s i o n 56 6.1 Conclusion . . . 56 6.2 Future Work . . . 57 b i b l i o g r a p h y 58 iv

(5)

L I S T O F F I G U R E S

Figure 1 IDF: Gradient bias per training epoch . . . 19

Figure 2 IDF: Gradient bias per coupling layer . . . 21

Figure 3 Smooth approximation of the hard rounding function . . . 25

Figure 4 Original IDF coupling layer . . . 26

Figure 5 Gumbel-Softmax/Concrete sampling layer . . . 28

Figure 6 REBAR or RELAX sampling layer . . . 30

Figure 7 Data distributions for IDF experiments . . . 34

Figure 8 Performance of IDF and RealNVP for different numbers of cou-pling layers. . . 35

Figure 9 Mean performance of IDF with alternative rounding functions 37 Figure 10 Performance of Gumbel-Softmax/Concrete sampling layer . . . 39

Figure 11 Performance of IDF with Gumbel-Softmax/Concrete sampling layer at different positions . . . 39

Figure 12 Performance of IDF with multiple Gumbel-Softmax/Concrete sampling layers . . . 40

Figure 13 Performance of REBAR and RELAX sampling layer . . . 41

Figure 14 Performance and gradient variance of IDF with REBAR or RE-LAX sampling layer at different positions . . . 42

Figure 15 DF: Gradient bias per training epoch . . . 44

Figure 16 Binarized MNIST Samples . . . 50

Figure 17 Performance (in BPD) of Discrete Flow vs. Discrete Denoising Flow . . . 51

Figure 18 Samples from Discrete Flow vs. Discrete Denoising Flow . . . . 51

Figure 19 Performance (BPD) of final Discrete Denoising Flow . . . 53

Figure 20 Samples from final Discrete Denoising Flow . . . 53

(6)

Table 1 Summary: Alternative Rounding Functions . . . 23 Table 2 Summary: Stochastic Sampling Layers . . . 27 Table 3 Standard Deviation of performance of IDF with alternative

round-ing functions . . . 37 Table 4 Structure of final Discrete Denoising Flow . . . 52

(7)

A C R O N Y M S

BPD Bits Per Dimension

DF Discrete Flow

FDA Finite Difference Approximation

IDF Integer Discrete Flow

RealNVP Real-valued non-volume preserving

STE Straight Through Estimator

(8)

1

I N T R O D U C T I O N

Flow-based generative models [8,41] have been intensively researched in recent years. As for all generative models they have a wide range of applications, ranging from im-age generation [23] to audio synthesis [36] and general density estimation [38]. How-ever, compared to other generative models, flow-based models are advantageous in that they allow for exact likelihood optimization and efficient sampling.

Most research into flow-based generative models has focused on modelling contin-uous data distributions [38]. In this case, discretely stored data, such as audio or im-age data, must be dequantized before it can be modelled. Two recent publications [19, 51] explore flow-based generative modelling of discrete distributions. Tran et al. [51] introduced two architectures for modelling distributions of discrete categorical data. Hoogeboom et al. [19] introduced the Integer Discrete Flow (IDF), a flow-based gener-ative model for ordinal discrete data. Besides being able to model discrete data distri-butions without data dequantization, these models also have an information-theoretic interpretation and can be utilized for lossless compression.

Unfortunately, discrete flow-based models come with a drawback. For the IDF, each of its layers contains a rounding operation. While optimizing the model using the back-propagation algorithm [29, 43], the gradient of this rounding operation is estimated with the straight-through estimator [4]. This introduces bias to the optimization which presumably harms the model’s performance. The Discrete Flow also suffers from an optimization problem. This is because each of the layers of the model uses the Gumbel-Softmax [21] or Concrete[30] gradient estimator, which also introduces bias.

1.1 h y p o t h e s e s

In this project, we aim to improve the optimization of discrete flow-based models to enhance their performance. To this end, we test a variety of hypotheses through experiments. Specifically, our hypotheses are

1.1. Gradient bias is affecting optimization of Integer Discrete Flows.

1.2. The finite difference gradient approximation can be used to fine-tune Integer Discrete Flows.

(9)

1.2 contributions 2

1.3. The gradient bias in Integer Discrete Flows can be mitigated by using a different rounding operation in the coupling layer.

1.4. The gradient bias in Integer Discrete Flows can be mitigated by introducing stochastic layers which utilize the Gumbel-Softmax / Concrete gradient estima-tor.

1.5. The gradient bias in Integer Discrete Flows can be mitigated by introducing stochastic layers which utilize the unbiased REBAR or RELAX gradient estima-tor.

2.1. Gradient bias is affecting optimization of Discrete Flows.

2.2. There is an alternative way of training a discrete flow-based model on binary or categorical data, which does not involve gradient bias.

where the first group of hypotheses regards the Integer Discrete Flow and the second group the Discrete Flow.

1.2 c o n t r i b u t i o n s

In an initial analysis, we investigate the gradient bias in Integer Discrete Flows. For this purpose, we compare the gradient obtained by backpropagation with an unbiased finite difference gradient approximation [53,55]. We show that gradient bias exists in IDFs and that it builds up the more coupling layers are stacked. We further investigate whether the first-order gradient approximation can be directly used for the optimiza-tion of the model. However, we find no meaningful improvements when approximat-ing the gradient with finite difference approximation durapproximat-ing trainapproximat-ing.

Next, we investigate whether gradient bias in Integer Discrete Flows can be mitigated by replacing the rounding operation with straight-through gradients with an alterna-tive quantization method. For this purpose, we propose three alternaalterna-tive rounding schemes. Unfortunately, we find that none of them leads to improved performance of the IDF.

Based on the finding that gradient bias in Integer Discrete Flows builds up over mul-tiple coupling layers, we attempt to mitigate the bias by introducing stochastic layers into the model. We test whether the Gumbel-Softmax / Concrete estimator [21, 30], the REBAR estimator[52] or the RELAX estimator [13] could be utilized in this context. However, we find that this is not the case, since the estimators introduce too much variance to the optimization.

(10)

We perform the gradient bias analysis with finite difference approximation in the same way as for the IDF also for the Discrete Flow. We find that the gradients in the Discrete Flow are also biased, however, the bias arises differently than in Integer Discrete Flows. Finally, we present an alternative discrete flow-based model for binary data, which we call the Discrete Denoising Flow. Unlike the Discrete Flow, our model can be trained without introducing any gradient bias. It further comes with the pleasant side effect that its training is computationally very efficient because not all coupling layers are trained at once, but one after the other. We show that the Discrete Denoising Flow outperforms the Discrete Flow on modelling binary data.

(11)

2

B A C K G R O U N D

This chapter provides the necessary background information on concepts used in this thesis. The first section 2.1 gives a general introduction to flow-based generative mod-els, as well as a description of the specific instances of flow-based generative models that are utilized in this thesis. These instances comprise normalizing flows with real-valued non-volume preserving (Real NVP) transformations [41] as well as the discrete flow-based models Integer Discrete Flows [19] and Discrete Flows [51]. Section 2.2 de-scribes a variety of gradient estimators that we later employ for introducing stochastic layers to the Integer Discrete Flow.

2.1 f l o w-based models

Flow-based generative models [25, 37] are a family of generative models that consist of a sequence of parameterized invertible functions. Compared to other approaches in generative modelling such as Variational Autoencoders [22], Generative Adversarial Networks [12] or autoregressive models like PixelCNN [34], flow-based generative models enable both exact likelihood optimization and efficient sampling. Due to these desirable properties, they have been intensively researched in recent years. For an extensive overview of the field, we refer to the paper by Papamakarios et al. [38].

2.1.1 Normalizing Flows

The fundamental idea of flow-based modelling is to express a complex probability dis-tribution as a transformation on a simple probability disdis-tribution.

Let’s consider the multivariate continuous random variables X and Z with correspond-ing domainsX and Z. Using an invertible and differentiable transformation T : Z → X, X’s probability distribution pX(·) can be written in terms of Z’s probability distribution p_Z(z)as pX(x) = pZ(z) det∂T (z) ∂z −1 with z = T−1(x) (1)

using the change of variables formula [42]. In this formula, the Jacobian determinant det(∂T (z)_∂z )is a normalizing term which ensures that pX(·) is a valid probability

(12)

bution. The distribution pZ(·) is referred to as the base distribution and the transfor-mation T is referred to as a flow.

The composition T1◦ T2 of two invertible and differentiable functions T1 and T2 is again invertible and differentiable. Its inverse and Jacobian determinant are defined as

(T₁◦ T₂)−1= T₁−1◦ T₂−1 (2) det ∂T1◦ T2(z) ∂z = det ∂T1(z) ∂z det ∂T2(z) ∂z (3) Accordingly, it is possible to create a complex invertible and differentiable transforma-tion as a sequence of simpler invertible and differentiable transformatransforma-tions. Note that a single application of formula 1 using a composite transformation can be seen as a repeated application of formula 1 for the sequence of transformations that make up the composite transformation. Consequently, deep learning literature widely refers to compositions of invertible transformations as a normalizing flow or normalizing flows [19,25].

When implementing a normalizing flow model, an invertible and differentiable trans-formation is modeled using a neural network. In theory, the transtrans-formation(s) can be arbitrarily complex. However, to efficiently optimize normalizing flows and sample from them in practice, the transformations must be constructed in a way that both their Jacobian determinant and the inverse can be efficiently computed.

2.1.1.1 Real NVP

Real-valued non-volume preserving (Real NVP) [8] transformations are a class of bi-jective functions that allow us to build normalizing flows that can be efficiently opti-mized. The RealNVP normalizing flow consists of multiple affine coupling layers, each of which is a simple bijection that is easy to invert and whose Jacobian determinant can be efficiently computed. The affine coupling layers are embedded into a multi-scale architecture.

a f f i n e c o u p l i n g l ay e r Every affine coupling layer models a simple bijection in which a part of the input stays the same and a part is updated. Note that this parti-tioning is implemented using a binary mask. In the overall architecture, two different kinds of masks are used: Checkerboard masks, which divide the grid-like input in an alternating manner, and channel-wise masks, which divide the input by channels.

(13)

As-2.1 flow-based models 6

suming without loss of generalization that our mask would partition an input x ∈RD in x1:dand xd+1:D, the output z ∈RD is calculated as

z_1:d= x_1:d (4)

z_d+1:D= x_d+1:D exp (s (x1:d)) + t (x1:d) (5)

where s and t are arbitrarily complex functions. The transformation has two crucial characteristics. First, its Jacobian is a lower triangular matrix, whose determinant can be efficiently calculated as the product of its diagonal terms.

J = ∂z ∂xT =   Id 0 ∂zd+1:D ∂xT 1:d diag (exp [s (x1:d )])   (6) det(J) = exp   X j s(x_1:d)   j (7)

Secondly, the transformation is easily invertible, which means that sampling from the model after or during training is as computationally efficient as performing inference.

x_1:d = z_1:d

xd+1:D = (zd+1:D− t (z1:d)) exp (−s (z1:d))

(8)

It is important to notice that neither computing the Jacobian determinant nor comput-ing the inverse of the couplcomput-ing layer requires the derivative of s or t, which is why they can be arbitrarily complex. For the experiments in the Real NVP paper, s and t are modeled with a deep residual network [16].

a r c h i t e c t u r e The overall Real NVP model can be considered to consist of sev-eral building blocks. Each of these blocks consists of three affine coupling layers with alternating checkerboard masking followed by a squeezing operation and three more affine coupling layers with alternating channel-wise masking. The alternating masks ensure that each input component is updated. The two different types of masking are necessary to make sure that the input is partitioned differently in different lay-ers. Finally, the last building block consists of four more affine coupling layers with alternating checkerboard masking. To safe computational and memory costs, not the complete D-dimensional input is propagated through all layers. Instead, part of the input is factored out, i.e. made part to the final transformed vector, at regular intervals.

(14)

2.1.2 Discrete Flow-based Models

Both Hoogeboom et al. [19] and Tran et al.[51] recently explored flow-based mod-els that operate directly in the discrete space and therefore do not require data (de-)quantization. Considering two discrete random variables X and Z with discrete do-mains X and Z and an invertible transformation T : Z → X the change of variables formula for continuous random variables given in formula (1) simplifies to

p_X(x) = p_Z(z) with z = T−1(x) (9)

Note that normalizing with the Jacobian determinant is not necessary anymore. In continuous space, the Jacobian determinant corrects for a change in volume. Discrete distributions, however, have no volume because they only have support on a discrete set of points. Consequently, the transformation T is not required to have any more properties than being efficiently invertible.

As pointed out by Papamakarios et al. [37], discrete flow-based models can only per-mute probabilities in the probability tensor that represents the distribution of a ran-dom variable. However, van den Berg et al. [5] show that the number of possible values that the discrete random variable can take on plays a critical role in the flexibility of discrete flow-based models. By expanding the variable’s domain, the modelling ca-pacity of the discrete flow-based model increases. Note that in the IDF, the variable’s domain is infinitely large by default, since it is defined as the integers.

i n f o r m at i o n t h e o r e t i c i n t e r p r e tat i o n It is important to notice that be-sides their functioning as generative models, discrete flow-based models further have an information theoretic interpretation. Following Shannon’s source coding theorem [10], given a data distribution p_Xthe optimal code length for a symbol x is − log p_X(x) and the minimum expected code length is lower-bounded by the distribution entropy. Formally,

Ex∼pX[l(c(x))]≈Ex∼pX[−log pmodel(x)]> H(pX) (10)

where pmodel(x) is the statistical model used by the encoder, c(x) denotes the code-word of x, l(·) is the length and H(pX)denotes the distribution entropy. The encoded message is chosen according to the statistical model, such that l(c(x)) ≈ − log pmodel(x). If the encoder is optimal, maximizing the model log-likelihood is equivalent to min-imizing the average codeword length, i.e. the average number of bits required per codeword.

(15)

2.1 flow-based models 8

Since discrete flow-based models can directly optimize the log-likelihood of discretely distributed data, they are implicitly minimizing the number of bits per dimension (BPD) of the input data. In contrast to continuous based models, discrete flow-based models do not have to quantize the latent space to get a discrete latent represen-tation z for an input x. Therefore, no errors occur when reconstructing x from z and they effectively perform lossless compression.

2.1.2.1 Integer Discrete Flows

Integer Discrete Flows [19] model distributions over ordinal discrete data. This means the domains of the random variables X and Z from formula 9 are defined as the in-teger valuesZD. Similar to the RealNVP normalizing flow, the Integer Discrete Flow is composed of several easily invertible bipartite coupling layers. These so-called addi-tive coupling layers are again embedded into a multilayer architecture that uses binary masks and factors out parts of the input vector at regular intervals.

a d d i t i v e c o u p l i n g l ay e r Using the fact that the integers form a group under addition, the authors define the bijection T−1:ZD→ZD_as

z_1:d = x_1:d

z_d+1:D = x_d+1:D+bt_θ(x1:d)e

(11)

which can be inverted to obtain T :ZD→ZD_as x_1:d = z_1:d

x_d+1:D = z_d+1:D−bt_θ(z1:d)e

(12)

The translation t(·) is modeled by a neural network with parameters θ. It has to be rounded to the nearest integer to ensure that T and T−1 map to the discrete space

ZD_{. When θ is optimized using back propagation [}₂₉_,₄₃_{], the gradient has to be back} propagated through the rounding operation. This is problematic, because the rounding operation is a step function and consequently its gradients are 0 almost everywhere. Hoogeboom et al. work around this problem using the straight through gradient es-timator [4]. The STE approximates the gradient of the rounding function with the identity matrix, which is equivalent to simply disregarding the rounding operation during the back-propagating step. As a consequence, the IDF’s gradients are biased. For continuous flows, one can simply stack more coupling layers to enable them to model a more complex distribution. This is not possible with the IDF because after a certain number of stacked layers, the performance of the model decreases as more layers are added. The authors attribute this effect to gradient bias, which is assumed to

(16)

increase the more layers are stacked. The author’s approach to minimizing the IDF’s gradient bias is to make the neural networks in the coupling layers bigger and there-fore more expressive. Consequently, fewer coupling layers are necessary, which means fewer rounding operations are involved. Another tactic to reduce gradient bias is to set the split index d (see formulas 11 and 12) to d = D ·3₄ instead of the usual d = D₂. Thereby, the rounding operation is applied to fewer dimensions.

2.1.2.2 Discrete Flows

Tran et al. introduced Discrete Flows for non-ordinal data [51]. In this case, the do-mains of the random variables X and Z from formula 9 are defined as D-dimensional vectors of a finite number of K possible values, i.e.X = Z = {0, . . . , K − 1}D, where the values {0, . . . , K − 1} are represented as one-hot vectors. Note that in the following, we refer to the bipartite version rather than the autoregressive version of Discrete Flows. This version of Discrete Flows consists of several easily invertible coupling layers that each utilize a modulo location-scale transform.

m o d u l o l o c at i o n-scale transform The transformation T₋₁:{0, 1, . . . , K − 1}D →{0, 1, . . . , K − 1}D _{is defined as}

z_1:d = x_1:d

z_d+1:D = (s_θ₁(x_1:d)◦ x_d+1:D+ t_θ₂(x_1:d))mod K

(13)

where ◦ denotes the element-wise multiplication. Every element of scale sθ1(·) and

translation tθ2(·) can take on values in{0, ..., K − 1}. This transformation is only

invert-ible if each element of sθ1(xd+1:D)is coprime to K, which can be achieved by choosing

Kas a prime number, masking non-primary values for a given K, or setting the scale to 1. Tran et al. chose to do the latter for the all but one of their experiments.

Scale sθ1(·) and translation tθ2(·) are modeled by a neural network with

parame-ters θ1,2. However, since both sθ1(x1:d) and tθ2(x1:d) have to be discrete vectors

∈ {0, ..., K − 1}D−d_{, the authors utilize the Gumbel-Softmax [}₂₁_{] or Concrete [}₃₀_] es-timator. The network outputs two vectors of size (D − d) × K, each of which serves as logits for obtaining (D − d) one-hot vectors of length K. The detailed functionality of the Gumbel-Softmax/ Concrete gradient estimator is decribed in section 2.2.3. Note that the network output corresponds to the value of the intermediate random variable zin formula 31. Tran et al. fix the gradient estimator’s temperature parameter τ to 0.1. In comparison to the Integer Discrete Flow, the Discrete Flow does only consider a finite number of classes instead of a countably infinite number of classes. The classes considered by the discrete flow have no order, which is achieved by the classes being

(17)

2.1 flow-based models 10

represented as one-hot vectors. Finally, instead of a rounding operation, the Discrete Flow utilizes the Gumbel-Softmax/ Concrete gradient estimator to make the neural network parameters θ1,2 trainable by back-propagation.

(18)

2.2 g r a d i e n t e s t i m at o r s

Gradient-based optimization lies at the heart of modern-day machine learning. Ar-guably the most important milestone in the field was the back-propagation algorithm [29,43], which enabled the training of neural networks. This algorithm calculates the exact gradients of an objective function with respect to the parameters of a neural network and updates the parameters accordingly. There are, however, some tasks in machine learning that require calculating the gradient of a function’s expected value rather than the gradient of the function itself. Such a task is for example variational inference, where it is not possible to optimize the likelihood directly and therefore its expected lower bound (ELBO) is optimized instead [22]. Other examples are rein-forcement learning tasks that optimize the expectation of a discrete reward, such as policy-gradient or actor-critic methods [46].

f o r m a l s e t u p Let x ∼ p_θ(x) be a D-dimensional random variable with a proba-bility distribution parameterized by θ. The goal is to find parameters θ that maximize the expectation of a function f(x), i.e.

Ex∼pθ(x)[f(x)] (14)

For optimizing this objective with (stochastic) gradient descend, it is necessary to cal-culate its gradient w.r.t. θ

∇_θE_x_∼p_θ_(x)[f(x)] (15)

Typically, both the objective given in formula 14 and its gradients w.r.t. the parameters θgiven in formula 15 are intractable. Therefore, a variety of estimators for the gradient in formula 15 have been introduced [13,21,22,30,46,52]. Two potential vulnerabili-ties of gradient estimators are variance and bias: High variance of gradient estimates causes slow or even inconsistent optimization. Biased gradient estimates are likely to lead to convergence to a local optima.

2.2.1 Reparameterization

One way of computing the gradient in formula 15 is to make use of the reparameteriza-tion trick [22]. That is, x is no longer sampled directly from the probability distribution p_θ(x)but instead the sampling process of x comprises first sampling a noise variable and then applying a deterministic function g parameterized by θ to it.

(19)

2.2 gradient estimators 12

The stochasticy in sampling x now comes from the noise variable while gθis defined to be differentiable w.r.t. its parameters θ. This allows us to rewrite the gradient in formula 15 as

∇θEx∼pθ(x)[f(x)] =∇θE∼p()[f(gθ())] (17)

=E_∼p()[∇_θf(g_θ())] (18)

As demonstrated by Kingma et al. [22], the expectation in formula 18 can then be estimated with a single sample. This leads to the reparameterization gradient estimator

g_REPARAM=∇_θf(g_θ()) with ∼ p() (19)

The gradient estimates obtained with reparameterization are unbiased. However, the reparameterization trick can only be applied if pθ(x)can be reparameterized, gθ(x)is differentiable w.r.t. θ, and f(x) is differentiable w.r.t. x. Due to the latter, the reparame-terization trick is not applicable for discrete random variables x.

2.2.2 REINFORCE

The REINFORCE, or score-function estimator [46, 54] makes use of the simple log-derivative trick

∇θpθ(x) = pθ(x)∇θlog pθ(x) (20)

Using this property, the gradient given in formula 15 can be reformulated as

∇θEx∼pθ(x)[f(x)] =∇θ Z f(x)pθ(x)dx (21) = Z f(x)∇_θp_θ(x)dx (22) = Z f(x)p_θ(x)∇_θlog pθ(x)dx (23) =E_x_∼p θ(x)[f(x)∇θlog pθ(x)] (24)

Note that for a discrete distribution pθ(x) one could simply replace the integration with a sum. The standard REINFORCE estimator is defined as the single-sample esti-mator of the expectation in formula 24, i.e

(20)

Since the expected value of the REINFORCE gradient estimator is equal to the true gradient, it is an unbiased estimator. At the same time the estimator’s variance is very high, which is impeding optimization. The high variance of the estimator can be at-tributed to the fact that it does not use any information about how f depends on x but only the final result f(x).

It is important to notice that the only requirement for using the REINFORCE estima-tor is knowing pθ(x) and being able to easily sample from it. In contrast to the repa-rameterization estimator, f does not need to be differentiable w.r.t. x. Consequently, REINFORCE can be used if x is a discrete random variable.

2.2.2.1 Variance Reduction with Control Variates

A popular approach to reducing REINFORCE’s high variance is to introduce a control variate [27], which can be a random variable or a constant. To effectively reduce vari-ance, the control variate should be closely related to the estimator and its closed form expectation must be efficiently solvable so that the introduced bias can be accounted for. Formalized, given a control variate c the REINFORCE estimator given in formula 24extends without introducing bias as

∇θEx∼pθ(x)[f(x)] (26)

=∇θEx∼pθ(x)[f(x)] −∇θEx∼pθ(x)[c] +∇θEx∼pθ(x)[c(x)] (27)

=∇θEx∼pθ(x)[(f(x) − c)] +∇θEx∼pθ(x)[c(x)] (28)

=∇θEx∼pθ(x)[(f(x) − c)∇θlog pθ(x)] +∇θEx∼pθ(x)[c(x)] (29)

If the control variate c does not depend on the random variable x, equation 29 further simplifies because ∇_θE_x_∼p θ(x)[c] =∇θ Z x p_θ(x)c = c∇_θ Z x p_θ(x) = 0 (30)

Most approaches to reducing the variance of the REINFORCE estimator have focused on the development of appropriate control variables [2,13,33].

2.2.3 Gumbel-Softmax / Concrete

In the case that x is a discrete categorical or bernoulli random variable, the gradient from formula 15 can be estimated with the Gumbel-Softmax [21], or Concrete [30] estimator. This estimator makes use of the well-known Gumbel-Max trick [14, 31], which allows sampling from a categorical distribution by simply adding noise to the log probabilities of each category and then choosing the category with the maximum

(21)

2.2 gradient estimators 14

resulting log probability. Formalized, if we want to sample a D-dimensional one-hot vector x from a categorical distribution with class probabilities θifor i ∈{1, .., D}, this sampling process can be written as

z_i= g_i+log θi with gi∼ Gumbel(0, 1) (31)

x = H(z) =one_hot argmax_i[zi]

(32) where the standard Gumbel noise giis reparameterized using inverse transform sam-pling [45], i.e.

g_i= −log(− log(ui)) ui∼ Uniform(0, 1) (33) Since the argmax operation is not differentiable, it is not possible to backpropagate through the hard threshold function H in formula 32. To enable the optimatization of the parameters θ, both Jang et al. [21] and Maddison et al. [30] propose to replace the gradient of the categorical sample x by the gradient of a differentiable approximation of x, referred to as ˜x. This approximation replaces the hard threshold function H by the softmax function Sτ with a temperature parameter τ.

˜x = Sτ(z) (34) ˜xi= exp(zi)/τ) PD d=1exp(zd/τ) (35) The distribution pθ,τ(˜x) is referred to as Gumbel-Softmax or Concrete distribution. As the temperature τ approaches 0, pθ,τ(˜x) becomes the categorical distribution pθ(x). There is a trade off between small temperatures, where samples are close to those of pθ(x) but the variance of the gradients is large, and large temperatures, where sam-ples are smooth but the variance of the gradients is small. In practice, τ starts at a high temperature and is annealed to a small but non-zero temperature.

Formally, the Gumbel-Softmax or Concrete estimator approximates the gradient given in formula 15 for a categorically distributed x as

∇_θE_x_∼p

θ(x)[f(x)] =∇θEz∼pθ(z)[f(H(z))] (36)

≈ ∇θEz∼pθ(z)[f(Sτ(z))] (37)

=E_u_{∼Uniform(0,1)}[∇_θf(Sτ(zθ(u)))] (38) utilizing the intermediate variable z and the functions H(·) and Sτ(·) introduced in formulas 31, 32, and 34 respectively. Due to the approximation in the step from formula 36 to formula 37, the estimator clearly is biased. Looking at formula 38, we can see

(22)

a resemblance to the reparameterization gradient given in formula 18. Intuitively, the Gumbel-Softmax/Concrete estimator can be thought of as a reparameterization for a distribution that is smoothly annealed to the categorical distribution.

2.2.4 REBAR

The REBAR gradient estimator [52] for categorical or Bernoulli random variables com-bines the Gumbel-Softmax/Concrete and the REINFORCE estimator. The fundamental idea is to use REINFORCE with the differentiable approximation of a categorical sam-ple introduced in Gumbel-Softmax/Concrete as control variate. Therefore, REBAR is an unbiased estimator.

Building upon Gumbel-Softmax/Concrete, REBAR makes use of the reparameterized standard Gumbel noise sample g and standard uniform sample u from formula 33, the intermediate random variable z from formula 31, the categorical random variable x from formula 32 and its differentiable approximation ˜x from formula 35 as well as the hard treshold function H and its differentiable approximation Sτ which were introduced in formulas 32 and 34 respectively. Recall that

g = −log(− log(u)) with u∼ U(0, 1) (39)

z = G(θ, u) = g + log(θ) (40)

x = H(z) (41)

˜x = Sτ(z) (42)

REBAR adds one more stochastic variable, which we refer to as ˜z. This variable ex-presses the value of z given x, i.e. which values of z could have led to a given value of x. Given a one-hot categorical sample x with xk = 1, ˜z is defined as

˜zi=      −log(− log(vk)) if i = k −log(−log vi θi −log(vk)) otherwise (43) with v∼ Uniform(0, 1) (44)

where v is another standard uniform variable.

As mentioned above, the goal of the REBAR estimator is to have the differentiable approximation ˜x of the categorical sample x as our control variate. The key insight in the REBAR paper is that one can perform a conditional marginalization over z given xto formulate this control variate. Specifically,

(23)

2.2 gradient estimators 16 ∇θ E pθ(z) [f (Sτ(z))] (45) = E pθ(z) f (Sτ(z))∇θlog p(z) (46) = E pθ(x) E pθ(z|x) f (Sτ(z))∇θlog p(z) (47) = E pθ(x) E pθ(z|x) f (Sτ(z))∇θ log pθ(z| x) + log pθ(x) (48) = E pθ(x) ∇θ E pθ(z|x) h f (Sτ(z)) i + E pθ(x) E pθ(z|x) h f (Sτ(z)) i ∇θlog pθ(x) (49) = E pθ(x) E p(v) h ∇θf (Sτ(˜z)) i + E pθ(x) " E p(v) [f (Sτ(˜z))] ∇θlog p(x) # (50)

where the step from 47 to 48 exploits that if z is sampled as z∼ pθ(z|x), x is determined as x = H(z) and hence

p_θ(z)1(x = H(z)) = pθ(z)pθ(x|z) = pθ(z|x)pθ(x) (51) where 1(·) denotes the indicator function. When going from 49 to 50, a reparemeteri-zation is performed using the definition of ˜z given in formula 43. Using formula 50 as well as the reparametrization

E

pθ(x)

[f(x)] = E

p(u)[f(H(G(θ, u)))] =p(u)E [f(H(z)] (52) it is possible to rewrite the expectation from formula 15 as

∇θ E pθ(x) [f(x)] (53) =∇θ E pθ(x) [f(x)] − η∇θ E pθ(z) [f(Sτ(z))] + η∇θ E pθ(z) [f(Sτ(z))] (54) = E p_(u,v) _h f(H(z)) − ηf(Sτ(˜z)) i ∇θlog p(x) + η∇θf(Sτ(z)) − η∇θf(Sτ(˜z)) (55)

where η is the scaling parameter of the control variate. The REBAR gradient estimator is the single sample estimator of the expectation in formula 55, i.e.

g_REBAR=hf(H(z)) − ηf(Sτ(˜z))

i

∇θlog p(x) + η∇θf(Sτ(z)) − η∇θf(Sτ(˜z)) (56)

with u, v∼ Uniform(0, 1)

The scaling parameter η as well as the temperature parameter τ are treated as addi-tional hyparameters and optimized to minimize the variance of the REBAR estimator

(24)

in formula 56.

The biggest drawbacks of the REBAR estimator are its high implementation complexity as well as its runtime. While the aforementioned estimators all work with a single for-ward pass per parameter update, the REBAR estimator requires three forfor-ward passes to obtain f(H(z)), f(Sτ(z))and f(Sτ(˜z)) for one parameter update. It is also important to note that the REBAR estimator can only be applied if f(·) can be evaluated on a non-discrete inputs like Sτ(z)and Sτ(˜z).

2.2.5 RELAX

The RELAX gradient estimator [13] is a successor to the REBAR estimator. It addresses the drawbacks of REBAR by generalizing it to arbitrary baseline functions.

Recall that for REBAR, the baseline function is f(Sτ(·)). As already mentioned, this is a drawback of REBAR because it means that f must be evaluated on non-discrete values, which is not suitable for any given task. As an example, one might think of tasks from reinforcement learning, where the return function f is defined only on dis-crete inputs. REBAR’s second drawback is that multiple evaluations of f are necessary for a single parameter update. RELAX addresses both of these drawbacks by introduc-ing a baseline function which is learned. Specifically, this baseline function is a neural network cφ with parameters φ. The RELAX gradient estimator is equivalent to the REBAR estimator given in formula 56, but with cφas baseline function:

g_RELAX=hf(H(z)) − cφ(˜z))

i

∇θlog p(x) + ∇θcφ(z)) −∇θcφ(˜z)) (57)

As discussed in section 2.2.2.1, cφ should be closely related to f to be an effective control variate. Therefore, cφis trained parallel to f using the L2 loss function

f(x) − c_φ(x)2 (58)

The authors illustrate RELAX’s ability to perform well for a variety of reinforcement learning toy examples. They also conduct an experiment involving a task in which both REBAR and RELAX could be applied. However, on the basis of this experiment, it is not clear whether one of the two estimators is generally preferable to the other.

(25)

3

I N T E G E R D I S C R E T E F L O W S

In this chapter, we present our approaches to mitigating gradient bias in Integer Dis-crete Flows. In a preliminary analysis in section 3.1, we measure the gradient bias and show that it accumulates over multiple stacked coupling layers. In section 3.2, we present our approaches to reducing the gradient bias. These include replacing the rounding operation in the original IDF coupling layer and introducing stochastic sam-pling layers to the IDF. We evaluate the effectiveness of our methods in the experiment section 3.3.

3.1 a na ly s i s o f g r a d i e n t b i a s

In this initial analysis, we compare the Integer Discrete Flow gradients obtained by backpropagation against an unbiased first-order gradient approximation.

Following Viera [53], we use a two sided finite difference approximation, also referred to as the central difference [55]. Given an objective function f(x, θ), an input x ∈RDand model parameters θ ∈RN _{this approximation is given as}

∇θif(x, θ) ≈

f(x, θi+ , θ/i) − f(x, θi− , θ/i)

2 (59)

where θ/i denotes the vector of model parameters θ excluding the ith parameter. The parameter is the step size of the approximation. Theoretically, this approximation becomes more accurate the smaller is chosen. However, we set = 0.01 to avoid errors related to the finite precision of floating point arithmetic [9]. As apparent from formula 59, two forward passes are necessary to obtain the finite difference gradient approximation for a single model parameter θi. Since this is very costly, we can only compute the approximation for models with a small number of parameters.

To measure the difference between the backpropagation gradients and finite difference approximation, we use three metrics: The cosine similarity [47] gives the similarity of the orientation of two vectors, where a cosine similarity of 1 corresponds to a match-ing orientation and −1 to an opposite orientation. The mean relative difference [53] is a scale-free error measure and the sign agreement [53] is the percentage of vector dimensions with the same sign.

(26)

As a reference, we measure the difference between gradient and finite difference ap-proximation not only for the IDF model but also for its continuous counterpart, the RealNVP flow. For comparability, we omit the scaling parameter in the RealNVP cou-pling layer so that both models have the same number of parameters. As base distri-butions for the models, we choose the logistic distribution and the discretized logistic distribution respectively. We train both models on a simple two-dimensional moon-shaped data distribution. After each training epoch, we calculate the backpropagation gradient as well as the finite difference gradient approximation from formula 59. To ensure consistency and meaningfulness of our results, we use a data batch x of all individual samples of the input distribution for this calculation.

g r a d i e n t b i a s ov e r t r a i n i n g e p o c h s Figure 1 shows the difference between gradients obtained by backpropagation and finite difference approximation for the IDF and RealNVP model over 40 training epochs. As expected, the cosine similarity for the unbiased continuous RealNVP flow is 1 throughout the training process, which means the gradient is consistent with its approximation. This is supported by a 100%

(a)

(b) (c)

Figure 1: Difference between finite difference approximation and backpropagation gradients for the IDF and RealNVP model over 40 training epochs. Measured in (a) cosine similarity, (b) sign agreement and (c) mean relative difference. This graphic shows mean and standard deviation over five runs.

(27)

3.1 analysis of gradient bias 20

sign agreement and near-zero mean relative difference. We attribute the reduced sign agreement and an increased mean relative difference in the first epochs to the fact that the parameter updates are coarser and therefore gradient and approximation are more likely to diverge in magnitude. Through all three metrics, it is evident that the IDF gradients deviate more from their unbiased approximation. The sign agreement is consistently between 85 − 95% and the mean relative difference between 0.2 and 0.35. The cosine similarity is 1 for the first eight epochs but afterwards fluctuates between 0.75 and 0.95. These results clearly show that the Integer Discrete Flow gradients are biased.

g r a d i e n t b i a s ov e r c o u p l i n g l ay e r s As a second part of this analysis, we take a closer look at how the gradient bias differs between the coupling layers. For this purpose, we calculate the difference between finite difference approximation and backpropagation gradients per coupling layer and average over training epochs. Figure 2shows our findings. Note that each model has 40 coupling layers numbered from 0 to 39, where 0 is the coupling layer operating directly on the input. As evident throughout all three metrics, the gradient bias is higher in the earlier layers of the IDF model. We conclude from this result that the gradient bias adds up during backpropagation through multiple layers, giving the earlier layers a less informative gradient than the layers located further back.

(28)

(a)

(b)

(c)

Figure 2: Difference between gradients obtained by finite difference approximation and gra-dients obtained by backpropagation for the IDF and RealNVP model per coupling layer. Layer 0 is the coupling layer that operates directly on the input x. Measured in (a) cosine similarity, (b) sign agreement and (c) mean relative difference. Averaged over 40 training epochs. This graphic shows mean and standard deviation over five runs.

(29)

3.2 methods 22

3.2 m e t h o d s

In this section, we present two approaches to mitigating the gradient bias in Integer Discrete Flows. As a first approach, we present three alternative rounding functions for the IDF coupling layer in section 3.2.1. For each of these functions, we outline why it introduces less gradient bias to the IDF than the original round-to-nearest operation with straight-through gradients. In section 3.2.2 we describe our second approach, which is to introduce stochastic sampling layers to the Integer Discrete Flow. Within these sampling layers, we can use gradient estimators that are less biased than the straight-through estimator in the original IDF coupling layer.

3.2.1 Alternative Rounding Functions

Recall from section 2.1.2.1 that the original IDF coupling layer requires a discrete trans-lation. This translation is modelled by a neural net and a subsequent rounding opera-tion. The IDF uses the round-to-nearest or hard rounding function for integers, i.e.

bxe =      bxc if bxc 6 x 6 bxc +1₂ bxc + 1 if bxc +1₂ < x_{6 bxc + 1} (60)

where bxc is the largest integer less than or equal to x. In the original IDF model, the gradient of this rounding function is estimated by the straight-through gradient estimator [4]. This estimator causes the rounding function to be disregarded during backpropagation, resulting in a mismatch between forward and backward pass re-ferred to as gradient bias.

As an approach to reduce the gradient bias caused by the hard rounding operation with straight through gradients, we propose in this section three alternative methods for discretizing the translation in the IDF coupling layer. To provide an overview and facilitate comparison, table 1 summarizes the key characteristics of all methods.

(30)

Formal Definition Grad. Estimator Hard Rounding f(x) =    bxc if bxc 6 x 6 bxc +12 bxc + 1 if bxc +1₂< x6 bxc + 1 Straight-Through Stochastic Rounding f(x) =    bxc with prob. 1 − (x − bxc) bxc + 1 with prob. x − bxc Straight-Through Smooth Rounding f(x) =bxc +1 2 tanh(αr) tanh(α/2)+ 1 2 where r = x −bxc −1 2 -Noisy Rounding f(x) = x + u with u∼ Uniform(h−1₂,1 2 i )

-Table 1: Characteristics of the hard rounding function used in the original Integer Discrete Flow and the three alternative rounding functions noisy rounding, smooth rounding, and stochastic rounding, that we propose for its replacement.

s t o c h a s t i c r o u n d i n g The first method we propose for replacing the hard round-ing function in the couplround-ing layer is stochastic roundround-ing [15]. This rounding operation is defined for integers as

st_round(x) =    bxc with probability 1 − (x − bxc) bxc + 1 with probability x − bxc (61)

That is, the probability of rounding x to bxc is proportional to the proximity of x to bxc. However, in order to perform a lossless compression, the Integer Discrete Flow needs to be deterministic. Therefore, we apply stochastic rounding only during training and use the round-to-nearest function at test time.

Stochastic rounding is considered an unbiased rounding scheme, because its expected value is equal to the identity. Formally,

Epx(x) st_round(x) = x (62)

As explained in section 2.1.2.1, the straight-through gradient estimator can be inter-preted as replacing the gradient of a function with the gradient of the identity func-tion. Since the expected value of the stochastic rounding function is equal to the identity function, we expect that approximating the gradient of stochastic rounding with the straight-through estimator will result in less gradient bias than it does for

(31)

3.2 methods 24

the original round-to-nearest function. However, since we apply nonlinear functions in the form of coupling layers or our objective function on top of the results of the stochastic rounding operation, our optimization will still be somewhat biased. For-mally,Epx(x)f(st_round(x)) does not necessarily equal f(x) for nonlinear functions f.

To our best knowledge, there is no work comparing bias due to rounding to bias due to stochastic rounding in neural networks. Nevertheless, stochastic rounding has been found effective in problems similar to ours, namely training models with fixed-point number representations of low precision [15] and training models with binary discrete layers [48].

s o f t r o u n d i n g The second alternative rounding function that we propose is the soft rounding function. This function is a smooth approximation which we anneal to hard rounding over the course of training.

We use the smooth rounding function sm_roundα(x) =bxc + 1 2 tanh(αr) tanh(α/2)+ 1 2, (63) where r = x −bxc −1 2

This function was introduced and proven to be differentiable everywhere by Agustsson et al. [1]. The parameter α ∈ [0,∞) determines how close the function approximates the hard rounding function. In the limits of α, sm_roundα(·) approaches the identity function and the hard rounding function, specifically

lim α→0 h sm_roundα(x) i = x lim α→∞ h sm_roundα(x) i =bxe (64)

Figure 3 visualizes sm_roundα(·) for different values of α. In practice, we anneal α from 1 to 20 over the course of training. At test time, we use hard rounding.

It is important to notice that this implies that we operate in continuous space dur-ing traindur-ing and in discrete space at test time. By dodur-ing so, we can train the IDF with the true gradients obtained by backpropagation and have no need of employing a gradient estimator. In principle, switching between continuous space at training and discrete space at test time is not problematic. We only need to ensure that the base function of our model yields meaningful (i.e. non-zero) values for continuous inputs. Fortunately, this is the case for the discretized logistic distribution as defined in the original IDF paper by Hoogeboom et al. [19]. However, note that our basis function here could not be a discrete probability distribution with probability mass on a discrete set of points.

(32)

Figure 3: Smooth approximation of the hard rounding function using the smooth approxima-tion sm_roundα(·) from formula 64 for different values of α.

n o i s y r o u n d i n g A well-liked strategy from the field of network quantization is to use the addition of uniform noise during training as an approximation to rounding at test time [1,3]. Intuitively, the model is trained to be robust towards small changes in the parameters that will be rounded at test time. Adding uniform noise is a fully differ-entiable operation, therefore we do not need a gradient estimator with this approach. Formalized, the third function that we propose for replacing the round-to-nearest op-eration with is

noisy_round(x) = x + u with u∼ Uniform −1 2, 1 2

Since the IDF must be deterministic to perform lossless compression, we use noisy_round(x) when training and then apply the hard rounding function at test time. Similar to the smooth rounding, this means that we are operating in continuous space during train-ing and in discrete space at test time. As we explained in the previous subsection, this again implies that our base function has to be an augmented discrete distribution that gives meaningful values for continuous inputs.

(33)

3.2 methods 26

3.2.2 Sampling Layers

As a second approach to counteract the gradient bias in Integer Discrete Flows, we pro-pose to add stochastic sampling layers to the model. These layers are additive coupling layers in which the discrete translation is obtained by sampling rather than rounding. For reference, figure 4 shows the architecture of the original additive coupling layer in the IDF as described in formulas 11 and 12. The input x ∈ ZD is split into two parts, the second of which is modified by the discrete translation bte ∈ Z(D−d)_{. This} translation is modelled by a neural network followed by a rounding operation. As previously mentioned, the gradient of the rounding operation is estimated using the biased straight-through estimator [4].

neural network

Figure 4: Structure of the original additive coupling layer in IDFs. The input x is split into two parts, one of which is altered in the coupling layer using a discrete translation bte. This translation is modelled by a neural network with parameters θ followed by a discrete rounding operation.

In contrast to the regular IDF coupling layer, the neural network in our stochastic sampling layers models the continuous parameters of a discrete distribution. From this discrete distribution, we then sample a discrete translation. Because of this modi-fication, we can use gradient estimators that are less biased than the straight-through estimator.

In the following, we introduce three versions of the stochastic sampling layer: One that employs the Gumbel-Softmax/Concrete estimator [21,30], one that uses the RE-BAR estimator [52] and one that uses the RELAX [13] estimator. Dependent on the gradient estimator, the neural network in the layer can model a categorical or a dis-cretized logistic distribution for sampling the discrete translation. To give an overview, table 2 summarizes the key characteristics of all layers.

(34)

Layer Bias Translation Distribution

Original IDF Coupling Layer Yes

-Sampling Layer w/

Gumbel-Softmax/Concrete

Yes, but decreases over training - Categorical Sampling Layer w/ REBAR No - Categorical or - Discretized Logistic Sampling Layer w/ RELAX No - Categorical or - Discretized Logistic

Table 2: Characteristics of the original IDF coupling layer and the stochastic sampling layer with Gumbel-Softmax/Concrete, REBAR or RELAX gradient estimator. ’Bias’ shows whether the layer introduces gradient bias. ’Translation Distribution’ shows which distribution(s) over translations can be used in the layer.

1. sampling layer with gumbel-softmax/concrete estimator

As explained in detail in section 2.2.3, the Gumbel-Softmax/Concrete gradient estima-tor enables us to backpropagate through a categorical sampling operation. The esti-mator uses a continuous approximation of the discrete categorical sampling operation and anneals it to discrete sampling during training. Consequently, the bias introduced by the Gumbel-Softmax/Concrete estimator becomes smaller as training progresses. This is an advantage over the straight-through estimator in the regular IDF coupling layer.

To utilize the Gumbel-Softmax/Concrete estimator in a stochastic sampling layer, the discrete translation must be sampled from a categorical distribution. The resulting sampling layer is depicted in figure 5. The neural network (NN) with parameters θ models the continuous event probabilities p ∈ R(D−d)×K of (D − d) i.i.d. categorical distributions with K categories, denoted CategoricalK(p). Each of the K categories represents an integer value. We obtain the discrete translation t ∈ Z(D−d) _by sam-pling from CategoricalK(p). Note that the number of categories K is an additional hyperparameter that we need to choose.

(35)

3.2 methods 28

neural network

Figure 5: Sampling Layer with Gumbel-Softmax/Concrete Estimator. The neural network (NN) with parameters θ outputs the continuous parameters p of the categorical distribu-tion with K categories, Categorical_K(p), from which we then sample the discrete translation t.

An Integer Discrete Flow that includes a sampling layer is a stochastic model. This is fine a training time. However, we want the trained Integer Discrete Flow to perform a lossless compression. To this end, the sampling layer has to be a deterministic layer at evaluation time. Therefore, we choose the most probable translation t instead of drawing a sample. Formally, the layer is defined at evaluation time as the bijection T : Z → X between domains Z = X = ZD. The inverse T−1, which represents the forward pass, is given as

z1:d = x1:d

zd+1:D = xd+1:D+ t where t = arg_max(NNθ(x1:d))

(65)

where NNθ represents the neural network with parameters θ. This operation is in-verted to T for inference and sampling from the trained IDF model as

x1:d = z1:d

xd+1:D = zd+1:D− t where t = arg_max(NNθ(z1:d))

(66)

2. sampling layer with rebar or relax estimator

As described in 2.2.4 and 2.2.5, both REBAR and RELAX were originally introduced for categorical random variables. Therefore, in the same way that we just described for the Gumbel-Softmax/Concrete estimator, both can be used to sample the discrete translation in the sampling layer from a categorical distribution. However, since the IDF is a model for ordinal discrete data, it is preferable to use an ordinal distribution for sampling the discrete translation. For this purpose, we define the REBAR and RELAX gradient estimator for discretized logistic random variables.

(36)

d e f i n i n g r e b a r a n d r e l a x f o r d i s c r e t i z e d l o g i s t i c r a n d o m va r i a b l e s

We can write the sampling process for a discrete random variable x with a discretized logistic distribution with location µ and scale σ as

z = σ· l + µ with l∼ Logistic(0, 1) (67)

x = H(z) =round[z] (68)

where the value of x is obtained from the value of the continuous random variable z by hard rounding. Further, the standard logistic random variable l can be reparame-terized via inverse transform sampling [45] as

l =log( u

1 − u) with u∼ Uniform(0, 1) (69)

Another random variable ˜z represents the value of z for a given value of x. To sample a value for ˜z, we perform inverse transform sampling [45] from all values that could have been rounded to a given value of x. Using the cumulative distribution function of the logistic distribution

LogisticCDF(x|µ, σ) = _1+e−(x−µ)/σ1 (70)

as well as its inverse

LogisticCDF−1(x|µ, σ) = σ · log( x

1 − x) + µ (71)

the sampling process of ˜z can be formalized as

min = LogisticCDF(x −1₂) max = LogisticCDF(x +1₂)

˜z = LogisticCDF−1(v|µ, σ) with v ∼ Uniform(min, max)

(72)

Combining the defined random variables with the REBAR estimator for a categori-cal random variable from formula 56, we define the REBAR gradient estimator for a discretized logistic random variable x as

g_REBAR(f; µ, σ) =hf(x) − ηf(˜z)i∇_[µ,σ]log p(x) + η∇[µ,σ]f(z) − η∇[µ,σ]f(˜z) (73)

(37)

esti-3.2 methods 30

mator for a categorical random variable from formula 74, we obtain the RELAX gradi-ent estimator for a discretized logistic random variable x :

gRELAX(f; µ, σ) = h f(H(z)) − cφ(˜z)) i ∇_[µ,σ]log p(x) + ∇[µ,σ]cφ(z)) −∇[µ,σ]cφ(˜z)) (74) with u, v∼ Uniform(0, 1)

Note that for both estimators, f denotes our overall objective function.

neural network

(a)

neural network

(b)

Figure 6: Two versions of the sampling layer with REBAR or RELAX estimator for discretized logistic random variables. The neural network (NN) outputs the parameters µ, σ of the discretized logistic distribution DL(µ, σ) from which we then sample the discrete translation t. Layer architecture version (a) is more similar to the original IDF cou-pling layer. However, since we find that this layer architecture can not prevent the accumulation of gradient bias, we introduce version (b). We proceed to using only version (b) in our experiments.

(38)

d e f i n i n g r e b a r o r r e l a x s a m p l i n g l ay e r a r c h i t e c t u r e When drafting the architecture of the sampling layer with REBAR or RELAX estimator for discretized logistic random variables, we first consider an architecture that closely resembles the original IDF coupling layer. This architecture is displayed in figure 6 (a). However, when writing the derivative of the overall objective function f w.r.t to the layer input x = [x_1:d, xd+1:D]for this architecture, we obtain

∂f ∂x_1:d = ∂f ∂x_1:d | {z } biased +g_REBAR/RELAX(f; µ, σ) ·∂[µ, σ] ∂x_1:d (75) ∂f ∂x_d+1:D = ∂f ∂x_d+1:D | {z } biased (76)

That is, the derivative _∂x∂f

d+1:D is the same as in the original IDF coupling layer.

How-ever, since our goal is to prevent the buildup of gradient bias we want to avoid back-propagating biased gradients. Therefore, we change the structure of our sampling layer to the version shown in 6 (b). With this structural adjustment, the gradients change to

∂f ∂x_1:d = ∂f ∂x_1:d+ ∂f ∂[µ, σ] ∂[µ, σ] ∂x_1:d (77) = ∂f ∂x_1:d | {z } biased +g_REBAR/RELAX(fxt; µ, σ) · ∂[µ, σ] ∂x_1:d (78) ∂f ∂xd+1:D = ∂f ∂[µ] ∂[µ] ∂xd+1:D = ∂f ∂[µ] = gREBAR/RELAX(fxt; µ) (79) Consequently, less gradient bias is backpropagated with version (b) of the architecture. We proceed to using only version (b) in our experiments.

(39)

3.2 methods 32

As already discussed for the sampling layer with Gumbel-Softmax/Concrete estimator, we want the trained Integer Discrete Flow to perform lossless compression. To trans-form the sampling layer with REBAR or RELAX estimator into a deterministic layer for evaluation, we choose the most probable translation instead of drawing a sample. The most probable translation is the location of the discretized logistic distribution, in our case this is (xd+1:D+ µ)(cf. figure 6 (b)). Consequently, the layer is defined at evaluation time as the bijection T :Z → X between domains Z = X = ZD_{. The inverse} T−1, which represents the forward pass, is given as

z_1:d = x_1:d

z_d+1:D = x_d+1:D+ µ where µ, σ = NNθ(x1:d)

(80)

where NNθ represents the neural network with parameters θ. This operation is in-verted to T for inference and sampling from the trained IDF model as

x_1:d = z_1:d

x_d+1:D = z_d+1:D− µ where µ, σ = NNθ(z1:d)

(40)

3.3 e x p e r i m e n t s

This section examines the performance of the methods for reducing gradient bias in Integer Discrete Flows presented in section 3.2. First, we show in section 3.3.1 that the negative effect of gradient bias on the performance of the IDF emerges also in the case of a two-dimensional input distribution. Inspired by our initial analysis in section 3.1, we investigate in section 3.3.2 whether the finite difference gradient approximation can be used directly to improve the performance of an Integer Discrete Flow model by fine-tuning it. Afterwards, in section 3.3.3 we test the alternative rounding functions for the IDF coupling layer introduced in method section 3.2.1. Finally, section 3.3.4 explores how the stochastic sampling layers proposed in method section 3.2.2 affect the performance of Integer Discrete Flows. To this end, we first look into sampling layers with Gumbel-Softmax gradient estimates in section 3.3.4.1 and then proceed to sampling layers with REBAR and RELAX gradient estimates in section 3.3.4.2. For all of the experiments, we train the original IDF as well as the IDF with our modifications on two-dimensional data distributions. Since we do not find a significant improvement in performance for any of our methods, we do not evaluate them on larger datasets. d ata s e t s To conduct the experiments in this section, we use the three discrete two-dimensional data distributions shown in figure 7. We find that the distributions in figure 7 (a) and (b) are relatively easy to model for a flow-based model, with the two binomials distribution shown in (a) being even easier to model than the moon distribution shown in (b). For this reason, we use these distributions in experiments with flow-based models that only have a single coupling layer. The distribution in figure 7 (a) consists of two binomial distributions and the distribution in 7 (b) is a discrete moon-shaped distribution. The third distribution shown in figure 7 (c) is a mixture of eight discretized Gaussian distributions arranged in a circular pattern. This distribution is more difficult to model for a flow-based generative model than the other two distributions. Therefore, we use it to compare flow-based models with a larger amount of coupling layers. Besides the distributions themselves, figure 7 also shows the corresponding distribution entropies measured in bits per dimensions. Recall from section 2.1.2.1 that the distribution entropy is the lower bound on the performance that a flow-based model can achieve when modelling the distribution.

3.3.1 Reproducing Gradient Bias for 2D Data Distribution

Hoogeboom et al. show in the original IDF paper that adding more coupling layers to the IDF after a certain point hurts performance [19]. They further illustrate that this is not the case for the IDF’s continuous counterpart, the RealNVP flow. However, for this experiment, they trained both models on high-dimensional image data. In the follow-ing, we show that gradient bias occurs similarly for two-dimensional input data. For

(41)

3.3 experiments 34

(a) (b) (c)

Figure 7: Two-dimensional data distributions and associated bits-per-dimension (BPD) entropy values used for the experiments in this section.

this purpose, we train both the IDF and the continuous RealNVP flow with different numbers of coupling layers on a two-dimensional data distribution.

Without loss of generalizability of our results, we omit the scaling parameter in the original RealNVP coupling layer so that the two models have the same number of parameters and thus become more comparable. For both models, the neural network in the coupling layer consists of two linear layers of each 128 hidden units. We choose a discretized Logistic distribution as base distribution for the IDF and a Logistic dis-tribution for the continuous RealNVP model. The input data disdis-tribution is the eight Gaussians distribution shown in figure 7 (c). Both models are trained with 5, 10, 20, 40, 60, 80, 100 and 120 coupling layers until convergence. The results are displayed in figure 8 as mean and standard deviation over 5 runs.

Figure 8 shows the same results as reported by Hoogeboom et al. for the models trained on high-dimensional image data. Up until 20 coupling layers, the IDF performs better than the RealNVP model. For 40 and 60 coupling layers, the IDF performance then improves only slightly, resulting in the IDF performing worse than the RealNVP flow. Above 80 layers the IDF’s performance even decreases as more layers are added. In contrast, the RealNVP model gets better the more layers are stacked until its per-formance reaches a BPD close to the distribution entropy. From this, we conclude that the gradient bias occurs in the same way as reported by Hoogeboom et al. for a two-dimensional input distribution.

(42)

Figure 8: Performance (in BPD) of IDF and RealNVP flow for different numbers of cou-pling layers. Both models were trained on a two-dimensional data distribution. This graphic shows mean and standard deviation over five runs.

3.3.2 Fine-tuning with FDA gradients

Inspired by our preliminary analysis in section 3.1, in which we compared the back-propagation gradient with an unbiased first-order gradient approximation, we test in this experiment whether the finite- difference approximation (FDA) can be used di-rectly for approximating the gradient during training. However, since the FDA is so expensive to compute, it is impossible to train the IDF fully with FDA as gradients. Instead, we use it to fine-tune a pre-trained IDF.

For this experiment, we use a 40 layer IDF and for reference an equally sized Re-alNVP flow. As in the previous experiment, we omit the scaling parameter in the RealNVP coupling layer for comparability. We again choose the discretized logistic and the logistic distribution as base distributions for the models and the eight Gaus-sian distribution shown in figure 7 (c) as input data distribution. Since the FDA is so costly to compute, we use a small neural network with one hidden layer of 20 hidden units in each coupling layer. This results in a total of 2440 parameters for each model. Recall from formula 59 that two forward passes are necessary to obtain the finite differ-ence gradient approximation for a single model parameter and data batch. Therefore, even for this small model size, running a single training epoch with FDA as gradients takes about three days on a Nvidia GeForce GTX 1080 graphics card. We fine-tune the pre-trained models for two epochs. We repeat the experiment for a number of approx-imation step sizes (cf. formula 59), namely, ∈{0.01, 0.001, 0.0001, 0.00001}.

(43)

3.3 experiments 36

Regardless of the chosen approximation step size , we find no performance improve-ments for either the IDF or the RealNVP model. However, since the experiment only tests a small number of cases, we cannot infer from this result whether using the FDA gradients for model tuning is generally not possible.

3.3.3 Alternative Rounding Functions

In this section, we test whether gradient bias can be mitigated by replacing the hard rounding operation in the IDF coupling layer with the alternative rounding functions introduced in section 3.2.1. These functions comprise stochastic rounding, smooth rounding and noisy rounding. For our experiment, we train the IDF model with each of the three rounding functions for different numbers of coupling layers. As a refer-ence to these modified IDF versions, we use the results from experiment 3.3.1.

For comparability, the IDF used in this experiment has the same structure as the IDF in experiment 3.3.1: The neural network in the coupling layer consists of two linear layers with each 128 hidden units and the base distribution is a discretized logistic distribution. We also use the same input data distribution, namely the eight Gaus-sians distribution shown in Figure 7 (c). For this experiment, we replace the original rounding operation in all IDF coupling layers with either stochastic rounding, noisy rounding, or smooth rounding. All of these three IDF versions are trained with 5, 10, 20, 40, 60, 80, 100 and 120 coupling layers until convergence. The average BPD over five runs per method and number of coupling layers is shown in figure 9. For clarity, the corresponding standard deviations are listed in table 3.

It is evident from figure 9 that none of our proposed alternative rounding functions sig-nificantly improves the performance of the IDF. Only the stochastic rounding makes the IDF perform a tiny bit better for 100 or more coupling layers. However, despite this slight improvement, the IDF still performs significantly worse than the contin-uous RealNVP model for 100 or more coupling layers. With smooth rounding and noisy rounding, the IDF always performs worse than with its regular rounding opera-tion. We conclude from this, that the true gradients of the smooth rounding function are more harmful to the optimization of the IDF than the bias introduced by using the hard rounding function with straight-through gradients.