Sampling from constrained domains

(1)

Sampling from constrained domains

Carla Groenland

June 27, 2014

Bachelor thesis (18EC) Supervisor: Taco Cohen

Bacheloropleiding Kunstmatige Intelligentie

(2)

Abstract

In this bachelor thesis various sampling methods for a particular constrained domain are evaluated. When applying the Expectation Maximization (EM) algorithm in order to learn representations of SO(3), an integral is approximated using the Monte Carlo method. For this purpose, the need arises to sample from an unnormalized density on SO(3), which is a constrained domain when embedded in Euclidean space. Various sam-pling methods have been implemented and evaluated based on their approximations of the EM objective. A basic importance sampling algorithm appears to work best, however this method has the disadvantage that it does not produce samples from the required distribution and that a large number of samples is required for a good approximation. On the other hand, the evaluated Markov chain Monte Carlo methods are able to give reasonable approximations using a small number of samples, but require much more time for generating samples and give worse approximations. Furthermore, a new technique for lifting constraints is proposed. It is derived under which conditions samples from a density on an easier domain can be transformed to samples from the required density. This technique is applied to the problem of sampling from SO(3) by proving for a certain case that samples can be transformed from R4 to S3. The theory is also shown to work experimentally by applying the same sampling method to sample from both domains and comparing their estimates of an EM objective.

Title: Sampling from constrained domains Author: Carla Groenland, 10208429

Supervisor: Taco Cohen, Machine Learning Group

Second grader: dr. Leo Dorst, Intelligent Systems Lab Amsterdam Date: June 27, 2014

University of Amsterdam

(3)

1 Introduction

Research on sampling methods is conducted in a wide range of fields including arti-ficial intelligence, econometrics, physics, statistics and computational chemistry. The necessity for sampling methods arises most frequently in artificial intelligence when a computationally intractable integral is approximated using the Monte Carlo method, that is, Z f (z)p(z)dz ≈ L X i=1 f (x(i))

where x(1), . . . , x(L) are sampled from the probability density function (pdf) p. Many formulas, for example expectations, can be written in the form R f (z)p(z)dz. Monte Carlo methods tend to outperform numeric integration when approximating integrals on high-dimensional domains [3, Chapter 11].

Which sampling method works best is usually highly dependent on the specifics of the problem, such as the dimensionality of the domain or whether the gradient of the pdf is known. In this thesis, several sampling methods are evaluated on a particular problem in computer vision which arose when learning representations over the Lie group SO(3) in the attempt to continue the research initiated by Cohen and Welling [7]. An integral of the formR f (z)p(z)dz is encountered in the EM algorithm and this leads to the problem of sampling from SO(3), which is an instance of a manifold (it is a Lie group) containing the rotations of three-dimensional space. The domain is constrained when embedded in Euclidean space and many sampling methods are incapable of handling such constraints.

1.1 Literature review

Much research has been conducted on Markov chain Monte Carlo (MCMC) sampling methods, where a Markov chain is constructed which has the desired distribution as its stationary distribution [19]. The first proposed MCMC method was the Metropolis algorithm [13] which was created in 1953 and which was improved by Hastings in 1970 [10]. Gibbs sampling [9] is a special case of the Metropolis-Hastings algorithm which works with the univariate conditional distributions. An existing variation on this method updates blocks of variables simultaneously instead of updating all variables separately [11].

Hamiltonian Monte Carlo (also called Hybrid Monte Carlo or HMC) is a class of MCMC methods which alternate between Metropolis updates (using a state proposal computed according to Hamiltonian dynamics) and updates of auxiliary momentum variables. The equations of Hamiltonian dynamics are difficult to solve analytically and are therefore approximated by discretizing time. The leapfrog method is commonly

(5)

used for this approximation [14]. The HMC methods improve the convergence rate of the Markov chains due to their ability to propose long distance moves which retain a high acceptance probability.

Several approaches to handling constraints by using an adjusted HMC sampling method have been proposed. Neal [14] discusses a modified HMC method which enables the sam-pler to remain within prescribed boundaries due to an additional term in the potential energy which tends to infinity as the sampler approaches the boundaries. Spherical HMC [12] is able to handle constraints of the form kxkq ≤ C by transforming such a domain

into a sphere Sn and sampling from this by splitting the Hamiltonian dynamics in such a way that both parts can be solved analytically.

Finally, the possibilities of applying HMC sampling on manifolds and Hilbert spaces have recently been investigated [5, 2]. Much research focuses on defining the Hamiltonian on such domains and explicitly adjusting the standard HMC algorithm for use on certain types of manifold or Hilbert space.

1.2 Research question

The main goal of the project is to support the research on learning representations of SO(3) by implementing a sampling method which can be used to approximate a certain expectation encountered in the EM algorithm. The research question is “Which sampling method performs best when integrated in the EM algorithm used for the Lie group SO(3)? ”. The question is divided into three parts:

1. What kind of domain do we need to sample from and which restrictions do we need to take into account? Which existing methods can be applied naturally? 2. Which new methods can be used for this task? What are their properties and can

these be generalized to other domains?

3. Which of the new methods and existing methods performs best?

In particular, the feasibility of facilitating sampling from difficult domains by transform-ing samples taken from more agreeable domains is also examined.

1.3 Method and overview

In the chapter Context, the question What kind of domain do we need to sample from and which restrictions do we need to take into account? will be addressed by explaining the computational framework the sampling methods need to be integrated in and how the elements of SO(3) can be parametrized in this framework. Furthermore, the concepts “Lie group” and “fully reduced representation” will be explained and motivation for the research described in this thesis will be given by explaining why learning representations over SO(3) is promising for research in computer vision and machine learning.

(6)

• Definition of sampling. The intuition behind sampling and the precise mathe-matical definition are given.

• Existing sampling methods. Several basic methods are explained as well as the more recent Spherical HMC method [12] in order to respond to the question Which existing methods can be applied naturally?. The examination of why existing methods work assists the investigation of new sampling methods.

• Investigating new sampling methods. The results of the theoretical research conducted in order to address Which new methods can be used for this task? What are their properties and can these be generalized to other domains? are outlined. Failed attempts are stated as well as the origin of their failure. General conditions are derived for sampling from difficult domains by transforming samples taken from more agreeable domains. It is shown how this approach enables more existing sampling methods to be applied to the task in hand.

In order to answer Which of the new methods and existing methods performs best?, var-ious sampling methods are implemented and evaluated. In the chapter Experiments, the details of the implementation of these sampling methods and the experiments con-ducted to evaluate the methods are given and the results are discussed.

1.4 Acknowledgements

I would like to thank my supervisor Taco Cohen for answering my questions, motivating me with his enthusiasm and for all his helpful suggestions. I am thankful to dr. Leo Dorst for agreeing to be second grader and to dr. Shiwei Lan for answering my questions about Spherical HMC and sending me the required adjustments I had to make. Finally, I would like to thank Josha Box, Wieger Hinderks and dr. Bert van Es for proofreading several theoretical parts of the thesis and for the discussions I have had with them.

(7)

2 Context

In this chapter, the approach described in [7] will be explained, together with the required background needed to understand the approach and how to extend it. This is important for understanding the context of the sampling problem, which gives a purpose for the research described in this thesis and also helps understand what needs to be done.

First of all, the notions Lie group and irreducible representation will be explained based on [8]. After that, the approach of [7] will be outlined and extensions of the approach will be discussed, where certain definitions from [3] are added for completeness.

2.1 Lie groups and irreducible representations

Group theory is a large field of study in mathematics. Before introducing a Lie group, the notions of a group and a differentiable manifold (a notion from differentiable geometry) have to be introduced first.

Definition 2.1. A group G is a set of elements together with an operation ◦ : G×G → G and an element e ∈ G such that

• (x ◦ y) ◦ z = x ◦ (y ◦ z) for all x, y, z ∈ G; • x ◦ e = x = e ◦ x for all x ∈ G;

• for all x ∈ G, there exists a x−1∈ G such that x ◦ x−1= e = x−1◦ x.

It follows directly that this inverse x−1 is unique. Instead of x ◦ y, notations such as x + y, xy or x · y are also used.

Example 2.2. Many well-known elements form a group, for example the invertible real (n × n)-matrices, the orthogonal transformations and the rotations in 2D. The real numbers also form a group under addition and R\{0} forms a group under multiplication. The definition of a differentiable manifold is a bit more involved: this requires first introducing the notions of topology, open set, homeomorphism, second countable and Hausdorff space. The intuition behind the definition is that a manifold is locally a Euclidean space, meaning that when you zoom in enough and adjust the space slightly, you see Rn for a certain n ∈ N. This means that Rnis an example of a manifold (to be precise, when Rnis taken with the Euclidean topology). It follows that the n×n-matrices are a manifold, since Mn×n(R) = Rn

2

by putting all matrix elements in a vector (if one defines a topology on Mn×n(R) using this identification). A Lie group is a combination

(8)

Definition 2.3. A Lie group is a group that is also a differentiable manifold, such that the group operations are given by smooth maps.

The definition above means that the map (x, y) 7→ xy−1 must be C∞ with respect to the topology on the Lie group as differentiable manifold.

Example 2.4. Using the regular value theorem it follows that the set of invertible matrices GLn(R) is also a manifold. Similar techniques show that the special orthogonal

group SO(n) is a manifold as well. This gives us the first examples of Lie groups: SO(n) and GLn(R) form Lie groups for all n ∈ N. The group operation is given by matrix

multiplication and one needs to verify that taking the inverse and multiplication are smooth operations in the topology induced on these manifolds as subspaces of Mn×n(R).

Since Lie groups have a topology defined on them, compact Lie groups can be defined. Compact sets have nice properties (such as being bounded) making them more tractable for computations.

A group representation is a mathematical notion which is abbreviated to a represen-tation in this thesis.

Definition 2.5. A representation of a group (G, ◦G) over a field k is a k-vector space

V together with a group homomorphism ρ : G → GL(V ), that is, ρ(g) ◦ ρ(h) = ρ(g ◦Gh) (for each g, h ∈ G)

and for each group element g ∈ G, ρ(g) : V → V is a linear and invertible map.

Example 2.6. A possible choice of field is k = R and V = Rn can then be chosen as vector space over k. In this case GL(V ) = GLn(R) simply consists of the invertible

matrices. The composition ◦ is given by matrix multiplication.

Seeing the elements of GL(V ) as (invertible) matrices, a representation can be seen as a linear assignment of matrices to group elements (‘linear’ when ‘+’ is seen as multi-plication in the group).

One important property that a representation can have, is called irreducibility. Definition 2.7. A non-zero representation (ρ, V ) of G is called irreducible if V has no non-trivial subrepresentations, that is, if W is a subspace of V such that ρ(g)(W ) ⊂ W for all g ∈ G, then W = V or W = 0.

Example 2.8. A vector space V has only 0 and itself as subspace if V is one-dimensional, which implies that all one-dimensional representations are irreducible.

The word irreducibility is used often in mathematics to denote elements which are in some sense primitive. An irreducible representation cannot be decomposed into smaller elements in some sense and a logical next question for a mathematician is when a repre-sentation can be decomposed into irreducible reprerepre-sentations (and whether there exists a unique way to do so). If this is possible, the representation is called completely reducible, fully reducible or semisimple.

(9)

Definition 2.9. A completely reducible representation of G is the direct sum of irre-ducible representations of G.

In language of matrices, the matrices ρ(g) have a block diagonal form (with each block corresponding to an irreducible representation) if a representation is completely reducible.

The last notion of representation theory that will be required is that of an isomorphism of representations. This explains when two representations can be seen as equivalent. Definition 2.10. Let (V1, ρ1) and (V2, ρ2) be two representations of a group G. An

isomorphism of representations is an invertible, linear map φ : V1 → V2 such that

φ(ρ1(g)v) = ρ2(g)φ(v) for all g ∈ G, v ∈ V1.

This definition is equivalent to saying that the representations are related by a change of basis. Namely, let φ : V1 → V2 be an isomorphism of G-representations, then φ is a

linear map such that φρ1(g) = ρ2(g)φ or φρ1(g)φ−1= ρ2(g) for all g ∈ G.

If a subspace U of a vector space V and its orthogonal complement U⊥ are both invariant under G then we can write

W ρ(g)W−1 =ρU(g) 0 0 ρ_U⊥(g)

for a certain matrix W which performs a basis change. The map corresponding to W is then an isomorphism of representations between ρ and ρU⊕ ρU⊥. If a representation is

isomorphic to a completely reducible representation, then we will also call it completely reducible. All group representations of compact groups have this property.

2.2 Learning representations of Lie groups

The computational framework for learning fully reduced representations of SO(2) is explained in [6] and [7]. In this section the motivation for learning representation over SO(2), the chosen approach and the prospects of extending the approach to SO(3) are summarized. Finally, the computational framework in which the sampling methods will be implemented is described and certain formulas that will become important later on are derived.

When a picture is rotated, the picture still depicts the same thing. Therefore, it is very useful in computer vision and machine learning to be able to compute features of an image which are invariant under rotation or other symmetries. In order to be able to speak of equivalence of images and individual properties of an image in a language, two insights of Felix Klein and Hermann Weyl are applied. The motivation behind learning irreducible representations of Lie groups is based on these two principles.

Klein tried to apply group theory on geometry by considering two figures equivalent if they are related by an element of the Euclidean group (which contains rigid body motions). This can be extended by exchanging the Euclidean group with any other group of transformations, for example SO(2), the rotations in 2D. The chosen group is

(10)

referred to as the symmetry group (because it describes the symmetries of a system) and will be taken to be a Lie group.

Besides features which are invariant (or covariant) under certain transformation, it also useful to be able to put labels on them. The features need to have separate or independent meanings with respect to the other features. Weyl’s principle helps in finding such independent, elementary components; his principle is that “the elementary components of a system are the irreducible representations of the symmetry group of the system” [7, page 3].

Hence, in order to learn the elementary components of a system the symmetry group working on the system needs to be known and the irreducible representations of this group need to be learned. A first step in this direction has been taken by proposing a probabilistic framework for learning fully reduced representations of the Lie group SO(2). This Lie group is a good starting point, since it is both commutative and compact, which makes it easy to work with. Since it is compact, all representations will be fully reducible, that is, they can be decomposed into irreducible representations. Furthermore, all irreducible representations are of complex dimension one (and hence real dimension two) due to the commutativity of SO(2). These irreducible representations are uniquely given by the weight of the representation, which is the number n ∈ N for which

R(θ) = exp nθ0 −1 1 0 .

2.2.1 Toroidal Subgroup Analysis

All finite-dimensional irreducible representations of SO(2)J for J ∈ N are given by a (tensor) product of irreducible representations of SO(2). The algorithm for learning irreducible representations of SO(2), called Toroidal Subgroup Analysis (TSA), is able to learn representations of compact commutative subgroups of SO(n) for n ∈ N, called toroidal subgroups. All elements of such a toroidal group can be written in the form ρφ = W R(φ)Wt for W an orthogonal matrix and R(φ) a block-diagonal matrix with

blocks of the form

R(φj) =

cos(φj) − sin(φj)

sin(φj) cos(φj)

on the diagonal. Such a toroidal group can hence be parametrized by φ = (φ1, . . . , φJ)

for some J ∈ N and the name “toroidal” arises from the fact that all these components are periodic. This also gives an identification with SO(2)J.

The probabilistic model relates each data pair (x, y) by a transformation with addi-tional Gaussian noise, that is,

p(y|x, φ) = N (y|ρφx, σ2) or y = ρφx +

for ∼ N (0, σ2).

The components φj are assumed to be marginally independent and von-Mises

(11)

the canonical solutions y(x) of Bessel’s differential equation x2d 2_y dx2 + x dy dx+ (x 2_{+ α}2_{)y = 0,}

where α ∈ C is called the order of the Bessel function. Let I0 be the modified Bessel

function of order 0, then

p(φj) =

1 2πI0(κj)

exp(κjcos(φj− µj)),

where the mean µ and the precision κ are parameters of the von-Mises distribution. The von-Mises distribution is an example of an exponential family.

Definition 2.11. The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form

p(x|h) = h(x)g(η) exp{ηtu(x)}.

In this definition, η are called the natural parameters of the distribution. The “co-efficient” g(η) ensures normalization. The von-Mises distribution can be expressed in natural parameters as

p(φj) =

1 2πI0(kηjk)

exp(η_jtT (φj)).

for T (φj) = (cos(φj), sin(φj))t the sufficient statistics.

Definition 2.12. A statistic T (X) is sufficient for underlying parameter θ if P (X = x|T (X) = t, θ) = P (X = x|T (x) = t).

A very important result shown in [7] is that the posterior p(φ|x, y) is again a product of von-Mises distributions, called a conjugacy relation.

Definition 2.13. A family F of prior distributions p(θ) is conjugate to a likelihood p(D|θ) if the resulting posterior p(θ|D) ∈ F .

The advantage of such a conjugacy relation is that the model can be updated by data iteratively, such that the model can be updated with newly acquired data. A similar conjugacy relation can be found for the coupled model, where the data is related by a transformation of the form

ρφ= W    R(ω1φ) . .. R(ωJφ)   W t

for ω1, . . . , ωJ the weights of the irreducible representations mentioned previously. In

this case, the generalized von-Mises is the right prior. The other model is incorrect in representing 2D image rotations due to the fact that all components of φ can separately

(12)

turn the invariant subspaces (they are “uncoupled”), but the model appears to be easier to learn.

In both models, φ are latent variables and the orthogonal matrix W is what needs to be learned (along with the weights ωi for the uncoupled model). The learning of these

parameters is done by maximizing the marginal likelihood. This can be done using the expectation maximization (EM) algorithm, where in the first step

Q(θ|θt) = E[log pθ(y, φ|x)]φ|x,y,θt =

Z

[log pθ(y, φ|x)]pθt(φ|x, y)dφ

is computed and in the second step θt+1 = argmaxθQ(θ|θt) is chosen. Explicit formulas

for Q(θ|θt) and the gradient are found for the TSA model. This is unique: it can be

expected that such explicit formulas may not be found for other Lie groups and that approximation techniques are required.

After learning the matrix W and computing u = Wtx and v = Wty, the manifold distance between x and y can be computed as

d(x, y) = minφky − ρφxk2 =

X

j

kvj− R(µj)ujk2

for µj the mean of p(φj|x, y, κj = 0). This gives the distance between the orbits of x

and y under the group {ρφ}φ, such that elements from the same orbit have zero distance

to each other. Such a distance can make the idea of “rotated pictures should be seen as the same picture” more concrete: all elements in the same orbit (that is, those that are related by a certain type of transformation, for example a rotation) can be seen as equivalent. In a machine learning task, such can be useful for labeling an entire orbit of elements at once. Experiments have shown that manifold distance is able to filter out variation by image rotation almost completely in a nearest neighbour classification task.

2.2.2 Computational framework for SO(3)

The approach described in the previous section can be extended to more Lie groups. However, non-commutative groups have worse tractability properties. Certain formulas that could be found in closed form when learning distributions over representations of toroidal subgroups, may be hard to compute analytically in the case of non-commutative Lie groups. Furthermore, the domain may be very high-dimensional, such that most numerical methods do not work well in general.

An example of a formula that is hard to compute analytically can already be found for the Lie group SO(3). The set-up for learning representations can be done in a similar fashion as has been done for SO(2) and a formula of the form

E[log p(y, φ|x)]_φ|x,y = Z

[log p(y, φ|x)]p(φ|x, y)dφ

is encountered again in the EM algorithm. For toroidal subgroups, the integral above could be solved analytically, but the integral for SO(3) appears to be more difficult. A

(13)

standard solution for approximating such integrals, which also generalizes well to higher dimensional domains, is called Monte Carlo, where φ(i) ∼ p(φ|x, y) are sampled for i = 1, . . . , L and the integral is approximated as

1 L L X i=1 log(p(y, φ(i)|x)).

For each data pair (x, y), samples φ(i) _{∼ p(φ|x, y) are required. The Monte Carlo method}

and sampling methods will be discussed in the next chapter.

A computational disadvantage of the non-commutativity of SO(3) is that SO(3) has irreducible representations of various dimensions. Since SO(3) is compact, all represen-tations can be written as a direct sum of irreducible represenrepresen-tations again, but now all these irreducible representations may have any (complex) dimension 2l + 1 for l ∈ N. A derivation of the irreducible representations of SO(3) can be found in [18, Chapter 8].

Recently, a computationally efficient method for working with these irreducible rep-resentations has been proposed [16]. Let V = L2(S2), the Hilbert space of the square-integrable functions on S2_{. Define a representation ρ : SO(3) → GL(V ) by}

(ρ(g)(f ))(x) = f (g−1x), x ∈ S2, g ∈ SO(3). We need to verify that ρ is a homomorphism:

(ρ(gh)(f ))(x) = f ((gh)−1x) = f (h−1g−1x) = ρ(h)f (g−1x) = ρ(g)ρ(h)f (x) for all x ∈ S2 _{and hence}

ρ(gh)f = ρ(g)ρ(h)f

for all f ∈ V , which gives ρ(gh) = ρ(g)ρ(h) for all g, h ∈ SO(3) as desired. All irreducible representations can be found as subrepresentations of this representation by picking subspaces El generated by 2l + 1 real or complex spherical harmonics functions, which

are simply well-chosen elements of V (but with quite complicated formulas).

The elements of SO(3) can be parametrized by Euler angles. The Euler angles repre-sent rotation angles for rotations about one of the three axis in R3. Various conventions exist, but in this thesis the (3,2,3) or (z, y, z) convention will be applied where (α, β, γ) stands for the element

  cos γ − sin γ 0 sin γ cos γ 0 0 0 1     cos β 0 − sin β 0 1 0 sin β 0 cos β     cos α − sin α 0 sin α cos α 0 0 0 1  ∈ SO(3).

It is shown in [16] that the matrix form of elements of SO(3) in the basis given by the lth real spherical harmonics can be written as

[ρ|El((α, β, γ))] = Xl(α)JlXl(β)JlXl(γ)

for Jl a fixed matrix and Xl(φ) a matrix consisting mainly of zeros, but with

(14)

on the diagonal and

(sin(−lφ), sin((−l + 1)φ), . . . , sin(lφ)) on the anti-diagonal.

In the computational framework in which the sampling method are integrated, the conventions above are used. In this case, φ = (α, β, γ) ∈ [0, 2π] × [0, π] × [0, 2π] is the form in which the elements need to be sampled and the density takes the form

p(φ|x, y) ∝ exp −1 σ2(y t_{W )R(φ)(W}t_x)

where R(φ) = ρ(φ) is φ in some representation. The symbol “∝” stands for proportional to, where a ∝ b if there exists a fixed constant C such that a = Cb. The function p depends on the data, the to be learned matrix W and the representation, where for this thesis the dimensions of the irreducible representations which make up the representation ρ are prescribed. In a further version of the model, it would be desirable to learn these dimensions or even the Lie group the representations are taken over as well.

It is only possible to compute an unnormalized version of p(φ|x, y) (since computation of the normalization constant would be expensive) which implies that computations with values from p can become too large, causing computation errors due to overflow. When probabilities need to be multiplied often, underflow or overflow can appear easily. For this reason, the code needs to work with log probabilities instead.

The sampling domain can also be converted to S3. Mathematically speaking, there exists a double cover f : S3 → SO(3). This means that for every element φ ∈ SO(3) exactly two elements in S3 exist which are mapped to x under f , in such a way that in a small area around either of these elements, f is a homeomorphism (a continuous bijection with a continuous inverse). Such a map is, however, hard to construct explicitly. Based on [1] and [15], formulas for converting unit quaternions to Euler angles are derived here, which will be useful later on. Here S3⊂ R4 _{is seen as the set of unit quaternions.}

Assume a unit quaternion is written as q = w + ix + jy + kz, then the corresponding rotation matrix is   1 − 2y2− 2z2 _{2xy − 2zw} _{2xz + 2yw} 2xy + 2zw 1 − 2x2− 2z2 _{2yz − 2xw} 2xz − 2yw 2yz + 2xw 1 − 2x2− 2y2  . With the abbreviations c = cos and s = sin

  cφ −sφ 0 sφ cφ 0 0 0 1     cψ 0 −sψ 0 1 0 sψ 0 cψ     cξ −sξ 0 sξ cξ 0 0 0 1  =   . . . −sψ cφ . . . −sψ sφ sψ cξ −sψ sξ cψ   with expressions like cφ cξ − sφ cψ sξ on the dots. This shows that a problem arises when ψ ∈ {0, π}, since then this matrix takes the form

  a1 b1 0 a2 b2 0 0 0 1  

(15)

and many choices of φ, ξ will give this particular matrix. This is called gimbal lock. For ψ 6∈ {0, π}, the solutions are

φ = tan−1(c2/c1) = atan2(2yz − 2xw, 2xz + 2yw),

ψ = cos−1(c3) = cos−1(1 − 2x2− 2y2),

ξ = tan−1(b3/ − a3) = atan2(2yz + 2xw, −2xz + 2yw).

The values (φ, ψ, ξ) correspond to (γ, β, α).

The derivative of these conversion formulas will also be required later on, when the need arises to compute the gradient of

U (w, x, y, z) = − log[p(f (w, x, y, z))]

for f the conversion formula and p the density defined in Euler angles. The chain rule gives U0(θ) = [− log p]0(f (θ)) Df (θ). Using Wolfram Alpha the partial derivatives can be computed, which together form the total derivative

Df (w, x, y, z) =       − z w2_+z2 0 −_w2z_+z2 y x2_+y2 4x √ 1−(1−2x2_−2y2₎2 − y x2_+y2 −_x2_+yx 2 4y √ 1−(1−2x2_−2y2₎2 x x2_+y2 w w2_+z2 0 _w2w_+z2       .

The derivative of − log p can also be found reasonably easily. This is a function φ 7→ −VtR(φ)U

and has derivative −Vt dR(φ)_dφ U . Recall that R(φ) = ⊕r_i=1Rli(α, β, γ) = ⊕

r

i=1Xli(α)JliXli(β)JliXli(γ)

for certain l1, . . . , lr ∈ N. The direct sum notation implies that R(φ) is a block-diagonal

matrix with blocks Rl1(φ), . . . , Rlr(φ), such that

dR(φ) dφ =        dR_l1(φ) dφ 0 . . . 0 0 dR_dφl2(φ) . . . 0 .. . . .. ... 0 . . . 0 dRlr(φ) dφ        . A derivative dRl(φ)

dφ is again given by the matrix of partial derivatives, where for example

dRl(φ)

dα =

dXl(α)

dα JlXl(β)JlXl(γ) and dXl(α)

(16)

3 Sampling methods

Many applications, such as those using the EM algorithm, require the computation of an integral of the form

Ep[f (z)] =

Z

f (z)p(z)dz

for some function f and some probability distribution p. The computation of Ep[f (z)]

may be intractable, such that an approximation method is required. A possible method for this purpose is Monte Carlo.

Definition 3.1. If z1, . . . , zn∼ p i.i.d. then ˆµn= _n1Pni=1f (zi) is a (basic) Monte Carlo

estimator Ep[f (z)], where z ∼ p.

This requires a method for getting samples z1, . . . , zn∼ p. The meaning of “z1, . . . , zn∼

p” will be given in the next section and various methods for gaining such samples will be discussed in this chapter.

Remark 3.2. In this thesis, sampling methods for the particular problem explained in the previous chapter are evaluated. The need of a sampling method is, however, a standard problem which arises in many fields of study and this chapter will therefore discuss sampling methods in a general framework.

Remark 3.3. The Monte Carlo method is supported by mathematical theory. The following properties hold.

• E[ˆµn] = E[f (z)], that is, ˆµn is an unbiased estimator.

• Assuming that the variance of f (z) is finite ˆµn is a consistent estimator, since

∀ > 0

P (|ˆµn− E[f (z)]| < ) → 1 (as n → ∞).

This follows from the law of large numbers. • The variance σ2_(ˆ_µ

n) = _n1σ2(f (z)) → 0. The mean squared error is defined as the

squared bias plus the variation. Since ˆµnis unbiased, the mean squared error goes

to zero at the same convergence rate of √1 n.

• The probability distribution does not need to be normalized, since if q(z) = _Z1p(z) for Z the normalization constant, then p ∼ q.

A well-known application of Monte Carlo is that it can be used to approximate π, known under the name “Buffon’s needle problem”.

(17)

3.1 Definition of sampling methods

In this section the intuition behind sampling methods and the precise mathematical definition is given based on [17]. The formal definition is needed in Section 3.3, but Section 3.2 only requires some intuition.

The intuition behind sampling methods is that points need to be sampled “according to” the given density p.

Example 3.4. The way sampling methods ought to work becomes more clear when p is a probability density function for a discrete distribution. Assume the domain Ω consists of two elements, x and y. In this case, p is called a probability mass function. Assume p(x) = 0.2 and p(y) = 0.8, such that y gets four times the mass of x. A sampling method which samples “according to” the distribution, should return y four times as often as x as well (approximately).

In the continuous case, the idea of sampling becomes a bit more complicated, since in theory every point has zero chance of being sampled (when the sampling domain is equipped with the usual measure). Still, the intuition holds that the larger the value that p assigns to a point, the more often it ought to be picked.

In order to understand the mathematical definition of sampling, some measure theory needs to be introduced first. Mathematicians have founded probability theory formally based on measure theory. Start with a set Ω, called the sample space and define a σ-algebra F on this.

Definition 3.5. A collection F of subsets of a sample space Ω is called a σ-field or a σ-algebra, if ∅ ∈ F and F is closed under countable unions and taking complements.

Next, define a measure ν : F → [0, ∞] which has ν(∅) = 0 and is countably additive, that is, for all disjoint Ai∈ F

ν (∪∞_i=1Ai) = ∞

X

i=1

ν(Ai).

Definition 3.6. The triple (Ω, F , ν) is called a measure space.

A simple example is R, with the Borel σ-algebra B (which consists of countable unions of intervals) and the Lebesgue measure (which assigns each interval its length and is defined appropriately when taking unions of intervals).

If ν(Ω) = 1, the measure ν is called a probability measure. When the measure in the triple of a measure space is left out the remaining pair is called a measurable space. Another important definition is that of a measurable function.

Definition 3.7. A function f : Ω1 → Ω2 between measurable spaces (Ωi, Fi) is called

measurable if f−1(A) ∈ F1 for all A ∈ F2.

Continuous functions and indicator functions 1A for A measurable are examples of

measurable functions and “continuous” combinations of measurable functions give mea-surable functions again.

(18)

In probability theory, a measurable function X is called a random element. If it also takes image in (R, B) or (Rk, Bk), then it is called a random variable or random k-vector. For a probability measure P and a random vector X, we call PX = P ◦ X−1 the

distribution of X. The notation P ◦ X−1 stands for the induced measure by X and is a measure on Rk defined by

P ◦ X−1(B) = P (X ∈ B) = P (X−1(B)) for B ∈ B.

Now that measures are defined, integrals and the Radon-Nikodym derivative can also be defined. Let λ, ν be two measures on (Ω, F ). We call λ absolutely continuous with respect to ν if ν(A) = 0 =⇒ λ(A) = 0 for all A ∈ F . When ν is also σ-finite (Ω is the union of countably many sets with finite ν-measure), then there exists a nonnegative (Borel) function f on Ω such that

λ(A) = Z

A

f dν, A ∈ F ,

unique almost everywhere (a.e.) with respect to ν (any other function with the same properties may only take different values on “negligible” sets to which ν assigns zero measure) according to the Nikodym theorem. The function f is called the Radon-Nikodym derivative or density of λ with respect to ν and denoted by dλ/dν. IfR f dν = 1 and f ≥ 0 a.e. ν, then λ is a probability measure and f is called its probability density function (pdf) with respect to ν. The definition above can be seen as a generalization of the usual definition of a pdf f by

P (a ≤ X ≤ b) = P (X ∈ [a, b]) = Z b

a

f (x)dx.

It is also possible to define a product measure in the most obvious way (when working with σ-finite measures). This gives the possibility of defining a joint pdf with respect to a product measure on Bk for a random k-vector (X1, . . . , Xk). The i-th marginal pdf

can be found by integrating over the other variables fi(x) =

Z

Rk−1

f (x1, . . . , xi−1, x, xi+1, . . . , xk)dν1· · · dνi−1dνi+1· · · dνk.

We call X1, . . . , Xn independent random variables if

f (x1, . . . , xn) = f1(x1) · · · fn(xn).

Definition 3.8. Given a measure space (Ω, F , P ) and a distribution PX, we call X1, . . . , Xn

a sample of PX of length n if X1, . . . , Xn are independent random variables and have

distribution PX.

Similarly, a sample X from a density f is a random variable which has f as probability density, that is, f is the Radon-Nikodym derivative of PX with respect to the Lebesgue

(19)

equivalence relation. In statistics it is used for the equivalence relation “has the same distribution as”. Hence, X ∼ N (0, 1) means that the random variable has the same density function as N (0, 1) (where N (0, 1) is seen as a random variable). Note that the definition above presupposes that X1, . . . , Xn are measurable functions to Rm for some

m ∈ N, but a similar definition could also be given for manifolds in general. We usually call their realizations Xi(ω) samples as well. In the context of sampling the functions

Xi are taken to be the identity.

3.2 Existing sampling methods

In this section, an overview will be given of the basic sampling methods and a state-of-the-art method called Spherical HMC will be explained.

One of the easiest sampling methods is called rejection sampling. Given an unnor-malized density function p(z), define a proposal distribution q(z) from which can be sampled and find a constant K such that Kq(z) ≥ p(z) for all z. The algorithm works by sampling x ∼ q(z) and accepting x according to the difference between q and p in order to compensate for the fact that the wrong distribution is being sampled from. The probability that x is accepted is _Kq(x)p(x) which is between 0 and 1 by definition of K. Ac-cepting according to a probability can be achieved by sampling ux∼ Uniform[0, Kq(x)]

and accepting x if p(x) ≥ ux.

A problem with this method is that it requires a good proposal distribution. If the constant K needs to be chosen large, which happens if q and p differ a lot, then this will cause many samples to be rejected.

Importance sampling also requires a proposal distribution q(z) for p(z), this time with the property that p(z) = 0 =⇒ q(z) = 0. Note that

E[f (z)] = Z f (x)p(x)dx = Z f (x)p(x) q(x)q(x)dx ≈ 1 n n X i=1 f (zi) p(zi) q(zi)

gives an approximation of the integral using samples zi ∼ q. The importance weights p(zi)

q(zi) compensate for sampling from the wrong distribution. As with rejection sampling,

it is important that the distribution q(z) is close to the distribution p(z). Samples for which p(x)/q(x) is small will contribute little to the expectation and lead to a bad estimation of the expectation of f . With a slight change, this sampling technique is also able to handle unnormalized distributions, see [3, page 533].

3.2.1 Markov chains

A lot of sampling methods simulate a Markov chain which has the desired distribution as its so-called stationary distribution. This introduction to Markov chains is based on [19] and [3, Chapter 11.2.1]. Let Xt denote the value of a random variable X at time t.

We call X a Markov process if

(20)

The sequence (X0, . . . , Xn) is then called a Markov chain and P (i → j) are called

transition probabilities. Let πj(t) = P (Xt = sj) denote the probability that the chain

is in state j at time t. Initiate the vector π(0). The Chapman-Kolomogrov equation describes the other values by

πi(t + 1) =

X

k

P (k → i)πk(t).

Putting P (i → j) as the (i, j)th entry of a matrix P , it follows that π(t) = π(0)Pt. A distribution π∗ is called stationary if π∗ = π∗P . The word ‘stationary’ in the definition can be explained by the fact that π(t) = π∗ implies π(t + 1) = π∗ when π∗is a stationary distribution.

Another desirable property is that the convergence π(t) → π∗ is independent of the chosen initiation π(0). A sufficient condition for a unique stationary distribution is the existence of a t such that every entry of Pt _{is strictly greater than zero, which is called}

ergodicity. For finite Markov chains, this is equivalent to irreducibility (meaning every point can be reached) and aperiodicity (meaning that there is no fixed period needed to walk from a point to itself, that is, the t for which P (i → i)t_{> 0 have greatest common}

divisor 1).

We want to construct Markov chains which have a certain probability distribution as its unique stationary distribution. A manner for verifying that this stationary distri-bution is the desired distridistri-bution, is by means of the detailed balance property. If the equation

P (j → k)π∗_j = P (k → j)π_k∗

holds for all k and j, then this gives a sufficient condition for π∗ being the stationary distribution.

3.2.2 Metropolis algorithm

The first proposed MCMC method is the Metropolis algorithm [13] in 1953, which was improved by Hastings in 1970 [10]. The method is able to draw samples from an un-normalized probability distribution p and requires the programmer to think of a good jumping distribution q(z|zt) which represents the probability of going from ztto a certain

state z (when combined with a certain acceptance probability). This jumping distribu-tion q has to be symmetric for the original Metropolis algorithm and Hastings improved the algorithm by adjusting it in such a way that this symmetry is no longer required.

The algorithm works as follows.

• Start with initial value z(0) _{such that p(z}(0)_{) > 0.}

• For t = 1, . . . , k, sample a candidate point z∗ _{from q(z|z}(t−1)_{) and accept this with}

probability α(z(t−1), z∗) = min p(z ∗_)q(z(t−1)_|z∗₎ p(z(t−1)_)q(z∗_|z(t−1)₎, 1 ! .

(21)

• When the stationary distribution has been reached, say at t = k, let (z(k+1)_{, . . . , z}(k+n)₎

be the required samples from p.

In order to check the existence of a stationary distribution, the detailed balance condition needs to be verified. Inspection gives

P (j → k) = q(sk|sj)α(sj, sk) and πk∗= p(sk) and q(z∗|z(t−1)_)p(z(t−1)_)α(z(t−1)_{, z}∗_{) = q(z}∗_|z(t−1)_)p(z(t−1)_)min p(z∗)q(z(t−1)|z∗) p(z(t−1)_)q(z∗_|z(t−1)₎, 1 ! = min p(z∗)q(z(t−1)|z∗), p(z(t−1))q(z∗|z(t−1)) = p(z∗)q(z(t−1)|z∗)min 1,p(z (t−1)_)q(z∗_|z(t−1)₎ p(z∗)q(z(t−1)_|z∗₎ ! = p(z∗)q(z(t−1)|z∗)α(z∗, z(t−1)).

This shows that the desired probability distribution p is the stationary distribution of the Markov chain that is created. However, it can be hard to know whether the stationary distribution has been reached and a lot of samples may be rejected if the jumping distribution is badly chosen. In the latter case, the chain is said to poorly mixing.

The Gibbs sampler [9] is a special case of this Metropolis-Hastings algorithm which works with the univariate conditional distributions (the distribution when all of the random variables but one are assigned fixed values). At each step, a coefficient z_i∗ of the candidate point is sampled from

p(zi|z1 = z1∗, . . . , zi−1= zi−1∗ , zi+1= zi+1(t) , . . . , zn= z(t)n )

for i = 1, . . . , n and the resulting candidate is always accepted (α(z(t), z∗) = 1).

3.2.3 Hybrid Monte Carlo

Hybrid Monte Carlo or Hamiltonian Monte Carlo (HMC) methods are a class of MCMC methods originating from physics. An intuitive picture is sketched in [14]: imagine a ball rolling over a landscape made up off hills. The ball has kinetic energy (from its rolling speed) and potential energy (from its height on the landscape). Due to the addition of the kinetic energy to the model, the ball is able to roll down a hill and up to the top of the next. A problem with Metropolis-Hastings is that either many candidates are rejected or very small steps in the domain are made. HMC methods ought to solve this issue of Metropolis-Hastings.

Assume p is the density function that needs to be sampled from. Let θ denote the position variable which has potential enegry U (θ) = − log p(θ) and introduce an auxiliary momentum variable v with kinetic energy K(v) = kvk2/2. The momentum variables correspond to the rate of change of θ.

(22)

The Hamiltonian is defined by H(θ, v) = U (θ) + K(v). The Hamiltonian equations dθi dt = ∂H ∂vi , dvi dt = − ∂H ∂θi

describe how θ and v change over time. These equations define a mapping Ts: (θ(t), v(t)) 7→ (θ(t + s), v(t + s)).

An important property of Hamiltonian dynamics is that Ts is invertible (the inverse can

be obtained by negating the time derivatives in the Hamiltonian equations) which shows that the dynamics are reversible. Other important properties are that the dynamics leave the Hamiltonian invariant, that Ts is volume-preserving and that the Hamiltonian

dynamics are symplectic. These properties are maintained when the dynamics are ap-proximated, which is crucial for implementation since time needs to be discretized. More information on these properties can found in [14, Section 2.2].

In order to approximate the dynamics, a stepsize > 0 for time is introduced. The leapfrog method usually outperforms other approximation methods such as Euler’s method [14, page 8]. The updates are given by

vi(t + /2) = vi(t) − (/2) ∂U ∂θi (θ(t)), θi(t + ) = θi(t) + vi(t + /2), vi(t + ) = vi(t + /2) − (/2) ∂U ∂θi (θ(t + )).

These updates are used in the HMC algorithm to generate proposals. Let q(θ, v) =

1

Zpexp(−H(θ, v)). The HMC algorithm exists of two steps, both of which leave q

invari-ant:

• Draw new momentum variables v_i randomly from N (0, 1).

• Perform a Metropolis update for the variables (θ, v). A proposal is generated by performing the updates of the leapfrog method (given above) L times. The acceptance probability for a proposal (θ∗, v∗) is

min{1, exp(−H(θ∗, v∗) + H(θ, v))}.

A proof that the detailed balance condition holds for HMC can be found in [14, page 13]. The stepsize or the number of leapfrog steps L is sometimes chosen randomly within a small interval to ensure L is not constantly equal to a periodicity in the density function.

(23)

3.2.4 Spherical HMC

Spherical HMC is an HMC method which is proposed in [12]. As any HMC method, it works by simulating Hamiltonian dynamics . The momentum variables, potential energy, kinetic energy and Hamiltonian are defined the same as for the regular HMC algorithm. Spherical HMC is aimed at sampling from distributions defined on constrained domains, where the constraints may be of the form kzkp≤ C for 1 ≤ p ≤ ∞. Useful variants are

kzk₁= |z1| + . . . |zn|, kzk2= v u u t n X i=1 |z_i|2 _{and kzk}_∞_{= max{|z} 1|, . . . , |zn|}.

For example, the domain [0, 2π] × [0, π] × [0, 2π] of Euler angles can be transformed to [−1, 1]3, where the domain can be written as {z : kzk∞≤ 1}.

The idea of Lan et al. [12] is to map such a constrained domain to a ball {θ ∈ Rm : kθk ≤ 1}. By defining θm+1 = p1 − kθk2 and ˜θ = (θ, θm+1), the ball can be mapped

to the m-sphere Sm by θ 7→ ˜θ. This gives a function f : Sm → Ω for Ω the original constrained domain. Spherical HMC samples from Sm and transforms these samples to samples from Ω according to this function (although the authors do not use this terminology). By splitting the Hamiltonian dynamics on Sm, analytical solutions for the updates can be found. Given the current position ˜θ, an update is made in a similar fashion as HMC, where only the generation of a proposal is done differently. The proposal is generated in Spherical HMC by running the Hamiltonian dynamics according to the solutions that were found analytically. The process of proposal generation requires the gradient of U (as function on the ball). The algorithm that can be used for Spherical HMC with U defined on a sphere (as will be required later on) is outlined in Algorithm 1, where the made adjustments were recommended by Shiwei Lan, one of the designers of the Spherical HMC algorithm.

3.3 Investigating new sampling methods

In this section, the investigated sampling methods are summarized. All methods are aimed at sampling from the probability density p(φ|x, y) defined in Chapter 2 and the samples can be taken from either [0, 2π] × [0, π] × [0, 2π] or S3. Everything in this sec-tion is original and based on the ideas of my supervisor or based on my own ideas. The main idea proposed in this section is to sample from a different domain and transform those samples to the desired domain. Let M be a space on which the pdf p from which needs to sampled is defined and N a space on which sampling is more easily done than on M . The idea is to define a function f : N → M and a pdf q on N such that X ∼ q =⇒ f ◦ X =: f (X) ∼ p.

Why is measure theory introduced?

The need for a more formal approach to this setting arose during an attempt to prove X ∼ q =⇒ f (X) ∼ p for two different cases. In the first case, M = S3, N = R4 and f is projection on the sphere (which actually ill-defined in the origin, but this can be

(24)

Algorithm 1 Spherical HMC (for distributions defined on a sphere SD₎

Initialize ˜θ(1). for i = 1, . . . , N do

Sample a momentum variable ˜v(1)∼ N (0, I_D+1) Project to tangent space: ˜v(1) = ˜v(1)− ˜θ(1)(˜θ(1))Tv˜(1) Calculate H(˜θ(1), ˜v(1)) = U (˜θ(1)) + K(˜v(1)) for l = 1, . . . , L do ˜ v(l+12)= ˜v(l)− 2 ID− θ(l)(θ(l))T −θ(l)θ(l)D+1 −θ(l)_D+1(θ(l))T kθ(l)_k2 ! ∇U (˜θ(l)) ˜ θ(l+1)= ˜θ(l)cos(k˜v(l+12)k) + v˜ (l+ 1₂) k˜v(l+ 12)ksin(k˜v (l+1 2)k) ˜ v(l+12)= −˜θ(l)k˜v(l+ 1 2)k sin(k˜v(l+ 1 2)k) + ˜v(l+ 1 2)cos(k˜v(l+ 1 2)k) ˜ v(l+1)= ˜v(l+12)− 2 ID − θ(l+1)(θ(l+1))T −θ(l+1)θ(l+1)_D+1 −θ(l+1)_D+1(θ(l+1))T kθ(l+1)_k2 ! ∇U (˜θ(l+1)) end for Calculate H(˜θ(L+1), ˜v(L+1)) = U (˜θ(L+1)) + K(˜v(L+1))

Calculate the acceptance probability α = exp{−H(˜θ(L+1), ˜v(L+1)) + H(˜θ(1), ˜v(1))} Sample u ∼ U [0, 1] uniformly

if u ≤ α then ˜

θ(1)= ˜θ(L+1) end if

Save the acquired sample ˜θ(1) end for

(25)

overlooked). A pdf p on M is given and a pdf q needs to be defined on R4. Intuitively, q should spread the density of p over R4. However, in order to have a finite integral for q, the spreading should be controlled. Therefore, a “spreading function” r can be introduced and q can be defined as p(f (x))r(kxk). In order to verify the condition X ∼ q =⇒ f (X) ∼ p, it might seem most logical to verify that

Z

f−1_{x}

q(y)dy ∝ p(x),

namely that for each point in x ∈ S3, an amount proportionate to p(x) is spread over R4. In the second context, M = SO(3), N = S3 and f : N → M is a double cover (a two-to-one map with some nice properties). Since every point in M has exactly two points in N which are mapped to it, it seems that surely q(x) = 1₂p(f (x)) must work. However, R

f−1_{x}q(y)dy = 0 in this case. It is not immediately trivial which condition it is that

makes both cases work, although both are intuitively sound methods. In such cases, it is easier to know what it is exactly that needs to proven and for such reasons, I chose to introduce more mathematical definitions. Since such conditions were hard to find in literature (except for the well-known change-of-variables formula), I introduce them here in a general context for arbitrary measurable domains. In particular, I merely derive the conditions by unraveling the definitions and these may be well-known in other contexts, although they have not been applied yet in the context of sampling to my knowledge. Why is change-of-variables insufficient?

In the most basic form, change-of-variables can be stated as Z M pd(f∗µ) = Z N p ◦ f dµ

for p a measurable function on M and f∗(µ) the pushforward of µ which is defined as

(f∗(µ))(B) = µ(f−1(B)) (of course f : N → M needs to be measurable). However, this

formula is rather limited, since only the functions p and f and one of the measures can be varied. In the context described above, the function q does not necessarily have the form q = p ◦ f and the measures on M and N do not have to be connected through f .

3.3.1 Conditions for changing domains

First of all, the function f needs to be measurable to ensure that f (X) is a random element again. In this setting, measures ν on M and λ on N are required. Remember that “p is a pdf” also means that p ≥ 0 a.e. ν and R_Mpdν = 1. The latter property can be reduced to R

Mpdν < ∞ for most purposes, since sampling algorithms usually work

with unnormalized densities as well. The function q needs to be a pdf as well, hence it is required that q ≥ 0 a.e. λ andR qdλ = 1 (or finite). Saying that X ∼ q =⇒ f (X) ∼ p means that if

P (X−1(A)) = Z

A

(26)

then also

P ((X−1◦ f−1)(B)) = Z

B

pdν for all B ∈ FM,

where P is a measure on the domain of X. Since f is measurable, we have f−1(B) ∈ FN

for all B ∈ FM (this is the definition of measurable). Call Y = f ◦ X and pY the pdf

corresponding to Y . Then Z B pYdν = P (Y−1(B)) = P ((X−1◦ f−1)(B)) = Z f−1_(B) qdλ.

It remains to show pY = p, and since pY is defined (almost uniquely) by the condition

R

BpYdν = P (Y

−1_{(B)), it is sufficient to show} R

Bpdν =

R

BpYdν for all B ∈ FM. This

leaves the following conditions.

• The functions f and q are measurable. This usually follows easily since most functions encountered in practice, such as indicator functions on measurable sets and continuous functions, are measurable.

• The integral R

Nqdλ is finite.

• The function q ≥ 0 on all sets with non-zero λ-measure. • For all B ∈ F_M, Z B pdν = Z f−1_(B) qdλ.

This last condition is easiest understood in the case of M = R, where it can be seen as “the cumulative distribution functions are the same”. It is sufficient to check this condition on a set which generates the σ-algebra by a theorem in measure theory, for example for all intervals or open sets when M = R.

Example 3.9. Let us look at a simple example. Take M = [0, 1] and N = [0, 1] ∪ [2, 3] and define f by f (x) =1_[0,1](x)x +1_[2,3](x)(x − 2), which is a combination of measurable functions and hence measurable again. Define p : [0, 1] → R : x 7→ 1 and q(x) =

1

2(p ◦ f )(x). Obviously, q ≥ 0 holds everywhere. Give both M and N the Lebesgue

measure µ. Then Z N qdµ = Z N 1 2 (p ◦ f )dµ = 2 Z [0,1] 1 2 pdµ = 1 and for all B ∈ (B ∩ [0, 1]), we have

Z f−1_(B) qdµ = Z B qdµ + Z B+2 qdµ = Z B pdµ.

This shows that we are allowed to sample from (N, µ, q) and translate these samples to (M, µ, p) using f .

(27)

In the example above, f is called a covering map. The example below gives another example of a case where the properties are easily verified.

Example 3.10. Let f : S1 → S1 _{be the double cover given by z 7→ z}2 _{and p a pdf on}

S1 and q = 1₂(p ◦ f ). Since f is continuous, it is also measurable. Furthermore, 1 2 Z [0,1] p(f (e2πiφ))dφ = 1 2 Z [0,1] p((e2πiφ)2)dφ = 1 2 Z [0,1] p(e4πiφ)dφ = Z [0,1] p(e2πiψ)dψ using the change-of-coordinates formula for ψ = 2φ. A similar result holds for all subsets of [0, 1].

It can be expected that if f : N → M is a map such that each point in M has n inverses, then q = 1_n(p ◦ f ) will be a pdf on N satisfying X ∼ q =⇒ f (X) ∼ p. However, if only “each point in M has n inverses” is assumed, then the following situation may occur. If N has a measure ν such that ν({x}) = 1 for a certain x ∈ N and ν(N \ {x}) = 0, then the property “R

Bpdν =

R

f−1_(B)qdλ” will obviously fail (in this case, the point x

will always get all probability mass). The main idea is that the measures on M and N may be very different and that these also decide whether X ∼ q =⇒ f (X) ∼ p holds or not.

In order to be able to use Spherical HMC to sample from SO(3), a new probability density function q on S3 should be defined, along with a function f : S3→ Euler angles with the property X ∼ q =⇒ f (X) ∼ p. The chosen conversion formula in given in Section 2.2 and q = 1₂(p ◦ f ) is taken, where in practice p is unnormalized and hence the half can be dropped. The intuition behind the choice of f and q is that there exists a double cover S3 → SO(3). The chosen f is written out such that the rotation represented by a unit quaternions x in S3is exactly the same as the rotation represented by f (x) and q is taken to be 1₂p based on the previous two examples. In order to show the property R

Apdν =

R

f−1_(A)qdλ, the change-of-variables formula

Z f (U ) p(v)dv = Z U (p ◦ f )(u)|Df |du

can be often very useful. In this case, the formula cannot be applied since the Euler angles are a subset of R3 and S3 ⊂ R4_{. However, S}3 _{is a three-dimensional manifold}

and the problem might be solved by parameterizing S3 with three variables, such that hopefully the new f will have |Df | = 1₂ (in fact, since unnormalized densities are used, |Df | constant is sufficient). Since the particular sampling domain S3 _{⊂ R}4 _{is required,}

such derivations would not help here and proving the property X ∼ q =⇒ f (X) ∼ p will not follow from the change-of-variables formula in a trivial way. The conditions can, however, be verified quite easily for the other case that they were derived for, as will be done in the next section.

3.3.2 Projection sampling

One of the first approaches to sampling from a restricted area is to sample from an unrestricted area and project the acquired sample to the restricted area. Let p be

(28)

a probability density function which needs to be sampled from, and R the restricted domain on which p is defined. The idea is to define an area U , a (measurable) function π : U → R and a probability density function q on U such that q is easier to sample from and x ∼ q implies π(x) ∼ p. This would give the possibility of sampling from p by sampling from q and “projecting” these samples from U to R.

For spheres Sn:= {x ∈ Rn+1 | kxk = 1}, an obvious projection function π : Rn+1\ {0} → Sn: x 7→ x

kxk

exists. The problem then is to define a probability density function q on Rn+1. An easy way to fulfill the property x ∼ q =⇒ π(x) ∼ p is by setting q = p ◦ π. However, this leads to the problem thatR q(x)dx is infinite.

A possible way to solve this is to “spread the density” to Rn+1 in a controlled manner. Definition 3.11. Let p be a pdf on Sn and let r : R>0 → R≥0 be a pdf such that

Z =R

R>0r(ρ)ρ

n_{dρ < ∞. Define the pdf associated to p and r as}

q : Rn+1\ {0} → R≥0 : x 7→ 1 Zr(kxk)p x kxk .

As mentioned before, proving the validity of this setting was one of the main reasons to introduce more formal definitions. The remainder of this subsection is used to verify the conditions that were found in the previous section and to discuss the condition that R

R>0r(ρ)ρ

n_{dρ < ∞, which may seem somewhat counterintuitive at first. This condition}

arises from the fact that the density is not spread in a “straight” manner: if a segment S from the sphere Sn is taken and all points in Rn+1 which are projected to S are visualized, then the area takes the form a fan instead of a rectangle.

Remark 3.12. In order to proof R

R>0r(ρ)ρ

n_{dρ < ∞ it is sufficient to show that there}

exists an M > 0 such that r(ρ) ≤ Cρ−(n+2) for all ρ ≥ M , since lim n→∞ Z n M ρ−2−ndρ ≤ lim n→∞ Z n M ρ−2dρ = lim n→∞[− 1 2ρ −1_]n M = lim_n→∞−n−1+ M 2 = M 2 < ∞, and Z M 0 ρnr(ρ)dρ ≤ Z R>0 Mnr(ρ)dρ = Mn. To see that not every density function has R

R>0r(ρ)ρ

n_{dρ < ∞, the following example}

shows a C∞ function f which is positive everywhere and has integral 1, but where the property above already fails for n = 1, that is,R

R>0f (ρ)ρdρ = ∞.

Example 3.13. Construct a C∞ function f1 : R>0 → R≥0 with the property that f1

is at least 1₂ on an interval B1 of length at least 1₂, which is zero outside [2, 3] and has

an integral of 1₂. Such a function can be constructed using the so-called bump functions which are also used in proving the existence of partitions of unity. Define

fn(x) =

1

2n−1f1(x − 2 n_{+ 2),}

(29)

a rescaled and translated version of f1, such that fn is at least ₂1n on an interval Bn of

length at least 1₂, which is zero outside [2n, 2n+1] and has an integral of ₂1n. Now gN(x) :=

PN

n=1fn(x) is C∞as well and the pointwise limit g(x) = limN →∞gN(x) obviously exists.

In order to see that the convergence is uniform, let kf k∞ := sup_x∈R_>0|f (x)| which we

allow to take to the value ∞. It is easy to see that this satisfies the triangle inequality and hence kg_N − gk∞= ∞ X n=N fn ∞ ≤ ∞ X n=N kf_nk∞≤ kf1k∞ ∞ X n=N 2−(n−1)= kf1k∞2−(N −2)

where kf1k∞ < ∞ since it is a continuous function with compact support. Hence

{g_N}_{N ∈N} converges uniformly to g and it follows that g is C∞ as well by Weierstrass theorem. Furthermore, since all fi ≥ 0,

Z R>0 g(ρ)ρdρ = Z R>0 ∞ X n=1 fn(ρ)ρdρ ≥ N X n=1 Z [2n_,2n_+1] fn(ρ)ρdρ

for all N ∈ N, but Z [2n_,2n_+1] f (ρ)ρdρ ≥ Z Bn 1 2nρdρ ≥ Z Bn 1 2n2 n_{dρ ≥} 1 2 for all n ∈ N, hence PN

n=1

R

[2n_,2n_+1]fn(ρ)ρdρ → ∞.

The following two theorems show that the function q of Definition 3.11 is indeed a probability density function and that X ∼ q =⇒ f (X) ∼ p holds.

Theorem 3.14. Let p, q, r be given as in Definition 3.11. Then q has an integral of 1. Proof. Let {φ1, . . . , φn, ρ} denote the Spherical coordinates for Rn+1\{0} (see [4]), where

ρ = kxk for x ∈ Rn+1 and let

f : [0, π]n−1× [0, 2π) × R>0 → Rn+1\ {0}

be the change-of-coordinates map. Write Ω = (φ1, . . . , φn) and

| det(Df )| = ρn_sinn−1_(φ 1) · · · sin(φn−1) =: ρng(Ω) then Z Rn+1 q(x)dx = 1 Z Z R>0 Z Ω q(f (Ω, ρ))ρng(Ω)dΩ dρ = 1 Z Z R>0 r(ρ)ρndρ Z Ω p(f (Ω, 1))g(Ω)dΩ = 1 Z Z R>0 r(ρ)ρndρ Z Sn p(x)dx = 1

(the first step applies change-of-variables formula and uses that 0 may be removed from the domain without changing the value of the integral, the second step fills in the def-inition of q and uses that constants may be removed from the integral and the fourth step applies the change-of-variables formula again). Since the Lebesgue integral is equal to the Riemann integral in cases like these, the result follows.

(30)

Theorem 3.15. Let p, q, r be given as in Definition 3.11. Then z ∼ q implies _kzkz ∼ p. Proof. Define π : Rn+1\ {0} → Sn _{: x 7→} x

kxk and let f : R>0× Ω → Rn+1\ {0} the

change-of-coordinates map to Spherical coordinates. Let x ∈ Sn_{be given, for sets B ∈ B}

Z π−1_(B) qdµ = Z π−1_(B) q(x)dx = Z f−1_(π−1_(B)) | det(Df )|q(f (y))dy = 1 Z Z R>0×B0 p(f (1, ω)r(ρ)g(ω)ρndρdω = 1 Z Z B0 p(f (1, ω)g(ω)dω Z R>0 r(ρ)ρndρ where it is used that the Lebesgue integral is equal to the Riemann integral and the change-of-coordinates formula is applied. Also, f−1(π−1{x}) = R>0× {ωx} for a single

vector ωx ∈ Ω, where ωx 6= ωy whenever x 6= y. This shows f−1(π−1(B)) = R>0× B0

for f (1, B0) = B. Hence Z π−1_(B) qdµ = Z B0 p(f (1, ω))g(ω)dω = Z B p(x)dx.

Remark 3.16. In practice, unnormalized probability density functions p and q are used. It is not required to compute the exact value ofR

R>0r(ρ)ρ

n_{dρ (it only needs to be}

verified that it is finite), since random elements X and Y (y) := CX(y) have the same distribution for a constant C > 0.

3.3.3 Manifold sampling

If the sampling domain is a k-dimensional manifold M , then there exist charts (Uα, φα)α∈A

which cover the manifold, for φα: Uα→ Rk. If there is a small number of charts, these

charts may be used to sample from Rkinstead of from the manifold, which may become an easier task. This does, however, give a lot of additional computations for switching between the manifold and Rk.

A tool which comes in handy for manifold are the so-called partitions of unity. There exists a partition of unity {ρα} subordinate to each open cover {Uα}, which satisfies the

three conditions:

• ρα : M → [0, 1] is C∞;

• P

α∈Aρα(x) = 1 for all x ∈ M ;

• supp(ρ_α) ⊂ Uα.

Using these functions, the sampling problem can in principle be “transported” to Rk. If p : M → R≥0 is a density defined on the manifold, then a function on Rk can be defined

by (assuming A = {1, . . . , n} is finite) q(x) := n X i=1 1φi(Ui)(x)ρi(φ −1 i x)p(φ −1 i x),

(31)

that is, for all i for which φ−1_i (x) exists (φi injective by definition), compute a weighted

probability and sum those. The advantage of the use of partition of unity is that q will approach zero when it approaches the boundary of ∪iφi(Ui). It might do so, however,

with such a sharp slope that sampling methods do not appreciate this fact in practice at all.

A problem with proving that the charts can be used to transport the sampling problem, is that for every x ∈ Rk there may exist zero or several y ∈ M such that φi(y) = x for

some i ∈ {1, . . . , n} (although at most one per chart). Such a y ∈ M can also be in the intersection of multiple charts, say y ∈ Ui∩ Uj. In such a case, it can happen that

φi(y) 6= φj(y), such that y has two points where it would like to give its density to.

The latter problem can be easily solved with the partitions of unity: each point y ∈ M can give ρj(y)p(y) to φj(x) and the partitions of unity spread the density nicely (C∞)

to Rk. It can be expected that the q defined above is indeed a density with the same integral as p (although this would require a formal proof still). The main problem is in defining a manner of conversion f such that X ∼ q gives f (X) ∼ p. As noted before, each x ∈ Rk may have multiple y ∈ M such that y has given part of its density to x. The most obvious conversion manner would be to select all y ∈ M such that φi(y) = x

for some i ∈ {1, . . . , n} and define f (x) by a random selection procedure which chooses one of the y based on the amount of density they have given x.

Besides the fact that this will be complicated to prove, such a method would have a tremendous amount of computational overhead and is therefore not suited for practice. The advantage is that it provides a manner of sampling from Rkfor k the dimension of the manifold, which is as good as it can get. A (non-trivial) theorem in mathematics called the Whitney embedding theorem states that any smooth real m-dimensional manifold can be smoothly embedded in R2m (if m > 0). For some manifolds (such as spheres) a lower dimensional embedding can be easily found as well.

For this thesis, it would have been perfect to find a sampling method which is tuned to sampling from exponential families defined on (compact) Lie groups (or manifolds in general). It appears difficult to do this from a mathematical point of view (using the definition of a manifold).

3.3.4 Variations on Gibbs sampling

If the sampling domain is Sn, then the domain may be reduced to a lower-dimensional Sm _{using an approach similar to Gibbs. Let a probability density p(z) = p(z}

1, . . . , zn+1)

defined on Sn⊂ Rn+1 _{be given. The idea is to intersect S}n _{with an (m + 1)-plane (an}

(m + 1)-dimensional plane) of the form {z ∈ Rn+1 : z1 = x1, . . . , zn−m = xn−m} where

x1, . . . , xn−m are values of a previous sample. The intersection will be an m-sphere in

the generic case (almost always) but may be a single point in the case that the plane is contained in the tangent plane to a point of Sn. The idea of sampling blocks of variables instead of a single one is already present in literature [11]. Only reduction to S0 _{or S}1

will be discussed here.

For m = 0, the Gibbs sampling method would sample

Sampling from constrained domains