Posterior approximation with variational inference for simulated Fermi gamma-ray data

(1)

Bachelor Informatica

Posterior approximation with

variational inference for simulated

Fermi gamma-ray data

Auke Schuringa

June 18, 2020

Supervisor(s): dr. Christoph Weniger, dr. Patrick Forr´e

Inf

orma

tica

—

University

of

Amsterd

am

(2)

(3)

Abstract

Probability estimation is an important part of many scientific processes. One way of calculating probabilities is by using Bayes’ theorem. This theorem is however not usable when a likelihood is hard or intractable to evaluate. To estimate a probability distribution using likelihood-free or variational inference, machine learning can be used. A neural network can be trained to estimate the posterior distribution for parameters using simulations by a model taking these parameters as input. Using for example Sequential Ratio Estimation (SRE), the posterior of the parameters can be calculated for all simulations by the model. For problems in which the posterior should be approximated for a single sample of data, this is wasteful. A training algorithm like Automatic Posterior Transform (APT) can be used in combination with a method such as a Mixture Density Network (MDN) to approximate the posterior for a single sample of data. The thesis explores the details of the methods and uses them to approximate the probability distribution of the Fermi gamma-ray flux of point source emissions from a single simulated observation which includes background.

(4)

(5)

Introduction

In many fields of science probabilities play an important role. One can think of weather forecasts [1] where the probability of having a certain type of weather the next day or next hours needs to be evaluated, or probabilities in economic models [2], where probabilities play an important role in forecasting economic situations, like crises, or if a financial stock is going to up or down based on movement in the past. Another field where probabilities can be used is in biological fields, for example the spread of a disease [3] and with that the probability that an organism might get infected by another organism. Other examples are forecasting consumer behavior [4] [5], in which many factors may play a role, for example the state of the economy, the overall mood of society, and/or societal issues or changes. Next to the examples stated here, many other situations and models exist in which probabilities play an important role.

This paper works with the luminosity of simulated and thus semi-real Fermi-gamma ray data. Being able to predict the overall luminosity in measurements provides the ability to notice spikes, and with that keep track of events. When a spike is measured it means something was happening, and research can be done to look into what event occurred. This can lead to new discoveries in the case of unexpected events, or the confirmation of existing theories should an event occur that was expected, according to theories at the time.

Probabilities can be evaluated using a posterior distribution, which shows the probabilities for a range of values for a parameter. A variety of methods exist to calculate posterior distributions. One of the most important methods is with Bayes’ theorem, which can calculate a probability and with that a posterior distribution using related probabilities. Using known and measured data one would be able to calculate a posterior distribution which is not directly measured. This allows for a wide range of possibilities for calculating a probability in complex situations in which the posterior can not be directly evaluated.

A problem with calculating the posterior using Bayes’ theorem is that for the involved events all probabilities or likelihoods need to be known. This is often intractable and causes the Bayes’ theorem to be intractable to calculate. A possible solution for this is using artificial neural networks and machine learning to approximate the posterior distribution. This can be done using variational inference, or likelihood-free inference, where the posterior can be approximated using limited simulations.

One of the situation in which variational inference can be used can be for approximating a posterior using measurements from events, for example from space, since these measurement often contain noise and can be complex. This thesis looks into the use of variational inference to approximate a posterior distribution for a parameter for the luminosity in simulated Fermi gamma-ray data.

(8)

(9)

CHAPTER 2

Background

The project uses an artificial neural network with convolutional and fully connected or hidden layers and approximates the posterior distribution for a parameter for the luminosity of a single sampled simulation of cosmic gamma-ray data. To understand the used methods and data, background knowledge is required, which is explained in this chapter.

2.1 Artificial Neural Network

A neural network is based on how a brain works. A brain consists of neurons which fire off signals to each other. The neurons together form a network, which overall takes input produces and output. The output depends on the connections that have been formed between neurons, different connections can yield different types of output. A developing neural network can form new or change existing connections and with that change the output it would give for certain input.

The perceptron was a first implementation of this idea in a technical sense [6]. With the perceptron it was attempted to understand the perceptions of living organism, and especially organisms like humans. A number of problems at the time was looked at - how information is sensed or detected, in what form information is stored and how this stored information affects recognition and behavior in organisms. A theory of the perceptron was formed based on a number of assumptions. These were that no two organisms have the exact same connections in the nervous system, and that these connections are mostly random at birth. The system of connections can change over time, so the probability that an excitation in one neuron can cause excitation in another can change. As a network of connections is often exposed to the same sort of input, paths will form in a network of neurons that are often excited. The creation of this path can be affected through positive and negative reinforcement. Finally a sense for similarities arises in the neural network because certain neurons are fired up more for certain input than others. If a certain set of neurons is often fired up for certain inputs, than the input is found to have similarities.

Overall the perceptron can, with the above information, be seen as a binary classifier with

f (~x) = (

1 if ~x · ~w + b > 0,

0 else, (2.1)

where ~x is the input, the output is 1 or 0, ~w are weights and b is the bias applied to the dot product of the input and weights. By multiplying the input ~x with weights ~w, the input transforms into a number that is based on the weights of a connection between two neurons. If the calculated value ~x · ~w is taken together with a bias b and is found to be positive, the neuron outputs a 1 and else a 0.

A modern version of artificial neural networks is similar to the perceptron. The neurons in artificial neural networks are called artificial neurons or nodes. The network consists of layers of connected nodes, with each node having incoming connections from nodes in the previous layer

(10)

and outgoing connection to nodes in the next layer [7]. An example of this is shown in figure 2.1. This figure shows an input layer, hidden layers and an output layer. The input layer consists of nodes taking the input data, which is processed and send on to the first hidden layer. The hidden layers take and send data from and to each other and perform calculations at the same time on this data. The data eventually arrives at the output layer and can then be read from the output neurons or nodes.

hidden layers

input layer output layer

Figure 2.1: An example of an artificial neural network with one input layer, three hidden layers and an output layer. In this example, the input layer has three nodes, every hidden layer has fours nodes and are fully connected with each other and there are two nodes in the output later.

Every node in the network takes the data from the nodes connected to it and performs a calculation similar to the one in formula 2.1

f (~x) = ~x · ~w + b, (2.2)

which takes the input ~x from the previous nodes, multiplies the inputs with weights ~w and adds a bias b, after which f (~x) is send to the nodes in the next layer this node is connected to.

Biological neurons do not exactly function like how artificial neurons or nodes function according to formula 2.2. Neurons either fire or do not fire, so produce an output of either 1 or 0, as is calculated in formula 2.1 for the perceptron earlier discussed. Neurons can fire at a higher or lower rate, while the number of times a node fires in an artificial neural network is limited to how often the layer in which the node exists is used in a calculation. Since usually all layers in a network are used sequentially each time a network receives input, the rate at which nodes fire is limited to the rate at which the network processes data. The output of formula 2.2 is not 1 or 0, but can be higher than 1. Compared to biological neurons a node with output higher than 1 can be seen as firing multiple times, and thus increasing the intensity of the output and the input to the next layer. Therefore formula 2.2 can be well compared to biological neurons.

2.1.1 Activation function

Besides the rate at which neurons are fired, there is another important property of biological neurons - they can not fire ‘negatively’, but only at f (x) ≥ 0. This means the output f (x) of a node needs to be limited to not fire a negative value. Functions to limit the output of a node are called activation functions, and are based on the biological properties of neurons [8]. An important activation function is the linear rectified unit (ReLU) [9], which has formula

f (x) = (

0 if x < 0,

(11)

with x being an output value of a previous node. The formula sets negative values x < 0 to 0 and with that makes sure that the input data to the next nodes is only positive. Formula 2.3 is also shown in figure 2.2.

−5 −4 −3 −2 −1 0 1 2 3 4 5 0 2 4 x f (x )

ReLU activation function

ReLU

Figure 2.2: This plot shows the ReLU activation function as shown in formula 2.3. All values f (x) for x < 0 are 0 and all other values are equal to x.

Another important property of biological neurons is that they have a limit for the rate at which they can fire. With the ReLU activation function, the theoretical maximum of the firing rate is at infinity, which is not biologically possible. A way to solve this is by using the sigmoid [10] activation function

f (x) = 1

1 + e−x (2.4)

instead of the ReLU activation function 2.3. The sigmoid activation function again limits the output of the node to positive values, but also limits it to values lower than 1. A plot of the formula 2.4 is shown in figure 2.3.

−10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.5 1 x f (x )

Sigmoid activation function

f (x) =_1+e1−x

Figure 2.3: This plot shows the sigmoid activation function as shown in formula 2.4. The function goes from y = 0 to y = 1 as x increases from negative to positive and crosses f (x) = 0.5 at x = 0.

Next to the previous activation functions that work on a single variable, there is also the softmax function [11] fi(~x) = exi P jexj , (2.5)

with ~x all output of layer and xithe ith value of ~x. The softmax function calculates a normalized

value and can be useful when using mixing coefficients.

Many other types of activation functions exist as well. Some limit the output f (x) only to positive values, while others allow negative values or only negative values up to a certain point. Other recent methods look into using the existing activation function like the sigmoid in different ways, for example the sigmoid-weighted linear units, which resembles formula 2.2 with aspects of the sigmoid function 2.4 [12] [13].

(12)

2.1.2 Training

An initial network is untrained and has values for the weights ~w which may be 0 or a random uninitialized value, together these are called the parameters φ of the network. To give the weights a meaningful value they need to be trained, together with any other variables in the network. To train these parameters, parameter regression can be used which is a way of finding the values of parameters for which a certain situation is best matched. In this case we would want the output of the network to be a certain value, which means parameter regression will change the parameters so that the output of the network comes close to the output it should have.

As data moves from the input nodes to the output nodes of the network, calculations are done on it using the parameters in the network. These calculations and the parameters which were used are saved and the total calculation is known at the output nodes of the network.

The output of the network can be compared to an expected or ‘true’ output and a so-called loss can be calculated with a function, this is the loss function, which is often denominated with L. The loss function shows how far result from the network are off of the desired result. In the ideal situation the result would be the same as the expected results, which would mean L = 0, the loss function would be 0. To approach this, the parameters in the network need to be changed to approach L = 0, which is also known as training the network.

The derivative of the loss function L can be used to minimize the loss and with that make the parameters φ for which the loss is calculated so that the parameters yield a smaller loss next time, this is called backpropagation. The derivative of the loss function shows the direction and slope of the loss function. To minimize, the ‘direction’ of the negative derivative should be followed to approach a smaller loss

∆φ0= −∇L(φ0), (2.6)

with φ the parameters of the network, like the weights ~w, φ0 the parameters for which the

loss function is calculated and L(φ0) the loss function for network parameters φ0. It is unclear

however what the loss function looks like further down the slope, it might have ∇L(φ0) = 0 with

a small change in φ, and further after that a nonzero slope again. If the value for the change in φ ∇L(φ0) is too large, it will pass over the values φ for a minimized loss. To make sure this

doesn’t happen a learning rate γ is used. The learning rate is multiplied by the loss, so a descent can be made along the gradient of the loss function

∆φ0= −γ∇L(φ0), (2.7)

which then gives the values for the next parameters φ1 as

φ1= φ0+ ∆φ0= φ0− γ∇L(φ0). (2.8)

The loss function calculated with network parameters φ1 for the same data will then be lower

next time the same input is given to the network. If the value for γ is chosen too low the network might train very slowly over every iteration, while if it is too high it might not be able to correctly approach the φ for which the loss function is minimized.

To train the network to accept a variety of input values, and not just one specific input value, the input is different for every iteration in the training algorithm. This can cause the loss function to not always be smaller after an iteration, since the changing input, together with the parameters φ, play a role in the loss function. Overall however, as the network is trained on the input data and the output is compared to the expected output, the network will decrease the loss compared to the expected output and with that improve the results.

To perform gradient descent with the parameters φ of the network, the function resulting from the loss function must be differentiable, as shown in formula 2.6. Because the network has multiple layers of nodes and activation functions, the final values coming out of the output layer will be a complex combination of the formulas used in the previous layers. For example a network consisting of a node, sigmoid function, a node and a ReLU function the output will be

(13)

of the form q = fn,1(x) = x · w1+ b1, r = fs(q) = 1 1 + e−q, s = fn,2(r) = r · w2+ b2, fr(s) = ( 0 if s < 0, s else, =⇒ fr(x) = ( 0 if fn,2(fs(fn,1(x))) < 0, fn,2(fs(fn,1(x))) else, =⇒ fr(x) = ( 0 if _1+e_−x·w1+b11 · w2+ b2< 0, 1 1+e−x·w1+b1 · w2+ b2 else, ,

with fr(x) being the function at the output of the network. These functions are not the only

functions used in artificial neural networks. There is for a example a large number of activation functions that can be used. The complicated functions that might arise need to be differentiated. Fortunately with to the chain rule

(f ◦ g)0 = (f0◦ g) · g0 (2.9) the derivative of fr(x) can easily be calculated, if the loss function L(φ) itself is differentiable.

In this case ∇fr(x) = fr0(x) would be

(fr◦ fn,2◦ fs◦ fn,1)0=(fr0 ◦ fn,2◦ fs◦ fn,1)

· (f0

n,2◦ fs◦ fn,1)

· (fs0◦ fn,1)

· f_n,10 .

This means that for the full calculation to be differentiable, the individual formulas that were used for calculating the results should be differentiable. Each of the previously discussed functions from artificial neural network can have a derivative calculated, so therefore a derivative of the loss can be calculated provided that a the loss function itself is differentiable.

2.1.3 Convolutional layers

Any sort of data can be put into an artificial neural network, however it can become difficult for the network to learn certain patterns from certain data. One type of data for which this can be difficult is images. The number of input nodes in a network can be made equal to the number of pixels in the images, and then every pixel in the image is taken through the network. It then can take long to find relations between different pixels in the image, and can take a large amount of computing resources due to the large number of required nodes.

Solutions for this problem are again found in biological processes. A solution based on this is the cognitron [14], which biological processes to create a network to perform visual pattern recognition. The network can be trained to make recognition better. The cognitron works by taking input from a certain region from the input layer, the input image. Through learning and training the regions that are extracted change and eventually regions are extracted that are significant for getting the output the network is trained to get [15]. The cognitron is then able to recognize patterns, for example recognizing hand written digits [16].

Modern convolutional layers work in the same way, using convolution or cross-correlation as shown in formula

Ci= (A ? B)i=

X

j

Aj· Bi+j (2.10)

which can be seen as a vector A, the kernel, sliding across a vector B, the image or source. At each point i in B, Bi, the values Ai0 in A are multiplied with the respective values around B_i in

(14)

1 2 0 -1 1 0 1

1 0 -1

source B

kernel A

result C -2 1 3 -1 -1 0 0

Figure 2.4: An example of cross-correlation of a source with a kernel, outputting a result. The kernel slides across all values of the source data while the result is being calculated.

Another form of convolution is pooling from which max pooling is used the most in artificial neural network. With max pooling the input is divided up in chunks, and the output is for each chunk the maximum value of the values inside the chunk.

Cross-correlation can extract edges from images. There are many, virtually unlimited, possibilities for the kernel. The right kernels can for example detect edges in certain directions. With one convolutional layer simple edges can be detected, and with multiple layers more complicated patterns can be found in the data. The data from the convolutional layers is then provided to the hidden layers of the network. The data which then going into the hidden layers already has features extracted, which can then be easily used by the hidden layers to make further calculations. Using this, it becomes more easy for the overall artificial neural network to find patterns in data and adjust the parameters φ to minimize the loss function L(φ).

2.2 Probability estimation

A probability of a certain event can be calculated if other probabilities are known. This can be done with for example Bayes’ theorem

P (A|B) =P (B|A) · P (A)

P (B) , (2.11)

which calculate a probability using probabilities related to the event that needs to be evaluated. One of the probabilities involved, for example P (B), might be made up of a set of events with Ai

which make P (A) =P

iP (Ai), with the corresponding probabilities P (B|Ai). To then calculate

the P (B) a sum is needed over all values Ai in A as

P (B) =X

i

P (B|Ai) · P (Ai).

Using this sum, the probability distribution can be calculated for one of the events in Ai in Ai

as P (Ai|B) = P (B|Ai) · P (Ai) P jP (B|Aj) · P (Aj) . (2.12)

The sum required to evaluate Bayes’ theorem is not a problem if the number of events Ai

involved is limited to a not too large number. As an example of this one can take shoe sizes Si and genders Gj. In this example the only information available is the probability P (Gj|Si)

a gender has a certain shoe size and the probability P (Si) of having a certain shoe size in

general. The probability P (Si|Gj) that needs to be evaluated is that of a shoe size belonging

with a certain gender. To get the probability P (Gj) of someone being a certain gender, the sum

P

iP (Gj|Si) · P (Si) can be calculated. The full calculation for a shoe size belonging to a certain

gender is then P (Si|Gj) = P (Gj|Si) · P (Si) P kP (Gj|Sk) · P (Sk) .

(15)

The previous example works because the number of shoe sizes is limited at a small number. If instead of the half integer shoe sizes a continuous function is chosen, P (Gj) is still possible to

evaluate if this continuous function can be evaluated.

2.2.1 Variational inference

There are cases in which a sum or integral is hard to evaluate, or even intractable. If this is the case for P (B) in formula 2.11, and with that formula 2.12, then that same formula is hard or intractable to evaluate. An example of this is can be images with a circle and noise as shown in images 2.5a, 2.5b and 2.5c.

(a) (b) (c)

Figure 2.5: Images with circles with the same radius and noise distribution, but with different positions and noise samples.

Each of the images in 2.5 has the same radius, but they are in different position. The images also contain noise from a certain distribution, but after each sampling from this noise distribution, the drawn sample is different. If the probability of a circle in a certain image having a certain radius needs to be evaluated, naturally one looks at Bayes’ theorem in 2.12. The number of different possibilities for the images with the circle is very large however, practically an unlimited number, which makes evaluation of the sum in formula 2.12 intractable.

In the case of such an intractable sum, or integral, the classic Bayes’ theorem can not be used to evaluate the probability or distribution, and variational inference needs to be used. This is also known as likelihood-free inference, as this probability distribution is calculated without knowing the exact likelihood for all output of the model from which the images in 2.5 were sampled.

Besides the images sampled from the model in 2.5, images with circles of different radii and different noise distributions can be sampled as well. This is shown in 2.6.

Variational inference works by looking at a limited number of images, and approaches the probability distribution as the number of processed images approaches infinity. The probability distribution calculated by the artificial neural network is also known as the posterior distribution, formed from the prior distribution from which the network samples the parameters which are supplied to a model which simulates images of the circles and noise. These images are then provided to the network to train with.

The output of the network should therefore be the probability that a certain radius belongs to the circle in a sampled image. The mean of the posterior distribution calculated for an image with a circle should approach the radius with which the model created the image with circle. The variance around the mean should gradually go down to zero as the network is trained.

2.2.2 Mixture Density Network

A mixture density network attempts to learn the posterior distribution by training the network to approximate a mean µ, variance σ2 _{and mixing coefficients α [17]. The µ and σ can be used}

(16)

(a) (b) (c)

Figure 2.6: Images of circles with different radii and noise distributions. The circle in image 2.6a is small with the same noise distribution as used in the images in 2.5. The circle in image 2.6b is large, partially visible and has little noise. The final circle in image 2.6c has a lot of noise and is harder to recognize, but still visible.

in the normal distribution

f (x) = 1 σ√2πe −1 2( x−µ σ ) 2 , (2.13)

with x being the parameter for which the probability is calculated that belongs to the input data of the network. The network can calculate multiple distributions by having out nodes for multiple means µi and variances σi2 and the mixture coefficients αi between the multiple

distribution when they are added up to form the final distribution. The final distribution is then calculated by multiplying the normal distributions of µi and σi with their respective coefficient

αi leading to f (x) =X i αi 1 σi √ 2πe −1 2( x−µi σi ) 2 (2.14)

over all i. The coefficients αi need to add up to 1, which can be achieved by using the softmax

activation function as shown in formula 2.5 [17].

2.3 Training algorithms

To train an artificial neural network to approximate the output it should have, a training algorithm is needed. The training algorithm takes care of the sampling of model parameters, generating simulations using a model and these parameters, calculating losses between the output of the network and the expected output, and updating the network using these losses. A training algorithm takes a prior for the distribution of parameters, from which parameters are then sampled. Some algorithms change the prior while running and others leave the prior as is. After the algorithm has finished running, the network can calculate the posterior for parameters for simulations.

2.3.1 Sequential Ratio Estimation

One way of approaching the posterior for a parameter in input data, for example the radius of a circle in an image with noise, is a contrastive learning approach [18] which is also sometimes known as sequential ratio estimation (SRE) [19]. SRE requires the input of both the simulation by the model and the parameter for which the probability should be calculated. This way the probability will be approximated for any parameter for any simulation, and with that the posterior distribution for a parameter for any input can be approximated. Note that the possible values for the parameters might be limited to a certain range, so the possible values for a parameter might be limited.

(17)

A naive approach of training a neural network would be to sample a parameter θ and a simulation xsim using a model and the parameter, and feed both through the network and

require the output to be 1. 1 would mean the parameter θ is used to create simulation xsim,

and 0 would mean the θ was not used to create xsim. If one would only train on outputting 1

however, the neural network would simply train until the output is 1, no matter what input is given.

To train a neural network without having that problem, contrastive learning is required. Contrastive learning means that the artificial neural network is trained on input, consisting of parameter and simulation, that has probability 0 that the parameter belongs with the simulation, and input that has probability 1 that the parameter belongs with the simulation. This way the network learns to not only output 1, but to output 1 if the simulation was created with the parameter and 0 if this was not the case. Overall the training algorithm [18] is shown in algorithm 1.

Algorithm 1: The algorithm for sequential ratio estimation. Result: The converged network dφ.

while not converged do for b in (0 ... B) do

sample parameter θb from prior p(θ)

sample parameter θ0_b

sample simulation xb with θb

L = 0 for b in (0 ... B) do L = L + l(dφ(xb, θb), 1) + l(dφ(xb, θ0b), 0) L = L B optimize φ using L

In algorithm 1 B is the batch size and dφ is the artificial neural network with φ being the

parameters like the weights of the network. l(x, y) is the smaller loss function taking as input the output x = dφ(x, θ) and the probability 1 or 0 if respectively x is simulated with θ or not. L

is the mean loss, which is backpropagated through the network, after which φ of the network is updated.

The loss l(x, i) calculated for the combination of network output and expected outcome is some function that calculates the loss between these two. The function that can be used in this case is the binary cross-entropy (BCE) loss

l(x, y) = y · log (x) + (1 − y) · log (1 − x) (2.15)

which is similar to the cross entropy, but with 1 or 0 as probability, which makes for binary in the name. For y = 1 the BCE is l(x, 1) = log (x), while for y = 0 the BCE is l(x, 0) = log (1 − x). When the posterior only needs to be approximated for a single simulation, the SRE algorithm is wasteful since it does not limit the calculations to only the single simulation, but trains to estimate probabilities for all simulations.

2.3.2 Automatic Posterior Transform

Automatic posterior transform (APT) [20] updates the prior after every iteration in the algorithm, according to a proposal posterior calculated at that time for a simulation for which the posterior distribution should be approximated. This means the algorithm only approximated the posterior for a single simulation, xtrue. By using a proposal posterior as prior, values for a parameter

with higher probability are more likely to be used for the next round in the algorithm. Since the distribution for only a single simulation is approximated, and parameters θ are sampled according to a proposal posterior, the posterior for a single simulation is approximated faster using APT than with using SRE.

APT starts with a prior distribution for θ and a simulation xtruefor which the posterior needs

(18)

R a batch size of N parameters θiis sampled from the prior, and N simulations xi are sampled

from the model with the respective priors. qF (x,φ)is the proposal posterior for simulation x and

qF (x,φ)(θ) the probability that x was created with θ. The probability is then multiplied with the

probability for θ from the proposed new prior ˜p(θ) for xtrue, divided by that probability using

the current prior p(θ) for xtrue and finally normalized by the integral Z(x, φ). The formula for

this is ˜ qx,φ(θ) = qF (x,φ)(θ) ˜ p(θ) p(θ) 1 Z(x, φ) (2.16) with Z(x, φ) = Z θ0 qF (x,φ)(θ0) ˜ p(θ0) p(θ0₎, (2.17)

where the integral is over the parameter values θ0 _{from the prior.}

The loss function L is then calculated by summing the negative log − log ˜qxi,φ(θi) over all

samples pairs (θi, xi) and backpropagated, after which φ of the network is updated. A new prior

is then taken by calculating the then-posterior distribution for the network using the new φ and the xtrue for which the posterior distribution needs to be approximated. The next round this

prior is used for sampling θ. The full algorithm [20] is shown in algorithm 2.

Algorithm 2: The algorithm for automatic posterior transform. Result: The approximated posterior ˜pR(θ) = qF (xtrue,φ)(θ)

for r in (1 ... R) do for j in (1 ... N ) do

sample paramater θr,j from prior ˜pr(θ)

sample simulation xr,j with θr,j

L = 0

for i in (1 ... r) do for j in (1 ... N ) do

L = L − log ˜qxi,j,φ(θi,j)

optimize φ using L ˜

pr+1(θ) = qF (xtrue,φ)(θ)

The integral in formula 2.17 is problematic, because it cannot be evaluated in many cases. To solve this, all sampled θ can be seen as the atomic proposals Θ, which are then used instead of formula 2.17 in formula 2.16 like

˜

p(θ) = P qF (x,φ)(θ)/p(θ)

θ0_∈ΘqF (x,φ)(θ0)/p(θ0)

, (2.18)

which calculates the probability qF (x,φ)(θ) using the pair (xi, θi) = (x, θ) and divides this by the

probability for θ according to the prior. To then normalize, the same is done for all θ0∈ Θ, but with the simulation x from the θ the formula works on.

After all round R have been completed, the final prior ˜pR(θ) is the approximated posterior

distribution qF (xtrue,φ)(θ) for xtrue.

2.4 Cosmic gamma-rays

Gamma radiation is electromagnetic radiation, and comes from a variety of sources, both natural and unnatural. The natural sources can be decay of particles on earth or events in outer space which create high energy gamma-rays [21], which can be measurable on earth if the output of the event is aimed in the right direction. These gamma-rays are called cosmic gamma-rays, and come from gamma-ray objects and events like solar flares, pulsars [22], magnetars [23] and galaxies [24]. To study these objects it is important to measure cosmic gamma-ray radiation and evaluate the measurements.

(19)

In 2008 the Fermi gamma-ray space telescope was launched, to perform measurements on cosmic gamma-ray radiation. The main instrument on the satellite is the Large Area Telescope, which makes the measurements [25]. The data is then used to perform experiments and studies on cosmic events producing gamma-ray radiation.

2.4.1 Background

Measurements of the Large Area Telescope were used to create images of the cosmic gamma-ray radiation. A map of the background Fermi gamma-gamma-ray radiation has been created as well [26]. Measured events by the space telescope with energies between 667 MeV and 158 GeV were divided into 15 logarithmic energy bins, and taken together to form the background shown in figure 2.7.

Figure 2.7: The Fermi gamma-ray background that will be used in the experiments. This is a 200 by 200 pixels image, created using measurements with the Fermi gamma-ray space telescope.

(20)

(21)

CHAPTER 3

Implementation

A convolutional neural network, a model for simulations and a training algorithm have been implemented to perform experiments with and attempt to approximate posterior distributions for parameters.

3.1 Network

An implementation of the theory was created with the use of PyTorch [27] to model a convolutional neural network. The network can be split in two parts, the base network and the network that calculates the posterior distribution for the output at that moment from the base network. The base network is a convolutional neural network with two convolutional and maxpooling layers layers and a number of hidden layers. The number of nodes in the hidden layers are always at least the number of nodes from the final convolutional layer. This makes sure that all output of the convolutional part of the network can be fully processed.

The hidden layers are each created using formula 2.2 and the convolutional layers are modeled after the LeNet network [28], which uses convolutional and maxpooling layers. The network that calculates the posterior distribution is the mixture density network, MDN, and takes the input of the base network. A mixture density network can output multiple densities, but in this case only a single density was used because this is enough to approximate the posterior of the parameter the model takes. The full network is shown in image 3.1.

3.2 Model

A model was used that creates semi-real samples of Fermi gamma-ray data, modeled after what could be measured by the Fermi ray space telescope. For this the model uses the gamma-ray background from image 2.7.

For every pixel in the background a value is samples from a normal distribution, for which a luminosity is evaluated using

L(j) = 10j·k−1.5, (3.1) with L(j) the calculated luminosity, j the flux of the signal sampled from the normal distribution over the pixels in the image, and k the parameter for the strength of the calculated luminosity. In this case k is the value for which the posterior distribution should be approximated. This is thus a way of finding the strength of the luminosity from the measurements. An example of this data is shown in figure 3.2a. The calculated points are added with the background, which is shown in figure 3.2b.

Next a convolution is done over the images in figure 3.2b, to make the data resemble an actual measurement more. This causes the points to be more spread out over neighboring pixels, and causes pixels that are high in intensity to be more spread out as well. For example the background in figure 3.2b is less clearly visible than the background in the image 3.2c. This is

(22)

Convolutional part Fully connected part MDN

1 2 3 4 5 6 7 8 9 10

Figure 3.1: The network that was implemented for the experiments. It contains convolutional, maxpooling, activation and hidden or fully-connected layers. Input enters the network on the left and exits the network to the right. The network can be split into a convolutional part, which takes the image as input, a fully connected part, which processes the output from the convolutional layers and a MDN part, which calculates the means µi, variances σi2 and mixing

coefficients αi. For the convolutional part, layers 1 and 4 are convolutional layers, layers 2 and 5

are ReLU activation functions, and layers 3 and 6 are maxpooling layers. For the fully connected part layer 7 is the input layer, 9 is the output layer and 8 is a number of fully connected layers. In the fully connected part, every layer is followed by a ReLU activation function, not explicitly shown in this image. The MDN consists of a single layer which performs calculations for the means, variances and mixing coefficients.

because there is a single sample in figure 3.2b which has very high intensity, and thus the other pixels in the images are shown as less bright.

Finally the image in figure 3.2c is noised by sampling a normal distribution over every pixel in the images. The samples are then taken as the noise, the result of this last step is shown in figure 3.2d.

Overall the model is created by four steps, each of which are shown in figure 3.2, where value k = 1.3 was taken for formula 3.1. The steps have been outlined previously, but in short they are

• sampling a normal distribution over the image and calculating the luminosity with formula 3.1,

• adding the background data from figure 2.7, • convoluting the image and

• sampling and adding noise from a normal distribution over the image.

During experiments it was found that the full image size of 200 by 200 pixels caused a large network to be needed, so it was decided to use a smaller version of the image, and background, instead. The final size which was experimented with was a 70 by 70 pixel images. The images from figure 3.2 of 200 by 200 pixels are shown again in figure 3.3 for 70 by 70 pixels.

3.3 Training algorithm

To train the network, a training algorithm needs to be used. In this case the previously discussed automatic posterior transform was used to train the network with. This algorithm takes a new prior from a proposal posterior and uses these priors to sample values for k for formula 3.1 for

(23)

(a) Step 1 (b) Step 2

(c) Step 3 (d) Step 4

Figure 3.2: Figures showing the steps taking in the model to create a Fermi gamma-ray sample based on measurements by the Fermi gamma-ray space telescope. Figure 3.2a shows the flux data sampled from the normal distribution and evaluated with formula 3.1, figure 3.2b is with background added, figure 3.2c is the convoluted image and figure 3.2d is the noised and final sample from the model.

calculating the simulations with the model. This way the posterior is faster approximated than in other methods, like SRE.

Sequential ratio estimation was implemented as well and successfully used to approximate the posterior for the previously used example with a circle, but unfortunately unsuccessfully used to approximate the posterior for k.

Finally the masked autoregressive flow (MAF) [29] algorithm was implemented, which uses masked layers [30] for the network, but this implementation could not be finished due to limited time. Therefore no further experiments could be done using MAF, and no more time was spend on this during experimentation.

(24)

(a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4

Figure 3.3: The images here are comparable to the images in figure 3.2, but with the difference that those in figure 3.2 are 200 by 200 pixels and those here are 70 by 70 pixels in size.

(25)

CHAPTER 4

Experiments

A number of experiments were performed. The base network in image 3.1 was used, together with the previously discussed model and training algorithm. The automatic posterior transform algorithm was used on simulated xtrue for parameter θ = k = µtrue = 1.3 in formula 3.1. The

output of the base network was put through a mixture density network. The posterior is simple enough that it can be approximated by a single distribution, which means that the number of means µi and variances σi2 from the MDN is 1, so the number of mixing coefficients αi is 1 as

well, with α = 1, because they are normalized with the softmax function shown in formula 2.5. An experiment was performed with the number of hidden layers in the base network. The number of hidden layers that was experimented with were 1, 2 and 3. The results are shown and discussed below for each number of layers.

4.1 One hidden layer

For one hidden layer with k = 1.3 in formula 3.1 the simulations in figure 4.1 were used to approximate the means and variances using the MDN. The means and variances are shown respectively in figures 4.2 and 4.3.

When looking at the plots for the means over the number of iterations in figure 4.2 it can be clearly seen that not all graphs follow the same pattern while learning. The plot for µ1,6

increases in value fast in the beginning and stays high, much higher than µtrue = 1.3. It also

doesn’t seem to progress back down to µtrue = 1.3, but stays at just under µ1,6 = 1.6. When

looking at image 4.1f for xtrue,1,6, a clear different can be seen from the other images in figure

4.1, which could explain why µ1,6 is consistently too high. The background in figure 4.1f is much

less bright than in the other images, which indicates that the one bright point in that image is much brighter than the bight points in the other images. The probability of having point with such luminosity is small but not zero, since a normal distribution is used to sample data to calculate the luminosities with. A point of such luminosity is much more likely for k > 1.3, so therefore the network might find image xtrue,1,6 to be more likely to be simulated with k > 1.3.

µ1,5 in figure 4.2 is also somewhat too high, but seems to be going back down to µtrue= k = 1.3

as the number of iterations increases.

The other means µ1,1, µ1,2, µ1,3 and µ1,4 slowly start to approximate µtrue = 1.3, but at a

slow and very unstable rate. While clearly overall going up, the plots are very unstable, which could indicate that the network with one hidden layer is not suitable to be used to reliably approximate the posterior distribution for k in luminosity function 3.1.

The variances σ2

1,1, σ21,2, σ1,32 , σ21,4, σ1,52 and σ1,62 in figure 4.3 all go down as the number of

iterations progress, although slowly, towards σ2

true= 0, which is as expected. Initially variances

σ2

1,1progresses the fastest and σ1,32 the slowest, when comparing their respective images xtrue,1,1

and xtrue,1,3 in figure 4.1, it is clear that xtrue,1,3 in 4.1c has more clearly visible points than

xtrue,1,1in 4.1a, which might indicate that the trained networks sees the number of points as an

important pattern in determining the σ2 _{for the input images x} true.

(26)

(a) xtrue,1,1 (b) xtrue,1,2 (c) xtrue,1,3

(d) xtrue,1,4 (e) xtrue,1,5 (f) xtrue,1,6

Figure 4.1: The xtrue,1,i images used for the experiments for which results are shown in figures

4.2 and 4.3. The value k = 1.3 in formula 3.1 is used. The images are all 70 by 70 pixels in size.

4.2 Two hidden layers

For two hidden layers with k = 1.3 in formula 3.1 the simulations in figure 4.4 were used to approximate the means and variances using the MDN. The means and variances are shown respectively in figures 4.5 and 4.6.

The plots of the approximated means µ2,i in figure 4.5 look very similar, and approximate

µtrue = 1.3 well after around 850 iterations. There are smaller differences between the plots

compared to those in figure 4.2, when it comes to how fast they approximate µtrue and the

shape of the curve. The plot µ2,4 approximates µtruethe fastest, which can be seen as it initially

is visible higher than all other plots. The plot µ2,1 approximates µtrue the slowest, but does

approximate µtruewell at 1000 iterations. Looking at xtrue,2,4in figure 4.4d for µ2,4and xtrue,2,1

in figure 4.4a for µ2,1 a difference is that xtrue,2,4 has more clearly visible points than xtrue,2,1,

however if that is compared to the number of visible points in the other images xtrue,2,i in 4.4,

there is no clear significant difference in the number of points. On the other hand, comparing xtrue,2,1 to the other images, it clearly has the least visible points.

Approximated mean µ2,3 shows that it sometimes goes far down from µtrue, especially at 400

iterations, after which it always goes back up again. There is not clear difference between the image xtrue,2,3 and the other images in figure 4.4, so it is unclear why this happens.

Like in figure 4.3, the variances σ_2,i2 in figure 4.6 go down gradually as the number of iterations increases, which is as expected.

(27)

0 100 200 300 400 500 600 700 800 900 1,000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 iteration µ

Means µ1,i approximated for xtrue,1,i for 1 hidden layer

µ1,1 for xtrue,1,1 µ1,2 for xtrue,1,2 µ1,3 for xtrue,1,3 µ1,4 for xtrue,1,4 µ1,5 for xtrue,1,5 µ1,6 for xtrue,1,6 µtrue= 1.3

Figure 4.2: The means µ1,i for the relative simulations xtrue,1,i in figure 4.1. The number of

hidden layers used in the base network was one.

4.3 Three hidden layers

For two hidden layers with k = 1.3 in formula 3.1 the simulations in figure 4.7 were used to approximate the means and variances using the MDN. The means and variances are shown respectively in figures 4.8 and 4.9.

The approximated means for three hidden layers are shown in figure 4.8, and show some interesting plots. The plot for mean µ3,2 for xtrue,3,2 in figure 4.7b initially goes up fast to

µ3,2 = 1.6 and gradually goes down as the number of iterations increases. Within 1000 iterations

it does not go down all the way to µtrue = 1.3. The plot for mean µ3,4 for xtrue,3,4 in figure

4.7d seems to be doing the same, although less fast initially. The plot remains above µtrue on

average, but does cross below µtrue, indicating it might go down more as the number of iterations increases, which is not clearly shown in the graph. An explanation can be seen in the images xtrue,3,2and xtrue,3,4in respectively figures 4.7b and 4.7d. These images clearly have points with

higher luminosity than the points in the other figures in 4.7, with xtrue,3,2 having the point with

the highest luminosity.

The other means µ3,i are correctly approximating µtrue= 1.3 at 1000 iterations, which is as

expected.

The variances σ2

3,ifor the images xtrue,3,iin figure 4.7 are shown in figure 4.9. Variances σ23,1,

σ2

3,2, σ23,3, σ23,4, σ23,5 and σ23,6 correctly go down over the number of iterations. What stands out

is that variance σ2

3,6 stays higher than the other variances plotted in figure 4.9. When looking

at the image xtrue,3,6 in figure 4.7f, there is no clear reason when comparing this image to the

(28)

0 100 200 300 400 500 600 700 800 900 1,000 0 0.5 1 1.5 2 2.5 3 3.5 iteration σ 2

Variances σ_1,i2 approximated for xtrue,1,i for 1 hidden layer

σ2 1,1 for xtrue,1,1 σ_1,22 for xtrue,1,2 σ2 1,3 for xtrue,1,3 σ2 1,4 for xtrue,1,4 σ2 1,5 for xtrue,1,5 σ2 1,6 for xtrue,1,6

Figure 4.3: The variances σ2

1,ifor the relative simulations xtrue,1,i in figure 4.1. The number of

hidden layers used in the base network was one.

4.4 SRE and MDN posterior distributions

The final experiments would be to compare the final forms of the posteriors for sequential ratio estimation and mixture density networks. The posterior for the MDN approximation for the image xtrue,2,2 in figure 4.4c with µ2,2 = 1.37 and σ23,3 = 0.15 is shown in figure 4.10. There

were problems with training a posterior using SRE, where the network would simply train to approximate the probability value 0.5 for the contrasting pairs of input instead of the true posterior, so this was not correct. The posterior distribution shown in figure 4.10 for k in the luminosity function 3.1 for xtrue,2,2 is very close to µtrue= 1.3.

(29)

(30)

0 100 200 300 400 500 600 700 800 900 1,000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 iteration µ

Means µ2,iapproximated for xtrue,2,i for 2 hidden layers

hidden layers used in the base network was two.

0 100 200 300 400 500 600 700 800 900 1,000 0 0.5 1 1.5 2 2.5 iteration σ 2 Variances σ2

2,i approximated for xtrue,2,i for 2 hidden layers

σ_2,12 for xtrue,2,1 σ2 2,2 for xtrue,2,2 σ2,32 for xtrue,2,3 σ2 2,4 for xtrue,2,4 σ2 2,5 for xtrue,2,5 σ2,62 for xtrue,2,6

Figure 4.6: The variances σ2_2,ifor the relative simulations xtrue,2,i in figure 4.4. The number of

(31)

(32)

0 100 200 300 400 500 600 700 800 900 1,000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 iteration µ

Means µ3,iapproximated for xtrue,3,i for 3 hidden layers

hidden layers used in the base network was three.

0 100 200 300 400 500 600 700 800 900 1,000 0 0.5 1 1.5 2 2.5 iteration σ 2 Variances σ2

3,i approximated for xtrue,3,i for 3 hidden layers

σ_3,12 for xtrue,3,1 σ2 3,2 for xtrue,3,2 σ3,32 for xtrue,3,3 σ2 3,4 for xtrue,3,4 σ2 3,5 for xtrue,3,5 σ3,62 for xtrue,3,6

Figure 4.9: The variances σ2_3,ifor the relative simulations xtrue,3,i in figure 4.7. The number of

(33)

0 0.5 1 1.5 2 2.5 0 1 2 3 x f (x )

Posterior approximated with MDN and SRE

MDN posterior µtrue= 1.3

(34)

(35)

CHAPTER 5

Conclusion

The plotted means µi,j in figures 4.2, 4.5 and 4.8, and variances σi,j2 in figures 4.3, 4.6 and 4.9,

show that an artificial neural network can successfully be used to approximate the posterior distribution for variable k in the function for the luminosity function 3.1. Comparing all network with each other, the plots for the means µ2,i in figure 4.5 for two hidden layers are the most

consistent, and approximate µtrue= 1.3 the best, compared to the other networks with one and

two hidden layers.

Therefore it can be concluded, based on these experiments, that a network from figure 3.1 with two hidden layers can be best used to approximate the luminosity for simulated images with background and point sources of Fermi gamma-ray radiation.

5.1 Discussion

A number of remarks can be made about the results, which might affect the conclusion if more research is done to look closer into the remarks and possible problems.

The limitations in how the structure of the network was changed across the experiments, creates a problem that the conclusion that two hidden layers can be best used for the approximation of the distribution of k might be false, because a network with one or three hidden layers, but different convolutional layers, different numbers of nodes or different activation functions might perform better with approximating the distribution with two hidden layers.

Another problem is the limited number of samples the experiments were performed with. The number of samples used to experiment with networks with one, two and three layers was six, which may have caused a possible outlier to have a large effect on the result of the experiment. For example this is the case with the plots of µ3,2, σ3,62 , µ1,6 and µ1,5. If more samples were

used, the possible outliers might have had a smaller effect on the overall on the result.

Future experiments should look into quantifying the instability of the measured data over the number of iterations, and see if any correlation can be found between the instability and the number of hidden layers, number of nodes in the hidden layers, or other parameters of the network.

The number of iterations was limited to 1000, but if more were used interesting results might have shown for higher iterations. For example the conclusion that the network with two hidden layers performs better than the one with three hidden layers is based only on the first 1000 iterations. What might be possible is that with later iterations the plots for the means µ3,ifor

three hidden layers might diverge more stable towards µtrue = 1.3 than the means µ2,i for two

hidden layers.

Further research should also be done into how SRE can be used to train this network, because in the experiment the output of the network for both contrastive pairs would go to 0.5. However, the network should output closer to 1 if the pair contains the the parameter with which the data xtrue was simulated, and closer to 0 in case the parameter was not used to simulate the data.

(36)

SRE did perform well with the example of the circle in the noisy image in figure 2.5 with radius as the parameter.

Overfitting was not considered in the experiments, and it is unclear what the effects of overfitting would be on the results. More research should be done into the effects of overfitting in training neural networks to approximate a posterior distribution, to see how the result would be affected.

Finally, if plots were seen which were opposites from each other, for example means that would only slowly progress towards µtrue and means that would progress fast, or means that

would go higher than µtrue, the corresponding images xtrue,i,j were compared and predictions

were made about the network might interpret these images to cause the observed difference in the means. More research should be done into how exactly the trained network extracts details from the images and how these details are handled in order to calculate a posterior distribution.

(37)

CHAPTER 6

Ethical aspects

There is a lot to talk about when it comes to the ethical aspects and moral behavior of artificial neural networks [31]. However not many of those subjects involve the analysis of Fermi gamma-ray data. Therefore in this chapter the more general case is discussed of artificial neural networks and ethical and moral boundaries.

6.1 Issues

Artificial intelligence, and with that artificial neural networks, are being used more and more in the world of today [32], especially with the increase in information produced by upcoming digital trends like the usage of many smaller devices for small tasks in the Internet of Things [33]. Artificial intelligence is used as well in image recognition, for example by Google and Facebook to recognize object, organisms and faces. However mistakes can be made by algorithms which can create outcry for strongly crossing certain ethical boundaries. An example of this is a tweet in which a person was identified by Google Photos as a gorilla, which was widely criticized. There can be a number of reason why the network falsely classifies a person. There could be a lack of training on certain data sets, or a lack of certain data sets themselves, or that not enough testing was done on the trained network to identify morally wrong misclassifications.

Ethically problematic classifications have recently been removed. For example for genders Google AI had the options male and female, and these have been removed to prevent bias [34]. In the case of the misclassification of a person as a gorilla, Google simply removed the gorilla classification [35]. It can be doubted if this is a good solution for the problem, or if it’s simply a ‘quick’ fix for a bad algorithm or badly trained network.

6.2 Guidelines

Politicians and companies have been working on guidelines for ethics in artificial intelligence. Ethical committees have been formed with commercial companies, but there are doubts as to how much commercial companies should be involved in the decision making around ethical rules and guidelines for artificial intelligence, especially when it touches on all corners of society. Companies would seek profits when working on this. While companies should be involved, they should not direct the processes [36]. The decision making should be publicly funded and transparent for the public.

The European Union has introduced a report with seven principles for ethics in ‘trustworthy AI’ [37]. These concern especially autonomous systems. They should

• enable and support human rights, instead of limiting them, • be secure and reliable and able to deal with encountered errors,

• give citizens full control over their data and this data should not be used against citizens, • ensure traceability,

(38)

• positively enhance social change and sustainability and

• have mechanisms and protocols to ensure someone is responsible for the outcomes of artificial intelligence.

Overall efforts are starting to create guidelines and rules around artificial intelligence and their ethical and moral boundaries. It can be difficult to follow these guidelines and make sure they are fully respected, and errors will occur. How companies and politicians handle these will be important in how artificial intelligence is used in public society in the future.

(39)

Bibliography

[1] A. H. Murphy and R. L. Winkler, “Probability forecasting in meteorology,” Journal of the American Statistical Association, vol. 79, no. 387, pp. 489–500, 1984.

[2] R. L. Fleurence and C. S. Hollenbeak, “Rates and probabilities in economic modelling,” Pharmacoeconomics, vol. 25, no. 1, pp. 3–6, 2007.

[3] M. E. Newman, “Spread of epidemic disease on networks,” Physical review E, vol. 66, no. 1, p. 016128, 2002.

[4] R. East, K. Hammond, and W. Lomax, “Measuring the impact of positive and negative word of mouth on brand purchase probability,” International journal of research in marketing, vol. 25, no. 3, pp. 215–224, 2008.

[5] F. T. Juster, “Consumer buying intentions and purchase probability: An experiment in survey design,” Journal of the American Statistical Association, vol. 61, no. 315, pp. 658– 696, 1966.

[6] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain.,” Psychological Review, vol. 65, no. 6, pp. 386–408, 1958.

[7] C. W. Dawson and R. Wilby, “An artificial neural network approach to rainfall-runoff modelling,” Hydrological Sciences Journal, vol. 43, pp. 47–66, Feb. 1998.

[8] F. Azam, Biologically inspired modular neural networks. PhD thesis, Virginia Tech, 2000.

[9] A. F. Agarap, “Deep learning using rectified linear units (relu),” 2018.

[10] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation functions: Comparison of trends in practice and research for deep learning,” 2018.

[11] J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing, pp. 227–236, Springer Berlin Heidelberg, 1990.

[12] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” 2017.

[13] M. Tanaka, “Weighted sigmoid gate unit for an activation function of deep neural network,” 2018.

[14] K. Fukushima, “Neocognitron: A hierarchical neural network capable of visual pattern recognition,” Neural Networks, vol. 1, pp. 119–130, Jan. 1988.

[15] K. Fukushima, “Neocognitron,” Scholarpedia, vol. 2, no. 1, p. 1717, 2007.

[16] K. Fukushima, “Neocognitron for handwritten digit recognition,” Neurocomputing, vol. 51, pp. 161–180, Apr. 2003.

(40)

[18] J. Hermans, V. Begy, and G. Louppe, “Likelihood-free mcmc with amortized approximate ratio estimators,” 2019.

[19] C. Durkan, I. Murray, and G. Papamakarios, “On contrastive learning for likelihood-free inference,” 2020.

[20] D. S. Greenberg, M. Nonnenmacher, and J. H. Macke, “Automatic posterior transformation for likelihood-free inference,” 2019.

[21] T. K. Gaisser, “Origin of cosmic radiation,” in AIP Conference Proceedings, vol. 558, pp. 27– 42, American Institute of Physics, 2001.

[22] R. W. Romani and I.-A. Yadigaroglu, “Gamma-ray pulsars: Emission zones and viewing geometries,” The Astrophysical Journal, vol. 438, p. 314, Jan 1995.

[23] K. Ioka, “Magnetic deformation of magnetars for the giant flares of the soft gamma-ray repeaters,” Monthly Notices of the Royal Astronomical Society, vol. 327, no. 2, pp. 639–662, 2001.

[24] G. Bignami and W. Hermsen, “Galactic gamma-ray sources,” Annual Review of Astronomy and Astrophysics, vol. 21, no. 1, pp. 67–108, 1983.

[25] W. B. Atwood, A. A. Abdo, M. Ackermann, W. Althouse, B. Anderson, M. Axelsson, L. Baldini, J. Ballet, D. L. Band, G. Barbiellini, J. Bartelt, D. Bastieri, B. M. Baughman, K. Bechtol, D. Bédérède, F. Bellardi, R. Bellazzini, B. Berenji, G. F. Bignami, D. Bisello, E. Bissaldi, R. D. Blandford, E. D. Bloom, J. R. Bogart, E. Bonamente, J. Bonnell, A. W. Borgland, A. Bouvier, J. Bregeon, A. Brez, M. Brigida, P. Bruel, T. H. Burnett, G. Busetto, G. A. Caliandro, R. A. Cameron, P. A. Caraveo, S. Carius, P. Carlson, J. M. Casandjian, E. Cavazzuti, M. Ceccanti, C. Cecchi, E. Charles, A. Chekhtman, C. C. Cheung, J. Chiang, R. Chipaux, A. N. Cillis, S. Ciprini, R. Claus, J. Cohen-Tanugi, S. Condamoor, J. Conrad, R. Corbet, L. Corucci, L. Costamante, S. Cutini, D. S. Davis, D. Decotigny, M. DeKlotz, C. D. Dermer, A. de Angelis, S. W. Digel, E. do Couto e Silva, P. S. Drell, R. Dubois, D. Dumora, Y. Edmonds, D. Fabiani, C. Farnier, C. Favuzzi, D. L. Flath, P. Fleury, W. B. Focke, S. Funk, P. Fusco, F. Gargano, D. Gasparrini, N. Gehrels, F.-X. Gentit, S. Germani, B. Giebels, N. Giglietto, P. Giommi, F. Giordano, T. Glanzman, G. Godfrey, I. A. Grenier, M.-H. Grondin, J. E. Grove, L. Guillemot, S. Guiriec, G. Haller, A. K. Harding, P. A. Hart, E. Hays, S. E. Healey, M. Hirayama, L. Hjalmarsdotter, R. Horn, R. E. Hughes, G. Jóhannesson, G. Johansson, A. S. Johnson, R. P. Johnson, T. J. Johnson, W. N. Johnson, T. Kamae, H. Katagiri, J. Kataoka, A. Kavelaars, N. Kawai, H. Kelly, M. Kerr, W. Klamra, J. Knödlseder, M. L. Kocian, N. Komin, F. Kuehn, M. Kuss, D. Landriu, L. Latronico, B. Lee, S.-H. Lee, M. Lemoine-Goumard, A. M. Lionetto, F. Longo, F. Loparco, B. Lott, M. N. Lovellette, P. Lubrano, G. M. Madejski, A. Makeev, B. Marangelli, M. M. Massai, M. N. Mazziotta, J. E. McEnery, N. Menon, C. Meurer, P. F. Michelson, M. Minuti, N. Mirizzi, W. Mitthumsiri, T. Mizuno, A. A. Moiseev, C. Monte, M. E. Monzani, E. Moretti, A. Morselli, I. V. Moskalenko, S. Murgia, T. Nakamori, S. Nishino, P. L. Nolan, J. P. Norris, E. Nuss, M. Ohno, T. Ohsugi, N. Omodei, E. Orlando, J. F. Ormes, A. Paccagnella, D. Paneque, J. H. Panetta, D. Parent, M. Pearce, M. Pepe, A. Perazzo, M. Pesce-Rollins, P. Picozza, L. Pieri, M. Pinchera, F. Piron, T. A. Porter, L. Poupard, S. Rainò, R. Rando, E. Rapposelli, M. Razzano, A. Reimer, O. Reimer, T. Reposeur, L. C. Reyes, S. Ritz, L. S. Rochester, A. Y. Rodriguez, R. W. Romani, M. Roth, J. J. Russell, F. Ryde, S. Sabatini, H. F.-W. Sadrozinski, D. Sanchez, A. Sander, L. Sapozhnikov, P. M. S. Parkinson, J. D. Scargle, T. L. Schalk, G. Scolieri, C. Sgrò, G. H. Share, M. Shaw, T. Shimokawabe, C. Shrader, A. Sierpowska-Bartosik, E. J. Siskind, D. A. Smith, P. D. Smith, G. Spandre, P. Spinelli, J.-L. Starck, T. E. Stephens, M. S. Strickman, A. W. Strong, D. J. Suson, H. Tajima, H. Takahashi, T. Takahashi, T. Tanaka, A. Tenze, S. Tether, J. B. Thayer, J. G. Thayer, D. J. Thompson, L. Tibaldo, O. Tibolla, D. F. Torres, G. Tosti, A. Tramacere, M. Turri, T. L. Usher, N. Vilchez, V. Vitale, P. Wang, K. Watters, B. L. Winer, K. S. Wood, T. Ylinen, and M. Ziegler, “The large area telescope on the fermi gamma-ray space telescope mission,” The Astrophysical Journal, vol. 697, pp. 1071–1102, May 2009.

(41)

[26] K. N. Abazajian, S. Horiuchi, M. Kaplinghat, R. E. Keeley, and O. Macias, “Strong constraints on thermal relic dark matter from fermi-lat observations of the galactic center,” 2020.

[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.

[28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[29] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” 2017.

[30] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked autoencoder for distribution estimation,” 2015.

[31] N. Bostrom and E. Yudkowsky, “The ethics of artificial intelligence,” The Cambridge handbook of artificial intelligence, vol. 1, pp. 316–334, 2014.

[32] Y. Lu, “Artificial intelligence: a survey on evolution, models, applications and future trends,” Journal of Management Analytics, vol. 6, no. 1, pp. 1–29, 2019.

[33] Z. Allam and Z. A. Dhunny, “On big data, artificial intelligence and smart cities,” Cities, vol. 89, pp. 80–91, 2019.

[34] K. Lyons, “Google ai tool will no longer use gendered labels like ‘woman’ or ‘man’ in photos of people,” The Verge.

[35] J. Vincent, “Google ‘fixed’ its racist algorithm by removing gorillas from its image-labeling tech,” The Verge.

[36] Y. Benkler, “Don’t let industry write the rules for ai,” Nature, vol. 569, no. 7754, pp. 161– 162, 2019.

[37] L. Floridi, “Establishing the rules for building trustworthy ai,” Nature Machine Intelligence, vol. 1, no. 6, pp. 261–262, 2019.

Posterior approximation with variational inference for simulated Fermi gamma-ray data

Bachelor Informatica