• No results found

Generating glitches similar to glitches observed by LIGO and Virgo using Wasserstein Generative Adversial Network with Gradient Penalty

N/A
N/A
Protected

Academic year: 2021

Share "Generating glitches similar to glitches observed by LIGO and Virgo using Wasserstein Generative Adversial Network with Gradient Penalty"

Copied!
70
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MSc Physics & Astronomy

GRAPPA

Master Thesis

Generating glitches similar to glitches observed by LIGO and Virgo

using Wasserstein Generative Adversial Network with Gradient Penalty

by

Kerwin Buijsman

10528865 60 EC December 2019 - December 2020

Supervisor:

Dr Sarah Caudill

Second examiner:

Dr Christoff Weniger

Nikhef: Virgo

(2)

Abstract

During the process of coalescing of two compact objects, for example a black hole and a neutron star, gravitational waves are emitted. The existences of gravitational waves was predicted by Einstein, based on his theory of general relativity, about 100 years ago. After many researches, the first measurement of a gravitational wave was made in 2015 by the LIGO-Virgo collaboration. This was a new milestone in multi-messenger astronomy (and also in other fields), because we can now also measure the ripples in spacetime which provides new insights. The characterization of a gravitational wave signal strongly depends on the nature of the coalescing objects.

Because the detectable effect of a gravitational wave is extremely small, measurements are done with very sensitive interferometers. During observation, these interferometers will also detect dif-ferent type of noise. The observed noise is limiting the sensitivity of the interferometers and the quality of potential detected gravitational waves. The type of noise which is doing the worst job is transient noise, also named glitches. These glitches are about a few times per hour observed and can mimic a gravitational wave signal. This will make the detection of a gravitational wave more complex. We want to understand and model the morphologies of those glitches as good as possible in order to improve the quality of the data.

In this thesis, we first briefly introduce the theory behind gravitational waves. Then we describe the different kind of noises and give a brief background of deep learning with the tools which are related to our research. In the main part we first create a data set for each glitch class. These data sets contain the time series of observed glitches during the O1 and O2 run. We removed the noise as much as possible in order to obtain only ‘clean’ (without other noise) time series of glitches. Next, we train a WGAN-GP for each glitch class on a data set in order to obtain a generator which can generate glitches similar to the observed ‘clean’ glitches. This is motivated by the fact that it is too complex to model glitches analytically due their complex morphologies. WGAN-GP is a deep learning model which contains two neural networks; a generator and a critic. The critic is trying to distinguish generated data with the real data and the generator is trying to fool the critic by generating data on which the critic has a difficult job to distinguish it with the real data.

(3)

Contents

1 Introduction 4

1.1 History gravitational wave astronomy . . . 4

1.2 Mathematical derivation of GW from GR . . . 4

1.2.1 Basic principals of GR . . . 4

1.2.2 GW from GR . . . 5

1.3 Interferometric detectors . . . 7

2 Detector characterization 10 2.1 Power Spectral Density (PSD) . . . 10

2.2 Noise . . . 11

2.2.1 Seismic noise . . . 11

2.2.2 Gravity gradient noise . . . 12

2.2.3 Thermal noise . . . 13

2.2.4 Quantum noise . . . 13

2.2.5 Instrumental noise . . . 15

2.3 Data conditioning . . . 16

2.3.1 Whitening . . . 16

2.3.2 Bandpass and notch filter . . . 16

2.4 Glitches . . . 17 2.4.1 Blip . . . 17 2.4.2 Koi fish . . . 18 2.4.3 Scattered light . . . 18 2.4.4 Whistle . . . 20 3 Deep Learning 21 3.1 Introduction to neural networks . . . 21

3.1.1 biological neuron vs artificial neuron . . . 21

3.1.2 Feedforward neural network . . . 22

3.1.3 Activation functions . . . 23 3.1.4 Batch normalization . . . 24 3.1.5 Dropout . . . 25 3.1.6 Loss functions . . . 25 3.1.7 Backpropagation . . . 28 3.1.8 Optimizers . . . 28

3.2 Different types of deep neural networks . . . 29

3.2.1 Convolutional neural network . . . 30

3.2.2 Transposed convolution layer vs upsampling . . . 32

3.2.3 Generative Adversarial Networks . . . 33

3.2.4 Wasserstein GAN (WGAN) . . . 35

(4)

CONTENTS CONTENTS

4 Generating GW detector glitches with a WGAN-GP 37

4.1 Modelling glitches . . . 37

4.2 Data set . . . 38

4.2.1 Gravity Spy . . . 38

4.2.2 BayesWave . . . 39

4.2.3 Improving data sets . . . 41

4.3 Model details . . . 46 4.3.1 Architectures . . . 46 4.3.2 Training . . . 46 5 Results 52 5.1 Blip glitches . . . 54 5.2 Koi fish . . . 55 5.3 Scattered Light . . . 56 5.4 Whistle . . . 57

6 Discussion, conclusion and future work 58 6.1 Discussion . . . 58 6.2 Conclusion . . . 59 6.3 Future work . . . 59 A Appendix sinusoide 66 A.1 dataset . . . 66 A.2 Architecture . . . 66

A.3 Training details . . . 68

(5)

Chapter 1

Introduction

1.1

History gravitational wave astronomy

The concept of gravitational waves (GW) is based on the theory of general relativity (GR) in-troduced by Einstein [19]. The theory of GR is an extension of Einstein’s earlier special theory of relativity [18]. Although the first publication which mentions GW was already published in 1906 by Poincar´e [62]. The theory of GR describes that gravity is actually a manifestion of the curvature of space-time caused by mass. The basis of his theory consists of the Einstein (field) equations. Shortly after the publication of GR, Einstein made a prediction about the existence of GW analogous to electromagnetic waves. Two years later Einstein had proven the existence of three different types of GW ( longitudinal-longitudinal, longitudinal and transverse-transverse) by solving the field equations. In 1936, three years after he moved to the United states, Einstein and Nathan Rosen wanted to publish a paper which concluded that gravitational waves do not exist. After a reviewer attended them on some errors, they published in 1937 the paper with some modifications which concluded that GW do actually exists [20]. Einstein also mentioned in his paper that GW are not measurable at earth due their extremely small effect on spacetime.

Although a variety of physicist/researchers dived into experiments in which they tried to detect GW. In 1969 Joseph Weber claimed that he measured GWs using the ’Weber bar’ [77], but this was most likely just noise. In 1984 MIT and Caltech signed an agreement for the design and construction of Laser Interferometer Gravitational wave Observation (LIGO) with Ronald Drever, Kip Thorne and Rainer Weiss as project leaders. This turned out to be a huge success as 31 years later LIGO made the first detection of a GW on 14 September 2015 [1]. This GW was produced by a binary black hole merger [1].

1.2

Mathematical derivation of GW from GR

1.2.1

Basic principals of GR

The mathematical description of the geometry of spacetime can be described by GR. The basic principals of GW can be derived from the theory of GR. In this section we will give a brief overview of some important formulas regarding GR. The following natural units and notation are

used: c = 1, G = 1, ∂αgβγ = ∂gβγ

∂xα. The metric is an important definition in GR. It express

nonphysical coordinates into a physical distance. For example, the square of the distance in

spherical coordinates can be described by

ds2= dr2+ r2dθ2+ r2sin2θdψ2≡ 3 X i,j=1 gijdxidxj (1.1) 2 2 2 1 2 3

(6)

1.2. MATHEMATICAL DERIVATION OF GW FROM GR CHAPTER 1. INTRODUCTION

In GR gµν describes the geometry of spacetime and is called the metric. This metric makes the

line element invariant under transformations of the coordinate system. The general definition for the line element is

ds2= gµνdxµdxν (1.2)

with gµν(t,x).

The curvature of spacetime is described using a Riemann tensor

σµν= ∂µΓρνσ− ∂νΓρµσ+ ΓρµλΓλνσ− ΓρνλΓλµσ (1.3)

with the Christoffel Γ symbol defined as Γλµν =1

2g

λσ(∂µgνσ+ ∂νgσµ− ∂σgµν) (1.4)

The Einstein Field Equation (EFE) is one of the most important equations in GR. It contains ten equations and it relates the local spacetime curvature (described by the metric) with the local

energy and momentum (described by Tµν). It can be expressed by

Rµν−1

2Rgµν = 8πTµν (1.5)

Here we neglected the cosmological constant Λ.

1.2.2

GW from GR

The derivation of GW from GR can be done by solving the EFE. The coordinate system which we are using to solve the EFE has the components

gµν = ηµν+ hµν (1.6)

Here is |hµν| << 1 and ηµν represents the Minkowski metric for flat spacetime. Because |hµν| is very small, we can linearize the EFE. This is an essential part in the derivation. In order to solve the EFE, we need the expressions of Rµν and R. First we plug in our metric from Equation 1.6 in Equation 1.4 to obtain the Christoffel symbols.

Γλµν = 1 2(η

λσ+ hλσ)(∂

µ(ηνσ+ hνσ) + ∂ν(ησµ+ hσµ) − ∂σ(ηµν+ hµν)) (1.7) Using ∂σηµv = 0 and neglecting higher orders of h, O h2, results in

Γλµν = 1 2η

λσ

(∂µhvσ+ ∂vhµσ− ∂σhvµ) (1.8)

This neglection of O h2 is possible, because we took |hµν| << 1. Now we will plug in the obtained Christoffel symbol in the definition of the Riemann tensor described in Equation 1.9. Here we can neglect the last two terms of the Riemann tensor, because those will result into terms consisting of only O h2. Rτ σµν =gτ ρRρσµν= gτ ρ(∂µΓρνσ− ∂νΓρµσ+ Γ ρ µλΓ λ νσ− Γ ρ νλΓ λ µσ) = ητ ρ  ∂µ 1 2η ρα(∂νhσα+ ∂σhνα− ∂αhσν)  − ∂ν 1 2η ρβ(∂µhσβ+ ∂σhµβ− ∂βhσµ)  = ∂µ(1 2(∂νhστ+ ∂σhντ − ∂τhσν)) − ∂ν( 1 2(∂µhστ+ ∂σhµτ− ∂τhσµ)) = 1 2(∂µ∂σhντ− ∂µ∂τhσν− ∂ν∂σhµτ+ ∂ν∂τhσµ) (1.9)

(7)

1.2. MATHEMATICAL DERIVATION OF GW FROM GR CHAPTER 1. INTRODUCTION

Now we calculate the Ricci tensor by making a contraction of the Riemann tensor.

Rσν =ηµτRτ σµν = ηµτ(1 2(∂µ∂σhντ − ∂µ∂τhσν− ∂ν∂σhµτ+ ∂ν∂τhσµ)) = 1 2(∂ τ∂σhντ − ∂τ∂τhσν− ∂ν∂σηµτhµτ+ ∂ν∂τhτ σ) (1.10)

Using the d’Alembert operator  = ∂τ∂τ and ηµτhµτ= h, we can write the obtained Ricci tensor as

Rσν = 1

2(∂ τ∂σhντ

− hσν− ∂ν∂σh + ∂ν∂τhτσ) (1.11)

Now we only need the Ricci scalar in order to start solving the EFE for our coordinate system. We can calculate the Ricci scalar by making a contraction of the Ricci tensor

R = ησνRσν =1 2η

σν(∂τ∂σhντ

− hσν− ∂ν∂σh + ∂ν∂τhτσ) = ∂τ∂νhντ − h (1.12)

At this point we can solve the EFE by plugging the obtained Ricci scalar and tensor (Equation 1.12 and 1.11) in the EFE (Equation 1.5).

8πTσν =Rσν−1 2Rgσν = 1 2(∂ τ∂σhντ − hσν− ∂ν∂σh + ∂ν∂τhτσ) −1 2(∂ τνhντ − h)(ηµν+ hµν) = 1 2(∂ τ∂σhντ − hσν+ ∂ν∂τhτσ− ηµν∂τ∂νhντ) (1.13) We define the trace resversed amplitude as ¯hµν= hµν−1

2ηµνh. Substituting hµν with ¯hµν+ 1 2ηµνh and neglecting O h2, we can write Equation 1.13 as

16πTσν= ∂τ∂σh¯ντ − ¯hσν+ ∂ν∂τh¯τσ− ηµν∂ τν¯h

ντ (1.14)

At this point we are still free to choose a sort of gauge. We set ∂µh¯µν = 0, this is also known as the Lorenz gauge condition. Under those conditions, the wave components are orthogonal to the direction of propagation. Using this condition we can simplify Equation 1.14 to

16πTσν = −¯hσν (1.15)

Far away from the source, for example at earth, or in free space we can set the energy momentum tensor Tµν to zero. And we can write the d’Alembert operator as  = c12

∂2 ∂t2 − ∇

2 in Minkowski

space. Now our linearized EFE can be written as  −∂ 2 ∂t2 + c 22  ¯ hσν = 0 (1.16)

The linearized EFE is now similar to the wave equation. It also implies that the velocity of a travelling GW is equal to the speed of light. The standard solution of Equation 1.16 is

¯

hσν = Aσνeikαxα (1.17)

We can make a simplification of the metric if ¯h is trace free (¯h = h). This is possible for a specific gauge. Because we already made the decision to use the Lorzenz gauge condition, we use a subgauge of the harmonic gauge called the transverse traceless gauge (TT). In the transverse traceless gauge, Aσν can be expressed as

(8)

1.3. INTERFEROMETRIC DETECTORS CHAPTER 1. INTRODUCTION A(TT)σν =     0 0 0 0 0 A(TT)xx A(TT)xy 0 0 A(TT)xy −A(TT)xx 0 0 0 0 0     (1.18)

Combining Equation 1.17, Equation 1.18 and the fact that ¯h is trace free we can describe a GW propagating along the z-axis by

h(t, z) =     0 0 0 0 0 h+ h× 0 0 h× −h+ 0 0 0 0 0     cos (ω(t − z) + φ0) (1.19)

Here ω is the angular frequency and φ0 the initial phase. h+ refers to Axxand hx refers to Axy. Figure 1.1 illustrates the effects of h+ and hx. 8

Figure 1.1: The effects of + and x polarizations of a GW traveling along the z-axis with ω = 2π/T [46].

1.3

Interferometric detectors

Interferometers are used in a broad range of fields in science and engineering. Most interferometers, including the interferometers used to detect GW, are based on the Michelson interferometer. The Michelson interferometer was introduced by Albert Michelson and firstly used in the

“Michelson-Morley Experiment” in 1887 [50]. Interferometers are designed to make measurements using

interferometry. Figure 1.2 shows a schematic setup of a Michelson interferometer.

The strength of a GW is characterised by the dimensionless quantity h, also called the strain. We can describe h by

h = 24L

L (1.20)

where 4L is the change in length and L the length of the detector. One of the most challenging problems in detecting GW is that the theoretical magnitude of the strain is of order 10−21 or smaller [68]. For example, a binary star system produce a strain h ∼ 10−21. If the length of the arms of a standard Michelson interferometer L = 1 AU, the observable change in length of the arm 4L is about 1 atomic diameter.

At the moment there are a few ground-based interferometer detectors operating; advanced LIGO: Hanford and Livingston [21], advanced Virgo [26], KAGRA [22] and GEO [25]. Advanced LIGO and Virgo are stationed in the US and Italy and are detecting noticeable GW signals since 2015 and 2017 respectively. Advanced LIGO started to operate with the first observation run, which is named as O1, advanced Virgo joined with the second observation run (O2) and KAGRA was

(9)

1.3. INTERFEROMETRIC DETECTORS CHAPTER 1. INTRODUCTION

supposed to join during O3, but got delayed. The length of the arms of advanced LIGO and Virgo are 3 km and 4 km respectively. Recently, the Japanese detector KAGRA started to operate which has 3 km arms. The GEO detector is only used for development and testing of advanced techniques, because the arms of the GEO interferometer (600m) are significantly smaller than the arms of the other operating interferometers. There are several projects ongoing for future interferometer detectors. The LIGO-India detector which is not ready yet to operate in the next few years [23]. The Einstein Telescope [31] which will be a 3rd generation detector. The evolved Laser Interferometer Space Antenna (LISA) [32] which will be an interferometer operating in space.

Figure 1.2: A schematic setup of a Michelson interferometer [6]. The laser sends a beam of monochromatic light to the beamsplitter which splits the beam into two beams with equal power perpendicular to each other. The two end mirrors reflect the incoming beams and the two beams will recombine at the beamsplitter. After recombination, the beam ends at the photo detector (PD).

Figure 1.2 shows a simplified version of an interferometer used to detect GW. A change in the length between the end mirrors and the beamsplitter will results in a change of the light power detected by the photo detector. The interferometer becomes more sensitive by increasing the arm lengths (the distance between the beamsplitter and the end mirror). This is the key for the problem that the strain of a GW is of order 10−21 or lower. Here fore the arm lengths of LIGO and Virgo are respectively 4 km and 3 km long. Another difference between the interferometers for measuring GW and the Michelson interferometer is that the current interferometers make use of laser light instead of a point source. The first prototype which used laser light was made in 1971 and achieved an increasement of the accuracy by a factor of about one million compared to the original Michelson interferometer [53]. Lasers are suited for interferometers, because lasers emit light with a relatively constant phase and with a narrow emission spectrum. This results in a clear interference which the interferometer use for detection.

The main limitations of the sensitivity of an inferometer are the length of the arms and the laser power. The current detectors make use of Fabry-Perot (FP) cavities and power recycling to increase sensitivity [21]. A FP cavity is created by adding an additional mirror in the beginning of the arm. This additional mirror ensures that the incoming beam bounces multiple times before it recombines at the beamsplitter. Regarding LIGO, the FP cavities are 4 km long and the laserbeam bounces about 300 times between the mirrors before recombination. This leads to an increase of the number of photons in the arms and the distance travelled by the photons. Because of this, the LIGO detector becomes more sensitive by many factors.

Power recycle mirrors are implemented in the current detectors to increase the effective laser power. A recycling mirror consists of one transparent side and is usually placed between the laser and the beamsplitter. The light from the laser propagate through the recycling mirror and reaching the beamsplitter. At the beamsplitter, a part of the light is reflected back to the laser source. Also during recombining, a part of the light is reflected to the laser source. Power recycle mirrors

(10)

1.3. INTERFEROMETRIC DETECTORS CHAPTER 1. INTRODUCTION

power in the interferometer.

In advanced LIGO and Virgo, signal recycling mirrors are also implemented in the interferometers. Signal recycling mirrors are introduced by [49] and they have the same principal as power recycle mirrors. The difference between a power recycle mirror is that a signal recycling mirror is placed between the beamsplitter and the photo detector.

(11)

Chapter 2

Detector characterization

The data obtained by the LIGO and Virgo detectors contains many noise transients, also named glitches, which have an environmental or an instrumental origin. Examples of transient noise sources are fluctuations of the laser power and thunderstorms [56]. Glitches are in general short duration transient noise and can mimic or mask a GW. Although the exact origin of many ob-served glitches is currently still unknown, a significant amount of obob-served glitches do have a known origin [24],[29] and [27].

A detector is characterized by a sensitivity curve. We can define a sensitivity curve of a de-tector by summing all the observed noises. The sensitivity curve indicates the minimum directly detectable GW signal for a certain detector. GW signals which are below the sensitivity curve are covered by the noise. We can still measure some of these GW signals by using for example the matched filtering algorithm. The sensitivity curve is constructed using the ratio between the detector’s noise power spectral density (PSD) Pn(f ), explained in section 2.1, to the corresponding polarization and sky averaged values to a GW R(f ) [75]. The characteristic strain noise amplitude hn(f ), also named sensitivity curve, is defined by [75]

hn(f ) ≡pf Sn(f ), with Sn(f ) ≡ Pn(f )/R(f ) (2.1)

where f is the frequency. For Gaussian and stationary noise, we can use Sn(f ) ≡ Pn(f ) [52]. In general, the approximation that the observed noise is Gaussian and stationary can be made. The noise vary over a relatively big time window, but the noise behave stationary for time windows used to look for GW signals. The noise is also Gaussian except in the presence of a glitch.

2.1

Power Spectral Density (PSD)

The PSD describes the distribution of the power in the noise in frequency space. The PSD of a detector characterize the corresponding sensitivity curve. It is common to visualizepSn(f ), also named amplitude spectral density (ASD), in order to obtain units that are easier to compare with the data in time domain.

Figure 2.1 shows the sensitivities for the detectors IPTA, LISA, GEO, advanced Virgo, KAGRA, advanced LIGO and the Einstein telescope (ET). It also shows the characteristic strain of different GW sources with the corresponding frequencies. We can notice that the most effective frequency range depends on the detector. For example, the most effective frequency range of aLIGO and aVirgo are of the order 102 Hz and for LISA this is of the order 10−2 Hz. Figure 2.1 shows that for this aLIGO and aVirgo are suited to detect GW with compact binary inspirals as origin and LISA is suited to detect GW with massive binaries or extreme mass ratio inspirals as origin. We can also notice that the future detector ET will be more sensitive compared to the detectors of the current operating network for observing compact binary inspirals.

(12)

2.2. NOISE CHAPTER 2. DETECTOR CHARACTERIZATION

Figure 2.1: The noise amplitude in frequency space for different detectors. This Figure was created using http://gwplotter.com which is introduced by [52].

2.2

Noise

The total noise of the detectors have a variety of noise sources. In this section we briefly describe the most significant noise sources. Figure 2.2 shows the contribution of each noise source for the advanced LIGO and Virgo detector.

(a) Advanced Virgo (b) Advanced LIGO

Figure 2.2: The total noise and the contribution of each noise source for a detector in the nominal mode. [47] [21]

2.2.1

Seismic noise

Seismic noise is noise caused by external factors which can not be prevented. Some examples are ground motion, heavy storms and human activities. Seismic noise is one of the main limiting factors of the sensitivity of the detectors at relatively low frequencies (a few Hz) [16]. However seismic noise is also problematic at higher frequencies. Seismic noise is regularly the cause of an observed glitch by the detector at higher frequencies [48].

(13)

2.2. NOISE CHAPTER 2. DETECTOR CHARACTERIZATION

It is very important to minimize the seismic noise. The first factor which is related to the amount of observed seismic noise is the geographic location of the detectors. Building a detector close to a dense population of humans will result in a relatively higher detected seismic noise. This is all taken into account by choosing the geographic location of LIGO and Virgo (and the other detec-tors). The seismic noise is also being minimized by isolating optical components of the detector with suspension systems. Here a pendulum system is introduced on which the input and end mirrors are hung on. The reason to use a pendulum system is that above pendulum resonance, the relation between the horizontal motion of the suspension and the pendulum mass decrease with the inverse square of the frequency. Additionally to the pendulum system, springs are used to take care of the seismic noise in the vertical direction. All those measures can theoretically reduces the detected seismic noise to a few Hz. Figure 2.2b shows that all the seismic noise above 11 Hz is negligible for Advanced LIGO.

2.2.2

Gravity gradient noise

Gravity gradient noise, also called Newtonian noise, is caused by fluctuating gravitational forces on the test masses of the interferometer. These fluctuating gravitational forces are a result of density perturbations in the Earth near the detector caused by seismic waves. The consequences of these seismic waves on test masses in interferometers is shown in figure 2.3.

Figure 2.3: Illustration of a fluctuating gravitational force on a mass which is suspended. The fluctuating gravitational force is a result of a surface wave [61].

The effect of the fluctuating gravitational force on a test mass of an interferometer can be described by

˜

x(ω) = 4πGρ

ω2 β(ω) ˜W (ω) (2.2)

Here represent ˜x(ω) the rms motion of the test masses, G Newton’s constant, ρ the density of the Earth, ω = 2πf the angular frequency of the fluctuating gravitational forces, β(ω) a reduced transfer function and ˜W (ω) the displacement rms-averaged over 3-dimensional directions [41]. Gravity gradient noise is problematic at low frequencies (until ∼ 20 Hz) [41]. This is also shown in Figure 2.2b.

It is very difficult to reduce gravity gradient noise. There are two main propositions for reducing gradient noise in (future) interferometers. The first approach is to keep track of the motion of nearby masses. Through this the contribution of a fluctuating gravitational force can be calculated and subtracted from the motion of the real test masses. However this is very complex and there are quite a few challenges to overcome in order to achieve a precise measurement of the contribution of a fluctuating gravitational force on the motion of the test masses. The second proposition is to build detectors on locations with reduced effects of gravity gradient noise. Good examples are the Kamioka Gravitational Wave Detector (KAGRA) which has been build underground and recently started to operate [22]. Additionally, there is eLISA which is going to be build in space and planned to operate in this decennium [32]. eLISA will focus on the low frequency, between

(14)

2.2. NOISE CHAPTER 2. DETECTOR CHARACTERIZATION

0.1 mHz to 1 Hz, and will most likely have very little problems with gradient gravity and seismic noise.

2.2.3

Thermal noise

Thermal noise is the result of fluctuations of atoms in the mirrors and their corresponding suspen-sions in the interferometer. Therefore thermal noise can be divided into two classes; mirror and pendulum thermal noise. These fluctations are driven by thermal energy. The total thermal noise will increase for every reflection of the laserbeam due the mirrors. Of all noise sources, thermal noise contributes the most to the total noise at low frequencies. This is also shown in Figure 2.2. These fluctuations are the most problematic at resonant frequencies of the pendulum systems and the mirrors (respectively a few Hz and a few kHz). The spectral density of thermal motion of a mass from a harmonic oscillator can be described by

x2(ω) = 4kBT ω 2 0φ(ω) ωmh(ω2 0− ω2) 2 + ω4 0φ2(ω) i (2.3)

where kB represents the Boltzmann’s constant, T the temperature, ω the angular frequency with

ω0the angular resonant frequency, φ(ω) the loss angle of the oscillator at ω0and m the mass [70]. In our case we can describe the springs and pendulum wich are connected to a mass by a harmonic oscillator.

We face another problem due the effects of thermal energy; it can change the index of refraction. The beamsplitter mirror will absorb a very small amount of the light propagating through the mirror. This results in a small rise of the temperature of the beamsplitter mirror which will change the index of refraction.

The thermal noise can be reduced in a few different ways. (1) Using materials with a high Q-factor. The Q-factor of a material can be described by

Q = fr

∆f (2.4)

where ∆f represent the measurement of the bandwidth around the resonant frequency and frthe

resonant frequency [10]. From Equation 2.4 it is clear that a higher Q-factor results in a smaller bandwith where these vibrations from thermal energy are problematic. (2) Not measuring at the resonant frequencies of the mirrors and the pendulum systems. (3) Decrease the temperature: from Equation 2.3 we see the that a decrease of the temperature leads to a decrease of the thermal motion.

The GW detector KAGRA is the first detector to reduce the thermal noise using cooled sapphire mirrors and cooled sapphire test masses [22]. This technique is also called the cryogenic operation of a detector. Here the sapphire is cooled to a temperature of 20 K [22]. Additionally, these mirrors will also causes very little thermal lensing effect due to the thermal conductivity of the cooled sapphire. Through this, the interferometer of KAGRA has reduced thermal fluctuations compared to room temperature detectors [22].

2.2.4

Quantum noise

Figure 2.2 shows that quantum noise is the most significant noise source at higher frequencies (above ∼ 100 Hz and ∼ 300 Hz for advanced LIGO and Virgo respectively). Quantum noise can be classified into two categorizes; photon shot noise and radiation pressure noise. The total quantum noise of a standard Michelson interferometer can be calculated by

htot(Ω) = q h2 shot(Ω) + h 2 rad(Ω) (2.5)

where hshot and hrad are described in equations 2.6 and 2.7. The shot noise and the radiation

(15)

2.2. NOISE CHAPTER 2. DETECTOR CHARACTERIZATION

Photon shot noise

The photons of the laser beam in the interferometer obey the rules of quantum mechanics. The photon shot noise is caused by quantum fluctuations. These quantum fluctuations will result in a randomly fluctuated intensity of the laserbeam. The relationship between the detector strain amplitude and the contribution of the shot noise can be described by

hshot(Ω) = 1

L r

~cλ

2πP (2.6)

where L represents the length of the arm, ~ the reduced Planck constant, λ the wavelength and P the power [34].

The noise from quantum fluctuations decrease with an increase of the power of the laser, this is also noticeable from Equation 2.6. The interference signal becomes smoother by the presence of more photons. For this reason, the power recycle mirrors which increase the effective laser power will also reduce the photon shot noise. Because Equation 2.6 is frequency-independent, reducing the shot noise leads to an improved detector sensitivity at all frequencies.

Radiation pressure noise

Increasing the (effective) laser power to improve the sensitivity of an interferometer results also into a significantly amount of radiation pressure noise. The radiation pressure noise arises from quantum fluctuations of the internal fields. These quantum flucuations induce radiation pressure fluctuations which will slightly change the positions of the mirrors. Due to the the Heisenberg uncertainly principle, the amount of disturbance of the mirrors due to radiation pressure is un-known and varies. The Heisenberg uncertainly principle prevents an accurate measurements of the momentum of the photons. The transfer of momentum of photons to the mirror can also mask a GW [69]. For a standard Michelson interferometer, the relationship between the detector strain amplitude and the contribution of the radiation pressure noise can be described by

hrad(Ω) = 1 2mΩ2L r 8π~P cλ (2.7)

where m represents the mass of the mirrors, Ω the sideband frequency with respect to the carrier frequency, L the length of the arm, ~ the reduced Planck constant, P the power and λ the cor-responding wavelength [34]. We can notice from Equation 2.7 that the radiation pressure noise scales as Ω−2. The radiation pressure noise is, in contrast to the shot noise, frequency-dependent due to the mirrors and is more significant at lower frequencies. We can also note from Equation 2.7 that an increase of the effective power leads to an increase of the radiation pressure noise. The standard quantum limit (SQL)

The standard quantum limit is a widely used term for the minimum quantum noise. There exists an optimal value of the input power which gives the lowest possible quantum noise. The SQL can be calculated by minimizing the total quantum noise with respect to the power of the laser. For a Michelson interferometer the SQL can be described by

hSQL(Ω) = r

4~

mΩ2L2 (2.8)

with the corresponding optimal power

PSQL= cλmΩ

2

4π (2.9)

[9]. Different techniques are proposed to go beyond the SQL [40]. At the moment, the observing detectors do not go beyond the SQL yet. It is very important to reduce the quantum noise as much as possible, because the sensitivity of the current detectors at higher frequencies are limited

(16)

2.2. NOISE CHAPTER 2. DETECTOR CHARACTERIZATION

by quantum noise which is also visible in figure 2.2b.

Quantum noise can also be reduced by injecting squeezed states of light [28]. The difference between a vacuum state and a squeeze state is that a squeezed state has a reduced uncertainty in the amplitude or phase quadrature. This technique can be applied because the phase and the amplitude are not measured at the exact same time. Also the Heisenberg uncertainty principle yields a minimum of the product of the variances of the quadrature. It does not set a minimum on the variances of the quadrature individually. For this, the injected squeezed light contains a reduced uncertainty of the amplitude quadrature relative to the coherent phase quadrature or vice versa.

2.2.5

Instrumental noise

In this subsection we will briefly discuss the most significant instrumental noise sources for the O1 and O2 run. For this we investigate the ASD, which is equal to the root of the PSD, to clarify for different instrumental noise. Figure 2.4 shows the ASD of the observing detectors at their best performance during the O2 run. The shape of the ASD of Hanford and Livingston at their best performance during O1 is very similar to the shape of the ASD during O2.

Figure 2.4: Amplitude spectral density during O2 (at best performance) of the total strain noise. The red line represent Hanford, blue Livingston and violet Virgo [3].

Figure 2.4 also shows the characteristic shape of the total noise described in the paragraphs above. Additionally to this shape, there are some sharp peaks at some frequencies. Those sharp peaks are namely caused by instrumental noise and can have a negative influence, especially near the sharp peaks, to some GW searches [3]. All of the most prominent peaks have instrumental noise as origin. The most significant contribution of the instrumental noise are at the relative highest peaks and are:

• Power line harmonics around 50*n for Virgo and around 60*n for Hanford and Livingston. Imperfections of magnetic coupling and electromagnetic shielding to the suspensions of the mirrors leads to upconversion of low-frequency seismic noise. (O1 and O2)

• Thermally excited beam splitter and mirror suspensions which are also known as violin modes. At Hanford and Livingston the violin modes of the suspensions of the mirrors are at

(17)

2.3. DATA CONDITIONING CHAPTER 2. DETECTOR CHARACTERIZATION

about 500 Hz and harmonics up to 2000 Hz. For Virgo this is about 300 Hz and harmonics up to 1800 Hz. The violin modes of the suspensions of the beam splitter are for Virgo about 280 Hz and are for Hanford and Livingston about 300, 600 and 900 Hz. (O1 and O2) • Inserting calibration lines during the run by moving the end mirrors. These calibration lines

were injected to monitor the behavior of a detector. A detector is also able to operate with the same performance without these calibration lines. These calibration lines correspond to:

– Hanford O1 and O2: 1083.7, 331.9000, 37.3000, 36.7000 and 35.9000 Hz – Livingston O1: 1083.1000, 331.3000, 35.3, 34.7 and 33.7 Hz

– Livingston O2: 1083.1000, 331.3000, 23.9, 23.3 and 22.7 Hz

– Virgo: 1111.001, 358.3, 357.801, 357.3, 356.302, 355.8, 87.1, 63.3, 62.801, 62.3, 60.8, 36.3, 16.8, 16.301 and 15.8 Hz

• In Hanford during O2, environmental disturbance were observed at 99.976, 35.7628, 35.7624, 35.7065, 35.7048, 29.8019 and 28.6099 Hz

• In Virgo, Output Mode Cleaner (OMC) dither lines around 10 Hz. These are also, just as the calibration lines, a result of the control system. Here the dither lines come from the coupling of the OMC with the laser beam. Because the dither and calibration lines are similair to sinusoids, the harmonics of those lines are not significant [5]. During O1, OMC dither lines were also present in Livingston and Hanford.

2.3

Data conditioning

When a detector measures a transient signal, it is mostly hidden in the observed noise. Applying data conditioning can remove correlations in the noise and filter away certain frequency ranges which are not relevant regarding the measured transient signal. This can lead to a transient signal which is not hidden in the noise anymore.

2.3.1

Whitening

The purpose of whitening is to remove all the correlation of the noise resulting into a delta-correlation of the sequence of data [13]. Figure 2.4 shows that the amount of noise fluctuations is frequency depended, data with this behaviour is called “colored” data. The whitening procedure is applied by dividing the data by the noise amplitude spectrum. This is done in the Fourier domain and can be described by

dw(f ) ∝ d(f )/ [Sn(f )]1/2 (2.10)

where dw(f ) represents the whitened data, d(f ) the observed data in Fourier domain and Sn(f ) the PSD. In other words, the whitening procedure normalize the power at every frequency. The resulting time series is now expressed as deviations σ away from the mean. If the original data is Gaussian and stationary, the corresponding whitened data will have zero mean and unit variance.

2.3.2

Bandpass and notch filter

The goal of a bandpass filter is to remove certain frequency ranges. A bandpass filter makes it possible to focus on a specific frequency range. For a given frequency range, the bandpass filter will suppress power outside this range and keeps the power in this range. It is common to combine a bandpass filter with a series of notch filters. The goal of notch (band-reject) filters is to remove the strong instrumental spectral lines. For example, the notch filters can be used to remove the power line harmonics shown in figure 2.4.

(18)

2.4. GLITCHES CHAPTER 2. DETECTOR CHARACTERIZATION

2.4

Glitches

Glitches are for several reasons problematic in GW astronomy. The most important reasons are: (1) Reducing the sensitivity of matched filter searches. This is caused by an increased rate of random coincidences. The observed glitches can mimic waveforms of cbc. The glitch classes blip, koi fish and scattered light are some of the most troublesome glitch classes related to the PyCBC search [15]. (2) The overall noise becomes non-Gaussian in the presence of a glitch. (3) The data quality decreases due to the present of glitches. In this section we will briefly describe the blip, koi fish, scattered light and whistle glitch class.

2.4.1

Blip

Blip glitches are in general one of the most problematic glitches. They are characterized by a very short duration (about 10 ms) with a large frequency bandwith (of the order 100 Hz) and having one (sometimes 2) sharp peak. Advanced LIGO observed on average two blip glitches per hour of observing during the first and second observing runs [7]. Blip glitches are one of the noise sources which limit GW searches for high mass binary black hole mergers [2]. Figure 2.5 shows the time-frequency representation of a blip glitch.

Figure 2.5: Time-frequency representation of a blip glitch observed by Livingston at GPS time 1177037747.186. Source: Gravity Spy database [79].

Blip glitches are problematic for high mass binary black hole mergers, because the shape of a blip glitch is characterized by a sharp peak in a very short time domain. Figure 2.6 shows the similarity between a CBC waveform template and a blip glitch detected in the observed data.

Figure 2.6: Time series of a CBC waveform template and filtered data around a blip glitch. Both have been filtered with bandpass and notch filters [15].

(19)

2.4. GLITCHES CHAPTER 2. DETECTOR CHARACTERIZATION

As we can notice from Figure 2.6, blip glitches can mimic a GW for a high mass binary black hole merger. It is very unlikely that blip glitches have an astrophysical origin due that the glitches are uncorrelated in time between the detectors. In [7], a correlation between four subsets of blip glitches with environmental sensor or detector auxiliary channels were found. The four subsets corresponded with the channels: power recycling control signals, laser intensity stabilisation, com-puter errors and low relative humidity inside the building [7]. Here, the trigger times of blip glitches, observed by Advanced LIGO, during O1 and O2 were investigated. Only a small part (about a few percent) of these blip glitches (observed during O2) are correlated with an auxiliary channel.

It is very hard to identify the sources of blip glitches. An argument for this can for example be that there are many potential physical mechanisms which can produce a blip glitch. Sensors are unable to record these processes at the correct bandwidth. Unfortunately, the origin of the majority of the observed blip glitches during O1 and O2 remains unknown.

2.4.2

Koi fish

Glitches classified as koi fish have some similarities in their morphology compared to blip glitches. Figure 2.7 shows a time-frequency representation of a koi fish glitch. The similarities between a koi fish and a blip glitch are also noticeable by comparing Figure 2.7 with Figure 2.5.

Figure 2.7: Time-frequency representation of a koi fish glitch observed by Livingston at GPS time 1174030978.397. Source: Gravity Spy database [79].

The main difference between the morphologies is that a koi fish glitch contains a noticeable ring-down. Koi fish glitches are, just like blip glitches, problematic at high mass binary black hole merger. Again the reason for this is the very small time window in which the glitch happens and the sharp peak. The templates which are finding a koi fish glitch trigger have also most likely maximal spins (aligned and anti-aligned) [15]. The name koi fish is given to this glitch class, because the time-frequency representation of a koi fish glitch can looks like a ’real’ koi fish.

2.4.3

Scattered light

Figure 2.8 shows a clear arch-like morphology of a scattered light glitch. The multiple arches in the time-frequency representation of a scattered light glitch are caused by multiple reflections of the scattered light between the scatterer and the test mass.

(20)

2.4. GLITCHES CHAPTER 2. DETECTOR CHARACTERIZATION

Figure 2.8: Time-frequency representation of a scattered light glitch observed by Livingston at GPS time 1166339307.688. Source: Gravity Spy database [79].

The waveform templates which are finding the most scattered light triggers are characterized by a short duration with a highly aligned spin [15]. These templates can match the morphology of a scattered light glitch due to a ‘hang up’ effect [39]. In general, the parameters of the templates which have significant responses of scattered light triggers varies a lot. This is probably due the fact that scattered light glitches have differing morphologies caused by the large amount of potential scattered light noise sources in the detector.

Scattered light glitches are caused by low frequency motion that causes the test masses to move. The surfaces of these test masses consist of very tiny imperfections. It is impossible to create the perfect test masses on atom scale. A small amount of light scatters out of the laser beam at the test masses due these imperfections. The scattered light will sometimes reflect back to the test masses by surfaces which consist of large relative motion compared to the test masses. This scattered light will recombine with the laserbeam and this results in transient noise in the observed data. The amplitude and upper frequency of this transient noise depends on the amount of scattered light which recombine and the corresponding relative motion.

The scattered light which recombine with the laserbeam will have an additional phase due to the change of the path length compared to the original laserbeam. The resulting phase noise can be described by hph(f ) = A λ 8πLF [sin δφ(t)] (2.11) with φ(t) = φ0+ δφsc(t) = 4π λ |x0+ δxsc(t)| (2.12)

where A represents the fraction of the field which recombine with the laserbeam, λ the wavelength

of the laserbeam, L the length of the arms, F the Fourier transform, x0 the static path which

correspond to the static phase φ0, and xsc(t) the time-dependent displacement of the surface which scatters the light [72].

Scattered light causes fringes in the observed data if their path length changes rapidly. The peak frequency of arches which are observed in many scattered light glitches (also shown in Figure 2.8) gives information about the fringe frequency and the velocity of the scatterer. The relationship between those can be described by

ffringe(t) = 2nvsc(t) λ (2.13) where n represents the amount of reflections of the scattered light before recombination, vsc the velocity of the scatterer and λ the wavelength of the laserbeam [72].

(21)

2.4. GLITCHES CHAPTER 2. DETECTOR CHARACTERIZATION

2.4.4

Whistle

This glitch class is named as whistle (or Radio Frequency Beat Notes), because the detector noise at the trigger of a whistle glitch sounds like a whistle. Figure 2.9 shows a time-frequency representation of a whistle glitch. The energy distribution of a whistle glitch is characterized by a ‘w’ (shown in Figure 2.9) or ‘v’ shape.

Figure 2.9: Time-frequency representation of a whistle glitch observed by Livingston at GPS time 1177442556.471. Source: Gravity Spy database [79].

Whistles are caused by radio frequency beats; radio frequency signal interfering and beating with the Voltage Controlled Oscillators (VCO) [30]. For example at LIGO, VCOs are placed in control loops throughout the detector. It is complex to define the origin of a beat note, because radio frequency pickup can happen anywhere at wires. More information about the sources of a whistle glitch can be found in [30].

(22)

Chapter 3

Deep Learning

3.1

Introduction to neural networks

3.1.1

biological neuron vs artificial neuron

The general idea of an artificial neural network (ANN) is based on the biological neural network. The biological neural network plays an important role in the brain. The brains contains a large amount of connected neurons. Those neurons are used to transmit a large amount of information. Based on the type of the information, different neurons will be activated. The set of neurons activated for a specific characterisation of the transmitted information is called a neural circuit. The biological neurons have electrical and chemical synapses connections. The artificial neuron is a mathematical function which tries to mimic the biological neuron. We can compare the artificial neuron with the biological neuron outlook:

Figure 3.1: Left: Schematic figure of the biological neuron. Right: Schematic overview of an artificial neuron. [74]

The soma, dendrite, synapse and axon of the biological neuron are modeled with respectively the node, input, weights and output of the artificial neuron. The artificial neuron is the basic block for every artificial neural network. The basic workflow of an artificial neuron consists of the following steps:

• The input x of the neuron is multiplied with a certain weight w. Each input has it own specific weight w. Because each neuron has it own weights, we can define a vector w = (w1, w2, ...., wn) with n the number of inputs. We can also write x as the vector x = (x1, x2, ...., xn) with n the number of inputs.

• A summation is applied to all the input values multiplied with their corresponding weight. Now we introduce a new parameter called the bias. The bias is an extra parameter which can be learned. After every input is multiplied with a weight, the bias is added to modify

(23)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

the output. This step can be mathematical written as Pnj=1wjxj+ b. Where x represents the input, w the weight and b the bias.

• The activation (transfer) function F will be applied to the output of the previous step. The activation function gives the output of a neuron a non-linear character. The full description of the activation function can be found in section 3.1.3. This process can be described as:

yi= F ( n X

j=1

wjxj+ b) (3.1)

Here yi is the output of the neuron i with input x, weight w, bias b and activation function F . Equation 3.1 shows that the bias will adjust the output where the activation function is applied to. With other words, the bias shifts the activation function. The role of the bias is equivalent to the role of a constant in a non-linear function.

3.1.2

Feedforward neural network

The most common and basic example of a neural network is the feedforward neural network, also named multilayer perceptron (MLP). The signal in a feedforward neural network can only travel forwards. The primary goal of a MLP is classification. However there are also some other applications in which MLP can succeed, for example modelling and control dynamic systems [55]. MLP consists of three different kind of layers; the input layer, at least one hidden layer and the output layer. Each layer consists of (one or) multiple artificial neurons, also called nodes. Figure 3.2 shows an example of an architecture of a MLP.

Figure 3.2: Example of a typical architecture of a MLP with one hidden layer. Here the circles represent the neurons and the lines between the neurons represent the corresponding connections. [58]

Figure 3.2 shows that, if there exists a next and previous layer, a neuron is connected with every neuron in the next or previous layer. A connection between two neurons means that the output of a neuron is one of the input values of the other neuron. Figure 3.2 also shows that the number

(24)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

the MLP is applying Equation 3.1. The only difference in the mathematical equations which the neurons are applying is that they can use different activation functions described in the next subsection.

In general a neural network is called a deep neural network if the architecture of the neural network consists of more than one hidden layer.

3.1.3

Activation functions

The activation function plays an important role in a neural network. Applying activation func-tions results in the ability of computing non-linear problems. In order to achieve solving non-linear problems, activation layers must be continuously with a non-linear structure. Also an activation layer has to be differential, because the backpropagation algorithm, which is explained in section 3.1.7, calculates derivatives. A neural network with only linear activation functions result in solv-ing only linear-problems. An activation function convert the net input value to the output value of the neuron. It is very common to use one specific activation function for each layer in a neural network. This function will be applied to each neuron within the layer. The input of the activation function is the result of the multiplication of the neuron input with their corresponding weight and applying the corresponding bias. The activation function is the last step in the mathematical operation of a neuron. A few of the most common activation functions are:

Sigmoid

Sigmoid is defined as

f (x) = 1

1 + ex (3.2)

Based on the definition of sigmoid, it will always give a value between 0 and 1. Sigmoid is a very common activation function for the last layer in a neural network used for (binary) classification. It basically gives a probability between 0 and 1 as output.

Tanh

Tanh, also named hyperbolic tangent, is an activation function similar to sigmoid. The main difference is that the range of tanh is between -1 to 1 and the range of sigmoid is between 0 and 1. Tanh is defined as

f (x) = tanh(x) = 2

1 + e−2x − 1 (3.3)

ReLU

Rectified regular units (ReLU) is introduced by [63] and is one of the most common used activation functions. It has a threshold at zero; giving zero for values smaller than zero. For values greater than zero, the output is the linear function itself. The main reason why many architectures choose

(25)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

ReLU over for example tanh is that ReLUs are non-saturating. Meaning that the neural network which uses ReLU can train several factors faster as those using tanh [45]. With ReLU, a neuron learns when there is at least one positive input to the ReLU. ReLU is defined as [54]:

R(z) = max{0, z}

{

z, if z ≥ 0

0, otherwise (3.4)

Leaky ReLU

The main downside of the ReLU activation function is that all the negative values become im-mediately zero. This results in a decrease of the ability of a model to train or fit properly. The Leaky ReLU activation function attempts to solve this problem. This is done by the fact that the negative values get multiplied by a small number in the Leaky ReLU activation function. Leaky ReLU is defined as

R(z) = max { 0.01z,z }

{

z, if z ≥ 0

0.01z, otherwise (3.5)

Source figures

3.1.4

Batch normalization

In order to improve the learning speed of a neural network, normalization of the input values is widely used. Here the input data is normalized to a distribution which has a mean of zero and a standard deviation of one. It is also possible to use this technique on the hidden layers of a neural network, this is called normalization. Batch normalization make use of the mean and variance of a mini-batch. Every mini-batch will be normalized on every layer using the input of the previous layer.

Batch normalization is recently introduced in 2015 [43]. Using batch normalization can improve the stability of a network. Batch normalization can also make the need of Dropout (explained in section 3.1.5) unnecessary and make it possible to use higher learning rates [43]. [43] also showed by implementing batch normalization in a neural network for image classification, that it was able to reach the same accuracy with a factor 14 less training steps. The idea behind batch normaliza-tion is to stabilize the input distribunormaliza-tion of a layer and reduce the internal covariate shift. During the training of a nerual network, the weigths of the neural network are updated after every (mini) batch. Updating a layer leads to a change of the input distribution of the next layer. This effect is also known as the internal covariate shift. This slows the learning speed of the neural network, because the intermediate layers are constantly adapting to the different inputs.

However [67] showed that batch normalization might not reduce the internal covariate shift when viewed from a optimazation point of view. [67] showed also that the connection between the internal covariate shift and the performance of batch normalization is also very weak. The exact

(26)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

reasons of the improvement of a neural network due batch normalization is not yet crystal clear, but it is many times proven that batch normalization can increase the training speed and accuracy of a neural network. Other explanations are that batch normalization results into conditioning [65] or an increase of the stability of concurrent updates [35].

Batch normalization is achieved by adding an additional layer before the activation layer. In [43], the algorithm for batch normalization is described as

Algorithm 1: Batch normalization

Input: Values of x over a mini-batch: Z = {x1....m}; Parameters to be learned: γ, β

Output:{yi= BNγ,β(xi)}

µ

Z

m1

P

mi=1

x

i mini-batch mean

σ

2 Z

1 m

P

m

i=1

(x

i

− µ

Z

)

2 mini-batch variance

i

xi −µZ

σ2 Z+ normalize

y

i

← γ ˆ

x

i

+ β ≡ BN

γ,β

(x

i

)

scale and shift

Here  is a very small constant to prevent dividing by zero. µ and σ represent respectively the mean and standard deviation of the mini-batch. The scale and shift factors γ and β can be treathed as a normal hyperparameter which can be learned by backpropagation. The value of ˆxi is always between zero and one as a result of the normalization. In order to preserve network capacity, the scale and shift parameters are introduced. During the training process, those parameters can change the input distribution with zero mean and unity variance into another distribution which works better for the layer. This is the strength of batch normalization; It can change the input distribution into a more favourable distribution to learn on.

3.1.5

Dropout

Dropout is recently introduced by [73]. The title of [73] “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” is a good formulation of the purpose of dropout. With dropout, the neural network will randomly dropout neurons for every batch given to the network. Dropout of a neuron means that all the connections of the corresponding neuron disappears. The neuron will receive no input and can not give an output. After an iteration, the weights of the dropout neurons will also not be updated. Dropout is only applied during the training session. During the test session, all the neurons of the neural network are used. [73] shows that dropout can speed up the training time.

3.1.6

Loss functions

Loss functions have an important role regarding optimizing (deep) neural networks. During a training process, a neural network is optimized by minimizing some function. This function is called the loss function and can differ for different neural networks. There are a variety of loss functions. Some loss functions are better suited for certain models than others. In this section we describe the loss functions which are related to our model.

Information theory and (binary) cross entropy

Before we describe the loss functions, a little and very basic part of information theory is intro-duced in order to give a better understanding of the background.

Self-information, also called Shannon information, is defined as [35]

I(x) = −logP (x) (3.6)

Equation 3.6 measures the amount of information regarding the event X = x with probability P (x). The more likely an event, the less information content is measured. The intuitive

(27)

explana-3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

tion is; information of an unlikely event is more valuable than the same information coming from a likely event. We can not extract any information about uncertainty from Equation 3.6. Next, we will focus on entropy which is a measurement of the uncertainty of a variable [12].

Shannon entropy measures the average information content of a probability distribution. It cal-culates the expected information content from an event which is drawn from that distribution. It is defined as [35]:

H(x) = − Ex∼P[I(x)] = − Ex∼P[logP (x)] = − X

x∈X

P (x)logP (x) (3.7)

X represent the distribution for the samples x for a discrete random variable X. The notation of Shannon entropy is also denoted as H(P). The units of the entropy are named bits and nats for respectively using log base 2 and e. The Shannon entropy is basically a summation of the self-information for all the values of X weighted by their corresponding probabilities. The summation sign is replaced by an integral when using continuous variables instead of discrete variables. We are dealing with discrete variables in our problem, for this we will focus on that one.

The cross entropy between probability distributions P and Q is defined as:

H(P, Q) = − Ex∼P[logQ(x)] = −

X

x∈X

P (x)logQ(x) (3.8)

where P (x) represents the probability of x in P and Q(x) the probability of x in Q. In a binary classification problem, it is common to use the loss function binary cross entropy (BCE). The neural network gives a value between 0 and 1 which represent the probability of the input x being labeled as 1. In this case there are only two classes (fake and real) and Equation 3.8 can be written as

H(P, Q) = –(P (x0)log(Q(x0)) + P (x1)log(Q(x1))) (3.9)

where x0represents a sample from class 0 and x1a sample from class 1. For example, P can repre-sent the target distribution and Q can reprerepre-sent the approximation of this distribution. Equation 3.9 is the definition of the BCE.

Kullback-Leibler (KL) Divergence

The KL divergence, also known as relative entropy, is a mathematical expression which calculates the difference between the cross entropy of P and Q and the entropy of distribution P . It can be expressed as:

DKL(P kQ) = H(P, Q) − H(P ) = Ex∼P[log

P (x)

Q(x)] = Ex∼P[logP (x) − logQ(x)] (3.10)

The KL divergence express the dissimilarity between two distributions P and Q. It is a non-symmetric and a non-negative equation. The distance measured by the KL divergence can not be interpreted as a true distance measure through the asymmetric property of the KL divergence. We can also express the cross entropy in terms of KL divergence by rewriting Equation 3.10:

H(P, Q) = H(x) + DKL (3.11)

For the purpose of training a neural network, the cross entropy is often minimized. Minimizing the cross entropy with respect to distribution Q can also be expressed as minimizing the KL di-vergence. The reason is that H(x) only depends on distribution P .

(28)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

that the JS divergence is symmetric and the KL divergence is asymmetric. Also, contrary to the KL divergence, a distance metric can be obtained by taking the square root of the JS divergence. The JS divergence is defined as [78]

DJS(P kQ) = 1 2DKL(P kM ) + 1 2DKL(QkM ) with M =1 2(P + Q) (3.12) Wasserstein distance

The Wasserstein distance metric, also known as the earth mover’s distance, defines a ‘distance’ between two (stochastic) distributions and arise from the theory of Optimal Transport. The idea of this theory is that a distribution can be described by probability mass and changing the (shape of) distribution is done by moving mass. The Wasserstein distance metric is the minimum amount of mass which has to be moved in order match a given distribution with a target distribution. Figure 3.3 illustrates the idea of Optimal Transport.

Figure 3.3: Left: Calculate the distance between two distributions based on the coordinates (standard way). Right: Calculate the distance between two distributions based on the theory of Optimal Transport; search for the best transport map T which can change/transport the blue distribution into the red distribution. Here T represent the energy needed in order to transport things. [66]

Figure 3.3 shows that from the theory of Optimal Transport, the difference between two distri-butions depends on the value of the transport map T and is independent of the difference in coordinates. Smaller T indicates more similarity between the two distributions. The Wasserstein distance can be described as the value of the best transport map T. By minimizing the total Wasserstein distance it is possible to transform a prior distribution into an approximation of the real or target distribution. The p-th Wasserstein distance is defined as [71]

Wp(P, Q) =  inf µ∈Γ(P,Q) Z ρ(x, y)pdµ(x, y) 1/p (3.13) where P and Q represent probability measures on M with P, Q ∈

P :R ρ(x, y)pdP(x) < ∞, ∀y ∈ M . Here we have the metric space (M, ρ) with ρ(x, y) the distance function in the set M. Γ(P, Q) rep-resent all the joint distributions with marginals P and Q. And µ(x, y) describes the amount of mass to move from a random location x to location y while maintaining the marginal constraint x ∼ P and y ∼ Q. When M is separable, the Wasserstein distance can also be defined as [71]

W1(P, Q) = sup

kf kL≤1

Ex∼P[f (x)] − Ex∼Q[f (x)] (3.14)

where kf kL represents the Lipschitz semi-norm and can be described as kf kL = sup |f (x) −

(29)

3.1. INTRODUCTION TO NEURAL NETWORKS CHAPTER 3. DEEP LEARNING

3.1.7

Backpropagation

In the next section the theory behind the optimizers is explained. Those optimizer are using the backpropagation algorithm as a tool in order to achieve the goal; minimizing the loss function. The backpropagation algorithm calculates the gradient of a loss function with respect to the weights and biases. The chain rule plays an important role in the computation of those gradients. Figure 3.4 illustrate the backwardpass which is the core of backpropagation.

Figure 3.4: Left: a neuron mapping input x and y on output z. Right: Basic example of a backwardpass. The gradient of a loss function L with respect to the output z is determined by dL/dz. The gradient of a loss function L with respect to the inputs x and y is determined by dL/dx and dL/dy which can be calculated by applying the chain rule. Link source

Backpropagation can calculate all the gradients with respect to a loss function very efficiently. Next, the optimizers use those computed gradients for training the neural network by updating the weights.

3.1.8

Optimizers

Every neural network is using a certain optimizer for minimizing the loss function. A loss function is minimized by using the obtained gradients from the backpropagation algorithm. The weights of the neural network are updated by substracting a small part of the corresponding gradients from the current weigths. Taking too small or too large steps will result in a bad learning process. The optimizer take care of this problem and is trying to achieve the best learning process possible. The optimizers can be divided into two categorizes: optimizers using stadard stochastic gradient (SGD) descent and optimizers using adaptive learning rate strategies. Optimizers using adaptive learning rate strategies are introduced more recently than optimizers based on standard SGD. The main difference between those optimizers is that SGD updates every individual parameter with the same learning rate. Optimizers using adaptive learning rate strategies updates param-eters with different learning rates. This can benefit the learning process of the model, because the importance of the information of the gradient can differ between paramaters. Using adaptive learning rate strategies, more important parameter updates can take bigger steps than less im-portant parameter updates. We will briefly describe the optimizers which are imim-portant to our research. All the optimizers described below use adaptive learning rate strategies.

RMSProp

The idea behind Root Mean Square Propagation (RMSProp) was proposed in 2012 during a lec-ture of G. Hinton [76]. RMSProp was introduced in order to solve the vanishing gradients problem. The proposition was to normalize the gradient with a moving average of squared gradients. Re-sulting in an increasing step size for small gradients and a decreasing step size for large gradients. The step size, also named momentum, is here for more balanced and this technique is avoiding exploding or vanishing gradients. Each update is done using [14]

(30)

3.2. DIFFERENT TYPES OF DEEP NEURAL NETWORKSCHAPTER 3. DEEP LEARNING vt= αvt−1+ (1 − α)(∇f )2 (3.15) θt= θt−1+ ∇f θ t−1 √ vt+ λ (3.16) where vt is the exponential average of squared gradients, ∇f is the gradient at time t along each parameter, α is the decay rate with 0 < α < 1,  is the initial learning rate and λ is a damping factor. It is common to use λ = 1e-10 and α = 0.9, but these values can be tuned. Parameter θt is calculated by adding the learning step ,calculated by the second term of Equation 3.19), to the previous value of θt.

ADAM

Adaptive Moment Estimation (ADAM) is a method for stochastic optimization and is introduced in 2014 by [44]. It is recently very popular regarding deep learning problems. Using the ADAM algorithm, each update is done with

vt= β1vt−1− (1 − β1)(∇f ) (3.17) st= β2st−1− (1 − β2)(∇f )2 (3.18) θt= θt−1− vt∇f θ t−1 √ st+ λ (3.19)

where vtis the exponential average of gradients along the parameters, stis the exponential average of squared gradients along the parameters, β1 and β2 are hyperparameters, ∇f is the gradient at time t along each parameter, α is the decay rate with 0 < α < 1,  is the initial learning rate and λ is a damping factor.

The algorithm of ADAM is partly similar to the algorithm of RMSProp. ADAM calculates also the exponential (decaying) average of the past squared gradients and use it in the calculation of the step size of each parameter in an update. Additionally, ADAM uses also the exponential (decaying) average of the past gradients in this calculation of step size. A significant difference be-tween ADAM and RMSProp is that RMSProp updates uses a momentum (total of the gradients) on the rescaled gradient and ADAM updates uses an avarage of the first and second moment of a gradient [44]. More details can be found in [44].

AdaMax

AdaMax is an extension of ADAM and also introduced by [44]. In this algorithm, the L2 norm

used in the ADAM algorithm is replaced with a L∞norm. This means that the square in Equation 3.18 is replaced with an infinity term. After some calculations, which are shown in [44] by detail, Equation 3.18 can be written as

st= max (β2· st−1, |∇f |) (3.20)

In general AdaMax has an improved numerical stability and can handle gradient update noise better compared to the original ADAM algorithm. This motivates to use Adamax over ADAM in some researches.

3.2

Different types of deep neural networks

There are different types of deep neural networks. The most basic type is the MLP which is described earlier in this chapter. Another example is a Recurrent Neural Network (RNN) which contains loops. The output of a layer is saved and fed back as input in a next learning step. RNN is suited for problems related to text and audio. In this section we will focus on convolutional neural networks and generative adversarial nets which are the most suited for our problem.

Referenties

GERELATEERDE DOCUMENTEN

This is not the first paper to give an answer to the question that was raised in [PSV10], Can we prove convergence using the Wasserstein gradient flow.. In [HN11], Herrmann and

A method and machine for generating map data and a method and navigation device for determining a route using map data Citation for published version (APA):.. Hilbrandie, G.,

Is er extra beregend voor het leggen van de folie Ja: 60 mm Hoeveel weken was de grond afgedekt met folie 8 weken Zijn er tijdens de afdekperiode problemen geweest met.

In dit rapport zijn de uitkomsten van het onderzoek beschreven dat in 2018 in het zinkreservaat is uitgevoerd in het kader van OBN (Ontwikkelingen

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website!. • The final author

Even if the lexicographer agrees with the decisions of a prescriptive body, the lexicographic presentation should make allowance for different points of departure and different

We apply our proposed method to the well-studied cross Col × Cvi in Arabidopsis thaliana in Section 5.1, and to high dimensional B73 × Ki11 genotype data from maize nested