Machine Learning for Gravitational Lensing

(1)

MSc Physics and Astronomy

Track: Gravitation and Astroparticle Physics (GRAPPA)

Master’s Thesis

Machine Learning for Gravitational Lensing

Gaussian Processes on Graphics Processors:

Searching for Substructure in Strong Lenses

by

Konstantin Karchev

12286834

August 2020

60 ECTS September 2019 - July 2020

Supervisor:

Dr Christoph Weniger

Second Examiner:

Dr Shin’ichiro Ando

(2)

Abstract

A new framework for strong gravitational lensing analysis based on variational infer-ence is presented aimed at the discovery and statistical characterisation of dark matter substructure, whose abundance serves as an indicator of the nature of dark matter. It leverages automatic differentiation and graphics processing support to efficiently han-dle images such as may be produced by the next generation of instruments. Specific emphasis is placed on the efficient treatment of dark matter substructure and inference of the parameters of the subhalo mass function (SHMF) with a possible low-mass cutoff from which the mass of warm dark matter particles can be constrained. A new model for the source light based on multiple layers of Gaussian processes (GPs) is developed. Correlations within the projected pixel areas are modelled using approximate “window functions” allowing sub-pixel reconstruction of the sources in multiple-image systems. The variances of the layers are optimised variationally, enabling straightforward sta-tistical treatment of the source uncertainties and their marginalisation. The framework is applied to observations of the lensing system SDSSJ0252+0039, obtaining sensitiv-ity down to 108.8M for NFW subhaloes (107.7M for the widely used pseudo-Jaffe

density profile), similar to previous studies. A fit to a mock of an observation made with the upcoming Extremely Large Telescope (ELT) is also presented, demonstrating the efficient statistical treatment of source irregularities. The expected improvement in substructure mass sensitivity is at least half an order of magnitude over Hubble images (for a conservative ELT specification of 12.5 mas/px and comparable signal-to-noise).

Acknowledgements

I would like to extend my gratitude first and foremost to Dr Christoph Weniger for his encouragement, understanding, and tireless supervision and for coming up with way more ideas than can fit in a thesis or a 30-minute presentation; to Adam Coogan and the undark group for their contributions in code and discussions; to the boys at the office for maintaining a healthy work-play balance; to the rest of GRAPPA for a brilliant research environment even with limited live contacts; to the Amsterdam Science Talent Scholarship for providing the incentive to come to UvA; to the scholars for unforgettable experiences; and to the lowlands for the mathematical horizon and the sunsets.

This research made use of the SURFsara and DAS5 computing clusters and the coffee (and hot chocolate) machine at the Nikhef spectrum.

(3)

con v er g ence (2-comp. Gaussian) foreg round light

source plane image plane

source la y er 1 𝜎 1 = 0 .01 px source la y er 1 𝜎2 = 1 px source la y er 1 𝜎3 = 5 px total source light final imag e recons tr uction

(4)

“We demand rigidly defined areas of doubt and uncertainty!”

Vroomfondel to Deep Though,

(6)

Chapter 1

Introduction

1.1 Dark matter

Dark matter (DM), as a hypothetical component of the Universe which interacts with particles of the Standard Model (SM: the “light sector”) predominantly gravitationally and only very weakly (if at all) through other means, is now so well established a concept in modern science that its history can fill a 70-page review article (Bertone & Hooper,2018). Initial hints of its existence have cropped up on various scales of astronomical observations: from the rotation curves of galaxies (Roberts, 1966;Rubin & Ford, 1970) to the kinematics of clusters (Zwicky,

1937), while conclusive evidence of its “non-baryonic” character was provided on the grandest scale by the cosmic microwave background (CMB) (Planck Collaboration, 2016), the cosmic abundance of light elements (Cyburt,2004) and by the lucky occurrence of the famous Bullet cluster (Markevitch et al.,2004).

Understanding the nature of dark matter is a primary quest in modern physics. While theoretical efforts are directed at the channels for its production at early cosmic times (e.g. from a decaying scalar field (Farrar & Peebles, 2004)) and have yielded numerous candidates, numerical simulations examine the structures it forms and its gravitational interaction with known particles (Springel et al., 2005; Schaye et al., 2015; Vogelsberger et al., 2014) (and in recent years self-interactions (Robertson et al.,2019)). Meanwhile, there have been a plethora of techniques devised to detect and characterise dark matter, which fall into three broad categories.

Direct detectionmethods rely on observing the results of interactions between DM candi-dates and “ordinary matter” in an experimental setup, which requires perfect modelling of all other effects in the experiment. Out of numerous projects, most (with the notable exception of DAMA/LIBRA (Bernabei et al., 2010) and XENON1T (XENON Collaboration, 2020)) have presented null results.

Indirect detection techniques, instead, are aimed at the discovery of celestial signals, either in the form of radiation (usually gamma rays) or particles, originating from the decay or annihilation of DM particles. Here, similarly, one needs to account for all possible “ordinary” explanations of a signal before claiming detection. Notable examples include the galactic centre GeV excess₁ (Abazajian & Kaplinghat,2012) and the 3.5 keV line (Boyarsky et al.,2014).

Finally, astrophysical methods rely on the basic characteristic of dark matter (and the one that is best understood): its gravitational interaction. Since the discovery of gravitational waves, there has been hope of detecting the imprint of DM clusters around compact objects in merger signals (Kavanagh et al.,2020). More commonly, one relies on derivative observables obtained from simulations, like the density profile of galaxies₂, or the abundance of DM structures at different mass scales. The latter has been probed using observations of the “Ly-𝛼 forest” in the spectra of high-redshift galaxies and quasars (Viel et al.,2005), by looking for perturbations in the motions of stars in “stellar streams” around the Milky Way (Banik et al.,2019), and by using gravitational lensing.

1_{which has more or less been refuted (}_{Bartels et al.}_, ₂₀₁₈_{) with only some speculations remaining (}_{Leane &}

Slatyer,2019)

(7)

1.1.1 Substructure and dark matter models

The main observable of interest in gravitational lensing searches for dark matter is the abundance of DM substructure₃ because it is a major differentiating factor between different models for the nature of dark matter.

According to the most basic cold dark matter (CDM) model cosmic structure was formed hierarchically with the smallest bound objects emerging the earliest from primordial density per-turbations (Peebles,1982). The general predictions of the CDM model are the scale invariance of the power spectrum below and its strong suppression above a scale of ∼ 1 Mpc, corresponding to the largest structures that have had time to form already. The implications of this model for the abundance of DM structures were developed analytically₄byPress & Schechter(1974) and have since been improved upon by countless numerical simulations.

CDM models predict a subhalo mass function (SHMF), i.e. a number density per mass interval, that can be approximated as (Giocoli et al.,2010)

d𝑛( 𝑀 |𝑀0,CDM) d ln( 𝑀/𝑀0) ∝ 𝑀 𝑀₀ −𝛼 exp " −𝛽 𝑀 𝑀₀ 3# . (1.1)

Here 𝑀 is the subhalo mass, and 𝑀0 is the mass of its “parent” structure. The overall

nor-malization, which is indicative of the total amount of substructure, and the parameter 𝛽, which controls the high mass cutoff evolve in time, and even the exponent 3 vary in different parts of the cosmic web, are influenced by the presence of baryons, and also depend on the exact definition of “subhaloes” and the particular numerical setup. The slope 𝛼, however, is consistent through time, space, and local environment and has a value₅𝛼≈ 0.9 (Comparat et al.,2017;Bullock &

Boylan-Kolchin,2017;Despali & Vegetti,2017;Richings et al.,2020).

Given that the total mass of (1.1) (integrated from 𝑀 = 0) is nearly divergent, it is obvious that it needs to be modified by another suppression at the low-mass end.₆ The most straightforward way to physically motivate this is by assuming that dark matter consists of particles that were formed thermally in the early universe (hence the reference to temperature in the name) (Steigman,1979). This would imply a certain thermal velocity which would prevent haloes smaller than a free-streaming scale₇ from collapsing, effectively “erasing” small-scale perturbations.

For thermal relic dark matter the free-streaming scale depends only on the particle mass, which determines the time of decoupling and the speed of the particles at the moment when structure formation began (the moment of “matter-radiation equality”). A lower mass would imply later decoupling and higher speeds (i.e. remaining relativistic), which would lead to the “erasure” of larger structures. To quantify this effect one usually calculates the half-mode mass 𝑀_hmat which the initial power spectrum is suppressed by a factor of two with respect to CDM. Possibly the most widely considered candidate for dark matter are the (supersymmetric) weakly interactive massive particles (WIMPs). In order to account for the observed DM

3_{roughly defined as more or less bound overdensities of dark matter with subgalactic size and embedded within} larger structures; also called (sub)haloes: a term we will be (unsuccessfully) trying to avoid so as to also avoid the long discussions on the topic of what exactly a “halo” is (see e.g. Despali et al.(2016) andappendix A.10). 4_{Instead of dark matter,}_{Press & Schechter}_{considered simply a “cold ‘gas’ in a Friedmann cosmology” and was} tested by a “1000-body” simulation.

5_{The analytical treatment of} _{Press & Schechter}_{derives a slope 𝛼 = −1/2 from a scale-invariant initial power} spectrum. However, in the non-linear regime of structure formation the mass function is modified, and so the quoted value is appropriate for later times. It is also close to the special value 𝛼 = 1 for which different mass scales contribute similar total mass (per logarithmic bin).

6_{Indeed, the observation (or not) of the Milky Way’s supposed “missing satellites” (}_{Klypin et al.}_,₁₉₉₉_;_Bullock

& Boylan-Kolchin,2017) seemed to provide observational support for such a suppression. Nowadays, arguments invoking the complicated physics of baryons and completeness corrections are preferred (Kim et al.,2018). 7_{A rough estimate for the minimal size 𝑅 a region of density 𝜌 filled with particles with speeds 𝑣 needs to have in} order to collapse can be derived by considering that it is gravitationally bound only if 𝐺 𝜌𝑅3 /𝑅 & 𝑣2_.

(8)

abundance, while having weak interaction scale couplings, WIMPs require a mass around 100 GeV/c2 (Jungman et al., 1996). Their half-mode mass has been estimated to be around₈ 10−6M (Green et al., 2004), i.e. WIMPs are essentially indistinguishable from CDM. Other

particle species can have a much lower mass and still behave as cold dark matter if they were produced non-thermally: such is the case with the axion (Peccei & Quinn,1977a,b;Weinberg,

1978;Wilczek,1978), which, for a mass on the order of µeV/c2required to explain the strong CP problem has a half-mode mass of around 10−15M (Nambu & Sasaki,1990), so it too acts

as CDM.

On the other end of the spectrum from CDM lies hot dark matter, as an example of which the Standard Model neutrino was once proposed. However, with a mass less than 0.23 eV/c2 (Planck Collaboration, 2016), neutrinos are yet unable to collapse into haloes at all, and so they cannot be the dark matter. Finally, DM particles of keV-scale mass are termed warm (WDM) and are still consistent with observations. Largely independently of the details of the model, the scaling of the half-mode mass with the particle mass 𝑚𝑋 is approximately (Bullock & Boylan-Kolchin,2017;Bode et al.,2001)

𝑀_hm ≈ 2 × 1010M· 𝑚_𝑋𝑐2 1 keV/c2 −10/3 . (1.2)

One well-motivated from a particle physics viewpoint candidate for WDM is₉ the right-handed (sterile) neutrino (see Boyarsky et al. (2019) for a comprehensive review). A variety of mechanisms for its production beyond thermal freeze-out are possible, but they all lead to similar predictions for the mass power spectrum after replacing the DM mass in (1.2) with an enquivalent thermal mass ¯𝑚_𝑋: for example a resonantly produced sterile neutrino model consistent with the 3.5 keV line has ¯𝑚_𝑋 ≈ 2 keV/c2(Bullock & Boylan-Kolchin,2017;Horiuchi

et al.,2016).

Numerical simulations (Schneider et al.,2012;Lovell et al.,2014;Lovell,2020) show that the suppression of small-scale power due to free-streaming modifies the subhalo mass function as 𝑛( 𝑀 | 𝑀₀, 𝑀_hm) 𝑛( 𝑀 | 𝑀₀,CDM) = " 1 + 𝑎 𝑀_hm 𝑀 𝑏#𝑐 , (1.3)

where depending on the exact setup, halo identification algorithms, and the concrete mass definition, 𝑎, 𝑏, 𝑐 can take on a plethora of fitted values. Qualitatively, the suppression is a power law for small masses and negligible above the free-streaming scale, as anticipated.₁₀ This relation forms the bridge between observations of substructure and DM model parameters. The latest constraints, coming mainly from the Lyman-𝛼 forest and stellar streams disfavour thermal relic (equivalent) masses below around 5 keV/c2 (Banik et al., 2019; Palanque-Delabrouille et al.,2020). Similar constraints can also be obtained from gravitational lensing.

8_{Application of}_(1.2)_{yields a result even lower than that. The discrepancy is due to an additional (elastic) interaction} mechanism thatGreen et al. consider.

9_are?

10_{In fact, we will not be using this expression but will be mimicking its effects by a “smoothing” procedure described} insection 2.3.1.

(9)

1.2 (Strong) Gravitational lensing

The discovery of gravity as an influence on the motion of massive objects arguably marks the conception of modern science[citation needed]. Cavendish around 1784 (according to Will

(1988)) and Laplace (1796) calculated the angular deflection of light passing by a star due to Newtonian gravity,₁₁ reasoning, based on the equivalence principle, that gravity should also affect phenomena which exhibit non-corpuscular (wave) properties.

In modern terms gravitational lensing is the phenomenon of “bending” of light paths by massive objects and is thought to be a consequence of the proposition that light always follows perfectly “straight” (geodesic) paths in spacetime₁₂ (Einstein, 1916). It was among the first predictions of the general theory of relativity to be experimentally verified (Dyson et al.,1920). Gravitational lensing has found application in a wide range of cosmological problems. As can be inferred from the name, it is of great aid in observations of the most distant objects in the Universe like the once most distant known galaxy EGSY8p7 (Zitrin et al., 2015). It is an important component in the analysis of the cosmic microwave background (CMB) (Planck Collaboration,2018) and, perhaps, the only alternative to kinematics for measuring the mass of distant galaxies and clusters (proposed first byZwicky(1937)). Meanwhile, the effect known as microlensing has been used to detect objects as small as planets (Tsapras,2018) and to look for mountain-mass primordial black holes (Niikura et al.,2019). Finally, the time delay induced by lensing is being used to give a straightforward and independent of local and CMB measurements estimate of the Hubble constant (Wong et al.,2020).

1.2.1 Thin lens formalism

The general theory of relativity (GR) allows for solving for the propagation of light through arbitrary spacetimes, but that can quickly become arbitrarily difficult. Usually, therefore, when considering “ordinary” non-extreme systems like galaxies and clusters one adopts a thin lens weak-gravity approximation which assumes that:

• Mass densities are low enough, and relativistic effects can be ignored, so that the Newtonian approximation to GR can be used; i.e. the metric can be fully described by a scalar gravitational potential 𝜓.

• Consequently, all deflections are small, and angular measures can be treated as a Cartesian coordinate system.

• All the lensing mass is located in a single plane: the image plane13. Thus, it can be described by a surface mass density Σ( ®𝜉), related to the more realistic 3D density 𝜌 by an integration perpendicular to the image plane: Σ( ®𝜉) ≡

∫∞

−∞𝜌(𝜉𝑥, 𝜉𝑦, 𝑧) d𝑧.

• Similarly, all the light is emitted from a source plane and is described by a surface

brightness 𝛽( ®𝑥).

Since the equations governing the interaction among light in vacuum (Maxwell’s equations) are linear, one can extend a lensing model beyond a single pair of image-source planes, raytracing to multiple sources. Furthermore, in a true raytracing fashion, one can trace through several lensing planes, taking care to use the proper distance rescalings (see below), to build arbitrarily complex scenarios.

Under these assumptions, the image plane and source plane coordinates of a light ray are related by the simple lens (or raytracing) equation (Meneghetti,2016):

®

𝑥 = ®𝜉− ®𝛼( ®𝜉) (1.4)

11_and_Michell₍₁₇₈₄_{) speculated of the existence of “dark stars”}

12_{The same reasoning and calculation apply to any massless particle or phenomenon, like gravitational waves; see,} e.g.Singer et al.(2019) for (a refutal of) a speculative detection of gravitational lensing of gravitational waves. 13_{so called because it is the one closest to the observer, and thus forms the final image}

(10)

source p_lane ® 𝑥 ® 𝛼0 image pl_ane ® 𝜉 ® 𝛼 _® 𝜉0 𝑀 observer 𝐷_𝐿 𝐷𝐿 𝑆 𝐷𝑆

Figure 1.1. Thin lens geometry. The lensing mass 𝑀 located at ®𝜉0 bends the light ray (thick line)

emanating from the point ®𝑥, so that to the observer it looks like it is coming from the direction of ®𝜉. General relativity predicts the deflection angle ®𝛼0as viewed from the image planebased on the mass 𝑀 and the distance ®𝜉− ®𝜉0. It then has to be rescaled by 𝐷_{𝐿 𝑆}/𝐷_𝑆 to obtain the displacement field ®𝛼 (as viewed by the observer) for use in(1.4). Angles are all assumed to be small enough that they can be used for Euclidean calculations. The dashed line is the optical axis perpendicular to the planes and connects the origins of the coordinate systems for each plane.

with ®𝛼the (reduced) displacement field, which is determined by the projected potential:14 Ψ( ®𝜉) ≡ ∞ ∫ −∞ 𝜓 𝜉 𝑥, 𝜉𝑦, 𝑧 d𝑧 =⇒ 𝛼®( ®𝜉) = 𝐷_{𝐿 𝑆} 𝐷_𝑆 2 𝑐2 ® ∇𝜉®Ψ( ®𝜉) 𝐷_𝐿 , (1.5)

where the gradient is taken in the image plane and should have dimensions of inverse length, whence the introduction of the observer–lens distance 𝐷𝐿. The additional factor 𝐷𝐿 𝑆/𝐷𝑆 is

geometric (as illustrated infigure 1.1) and motivates the “reduced” qualification.

Since lensing deals with angles, we need to ensure that all distances considered are in fact

angular diameter distances. In the flat cosmology implied byPlanck Collaboration(2016), their redshift dependence of the angular diameter distances is (Hogg,1999)

𝐷𝐿( 𝑧) = 𝐷𝐶( 𝑧lens) 1 + 𝑧lens , 𝐷𝑆( 𝑧) = 𝐷𝐶( 𝑧lens) 1 + 𝑧src , 𝐷𝐿 𝑆 = 𝐷𝐶( 𝑧src) − 𝐷𝐶( 𝑧lens) 1 + 𝑧src , (1.6) with 𝐷𝐶 the comoving distance. In practice, we use the routines from the astropy package

(Astropy Collaboration,2013,2018).

From a computational standpoint it is more convenient to represent(1.5) as an integral. This is accomplished through the use of the Poisson equation₁₅ ∇2_𝜓 ₌ _{4𝜋𝐺 𝜌 and by setting}

appropriate boundary conditions (Meneghetti,2016; see alsoappendix A.1): ® 𝛼( ®𝜉) = 𝐷_{𝐿 𝑆} 𝐷_𝑆 4𝐺 𝑐2𝐷_𝐿 ∫ image plane Σ( ®𝜉0) ® 𝜉− ®𝜉0 | ®𝜉− ®𝜉0|2 d2( 𝐷𝐿𝜉® 0_{) .} (1.7)

In fact, this equation can be the starting point for a derivation in the reverse direction: it can be interpreted as a properly rescaled sum of the well-known contribution 4𝐺 𝑀/𝑏𝑐2(whichDyson et al.(1920) confirmed) from infinitesimal point masses 𝑀 = Σ𝐷𝐿2d2𝜉®

0

. Thus, lensing can be seen to be linear in the lensing mass distribution, which allows arbitrary massive components to be effortlessly superposed.

The expression (1.7)is simplified by introducing the critical surface density for lensing, Σ_cr, and the closely related convergence, 𝜅:

Σcr ≡ 𝑐2 4𝜋𝐺 𝐷_𝑆 𝐷_𝐿𝐷_{𝐿 𝑆} , 𝜅( ®𝜉) = Σ( ®𝜉) Σ_cr . (1.8)

14_{The factor of two here is the only difference between the Newtonian treatment of centuries past and GR.} 15_{also known to school children as Newton’s law of universal gravity}

(11)

The former is a bound above which a lens can form multiple images. It only depends on the relative strength of gravity (𝐺) to the speed of light (𝑐) and dictates that further sources are more easily lensed (for a fixed lens distance) since the deflections required are smaller. The multiple images condition is equivalently stated as 𝜅 > 1 (seeMeneghetti (2016, section 2.6) for discussion and illustration).

The lens equation can be viewed as a coordinate transformation and then conveniently characterised through its Jacobian, which can be shown (inappendix A.1) to take the form

d®𝑥 d ®𝜉 = 1 − 𝜅 0 0 1 − 𝜅 − 𝛾₁ 𝛾₂ 𝛾₂ −𝛾₁ , (1.9)

conveniently split into an isotropic and a shear part. The (inverse) Jacobian defines the magni-fication M ≡ d ®𝜉 d®𝑥 = h (1 − 𝜅)2− 𝛾2 1+ 𝛾 2 2 i−1 , (1.10)

which can be interpreted as the overall scaling of the area of small patches of the source plane after lensing. The magnification can take any real value, with negative magnification signifying a flipped image. It can also become extremely large (positive or negative) where the argument in brackets is close to zero. The regions where this occurs are called caustics and critical lines in the source and image planes, respectively. The magnitude of magnification differentiates strong from weak lensing: for the former 𝜅 and 𝛾 are comparable to (and/or even exceed) unity, and so sources are strongly magnified.

It is important to note, however, that lensing does not create or destroy photons (and thus, conserves energy₁₆). Thus, the amount of light arriving at the observer is proportional to the area

in the image planethat the source subtends, which effectively acts as a collecting area for light that is then simply rerouted just as with conventional lenses. This means that lensing conserves

surface brightnessaround infinitesimal points, and thus the surface brightness registered by the observer (in the absence of other sources) is

𝛽( ®𝜉) = 𝛽( ®𝑥( ®𝜉)) . (1.11)

It is obvious from this expression that the lens and the source are entirely degenerate: for any observation there is for every deflection field a surface brightness configuration that can reproduce the observation₁₇. The reverse is also sometimes true, such as in the case of the “mass-sheet degeneracy”: in general the spatial scale of the source cannot be conclusively determined since a sheet of constant mass density could be magnifying it uniformly by an arbitrary amount (Kuĳken,2003).

16_{at least for stationary (non-rotating) spacetimes. The interested reader is referred to the effect of superradiance} (Brito et al.,2015) as an example of extraction of energy by “very strong graviational lensing”.

17_{For all we know, there could be sources that naturally look like Einstein rings. What’s there to stop them? Find} out insection 2.4.

(12)

1.3 Previous work and motivation

Gravitational lensing is a powerful tool for studying dark matter because it directly encodes its gravitational field (which is its defining characteristic) into signatures in electromagnetic images. Substructure research naturally targets galaxy-scale lenses, preferentially in two lens-source configurations: multiply (quadruply) imaged point lens-sources₁₈and (again multiply imaged) extended arcs or Einstein rings (occurring near the critical lines).

In the former case, the ratios between fluxes of the source projections are used, together with strong assumptions about a “macroscopic” (smooth) lens configuration, to rate the marginalised

statistical likelihood of the presence of substructure with a given low-mass cutoff. The latest constraints (Gilman et al., 2020; Hsueh et al., 2020) derived by this method derive an upper limit 𝑀hm . 107.5M, similar to constraints from other methods.

On the other hand, techniques analysing extended sources aim, usually, at localising and characterising individual subhaloes, whose (non-)detection allows conclusions on the mass function to be drawn based on the statistical power of the used technique (Vegetti et al., 2010;

Vegetti et al., 2012; Hezaveh et al., 2016; Ritondale et al., 2019). An alternative is to add a “correction” potential on top of a smooth extended lens and from it extract information about the DM model (Birrer et al.,2017;Bayer et al.,2018). These studies employ a “classical” approach to statistical testing by utilising Monte Carlo (MC) simulations and comparing their outcome (possibly through the use of a test statistic) to the observation. Although statistically straightfor-ward and robust to correlations, this approach is extremely time consuming, necessitating the exploration of very high-dimensional parameter space, especially if any sort of “pixelated” lens or source model is used.

While there are currently on the order of a hundred available observations of galaxy-galaxy lenses suitable for substructure detection (mostly contained within the SLACS (Bolton et al.,

2006;Shu et al.,2017) and BELLS (Brownstein et al.,2012;Shu et al.,2016) optical surveys₁₉), instruments of the (near) future are expected to deliver hundreds of thousands (Collett,2015). The main contributors are expected to be the Large Synoptic Survey Telescope₂₀(LSST₂₀,Ivezić et al. (2019)) and the Euclid₂₁ space telescope (Laureĳs et al., 2011). The latter will in fact provide imagery of the necessary resolution and, more importantly, seeing to be directly usable for substructure modelling. On the other hand, the expected point spread function (PSF) of the LSST will be on the order of the Einstein radius for the majority of the discovered lenses, and so follow up will still be required. In that respect, the James Webb Space Telescope and the Extremely Large Telescope (ELT) will improve on the imaging capabilities of the Hubble Space Telescope (HST), with the latter expected to produce 1000× 1000diffraction limited images (resolution is expected to be at least 2 mas (ELT Instrumentation), compared to Hubble’s 50 mas (ACS Handbook) and Euclid’s 100 mas (Euclid VIS Instrument)).

Even the identification of candidate lenses among such large volumes of data is a challeng-ing task for which new methods need to be developed₂₂, let alone their analysis and utilisation. The main drawback of existing methods is that, with growing image sizes (which are necessary

18_See_Meneghetti₍₂₀₁₆_{, chapter 2.6) for an explanation and demonstration of this possibility.}

19_{These surveys identify candidates in large spectroscopic surveys (the SDSS and BOSS), by looking for objects} that seem to combine light with distinct redshifts. Follow-up observations have been performed with the Hubble Space Telescope to obtain the final deep high resolution images in SLACS and BELLS. JVA/CLASS (Myers et al.,

2003) is an older radio survey that also provides suitable images. Other collections have also been produced (More et al.,2012;Huang et al.,2020), for which high enough quality imagery is generally still not available.

20_{LSST now apparently stands for the “Legacy Survey of Space and Time”, while the observatory has been renamed} after Vera C. Rubin (116th Congress,2019) and the telescope itself: theSimonyi Survey Telescope...

21_{On the topic of naming: the author finds the name of one of} _{the original mission designs}_{: the Dark Universe} Explorer (DUNE), far more exciting than “Euclid”, while abbreviating the other, Spectroscopic All Sky Cosmic Explorer, as SPACE instead ofSASyCEas a missed opportunity. But then again, Euclid’s books are more influential than Frank Herbert and manga... or are they?

(13)

to improve constrains), the parameter space of any pixelisations necessarily grows as well and increases rapidly the time required for full MC sampling.

The alternative approach, enabled partly by recent advances in computer science like automatic differentiation and easy massive parallelisation on graphics and tensor₂₃ processing units (GPUs and TPUs) and finding almost ubiquitous application in all areas of life and research, is to use an artificial deep convolutional neural network. Indeed,Hezaveh et al.(2017) demonstrated a speedup of ten million times, reducing the runtime for extraction of basic smooth lens parameters from days to hundredths of a second without sacrificing any accuracy, while

Morningstar et al. (2019) and Chianese et al.(2020) used specific neural network variants to improve the modelling of sources, andBrehmer et al.(2019) andDiaz Rivero & Dvorkin(2020) attempted application on substructure directly, albeit with varying degrees of success and/or realism of the setup. The common aspect in all neural network solutions is the necessity for a preparatory training stage that closely resembles the MC stage in traditional methods in that a mock observation is simulated, a likelihood is assigned to it, and the parameters of the model, now embodied in the weights, etc. of the network, are optimised. For example, Brehmer et al.

(2019) used a similar sized training set (106) as the MC samples of Gilman et al. (2020). A neural network, however, “remembers” “how” to “solve the problem” during training, and can subsequently quickly apply the solution to new examples (within its “area of expertise”).

This thesis presents a different method targeted at the outputs of the new generations of surveys: we aim to be able to process effortlessly images of resolution 400 × 400 in a matter of hours₂₄. Its design goal is to combine the merits of the two “camps”: to exploit auto-differentiation for quick likelihood-based fitting, while maintaining utmost flexibility (i.e. permitting arbitrary lens and source components (and instrument specifications) to be defined and combined) and allowing for full statistical treatment using arbitrary parameter priors and posterior approximation distributions. To achieve this, we adopt the variational approach, which is discussed in the next chapter along with the model components. Significant attention is devoted to the source modelling, described in detail in chapter 3, followed by the presentation of two applications: to the SLACS image of SDSSJ0252+0039 and to a model ELT image in chapter 4. Finally, various subtleties and the future development plan for the framework are discussed inchapter 5.

23_{At this point the author is obliged to express his appalment at the terminology for arrays in machine learning. The} logic goes: numbers (“scalars”), “vectors”, “matrices”,... “higher-dimensional tensors”. Well, one is allowed to think that only if one is a Bachelor student, or if one’s tensor’s components transform properly.

(14)

Chapter 2

Model Overview

The main endeavour in modelling strong gravitational lenses is disentangling the effects of the lens from the shape of the source, which, as mentioned, are entirely degenerate. In principle, however, if the source is perfectly well understood, a single observation allows the determination of the projected mass density by inversion of the equations insection 1.2.1. However, this requires that elements of the source can be uniquely identified, i.e. that one can construct the map ®𝜉 ↦→ ®𝑥, and that this map covers the whole image plane.

Realistically, partial information about the lens can be extracted straight from the observed image. In galaxy-galaxy lenses, which are the class of interest for substructure studies, the main lensing DM halo is always populated by an elliptical galaxy, from which an initial estimate of the lensing mass distribution can be inferred (see e.g. Bolton et al.(2006)). An Einstein ring can further constrain the total lens mass (inside the ring).

Furthermore, it is of great aid to identify multiple projections of the same source: in so-called multiple-image systems. One can expect then that a macroscopic₁ mass distribution accounts for mapping the multiple images to the same location in the source plane, while subhaloes only introduce variations in their vicinity and thus to only one of the projections. Thus, if substructure is present in a multiple-image lens modelled using only a smooth lens component, its signature in the source plane will be that pixels that map close to each other but come from different images will have different values (seefigure 2.2).

Therefore, the most important ingredient of lensing models for substructure inference is a component which evaluates the probability that such variations in the source plane are due to the presence of substructure and not intrinsic variation of the source. This has been achieved in multiple ways (Vegetti & Koopmans,2009;Birrer et al.,2017;Brehmer et al.,2019;Chianese et al.,2020) that rely on either a full Bayesian treatment using Monte Carlo methods or the use of neural networks in various forms.

We propose a new framework based on variational inference. Broadly, it consists of an arbitrary number of lenses that transform the regular pixel grid of an image into positions in the source plane. An arbitrary number of sources is then “asked” to provide surface brightnesses at those locations, which are passed through an instrument response to produce the final data. A schematic representation is shown infigure 2.1.

+ + + 𝚯_lens lens 𝛂® ® 𝖕 𝚯_src source f ® p 𝚯_inst instrument d lenses sources

Figure 2.1.A graphical representation of the model. See footnote4for details on notation.

1_{Since light from the source is received “between” the two images, one can only assume the lensing configuration} there. The usual assumption is for a high degree of smoothness, and this is why the “macroscopic” component is referred to as “smooth”. It also accounts for most of the mass, so it is also called “main”.

(15)

1 2 −1 0 1 −1 0 1 𝜉_𝑥,00 𝜉 𝑦 , 00 image −0.2 0 0.2 −0.2 0 0.2 𝑥𝑥, 00 𝑥 𝑦 , 00 reconstruction from 1 −0.2 0 0.2 𝑥𝑥, 00 reconstruction from 2 −0.2 0 0.2 𝑥𝑥, 00 excess variance of combined reconstruction

Figure 2.2. Illustration of the signature of a subhalo on the source plane representation of the observation. The setup imitates a main halo of mass 1.6 × 1011Mwithin the Einstein radius (red circle in the “image”

panel) mulitply-lensing a simple Sérsic source. A subhalo of mass 109M is placed at the location of

the red cross and causes a local disturbance in the image. Given perfect knowledge of the smooth componentof the lens, but ignorant of the presence of the subhalo, one can reproduce the source like in “reconstruction from 1” using the unperturbed image: enclosed in a black line. Alternatively, one can do the same with the other source projection and obtain “reconstruction from 2”. By themselves both these source reconstructions may be plausible: the dark stripe in 2 may be due to a dust lane, for example. However, one can unequivocally infer the presence of the subhalo by examining the difference between the two. In a more statistically well-defined manner, one can calculate the source plane local variance of pixel values assigned projected locations as depicted in the rightmost panel. Each set of pixels (from each of the source projections) will lead to certain intrinsic variance due to source properties and gradients (when necessarily averaging over an extended area), but considering both sets will lead to additional variance introduced wherever there is systematic mismatch (i.e. a difference of the means).

2.1 Variational Inference

Variational inference (VI) is an approximate inference technique alternative to sampling methods for solving intractable Bayesian problems. It defines a proposal distribution 𝑞(𝚯) for the posterior of the parameters of interest 𝚯 given data d0and evaluates how well it approximates

the true posterior 𝑝(𝚯|d0), typically via theevidence lower bound(ELBO):

ELBO[ 𝑝(d0,𝚯), 𝑞_Φ(𝚽)] = E_𝑞

𝚽(𝚯)[ln 𝑝(d0,𝚯) − ln 𝑞𝚽(𝚯)], (2.1)

where 𝑝(d0,𝚯) = 𝑝 (d₀|𝚯) 𝑝 (𝚯) is the model likelihood, and the expectation is taken over the

proposal. When the ELBO is added to theKullback-Leibler(KL) divergence, an always positive quantity, one obtains the (logarithm of the) model evidence ln 𝑝(d0), which justifies the name.

Thus, maximising the ELBO minimises the KL divergence, which indicates that the proposal is brought “closer in shape”₂to the true posterior (Bishop,2006).

Minimisation is typically done with respect to a set of guide parameters₃𝚽 that are usually the parameters of the approximating probability density functions: for example a guiding mean and variance of a normal distribution. Usually the guides are simple analytic functions (we will only be using delta and normal distributions), so both sampling from them and differentiating (2.1) with respect to their parameters (by using the reparametrisation trick for the expectation (Kingma & Welling,2013)) are straightforward.

2_{Various subtleties exist. For one, the ELBO is not symmetric in its arguments (as is not the KL divergence).} Minimisation of(2.1)means seeking such 𝑞(𝚯) that for all drawn 𝚯 the likelihood is as big as possible. This means that this “forward ELBO” is “mode seeking”, i.e. tries to fit within modes, giving more constrained results than the true posterior! Maximising the “reverse ELBO” that has its arguments swapped (thus requiring samples weighted by the likelihood) leads to a “mass covering” approximation.

3_{It is also common to refer to parameters of the model as latent variables over which one can place priors, and/or} to the guide (and other) parameters as hyperparameters.

(16)

The minimisation procedure is as follows₄: values for 𝚯 are drawn from the guide, using the current hyperparameters 𝚽; the priors 𝑝(𝚯) are computed; the likelihood is computed by running the generative model, which produces observations d; these observations are conditioned on the actual observations to produce the likelihood 𝑝(d0|𝚯) = 𝑝 (d0|d); the ELBO and its

gradient with respect to 𝚽 are computed; an optimization step is taken. We use the probabilistic programming package pyro (Bingham et al.,2018) that uses the automatic gradient capabilities of torch (Paszke et al.,2019) and optimize via the Adam optimizer (Kingma & Ba,2014). 2.2 Main lens

For the smooth (“main”) component of the lens, we consider a singular power law ellipsoid (SPLE), which has a surface density

Σ( ®𝜉) Σ_cr = 𝜅 ( ®𝜉) = 3 − 𝛾 2 𝜃_𝐸 𝑅( ®𝜉) !𝛾−1 (2.2)

parametrised by the 3D density exponent 𝛾 (or “slope”) and the Einstein radius 𝜃𝐸, while 𝑅 is

an elliptical radial coordinate as described inappendix A.2. The SPLE displacement field has a closed form expression in the complex notation, in which 𝛼 = 𝛼𝑥+ i𝛼𝑦, (O’Riordan et al.,2020; Tessore & Metcalf,2015):

𝛼SPLE( 𝑅, 𝜙) = 𝜃_𝐸 2√𝑞 1 + 𝑞 𝜃_𝐸 𝑅( ®𝜉) !𝛾−2 2𝐹1 1, 𝛾−1₂ ;5−𝛾₂ ; −1−𝑞_1+𝑞e2i𝜙 ei𝜙. (2.3)

Here 𝜙 is the elliptical angular coordinate (and not the ellipse position angle 𝜑!), and ₂𝐹₁ is the hypergeometric function. In the circular (𝑞 = 1) isothermal (𝛾 = 2) case this reduces to 𝛼 = 𝜃_𝐸ei𝜙 as expected. The function

2𝐹1, however, is not implemented in torch, let alone

in an autodifferentiable fashion, so we instead use the procedure described in full in Chianese et al. (2020, Appendix A), which relies on pre-tabulating the angular part of (2.3) (through an alternative formulation) and interpolating at runtime. This results in very fast code with negligible output degradation.

® 𝛂 ® 𝛂 ® 𝖕 𝛾₁ 𝛾₂ 𝚯_ext.shear 𝑥₀ 𝑦₀ 𝜑 𝑞 _𝜃 𝐸 𝛾 𝚯_SPLE

Figure 2.3. Graphical representation of the smooth lens

model, which consists of a singular power law ellipsoid and external shear, producing two of the displacement field layers. There are eight parameters in total, which are processed deterministically through these components, so they carry all the stochasticity.

4_{We illustrate the components of VI in graphs as follows: parameters in 𝚯 are filled rectangles, while other} inputs are empty rectangles; circles represent intermediary variables, and distributions and stochastic functions are enclosed in a dashed border. In a fitting process these are not used to draw from but only to provide the necessary probabilities. For parameters not pointed to from a distribution, the probability is provided by a (not depicted) prior. Drawing is done from the guide, and all (model) parameters are implicitly drawn. This can also be indicated explicitly with a dashed arrow. Intermediary variables pointed to by a dashed arrow are conditioned on data. Probabilities of elements pointed to by dashed arrows are calculated for the value substituted (be it from the guide or from observed data5_).

5_{In fact, conditioning on data is the same as augmenting the guide with a non-trainable delta guide set at the} observed values.

(17)

An additional component introduced for model flexibility is the external shear, which contributes a displacement field “layer”

® 𝛼ext. shear( ®𝜉) = 𝛾₁ 𝛾₂ 𝛾₂ −𝛾₁ ® 𝜉 (2.4)

in analogy with(1.10)and represents the combined weak lensing effect of all unmodelled lenses along the light path (assumed constant across the image). As discussed in section 1.2.1, an overall magnification (external convergence) is degenerate with scaling the source plane, so it is not modelled explicitly.

Overall, the main lens model has eight trainable parameters, illustrated infigure 2.3. 2.3 Substructure

The substructure model consists of a (fixed) number 𝑁subof subhaloes located at positions

®

𝖕_sub. Each is assumed to have the NFW 3D mass density profile (Navarro et al.,1997), which has recently been confirmed appropriate for DM haloes as light as Earth-massWang et al.(2019). This profile, however, has the well-known drawback of having divergent total mass, which is certainly not the case in dense environments like galaxies and clusters where tidal stripping and baryonic effects are expected to “truncate” DM haloes (Tormen et al.,1998). This effect is usually overlooked in kinematic studies which only require calculations of the enclosed mass, but can have a significant impact on direct detection and lensing observables that contain integrals along the full line of sight. Therefore, we use the modification due toBaltz et al.(2009), which allows modelling the truncation in a computationally “well-behaved” and easy to implement way: 𝜌(𝑟) = 𝜌_NFW(𝑟) 1 + (𝑟/𝑟𝑡) 2 =⇒ 𝜌( 𝑠) = 𝜌_𝑠 𝑠(1 + 𝑠)21 + (𝑠/𝜏)2 , (2.5) with 𝑠≡ 𝑟 𝑟_𝑠 and 𝜏≡ 𝑟𝑡 𝑟_𝑠 , (2.6)

where the scale density 𝜌𝑠 and radius 𝑟𝑠 are the parameters of the NFW profile, and 𝑟𝑡 is the

truncation radius. Note that, since the distances appear only as ratios, we can use angular quantities straightforwardly, e.g. by computing an angular scale radius 𝜃𝑠, etc.

Although this formulation is useful for computing the lensing properties of the halo (see below), it is more common to parametrise the halo by its “virial” mass and concentration, related as usual to the scale density and radius by

𝑀₂₀₀ ≡ 200𝜌_cr( 𝑧_lens) · 4𝜋 3 ( 𝑐200𝑟𝑠) 3_, _(2.7) 𝜌_𝑠 =200𝜌_cr( 𝑧_lens) · 1 3 𝑐3 200 ln(1 + 𝑐200) − 𝑐200/(1 + 𝑐200) . (2.8)

The (axially symmetric, i.e. 𝑠 = | ®𝜉− ®𝔭

sub|/𝜃𝑠) displacement field corresponding to the

truncated NFW profile is, in its full glory (Baltz et al.,2009, Appendix A):

® 𝛼NFW( ®𝜉) = 4𝜌𝑠𝑟𝑠 Σ_cr ® 𝜉− ®𝔭 sub 𝑠2 𝜏2 𝜏2+ 12          𝜏2+ 2𝑠2− 1 cos −1_(1/𝑠) q 𝑠2− 1 + 𝜋𝜏 +p𝑠2+ 𝜏2 𝜏2− 1 𝜏 ln 𝑠 √ 𝜏2+ 𝑠2+ 𝜏 − 𝜋 , (2.9)

(18)

Each subhalo is thus defined by five parameters. The concentration of NFW haloes is the object of extensive studies with numerous mass-concentration(-redshift) relations motivated by numerical simulations proposed. We fix it to 𝑐200 =15, which is representative of the range of

concentrations appropriate for the masses (and other relevant details like redshift and host halo) that we are considering and consistent₆ withHellwing et al.(2016) andRichings et al.(2020). Furthermore, we fix 𝜏 = 6, consistently withGilman et al.(2020), and provide some discussion on this choice₇ inappendix A.10. In any case, provided that the aim of this work is (making steps towards) detecting substructure, characterisation beyond its mass properties is beyond its scope₈. Thus, there are just two (vectors of) trainable parameters pertaining to substructure: {M₂₀₀,𝖕®

sub} (the latter being a 2D position).

The expression(2.9)presents a perfectly usable way of calculating the displacement field of a single NFW halo. However, applying it to substructure requires O (𝑛 × 𝑁sub) calculations,

which, especially in the backward pass where the derivatives are extremely non-trivial, quickly becomes a burden. Instead of direct calculation, we apply a similar strategy to the one used for (2.3)by first precomputing the displacement field on a “mesh” of masses mmeshand pixel-subhalo

displacements ®𝖕_mesh, then encoding {M200,𝖕®

sub} onto the mesh, convolving the resulting weights

with the precomputed displacement, and finally interpolating from ®𝖕_meshto the pixel positions ®

𝖕. Despite the overhead of this procedure, an efficient implementation of the convolution using the Fast Fourier Transform (FFT) reduces the complexity to O 𝑁mmesh𝑁𝖕®meshln 𝑁𝖕®mesh (with 𝑁·

the size of the respective vector), which, crucially, is independent of both the image size and subhalo number. The full procedure is detailed inappendix A.3.

2.3.1 Smoothed substructure

Ideally, we would like to model substructure separately from the main lens in order to reduce correlations and aid inference. However, an obvious correlation exists in that substructure contains a non-negligible part of the total mass (within the Einstein radius) of the system, which, as discussed, is the best constrained property of the system, and so we need to ensure that it corresponds to only one model parameter. This can naively be achieved by properly normalising the subhalo mass function at each step, so that a certain ratio between the total substructure and main lens masses is maintained₉. However, if the cutoff is varied at the same time, this would require changing the number of simulated subhaloes between iterations, which obviously breaks the downhill optimisation procedure.

Instead of fully removing or adding subhaloes in response to a varying low-mass cutoff and/or normalisation, we keep the mass function fixed at the CDM prediction(1.1) and mask out the lensing signature of those subhaloes that are supposed to be suppressed by smoothing the displacement field they generate. In fact, such a treatment is faithful to the reasoning that low mass subhaloes are absent because the particles that should make them up are/have been free streaming and are therefore more diluted throughout the main halo. Furthermore, since the smoothed subhaloes still contain significant mass, the fitting process can distribute them so as to act as an additional smooth component. Thus, in essence, the smoothing of substructure alleviates two problems: it adds flexibility to the main lens model and models the low-mass

6_{Concentration is expected to increase, logically (yet for subhaloes less obviously), with time, so the results of}

Richings et al.(2020) at 𝑧 = 0 are upper limit estimates for the haloes we are modelling. On the other hand, we want to test first the most favourable case, and the most concentrated haloes will produce the most pronounced signatures (see the extended discussion inappendix A.10).

7_{and on whether anyone can say if it’s reasonable and whether it matters}

8_{Still, there has been some work suggesting that examining the profile of subhaloes through lensing is indeed} possible (Vegetti & Vogelsberger,2014), andGilman et al. (2020) show that the concentration-mass relation is modified for WDM models in a way that can be used to constrain them.

9_{In the literature this is referred to as the DM substructure mass fraction 𝑓}

suband is used by many authors as an

alternative way to check for consistency with CDM and/or distinguish WDM models (e.g.Vegetti et al.,2010;

(19)

0 0.5 1 no smoothing_<_s₌₁₀8_M all smoothed \⇢( "g) 5 6 7 8 9 10 102 104 1 log₁₀"₂₀₀_/M # ( > "200 ) / %(> "200) from(1.1) fs ,arcsec

Figure 2.4. Top: the size of the smoothing kernel

from (2.11)for the three levels infigure 2.5(solid lines) and the size of the Einstein radius (dashed line) corresponding to the total mass of subhaloes we label with the virial mass 𝑀200. Bottom: survival count,

i.e. number of subhaloes bigger than 𝑀200 in the

sample forfigure 2.5. The histogram is the particular realisation, and because it is accumulated it is to be expected that it deviates slightly from the analytic expression (the dashed line; derived from (1.1)) in the highest mass bins. The shading represents the amount of smoothing with respect to the Einstein radius, i.e. the ratio of the dashed line and the line for 𝑚s=108Mfrom the upper panel.

−4 −2 0 2 4

no smoothing 𝑚_s = 108M all smoothed

equivalent smooth lens −5 0 5 𝛼 𝑥 , mas −4 −2 0 2 4 −5 0 5 𝛼 𝑦 , mas −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 0 0.5 1 1.5 con v er g ence, 10 − 4

𝛼

𝑥

𝛼

_𝑦

𝜅

Figure 2.5. Smoothing the displacement field of substructure. The two top rows show its components

over a 1000× 1000 area, and the bottom row shows the convergence (𝜅 ≡ ®∇ · ®𝛼). The first three columns to the left depict the effect of increasing the smoothing mass from a vanishing value (no smoothing) in the leftmost column, to smoothing all subhaloes in the third column. The rightmost column shows for comparison the smooth field calculated from a surface density Σ ∝ 𝑛 ¯𝑀 with 𝑛 the projected number density of subhaloes (which in this example is a normal distribution with spread 200) and ¯𝑀 the mean of the subhalo mass function (we extend the SHMF to 105M, which leads to ¯𝑀 ≈ 1.7 × 106M). The bottom right panel depicts the subhaloes as circles the size of their respective Einstein radii (for the total mass of the truncated NFW profile). Calculations were done for 𝑧lens = 0.2 and 𝑧src =1 with

10 000 subhaloes drawn from the CDM SHMF (1.1) with slope 𝛼 = 0.9 as usual and for a main halo 𝑀₀=1010M. Note the small scale of the displacement and convergence.

(20)

suppression.

In practice, since the suppression in(1.3) is a function of the subhalo mass only, we can implement the smoothing by simply convolving the precomputed displacement fields with a point spread function (we, traditionally, chose a Gaussian). That is, we modify(A.3.3)by

® 𝛼 𝖕®

mesh, 𝑗,mmesh,𝑖, ®0 → conv

𝑗 ® 𝛼 𝖕® mesh, 𝑗,mmesh,𝑖, ®0,G ®𝖕 mesh, 𝑗, 𝜎s2 mmesh,𝑖 , (2.10) with 𝜎_s m_mesh,𝑖 = 𝑆 −ln mmesh,𝑖/𝑚s Δln 𝑀200 𝜎_s,0+ 𝑛_s𝜃𝐸( 𝑀𝜏) . (2.11)

where 𝑆(𝑥) = 1/(1 + e−𝑥) is the sigmoid function which effects a (differentiable) transition between smoothing below 𝑚sand no modification above it. The size of the smoothing kernel is

at least 𝜎s,0with an additional component proportional to the Einstein radius for the total mass

𝑀_𝜏 of the truncated halo ((A.4) in Baltz et al.(2009)). We set 𝑛_s = 3. The transition occurs over a range of masses Δ ln 𝑀200, which we set to 0.5 dex. Where effective, this smoothing acts

akin to a low-pass filter, erasing sharp variations in the displacement field, which occur close to the centres of the subhaloes, while leaving the slowly varying far field essentially intact₁₀. Since for spherically symmetric mass distributions the lensing depends only on the enclosed mass (Meneghetti,2016, (2.13)), this does indeed mimic a “dilution” of the subhaloes. Figure 2.5demonstrates the smoothing process.

In this framework, 𝑚s acts as a proxy for 𝑀hm, so it assumes the role of the parameter of

interest in the study.

The full routine of simulating substructure is illustrated infigure 2.6. It starts by drawing masses from the SHMF(1.1)and positions from a 2D spatial distribution matched to the location and Einstein radius of the main halo₁₁. These parameters are then encoded onto the mesh of precomputed and smoothed displacements, which are then convolved and interpolated to produce the final displacement field of substructure.

U V "₀ CDM SHMF M200 ⇥main ?_(Æp_sub_|⇥_main_{) =} N⇣(G0, H0), \2⇢I ⌘ Æpsub Æz Æp <_s mesh w * Æ↵Æpmesh Æ↵

⇥_SHMF Figure 2.6. Graphical representation of the substructure

calculation. The true input parameters are the masses

M200and positions ®𝖕sub, while the SHMF and the position

distribution act as priors, which are also influenced by the main halo, with which the substructure is aligned. We always use the CDM mass function to “draw” subhalo masses and, as described in the text, substitute the low mass cutoff with a procedure of smoothing the displacement field of suppressed subhaloes. This effect is regulated by the parameter 𝑚s, which acts as a proxy for the half-mode

mass.

10_{For this it is important that the discretised smoothing kernel is properly normalised so that it performs a true} weighted mean operation.

11_{In future iterations of the framework this aspect will be improved to model more realistic subhalo distributions} like an NFW/Einasto profile.

(21)

2.4 (Analytic) Sources

A source is a stochastic function that produces the surface brightnesses at input locations: f = 𝛽𝚯src p = ®® 𝖕 − ®𝛂𝚯lens . It is usually the source model that plays the main regularising role if

any non-trivial (non-parametric) sources are considered.

There are two general types of sources. Non-parametric sources usually have an explicitly regularised parameter representing the surface brightness in each pixel (Vegetti & Koopmans,

2009), or use a scheme of encoding the source into a lower-dimensional space (Birrer et al.

(2017) use shapelets, and Chianese et al. (2020) a variational autoencoder). We will present shortly a source model akin to the former, motivated by the need for a clear and straightforward Bayesian interpretation of its predictions.

On the other hand, parametric sources regularise implicitly, by assuming a definite “true” surface brightness as a functional expression dependent on a set of parameters. The probability of attributing variations to the source is then purely set by the observational uncertainty. Parametric models are usually not flexible enough to describe real surface brightness distributions with e.g. dust lanes in spiral galaxies and unusual morphologies; nevertheless, they offer an efficient way to obtain an estimate of the “look” (i.e. general shape and size) of the source. Often using them is even a necessity since some source models, like the one we present in the next chapter, are “too free” (too weakly regularising) to force the lens to the correct parameter values if it is not close already.

We will be using the Sérsic profile (Sérsic, 1963), which due to its flexibility is able to model efficiently both early- and late-type galaxies (up to details of the spiral structure). Its surface brightness distribution, with 𝑅( ®𝑥) the elliptical radial coordinate from(A.2.1), is given by 𝛽Sérsic( ®𝑥) = 𝛽_𝑒exp ( −𝑏𝑛 " 𝑅( ®𝑥) 𝑅_𝑒 1/𝑛 − 1 # ) , (2.12)

where 𝛽𝑒is the surface brightness at the half-light radius 𝑅𝑒. In order to ensure this normalisation,

𝑏_𝑛is related to the Sérsic index 𝑛 by an implicit transcendental equation in terms of the complete and lower incomplete gamma functions; we use a series expansion valid over a broad range of indices 𝑛 (Ciotti & Bertin,1999):

𝑏_𝑛: 2𝛾 (2𝑛, 𝑏_𝑛) = Γ(2𝑛) =⇒ 𝑏_𝑛≈ 2𝑛 − 1 3 + 4 405𝑛 −1₊ 46 25515𝑛 −2_{+ O} 𝑛−3 . (2.13)

The Sérsic source, thus, has seven parameters: the four elliptic ones and { 𝛽𝑒, 𝑅𝑒, 𝑛}. 2.5 Instrument

Digital images usually represent data measured in “digital numbers”, which are a rescaled (in some well-defined way depending on the specifics of an analog-to-digital converter) mea-surement of the number of electrons that hit the detector. These, then, sans the ones contributed by dark current and readout noise, can be translated into a total photon count via the quantum efficiency of the sensor. These photons imply a total radiant energy, corresponding to the spectral energy density integrated over wavelength after taking into account the filter used (and, again, the wavelength dependence of the efficiency). This spectrum was obtained over an extended period of exposure and ultimately came from a restricted area of the sky, described by the point spread function of the instrument and the pixel shape and size. This, finally, leads to the

(spectral) intensity, a quantity describing the power coming from the region around a specific direction in the sky, per unit solid angle (per unit wavelength interval), which we also refer to as surface brightness. If the filter is broad-band, the observed data represents an average over the spectral range.

(22)

Fortunately, all those transformations are linear with the incoming spectral intensity and/or data is usually calibrated (corrected for noise and other non-linear effects in the conversions, efficiency, optical throughput, etc.), so from our viewpoint the transformation amounts simply to a redefinition of the units of the data. Since an estimate of the variance (the so-called

observational uncertaintyor data error) is provided as well, we can simply treat the image data as giving us directly the surface brightness₁₂that the model models.

Thus, the instrument only serves to convolve the output surface brightness of the rest of the model in order to produce the “model fluxes”, d, which are the final data generated by the model. Technically, an integral over the whole image plane should be performed for every pixel. However, we approximate it to a numerical convolution from the pixel grid onto the pixel grid, i.e. we compute the “input fluxes” f on the same grid as the output and numerically convolve:

d = ∫

𝛽( ®𝜉)PSF( ®𝜉 ,𝖕) d ®® 𝜉 ≈ conv

𝑖

(f𝑖,PSF( ®𝖕𝑖)) . (2.14)

The final step is to model the data uncertainty (associated to the instrument). We assume it is Gaussian with diagonal covariance:

𝑝(d₀|𝚯) = G

d0− d(𝚯), 𝜀20I

. (2.15)

12_{We may also call it “flux” (and label it “ 𝑓 ”...). Based on the above reasoning, this is only slightly incorrect, since} we might as well multiply by the constant angular size of pixels, assuming, of course, the pixel grid is regular when

(23)

Chapter 3

Source Modelling Using Gaussian Processes

3.1 Overview of Gaussian processes

The notion of a process is an extension of that of a distribution to an infinite number of dimensions, i.e. a process describes probabilities of (real) functions 𝑓 ( ®𝑥) (defined on an arbitrary vector space, usually R𝑛

) instead of finite-dimensional (real) vectors f. It is, thus, a useful concept in supervised learning which aims exactly at learning a function from a (finite) set of training examples. As a probabilistic tool, however, a process is further able to learn an uncertainty from the data.

The Gaussianity of a Gaussian process (GP) comes in the assumption that it describes normally distributed functions. More formally, a Gaussian process is a collection of (an infinite

number of) random variables, any linear combination of which is normally distributed, i.e. for any f = { 𝑓 ( ®𝑥₁), 𝑓 ( ®𝑥₂), ...} =⇒ f ∼ N , where N indicates a multivariate normal distribution. Thus, a GP acts as a prior, which when presented with training data, provides the posterior probability for any other set of function values.

A Gaussian process is entirely defined by a covariance function that is an extension of the covariance matrix of Gaussian distributions and simply returns the covariance between function values evaluated at two given locations:

𝑘_𝚯

GP( ®𝑥1,𝑥®2) = cov( 𝑓 ( ®𝑥1), 𝑓 ( ®𝑥2)) . (3.1)

Note that the covariance function is assumed to be a function of space only, i.e. independent of the function values. In general, it can be parametrised by a set of (learnable) hyperparameters, which are labelled 𝚯GP.

A variety of covariance functions have been proposed and used₁, but we will restrict ourselves to the most common one, the Gaussian radial basis function (RBF) with parameters 𝛼₀for the overall variance and Σ as a correlation kernel. It is given by

𝓀( ®𝑥₁,𝑥®₂) = 𝛼2exp −( ®𝑥1− ®𝑥2) 𝑇 Σ−1( ®𝑥₁− ®𝑥₂) 2 . (3.2)

Usually, a Gaussian process is used to infer the distribution of finite-sized vectors, which are arguably infinitely easier to deal with computationally than (infinite-dimensional) functions𝑀𝜏.

A GP is first conditioned on a set of data d0∼ N (f0,S₀), with S₀the observational covariance of the data, which is associated to some locations ®p. Then it is used to make predictions at (possibly different) locations ®p∗, using (Rasmussen & Williams,2006, p.16)

f ∼ GP𝚯GP(f |{ ®p, d0,S0}, ®p∗) ∼ N𝛍( ®p∗) = K𝚯_GP( ®p∗,p)® K𝚯_GP( ®p, ®p) + S0 −1 d0, V( ®p∗) = K𝚯GP( ®p∗,p®∗) − K𝚯GP( ®p∗,p)® K𝚯GP( ®p, ®p) + S0 −1 K𝚯GP( ®p, ®p∗) . (3.3)

Here 𝛍 and V are, respectively, the mean and covariance of the prediction, and K is the covariance matrix: K𝚯GP( ®a, ®b)

𝑖 𝑗 ≡ 𝑘𝚯GP( ®a𝑖

, ®b_𝑗).

(24)

3.2 Windowing

A particular complication arising in the modelling of multiply-lensed systems is that it is not enough to assign the observed data to zero-sized points in the source plane, as this will disregard the correlations arising from the overlap of the light-collecting areas of different pixels (seefigure 3.1). To fully account for the (projected) pixel shapes and sizes, one must introduce a window function that is non-zero only within the light-collecting area in the source plane. Doing this exactly is infeasible even if the projected pixels are assumed to be polygonal (which is not necessarily true₂), as it would involve numerous intersection checks and code branching points which are extremely inefficient on GPUs.

Instead, we opt for a simplified treatment that approximates the window function using a Gaussian located at the pixel centre and aligned with the axes ®𝑎and ®𝑏(depicted as # „𝐴 𝐴0and𝐵 𝐵# „0 infigure 3.1) of the pixel₃.

To fix the various scalings of the window function, labelled 𝑤𝑖( ®𝑥) for the 𝑖

th _{pixel, we} require that ∫ d®𝑥 𝑤𝑖( ®𝑥) = 1, (3.4) ∫ d®𝑥 𝑤2 𝑖( ®𝑥) = 𝐴 −1 𝑖 . (3.5)

The first requirement simply means that the window function acts as a “contribution density” and is satisfied by any properly normalised Gaussian function. The second condition, where 𝐴_𝑖 is the projected pixel area, or an estimate thereof, demands that the window function is similar in size to the pixel and ensures that the variance of pixel value scales inversely with the light-collecting area of the pixel. For the full derivation, which follows from the conservation of surface brightness, seeappendix A.6. It is also shown that these conditions are satisfied when

𝑤_𝑖( ®𝑥) = G ®𝑥− ®p 𝑖,𝚺𝑖 (3.6) with 𝚺𝑖 = ® 𝑎_𝑖𝑎®𝑇 𝑖 + ®𝑏𝑖 ® 𝑏𝑇 𝑖 4𝜋 . (3.7) (a) (b) (c)

Figure 3.1. The projection of two pixels onto the source plane (a). A full treatment should consider

their exact shapes and use a window function that is nonzero only inside the shaded areas. Instead, the projections are approximated to ellipses defined by (the projection of) the pixel centres 𝑂𝑖and the vectors

# „

𝐴 𝐴0and𝐵 𝐵# „0, as shown in (b). The window functions are Gaussians with covariances derived from those ellipses (see the text for the exact definition) and are depicted in (c).

2_{Indeed, pixels containing critical lines can be infinitely distorted, cut in pieces, etc.}

3_{In the actual implementation only the pixel centres were projected to the source plane, which made it impossible} to use the midsegments as illustrated. Instead, the axes ®𝑎and ®𝑏were calculated as half the distance between the two adjacent pixel centres along the respective axes of the grid.

Machine Learning for Gravitational Lensing

MSc Physics and Astronomy

Track: Gravitation and Astroparticle Physics (GRAPPA)

Master’s Thesis

Machine Learning for Gravitational Lensing

Gaussian Processes on Graphics Processors:

Searching for Substructure in Strong Lenses

Konstantin Karchev

August 2020

Supervisor:

Dr Christoph Weniger

Second Examiner:

Dr Shin’ichiro Ando

Contents

“We demand rigidly defined areas of doubt and uncertainty!”

Introduction

Model Overview

𝛼

𝛼

𝜅

Source Modelling Using Gaussian Processes