Exploring applications of machine learning to jet discrimination

(1)

MSc Physics and Astronomy

Theoretical physics

Master Thesis

Exploring applications of machine learning to

jet discrimination

by

Wouter Christiaan Schreuder

Student nr. 10889515

University of Amsterdam & Vrije Universiteit Amsterdam

Size 60 EC.

Conducted between 22-04-2020 and 09-03-2021

Supervisor/Examiner:

Examiner:

dr. W.J.Waalewijn

dr. J.Rojo

(2)

Abstract

Improving our understanding of emitted QCD-radiation (jets) in high-energy particle collision is of vital impor-tance in modern particle physics. Sophisticated data-analysis tools can be used to find patterns that are difficult to predict from first principles. Since machine learning tools have outperformed conventional data analyses in other fields, this study searches for new ways to analyse jet observables in high energy particle collisions with machine learning tools. We simulated the data of Higgs-boson, W-boson, quark- and gluon-jets with PYTHIA. The observables pT, mass, ECF, N-subjettiness, pT-ratio, subjet-angle, number of constituents, number of charged particles, and the number of b-quarks were analyzed with a logistical regression, a neural network, a random forest, and a conventional approach. We found that the random forest outperforms the neural network, logistic regression and the conventional approach.

(3)

Introduction

The Standard Model has, since its foundation in the 70’s, been the paradigm of particle physics. Although the model encapsulates the e↵ects of quantum mechanics and special relativity, it fails to provide a comprehensive understanding of gravity. Besides that, it is descriptive but not self-explanatory, in the sense that it is unknown how the values of certain parameters emerge, and what mechanism fixes the number of elementary particles. There has been extensive empirical research in experimental physics for new particles to add to the Standard Model, with the discovery of the Higgs-boson as the most recent success. The goal of the Standard Model is to form an irreducible formulation of every physical interaction. To be an irreducible formulation, 2 criteria need to be fulfilled: First, the interactions it describes should not be reducible to more basic interactions. Secondly, the particles it describes should not be reducible to more elementary particles.

The first criterion is partially met: There are four fundamental forces proposed that are currently not reducible to more basic interactions, respectively the strong nuclear force, the weak nuclear force, electromag-netism and gravity. This does not mean that they are in principle irreducible. Grand unification theories suggest that they originate from a single force, but there is not yet any irrefutable evidence found for this. 3 of these forces (gravity excluded) are currently represented in the Standard Model. This is done by the force-carrying particles, categorized as bosons. The 5 known bosons are the gluon, photon, W-boson, Z-boson and Higgs-boson. The Higgs boson is actually not a force carrier, but does explain how particles can obtain their mass [1]. The only thing they all have in common, is their integer spin (0 for Higgs, 1 for the others). The photons and gluons represent respectively electromagnetism and the strong nuclear force, and are the only massless particles in the model. The W- and Z-boson both carry the weak nuclear force, although they di↵er in that the W-boson carries an electric charge of +/-1, while the Z-boson is neutral.

The second criterion is met in the sense that there are no forms of matter that are extractable for empirical research that are not describable with the 12 fermions, and that there is never any experimentally verifiable substructure found for these fermions. The restraint in this acknowledgement comes from a variety in open questions that the Standard Model can not answer yet. These open questions are mainly about how the structure of the universe relates to the postulates of the standard model, and phenomena in the universe that can not be observed in an isolated empirical setting. An example of the former is the matter-antimatter asymmetry, and an example of the latter is the elementary content of black holes. There have also been substructures proposed, most notable of them is string theory, but non of them is yet verified.

The matter particles all have in common that they have half-integer spin. They are further categorized into the type of interactions that they undergo, their charge and their mass. While all of them have weak and electromagnetic interactions, only half of them have strong interactions. The 6 fermions which do undergo strong interactions are quarks, and the 6 which do not are leptons. Furthermore, the categorization on charge split the leptons into 3 electron-like (with charge -1) and 3 neutrinos (charge 0), while the quarks are split into 3 up-like (charge +2₃) and 3 down-like (charge -1₃). The last distinguishment, based on their mass, classifies them into 3 generations, each of them containing two quarks (one up-like and one down-like) and two leptons (one electron-like and one neutrino-like). The masses of the particles grow an order of magnitude for each generation, known as the mass hierarchy, which cause the particles of the third and second generation to decay to first generation particles. This explains why all common-observed matter is composed of first generation particles. As stated before, there is not yet an explanation suggesting how these parameter values change over di↵erent generations.

(6)

Another di↵erence between bosons and fermions is that each fermion has a corresponding antiparticle, with equal mass but opposite quantum numbers, while bosons do not have a corresponding antiparticle.

Figure 1.1: Standard Model

QFT The mathematical framework that translates the postulates of the standard model into empirical pre-dictions, is quantum field theory (QFT). It does so by formulating the cross-section (probability amplitude squared) of the transition of a certain initial state to a certain final state, initialized by an interaction of one of the fundamental forces. Since these forces of the standard model are irreducible, it would in theory not be pos-sible to unify them in a single description. Therefore there is a summation of multiple QFT’s necessary to get a comprehensive description of particle physics. Before getting into a more detailed description of those theories, it is first necessary to outline the general boundary conditions of QFT. Just as in special relativity, energy and momentum need to be conserved for the initial and final state, and the same holds for electric charge and color charge. Also, we know that the particles in the initial and final state need to satisfy the energy-momentum relation:

E2_{= (} |#»

p_|c)2_{+ (m}

0c2)2 (1.1)

We therefore know that the initial- and final state particles are on the mass-shell. The term “mass-shell” refers to the surface of possible solutions of (1.1) in a plot with m, _|#»

p_{| and E as axes. This relationship is derived} from the Euler-Lagrange equations, so this criterion could alternatively be formulated as that all initial- and final state particles need to satisfy the equations of motion. But besides final- and initial state particles, QFT’s postulates the existence of virtual particles. These particles are the intermediate states between the initial states and the final states, also known as “propagators”, and do not need to satisfy the equations of motion. They are therefore allowed to be o↵-shell. These virtual particles can have di↵erent masses as their real counterparts, but can not actually be observed, since their existence is limited by the uncertainty principle. The probability amplitude for their existence depends on how far o↵-shell they are.

The first field-theoretical description of the Standard Model particles explained the interaction between photons and electrically charged particles, and was therefore called quantum electrodynamics(QED). This ex-planation of electromagnetism was historically the first QFT that made accurate predictions, and lead to the construction of the famous Feynmann-diagrams. It served as a template for all subsequent QFT’s. Subsequently, the weak force, mediated by W- and Z-bosons, was described in quantum flavour dynamics [2]. The electroweak theory unified both the weak nuclear force and electromagnetic force into a single formulation. It showed that both forces were just emergent phenomena of the same force at low energy scales. The two forces would merge at an energy of 195.5 GeV [3], which was approximately the energy density of the universe at 10 12_{s after} the Big Bang [4]. Technically, the earlier statement that the four fundamental forces were irreducible was a deliberate inaccuracy to improve readability. Electroweak theory di↵ers from just a combination of the two mentioned field theories, in the way that it is a redundancy of the description of physics.

Besides the usual quantum numbers like spin and electric charge, quarks and gluons have another quantum number that plays an important role in their physical properties, called color. This number is based on what

(7)

kind of color charge a system or particle possesses. This color charge is a mathematical representation for the kinds of strong interactions that they undergo. There are 6 di↵erent color charges, namely red, blue, and green for ordinary particles, and anti-red, anti-blue, and anti-green for their antiparticles. There is no absolute configuration for which quark corresponds to which color, in the sense that you can not state that a charm quark is always blue for example. It is merely a useful metaphor to visualize the strong interactions between them. All the di↵erent color-configurations for both quarks and gluons are visualized in figure 1.2. Quarks carry only one color charge, where gluons carry both a color and anti-color charge. This would suggest that there are 9 types of gluons, but since the red-antired, blue-antiblue, and green-antigreen combinations are be colorless, these combinations can only appear in superpositions. Since only two of them are linear independent, there are 8 instead of 9 gluons. Color charged particles can not be observed directly, but their existence is derivable from certain composite structeres that they form. This phenomena is called color confinement,

Figure 1.2: quarks and gluons

and is not yet understood from an analytical point of view [5]. These composite structures are restricted to colorless combinations, called hadrons, which can be either all the three anti-colors (forming anti-baryons), all the three normal colors (forming baryons), or a color with its corresponding anti-color (forming mesons). A notable exception to this are the recently discovered pentaquarks [6], that are composed from 5 quarks. The process where quarks and gluons cluster together, is called hadronization. Since color-charged particles like quarks and gluons can not be observed directly (Note: theoretically these particles would be observable above the Hagedorn temperature of 1.7_{⇥ 10}12_{K), there is need for a theory that translates the physics of these} particles via hadronization to experimental observations. The problem is that hadronization itself is not yet fully understood from a theoretical point of view. Because it is a non-perturbative process, it is not possible to calculate it from first principles. This non-perturbativenes arises from the divergence of the coupling constants of the QCD Lagrangian in the low-energy regime, which will be explained in more detail later on. The QCD Lagranian is formulated as LQCD = ¯i(i( µDµ)ij m ij) j 1 4G ↵ µ⌫Gµ⌫↵ , (1.2) G↵µ⌫= @µA↵⌫ @⌫A↵µ+ gf↵bcAbµAc⌫, (1.3) where are the quark fields. G↵

µ⌫ is the gluon field strength tensor, in which A↵⌫ denotes the gluon fields, with ⌫ as spacetime index and ↵ as color index. f↵bc _{and g are respectively the SU(3) structure constants and coupling} constant. The term “coupling constant” implies that it is a constant, while it is in fact dependent on the energy scale. The function that describes how the coupling constant varies over the energy scale µ, is known as the beta-function and is defined as (g)_⌘_@log(µ)@g . For the QCD coupling constant, it is at one loop given by:

(g) = ✓ 11 ns 6 2nf 3 ◆ g3 16⇡2 (1.4)

where nf is the number of flavours, and ns the number of scalar coloured bosons. Because the number of coloured scalar bosons is zero in the standard model, we only have to deal with the number of flavours. Each flavour number corresponds to a di↵erent type of quark, so nf = 6. It is easy to see that for every nf  16,

(8)

(g)  0. Since this derivative with respect the energy is in the current standard model always negative, one can see from here that the coupling constant will become weaker at high energy scales, and stronger at low energy scales. Mathematically, it decreases as

↵s(µ)⌘ gs(µ) 4⇡ ⇡ 1 0lnµ 2 ⇤2 (1.5)

where ⇤ = 218± 24MeV , also known as the QCD-scale. This phenomena gives rise to two properties that require further explanation: asymptotic freedom, and the previous mentioned divergence at low-energy scales.

Asymptotic freedom means that the interaction strength between particles get asymptotically weaker as the energy scale increases and the corresponding length scale decreases [7]. A visualization of this can be found in figure C.1. This plays a crucial role in QCD, because it allows us to perform perturbative calculations at small distances with high energies. Since these are the circumstances at which elementary particles get produced, it makes it possible to analytically express the particle decay processes.

At low energies the coupling diverges, implying that if the spatial separation between two quarks increases, the gluon-mediated interactions strength diverges [7]. This implies that an ever-increasing amount of energy is required to seperate them, which makes it at some point energetically favourable for a quark to create a new antiquark to cancel the strong interactions. This process is the driving force for hadronization. A notable exception to this process is the top quark. Due to its large mass, it has a mean lifetime as short as 5⇥ 10 25_s, causing it to decay before it can hadronize. As mentioned before, there is not yet a model found that gives an analytical explanation out of fundamental theory of this process. Instead there are phenomenological models that do not contradict fundamental theory, but are also not derivable from it. These models merely aim to parametrize a set of variables in such a way that it is consistent with empirical observations. One such model, also used in PYTHIA, is the Lund string model [8]. In this model, gluons are depicted as strings between a quark and an anti-quark. As energy is provided into the system, the strings get stretched out, until the point where the tension is so high that it is energetically favourable to create a new quark anti-quark pair. Despite similarities in jargon, this model has no connections to string theory.

Jets Despite the above mentioned limitations, it is still possible to make mathematical predictions of the processes at the low energy regime. This is caused by a factor in the QCD splitting function cross-sections, related to the relative orientation of the four momenta. When measured in the rest frame of the decay particle, the divergences in the cross-section for QCD-decays occur when the squared sum of the four-momenta of the decay products is zero. For example, the splitting of a quark into a quark and a gluon [9]:

ita µ(/p₁+ /p₂) (p1+ p2)2

(1.6)

It is easy to see that divergences occur when:

(p1+ p2)2= m21+ m22+ 2E1E2 2|#»p1||#»p2|cos✓ ⇡ 0 (1.7) In the high energy limit, the contribution of the masses to the energy compared to the contribution of the momenta is negligibly small. Neglecting the masses, so that m2

i = 0 and E = | #»

p|, (p1 + p2)2 reduces to 2E1E2 2E1E2cos(✓) = 2E1E2(1 cos(✓)), so that (1.6) reduces to:

ita µ(/p₁+ /p₂) 2E1E2(1 cos(✓))

(1.8)

It is easy to see that here the divergences occur when E1 or E2 = 0, which is known as the infrared (or soft) limit, and ✓ = 0, known as the collinear limit. Please note that other QCD-decays might have slightly di↵erent cross-sections, but that these divergences are a general phenomena in them. More theoretically, this divergence can be understood as a propagator that lives on shell, and whose existence therefore is no longer limited by the uncertainty principle. Actually, the term “propagator” is somewhat misleading, since it has become an ordinairy particle. So there are two possibilities in a 1_{!2 decay process to preserve momentum conservation} for the propagator/decay particle: either being completely aligned with each other, or when one of the two has a three momentum that approaches zero. This implies that above a certain momentum threshold all the decay products of a particle end up in a narrow geometrical region. When this procedure is repeated multiple times, you can imagine a whole spray of particles who all originate from the same ancestor. This is called a jet.

p2ancestor= p2Jet= ( X i2Jet

(9)

Please note that this relation is just an approximation, since it neglects particles masses, and the fact that propagators in these decay chains can be o↵-shell, although their probability amplitude is much smaller. In jet physics, this narrow cone of hadrons (called the jet) is analysed in order to gain knowledge of the elementary particles that originally caused them to form. In order to do so, we first need a clear definition of a jet. Secondly, there is need for measurable observables that can tell us which kind of elementary particle produces which kind of jets. The jet-definition should be infrared and collinear safe, which means that they should be invariant under those emissions. Although jet observables do not in definition need to be infrared and collinear safe, it is in many cases preferable.

Particle colliders The way the previous mentioned experimental searches for new particles are being done, is through high energy particle colliders. In these colliders, beams of particles collide at high energies with each other, while a cylindrical shaped detector with bins of a specific width absorbs the QCD-radiation emitted by these collisions, and measures its transverse momenta, in the rest frame of the collision defined as:

pT = q

p2

x+ p2y (1.10)

This QCD-radiation consists of hadrons, bosons and leptons. The way this radiation is analyzed will be explained in more detail in Chapter 2. The detector absorption’s could be visualized in polar coordinates by a graph where the pseudorapidity, given by

⌘ =1 2log ✓ |#» p_{| + p}z |#» p| pz ◆ = log ✓ tan✓ 2 ◆ (1.11)

is displayed at the x-axis, the azimuthal angle is on the y-axis. Note that it is implicitely assumed that the beam is along the z-axis. This pseudorapidity is derived from the original definition of the rapidity y_⌘1

2log ( E+pz E pz).

In the energy scales of the collider (or for massless particles in general) E _{⇡ |}#»

p_{|, making the rapidity and} pseudo-rapidity converge to each other. The reason pseudorapidity is preferred over the rapidity is that it has a direct connection to the angle ✓ between the particles trajectory and the beam axis. This pseudo-rapidity is thus a measure for how strong the trajectory of a particle diverges from the beam axis. If a particle is perpendicular to the beam axis ⌘ = 0, and if it is parallel to the beam ⌘ =1. Most detectors catch up particles up to ⌘ = 3. Each point on this plane relates to a specific bin of the detector. This format shows the geometrical distance between the components in a straightforward way. It is important to note that a R=1 surface at ⌘=2.5 is in Carthesian coordinates larger than a R=1 surface at ⌘=0. For this reason the width of the bins in a detector varies with respect to their position of the collision [10]. Each bin has a certain threshold for the amount of

Figure 1.3: An image of a detector

energy it can measure. For example, in the LHC the detector bins only measure energy deposits of above 3.5 GeV for hadrons and 2.5 GeV for photons. These two thresholds have slightly evolved (mainly increased) over time [10]. This experimental setup makes the measurements inherently infrared and collinear safe. The infrared safety is provided by the energy treshold, and the collinear safety is provided by the width of the bins. Because most processes of scientific interest occur at a high energy scale, there needs to be sufficient kinetic energy in the colliding beams. This causes the beam of emitted particles to narrow down. Qualitatively, this can best be understood from the assumption made in the derivation of the collinear safety: that energy contribution of the masses can be neglected compared to the three momenta. The more this assumption is “true”, the more collinear the jet will be.

Although we lack analytic understanding of the hadronization process, we do have a field-theoretical descrip-tion about the decay processes that occur right after the collision, when the asymptotic freedom still holds. This makes us able to predict some of the features that the final jet needs to possess if it originates from a certain particle. For example; we know that the most dominant decay process of the Higgs boson [11] is H0! b¯b. This makes it reasonable to assume that in the jet of the Higgs boson with the right momentum two smaller subjets will occur: one with b as ancestor and the other one with ¯b as ancestor. If the Higgs jet is to boosted, both subjets will overlap due to the collinear limit. On the other hand, when the Higgs is not boosted enough, both subjets will diverge from each other, and appear as two seperate jets. But to conserve momentum in the whole collision, there is also need for another jet, originating from another particle, pointing in the opposite direction. So to detect the Higgs jet, we need to find a substructure that discriminates it from other jets. This other jet

(10)

will most likely consist of QCD-radiation. Since the most copious produced particles in proton-proton collisions are gluons, and up- and down-quarks, it would be scientific relevant to find a way to discriminate them from other jets. Another important reason is that often appear in the decay chains of other particles, like the Higgs boson. Because of the di↵erent color charges, gluons interact and hadronize di↵erently[12], causing gluon jets to have more constituents, become wider, and to have a more uniform energy distribution over its constituents, while quark jets are more likely to produce narrow jets with less constituents, that carry a less equal fraction of the total energy.

One of the main question of jet physics is thus how to deduce knowledge of the collisions out of these absorption spectra of the detector. Improving our understanding of the structure of QCD-radiation emitted by particle collisions is therefore an important subject of high-energy particle physics. In order to do so, this study will investigate possibilities to apply machine learning to jet physics. The particle data of the collisions will be simulated with PYTHIA, and the jet clustering will be done by FastJet. Chapter 2 will give an outline to all physics- and experimental related issues in the area of jet research. Chapter 3 will give a general overview of the di↵erent forms of machine learning, and their advantages and disadvantages. Chapter 4 will be dedicated to the setup of the simulations, and in Chapter 5 the results will be shown and interpreted. Chapter 6 will give a summary and outlook for further research.

(11)

Chapter 2

Jet Physics

For scientific convenience, it would be best if there was just one collision at a time, with a single produced particle that goes straightforwardly into the detector. But life is not made to be easy. As mentioned in section 1, the particles decay and hadronize before being measured by the detector. But besides that, there occur a couple of processes that provide challenging barriers for interpreting the detector spectra. Some of these processes are related to the technical features of the collider, and some of them are of a more theoretical nature. Besides that, there are also computational problems related to the large amounts of data that need to be digested. All processes that occur between the moment where two particles are launched towards each other, and the data-analysis, can be categorized into 5 di↵erent stages: pre-collision, collision, decay, hadronization, and measurement. Please note that this categorization is made to provide a qualitative understanding of the scattering process, and could be disputed for not resembling the actual physics. Each stage provide us with di↵erent challenges. This section aims to give an overview of all these challenges, and how to overcome them.

Lets start with the pre-collision stage. First of all, there are not just two incoming particles but two bunches of particles. This excess of collisions is caused by the fact that the cross-sections for formation or decay of particles of interest in the energy range of the collider (14 TeV) is so small, that there needs to be an extremely large amount of collisions to generate an interesting signal. The only contamination that already occurs in the pre-collision stage is the initial state radiation. This is radiation emitted by the partons before they interact with each other. This gives root to two complications: the colliding particles lose momentum, and the radiation is a form of contamination in the detector that needs to be filtered out.

In the second stage, the collision of the beams, there are four possible scenario’s that can happen for each particle collision: in the first scenario, the protons miss each other and there is no collision at all. In the second scenario, nothing happens in terms of particle creation/annihilation, and the particles just scatter of from each other. This scattering causes them to diverge from the beam axis, and thus ending up in the detector. This form of contamination is known as elastic scatterings. Most of it will be found in the large pseudo-rapidity (1.11) limit. This can be explained by momentum transfer. Particles in the large pseudo-rapidity limit have a small angle ✓ with respect to the beam axis, and the relative change in momentum (i.e. the momentum transfer in the collision) must have been quite small. This small momentum transfer makes the probability amplitude for the creation of new particles negligibly small, making the particles just scatter o↵. The third scenario that can happen, is that there are new particles created in the collision, but not the particles of interest. Which kind of particles are of interest and which are not depends of course on the type of research. The QCD-radiation produced in these collisions contaminate the signal of the particles of interest. All contamination originating from the excess in collisions is collectively called pile-up. The fourth scenario, is that a particle of interest is produced in the collision, and this is the source of the signal. In case of scenario three or four, there is also a possibility that the remnants of the colliding particles also interact with each other. There could thus multiple events be generated from a single collision, a phenomena known as multiple parton interactions. This can be another form of contamination.

The third stage, is the decay stage. This decay can be subdivided into two categories: The decay processes where the particle only loses momentum by emission of gluons or photons, and the decays where a particle splits up into truely di↵erent particles. The former is an example of final state radiation. Final state radiation is a collective term used for all gluon- or photon emissions (and their o↵spring) that are not due to particle annihilation, and occur after or during the collision. Note that the latter only holds for particles with a short lifetime, and does not necessary occur for all particles. An up- or down-quark will not decay into other

(12)

particles. This decay into truely di↵erent particles has implications for the geometric distribution of the signal in the detector. Due to our understanding of the cross-sections for particle decays and the notion of the soft- and collinear limit, we can forecast the e↵ects of this process on the signal. This theoretical understanding makes it also (to some extent) possible to identify b-quarks that decayed or hadronized before reaching the detector[13]. Besides theoretical understanding, there is also an important technical feature of the collider that provides important information for this identification. Due to a magnetic field in the collider, the trajectory of charged particles in the event gets curved. Particles with a large enough lifetime travel a measurable distance from the collision point before they decay, giving rise to displaced tracks of their o↵spring which can form secondary vertices[13]. Most particles other than b-quarks are unfit for these kind of measurements, because their lifetime is either too short, making the displacement of the trajectories of their o↵spring too small to measure, or too long, preventing them to decay before reaching the detector. The fourth stage, the hardonization, is more problematic from a theoretical point of view. As mentioned in section 1, we lack fundamental understanding of how and why this process occurs. The only way we can incorporate this e↵ects is with advanced models that mimick experimental observations. The fifth stage, the measurement, is a purely technical story. The two features of the calorimeter cells that influence the resolution of the spectra are their width and their energy threshold. The width of the bins determine the size of the pixels in the rapidity-phi plane, and the energy treshold is a cut that particles need to pass in order to be measured. It can thus be seen as a practical way of removing soft radiation.

All the contamination described above, leads to two related issues: an amount of data that is so large that it is even for the most powerful supercomputer completely impossible to fully digest it, and a huge amount of contamination that blurs the signal. The blurring of the signal is not necessarily proportional to the amount of contamination, but is related to the distribution of the contamination. If all contamination was uniformly spread out over the detector, it would not cause any problem in the signal detection, since the peak of the signal would still be clearly visible in the detector spectra. However, small fluctuations in the distribution cause larger peaks for large amounts of contamination than for small amounts of contamination, so it is not completely uncorrelated. The computational issues are a more fundamental problem.

2.1 Jet algorithms

Before getting into observables useful for discriminating jets, we first need to have a clear definition of how we cluster particles into jets. The crucial factor in here is to find an expression for the likelihood that 2 particles originate from the same mother particle. In order to do so, there are a wide variety of jet-clustering algorithms, who are all based on the geometrical position and, except for the Cambridge/Aachen algorithm, transverse momenta of the particles. The di↵erence between these algorithms lies in their choice of parameter values. The criteria that each clustering algorithm needs to satisfy, is that they should be invariant under soft or collinear emissions, and that they should be as insensitive as possible to all the contamination. This latter criterion one is not just a practical feature needed because we can not model or filter the garbage out. Also from a theoretical point of view, it would make no sense if a jet in the order of hundreds of GeV’s would radically change just because a 1 GeV gluon is added to the system. In these algorithms [14], the starting point is always a list of all the particles that ended up in the detector. For each particle, their transverse momentum and position in the detector is stored in the list. The algorithm checks which combination of two particles have the lowest value for dij

dij = min(p2↵T,i, p2↵T,j)4R2ij (2.1) in which the geometrical distance between two particles is defined as

4Rij= q

(_4⌘ij)2+ (4 ij)2 (2.2) and subsequently removes the two particles from the list, and places a recombined object back into the list. The most widely used recombination scheme is for this is the “E-scheme”, which simply sums the components of the two four-vectors. This form of recombination provides infrared-safety to the algorithm. The addition of a soft particle will have a much lower impact on the recombined object than the addition of a hard particle. Each step makes the list thus 1 element shorter. This process is repeated until either

min(p2↵

T,i, p2↵T,j)4R2ij > p2↵T,jR2 (2.3) or

(13)

(a) kt (b) Cambridge-Aachen

(c) anti-kt

Figure 2.1: Three di↵erent clusterings of same event [15]

in which R is the predetermined jet radius, and ↵ is a free parameter that weights the momentum. When this point is reached, all the particles that were merged together into the object p2↵

T,i/jR2 that yielded the minimum together form the jet, and are removed from the list. This process is repeated until the list is empty. It is important to notice that the4Rij term provides us collinear safety: as the angle ✓ between two particles goes to zero, 4Rij will also go to zero, and the particles will be merged into a combined object. If there is are no further restrictions on the jet definition, events where soft radiation is emitted will always appear to have a lot of jets. Most of this jets will consist of very few particles with very low momenta, and are of no scientific relevance. To bypass this, it is common to place a minimum pT threshold in the jet definition. This makes sure that only jets with a combined momentum above that threshold are regarded as jets. All the particles in jets below that threshold are not considered.

The di↵erence between the algorithms lies in the value of parameter ↵. For the kt-algorithm ↵ = 1, for Cambridge-Aachen algorithm ↵ = 0, and for the antikt-algorithm ↵ = 1. For ↵ = 1, the particles with low pT are clustered earlier than the particles with high pT in the jet clustering process. For ↵ = 0, the only variable is the Rij, and there is no discrimination on momenta in the clustering sequence. ↵ = 1 favours particles with high pT. Although the clustering process is iterated until every particle is assigned to a jet, it still matters for the shape and areas of the jets if you start clustering with soft particles, hard particles, or the particles that are most close to each other. To see this, it is good to take a look at a practical example. Lets consider an event with a few hard particles and a background of many soft radiation particles, and cluster it with the three di↵erent algorithms with the same R-value. As can be seen in the figures 2.1a, 2.1b, and 2.1c from [15], the number of jets and assignment of the hard particles to each jet is essentially invariant under the choice of the algorithm. What does change for each algorithm, is the assignment of the soft radiation. The anti-kt algorithm provides cone-like shapes for each jet, while the jets of the kt- and Cambridge-Aachen algorithms have a much more amorphous shape. This is straightforwardly explainable if we take a closer look on the clustering process. Consider the case where pT,1 is the hardest particle in an event with a uniform soft-radiation background. The first step in the anti-kt algorithm would be to cluster this particle with its nearest neighbour, since 1

p2 T ,1

(14)

for _4R2

ij, the product 4R2

1j p2

T ,1 would give the lowest value for dij, and thus be the first step in the clustering

process. The new combined object has an even higher transverse momentum as the original hard particle, and will thus presumably be reclustered again with its second nearest neighbour in the next step of the clustering. This second nearest neighbour will be slightly farther away since the position of the combined object is the weighted average over its constituents. This process repeats itself until the transverse momentum ratio between the hardest and the second hardest particle pt,2is equal to the ratio between the distance of nearest neighbour of the combined object and the nearest neighbour of the second hardest particle, i.e. 4R

2 1j 4R2 2j = p2 T ,1 p2 T ,2. Since the

variation in the transverse momenta is much higher than the variation in distances, it will take a lot of steps to reach that point. This is clearly visible in the y = 2, = 5 region of the plot 2.1c. The green jet with the hardest particle has a perfect cone-shape, at the cost of the shape of the pink jet that has much softer constituents. Only in cases where the momenta are approximately equal, the jet areas will have a boundary somewhere in the middle, as can be seen in the y=2 = 3 region of plot 2.1c. In contrary, the kt-algorithm will start clustering with the softest particle with its nearest neighbour. The new combined object will have a higher transverse momentum, so the next step will cluster the second softest particle with its nearest neighbour. The Cambridge-Aachen will start clustering the nearest nearest neighbours, and since the new combined object has an intermediate position between its constituents, the next step will presumably cluster di↵erent particles. This explains the amorphuos shape of their jets. Previous studies have shown that anti-kt is the best algorithm for jet-reconstruction, while Cambridge-Aachen has shown a better performance in detecting jet-substructure [16].

2.2 Jet observables

In the observables that are assigned to jets, you could make a distingiushment between global observables, and substructure observables. The global observables, that are related to the jet as a whole, are the most straightforward, and are derived from the summation of the four-vectors over all the jets constituents. A few examples are mass and transverse momentum. The jets mass is thus not a summation of the masses of the constituents, but (P_i_2Jetpi)2. The substructure observables are related to how a jet is composed. The following section will give an outline of these substructure observables. B-tagging is an observable that does not fall in one of the categories.

2.2.1 Energy-correlation function

The original energy correlation function [17] is defined as

ECF (N, ) = X i1<i2<...<iN2J N Y a=1 Eia N_Y1 b=1 N Y c=b+1 ✓ia,ib (2.5)

Please note that this observable is both infrared and collinear safe for every > 0, since each splitting where either ✓ia,ib! 0 or Eia! 0 will not change the value of the observable. This formulation is mostly used for e

+_e collisions, and for hadron colliders it is more natural to substitute the energy for the transverse momentum, and the angle ✓ia,ib by the Euclidean distance4Ria,ib. The ECF tells something about how the pT of the jet

is distributed among its constituents, and how the constituents are distributed in angle among the jet. An ECF(N, ) shows a sharp decrease if a system has N-1 or fewer subjets, while N or more subjets will yield a high value. In general, at a fixed number of particles and fixed jet-pT, the ECF will maximize when each constituent has an equal pT-fraction, and the constituents are maximally separated over the jet area. To see this, consider a jet of 100GeV with just two particles. To maximize the ✓ia,ib term, the particles need to be

maximally separated, thus having a 2R distance to each other. For the pT term, it would be most favourable of if the momentum is equally distributed, since 50⇥ 50 > (50 + n)(50 n). If we now add a third soft particle which pT ⇡ 0, this wont change the value of ECF(2, ), regardless of its geometrical position. But ECF(3, ) will drop to zero, because the product over the momentum is now over all the three particles, instead of a sum over the 3 possible combinations of two particles. Since we study 1-prong versus 2-prong jets that originate from hadron collisions, we will use the ECF(2, ) with transverse momentum instead of energy. Our choice for is determined by the other substructure observables. Since we want the substructure observables to be comparable to each other, it is useful for them to have the same weight for the distance between particles. Because the

(15)

N-subjettines weights the distance by a power of 2, we choose = 2 for the ECF. We thus define our ECF as: ECF (2, 2) = X i<j2J pTipTj4R 2 i,j (2.6)

2.2.2 N-subjettiness

The formal definition of the N-subjettines [18] is:

⌧N = 1 d0 X k pT,kmin(4R1,k,4R2,k, ...,4RN,k) (2.7) d0= X k pT,kR0 (2.8)

in which R0 is the radius of the original jet, and4RN,k is the geometrical distance between a particle k and a candidate subjet N . The N-subjettines is an expression for the degree to which a jet is composed of N-subjets. For similar reasons as for the ECF, this is also an infrared- and collinear-safe observable. For ⌧N ⇡ 0 all particles are perfectly alligned with the candidate subjet directions, and the system therefor has N or fewer subjets. For ⌧N >> 0, there are lots of particles distributed far away from the candidate subjet axes, and you could therefor conclude that the jet has at least N+1 subjets. To know if a jet has exactly N subjets, both ⌧N 1 >> 0 and ⌧N ⇡ 0 need to be satisfied. However, in this study we will focus on the two versus one-prong structure. By setting N = 2, we can measure whether a jet has less than three subjets. The ECF subsequently forms a good indicator whether it has one or two subjets, making the subjettines ratio unnecesarry. The choice of the candidate subjet axes is a sensitive one. One could optimize this process by minimizing ⌧N over all possible subjet directions, but this is computational intensive. A more practical approach would be to recluster a jet and force it to return exactly N exclusive subjets. The term exclusive means that the clustering stops when the desired amount of jets is found, instead of when all particles are assigned to a jet. It can thus well be that some particles of the original jet are not assigned to any of the subjets. Earlier studies [18] suggest that this approach have a small cost in generality.

2.2.3 Subjet-angle and transverse momentum ratio

The number of subjets that appear in an event depends on the setting of the parameters. As stated in the introduction, the pT of a jet has to be in a certain margin to recover the same number of subjets as the number of decay particles from the particle that originated the jet: If a jet is to boosted, the subjets overlap and appear like a single one. If not boosted enough, the subjets are too far apart and appear to be di↵erent jets. It is therefor useful to measure the angle between the subjets as an indicator for how much overlap/separation there is between the subjets. If the angle is large enough, it is an indicator that the N-subjettines is indeed based on two truely di↵erent subjets, while a small angle would suggest that the N-subjettines is based on a single jet. Another observable with similar properties and purposes is the momentum fraction of the hardest subjet with respect to the momentum of the jet. If this is close to one, it indicates that the second subjet is just constituted of soft radiation.

Although other observables might seem very straightforward, they still deserve some remarks on how sensitive they are under contamination, infrared- and collinear emissions, and the choice of clustering algorithm. Mass, energy and transverse momentum are examples of observables that are largely invariant these changes. The jet area is a previous mentioned example that depends sensitively on the clustering algorithm. Other observables like the number of constituents and the number of charged particles are sensitive to both contamination, the clustering algorithm and the infrared- and collinear emissions. This still does not mean that they can not be useful for jet discrimination, but their must be more restraint on the interpretation of their discriminatory power than other observables.

2.3 Jet discrimination

Our understanding of how jets form, and how we can measure their properties, leads us to another question: How can we use those properties to discriminate di↵erent types of jets, and how do we express the performance

(16)

of this discrimination? The next section will be dedicated to these two questions, and is the bridge between jet-physics and machine learning.

The first question can be answered by looking at the distributions of these observables. Which observable is the most appropriate depends in general on the type of jets to be discriminated. The main criterion for how suitable a observable is for discrimination, is how much the distribution functions of the two di↵erent jets overlap on that observable. It is worth noting that the most suitable discrimination observable also varies within a certain jet-type. For example, if a jet has a remarkably low or high value of a specific observable (so that it is in the tail of the distribution function that does not lie in the distribution function of the other jet type), that observable might be more useful than the observable with the least overlap in the distribution functions. The amount of overlap between the distributions can be visualized in a receiver operating characteristics curve. This is a 2-dimensional plot in which the percentage that is included at a specific cut of distribution 1 is on the x-axis, and the included percentage of distribution 2 is on the y-axis.

2.3.1 ROC-curve

For every binary classification problem, you could arbitrary classify the elements of one class in the test data as “positive” (P) and the elements of the other class as “negative” (N). Subsequently, you could classify the predictions into true positive (TP, a positive element that got predicted as positive), true negative (TN, a negative element that got predicted as negative), false positive (FP, a negative element that got predicted as positive), and false negative (FN, a positive element that got predicted as negative). In principle, the ultimate goal is always to maximize TP and TN and to reduce FP and FN as much as possible. To what extent this is achieved, is expressed by the accuracy:

accuracy =T P + T N

P + N (2.9)

But this result can be misleading for a couple of reasons. If there is a imbalance in the data between the two classes, the accuracy of the model will be high if it predicts all the test data to belong to the class that is most copious in the data. If the ratio between P and N is 1₉, the model will have an accuracy of 90% if it predicts that all cases are negative. Also, the accuracy does not make a di↵erence in the importance of a true negative result and a true positive result. Depending on the classification that you need to perform, this can be very important. For example, when you want to classify corona patients (defined as positive), and healthy people (defined as negative), you might care more about the true positive rate than the true negative rate or the false positive rate. A more precise expression of the performance of a classification is given by the four following equations: T P R = T P T P + F N (2.10) F P R = F P F P + T N (2.11) T N R = T N T N + F P (2.12) F N R = F N F N + T P (2.13)

It is easy to see that the denominators of these ratios make them insensitive to the positive/negative ratio of the dataset. Since they are all linear dependent, we can express the performance with only two of them: TPR and FPR. Since these rates depend on the parameter values of the classification tool, it is possible to display them in a single plot as a function of the parameter settings. This is the more formal definition of the receiver operating characteristic (ROC) curve. ROC curves are a powerful tool in statistical analysis, and they are widely used outside physics. By convention, the TPR is shown on the y-axis and the FPR on the x-axis. Each point in the plot refers to a certain choice of parameters. The range of relevant parameter values lie between the point where the entire dataset is predicted to be negative, in which case TPR=FPR=0, and the point where everything is classified as positive, in which case TPR=FPR=1. Each adjustment of the parameters in between aims to maximize the TPR increase and to minimize the FPR increase, so the shape of the graph needs to be as curvy as possible. One way to express the e↵ectiveness of a ROC is with the area under the curve (AUC). As its name says, this is the integral over the ROC curve. For a random guess, this area would be 0.5. Since the classification of positive and negative is arbitrary, it is_{|AUC 0.5| that you care about. Please note that in} the above explanation, it is assumed that there is a continuous mathematical distribution function. Since there often is not, it is important to mention an in between step. To convert empirical data into a ROC-curve, it is

(17)

first needed to plot the data into a histogram, where the the height of the bins represent its frequency. This might seem a boring step between mathematics and reality, but there is an important fallacy hidden in it.

There is a subtile mathematical problem in multidimensional empirical distribution functions. In a one-dimensional distribution function, there can not be much discussion on how the ROC curve is related to the cut on the axis of the distribution function. It is obvious that the point (0,0) relates to a cut were nothing is included, and (1,1) to a cut were everything is included. All the points in between relate to a cut one step further on the distribution axis than the previous point (implicitly assuming that the signal/background ratio only increases or decreases as function of the dimension). For multidimensional distribution functions, this is not obvious at all, since there are multiple directions (i.e. axes of the distribution function) in which you could step. Each “path” you take through the distribution function can give you a di↵erent distribution function. So to get an optimal ROC-curve, you need to find out which path to take. One way to cope with this problem, is to make a one dimensional list were all the bins (the bars in the empirical distribution) are sorted with respect to their ratio of signal versus background content, and subsequently plot the ROC-curve when going down trough the list. In this way, you know that you get an optimal curve, since you know that each next bin on the list has the largest available ratio between signal and background. Note that this bin ordering can also be necessary for the one-dimensional case, although it is often not.

The problem is, that when you have finite statistics, you could make your bins infinitely small, so that each bin contains 1 signal data point, 1 background data point, or nothing at all. When you would sort those bins based on their signal versus background content, you would have a long queue of bins containing 1 signal fraction and thereafter a queue of bins with 1 background fraction or nothing at all. In this way, you could make a ROC-curve were full separation is achieved, but it would be purely mathematical cheating. So when plotting an ROC-curve out of a multidimensional distribution function, it is important to incorporate statistical uncertainties, i.e. having enough signal and background content in each bin.

If a conventional program would base the discrimination on all the available variables, it would in theory be impossible for the machine learning program to outperform it. Problem is that this would cost such extraordinary computational power, that for every practical purpose, only a few variables can be selected to do the calculation. This rapid rise of computational power is because it scales exponential with the amount of input dimensions. For N dimensions that are each divided into 100 bins, you would need a factor 100N 1 _{more computational} power as for one dimension. One important advantage of machine learning compared to conventional programs, is that in machine learning it is possible to scale the needed computational power linearly with the amount of input dimensions, instead of exponentially. Because of this, the increase in computational power would only be a factor N for machine learning tools. It is in this selection of variables to use for the computation that the machine learning algorithm could provide a significant advantage compared to a conventional program. This will be explained in more detail in Chapter 3.

2.4 Jet simulation

A reasonable approach for machine learning-based jet discrimination would be to train the algorithms on real-world input data. As mentioned before, the problem is that the amount of data produced by collisions at the LHC is so large that this would be practically impossible. To give a sense of the scale of the data that is produced, this section gives a brief summary of what is actually happening at the LHC. The following specifications are obtained from [19]. Note that these specifications could have evolved over time, but not in a way that solves the problem. The bunches of protons that are colliding carry each 1.15_{⇥ 10}11 _{protons. Due to the relative small} volume of the protons with respect to the entire volume of the beam, most of them will not collide to another proton. Each time the bunches collide, there will be approximately 50 proton-proton interactions. Around half of them will be elastic scatterings, where the protons just scatter o↵ resulting in pile-up, and the other half will be inelastic scatterings, where new particles are created in an event. For each event, the LHC captures around 1 megabyte, so that corresponds to 25 MB per beam collision. But in the LHC, there are not just two bunches of protons that collide, but 2808 of them, all separated equal distance in the circular detector. Since the protons travel close to the speed of light, they travel 11245 rounds per second in the detector. For each second, there is thus 11245⇥ 2808 ⇥ 25MB = 789399 gigabyte that needs to be stored. The laptop used for this thesis has 120 Gb memory, so that can store 789399Gb per second_120Gb _{⇡ 0, 00015 seconds of LHC data, which} is unsuitable for every practical purpose. The only benefit is that >99% of this data is uninteresting anyway. Although it is hard to model all the contamination from particle collisions at such a large scale, we have a proper fundamental understanding of the jet formation. The kinematics and cross-sections for particle formation and decay are derivable from first principles, and the e↵ects of hadronization can be modeled with the Lund string

(18)

model. This is why we have chosen PYTHIA to simulate all the events. In PYTHIA, each event just consists of one proton colliding with one other proton. The events that are generated by this collisions can be fixed with commands listed in [20]. Each command can switch a certain scattering process on/o↵, therefore extremely reducing the amount of collisions (and data) that need to be generated to produce an interesting event. The hadronization, initial state radiation, final state radiation, and multiple parton interactions can be modeled with PYTHIA, but since each event just consists of two protons, the elastic scattering is left out.

(19)

Chapter 3

Machine Learning

The range of applications of machine learning is rapidly expanding. Its ability to digest large amounts of data in a short period of time, and recognize unexpected patterns or correlations, make it useful to circumvent limitations of conventional approaches. The majority of the content of this chapter is retrieved from [21]. Broadly stated, the field of machine learning could be divided into three subcategories: supervised machine learning, unsupervised machine learning, and reinforcement learning. All of them have in common that their algorithms ask themselves one simple question: how should the parameters be changed to map a set of input values to a desired set of output values? Supervised machine learning di↵ers from the latter two in that the algorithm is first given a set of training data for which the real prediction-values are known, and optimizes the configuration of its parameters based on the di↵erence between its predictions and the real prediction-values. After the configuration of the parameters is optimized, the algorithm can be applied to real-world data. By contrast, for unsupervised- and reinforcement learning there is no training data. Which form of machine learning is preferable depends per subject. Since the tools used for this research are all supervised machine learning tools, this field will have the most comprehensive discussion. The section about unsupervised machine learning will introduce some relevant concepts and has a relation to jet clustering. The section about reinforcement learning can be regarded as background for the interested reader, but are not necessary for understanding the data-analyses of this study.

3.1 Reinforcement learning

The goal of reinforcement learning is that the algorithm can take autonomously decisions and adaptions in order to maximize the chance of successfully performing a task. In order to do so, the algorithm is in constant interaction with its own environment. The key di↵erence between reinforcement- and (un)supervised learning is that in reinforcement learning the output values/decisions of the algorithm influence the values of the upcoming input data. A trivial example of this is a thermostat. When the thermostat is set at 20 degrees (the desired output value), and measures a temperature of 18 degrees (the input value), it will speed up the heating (ad-justment of the parameters). The way reinforcement learning di↵erentiates itself from the other two machine learning fields is by a continuous feedback-loop to adapt their parameters to a dynamic environment. This con-stant adaption to the changing environment implies that the parameters will never converge to a final optimal configuration. This feedback-loop is more formally known as the Reward/Loss-function, and is based on the evaluation of the results of each possible action in previous situations. This gives an indication on how likely a specific decision in a current situation will lead to a successful outcome. In the example of the thermostat, the agent has three options: lowering the heating, leave the heating constant, and increase the heating. The former two will never have achieved the desired output, and the latter will have a large probability (assuming the windows are closed) of achieving it, so the agent will choose to increase the heating. In more sophisticated tools, there is also a randomization factor incorporated in the decision function, that causes the program to sometimes take unexplored paths. This implicates that sometimes the Reward-Loss component is neglected in the decision process. In some cases, like board games, the environment might be a second version of the algorithm itself. By letting the algorithm compete against itself, the learning rate is significantly improved. In this way, the chess program AlphaZero was after just 4 hours of training on 5000 tensor processing units (TPUs) able to beat the best conventional chess program[22]. Note that the final version of AlphaZero ran on only 4 TPUs, so it is not a matter of just increasing computational power. This example is completely unrelated to

(20)

any data-analysis in this study, but it illustrates the immense power of machine learning: with no more input than just the rules of the game and four hours of training, the machine learning tool beat thousands of years of human practice.

3.2 Unsupervised learning

Where supervised- and reinforcement learning are mainly focused on the prediction of a value or label, unsuper-vised machine learning aims to seek hidden patterns and structures in the input data. In contrast to superunsuper-vised learning, there is no set of training data. The most important applications of unsupervised machine learning are clustering and dimensional reduction. A further explanation of dimensional reduction is listed in the appendix, but is not necessary to understand the data-analyses in this study.

3.2.1 Clustering

In clustering, the algorithm aims to group unlabeled data into clusters using a distance or similarity measure. A set of n-data points with p-features each can be expressed asPN_n=1xn with xn= (xn,1, xn,2, ..xn,p), and each cluster centre as µk = µk,1, µk,2, ..µk,p. The algorithm optimizes itself by minimizing the cost function. The concept of the cost-function plays a crucial role in machine-learning, and measures the prediction of the model. For cluster analysis, the cost function for a clustering is defined as:

C(x; µ) = N X n=1 K X k=1 rnk(xn µk)2 (3.1)

where (xn µk)2 is the squared Euclidian distance between a point and a cluster centre, and rnk = 1 if the n-point belongs to the k-th cluster, and rnk= 0 if not. Since clustering is exclusive, we need to impose that each point belongs exactly to one cluster. Note that this shows great similarities with jet-clustering. The way this minimization occurs, is by choosing k random points as cluster centres, and subsequently taking the derivative over the cost function with respect to each cluster centre. Each derivative shows in which direction coordinates of each cluster centre should be changed in order to lower the cost function. Each cluster centre is adjusted in that direction, and the new cost function and the new corresponding derivatives are calculated. After enough iterations, the cost function will converge to a (local) minimum.

3.3 Supervised machine learning

In supervised machine learning, an algorithm learns a function that maps a vector of inputs to a vector of outputs. In order to do so, it needs a set of labeled data. “Labeled” in this context means that the outcome that the machine needs to predict is known. To train and validate the algorithm, the data is split into a training-and a validation (or test) set, after which only the in- training-and output values of the training set are given to the algorithm. The accuracy of the model gets tested after the training by applying the algorithm to the validation set, of which the desired output values are known to us, but unknown to the algorithm. Subsequently the predicted values get compared to the real values to determine the accuracy. The goal for the algorithm in the training phase is to adjust its parameters in such way that it minimizes the sum of the squared di↵erences between the real values and the predicted values, which is a more general formulation of the cost function:

C(Y ; f (X; ⇥)) = 1 n n X i=1 (yi f (xi; ⇥))2 (3.2)

Where y is the desired output value, x is the input data, i runs over the training data points, and ⇥ are the model parameters. By fitting the model to the training data, we aim to discover a certain underlying law, by which we can make predictions of validation data outside the training set. If the training data would be perfectly clean, in the sense that the output values of the training data perfectly correspond to the underlying law we want to discover, it would be sufficient to adjust the parameters until the cost function is zero. However, in the real world there is often noise involved in the training data. The fact that the data is contaminated with noise, implicates that there is an irreducible error margin in the model, called the Bayes error [23]. This makes it tricky to fit the model perfectly to the training data, since this would mean that its noise got incorporated

(21)

(a) Error for samples out of the training data for a fixed model complexity [21]

(b) Bias and variance for a fixed number of data points [21]

in the underlying law. If the model is tested on the validation set, there would be a huge error. Since the goal is to predict output values of unlabeled data, we only care about the error margin of the validation set. This phenomena of increasing the validation error by reducing the training error, is known as “overfitting”. This is basically the machine-learning analogy of students who pass their courses by just learning old exams. The perform very well on the training exercises, but are unable to apply their knowledge to the real world. More philosophically, it can be seen as an example of Goodhart’s law: “whenever a measurement becomes a target, it ceases to be a good measurement”. To counter this e↵ect, a regularization term is added to the cost function:

C(Y ; f (X; ⇥)) = 1 n n X i=1 (yi f (xi; ⇥))2+ ✓T✓ (3.3)

Here, is the regularization parameter. Note that due to this regularization term, a model with less accurate predictions but small parameter values is to some extent favoured over a model with more accurate predictions but large parameter values in the minimization of the cost function. This formula also illustrates why it is important to have balanced training data: if one output value is excessive in the training data, the algorithm tend to minimize the cost function by just mapping all input data to that output value, making the algorithm completely useless.

3.3.1 Bias-variance trade-o↵

As stated above, the goal of the training is not to reduce the error margin of the training set Etrain, but to decrease the error on the validation set Eval. How representative a training set is, both depends on the number of datapoints and their dimensionality. The smaller the amount of datapoints in the training set, the more likely it is to misrepresent reality. Also, as the number of dimensions increases, the probability distributions of the training data is more prone to deviate from the real probability distribution. Please note that these are fundamentally di↵erent than the noise: The noise is an intrinsic property of the data, while these are caused by statistical fluctuations that always occur in experimental studies where samples are used to represent larger groups. These statistical fluctuations can in principle be solved by increasing the number of data points, so that Etrain = Eval, as shown in figure 3.1a (Note that Eval = Eout). This convergence limit is known as the bias of the model, while the di↵erence between the bias and Eval is known as the variance, and can thus be seen as the loss in performance caused by the lack of training data. The Eval we want to minimize can be expressed as Eval = bias + variance. A fundamental problem in machine learning is that the variance grows with the model complexity, while the bias decreases with the model complexity, as shown in figure 3.1b. A simple model has a low variance, so it will take only a small training set to converge to the bias, but it also has a high bias. A very complex model has a bias close to the Bayes error, but has a high variance, so it will take a very large amount of training data to get there. Simple models thus provide bad solutions that are easy to find, while complex models provide good solutions that are hard to find. Assumed that the amount of training data and computational power are limited (implying that the variance will never go to zero), we are thus forced to make a trade-o↵ in our model complexity to minimize Eval. This is known as the bias-variance trade-o↵. The variance is an expression for to which degree the optimal configuration of a model is overfitting with respect to its complexity and amount of training data.

(22)

3.3.2 Minimizing cost function

When the underlying law that needs to be discovered is an ordinary mathematical function (i. e. a polynomial or wavefunction), the minimizing of the cost function can be done analytically. The critical reader might have noticed that the clustering problem could have been solved in an analytical way. However, in more complex models in realistic applications this is not possible. As already mentioned in section 3.2.1, the most common minimization strategy for those situations is the gradient descent method.

One way to find a minimum, is to start a random initial point pinit in the plot of the cost function, calculate the gradient of the full parameter space at that position, and step in the steepest descent direction to the next position pinit+1(✓t+1) = pinit(✓t) ⌘rpinit(✓t), in which ⌘ is the learning rate. When iterating this process from the next position, it will converge to a local minimum. A two dimensional example of this process is visualized in figure 3.2, where the black line denotes the parameter values. Generally, each parameter in a gradient descent process evolves like:

✓t+1= ✓t vt (3.4)

Where vtis the step size, in this case given by:

vt= ⌘r✓E(✓) (3.5)

The right choice for ⌘ depends on the shape of the plot of the cost function, the starting position, the parameters, and the amount of available computational power. Small ⌘ tend to converge more precisely to the minimum, but take more steps, therefor increasing computational costs. Large ⌘ will converge faster to the region of the minimum, but at the price of two risks: overshooting the minimum, or endlessly oscillate between the two slopes of the minimum. But if there are multiple local minima, and you aim to converge to the absolute minimum, it can be useful to overshoot local minima. For succeeding in this, it is necessary that the local minima lie in a narrow region of the parameter space and the absolute minimum in a broad region. Another problem can be that ⌘ treats all parameters uniform. This can be disadvantageous when some parameters need small corrections and other parameters need large corrections. To bypass this limitations, two more sophisticated tools have been developed.

Figure 3.2: plot of cost function over a 2-dimensional parameter space [21]

One of them is the Stochastic gradient descent (SGD). In an SGD the data is divided into N minibatches. Each minibatch uses the same cost function, but since they are all trained on a di↵erent dataset, each minibatch has its own plot of the cost function. For each step, the gradient of each plot is calculated, and the parameters are adjusted to for each gradient. One loop over all the minibatches is called an epoch. The step size for each epoch is thus vt = ⌘PN1 rM B✓ E(✓) Each epoch the rate of adjustments is thus ⌘⇥ N. Although the step size ⌘ of the parameters is still the same, stochasticity works as a regularization for each parameter. If a parameter in the cost function of the entire dataset has a very steep slope, it is most likely that the gradient of the minibatches all point in the same direction, and after one epoch the parameter will be changed ⌘_{⇥ N into} this direction. Conversely, if a parameter has zero to no slope in the cost plot of the entire dataset, it is most likely that the sum over the gradients of all the minibatches will yield approximately zero. Besides that, there are two other important benefits: splitting the plot of the cost functions over N plots avoid getting stuck on a local minimum. Secondly, because one epoch corresponds to N adjustments of the parameters instead of 1, it reduces the computational power by a factor _N1. For further improvement, a momentum term is added to the

Exploring applications of machine learning to jet discrimination

MSc Physics and Astronomy

Theoretical physics

Master Thesis

Exploring applications of machine learning to

jet discrimination

by

Wouter Christiaan Schreuder

Student nr. 10889515

University of Amsterdam & Vrije Universiteit Amsterdam

Size 60 EC.

Conducted between 22-04-2020 and 09-03-2021

Supervisor/Examiner:

Examiner:

dr. W.J.Waalewijn

dr. J.Rojo

Contents

Chapter 1

Introduction

Chapter 2

Jet Physics

2.1

Jet algorithms

2.2

Jet observables

2.2.1

Energy-correlation function

2.2.2

N-subjettiness

2.2.3

Subjet-angle and transverse momentum ratio

2.3

Jet discrimination

2.3.1

ROC-curve

2.4

Jet simulation

Chapter 3

Machine Learning

3.1

Reinforcement learning

3.2

Unsupervised learning

3.2.1

Clustering

3.3

Supervised machine learning

3.3.1

Bias-variance trade-o↵

3.3.2

Minimizing cost function