Machine learning in the Xenon experiment

(1)

Machine learning in the Xenon experiment

Gijs Leguijt

June 22, 2018

Studentnumber 11000279

Assignment Report Bachelor Project Physics and Astronomy

Size 15 EC

Conducted between April 3, 2018 and June 22, 2018

Institute Nikhef

Faculty Faculteit der Natuurwetenschappen, Wiskunde en Informatica

University Universiteit van Amsterdam

Supervisor Prof. dr. P. Decowski

Second Examiner Dr. H. Snoek

Abstract

The XENON1T-experiment tries to make a direct discovery of a dark matter par-ticle scattering of a xenon-atom. As the cross-section of this reaction is very low, the chance that a dark matter particle scatters twice is negligible. This thesis evaluates the performance of machine learning algorithms. Aiming to remove the multiple-scatter sig-nals from the data. The three different algorithms used are a support vector machine, a random forest classifier and a multi-layered perceptron. All three algorithms achieved higher accuracy than the current analysis, but as the vector machine took multiple factors longer to process the large amount of data, only the forest-classifier and the perceptron are recommended for implementation.

(2)

1 Samenvatting

Er is bijna vijf keer zo veel donkere materie als alle normale massa. We weten dat er donkere materie bestaat, en toch hebben we nog nooit een donker-materie deeltje gezien. Om de deeltjes toch te detecteren worden er grote detectoren gemaakt, maar zelfs met grote detec-toren is de kans klein dat een donker-materie deeltje botst, het meeste signaal is achtergrond. Aangezien de kans op een botsing zo klein is, zijn alle deeltjes die twee keer botsen in de detector dus geen donkere materie deeltjes.

Het is daarom belangrijk om alle deeltjes die meerdere keren botsen weg te filteren. Dit wordt momenteel gedaan zonder machine learning, maar met machine learning zou dit

ef-fici¨eter kunnen. Deze thesis probeert de huidige analyse te verbeteren door machine learning

toe te passen, en gebruikt daarvoor meerdere algoritmes. Het blijkt dat alle geteste

algo-ritmes inderdaad in staat zijn beter onderscheid te maken tussen deeltjes die ´e´en keer botsen

(4)

2 Introduction

On March 28, 2018, researchers announced the discovery of a galaxy with a very low concen-tration of dark matter [1]. This absence, counter-intuitively, was said to prove the existence of dark matter. The details of this paradox will be explained later, for now it is sufficient to know that this claim made it into the news worldwide, which shows that dark matter is a big topic in modern day physics.

All around the globe experiments are trying to find dark matter, which would contain over five times as much mass as all matter known to us [2]! But if there is that much mass represented as dark matter, why has no detector found it yet? The problem is that dark matter does hardly interact with any matter known to us. You can compare it for example with neutrino’s, they also pass through most matter without any interaction, but very rarely leave a trail in the huge neutrino detectors.

Since no detector has ever found a dark matter particle, there are still multiple theo-ries about what would be the individual constituents. Currently, most attention is given to the WIMP-model (Weakly Interacting Massive Particle) [3], which is also the target of the XENON1T-detector situated in a mine in Italy. XENON1T consist of roughly 1000 kg of liquid xenon [4] in which all the interactions are closely monitored. Although a lot of work has been put in reducing background noise, up to the level that this detector has the low-est background noise in his category [4], most of the signal still consists of interactions that did not include any dark matter particle. As a result, a lot of data needs to be filtered, in particular particles that scattered multiple times within the detector. Since the cross section

of dark matter particles is very low, of the order of < 10−40cm−2, where the precise value of

the upper-bound depends on the WIMP-mass [5], the chance that it scatters multiple times within the detector is negligible. Discriminating between single and double scatters is cur-rently done using cuts in 2D-parameter spaces [6].

Of course dark matter is not the only field that has received major attention: machine learning is one of the other hot topics in science. One can occasionally find a self driving car on the road in some countries [7], world’s best GO-player was beaten by a computer [8], and, maybe less known, it is making its way into big research experiments as well [9]. Whereas the human mind is not naturally capable of working in more than three dimensions, there is no such limitation for computers. A machine can operate in high dimensional spaces, seeking correlations between many variables at once. As a result, letting a computer find its own way of splitting the data may yield better results than telling the computer where the splits have to be made [10].

(5)

data, which is like trying to find a needle in a haystack. While on the other there is a pow-erful data-analysis tool, that has classification as one of his main attributes. The aim of this thesis is to merge the two sides, trying to improve the current splitting of the data, such that XENON1T would be even more sensitive.

More specifically, the goal of this work will be twofold. At first the effect of adding machine learning to the analysis will be evaluated. Subsequently, three different learning algorithms will be compared with each other; the Random Forest Classifier (RFC), Support Vector Machine (SVM) and the Multi-Layered Perceptron (MLP).

3 Dark matter

3.1 Proof for dark matter

As stated in the intro there are multiple experiments searching for dark matter. But if no individual particle has been found yet, what is the motivation for these searches? Strong hints for its existence are found in the rotation curves of galaxies [11]. The speed of orbiting bodies is governed by the mass enclosed in the orbit and the distance to the centre of mass. For a typical galaxy one would predict the lower curve in figure 1, when only considering the visible matter. However, observations yield the upper curve which differs significantly from the expectation.

1985ApJ.

.

.295.

.305V _{No. 2, 1985} _{DISTRIBUTION OF DARK MATTER IN NGC 3198} ₃₀₉

Fig. 4.—Fit of exponential disk with maximum mass and halo to observed rotation curve (dots with error bars). The scale length of the disk has been taken equal to that of the light distribution (60", corresponding to 2.68 kpc). The halo curve is based on eq. (1), a = 8.5 kpc, y = 2.1, p(R0) = 0.0040 M0 pc-3.

dark matter to visible matter inside the last point of the rotation curve (at 30 kpc) is 3.9. The enclosed halo mass is 0.8 times the disk mass at R25; the enclosed halo mass is 1.5 times the disk mass at the Holmberg radius. The total mass inside 30 kpc isl5xl010Mo. Another property of interest is the mass-to- light ratio of the disk; we find M/Lß(disk) < 3.6 Mq/LBq and M/LF(disk) < 4.4 Mq/LVq.

The disk-halo model shown in Figure 4 has the character- istic flat rotation curve over a large part of the galaxy. Beyond 30 kpc it is a mere extrapolation, but the observations inside 30

kpc do not show any sign of a decline, and the extrapolated curve may well be close to the true one. To obtain an estimate of the minimum amount of dark matter at large distances from the center we have also made a fit, shown in Figure 6, with a halo density law whose slope changes from —2 in the inner region to — 4 in the outer region :

PhaloW CC (2)

Fig. 5.—Cumulative distribution of mass with radius for disk and halo for the maximum disk mass case. Two halo fits are shown. The curve labeled “ normal ” halo is based on eq. (1); the parameters of the fit are the same as those in Fig. 4. The curve labeled “minimum” halo is based on eq. (2); it corresponds to a density distribution whose slope changes from —2 in the inner regions to —3.5 in the outer regions. This curve represents an estimate of the minimum amount of dark matter in NGC 3198 inside 50 kpc.

Figure 1: From [11]. The upper line shows the measured rotation curve of the galaxy NGC3198. The two lower curves represent the contribution to the curve due to the disk and due to the dark matter halo. In the absence of the halo, the observed plateau would not occur, and velocity of stars would decrease for increasing distance to the center, after an initial increase up to roughly 5 kpc due to an accumulation of mass near the center.

(6)

One way of explaining these high velocities far from the centre of the galaxy is introducing additional mass in the galaxy. Since only the effect of their gravity is visible this additional mass is called dark matter. If one would look at clusters of galaxies, instead of individual systems, one would be in need of even more dark matter to explain the high velocities en-countered in these clusters.

Independently of the rotation curves, the existence of dark matter can also be deduced from the Cosmic Microwave Background (CMB) [12]. The shape of the CMB spectrum has been measured with great accuracy by satellites that were increasingly sensitive [13]. Inho-mogeneities in the spectrum result from a balance between gravity and photon-pressure [12]. Since dark matter hardly couples to photons, it creates gravitational potential wells instead of starting to oscillate due to radiation pressure. This would have a specific signature on the CMB. Current theories on the history of the universe can successfully reproduce the CMB spectrum and the resulting bound on the amount of dark matter is consistent with the galaxy rotation curves [14].

Although there are multiple observations that can be explained by dark matter, they are all related to gravitational effects. Therefore there have been tries to explain the observations by making new theories for gravity acting on large scales [15]. This is where the discovery of the dark matter-less galaxy comes into play. Because this galaxy shows the expected rota-tion curves, due to the absence of dark matter, theories with different gravitarota-tional laws face difficulties trying to explain both the galaxies with dark matter, and those without.

Another discovery in favour of the dark matter theories is the Bullet Cluster [16], see figure 2, in which two galaxy clusters collide. In the absence of dark matter most mass would be contained in the interstellar gas. When the clusters collided, the stars were hardly affected and moved through, while the gas was slowed down more heavily. As a result, most of the mass, and thus most gravitational effect, should be split from the stars. Unless there is some-thing else, that carried most of the mass in the cluster and that hardly interacted, then most of the gravity would stay co-moving with the stars, which is indeed the case in figure 2. The fact that the dust does not contain most of the mass is a strong argument for the existence of dark matter.

(7)

Chandra X-Ray Observatory

CXC

1E 0657-56

Figure 2: From [16]. The bullet cluster is powerful evidence for the existence of dark matter. As the two galaxies clusters collided, the individual galaxies moved through due to the vast distances between them. However, the insterstellar gas clouds slowed down, as their higher density resulted in more collisions. Since most ordinary mass is contained in the gas, one would expect that most of the gravity would have stayed with the gas, and got seperated from the galaxies. Observations show otherwise, the gas clouds, shown in red got separated from the stars, while the gravitational effect, shown in blue, did not slow down. Therefore, something else must carry most of the mass, something that does hardly interact, such that it would not slow down. Dark matter fits perfectly in this observation.

3.2 Consistence of dark matter

Since no dark matter exists in the standard model, modifications or new theories have to be developed to describe these particles. The dark matter candidates are generally divided in two categories, MAssive Compact Halo Objects (MACHOs) and Weakly Interacting Massive Particles (WIMPs). The first category is partly occupied by less exotic objects, like black holes and neutron stars, while there is still a lot of speculation on what particles/objects make up the WIMPs.

Since the discovery of the mass of the neutrino they have been opted as dark matter candidates [17]. However, neutrinos alone are not sufficient to explain the observed effects of dark matter [18]. Supersymmetry is beyond the scope of this thesis, however, it produces a whole zoo of possible WIMPs such as neutralinos, sneutrinos and gravitinos [14]. With WIMPs being considered the most promising dark matter category they form the focus of the XENON-experiment.

(8)

4 The XENON1T-detector

The XENON1T-detector is the third in a series of xenon experiments and the predecessor of XENONnT, which will be an even bigger detector [19]. It aims to find dark matter by detecting the light produced by a WIMP colliding with a xenon atom. To this end the detector contains a target mass of 1000 kg xenon, partly in a gaseous state and partly liquid. The light is recorded by Photo Multiplier Tubes (PMT) situated both at the top and bottom of the detector. The array of 248 PMTs can reconstruct the (x,y)-position of the collision by analysing the yield of the individual PMTs. In principle the difference in light travel time could be used to retrieve the z-position, but the speed of light in comparison to the size of the experiment makes this practically impossible. Instead, the z-position is constructed using the fact that there is both liquid and gaseous xenon present.

In a collision, a xenon atom is ionised, and in response it forms an excited Xe2-molecule.

The individual xenon atoms are transparent to the light that is emitted when the molecule de-excites and this light is the first signal seen from a collision (S1). The electrons that resulted from the ionization are accelerated upwards by an electric field into the gaseous xenon where they receive sufficient energy to ionise the gas, which creates a second signal (S2). The higher in the detector the collision took place, the less distance between the site and the gaseous phase, resulting in a smaller time interval between the S1 and S2,. Therefore, the z-position can be reconstructed by comparing the time of arrival of both signals, which is shown in the left panel of figure 3, the right hand side shows the XENON1T-detector itself.

Figure 3: From [20]. A view into the XENON1T-detector, where the left panel shows the creation of a signal in the detector. As a particle moves through the liquid xenon, it collides with an atom, resulting in an ionization. Upon de-excitation, light is emited with is recorded as the S1-signal. Due to the collision, there are free electrons in the liquid xenon, that get accelerated upwards by an electric field. When the electrons reach the gaseous xenon, they have enough energy to ionise the xenon, resulting in a second signal, the S2. The right panel shows a more detailed view of the detector.

(9)

Apart from xenon being transparent to the emitted light it also has a large cross-section which enlarges the probability of finding dark matter. This also means it is useful as a screening substance, and in addition to the 1 t used for detection, there is over 2 t of xenon used solely for shielding purposes. To further reduce background signals the detector is placed in a big water tank in the Laboratori Nazionali del Gran Sasso, which is situated in a mine in Italy [4].

4.1 Current data-analysis

As mentioned earlier, the shielding with xenon, water and 3600m of rock is not sufficient to remove all background noise. For example, even though all components of the detector have been selected on low radio-activity, there are still neutrons and electrons in the detector that originated from decaying isotopes. Neutrons that scatter only once in the detector are hard to distinguish from WIMPs, however, neutrons are likely to scatter multiple times before leaving the detector. To distinguish between multiple and single-scatter events cuts are made on the following parameters (or features in machine-learning jargon): the ‘largest S2’, ‘largest other S2’, ‘50% width’ and the ‘Goodness of Fit’ (GoF) [6].

4.1.1 Largest S2

The S2 signal originated from the accelerated electrons that ionised the gaseous xenon. Its size depends both on the energy of the incident particle and the z-coordinate of the collision. A higher energy collision within the liquid xenon results in a larger number of liberated electrons, and therefore more ionization in the gaseous phase. Because the electrons are accelerated on their way up, their ability to ionise the gas is affected by the z-coordinate of the collision. Collisions lower in the detector enable more acceleration and result in higher S2 peaks. The largest S2 signal is the highest peak, labelled as S2, that is recorded by the PMTs during the event. It is being used as a reference to which the other features are plotted.

4.1.2 Largest other S2

If particles were to scatter twice in the detector, the second collision would be less energetic because of a loss of energy in the former. Therefore, the second largest S2 peak can be coupled to the second collision. Even in single-scatter events a second S2 could be measured because of noise or delay of electrons due to collisions, however, these accidental S2’s are in general smaller than S2’s originated from a second collision.

Currently this cut is being implemented as [6]:

(10)

where ‘largest other S2’ has been abbreviated as LOS2 and S2 corresponds to the largest S2 signal.

Figure 4: The current cut in the (Largest S2, Largest other S2)-parameter space. The cut tries to discriminate between single- and multi-scatter events. Even in single-scatters, a second S2 signal can be observed, due to delay of some of the electrons, however, in multi-scatters this second signal is ofter relatively large, allowing discrimination between the two cases. Points that lie above the blue line, eq.(1), are filtered away, as they are not likely to be a single-scatter event.

In figure 4 the (Largest S2, Largest other S2)-parameter space is shown, including the current cut: eq.(1). Data-points above the line are removed, while the events below the line are kept for further analysis.

4.1.3 50%-width

If the second largest S2 is well separated from the largest S2 it could be identified by the ‘largest other S2’-cut. However, if the time-separation between the two peaks is small enough for the distributions to overlap, the peaks can be classified as a single S2, see figure 5.

Although the second case in figure 5 is classified as one peak, its shape is different from a genuine single peak. The 50%-width depends on the shape of the curve, and can be used to

(11)

discriminate between single and double-scatters in case 2 of figure 5. Time Time Si gn al co u n t Si gn al c o u n t

Figure 5: Schematical view of two consecutive S2-signals. The left panel shows a case that can be identified by the ‘largest other S2’-cut, however, this cut fails to classify the right panel. As the time between the two signals is short, there is some overlap between the peaks, resulting in the detector to see only one peak. As this “one peak” has a different shape than a genuine single peak, this false peak can be identified by its half-width, as the half-width depends on the peak-shape.

Another parameter that influences the width of the signal is the height at which the collision occurred. Electrons originated from lower collisions take longer to reach the PMT’s at the top of the detector, and subsequently diffuse further apart, creating a broader signal. To discriminate between the two effects on the width of the signal, the actual width is divided by the expected width [6],

∆texpected(t) = r 4.0325Dt v2 + w 2 0, (2)

where D is the diffusion constant in the detector, 22.8cm2/s, v the drift velocity of the

electrons in the detector, 1.44mm/µs and w0 = 348.6ns an offset that takes into account

that the pulse is formed by multiple electron-signals. The dependence on the position in the detector is reflected in the appearance of t, the time between collision and detection of the signal, as the field inside the detector is known, the time translates to a distance covered, and therefore into the vertical position.

Figure 6 includes the lines:

∆t50%(t) ∆texpected(t) < s ppf(1 − 10−5, S2/scg) S2/scg − 1 , (3) ∆t_50%(t) ∆texpected(t) > s ppf(10−5_{, S2/scg)} S2/scg − 1 (4)

the current cut in this parameter space, with scg a dimensionless constant: 21.3 and ppf the percent-point function. Only points that lie between the curves (3) and (4) stay.

(12)

Figure 6: The 50%-width cut consists of two lines: eq.(3) and (4). As the width of the signal also depends on the travel-time of the electrons, the width is divided by the expected width, eq.(2), resulting in the single-scatter events to be centered around a normalised width of 1. Points that differ too much from the expected width, and therefore lie outside the blue lines, are filtered away.

4.1.4 Goodness of fit

In the previous two cuts the peaks were separated in time, the GoF-cut uses the spatial distribution of the peaks instead. Figure 7 shows the signal of two peaks in the PMT’s situated at the top of the detector. Because the electrons are accelerated in the vertical direction, the resulting shapes are roughly circular.

If the signal were a single-scatter, there would be a single circular shape, however the distortion by the second collision results in a different pattern. To qualify how well a signal resembles a single-scatter it is being compared with simulated patterns, where the difference in all PMTs to the best fit is added. As a result, a higher value of the GoF actually means the pattern did not fit well with the simulated patterns, and “badness of fit” might have been a more accurate name.

(13)

Figure 7: From [6]. A top-view of the detector, showing the individual PMTs. The event shown is a double-scatter event, as it consists of two separated regions of high signal. The purple cross marks the reconstructed position. To identify single- and multi-scatters, the measured signals are compared to simulated patterns, corresponding to the same reconstructed position. The Goodness of Fit (GoF) is defined as the difference of each individual PMT-signal to the simulation, meaning a large GoF corresponds to a bad fit.

Figure 8: Whereas the previous cuts were based on time-separation, the current GoF-cut tries to discriminate between single-scatters and multi-scatters based on space-separation. The yield of all PMTs is compared to simulated single-scatter patterns, where a better fit corresponds to a lower GoF. The resulting cut, eq.(5), removes all points that lie above the blue line.

(14)

The result of the GoF-cut is plotted in figure 8, with the cut being:

GoF < 0.0404 · S2 + 594 · S20.0737− 686. (5)

Since a lower GoF means a better fit, only the points that lie below the curve are being considered.

5 Machine learning

As mentioned in the introduction, machine learning is about letting the computer find the best way to identify the data, instead of imposing how the cuts have to be made. To illus-trate the difference with conventional methods a cartoon-example is used. This example is a classification problem. These problems consist of a data with features (or variables) that are used to predict the class of a certain event.

If a computer would have to discriminate between Superman and Spiderman using con-ventional methods, it would need characteristics of both. Its input could for example be like table 1, where the values of specific features have been pre-defined. This would also work with continuous variables (like the amount of red in a picture), but for the sake of the argument only binary variables have been listed.

Class

Feature Superman Spiderman

Cape +1 -1

Mouth +1 -1

Gloves -1 +1

etc.

Table 1: Cartoon classification problem. The values +1 correspond to present, while -1 means the specific feature is absent in that class.

The machine learning approach is different in the sense that the feature-boundaries have not been imposed, the machine would still be told to look at the presence of a cape, gloves, the amount of red, etc., but the second and third column of table 1 would be left empty. Instead, a dataset is used to let the algorithm complete the table by itself.

The above situation is an example of supervised learning. Machine learning is also usable for unsupervised problems, in which case the number of columns would not be specified, or by the use of deep learning for problems without a pre-defined number of rows. However, in this study the number of classes is known, events are either single-scatters or multi-scatters,

(15)

and therefore supervised learning is sufficient.

The data set that is used to fill the columns is referred to as the training-set [21], the algorithm gets access to both features and classes of the data-points in order to fit a function in space. Afterwards it can designate classes to new points based on their feature-values. To reduce the risk of overfitting the performance of the trained algorithm is tested on another set, the validation-set. Although both features and classes of this set are again specified, only the features are shown to the machine. The outcome is then compared to the known classes of the data-points to test the performance of the algorithm.

To avoid overfitting, under-fitting or some other problem in the classification, parameters can be tuned, and the cycle repeated in order to improve performance on the validation-set. However, by repeating these steps, there is a risk of overfitting on the combined training + validation-set. To measure the final performance of the algorithm, it is tested on a test-set, after which the algorithmic parameters can not be changed without creating a bias to the testing. Figure 9 shows the process of optimising the algorithm, and the use of the different data-sets. Training-set Validation-set Test-set Final test Optimising decision boundary for training set + test on validation Manually adjusting

imposed model parameters, e.g. class weight

Figure 9: The process of training an algorithm consists of multiple steps. At first, different parameters are imposed on the algorithm, which are then used to make the best cut in the training-set. As these cuts could be subject to overfitting or have other problems, the performance of the algorithm is evaluated on the validation-set. To achieve better results on the validation-set, the imposed parameters can be changed, going into a new cycle of training and evaluation. When the desired accuracy on the validation-set is reached, an independent test is needed to check the performance of the algorithm. Therefore, it is tested on a new set, the test-set, which gives the final accuracy of the classifier.

The number of points that form each data-set depends on the specific problem. For exam-ple, the desired accuracy and the amount of available data. To use the data as efficiently as

(16)

possible, one should keep the test-set at the minimum size that fulfils the accuracy-limitations and use as much data as possible in the training-set. Adding more data to the training-set improves the ability to fit functions in the parameter space, and also lowers the chance of overfitting [21].

5.1 Overfitting

Even though models with higher complexity are able to get better results on the training-set, they are not generally speaking better models. This paradox is shown in figure 10.

0 2 4 6 8 10 x 0 5 10 15 20 25 30 y Overfitting fitted points new points noiseless truth fit, order = 10

Figure 10: Example of overfitting. The blue and red points are points on the curve y = 2x + 3 (green line) with added noise. The blue points (training-set) have been used to fit a 10th order polynomial, which fits reasonably well. However, the red points (validation-set) do not match the blue curve at all, so the fit has to be adjusted.

Even though the 10th order polynomial fits the training-set better than a linear model would have done, it is clear that the model is not able to predict unseen data with a reasonable accuracy. If the red points would have laid within x ∈ [0, 8] the results would have been similar, but less pronounced.

(17)

5.2 Decision tree

The algorithms used to find patterns within the training set are called ‘classifiers’. The first one that will be considered is the decision tree. A decision tree uses consecutive cuts in feature-space to split the parameter region into rectangular decision regions. In these decision areas points will be identified to belong to a specific class by a ‘majority vote’: all points will get the same class as the one that is most abundant in that part of the parameter space. [22] To clarify the decision tree procedure the classification of weekend-days versus midweek-days will be performed based on the average temperature and the number of people that visited a (unspecified) beach that day. Figure 11 shows both the feature-space (left panel) and the decision tree (right panel), including the decision regions.

Figure 11: Example of a classification problem. The goal is to identify weekend- and midweek-days based on the temperature and the number of visitors on a beach. The left panel shows the feature-space, with fake data. Green dots correspond to weekend-days, while the midweek is represented by red dots. The colored regions show how the decision tree has divided the parameter space, where the green regions are predicted to contain green dots. The cuts have been shown with black lines, where the broadest line corresponds to the first cut. The right panel shows the decision tree that corresponds to this splitting of data. The left branch of each cut corresponds to “yes”, while the right branch is “no”.

On weekend-days there are more people that have time to visit the beach, and therefore a high number of visitors is a good indicator of weekend-days. After this first cut is made,

(18)

the bottom rectangle is further divided. Although at high temperatures the beach can be crowded even midweek, this is not the case at lower temperatures. As a result, the second cut identifies days with low temperatures as part of weekends. This argument does not hold for low number of visitors, however in the absence of additional cuts these days are seen as weekend-days because of the majority vote.

5.3 Random forest classifier

Although every cut is designed to construct the optimal splitting at that step, it is possible that better results can be obtained by starting with a less optimal cut that enables a better splitting in the consecutive step. Another problem with decision trees is the fact that different starting cuts can result in various distinctive final trees. Even though the different cuts can result in similar prediction accuracies, an experiment like XENON1T needs algorithms that are reproducible. Finally, it is easy to overfit data by adding too many cuts [23].

To counter these problems, a Random Forest Classifier (RFC) is used; instead of using one tree, multiple trees are used to predict the class of the data-points. The final outcome is the weighted average of the individual predictions, where the weight is the fraction of points that resulted in the majority vote [24]. The number of trees used to create the forest is a parameter imposed on the algorithm. Using more trees results in better accuracy, however, it also means more computation time.

To make sure different trees are created, only a randomly picked fraction of the features are available for every cut. For example, using the features of the current data-analysis, the first cut can be made either on the ‘largest other S2’ or the ‘50%-width’, while the second cut is only allowed in the ‘largest other S2’- and ‘GoF’-spaces. The best cut will be different depending on the features that can be used, and in this way different random trees are created [25].

To evaluate what is the best possible cut the function

Φ = C1   X i:yi=+1 |y_i− h(x_i)|  + C2   X i:yi=−1 |y_i− h(x_i)|  , (6)

that counts the number of misclassifications, is minimised. xi is a vector with all

feature-values of the i-th data-point and yi is the class to which it belongs (either +1 or -1). h(xi) is

the prediction, depending on how the cut is made. For the first cut in the above example it would be of the form:

h(xi) =    +1 Nvisitors> 800 −1 Nvisitors< 800, (7)

(19)

where the value 800 would be the number to be optimised to find the best cut.

C1 and C2 in eq.(6) are class-weights, determining the effect of misclassifications. Higher

values for these parameters mean a higher penalty, allowing less violation in the training set. In unbalanced data-sets (sets that have a different number of points for each class) the sums in eq.(6) can differ heavily in size. To make sure that also misclassifications of the less abundant class are considered the class weights are chosen inversely proportional to the number of data-points [26].

5.4 Support vector machine

Although decision trees are usable in multi-dimensional problems, the possible cuts are lim-ited by the way of construction. For example, a diagonal line in figure 11 would be difficult to realise by the use of decision trees. Support Vector Machines (SVMs), originally constructed by Vapnik [27], are classifiers that are able to fit differently shaped curves in feature space, dividing it into decision regions.

The working principle of a SVM is mapping a feature-space (that can be multi-dimensional) into a higher dimensional space (computational space). In this richer space a separating hy-perplane is constructed (a hyhy-perplane is the higher dimensional equivalent of a plane in 3D and a straight line in 2D), which divides the data points into different classes [28]. Hereafter, the map is inverted so that the decision region is transformed back to the feature space.

Even though the separating hyperplane is linear, the decision regions can be a higher order polynomial, a Gaussian or other shaped function, depending on the map between feature and computational space [28]:

xi→ φ(xi) = (φ1(xi), φ2(xi), ...φn(xi)), (8)

where φ(xi) is the map from feature to computational space.

A hyperplane that separates classes without failure would have to fulfil:

w · xi+ b ≥ +1 ∀ xi∈ Class 1 (9)

w · xi+ b ≤ −1 ∀ xi ∈ Class 2, (10)

where the hyperplane is specified by w and b [28].

If the ≥ or ≤-sign is an ‘equal’-sign instead, this particular point lies close to the boundary and is called a support vector. The further the absolute value of an outcome differs from 1, the larger the distance to the separating plane. As the support vectors lie close to the boundary, they predominantly determine its shape [29].

(20)

perfect case the following equation would hold:

yi(w · xi+ b) ≥ 1, (11)

for all data-points i. However, if separation on the training set is not flawless, positive

predictions get multiplied by negative true values and vice-versa, such that eq.(11) becomes

yi(w · xi+ b) ≥ 1 − ξ (12)

instead, where ξ is a measure for misclassification on the training set and

ξ ≥ 0. (13)

The optimum separation is reached when the distance of all points to the boundary is maximised, while the number of misclassifications, and their size, is minimised. To find this hyperplane the following equation has to be minimised under the constraints (12) and (13) [28]: Φ(w, Ξ) = 1 2|w| 2_{+ C} X i ξi ! . (14)

C in eq.(14) again determines the effect of misclassifications. Alternatively, to include different weighting for classes, the following function can be minimised:

Φ(w, Ξ) = 1 2|w| 2_{+ C} 1   X i:yi=+1 ξi  + C2   X i:yi=−1 ξi  , (15)

which is more suitable for our problem.

To evaluate the discrimination power of the support vector machine, figure 11 has been remade with a support vector machine, see figure 12. Instead of the consecutive cuts, the decision region is now a smooth function. However, the position of the green and red areas is still roughly equal.

(21)

0 5 10 15 20 25 Temperature [o_C] 0 200 400 600 800 1000 1200 Amount of visitors

Days at the beach, classified by svm

Figure 12: The same data as in figure 11 has been re-classified, this time using a Support Vector Machine (SVM). Comparing the two figures the green and red areas are similar in position, however, the SVM resulted in a smoother decision boundary.

5.5 Multi-layered perceptron

Although computers are able to complete a large number of computations per time-frame, there are still problems that humans can do better or faster, for example face recognition. Therefore, algorithms have been proposed that try to mimic the human brain, trying to get better results.

Figure 13: From [30]. A schematical view of a multi-layered perceptron, two-layered in this case. The input layer is a vector with all the feature values and the output layer yields a vector with components equal to the probability to belong to the corresponding class. The white circles are the individual neurons, and the connecting lines all have an associated weight.

(22)

Instead of one powerful processing unit, Multi-Layered Perceptrons (MLPs) consist of many non-complex processors in analogy to the neurons that construct our brain. A schematic view of a MLP is shown in figure 13.

The left column is the input layer, which is processed by the hidden layers in the middle before it reaches the output layer. The neurons (represented by white circles) are connected

by lines that correspond to a weight w, where w(l)_ij corresponds to the weight from the i-th

neuron in the l-th layer to the j-th neuron in the (l + 1)-th layer. The output of this j-th neuron depends on the input of all neurons in the previous layer [31]:

x(l+1)_j = f N X i=1 w_ij(l)x(l)_i + θ(l)_j ! , (16)

where N is the number of neurons in the l-th layer and θ(l)_j is an offset corresponding to the

neuron on the left hand-side of the equation. The function f () is a non-linear activation func-tion. The non-linearity ensures the stacking of layers does not yield trivial results. Although there are multiple choices for the activation function, a common form is the Rectified Linear Unit (ReLU) [32]:

ReLU(x) = max(0, x). (17)

As in eq.(6) the difference between prediction and truth will be minimised. However, the

parameters to be optimised are w_ij(l) and θ_j(l), for all i, j and l. For our problem, with four

features, an algorithm using three hidden layers, all consisting of ten neurons, would have

30 + 40 + 100 + 100 + 20 = 290 parameters to be optimised (30 neurons yield 30 θ(l)_j and 260

connections, with corresponding weights).

5.6 Stochastic gradient descent

To solve the optimisation problems of eq.(6) and (15) is not easily done in an exact manner, therefore, a numerical approach is used. From a chosen starting point, the gradient of Φ is computed, hereafter, a step is made towards a lower value of Φ. By repeating these steps, lower and lower values of Φ are reached, until eventually a minimum is reached. This mini-mum can be either global or local, and therefore multiple starting positions can be tried in succession, lowering the chance of ending in a local minimum.

The above method is called gradient descent, however, since the computation of the gra-dient using all available data can be computationally demanding, a subset of the data can be used for each step, in which gradient descent reduces to Stochastic Gradient Descent (SGD). As the individual steps in SGD are less demanding than full gradient descent, it could con-verge in fewer computation time. However, as less data is used for each step, it could be that more steps are needed to reach the minimum and a balance between the number of steps and

(23)

the time for each step is needed.

A similar balance applies to the size of the steps taken, if SGD would be used to optimise

the w_ij(l) for the MLP, it would take the following form:

w(l)_ij,t= w_ij,t−1(l) − ηΦ0(w_ij,t−1(l) , xi, yi)

, (18)

where the consecutive steps are separated by time steps (t). Φ0 indicates the partial derivative

of Φ with respect to w_ij,t−1(l) [33].

The step-size is represented as η. Higher values for η enable coverage of a larger region of the parameter space, in fewer steps, however, it comes with the risk of never finding the minimum of Φ as is shown in figure 14.

Figure 14: Stochastic Gradient Descent (SGD) is a method to find the minimum of a function. Full gradient descent calculates the full derivate, however, this can be computationally intensive for larger data-sets. A solution is to use a fraction of the data to determine the direction of each step: SGD. The above figure shows the effect of the step-size on the effecivity of the algorithm, where the density of blue lines simbolises the gradient and the minimum of the function lies within the inner circle. Small steps, yellow arrows, eventually reach the minimum, but it takes many steps to reach the end-point. However, if the step-size is too large, red arrows, there is a risk of overshooting the minimum, and the minimum might not be found.

To make sure one η is compatible with all dimensions in feature-space, feature-values should be of similar size. This creates the need of normalisation of all features such that neither the yellow, nor the red arrows in figure 14 are applicable to the algorithm.

5.7 ROC-curve

After the data has been split and the algorithms chosen and optimised, there is still the need to evaluate the performance of the machine-learning code. Using the number of correctly

(24)

classified points in the test-set is not a correct representation of how well the algorithm per-formed. Imagine a data-set with 10, 000 multiple-scatters and only a hundred single-scatters. If a classifier would predict every data-point to be a multi-scatter it would reach an accuracy of 99%, even though no single-scatters are correctly classified.

ROC-curves, or Receiver Operating Characteristics-curves, are a different way of measur-ing the performance, countermeasur-ing the above problem [34]. A ROC-curve is a plot of the true positive rate (tpr) against the false positive rate (fpr), which are defined as

tpr = Correctly classified positives

Total positives (19)

fpr = Incorrectly classified negatives

Total negatives (20)

respectively, where the term ‘positive’ corresponds to one class (the single-scatters), and ‘neg-ative’ to the other (multiple-scatters).

As follows from eq.(19) and (20), both values are confined between 0 and 1. Figure 15 shows three different curves, one which has perfect classification (purple curve), one with de-cent classification (blue curve) and one that did not succeed in separating the classes (green curve).

Figure 15: Adjusted from [35]. ROC-curves are a measure of the performance of your algorithm, where the true positive rate, eq.(19), is plotted against the false positive rate, eq.(20). In the perfect case, no points would be misclassified, and the purple line would be obtained. In general, this is not possible and ROC-curves of working algorithms look more like the blue line. If an algorithm fails to discriminate between the classes, it is completely random whether it classifies a point directly. Getting 20% of the positive points correct would result in misclassifying 20% of negative points as well, and the ROC-curve looks like the green curve.

(25)

Classifiers predict a point to belong to a class with a certain probability. By changing the threshold on the certainty needed to be classified as positive, a tpr can be assigned to every fpr. The blue curve in figure 15 is the standard form of a ROC-curve. Lowering the threshold increases the number of correctly classified positives, while also increasing the fpr. After a certain threshold, all the positives have been identified, and increasing the fpr even further does not yield an increase in correctly classified points.

The green curve, corresponding to the diagonal line, shows a problem in the classifica-tion. For every % of correctly classified positives, there is a same percentage of misclassified negatives, meaning the classification is completely random. Predicting 50% of the one class correctly results in 50% of the other class being misclassified. Therefore, counter-intuitively, the line that looks like perfect correlation actually corresponds to no correlation between the algorithm and the data.

6 Results

6.1 Data preparation

The data used in this thesis is simulated by M. Vonk [6]. With a Monte-Carlo program he created 500, 000 S2-signals. Because S1-signals are used for time-measurements, they were not needed in the simulation, as the time of collision is directly outputted by the simula-tion. Although the data contained more features, only the four earlier mentioned were used: ‘largest S2’, ‘largest other S2’, ‘50%-width’, normalised by the expected width: eq(2) and the ‘goodness of fit’. For a detailed explanation of the data-simulation I refer to his work [6].

The data was split with 10%, 20% and 70% for the test-set, validation-set and training-set respectively. The 350, 000 points in the training-set determined the normalisation. All the features were divided by the highest value of that feature in the training-set, so also data from the test- and validation-set, and afterwards the mean value of the feature in the training set was subtracted. This resulted in the training-set having mean equal to zero for all features and a difference between maximum and minimum of approximately one. The same applies to the other two sets, but only in approximation.

As evident in figures 4, 6 and 8, the multiple-scatters are more abundant. The single-scatter events make up about 10% of the data-points, which has to be accounted for in the training of the algorithms.

(26)

6.2 Improving current cuts

The current data-analysis has a true positive rate of 0.928 with a corresponding false positive rate of 0.163. To improve these cuts, the fpr needs to be lowered while the tpr remains unchanged. Figure 16 shows the performance of a support vector machine on the test-set. The current cuts are marked with the red dot, while the black dot is the corresponding point on the ROC-curve.

The parameters imposed on the SVM are:

Parameter Value Explanation

classweight ‘balanced’ makes sure weights are inversely proportional to amount of data

kernel ‘rbf’ radial basis function

probability True to retrieve the confidence in the classification

C 1e2 weight of misclassification

gamma 0.4 Determines the region of effect of the data-points

Table 2: Parameters imposed on the support vector machine. The package used is scikit-learn [24].

0.0 0.2 0.4 0.6 0.8 1.0

fpr

0.0 0.2 0.4 0.6 0.8 1.0

tpr

SVM, rbf-kernel

ROC-curve machine cut, fpr = 0.085 current cut, fpr = 0.163

Figure 16: The ROC-curve corresponding to the SVM, with parameters as in table 2. The red dot shows the accuracy of the current cut, where the black dot is the accuracy of the machine cut, corresponding to the same tpr. As the fpr of the machine cut is about a factor two smaller, the background is reduced with the same factor, meaning the machine algorithm did considerably better than the current data-cuts.

(27)

After it became clear that adding machine learning to the analysis yielded better results than the current method, there were two hypotheses:

1. The two-dimensional cuts were not optimal and could be improved.

2. The multi-dimensional approach of the machine algorithm enabled better classification. To check both hypotheses, a support vector machine was used to mimic the two-dimensional cuts. The algorithm had no access to other features than the ‘Largest S2’ and the feature considered, making it a purely 2D-problem.

0.0 0.2 0.4 0.6 0.8 1.0 fpr 0.0 0.2 0.4 0.6 0.8 1.0 tpr

Largest other S2 cut

ROC-curve current cut, fpr = 0.198 machine cut, fpr = 0.160 1000 2000 3000 4000 5000 Largest S2 [PE] 0 100 200 300 400 500 600 700

Largest other S2 [PE]

Largest other S2 cut

0.0 0.2 0.4 0.6 0.8 1.0 fpr 0.0 0.2 0.4 0.6 0.8 1.0 tpr 50%-width cut ROC-curve current cut, fpr = 0.817 machine cut, fpr = 0.775 1000 2000 3000 4000 5000 Largest S2 [PE] 1 2 3 4 5

50% width / Expected width

(28)

0.0 0.2 0.4 0.6 0.8 1.0 fpr 0.0 0.2 0.4 0.6 0.8 1.0 tpr

Goodness of fit cut

ROC-curve current cut, fpr = 0.896 machine cut, fpr = 0.888 1000 2000 3000 4000 5000 Largest S2 [PE] 500 1000 1500 2000 2500 3000 Goodness of fit

Goodness of fit cut

Figure 17: To check whether the improvement of figure 16 was due to the high-dimensional analysis or due to the 2D-cuts not being optimal, a SVM with the same parameters was used to classify the scatter events based on only two features. The left column shows the ROC-curves of these algorithms, including both the current accuracy and the machine accuracy. The SVM got better results in all three cases, however, it is not sufficient to explain the factor of two encountered in figure 16. The right column shows the decision boundaries that correspond to the machine cuts. Even though a larger region of the parameter spaces is green, the fpr of the cuts is lower. This is due to the fact that the green areas that lie outside the cuts have a low density of data, while the red areas inside the cuts remove a large amount of points.

Figure 17 shows the results of these tests, where the same parameters were used as in figure 16, to make sure the results could be compared. The left column shows the ROC-curves, while the corresponding decision boundaries are potted on the right.

As can be seen in figure 17, changing the two-dimensional cuts results in a better per-formance. Although the machine cuts use larger regions in feature-space, they still remove more multiple-scatters. This can be seen by looking at figure 4, 6 and 8, which show that the density of data in the ‘extra-regions’ is small compared to the parts that are removed with respect to the current cuts.

However, even though the first hypothesis holds, the improved cuts are not able to explain the factor two difference in fpr seen in figure 16, meaning both the hypotheses are valid. The fact that machine learning is indeed able to reach better classification than the current anal-ysis justifies a comparison of the support vector machine with different algorithms, trying to improve the cuts further.

(29)

6.3 Comparing different algorithms

Going through the same cycle of training and validation the following parameters were chosen for the RFC- and MLP-classifiers:

classweight ‘balanced’ weights are inversely proportional to amount of data

nestimators 100 amount of trees that form the forest

min. samples / leaf 6 the minimum amount of data-points left in a rectangle

max. features 2 the amount of features available for each cut

Table 3: Parameters for the boosted decision tree from the scikit-package [24].

hidden-layer sizes (10,10,10) three layers, all with 10 neurons

activation ‘relu’ using a ReLU as activation function

solver ‘sgd’ using stochastic gradient descent

alpha 1e-4 similar to the C-parameter for other classifiers

Table 4: Parameters of the multi-layered perceptron [24].

0.0 0.2 0.4 0.6 0.8 1.0

fpr

0.0 0.2 0.4 0.6 0.8 1.0

tpr

RFC

ROC-curve machine cut, fpr = 0.079 current cut, fpr = 0.163 0.0 0.2 0.4 0.6 0.8 1.0

fpr

0.0 0.2 0.4 0.6 0.8 1.0

tpr

MLP

ROC-curve machine cut, fpr = 0.084 current cut, fpr = 0.163

Figure 18: To try to achieve better results, different machine algorithms were tested. The random forest classifier and multi-layered perceptron both got slightly better results, where the random forest performed best. However, the main difference with the SVM was in run-time. The SVM took multiple hours to train, while these two algorithms finished the training phase within 15 minutes.

(30)

The results of both classifiers were similar to the support vector machine, see figure 18. However, where the training of the SVM, using the full training set, took multiple hours, both the RFC and the MLP finished training within 15 minutes.

7 Conclusion

As is evident from the previous chapters, adding machine learning to the current analysis enables better results. A larger number of multi-scatters can be removed, without affecting how many single-scatters are identified.

All three algorithms, the RFC, SVM and MLP, yield a comparable improvement in ac-curacy, where the amount of multiple-scatter events that survive the data-cuts is roughly halved. However, as the support vector machine faced difficulties with the large amount of data, I would not recommend the algorithm for the XENON-experiment, where even larger data-sets can be expected.

Even though the random forest classifier achieved slightly better classification than the multi-layered perceptron, I would consider both for implementation. It could also be possible that a combination of the two would yield even better results, just like the RFC is made up of multiple decision trees.

While the machine classifiers are able to achieve better results, they lack the transparency of the current cuts, which all have a well founded physical justification. If for this reason machine learning is not chosen as data-splitter, then maybe a second look could be taken at the current two-dimensional cuts. Slightly adjusting these, to avoid small regions with high multiple-scatter density, might result in minor improvements of the accuracy, even if no machine learning is used.

References

[1] P. Van Dokkum, et al., Nature 555, 629 (2018).

[2] P. A. Ade, et al., Astronomy & Astrophysics 594, A13 (2016). [3] G. Bertone, Nature 468, 389 (2010).

[4] E. Aprile, et al., The European Physical Journal C 77, 881 (2017). [5] E. Aprile, et al., Physical review letters 111, 021301 (2013). [6] M. Vonk, Master thesis (2018).

[7] M. Bojarski, et al., arXiv preprint arXiv:1604.07316 (2016). [8] D. Silver, et al., nature 529, 484 (2016).

(31)

[10] F. Sebastiani, ACM computing surveys (CSUR) 34, 1 (2002).

[11] T. S. van Albada, K. Begeman, R. Sanscisi, J. Bahcall, Dark Matter in the Universe (World Scientific, 2004), pp. 7–23.

[12] W. Hu, S. Dodelson, Annual Review of Astronomy and Astrophysics 40, 171 (2002). [13] E. Hivon, et al., The Astrophysical Journal 567, 2 (2002).

[14] G. Bertone, D. Hooper, J. Silk, Physics Reports 405, 279 (2005). [15] E. Verlinde, Journal of High Energy Physics 2011, 29 (2011). [16] M. Markevitch, arXiv preprint astro-ph/0511345 (2005).

[17] L. Bergstr¨om, Nuclear Physics B-Proceedings Supplements 138, 123 (2005).

[18] L. Bergstr¨om, Reports on Progress in Physics 63, 793 (2000).

[19] X. collaboration, G. Plante, et al., Columbia University (2016). [20] J. Aalbers, PhD thesis (2018).

[21] W.-L. Chao, Disp. Ee. Ntu. Edu. Tw (2011). [22] J. R. Quinlan, Machine learning 1, 81 (1986).

[23] L. Breiman, et al., The annals of statistics 26, 801 (1998).

[24] F. Pedregosa, et al., Journal of Machine Learning Research 12, 2825 (2011). [25] G. Biau, E. Scornet, Test 25, 197 (2016).

[26] J. J. Rodriguez, L. I. Kuncheva, C. J. Alonso, IEEE transactions on pattern analysis and machine intelligence 28, 1619 (2006).

[27] V. Vapnik, A. Y. Lerner, Avtomat. i Telemekh 24, 774 (1963). [28] E. Osuna, R. Freund, F. Girosi (1997).

[29] C.-C. Chang, C.-J. Lin, ACM transactions on intelligent systems and technology (TIST) 2, 27 (2011).

[30] M. W. Gardner, S. Dorling, Atmospheric environment 32, 2627 (1998). [31] A. K. Jain, J. Mao, K. M. Mohiuddin, Computer 29, 31 (1996).

[32] L. Noriega, School of Computing. Staffordshire University (2005).

[33] T. Zhang, Proceedings of the twenty-first international conference on Machine learning (ACM, 2004), p. 116.

[34] T. Fawcett, Pattern recognition letters 27, 861 (2006).

Machine learning in the Xenon experiment