CUDA-Based Trigger System For The XENON Experiment

(1)

Master Thesis

by

Alper Topcuoglu 6470742

July 7, 2016

54 + 6 ECTS

The research was carried out between January 2014 and June 2016

Supervisor:

Prof. dr. M.P. Decowski P.A. Breur, MSc

Examiners: Prof. dr. M.P. Decowski Prof. dr. ir. E.N. Koffeman

(2)

Abstract

The XENON100 is a dark matter detector operated by the XENON collaboration, with its follow-up XENON1T currently being finalized. This report documents on the working and implementation of a trigger algorithm using GPUs developed for XENON1T. The purpose of this thesis is to discover if the qualities of the CUDA parallel computing platform (the NVIDIA GPU architecture which supports general purpose computing) is suitable for triggering on dark matter signals.

Three triggers will be compared on their trigger efficiency, and their process speed. The three triggers are: SiFTrig (Simple Fast Trigger), XerawDP (XENON raw DataProcessor), and PaX (HitCluster). SiFTrig is developed for CUDA, XerawDP is the trigger algorithm used for XENON100, and PaX (Proccesor for Analyzing XENON) is the new data processor developed for XENON1T.

FaX(Fake XENON) provides simulated dark matter signals by simulating a XENON detector. Emulated incoming dark matter particles are passed as input instructions (called a test) in order to simulate dark matter detector signals. PFTest (Peak Finding Test) will determine the trigger efficiency for all three trigger algorithms for 7 different tests. The results are quantified in an accuracy classification (ACC) score. For 4 out 7 tests, PaX has a higher ACC score than XerawDP. SiFTrig has the lowest ACC for every test.

The process time for each trigger is measured by the amount of events processed. The process time has been measured for 100, 200, 400 and 800 events, with each event containing 40 × 103 _{samples. For 800 events, SiFTrig is 55 × faster than PaX.}

The process rate of SiFTrig is at 1.27 kHz.

CUDA performs well on massively parallel problems, but SiFTrig is not suitable as a trigger algorithm. Further research may lead to more complex parallel algorithms to increase trigger efficiency, while keeping high process speeds. The disadvantage is that programming for CUDA becomes complex in case the algorithm gets more advanced.

(3)

Acknowledgements

I would like to thank every one of my supervisors: Sander Breur, Jelle Aalbers, Christopher Tunnell, and my professor Patrick Decowski. My greatest gratitude is for their patience they have had with me. I would like to thank Sander for his guidance, Jelle and Chris for their technical support.

I would like to thank Patrick Decowski for giving me the opportunity to work on this special experiment which I believe in time will lead to the detection of Dark Matter.

(4)

1. Introduction 1

2. Dark Matter 3

2.1. Evidence For Dark Matter . . . 4

2.1.1. Rotation Curves. . . 4

2.1.2. Gravitational Lensing. . . 6

2.1.3. Cosmic Microwave Background (CMB) . . . 8

2.2. Dark matter candidates . . . 9

2.2.1. Characteristics . . . 10

2.2.2. WIMPs . . . 11

2.3. Detection . . . 12

2.3.1. Direct detection . . . 13

3. The XENON Experiments 15 3.1. The XENON detector: a dual-phase LXe TPC . . . 16

3.2. XENON100 . . . 17

3.3. XENON1T . . . 22

4. GPGPU and CUDA Programming Model 25 4.1. General Purpose Graphics Processing Unit (GPGPU) . . . 25

4.2. Compute Unified Device Architecture (CUDA) . . . 28

4.3. CUDA Thread Hierarchy . . . 29

4.4. CUDA Fast Fourier Transform . . . 31

4.4.1. Filtering: FFT Convolution . . . 32

4.4.2. Filtering: Convolution - Overlap-add method . . . 35

4.5. Limitations of CUDA . . . 36

(5)

5. FaX and PFTest 38

5.1. Fake XENON Experiment (FaX) . . . 38

5.1.1. Simulating a S2 event . . . 40

5.2. Peak Finding Test (PFTest) . . . 43

5.2.1. Confusion Matrix . . . 43

6. Trigger Setup 45 6.1. Simple Fast Trigger (SiFTrig) . . . 45

6.1.1. Implementation . . . 46

6.2. XENON Raw Data Processor (XeRawDP) . . . 48

6.3. Processor for Analyzing XENON (PaX) . . . 50

7. Trigger Efficiency Analysis 53 7.1. PFTest Measurements . . . 55 7.1.1. Summary . . . 78 7.2. Process Time . . . 79 8. Conclusion 81 8.1. Recommendations . . . 82 A. Appendix 83 A.1. Lifetime of a MLC SSD . . . 83 A.1.1. Introduction . . . 83

A.1.2. Degration of floating gates . . . 83

MLC vs SLC . . . 85

A.1.3. Method (writing data to the SSD). . . 85

A.1.4. Analysis . . . 86

A.1.5. Conclusion. . . 87

A.2. PFTest Instructions details . . . 88

(6)

Introduction

Several observations show strong evidence that a large portion of matter in the universe consists of dark matter. Though, as of yet, no dark matter particle has been detected.

A method for detecting dark matter is by recording the interaction of dark mat-ter, as it is expected to scatter off nuclei. The interaction probability is extremely low, and only upper limits of the interaction cross section are known.

The XENON collaboration has lead the with the most sensitive measurements by setting the record for the lowest dark matter limits for direct detection. These limits have been set by the XENON100 detector which is stationed in Gran Sasso. The follow-up to the XENON100 is called the XENON1T and is expected to set a lower limit two orders of magnitude lower.

XENON100 uses a trigger algorithm to detects which signals in the detector should be recorded for analysis. The trigger limits the rate of signals as not every signal is a dark matter event. The dark matter events occur at a low rate. The requirement for the XENON1T trigger is to process events at a rate of 1 kHz. To achieve this rate, either a more enhanced algorithm has to be developed, or computing power should be increased. GPU’s (Graphic Processing Units) can provide this computation power using a highly parallel, sophisticated, algorithm.

The outline of this thesis is as follows. First, a description of observational evidence indicating the existence of dark matter will be given, examining possible particle candidates and their properties. Chapter 3 discusses detection principles and the XENON detector. Chapter 4 gives theoretical background on GPGPU

(7)

(General-Purpose computing on GPU) and CUDA (GPU development language). Chapter 5 describes the software which simulates the XENON detector and gen-erates dark matter signals. The main interest of this thesis is building and testing a new trigger algorithm running op GPUs, called SiFTrig. It will be tested for its trigger efficiency and its process speed. The results are discussed in Chapter 7. In conclusion, chapter 8 presents a recommendation whether it is viable to use a trigger which developed for GPU’s.

(8)

Dark Matter

Observations shows that the largest part of the Universe is composed of (elec-tromagnetically) invisible matter, referred to as dark matter and dark energy. Cosmology provides models describing the state of the universe and explain the nature of dark matter and energy. Combining the Friedmann equation and the energy conservation equation gives [1]:

1 Ω(ΩR( a0 a) 4 + ΩM( a0 a ) 3 + ΩΛ (2.1) where Ω = ρ ρ_crit, ρcrit = 3H2 8πG_N, H = ( ˙a a) (2.2)

where ρ is the energy density, ρcrit is the critical energy density which is defined to be the energy density at which the universe has a flat geometry, a(t) is known as the scale factor, which relates to the expansion rate of the energy distribution in the universe, and GN as Newton’s constant. Equation 2.1 is known as the ΛCDM model (Lambda Cold Dark Matter). The term cold points out that dark matter particles have low kinetic energy and their momentum have non-relativistic velocities. The ΛCDM model describes the universe as a flat universe with Ω ∼ 1, with a baryon density of Ωb∼ 0.04, a non-baryonic dark matter density Ωc∼ 0.23 and a dark energy density ΩΛ∼ 0.73[1]. Updated numbers are represented in figure 2.1.

(9)

Figure 2.1.: The matter-energy distribution throughout the Universe, as derived by the Λ CDM model. Percentages are rounded. Figure taken from Pelssers (2015).

To explain the missing mass as detected in observations dark matter must be massive. The WIMP hypothesis proposes particles which are massive, neutral and stable as dark matter candidates. WIMPs interact with Standard particles by col-lisions, which leads to direct and indirect detection. Direct detection experiments, like the XENON detector, use the scattering process to find dark matter particles.

2.1. Evidence For Dark Matter

Observational evidence for Dark Matter is matter or are particles whose presence explain unexpected gravitational effects on galaxies and stars. Here, some of the most convincing evidence for the existence of dark matter, at different scales, will be presented.

2.1.1. Rotation Curves

The most recognized measurements indicating the existence of dark matter are the rotation curves of spiral galaxies. A spiral galaxy is made up of a circular disc of gas and stars in rotation around a common center. Considering Newtonian gravity, the distribution of the mass in the galaxy can be determined once the orbital velocities of the stars and gas clouds has been measured. By equating the centripetal force with the gravitational force, a body of mass m moving in

(10)

a circular orbit of radius R with velocity V around an object with mass M, the velocity V is2.3. mV2 R = GM m R2 → V = r GM R (2.3)

For orbits that rotate around a centrally concentrated mass, e.g. bodies orbiting around the Sun, this equation gives RV2

= constant. Thus the orbital velocities of the planets declines with their distance from the Sun according to the Keplerian law V ∝ R−1

2. The mass of a spiral galaxy is approximated by the describing it as

a solid, rotating disk.

Figure 2.2.: Rotation curve of the NGC6503 galaxy. Measured data points and a curve through the points representing a model which incorporates the disk, gas and dark matter halo. Figure taken from Plante (2012).

The dashed curve in figure 2.2 shows the curve for a disk and behaves as expected. Further, it shows the measured rotational velocity of the NGC6503 galaxy, which does not obey the Keplerian law. The curve is flat for increasing distance, and stays flat for distances beyond the radius off the visible galaxy. The lack of decline of the rotational velocity is observed for several other spiral galaxies

(11)

[2]. The discrepancy between the measured and predict rotation curves can be solved by adding a dark matter halo component. The halo contains dark matter and extends beyond the visible radius of the galaxy. The mass is distributed to larger radii than the visible part of the galaxy, meaning the galaxy is more massive than assumed from the visible matter only. Figure 2.2 displays a fit of a model which includes a dark matter halo alongside the disk and gas, and fits very well.

2.1.2. Gravitational Lensing

Another indication of the existence of dark matter is presented by gravitational lensing measurements.

Photons passing through a gravitational field of a large mass will bent, a process similar to a converging lens (the amount of bending decreases, the further away from the object). An object, on the same line as the observer, will be focused into a ring (known as the Einstein ring) by a interrupting spherical mass. This is shown in figure2.3.

Figure 2.3.: Light coming from an object which is positioned behind a more massive body is deflected and gets to the observer, and he will observe a number of images of the source under several angles as an Einstein ring. Figure taken from Schulz (2011).

(12)

For non-spherical objects focusing on an observer which is not on the same line as the object, will produce arcs and arclets. Background galaxies focused by interrupting clusters cause giant arcs and arclets, which have been observed. The radius of an arc is then used in order to find the central mass concentration in the cluster, and the arclets provide information of its distribution [3]. This effect is shown in a schematic in figure 2.4.

Figure 2.4.: Top panel: Shapes of cluster members are shown in yellow and background galaxies in white. Lower panel: Same scenario, now including gravitational lensing effects. In between a 3d representation is shown of the positions of the clusters and galaxies, relative to the observer. Figure taken from Wikipedia [4].

Another observation indicating the existence of dark matter is the Bullet cluster. A bullet cluster is the result of two single galaxy clusters which collided about 150 million years ago. The mass distributions of the two galaxy clusters were determined from gravitational lensing. The mass distribution were not equal to

(13)

that of the hot plasma, observed in the X-ray regime. The main part of baryonic matter is plasma, and thus the discrepancy shown in figure2.5can only be clarified by dark matter shifting the center of gravity because of its collision-less behavior.

Figure 2.5.: Bullet-Cluster 1E0657-558. Mass distribution inferred from gravitational lensing is in green, as the blue/yellow part is the X-ray emission from visible matter. Figure taken from Schulz (2011).

2.1.3. Cosmic Microwave Background (CMB)

The discovery of the Cosmic Microwave Background (CMB) is the most compelling argument for the existence of dark matter. The CMB is a uniform, isotropic background in the radio band. After the discovery of the CMB the search for fluctuations began.

After inflation, the universe contained dark matter, baryonic matter and pho-tons. Due to quantum-mechanical fluctuations, small density fluctuations oc-curred, creating dense regions attracting baryonic and dark matter. These regions heat up, causing baryonic matter to release radiation pressure. Due to the en-ergy decrease of the baryonic matter, the baryonic matter ’escapes’ from the dense

(14)

region. Dark matter keeps being attracted into the dense region, as it does not release radiation pressure.

These processes cause regions in the universe to expand and cool down. The temperature decease is such that protons and electrons form neutral hydrogen (re-combination). This causes density fluctuations, and leads to the expectation of smaller oscillations due to the difference between radiation pressure and gravita-tional collapse. The characteristic scale of these oscillations therefor has a typical imprint in the temperature of the Cosmic Wave Background (CMB) photons. Dense regions are warmer, so the photons measured from those regions have more energy. Figure 2.6 shows a sky map of the CMB temperature.

Figure 2.6.: Map depicting the temperature anisotropies as observed by the Plank Institutes. Figure taken from Hogenbirk (2014).

2.2. Dark matter candidates

Though there is overwhelming amount of evidence for the existence of dark matter, it is unknown what dark matter is made of. Several theories present the nature of dark matter particles. One of the most popular model for the dark matter particle is the Weakly Interactive Massive Particle (WIMP).

(15)

2.2.1. Characteristics

The observations implicate that dark matter particles have to to have several characteristics. The most important characteristics are that the particles need to be stable, and the behavior during formation of large structures (hot or cold).

The particle has to be stable because there is evidence for dark matter particles for observations on a ’broad’ time scale. Considering the evidence from the CMB, it means that stable dark matter particle is around since shortly after the Big Bang. Even though it is known that there is a dark matter particle that doesn’t decay, that does not exclude the existence of dark matter particles which do decay. Therefor the are theories which present multiple dark matter particles. All dark matter particles then decay to the lightest dark matter particle.

During the formation of galaxies, the dynamics of the dark matter particle influences the structure formation. If the dark matter particles are considered hot, it will damp fluctuations in density. If the dark matter particles are considered cold, it will not dampen out the fluctuations, which leads to increase of structure during structure formation [5]. Figure 2.7 shows an illustration off how the gravitational density would look like depending on whether the dark matter particle is cold, warm or hot.

Figure 2.7.: Cold dark matter has has the most dense structures, whereas hot dark matter has the least dense structures. Figure taken from Baudis (2015) [6].

(16)

2.2.2. WIMPs

Any particle that is massive, stable and neutral, which doesn’t interact electro-magnetically can be considered as a dark matter particle. The best dark matter particle candidate to date is the Weakly Interacting Massive Particle (WIMP), due to the WIMP miracle.

Right after the Big Bang, the universe is thought to exist out of dense plasma, in which particles react to form other particles, and vice versa. This process is in a thermal equilibrium, meaning the same number of specific particles are created and annihilated. This means that WIMPs, in thermal equilibrium, in the early universe, were formed and annihilated into Standard Model particles at equal rate. Due to expansion of the universe, the density decreases of the plasma decreases, and some reactions occur less or not at all, which is called decoupling. Due to decoupling, the WIMP density started to decrease faster than that of Standard Model particles, because the WIMP production process stopped. The particles that prevailed are known as thermal relic, for which the relic density Ωx, with x being type of particle, can be computed. The relic density relates to the interaction cross-section of the particles. An estimation for the relic density is [1]

Ω_Xh2∼3 × 10 −27

cm3s−1

< σv > (2.4)

where < σv > is the thermal average of the total annihilation cross section of the particle multiplied by velocity. Figure2.8 illustrates this effect for different values of < σv >.

(17)

Figure 2.8.: The number density of a particle specie over time, excluding universe expansion effects. A curve is shown for various values of < σAv >, the

average value for annihilation cross-section × velocity. Figure taken from Hogenbirk (2014).

Observations like the CMB provide an estimation for the value of Ω for dark matter. Cosmology then implies that annihilation cross section for a thermally created particle has to be on the same order as the weak nuclear force. Additional, the mass of the dark matter is in the GeV-TeV range, which gives the correct relic density from cosmology [7]. This is what is known as the WIMP miracle, making it the best dark matter particle candidate.

2.3. Detection

The biggest concern of dark matter theorists is finding the dark matter particle. There are several experiments which try to find a sign of the dark matter particle using various techniques. These experiments can be divided into three different categories based on the interaction type which is observed. The three categories

(18)

are direct detection, indirect detection, and dark matter particles production at colliders. An overview is shown in figure 2.9.

Figure 2.9.: Three dark matter (DM) interaction processes with standard model (SM) particles. Time is shown on the horizontal axis. Figure taken from Pelssers (2015).

Direct detection looks for the energy (momentum) transfer from the dark mat-ter particle to a standard matmat-ter particle through collision. Indirect detection looks for standard matter particles (e.g. mono-energetic gamma rays), originat-ing from co-annihilation’s of dark matter particles in regions where a high density of dark matter particles is expected. The production of dark matter particles can be observed as missing energy-momentum when an interaction between stan-dard matter particles takes place. Accelerator experiments try to discover missing energy-momentum during collisions in the final state of standard matter particles, due to interaction of dark matter particles with standard matter particles.

The results of these experiments give a constraint on the interaction cross-section, setting upper limits for scattering and annihilation cross-sections.

2.3.1. Direct detection

The most propitious experiment for dark matter particle detection is by observing collision with standard matter particles. With every collision, a small amount of recoil energy gets stored within the medium (off which it scatters). Sensitive

(19)

detectors are able to detect these energies. This type of experiment is specified as direct detection.

For the XENON experiment, xenon is chosen as detector medium, and as for now the XENON experiments have set the most stringent limits on dark matter, although in 2013 surpassed by LUX. Since there is a low event rate, background suppression is of essence. Therefor the detector is located within a deep mine shaft, where cosmic ray are reduced. Aside physically blocking background noise, it is possible to filter out background noise by determining whether the collision was with an electron (electronic recoil) or an atomic nucleus (nuclear recoil). For dark matter signals nuclear recoil events are predicted, while background are most likely electronic recoils. The upper limit set by the XENON experiment is shown in figure 2.10.

Figure 2.10.: These are the upper limits on the spin-independent elastic WIMP-nucleon scattering cross section. The new XENON100 is the limit set by XENON100 with a second analysis. The limit set by LUX in 2013 is also shown. Figure taken from Danish (2014).

(20)

The XENON Experiments

The XENON project has designed a detector which uses liquid xenon (LXe) as target medium, with time projection chambers. The XENON10 was the first operating detector used by the XENON collaboration in 2006-2007. It was located in Gran Sasso, Italy, where it profited from its 3600 meter water equivalent (m.w.e.) rock, covering the detector against background from cosmic radiation [8].

In 2009, the XENON100 detector was setup and used, also at the Gran Sasso Underground Laboratory (LNGS). It used the same LXe dual-phase time projec-tion chambers principle as the XENON10, except for XENON100 being a larger detector. This meant that the XENON100 detector had 100kg LXe target instead of 10kg. It also had improved capabilities filtering out background noise. The first live run of measurements was for 100 days, with a follow-up run of 255 days. The analysis for the second run set the most stringent limits for high WIMP masses in direct dark matter searches.

XENON1T is the follow-up detector to XENON100, which has one ton of LXe as target. In total, it will use more than 3 tons of LXe throughout the whole detector, and once again the background noise reduction has been further improved. Its objective is to test WIMP nucleon scattering cross-sections of σSI = 2 × 10−47cm2 for WIMPs with a mass of MX = 50GeV.

In case a dark matter particle is not found, the detector can easily be enlarged. The XENONnT experiment will be scaled up to 6 ton of LXe [9].

(21)

3.1. The XENON detector: a dual-phase LXe

TPC

The XENON detector is a dual-phase LXe time projection chamber detector. The LXe is held within a steel container with an inner cylindrical chamber. Within this chamber exists a liquid-gas interface which is sustained below the top of the chamber by pressurizing it with a constant flow of xenon gas stream.

LXe is a good target due to to its scintillation and ionization response to incoming charged particles, neutrons or γ-rays. The measurement of these two excitation methods is done with the TPC as, as displayed in figure 3.1.

Figure 3.1.: A schematic sketch of the main concepts of the XENON two-phase liquid-gas time projection chamber (TPC). Also included is a sketch of a waveform containing a S2 and S1 occurrence. Figure taken from Biasini (2014). The sudden scintillation light, which referred to as the S1, is detected by two arrays of photo-multiplier tubes (PMTs). One of them is positioned at the bottom of the TPC submerged in the LXe, while the other array sits at the top encircled

(22)

by gaseous xenon (GXe). A large portion of the S1 signal is observed by the bottom array because of high internal reflection of the scintillation light at the liquid-gas interface. The electrons released during collision of a incident particle are gathered by electrical fields. Between the cathode grid, which is located at the bottom of the TPC, and the grounded gate, which is located below the liquid-gas interface, there is an applied electrical field of 530 V/cm which drifts the electrons upwards. The electrons are separated and accelerated by an extraction field of 10 kV/cm, which is placed between the grounded gate and the anode grid, placed somewhat above the liquid-gas interface. Within the GXe the proportional scintillation signal is produced, known as S2. The same PMTs detect the S2 signal, making it possible to reconstruct the position of the interaction vertex. As S1 and S₂ signals are measured, the time difference between the signals can be expressed as the z-position of the interaction vertex within the TPC, due to the constant drift velocity of electrons. The (x, y) positions are determined from the hit pattern of the S2 signal. The accuracy for the positions are in the mm regime.

Position reconstruction enables fiducializing the target. utilizing the self shield-ing of LXe for background reduction. Natural radioactivity result in background events occurring at the edges of the TPC, but they can’t infiltrate into the fiducial volume and are excluded. In addition, γ-ray induced electronic recoil background events can be distinguished from nuclear recoil WIMP events considering the differ-ent S2/S₁ ratios for different interaction types. This results in XENON detectors attain extreme low background conditions.

3.2. XENON100

The XENON100 TPC is cylindrical, with a radius of 15.3cm and has a height of 30.5cm, enclosing 62kg of LXe [10]. The walls of the TPC are constructed of polytetrafluorethylen, due to its good reflector properties for light with wavelength of 175nm. It’s a good electrical insulator as well. 80 PMTs are stored within the bottom array, 98 PMTs within the top array. The top array is ordered in side-by-side circles, as this is the most effective for the position reconstruction algorithms. A picture of the bottom and top array is shown in figure 3.2.

(23)

Figure 3.2.: Left: Top array Hamamatsu R8520-06-Al PMTs, aligned in circles to improve reconstruction of the event position. Right: Bottom array PMTs grouped as close as possible to achieve highest light collection possible, which is required for a low detector threshold. Figure taken from Aprile (2012).

Three field grids are installed within the TPC. At the bottom a cathode grid is placed, and at the top there is anode grid and a grounded gate grid. Field shaping electrodes are fixed to the TPC providing a homogeneous drift field in the LXe and a extraction field atop the gate grid. The PMTs are protected from high voltages of the cathode and anode by an additional group of grounded meshes, located at 5mm atop of the bottom array and 15mm beneath the top array. The continuous flow of GXe into gaseous xenon at the top pressures the liquid-gas interface, staying at the same level, above the grounded gate grid and a couple of mm beneath the anode grid. The outer LXe acts as a self-shielding and active veto, with 64 PMTs installed to the outer walls of the TPC.

A total mass of 161kg xenon is placed within the TPC and has to be cooled to a temperature of −91o_{C (the LXe temperature) [}₁₁_{]. Cooling to this temperature} requires a stable cryogenic system. A Pulse Tube Refrigerator (PTR) is used in combination with a helium compressor to reach low temperatures. The PTR consists of a compressor, a cold head and. a reservoir. The cold head has a connection to the LXe system using a cold finger. In between the cold head and the cold finger are electric heaters, to keep the cold finger at a steady temperature. This is controlled by the usage of of two PT100 temperature sensors. Figure 3.3 shows a drawing of the cooling system.

(24)

Figure 3.3.: Drawing of XENON100’s cryogenic system. The cooling is supplied by a 200W PTR installed outside the shield, attached to the primary cryostat by a double walled vacuum insulated pipe. LXe dragged out of the bottom of the detector, purified while gaseous, and put back as xenon gas into the diving bell. Figure is not to scale. Figure taken from Aprile (2012).

To prevent heat loss the LXe is held within a vacuum cryostat. The materials (of the equipment) located close to the detector have to meet high conditions on radio-purity. By doing so, radioactive contamination’s are prevented, reducing background signal. Materials with unavoidable contamination are placed outside the passive shield. Natural radioactivity (like γ-rays) is shielded by materials with a proton number, such as lead, while neutrons are diminished and captured by water and polyethylene. As such, the cryostat is enclosed by oxygen-free high thermal conductivity (OFHC) copper (5cm), surrounded by polyethylene (20cm), which is enclosed by lead (20cm), and with eventually the most outer shield exists of low radioactivity lead (5 cm). Outside of the shields there are water tanks and polyethylene layers as additional passive shielding. The space within the copper shield is purified with high-purity nitrogen to prevent the diffusion of radon. Radon

(25)

is emitted from the rocks by which the detector is surrounded in Gran Sasso. Figure 3.4 shows a schematic of the shielding for the XENON100 detector.

Figure 3.4.: A sketch of the XENON100 detector contained within a passive shield consisting of copper, polyethylene, lead and water containers. Figure taken from Aprile (2012).

Nonetheless, impurities contamination’s exists within the TPC. Gases which are highly electronegative and water particles lessen the transference of light and charge within the LXe, reducing the signal yield (resulting in increase of detector threshold). These additional impuritesi can be removed by circulating the xenon constantly using a dedicated gas system. The gas system is also used for taking samples, filling the detector and recuperation of xenon.

A run of 225 days in 2012 resulted in a report containing the best limit for spin-independent WIMP nucleus scattering for high WIMP masses. Due to the shield-ing and along with S2/S₁ discrimination the TPC, shown in figure3.5, XENON100 presented the lowest background for any dark matter (search) experiment at that moment.

(26)

Figure 3.5.: The event distribution in the log₁₀ S2/S1 parameter space after applying

analysis cuts, fiducializing volume, and electronic recoil band mean subtrac-tion. Black dots are observed events, histogram in red and gray displays the position of nuclear recoil events emitted by a neutron calibration source. With a rejection rate of 99.75%, events over the green dashed line are identified as electronic recoils. Figure taken from Stolzenburg (2014).

Due to the low background the TPC has fiducial volume of 34kg [1]. The distribution of events contained within the fiducial volume is shown in fig 3.6.

Figure 3.6.: The distribution of events within the TPC. Rejected events, by discrimina-tion parameter space cut, are shown in grey. Figure taken from Stolzenburg (2014).

(27)

An initial analysis found two events (WIMP candidates) in the signal, expecting 1.0 ± 0.2background events. The two events are to be seen in figure3.6. The result in terms of WIMP nucleus scattering using a more enhanced profile likelihood analysis is presented in figure3.7.

Figure 3.7.: Exclusion limits from multiple dark matter direct detection searches. XENON100’s result of 225 days run is shown in light blue. The low-est limit is set by LUX (2013), shown as the solid green line. The solid blue line is the projected limit to be by XENON1T (2017). The DAMA results have been withdrawn. Figure taken from Pelssers (2015).

3.3. XENON1T

To further improve sensitivity, the follow up for the XENON100 experiment, the XENON1T is constructed. The XENON1T is based upon similar techniques as the XENON100, but scaling up the LXe to about 3 tons. The total active mass will be ≈ 1 ton, and the background will be reduced by a factor of ≈ 100. A two year measurement will be run, in which it will reach the goal of σSI = 2 × 10−47cm2 [11].

Much-like the XENON100, the XENON1T is designed as a dual-phase TPC, which is 1m in diameter and 1m in height. The TPC will be stocked with two

(28)

arrays: the top array consists of 121 PMT’s while the bottom array consists of 127 three-inch PMTs (Hamamatsu R-11410). An artist impression of the experiment is shown in figure 3.8.

Figure 3.8.: An artist impression of the XENON1T dark matter experiment. The TPC is shown on the left side, within a watertank (serving as Cherenkov muon veto) containing the cryostat. The right side shows a service building which hold the gas purification systems and DAQ (Data Aquisition System). Figure taken from Stolzenburg (2014).

Due to the active volume to ≈ 1 ton of LXe, improvements in LXe purity and gas handling are necessary. Aside from those improvements, electron drifts of ≈ 1m have to be realized, while having stable LXe temperatures to be assured. This will be realized by improving the cooling and purification systems.

Additionally, further reduction of background is attained. The liquid xenon target and the TPC are closed in a stainless-steel cryostat that will be delved in the middle of a 9.6m diameter water tank. The ultra pure water (UPW) serves as a passive shield and concurrently acts as an active Cherenkov veto detector against residual cosmic muons at the underground site of the experiment in hall B at Laboratori Nazionali del Gran Sasso(LNGS) in Italy. Materials with high radioactive contamination’s have to be avoided with high necessity. The screening of materials has been improved yielding a reduction in the 222

(29)

Future improvements in sensitivity can be achieved by upgrading the XENON1T experiment, in which it will be renamed to XENONnT. About six tons of liquid xenon will be used while most hardware items of XENON1T like the cryostat and cryogenic systems, water tank, infrastructure building, xenon storage vessel as well as gas handling system will be re-used.

(30)

GPGPU and CUDA Programming

Model

This dissertation pursues in finding out if building a GPU-based trigger system for the XENON system would yield in major increase in computational speed compared to the latest CPU based methods (used by the XENON100 trigger), trying to demonstrate if GPU’s are suitable for as trigger systems.

This chapter will provide an overview of GPU hardware, and will go in-depth on CUDA, it’s architecture and the programming model.

4.1. General Purpose Graphics Processing Unit

(GPGPU)

A Central Processing Unit (CPU) is build in order to process serial while a Graph-ics Processing Unit (GPU) processes parallel. A CPU will execute instructions sequentially, meaning an instruction is fetched and executed one by one. A GPU has a stream processor, and will execute a function (kernel) on several sets of input data (streams) at the same time. The elements of the input data are passed and processed by the kernel independently for every stream (meaning that there is no dependency among the elements).

A CPU contains a control unit and one or more arithmetic logic units (ALU). The ALU’s are used to perform arithmetic and logic operations and are controlled by the control unit. Modern CPU’s have complex ALUs. A GPU contains way

(31)

more ALU’s but are less sophisticated. Due to the utilization of less complex cores, and having no control unit, there is more area to implement chips with a large amount of cores. This difference is schematically shown in figure 4.1.

Figure 4.1.: On the left, the layout of the ALU’s and control unit is shown. On the right, the ALU layout of a GPU is shown. It has smaller ALU’s, grouped into one or multiple streaming multiprocessors. Figure taken from Henriksson (2015).

Due to the stream processors, computations can be executed in parallel. As computations are performed in parallel, this results in speed improvements as a GPU has over a billion transistors[12] performing parallel computations. Programs may benefit from this parallel architecture, and if so, then GPUs outperform CPUs. Figure 4.2 shows performance for single and double precision computations for NVIDIA GPUs and Intel CPUs.

(32)

Figure 4.2.: Shown is the theoretical peak performance over time for single and double precision computations for NVIDIA GPUs and Intel CPUs. The most important feature is that the speed increase gets larger in time. Figure taken from Galloy (2013) [13]

Developers started mapping various data parallel algorithms to GPUs using languages as DirectX and mapping the algorithms to the graphics API. This was discovered by researchers in which general purpose computing (GPGPU) was dis-covered.

As GPUs were designed for processing graphical computation, it required pro-gramming skill with a deep understanding of the graphics API and GPU architec-ture in order to do GPGPU. In view of that fact, NVIDIA released its Compute Unified Device Architecture (CUDA).

It is important to realize that not all algorithms will run more efficiently on a GPU and therefor a GPU is not supposed to replace a CPU. The best candidates for GPU programming are algorithms that have repeated calculations for every element in the data, which much be executed independently from each other.

(33)

4.2. Compute Unified Device Architecture

(CUDA)

CUDA is an extension of the C language developed by NVIDIA. It has its own compiler (nvcc) which adds support for API calls used by CUDA to execute func-tions on the GPU. It further gives the ability to execute kernels, modify GPU RAM, and use libraries with predefined functions. It supports non-blocking GPU execution, meaning that alongside code being executed on the GPU, the CPU is available for processing.

CUDA starts executing a program by copying memory from host (CPU RAM) to device (GPU RAM) per 64KB block, launching a kernel (containing the function to process), and copying the memory back from the device to the host.

Figure 4.3 shows a schematic on how the CUDA programming model maps threads into blocks, which are mapped into grids. Although figure 4.3 illustrates a two dimensional representation, both the grids and blocks can be extended to a third dimension. Each thread maps to a processing unit and every block maps to a streaming multi-processor. Threads communicate with each other, and shared memory is used to decrease memory loads.

Figure 4.3.: CUDA model and its mapping into the architecture[12]. Blocks are mapped into threads, and threads are mapped into the multiprocessor. The mul-tiprocessors execute threads concurrently. Figure taken from Henriksson (2015).

(34)

When updating the CUDA architecture, NVIDIA doesn’t make every new fea-ture available for every previously released GPU. NVIDIA has opted for classifica-tion classes called the compute capability. Depending on the compute capability different functions are available.

4.3. CUDA Thread Hierarchy

Methods and functions which are supposed to run on the device are called kernels. Kernels are executed on multiple threads at the same time to take advantage of data parallelism.

Threads will be automatically called upon when a kernel is launched. The programmer chooses the appropriate amount of threads, where the thread count is passed along to the kernel. The grid contains the entire collection of threads, as shown in figure 4.4.

Figure 4.4.: Grid of blocks containing threads [14]. As an example, Block(1,1) is shown containing 12 threads. Image taken from Tse (2006).

A grid consists out of blocks. A block is made up of an array with threads executing the same kernel, while the threads can cooperate to get a certain result. Each block has a specific block identifier. Each thread within a block is able to

(35)

communicate with each other, with each block having its own shared memory in which the threads can share data. To make sure all threads have executed before advancing to the next part of the program, there is the ability to sync the threads. Syncthreads is mostly used inside a kernel in order to make sure the reading and writing of data happens at the right time within the shared data. Threads are contained in different blocks, and these aren’t able to interact with each other. Each block is able to have a maximum of 512 threads. A grid can contain multiple blocks, with its maximum at 65535 × 65535 × 1 blocks [14].

Once a kernel launches, the grid and block structure are created. Each Stream-ing Multiprocessor (SM) can execute up to eight blocks. A StreamStream-ing Multipro-cessor Controller (SMC) manages the distribution of the blocks over the SMs. If a graphics card has more SM on-board, it means more blocks can be executed (see figure 4.5).

Figure 4.5.: A GPU which has more multiprocessor on-board will execute a kernel grid in less than one with fewer multiprocessors. Figure taken from Tse (2012).

(36)

Hardware limitation causes the SM to execute up to 768 threads at the same time. Combined with multiple blocks, the maximum amount of threads can be determined. If a graphic cards has 16 SMs, that means 12288 threads can be executed concurrently.

4.4. CUDA Fast Fourier Transform

NVIDIA developed a special library that specializes in time to frequency domain conversion. The CUDA Fast Fourier Transform (cuFFT) library is a framework containing optimized algorithms for signal processing. It can transform complex and real 1, 2 and 3 dimensional arrays. Computation can be performed in both single- and double-precision. The maximum amount of elements an one dimen-sional array can contain is 128 million. Depending on the size of the input array, the best algorithm will be chosen by the framework. The most common algorithm is the Cooley-Tukey algorithm which works best if input signal is an array of a size 2a × 3b × 5c × 7d which decomposes in a radix-2, radix-3, radix-5 of radix-7 form. It’s possible to perform a batched execution, in which the input array is cut in to smaller arrays. The smaller arrays will be computed in parallel. This doesn’t necessarily mean it’s faster than computing the transform for a single (large) array, due to drawbacks in the algorithms.

When computing a FFT, a plan is defined, setting the type of transformation which will be computed. There is the option of a basic plan, yet there’s the ability to set a more advanced plan, where multiple plans will be configured. This is called cufftPlanMany.

The basic plan sets the amount of dimensions the FFT will operate in, and the data type. When using cufftPlanMany batched computations can be set up, data with different type of offsets can be set, and output in different layouts can be computed. The output data layout is different depending on the type of data that is used as input.

When computing a FFT between real and complex data type, the output is of size [0, N/2 + 1] complex Fourier coefficients, where N is the size of the input array. Table4.1 shows the the data layout for different plans.

(37)

Transform type Input: size, type Output: size, type Complex to complex N, cufftComplex N, cufftComplex Complex to real N/2 + 1, cufftComplex N, cufftReal

Real to complex N, cufftReal N/2 + 1, cufftComplex

Table 4.1.: cuFFT data layout [15]. The difference in data type and size is shown for three different transform types.

The library is based upon the FFTW library for fast FFT computations on a CPU. In order to keep compatibility cuFFT works similar to FFTW, and has compatibiltyMode() option.

The Cooley-Tukey algorithm has a relative error growth rate of log2N where N is the transform size.

4.4.1. Filtering: FFT Convolution

A FFT convolution is the principle where the multiplication of two signals in the frequency domain equals to the convolution of two signals in the time domain. The signal representing the filter will (most likely) be in the frequency domain.

A convolution is computed as follows. The input signal will have to be converted to the frequency domain by a FFT, multiplied by the frequency response of the filter, which has to be converted back to time domain by an inverse FFT. The output will consist of a result containing both a real and an imaginary part. Figure 4.6shows an example of this process. In (a) a filter is shown in its time domain and is converted into its frequency domain, shown in (b) and (c). In (d) the same is seen for the input signal, with the FFT result showing in (e) and (f). The spectra are multiplied resulting into (h) and (i). The inverse FFT of (h) and (i) results into (g).

(38)

Figure 4.6.: Schematics of the principle of FFT convolution. Filter (a) and input-signal (d) are converted from time to frequency domain, into (b) and (c), which then get multiplied to for the filtered signal in frequency domain (h). (h) Is then converted by an inverse Fourier Transform into the desired filtered signal (g) (in time domain). Figure taken from [16].

The FFT output has the same length as the output signal. The output length is determined by filter length plus signal length plus 1. For instance, if the filter has a length of 127 points, the input signal 128, then the output signal is of length 256. This means a FFT for 256 points will be computed. When performing the

(39)

multiplication between the filter and input signal in frequency domain, the filter signal is padded with 129 zeros, and the input signal with 128. If a signal is of 600 samples, then the most logical solution would be to add 425 zeros to get a length of 1024, due to optimization of the algorithm.

cuFFT API Is modeled after FFTW, which is the most efficient CPU FFT library. Therefor this whole explanation applies when programming for cuFFT.

(40)

4.4.2. Filtering: Convolution - Overlap-add method

By decomposing the input signal into smaller components, it possible to make of the cufftPlanMany option. When decomposing the signal into smaller com-ponents, the convolution is performed by the overlap method. The overlap-add method consists out of three parts: decomposing a single into smaller compo-nents, process the smaller compocompo-nents, and then recombine the smaller parts into a output signal. This is shown in figure 4.7.

Figure 4.7.: This is schematics for the FFT Convolution Overlap-add method. Input-signal (a) will be convolved with filter (b). Breaking Input-signal (a), into three pieces (c), (d), (e), each will be convolved by using the FFT Convolve method shown in (f), (g) and (h). The output signal (i) is found by adding the overlapping output segments. Figure taken from [16].

(41)

Consider a sample signal with length N, convolved with a signal with length M, then the length of the output signal is N + M + 1. For example, signal (a) consists out of 300 samples, and (b) is 101 samples, then the output is 400 samples long. Thus when a signal with N samples is filtered, it will be expanded by M − 1 points (to the right). Zeros are appended to the signal from sample 300 to 399.

Figures (c), (d) and (e) shows how the signal will be decomposed into segments, each containing 100 samples from the original signal. With each decomposition, 100 zeros are appended to the right. Every segment will be convolved with a filter. This can be computed in the time-domain but also by a FFT convolution. This is shown in (f), (g) and (h). Every input segment is 100 samples, and the filter segment is 101 sample, meaning the output is 200 samples.

With cuFFT the convolution of these segments can be done in parallel by performing the convolution in batch.

Now the final part is to add them together. To be seen is that the output segments overlap each other. These overlapping segments are added up to give the output signal (i).

The overlap-add methods results in the exact same output as a FFT convolution or a time-domain convolution. The drawback of this is that it is complex to keep track of where these segments overlap, and which elements in the sample to add up with each other.

4.5. Limitations of CUDA

Even when a problem lends itself well to parallel computing, the limitations of CUDA may still make inefficient process to program it for a GPU.

First, the double precision floating point support deviates from IEEE specifi-cations. These leads to the CPU and GPU program not giving the same exact result.

Second, a branch divergence in a kernel will lead to performance loss. Writing if-statements slows down the process time.

Third, the learning curve in order to write efficient CUDA code is steep. It takes quite sometime to teach someone CUDA (which is cost inefficient), and thus it might be more effective to optimize the CPU code.

(42)

Fourth, the bus bandwidth and latency between the CPU and GPU might be a bottleneck. If the kernel runs quicker than the CPU and GPU can communicate, then this will be a bottleneck.

And last, CUDA is propriety closed source software. This means CUDA only works on NIVIDIA graphics cards, which means that NVIDIA has a monopoly position when it comes to CUDA.

(43)

FaX and PFTest

In order to test the trigger efficiency of a trigger, one cannot use measurements from XENON100 since that is data which already has been triggered on. The data can be used to see if another trigger resembles the XENON100 trigger, i.e. triggers on the same signals.

Therefor simulated signals are necessary in order to compare triggers. These signals are provided by FaX (Fake XENON Experiment), which is software which emulates the behavior of a XENON TPC. The main task of FaX is to compute waveforms, so they be provided as input for PFTest (Peak Finding Test).

PFTestis a test which runs the signals through a trigger algorithm to determine its efficiency. It will assigns a status for each simulated peak whether the peak has been found or missed.

This chapter will discuss FaX and PFTest in further depth. There will be clear focus on simulating S2 events, as the S2 peak is what the XENON100 trigger triggers on.

5.1. Fake XENON Experiment (FaX)

FaX (a Fake Xenon experiment) is a waveform simulator for XENON TPCs. FaX predicts how a XENON TPC (and electronics) reacts to an energy deposition somewhere in the detector. Given some initial input/conditions (e.g. Monte Carlo GEANT4 simulation), it will compute the photon and electron yields, photon detection times, and the summed waveform. Fig 5.1 shows a schematic on how the simulation is computed. Each part will be briefly discussed.

(44)

Figure 5.1.: A schematic showing the structure of the FaX. Given some initial input, it performs several computations to emulate the XENON TPC, with wave-forms as output which can be past on to PFTest.

To run FaX it is necessary to provide a source specification, i.e. specify the incoming particle. There are several parameters which define a collision. These parameters are shown in Table 5.1. For the simulation of the waveform it is necessary to set the recoil type, the position of the collision within the detector (x, y, z), the number of photons excitated due to scintillation, the numbers of electrons produced by ionization, and the time at which the occurrence happens.

Input Parameter Definition

InstructionId Index Of Instruction

Recoil Type Electronic Recoil or Nuclear Recoil x Position of the collision (x-axis) y Position of the collision (y-axis)

z Depth (z-axis)

S₁ photons Number Of γ0s (S1 Signal)

S₂ electrons Number Of Electrons Producing A S2 Signal

t Time of InstructionId

Table 5.1.: FaXSource Specification Parameters

These parameters can be provided by GEANT4, when trying to emulate actual (hypothetical) particle, e.g. a WIMP. These parameters can also be provided

(45)

man-ually, to simulate specific waveforms. For this dissertation, the GEANT4 simulation is skipped, and specifications are provided manually. By doing so, specific tests can be designed to in order to test triggers for waveform in which triggers are known to have a hard time triggering. The design of each test will be discussed in chapter 7.

Before computing the yield calculations, several interactions have to be com-puted, taking into account:

• Interactions for the S1 signal

• Calculating particle number density in the gas (ideal gas law) • Electric field in the gas

• Reduced electric field in the gas • Loading real noise data

• Determining sensible length of a PMT pulse to simulate

After the yield calculations are computed, he electrons and γ yields are pro-vided, after which point the simulation of the S2 signals start.

5.1.1. Simulating a S

₂

event

As XENON100 triggers on S2 signals, it is important to to determine what char-acterizes a S2 signal, and how these characterizations get simulated. Figure 5.2 shows an typical event which consists out of both a S1 and S2 peak.

The two main features which characterize a S2 signal are its amplitude and its frequency components. This doesn’t mean that a S1 and S2 can be distinguished based on these two features only. When considering a single-electron S2 signal, i.e. a S2 signal of which the ionization signal produces only one electron, the amplitude of the peak is low. The amplitude is so low, that it has the amplitude of a typical S1 peak [1]. The frequency components of a single-electron S2 signal are comparable to that an S1 signal.

As the width of a typical S2 peak is broader, it means that the time interval of a S2 signal is larger than that of a S1 signal.For a typical S2 signal a width is expected of 1µs. A peak with a width of less than 1µs is considered not to be a S2 signal and thus will not be identified so.

(46)

Figure 5.2.: The summed waveform representing an typical event. The main S₁ and S₂ peak candidates are shown. The S₂ peak (candidate) has a larger peak compared as the S1 peak (candidate).

The reason why the amplitude and the width of the peaks are characteristics of a S2 signal, becomes clear when comprehending the yield calculations that FaX computes. The yield calculations are divided into two parts: the ionization, producing electrons, and the proportional scintillation, producing S2 photons.

First the average drift time is calculated, where the electrons drift from the liquid xenon to the gaseous xenon due to an applied electric field. While drifting, some electrons get absorbed, in which the amount of absorbed electrons during the drift is computed. For the electrons that make it to the the area where the liquid and gas overlap (ELR region), the arrival times are calculated.

Once the electrons cross over to the gaseous xenon, proportional scintillation occurs due to the electrons colliding with the gaseous xenon. The simulation calculates how many photons are produced by each electron. The production of photons is computed by a random Poisson distribution depending on the gap length elr-gas-gap-length, and the s2-secondary-sc-gain-density. The photon production times have to be determined, taking into account for singlet/triplet excimer decay times. Then the PMT pulse current can be computed and then these values can be passed on to PaX in order to create the event.

(47)

PaX is the Processor for Analyzing XENON and builds the waveform pulse by pulse. There is the possibility to add Gaussian noise or real noise samples to the waveform. The build waveform can be further analyzed by PaX or used in the PFTest (Peak Finding Test).

In 5.1 an example of instructions (source specifications) is shown for FaX. An instruction gives the parameters necessary in order to simulate the event. Instruc-tion 0 is an electronic recoil, with the occurrence of the happening at a random position for the x and y field, where the z-plane is determined by the depth. 10 S₁ photons and 25 S2 electrons will be simulated.

1 f a x _ i n s t r u c t i o n s =

2 " i n s t r u c t i o n , r e c o i l _ t y p e , x , y , depth , s1_photons , s 2 _ e l e c t r o n s , t \n" 3 + \" 0 ,ER, random , random , 0 , 1 0 , 2 5 , 0 \ n"

Listing 5.1: An Example Of Instructions for FaX

Figure 5.3 is a plot of the event which has been simulated by FaX. Both the S1 and S2 peak are clearly to be identified, as giving in the instructions. It contains the raw waveform, the filtered waveform, and a baseline (computed by PaX). The height of the peaks have been marked, and the width of the peak has been marked as blue for the S1 peak, green for the S2 peak.

Figure 5.3.: Plot of a event simulated by FaX. Both the raw and the filtered waveform are shown, as well as a baseline. The height is determined for the S1 and

(48)

5.2. Peak Finding Test (PFTest)

Peak Finding Test’s (PFTest) objective is to test the accuracy of a peakfinder using simulated waveforms. A trigger can be incorporated into the peakfinder, so PFTest will be used for testing triggers.

PFTest consists of four stages, which are: 1. Prepare

2. Run 3. Aggregate 4. Analyze

The Prepare stage simulate waveforms (using FaX), and saving them to a file which can processed by PaX. Information on which events contain which peaks (the simulator truth) is written to a separate (CSV) file.

The Run stage runs PaX for different configurations (setups), i.e. different peak-finding algorithms. The different configurations are different trigger setups, which will be discussed in chapter 6.

The Aggregate stage compares PaX’s results to the simulator truth file. A status is assigned to each simulated peak. The status is written to a HDF file containing for each simulated peak, and further contains the simulator truth info (such as the number of photons in the peak), and if peaks are found, its properties are reported by the processor (such as area and width). For each simulated peak a status is assigned, where status of the peak is either found and missed.

In the Analyze stage the HDF file with the aggregated results is loaded. By default, a histogram of the peak-finding algorithm efficiency is plotted. The his-togram shows the fractions of peaks found versus the number of electrons or pho-tons. An alternative to the number of electrons/photons, is to plot against the depth (z) within the detector where the collision occurs. These plots will show at which energy levels the triggers will fail or excel.

5.2.1. Confusion Matrix

To find out which trigger is the best trigger, a score is given to each trigger in every test. The score is based on a confusion matrix.

(49)

A confusion matrix is a table used to describe the performance of a classification model if test data available containing truth values.

The four parameters in the table are true positives (TP), true negatives (TN), false positives (FP), false negatives (FN). The triggers need to trigger on a S2 event. That means that for a TP, the trigger triggers on a S2 event which actually is a S2 event. A TN would mean that there is a peak, which is not a S2 event, and the trigger doesn’t trigger on this event. A FP would mean that the trigger triggers on a S2 event, which isn’t a S2 event. And last FN, which means a peak is found, but the trigger doesn’t label it a S2 event, even though it is.

From these four numbers, two variables will be deduced. The most important one is the accuracy (ACC). The ACC states how often the classifier is correct. It is defined as 5.1

ACC = (T P + T N )

(T P + T N + F P + F N ) (5.1)

The second variable is the True Positive Rate (TPR), 5.2. The TPR deter-mines counts how many S2 events were simulated, and how often was did correctly predicted.

T P R = (T P )

(50)

Trigger Setup

When scaling from XENON100 to XENON1T the same processing resources are expected, though the amount of events to process are more. The amount of storage necessary excepted is 50 TB/year for DM search, and for calibration it is 500 TB/ year [17]. The data flow expected during DM search is 10MB/s, when calibrating 50 MB/s, though the maximum is set by the DAQ at 300MB/s. Due to to this limit, the trigger has to be able to process at a rate of 1kHz. The processing power can be scaled by utilizing faster (expensive) CPU’s, though an alternative would be to use (cheap) GPU’s.

Due to GPU’s only performing optimal when the algorithm written is suitable for parallel computation, the question is whether a trigger algorithm can be de-veloped which runs optimal on GPU hardware and has a high trigger efficiency. Based on the two main features that characterize a S2 signal, an algorithm in the CUDA language has been developed, named Simple Fast Trigger (SiFTrig). Us-ing PFTest the trigger efficiency of SiFTrig will be compared to the XENON100 trigger, Xenon Raw Data Processor (XeRawDP), and PaX’ trigger, Hitcluster.

First the setup of the triggers will be described, whereas in the next chapter the triggers will actually be tested.

6.1. Simple Fast Trigger (SiFTrig )

SiFTrig determines whether a signal contains a S2 peak based on three steps. These are smoothing the signal, threshold check for the amplitude, and threshold for the width of signal above the amplitude threshold.

(51)

Smoothing a signal is done by applying a filter. There are two techniques, mainly the FFT convolution, and the overlap-add method. Using cuFFT this can be computed optimally.

Second, the algorithm checks if a signal passes the threshold (amplitude) level. A single electron S2 peak has a height of around 19pe [1], at which the threshold is set. This can cause mislabeling noise for S2’s, since noise with amplitude higher as 19pe are available in XENON100 measurements. This computation is also performed by computing a FFT Convolution, again making use of cuFFT.

If a signal passes the amplitude threshold, a threshold checking a minimum width has to be passed. The algorithm checks when the signal passes threshold amplitude value, and when it goes below threshold level. The width threshold is set at 0.9 µs. So if this happens really fast, e.g. 0.25 µs, this peak will be neglected. Determining the threshold check, is again computed using cuFFT.

6.1.1. Implementation

Each step of the algorithm uses cuFFT making it extremely optimal code running on NVIDIA CUDA hardware. The technical implementations will be discussed on this is programmed.

Within a CUDA program it’s important to keep track of data that is both on the host (CPU-RAM) and the device (GPU-RAM). When computing on a GPU, data will have to be copied over to the GPU-RAM from the CPU-RAM. This can be become a bottleneck for the process time since copying data takes time. This inefficiency occurs again when copying the data from the GPU-RAM back to the CPU-RAM.

The algorithm starts by reading in a sample file containing waveforms, and a file representing the filter signal. This sample file is build by FaX.

Consider the amount of samples of the signal is N, then the log2(N) is com-puted. Optimizing a FFT computation means the input data will have to have a length of 2N_{. This is because of the implementation of cuFFT.}

Due to the blocks architecture, the algorithm decides on how many blocks it will allocate. The maximum elements each block can have is 1024, so the amount of blocks is decided by N/1024.

(52)

These task where performed using the CPU, and the next with will be per-formed by the GPU. First, a timer is started to time the process time.

1 // c r e a t e t i m e r s 2 cudaEvent_t s t a r t , end ; 3 c u d a E v e n t C r e a t e (& s t a r t ) ; 4 c u d a E v e n t C r e a t e (&end ) ; 5 f l o a t e l a p s e d T i m e ; 6 cudaEventRecord ( s t a r t , 0 ) ;

Listing 6.1: Creating a time in CUDA

Data will be transferred from CPU-RAM to GPU-RAM, for which memory will be allocated on the device. After allocation, the data will be copied to the device. 1 c u d a M a l l o c ( (v o i d∗ ∗ )&d e v S i g n a l , s i z e N ) ; 2 c u d a M a l l o c ( (v o i d∗ ∗ )&d e v F i l t e r , s i z e N ) ; 3 c u d a M a l l o c ( (v o i d∗ ∗ )&d e v F i l t e r E d g e , s i z e N ) ; 4 5 CudaMemcpy ( d e v S i g n a l , h o s t S i g n a l , s i z e N , cudaMemcpyHostToDevice ) ; 6 cudaMemcpy ( d e v F i l t e r , h o s t F i l t e r , s i z e N , cudaMemcpyHostToDevice ) ;

Listing 6.2: Allocating memory in CUDA

Then functions will be called to perform the FFT Convolution: 1 signalFFT ( d e v S i g n a l , N) ;

2 signalFFT ( d e v F i l t e r , N) ;

3 ComplexPointwiseMulAndScale<<<n_blocks , b l o c k _ s i z e >>> 4 ( d e v F i l t e r , d e v S i g n a l , N) ;

5 s i g n a l I F F T ( d e v S i g n a l , N) ;

Listing 6.3: Running cuFFT computations

This computation is explained in 4.4.1.

When smoothing has been applied to the signal, a threshold value will be set and checked if the signal is passing it.

1 f l o a t t h r e s h o l d V a l u e = 0 . 6 // 19 pe ; 2 t h r e s h o l d <<< n_blocks , b l o c k _ s i z e >>> 3 ( d e v S i g n a l , t h r e s h o l d V a l u e , N) ;

(53)

Applying a threshold value is similar to applying a filter, so there another FFT Convolution is computed. This makes excellent use of the cuFFT library making the most advantage of the optimized code. This is repeated for the second threshold.

As the program ends, the timer has to stop. 1 cudaEventRecord ( end , 0 ) ;

2 c u d a E v e n t S y n c h r o n i z e ( end ) ;

3 cudaEventElapsedTime (& el ap se dTi me , s t a r t , end ) ; Listing 6.5: Stopping the timer in CUDA

6.2. XENON Raw Data Processor (XeRawDP)

XeRawDP operates in two stages: it searches for S2-like peaks, and then searches for S1-like peaks in between S2 peak candidates, although not beyond a S2 peaks that is surpasses a 50mV threshold. This condition is set because the S1 signal is excepted to precede a S2 signal if both the S1 and S2 candidate arise from the same energy deposit. S1 peak candidates that occur due to any of the many photo electrons that follow a S2like a PMT after-pulse and single electron S2are avoided. The first step in the peak finding algorithm is applying a digital filter to smooth out the waveform. This filters out the high frequency components. The filter which is applied is the raised cosine low-pass filter with a cut-off frequency of 3MHz.

Second step of the algorithm is to see where the signal exceeds a threshold of 10mV for at least 0.6us, which is a time interval broad enough to contain at least a single S2 peak. Then the average of 0.21us of the waveform preceding and following the interval is taken. This then must not exceed 5% of the maximum of the peak. For long after-pulsing tails (which are follow after large S2 peaks) the interval above threshold will most likely contain multiple S2 peaks.

Third step is to search for the multiple S2-like peaks within the found interval. Again potential peaks are searched for by using a threshold and and then trace the sample value till it drops to 0.5% of the maximum of the peak or the slope of the signal flips sign. This defines the left and right boundary. If the peak found has a Full Width At Half Maximum (FWHM) larger than 0.35us, it is identified as a valid S2 peak candidate, and so the position and data points for its boundaries

(54)

are saved. This repeats itself till the stopping condition has been met. Figure 6.1 shows the visualization of the recursive peak search.

Figure 6.1.: A visualization of the S₂ peak finding algorithm for XerawDP. The grey lines are the interval bounds. In the first step, the position of a S2

peak-candidate is shown in the largest sample.The green dotted lines are found by performing slope-change and threshold conditions. If the peak meets all requirements set, it will be accepted as a S2 peak. In the second step, it

looks left from the first peak, and discovers a second S₂ peak. The peak further on the left gets skipped due to the minimum width condition. The third step shows the last S2 peak being found. Figure taken from Plante

(55)

When the search for large S2 peaks is done, the algorithm will continue to search for small S2 peaks. This is defined as tens of pe to a single electron S2 peak.

Another filter is applied with a higher frequency cut-off to identify intervals where the average height is larger than what is expected for a single electron S2 signal. The filtered waveform hast to be larger than 1mV for longer than 0.4us and the average of 0.1us preceding and following the interval must not surpass 5%. Another condition is that the ratio of the interval maximum compared to the width must be larger than 0.1mV/ns. Only the 32 largest small S2-peaks are saved.

The algorithm will continue on to identify S1 peak candidates.

6.3. Processor for Analyzing XENON (PaX)

PaXidentifies peaks by looking for hits in pulses.

SiFTrig and XerawDP trigger on the summed waveform, which is the summa-tion of the pulses of every PMT. Instead of triggering on the the summed waveform, PaX does an analysis per pulse (per PMT) to decide if it triggers.

Considering an incoming pulse, the first step is determine a baseline. The baseline is computed by taking the mean of the first initial_baseline_samples in the pulse. initial_baseline_samples is a number which can be set by the user, but on default is set to 5.

Then PaX has two threshold levels, high and low. A peak candidate is consid-ered when a signal passes a low threshold, and ends when it goes below the low threshold. When the signal is higher than the low threshold, the signal has to pass another threshold, the high threshold.

The thresholds can be set on three different quantities. These will be computed per pulse, in which the highest will be used (in absolute value).

The first quantity determined is the height over noise threshold. This threshold operates on the height above baseline/noise level. The noise level in each pulse is computed as (((w − baseline)2

)0.5) with the average running only over sam-ples smaller than baseline. This quantity can be used for both the low and high threshold.

(56)

The second quantity determined is the absolute ADC (digitizer) counts above baseline. If measurable noise is present in the waveform, you don’t want the thresholds to fall to zero, which would make PaX crash. This occurs for pulses constant on a initial_baseline_samples. This too can be set for both the low and high threshold.

The third quantity determined is the minimum height over high or low thresh-old. This quantity is determined in order to have less problems with up and down fluctuations, especially in larger pulses. However, one large fluctuation downwards will cause a sensitivity loss throughout the entire pulse. This threshold operates on the height above baseline/ height below baseline of the lowest sample in the pulse.

Figure 6.2.: Top two plots from left: largest S₁ and S₂ peak. Top two plots from right: PMT detection bottom and top. Middle: The event showing the found S1

and S2 peaks. Bottom: The red dots represent each PMT that found a