Track Reconstruction with Cellular Automaton: a Performance Engineering Case-study in HEP

(1)

MSc Computational Science

Master Thesis

Track Reconstruction with Cellular Automaton:

a Performance Engineering Case-study in HEP

by

Julius Roeder

10044868

22 nd June 2018

November 2017 - June 2018

Supervisors:

Dr. Ana Varbanescu

Dr. Roel Aaij

Prof. Dr. Gerhard Raven

Assessor:

Dr. Clemens Grelck

(2)

(3)

A B S T R A C T

The Large Hadron Collider (LHC) collides protons. Each collision event results in a shower of charged particles, which travel through the detector and are measured by its sensors. A particle interacting with the sensors results in a hit. As a particle travels through several sensors, its individual hits form a track. Saving all collision events to disk is prohibitively expensive. Therefore, the most interesting events need to be selected in real time. Reconstructed tracks are needed for this selection.

The current CPU algorithm can process between 7500 and 12000 events/s on all virtual cores of an Intel Xeon E5-2630, which is only an indication of the order of magnitude that has been achieved on current hardware. With the LHCb upgrade, the current track reconstruction procedure cannot cope and the upgrade will need at least twice the computational capacity to be able to run several additional algorithms. Thus, the current processing rate is too slow.

In this work we investigate whether Cellular Automata (CA) can be a reliable alternative for fast track reconstruction. Additionally, we investigate the problem from a computational point of view (on top of the physics analysis).

We implement a CA algorithm in C++ and perform multiple rounds of optimisations based on bottleneck analysis. The speedup from the initial C++ version to the final version is 208 times. Our final CA algorithm can process 518 events/s, including data preprocessing. An important objective of this work is to devise a model for predicting the runtime of an event. We model the runtime using both a statistical model and an analytical model. Our models are able to provide accurate performance prediction for individual events and individual parts of the algorithm.

(4)

(5)

A C K N O W L E D G M E N T

I would first like to thank my thesis supervisors Ana Varbanescu, Roel Aaij and Gerhard Raven. The door to Roel’s and Gerhard’s office was always open whenever I had questions about physics and the LHCb experiment/detector. Ana’s advice steered me in the right direction and helped me be a better researcher. I would also like to thank them for reviewing my thesis and providing me with comments.

Secondly, I would like to thank the members of the defence committee for agreeing to read my thesis on short notice.

I would also like to acknowledge the Beauty parlour team at Nikhef for explaining high energy physics to me during the many coffee breaks.

Finally, I want to express my gratitude to Helena for unfailing support and her valuable comments on this thesis.

(6)

(7)

C O N T E N T S

List of Figures II

List of Tables IV

1 i n t r o d u c t i o n 1

1.1 Research Question and Approach . . . 3

1.2 Structure . . . 3

2 b a c k g r o u n d a n d r e l at e d w o r k 4 2.1 The Detector . . . 4

2.2 Track Reconstruction Algorithms . . . 5

2.3 Cellular Automata . . . 7

2.4 Cellular Automata in High Energy Physics . . . 8

3 d ata a n d m e t h o d o l o g y 11 3.1 Data . . . 11 3.2 Performance Engineering . . . 13 3.3 Performance Modelling . . . 14 4 c e l l u l a r au t o m at o n d e s i g n f o r t r a c k r e c o n s t r u c t i o n i n t h e u p g r a d e d l h c b e x p e r i m e n t 15 4.1 Make Doublets . . . 16

4.2 Finding the Neighbours . . . 18

4.3 Evolving the Cellular Automaton . . . 20

4.4 Extracting All Possible Tracks . . . 21

4.5 Removing Clones and Ghost tracks . . . 21

5 i m p l e m e n tat i o n a n d p e r f o r m a n c e e n g i n e e r i n g 22 5.1 Python . . . 22

5.2 Original C++ . . . 24

5.3 Parameter Exploration . . . 33

6 p e r f o r m a n c e m o d e l l i n g 36 6.1 Theoretical Roofline Model . . . 36

6.2 Statistical Model . . . 37 6.3 Analytical Model . . . 39 7 c o n c l u s i o n 50 7.1 Physics Efficiency . . . 50 7.2 Performance Development . . . 51 7.3 Performance Modelling . . . 53 7.4 Future Work . . . 53 Bibliography 54

(8)

L I S T O F F I G U R E S

Figure 1 A toy event with 50 tracks, comparing the loose hits with the formed tracks. . . 1

Figure 2 An average event with 2000 hits. . . 2

Figure 3 Side view of the LHCb detector [1]. . . 4

Figure 4 Sketch of the VELO detector [2]. . . 5

Figure 5 Oscillating Blinker repeating every two periods. . . 7

Figure 6 Oscillating Beacon repeating every two periods. . . 8

Figure 7 Distribution of event sizes for Minimum Bias and BsPhiPhi Events. . . 11

Figure 8 Comparison of an average minimum bias event and an large minimum bias event. . . 12

Figure 9 Comparison of an average minimum bias event and an large Minimum Bias Event represented in 2D, where each circle represents one sensor (i.e. the z coordinate of a hit) and the position of a hit on the circle represents its x and y coordinates. . . 12

Figure 10 Base Roofline Model representing the peak performance of the machine used for all our measurements. . . 13

Figure 11 Three tracks in 2D. . . 15

Figure 12 Three tracks in 2D with all doublets. . . 16

Figure 13 3D representation of the steepness of two doublets, where the dashed line rep-resents a doublet that is too steep. . . 17

Figure 14 3D representation of theΦ difference, where the dashed line represents a doublet of a too highΦ difference. . . 17

Figure 15 Three tracks in 2D with all allowed doublets. . . 18

Figure 16 Examples of doublets and its neighbours. . . 19

Figure 17 Three tracks in 2D including all doublets with at least one neighbour. . . 19

Figure 18 Examples of doublets and its neighbours, the thickness of the doublets repres-ents the state of the respective doublet. . . 21

Figure 19 Exponential Model for predicting runtime based on the number of hits in an event. . . 23

Figure 20 Detailed analysis of the six main parts of the Cellular Automata (CA) algorithm. 23 Figure 21 Runtime vs Number of Hits for the original C++ version. . . 25

Figure 22 Exponential Model for the unoptimized C++ version. The data is a sample to represent the represent has an exponential relationship with the number of hits. The green dotted line shows the exponential fit. . . 25

Figure 23 Roofline analysis for the original C++ version. . . 26

Figure 24 Slopes in the dx/dz and dy/dz dimension between hits of simulated tracks. . . 26

Figure 25 Number of doublets created for the original C++ version and after the azimuthal angle cut optimisation. . . 27

Figure 26 Frequency of theΦ difference between hits of Monte Carlo tracks across 10000 minimum bias events. . . 27

Figure 27 Runtime vs Number of Hits for the original C++ version. . . 28

Figure 28 Roofline analysis for the Azimuthal Angle Cut (AAC) Optimisation. . . 29

Figure 29 Runtime vs Number of Hits for the Hit Object Simplification optimisation. . . . 30

(9)

Figure 31 Runtime vs Number of Hits for the Outlier Elimination Optimisation. . . 31

Figure 32 Runtime vs Number of Hits for the Outlier Optimisation and the Area Calcula-tion optimisaCalcula-tion version. . . 32

Figure 33 Roofline Analysis - Including the Area Calculation Optimisation. . . 32

Figure 34 Roofline Analysis - Including the 2D Grid and Binary Search Optimisation. . . . 33

Figure 35 Impact of two physics parameters on runtime and physics efficiency. . . 34

Figure 36 Impact of two physics parameters on runtime and physics efficiency. . . 34

Figure 37 Roofline Analysis - Including the Optimal Physics Performance Results. . . 35

Figure 38 Final Roofline model including the Theoretical measures. . . 37

Figure 39 Fits of the statistical models and the underlying data. . . 38

Figure 40 Distribution of the errors of the quadratic model. . . 39

Figure 41 Fit of the exponential model for predicting P1. . . 41

Figure 42 Prediction Error of the making doublets phase. . . 41

Figure 43 Prediction Error for the sum of doublets linear model. . . 42

Figure 44 Error for predicting the runtime of the finding neighbours phase, when using an over-estimation for the complexity of the binary search. . . 44

Figure 45 P1 and P2 percentage per Number of Doublets. . . 44

Figure 46 Error for predicting the runtime of the finding neighbours phase for the final model. . . 45

Figure 47 Runtime for extracting tracks vs the number of hits in an event. . . 46

Figure 48 Time taken for the Short Track removal phase. . . 47

Figure 49 Predicting the number of tracks in an event based on the number of hits. . . 47

Figure 50 Error of the full analytical model for predicting the runtime of an event. Clearly showing the under-estimation for larger events. . . 49

Figure 51 Events/s of the different versions. . . 52

(10)

L I S T O F TA B L E S

Table 1 CPU Counters used for the GFLOP estimation of the Cellular Automaton for a Intel Kabylake . . . 14 Table 2 Comparing the physics performance (%) and clone rates (%) for the three

differ-ent algorithms across 30 test evdiffer-ents. Where the first number is the reconstruc-tion efficiency and the second number is the clone rate. . . 22 Table 3 Comparing mean, minimum and maximum time (seconds) taken to reconstruct

one event for each one of the three methods. . . 22 Table 4 Comparing the physics performances (%) and clone rates (%) for the current

sequential version, the GPU implementation and CA algorithm, where the first number is the reconstruction efficiency and the second number is the clone rate. Ghost rate (%) for each algorithm is given in the last row. . . 24 Table 5 C++ performance for the GPU Search-by-triplet method and CA method . . . . 24 Table 6 Comparing the physics performances (%) and clone rates (%) for the original

implementation and the AAC optimized CA algorithm. . . 28 Table 7 Profiling results of the Azimuthal Angle Cut Optimization . . . 28 Table 8 Comparing the physics performance (%) and clone rates (%) for the original

implementation, the AAC CA algorithm and the Hit Object Simplification. . . . 29 Table 9 Profiling results for the Outlier Elimination Optimisation . . . 31 Table 10 Profiling results of the 2D Grid and Binary Search Optimisation . . . 33 Table 11 Comparing the physics performances (%) and clone rates (%) for the Original

Implementation, the final CA algorithm. . . 35 Table 12 Theoretical Roofline model measurements by phase of the algorithm, FLOPs

and bytes summed over 1000 events . . . 36 Table 13 Average Theoretical Roofline model measurements by phase of the algorithm . . 37 Table 14 Percentage Error for the three statistical models . . . 38 Table 15 Basic Operations considered and their abbreviations . . . 39 Table 16 Comparing the physics performances (%) and clone rates (%) for the GPU

Im-plementation and the final CA algorithm across 10000 minimum bias events. . . . 50 Table 17 Comparing the physics performances (%) and clone rates (%) for the GPU

Im-plementation and the final CA algorithm across 10000 BsPhiPhi events. . . 51 Table 18 Performance results over the seven optimisation rounds . . . 51

(11)

(12)

1

I N T R O D U C T I O N

The LHC is the world’s largest particle accelerator and collides protons. Collisions take place at four points along the 27km ring, where experiments are located [3]. Our work focuses on the Large Hadron Collider beauty (LHCb) experiment, which investigates the slight difference between matter and antimatter. Each collision results in a shower of charged particles (hereinafter referred to as ”particles”), which travel through the detector, being measured by sensors. A particle interacting with one sensor results in a hit in that specific sensor; every hit is characterised by the x, y and z coordinates of the interaction between the particle and the sensor. We define an event as a collection of all the hits from all particles formed and registered by the detector in a given bunch crossing, where a bunch crossing (hereinafter referred to as ”collision”) consists of on average 5 proton-proton interactions. An example of a small event is presented in Figure 1a.

As a particle travels through several sensors, its individual hits (i.e., interactions with each sensor) form a track. Thus, we define a track as a sequence of hits from the same particle in consecutive sensors. Figure 1b shows a small event with 50 tracks.

(a) Toy event with hits from 50 tracks. Each point is one hit in one of the sensors. In total the detector has 52 sensors.

(b) Toy event with 50 tracks.

Figure 1: A toy event with 50 tracks, comparing the loose hits with the formed tracks.

The LHCb experiment will be upgraded to increase the rate at which event-data can be read from the detector form 1 MHz to 40 MHz. At a nominal event size of 100 kBytes, the amount of raw data generated will reach 4 TB/s. Considering that only a small percentage of events is interesting for further analysis, saving the data of all events to disk would be impractical and prohibitively expensive. Instead, selecting only the most interesting events in real time is required. Reconstructed tracks are

(13)

needed for this selection, making track reconstruction a necessary processing step for filtering of event-data generated by the detector [4].

The reconstruction of tracks is a challenging problem for two main reasons. First, the hit density in an actual event is very high, as shown in figure 2 and, depending on the algorithm used, this raises different challenges. For example the computation time of the Denby-Peterson algorithm [5, 6] increases with the third power of the density. A naive track following algorithm on the other hand easily looses its trail in a high density event [7]. Second, the reconstruction has to be done very fast within limited computing resources [7].

Figure 2: An average event with 2000 hits.

The detector forwards the collision data to the event-builder server farm, however the bandwidth of one event-builder server is too low to handle the data stream. Therefore the data stream is split into 8800 links and the readout of the various sensors proceeds in parallel, where up to 48 links feed part of the data into one event-builder server. The data of one event is thus distributed across multiple servers. For reconstruction the data needs to be combined into one event, hence one server in the event builder is designated to build the full event and all other servers forward their data to the designated server. Once the whole event is assembled it is forwarded to the High Level Trigger (HLT) farm. The HLT farm processes the data and among others reconstructs the tracks. The allocated budget for the HLT farm allows for approximately 1000 CPU servers [4].

With the LHCb upgrade, the current track reconstruction procedure cannot cope: the rate of colli-sions read from the experiment will see a 40-fold increase, requiring a processing rate of 40000 events/s per HLT server [4]. The currently available sequential version can only reconstruct between 7500 and 12000events per second per server1[8], which is only an indication of the order of magnitude that has been achieved on current hardware.

An alternative to using a CPU based reconstruction, would be a graphics processing unit (GPU) based reconstruction. Using GPU’s might be more cost-effective as they could be housed in the same chassis as some of the servers and therefore process more events/s per Swiss franc (CHF). A current suggestion is to include GPU’s in the 500 event builder servers, thus one server would need to process

(14)

80000events/s [4]. The GPU track reconstruction based on the search-by-triplet method can process approximately 61000 events per second per server2

[9].

On top of that, according to Roel Aaij and Gerhard Raven, ”Efficient selection of events requires more information than that provided by these algorithms. At least twice the computational capacity will be required to be able to run several additional algorithms. The quoted processing rate is therefore still too low” (personal communication, 11.06.2018).

1.1 r e s e a r c h q u e s t i o n a n d a p p r oa c h

In this work we investigate whether Cellular Automata (CA) can be a reliable alternative for fast track reconstruction. CA methods have a proven track record in High Energy Physics (HEP) [10, 11, 12, 13, 14], and have the potential to work well on GPUs as they only access local data, i.e. minimize the access distance of the data in the memory [14].

Therefore our research question is: Is it feasible to use Cellular Automata for high performance

track reconstruction in the LHCb experiment?

We first design the base algorithm, implement it, and following a systematic performance engin-eering process, refine the algorithm. All refinements and improvements are tested in practice, using simulated data. The improvements are both of algorithmic and implementation nature. In comparison to previous research on CA, our work also investigates the problem from a computational point of view (on top of the physics analysis). Thus, we provide thorough performance analysis and aim to build parametric performance models to predict the runtime per event.

1.2 s t r u c t u r e

In chapter 2 we explain the detector used in the LHCb experiment as well as the sub-detector (Vertex Locator (VELO)) that is the focus of this work. Furthermore, we give a brief introduction into track reconstruction algorithms and CAs. We also review previous work on CA in HEP, where we distin-guish between two different CA methods. Chapter 3 describes our data and methodology. Firstly, we cover the different data types and visualize the data for better insight. We then detail the tools used in the performance engineering process. Lastly, we explain the dataset used for the performance mod-elling. Chapter 4 explains the design of our CA algorithm, which is split into 5 sections: one section per algorithmic step. Additionally, we also explain the algorithmic refinements devised throughout the performance engineering process. Chapter 5 presents the six optimisations of the performance engineering process and the respective performance. In chapter 6 we present our theoretical Roofline analysis, the statistical model and the analytical model. Finally, in chapter 7 we deliver the concluding remarks and present the results of the current prototype.

(15)

2

B A C K G R O U N D A N D R E L AT E D W O R K

This chapter explores the detector architecture in more detail, gives a short overview of the different reconstruction algorithms and their potential weaknesses. Lastly, we give an introduction to Cellular Automata and its applications.

2.1 t h e d e t e c t o r

The LHCb consists of a series of subdetectors, shown in figure 3, to detect particles created by collisions between protons [15].

Figure 3: Side view of the LHCb detector [1].

The point where the collisions occur is located inside the VELO subdetector, shown at the left edge of figure 3. Each collision results in a shower of new particles, which travel through the detector, being measured by the sensors in the subdetectors. In this work we only focus on the VELO subdetector. The VELO consists of a series of sensors shown in figure 4, where each vertical line represents a sensor. The coordinate system of the VELO has x, y and z coordinates. Each sensor corresponds to one z coordinate. A sensor consists of two modules. The coordinates x and y represent the position at which a particle passed through the modules in the sensor (figure 4).

(16)

Figure 4: Sketch of the VELO detector [2].

2.2 t r a c k r e c o n s t r u c t i o n a l g o r i t h m s

The reconstruction of track and vertices is an important part of the data analysis chain of HEP ex-periments. Tracks are first reconstructed in the VELO and then used to find their common point(s) of origin, which are called primary vertices. These tracks can then be used as a starting point for reconstruction of tracks in subsequent sub-detectors [16].

The LHCb experiment distinguishes two main types of tracks [4]. • Velo tracks: Tracks with hits in the VELO only

• Long tracks: Tracks with hits both in the VELO and in the tracking detectors downstream of the experiment’s dipole magnet.

Long tracks can be divided into several subtypes, where the categorization depends on simulation information not available in real data. The categorisation depends on the physics process which produces these particles and includes: long>5GeV, long strange, long strange>5GeV, long fromb and long fromb>5GeV. The >5GeV indicates that these particles contained more than 5 giga electronvolt, ”strange” and ”fromb” indicates certain properties of the particle. Long tracks are of general interest to the LHCb experiment, however the ”fromb” are the most important. High reconstruction efficiency for the ”strange” tracks is non-essential.

Several indicators can be used to estimate how well an algorithm reconstructs tracks. Firstly, the reconstruction efficiency is the percentage of particle tracks that were found in comparison to the particle tracks that can be found. The number particle tracks that can be found is known as we work with simulated data. Therefore, the reconstruction efficiency is defined by Schiller [16] in equation 1, where F is the set of particles found and R is the set of particles that can be found. # is the operator that returns the number of elements in the set.

e= #(F∩R)

#R (1)

Other important performance indicators are the ghost and clone rates. The ghost rate indicates the number of tracks constructed that do not correspond to a particle passing through the detector. They may, for example, consist of segments belonging to several particles, noise and measurements left-over

(17)

from the previous bunch-crossing (spillover). Clone rate indicates the number of particle tracks for the same particle. Both are reasons for false positive tracks [16].

Two other indicators are the purity and the collection efficiency. Purity is a measure of the number of correct hits assigned to a track vs the number of hits assigned to a track. The collection efficiency is a measure of how efficiently the hits produced by a particle are used, i.e. if an algorithm only uses half of the hits produced it will most likely be worse than an algorithm using all of the hits from a particle [16].

The performance of the tracking algorithms is important for real time filtering and the quality of physics analysis. Events that contain more interesting tracks, need to be reconstructed accurately enough to follow the full track throughout the whole detector [16, 17]. A basic selection of interesting events can be done based on a set of two independent combinations of selection criteria [18]:

1. A single track that does not originate from any primary vertex (proton-proton collision) and has sufficient momentum and transverse momentum.

2. Two tracks that do not originate from any primary vertex, but form a vertex together and together have sufficient invariant mass and momentum.

3. The event contains so-called Muons.

Reconstruction has three steps: track finding, track fitting and quality testing. We focus on the track finding also known as pattern recognition (i.e. the assignment of hits to track candidates). [17].

Track finding techniques can be grouped into two categories: local and global. The local or sequen-tial methods include track road, track following [17] and progressive track finding [19]. The global or parallel methods include Template matching [20], Hough transform [21], Legendre transform [22], Hopfield network [5], Elastic net [23], and CA [12, 13, 24, 25].

Local Methods

Local or sequential methods first generate an initial track segment (seed) as a starting point. These segments consist of two or three hits. Seeds are then extended, for example by extrapolation. Lastly, a quality criterion selects good tracks [17, 7].

Seeds can be constructed in different ways. One can construct a seed between two neighbouring sensors. The advantage of constructing short segments is that the number of fake seeds is low; however, the angular precision is low. One can also construct a seed between two distant sensors. The number of choices is much higher than in the first version; implying a higher rate of fake seeds. At the same time the angular precision can be more accurate [7].

Track road computes an approximate track and its tolerance, from the seed and picks up hits in the tolerance road. Track following extrapolates the seed and picks up the next hit. This is repeated until the last sensor. Progress track finding is a refined version of the track following, where the extrapolation and hit selection is based on track statistics [17].

Local methods suffer from seeding bias (i.e., the seeding method has a large impact). All hits are considered one after another, once a hit is used, it will not be considered again, even if it would have been more suitable as part of a different track. [7].

Global Methods

Unlike the local methods, global methods consider all hits at once. The result should be independent of the starting point. Thus, they do not suffer from seeding bias.

(18)

Template matching can be used if the number of possible patterns is restricted. For each pattern a template can be made. The hits of an event can be compared to masks stored. Template matching does not scale very well as the number of templates increases quickly with granularity. The number of computations increases with the number of patterns [7].

Transforms, such as the Hough transforms, are used in image analysis, computer vision and image processing. Hough transforms come with several limitations such as no possible competition between track candidates [26]. Additionally, reconstructing events with many hits and tracks is often not feasible due to long computation [27].

Neural Network Techniques, such as the Hopfield network, come with different flaws. The Denby-Peterson method, a version of the Hopfield network, does not take into account an explicit track model, thus ignoring valuable information. Additionally, the computing time increases with the third power of the track density. A method, referred to as Cellular Automaton, resembling a discrete Denby-Peterson net, has been used successfully to circumvent several shortcomings of the Denby-Denby-Peterson method [7].

2.3 c e l l u l a r au t o m ata

CA were discovered by Stanislaw Ulam and John von Neumann [28] in the 1940s. CA gained more attention throughout the 1970s with Conways Game of Life. A simple 2D CA is based on an orthogonal grid of square cells 5. Each cell in the grid is either alive or dead (in figure 5 and 6 black cells are alive). The Game of Life follows four simple rules that update the status (i.e., dead or alive) of a cell depending on the status of its 8 neighbours [29]. As described in [29] the rules of the Game of Life are:

1. In the case of under population, cells with less than two alive neighbours die. 2. In the case of over population, cells with more than three alive neighbours die.

3. In the case of reproduction, any dead cells with exactly three alive neighbours become alive. 4. Any alive cells with two or three alive neighbours, stay alive.

Based on these rules one can create intriguing evolving patterns such as the glider gun. Easier patterns contain the blinker and the beacon seen in figures 5 [29] and 6 [30] respectively. Both CA pattern only take two different positions and repeat thereafter.

(a) First Blinker position. (b) Second Blinker position.

(19)

(a) First Beacon position. (b) Second Beacon position.

Figure 6: Oscillating Beacon repeating every two periods.

CA have been applied successfully in diverse areas such as lava flow modelling [31] and urban growth simulations [32]. Cellular Automata, in different variations, have been used in several high energy physics experiments for track and vertex reconstruction [10, 11, 12, 13, 14].

2.4 c e l l u l a r au t o m ata i n h i g h e n e r g y p h y s i c s

Let us formalize the CA discussion from the previous section for HEP. In general a CA needs: • Cells can either consist of individual hits (space-point model) or can consist of segments of hits

(segment-based model). Segments consist of pairs of hits and are referred to as doublets.

• Neighbours in the space point model are points that are physically close and can form a track. In the segment model, neighbours are doublets that share one space-point.

• Rules are very similar to the Game of Life rules in the space-point model. In the segment model the state of each cell is initialised to one. A doublet/cell has neighbours on the left and right hand side, we need to pick an ordering in which to evolve the CA into. If one chooses to evolve the CA from left to right, a cell will increase its state if a left neighbour has the same state. • Time evolves discretely. Every iteration all state are evolved simultaneously, until no more cells

exist that have the same state as their left neighbour. After the CA converged the tracks have to be collected.

The best features of CA based track reconstruction algorithms are that they are local and parallel. They do not involve exhaustive combinatorial searches. The highly structured information reduces the amount of data. Lastly, the actual computations are very simple. [12, 11, 14, 13]

Space-point based CA’s

The use of the space-point based method is rather rare and was mostly explored until 1995. The main usage was the cleaning of data and restoring missed hits.

The first use of Cellular Automata for data processing of multi-wire proportional chambers (specific detector able to detect 1000s of particles) is recorded in the 1993 paper by Glazov et al. [13]. The paper used cells similar to the game of life, each cell representing one hit in the detector. Cells stay alive or die depending on the number of neighbours. Additionally, new phantom hits are created in cases where the neighbours indicate a missed hit. It proved to be very successful for cleaning the data and restoring missed hits. The algorithm converges quickly and cleans the event of 60-70% of the noise hits.

(20)

Casolino and Picozza [24] extended the research by Glazov et al. to not only reduce noise but to also reconstruct tracks, using a space-point based CA for the Cern Proton Synchrotron. They concluded that the CA based algorithm is well suited for both particle identification and noise filtering. And due to its simplicity it could be appropriate for on-line applications (i.e., be used for identification and filtering while the experiment is ongoing).

Segment-based CAs

In 1995, research switched towards the use of segment / doublet based CA algorithms as they were more convenient [25]. The convenience comes from the additional directional information that doublets provide. Most of these follow the same structure:

• Make Segments • Run CA

• Extract tracks

All of the performance indicators in the paragraphs below are for the hardware and experiments at a given time, thus a fair comparison between previous performance and the performance achieved in this work cannot be made.

Bussa et al. [25, 33] were among the first to use a segment-based approach to reconstruct tracks in the DISTO Spectrometer, achieving a high efficiency and a speed of 1000 events/s. The studies [11, 12, 34] are slightly more detailed on their exact CA implementation for tracking.

The events of the NEMO-2 experiment are less complex than of the current LHCb experiment. The events consist of 5 frames (sensor layers), each with a pair of planes of 32 horizontal and 32 vertical cells. The first step is to make all doublets, thus all possible connections between two sensor layers. Due to possible inefficiencies in the detector, it is necessary to make all possible doublets between a layer and the next but one layer, hereinafter referred to as long doublets. Each doublets status is initialised to 1. Then the CA is evolved, where each cell increments its status if the state of its neighbour is the same as the current status. Neighbours must share a common point. When the CA converges the tracks are extracted, starting with the cell with the highest state. The track extraction also introduces angular cuts between segments. This leads to good track reconstruction and a factor of 5 speedup over the previously used method. The resulting algorithm can process 350 events/s [12].

Abt et al. [11] studied the usage of CA based track and vertex reconstruction algorithms in the HERA-B experiment. They first focused on the vertex reconstruction [11], following a similar approach as Kisel et al. [12]. The first step is again to construct all possible doublets between the neighbouring layers, including long doublets. After that, the evolution is run. Two segments are only considered to be neighbours if they share a point; however, in contrast to [12] the breaking angle has to be small enough. The breaking angle is defined as the angle between the two doublets; particles typically continue on a smooth trajectory, thus the direction of a doublet is unlikely to change and the angle between the two doublets has to be small enough. The track extraction (backwards pass) is the same as in Kisel et al. [12]. Lastly, the paper introduces a quality estimator that favours long and straight tracks to remove ghost tracks. This implementation achieved a reconstruction efficiency between 83% and 98.4%. The clone and ghost rates are 2.8% and 24.7% respectively. The reconstruction takes between 6and 150ms; depending on the number of segments created (1000 to 10000) on a 450MHz Pentium 2 [11].

Abt et al. in [34] focused on full track reconstruction in the HERA-B experiment. The methodology is the same as for the previous paper. During the track extraction, if more than one suitable neighbour is found the track is split in two, therefore creating all possible tracks. The clone and ghost removal mechanism works as follows:

(21)

1. The longest and straightest track candidate is picked first and all hits belonging to that track are marked as used.

2. If the fraction of used hits for the next track candidates exceeds some predefined level, the track is discarded.

3. All hits from accepted tracks are marked as used.

The method achieves an efficiency of 97%, ghost rates of 14% and a clone rate of 2%. One event is reconstructed in 240ms on a 500MHz Pentium 3. However, this also includes additional steps [34].

The method by Kisel [35] is very similar to the previous papers. He applies some additional cuts to limit the number of doublets created based on the geometrical acceptance. The doublets are extrapol-ated to find neighbours, which are used for the CA evolution. It achieves an efficiency between 89.46% and 99.45% [35].

The latest work we review is the 2007 PhD dissertation by Schiller [16], focusing on track reconstruc-tion in the outer tracker of the LHCb experiment using Cellular Automata. Schiller focused on the current LHCb detector, while we work on its upgrade. In addition, his algorithm targets track seg-ments in a sub-detector downstream of the magnet, the so-called ”outer tracker”. The CA algorithm follows the same idea as the previously mentioned papers, however the method was altered and exten-ded due to the specific geometry of the outer tracker. The outer tracker consists of three stations, each with four layers. The first and last station are vertical, however the two middle layers are slightly tilted. This geometry results in the use of two CA, one for the vertical projection and one for the horizontal projection, where the former can be used to confirm the later. That also means that the neighbours are worked out in both projections. The new method achieves high efficiencies and outperforms the standard algorithm in two different samples. However, it produces more ghost tracks. Lastly, the CA algorithm is a factor of 2 slower than the standard algorithm, taking 259ms per event on an AMD Opteron CPU (2.4 GHz). Additionally, it is not well behaved for high occupancy events, i.e. events with more than 10k measurements in the outer tracker.

Overall, the previous work has shown that CA can be used successfully to reconstruct tracks in a wide range of HEP experiments. Therefore, we expect that segment based CA can be used for track reconstruction in the LHCb experiment. Additionally, due to the wider use of segment-based CA we also use a segment-based approach, similar to the approaches in [16, 35, 34, 11, 12].

(22)

3

D ATA A N D M E T H O D O L O G Y

In this chapter we shortly discuss the data, the performance engineering process and the performance modelling.

3.1 d ata

At this point in time, event data is generated by Monte Carlo simulations, as the upgraded phase of the experiment will only start collecting data in 2020/ 2021. The data generated by simulation is sufficiently realistic for our purposes. For the Python version we only have 30 events in the right format available. We examine a total of 20000 events for the C++ version.

We can distinguish between different types of events for the C++ version. Minimum bias are events that arise from a bunch-crossing, without selection criteria which could bias their properties. We will use 10000 minimum bias events. Moreover, we also test the performance of our final algorithm using 10000 BsPhiPhi events. B_s → _φφdecays are of interest to the experiment. The probability of such a decay is 1.87∗10−5 [36]. These events are slightly more difficult to compute due to, among others things, having a higher number of hits on average as seen in figure 7. Additionally, the tails in the distribution are enhanced, and these events are, on average, larger (i.e., contain more hits).

Figure 7: Distribution of event sizes for Minimum Bias and BsPhiPhi Events.

To distinguish the difference between an average and a large event, an average event can be observed in figure 8a. A much busier event with more than 7100 hits can be seen in figure 8b.

(23)

(a) All hits for an average minimum bias event with a total of 2000 hits.

(b) All hits for a large minimum bias event with a total of 7100 hits.

Figure 8: Comparison of an average minimum bias event and an large minimum bias event. In the 3D representation one can see the different tracks represented by different colours; which makes it relatively easy to spot hits that belong to the same track. The two representations in figures 9a and 9b make it even easier to see the relation between different hits. The centre of the circle is the interaction region and each circle represents a sensor, thus each circle represents the z coordinate of a hit. The position on the circle is a representation of the x & y coordinates. Some tracks in the 2D representations, especially in the large event, do not originate in the centre, indicating a noise hit.

(a) All hits for an average minimum bias event with a total of 2000 hits represented in 2D.

(b) All hits for a large minimum bias event with a total of 7100 hits represented in 2D.

Figure 9: Comparison of an average minimum bias event and an large Minimum Bias Event repres-ented in 2D, where each circle represents one sensor (i.e. the z coordinate of a hit) and the position of a hit on the circle represents its x and y coordinates.

(24)

3.2 p e r f o r m a n c e e n g i n e e r i n g

We first implement the CA algorithm in Python and then port the algorithm to C++.

Over several optimisation rounds we refine the algorithm and the data structures. To identify bottle-necks we use Intel VTune Amplifier1

, an application that does software performance analyses. It uses hardware counters to provide a wide range of analysis. We mostly use its Hotspot analysis, which gives detailed overview of the time spent in each function and sub-function. All VTune profiling will be done with 1000 events and each event will be processed five times (i.e. five iterations of all 1000 events).

Next to using the Intel tool, we also perform simple in-code profiling. We measure parts of the algorithm using wall-clock time (std::chrono::high resolution clock::duration with double precision). Secondly, we count the number of iterations, doublets created and neighbours found to demonstrate how specific optimisations decrease combinatorics or to explain outliers. The additional profiling is also done with 1000 events and five iterations.

We use two benchmarks to determine and monitor performance over each optimisation iteration: runtime and events per second. Both results are based on the time taken to process 1000 events. Each experiment is executed 20 times and we use the minimum time taken. All experiments were executed on the same machine and compiled with GCC 5.5.0. All versions are compiled with the following C++ flags: O3, C++11, mavx, msse3 and fopenmp.

The other important performance is the so-called physics performance, which determines track-finding efficiency and fake rate using simulated information. We used the EventAnalyzer2

tool by Daniel Campora to calculate the physics efficiencies achieved by the different versions.

Besides the afore mentioned methods we will also use Roofline analysis to analyse each version. The Roofline model, proposed by Williams, Waterman and Patterson [37], is a visual model that provides insight on the performance bound. It combines floating point performance, operational intensity and memory performance in a 2D graph. Operational Intensity is defined as FLOPs/byte, where byte is the amount of data read and written to main memory (DRAM). Figure 10 shows the model, on a log-log scale, for a 4.2GHz Intel i7-7700K. The x-axis is FLOPs per byte and the y-axis is the attainable GFLOPs per second [37].

Figure 10: Base Roofline Model representing the peak performance of the machine used for all our measurements.

The peak floating-point performance of a computer can be represented by a horizontal line (blue in figure 10). The red line represents the maximum floating-point performance that the memory

1 https://software.intel.com/en-us/intel-vtune-amplifier-xe 2 https://gitlab.cern.ch/dcampora/EventAnalyzer

(25)

system of the given computer can support. For a certain kernel we can find the GFLOPs/s and the FLOPs/byte, thus a point in the roofline model. If the point is under the diagonal line, the performance is memory bound. If the point is under the horizontal line it is compute bound. The maximum achievable performance is bound by the two lines, the rooflines [37].

We used the same method as in Williams et al. [37] to establish the peak performance of the used machine. The method uses a variety of micro-kernels to establish the peak performance 3

. We ran the tool using only one core, with the following C++ Flags: -O3 -msse3 -axSSE3 -xSSE3 -fno-alias -fno-fnalias -DERT INTEL. The kernels were compiled using icc.

For the measured performance we used Intel’s Performance Counter Monitor (PCM) to find the FLOPs and DRAM access similar to Ofenbeck et al. [38]. Within Intels PCM API we used the following six CPU counters for the GFLOP performance estimation of the Cellular Automaton.

Event Mask Mnemonic Description

FP ARITH INST RETIRED.SCALAR DOUBLE scalar double

SCALAR SINGLE scalar single

128B PACKED DOUBLE scalar 128-bit packed double

128B PACKED SINGLE scalar 128-bit packed single

256B PACKED DOUBLE scalar 256-bit packed double

256B PACKED SINGLE scalar 256-bit packed single

Table 1: CPU Counters used for the GFLOP estimation of the Cellular Automaton for a Intel Kabylake For the theoretical model we use manual FLOPs and byte counting. We first determine the number of FLOPs and byte per iteration and then count the number of iterations for 1000 events to have a precise theoretical estimate. The theoretical byte counts include both cache and DRAM accesses. To gain a realistic number of DRAM read and write accesses, we used Intel PCMs L2 and L3 Cache hit ratios. The results of the theoretical roofline model are presented in chapter 6.

3.3 p e r f o r m a n c e m o d e l l i n g

The additional performance modelling consists of two main parts. We record the runtime five times for all 10000 events, thus resulting in a dataset of 50000 measurements. Firstly, we implement three simple statistical models trained on 70% of the data and then tested on the other 30% of the data. We used Mean Absolute Error (MAE) and Mean Absolute Deviation Percent (MADP) to determine the accuracy of the model.

We then build a detailed analytical model for the final C++ version, where each part of the algorithm is modelled individually. Some parts are based on the number of hits of an event and the probable number of iterations, other parts are more statistical due to the recursive or non-deterministic nature of the algorithm. The dataset is the same as for the statistical model, however due to the detail of the modelling, some parts are re-measured to obtain more detailed data (per function and operation).

(26)

4

C E L L U L A R A U T O M AT O N D E S I G N F O R T R A C K R E C O N S T R U C T I O N I N T H E U P G R A D E D L H C B E X P E R I M E N T

In this chapter we lay out the design of the Cellular Automaton used for track reconstruction. Each step of the algorithm is discussed. The Algorithm is divided into five distinct parts. Each part includes refinements that were identified in the performance engineering phase detailed in chapter 5. Each refinement is detailed in its own subsection; the same subsection titles are used in chapter 5 to describe the corresponding performance engineering work. The algorithm is a segment-based model, similar to the approaches explored in the literature review [12, 11, 34, 35, 16]. The main differences are additions and refinements introduced to reduce the combinatorics.

Parts one and two prepare the data for the Cellular Automaton evolution, line 1 and 2 in algorithm 1. Part three evolves the Cellular Automaton (line 3 in algorithm 1) . Part four and five extract and clean the track candidates (lines 4 and 5 in algorithm 1).

Algorithm 1Cellular Automaton

1: Make all possible Doublets 2_: Find Neighbours of Doublets 3: Run Cellular Automaton 4_: Extract All Possible Tracks 5: Remove Clone and Ghost Tracks

We explain each part of the algorithm based on a 2D example shown in figure 11. Each dot repres-ents a hit. Each vertical line represrepres-ents a sensor.

(27)

4.1 m a k e d o u b l e t s

The cells of the CA are not individual hits, but doublets. Doublets are pairs of hits from two consecutive sensors [12]. Creating doublets between all of the hits in figure 11 results in figure 12, where each coloured segment represents one doublet.

Figure 12: Three tracks in 2D with all doublets.

Not every doublet is important, we are only interested in the doublets that make sense from a physics point of view. Hence, depending on the detector architecture one can introduce different constraints such as the slope of the doublet or that it points towards the region where collisions occur [16]. The first constraint we introduce is that a doublet has to have an acceptable slope in the dx/dz and dy/dz directions. Figure 13 shows two doublets, where the dashed line is a doublet that would be too steep in the dy/dz dimension.

(28)

Figure 13: 3D representation of the steepness of two doublets, where the dashed line represents a doublet that is too steep.

Azimuthal Angle Cut

Controlling for the steepness of doublets does not prevent the formation of doublets that do not point to the beam line or interaction region when extrapolated. The two hits of a doublet need to have roughly the same azimuthal angleΦ, where Φ=arctan(y_x). The dashed doublet in figure 14 shows a doublet with a large Φ difference between the two hits. It is unlikely to correspond to a real particle track.

Figure 14: 3D representation of the Φ difference, where the dashed line represents a doublet of a too highΦ difference.

(29)

Thus, we reduce the number of doublets created by controlling for the slopes and the difference in Φ, the Azimuthal Angle Cut (AAC). In the 2D case, this might result in figure 15, which displays a reduced number of doublets. Reducing the number of fake doublets is crucial as each fake doublet increases the number of combinations of two doublets that need to be considered later.

Figure 15: Three tracks in 2D with all allowed doublets.

Each formed doublet is one cell in the CA. The status of a cell depends on the status of its neighbour cell. Thus, the next step is to find the neighbours of each doublet.

4.2 f i n d i n g t h e n e i g h b o u r s

The two basic steps for finding neighbours are: 1) Check if the neighbouring doublet share a point. 2) Test if the breaking angle is small enough. [12, 11]

In figure 16a the thick green line and the thick orange line represent a doublet and its left neighbour. A second example can be seen in figure 16b, where the blue doublet has two neighbours. As the particles in the VELO travel, to a good approximation, in a straight line, any neighbours which do not result in a sufficiently straight line (i.e., if the breaking angle between the doublets is too large) are ignored [11]. The turquoise doublet is therefore excluded.

Cutting on the angle of the neighbours and only plotting the doublets that have reasonable neigh-bours results in figure 17. The red and brown doublets are kept for illustrative purposes.

(30)

(a) Three tracks in 2D with a doublet (thick green) and its neighbouring doublet (thick orange).

(b) Three tracks in 2D with a doublet (thick blue) and its two neighbouring doublets (thick green and tur-quoise).

Figure 16: Examples of doublets and its neighbours.

Figure 17: Three tracks in 2D including all doublets with at least one neighbour.

2D Grid & Binary Search and Azimuthal Angle Cut

Testing each potential neighbour for a shared point is cumbersome. Hence, we switch to a sorted 2D Grid that can be searched for shared hits. Moreover, we also introduced the Azimuthal Angle Cut between the 1st hit of the first doublet and the 2nd hit of the second doublet (i.e., an azimuthal cut between point one and three along the line formed by the two doublets).

Area Calculation

Estimating if two doublets form a straight line can be done in two ways: 1) Calculating the angle between the two lines or 2) calculating the area between the three points. The later option is less computationally intense and can be vectorised easily. We define vectorisation as applying the same operation simultaneously to several pieces of data. The actual outcome of the computation is not used, just the relative outcome with respect to a certain cutoff is used. Thus, one could either test if the angle is close to 180◦ or if the area is close to zero. Calculating the area of a triangle in a

(31)

3D coordinate system is computationally intense. As we only use the area in a comparison, we can also use any monotonically increasing function of the area. In this case it is less compute-intense to compute 2∗Area2, which is an intermediate step to calculating the actual area. Using the intermediate result saves a square-root and a division, while not impacting the comparison.

Calculating the area of a triangle in 3D, can best be approached using vectors. Using the 3 points, we create two vectors with the same starting point (formula 2 and 3). We then calculate the orthogonal vector using the cross product (formulas 4 to 8). The magnitude of the orthogonal vector equals to 2 times the area of the triangle (formula 9) [39].

−−→ P1P2=<bx−ax, by−ay, bz−by> (2) −−→ P1P3=<cx−ax, cy−ay, cz−by> (3) −−→ u∗v=−−→P1P2∗ −−→ P1P3 (4) −−→ u∗v= r1 r2 r3 u1 u2 u3 v1 v2 v3 (5)

which simplifies to: (6)

−−→ u∗v=<u2∗v3−v2∗u3, u1∗v3−v1∗u3, u1∗v2−v1∗u2> (7) −−→ u∗v=<w1, w2, w3> (8) Area= ||−−→u∗v|| = 1 2 q w2 1+w22+w32 (9) 4.3 e v o lv i n g t h e c e l l u l a r au t o m at o n

The third step is the Cellular Automaton evolution, where each doublet is considered to be a cell with a status. The status of a cell changes depending on the status of a neighbouring cell. If the status of a doublet and a neighbour are the same, the status of the doublet is incremented by 1 [12, 11, 34, 35, 16].

As the left-most doublet does not have a neighbour, its status remains one. All doublets review their respective neighbours at the same time and update their status. If figure 17 represents the CA at the beginning of the evolution, then figure 18a represents it after the first iteration. The thickness of each line represents its status.

After the first iteration the three left-most doublets still have the same status (i.e., 1). Every other doublet has the status 2. With each iteration the status of a doublet increases, given that a left neigh-bour has the same status until no doublet has the same status as its left neighneigh-bour. The final state is represented in figure 18b.

(32)

(a) Cellular Automaton after first iteration. (b) Cellular Automaton after the final iteration.

Figure 18: Examples of doublets and its neighbours, the thickness of the doublets represents the state of the respective doublet.

4.4 e x t r a c t i n g a l l p o s s i b l e t r a c k s

The fourth step is to collect the most feasible tracks. For all right-most doublets, it is checked whether the status of any left neighbours is exactly one less, the left neighbour with a state of one less is added to the track. The new doublet then analyses its own neighbours. More doublets are picked up recursively until no more left neighbours exist. After creating all tracks that start in the right most layer, the processes is repeated for the next sensor, thereby creating tracks that did not reach the final sensor. If there are two left neighbours that meet the condition then the track is split in two and two tracks are created [34]. By creating every track from every sensor, a lot of clone and ghost tracks are created.

Outlier Elimination

We decided to not create two tracks if two neighbours meet the status condition. Instead if several neighbours with the right status are found, the algorithm picks the neighbour that forms the straightest line. This might not always lead to the extraction of the best track candidate, however the impact seems to be negligible and it reduces the amount of runtime outliers.

4.5 r e m ov i n g c l o n e s a n d g h o s t t r a c k s

The last step is to clean the candidate tracks by removing short, clone and ghost tracks. All tracks are sorted by length and straightness. One track after another is added to the final list of tracks and its hits are marked as used. A track is not added to the final list if a given percentage of its hits has already been used in previous tracks [34].

(33)

5

I M P L E M E N TAT I O N A N D P E R F O R M A N C E E N G I N E E R I N G

This chapter details the different implementations of the CA algorithm, the performance and efficiency achieved as well as the performance engineering process.

5.1 p y t h o n

The first CA based prototype focused on the physics performance (i.e., the accuracy of the reconstruc-tion). To ease the prototyping process, we implemented the first version in Python. The prototype is a straightforward implementation of the CA algorithm, described in the previous chapter and does not employ any optimisations described in the subsections. The method performed on par with the classical and Depth First Traversal (DFS) methods1

, as seen in table 2.

Classical Method DFS Method CA Method

velo 92.44 (2.99) 92.08 (18.95) 92.58 (3.70) long 98.19 (4.77) 97.18 (19.62) 97.53 (6.91) long>5GeV 98.54 (4.77) 97.74 (20.79) 98.08 (6.69) long strange 96.81 (3.83) 97.00 (17.79) 97.06 (8.64) long strange>5GeV 97.82 (2.27) 98.41 (8.13) 97.02 (7.93) long fromb 96.99 (3.93) 96.49 (18.87) 98.14 (5.43) long fromb>5GeV 97.01 (3.65) 96.15 (18.94) 98.62 (3.99) Ghosts 6.20 16.83 6.48

Table 2: Comparing the physics performance (%) and clone rates (%) for the three different algorithms across 30 test events. Where the first number is the reconstruction efficiency and the second number is the clone rate.

The Python implementation is slow and compares poorly to the other two methods (Table 3). It only reconstructs 0.016 events per second. Nevertheless we analysed the performance to compare the Python and C++ version.

Classical Method DFS Method CA Method

Mean (seconds) 2.67 13.93 61.12

Min (seconds) 0.06 0.43 2.14

Max (seconds) 9.07 60.23 248.46

Table 3: Comparing mean, minimum and maximum time (seconds) taken to reconstruct one event for each one of the three methods.

(34)

As seen in figure 19 the runtime currently has an exponential relationship with the number of hits in an event. At 4500 hits the runtime is approximately 250 seconds. Large events can reach 7000+ hits, implying runtimes of more than 3100 seconds. The exponential increase in time demonstrates that the combinatorics increase exponentially with each hit added. The increase in combinatorics is probably caused by the first or second step of the algorithm, either creating too many fake doublets (i.e., combining hits that don’t belong together) or finding too many neighbours for each doublet.

Figure 19: Exponential Model for predicting runtime based on the number of hits in an event. Figure 20 shows that more than 90% of the time is spent in the neighbour searching phase. Two functions within the neighbour searching are the ”check tolerance” and ”Shared point” functions. They contribute 29% and 23% to the total time taken respectively (not shown in the graph). The ”check tolerance” function determines if two doublets are consistent with a single straight line, while the ”Shared point” function determines if two doublets share a common point. A high percentage of runtime spent in the neighbour finding function, indicates that the current algorithm creates too many useless doublets that all have to be tested in the neighbour finding.

(35)

5.2 o r i g i na l c++

The first C++ version is a direct port from Python, thus still suffering from the same algorithmic flaws. The Physics efficiency are worse than of the current CPU implementation [40], but similar or better than the efficiency of the current GPU implementation [9].

CPU Method GPU Method CA Method

Table 4: Comparing the physics performances (%) and clone rates (%) for the current sequential ver-sion, the GPU implementation and CA algorithm, where the first number is the reconstruction efficiency and the second number is the clone rate. Ghost rate (%) for each algorithm is given in the last row.

The unoptimized C++ version is roughly 206 times faster than the Python version, processing on average 2.6 events/s (Table 5).

GPU Method CA Method

Mean (sec) 1.66×10−5 0.38

Events / Second 61000 2.60

Table 5: C++ performance for the GPU Search-by-triplet method and CA method

A closer look at the runtime per event is shown in figure 21 and reveals a few extreme outliers. Identifying and eliminating outliers is crucial, as edge cases could stall whole machines. Examining the event that took almost 100 seconds, we find that it contains only 2703 hits. Similarly sized events are processed on average in 0.34 seconds.

The outlier events do not create more doublets or neighbours than the average event. The bottleneck is the track extraction phase, accounting for more than 99% of the runtime; other events only spend 11% in the same phase. Investigating the track extraction phase, shows that some events contain highly connected doublets. The average number of neighbours per doublet is not different, however the distribution is different: fewer doublets have one neighbour and more doublets have a higher number of neighbours. As we recursively extract all feasible tracks, a single highly connected starting doublet can generate more than 1.5 million tracks.

Introducing more cuts to reduce the number of doublets and neighbours produced, should reduce the number and impact of outliers and lead to fewer combinations that need to be considered.

(36)

Figure 21: Runtime vs Number of Hits for the original C++ version.

Examining a sub-sample of data points, excluding outliers, reveals a similar exponential trend as in the Python version, figure 22.

Figure 22: Exponential Model for the unoptimized C++ version. The data is a sample to represent the represent has an exponential relationship with the number of hits. The green dotted line shows the exponential fit.

The current unoptimized version processes 2.6 events per second and suffers from similar bottle-necks as the Python version. 61% of the time is spent searching for neighbours; where 34% of the total runtime is spent calculating the angle between two doublets (i.e., the breaking angle). The second hotspot is the track extraction, taking 33% of the time including outliers.

The Roofline analysis (figure 23) shows that the implementation is compute bound. The high FLOPs/s demonstrates that the current implementation performs many computations. Thus, the first optimisation focuses on reducing the combinatorics (computations) by introducing theΦ cut (i.e., Azi-muthal Angle Cut (AAC)) discussed in chapter 4. Less computations should decrease the runtime and thereby improve performance.

(37)

Figure 23: Roofline analysis for the original C++ version.

Azimuthal Angle Cut

Heretofore the most complex cut while making doublets was the dx/dz and dy/dz slope cut. We chose dx/dz<0.7 and dy/dz<0.7 because, as figure 24 shows, most slopes of actual tracks are smaller than 0.7.

(a) Slope in the dx/dz dimension between hits in simulated tracks across 10000 Minimum Bias Events.

(b) Slope in the dy/dz dimension between hits in simulated tracks across 10000 Minimum Bias Events.

Figure 24: Slopes in the dx/dz and dy/dz dimension between hits of simulated tracks. In addition to the slope cut, we introduce theΦ cut for the doublet formation and the neighbour finding phase. This reduces the average number of doublets formed from 157k to 24k and the average number of neighbours from 7582 to 1.01. The distribution of the number of doublets also changes accordingly (figure 25).

(38)

Figure 25: Number of doublets created for the original C++ version and after the azimuthal angle cut optimisation.

Figure 26 shows the distribution ofΦ in the simulated tracks. The distribution is centred around 0 with very small tails, we choose|∆Φ| <0.2 as the initial cut off. We will revisit the cut-off later on.

Figure 26: Frequency of the Φ difference between hits of Monte Carlo tracks across 10000 minimum bias events.

The AAC cut has a relatively small effect on most of the track efficiencies. Only the efficiency of long strange tracks unfortunately decreased by 5% (Table 6). Additionally, the ghost rate decreased to 2.3%.

(39)

original C++ AAC Opt. velo 93.2 (4.05) 93.2 (4.33) long 97.6 (8.03) 97.2 (8.70) long>5GeV 97.5 (8.52) 97.9 (8.91) long strange 95.6 (4.90) 90.1 (10.01) long strange>5GeV 94.6 (4.50) 92.0 (8.57) long fromb 96.7 (7.97) 96.6 (9.42) long fromb>5GeV 97.4 (9.57) 97.4 (10.40) Ghosts 8.2 2.3

Table 6: Comparing the physics performances (%) and clone rates (%) for the original implementation and the AAC optimized CA algorithm.

Examining the runtime per event a bit closer (Figure 27), we notice that the outliers still remain an issue. As before this is caused by highly connected doublets.

Figure 27: Runtime vs Number of Hits for the original C++ version.

When examining the profiling output in depth (table 7), the majority of the time is spent in the track extraction step. A track is extended with new hit objects one after another until the full track is found. The hit creation is inefficient as too many complex hit objects are created. Additionally, these objects are copied several times. Items 1 to 3 and 5 in table 7 are related to the hit object. Secondly, the neighbour finding still takes a long time. Lastly, a lot of time is spent in theΦ calculation, not shown in table 7.

Function Module CPU Time

Hit Constructor CellularAutomata 147.623s

operator new libstdc++.so.6 134.439s

GI libc.so.6 67.976s

findNeighbours CellularAutomata 51.266s

std::vector<Hit, std::allocator<Hit>>::operator= CellularAutomata 17.321s

[Others] 209.549s

Table 7: Profiling results of the Azimuthal Angle Cut Optimization

The AAC optimisation caused a 2.55 times speed-up. The Roofline analysis in figure 30 shows that a reduction in FLOPs, which was not fully matched by the reduction in runtime. The original version

(40)

carried out 225 GFLOPs and read and wrote 316GB to RAM. After the first optimisation this was reduced to 17 GFLOP and 43GB. The 13 times reduction in FLOPs and 7 times reduction in DRAM memory access, was offset by the creation of many inefficient hit objects that require a lot of data copying. Not only the GFLOPs/s decreased but also the FLOPs/byte to 0.41, confirming suboptimal data structures.

Figure 28: Roofline analysis for the AAC Optimisation.

Hit Object Simplification

Three improvements are introduced in this version:

1. Moving the arctan calculation upfront, so that it is only done once per hit.

2. Simplifying the hit objects, reducing the information needed from three floats and three integers to a single integer.

3. Excluding long doublets.

The probability that a particle is not registered by a sensor is 1% [2]. Thus, we only take into consideration all possible doublets between a sensor and the next sensor and ignore long doublets, reducing the number of doublets to an average of 12k doublets per event.

These three changes lead to a 21 times speedup over the previous version, resulting in 167 events per second. The physics efficiency decreased slightly due to the exclusion of long doublets and the clone rates increased after excluding the long doublets (Table 8).

Original C++ AAC Opt. Hit Obj. Simp.

Table 8: Comparing the physics performance (%) and clone rates (%) for the original implementation, the AAC CA algorithm and the Hit Object Simplification.

(41)

We still observe outliers in figure 29. The left-most outlier, is the same event as in the previous optimisation (figure 27) and is still 56 times slower than the average event.

Figure 29: Runtime vs Number of Hits for the Hit Object Simplification optimisation.

The three optimisation reduced the number of operations and DRAM access to 1GFLOPs and 0.77GB respectively. Again the large decrease in FLOPs was not reflected in the GFLOPs/s, due to the toler-ance checking. The 55 times reduction in bytes read and written increased the FLOPS/byte to 1.39.

Figure 30: Roofline analysis for the Hit Object Simplification optimisation.

While the current profiling indicates that the next step should be improving the neighbour finding phase, we focus on reducing the amount of outliers, because a sufficient number of outliers could halt the server farm.

Outlier Elimination

The third optimisation step implemented a few minor changes that decreased the physics efficiency by 1% at most. Firstly, any calls to vector.size() were moved out of the loops. Secondly, we switched to a lighter track extraction. During the original track extraction, if a track encounters two feasible neighbours it will split into two tracks. Mostly this is not an issue, but some events contain highly connected doublets that have several neighbours. One doublet in such an event can generate up to 1.5 million tracks. Instead of splitting the tracks, we only extract the most feasible (straightest) track by

(42)

always selecting the doublet that leads to the straightest track. Extracting only the most feasible track decreases the runtime especially for outlier events.

These two changes increased the throughput to 198 events/s. At this point most outliers have been removed successfully (Figure 31).

Figure 31: Runtime vs Number of Hits for the Outlier Elimination Optimisation.

The performance with respect to GFLOPs/s and FLOPs/byte is only slightly different to the pre-vious version. The Vtune Amplifier results show (Table 9) that we should focus on improving the neighbour finding function next, focusing on the ”checkTolerance” function.

Function Module CPU Time

findNeighbours CellularAutomata 6.808s

operator new libstdc++.so.6 3.984s

GI libc.so.6 2.384s

checkTolerance CellularAutomata 1.868s

removeClonesAndGhosts CellularAutomata 1.664s

[Others] 1.664s

Table 9: Profiling results for the Outlier Elimination Optimisation

Area Calculation

Searching for neighbouring doublets consists of three tests: 1. Do the two doublets share a point?

2. Is theΦ difference between the first hit of the left doublet and the second hit of the right doublet small enough (i.e., Is the angle with respect to the beam line roughly the same for the first and third hit)?

3. Do the two doublets form a relatively straight line?

Testing if two doublets form a straight line is done by calculating the angle between the two lines. This takes between 24 and 26 floating point operations. Alternatively we can calculate the area between the three points, that takes 20 floating point operations and can easily be vectorised [41].

A cut-off of 10 for the 2∗Area2leads to 283 events per second, a 1.43 times speedup. The cut-off of 10 is slightly more lenient than the previously used cut-off for the angle, thus the switch had a positive

(43)

im-pact on the physics efficiency. Increasing the efficiency for long strange tracks and long strange>5GeV by 1.4% and 2.6% to 91.3% and 94.6% respectively.

Figure 32 compares the time taken to find neighbours for the third and fourth optimisation steps, showing that switching to the 2∗Area2method made a positive impact.

Figure 32: Runtime vs Number of Hits for the Outlier Optimisation and the Area Calculation optim-isation version.

The new method for testing the straightness of two doublets takes 4.3% of the total runtime, whereas the old version took 10% of the total runtime.

The fourth optimisation achieved 0.58 GFLOPs/s and 2.32 FLOPs/byte. The impact of the slightly wider cut-off is also represented by the increase in both GFLOPs and bytes to 2 and 0.88GB respectively. The aim for the last optimisation is to simplify the data structure of the doublet vector, thus boosting both GFLOPs/s and FLOPs/byte.

Figure 33: Roofline Analysis - Including the Area Calculation Optimisation.

2D Grid & Binary Search

The Fifth Optimisation introduces two main changes. Firstly, it simplifies the data structure. Secondly, it introduces a binary search for the neighbour finding phase.

The CA data structure was a triple nested vector. We simplified this structure to an array of vectors, the array contains 50 vectors and each vector consists of all the doublets starting in a sensor. These