Feasibility of detecting di-Higgs interactions at the LHC: Exploring the analysis

(1)

Feasibility of detecting di-Higgs interactions at the LHC:

Exploring the analysis

Emanuel Hoogeveen 31 August 2017

Supervisor: Dr J. Rojo

Abstract

In this paper I explore the analyses applied to determine whether detecting di-Higgs interac-tions at the Large Hadron Collisder is feasible. I go through the cut-based analysis phase in detail and suggest some extensions, then describe a neural network-based multivariate analysis that dynamically splits the dataset for maximum coverage. Finally I propose a combination procedure that greatly enhances predictive stability and modestly enhances overall performance.

(2)

1 Introduction

The discovery of the Higgs boson at the Large Hadron Collider (LHC) in 2013 completed an important step in verifying the Standard Model of physics, our best model of elementary particle interactions at microscopic scales to date. However, only knowing its mass is not enough to show that the discovered boson is indeed a component of the Higgs field responsible for electroweak symmetry breaking. One way to check that the discovered boson is indeed the Higgs predicted by the Standard Model is to verify the relationship

m2h = 2λSMv2, (1)

where the Higgs mass is on the left, and the self-coupling on the right can be measured by its contribution to the double Higgs production process [15]. As such, an important question becomes whether the LHC - either in its current form or with the planned high luminosity upgrade - is sensitive enough to this process that such a measurement could be performed. This question has already received significant scrutiny, see e.g. [13].

As a starting point in exploring this topic, I attempted to replicate the cut-based analysis described in [13]. In doing so I encountered at several points an ambiguity in the way jet se-lection was to be performed. Some of these ambiguities I was able to resolve with the gracious assistance of Dr Nathan Hartland, but others lead me to parametrize the analysis in a number of ways. Following the cut-based analysis, I implemented a multivariate analysis based on train-ing neural networks ustrain-ing the Stochastic Gradient Descent method, with a focus on avoidtrain-ing overfitting the training data. Finally, to reduce the large variance in results observed between neural networks trained on different (overlapping) subsets of the same data, I experimented with different ways of combining their predictions, with promising results.

This paper is divided up in four sections. In the following section I will briefly describe the data used for this report. In section 3 I will discuss the cut-based analysis, the additional param-eters I introduced and how they affect the number of accepted signal and background events. Section 4 describes the neural network based multivariate analysis and the fine tuning process. Finally, section 5 details the combination procedure.

2 Data and basic tools

For the purposes of this thesis I was supplied with a number of data files, containing modeled collision events similar to those used in Behr et al [13]. In chronological order, the files supplied contained

1. 1 million di-Higgs events, representing the signal events for the analyses. 2. 3 million QCD 4b background events, 3 million QCD 2b2j background events. 3. 30 million QCD 4b background events, 30 million QCD 2b2j background events.

While these event files do not represent the full spectrum of background events that might con-tribute to the analysis (in particular, Behr et al [13] also found a large contribution from the 4j background), the size of the 30M files proved prohibitively large, making off-site transfer diffi-cult. However, note that the value of including any background sets beyond the 4b background is still in dispute because these sets overlap once the parton shower has been applied. Thus, only including the 4b background will give an overly optimistic estimate of the achievable signal

(4)

significance, whereas including other backgrounds will give an overly conservative estimate of the achievable signal significance. In addition to the event files, I was supplied with a program to interpret them [16], yielding a vector of constituent partons for each event. Beyond simply reading the event files, this program also applied the parton shower to the di-Higgs events (the other event files came pre-showered).

2.1 3 million versus 30 million discrepancy

I requested the 30 million event files because the cut-based analysis proved “too effective”, cutting all but a few hundred background events in the boosted category. This left little for the multivariate analysis to train on, reducing its predictive value. However, in applying the cut-based analysis to the 30 million event files I encountered two unexplained discrepancies.

1. Their normalization does not appear to match the normalization of the 3 million event files. The total weights of the 3 million event files appear to match the event cross sections reported in Behr et al [13], but the 30 million event files deviate significantly. I decided to resolve this discrepancy by renormalizing the 30 million event files to their 3 million event equivalents.

4b 2b2j

3M 5.98e-1 1.15e2 30M 8.57e-1 2.61e2

Table 1: The weight per event for the different files, in arbitrary units.

2. The results of the cut-based analysis differ greatly, preserving relatively fewer 4b events and relatively more 2b2j events. I investigated this difference by randomly sampling 10% subsets of the 30 million event files, but the likelihood of obtaining such a difference by chance appears to be essentially zero.

3M (4b) 30M (4b) 3M (2b2j) 30M (2b2j) Passed weight (Boosted) 3.95e1 1.73e1 1.20e2 2.96e2 Passed weight (Intermediate) 7.99e1 5.47e1 2.41e2 7.83e2 Passed weight (Resolved) 2.48e3 1.58e3 4.31e3 9.94e3 Table 2: The total passed weight in each category (as defined in section 3).

I’m not sure how to explain these differences. Perhaps the 3 million event files are older, and the differences represent an evolution of the models used. Perhaps there is a problem with my analysis - though this would not explain the initial difference in per-event weight. While worrying, these discrepancies do not qualitatively impact the following analysis.

3 Cut-based analysis

Rather than feeding the event data to a multivariate analysis directly, we first perform a series of binary cuts to reduce the number of events that can be easily attributed to background noise. To do this I followed the example of Behr et al [13]. These cuts are very permissive (or “loose”) in order to preserve a large fraction of the signal for the multivariate analysis. In the following section I will briefly describe the procedure; for a more verbose description, see Behr et al.

(5)

3.1 Jet reconstruction

For each input event, we perform the following procedure:

1. Cluster the constituents into jets using the anti-k_t algorithm with radius R. 2. Cut jets with transverse momentum p_T < p_{T min}.

3. Cut jets with absolute pseudorapidity |η| > ηmax.

We do this three times with different parameters, clustering the constituents into Large-R jets, Small-R jets and Small-R subjets respectively:

R p_{T min} η_max Large-R jets 1.0 200 GeV 2.0 Small-R jets 0.4 40 GeV 2.5 Small-R subjets 0.3 50 GeV 2.5

Table 3: The clustering and jet definition settings for the three types of jets.

We then recluster the Large-R jets using the Cambridge-Aachen algorithm with R = 1.0 and apply the BDRS mass drop tagger (with default parameters µ = 0.67, y_cut = 0.09), keeping only those jets on which tagging is successful. Finally, we ghost-associate each Small-R subjet with a Large-R jet.

3.2 b-tagging

We perform b-tagging on the Small-R jets and subjets, using the following procedure: 1. Select the constituents with p_T ≥ 15 GeV.

2. If any of the constituents are tagged as bottom quarks, return the true positive rate 0.8. 3. If any of the constituents are tagged as top quarks, return the false positive rate 0.1. 4. If any of the constituents are tagged as light quarks, return the false positive rate 0.01. 5. Otherwise, return zero.

If a nonzero answer is returned, the jet is said to be b-tagged with that probability. To tag a Large-R jet we multiply the results of its hardest two (highest pT) associated subjets. For example, a Large-R jet whose hardest two subjets both contain b-quarks will be tagged with probability 0.82= 0.64. Jets that fail b-tagging are cut.

3.3 Event categories

Based on the results of jet clustering, we define the following three categories: 1. Boosted : Events containing 2 Large-R jets.

2. Intermediate: Events containing 1 Large-R jet and 2 Small-R jets, separated from the Large-R jet by an angular distance of ∆R ≥ 1.2.

(6)

These three categories all contain two Higgs candidates: each Large-R jet forms a candidate on its own, and each pair of Small-R jets may be combined to form a candidate. For a set of jets to pass a category, we require that the invariant mass of each Large-R jet and Small-R dijet falls within a window around the Higgs mass:

|m_h,j− 125 GeV| < 40 GeV, j = 1, 2. (2) If an event passes a category, the weight passed will be the total weight multiplied by its b-tagging probability. For example, an event with two Large-R jets containing 4 b-quarks between them that passes the Boosted category will pass with 0.84 = 40.96% of its total weight. The remaining event weight is then passed to the next category if the event meets its requirements. Factoring the b-tagging probability into the passed weight is a way to get more mileage out of the limited number of Monte Carlo samples. For experimental data, b-tagging has a binary outcome and events that pass one category will not be considered by the next.

3.4 Jet pairing

In order to find the most suitable set of Higgs candidates from the available jets, we look for the pairing that minimizes the difference in invariant mass. For the Boosted category this simply means choosing the two Large-R jets with the most similar mass. For the Intermediate category, every Large-R jet has an associated set of suitably separated Small-R jets to pair up with; in this case we check every Large-R jet against every pair of associated Small-R jets. Finally, for the resolved category we minimize the mass difference between two pairs of jets, which may be any combination of four Small-R jets.

3.5 Ambiguities

While the preceding description may at first glance appear to be comprehensive, in practice several additional questions emerge:

1. If a Large-R jet is matched to more than two subjets and one or both of the hardest two fail b-tagging, should we check the others?

2. Should events with more than two Large-R jets be eligible for the Boosted category? 3. If so, do we check only the hardest two against the 80 GeV mass window, or do we keep

going until we have two valid jets?

4. What if more than two jets are valid? Do we allow softer jets to participate in finding the pair with the smallest mass difference?

Similar questions may be raised regarding the Small-R jets and their use in the Intermediate and Resolved categories. Note that Behr et al [13] make a similar point in their description of b-tagging, where they say “We attempt to b-tag only the four (two) hardest small-R jets in the resolved (intermediate) category”, which I take to mean that all other (sub)jets will have zero tag weight, effectively excluding them from the analysis. To examine the effects of these constraints, I introduced a number of parameters:

1. Keep zero weight (sub)jets: Whether to discard zero weight (untagged) Small-R jets and subjets, or keep them until the end. Zero weight jets never pass the analysis (as they would just add noise to the MVA), but they might still be selected over other jets during jet pairing, causing us to cut an event that otherwise would have passed. I found

(7)

that discarding zero weight Small-R subjets immediately gave slightly better results, but keeping zero weight Small-R jets until the end performed substantially better than the alternative.

2. Number of (sub)jets to tag: How many Small-R (sub)jets to consider for tagging. If we discard zero weight (sub)jets immediately, this determines how many additional jets may take their places. I found that only considering the first two Small-R subjets performs substantially better than considering more; in contrast, considering all Small-R jets performs much better than only considering two.

3. Number of jets to pair: How many Large-R and Small-R jets should participate in jet pairing. For each category there may be more available jets than the minimum required, so (if we accept this at all) we must decide how many to use. I found that in all cases, allowing as many jets as possible to participate in pairing yielded the best results. 4. Jet pairing tolerance: Once we have a valid pair of Higgs candidates for an event, we

know it will pass the category. However, we may still look at other jet pairs to see if their mass difference is smaller. This parameter determines how many other jets we look at once we’ve found a valid pairing. I found that looking at all pairings produced the best results in all but one case: in the Resolved category, a tolerance of 0 performed best. In addition, one might consider whether to accept events with more than the requisite number of Large- or Small-R jets at all. For the Intermediate category, Behr et al [13] chose to include only events with exactly 1 Large-R jet. Because there are three types of jets and three categories, a table containing all configurations of parameters would be too large to include here. Instead I will give two possible parametrizations: one (“Stock”) which I believe best matches the analysis used in Behr et al, and one (“Best”) which gives the best signal significance. The results for the “Stock” parametrization do not match the results reported in Behr et al exactly, but the signal significances for the 3 million event files are very similar (0.7, 0.4 and 0.4 as opposed to the reported 0.5, 0.4 and 0.4 for the three categories respectively).

Parameter “Stock” “Best”

Number of Large-R jets (Boosted) ≥ 2 ≥ 2

Number of Large-R jets (Intermediate) = 1 ≥ 1 Number of Small-R jets (Intermediate) ≥ 2 ≥ 2 Number of Small-R jets (Resolved) ≥ 4 ≥ 4 Keep zero weight Small-R subjets true false

Keep zero weight Small-R jets true true

Number of subjets to tag 2 2

Number of Large-R jets to pair (Boosted) 2 ∞

Large-R pairing tolerance (Boosted) 0 ∞

Number of Large-R jets to pair (Intermediate) 1 ∞ Large-R pairing tolerance (Intermediate) 0 ∞ Number of Small-R jets to tag (Intermediate) 2 ∞ Number of Small-R jets to pair (Intermediate) 2 ∞ Small-R pairing tolerance (Intermediate) 0 ∞ Number of Small-R jets to tag (Resolved) 4 ∞ Number of Small-R jets to pair (Resolved) 4 ∞ Small-R pairing tolerance (Resolved) 0 0

Table 4: Two parametrizations for the cut-based analysis, where “Stock” represents the settings used in Behr et al [13] and “Best” gives the best signal significance.

(8)

Resolved category

Cross-section [fb] S/B S/√B

hh4b total bkg 4b 2b2j tot 4b tot 4b

3M “Stock” 4.89e-1 5.55e3 2.30e3 3.25e3 8.80e-5 2.13e-4 3.59e-1 5.58e-1 “Best” 7.15e-1 6.95e3 2.44e3 4.51e3 1.03e-4 2.93e-4 4.70e-1 7.93e-1 30M “Stock” 4.89e-1 5.89e3 1.16e3 4.73e3 8.31e-5 4.23e-4 3.49e-1 7.87e-1 “Best” 7.15e-1 8.54e3 1.35e3 7.19e3 8.38e-5 5.31e-4 4.24e-1 1.07e0

Intermediate category

3M “Stock” 9.66e-2 1.83e2 5.86e1 1.24e2 5.29e-4 1.65e-3 3.92e-1 6.91e-1 “Best” 3.06e-1 4.42e2 9.57e1 3.46e2 6.93e-4 3.20e-3 7.98e-1 1.71e0 30M “Stock” 9.66e-2 2.66e2 2.42e1 2.42e2 3.63e-4 3.99e-3 3.24e-1 1.07e0 “Best” 3.06e-1 7.09e2 4.65e1 6.63e2 4.32e-4 6.59e-3 6.29e-1 2.46e0

Boosted category

3M “Stock” 1.37e-1 1.06e2 3.60e1 7.02e1 1.29e-3 3.80e-3 7.27e-1 1.25e0 “Best” 1.41e-1 1.06e2 3.60e1 7.02e1 1.33e-3 3.93e-3 7.52e-1 1.29e0 30M “Stock” 1.37e-1 2.16e2 1.17e1 2.05e2 6.32e-4 1.17e-2 5.09e-1 2.19e0 “Best” 1.41e-1 2.20e2 1.18e1 2.09e2 6.42e-4 1.19e-2 5.22e-1 2.25e0 Table 5: The results of the different parametrizations, in a similar style to table 4 of Behr et al [13]. Included are the results for both the 3 million event files and the 30 million event files.

4 Multivariate analysis

Following the cut-based analysis, Behr et al [13] apply a multivariate analysis to improve the signal significance. In the same vein, I implemented a neural network optimizer based on the Stochastic Gradient approach. While the main goal was to simply optimize performance, a secondary goal in this project was to assess and minimize training bias. Much of the work described in the following sections reflects this goal.

4.1 Input parameters

For each event that passes one of the categories in the cut-based analysis, we calculate a number of variables describing the selected jets and pass them as input to the neural network. Events from the three categories (Boosted, Intermediate, Resolved) are analyzed separately, and the number of variables depends on the number of Large-R jets involved. Following Behr et al [13], I selected 13 common variables shared by all categories, and an additional 4 subjet structure variables for each Large-R jet. This brings the total number of variables to 13 for the Resolved category, 17 for the Intermediate category and 21 for the Boosted category. For a detailed description of these variables, see Behr et al.

4.1.1 ZCA whitening

Many of the parameters are correlated in some way with each other; this makes them less useful for training a neural network. By treating the input parameters for each event as the row of a matrix, we can decorrelate the parameters by transforming the data to make its covariance matrix equal to the identity matrix. This procedure is called “whitening”, and two common choices of methods are ZCA whitening and PCA whitening [18]. I found that ZCA whitening greatly enhanced neural network performance, but did not investigate alternative choices.

(9)

4.1.2 Event weight and output

In addition to the input parameters, each event also has a weight and a known output value classifying it as either signal or background. The event outputs are compared to the neural network predictions to calculate gradients to adjust the network parameters. The event weights are used to select points as described below, and in calculation of the signal significance.

4.2 Neural network architecture

The simplest type of neural network uses a sequential architecture where each layer of neurons or “hidden units” is fully connected to the next via some (nonlinear) transfer function. I found no reason to deviate from this type; indeed, I found no advantage to using more than a single hidden layer. The simple architecture I used can be summarized as follows:

1. An input layer containing a number of units equal to the number of input parameters: 13, 17 or 21 for the Resolved, Intermediate and Boosted category respectively.

2. A transfer function applied to each unit. I chose the hyperbolic tangent function as preferred in the literature (see e.g. [17]).

3. A layer of hidden units. Testing using the Boosted category, I found that a layer containing 12 units was most effective.

4. A transfer function applied to each unit. Here I used either the standard sigmoid function or the hyperbolic tangent function depending on the chosen criterion, see below.

5. An output layer containing only a single unit, representing the model’s classification of an event.

4.2.1 Criterion

In addition to the basic architecture, a criterion is needed to determine the accuracy of the model. Here I compared the Binary Cross Entropy (BCE), a measure of distribution similarity that is often preferred for classifier networks, to the Mean Squared Error (MSE) as commonly used in regression analyses. Since the BCE uses the natural logarithm function, which is not defined for negative numbers unless one allows complex results, I used the standard sigmoid function 1/(1 + e−x) for the final transfer layer when training with this criterion. When training with the MSE criterion I instead used the symmetrical hyperbolic tangent function. I found that while both criteria performed similarly, the BCE criterion appeared to generate more consistent results whereas the MSE criterion appeared to perform better when combining the results of multiple networks (see section 5). These are purely qualitative statements, however: due to time constraints I was not able to perform a quantitative comparison.

4.2.2 A note on output values

When assigning a numeric value to signal and background events, perhaps the obvious choices are 0 (for background) and 1 (for signal). On the other hand, the hyperbolic tangent function ranges from -1 to 1, so I opted to use -1 to denote background events when using the MSE criterion. This will be important to keep in mind once we introduce a cutoff value to improve signal significance.

(10)

4.3 Subset splitting

To minimize the impact of overfitting the data, we split the total sets of signal and background events into non-overlapping subsets. Each event is randomly assigned into a subset by choosing a random permutation of the whole set and splitting each permutation into equal parts. Each event is then assigned an index corresponding to its subset such that we can identify how it was used. We treat signal and background events as separate sets, so that the number of signal and background events in each subset is consistent modulo rounding error. Total event weight may vary between subsets, however.

4.3.1 Three subsets

The simplest option is to split the total set into three subsets: a training set, a test set and a cross verification set.

1. The training set is used to train the neural network and to decide when to stop training. 2. The test set is used to evaluate network performance, to decide which iteration of the

network to use as the final result.

3. The cross verification set is used to give an unbiased verification of the network’s perfor-mance.

Since the subsets are independent we can use each subset for all three roles, generating three neural networks with three corresponding cross verification sets. Then we can evaluate their joint performance by applying them to their cross verification sets, thus yielding a single prediction for each event and allowing us to evaluate the signal significance for the total set.

4.3.2 Subsets as units

We might also consider splitting the total set into more than three parts, assigning a certain number of “units” to each role. Thus in the previous example we had 1 train unit, 1 test unit and 1 cross unit; instead we might choose 3 train units, 1 test unit and 2 cross units, splitting the set into 6 parts. Then we can combine any 3 units to form a training set, take another unit to be the test set, and apply the results to 2 cross sets. For simplicity we may choose consecutive units modulo the total number of units and treat the cross sets independently, giving us the following scheme:

Pass Training Set Test Set Model Cross Set 1 Cross Set 2

1 1 ∪ 2 ∪ 3 4 α − − − − α5 − − − − − − α6 2 2 ∪ 3 ∪ 4 5 β − − − − α5 β6 β1 − − − − α6 3 3 ∪ 4 ∪ 5 6 γ γ₁ − − − α5 β6 β1 γ2 − − − α6 4 4 ∪ 5 ∪ 6 1 δ γ₁ δ₂ − − α5 β6 β1 γ2 δ3 − − α6 5 5 ∪ 6 ∪ 1 2 γ1 δ2 3 − α5 β6 β1 γ2 δ3 4 − α6 6 6 ∪ 1 ∪ 2 3 ζ γ1 δ2 3 ζ4 α5 β6 β1 γ2 δ3 4 ζ5 α6

Table 6: This table shows how each subset of the two cross sets is populated in the given example. Each model fills in one sixth of each cross set.

By adjusting the number of units used, we can determine the fraction of events that are used to generate predictions for each point. Given n events, to obtain the best performance one could thus conceivably set the number of training and test units to n − 1 and generate n models

(11)

to populate a single cross set (but doing so would be prohibitively expensive). At the other extreme, given a surplus of events one could set the number of training and test units to 1 and increase the number of cross sets. For the purposes of this report I used the latter tactic to reduce the memory usage and runtime of the Resolved analysis, as follows:

Category Background Training Test Cross

Booosted both 1 1 1 Intermediate both 1 1 1 Resolved both 1 1 20 Booosted 4b 1 1 1 Intermediate 4b 1 1 1 Resolved 4b 1 1 10

Table 7: Number of units used for the present analysis.

I used the same number of units for both the “Stock” and “Best” parametrizations.

4.4 Sample selection

To improve efficiency, especially on highly parallel hardware like a graphics processing unit, neural networks are usually trained in so-called minibatches of samples. Doing so may also be advantageous for model performance because much of the random variance in each gradient will be smoothed away. This randomness may also be helpful in avoiding or leaving local minima, however, so a balance must be struck between efficiency and model performance. For the Boosted category, I found that a batch size of 1024 gave good results. Due to time constraints I was unable to test the other categories as extensively, and used 1024 for every run.

4.4.1 Event weighting

In selecting samples for each batch we must take event weight into account. To select events for minibatches in a properly weighted way we apply Walker’s alias method [21] to generate alias tables U and K in advance. During training, we then generate uniformly distributed random numbers to choose indices from these tables. The total weight of the background events is much greater than the total weight of the signal events, so to avoid biasing our models toward only caring about background events, we normalize the two sets separately. In my experiments I found that attributing roughly equal weight to the signal and background sets performed well, but even moderately large deviations from this balance did not greatly impact performance.

4.5 On signal significance

To distinguish between models that perform well and models that perform poorly we use the signal significance. The approximation usually used for this quantity is S₁ = n_S/√n_B, the number of signal events over the square root of the number of background events, but this equation has the unfortunate property of overestimating the signal significance when the amount of background events is low (< 100), even going to infinity if n_B = 0. To moderate the signal significance calculated for very low numbers of background events, I instead used the formula derived in [14]:

(12)

4.5.1 Cutoff point

The predictions generated by a neural network using the architecture described previously aren’t binary; rather, they are on a spectrum corresponding to the network’s certainty that an event is either signal or background. As such we must choose a cutoff point: predictions above this point are considered to be signal and are kept, predictions below it are considered to be background and are discarded. One can then check the retained events against the known output values to see how many signal and background events were actually kept. Since signal events are more likely to have a high predicted value than background events (assuming the network has been trained correctly), we expect that raising the cutoff point will cut background events at a faster rate than signal events, and thus the signal significance will be increased. Put the cutoff point too high however and almost all the signal events will be discarded as well, reducing the usefulness of the remaining data. Unfortunately there is no clear answer to the question of how much signal to sacrifice for a higher signal significance.

4.5.2 Ideal cutoff points

As we raise the cutoff point, the signal significance will reach a maximum whenever we discard a group of background events, and a minimum whenever we discard a group of signal events. We can get an idea of the landscape by sampling the signal significance at a number of different values for the cutoff point, but this risks missing significant maxima, in particular near the upper limit of the range. Instead, we can calculate the peaks exactly in the following way:

0. As a preprocessing step, copy the event weights to two new buffers, one in which all the background weights are set to zero (the signal weights), and one in which all the signal weights are set to zero (the background weights).

1. Sort the model predictions in descending order and store the sorted indices.

2. Index the signal weights by the sorted indices and calculate the cumulative sum of the resulting buffer. Do the same for the background weights.

3. (optional) Select the coordinates for which the signal weight is non-zero (if the signal weight is zero, the signal significance can only go down) and do the same for the corre-sponding elements of each cumulative sum.

4. To guard against identical predictions, remove each coordinate that matches the next (and the corresponding cumulative sum elements).

5. Calculate the signal significances for the remaining elements.

6. (optional) Scan through the signal significances in reverse, removing any signal signifi-cances and corresponding coordinates that do not exceed the highest significance seen so far.

Including the optional steps, this procedure gives a list of all the unique peaks in signal signifi-cance, sorted in descending order, and the corresponding passed signal and background weights, sorted in ascending order. The remaining list of predictions, also sorted in descending order, represents the cutoff points corresponding to each peak. Note that these peaks are likely specific to the dataset used, and the signal significance at these exact cutoff points will be lower for other sets.

(13)

4.5.3 Adjusted signal significance

We now have a list of all the cutoff points of interest, but the highest cutoff points may still let through too few signal events to be useful. To address this problem I adjusted calculation of the signal significance in two ways:

1. We remove all cutoff points that retain less than 10% of the signal events (this may be too strict for sets containing many signal events, like the Resolved category).

2. We multiply the remaining signal significances with the natural logarithm of the number of signal events.

These adjustments are somewhat arbitrary - in particular, one might consider the natural logarithm a peculiar choice. A more natural choice might be the square root of the number of signal events as a measure of the statistical accuracy, but I found that this pushes the results toward signal significances that are too low to be useful. The natural logarithm tends to avoid extremely low amounts of signal events while still favoring high signal significance.

4.6 Training procedure

As mentioned previously, there are two sets involved in training the neural network: the training set and the test set. In each iteration, we

1. Select a minibatch of points from the training set.

2. Perform a “forward propagation” step where the network evaluates the input.

3. Compute the gradients of the loss function associated with the criterion using the model’s output and the target values.

4. Perform a “backward propagation” step to calculate the neural network’s gradients. 5. Update the neural network’s internal state using the learning rate.

For the Boosted category, I found that a relatively high learning rate of 1.0 worked best. Following this, we

1. Calculate the signal significance of the model’s predictions for the training set.

2. If this signal significance is better than the previous best, we update the best training significance and iteration counter.

3. Calculate the signal significance of the model’s predictions for the test set.

4. If this signal significance is better than the previous best, we calculate the geometric mean of the training significance and the test significance (to lower the odds of selecting statistical outliers).

5. If this geometric mean is better than the previous best, we update the stored significances and save the current state of the network.

6. Otherwise, we compare the iteration to the training iteration counter. If too many itera-tions have passed without an improvement to the training significance, we stop.

(14)

In addition, we set a lower limit on the number of iterations to perform. This “warm up” period is simply the number of iterations needed for all points in the set to have been seen by the network at least once (albeit ignoring their individual weights). If the best iteration found during the process falls within the warm up period we repeat the process. Once the process completes successfully, we restore the saved state of the neural network and apply it to its cross sets as described previously.

Parameter Value

Criterion Mean squared error

Number of hidden layers 1 Number of hidden units 12

Minibatch size 1024

Learning rate 1.0

Signal bias 0.5

Table 8: A summary of the parameters used to define and train the neural networks. Here the signal bias is the fraction of weight devoted to signal data, used by Walker’s alias method. These parameters were tuned on the Boosted category and may not be ideal for the other categories.

5 Combining results

Using the scheme from the previous section, we obtain three (or more) models, each trained on one third (or less) of the points. However, the points for each (training, test, cross) set are chosen at random, and from the perspective of an event being predicted, it doesn’t matter what points the model trained on, so long as the event being predicted wasn’t included. Thus, what if we trained many models on many different selections of points, saving each set of models and its predictions? Then instead of simply looking at the average model performance, we could actually combine the predictions for each event to arrive at a master prediction.

5.1 Mean, median and other options

A priori it isn’t clear what combination of models would yield the best results. Perhaps simply taking the mean of the predictions works well, or perhaps we want something more robust against outliers like the median. Perhaps determining the mode of each set of predictions could yield an even better result - but if the data is multi-modal, perhaps the main mode doesn’t tend to represent the best predictions. One might also consider more complex combinations that take the actual data into account, such as linear regression or even training a neural network on the predictions. For these combinations the same problems arise as before, however: to avoid overfitting, the set must be split into multiple parts, models applied only to their cross sets. I tried the following options:

1. The mean of the predictions for each event. 2. The median of the predictions for each event.

3. The minimum/maximum of the predictions for each event (chosen based on the mean or median).

4. The mode of the predictions for each event.

5. A mix of linear combinations of the models (trained on random subsets).

(15)

The mean and median can be seen as two extremes, with the mean giving equal weight to each prediction and the median giving weight only to the center prediction. Based on this, the last option seemed promising, essentially calculating an optimal middle ground between the mean and median. However, incorporating the event weights presented a challenge, and I found that many predictions ended up with negative weights, nearly canceling out their neighbors. In the end, none of the more complex options were able to outperform the mean and the median, usually performing substantially worse.

5.2 Target signal significance

As mentioned before, it isn’t clear how to choose the optimal cutoff point: we want to maximize the signal significance, but without sacrificing too many signal events in the process. This lead me to use a somewhat arbitrary adjusted signal significance in training the neural networks. However, to present the final results I’ve taken a different approach. Since the maximum achievable signal significance is often more than adequate, instead let us target specific values for the signal significance and maximize the number of signal events that is retained. This problem has a unique solution for each signal significance, with a unique cutoff point for each category. The following tables represent combinations of 1000 sets of predictions.

Boosted category (“Stock”, both backgrounds)

Signal events Signal significance Cutoff point

Target Mean Median Raw Mean Median Raw Mean Median Raw % failed

0 410 410 410 0.5 0.5 0.5 -1.00 -1.00 -1.00 0.00 1 370 370 325±80 1.1 1.0 1.4±0.9 0.14 0.16 0.30±0.34 0.00 2 138 146 144±57 6.4 6.4 3.6±1.2 0.91 0.92 0.91±0.06 0.00 3 138 146 122±55 6.4 6.4 4.4±1.0 0.91 0.92 0.93±0.05 0.40 4 138 146 105±50 6.4 6.4 5.1±0.8 0.91 0.92 0.95±0.03 3.40 5 138 146 87±53 6.4 6.4 5.7±0.7 0.91 0.92 0.95±0.03 14.40 6 138 146 63±48 6.4 6.4 6.7±0.7 0.91 0.92 0.97±0.02 43.60 7 16 45±30 7.9 7.6±0.7 0.98 0.98±0.01 70.40 best 16 11 48±41 7.9 6.6 6.4±1.4 0.98 0.98 0.97±0.02

Intermediate category (“Stock”, both backgrounds)

0 290 290 290 0.3 0.3 0.3 -1.00 -1.00 -1.00 0.00 1 274 274 265±8 1.0 1.0 1.1±0.1 -0.37 -0.39 -0.26±0.15 0.00 2 203 204 175±29 2.0 2.0 2.2±0.3 0.51 0.53 0.65±0.12 0.00 3 129 129 107±31 4.4 4.3 3.8±0.6 0.80 0.83 0.86±0.06 0.20 4 129 129 82±36 4.4 4.3 4.4±0.5 0.80 0.83 0.90±0.05 3.70 5 124 112 48±27 5.4 5.2 5.6±0.5 0.81 0.86 0.94±0.03 27.80 6 43 43 40±18 6.2 6.2 6.4±0.4 0.93 0.94 0.95±0.02 61.90 best 43 43 37±24 6.2 6.2 5.7±1.0 0.93 0.94 0.95±0.03

Resolved category (“Stock”, both backgrounds)

0 1467 1467 1467 0.3 0.3 0.3 -1.00 -1.00 -1.00 0.00 1 1231 1232 1181±42 1.0 1.0 1.0±0.1 -0.30 -0.33 -0.22±0.12 0.00 2 14 18 130±172 2.8 4.1 2.5±0.5 0.93 0.95 0.92±0.09 13.40 3 13 18 60±74 3.2 4.1 3.3±0.4 0.94 0.95 0.95±0.03 52.20 4 18 23±25 4.1 4.3±0.3 0.95 0.97±0.01 91.80 best 7 8 108±210 3.7 4.4 2.9±0.8 0.94 0.95 0.93±0.11

Table 9: Combined model results (mean and median) compared with the raw results for various target signal significances, using the “Stock” configuration and including both backgrounds. The raw results include the variance, given as the size of a single standard deviation. The rightmost column gives the percentage of models that failed to meet the target significance and were excluded. The last row gives the best achievable significance, including all models.

(16)

One caveat to keep in mind for these results is that these cutoff points are in reality extremely specific, and represent peaks corresponding to the datasets used. In practice it would be better to choose cutoff points that are in between two extrema, or that perform well in cross validation; I did not do this due to time constraints. As is, these results should be considered as the theoretical limits for the models used.

5.3 Combining categories

Finally, we can apply the same procedure to all three categories at once. For each target signal significance, there is a combination of cutoff points that yields the largest number of signal events. Unfortunately I don’t know of a way to find the best combination in less than O(nmk) time (where n, m and k are the number of unique peaks in signal significance for each category). Even though the lists of signal significance peaks for each category are sorted, this property is not preserved for their combination. One observation we can use is that if the combination of two categories has a worse signal significance than the third, increasing the amount of signal contributed by the third will improve the overall signal significance (so long as doing so does not reduce the signal significance of the third category below that of the other two). This gives us an early stopping criterion: once increasing the amount of signal contributed by a category no longer increases the overall signal significance, adding more will just decrease the overall significance further. Using this observation I was able to generate Table 13 and Table 14, which represent the final results for this project.

Boosted category (“Best”, both backgrounds)

0 424 424 424 0.5 0.5 0.5 -1.00 -1.00 -1.00 0.00 1 379 380 352±61 1.1 1.1 1.2±0.7 0.13 0.15 0.21±0.30 0.00 2 132 139 140±59 6.1 6.1 3.7±1.2 0.91 0.92 0.91±0.06 0.00 3 132 139 120±57 6.1 6.1 4.3±0.9 0.91 0.92 0.93±0.05 0.80 4 132 139 102±52 6.1 6.1 4.9±0.7 0.91 0.92 0.94±0.03 4.60 5 132 139 84±57 6.1 6.1 5.6±0.6 0.91 0.92 0.95±0.03 23.00 6 132 139 57±48 6.1 6.1 6.6±0.6 0.91 0.92 0.97±0.02 56.00 best 13 10 51±48 7.2 6.4 5.9±1.3 0.97 0.98 0.97±0.03

Intermediate category (“Best”, both backgrounds)

0 918 918 918 0.6 0.6 0.6 -1.00 -1.00 -1.00 0.00 1 895 894 880±10 1.0 1.0 1.0±0.02 -0.67 -0.70 -0.66±0.08 0.00 2 33 39 94±88 5.2 5.7 3.1±0.9 0.93 0.95 0.94±0.05 0.90 3 33 39 67±69 5.2 5.7 3.8±0.8 0.93 0.95 0.95±0.04 10.30 4 33 39 45±45 5.2 5.7 4.7±0.7 0.93 0.95 0.96±0.02 38.40 5 33 39 35±27 5.2 5.7 5.6±0.6 0.93 0.95 0.97±0.02 68.20 best 17 21 30±34 5.8 6.5 4.5±1.2 0.95 0.96 0.97±0.02

Resolved category (“Best”, both backgrounds)

0 2146 2146 2146 0.4 0.4 0.4 -1.00 -1.00 -1.00 0.00

1 1900 1896 1827±22 1.0 1.0 1.0±0.01 -0.59 -0.62 -0.52±0.04 0.00

2 1170 1175 592±414 2.0 2.0 2.1±0.2 0.36 0.38 0.69±0.20 12.90

best 9 2 268±352 2.7 2.5 2.5±0.5 0.90 0.93 0.84±0.17

Table 10: Combined model results (mean and median) compared with the raw results for various target signal significances, using the “Best” configuration and including both backgrounds. The raw results include the variance, given as the size of a single standard deviation. The rightmost column gives the percentage of models that failed to meet the target significance and were excluded. The last row gives the best achievable significance, including all models.

(17)

6 Discussion of results

As can be seen from the result tables of the last section, the “Best” configuration doesn’t perform as well as expected when both backgrounds are taken into account. It only shows its potential for a signal significance of 3 or lower when all the categories are combined, yielding a much larger number of signal events. On the other hand, it performs very well if only the 4b background is considered, reaching a somewhat higher maximum signal significance but more importantly, retaining a much larger number of signal events when compared at the same target significance.

6.1 Event predictability

This suggests that the “Best” configuration with both backgrounds includes so many different kinds of events that it becomes more difficult for the neural networks to separate them. Due to time constraints I was not able to extend my experiments with the parametrization to take the MVA into account as well, but this seems like a logical next step. Perhaps one of the parameters

Boosted category (“Stock”, 4b background)

2 410 410 410 2.2 2.2 2.2 -0.99 -1.00 -1.00 0.00 3 409 409 407±1 3.0 3.0 3.0±0.01 -0.75 -0.81 -0.79±0.04 0.00 4 400 400 394±2 4.0 4.0 4.0±0.01 -0.32 -0.37 -0.29±0.09 0.00 5 386 385 369±5 5.0 5.0 5.0±0.03 0.05 0.07 0.23±0.10 0.00 6 363 363 334±11 6.0 6.2 6.0±0.1 0.36 0.40 0.57±0.09 0.00 7 333 330 277±42 7.0 7.0 7.1±0.2 0.57 0.64 0.79±0.07 2.40 8 303 301 163±89 8.2 8.2 8.3±0.5 0.70 0.76 0.92±0.06 48.60 best 221 200 152±85 8.9 8.7 8.3±0.9 0.87 0.92 0.93±0.06

Intermediate category (“Stock”, 4b background)

1 290 290 290 1.1 1.1 1.1 -1.00 -1.00 -1.00 0.00 2 280 280 275±2 2.0 2.0 2.0±0.004 -0.43 -0.47 -0.39±0.08 0.00 3 240 241 229±8 3.0 3.0 3.0±0.02 0.31 0.34 0.42±0.10 0.00 4 207 208 174±25 4.0 4.1 4.1±0.1 0.58 0.61 0.74±0.08 0.00 5 190 187 82±50 5.0 5.0 5.4±0.6 0.67 0.71 0.92±0.06 10.20 6 81 77 39±25 6.0 6.0 6.6±0.6 0.92 0.93 0.96±0.02 39.00 7 33 28 34±16 7.4 7.2 7.5±0.5 0.96 0.97 0.97±0.02 71.00 best 30 28 41±37 7.5 7.2 6.4±1.1 0.96 0.97 0.96±0.04

Resolved category (“Stock”, 4b background)

0 1467 1467 1467 0.8 0.8 0.8 -1.00 -1.00 -1.00 0.00 1 1448 1448 1441±1 1.0 1.0 1.0±0.0002 -0.92 -0.94 -0.93±0.005 0.00 2 1335 1336 1312±5 2.0 2.0 2.0±0.001 -0.58 -0.61 -0.54±0.03 0.00 3 1203 1201 1165±9 3.0 3.0 3.0±0.002 -0.11 -0.11 -0.02±0.04 0.00 4 1035 1037 966±22 4.0 4.0 4.0±0.01 0.32 0.33 0.44±0.05 0.00 5 771 771 559±130 5.0 5.0 5.0±0.1 0.66 0.69 0.81±0.06 9.20 6 359 357 169±111 6.0 6.0 6.2±0.2 0.87 0.89 0.94±0.03 89.80 7 250 259 54±36 7.1 7.0 7.4±0.004 0.90 0.91 0.97±0.01 99.80 best 169 189 313±152 7.4 7.3 5.5±0.4 0.92 0.93 0.90±0.05

Table 11: Combined model results (mean and median) compared with the raw results for various target signal significances, using the “Stock” configuration and including just the 4b background. The raw results include the variance, given as the size of a single standard deviation. The rightmost column gives the percentage of models that failed to meet the target significance and were excluded. The last row gives the best achievable significance, including all models.

(18)

Boosted category (“Best”, 4b background)

2 424 424 424 2.2 2.2 2.2 -1.00 -1.00 -1.00 0.00 3 423 423 421±0.4 3.0 3.0 3.0±0.005 -0.79 -0.85 -0.83±0.03 0.00 4 415 415 410±2 4.0 4.0 4.0±0.01 -0.37 -0.43 -0.38±0.08 0.00 5 400 400 387±5 5.0 5.0 5.0±0.02 -0.01 -0.01 0.13±0.10 0.00 6 380 380 353±10 6.0 6.0 6.0±0.1 0.29 0.33 0.50±0.09 0.00 7 345 345 303±29 7.0 7.1 7.1±0.1 0.55 0.60 0.73±0.08 0.50 8 316 312 210±81 8.1 8.0 8.2±0.4 0.68 0.74 0.89±0.07 32.30 9 235 65 86±62 9.0 9.2 9.6±0.7 0.86 0.98 0.97±0.03 77.00 best 70 65 157±88 9.7 9.2 8.6±1.0 0.97 0.98 0.92±0.06

Intermediate category (“Best”, 4b background)

2 918 918 918 2.5 2.5 2.5 -1.00 -1.00 -1.00 0.00 3 915 914 913±1 3.0 3.0 3.0±0.002 -0.85 -0.88 -0.88±0.02 0.00 4 898 897 889±3 4.0 4.0 4.0±0.003 -0.61 -0.64 -0.59±0.07 0.00 5 862 862 843±8 5.0 5.0 5.0±0.01 -0.27 -0.29 -0.21±0.10 0.00 6 805 808 769±18 6.0 6.0 6.0±0.01 0.06 0.05 0.17±0.12 0.00 7 729 728 668±42 7.1 7.1 7.0±0.03 0.33 0.36 0.48±0.12 0.00 8 655 655 528±96 8.0 8.0 8.1±0.1 0.51 0.53 0.70±0.10 6.20 9 511 499 347±131 9.0 9.1 9.1±0.2 0.71 0.75 0.84±0.08 57.10 best 254 267 319±128 9.8 9.7 8.9±0.6 0.88 0.89 0.86±0.07

Resolved category (“Best”, 4b background)

1 2146 2146 2146 1.1 1.1 1.1 -1.00 -1.00 -1.00 0.00 2 2036 2036 2008±6 2.0 2.0 2.0±0.0003 -0.78 -0.80 -0.77±0.02 0.00 3 1894 1894 1842±11 3.0 3.0 3.0±0.001 -0.49 -0.52 -0.44±0.04 0.00 4 1727 1726 1628±24 4.0 4.0 4.0±0.003 -0.17 -0.18 -0.04±0.06 0.00 5 1473 1474 1287±71 5.0 5.0 5.0±0.01 0.20 0.21 0.39±0.07 1.00 6 1046 1048 662±173 6.0 6.0 6.0±0.04 0.56 0.58 0.76±0.06 74.10 7 599 610 7.1 7.0 0.76 0.78 100.00 best 459 459 672±176 7.1 7.2 5.8±0.3 0.80 0.82 0.75±0.07

Table 12: Combined model results (mean and median) compared with the raw results for various target signal significances, using the “Best” configuration and including just the 4b background. The raw results include the variance, given as the size of a single standard deviation. The rightmost column gives the percentage of models that failed to meet the target significance and were excluded. The last row gives the best achievable significance, including all models.

greatly impacts the predictability of the events without adding much to the signal significance, or perhaps the Intermediate and Resolved categories could be split to separate the “easy” events from the hard ones. Being able to utilize the extra signal events retained with these parameters without adversely affecting model performance could potentially enhance performance greatly.

6.2 Performance: 3M versus 30M

When examining the results for just the 4b background, it’s important to keep in mind the large, unexplained discrepancy that was observed between the 3 million event files and the 30 million event files. While I normalized the 30 million event files to the same weight, the number of 4b events retained was much smaller than expected for the 30 million event file, and the number of 2b2j events retained much larger. When taken together, signal significance after the cut-based analysis was relatively unaffected, but it seems likely that performance of the models focusing only on the 4b background was enhanced. Which of the two sets of event files more accurately represent reality is unclear to me.

(19)

Mean (“Stock”, both backgrounds)

Target Combined B I R Combined B I R B I R

0 2167 410 290 1467 0.5 0.5 0.3 0.3 -1.00 -1.00 -1.00 1 2069 408 287 1374 1.0 0.7 0.5 0.7 -0.78 -0.76 -0.77 2 1766 387 272 1106 2.0 0.8 1.2 1.5 -0.12 -0.33 0.03 3 371 138 220 14 3.0 6.4 1.8 2.8 0.91 0.38 0.93 4 331 138 180 14 4.3 6.4 2.5 2.8 0.91 0.63 0.93 5 293 138 141 14 5.1 6.4 2.6 2.8 0.91 0.77 0.93 6 281 138 129 14 7.6 6.4 4.4 2.8 0.91 0.80 0.93 7 281 138 129 14 7.6 6.4 4.4 2.8 0.91 0.80 0.93 8 277 138 126 14 8.1 6.4 4.8 2.8 0.91 0.81 0.93 best 256 126 118 13 8.7 6.5 5.4 3.2 0.92 0.83 0.94

Median (“Stock”, both backgrounds)

0 2167 410 290 1467 0.5 0.5 0.3 0.3 -1.00 -1.00 -1.00 1 2069 408 287 1375 1.0 0.7 0.5 0.7 -0.83 -0.80 -0.80 2 1764 387 272 1105 2.0 0.8 1.2 1.5 -0.13 -0.35 0.04 3 385 146 222 18 3.0 6.4 1.8 4.1 0.92 0.39 0.95 4 343 146 179 18 4.4 6.4 2.4 4.1 0.92 0.67 0.95 5 311 146 148 18 5.1 6.4 2.6 4.1 0.92 0.78 0.95 6 292 146 129 18 7.8 6.4 4.3 4.1 0.92 0.83 0.95 7 292 146 129 18 7.8 6.4 4.3 4.1 0.92 0.83 0.95 8 289 146 125 18 8.1 6.4 4.6 4.1 0.92 0.83 0.95 9 107 46 43 18 9.8 6.4 6.2 4.1 0.97 0.94 0.95 best 107 46 43 18 9.9 6.4 6.2 4.1 0.97 0.94 0.95

Mean (“Best”, both backgrounds)

0 3489 424 918 2146 0.7 0.5 0.6 0.4 -1.00 -1.00 -1.00 1 3403 424 915 2063 1.0 0.5 0.8 0.7 -1.00 -0.90 -0.86 2 2883 419 895 1569 2.1 0.8 1.0 1.7 -0.66 -0.66 -0.06 3 1835 373 569 893 3.0 1.1 1.7 2.3 0.19 0.52 0.54 4 174 132 33 9 7.6 6.1 5.2 2.7 0.91 0.93 0.90 5 174 132 33 9 7.6 6.1 5.2 2.7 0.91 0.93 0.90 6 174 132 33 9 7.6 6.1 5.2 2.7 0.91 0.93 0.90 7 174 132 33 9 7.6 6.1 5.2 2.7 0.91 0.93 0.90 best 169 127 33 9 7.9 6.3 5.2 2.7 0.91 0.93 0.90

Median (“Best”, both backgrounds)

0 3489 424 918 2146 0.7 0.5 0.6 0.4 -1.00 -1.00 -1.00 1 3394 424 918 2052 1.0 0.5 0.6 0.7 -1.00 -1.00 -0.87 2 2882 419 893 1570 2.1 0.8 1.0 1.7 -0.71 -0.70 -0.06 3 1826 374 543 908 3.0 1.1 1.7 2.3 0.21 0.60 0.56 4 180 139 39 2 7.5 6.1 5.7 2.5 0.92 0.95 0.93 5 180 139 39 2 7.5 6.1 5.7 2.5 0.92 0.95 0.93 6 180 139 39 2 7.5 6.1 5.7 2.5 0.92 0.95 0.93 7 180 139 39 2 7.5 6.1 5.7 2.5 0.92 0.95 0.93 8 33 10 21 2 8.7 6.4 6.5 2.5 0.98 0.96 0.93 best 33 10 21 2 8.7 6.4 6.5 2.5 0.98 0.96 0.93

Table 13: Combined results using the mean and median predictions for both configurations, including both backgrounds. B, I and R represent the contributions from the Boosted, Intermediate and Resolved categories respectively. A target of 0 is included for reference, representing the results of the cut-based analysis.

(20)

6.3 MVA enhancements

The MVA could be enhanced in several ways. Firstly, for this analysis I used ZCA whitening as described, but did not explore other options. Kessy et al. [18] suggest two optimized variants of ZCA and PCA whitening, which might enhance the networks’ learning ability. Secondly, the criterion used during the training process does not truly match the signal significance, and my notion of an adjusted signal significance may not be ideal. In particular, striving for a particular target signal significance and optimizing the number of signal events retained might perform better, though care would have to be taken to prefer higher signal significance until the target is reached. There are also various advanced neural network training algorithms to explore such as Adam [19] or Fista [12]. Finally, an alternative to neural networks was recently proposed, dubbed gcForest [22] and said to require less fine tuning.

6.4 Improved combination

Another area in which performance might be enhanced is in combining the results. While the mean and median performed well, they are hardly the most advanced combinations one might consider. One option would be to see if weighting the predictions using a bell curve or similar improves results. For a Gaussian distribution centered on the middle prediction, letting the standard deviation go to infinity yields the mean, whereas letting it go to zero yields the median (however, the Gaussian is hardly the only distribution function with this property). Another option would be to train a neural network on the predictions, though care would have to be taken to avoid overfitting. Another simpler option that I did not explore is to save the predictions for the test set as well, and use the models’ performance on those to set the weight of their predictions for the cross set. Either way, it seems likely that more can be done in this area.

6.5 Usefulness of results

Finally, as mentioned I used the exact peaks of the signal significance as cutoff points, which makes these cutoff points less useful in practice. For maximum utility, cutoff points should be selected that optimize signal significance and the number of signal events retained for other datasets as well. In addition, in this report I did not take pileup into account. In Behr et al. [13] it is reported that pileup can be subtracted in a preprocessing step and does not qualitatively affect further analysis, however quantities such as the number of retained signal events may be affected.

(21)

0 2167 410 290 1467 1.1 2.2 1.1 0.8 -1.00 -1.00 -1.00 3 2020 408 286 1326 3.0 3.1 1.6 2.1 -0.73 -0.66 -0.55 4 1922 407 283 1233 4.0 3.4 1.9 2.8 -0.60 -0.52 -0.21 5 1809 405 272 1132 5.0 3.6 2.4 3.5 -0.53 -0.18 0.10 6 1672 396 268 1007 6.0 4.3 2.5 4.2 -0.18 -0.10 0.37 7 1518 385 248 885 7.0 5.1 2.9 4.8 0.06 0.22 0.55 8 1314 374 198 742 8.0 5.6 4.7 5.1 0.24 0.63 0.69 9 1141 358 202 581 9.0 6.4 4.6 5.5 0.40 0.60 0.79 10 916 358 200 357 10.0 6.4 4.7 6.1 0.40 0.62 0.87 11 829 298 186 346 11.0 8.5 5.0 6.2 0.72 0.68 0.88 12 675 303 119 253 12.0 8.2 5.4 7.0 0.70 0.87 0.90 13 472 221 81 170 13.0 8.9 6.0 7.4 0.87 0.92 0.92 best 462 221 73 169 13.0 8.9 6.2 7.4 0.87 0.93 0.92

Median (“Stock”, 4b background)

0 2167 410 290 1467 1.1 2.2 1.1 0.8 -1.00 -1.00 -1.00 3 2019 408 287 1324 3.0 3.1 1.6 2.1 -0.80 -0.74 -0.57 4 1920 406 282 1233 4.0 3.5 1.9 2.8 -0.62 -0.55 -0.22 5 1810 404 276 1131 5.0 3.7 2.2 3.5 -0.52 -0.31 0.11 6 1672 395 271 1006 6.0 4.5 2.4 4.2 -0.15 -0.18 0.39 7 1519 385 249 885 7.0 5.1 2.9 4.8 0.07 0.24 0.57 8 1318 363 204 751 8.0 6.2 4.3 5.1 0.40 0.63 0.70 9 1143 363 208 571 9.0 6.2 4.1 5.7 0.40 0.61 0.81 10 915 354 192 370 10.0 6.5 4.9 6.0 0.49 0.69 0.88 11 791 297 160 334 11.0 8.4 5.0 6.2 0.77 0.80 0.89 12 707 294 160 253 12.0 8.5 5.0 7.1 0.78 0.80 0.91 best 466 200 77 189 12.6 8.7 6.0 7.3 0.92 0.93 0.93

Mean (“Best”, 4b background)

0 3489 424 918 2146 1.7 2.2 2.5 1.1 -1.00 -1.00 -1.00 5 3180 420 906 1854 5.0 3.6 3.7 3.2 -0.59 -0.70 -0.41 6 3057 420 893 1744 6.0 3.6 4.2 3.9 -0.59 -0.54 -0.20 7 2905 414 861 1629 7.0 4.1 5.1 4.5 -0.34 -0.26 -0.01 8 2706 404 842 1460 8.0 4.9 5.4 5.1 -0.07 -0.13 0.22 9 2452 397 799 1256 9.0 5.3 6.2 5.5 0.05 0.08 0.42 10 2209 380 727 1102 10.0 6.1 7.2 5.9 0.29 0.34 0.53 11 1929 365 716 848 11.0 6.7 7.4 6.3 0.42 0.37 0.67 12 1741 365 655 720 12.0 6.7 8.0 6.7 0.42 0.51 0.72 13 1537 339 599 599 13.0 7.4 8.7 7.1 0.58 0.60 0.76 best 1085 225 401 459 13.5 9.2 9.7 7.1 0.87 0.81 0.80

Median (“Best”, 4b background)

0 3489 424 918 2146 1.7 2.2 2.5 1.1 -1.00 -1.00 -1.00 5 3178 420 905 1854 5.0 3.7 3.7 3.2 -0.65 -0.73 -0.43 6 3057 420 898 1740 6.0 3.6 4.0 3.9 -0.66 -0.64 -0.21 7 2904 412 861 1631 7.0 4.2 5.1 4.5 -0.34 -0.28 -0.01 8 2707 404 856 1446 8.0 4.9 5.2 5.1 -0.08 -0.25 0.24 9 2455 396 799 1260 9.0 5.3 6.2 5.5 0.08 0.09 0.43 10 2204 376 722 1106 10.0 6.2 7.3 5.9 0.38 0.37 0.55 11 1925 368 709 849 11.0 6.6 7.4 6.2 0.45 0.41 0.69 12 1744 368 662 714 12.0 6.6 7.9 6.8 0.45 0.52 0.74 13 1520 309 603 608 13.0 8.7 8.5 7.1 0.75 0.62 0.78 best 1152 298 395 459 13.9 8.8 9.6 7.2 0.78 0.83 0.82

Table 14: Combined results using the mean and median predictions for both configurations, including just the 4b background. B, I and R represent the contributions from the Boosted, Intermediate and Resolved categories respectively. Due to the brute force nature of these combinations, only higher significances are included. A target of 0 is included for reference, representing the results of the cut-based analysis.

(22)

7 Conclusion

In this paper I examined the analysis of di-Higgs events in detail. We started out by reevaluating the cut-based analysis proposed in Behr et al. [13], finding new variables that greatly enhance signal acceptance; then looked at a neural network-based multivariate analysis that dynamically separates the dataset into three different roles. Finally we saw how combining the results of multiple, separately trained neural networks can greatly stabilize and somewhat enhance the models’ predictive capability. As described in the previous section there are still many aspects that could be refined, but the present analysis should allow a systematic path forward toward even better models. In closing I’d like to offer one more point of data: Using only the 4b background, a signal significance of 5 seems just out of reach at the current LHC, but aiming for a detection level signal significance of 3 we might be able to keep upwards of 200 signal events. After the upcoming high luminosity upgrade, even with both backgrounds the discovery level signal significance of 5 appears to be within reach.

Note: The source code used for this project is available on request.

8 Acknowledgments

I would like to express my appreciation to Dr Nathan Hartland for providing the basic framework used in this project, providing the large event files and his patience in answering my many questions. I’d also like to thank Dr Juan Rojo for his patience with the delays I encountered and allowing me to finish this project regardless.

The primary software libraries used in this project were as follows:

1. For the cut-based analysis: I extended a “Basic HH4b analysis” [16] that relies on HepMC [1], FastJet [5], FastJet Contrib [6] and PYTHIA 8.2 [7] to function.

2. For the neural network analysis: Torch [8] with the unsup package [2] (only for ZCA whitening), Torch for Windows [10], using the nn [11], cutorch [3] and cunn [9] packages. 3. For combining the predictions: The ‘double-double’ type [20] and the Eigen C++ template

library [4].

References

[1] HepMC, Jun 2012. Version 2.06.09. URL: http://hepmc.web.cern.ch/hepmc/.

[2] UNSUP: Some unsupervised learning modules using Torch, Feb 2016. Commit 1d4632e716dc3c82feecc7dd4b22549df442859f, included in [8]. URL: https://github.com/ koraykv/unsup.

[3] cutorch: A CUDA backend for Torch7, Apr 2017. Commit

8312fd1896b73a598e9951cf9d5f25677cae452c, included in [10]. URL: https: //github.com/torch/cutorch.

[4] Eigen, Jun 2017. Version 3.3.4. URL: http://eigen.tuxfamily.org. [5] FastJet, Jul 2017. Version 3.3.0. URL: http://fastjet.fr.

[6] FastJet Contrib, Jan 2017. Version 1.026. URL: http://fastjet.hepforge.org/ contrib/.

(23)

[7] Pythia 8.2, Apr 2017. Version 8.226. URL: http://home.thep.lu.se/~torbjorn/Pythia. html.

[8] Torch: A scientific computing framework for LuaJIT, Jul 2017. Commit 5beb83c46e91abd273c192a3fa782b62217072a6. URL: https://github.com/torch/ torch7.

[9] Torch: CUDA backend for the Neural Network Package, Apr 2017. Commit 536f41ad8044ec61afaab9045ab8c84a4137514b, included in [10]. URL: https://github. com/torch/cunn.

[10] Torch installation in a self-contained folder for windows with msvc, Apr 2017. Com-mit ed2b0f48a9f3b4aa47ec5fab5abcabcedac4f97d. URL: https://github.com/BTNC/ distro-win.

[11] Torch: Neural Network Package, Apr 2017. Commit

97df28724a3000d88362295e747bd3c0cea813fc, included in [10]. URL: https: //github.com/torch/nn.

[12] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, Mar 2009. doi: 10.1137/080716542.

[13] J.K. Behr, D. Bortoletto, J.A. Frost, N.P. Hartland, C. Issever, and J. Rojo. Boosting Higgs pair production in the b¯bb¯b final state with multivariate techniques. The European Physical Journal C, 76:386, Jul 2016. doi:10.1140/epjc/s10052-016-4215-5.

[14] S. Bityukov and N.V. Krasnikov. New physics discovery potential in future experiments. Modern Physics Letters A, 13:3235–3249, Dec 1998. doi:10.1142/S0217732398003442. [15] R.S. Gupta, H. Rzehak, and J.D. Wells. How well do we need to measure the Higgs boson

mass and self-coupling? Physical Review D, 88:055024, Sep 2013. doi:10.1103/PhysRevD. 88.055024.

[16] N. Hartland. Basic HH4b analysis, Mar 2017. Provided via email for personal use.

[17] B.L. Kalman and S.C. Kwasny. Why tanh: Choosing a sigmoidal function. In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, volume 4, pages 578– 581, Jun 1992. doi:10.1109/IJCNN.1992.227257.

[18] A. Kessy, A. Lewin, and K. Strimmer. Optimal whitening and decorrelation. The American Statistician, page to appear, Dec 2015. doi:10.1080/00031305.2016.1277159.

[19] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. Computing Research Repository, Dec 2014. URL: http://arxiv.org/abs/1412.6980.

[20] D. Pfeffer. The ‘double-double’ type, Feb 2016. Version 1.1.3. URL: https://www. codeproject.com/Articles/884606/The-double-double-type.

[21] A.J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 3:253–256, Sep 1977. doi: 10.1145/355744.355749.

[22] Z. Zhou and J. Feng. Deep forest: Towards an alternative to deep neural networks. Com-puting Research Repository, Feb 2017. URL: http://arxiv.org/abs/1702.08835.

Feasibility of detecting di-Higgs interactions at the LHC: Exploring the analysis