A three-dimensional profile of nuclear structure functions with neural networks

(1)

A three-dimensional profile of nuclear structure

functions with neural networks

Report Bachelorproject Physics and Astronomy (15 EC)

Author: Marco Bout (10745696)

Daily Supervisor: Rabah Abdul Khalek Supervisor: Dr. Juan Rojo

Second Examiner: Prof. Dr. Piet J. Mulders

Institute: Nikhef - Nationaal Instituut voor Subatomaire Fysica University: Universiteit van Amsterdam - Faculteit der Natuurwetenschappen,

Wiskunde en Informatica

Project performed from third of April to First of August 2018

Abstract

By using Deep Inelastic Scattering (DIS) experiments the partonic structure of a nucleus can be probed. In order to determine the nuclear effects of the nucleus on the nuclear structure the dependence on the fraction of momentum (Bjorken) x, energy scale Q2 and nuclear mass A need to be examined. Since this dependence is non-perturbative the structure function can not be solved analytically and needs to be fitted numerically. This report will describe artificial neural networks and the way they can be used to perform this fit. After an explanation as to how neural networks function an example fit will be shown to validate the written neural net. This is then followed by several fits of the nuclear structure function along with a discussion of the results. While this report does not provide an accurate representation of the data, it does provide a guide on the inner workings of an artificial neural network and with the help of library functions these results can be gained.

(2)

Popular Science summary (Dutch)

In het begin van de jaren zestig produceerde Quantum ElectroDynamica (QED) steeds preciezere experimenteel verifi¨eerbare beschrijvingen van electromagnetis-che interacties. Voor the sterke kernkracht bleef deze beschrijving nog uit.

Een model dat meer en meer geaccepteerd werd was het ”quark” model van Gell-Mann. Dit model postuleerde dat een kerndeeltje bestond uit drie ”quarks”. In het begin werd dit gezien als een enkel wiskundig model, maar latere Deep Inelastic Scattering (DIS) experimenten toonde aan dat dit model niet de gehele werkelijkheid weerspiegelde.

Uit data blijkt dat kerndeeltjes bestaan uit the drie gepostuleerde quarks, glu-onen die ze bij elkaar houden, maar ook bij hogere energi¨en uit virtuele quarks. Door het gebruik van DIS kan informatie gewonnen worden over de interacties in de kerndeeltjes en kan zo een beeld geschets worden van de nucleaire structuur van de deeltjes.

Echter, sinds the structuur afhangt van niet-perturbatief berekenbare variabe-len, moet de structuur numeriek benaderd worden. Een manier om dat te doen is door gebruik te maken van Artificial Neural Networks (ANN), een machine learning algorithme dat niet vereist dat de vrijheidsgraden en hoeveelheid pa-rameters van tevoren vastgelegd hoeven te worden.

Door te laten zien dat een ANN een simpele functie kan benaderen wordt het concept van de ANN gevalideerd. Hierna wordt geprobeerd de structuur formule te benaderen via de ANN en de resultaten hiervan worden besproken.

(3)

1 Introduction

1.1 Motivation for study

The main motivation behind this project is to learn how to write an Artificial Neural Network and use it to produce a fit of a nuclear structure function. This can be used to infer the nuclear effects on the level of the structure function variable.

One of the benefits of using a neural network is that it provides a parametrization whose shape is free from a theoretical bias. Normally when parametrizing a function, its shape and degrees of freedom need to be determined at the start. However, in a neural network these are decided by the network itself.

Another benefit is that it produces a fast parametrization that can be performed on many kinds of machines. By writing my own neural network I can show how it functions, how it is used and give some insight into the viability of using artificial neural networks as a way of fitting a nuclear structure function.

1.2 Theoretical Background

1.2.1 Quantum ChromoDynamics (QCD)

In the early sixties, Quantum ElectroDyanmics (QED) produces increasingly precise predictions for the electromagnetic interactions of leptons, which could be verified experimentally. However, the description of the nuclear strong force was not as well accepted.

An approach gaining traction and providing experimentally verifiable predic-tions was the quark model by Gell-Mann [1]. The model postulates that a particle consists of three ”quarks”. At first this was thought of as simply a mathematical model by most of the particle physicists, including its creator.

However, later experiments of Deep Inelastic Scattering (DIS) provided unex-pected results. Whereas at low energies the cross sections were characterized by baryon resonance production, the behaviour at large energies seemed to indicate that the nucleon consists of three non-interacting ”partons” [2].

This was later expanded into the Quantum ChromoDynamics (QCD) model, which states that a quark have one of three ”colors” and has ”gluons” as its charge carriers. A fundamental difference between QED and QCD is that QED’s charge carrier (photon) is electrically neutral, but QCD’s charge carrier (gluon) does contain color charge.

(5)

1.2.2 Deep Inelastic Scattering (DIS)

Deep Inelastic Scattering (DIS) is a process in which a lepton (electron, muon, neutrino. They do not undergo strong interactions) interacts with a hadron (baryon, meson. Consists of quarks held together by the strong nuclear force) by exchanging a gauge boson.

In this process the lepton is deflected off of the hadron (scattering). Due to the interaction, a quark is separated from the hadron. However, due to the rule in QED that a isolated color charge is not possible, new quarks and thereby new hadrons form while the target absorbs kinetic energy from the lepton (inelastic).

Due to the high energy of the lepton, it has a short wavelength, which allows it to penetrate deep into the hadron.

Figure 1: A lepton l with momentum k1scatters off of a nucleon N, exchanging

a electro-weak gauge boson V to a lepton l’ with momentum k2 = k1− q and

an ensemble of hadrons X: k1+ p1− > k2+ p2. Here p1is the momentum of the

incoming nucleon and p2the momentum of the outgoing hadrons [3].

By measuring the scattering cross-sections the nuclear structure function (F2)

can be determined, dependent on the variables x (Bjorken x) and Q2 _{(= −q}2_,

(6)

1.2.3 Parton Distribution Functions (PDFs)

The Parton name was proposed by Richard Feynman in 1969 as a generic description for any particle constituent within the proton, neutron and other hadrons. These particles are referred today as quarks and gluons.

However, experimental data from DIS experiments indicate that at high energies baryons like protons do not contain only the three quarks that give the particle its properties (valence quarks), but also a sea of virtual quark-antiquark pairs generated by the gluons holding the particle together [4].

Because of this, a nucleus with a high momentum can be seen as a stream of partons with momentum collinear to the nucleus. Not considering the spin directions, the momentum distribution of the partons is called the Parton Dis-tribution Function (PDF).

The PDF is the probability density of finding a parton with fraction x of the momentum at energy scale Q2_{(= −q}2_{). DIS experiments have shown that at}

low x the number of partons goes up and at high x it goes down. Furthermore, at low Q2_{the valence quarks become more dominant and at high Q}2_{the number}

of sea quarks increases more and more, each carrying low fraction of momentum x [5].

The dependence on x is determined by non-perturbative dynamics, therefore it cannot be calculated perturbatively. The dependence on Q2 _{however, can be}

computed using QCD perturbation theory up to any given order. This depen-dence is determined by a series of integrodifferential equations, known as the Dokshitzer-Gribov-Lipatov-Altarelli-Parisi (DGLAP) evolution equations:

Q2 ∂ ∂Q2fi(x, Q 2 ) =X j Pij(x, αs(Q2)) ⊗ fj(x, Q2) (1)

Here, Pij(x, αs(Q2)) are the Altarelli-Parisi splitting functions, which can be

computed in perturbation theory in the following way:

Pij(x, αs(Q2)) = X n=0 (αs(Q 2₎ 2π ) n+1_P(n) ij (x) (2)

Furthermore, ⊗ denotes the convolution where:

f (x) ⊗ g(x) ≡ Z 1 x dy y f (y)g( x y) (3)

(7)

2 Neural Net

2.1 What is an Artificial Neural Network?

An Artificial Neural Network (ANN) is a machine learning algorithm inspired by the function of biological neurons in a brain. An ANN consists of an input layer, an output layer and at least one hidden layer in between the input and output. Each layer consists of several ”neurons” that each contain a value.

Figure 2: A graphical representation of a neural network with two hidden layers, each containing 4 nodes [6].

As can be seen figure 2, each neuron in a layer is connected to every neuron in the next layer. These connections have weights that modify how much the current neuron’s value influences the next neuron’s value. Furthermore, each neuron had a corresponding bias that modifies the weighted value of the the neuron being calculated.

An ANN functions analogously to a function, with at least one input and output value. The input and output values can be represented in vector form. To give an example of this, let’s examine the following example: f (x, y, z) = [a, b].

In this example, the input variables would be x, y and z whereas the output values would be a and b. The neural net functions as the function f; it takes the input vector and returns an output vector.

(8)

2.2 Technical details

The ANN algorithm described by this report focuses on four steps [7]. The first step is feeding the input arguments into the neural net and allowing the net to propagate these values forward through its layers, calculating the output values.

The second step is to calculate the ”cost” of the output values, which is a measure of how close the output values generated by the neural network are to experimentally deduced values for the same input variables.

Thirdly, the algorithm calculates how much each weight and bias used in the forward propagation of the net influenced the error of the output values.

Finally, the network modifies each weight and bias by a tiny fraction propor-tional to their influence of the output error. This process is then repeated until the cost of the output has dropped to a satisfactory level.

Following this, there will be a more in depth look at each step.

2.2.1 Forward propagation

For a neural net with one input variable X, one hidden layer with two neurons n1 and n2 and one output value O, the neural network can be mathematically

represented the following way:

n1= w11× X + b11

n2= w12× X + b12

O = (w21× n1+ b21) + (w22× n2+ b21)

(4)

Here wln is the weight and bln the bias at layer l for neuron n. These equations

can also be represented using linear algebra:

nl= wl· nl−1_{+ b}l ₍₅₎

Here the weights are represented by a two-dimensional matrix and the neurons and biases by a vector. This process is then repeated through however many layers are present in the neural network, to eventually result in an output vector.

This however would only allow a linear transformation on the input variables. To introduce some non-linearity into the equation, the neurons can be ”activated”. What this means is that a non-linear ”activation function” is applied to the value of each neuron:

(9)

nl= ψ(wl· nl−1_{+ b}l₎ ₍₆₎

For clarity, the non-activated vector of neurons in layer l will be written as zl

and the activated vector will be written as al_{≡ ψ(z}l_{) = ψ(w}l_{· a}l−1_{+ b}l_).

An activation function in the final (output) layer can be considered if the user desires an output within a specific range. For example, when the user desires to classify the data, the activation function can turn a continuous result into a discrete one.

2.2.2 Cost calculation

In this step, the output values are compared with experimentally derived solu-tions for the same input variables. The cost is calculated using a ”cost function” (or ”loss function”). A cost function can take multiple shapes, depending on what the user wants the neural network to do. For this report the used cost function was the chi-squared distribution:

Ci= χ2i = (

Oi− Ei

σi

)2 (7)

Where Oi is the experimentally observed value and Ei is the expected value

calculated by the neural net for datapoint i, with σi being the experimental

error for value Oi.

The value of χ2 _{can be interpreted as an estimation of the quality of the fit}

produced by the neural net. For a value of χ2 < 1 the fit would have an error smaller than the experimental error of the observed datapoint. With this value of χ2 the neural net is treating the datapoints as absolute while ignoring the error, thereby providing a fit of the given datapoints and not of the underlying relation. This is called ”overfitting”.

On the other hand, a value of χ2_{> 2 means that the datapoint is more than two}

standard deviation away from the observed datapoint, meaning that this value is unlikely to accurately represent the observed data. Assuming the experiment was conducted succesfully, the results given by the neural network would be very unlikely to be measured in the experiment. This is called ”underfitting”.

2.2.3 Error Calculation

As said before (5.2, intro), when mentioning the error of a specific weight or bias it refers to how much that weight or bias influenced the cost of the neural net’s

(10)

output. This relation can be rewritten as the change of the cost with regards to a specific weight or bias, respectively ∂Ci

∂wl and ∂Ci

∂bl.

However, deriving a vector with respect to a matrix results in many operations. Can this process be optimized? The cost is influenced by the output of an activated layer, which is influenced by the un-activated output of this layer. This is then influenced by the weights and biases of the layer. Therefore the relations can be expanded using the chain rule.

∂Ci ∂wl = ∂Ci ∂al i ∂al i ∂wl = ∂Ci ∂al i ∂al i ∂zl i ∂zl i ∂wl ∂Ci ∂bl = ∂Ci ∂al i ∂al i ∂bl = ∂Ci ∂al i ∂al i ∂zl i ∂zl i ∂bl (8)

Combined with equation (6), the last two parts of the expanded relation can be solved analytically. ∂al i ∂zl i ∂zl i ∂wl = ψ 0_(zl i) · ∂zl i ∂wl = ψ 0_(zl i) · a l−1 i ∂al_i ∂zl i ∂z_il ∂bl = ψ 0_(zl i) · ∂zl_i ∂bl = ψ 0_(zl i) (9)

The resulting equations now only depend on one partial derivative: ∂Ci ∂al i . ∂Ci ∂wl = ∂Ci ∂al i ∂al_i ∂zl i ∂z_il ∂wl = ∂Ci ∂al i · ψ0(z_il) · al−1_i ∂Ci ∂bl = ∂Ci ∂al i ∂al_i ∂zl i ∂zl_i ∂bl = ∂Ci ∂al i · ψ0(zli) (10)

In the output layer however, this too can be solved analytically. The cost function as given by equation (7) depends only on one variable: the activated output of the final layer, where Oi = al=Li . Therefore, in the final layer this

derivative can be calculated analytically too.

∂Ci ∂al i = ∂ ∂al i (Oi− Ei)2 σ2 i = 2Oi− Ei σ2 i (11)

Since the change of the cost with respect to the weights and biases of the output layer can be completely analytically calculated with only the dependence on zL i

and aL

i, which are numerically calculated by the neural net. Therefore, the error

(11)

Can the error in a layer perhaps be defined as an operation on the error of the next layer? Lets define the error of the neurons in layer l for datapoint i as δl_i.

δ_il≡ ∂Ci ∂al i ∂al_i ∂zl i =∂Ci ∂al i · ψ0(z_il) =∂Ci ∂zl i (12)

Furthermore, lets apply the chain rule in order to see how the cost changes with respect to the error of the next layer.

δl_i= ∂Ci ∂zl i = ∂Ci ∂z_il+1 ∂z_il+1 ∂zl i = δl+1_i · ∂z l+1 i ∂zl i (13)

Notice however that the following is true:

zl+1_i = wl+1· ψ(zli) + b l+1 ₍₁₄₎ Therefore: δ_il= δl+1_i ·∂z l+1 i ∂zl i = (δ_il+1· wl+1_{) ψ}0_(zl₎ ₍₁₅₎

Now we can calculate the error in the final layer and use that value to calculate the error in the previous layers by propagating backwards through each layer. This is why the algorithm is called the ’backpropogation algorithm’.

2.2.4 Weight modification

After the neural net has provided a result and a measure of how far of the result is with respect to experimental values, its weights and biases must be updated using this information. An algorithm to perform this is the gradient descent algorithm [8]. This algorithm treats the cost function as a three-dimensional function in the parameter space of the weights and biases.

It can be reasoned that there is a value of the weights and biases where the cost is lowest: the minimum. The gradient of the cost function with respect to the weights and biases determines whether the cost is decreasing or increasing. By changing the values of the weights and biases such that the gradient of the cost function decreases by a small step, the minimum can be approached. However, if a step is too large, the minimum can be overshot. On the other hand, if a step is too small, the gradient of the cost function might get stuck in a small pit of the cost: a local minimum.

(12)

Figure 3: An image to illustrate the parameter space in gradient descent. Note the visible local minima and maxima [9].

Since the size of the steps determine how fast the ANN ’learns’, the value that determines the size of the steps can be called the ’learning rate’ (η). By sub-tracting the gradient (multiplied with learning rate) from either the weights or biases these parameters are set closer to the values which results in a minimum of the cost, using the following updates for n datapoints:

wl= wl−η n n X i=1 al−1_i · δl i bl= bl−η n n X i=1 δl_i (16)

(13)

3 Results

3.1 Validation of the Neural Net

In order to validate the created neural network, it can be tested on a simple problem with a known solution. An example of that is trying to fit a smeared sine function.

First we state that there is a function that depends on x like a sine function: f (x) = sin(x). This function represents a possible but unknown relation in physics. Then noise is added to the result: yexp= sin(x) + noise(σ),

represent-ing the experimental data possibly measured from testrepresent-ing the physical relation. By letting the noise be a normal distribution dependent on a standard deviation σ, the experimental uncertainty is simulated.

Now that there is simulated experimental data, the same inputs (x) can be fed into a neural network. If the neural network is functioning properly, the output values should look similar to a sine function, with an error of 1 ≤ χ2_{≤ 2.}

In order to validate the neural net written by me, I’ve trained the net on the smeared sine data multiple times while varying the amount of datapoints, the initial learning rate, the value of σ and the percentage of points used for training and testing of the network.

(14)

Figure 4: A fit of a smeared sine wave for a neural net with one layer and one node. The neural net was fed 250 points between 0 and 2 π with a σ value of 0.3. The lower graph shows the value of χ2 against the number of epochs the net had to learn. The value of χ2is the average value for all datapoints.

As can be seen in figure 4, the neural net generates a shape that seems somewhat similar to the desired sine wave. It is however flat outside of the center of the function (around x = pi). Since this neural net only possesed a single node in a single hidden layer, the only non-linearity that the net can produce is in the shape of the used activation function (in this case the hyperbolic tangent).

(15)

Figure 5: A fit of a smeared sine wave for a neural net with one layer and two nodes. The neural net was fed 250 points between 0 and 2 π with a σ value of 0.3. The lower graph shows the value of χ2 against the number of epochs the net had to learn. Note: The neural net was set to stop progressing at χ2= 1.2, since this was close enough to illustrate its purpose while the net came close to a local minimum. The value of χ2is the average value for all datapoints.

In figure 5 the neural net was fed the same amount of data points within the same range as in figure 4, however this net consisted of two nodes. It can be seen from the figure that not only was the neural net more successful in representing the data (as can be seen by the low value of χ2_{), but it also reached this point}

in a shorter time than the previous neural net ran in total.

3.2 Applying the Neural Net to the structure function

Next, a set of 535 datapoints were fed into a neural net. Each of these contained an experimental measurement of a nuclear structure function FA

2 (x, Q2), along

(16)

and Q2 were set as the input values of the neural net, as seen in equation (17).

F 2A(x, A, Q2) = AN N (x, Q2) (17)

This was done for different values of A (mass number of a particle), in order to test the nuclear effects on the structure of the nucleon.

Since the nuclear structure functions are always greater than 0, the activation function used by the neural net was set to the sigmoid function instead of the hyperbolic tangent, where:

sigmoid(x) = 1

1 + e−x (18)

The result of the sigmoid function can be between the values of 0 and 1 and accepts values between negative and positive infinity. This function was chosen because due to the smaller spread of values resulting from it the training time of the neural net is decreased.

(17)

Figure 6: Fit of FA

2 (x, Q2) for each ID value. This neural network consisted of

2 layers with 75 nodes each and was run on a constant learning rate of 1.0E-8. In the bottom graph it can be seen that the value of χ2 _{plateaued at around}

160. The value of χ2 _{is the average value for all datapoints. The variations in}

the y-axis (F 2A_{) reflect changes at the PDF level.}

As can be seen in figure 6, the fit given by the neural network is still pretty far off of the actual data. However, due to the fact that the decrease in χ2

(18)

became small and kept decreasing every epoch, running it for longer wouldn’t let it come that much closer to the desired fit.

This can be attributed to multiple causes. First of all, this slowdown is in-dicative of the gradient descent function nearing a local minimum. To escape this minimum, the learning rate would have to be increased such that the func-tion ”bounces out” of this minima. However, I did not get the variafunc-tion of the learning rate to work correctly in my program.

Another reason for the slowdown could be that the network does not have enough neurons/layers in order to correctly simulate the required shape as given by the experimental data. Increasing this amount does not necessarily improve the the result, as can be seen in figure 7.

(19)

Figure 7: A fit for a neural net with 3 layers, the first two with 75 neurons and the third with 100 neurons, made with a learning rate of 1E-8. The value of χ2 _{is the average value for all datapoints. The variations in the y-axis (F 2}A₎

reflect changes at the PDF level.

A good way of interpreting how good a fit made by a neural net was, is showing the difference between the experimental and neural net value of FA

2 . For a

(20)

neural net predicted the value about two standard deviations from the data. An example of this can be seen in figure 8, made from the same data as figure 6.

Figure 8: A fit to show the difference between experimental and neural net calculated values in terms of experimental standard deviations. The better trained the neural net is, the closer the values should be to zero.

The graph in figure 8 already shows a clear trend towards a straight line, how-ever, mostly at the end the values are many deviations away from the experi-mental values. This cause for this can be seen when looking at the end of figure 6.

In order to check whether the fit of the structure function can be improved by changing the variables fed into the function, I’ve replaced the dependence on the value of Q2to the value of A, as seen in equation (19). The fits can be seen in figures 9 and 10.

(21)

Figure 9: Fit of F2A(x, A) for each ID value. This net was run with two hidden

layers of 75 nodes with a learning rate of 1E-8. Note that when looking at the end of the lower graph it can be seen that the gradient descent algorithm has encountered a local minimum. The value of χ2 _{is the average value for all}

(22)

Figure 10: The spread belonging to figure 9. It’s spread is still similar to that shown in figure 7.

(23)

4 Conclusion

While this report might not immediately answer whether or not the the fitting of nuclear structure function can be improved by artificial neural networks, due to the fact that in all results a large spread was present between the experimental and the calculated values of FA

2 , it does show that a bachelor student, with

no prior knowledge of either any machine learning algorithms or the process of fitting structure functions, can still write a working neural net and understand its functionality.

A point to be made however, is that a neural net need not be build from the ground up. While I can recommend doing so at least once in order to truly understand how the algorithm works, there already exist many different imple-mentations open to access in libraries for different programming languages.

Furthermore, I’ve used the values of x and Q2 or A as variables for my neural network. A different avenue of research could be to generate a separate neural net for every value of A you wish to test.

This way would decrease the computational time, but would require the user to maintain several neural nets where they previously had to have only one. Furthermore, it might lead to a loss of generality (each neural net would have a slightly different dependence on x and Q2_{). However, if one wants to produce}

a more accurate fit for a single value of A, this would perhaps be preferable.

5 Discussion

When looking at the fits and trying to determine why they are still far from the desired value of χ2_{= 1, three things come to mind.}

Firstly, progressive data IDs correspond to increasing nuclear mass values of A. As can be seen from the fits of figures 6, 7 and 9, each section shows a parabolic-like shape of different width. The neural net is trying to fit each of these shapes using a single set of calculations, meaning that the width of one shape might be interfering with the width of another one for the calculations done by the neural net. Perhaps by adding more hidden layers or, as mentioned in the conclusion, maintaining separate neural networks for every value of A these shapes can be more accurately fitted.

Secondly, at high values of A (as can be seen in the figures as a high data ID) the value of F 2A _{is a lot higher than at the lower values of A. Since there are}

a lot less datapoints in this range, the neural net does not have a lot of data to learn these results. Furthermore, These results will have a high value of χ2

(24)

meaning they might influence the neural net disproportionately.

Lastly, most algorithms for neural networks will include a function that mod-ifies the learning rate. By increasing the learning rate, local minimas can be escaped and the number of epochs required for the net to provide an accurate fit would decrease. Lowering the learning rate would prevent the neural net just bouncing around a minimum without approaching it and provide a more precise adjustment of weights and biases when the cost of the function is low.

The code written by me did not include the modification in learning rate. I did try adding this, however I did not manage to find the correct implementation within the timeframe I had for completing this project. For another researcher attempting to follow this report, I can advise them to look into the use of a function to automatically modify the learning rate to see if this produces a more accurate fit of the data.

References

[1] J.-M. Richard. “An introduction to the quark model”. In: ArXiv e-prints (May 2012). arXiv: 1205.4326 [hep-ph].

[2] G. Ecker. “Quantum Chromodynamics”. In: ArXiv High Energy Physics -Phenomenology e-prints (Apr. 2006). eprint: hep-ph/0604165.

[3] J. Bl¨umlein. “The theory of deeply inelastic scattering”. In: Progress in Particle and Nuclear Physics 69 (Mar. 2013), pp. 28–84. doi: 10.1016/j. ppnp.2012.09.006. arXiv: 1208.6087 [hep-ph].

[4] J. Feltesse. “Introduction to Parton Distribution Functions”. In: Scholarpe-dia 5.11 (2010). revision #91386, p. 10160. doi: 10.4249/scholarpeScholarpe-dia. 10160.

[5] J. Gao, L. Harland-Lang, and J. Rojo. “The structure of the proton in the LHC precision era”. In: 742 (May 2018), pp. 1–121. doi: 10.1016/j. physrep.2018.03.002. arXiv: 1709.04922 [hep-ph].

[6] Luke Dormehl. What is an artificial neural network? Herefffdfffdfffds ev-erything you need to know. 2018. url: https://www.digitaltrends.com/ cool-tech/what-is-an-artificial-neural-network/.

[7] Michael Nielsen. Neural Networks and Deep Learning. Online, 2017. [8] _{Niklas Donges. Gradient Descent in a Nutshell. 2018. url: https : / /}

machinelearning- blog.com/2018/02/28/gradient- descent/#more-435.

[9] Kasperfred. Why Do I Sometimes Get Better Accuracy With A Higher Learning Rate Gradient Descent? 2017. url: https : / / steemkr . com / programming/@kasperfred/why-do-i-sometimes-get-better-accuracy-with-a-higher-learning-rate.

A three-dimensional profile of nuclear structure functions with neural networks