The parralization of the finite-element model, a feasibility study

(1)

UNIVERSITY OF AMSTERDAM

The parralization of the finite-element model, a

feasibility study

by

Zazo Meijs

A thesis submitted in partial fulfillment for the degree of Bachelor of Science

van de

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam

(2)

The intersection between computational sciences and gpu architecture allows us to create new algorithms; to accelerate, to advance, to make easy the work for all industries

Jen-Hsun Huang (CEO and co-founder of NVIDIA)1

1

(3)

UNIVERSITY OF AMSTERDAM

Abstract

The parralization of the finite-element model, a feasibility study by Zazo Meijs

We have looked at the feasibility and possible advantages of parralization of the finite element model using cuda c implementation. We specifically looked at if we could imple-ment this for an existing bruises model. We have tested the speed gain by parralization using multiple benchmarks finding a speed improvements of up to factor 22(4) faster computation for the gpu accelerated implementation compared to sequential implemen-tation in c. Furthermore we benchmarks the effect of efficient memory allocation finding a speed drop of a factor 8.26(28) for any application that wishes to make intermediate calculations on the cpu. We looked at integration of the model within Python concluding that the integration within a higher tier language is possible but only ensures a speed increase for large systems which undergo many time iterations. As c has a steep learning curve we conclude that it is unwise to write any application in which users would need to change the c code to use the model. For actual implementation we would recommend to use gpu libraries like PyCuda and the ’Matlab Parallel Computing Toolbox’ which are now being developed. Alternatively the whole model could be rewritten in c including reading in of pictures, results visualization and a Graphic User Interface.

(4)

Introduction

Driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmable GPU (graphics processing unit) has evolved into a highly efficient pro-cessor. Over the last decade the theoretical output of gpu’s has overtaken those of comparable CPU’s[1]. As we wish to explore larger and more complex systems it be-comes necessary to harness this increased capability. With the introduction of cuda by NVIDIA, worlds biggest gpu manufacturer, it is now made relatively easy.

The promoter of this project, Maurice Aalders, was previously promoter for a PHD student who studied the time evolution of bruises using a finite-element model [2]. The model determined the age of bruises in cases of child abuse. Within the model there are many person dependant variables. To determine the age of the bruise the model takes as input two pictures some known time apart and tries multiple sets of variables to get the best fit. It then uses these variables to calculate the time of injury. As the precision we try to achieve increases the computation time steeply ramps up because of the many iterations. Therefore this project aims to give a clear analysis if using the gpu can lessen the time consumed by this model.

Within this thesis we will first explain the model in more detail including the origin of those formulae used to govern the time development of the bruise (chapter2). Than we will explain what it means to write a parallel program, how it is done and how we can apply this to the finite element model (3). We want to test the speed of the parallel implementation compared to a sequential implementation. To this purpose we have used a set of benchmarks which are explained in chapter 4. The results of these benchmarks are displayed in chapter5. In chapter6we will discuss these results, our deviations from the original model and some of the pitfall/limitations of the parallel implementation. In the last chapter we will convey our conclusion and give some recommendations for possible follow up work (chapter 7).

(8)

Chapter 2

The model

In this chapter we will explain the three dimensional model of bruise evolution for improved age determination. This model is used as example for possible parallelization. It is based on the, in 2010 published, article by Barbara Stam, Martin J.C. van Gemert, T. G. van Leeuwen and Maurice C.G. Aalders [2].

Our implementation of the model only looks at the changing of an in vivo bloodpool (a bruise) for one set of fixed variables. With this we can give a general description of the time evolution of bruises. The actual application of the model is age determination for example in case of child abuse. This application is described in section2.4.2.

2.1 Biological background

This section will give a rudimentary idea of how blood pools under the skin (bruises) behave, and what the important biological concepts are that need to be taken into account for the model to be in accordance with physical reality.

Our information about blood and skin is closely based on the educational book by Martini, Nath and Bartholomew [3]. This book is also used for the Bachelor of Medicine of the University of Amsterdam. Blood is composed of both plasma and formed elements. The percentage of formed elements is often referred to as hematocrit and normally lies between 37 and 54. This also differs between men and women, as men on average have almost 10% more red blood cells per volume. For the observation of bruises the most important substrates are hemoglobin and bilirubin. Hemoglobin is the strongest contribution to how we perceive blood.

(9)

The model 3 For the skin we only need to take the upper part into account as the used methods to image the bruises have a limited depth of detection. A schematic representation of the skin is given in figure 2.1.

Figure 2.1: A schematic representation of the upper part of the skin

2.2 Finite-Element Model

To numerical compute the evolution of bruises we use an finite-element model. This means we split the area of interest in many small blocks which are assumed to be uniform within themselves. We will refer to these blocks as the elements of the sample. Any change of the complete system is than modeled as an interacting between these small elements. For our model we subdivide the skin in three layers. Each layer in subdivided in 100 by a 100 elements. So the whole system consists of 30 000 elements. This subdivision is shown in figure 2.2. As we need to keep track of both hemoglobin and bilirubin we are in essence keeping track of the sample twice. Once where each element keeps track of amount of hemoglobin within it’s volume and one for the amount of bilirubin. Within finite-element models more precise interactions can often only be observed by looking at many elements, but with increasing the amount of element so does the computation time increase, as we will show in chapter 5.

Figure 2.2: For the Finite-element model we subdivide the skin in three layers. Each layer in subdivided in 100 by a 100 elements. So the whole system consists of 30 000

(10)

The model 4

2.3 Time evolution of Hemoglobin and Bilirubin

To come to the mathematical description of the time evolution of hemoglobin and biliru-bin, we use approximations of standard convection and diffusion laws. In this section we will introduce the used equations and try to substantiate our resulting time evolu-tion. First we will the derive a formulae for hemoglobin and than subsequently use this formulae to substantiate our used formulae for the time evolution of bilirubin. As our model is based on the result by B. Stam el al. [2] we will try to use the same form where possible. Even then our result slightly differs and these deviations are discussed in section 6.1

2.3.1 Darcy’s law

Darcy’s law was originally proposed by Darcy in 1856 as the result of a filtration exper-iment. In our model we use Darcy’s law to describe the pressure driven time evolution of our concentrations. In essence, Darcy’s law states that the rate at which water filters through a porous surface is proportional to both the area of filtration and the pressure difference yet inversely proportional to the thickness of the filter[4]. In equation Darcy’s law is expressed as:

v = − 1

µφA0· ∇p (2.1) Where µ is the thickness of the filter, A0 is the area of filtration, φ is the porosity of

the medium, ∇p is the pressure gradient and v(m/s) is the speed with which the fluid moves through the medium.

Darcy’s law only holds for slow viscous flow. [5] But as we look at in vivo bleeding where the whole cavity is filled with fluid we expect the pressure gradient to always be small. So we we made the assumption that we can use Darcy’s law to describe the pressure driven diffusion. For our model we will redefine some parts of the original equation. As our whole sample is what we see as our filter, we define thickness µ as the distance between our two elements, or in other words, the maximum length of medium the blood will have to pass through for our time step. As we want the actual amount displaced in a time period, we write v as the total amount displaced divided by time: v = ∆C_∆t where ∆C is the chance in concentration. We interchange the place dependent porosity φ with a set constant K, in the form of K = _φ1. Finally we made the assumption that the gradient is linear and can be expressed as the difference between the two pressures of the interacting points; i.e. ∇p = ∆p. With these changes we can write:

(11)

The model 5

∆C ∆t = −

KA0∆p

∆x (2.2)

2.3.2 The First law of Fick

In the theory of diffusion, Fick’s first law is very like Darcy’s law as explained in2.3.1. It states that the proportionality of the flux is linear with the gradient of the transported quantity [6]. So the basic equation would read :

J = Dc∗ ∇C (2.3)

Where J is the resulting diffusion flow, Dc is the diffusion constant and ∇C is the

gradient of the concentration of the quantity. As with Darcy’s law we want to know the displaced amount in a time step ∆t so we need to redefine J to J = _A∆C

0∆t with A0 the

area of diffusion. For the gradient of the concentration we once again assume linearity giving us for an defined time step ∆t

∇C = δC δr =

(Ci− Ci+a)

∆r (2.4)

Where Ci is the concentration in box of which we are calculating the change and Ci+a

is the concentration in a the neighbouring element. Within the finite element model we define ∆r as the distance between the centers of the two ´_{boxes´. Putting these changes} together we get an expression for the concentration driven diffusion for a time step ∆t.

∆C

∆t = Dc∗

(Ci− Ci+a)A0

∆r (2.5)

2.3.3 Michaelis-Menten kinetics

Michaelis-Menten kinetics describe the reaction speed between substrates (S) and en-zymes (E). They can be derived using the assumption that the total amount of substrate within the system is constant, be it in original state, enzyme bound to substrate (ES) or as part of the final product(P ) [7]. If we assume that the reaction between substrate en enzyme is reversible but the created the product does not react with the enzyme to return to substrate form, we can write:

E + S

kf

kr

(12)

The model 6 Where we define dimensionless rate constants kf, ksand kcatfor the respective reactions.

It’s clear that the enzyme only facilitates the reaction but isn’t gained or lost as a result of the reaction. So the total amount of enzyme in free plus combined form is constant and will be denoted by [E]0. Furthermore the assumption that the system does not lose

nor gain anything from the environment gives clear constraints on the derivatives of all involved concentrations:

[E]0 = [E] + [ES] (2.7)

d[E]

dt = −kf[E][S] + kr[ES] + kcat[ES] (2.8) d[S]

dt = −kf[E][S] + kr[ES] (2.9) d[ES]

dt = kf[E][S] − kr[ES] − kcat[ES] (2.10) d[P ]

dt = kcat[ES] (2.11) In their original 1913 article Michaelis and Menten assume that the substrate reaction rate at any time is proportional to the concentration of the substrate-enzyme complex and that the concentration of this complex at any instant is determined by the con-centration of enzyme and substrate [8]. This is shown in equation 2.12. Next we use equation 2.8and simplify to get an expression for [ES].

kr[ES] = kf[E][S]] (2.12) =⇒ [ES] = kf kr ([E]0− [ES])[S]] (2.13) =⇒ [ES] = kf[E]0[S] kr+ kf[S] (2.14)

If we assume that the amount of substrate which is in the bound state ES is at any time negligible compared tot the original substrate. Instead of calculating the decrease in [E] we will calculate the increase in [P ]. Often the speed of conversion is not limited by the amount of substrate but the small amount of enzyme. As so we will define an enzyme dependant maximum speed at which the reaction might occur as Vmax= kcat[E]0.

(13)

The model 7 d[P ] dt = kcat[ES] (2.15) = kcat kf[E]0[S] kr+ kf[S] (2.16) = V_kmax[S] r kf + [S] (2.17) where we define k∗= kr kf .

2.3.4 The used time step function

In this section we will be using the derived expressions for the pressure driven diffusion (Darcy’s law, section 2.3.1), concentration driven diffusion (The First law of Fick, sec-tion 2.3.2) and the change in concentrations by enzymatic reactions (Michaelis-Menten kinetics, section 2.3.3).

This gives us the function:

∆Ci,j,z ∆t = Ci,j,z±1 KA∆p ∆x − X neighbours DHb∗ (Ci− Ci+a)A ∆r − Vmax[S] k∗+ [S] (2.18) 2.3.5 Bilirubin

The used processes for bilirubin are very like those for hemoglobin. We will neglect the pressure driven part. The diffusion constant Dcis both substrate and medium dependent

and so we now use the in vivo diffusion constant for bilirubin which we denote as DBi.

The enzymatic reaction as derived in2.3.3 gives the amount of hemoglobin conversion. Following Stam et al. we assume that all hemoglobin is conversed into 4 parts of bilirubin. So for the bilirubin term we will add 4 times the amount of subtracted hemoglobin in the corresponding element. The clearance of bilirubin happens by the lymphatic system. Following Stam et al. we will model this process as zero order reaction kinetics. With τBi the suppressing variable.

∆Ci,j,z ∆t = − X neighbours DBi∗ (Ci− Ci+a)A ∆r + 4 ∗ Vmax[C]Hb k∗_{+ [C]} Hb −[C]Bi τBi (2.19)

(14)

The model 8

2.4 Further implications

2.4.1 Diagonal neighbours in the 2D case

Which neighbouring elements to take into account for each computational steps is an ever ongoing debate between efficiency and precision. In this section we would like to explain why we found it necessary to include the diagonal neighbours in the horizontal plane.

Figure 2.3: The

For our approach we see the horizontal plane as an independent 2D case. The case we will investigate is graphically shown in figure2.3. We start with all of our concentration C within the inner circle with radius a. After a time ∆t the concentration has spread to the outer circle with radius 3a. We assume that the concentration is evenly distributed within the circle, this gives us an expression for the concentration per unit area, which we will denote with c. As can easily be seen, the cross neighbours are almost completely full. For further calculation we assume the cross section neighbours to be filled introducing an error were we allocate slightly more weight to the cross section neighbours than the actual reality. This meas that taking into account only cross section neighbours we will have 5 filled squares. In equations 2.20, 2.21 and 2.22 we calculate the amount of concentration which is within these five blocks after the diffusion has taken place.

Concentration per area after ∆t = c = C A =

C πr2 =

C

9a2_π (2.20)

Area in centre plus the cross neighbours = Area = 5 · (2a)2 = 20a2 (2.21) Concentration taken into account = Area · c = 20a2· C

(15)

The model 9 We see that in this simplified two dimensional case we only keep track of approximately 71% of our concentration. This effectively means we miss over a fourth of the displaced concentration. Based on this simplified case we have chosen to take diagonal neighbours within the plane into account for our implementation.

2.4.2 Used variables

Many factors used are still unknown for blood in our chosen setting. For the age de-termination two measurements a known time apart will be taken. the variables will be varied within the expected biological range and the first measurement will be advanced by the model. The variables which gives have the highest correlation between the model made result and the second measurement are than use to calculate back how long ago the bruise was inflicted. For testing purposes we use one average value. Both the expected biological range are show in table 2.1, they are equal to the values used by Stam et al. although all units were concerted to SI units.

Param. Name Range average value K Hydraulic conductivity 1.4 ∗ 10−12 m4/Ns 1.4 ∗ 10−12m4/Ns ∆p Pressure Difference 2.6 ∗ 102_N/m2 _{2.6 ∗ 10}2_N/m2

DHb Diffusivity Hemoglobin 2.8 ∗ 10−13− 2.8 ∗ 10−11 m2/s 2.8 ∗ 10−12m2/s

DBi Diffusivity Bilirubin 1.1 ∗ 10−12− 1.1 ∗ 10−10 m2/s 1.1 ∗ 10−11m2/s

Km Michaelis constant 0.24 ∗ 10−6 mol/L 0.24 ∗ 10−6 mol/L

Vmax Speed of conversion 0.94 mol/(s *g) 0.94 mol/(s *g)

[HO] Concentration of HO 1.0 ∗ 10−4− 1.0 ∗ 10−2 _g/L _{5 ∗ 10}−3 _g/L

τBi Clearance of Billirubin 1.8 ∗ 105− 1.4 ∗ 106 s 5.4 ∗ 105s

(16)

Chapter 3

Parralization

Within this chapter we will introduce what it means to write a parallel program and why this is useful. To this end we will first introduce c in which we will add cuda commands to use the gpu.

3.1 Peculiarities of C

C is a ”low level” language, meaning that it works much with the same variables as the computer itself does [9]. As c gives much more control over the process it is often able to achieve faster speeds than ”high level” languages like Python and Matlab. The great disadvantage of c is that compared to higher languages it is much more important to understand what the program is doing at every step, as mistakes often are able to be run by the compiler but do not give the correct answer.

C programming does not allow to return an entire array as an argument to a function. However, you can save the array on a specified place within the memory and than create a pointer to this array. The pointer acts as a name that points the program where it can get either get the elements of the array or where it should replace elements of the array.

3.2 Graphic Processing Unit

In 1965 Moore observed the doubling of the transistors on an chip every year. In 1975 he published a follow up paper where he speculated this trend to slow down to doubling every every 2 years before the end of the decade [10]. This yearly doubling is now better known as Moore’s Law. Contradictory to Moore’s expectation it has persisted till this

(17)

Parralization 11 day (see image 3.1). Yet even though transistors keep getting smaller the increase in clock speed has almost leveled off since 2002. This slowing down is the result of increasing heat production per area. This increasing amount of heat cannot be effectively drained by air currents making it necessary for the chips to work at slower settings. As the clock speeds of individual processors fall off, one of possible solutions is to scale up not their individual processing power but the amount of processors. This is precisely the way Graphic Processing Units (GPU’s) have evolved. A larger chip with many small processors which independently work on small problems. Each of these processors takes longer to finish a code snipped than the CPU cores do but as there are many many more they can achieve a much higher total amount of data procession. But this is only useful if you can write your code in such a way that these independent results can be used for your problem.

Figure 3.1: Transistors per chip and clock speeds of these chips over the last 50 years

An very fitting analogy is about a farmer who needs to plough his field with either 4 oxen or 20 thousand chickens. In practice it will obviously be easier to plough the field with the oxen. But if the field is very large two oxen will need an awful lot of time t0 plough the whole expand. So we might want to think about a way to use the chickens. If we can snap them together so they all work on a different part at any time and lead this group over the field. Still any one chicken will still do much less than an ox, but the 20 thousand will output much more work.

This is referred to as Latency versus Throughput. The latency is the time needed to finish any one computation while the throughput is how much total information can be computed per time. For most of our everyday use latency is the more important fac-tor. We want our mail program to open quickly and don’t care if it also prepares your

(18)

Parralization 12 2000 spam mails. Yet the gaming industry has been pushing for ever more throughput to compute the graphics of open worlds in which many distinct objects operate inde-pendently. It is now so that GPU’s have a much higher throughput than comparable CPU’s. For our application of the finite-element model we look at 60k-25mil elements, on each of which multiple computations are applied for each time step. Although each time step must follow sequentially on the previous one. Each element uses what the elements at the start are to calculate it’s time evolution. As we don’t care how soon any of the elements is done but only care about how long till they’re all done it becomes a throughput problem.

3.3 CUDA

Cuda, originally an acronym for Compute Unified Device Architecture [11], provides a way to do regular computations on the gpu. To this goal it provides a hierarchy of thread groups, shared memories, and barrier synchronization [1]. Important syntax is that the cpu is referred to as the ’host’ while the gpu on which the calculations are done is called the ’device’. In our case the c implementation gives a new set of commands which work with the gpu. The most important part is the kernel launch ability, the launching command is shown in equation 3.1.

The ”<<<, >>>” part tells the compiler that this function should be run on the gpu. It is possible to define multiple different kernels within one code, for this example Kernel would also be the name of the launched kernel. The kernel is launched as a set of blocks. Every block can be made up of to either 512 or 1024 threads, depending on model type. The amount of block is limited to 65535 blocks launched. Both block and threads can also be launched in 3 dimensional form using dim3(Nx, Ny, Nz) instead of just a number. A number N is read the same as dim3(N,1,1). The limit for threads is the limit for all threads within the block e.g. dim3(10, 5, 5) = 250 is allowed but dim3(20,10,10) = 2000 is not. The Array’s given as arguments to the kernel launch are not actual lists but pointers to arrays who are saved on the device. Notice that as the cpu and gpu don’t communicate a failed kernel launch does not produce an error, but just doesn’t make any changes on the array’s.

Kernel <<< Block, Threads >>> (Array1, Array2) (3.1) Within the launched kernel every thread knows what both number thread it is within the launched and what the number of his block is. Using these facts it can calculate a thread specific number i = blockId.x ∗ blockDim.x + threadId.x where we use blockdim.x to

(19)

Parralization 13 get the size of the blocks. Using this thread specific number each thread can calculate the new value of the i-th element of the array. As these threads are run parallel they can efficiently calculate the new values for long arrays. The kernel launch takes as long as the slowest thread.

(20)

Chapter 4

Benchmarks

To be able to give conclusive results about the speed increase we wrote multiple bench-marks to find quantitative indicators if the resulting improvement by using gpu acceler-ation.

4.1 System

All benchmark calculations were done using the same machine approximately 2 minutes after restarting with nothing but the benchmarked program running. Measurements which we compared with each other were always done subsequently without changing any settings. The details of both relevant hard and relevant software are shown in table

4.1.

CPU Intel Core i7-470HQ GPU GeForce GTX 860M C compiler Visula Studio 2013 Python Python XY V2.70

Table 4.1: Specifications of the machine used for the benchmarks

4.2 Time/Size Benchmark

Firstly we needed to know how the parallel implementation reacted to increasing the size of our system. As explained in 2 we track two substances, Hemoglobin and Bilirubin, so we need to keep track of twice the amount of elements we divide our sample in. In table 4.2a general overview of different sample size is given to better understand what the amount of elements actually represent.

(21)

Benchmarks explained 15 X-direction Y-direction Height Number of element

100 100 3 60 ∗ 103 100 100 100 2 ∗ 106

300 300 100 18 ∗ 106 250 250 250 ∼ 15.6 ∗ 106

Table 4.2: A general overview of different sampling sizes to better understand what the amount of elements actually represent. Remember that as was keeping track of both Hemoglobin and Bilirubin we need to keep track of twice as many elements as our

system is divided in.

The benchmark is kept as simple as possible. A flowchart can be found in figure4.1. The code for the gpu accelerated version can be found in AppendixB.2, this code is only for reference and can not be run. The sequential implementation differs in that instead of calculating the time evolution on the gpu it is done in a for loop which loops through all elements. In essence the benchmark does exactly what the final model needs to do but than takes general statements instead of the physical effects. We will benchmark the two implementations both with increasing amounts of elements as with increasing amounts of time iterations for a fixed amount of elements. All results are shown in section5.2.

Figure 4.1: Only the part within the time iteration loop is gpu accelerated

4.3 Memory allocation

The gpu has his own individual memory. Any calculation done by the gpu on an array needs to have the changed array present on the memory of the gpu. For this CUDA introduces a new command cudaMemcpy which can copy memory between host and device (gpu) memory.

Naively the most straight forward implementation would be to send the list to the device, let the kernel change the values according to defined computations and than return the

(22)

Benchmarks explained 16 list to the device. This way also works but as the gpu has his own memory time is spent copying the information back and forth between cpu (host) and gpu (device). Online on non peer reviewed sites we found advice to instead use the cudaMemcpy ’DevicetoDevice’ capability. The flowchart as shown in figure4.2shows the two possible memory allocation schemes which we compared. The comparison was done for a list of 500 thousand elements for different amount of time iterations.

Figure 4.2: Flowchart of the two used benchmarks to validate the increase of efficiency by using cudaMemcpy ’Device to Device’

4.4 Python Integrated

As previously explained c is a language which asks the user to understand the underline structure to make small adjustments. The group has multiple short time interns/stu-dents which need to use the models, and be able to make these small adjustments. To lower the needed understanding of the gpu acceleration it was proposed to write it as an executable which was called from a higher tier language. The benchmark has the gpu acceleration nestled in Python because of previous experience with this language, yet implementation would most likely be in Matlab. The flowchart for this process is shown

(23)

Benchmarks explained 17 in figure 4.3. As we could find no direct way of communication between the higher tier language and the executable we opted to have all information be stored in a temporary text file. The information was stored using file printing statements instead of a binary encoding so that the files could easily be used for trouble shooting. An other advantage of having the gpu nestled in a higher tier language is easier visualisation and use of the resulting sample after all time iterations.

Figure 4.3: Flowchart of the python nestled gpu acceleration

4.5 Errors

The computation time taken often showed considerable drift. To be able to give more trustworthy results every benchmark has been done 10 times and the shown results are the average of these ten runs.

For the error the Excel internal STDEVP function has been used which calculates the Standard Deviation of a Population (dX ) using:

dX =r P (x − ¯x)

2

n (4.1)

(24)

Chapter 5

Results

5.1 C compared to higher tier languages

C itself is already much faster than higher tier languages, to be able to say anything about the effects of parralization most benchmarks compare gpu accelerated c code with sequential c. But to give an idea of how this compares to higher tier language we have also compared sequential c to python using in both cases the sequential implementation as shown in 4.2.The result is shown in figure5.1.

Figure 5.1: Benchmark comparison between sequential implementation within both Python and C, details of the benchmark are explained in section4.2

(25)

Results 19

5.2 Size/Time Benchmarks

First we needed to know how the parallel implementation reacted to increasing the size of our system. As explained in chapter 2 we track two substances, hemoglobin and bilirubin, so we need to keep track of twice the amount of elements we divide our sample in.

Figure 5.2: Benchmark for different amount of elements using the benchmark as ex-plained in section4.2

The results of the benchmark as described in4.2are shown in figure5.2. Take notice that the Error bars for GPU accelerated are much bigger than for his sequential counterpart. For Sequential seems that up to ten thousand elements the process is instantaneous, of course this isn’t true. What happens is that the process takes less than one millisecond which automatically gets rounded down by the time library.

(26)

Results 20 We want to know what the effect is of time iterations. To show the difference we have chosen to look at two different length of arrays. The results can be seen in figure 5.3. consistent with the first benchmark the gpu accelerated model takes a factor ten longer for only one time iteration.

Figure 5.3: Benchmark for different amount of iterations using the benchmark as explained in section 4.2; a) is for 60k and b) for 106

The actual model would need to take into account more than just one computation for every element. To model this into the benchmark we took the easy way having the same computation but doing the computational part on each element a 100 times effectively increasing the computational work for each time step for each element to increase a 100 fold. The time dependant benchmark for this variation is shown in figure 5.4. Notice that the last point for the sequential benchmark is missing this is because the expected time is about 6 hours so doing this 10 times would take to much computation time. We show the factor increase of the computation time for 100 times the computation for each element in figure 5.5.

(27)

Results 21

Figure 5.4: Benchmark for different amount of iterations using the benchmark as explained in section4.2

Figure 5.5: The factorial increase in computation time for 100 times increased compu-tation as explained in figure5.5

5.3 Memory allocation

Our result for the memory allocation benchmark as explained in section4.3 are shown in figure5.6. Notice that for one iteration there is no difference as both follow the same scheme. The 100k iterations point for the ’Dev to Host variation’ is not included because of computation time.

(28)

Results 22

5.4 Python Integrated

The results of the last presented benchmark as explained in section4.4is shown in figure

5.7. Notice how even for an comparable small list of 60k elements the model executing an python script takes almost 500 ms even for just one iteration. Notice again that the Python integrated is compared to a Python implementation and not c.

(29)

Chapter 6

Discussion

In this chapter we will describe and discuss the implications of the results of the sections

2.3.4,2.3.5and chapter5. Furthermore we describe some of the more important ’pitfalls’ which we encountered while writing the gpu accelerated implementation.

6.1 Differences for time evolution formula

There are some key differences between the time evolution formulae derived in chapter

2 and set forth by B. Stam [2] in 2010. First we look at the time evolution formula for hemoglobin, notice that differences are colored red.

∆Ci,j,z ∆t = Ci,j,z±1 KA∆p ∆x − X neighbours Dc∗ (Ci− Ci+a)A ∆r − Vmax∗ [C] k∗_{+ [C]} (Derived) (6.1) ∆Ci,j,z ∆t = Ci,j,z±1 Vz+1 KA∆p ∆x − X neighbours Dc∗ (Ci− Ci+a)A ∆r − Vmax∗ [C] K_m∗ + [C] ∗M WHb (B.Stam) (6.2)

From left to right the first dissimilarity we find is the _V1

z+1. This term is existent within

the formula as proposed by B. Stam but when brought into modulation, discrepancy occurred between the absolute size of the three different contributions. This effect occurred because the Vz+1 term was extremely small for our sample and thereby the

pressure driven term blew up. As this term gave problems and did not follow from our own derivation, it was dropped from our time evolution.

(30)

Discussion 24 The second discrepancy we find is more a convention: we differently define what is often referred to as the Michaelis constant differently. Where we use k∗= kf

kr they used

K_m∗ = k∗∗[HO], where [HO] is the concentration of hydroxide which acts as the catalyst for the enzymatic conversion. As this effect only implies different values for kfandkr we

have in practice used the same numerical value as Stam et al for the Michaelis constant. The third difference is the M WHbterm which is the molecular weight of the hemoglobin

model and for our implementation is taken as part of the Vmax giving a limit to the

maximum speed at which conversion takes place. So our numerical used value for Vmax

is actually the difference and we use the same value as Stam et al. for the actual implementation.

6.2 Benchmarks

6.2.1 C compared to higher tier languages

We see that the c implementation is consistently much faster than the Python variant. We take the average of the speed increase for every point, where we exclude the factor for 1 time iteration because of the more than 100% error. We get that the c implementation is a factor 88(17) faster than the Python implementation. It is known that c is much faster than high tier languages so most have c implementations which can greatly enhance the speed of simple for loops. According to literature Numpy can get you to speeds about 3-5 times slower than c. While Cython, a Python language extension that compiles directly to c, can in specific cases be near equal to c [12]. Although c offers higher speeds it is much less accessible, this effectively means that an c implementation will take longer to program and be much more inconvenient for later alterations by different people. For reference see the two codes added as appendicesB.1 and B.2.

6.2.2 Time/Size Benchmarks

We see that for one time iteration at low computation for every step (figure 5.2) the sequential implementation of the model is many times faster than his gpu accelerated counter part. This is mostly because the initializing of the cuda library takes a consid-erable time, independent on what kind of calculations follow.

As the amount of time iterations increases the contribution of the time evolution in-creases to the total time taken. As the time taken by the sequential implementation increases faster we can assume the sequential implementation takes longer to execute

(31)

Discussion 25 each time iteration. When we look at the time dependant implementation for 60k ele-ments, we see that the gpu accelerated implementation is faster for 10k or more time iterations. But even for a 100k time iterations the gpu accelerated implementation is only a factor 3.35(33) faster. If we increase the amount of elements to a million, we see that the gpu accelerated implementation much earlier takes over the sequential variant. Already at 100 time iteration does there execution take approximately the same time. At the 100k time iterations the gpu accelerated version goes 4.78(4) times as fast. An actual model will have to do more than one calculation, for our example we already see that there are 3 independent contributions, and you can take multiple neighbours into account. The results of the measurements for 100 times the calculation and 106elements are shown in figure 5.4. For 1k time iterations the gpu accelerated implementation is 21.5(4) times faster and for 10k iterations it’s 22(4) times faster. As we can see the effect of increasing the amount of calculations is different between the sequential and gpu accelerated. This increase of computation time is visualized in figure 5.5. We calculate the average increase for the sequential implementation using the last three results. This gives an average increase of a factor 101(8). For the gpu accelerated version we only use the last two results and as we can see these are still slightly increasing this factor could be a bit higher for even larger systems. For the gpu accelerated implementation we found an increase of a factor 21.9(9). The increase for the sequential part is approximately equal to the increase in computational work within each step. While for the gpu accelerated implementation this increase is a factor 4.6(4) smaller. As we see that the parallel model scales good with increasing the amount of computations within each kernel it has been suggested to further optimize the code by having each kernel compute the results for multiple elements. Regretfully the test of this optimization fell outside the scope of this project.

6.2.3 Memory allocation

The results presented in 5.6 clearly show that the ’Dev to Host’ implementation takes an much higher time for larger amounts of memory transfers between device and host. For a thousand time iterations the ’Dev to Host’ implementation takes and for 10 thou-sand iterations it takes a factor 8.26(28) longer. For the hundred thouthou-sand the Dev to Host implementation took to long to do the necessary ten runs. The steep increase of calculation time means that this implementation is unsuited for any program which uses intermediate results. One example would be from the bruises model by Stam et al. Here they keep track of the total amount of both hemoglobin and bilirubin over time. As the information is kept on the gpu this kind evaluations are much harder to do. one possibility is to only do them some times, for example every 100 time steps, this would

(32)

Discussion 26 come at some some loss of speed but much less. A second possibility would be to let a kernel on the gpu calculate the intermediate results and save these within a second array on the gpu. After all time iterations are done this array would also be copied to the host and give all intermediate results. As this scheme would greatly increase the complexity of the kernel launched code we would advise against trying to implement this without more expertise on the subject.

6.2.4 Python integration

The results presented in figure 5.7 show two important characteristics. Firstly we see that for up to a hundred time iterations the time taken to resolve the system hardly increase. This is because the gpu implement already has a high base initialization time which is further increased because of the time spent on reading/writing the results to a .txt file. Using the first two points we found an base time of 344(20) ms. This base time is amount of Elements dependant, but in view of time this dependence has not been characterized. From this base time we can conclude that using an integrated cpu acceleration will only be faster for a large amount of time iterations. For a thousand time iterations the integrated version is 5.58(15) times faster. For ten-thousand iterations this difference has grown to a factor 7.14(14). Notice that this difference is compared to simple Python implementation which was shown in6.2.1to be a factor 88(17) slower than a sequential c implementation. From this w conclude that using a gpu accelerated integration would only be useful for

6.3 Pitfalls

For the possibility that someone will continue to develop this specific model within CUDA assisted c code we have added a hand full of problems that arose on our path and strongly impeded improvement. These warnings are next to all other comments within this thesis and will not be a complete overview of previously mentioned points.

6.3.1 Limitation on the amount of kernels

Cuda kernel launch is hardware limited in both the amount of blocks as the amount of kernels within each block. The amount of kernels on each block are limited to 512 or 1024 depending on graphic card meaning that at this time any model used on multiple will be limited to 512. The hardware restriction on the amount of blocks 65535 which is (21₆₋₁₎

(33)

Discussion 27 the newest gpu architecture block limitation is dimension dependant so the theoretical maximum amount of launched kernels would be: 655353∗ 1024 = 28.8 ∗ 1016_{. But during}

this project we have not been able to work with multiple dimensions so our maximum was: 65535 ∗ 512 = 33.6 ∗ 106. It is useful to note that overstepping either of this limit does not return an error but the kernel launch will fail and the c code will just continue with the code.

6.3.2 Memory allocation

Memory allocation is the bane of all c programming and it is no different in this project. We use the cuda command ’cudaMemcpy’, which copies a specified amount of memory from one place to another. Subsequently it is essential that all moved data are of the exact same size. This practically means that it is not allowed to initialize the arrays using open initialization, but that all lengths have to be defined with respectively malloc (memory allocation) and cudaMalloc (gpu memory allocation).

d o u b l e * system = ( d o u b l e * ) m a l l o c ( Elements * s i z e o f ( d o u b l e ) ) ;

d o u b l e * d e v s y s t e m = ( d o u b l e * ) c u d a M a l l o c ( Elements * s i z e o f ( d o u b l e ) ) ; cudaMemcpy ( devsystem , system , Elements * s i z e o f ( d o u b l e ) ) ;

Here we use Elements for the amount of elements in the array, system for the array on the cpu and devsystem for the array on the gpu. We define the array as ’data type double’, giving us a higher precision float number. Using one asterisks initializes the array with a dimensionality of one, a list. During this project we were not able to launch kernels working on three dimensional arrays. We got error messages of the form: argument of type ”int ***” is incompatible with parameter of type ”const int ***”, which we were not able to resolve.

6.3.3 Indexing bug

One bug that we found in our CUDA implementation but have not been able to remedy concerns the indexation of used arrays within the kernel. As previously explained in 2

each kernel first computes it’s own number ’i’ and than computes the change for the i-th element of the array. But for short arrays our kernel changed nothing on the elements in the array. After some experimenting we found out that the result list had an internal offset. This practically meant that the result list only had elements starting from a certain offset. Below is an example for an array of length for which we found an offset of 192.

(34)

Discussion 28

g l o b a l v o i d a d d K e r n e l ( c o n s t * sample , * r e s u l t ) i n t i = b l o c k I d x . x* blockDim . x + t h r e a d I d . x i f ( i w i t h i n LengthArray )

r e s u l t [ i +192] = sample [ i ] %Here we add t h e o f f s e t

As you might notice, this offset is only for the result list, the sample list does start at 0. The sample list is added as a non mutable array as can be seen by the const in the call. We believe that the hard-code adding of this offset enables the model to do what it was supposed to, but we have not been able to find out what the origin of this offset is. The found offsets for different arraysizes can be found in table6.1. We also give the difference between Elements and offset because we believed that this could explain the origin of this inconsistency, but we could not find any underlying logic.

Elements(#) offset() difference 150 192 42 296 330 34 486 512 26 2 400 2432 32 15 000 15 104 104 60 000 60 032 32

(35)

Chapter 7

Conclusion and Recommendations

Using multiple benchmarks we have shown that the gpu accelerated version is faster than the sequential c code. The biggest effect we measured was 21.5(4) times as fast for 106 elements with an a high computational taxation for each element. This effect is even more pronounced taking into account that the c implementation is already faster than higher tier implementations, we tested this with a comparison between Python and c finding a factor 88(17), although we expect this factor to be smaller for correctly optimized code.

We have shown that the finite-element model can be accelerated using gpu algorithms. But we have not managed to implement all physical implications of the bruise model to be able to compare the gpu accelerated results to the existing Labview model. This brings us to one of the most pronounced disadvantages; the inaccessibility of c. As far as programming languages go c has an large amount of syntax and very unhelpful error messages. As the bruises model should be accessible without prior experience with c, both short time students as in more advance stadium medical specialist, we need to ascertain that all functionality of the model can be accessed without changing the actual code. For this we will describe two possible approaches which we believe to be best suited.

Using the ’MATLAB Parallel Computing Toolbox’ to make a direct integration within Matlab [13]. This toolbox is still very new so we don’t know if it will live up to it’s promises. But the documentation talk about a built in command to make for loops run in parallel. This is in essence the same as we have done with the gpu accelerated implementation within c. It will likely not reach the same speeds as a c gpu accelerated implementation would but it might very well be more than enough, without the need to learn the inner workings of c but instead using the already present experience with Matlab.

(36)

Conclusion and Recommendations 30 Recreating the whole model within a c framework; including the read in of the bruises, visualization of the results and a Graphic User Interface (GUI). Especially the GUI should be quite extensive to account for all variations that could be encountered without having to change the actual code. This extensive GUI would also set limitations on what the program can solve and it’s speed. Most notably it would make the writing of the model vastly more complex. As so we would only advise to try this approach if the programmer has extensive knowledge of, and experience with, c programming. The extra complications because of the gpu acceleration would limit the intermediate results but should not pose an insurmountable challenge.

(37)

Acknowledgements

I’d like to thank Maurice Aalders and Richelle Hoveling, not just for helping me during this Bachelor project but firstly for giving me to chance to do this. Learning program-ming in c had been something I wanted to do for the last two years. Furthermore being able to do a extended programming project gave me valuable insight if this is a direction I want to follow in my master. And even when the project did not go as fast as we had hoped/expected, instead of rebuking me they where there to help and give the project a more practical direction.

I’m also grateful to Ivo van Vulpen who accepted being my second corrector even if it meant to travel all the way to the AMC for my 9 o’clock presentation, where he was ample on time and in good spirits to give support. I’m greatly indebted to David Luebke and John Owens, who have created a very comprehensive and complete open Udacity course [14], which acted as basis for my exploration of parallel processing. I would also like to thank the other students in what we dubbed the coolest AMC room, you guys kept it a bit more interesting in that dismal basement. I would also like to specifically name Jeltje van Esch, a master student in our room who really took care that everything was going well and often had time to give advice.

References

[1] John Nickolls, Ian Buck, Michael Gar-land, and Kevin Skadron. Scal-able parallel programming with cuda. Queue, 6(2):40–53, March 2008. ISSN 1542-7730.

[2] T. van Leeuwen M. Aalders B. Stam, M. van Gemert. Three dimensional modeling of bruise evolution for im-proved age determination. Medical and Biological Engineering and Com-puting, 48, 2010.

[3] Bartholomew Martini, Nath. Fun-damentals of Anatomy & Physiology. Pearson, 1988.

[4] Mouaouia Firdaous, Jean-Luc Guer-mond, and Patrick le Qu´er´e. Nonlin-ear corrections to darcy’s law at low reynolds numbers. Journal of Fluid Mechanics, pages 331–350, 7 1997.

[5] Michael Poreh and Chaim Elata. An analytical derivation of darcy´s law. In Israel Journal of Technology, vol-ume 4, page 214, 1966.

[6] Yuriy Povstenko. Linear fractional diffusion-wave equation for scientists and engineers. Springer, 2015.

(38)

Bibliography 32 [7] Stanley Ainsworth. Steady-State

En-zyme Kinetics, pages 43–73. Macmil-lan Education UK, 1977.

[8] Kenneth A. Johnson and Roger S. Goody. The original michaelis constant: Translation of the 1913 michaelis–menten paper. Biochem-istry, 50(39):8264–8269, 2011.

[9] Brian W Kernighan, Dennis M Ritchie, and Per Ejeklint. The C programming language, volume 2. prentice-Hall Englewood Cliffs, 1988. [10] I PRESENT. Cramming more

compo-nents onto integrated circuits. Read-ings in computer architecture, 56, 2000, original 1975.

[11] CUDA Nvidia. Nvidia cuda c pro-gramming guide. Nvidia Corporation, 120(18):8, 2011.

[12] Behnel Dalcin Seljebotn Smith Brad-show, Citro. Cython: The best of both worlds. Computing in Science & En-gineering, 2011.

[13] Matlab parallel computing toolbox documentation, 2016-07-08. URL

http://nl.mathworks.com/discovery/ matlab-gpu.html.

[14] NVIDIA. Intro to parallel program-ming, last checked: 06-07-2016. URL

https://www.udacity.com/course/ intro-to-parallel-programming--cs344.

(39)

Appendix A

3D modulatie op de grafische kaart

Onder grote vraag van de gaming industrie is de capaciteits groei van grafische kaarten (GPU) gigantisch. Zoals je in figuur A.1 kan zien is de theoretische computatie kracht van de grafische kaarten al vele malen groter dan die van de centrale processer (CPU) op de markt. Voor gamers is deze ontwikkeling geweldig: hierdoor kunnen zij allemaal genieten van gigantische open werelden waar honderden dingen parrallel op het scherm kunnen gebeuren.

Figure A.1: Theorethische computatie kracht van de NVVIDIA GPUs (groen) ten opzichte van de Intel CPUs (blauw)

Maar als deze grafische kaarten zoveel kracht bezitten, waarom zal elke verkoper je dan zeggen dat je buiten games vooral een sterke CPU moet hebben? Dit is omdat de meeste computers vanuit hun oorsprong in serie door problemen heen spitten. Conventionele

(40)

Popular Dutch Summary 34 modellen lossen eerst de eerste stap op, dan de tweede, dan de derde en ga zo maar door. En zolang de hoeveelheid berekeningen klein zijn gaat de CPU er zo snel doorheen dat dit bijna instantaan gebeurd. Maar voor zogeheten Finite Element Models (FEM’s) verdelen we het bestudeerde gebied in duizenden kleine blokjes die allemaal interacties met elkaar aangaan. Hoe kleiner we de blokjes maken, hoe meer computatie kracht we nodig hebben om het systeem te beschrijven.

Met de ontwikkeling van gpu acceleration modules ontstaat de mogelijkheid voor semi leken om berekeningen uit te voeren gebruikmakend van het gigantische rekenvermogen dat de grafische kaarten bieden. In essentie lijkt de grafische kaart heel erg op de reguliere processor, aangezien een cpu qua hardware niks anders is dan een hele boel kleine proccessors. Dat elke unieke processor kleiner is geeft wel een belangrijke eis aan deze zogenaamde gpu accelerated programs: het is noodzakelijk om je probleem in hele kleine berekeningen te verdelen. Voor het Finite Element Model is het redelijk eenvoudig om het probleem te verdelen. Elk blokje krijgt zijn eigen proces waarin het naar alle blokjes om zich heen kijkt en aan de hand van de eigenschappen van zijn buren bepaalt wat zij nieuwe toestand is. De toepassing van zo’n model is zeer duidelijk bij de verspreiding van kleurstof in water of soortgelijk bloed onder de huid.

In dit project is CUDA, een gpu acceleration implementatie in c, gebruikt om via de grafische kaart de ontwikkeling van onderhuidse bloedingen (blauwe plekken) te mod-uleren. De snelheid toename ten opzichte van bekende reguliere implementaties is enorm. Programmeer talen zoals Python en Matlab die voor een systeem van 10 miljoen blokjes ook echt elk blokje door moeten lopen doen er logischerwijs langer over dan CUDA, dat in één stap 10 miljoen kleine berekeningen start op de grafische kaart. Deze worden met duizenden tegelijkertijd opgelost. Maar elk goed onderzoek is kritisch over zijn eigen re-sultaten. Zo is de CUDA implementatie in c; één van de meest elementaire programmeer talen die we hebben maar daarmee ook één van de gebruiker onvriendelijkste talen die in gebruik zijn. Mede hierom wordt momenteel gewerkt aan GPU implementaties in bijna alle bekende talen. Voor kleinere modellen is de hoop gericht op de ontwikkeling van PyCUDA (Python), jCUDA (Java), KappaCUDA (Perl/Ruby) en nog een hele reeks implementaties.

Maar wat betekent dit voor de ge¨ınteresseerde in dit onderwerp? Is het afwachten wat er komen gaat? Nee, nu is je kans; spring op de boot van de parallelle computatie. Sinds 2012 draait de eerste GPU driven supercomputer met de indrukwekkend naam Titan. Deze reus haalt de 10petaFLOPS (1015), oftewel duizend miljoen miljoen berekeningen per seconde. Ook is er alweer een nieuwe gigant in aankomst, de Summit moet in 2018 operationeel zijn en Titan van zijn troon verstoten. Nu is het moment om de uitdaging te vinden die tot voor kort voor de seri¨ele computers buiten handbereik lag, maar door

(41)

Popular Dutch Summary 35 ze slim te herschrijven parrallel eindelijk kan worden overwonnen. Wie weet, misschien volgt hier dan eindelijk de oplossingen voor stroming simulaties, eiwitvouwing of de modulatie van blauwe plekken.

(42)

Appendix B

Simple Benchmarks

B.1 Python Benchmark

d e f Main ( Numbertotal , Numtime ) : s t a r t ˙ t i m e = t i m e . t i m e ( ) # Make two l i s t s L i s t b e g , L i s t e n d = [ ] , [ ] f o r i i n r a n g e ( Numbertotal ) : L i s t b e g . append ( i ) L i s t e n d . append ( 0 ) # For e v e r t i m e s t e p t h e end v a l u e s c h a n g e s o n c e f o r t i n Numtime : f o r i r a n g e ( l e n ( L i s t e n d ) ) : L i s t e n d [ i ] = L i s t b e g [ i ] + 1 L i s t b e g = deepcopy ( L i s t e n d ) P r i n t ˙ t i m e ( s t a r t ˙ t i m e ) #P r i n t s t h e t i m e t a k e n s i n c e s t a r t i n g

Listing B.1: Python implementation simple benchmark

(43)

Added code, only for reference 37

B.2 C/CUDA Benchmark

#i n c l u d e s t d i o . h , s t r i n g . h , t i m e . h , c o n i o . h , s t d l i b . h #d e f i n e S i z e 10 i n t main ( ) – d o u b l e t i m e s t a r t ; t i m e ˙ s t a r t = c l o c k ( ) ; i n t * L i s t b e g = ( i n t * ) m a l l o c ( a r r a y S i z e * s i z e o f ( i n t ) ) ; i n t * L i s t e n d = ( i n t * ) m a l l o c ( a r r a y S i z e * s i z e o f ( i n t ) ) ; f o r ( i n t i = 0 ; i ¡ a r r a y S i z e ; i ++)– L i s t b e g [ i ] = i ; L i s t e n d [ i ] = 0 ; ˝ f o r ( i n t t = 0 ; t ¡ Num˙time ; t ++)– f o r ( i n t n = 0 ; n ¡ a r r a y S i z e ; n++)– L i s t e n d [ n ] = L i s t b e g [ n ] + 1 ; ˝ memcpy ( a , c , a r r a y S i z e * s i z e o f ( i n t ) ) ; ˝ f r e e ( a ) ; f r e e ( c ) ; d o u b l e t i m e t a k e n ; t i m e t a k e n = c l o c k ( ) - t i m e s t a r t ; p r i n t f ( ”“%d” t i m e t a k e n g e t c h ( ) ; r e t u r n 0 ; ˝

(44)

Appendix C

Complete gpu accelerated model

#i n c l u d e ” s t d i o . h ” , ” c o n i o . h ” , ” t i m e . h ” , ” s t d l i b . h” #i n c l u d e ” c u d a ˙ r u n t i m e . h” #i n c l u d e ” d e v i c e ˙ l a u n c h ˙ p a r a m e t e r s . h” // V a r i a b l e s -#d e f i n e R ˙ b l o o d 1 0 . // mm s t a r t i n g Diameter b l o o d p o o l #d e f i n e K˙h 0 . 0 0 0 0 1 8 // m ˆ 4 / N H y d r a u l i c c o n d u c t i v i t y #d e f i n e d ˙ p 2 6 0 . // N / m ˆ 2 P r e s s u r e d i f f e r e n c e #d e f i n e D˙hb 0 . 0 0 0 3 6 // m ˆ 2 / s D i f u s i o n c o n s t a n t Hemoglobin #d e f i n e D ˙ b i 0 . 0 0 0 1 4 4 // m ˆ 2 / s D i f u s i o n c o n s t a n t B i l i r u b i n #d e f i n e K˙m 0 . 0 0 0 0 0 0 2 4 // M #d e f i n e M˙hb 65000 // g / mol M o l e c u l a r w e i g h t HB #d e f i n e M˙bi 584 // g / mol #d e f i n e V˙max 0 . 0 0 0 0 0 0 9 4 4 // mol / ( s g ) #d e f i n e C˙ho 0 . 0 0 5 // g /L #d e f i n e t ˙ b i 540000 // s C l e a r a n c e t i m e o f B i l l i r u b i n #d e f i n e x ˙ t 0 . 0 0 0 0 0 4 // m Dermal t h i c k n e s s t o p l a y e r #d e f i n e x ˙ b 0 . 0 0 0 0 0 6 // m Dermal t h i c k n e s s bottom l a y e r #d e f i n e x ˙ s 0 . 0 0 0 0 1 // m T h i c k n e s s S u b c a t e o u s l a y e r #d e f i n e s ˙ h 0 . 0 5 // m H o r i z o n t a l s i z e o f t h e s q u a r e // End V a r i a b l e s -// C a l c u l a t e d c o n s t a n t s , -#d e f i n e Dx˙h s ˙ h / SIZE // h o r i z o n t a l d i s t a n c e between c e n t r e s , 38

(45)

Added code, only for reference 39 #d e f i n e Dx˙s 1 . 4 1 4 * Dx˙h #d e f i n e D x ˙ 1 ˙ 2 ( x ˙ t+x ˙ b ) / 2 // v e r t i c a l d i s t a n c e between l a y e r 1 and 2 #d e f i n e D x ˙ 2 ˙ 3 ( x ˙ b+x ˙ s ) / 2 // v e r t i c a l d i s t a n c e between l a y e r 2 and 3 #d e f i n e A ˙ h ˙ 1 Dx˙h * x ˙ t // Area h o r i z o n t a l l a y e r 1 #d e f i n e A ˙ h ˙ 2 Dx˙h * x ˙ b // Area h o r i z o n t a l l a y e r 2 #d e f i n e A ˙ h ˙ 3 Dx˙h * x ˙ s // Area h o r i z o n t a l l a y e r 3 #d e f i n e A˙v Dx˙h * Dx˙h // h o r i z o n en v e r t . a n d e r s ? #d e f i n e A ˙ s ˙ 1 ( A ˙ h ˙ 1 ) / 2 #d e f i n e A ˙ s ˙ 2 ( A ˙ h ˙ 2 ) / 2 #d e f i n e A ˙ s ˙ 3 ( A ˙ h ˙ 3 ) / 2 #d e f i n e V ˙ z ˙ 1 ˙ 2 Dx˙h * Dx˙h * x ˙ b #d e f i n e V ˙ z ˙ 2 ˙ 3 Dx˙h * Dx˙h * x ˙ s //End D e r i v e d c o n s t a n t s - - - -// Both a r r a y S i z e and Num˙time a r e d e f i n e d i n Main ( )

#d e f i n e SIZE 100

#d e f i n e b l o c k ˙ s i z e 100 // S i z e o f t h e l a u n c h e d b l o c k s

#d e f i n e D ˙ t 1800 // ( s ) amount o f s e c o n d s e a c h t t i m e s t e p t a k e s

c u d a E r r o r ˙ t addWithCuda ( d o u b l e *a , d o u b l e * c , u n s i g n e d i n t s i z e , c o n s t i n t Num˙time ) ; // Weird problem where i i s which number i n l i s t c . . . n ot a c l u e what ’ s g o i n g on : P

˙ ˙ g l o b a l ˙ ˙ v o i d a d d K e r n e l ( c o n s t d o u b l e *a , d o u b l e * c , c o n s t i n t a r r a y S i z e ) – i n t n ˙ t = t h r e a d I d x . x ; i n t n ˙ b = b l o c k I d x . x ; i n t s i z e ˙ b l o c k = blockDim . x ; i n t i , j ; i = n ˙ b * s i z e ˙ b l o c k + n ˙ t ;

j = i + 6 0 0 3 2 ; // I don ’ t have a c l u e where t h i s f a c t o r comes from d o u b l e now , d e l t a ;

now = a [ i ] ;

i f ( i ¡ a r r a y S i z e ) – // J u s t t o g r a b some s t r a g g l e r s b e f o r e t h e y r u i n a b s o l u t l y e v e r y t h i n g i f ( i % 2 == 0 ) –

(46)

Added code, only for reference 40

i f ( i % ( 2 * SIZE ) == 0 —— i % ( 2 * SIZE ) == 2 * ( SIZE - 1 ) ) – c [ j ] = d o u b l e ( 0 . 0 ) ;

˝ e l s e –

// LAYER 1 Hb

i f ( ( 2 * SIZE + 1 ) ¡ i && i ¡ ( a r r a y S i z e - 2 * 2 * SIZE *SIZE - 2 * SIZE - 2 ) ) – d e l t a = ( a [ i + 2 * SIZE*SIZE ] ) * ( K˙h * d ˙ p * A˙v / D x ˙ 1 ˙ 2 ) + “ D˙hb * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now “ ) + “ D˙hb * ( 1 / 6 . ) * ( A˙v / D x ˙ 1 ˙ 2 ) * “ ( a [ i + 2 * SIZE*SIZE ] - now ) “ + “ D˙hb * ( 1 / 4 . ) * ( A ˙ s ˙ 1 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) - “

( V˙max*now* C˙ho *M˙hb ) / ( M˙hb*K˙m + now ) ; c [ j ] = now + d e l t a * D ˙ t ;

˝

// LAYER 2 Hb

e l s e i f ( ( ( 1 * 2 * SIZE*SIZE + ( 2 * SIZE + 1 ) ) ¡ i && i ¡ ( a r r a y S i z e - 1 * 2 * SIZE* SIZE - 2 * SIZE - 2 ) ) ) – d e l t a = ( a [ i + 2 * SIZE*SIZE ] ) * ( K˙h * d ˙ p * A˙v / D x ˙ 2 ˙ 3 ) - “ ( a [ i ] ) * ( K˙h * d ˙ p * A˙v / D x ˙ 1 ˙ 2 ) + “ D˙hb * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now “ ) + “ D˙hb * ( 1 / 6 . ) * ( “

( A˙v / D x ˙ 1 ˙ 2 ) * ( a [ i - 2 * SIZE *SIZE ] - now ) + “ ( A˙v / D x ˙ 2 ˙ 3 ) * ( a [ i + 2 * SIZE* SIZE ] - now ) “

(47)

Added code, only for reference 41 ) + “ D˙hb * ( 1 / 4 . ) * ( A ˙ s ˙ 2 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) - “

( V˙max*now* C˙ho *M˙hb ) / ( M˙hb*K˙m + now ) ; c [ j ] = now + d e l t a * D ˙ t ;

˝

//LAYER 3 Hb

e l s e i f ( ( ( 2 * 2 * SIZE*SIZE + ( 2 * SIZE + 1 ) ) ¡ i && i ¡ ( a r r a y S i z e - 2 * SIZE - 2 ) ) ) – d e l t a = ( - 1 ) * ( a [ i ] ) * ( K˙h * d ˙ p * A˙v / D x ˙ 2 ˙ 3 ) + “ D˙hb * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now ) “ + “ D˙hb * ( 1 / 6 . ) * ( A˙v / D x ˙ 2 ˙ 3 ) * “ ( a [ i - 2 * SIZE* SIZE ] - now ) “ + “ D˙hb * ( 1 / 4 . ) * ( A ˙ s ˙ 3 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) - “

( V˙max*now* C˙ho *M˙hb ) / ( M˙hb*K˙m + now ) ; c [ j ] = now + d e l t a * D ˙ t ; ˝ e l s e – c [ j ] = d o u b l e ( 0 . 0 ) ; ˝ ˝ ˝ e l s e – //The s i d e s

i f ( i%SIZE == 1 —— i%SIZE == ( 2 * SIZE - 1 ) ) – c [ j ] = d o u b l e ( 0 . 0 ) ;

˝ e l s e –

(48)

Added code, only for reference 42 // B i l i r u b i n LAYER 1

i f ( ( 2 * SIZE + 1 ) ¡ i && i ¡ ( a r r a y S i z e - 2 * 2 * SIZE *SIZE - 2 * SIZE - 2 ) ) – d e l t a = D ˙ b i * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now “ ) + “ D ˙ b i * ( 1 / 6 . ) * ( A˙v / D x ˙ 1 ˙ 2 ) * “ ( a [ i + 2 * SIZE*SIZE ] - now ) “ + “ D ˙ b i * ( 1 / 4 . ) * ( A ˙ s ˙ 1 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) + “

4 * ( V˙max* a [ i - 1 ] * C˙ho *M˙hb ) / ( M˙bi *K˙m + a [ i - 1 ] ) ; c [ j ] = now + d e l t a * D ˙ t ;

˝

// B i l i r u b i n LAYER 2

e l s e i f ( ( ( 2 * SIZE*SIZE + ( 2 * SIZE + 1 ) ) ¡ i && i ¡ ( a r r a y S i z e - 2 * SIZE* SIZE - 2 * SIZE - 2 ) ) ) – d e l t a = D ˙ b i * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now “ ) + “ D ˙ b i * ( 1 / 6 . ) * ( A˙v / D x ˙ 2 ˙ 3 ) * “ ( a [ i + 2 * SIZE*SIZE ] - now ) + “ D ˙ b i * ( 1 / 6 . ) * ( A˙v / D x ˙ 2 ˙ 3 ) * “ ( a [ i - 2 * SIZE* SIZE ] - now ) “ + “ D ˙ b i * ( 1 / 4 . ) * ( A ˙ s ˙ 3 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) + “

4 * ( V˙max* a [ i - 1 ] * C˙ho * M˙bi ) / ( M˙bi *K˙m + a [ i - 1 ] ) “ - now / t ˙ b i ;

(49)

Added code, only for reference 43 c [ j ] = now + d e l t a * D ˙ t ;

˝

// B i l i r u b i n LAYER 3

e l s e i f ( ( ( 2 * 2 * SIZE*SIZE + ( 2 * SIZE + 1 ) ) ¡ i && i ¡ ( a r r a y S i z e - 2 * SIZE - 2 ) ) ) – d e l t a = D ˙ b i * ( 1 / 6 . ) * ( A˙v / Dx˙h ) * ( “ a [ i - 2 ] - now + “ a [ i + 2 ] - now + “ a [ i + 2 * SIZE ] - now + “ a [ i - 2 * SIZE ] - now “ ) + “ D ˙ b i * ( 1 / 6 . ) * ( A˙v / D x ˙ 2 ˙ 3 ) * “ ( a [ i - 2 * SIZE* SIZE ] - now ) “ + “ D ˙ b i * ( 1 / 4 . ) * ( A ˙ s ˙ 3 / Dx˙s ) * ( “ a [ i - 2 * SIZE - 2 ] - now + “ a [ i - 2 * SIZE + 2 ] - now + “ a [ i + 2 * SIZE - 2 ] - now + “ a [ i + 2 * SIZE + 2 ] - now ) + “

4 * ( V˙max* a [ i - 1 ] * C˙ho * M˙bi ) / ( M˙bi *K˙m + a [ i - 1 ] ) “ - now / t ˙ b i ; c [ j ] = now + d e l t a * D ˙ t ; ˝ e l s e – c [ j ] = d o u b l e ( 0 . 0 ) ; ˝ ˝ ˝ ˝ i f ( c [ j ] ¡= 0 ) – c [ j ] = d o u b l e ( 0 . 0 ) ; ˝ ˝ i n t a c t i o n ( c o n s t i n t a r r a y S i z e , c o n s t i n t Num˙time ) – d o u b l e t i m e ˙ s t a r t ; t i m e ˙ s t a r t = c l o c k ( ) ; d o u b l e * a = ( d o u b l e * ) m a l l o c ( a r r a y S i z e * s i z e o f ( d o u b l e ) ) ; f o r ( i n t i = 0 ; i ¡ a r r a y S i z e ; ++i ) –

(50)

Added code, only for reference 44 a [ i ] = S t a r t t i n g V a l u e ( ) ; ˝ d o u b l e * c = ( d o u b l e * ) m a l l o c ( a r r a y S i z e * s i z e o f ( d o u b l e ) ) ; addWithCuda ( a , c , a r r a y S i z e , Num˙time ) ; f r e e ( a ) ; f r e e ( c ) ; d o u b l e t i m e ˙ e n d ; t i m e ˙ e n d = c l o c k ( ) ;

p r i n t f ( ” “n t o t a l amount o f t h r e a d s was : %d “n amount o f t i m e s t e p s was %d “n t i m e t a k e n was : %f s e c o n d s “n” , a r r a y S i z e , Num˙time , ( t i m e ˙ e n d - t i m e ˙ s t a r t ) / CLOCKS˙PER˙SEC ) ; r e t u r n 0 ; ˝ c u d a E r r o r ˙ t addWithCuda ( d o u b l e *a , d o u b l e * c , u n s i g n e d i n t s i z e , c o n s t i n t Num˙time ) – d o u b l e * d e v ˙ a = 0 ; d o u b l e * d e v ˙ c = 0 ; c u d a E r r o r ˙ t c u d a S t a t u s ;

// Choose which GPU t o run on , change t h i s on a m u l t i -GPU system . c u d a S t a t u s = c u d a S e t D e v i c e ( 0 ) ;

i f ( c u d a S t a t u s != c u d a S u c c e s s ) –

f p r i n t f ( s t d e r r , ” c u d a S e t D e v i c e f a i l e d ! Do you have a CUDA- c a p a b l e GPU i n s t a l l e d ? ” ) ; g o t o E r r o r ;

˝

// A l l o c a t e GPU b u f f e r s f o r t h r e e v e c t o r s ( two i n p u t , one o u t p u t ) . c u d a S t a t u s = c u d a M a l l o c ( ( v o i d **)& d e v ˙ c , s i z e * s i z e o f ( d o u b l e ) ) ; i f ( c u d a S t a t u s != c u d a S u c c e s s ) – f p r i n t f ( s t d e r r , ” c u d a M a l l o c f a i l e d ! ” ) ; g o t o E r r o r ; ˝ c u d a S t a t u s = c u d a M a l l o c ( ( v o i d **)& d e v ˙ a , s i z e * s i z e o f ( d o u b l e ) ) ; i f ( c u d a S t a t u s != c u d a S u c c e s s ) – f p r i n t f ( s t d e r r , ” c u d a M a l l o c f a i l e d ! ” ) ; g o t o E r r o r ; ˝

The parralization of the finite-element model, a feasibility study