MT3DMS, a modular three-dimensional multispecies transport model : user guide to the massively parallel processing (MPP) package and PETSC (PET) package

(1)

MT3DMS, A Modular Three-Dimensional

Multispecies Transport Model

User Guide to the Massively Parallel Processing (MPP) Package and PETSC (PET) Package

1203355-003

Jarno Verkaik Arthur van Dam Aris Lourens

(2)

(3)

1203355-003-BGS-0001, 5 September 2011, final

MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package

i

1 Introduction

Preserving a good groundwater and surface water quality is of great importance for water consumption, agriculture, and the environment. Groundwater and surface water bodies that provide us with clean water are threatened by a large range of contaminants, e.g. nutrients, pesticides, or heavy metals. Several national and international policy programs exist to prevent this happening, for example KRW, WB21, or Natura2000. In order to support the decision makers, accurate and fast simulation models are required for predicting contaminant transport. This involves very detailed models resulting in, inherently, very large computing times.

MT3DMS is a widely used, public-domain, code for modeling contaminant groundwater transport. The last two decades a large number of MT3DMS transport models have been developed at Deltares, becoming more and more detailed, and hence requiring more and more computing time. To speed up MT3DMS, a distributed memory parallelization approach is applied in this project that allows the code to run on high performance computers in a massively parallel way, i.e., using a very large number of processors. This report describes how this is realized.

The focus in this approach was on distributed memory parallelization of the Euler solvers, such as the TVD scheme, and not on the Eulerian-Lagrangian solvers (MOC), requiring different kind of parallelization techniques. Furthermore, parallelization of the implicit solver required some special attention, although this is of less importance for models where the contaminant flow is assumed to be advection dominated. From a software point of view, the MT3DMS code was changed at some points (e.g. GCG solver) although this was kept as minimal as possible. Two new modules/packages were added: the MPP package (massively parallel processing) and the PET package (interface to PETSc solver toolkit).

The layout of this report is as follows. chapter 2 describes the numerical implementation, followed by a brief program design in chapter 3 and input instructions in chapter 4. In chapter 5 results are presented for the test problems and a large application model.

2 Numerical implementation

In section 2.1, first some motivations are given for the chosen parallelization approach, followed by the partitioning strategy in section 2.2, and the data communication mechanism in section 2.3. Section 2.4 and 2.5 describe the parallelization approach for the GCG solver and PETSc solver, respectively. Section 2.6 concludes with a list of supported features.

(5)

2 2.1 Motivations for distributed parallel computing

Figure 2.1: shared-memory (a) versus distributed memory (b).

This report will not go into detail on the wide spectrum of existing parallelization techniques. For an introduction, the reader is referred to further readings on parallel computing, e.g. Grama et. al (2003).

A commonly used difference in parallelization is made by memory organization: shared-memory versus distributed shared-memory. In a shared-shared-memory (or global shared-memory) computer architecture (Figure 2.1a), each processor can access each memory location of the entire memory system. In a distributed-memory (or local memory), each processor has its own local memory as illustrated in Figure 2.1b. A processor can only access its own local memory, and only obtain data from another processor through communication using the interconnection network (the network that connects the processors with each other).

Shared memory parallelization is usually achieved by the OpenMP standard, and distributed memory by a Message Passing Interface (MPI) implementation, e.g. MPICH (Gropp et al. 2009). The shared memory approach usually involves implicit (or data) parallelization, meaning that computations are parallelized by setting compiler directives, e.g. for DO-loops in the code. The distributed memory approach usually involves explicit parallelization, meaning that the problem is physically distributed over different processes. Some advantages of distributed memory parallelization are:

Divide-and-conquer is a natural approach for splitting the large problem into smaller sub-problems;

Memory size is not limited;

Use of a very large amount of processors with a good scalability (massively parallel processing);

Upgrading hardware (clusters/super computers) with more processors is usually cheaper than for shared memory machines;

(6)

3

Some downsides are:

Speed of the network (interconnect) can be bottleneck for the performance (latency); Coding can be hard since communication between processors must be programmed

explicitly;

Implicit solver convergence rates can be an issue, especially for dispersion dominated groundwater flow.

Distributed memory parallelization is closely related to so-called domain decomposition methods, see e.g. www.ddm.org. Furthermore, the distributed memory approach does not exclude shared memory. Since coding is quite complementary, both methods can be combined in one single code. As a hybrid approach, for example, one could think of distributed MPI processes1 where within each MPI process the sub-problem is solved on a multi-core processor using OpenMP.

2.2 Partitioning 3 1 4 2 5 3 6 4 7 5 8 6 11 7 12 8 13 9 14 10 15 11 16 12 19 13 20 14 21 15 22 16 23 17 24 18 27 19 28 20 29 21 30 22 31 23 32 24 35 25 36 26 37 27 38 28 39 29 40 30 11 1 12 2 13 3 14 4 15 5 16 6 19 7 20 8 21 9 22 10 23 11 24 12 27 13 28 14 29 15 30 16 31 17 32 18 35 19 36 20 37 21 38 22 39 23 40 24 43 25 44 26 45 27 46 28 47 29 48 30 1 1 2 2 3 3 4 4 5 5 6 6 9 7 10 8 11 9 12 10 13 11 14 12 17 13 18 14 19 15 20 16 21 17 22 18 25 19 26 20 27 21 28 22 29 23 30 24 33 25 34 26 35 27 36 28 37 29 38 30 9 1 10 2 11 3 12 4 13 5 14 6 17 7 18 8 19 9 20 10 21 11 22 12 25 13 26 14 27 15 28 16 29 17 30 18 33 19 34 20 35 21 36 22 37 23 38 24 41 25 42 26 43 27 44 28 45 29 46 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 icol irow P3 P2 P0 P1 NROW = 6 icol irow icol irow icol irow icol irow NCOL = 8

: send cells, overlap 1 : send cells, overlap 2 : receive cells, overlap 1 : receive cells, overlap 2 global number local number (NLAY = 1) 3 1 4 2 5 3 6 4 7 5 8 6 11 7 12 8 13 9 14 10 15 11 16 12 19 13 20 14 21 15 22 16 23 17 24 18 27 19 28 20 29 21 30 22 31 23 32 24 35 25 36 26 37 27 38 28 39 29 40 30 11 1 12 2 13 3 14 4 15 5 16 6 19 7 20 8 21 9 22 10 23 11 24 12 27 13 28 14 29 15 30 16 31 17 32 18 35 19 36 20 37 21 38 22 39 23 40 24 43 25 44 26 45 27 46 28 47 29 48 30 1 1 2 2 3 3 4 4 5 5 6 6 9 7 10 8 11 9 12 10 13 11 14 12 17 13 18 14 19 15 20 16 21 17 22 18 25 19 26 20 27 21 28 22 29 23 30 24 33 25 34 26 35 27 36 28 37 29 38 30 9 1 10 2 11 3 12 4 13 5 14 6 17 7 18 8 19 9 20 10 21 11 22 12 25 13 26 14 27 15 28 16 29 17 30 18 33 19 34 20 35 21 36 22 37 23 38 24 41 25 42 26 43 27 44 28 45 29 46 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 icol irow P3 P2 P0 P1 NROW = 6 icol irow icol irow icol irow icol irow NCOL = 8

: send cells, overlap 1 : send cells, overlap 2 : receive cells, overlap 1 : receive cells, overlap 2 global number

local number

(NLAY = 1)

Figure 2.2: Example showing how a grid of eight columns and six rows is partitioned for four processes P0 – P3.

In the distributed approach, the MT3DMS grid is divided over the number of available processes, say

M

_c

M

_r, where

M

_c and

M

_r are the number of partitions in column-

and row direction, respectively. Each partition has the same number of layers, since the assumption is that for large applications the number of rows and columns is much greater than the number of layers. The blocks are numbered for upper-left (

P

₀ Master process) to the lower right

P

_M ₁ (Slave processes). The partitioning is done such that a) each process has

1. From now on, we will speak of communication between MPI processes, instead of processors. This is because more then one MPI processes can run on a single processor.

(7)

4

more or less the same number of cells (work load) and b) the total length of interfaces between the processes is minimal. By this, each process has the same amount of work and the communication of data between the processes is minimal. The MPP package supports three options for basic partitioning:

1 Basic partitioning algorithm, uniform in column and row direction;

2 Basic partitioning algorithm correcting for active cells in column and row direction; 3 User defined partitioning.

These options are explained in more detail below. First, a non-overlapping partition is determined.

Basic partitioning algorithm: uniform

For example, consider a MT3DMS computational grid with

_N

c

8

_columns,

_N

r

₆

_rows

and

_N

l

1

layer2, as shown in Figure 2.2. Given four processes (

M

4

), this shows the resulting partitions for the case

M

_c

M

_r

2

. The partitioning algorithm used is very straightforward to determine

M

_cand

M

_r. First, all the candidate options (dividers) for

M

_c

and

M

_rare determined such that

M

_c

M

_r for a given

M

. For this example, there are

three dividers

M

_c

M

_r: 1) 1 × 4, 2) 2 × 2, and 3) 4 × 1. Second, the optimal block size

b

is

computed by

b

int(

N M

/

)

, where

N N

c

N

r. For this example

b

3

. The smallest

dimension,

N

r in this case, is divided by

b

, resulting in the optimal number of blocks in row direction

M

_r

2

). Then, the most optimal divider is selected, hence

M

_c

M

_r

2

. In this method, each block

p

has (assuming even dimensions for simplicity) c c

/

p c

N

M

columns and r r

/

p r

N

M

rows.

Basic partitioning algorithm, correction for active cells

Given the determined

M

_c and

M

_r from the above basic partitioning algorithm, the user can automatically optimize the block dimensions

N

_pc and

N

_pr , and hence the load balance, by specifying an integer grid with weights for each cell (> 0 or 0). For example, consider Figure 2.2. for the case a weight grid is used having all ones, except the last two columns. This example corresponds to

N

_act

36

active cells, where the last two columns contain inactive cells only. Then, the cumulative row sums are 6, 12, 18, 24, 30, 36, 36, 36 and cumulative column sums are 6, 12, 18, 24, 30, 36. For determining

N

_pc, a (active cell) block size is introduced by

b

_act

int(

N

_act

/

M

_c

)

, which equals 18 for this example. For each column direction,

j

1,...,

M

_c, the index in the row sum array is searched that is closest to

j b

_act

and this index we take as the end column. Hence, for this example this is the third column, shifting the vertical interface with one column to the left. The row direction is treated in a similar way.

(8)

5

User defined partitioning

Instead of automatic partitioning, the user can specify the non-overlapping partitions entirely by input. This is done by specifying

M

_c and

M

_r, and the starting columns (

M

_c times) and

starting rows (

M

_r times) for the partitions. Amount of overlap

To couple the resulting blocks with each other, an amount of overlap has to be introduced. Because of the third-order stencil that is used for the advection TVD scheme, an overlap of two so-called ghost cells is added, see shaded cells in Figure 2.2 . Each non-overlapping partition _p has

N

_p

N

_pc

N

_pr

N

l grid cells, including ghost-cells. Introducing the overlap

of two results in overlapping partitions _phaving

(

c

2

E

2

W

) (

r

2

N

2

S

)

l

p p

N

, where E

1

_{if the block has a east}

neighbour, if not E

0

_{, etc. The overlapping dimensions for block}

_p

_{are denoted by}

2

c c E W

p p

N

for the number of columns, and r r

2

N

2

S

p p

N

for the

number of rows. The total number of nodes are given by

N

_p

N

_pc

N

_pr

N

l. Each process is responsible for the computation of the cells in _p. To do this in parallel, each process needs to synchronize the concentrations in the ghost cells. The following section describes in more detail how this is accomplished.

2.3 Communication of data

Communication of data can be local or global. Local communication (LC) only involves the exchange of data between neighbouring processes. For MT3DMS, these data are concentrations for the TVD scheme and search directions for the implicit solver. Global communication (GC) involves the exchange of data for all processes (to-all, one-to-all, all-to-one). This is necessary for synchronizing transport time step, mass budgets, scaling factors and Euclidean norms in the implicit solver. In general, local communication is less expensive than global communication, especially for a large number of processors.

Local communication

Local communication is illustrated in Figure 2.2. Before computing the new concentration subject to advection, all processes first have to update their concentrations in the “receive” ghost cells (dark gray shaded). This is done by sending the concentrations for the “send” ghost cells that lie in the non-overlapping partition (light gray shaded). For example, Master process P0 sends to:

• P1 (east): – Overlap 1: cells 4, 12, 20 – Overlap 2: cells 3, 11, 19 • P2 (south): – Overlap 1: cells 17, 18, 19, 20 – Overlap 2: cells 9, 10, 11, 12 • P3 (South-East): – Overlap 1: 20 – Overlap 2: 11, 12, 19

(9)

6

At the same time process P1 sends concentrations to P0 (West), P2 (South-East), P3 (South), process P2 sends to P0 (North), P1 (North-West), P3 (East), and process P3 sends to P0 (North-West), P1 (North) and P2 (West). After sending concentrations, P0 receives concentrations from three neighbours:

• P1 (east): – Overlap 1: cells 5, 13, 21 – Overlap 2: cells 6, 14, 22 • P2 (south): – Overlap 1: cells 25, 26, 27, 28 – Overlap 2: cells 33, 34, 35, 36 • P3 (South-East): – Overlap 1: 29 – Overlap 2: 30, 37, 38

At the same time process P1 receives concentration from P0 (West), P2 (South-East), P3 (South), process P2 receives from P0 (North), P1 (North-West), P3 (East), and process P3 receives from P0 (North-West), P1 (North) and P2 (West).

Local communication is done with MPI using non-blocking send and non-blocking receive.

Global communication

Global communication is necessary to synchronize parameters that need to be evaluated over all processes, such e.g. the minimal transport time step. For this example, each process computes the transport time step satisfying the Courant stability constraint for its partition, sends this time step to all other processes (all-to-all communication), and each process computes the minimum and uses this value as the current transport time step

(

min( ),

x

p

0,...,

M

1

). The all-to-all summation (

1 0

M p

p

x

) is used for the total mass

and the inner products for vectors necessary for the implicit solver. The all-to-all maximum communication (

max( ),

x

p

0,...,

M

1

) is used in the implicit solver for the scaling and

closure parameters. Furthermore, observation point data is gathered and written by the master process P0 as an user option (all-to-one communication).

2.4 GCG implicit solver

The Generalized Conjugate Gradient (GCG) implicit solver is by default used in MT3DMS for solving the linear algebraic system

Au b

, where

A

is a general symmetric/non-symmetric coefficient matrix that is real and non-singular, and has a sparse structure containing 7- or 19 diagonals (full dispersion tensor). The vector

u

is the concentration solution vector to be solved, and

b

the right-hand side vector. The GCG solver is parallelized by a so-called Schwarz domain decomposition approach, see e.g. Saad (2000). Partitioning results in a reordering of coefficients, and for the example of

M

6

processes where

M

_c

2

and

3

r

(10)

7 E S SE 0 0 0,0 0,1 0,2 0,3 W SW S 1 1,0 1,1 1,2 1,3 N NE E S SE 2 2,0 2,1 2,2 2,3 2,4 2,5 NW N W SW S 3 3,0 3,1 3,2 3,3 3,4 3,5 N NE E 4 4,2 4,3 4,4 4,5 NW N W 5 5,2 5,3 5,4 5,5

u

b

A

u

A

u

A

Au

u

A

u

A

u

A

1 2 3 4 5

b

, (1)

where

A

_{i i}_, is the interior coefficient matrix for block

i

, *

,

i j

i

j

A

are the connection matrices, where the superscript * denotes the interface. For example, E

0,1

A

means that block 0 has an Eastern interface to block 1. For the case of a 7-point stencil (no full dispersion tensor), the connection matrices with superscript SE, SW, NE, NW are all equal zero. The vectors

u

_i and

b

_i are the computed concentration vector and right-hand side vector on block

i

, respectively. In the additive Schwarz method, the (left) preconditioned system

1 1

Q Au Q b

is solved by choosing the parallel preconditioner

Q

the block-diagonal of

A

(block-Jacobi). This means that the subdomain solutions, hence solutions of the form

, ,

i i i i i i i

Q v

A v

w

can entirely be solved in parallel. The resulting GCG-Schwarz method with block-Jacobi preconditioning uses the GCG preconditioners (Jacobi, SOR, MC) to solve the subdomain problem

A v

_{i i}_, _i

w

_i with only one preconditioning step, and hence inaccurately (Brakkee, et al., 1998). A derivation of the GCG algorithm is given by Jea and Young (1983)3.

Parallelization of the GCG solver only involves a few basic operations: Applying a mask array for computing inner (vector-vector) products

2 mask 1

1,

,

:

,

( )

0, else

p N p i i

i

mx

m i

x x

,

where

N

_p is the number of nodes including the ghost cells, and _p is the non-overlapping partition.

Local communication (LC): exchange vectors for ghost cells.

Global communication (GC): global sum, maximum over all processes

The GCG algorithm (CG and Lanczos/ORTHOMIN) in pseudo-code is given for each process by highlighting the MPP modifications:

(0)

max |

_i

|

i

u

; GC: maximum of ; Scale:

b

:

b

1;

u

(0)

:

u

(0) 1; 2 2

,

mask

b

b b

; (0) (0)

r

b Au

; LC of

r

(0);

3. Here, the original paper is referred to and not the MT3DMS manual since the manual has errors in the presented algorithms.

(11)

8 2 (0) (0) (0) 2

,

mask

r

; GC: sum of

b

2₂ and (0) 2 2

r

; 2 2

:

2

b

; (0) (0) 2 2

:

2

r

; Stopping test: (0) ₂ 2

/

min(1

E

6,CCLOSE)

r

b

; Solve

Q

(0)

r

(0); LC of (0); if (

A

non-symmetric) then (0) (0)

p

; (0) (0);

q

(0) (0); (0) (0) 0 mask

,

; GC: sum of 0;

0 n

; while

CCLOSE

do ( ) ( ) mask

,

n n n ; GC: sum of n if n > 0 then 1 n n n ; ( )n ( )n (n 1) n

p

; ( )n ( )n (n 1) n

q

; end if Solve

Q

Ap

( )n ; Solve

Q

T

q

( )n ; LC of and ; ( ) mask

,

_q

n _{; GC: sum of ;} n n ; (n 1) ( )n ( )n n

u

p

; (n 1) ( )n n ; (n 1) ( )n T n

A

; ( 1) ( )

max

n n i i i

u

mask; GC: maximum of ;

1 n n

; end do

else if (

A

is symmetric positive definite)

(0) (0)

p

;

0 n

; while

CCLOSE

if n > 0 then

(12)

9 ( ) mask

,

n _{; GC: sum of} _; 1 n n ; ( )n ( )n (n 1) n

p

; end if ( ) ( ) mask

,

n

_p

n _{; GC: sum of ;} Solve

Q

Ap

( )n ; LC of ; ( ) mask

,

n n ; GC: sum of n; n n ; (n 1) ( )n ( )n n

u

p

; (n 1) ( )n n ; ( 1) ( ) mask

max

n n i i i

u

; GC: maximum of ;

1 n n

; end do end if Unscale:

b

:

b

;

_u

(n 1)

:

_u

(n 1) _;

The current implementation of the GCG solver is inefficient for the case

A

is a diagonal matrix and the flow is inherently explicit. In this case,

A

can be easily inverted and new concentration entries of

u

(1)can be directly computed by simply

u

i(1)

b a i

i

/

i i,

,

1,...

N

p for

all non-zero diagonal entries

a

_{i i}_, . This can significantly save computing time since this avoids unnecessary initialization overhead for the preconditioning.

2.5 PETSc implicit solver

Instead of the parallel GCG solver, a parallel PETSc solver can be selected if this is desired (Balay et al., 2008). PETSc stands for Portable, Extensible Toolkit for Scientific Computation, and this toolkit contains a wide range of parallel solver packages. PETSc has been widely used for more than two decades now for a large number of application fields.

For MT3DMS, an interface to PETSc 3.1 is constructed, using the KSP (Krylov Subspace Methods) package. For the case

A

is symmetric positive definite, the CG method is used, otherwise, the commonly used GMRES (Generalized Minimal Residual method, Saad, 2000). In PETSc, these methods are set by KSPCG and KSPGMRES, respectively. By default, the GMRES method is restarted after 100 inner iterations. The parallel preconditioner

Q

is chosen block-Jacobi, which is similar to the parallel GCG solver. The subdomain solve

,

i i i i

A v

w

is obtained by an incomplete LU factorization with zero fill-on (ILU(0)). Similar to the parallel GCG solver, the subdomain solution is obtained by preconditioning only.

(13)

10

The general set-up for PETSc is as follows: 1. Initialize PETSc variables;

2. Build linear system; 3. Set solver settings; 4. Solve linear system. Special attention was needed for:

Stopping test. The GCG stop criterion is based on the ₁-norm of the difference in solution, PETSc is by default based on the ₂-norm of the residual. Although the PETSc manual states that the stop criterion can be customized by the KSPSetConvergenceTest object, it did not seem to be possible to reproduce the GCG stopping test. Hence, PETSc converges if ( ) ₂ 50

2

max(CCLOSE

,10 )

k

r

b

, and diverges if ( ) 5 2 2

10

k

r

b

.

Node numbering. PETSc requires that the linear system is filled according to Eq. (1), meaning that besides the MT3DMS node numbering

n

1 (referred to as global node number in Figure 2.2), and the local partioning node numbering

n

2(referred to as local node number in Figure 2.2), a third node numbering

n

3 is introduced by

1 3 2 0 p i i

n

N

for partition

p

and

N

i is the number of non-overlapping nodes.

Dry or inactive cells. Cells that are inactive are treated such that the corresponding matrix row

i

of

A

only has a diagonal entry set to -1, and

u

_i

b

_i

u

_i(0) for the inactive cell

i

. For each flow time step, cells can become dry or inactive as determined by Flow Model Interface package. If this happens, the matrix connectivity’s change and this must be re-evaluated.

2.6 Supported features

Table 2.1 shows for each MT3DMS package if it is supported or not in combination with the MPP package.

Table 2.1 : Supported and not supported packages for MPP.

Package Supported Restrictions

Basic Transport (BTN) package Y

Advection (ADV) package Y particle tracking MOC, MMOC or HMOC not supported

Dispersion (DSP) package Y

Sink and Source Mixing (SSM) package Y recirculation well option not supported

Chemical Reaction (RCT) Package Y

Generalized Conjugate Gradient (GCG) package Y

Flow Model Interface Package (LKMT3) Y Multi-node Well (MNW) package not supported

Transport Observation (TOB) package N

(14)

11

3 Program design

The MT3DMS v5.30 code is modified to support distributed memory parallelization (MPP package) and parallel PETSc solvers (PET package). Code adjustments are kept as minimal as possible, to remain the modularity and such that the code can be merged with the newer MT3DMS versions. Table 3.1 summarizes the code changes, what parts of the code these involve, and the specific reasons for changing. In Table 3.2 – Table 3.7 the new subroutines and functions are described. Figure 3.1 – Figure 3.4 show flowcharts of the main program and the implicit solvers. Some code adjustments (CA) require further explanation, see Table 3.1 for the numbering.

CA 1-6 (MPP initialization)

These code adjustments are made for the initialization of the MPP package. The basic idea behind these code adjustments is that all MPI processes execute until the point where the grid dimensions are read (BTN5DF), “rewind” the program and then the partitioning takes place. Figure 3.1 shows how this is accomplished. First, all MPI processes do exactly the same: returning from MPPINI1 they have the flag MPPTYP set to MPPINI1, meaning the beginning MPP phase 1. In this phase, the program files are first opened in BTN5OPEN and unit numbers are stored in a temporary array (LIST output file is labelled negative since this file need to be removed later on). Then, the partitioning options are read from INMPP in MPP1INI2. After reading the grid dimensions NCOL and NROW in BTN5DF, all processes call MPP1INI3, where they rewind INMPP and close all other files. Besides closing, the Master process deletes the LIST file. After all processes return with logical MPPREST being true, this is the end of MPP initialization phase 1, and the start of MPP phase 2 (MPPTYP = MPPTYP2). All processes return in the main program to continue label 100, and enter the partitioning subroutine MPP1PART, where they determine their partition and their own NCOL and NROW. In BTN5DF these are set by calling the subroutine MPP1SETRCL, right after the (global) NCOL and NROW have been read.

Although this way of initialization seems a bit cumbersome, it is flexible when the code is extended with a more complex partitioning algorithm, like e.g. MeTiS, see Karypis and Kumar, 1998. In this case, a possibility is that MPP phase 1 is only executed by the Master process and the Master process does the partitioning using calls to the MeTiS library, and partition data files are written for each process. After this, all processes continue with MPP phase 2, and each process reads its MeTiS partition data file in MPP1PART.

CA 24, 38-40 (file readers)

(15)

12 Table 3.1 : Code adjustments of the MT3DMS code v5.30.

Nr. Code adjustment Subroutine(s) / Function(s) Reason

1 MPPREST logical main program flag introduced for MPP inititalization

2 subroutine call added to MPP1PART main program partitioning of the grid in row and column direction 3 subroutine call added to MPP1INI1 main program MPI initialize, set flags, etc.

4 subroutine call added to MPP1LXCHINI main program initialize local communication data structures 5 subroutine call added to MPP1INI2 main program read data from input files

6 subroutine call added to MPP1INI3 main program close and rewind input files 7 INMPP and INPET added main program packages MPP and PET added

8 subroutine call added to PET1AL main program allocate space or data arrays needed by the PET package 9 check for GCG solver main program changed in general check for implicit solver_{(PETSc solver can also be selected)}

10 subroutine call added MPP1RP main program MPP package read and prepare: only write to IOUT 11 subroutine call added PET1RP main program PETSc package read and prepare

12 subroutine call added MPP1LXCH main program local exchange of concentrations CNEW 13 subroutine call added PET1AP main program solve PETSc linear system for obtaining new_{concentrations CNEW} 14 subroutine call added MT_MPIFINALIZE main program finalize MPI processes

15 WRITESTO logical added ADV5AL, BTN5RP, BTN5AD, BTN5OT, main program

to assure that only Master process P0 writes to standard output

16 MPP check added (MPP1CHECK) ADV5AL, BTN5AL, FMI5AL, HSS5AL,_{RCT5AL, SSM5AL, TOB5AL, GCG5AP} check for unsupported MPP f eatures

17 nodal mask aplied to RMASIO by call MPP1MASK ADV5SV, SADV5F, ADV5BD, SADV5U,_{BTN5BD, DSP5BD, RCT5BD, SSM5BD} only account for cells for which the process is responsible 18 nodal mask aplied to TMASS by call MPP1MASK BTN5AD only account for cells for which the process is responsible 19 IF-statement splitting SADV5M, SADV5F, SADV5U,

SADV5Q, SAVD5F, SSM5FM, SSM5BD changed to solve boundary check error for Intel compiler 20 MPP and PET packages added BTN5OPEN add MPP and PET packages to the name file 21 subroutine call added to MPP1STORELUN BTN5OPEN MPP phase 1: store unit numbers

22 subroutine call added to MPP1FNAME BTN5OPEN, BTN5RP MPP phase 2: append file name with process ID (LIST, UCN, CNF, OBS optional)

23 subroutine call added to MPP1SETRCL BTN5DF MPP phase 2: set local number of columns and rows 24 point read changed by call to new subroutine_CHKLOC BTN5RP, READPS, READGS, SSM5RP,_TOB5RP read clipped point data: check if used and renumber_coordinates 25 subroutine call added to MPP1READOBS BTN5RP MPP phase 2: store observation global node numbers

(optional merge option)

26 subroutine call added to MPP1WRTOBSHDR BTN5RP MPP phase 2: write global node numbers of observation nodes (optional merge option)

27 subroutine call added to MPP1GXCHTIME BTN5AD global exchange to compute minimum of transport time_{step DTRANS} 28 subroutine call added to MPP1GXCHMASS1 BTN5AD global exchange to compute sum of mass TMASS 29 subroutine call added to MPP1GXCHMASS2;_{changed code order} BTN5BD combined global exchange to compute sum of masses_{TMASS en RMASIO} 30 subroutine call added to MPP1GXCHOBS BTN5OT Master process gathers observation point data from Slaves

and writes to output file (optional merge option) 31 subroutine call added to PET1ICBUNDCHG FMI5RP store flag indicating that ICBUND had changed

(PET package only) 32 diagonal check added: convergence check

and call to DIAGMATCHK GCG5AP

check for diagonal matrix and if true, directly compute CNEW and return

33 subroutine calls added to MPP1CHMAX GCG5AP global maximum of SCALE or CHANGE 34 subroutine calls added to MPP1MASKN GCG5AP skip summation for cells in overlap 35 subroutine calls added to MPP1LXCH GCG5AP local exchange of vectors (search direction) 36 subroutine calls added to MPP1GXCHIP GCG5AP global sum of iterior products (vector-vector) 37 dimension checka skipped for FMI package READHQ, READDS, READPS check for grid dimensions skipped in case of MPP 38 real array read changed by call to_{new subroutine RDARRAYR} READHQ, READDS, RARRAY clipped read of real arrays

39 integer array read changed by call to_{new subroutine RDARRAYI} READDS, IARRAY clipped read of integer arrays 40 IOUT > 0 added as a condition to write

(16)

13 Table 3.2 : New subroutines and functions created for the MPP package (file mt_mpp1.for).

Name Description

MPP1CHECK subroutine: check for features that are not supported with MPP MPP1FNAME subroutine: append output file name with process rank ID

MPP1GNODE2RANK function: get process rank ID for the process that is reponsible for a MT3DMS global node number MPP1GXCHDIAG subroutine: global sum of integer flag indicating if a matrix is diagonal; used for GCG and PETsc solver MPP1GXCHIP subroutine: global sum over all processes used for computing iterior products in GCG solver

MPP1GXCHMASS1 subroutine: global sum of mass over all processes used for TMASS MPP1GXCHMASS2 subroutine: global sum of mass TMASS and RMASIO and all processes

MPP1GXCHMAX subroutine: global maximum over all processes used for SCALE and CHANGE in GCG solver MPP1GXCHOBS subroutine: Master process gathers observation nodal data from Slave processes and writes to IOBS MPP1GXCHTIME subroutine: global minimum over all processes used for DTRANS

MPP1INI1 subroutine: initialize MPI, set MPPTYP integer flag, set WRITESTO logical flag,_{initialize arrays to store unit numbers}

MPP1INI2 subroutine: read data from INMPP, INADV for MIXELM, INGCG for NCRS and INPET for NCRS MPP1INI3 subroutine: first slave processes rewind INMPP, INADV, INGCG, INPET, and closes other files_{second Master processes does the same but removes LIST file} MPP1LOADEST subroutine: estimate work load based on simple counting of active cells

MPP1LXCH subroutine: local communication using non-blocking send and receive

MPP1LXCHINI subroutine: initialize local communication data structures: exchange partners, ranks, interface types, nodes MPP1MASK subroutine: determine from its local coordinates if process is responsible for this node

MPP1MASKN subroutine: determine from its local node number if process is responsible for this node MPP1PART subroutine: main partitioning subroutine to partitionize the grid in row and column direction MPP1PARTBAL subroutine: partioning algorithm by load balancing with pointer grid defined by user

and using row and column sums

MPP1PARTDEF subroutine: default partioning algorithm uniform in row and column direction MPP1PARTUSER subroutine: partitioning where the user has specified the blocks

MPP1RP subroutine: estimate work load; write partition information to output file IOUT

MPP1SETRCL subroutine: set new NLAY, NROW and NCOL; set clipping area parameters for clipped data read MPP1STORELUN subroutine: store unit numbers that are to be closed or rewinded later on

MPP1STOREOBS subroutine: store observation nodes global MT3DMS node numbers and responsible process rank MPP1WRTOBSHDR subroutine: write file header for observation points in IOBS

MPP1XPIDX subroutine: determine local communication index arrays

Table 3.3 : New FORTRAN subroutines and functions created for the PET package (file mt_pet1.for).

PET1AL subroutine: allocate space for data arrays

PET1AP subroutine: solve linear system using PETSc library to obtain new concentrations

PET1G2LNODE function: determine global MT3DMS node number from local node number

PET1ICBUNDCHG subroutine: check if icbund has changes and hence the grid cell connectivities

PET1L2GNODE function: determine local node number from global MT3DMS node number

PET1L2PETGNODE function: determine global PETSC node number from local node number

PET1PETIDX subroutine: determine connectity indices for the linear system

PET1RP subroutine: read and prepare data

Table 3.4 : New C routines created for the PET package (file mt_pet1.c).

PET1PETCOEFINI PETSc initialize, allocate PETSc matrix, solution vector and right-hand side vector

PET1PETCOEFINI_ fortran interface wrapper for PET1PETCOEFINI

PET1PETCOEFINI__ fortran interface wrapper for PET!PETCOEFINI

PET1PETSETCOEF fill PETSc matrix, initial solution vector and right-hand side vector; assemble system

PET1PETSETCOEF_ fortran interface wrapper for PET1PETSETCOEF

PET1PETSETCOEF__ fortran interface wrapper for PET1PETSETCOEF

PET1PETSOLSET set PETSc solver settings (CG or GMRES, block-Jacobi, ILU(0), stop criterion, output)

PET1PETSOLSET_ fortran interface wrapper for PET1PETSOLSET

PET1PETSOLSET__ fortran interface wrapper for PET1PETSOLSET

PET1PETSOLVE PETSc solve linear system; error handling

PET1PETSOLVE_ fortran interface wrapper for PET1PETSOLVE

(17)

14 Table 3.5 : Subroutines and functions added to the MT3DMS utilities (mt_utl5.for).

CFN_FINDWORD subroutine: find word in string array

CFN_LENGTH function: wrapper for CFN_LENGTH2

CFN_LENGTH2 function: determine string length

CFN_S_TRIM subroutine: wrapper for CFN_TRIM2

CFN_S_TRIM2 subroutine: trim a string

CFN_UPCASE subroutine: wrapper for CFN_UPCASE2

CFN_UPCASE2 subroutine: make a string uppercase

CHKLOC subroutine: read clipped point data and check if used and renumber coordinates

CHKNOTSUPPORTED subroutine: check for supported packages for clipping read

DEFRDSZ subroutine: define clipping area parameters

GETIRCL subroutine: get coordinates from MT3DMS global node number

RDARRAYI subroutine: integer array reader suitable for clipping data

RDARRAYR subroutine: real array reader suitable for clipping read _.

Table 3.6 : Subroutine added to the GCG5 package (mt_gcg5.for).

Name description

DIAGMATCHK subroutine: check if coefficient matrix only has diagonal entries if so sompute cnew(n) = rhs(n) / a(n,1)

Table 3.7 : MPI wrapper routines needed by the MPP and PETSc package (mt_mpp1_mpi.for).

MTMPI_ALLMAXI subroutine: wrapper for MPI_ALLREDUCE to maximize integer values over all processes

MTMPI_ALLMAXR subroutine: wrapper for MPI_ALLREDUCE to maximize real values over all processes

MTMPI_ALLMINR subroutine: wrapper for MPI_ALLREDUCE to minimize real values over all processes

MTMPI_ALLSUMR subroutine: wrapper for MPI_ALLREDUCE to sum real values over all processes

MTMPI_BARRIER subroutine: wrapper for MPI_BARRIER to block until all processes reach this subroutine

MTMPI_COMM_RANK function: wrapper for MPI_COMM_RANK to determine own process rank ID

MTMPI_COMM_SIZE function: wrapper for MPI_COMM_SIZE te determine number of processes

MTMPI_ERROR subroutine: MPI error message handling

MTMPI_FINALIZE subroutine: wrapper for MPI_FINALIZE to finalize MPI

MTMPI_GATHERVI subroutine: wrapper for MPI_GATHERV to gather integer arrays from all processes to one process

MTMPI_GATHERVR subroutine: wrapper for MPI_GATHERV to gather real arrays from all processes to one process

MTMPI_INIT subroutine: call MPI_INIT; zero communication buffers; get NRPROC and MYRANK

MTMPI_IRECVI subroutine: wrapper for MPI_IRECV for non-blocking receive of an integer array

MTMPI_IRECVR subroutine: wrapper for MPI_IRECV for non-blocking receive of an real array

MTMPI_ISENDI subroutine: wrapper for MPI_ISEND for non-blocking send of an integer array

MTMPI_ISENDR subroutine: wrapper for MPI_ISEND for non-blocking send of a real array

(18)

15 IF ADVECTION START CALL BTN5AL CALL

FMI5AL ADV5ALCALL

IF

DISPERSION OR SOURCEIF SINK REACTIONIF SOLVERIF GCG

CALL

DSP5AL SSM5ALCALL RCT5ALCALL GCG5ALCALL

Y Y Y Y N N N N CALL BTN5RP CALL BTN5ST IF SINK OR SOURCE _SSM5RPCALL CALL FMI5RP IF DISPERSION CALL DSP5CF MORE TRANSPORT STEP? MORE TIME STEP? MORE STRESS PERIOD? END N Y Y N TR A N S PO R T T IM E S TE P L O O P F LO W T IM E ST EP L O O P ST R E SS P E R IO D L O O P IF TIME OBSERVATION CALL TOB5AL CALL MPP1INI1

MPPTYP = MPPINI1 (parallel) MPPTYP = MPPSER (serial)

MPPTYP CALL MPP1PART CALL MPP1LXCHINI CALL BTN5OPEN MPPTYP CALL MPP1INI2 CALL BTN5DF MPPINI1 MPPINI2 MPPSER CALL MPP1INI3 MPPTYP MPPINI2 MPPSER MPPINI1 MPPTYP = MPPINI2 MPPINI2 MPPINI1 MPPSER IF PETSC SOLVER CALL PET1AL Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5AL Y N IF ADVECTION CALL ADV5RP IF

DISPERSION REACTIONIF SOLVERIF GCG PARALLELIF MPP

CALL

DSP5RP RCT5RPCALL GCG5RPCALL MPP1RPCALL

Y Y Y Y N N N N IF TIME OBSERVATION CALL TOB5RP IF PETSC SOLVER CALL PET1RP Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5RP Y B A IF ADVECTION START CALL BTN5AL CALL

FMI5AL ADV5ALCALL

IF

DISPERSION OR SOURCEIF SINK REACTIONIF SOLVERIF GCG

CALL

DSP5AL SSM5ALCALL RCT5ALCALL GCG5ALCALL

Y Y Y Y N N N N CALL BTN5RP CALL BTN5ST IF SINK OR SOURCE _SSM5RPCALL CALL FMI5RP IF DISPERSION CALL DSP5CF MORE TRANSPORT STEP? MORE TIME STEP? MORE STRESS PERIOD? END N Y Y N TR A N S PO R T T IM E S TE P L O O P F LO W T IM E ST EP L O O P ST R E SS P E R IO D L O O P IF TIME OBSERVATION CALL TOB5AL CALL MPP1INI1

MPPTYP = MPPINI1 (parallel) MPPTYP = MPPSER (serial)

MPPTYP CALL MPP1PART CALL MPP1LXCHINI CALL BTN5OPEN MPPTYP CALL MPP1INI2 CALL BTN5DF MPPINI1 MPPINI2 MPPSER CALL MPP1INI3 MPPTYP MPPINI2 MPPSER MPPINI1 MPPTYP = MPPINI2 MPPINI2 MPPINI1 MPPSER IF PETSC SOLVER CALL PET1AL Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5AL Y N IF ADVECTION CALL ADV5RP IF

DISPERSION REACTIONIF SOLVERIF GCG PARALLELIF MPP

CALL

DSP5RP RCT5RPCALL GCG5RPCALL MPP1RPCALL

Y Y Y Y N N N N IF TIME OBSERVATION CALL TOB5RP IF PETSC SOLVER CALL PET1RP Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5RP Y B A

(19)

16 CALL BTN5BD CALL BTN5OT ADVECTION EXPLICIT BTN5SVCALL Y CALL ADV5SV N CALL BTN5FM ADVECTION

IMPLICIT DISPERSIONIF IF SOURCEOR SINK REACTIONIF

CALL

DSP5FM SSM5FMCALL RCT5FMCALL

MORE OUTER ITERATIONS?

IF

ADVECTION DISPERSIONIF IF SOURCEOR SINK REACTIONIF

CALL DSP5BD SSM5BDCALL RCT5BDCALL Y N CALL DSP5BD CALL ADV5FM Y Y Y Y N N N N N N N N Y Y Y Y IF

OBSERVATION ConcObsCALL

CALL MassFlux Obs Y N IF REACTION CALL RCT5CF IF MPP PARALLEL N CALL MPP1LXCH Y IF GCG SOLVER CALL GCG5AP C D IF PETSC SOLVER E F CALL PET1AP IF HYDROCARBON SPILL SRC CALL HSS5FM N IF HYDROCARBON SPILL SRC CALL HSS5BD N A B O U TE R IT E R A TI O N L O O P CALL BTN5BD CALL BTN5OT ADVECTION EXPLICIT BTN5SVCALL Y CALL ADV5SV N CALL BTN5FM ADVECTION

IMPLICIT DISPERSIONIF IF SOURCEOR SINK REACTIONIF

CALL

DSP5FM SSM5FMCALL RCT5FMCALL

MORE OUTER ITERATIONS?

IF

ADVECTION DISPERSIONIF IF SOURCEOR SINK REACTIONIF

CALL DSP5BD SSM5BDCALL RCT5BDCALL Y N CALL DSP5BD CALL ADV5FM Y Y Y Y N N N N N N N N Y Y Y Y IF

OBSERVATION ConcObsCALL

CALL MassFlux Obs Y N IF REACTION CALL RCT5CF IF MPP PARALLEL N CALL MPP1LXCH Y IF GCG SOLVER CALL GCG5AP C D IF PETSC SOLVER E F CALL PET1AP IF HYDROCARBON SPILL SRC CALL HSS5FM N IF HYDROCARBON SPILL SRC CALL HSS5BD N A B O U TE R IT E R A TI O N L O O P

(20)

17 (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n Select preconditioner Q Compute scaling factor _{max |}(0)_| i i u MPP? Y Scale: 1 : b b (0)_: ( 0) 1 u u N Compute residual: ( 0) (0 ) r b Au

First stop criterion init:

2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r MPP? Y N D Y N Solve for initial vector:

( 0) _r(0) MPP?

Y N

Local exchange of(0)

Global sum of and 2

2 b (0 )2 2 r A is SPD matrix? stopping test (0 ) 2 2/ C CLOSE r b Begin Lanczos/ ORTHOMIN Begin CG N A is diagonal matrix? N Directly compute (1)_: _/ u b A Y D Y Initialization: ( 0) (0) p n 0 n > 0? n > 0? MPP? Y Y

Compute direction vector:

( )n ( )n ( 1)n n p p N Solve ( )n Q Ap MPP? Y N Local exchange of N Compute maximum concentration change: (1) ( ) max n n i i i u u Global max of

Store and location

Y Stopping test: CCLOSE D Y A is SPD matrix? N N n = n + 1 Unscale: :b b (n1)_: (n1) u u Compute ( ) , n n p MPP? Global sum of n Y N MPP? Local exchange of _r(0 ) Y N

Compute direction vector parameter:

( )n_,

Global sum of

Compute direction vector parameter: 1 n n Compute ( )n_,_p( )n MPP? Global sum of Y N Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n n n Y Compute ( )n_,( )n n MPP? Global sum of n Y N

1 n n

n

Compute direction vectors:

( )n ( )n (n1) n p p ( )n ( )n (n1) n q q Solve ( )n Q Ap ( ) T n Q q MPP? N Y

Local exchange of and Y Compute ( ) ,_qn MPP?Y Global sum of N N Compute iterates: n n (n1) ( )n ( )n n u u p ( 1 )n ( )n n (n1) ( )n T nA Global max of C MPP? N (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n Select preconditioner Q Compute scaling factor _{max |}(0)_| i i u Compute scaling factor _{max |}(0)_| i i u MPP? Y Scale: 1 : b b (0)_: ( 0) 1 u u Scale: 1 : b b (0)_: ( 0) 1 u u N Compute residual: ( 0) (0 ) r b Au Compute residual: ( 0) (0 ) r b Au

2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r

2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r MPP? Y N D Y N Solve for initial vector:

( 0) _r(0)

Solve for initial vector:

( 0) _r(0) MPP?

Y N

Local exchange of(0)

Global sum of and 2

2 b (0 )2

2 r

Global sum of and 2

2 b (0 )2 2 r A is SPD matrix? stopping test (0 ) 2 2/ C CLOSE r b stopping test (0 ) 2 2/ C CLOSE r b Begin Lanczos/ ORTHOMIN Begin CG N A is diagonal matrix? N Directly compute (1)_: _/ u b A Directly compute (1)_: _/ u b A Y D Y Initialization: ( 0) (0) p n 0 Initialization: ( 0) (0) p n 0 n > 0? n > 0? MPP? Y Y

( )n ( )n ( 1)n

n

p p

( )n ( )n ( 1)n n p p N Solve ( )n Q Ap Solve ( )n Q Ap MPP? Y N Local exchange of Local exchange of N Compute maximum concentration change: (1) ( ) max n n i i i u u Compute maximum concentration change: (1) ( ) max n n i i i u u Global max of Global max of

Store and location Store and location

Y Stopping test: CCLOSE Stopping test: CCLOSE D Y A is SPD matrix? N N n = n + 1 Unscale: :b b (n1)_: (n1) u u Unscale: :b b (n1)_: (n1) u u :b b (n1)_: (n1) u u Compute ( ) , n n p Compute ( ) , n n p MPP? Global sum of n Global sum of n Y N MPP? Local exchange of _r(0 ) Local exchange of _r(0 ) Y N

( )n_,

Global sum of Global sum of

1 n

n

Compute direction vector parameter: 1 n n Compute ( )n_,_p( )n Compute ( )n_,_p( )n MPP? Global sum of Global sum of Y N Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n n n Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n (n1) ( )n ( )n n u u p (n1) ( )n n n n Y Compute ( )n_,( )n n Compute ( )n_,( )n

n MPP? Global sum of Global sum of nn

Y N

1 n n

n

1 n n

n

( )n ( )n (n1) n p p ( )n ( )n (n1) n q q

( )n ( )n (n1) n p p ( )n ( )n (n1) n q q Solve ( )n Q Ap ( ) T n Q q Solve ( )n Q Ap ( ) T n Q q MPP? N Y

Local exchange of and Local exchange of and Y Compute ( ) ,_qn Compute ( )

,_qn MPP?Y Global sum of Global sum of

N N Compute iterates: n n (n1) ( )n ( )n n u u p ( 1 )n ( )n n (n1) ( )n T nA Global max of Global max of C MPP? N

(21)

18 A is diagonal matrix? N Directly compute (1 )_: _/ u b A Y F E first time in PET1AP or change in icbund? Determine connection indices (PET1PETIDX) Y first time in PET1AP? Allocate PETSc A, u and b (PET1PETCOEFINI) Y N Set coefficients in PETSc A, u and b (PET1PETSETCOEF) Set PETSc solver settings (PET1PETSOLSET) PETSc solve A u = b (PET1PETSOLVE) Local exchange of u F A is diagonal matrix? N Directly compute (1 )_: _/ u b A Directly compute (1 )_: _/ u b A Y F E first time in PET1AP or change in icbund? Determine connection indices (PET1PETIDX) Y first time in PET1AP? Allocate PETSc A, u and b (PET1PETCOEFINI) Y N Set coefficients in PETSc A, u and b (PET1PETSETCOEF) Set PETSc solver settings (PET1PETSOLSET) PETSc solve A u = b (PET1PETSOLVE) Local exchange of u F Figure 3.4: Flow chart of the PETSc solver.

(22)

19

4 Input instructions

Input for the Massively Parallel Processing (MPP) package is read on unit INMPP = 5, which is preset in the main program. The input file is required only if the MPP package is used in the simulation. For each simulation, are the input instructions are given by Table 4.1.

Table 4.1 : Input instructions for the MPP package

Massive Parallel Processing (MPP) package

ID Variable Name Format Explanation

G1 MERGEOBS free flag indicating that Master process should merge observation points MERGEOBS = 0: each process writes his own observation file MERGEOBS = 1: master process writes observation file G2 PARTOPT free flag indicating the partition method in column and row direction

PARTOPT = 0: default algorithm by assuming constant load PARTOPT = 1: user defined blocks

PARTOPT = 2: load balancing by user specified integer grid (Enter G2a if PARTOPT = 1 or 2)

G2a NRPROC_NCOL, NRPROC_NROW free number of partitions in column and row direction (Enter G2b and G2b if PARTOPT = 1)

G2b NRPROC_COL free array of length NRPROC_NCOL with the starting columns of the partitions G2c NRPROC_ROW free array of length NRPROC_NROW with the starting rows of the partitions (Enter G2d if PARTOPT = 2)

G2d LOADPTR iarray array used for load balancing in row and column direction only entries different then 0 are used

Input to the Portable, Extensible Toolkit for Scientific Computation (PET) package is read on unit INPET = 6, which is preset in the main program. The input file is needed only if the PET package is used in the simulation. If the PET package is used, the MPP package must be used too. For each simulation are the input instructions are given by Table 4.2.

Table 4.2: Input instructions for the PET package.

Portable, Extensible Toolkit for Scientific Computation (PET) Package

ID Variable Name Format Explanation

H1 MXITER, ITER1, NCRS free maximum number of outer iterations, maximum of inner iterations;

flag for treatment of dispersion cross terms

NCRS = 0: dispersion cross terms lumped to right-hand side NCRS = 1: full dispersion tensor

H2 CCLOSE free convergence criterion: 2-norm(residual)/2-norm(right-hand side)

Note that in order to use the MPP and PET packages, you will need to have MPI installed on your system. To use PET package you will also need PETSc to have installed too.

5 Test problems

5.1 Examples

The parallel code was tested for its correctness for the supported examples in the MT3DMS v5.30 distribution (21-02-2010). The examples were taken from directory “examples\mf96mt3d\”, where MODFLOW-96 was used to compute the groundwater flow. For this, MODFLOW-96 with the link-MT3D interface (25-05-2003) was compiled with the Intel compiler under Linux. The examples were tested for 1, 2, 3, 4 and 8 processes. As a reference, the examples were run with a compiled clean version of MT3DMS v5.30.

(23)

20

Examples 4, 6, 8 were not tested because they all use MOC advection which is not supported. All tests gave similar results in the parallel case as for the serial case. For one specific example, we will go into more detail.

Example 1: One-dimensional transport in a uniform flow field

This test case models one-dimensional solute transport involving advection, dispersion and some simple chemical reactions in a steady-state uniform flow field. The full problem specification can be found in Sec. 7.1 of Zheng, 1999. Here, advection was solved by the parallelized TVD scheme.

Test case 1a (advection only) has no difference (exactly zero) between serial and parallel runs, both using the GCG and PETSc solver. This is easily explained by the fact that no iterative solvers are invoked when dispersion is absent. Also, the parallel advection solver (ULTIMATE) is explicit in time and, therefore, produces results across multiple partitions that are identical to the results from a serial run.

Test case 1b (advection and dispersion) has no significant difference (up to machine precision) between serial and parallel runs, both using the GCG and PETSc solver. This is achieved by setting the tolerance for the iterative solvers to machine precision (e.g., CCLOSE=1E-16). For less strict values, serial and parallel results are identical up to that tolerance, as is to be expected.

Test case 1c (advection, dispersion and sorption) shows equality between runs up to machine precision, as for test case 1b. Figure 5.1a shows the numerical solutions obtained with the PETSc solver at Tend=2000 s for serial and parallel runs. The results for any number of partitions coincide.

Test case 1d (advection, dispersion, sorption and decay) shows the same behavior. Figure 5.1b shows the numerical solutions with the parallel GCG solver at Tend=2000 s for serial and parallel runs. Again, all results coincide.

Figure 5.1: Comparison of the calculated concentrations with the analytical solutions for the one-dimensional test problem. The analytical solutions are shown in solid lines with the numerical solutions in different symbols.

a) case 1c, solved by PETSc on 1, 2, 4 and 8 processes. b) case 1d, solved by GCG on 1, 2, 4, and 8 processes.

(24)

21 5.2 Large application model

57 km (567cols) 39 km (389 rows) : Zinc smeltery (20 layers) 57 km (567cols) 39 km (389 rows) : Zinc smeltery (20 layers)

Figure 5.2: Area of interest for the BeNeKempen model and model dimensions.

In cooperation with the Dutch knowledge centers Deltares and Alterra several large contaminant transport models have been developed. The largest regional transport model that uses the maximum amount of computer resources is the recently developed so-called BeNeKempen model for the “Aktief Bodembeheer de Kempen” environmental program. This 3D transboundary model consists of ~5 million grid cells with a horizontal resolution of 100 × 100 m, covering an area of ~2200 km2, which is about 1/20 part of The Netherlands (Figure 5.2). This MT3DMS model simulates groundwater transport of cadmium and zinc originating from historical emissions of four zinc smelters in the vicinity of the Dutch-Belgian border. The typical simulation time used is 90 years.

The performance of the parallel code is benchmarked on the Dutch national compute cluster SARA-LISA4_{, having specifications:}

•

512 nodes each node 8 or 12 cores (total 4480 cores);

•

Intel Xeon quad cores (416 nodes), Intel Xeon six cores (96 nodes);

•

2 GB internal memory per core;

•

Fast Infiniband connection, 1600 MB/sec, <6 s; and on the Deltares in-house H4 cluster:

•

221 nodes, each node 2 or 4 cores (total 448 cores);

•

AMD dual cores (220 x-nodes), AMD quad cores (1 y-nodes);

•

4 GB internal memory, 250 GB local storage;

•

Slow 1 Gigabit ethernet connection.

4. This work was sponsored by the Stichting Nationale Computerfaciliteiten (National Computing Facilities Foundation, NCF) for the use of supercomputer facilities, with financial support from the Nederlandse Organisatie voor

(25)

22

Wall-clock time BeNeKempen Model

0,00 2,00 4,00 6,00 8,00 10,00 12,00 14,00 16,00 18,00 1 2 4 8 16 32 number of processes w al l-cl o ck t im e [h o u rs ] Deltares H4 SARA LISA

Speedup BeNeKempen Model

0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 1 2 4 8 16 32 number of processes sp ee d u p Deltares H4 SARA LISA Ideal case

The BeNeKempen model was run up to 32 processes on both machines with the MPP package selected in combination with the parallel GCG solver. The parallel GCG-solver was selected, since the PETSc solver introduces a large amount of initialization overhead when the flow is advection dominated, which is the assumption for the BeNeKempen

model. True wall-clock time were measured, i.e. the time measured from start till end of

simulation that is directly experienced by the user. The results are shown in Figure 5.3. A speedup of more than a factor 20 was obtained with 32 processes on the Dutch national compute cluster SARA-LISA, reducing the computing time from ~10 hours to ~30 minutes. The speedup on the H4-cluster decreases for more than 16 processors, that is probably caused by the slow ethernet interconnect. Overall, the benchmarks show that parallel MT3DMS code can significantly reduce large computing times.

Exactly the same results were reproduced in the parallel case as in the serial case, see Figure 5.4 for a typical observation point. Figure 5.5 shows how the default partitioning is done for the case of 16 blocks (4 × 4). Clearly, the load balance is not optimal, since the 12 outer blocks do not have the maximum load since they have a large number of inactive cells.

BeNeKempen observation point (6, 241, 67)

0.00E+00 1.00E-03 2.00E-03 3.00E-03 4.00E-03 5.00E-03 6.00E-03 7.00E-03 8.00E-03 4. 47 36 57 73 10 10 96 2 14 61 4 18 26 7 21 91 9 25 57 2 29 22 4 time [days] C d c o n ce n tr at io n [ kg /m 3] GCG MPP-GCG np=2 MPP-GCG np=4 MPP-GCG np=8 MPP-GCG np=16 MPP-PETSC np=16 MPP-diag np=16

Figure 5.4: Typical observation point showing a cadmium concentration time series for 90 years. Results are shown for the GCG solver, the parallel GCG solver (2, 4, 8, 16 processors), the PETSc solver (16 processors) and the case of no implicit solver for 16 processors (diagonal check activated).

Figure 5.3: Timing results for the parallel MT3DMS code (parallel GCG) on the SARA LISA cluster and the Deltares H4 cluster. Left: measure wall-clock time; right: speed-ups.

(26)

23

Concentration Cd [kg/m3] Concentration

Cd [kg/m3]

Figure 5.5: Computed cadmium concentration for layer = 1 and T = 90 years, showing the partitioning for the case of 16 processes (upper left block: P0; lower right block: P15).

MT3DMS, a modular three-dimensional multispecies transport model : user guide to the massively parallel processing (MPP) package and PETSC (PET) package