MT3DMS, A Modular Three-Dimensional
Multispecies Transport Model
User Guide to the Massively Parallel Processing (MPP) Package and PETSC (PET) Package
1203355-003
© Deltares, 2011
Jarno Verkaik Arthur van Dam Aris Lourens
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
i
Contents
1 Introduction 1
2 Numerical implementation 1
2.1 Motivations for distributed parallel computing 2
2.2 Partitioning 3
2.3 Communication of data 5
2.4 GCG implicit solver 6
2.5 PETSc implicit solver 9
2.6 Supported features 10
3 Program design 11
4 Input instructions 19
5 Test problems 19
5.1 Examples 19
5.2 Large application model 21
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
1
1 Introduction
Preserving a good groundwater and surface water quality is of great importance for water consumption, agriculture, and the environment. Groundwater and surface water bodies that provide us with clean water are threatened by a large range of contaminants, e.g. nutrients, pesticides, or heavy metals. Several national and international policy programs exist to prevent this happening, for example KRW, WB21, or Natura2000. In order to support the decision makers, accurate and fast simulation models are required for predicting contaminant transport. This involves very detailed models resulting in, inherently, very large computing times.
MT3DMS is a widely used, public-domain, code for modeling contaminant groundwater transport. The last two decades a large number of MT3DMS transport models have been developed at Deltares, becoming more and more detailed, and hence requiring more and more computing time. To speed up MT3DMS, a distributed memory parallelization approach is applied in this project that allows the code to run on high performance computers in a massively parallel way, i.e., using a very large number of processors. This report describes how this is realized.
The focus in this approach was on distributed memory parallelization of the Euler solvers, such as the TVD scheme, and not on the Eulerian-Lagrangian solvers (MOC), requiring different kind of parallelization techniques. Furthermore, parallelization of the implicit solver required some special attention, although this is of less importance for models where the contaminant flow is assumed to be advection dominated. From a software point of view, the MT3DMS code was changed at some points (e.g. GCG solver) although this was kept as minimal as possible. Two new modules/packages were added: the MPP package (massively parallel processing) and the PET package (interface to PETSc solver toolkit).
The layout of this report is as follows. chapter 2 describes the numerical implementation, followed by a brief program design in chapter 3 and input instructions in chapter 4. In chapter 5 results are presented for the test problems and a large application model.
2 Numerical implementation
In section 2.1, first some motivations are given for the chosen parallelization approach, followed by the partitioning strategy in section 2.2, and the data communication mechanism in section 2.3. Section 2.4 and 2.5 describe the parallelization approach for the GCG solver and PETSc solver, respectively. Section 2.6 concludes with a list of supported features.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
2 2.1 Motivations for distributed parallel computing
Figure 2.1: shared-memory (a) versus distributed memory (b).
This report will not go into detail on the wide spectrum of existing parallelization techniques. For an introduction, the reader is referred to further readings on parallel computing, e.g. Grama et. al (2003).
A commonly used difference in parallelization is made by memory organization: shared-memory versus distributed shared-memory. In a shared-shared-memory (or global shared-memory) computer architecture (Figure 2.1a), each processor can access each memory location of the entire memory system. In a distributed-memory (or local memory), each processor has its own local memory as illustrated in Figure 2.1b. A processor can only access its own local memory, and only obtain data from another processor through communication using the interconnection network (the network that connects the processors with each other).
Shared memory parallelization is usually achieved by the OpenMP standard, and distributed memory by a Message Passing Interface (MPI) implementation, e.g. MPICH (Gropp et al. 2009). The shared memory approach usually involves implicit (or data) parallelization, meaning that computations are parallelized by setting compiler directives, e.g. for DO-loops in the code. The distributed memory approach usually involves explicit parallelization, meaning that the problem is physically distributed over different processes. Some advantages of distributed memory parallelization are:
Divide-and-conquer is a natural approach for splitting the large problem into smaller sub-problems;
Memory size is not limited;
Use of a very large amount of processors with a good scalability (massively parallel processing);
Upgrading hardware (clusters/super computers) with more processors is usually cheaper than for shared memory machines;
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
3
Some downsides are:
Speed of the network (interconnect) can be bottleneck for the performance (latency); Coding can be hard since communication between processors must be programmed
explicitly;
Implicit solver convergence rates can be an issue, especially for dispersion dominated groundwater flow.
Distributed memory parallelization is closely related to so-called domain decomposition methods, see e.g. www.ddm.org. Furthermore, the distributed memory approach does not exclude shared memory. Since coding is quite complementary, both methods can be combined in one single code. As a hybrid approach, for example, one could think of distributed MPI processes1 where within each MPI process the sub-problem is solved on a multi-core processor using OpenMP.
2.2 Partitioning 3 1 4 2 5 3 6 4 7 5 8 6 11 7 12 8 13 9 14 10 15 11 16 12 19 13 20 14 21 15 22 16 23 17 24 18 27 19 28 20 29 21 30 22 31 23 32 24 35 25 36 26 37 27 38 28 39 29 40 30 11 1 12 2 13 3 14 4 15 5 16 6 19 7 20 8 21 9 22 10 23 11 24 12 27 13 28 14 29 15 30 16 31 17 32 18 35 19 36 20 37 21 38 22 39 23 40 24 43 25 44 26 45 27 46 28 47 29 48 30 1 1 2 2 3 3 4 4 5 5 6 6 9 7 10 8 11 9 12 10 13 11 14 12 17 13 18 14 19 15 20 16 21 17 22 18 25 19 26 20 27 21 28 22 29 23 30 24 33 25 34 26 35 27 36 28 37 29 38 30 9 1 10 2 11 3 12 4 13 5 14 6 17 7 18 8 19 9 20 10 21 11 22 12 25 13 26 14 27 15 28 16 29 17 30 18 33 19 34 20 35 21 36 22 37 23 38 24 41 25 42 26 43 27 44 28 45 29 46 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 icol irow P3 P2 P0 P1 NROW = 6 icol irow icol irow icol irow icol irow NCOL = 8
: send cells, overlap 1 : send cells, overlap 2 : receive cells, overlap 1 : receive cells, overlap 2 global number local number (NLAY = 1) 3 1 4 2 5 3 6 4 7 5 8 6 11 7 12 8 13 9 14 10 15 11 16 12 19 13 20 14 21 15 22 16 23 17 24 18 27 19 28 20 29 21 30 22 31 23 32 24 35 25 36 26 37 27 38 28 39 29 40 30 11 1 12 2 13 3 14 4 15 5 16 6 19 7 20 8 21 9 22 10 23 11 24 12 27 13 28 14 29 15 30 16 31 17 32 18 35 19 36 20 37 21 38 22 39 23 40 24 43 25 44 26 45 27 46 28 47 29 48 30 1 1 2 2 3 3 4 4 5 5 6 6 9 7 10 8 11 9 12 10 13 11 14 12 17 13 18 14 19 15 20 16 21 17 22 18 25 19 26 20 27 21 28 22 29 23 30 24 33 25 34 26 35 27 36 28 37 29 38 30 9 1 10 2 11 3 12 4 13 5 14 6 17 7 18 8 19 9 20 10 21 11 22 12 25 13 26 14 27 15 28 16 29 17 30 18 33 19 34 20 35 21 36 22 37 23 38 24 41 25 42 26 43 27 44 28 45 29 46 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 icol irow P3 P2 P0 P1 NROW = 6 icol irow icol irow icol irow icol irow NCOL = 8
: send cells, overlap 1 : send cells, overlap 2 : receive cells, overlap 1 : receive cells, overlap 2 global number
local number
(NLAY = 1)
Figure 2.2: Example showing how a grid of eight columns and six rows is partitioned for four processes P0 – P3.
In the distributed approach, the MT3DMS grid is divided over the number of available processes, say
M
M
cM
r, whereM
c andM
r are the number of partitions in column-and row direction, respectively. Each partition has the same number of layers, since the assumption is that for large applications the number of rows and columns is much greater than the number of layers. The blocks are numbered for upper-left (
P
0 Master process) to the lower rightP
M 1 (Slave processes). The partitioning is done such that a) each process has1. From now on, we will speak of communication between MPI processes, instead of processors. This is because more then one MPI processes can run on a single processor.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
4
more or less the same number of cells (work load) and b) the total length of interfaces between the processes is minimal. By this, each process has the same amount of work and the communication of data between the processes is minimal. The MPP package supports three options for basic partitioning:
1 Basic partitioning algorithm, uniform in column and row direction;
2 Basic partitioning algorithm correcting for active cells in column and row direction; 3 User defined partitioning.
These options are explained in more detail below. First, a non-overlapping partition is determined.
Basic partitioning algorithm: uniform
For example, consider a MT3DMS computational grid with
N
c8
columns,N
r6
rowsand
N
l1
layer2, as shown in Figure 2.2. Given four processes (
M
4
), this shows the resulting partitions for the caseM
cM
r2
. The partitioning algorithm used is very straightforward to determineM
candM
r. First, all the candidate options (dividers) forM
cand
M
rare determined such thatM
M
cM
r for a givenM
. For this example, there arethree dividers
M
cM
r: 1) 1 × 4, 2) 2 × 2, and 3) 4 × 1. Second, the optimal block sizeb
iscomputed by
b
int(
N M
/
)
, whereN N
cN
r. For this exampleb
3
. The smallestdimension,
N
r in this case, is divided byb
, resulting in the optimal number of blocks in row directionM
r2
). Then, the most optimal divider is selected, henceM
cM
r2
. In this method, each blockp
has (assuming even dimensions for simplicity) c c/
p c
N
N
M
columns and r r/
p r
N
N
M
rows.Basic partitioning algorithm, correction for active cells
Given the determined
M
c andM
r from the above basic partitioning algorithm, the user can automatically optimize the block dimensionsN
pc andN
pr , and hence the load balance, by specifying an integer grid with weights for each cell (> 0 or 0). For example, consider Figure 2.2. for the case a weight grid is used having all ones, except the last two columns. This example corresponds toN
act36
active cells, where the last two columns contain inactive cells only. Then, the cumulative row sums are 6, 12, 18, 24, 30, 36, 36, 36 and cumulative column sums are 6, 12, 18, 24, 30, 36. For determiningN
pc, a (active cell) block size is introduced byb
actint(
N
act/
M
c)
, which equals 18 for this example. For each column direction,j
1,...,
M
c, the index in the row sum array is searched that is closest toj b
actand this index we take as the end column. Hence, for this example this is the third column, shifting the vertical interface with one column to the left. The row direction is treated in a similar way.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
5
User defined partitioning
Instead of automatic partitioning, the user can specify the non-overlapping partitions entirely by input. This is done by specifying
M
c andM
r, and the starting columns (M
c times) andstarting rows (
M
r times) for the partitions. Amount of overlapTo couple the resulting blocks with each other, an amount of overlap has to be introduced. Because of the third-order stencil that is used for the advection TVD scheme, an overlap of two so-called ghost cells is added, see shaded cells in Figure 2.2 . Each non-overlapping partition p has
N
pN
pcN
prN
l grid cells, including ghost-cells. Introducing the overlapof two results in overlapping partitions phaving
(
c2
E2
W) (
r2
N2
S)
lp p
N
N
N
, where E1
if the block has a eastneighbour, if not E
0
, etc. The overlapping dimensions for blockp
are denoted by2
2
c c E W
p p
N
N
for the number of columns, and r r2
N2
Sp p
N
N
for thenumber of rows. The total number of nodes are given by
N
pN
pcN
prN
l. Each process is responsible for the computation of the cells in p. To do this in parallel, each process needs to synchronize the concentrations in the ghost cells. The following section describes in more detail how this is accomplished.2.3 Communication of data
Communication of data can be local or global. Local communication (LC) only involves the exchange of data between neighbouring processes. For MT3DMS, these data are concentrations for the TVD scheme and search directions for the implicit solver. Global communication (GC) involves the exchange of data for all processes (to-all, one-to-all, all-to-one). This is necessary for synchronizing transport time step, mass budgets, scaling factors and Euclidean norms in the implicit solver. In general, local communication is less expensive than global communication, especially for a large number of processors.
Local communication
Local communication is illustrated in Figure 2.2. Before computing the new concentration subject to advection, all processes first have to update their concentrations in the “receive” ghost cells (dark gray shaded). This is done by sending the concentrations for the “send” ghost cells that lie in the non-overlapping partition (light gray shaded). For example, Master process P0 sends to:
• P1 (east): – Overlap 1: cells 4, 12, 20 – Overlap 2: cells 3, 11, 19 • P2 (south): – Overlap 1: cells 17, 18, 19, 20 – Overlap 2: cells 9, 10, 11, 12 • P3 (South-East): – Overlap 1: 20 – Overlap 2: 11, 12, 19
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
6
At the same time process P1 sends concentrations to P0 (West), P2 (South-East), P3 (South), process P2 sends to P0 (North), P1 (North-West), P3 (East), and process P3 sends to P0 (North-West), P1 (North) and P2 (West). After sending concentrations, P0 receives concentrations from three neighbours:
• P1 (east): – Overlap 1: cells 5, 13, 21 – Overlap 2: cells 6, 14, 22 • P2 (south): – Overlap 1: cells 25, 26, 27, 28 – Overlap 2: cells 33, 34, 35, 36 • P3 (South-East): – Overlap 1: 29 – Overlap 2: 30, 37, 38
At the same time process P1 receives concentration from P0 (West), P2 (South-East), P3 (South), process P2 receives from P0 (North), P1 (North-West), P3 (East), and process P3 receives from P0 (North-West), P1 (North) and P2 (West).
Local communication is done with MPI using non-blocking send and non-blocking receive.
Global communication
Global communication is necessary to synchronize parameters that need to be evaluated over all processes, such e.g. the minimal transport time step. For this example, each process computes the transport time step satisfying the Courant stability constraint for its partition, sends this time step to all other processes (all-to-all communication), and each process computes the minimum and uses this value as the current transport time step
(
min( ),
x
pp
0,...,
M
1
). The all-to-all summation (1 0
M p
p
x
) is used for the total massand the inner products for vectors necessary for the implicit solver. The all-to-all maximum communication (
max( ),
x
pp
0,...,
M
1
) is used in the implicit solver for the scaling andclosure parameters. Furthermore, observation point data is gathered and written by the master process P0 as an user option (all-to-one communication).
2.4 GCG implicit solver
The Generalized Conjugate Gradient (GCG) implicit solver is by default used in MT3DMS for solving the linear algebraic system
Au b
, whereA
is a general symmetric/non-symmetric coefficient matrix that is real and non-singular, and has a sparse structure containing 7- or 19 diagonals (full dispersion tensor). The vectoru
is the concentration solution vector to be solved, andb
the right-hand side vector. The GCG solver is parallelized by a so-called Schwarz domain decomposition approach, see e.g. Saad (2000). Partitioning results in a reordering of coefficients, and for the example ofM
6
processes whereM
c2
and3
r
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
7 E S SE 0 0 0,0 0,1 0,2 0,3 W SW S 1 1,0 1,1 1,2 1,3 N NE E S SE 2 2,0 2,1 2,2 2,3 2,4 2,5 NW N W SW S 3 3,0 3,1 3,2 3,3 3,4 3,5 N NE E 4 4,2 4,3 4,4 4,5 NW N W 5 5,2 5,3 5,4 5,5
u
b
A
A
A
A
u
A
A
A
A
u
A
A
A
A
A
A
Au
u
A
A
A
A
A
A
u
A
A
A
A
u
A
A
A
A
1 2 3 4 5b
b
b
b
b
b
, (1)where
A
i i, is the interior coefficient matrix for blocki
, *,
,
i j
i
j
A
are the connection matrices, where the superscript * denotes the interface. For example, E0,1
A
means that block 0 has an Eastern interface to block 1. For the case of a 7-point stencil (no full dispersion tensor), the connection matrices with superscript SE, SW, NE, NW are all equal zero. The vectorsu
i andb
i are the computed concentration vector and right-hand side vector on blocki
, respectively. In the additive Schwarz method, the (left) preconditioned system1 1
Q Au Q b
is solved by choosing the parallel preconditionerQ
the block-diagonal ofA
(block-Jacobi). This means that the subdomain solutions, hence solutions of the form
, ,
i i i i i i i
Q v
A v
w
can entirely be solved in parallel. The resulting GCG-Schwarz method with block-Jacobi preconditioning uses the GCG preconditioners (Jacobi, SOR, MC) to solve the subdomain problemA v
i i, iw
i with only one preconditioning step, and hence inaccurately (Brakkee, et al., 1998). A derivation of the GCG algorithm is given by Jea and Young (1983)3.Parallelization of the GCG solver only involves a few basic operations: Applying a mask array for computing inner (vector-vector) products
2 mask 1
1,
,
:
,
( )
0, else
p N p i ii
mx
m i
x x
,where
N
p is the number of nodes including the ghost cells, and p is the non-overlapping partition.Local communication (LC): exchange vectors for ghost cells.
Global communication (GC): global sum, maximum over all processes
The GCG algorithm (CG and Lanczos/ORTHOMIN) in pseudo-code is given for each process by highlighting the MPP modifications:
(0)
max |
i|
iu
; GC: maximum of ; Scale:b
:
b
1;u
(0):
u
(0) 1; 2 2,
maskb
b b
; (0) (0)r
b Au
; LC ofr
(0);3. Here, the original paper is referred to and not the MT3DMS manual since the manual has errors in the presented algorithms.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
8 2 (0) (0) (0) 2
,
maskr
r
r
; GC: sum ofb
22 and (0) 2 2r
; 2 2:
2b
b
; (0) (0) 2 2:
2r
r
; Stopping test: (0) 2 2/
min(1
E
6,CCLOSE)
r
b
; SolveQ
(0)r
(0); LC of (0); if (A
non-symmetric) then (0) (0)p
; (0) (0);q
(0) (0); (0) (0) 0 mask,
; GC: sum of 0;0
n
; whileCCLOSE
do ( ) ( ) mask,
n n n ; GC: sum of n if n > 0 then 1 n n n ; ( )n ( )n (n 1) np
p
; ( )n ( )n (n 1) nq
q
; end if SolveQ
Ap
( )n ; SolveQ
Tq
( )n ; LC of and ; ( ) mask,
q
n ; GC: sum of ; n n ; (n 1) ( )n ( )n nu
u
p
; (n 1) ( )n n ; (n 1) ( )n T nA
; ( 1) ( )max
n n i i iu
u
mask; GC: maximum of ;1
n n
; end doelse if (
A
is symmetric positive definite)(0) (0)
p
;0
n
; whileCCLOSE
if n > 0 then1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
9 ( ) mask
,
n ; GC: sum of ; 1 n n ; ( )n ( )n (n 1) np
p
; end if ( ) ( ) mask,
np
n ; GC: sum of ; SolveQ
Ap
( )n ; LC of ; ( ) mask,
n n ; GC: sum of n; n n ; (n 1) ( )n ( )n nu
u
p
; (n 1) ( )n n ; ( 1) ( ) maskmax
n n i i iu
u
; GC: maximum of ;1
n n
; end do end if Unscale:b
:
b
;u
(n 1):
u
(n 1) ;The current implementation of the GCG solver is inefficient for the case
A
is a diagonal matrix and the flow is inherently explicit. In this case,A
can be easily inverted and new concentration entries ofu
(1)can be directly computed by simplyu
i(1)b a i
i/
i i,,
1,...
N
p forall non-zero diagonal entries
a
i i, . This can significantly save computing time since this avoids unnecessary initialization overhead for the preconditioning.2.5 PETSc implicit solver
Instead of the parallel GCG solver, a parallel PETSc solver can be selected if this is desired (Balay et al., 2008). PETSc stands for Portable, Extensible Toolkit for Scientific Computation, and this toolkit contains a wide range of parallel solver packages. PETSc has been widely used for more than two decades now for a large number of application fields.
For MT3DMS, an interface to PETSc 3.1 is constructed, using the KSP (Krylov Subspace Methods) package. For the case
A
is symmetric positive definite, the CG method is used, otherwise, the commonly used GMRES (Generalized Minimal Residual method, Saad, 2000). In PETSc, these methods are set by KSPCG and KSPGMRES, respectively. By default, the GMRES method is restarted after 100 inner iterations. The parallel preconditionerQ
is chosen block-Jacobi, which is similar to the parallel GCG solver. The subdomain solve,
i i i i
A v
w
is obtained by an incomplete LU factorization with zero fill-on (ILU(0)). Similar to the parallel GCG solver, the subdomain solution is obtained by preconditioning only.1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
10
The general set-up for PETSc is as follows: 1. Initialize PETSc variables;
2. Build linear system; 3. Set solver settings; 4. Solve linear system. Special attention was needed for:
Stopping test. The GCG stop criterion is based on the 1-norm of the difference in solution, PETSc is by default based on the 2-norm of the residual. Although the PETSc manual states that the stop criterion can be customized by the KSPSetConvergenceTest object, it did not seem to be possible to reproduce the GCG stopping test. Hence, PETSc converges if ( ) 2 50
2
max(CCLOSE
,10 )
kr
b
, and diverges if ( ) 5 2 210
kr
b
.Node numbering. PETSc requires that the linear system is filled according to Eq. (1), meaning that besides the MT3DMS node numbering
n
1 (referred to as global node number in Figure 2.2), and the local partioning node numberingn
2(referred to as local node number in Figure 2.2), a third node numberingn
3 is introduced by1 3 2 0 p i i
n
n
N
for partitionp
andN
i is the number of non-overlapping nodes.Dry or inactive cells. Cells that are inactive are treated such that the corresponding matrix row
i
ofA
only has a diagonal entry set to -1, andu
ib
iu
i(0) for the inactive celli
. For each flow time step, cells can become dry or inactive as determined by Flow Model Interface package. If this happens, the matrix connectivity’s change and this must be re-evaluated.2.6 Supported features
Table 2.1 shows for each MT3DMS package if it is supported or not in combination with the MPP package.
Table 2.1 : Supported and not supported packages for MPP.
Package Supported Restrictions
Basic Transport (BTN) package Y
Advection (ADV) package Y particle tracking MOC, MMOC or HMOC not supported
Dispersion (DSP) package Y
Sink and Source Mixing (SSM) package Y recirculation well option not supported
Chemical Reaction (RCT) Package Y
Generalized Conjugate Gradient (GCG) package Y
Flow Model Interface Package (LKMT3) Y Multi-node Well (MNW) package not supported
Transport Observation (TOB) package N
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
11
3 Program design
The MT3DMS v5.30 code is modified to support distributed memory parallelization (MPP package) and parallel PETSc solvers (PET package). Code adjustments are kept as minimal as possible, to remain the modularity and such that the code can be merged with the newer MT3DMS versions. Table 3.1 summarizes the code changes, what parts of the code these involve, and the specific reasons for changing. In Table 3.2 – Table 3.7 the new subroutines and functions are described. Figure 3.1 – Figure 3.4 show flowcharts of the main program and the implicit solvers. Some code adjustments (CA) require further explanation, see Table 3.1 for the numbering.
CA 1-6 (MPP initialization)
These code adjustments are made for the initialization of the MPP package. The basic idea behind these code adjustments is that all MPI processes execute until the point where the grid dimensions are read (BTN5DF), “rewind” the program and then the partitioning takes place. Figure 3.1 shows how this is accomplished. First, all MPI processes do exactly the same: returning from MPPINI1 they have the flag MPPTYP set to MPPINI1, meaning the beginning MPP phase 1. In this phase, the program files are first opened in BTN5OPEN and unit numbers are stored in a temporary array (LIST output file is labelled negative since this file need to be removed later on). Then, the partitioning options are read from INMPP in MPP1INI2. After reading the grid dimensions NCOL and NROW in BTN5DF, all processes call MPP1INI3, where they rewind INMPP and close all other files. Besides closing, the Master process deletes the LIST file. After all processes return with logical MPPREST being true, this is the end of MPP initialization phase 1, and the start of MPP phase 2 (MPPTYP = MPPTYP2). All processes return in the main program to continue label 100, and enter the partitioning subroutine MPP1PART, where they determine their partition and their own NCOL and NROW. In BTN5DF these are set by calling the subroutine MPP1SETRCL, right after the (global) NCOL and NROW have been read.
Although this way of initialization seems a bit cumbersome, it is flexible when the code is extended with a more complex partitioning algorithm, like e.g. MeTiS, see Karypis and Kumar, 1998. In this case, a possibility is that MPP phase 1 is only executed by the Master process and the Master process does the partitioning using calls to the MeTiS library, and partition data files are written for each process. After this, all processes continue with MPP phase 2, and each process reads its MeTiS partition data file in MPP1PART.
CA 24, 38-40 (file readers)
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
12 Table 3.1 : Code adjustments of the MT3DMS code v5.30.
Nr. Code adjustment Subroutine(s) / Function(s) Reason
1 MPPREST logical main program flag introduced for MPP inititalization
2 subroutine call added to MPP1PART main program partitioning of the grid in row and column direction 3 subroutine call added to MPP1INI1 main program MPI initialize, set flags, etc.
4 subroutine call added to MPP1LXCHINI main program initialize local communication data structures 5 subroutine call added to MPP1INI2 main program read data from input files
6 subroutine call added to MPP1INI3 main program close and rewind input files 7 INMPP and INPET added main program packages MPP and PET added
8 subroutine call added to PET1AL main program allocate space or data arrays needed by the PET package 9 check for GCG solver main program changed in general check for implicit solver(PETSc solver can also be selected)
10 subroutine call added MPP1RP main program MPP package read and prepare: only write to IOUT 11 subroutine call added PET1RP main program PETSc package read and prepare
12 subroutine call added MPP1LXCH main program local exchange of concentrations CNEW 13 subroutine call added PET1AP main program solve PETSc linear system for obtaining newconcentrations CNEW 14 subroutine call added MT_MPIFINALIZE main program finalize MPI processes
15 WRITESTO logical added ADV5AL, BTN5RP, BTN5AD, BTN5OT, main program
to assure that only Master process P0 writes to standard output
16 MPP check added (MPP1CHECK) ADV5AL, BTN5AL, FMI5AL, HSS5AL,RCT5AL, SSM5AL, TOB5AL, GCG5AP check for unsupported MPP f eatures
17 nodal mask aplied to RMASIO by call MPP1MASK ADV5SV, SADV5F, ADV5BD, SADV5U,BTN5BD, DSP5BD, RCT5BD, SSM5BD only account for cells for which the process is responsible 18 nodal mask aplied to TMASS by call MPP1MASK BTN5AD only account for cells for which the process is responsible 19 IF-statement splitting SADV5M, SADV5F, SADV5U,
SADV5Q, SAVD5F, SSM5FM, SSM5BD changed to solve boundary check error for Intel compiler 20 MPP and PET packages added BTN5OPEN add MPP and PET packages to the name file 21 subroutine call added to MPP1STORELUN BTN5OPEN MPP phase 1: store unit numbers
22 subroutine call added to MPP1FNAME BTN5OPEN, BTN5RP MPP phase 2: append file name with process ID (LIST, UCN, CNF, OBS optional)
23 subroutine call added to MPP1SETRCL BTN5DF MPP phase 2: set local number of columns and rows 24 point read changed by call to new subroutineCHKLOC BTN5RP, READPS, READGS, SSM5RP,TOB5RP read clipped point data: check if used and renumbercoordinates 25 subroutine call added to MPP1READOBS BTN5RP MPP phase 2: store observation global node numbers
(optional merge option)
26 subroutine call added to MPP1WRTOBSHDR BTN5RP MPP phase 2: write global node numbers of observation nodes (optional merge option)
27 subroutine call added to MPP1GXCHTIME BTN5AD global exchange to compute minimum of transport timestep DTRANS 28 subroutine call added to MPP1GXCHMASS1 BTN5AD global exchange to compute sum of mass TMASS 29 subroutine call added to MPP1GXCHMASS2;changed code order BTN5BD combined global exchange to compute sum of massesTMASS en RMASIO 30 subroutine call added to MPP1GXCHOBS BTN5OT Master process gathers observation point data from Slaves
and writes to output file (optional merge option) 31 subroutine call added to PET1ICBUNDCHG FMI5RP store flag indicating that ICBUND had changed
(PET package only) 32 diagonal check added: convergence check
and call to DIAGMATCHK GCG5AP
check for diagonal matrix and if true, directly compute CNEW and return
33 subroutine calls added to MPP1CHMAX GCG5AP global maximum of SCALE or CHANGE 34 subroutine calls added to MPP1MASKN GCG5AP skip summation for cells in overlap 35 subroutine calls added to MPP1LXCH GCG5AP local exchange of vectors (search direction) 36 subroutine calls added to MPP1GXCHIP GCG5AP global sum of iterior products (vector-vector) 37 dimension checka skipped for FMI package READHQ, READDS, READPS check for grid dimensions skipped in case of MPP 38 real array read changed by call tonew subroutine RDARRAYR READHQ, READDS, RARRAY clipped read of real arrays
39 integer array read changed by call tonew subroutine RDARRAYI READDS, IARRAY clipped read of integer arrays 40 IOUT > 0 added as a condition to write
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
13 Table 3.2 : New subroutines and functions created for the MPP package (file mt_mpp1.for).
Name Description
MPP1CHECK subroutine: check for features that are not supported with MPP MPP1FNAME subroutine: append output file name with process rank ID
MPP1GNODE2RANK function: get process rank ID for the process that is reponsible for a MT3DMS global node number MPP1GXCHDIAG subroutine: global sum of integer flag indicating if a matrix is diagonal; used for GCG and PETsc solver MPP1GXCHIP subroutine: global sum over all processes used for computing iterior products in GCG solver
MPP1GXCHMASS1 subroutine: global sum of mass over all processes used for TMASS MPP1GXCHMASS2 subroutine: global sum of mass TMASS and RMASIO and all processes
MPP1GXCHMAX subroutine: global maximum over all processes used for SCALE and CHANGE in GCG solver MPP1GXCHOBS subroutine: Master process gathers observation nodal data from Slave processes and writes to IOBS MPP1GXCHTIME subroutine: global minimum over all processes used for DTRANS
MPP1INI1 subroutine: initialize MPI, set MPPTYP integer flag, set WRITESTO logical flag,initialize arrays to store unit numbers
MPP1INI2 subroutine: read data from INMPP, INADV for MIXELM, INGCG for NCRS and INPET for NCRS MPP1INI3 subroutine: first slave processes rewind INMPP, INADV, INGCG, INPET, and closes other filessecond Master processes does the same but removes LIST file MPP1LOADEST subroutine: estimate work load based on simple counting of active cells
MPP1LXCH subroutine: local communication using non-blocking send and receive
MPP1LXCHINI subroutine: initialize local communication data structures: exchange partners, ranks, interface types, nodes MPP1MASK subroutine: determine from its local coordinates if process is responsible for this node
MPP1MASKN subroutine: determine from its local node number if process is responsible for this node MPP1PART subroutine: main partitioning subroutine to partitionize the grid in row and column direction MPP1PARTBAL subroutine: partioning algorithm by load balancing with pointer grid defined by user
and using row and column sums
MPP1PARTDEF subroutine: default partioning algorithm uniform in row and column direction MPP1PARTUSER subroutine: partitioning where the user has specified the blocks
MPP1RP subroutine: estimate work load; write partition information to output file IOUT
MPP1SETRCL subroutine: set new NLAY, NROW and NCOL; set clipping area parameters for clipped data read MPP1STORELUN subroutine: store unit numbers that are to be closed or rewinded later on
MPP1STOREOBS subroutine: store observation nodes global MT3DMS node numbers and responsible process rank MPP1WRTOBSHDR subroutine: write file header for observation points in IOBS
MPP1XPIDX subroutine: determine local communication index arrays
Table 3.3 : New FORTRAN subroutines and functions created for the PET package (file mt_pet1.for).
Name Description
PET1AL subroutine: allocate space for data arrays
PET1AP subroutine: solve linear system using PETSc library to obtain new concentrations
PET1G2LNODE function: determine global MT3DMS node number from local node number
PET1ICBUNDCHG subroutine: check if icbund has changes and hence the grid cell connectivities
PET1L2GNODE function: determine local node number from global MT3DMS node number
PET1L2PETGNODE function: determine global PETSC node number from local node number
PET1PETIDX subroutine: determine connectity indices for the linear system
PET1RP subroutine: read and prepare data
Table 3.4 : New C routines created for the PET package (file mt_pet1.c).
Name Description
PET1PETCOEFINI PETSc initialize, allocate PETSc matrix, solution vector and right-hand side vector
PET1PETCOEFINI_ fortran interface wrapper for PET1PETCOEFINI
PET1PETCOEFINI__ fortran interface wrapper for PET!PETCOEFINI
PET1PETSETCOEF fill PETSc matrix, initial solution vector and right-hand side vector; assemble system
PET1PETSETCOEF_ fortran interface wrapper for PET1PETSETCOEF
PET1PETSETCOEF__ fortran interface wrapper for PET1PETSETCOEF
PET1PETSOLSET set PETSc solver settings (CG or GMRES, block-Jacobi, ILU(0), stop criterion, output)
PET1PETSOLSET_ fortran interface wrapper for PET1PETSOLSET
PET1PETSOLSET__ fortran interface wrapper for PET1PETSOLSET
PET1PETSOLVE PETSc solve linear system; error handling
PET1PETSOLVE_ fortran interface wrapper for PET1PETSOLVE
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
14 Table 3.5 : Subroutines and functions added to the MT3DMS utilities (mt_utl5.for).
Name Description
CFN_FINDWORD subroutine: find word in string array
CFN_LENGTH function: wrapper for CFN_LENGTH2
CFN_LENGTH2 function: determine string length
CFN_S_TRIM subroutine: wrapper for CFN_TRIM2
CFN_S_TRIM2 subroutine: trim a string
CFN_UPCASE subroutine: wrapper for CFN_UPCASE2
CFN_UPCASE2 subroutine: make a string uppercase
CHKLOC subroutine: read clipped point data and check if used and renumber coordinates
CHKNOTSUPPORTED subroutine: check for supported packages for clipping read
DEFRDSZ subroutine: define clipping area parameters
GETIRCL subroutine: get coordinates from MT3DMS global node number
RDARRAYI subroutine: integer array reader suitable for clipping data
RDARRAYR subroutine: real array reader suitable for clipping read .
Table 3.6 : Subroutine added to the GCG5 package (mt_gcg5.for).
Name description
DIAGMATCHK subroutine: check if coefficient matrix only has diagonal entries if so sompute cnew(n) = rhs(n) / a(n,1)
Table 3.7 : MPI wrapper routines needed by the MPP and PETSc package (mt_mpp1_mpi.for).
Name Description
MTMPI_ALLMAXI subroutine: wrapper for MPI_ALLREDUCE to maximize integer values over all processes
MTMPI_ALLMAXR subroutine: wrapper for MPI_ALLREDUCE to maximize real values over all processes
MTMPI_ALLMINR subroutine: wrapper for MPI_ALLREDUCE to minimize real values over all processes
MTMPI_ALLSUMR subroutine: wrapper for MPI_ALLREDUCE to sum real values over all processes
MTMPI_BARRIER subroutine: wrapper for MPI_BARRIER to block until all processes reach this subroutine
MTMPI_COMM_RANK function: wrapper for MPI_COMM_RANK to determine own process rank ID
MTMPI_COMM_SIZE function: wrapper for MPI_COMM_SIZE te determine number of processes
MTMPI_ERROR subroutine: MPI error message handling
MTMPI_FINALIZE subroutine: wrapper for MPI_FINALIZE to finalize MPI
MTMPI_GATHERVI subroutine: wrapper for MPI_GATHERV to gather integer arrays from all processes to one process
MTMPI_GATHERVR subroutine: wrapper for MPI_GATHERV to gather real arrays from all processes to one process
MTMPI_INIT subroutine: call MPI_INIT; zero communication buffers; get NRPROC and MYRANK
MTMPI_IRECVI subroutine: wrapper for MPI_IRECV for non-blocking receive of an integer array
MTMPI_IRECVR subroutine: wrapper for MPI_IRECV for non-blocking receive of an real array
MTMPI_ISENDI subroutine: wrapper for MPI_ISEND for non-blocking send of an integer array
MTMPI_ISENDR subroutine: wrapper for MPI_ISEND for non-blocking send of a real array
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
15 IF ADVECTION START CALL BTN5AL CALL
FMI5AL ADV5ALCALL
IF
DISPERSION OR SOURCEIF SINK REACTIONIF SOLVERIF GCG
CALL
DSP5AL SSM5ALCALL RCT5ALCALL GCG5ALCALL
Y Y Y Y N N N N CALL BTN5RP CALL BTN5ST IF SINK OR SOURCE SSM5RPCALL CALL FMI5RP IF DISPERSION CALL DSP5CF MORE TRANSPORT STEP? MORE TIME STEP? MORE STRESS PERIOD? END N Y Y N TR A N S PO R T T IM E S TE P L O O P F LO W T IM E ST EP L O O P ST R E SS P E R IO D L O O P IF TIME OBSERVATION CALL TOB5AL CALL MPP1INI1
MPPTYP = MPPINI1 (parallel) MPPTYP = MPPSER (serial)
MPPTYP CALL MPP1PART CALL MPP1LXCHINI CALL BTN5OPEN MPPTYP CALL MPP1INI2 CALL BTN5DF MPPINI1 MPPINI2 MPPSER CALL MPP1INI3 MPPTYP MPPINI2 MPPSER MPPINI1 MPPTYP = MPPINI2 MPPINI2 MPPINI1 MPPSER IF PETSC SOLVER CALL PET1AL Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5AL Y N IF ADVECTION CALL ADV5RP IF
DISPERSION REACTIONIF SOLVERIF GCG PARALLELIF MPP
CALL
DSP5RP RCT5RPCALL GCG5RPCALL MPP1RPCALL
Y Y Y Y N N N N IF TIME OBSERVATION CALL TOB5RP IF PETSC SOLVER CALL PET1RP Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5RP Y B A IF ADVECTION START CALL BTN5AL CALL
FMI5AL ADV5ALCALL
IF
DISPERSION OR SOURCEIF SINK REACTIONIF SOLVERIF GCG
CALL
DSP5AL SSM5ALCALL RCT5ALCALL GCG5ALCALL
Y Y Y Y N N N N CALL BTN5RP CALL BTN5ST IF SINK OR SOURCE SSM5RPCALL CALL FMI5RP IF DISPERSION CALL DSP5CF MORE TRANSPORT STEP? MORE TIME STEP? MORE STRESS PERIOD? END N Y Y N TR A N S PO R T T IM E S TE P L O O P F LO W T IM E ST EP L O O P ST R E SS P E R IO D L O O P IF TIME OBSERVATION CALL TOB5AL CALL MPP1INI1
MPPTYP = MPPINI1 (parallel) MPPTYP = MPPSER (serial)
MPPTYP CALL MPP1PART CALL MPP1LXCHINI CALL BTN5OPEN MPPTYP CALL MPP1INI2 CALL BTN5DF MPPINI1 MPPINI2 MPPSER CALL MPP1INI3 MPPTYP MPPINI2 MPPSER MPPINI1 MPPTYP = MPPINI2 MPPINI2 MPPINI1 MPPSER IF PETSC SOLVER CALL PET1AL Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5AL Y N IF ADVECTION CALL ADV5RP IF
DISPERSION REACTIONIF SOLVERIF GCG PARALLELIF MPP
CALL
DSP5RP RCT5RPCALL GCG5RPCALL MPP1RPCALL
Y Y Y Y N N N N IF TIME OBSERVATION CALL TOB5RP IF PETSC SOLVER CALL PET1RP Y N Y N Y N IF HYDROCARBON SPILL SRC CALL HSS5RP Y B A
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
16 CALL BTN5BD CALL BTN5OT ADVECTION EXPLICIT BTN5SVCALL Y CALL ADV5SV N CALL BTN5FM ADVECTION
IMPLICIT DISPERSIONIF IF SOURCEOR SINK REACTIONIF
CALL
DSP5FM SSM5FMCALL RCT5FMCALL
MORE OUTER ITERATIONS?
IF
ADVECTION DISPERSIONIF IF SOURCEOR SINK REACTIONIF
CALL DSP5BD SSM5BDCALL RCT5BDCALL Y N CALL DSP5BD CALL ADV5FM Y Y Y Y N N N N N N N N Y Y Y Y IF
OBSERVATION ConcObsCALL
CALL MassFlux Obs Y N IF REACTION CALL RCT5CF IF MPP PARALLEL N CALL MPP1LXCH Y IF GCG SOLVER CALL GCG5AP C D IF PETSC SOLVER E F CALL PET1AP IF HYDROCARBON SPILL SRC CALL HSS5FM N IF HYDROCARBON SPILL SRC CALL HSS5BD N A B O U TE R IT E R A TI O N L O O P CALL BTN5BD CALL BTN5OT ADVECTION EXPLICIT BTN5SVCALL Y CALL ADV5SV N CALL BTN5FM ADVECTION
IMPLICIT DISPERSIONIF IF SOURCEOR SINK REACTIONIF
CALL
DSP5FM SSM5FMCALL RCT5FMCALL
MORE OUTER ITERATIONS?
IF
ADVECTION DISPERSIONIF IF SOURCEOR SINK REACTIONIF
CALL DSP5BD SSM5BDCALL RCT5BDCALL Y N CALL DSP5BD CALL ADV5FM Y Y Y Y N N N N N N N N Y Y Y Y IF
OBSERVATION ConcObsCALL
CALL MassFlux Obs Y N IF REACTION CALL RCT5CF IF MPP PARALLEL N CALL MPP1LXCH Y IF GCG SOLVER CALL GCG5AP C D IF PETSC SOLVER E F CALL PET1AP IF HYDROCARBON SPILL SRC CALL HSS5FM N IF HYDROCARBON SPILL SRC CALL HSS5BD N A B O U TE R IT E R A TI O N L O O P
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
17 (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n Select preconditioner Q Compute scaling factor max |(0)| i i u MPP? Y Scale: 1 : b b (0): ( 0) 1 u u N Compute residual: ( 0) (0 ) r b Au
First stop criterion init:
2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r MPP? Y N D Y N Solve for initial vector:
( 0) r(0) MPP?
Y N
Local exchange of(0)
Global sum of and 2
2 b (0 )2 2 r A is SPD matrix? stopping test (0 ) 2 2/ C CLOSE r b Begin Lanczos/ ORTHOMIN Begin CG N A is diagonal matrix? N Directly compute (1): / u b A Y D Y Initialization: ( 0) (0) p n 0 n > 0? n > 0? MPP? Y Y
Compute direction vector:
( )n ( )n ( 1)n n p p N Solve ( )n Q Ap MPP? Y N Local exchange of N Compute maximum concentration change: (1) ( ) max n n i i i u u Global max of
Store and location
Y Stopping test: CCLOSE D Y A is SPD matrix? N N n = n + 1 Unscale: :b b (n1): (n1) u u Compute ( ) , n n p MPP? Global sum of n Y N MPP? Local exchange of r(0 ) Y N
Compute direction vector parameter:
( )n,
Global sum of
Compute direction vector parameter: 1 n n Compute ( )n,p( )n MPP? Global sum of Y N Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n n n Y Compute ( )n,( )n n MPP? Global sum of n Y N
Compute direction vector parameter:
1 n n
n
Compute direction vectors:
( )n ( )n (n1) n p p ( )n ( )n (n1) n q q Solve ( )n Q Ap ( ) T n Q q MPP? N Y
Local exchange of and Y Compute ( ) ,qn MPP?Y Global sum of N N Compute iterates: n n (n1) ( )n ( )n n u u p ( 1 )n ( )n n (n1) ( )n T nA Global max of C MPP? N (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) (0 ) (0 ) q Initialization: (0 ) (0 ) p (0) (0) 0 n Select preconditioner Q Compute scaling factor max |(0)| i i u Compute scaling factor max |(0)| i i u MPP? Y Scale: 1 : b b (0): ( 0) 1 u u Scale: 1 : b b (0): ( 0) 1 u u N Compute residual: ( 0) (0 ) r b Au Compute residual: ( 0) (0 ) r b Au
First stop criterion init:
2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r
First stop criterion init:
2 2 , b b b 2 (0 ) (0 ) ( 0) 2 , r r r MPP? Y N D Y N Solve for initial vector:
( 0) r(0)
Solve for initial vector:
( 0) r(0) MPP?
Y N
Local exchange of(0)
Local exchange of(0)
Global sum of and 2
2 b (0 )2
2 r
Global sum of and 2
2 b (0 )2 2 r A is SPD matrix? stopping test (0 ) 2 2/ C CLOSE r b stopping test (0 ) 2 2/ C CLOSE r b Begin Lanczos/ ORTHOMIN Begin CG N A is diagonal matrix? N Directly compute (1): / u b A Directly compute (1): / u b A Y D Y Initialization: ( 0) (0) p n 0 Initialization: ( 0) (0) p n 0 n > 0? n > 0? MPP? Y Y
Compute direction vector:
( )n ( )n ( 1)n
n
p p
Compute direction vector:
( )n ( )n ( 1)n n p p N Solve ( )n Q Ap Solve ( )n Q Ap MPP? Y N Local exchange of Local exchange of N Compute maximum concentration change: (1) ( ) max n n i i i u u Compute maximum concentration change: (1) ( ) max n n i i i u u Global max of Global max of
Store and location Store and location
Y Stopping test: CCLOSE Stopping test: CCLOSE D Y A is SPD matrix? N N n = n + 1 Unscale: :b b (n1): (n1) u u Unscale: :b b (n1): (n1) u u :b b (n1): (n1) u u Compute ( ) , n n p Compute ( ) , n n p MPP? Global sum of n Global sum of n Y N MPP? Local exchange of r(0 ) Local exchange of r(0 ) Y N
Compute direction vector parameter:
( )n,
Compute direction vector parameter:
( )n,
Global sum of Global sum of
Compute direction vector parameter:
1 n
n
Compute direction vector parameter: 1 n n Compute ( )n,p( )n Compute ( )n,p( )n MPP? Global sum of Global sum of Y N Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n n n Compute iterates: (n1) ( )n ( )n n u u p (n1) ( )n n (n1) ( )n ( )n n u u p (n1) ( )n n n n Y Compute ( )n,( )n n Compute ( )n,( )n
n MPP? Global sum of Global sum of nn
Y N
Compute direction vector parameter:
1 n n
n
Compute direction vector parameter:
1 n n
n
Compute direction vectors:
( )n ( )n (n1) n p p ( )n ( )n (n1) n q q
Compute direction vectors:
( )n ( )n (n1) n p p ( )n ( )n (n1) n q q Solve ( )n Q Ap ( ) T n Q q Solve ( )n Q Ap ( ) T n Q q MPP? N Y
Local exchange of and Local exchange of and Y Compute ( ) ,qn Compute ( )
,qn MPP?Y Global sum of Global sum of
N N Compute iterates: n n (n1) ( )n ( )n n u u p ( 1 )n ( )n n (n1) ( )n T nA Global max of Global max of C MPP? N
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
18 A is diagonal matrix? N Directly compute (1 ): / u b A Y F E first time in PET1AP or change in icbund? Determine connection indices (PET1PETIDX) Y first time in PET1AP? Allocate PETSc A, u and b (PET1PETCOEFINI) Y N Set coefficients in PETSc A, u and b (PET1PETSETCOEF) Set PETSc solver settings (PET1PETSOLSET) PETSc solve A u = b (PET1PETSOLVE) Local exchange of u F A is diagonal matrix? N Directly compute (1 ): / u b A Directly compute (1 ): / u b A Y F E first time in PET1AP or change in icbund? Determine connection indices (PET1PETIDX) Y first time in PET1AP? Allocate PETSc A, u and b (PET1PETCOEFINI) Y N Set coefficients in PETSc A, u and b (PET1PETSETCOEF) Set PETSc solver settings (PET1PETSOLSET) PETSc solve A u = b (PET1PETSOLVE) Local exchange of u F Figure 3.4: Flow chart of the PETSc solver.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
19
4 Input instructions
Input for the Massively Parallel Processing (MPP) package is read on unit INMPP = 5, which is preset in the main program. The input file is required only if the MPP package is used in the simulation. For each simulation, are the input instructions are given by Table 4.1.
Table 4.1 : Input instructions for the MPP package
Massive Parallel Processing (MPP) package
ID Variable Name Format Explanation
G1 MERGEOBS free flag indicating that Master process should merge observation points MERGEOBS = 0: each process writes his own observation file MERGEOBS = 1: master process writes observation file G2 PARTOPT free flag indicating the partition method in column and row direction
PARTOPT = 0: default algorithm by assuming constant load PARTOPT = 1: user defined blocks
PARTOPT = 2: load balancing by user specified integer grid (Enter G2a if PARTOPT = 1 or 2)
G2a NRPROC_NCOL, NRPROC_NROW free number of partitions in column and row direction (Enter G2b and G2b if PARTOPT = 1)
G2b NRPROC_COL free array of length NRPROC_NCOL with the starting columns of the partitions G2c NRPROC_ROW free array of length NRPROC_NROW with the starting rows of the partitions (Enter G2d if PARTOPT = 2)
G2d LOADPTR iarray array used for load balancing in row and column direction only entries different then 0 are used
Input to the Portable, Extensible Toolkit for Scientific Computation (PET) package is read on unit INPET = 6, which is preset in the main program. The input file is needed only if the PET package is used in the simulation. If the PET package is used, the MPP package must be used too. For each simulation are the input instructions are given by Table 4.2.
Table 4.2: Input instructions for the PET package.
Portable, Extensible Toolkit for Scientific Computation (PET) Package
ID Variable Name Format Explanation
H1 MXITER, ITER1, NCRS free maximum number of outer iterations, maximum of inner iterations;
flag for treatment of dispersion cross terms
NCRS = 0: dispersion cross terms lumped to right-hand side NCRS = 1: full dispersion tensor
H2 CCLOSE free convergence criterion: 2-norm(residual)/2-norm(right-hand side)
Note that in order to use the MPP and PET packages, you will need to have MPI installed on your system. To use PET package you will also need PETSc to have installed too.
5 Test problems
5.1 Examples
The parallel code was tested for its correctness for the supported examples in the MT3DMS v5.30 distribution (21-02-2010). The examples were taken from directory “examples\mf96mt3d\”, where MODFLOW-96 was used to compute the groundwater flow. For this, MODFLOW-96 with the link-MT3D interface (25-05-2003) was compiled with the Intel compiler under Linux. The examples were tested for 1, 2, 3, 4 and 8 processes. As a reference, the examples were run with a compiled clean version of MT3DMS v5.30.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
20
Examples 4, 6, 8 were not tested because they all use MOC advection which is not supported. All tests gave similar results in the parallel case as for the serial case. For one specific example, we will go into more detail.
Example 1: One-dimensional transport in a uniform flow field
This test case models one-dimensional solute transport involving advection, dispersion and some simple chemical reactions in a steady-state uniform flow field. The full problem specification can be found in Sec. 7.1 of Zheng, 1999. Here, advection was solved by the parallelized TVD scheme.
Test case 1a (advection only) has no difference (exactly zero) between serial and parallel runs, both using the GCG and PETSc solver. This is easily explained by the fact that no iterative solvers are invoked when dispersion is absent. Also, the parallel advection solver (ULTIMATE) is explicit in time and, therefore, produces results across multiple partitions that are identical to the results from a serial run.
Test case 1b (advection and dispersion) has no significant difference (up to machine precision) between serial and parallel runs, both using the GCG and PETSc solver. This is achieved by setting the tolerance for the iterative solvers to machine precision (e.g., CCLOSE=1E-16). For less strict values, serial and parallel results are identical up to that tolerance, as is to be expected.
Test case 1c (advection, dispersion and sorption) shows equality between runs up to machine precision, as for test case 1b. Figure 5.1a shows the numerical solutions obtained with the PETSc solver at Tend=2000 s for serial and parallel runs. The results for any number of partitions coincide.
Test case 1d (advection, dispersion, sorption and decay) shows the same behavior. Figure 5.1b shows the numerical solutions with the parallel GCG solver at Tend=2000 s for serial and parallel runs. Again, all results coincide.
Figure 5.1: Comparison of the calculated concentrations with the analytical solutions for the one-dimensional test problem. The analytical solutions are shown in solid lines with the numerical solutions in different symbols.
a) case 1c, solved by PETSc on 1, 2, 4 and 8 processes. b) case 1d, solved by GCG on 1, 2, 4, and 8 processes.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
21 5.2 Large application model
57 km (567cols) 39 km (389 rows) : Zinc smeltery (20 layers) 57 km (567cols) 39 km (389 rows) : Zinc smeltery (20 layers)
Figure 5.2: Area of interest for the BeNeKempen model and model dimensions.
In cooperation with the Dutch knowledge centers Deltares and Alterra several large contaminant transport models have been developed. The largest regional transport model that uses the maximum amount of computer resources is the recently developed so-called BeNeKempen model for the “Aktief Bodembeheer de Kempen” environmental program. This 3D transboundary model consists of ~5 million grid cells with a horizontal resolution of 100 × 100 m, covering an area of ~2200 km2, which is about 1/20 part of The Netherlands (Figure 5.2). This MT3DMS model simulates groundwater transport of cadmium and zinc originating from historical emissions of four zinc smelters in the vicinity of the Dutch-Belgian border. The typical simulation time used is 90 years.
The performance of the parallel code is benchmarked on the Dutch national compute cluster SARA-LISA4, having specifications:
•
512 nodes each node 8 or 12 cores (total 4480 cores);•
Intel Xeon quad cores (416 nodes), Intel Xeon six cores (96 nodes);•
2 GB internal memory per core;•
Fast Infiniband connection, 1600 MB/sec, <6 s; and on the Deltares in-house H4 cluster:•
221 nodes, each node 2 or 4 cores (total 448 cores);•
AMD dual cores (220 x-nodes), AMD quad cores (1 y-nodes);•
4 GB internal memory, 250 GB local storage;•
Slow 1 Gigabit ethernet connection.
4. This work was sponsored by the Stichting Nationale Computerfaciliteiten (National Computing Facilities Foundation, NCF) for the use of supercomputer facilities, with financial support from the Nederlandse Organisatie voor
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
22
Wall-clock time BeNeKempen Model
0,00 2,00 4,00 6,00 8,00 10,00 12,00 14,00 16,00 18,00 1 2 4 8 16 32 number of processes w al l-cl o ck t im e [h o u rs ] Deltares H4 SARA LISA
Speedup BeNeKempen Model
0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 1 2 4 8 16 32 number of processes sp ee d u p Deltares H4 SARA LISA Ideal case
The BeNeKempen model was run up to 32 processes on both machines with the MPP package selected in combination with the parallel GCG solver. The parallel GCG-solver was selected, since the PETSc solver introduces a large amount of initialization overhead when the flow is advection dominated, which is the assumption for the BeNeKempen
model. True wall-clock time were measured, i.e. the time measured from start till end of
simulation that is directly experienced by the user. The results are shown in Figure 5.3. A speedup of more than a factor 20 was obtained with 32 processes on the Dutch national compute cluster SARA-LISA, reducing the computing time from ~10 hours to ~30 minutes. The speedup on the H4-cluster decreases for more than 16 processors, that is probably caused by the slow ethernet interconnect. Overall, the benchmarks show that parallel MT3DMS code can significantly reduce large computing times.
Exactly the same results were reproduced in the parallel case as in the serial case, see Figure 5.4 for a typical observation point. Figure 5.5 shows how the default partitioning is done for the case of 16 blocks (4 × 4). Clearly, the load balance is not optimal, since the 12 outer blocks do not have the maximum load since they have a large number of inactive cells.
BeNeKempen observation point (6, 241, 67)
0.00E+00 1.00E-03 2.00E-03 3.00E-03 4.00E-03 5.00E-03 6.00E-03 7.00E-03 8.00E-03 4. 47 36 57 73 10 10 96 2 14 61 4 18 26 7 21 91 9 25 57 2 29 22 4 time [days] C d c o n ce n tr at io n [ kg /m 3] GCG MPP-GCG np=2 MPP-GCG np=4 MPP-GCG np=8 MPP-GCG np=16 MPP-PETSC np=16 MPP-diag np=16
Figure 5.4: Typical observation point showing a cadmium concentration time series for 90 years. Results are shown for the GCG solver, the parallel GCG solver (2, 4, 8, 16 processors), the PETSc solver (16 processors) and the case of no implicit solver for 16 processors (diagonal check activated).
Figure 5.3: Timing results for the parallel MT3DMS code (parallel GCG) on the SARA LISA cluster and the Deltares H4 cluster. Left: measure wall-clock time; right: speed-ups.
1203355-003-BGS-0001, 5 September 2011, final
MT3DMS, A Modular Three-Dimensional Multispecies Transport Model - User Guide to the Massively Parallel Processing (MPP) Package and PETSc (PET) Package
23
Concentration Cd [kg/m3] Concentration
Cd [kg/m3]
Figure 5.5: Computed cadmium concentration for layer = 1 and T = 90 years, showing the partitioning for the case of 16 processes (upper left block: P0; lower right block: P15).