Design and evaluation of communication latency hiding/reduction techniques for message-passing environments

(1)

This manuscript has been reproduced from the microfilm master. UMI films

the text directly from the original or copy submitted. Thus, some thesis and

dissertation copies are in typewriter face, while others may be from any type of

computer printer.

The quality of this reproduction is d ep en d en t upon th e quality of the

copy subm itted. Broken or Indistinct print, colored or poor quality illustrations

and photographs, print bleedthrough, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by

sectioning the original, beginning at the upper left-hand comer and continuing

from left to right in equal sections with small overlaps.

Photographs included in the original manuscript have been reproduced

xerographicaily in this copy.

Higher quality 6” x 9” black and white

photographic prints are available for any photographs or illustrations appearing

in this copy for an additional charge. Contact UMI directly to order.

Bell & Howell Information and Learning

300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA

800-521-0000

(2)

(3)

Techniques for Message-Passing Environm ents

by Ahmad Atsahi

B.Sc.. Shiraz University. Iran. 1985 M.Sc.. S h arif University o f Technology. Iran. 1988 A Dissertation Submitted in Partial Fulfillment o f the

Requirements o f the Degree o f DOCTOR OF PHILOSOPHY

in the Departm ent o f Electrical and Com puter Engineering We accept this dissertation as conforming

to the required standard

II^rrî'îTj.

Dimopoulos. Supervisor (D epartm ent o f Electrical and C om puter Engineering)

Dr. K. F. Li. Departmcfital M ember (D epartm ent o f Electrical and Com puter Engineering)

_____________________________ Dr. V. K. Bhargava. Departmental M em bet(j^epartm ent o f Electrical and Computer

Engineering)

Dr. D. M. Miller. Outside M ember (D epartm ent o f Com puter Science)

Dr. J. Duato. External Exam iner (D epartm ent o f Information Systems and Com puter A rchitecture. Technical University o f Valencia)

c Ahmad A tsahi. 2000 University o f Victoria

(4)

Supervisor: Dr. Nikitas J. Dimopoulos

Abstract

With the availability o f fast m icroprocessors and sm all-scale m ultiprocessors, inter node com m unication has becom e an increasingly important factor that limits the perfor m ance o f parallel com puters. Essentially, m essage-passing parallel computers require extrem ely short com m unication latency such that message transm issions have m inim al im pact on the overall com putation time. This thesis concentrates on issues regarding hard w are com m unication latency in single-hop reconfigurable networks, and software com m u nication latency regardless o f the type o f network.

The first contribution o f this thesis is the design and evaluation o f two different catego ries o f prediction techniques for m essage-passing systems. This thesis utilizes the com m u nications locality property o f message-passing parallel applications to devise a num ber o f heuristics that can be used to predict the target o f subsequent communication requests, and to predict the ne.xt consum able message at the receiving ends o f com munications.

Specifically. 1 propose two sets o f predictors: Cycle-based predictors, which are purely dynam ic predictors, and Tag-based predictors, which are static/dynamic predictors. The perform ance o f the proposed predictors, specially B etter-cyclel and Tag-bettercycle2. are very well on the application benchmarks studied in this thesis. The proposed predictors could be easily im plem ented on the network interface due to their simple algorithm s and low m em ory requirem ents.

.As the second contribution o f this thesis. I show that m ajority o f reconfiguration delays in single-hop recontigurable networks can be hidden by using one o f the proposed high hit ratio predictors. The proposed predictors can be used in establishing a com m uni cation pathway between a source and a destination in such networks before this pathw ay is to be used.

This thesis' third contribution is the analysis o f a broadcasting algorithm that utilizes latency hiding and reconfiguration in the network to speed the broadcasting operation. The analysis brings up closed formulations that yields the term ination tim e o f the algorithm s.

(5)

figurable networks. 1 conjecture that this algorithm ensures a better term ination tim e than w hat can be achieved by either o f the direct, and standard exchange algorithms.

The fifth contribution o f this thesis is the use and evaluation o f the proposed predictors to predict the next consum able message at the receiving ends o f com munications. This thesis contributes by claim ing that these message predictors can be efficiently used to drain the network and cache the incoming messages even if the corresponding receive calls have not been posted yet. This way. there is no need to copy the early arriving m es sages into a tem porary buffer. The perform ance o f the proposed predictors. Single-cycle. Tag-cycle2 and Tag-bettercycle2. on the parallel applications are quite promising and sug gest that prediction has the potential to elim inate most o f the rem aining message copies.

Examiners:

(1, D r.’S!. J. D imopoulos. Supervisor (D epartm ent o f Electrical and C om puter Engineering)

Dr. K. P. Li. D epartm ental Member {Departm ent o f Electrical and C om puter Engineering)

) e f

Dr. V. K. Bhargava. Departmental M em ber ( Department o f Electrical and C om puter Engineering)

___________________ Dr. D. M. Miller. Outside M em ber (Departm ent o f C om puter Science)

Dr. J. Duato. External Exam iner (D epartm ent o f Information System s and C om puter Architecture. Technical University o f Valencia)

(6)

IV

Chapter I

Introduction... 1

1 Comm unications Locality and Prediction T echniques...5

2 Using the Proposed Predictors at the Send S ide...8

3 Redundant M essage Copying in Software M essaging L ay e rs... 9

4 Collective C o m m unications...1Ü 5 Thesis C ontributions... 11

Chapter 2 Application Benchmarks and Experimental M e th o d o lo g y

15

2.1 Parallel B enchm arks... 15

2.1.1 NPB; NAS Parallel Benchmarks S u ite ... 16

2.1 .1 .1 C G ...16 2.1.1.2 M G ...17 2.1.1.3 L U ...17 2.1.1.4 B T a n d S P ... 17 2.1.2 P S T S W M ... 18 2.1.3 Q C D M P l... 18

2.2 .Applications' Com m unication P rim itives... 19

2.2.1 M PL Send ... 20 2.2.2 M P lJ s e n d ... 20 2.2.3 M P l_S endrecv_replace...20 2.2.4 M P L R e c v ... 20 2.2.5 M P I_ Ire c v ... 21 2.2.6 M P L W ait ... 21 2.2.7 M P L W aitall ...21 2.3 Experimental M ethodology... 21

Chapter 3 D esign and Evaluation o f Latency H iding/R eduction M essage

D estination Predictors...22

3.1 In troduction... 23

3.1.1 M essage Switching Layers ...24

(7)

3.2 Com m unication Frequency and M essage Destination D istribution... 30

3.3 Com m unication Locality and C aching... 35

3.3.1 The LRU. FIFO and LFU H e u ristic s...38

3.4 M essage Destination Predictors... 43

3.4.1 The Single-cycle Predictor ...46

3.4.2 The Single-cycle2 Predictor ... 48

3.4.3 The Better-cycle and Better-cycle2 P re d ic to rs... 49

3.4.4 The Tagging P re d ic to r...53

3.4.5 The Tag-cycle and Tag-cycle2 Predictors ... 54

3.4.6 The Tag-bettercycle and Tag-bettercycle2 P re d ic to rs...56

3.5 Predictors’ C om p ariso n...57

3.5.1 Predictor’s M emory R eq u irem en ts... 59

3.6 Using M essage P red icto rs... 60

3.7 S um m ary...61

C h a p t e r 4 R e c o n f ig u r a tio n T i m e E n h a n c e m e n ts U s in g P r e d ic t o r s ...63

4.1 Distribution o f M essage S iz e s ... 64

4.2 Inter-send Com putation T im e s... 64

4.3 Total R econfiguration Tim e Enhancem ent... 71

4.4 Predictors' Effect on the Receive Side...79

4.5 Sum m ary...81

C h a p t e r 5 C o ll e c t iv e C o m m u n ic a tio n s o n a R e c o n f ig u r a b le I n te r c o n n e c tio n N e tw o r k ... 8 4 5.1 Intro du ctio n ... 84

5.2 Com m unication M odeling for Broadcasting. M ulti-broadcasting... 88

5.3 Broadcasting and M ulti-broadcasting... 90

5.3.1 B ro a d c a stin g ...90

5.3.1.1 Analysis o f the G reedy A lgorithm ...92

5.3.1.2 Grouping sc h e m a...101

5.3.2 M ulti-broadcasting...102

5.4 Com m unication M odeling for other Collective C om m unications...103

5.5 S cattering...103

5.6 M ultinode B road casting ...105

5.7 Total E xchange... 108

5.8 S um m ary...112

C h a p t e r 6 E f f ic ie n t C o m m u n ic a tio n U s in g M e s s a g e P r e d ic tio n f o r C lu s te r s o f M u lti p r o c e s s o r s 114 6.1 In tro du ction... 115

(8)

VI

6.2 .Motivation and Related W o rk ... 117

6.3 Using M essage Predictions... 122

6.4 Experimental M ethodology... 123

6.5 Receiver-side Locality Estim ation... 123

6.5.1 Com m unication Locality ...125

6.5.2 The LRU. FIFO and LFU H eu ristics... 127

6.6 Message P redictors...129

6.6.1 T he Tagging P re d ic to r... 129

6.6.2 The Single-cycle Predictor ... 130

6.6.3 The Tag-cycle2 Predictor ... 130

6.6.4 The Tag-bettercycle2 P re d ic to r... 131

6.7 Message Predictors’ C om p ariso n... 132

6.7.1 P redictor’s M emory R eq uirem ents...132

6.8 Summary... 134

Chapter 7 C onclusions and Directions for Future R esearch...136

7.1 Future R esearch ... 138

Bibliography... 141

(9)

List of Figures

Figure 1.1: A generic parallel com p uter...2

Figure 3.1; RON (k, N). a m assively parallel com puter interconnected by a com plete

free-space optical interconnection n etw o rk ...27

Figure 3.2: N um ber o f send calls per process in the applications under different sys

tem sizes...32

Figure 3.3: N um ber o f message destinations per process in the applications under dif

ferent system s iz e s ...34

Figure 3.4: D istribution o f message destinations in the applications when N = 64 ..36

Figure 3.5: D istribution o f message destinations in the applications for process zero.

w hen N = 6 4 ...37

Figure 3.6: C om parison o f the LRU. FIFO, and LFU heuristics when N = 6 4 ... 39

Figure 3.7: Effects o f the scalibilty o f the LRU. FIFO, and LFU heuristics on the BT.

SP and CG app licatio n s... 40

Figure 3.S: Effects o f the scalibilty o f the LRU. FIFO, and LFU heuristics on the MG

and LU applications... 41

Figure 3.9: Effects o f the scalibilty o f the LRU. FIFO, and LFU heuristics on the

PSTSW M and Q CDM Pl ap p licatio n s...42

Figure 3.10: O peration o f the Single-cycle predictor on a sam ple request sequcnce..47

Figure 3.11: Effect o f the Single-cycle predictor on the applications... 48

Figure 3.12: C om parison o f the perfonnance o f the Single-cycle predictor with the

LRU. LFU. and FIFO heuristics on the applications under single-port m odeling when N = 6 4 ...48

Figure 3.13: O peration o f the Single-cycle2 predictor on the sam ple request sequence

Figure 3.14 Figure 3.15 Figure 3.16 Figure 3.17: Figure 3.18: Figure 3.19: .49 Effect o f the Single-cycle2 predictor on the applications...49 State diagram o f the Better-cycle p red icto r...50 O peration o f the B etter-cycle predictor on the sam ple request sequence...

... 51

Effect o f the Better-cycle predictor on the ap p licatio n s... 52 O peration o f the Better-cycle2 predictor on the sam ple request sequence.

... 52

(10)

Vlll

Figure 3.20: EtYects o f the Tagging predictor on the ap plications...54

Figure 3.21 : Effects o f the Tag-cycle predictor on the ap p licatio n s... 55

Figure 3.22: Effects o f the Tag-cycle2 predictor on the ap plication s... 56

Figure 3.23: Effects o f the Tag-bettercycle predictor on the ap plication s... 56

Figure 3.24: Effects o f the Tag-bettercycle2 predictor on the applicatio ns...57

Figure 3.25: Comparison o f the performance o f the predictors proposed in this chapter when num ber o f processes is 64. 32 (36 for BT and SP). and 16... 58

Figure 4.1 : Distribution o f message sizes o f the applications when N =.4 ... 65

Figure 4.2: Distribution o f message sizes o f the applications when N = 9 for BT and SP. and 8 for CG. MG. LU. PSTSW M . and Q C D M P l... 66

Figure 4.3: Distribution o f message sizes o f the applications when N = 1 6 ...67

Figure 4.4: Distribution o f message sizes o f the BT. SP. PSTSW M . and Q C D M Pl.ap plications when N = 2 5 68 Figure 4.5: Cumulative distribution function o f the inter-send com putation times for node zero o f the application benchm arks when the number o f processors is 16 for CG. MG. and LU. and 25 for BT. SP. QCDM PL and PSTSW M . ... 69

Figure 4.6: Percentage o f the inter-send com putation times for different benchm arks that are more than 5 .1 Ü. and 25 m icroseconds when N = 4 .8 or 9. 16. and 25. ...72

Figure 4.7: Different scenarios for message transm ission in a m ulticom puter with a recontigurable optical interconnect (a) when the m essage_transfer_delay is less than the inter_send time, and the available time is larger than the recontlguration_delay (b) when the m essage_transfer_delay is less than the inter_send time, and the available tim e is less than the recontiguration_delay (c) when the m essage_transfer_delay is larger than the inter-send tim e... 73

Figure 4.8: Average ratio o f the total reconfiguration tim e after hiding over the total original reconfiguration time for different benchm arks with the current generation and a 10 times faster CPU when d - 1.5. 10. and 25 m icrosec onds: A class for NPB. 4 nodes (shorter bars are b etter)... 75

Figure 4.9: Average ratio o f the total reconfiguration tim e after hiding over the total original reconfiguration time for different benchm arks with the current generation and a 10 times faster CPU when d = 1.5. 10. and 25 m icrosec onds: A class for NPB. 9 nodes for BT and SP. 8 nodes for other applica tions (shorter bars are b etter)... 76

(11)

Figure 4.10: Average ratio o f the total recontiguration tim e after hiding over the total original reconfiguration time for different benchm arks with the current generation and a 10 times faster CPU when d = 1.5. 10. and 25 m icrosec onds: A class for NPB. lônodes (shorter bars are b e tte r)... 77

Figure 4.11 : Average ratio o f the total reconfiguration tim e after hiding over the total

original reconfiguration time for different benchm arks with the current generation and a 10 times faster CPU when d = 1.5. 10. and 25 m icrosec onds. A class for NPB. 25 nodes (shorter bars are b e tte r)...78

Figure 4.12: Summary o fth e average ratio o f the total reconfiguration tim e after hiding

over the total original reconfiguration time with the current generation and a 10 times faster CPU when applying the Tag-bettercycle2 predictor on the benchm arks with d = 25 microseconds. A class for NPB. and under different system sizes... SO

Figure 4.13 : Heuristics effects on the receiving s id e ... S 1

Figure 4.14: .Average percentage o fth e times the receive calls are issued before the cor

responding send c a lls ... 82

Figure 5.1 : Some collective com m unication o p eratio n s... 87

Figure 5.2: Latency hiding broadcasting algorithm for RON (k. N). N - 41. k = 2. d = 1

...92

Figure 5.3: First and second generation trees. The numbers underneath each tree de

note the number o f trees having the same height. These trees are rooted at nodes that were at the sam e level in the first generation tree... 94

Figure 5.4: Sequential tree alg o rith m ...104

Figure 5.5: Spanning binomial tree algorithm ...105

Figure 5.6: M ultinodebroadcastingonan8-nodeR O N (k.N )undersingle-portm odeling

106

Figure 5.7: M ultinode broadcasting on an 9-node RON (k. N) under 2-port m odeling

... 107

Figure 5.8: Total exchange on an 8-node RON (k. N) under single-port m odeling 108

Figure 5.9: Total exchange on an 9-node RON (k. N) under 2-port m odelin g 110

Figure 6.1 : Data transfers in a traditional messaging lay er... 119

Figure 6.2: N um ber o f receive calls in the applications under different system s iz e s ..

... 124

Figure 6.3: N um ber o f unique m essage identifiers in the applications under different

(12)

X

Figure 6.4: Distribution o f the unique message identifiers for process zero in the

applications...127

Figure 6.5 : Effects o f the LRU. FIFO, and LFU heuristics on the ap p licatio n s 128

Figure 6.6: Effects o fth e Tagging predictor on the applications... 130

Figure 6.7: Effects o f the Single-cycle predictor on the app licatio n s... 131

Figure 6.8: Effects o fth e Tag-cycle2 predictor on the ap plication s...13 1

Figure 6.9: Effects o fth e Tag-bettercycle2 predictor on the ap p licatio n s... 132

(13)

List of Tables

T able 3.1 : M emory requirements (in bytes) o f the predictors when N = 6 4 ... 59

T able 4.1 : M inim um inter-send com putation times (m icroseconds) in NAS Parallel Benchm arks. PSTSW M , and Q CD M Pl when N - 4. 8. 9. 16. and 25 ... 70

Table 4.2; Com m unication to com putation ratio o fth e ap p lic a tio n s... 83

Table 5.1 : Broadcasting time, k = 2. d = 1... ... 99

Table 5.2: Broadcasting time, k = 4. d ^ 3... 100

Table 5.3: Broadcasting time, d = 3 ...101

Table 5.4: M ulti-broadcasting time, k = 4. d =3. M = 10 ... 103

Table 5.5: Total exchange time. N = 1024. single-port ... 112

Table 5.6: Total exchange time. N - 1024. k - 3 ... 112

Table 6.1 : M emory requirements ( in 6-tuple sets) for the predictors when N = 64 for CG . and N = 4 9 for BT. SP. and PSTSWM ...134

(14)

XU

Trademarks

Many o t'th e designations used by manufacturers and sellers to distinguish their products are claim ed as trademarks. Tradem arks and registered tradem arks used in this work, where the author was aware o f them, are listed below. .'Ml other tradem arks are the property o f their respective owners.

IBM SP2 is a registered trademark o f International Business M achines Corp.

IBM D eep Blue is a registered trademark o f International Business M achines Corp..

IBM P2SC CPU is a registered trademark o f International Business M achines Corp.

IBM Vulcan Switch is a registered trademark o f International Business M achines Corp.

Myrinet is a registered trademark o f Myricom.

ServerN et is a registered trademark o f Tandem Division o f Compaq.

SGI O rigin 2000 is a registered trademark o f Silicon Graphics. Inc.

SGI Spider Switch is a registered trademark o f Silicon Graphics. Inc.

(15)

Glossary

AM Active M essages

ASCI Accelerated Strategic Computing Initiative program

BIP Basic Interface For Parallelism

BT Block Tridiagonal Application Benchmark

CA Comm unication Assist

CDF Cumulative Distribution Function

CGH Computer G enerated Holograms

CIC Computing. Inform ation and Com m unications Project

CG Conjugate G radient Application Benchmark

CLUM P Cluster o f M ultiprocessors

COW Cluster o f W orkstations

DM Deformable M irrors

DSM Distributed Shared-M em ory M ultiprocessor

EP Embarrassingly Parallel Application Benchm ark

FIFO First-in-tirst-out

FM Fast Messages

FT 3-D Fast-Fourier Transform Application Benchmark

HPF High Performance Fortran

IS Integer Sort A pplication Benchmark

LAM/MPI Local Area M ulticom puter/M essage Passing Interface

LAN Local Area Networks

(16)

XI V

LIFO Last-in-first-out

LRU Least Recently Used

LU Lower-Upper Diagonal Application Benchmark

MG M ultigrid Application Benchmark

MIMD M ultiple Instructions M ultiple Data

MPI M essage Passing Interface

MPICH A Portable Implementation o f MPI

MPP M assi\ely Parallel Processors systems

Ni Network Interface

NOW Networks o f Workstations

NPB NAS Parallel Benchmarks

O R P C (k) O ptically Recontigurable Parallel Com puter

OPS Optical Passive Stars

P2SC PowerZ-Super M icroprocessor

POPS Partitioned Optical Passive Stars

PM A H igh-Performance Com m unication Library

PSTSWM Power Spectrum Transform Shallow Water Model

PVM Parallel Virtual Machine

QCDM Pl Q uantum Chrom odynam ics with M essage passing Interface

RON (k. N) R econtigurable Optical Network.

RMA Rem ote M emory Access

SAN System Area Networks

(17)

SHRIM P SP SPMD TLB U-Net

vcc

VCSEL VIA VMM C-2

Scalable High-Perform ance Really Inexpensive M ultiprocessor Scalar Pentadiagonal Application Benchmark

Single Program M ultiple Data Translation Lookaside Buffer

A User-Level Netw ork Interface A rchitecture Virtual Circuit Caching

Vertical Cavity Surface Emitting Laser Virtual Interface A rchitecture

(18)

XVI

Ac kno wledgmen ts

1 would like to express my deepest appreciation to my supervisor. Dr. Nikitas J. Dimopoulos for his thoughttdl suggestions that shaped and improved m y ideas. 1 am very grateful to Nikitas for providing me with his valuable guidance, encouragem ent, support, criticism , patience, and kindness from the first day 1 cam e to Victoria.

1 would like to thank the members o f my dissertation committee. 1 wish to thank Dr. Kin F. Li. Dr. Vijay K. Bhargava. and Dr. D. M ichael Miller for their support and sugges tions. 1 am very grateful to Dr. José Duato for his kind acceptance to be the external exam iner o f this dissertation, and for his brilliant suggestions.

1 am greatly indebted to my wife. .Azita G eram i for her continuous support and encouragem ents. Without her understanding. 1 w ould not have finished m y dissertation. 1 would like to express my gratitude to my parents who always encouraged me to pursue a Ph.D.

1 want to thank all my friends and graduate fellows especially the fellow researchers at LAPIS including .André Schoorl. Nicolaos P. Kourounakis. Shahadat Khan. M ohamed Watheq El-Kharashi. Stephen W. Neville. Rafael Parra Hernandez. C aedm on Somers. Jon Kanie. and Eric Laxdal who have made my stay so much fun.

1 would like to thank the departm ent’s system and office staff for their continuous cooperation. 1 am thankful to Vicky Smith. Lynne Barrett. Maureen D enning, and Moneca Bracken.

Special thanks to Dr. M urray Campbell at the IBM T. J. Watson Research C enter for his kind cooperation and help in accessing the IBM Deep Blue, and the staff o f the com puter center at the University o f Victoria for the access to the University IBM SP2.

My dissertation research was supported by grants from the Natural Science and Engi neering Research Council (NSERC) o f Canada, and the University o f Victoria.

(19)

and to rnif parents, :\bbas Afsahi, and i^hodsieh Sal^ouie for their support and

(20)

Chapter 1 Introduction

Research in the area o f advanced com puter architecture has been prim arily focused on how to improve the perform ance of com puters in order to solve com putationally intensive problem s [32. 62. 69], Som e of’ these problem s are called grand challenges. A grand chal lenge is a fundamental problem in science or engineering that has a broad economic and/or scientific im pact; coupled fields, geophysical, and astrophysical fluid dynamics (G A FD ) turbulence, m odeling the global clim ate system, formation o f the large scale uni verse. global optim ization algorithms for m acrom olecular modeling, petroleum explora tion. aerodynam ic sim ulations, ocean circulation, are ju st a few to mention.

T he perform ance o f processors is doubling each eighteen months [62]. However, there is always a dem and for m ore com puting power. To solve grand challenge problem s,

com-j 1

puter system s at the lerafiop ( 10 ‘ floating point operations per second) and petafiop ( lo '" floating point operations per second) performance levels are needed.

Processors are becom ing very complex and only a few com panies are designing new processors. Therefore, it is not cost-effective to build high perform ance com puters ju st by using custom -design high perform ance processors. The trend is to design parallel com put ers using com m odity processors to achieve teraflop and petafiop perform ance. For instance, two major projects to develop high perform ance supercom puters in the USA are: the federal program in Computing, Information a n d Com munications (CIC) project at the national coordination office [98]. and the Departm ent o f Energy .Accelerated Strategic

C om puting Initiative (A SCI) program including Intel/Sandia O ption Red, IBM /Lawrence

Livermore National Laboratory Blue Pacific, and SGI/Los Alamos N ational Laboratory Blue M ountain [39].

(21)

often called M assively Parallel Processor (M PP) systems, are only used for grand chal lenges and parallel scientific applications. Even for applications requiring lower com put ing power, parallel com puting is a cost-effective solution. T hese days, many high perform ance parallel com puting systems are being used in network and com m ercial appli cations such as data warehousing, internet servers, and digital libraries.

Parallel processing is at the heart o f such powerful computers. .Although parallelism appears at different levels for a single processor system, such as lookahead, pipelining, superscalarity. speculative execution, vectorization. interleaving, overlapping, multiplicity, tim e sharing, m ultitasking, multiprogramming, and multithreading, but it is the parallel processing and parallel com puting among different processors which brings us such levels o f performance.

Basically, a parallel com puter is a “collection o f processing elem ents that com m uni cate and cooperate to solve large problems fast" [9]. In other words, a parallel computer, w hether m essage-passing or distrihuled shared-m einoiy (DSM). is a collection o f com  plete com puters, including processor and memory, that com m unicate through a general- purpose. high-perform ance, scalable interconnection network using a com munication

assist (CA) and/or a network interface (Nl) [32]. as shown in Figure 1.1.

M emorv Com m unication Assist/ N etwork Interface P: Processor S; Cache Interconnection N etwork

(22)

M essage-passing miilticomputers. among all known parallel architectures, are the best

to achieve such com puting perform ance level. M essage-passing m ulticom puters are char acterized by the distribution o f memory among a num ber o f com puting nodes that com m unicate with each other by exchanging messages through their interconnection networks. Each node has its own processor, local memory, and com m unication assist/net work interface. All local m em ories are private and are accessible only by the local proces sors. The wide acceptance o f m essage-passing m ultiprocessor system s has been proven by the introduction o ï M essage Passing Interface (M PI) standard [92. 93]. Currently, in addi tion to vendor im plem entations o f MPI on commercial machines, there are many freely available MPI im plem entations including MPICH [57] and L.AM/MPl [78].

Recently. .Wenvorks o! Workstations (NOW) [11]. Clusters oj Workstations (COW), and Clusters o f M ultiprocessors (CLUM P) [87]. have been proposed to build inexpensive parallel computers. how e\er. often at a lower performance level com pared to MPP sys tems. The developm ent o f high-perform ance switches specially for building cost-effective interconnects known as System Area Metworks (S.AN) [23. 67. 1 13. 54] has motivated suit ability o f the networks o f workstation/m ultiprocessors as an inexpensive high-perfor m ance com puting platform . System area networks such as the M yricom M yrinet [23]. the IBM Vulcan switch in the IBM SP2 machine [113]. the Tandem ServerNet [67]. and the Spider switch in SGI O rigin 2000 machine [54]. are a new generation o f networks that falls between m em ory buses and com mercial local area networks (LANs).

Parallel processing, w hether MPP. DSM. NOW. COW. or CLUMP, puts trem endous pressure on the interconnection networks and the memory hierarchy subsystem s. As the com m unication overhead is one o fth e most im portant factors affecting the perform ance o f parallel com puters [76. 6 9 .4 3 ]. there has been a growing interest in the design o f intercon nection networks. In this respect, various types o f interconnection networks, such as com plete networks, hypercubes, meshes, rings, tori, irregular switch-based, stack-graphs. and hypermesh have been proposed and some o f them have been im plem ented [46. 124. 108]. M eanwhile, many routing algorithm s [47. 56. 12] have been proposed for such networks.

(23)

between processors is very critical to obtaining high perform ance. In essence, parallel com puters require extremely short communication latencies such that netw ork transac tions have minim al impact on the overall com putation time. C om m unication hardware latency, com m unication softw are latency, and the user environm ent (m ultiprogram m ing, multiuser) are the major factors atfecting the perfonnance o f parallel com puter systems. This thesis concentrates on issues regarding hardware com m unication latency in elec tronic networks and recontigurable optical networks, and softw are com m unication latency (regardless o f th e type o f network).

In this thesis. 1 propose a num ber o f techniques to achieve efficient com m unications in m essage-passing systems. This thesis makes five contributions:

The first contribution o f this thesis (Chapter 3) is the design and evaluation o f two different categories o f prediction techniques for m essage-passing system s. Specifi cally. 1 use these predictors to predict the target o f com m unication m essages in parallel applications.

• .As the second contribution o f this thesis (C hapter 4). 1 show that the majority o f reconfiguration delays in reconfigurable networks can be hidden by using one o f the high hit ratio proposed predictors in Chapter 3.

• The third contribution o f this thesis (Chapter 5) is the analysis o f a latency hiding broadcasting algorithm on single-hop reconfigurable networks under single-port and A'-port modeling which brings up closed form ulations that yield the term ina tion time.

• As the fourth contribution o f this thesis (C hapter 5). 1 propose a new total exchange algorithm in single-hop reconfigurable networks under single-port and A'- port modeling.

• Finally, the fifth contribution (Chapter 6) is the use and evaluation o f the proposed predictors in Chapter 3 to predict the next consum able message at the receiving ends o f m essage-passing system s (regardless o f the type o f network). I argue that

(24)

these message predictors can be etficiently used to drain the network and cache the incoming m essages even if the corresponding receive calls have not been posted yet.

C hapter 2 introduces the parallel applications used in this thesis. Chapter 7 concludes this dissertation and gives directions for future research. .Appendix A describes how tim ing disturbances have been removed from the timing profiles o f the parallel applications used in this thesis.

The rest o f this chapter is organized as follows. In Section 1.1.1 explain the com m uni cation locality in m essage-passing parallel applications and discuss difterent latency hid ing techniques for parallel com puter system s. In Section 1.2, 1 discuss the advantages o f using prediction techniques at the send side o f com m unications in the reconfigurable o p ti cal interconnection networks, and in the circuit switched and w orm hole routing electronic interconnection networks. In Section 1.3. 1 describe the issues related to the m essaging layer and software com m unication overhead in m essage-passing systems, and how pred ic tion can help elim inate redundant message copying operations. 1 give an introduction to the issues regarding collective com m unications in Section 1.4. Finally. 1 sum m arize the contributions o f this thesis in Section 1.5.

1.1 Com m unications Locality and Prediction Techniques

In this thesis. 1 am interested in the m essage-passing model o f parallelism as m essage- passing parallel com puters scale much better than the shared-m em ory parallel com puters. C om m unication properties o f m essage-passing parallel applications can be categorized by the spatial, temporal, and volume attributes o f th e com m unications [30. 75. 68]. The te m  poral attribute o f com m unications in parallel applications characterizes the rate o f m es sage generation, and the rate o f com putations in the applications. The volum e o f com m unications is characterized by the num ber o f messages, and the distribution o f m es sage sizes in the applications.

(25)

distribution o f message destinations. Point-to-point com m unication patterns may be repet itive in m essage-passing applications as most parallel algorithm s consist o f a num ber o f com putation and com munication phases. Several researchers have worked to find or use the com m unications locality properties o f parallel applications [30. 75. 68. 36. 37].

By message destination communication locality. 1 mean that if a certain source-desti nation pair has been used it will be re-used with high probability by a portion o f code that is "near” the place that was used earlier, and that it will be re-used in the near fiiture. By

m essage reception communication locality I mean that if a certain m essage reception call

has been used it will be re-used with high probability by a portion o f code that is "near” the place that was used earlier, and that it will be re-used in the near future.

Traditionally, one approach to deal with com m unication latency is to tolerate the latency: that is. hide the latency from the processor's critical path by overlapping it with other high latency events, or hide it with computations. T he processor is then free to do other useful tasks.

Three approaches can be used to tolerate latency in shared-m em ory and m essage-pass ing system s [32]. They are proceeding p a st communication in the sam e thread, m ulti

threading. and precommunication. The first approach, proceeding past com m unication in

the sam e thread in m essage-passing systems, is to make com m unication m essages asyn chronous and proceed past them either to other asynchronous com m unication m essages, or to the com putation in the same thread. This approach is usually used by the parallel algorithm designers. Some o f the applications studied in this thesis use this type o f latency tolerance by using nonblocking asynchronous MPI calls.

In m ultithreading, a thread issuing a communication operation suspends itself and lets another thread run. This approach is used for other threads too. It is hoped that w hen the first thread is rescheduled, its com m unication operations have concluded. M ultithreading can be done in software or hardware. Software m ultithreading is very expensive. Som e hardw are m ultithreading research architectures for m essage-passing system s such as the J- M achine [35]. and the M -M achine [52] have been reported.

(26)

In precom m unication, communication operations are pulled up from the place that com m unications naturally occur in the program so that it is partially o r entirely com pleted before data is needed. This can be done in softw are by inserting a precom m iinication o p er

ation, or in hardw are, by predicting the subsequent com munication operations and issue

them early.

Precom m unication is com mon in receiver-initiated com m unications (that is. in shared- memory system s) where communication com m ences when a data is needed such as a read operation. In software-controlled prefeiching. the programm er o r the compiler decides when and what to prefetch by analyzing the program and then inserting prefetch instruc tions before the actual data request in the program [95]. In hardware-controlled pre fetch

ing, dedicated hardware is used to predict the future accesses o f sharing patterns and

coherence activities by looking at their observed behavior [96. 77. 73. 133. 34. 107]. T hus, there is no need to add instructions to the program. These techniques assum e that m em ory accesses and coherence activities in the near future will follow past patterns. Then, the hardw are prefetches the data based on its prediction.

In sender-initiated system s (that is. in m essage-passing systems), it is usually difficult to do the com m unication operation earlier at the send sides and thus hide the latency. This is because message com m unication is naturally initiated to transfer the data when the data is produced. However, messages may arrive earlier at the receiver than it is needed w hich leads to a precom m unication for the receiver side o f com munication.

.As far as the author is aware, no precom munication technique has been proposed for m essage-passing systems. Predictions techniques can be used to predict the subsequent m essage destinations, and message reception calls in m essage-passing systems. This the sis. for the first time, proposes and evaluates tw o categories o f pattern-based predictors, namely. C ycle-based predictors, and Tag-based predictors for m essage-passing system s. These predictors can be used dynamically (at the send side or receive side o f com m unica tions) at the com m unication assist or network interface with or w ithout the help o f a pro gram m er o r the compiler.

(27)

In the to I low ing. 1 explain how m essage destination prediction can be helpful in hiding the reconfiguration delay in single-hop and multi-hop reconfigurable optical interconnec tion networks, and in hiding path setup time in circuit switched electronic netw orks. 1 also describe the benefit o f message destination prediction techniques to reduce the latency o f com munications in current com m ercial wormhole routed networks.

The interconnection network plays a key role in the performance o f message-passing parallel com puters. .A message is sent from a source to a destination through the intercon nection network. High com m unication bandwidth and low com m unication latency are essential for efficient com munication between a source and a destination. However, com munication latency is the most im portant factor affecting the perform ance o f message- passing parallel computers. In this thesis. 1 am interested in hiding and reducing the com munication latency. Two categories o f interconnection networks exist: electronic intercon- netcion networks, and optical interconnection networks. 1 have developed prediction techniques that can be applied to both electronic and optical interconnection networks.

The proposed predictors can be used to set up the paths in advance in electronic net works using either circuit switching o r nave swiicliing. In circuit-sw itching, the routing header Hit progresses through the m essage destination and reserves physical links. Wave switching is a hybrid switching technique for high perform ance routers in electronic inter connection networks. Wave sw itching combines w orm hole switching and circu it switch ing in the sam e router architecture to reduce the fixed overhead o f com m unication latency by exploiting com m unication locality. Hence, it is possible to hide the hardw are communi cation latency using message destination predictions to pre-establish physical circuits in circuit switching and wave switching networks.

The predictors can even be useful to reduce com m unication latency in current com mercial networks. For example. M yrinet networks [23] have a relatively long routing time compared with link transmission tim e. Predictors would allow sending the m essage header in advance for the predicted m essage destination. W hen data becomes available, they can

(28)

be directly transm itted through the network it' the prediction was correct, thus reducing latency significantly. In case o f m is-prediction. a message tail is forwarded to tear the path down. O bviously, null messages m ust be discarded at the destination.

Optics is ideally suited for im plem enting interconnection networks because o f its superior characteristics over electronic interconnects such as higher bandw idth, greater number o f fan-ins and fan-outs, higher interconnection densities, less signal crosstalk, freedom from planar constraint as it can easily exploit the third spatial dim ension which dram atically increases the available communication bandw idth, lower signal and clock skew, lower pow er dissipation, inherent parallelism, immunity from electrom agnetic inter ference and ground loops, and suitability for reconfigurable interconnects [ 100. 5 1 .7 4 . 19. 50. 129. 82. 19].

Future m assively parallel com puters might benefit from using reconfigurable optical interconnection networks. Currently, there are some problem s with the optical intercon nect technology. Signal attenuation, optical element aligning, low conversion time between electronics to photonics and vice versa, and high reconfiguration delay are some disadvantages o f optics which are m ostly due to its relatively immature technology. How ever. this technology is m aturing fast. As an example. Lucent's WaveStar Lam bdaRouter [86] relies on an array o f hundreds o f electrically configurable microscopic m irrors fabri cated on a single substrate so that an individual wavelength can be passed to any o f 256 input and output fibers.

.As stated above, the reconfiguration delay in reconfigurable optical interconnection networks is currently very high. The proposed message destination predictors can be effi ciently used to hide the reconfiguration delay in the single-hop and m ulti-hop reconfig urable optical interconnection networks concurrently to the com putations [127. 84].

1.3 R ed u n d an t Message Copying in Software Messaging L ayers

The com m unication software overhead currently dom inates the com m unication time in cluster o f w orkstations/m ultiprocessors. Crossing protection boundaries several times between the u ser space and the kernel space, passing several protocol layers, and involving a number o f m em ory copying are three different sources o f softw are com m unication cost.

(29)

Several researchers are working to m inim ize the cost o f crossing protection bound aries. and using simple protocol layers by utilizing user-level m essaging techniques such as A ctive Messages (AM ) [125], Fast M essages (FM) [102]. llrtu a l M em ory-M apped

C om munications (V M M C-2) [48]. L'-Net [126]. L A P l [110]. Basic Interface for Parallel ism (BIP) [105]. 1 irtual Interface Architecture (V 1A ) [49]. and PM [121].

A significant portion o f the software com m unication overhead belongs to a number o f m essage copying operations. Ideally, message protocols should copy the message directly from the send buffer in its user space to the receive bufter in the destination without any interm ediate buffering. However, applications at the send side do not know the final receive butfer addresses and. hence, the com munication subsystems at the receiving end still copy messages at a tem porary buffer.

Several research groups have tried to avoid memory copying [79. 14. 106. 119. 118]. They ha\ e been able to rem ove the extra memory copying operations between the applica tion user buffer space and the network interface at the send side. However, they haven't been able to remove the memory copying at the receiver sides. They may achieve a zero- copy m essaging at the receiver sides only when the receive call is already posted, a ren dez-vous type com m unication is used for large messages, or the destination buffer address is already known by an extra communication (pre-com munication). However, the predic tors proposed in this dissertation can be efficiently used to predict the next message recep tion calls and thus move the corresponding incoming messages to a place near the CPU such as a staging cache.

1.4 Collective Com m unications

Com m unication operations may be e\i\\evpoint-to-point, which involve a single source and a single destination, or collective, in which m ore than two processes participate. C ol lective com m unications are common basic patterns o f interprocessor com m unication that are frequently used as building blocks in a variety o f parallel algorithm s. Proper imple m entation o f these basic com m unication operations is a key to the perform ance o f the par

(30)

11

allel com puters. Therefore, there has been a great deal o f interest in their design and the study o f their perform ance. Excédent surveys on collective com m unication algorithm s can be found in [90. 53. 61].

Collective com m unication operations can be used for data movement, process control, or global operations. Data movement operations include, broadcasting, niiiticasting, scat

tering. gathering, m ultinode broadeasting. and total exchange. Barrier synchronization, is

a type o f process control. Global operations include reduction, and scan. The grow ing interest in collective com m unications is evident by their inclusion in the M essage Passing Interface (M PI) [93. 92].

1.5 Thesis C ontributions

In C hapter 2. 1 describe the applications used in this thesis along with the point-to- point com m unication primitives that they use. 1 explain the experim ental m ethodology used to collect the com m unication traces o f the applications.

In C hapter 3. I introduce a com plete interconnection network using free-space recon figurable optical interconnects for m essage-passing parallel machines. A com puting node in this parallel m achine configures its com m unication link(s) to reach to its destination node(s). Then it sends its message!s) over the established link(s).

I characterize som e com m unication properties o f the parallel applications by present ing their com m unication frequency and message destination distributions. I define the concept o f com m unication locality in m essage-passing parallel applications, and caching in reconfigurable networks. 1 present evidence, using classical m em ory hierarchy heuris tics. LRU. LFU. and FIFO, that there exists m essage destination com m unication locality in the m essage-passing parallel applications.

The first contribution o f this thesis (Chapter 3) is the design and evaluation (in terms o f hit-ratio) o f two different categories o f hardw are/softw are com m unication latency hiding predictors for such reconfigurable m essage-passing environm ents. I have utilized the m es sage destination locality property o f m essage-passing parallel applications to devise a num ber o f heuristics that can be used to predict the target o f subsequent com m unication

(31)

calls. This technique, can be applied directly to reconfigurable interconnects to hide the com m unications latency by reconfiguring the com m unications network concurrently to the com putation.

Specifically. I propose two sets o f message destination predictors: Cycle-based predic tors. which are purely dynam ic predictors, and Tag-based predictors, which are static/ dynam ic predictors. In cycle-based predictors. Single-cycle. Slngle-cycleJ. Beiter-cycle. and Beiter-cycleJ. predictions are done dynam ically at the network interface w ithout any help from the program m er or compiler. In Tag-based predictors. Tagging. Tag-cycle. Tag-

cycleJ. Tag-bettercycle. and Tag-bettercyckC. predictions are done dynam ically at the net

work interface as well, but they require an interface to pass som e information from the program to the network interface. This can be done with the help o f a program m er o r the com piler through inserting instructions in the program such as pre-connect (tag) (or

pre-receive (tag) as in Chapter 6). The performance o f the proposed predictors. Better-

cycle2 and Tag-bettercycle2. is very high and prove that they have the potential to hide the hardw are com m unication latency in reconfigurable networks. T he memory requirem ents o f the predictors is very low. That makes them very attractive for the im plem entation on the com m unication assist or network interface.

In order to efficiently use the proposed predictors in Chapter 3 to hide the hardw are latency o f the reconfigurable interconnects, enough lead time should exist such that the reconfiguration o f the interconnect be completed before the com m unication request arrives. In C hapter 4. I present the pure execution times o f the com putation phases o f the parallel applications on the IBM Deep Blue machine at the IBM T. J. Watson R esearch C enter using its high-perform ance switch and under the user space mode.

As the second contribution o f this thesis. Chapter 4 states that by com paring the inter- send com putation times o f these parallel benchmarks with som e specific reconfiguration times, most o f the time, we are able to fully utilize these com putation times for the concur rent reconfiguration o f the interconnect when we know, in advance, the next target using one o f the proposed high hit ratio target prediction algorithms introduced in C hapter 3. I present the perform ance enhancem ents o f the proposed predictors on the application

(32)

benchm arks for the total recontiguration time. Finally, 1 show that by applying the predic tors at the send sides, applications at the receiver sides would also benefit as m essages arrive earlier than before.

.As the third contribution o f this thesis (Chapter 5). I present and analyze a broadcast ing algorithm that utilizes latency hiding and reconfiguration in the network to speed the broadcasting operation under single-port and /r-port modeling. In this algorithm , the reconfiguration phase o f some o f the nodes is overlapped with the m essage transm ission phase o f the other nodes which ultim ately reduces the broadcasting time. The analysis brings up closed formulation that yields the termination tim e o f the algorithm .

T h e fourth contribution o f this thesis (Chapter 5) is a com bined toial exchange algo

rithm based on a combination o f the direct [109. 120]. and standard exchange [71. 24]

algorithm s. This ensures a better tennination time than that which can be achieved by either o f the two algorithms. .Also, known algorithms [20. 40] for scattering and all-to-all broadcasting have been adapted to the network.

In C hapter 6. 1 present the frequency and distributions o f receive com m unication calls in the applications. 1 present evidence that there exists m essage reception com m unications locality in the m essage-passing parallel applications. .As 1 stated earlier, the com m unica tion subsystem s at the receiving end still copy early arriving messages unnecessarily at a tem porary buffer. As far as the author is aware, no prediction techniques have been pro posed to rem ove this unnecessary m essage copying.

I use the proposed predictors introduced in Chapter 3 to predict the next consum able m essage, and to thus establish the existence o f message reception com m unications local ity. As the fifth contribution o f this thesis. Chapter 6 argues that these m essage predictors can be efficiently used to drain the network and cache the incoming m essages even if the corresponding receive calls have not been posted yet. This way. there is no need to unnec essarily copy the early arriving m essages into a tem porary buffer.

T he perform ance o f the proposed predictors. Single-cycle. Tag-cycle2 and Tag- bettercycle2. in term s o f hit ratio, on the parallel applications are quite prom ising and sug gest that prediction has the potential to eliminate most o f the remaining m essage copies.

(33)

M oreover, the memory requirem ents o f these predictors is very low m aking them easy to im plem ent. Finally. 1 discuss ways in which these predictions could be used to drastically reduce the latency due to message copying.

(34)

15

Chapter 2 Application Benchmarks and Experimental

Methodology

In Section 2.1. 1 describe the applications used in this thesis. 1 explain the various point-to-point m essage-passing primitives ot' the applications in Section 2.2. 1 discuss the experim ental methodology in Section 2.3.

2.1 Parallel Benchm arks

This thesis (except Chapter 5) studies the com putation and com m unication character istics o f actual parallel applications. For these studies. 1 have used som e well-known paral lel benchm arks form the .V.-15parallel benchmarks (N PB) suite [13]. the Parallel Spectral

Transform Shallow Water M odel (PSTSWM) parallel application [125]. and the pure Quantum Chromo Dynamics M onte Carlo Sim ulation Code with M PI (QCDM PI) parallel

application [65]. .Although the results presented in this thesis are for the above parallel applications, these applications have been widely used as benchm arks representing the com putations in scientific and engineering parallel applications.

1 used the MPI [92] im plem entation o f the NAS benchm arks, version 2.3. the PSTSW M . version 6.2. and the QCDMPI. version 1.4. and run them on several IBM SP2 m achines. 1 chose the IBM SP2 as it is a m essage-passing parallel m achine so that the cho sen parallel applications are m apped directly on it. 1 used different system sizes and prob lem sizes o f the applications in this study. NPB 2.3 com es with five problem sizes for each benchm ark: small class “S”. w orkstation class “W” . large class "A " and larger classes “ B” and "C ". D ue to access lim itations in the use o f the IBM Deep Blue m achine at the IBM T. J. Watson Research Center, and space limitations in using the University o f Victoria IBM SP2. 1 was able to experim ent w ith only the “W” and "A " classes and the results included in this thesis represent theses classes.

(35)

2.1.1 NPB: NAS Parallel Benchm arks Suite

The NAS Parallel Benchm arks (NPB) [13] have been developed at the NASA Ames Research C enter to study the perform ance o f massively parallel processor system s and networks o f w orkstations. The NAS Parallel Benchmarks are a set o f eight benchm ark problem s, each o f which focuses on some important aspect o f highly parallel supercom puting for aerophysics applications. The NPB are a set o f im plem entations o f the NAS Parallel Benchm arks based on Fortran 77 and the MPI m essage-passing interface stan dard. and are not tied to any specific system.

The NPB consists o f five "kernels", and three “sim ulated com putational fluid dynam ic (CFD) applications". The three simulated CFD application benchm arks, lower-upper

diagonal (LU), scalar penuidiagonal (SP). and block rridiagonal (BT) are intended to

accurately represent the principal computational and data m ovement requirem ents o f m od em CFD applications. The kernels, conjugate gradient (CG). nniltigrid (M G), em barrass

ingly parallel (EP). 3-D fast-Foitrier transform (FT), and integer sort (IS) are relatively

com pact problem s, each o f which em phasizes a particular type o f num erical com putation. 1 am interested in the point-to-point patterns o f the LU. BT. and SP applications, and CG and MG kernels. EP. FT. and IS kernels are not suitable for this study. EP and FT use only collective com m unication operations while each node in the IS kernel always com m uni cates with a specific node.

2.1.1.1 CG

The conjugate gradient kernel. CG. tests the perform ance o f the system for unstruc tured grid com putations which by their nature require irregular long distance com m unica tions w hich is a challenge for all kinds o f parallel com puters. Essentially, it requires com puting a sparse matri.x-vector product. The inverse pow er m ethod is used to find an estim ate o f the largest eigenvalue o f a symmetric positive-definite sparse matrix with a random pattern o f non-zeros. This code requires a power-of-two num ber o f processors.

(36)

17

2.1.1.2 MG

The second kernel benchm ark is a simplified nniltigrid kernel, MG. which solves a 3-D poisson PDE. Four iterations o f the V-cycle multigrid algorithm are used to obtain an approxim ate solution u to the discrete Poisson problem V '» = v on a 256 x 256 x 256 grid with periodic boundary conditions. This code is a good test o f both short and long dis tance highly structured com munication. This code requires a power-of-two number o f pro cessors. The partitioning o f the grid onto processors occurs such that the gnd is successively halved, starting with the r dim ension, then the r dim ension and then the .v dim ension, and repeating until all power-of-two processors are assigned.

2.1.1.3 LU

The lower-upper diagonal benchmark. LU. employs a symmetric successive over relaxation (SSOR) numerical scheme to solve a regular-sparse block 5 x 5 lower and upper triangular system. A 2-D partitioning o f the grid onto processors occurs by halving the grid repeatedly in the first two dimensions, alternately .v and then r. until all power-of- two processors are assigned, resulting in vertical pencil-like grid partitions on the individ ual processors. The ordering o f point based operations constituting the SSOR procedure proceeds on diagonals which progressively sweep from one com er on a given r plane to the opposite com er o f the same r plane, thereupon proceeding to the next r plane. C om m u nication o f partition boundary data occurs after com pletion o f com putation on all diago nals that contact an adjacent partition. LU is very sensitive to the sm all-m essage com m unication perform ance o f an MPI implementation. It is the only benchm ark in the NPB 2.3 suite that sends large numbers o f very small (40 byte) messages.

2.1.1.4 BT and SP

The BT and SP algorithm s have a similar structure: each solves three sets o f uncoupled system s o f equations, first in the .v. then in the r. and finally in the r direction. In the block

tridiagonal benchm ark, BT. m ultiple independent systems o f non-diagonally dom inant,

block tridiagonal equations with a 5 x 5 block size are solved. In the sca la r pentadiago-

(37)

pen-tadiagonal equations w ith a 5 x 5 block size are solved. Both BT and SP codes require a square num ber o f processors. These codes have been written so that if a given parallel platform only permits a power-of-two number o f processors to be assigned to a job. then unneeded processors are deemed inactive and are ignored during com putation, but are counted when determ ining Mflop/s rates.

2.1.2 PSTSWM

The Parallel Spectral Transform Shallow Water M odel fPSTSW M ) application [125]. was developed by Worley at Oak Ridge National Laboratory and Foster at Argonne National Laboratory. PSTSWVI is a m essage-passing benchmark code and parallel algo rithm testbed that solves the nonlinear shallow w ater equations on a rotating sphere using the spectral transform method. PSTSWM was developed to evaluate parallel algorithms for the spectral transform method as it is used in global atmospheric circulation models. M ultiple parallel algorithm s are embedded in the code and can be selected at run-time, as can the problem size, num ber o f processors, and data decom position. PSTSW M is written in Fortran 77 with VM S extensions and a small num ber o f C preprocessor directives. 1 used the MPI im plem entation o f the PSTSWM with the default input sizes.

2.1.3 QCDM PI

Pure Quantum Chrom o Dynamics M onte Carlo Simulation C ode with MPI (Q CD M PI) [65]. w ritten by Hioki at Tezukayama University, is a pure Q uantum Chromo D ynam ics simulation code with MPI calls. It is a powerful tool to analyze the non-pertur- bative aspects o f QCD. This program can be applied to any dim ensional QCD such as the 3-dim ensional QCD in which the color and/or quark confinement m echanism are obtained. QCDM PI runs on any number o f processors and also any dim ensional partition ing o f the system can be applied.

(38)

19

2.2 A pplications' C om m unication Primitives

As stated earlier. 1 am only interested in the patterns o f the point-to-point com m unica tions between pair-w ise nodes in the above applications as discussed in C hapter 3, C hapter 4, and Chapter 6 o f this thesis. Efficient algorithm s for collective com m unications are pre sented in Chapter 5. These applications use synchronous and asynchronous MPI send and receive primitives [92]. 1 briefly explain these com m unication primitives here.

.An MPI program consists o f autonomous processes, executing their own code, in an

mitliiple insinictions nniliiplc data (M IM D) style. Note that all parallel applications stud

ied in this thesis use an single program m ultiple data (SPM D ) style. Processes are identi- fled according to their relative rank in a group, that is. consecutive integers in the range 0

lo groiipsize - 1. If the group consists o f all processes then the processes are ranked trom 0

to .V- 1 w here N is the total number o f processes in the application.

The processes com m unicate via calls to MPI com m unication primitives. The basic point-to-point com m unication operations are send and receive. There are tw o general point-to-point com m unication operations in MPI: blocking and nonblocking. Blocking send or receive calls will not return until the parameters o f the calls can be safely m odi fied. That is. in the case o f a send call, the message envelop has been created and the m es sage has been sent out or has been buffered into a system buffer. For the case o f a receive call, it means that the message has been received into the receive buffer. N ote that the m es sage envelop consists o f a fixed number o f fields (source, dest, tag. com m ) and it is used to distinguish messages and seleetively receive them. N onblocking com m unication opera tions ju st post or start the operation. Thus the application program m er m ust explicitly com plete the com m unication call later at som e point in the program using one o f the vari ous function calls in MPI such as MPl_W ait or MPl_lVaitall.

There are four com m unication modes in MPI: standard, buffered, synchronous, and

ready. These correspond to four different types o f send operations. In the synchronous

mode send call, the call will not finish until a matching receive call has been issued and has begun reception o f the message. In the buffered m ode send call, the send call is local (in contrary to other com munication modes where the send calls are nonlocal) and is not

(39)

w aiting for the receive call to be posted. Actually, it buffers data when the receive call is not posted. In the ready mode send call, the receive call must have been posted earlier. In the standard mode, it is up to the system to buffer the data or send it as in synchronous mode. Note that the standard m ode is the only mode for the receive calls.

2.2.1 M PI_Send

M P I_Sem i (biif. count, datatype, dest. tag. comm) [92] is a standard blocking send call

w hich is a com bination o f buffered and synchronous m ode and is dependent on the im ple m entation. W hen the call finishes, the send buffer can be used. In the buffered mode, data is w ritten from the send buffer to the system buffer and the call returns. In the synchronous mode, the call waits for the receiver to be posted and then returns. The LU. NIG. CG. and PSTSW M applications use this type o f send call.

2.2.2 M P M se n d

M P l_ lsen d (bid. count, datatype, dest. tag. comm, request) [92] is a standard non-

blocking send call. It returns immediately. Therefore, the send buffer cannot be reused. It can be im plem ented in the buffered or synchronous mode. It needs another call. M PI_W ait o r M PI_\Vaitall. to com plete the call. These completion calls are explained later in Section 2.2.6 and Section 2.2.7. respectively. BT and SP use this type o f send call.

2.2.3 IVIPI_Sendrecv_replace

M P[_Sendrecv_replace (buf. count, datatype, dest. sendtag. source, recvtag. comm, status) [92] com bines in one call the sending o f a m essage and receiving another m essage

in the sam e buffer. QCDM PI uses this type o f com m unication call.

2.2.4 [VIPI_Recv

M P l_R ecv (buf. count, datatype, source, tag, comm, status) [92] is a standard blocking

receive call. W hen it returns, the data is available at the destination buffer. LU and PSTSW M use this type o f receive call.