• No results found

multiplication on large-scale parallel systems Improving performance of sparse matrix dense matrix Parallel Computing

N/A
N/A
Protected

Academic year: 2022

Share "multiplication on large-scale parallel systems Improving performance of sparse matrix dense matrix Parallel Computing"

Copied!
26
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

ContentslistsavailableatScienceDirect

Parallel Computing

journalhomepage:www.elsevier.com/locate/parco

Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems

Seher Acer, Oguz Selvitopi, Cevdet Aykanat

Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey

a r t i c l e i n f o

Article history:

Received 30 March 2015 Revised 25 August 2016 Accepted 5 October 2016 Available online 6 October 2016 Keywords:

Irregular applications Sparse matrices

Sparse matrix dense matrix multiplication Load balancing

Communication volume balancing Matrix partitioning

Graph partitioning Hypergraph partitioning Recursive bipartitioning Combinatorial scientific computing

a b s t r a c t

Weproposeacomprehensiveandgenericframeworktominimizemultipleanddifferent volume-basedcommunicationcostmetricsforsparsematrixdensematrixmultiplication (SpMM).SpMMisanimportant kernelthatfindsapplication incomputational linearal- gebraandbigdataanalytics.Ondistributedmemorysystems,thiskernelisusuallychar- acterizedwithitshighcommunicationvolumerequirements.Ourapproachtargetsirregu- larlysparsematricesandisbasedonbothgraphandhypergraphpartitioningmodelsthat relyonthewidelyadopted recursivebipartitioningparadigm.The proposedmodels are lightweight,portable(canberealizedusinganygraphand hypergraphpartitioningtool) andcansimultaneouslyoptimizedifferentcostmetricsbesidestotalvolume,suchasmax- imumsend/receivevolume,maximumsumofsendandreceivevolumes,etc.,inasingle partitioningphase.Theyallowonetodefineandoptimizeasmanycustomvolume-based metricsas desiredthrough aflexible formulation.The experimentsonawide rangeof aboutthousandmatricesshowthattheproposedmodelsdrasticallyreducethemaximum communicationvolumecomparedtothestandardpartitioningmodels thatonlyaddress theminimizationoftotalvolume.Theimprovementsobtainedonvolume-basedpartition quality metrics using ourmodels are validatedwith parallel SpMM as well as parallel multi-sourceBFSexperiments ontwo large-scalesystems. ForparallelSpMM,compared tothe standard partitioning models,our graphand hypergraph partitioningmodels re- spectively achievereductions of14% and 22% inruntime,onaverage. Compared tothe state-of-the-artpartitionerUMPa, ourgraphmodel is overall14.5× fasterand achieves anaverageimprovementof19%inthepartitionqualityoninstancesthatareboundedby maximumvolume.ForparallelBFS,weshow ongraphs withmorethanabillionedges thatthescalabilitycansignificantlybeimprovedwithourmodelscomparedtoarecently proposedtwo-dimensionalpartitioningmodel.

© 2016ElsevierB.V.Allrightsreserved.

1. Introduction

Sparsematrixkernelsformthecomputationalbasisofmanyscientificandengineeringapplications.Animportantkernel isthesparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisasparsematrix,andXandYare densematrices.

Corresponding author. Fax: +90 312 266 4047.

E-mail addresses: acer@cs.bilkent.edu.tr (S. Acer), reha@cs.bilkent.edu.tr (O. Selvitopi), aykanat@cs.bilkent.edu.tr (C. Aykanat).

http://dx.doi.org/10.1016/j.parco.2016.10.001 0167-8191/© 2016 Elsevier B.V. All rights reserved.

(2)

SpMMisalreadyacommonoperationincomputationallinearalgebra,usuallyutilizedrepeatedlywithin thecontextof block iterative methods.The practical benefitsof block methods have beenemphasized in severalstudies. These studies either focuson theblock versionsofcertain solvers (i.e., conjugategradient variants) whichaddress multiple linearsys- tems [1–4], orthe block methods foreigenvalue problems,such asblock Lanczos [5] andblock Arnoldi [6]. The column dimensionofXandYinblockmethodsisusuallyverysmallcomparedtothatofA[7].

Alongwithothersparsematrixkernels,SpMMisalsousedintheemergingfieldofbigdataanalytics.Graphalgorithms areubiquitousinbigdataanalytics.Manygraphanalysisapproachessuchascentralitymeasures[8]relyonshortestpath computationsandusebreadth-firstsearch(BFS)asabuildingblock.Asindicatedinseveralrecentstudies[9–14],processing each levelinBFSis actuallyequivalentto asparse matrixvector “multiplication”.Graphalgorithms oftennecessitate BFS from multiple sources. In this case, processingeach level becomes equivalentto multiplication of a sparse matrix with another sparse(the SpGEMM kernel[15]) ordense matrix.For atypical small worldnetwork [16],matrix Xis sparse at thebeginningofBFS,howeveritusually getsdenserasBFSproceeds.Evenincaseswhenitremainssparse,thechanging patternofthismatrixthroughouttheBFSlevelsandtherelatedsparsebookkeepingoverheadmakeitplausibletostoreit asadensematrixifthereismemoryavailable.

SpMMisprovidedinIntelMKL[17]andNvidiacuSPARSE[18]librariesformulti-/many-coreandGPUarchitectures.To optimizeSpMMondistributedmemoryarchitecturesforsparsematriceswithirregularsparsitypatterns,oneneedstotake communicationbottlenecksintoaccount.Communicationbottlenecksareusuallysummarizedbylatency(messagestart-up) andbandwidth(messagetransfer) costs.Thelatencycostisproportionaltothenumberofmessageswhilethebandwidth costisproportionaltothenumberofwordscommunicated,i.e.,communicationvolume.Thesecostsareusuallyaddressed intheliteraturewithintelligentgraphandhypergraphpartitioningmodelsthatcanexploitirregularpatternsquitewell[19–

24].Mostofthesemodelsfocusonimprovingtheperformanceofparallelsparsematrixvectormultiplication.Althoughone canutilizethemforSpMMaswell,SpMMnecessitatestheuseofnewmodelstailoredtothiskernelsinceitisspecifically characterized withitshighcommunicationvolume requirementsbecauseoftheincreasedcolumndimensionsofdense X andYmatrices.Inthisregard,thebandwidthcostbecomescriticalforoverallperformance,whilethelatencycostbecomes negligiblewithincreasedaveragemessagesize.Therefore,togetthebest performanceout ofSpMM,itis vitaltoaddress communicationcostmetrics thatare centeredaround volumesuch asmaximumsend volume,maximumreceivevolume, etc.

1.1. Relatedworkonmultiplecommunicationcostmetrics

Totalcommunicationvolumeisthemostwidelyoptimizedcommunicationcostmetricforimprovingtheperformanceof sparsematrixoperationsondistributedmemorysystems[21,22,25–27].Thereareafewworksthatconsidercommunication cost metrics other thantotal volume[28–33].Inan early work, Uçarand Aykanat[29] proposed hypergraph partitioning models tooptimizetwodifferentcost metricssimultaneously. Thiswork isatwo-phase approach,wherethepartitioning inthe firstphase is followedby alatter phasein whichthey minimize totalnumber ofmessages andachievea balance oncommunicationvolumesofprocessors.Inarelatedwork,UçarandAykanat[28]adaptedthementionedmodelfortwo- dimensional fine-grainpartitioning.A veryrecent workby Selvitopi andAykanataims toreduce the latencyoverheadin two-dimensionaljaggedandcheckerboardpartitioning[34].

BisselingandMeesen[30]proposedagreedyheuristicforbalancingcommunicationloadsofprocessors.Thismethodis alsoa two-phase approach,in whichthepartitioning inthefirst phaseis followedby a redistributionof communication tasks in thesecond phase. While doing so,they tryto minimize themaximum send andreceive volumesof processors whilerespectingthetotalvolumeobtainedinthefirstphase.

Thetwo-phaseapproacheshavetheflexibilityofworkingwithalreadyexistingpartitions.However,sincethefirstphase is obliviousto thecost metrics addressedinthe second phase, they canget stuckin localoptima. Toremedy thisissue, Devecietal.[32]recentlyproposedahypergraphpartitionercalledUMPa,whichiscapableofhandlingmultiplecostmetrics in a singlepartitioningphase. Theyconsider various metrics such asmaximumsend volume, totalnumber ofmessages, maximumnumberofmessages,etc.,andproposeadifferentgaincomputationalgorithmspecifictoeachofthesemetrics.In thecenteroftheirapproacharethemove-basediterativeimprovementheuristicswhichmakeuseofdirectedhypergraphs.

Theseheuristicsconsistofanumberofrefinementpasses.Toeachpass,theirapproachisreportedtointroduceanO(VK2)- time overhead,whereVisthenumberofverticesinthehypergraph (numberofrows/columnsinA) andK isthenumber of parts/processors. Theyalso report that the slowdown ofUMPaincreases withincreasing K withrespect to thenative hypergraphpartitionerPaToHduetothisquadraticcomplexity.

1.2. Contributions

Inthisstudy,weproposeacomprehensiveandgenericone-phaseframeworktominimizemultiplevolume-basedcom- munication cost metrics forimproving the performance of SpMMon distributed memorysystems. Ourframework relies onthewidelyadoptedrecursivebipartitioningparadigmutilizedinthecontextofgraphandhypergraphpartitioning.Total volumecanalreadybeeffectivelyminimizedwithexistingpartitioners[21,22,25].Wefocusontheotherimportantvolume- basedmetrics besidestotalvolume, suchasmaximumsend/receivevolume, maximumsumofsendandreceivevolumes, etc.Theproposed modelassociatesadditionalweights withboundaryverticestokeeptrackofvolumeloadsofprocessors

(3)

duringrecursivebipartitioning.Theminimizationobjectivesassociatedwiththeseloadsaretreatedasconstraintsinorder to make useof a readily available partitioner.Achieving a balance on these weights of boundaryvertices through these constraintsenablestheminimizationoftargetvolume-basedmetrics.Wealsoextendourmodelbyproposingtwopractical enhancementstohandletheseconstraintsinpartitionersmoreefficiently.

Ourframework is unique and flexible in the sense that it handles multiple volume-based metrics through the same formulationinagenericmanner.Thisframeworkalsoallowstheoptimizationofanycustommetricdefinedonsend/receive volumes. Ouralgorithms are computationally lightweight: they only introduce an extra O(nnz(A))time toeach recursive bipartitioning level, where nnz(A) is the number of nonzeros in matrix A. To the best of our knowledge, it is the first portable one-phase methodthat can easily be integrated intoanystate-of-the-art graphandhypergraph partitioner.Our workisalsothefirstworkthataddressesmultiplevolume-basedmetricsinthegraphpartitioningcontext.

Anotherimportantaspectisthesimultaneoushandlingofmultiplecostmetrics.Thisfeatureiscrucialasoverallcommu- nicationcostissimultaneouslydeterminedbymultiplefactorsandthetargetparallelapplicationmaydemandoptimization ofdifferentcostmetricssimultaneouslyforgoodperformance(SpMMandmulti-sourceBFSinourcase).Inthisregard,Uçar andAykanat[28,29]accommodatethisfeaturefortwometrics,whereasDevecietal.[32],althoughaddressmultiplemet- rics,donothandletheminacompletelysimultaneousmannersincesomeofthemetricsmaynotbeminimizedincertain cases.Ourmodelsincontrastcanoptimizealltargetmetricssimultaneouslybyassigningequalimportancetoeachofthem inthefeasiblesearchspace.Inaddition,theproposedframeworkallowsonetodefineandoptimizeasmanyvolume-based metricsasdesired.

Forexperiments,the proposed partitioningmodels for graphsand hypergraphsare realized usingthe widely-adopted partitionersMetis[22] andPaToH [21],respectively. Wehavetestedtheproposedmodels for128,256,512and1024pro- cessorsonadatasetof964matricescontaininginstancesfromdifferentdomains.Weachieveaverageimprovementsofup to61% and78%in maximumcommunicationvolume forgraphandhypergraph models,respectively, inthe categoriesof matricesforwhichmaximumvolumeismostcritical.Comparedtothestate-of-the-artpartitionerUMPa,ourgraphmodel achievesan overallimprovementof5% inthepartition quality14.5× fasterandourhypergraph modelachievesan overall improvementof11%inthepartition quality3.4× faster.Ouraverageimprovementsfortheinstancesthatare boundedby maximumvolumeareevenhigher:19%fortheproposedgraphmodeland24%fortheproposedhypergraphmodel.

Wetest the validityof theproposed models forboth parallelSpMM andmulti-sourceBFSkernels onlarge-scale HPC systemsCray XC40andLenovoNeXtScale, respectively.Forparallel SpMM,compared tothestandard partitioningmodels, ourgraphandhypergraph partitioningmodelsrespectivelylead toreductionsof14% and22%inruntime,onaverage.For parallelBFS,weshowongraphswithmorethanabillionedgesthatthescalabilitycansignificantlybeimprovedwithour modelscomparedtoa recentlyproposed two-dimensionalpartitioningmodel[12]fortheparallelizationofthiskernelon distributedsystems.

Therestofthepaperisorganizedasfollows.Section2givesbackgroundforpartitioningsparsematricesviagraphand hypergraph models. Section 3 defines the problemsregarding minimization of volume-based cost metrics. The proposed graphandhypergraph partitioningmodels to addresstheseproblems aredescribedin Section 4.Section 5proposestwo practicalextensionstothesemodels.Section6givesexperimentalresultsforinvestigatedpartitioningschemesandparallel runtimes.Section7concludes.

2. Background

2.1. One-dimensionalsparsematrixpartitioning

Considertheparallelizationofsparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisann× nsparsematrix,andXandYaren× sdensematrices.AssumethatAispermutedintoaK-wayblockstructureoftheform

ABL=



C1 · · · CK



=

R1 .. . RK

=

A11 · · · A1K

..

. ... ... AK1 · · · AKK

, (1)

forrowwiseorcolumnwisepartitioning,whereKisthenumberofprocessorsintheparallelsystem.ProcessorPkownsrow stripeRk=[Ak1· · · AkK]forrowwisepartitioning,whereasitownscolumnstripeCk=[AT1k· · · ATKk]T forcolumnwisepartition- ing.Wefocusonrowwisepartitioninginthiswork,however,alldescribedmodelsapplytocolumnwisepartitioningaswell.

WeuseRkandAkinterchangeablythroughoutthepaperasweonlyconsiderrowwisepartitioning.

InbothblockiterativemethodsandBFS-likecomputations,SpMMisperformedrepeatedlywiththesameinputmatrixA andchangingX-matrixelements.TheinputmatrixXofthenextiterationisobtainedfromtheoutputmatrixYofthecurrent iterationvia element-wiselinearmatrixoperations. Wefocus onthecasewherethe rowwisepartitionsofthe inputand outputdensematricesareconformabletoavoidredundantcommunicationduringtheselinearoperations.Hence,apartition of A naturally induces partition [Y1T...YKT]T on the rows of Y, which is in turn used to induce a conformable partition [X1T...XKT]T ontherowsofX.Inthisregard,therowandcolumnpermutationmentionedin(1)shouldbeconformable.

Anonzerocolumnsegmentisdefinedasthenonzerosofacolumninaspecificsubmatrix block.ForexampleinFig.1, therearetwo nonzerocolumnsegments inA14 whichbelongto columns13and15.Inrow-parallelY=AX,Pk ownsrow

(4)

Fig. 1. Row-parallel Y = AX with K = 4 processors, n = 16 and s = 3 .

stripes Ak and Xk of the input matrices, and is responsible for computing respective row stripeYk=AkX of the output matrix.Pkcan performcomputationsregarding diagonalblock Akk locally usingitsown portionXk withoutrequiring any communication,whereAkl iscalledadiagonalblockifk=l,andanoff-diagonalblockotherwise.SincePkownsonlyXk,it needs theremaining X-matrixrowsthatcorrespond tononzerocolumnsegments inoff-diagonalblocksofAk.Hence,the respectiverowsmustbesenttoPkbytheirownersinapre-communicationphasepriortoSpMMcomputations.Specifically, toperformthemultiplicationregardingoff-diagonalblockAkl,PkneedstoreceivetherespectiveX-matrixrowsfromPl.For example,inFig.1forP3,sincethereexistsanonzerocolumnsegmentinA34,P3needs toreceivethecorrespondingthree elementsinrow14ofXfromP4.Inasimilarmanner,itneedstoreceivetheelementsofX-matrixrows2,3fromP1and5, 7fromP2.

2.2. Graphandhypergraphpartitioningproblems

AgraphG=(V,E)consistsofasetV ofverticesandasetE ofedges.Eachedgeeijconnectsapairofdistinct vertices

v

iand

v

j.Acostcij isassociatedwitheachedgeeij.Adj(

v

i)denotestheneighborsof

v

i,i.e.,Adj(

v

i)=

{ v

j:ei jE

}

.

Ahypergraph H=(V,N)consistsofa setV ofvertices anda setN ofnets.Eachnetnj connectsasubset ofvertices denotedasPins(nj).Acostcjisassociatedwitheach netnj.Nets(

v

i)denotesthesetofnetsthatconnect

v

i.Inbothgraph andhypergraph,multipleweightsw1(

v

i),. . .,wC(

v

i)areassociatedwitheachvertex

v

i,wherewc(

v

i)denotesthecthweight associatedwith

v

i.

(G)=

{

V1,...,VK

}

and(H)=

{

V1,...,VK

}

are calledK-waypartitionsofGandH ifpartsaremutuallydisjointand mutuallyexhaustive.In(G),an edgeeijissaidtobe cutifvertices

v

iand

v

j areindifferentparts,anduncutotherwise.

Thecutsizeof(G)isdefinedasei j∈EEci j,whereEE⊆ E denotesthesetofcutedges.In(H),theconnectivityset(nj) of net nj consistsof the parts that are connected by that net, i.e., (nj)=

{

Vk:Pins(nj)Vk=∅

}

.The number of parts

connectedbynjisdenotedby

λ

(nj)=

|

(nj)

|

.Anetnj issaidtobecut ifitconnectsmorethanonepart,i.e.,

λ

(nj)>1, anduncutotherwise.Thecutsize of(H) isdefinedasnj∈Ncj(

λ

(nj)− 1).Avertex

v

iin(G) or(H)issaid tobea boundaryvertexifitisconnectedbyatleastonecutedgeorcutnet.

TheweightWc(Vk)ofpartVkisdefinedasthesumofthecthweightsoftheverticesinVk.Apartition(G)or(H) issaidtobebalancedif

Wc

(

Vk

)

≤ Wacvg

(

1+



c

)

, k

{

1,...,K

}

and c

{

1,...,C

}

, (2)

whereWacvg=

kWc(Vk)/K,and



cisthepredeterminedimbalancevalueforthecthweight.

TheK-waymulti-constraint graph/hypergraphpartitioningproblem[35,36]isthendefinedasfindingaK-way partition such thatthecutsizeisminimizedwhilethebalanceconstraint(2)ismaintained.Notethat forC=1,thisreducestothe well-studiedstandardpartitioningproblem.BothgraphandhypergraphpartitioningproblemsareNP-hard[37,38].

2.3. Sparsematrixpartitioningmodels

Inthissection,wedescribe howtoobtainaone-dimensional rowwisepartitioningofmatrixA forrow-parallelY=AX usinggraphandhypergraphpartitioningmodels.Thesemodelsaretheextensionsofstandardmodelsusedforsparsematrix vectormultiplication[21,22,39–41].

(5)

Inthegraphandhypergraph partitioningmodels,matrixAisrepresentedasanundirectedgraphG=(V,E)anda hy- pergraphH=(V,N).In both, there exists a vertex

v

iV foreach rowi ofA, where

v

i signifies the computational task ofmultiplying rowi of A withX to obtain rowi of Y.So, in both models, a single(C=1) weight of stimes the num- berof nonzerosin rowiof Ais associatedwith

v

i toencode the loadofthiscomputational task.Forexample,inFig.1, w1(

v

5)=4× 3=12.

InG,each nonzeroaij oraji (orboth)ofAisrepresentedbyan edgeei jE.The costofedgeeijisassignedasci j=2s foreachedgeeijwithaij=0andaji=0,whereasitisassignedasci j=sforeachedgeeij witheitheraij=0oraji=0,but notboth.InH,eachcolumnjofAisrepresentedbyanetnjN,whichconnectstheverticesthatcorrespondtotherows thatcontainanonzeroincolumnj,i.e.,Pins(nj)=

{ v

i:ai j=0

}

.Thecostofnetnjisassignedascj=sforeachnetinN.

InaK-waypartition(G)or(H),withoutlossofgenerality,weassumethattherowscorrespondingtotheverticesin partVkareassignedtoprocessorPk.In(G),eachcutedgeeij,where

v

iVkand

v

jV,necessitates cijunitsofcommu- nicationbetweenprocessors PkandP.Here,Psendsrowj ofXtoPkifaij =0andPksendsrowiofXtoP ifaji =0.In

(H),each cutnetnjnecessitates cj(

λ

(nj)− 1)unitsofcommunicationbetweenprocessorsthat correspondtotheparts in(nj),wheretheownerofrowjofXsendsittotheremainingprocessorsin(nj).Hereinafter,(nj)isinterchangeably usedtorefertopartsandprocessorsbecauseoftheidenticalvertexparttoprocessorassignment.

Throughtheseformulations,theproblemofobtainingagoodrowpartitioningofAbecomesequivalenttothegraphand hypergraphpartitioningproblemsinwhichtheobjectiveofminimizing cutsizerelatestominimizing totalcommunication volume,whiletheconstraintofmaintainingbalanceonpartweights((2)withC=1)correspondstobalancingcomputational loads ofprocessors. The objective of hypergraph partitioning problemis an exact measure of total volume,whereas the objectiveofgraphpartitioningproblemisanapproximation[21].

3. Problemdefinition

Assumethat matrix A isdistributed amongK processors for parallel SpMMoperation asdescribedin Section 2.1.Let

σ

(Pk,P) bethe amountofdatasent fromprocessorPkto P in termsofX-matrixelements. Thisisequal tostimesthe numberofX-matrixrowsthatare ownedbyPkandneededbyP,whichisalsoequalto stimesthenumberofnonzero columnsegmentsinoff-diagonalblock Ak.Since Xk isownedbyPkandcomputationson Akk requirenocommunication,

σ

(Pk,Pk)=0.We use the function ncs(.)to denote thenumber ofnonzero columnsegments ina given block ofmatrix.

ncs(Ak) isdefinedtobe thenumberofnonzerocolumnsegments inAk ifk =,and0otherwise.Thisisextended toa rowstripeRkandacolumnstripeCk,wherencs(Rk)=

ncs(Ak)andncs(Ck)=

ncs(Ak).Finally,forthewholematrix, ncs(ABL)=

kncs(Rk)=

kncs(Ck).Forexample,inFig.1,ncs(A42)=2,ncs(R3)=5,ncs(C3)=4andncs(ABL)=21. ThesendandreceivevolumesofPkaredefinedasfollows:

SV(Pk), send volumeof Pk: The total number ofX-matrix elementssent fromPk to other processors. Thatis, SV(Pk)=



σ

(Pk,P).Thisisequaltos× ncs(Ck).

RV(Pk),receivevolumeofPk:ThetotalnumberofX-matrixelementsreceivedbyPkfromotherprocessors.Thatis,RV(Pk)=



σ

(P,Pk).Thisisequaltos× ncs(Rk).

Notethatthetotalvolume ofcommunicationisequalto

kSV(Pk)=

kRV(Pk).Thisisalsoequaltostimesthetotal numberofnonzerocolumnsegmentsinalloff-diagonalblocks,i.e.,s× ncs(ABL).

Inthis study,we extend the sparsematrix partitioning probleminwhich the only objectiveis to minimize the total communicationvolume,byintroducingfourmoreminimizationobjectiveswhicharedefinedonthefollowingmetrics:

1. maxkSV(Pk):maximumsendvolumeofprocessors(equivalenttomaximums× ncs(Ck)), 2. maxkRV(Pk):maximumreceivevolumeofprocessors(equivalenttomaximums× ncs(Rk)),

3. maxk(SV(Pk)+RV(Pk)): maximum sum of send and receive volumes of processors (equivalent to maximum s× (ncs(Ck)+ncs(Rk))),

4. maxkmax{SV(Pk),RV(Pk)}:maximumofmaximumofsendandreceivevolumesofprocessors(equivalenttomaximum s× max{ncs(Ck),ncs(Rk)}).

Undertheobjectiveofminimizingthetotalcommunicationvolume,minimizingoneofthesevolume-basedmetrics(e.g., maxkSV(Pk))relatestominimizingimbalanceontherespectivequantity(e.g.,imbalanceonSV(Pk)values).Forinstance,the imbalanceonSV(Pk)valuesisdefinedas

maxkSV

(

Pk

)



kSV

(

Pk

)

/K.

Here,theexpressioninthedenominatordenotestheaveragesendvolumeofprocessors.

A parallel application may necessitate one or more of these metrics to be minimized. These metrics are considered besidestotalvolumesinceminimizationofthemisplausibleonlywhentotalvolumeisalsominimizedasmentionedabove.

Hereinafter,thesemetricsexcepttotalvolumearereferredtoasvolume-basedmetrics.

(6)

Fig. 2. The state of the RB tree prior to bipartitioning G 21 and the corresponding sparse matrix. Among the edges and nonzeros, only the external (cut) edges of V 12 and their corresponding nonzeros are shown.

4. Modelsforminimizingmultiplevolume-basedmetrics

Thissectiondescribestheproposedgraphandhypergraphpartitioningmodelsforaddressingvolume-basedcostmetrics definedintheprevioussection.Ourmodelshavethecapabilityofaddressingasingle,acombinationorallofthesemetrics simultaneouslyinasinglephase.Moreover,theyhavetheflexibilityofhandlingcustommetricsbasedonvolumeotherthan thealreadydefinedfourmetrics.Ourapproachreliesonthewidelyadoptedrecursivebipartitioning(RB)frameworkutilized inabreadth-firstmannerandcanberealizedbyanygraphandhypergraphpartitioningtool.

4.1. Recursivebipartitioning

In the RB paradigm, the initial graph/hypergraph is partitioned into two subgraphs/subhypergraphs. These two sub- graphs/subhypergraphs are further bipartitioned recursively until K parts are obtained. This process forms a full binary tree,which we referto asan RBtree,withlg2K levels,where K isa powerof 2.Withoutloss ofgenerality, graphs and hypergraphsatlevelroftheRBtreearenumberedfromleftto rightanddenotedasGr0,...,Gr2r−1 andH0r,...,Hr2r−1,re- spectively.Frombipartition(Grk)=

{

V2r+1k ,V2r+1k+1

}

ofgraphGrk=(Vkr,Ekr),twovertex-inducedsubgraphsGr2+1k =(V2r+1k ,E2r+1k ) andGr2+1k

+1=(V2r+1k+1,E2r+1k+1)areformed.Allcutedgesin(Grk)areexcludedfromthenewlyformedsubgraphs.Frombipar- tition(Hrk)=

{

V2r+1k ,V2r+1k+1

}

ofhypergraphHrk=(Vkr,Nkr),twovertex-inducessubhypergraphsareformedsimilarly.Allcut netsin(Hrk)aresplittocorrectlyencodethecutsizemetric[21].

4.2. Graphmodel

ConsidertheuseoftheRBparadigmforpartitioningthestandard graphrepresentationG=(V,E)ofAforrow-parallel Y=AX toobtainaK-waypartition.WeassumethattheRBproceedsinabreadth-firstmannerandRBprocessisatlevelr priortobipartitioningkthgraphGrk.ObservethattheRBprocessuptothisbipartitioningalreadyinducesaK -waypartition

(G)=

{

V0r+1,...,V2r+1k−1,Vkr,...,V2rr−1

}

.(G) contains2k vertexpartsfromlevelr+1and2r− kvertexpartsfromlevelr, makingK =2r+k.AfterbipartitioningGrk,a(K +1)-waypartition (G)isobtainedwhichcontainsV2r+1k andV2r+1k+1instead of Vkr. Forexample, in Fig. 2, the RBprocess is atlevel r=2 prior to bipartitioning G21=(V12,E12), so, the currentstate of theRB induces a five-way partition (G)=

{

V03,V13,V12,V22,V32

}

.BipartitioningG21 induces a six-way partition  (G)=

{

V03,V13,V23,V33,V22,V32

}

. Pkr denotes thegroup ofprocessorswhich areresponsible forperformingthe tasksrepresentedby theverticesinVkr.ThesendandreceivevolumedefinitionsSV(Pk)andRV(Pk)ofindividualprocessorPkareeasilyextended toSV(Pkr)andRV(Pkr)forprocessorgroupPkr.

WefirstformulatethesendvolumeoftheprocessorgroupPkrtoallotherprocessorgroupscorrespondingtovertexparts in(G).Letconnectivitysetofvertex

v

iVkr,Con(

v

i),denotethesubsetofpartsin(G)

{

Vkr

}

inwhich

v

ihasatleastone

(7)

neighbor.Thatis,

Con

( v

i

)

=

{

Vt

(

G

)

:Adj

( v

i

)

Vt=∅

}

{

Vkr

}

,

wheretiseitherrorr+1.Vertex

v

iisboundaryifCon(

v

i)=∅,andonce

v

ibecomesboundary,itremainsboundaryinall furtherbipartitionings.Forexample,inFig.2,Con(

v

9)=

{

V13,V22,V32

}

.Con(

v

i)signifiesthecommunicationoperationsdueto

v

i,wherePkrsendsrowiofXtoprocessorgroupsthatcorrespondtothepartsinCon(

v

i).Thesendloadassociatedwith

v

i

isdenotedbysl(

v

i)andisequalto sl

( v

i

)

=s×

|

Con

( v

i

) |

The total sendvolume ofPkr is then equal to the sumof the send loadsof all vertices in Vkr,i.e., SV(Pkr)=

vi∈Vrksl(

v

i). InFig.2, thetotal sendvolume ofP12 is equalto sl(

v

7)+sl(

v

8)+sl(

v

9)+sl(

v

10)=3s+2s+3s+s=9s.Therefore,during bipartitioningGrk,minimizing

max

vi∈Vr+12k

sl

( v

i

)

,

vi∈V2r+1k+1

sl

( v

i

)

isequivalenttominimizingthemaximumsendvolumeofthetwoprocessorgroupsP2r+1k andP2r+1k+1 totheother processor groupsthatcorrespondtothevertexpartsin(G).

Inasimilarmanner,weformulatethereceivevolumeoftheprocessorgroupPkr fromallother processorgroupscorre- spondingtothevertexpartsin(G).Observethatforeachboundary

v

jVt thathasatleastoneneighborinVkr,Pkrneeds toreceivethecorrespondingrowjofXfromPt.Forinstance,inFig.2,since

v

5V13 hastwoneighborsinV12,P12 needsto receivethecorrespondingfifthrowofXfromP13.Hence,PkrreceivesasubsetofX-matrixrowswhosecardinalityisequalto thenumberofverticesinV− VkrthathaveatleastoneneighborinVkr,i.e.,

|{ v

j

{

V− Vkr

}

:

v

iVkrandejiE

}|

.Thesizeof

thissetforV12 inFig.2isequalto10.Notethateachsuch

v

j contributesswordstothereceivevolumeofPkr.Thisquan- titycanbecapturedbyevenlydistributingitamong

v

j’sneighborsinVkr.Inotherwords,avertex

v

jVlt thathasatleast oneneighborinVkrcontributess/

|

Adj(

v

j)Vkr

|

tothereceiveloadofeachvertex

v

i

{

Adj(

v

j)Vkr

}

.Thereceiveloadof

v

i, denotedbyrl(

v

i),isgivenbyconsideringallneighborsof

v

ithatarenotinVkr,thatis,

rl

( v

i

)

=

eji∈Eandvj∈Vt

s

|

Adj

( v

j

)

Vkr

|

.

ThetotalreceivevolumeofPkristhenequaltothesumofthereceiveloadsofallverticesinVkr,i.e.,RV(Pkr)=

vi∈Vkrrl(

v

i). InFig.2,thevertices

v

11,

v

12,

v

15 and

v

16 respectivelycontribute s/3, s/2,sandstothereceiveloadof

v

8, whichmakes rl(

v

8)=17s/6.ThetotalreceivevolumeofP12 isequaltorl(

v

7)+rl(

v

8)+rl(

v

9)+rl(

v

10)=3s+17s/6+10s/3+5s/6=10s. NotethatthisisalsoequaltothestimesthenumberofneighboringverticesofV12 inV− V12.Therefore,duringbipartition- ingGrk,minimizing

max

vi∈Vr+12k

rl

( v

i

)

, vi∈Vr+12k+1

rl

( v

i

)

isequivalenttominimizingmaximumreceivevolumeofthetwoprocessorgroupsP2r+1k andP2r+1k+1fromtheotherprocessor groupsthatcorrespondtothevertexpartsin(G).

Althoughthesetwoformulationscorrectlyencapsulatethesend/receivevolumeloadsofP2r+1k andP2r+1k+1to/fromallother processorgroupsin(G),theyoverlookthesend/receivevolumeloadsbetweenthesetwoprocessorgroups.Ourapproach triesto refrainfromthis smalldeviationby immediatelyutilizing thenewly generatedpartition information whilecom- putingvolume loadsinthe upcomingbipartitionings. Thatis,the computationofsend/receiveloads forbipartitioningGrk utilizesthemostrecentK -way partitioninformation,i.e.,(G).Thisdeviationbecomesnegligiblewithincreasingnumber ofsubgraphsinthelatterlevelsoftheRBtree.Theencapsulationofsend/receivevolumesbetweenP2r+1k andP2r+1k+1 during bipartitioningGrknecessitatesimplementinganewpartitioningtool.

Algorithm1presentsthecomputationofsendandreceiveloadsofverticesinGrkpriortoitsbipartitioning.Asitsinputs, thealgorithmneedstheoriginalgraphG=(V,E),graphGrk=(Vkr,Ekr),andtheup-to-datepartitioninformationofvertices, whichisstoredinpartarrayofsizeV=

|

V

|

.Tocompute thesendloadofavertex

v

iVkr,itisnecessarytofindthesetof partsinwhich

v

ihasatleastoneneighbor.Forthispurpose,foreach

v

j/Vkr inAdj(

v

i),Con(

v

i)isupdated withthepart that

v

j iscurrentlyin(lines2–4).Adj(· )listsaretheadjacencylistsoftheverticesintheoriginalgraphG.Next,thesend loadof

v

i,sl(

v

i),issimplysettostimesthesizeofCon(

v

i)(line5).Tocomputethereceiveloadof

v

iVkr,itisnecessary tovisittheneighborsof

v

ithatarenotinVkr.Foreachsuchneighbor

v

j,thereceiveloadof

v

i,rl(

v

i),isupdatedbyadding

v

i’sshareofreceiveloaddueto

v

j,whichisequaltos/

|

Adj(

v

j)Vkr

|

(lines6–8).Observethatonlytheboundaryvertices

inVkrwillhavenonzerovolumeloadsattheendofthisprocess.

Algorithm2presentstheoverallpartitioningprocesstoobtainaK-waypartitionutilizingbreadth-firstRB.Foreachlevel roftheRB tree,thegraphsin thislevelare bipartitioned fromleft toright, Gr0 to Gr2r−1 (lines 3–4).Prior tobipartition- ingofGrk,thesendloadandthereceiveloadofeachvertexinGrk arecomputedwith

GRAPH-COMPUTE-VOLUME-LOADS

(8)

Algorithm 1 GRAPH- COMPUTE-VOLUME-LOADS .

Algorithm 2 GRAPH- PARTITION .

(line5).Recallthat intheoriginalsparsematrixpartitioningwithgraphmodel,eachvertex

v

ihasasingleweightw1(

v

i),

which represents thecomputational load associated withit. Toaddress the minimization of maximum send/receive vol- ume, we associatean extraweight witheach vertex. Specifically,to minimize themaximum sendvolume, thesend load of

v

iisassignedasits secondweight,i.e.,w2(

v

i)=sl(

v

i).Inasimilarmanner,tominimize themaximumreceivevolume, the receiveloadof

v

i isassignedasits second weight, i.e.,w2(

v

i)=rl(

v

i).Observethat onlythe boundaryvertices have nonzerosecondweights.Next,Grkisbipartitionedtoobtain(Grk)=

{

V2r+1k ,V2r+1k+1

}

usingmulti-constraintpartitioningtohan- dle multiplevertexweights (line7). Then, twonew subgraphsGr2+1k andGr2+1k+1 areformed fromGrk using(Grk) (line8).

Inpartitioning,minimizingimbalanceon thesecondpartweights correspondstominimizing imbalanceonsend(receive) volumeiftheseweights aresettosend(receive)loads.Inother words,undertheobjectiveofminimizingtotalvolumein thisbipartitioning,minimizing

max

{

W2

(

V2r+1k

)

,W2

(

V2r+1k+1

) } (

W2

(

V2r+1k

)

+W2

(

V2r+1k+1

))

/2

relatestominimizing max

{

SV(P2rk+1),SV(P2r+1k+1)

}

(max

{

RV(P2r+1k ),RV(P2rk+1+1)

}

) ifthe secondweights aresetto send(receive)

loads.Thenpartarrayisupdated aftereachbipartitioningtokeeptrackofthemostup-to-datepartitioninformationofall vertices(line9).Finally,theresultingK-waypartitioninformationisreturnedinpartarray(line10).Notethatinthefinal K-waypartition,processorgroupPklg2KdenotestheindividualprocessorPk,for0≤ k≤ K− 1.

Inorderto efficientlymaintainthesendandreceiveloads ofvertices,wemake useoftheRBparadigminabreadth- firstorder.Sincetheseloadsarenotknowninadvanceanddependonthecurrentstateofthepartitioning,itiscrucialto act proactivelybyavoidinghighimbalancesonthem.Comparethistocomputational loadsofvertices,whichisknownin advanceandremains thesameforeachvertexthroughoutthepartitioning.Hence,utilizingabreadth-firstoradepth-first RBdoesnotaffectthequality oftheobtainedpartitioninterms ofcomputational load. Weprefer abreadth-firstRBtoa depth-firstRBforminimizing volume-basedmetricssinceoperatingonthepartsthatareatthesameleveloftheRBtree (in ordertocompute send/receiveloads)preventsthe possibledeviationsfromthetargetobjective(s)by quicklyadapting thecurrentavailablepartitiontothechangesthatoccurinsend/receivevolumeloadsofvertices.

The describedmethodologyaddresses theminimization ofmaxkSV(Pk) ormaxkRV(Pk) separately.After computingthe sendandreceiveloads,wecanalsoeasilyminimizemaxk(SV(Pk)+RV(Pk))byassociatingthesecondweightofeachvertex withthesumofsendandreceiveloads,i.e.,w2(

v

i)=sl(

v

i)+rl(

v

i).Fortheminimizationofmaxkmax{SV(Pk),RV(Pk)},either

Referenties

GERELATEERDE DOCUMENTEN

In het onderzoek van Roerink (2013) wordt ook aangegeven dat deze indicatoren en mogelijke meetinstrumenten Jeugdhulp breed ingezet kunnen worden maar dat wel rekening

For this reason the surplus items - for a great part consisting of ready to pack products – were not used and therefore not moved out of the department.. In order to create

We note, however, that in a symmetric matrix, the 2D coarsening approach works like the coarsening operation in the multilevel graph partitioning methods (see [31] for coarsening

In the decomposi- tion of the test matrices, the use of the proposed hypergraph models instead of the graph models achieved 30 to 38 percent decrease in the communication

The improved algorithm (IA), on the other hand, allows a data transfer to the GPU that seem costly at the beginning but ultimately, it enables other kernels to run on the GPU and

We proposed three hypergraph models that achieve simultane- ous partitioning of input and output matrices for two-phase outer-product–parallel sparse matrix-matrix

When a single floating point pipelined adder is used, there will be partial results in the adder pipeline that have to be reduced further after the last value of a row of values

The following non-zero element processed by the PE has to have a larger column and row index than its predecessor because the PE will only receive successive elements of vector x