ContentslistsavailableatScienceDirect
Parallel Computing
journalhomepage:www.elsevier.com/locate/parco
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
Seher Acer, Oguz Selvitopi, Cevdet Aykanat
∗Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey
a r t i c l e i n f o
Article history:
Received 30 March 2015 Revised 25 August 2016 Accepted 5 October 2016 Available online 6 October 2016 Keywords:
Irregular applications Sparse matrices
Sparse matrix dense matrix multiplication Load balancing
Communication volume balancing Matrix partitioning
Graph partitioning Hypergraph partitioning Recursive bipartitioning Combinatorial scientific computing
a b s t r a c t
Weproposeacomprehensiveandgenericframeworktominimizemultipleanddifferent volume-basedcommunicationcostmetricsforsparsematrixdensematrixmultiplication (SpMM).SpMMisanimportant kernelthatfindsapplication incomputational linearal- gebraandbigdataanalytics.Ondistributedmemorysystems,thiskernelisusuallychar- acterizedwithitshighcommunicationvolumerequirements.Ourapproachtargetsirregu- larlysparsematricesandisbasedonbothgraphandhypergraphpartitioningmodelsthat relyonthewidelyadopted recursivebipartitioningparadigm.The proposedmodels are lightweight,portable(canberealizedusinganygraphand hypergraphpartitioningtool) andcansimultaneouslyoptimizedifferentcostmetricsbesidestotalvolume,suchasmax- imumsend/receivevolume,maximumsumofsendandreceivevolumes,etc.,inasingle partitioningphase.Theyallowonetodefineandoptimizeasmanycustomvolume-based metricsas desiredthrough aflexible formulation.The experimentsonawide rangeof aboutthousandmatricesshowthattheproposedmodelsdrasticallyreducethemaximum communicationvolumecomparedtothestandardpartitioningmodels thatonlyaddress theminimizationoftotalvolume.Theimprovementsobtainedonvolume-basedpartition quality metrics using ourmodels are validatedwith parallel SpMM as well as parallel multi-sourceBFSexperiments ontwo large-scalesystems. ForparallelSpMM,compared tothe standard partitioning models,our graphand hypergraph partitioningmodels re- spectively achievereductions of14% and 22% inruntime,onaverage. Compared tothe state-of-the-artpartitionerUMPa, ourgraphmodel is overall14.5× fasterand achieves anaverageimprovementof19%inthepartitionqualityoninstancesthatareboundedby maximumvolume.ForparallelBFS,weshow ongraphs withmorethanabillionedges thatthescalabilitycansignificantlybeimprovedwithourmodelscomparedtoarecently proposedtwo-dimensionalpartitioningmodel.
© 2016ElsevierB.V.Allrightsreserved.
1. Introduction
Sparsematrixkernelsformthecomputationalbasisofmanyscientificandengineeringapplications.Animportantkernel isthesparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisasparsematrix,andXandYare densematrices.
∗ Corresponding author. Fax: +90 312 266 4047.
E-mail addresses: acer@cs.bilkent.edu.tr (S. Acer), reha@cs.bilkent.edu.tr (O. Selvitopi), aykanat@cs.bilkent.edu.tr (C. Aykanat).
http://dx.doi.org/10.1016/j.parco.2016.10.001 0167-8191/© 2016 Elsevier B.V. All rights reserved.
SpMMisalreadyacommonoperationincomputationallinearalgebra,usuallyutilizedrepeatedlywithin thecontextof block iterative methods.The practical benefitsof block methods have beenemphasized in severalstudies. These studies either focuson theblock versionsofcertain solvers (i.e., conjugategradient variants) whichaddress multiple linearsys- tems [1–4], orthe block methods foreigenvalue problems,such asblock Lanczos [5] andblock Arnoldi [6]. The column dimensionofXandYinblockmethodsisusuallyverysmallcomparedtothatofA[7].
Alongwithothersparsematrixkernels,SpMMisalsousedintheemergingfieldofbigdataanalytics.Graphalgorithms areubiquitousinbigdataanalytics.Manygraphanalysisapproachessuchascentralitymeasures[8]relyonshortestpath computationsandusebreadth-firstsearch(BFS)asabuildingblock.Asindicatedinseveralrecentstudies[9–14],processing each levelinBFSis actuallyequivalentto asparse matrixvector “multiplication”.Graphalgorithms oftennecessitate BFS from multiple sources. In this case, processingeach level becomes equivalentto multiplication of a sparse matrix with another sparse(the SpGEMM kernel[15]) ordense matrix.For atypical small worldnetwork [16],matrix Xis sparse at thebeginningofBFS,howeveritusually getsdenserasBFSproceeds.Evenincaseswhenitremainssparse,thechanging patternofthismatrixthroughouttheBFSlevelsandtherelatedsparsebookkeepingoverheadmakeitplausibletostoreit asadensematrixifthereismemoryavailable.
SpMMisprovidedinIntelMKL[17]andNvidiacuSPARSE[18]librariesformulti-/many-coreandGPUarchitectures.To optimizeSpMMondistributedmemoryarchitecturesforsparsematriceswithirregularsparsitypatterns,oneneedstotake communicationbottlenecksintoaccount.Communicationbottlenecksareusuallysummarizedbylatency(messagestart-up) andbandwidth(messagetransfer) costs.Thelatencycostisproportionaltothenumberofmessageswhilethebandwidth costisproportionaltothenumberofwordscommunicated,i.e.,communicationvolume.Thesecostsareusuallyaddressed intheliteraturewithintelligentgraphandhypergraphpartitioningmodelsthatcanexploitirregularpatternsquitewell[19–
24].Mostofthesemodelsfocusonimprovingtheperformanceofparallelsparsematrixvectormultiplication.Althoughone canutilizethemforSpMMaswell,SpMMnecessitatestheuseofnewmodelstailoredtothiskernelsinceitisspecifically characterized withitshighcommunicationvolume requirementsbecauseoftheincreasedcolumndimensionsofdense X andYmatrices.Inthisregard,thebandwidthcostbecomescriticalforoverallperformance,whilethelatencycostbecomes negligiblewithincreasedaveragemessagesize.Therefore,togetthebest performanceout ofSpMM,itis vitaltoaddress communicationcostmetrics thatare centeredaround volumesuch asmaximumsend volume,maximumreceivevolume, etc.
1.1. Relatedworkonmultiplecommunicationcostmetrics
Totalcommunicationvolumeisthemostwidelyoptimizedcommunicationcostmetricforimprovingtheperformanceof sparsematrixoperationsondistributedmemorysystems[21,22,25–27].Thereareafewworksthatconsidercommunication cost metrics other thantotal volume[28–33].Inan early work, Uçarand Aykanat[29] proposed hypergraph partitioning models tooptimizetwodifferentcost metricssimultaneously. Thiswork isatwo-phase approach,wherethepartitioning inthe firstphase is followedby alatter phasein whichthey minimize totalnumber ofmessages andachievea balance oncommunicationvolumesofprocessors.Inarelatedwork,UçarandAykanat[28]adaptedthementionedmodelfortwo- dimensional fine-grainpartitioning.A veryrecent workby Selvitopi andAykanataims toreduce the latencyoverheadin two-dimensionaljaggedandcheckerboardpartitioning[34].
BisselingandMeesen[30]proposedagreedyheuristicforbalancingcommunicationloadsofprocessors.Thismethodis alsoa two-phase approach,in whichthepartitioning inthefirst phaseis followedby a redistributionof communication tasks in thesecond phase. While doing so,they tryto minimize themaximum send andreceive volumesof processors whilerespectingthetotalvolumeobtainedinthefirstphase.
Thetwo-phaseapproacheshavetheflexibilityofworkingwithalreadyexistingpartitions.However,sincethefirstphase is obliviousto thecost metrics addressedinthe second phase, they canget stuckin localoptima. Toremedy thisissue, Devecietal.[32]recentlyproposedahypergraphpartitionercalledUMPa,whichiscapableofhandlingmultiplecostmetrics in a singlepartitioningphase. Theyconsider various metrics such asmaximumsend volume, totalnumber ofmessages, maximumnumberofmessages,etc.,andproposeadifferentgaincomputationalgorithmspecifictoeachofthesemetrics.In thecenteroftheirapproacharethemove-basediterativeimprovementheuristicswhichmakeuseofdirectedhypergraphs.
Theseheuristicsconsistofanumberofrefinementpasses.Toeachpass,theirapproachisreportedtointroduceanO(VK2)- time overhead,whereVisthenumberofverticesinthehypergraph (numberofrows/columnsinA) andK isthenumber of parts/processors. Theyalso report that the slowdown ofUMPaincreases withincreasing K withrespect to thenative hypergraphpartitionerPaToHduetothisquadraticcomplexity.
1.2. Contributions
Inthisstudy,weproposeacomprehensiveandgenericone-phaseframeworktominimizemultiplevolume-basedcom- munication cost metrics forimproving the performance of SpMMon distributed memorysystems. Ourframework relies onthewidelyadoptedrecursivebipartitioningparadigmutilizedinthecontextofgraphandhypergraphpartitioning.Total volumecanalreadybeeffectivelyminimizedwithexistingpartitioners[21,22,25].Wefocusontheotherimportantvolume- basedmetrics besidestotalvolume, suchasmaximumsend/receivevolume, maximumsumofsendandreceivevolumes, etc.Theproposed modelassociatesadditionalweights withboundaryverticestokeeptrackofvolumeloadsofprocessors
duringrecursivebipartitioning.Theminimizationobjectivesassociatedwiththeseloadsaretreatedasconstraintsinorder to make useof a readily available partitioner.Achieving a balance on these weights of boundaryvertices through these constraintsenablestheminimizationoftargetvolume-basedmetrics.Wealsoextendourmodelbyproposingtwopractical enhancementstohandletheseconstraintsinpartitionersmoreefficiently.
Ourframework is unique and flexible in the sense that it handles multiple volume-based metrics through the same formulationinagenericmanner.Thisframeworkalsoallowstheoptimizationofanycustommetricdefinedonsend/receive volumes. Ouralgorithms are computationally lightweight: they only introduce an extra O(nnz(A))time toeach recursive bipartitioning level, where nnz(A) is the number of nonzeros in matrix A. To the best of our knowledge, it is the first portable one-phase methodthat can easily be integrated intoanystate-of-the-art graphandhypergraph partitioner.Our workisalsothefirstworkthataddressesmultiplevolume-basedmetricsinthegraphpartitioningcontext.
Anotherimportantaspectisthesimultaneoushandlingofmultiplecostmetrics.Thisfeatureiscrucialasoverallcommu- nicationcostissimultaneouslydeterminedbymultiplefactorsandthetargetparallelapplicationmaydemandoptimization ofdifferentcostmetricssimultaneouslyforgoodperformance(SpMMandmulti-sourceBFSinourcase).Inthisregard,Uçar andAykanat[28,29]accommodatethisfeaturefortwometrics,whereasDevecietal.[32],althoughaddressmultiplemet- rics,donothandletheminacompletelysimultaneousmannersincesomeofthemetricsmaynotbeminimizedincertain cases.Ourmodelsincontrastcanoptimizealltargetmetricssimultaneouslybyassigningequalimportancetoeachofthem inthefeasiblesearchspace.Inaddition,theproposedframeworkallowsonetodefineandoptimizeasmanyvolume-based metricsasdesired.
Forexperiments,the proposed partitioningmodels for graphsand hypergraphsare realized usingthe widely-adopted partitionersMetis[22] andPaToH [21],respectively. Wehavetestedtheproposedmodels for128,256,512and1024pro- cessorsonadatasetof964matricescontaininginstancesfromdifferentdomains.Weachieveaverageimprovementsofup to61% and78%in maximumcommunicationvolume forgraphandhypergraph models,respectively, inthe categoriesof matricesforwhichmaximumvolumeismostcritical.Comparedtothestate-of-the-artpartitionerUMPa,ourgraphmodel achievesan overallimprovementof5% inthepartition quality14.5× fasterandourhypergraph modelachievesan overall improvementof11%inthepartition quality3.4× faster.Ouraverageimprovementsfortheinstancesthatare boundedby maximumvolumeareevenhigher:19%fortheproposedgraphmodeland24%fortheproposedhypergraphmodel.
Wetest the validityof theproposed models forboth parallelSpMM andmulti-sourceBFSkernels onlarge-scale HPC systemsCray XC40andLenovoNeXtScale, respectively.Forparallel SpMM,compared tothestandard partitioningmodels, ourgraphandhypergraph partitioningmodelsrespectivelylead toreductionsof14% and22%inruntime,onaverage.For parallelBFS,weshowongraphswithmorethanabillionedgesthatthescalabilitycansignificantlybeimprovedwithour modelscomparedtoa recentlyproposed two-dimensionalpartitioningmodel[12]fortheparallelizationofthiskernelon distributedsystems.
Therestofthepaperisorganizedasfollows.Section2givesbackgroundforpartitioningsparsematricesviagraphand hypergraph models. Section 3 defines the problemsregarding minimization of volume-based cost metrics. The proposed graphandhypergraph partitioningmodels to addresstheseproblems aredescribedin Section 4.Section 5proposestwo practicalextensionstothesemodels.Section6givesexperimentalresultsforinvestigatedpartitioningschemesandparallel runtimes.Section7concludes.
2. Background
2.1. One-dimensionalsparsematrixpartitioning
Considertheparallelizationofsparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisann× nsparsematrix,andXandYaren× sdensematrices.AssumethatAispermutedintoaK-wayblockstructureoftheform
ABL=
C1 · · · CK
=⎡
⎣
R1 .. . RK
⎤
⎦
=⎡
⎣
A11 · · · A1K
..
. ... ... AK1 · · · AKK
⎤
⎦
, (1)forrowwiseorcolumnwisepartitioning,whereKisthenumberofprocessorsintheparallelsystem.ProcessorPkownsrow stripeRk=[Ak1· · · AkK]forrowwisepartitioning,whereasitownscolumnstripeCk=[AT1k· · · ATKk]T forcolumnwisepartition- ing.Wefocusonrowwisepartitioninginthiswork,however,alldescribedmodelsapplytocolumnwisepartitioningaswell.
WeuseRkandAkinterchangeablythroughoutthepaperasweonlyconsiderrowwisepartitioning.
InbothblockiterativemethodsandBFS-likecomputations,SpMMisperformedrepeatedlywiththesameinputmatrixA andchangingX-matrixelements.TheinputmatrixXofthenextiterationisobtainedfromtheoutputmatrixYofthecurrent iterationvia element-wiselinearmatrixoperations. Wefocus onthecasewherethe rowwisepartitionsofthe inputand outputdensematricesareconformabletoavoidredundantcommunicationduringtheselinearoperations.Hence,apartition of A naturally induces partition [Y1T...YKT]T on the rows of Y, which is in turn used to induce a conformable partition [X1T...XKT]T ontherowsofX.Inthisregard,therowandcolumnpermutationmentionedin(1)shouldbeconformable.
Anonzerocolumnsegmentisdefinedasthenonzerosofacolumninaspecificsubmatrix block.ForexampleinFig.1, therearetwo nonzerocolumnsegments inA14 whichbelongto columns13and15.Inrow-parallelY=AX,Pk ownsrow
Fig. 1. Row-parallel Y = AX with K = 4 processors, n = 16 and s = 3 .
stripes Ak and Xk of the input matrices, and is responsible for computing respective row stripeYk=AkX of the output matrix.Pkcan performcomputationsregarding diagonalblock Akk locally usingitsown portionXk withoutrequiring any communication,whereAkl iscalledadiagonalblockifk=l,andanoff-diagonalblockotherwise.SincePkownsonlyXk,it needs theremaining X-matrixrowsthatcorrespond tononzerocolumnsegments inoff-diagonalblocksofAk.Hence,the respectiverowsmustbesenttoPkbytheirownersinapre-communicationphasepriortoSpMMcomputations.Specifically, toperformthemultiplicationregardingoff-diagonalblockAkl,PkneedstoreceivetherespectiveX-matrixrowsfromPl.For example,inFig.1forP3,sincethereexistsanonzerocolumnsegmentinA34,P3needs toreceivethecorrespondingthree elementsinrow14ofXfromP4.Inasimilarmanner,itneedstoreceivetheelementsofX-matrixrows2,3fromP1and5, 7fromP2.
2.2. Graphandhypergraphpartitioningproblems
AgraphG=(V,E)consistsofasetV ofverticesandasetE ofedges.Eachedgeeijconnectsapairofdistinct vertices
v
iandv
j.Acostcij isassociatedwitheachedgeeij.Adj(v
i)denotestheneighborsofv
i,i.e.,Adj(v
i)={ v
j:ei j∈E}
.Ahypergraph H=(V,N)consistsofa setV ofvertices anda setN ofnets.Eachnetnj connectsasubset ofvertices denotedasPins(nj).Acostcjisassociatedwitheach netnj.Nets(
v
i)denotesthesetofnetsthatconnectv
i.Inbothgraph andhypergraph,multipleweightsw1(v
i),. . .,wC(v
i)areassociatedwitheachvertexv
i,wherewc(v
i)denotesthecthweight associatedwithv
i.(G)=
{
V1,...,VK}
and(H)={
V1,...,VK}
are calledK-waypartitionsofGandH ifpartsaremutuallydisjointand mutuallyexhaustive.In(G),an edgeeijissaidtobe cutifverticesv
iandv
j areindifferentparts,anduncutotherwise.Thecutsizeof(G)isdefinedasei j∈EEci j,whereEE⊆ E denotesthesetofcutedges.In(H),theconnectivityset(nj) of net nj consistsof the parts that are connected by that net, i.e., (nj)=
{
Vk:Pins(nj)∩Vk=∅}
.The number of partsconnectedbynjisdenotedby
λ
(nj)=|
(nj)|
.Anetnj issaidtobecut ifitconnectsmorethanonepart,i.e.,λ
(nj)>1, anduncutotherwise.Thecutsize of(H) isdefinedasnj∈Ncj(λ
(nj)− 1).Avertexv
iin(G) or(H)issaid tobea boundaryvertexifitisconnectedbyatleastonecutedgeorcutnet.TheweightWc(Vk)ofpartVkisdefinedasthesumofthecthweightsoftheverticesinVk.Apartition(G)or(H) issaidtobebalancedif
Wc
(
Vk)
≤ Wacvg(
1+c
)
, k∈{
1,...,K}
and c∈{
1,...,C}
, (2)whereWacvg=
kWc(Vk)/K,and
cisthepredeterminedimbalancevalueforthecthweight.
TheK-waymulti-constraint graph/hypergraphpartitioningproblem[35,36]isthendefinedasfindingaK-way partition such thatthecutsizeisminimizedwhilethebalanceconstraint(2)ismaintained.Notethat forC=1,thisreducestothe well-studiedstandardpartitioningproblem.BothgraphandhypergraphpartitioningproblemsareNP-hard[37,38].
2.3. Sparsematrixpartitioningmodels
Inthissection,wedescribe howtoobtainaone-dimensional rowwisepartitioningofmatrixA forrow-parallelY=AX usinggraphandhypergraphpartitioningmodels.Thesemodelsaretheextensionsofstandardmodelsusedforsparsematrix vectormultiplication[21,22,39–41].
Inthegraphandhypergraph partitioningmodels,matrixAisrepresentedasanundirectedgraphG=(V,E)anda hy- pergraphH=(V,N).In both, there exists a vertex
v
i∈V foreach rowi ofA, wherev
i signifies the computational task ofmultiplying rowi of A withX to obtain rowi of Y.So, in both models, a single(C=1) weight of stimes the num- berof nonzerosin rowiof Ais associatedwithv
i toencode the loadofthiscomputational task.Forexample,inFig.1, w1(v
5)=4× 3=12.InG,each nonzeroaij oraji (orboth)ofAisrepresentedbyan edgeei j∈E.The costofedgeeijisassignedasci j=2s foreachedgeeijwithaij=0andaji=0,whereasitisassignedasci j=sforeachedgeeij witheitheraij=0oraji=0,but notboth.InH,eachcolumnjofAisrepresentedbyanetnj∈N,whichconnectstheverticesthatcorrespondtotherows thatcontainanonzeroincolumnj,i.e.,Pins(nj)=
{ v
i:ai j=0}
.Thecostofnetnjisassignedascj=sforeachnetinN.InaK-waypartition(G)or(H),withoutlossofgenerality,weassumethattherowscorrespondingtotheverticesin partVkareassignedtoprocessorPk.In(G),eachcutedgeeij,where
v
i∈Vkandv
j∈V,necessitates cijunitsofcommu- nicationbetweenprocessors PkandP.Here,Psendsrowj ofXtoPkifaij =0andPksendsrowiofXtoP ifaji =0.In(H),each cutnetnjnecessitates cj(
λ
(nj)− 1)unitsofcommunicationbetweenprocessorsthat correspondtotheparts in(nj),wheretheownerofrowjofXsendsittotheremainingprocessorsin(nj).Hereinafter,(nj)isinterchangeably usedtorefertopartsandprocessorsbecauseoftheidenticalvertexparttoprocessorassignment.Throughtheseformulations,theproblemofobtainingagoodrowpartitioningofAbecomesequivalenttothegraphand hypergraphpartitioningproblemsinwhichtheobjectiveofminimizing cutsizerelatestominimizing totalcommunication volume,whiletheconstraintofmaintainingbalanceonpartweights((2)withC=1)correspondstobalancingcomputational loads ofprocessors. The objective of hypergraph partitioning problemis an exact measure of total volume,whereas the objectiveofgraphpartitioningproblemisanapproximation[21].
3. Problemdefinition
Assumethat matrix A isdistributed amongK processors for parallel SpMMoperation asdescribedin Section 2.1.Let
σ
(Pk,P) bethe amountofdatasent fromprocessorPkto P in termsofX-matrixelements. Thisisequal tostimesthe numberofX-matrixrowsthatare ownedbyPkandneededbyP,whichisalsoequalto stimesthenumberofnonzero columnsegmentsinoff-diagonalblock Ak.Since Xk isownedbyPkandcomputationson Akk requirenocommunication,σ
(Pk,Pk)=0.We use the function ncs(.)to denote thenumber ofnonzero columnsegments ina given block ofmatrix.ncs(Ak) isdefinedtobe thenumberofnonzerocolumnsegments inAk ifk =,and0otherwise.Thisisextended toa rowstripeRkandacolumnstripeCk,wherencs(Rk)=
ncs(Ak)andncs(Ck)=
ncs(Ak).Finally,forthewholematrix, ncs(ABL)=
kncs(Rk)=
kncs(Ck).Forexample,inFig.1,ncs(A42)=2,ncs(R3)=5,ncs(C3)=4andncs(ABL)=21. ThesendandreceivevolumesofPkaredefinedasfollows:
• SV(Pk), send volumeof Pk: The total number ofX-matrix elementssent fromPk to other processors. Thatis, SV(Pk)=
σ
(Pk,P).Thisisequaltos× ncs(Ck).• RV(Pk),receivevolumeofPk:ThetotalnumberofX-matrixelementsreceivedbyPkfromotherprocessors.Thatis,RV(Pk)=
σ
(P,Pk).Thisisequaltos× ncs(Rk).Notethatthetotalvolume ofcommunicationisequalto
kSV(Pk)=
kRV(Pk).Thisisalsoequaltostimesthetotal numberofnonzerocolumnsegmentsinalloff-diagonalblocks,i.e.,s× ncs(ABL).
Inthis study,we extend the sparsematrix partitioning probleminwhich the only objectiveis to minimize the total communicationvolume,byintroducingfourmoreminimizationobjectiveswhicharedefinedonthefollowingmetrics:
1. maxkSV(Pk):maximumsendvolumeofprocessors(equivalenttomaximums× ncs(Ck)), 2. maxkRV(Pk):maximumreceivevolumeofprocessors(equivalenttomaximums× ncs(Rk)),
3. maxk(SV(Pk)+RV(Pk)): maximum sum of send and receive volumes of processors (equivalent to maximum s× (ncs(Ck)+ncs(Rk))),
4. maxkmax{SV(Pk),RV(Pk)}:maximumofmaximumofsendandreceivevolumesofprocessors(equivalenttomaximum s× max{ncs(Ck),ncs(Rk)}).
Undertheobjectiveofminimizingthetotalcommunicationvolume,minimizingoneofthesevolume-basedmetrics(e.g., maxkSV(Pk))relatestominimizingimbalanceontherespectivequantity(e.g.,imbalanceonSV(Pk)values).Forinstance,the imbalanceonSV(Pk)valuesisdefinedas
maxkSV
(
Pk)
kSV
(
Pk)
/K.Here,theexpressioninthedenominatordenotestheaveragesendvolumeofprocessors.
A parallel application may necessitate one or more of these metrics to be minimized. These metrics are considered besidestotalvolumesinceminimizationofthemisplausibleonlywhentotalvolumeisalsominimizedasmentionedabove.
Hereinafter,thesemetricsexcepttotalvolumearereferredtoasvolume-basedmetrics.
Fig. 2. The state of the RB tree prior to bipartitioning G 21 and the corresponding sparse matrix. Among the edges and nonzeros, only the external (cut) edges of V 12 and their corresponding nonzeros are shown.
4. Modelsforminimizingmultiplevolume-basedmetrics
Thissectiondescribestheproposedgraphandhypergraphpartitioningmodelsforaddressingvolume-basedcostmetrics definedintheprevioussection.Ourmodelshavethecapabilityofaddressingasingle,acombinationorallofthesemetrics simultaneouslyinasinglephase.Moreover,theyhavetheflexibilityofhandlingcustommetricsbasedonvolumeotherthan thealreadydefinedfourmetrics.Ourapproachreliesonthewidelyadoptedrecursivebipartitioning(RB)frameworkutilized inabreadth-firstmannerandcanberealizedbyanygraphandhypergraphpartitioningtool.
4.1. Recursivebipartitioning
In the RB paradigm, the initial graph/hypergraph is partitioned into two subgraphs/subhypergraphs. These two sub- graphs/subhypergraphs are further bipartitioned recursively until K parts are obtained. This process forms a full binary tree,which we referto asan RBtree,withlg2K levels,where K isa powerof 2.Withoutloss ofgenerality, graphs and hypergraphsatlevelroftheRBtreearenumberedfromleftto rightanddenotedasGr0,...,Gr2r−1 andH0r,...,Hr2r−1,re- spectively.Frombipartition(Grk)=
{
V2r+1k ,V2r+1k+1}
ofgraphGrk=(Vkr,Ekr),twovertex-inducedsubgraphsGr2+1k =(V2r+1k ,E2r+1k ) andGr2+1k+1=(V2r+1k+1,E2r+1k+1)areformed.Allcutedgesin(Grk)areexcludedfromthenewlyformedsubgraphs.Frombipar- tition(Hrk)=
{
V2r+1k ,V2r+1k+1}
ofhypergraphHrk=(Vkr,Nkr),twovertex-inducessubhypergraphsareformedsimilarly.Allcut netsin(Hrk)aresplittocorrectlyencodethecutsizemetric[21].4.2. Graphmodel
ConsidertheuseoftheRBparadigmforpartitioningthestandard graphrepresentationG=(V,E)ofAforrow-parallel Y=AX toobtainaK-waypartition.WeassumethattheRBproceedsinabreadth-firstmannerandRBprocessisatlevelr priortobipartitioningkthgraphGrk.ObservethattheRBprocessuptothisbipartitioningalreadyinducesaK -waypartition
(G)=
{
V0r+1,...,V2r+1k−1,Vkr,...,V2rr−1}
.(G) contains2k vertexpartsfromlevelr+1and2r− kvertexpartsfromlevelr, makingK =2r+k.AfterbipartitioningGrk,a(K +1)-waypartition (G)isobtainedwhichcontainsV2r+1k andV2r+1k+1instead of Vkr. Forexample, in Fig. 2, the RBprocess is atlevel r=2 prior to bipartitioning G21=(V12,E12), so, the currentstate of theRB induces a five-way partition (G)={
V03,V13,V12,V22,V32}
.BipartitioningG21 induces a six-way partition (G)={
V03,V13,V23,V33,V22,V32}
. Pkr denotes thegroup ofprocessorswhich areresponsible forperformingthe tasksrepresentedby theverticesinVkr.ThesendandreceivevolumedefinitionsSV(Pk)andRV(Pk)ofindividualprocessorPkareeasilyextended toSV(Pkr)andRV(Pkr)forprocessorgroupPkr.WefirstformulatethesendvolumeoftheprocessorgroupPkrtoallotherprocessorgroupscorrespondingtovertexparts in(G).Letconnectivitysetofvertex
v
i∈Vkr,Con(v
i),denotethesubsetofpartsin(G)−{
Vkr}
inwhichv
ihasatleastoneneighbor.Thatis,
Con
( v
i)
={
Vt∈(
G)
:Adj( v
i)
∩Vt=∅}
−{
Vkr}
,wheretiseitherrorr+1.Vertex
v
iisboundaryifCon(v
i)=∅,andoncev
ibecomesboundary,itremainsboundaryinall furtherbipartitionings.Forexample,inFig.2,Con(v
9)={
V13,V22,V32}
.Con(v
i)signifiesthecommunicationoperationsduetov
i,wherePkrsendsrowiofXtoprocessorgroupsthatcorrespondtothepartsinCon(v
i).Thesendloadassociatedwithv
iisdenotedbysl(
v
i)andisequalto sl( v
i)
=s×|
Con( v
i) |
The total sendvolume ofPkr is then equal to the sumof the send loadsof all vertices in Vkr,i.e., SV(Pkr)=
vi∈Vrksl(
v
i). InFig.2, thetotal sendvolume ofP12 is equalto sl(v
7)+sl(v
8)+sl(v
9)+sl(v
10)=3s+2s+3s+s=9s.Therefore,during bipartitioningGrk,minimizingmax
vi∈Vr+12k
sl
( v
i)
,vi∈V2r+1k+1
sl
( v
i)
isequivalenttominimizingthemaximumsendvolumeofthetwoprocessorgroupsP2r+1k andP2r+1k+1 totheother processor groupsthatcorrespondtothevertexpartsin(G).
Inasimilarmanner,weformulatethereceivevolumeoftheprocessorgroupPkr fromallother processorgroupscorre- spondingtothevertexpartsin(G).Observethatforeachboundary
v
j∈Vt thathasatleastoneneighborinVkr,Pkrneeds toreceivethecorrespondingrowjofXfromPt.Forinstance,inFig.2,sincev
5∈V13 hastwoneighborsinV12,P12 needsto receivethecorrespondingfifthrowofXfromP13.Hence,PkrreceivesasubsetofX-matrixrowswhosecardinalityisequalto thenumberofverticesinV− VkrthathaveatleastoneneighborinVkr,i.e.,|{ v
j∈{
V− Vkr}
:v
i∈Vkrandeji∈E}|
.ThesizeofthissetforV12 inFig.2isequalto10.Notethateachsuch
v
j contributesswordstothereceivevolumeofPkr.Thisquan- titycanbecapturedbyevenlydistributingitamongv
j’sneighborsinVkr.Inotherwords,avertexv
j∈Vlt thathasatleast oneneighborinVkrcontributess/|
Adj(v
j)∩Vkr|
tothereceiveloadofeachvertexv
i∈{
Adj(v
j)∩Vkr}
.Thereceiveloadofv
i, denotedbyrl(v
i),isgivenbyconsideringallneighborsofv
ithatarenotinVkr,thatis,rl
( v
i)
=eji∈Eandvj∈Vt
s
|
Adj( v
j)
∩Vkr|
.ThetotalreceivevolumeofPkristhenequaltothesumofthereceiveloadsofallverticesinVkr,i.e.,RV(Pkr)=
vi∈Vkrrl(
v
i). InFig.2,theverticesv
11,v
12,v
15 andv
16 respectivelycontribute s/3, s/2,sandstothereceiveloadofv
8, whichmakes rl(v
8)=17s/6.ThetotalreceivevolumeofP12 isequaltorl(v
7)+rl(v
8)+rl(v
9)+rl(v
10)=3s+17s/6+10s/3+5s/6=10s. NotethatthisisalsoequaltothestimesthenumberofneighboringverticesofV12 inV− V12.Therefore,duringbipartition- ingGrk,minimizingmax
vi∈Vr+12k
rl
( v
i)
, vi∈Vr+12k+1rl
( v
i)
isequivalenttominimizingmaximumreceivevolumeofthetwoprocessorgroupsP2r+1k andP2r+1k+1fromtheotherprocessor groupsthatcorrespondtothevertexpartsin(G).
Althoughthesetwoformulationscorrectlyencapsulatethesend/receivevolumeloadsofP2r+1k andP2r+1k+1to/fromallother processorgroupsin(G),theyoverlookthesend/receivevolumeloadsbetweenthesetwoprocessorgroups.Ourapproach triesto refrainfromthis smalldeviationby immediatelyutilizing thenewly generatedpartition information whilecom- putingvolume loadsinthe upcomingbipartitionings. Thatis,the computationofsend/receiveloads forbipartitioningGrk utilizesthemostrecentK -way partitioninformation,i.e.,(G).Thisdeviationbecomesnegligiblewithincreasingnumber ofsubgraphsinthelatterlevelsoftheRBtree.Theencapsulationofsend/receivevolumesbetweenP2r+1k andP2r+1k+1 during bipartitioningGrknecessitatesimplementinganewpartitioningtool.
Algorithm1presentsthecomputationofsendandreceiveloadsofverticesinGrkpriortoitsbipartitioning.Asitsinputs, thealgorithmneedstheoriginalgraphG=(V,E),graphGrk=(Vkr,Ekr),andtheup-to-datepartitioninformationofvertices, whichisstoredinpartarrayofsizeV=
|
V|
.Tocompute thesendloadofavertexv
i∈Vkr,itisnecessarytofindthesetof partsinwhichv
ihasatleastoneneighbor.Forthispurpose,foreachv
j∈/Vkr inAdj(v
i),Con(v
i)isupdated withthepart thatv
j iscurrentlyin(lines2–4).Adj(· )listsaretheadjacencylistsoftheverticesintheoriginalgraphG.Next,thesend loadofv
i,sl(v
i),issimplysettostimesthesizeofCon(v
i)(line5).Tocomputethereceiveloadofv
i∈Vkr,itisnecessary tovisittheneighborsofv
ithatarenotinVkr.Foreachsuchneighborv
j,thereceiveloadofv
i,rl(v
i),isupdatedbyaddingv
i’sshareofreceiveloadduetov
j,whichisequaltos/|
Adj(v
j)∩Vkr|
(lines6–8).ObservethatonlytheboundaryverticesinVkrwillhavenonzerovolumeloadsattheendofthisprocess.
Algorithm2presentstheoverallpartitioningprocesstoobtainaK-waypartitionutilizingbreadth-firstRB.Foreachlevel roftheRB tree,thegraphsin thislevelare bipartitioned fromleft toright, Gr0 to Gr2r−1 (lines 3–4).Prior tobipartition- ingofGrk,thesendloadandthereceiveloadofeachvertexinGrk arecomputedwith
GRAPH-COMPUTE-VOLUME-LOADS
Algorithm 1 GRAPH- COMPUTE-VOLUME-LOADS .
Algorithm 2 GRAPH- PARTITION .
(line5).Recallthat intheoriginalsparsematrixpartitioningwithgraphmodel,eachvertex
v
ihasasingleweightw1(v
i),which represents thecomputational load associated withit. Toaddress the minimization of maximum send/receive vol- ume, we associatean extraweight witheach vertex. Specifically,to minimize themaximum sendvolume, thesend load of
v
iisassignedasits secondweight,i.e.,w2(v
i)=sl(v
i).Inasimilarmanner,tominimize themaximumreceivevolume, the receiveloadofv
i isassignedasits second weight, i.e.,w2(v
i)=rl(v
i).Observethat onlythe boundaryvertices have nonzerosecondweights.Next,Grkisbipartitionedtoobtain(Grk)={
V2r+1k ,V2r+1k+1}
usingmulti-constraintpartitioningtohan- dle multiplevertexweights (line7). Then, twonew subgraphsGr2+1k andGr2+1k+1 areformed fromGrk using(Grk) (line8).Inpartitioning,minimizingimbalanceon thesecondpartweights correspondstominimizing imbalanceonsend(receive) volumeiftheseweights aresettosend(receive)loads.Inother words,undertheobjectiveofminimizingtotalvolumein thisbipartitioning,minimizing
max
{
W2(
V2r+1k)
,W2(
V2r+1k+1) } (
W2(
V2r+1k)
+W2(
V2r+1k+1))
/2relatestominimizing max
{
SV(P2rk+1),SV(P2r+1k+1)}
(max{
RV(P2r+1k ),RV(P2rk+1+1)}
) ifthe secondweights aresetto send(receive)loads.Thenpartarrayisupdated aftereachbipartitioningtokeeptrackofthemostup-to-datepartitioninformationofall vertices(line9).Finally,theresultingK-waypartitioninformationisreturnedinpartarray(line10).Notethatinthefinal K-waypartition,processorgroupPklg2KdenotestheindividualprocessorPk,for0≤ k≤ K− 1.
Inorderto efficientlymaintainthesendandreceiveloads ofvertices,wemake useoftheRBparadigminabreadth- firstorder.Sincetheseloadsarenotknowninadvanceanddependonthecurrentstateofthepartitioning,itiscrucialto act proactivelybyavoidinghighimbalancesonthem.Comparethistocomputational loadsofvertices,whichisknownin advanceandremains thesameforeachvertexthroughoutthepartitioning.Hence,utilizingabreadth-firstoradepth-first RBdoesnotaffectthequality oftheobtainedpartitioninterms ofcomputational load. Weprefer abreadth-firstRBtoa depth-firstRBforminimizing volume-basedmetricssinceoperatingonthepartsthatareatthesameleveloftheRBtree (in ordertocompute send/receiveloads)preventsthe possibledeviationsfromthetargetobjective(s)by quicklyadapting thecurrentavailablepartitiontothechangesthatoccurinsend/receivevolumeloadsofvertices.
The describedmethodologyaddresses theminimization ofmaxkSV(Pk) ormaxkRV(Pk) separately.After computingthe sendandreceiveloads,wecanalsoeasilyminimizemaxk(SV(Pk)+RV(Pk))byassociatingthesecondweightofeachvertex withthesumofsendandreceiveloads,i.e.,w2(