multiplication on large-scale parallel systems Improving performance of sparse matrix dense matrix Parallel Computing

(1)

ContentslistsavailableatScienceDirect

Parallel Computing

journalhomepage:www.elsevier.com/locate/parco

Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems

Seher Acer, Oguz Selvitopi, Cevdet Aykanat

^∗

Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey

a r t i c l e i n f o

Article history:

Received 30 March 2015 Revised 25 August 2016 Accepted 5 October 2016 Available online 6 October 2016 Keywords:

Irregular applications Sparse matrices

Sparse matrix dense matrix multiplication Load balancing

Communication volume balancing Matrix partitioning

Graph partitioning Hypergraph partitioning Recursive bipartitioning Combinatorial scientiﬁc computing

a b s t r a c t

Weproposeacomprehensiveandgenericframeworktominimizemultipleanddifferent volume-basedcommunicationcostmetricsforsparsematrixdensematrixmultiplication (SpMM).SpMMisanimportant kernelthatfindsapplication incomputational linearal- gebraandbigdataanalytics.Ondistributedmemorysystems,thiskernelisusuallychar- acterizedwithitshighcommunicationvolumerequirements.Ourapproachtargetsirregu- larlysparsematricesandisbasedonbothgraphandhypergraphpartitioningmodelsthat relyonthewidelyadopted recursivebipartitioningparadigm.The proposedmodels are lightweight,portable(canberealizedusinganygraphand hypergraphpartitioningtool) andcansimultaneouslyoptimizedifferentcostmetricsbesidestotalvolume,suchasmaximumsend/receivevolume,maximumsumofsendandreceivevolumes,etc.,inasingle partitioningphase.Theyallowonetodefineandoptimizeasmanycustomvolume-based metricsas desiredthrough aflexible formulation.The experimentsonawide rangeof aboutthousandmatricesshowthattheproposedmodelsdrasticallyreducethemaximum communicationvolumecomparedtothestandardpartitioningmodels thatonlyaddress theminimizationoftotalvolume.Theimprovementsobtainedonvolume-basedpartition quality metrics using ourmodels are validatedwith parallel SpMM as well as parallel multi-sourceBFSexperiments ontwo large-scalesystems. ForparallelSpMM,compared tothe standard partitioning models,our graphand hypergraph partitioningmodels respectively achievereductions of14% and 22% inruntime,onaverage. Compared tothe state-of-the-artpartitionerUMPa, ourgraphmodel is overall14.5× fasterand achieves anaverageimprovementof19%inthepartitionqualityoninstancesthatareboundedby maximumvolume.ForparallelBFS,weshow ongraphs withmorethanabillionedges thatthescalabilitycansignificantlybeimprovedwithourmodelscomparedtoarecently proposedtwo-dimensionalpartitioningmodel.

1. Introduction

Sparsematrixkernelsformthecomputationalbasisofmanyscientiﬁcandengineeringapplications.Animportantkernel isthesparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisasparsematrix,andXandYare densematrices.

∗ Corresponding author. Fax: +90 312 266 4047.

E-mail addresses: acer@cs.bilkent.edu.tr (S. Acer), reha@cs.bilkent.edu.tr (O. Selvitopi), aykanat@cs.bilkent.edu.tr (C. Aykanat).

(2)

SpMMisalreadyacommonoperationincomputationallinearalgebra,usuallyutilizedrepeatedlywithin thecontextof block iterative methods.The practical beneﬁtsof block methods have beenemphasized in severalstudies. These studies either focuson theblock versionsofcertain solvers (i.e., conjugategradient variants) whichaddress multiple linearsys- tems [1–4], orthe block methods foreigenvalue problems,such asblock Lanczos [5] andblock Arnoldi [6]. The column dimensionofXandYinblockmethodsisusuallyverysmallcomparedtothatofA[7].

Alongwithothersparsematrixkernels,SpMMisalsousedintheemergingﬁeldofbigdataanalytics.Graphalgorithms areubiquitousinbigdataanalytics.Manygraphanalysisapproachessuchascentralitymeasures[8]relyonshortestpath computationsandusebreadth-ﬁrstsearch(BFS)asabuildingblock.Asindicatedinseveralrecentstudies[9–14],processing each levelinBFSis actuallyequivalentto asparse matrixvector “multiplication”.Graphalgorithms oftennecessitate BFS from multiple sources. In this case, processingeach level becomes equivalentto multiplication of a sparse matrix with another sparse(the SpGEMM kernel[15]) ordense matrix.For atypical small worldnetwork [16],matrix Xis sparse at thebeginningofBFS,howeveritusually getsdenserasBFSproceeds.Evenincaseswhenitremainssparse,thechanging patternofthismatrixthroughouttheBFSlevelsandtherelatedsparsebookkeepingoverheadmakeitplausibletostoreit asadensematrixifthereismemoryavailable.

SpMMisprovidedinIntelMKL[17]andNvidiacuSPARSE[18]librariesformulti-/many-coreandGPUarchitectures.To optimizeSpMMondistributedmemoryarchitecturesforsparsematriceswithirregularsparsitypatterns,oneneedstotake communicationbottlenecksintoaccount.Communicationbottlenecksareusuallysummarizedbylatency(messagestart-up) andbandwidth(messagetransfer) costs.Thelatencycostisproportionaltothenumberofmessageswhilethebandwidth costisproportionaltothenumberofwordscommunicated,i.e.,communicationvolume.Thesecostsareusuallyaddressed intheliteraturewithintelligentgraphandhypergraphpartitioningmodelsthatcanexploitirregularpatternsquitewell[19–

24].Mostofthesemodelsfocusonimprovingtheperformanceofparallelsparsematrixvectormultiplication.Althoughone canutilizethemforSpMMaswell,SpMMnecessitatestheuseofnewmodelstailoredtothiskernelsinceitisspeciﬁcally characterized withitshighcommunicationvolume requirementsbecauseoftheincreasedcolumndimensionsofdense X andYmatrices.Inthisregard,thebandwidthcostbecomescriticalforoverallperformance,whilethelatencycostbecomes negligiblewithincreasedaveragemessagesize.Therefore,togetthebest performanceout ofSpMM,itis vitaltoaddress communicationcostmetrics thatare centeredaround volumesuch asmaximumsend volume,maximumreceivevolume, etc.

1.1. Relatedworkonmultiplecommunicationcostmetrics

Totalcommunicationvolumeisthemostwidelyoptimizedcommunicationcostmetricforimprovingtheperformanceof sparsematrixoperationsondistributedmemorysystems[21,22,25–27].Thereareafewworksthatconsidercommunication cost metrics other thantotal volume[28–33].Inan early work, Uçarand Aykanat[29] proposed hypergraph partitioning models tooptimizetwodifferentcost metricssimultaneously. Thiswork isatwo-phase approach,wherethepartitioning inthe ﬁrstphase is followedby alatter phasein whichthey minimize totalnumber ofmessages andachievea balance oncommunicationvolumesofprocessors.Inarelatedwork,UçarandAykanat[28]adaptedthementionedmodelfortwo- dimensional ﬁne-grainpartitioning.A veryrecent workby Selvitopi andAykanataims toreduce the latencyoverheadin two-dimensionaljaggedandcheckerboardpartitioning[34].

BisselingandMeesen[30]proposedagreedyheuristicforbalancingcommunicationloadsofprocessors.Thismethodis alsoa two-phase approach,in whichthepartitioning intheﬁrst phaseis followedby a redistributionof communication tasks in thesecond phase. While doing so,they tryto minimize themaximum send andreceive volumesof processors whilerespectingthetotalvolumeobtainedintheﬁrstphase.

Thetwo-phaseapproacheshavetheflexibilityofworkingwithalreadyexistingpartitions.However,sincethefirstphase is obliviousto thecost metrics addressedinthe second phase, they canget stuckin localoptima. Toremedy thisissue, Devecietal.[32]recentlyproposedahypergraphpartitionercalledUMPa,whichiscapableofhandlingmultiplecostmetrics in a singlepartitioningphase. Theyconsider various metrics such asmaximumsend volume, totalnumber ofmessages, maximumnumberofmessages,etc.,andproposeadifferentgaincomputationalgorithmspecifictoeachofthesemetrics.In thecenteroftheirapproacharethemove-basediterativeimprovementheuristicswhichmakeuseofdirectedhypergraphs.

Theseheuristicsconsistofanumberofreﬁnementpasses.Toeachpass,theirapproachisreportedtointroduceanO(VK²)- time overhead,whereVisthenumberofverticesinthehypergraph (numberofrows/columnsinA) andK isthenumber of parts/processors. Theyalso report that the slowdown ofUMPaincreases withincreasing K withrespect to thenative hypergraphpartitionerPaToHduetothisquadraticcomplexity.

1.2. Contributions

Inthisstudy,weproposeacomprehensiveandgenericone-phaseframeworktominimizemultiplevolume-basedcom- munication cost metrics forimproving the performance of SpMMon distributed memorysystems. Ourframework relies onthewidelyadoptedrecursivebipartitioningparadigmutilizedinthecontextofgraphandhypergraphpartitioning.Total volumecanalreadybeeffectivelyminimizedwithexistingpartitioners[21,22,25].Wefocusontheotherimportantvolume- basedmetrics besidestotalvolume, suchasmaximumsend/receivevolume, maximumsumofsendandreceivevolumes, etc.Theproposed modelassociatesadditionalweights withboundaryverticestokeeptrackofvolumeloadsofprocessors

(3)

duringrecursivebipartitioning.Theminimizationobjectivesassociatedwiththeseloadsaretreatedasconstraintsinorder to make useof a readily available partitioner.Achieving a balance on these weights of boundaryvertices through these constraintsenablestheminimizationoftargetvolume-basedmetrics.Wealsoextendourmodelbyproposingtwopractical enhancementstohandletheseconstraintsinpartitionersmoreeﬃciently.

Ourframework is unique and flexible in the sense that it handles multiple volume-based metrics through the same formulationinagenericmanner.Thisframeworkalsoallowstheoptimizationofanycustommetricdefinedonsend/receive volumes. Ouralgorithms are computationally lightweight: they only introduce an extra O(nnz(A))time toeach recursive bipartitioning level, where nnz(A) is the number of nonzeros in matrix A. To the best of our knowledge, it is the first portable one-phase methodthat can easily be integrated intoanystate-of-the-art graphandhypergraph partitioner.Our workisalsothefirstworkthataddressesmultiplevolume-basedmetricsinthegraphpartitioningcontext.

Anotherimportantaspectisthesimultaneoushandlingofmultiplecostmetrics.Thisfeatureiscrucialasoverallcommu- nicationcostissimultaneouslydeterminedbymultiplefactorsandthetargetparallelapplicationmaydemandoptimization ofdifferentcostmetricssimultaneouslyforgoodperformance(SpMMandmulti-sourceBFSinourcase).Inthisregard,Uçar andAykanat[28,29]accommodatethisfeaturefortwometrics,whereasDevecietal.[32],althoughaddressmultiplemet- rics,donothandletheminacompletelysimultaneousmannersincesomeofthemetricsmaynotbeminimizedincertain cases.Ourmodelsincontrastcanoptimizealltargetmetricssimultaneouslybyassigningequalimportancetoeachofthem inthefeasiblesearchspace.Inaddition,theproposedframeworkallowsonetodeﬁneandoptimizeasmanyvolume-based metricsasdesired.

Forexperiments,the proposed partitioningmodels for graphsand hypergraphsare realized usingthe widely-adopted partitionersMetis[22] andPaToH [21],respectively. Wehavetestedtheproposedmodels for128,256,512and1024pro- cessorsonadatasetof964matricescontaininginstancesfromdifferentdomains.Weachieveaverageimprovementsofup to61% and78%in maximumcommunicationvolume forgraphandhypergraph models,respectively, inthe categoriesof matricesforwhichmaximumvolumeismostcritical.Comparedtothestate-of-the-artpartitionerUMPa,ourgraphmodel achievesan overallimprovementof5% inthepartition quality14.5× fasterandourhypergraph modelachievesan overall improvementof11%inthepartition quality3.4× faster.Ouraverageimprovementsfortheinstancesthatare boundedby maximumvolumeareevenhigher:19%fortheproposedgraphmodeland24%fortheproposedhypergraphmodel.

Wetest the validityof theproposed models forboth parallelSpMM andmulti-sourceBFSkernels onlarge-scale HPC systemsCray XC40andLenovoNeXtScale, respectively.Forparallel SpMM,compared tothestandard partitioningmodels, ourgraphandhypergraph partitioningmodelsrespectivelylead toreductionsof14% and22%inruntime,onaverage.For parallelBFS,weshowongraphswithmorethanabillionedgesthatthescalabilitycansigniﬁcantlybeimprovedwithour modelscomparedtoa recentlyproposed two-dimensionalpartitioningmodel[12]fortheparallelizationofthiskernelon distributedsystems.

Therestofthepaperisorganizedasfollows.Section2givesbackgroundforpartitioningsparsematricesviagraphand hypergraph models. Section 3 deﬁnes the problemsregarding minimization of volume-based cost metrics. The proposed graphandhypergraph partitioningmodels to addresstheseproblems aredescribedin Section 4.Section 5proposestwo practicalextensionstothesemodels.Section6givesexperimentalresultsforinvestigatedpartitioningschemesandparallel runtimes.Section7concludes.

2. Background

2.1. One-dimensionalsparsematrixpartitioning

Considertheparallelizationofsparsematrixdensematrixmultiplication(SpMM)oftheformY=AX,whereAisann× nsparsematrix,andXandYaren× sdensematrices.AssumethatAispermutedintoaK-wayblockstructureoftheform

ABL=

C1 · · · CK

=

⎡

⎣

R₁ .. . RK

⎤

⎦

=

⎡

⎣

A₁₁ · · · A1K

..

. ... ... AK1 · · · AKK

⎤

⎦

, (1)

forrowwiseorcolumnwisepartitioning,whereKisthenumberofprocessorsintheparallelsystem.ProcessorP_kownsrow stripeR_k=[A_k₁· · · AkK]forrowwisepartitioning,whereasitownscolumnstripeC_k=[A^T₁_k· · · A^T_Kk]^T forcolumnwisepartition- ing.Wefocusonrowwisepartitioninginthiswork,however,alldescribedmodelsapplytocolumnwisepartitioningaswell.

WeuseR_kandA_kinterchangeablythroughoutthepaperasweonlyconsiderrowwisepartitioning.

InbothblockiterativemethodsandBFS-likecomputations,SpMMisperformedrepeatedlywiththesameinputmatrixA andchangingX-matrixelements.TheinputmatrixXofthenextiterationisobtainedfromtheoutputmatrixYofthecurrent iterationvia element-wiselinearmatrixoperations. Wefocus onthecasewherethe rowwisepartitionsofthe inputand outputdensematricesareconformabletoavoidredundantcommunicationduringtheselinearoperations.Hence,apartition of A naturally induces partition [Y₁^T...Y_K^T]^T on the rows of Y, which is in turn used to induce a conformable partition [X₁^T...X_K^T]^T ontherowsofX.Inthisregard,therowandcolumnpermutationmentionedin(1)shouldbeconformable.

Anonzerocolumnsegmentisdeﬁnedasthenonzerosofacolumninaspeciﬁcsubmatrix block.ForexampleinFig.1, therearetwo nonzerocolumnsegments inA₁₄ whichbelongto columns13and15.Inrow-parallelY=AX,P_k ownsrow

(4)

Fig. 1. Row-parallel Y = AX with K = 4 processors, n = 16 and s = 3 .

stripes A_k and X_k of the input matrices, and is responsible for computing respective row stripeY_k=A_kX of the output matrix.P_kcan performcomputationsregarding diagonalblock A_kk locally usingitsown portionX_k withoutrequiring any communication,whereA_kl iscalledadiagonalblockifk=l,andanoff-diagonalblockotherwise.SinceP_kownsonlyX_k,it needs theremaining X-matrixrowsthatcorrespond tononzerocolumnsegments inoff-diagonalblocksofA_k.Hence,the respectiverowsmustbesenttoP_kbytheirownersinapre-communicationphasepriortoSpMMcomputations.Speciﬁcally, toperformthemultiplicationregardingoff-diagonalblockA_kl,P_kneedstoreceivetherespectiveX-matrixrowsfromP_l.For example,inFig.1forP₃,sincethereexistsanonzerocolumnsegmentinA₃₄,P₃needs toreceivethecorrespondingthree elementsinrow14ofXfromP₄.Inasimilarmanner,itneedstoreceivetheelementsofX-matrixrows2,3fromP₁and5, 7fromP₂.

2.2. Graphandhypergraphpartitioningproblems

AgraphG=(V,E)^consists^of^a^setV ofverticesandasetE ofedges.Eachedgee_ijconnectsapairofdistinct vertices

v

iand

v

j.Acostc_ij isassociatedwitheachedgee_ij.Adj(

v

i)^denotes^the^neighbors^of

v

i,i.e.,Adj(

v

i)=

{ v

j:e_{i j}∈E

}

^.

Ahypergraph H=(V,N)^consists^of^a ^setV ofvertices anda setN ofnets.Eachnetn_j connectsasubset ofvertices denotedasPins(n_j).Acostc_jisassociatedwitheach netn_j.Nets(

v

i)^denotes^the^set^of^nets^that^connect

v

i.Inbothgraph andhypergraph,multipleweightsw¹(

v

_i₎_,_{. . .}_,w^C(

v

_i₎areassociatedwitheachvertex

v

_i_,wherew^c(

v

_i₎denotesthecthweight associatedwith

v

i.

(^G)=

{

V1,...,VK

}

^and(H)=

{

V1,...,VK

}

âre ^called^K^-way^partitionsôf^GândH ifpartsaremutuallydisjointand mutuallyexhaustive.In⁽^G^),ân êdgeêijissaidtobe cutifvertices

v

_iand

v

_j areindifferentparts,anduncutotherwise.

Thecutsizeof⁽^G⁾îs^definedâse_{i j}∈EEc_{i j},whereEE⊆ E denotesthesetofcutedges.In(H),theconnectivityset⁽ⁿj) of net n_j consistsof the parts that are connected by that net, i.e., (ⁿj)=

{

Vk:Pins(ⁿj)∩Vk=∅

}

^.^The ^number ^of ^parts

connectedbyn_jisdenotedby

λ

(ⁿj)=

|

(ⁿj)

|

^.^A^netⁿj issaidtobecut ifitconnectsmorethanonepart,i.e.,

λ

⁽ⁿj)>1, anduncutotherwise.Thecutsize of(H) îs^definedâsn_j∈Nc_j(

λ

(ⁿj)− 1)^.^A^vertex

v

iin⁽^G⁾ ôr(H)îs^said ^to^beâ boundaryvertexifitisconnectedbyatleastonecutedgeorcutnet.

TheweightW^c(Vk)ôf^partVkisdefinedasthesumofthecthweightsoftheverticesinVk.Apartition⁽^G⁾ôr(H) issaidtobebalancedif

W^c

(

Vk

)

≤ Wa^cv^g

(

¹+

^c

)

, k∈

{

¹^,^.^.^.^,^K

}

^and ^c^∈

{

¹^,^.^.^.^,^C

}

^, ⁽²⁾

whereW_a^c_v_g=

kW^c(Vk)/K,and

^c^is^thepredeterminedimbalancevalueforthecthweight.

TheK-waymulti-constraint graph/hypergraphpartitioningproblem[35,36]isthendeﬁnedasﬁndingaK-way partition such thatthecutsizeisminimizedwhilethebalanceconstraint(2)ismaintained.Notethat forC=1,thisreducestothe well-studiedstandardpartitioningproblem.BothgraphandhypergraphpartitioningproblemsareNP-hard[37,38].

2.3. Sparsematrixpartitioningmodels

Inthissection,wedescribe howtoobtainaone-dimensional rowwisepartitioningofmatrixA forrow-parallelY=AX usinggraphandhypergraphpartitioningmodels.Thesemodelsaretheextensionsofstandardmodelsusedforsparsematrix vectormultiplication[21,22,39–41].

(5)

Inthegraphandhypergraph partitioningmodels,matrixAisrepresentedasanundirectedgraphG=(V,E)ândâ ^hy- pergraphH=(V,N)^.În ^both, ^there êxists â ^vertex

v

i∈V foreach rowi ofA, where

v

i signiﬁes the computational task ofmultiplying rowi of A withX to obtain rowi of Y.So, in both models, a single(C=1) weight of stimes the num- berof nonzerosin rowiof Ais associatedwith

v

_i toencode the loadofthiscomputational task.Forexample,inFig.1, w¹(

v

5)=4× 3=12.

InG,each nonzeroa_ij ora_ji (orboth)ofAisrepresentedbyan edgee_{i j}∈E.The costofedgee_ijisassignedasc_{i j}=2s foreachedgee_ijwitha_ij=0anda_ji=0,whereasitisassignedasc_{i j}=sforeachedgee_ij witheithera_ij=0ora_ji=0,but notboth.InH,eachcolumnjofAisrepresentedbyanetn_j∈N,whichconnectstheverticesthatcorrespondtotherows thatcontainanonzeroincolumnj,i.e.,Pins(ⁿj)=

{ v

i:a_{i j}=0

}

^.^The^cost^of^netⁿjisassignedasc_j=sforeachnetinN.

InaK-waypartition⁽^G⁾ôr(H),withoutlossofgenerality,weassumethattherowscorrespondingtotheverticesin partVkareassignedtoprocessorP_k.In⁽^G^),êach^cutêdgeêij,where

v

i∈Vkand

v

j∈V,necessitates c_ijunitsofcommu- nicationbetweenprocessors P_kandP.Here,Psendsrowj ofXtoP_kifa_ij =0andP_ksendsrowiofXtoP ifa_ji =0.In

(H),each cutnetn_jnecessitates c_j(

λ

(ⁿj)− 1)^units^ofcommunicationbetweenprocessorsthat correspondtotheparts in⁽ⁿj),wheretheownerofrowjofXsendsittotheremainingprocessorsin⁽ⁿj).Hereinafter,⁽ⁿj)isinterchangeably usedtorefertopartsandprocessorsbecauseoftheidenticalvertexparttoprocessorassignment.

Throughtheseformulations,theproblemofobtainingagoodrowpartitioningofAbecomesequivalenttothegraphand hypergraphpartitioningproblemsinwhichtheobjectiveofminimizing cutsizerelatestominimizing totalcommunication volume,whiletheconstraintofmaintainingbalanceonpartweights((2)withC=1)correspondstobalancingcomputational loads ofprocessors. The objective of hypergraph partitioning problemis an exact measure of total volume,whereas the objectiveofgraphpartitioningproblemisanapproximation[21].

3. Problemdeﬁnition

Assumethat matrix A isdistributed amongK processors for parallel SpMMoperation asdescribedin Section 2.1.Let

σ

⁽^Pk,P) bethe amountofdatasent fromprocessorP_kto P in termsofX-matrixelements. Thisisequal tostimesthe numberofX-matrixrowsthatare ownedbyP_kandneededbyP,whichisalsoequalto stimesthenumberofnonzero columnsegmentsinoff-diagonalblock A_k.Since X_k isownedbyP_kandcomputationson A_kk requirenocommunication,

σ

(^Pk,P_k)=0.We use the function ncs(.)to denote thenumber ofnonzero columnsegments ina given block ofmatrix.

ncs(A_k) isdeﬁnedtobe thenumberofnonzerocolumnsegments inA_k ifk =,and0otherwise.Thisisextended toa rowstripeR_kandacolumnstripeC_k,wherencs(^Rk)=

ncs(^Ak)^and^ncs(^Ck)=

ncs(^A_k)^.^Finally,^for^the^whole^matrix, ncs(^ABL)=

kncs(^Rk)=

kncs(^Ck)^.^Forêxample,ⁱⁿ^Fig.¹^,^ncs(Â42)=2,ncs(^R3)=5,ncs(^C3)=4andncs(ÂBL)=21. ThesendandreceivevolumesofP_karedefinedasfollows:

• SV(P_k), send volumeof P_k: The total number ofX-matrix elementssent fromP_k to other processors. Thatis, SV(^Pk)=

σ

(^Pk,P)^.^This^is^equal^to^s× ncs(C_k).

• RV(P_k),receivevolumeofP_k:ThetotalnumberofX-matrixelementsreceivedbyP_kfromotherprocessors.Thatis,RV(^Pk)=

σ

(^P,P_k)^.^This^is^equal^to^s× ncs(R_k).

Notethatthetotalvolume ofcommunicationisequalto

kSV(^Pk)=

kRV(^Pk)^.^Thisîsâlsoêqual^to^s^times^the^total numberofnonzerocolumnsegmentsinalloff-diagonalblocks,i.e.,s× ncs(A_BL).

Inthis study,we extend the sparsematrix partitioning probleminwhich the only objectiveis to minimize the total communicationvolume,byintroducingfourmoreminimizationobjectiveswhicharedeﬁnedonthefollowingmetrics:

1. max_kSV(P_k):maximumsendvolumeofprocessors(equivalenttomaximums× ncs(C_k)), 2. max_kRV(P_k):maximumreceivevolumeofprocessors(equivalenttomaximums× ncs(R_k)),

3. max_k(^SV(^Pk)+RV(^Pk))^: ^maximum ^sum ôf ^send ând ^receive ^volumes ôf ^processors (equivalent to maximum s× (^ncs(^Ck)+ncs(^Rk))^),

4. max_kmax{SV(P_k),RV(P_k)}:maximumofmaximumofsendandreceivevolumesofprocessors(equivalenttomaximum s× max{ncs(C_k),ncs(R_k)}).

Undertheobjectiveofminimizingthetotalcommunicationvolume,minimizingoneofthesevolume-basedmetrics(e.g., max_kSV(P_k))relatestominimizingimbalanceontherespectivequantity(e.g.,imbalanceonSV(P_k)values).Forinstance,the imbalanceonSV(P_k)valuesisdeﬁnedas

max_kSV

(

^Pk

)

kSV

(

^Pk

)

/K.

Here,theexpressioninthedenominatordenotestheaveragesendvolumeofprocessors.

A parallel application may necessitate one or more of these metrics to be minimized. These metrics are considered besidestotalvolumesinceminimizationofthemisplausibleonlywhentotalvolumeisalsominimizedasmentionedabove.

Hereinafter,thesemetricsexcepttotalvolumearereferredtoasvolume-basedmetrics.

(6)

Fig. 2. The state of the RB tree prior to bipartitioning G ²₁ and the corresponding sparse matrix. Among the edges and nonzeros, only the external (cut) edges of V ₁² and their corresponding nonzeros are shown.

4. Modelsforminimizingmultiplevolume-basedmetrics

Thissectiondescribestheproposedgraphandhypergraphpartitioningmodelsforaddressingvolume-basedcostmetrics definedintheprevioussection.Ourmodelshavethecapabilityofaddressingasingle,acombinationorallofthesemetrics simultaneouslyinasinglephase.Moreover,theyhavetheflexibilityofhandlingcustommetricsbasedonvolumeotherthan thealreadydefinedfourmetrics.Ourapproachreliesonthewidelyadoptedrecursivebipartitioning(RB)frameworkutilized inabreadth-firstmannerandcanberealizedbyanygraphandhypergraphpartitioningtool.

4.1. Recursivebipartitioning

In the RB paradigm, the initial graph/hypergraph is partitioned into two subgraphs/subhypergraphs. These two subgraphs/subhypergraphs are further bipartitioned recursively until K parts are obtained. This process forms a full binary tree,which we referto asan RBtree,withlg₂K levels,where K isa powerof 2.Withoutloss ofgenerality, graphs and hypergraphsatlevelroftheRBtreearenumberedfromleftto rightanddenotedasG^r₀,...,G^r₂_r₋₁ andH₀^r,...,H^r₂r−1,respectively.Frombipartition(^G^r_k)=

{

V₂^r⁺¹_k ,V₂^r⁺¹_k₊₁

}

^of^graph^G^r_k=(V_k^r,E_k^r),twovertex-inducedsubgraphsG^r₂⁺¹_k =(V₂^r⁺¹_k ,E₂^r⁺¹_k ) andG^r₂⁺¹_k

+1=(V₂^r⁺¹_k₊₁,E₂^r⁺¹_k₊₁)âre^formed.Âll^cutêdgesⁱⁿ(^G^r_k)âreêxcluded^from^the^newly^formed^subgraphs.^From^bipar- tition(H^r_k)=

{

V₂^r⁺¹_k ,V₂^r⁺¹_k₊₁

}

ôf^hypergraphH^r_k=(V_k^r,N_k^r),twovertex-inducessubhypergraphsareformedsimilarly.Allcut netsin(H^r_k)âre^split^to^correctlyêncode^the^cutsize^metric^[21]^.

4.2. Graphmodel

ConsidertheuseoftheRBparadigmforpartitioningthestandard graphrepresentationG=(V,E)ôfÂ^forrow-parallel Y=AX toobtainaK-waypartition.WeassumethattheRBproceedsinabreadth-firstmannerandRBprocessisatlevelr priortobipartitioningkthgraphG^r_k.ObservethattheRBprocessuptothisbipartitioningalreadyinducesaK-waypartition

(^G)=

{

V₀^r+1,...,V₂^r+1_k₋₁,V_k^r,...,V₂^rr−1

}

^.⁽^G⁾ ^contains²^k ^vertex^parts^from^level^r+1and2^r− kvertexpartsfromlevelr, makingK=2^r+k.AfterbipartitioningG^r_k,a(^K+1)^-way^partition⁽^G⁾îsôbtained^which^containsV₂^r+1_k andV₂^r+1_k₊₁instead of V_k^r. Forexample, in Fig. 2, the RBprocess is atlevel r=2 prior to bipartitioning G²₁=(V1²,E1²), so, the currentstate of theRB induces a five-way partition (^G)=

{

V₀³,V₁³,V₁²,V₂²,V₃²

}

^.BipartitioningG²₁ induces a six-way partition (^G)=

{

V₀³,V₁³,V₂³,V₃³,V₂²,V₃²

}

^. ^Pk^r denotes thegroup ofprocessorswhich areresponsible forperformingthe tasksrepresentedby theverticesinV_k^r.ThesendandreceivevolumedeﬁnitionsSV(P_k)andRV(P_k)ofindividualprocessorP_kareeasilyextended toSV(^P_k^r)^and^RV(^P_k^r)^for^processor^group^P_k^r^.

WeﬁrstformulatethesendvolumeoftheprocessorgroupP_k^rtoallotherprocessorgroupscorrespondingtovertexparts in⁽^G^).^Letconnectivitysetofvertex

v

i∈V_k^r,Con(

v

i),denotethesubsetofpartsin(^G)−

{

V_k^r

}

ⁱⁿ^which

v

ihasatleastone

(7)

neighbor.Thatis,

Con

( v

i

)

=

{

V^t∈

(

^G

)

^:^Adj

( v

i

)

∩V^t=∅

}

−

{

V_k^r

}

,

wheretiseitherrorr+1.Vertex

v

iisboundaryifCon(

v

i)=∅,andonce

v

ibecomesboundary,itremainsboundaryinall furtherbipartitionings.Forexample,inFig.2,Con(

v

9)=

{

V1³,V2²,V3²

}

^.^Con(

v

i)^signiﬁes^thecommunicationoperationsdueto

v

i,whereP_k^rsendsrowiofXtoprocessorgroupsthatcorrespondtothepartsinCon(

v

i)^.^The^send^load^associated^with

v

i

isdenotedbysl(

v

i)ândîsêqual^to sl

( v

i

)

=s×

|

^Con

( v

i

) |

The total sendvolume ofP_k^r is then equal to the sumof the send loadsof all vertices in V_k^r,i.e., SV(^P_k^r)=

v_i∈V^r_ksl(

v

i)^. InFig.2, thetotal sendvolume ofP₁² is equalto sl(

v

7)+sl(

v

8)+sl(

v

9)+sl(

v

10)=3s+2s+3s+s=9s.Therefore,during bipartitioningG^r_k,minimizing

max

vi∈V^r+12k

sl

( v

i

)

^,

vi∈V2^r+1k+1

sl

( v

i

)

isequivalenttominimizingthemaximumsendvolumeofthetwoprocessorgroupsP₂^r⁺¹_k andP₂^r⁺¹_k₊₁ totheother processor groupsthatcorrespondtothevertexpartsin⁽^G^).

Inasimilarmanner,weformulatethereceivevolumeoftheprocessorgroupP_k^r fromallother processorgroupscorre- spondingtothevertexpartsin⁽^G^).^Observe^that^for^each^boundary

v

j∈V^t thathasatleastoneneighborinV_k^r,P_k^rneeds toreceivethecorrespondingrowjofXfromP^t.Forinstance,inFig.2,since

v

5∈V1³ hastwoneighborsinV1²,P₁² needsto receivethecorrespondingﬁfthrowofXfromP₁³.Hence,P_k^rreceivesasubsetofX-matrixrowswhosecardinalityisequalto thenumberofverticesinV− V_k^rthathaveatleastoneneighborinV_k^r,i.e.,

|{ v

j∈

{

V− V_k^r

}

^:

v

i∈V_k^rande_ji∈E

}|

^.^The^size^of

thissetforV1² inFig.2isequalto10.Notethateachsuch

v

_j contributesswordstothereceivevolumeofP_k^r.Thisquan- titycanbecapturedbyevenlydistributingitamong

v

j’sneighborsinV_k^r.Inotherwords,avertex

v

j∈V_l^t thathasatleast oneneighborinV_k^rcontributess/

|

^Adj(

v

j)∩V_k^r

|

^to^the^receive^load^of^each^vertex

v

i∈

{

^Adj(

v

j)∩V_k^r

}

^.^The^receive^load^of

v

i, denotedbyrl(

v

_i₎_,isgivenbyconsideringallneighborsof

v

_ithatarenotinV_k^r,thatis,

rl

( v

i

)

⁼

eji∈Eandvj∈V^t

s

|

^Adj

( v

j

)

^∩^Vk^r

|

^.

ThetotalreceivevolumeofP_k^risthenequaltothesumofthereceiveloadsofallverticesinV_k^r,i.e.,RV(^P_k^r)=

v_i∈V_k^rrl(

v

i)^. InFig.2,thevertices

v

11,

v

12,

v

15 and

v

16 respectivelycontribute s/3, s/2,sandstothereceiveloadof

v

8, whichmakes rl(

v

8)=17s/6.ThetotalreceivevolumeofP₁² isequaltorl(

v

7)+rl(

v

8)+rl(

v

9)+rl(

v

10)=3s+17s/6+10s/3+5s/6=10s. NotethatthisisalsoequaltothestimesthenumberofneighboringverticesofV1² inV− V1².Therefore,duringbipartition- ingG^r_k,minimizing

max

vi∈V^r+12k

rl

( v

i

)

, vi∈V^r+12k+1

rl

( v

i

)

isequivalenttominimizingmaximumreceivevolumeofthetwoprocessorgroupsP₂^r⁺¹_k andP₂^r⁺¹_k₊₁fromtheotherprocessor groupsthatcorrespondtothevertexpartsin⁽^G^).

Althoughthesetwoformulationscorrectlyencapsulatethesend/receivevolumeloadsofP₂^r+1_k andP₂^r+1_k₊₁to/fromallother processorgroupsin⁽^G^),^they^overlook^thesend/receivevolumeloadsbetweenthesetwoprocessorgroups.Ourapproach triesto refrainfromthis smalldeviationby immediatelyutilizing thenewly generatedpartition information whilecom- putingvolume loadsinthe upcomingbipartitionings. Thatis,the computationofsend/receiveloads forbipartitioningG^r_k utilizesthemostrecentK-way partitioninformation,i.e.,⁽^G^).^This^deviation^becomes^negligible^with^increasing^number ofsubgraphsinthelatterlevelsoftheRBtree.Theencapsulationofsend/receivevolumesbetweenP₂^r⁺¹_k andP₂^r⁺¹_k₊₁ during bipartitioningG^r_knecessitatesimplementinganewpartitioningtool.

Algorithm1presentsthecomputationofsendandreceiveloadsofverticesinG^r_kpriortoitsbipartitioning.Asitsinputs, thealgorithmneedstheoriginalgraphG=(V,E),graphG^r_k=(V_k^r,E_k^r),andtheup-to-datepartitioninformationofvertices, whichisstoredinpartarrayofsizeV=

|

V

|

^.^To^compute ^the^send^load^of^a^vertex

v

_i_∈_V_k^r_,itisnecessarytoﬁndthesetof partsinwhich

v

ihasatleastoneneighbor.Forthispurpose,foreach

v

j∈/V_k^r inAdj(

v

i),Con(

v

i)^is^updated ^with^the^part that

v

j iscurrentlyin(lines2–4).Adj(· )listsaretheadjacencylistsoftheverticesintheoriginalgraphG.Next,thesend loadof

v

_i_,sl(

v

_i₎_,issimplysettostimesthesizeofCon(

v

_i₎(line5).Tocomputethereceiveloadof

v

_i_∈_V_k^r_,itisnecessary tovisittheneighborsof

v

ithatarenotinV_k^r.Foreachsuchneighbor

v

j,thereceiveloadof

v

i,rl(

v

i),isupdatedbyadding

v

i’sshareofreceiveloaddueto

v

j,whichisequaltos/

|

^Adj(

v

j)∩V_k^r

|

^(lines^6–8).^Observe^that^only^the^boundary^vertices

inV_k^rwillhavenonzerovolumeloadsattheendofthisprocess.

Algorithm2presentstheoverallpartitioningprocesstoobtainaK-waypartitionutilizingbreadth-ﬁrstRB.Foreachlevel roftheRB tree,thegraphsin thislevelare bipartitioned fromleft toright, G^r₀ to G^r₂r−1 (lines 3–4).Prior tobipartition- ingofG^r_k,thesendloadandthereceiveloadofeachvertexinG^r_k arecomputedwith

GRAPH-COMPUTE-VOLUME-LOADS

(8)

Algorithm 1 GRAPH- COMPUTE-VOLUME-LOADS .

Algorithm 2 GRAPH- PARTITION .

(line5).Recallthat intheoriginalsparsematrixpartitioningwithgraphmodel,eachvertex

v

_ihasasingleweightw¹(

v

_i₎_,

which represents thecomputational load associated withit. Toaddress the minimization of maximum send/receive volume, we associatean extraweight witheach vertex. Speciﬁcally,to minimize themaximum sendvolume, thesend load of

v

_iisassignedasits secondweight,i.e.,w²(

v

_i₎₌sl(

v

_i₎.Inasimilarmanner,tominimize themaximumreceivevolume, the receiveloadof

v

i isassignedasits second weight, i.e.,w²(

v

i)=rl(

v

i)^.^Observe^that ^only^the ^boundary^vertices ^have nonzerosecondweights.Next,G^r_kisbipartitionedtoobtain(^G^r_k)=

{

V₂^r⁺¹_k ,V₂^r⁺¹_k₊₁

}

^usingmulti-constraintpartitioningtohan- dle multiplevertexweights (line7). Then, twonew subgraphsG^r₂⁺¹_k andG^r₂⁺¹_k₊₁ areformed fromG^r_k using(^G^r_k) ^(line^8).

Inpartitioning,minimizingimbalanceon thesecondpartweights correspondstominimizing imbalanceonsend(receive) volumeiftheseweights aresettosend(receive)loads.Inother words,undertheobjectiveofminimizingtotalvolumein thisbipartitioning,minimizing

max

{

^W²

(

V₂^r⁺¹_k

)

,W²

(

V₂^r⁺¹_k₊₁

) } (

^W²

(

V₂^r⁺¹_k

)

+W²

(

V₂^r⁺¹_k₊₁

))

/2

relatestominimizing max

{

^SV(^P₂^r_k⁺¹),SV(^P₂^r⁺¹_k₊₁)

}

⁽^max

{

^RV(^P₂^r⁺¹_k ),RV(^P₂^r_k⁺¹₊₁)

}

⁾ ^if^the ^second^weights ^are^set^to ^send^(receive)

loads.Thenpartarrayisupdated aftereachbipartitioningtokeeptrackofthemostup-to-datepartitioninformationofall vertices(line9).Finally,theresultingK-waypartitioninformationisreturnedinpartarray(line10).Notethatintheﬁnal K-waypartition,processorgroupP_k^lg²^KdenotestheindividualprocessorP_k,for0≤ k≤ K− 1.

Inorderto efficientlymaintainthesendandreceiveloads ofvertices,wemake useoftheRBparadigminabreadth- firstorder.Sincetheseloadsarenotknowninadvanceanddependonthecurrentstateofthepartitioning,itiscrucialto act proactivelybyavoidinghighimbalancesonthem.Comparethistocomputational loadsofvertices,whichisknownin advanceandremains thesameforeachvertexthroughoutthepartitioning.Hence,utilizingabreadth-firstoradepth-first RBdoesnotaffectthequality oftheobtainedpartitioninterms ofcomputational load. Weprefer abreadth-firstRBtoa depth-firstRBforminimizing volume-basedmetricssinceoperatingonthepartsthatareatthesameleveloftheRBtree (in ordertocompute send/receiveloads)preventsthe possibledeviationsfromthetargetobjective(s)by quicklyadapting thecurrentavailablepartitiontothechangesthatoccurinsend/receivevolumeloadsofvertices.

The describedmethodologyaddresses theminimization ofmax_kSV(P_k) ormax_kRV(P_k) separately.After computingthe sendandreceiveloads,wecanalsoeasilyminimizemax_k(^SV(^Pk)+RV(^Pk))^byassociatingthesecondweightofeachvertex withthesumofsendandreceiveloads,i.e.,w²(

v

i)=sl(

v

i)+rl(

v

i)^.^For^theminimizationofmax_kmax{SV(P_k),RV(P_k)},either

multiplication on large-scale parallel systems Improving performance of sparse matrix dense matrix Parallel Computing

Parallel Computing

Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems

Seher Acer, Oguz Selvitopi, Cevdet Aykanat

⎡

⎣

⎤

⎦

⎡

⎣

⎤

⎦

v

v

v

v

v

{ v

}

v

v

v

v

v

v

v

{

}

{

}

v

v

{

}

λ

|

|

λ

λ

v

(

)

(



)

{

}

{

}



v

v

v

v

{ v

}

v

v

λ

σ

σ

σ

σ

(

)

(

)

{

}

{

}

{

}

{

}

{

}

v

v

{