• No results found

vector machines A novel distributed anomaly detection algorithm based on support Digital Signal Processing

N/A
N/A
Protected

Academic year: 2022

Share "vector machines A novel distributed anomaly detection algorithm based on support Digital Signal Processing"

Copied!
9
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

A novel distributed anomaly detection algorithm based on support vector machines

Tolga Ergen

a,

, Suleyman S. Kozat

b

aDepartmentofElectricalEngineering,StanfordUniversity,Stanford,CA94305,USA bDepartmentofElectricalandElectronicsEngineering,BilkentUniversity,Ankara,Turkey

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonline8January2020

Keywords:

Anomalydetection Distributedlearning Supportvectormachine Gradientbasedtraining

In thispaper, westudy anomalydetectionin adistributed network of nodes and introducea novel algorithm based on Support Vector Machines (SVMs). We first reformulate the conventional SVM optimizationproblemforadistributednetworkofnodes.Wethendirectlytraintheparametersofthis SVMarchitectureinitsprimalformusingagradientbasedalgorithminafullydistributedmanner,i.e., eachnodeinournetworkisallowedtocommunicateonlywithitsneighboringnodesinordertotrain the parameters.Therefore,wenot onlyobtainahigh performinganomalydetectionalgorithmthanks tostrongmodelingcapabilitiesofSVMs,butalsoachievesignificantlyreducedcommunicationloadand computational complexity duetoour fully distributedand efficient gradient basedtraining.Here, we provideatrainingalgorithminasupervisedframework,however,wealsoprovidetheextensionsofour implementation to anunsupervised framework. We illustrate the performance gainsachieved by our algorithmviaseveralbenchmarkreallifeandsyntheticexperiments.

©2020ElsevierInc.Allrightsreserved.

1. Introduction

Anomalydetectionhasbeenextensivelystudiedinthecurrent literatureduetoitsvariousapplicationssuchashealthcare,fraud detection,networkmonitoring,cybersecurity,andmilitarysurveil- lance[1–5].Particularly,inthispaper,westudytheanomalydetec- tionprobleminadistributedframework,wherewehaveanetwork ofK nodesequippedwithprocessingcapabilities.Eachnodeinour networkobservesadatasequenceandaimstodecidewhetherthe observationsare anomalous or not.Here, each node has a setof neighboringnodesandcanshareinformationwithitsneighborsin ordertoenhancethedetectionperformance.

Variousother anomalydetectionalgorithmshavealsobeenin- troducedinordertoenhancethedetectionperformance.Asanex- ample,Fisherkernelandgenerativemodelsareintroducedtogain performance improvements,especially for time seriesdata [6–9].

However,themaindrawbackoftheFisherkernelmodelisthatit requires the inversion ofFisher informationmatrix, which has a highcomputational complexity [6,7].Ontheother hand,inorder toobtainanadequate performancefromagenerative modelsuch asaHiddenMarkovModel(HMM),one shouldcarefullyselectits

ThispaperisinpartsupportedbyTubitakProjectNo:117E153.

*

Correspondingauthor.

E-mailaddresses:ergen@stanford.edu(T. Ergen),kozat@ee.bilkent.edu.tr (S.S. Kozat).

structural parameters, e.g., thenumber of statesand topology of the model [8,9]. Furthermore,the type oftraining algorithm has alsoconsiderableeffectsontheperformanceofgenerativemodels, whichlimitstheir usage inreal lifeapplications [9]. Thus,neural networks,especiallyRecurrentNeuralNetworks(RNNs),basedap- proachesareintroducedthankstotheirinherentmemorystructure that can store “time”or“state”information andstrongmodeling capabilities[1,10].SincethebasicRNNarchitecturedoesnothave controlstructures(gates)toregulatetheamountofinformationto bestored[11,12],a moreadvancedRNNarchitecture withseveral control structures, i.e.,the Long ShortTerm Memory(LSTM) net- work, are usually employed among RNNs [12,13]. However, such neural networkbased approachesdonot have aproper objective criterion foranomaly detectiontasksespecially intheabsence of datalabels[1,14].Hence,theyfirstpredictasequencefromitspast samplesandthendeterminewhetherthesequenceisananomaly ornotbasedonthepredictionerror,i.e., ananomaly isanevent, whichcannot be predictedfromthe pastnominaldata[1].Thus, they require a probabilistic model forthe predictionerror anda threshold on the probabilistic model to detect anomalies, which results in challenging optimization problems and restricts their performance accordingly[1,14,15].Furthermore,oneneeds acon- siderableamountofcomputationalpowerandtimetoadequately train such networks due to their highly complex and nonlinear structure,which alsomakesthem susceptibletooverfittingprob- lems. Inorder tohandlelarge scale anomaly detectionproblems, https://doi.org/10.1016/j.dsp.2020.102657

1051-2004/©2020ElsevierInc.Allrightsreserved.

(2)

distributedanomalydetectiontechniqueshasalsobeenextensively studied especially in the IoT and Deep Tech literatures [16–19].

However, the main drawback of these approaches, especially the deepones,isthattheysufferfromhighcomputationalcomplexity andrequirecareful tuning ofseveralparameters toachieve good generalizationperformance[17,20].

Inthecurrentliterature, acommonandwidelyusedapproach foranomaly detection isto find a decision function that defines the model of normality [1]. In this approach, one first defines a certaindecisionfunctionandthenoptimizestheparametersofthis functionwithrespect toa predefinedobjectivecriterion, e.g.,the SupportVectorMachines(SVM)algorithm[21,22].Oneremarkable propertyofSVM isits abilityto learncan be independentofthe featurespacedimensionalitysothatitgeneralizeswelleveninthe presenceofmanyfeatures[23].Duetoitsgeneralizationcapabili- ties,thereexistseveral anomalydetectionmethodsbasedonSVM, e.g., [23–27] and the other references therein. In these studies, SVManditsvariantsareemployedforanomalydetectiontasks.In additiontothese, recently,applicationsofSVMsalongwithother learning techniques have also attractedsignificant attention[8,9, 20,28–31]. As an example in [8,9], theauthors employ a Hidden MarkovModel(HMM) toextract sequentialinformationfromthe dataandthenapplytheSVMalgorithmforclassification.Moreover, in[20,28,29],theauthorsusetheSVMarchitecturealongwithdif- ferent neural network architectures. Thus, they achieve effective optimizationof thenetwork parameters withrespect tothe well definedobjective function ofSVMs andalso enjoy good general- izationcapabilitiesofSVMsattheoutputlayer.

Support Vector Machines (SVMs) are generally employed for anomaly detection thanks to their aforementioned high perfor- mance in several real life scenarios [20,21,32–35]. In the con- ventionalapplications, SVMsare used in centralized frameworks, where all observations are available and processed together at a certain node (or a processing unit), i.e., known as the central node. Here, the aim is to find the maximum margin hyperplane that separates normal data sequences from anomalies based on the centrally available observations. Acommon andeffective ap- proachto trainan SVM is tosolve itasa quadratic optimization problemin its dual form, where thedimensionality ofthe prob- lem is in the order of the cardinality of the training data. After solvingthisoptimizationproblem,the resultingseparatinghyper- plane only dependson a subset ofthe training data,i.e., known as Support Vectors (SVs). Thus, for training an SVM, all the ob- served dataisgathered atacentral node tofind thedesiredSVs inacentralizedapproach[1,21,32].However, suchcentralizedap- proachesrequirehighstoragecapacityandcomputationalpowerat thecentralnode [5,36].Moreover,eachnodeinthenetworkmust communicate with the central node regardless of its distance to thecentralnodetoobtainSVs,whichcauseshighcommunication burdeninthe network. Furthermore,theseapproachesare prone tosystemfailuresduetothesingle-point-of-failureatthecentral node[37–39].

To this end, distributed approaches are introduced to train SVMs[37–39]. Intheseapproaches,SVs areobtainedlocallyfrom theobservedtraining dataateach node,i.e., eachnodeprocesses only its observations to obtain SVs. Then, each node exchanges these SVs withtheir neighbors and also witha central node in ordertofurtherenhance theirperformance. However,since these approachesrequireexchangesofmultipleSVsbetweenneighboring nodes, they still causea considerable communication loadespe- ciallyinanetworkthathasalargenumberofobservationsateach node.Moreover,theystillneedacentralnodetomergetheirlocal SVs,whichmightinduce thesameproblemswithcentralizedap- proaches.Duetothehighcommunicationloadoftheconventional distributedapproaches,especiallywhentheamountofdataiscon- siderably large, parallelized designs ofSVM are introduced asan

alternativeapproach[40–43].Inaparalleldesign,eachnodetrains apartialSVMusingasmallsubsetoftrainingdataandthenshares itspartialSVMwithacentralnode.Then,thecentralnodemerges thesepartialSVMstoobtainaglobalSVMforthewholenetwork.

Althoughthisapproachmitigatesthecommunicationloadproblem in the network, it still requires a significant amount of commu- nication withthe central node. Additionally, since itrequires the existenceofacentralnode,itmightsufferfromtheproblemsasso- ciated withcentralized approaches.Furthermore,both distributed and parallelized designs rely on complex optimization problems, e.g., solving a quadratic optimization problem in the dual form, totrainan SVM,whichsignificantly increasestheir computational complexity[44–46].Additionally,thereexistseveralrecentstudies onmorespecificapplicationsofdistributedSVMs. Asanexample, in [47], the authors introduce a communicationefficient method by using both primal anddual representationsof theSVM prob- lem. In another study, a new step size choice is introduced so that the authorsachieve afaster convergenceratefordistributed SVMs [48]. In [49], the authors propose a distributed method to train semi-supervised SVMs, where they relax the original prob- lem. Additionally, in [50], a newdistributed approach for imbal- anceddatasets isintroduced. As seenintheseexamples, mostof the recentstudieseither focusongaining improvementsforspe- cific applications oraim to reduce theoptimality gap intraining by proposing highly complicatedand technicalapproaches. Thus, itbecomesalmostimpossibletoimplementthesemethodsonreal systems.However,since inthispaper,weaimtoobtaina generic andefficientapproachthatcanbe straightforwardlyimplemented inreal-lifeapplications,weintroduceanewdistributedSVMbased approachasdetailedinthefollowing.

In orderto resolvethe issuesassociatedwiththe existing ap- proaches,weintroduceanovelanomalydetectionapproach,where we train SVMsinafully distributedandefficientmanner.Partic- ularly,ratherthan findingSVs,we directlyfindthe parametersof theseparatinghyperplaneintheprimalformofanSVMoptimiza- tionproblem.Fortraining,weuseagradientbasedalgorithm,e.g., [44], sothat thedimensionalityofourproblemisintheorderof thedatasize,i.e.,thenumberoffeaturesinadatasequence,unlike the quadratic programmingbased approaches, where the dimen- sionality ofthe problem is in the orderof the number of train- ing samples [45,46]. Thus, we significantly reduce the computa- tionalcomplexitycomparedtotheconventionalmethods[37–43], especially when the number of training samples is prohibitively large,e.g.,bigdataapplications[36].Additionally,unliketheseap- proaches,welocallytraintheparametersofan SVMateachnode andthen combine theseparameters onlywiththe parameters of theneighboringnodestoobtainafinalSVMbasedanomalydetec- tor. Thus,weeliminatetheneedforacentralnode andavoidthe associateddrawbacks. Moreover,since ourSVMtraining approach isbasedondirectlyfindingthehyperplane parametersinthepri- mal form, we only require exchanges of hyperplane parameters, insteadofmultipleSVs,betweenneighbors,whichfurtherreduces thecommunicationloadinthenetwork.Inadditiontothis,thanks to oureffectiveinformationexchange protocolbetweenneighbor- ingnodes,ourapproachachievesasignificantlyhighperformance while requiringconsiderably lesscomputational andcommunica- tion resources compared to the conventional methods. Here, we work in asupervised framework, whereall the datasamples are properly labeled,however, thanks tothe generic structure ofour distributed training approach, we also extend the original algo- rithm to an unsupervised framework. Through our experiments, we illustratethe performance gainsachievedbyour approachon severalbenchmarkreallifeandsyntheticdatasets.

The organization ofthis paper isas follows.In Section 2, we describe theanomaly detectionproblemina distributednetwork ofnodesandthenprovideourSVMbasednovelapproachforthis

(3)

problem.Throughasetofexperimentsinvolvingsyntheticandreal lifedata, wedemonstrate theperformance gainsachievedby the introducedalgorithmwithrespecttotheconventionalmethodsin Section3.Wethenfinalize thepaperwithconcludingremarksin Section4.

2. DistributedanomalydetectionbasedonSVMs

In this paper, all vectors (and matrices) are represented by boldface lower (andupper) caseletters. Fora given vector v (or matrixV ), vT (or VT)representsitstranspose and||v|| =√

vTv isits 2-norm.Additionally, 0 and 1 represent avector ofall ze- rosand ones, respectively. Here, the sizes of thesenotations are understoodfromthecontext.

Weconsidera network K nodes, whereeach nodek observes adatasequence{xki}ni=k1,xki∈ Rm.Here,weassume thatthedata sequenceateach node k belongsto acertain class withthecor- respondinglabels{yki}ni=k1,where yki∈ {1,1}.Inourframework, yki=1 (1)representsthenominal(anomalous) data,wherethe anomalous datasamples areinherently minority. This framework models a wide range ofengineering applications in real life [1].

Asanexample,inacomputernetworkanomalydetectiontask[1], we receive a set of features for a connection to a certain com- puternetwork,i.e.,denotedasxkifortheith connectionatthekth computer network. Then, our goal is to determine whether this connectionisananomaly,e.g.,awebbasednetworkattack,ornot basedonthereceivedfeaturesoftheconnection.

Inthisframework,weparticularlyaimtofindadecisionfunc- tion f(·) that separates anomalies from normal data samples, whichisdefinedas

f

(

xki

) =



1 if xkiis normal

1 otherwise

.

Forthisproblem, one can use theSVM algorithm [32,51] dueto its strong modeling capabilities in reallife tasks. The SVM algo- rithmbasically findsthemaximummargin separatinghyperplane betweentwoclassesinordertodetectanomalousdatasequences [32,51]. Assumingthat all thedatasequences ateach node k are centrally available ata certain processingunit, we can solve the followingSVM optimization problemin ournetwork of nodesto obtainthemaximummarginseparatinghyperplane[32,51,52]

min

w,bki

1

2

||

w

||

2

+ λ



K

k=1 nk



i=1

ξ

ki (1)

subject to: yki

(

wTxki

+

b

)

1

− ξ

ki

k

,

i (2)

ξ

ki

0

k

,

i

,

(3)

wherew∈ Rm andb∈ Raretheglobalparametersofthemaxi- mummarginseparatinghyperplane,λ∈ R+isatrade-offparame- terandξki∈ Ristheslackvariabletoincuracostforthemisclas- sificationoftheith instanceatthenodek. Weemphasizethatin (1),(2),and(3),weusethesamehyperplaneparametersforeach datainstanceatdifferentnodessinceweareabletogatherallthe observationsinournetworkatacentralnodetoprocessthemto- gether.Then,weusethefollowingdecisionfunctiontodetectthe anomaliesateachnodek

f

(

xki

) =

sgn

(

wTxki

+

b

),

(4) wherethesgn(·)functionoutputsthesignofitsinputargument.

Onecanconventionallysolve(1),(2),and(3) usingaquadratic programming based approach in the dual form of the problem

[32,51].However, suchan approachhashighcomputationalcom- plexity since the order of the problem depends on the number of training samples inthe dual form, which can be significantly highespeciallyforapplicationsinvolvingbigdata[36,45,46].Thus, inorder to employan efficient gradientbased training approach, we reformulate(1), (2), and(3) asan unconstrainedoptimization probleminitsprimalformasfollows

min

w,b

1

2

||

w

||

2

+ λ



K

k=1 nk



i=1

max

{

0

,

1

yki

(

wTxki

+

b

)}.

(5)

Tofurthersimplifythenotationin(5),weusetheaugmentedpa- rameter vector v  [b wT]T and the augmented data sequence

¯

xki [1 xTki]T.Then,weobtainthefollowingcompactoptimization problemfortheaugmentedparametervector v

minv F

(

v

),

(6)

where

F

(

v

) 

1 2

m



+1

j=2

v2j

+ λ



K

k=1 nk



i=1

max

{

0

,

1

ykivTx

¯

ki

}

(7)

is our objectivefunction to be minimizedand vj represents the jth elementofthevector v .For(6),weusethefollowinggradient basedupdate[53]

vt+1

=

vt

μ

vF

(

v

) 

v

=vt

,

(8)

where

μ

is the learning rate and the subscript t represents the iterationindex. Notethat F(·)isnotadifferentiable functiondue to the max{0,x} term, thus, we use a subgradient method as in [44], i.e., treating the derivative of max{0,x} at x=0 as 0. We thenusethefollowingdecisionfunction

f

(

xki

) =

sgn

(

vTx

¯

ki

).

(9)

Noticethat(6),(7),and(8) causeasignificantcommunicationload by gatheringallthe informationinournetworkata certaincen- tralprocessingunit.Moreover,suchcentralizedapproachesrequire asignificant amountofcomputational resourcesandhighstorage capacity at the central processing unit. Furthermore, a failure at thecentralprocessingunitleadstothefailureofthewholesystem inacentralized scenario.Hence, weintroducea noveldistributed implementation to find the maximum margin separating hyper- plane ateach node k. Particularly,we consider ascenario, where each node k has a set of neighboring nodes connected to itself andcanexchangeinformationonlywiththeseneighbors,i.e.,also calledas theneighborhood of the node k anddenoted asNk as showninFig.1.Here,eachnodek aimstofindadecisionfunction based on the observations revealedto itself and the information obtainedfromitsneighbors.Thus,we eliminatetheproblemsas- sociated withthecentralizedapproachesby completelyremoving theneedforacentralnodeinthenetwork.

Inourdistributedframework,wehaveadifferentsetofhyper- planeparameters ateachnode,i.e., denotedas{vk}kK=1.Withthis change,wereformulateouroptimizationproblemas

min{vk} 1 2



K

k=1 m



+1

j=2

v2kj

+ λ

K



K

k=1 nk



i=1

max

{

0

,

1

ykivkTx

¯

ki

},

(10)

where vkj represents the jth element of the vector vk. Due to the term that sums the norms ofall the hyperplane parameters atdifferentnodes,wealsointroduceascalarfactor K in(10).We then updatethe parameters ofeachnode k usingthe distributed gradientbasedalgorithm[54,55] asfollows

(4)

Fig. 1. Anexampledistributednetworkofnodes,wherethegrayregionrepresents theneighborhoodofthenodek.

vk,t+1

= φ

k,t

μ

φF

(φ)  φ

=

φ

k,t

,

(11)

wheretheintermediateparameterφk,t,i.e.,alsoknownasthelocal estimateforthenodek,iscomputedas

φ

k,t

= 

j∈Nk

ck,jvj,t

,

whereck,j’srepresentthecombinationweightsofthenodek and arenormalizedsuchthat



K

j=1 ck,j

=

1

.

Foracertainnetworktopology,wecandeterminethecombination weights using differentrules such asthe Metropolis,uniform, or adaptiverulesdependingontheapplicationrequirements[56].

Inordertoperformtheupdatein(11),wecalculatethegradi- entoftheobjectivefunctionin(10) forthenodek as

vF

(

vk

) = [

0 vk2

. . .

vk(m+1)

]

T

+ λ

K

nk



i=1

vGki

(

v

),

(12)

wherewedefineGki(v)max{0,1−ykivTk¯xki},whichhasthefol- lowinggradient

vGki

(

v

) =

 −

ykix

¯

ki if 1

ykivkTx

¯

ki

>

0

0 otherwise (13)

using the subgradient method [44]. After the update of the pa- rameters,weuse(9) todetecttheanomaliesintheobserveddata sequences. We also present the complete algorithm as a pseu- docodebelow.

Remark1.Inthispaper,westudytheanomalydetectionproblem inasupervisedframework, whereall thedatalabelsare present.

However, the data labels might be too costly to obtain or even mightnotbepresentincertainreallifescenarios,i.e.,alsoknown astheunsupervisedframework [1].Since ourdistributedtraining approachisgeneric,itcanalsobeemployedinsuchaframework.

Inanunsupervisedscenario,theSVMstructureistrainedasaone classyieldingthefollowingcentralizedoptimizationproblem[21]

min

w,bki

1

2

||

w

||

2

+ λ



K

k=1 nk



i=1

ξ

ki

b (14)

subject to: wTxki

b

− ξ

ki

k

,

i (15)

ξ

ki

0

k

,

i

.

(16)

Notethatwedonotusethedatalabelsin(15) unlike(2).Forthe unsupervisedimplementation,wefirstreformulatethecentralized optimizationprobleminafullydistributedandunconstrainedset- tingasin(10).Wethenapplythedistributedtrainingapproachin (11),(12), and(13) based ontheproblemdefinitionin(14),(15), and(16).Thisyieldstheunsupervisedversionofouralgorithm.

Algorithm1 Adistributed anomaly detectionalgorithmbased on SVMs.

1: Setλ,μandT

2: Initialization:vk,1=0,k∈ {1,2,. . . ,K} 3: for t=1:T do

4: for k=1:K do 5: Calculateck,j,j∈ Nk 6: Localestimate:φk,t=

jNkck,jvj,t 7: Computethegradientaccordingto(12) 8: Update:vk,t+1= φk,tμ∇φFk,t) 9: end for

10: end for

11: SVMdecisionfunction: f(xki)=sgn(vkT,T+1¯xki),k,i

3. Simulations

WeevaluatetheperformanceoftheintroduceddistributedSVM approach on several benchmark real and synthetic datasets. For performance comparisons, we also include centralized and local approaches in our simulations. For the centralized approach, we directlygatheralltheobservedtrainingdataatacentralnodeand processthemtogethertotrainanSVM.Ontheotherhand,forthe localapproach,wetrainanSVMateachnodeusingonlythetrain- ingdatarevealedtothatnode.Throughoutthissection,wedenote the distributed, centralized, and local SVM based approaches as

“DSVM”,“CSVM”,and“LSVM”,respectively.Inadditiontothis, we useasquarenetworkoffournodes,i.e., K=4,whereeachnodek isconnectedtothe twonodesatthe closestvertexofoursquare networkandemploysthefollowingcombinationweightrule

ck,j

=



1

/(

K

1

)

if j

Nk

0 otherwise

.

(17)

Notethatin(17), K−1 correspondstothecardinalityofNk,i.e., the number of neighbors for the node k. Moreover, the perfor- mance ofa certain distributed algorithm mayvary based on the selected combination weight rule [56]. However, in our experi- ments,weobservethattheseperformancevariationsarenegligible anddonotchange therankingofthealgorithms.Thus, weselect acertaincombinationrule,i.e.,theuniformrulein(17),andthen useitinallourexperiments.

3.1. Syntheticdata

Inordertogenerateasyntheticanomalydetectiondataset,we usetwo multivariate Gaussiandistributions withdifferentmeans.

Particularly,weselectthemeanas[−11]Tforthenormaldistri- butionand[1 1]T fortheanomalous distribution.Wethen obtain thenormalandanomalousdatainstancesbysamplingthesedistri- butions.Weusethesamecovariancematrixforbothdistributions, i.e., [2 0;0 2].Here, weselectthe parametersinourapproach so thatallthealgorithmsmaximize theirdetectionperformance(see the next paragraph forthe details of our performance criterion) on the training samples at a fixed number of iterations. In this setup,we select

μ

=0.0001,λ1= λ2= λ3=1,whereλ1,λ2,and λ3representthetrade-offparametersforCSVM,DSVM,andLSVM,

(5)

Fig. 2. The ROC curves of the algorithms for a synthetic dataset.

respectively.Furthermore,we usethe2/3 portion ofthesamples fortrainingandtheremaining1/3 fortest.

Asaperformancemetric,wemeasuretheareaunderROCcurve [57].Foraclassificationtask, weplottruepositiverateasafunc- tionoffalsepositiverateinaROCcurveanddenotetheareaunder thiscurve astheAreaUnderCurve (AUC)score,i.e.,a widelyac- cepted performance metric for anomaly detection tasks [57]. In Fig.2,weobservethatDSVMachievesalmostthesameAUCscore withCSVM while significantly outperforming LSVMthanks to its highly efficient and effective information sharing protocol. More specifically,LSVMachievesalowAUCscoreintheabsenceofcom- munication with neighbors so that it almost provides the same performancewithacompletelyrandompredictorthatachieves0.5 asitsAUCscore.However, withourdistributedtrainingapproach, thesamenodeturnsouttobeanadequateanomalydetectorwith anAUCscoreof0.8324.Moreover,inFig.3,weexamineatwodi- mensionalvisualization ofSVMstrainedby eachalgorithm.As in thepreviouscase,CSVMandDSVMprovidealmostthesametrain- ingperformance.However,LSVMachievesaconsiderablydifferent hyperplanethanCSVM.Based ontheseexperiments,weconclude that our distributed approach provides a highly superior perfor- manceusingonlythelocalinformationsothat italmostachieves thesameperformance withCSVM that hasaccesstoall thedata samplesinthenetwork.

3.2.Reallifedata

Inthissubsection, we comparethe performancesof thealgo- rithmsonreallifedatasets.WefirstconsidertheIrisdataset[58], where we have 150 data samples for 3 specific types of an Iris plantand 4 different features for each sample, i.e., sepal length, sepal width, petal length, and petal width. In this scenario, we randomly select a flower type and then aim to detect the sam- ples ofthe selectedtype among all theflower types.In addition tothis,sincewehave50 samplesforeachtype,werandomlyun- dersampletheselectedflowertypeinordertomakeitsuitablefor

Fig. 3. Twodimensionalvisualizationsoftheseparatinghyperplaneforeachalgo- rithm.

our anomaly detection framework. We then randomly distribute thesesamplesto ournetwork offournodes aswedo forall the experimentsintheremainingofthissection.Inthissetup,wese- lect

μ

=0.0001,λ1=8,λ2=8,andλ3=0.5.Wealsochoose10%

anomaly tonormalratio.Furthermore,we usethe2/3 portion of thesamplesforthetrainingphaseandtheremaining 1/3 forthe test phase ofeach experiment inthis section. In Table 1, we il- lustratetheAUC performancesofthe algorithms.SinceCSVM has accesstoallthesamples,itachievesthehighestAUCscoreamong all the algorithms. However, even though DSVM has access only tothesamplesrevealedtothecorrespondingnodeandtheneigh- boringestimatesfortheSVMparameters,itprovidesaremarkable AUC score improvement compared to LSVM and obtains a close performancetoCSVM.

We also examine the performances of the algorithms on the well known Thyroid dataset [58]. In this dataset, we have 215 data samples of patients and our aim is to determine whether a patient’sthyroid state is hypothyroid ornot basedon the pro- vided 5 medical features of each patient. Particularly, we have 3 classes ofpatients, i.e., normal, hyperthyroid,and hypothyroid.

In ouranomaly detectionframework, since the hypothyroidclass is the minority, we declare this class as an anomaly and regard the other classes as normal. Here, we choose

μ

=0.0001 and λ1= λ2= λ3=2.Wealsochoose16% anomalytonormalratio.In Table 1, CSVM againachievesthehighestAUC scoreasexpected.

However, DSVMprovidesalmost thesameAUC scorewithCSVM.

Additionally, when we compare the performances of DSVM and LSVM,we canclearly observetheperformanceimprovementpro- videdbyoureffectivedistributedapproach.

As anotherreallife scenario,we evaluatethe performancesof thealgorithmsontheOccupancydataset[59].Here,wehave5 fea- tures ofa room,i.e., temperature, humidity, light, carbondioxide level,andhumidityratio.Ouraimistodeterminewhetheraroom

Table 1

AUCscoresofthealgorithmsforthereallifedatasets.

Algorithms

Datasets

Iris Thyroid Occupancy Smtp Mammography

CSVM 0.9930 0.9982 0.8842 0.7140 0.9398

DSVM 0.9920 0.9969 0.8833 0.7140 0.9395

LSVM 0.9850 0.9870 0.7845 0.6980 0.9239

(6)

Table 2

AUCscoresoftheunsupervisedalgorithmsfortherealdatasets.

Algorithms

Datasets

Iris Thyroid Occupancy Smtp Mammography

CSVM 0.9820 0.8880 0.8220 0.5605 0.9205

DSVM 0.9820 0.8868 0.8220 0.5605 0.9205

LSVM 0.9780 0.8535 0.6670 0.5580 0.9163

Table 3

AUCscoresforthePRcurves.

Algorithms

Datasets

Iris Thyroid Occupancy Smtp Mammography

CSVM 0.9993 0.9998 0.8071 0.9514 0.9922

DSVM 0.9992 0.9995 0.8070 0.9514 0.9921

LSVM 0.9964 0.9977 0.6863 0.9501 0.9898

isoccupiedornotbasedonthesefeatures.Sincethisdatasetisin- herentlysuitableforouranomalydetectionframework,wedonot performfurtherpreprocessingforthisdataset.Inthiscase,weuse

μ

=0.0001,λ1=8,λ2=8,andλ3=1.Wealsouse10% anomaly to normalratio.As seen inTable 1,DSVM againachievesalmost thesameAUCscorewithCSVMwhilesignificantly outperforming LSVMthankstoitsefficientandeffectivedistributedstructure.

WealsomakeperformancecomparisonsusingtheSmtpdataset [60],wherewehave3 numericalfeaturesforacomputernetwork connection,whicharethedurationofaconnection,thenumberof bytesfromasourcetoadestination,andthenumberofbytesfrom adestinationtoasource.Basedonthisinformation,ourgoalisto detect anomalous network connections. As in the previous case, we do not perform further preprocessing for the dataset thanks toitssuitablestructureforouranomalydetectionframework. For thisdataset,we choose

μ

=0.00001,λ1=0.5,λ2=0.5,andλ3= 2.We also select 10% anomalyto normalratio. Weobserve that DSVM achieves the sameAUC score withCSVM asillustrated in Table1.

Finally, we perform an experiment on another well known anomalydetectiondataset,i.e.,theMammographydataset[60].In thisdataset,weaimtodetectcalcificationsbasedontheprovided 6 medical features. Since this dataset is an anomaly detection dataset,we directlyuseitinourframework.Forthiscase,we se- lect

μ

=0.00001,λ1=2,λ2=2,andλ3=8.Wealsochoose10%

anomaly to normal ratio. As in the previous experiments, DSVM achievesavery closeAUC scoretotheAUC scoreofCSVMinTa- ble1.Moreover,bothCSVMandDSVMoutperformLSVM.

Inorder to justifythe highperformance of SVM comparedto the state of the art approaches, we also perform an experiment for comparison. In this experiment, since neural networksbased anomaly detection algorithms have strong modeling capabilities asdiscussedabove,we includetheLSTMbasedalgorithm in[61]

asthestate-of-the-artanomalydetectionalgorithm.Inadditionto this,wealsoincludeSupportVectorDataDescription(SVDD)[22], which aims to train a predefined decision function withrespect toacertainobjectivecriterionasSVMdoes.Since,neuralnetwork basedapproaches requirea large andlessskeweddatasets to be adequatelytrained, we usetheOccupancydataset.Inthisexperi- ment,SVM outperformsbothSVDD andtheLSTMbased method, which justifies SVM as a strong baseline for anomaly detection tasks.Wenote thatSVM,SVDD, andLSTM basedmethodachieve 0.8842,0.6715,and0.7444 asAUCscores,respectively.

Sinceour approachis generic inthesense that it canbe em- ployed in an unsupervised framework as well as a supervised framework,wealsoperformthesameexperimentsforallthereal datasets in an unsupervised setting. In Table 2, we provide the AUCscoresforeachcase.Here,weobservethatourapproachstill provides a considerable performance improvement with respect

to LSVMandachievesa comparableperformance withrespect to CSVM.

3.3. Performanceevaluationanalysisanddifferentmetrics

In ordertoevaluate theperformance ofan anomaly detection algorithm, we use the area under the ROC curve in the previ- ous experiments. Even though this is a widely accepted perfor- mance metric [57], it might present an overly optimistic view ofan anomalydetectionalgorithm’sperformance especiallywhen thereisalargeskewinclassdistributions[62,63].Forsuchcases, Precision-Recall(PR)curvesareusedasaperformancemetric[62, 63]. Thismetric can accentuateperformance differencesbetween algorithms that are not distinguishable by ROC based metrics.

Hence, we also evaluate the performances of the algorithms us- ingthismetric.

We first evaluate the performances on the synthetic dataset.

Our distributedapproach provides ahighly superiorperformance using only the local information so that it almost achieves the same performance withCSVM. We note that a higherAUC score foraPRcurveindicatesbetterperformance.

We alsocompute thearea underthePR curvescores foreach real dataset. In Table 3, we observe that DSVM provides almost the sameAUCscoreswithCSVM. Additionally,whenwe compare theperformancesofDSVMandLSVM, wecanclearly observethe performance improvement provided by our effective distributed approachasintheROCmetriccase(seeTable1).

In an anomaly detectionframework, we havetwo classesand one ofthemis inherentlyminority. Thus,inour experiments,we eitherusedatasetsthataresuitableforthisframeworkorprepro- cess certain well known classification datasets in order to make themsuitable.Specifically,wehavemorethantwoclassesforthe Iris and Thyroid datasets. Thus, we first randomly select a cer- tain class and then undersample that class to make it minority witha certain anomaly to normalratio.With thisprocedure, we obtain anomalydetectiondatasetsfromthewell-knownclassifica- tion datasets withbalanced class distributions. Since theoriginal datasets havebalanced class distributions, we donot sufferfrom skewness between different classes in this case. The remaining datasets,namelyOccupancy,Smtp,andMammography,arealready inasuitableformforouranomalydetectionframework.However, some ofthesedatasetshaveaquitelowanomalyto normalratio.

Hence, wealso undersamplethesedatasetsto achievereasonable anomalytonormalratios,e.g.,around10%,sothatwesignificantly mitigatethepossiblemisleadingeffectsofskewness.

We also emphasize that dataset skewness mightcause overly optimisticresultsparticularlywhenoneusestheROCcurvebased evaluation metrics as we discussed in the previous paragraphs.

Therefore, wealso useanothermetric, i.e.,the PRcurve, to eval- uate theperformances.Since thismetricisintroduced toremove

(7)

Fig. 4. The PR curves of the algorithms for a synthetic dataset.

themisleadingeffectsofskewnessduringtheevaluationprocess,it canbeusedtochecktheskewnessinourexperiments.As shown inFig.4andTable3,allthePRcurvebasedresultsareconsistent withtheROCcurve basedones(seeTable1), whichvalidatesour claimsinthepreviousparagraph.

3.4.Computationalandcommunicationcostanalysis

An SVM update costs O(mn). Therefore, in our case, a cen- tralized approach (CSVM) has O(mn) complexity, since it up- dates all the parameters at a single node. Similarly, a local SVM (LSVM)ateachnodek hasO(mnk)complexity.Ontheotherhand, our distributed method (DSVM) costs O(m

i∈|Nk|nk) complex- ity for each node k. Thus, we observe that the computational complexity of DSVM is very similar to LSVM and significantly lowerthanCSVM,especiallyforsparselyconnectedtopologieswith manynodes. Considering the fixed topology in our experiments, where|Nk|=3,nk=n/4,weobtainthatthecomplexity ofCSVM, DSVM, and LSVM are O(mn), O(3mn/4), and O(mn/4), respec- tively,wherem andn changesforeachdataset.

SinceCSVM needstotransfer allthedatasamplesto thecen- tral node, its communication cost is O(mn). We note that even thoughwedonot considerthedistanceofeachnode tothecen- tralnode inthiscalculation, CSVM mightrequirecommunication betweenthenodesthat are faraway fromeach other so that its applicationmightbeimpractical.Ontheotherhand,thecommuni- cationcostforDSVMisO(m

i∈|Nk\k|nk)andthecommunication distancesinthiscasearesignificantlyshorter(sincenodesareal- lowedtocommunicateonlywiththeirneighbors).Thusinourex- periments,thecommunicationcostofCSVMandDSVMareO(mn) andO(mn/2),respectively.

4. Conclusion

Westudiedtheanomalydetectionprobleminadistributednet- workofnodesandintroducedanovelSVMbasedapproach,which worksinafullydistributedmanner.Particularly,wefirstreformu- late the SVM optimization problem for a distributed network of nodes. Based on this formulation, we then provide the updates totrain the SVM architecture. Unliketheexisting distributed ap- proaches, here, we directly solve the SVM optimization problem in the primal form using an efficient gradient based algorithm [54,55] in order to obtain an estimate for the separating hyper- plane parameters. We then let onlythe neighboring nodesshare

their estimateswitheach other toobtaina final estimate forthe separating hyperplane parameters. Thus, we achieve a high per- forming anomaly detectionalgorithm over a distributednetwork ofnodes.Moreimportantly,weachievethisperformancewithlow computational complexity andcommunicationload. Eventhough weworkinasupervisedframework,wealsoextendouralgorithm tothe unsupervisedframework thankstoits genericstructure. In oursimulations involvingboth syntheticandrealdata,the intro- duceddistributed approachdemonstrates significant performance improvementswithrespecttothelocalperformancesofthenodes sothateachnodeinournetworkachievesalmostthesameperfor- mancewithacentralnodethathasaccesstoallthedatasamples inthenetwork.

Declarationofcompetinginterest Nopotentialconflictofinterestexists.

Appendix A. Supplementarymaterial

Supplementarymaterialrelatedtothisarticlecanbefoundon- lineathttps://doi.org/10.1016/j.dsp.2020.102657.

References

[1] V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,ACMComput.

Surv.41 (3)(2009)15:1–15:58,https://doi.org/10.1145/1541880.1541882.

[2] Y.Rajabzadeh,A.H.Rezaie,H.Amindavar,Adynamicmodelingapproachfor anomaly detection usingstochastic differential equations, Digit.SignalPro- cess. 54 (2016) 1–11, https://doi.org/10.1016/j.dsp.2016.03.006, http://www.

sciencedirect.com/science/article/pii/S1051200416000610.

[3] N. Chávez,C. Guillén, Radar detection inthe moments space of the scat- tered signalparameters, Digit.Signal Process. 83 (2018) 359–366, https://

doi.org/10.1016/j.dsp.2018.08.013, http://www.sciencedirect.com/science/ article/pii/S1051200418306869.

[4] M.Thottan,C.Ji,AnomalydetectioninIP networks,IEEETrans.SignalProcess.

51 (8)(2003)2191–2204,https://doi.org/10.1109/TSP.2003.814797.

[5] C.O’Reilly,A.Gluhak,M.A.Imran,Distributedanomalydetectionusingmini- mumvolumeellipticalprincipalcomponentanalysis,IEEETrans.Knowl.Data Eng.28 (9)(2016)2320–2333,https://doi.org/10.1109/TKDE.2016.2555804.

[6] J.Zhao,L.Itti,Classifyingtimeseriesusinglocaldescriptorswithhybridsam- pling,IEEETrans.Knowl.DataEng.28 (3)(2016)623–637,https://doi.org/10. 1109/TKDE.2015.2492558.

[7] R. Venkatesan, A. Plastino, Fisher information framework for time se- ries modeling, Physica A:Stat. Mech. Appl.480 (2017) 22–38, https://doi. org/10.1016/j.physa.2017.02.076, http://www.sciencedirect.com/science/article/ pii/S0378437117301863.

[8] K.T.Abou-Moustafa,M.Cheriet,C.Y.Suen,Classificationoftime-seriesdataus- inga generative/discriminativehybrid,in:Ninth InternationalWorkshopon FrontiersinHandwritingRecognition,2004,pp. 51–56,https://doi.org/10.1109/ IWFHR.2004.26.

[9]K.T.Abou-Moustafa,Agenerative-discriminativeframeworkfortime-seriesdata classification,Ph.D.thesis,ConcordiaUniversity,2003.

[10] H.Debar, M.Becker, D. Siboni,Aneural networkcomponent for anintru- sion detection system, in: Proceedings ofthe 1992 IEEE ComputerSociety SymposiumonResearchinSecurityandPrivacy,1992,pp. 240–250,https://

doi.org/10.1109/RISP.1992.213257.

[11] Y.Bengio,P.Simard,P.Frasconi,Learninglong-termdependencieswithgradi- entdescentisdifficult,IEEETrans.NeuralNetw.5 (2)(1994)157–166,https://

doi.org/10.1109/72.279181.

[12] K.Greff, R.K.Srivastava, J.Koutník,B.R. Steunebrink,J. Schmidhuber,LSTM:

asearchspaceodyssey,IEEETrans.NeuralNetw.Learn.Syst.28 (10)(2017) 2222–2232,https://doi.org/10.1109/TNNLS.2016.2582924.

[13] S.Hochreiter,J.Schmidhuber,Longshort-termmemory,NeuralComput.9 (8) (1997)1735–1780,https://doi.org/10.1162/neco.1997.9.8.1735.

[14] R.Kozma,etal.,Anomalydetectionbyneuralnetworkmodelsandstatistical time series analysis,in:1994IEEE InternationalConference onNeuralNet- works,1994,IEEEWorldCongressonComputationalIntelligence,vol.5,1994, pp. 3207–3210,https://doi.org/10.1109/ICNN.1994.374748.

[15]C.Bishop,Noveltydetectionandneuralnetworkvalidation,IEEProc.,Vis.Im- ageSignalProcess.141(1994)217–222.

[16] N.K.Thanigaivelan,E.Nigussie,R.K.Kanth,S.Virtanen,J.Isoaho,Distributed internalanomalydetectionsystemforinternet-of-things,in:201613thIEEE Annual Consumer Communications Networking Conference (CCNC), 2016, pp. 319–320,https://doi.org/10.1109/CCNC.2016.7444797.

Referenties

GERELATEERDE DOCUMENTEN

Deze case handelt over een proefsleuvenonderzoek op een terrein van 1,92 ha waar de Sociale Huisvestingsmaatschappij de aanleg van een woonwijk plant. Het

Om na te gaan hoe de stratigrafie zich langs de oostelijke zijde van spoor 80 manifesteerde, kwam er een tweede profiel dwars op spoor 80 (figuur 15).. Dit profiel toonde opnieuw

Vrij vast spoor 10 Zeer weinig baksteenbrokken - Onduidelijke aflijning - West-Oost georiënteerd - Sp 9 = Sporen 2, 3, 10 en 14 10 2 Homogeen Gracht Fijne zandige klei Licht

The objectives of this systematic review, which covers all human vaccine- preventable diseases, are to determine the relative impact of vaccination compared with other

Naar aanleiding van een verkaveling aan de Gorsemweg te Sint-Truiden werd door Onroerend Erfgoed een archeologisch vooronderzoek in de vorm van proefsleuven

We first use a distributed algorithm to estimate the principal generalized eigenvectors (GEVCs) of a pair of network-wide sensor sig- nal covariance matrices, without

Re- markably, even though the GEVD-based DANSE algorithm is not able to compute the network-wide signal correlation matrix (and its GEVD) from these compressed signal ob-

Re- markably, even though the GEVD-based DANSE algorithm is not able to compute the network-wide signal correlation matrix (and its GEVD) from these compressed signal ob-