vector machines A novel distributed anomaly detection algorithm based on support Digital Signal Processing

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

A novel distributed anomaly detection algorithm based on support vector machines ^✩

Tolga Ergen

^a^,^∗

, Suleyman S. Kozat

^b

aDepartmentofElectricalEngineering,StanfordUniversity,Stanford,CA94305,USA bDepartmentofElectricalandElectronicsEngineering,BilkentUniversity,Ankara,Turkey

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonline8January2020

Keywords:

Anomalydetection Distributedlearning Supportvectormachine Gradientbasedtraining

In thispaper, westudy anomalydetectionin adistributed network of nodes and introducea novel algorithm based on Support Vector Machines (SVMs). We first reformulate the conventional SVM optimizationproblemforadistributednetworkofnodes.Wethendirectlytraintheparametersofthis SVMarchitectureinitsprimalformusingagradientbasedalgorithminafullydistributedmanner,i.e., eachnodeinournetworkisallowedtocommunicateonlywithitsneighboringnodesinordertotrain the parameters.Therefore,wenot onlyobtainahigh performinganomalydetectionalgorithmthanks tostrongmodelingcapabilitiesofSVMs,butalsoachievesignificantlyreducedcommunicationloadand computational complexity duetoour fully distributedand efficient gradient basedtraining.Here, we provideatrainingalgorithminasupervisedframework,however,wealsoprovidetheextensionsofour implementation to anunsupervised framework. We illustrate the performance gainsachieved by our algorithmviaseveralbenchmarkreallifeandsyntheticexperiments.

©²⁰²⁰ÊlsevierÎnc.Âll^rights^reserved.

1. Introduction

Anomalydetectionhasbeenextensivelystudiedinthecurrent literatureduetoitsvariousapplicationssuchashealthcare,fraud detection,networkmonitoring,cybersecurity,andmilitarysurveil- lance[1–5].Particularly,inthispaper,westudytheanomalydetec- tionprobleminadistributedframework,wherewehaveanetwork ofK nodesequippedwithprocessingcapabilities.Eachnodeinour networkobservesadatasequenceandaimstodecidewhetherthe observationsare anomalous or not.Here, each node has a setof neighboringnodesandcanshareinformationwithitsneighborsin ordertoenhancethedetectionperformance.

Variousother anomalydetectionalgorithmshavealsobeenin- troducedinordertoenhancethedetectionperformance.Asanex- ample,Fisherkernelandgenerativemodelsareintroducedtogain performance improvements,especially for time seriesdata [6–9].

However,themaindrawbackoftheFisherkernelmodelisthatit requires the inversion ofFisher informationmatrix, which has a highcomputational complexity [6,7].Ontheother hand,inorder toobtainanadequate performancefromagenerative modelsuch asaHiddenMarkovModel(HMM),one shouldcarefullyselectits

✩ ThispaperisinpartsupportedbyTubitakProjectNo:117E153.

*

Correspondingauthor.

E-mailaddresses:ergen@stanford.edu(T. Ergen),kozat@ee.bilkent.edu.tr (S.S. Kozat).

structural parameters, e.g., thenumber of statesand topology of the model [8,9]. Furthermore,the type oftraining algorithm has alsoconsiderableeffectsontheperformanceofgenerativemodels, whichlimitstheir usage inreal lifeapplications [9]. Thus,neural networks,especiallyRecurrentNeuralNetworks(RNNs),basedap- proachesareintroducedthankstotheirinherentmemorystructure that can store “time”or“state”information andstrongmodeling capabilities[1,10].SincethebasicRNNarchitecturedoesnothave controlstructures(gates)toregulatetheamountofinformationto bestored[11,12],a moreadvancedRNNarchitecture withseveral control structures, i.e.,the Long ShortTerm Memory(LSTM) network, are usually employed among RNNs [12,13]. However, such neural networkbased approachesdonot have aproper objective criterion foranomaly detectiontasksespecially intheabsence of datalabels[1,14].Hence,theyﬁrstpredictasequencefromitspast samplesandthendeterminewhetherthesequenceisananomaly ornotbasedonthepredictionerror,i.e., ananomaly isanevent, whichcannot be predictedfromthe pastnominaldata[1].Thus, they require a probabilistic model forthe predictionerror anda threshold on the probabilistic model to detect anomalies, which results in challenging optimization problems and restricts their performance accordingly[1,14,15].Furthermore,oneneeds acon- siderableamountofcomputationalpowerandtimetoadequately train such networks due to their highly complex and nonlinear structure,which alsomakesthem susceptibletooverﬁttingprob- lems. Inorder tohandlelarge scale anomaly detectionproblems, https://doi.org/10.1016/j.dsp.2020.102657

1051-2004/©²⁰²⁰ÊlsevierÎnc.Âll^rights^reserved.

(2)

distributedanomalydetectiontechniqueshasalsobeenextensively studied especially in the IoT and Deep Tech literatures [16–19].

However, the main drawback of these approaches, especially the deepones,isthattheysufferfromhighcomputationalcomplexity andrequirecareful tuning ofseveralparameters toachieve good generalizationperformance[17,20].

Inthecurrentliterature, acommonandwidelyusedapproach foranomaly detection isto find a decision function that defines the model of normality [1]. In this approach, one first defines a certaindecisionfunctionandthenoptimizestheparametersofthis functionwithrespect toa predefinedobjectivecriterion, e.g.,the SupportVectorMachines(SVM)algorithm[21,22].Oneremarkable propertyofSVM isits abilityto learncan be independentofthe featurespacedimensionalitysothatitgeneralizeswelleveninthe presenceofmanyfeatures[23].Duetoitsgeneralizationcapabili- ties,thereexistseveral anomalydetectionmethodsbasedonSVM, e.g., [23–27] and the other references therein. In these studies, SVManditsvariantsareemployedforanomalydetectiontasks.In additiontothese, recently,applicationsofSVMsalongwithother learning techniques have also attractedsignificant attention[8,9, 20,28–31]. As an example in [8,9], theauthors employ a Hidden MarkovModel(HMM) toextract sequentialinformationfromthe dataandthenapplytheSVMalgorithmforclassification.Moreover, in[20,28,29],theauthorsusetheSVMarchitecturealongwithdif- ferent neural network architectures. Thus, they achieve effective optimizationof thenetwork parameters withrespect tothe well definedobjective function ofSVMs andalso enjoy good general- izationcapabilitiesofSVMsattheoutputlayer.

Support Vector Machines (SVMs) are generally employed for anomaly detection thanks to their aforementioned high performance in several real life scenarios [20,21,32–35]. In the con- ventionalapplications, SVMsare used in centralized frameworks, where all observations are available and processed together at a certain node (or a processing unit), i.e., known as the central node. Here, the aim is to ﬁnd the maximum margin hyperplane that separates normal data sequences from anomalies based on the centrally available observations. Acommon andeffective ap- proachto trainan SVM is tosolve itasa quadratic optimization problemin its dual form, where thedimensionality ofthe problem is in the order of the cardinality of the training data. After solvingthisoptimizationproblem,the resultingseparatinghyper- plane only dependson a subset ofthe training data,i.e., known as Support Vectors (SVs). Thus, for training an SVM, all the ob- served dataisgathered atacentral node toﬁnd thedesiredSVs inacentralizedapproach[1,21,32].However, suchcentralizedap- proachesrequirehighstoragecapacityandcomputationalpowerat thecentralnode [5,36].Moreover,eachnodeinthenetworkmust communicate with the central node regardless of its distance to thecentralnodetoobtainSVs,whichcauseshighcommunication burdeninthe network. Furthermore,theseapproachesare prone tosystemfailuresduetothesingle-point-of-failureatthecentral node[37–39].

To this end, distributed approaches are introduced to train SVMs[37–39]. Intheseapproaches,SVs areobtainedlocallyfrom theobservedtraining dataateach node,i.e., eachnodeprocesses only its observations to obtain SVs. Then, each node exchanges these SVs withtheir neighbors and also witha central node in ordertofurtherenhance theirperformance. However,since these approachesrequireexchangesofmultipleSVsbetweenneighboring nodes, they still causea considerable communication loadespe- ciallyinanetworkthathasalargenumberofobservationsateach node.Moreover,theystillneedacentralnodetomergetheirlocal SVs,whichmightinduce thesameproblemswithcentralizedap- proaches.Duetothehighcommunicationloadoftheconventional distributedapproaches,especiallywhentheamountofdataiscon- siderably large, parallelized designs ofSVM are introduced asan

alternativeapproach[40–43].Inaparalleldesign,eachnodetrains apartialSVMusingasmallsubsetoftrainingdataandthenshares itspartialSVMwithacentralnode.Then,thecentralnodemerges thesepartialSVMstoobtainaglobalSVMforthewholenetwork.

Althoughthisapproachmitigatesthecommunicationloadproblem in the network, it still requires a significant amount of communication withthe central node. Additionally, since itrequires the existenceofacentralnode,itmightsufferfromtheproblemsasso- ciated withcentralized approaches.Furthermore,both distributed and parallelized designs rely on complex optimization problems, e.g., solving a quadratic optimization problem in the dual form, totrainan SVM,whichsignificantly increasestheir computational complexity[44–46].Additionally,thereexistseveralrecentstudies onmorespecificapplicationsofdistributedSVMs. Asanexample, in [47], the authors introduce a communicationefficient method by using both primal anddual representationsof theSVM problem. In another study, a new step size choice is introduced so that the authorsachieve afaster convergenceratefordistributed SVMs [48]. In [49], the authors propose a distributed method to train semi-supervised SVMs, where they relax the original problem. Additionally, in [50], a newdistributed approach for imbal- anceddatasets isintroduced. As seenintheseexamples, mostof the recentstudieseither focusongaining improvementsforspe- cific applications oraim to reduce theoptimality gap intraining by proposing highly complicatedand technicalapproaches. Thus, itbecomesalmostimpossibletoimplementthesemethodsonreal systems.However,since inthispaper,weaimtoobtaina generic andefficientapproachthatcanbe straightforwardlyimplemented inreal-lifeapplications,weintroduceanewdistributedSVMbased approachasdetailedinthefollowing.

In orderto resolvethe issuesassociatedwiththe existing approaches,weintroduceanovelanomalydetectionapproach,where we train SVMsinafully distributedandefficientmanner.Partic- ularly,ratherthan findingSVs,we directlyfindthe parametersof theseparatinghyperplaneintheprimalformofanSVMoptimiza- tionproblem.Fortraining,weuseagradientbasedalgorithm,e.g., [44], sothat thedimensionalityofourproblemisintheorderof thedatasize,i.e.,thenumberoffeaturesinadatasequence,unlike the quadratic programmingbased approaches, where the dimen- sionality ofthe problem is in the orderof the number of training samples [45,46]. Thus, we significantly reduce the computa- tionalcomplexitycomparedtotheconventionalmethods[37–43], especially when the number of training samples is prohibitively large,e.g.,bigdataapplications[36].Additionally,unliketheseap- proaches,welocallytraintheparametersofan SVMateachnode andthen combine theseparameters onlywiththe parameters of theneighboringnodestoobtainafinalSVMbasedanomalydetec- tor. Thus,weeliminatetheneedforacentralnode andavoidthe associateddrawbacks. Moreover,since ourSVMtraining approach isbasedondirectlyfindingthehyperplane parametersinthepri- mal form, we only require exchanges of hyperplane parameters, insteadofmultipleSVs,betweenneighbors,whichfurtherreduces thecommunicationloadinthenetwork.Inadditiontothis,thanks to oureffectiveinformationexchange protocolbetweenneighbor- ingnodes,ourapproachachievesasignificantlyhighperformance while requiringconsiderably lesscomputational andcommunica- tion resources compared to the conventional methods. Here, we work in asupervised framework, whereall the datasamples are properly labeled,however, thanks tothe generic structure ofour distributed training approach, we also extend the original algorithm to an unsupervised framework. Through our experiments, we illustratethe performance gainsachievedbyour approachon severalbenchmarkreallifeandsyntheticdatasets.

The organization ofthis paper isas follows.In Section 2, we describe theanomaly detectionproblemina distributednetwork ofnodesandthenprovideourSVMbasednovelapproachforthis

(3)

problem.Throughasetofexperimentsinvolvingsyntheticandreal lifedata, wedemonstrate theperformance gainsachievedby the introducedalgorithmwithrespecttotheconventionalmethodsin Section3.Wethenﬁnalize thepaperwithconcludingremarksin Section4.

2. DistributedanomalydetectionbasedonSVMs

In this paper, all vectors (and matrices) are represented by boldface lower (andupper) caseletters. Fora given vector v (or matrixV ), v^T (or V^T)representsitstranspose and||v|| =√

v^Tv isits 2-norm.Additionally, 0 and 1 represent avector ofall ze- rosand ones, respectively. Here, the sizes of thesenotations are understoodfromthecontext.

Weconsidera network K nodes, whereeach nodek observes adatasequence{^xki}ⁿ_i₌^k₁^,^xki∈ R^m^.^Here,^weâssume ^that^the^data sequenceateach node k belongsto acertain class withthecor- respondinglabels{^yki}ⁿ_i₌^k₁^,^where ^yki∈ {¹,−¹}^.Înôur^framework, yki=^{1 (}−¹⁾^represents^the^nominal(anomalous) data,wherethe anomalous datasamples areinherently minority. This framework models a wide range ofengineering applications in real life [1].

Asanexample,inacomputernetworkanomalydetectiontask[1], we receive a set of features for a connection to a certain com- puternetwork,i.e.,denotedasx_kifortheith connectionatthekth computer network. Then, our goal is to determine whether this connectionisananomaly,e.g.,awebbasednetworkattack,ornot basedonthereceivedfeaturesoftheconnection.

Inthisframework,weparticularlyaimtofindadecisionfunc- tion f(·) ^that ^separates ânomalies ^from ^normal ^data ^samples, whichisdefinedas

f

(

x_ki

) =

1 if xkiis normal

−

¹ ^otherwise

.

Forthisproblem, one can use theSVM algorithm [32,51] dueto its strong modeling capabilities in reallife tasks. The SVM algo- rithmbasically ﬁndsthemaximummargin separatinghyperplane betweentwoclassesinordertodetectanomalousdatasequences [32,51]. Assumingthat all thedatasequences ateach node k are centrally available ata certain processingunit, we can solve the followingSVM optimization problemin ournetwork of nodesto obtainthemaximummarginseparatinghyperplane[32,51,52]

min

w,b,ξki

1

2

||

^w

||

²

+ λ

K

k=¹ n_k

i=¹

ξ

ki (1)

subject to: y_ki

(

w^Tx_ki

+

^b

) ≥

¹

− ξ

ki

∀

^k

,

i (2)

ξ

ki

≥

⁰

∀

^k

,

i

,

(3)

wherew∈ R^m ând^b∈ Râre^the^global^parametersôf^the^maxi- mummarginseparatinghyperplane,λ∈ R⁺îsâ^trade-off^parame- terandξki∈ Rîs^the^slack^variable^toîncurâ^cost^for^the^misclas- sificationoftheith instanceatthenodek. Weemphasizethatin (1),(2),and(3),weusethesamehyperplaneparametersforeach datainstanceatdifferentnodessinceweareabletogatherallthe observationsinournetworkatacentralnodetoprocessthemto- gether.Then,weusethefollowingdecisionfunctiontodetectthe anomaliesateachnodek

f

(

x_ki

) =

^sgn

(

w^Tx_ki

+

^b

),

(4) wherethesgn(·)^functionôutputs^the^signôfîtsînputârgument.

Onecanconventionallysolve(1),(2),and(3) usingaquadratic programming based approach in the dual form of the problem

[32,51].However, suchan approachhashighcomputationalcom- plexity since the order of the problem depends on the number of training samples inthe dual form, which can be signiﬁcantly highespeciallyforapplicationsinvolvingbigdata[36,45,46].Thus, inorder to employan eﬃcient gradientbased training approach, we reformulate(1), (2), and(3) asan unconstrainedoptimization probleminitsprimalformasfollows

min

w,b

1

2

||

^w

||

²

+ λ

K

k=1 n_k

i=¹

max

{

⁰

,

1

−

^yki

(

w^Tx_ki

+

^b

)}.

⁽⁵⁾

Tofurthersimplifythenotationin(5),weusetheaugmentedpa- rameter vector v [^{b w}^T]^T ^and ^the ^augmented ^data ^sequence

¯

xki [^{1 x}^T_ki]^T^.^Then,^we^obtain^the^following^compactoptimization problemfortheaugmentedparametervector v

minv F

(

v

),

(6)

where

F

(

v

)

¹ 2

m

+¹

j=2

v²_j

+ λ

K

k=¹ nk

i=1

max

{

⁰

,

1

−

^ykiv^Tx

¯

_ki

}

⁽⁷⁾

is our objectivefunction to be minimizedand vj represents the jth elementofthevector v .For(6),weusethefollowinggradient basedupdate[53]

vt+¹

=

vt

− μ ∇

vF

(

v

)

_v

=vt

,

(8)

where

μ

is the learning rate and the subscript t represents the iterationindex. Notethat F(·)îs^notâdifferentiable functiondue to the max{⁰,x} ^term, ^thus, ^we ûse â subgradient method as in [44], i.e., treating the derivative of max{⁰,x} ât ^x=^{0 as} ^0. ^We thenusethefollowingdecisionfunction

f

(

x_ki

) =

^sgn

(

v^Tx

¯

_ki

).

(9)

Noticethat(6),(7),and(8) causeasignificantcommunicationload by gatheringallthe informationinournetworkata certaincen- tralprocessingunit.Moreover,suchcentralizedapproachesrequire asignificant amountofcomputational resourcesandhighstorage capacity at the central processing unit. Furthermore, a failure at thecentralprocessingunitleadstothefailureofthewholesystem inacentralized scenario.Hence, weintroducea noveldistributed implementation to find the maximum margin separating hyperplane ateach node k. Particularly,we consider ascenario, where each node k has a set of neighboring nodes connected to itself andcanexchangeinformationonlywiththeseneighbors,i.e.,also calledas theneighborhood of the node k anddenoted asNk as showninFig.1.Here,eachnodek aimstofindadecisionfunction based on the observations revealedto itself and the information obtainedfromitsneighbors.Thus,we eliminatetheproblemsas- sociated withthecentralizedapproachesby completelyremoving theneedforacentralnodeinthenetwork.

Inourdistributedframework,wehaveadifferentsetofhyper- planeparameters ateachnode,i.e., denotedas{^vk}_k^K₌₁^.^With^this change,wereformulateouroptimizationproblemas

min{vk} 1 2

K

k=¹ m

+¹

j=2

v²_kj

+ λ

K

k=¹ nk

i=1

max

{

0

,

1

−

y_kiv_k^Tx

¯

_ki

},

(10)

where vkj represents the jth element of the vector vk. Due to the term that sums the norms ofall the hyperplane parameters atdifferentnodes,wealsointroduceascalarfactor K in(10).We then updatethe parameters ofeachnode k usingthe distributed gradientbasedalgorithm[54,55] asfollows

(4)

Fig. 1. Anexampledistributednetworkofnodes,wherethegrayregionrepresents theneighborhoodofthenodek.

v_k_,_t₊₁

= φ

k,t

− μ ∇

φF

(φ) φ

=

φ

_k_,_t

,

(11)

wheretheintermediateparameterφ_k_,_t,i.e.,alsoknownasthelocal estimateforthenodek,iscomputedas

φ

_k_,_t

=

j∈Nk

c_k_,_jvj,t

,

whereck,j’srepresentthecombinationweightsofthenodek and arenormalizedsuchthat

K

j=¹ c_k_,_j

=

¹

.

Foracertainnetworktopology,wecandeterminethecombination weights using differentrules such asthe Metropolis,uniform, or adaptiverulesdependingontheapplicationrequirements[56].

Inordertoperformtheupdatein(11),wecalculatethegradi- entoftheobjectivefunctionin(10) forthenodek as

∇

vF

(

v_k

) = [

^{0 v}k2

. . .

v_k₍_m₊₁₎

]

^T

+ λ

^K

nk

i=1

∇

vG_ki

(

v

),

(12)

wherewedeﬁneG_ki(v)^max{⁰,1−^ykiv^T_k¯^xki}^,^which^has^the^fol- lowinggradient

∇

vGki

(

v

) =

−

^ykix

¯

_ki if 1

−

^ykiv_k^Tx

¯

_ki

>

0

0 otherwise (13)

using the subgradient method [44]. After the update of the parameters,weuse(9) todetecttheanomaliesintheobserveddata sequences. We also present the complete algorithm as a pseu- docodebelow.

Remark1.Inthispaper,westudytheanomalydetectionproblem inasupervisedframework, whereall thedatalabelsare present.

However, the data labels might be too costly to obtain or even mightnotbepresentincertainreallifescenarios,i.e.,alsoknown astheunsupervisedframework [1].Since ourdistributedtraining approachisgeneric,itcanalsobeemployedinsuchaframework.

Inanunsupervisedscenario,theSVMstructureistrainedasaone classyieldingthefollowingcentralizedoptimizationproblem[21]

min

w,b,ξki

1

2

||

^w

||

²

+ λ

K

k=¹ n_k

i=¹

ξ

ki

−

^b ⁽¹⁴⁾

subject to: w^Txki

≥

b

− ξ

ki

∀

k

,

i (15)

ξ

ki

≥

⁰

∀

^k

,

i

.

(16)

Notethatwedonotusethedatalabelsin(15) unlike(2).Forthe unsupervisedimplementation,weﬁrstreformulatethecentralized optimizationprobleminafullydistributedandunconstrainedset- tingasin(10).Wethenapplythedistributedtrainingapproachin (11),(12), and(13) based ontheproblemdeﬁnitionin(14),(15), and(16).Thisyieldstheunsupervisedversionofouralgorithm.

Algorithm1 Adistributed anomaly detectionalgorithmbased on SVMs.

1: Setλ,μandT

2: Initialization:vk,1=0,∀k∈ {1,2,. . . ,K} 3: for t=1:T do

4: for k=¹:^{K do} 5: Calculateck,j,∀^j∈ Nk 6: Localestimate:φk,t=

j∈Nkck,jvj,t 7: Computethegradientaccordingto(12) 8: Update:vk,t+¹= φk,t−μ_∇φF(φk,t) 9: end for

10: end for

11: SVMdecisionfunction: f(xki)=^sgn(v_k^T_,_T₊₁¯xki),∀^k,i

3. Simulations

WeevaluatetheperformanceoftheintroduceddistributedSVM approach on several benchmark real and synthetic datasets. For performance comparisons, we also include centralized and local approaches in our simulations. For the centralized approach, we directlygatheralltheobservedtrainingdataatacentralnodeand processthemtogethertotrainanSVM.Ontheotherhand,forthe localapproach,wetrainanSVMateachnodeusingonlythetrain- ingdatarevealedtothatnode.Throughoutthissection,wedenote the distributed, centralized, and local SVM based approaches as

“DSVM”,“CSVM”,and“LSVM”,respectively.Inadditiontothis, we useasquarenetworkoffournodes,i.e., K=^4,^where^each^node^k isconnectedtothe twonodesatthe closestvertexofoursquare networkandemploysthefollowingcombinationweightrule

c_k_,_j

=

1

/(

K

−

1

)

if j

∈

Nk

0 otherwise

.

(17)

Notethatin(17), K−1 correspondstothecardinalityofNk,i.e., the number of neighbors for the node k. Moreover, the performance ofa certain distributed algorithm mayvary based on the selected combination weight rule [56]. However, in our experiments,weobservethattheseperformancevariationsarenegligible anddonotchange therankingofthealgorithms.Thus, weselect acertaincombinationrule,i.e.,theuniformrulein(17),andthen useitinallourexperiments.

3.1. Syntheticdata

Inordertogenerateasyntheticanomalydetectiondataset,we usetwo multivariate Gaussiandistributions withdifferentmeans.

Particularly,weselectthemeanas[−¹−¹]^T^for^the^normal^distri- butionand[^{1 1}]^T ^for^theânomalous distribution.Wethen obtain thenormalandanomalousdatainstancesbysamplingthesedistri- butions.Weusethesamecovariancematrixforbothdistributions, i.e., [^{2 0};^{0 2}]^.^Here, ^we^select^the ^parametersⁱⁿôurâpproach ^so thatallthealgorithmsmaximize theirdetectionperformance(see the next paragraph forthe details of our performance criterion) on the training samples at a fixed number of iterations. In this setup,we select

μ

=⁰.0001,λ1= λ2= λ3=^1,^whereλ1,λ2,and λ3representthetrade-offparametersforCSVM,DSVM,andLSVM,

(5)

Fig. 2. The ROC curves of the algorithms for a synthetic dataset.

respectively.Furthermore,we usethe2/3 portion ofthesamples fortrainingandtheremaining1/3 fortest.

Asaperformancemetric,wemeasuretheareaunderROCcurve [57].Foraclassificationtask, weplottruepositiverateasafunc- tionoffalsepositiverateinaROCcurveanddenotetheareaunder thiscurve astheAreaUnderCurve (AUC)score,i.e.,a widelyac- cepted performance metric for anomaly detection tasks [57]. In Fig.2,weobservethatDSVMachievesalmostthesameAUCscore withCSVM while significantly outperforming LSVMthanks to its highly efficient and effective information sharing protocol. More specifically,LSVMachievesalowAUCscoreintheabsenceofcom- munication with neighbors so that it almost provides the same performancewithacompletelyrandompredictorthatachieves0.5 asitsAUCscore.However, withourdistributedtrainingapproach, thesamenodeturnsouttobeanadequateanomalydetectorwith anAUCscoreof0.8324.Moreover,inFig.3,weexamineatwodi- mensionalvisualization ofSVMstrainedby eachalgorithm.As in thepreviouscase,CSVMandDSVMprovidealmostthesametrain- ingperformance.However,LSVMachievesaconsiderablydifferent hyperplanethanCSVM.Based ontheseexperiments,weconclude that our distributed approach provides a highly superior perfor- manceusingonlythelocalinformationsothat italmostachieves thesameperformance withCSVM that hasaccesstoall thedata samplesinthenetwork.

3.2.Reallifedata

Inthissubsection, we comparethe performancesof thealgo- rithmsonreallifedatasets.WefirstconsidertheIrisdataset[58], where we have 150 data samples for 3 specific types of an Iris plantand 4 different features for each sample, i.e., sepal length, sepal width, petal length, and petal width. In this scenario, we randomly select a flower type and then aim to detect the samples ofthe selectedtype among all theflower types.In addition tothis,sincewehave50 samplesforeachtype,werandomlyun- dersampletheselectedflowertypeinordertomakeitsuitablefor

Fig. 3. Twodimensionalvisualizationsoftheseparatinghyperplaneforeachalgo- rithm.

our anomaly detection framework. We then randomly distribute thesesamplesto ournetwork offournodes aswedo forall the experimentsintheremainingofthissection.Inthissetup,weselect

μ

=⁰.0001,λ1=^8,λ2=^8,^andλ3=⁰.5.Wealsochoose10%

anomaly tonormalratio.Furthermore,we usethe2/3 portion of thesamplesforthetrainingphaseandtheremaining 1/3 forthe test phase ofeach experiment inthis section. In Table 1, we il- lustratetheAUC performancesofthe algorithms.SinceCSVM has accesstoallthesamples,itachievesthehighestAUCscoreamong all the algorithms. However, even though DSVM has access only tothesamplesrevealedtothecorrespondingnodeandtheneigh- boringestimatesfortheSVMparameters,itprovidesaremarkable AUC score improvement compared to LSVM and obtains a close performancetoCSVM.

We also examine the performances of the algorithms on the well known Thyroid dataset [58]. In this dataset, we have 215 data samples of patients and our aim is to determine whether a patient’sthyroid state is hypothyroid ornot basedon the provided 5 medical features of each patient. Particularly, we have 3 classes ofpatients, i.e., normal, hyperthyroid,and hypothyroid.

In ouranomaly detectionframework, since the hypothyroidclass is the minority, we declare this class as an anomaly and regard the other classes as normal. Here, we choose

μ

=⁰.0001 and λ1= λ2= λ3=^2.^We^also^choose16% anomalytonormalratio.In Table 1, CSVM againachievesthehighestAUC scoreasexpected.

However, DSVMprovidesalmost thesameAUC scorewithCSVM.

Additionally, when we compare the performances of DSVM and LSVM,we canclearly observetheperformanceimprovementpro- videdbyoureffectivedistributedapproach.

As anotherreallife scenario,we evaluatethe performancesof thealgorithmsontheOccupancydataset[59].Here,wehave5 features ofa room,i.e., temperature, humidity, light, carbondioxide level,andhumidityratio.Ouraimistodeterminewhetheraroom

Table 1

AUCscoresofthealgorithmsforthereallifedatasets.

Algorithms

Datasets

Iris Thyroid Occupancy Smtp Mammography

CSVM 0.9930 0.9982 0.8842 0.7140 0.9398

DSVM 0.9920 0.9969 0.8833 0.7140 0.9395

LSVM 0.9850 0.9870 0.7845 0.6980 0.9239

(6)

Table 2

AUCscoresoftheunsupervisedalgorithmsfortherealdatasets.

Algorithms

Datasets

CSVM 0.9820 0.8880 0.8220 0.5605 0.9205

DSVM 0.9820 0.8868 0.8220 0.5605 0.9205

LSVM 0.9780 0.8535 0.6670 0.5580 0.9163

Table 3

AUCscoresforthePRcurves.

Algorithms

Datasets

CSVM 0.9993 0.9998 0.8071 0.9514 0.9922

DSVM 0.9992 0.9995 0.8070 0.9514 0.9921

LSVM 0.9964 0.9977 0.6863 0.9501 0.9898

isoccupiedornotbasedonthesefeatures.Sincethisdatasetisin- herentlysuitableforouranomalydetectionframework,wedonot performfurtherpreprocessingforthisdataset.Inthiscase,weuse

μ

=⁰.0001,λ1=^8,λ2=^8,ândλ3=^1.^Weâlsoûse10% anomaly to normalratio.As seen inTable 1,DSVM againachievesalmost thesameAUCscorewithCSVMwhilesignificantly outperforming LSVMthankstoitsefficientandeffectivedistributedstructure.

WealsomakeperformancecomparisonsusingtheSmtpdataset [60],wherewehave3 numericalfeaturesforacomputernetwork connection,whicharethedurationofaconnection,thenumberof bytesfromasourcetoadestination,andthenumberofbytesfrom adestinationtoasource.Basedonthisinformation,ourgoalisto detect anomalous network connections. As in the previous case, we do not perform further preprocessing for the dataset thanks toitssuitablestructureforouranomalydetectionframework. For thisdataset,we choose

μ

=⁰.00001,λ1=⁰.5,λ2=⁰.5,andλ3= 2.We also select 10% anomalyto normalratio. Weobserve that DSVM achieves the sameAUC score withCSVM asillustrated in Table1.

Finally, we perform an experiment on another well known anomalydetectiondataset,i.e.,theMammographydataset[60].In thisdataset,weaimtodetectcalciﬁcationsbasedontheprovided 6 medical features. Since this dataset is an anomaly detection dataset,we directlyuseitinourframework.Forthiscase,we select

μ

=⁰.00001,λ1=^2,λ2=^2,^andλ3=^8.^We^also^choose^10%

anomaly to normal ratio. As in the previous experiments, DSVM achievesavery closeAUC scoretotheAUC scoreofCSVMinTa- ble1.Moreover,bothCSVMandDSVMoutperformLSVM.

Inorder to justifythe highperformance of SVM comparedto the state of the art approaches, we also perform an experiment for comparison. In this experiment, since neural networksbased anomaly detection algorithms have strong modeling capabilities asdiscussedabove,we includetheLSTMbasedalgorithm in[61]

asthestate-of-the-artanomalydetectionalgorithm.Inadditionto this,wealsoincludeSupportVectorDataDescription(SVDD)[22], which aims to train a predeﬁned decision function withrespect toacertainobjectivecriterionasSVMdoes.Since,neuralnetwork basedapproaches requirea large andlessskeweddatasets to be adequatelytrained, we usetheOccupancydataset.Inthisexperi- ment,SVM outperformsbothSVDD andtheLSTMbased method, which justiﬁes SVM as a strong baseline for anomaly detection tasks.Wenote thatSVM,SVDD, andLSTM basedmethodachieve 0.8842,0.6715,and0.7444 asAUCscores,respectively.

Sinceour approachis generic inthesense that it canbe employed in an unsupervised framework as well as a supervised framework,wealsoperformthesameexperimentsforallthereal datasets in an unsupervised setting. In Table 2, we provide the AUCscoresforeachcase.Here,weobservethatourapproachstill provides a considerable performance improvement with respect

to LSVMandachievesa comparableperformance withrespect to CSVM.

3.3. Performanceevaluationanalysisanddifferentmetrics

In ordertoevaluate theperformance ofan anomaly detection algorithm, we use the area under the ROC curve in the previous experiments. Even though this is a widely accepted performance metric [57], it might present an overly optimistic view ofan anomalydetectionalgorithm’sperformance especiallywhen thereisalargeskewinclassdistributions[62,63].Forsuchcases, Precision-Recall(PR)curvesareusedasaperformancemetric[62, 63]. Thismetric can accentuateperformance differencesbetween algorithms that are not distinguishable by ROC based metrics.

Hence, we also evaluate the performances of the algorithms us- ingthismetric.

We ﬁrst evaluate the performances on the synthetic dataset.

Our distributedapproach provides ahighly superiorperformance using only the local information so that it almost achieves the same performance withCSVM. We note that a higherAUC score foraPRcurveindicatesbetterperformance.

We alsocompute thearea underthePR curvescores foreach real dataset. In Table 3, we observe that DSVM provides almost the sameAUCscoreswithCSVM. Additionally,whenwe compare theperformancesofDSVMandLSVM, wecanclearly observethe performance improvement provided by our effective distributed approachasintheROCmetriccase(seeTable1).

In an anomaly detectionframework, we havetwo classesand one ofthemis inherentlyminority. Thus,inour experiments,we eitherusedatasetsthataresuitableforthisframeworkorprepro- cess certain well known classification datasets in order to make themsuitable.Specifically,wehavemorethantwoclassesforthe Iris and Thyroid datasets. Thus, we first randomly select a certain class and then undersample that class to make it minority witha certain anomaly to normalratio.With thisprocedure, we obtain anomalydetectiondatasetsfromthewell-knownclassifica- tion datasets withbalanced class distributions. Since theoriginal datasets havebalanced class distributions, we donot sufferfrom skewness between different classes in this case. The remaining datasets,namelyOccupancy,Smtp,andMammography,arealready inasuitableformforouranomalydetectionframework.However, some ofthesedatasetshaveaquitelowanomalyto normalratio.

Hence, wealso undersamplethesedatasetsto achievereasonable anomalytonormalratios,e.g.,around10%,sothatwesigniﬁcantly mitigatethepossiblemisleadingeffectsofskewness.

We also emphasize that dataset skewness mightcause overly optimisticresultsparticularlywhenoneusestheROCcurvebased evaluation metrics as we discussed in the previous paragraphs.

Therefore, wealso useanothermetric, i.e.,the PRcurve, to evaluate theperformances.Since thismetricisintroduced toremove

(7)

Fig. 4. The PR curves of the algorithms for a synthetic dataset.

themisleadingeffectsofskewnessduringtheevaluationprocess,it canbeusedtochecktheskewnessinourexperiments.As shown inFig.4andTable3,allthePRcurvebasedresultsareconsistent withtheROCcurve basedones(seeTable1), whichvalidatesour claimsinthepreviousparagraph.

3.4.Computationalandcommunicationcostanalysis

An SVM update costs O(^mn). Therefore, in our case, a centralized approach (CSVM) has O(^mn) complexity, since it updates all the parameters at a single node. Similarly, a local SVM (LSVM)ateachnodek hasO(^mnk)complexity.Ontheotherhand, our distributed method (DSVM) costs O(^m

i∈|Nk|n_k) complexity for each node k. Thus, we observe that the computational complexity of DSVM is very similar to LSVM and signiﬁcantly lowerthanCSVM,especiallyforsparselyconnectedtopologieswith manynodes. Considering the ﬁxed topology in our experiments, where|Nk|=^3,ⁿk=ⁿ/4,weobtainthatthecomplexity ofCSVM, DSVM, and LSVM are O(^mn), O(^3mn/4), and O(mn/4), respectively,wherem andn changesforeachdataset.

SinceCSVM needstotransfer allthedatasamplesto thecen- tral node, its communication cost is O(mn). We note that even thoughwedonot considerthedistanceofeachnode tothecen- tralnode inthiscalculation, CSVM mightrequirecommunication betweenthenodesthat are faraway fromeach other so that its applicationmightbeimpractical.Ontheotherhand,thecommuni- cationcostforDSVMisO(^m

i∈|Nk\k|n_k)andthecommunication distancesinthiscasearesigniﬁcantlyshorter(sincenodesareal- lowedtocommunicateonlywiththeirneighbors).Thusinourex- periments,thecommunicationcostofCSVMandDSVMareO(^mn) andO(^mn/2),respectively.

4. Conclusion

Westudiedtheanomalydetectionprobleminadistributednet- workofnodesandintroducedanovelSVMbasedapproach,which worksinafullydistributedmanner.Particularly,weﬁrstreformu- late the SVM optimization problem for a distributed network of nodes. Based on this formulation, we then provide the updates totrain the SVM architecture. Unliketheexisting distributed approaches, here, we directly solve the SVM optimization problem in the primal form using an eﬃcient gradient based algorithm [54,55] in order to obtain an estimate for the separating hyperplane parameters. We then let onlythe neighboring nodesshare

their estimateswitheach other toobtaina ﬁnal estimate forthe separating hyperplane parameters. Thus, we achieve a high per- forming anomaly detectionalgorithm over a distributednetwork ofnodes.Moreimportantly,weachievethisperformancewithlow computational complexity andcommunicationload. Eventhough weworkinasupervisedframework,wealsoextendouralgorithm tothe unsupervisedframework thankstoits genericstructure. In oursimulations involvingboth syntheticandrealdata,the intro- duceddistributed approachdemonstrates signiﬁcant performance improvementswithrespecttothelocalperformancesofthenodes sothateachnodeinournetworkachievesalmostthesameperfor- mancewithacentralnodethathasaccesstoallthedatasamples inthenetwork.

Declarationofcompetinginterest Nopotentialconﬂictofinterestexists.

Appendix A. Supplementarymaterial

Supplementarymaterialrelatedtothisarticlecanbefoundon- lineathttps://doi.org/10.1016/j.dsp.2020.102657.

References

[1] V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,ACMComput.

Surv.41 (3)(2009)15:1–15:58,https://doi.org/10.1145/1541880.1541882.

[2] Y.Rajabzadeh,A.H.Rezaie,H.Amindavar,Adynamicmodelingapproachfor anomaly detection usingstochastic differential equations, Digit.SignalPro- cess. 54 (2016) 1–11, https://doi.org/10.1016/j.dsp.2016.03.006, http://www.

sciencedirect.com/science/article/pii/S1051200416000610.

[3] N. Chávez,C. Guillén, Radar detection inthe moments space of the scat- tered signalparameters, Digit.Signal Process. 83 (2018) 359–366, https://

doi.org/10.1016/j.dsp.2018.08.013, http://www.sciencedirect.com/science/ article/pii/S1051200418306869.

[4] M.Thottan,C.Ji,AnomalydetectioninIP networks,IEEETrans.SignalProcess.

51 (8)(2003)2191–2204,https://doi.org/10.1109/TSP.2003.814797.

[5] C.O’Reilly,A.Gluhak,M.A.Imran,Distributedanomalydetectionusingmini- mumvolumeellipticalprincipalcomponentanalysis,IEEETrans.Knowl.Data Eng.28 (9)(2016)2320–2333,https://doi.org/10.1109/TKDE.2016.2555804.

[6] J.Zhao,L.Itti,Classifyingtimeseriesusinglocaldescriptorswithhybridsam- pling,IEEETrans.Knowl.DataEng.28 (3)(2016)623–637,https://doi.org/10. 1109/TKDE.2015.2492558.

[7] R. Venkatesan, A. Plastino, Fisher information framework for time series modeling, Physica A:Stat. Mech. Appl.480 (2017) 22–38, https://doi. org/10.1016/j.physa.2017.02.076, http://www.sciencedirect.com/science/article/ pii/S0378437117301863.

[8] K.T.Abou-Moustafa,M.Cheriet,C.Y.Suen,Classiﬁcationoftime-seriesdataus- inga generative/discriminativehybrid,in:Ninth InternationalWorkshopon FrontiersinHandwritingRecognition,2004,pp. 51–56,https://doi.org/10.1109/ IWFHR.2004.26.

[9]K.T.Abou-Moustafa,Agenerative-discriminativeframeworkfortime-seriesdata classiﬁcation,Ph.D.thesis,ConcordiaUniversity,2003.

[10] H.Debar, M.Becker, D. Siboni,Aneural networkcomponent for anintru- sion detection system, in: Proceedings ofthe 1992 IEEE ComputerSociety SymposiumonResearchinSecurityandPrivacy,1992,pp. 240–250,https://

doi.org/10.1109/RISP.1992.213257.

[11] Y.Bengio,P.Simard,P.Frasconi,Learninglong-termdependencieswithgradi- entdescentisdiﬃcult,IEEETrans.NeuralNetw.5 (2)(1994)157–166,https://

doi.org/10.1109/72.279181.

[12] K.Greff, R.K.Srivastava, J.Koutník,B.R. Steunebrink,J. Schmidhuber,LSTM:

asearchspaceodyssey,IEEETrans.NeuralNetw.Learn.Syst.28 (10)(2017) 2222–2232,https://doi.org/10.1109/TNNLS.2016.2582924.

[13] S.Hochreiter,J.Schmidhuber,Longshort-termmemory,NeuralComput.9 (8) (1997)1735–1780,https://doi.org/10.1162/neco.1997.9.8.1735.

[14] R.Kozma,etal.,Anomalydetectionbyneuralnetworkmodelsandstatistical time series analysis,in:1994IEEE InternationalConference onNeuralNet- works,1994,IEEEWorldCongressonComputationalIntelligence,vol.5,1994, pp. 3207–3210,https://doi.org/10.1109/ICNN.1994.374748.

[15]C.Bishop,Noveltydetectionandneuralnetworkvalidation,IEEProc.,Vis.Im- ageSignalProcess.141(1994)217–222.

[16] N.K.Thanigaivelan,E.Nigussie,R.K.Kanth,S.Virtanen,J.Isoaho,Distributed internalanomalydetectionsystemforinternet-of-things,in:201613thIEEE Annual Consumer Communications Networking Conference (CCNC), 2016, pp. 319–320,https://doi.org/10.1109/CCNC.2016.7444797.

vector machines A novel distributed anomaly detection algorithm based on support Digital Signal Processing

Digital Signal Processing

A novel distributed anomaly detection algorithm based on support vector machines ✩

Tolga Ergen

, Suleyman S. Kozat

*

(

) =

−

.

||

||

+ λ

ξ

(

+

) ≥

− ξ

∀

,

ξ

≥

∀

,

,

(

) =

(

+

),

||

||

+ λ

{

,

−

(

+

)}.

(

),

(

)

+ λ

{

,

−

¯

}

=

− μ ∇

(

)

,

μ

(

) =

(

¯

).

+ λ

{

,

−

¯

},

= φ

− μ ∇

(φ) φ

φ

,

φ

=

,

=

.

∇

(

) = [

. . .

A novel distributed anomaly detection algorithm based on support vector machines ^✩