Unsupervised energy prediction in a smart grid context using reinforcement cross-buildings transfer learning

(1)

ContentslistsavailableatScienceDirect

Energy

and

Buildings

jo u r n al h om ep age :w w w . e l s e v i e r . c o m / l o c a t e / e n b u i l d

Unsupervised

energy

prediction

in

a

Smart

Grid

context

using

reinforcement

cross-building

transfer

learning

Elena

Mocanu

∗

,

Phuong

H. Nguyen,

Wil

L. Kling

1

_,

_Madeleine

_Gibescu

DepartmentofElectricalEngineering,EindhovenUniversityofTechnology,5600MBEindhoven,TheNetherlands

a

r

t

i

c

l

e

i

n

f

o

Articlehistory: Received15June2015 Receivedinrevisedform 11December2015 Accepted22January2016 Availableonline16February2016 Keywords:

Buildingenergyprediction Reinforcementlearning Transferlearning DeepBeliefNetworks Machinelearning

a

b

s

t

r

a

c

t

InafutureSmartGridcontext,increasingchallengesinmanagingthestochasticlocalenergysupplyand demandareexpected.Thisincreasedtheneedofmoreaccurateenergypredictionmethodsinorder tosupportfurthercomplexdecision-makingprocesses.Althoughmanymethodsaimingtopredictthe energyconsumptionexist,alltheserequirelabelleddata,suchashistoricalorsimulateddata.Still, suchdatasetsarenotalwaysavailableundertheemergingSmartGridtransitionandcomplexpeople behaviour.Ourapproachgoesbeyondthestate-of-the-artenergypredictionmethodsinthatitdoesnot requirelabelleddata.Firstly,tworeinforcementlearningalgorithmsareinvestigatedinordertomodel thebuildingenergyconsumption.Secondly,asamaintheoreticalcontribution,aDeepBeliefNetwork (DBN)isincorporatedintoeachofthesealgorithms,makingthemsuitableforcontinuousstates.Thirdly, theproposedmethodsyieldacross-buildingtransferthatcantargetnewbehaviourofexistingbuildings (duetochangesintheirstructureorinstallations),aswellascompletelynewtypesofbuildings.The methodsaredevelopedintheMATLAB®_environment_and_tested_on_a_real_database_recorded_over_seven years,withhourlyresolution.Experimentalresultsdemonstratethattheenergypredictionaccuracyin termsofRMSEhasbeensigniﬁcantlyimprovedin91.42%ofthecasesafterusingaDBNforautomatically extractinghigh-levelfeaturesfromtheunlabelleddata,comparedtotheequivalentmethodswithout theDBNpre-processing.

1. Introduction

Predictionofenergyconsumptionasafunctionoftimeplays anessentialroleinthecurrenttransitiontofutureenergysystems. Withinthenewcontextofso-calledSmartGrids,theenergy con-sumptionofbuildingscanberegardedasanonlineartimeseries, dependingonmanycomplexfactors.Thevariabilityintroducedby thegrowingpenetrationofwindandsolargenerationsourcesonly strengthenstheroleofaccuratepredictionmethods[1].Prediction formsanintegralpartintheefﬁcientplanningandoperationofthe wholeSmartGrid.

Ontheonehand,advancedenergypredictionmethodsshould beeasilyexpandabletovariouslevelsof dataaggregationatall timescales[2].Ontheotherhand,theyhavetoautomaticallyadapt withdecisionstrategiesbasedondynamicbehaviorofactive con-sumers(e.g.newandsmart(er)buildings)[3].Applicationsofthese newmethodsshouldfacilitatethetransitionfromthetraditional

∗ Correspondingauthor.

E-mailaddress:e.mocanu@tue.nl(E.Mocanu).

1 _Deceased_March_14,_2015.

single-tariffgridtotime-of-use(TOU)andreal-timepricing.The effectswillbefeltbyallplayersinthegridfromtransmission(TSO) anddistributionsystemoperator(DSO)totheend-user,including resourceassessment and analysisofenergy efﬁciency improve-ments, ﬂexible demand response (DR) and other continuous projectiononplanningstudies.The jointconsideration of deci-sionsregardingnewrenewablegeneration,TSOdevelopment,and demand-sidemanagement(DMS)programsinanintegrated fash-ionrequires demandforecasts.Consequently,thesewillrequire changesinthewayhowthedataarecollectedandanalyzed[4].

Priorstudieshaveshownthatbyusingstatisticalmethods,more recentinspiredbysupervisedmachinelearningtechniques,such asSupportVectorMachines[5,6],ArtificialNeuralNetworks[7,8], Autoregressivemodels[9],Conditionalrandomfield[10],or Hid-denMarkovModel[11],onecanimprovetheaccuracyofenergy predictionsignificantly.Ontheotherhand,therearemanymethods basedonphysicalprinciples,includingalargenumberofbuilding parameters,tocalculatethermaldynamicsand energybehavior atthebuildinglevel.Moreover,toshape theevolutionoffuture buildingssystems,therearealsosomehybridapproacheswhich combinesomeoftheabovemodelstooptimizepredictive perfor-mance,suchas[12–16].Interestedreadersarereferredforamore http://dx.doi.org/10.1016/j.enbuild.2016.01.030

(2)

Nomenclature

˛ learningrate,

∀

˛∈[0,1] discountfactor,

∀

∈(0,1) E[·] expectedvalueoperator

A thesetofactions,

∀

a∈A

D dataset

R therewardfunction

S thesetofstates,

∀

s ∈S T transitionprobabilitymatrix

h vectorcollectingallthehiddenunits,hj∈{0,1} v vectorcollectingallvisibleunits,

v

i ∈R Wvh _matrix_of_all_weights_connecting_v_and_h E totalenergyfunctionintheRBMmodel k thenumberofhiddenlayers

M buildingenergyconsumptionmodel p,P probabilityvalue/vector

Q thequalitymatrix

t time

Z normalizationfunction

comprehensivediscussionontheapplicationofenergydemand managementto[9,14,17,18].

Althoughtheyremainattheforefrontofacademicandapplied research,allthesemethodsrequirelabeleddataabletofaithfully reproducetheenergyconsumptionofbuildings.Intheremainingof thispaperwerefertothelabeleddataastothehistorical(known) dataoftheanalyzedbuilding.Usuallythelackofhistoricaldatacan bereplacedbysimulateddata.Still,both,historicalorsimulated data,areemployedintheseforecastingmethodsinanon-adaptable waywithoutconsideringthefutureeventsorchangeswhichcan occurintheSmartGrid.

Astrongermotivationfor thispaper,isgivenbythenottoo wellexploitedfactthatsometimestherearenothistoricallydata consumptionavailableforaparticularbuilding.Fromthemachine learningperspectivethisisatypicalunsupervisedlearning prob-lem. One of the most used methods of unsupervised learning, reinforcementlearning(RL),wasintroducedinpowersystemarea to solvestochastic optimalcontrol problems [19]. RL methods areusedinawiderangeofapplications,suchassystemcontrol [20],playing gamesormorerecent intransferlearning [19,21]. Theadvantageofthecombinationofreinforcementlearningand transferlearningapproachesisstraightforward.Hence,wewantto transferknowledgefromaglobaltoalocalperspective,toencode theuncertaintyofthebuildingenergydemand.

Owingtothecurseofdimensionality,thesemethodsfailinhigh dimensions.Morerecently,therehasbeenarevivalofinterestin combiningDeep Learningwithreinforcementlearning.Therein, RestrictedBoltzmannMachineswereproventoprovideavalue functionestimation[22]orapolicyestimation[23].Morethanthat, Minhetal.[24]combinedsuccessfullydeepneuralnetworksand Q-learningtocreateadeepQ-networkwhichsuccessfullylearned controlpoliciesinarangeofdifferentenvironments.

Inthispaper,wecomprehensivelyexploreandextendedtwo reinforcementlearning(RL)methodstopredicttheenergy con-sumptionat the buildinglevel usingunlabelled historical data, namely State-Action-Reward-State-Action (SARSA) [25] and Q-learning[26].Duetothefactthatintheoriginalformbothmethods can not handle well continuous states space, this paper con-tributestheoreticallytoextendthembyincorporatingaDeepBelief Network[27]forcontinuousstatesestimationandautomatically featuresextractioninauniﬁedframework.OurproposedRL meth-odsareappropriatewhenwedonothavehistoricalorsimulated data,butwewanttoestimatetheimpactofchangesinSmartGrid,

suchastheappearanceofabuildingorseveralbuildingsina cer-tainarea,ormorecommonlyachangeinenergyconsumptiondue tobuildingrennovation.Inthispaper,wehaveshownthe appli-cabilityandefﬁciencyofourproposedmethodinthreedifferent situations:

1.In the case of a new type of buildingbeing connected with theSmartGrid,thustransferringknowledgefromacommercial buildingtoaresidentialbuilding.Speciﬁcally,inSection 6.2.1, fourdifferenttypesofresidentialbuildingswereanalyzed. 2.Inthecaseofarennovatedbuilding,thustransferring

knowl-edgefromanon-electricheatbuildingtoabuildingwithelectric heating.

3.Additionally,weproposeexperimentstohighlightthe impor-tanceofexternalfactorsfortheestimationofbuildingenergy consumption,suchaspriceinformation.InSection 6.2.2, trans-ferlearningisapplied,fromabuildingunderastatictarifftoa buildingwithatime-of-usetariff.

Accordingtoourknowledge, this isthe ﬁrsttime when the energy prediction is performed without using any information aboutthatbuilding,suchashistoricaldata,energyprice,physical parametersofthebuilding,meteorologicalconditionor informa-tionabouttheuserbehavior.

The paper is organized as follows. In Section 2 we explain therationalityunderlyingourapproach.Section 3 presentsthe mathematicalmodelingofthereinforcementlearningapproaches. Section 4describesthenoveltymethodtoestimatecontinuous statesinreinforcementlearningusingDeepBeliefNetworks.The experimentssetupand resultsareillustratedin Sections 5and 6,respectively.Thepaperconcludeswithadiscussionandfuture work.

2. Problemformulation

Inthispaper,weproposeamethodtosolvetheunsupervised energypredictionproblemwithcross-buildingtransferbyusing machinelearningtimeseriespredictiontechniques.Inthemost generalstatement,theproposedReinforcementandtransfer learn-ing setup is depictedin Fig.1. Given theunevenly distributed buildingenergyvaluesduringtime,ﬁrstly,a specialattentionis

Fig.1.Theunsupervisedlearningexploreandextendsreinforcementandtransfer learningsetup,byincludingaDeepBeliefNetworkforcontinuousstatesestimation.

(3)

giventothequestion:Howtoestimateacontinuousstatespace?The ideaistoﬁndalower-dimensionalrepresentationoftheenergy consumptiondatathatpreservesthepairwisedistancesaswellas possible.

More formally, the energy prediction using unlabeled data problempresentedinthispaperisdividedintothreedifferent sub-problems,namely:

1.Continuousstateestimationproblem:Givenadataset,D:IR→

S,ﬁndaconﬁnedspacestaterepresentationS1.

2.Reinforcementlearningproblem:GivenabuildingmodelM1 = S1,A1,T·(·,·),R1,ﬁndaoptimalpolicy,∗₁.

3.Transfer learning problem: Given a model, M1 = S1,A1,T·(·,·),R1, a reasonable 1∗ and M2 = S2,A2,T·(·,·),R2,ﬁndagood2.

TheproposedsolutionispresentedinSection 4,whereanew methodtoestimatecontinuousstatesinreinforcementlearning usingDeepBeliefNetworkisdetailed.Furtherthisstateestimation methodisintegratedinSARSAandQ-learningalgorithmsinorder toimprovethepredictionaccuracy.

3. Reinforcementlearning

Reinforcement learning [28] is a field of machine learning inspiredbypsychology,which studieshowartificialagents can performactionsinanenvironmenttoachieveaspecificgoal. Practi-cally,theagenthastocontroladynamicsystembychoosingactions inasequentialfashion.Thedynamicsystem,knownalsoasthe environment,ischaracterizedbystates,itsdynamics,anda func-tionthatdescribesthestate’sevolutiongiventheactionschosen bytheagent.Afteritexecutesanaction,theagentmovestoanew state,whereitreceivesareward(scalarvalue)whichinformsit howfaritisfromthegoal(thefinalstate).Toachieve thegoal, theagenthastolearnastrategytoselectactions,dubbedpolicyin theliterature,insuchawaythattheexpectedsumoftherewards ismaximizedovertime.Besidesthat,astateofthesystem cap-turesalltheinformationrequiredtopredicttheevolutionofthe systeminthenextstate,givenanagentaction.Also,itisassumed thattheagentcouldperceivethestateoftheenvironmentwithout error,anditcouldmakeitscurrentdecisionbasedonthis informa-tion.TherearetwodifferentcategoriesofRLalgorithms,(i)Online RLwichareinteractionbasedalgorithms,suchasQ-learning[26], SARSA[25]orPolicyGradient,and(ii)OfflineRL,likeLeast-Square PolicyIterationorfittedQ-iteration.Foramorecomprehensive dis-cussionofRLalgorithmswereferto[29].Intheremainingofthis paperwillreferjusttoonlineRL.

ARLproblemcanbeformalizedusingMarkovdecisionprocess (MDPs).MDPsaredeﬁnedbya4-tupleS,A,T·(·,·),R·(·,·), whereSis asetofstates,

∀

s ∈S, Ais asetofactions,

∀

a∈A, T:S×A×S→[0,1]isthetransitionfunctiongivenbythe prob-abilitythatbychoosingaction ain statesattimet thesystem willarrivetostatesattimet+1,suchthatpa(s,s)=p(st+1=s|st=s, at=a),andR:S×A×S→Ristherewardfunction,wereRa(s,s) istheimmediatereward(orexpectedimmediatereward)received bytheagentafteritperformsthetransitiontostatesfromstate s.AnimportantpropertyinMDPsistheMarkovpropertywhich makestheassumptionthatthestatetransitionsaredependentjust onthelaststateofthesystem,andareindependentofanyprevious environmentstatesoragentactions,i.e.p(st+1=s,rt+1=r|st,at)for alls,r,st,andat.TheMDPstheorydoesnotassumethatSorAare ﬁnite,butthetraditionalalgorithmsmakethisassumption.In gen-eral,theycanbesolvedbyusinglinearordynamicprogramming. Theinterestedreaderisreferredto[30]foramore comprehen-sivediscussionaboutMDPs.Furthermore,inthereal-world,the

statestransitionsprobabilitiesT·(·;·)andtherewardsR·(·;·)are unknown,andthestatesspaceSortheactionsspaceAmightbe continuous.Thus,RLrepresentsanormalextensionand generaliza-tionoverMDPsforsuchsituations,wherethetasksaretoolargeor tooill-deﬁned,andcannotbesolvedusingoptimal-controltheory [25].

3.1. Q-learning

First, the Q-learning algorithm [26] is recommended like a standardsolutioninRLweretherulesareoftenstochastic.This algorithmthereforehasafunctionwhichcalculatestheQualityof astate-actioncombination,deﬁnebyQ:S×A→R.Before learn-inghasstarted,Qmatrixreturnsaninitialvalue.Then,eachtime theagentselectsanaction,andobservesarewardandanewstate thatbothmaydependonthepreviousstateandtheselectedaction. Theaction-valuefunctionofaﬁxedpolicywiththevaluefunction V_:_S_→_R_is

Q_(s,_a)₌ _r(s,_a)₊

s

p(s|s,a)V_(s_),

_∀

_s_∈_S,_a_∈_A ₍₁₎

Thevalueofstate-actionpairs,Q_(s,_a),_represents_the_expected outcomewhenoneagentisstartingfroms,executingaandthen followingthepolicyafterwards,suchthatV_(x)₌_Q_(x,_(x)),_with theircorrespondingBellmanequation

Q∗(s,a)= r(s,a)+

s

p(s|s,a)max b Q

∗_(s,_b) ₍₂₎

wherethediscountfactor∈[0,1]tradesofftheimportanceof rewardsandb=max(a).Thus,theoptimalvalueareobtainedfor

∀

s_∈S,V∗(s)₌ max

a Q

∗_(s,_a)_and∗_(s)₌ _argmax a Q

∗_(s,_a)._The_value ofstate-actionpairsisgivenbythesameformalexpectationvalue, E,ofanexpectedtotal returnrt,suchthatQ (s,a)= E(rt|st = s,at = a).Theoff-policyQ-learningalgorithmhavetheupdaterule deﬁneby

Qt+1(st,at)

=Qt(st,at)+˛t[rt+1+maxQt(st+1,a)−Qt(st,at)] (3) wherert+1istherewardobservedafterperformingat inst,and where˛t(s,a),withall˛∈[0,1],isthelearningratewhichmay bethesameforallpairs.Q-learningalgorithmhasproblemswith bignumbersofcontinuousstatesanddiscreteactions.Usually,it needsfunctionapproximations,e.g.neuralnetworks,toassociate tripletslike(state,action,Q-value).ExplorationofoneMDPcanbe doneunderMarkovassumption,totakeintoaccountjustcurrent stateandaction,butbecauseintherealworldwehavePartially ObservableMDPs,wemayhavebetterresultsifanarbitraryk num-berofhistorystatesandactions(St−k,at−k,···,St−1,at−1)willbe considerate[31],toclearlyidentifyatripletSt,At,Qtattimet. 3.2. SARSA

An interesting variation for Q-learning is the State-Action-Reward-State-Action(SARSA)algorithm[25],whichaimsatusing Q-learningaspartofaPolicyIterationmechanism.Themajor differ-encebetweenSARSAandQ-Learning,isthatthemaximumreward forthenextstateinSARSAisnotnecessarilyusedforupdatingthe Q-values.Therefore,thecoreoftheSARSAalgorithmisasimple valueiterationupdate.Theinformationrequiredfortheupdateis atuple(st,at,rt+1,st+1,at+1),andtheupdateisdeﬁnedby Qt+1(st,at)

(4)

wherert+1istherewardand˛t(s,a)isthelearningrate.Inpractice, Q-learningandSARSAarethesameifweuseagreedypolicy(i.e. theagentchoosesthebestactionalways),butaredifferentwhen the

-greedypolicyisused,whichfavorsmorerandomexploration. In traditional reinforcement learning algorithms, only MDPs withﬁnitestates andactionsareconsidered.However,building energyconsumptioncantakenearlyarbitraryrealvalueresulting inaverylargenumberofstatesinMDPs.Duetothefactthatenergy consumptioncanbeseenasatimeseriesproblem,anprior dis-cretizationofthestatesspaceisnotveryuseful.So,wetrytoﬁnd algorithmsthatworkwellwithlarge(orcontinuous)statesspaces, asitisshownnext.

4. StatesestimationviaDeepBeliefNetworks

Deep architectures [27] showed very good results in differ-ent applications,such as to perform non-linear dimensionality reduction [32], images recognition [33], video sequences, or motion-capturedata [34].A comprehensiveanalysison dimen-sionalityreductionanddeeparchitecturescanbereferredto[35]. Overall,DeepBeliefNetworks(DBN)couldbeawaytonaturally decomposetheproblemintosub-problemsassociatedwith differ-entlevelsofabstraction.

4.1. RestrictedBoltzmannMachines

DBNsarecomposedofseveralRestrictedBoltzmannMachines (RBMs)stackedontopofeachother[36].ARBMisastochastic recurrentneuralnetworkthatconsistsofalayerofvisibleunits,v, andalayerofbinaryhiddenunits,h.Thetotalenergyofthejoint conﬁgurationofthevisibleandhiddenunits(v,h)isgivenby: E(v,h)= −

i,j

v

ihjWij−

i

v

iai−

j hjbj (5)

whereirepresentstheindicesofthevisiblelayer,jthoseofthe hiddenlayer,andwi,jdenotestheweightconnectionbetweenthe ithvisibleandjthhiddenunit.Further,

v

iandhjdenotethestate oftheithvisibleandjthhiddenunit,respectively,aiandbj rep-resentthebiasesofthevisibleandhiddenlayers.Theﬁrstterm,

i,j

v

ihjWijrepresentstheenergybetweenthehiddenandvisible unitswiththeirassociatedweights.Thesecond,

_i

v

iairepresent theenergyinthevisiblelayer,whilethethirdtermrepresentsthe energyinthehiddenlayer.TheRBMdeﬁnesajointprobabilityover thehiddenandvisiblelayerp(v,h)

p(v,h)= e−E(v,h)_Z (6)

whereZisthepartitionfunction,obtainedbysummingtheenergy ofallpossible(v,h)conﬁgurations,Z=

v,he−E(v,h).Todetermine

theprobabilityofadatapointrepresentedbyastate

v

,themarginal probabilityisused,summingoutthestateofthehiddenlayer,such thatp(

v

)=

hp(v,h).

Theaboveequationcanbeusedforanygiveninputto calcu-latetheprobabilityofeitherthevisibleorthehiddenconﬁguration tobeactivated.Thisvaluesarefurtherusedtoperforminference inordertodeterminetheconditionalprobabilitiesinthemodel. Tomaximizethelikelihoodofthemodel,thegradientofthe log-likelihoodwithrespecttotheweightsmusttobecalculated.The gradientoftheﬁrstterm,aftersomealgebraicmanipulationscan bewrittenas

∂

log(

_hexp(_−E(

v

,h)))

∂

W_ij =

v

i·p(hj =1|

v

) (7)

However,computingthegradientofthesecondtermisintractable.

Fig.2.AgeneralDeepBeliefNetworkstructurewiththreehiddenlayers.Thetop twolayershaveundirectedconnectionsandformanassociativememory,where denotesbinaryneuronsand◦representstherealvalues.

TheinferenceofthehiddenandvisiblelayersinRBMcanbe doneaccordinglywiththenextformulas

p(hj =1|v) =(bj+

i

v

iWji) (8) p(vi = 1|h)= (ai+

j hjWji) (9)

where (.)representsthe sigmoidfunction. Moreover,tolearn anRBMwecanusethefollowinglearningrulewhichperforms stochasticsteepestascentinthelogprobabilityofthetrainingdata [37]:

∂

log(p(v,h))

∂

Wij

=

v

ihj0−

v

ihj∞ (10) where_·0denotestheexpectationsforthedatadistribution(p0), and·∞denotestheexpectationsunderthemodeldistribution. 4.2. DeepBeliefNetworks

Overall,aDeepBeliefNetwork[27]isgivenbyanarbitrary num-berofRBMsstackonthetopofeachother.Thisyieldsacombination between a partially directed and partially undirected graphical model.Therein,thejointdistributionbetweenvisiblelayerx(input vector)andthelhiddenlayershk_is_deﬁne_as_follows

p(x,h1,...,hk)= l−2

k=0

P(hk|hk+1_)P(hl−1_,_hl₎ ₍₁₁₎

whereP(hk_|hk+1₎_is_a_conditional_distribution_for_the_visible_units conditionedonthehiddenunitsoftheRBMatlevelk,andP(hl−1_, hl₎_is_the_{visible-hidden}_joint_distribution_in_the_top-level_RBM._An exampleofaDBNwith3hiddenlayers(i.e.h1(j),h2(k),andh3(l)) isdepictedinFig.2.

ThetoplevelRBMinaDBNactsasacomplementarypriorfrom thebottomleveldirectedsigmoidlikelihoodfunction.ADBNcan betrainedinagreedyunsupervisedway,bytrainingseparately eachRBMfromit,inabottomtotopfashion,andusingthehidden layerasaninputlayerforthenextRBM[38].Furthermore,the DBN,canbeusedtoprojectourinitialstatesacquiredfromthe environmenttoanotherstatespacewithbinaryvalues,byﬁxing theinitialstatesinthebottomlayerofthemodel,andinferringthe tophiddenlayerfromthem.Intheend,thetophiddenlayercanbe directlyincorporatedintotheSARSAorQ-learningalgorithms,as itisdescribedinAlgorithm1.

(5)

Algorithm1. RLextensionincludingaDBNforstateestimation. 1:%%DBNforstateestimation

2:InitializeDBN

3:InitializetrainingsetXwiththestates 4:foreachRBMkinDBN

5: repeattrainingepoch

6: Foreachtraininginstancex∈X

7: SetRBMvisible k =x

8: RunMarkovchaininRBMk

9: GetStatisticsforRBMk

10: UpdateWeightsforRBMk

11: endfor

12: Untilconverge

13: foreachtraininginstancex∈X

14: SetRBMvisible k =x

15: InferRBMhidden k

16: ReplacexinXwithRBMhidden k

17: endfor

18:endfor

19:%%UsethelastcomputedXasstatesforRL(·) 20:%%RL(1):SARSAalgorithm

21:InitializeQ(s,a)arbitrarily,wheres∈X

22:repeat(foreachepisode) 23: Initializes

24: ChooseafromsusingpolicyderivedfromQ

25: repeat(foreachstepofepisode)

26: Takeactiona,observer,s

27: Choosea_from_s_using_policy_derived_from_Q

28: Q(s,a)←Q(s,a)+˛[r+Q(s_,_a₎₋_Q(s,_a)] 29: s←s 30: a←a 31: untilsisterminal 32:untilQoptimal 33:%%RL(2):Q-learningalgorithm 34:InitializeQ(s,a)arbitrarily,wheres∈X

35:repeat(foreachepisode) 36: Initializes

37: repeat(foreachstepofepisode)

38: ChooseafromsusingpolicyderivedfromQ 39: Takeactiona,observer,s

40: Q (s,a)←Q (s,a)+˛[r+max ˛ Q (s,a)−Q (s,a)] 41: s←s 42: untilsisterminal 43: untilQoptimal

Nowthatwehaveconsideredtheproblemofstateestimation andweincorporatedallthreesub-problemsinauniﬁedapproach welookintotheexperimentalvalidation.

5. Datasetcharacteristics

The proposed solution is experimentally evaluated using a datasetrecordedoversevenyears,moreexactlybetweenJanuary 6,2007andJanuary31,2014.Theloadsprofiles,including differ-entresidentialandcommercialbuildings,aremadeavailableby BaltimoreGasandElectricCompany[39].Foreverytypeof build-inganalyzedtheavailablehistoricalloaddatainkWhrepresentan averagebuildingprofileperhour.Overall,therearefivedifferent buildingsprofiles,aspresentedinTable1.

Fora morecomprehensiveviewof thedatasets usedin this paperwehaveshowninFig.3thehourlyevolutionoftheelectrical energyconsumptionforaGeneralService(G)dataset,including

Table1

Buildingtypesindatasets.

Residential

R Residential(non-electricheat) R(ToU) ResidentialTime-of-Use

(non-electricheat) RH Residential(electricheat) RH(ToU) ResidentialTime-of-Use(electric

heat)

Commercial G GeneralService(<60kW), Commercial,Industrial&Lighting

Fig.3. ElectricalenergyconsumptionforaCommercial,Industrial&Lighting(G) datasetandforaresidentialbuildingwithnon-electricheat(R)buildingover dif-ferenttimehorizons.

Commercial, Industrial &Lighting, and a residential with non-electricheat(R)buildingoverdifferenttimehorizons.Moreover, somegeneralcharacteristicsfortheentiredatasetaregraphically depictinginFig.4.Inallexperimentsthedatawereseparatedinto thetrainingandtestingdatasets.Moreprecisely,thedatacollected from1stJune2007until1stJanuary2013(2041days)wereusedin thelearningphaseandtheremainingdata,betweenJanuary2013 and31January2014(396days)wereusedtoevaluatethe perfor-manceofthemethods.Themetricsusedtoassessthequalityofthe differentbuildingsenergyconsumptionpredictionaredescribed further.

5.1. Metricsforpredictionassessment

Aswe mentionearlier, thegoal is toachieve good general-ization by makingaccurateprediction for new buildingenergy consumption data. Firstly, some quantitative insight into the dependenceofthegeneralizationperformanceofourapproachare evaluatedusingtheroot-mean-square errordeﬁnedbyRMSE =

(1/N)

N_i₌₁(

v

i−

v

ˆi)2,whereNrepresentsthenumberof multi-stepspredictionwithinaspeciﬁedtimehorizon,

v

irepresentsthe realvaluesforthetime-stepiandˆ

v

irepresentsthemodelestimated valueatthesametime-step.Then,byusingthePearson product-momentcorrelationcoefﬁcient(R),insightsaregivenonthedegree oflinear dependencebetweenthereal valueandthepredicted value.Hence R(u,

v

)= (E[(u−u)(

v

−v)]/uv), where E[·] is

Fig.4. Generalcharacteristicsofalldatasets:abox-plotwiththeexactlyvaluefor meanandstandarddeviationencodedinit.

(6)

Fig.5. (Left)TheRMSEvaluesobservedfordifferentRBMconﬁgurationsintheDBNarchitecture,withvaryingnumberofhiddenneurons,asafunctionoftrainingepochs. (Right)PerformancemetricsforthechosenRBMconﬁgurationwith10hiddenneurons.

theexpectedvalueoperatorwithstandarddeviationsu andv. Thecorrelationcoefficientmaytakeonanyvaluewithintherange [−1,1].Thesignofthecorrelationcoefficientdefinesthedirection oftherelationship,eitherpositiveornegative.Finally,weperform theKolmogorov–Smirnovtest[40]inordertogaininsightson sta-tisticalsignificanceofourresults.TheKolmogorov–Smirnovtest hastheadvantageofmakingnoassumptionaboutthedistribution ofdata.Thiselaboratestatisticaltestisnotatypicallymetricused intheanalyzeofthepredictionaccuracy,butisimposedbythe factthatthelearningandthetestingprocedureismadeusing dif-ferentbuildingtypes.Hence,exceedingthestatisticalsignificance level,p<0.05, wouldbeexpectedandwillvalidatethedifferent probabilitydistributionfunctionfromwherethisdataareprovided. 6. Empiricalresults

Toassesstheperformanceofourextendedreinforcementand transfer learning approaches presented in Section 4, we have designed ﬁve different scenarios. These are selected to cover variousmulti-steppredictionatdifferentresolution,andare sum-marizedinTable2.Furtheron,beforetogointhedeepanalysesof thenumericalresults,ﬁrstlywepresentsomedetailsofthe imple-mentation.

6.1. Implementationsdetails

Theimplementationhasbeendoneintwoparts.Firstly,aDBNis implemented,andsecondlytheRLalgorithmsusetheDBNintheir implementationsforcontinuousstatesestimation,asitisshown next.

6.1.1. ContinuousstatesestimationsusingDBN

WeimplementedtheDBNinMATLAB®_from_scratch_using_the mathematicaldetailsdescribedinSection 4.Inordertoobtaina goodpredictionweinvestigatecarefullythechoiceoftheoptimal numberofhiddenunitsinourDBNconﬁgurationwithrespectto theRMSEevolution,seeFig.5.

Table2

Summaryoftheexperiments.

Notation Timehorizon Resolution Scenario1 S1 1h 1haverage Scenario2 S2 1day 1haverage Scenario3 S3 1week 1haverage Scenario4 S4 1month 1haverage Scenario5 S5 1year 1weekaverage

Thus,thenumberofhiddenneuronswassetto10andthe learn-ingratewas10−3.Themomentumwassetto0.5andtheweight decayto0.0002.Wetrainedthemodelfor20epochs,butasitcan beseeninFig.5(right),themodelconvergedafterapproximately 4epochs.Moredetailsabouttheoptimalchoiceoftheparameters canbefoundin[41].

6.1.2. SARSAandQ-learning

WeimplementedtheSARSAandQ-learninginMATLAB®_using themathematicaldetailsdescribedinSection3.Inbothcasesthe learningratewassetto0.4andthediscountfactorwasconsiderate 0.99.Bothparametershaveadirectinﬂuenceontheperformance ofthebothalgorithms.

The choice of theseparameters wasmade after a thorough examinationoftheRMSEoutcome,asisshownforexampleinFig.6. Overall,thelearning ratedeterminestowhat extentthenewly acquiredinformation will overridetheold information and the discountfactordeterminestheimportanceoffuturerewards,for example=0willmaketheagent“opportunistic”byonly consid-eringcurrentrewards,whileadiscountfactorapproaching1will makeitstriveforalong-termhighreward.

6.2. Numericalresults

Inthissection,wetestandillustratethetwounsupervised learn-ingapproachesusingthedatasetdescribedinSection5.Different scenarios,assummarizedinTable2,havebeencreatedtoassess theperformanceoftheproposedmodels.

Fig.6. AnalysesofRMSEvaluesobtainedfromdifferent˛valueinexploration step,fordifferentscenarios.ThisinvolvethepredictionofGdataset(Commercial, Industrial&Lightingconsumption,GeneralService(<60kW)).

(7)

E. Mocanu et al. / Energy and Buildings 116 (2016) 646–655 Table3

UsingCommercial,GeneralService(G)(<60kW)datasettopredictrezidentialenergyconsumption,suchasR,R(ToU),RHandRH(ToU)valuesusingSARSA,Q-learning,SARSAwithDBNextensionandQ-learningwithDBN extension.

Method G R R(ToU) RH RH(ToU)

RMSE R p-Val RMSE R p-Val RMSE R p-Val RMSE R p-Val RMSE R p-Val

S1

SARSA 0.18 0.90 5.6e−10 0.02 0.99 2.8e−23 0.10 0.93 3.9e−12 0.36 0.91 1.2e−10 0.42 0.88 4.2e−09

Q-learning 0.22 0.86 2.2e−08 0.02 0.99 2.8e−23 0.04 0.99 2.5e−21 0.34 0.92 3.3e−11 0.34 0.92 3.3e−11

SARSA+DBN 0.04 0.01 1.7e−05 0.02 0.99 3.4e−20 0.06 0.98 4.7e−14 0.04 0.99 1.2e−23 0.04 0.01 7.8e−26

Q-learning+DBN 0.01 0.99 6.9e−38 0.03 0.97 7.1e−16 0.09 0.94 2.1e−12 0.04 0.99 1.1e−23 0.02 0.99 5.3e−33

S2

SARSA 0.65 0.36 0.0097 0.75 0.52 1.3e−04 0.47 0.42 0.0029 1.23 0.55 4.3e−05 1.20 0.65 3.5e−07

Q-learning 1.09 0.17 0.2460 0.98 0.30 0.0358 0.40 0.81 1.4e−12 1.28 0.52 1.4e−04 1.55 0.41 0.0038

SARSA+DBN 0.38 0.65 1.3e−05 0.37 0.98 0.39 0.37 0.79 0.0014 0.46 0.81 0.0023 0.47 0.67 2.6e−05

Q-learning+DBN 0.33 0.84 6.2e−14 0.37 0.08 0.5539 0.29 0.74 1.7e−09 0.41 0.50 2.3e−04 0.66 0.26 0.0671

S3

SARSA 1.27 0.12 0.0877 1.73 0.21 0.0026 1.36 0.27 1.1e−04 1.59 0.29 2.6e−05 1.33 0.26 2.3e−04

Q-learning 1.39 0.09 0.2052 1.10 0.24 7.6e−04 0.83 0.23 0.0012 1.47 0.25 3.1e−04 1.61 0.06 0.3589

SARSA+DBN 0.69 0.90 1.4e−05 1.31 0.99 0.88 0.55 0.98 0.1623 1.33 0.99 0.7896 1.18 0.99 0.5244

Q-learning+DBN 0.62 0.38 4.8e−08 0.98 0.11 0.0978 0.58 0.30 2.1e−05 1.26 0.12 0.0950 1.30 0.03 0.5932

S4

SARSA 1.55 0.09 0.0128 3.70 −0.25 1.3e−11 2.39 0.07 0.0361 2.05 0.07 0.0397 1.89 0.16 1.0e−05

Q-learning 1.41 0.15 6.1e−05 1.24 0.07 0.0404 1.14 −0.04 0.2000 1.67 0.08 0.0309 1.71 0.02 0.4621

SARSA+DBN 1.14 0.98 2.2e−04 1.45 0.99 0.29 1.17 0.98 0.0025 1.33 0.99 −8.51e−5 1.21 0.99 0.1347

Q-learning+DBN 0.98 0.34 2.5e−20 1.40 0.01 0.8960 0.87 −0.13 5.2e−04 1.52 0.11 0.0022 1.55 0.17 3.2e−06

S5

SARSA 1.01 −0.08 0.5419 2.61 −0.15 0.2484 2.04 −0.20 0.1298 2.16 −0.31 0.0197 1.95 −0.29 0.0276

Q-learning 0.72 0.30 0.0208 2.28 −0.20 0.1334 1.81 −0.20 0.1267 1.83 −0.33 0.0125 1.59 −0.08 0.5542

SARSA+DBN 0.05 0.65 1.4e−08 0.08 0.66 5.3e−09 0.10 0.74 6.4e−12 0.11 0.89 6.02e−22 0.24 0.48 8.7e−05

Q-learning+DBN 0.03 0.02 0.8245 0.02 0.37 0.0031 0.02 0.37 0.0028 0.03 0.22 0.0873 0.03 0.06 0.6315

(8)

Fig.7. Overviewoferrorsobtained,where(a)usingGdatasetwepredictR,R(ToU),RHandRH(ToU)values.(b)UsingRwepredictRHand(c)usingR(ToU)wepredictthe RH(ToU).Fourmethodsareused:SARSA,Q-learning,SARSAwithDBNextensionandQ-learningwithDBNextension.

6.2.1. Commercialtoresidentialtransfer

In this setof experiments, we useCommercial,Industrial & LightingdatatotraintheDBNmodel.Furthermore,we usethe trainedDBNmodeltopredictfourdifferenttypesofunseen res-identialbuildingconsumption, such asresidential withelectric heatandwithoutelectricheat,andresidentialelectricconsumption withTOUpricing,asitisshowninTable3andFig.7.Theanalysisof thedifferenttypesofresidentialbuildingsadvancestheinsighton thegeneralizationcapabilitiesofourproposedmethodand stud-iesitsrobustnessbytestingthebehaviourondifferentprobability distributions(seeFig.4).

6.2.2. Residentialtoresidentialtransfer

Duringtheseexperiments welearnand transferonetype of residentialbuildingenergydemandproﬁletoanothertypeof res-identialbuildingwithdifferentcharacteristics.Moreexactly,we usedtotrainthelearningalgorithm(i)Aresidentialbuildingproﬁle withoutelectricheat(R),and(ii)Aresidentialbuildingwith elec-tricheat(RH).Thepredictionresultsofthesetwobuildingmodels canbeseeninTables4and5.

InTables3–5,theRMSEvaluesshowagoodagreementbetween therealvaluesandthemodelestimatedvalues.Inaddition,the conﬁdenceinourresultsisformallydeterminednotjust bythe

Table4

Predictionofresidentialbuildingwithelectricheatconsumptionusingdata col-lectedfromaresidentialwithnon-electricheatbuilding.

Methods RMSE R p-Value

Scenario1

SARSA 0.42 0.88 4.2e−09 Q-learning 0.44 0.87 1.1e−08 SARSAwithDBN 0.42 0.88 5.8e−09 Q-learningwithDBN 0.03 0.99 7.2e−27

Scenario2

SARSA 2.15 −0.18 0.2175 Q-learning 1.93 −0.10 0.4802 SARSAwithDBN 1.25 0.61 3.7e−06 Q-learningwithDBN 0.5 0.64 9.2e−07

Scenario3 SARSA 2.63 −0.27 8.6e−05 Q-learning 2.57 −0.18 0.0094 SARSAwithDBN 2.67 0.13 0.06 Q-learningwithDBN 0.69 0.09 0.1863 Scenario4 SARSA 2.23 0.04 0.2504 Q-learning 2.14 0.11 0.0015 SARSAwithDBN 1.97 −0.09 0.01 Q-learningwithDBN 0.71 −0.10 0.0072 Scenario5 SARSA 0.74 0.62 2.8e−07 Q-learning 0.57 0.62 2.1e−07 SARSAwithDBN 0.03 0.43 4.8e−04 Q-learningwithDBN 0.02 0.51 0.0259

RMSEvalues,butalsobythecorrelationcoefficientandthe num-berofstepspredictedintothefuture.Forexample,ifthereisjust onestepahead,suchasinScenario1,thenthePearsoncorrelation coefficientneedstobeverycloseto1or−1inordertoconsiderit statisticallysignificant.However,inthecaseofScenarios3and4, wherethepredictionismadeon168and672futuresteps,a coef-ficientcloseto0canstillbeconsideredhighlysignificant.More discussionsabouttherobustnessofthecorrelationcoefficientcan befoundin[42].Still,theinaccuracywasreflectedinanegative correlationcoefficientin24%oftheexperimentswhenweused thesimpleformoftheSARSAandQ-learningmethods.By con-trast,ourtwoimprovedapproaches,SARSAwithDBNextension andQ-learningwithDBNextension,showsanegativecorrelation injust4%ofthecases.Overall,theKolmogorov–Smirnovtestin mostcasesconfirmsthatthedatadoindeedcomefromdifferent distributions.Thisispartiallyduetotheuniquecharacteristicsof thisdataset,givenbythepresenceofahighlynon-linearprofile shapeandlargeoutliervalues,asseeninFig.4.Allofthese observa-tions,giveastrongargumentforemployingamorecomprehensive examinationofthedistributionsusedinthetransferlearning. Nev-ertheless,theresultspresentedinTables 3–5,demonstratethat theenergy predictionaccuraciesin terms ofRMSEsignificantly improvein91.42%ofthecasesafterusingaDBNforautomatically

Table5

Predictionofresidentialbuildingconsumptionwithelectricheatusingdata col-lectedfromaresidentialwithnon-electricheatbuilding,bohtwithToUpricing.

Methods RMSE R p-Value

Scenario1

SARSA 0.50 0.83 1.8e−07 Q-learning 0.16 0.99 1.7e−25 SARSAwithDBN 0.28 0.94 6.1e−13 Q-learningwithDBN 0.24 0.99 2.0e−21

Scenario2

SARSA 1.69 0.33 0.0200 Q-learning 0.91 0.83 3.0e−13 SARSAwithDBN 1.42 0.55 4.09e−05 Q-learningwithDBN 1.18 0.77 1.0e−10

Scenario3

SARSA 2.69 −0.11 0.1205 Q-learning 1.65 0.17 0.0031 SARSAwithDBN 1.98 0.27 1.2e−04 Q-learningwithDBN 1.55 0.21 0.0167

Scenario4

SARSA 2.45 −0.01 0.9477 Q-learning 1.62 0.17 3.7e−06 SARSAwithDBN 2.38 0.24 4.7e−11 Q-learningwithDBN 1.60 0.28 3.3e−14

Scenario5

SARSA 0.67 0.19 0.0014 Q-learning 0.41 0.47 2.0e−04 SARSAwithDBN 0.03 0.34 0.006 Q-learningwithDBN 0.02 0.42 6.4e−04

(9)

computinghigh-levelfeaturesfromtheunlabelleddata,as com-paredtothesituationwhenthecounterpartRLmethodsareused withoutanyDBNextension.

Notably,theproposedapproachisalsosuitablewhenwehave accesstohistoricaldata.Inthescopeofthisargument,theresult obtainedinﬁrstcolumnofTable3areexpectedtobeequivalent withtheresultsobtainwithanysupervisedlearningmethods,such asANNorSVM like.Nevertheless,theRMSEaccuracyobtained usingtheQ-learning algorithmwiththeDBNextension forthe long-termforecastingofbuildingsenergyconsumption(Scenario 5)isgreaterthan90%inalltheexperimentsthanQ-learning with-outDBNextension.Forexample,inTable4theRMSEis0.02ifwe useQ-learningwithDBNversus0.57forQ-learningwithoutDBN, yieldingtoa96.5%improvedRMSEaccuracy.

7. Discussionandconclusion

In this paper, a new paradigm on building energy predic-tion has been introduced, which does not require historical data from the speciﬁc building under scrutiny. In a uni-ﬁed approach, we can successfully learn a building model by including a generalization of the state space domain, then we transfer it across other building. The contribution is two-fold.First, we present a Deep Belief Networkfor automatically featureextractionandsecond,weextendtwostandard reinforce-ment learning algorithms able to perform knowledge transfer between domains (buildings models), namely State-Action-Reward-State-Action(SARSA)algorithmandQ-learningalgorithm byincorporatingthestatesestimatedwiththeDeepBelief Net-work.Thenovelproposedmachinelearningmethodsforenergy predictionareevaluatedoverdifferenttimehorizonswith differ-enttimeresolutionsusingrealdata.Notably,itcanbeobserved thatasthepredictionhorizonisincreasing,SARSAandQ-learning extensions by including a DBN for states estimation seem to bemore robust and theirprediction error is approximately 20 timeslower that of theirunextended versions.The strength of thismethodisgivenbytheDBNgeneralizationcapabilitiesover theunderlying statespace for a new buildingand the robust-nesses to invariance in the states representation. However, a forthcomingdeep investigation canbe doneat different Smart Gridlevelsin order tohelp thetransitionto thefuture energy system.

Acknowledgements

ThisresearchhasbeenfundedbyNLEnterpriseAgencyRVO.nl under the TKI Switch2SmartGrids project of Dutch Top Sector Energy.

References

[1]L.Yang,H.Yan,J.C.Lam,Thermalcomfortandbuildingenergyconsumption implications–areview,Appl.Energy115(2014)164–173,http://dx.doi.org/ 10.1016/j.apenergy.2013.10.062.

[2]E.A.Bakirtzis,C.K.Simoglou,P.N.Biskas,D.P.Labridis,A.G.Bakirtzis, Comparisonofadvancedpowersystemoperationsmodelsforlarge-scale renewableintegration,Electr.PowerSyst.Res.128(2015)90–99,http://dx. doi.org/10.1016/jepsr.2015.06.025.

[3]A.Costa,M.M.Keane,J.I.Torrens,E.Corry,Buildingoperationandenergy performance:monitoring,analysisandoptimisationtoolkit,Appl.Energy101 (2013)310–316,http://dx.doi.org/10.1016/j.apenergy.2011.10.037. [4]M.Simoes,R.Roche,E.Kyriakides,S.Suryanarayanan,B.Blunier,K.McBee,P.

Nguyen,P.Ribeiro,A.Miraoui,Acomparisonofsmartgridtechnologiesand progressesinEuropeandtheU.S.,IEEETrans.Ind.Appl.48(4)(2012) 1154–1162,http://dx.doi.org/10.1109/TIA.2012.2199730.

[5]X.Li,D.Gong,L.Li,C.Sun,NextdayloadforecastingusingSVM,in:J.Wang, X.-F.Liao,Z.Yi(Eds.),AdvancesinNeuralNetworks–ISNN2005,Vol.3498of LectureNotesinComputerScience,2005.

[6]W.-C.Hong,Electricloadforecastingbysupportvectormodel,Appl.Math. Model.33(5)(2009)2444–2454.

[7]S.Wong,K.K.Wan,T.N.Lam,Artiﬁcialneuralnetworksforenergyanalysisof ofﬁcebuildingswithdaylighting,Appl.Energy87(2)(2010)

551–557.

[8]S.A.Kalogirou,Artiﬁcialneuralnetworksinenergyapplicationsinbuildings, Int.J.Low-CarbonTechnol.1(3)(2006)201–216.

[9]T.Mestekemper,G.Kauermann,M.S.Smith,Acomparisonofperiodic autoregressiveanddynamicfactormodelsinintradayenergydemand forecasting,Int.J.Forecast.29(1)(2013)1–12.

[10]M.Wytock,J.Z.Kolter,Large-scaleprobabilisticforecastinginenergysystems usingsparsegaussianconditionalrandomﬁelds,in:Proceedingsofthe52nd IEEEConferenceonDecisionandControl,CDC2013,December10–13,2013, Firenze,Italy,2013,pp.1019–1024.

[11]E.Mocanu,P.H.Nguyen,M.Gibescu,W.Kling,Comparisonofmachine learningmethodsforestimatingenergyconsumptioninbuildings,in: Proceedingsofthe13thInternationalConferenceonProbabilisticMethods AppliedtoPowerSystems(PMAPS),July10–13,Durham,UK,2014. [12]M.Aydinalp-Koksal,V.I.Ugursal,Comparisonofneuralnetwork,conditional

demandanalysis,andengineeringapproachesformodelingend-useenergy consumptionintheresidentialsector,Appl.Energy85(4)(2008) 271–296.

[13]L.Xuemei,D.Lixing,L.Jinhu,X.Gang,L.Jibin,Anovelhybridapproachof KPCAandSVMforbuildingcoolingloadprediction,in:KnowledgeDiscovery andDataMining,2010.ThirdInternationalConferenceonWKDD‘10,2010. [14]L.Suganthi,A.A.Samuel,Energymodelsfordemandforecasting–areview,

Renew.Sustain.EnergyRev.16(2)(2012)1223–1240.

[15]M.Krarti,EnergyAuditofBuildingSystems:AnEngineeringApproach, MechanicalandAerospaceEngineeringSeries,2nded.,Taylor&Francis,2012. [16]A.I.Dounis,Artiﬁcialintelligenceforenergyconservationinbuildings,Adv.

Build.EnergyRes.4(1)(2010)267–299.

[17]A.Foucquier,S.Robert,F.Suard,L.Stéphan,A.Jay,Stateoftheartinbuilding modellingandenergyperformancesprediction:areview,Renew.Sustain. EnergyRev.23(0)(2013)272–288.

[18]H.xiangZhao,F.Magoulès,Areviewonthepredictionofbuildingenergy consumption,Renew.Sustain.EnergyRev.16(6)(2012)

3586–3592.

[19]D.Ernst,M.Glavic,F.Capitanescu,L.Wehenkel,Reinforcementlearning versusmodelpredictivecontrol:acomparisononapowersystemproblem, IEEETrans.Syst.ManCybern.B:Cybern.39(2)(2009)

517–529.

[20]R.Crites,A.Barto,Improvingelevatorperformanceusingreinforcement learning,in:in:AdvancesinNeuralInformationProcessingSystems,vol.8, MITPress,1996,pp.1017–1023.

[21]H.Ammar,D.Mocanu,M.Taylor,K.Driessens,K.Tuyls,G.Weiss, Automaticallymappedtransferbetweenreinforcementlearningtasksvia three-wayrestrictedBoltzmannmachines,in:in:MachineLearningand KnowledgeDiscoveryinDatabases,Vol.8189ofLectureNotesinComputer Science,2013,pp.449–464.

[22]B.Sallans,G.E.Hinton,Reinforcementlearningwithfactoredstatesand actions,J.Mach.Learn.Res.5(2004)1063–1088.

[23]N.Heess,D.Silver,Y.W.Teh,Actor-criticreinforcementlearningwith energy-basedpolicies,in:JMLRWorkshopandConferenceProceedings: EWRL2012,2012.

[24]V.Mnih,K.Kavukcuoglu,D.Silver,A.A.Rusu,J.Veness,M.G.Bellemare,A. Graves,M.Riedmiller,A.K.Fidjeland,G.Ostrovski,S.Petersen,C.Beattie,A. Sadik,I.Antonoglou,H.King,D.Kumaran,D.Wierstra,S.Legg,D.Hassabis, Human-levelcontrolthroughdeepreinforcementlearning,Nature518 (7540)(2015)529–533.

[25]R.S.Sutton,A.G.Barto,IntroductiontoReinforcementLearning,1sted.,MIT Press,Cambridge,MA,USA,1998.

[26]C.J.C.H.Watkins,P.Dayan,Technicalnote:Q-learning,J.Mach.Learn.Res.8 (3-4)(1992)279–292.

[27]Y.Bengio,LearningdeeparchitecturesforAI,Found.TrendsMach.Learn.2(1) (2009)1–127,alsopublishedasabook.NowPublishers,2009.

[28]M.Wiering,M.vanOtterlo,ReinforcementLearning:State-of-the-Art, Springer,2012.

[29]L.Busoniu,D.Ernst,B.DeSchutter,R.Babuska,Approximatereinforcement learning:anoverview,in:2011IEEESymposiumonAdaptiveDynamic ProgrammingAndReinforcementLearning(ADPRL),2011,pp.1–8. [30]M.L.Puterman,MarkovDecisionProcesses:DiscreteStochasticDynamic

Programming,1sted.,JohnWiley&Sons,Inc.,NewYork,NY,USA,1994. [31]M.Castronovo,F.Maes,R.Fonteneau,D.Ernst,Learning

exploration/exploitationstrategiesforsingletrajectoryreinforcement learning,in:in:EWRL,Vol.24ofJMLRProceedings,2012,pp.1–10. [32]G.Hinton,R.Salakhutdinov,Reducingthedimensionalityofdatawithneural

networks,Science313(5786)(2006)504–507.

[33]G.E.Hinton,S.Osindero,Y.-W.Teh,Afastlearningalgorithmfordeepbelief nets,NeuralComput.18(7)(2006)1527–1554.

[34]G.W.Taylor,G.E.Hinton,S.T.Roweis,Twodistributed-statemodelsfor generatinghigh-dimensionaltimeseries,J.Mach.Learn.Res.12(2011) 1025–1068.

[35]L.J.vanderMaaten,E.O.Postma,H.J.vandenHerik,Dimensionalityreduction: acomparativereview,J.Mach.Learn.Res.10(1-41)(2009)66–71.

[36]G.E.Hinton,S.Osindero,Afastlearningalgorithmfordeepbeliefnets,Neural Comput.18(2006)2006.

[37]G.E.Hinton,Trainingproductsofexpertsbyminimizingcontrastive divergence,NeuralComput.14(8)(2002)1771–1800.

(10)

[38]R.Salakhutdinov,LearningdeepboltzmannmachinesusingadaptiveMCMC, in:Proceedingsofthe27thInternationalConferenceonMachineLearning (ICML-10),Haifa,Israel,2010,pp.943–950.

[39]B.G.E.Company.http://www.supplier.bge.com(lastvisit17.10.15). [40]F.J.Massey,TheKolmogorov–Smirnovtestforgoodnessofﬁt,J.Am.Stat.

Assoc.46(253)(1951)68–78.

[41]G.E.Hinton,ApracticalguidetotrainingrestrictedBoltzmannmachines,in: in:NeuralNetworks:TricksoftheTrade,LectureNotesinComputerScience, 2nded.,Springer,2012,pp.599–619.

[42]J.R.K.Susan,J.Devlin,R.Gnanadesikan,Robustestimationandoutlier detectionwithcorrelationcoefﬁcients,Biometrika62(3)(1975)531–545.