ContentslistsavailableatScienceDirect
Energy
and
Buildings
jo u r n al h om ep age :w w w . e l s e v i e r . c o m / l o c a t e / e n b u i l d
Unsupervised
energy
prediction
in
a
Smart
Grid
context
using
reinforcement
cross-building
transfer
learning
Elena
Mocanu
∗,
Phuong
H.
Nguyen,
Wil
L.
Kling
1,
Madeleine
Gibescu
DepartmentofElectricalEngineering,EindhovenUniversityofTechnology,5600MBEindhoven,TheNetherlands
a
r
t
i
c
l
e
i
n
f
o
Articlehistory: Received15June2015 Receivedinrevisedform 11December2015 Accepted22January2016 Availableonline16February2016 Keywords:
Buildingenergyprediction Reinforcementlearning Transferlearning DeepBeliefNetworks Machinelearning
a
b
s
t
r
a
c
t
InafutureSmartGridcontext,increasingchallengesinmanagingthestochasticlocalenergysupplyand demandareexpected.Thisincreasedtheneedofmoreaccurateenergypredictionmethodsinorder tosupportfurthercomplexdecision-makingprocesses.Althoughmanymethodsaimingtopredictthe energyconsumptionexist,alltheserequirelabelleddata,suchashistoricalorsimulateddata.Still, suchdatasetsarenotalwaysavailableundertheemergingSmartGridtransitionandcomplexpeople behaviour.Ourapproachgoesbeyondthestate-of-the-artenergypredictionmethodsinthatitdoesnot requirelabelleddata.Firstly,tworeinforcementlearningalgorithmsareinvestigatedinordertomodel thebuildingenergyconsumption.Secondly,asamaintheoreticalcontribution,aDeepBeliefNetwork (DBN)isincorporatedintoeachofthesealgorithms,makingthemsuitableforcontinuousstates.Thirdly, theproposedmethodsyieldacross-buildingtransferthatcantargetnewbehaviourofexistingbuildings (duetochangesintheirstructureorinstallations),aswellascompletelynewtypesofbuildings.The methodsaredevelopedintheMATLAB®environmentandtestedonarealdatabaserecordedoverseven years,withhourlyresolution.Experimentalresultsdemonstratethattheenergypredictionaccuracyin termsofRMSEhasbeensignificantlyimprovedin91.42%ofthecasesafterusingaDBNforautomatically extractinghigh-levelfeaturesfromtheunlabelleddata,comparedtotheequivalentmethodswithout theDBNpre-processing.
©2016ElsevierB.V.Allrightsreserved.
1. Introduction
Predictionofenergyconsumptionasafunctionoftimeplays anessentialroleinthecurrenttransitiontofutureenergysystems. Withinthenewcontextofso-calledSmartGrids,theenergy con-sumptionofbuildingscanberegardedasanonlineartimeseries, dependingonmanycomplexfactors.Thevariabilityintroducedby thegrowingpenetrationofwindandsolargenerationsourcesonly strengthenstheroleofaccuratepredictionmethods[1].Prediction formsanintegralpartintheefficientplanningandoperationofthe wholeSmartGrid.
Ontheonehand,advancedenergypredictionmethodsshould beeasilyexpandabletovariouslevelsof dataaggregationatall timescales[2].Ontheotherhand,theyhavetoautomaticallyadapt withdecisionstrategiesbasedondynamicbehaviorofactive con-sumers(e.g.newandsmart(er)buildings)[3].Applicationsofthese newmethodsshouldfacilitatethetransitionfromthetraditional
∗ Correspondingauthor.
E-mailaddress:e.mocanu@tue.nl(E.Mocanu).
1 DeceasedMarch14,2015.
single-tariffgridtotime-of-use(TOU)andreal-timepricing.The effectswillbefeltbyallplayersinthegridfromtransmission(TSO) anddistributionsystemoperator(DSO)totheend-user,including resourceassessment and analysisofenergy efficiency improve-ments, flexible demand response (DR) and other continuous projectiononplanningstudies.The jointconsideration of deci-sionsregardingnewrenewablegeneration,TSOdevelopment,and demand-sidemanagement(DMS)programsinanintegrated fash-ionrequires demandforecasts.Consequently,thesewillrequire changesinthewayhowthedataarecollectedandanalyzed[4].
Priorstudieshaveshownthatbyusingstatisticalmethods,more recentinspiredbysupervisedmachinelearningtechniques,such asSupportVectorMachines[5,6],ArtificialNeuralNetworks[7,8], Autoregressivemodels[9],Conditionalrandomfield[10],or Hid-denMarkovModel[11],onecanimprovetheaccuracyofenergy predictionsignificantly.Ontheotherhand,therearemanymethods basedonphysicalprinciples,includingalargenumberofbuilding parameters,tocalculatethermaldynamicsand energybehavior atthebuildinglevel.Moreover,toshape theevolutionoffuture buildingssystems,therearealsosomehybridapproacheswhich combinesomeoftheabovemodelstooptimizepredictive perfor-mance,suchas[12–16].Interestedreadersarereferredforamore http://dx.doi.org/10.1016/j.enbuild.2016.01.030
Nomenclature
˛ learningrate,
∀
˛∈[0,1] discountfactor,∀
∈(0,1) E[·] expectedvalueoperatorA thesetofactions,
∀
a∈AD dataset
R therewardfunction
S thesetofstates,
∀
s ∈S T transitionprobabilitymatrixh vectorcollectingallthehiddenunits,hj∈{0,1} v vectorcollectingallvisibleunits,
v
i ∈R Wvh matrixofallweightsconnectingvandh E totalenergyfunctionintheRBMmodel k thenumberofhiddenlayersM buildingenergyconsumptionmodel p,P probabilityvalue/vector
Q thequalitymatrix
t time
Z normalizationfunction
comprehensivediscussionontheapplicationofenergydemand managementto[9,14,17,18].
Althoughtheyremainattheforefrontofacademicandapplied research,allthesemethodsrequirelabeleddataabletofaithfully reproducetheenergyconsumptionofbuildings.Intheremainingof thispaperwerefertothelabeleddataastothehistorical(known) dataoftheanalyzedbuilding.Usuallythelackofhistoricaldatacan bereplacedbysimulateddata.Still,both,historicalorsimulated data,areemployedintheseforecastingmethodsinanon-adaptable waywithoutconsideringthefutureeventsorchangeswhichcan occurintheSmartGrid.
Astrongermotivationfor thispaper,isgivenbythenottoo wellexploitedfactthatsometimestherearenothistoricallydata consumptionavailableforaparticularbuilding.Fromthemachine learningperspectivethisisatypicalunsupervisedlearning prob-lem. One of the most used methods of unsupervised learning, reinforcementlearning(RL),wasintroducedinpowersystemarea to solvestochastic optimalcontrol problems [19]. RL methods areusedinawiderangeofapplications,suchassystemcontrol [20],playing gamesormorerecent intransferlearning [19,21]. Theadvantageofthecombinationofreinforcementlearningand transferlearningapproachesisstraightforward.Hence,wewantto transferknowledgefromaglobaltoalocalperspective,toencode theuncertaintyofthebuildingenergydemand.
Owingtothecurseofdimensionality,thesemethodsfailinhigh dimensions.Morerecently,therehasbeenarevivalofinterestin combiningDeep Learningwithreinforcementlearning.Therein, RestrictedBoltzmannMachineswereproventoprovideavalue functionestimation[22]orapolicyestimation[23].Morethanthat, Minhetal.[24]combinedsuccessfullydeepneuralnetworksand Q-learningtocreateadeepQ-networkwhichsuccessfullylearned controlpoliciesinarangeofdifferentenvironments.
Inthispaper,wecomprehensivelyexploreandextendedtwo reinforcementlearning(RL)methodstopredicttheenergy con-sumptionat the buildinglevel usingunlabelled historical data, namely State-Action-Reward-State-Action (SARSA) [25] and Q-learning[26].Duetothefactthatintheoriginalformbothmethods can not handle well continuous states space, this paper con-tributestheoreticallytoextendthembyincorporatingaDeepBelief Network[27]forcontinuousstatesestimationandautomatically featuresextractioninaunifiedframework.OurproposedRL meth-odsareappropriatewhenwedonothavehistoricalorsimulated data,butwewanttoestimatetheimpactofchangesinSmartGrid,
suchastheappearanceofabuildingorseveralbuildingsina cer-tainarea,ormorecommonlyachangeinenergyconsumptiondue tobuildingrennovation.Inthispaper,wehaveshownthe appli-cabilityandefficiencyofourproposedmethodinthreedifferent situations:
1.In the case of a new type of buildingbeing connected with theSmartGrid,thustransferringknowledgefromacommercial buildingtoaresidentialbuilding.Specifically,inSection 6.2.1, fourdifferenttypesofresidentialbuildingswereanalyzed. 2.Inthecaseofarennovatedbuilding,thustransferring
knowl-edgefromanon-electricheatbuildingtoabuildingwithelectric heating.
3.Additionally,weproposeexperimentstohighlightthe impor-tanceofexternalfactorsfortheestimationofbuildingenergy consumption,suchaspriceinformation.InSection 6.2.2, trans-ferlearningisapplied,fromabuildingunderastatictarifftoa buildingwithatime-of-usetariff.
Accordingtoourknowledge, this isthe firsttime when the energy prediction is performed without using any information aboutthatbuilding,suchashistoricaldata,energyprice,physical parametersofthebuilding,meteorologicalconditionor informa-tionabouttheuserbehavior.
The paper is organized as follows. In Section 2 we explain therationalityunderlyingourapproach.Section 3 presentsthe mathematicalmodelingofthereinforcementlearningapproaches. Section 4describesthenoveltymethodtoestimatecontinuous statesinreinforcementlearningusingDeepBeliefNetworks.The experimentssetupand resultsareillustratedin Sections 5and 6,respectively.Thepaperconcludeswithadiscussionandfuture work.
2. Problemformulation
Inthispaper,weproposeamethodtosolvetheunsupervised energypredictionproblemwithcross-buildingtransferbyusing machinelearningtimeseriespredictiontechniques.Inthemost generalstatement,theproposedReinforcementandtransfer learn-ing setup is depictedin Fig.1. Given theunevenly distributed buildingenergyvaluesduringtime,firstly,a specialattentionis
Fig.1.Theunsupervisedlearningexploreandextendsreinforcementandtransfer learningsetup,byincludingaDeepBeliefNetworkforcontinuousstatesestimation.
giventothequestion:Howtoestimateacontinuousstatespace?The ideaistofindalower-dimensionalrepresentationoftheenergy consumptiondatathatpreservesthepairwisedistancesaswellas possible.
More formally, the energy prediction using unlabeled data problempresentedinthispaperisdividedintothreedifferent sub-problems,namely:
1.Continuousstateestimationproblem:Givenadataset,D:IR→
S,findaconfinedspacestaterepresentationS1.
2.Reinforcementlearningproblem:GivenabuildingmodelM1 = S1,A1,T·(·,·),R1,findaoptimalpolicy,∗1.
3.Transfer learning problem: Given a model, M1 = S1,A1,T·(·,·),R1, a reasonable 1∗ and M2 = S2,A2,T·(·,·),R2,findagood2.
TheproposedsolutionispresentedinSection 4,whereanew methodtoestimatecontinuousstatesinreinforcementlearning usingDeepBeliefNetworkisdetailed.Furtherthisstateestimation methodisintegratedinSARSAandQ-learningalgorithmsinorder toimprovethepredictionaccuracy.
3. Reinforcementlearning
Reinforcement learning [28] is a field of machine learning inspiredbypsychology,which studieshowartificialagents can performactionsinanenvironmenttoachieveaspecificgoal. Practi-cally,theagenthastocontroladynamicsystembychoosingactions inasequentialfashion.Thedynamicsystem,knownalsoasthe environment,ischaracterizedbystates,itsdynamics,anda func-tionthatdescribesthestate’sevolutiongiventheactionschosen bytheagent.Afteritexecutesanaction,theagentmovestoanew state,whereitreceivesareward(scalarvalue)whichinformsit howfaritisfromthegoal(thefinalstate).Toachieve thegoal, theagenthastolearnastrategytoselectactions,dubbedpolicyin theliterature,insuchawaythattheexpectedsumoftherewards ismaximizedovertime.Besidesthat,astateofthesystem cap-turesalltheinformationrequiredtopredicttheevolutionofthe systeminthenextstate,givenanagentaction.Also,itisassumed thattheagentcouldperceivethestateoftheenvironmentwithout error,anditcouldmakeitscurrentdecisionbasedonthis informa-tion.TherearetwodifferentcategoriesofRLalgorithms,(i)Online RLwichareinteractionbasedalgorithms,suchasQ-learning[26], SARSA[25]orPolicyGradient,and(ii)OfflineRL,likeLeast-Square PolicyIterationorfittedQ-iteration.Foramorecomprehensive dis-cussionofRLalgorithmswereferto[29].Intheremainingofthis paperwillreferjusttoonlineRL.
ARLproblemcanbeformalizedusingMarkovdecisionprocess (MDPs).MDPsaredefinedbya4-tupleS,A,T·(·,·),R·(·,·), whereSis asetofstates,
∀
s ∈S, Ais asetofactions,∀
a∈A, T:S×A×S→[0,1]isthetransitionfunctiongivenbythe prob-abilitythatbychoosingaction ain statesattimet thesystem willarrivetostatesattimet+1,suchthatpa(s,s)=p(st+1=s|st=s, at=a),andR:S×A×S→Ristherewardfunction,wereRa(s,s) istheimmediatereward(orexpectedimmediatereward)received bytheagentafteritperformsthetransitiontostatesfromstate s.AnimportantpropertyinMDPsistheMarkovpropertywhich makestheassumptionthatthestatetransitionsaredependentjust onthelaststateofthesystem,andareindependentofanyprevious environmentstatesoragentactions,i.e.p(st+1=s,rt+1=r|st,at)for alls,r,st,andat.TheMDPstheorydoesnotassumethatSorAare finite,butthetraditionalalgorithmsmakethisassumption.In gen-eral,theycanbesolvedbyusinglinearordynamicprogramming. Theinterestedreaderisreferredto[30]foramore comprehen-sivediscussionaboutMDPs.Furthermore,inthereal-world,thestatestransitionsprobabilitiesT·(·;·)andtherewardsR·(·;·)are unknown,andthestatesspaceSortheactionsspaceAmightbe continuous.Thus,RLrepresentsanormalextensionand generaliza-tionoverMDPsforsuchsituations,wherethetasksaretoolargeor tooill-defined,andcannotbesolvedusingoptimal-controltheory [25].
3.1. Q-learning
First, the Q-learning algorithm [26] is recommended like a standardsolutioninRLweretherulesareoftenstochastic.This algorithmthereforehasafunctionwhichcalculatestheQualityof astate-actioncombination,definebyQ:S×A→R.Before learn-inghasstarted,Qmatrixreturnsaninitialvalue.Then,eachtime theagentselectsanaction,andobservesarewardandanewstate thatbothmaydependonthepreviousstateandtheselectedaction. Theaction-valuefunctionofafixedpolicywiththevaluefunction V:S→Ris
Q(s,a)= r(s,a)+
sp(s|s,a)V(s),
∀
s∈S,a∈A (1)Thevalueofstate-actionpairs,Q(s,a),representstheexpected outcomewhenoneagentisstartingfroms,executingaandthen followingthepolicyafterwards,suchthatV(x)=Q(x,(x)),with theircorrespondingBellmanequation
Q∗(s,a)= r(s,a)+
sp(s|s,a)max b Q
∗(s,b) (2)
wherethediscountfactor∈[0,1]tradesofftheimportanceof rewardsandb=max(a).Thus,theoptimalvalueareobtainedfor
∀
s∈S,V∗(s)= maxa Q
∗(s,a)and∗(s)= argmax a Q
∗(s,a).Thevalue ofstate-actionpairsisgivenbythesameformalexpectationvalue, E,ofanexpectedtotal returnrt,suchthatQ (s,a)= E(rt|st = s,at = a).Theoff-policyQ-learningalgorithmhavetheupdaterule defineby
Qt+1(st,at)
=Qt(st,at)+˛t[rt+1+maxQt(st+1,a)−Qt(st,at)] (3) wherert+1istherewardobservedafterperformingat inst,and where˛t(s,a),withall˛∈[0,1],isthelearningratewhichmay bethesameforallpairs.Q-learningalgorithmhasproblemswith bignumbersofcontinuousstatesanddiscreteactions.Usually,it needsfunctionapproximations,e.g.neuralnetworks,toassociate tripletslike(state,action,Q-value).ExplorationofoneMDPcanbe doneunderMarkovassumption,totakeintoaccountjustcurrent stateandaction,butbecauseintherealworldwehavePartially ObservableMDPs,wemayhavebetterresultsifanarbitraryk num-berofhistorystatesandactions(St−k,at−k,···,St−1,at−1)willbe considerate[31],toclearlyidentifyatripletSt,At,Qtattimet. 3.2. SARSA
An interesting variation for Q-learning is the State-Action-Reward-State-Action(SARSA)algorithm[25],whichaimsatusing Q-learningaspartofaPolicyIterationmechanism.Themajor differ-encebetweenSARSAandQ-Learning,isthatthemaximumreward forthenextstateinSARSAisnotnecessarilyusedforupdatingthe Q-values.Therefore,thecoreoftheSARSAalgorithmisasimple valueiterationupdate.Theinformationrequiredfortheupdateis atuple(st,at,rt+1,st+1,at+1),andtheupdateisdefinedby Qt+1(st,at)
wherert+1istherewardand˛t(s,a)isthelearningrate.Inpractice, Q-learningandSARSAarethesameifweuseagreedypolicy(i.e. theagentchoosesthebestactionalways),butaredifferentwhen the
-greedypolicyisused,whichfavorsmorerandomexploration. In traditional reinforcement learning algorithms, only MDPs withfinitestates andactionsareconsidered.However,building energyconsumptioncantakenearlyarbitraryrealvalueresulting inaverylargenumberofstatesinMDPs.Duetothefactthatenergy consumptioncanbeseenasatimeseriesproblem,anprior dis-cretizationofthestatesspaceisnotveryuseful.So,wetrytofind algorithmsthatworkwellwithlarge(orcontinuous)statesspaces, asitisshownnext.4. StatesestimationviaDeepBeliefNetworks
Deep architectures [27] showed very good results in differ-ent applications,such as to perform non-linear dimensionality reduction [32], images recognition [33], video sequences, or motion-capturedata [34].A comprehensiveanalysison dimen-sionalityreductionanddeeparchitecturescanbereferredto[35]. Overall,DeepBeliefNetworks(DBN)couldbeawaytonaturally decomposetheproblemintosub-problemsassociatedwith differ-entlevelsofabstraction.
4.1. RestrictedBoltzmannMachines
DBNsarecomposedofseveralRestrictedBoltzmannMachines (RBMs)stackedontopofeachother[36].ARBMisastochastic recurrentneuralnetworkthatconsistsofalayerofvisibleunits,v, andalayerofbinaryhiddenunits,h.Thetotalenergyofthejoint configurationofthevisibleandhiddenunits(v,h)isgivenby: E(v,h)= −
i,jv
ihjWij− iv
iai− j hjbj (5)whereirepresentstheindicesofthevisiblelayer,jthoseofthe hiddenlayer,andwi,jdenotestheweightconnectionbetweenthe ithvisibleandjthhiddenunit.Further,
v
iandhjdenotethestate oftheithvisibleandjthhiddenunit,respectively,aiandbj rep-resentthebiasesofthevisibleandhiddenlayers.Thefirstterm,i,j
v
ihjWijrepresentstheenergybetweenthehiddenandvisible unitswiththeirassociatedweights.Thesecond,iv
iairepresent theenergyinthevisiblelayer,whilethethirdtermrepresentsthe energyinthehiddenlayer.TheRBMdefinesajointprobabilityover thehiddenandvisiblelayerp(v,h)p(v,h)= e−E(v,h)Z (6)
whereZisthepartitionfunction,obtainedbysummingtheenergy ofallpossible(v,h)configurations,Z=
v,he−E(v,h).Todeterminetheprobabilityofadatapointrepresentedbyastate
v
,themarginal probabilityisused,summingoutthestateofthehiddenlayer,such thatp(v
)= hp(v,h).Theaboveequationcanbeusedforanygiveninputto calcu-latetheprobabilityofeitherthevisibleorthehiddenconfiguration tobeactivated.Thisvaluesarefurtherusedtoperforminference inordertodeterminetheconditionalprobabilitiesinthemodel. Tomaximizethelikelihoodofthemodel,thegradientofthe log-likelihoodwithrespecttotheweightsmusttobecalculated.The gradientofthefirstterm,aftersomealgebraicmanipulationscan bewrittenas
∂
log(hexp(−E(v
,h)))∂
Wij =v
i·p(hj =1|v
) (7)However,computingthegradientofthesecondtermisintractable.
Fig.2.AgeneralDeepBeliefNetworkstructurewiththreehiddenlayers.Thetop twolayershaveundirectedconnectionsandformanassociativememory,where denotesbinaryneuronsand◦representstherealvalues.
TheinferenceofthehiddenandvisiblelayersinRBMcanbe doneaccordinglywiththenextformulas
p(hj =1|v) =(bj+
iv
iWji) (8) p(vi = 1|h)= (ai+ j hjWji) (9)where (.)representsthe sigmoidfunction. Moreover,tolearn anRBMwecanusethefollowinglearningrulewhichperforms stochasticsteepestascentinthelogprobabilityofthetrainingdata [37]:
∂
log(p(v,h))∂
Wij=
v
ihj0−v
ihj∞ (10) where·0denotestheexpectationsforthedatadistribution(p0), and·∞denotestheexpectationsunderthemodeldistribution. 4.2. DeepBeliefNetworksOverall,aDeepBeliefNetwork[27]isgivenbyanarbitrary num-berofRBMsstackonthetopofeachother.Thisyieldsacombination between a partially directed and partially undirected graphical model.Therein,thejointdistributionbetweenvisiblelayerx(input vector)andthelhiddenlayershkisdefineasfollows
p(x,h1,...,hk)= l−2
k=0P(hk|hk+1)P(hl−1,hl) (11)
whereP(hk|hk+1)isaconditionaldistributionforthevisibleunits conditionedonthehiddenunitsoftheRBMatlevelk,andP(hl−1, hl)isthevisible-hiddenjointdistributioninthetop-levelRBM.An exampleofaDBNwith3hiddenlayers(i.e.h1(j),h2(k),andh3(l)) isdepictedinFig.2.
ThetoplevelRBMinaDBNactsasacomplementarypriorfrom thebottomleveldirectedsigmoidlikelihoodfunction.ADBNcan betrainedinagreedyunsupervisedway,bytrainingseparately eachRBMfromit,inabottomtotopfashion,andusingthehidden layerasaninputlayerforthenextRBM[38].Furthermore,the DBN,canbeusedtoprojectourinitialstatesacquiredfromthe environmenttoanotherstatespacewithbinaryvalues,byfixing theinitialstatesinthebottomlayerofthemodel,andinferringthe tophiddenlayerfromthem.Intheend,thetophiddenlayercanbe directlyincorporatedintotheSARSAorQ-learningalgorithms,as itisdescribedinAlgorithm1.
Algorithm1. RLextensionincludingaDBNforstateestimation. 1:%%DBNforstateestimation
2:InitializeDBN
3:InitializetrainingsetXwiththestates 4:foreachRBMkinDBN
5: repeattrainingepoch
6: Foreachtraininginstancex∈X
7: SetRBMvisible k =x
8: RunMarkovchaininRBMk
9: GetStatisticsforRBMk
10: UpdateWeightsforRBMk
11: endfor
12: Untilconverge
13: foreachtraininginstancex∈X
14: SetRBMvisible k =x
15: InferRBMhidden k
16: ReplacexinXwithRBMhidden k
17: endfor
18:endfor
19:%%UsethelastcomputedXasstatesforRL(·) 20:%%RL(1):SARSAalgorithm
21:InitializeQ(s,a)arbitrarily,wheres∈X
22:repeat(foreachepisode) 23: Initializes
24: ChooseafromsusingpolicyderivedfromQ
25: repeat(foreachstepofepisode)
26: Takeactiona,observer,s
27: ChooseafromsusingpolicyderivedfromQ
28: Q(s,a)←Q(s,a)+˛[r+Q(s,a)−Q(s,a)] 29: s←s 30: a←a 31: untilsisterminal 32:untilQoptimal 33:%%RL(2):Q-learningalgorithm 34:InitializeQ(s,a)arbitrarily,wheres∈X
35:repeat(foreachepisode) 36: Initializes
37: repeat(foreachstepofepisode)
38: ChooseafromsusingpolicyderivedfromQ 39: Takeactiona,observer,s
40: Q (s,a)←Q (s,a)+˛[r+max ˛ Q (s,a)−Q (s,a)] 41: s←s 42: untilsisterminal 43: untilQoptimal
Nowthatwehaveconsideredtheproblemofstateestimation andweincorporatedallthreesub-problemsinaunifiedapproach welookintotheexperimentalvalidation.
5. Datasetcharacteristics
The proposed solution is experimentally evaluated using a datasetrecordedoversevenyears,moreexactlybetweenJanuary 6,2007andJanuary31,2014.Theloadsprofiles,including differ-entresidentialandcommercialbuildings,aremadeavailableby BaltimoreGasandElectricCompany[39].Foreverytypeof build-inganalyzedtheavailablehistoricalloaddatainkWhrepresentan averagebuildingprofileperhour.Overall,therearefivedifferent buildingsprofiles,aspresentedinTable1.
Fora morecomprehensiveviewof thedatasets usedin this paperwehaveshowninFig.3thehourlyevolutionoftheelectrical energyconsumptionforaGeneralService(G)dataset,including
Table1
Buildingtypesindatasets.
Residential
R Residential(non-electricheat) R(ToU) ResidentialTime-of-Use
(non-electricheat) RH Residential(electricheat) RH(ToU) ResidentialTime-of-Use(electric
heat)
Commercial G GeneralService(<60kW), Commercial,Industrial&Lighting
Fig.3. ElectricalenergyconsumptionforaCommercial,Industrial&Lighting(G) datasetandforaresidentialbuildingwithnon-electricheat(R)buildingover dif-ferenttimehorizons.
Commercial, Industrial &Lighting, and a residential with non-electricheat(R)buildingoverdifferenttimehorizons.Moreover, somegeneralcharacteristicsfortheentiredatasetaregraphically depictinginFig.4.Inallexperimentsthedatawereseparatedinto thetrainingandtestingdatasets.Moreprecisely,thedatacollected from1stJune2007until1stJanuary2013(2041days)wereusedin thelearningphaseandtheremainingdata,betweenJanuary2013 and31January2014(396days)wereusedtoevaluatethe perfor-manceofthemethods.Themetricsusedtoassessthequalityofthe differentbuildingsenergyconsumptionpredictionaredescribed further.
5.1. Metricsforpredictionassessment
Aswe mentionearlier, thegoal is toachieve good general-ization by makingaccurateprediction for new buildingenergy consumption data. Firstly, some quantitative insight into the dependenceofthegeneralizationperformanceofourapproachare evaluatedusingtheroot-mean-square errordefinedbyRMSE =
(1/N)
Ni=1(v
i−v
ˆi)2,whereNrepresentsthenumberof multi-stepspredictionwithinaspecifiedtimehorizon,v
irepresentsthe realvaluesforthetime-stepiandˆv
irepresentsthemodelestimated valueatthesametime-step.Then,byusingthePearson product-momentcorrelationcoefficient(R),insightsaregivenonthedegree oflinear dependencebetweenthereal valueandthepredicted value.Hence R(u,v
)= (E[(u−u)(v
−v)]/uv), where E[·] isFig.4. Generalcharacteristicsofalldatasets:abox-plotwiththeexactlyvaluefor meanandstandarddeviationencodedinit.
Fig.5. (Left)TheRMSEvaluesobservedfordifferentRBMconfigurationsintheDBNarchitecture,withvaryingnumberofhiddenneurons,asafunctionoftrainingepochs. (Right)PerformancemetricsforthechosenRBMconfigurationwith10hiddenneurons.
theexpectedvalueoperatorwithstandarddeviationsu andv. Thecorrelationcoefficientmaytakeonanyvaluewithintherange [−1,1].Thesignofthecorrelationcoefficientdefinesthedirection oftherelationship,eitherpositiveornegative.Finally,weperform theKolmogorov–Smirnovtest[40]inordertogaininsightson sta-tisticalsignificanceofourresults.TheKolmogorov–Smirnovtest hastheadvantageofmakingnoassumptionaboutthedistribution ofdata.Thiselaboratestatisticaltestisnotatypicallymetricused intheanalyzeofthepredictionaccuracy,butisimposedbythe factthatthelearningandthetestingprocedureismadeusing dif-ferentbuildingtypes.Hence,exceedingthestatisticalsignificance level,p<0.05, wouldbeexpectedandwillvalidatethedifferent probabilitydistributionfunctionfromwherethisdataareprovided. 6. Empiricalresults
Toassesstheperformanceofourextendedreinforcementand transfer learning approaches presented in Section 4, we have designed five different scenarios. These are selected to cover variousmulti-steppredictionatdifferentresolution,andare sum-marizedinTable2.Furtheron,beforetogointhedeepanalysesof thenumericalresults,firstlywepresentsomedetailsofthe imple-mentation.
6.1. Implementationsdetails
Theimplementationhasbeendoneintwoparts.Firstly,aDBNis implemented,andsecondlytheRLalgorithmsusetheDBNintheir implementationsforcontinuousstatesestimation,asitisshown next.
6.1.1. ContinuousstatesestimationsusingDBN
WeimplementedtheDBNinMATLAB®fromscratchusingthe mathematicaldetailsdescribedinSection 4.Inordertoobtaina goodpredictionweinvestigatecarefullythechoiceoftheoptimal numberofhiddenunitsinourDBNconfigurationwithrespectto theRMSEevolution,seeFig.5.
Table2
Summaryoftheexperiments.
Notation Timehorizon Resolution Scenario1 S1 1h 1haverage Scenario2 S2 1day 1haverage Scenario3 S3 1week 1haverage Scenario4 S4 1month 1haverage Scenario5 S5 1year 1weekaverage
Thus,thenumberofhiddenneuronswassetto10andthe learn-ingratewas10−3.Themomentumwassetto0.5andtheweight decayto0.0002.Wetrainedthemodelfor20epochs,butasitcan beseeninFig.5(right),themodelconvergedafterapproximately 4epochs.Moredetailsabouttheoptimalchoiceoftheparameters canbefoundin[41].
6.1.2. SARSAandQ-learning
WeimplementedtheSARSAandQ-learninginMATLAB®using themathematicaldetailsdescribedinSection3.Inbothcasesthe learningratewassetto0.4andthediscountfactorwasconsiderate 0.99.Bothparametershaveadirectinfluenceontheperformance ofthebothalgorithms.
The choice of theseparameters wasmade after a thorough examinationoftheRMSEoutcome,asisshownforexampleinFig.6. Overall,thelearning ratedeterminestowhat extentthenewly acquiredinformation will overridetheold information and the discountfactordeterminestheimportanceoffuturerewards,for example=0willmaketheagent“opportunistic”byonly consid-eringcurrentrewards,whileadiscountfactorapproaching1will makeitstriveforalong-termhighreward.
6.2. Numericalresults
Inthissection,wetestandillustratethetwounsupervised learn-ingapproachesusingthedatasetdescribedinSection5.Different scenarios,assummarizedinTable2,havebeencreatedtoassess theperformanceoftheproposedmodels.
Fig.6. AnalysesofRMSEvaluesobtainedfromdifferent˛valueinexploration step,fordifferentscenarios.ThisinvolvethepredictionofGdataset(Commercial, Industrial&Lightingconsumption,GeneralService(<60kW)).
E. Mocanu et al. / Energy and Buildings 116 (2016) 646–655 Table3
UsingCommercial,GeneralService(G)(<60kW)datasettopredictrezidentialenergyconsumption,suchasR,R(ToU),RHandRH(ToU)valuesusingSARSA,Q-learning,SARSAwithDBNextensionandQ-learningwithDBN extension.
Method G R R(ToU) RH RH(ToU)
RMSE R p-Val RMSE R p-Val RMSE R p-Val RMSE R p-Val RMSE R p-Val
S1
SARSA 0.18 0.90 5.6e−10 0.02 0.99 2.8e−23 0.10 0.93 3.9e−12 0.36 0.91 1.2e−10 0.42 0.88 4.2e−09
Q-learning 0.22 0.86 2.2e−08 0.02 0.99 2.8e−23 0.04 0.99 2.5e−21 0.34 0.92 3.3e−11 0.34 0.92 3.3e−11
SARSA+DBN 0.04 0.01 1.7e−05 0.02 0.99 3.4e−20 0.06 0.98 4.7e−14 0.04 0.99 1.2e−23 0.04 0.01 7.8e−26
Q-learning+DBN 0.01 0.99 6.9e−38 0.03 0.97 7.1e−16 0.09 0.94 2.1e−12 0.04 0.99 1.1e−23 0.02 0.99 5.3e−33
S2
SARSA 0.65 0.36 0.0097 0.75 0.52 1.3e−04 0.47 0.42 0.0029 1.23 0.55 4.3e−05 1.20 0.65 3.5e−07
Q-learning 1.09 0.17 0.2460 0.98 0.30 0.0358 0.40 0.81 1.4e−12 1.28 0.52 1.4e−04 1.55 0.41 0.0038
SARSA+DBN 0.38 0.65 1.3e−05 0.37 0.98 0.39 0.37 0.79 0.0014 0.46 0.81 0.0023 0.47 0.67 2.6e−05
Q-learning+DBN 0.33 0.84 6.2e−14 0.37 0.08 0.5539 0.29 0.74 1.7e−09 0.41 0.50 2.3e−04 0.66 0.26 0.0671
S3
SARSA 1.27 0.12 0.0877 1.73 0.21 0.0026 1.36 0.27 1.1e−04 1.59 0.29 2.6e−05 1.33 0.26 2.3e−04
Q-learning 1.39 0.09 0.2052 1.10 0.24 7.6e−04 0.83 0.23 0.0012 1.47 0.25 3.1e−04 1.61 0.06 0.3589
SARSA+DBN 0.69 0.90 1.4e−05 1.31 0.99 0.88 0.55 0.98 0.1623 1.33 0.99 0.7896 1.18 0.99 0.5244
Q-learning+DBN 0.62 0.38 4.8e−08 0.98 0.11 0.0978 0.58 0.30 2.1e−05 1.26 0.12 0.0950 1.30 0.03 0.5932
S4
SARSA 1.55 0.09 0.0128 3.70 −0.25 1.3e−11 2.39 0.07 0.0361 2.05 0.07 0.0397 1.89 0.16 1.0e−05
Q-learning 1.41 0.15 6.1e−05 1.24 0.07 0.0404 1.14 −0.04 0.2000 1.67 0.08 0.0309 1.71 0.02 0.4621
SARSA+DBN 1.14 0.98 2.2e−04 1.45 0.99 0.29 1.17 0.98 0.0025 1.33 0.99 −8.51e−5 1.21 0.99 0.1347
Q-learning+DBN 0.98 0.34 2.5e−20 1.40 0.01 0.8960 0.87 −0.13 5.2e−04 1.52 0.11 0.0022 1.55 0.17 3.2e−06
S5
SARSA 1.01 −0.08 0.5419 2.61 −0.15 0.2484 2.04 −0.20 0.1298 2.16 −0.31 0.0197 1.95 −0.29 0.0276
Q-learning 0.72 0.30 0.0208 2.28 −0.20 0.1334 1.81 −0.20 0.1267 1.83 −0.33 0.0125 1.59 −0.08 0.5542
SARSA+DBN 0.05 0.65 1.4e−08 0.08 0.66 5.3e−09 0.10 0.74 6.4e−12 0.11 0.89 6.02e−22 0.24 0.48 8.7e−05
Q-learning+DBN 0.03 0.02 0.8245 0.02 0.37 0.0031 0.02 0.37 0.0028 0.03 0.22 0.0873 0.03 0.06 0.6315
Fig.7. Overviewoferrorsobtained,where(a)usingGdatasetwepredictR,R(ToU),RHandRH(ToU)values.(b)UsingRwepredictRHand(c)usingR(ToU)wepredictthe RH(ToU).Fourmethodsareused:SARSA,Q-learning,SARSAwithDBNextensionandQ-learningwithDBNextension.
6.2.1. Commercialtoresidentialtransfer
In this setof experiments, we useCommercial,Industrial & LightingdatatotraintheDBNmodel.Furthermore,we usethe trainedDBNmodeltopredictfourdifferenttypesofunseen res-identialbuildingconsumption, such asresidential withelectric heatandwithoutelectricheat,andresidentialelectricconsumption withTOUpricing,asitisshowninTable3andFig.7.Theanalysisof thedifferenttypesofresidentialbuildingsadvancestheinsighton thegeneralizationcapabilitiesofourproposedmethodand stud-iesitsrobustnessbytestingthebehaviourondifferentprobability distributions(seeFig.4).
6.2.2. Residentialtoresidentialtransfer
Duringtheseexperiments welearnand transferonetype of residentialbuildingenergydemandprofiletoanothertypeof res-identialbuildingwithdifferentcharacteristics.Moreexactly,we usedtotrainthelearningalgorithm(i)Aresidentialbuildingprofile withoutelectricheat(R),and(ii)Aresidentialbuildingwith elec-tricheat(RH).Thepredictionresultsofthesetwobuildingmodels canbeseeninTables4and5.
InTables3–5,theRMSEvaluesshowagoodagreementbetween therealvaluesandthemodelestimatedvalues.Inaddition,the confidenceinourresultsisformallydeterminednotjust bythe
Table4
Predictionofresidentialbuildingwithelectricheatconsumptionusingdata col-lectedfromaresidentialwithnon-electricheatbuilding.
Methods RMSE R p-Value
Scenario1
SARSA 0.42 0.88 4.2e−09 Q-learning 0.44 0.87 1.1e−08 SARSAwithDBN 0.42 0.88 5.8e−09 Q-learningwithDBN 0.03 0.99 7.2e−27
Scenario2
SARSA 2.15 −0.18 0.2175 Q-learning 1.93 −0.10 0.4802 SARSAwithDBN 1.25 0.61 3.7e−06 Q-learningwithDBN 0.5 0.64 9.2e−07
Scenario3 SARSA 2.63 −0.27 8.6e−05 Q-learning 2.57 −0.18 0.0094 SARSAwithDBN 2.67 0.13 0.06 Q-learningwithDBN 0.69 0.09 0.1863 Scenario4 SARSA 2.23 0.04 0.2504 Q-learning 2.14 0.11 0.0015 SARSAwithDBN 1.97 −0.09 0.01 Q-learningwithDBN 0.71 −0.10 0.0072 Scenario5 SARSA 0.74 0.62 2.8e−07 Q-learning 0.57 0.62 2.1e−07 SARSAwithDBN 0.03 0.43 4.8e−04 Q-learningwithDBN 0.02 0.51 0.0259
RMSEvalues,butalsobythecorrelationcoefficientandthe num-berofstepspredictedintothefuture.Forexample,ifthereisjust onestepahead,suchasinScenario1,thenthePearsoncorrelation coefficientneedstobeverycloseto1or−1inordertoconsiderit statisticallysignificant.However,inthecaseofScenarios3and4, wherethepredictionismadeon168and672futuresteps,a coef-ficientcloseto0canstillbeconsideredhighlysignificant.More discussionsabouttherobustnessofthecorrelationcoefficientcan befoundin[42].Still,theinaccuracywasreflectedinanegative correlationcoefficientin24%oftheexperimentswhenweused thesimpleformoftheSARSAandQ-learningmethods.By con-trast,ourtwoimprovedapproaches,SARSAwithDBNextension andQ-learningwithDBNextension,showsanegativecorrelation injust4%ofthecases.Overall,theKolmogorov–Smirnovtestin mostcasesconfirmsthatthedatadoindeedcomefromdifferent distributions.Thisispartiallyduetotheuniquecharacteristicsof thisdataset,givenbythepresenceofahighlynon-linearprofile shapeandlargeoutliervalues,asseeninFig.4.Allofthese observa-tions,giveastrongargumentforemployingamorecomprehensive examinationofthedistributionsusedinthetransferlearning. Nev-ertheless,theresultspresentedinTables 3–5,demonstratethat theenergy predictionaccuraciesin terms ofRMSEsignificantly improvein91.42%ofthecasesafterusingaDBNforautomatically
Table5
Predictionofresidentialbuildingconsumptionwithelectricheatusingdata col-lectedfromaresidentialwithnon-electricheatbuilding,bohtwithToUpricing.
Methods RMSE R p-Value
Scenario1
SARSA 0.50 0.83 1.8e−07 Q-learning 0.16 0.99 1.7e−25 SARSAwithDBN 0.28 0.94 6.1e−13 Q-learningwithDBN 0.24 0.99 2.0e−21
Scenario2
SARSA 1.69 0.33 0.0200 Q-learning 0.91 0.83 3.0e−13 SARSAwithDBN 1.42 0.55 4.09e−05 Q-learningwithDBN 1.18 0.77 1.0e−10
Scenario3
SARSA 2.69 −0.11 0.1205 Q-learning 1.65 0.17 0.0031 SARSAwithDBN 1.98 0.27 1.2e−04 Q-learningwithDBN 1.55 0.21 0.0167
Scenario4
SARSA 2.45 −0.01 0.9477 Q-learning 1.62 0.17 3.7e−06 SARSAwithDBN 2.38 0.24 4.7e−11 Q-learningwithDBN 1.60 0.28 3.3e−14
Scenario5
SARSA 0.67 0.19 0.0014 Q-learning 0.41 0.47 2.0e−04 SARSAwithDBN 0.03 0.34 0.006 Q-learningwithDBN 0.02 0.42 6.4e−04
computinghigh-levelfeaturesfromtheunlabelleddata,as com-paredtothesituationwhenthecounterpartRLmethodsareused withoutanyDBNextension.
Notably,theproposedapproachisalsosuitablewhenwehave accesstohistoricaldata.Inthescopeofthisargument,theresult obtainedinfirstcolumnofTable3areexpectedtobeequivalent withtheresultsobtainwithanysupervisedlearningmethods,such asANNorSVM like.Nevertheless,theRMSEaccuracyobtained usingtheQ-learning algorithmwiththeDBNextension forthe long-termforecastingofbuildingsenergyconsumption(Scenario 5)isgreaterthan90%inalltheexperimentsthanQ-learning with-outDBNextension.Forexample,inTable4theRMSEis0.02ifwe useQ-learningwithDBNversus0.57forQ-learningwithoutDBN, yieldingtoa96.5%improvedRMSEaccuracy.
7. Discussionandconclusion
In this paper, a new paradigm on building energy predic-tion has been introduced, which does not require historical data from the specific building under scrutiny. In a uni-fied approach, we can successfully learn a building model by including a generalization of the state space domain, then we transfer it across other building. The contribution is two-fold.First, we present a Deep Belief Networkfor automatically featureextractionandsecond,weextendtwostandard reinforce-ment learning algorithms able to perform knowledge transfer between domains (buildings models), namely State-Action-Reward-State-Action(SARSA)algorithmandQ-learningalgorithm byincorporatingthestatesestimatedwiththeDeepBelief Net-work.Thenovelproposedmachinelearningmethodsforenergy predictionareevaluatedoverdifferenttimehorizonswith differ-enttimeresolutionsusingrealdata.Notably,itcanbeobserved thatasthepredictionhorizonisincreasing,SARSAandQ-learning extensions by including a DBN for states estimation seem to bemore robust and theirprediction error is approximately 20 timeslower that of theirunextended versions.The strength of thismethodisgivenbytheDBNgeneralizationcapabilitiesover theunderlying statespace for a new buildingand the robust-nesses to invariance in the states representation. However, a forthcomingdeep investigation canbe doneat different Smart Gridlevelsin order tohelp thetransitionto thefuture energy system.
Acknowledgements
ThisresearchhasbeenfundedbyNLEnterpriseAgencyRVO.nl under the TKI Switch2SmartGrids project of Dutch Top Sector Energy.
References
[1]L.Yang,H.Yan,J.C.Lam,Thermalcomfortandbuildingenergyconsumption implications–areview,Appl.Energy115(2014)164–173,http://dx.doi.org/ 10.1016/j.apenergy.2013.10.062.
[2]E.A.Bakirtzis,C.K.Simoglou,P.N.Biskas,D.P.Labridis,A.G.Bakirtzis, Comparisonofadvancedpowersystemoperationsmodelsforlarge-scale renewableintegration,Electr.PowerSyst.Res.128(2015)90–99,http://dx. doi.org/10.1016/jepsr.2015.06.025.
[3]A.Costa,M.M.Keane,J.I.Torrens,E.Corry,Buildingoperationandenergy performance:monitoring,analysisandoptimisationtoolkit,Appl.Energy101 (2013)310–316,http://dx.doi.org/10.1016/j.apenergy.2011.10.037. [4]M.Simoes,R.Roche,E.Kyriakides,S.Suryanarayanan,B.Blunier,K.McBee,P.
Nguyen,P.Ribeiro,A.Miraoui,Acomparisonofsmartgridtechnologiesand progressesinEuropeandtheU.S.,IEEETrans.Ind.Appl.48(4)(2012) 1154–1162,http://dx.doi.org/10.1109/TIA.2012.2199730.
[5]X.Li,D.Gong,L.Li,C.Sun,NextdayloadforecastingusingSVM,in:J.Wang, X.-F.Liao,Z.Yi(Eds.),AdvancesinNeuralNetworks–ISNN2005,Vol.3498of LectureNotesinComputerScience,2005.
[6]W.-C.Hong,Electricloadforecastingbysupportvectormodel,Appl.Math. Model.33(5)(2009)2444–2454.
[7]S.Wong,K.K.Wan,T.N.Lam,Artificialneuralnetworksforenergyanalysisof officebuildingswithdaylighting,Appl.Energy87(2)(2010)
551–557.
[8]S.A.Kalogirou,Artificialneuralnetworksinenergyapplicationsinbuildings, Int.J.Low-CarbonTechnol.1(3)(2006)201–216.
[9]T.Mestekemper,G.Kauermann,M.S.Smith,Acomparisonofperiodic autoregressiveanddynamicfactormodelsinintradayenergydemand forecasting,Int.J.Forecast.29(1)(2013)1–12.
[10]M.Wytock,J.Z.Kolter,Large-scaleprobabilisticforecastinginenergysystems usingsparsegaussianconditionalrandomfields,in:Proceedingsofthe52nd IEEEConferenceonDecisionandControl,CDC2013,December10–13,2013, Firenze,Italy,2013,pp.1019–1024.
[11]E.Mocanu,P.H.Nguyen,M.Gibescu,W.Kling,Comparisonofmachine learningmethodsforestimatingenergyconsumptioninbuildings,in: Proceedingsofthe13thInternationalConferenceonProbabilisticMethods AppliedtoPowerSystems(PMAPS),July10–13,Durham,UK,2014. [12]M.Aydinalp-Koksal,V.I.Ugursal,Comparisonofneuralnetwork,conditional
demandanalysis,andengineeringapproachesformodelingend-useenergy consumptionintheresidentialsector,Appl.Energy85(4)(2008) 271–296.
[13]L.Xuemei,D.Lixing,L.Jinhu,X.Gang,L.Jibin,Anovelhybridapproachof KPCAandSVMforbuildingcoolingloadprediction,in:KnowledgeDiscovery andDataMining,2010.ThirdInternationalConferenceonWKDD‘10,2010. [14]L.Suganthi,A.A.Samuel,Energymodelsfordemandforecasting–areview,
Renew.Sustain.EnergyRev.16(2)(2012)1223–1240.
[15]M.Krarti,EnergyAuditofBuildingSystems:AnEngineeringApproach, MechanicalandAerospaceEngineeringSeries,2nded.,Taylor&Francis,2012. [16]A.I.Dounis,Artificialintelligenceforenergyconservationinbuildings,Adv.
Build.EnergyRes.4(1)(2010)267–299.
[17]A.Foucquier,S.Robert,F.Suard,L.Stéphan,A.Jay,Stateoftheartinbuilding modellingandenergyperformancesprediction:areview,Renew.Sustain. EnergyRev.23(0)(2013)272–288.
[18]H.xiangZhao,F.Magoulès,Areviewonthepredictionofbuildingenergy consumption,Renew.Sustain.EnergyRev.16(6)(2012)
3586–3592.
[19]D.Ernst,M.Glavic,F.Capitanescu,L.Wehenkel,Reinforcementlearning versusmodelpredictivecontrol:acomparisononapowersystemproblem, IEEETrans.Syst.ManCybern.B:Cybern.39(2)(2009)
517–529.
[20]R.Crites,A.Barto,Improvingelevatorperformanceusingreinforcement learning,in:in:AdvancesinNeuralInformationProcessingSystems,vol.8, MITPress,1996,pp.1017–1023.
[21]H.Ammar,D.Mocanu,M.Taylor,K.Driessens,K.Tuyls,G.Weiss, Automaticallymappedtransferbetweenreinforcementlearningtasksvia three-wayrestrictedBoltzmannmachines,in:in:MachineLearningand KnowledgeDiscoveryinDatabases,Vol.8189ofLectureNotesinComputer Science,2013,pp.449–464.
[22]B.Sallans,G.E.Hinton,Reinforcementlearningwithfactoredstatesand actions,J.Mach.Learn.Res.5(2004)1063–1088.
[23]N.Heess,D.Silver,Y.W.Teh,Actor-criticreinforcementlearningwith energy-basedpolicies,in:JMLRWorkshopandConferenceProceedings: EWRL2012,2012.
[24]V.Mnih,K.Kavukcuoglu,D.Silver,A.A.Rusu,J.Veness,M.G.Bellemare,A. Graves,M.Riedmiller,A.K.Fidjeland,G.Ostrovski,S.Petersen,C.Beattie,A. Sadik,I.Antonoglou,H.King,D.Kumaran,D.Wierstra,S.Legg,D.Hassabis, Human-levelcontrolthroughdeepreinforcementlearning,Nature518 (7540)(2015)529–533.
[25]R.S.Sutton,A.G.Barto,IntroductiontoReinforcementLearning,1sted.,MIT Press,Cambridge,MA,USA,1998.
[26]C.J.C.H.Watkins,P.Dayan,Technicalnote:Q-learning,J.Mach.Learn.Res.8 (3-4)(1992)279–292.
[27]Y.Bengio,LearningdeeparchitecturesforAI,Found.TrendsMach.Learn.2(1) (2009)1–127,alsopublishedasabook.NowPublishers,2009.
[28]M.Wiering,M.vanOtterlo,ReinforcementLearning:State-of-the-Art, Springer,2012.
[29]L.Busoniu,D.Ernst,B.DeSchutter,R.Babuska,Approximatereinforcement learning:anoverview,in:2011IEEESymposiumonAdaptiveDynamic ProgrammingAndReinforcementLearning(ADPRL),2011,pp.1–8. [30]M.L.Puterman,MarkovDecisionProcesses:DiscreteStochasticDynamic
Programming,1sted.,JohnWiley&Sons,Inc.,NewYork,NY,USA,1994. [31]M.Castronovo,F.Maes,R.Fonteneau,D.Ernst,Learning
exploration/exploitationstrategiesforsingletrajectoryreinforcement learning,in:in:EWRL,Vol.24ofJMLRProceedings,2012,pp.1–10. [32]G.Hinton,R.Salakhutdinov,Reducingthedimensionalityofdatawithneural
networks,Science313(5786)(2006)504–507.
[33]G.E.Hinton,S.Osindero,Y.-W.Teh,Afastlearningalgorithmfordeepbelief nets,NeuralComput.18(7)(2006)1527–1554.
[34]G.W.Taylor,G.E.Hinton,S.T.Roweis,Twodistributed-statemodelsfor generatinghigh-dimensionaltimeseries,J.Mach.Learn.Res.12(2011) 1025–1068.
[35]L.J.vanderMaaten,E.O.Postma,H.J.vandenHerik,Dimensionalityreduction: acomparativereview,J.Mach.Learn.Res.10(1-41)(2009)66–71.
[36]G.E.Hinton,S.Osindero,Afastlearningalgorithmfordeepbeliefnets,Neural Comput.18(2006)2006.
[37]G.E.Hinton,Trainingproductsofexpertsbyminimizingcontrastive divergence,NeuralComput.14(8)(2002)1771–1800.
[38]R.Salakhutdinov,LearningdeepboltzmannmachinesusingadaptiveMCMC, in:Proceedingsofthe27thInternationalConferenceonMachineLearning (ICML-10),Haifa,Israel,2010,pp.943–950.
[39]B.G.E.Company.http://www.supplier.bge.com(lastvisit17.10.15). [40]F.J.Massey,TheKolmogorov–Smirnovtestforgoodnessoffit,J.Am.Stat.
Assoc.46(253)(1951)68–78.
[41]G.E.Hinton,ApracticalguidetotrainingrestrictedBoltzmannmachines,in: in:NeuralNetworks:TricksoftheTrade,LectureNotesinComputerScience, 2nded.,Springer,2012,pp.599–619.
[42]J.R.K.Susan,J.Devlin,R.Gnanadesikan,Robustestimationandoutlier detectionwithcorrelationcoefficients,Biometrika62(3)(1975)531–545.