• No results found

To be precise, it considers the combination of a photovoltaic system and a heat pump which stores the produced electricity in the form of heat


Academic year: 2021

Share "To be precise, it considers the combination of a photovoltaic system and a heat pump which stores the produced electricity in the form of heat"


Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst


Master’s Thesis Operations Research

Scheduling a heatpump under stochastic power supply and demand

A.F. Konst, (s2157187) January 15, 2016


The Paris Agreement of December 12, 20151, states that countries must stimulate the use of renewable energy. However, problems arise if this renewable energy is added to the current grid, since the renewable energy is often highly volatile and uncertain.

For this reason so-called smart grids are introduced. The idea of smart grids is to shift demand during period with high demand to periods of low demand. Smart grids rely on the law of large numbers for which these systems are only applicable on a society level. Therefore, smart grids are not implemented that easily. This paper considers a smart solution for the efficient use of renewable energy on a household level. To be precise, it considers the combination of a photovoltaic system and a heat pump which stores the produced electricity in the form of heat. The aim is to schedule the heat pump, i.e., to determine when to switch the heat pump on or off, in such a way that it uses less electricity from the grid and more renewable energy from the photovoltaic system. The proposed algorithm, the simulation algorithm, takes the future into account and therefore deals with the stochasticity in both demand and supply for energy. It relies heavily on the so-called approximated dynamic programming (ADP) algorithm, that schedules the heat pump under the assumption that future supply and demand for energy are known. The ADP algorithm gives solutions that are slightly worse than the optimal solution in a deterministic setting and the simulation algorithm outperforms a greedy algorithm saving a household e239.46 on the energy bill for the months June to December of one year.

1“Framework Convention on Climate Change”. United Nations FCCC Int. United Nations. 12 December 2015.


Master’s Thesis Operations Research Supervisors

dr. W. Romeijnders University of Groningen University of Groningen drs. M.J. Konsman TNO

W.E. Wijbrandi, MSc TNO


1. Introduction

The world’s stock of fossil fuels is decreasing rapidly and the use of these fuels causes pollution that negatively influences the earth and its inhabitants. This problem motivates researchers to develop more eco-friendly sources of power, the so-called renewable energy resources. Examples of renewables are wind turbines for wind power, the hydropumps for hydropower and photo- voltaic systems for solar power.

Many of the renewable energy resources depend on weather conditions and as a result they have a volatile power output that causes problems in the grid. In the grid, demand must always equal supply. Whenever supply exceeds demand, the grid will be overloaded and will shut down.

Vice versa, it is forbidden by law for an energy supplier to not meet demand. A solution to the volatility in power output is to store surplus energy and use this energy later on. However, storing energy is inefficient and expensive. Another idea is to adjust the demand instead of the supply. This is the idea behind the so-called smart grids: create flexibility by shifting demand for devices such as electric vehicles and washing machines to a point later in time. These smart grids rely on the law of large numbers, i.e., the more households that are involved, the more stable demand for electricity and the less probability of extreme values. However, integrating such a system in society takes time. Therefore, there is room and a need for a smart system along side the upcoming smart grid system that works on a household level.

To decrease the volatility in the grid caused by all renewable energy produced by households, the German government has introduced a subsidy that is rewarded to citizens who directly con- sume energy produced by themselves. Furthermore, the price for selling residual self-produced energy to the grid is lower than the price for purchasing electricity from the grid such that storing energy on the grid is not beneficial. Hence, a single household could lower their energy bill by inaugurating a renewable device and by optimizing the way they use their electric devices.

Since storing the energy on the grid is not beneficial, one could look for other methods to store energy. The most energy consuming demand in a household is the demand for heat. Heat, in contrast to electricity, can be stored efficiently and can be used to store energy. In contrast to the smart grid, the flexibility does not lie in shifting demand, but in storing the energy in the form of heat which can be consumed later on.

To create heat, a so-called heat pump (HP) is used. A heat pump is an electric device that extracts heat from outside air into a water heating circuit. In general, a heat pump has an efficiency between 300% to 400%. An efficiency of 300% means that for each electric Joule of input, the heat pump creates three Joules of heat as output. This phenomenon is captured in the coefficient of performance (COP). A heat pump has a minimum period of time it must be on to run at its full potential (i.e., to reach the maximum level of COP) and a minimum period of time it must be off to cool down properly and avoid wear.

The problem discussed in this paper considers a household that has installed both a solar


photovoltaic system (PV) as well as an air-to-water heat pump (HP). The aim is to efficiently schedule the use of the heat pump, i.e., to determine when to switch the HP on or off, by using the least possible amount of electricity from the grid whilst ensuring demand for heat is satisfied.

This paper dicusses five different algorithms. First, a greedy approach based on the Power- Matcher of TNO, which turns the heatpump on as soon as solar energy is available. However, it does not take into account that the heatpump must be on for a minimum period of time.

Therefore, situations arise where the heat pump is on even if there is no solar energy available.

Hence, the future supply of solar energy must be taken into account. The main algorithm, introduced in this paper and called the simulation algorithm, will schedule the heat pump while taking the uncertain future into account. The simulation algorithm is based on an algorithm that assumes all future data is known and will therefore be referred to as the deterministic algorithm or the Approximated Dynamic Programming algorithm (ADP). The simulation al- gorithm in the stochastic setting will be compared to a Markov Decision Process (MDP) based algorithm and the greedy algorithm. The underlying ADP algorithm will be compared to an optimal algorithm in a deterministic setting and the greedy algorithm as well, since the greedy algorithm is a real time algorithm and therefore suitable for both the stochastic and determin- istic setting.

First, a literature review will be provided in Section 2. In Section 3, the problem will be defined followed by a description of the deterministic algorithm or ADP algorithm in Section 4.

In Section 5, the simulation algorithm will be defined and in Section 6 results of both the ADP algorithm and the simulation algorithm will be discussed. Section 7 concludes the paper.

2. Literature review

Cao, Hasan, and Siren (2014) introduce the combination of a heat pump and renewable energy, and state that research is needed to get this system to its full potential. However, they introduce this system in an office building, which has a demand distribution that differs from a residential building. The research done on a residential level is limited to the paper of Vialetto, Noro, and Rokni (2015). However, they make use of solid oxide fuel cells instead of a PV system.

Oxide fuel cells make use of natural gas and are therefore not dependent on weather conditions.

Furthermore, the purpose of our paper is to make use of renewable energy in an efficient way for which this paper of Vialetto et al. (2015) cannot be used. Franco and Fantozzi (2016) did investigate the perspective of a mutual interaction of a PV system and a heat pump by an experimental analysis. In their conclusion they state:

“The promotion of self-consumption strategies and in general the use of heat pumps in connection with PV plants represents surely the new frontier for the future development of systems based on renewable energy.”

To the best of our knowledge, there are no papers about the decision making algorithm of a problem with a heat pump and a renewable energy source dependent on the weather. In the literature about smart grids, it is stated by Bakker (2011) that smart grid solutions can roughly


be divided into two groups: agent based market systems, such as the PowerMatcher by TNO researcher Kok (2013), and discrete mathematical optimizations, such as Triana by Bakker (2011).

The idea of PowerMatcher is to let every household place a bid to a fictional market. An agent collects all the bids, observes the supply in renewable energy and returns a price for which demand equals supply. All households with a bid lower than the price returned by the agent will not consume electricity, and all others do. The strength of the PowerMatcher lies in its scalability and its simplicity. It makes use of the fact that the demand for electricity for a whole country does not contain huge outliers, is therefore balanced and can be better forecasted, i.e., the PowerMatcher relies on the law of large numbers. The demand of a single household does have these huge outliers and differences in demand from one minute to the other could be very large. This outlier of a single household will, in return, not be noticable on a bigger scale.

If the PowerMatcher would operate in a single household with a HP such as our problem, it will only real time look at the demand for heat disregarding demand of the future. Furthermore, it will turn the HP on when there is solar energy available disregarding the possibility that this supply of energy could only be a peak and a cloud could appear in the next minute leading to the use of electricity from the grid and that is not the idea of an eco-friendly smart system. The PowerMatcher only operates real time and this causes problems on the scale of a household, where demand and supply are very unpredictable. The algorithm introduced in this paper is compared to the performance of the PowerMatcher. The decisions the PowerMatcher makes on a household level, will be referred to as the greedy algorithm.

Triana does take future demand and supply for energy into account by the use of dynamic programming. It sends, based on a forecast of supply, temporary prices for energy for the next time period to all households in the Triana system. Households optimize based on these prices their consumption and will return this to Triana. Triana again sends, based on forecasts of supply and the information from the households, temporary prices for energy for the next time period to all households and this will continue until demand equals supply at every point in time for the next time period. Although Triana does take the future demand and supply into account, it relies on the law of large numbers and is therefore not directly applicable to our problem.

A smart grid system that does take the future into account and in fact only works on a smaller scale is the system discussed by Str¨ohle, Gerding, de Weerdt, Stein, and Robu (2014). The problem that is discussed makes a schedule for different devices which are currently demanding electricity. This schedule is based on scenarios, and devices that are frequently scheduled in the beginning of these different scenarios will be turned on and others will have to wait. For the scheduling of the deterministic scenarios, Str¨ohle et al. (2014) use a greedy algorithm to fasten the actual algorithm. Note that for the different scenarios the problem is deterministic and the scheduling algorithm will be referred to as a deterministic algorithm. This idea of scenarios will be used in this paper, referred to as the simulation algorithm, and the deterministic algorithm will be replaced by a dynamic programming as introduced by Bellman (1957). The choice for


dynamic programming is based on the structure of the problem, since the problem contains decision epochs and the number of decision epochs can be easily limited.

3. Problem Formulation

We introduce a household, having a heat pump (HP), a buffer for warm water, and a solar photovoltaic system (PV). Time is considered to be discretized and at each point in time, t, a decision has to be made on whether to turn the HP on or off. At time t, the household has a stochastic demand for warm water in the form of tapwater and central heating denoted by T Wt and CHt, respectively. Furthermore, besides the HP, there are other electric devices that demand electricity; the amount of demanded electricity is denoted by the random variable Et. The supply of renewable energy from the PV is stochastic as well and will be denoted as P Vt for each time period t.

The HP uses a buffer to store heat water. The temperature in this buffer is limited by a maximum temperature Tmax and a minimum temperature Tmin that are expressed in Joules.

The HP cannot be turned off when it has been on for less than µon time units and cannot be turned on when it has been off for less than µofftime units. When the HP is on, it requires W amount of electric power. The coefficient of performance (COP) resembles the ratio between the amount of heat in Joules provided to the power W required and is denoted as β. Furthermore, after the HP has been on for µon time units, indicated by the indicator variable yt, the HP runs with higher efficiency for which the COP increases by λ × 100%. Let xt be a binary variable that is 1 if the HP is on in period t and 0 otherwise. The temperature of the buffer at time t + 1 is defined as

Tt+1= (1 − c)Tt− T Wt− CHt+ (1 + λyt) · β · W · xt, (1) with


1 if




xi = µon, 0 otherwise,

and 0 ≤ c ≤ 1 is the cooling parameter denoting the adjustment of the temperature in the buffer to the temperature of its environment. The cooling relationship is assumed to be linear since the buffer is well isolated. The simulation algorithm, to be defined in Section 5, will determine the values of xt for t ∈ {1, . . . , H} with H denoting the fixed time horizon.

To calculate the associated costs per time unit, we introduce three different prices for elec- tricity as are currently active in Germany. The price of purchasing electricity from the grid is denoted as pdand the price for selling electricity to the grid is ps. Remarkably, ps< pdfor which self-produced electricity is worth less than electricity produced by energy companies. By setting pslower than pdthe German government stimulates the direct usage of self-produced electricity rather than storing the electricity on the grid. To stimulate this even further, a subsidy, pu,


is introduced for every amount of self-produced electricity that is consumed immediately. The prices pu, ps and pd are per Joule × time unit. Another cost associated to the problem is the costs of wear and tear by turning the HP on and off. Since the number of times the HP is turned on is equal to the number of times the HP is turned off, w ≥ 0 is a fixed wear costs added when the HP is turned on. The profit at time t equals,

Πt(xt) = pumin {P Vt, Et+ W · xt} + ps[P Vt− Et− W · xt]+

− pd[P Vt− Et− W · xt]+ xt(1 − xt−1)w, (2) with pu, ps, pd> 0 and ps< pd. The first term in (2) describes the profit associated with the self-produced electricity that is immediately used, i.e., the sum of the electricity used by other devices Etand the power used by the HP restricted by the total amount of self-produced energy that is available at time t. The second term describes the amount of self-produced energy that is not consumed. The third term constitute the costs in the case that the self-produced energy is not enough to cover demand for which this term constitutes the shortage that will be purchased from the grid. Lastly, whenever xt = 1 and xt−1 = 0, i.e., when the HP is turned on, a fixed wear costs is included.

4. Deterministic Algorithm

For the deterministic version of the problem, it is assumed that all stochastic variables T Wt, CHt, Etand P Vtare known in advance. For each moment in time, a decision has to be made on whether to turn the HP on or off, i.e., determine the value of xtfor all t, while taking constraints on the buffer and the HP into account as described in Section 3.

4.1. Approximated Dynamic Programming

The restrictions on the buffer (the temperature should be between Tmin and Tmax) and on the HP (at least µon time units on and at least µoff time units off) limit the number of possible solutions in. To fully exploit this advantage, we make use of a Dynamic programming. To speed up the algorihm, not all possible solutions will be taken into account for which our heuristic will be denoted as the Approximated Dynamic Programming (ADP) algorithm. Dynamic pro- gramming is a method that divides the problem into subproblems. Our problem is divided into decision epochs that are moments in time for which the algorithm must determine to put the HP on or off. The heuristic selection of which decision epochs to evaluate and which not are further explained in Section 4.1.3.

The decision epochs are visualized in Figure 2. Each circle denotes a moment in time t ∈ {1, . . . , H} and therefore a decision epoch since the value of xt must be determined in that point. Let each decision epoch have information about the current state of the problem. A state is a unique characterization of all that is important in a state of the problem that is modelled. A state in our problem is defined by (T, t) ∈ S which is a pair of the temperature in the buffer, T , and the point in time t in the state space S.


Due to the constraints, in many states (T, t) there is only one feasible value for xtand therefore no decision needs to be made. Let S1 and S0 be sets of all the states for which xtmust be equal to 1 or 0, respectively. The extended definition of these sets can be found in Section 4.1.6. The ADP algorithm will, to decrease the number of decision epochs, only consider decision epochs with values for t that satisfy (3), i.e., states the proposed algorithm will encounter are such that in the previous period the HP was on and the constraint that the HP must be on for at least µon time units is satisfied. This decreases the number of decision epochs rapidly since it is assumed that the HP is off for the greater part of the day.




xi ≥ µon (3)

4.1.1. Intuition

Figure 2 serves to give an intuition on what the ADP algorithm does. Assume that the ADP algorithm is at the decision epoch with t = t0 and T = T0. Moreover, assume that the current state (T0, t0) /∈ S1∪ S0 and that (3) holds.

Figure 2: Possible scenario of a subproblem of the ADP algorithm with µon = 4 and µoff= 3. Each circle indicates a decision epoch at which the HP is on or off. It is assumed that the HP has been on the previous period and, if left off entirely, the red dotted line indicates the point where the temperature in the buffer would be below the minimum temperature, denoted as the critical point or τmax. The point τmin is the point where the HP is able to be turned on taking into account µoff. If the HP is left on in the next period, the algorithm repeats itself at t1. Otherwise, the HP will be turned off and turned on again at τ ∈ {τmin, . . . , τmax}, which will be the point with the most consecutive surplus solar energy indicated by the blue rectangle. Hence, the algorithm will be called upon again at either t1 or t12.

At t0the ADP has two choices: either leave the HP on (xt1 = 1) or to turn the HP off (xt1 = 0).

In the first case, the next decision epoch the ADP encounters is at t1. Since (3) holds for (T0, t0) and xt1 = 1 in this case, (3) holds for this decision epoch as well, i.e. the constraint that the HP must be on for at least µon time units is already satisfied and does not restrict the decision on xt

1. In the second case, the HP is turned off. Since (3) should always hold, the ADP algorithm


determines the next point in time the HP will be turned on again. This point will be denoted by τ . For all points in time between t0 and τ − 1, the HP will be off and for all point in time between τ and τ + µon− 1 the HP has no other choice than to be on, since the HP must be on for at least µon time units. Therefore the next decision epoch will be at τ + µon , with again (3) satisfied. At each new decision epoch the algorithm will again choose between leaving the HP on one more time unit or turning it off. Therefore the ADP algorithm can be seen as making jumps through time moving from one decision epoch to the other while always making sure (3) holds. Of course, with every jump, the temperature in the buffer must be updated according to (1).

4.1.2. Value function

To distinguish between the two cases of either moving to t + 1 or to τ + µon, the value function introduced by Bellman (1957) is used, which is given by

V (t) =

Πt(1) + V (t + 1), if xt= 1,

τ −1



Πi(0) +

τ +µon−1



Πj(1) + V (τ + µon), if xt= 0, (4)

where Πt(xt) is defined in (2). Hence the value function consists of observed profit when taking a certain action and future profit. The value of xt which gives the highest value of V (t) will be chosen. Note that V (t + 1) and V (τ + µon) can only be obtained by recursively calling the algorithm. The last value, which is associated with the termination of the algorithm, will be discussed in Section 4.1.5.

4.1.3. Determination of τ

The point τ is the point where the ADP algorithm will turn the HP on again after xt= 0. To determine its value, first a feasible set of points in time will be created to make sure that τ does not cause any constraint violation. The first feasible value for τ is τmin = τ + µoff since the HP must be off for at least µoff time units. Furthermore, τ is also bound by a maximum value, denoted by τmax, which is based on a constraint that the temperature in the buffer must be above Tmin. When the HP is left off, the temperature in the buffer will decrease over time due to demand and cooling and will therefore at some point drop below the minimum temperature. Let τmaxbe referred to as the critical point. The critical point is the last point in time for which leaving the HP off from t to τmax will not cause T to drop below Tmin. Hence the HP must have been on at least once before τmax to cover for demand and not violate the minimum temperature. In Figure 2 the critical point is visualized as the red dotted line. The calculation of τmax is stated in (5) where Tt is known at the current time t. The parameter φ in (5) will initially be zero and, only in the case of a considerable large amount of demand (i.e., T Wτ

max+1+ CHτ

max+1 > β · W ), will τmax be updated with the value of φ. The parameter φ indicates how many time units the critical point τmax should lie earlier in time in the case that the amount of demand after τmax is so great that the buffer must contain more heath on forehand. This will be further explained in Section 4.1.4. The parameter φ is initially zero to


calculate τmaxafter which φ is calculated such as described in section 4.1.4 and finally τmaxwill be updated with this value of φ.

τmax= max

j=t,...,Hj : Tj > Tmin − [φ]+, (5) with

Tj+1= (1 − c)Tj− T Wj− CHj, ∀j ∈ {t, . . . , H}.

However, not all values between τminand τmaxare feasible values for τ due to the constraint that the temperature is not allowed to exceed Tmax. Let M ⊂ {τmin, . . . , τmax} contain values for τ which would violate the maximum temperature restriction, i.e. putting it on at i ∈ M would cause one of the values j ∈ {i, . . . , i + µon} to have a temperature Tj > Tmax. The temperature in the buffer is decreasing over time for which only values of τ in the beginning of the interval are to be found in M. Hence M eliminates lower values of τ and therefore actually creates a new τmin. The point τ can only be chosen from {τmin, . . . , τmax}\M.

Dynamic programming is an approach were all possible decisions are evaluated according to the value function in (4). This means that a true dynamic program would evaluate the value functions belonging to all τ ∈ {τmin, . . . , τmax}\M which are all feasible points to turn the HP on again. However, to speed the algorithm up, τ is heuristically selected beforehand from min, . . . , τmax}\M. The τ that is chosen does not necessarily have to be the optimal τ , i.e., τ is not necessarily the point which gives the greatest value according to the value function.

The point τ is chosen based on the available amount of renewable energy. That is, τ is selected such that the HP will be on during the period of length µon in {τmin, . . . , τmax} which has the greatest amount of surplus solar energy when the electricity of other devices, Et, is substracted.

In this way, in the short run the HP uses as much solar energy as possible. This calculation of τ equals,

τ = arg maxi




[P Vj − Ej]+ ; i ∈ {τmin, . . . , τmax}\M

. (6)

Now if the HP is decided to be off for the first period, xt, . . . , xτ −1= 0 and xτ, . . . , xτ +µ

on−1 = 1.

The next decision epoch will be at τ + µon for which, given the values for xt, . . . , xτ +µon−1, (3) holds. To make things clear, take a look again at Figure 2. The values above the time units represent the surplus solar energy available, which is P Vt− Et for t ∈ {t0, . . . , t16}. The maximum sum of surplus solar energy of four consecutive time units is indicated by the blue rectangle. Hence τ will be at t9 and the next decision epoch at t12 which is τ + µon.

4.1.4. Considerable large amount of demand

Whenever T Wτmax+1+ CHτmax+1 > β · W , i.e., whenever the demand after the critical point τmax is too much to be covered by the HP in one time period, the critical point τmax must be shifted backwards in time. Let ¯φ be the number of consecutive periods with a demand higher than β · W and let φ be the number of periods the HP needs to be on to cover the demand in max+ 1, . . . , τmax+ ¯φ}. Of course, there could be usable temperature left in the buffer at τmax


i.e. Tmax− Tmin, where Tmaxis calculated by (1) with xt, . . . , xτmax = 0. This heat Tmax− Tmin can be substracted from the demand in {τmax+ 1, . . . , τmax+ ¯φ}. The number of time periods needed to cover this demand is φ and calculated as given in (7).

φ =

Pτmax+ ¯φ

k=τmax+1(T Wt+ CHt) − [Tmax− Tmin] β · W

− ¯φ, (7)

where the last ¯φ indicates that the HP could be on in the period {τmax+ 1, .., τmax+ ¯φ} as well.

This value of φ will be used to update the value of τmax in (5), which was initially calculated by using φ = 0.

4.1.5. Termination

The termination of the algorithm depends on the critical point τmax, described in Section 4.1.3. The termination of the algorithm follows the rule that for each V (t) with τmax > H, all (xt, . . . , xH) will be equal to 0 and V (t) =PH

i=tΠi(0). Hence if the critical point τmax lies beyond the time horizon H, the HP will be off until and including H.

4.1.6. Special cases

The sets S1 and S0 contain states for which the HP must be turned on and respectively turned off. States that are in S1 violate one of the following constraints:

1. The HP has been on for less than µon time units, i.e., (3).

2. The temperature in the buffer will be below Tminwhen leaving the HP off, i.e., (1 − c)T − T Wt− CHt< Tmin.

3. The temperature in the buffer will be below Tmin when turning the HP off for at least µoff time units, i.e., the critical point τmax lies before τmin

In the ADP algorithm, the states that violate the first contraints will never be encountered, since the next decision epoch is either t + 1 or τ + µon with both (3) satisfied. States that are in S0 violate one of the following constraints:

1. The HP has been off for less than µoff time units, i.e.,




> 0.

2. The temperature in the buffer will be above Tmaxwhen leaving the HP on, i.e., (1 − c)T − T Wt− CHt+ β · (1 + λ) · W > Tmax.

Again, since the ADP algorithm chooses τmin = t + µoff, this state, for which xt must be zero, will never be encountered. The ADP algorithm is listed in Algorithm 1 making use of Algorithm 2 and 3 in Appendix B for which the initial state (T, t) must be given by the user. However, the temperature is still not guarenteed to be always above Tmin. For example, whenever the demand is greater than β · W for such a long period that even a full tank is empty. In these rare cases, the buffer can make use of an emergency gas supply to save what can be saved.


4.2. Optimal Algorithm

The ADP does not necessarily give the optimal solution since not all possible values for τ are evaluated as is discussed in Section 4.1.3. Therefore, to find the optimal solution, one replaces Algorithm 3 in Algorithm 1 by Algorithm 4 in Appendix B. The optimal algorithm will compare the value of the value function of all possible values τ ∈ {τmin, . . . , τmax}\M. With this algorithm, the quality of the solutions of the ADP will be tested.

4.3. Greedy Algorithm

The greedy algorithm is based on the behaviour of the PowerMatcher, introduced by TNO researcher Kok (2013) and described in Section 2. In general, when the constraints are not violated, the heatpomp will be turned on immediately when there is a proportion ρ of the power needed available in solar energy, i.e., P Vt−1− Et−1> ρ · W . The algorithm can be found in Algorithm 5 in Appendix B.

5. Stochastic Approach

In practice, the values of T Wt, CHt, Et and P Vt are unknown before time t. So, in constrast to the deterministic algorithm, these parameters are considered stochastic. To make a decision on xt, some methods of dealing with this stochasticity are listed by Spall (2003): machine (re- inforcement) learning, simulation-based optimization, Markov decision processes and stochastic programming (e.g. multi-stage recourse models). To choose the appropriate approach one must take into account that the algorithm should have a low computational complexity since it must be placed inside a household. Furthermore, it would be beneficial if the algorithm could process updates in forecasts without rerunning the entire program. In Section 5.1 the stochastic ap- proaches are briefly summarized. Readers who are familiar with these approaches can continue reading at Section 5.2.

5.1. Evaluation of Approaches

A Markov Decision Process (MDP), as described by Wiering and Van Otterlo (2012), is defined by a number of states, a set of actions, a transition matrix, and a reward function which will be thouroughly explained in Section 5.3. Though an MDP is able to optimize expected value and is therefore for some stochastic problems an optimal approach, the number of states is compu- tationally limited and all states are independent of previous states, i.e., it only matters which state the problem is and not how it came there. In our problem, the state space is continuous since T is continuous and the states are dependent e.g. when a shower has been taken earlier that day, the probability that another shower is taken is very small. Therefore, to make the MDP approach computationally feasible, information about the problem should be omitted. As a consequence it is not known if the MDP optimizes the expected value of the actual problem.

Furthermore, if new forecasts on solar energy are available, the algorithm must entirely start over.


Reinforcement learning, as described by Wiering and Van Otterlo (2012), is a special case of a Markov Decision Process in which the actions are evaluated after its effect has been ob- served. However, the effect of an action is in our problem known, wherefore this method does not seem suitable.

Klein Haneveld and Van der Vlerk (2012) discuss the so-called recourse models which are, in contrast to an MDP, time dependent which means that one knows previous states and can therefore make a suitable decision. Moreover, it evaluates the expected value of an action and makes therefore a good decision under uncertainty. However, the number of states grows expo- nentially with the number of stages for which the time complexity is much worse than an MDP.

Simulation is a method in which a number of scenarios are given that are deterministically solved and evaluated. Though the method does not automatically optimize the expected value, the application by Str¨ohle et al. (2014), as already stated in Section 2, gives promising results and is computationally simple. Therefore, we choose this approach to deal with the stochastic- ity. The quality of the solutions are tested against the deterministic algorithm to evaluate the costs associated with the stochasticity and against an MDP algorithm.

5.2. Simulation

To deal with the stochasticity of the problem, the method stated in Str¨ohle et al. (2014), will be used and referred to as the simulation algorithm and the quality of the solutions compared to a Markov Decision Process which is expected to give better solution, though it has a higher computational complexity. At each point in time t, the simulation algorithm takes a number of randomly chosen scenarios for which the ADP will schedule the HP. The simulation algorithm will evaluate all solutions for the different scenarios to make a decision upon xt. Firstly, the explanation on how the scenarios are created is given which is followed by a description of the simulation algorithm.

5.2.1. Scenarios

A scenario, denoted by U , is defined as U = T W{t,...,H}, CH{t,...,H}, E{t,...,H}, P V{t,...,H} and gives therefore values for the stochastic variables in the time range from t to H. The values of these stochastic variables will be based on actual data. Let K = {1, . . . , 365} represent days of the year for which data is available. A scenario U , based on data from two days kd, ks ∈ K, is denoted by

Uk =

T W{t,...,H}kd , CH{t,...,H}kd , E{t,...,H}kd , P V{t,...,H}ks

(8) with T W{t,...,H}kd , CH{t,...,H}kd , E{t,...,H}kd and P V{t,...,H}ks being the observations on days kdand ksin range t to H of T W{t,...,H}, CH{t,...,H}, E{t,...,H} and P V{t,...,H}, respectively. Assume the current day is k0. To include seasonal effects, kdwill be randomly drawn from KD = {k0− m, . . . , k0+ m}, m ∈ N and ks will be randomly drawn from Ks = {k0− n, . . . , k0+ n}, n ∈ N. Hence, for day k0, a scenario Uk is build by randomly selecting kd ∈ KD ⊆ K and ks ∈ KS ⊆ K.

To create more scenarios, one draws Nd days from Kd and Ns days from Ks and all possible


combinations of these draws will be listed in the set K with k = (kd, ks) ∈ K creating a total of Nd· Ns scenarios. Note that KD and KS only depend on k0 and not on the current point in time t. Hence, observations about demand and supply earlier that day are not taken into account.

5.2.2. Simulation algorithm

At each point in time t of day k0, Nd days will be randomly chosen from Kd and Ns days will be randomly chosen from Kscreating the set K for which a scenario Ukwith k = (kd, ks) ∈ K is created. For each scenario Uk, the ADP algorithm as described in Algorithm 1 will give values for xt, . . . , xt+H. If in γ% of the scenarios, xt= 0, the HP will be turned off or left off, else the HP will be turned on or left on.

There are states (T, t) in which there is no choice for xt since constraints will or have been violated. Let these states again be in the set S1 and S0 for respectively xt = 1 and xt = 0.

States that are in S1 violate at least one of following constraints:

1. The HP has been on for less than µon time units, i.e.,




xi < µon.

2. The current temperature is below Tmin.

Since the problem contains stochasticity, it could happen that there is a great amount of demand whilst there was no or barely demand expected and accounted for. Therefore, it could happen that the temperature in the buffer is lower than the minimum allowed temperature for which the HP must be turned on immediately. States that are in S0 violate at least one of the following constraints:

1. The HP has been off for less than µoff time units, i.e.,




xi> 0.

2. The current temperature is above Tmax.

Again, since the problem contains stochasticity, it could happen that there is no or barely de- mand at all whilst a greater amount was expected and accounted for. Therefore, it could happen that the temperature in the buffer exceeds the maximum allowed temperature for which the HP must be turned off immediately.

The simulation algorithm can be found in Algorithm 6 in Appendix B. It describes how the value of xtis calculated. To determine all values for x1, . . . , xH that constitute a solution, one has to iteratively call the simulation algorithm and with each iteration update (T, t).

5.3. Markov Decision Process

A Markov Decision Process is defined as a discrete stochastic dynamic program and is used when decisions have to be made under uncertainty which will influence future decisions and


rewards. These decisions have to be made at certain points in time. At such a decision epoch, the system is in a certain state. Moving from one state to the other is dependent on the decision made and a transition probability. Each state has its own reward functions, including current rewards and future rewards, regarding a certain decision or what we call actions. Hence, a Markov Decision Process contains a state space S, an action space A, a transition matrix M and a reward function R. Each element will be thoroughly discussed in Sections 5.3.1 to 5.3.4.

5.3.1. States

A state is a unique characterization of all that is important in a state of the problem that is modelled, denoted by s ∈ S. The state space S is a finite set of states {s1, . . . , sN} with N ∈ N being the number of different states in which the problem could appear. In our problem the elements that define a state are the temperature of the buffer, T , the number of time periods the HP has been on, Non, or off, Noff, and the current point in time, t. For simplicity, the temperature of the buffer and the time periods are discretized for which the formal definition of a state is given by

s = {T, Non, Noff, t}, (9)


T ∈ {Tmin, . . . , Tmax}, Non ∈ {0, . . . , µon}, Noff∈ {0, . . . , µoff}, t ∈ {1, . . . , H}.

Note that Non > 0 ⇔ Noff= 0 and Noff> 0 ⇔ Non= 0 since the HP cannot be both a strictly positive number of periods on and a strictly positive number of periods off at the same time.

5.3.2. Actions

The future state of the system can be influenced by actions. The set of actions A is defined as the finite set {a1, . . . , aL} where L ∈ N is the number of actions. For each state s ∈ S there is a subset A(s) ⊆ A with all feasible actions since not all actions can be done in every state (for example, the HP has not been on for at least µon time units, so the action must be “on”

for this period). The action space in our problem is the finite set A ∈ {0, 1} indicating either putting the HP off (a = 0) or on (a = 1).

5.3.3. Transition matrix

When taking action a ∈ A in a state s ∈ S, the system makes a transition to a new state s0 ∈ S, based on a probability distribution over the set of possible transitions. Hence, the transition matrix M is defined over: S × A × S → [0, 1], i.e., it gives the probability of moving from state s by taking action a to state s0. Note that 0 ≤ M (s, a, s0) ≤ 1 andP

s0∈SM (s, a, s0) = 1, for all


feasible combinations of states s and action a. Note that the probabilities are not influenced by states previous to state s or previous actions, i.e., the Markov property holds given by

P (st+1|st, at, st−1, at−1. . .) = P (st+1|st, at) = M (st, at, st+1).

For our problem, the probabilities are classified into two cases:

M (s, a, s0) ≥ 0 If ts+ 1 = ts0 and, Non

s0 = 0, if a = 0 Non

s0 > 0, if a = 1 Noff

s0 > 0, if a = 0 Noff

s0 = 0, if a = 1 Ts≥ Ts0, if a = 0 M (s, a, s0) = 0 otherwise.

For the first case, if all conditions hold, the transition probability is equal to the probability of having a Ts− Ts0 amount of demand if the action is “off”. Note that this probability could be zero if this size of demand is never observed. If the action is “on”, the probability is equal to having an Ts− Ts0+ β · W · (1 + λ [Non− µon+ 1]+) amount of demand, where the last maximum value is equal to zero except when Non= µon. Hence the probabilities are:

M (s, 0, s0) = P Ts− Ts0 = CH(ts) + T Wt

s , M (s, 1, s0) = P

Ts− Ts0 = CH(ts) + T Wts− β · W · (1 + λ [Non− µon+ 1]+)

 . Note that, since T is discretized, both CHt

s and T Wt

s are discretized as well such that the probability will not be zero. Hence, there is a possibility that the amount of demand is too small to make a significant change in T and T will therefore not change. Accordingly, the cooling parameter is disregarded.

5.3.4. Reward Function

The Reward Function is defined as R : S × A → R, and therefore specifies a reward for doing some action in a certain state. The reward function is used to value a state for which later on, an algorithm can favour between states to come to an optimal strategy. In our problem, the reward functions are stated as follows based on (2):

Ron(s) = pumin n

FtP Vs , FtEs + W o

+ ps h

FtP Vs − FtE

s − Wi+

− pdh

FtP Vs − FtE

s − Wi

Roff(s) = pumin n

FtP Vs , FtEs o

+ ps h

FtP Vs − FtE



− pdh

FtP Vs − FtE




with FtP Vs denoting the forecast at time ts in terms of expected value of the available solar energy and FtEs the expected value of the demand for electricity for other devices at time ts. The state space, the action space, the transition matrix and the reward function all define a Markov Decision Process, or with a formal definition cited from Wiering and Van Otterlo (2012).

Definition. A Markov decision process is a tuple hS, A, M, Ri in which S is a finite set of states, A a finite set of actions, M a transition function defined as M : S × A × S → [0, 1] and R a reward function defined as R : S × A → R.

5.3.5. Algorithm

One of the most efficient algorithm to calculate an optimal policy for a Markov decision process is the value iteration algorithm introduced by Bellman (1957). It is based on dynamic program- ming and makes use of the so-called value function V (s), with the optimal policy satisfying:

V(s) = max




M (s, a, s0) R(s, a, s0) + V(s0)

. (10)

Equation (10) gives the expected reward of undertaking a certain action. Let i ∈ N be an index for the number of iterations and let (11) be the updating rule for the value function.

Vi+1(s) = max




M (s, a, s0) R(s, a, s0) + Vi(s0)

. (11)

Starting with a value function V0 over all states, by using (11), in the limit on the number of iterations, limi→∞Vi(s) = V(s). In general, the algorithm will stop when ||Vi+1(s)−Vi(s)|| < .

The value iteration algorithm is illustrated in Algorithm 7 in Appendix B. Based on the final V (s), at s one chooses the action that maximizes the expected value, see (10), which will lead to an optimal policy for all states. In our problem, states will be updated according to (1) and rule t = t + 1. Then, the state defined in the MDP which comes closest to the current state, will be chosen and an action taken according to its policy.

6. Numerical Results

The performance of both the deterministic ADP algorithm and the stochastic simulation algo- rithm will be evaluated on instances that are created based on available data. The data consist of per minute observations on P Vt, Et, T Wtand CHtin the months June to December which consists of 214 days. The time t is discretized is steps of 5 minutes.

An instance is dependent on the time horizon H, i.e., the number of decision epochs, and is created by splitting the 214 days of data into periods of length H. Let HADP be the number



Die hoofdoel van hierdie studie was om aan te dui watter terapeutiese uitkomste deur ‘n vyfjarige seun met Asperger Sindroom (AS), met behulp van nie-direktiewe prosesse van

The results of the Process regression analysis show (Hayes, 2013) that there is a direct effect of general information disclosed by the UvA on willingness to disclose

Toch behoren deze liedjes tot het cultuurgoed van de Nederlandse taal en kunnen ze uitstekend benut worden als toelichting bij het onderwijs in de moedertaal en de andere

If a seizure is detected, information from the GPS device together with cell-ID information is used to determine the location of the patient so that

Longitudinal study on trance mineral compositions (selenium, zinc, copper, manganese) in Korean human preterm milk. Ronayne de Ferrer PA, Weisstaub A, Lopez N,

Assuming that the conflict observation technique is also reliable under field conditions (for ,.,hich there are some indications in the figures) a number of

Despite dissensus regarding the widespread use of the concept, dignity has come to display three elements in constitutional adjudication post World War Two: the ontological

In the Netherlands systematic analysis of child deaths only occurs in cases of Sudden Infant Death Syndrome (SIDS) by the National Cot Death Study Group [30] and in perinatal deaths