Master’s Thesis Operations Research

Scheduling a heatpump under stochastic power supply and demand

A.F. Konst, (s2157187) January 15, 2016

Abstract

The Paris Agreement of December 12, 2015^{1}, states that countries must stimulate the
use of renewable energy. However, problems arise if this renewable energy is added
to the current grid, since the renewable energy is often highly volatile and uncertain.

For this reason so-called smart grids are introduced. The idea of smart grids is to shift demand during period with high demand to periods of low demand. Smart grids rely on the law of large numbers for which these systems are only applicable on a society level. Therefore, smart grids are not implemented that easily. This paper considers a smart solution for the efficient use of renewable energy on a household level. To be precise, it considers the combination of a photovoltaic system and a heat pump which stores the produced electricity in the form of heat. The aim is to schedule the heat pump, i.e., to determine when to switch the heat pump on or off, in such a way that it uses less electricity from the grid and more renewable energy from the photovoltaic system. The proposed algorithm, the simulation algorithm, takes the future into account and therefore deals with the stochasticity in both demand and supply for energy. It relies heavily on the so-called approximated dynamic programming (ADP) algorithm, that schedules the heat pump under the assumption that future supply and demand for energy are known. The ADP algorithm gives solutions that are slightly worse than the optimal solution in a deterministic setting and the simulation algorithm outperforms a greedy algorithm saving a household e239.46 on the energy bill for the months June to December of one year.

1“Framework Convention on Climate Change”. United Nations FCCC Int. United Nations. 12 December 2015.

Master’s Thesis Operations Research Supervisors

dr. W. Romeijnders University of Groningen University of Groningen drs. M.J. Konsman TNO

W.E. Wijbrandi, MSc TNO

1. Introduction

The world’s stock of fossil fuels is decreasing rapidly and the use of these fuels causes pollution that negatively influences the earth and its inhabitants. This problem motivates researchers to develop more eco-friendly sources of power, the so-called renewable energy resources. Examples of renewables are wind turbines for wind power, the hydropumps for hydropower and photo- voltaic systems for solar power.

Many of the renewable energy resources depend on weather conditions and as a result they have a volatile power output that causes problems in the grid. In the grid, demand must always equal supply. Whenever supply exceeds demand, the grid will be overloaded and will shut down.

Vice versa, it is forbidden by law for an energy supplier to not meet demand. A solution to the volatility in power output is to store surplus energy and use this energy later on. However, storing energy is inefficient and expensive. Another idea is to adjust the demand instead of the supply. This is the idea behind the so-called smart grids: create flexibility by shifting demand for devices such as electric vehicles and washing machines to a point later in time. These smart grids rely on the law of large numbers, i.e., the more households that are involved, the more stable demand for electricity and the less probability of extreme values. However, integrating such a system in society takes time. Therefore, there is room and a need for a smart system along side the upcoming smart grid system that works on a household level.

To decrease the volatility in the grid caused by all renewable energy produced by households, the German government has introduced a subsidy that is rewarded to citizens who directly con- sume energy produced by themselves. Furthermore, the price for selling residual self-produced energy to the grid is lower than the price for purchasing electricity from the grid such that storing energy on the grid is not beneficial. Hence, a single household could lower their energy bill by inaugurating a renewable device and by optimizing the way they use their electric devices.

Since storing the energy on the grid is not beneficial, one could look for other methods to store energy. The most energy consuming demand in a household is the demand for heat. Heat, in contrast to electricity, can be stored efficiently and can be used to store energy. In contrast to the smart grid, the flexibility does not lie in shifting demand, but in storing the energy in the form of heat which can be consumed later on.

To create heat, a so-called heat pump (HP) is used. A heat pump is an electric device that extracts heat from outside air into a water heating circuit. In general, a heat pump has an efficiency between 300% to 400%. An efficiency of 300% means that for each electric Joule of input, the heat pump creates three Joules of heat as output. This phenomenon is captured in the coefficient of performance (COP). A heat pump has a minimum period of time it must be on to run at its full potential (i.e., to reach the maximum level of COP) and a minimum period of time it must be off to cool down properly and avoid wear.

The problem discussed in this paper considers a household that has installed both a solar

photovoltaic system (PV) as well as an air-to-water heat pump (HP). The aim is to efficiently schedule the use of the heat pump, i.e., to determine when to switch the HP on or off, by using the least possible amount of electricity from the grid whilst ensuring demand for heat is satisfied.

This paper dicusses five different algorithms. First, a greedy approach based on the Power- Matcher of TNO, which turns the heatpump on as soon as solar energy is available. However, it does not take into account that the heatpump must be on for a minimum period of time.

Therefore, situations arise where the heat pump is on even if there is no solar energy available.

Hence, the future supply of solar energy must be taken into account. The main algorithm, introduced in this paper and called the simulation algorithm, will schedule the heat pump while taking the uncertain future into account. The simulation algorithm is based on an algorithm that assumes all future data is known and will therefore be referred to as the deterministic algorithm or the Approximated Dynamic Programming algorithm (ADP). The simulation al- gorithm in the stochastic setting will be compared to a Markov Decision Process (MDP) based algorithm and the greedy algorithm. The underlying ADP algorithm will be compared to an optimal algorithm in a deterministic setting and the greedy algorithm as well, since the greedy algorithm is a real time algorithm and therefore suitable for both the stochastic and determin- istic setting.

First, a literature review will be provided in Section 2. In Section 3, the problem will be defined followed by a description of the deterministic algorithm or ADP algorithm in Section 4.

In Section 5, the simulation algorithm will be defined and in Section 6 results of both the ADP algorithm and the simulation algorithm will be discussed. Section 7 concludes the paper.

2. Literature review

Cao, Hasan, and Siren (2014) introduce the combination of a heat pump and renewable energy, and state that research is needed to get this system to its full potential. However, they introduce this system in an office building, which has a demand distribution that differs from a residential building. The research done on a residential level is limited to the paper of Vialetto, Noro, and Rokni (2015). However, they make use of solid oxide fuel cells instead of a PV system.

Oxide fuel cells make use of natural gas and are therefore not dependent on weather conditions.

Furthermore, the purpose of our paper is to make use of renewable energy in an efficient way for which this paper of Vialetto et al. (2015) cannot be used. Franco and Fantozzi (2016) did investigate the perspective of a mutual interaction of a PV system and a heat pump by an experimental analysis. In their conclusion they state:

“The promotion of self-consumption strategies and in general the use of heat pumps in connection with PV plants represents surely the new frontier for the future development of systems based on renewable energy.”

To the best of our knowledge, there are no papers about the decision making algorithm of a problem with a heat pump and a renewable energy source dependent on the weather. In the literature about smart grids, it is stated by Bakker (2011) that smart grid solutions can roughly

be divided into two groups: agent based market systems, such as the PowerMatcher by TNO researcher Kok (2013), and discrete mathematical optimizations, such as Triana by Bakker (2011).

The idea of PowerMatcher is to let every household place a bid to a fictional market. An agent collects all the bids, observes the supply in renewable energy and returns a price for which demand equals supply. All households with a bid lower than the price returned by the agent will not consume electricity, and all others do. The strength of the PowerMatcher lies in its scalability and its simplicity. It makes use of the fact that the demand for electricity for a whole country does not contain huge outliers, is therefore balanced and can be better forecasted, i.e., the PowerMatcher relies on the law of large numbers. The demand of a single household does have these huge outliers and differences in demand from one minute to the other could be very large. This outlier of a single household will, in return, not be noticable on a bigger scale.

If the PowerMatcher would operate in a single household with a HP such as our problem, it will only real time look at the demand for heat disregarding demand of the future. Furthermore, it will turn the HP on when there is solar energy available disregarding the possibility that this supply of energy could only be a peak and a cloud could appear in the next minute leading to the use of electricity from the grid and that is not the idea of an eco-friendly smart system. The PowerMatcher only operates real time and this causes problems on the scale of a household, where demand and supply are very unpredictable. The algorithm introduced in this paper is compared to the performance of the PowerMatcher. The decisions the PowerMatcher makes on a household level, will be referred to as the greedy algorithm.

Triana does take future demand and supply for energy into account by the use of dynamic programming. It sends, based on a forecast of supply, temporary prices for energy for the next time period to all households in the Triana system. Households optimize based on these prices their consumption and will return this to Triana. Triana again sends, based on forecasts of supply and the information from the households, temporary prices for energy for the next time period to all households and this will continue until demand equals supply at every point in time for the next time period. Although Triana does take the future demand and supply into account, it relies on the law of large numbers and is therefore not directly applicable to our problem.

A smart grid system that does take the future into account and in fact only works on a smaller scale is the system discussed by Str¨ohle, Gerding, de Weerdt, Stein, and Robu (2014). The problem that is discussed makes a schedule for different devices which are currently demanding electricity. This schedule is based on scenarios, and devices that are frequently scheduled in the beginning of these different scenarios will be turned on and others will have to wait. For the scheduling of the deterministic scenarios, Str¨ohle et al. (2014) use a greedy algorithm to fasten the actual algorithm. Note that for the different scenarios the problem is deterministic and the scheduling algorithm will be referred to as a deterministic algorithm. This idea of scenarios will be used in this paper, referred to as the simulation algorithm, and the deterministic algorithm will be replaced by a dynamic programming as introduced by Bellman (1957). The choice for

dynamic programming is based on the structure of the problem, since the problem contains decision epochs and the number of decision epochs can be easily limited.

3. Problem Formulation

We introduce a household, having a heat pump (HP), a buffer for warm water, and a solar
photovoltaic system (PV). Time is considered to be discretized and at each point in time, t, a
decision has to be made on whether to turn the HP on or off. At time t, the household has
a stochastic demand for warm water in the form of tapwater and central heating denoted by
T W_{t} and CH_{t}, respectively. Furthermore, besides the HP, there are other electric devices that
demand electricity; the amount of demanded electricity is denoted by the random variable E_{t}.
The supply of renewable energy from the PV is stochastic as well and will be denoted as P V_{t}
for each time period t.

The HP uses a buffer to store heat water. The temperature in this buffer is limited by a
maximum temperature T_{max} and a minimum temperature T_{min} that are expressed in Joules.

The HP cannot be turned off when it has been on for less than µ_{on} time units and cannot be
turned on when it has been off for less than µ_{off}time units. When the HP is on, it requires W
amount of electric power. The coefficient of performance (COP) resembles the ratio between the
amount of heat in Joules provided to the power W required and is denoted as β. Furthermore,
after the HP has been on for µ_{on} time units, indicated by the indicator variable y_{t}, the HP runs
with higher efficiency for which the COP increases by λ × 100%. Let x_{t} be a binary variable
that is 1 if the HP is on in period t and 0 otherwise. The temperature of the buffer at time
t + 1 is defined as

T_{t+1}= (1 − c)T_{t}− T W_{t}− CH_{t}+ (1 + λy_{t}) · β · W · x_{t}, (1)
with

y_{t}=

1 if

t−1

X

i=t−µ_{on}

x_{i} = µ_{on},
0 otherwise,

and 0 ≤ c ≤ 1 is the cooling parameter denoting the adjustment of the temperature in the buffer
to the temperature of its environment. The cooling relationship is assumed to be linear since
the buffer is well isolated. The simulation algorithm, to be defined in Section 5, will determine
the values of x_{t} for t ∈ {1, . . . , H} with H denoting the fixed time horizon.

To calculate the associated costs per time unit, we introduce three different prices for elec-
tricity as are currently active in Germany. The price of purchasing electricity from the grid is
denoted as p_{d}and the price for selling electricity to the grid is p_{s}. Remarkably, p_{s}< p_{d}for which
self-produced electricity is worth less than electricity produced by energy companies. By setting
p_{s}lower than p_{d}the German government stimulates the direct usage of self-produced electricity
rather than storing the electricity on the grid. To stimulate this even further, a subsidy, p_{u},

is introduced for every amount of self-produced electricity that is consumed immediately. The
prices p_{u}, p_{s} and p_{d} are per Joule × time unit. Another cost associated to the problem is the
costs of wear and tear by turning the HP on and off. Since the number of times the HP is
turned on is equal to the number of times the HP is turned off, w ≥ 0 is a fixed wear costs
added when the HP is turned on. The profit at time t equals,

Π_{t}(x_{t}) = p_{u}min {P V_{t}, E_{t}+ W · x_{t}} + p_{s}[P V_{t}− E_{t}− W · x_{t}]^{+}

− p_{d}[P V_{t}− E_{t}− W · x_{t}]^{−}+ x_{t}(1 − x_{t−1})w, (2)
with p_{u}, p_{s}, p_{d}> 0 and p_{s}< p_{d}. The first term in (2) describes the profit associated with the
self-produced electricity that is immediately used, i.e., the sum of the electricity used by other
devices E_{t}and the power used by the HP restricted by the total amount of self-produced energy
that is available at time t. The second term describes the amount of self-produced energy that is
not consumed. The third term constitute the costs in the case that the self-produced energy is
not enough to cover demand for which this term constitutes the shortage that will be purchased
from the grid. Lastly, whenever x_{t} = 1 and x_{t−1} = 0, i.e., when the HP is turned on, a fixed
wear costs is included.

4. Deterministic Algorithm

For the deterministic version of the problem, it is assumed that all stochastic variables T W_{t},
CH_{t}, E_{t}and P V_{t}are known in advance. For each moment in time, a decision has to be made on
whether to turn the HP on or off, i.e., determine the value of x_{t}for all t, while taking constraints
on the buffer and the HP into account as described in Section 3.

4.1. Approximated Dynamic Programming

The restrictions on the buffer (the temperature should be between T_{min} and T_{max}) and on the
HP (at least µ_{on} time units on and at least µ_{off} time units off) limit the number of possible
solutions in. To fully exploit this advantage, we make use of a Dynamic programming. To speed
up the algorihm, not all possible solutions will be taken into account for which our heuristic
will be denoted as the Approximated Dynamic Programming (ADP) algorithm. Dynamic pro-
gramming is a method that divides the problem into subproblems. Our problem is divided into
decision epochs that are moments in time for which the algorithm must determine to put the
HP on or off. The heuristic selection of which decision epochs to evaluate and which not are
further explained in Section 4.1.3.

The decision epochs are visualized in Figure 2. Each circle denotes a moment in time t ∈
{1, . . . , H} and therefore a decision epoch since the value of x_{t} must be determined in that
point. Let each decision epoch have information about the current state of the problem. A
state is a unique characterization of all that is important in a state of the problem that is
modelled. A state in our problem is defined by (T, t) ∈ S which is a pair of the temperature in
the buffer, T , and the point in time t in the state space S.

Due to the constraints, in many states (T, t) there is only one feasible value for x_{t}and therefore
no decision needs to be made. Let S^{1} and S^{0} be sets of all the states for which x_{t}must be equal
to 1 or 0, respectively. The extended definition of these sets can be found in Section 4.1.6. The
ADP algorithm will, to decrease the number of decision epochs, only consider decision epochs
with values for t that satisfy (3), i.e., states the proposed algorithm will encounter are such
that in the previous period the HP was on and the constraint that the HP must be on for at
least µ_{on} time units is satisfied. This decreases the number of decision epochs rapidly since it
is assumed that the HP is off for the greater part of the day.

t−1

X

i=t−µon

x_{i} ≥ µ_{on} (3)

4.1.1. Intuition

Figure 2 serves to give an intuition on what the ADP algorithm does. Assume that the ADP
algorithm is at the decision epoch with t = t_{0} and T = T_{0}. Moreover, assume that the current
state (T_{0}, t_{0}) /∈ S^{1}∪ S^{0} and that (3) holds.

Figure 2: Possible scenario of a subproblem of the ADP algorithm with µ_{on} = 4 and µ_{off}=
3. Each circle indicates a decision epoch at which the HP is on or off. It is
assumed that the HP has been on the previous period and, if left off entirely, the
red dotted line indicates the point where the temperature in the buffer would
be below the minimum temperature, denoted as the critical point or τ_{max}. The
point τ_{min} is the point where the HP is able to be turned on taking into account
µ_{off}. If the HP is left on in the next period, the algorithm repeats itself at t_{1}.
Otherwise, the HP will be turned off and turned on again at τ ∈ {τ_{min}, . . . , τ_{max}},
which will be the point with the most consecutive surplus solar energy indicated
by the blue rectangle. Hence, the algorithm will be called upon again at either
t_{1} or t_{12}.

At t_{0}the ADP has two choices: either leave the HP on (x_{t}_{1} = 1) or to turn the HP off (x_{t}_{1} = 0).

In the first case, the next decision epoch the ADP encounters is at t_{1}. Since (3) holds for (T_{0}, t_{0})
and x_{t}_{1} = 1 in this case, (3) holds for this decision epoch as well, i.e. the constraint that the HP
must be on for at least µ_{on} time units is already satisfied and does not restrict the decision on
x_{t}

1. In the second case, the HP is turned off. Since (3) should always hold, the ADP algorithm

determines the next point in time the HP will be turned on again. This point will be denoted
by τ . For all points in time between t_{0} and τ − 1, the HP will be off and for all point in time
between τ and τ + µ_{on}− 1 the HP has no other choice than to be on, since the HP must be on
for at least µ_{on} time units. Therefore the next decision epoch will be at τ + µ_{on} , with again (3)
satisfied. At each new decision epoch the algorithm will again choose between leaving the HP
on one more time unit or turning it off. Therefore the ADP algorithm can be seen as making
jumps through time moving from one decision epoch to the other while always making sure (3)
holds. Of course, with every jump, the temperature in the buffer must be updated according
to (1).

4.1.2. Value function

To distinguish between the two cases of either moving to t + 1 or to τ + µ_{on}, the value function
introduced by Bellman (1957) is used, which is given by

V (t) =

Π_{t}(1) + V (t + 1), if x_{t}= 1,

τ −1

X

i=t

Π_{i}(0) +

τ +µ_{on}−1

X

j=τ

Π_{j}(1) + V (τ + µ_{on}), if x_{t}= 0, (4)

where Π_{t}(x_{t}) is defined in (2). Hence the value function consists of observed profit when taking
a certain action and future profit. The value of x_{t} which gives the highest value of V (t) will
be chosen. Note that V (t + 1) and V (τ + µ_{on}) can only be obtained by recursively calling the
algorithm. The last value, which is associated with the termination of the algorithm, will be
discussed in Section 4.1.5.

4.1.3. Determination of τ

The point τ is the point where the ADP algorithm will turn the HP on again after x_{t}= 0. To
determine its value, first a feasible set of points in time will be created to make sure that τ
does not cause any constraint violation. The first feasible value for τ is τ_{min} = τ + µ_{off} since
the HP must be off for at least µ_{off} time units. Furthermore, τ is also bound by a maximum
value, denoted by τ_{max}, which is based on a constraint that the temperature in the buffer
must be above T_{min}. When the HP is left off, the temperature in the buffer will decrease over
time due to demand and cooling and will therefore at some point drop below the minimum
temperature. Let τ_{max}be referred to as the critical point. The critical point is the last point in
time for which leaving the HP off from t to τ_{max} will not cause T to drop below T_{min}. Hence
the HP must have been on at least once before τ_{max} to cover for demand and not violate the
minimum temperature. In Figure 2 the critical point is visualized as the red dotted line. The
calculation of τ_{max} is stated in (5) where T_{t} is known at the current time t. The parameter φ
in (5) will initially be zero and, only in the case of a considerable large amount of demand (i.e.,
T W_{τ}

max+1+ CH_{τ}

max+1 > β · W ), will τ_{max} be updated with the value of φ. The parameter
φ indicates how many time units the critical point τ_{max} should lie earlier in time in the case
that the amount of demand after τ_{max} is so great that the buffer must contain more heath on
forehand. This will be further explained in Section 4.1.4. The parameter φ is initially zero to

calculate τ_{max}after which φ is calculated such as described in section 4.1.4 and finally τ_{max}will
be updated with this value of φ.

τ_{max}= max

j=t,...,Hj : T_{j} > T_{min} − [φ]^{+}, (5)
with

T_{j+1}= (1 − c)T_{j}− T W_{j}− CH_{j}, ∀j ∈ {t, . . . , H}.

However, not all values between τ_{min}and τ_{max}are feasible values for τ due to the constraint that
the temperature is not allowed to exceed T_{max}. Let M ⊂ {τ_{min}, . . . , τ_{max}} contain values for τ
which would violate the maximum temperature restriction, i.e. putting it on at i ∈ M would
cause one of the values j ∈ {i, . . . , i + µ_{on}} to have a temperature T_{j} > T_{max}. The temperature
in the buffer is decreasing over time for which only values of τ in the beginning of the interval
are to be found in M. Hence M eliminates lower values of τ and therefore actually creates a
new τ_{min}. The point τ can only be chosen from {τ_{min}, . . . , τ_{max}}\M.

Dynamic programming is an approach were all possible decisions are evaluated according to
the value function in (4). This means that a true dynamic program would evaluate the value
functions belonging to all τ ∈ {τ_{min}, . . . , τ_{max}}\M which are all feasible points to turn the
HP on again. However, to speed the algorithm up, τ is heuristically selected beforehand from
{τ_{min}, . . . , τ_{max}}\M. The τ that is chosen does not necessarily have to be the optimal τ , i.e.,
τ is not necessarily the point which gives the greatest value according to the value function.

The point τ is chosen based on the available amount of renewable energy. That is, τ is selected
such that the HP will be on during the period of length µ_{on} in {τ_{min}, . . . , τ_{max}} which has the
greatest amount of surplus solar energy when the electricity of other devices, E_{t}, is substracted.

In this way, in the short run the HP uses as much solar energy as possible. This calculation of τ equals,

τ = arg max_{i}

i+µon−1

X

j=i

[P V_{j} − E_{j}]^{+} ; i ∈ {τ_{min}, . . . , τ_{max}}\M

. (6)

Now if the HP is decided to be off for the first period, x_{t}, . . . , x_{τ −1}= 0 and x_{τ}, . . . , x_{τ +µ}

on−1 = 1.

The next decision epoch will be at τ + µ_{on} for which, given the values for x_{t}, . . . , x_{τ +µ}_{on}−1, (3)
holds. To make things clear, take a look again at Figure 2. The values above the time units
represent the surplus solar energy available, which is P V_{t}− E_{t} for t ∈ {t_{0}, . . . , t_{16}}. The
maximum sum of surplus solar energy of four consecutive time units is indicated by the blue
rectangle. Hence τ will be at t_{9} and the next decision epoch at t_{12} which is τ + µ_{on}.

4.1.4. Considerable large amount of demand

Whenever T W_{τ}_{max}_{+1}+ CH_{τ}_{max}_{+1} > β · W , i.e., whenever the demand after the critical point
τ_{max} is too much to be covered by the HP in one time period, the critical point τ_{max} must be
shifted backwards in time. Let ¯φ be the number of consecutive periods with a demand higher
than β · W and let φ be the number of periods the HP needs to be on to cover the demand in
{τ_{max}+ 1, . . . , τ_{max}+ ¯φ}. Of course, there could be usable temperature left in the buffer at τ_{max}

i.e. T_{max}− T_{min}, where T_{max}is calculated by (1) with x_{t}, . . . , x_{τ}_{max} = 0. This heat T_{max}− T_{min}
can be substracted from the demand in {τ_{max}+ 1, . . . , τ_{max}+ ¯φ}. The number of time periods
needed to cover this demand is φ and calculated as given in (7).

φ =

Pτ_{max}+ ¯φ

k=τ_{max}+1(T W_{t}+ CH_{t}) − [T_{max}− T_{min}]
β · W

− ¯φ, (7)

where the last ¯φ indicates that the HP could be on in the period {τ_{max}+ 1, .., τ_{max}+ ¯φ} as well.

This value of φ will be used to update the value of τ_{max} in (5), which was initially calculated
by using φ = 0.

4.1.5. Termination

The termination of the algorithm depends on the critical point τ_{max}, described in Section
4.1.3. The termination of the algorithm follows the rule that for each V (t) with τ_{max} > H,
all (x_{t}, . . . , x_{H}) will be equal to 0 and V (t) =PH

i=tΠ_{i}(0). Hence if the critical point τ_{max} lies
beyond the time horizon H, the HP will be off until and including H.

4.1.6. Special cases

The sets S^{1} and S^{0} contain states for which the HP must be turned on and respectively turned
off. States that are in S^{1} violate one of the following constraints:

1. The HP has been on for less than µ_{on} time units, i.e., (3).

2. The temperature in the buffer will be below T_{min}when leaving the HP off, i.e., (1 − c)T −
T W_{t}− CH_{t}< T_{min}.

3. The temperature in the buffer will be below T_{min} when turning the HP off for at least µ_{off}
time units, i.e., the critical point τ_{max} lies before τ_{min}

In the ADP algorithm, the states that violate the first contraints will never be encountered,
since the next decision epoch is either t + 1 or τ + µ_{on} with both (3) satisfied. States that are
in S^{0} violate one of the following constraints:

1. The HP has been off for less than µ_{off} time units, i.e.,

t−1

X

i=t−µoff

> 0.

2. The temperature in the buffer will be above T_{max}when leaving the HP on, i.e., (1 − c)T −
T W_{t}− CH_{t}+ β · (1 + λ) · W > T_{max}.

Again, since the ADP algorithm chooses τ_{min} = t + µ_{off}, this state, for which x_{t} must be zero,
will never be encountered. The ADP algorithm is listed in Algorithm 1 making use of Algorithm
2 and 3 in Appendix B for which the initial state (T, t) must be given by the user. However,
the temperature is still not guarenteed to be always above T_{min}. For example, whenever the
demand is greater than β · W for such a long period that even a full tank is empty. In these
rare cases, the buffer can make use of an emergency gas supply to save what can be saved.

4.2. Optimal Algorithm

The ADP does not necessarily give the optimal solution since not all possible values for τ
are evaluated as is discussed in Section 4.1.3. Therefore, to find the optimal solution, one
replaces Algorithm 3 in Algorithm 1 by Algorithm 4 in Appendix B. The optimal algorithm
will compare the value of the value function of all possible values τ ∈ {τ_{min}, . . . , τ_{max}}\M. With
this algorithm, the quality of the solutions of the ADP will be tested.

4.3. Greedy Algorithm

The greedy algorithm is based on the behaviour of the PowerMatcher, introduced by TNO
researcher Kok (2013) and described in Section 2. In general, when the constraints are not
violated, the heatpomp will be turned on immediately when there is a proportion ρ of the
power needed available in solar energy, i.e., P V_{t−1}− E_{t−1}> ρ · W . The algorithm can be found
in Algorithm 5 in Appendix B.

5. Stochastic Approach

In practice, the values of T W_{t}, CH_{t}, E_{t} and P V_{t} are unknown before time t. So, in constrast
to the deterministic algorithm, these parameters are considered stochastic. To make a decision
on x_{t}, some methods of dealing with this stochasticity are listed by Spall (2003): machine (re-
inforcement) learning, simulation-based optimization, Markov decision processes and stochastic
programming (e.g. multi-stage recourse models). To choose the appropriate approach one must
take into account that the algorithm should have a low computational complexity since it must
be placed inside a household. Furthermore, it would be beneficial if the algorithm could process
updates in forecasts without rerunning the entire program. In Section 5.1 the stochastic ap-
proaches are briefly summarized. Readers who are familiar with these approaches can continue
reading at Section 5.2.

5.1. Evaluation of Approaches

A Markov Decision Process (MDP), as described by Wiering and Van Otterlo (2012), is defined by a number of states, a set of actions, a transition matrix, and a reward function which will be thouroughly explained in Section 5.3. Though an MDP is able to optimize expected value and is therefore for some stochastic problems an optimal approach, the number of states is compu- tationally limited and all states are independent of previous states, i.e., it only matters which state the problem is and not how it came there. In our problem, the state space is continuous since T is continuous and the states are dependent e.g. when a shower has been taken earlier that day, the probability that another shower is taken is very small. Therefore, to make the MDP approach computationally feasible, information about the problem should be omitted. As a consequence it is not known if the MDP optimizes the expected value of the actual problem.

Furthermore, if new forecasts on solar energy are available, the algorithm must entirely start over.

Reinforcement learning, as described by Wiering and Van Otterlo (2012), is a special case of a Markov Decision Process in which the actions are evaluated after its effect has been ob- served. However, the effect of an action is in our problem known, wherefore this method does not seem suitable.

Klein Haneveld and Van der Vlerk (2012) discuss the so-called recourse models which are, in contrast to an MDP, time dependent which means that one knows previous states and can therefore make a suitable decision. Moreover, it evaluates the expected value of an action and makes therefore a good decision under uncertainty. However, the number of states grows expo- nentially with the number of stages for which the time complexity is much worse than an MDP.

Simulation is a method in which a number of scenarios are given that are deterministically solved and evaluated. Though the method does not automatically optimize the expected value, the application by Str¨ohle et al. (2014), as already stated in Section 2, gives promising results and is computationally simple. Therefore, we choose this approach to deal with the stochastic- ity. The quality of the solutions are tested against the deterministic algorithm to evaluate the costs associated with the stochasticity and against an MDP algorithm.

5.2. Simulation

To deal with the stochasticity of the problem, the method stated in Str¨ohle et al. (2014), will
be used and referred to as the simulation algorithm and the quality of the solutions compared
to a Markov Decision Process which is expected to give better solution, though it has a higher
computational complexity. At each point in time t, the simulation algorithm takes a number of
randomly chosen scenarios for which the ADP will schedule the HP. The simulation algorithm
will evaluate all solutions for the different scenarios to make a decision upon x_{t}. Firstly, the
explanation on how the scenarios are created is given which is followed by a description of the
simulation algorithm.

5.2.1. Scenarios

A scenario, denoted by U , is defined as U = T W_{{t,...,H}}, CH_{{t,...,H}}, E_{{t,...,H}}, P V_{{t,...,H}} and
gives therefore values for the stochastic variables in the time range from t to H. The values of
these stochastic variables will be based on actual data. Let K = {1, . . . , 365} represent days of
the year for which data is available. A scenario U , based on data from two days k_{d}, k_{s} ∈ K, is
denoted by

U^{k} =

T W_{{t,...,H}}^{k}^{d} , CH_{{t,...,H}}^{k}^{d} , E_{{t,...,H}}^{k}^{d} , P V_{{t,...,H}}^{k}^{s}

(8)
with T W_{{t,...,H}}^{k}^{d} , CH_{{t,...,H}}^{k}^{d} , E_{{t,...,H}}^{k}^{d} and P V_{{t,...,H}}^{k}^{s} being the observations on days k_{d}and k_{s}in
range t to H of T W{t,...,H}, CH{t,...,H}, E{t,...,H} and P V{t,...,H}, respectively. Assume the current
day is k_{0}. To include seasonal effects, k_{d}will be randomly drawn from K_{D} = {k_{0}− m, . . . , k_{0}+
m}, m ∈ N and ks will be randomly drawn from K_{s} = {k_{0}− n, . . . , k_{0}+ n}, n ∈ N. Hence,
for day k_{0}, a scenario U^{k} is build by randomly selecting k_{d} ∈ K_{D} ⊆ K and k_{s} ∈ K_{S} ⊆ K.

To create more scenarios, one draws N_{d} days from K_{d} and N_{s} days from K_{s} and all possible

combinations of these draws will be listed in the set K with k = (k_{d}, k_{s}) ∈ K creating a total
of N_{d}· N_{s} scenarios. Note that K_{D} and K_{S} only depend on k_{0} and not on the current point
in time t. Hence, observations about demand and supply earlier that day are not taken into
account.

5.2.2. Simulation algorithm

At each point in time t of day k_{0}, N_{d} days will be randomly chosen from K_{d} and N_{s} days will
be randomly chosen from K_{s}creating the set K for which a scenario U^{k}with k = (k_{d}, k_{s}) ∈ K is
created. For each scenario U^{k}, the ADP algorithm as described in Algorithm 1 will give values
for x_{t}, . . . , x_{t+H}. If in γ% of the scenarios, x_{t}= 0, the HP will be turned off or left off, else the
HP will be turned on or left on.

There are states (T, t) in which there is no choice for x_{t} since constraints will or have been
violated. Let these states again be in the set S^{1} and S^{0} for respectively x_{t} = 1 and x_{t} = 0.

States that are in S^{1} violate at least one of following constraints:

1. The HP has been on for less than µ_{on} time units, i.e.,

t−1

X

i=t−µ_{on}

x_{i} < µ_{on}.

2. The current temperature is below T_{min}.

Since the problem contains stochasticity, it could happen that there is a great amount of demand
whilst there was no or barely demand expected and accounted for. Therefore, it could happen
that the temperature in the buffer is lower than the minimum allowed temperature for which the
HP must be turned on immediately. States that are in S^{0} violate at least one of the following
constraints:

1. The HP has been off for less than µ_{off} time units, i.e.,

t−1

X

i=t−µ_{off}

x_{i}> 0.

2. The current temperature is above T_{max}.

Again, since the problem contains stochasticity, it could happen that there is no or barely de- mand at all whilst a greater amount was expected and accounted for. Therefore, it could happen that the temperature in the buffer exceeds the maximum allowed temperature for which the HP must be turned off immediately.

The simulation algorithm can be found in Algorithm 6 in Appendix B. It describes how the
value of x_{t}is calculated. To determine all values for x_{1}, . . . , x_{H} that constitute a solution, one
has to iteratively call the simulation algorithm and with each iteration update (T, t).

5.3. Markov Decision Process

A Markov Decision Process is defined as a discrete stochastic dynamic program and is used when decisions have to be made under uncertainty which will influence future decisions and

rewards. These decisions have to be made at certain points in time. At such a decision epoch, the system is in a certain state. Moving from one state to the other is dependent on the decision made and a transition probability. Each state has its own reward functions, including current rewards and future rewards, regarding a certain decision or what we call actions. Hence, a Markov Decision Process contains a state space S, an action space A, a transition matrix M and a reward function R. Each element will be thoroughly discussed in Sections 5.3.1 to 5.3.4.

5.3.1. States

A state is a unique characterization of all that is important in a state of the problem that is
modelled, denoted by s ∈ S. The state space S is a finite set of states {s^{1}, . . . , s^{N}} with N ∈ N
being the number of different states in which the problem could appear. In our problem the
elements that define a state are the temperature of the buffer, T , the number of time periods
the HP has been on, N_{on}, or off, N_{off}, and the current point in time, t. For simplicity, the
temperature of the buffer and the time periods are discretized for which the formal definition
of a state is given by

s = {T, N_{on}, N_{off}, t}, (9)

with

T ∈ {T_{min}, . . . , T_{max}},
N_{on} ∈ {0, . . . , µ_{on}},
N_{off}∈ {0, . . . , µ_{off}},
t ∈ {1, . . . , H}.

Note that N_{on} > 0 ⇔ N_{off}= 0 and N_{off}> 0 ⇔ N_{on}= 0 since the HP cannot be both a strictly
positive number of periods on and a strictly positive number of periods off at the same time.

5.3.2. Actions

The future state of the system can be influenced by actions. The set of actions A is defined as
the finite set {a^{1}, . . . , a^{L}} where L ∈ N is the number of actions. For each state s ∈ S there
is a subset A(s) ⊆ A with all feasible actions since not all actions can be done in every state
(for example, the HP has not been on for at least µ_{on} time units, so the action must be “on”

for this period). The action space in our problem is the finite set A ∈ {0, 1} indicating either putting the HP off (a = 0) or on (a = 1).

5.3.3. Transition matrix

When taking action a ∈ A in a state s ∈ S, the system makes a transition to a new state s^{0} ∈ S,
based on a probability distribution over the set of possible transitions. Hence, the transition
matrix M is defined over: S × A × S → [0, 1], i.e., it gives the probability of moving from state
s by taking action a to state s^{0}. Note that 0 ≤ M (s, a, s^{0}) ≤ 1 andP

s^{0}∈SM (s, a, s^{0}) = 1, for all

feasible combinations of states s and action a. Note that the probabilities are not influenced by states previous to state s or previous actions, i.e., the Markov property holds given by

P (s_{t+1}|s_{t}, a_{t}, s_{t−1}, a_{t−1}. . .) = P (s_{t+1}|s_{t}, a_{t}) = M (s_{t}, a_{t}, s_{t+1}).

For our problem, the probabilities are classified into two cases:

M (s, a, s^{0}) ≥ 0 If t_{s}+ 1 = t_{s}^{0} and,
N_{on}

s0 = 0, if a = 0
N_{on}

s0 > 0, if a = 1
N_{off}

s0 > 0, if a = 0
N_{off}

s0 = 0, if a = 1
T_{s}≥ T_{s}^{0}, if a = 0
M (s, a, s^{0}) = 0 otherwise.

For the first case, if all conditions hold, the transition probability is equal to the probability of
having a T_{s}− T_{s}^{0} amount of demand if the action is “off”. Note that this probability could be
zero if this size of demand is never observed. If the action is “on”, the probability is equal to
having an T_{s}− T_{s}^{0}+ β · W · (1 + λ [N_{on}− µ_{on}+ 1]^{+}) amount of demand, where the last maximum
value is equal to zero except when N_{on}= µ_{on}. Hence the probabilities are:

M (s, 0, s^{0}) = P T_{s}− T_{s}^{0} = CH(t_{s}) + T W_{t}

s ,
M (s, 1, s^{0}) = P

T_{s}− T_{s}^{0} = CH(t_{s}) + T W_{t}_{s}− β · W · (1 + λ [N_{on}− µ_{on}+ 1]^{+})

.
Note that, since T is discretized, both CH_{t}

s and T W_{t}

s are discretized as well such that the probability will not be zero. Hence, there is a possibility that the amount of demand is too small to make a significant change in T and T will therefore not change. Accordingly, the cooling parameter is disregarded.

5.3.4. Reward Function

The Reward Function is defined as R : S × A → R, and therefore specifies a reward for doing some action in a certain state. The reward function is used to value a state for which later on, an algorithm can favour between states to come to an optimal strategy. In our problem, the reward functions are stated as follows based on (2):

R_{on}(s) = p_{u}min
n

F_{t}^{P V}_{s} , F_{t}^{E}_{s} + W
o

+ p_{s}
h

F_{t}^{P V}_{s} − F_{t}^{E}

s − Wi+

− p_{d}h

F_{t}^{P V}_{s} − F_{t}^{E}

s − Wi−

R_{off}(s) = p_{u}min
n

F_{t}^{P V}_{s} , F_{t}^{E}_{s}
o

+ p_{s}
h

F_{t}^{P V}_{s} − F_{t}^{E}

s

i+

− p_{d}h

F_{t}^{P V}_{s} − F_{t}^{E}

s

i−

with F_{t}^{P V}_{s} denoting the forecast at time t_{s} in terms of expected value of the available solar
energy and F_{t}^{E}_{s} the expected value of the demand for electricity for other devices at time t_{s}.
The state space, the action space, the transition matrix and the reward function all define
a Markov Decision Process, or with a formal definition cited from Wiering and Van Otterlo
(2012).

Definition. A Markov decision process is a tuple hS, A, M, Ri in which S is a finite set of states, A a finite set of actions, M a transition function defined as M : S × A × S → [0, 1] and R a reward function defined as R : S × A → R.

5.3.5. Algorithm

One of the most efficient algorithm to calculate an optimal policy for a Markov decision process is the value iteration algorithm introduced by Bellman (1957). It is based on dynamic program- ming and makes use of the so-called value function V (s), with the optimal policy satisfying:

V^{∗}(s) = max

a∈A

X

s^{0}∈S

M (s, a, s^{0}) R(s, a, s^{0}) + V^{∗}(s^{0})

. (10)

Equation (10) gives the expected reward of undertaking a certain action. Let i ∈ N be an index for the number of iterations and let (11) be the updating rule for the value function.

V_{i+1}(s) = max

a∈A

X

s^{0}∈S

M (s, a, s^{0}) R(s, a, s^{0}) + V_{i}(s^{0})

. (11)

Starting with a value function V_{0} over all states, by using (11), in the limit on the number of
iterations, lim_{i→∞}V_{i}(s) = V^{∗}(s). In general, the algorithm will stop when ||V_{i+1}(s)−V_{i}(s)|| < .

The value iteration algorithm is illustrated in Algorithm 7 in Appendix B. Based on the final V (s), at s one chooses the action that maximizes the expected value, see (10), which will lead to an optimal policy for all states. In our problem, states will be updated according to (1) and rule t = t + 1. Then, the state defined in the MDP which comes closest to the current state, will be chosen and an action taken according to its policy.

6. Numerical Results

The performance of both the deterministic ADP algorithm and the stochastic simulation algo-
rithm will be evaluated on instances that are created based on available data. The data consist
of per minute observations on P V_{t}, E_{t}, T W_{t}and CH_{t}in the months June to December which
consists of 214 days. The time t is discretized is steps of 5 minutes.

An instance is dependent on the time horizon H, i.e., the number of decision epochs, and
is created by splitting the 214 days of data into periods of length H. Let H_{ADP} be the number