Energy- and Cost-Effcient Pumping Station Control

(1)

MSc Artificial Intelligence

Master Thesis

Energy- and Cost-Efficient

Pumping Station Control

by

Timon Kanters

10322833

October 2015 42ECTS January 2015 - October 2015

Supervisor Examiner Assessor

(2)

Abstract

With renewable energy becoming more common, energy prices fluctuate more depending on environmental factors such as the weather. Consuming energy without taking volatile prices into consideration can not only become expensive, but may also increase the peak load, which requires energy providers to generate additional energy using less environment-friendly methods. In the Netherlands, pumping stations that maintain the water levels of polder canals are large energy consumers, but the controller software currently used in the industry does not take energy availability into account.

In this thesis, we investigate if existing AI planning techniques have the poten-tial to improve upon current pumping station controllers used in practice. To facilitate our controller and its evaluation, we develop a light weight but realis-tic polder system simulator that models a real-world environment. We propose using an online planning method (UCT) that considers energy prices and co-ordination between pumping stations to improve the cost-efficiency of pumping station control policies.

Through evaluating several aspects and adaptations of UCT, we provide a UCT configuration suitable for the pumping station control problem and show how UCT can be made scalable to support large numbers of pumping stations. An empirical comparison with the current pumping station control algorithms in-dicates that substantial cost, and thus peak load, reduction could be attained.

(3)

Introduction

In this thesis, we describe the research done into controlling pumping stations in polder systems. We discuss the main difficulties, the current approach to con-trolling pumping stations, and both successful and unsuccessful improvements.

1.1 Problem Description

The Netherlands has a close connection to water. A large part of the land lies below the sea level and a great deal of it has been reclaimed from the sea or other waters. These areas are called polders and are protected by dikes. Though save from the sea, they still contain a lot of water that hydrates the lands for agriculture and prevents the polder from deteriorating due to peat dryness. This does not come without risk however, as over time the water level will rise due to external events such as rain. To prevent flooding or the land becoming unsuited for farming, it is crucial that the polder’s waters remain at a desired level by removing excess water.

Using pumping stations, the polder’s water is transferred into canals (or boezem in Dutch) that allows the water to be released into the sea or large lakes, such as the Ijsselmeer, through the end drain. These pumping stations rely on con-trollers to determine when they start pumping water and at which capacity. These controllers may be human, software, or a combination of both where software advises humans. In this thesis, we investigate software controllers and propose our own with the goal of improving the current state-of-the-art. Our main task when controlling the pumping stations is to keep the water levels within acceptable margins while minimising the energy cost required to do so. As pumping stations require a lot of power and energy prices can differ greatly even in small time periods, intelligently choosing when to enable the pumps allows us to work much more cost-efficiently. Doing so requires us to reason about future energy prices, as well as the short and long term effects of actions, which involve the flow of water and the effects of rain.

(6)

1.2 Thesis Contributions

In our search for a good controller, we focus on new Monte Carlo Tree Search (MCTS) [Kocsis and Szepesv´ari, 2006] based solutions. MCTS is a sample-based planning method that has proven to be effective in solving complicated problems that require thinking far ahead [Fern and Lewis, 2011, Finnsson and Bj¨ornsson, 2008, Lee et al., 2009, Sheppard, 2002, Enzenberger et al., 2010]. It works by observing the effects of actions in a simulator of the problem, allowing it to determine which actions seems promising without actually performing them in the real world. By trying multiple actions subsequently in the simulator, it is capable of estimating the long term quality of actions. As MCTS has a lot of areas that can be optimised for specific problems, we discuss several variants of it. These variants, though similar in nature, have differences that greatly affect their performance. Our main MCTS contribution and research is related to the modifications required for MCTS to fit the problem, alternative rollout policies and decentralisation. Furthermore, we propose pumping station controllers that function more cost-efficient than the current state-of-the-art.

As sample-based planning methods require a simulator to estimate action qual-ity, we research and construct a simulation model for polder systems that is abstract enough to perform the many simulations required by sample-based planning, but realistic enough to give an accurate indication of the effects of actions. This simulation model includes pumping behaviour, water flow, energy prices and rainfall.

Throughout this thesis, we look into the following research questions: • How can a polder system be modelled to support MCTS? • How can future water levels be predicted?

• How can future energy prices be predicted?

• What adjustments or implementation details does MCTS require to con-trol pumping stations?

• How do different rollout policies affect the performance of MCTS in the pumping station control task?

• How can MCTS deal with many pumping stations to control and how does coordination affect performance?

By finding the answers to these questions, we are able to answer our main research question: Can artificial intelligence methods, such as MCTS, improve upon current pumping station controllers?

In order to broaden the audience of this thesis project’s accomplishments, we have compiled the key information from this report into a paper, which has been submitted to the computational sustainability track of the thirtieth AAAI con-ference. We hope that our promising findings will lead to reduced energy costs and peak load in practice, as well as show the potential of artificial intelligence in these fields and inspire others to perform similar research.

(7)

1.3 Collaborating Parties

During this thesis project, we have received assistance in the form of information and insight from the company Nelen & Schuurmans (N&S) and water board Hoogheemraadschap Hollands Noorderkwartier (HHNK). N&S specialises water management solutions and came in contact with the University of Amsterdam (UvA) for that purpose. Through this relation, the subject for this thesis was conceptualised and as such, N&S plays a big role in it. Over the course of this project, we had several meetings with N&S to discuss requirements and progress.

One of the parties that N&S supports is HHNK, which is a water board respon-sible for many water related tasks in Noord-Holland. As the authority in the field, HHNK has in-dept knowledge of polders and canals in the area. They provided us with the data required to model the environment and broached subjects that improved simulation realism.

(8)

Chapter 2

The Current Approach To

Pumping Station Control

Figure 2.1: An overview of the VRNK polder system with main polders indi-cated by coloured areas and their pump-ing stations by numbers.

In order to find a good way of control-ling the pumping stations and simu-lating the environment, we first look into the details of the environment, the current controller in place and the energy markets. In this chapter we start by discussing our target polder system. We target a real-world polder system to ensure that our controller is evaluated through realistic exper-iments, and so that it may later be used during a pilot in practice. We then explain how the pumping sta-tions in the polder system are con-trolled by N&S. As pumps are great energy consumers, we finally discuss the available energy markets. These markets have large differences that directly affect their applicability and potential cost savings.

2.1 Polder System Vereenigde Raaksmaats- en

Niedorperkoggeboezem

Though the environment could practically be any polder system, in this the-sis we focus on the Vereenigde Raaksmaats- en Niedorperkoggeboezem (VRNK) polder system. This polder system lies to the northeast of the Dutch city Heer-hugowaard and can be seen in Figure 2.1. The VRNK system has been proposed

(9)

by N&S as suitable for experimentation in the form of testing new controllers and has a large potential for energy cost saving as indicated by research com-missioned by HHNK and executed by N&S [Nelen & Schuurmans, 2013].

Figure 2.2: Dikes (grey) sepa-rating a canal (blue) from the lower polders (green).

The VRNK polder system has a water sur-face of 163 ha and discharges in the north. It contains 20 polders with a typical surface between 300 ha and 700 ha. We consider the pumping stations of seven polders to be con-trollable by our method, which have been se-lected for real-world pilots by N&S as they have the largest pumping capacity and thus the highest expected cost reduction [Nelen & Schuurmans, 2013]. As mentioned in

Sec-tion 1.1, pumping staSec-tions are used to transport water from polders to canals that allow the water to be discharged. The requirement of pumping stations is due to the water levels in the canals being much higher than those in the pold-ers, as illustrated in Figure 2.2. The target water level of the canals is generally -0.6 mNAP1 _{and differs greatly amongst polders. The range that water levels}

may deviate from their target varies as well, but is generally from 0.1 m below target to 0.4-0.6 m above target.

2.2 Current State-of-the-Art: ControlNEXT

Currently, the inlet to and discharge from the VRNK canals is fully automated using the controller software ControlNEXT [Deltares, 2015a]. ControlNEXT is based on expert-knowledge and the results of a repeatedly running hydrological simulation model, and has proven itself in practice through its utilisation by various companies. As such, it forms an excellent baseline for the results found in this thesis. We have implemented our own adaptation of ControlNEXT based on the specifications provided by N&S, which is suitable for our simulation. ControlNEXT bases its pumping advice on the expected amount of excess water in the future. This gives a good indication of how much pumping capacity is needed and over what time span. The end drain, which lowers the water level in the canals by pumping water out of it, follows the advice given by these calculations. The end drain is only advised to pump when the water levels in the canal are high enough however, as draining water for future expectations should not cause too low water levels before they rise again. The pumping stations for polders work on a simpler model, and will pump based on the current water levels. However, ControlNEXT does not reason about energy prices or interactions between pumping stations, which offers great potential for improvement. The pseudocode for this approach can be seen in Algorithm 2.1.

1_{mNAP stands for ’meter boven Normaal Amsterdams Peil’, or ’meters above Amsterdam} Ordnance Datum’, and is used as a unit for water level height.

(10)

1 a c t i o n = {} 2f o r e a c h pumping s t a t i o n 3 i f s t a t i o n i s c o n n e c t e d t o a p o l d e r 4 i f p o l d e r L e v e l > p ol de rP um p Le ve l 5 a c t i o n ← 1 6 e l s e 7 a c t i o n ← 0 8 e l s e i f s t a t i o n i s c o n n e c t e d t o a d r a i n 9 i f c u r r e n t l e v e l h i g h enough 10 n e a r F u t u r e P o w e r = getPumpingPowerNeededOver8Hours ( ) 11 d i s t a n t F u t u r e P o w e r = getPumpingPowerNeededOver24Hours ( ) 12 a c t i o n ← max( nearFuturePower , d i s t a n t F u t u r e P o w e r ) 13return a c t i o n

Algorithm 2.1: The pseudocode for ControlNEXT.

2.3 Energy Markets

The pumping stations that manage the water levels in polder systems are great energy consumers. In a year, the typical amount of energy used is between 20,000 kWh and 120,000 kWh per pumping station. To produce the energy re-quired by the pumping stations, water boards rely on purchasing energy through one of the available markets. Like many European electricity markets, the Dutch electricity market consists of three options: the Day-Ahead Auction Ea, the

In-traday market Er and the imbalance market Ei.

The Day-Ahead Auction is held one day before delivery. The Intraday market allows purchases at hourly intervals as well as freely definable block orders up to 5 minutes prior to delivery. While these two markets are two-sided in that they settle demand and supply between different electricity traders, the imbalance market typically employs a one-sided auction where up- or downward regulation power (or reserve capacity) is offered to the transmission system operator as the single buyer. When using the imbalance market, the buyer makes himself available to purchase energy up to a certain amount. However, when this energy is delivered and at what exact cost is not known in advance. All of these markets are characterized by stochastic price developments, with decreasing transaction volume (vEa > vEr > vEi), increasing average prices (µEa ≤ µEr ≤ µEi) and increasing price volatility (σEa< σEr < σEi).

Based on advice from N&S and their research [Nelen & Schuurmans, 2013], we looked into using the imbalance market first. The prices of the imbalance market fluctuate a lot and pumping when energy is cheap could save a lot of costs. However, pumping at the desired moments might not be an easy task; the inability to predict when energy will be delivered or how much the cost will be makes the task of planning much harder and unreliable. As such, we deem the imbalance market unfit for our controller in practice. The Day-Ahead Auction is also not a viable option, as the required energy is not accurately known a full day in advance, leaving us with the Intraday market. This market allows us to purchase energy at a known price within a sufficiently small time frame. Unfortunately, the availability of historical data for the Intraday market is very limited; we were only able to find hourly data for the last 6 months. APX, the

(11)

company behind the Intraday market, offers up to 4 years worth of transaction data. However, this data only contains energy prices at random and often large intervals. For proper planning and experiment sampling (e.g., we want to sam-ple different seasons), we require data that describes the energy price at least on an hourly basis and with a greater time span than 6 months. As such, we turn to the historical price data of the imbalance market for our experimental results. The data provided by Tennet [TenneT, 2015], gives us roughly 10 years of history to work with. Though the energy prices of the imbalance market do not follow the same patterns as those of the Intraday market, they are suitable for our experiments as they fulfil the purpose of illustrating the differences be-tween pumping station controllers with realistically changing energy prices. We manipulate the data slightly however: since the imbalance market occasionally has negative energy prices when there is an excess amount and the Intraday market does not, we set negative prices to 0. As the imbalance market is noto-riously difficult to predict, we expect results to generalise to other market price signals, given that they may be more predictable and thus less challenging to plan for.

(12)

Chapter 3

Decision Making Through

Artificial Intelligence

Computer decision making is often done through following a set of rules that is predefined by domain experts. While these heuristic methods can be effec-tive, they are often limited by their static behaviour. As problems grow more complex, so does the potential for more intelligent methods to improve upon heuristic methods. In the past few decennia, a lot of research has been done on decision making through artificial intelligence. We describe the most relevant related work, on which be base our work, in this chapter.

3.1 Formal Decision Process Definition

In order to allow artificial intelligence methods to provide solutions to our prob-lem, we first define it in a suitable model. The technique that our controllers are based on (MCTS), treats its environment as a Markov decision process (MDP) [Puterman, 1994]. An MDP is similar to a Markov chain [Norris, 1998], which defines the probabilities of states occurring after a given state, with the addition of actions and rewards. Such MDPs can be expressed as tuples < S, A, T, R, µ > where

• S is an infinite set of states; • A is a finite set of actions;

• T : S×A×S → [0, 1] is a transition function which specifies the probability of a next state occurring for each state and action;

• R : S × A × S → R, is a reward function which specifies the immediate reward for each state, action and next state;

• µ : S → [0, 1] is a probability distribution over initial states.

The state in an MDP describes what all variables in the environment are cur-rently set to (e.g., the water levels). The exact definition is specified in Sec-tion 4.1.1. AcSec-tions, defined in SecSec-tion 4.1.2, are the opSec-tions that the decision maker (also known as the agent, or controller in our case) can select to execute.

(13)

Through the transition function, the current state changes based on the action which was selected. This is described in Section 4.2. After each such transi-tion, the controller receives a reward through the reward function (Section 5.1), which indicates the quality of the transition.

In the real world, our task is an infinite horizon problem, which runs indefinitely. Since we cannot run experiments for an infinite amount of time, infinite horizon problems often have their rewards discounted such that rewards received later on are valued less than those received early on. In this case, experiments run until the rewards are discounted so much that they barely affect the cumulative reward any more. We do not use this in our model though, as our experiments must already run for a very long time before differences between controllers become apparent and the problem often gets more challenging later on, making it important for the later rewards not to be neglected. Instead, we manually set a large horizon for our experiments (two simulated weeks) and observe the cumulative reward that the controller received at each time step, or decision epoch, up to that point. In order to prevent the controller from optimising towards this horizon, it is not aware of it.

A full MDP model specifies both the probabilities of transitions P (s0|s, a), as well as the corresponding immediate rewards R(s, a, s0) that specify the task. MCTS methods, however, are sample-based planning methods that do not need access to a full MDP model. Instead they only need a generative model, or simulator, G from which transitions and rewards can be sampled s0, r ∼ G(s, a). Such a generative model is generally easier to obtain or to learn compared to the full representation.

When MCTS is applied in the real world, it uses the simulator Gp for planning.

During this thesis project, however, we evaluate the proposed method in simu-lation. This requires us to use two different simulators: the real simulator Gras

stand-in for the real world, and the planning simulator Gpas generative model

given to the agent for planning purposes. In many cases, these two simulators can be identical, but we will need to treat them differently in some aspects as discussed later.

3.2 Monte Carlo Tree Search

When it comes to decision making, artificial intelligence offers many different methods. How these methods work often differs greatly, causing each method to have its own strengths and weaknesses. The problem we are dealing with requires a method that is effective even with large state spaces and actions that affect states up to far in the future. The latter is especially important, because getting a negative reward for pumping is often only profitable over a longer time span.

In a problem such as ours, it is infeasible to search through all possible actions throughout the future. A good way to avoid this while still getting good results, is by sampling. Sampling methods explore only select part of the possibilities that they believe have potential for a good course of action. One such method that has become popular over the last few years is Monte Carlo Tree Search

(14)

(MCTS) [Kocsis and Szepesv´ari, 2006]. It has proven to be capable of handling large state spaces and thinking far ahead in practice abundantly thanks to its broad applicability and customisability [Browne et al., 2012]. For example, in 9 × 9 Go, MCTS based solutions have achieved dan (master) level [Lee et al., 2009] and even beat a top human professional [Enzenberger et al., 2010]. For these reasons, MCTS lends itself well for our task and we use it as the base for our solution.

Figure 3.1: The MCTS process where circles are states, lines are actions and the square indicates the rollout result [Baier and Drake, 2010].

MCTS is a planning algorithm that works by sampling the effects of actions and cating a tree structure based on the re-sults. These actions and their effects are simulated in a separate environment than the one where the action is ultimately performed. This allows MCTS to search for the best action without performing bad actions in the process. MCTS uses two different policies for selecting actions while searching: the tree policy and the rollout policy. Starting at the current state, MCTS uses the tree policy to select an action to sample. Using the resulting state, it moves through the tree until it

reaches a leaf node. At a leaf node, where the resulting state is not yet present in the tree, the resulting state is added as a node. After this, a rollout phase is performed where the rollout policy selects actions in a fast manner to deter-mine the cumulative reward at a fixed amount of decision epochs in the future. This reward is then stored in the preceding tree nodes to indicate their quality. The process repeats until a time or simulation limit has been reached. When finished, the action with the highest node value is the one selected to actually perform.

3.2.1 Upper Confidence Bounds for Trees

The tree policy is an important factor in the performance of MCTS; sampling actions that are not profitable in the long run wastes computational resources. The goal is thus to find the actions with the most potential as soon as possible while not neglecting good actions. One of the most successful ways of doing this is through Upper Confidence Bounds for Trees (UCT) [Kocsis and Szepesv´ari, 2006]. Using the algorithm UCB1 [Auer et al., 2002] (where UCB stands for Upper Confidence Bounds), actions are selected based on their optimistic po-tential value. As actions perform worse, their upper confidence bound drops and actions that have higher bounds are selected instead.

The pseudocode for the UCT adaptation of MCTS that we use is seen in Al-gorithm 3.1. This alAl-gorithm starts by calling selectAction for the current state. selectAction keeps running simulations from the root of the tree by calling simulate, which recursively selects actions using the UCB value given by ucbValue. After simulate expands the tree, it initiates the rollout phase by calling rollout, which runs recursively until the search depth has been reached.

(15)

1% The main f u n c t i o n f o r s t a r t i n g t o t h e s e a r c h f o r t h e b e s t a c t i o n 2function s e l e c t A c t i o n ( s t a t e ) 3 % C o n s t r u c t t h e b a s e MCTS t r e e 4 s e a r c h T r e e = new S e a r c h T r e e ( s t a t e ) 5 6 % S i m u l a t e p e r f o r m i n g a c t i o n s u n t i l t h e l i m i t h a s been r e a c h e d 7 n u m S i m u l a t i o n s = 0

8 while n u m S i m u l a t i o n s < SIMULATION LIMIT 9 s i m u l a t e ( s t a t e , s e a r c h T r e e , 0 ) 10 n u m S i m u l a t i o n s += MAX DEPTH 11 12 % Return t h e a c t i o n t h a t was e s t i m a t e d t o be b e s t 13 return s e a r c h T r e e . g e t B e s t A c t i o n ( ) 14 15% The f u n c t i o n f o r s i m u l a t i n g an a c t i o n i n t h e t r e e p h a s e 16function s i m u l a t e ( s t a t e , s e a r c h T r e e , de pt h ) 17 % Stop s i m u l a t i n g o n c e t h e s e a r c h d e pt h h a s been r e a c h e d 18 i f d ep th >= MAX DEPTH 19 return 0 20

21 % I f a l e a f node h a s been r e a c h e d , make n o t e o f i t and s t a r t 22 % t h e r o l l o u t p h a s e 23 i f s e a r c h T r e e . n u m V i s i t s == 0 && d ep th > 0 24 s e a r c h T r e e . n u m V i s i t s += 1 25 return r o l l o u t ( s t a t e , d e pt h ) 26 27 % Make s u r e t h e c u r r e n t node c o n t a i n s i t s a v a i l a b l e a c t i o n s 28 i f ! s e a r c h T r e e . i s I n i t i a l i s e d 29 s e a r c h T r e e . i n i t i a l i s e C h i l d r e n 30 31 % Use UCB t o s e l e c t t h e b e s t a c t i o n t o s a m p l e 32 a c t i o n = argmax ( s e a r c h T r e e . c h i l d r e n −>ucbValue ( s e a r c h T r e e , c h i l d ) ) 33 34 % Sample t h e e f f e c t s o f t h e s e l e c t e d a c t i o n 35 [ n e x t S t a t e , r e w a r d ] = SIMULATOR. s a m p l e T r a n s i t i o n ( s t a t e , a c t i o n ) 36 37 % Expand t h e t r e e 38 n e x t S e a r c h T r e e = a c t i o n . a d d C h i l d ( n e x t S t a t e ) 39 40 % R e c u r s i v e l y e s t i m a t e t h e e x p e c t e d d i s c o u n t e d r e t u r n 41 ex pec ted Rew ard = ( r e w a r d + DISCOUNT FACTOR

42 ∗ s i m u l a t e ( n e x t S t a t e , n e x t S e a r c h T r e e , depth + 1 ) ) 43

44 % Update t h e number o f v i s i t s and t h e r e w a r d i n t h e c u r r e n t 45 % t r e e node

46 s e a r c h T r e e . n u m V i s i t s += 1 47 a c t i o n . n u m V i s i t s += 1

48 a c t i o n . v a l u e += ( ( exp ect edR ewar d − a c t i o n . v a l u e ) 49 / a c t i o n . n u m V i s i t s )

50

51 % Return t h e e x p e c t e d d i s c o u n t e d r e w a r d s from t h i s p o i n t 52 return ex pec ted Rew ard

53 54% The f u n c t i o n f o r s i m u l a t i n g an a c t i o n i n t h e r o l l o u t p h a s e 55function r o l l o u t ( s t a t e , d e pt h ) 56 % Stop s i m u l a t i n g o n c e t h e s e a r c h d e pt h h a s been r e a c h e d 57 i f d ep th >= MAX DEPTH 58 return 0 59 60 % Use t h e r o l l o u t p o l i c y t o t h e s e l e c t t h e b e s t a c t i o n t o s a m p l e 61 a c t i o n = ROLLOUT POLICY . s e l e c t A c t i o n ( s t a t e ) 62

(16)

63 % Sample t h e e f f e c t s o f t h e s e l e c t e d a c t i o n

64 [ n e x t S t a t e , r e w a r d ] = SIMULATOR. s a m p l e T r a n s i t i o n ( s t a t e , a c t i o n ) 65

66 % R e c u r s i v e l y c a l c u l a t e and r e t u r n t h e e x p e c t e d d i s c o u n t e d r e w a r d 67 return r e w a r d + DISCOUNT FACTOR ∗ r o l l o u t ( n e x t S t a t e , de p th + 1 ) 68

69% The f u n c t i o n f o r c a l c u l a t i n g t h e UCB v a l u e f o r an a c t i o n a t a 70% c e r t a i n p o i n t i n t h e s e a r c h t r e e

71function ucbValue ( s e a r c h T r e e , a c t i o n ) 72 return ( a c t i o n . v a l u e + UCB CONSTANT

73 ∗ sqrt ( log ( s e a r c h T r e e . n u m V i s i t s ) / a c t i o n . n u m V i s i t s ) )

Algorithm 3.1: The pseudocode for UCT.

3.2.2 Predicate-Average Sampling Technique

In UCT, it is important that the actions suggested by the rollout policy lead to an accurate estimation of the expected cumulative reward for it to prop-erly evaluate the quality of actions. Considering a rollout policy that perfectly reflects the actions actually selected by UCT would be ideal, we look into a method that adjusts the rollout policy based on what actions are deemed good by UCT. This method is called Predicate-Average Sampling Technique (PAST) [Finnsson and Bj¨ornsson, 2010].

PAST is based on the history heuristic, which assumes that values for actions in the past serve as an indicator for future action values. As such, PAST keeps track of the average values for actions for each predicate that is true for the current state. Predicates indicate properties of states, e.g., you could have a predicate for a certain water level or energy price that is true for states with the corresponding water level or energy price. When selecting rollout actions in the future, PAST uses these average values to change the distribution of actions chosen by the rollout policy.

3.2.3 Factored Value Partially Observable Monte Carlo

Planning

The amount of actions available to the controller grows exponentially with the number of pumping stations it controls. It may therefore start performing worse as the number of pumping stations grows. One way of dealing with this is through treating the task as a multi-agent problem where each pumping station has its own controller. This allows the amount of actions to grow linearly instead of exponentially, but at the cost of limited cooperation.

We investigate this possibility by implementing the Factored Statistics variant of Factored value partially observable Monte Carlo planning (FV-POMCP) [Amato and Oliehoek, 2015]. As we do not utilise the partial observability property of this method, it works similar to UCT. There is a big difference in how values are stored, however. While UCT stores one value per global action, FV-POMCP stores a value per joint action available to the subsets of pumping stations. As the controller must select one global action to perform at each decision epoch, we must use the separate action values to find a single optimal action.

(17)

In order to do this, we use the subsets of pumping stations and their action values to construct a coordination graph [Guestrin et al., 2001, Nair et al., 2005]. Coordination graphs represent how the actions of one node affect others, where nodes reflect pumping stations in our situation. These coordination graphs can then be used in an algorithm such as variable elimination [Guestrin et al., 2001] or max-sum [Kok and Vlassis, 2006] to find the action that maximises the sum of values across subsets.

(18)

Chapter 4

Polder System Simulation

Facing a problem with a large state space and actions with long term effects, a based approach such as UCT is very promising. However, as sample-based techniques often require a large number of samples that indicate effects of actions, it is important that the simulation runs fast enough to facilitate this. Current water level simulators focus on accuracy more than speed and thus are not fit for our purposes. In this chapter we propose our own simulator which has a more suitable trade-off between speed and accuracy.

4.1 The Polder System Model

In order to simulate the water in the polders and canals between them, we first represent them in a model. The canals are modelled as a graph with canal parts as nodes. Each canal part can have a polder or an end drain connected to it, where polders can pump water into the canal part and end drains pump water out of the canal part. Canal parts and polders both have a number of fixed properties:

• a target (or goal) water level Lg in mNAP;

• a bottom water level margin Lb in mNAP;

• a top water level margin Ltin mNAP;

• a maximum water level Lmin mNAP (exceeding this causes flooding);

• a surface area A in m2_.

Polders and end drains have one additional fixed property: • a pumping capacity Sc in m3/s.

As discussed in Section 2.1, we focus on the VRNK polder system during this thesis. As such, we have modelled the VRNK polder system in the described structure. Illustrations of the polder system and our graph model of it can be seen in Figure 4.1.

(19)

Figure 4.1: Illustrations of the VRNK polder system: on the left an overview with main polders indicated by coloured areas and pumping stations by num-bers, and on the right the modelled VRNK polder system with lines as canal parts, circles as controllable polder pumping stations, triangles as uncontrollable polder pumping stations and a square as end drain.

4.1.1 State Definition

Apart from fixed properties, canal parts and polders also have variable proper-ties. These are defined in the state, which was mentioned earlier in Section 3.1. The state also contains variable properties for the energy prices and weather. We formally define the state as

s = hLc, HCe(t), HRf(t)i (4.1)

where

• Lc is a vector containing the current water level for each canal part and

polder in mNAP;

• HCe is the history of energy prices Ce in euros up to decision epoch t; • HRf is the history of rainfall Rf in m up to decision epoch t.

The histories are used by our simulator for state transitions. Using these his-tories, our simulator can determine which energy prices may occur in future states, or how much rain may fall during future transitions. This is explained further in Section 4.2.3. While not an issue in our simulations during this thesis project, in practice, these histories growing indefinitely as time goes by may pose an issue. However, in real-world usage of our proposed method, only the planning simulator is needed, which does not require a full history of energy prices or rainfall. We can therefore limit the histories so that only the ten most recent observations are stored.

The initial state for the problem is generated before the first decision epoch and changes during each transition to the next decision epoch. These transitions are further described in Section 4.2.

(20)

4.1.2 Action Definition

At each decision epoch, the controller must provide an action to take. This action consists of a setting for each pumping station that the controller manages. We formally define an action as

a = ha1, ..., aNi (4.2)

where

• an is the selected action value for controllable pumping station n;

• N is the amount of controllable pumping stations.

The possible values for an may differ per controller. For ControlNEXT

(Sec-tion 2.2), the possible ac(Sec-tion values are such that

an∈ [0, 1] (4.3)

where the value of an linearly scales the pumping station’s capacity to

deter-mine the pumping capacity for the transition. This is further explained in Section 4.2.1.

The action space used by ControlNEXT is infinitely large, and therefore cannot be used for our implementation of UCT. In order to have a smaller and more manageable action space, we define the available action values per pumping station as

an ∈ {0, 1}. (4.4)

In words, these action values allow UCT to either turn a pump off or on. Though this reduced action space does limit UCT’s potential precision, N&S indicated that this would be sufficient for our research, which the results of our empirical evaluation confirm. This leaves us with an action space with a size of 2N_{. If in}

practice it proves useful to have additional options, this action space can easily be extended to include intermediate values.

4.2 State Transitions

Having our model structure, states and actions defined, we can look into the transitions of states within the model. The state transitions to a new state each decision epoch. Some of the resulting state’s values can be influenced by the action taken, while others are independent.

4.2.1 Effects of Actions

Each pumping station with an action value an > 0 will either raise or lower

water levels. The water levels are adjusted by

Dl=

anScT

A (4.5)

(21)

• Dl is the water level difference in m;

• an is the selected action value for controllable pumping station n;

• Sc is the pumping station capacity in m3/s;

• T is amount of time spent pumping in s (the decision epoch length); • A is the water area’s surface in m2 _{(i.e., either a polder or canal part).}

In case the pumping station is connected to a polder, the water level in the polder lowers and the water level in the connected canal part rises. In case the pumping station is an end drain and therefore not connected to a polder, the water level in the connected canal part lowers.

4.2.2 Water Flow

After the water level changes due to the effects of the action, water from one canal part can flow to its neighbours. This is modelled using the Gauck-ler–Manning formula (GMF) [Manning et al., 1891] which estimates the velocity of liquid in an open channel. The formula states

Gv=

1 Gc

Gr2/3Gs1/2 (4.6)

where

• Gv is the cross-sectional average velocity in m/s;

• Gc is the Gauckler–Manning coefficient in s/m1/3;

• Gr is the hydraulic radius in m;

• Gsis the slope of the hydraulic grade line calculated as the height/length

ratio.

Figure 4.2: The cross sectional area of an open channel with a blue line indicating wetted perimeter. The hydraulic radius Gr is defined as

Gr=

Ga

Gp

(4.7)

where

• Gais the cross sectional area of flow

in m2;

• Gp is the wetted perimeter in m.

Gauckler-Manning Formula Implementation

In order to be used in the graph structure that the canal is modelled in, we calculate velocity Gvfor each connection lc between two canal parts. Using this

velocity, the amount of water transferred between the canal parts in lc_{, which}

is what our simulation ultimately needs, can be calculated by

lc_t = lc_dL¯wGv (4.8)

where • lc

(22)

• lc

d is the height difference between the two canal parts in l c _{in m;}

• ¯Lwis the average canal width in m;

• Gv is the cross-sectional average velocity in m/s.

To calculate Gv according to the GMF, we must first find the values for

coef-ficient Gc, hydraulic radius Gr and slope Gs. Gauckler-Manning coefficient Gc

indicates the resistance in the canal that the water flows through. Though an accurate value for this requires field inspection, there are references for estimat-ing it. N&S and research such as [Mott and Wagenaar, 2009, Te Chow, 1959, Edwards, 2000] suggest a value of Gc≈ 0.03 for clean natural canals. However,

due to results of our experiments (Section 4.3.1), we deviate slightly from this in favour of a value for smoother man-made canals: Gc= 0.015.

The hydraulic radius Gr can be inferred through the assumed average canal

width ¯Lw and depth ¯Ld, and the current depth Ld. For the purpose of

abstrac-tion, we make the assumption that the canals keep the same width ¯Lw across

different depths. We then find Gras

Gr=

LdL¯w

2Ld+ ¯Lw

. (4.9)

While we manually set ¯Lwand ¯Ldusing real-world data, the value of Ldchanges

over time, and thus must be calculated. Given the average depth ¯Ldfor a canal

part when the water is at its target level Lg, the current depth Ld can be

determined by knowing the current water level Lc through

Ld= ¯Ld+ (Lc− Lg). (4.10)

The slope of the hydraulic grade line Gsis calculated between the centre points

of the two canal parts in connection lc. This done through

Gs= lc d lc l (4.11) where • lc

d is the height difference in m between the two canal parts in l c_;

• lc

l is length in m between the centres of the canal parts in l c_.

Given the surface, width and amount of connections of two adjacent canal parts, length between canal parts can be calculated as

llc= X l∈lc A(l) ¯ LwLa(l) (4.12) where • lc

l is length in m between the centres of the canal parts in l c_;

• lc _{is the set containing the two canal parts;}

• A(l) is surface of canal part l’s area in m2_;

• ¯Lwis the average canal width in m;

(23)

Decision Epochs and Water Flow

In our simulation, we must calculate the water levels for each decision epoch. We define a decision epoch’s length as 15 minutes, as this corresponds to the interval at which pumping stations can be turned on or off. However, 15 minutes is a too coarse granularity to apply GMF in our model. To remedy this, each decision epoch is subdivided into multiple GMF steps that each calculate intermediate water levels. This allows water movement to span multiple canal parts in one decision epoch. As additional GMF steps do come at the cost of additional computation, we aim for a good balance between realism and performance. In our experiments, we found ten GMF steps per decision epoch to be suitable.

4.2.3 Transitions of Rain and Energy Prices

Apart from the deterministic effects of pumping station actions and water flow, state transitions are influenced by rainfall and energy price development. Be-fore specifying how these affect the next state, we look into how they affect each other. It is imaginable that there could be correlations between these [Panagopoulos et al., 2015], which we investigate.

Correlation Between Rain and Energy Prices

In order to determine correlation between rainfall and energy prices, we use the sample cross covariance function as described in [Box et al., 1994]. This function is an estimate of the covariance between two data sets, D1t and D2t, at lags

k = 0, ±1, ±2, .... For data pairs (D11, D21), ..., (D1T, D2T), an estimate of the

lag k cross-covariance is cD1D2(k) =        1 T T −k P t=1 (D1t− ¯D1)(D2,t+k− ¯D2); k ≥ 0 1 T T +k P t=1 (D2t− ¯D2)(D1,t−k− ¯D1); k ≤ 0 (4.13)

where ¯D1and ¯D2 are the sample means of the data sets.

In Figure 4.3, a cross correlation evaluation between the data sets is illustrated. Between -1 and 12 hours of lag, the sampled cross correlation is consistently above the error confidence bound shown in blue. As a correlation seems to exist, the historical rain and energy data used in the simulation should always be selected from the same time frame such that the same P0 is used for D1,P0 and D2,P0.

Given that a correlation exists, we can look into the practical results of this by measuring the average energy prices across different rain levels. Figure 4.4 illustrates that higher rain levels indeed correlate with a higher energy price. We can also observe that the prices are highest around 1 or 2 hours of lag, and decrease as lag increases.

(24)

−20 −10 0 10 20 −0.02 0 0.02 0.04 Lag in hours

Sample cross correlation

Figure 4.3: The cross correlation between rain and energy prices.

0 2 4 6 8 10 50 55 60 65 70 75 80 85 90 95

Mean energy cost in euro/MWh

Lag in hours Rainfall in mm None 0−5 5−10 10−15 >15

Figure 4.4: The mean average energy prices across levels of rain.

Rain and Energy Price Simulators

As correlations between rain and energy prices have been shown to exist, we propose a method of simulation that preserves these correlations. However, we will need to make a distinction between the real simulator Gr and the planning

simulator Gp as defined in Section 3.1.

For the real simulator we propose to take a historical data approach to simula-tion. In particular, we use synchronised historical data in the form of time series. When the initial state of the real simulator is generated, a random point in this data is selected as the first rainfall level and energy price (and this point is the same in both the rainfall data and energy price data). At each state transition, the next point in the data is used to determine the rainfall and energy price for the next state. Though this limits the simulation of rain and energy price to be the same as those observed in our data set, the amount of data is more than enough to cover the time span of the experiments that we run. The advantage of doing this is that it respects their correlation, which could be important as ignoring the correlation may lead to our estimations being too positive since high prices and rain are likely correlated.

(25)

For the planning simulator it is not possible to use such an approach. As real transitions bear uncertainty, it is important that the controller’s planning reflects this. Here, we take different approaches for rainfall and energy price simulation.

Rain in the Planning Simulator

For the model of rainfall, we propose the utilisation of external tools to circum-vent a prediction problem. When a UCT-based controller would be deployed in practice, it could have access to commercially available weather forecasts that predict the amount of rain that will fall in the area of interest in the coming hours. We expect that these forecasts are much more accurate than any learned model of rainfall transitions up to the point that we can argue that the agent knows what the future rainfall will be. Therefore, in our planning simulator the sampled rain is (roughly) equal to the prediction. In the current simula-tions, we use historical data from Meteobase [STOWA, 2015] for this, but for real-world usage, forecast services like Buienradar [Buienradar, 2015] may be used. Of course, forecasts are not flawless. Taking this into into account, we add Gaussian noise with a standard deviation of a third of the prediction value. Un-fortunately, we were not able to find reliable research which specifies the amount of forecast error. The amount of noise that we use is therefore an estimation which can be improved once additional information becomes available.

Energy Prices in the Planning Simulator

For energy price data, no such prediction as for rain is available. Instead, we look into alternative ways to simulate future energy prices. A relatively straight forward method of predicting energy prices is through a Markov chain [Norris, 1998], which maps the probabilities of states occurring after a state as mentioned in Section 3.1. As a simple baseline predictor, we implement a first-order Markov chain to map energy prices to probabilities of following energy prices. Using ten years of historical energy price data, we calculate the probabilities of energy price transitions for our Markov chain. In order to deal with the continuous energy prices, we bin energy prices together in groups of 1 MWh.

For a more advanced method of simulating energy prices by predicting those in the future, we use a technique for inferring beliefs over scenarios [Walraven and Spaan, 2014]. This technique attempts to find the best fits of the last few observed energy prices in the available historical data. The scenarios Σ in the historical data that match the observed data best are then used for the energy price transition, where a scenario σ is randomly selected based on their probability. To prevent this technique from being reduced to a look-up of an exact match, we split our available data into a training set and a test set, where the training set is used by the technique to predict energy prices and the test set is used by the real simulator. As we have ten years worth of historical data, we use nine years for the training set and the remaining year for the test set. We build upon this technique by utilising the correlation between rainfall and energy prices to improve its performance in our problem; we select scenarios not

(26)

only based on the energy price fit, but on the best fit of both the energy prices and rainfall.

The pseudocode for the scenario based approach is shown in Algorithm 4.1 where

• ε is a small non-zero constant to prevent division by 0;

• Wr is the weight for rain used to scale the values of rainfall to be more

comparable to energy prices - chosen based on empirical results; • σCe is the history of energy prices Ce in euros in the scenario σ;

• HCe is the history of energy prices Ce in euros as observed by the con-troller;

• σRf is the history of rainfall Rf in m in the scenario σ;

• HRf is the history of rainfall Rf in m as observed by the controller; • n is the amount of most similar scenarios to be returned.

1 i n v E r r = {} 2 3% C a l c u l a t e t h e i n v e r s e e r r o r f o r e a c h s c e n a r i o 4f o r e a c h σ ∈ Σ 5 i n v E r r [ σ ] = 1/εPt_i (σCe(i) − HCe(i)) 2_{+ W} r(σRf(i) − HCe(i)) 2 6end f o r 7 8% S o r t , t r i m and n o r m a l i s e t h e r e s u l t s t o be l e f t w i t h t h e n b e s t 9% matches and t h e i r p r o b a b i l i t i e s 10s o r t i n v E r r 11d e l e t e i n v E r r [ i n d e x ≥ n ] 12 n o r m a l i s e i n v E r r 13 14return i n v E r r

Algorithm 4.1: The pseudocode for inferring beliefs over scenarios.

4.3 Model Realism Evaluation

In Chapter 6, we report on experiments regarding the performance of our pro-posed techniques. In order to assess the implications for real-world deployment, it is important to evaluate the accuracy of the real simulator Gr. In addition,

we evaluate the performance of the planning simulator Gp to get an idea of

the prediction accuracy, and thus how our much our controller can utilise these predictions in the planning phase. We first focus on the water transitions, after which we look into the energy price planning transitions.

4.3.1 Realism of Real Simulator

The real simulator Gr involves calculations for water flow and pumping, but

follows historical data for rain and energy price transitions. As historical data is realistic by definition, we focus on the the water levels while evaluating the realism of Gr.

(27)

In order to evaluate the realism of our real simulator Gr, we use one of the

state-of-the-art simulators as baseline Gb: SOBEK by Deltares [Deltares, 2015b].

SOBEK is a physics-based modelling suite used by water authorities and water management consultancies. Using the VRNK model that is used in practice by HHNK, we can compare the water behaviour of Grand Gb. In our experiment,

we run a typical simulation in SOBEK where the water levels start at their target and are raised by heavy rain. About 16 hours into the simulation, the water levels reach a point where the pumping stations activate, which causes the water to flow through the canals. Using the same rainfall and pumping station actions, we mimic this simulation in our own simulator.

Error in m

Simulated time in hours

0 8 16 24 32 40

-0.2 0 0.2

Error in m

0 8 16 24 32 40

-0.2 0 0.2

Figure 4.5: The mean and standard deviation of water level differences between Gr and Gb with Gauckler-Manning coefficients Gc= 0.015 (left) and Gc= 0.03

(right).

We determine the accuracy at each decision epoch t by using the difference of water levels Ge(t) = Gr(t) − Gb(t) as predicted by simulators Gr and Gb.

Figure 4.5 shows the mean and standard deviation of the water level difference of the different canal parts. The first 16 hours do not involve pumping and only contain a slight rain build-up. After this first period, pumps activate, which allows differences in water levels between the simulations to become more apparent.

As discussed in Section 4.2.2 and [Mott and Wagenaar, 2009, Te Chow, 1959, Edwards, 2000], typical Gauckler-Manning coefficient references suggest Gc ≈

0.03 for natural canals. However, we find that a lower value that is typically used for smoother man-made canals, such as Gc= 0.015, gives us a much more

realistic water flow. Comparing the left and right plots of Figure 4.5, we see that the mean errors of both coefficients are similar, but using Gc = 0.03 gives

a much higher standard deviation than Gc = 0.015, indicating that there is a

lot more simulation error there. A more detailed representation of the effects of Gcon the standard deviation of error is shown in Figure 4.6, which plots the

mean of the standard deviation values across the performed experiments per coefficient value: f (x) = 1 h h X t=1 σ(G_rx(t) − G_bx(t)). (4.14)

Here we see that simulation error is lowest around Gc = [0.009, 0.015], and

(28)

there-fore use Gc= 0.015 as the Gauckler-Manning coefficient in our simulation and experiments. Mean std in m Gauckler-Manning coefficient G_c 0 0.01 0.02 0.03 0 0.1 0.2

Figure 4.6: The mean standard deviation of error across different Gc values.

A few examples of water levels of individual canal parts are seen in Figure 4.7. The left plot shows the water levels close to the end drain. The end drain activates three times, causing the water levels to drop sharply and rise again while the pump is off. The middle plot relates to water levels south in the VRNK polder system where pumping stations of polders activate twice. These first two plots are examples of areas where Gr has results very similar to Gb. Figure 4.5

shows that there are also parts where higher error in prediction occurs. One of these canal parts is seen in the right plot. This shows the water levels of a branching canal in the centre of the VRNK polder system and illustrates that our model’s accuracy in this area can still be improved. Before using our simulation in a real-world pilot, looking into these model discrepancies together with N&S or HHNK can likely reduce these water level differences.

Water level in mNAP

0 16 32

-1 -0.5

0 16 32

-0.6 -0.5

0 16 32

-1 -0.5

Figure 4.7: Example water levels from different canal parts in simulator Gb

(dashed blue) and Gr (solid red).

In conclusion, our simulation Grgenerally keeps water levels close to those of Gb.

The error seen in Figure 4.5, which ranges from 0.05 m below the baseline to 0.1 m above it, could be considered a risk. Though potential risk is already lowered by our simulation tending to show water levels higher than they actually are, this error could still be considered a liability. We can minimise the risk of this error by simply adjusting the acceptable water levels. By placing the top target water level 0.05 m lower and the bottom target water level 0.1 m higher, the controller will take less risk when water levels are near their target boundaries, preventing damages from simulation error. As such, we deem our simulation sufficiently accurate for our purposes.

(29)

4.3.2 Realism of Planning Simulator

The planning simulator Gp differs from the real simulator Gr in that energy

prices no longer transition based on historical data. In this section we evaluate our energy price transitions, which are based on inferring beliefs over scenarios [Walraven and Spaan, 2014] and the correlation with rainfall. As we focus on the imbalance market for purchasing energy throughout this paper, which is notorious for being hard to make predictions for, there is no proven baseline available for us. We therefore compare our scenario-based and Markov chain transitions to a transition function that selects a random energy price from the historical data at each transition.

In our evaluation, we select 10,000 random points in the historical energy price data, and compare the 10 following energy prices with the energy prices that our transition functions predicted. These 10 following energy prices correspond with 10 decision epochs, and therefore 2.5 hours. This allows us to find the error (i.e., the difference between our prediction and the actual future energy prices) of our transition functions over a time span of two and a half hours.

Mean error in euro/MWh

0 1 2 40 50 60 Random Markov chain Scenario

Figure 4.8: The mean prediction error of different energy price transition func-tions.

Figure 4.8 shows the results of our evaluation according to the mean of Ec(t) =

|Pe(t) − Ce(t)| where t is the decision epoch, Ec is the error, Peis the predicted

energy price and Ceis the actual energy price. We see that selecting a random

energy price has a mean error of e 60. The Markov chain approach is able to slightly reduce this error early on, but converges to the same error as the random transitions a few decision epochs later. The scenario-based simulation is capable of making even better predictions early on, and only slowly loses prediction accuracy. This result implies that our controller might be able to exploit our energy price transition function especially early on when attempting to select the most cost-efficient moments to pump, and that it faces a larger uncertainty over a longer time span.

As these tests were ran on data from the imbalance market, we expect pre-dictions to be more accurate if better historical data for the Intraday market becomes available. This should further improve the potential cost savings of our controller.

(30)

Chapter 5

UCT Based Pumping

Station Controllers

In the previous chapters we described the task at hand and our light weight simulator. In this chapter, we focus on the controllers that can solve the task of controlling pumping stations by utilising our simulator. The controller is re-sponsible for selecting an action (Section 4.1.2) for the pumping stations at each decision epoch. It should consider not only which actions are good immediately but also which are likely to yield a good result over a longer time span. Here, we discuss how actions are rewarded and which controllers we consider for our task.

5.1 Action Quality and Rewards

In order for a controller to find a good action, we must first define what a good action is. This is done through specifying the reward function. The immediate quality of an action is based on the cost of the pumping stations that were on and the resulting state. The penalties for pumping costs, water levels and overflow combined determine the total reward

r = −X n Pc(n) − X l Pl(l) − X l Po(l) (5.1) where

• r is the reward in euros;

• Pc(n) is the pumping cost in euros for pumping station n;

• Pl(l) is the water level penalty in euros for canal part or polder l;

(31)

5.1.1 Penalty for Pumping

The pumping cost of activated pumps depends on multiple things. Firstly and most obviously, it depends on the current energy price. Secondly, it depends on the pumping capacity, which differs across pumping stations. This capacity cannot be directly translated to an amount of energy however. We therefore attempt to estimate the energy required for one decision epoch of pumping through an alternative method. Using a year worth of data from HHNK, we have found a correlation between energy consumption relative to the surface of the pump’s polder and the average water level difference between each side of that pump (the pumping height ). This is shown in Figure 5.1 where each point represents a pumping station and the green line represents the fit that we use for the simulation. On the scale of one decision epoch, this gives us an energy requirement per pumping station of

Er(n) = EcDl(n)A(n) (5.2)

where

• Er(n) is the required energy in MWh for pumping station n;

• Ec is a constant with value 8.1 × 10−6;

• Dl(n) is the current water level difference in m for pumping station n;

• A(n) the area’s surface in m2_{for pumping station n.}

Note that Dl(n) is the water level difference between each side of the pump at

the current decision epoch, rather than the yearly average difference as is used in Figure 5.1.

Using this energy requirement, we can determine the full pumping cost per pumping station through

Pc(n) = anEr(n)Ce (5.3)

where

• Pc(n) is the reward penalty in euros for pumping station n;

• an is the selected action value for pumping station n from the range [0, 1];

• Er(n) is the required energy in MWh for pumping station n;

• Ce the energy price per MWh in euros.

Pumping height in m Energy in Wh/m 2 0 1 2 3 0 5 10 15

Figure 5.1: The relative energy price per pumping station where each point represents a pumping station.

(32)

5.1.2 Penalty for Water Levels

Water levels deviating from the target levels incurs costs. Each polder and canal part has a bottom and top target water level. When the current water level is outside of this target, a penalty is given according to

Pl(l) =      |Lt(l) − Lc(l)|2A(l)Ct, if Lc(l) > Lt(l) |Lb(l) − Lc(l)|2A(l)Cb, else if Lc(l) < Lb(l) 0, otherwise (5.4) where

• Pl(l) is the reward penalty in euros for canal part or polder l;

• Lt(l) is the top target level in mNAP for canal part or polder l;

• Lb(l) is the bottom target level in mNAP for canal part or polder l;

• Lc(l) is the current water level in mNAP for canal part or polder l;

• A(l) is the area’s surface in m2_{for canal part or polder l;}

• Ctis the cost per (squared) m above target per m2surface in euros;

• Cb is the cost per (squared) m below target per m2 surface in euros.

The amount of meters off target is squared in order to further penalise larger deviation. The costs Ct and Cb have been set to 1 in consultation with N&S.

This is a rough estimation of the true cost based on real-world scenarios, but fulfils its purpose for our experiments. For real-world usage, more in-depth research can be done into these costs to improve real-world performance.

5.1.3 Penalty for Flooding

Finally, there is an extra cost for flooding, which adds

Po(l) = O(l)Co (5.5)

where

• Po(l) is the reward penalty in euros for canal part or polder l;

• O(l) is the amount of overflowed water in m3_{for canal part or polder l;}

• Co the cost per m3 overflowed water in euros.

As advised by N&S, the damages reported during a flooding of Texel, a Dutch isle, are roughly representative for those in the VRNK polder system. We were therefore able to use the data from [Nationaal Watertraineeship, 2015] for our estimation of cost Co. Dividing the total costs of the disaster by the amount of

water that overflowed gives us Co≈ 3.

While discussing the formula for this penalty with N&S, interesting ethical questions arose. In the rare event of a flood, substituting O(l) for O(l)2 _such

that a separately squared flooding cost is returned per canal part or polder l, would cause the controller to favour flooding many areas slightly rather than a few areas heavily. pO(l), however, would do the opposite and leads more water to a few areas rather than spreading it over many areas. What the preferred scenario is could not be decided at the time, and may require more research from

(33)

N&S or HHNK. As such, we make no explicit distinction in our reward function and neither square nor take the square root of the amount of overflowed water.

5.2 UCT Implementation

In order to effectively apply UCT to the pumping station domain, we make a number of domain-specific design choices.

Firstly, we introduce binning of state variables only within the UCT search tree. As we are dealing with transitions that contain continuous and stochastic elements, the probability of ending up in the same state may be very small at times. To still allow UCT to exploit the tree structure it built, we group the water levels and energy prices for the UCT tree nodes in bins of 1 cm and e 10 respectively. These values have been selected based on the results of our experiments (Section 6.2.3).

Secondly, we consider a number of alternative rollout policies.

5.2.1 Rollout Policies

UCT relies heavily on its ability to perform quick value estimations of a given state. A large part of this is the rollout policy, which was described in Section 3.2. Rollout polices should minimise the amount of computational power for fast value estimation and should be representative of actions that the controller might perform in the future.

When no domain knowledge is available, it is standard practice to use a rollout policy which selects random actions. The downside of this is that it will make UCT value actions highly when they lead to a good outcome if a random policy takes over later on. In other words, it does not necessarily optimise towards states that are good when more sensible policies take over.

Using domain knowledge, it may be possible to construct better rollout policies. One of our considered rollout policies is ControlNEXT. Inspired by Control-NEXT, which looks at future states that follow after doing nothing, we also consider a do nothing rollout policy which keeps all pumping stations inactive. Finally, we consider a rollout policy that learns from UCT’s sampling to de-termine if actions look promising in the future: Predicate-Average Sampling Technique (PAST) [Finnsson and Bj¨ornsson, 2010].

PAST Implementation

Predicate-Average Sampling Technique (PAST), as described in Section 3.2.2, selects actions based on the average rewards they received since the start of the program. The average rewards are stored per predicate, which corresponds one certain feature of the state that the action was taken in (e.g., a certain water level or energy price). The action distribution used in PAST is determined by the predicate that has the highest estimated value. We find this too optimistic in

(34)

our situation however. E.g., a predicate for low energy prices might be selected rather than a predicate indicating imminent flooding. Instead of using different predicates for individual state features, we therefore have predicates defined by all state features.

5.2.2 FV-POMCP Implementation

As mentioned in Section 4.1.2, the actions available to UCT grows exponentially with the number of pumping stations it controls. Though UCT is still very much capable of handling the 7 pumping stations used in our experimental problem, its performance may decline rapidly with larger numbers pumping stations. As a solution to this, we look into decentralisation by utilising the Factored Statistics variant of Factored value partially observable Monte Carlo planning (FV-POMCP), which we described in Section 3.2.3. FV-POMCP divides the problem into multiple sub-problems that are easier to solve. We define these sub-problems as collections of a pumping station combined with its adjacent pumping stations. As such, each sub-problem overlaps with one or more others. Whereas UCT stores the expected value for global actions that contain the action value for each controllable pumping station, FV-POMCP stores expected values for each joint action, which contains only the action values for the pumping stations in the corresponding sub-problem. As our sub-problems overlap, we can use max-sum [Kok and Vlassis, 2006] to merge all joint actions into a global action, which is then performed. Though this makes the controller a lot more scalable, it comes at the cost of reduced coordination between pumping stations. How this affects performance is discussed in Section 6.4.

(35)

Chapter 6

Controller Evaluation

Having our simulator and controllers, we can empirically evaluate the perfor-mance of our controller. First, we will look into several configurations for UCT to determine the ones most suitable for our problem. We then analyse how different rollout policies affect UCT’s performance and how well it scales com-pared to a decentralised solution such as FV-POMCP. Finally, we compare our proposed controller to current industry baselines.

6.1 Experimental Setup

In order to evaluate the quality of our controllers, we run several experiments with them and observe the costs (i.e., negative rewards) they incur. To compare their performance over time, we plot the cumulative costs at each decision epoch throughout the experiments.

One requirement of a pumping station controller is to be able to perform well over a long time span. To accommodate this, we let the experiment last two (simulated) weeks such that the horizon h = 1344 decision epochs. To ensure that the controller’s task is challenging, we make the experiments more difficult in two ways.

Firstly, the initial state is generated such that all water levels are near their top target by selecting them uniformly randomly from the range

Lc ∈ [Lt− 0.2, Lt] (6.1)

where

• Lc is current water level in mNAP;

• Ltis top target level in mNAP.

Secondly, we select a point in the historical rain data where rain is just starting. This results in a scenario where the water levels are high (sometimes dangerously so) and still rising. We generate 8 such initial states and report the mean average cumulative rewards over the experiments ran with these initial states.

(36)

UCT has a number of parameters that can be set. Based on the experiments below, we found Up = 65, 000 planning simulations (which take about about 5

seconds on a typical work station) and a search depth of Ud = 8 hours (which

equates to 32 decision epochs) to be good settings for our problem. Furthermore, we use the do nothing rollout policy where no pump is enabled for the reason that it has the best performance, which is discussed in more detail in Section 6.3. Since we run a very large amount of experiments, we gratefully make use of SURFsara’s cluster computer Lisa [SURFsara, 2015]. As scheduling tasks on Lisa’s 32 GB and 64 GB RAM machines does not require any special requests and are sufficient for our experiments, we work within those restrictions.

6.2 UCT Configurations

UCT has a number or parameters that can greatly affect its performance. In this section we investigate which are most suitable for our problem. The parameters we investigate are the amount of planning simulations Up, the search depth Ud,

the bin size Ub, the UCB constant Uc and the discount factor Uγ.

6.2.1 Number of Planning Simulations

Between each decision epoch, UCT may plan until a certain resource is ex-hausted. In our implementation, this resource is the number of planning simu-lations Up that it may perform. I.e., UCT may sample the effects of Up actions

before having to decide which action to perform in the real simulator. We em-pirically investigate the effects of a wide range of values for Up. We start at

Up ≈ 8, 000, which leads to decent performance and isn’t improved until the

available planning simulations are raised to Up≈ 65, 000. After this, the

perfor-mance boost by increasing the number of planning simulations seems to stall; even at Up≈ 200, 000 the performance is still the same.

Simulated time in days

Cost in euros 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 2 4 × 10 5 Up=8000 U_p=50000 Up=65000 Up=200000

Figure 6.1: The mean cost of different numbers of planning simulations Up over

Energy- and Cost-Effcient Pumping Station Control

MSc Artificial Intelligence

Master Thesis

Energy- and Cost-Efficient

Pumping Station Control

by

Timon Kanters

10322833

Contents

Chapter 1

Introduction

1.1

Problem Description

1.2

Thesis Contributions

1.3

Collaborating Parties

Chapter 2

The Current Approach To

Pumping Station Control

2.1

Polder System Vereenigde Raaksmaats- en

Niedorperkoggeboezem

2.2

Current State-of-the-Art: ControlNEXT

2.3

Energy Markets

Chapter 3

Decision Making Through

Artificial Intelligence

3.1

Formal Decision Process Definition

3.2

Monte Carlo Tree Search

3.2.1

Upper Confidence Bounds for Trees

3.2.2

Predicate-Average Sampling Technique

3.2.3

Factored Value Partially Observable Monte Carlo

Planning

Chapter 4

Polder System Simulation

4.1

The Polder System Model

4.1.1

State Definition

4.1.2

Action Definition

4.2

State Transitions

4.2.1

Effects of Actions

4.2.2

Water Flow

4.2.3

Transitions of Rain and Energy Prices

4.3

Model Realism Evaluation

4.3.1

Realism of Real Simulator

4.3.2

Realism of Planning Simulator

Chapter 5

UCT Based Pumping

Station Controllers

5.1

Action Quality and Rewards

5.1.1

Penalty for Pumping

5.1.2

Penalty for Water Levels

5.1.3

Penalty for Flooding

5.2

UCT Implementation

5.2.1

Rollout Policies

5.2.2

FV-POMCP Implementation