Master’s Thesis Reinforcement Learning versus Heuristics in Queue Optimization: A Comparison

(1)

Master’s Thesis

Reinforcement Learning versus Heuristics

in Queue Optimization: A Comparison

Author

Guus Smit

Supervisor & first assessor

dr. Nicky van Foreest

Second assessor

dr. Stuart Zhu

August 26, 2018

Abstract

(2)

1 Introduction

Queues are ubiquitous in day-to-day operations in many industries and arise nat-urally whenever demand for a service cannot be immediately satisfied. Traffic in-tersections, elevators, maintenance operations, computer and telecommunications, emergency room scheduling and production systems are just some of the many exam-ples of systems with queues[1]. The dynamics of these systems differ in many ways, but central is that there is a server that needs to attend queues in order to process demand.

Consider the operations of airport security, having various locations at which passengers may arrive. At each location, multiple security officers are working to process the passengers. An operations manager may relocate security officers, send them home early or call in extra officers that are on stand-by. Long queues are undesirable, but calling in too many officers will result in high costs. Relocating too often results in officers being in transit for a large fraction of time, not processing passengers. What decisions should the operations manager make in order to be cost-efficient?

To solve these problems, the field of polling systems arose. The first paper of this field was published in 1957 on performing maintenance on N machines when one worker takes a cyclic path along those machines, by Mack[2]. An overview of applications and recent endeavours within this field is written by Boon et al.[1].

We will analyze a discrete time variant of a polling system with N queues and a single server that can be turned on or off and visit the queues in any desired order, incurring switching costs while doing so. The goal of the algorithms is to minimize the cost rate of a given system. In particular, we will focus on how reinforcement learning (RL) can be applied to this problem. We will analyze different queueing systems of increasing model complexity and benchmark how well algorithms based on reinforcement learning perform with respect to heuristics and, if available, the optimal solution.

The reason we investigate reinforcement learning on this problem is two-fold. Firstly, reinforcement learning has in the past been used extensively for OR problems like traffic light control[3][4][5][6] and scheduling [7][8], but has not been applied to the polling systems of this paper. Secondly, recent advances in the field of deep reinforce-ment learning are revolutionizing the way we can solve large state-space problems, among which our polling systems.

An important case for reinforcement learning are the recent advances that have been made within the field of computer chess, Go and shogi. Decades of research and expert knowledge from grandmasters have gone into finding domain-specific ways of efficiently exploring the tree of possible moves in order to solve these games. For all three fields, the reigning world-champion computer programs were recently con-vincingly defeated by AlphaZero, relying only on self-play reinforcement learning[9]. Another recent breakthrough is the human-level control that was reached with rein-forcement learning on Atari games, only observing pixel images of the game state.[10].

(4)

will first describe the characteristics of the polling systems that are investigated in this paper. Section 3.2 details the Economic Production Quantity (EPQ) model, which gives an accurate approximate solution of our system in some settings. Furthermore, the EPQ model is used in the design of the heuristic, which will covered immediately afterwards in Section 3.3.

After detailing the heuristic, we will briefly cover the fundamentals of reinforce-ment learning required to understand the rest of our methods. We will cover the basic Q-learning algorithm and how it can be extended using function approximation. We will go into the details of linear function approximation and how it is applied in our research and subsequently move on to deep reinforcement learning. We will then also detail the design of our neural network. Lastly in this section, we will describe how the least-squares fit over a batch of ‘experiences’ can directly be obtained for tabular and linear learning, which is an important advantage of these methods.

From there, we move on to our experiments and findings. We will first cover the findings of tabular Q-learning in a single queue system, after which we investigate the performance of linear function approximation and deep learning for the more complex dual queue system. Our results indicate that linear function approximation and tabular learning are inadequate to apply in complex settings. The remainder of our experiments focuses on large multi-queue problems where we compare deep reinforcement learning to heuristics. We find that deep reinforcement learning can outperform the heuristic in some settings.

Lastly, we give a quick summary of our experiments and findings, and present our conclusions.

2 Literature

There is a large body of research dedicated to queueing theory and a separate large body of research for reinforcement learning, but the overlap of these two areas is minimal. As mentioned in the introduction, some research has been done with regard to reinforcement learning in traffic light control. However, this is a very specific subset of queueing problems and this paper builds on advances in reinforcement learning that were not available when most of these papers were written.

In conclusion, we have found no research that investigates the performance of reinforcement learning in the settings that we apply it to. Furthermore, the body of knowledge of reinforcement learning is for a large part adequately represented by the introductory textbook by Sutton and Barto[11], of which some important concepts will be covered in the following section.

(5)

3 Methods

In this section we will explain our methods. We begin in Section 3.1 by explaining the dynamics of the polling systems that we will investigate. Section 3.2 details a deterministic continuous time equivalent for a single queue to build intuition of the problem at hand. We use this model to design our heuristic solution to this problem. Then, in Section 3.4, we will explain the reinforcement learning techniques that we used.

3.1 Description of the polling system

The polling system that we will analyze will feature the following cost processes:

• Setup costs K to activate the server. Switching the server off costs nothing. • Switching costs kij to move from queue i to j.

• Operating costs o (in units of cost per unit time) that are paid each time unit the server is active

• Holding costs hi(in units of cost per unit time per job) for each time unit a job

spends waiting in queue i.

We model our system in discrete time, with time steps of size ∆t. Our polling system consists of a single server that can either be idle or attending a queue. To go from an idle state to attending a queue, setup costs K are incurred. To switch from queue i to j, switching costs kij are incurred. Each time step the server is attending a

queue, operating costs ∆t_{× o are incurred. Each time step that a job spends waiting} in queue i, holding costs ∆t× hi are paid for that job.

The jobs are identical and all require a deterministic service time s = 1. Jobs in queue i arrive according to a Poisson process with rate λi. We set the discrete

time step equal to the service time of a job, ∆t = s = 1, such that the server always processes exactly one job during one time step.

We let N be the total number of queues of the system. We require thatPλi< 1,

such that the server can always process more jobs per unit time than will arrive in the system. Note that there is no limit on the length of the queues (i.e., ‘unlimited buffer size’). In principle, the queues can grow indefinitely, if for example the server is never active.

The entire state of the system can be described by (1) the server location ∈ {0, 1, ..., N}, where 0 indicates the server is idle, (2) the amount of jobs per queue ∈ NN_{. Further note that the system’s future is entirely described by the present. In}

(6)

Possible actions

A total of N +1 unique actions can be performed on this environment: N for attending each of the separate queues, and one ‘idling’ action. For example, for a system with 2 queues, the server can either attend queue 1, queue 2 or do nothing, resulting in a total of 3 possible actions.

To further clarify: choosing the same action repeatedly means no setup or switch-ing costs need to be paid. When the server is currently idle, and we choose action ‘attend queue 1’, setup cost K are incurred.

Freak events

We want our algorithms to be robust, i.e., they should perform well in scenarios that do not occur frequently. In order to test this, we introduce a feature in the polling system called ‘freak events’. Each time unit, there is a small probability of 0.1% that a large number of jobs arrives at once in one of the queues. Such an event is dubbed a freak event. How many jobs qualify as a ‘large number’ depends on other environment parameters.

Freak events are a realistic property of many queueing systems. Consider for example the freak events of a touring bus arriving at a restaurant or a large number of phone calls being made at once when an accident occurs. These freak events are therefore not only a useful tool to test the robustness of the algorithms, but are also justified by realistic scenarios.

The goal is to minimize the costs per unit time of the polling system. For example, if the holding costs were 0, we would simply let the server sit idle indefinitely. With operating costs 0, we should never deactivate the server.

In the next section, we cover a variation of this system that caps the amount of jobs that a queue can hold.

3.1.1 Systems with limited buffer size

Here we propose a system with limited ‘buffer size’, i.e. a limited amount of jobs that the queue can maximally hold. Whenever a queue is full, no new jobs can arrive in the system. Note that no ‘freak events’ will occur in this system.

A great advantage of analyzing such a system is that it has a limited state space. For a system with N = 2 queues and a buffer size of M = 19, the system can be in only (N + 1)× (M + 1)N _{= 1200 different states, where N + 1 is the amount of}

locations the server can be at and M + 1 is the amount of states a queue can be in (including the empty state).

(7)

from choosing any action in any state. We will see later that reinforcement learning algorithms try to estimate these long-run rewards.

The reason we are thus interested in these limited size buffer systems is that they allow us benchmark the performance of our reinforcement learning algorithms.

Comparing solutions of unlimited and limited buffer systems

When comparing solutions found in unlimited or limited buffer systems, we need to proceed with caution. It is possible that in the limited buffer system, it is optimal to never turn the server on when the holding costs of a full queue are sufficiently low. This solution translates poorly to the unlimited buffer system, since never turning the server on would generally cause the holding costs to become extremely high. We are therefore only interested in solutions of the limited buffer systems where the buffer capacity is not reached. However, it turns out that in many cases, the buffer capacity will be reached, unless we explicitly penalize a full queue. Why?

Imagine an unlimited buffer size system with a single queue and an optimal policy to turn the server on with a queue length of 18. In other words, the server idles until the queue grows to a length of 18 jobs, then switches on to process all jobs and switches back off again. This cycle repeats.

Now suppose we have the same system, but with a buffer size of 19. It turns out that it is better to just let the queue sit at full capacity of 19 jobs than to ever turn the server on, even though the optimal length to turn the server on is less than the buffer capacity.

To prevent these situations from arising, a relatively high cost is paid each time unit that all queues are at maximum capacity. Enforcing this penalty in the previous example would mean that the server would no longer let the queue sit at full capacity. So, the server would definitely turn on at 19 jobs. However, since it is even better to turn the server on at 18 jobs than at 19 jobs, we will find as our optimal solution that the server should turn on at 18 jobs.

In other words, such a penalty ensures that the optimal solution found using a limited buffer size matches the solution found with unlimited buffer size. We will enforce this penalty throughout our analysis of limited size buffer systems.

Having described the systems we will analyze, we will now move on to a deter-ministic continuous time equivalent to our problem for single queue systems.

3.2 Economic Production Quantity model

In this section we will cover the classical Economic Production Quantity (EPQ) model, with which the queueing system that we analyze has similarities. For this determin-istic continuous time model with one queue we can find an analytical solution and it turns out this solution is very accurate for our stochastic discrete time model with one queue. This helps build an intuitive understanding of the problem. Furthermore, we will use this model in the design of our heuristic for N queues.

(8)

of demand. A penalty is paid for each time unit a unit of demand is not being satisfied, called the holding cost h (in units of cost per unit time per unit demand). To satisfy the demand, a server has to be switched on, for which a setup cost K has to be paid. Furthermore, each time unit the server is on, an operational cost o has to be paid (in units of cost per unit time). The demand rate d and service rate (or production rate) p > d are constant over time and let Q be queue length at which the server is activated (incurring setup cost K). See Figure 1 for an overview.

..

Q/d _Q/(p_{− d)}

Q

Queue length

Time

Figure 1: This cycle repeats itself. At first, the queue grows to size Q at rate d, after which it is cleared by the server at rate p− d. While the server is active, costs o are paid each time unit. When the queue is empty, the server is deactivated.

The variable we control is Q and the goal is to find Q such that the costs per unit time (cost rate) is minimized. In other words: how long should the queue become before we activate the server? The cost rate function C can be found as

C = avg. q. length× holding cost + operating cost per cycle_{cycle time} +setup cost per cycle cycle time , which results in C = 1 2Qh + Q p−do Q d + Q p−d + _Q K d + Q p−d and simplifies to C = 1 2Qh + do p + dK(p_{− d)} pQ . (1)

To find Q that minimizes this equation, we differentiate with respect to Q to find ∂C

∂Q =

Kd(d− p) Q2 + h/2.

Setting this equation equal to 0 and solving for Q gives us the queue length that minimizes the cost rate as

Q∗= s

2Kd(p_{− d)}

hp . (2)

Note that Q∗ _{seems to be independent of the operating cost o. This is partly true.}

In this analysis, we have restricted ourselves to the case where the server alternates its activation. When we relax this assumption, it may be more cost efficient to never switch the server off, depending on the operating costs. To prevent this situation from arising, we identify the following inequality

(9)

indicating that the optimal cost rate of alternating the activation should be less than the cost rate of simply letting the server be turned on.

In the next section, we will cover the heuristic for N queues that utilizes the equation for Q∗ (Equation 2).

3.3 Heuristic for N queues

To benchmark how well reinforcement learning solutions are compared to a heuristic, we compare the cost rates found by either technique. For this, we use a heuristic with parameters α, β _{≥ 0. The parameters are optimized by a grid search. This} section details the workings of the heuristic. Note that this heuristic is only applied to unlimited buffer systems, since our limited buffer size systems are chosen to be sufficiently small such that optimal solutions can be found.

Recall that Equation 2 gives us a solution for the optimal queue length for the deterministic continuous time queue at which the server should activate and note that Q∗ _{is different for each queue since there are different holding costs, switching costs}

and arrival rates for each queue. Further recall that the switching costs to move to any of the queues depends on the current location of the server, so Q∗ _{also depends}

on the server location. For each queue, we divide the queue length by Q∗ to get the relative queue lengths.

When the server is idling, and if there is a queue with relative length higher than parameter α, the server will attend that queue. It will then keep handling jobs from the queue it is attending until it is empty. Parameter α determines the threshold for the server to become active from an idling position. When α = 0, the server will immediately become active at any queue length and at α =_{∞, the server will never} process a queue.

Then, when the server has just processed a queue, it will either move to the longest relative queue or go back to idling. When the sum of relative queue lengths is less than parameter β, the server will go back to idling. Parameter β therefore determines the threshold at which the server deactivates. Note that when β = 0, the server will always stay active and at β =_{∞ the server will always go back to idling} after processing a queue.

(10)

Algorithm 1Heuristic for N queues with unlimited buffer size Define parameters α and β

Read the current state S from the environment, giving information about the server’s location and each queue’s length

foreachqueue do

Calculate Q∗ _{for this queue from Equation 2}

Relative queue length of this queue← queue length/Q∗

end

if server is idle then

if max relative queue length > α then

returnvisit queue with longest relative queue else

returnstay idle end

else

if current queue is not empty then returnstay at current queue else

if Prelative queue length > β then

returnvisit queue with longest relative queue else

returnlet server go idle end

end end

3.4 Reinforcement learning

In this section we will explain the reinforcement learning techniques used in this paper. We will first cover some key concepts within reinforcement learning necessary to understand our techniques. We will then go into these techniques, starting with the basic Q-learning algorithm.

The Q-learning algorithm is fundamental for understanding linear function ap-proximation (or ‘linear learning’) and deep reinforcement learning (often referred to as ‘deep learning’ in this paper), which are covered in Sections 3.4.4 and onward. We will also then present the details of our implementation of linear and deep learning. Lastly, in Section 3.4.8 we show an important property of tabular and linear learning that can greatly improve the performance of these methods.

3.4.1 Key concepts

Reinforcement learning consists of an agent interacting with an environment, where the agent is tasked with learning which actions yield the most long-run reward. In this setup, the environment is described by a Markov Decision Process (MDP). An impor-tant characteristic of the MDP is that the current state the agent is in is independent of the history of states it has visited. In practice, this means that the current state contains all the information anyone would minimally need to make optimal decisions.

(11)

with policy π. Formally, the policy is a mapping from states to actions. By choos-ing action At, the environment responds by giving the agent its immediate reward

Rt+1 and its successor state St+1.1 The future reward is estimated from the agent’s

knowledge of previous interactions with the successor state. In general, the reward and successor state are random variables. All transition probabilities, rewards, states and actions are are modelled in the MDP. It is therefore a complete description of the environment.

38 CHAPTER 3. FINITE MARKOV DECISION PROCESSES

Agent

Environment

action At reward Rt state St Rt+1 St+1

Figure 3.1: The agent–environment interaction in a Markov decision process.

its action, the agent receives a numerical reward , Rt+1∈ R ⊂ R, and finds itself in a new state, St+1.4

The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

S0, A0, R1, S1, A1, R2, S2, A2, R3, . . . (3.1)

In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finite number of elements. In this case, the random variables Rt and St have well defined discrete probability

distribu-tions dependent only on the preceding state and action. That is, for particular values of these random variables, s0 ∈ S and r ∈ R, there is a probability of those values occurring at time t, given particular values of the preceding state and action:

p(s0, r_{|s, a)} = Pr. _{St= s0, Rt= r| St−1= s, At−1= a}, (3.2)

for all s0, s∈ S, r ∈ R, and a ∈ A(s). The dot over the equals sign in this equation reminds us that it is a definition (in this case of the function p) rather than a fact that follows from previous definitions. The function p : S× R × S × A → [0, 1] is an ordinary deterministic function of four arguments. The ‘|’ in the middle of it comes from the notation for conditional probability, but here it just reminds us that p specifies a probability distribution for each choice of s and a, that is, that

X

s0_∈S X

r∈R

p(s0, r_{|s, a) = 1, for all s ∈ S, a ∈ A(s).} (3.3)

The probabilities given by the four-argument function p completely characterize the dynamics of a finite MDP. From it, one can compute anything else one might want to know about the environment, such as the state-transition probabilities (which we denote, with a slight abuse of notation, as a three-argument function p : S× S × A → [0, 1]),

p(s0_{|s, a)} = Pr. _{St= s0| St−1= s, At−1= a} =

X

r∈R

p(s0, r_{|s, a).} (3.4)

We can also compute the expected rewards for state–action pairs as a two-argument function r : S_{×A →} R: r(s, a) =. E[Rt | St−1= s, At−1= a] = X r∈R rX s0_∈S p(s0, r_{|s, a),} (3.5)

or the expected rewards for state–action–next-state triples as a three-argument function r : S_{× A × S →} R, r(s, a, s0) =. _E[Rt | St−1= s, At−1= a, St= s0] = X r∈R rp(s0, r|s, a) p(s0_{|s, a)} . (3.6) it simply as A.

4_{We use R}_t+1 _{instead of R}_t_{to denote the reward due to A}_t _{because it emphasizes that the next reward and next} state, Rt+1and St+1, are jointly determined. Unfortunately, both conventions are widely used in the literature.

Figure 2: The agent takes action At, receives an immediate reward Rt+1and ends up

in the successor state St+1.

Furthermore, a very important concept is that of the action-value function Q(s, a). This function describes how much reward the agent expects to get in the long run when it chooses action a when it is in state s.

What is meant by the long run here? If the environment knows no ‘final state’ or ‘terminal state’, this could mean that the expected reward in the long run is perhaps infinite, since the agent can just keep accumulating reward indefinitely.

The final key concept solves this problem: the discount factor γ _{∈ [0, 1]. The} reward an agent receives consists of the immediate reward and the future reward. We will multiply the future reward by, for example, factor γ = 0.99. By doing this, the long-run reward is capped.

To illustrate this, consider a very simple environment where the agent gets unit reward of 1 each time step. If γ = 1, no discounting takes place. The expected long-run reward when choosing action a (the only possible action) when in state s (the only state of this system) will then be∞, or Q(s, a) = ∞.

However, with γ < 1, the expected long-run reward will be

Q(s, a) = R0+ γR1+ γ2R2+ ... = 1 + γ + γ2+ ... = ∞ X n=0 γn= 1 1_{− γ}

For discount factor γ = 0.99 this already results in the action-value function being Q(s, a) = 100 instead of infinite.

Very low values of γ result in the agent being short-sighted, putting much weight on rewards it will get in the near future. High values of γ will result in the agent also considering what may happen in the future, but setting γ too high will make learning about the action-value function difficult.

To summarize, an agent interacts with an environment and gets reward along the way.

1_{Following the notation and conventions as used in the introductory textbook by Sutton and}

(12)

We discount the future reward to ensure the action-value function Q(s, a) converges. Q(s, a) is the function that tells us how much long-run reward we expect to get from being in state s and choosing action a. The reason we care so much about Q(s, a) is that whenever we have an accurate estimate of it, we can find the best action when in state S by by taking arg maxaQ(S, a).

In the next section, the Q-learning algorithm is explained. This algorithms gives us tools to learn about Q(s, a).

3.4.2 Q-learning

The Q-learning algorithm allows the agent to learn about Q(s, a). Algorithm 2 shows how this can be implemented. Note that this algorithm is titled tabular Q-learning. This can be understood as follows: suppose we have a table with on each row a unique state and on each column a possible action. The action-value function Q(s, a) then maps to a single cell of this table: we thus keep track of Q(s, a) through a table of values (hence ‘tabular’). Note that we call a combination of a specific state and action a state-action pair.

The algorithm can be outlined as follows. First, the agent reads the current state of the environment, S. It will then either choose a random action with probability or the action that it thinks is best, the greedy action. Therefore, can be interpreted as a measure of exploration. This is called the -greedy policy. Some degree of exploration is necessary to prevent the agent from never exploring some options and getting stuck in local optima early on.

The action that is now chosen is denoted A. This action is fed to the environment and an immediate reward R and successor state S0 _{are received. Initially, the agent}

predicted that it would get an immediate plus discounted future reward of Q(S, A) (by definition of Q(S, A)). Having performed the action, it can now establish that the actual return was R + γ maxaQ(S0, a). The agent has gained knowledge about what

Q(S, A) actually should have been.

Therefore, it can now adjust its estimate of Q(S, A). In this context, R + γ maxaQ(S0, a) is dubbed the learning target and Q(S, A)− (R + γ maxaQ(S0, a))

is the error. The update of Q(S, A) then becomes

Q(S, A)← Q(S, A) + αR + γ max

a Q(S

0_{, a)}_{− Q(S, A)}_, ₍₃₎

where α∈ (0, 1) is the learning rate. This simple algorithm in principle is enough to let us learn about Q(S, A) and the optimal actions for each state.

(13)

Algorithm 2Tabular Q-learning

Initialize Q(s, a) at 0 for all state-action pairs, set , γ and α. fora predefined number of training steps do

Read the current state S from the environment Draw a random number u_{∼ U(0, 1)}

if u < 1− then

A_{← arg max}aQ(S, a), the greedy action

else

A is drawn randomly from the set of possible actions end

Action A is fed to the environment

The environment returns immediate reward R, successor state S0

A0 ← arg maxaQ(S0, a), the greedy successor action

Learning target_{← R + γQ(S}0_{, A}0₎

Error_{← Learning target − Q(S, A)} Q(S, A)← Q(S, A) + α × Error end

3.4.3 Function approximation

In the previous section we have shown how tabular Q-learning can be implemented. However, this implementation relies on encountering each state-action pair (s, a) nu-merous times in order to obtain a somewhat accurate estimate of Q(s, a). Further-more, we need to keep track of the estimated value of each state-action pair.

In most practical problems, the amount of state-action pairs quickly blows up due to an enormous state space. For such problems, we need to be able to infer the value of various actions without ever having encountered some specific state. This can be achieved by function approximation. Let ~w be a vector of weights, then

ˆ

Q(s, a; ~w)≈ Q(s, a)

is the approximating function with weights ~w. In this paper we will go into linear function approximation and function approximation through neural networks, which will be explained in Sections 3.4.4 and 3.4.6, respectively.

Note that we can set up the function approximator in different ways. See Figure 3 for a schematic of two different designs.

s w~ s a w~ Qˆ(s, a1; w~ )...Qˆ(s, am; w~ ) Qˆ(s, a; w~)

(14)

3.4.4 Linear function approximation

Linear function approximation relies on features, which can be described as simplifica-tions of the state space. A system of multiple queues could for example be simplified by only considering features such as the job length at each queue, the maximum dif-ference in job lengths between all queues and the length of longest queue. In this approach, a state-action pair (s, a) is represented by its feature vector ~θ(s, a). We estimate the action-value function as

ˆ

Q(s, a; ~w) = ~wT_~_{θ(s, a),} ₍₄₎

where each feature is simply multiplied by the corresponding element of vector ~w. A different choice would be to let the feature vector depend only on the state, i.e. ~θ(s). (Note however that this is just a less general approach than defining ~θ(s, a), as will be covered in the next section, Section 3.4.5.)

ˆ Q(s, a) Q(s, aˆ 1) Q(s, aˆ 2) Q(s, aˆ k) θ0(s, a) θ1(s, a) θ2(s, a) θk(s, a) w0 w1 w2 wk θ0(s) θ1(s) θ2(s) θk(s) w0,1 w0,2 wk,k w2,k w2,2 (a) (b)

Figure 4: Schematic overview of how (a) ˆQ(s, a) is derived from the feature vector ~θ(s, a) and (b) how we can derive ˆQ(s, a1) through ˆQ(s, ak) when ~θ(s) is given

A very important task of the researcher when doing linear function approximation is designing the feature vector. Many design choices for such vector can be made, heavily influencing the quality of the solutions. This can be considered an important downside of this technique. Especially note that in order for the approximation to be good, each element of the feature vector must have a linear relationship with Q(s, a).

To clarify what is meant by this, suppose we have the server location_{∈ 0, 1, ..., N} as the first element of the feature vector. This would mean that at location k, the con-tribution to ˆQ(s, a; ~w) of this element would be θ0× w0= kw0. However, at location

2k we would have a contribution of 2kw0. Since locations can be arbitrarily

num-bered, it does not make sense that at location 2k the magnitude of the contribution to ˆQ(s, a; ~w) is doubled, making this a poor feature vector design.

The length of a queue would for example be a better (although perhaps still not perfect) feature: a longer queue generally results in higher costs. A challenge of linear function approximation therefore is to find a collection features that have a linear relationship with Q(s, a) and together result in an accurate fit. In Section 3.4.5 we describe the feature vector design used in this paper.

(15)

Updating weights w~

The update ∆ ~w is chosen such that it minimizes the squared error. Consider E( ~w) to be the mean squared error function, i.e,

E( ~w) =_Ehq(s, a)_{− ˆ}Q(s, a; ~w)i2, (5) where q(s, a) is the true value we are trying to fit. Note that we do not actually have an estimate for q(s, a) at this stage, but we assume it is given for the moment. By taking the negative of the gradient of the error function with respect to ~w, i.e., changing ~w such that the error decreases, we can find the update for ~w as

∆ ~w =−1₂α∇w~E( ~w) = α

q (s, a)− ˆQ (s, a; ~w)∇w~Q (s, a; ~ˆ w) ,

with α the learning rate (the relative size of the step we will take in the direction of negative gradient). Since we do not know q (s, a), we use the next-best thing we have: r + γ maxa0Q(sˆ 0, a0; ~w), with r the immediate reward and s0and a0the successor state and action. And hence, with Equation 4, we arrive at

∆ ~w = αr + γ max

a0 w~

T_~_θ(s0_{, a}0₎_{− ~}_wT_~_{θ(s, a)}_~_{θ(s, a)} ₍₆₎

for the Q-learning update rule of ~w with each new observation. Algorithm 3 describes how Q-learning with linear function approximation is then implemented.

Algorithm 3Linear Q-learning Initialize ~w, set , γ and α.

fora predefined number of training steps do Read the current state S from the environment Draw a random number u_{∼ U(0, 1)}

if u < 1− then

A← arg maxaQ(S, a; ~ˆ w) = arg maxaw~T~θ(S, a), the greedy action

else

A is drawn randomly from the set of all possible actions end

Action A is fed to the environment

The environment returns immediate reward R, successor state S0 A0 _{← arg max}

a0w~T~θ(S0, a0) is the greedy successor action Find the weight update that reduces the error as

∆ ~w← αR + γ ~wT_~_θ(S0_{, A}0₎_{− ~}_wT_~_{θ(S, A)}_~_{θ(S, A)}

~

w_{← ~}w + ∆ ~w end

3.4.5 Feature vector design

The feature vector design used in our experiment with linear learning with 2 queues in Section 4.2 will be explained here. This section is primarily recommended to readers that want to replicate our experiments or test whether they can improve upon our methods of linear learning. The choice of this feature vector design is based on a informal exploration and it is expected that it can still be improved. Important elements of the design are as follows.

(16)

• The length of queue 1 minus the length of queue 2 and vice versa are given as elements of the feature vector (but not the absolute difference between the two). • There is an element of the feature vector that indicates whether queue 1 is empty and one that indicates whether queue 2 is empty. A ‘1’ indicates an empty queue and ‘0’ a non-empty queue.

• One more element of the feature vector indicates whether both queues are empty. • Lastly, there is an element that is always activated at 1. This is commonly known

as the bias. (Similar to b in y = ax + b when fitting a linear relationship.)

These 10 elements give a description of the state, but the server location is not yet incorporated in these elements. As explained in the previous section, the relationship between the server location and Q(s, a) is highly non-linear.

To solve this problem, the feature vector will be chosen to be of size 30 and each element set to 0. Whenever the server is idling, only the first 10 elements are activated as described in the list. When at location 1, the second 10 elements will be activated and when at location 2, the third 10 elements are activated (where ‘activating’ means replacing the zeroes by the elements described in the list above).

This is a way of saying or acknowledging that the policy when at queue 1 is completely unrelated to the policy when at queue 2. Although there is probably some relationship between the policies, it is difficult to incorporate such information in the feature vector.

With this, the 30-element feature vector ~θ(s) is completely defined. We can now follow the schematic of Figure 4b and be done with it. However, we would like to define ~θ(s, a) in order to let our feature vector fit in an important framework that is explained in Section 3.4.8. This is easily done.

Since there is a total of 3 actions (do nothing, go to queue 1, go to queue 2), we create a vector of zeroes of 90 elements. The first 30 elements are activated when we choose the first action, and so on.

Note that the amount of elements of the feature vector grows with _O(n4_{) with}

n the amount of queues (the list above grows with O(n2_{) and the amount of server}

locations and actions both grow with_{O(n)), a considerable growth. This can perhaps} be reduced by a more clever feature vector design. The next section covers deep reinforcement learning.

3.4.6 Deep reinforcement learning

In 2015, Google DeepMind achieved human-level control in Atari games through deep reinforcement learning, see Mnih (2015)[10], using a technique called ‘Deep Q-Networks’ (DQN) with experience replay, which is a lot more subtle than just re-placing the linear function approximator by a neural network. There are three major components to this technique, namely:

1. A neural network is used to estimate the action-value function Q(s, a) 2. The ‘Q-targets’ are ‘fixed’

(17)

We will now cover each of these components separately.

Using a neural network to estimate Q(s, a)

Consider an experience consisting of initial state s, the action a that was then per-formed yielding immediate reward r and successor state s0_{, or a < s, a, r, s}0 _>-tuple.

The neural network is simply a function approximator with weights ~w, such that ˆ

Q(s, a; ~w) is the estimated long-run reward of this experience. Just like before, we find that we can update this estimate by considering the value of r + γ maxa0Q(sˆ 0, a0; ~w), also known as the ‘target’, or more specifically the Q-target.

How should we update our network such that its prediction matches the target more closely? We feed state s and action a to the network as input and receive

ˆ

Q(s, a; ~w). As for supervised learning, we tell the network that r +γ maxa0Q(sˆ 0, a0; ~w) should have been its output and perform a gradient descent step ‘with respect to the weights ~w’.

In other words, the weights are updated such that the error is reduced. This means that if we were to perform many gradient descent steps with only this one training example, we would eventually end up with a vector of weights ~w such that

ˆ

Q(s, a; ~w)_{≈ r + γ max}a0Q(sˆ 0, a0; ~w).

In reality, we use a slightly different (but very common) network design. We choose not to feed both s and a to the network to receive ˆQ(s, a; ~w), but only feed s and receive ˆQ(s, ak; ~w) for all actions ak. See also Figure 3. The reason is that

we often want to get Q-values of all actions in order to choose the best action. The network we used is trained in the same ‘supervised’ fashion.

Fixed Q-targets

When training the network as described in the previous section, we run into a problem. When updating the weights, not only the prediction ˆQ(s, a; ~w) changes, also the target r + γ maxa0Q(sˆ 0, a0; ~w) changes – analogous to learning for an exam, while the subject keeps changing. This results in very poor convergence of the neural network. A solution to this problem is to have two identical neural networks: the target network and the training network.

The target network has weights ~w− and these weights are fixed for C training

steps. The target for experience < s, a, r, s0_{> is then r + γ max}

a0Q(sˆ 0, a0; ~w₋), using the target network. The prediction by the training network is still ˆQ(s, a; ~w), upon which the gradient descent step is performed. After C steps, we set ~w− = ~w.

Shortly after the 2015 DeepMind paper, a more elegant implementation of the target network was proposed using ‘soft’ target updates that smoothly alter the target network during training, see Lillicrap (2015)[13]. Factor τ _{1 is used to update the} target network after each training step as

~

w−= (1− τ) ~w−+ τ ~w, (7)

(18)

Experience replay

Lastly, an innovation that also helps stabilize convergence is experience replay. When interacting with an environment, we end up with a new state after each action. How-ever, the sequence of states is then highly similar resulting in highly correlated errors. Training on sequential experiences may result in a biased function approximator and parameters may diverge catastrophically[14].

To fix this, we save all experiences in replay memory, with each experience a < s, a, r, s0 _{>-tuple. We then randomly sample experiences from memory, breaking}

the sequence of events. After each step, we draw a ‘mini-batch’ of 32 experiences from memory and perform a gradient descent step on these samples.

As an additional benefit, we are training on old experiences multiple times, re-sulting in higher computational efficiency, since simulating the environment takes up a non-negligible amount of resources.

Furthermore, to ensure more exploration in the beginning while the network is unlikely to suggest good actions starts off high and is decreased each time an action is selected. Combined, this leads to an implementation of DQN as in Algorithm 4.

Algorithm 4Deep Q-learning with experience replay Randomly initialize weights ~w and ~w−, set , γ, τ

Initialize replay memory _{D, populate replay memory with b experiences} fora predefined number of training steps do

Read the current state S from the environment Draw a random number u_{∼ U(0, 1)}

if u < 1_{− then}

A_{← arg max}aQ(S, a; ~ˆ w), the greedy action

else

A is drawn randomly from the set of all possible actions end

Decrement unless ≤ 0.05 Action A is fed to the environment

The environment returns immediate reward R, successor state S0

Append < S, A, R, S0> to replay memory

Sample random mini-batch of experiences of size b from replay memory

Perform gradient descent step on error (R + γ maxa0Q(sˆ 0, a0; ~w₋)− ˆQ(s, a; ~w))2 with respect to weights ~w

Set ~w−← τ ~w + (1− τ) ~w−

end

3.4.7 Neural network architecture

In this Section, we will cover the architecture of the neural network that we used and relevant (hyper)parameters.

(19)

amount of hidden nodes and layers was determined after an informal exploration of the possibilities. We therefore suspect there are alternative network architectures with better performance.

The weights of the network are randomly initialized, uniformly on [−0.5, 0.5]. The weight updates are performed according to the Adam optimizer as published by Kingma et al.[15] using the default settings recommended in their paper. An advantage of the Adam optimizer is that it requires little hyperparameter tuning between different applications.

The parameters we used for the DQN algorithm are as follows: we set τ = 0.001, the size of a minibatch is set to 32 and exploration starts at 0.5 and is linearly decremented to 0.05 in 100,000 steps.

In the next section we will cover how we can directly find the best fit from a batch of experiences for tabular and linear learning, which is not directly possible for deep learning.

3.4.8 Finding the least-squares fit from a batch of experiences

As shown in the ‘DQN with experience replay’ algorithm, we can reuse old experi-ences to learn the Q-values. Given a batch of experiexperi-ences_{D and applying the gradient} descent updates many times on experiences from_{D will eventually result in the} mini-mization of the squared error overD. This is very time-consuming, and an interesting and very important property of tabular and linear learning is that the least-squares fit can be found directly. This property could ensure linear learning to stay a com-petitive alternative to deep learning. In this section, we show how the least-squares fit is found, using a technique called Least-Squares Policy Iteration (LSPI).

Tabular learning

Consider a set of experiencesD. We want to find, for each pair (s, a), the solution that best fits the observations in _{D. When looking at a certain state-action pair,} say (si, ai), we can find the best fit for this pair simply as the expectation over the

observed immediate and future rewards:

Q(si, ai) =EDi h r + γ max a0 Q(s 0_{, a}0₎i₌ X i∈Di ri+ γ max a0 Q(s 0 i, a0) ! /_|Di|, (8)

whereDiis the set of experiences that containing pair (si, ai). To put this equation in

words: to find the best fit for Q(si, ai), simply take the sample mean of the immediate

plus future reward that was obtained when in state si and performing ai. Also note

that in order to do this, there needs to be at least one occurrence of (si, ai) in our set

of experiences_D.

Only there is a problem with this: it matters in which order we apply this update over all pairs. Imagine applying this update to the first state-action pair: then Q(s1, a1) has changed. Since Q(s1, a1) has changed, the future reward received

from (for example) state-action pair (s2, a2) is likely to have changed too, since the

future reward is likely to depend on the estimate Q(s1, a1). So before applying this

(20)

update Q(si, ai)new =EDi h r + γ max a0 Q(s 0_{, a}0₎ old i ∀i (9)

to find the best fit over our batch. Now, all Q-values have been adjusted. But this changes the expression maxa0Q(s0, a0). This makes sense: with our updated Q-values, our greedy policy changes. If we set Qold ← Qnew and apply the update again, we

would therefore get a new set of Q-values.

Therefore, we now keep applying this update until our policy (and therefore Q-values) no longer change, making this a way of ‘policy iteration’ (see Puterman[12]). Having done this, we have found the Q-values that best fit our observations. This approach is not unique to tabular learning, but can also be applied to linear learning, as will be shown in the next section. This technique will be referred to as tabular Least-Squares Policy Iteration (LSPI) (as opposed to linear LSPI, which is described in the next section.)

Linear learning

Least-Squares Policy Iteration, or LSPI, was first described by Lagoudakis and Parr in 2003[16] and is a generalization of the method described in the previous section such that can be applied to linear learning. It consists of two parts: the LSTDQ algorithm, its acronym derived from ‘Least-Squares Temporal-Difference learning for the Q-values’, and the policy iteration algorithm that utilizes LSTDQ. We will first go into how LSTDQ can be derived.

The starting point of this method is similar: we want to find an expression such that the expected update over our batch of experiences is 0: _ED[∆ ~w] = 0, or

X

D

∆ ~w = 0. (10)

And since

∆ ~w = αr + γ ~wT_~_θ(s0_{, a}0₎_{− ~}_wT_~_{θ(s, a)}_~_{θ(s, a),}

together this forms X

D

∆ ~w =X

D

αr + γ ~wT_~_θ(s0_{, a}0₎_{− ~}_wT_~_{θ(s, a)}_~_{θ(s, a) = 0.}

Eliminating α and the instance of ~θ(s, a) that is outside the brackets and rewriting gives us

X

D

r + γ ~wT_~_θ(s0_{, a}0₎_{− ~}_wT_{θ(s, a)}_~ _{= 0.}

From there, we split the summation into two parts and rewrite again to find X D r + γ ~wT~θ(s0, a0) =X D ~ wT~θ(s, a),

which we multiply by ~θ(s, a)T_{on both sides to get}

X D r + γ ~wTθ(s~ 0, a0)~θ(s, a)T=X D ~ wT~θ(s, a)~θ(s, a)T,

after which we multiply both sides by~θ(s, a)~θ(s, a)T−1 _{to finally end up with}

~

wT₌X D

(21)

To further clarify how this update is applied, consider a set with T experiences with each experience a < s, a, r, s0 >-tuple. Of course, each experience is missing a successor action that is necessary to determine the expected future reward. This action is found from our greedy policy as a0 _{= π(s}0_{) = arg max}

a0w~T~θ(s0, a0), where ~

w is the current vector of weights that is used by the agent. To find the new set of weights that minimizes the error over the batch of T experiences, we simply apply

~ wT₌ T X t=1 rt+ γ ~wT~θ(s0t, a0t) ~ θ(st, at)T ~ θ(st, at)~θ(st, at)T −1 . (12)

From these equations we implement the LSTDQ algorithm as shown in Algorithm 5.

Algorithm 5Least-Squares Temporal Difference learning for the Q-values (LSTDQ) Requirements and input:

D: the set of experiences < s, a, r, s0_>

~

w: current vector of weights from which successor action a0 _{is determined greedily}

~

θ(s, a): the function that maps (s, a) to a feature vector of size (k_{× 1)} And define π(s; ~w) = arg maxaw~T~θ(s, a)

Initialize: A_{← 0, a (k × k) matrix} ~b ← ~0, a (k × 1) vector foreach(s, a, r, s0₎_{∈ D do} A_{← A + ~θ(s, a)}~θ(s, a)_{− γ~θ (s}0_{, π(s}0₎₎T ~b ← ~b + ~θ(s, a)r end ~ w_{← (A}−1_~b) returnw~

With the LSTDQ algorithm we can relatively quickly find the weights that best fit our observations in _{D. Applying it iteratively until our vector of weights and} therefore our policy no longer changes results in the best fit for the optimal policy given our set of experiences and gives us the LSPI algorithm, as shown in Algorithm 6.

Lagoudakis mentions that the LSTDQ algorithm can be performed in _O(k2₎

time, with k the length of the feature vector. However, recall from Section 3.4.5 that the length of the feature vector k that we chose grows_O(n4_{), with n the amount of}

queues. This means that the time cost to perform LSTDQ is O(n8_{), which heavily}

limits our ability to apply this technique. This is a fundamental shortcoming that is difficult to overcome.

Algorithm 6Least-Squares Policy Iteration (LSPI)

Let ~w0be the initial vector of weights and D the set of experiences.

(22)

4 Experiments and results

In this section, we present the experiments that we conducted and the obtained results. The experiments are divided into three parts of increasing model complexity. First, we cover the single queue system (with limited queue size) to explore basic properties of reinforcement learning in queueing applications. With the insights obtained from the single queue setting, we move on to a system of two queues (again with a maximum queue size). On this system, we will apply linear learning and deep learning to see if these techniques can match a benchmark solution. Lastly, we will look at various systems of many queues and compare the performance of deep reinforcement learning to that of a heuristic.

4.1 Single queue system

The single queue system (with limited queue size) is used to explore how well reinforce-ment learning performs in very simple settings. In these settings, we can benchmark tabular reinforcement learning with the optimal solution. With findings from this section, we will benchmark linear learning and deep learning.

We will first look at the optimal policy as a function of the discount factor γ. Then, we will see how well Q-learning converges to the optimal policy for various values of the exploration rate and the learning rate α. We wrap up by showing how quickly good policies are found with a good choice of parameters α and , but also how quick this is done using Least-Squares Policy Iteration (see Section 3.4.8 for a recap of LSPI).

The amount of jobs that the single queue may hold is capped at 19. Further parameters of the queue are as follows: setup cost K is set at 20, holding cost h at 0.5, arrival rate λ is 0.3 and the operating costs o are 5 per time unit.

4.1.1 Optimal policy as a function of discounting

In this section we investigate how the optimal policy changes for various discount factors. See Figure 5. These optimal policies are found using standard Value Iteration (see Puterman [12]).

We see that setting discount factor γ too low results in ‘optimal’ policies that are very different from the undiscounted policy (which should be considered the true optimal policy). This makes sense, since low values of γ cause us to put more weight on immediate costs and less weight on future costs. We are therefore reluctant to invest now (paying setup costs) to prevent higher future costs (paying holding costs for a long queue).

(23)

0.92 0.935 0.95 0.98 0.995 1.0

Figure 5: Optimal policy as a function of the discount factor γ. Each bar represents a full policy, with the y-axis indicating the queue length and the x-axis indicating whether the server is off or on. Black cells indicate the optimal action is to have to server off, whereas white cells indicate the server should be on. Low values of γ correspond to policies that let the queue grow longer before turning the server on.

In the formulation of our multi-queue systems, this is done as follows. Whenever the server is idle or sitting at an empty queue, the action to remain at that queue will be repeated over 6 time units instead of one. Instead of 6 very similar experiences, we will end up with just one, resulting in more diverse experiences being collected and the future rewards being discounted less. To clarify: an experience that was 12 time steps away would be discounted with γ12_{, while it will now be discounted with}

just γ2_.

Further note that the (undiscounted) optimal policy is to activate the server at a queue length of 4. We compare this solution with the deterministic continuous time equivalent of this problem, i.e Equation 2, and conclude that they yield very similar results, since Q_{∗ =} √16.8 _{≈ 4. Choosing different values for the cost parameters} gives similar results, indicating that Equation 2 is a very accurate solution to the stochastic discrete time single queue (which we are investigating in this section). Using this approximation is therefore justified.

4.1.2 Convergence of Q-learning

In this section, we will see how well learning performs in finding the correct Q-values. For this, we compare the agent’s estimate Q(s, a) to the true values found with the Value Iteration algorithm. This comparison is done by taking the sum of the absolute differences, i.e.

Error = X

(s,a)∈S

|Q(s, a) − q(s, a)|, (13)

where Q(s, a) is the agent’s estimate and q(s, a) is the true action-value function. When calculating this error, we exclude the error of state-action pairs that the agent will not realistically encounter. To clarify what this means: if the optimal policy is to switch the server on at a queue length of 7, we will exclude states with a higher queue length in the calculation of the error. SetS is the set of state-action pairs that remains.

(24)

Varying the exploration rate 00 50 100 150 200 250 300 0 5 ₁0 ₁5 ₂0 5₂ ₃0 ₃5 ₄0 ₄5 0₅ ₅5 ₆0 ₆5 ₇0 ₇5 ₈0 ₈5 ₉0 ₉5 Su m o f a b so lu te e rr o r Iterations (x100,000) ε = 0.01 ε = 0.05 ε = 0.1 ε = 0.35 ε = 0.5

Figure 6: Convergence to correct Q-values for various values of the exploration rate . Too little exploration or too much exploration are both inadvisable and the best values for are around 0.1 and 0.05. Here, α = 0.002 and γ = 0.95

In Figure 6 it can be seen that = 0.1 is an ideal rate of exploration for this setting, outperforming the other possibilities. Recall that epsilon can be interpreted as a measure of exploration. Too little exploration ( = 0.01) or too much exploration ( = 0.5) both result in sub-optimal performance.

Varying the learning rate

00 50 100 150 200 250 300 350 400 450 500 0 5 ₁0 ₁5 ₂0 5₂ ₃0 ₃5 ₄0 ₄5 0₅ ₅5 ₆0 ₆5 ₇0 ₇5 ₈0 ₈5 ₉0 ₉5 Su m o f a b so lu te e rr o r Iterations (x100,000) α = 0.0004 α = 0.002 α = 0.01 α = 0.05

Figure 7: Convergence to correct Q-values for different values of the learning rate α. A high learning converges quickly, but stagnates and cannot improve beyond a certain point. A lower learning rate converges slowly, but can come closer to the correct Q-values. It is advised to dynamically adapt the learning rate. Here, = 0.01 and γ = 0.95

From Figure 7 we see that with a high learning rate of 0.05, the error takes a massive drop, but then hovers around an error of 50, whereas for a lower learning rate the error drops more slowly, but reaches nearer to 0.

(25)

How fast can we find good policies with Q-learning?

Wrapping up the results of previous sections, we will now show how quickly the agent can find good policies. The learning rate is set to α = 0.01, and the exploration to = 0.1.

See Figure 8 for the result. After 50,000 iterations, a policy is found that can be used in practice, with the server switching on at a queue length of 3. The optimal policy is first encountered around the 300,000 iteration mark. Notice how the agent will not know how to deal with a long queue even after 1,000,000 steps of Q-learning, since it hasn’t encountered long queues often. For example, after switching the server on with 19 jobs in queue, it will switch off again at a queue length of 17. Since the model that it learns about will not throw it into long queues randomly, it will not learn about these situations. This can be interpreted as learning more efficiently than doing ‘full-width backups’ as often done using other techniques for solving Markov Decision Processes. 0 1 0 100 200 300 400 500 600 700 800 900 1000 1100 0 0 .5 1 ₁.5 2 ₂.5 3 ₃.5 4 ₄.5 5 ₅.5 6 ₆.5 7 ₇.5 8 ₈.5 9 ₉.5 F o und o pt im a l po li cy ? Sum o f a bs o lut e er ro r Iterations (x100,000) Total error

Optimal policy found

Figure 8: How 1,000,000 iterations of Q-learning brings us understanding of the correct policy. A secondary axis shows when the agent’s policy coincides with the optimal policy (given that the environment is initialized with an empty queue). Here, γ = 0.98.

4.1.3 Convergence using tabular Least-Squares Policy Iteration

Recall from Section 3.4.8 that we can easily find the least-squares fit over a batch of experiences for tabular (and linear learning) using Least-Squares Policy Iteration (LSPI). Here, we present how quickly the optimal policies are found with this tech-nique for tabular learning. We express this as function of the amount of visits per state-action pair.

In this setting, there are 40 states with 2 actions each, so 80 state-action pairs. An important note: we craft our batch of experiences is such a way that it contains an equal amount of experiences for each of those state-action pairs. See Figure 9. After 30 visits per pair (i.e. a total of just 2400 experiences), the optimal policy has already been found for the first time. It is confidently learned after 50 visits.

(26)

0 1 0 100 200 300 400 500 600 700 800 900 1000 1100 0 5 ₁0 ₁5 ₂0 5₂ ₃0 ₃5 ₄0 ₄5 0₅ ₅5 ₆0 ₆5 ₇0 ₇5 ₈0 ₈5 ₉0 ₉5 1 00 F o und o pt im a l po li cy ? Sum o f a bs o lut e err o r

Visits per state-action pair Total error Optimal policy found

Figure 9: After about 50 visits of each state-action pair, the optimal policy is confidently determined with (tabular) LSPI (see the secondary axis, where ‘1’ indicates the optimal policy is found). This is done much more quickly than when using regular tabular Q-learning: we can jump straight to the least-squares solution. Here, discounting rate γ = 0.98.

4.2 Dual queue system

Having covered the performance of tabular learning in the previous section for the single queue, we now move on to the more difficult dual queue system. Since we know that tabular LSPI is very good at finding correct solutions after visiting each state-action pair just 100 times (see Figure 9), we will use it to benchmark the results of linear learning and deep learning. Note that this means that we must use a limited buffer size system, otherwise tabular learning cannot be applied (due to the state space otherwise being infinite).

Our experiments can be replicated using the feature vector design and neural net-work architecture as explained in Sections 3.4.5 and 3.4.7 respectively. Furthermore, an overview with parameters of the dual queue system is found in Figure 10.

K = 20 k1,2= 8 k2,1= 8 λ1= 0.15 λ2= 0.10 Operating cost o = 5 h1= 0.5 h2= 0.8 Queue 1 Queue 2 Idle K

Figure 10: Overview of the dual queue system that is used in our experiments. The maxi-mum queue capacity is set at 19. Note that queue 2 has a lower arrival rate but has higher holding cost of h2= 0.8, meaning that 0.8 cost is incurred per job in queue per unit time.

(27)

Deep reinforcement learning took approximately 5 hours, but matches the tabular solution. Although linear learning converged quickly here, we know from Section 3.4.8 that the computational time for linear LSPI grows as_O(n8_{), with n the amount}

of queues for our feature vector design.

In the next section, we will cover the experiments on various multi-queue systems. We have excluded linear learning due to the described difficulties and only focus on deep learning. At queue 1 At queue 2 Server is idle T abular Linear Deep learning

Figure 11: Policies of the dual queue for various reinforcement learning methods. Each complete row of the 3 × 3 grid corresponds to one policy. The columns indicate the location of the server. The y- and x-axes of an image correspond to the length of queue 1 and queue 2, respectively. White, grey and black correspond to the actions of setting the server idle, setting the server at queue 1 and setting the server at queue 2, respectively. (To clarify, the top left image tells us for each length of queue 1 and 2 which action the policy found with tabular learning suggests when the server is idle.) The tabular learning solution is known to be close to optimality. Linear learning finds a policy of somewhat similar form, but with noticeable differences, while deep learning matches the policy quite closely. (Note that γ was set to 0.99 for each algorithm.)

4.3 Multi-queue systems

This section covers our application of reinforcement learning to systems of many queues. We will only focus on deep reinforcement learning, since tabular learning is computationally infeasible (due to large state-spaces) and linear features are too difficult to design. Furthermore, directly finding the least-square fit, which makes linear learning attractive, suffers from an explosive growth in computational time with more queues (see Section 3.4.8).

To benchmark the performance of deep learning, we will therefore use the heuris-tic of Section 3.3. Recall that this heurisheuris-tic is optimized through a grid search over the parameters α and β. During training of each neural network, the discount rate is set to γ = 0.995.

(28)

charac-teristics of the reinforcement learning solution. These characcharac-teristics were observed by visualizing the systems in a graphical user interface.

In the following sections, we will cover systems..

• ..of 4 queues with connections to and from each queue. • ..of 3 clusters, with in each cluster multiple queues. • ..of 6 queues with cyclic paths.

The first and second type of systems are fully connected: it is possible to directly move from a given queue to any other queue. The cyclic systems have a high penalty > K for straying from the designated paths. Furthermore, in each system, the operating cost per time unit o = 5. We will cover the systems in the specified order.

Note that for these unlimited buffer size systems, we let ‘freak events’ occur with a probability of 0.1% per time unit. During a freak event, between 30 and 60 arrivals will immediately enter a random queue. The distribution of the amount of arrivals and the queue at which they arrive are both uniform.

Lastly, as specified in Section 4.1.2, if the server is idle or sitting at an empty queue, 6 time units pass instead of 1. This will compress the amount of experience that the agent gathers, since instead of generating 6 experiences that are nearly identical, we will only generate one experience that lasted 6 time steps. A slight disadvantage of ‘fast forwarding’ is that the agent will have to dedicate its decision for 6 time steps, which is a acceptable length of time.

4.3.1 System of 4 queues

An overview of the system of 4 queues can be found in Figure 12. There are two types of queues: three with relatively high holding cost parameters on the corner points and one with lower holding costs in the center. To move to the center queue costs 220, and moving away from it is free.

Idle 1 2 3 4 K = 400 k = 200 k = 220 or k = 0, depending on direction h = 0.05, λ = 0.15 h = 0.025, λ = 0.15

Figure 12: System of 4 queues. There are 2 types of queues, one queue with lower holding cost in the center and 3 queues with higher holding cost at the corner points. Moving to the center costs k = 220, while moving away from the center is free. Operating cost o = 5 per time unit (as for all systems).

Note that it costs a bit more to move along the center, but with the additional benefit of being able to clear any jobs that are sitting in central queue.

(29)

3). Although the cost rates are close to each other, the differences in policies are pronounced.

In the policy found by deep learning, the server often moves along the center to clear about half of the queue there and then moves to a corner point. At the corner point, the queue is almost always cleared completely before moving on. From time to time, there are too few jobs in the system and the agent will let the server idle for some time. From an idling position, the server can move to any of the 4 queues at equal cost. A strange phenomenon is that the server will often move to the central queue (with lowest holding cost) from its idling position. Furthermore, the threshold for the server to become active again is a quite a bit lower than with the heuristic policy.

In the heuristic solution, the server usually moves from idle position to a corner point and it often idles for longer time periods. Furthermore, the heuristic will always clear an entire queue, as by design.

The cost rate difference between deep learning and the heuristic are very minor for this setup and we suspect that setting the discount rate higher will result in better policies by the deep learning algorithm, possibly outperforming the heuristic.

4.3.2 Clusters of queues

Besides the system of 4 queues, we also tested deep learning on a system of 3 clusters, see Figure 13 for an overview. In this setup, there are three clusters containing 2 queues each (this setup is referred to as the ‘3× 2 cluster’). It costs just 200 to move between queues in the same cluster, and 400 to move between queues in different clusters. In this setup, higher-level decision making is necessary: not only should a good policy take into account the length of one single queue, but the sum of the queue lengths of a cluster. Cluster of 2 queues Idle K = 600 k = 400 k = 200 h = 0.05, λ = 0.1

Figure 13: Cluster system. Moving between queues that are in the same cluster costs just k = 200, whereas moving between queues that are in different clusters costs k = 400. All queues have identical holding costs and arrival rates. Again, o = 5 per time unit.

(30)

However, we can see from the estimate of the Q-values that the neural network does indeed take into account high-level information as the sum of the queue lengths of a cluster. It will give a lower estimate of the costs of moving to a queue if the other queue in the cluster has a long queue as well. The heuristic is not programmed to take into account such high-level information.

This indicates that without basic ‘mistakes’ such as abandoning a queue or not moving to the next queue without clear benefit, the neural network has a possibility of outperforming the heuristic. We suspect that if the network is trained for a longer period or with more stable convergence (lower value of τ in Algorithm 4), we could further reduce the cost rate.

Besides the 3_{×2 cluster, we also trained a network on a 3×3 cluster with similar} parameters. However, the network did not converge properly. For unknown reason, it will rarely visit one of the queues, even if the queue has very high length. Perhaps the network needs more nodes in the hidden layers to properly learn about a system of this size.

We conclude that both cluster systems converged somewhat poorly, with very basic mistakes still present in the policies. As we will see in the next section, the cyclic systems did converge properly. A major difference between the cyclic systems and cluster systems is that there is a much higher number of connections between the queues of the cluster systems. This results in many more ‘good’ actions from any given location. Since much more actions are adequate, it perhaps takes longer to investigate which of these many actions is optimal.

4.3.3 Cyclic system of queues

Lastly, we investigated two cyclic systems of 6 queues. See Figure 14 for an overview. In this setup, 2 of the 6 queues have significantly higher arrival rate (_{×4) and holding} cost (_{×3) and are situated at opposing corners of a hexagon. Moving in clockwise} direction over this hexagon costs 50 per move. It is also possible to move in counter-clockwise direction, but at a higher cost of 100. The cost-efficient route along the cycle is thus in clockwise direction.

Furthermore, there are two shortcuts in this system. Moving along a shortcut costs next-to-nothing at a cost of just 1, but notice that only one direction is allowed. Any connection that is not pictured has an associated switching cost of k = 500 > K, so it is never beneficial to choose such a switch.

This configuration is referred to as the ‘original’ cyclic system. We have also investigated a simple variant of this system: the ‘reversed’ cyclic system. In the reversed system, it costs k = 50 to move in counter-clockwise direction and k = 100 to move in clockwise direction. This means that the cost-efficient movement along the cycle is reversed. The shortcuts remain in their respective places with their original direction.

Master’s Thesis Reinforcement Learning versus Heuristics in Queue Optimization: A Comparison

Master’s Thesis

Reinforcement Learning versus Heuristics

in Queue Optimization: A Comparison

Author

Guus Smit

Supervisor & first assessor

dr. Nicky van Foreest

Second assessor

dr. Stuart Zhu

August 26, 2018

Contents

1

Introduction

2

Literature

3

Methods

3.1

Description of the polling system

3.2

Economic Production Quantity model

..

3.3

Heuristic for N queues

3.4

Reinforcement learning

Agent

Environment

4

Experiments and results

4.1

Single queue system

4.2

Dual queue system

4.3

Multi-queue systems