Reinforcement learning for routing in communication networks

(1)

for Routing in

Communication Networks

W a l t e r H . A n d r a g T h e s i s p r e s e n t e d i n p a r t i a l f u l f i l m e n t o f t h e r e q u i r e m e n t s f o r t h e d e g r e e o f M a s t e r o f S c i e n c e a t t h e U n i v e r s i t y o f S t e l l e n b o s c h S u p e r v i s o r : P r o f C h r i s t i a n W . O m l i n A p r i l 2 0 0 3

(2)

I,

the undersigned,

hereby declare that the work contained

in this thesis is my own

original work and has not previously

in its entirety

or in part been submitted

at any

university

for a degree.

(3)

R o u t i n g p o l i c i e s f o r p a c k e t - s w i t c h e d c o m m u n i c a t i o n n e t w o r k s m u s t b e a b l e t o a d a p t t o c h a n g i n g t r a f f i c p a t t e r n s a n d t o p o l o g i e s . W e s t u d y t h e f e a s i b i l i t y o f i m p l e m e n t i n g a n a d a p t i v e r o u t i n g p o l i c y u s i n g t h e Q - L e a r n i n g a l g o r i t h m w h i c h l e a r n s s e q u e n c e s o f a c t i o n s f r o m d e l a y e d r e w a r d s . T h e Q - R o u t i n g a l g o r i t h m a d a p t s a n e t w o r k ' s r o u t i n g p o l i c y b a s e d o n l o c a l i n f o r m a t i o n a l o n e a n d c o n v e r g e s t o w a r d a n o p t i m a l s o l u t i o n . W e d e m o n s t r a t e t h a t Q - R o u t i n g i s a v i a b l e a l t e r n a t i v e t o o t h e r a d a p t i v e r o u t i n g m e t h o d s s u c h a s B e l l m a n - F o r d . W e a l s o s t u d y v a r i a t i o n s o f Q - R o u t i n g d e s i g n e d t o b e t t e r e x p l o r e p o s s i b l e r o u t e s a n d t o t a k e i n t o c o n s i d e r a t i o n l i m i t e d b u f f e r s i z e a n d o p t i m i z e m u l t i p l e o b j e c t i v e s . 1 1

(4)

D i e r o e t e r i n g i n k o m m u n i k a s i e n e t w e r k e m o e t k a n a a n p a s b y v e r a n d e r i n g s i n n e t w e r k -t o p o l o g i e e n v e r k e e r s v e r s p r e i d i n g s . O n s b e s t u d e e r d i e b r u i k b a a r h e i d v a n 'n a a n p a s b a r e r o e t e r i n g s a l g o r i t m e g e b a s e e r o p d i e " Q - L e a r n i n g " - a l g o r i t m e w a t d i t m o o n t l i k m a a k o m 'n r e e k s b e s l u i t e t e k a n n e e m g e b a s e e r o p v e r t r a a g d e v e r g o e d i n g s . D i e r o e t e r i n g s a l g o -r i t m e g e b r u i k s l e g s n a b y g e l e e i n l i g t i n g o m r o e t e r i n g s b e s l u i t e t e m a a k e n k o n v e r g e e r n a 'n o p t i m a l e o p l o s s i n g . O n s d e m o n s t r e e r d a t d i e r o e t e r i n g s a l g o r i t m e 'n g o e i e a l t e r n a t i e f v i r a a n p a s b a r e r o e t e r i n g i s , a a n g e s i e n d i t i n b a i e o p s i g t e b e t e r v a a r a s d i e B e l l m a n - F o r d a l g o r i t m e . O n s b e s t u d e e r o o k v a r i a s i e s v a n d i e r o e t e r i n g s a l g o r i t m e w a t b e t e r p a a i e k a n o n t d e k , m i n d e r g e h e u e g e b r u i k b y n e t w e r k e l e m e n t e , e n w a t m e e r a s e e n d o e l f u n k s i e k a n o p t i m e e r . 1 1 1

(5)

I would like to sincerely thank my supervisor,

Prof. C. W. Omlin, for all the inspiration,

assistance

and funding

he provided.

This work was also made possible by funding from the South African National

Research

Foundation,

Telkom-Siemens

Centre of Excellence for ATM and Broadband

Networks

and their Applications

and the Harry Crossley Scholarship

Fund.

(6)

1 Introduction

1

1.1 Motivation.

. . . .

1

1.2 Problem

Statement

1

1.3 Premises

. . . .

2

1.4 Hypotheses

2

1.5 Technical

Objectives

3

1.6 Methodology

3

1.7 Achievements

4

1.8 Thesis Organization.

5

2 Routing

in Com m unication

Networks

6

2.1 The Routing

Problem.

. . . .

6

2.1.1 Performance

Criterion

8

2.1.2 Decision Time .

8

2.1.3 Decision Place .

9

2.1.4 Network

Information

Source

9

2.1.5 Routing

Information

Update

Timing

9

2.2 Conventional

Routing

Strategies

. . . .

10

(7)

2.2.3

_{F ix e d} _{R o u tin g} _..

11

2.2.4

_{A d a p tiv e} _{R o u tin g .}

11

2.2.5

_{L in k - S ta te} _{R o u tin g}

12

2.2.6

_{D is ta n c e - V e c to r} _{R o u tin g}

12

2.3

_{M o b ile} _{A g e n ts} _...

13

2.3.1

_{A c tiv e} _{N e tw o r k s}

13

2.3.2

_{S o c ia l I n s e c t} _{M e ta p h o r s}

14

2.4

_{S u m m a r y}

_...

14

3

Reinforcement

Learning

16

3.1

V a lu e F u n c tio n s . . .

17

3.2

_{T e m p o r a l- D if f e r e n c e} _{L e a r n in g}

19

3.3

_{Q - L e a r n in g}

...

20

3.4

_{T D ( > ') L e a r n in g} _.

22

3.5

_{Q ( > ') L e a r n in g} _..

24

3.6

_{C o n v e r g e n c e} _{P r o p e r tie s} _{o f Q - L e a r n in g}

25

3.7

_{E x p lo r a tio n} _{v s E x p lo ita tio n}

27

3.8

_{S u m m a r y}

_...

29

4

Q-Learning for Traffic Routing

30

4.1

_{O p tim iz a tio n} _{o f P a c k e t} _{D e liv e r y} _{T im e}

30

4.1.1

_{Q - R o u tin g} _..

30

4.1.2

_{D R Q - R o u tin g}

35

(8)

4.2

4.3

4.4

4.1.5 Probabilistic

CDRQ-Routing

.

Finite

Buffer Size

.

Optimization

of M ultiple

Objectives.

Summary

44

48

50

59

5 Conclusion

5.1 Conclusion.

5.2 Future

W ork.

5.2.1 Realistic

Simulations

5.2.2 Improved

Routing

..

V ll

61

62

63

(9)

2.1 Design elements of a routing strategy

..

4.1 The parameters

used in the simulations

.

Vlll

8

(10)

3 .1 T h e a g e n t-e n v iro n m e n t in te ra c tio n .

3 .2 E s tim a tin g

V

1r w ith T D (O ). ..

Q*

w ith Q -L e a rn in g .

V1r

w ith T D (A ).

3 .5 W a tk in s 's Q (A ) a lg o rith m .

16

20

22

23

2 4 4 .1 T h e B ritis h S y n c h ro n o u s D ig ita l H ie ra rc h y (S D H ) n e tw o rk to p o lo g y ... 3 2 4 .2 A v e ra g e p a c k e t d e liv e ry tim e s fo r n e tw o rk lo a d 1 .2 fo r th e S D H n e tw o rk to p o lo g y . . . .. 3 4 4 .3 A v e ra g e p a c k e t d e liv e ry tim e s fo r n e tw o rk lo a d 2 .2 fo r th e S D H n e tw o rk to p o lo g y . . . .. 3 4 4 .4 A v e ra g e p a c k e t d e liv e ry tim e s fo r n e tw o rk lo a d 3 .2 fo r th e S D H n e tw o rk to p o lo g y . . . .. 3 5 4 .5 A v e ra g e p a c k e t d e liv e ry tim e s o f B e llm a n -F o rd fo r h ig h n e tw o rk lo a d fo r th e S D H n e tw o rk to p o lo g y . E rro r b a rs s h o w s ta n d a rd d e v ia tio n s . ... 3 6

4 .6 A v e ra g e p a c k e t d e liv e ry tim e s o f Q -R o u tin g fo r h ig h n e tw o rk lo a d fo r

th e S D H n e tw o rk to p o lo g y . E rro r b a rs s h o w s ta n d a rd d e v ia tio n s . ... 3 6

4 .7 C o m p a rin g th e a v e ra g e p a c k e t d e liv e ry tim e s o f Q -R o u tin g a n d D R Q

-R o u tin g fo r n e tw o rk lo a d 2 .0 fo r th e S D H n e tw o rk to p o lo g y . . . .. 3 7

4 .8 C o m p a rin g th e a v e ra g e p a c k e t d e liv e ry tim e s o f Q -R o u tin g a n d D R Q

(11)

4 .1 0 C o m p a rin g th e a v e ra g e p a c k e t d e liv e ry tim e s o f Q -R o u tin g a n d C Q

4 .1 3 C o m p a rin g th e a v e ra g e p a c k e t d e liv e ry tim e s o f Q -R o u tin g , C Q -R o u tin g ,

D R Q -R o u tin g a n d C D R Q -R o u tin g fo r n e tw o rk lo a d 2 .0 fo r th e S D H

n e tw o rk to p o lo g y . . . .. 4 3

4 .1 6 T h e v a ria n c e fu n c tio n o f E q u a tio n 3 6 fo r

f3

o f 0 .2 , 0 .4 , 0 .6 a n d 0 .8 . . .. 4 5

4 .1 7 T h e A v e ra g e P a c k e t D e liv e ry T im e fo r th e S D H n e tw o rk fo r n e tw o rk lo a d 1 .5 ;

f3

o f 0 .2 , 0 .4 , 0 .6 a n d 0 .8 . . . .. 4 6 4 .1 8 T h e A v e ra g e P a c k e t D e liv e ry T im e fo r th e S D H n e tw o rk fo r n e tw o rk lo a d 3 .0 ;

f3

o f 0 .2 , 0 .4 , 0 .6 a n d 0 .8 . . . 4 7 4 .1 9 T h e A v e ra g e P a c k e t D e liv e ry T im e fo r th e S D H n e tw o rk fo r n e tw o rk lo a d 4 .5 ;

f3

o f 0 .2 , 0 .4 , 0 .6 a n d 0 .8 . . . 4 7 4 .2 0 T h e C o n g e s tio n R is k o f E q u a tio n 3 8 fo r

e

o f 3 , 6 a n d 1 5 . 4 9

4 .2 1 T h e 1 3 n o d e n e tw o rk to p o lo g y u s e d fo r th e fin ite b u ffe r s im u la tio n . 5 0

4 .2 2 A v e ra g e p a c k e t d e liv e ry tim e fo r lo w lo a d . 5 1

4 .2 3 N u m b e r o f p a c k e ts d ro p p e d fo r lo w lo a d .. 5 1

(12)

4.26 Average packet delivery time for high load.

. . . ..

53 4.27 Number

of packets dropped

for high load.

53 4.28 The network

topology

for the 36 node grid. . . ..

54 4.29 The

average

packet

delivery

time

for single versus

multiple

objective

optimization

for the 36 node grid for differing a.

55 4.30 Details of the steady state behaviour

of Figure 4.29.

56 4.31 The average cost for single versus multiple objective

optimization

for the

36 node grid for differing

a. ... . . ..

₅₆

4.32 The

average

packet

delivery

time

for single versus

multiple

objective

optimization

for the BT SDH network for differing

a.

₅₇

4.33 Details

of the steady

state behaviour

of Figure 4.32.

57 4.34 The average cost for single versus multiple objective

optimization

for the

BT SDH network for differing a.

58 4.35 The average saving of multiple objective optimization

of cost and delivery

time for the BT SDH network versus

a.

₅₈

(13)

Introduction

1.1 Motivation

M o d ern co m m u n icatio n n etw o rk s m u st co p e w ith ev er in creasin g d em an d s o n n etw o rk

reso u rces. T h e ran g e o f serv ices o ffered lead s to b o th reg u lar an d less p red ictab le

traffic p attern s. A d ap tiv e ro u tin g is ab le to resp o n d to ch an g in g traffic p attern s an d

to p o lo g y , th u s p ro v id in g efficien t u se o f n etw o rk reso u rces. In n etw o rk s ch aracterized

b y a co n stan tly ch an g in g to p o lo g y , ad ap tiv e ro u tin g is essen tial. A d ap tatio n m ay b e

n ecessary in trad itio n al n etw o rk s d u e to failu res o f lin k s o r n o d es; in m o b ile ad -h o c

n etw o rk s, m o b ile ro u ters are ab le to m o v e ran d o m ly , th u s co n stan tly an d u n p red ictab ly

ch an g in g th e n etw o rk to p o lo g y .

In o rd er to ad ap t ro u tin g to ch an g in g n etw o rk co n d itio n s, a cen tralized ro u tin g strateg y

n eed s in fo rm atio n ab o u t th e statu s o f all n o d es an d lin k s in th e n etw o rk . H o w ev er,

th is in fo rm atio n tran sm issio n o v erh ead co n su m es v alu ab le n etw o rk reso u rces. T h is

h ig h lig h ts th e n eed to m ak e d istrib u ted ro u tin g d ecisio n s b ased o n lo cally av ailab le

in fo rm atio n o n ly .

1.2 Problem Statement

A p ack et-sw itch ed co m m u n icatio n n etw o rk can b e m o d eled as a set o f n o d es an d in

ter-co n n ectin g lin k s. D ata is ex ch an g ed o v er th ese co m m u n icatio n lin k s as a seq u en ce o f

p ack ets. In g en eral, n o d es are n o t fu lly co n n ected ; th u s, th e p ack ets m u st p ass th ro u g h

(14)

interm ediate

nodes.

T he

route

_{is the sequence of nodes along w hich a packet travels}

_to

its final destination.

In m ost netw orks,

there m ay be m ore than one route betw een

pairs

of nodes.

T he routing

problem

consists of finding the

optimal

route betw een

source and

destination

nodes, w here the optim al

route is the one that

delivers packets to their final

destination

in the shortest

tim e possible.

1.3 Premises

T he

prem ises

of the

packet

routing

dom ain

w hich w e believe

m ake

adaptive

routing

indispensable

are as follow s:

1. A netw ork

is a highly

dynam ic

environm ent

in w hich

traffic

patterns

m ay

be

unpredictable

and links or nodes m ay fail.

2. A central

routing

m echanism

w hich has global inform ation

about

the state

of the

netw ork

is generally

not feasible because

of the overhead

involved.

3. T hus,

w e need a good routing

policy w hich

(a) uses only local inform ation

and

(b) m inim izes

average packet delivery tim e.

1.4 Hypotheses

M achine learning

covers a broad field of m ethods

concerned

w ith the ability of program s

to learn

from

experience,

thereby

im proving

their

perform ance.

W e w ish to test

the

follow ing hypotheses

in this thesis:

1. M achine

learning

is a viable alternative

to static

routing

because

(a) it can

adapt

to changing

environm ents,

i.e.

changes

in traffic

patterns

or

netw ork

topology;

(15)

2 . R e in fo rc e m e n t le a rn in g is a fie ld in m a c h in e le a rn in g c o n c e rn e d w ith p ro g ra m s

ta k in g o p tim a l a c tio n se q u e n c e s so a s to a c h ie v e a g o a l. R e in fo rc e m e n t le a rn in g

is w e ll-su ite d fo r a d a p tiv e ro u tin g b e c a u se

(a ) it is g o a l-o rie n te d , i.e . a c tio n s a re to b e le a rn e d w ith a d e sire d o u tc o m e . T h e

g o a l o f ro u tin g is to d e liv e r p a c k e ts w ith m in im u m d e la y .

(b ) R e in fo rc e m e n t le a rn in g a llo w s to a c q u ire a p o lic y , i.e . a se q u e n c e o f a c tio n s

th a t le a d to a d e sire d o u tc o m e . A c tio n s in ro u tin g a re p ro p a g a tio n o f p a c k e ts

to a n e ig h b o u rin g n o d e , a n d th e d e sire d o u tc o m e is th e d e liv e ry o f p a c k e ts

to th e ir in te n d e d fin a l d e stin a tio n .

(c ) R e in fo rc e m e n t le a rn in g a lg o rith m s a re a b le to le a rn p o lic ie s fro m d e la y e d

re w a rd s. W e o n ly k n o w w h e th e r a g o o d ro u te w a s c h o se n fo r a p a c k e t o n c e

it h a s re a c h e d its d e stin a tio n .

(d ) R e in fo rc e m e n t le a rn in g a lg o rith m s c a n a d a p t to c h a n g e s in th e e n v iro n m e n t.

A s tra ffic p a tte rn s c h a n g e , o r n o d e s o r lin k s fa il, a ro u tin g p o lic y w ill h a v e

to a d a p t.

1.5 Technical Objectives

T h e o b je c tiv e s w e se t o u t to a c h ie v e in o u r in v e stig a tio n a re a s fo llo w s:

1 . Im p le m e n t a d istrib u te d a d a p tiv e p a c k e t ro u tin g a lg o rith m w h ic h m in im iz e s th e

a v e ra g e p a c k e t d e liv e ry tim e , w h ile u sin g o n ly lo c a lly a v a ila b le in fo rm a tio n .

2 . C o m p a re th e p e rfo rm a n c e o f th e a d a p tiv e ro u tin g p o lic y to sta n d a rd ro u tin g a

l-g o rith m s.

3 . Im p ro v e th e a d a p tiv e ro u tin g m e c h a n ism to d isc o v e r n e w ro u te s.

4 . E x te n d a n d te st th e a lg o rith m s u n d e r m o re re a listic sc e n a rio s in c lu d in g n o d e s

w ith fin ite b u ffe r siz e a n d o p tim iz a tio n o f m u ltip le o b je c tiv e s.

1.6 Methodology

(16)

1 . Q - L e a r n i n g i s a r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m t h a t i s a b l e t o l e a r n a n o p t i m a l s e q u e n c e o f a c t i o n s i n a n e n v i r o n m e n t w h i c h m a x i m i z e s r e w a r d s r e c e i v e d f r o m t h e e n v i r o n m e n t . Q - R o u t i n g i s a n i m p l e m e n t a t i o n o f Q - L e a r n i n g , w h i c h i s a b l e t o d i s t r i b u t i v e l y r o u t e p a c k e t s i n a n e t w o r k . E a c h n o d e i s a b l e t o m a k e a r o u t i n g d e c i s i o n u s i n g o n l y l o c a l l y a v a i l a b l e i n f o r m a t i o n . T h e r e w a r d r e c e i v e d i s t h e p a c k e t d e l i v e r y t i m e ; t h u s , t h e g o a l i s t o m i n i m i z e t h e a v e r a g e d e l i v e r y t i m e o f a l l p a c k e t s . 2 . W e w i l l e m p i r i c a l l y e v a l u a t e t h e r o u t i n g a l g o r i t h m s b y s i m u l a t i o n a n d c o m p a r e t h e i r p e r f o r m a n c e u n d e r d i f f e r e n t t r a f f i c l o a d s a n d n e t w o r k t o p o l o g i e s . W e w i l l u s e t h e a v e r a g e p a c k e t d e l i v e r y t i m e a n d t h e a v e r a g e n u m b e r o f d r o p p e d p a c k e t s a s t h e p e r f o r m a n c e m e a s u r e . 3 . W e w i l l i n v e s t i g a t e p o s s i b l e p e r f o r m a n c e i m p r o v e m e n t s t o Q - R o u t i n g d e s i g n e d t o i n c r e a s e t h e e x p l o r a t i o n a b i l i t y o f t h e a l g o r i t h m , w h i c h e n a b l e s t h e d i s c o v e r y o f n e w r o u t e s i n t h e n e t w o r k . T h i s i m p r o v e m e n t i s m a d e p o s s i b l e b y a d d i n g a p r o b a b i l i s t i c c o m p o n e n t t O J o u t i n g d e c i s i o n s . 4 . W e w i l l c o n s i d e r m o r e r e a l i s t i c n e t w o r k s c e n a r i o s : n e t w o r k s w i t h f i n i t e b u f f e r s a n d n e t w o r k s w h e r e t h e r e a r e m u l t i p l e o b j e c t i v e s t o b e o p t i m i z e d : ( a ) P r e v i o u s w o r k e x p l o r e d t h e p e r f o r m a n c e o f Q - R o u t i n g i n n e t w o r k s w i t h i n f i -n i t e p a c k e t b u f f e r s . W e w i l l e x a m i n e t h e m o r e r e a l i s t i c c a s e o f f i n i t e b u f f e r s . C o n g e s t i o n c o n t r o l i s a c h i e v e d b y a v o i d i n g n o d e s w i t h a h i g h l e v e l o f c o n g e s -t i o n . ( b ) W e w i l l a l s o e x a m i n e t h e p e r f o r m a n c e o f a n e w r o u t i n g a l g o r i t h m , a b l e t o o p t i m i z e m u l t i p l e p o s s i b l y c o n f l i c t i n g o b j e c t i v e s , e . g . p a c k e t d e l i v e r y t i m e v e r s u s c o s t . W e e x a m i n e t h e i m p l i c a t i o n s o f t h i s t r a d e - o f f .

1.7 Achievements

W e h a v e l e a r n e d t h e f o l l o w i n g f r o m o u r i n v e s t i g a t i o n : 1 . Q - R o u t i n g i s a b l e t o r o u t e p a c k e t s m i n i m i z i n g t h e a v e r a g e d e l a y w h i l e o n l y u s i n g l o c a l i n f o r m a t i o n . 2 . Q - R o u t i n g c o m p a r e s w e l l w i t h s t a n d a r d r o u t i n g a l g o r i t h m s . I n p a r t i c u l a r , i t c o n v e r g e s t o a m o r e s t a b l e r o u t i n g p o l i c y t h a n o u r v e r s i o n o f t h e d i s t r i b u t e d

(17)

B e l l m a n - F o r d a l g o r i t h m . 3 . W e d e m o n s t r a t e d t h e e f f i c i e n c y o f n e w p a t h d i s c o v e r y o f a p r o p o s e d a l g o r i t h m b a s e d o n p r o b a b i l i s t i c e x p l o r a t i o n . 4 . T w o e x t e n d e d a l g o r i t h m s w e r e a l s o s h o w n t o p e r f o r m w e l l i n m o r e r e a l i s t i c n e t -w o r k s c e n a r i o s : ( a ) I n n e t w o r k s w i t h l i m i t e d b u f f e r s , t h e a l g o r i t h m i s a b l e t o p e r f o r m c o n g e s t i o n c o n t r o l , d r o p p i n g f e w e r p a c k e t s a t o v e r l o a d e d n o d e s . ( b ) A n i m p r o v e d r o u t i n g a l g o r i t h m i s a l s o a b l e t o o p t i m i z e m u l t i p l e o b j e c t i v e s . H o w e v e r , c o m p e t i n g o b j e c t i v e s m a y e s t a b l i s h a t r a d e - o f f .

1.8 Thesis Organization

I n C h a p t e r 2 , w e d i s c u s s t h e r o u t i n g p r o b l e m i n m o r e d e t a i l a n d l o o k a t s o m e o f t h e a p p r o a c h e s u s e d t o s o l v e i t . C h a p t e r 3 d i s c u s s e s t h e f i e l d o f r e i n f o r c e m e n t l e a r n i n g , p r e s e n t i n g t e c h n i q u e s o f s o l v i n g r e i n f o r c e m e n t l e a r n i n g p r o b l e m s . I n C h a p t e r 4 , w e p r e s e n t t h e s i m u l a t i o n r e s u l t s o f t h e c o m p a r i s o n b e t w e e n d i f f e r e n t r o u t i n g a l g o r i t h m s b y e v a l u a t i n g p e r f o r m a n c e u n d e r v a r i o u s s c e n a r i o s . C o n c l u s i o n s a n d d i r e c t i o n s o f f u t u r e r e s e a r c h a r e p r e s e n t e d i n C h a p t e r 5 .

(18)

Routing in Communication

Networks

In this chapter, w e exam ine the routing problem and investigate different approaches

that have been proposed for solving it. W e define the routing problem , and discuss

the general requirem ents of routing algorithm s. N etw ork routing is very com plex; thus,

w e discuss som e of the characteristics that differentiate betw een different routing

algo-rithm s.

2.1 The Routing Problem

W e consider a com m unication netw ork [27; 15] as a undirected w eighted graph G

=

(N, L)

w ith a set of nodes

N,

and a set of bidirectional links

L,

connecting the nodes.

E ach link has a capacity and a user-defined associated cost. W e define a path as a

sequence of nodes connecting a source to a destination node. T here m ay be m ultiple

paths betw een sources and destinations. T he general routing problem consists of finding

the optim al path betw een source and destination nodes satisfying som e perform ance

criterion.

W e w ill discuss the routing problem in the context of packet sw itching. In a

packet-sw itched netw ork, data is broken up into a sequence of packets w hich are sent from node

to node until the destination is reached. T he routing decision at each node consists of

deciding to w hich neighbouring node to send a packet.

(19)

A r o u t i n g a l g o r i t h m h a s t h e f o l l o w i n g r e q u i r e m e n t s [ 2 7 ] : • C o r r e c t n e s s • S i m p l i c i t y • E f f i c i e n c y • R o b u s t n e s s • S t a b i l i t y • F a i r n e s s • O p t i m a l i t y T h e

correctness

o f a r o u t i n g a l g o r i t h m r e f e r s t o t h e f a c t t h a t i t m u s t r o u t e a l l p a c k e t s t o t h e c o r r e c t d e s t i n a t i o n s .

Simple

r o u t i n g a l g o r i t h m s a r e a l s o p r e f e r r e d , a s t h e y h a v e l e s s r o u t i n g o v e r h e a d , w h i c h i n t u r n i n c r e a s e t h e

efficiency

o f t h e n e t w o r k . A l l p a c k e t r o u t i n g s c h e m e s h a v e a c e r t a i n a m o u n t o f p r o c e s s i n g a n d t r a n s m i s s i o n o v e r h e a d , w h i c h m a y n e g a t i v e l y i m p a c t t h e e f f i c i e n c y o f t h e n e t w o r k . T h e b e n e f i t s o f o v e r h e a d s m u s t b e b a l a n c e d w i t h t h e d e c r e a s e i n e f f i c i e n c y c a u s e d . S o m e o f t h e s e r e q u i r e m e n t s a r e i n c o m p e t i t i o n w i t h e a c h o t h e r , e .g . r o b u s t n e s s a n d s t a b i l i t y . A r o u t i n g a l g o r i t h m i s s a i d t o b e

robust

w h e n i t i s a b l e t o a d a p t t o n o d e o r l i n k f a i l u r e s a n d c h a n g e s i n n e t w o r k l o a d c o n d i t i o n s . W h e n a n o v e r l o a d i s d e t e c t e d i n a s e c t i o n o f t h e n e t w o r k , t r a f f i c i s r e r o u t e d t o l e s s c o n g e s t e d r e g i o n s . I f t h e r o u t i n g a l g o r i t h m r e s p o n d s t o o q u i c k l y , t h e s e l e s s c o n g e s t e d r e g i o n s w i l l i n t u r n b e c o m e c o n -g e s t e d . T h e r o u t i n g a l g o r i t h m i s c a l l e d

unstable

i f i t c o n t i n u a l l y s h i f t s t h e l o a d b e t w e e n d i f f e r e n t s e c t i o n s o f t h e n e t w o r k . O n t h e o t h e r h a n d , i f t h e n e t w o r k a d a p t s t o o s l o w l y , p a c k e t s m a y b e d r o p p e d a t c o n g e s t e d n o d e s . T h e r e a l s o e x i s t s a t r a d e - o f f b e t w e e n

optimality

a n d

fairness:

i f a c e r t a i n p e r f o r m a n c e c r i t e r i o n f a v o u r s t h e e x c h a n g e o f p a c k e t s b e t w e e n n e a r b y n o d e s , t h e t h r o u g h p u t m a y b e i n c r e a s e d . T h i s m a y a p p e a r u n f a i r t o n o d e s w i t h a h i g h p r o p o r t i o n o f l o n g - d i s t a n c e t r a f f i c . W e b r i e f l y d i s c u s s t h e v a r i o u s d e s i g n e l e m e n t s t h a t c o n t r i b u t e t o a r o u t i n g s t r a t e g y a s p r e s e n t e d i n [ 2 7 ] ( s e e T a b l e 2 .1 ) .

(20)

P e r f o r m a n c e c r i t e r i o n N u m b e r o f h o p s C o s t D e l a y T h r o u g h p u t D e c i s i o n t i m e P a c k e t S e s s i o n D e c i s i o n p l a c e E a c h n o d e ( d i s t r i b u t e d ) C e n t r a l n o d e ( c e n t r a l i z e d ) O r i g i n a t i n g n o d e ( s o u r c e ) N e t w o r k i n f o r m a t i o n s o u r c e N o n e L o c a l A d j a c e n t n o d e s N o d e s a l o n g r o u t e A l l N o d e s N e t w o r k i n f o r m a t i o n u p d a t e t i m i n g C o n t i n u o u s P e r i o d i c M a j o r l o a d c h a n g e T o p o l o g y c h a n g e T a b l e 2 . 1 : D e s i g n e l e m e n t s o f a r o u t i n g s t r a t e g y 2 . 1 . 1 P e r f o r m a n c e C r i t e r i o n A r o u t i n g p o l i c y h a s t o d e c i d e t o w h i c h n e i g h b o u r i n g n o d e t o f o r w a r d a p a c k e t t o b a s e d o n s o m e p e r f o r m a n c e c r i t e r i o n . T h e s i m p l e s t c h o i c e i s t o s e l e c t t h e n e i g h b o u r w h i c h i s o n t h e m i n i m u m h o p p a t h t o t h e p a c k e t 's d e s t i n a t i o n . A m o r e g e n e r a l a p p r o a c h i s t o a s s i g n a l i n k c o s t t o e a c h l i n k a n d t o s e l e c t t h e m i n i m u m c o s t p a t h . T h e s p e c i f i c c o s t m e t r i c u s e d d e t e r m i n e s t h e o p t i m a l p a t h . I f t h e l i n k c o s t i s i n v e r s e l y p r o p o r t i o n a l t o t h e l i n k c a p a c i t y , t h e l e a s t - c o s t p a t h m a x i m i z e s t h e t h r o u g h p u t w h e r e a s i t m i n i m i z e s t h e a v e r a g e p a c k e t d e l a y w h e n t h e l i n k c o s t i s t h e m e a s u r e d l i n k d e l a y . O t h e r p o s s i b l e c o s t m e t r i c s a r e r e l i a b i l i t y , l o a d a n d c o m m u n i c a t i o n s c o s t . T h e m e t r i c c a n a l s o b e a c o m b i n a t i o n o f s e v e r a l p e r f o r m a n c e c r i t e r i a ; i . e . t h e o p t i m a l r o u t e o v e r m u l t i p l e o b j e c t i v e s . 2 . 1 . 2 D e c i s i o n T i m e T h e d e c i s i o n t i m e o f r o u t i n g d e c i s i o n s r e f e r t o t w o t y p e s o f p a c k e t - s w i t c h e d n e t w o r k s . I n a

datagram

p a c k e t s w i t c h i n g n e t w o r k , e a c h n o d e m a k e s a r o u t i n g d e c i s i o n f o r e a c h i n c o m i n g p a c k e t . H o w e v e r , t h e r e i s a n o t h e r a p p r o a c h , c a l l e d

virtual-circuit

p a c k e t s w i t c h i n g , w h e r e t h e r o u t i n g d e c i s i o n i s m a d e o n l y o n c e p e r

session.

I f a s o u r c e n o d e w a n t s t o c o m m u n i c a t e w i t h a d e s t i n a t i o n n o d e , a v i r t u a l - c i r c u i t b e t w e e n s o u r c e a n d

(21)

d e s tin a tio n is e s ta b lis h e d . A fte r th e c o n n e c tio n h a s b e e n s e t u p , e a c h n o d e s e le c ts

th e n e ig h b o u r b a s e d o n th e v irtu a l-c irc u it id e n tifie r. T h u s , a ll s u b s e q u e n t p a c k e ts o f a

s e s s io n w ill fo llo w th e s a m e ro u te th ro u g h th e n e tw o rk .

2 .1 .3 D e c is io n P la c e

T h e d e c is io n p la c e re fe rs to w h e re ro u tin g d e c is io n s a re m a d e . In

centralized

ro u tin g ,

th e re is a c e n tra l c o n tro l n o d e w h ic h c o lle c ts in fo rm a tio n fro m th e n e tw o rk a n d c o m

-p u te s ro u tin g ta b le s w h ic h a re d is trib u te d to a ll n o d e s . T h e p ro b le m w ith th is a p p ro a c h

is th a t th e c o n tro llin g n o d e is a s in g le p o in t o f fa ilu re .

Distributed

ro u tin g a lg o rith m s

m a k e ro u tin g d e c is io n s a t e a c h n o d e ; th u s , th e y a re m o re ro b u s t. In

source

ro u tin g

a lg o rith m s , th e o rig in a tin g n o d e s e le c ts th e ro u te th ro u g h th e n e tw o rk .

2 .1 .4 N e tw o r k I n f o r m a tio n S o u r c e

M o s t ro u tin g a lg o rith m s u tiliz e s o m e in fo rm a tio n a b o u t th e n e tw o rk to p o lo g y , tra ffic

lo a d o r lin k c o s t. D is trib u te d ro u tin g m a y u tiliz e in fo rm a tio n a v a ila b le lo c a lly to th e

n o d e s u c h a s th e c o s t o f e a c h lin k . N o d e s m a y a ls o m a k e ro u tin g d e c is io n s b a s e d

o n in fo rm a tio n fro m n e ig h b o u rin g n o d e s , o r a ll n o d e s o n a p a th . C e n tra liz e d ro u tin g

m a k e s u s e o f in fo rm a tio n fro m a ll n o d e s . S o m e a lg o rith m s d o n o t u s e a n y n e tw o rk s ta te

in fo rm a tio n , e .g . flo o d in g a n d ra n d o m ro u tin g .

2 .1 .5 R o u tin g I n f o r m a tio n U p d a te T im in g

If th e ro u tin g s tra te g y u s e s lo c a lly a v a ila b le in fo rm a tio n , ro u tin g u p d a te s a re c o n tin u

-o u s . F o r a ll o th e r s tra te g ie s th a t m a k e u s e o f n e tw o rk in fo rm a tio n , ro u tin g in fo rm a tio n

u p d a te s a re m a d e p e rio d ic a lly in o rd e r to a d a p t to c h a n g in g n e tw o rk c o n d itio n s . T h e

a c c u ra c y o f in fo rm a tio n d e p e n d s o n h o w fre q u e n tly th e in fo rm a tio n is u p d a te d . T h u s ,

w ith m o re a c c u ra te in fo rm a tio n , b e tte r ro u tin g d e c is io n s a re m a d e . H o w e v e r, in fo rm a

(22)

2.2 Conventional

Routing Strategies

N e t w o r k r o u t i n g i s a v e r y c o m p l e x p r o b l e m a n d m a n y d i f f e r e n t a p p r o a c h e s t o s o l v i n g i t h a v e b e e n p r o p o s e d . W e b r i e f l y d i s c u s s s o m e o f t h e r o u t i n g s t r a t e g i e s u s e d , r a n g i n g f r o m t h e s i m p l e t o t h e m o r e c o m p l e x a d a p t i v e r o u t i n g s t r a t e g i e s .

2.2.1 Flooding

F l o o d i n g [ 2 7 ] i s s i m p l e r o u t i n g s t r a t e g y w h e r e b y e a c h n o d e f o r w a r d s a p a c k e t t o e a c h o f i t s n e i g h b o u r s , e x c e p t t h e n o d e w h e r e t h e p a c k e t c a m e f r o m . N o d e s d o n o t n e e d a n y i n f o r m a t i o n a b o u t t h e n e t w o r k t o p o l o g y b e y o n d t h e i r i m m e d i a t e n e i g h b o u r s . P a c k e t s n e e d a s e q u e n c e n u m b e r a n d t h e d e s t i n a t i o n n o d e e m b e d d e d i n t h e i r h e a d e r s s o t h a t a d e s t i n a t i o n n o d e c a n d i s c a r d d u p l i c a t e p a c k e t s . F o r w a r d e d p a c k e t s w h i c h r e t u r n t o a p r e v i o u s l y v i s i t e d n o d e m u s t a l s o b e d i s c a r d e d ; o t h e r w i s e , t h e n u m b e r o f p a c k e t s i n c i r c u l a t i o n w i l l i n c r e a s e w i t h o u t b o u n d . A n o t h e r w a y t o a c c o m p l i s h t h i s i s f o r e a c h p a c k e t t o h a v e a h o p c o u n t w h i c h i s i n c r e m e n t e d a t e a c h n o d e , a n d d i s c a r d e d w h e n a p r e d e t e r m i n e d l i m i t i s r e a c h e d . S i n c e a l l p o s s i b l e r o u t e s b e t w e e n s o u r c e a n d d e s t i n a t i o n a r e t r i e d , a p a c k e t i s g u a r a n t e e d t o r e a c h t h e d e s t i n a t i o n i f i t i s r e a c h a b l e ; t h u s , f l o o d i n g i s v e r y r o b u s t . I t h a s b e e n u s e d i n m i l i t a r y n e t w o r k s w h e r e l i n k o r n o d e f a i l u r e s m a y f r e q u e n t l y o c c u r [ 1 5 ] . A n o t h e r p r o p e r t y o f f l o o d i n g i s t h a t a t l e a s t o n e p a c k e t w i l l t r a v e l a l o n g t h e s h o r t e s t r o u t e . T h i s m a y b e u s e d i n s o m e n e t w o r k s t o s e t u p v i r t u a l - c i r c u i t s . B e c a u s e a l l n o d e s d i r e c t l y o r i n d i r e c t l y c o n n e c t e d t o t h e s o u r c e n o d e a r e v i s i t e d , f l o o d i n g c a n b e u s e d t o d i s t r i b u t e i m p o r t a n t i n f o r m a t i o n ( e . g . r o u t i n g i n f o r m a t i o n ) t o a l l n o d e s . T h e b i g g e s t d i s a d v a n t a g e o f f l o o d i n g i s o f c o u r s e t h e h i g h l e v e l o f n e t w o r k b a n d w i d t h t h a t i s w a s t e d o n d u p l i c a t e p a c k e t s .

2.2.2 Random Routing

A n o t h e r s i m p l e , r o b u s t r o u t i n g s t r a t e g y i s t h a t o f r a n d o m r o u t i n g [ 2 7 ] ' w h e r e e a c h n o d e r a n d o m l y s e l e c t s t h e n o d e t o f o r w a r d a p a c k e t t o , e x c l u d i n g t h e n o d e w h e r e t h e p a c k e t c a m e f r o m . A l t h o u g h t h i s s t r a t e g y w i l l i n g e n e r a l n o t s e l e c t t h e s h o r t e s t p a t h , i t g e n e r a t e s l e s s t r a f f i c t h a n f l o o d i n g . A r e f i n e m e n t o f t h i s t e c h n i q u e i s t o s e l e c t a n

(23)

o u t g o i n g l i n k w i t h a p r o b a b i l i t y p r o p o r t i o n a l t o t h e d a t a r a t e o f t h e l i n k . T h i s s t r a t e g y a t t e m p t s t o e n s u r e a g o o d t r a f f i c d i s t r i b u t i o n .

2.2.3 Fixed Routing

F i x e d r o u t i n g - a l s o c a l l e d s t a t i c s h o r t e s t p a t h r o u t i n g - c o m p u t e s l e a s t - c o s t p a t h s f o r a l l o r i g i n - d e s t i n a t i o n n o d e s i n t h e n e t w o r k . F r o m t h e s e f i x e d p a t h s , r o u t i n g t a b l e s a r e c o m p u t e d a n d s e n t t o e a c h n o d e . A s t h e l e a s t - c o s t p a t h s a r e c o m p u t e d o n c e , t h e l i n k c o s t s c a n n o t b e b a s e d o n d y n a m i c v a r i a b l e s s u c h a s t r a f f i c . I n s t e a d , t h e n e t w o r k i s d e s i g n e d b a s e d o n a n a n t i c i p a t e d t r a f f i c d i s t r i b u t i o n . F i x e d r o u t i n g i s s i m p l e a n d i t i s v e r y e f f e c t i v e i n r e l i a b l e n e t w o r k s w i t h s t a b l e l o a d . T h e d i s a d v a n t a g e i s t h a t i t d o e s n o t r e a c t t o c o n g e s t i o n o r n o d e f a i l u r e s , o r u n f o r e s e e n t r a f f i c p a t t e r n s .

2.2.4 Adaptive Routing

I n o r d e r t o i n c r e a s e e f f i c i e n c y , a d a p t i v e r o u t i n g m e t h o d s d y n a m i c a l l y a l t e r r o u t e s w h e n n o d e o r l i n k f a i l u r e s a r e d e t e c t e d o r w h e n c o n g e s t i o n d e v e l o p s . F o r a n e t w o r k t o a d a p t t o t h e s e c h a n g e s , i t n e e d s t o c o l l e c t a n d e x c h a n g e n e t w o r k s t a t e i n f o r m a t i o n b e t w e e n n o d e s , s u c h a s d e l a y o r t h r o u g h p u t [ 2 6 ] . T h e o p t i m a l i t y o f t h e n e w r o u t e s d e p e n d s o n t h e q u a l i t y o f t h e n e t w o r k i n f o r m a t i o n , w h i c h n e c e s s i t a t e s a n i n c r e a s e d i n f o r m a t i o n e x c h a n g e . H o w e v e r , t h e r e e x i s t s a t r a d e - o f f b e t w e e n t h e q u a l i t y o f i n f o r m a t i o n a n d t h e o v e r h e a d : o v e r h e a d c o n s u m e s n e t w o r k r e s o u r c e s , w h i c h m a y d e g r a d e t h e o v e r a l l n e t w o r k p e r f o r m a n c e . A s e r i o u s p r o b l e m w i t h a d a p t i v e r o u t i n g i s t h a t i t m a y b e c o m e u n s t a b l e i f a r o u t i n g p o l i c y r e a c t s t o o q u i c k l y t o c o n g e s t i o n [ 1 5 ; 2 7 ; 1 4 ] . I f t h e a d a p t i v e r o u t i n g r e d i r e c t s m o s t t r a f f i c a w a y f r o m t h e c o n g e s t e d p a r t o f t h e n e t w o r k , c o n g e s t i o n m a y d e v e l o p e l s e w h e r e ; t h u s , t r a f f i c w i l l a g a i n s h i f t t o a d i f f e r e n t p a r t o f t h e n e t w o r k . T h i s o s c i l l a t i o n w i l l c o n t i n u e i n d e f i n i t e l y i f n o t p r o p e r l y m a n a g e d b y t h e r o u t i n g a l g o r i t h m . A s i t t a k e s t i m e f o r t h e n e t w o r k i n f o r m a t i o n t o r e a c h r e l e v a n t n o d e s , t h e r e i s n e v e r a t r u e p i c t u r e o f t h e n e t w o r k s t a t e . T e m p o r a r y r o u t i n g l o o p s [ 1 1 ; 7 ] c a n d e v e l o p , w h e r e p a c k e t s c i r c u l a t e t h r o u g h t h e n e t w o r k u n t i l a l l n o d e s h a v e c o n s i s t e n t r o u t i n g t a b l e s . T h i s l o o p i n g w a s t e s b a n d w i d t h a n d i n c r e a s e s d e l a y .

(24)

A lth o u g h a d a p tiv e r o u tin g is c o m p le x , it is w id e ly u s e d a s it im p r o v e s th e n e tw o r k p e r f o r m a n c e , a n d h e lp s in c o n g e s tio n c o n tr o l.

2.2.5 Link-State Routing

L in k - s ta te r o u tin g [ 2 6 ] is a d is tr ib u te d , a d a p tiv e r o u tin g a lg o r ith m w h e r e e a c h n o d e m a in ta in s a v ie w o f th e w h o le n e tw o r k to p o lo g y w ith a c o s t f o r e a c h lin k . T o u p d a te th e ir v ie w o f th e c u r r e n t n e tw o r k s ta te , n o d e s r e g u la r ly b r o a d c a s t th e lin k c o s ts o f o u tg o in g lin k s to a ll o th e r n o d e s u s in g f lo o d in g . E a c h n o d e u s e s its v ie w to c a lc u la te th e s h o r te s t p a th s to a ll d e s tin a tio n s w ith D ijk s tr a 's a lg o r ith m . E a c h n o d e n e e d s s to r a g e s p a c e p r o p o r tio n a l to

O(N

2), w h e r e

N

is th e n u m b e r o f n o d e s in th e n e tw o r k .

O p e n S h o r te s t P a th F ir s t ( O S P F ) is th e lin k - s ta te r o u tin g p r o to c o l u s e d in th e I n te r -n e t [ 1 1 ] . I n s ta b ilitie s a r e a v o id e d b y d is s e m in a tin g th e lin k c o s t in f o r m a tio n q u ic k ly , a n d b y r e p r e s e n tin g th e lin k - c o s ts b y a s lo w ly c h a n g in g m e a s u r e o f a v e r a g e lin k u ti-liz a tio n [ 2 6 ; 2 7 ] . R a p id lin k c o s t d is s e m in a tio n c a n b e a c h ie v e d if r o u tin g p a c k e ts h a v e h ig h e r p r io r ity th a n d a ta p a c k e ts . R o u tin g lo o p s a r e s till p o s s ib le , b u t s in c e th e y d is a p p e a r in tim e p r o p o r tio n a l to th e d ia m e te r

D

o f th e n e tw o r k , th e y a r e s h o r t- liv e d .

2.2.6 Distance-Vector Routing

D is ta n c e - v e c to r r o u tin g is a n o th e r d is tr ib u te d , a d a p tiv e r o u tin g a p p r o a c h b a s e d o n th e B e llm a n - F o r d a lg o r ith m [ 1 0 ; 2 6 ] . E a c h n o d e m a in ta in s a s e t o f d is ta n c e s to a ll d e s tin a tio n s v ia e a c h o f its n e ig h b o u r s . T h u s , th e s to r a g e n e e d e d a t e a c h n o d e is p r o p o r tio n a l to

O(N

x

e),

w h e r e e is th e a v e r a g e n u m b e r o f n e ig h b o u r s o f e a c h n o d e in th e n e tw o r k . E a c h n o d e r o u te s a n in c o m in g p a c k e t to th e n e ig h b o u r w ith th e m in im u m d is ta n c e to th e d e s tin a tio n . N o d e s u p d a te th e ir d is ta n c e ta b le s b y e x c h a n g in g

distance-vectors

w ith th e ir n e ig h -b o u r s . T h e d is ta n c e - v e c to r a n o d e tr a n s m its c o n s is ts o f th e c u r r e n t s h o r te s t d is ta n c e f r o m a n o d e to e a c h d e s tin a tio n . U p o n r e c e iv in g a d is ta n c e - v e c to r , a n o d e c o m p u te s a n e w d is ta n c e ta b le b y s e le c tin g th e m in im u m b e tw e e n th e c u r r e n t a n d r e c e iv e d s h o r t-e s t d is ta n c e s . I f th e d is ta n c e ta b le c h a n g e s , th e n o d e w ill a g a in b r o a d c a s t its n e w ly c o m p u te d d is ta n c e - v e c to r to a ll n e ig h b o u r s . T h is a s y n c h r o n o u s u p d a te m e c h a n is m c o n v e r g e s to th e s h o r te s t d is ta n c e s f o r a ll c o n n e c te d p a ir s o f n o d e s [ 7 ] .

(25)

T h e o r ig in a l A R P A N E T u s e d th e d is tr ib u te d B e llm a n - F o r d a lg o r ith m ; h o w e v e r , it w a s r e p la c e d in 1 9 7 9 b y a b r u te - f o r c e lin k - s ta te a lg o r ith m b e c a u s e o f s e v e r a l d r a w -b a c k s [ 2 7 ; 7 ] . I t w a s f o u n d to r e a c t s lo w ly to f a ilu r e s a n d lin k c o s t c h a n g e s . T h e p r o b le m is th a t th e d is ta n c e s e x c h a n g e d b e tw e e n n o d e s m a y c o n ta in p a th s w ith lo o p s . T h e lo o p in g o f p a c k e ts w a s te s b a n d w id th a n d is c a lle d th e b o u n c i n g e f f e c t . I f th e n e t-w o r k is d is c o n n e c te d , th e a lg o r ith m d o e s n o t e v e n te r m in a te ; th is is a ls o r e f e r r e d to a s th e c o u n t i n g - t o - i n f i n i t y p r o b le m . M e c h a n is m s to o v e r c o m e th e s e p r o b le m s h a v e b e e n p r o p o s e d w h ic h u s e v a r io u s n o d e c o o r d in a tio n te c h n iq u e s , d if f u s in g c o m p u ta tio n s a n d m a in ta in in g o n ly lo o p - f r e e p a th s [ 7 ; 1 1 ; 1 ; 2 6 ] . T h e s e te c h n iq u e s a ll e lim in a te lo n g - liv e d lo o p s , a n d s o m e a ls o e lim in a te s h o r t- liv e d lo o p s . H o w e v e r , th e s e te c h n iq u e s a ll h a v e in c r e a s e d c o m m u n ic a tio n o v e r h e a d to d if f e r in g d e g r e e s .

2.3 Mobile Agents

A s th e n e tw o r k a n d its tr a f f ic a r e a h ig h ly d y n a m ic a l s y s te m , it h a s b e e n a r g u e d th a t m o b ile s o f tw a r e a g e n ts a r e a g o o d a p p r o a c h f o r a d a p tiv e r o u tin g in s u c h a c o m p le x , in h e r e n tly d is tr ib u te d e n v ir o n m e n t [ 1 6 ; 6 ] . T h e u s e o f m u ltip le c o o p e r a tin g a g e n ts m a y f a c ilita te a h ig h le v e l o f a v a ila b ility , a d a p ta b ility a n d f a u lt- to le r a n c e in m o d e r n c o m m u n ic a tio n n e tw o r k s . M o b ile a g e n ts m a y a ls o s e r v e u s e f u l in d e s ig n , a b s tr a c tin g th e in te r a c tio n s b e tw e e n e n titie s in a c o m p le x s y s te m .

2.3.1

A c t i v e N e t w o r k s

T h e n e w a p p r o a c h o f a c t i v e n e t w o r k s e n a b le n o d e s to e x e c u te c u s to m c o d e e m b e d d e d in p a c k e ts . T h is a llo w s p a c k e ts to r o u te th e m s e lv e s a n d p e r f o r m c o m p u ta tio n s a t n e tw o r k n o d e s o n th e r o u te [ 3 1 ; 1 6 ] . I n a d d itio n to r o u tin g , th is a p p r o a c h a ls o a llo w s f le x ib le in c o r p o r a tio n o f n e w s e r v ic e s in to a n e tw o r k w ith o u t th e n e e d to r e d e s ig n th e n e tw o r k in f r a s tr u c tu r e [ 3 1 ] .

T h e c h ie f p r o b le m s f a c in g a c tiv e n e tw o r k s a r e e n s u r in g th e s e c u r i t y a n d s c a l a b i l i t y o f th e n e tw o r k s . B e f o r e e x e c u tin g m o b ile c o d e , th e n o d e m u s t tr u s t th e c o d e . O n e w a y o f d o in g th is is w ith P r o o f - C a r r y in g C o d e ( P C C ) [ 2 2 ] . T h e m o b ile c o d e in c lu d e s a f o r m a l p r o o f o f its p r o p e r tie s , w h ic h th e p r o c e s s in g n o d e c a n v e r if y . T h e q u e s tio n is w h e th e r

(26)

th e in c re a se d fle x ib ility ju stifie s th e e x tra o v e rh e a d o f p e r p a c k e t e x e c u tio n , a n d h o w

w e ll th is p a ra d ig m sc a le s to v e ry la rg e n e tw o rk s.

2.3.2 Social Insect Metaphors

A n t-c o lo n y o p tim iz a tio n is a m e th o d o f so lv in g c o m b in a to ria l o p tim iz a tio n p ro b le m s

in sp ire d fro m th e fo ra g in g b e h a v io u r o f a n ts [6 ]. In n a tu re , a n ts a re a b le to fin d th e

sh o rte st d ista n c e to a fo o d so u rc e b y la y in g tra ils o f p h e ro m o n e s. A la rg e c o lle c tio n o f

a n ts c o o p e ra te o n a ta sk b y th is in d ire c t fo rm o f c o m m u n ic a tio n th ro u g h th e e n v iro n

-m e n t, c a lle d

stigmergy.

A d a p tiv e d istrib u te d ro u te d isc o v e ry is p e rfo rm e d b y a rtific ia l so ftw a re a n ts th a t e x p lo re

th e n e tw o rk [2 5 ; 6 ]. T h ro u g h o u t th e n e tw o rk , a n ts a re la u n c h e d to ra n d o m ly se le c te d

d e stin a tio n n o d e s. T h e se a n ts sh a re th e q u e u e s a t n o d e s w ith d a ta p a c k e ts, a n d re c o rd

th e e x p e rie n c e d d e la y w h ic h is u se d fo r u p d a tin g th e ro u tin g ta b le s. E a c h a n t c a n b e

th o u g h t o f a s p e rfo rm in g a sin g le M o n te C a rlo e x p e rim e n t o n th e a c tu a l n e tw o rk , a n d

th e re su lt is th e e x p e rie n c e d d e la y . T h e sy ste m a s a w h o le p e rfo rm s p a ra lle l M o n te

C a rlo e x p e rim e n ts w ith e x p lo ra tio n b ia se d to w a rd s m o re u se fu l re g io n s o f th e sta te

sp a c e [6 ].

T h e re su ltin g ro u tin g is v e ry ro b u st a s it d o e s n o t d e p e n d o n in d iv id u a l a n ts, b u t ra th e r

o n th e c o lle c tiv e b e h a v io u r o f th e e n tire a n t c o lo n y .

2.4 Summary

T h e a im o f p a c k e t-sw itc h e d n e tw o rk s is to m a k e m o re e ffic ie n t u se o f n e tw o rk re so u rc e s

b y fo rw a rd in g p a c k e ts b e tw e e n n o d e s o n a h o p -b y -h o p fa sh io n . T h e ro u tin g d e c isio n

a t e a c h n o d e c o n sists o f d e c id in g w h ic h n e ig h b o u r to se n d a p a c k e t to . W e d isc u sse d

th e sim p le ro u tin g stra te g ie s o f flo o d in g , ra n d o m ro u tin g a n d fix e d ro u tin g .

A d a p tiv e ro u tin g in c re a se s th e e ffic ie n c y o f a n e tw o rk b y re d ire c tin g tra ffic a w a y fro m

c o n g e ste d a re a s o r d y n a m ic a lly c h a n g in g ro u te s in n e tw o rk s c h a ra c te riz e d b y a c o n

-sta n tly c h a n g in g to p o lo g y . A d a p tiv e ro u tin g stra te g ie s h a v e to a v o id o sc illa tio n s in th e

n e tw o rk w h ic h a rise if th e y a d a p t to o q u ic k ly to c o n g e stio n . W e d isc u sse d th e tw o

(27)

Mobile software agents may prove helpful in managing

the complexity

of distributed,

dynamic networks.

We discussed the potential

of active networks, where packets route

themselves

by executing

code on a router.

The emergent

behaviour

exhibited

by ant

colonies also offer valuable insight into optimization

of a complex dynamical

system.

Promising results have already been obtained by routing based on a collection of simple

ant-like software agents.

(28)

Reinforcement

Learning

A b r o a d r a n g e o f le a r n in g p r o b le m s c a n b e c a s t in to th e r e in f o r c e m e n t le a r n in g f r a m e -w o r k [ 1 3 ; 2 0 ] . B r o a d ly s ta te d , r e in f o r c e m e n t le a r n in g is th e p r o b le m o f le a r n in g to a c h ie v e a g o a l th r o u g h in te r a c tio n in a d y n a m ic e n v ir o n m e n t. T h e le a r n in g e n tity w h ic h is r e s p o n s ib le f o r ta k in g a c tio n s is c a lle d a n

agent.

T h e a g e n t c o n tin u a lly in te r a c ts w ith th e e n v ir o n m e n t b y ta k in g a c tio n s , a n d r e c e iv in g r e w a r d s a n d s ta te in f o r m a tio n , a s s h o w n in F ig u r e 3 .1 . T h e g o a l o f th e a g e n t is to e x p e r im e n t w ith d if f e r e n t a c tio n s e q u e n c e s in o r d e r to m a x im iz e th e r e w a r d r e c e iv e d o v e r tim e . A n im p o r ta n t a s p e c t o f r e in f o r c e m e n t le a r n in g a lg o r ith m s is th a t th e y a r e a b le to le a r n f r o m

delayed rewards.

I n s o m e p r o b le m s , a n a g e n t h a s to e x e c u te a s p e c if ic s e q u e n c e o f a c tio n s b e f o r e it r e c e iv e s a r e w a r d . T o le a r n s u c h a s e q u e n c e , a n a g e n t h a s to o v e r c o m e th e p r o b le m o f

temporal credit assignment,

i.e . a n a g e n t h a s to d e c id e w h ic h s ta te s in th e a c tio n s e q u e n c e w e r e r e s p o n s ib le f o r th e r e c e iv e d r e w a r d . R e in f o r c e m e n t le a r n in g a lg o r ith m s th e r e f o r e a r e c o n c e r n e d w ith f in d in g th e o p tim a l s e q u e n c e o f a c tio n s th r o u g h

Agent

s ta te r e w a r d a c tio n

Environment

F i g u r e 3 .1 : T h e a g e n t - e n v i r o n m e n t i n t e r a c t i o n .

(29)

tria l-a n d -e rro r in te ra c tio n s in a n e n v iro n m e n t th a t m a x im iz e s th e re c e iv e d re w a rd o v e r tim e .

R e in fo rc e m e n t le a rn in g a lg o rith m s d iffe r fro m s u p e rv is e d le a rn in g a lg o rith m s in th a t

th e y a re n o t tra in e d o n in p u t/o u tp u t p a irs s p e c ify in g w h ic h a c tio n is th e b e s t a t e a c h

s ta te . In s te a d , th e y a re g u id e d to th e g o a l b y th e re w a rd s re c e iv e d . In o th e r w o rd s ,

th e re w a rd re c e iv e d a fte r e a c h a c tio n fu lly s p e c ifie s th e p ro b le m to b e s o lv e d . A n o th e r

d iffe re n c e to s u p e rv is e d le a rn in g is th a t a ta s k o fte n h a s n o s e p a ra te tra in in g a n d te s tin g

p h a s e s . In s te a d , s o m e ta s k s re q u ire c o n tin u a l le a rn in g th ro u g h o u t a n a g e n t's life .

3.1 Value Functions

W e c a n fo rm u la te th e re in fo rc e m e n t le a rn in g ta s k a n a g e n t fa c e s a s a M a rk o v d e c is io n

p ro c e s s (M D P ) [1 3 ]. A fin ite M a rk o v d e c is io n p ro c e s s is c h a ra c te riz e d b y :

• a fin ite s e t o f s ta te s

S,

• a fin ite s e t o f a c tio n s

A,

• a re w a rd fu n c tio n

R : S

x

A

----+ ~, a n d

• a s ta te tra n s itio n fu n c tio n T : S x A x S ----+ ~, w h e re T ( s , a ,

Sf)

is th e p ro b a b ility

o f a d v a n c in g fro m s ta te s to s ' w h e n ta k in g a c tio n a .

T h e m o d e l is c a lle d M a r k o v if th e tra n s itio n p ro b a b ilitie s T a re in d e p e n d e n t o f p re v io u s

s ta te s a n d a c tio n s . T h u s , th e n e x t s ta te is s p e c ifie d p ro b a b ilis tic a lly b y th e tra n s itio n

fu n c tio n T a n d th e c u rre n t s ta te a n d a c tio n a lo n e . N o te th a t th e m o d e l is a n o n d e te r

-m in is tic M D P b e c a u s e th e a c tio n s a re c h o s e n p ro b a b ilis tic a lly .

A t e a c h tim e s te p

t ,

a n a g e n t o b s e rv e s th e s ta te S t a n d ta k e s a c tio n a t. T h e e n v iro n m e n t

re s p o n d s b y re tu rn in g a re w a rd r H l

=

R ( s t, a t) a n d th e n e x t s ta te S H I w ith p ro b a b ility

T ( s t, a t, S H l ) ' T h is p ro c e s s is re p e a te d c o n tin u a lly u n til th e a g e n t a c h ie v e s its g o a l, o r

in d e fin ite ly fo r n o n -e p is o d ic ta s k s .

T h e p o lic y 7 f(s, a ) o f a n a g e n t is a m a p p in g o f e a c h s ta te S a n d a c tio n a to th e p ro b a b ility

o f ta k in g a c tio n a in s ta te s . T h e g o a l o f a n a g e n t is to im p ro v e its p o lic y b y m a x im iz in g

(30)

(1)

(2 ) T h e r e a r e d if f e r e n t w a y s o f c a lc u la tin g th e e x p e c te d r e tu r n

R

t , b a s e d o n th e s p e c if ic ta s k th e a g e n t h a s to s o lv e . S o m e ta s k s c a n b e b r o k e n u p in to a s e r ie s o f e p is o d e s o r tr ia ls , w h e r e e a c h e p is o d e e n d s in a t e r m i n a l s ta te . A t th e e n d o f e a c h e p is o d e , th e a g e n t is r e s e t to a s ta r tin g s ta te . I n s u c h e p i s o d i c t a s k s , w e o b ta in th e e x p e c te d r e tu r n b y s u m m in g th e to ta l r e c e iv e d r e w a r d s o v e r a f in ite h o r iz o n h : h

Rt

=

I :

r t + k + l k=O S o m e ta s k s n e v e r e n d ; th u s , th e a b o v e s u m m a y b e in f in ite . T h is p r o b le m m a y b e s o lv e d b y d is c o u n tin g f u tu r e r e w a r d s : 0 0 Rt

=

I :

' l r t + k + l , k=O w h e r e ry is th e d i s c o u n t r a t e a n d 0

:S

ry

<

1 . I n o u r d is c u s s io n s , w e w ill f o c u s e x c lu s iv e ly o n th is c a s e , w h ic h is c a lle d th e d i s c o u n t e d i n f i n i t e h o r i z o n c a s e . E p is o d ic ta s k s c a n a ls o b e h a n d le d b y th is d e f in itio n o f e x p e c te d r e tu r n b y in tr o d u c in g a n a b s o r b i n g s t a t e w h ic h is e n te r e d ju s t a f te r th e te r m in a l s ta te . T h e o n ly tr a n s itio n f r o m th e a b s o r b in g s ta te is to its e lf , w ith a n a s s o c ia te d r e w a r d o f z e r o .

M o s t r e in f o r c e m e n t le a r n in g a lg o r ith m s a r e b a s e d o n e s tim a tin g v a l u e f u n c t i o n s th a t e s tim a te th e u tility o f s ta te s . T h e v a lu e o r u tility o f a s ta te is th e f u tu r e r e w a r d , o r r e tu r n , th a t a n a g e n t c a n e x p e c t. A s th e f u tu r e r e w a r d s d e p e n d o n w h ic h a c tio n s a n a g e n t ta k e s , th e v a lu e f u n c tio n d e p e n d s o n th e p a r tic u la r p o lic y th e a g e n t f o llo w s . T h e v a l u e V 7 r

(s)

o f a s ta te

s

u n d e r p o lic y 7 r, is th e e x p e c te d r e tu r n b y f o llo w in g p o lic y 7 r f r o m s ta te s: V7 r( s )

=

E

7 r

{R

t

1St

=

s } ,

(3)

(4 ) w h e r e E_7r{} d e n o te s th e e x p e c te d r e tu r n w h e n p o lic y 7 r is f o llo w e d . F o r th e d is c o u n te d in f in ite h o r iz o n c a s e , w e h a v e : V7 r( s )

=

E 7 r{ ~ r y k r t + k + l

I

S t

=

s } . T h e o p t i m a l v a l u e f u n c t i o n V * is a tta in e d b y m a x im iz in g V 7 r f o r a ll s ta te s : V * (

s)

=

m a x V7 r

(\Is) .

7 r (5 )

T h e o p t i m a l p o l i c y is d e f in e d a s th e p o lic y c o r r e s p o n d in g to th e o p tim a l v a lu e f u n c tio n in th e m a x im iz a tio n a b o v e :

7 r *

=

a r g m a x V7r

(\Is) .

7 r

(31)

I n a M D P , w e h a v e a m o d e l o f th e e n v ir o n m e n t d y n a m ic s in th e f o r m o f s ta te tr a n s itio n p r o b a b ilitie s

T

a n d th e r e w a r d f u n c tio n

R;

th u s , w e c a n u s e th e d y n a m ic p r o g r a m m in g te c h n iq u e c a lle d

value iteration

to f in d th e o p tim a l v a lu e f u n c tio n . O n c e w e h a v e th e o p tim a l v a lu e f u n c tio n , w e c a n o b ta in th e

optimal policy

1 f * b y c h o o s in g , in e a c h s ta te ,

th e a c tio n th a t r e s u lts in th e m a x im u m v a lu e f u n c tio n o f a ll th e im m e d ia te s u c c e s s o r s ta te s :

1f(s)*

=

a r g m a x

V (s'),*

a w h e r e

s'

is th e s u c c e s s o r o f s ta te

s.

(7 )

I n r e in f o r c e m e n t le a r n in g p r o b le m s , a n a g e n t g e n e r a lly d o e s n o t h a v e a c c e s s to th e e n v ir o n m e n t d y n a m ic s in th e f o r m o f th e tr a n s itio n p r o b a b ilitie s

T;

th u s , w e c a n n o t u s e d y n a m ic p r o g r a m m in g te c h n iq u e s . I n th e n e x t s e c tio n s , w e e x a m in e r e in f o r c e m e n t le a r n in g m e th o d s b a s e d o n d y n a m ic p r o g r a m m in g 1 , w h e r e w e d o n o t h a v e a c c e s s to th e e n v ir o n m e n t d y n a m ic s . I n s te a d , a n a g e n t h a s to le a r n f r o m th e e n v ir o n m e n t th r o u g h th e r e w a r d s e x p e r ie n c e d b y ta k in g d if f e r e n t a c tio n s .

3.2 Temporal-Difference

Learning

W e n o w tu r n o u r a tte n tio n to th e p r o b le m o f le a r n in g th e o p tim a l p o lic y w ith o u t p e r f e c t k n o w le d g e o f th e e n v ir o n m e n t. T h e o n ly w a y w e c a n le a r n a b o u t th e e n v ir o n m e n t is to e x p lo r e it b y ta k in g a c tio n s , o b s e r v in g th e r e w a r d a n d u s e th e e x p e r ie n c e to u p d a te th e v a lu e f u n c tio n . O n e w a y o f s o lv in g th e p r o b le m is to in c r e m e n ta lly e s tim a te th e v a lu e f u n c tio n V 7 r a s w e e n c o u n te r e a c h n e w s ta te . W e d e n o te th is a p p r o x im a te v a lu e f u n c tio n b y

V.

T h e c la s s o f te m p o r a l- d if f e r e n c e le a r n in g [ 2 8 ] a lg o r ith m s u p d a te th e c u r r e n t e s tim a te

V

(St)

b y u s in g th e v a lu e f u n c tio n e s tim a te s o f

temporally successive

s ta te s . T e m p o r a l-d if f e r e n c e m e th o d s a r e c a lle d

bootstrapping

m e th o d s , b e c a u s e th e y u p d a te e s tim a te s b a s e d o n o th e r e s tim a te s . B y lo o k in g o n e s te p a h e a d a t th e v a lu e f u n c tio n o f th e n e x t s ta te , w e c a n u p d a te th e c u r r e n t v a lu e f u n c tio n e s tim a te a s f o llo w s :

V(St)

+--

V(St)

+

a h + 1

+

!,V(St+l)

- V(St)],

w h e r e a is th e s te p s iz e p a r a m e te r .

(8)

1B a r to a n d S u tto n [ 3 0 ] p r e s e n t a u n if ie d v ie w r e la tin g d y n a m ic p r o g r a m m in g , M o n te C a r lo , a n d te m p o r a l-d if f e r e n c e m e th o d s f o r s o lv in g r e in f o r c e m e n t le a r n in g p r o b le m s .

(32)

I n itia liz e

V

(s)

a r b itr a r ily , 7 f to th e p o lic y to b e e v a lu a te d r e p e a t f o r e a c h e p is o d e : I n itia liz e

s

r e p e a t f o r e a c h s te p in e p is o d e : c h o o s e a c tio n

a

in s ta te

s

f r o m p o lic y 7 f ta k e a c tio n

a;

o b s e r v e r e w a r d

r,

a n d n e x t s ta te

s'

V(s) +- V(s)

+

a[r

+

,V(s')

- V(s)]

s

+-

s'

u n til

s

is te r m in a l

F ig u r e 3 .2 : E s tim a tin g V1l" w ith T D ( O ) .

T h e a lg o r ith m , c a lle d T D ( O ) f o r r e a s o n s w e w ill s e e s h o r tly , is s h o w n in F ig u r e 3 .2 . R e c a ll th a t th e v a lu e f u n c tio n

V

(s)

is th e e x p e c te d r e tu r n o f f o llo w in g p o lic y 7 [ f r o m

s ta te

s.

T h u s , th e T D ( O ) a lg o r ith m

predicts

th e r e w a r d a n a g e n t w ill r e c e iv e b y f o llo w in g p o lic y 7 [ f r o m s ta te

s.

I t h a s b e e n s h o w n th a t T D ( O ) c o n v e r g e s w ith p r o b a b ility 1 to

V1l"

f o r a n y f ix e d 7 [ w ith a n a p p r o p r ia te c h o ic e o f

a.

I f w e d e n o te

ak(a)

a s th e s te p s iz e

p a r a m e te r a f te r th e k th s e le c tio n o f a c tio n

a,

a s u ita b le c h o ic e is ak

(a)

=

t.

T h is f o llo w s f r o m th e w e ll- k n o w n r e s u lt in s to c h a s tic a p p r o x im a tio n th e o r y g iv in g th e c o n d itio n s f o r c o n v e r g e n c e w ith p r o b a b ility 1 a s : 00

L

ak(a)

=

0 0 k = l a n d 0 0

L

a%(a)

<

0 0 . k = l (9 ) A lth o u g h th is is a u s e f u l th e o r e tic a l r e s u lt, th e s te p s iz e d e c r e a s e a b o v e is s e ld o m u s e d in p r a c tic e [ 3 0 ] . I n s te a d , a c o n s ta n t s te p s iz e

ak(a)

=

_a

is u s e d . T h is m a y b e s o f o r tw o r e a s o n s : f ir s t, th e c o n v e r g e n c e is o f te n s lo w o r n e e d s c o n s id e r a b le tu n in g f o r a s a tis f a c to r y c o n v e r g e n c e r a te ; s e c o n d , in n o n - s ta tio n a r y e n v ir o n m e n ts , c o n v e r g e n c e is u n d e s ir a b le a s th e r e w a r d f u n c tio n

R

m a y c h a n g e o v e r tim e , th u s , w e w a n t o u r le a r n e d p o lic y to c o n tin u a lly c h a n g e in r e s p o n s e to th e la te s t r e c e iv e d r e w a r d s .

3.3 Q-Learning

I n th e p r e v io u s s e c tio n , w e s a w h o w T D ( O ) c a n b e u s e d f o r p r e d ic tin g th e e x p e c te d r e w a r d o f a p a r tic u la r p o lic y 7 [ b y e s tim a tin g th e v a lu e f u n c tio n . I n th is s e c tio n , w e

(33)

If th e a g e n t k n o w s th e tra n s itio n p ro b a b ilitie s

T

o f th e e n v iro n m e n t, it c a n c h o o s e th e

a c tio n th a t le a d s to th e s u c c e s s o r s ta te w ith th e c o m b in e d m a x im u m v a lu e fu n c tio n

(E q u a tio n 7 ) a n d im m e d ia te re w a rd . T h e p ro b le m is th a t w e g e n e ra lly d o n o t h a v e

a m o d e l o f th e e n v iro n m e n t; th u s , w e d o n o t k n o w w h ic h a c tio n s ta k e u s to w h ic h

s ta te s . T h e s o lu tio n is to d e fin e a n e w v a lu e fu n c tio n Q 7 l " ( s , a ) , d e fin e d a s th e v a lu e o f

ta k in g a c tio n a in s ta te s w h ile fo llo w in g p o lic y 1f. T h is n e w v a lu e fu n c tio n is c a lle d

th e a c t i o n - v a l u e fu n c tio n , a n d V 7 l "( s ) th e s t a t e - v a l u e fu n c tio n .

W e d e fin e

Q*

( s , a ) a s th e e x p e c te d re tu rn o f ta k in g a c tio n a in s ta te s , a n d fo llo w in g

th e o p t i m a l p o l i c y fro m th e n o n . T h u s , w e c a n w rite Q * ( s , a ) in te rm s o f V * ( s ) : Q * ( s , a )

=

E { r t + 1

+

I ' V * ( S t + l ) 1 s t

=

s , a t

=

a } R e c a ll th a t

V*

(s)

is th e v a lu e o f ta k in g th e b e s t s te p in itia lly , s o w e a ls o h a v e : V * ( s )

=

m a x Q * ( s , a ) , a w h ic h e n a b le s u s to w rite E q u a tio n 1 0 re c u rs iv e ly : Q * ( s , a )

=

E { r t + l

+

I'm a x Q * (S t+ l' a ' ) 1 s t

=

s , a t

=

a } . a '

(1 0 )

(11)

(1 2 )

W h e re a s T D (O ) is u s e d to p re d ic t th e e x p e c te d re tu rn o f s ta te s w h ile fo llo w in g p o lic y

1f, Q -L e a rn in g [3 4 ] in c re m e n ta lly e s tim a te s th e o p tim a l a c tio n -v a lu e fu n c tio n Q * ( s , a ) .

T h e u p d a te ru le is g iv e n b y :

Q ( S t , a t ) f- Q ( S t , a t )

+

e x h + l

+

I'm a x Q (s t+ 1 , a ) - Q ( S t , a t ) ] .

(13)

a

T h e Q -L e a rn in g a lg o rith m s h o w n in F ig u re 3 .3 c o n v e rg e s to th e o p tim a l a c tio n -v a lu e

fu n c tio n

Q*

w ith p ro b a b ility

1

u n d e r th e s a m e c o n d itio n s fo r e x a s in T D (O ), p ro v id e d

e a c h s ta te -a c tio n p a ir is trie d in fin ite ly o fte n . W e w ill p ro v e th e c o n v e rs io n re s u lts in a

la te r s e c tio n .

In th e Q -L e a rn in g a lg o rith m , w e m u s t s e le c t a c tio n s b a s e d o n a s u ita b le e x p lo ra tio n

s tra te g y d e riv e d fro m Q . A n y s tra te g y th a t g u a ra n te e s th a t e a c h s ta te -a c tio n p a ir w ill

b e trie d in fin ite ly o fte n w ill s u ffic e . O n e o f th e s im p le s t s tra te g ie s is E -g re e d y , w h e re

a n a g e n t c h o o s e s th e a c tio n w ith m a x im a l Q -v a lu e in th a t s ta te w ith p ro b a b ility

1 -

E

a n d a ra n d o m a c tio n w ith a s m a ll p ro b a b ility E . W h e n a n a g e n t c h o o s e s a n a c tio n

w ith m a x im u m Q -v a lu e , it is e x p l o i t i n g p re v io u s ly s to re d in fo rm a tio n , w h e re a s ra n d o m

a c tio n s re s u lt in e x p l o r a t i o n . W e w ill d is c u s s th e tra d e o ff b e tw e e n e x p lo ra tio n a n d

(34)

In itia liz e

Q(8, a)

a rb itra rily re p e a t fo r e a c h e p is o d e :

In itia liz e 8

re p e a t fo r e a c h s te p in e p is o d e :

c h o o s e a c tio n a in s ta te 8 u s in g e x p lo ra tio n p o lic y d e riv e d fro m Q

ta k e a c tio n a ; o b s e rv e re w a rd r , a n d n e x t s ta te 8 '

Q ( 8 , a )

+--

Q ( 8 , a )

+

a [ r

+

'Y

m a xa , Q ( 8 ', a ') - Q ( 8 , a ) ]

8

+--

8 '

u n til 8 is te rm in a l

F ig u re 3 .3 : E s tim a tin g

Q*

w ith Q -L e a rn in g .

Q -L e a rn in g is c a lle d a n o f f - p o l i c y le a rn in g a lg o rith m b e c a u s e it c o n v e rg e s to th e o p tim a l

v a lu e fu n c tio n i n d e p e n d e n t o f th e e x p lo ra tio n p o lic y b e in g fo llo w e d . In o th e r w o rd s , th e

d e ta ils o f th e p a rtic u la r e x p lo ra tio n s tra te g y d o n o t in flu e n c e th e v a lu e fu n c tio n , b u t

o n ly th e ra te o f c o n v e rg e n c e . T h e re is a ls o a n o n -p o lic y Q -L e a rn in g a lg o rith m c a lle d

S A R S A [3 0 ]' in w h ic h th e e x p lo ra tio n s tra te g y is ta k e n in to a c c o u n t. H o w e v e r, b o th

a lg o rith m s c o n v e rg e to th e s a m e v a lu e fu n c tio n w h e n E , th e p ro b a b ility o f e x p lo ra tio n ,

d e c re a s e s to w a rd s z e ro .

3 .4

TD (,\) Learning

T h e T D (O ) le a rn in g m e th o d w e s tu d ie d p re v io u s ly is a s p e c ia l c a s e o f a c la s s o f te m p o ra

l-d iffe re n c e le a rn in g m e th o d s c a lle d T D (A ), w ith A

=

o.

In th e u p d a te ru le o f T D (O )

(E q u a tio n 8 ), w e lo o k a h e a d o n e s te p to th e v a lu e fu n c tio n o f th e n e x t s ta te . T h e u p d a te m o v e s th e e s tim a te c lo s e r to th e ta rg e t v a lu e o f e s tim a te d re tu rn : R~l) =r t + l

+

'Y v t(S t+ l).

(1 4 )

W e c a n g e n e ra liz e th e ta rg e t to th e c a s e o f n s te p s , a ls o c a lle d th e c o r r e c t e d n - s t e p t r u n c a t e d r e t u r n :

R(n)

2 n - l

nV; (

)

t

=

r t + l

+

'Y

r t + 2

+

'Y

r t + 3

+ ...+

'Y

r t + n

+

'Y

t S t + n .

(1 5 )

It c a n b e s h o w n [3 0 ] th a t th e e x p e c te d v a lu e o f th e c o rre c te d n -s te p tru n c a te d re tu rn is

a n im p ro v e m e n t o v e r th e c u rre n t v a lu e fu n c tio n a s a n a p p ro x im a tio n to th e tru e v a lu e