chapter 2 dynamic programming

34
Chapter 2 Dynamic Programming 2.1 Closed-loop optimization of discrete-time systems: inventory control We consider the following inventory control problem: The problem is to minimize the expected cost of ordering quantities of a certain product in order to meet a stochastic demand for that product. The ordering is only possible at discrete time instants t 0 < t 1 < ··· <t N -1 ,N N. We denote by x k ,u k , and w k , 0 k N - 1, the available stock, the order and the demand at t k . Here, w k , 0 k N - 1, are assumed to be independent random variables. Excess demand (giving rise to negative stock) is backlogged and filled as soon as additional inventory becomes available. Consequently, the stock evolves according to the following discrete- time system (2.1) x k+1 = x k + u k - w k , 0 k N - 1 . The cost at time t k invokes a penalty r(x k ) for either holding cost (excess inventory) or shortage cost (unsatisfied demand) and a pur- chasing cost in terms of a cost c per ordered unit. Moreover, there is terminal cost R(x N ) for left over inventory at the final time instant t N -1 . Therefore, the total cost is given by (2.2) E R(x N )+ N -1 X k=0 ( r(x k )+ cu k ·· . A natural constraint on the variables is u k 0, 0 k N - 1. We distinguish between open-loop minimization and closed-loop minimization of the cost: In open-loop minimization, the decisions on the ordering policy u k , 0 k N - 1, are made once and for all at the initial time t 0 which leads to the minimization problem minimize E R(x N )+ N -1 X k=0 ( r(x k )+ cu k ·· , (2.3a) subject to x k+1 = x k + u k - w k , 0 k N - 1 , (2.3b) u k 0 , 0 k N - 1 . (2.3c) In closed-loop minimization, the orders u k ,k > 0, are placed at the last possible moment, i.e., at t k , to take into account information about the development of the stock until t k . In mathematical terms, this amounts to determine a policy or control rule π := {μ k } N -1 k=0 48

Upload: vantuong

Post on 10-Feb-2017

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2 Dynamic Programming

Chapter 2 Dynamic Programming

2.1 Closed-loop optimization of discrete-time systems:inventory control

We consider the following inventory control problem:The problem is to minimize the expected cost of ordering quantitiesof a certain product in order to meet a stochastic demand for thatproduct. The ordering is only possible at discrete time instants t0 <t1 < · · · < tN−1, N ∈ N.We denote by xk, uk, and wk, 0 ≤ k ≤ N − 1, the available stock, theorder and the demand at tk. Here, wk, 0 ≤ k ≤ N − 1, are assumedto be independent random variables. Excess demand (giving rise tonegative stock) is backlogged and filled as soon as additional inventorybecomes available.Consequently, the stock evolves according to the following discrete-time system

(2.1) xk+1 = xk + uk − wk , 0 ≤ k ≤ N − 1 .

The cost at time tk invokes a penalty r(xk) for either holding cost(excess inventory) or shortage cost (unsatisfied demand) and a pur-chasing cost in terms of a cost c per ordered unit. Moreover, there isterminal cost R(xN) for left over inventory at the final time instanttN−1. Therefore, the total cost is given by

(2.2) E(R(xN) +

N−1∑

k=0

(r(xk) + cuk

)).

A natural constraint on the variables is uk ≥ 0, 0 ≤ k ≤ N − 1.We distinguish between open-loop minimization and closed-loopminimization of the cost:In open-loop minimization, the decisions on the ordering policyuk, 0 ≤ k ≤ N − 1, are made once and for all at the initial time t0which leads to the minimization problem

minimize E(R(xN) +

N−1∑

k=0

(r(xk) + cuk

)),(2.3a)

subject to xk+1 = xk + uk − wk , 0 ≤ k ≤ N − 1 ,(2.3b)

uk ≥ 0 , 0 ≤ k ≤ N − 1 .(2.3c)

In closed-loop minimization, the orders uk, k > 0, are placed atthe last possible moment, i.e., at tk, to take into account informationabout the development of the stock until tk. In mathematical terms,this amounts to determine a policy or control rule π := {µk}N−1

k=048

Page 2: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 49

described by functions µk = µk(xk), 0 ≤ k ≤ N − 1, where the functionvalue is the amount of units of the product to be ordered at time tk,if the stock is xk. Hence, given a fixed initial stock x0, the associatedminimization problem reads as follows

minimize Jπ(x0) := E(R(xN) +

N−1∑

k=0

(r(xk) + cµk(xk)

)),(2.4a)

subject to xk+1 = xk + µk(xk) − wk , 0 ≤ k ≤ N − 1 ,(2.4b)

µk(xk) ≥ 0 , 0 ≤ k ≤ N − 1 .(2.4c)

Dynamic Programming (DP) is concerned with the efficient solu-tion of such closed-loop minimization problems.

Remark: We note that minimization problems associated with de-terministic discrete-time dynamical systems can be considered as well.Such systems will be dealt with in more detail in Chapter 2.3.

Page 3: Chapter 2 Dynamic Programming

50 Ronald H.W. Hoppe

2.2 Dynamic Programming for discrete-time systems

2.2.1 Setting of the problem

For N ∈ N, we consider sequences {Sk}Nk=0, {Ck}N−1

k=0 , and {Dk}N−1k=0 of

(random) state spaces Sk, 0 ≤ k ≤ N, control spaces Ck, 0 ≤ k ≤N − 1, and (random) disturbance spaces Dk, 0 ≤ k ≤ N − 1.Given an initial state x0 ∈ S0, we assume that the states xk ∈ Sk, 0 ≤k ≤ N, evolve according to the discrete-time dynamic system

xk+1 = fk(xk, uk, wk) , 0 ≤ k ≤ N − 1 ,

where fk : Sk × Ck × Dk → Sk+1, 0 ≤ k ≤ N − 1. The controls uk

satisfy

uk ∈ Uk(xk) ⊂ Ck , 0 ≤ k ≤ N − 1 ,

i.e., they depend on the state xk ∈ Sk, and the random disturbanceswk, 0 ≤ k ≤ N − 1, are characterized by a probability distribution

Pk(·|xk, uk) , 0 ≤ k ≤ N − 1 ,

which may explicitly depend on the state xk and on the control uk, butnot on the prior disturbances wj, 0 ≤ j ≤ k − 1.At the time instants 0 ≤ k ≤ N − 1, decisions with respect to thechoice of the controls have to be made by means of control laws

µk : Sk → Sk , uk = µk(xk) , 0 ≤ k ≤ N − 1 ,

leading to a control policy

π = {µ0, µ1, · · · , µN−1} .

We refer to Π as the set of admissible control policies.Finally, we introduce cost functionals

gk : Sk × Ck ×Dk → R , 0 ≤ k ≤ N − 1 ,

associated with the decisions at time instants 0 ≤ k ≤ N − 1, and aterminal cost

gN : SN → R .

We consider the minimization problem

minπ∈Π

Jπ(x0) = E [gN(xN) +N−1∑

k=0

gk(xk, µk(xk), wk) ] ,(2.5a)

subject to

xk+1 = fk(xk, µk(xk), wk) , 0 ≤ k ≤ N − 1 ,(2.5b)

Page 4: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 51

where the expectation E is taken over all random states and randomdisturbances.For a given initial state x0, the value

(2.6) J∗(x0) = minπ∈Π

Jπ(x0)

is referred to as the optimal cost function or the optimal valuefunction. Moreover, if there exists an admissible policy π∗ ∈ Π suchthat

(2.7) Jπ∗(x0) = J∗(x0) ,

then π∗ is called an optimal policy.

2.2.2 Bellman’s principle of optimality

Bellman’s celebrated principle of optimality is the key for a con-structive approach to the solution of the minimization problem (2.5)which readily leads to a powerful algorithmic tool: the Dynamic Pro-gramming algorithm (DP algorithm).For intermediate states xk ∈ Sk, 1 ≤ k ≤ N − 1, that occur withpositive probability, we consider the minimization subproblems

minπk∈Πk

Jπk(x0) = E [gN(xN) +

N−1∑

`=k

g`(x`, µ`(x`), w`) ] ,(2.8a)

subject to

x`+1 = f`(x`, µ`(x`), w`) , k ≤ ` ≤ N − 1 ,(2.8b)

where πk = {µk, µk+1, · · · , µN−1} and Πk is the set of admissible policiesobtained from Π by deleting the admissible control laws associated withthe previous time instants 0 ≤ ` ≤ k − 1.Bellman’s optimality principle states that if

π∗ = {µ∗0, µ∗1, · · · , µ∗N−1}is an optimal policy for (2.5), then the truncated policy

π∗k = {µ∗k, µ∗k+1, · · · , µ∗N−1}is an optimal policy for the minimization subproblem (2.8). We referto

(2.9) J∗k (xk) = Jπ∗k(xk)

as the optimal cost-to-go for state xk at the time instant k to thefinal time N . For completeness, we further set J∗N(xN) = gN(xN).The intuitive justification for Bellman’s optimality principle is thatif the truncated policy π∗k were not optimal for subproblem (2.8), then

Page 5: Chapter 2 Dynamic Programming

52 Ronald H.W. Hoppe

we would be able to reduce the optimal cost for (2.5) by switching tothe optimal policy for (2.8) once xk is reached.

2.2.3 The DP algorithm and its optimality

Bellman’s optimality principle strongly suggests to solve the optimalitysubproblems (2.8) backwards in time, beginning with the terminal costat final time instant N and then recursively compute the optimal cost-to-go for subproblems k = N − 1, · · · , 0. This leads to the so-calledbackward DP algorithm which is characterized by the recursions

JN(xN) = gN(xN) ,(2.10a)

Jk(xk) = minuk∈Uk(xk)

Ewk[gk(xk, uk, wk) +(2.10b)

+ Jk+1(fk(xk, uk, wk))] , 0 ≤ k ≤ N − 1 ,

where Ewkmeans that the expectation is taken with respect to the

probability distribution of wk.

Theorem 2.1 (Optimality of the backward DP algorithm)We assume that

• the random disturbance spaces Dk, 0 ≤ k ≤ N − 1, are finite orcountable sets,

• the expectations of all terms in the cost functionals in (2.10b)are finite for every admissible policy.

Then, there holds

(2.11) J∗k (xk) = Jk(xk) , 0 ≤ k ≤ N .

Moreover, if there exist optimal policies µ∗k, 0 ≤ k ≤ N − 1, for (2.10b)such that u∗k = µ∗k(xk), then π∗ = {µ∗0, µ∗1, · · · , µ∗N−1} is an optimalpolicy for (2.5).

Proof. We note that for k = N the assertion (2.11) holds true bydefinition. For k < N and ε > 0, we define µε

k(xk) ∈ Uk(xk) as theε-suboptimal control satisfying

Ewk[gk(xk, µ

εk(xk), wk) + Jk+1(fk(xk, µ

εk(xk), wk))] ≤(2.12)

≤ Jk(xk) + ε .

We denote by Jεk(xk) the expected cost at state xk and time instant k

with respect to the ε-optimal policy

πεk = {µε

k(xk), µεk+1(xk+1), · · · , µε

N−1(xN−1)}.

Page 6: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 53

We show by induction on N − k, k ≥ 1, that

Jk(xk) ≤ Jεk(xk) ≤ Jk(xk) + (N − k)ε ,(2.13a)

J∗k (xk) ≤ Jεk(xk) ≤ J∗k (xk) + (N − k)ε ,(2.13b)

Jk(xk) = J∗k (xk) .(2.13c)

(i) Begin of the induction k = N− 1 : It follows readily from (2.12)that (2.13a) and (2.13b) hold true for k = N − 1. The limit processε → 0 in (2.13a) and (2.13b) then yields that (2.13c) is satisfied fork = N − 1.

(ii) Induction hypothesis: We suppose that (2.13a)-(2.13c) holdtrue for some k + 1.

(iii) Proof for k : Using the definition of Jεk(xk), the induction hy-

pothesis and (2.12), we get

Jεk(xk) = Ewk

[gk(xk, µεk(xk), wk) + Jε

k+1(fk(xk, µεk(xk), wk))] ≤

≤ Ewk[gk(xk, µ

εk(xk), wk) + Jk+1(fk(xk, µ

εk(xk), wk))] + (N − k − 1)ε

≤ Jk(xk) + ε + (N − k − 1)ε = Jk(xk) + (N − k)ε .

On the other, using the induction hypothesis once more, we obtain

Jεk(xk) = Ewk

[gk(xk, µεk(xk), wk) + Jε

k+1(fk(xk, µεk(xk), wk))] ≥

≥ Ewk[gk(xk, µ

εk(xk), wk) + Jk+1(fk(xk, µ

εk(xk), wk))] ≥

≥ minuk∈UK(xk)

Ewk[gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] = Jk(xk) .

The combination of the two preceding inequalities proves (2.13a) for k.Now, for every policy π ∈ Π, using the induction hypothesis and (2.12),we have

Jεk(xk) = Ewk

[gk(xk, µεk(xk), wk) + Jε

k+1(fk(xk, µεk(xk), wk))] ≤

≤ Ewk[gk(xk, µ

εk(xk), wk) + Jk+1(fk(xk, µ

εk(xk), wk))] + (N − k − 1)ε

≤ minuk∈Uk(xk)

Ewk[gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] + (N − k)ε ≤

≤ Ewk[gk(xk, µk(xk), wk) + Jπk+1

(fk(xk, µk(xk), wk))] + (N − k)ε =

= Jπk(xk) + (N − k)ε .

Minimizing over πk yields

Jεk(xk) ≤ J∗k (xk) + (N − k)ε .

On the other hand, by the definition of J∗k (xk) we obviously have

J∗k (xk) ≤ Jεk(xk) .

Combining the preceding inequalities proves (2.13b) for k.Again, (2.13c) finally follows from ε → 0 in (2.13a) and (2.13b). ¤

Page 7: Chapter 2 Dynamic Programming

54 Ronald H.W. Hoppe

Remark: The general case (where the disturbance spaces are not nec-essarily finite or countable) requires additional structures of the spacesSk, Ck, and Dk as well as measurability assumptions with regard tothe functions fk, gk, and µk within the framework of measure-theoreticprobability theory. We refer to [2] for details.

2.2.4 Applications of the DP algorithm

Example (Inventory control): We consider the following modifica-tion of the inventory control problem previously discussed in Chapter2.1:We assume that the states xk, the controls uk, and the demands wk arenon-negative integers which can take the values 0, 1 and 2 where thedemand wk has the same probability distribution

p(wk = 0) = 0.1 , p(wk = 1) = 0.7 , p(wk = 2) = 0.2

for all planning periods (k, k + 1).We further assume that the excess demand wk − xk − uk is lost andthat there is an upper bound of 2 units on the stock that can be stored.Consequently, the equation for the evolution of the stock takes the form

xk+1 = max(0, xk + uk − wk)

under the constraint

xk + uk ≤ 2 .

For the holding costs and the terminal cost we assume

r(xk) = (xk + uk − wk)2 , R(xn) = 0 ,

and we suppose that the ordering cost is 1 per unit, i.e.,

c = 1 .

Therefore, the functions gk, 0 ≤ k ≤ N, are given by

gk(xk, uk, wk) = uk + (xk + uk − wk)2 , 0 ≤ k ≤ N − 1 ,

gN(xN) = 0 .

Finally, we suppose x0 = 0 and for the planning horizon we take N = 3so that the recursions (2.8) of the backward DP algorithm have theform

J3(x3) = 0 ,(2.14a)

Jk(xk) = min0≤uk≤2−xk

Ewk[uk + (xk + uk − wk)

2 +(2.14b)

+ Jk+1(max(0, xk + uk − wk))] , 0 ≤ k ≤ 2 .

We start with period 2, followed by period 1, and finish with period 0:

Page 8: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 55

Period 2: We compute J2(x2) for each possible value of the state x2:

(i) x2 = 0 : We have

J2(0) = minu2∈{0,1,2}

Ew2 [u2 + (u2 − w2)2] =

= minu2∈{0,1,2}

[u2 + 0.1u22 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2] .

The computation of the right-hand side for the three different valuesof the control u2 yields

u2 = 0 : E[·] = 0.7 · 1 + 0.2 · 4 = 1.5 ,

u2 = 1 : E[·] = 1 + 0.1 · 1 + 0.2 · 1 = 1.3 ,

u2 = 2 : E[·] = 2 + 0.1 · 4 + 0.7 · 1 = 3.1 .

Hence, we get

J2(0) = 1.3 , µ∗2(0) = 1 .

(ii) x2 = 1 : In view of the constraint x2 + u2 ≤ 2 only u2 ∈ {0, 1} isadmissible. Thus, we obtain

J2(1) = minu2∈{0,1}

Ew2 [u2 + (1 + u2 − w2)2] =

= minu2∈{0,1}

[u2 + 0.1(1 + u2)2 + 0.7u2

2 + 0.2(u2 − 1)2] .

The computation of the right-hand side gives

u2 = 0 : E[·] = 0.1 · 1 + 0.2 · 1 = 0.3 ,

u2 = 1 : E[·] = 1 + 0.1 · 4 + 0.7 · 1 = 2.1 .

It follows that

J2(1) = 0.3 , µ∗2(1) = 0 .

(iii) x2 = 2 : Since the only admissible control is u2 = 0, we get

J2(2) = Ew2 [(2− w2)2] = 0.1 · 4 + 0.7 · 1 = 1.1 ,

whence

J2(2) = 1.1 , µ∗2(2) = 0 .

Period 1: Again, we compute J1(x1) for each possible value of thestate x1 taking into account the previously computed values J2(x2):

(i) x1 = 0 : We get

J1(0) = minu1∈{0,1,2}

Ew1 [u1 + (u1 − w1)2 + J2(max(0, u1 − w1))] .

Page 9: Chapter 2 Dynamic Programming

56 Ronald H.W. Hoppe

The computation of the right-hand side results in

u1 = 0 : E[·] = 0.1 · J2(0) + 0.7 · (1 + J2(0)) +

+0.2 · (4 + J2(0)) = 2.8 ,

u1 = 1 : E[·] = 1 + 0.1 · (1 + J2(1)) + 0.7 · J2(0) +

+0.2 · (1 + J2(0)) = 2.5 ,

u1 = 2 : E[·] = 2 + 0.1 · (4 + J2(2)) + 0.7 · (1 + J2(1)) +

+0.2 · J2(0) = 3.68 .

We thus obtain

J1(0) = 2.5 , µ∗1(0) = 1 .

(ii) x1 = 1 : We have

J1(1) = minu1∈{0,1}

Ew1 [u1 + (1 + u1 − w1)2 + J2(max(0, 1 + u1 − w1))] .

Evaluation of the right-hand side yields

u1 = 0 : E[·] = 0.1 · (1 + J2(1)) + 0.7 · J2(0) +

+0.2 · (1 + J2(0)) = 1.5 ,

u1 = 1 : E[·] = 1 + 0.1 · (4 + J2(2)) + 0.7 · (1 + J2(1)) +

+0.2 · J2(0) = 2.68 .

Consequently, we get

J1(1) = 1.5 , µ∗1(1) = 0 .

(iii) x1 = 2 : Since the only admissible control is u1 = 0, it follows that

J1(2) = Ew1 [(2− w1)2 + J2(max(0, 2− w1))] =

= 0.1 · (4 + J2(2)) + 0.7 · (1 + J2(1)) + 0.2 · J2(0) = 1.68 ,

whence

J1(2) = 1.68 , µ∗1(2) = 0 .

Period 0: We only need to compute J0(0), since the initial state isx0 = 0. We obtain

(i) x1 = 0 : We get

J0(0) = minu0∈{0,1,2}

Ew0 [u0 + (u0 − w0)2 + J1(max(0, u0 − w0))] .

Page 10: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 57

Computation of the right-hand side gives

u0 = 0 : E[·] = 0.1 · J1(0) + 0.7 · (1 + J1(0)) +

+0.2 · (4 + J1(0)) = 4.0 ,

u0 = 1 : E[·] = 1 + 0.1 · (1 + J1(1)) + 0.7 · J1(0) +

+0.2 · (1 + J1(0)) = 3.7 ,

u0 = 2 : E[·] = 2 + 0.1 · (4 + J1(2)) + 0.7 · (1 + J1(1)) +

+0.2 · J1(0) = 4.818 .

It follows that

J0(0) = 3.7 , µ∗0(0) = 1 .

Conclusion: The optimal ordering policy is to order one unit for eachperiod, if the stock is 0, and to order nothing, otherwise.

The results are visualized in Fig. 2.1-2.3 and Table 2.1. The numbersnext to the arcs represent the transition probabilities.

j

j

j

j

j

j

Stock = 2 Stock = 2

Stock = 1 Stock = 1

Stock = 0 Stock = 0

0.1

0.7

0.20.1

0.9

1.0

Fig. 2.1. Inventory control (purchased stock = 0)

j

j

j

j

j

j

Stock = 2 Stock = 2

Stock = 1 Stock = 1

Stock = 0 Stock = 0

0.1

0.7

0.20.1

0.9

Fig. 2.2. Inventory control (purchased stock = 1)

Page 11: Chapter 2 Dynamic Programming

58 Ronald H.W. Hoppe

j

j

j

j

j

j

Stock = 2 Stock = 2

Stock = 1 Stock = 1

Stock = 0 Stock = 0

0.1

0.7

0.2

Fig. 2.3. Inventory control (purchased stock = 2)

Stock Stage 0 Stage 0 Stage 1 Stage 1 Stage 2 Stage 2Cost Purchase Cost Purchase Cost Purchase

0 3.7 1 2.5 1 1.3 11 2.7 0 1.5 0 0.3 02 2.818 0 1.68 0 1.1 0

Table 2.1: Inventory control

Table 2.1 displays the optimal cost-to-go and optimal stock to be pur-chased for the three stages of the inventory control. Observe that dueto the constraint xk +uk ≤ 2, the control u = 1 is not available at state2, whereas the control u = 2 is not available at states 1 and 2.

Page 12: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 59

Example (Optimal chess match strategy): We consider a chessmatch between two players A and B according to the following rules:

• If a game has a winner, the winner gets 1 point and the losergets 0 point,

• In case of a draw, both players get 1/2 point,• If the score is 1−1 at the end of the first two games, the match

goes into sudden death, i.e., the players continue to play untilthe first time one of them wins a game.

We take the position of player A who has two playing strategies whichhe can choose independently at each game:

• Timid play with which he draws with probability pd > 0 andlooses with probability 1− pd,

• Bold play which with he wins with probability pw < pd andlooses with probability 1− pw.

Remark: We note that timid play never wins, whereas bold play neverdraws.

We will formulate a DP algorithm to determine the policy that maxi-mizes player’s A probability to win the match.For that purpose, we consider an N -game match and define the stateas the net score, i.e., the difference between the points of player Aand the points of player B (e.g., a net score 0 indicates an even score).We further define Tk and Bk as the two strategies:

• Tk : Timid play which keeps the score at xk with probabilitypd and changes xk to xk − 1 with probability 1− pd,

• Bk : Bold play which changes xk to xk + 1 with probability pw

or to xk − 1 with probability 1− pw.

The DP recursion begins with

(2.15) JN(xN) =

1 , xN > 0pw , xN = 00 , xN < 0

.

Note that JN(0) = pw, since due to the sudden death rule it is optimalto play bold, if the score is even after N games.The optimal cost-to-go function at the beginning of the k-th gameis given by the DP recursion

Jk(xk) = maxTk,Bk

[pdJk+1(xk) + (1− pd)Jk+1(xk − 1),(2.16)

pwJk+1(xk + 1) + (1− pw)Jk+1(xk − 1)] .

Page 13: Chapter 2 Dynamic Programming

60 Ronald H.W. Hoppe

We remark that it is optimal to play bold, if

pwJk+1(xk + 1) + (1− pw)Jk+1(xk − 1) ≥(2.17)

≥ pdJk+1(xk) + (1− pd)Jk+1(xk − 1) .

or equivalently, if

(2.18)pw

pd

≥ Jk+1(xk)− Jk+1(xk − 1)

Jk+1(xk + 1)− Jk+1(xk − 1).

Using the recursions (2.15) and (2.16), we find

JN−1(xN−1) = 1 for xN−1 > 1 Optimal play: either

JN−1(1) = max(pd + (1− pd)pw, pw + (1− pw)pw)

= pd + (1− pd)pw Optimal play: timid

JN−1(0) = pw Optimal play: bold

JN−1(−1) = p2w Optimal play: bold

JN−1(xN−1) = 0 for xN−1 < −1 Optimal play: either

Moreover, given JN−1(xN−1), and taking (2.16) and (2.18) into account,we find

JN−2(0) = max(pdpw + (1− pd)p2w, pw(pd +

+(1− pd)pw) + (1− pw)p2w) =

= pw(pw + (pw + pd)(1− pw)) ,

i.e., if the score is even with two games still to go, the optimal policyis to play bold.

Conclusion: For a 2-game match, the optimal policy for player A isto play timid if and only if player A is ahead in the score. The regionsof pairs (pw, pd) for which player A has a better than 50 − 50 chanceto win the match is

(2.19) {(pw, pd) ∈ [0, 1]2 | J0(0) = pw(pw +(pw +pd)(1−pw)) > 1/2} .

Page 14: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 61

2.3 Finite-state systems and shortest path problems

2.3.1 The backward DP algorithm

We consider the following deterministic discrete-time system

xk+1 = fk(xk, uk) , 0 ≤ k ≤ N − 1 ,(2.20a)

uk = µk(xk) ,(2.20b)

where x0 is a fixed initial state and the state spaces Sk, 1 ≤ k ≤ N,are finite sets with card Sk = nk. The control uk provides a transitionfrom the state xk ∈ Sk to the state fk(xk, uk) ∈ Sk+1 at cost gk(xk, uk).Such a system is referred to as a deterministic finite-state system.We call k ∈ {0, · · · , N} the stage of the finite-state system. In par-ticular, k = 0 is said to be the initial stage and k = N is called thefinal stage. A finite-state system can be described by a graph con-sisting of nodes representing states and arcs representing transitionsbetween states at successive stages. In particular, the nodes consist ofan initial node s0, nodes si, 1 ≤ i ≤ nk, at stages 1 ≤ k ≤ N, and aterminal node sN+1. Each node si, 1 ≤ i ≤ nN , representing a statexN at the final stage N is connected to the terminal node sN+1 by anarc representing the terminal cost gN(xN) (cf. Fig. 2.4).

j

j

j

j

j

j

j

j

j

j

j

j

j

j¡¡

¡¡

¡¡µ

-@

@@

@@@R

-

-

-

@@

@@

@R¡¡

¡¡

¡µ

@@

@@

@R¡¡

¡¡

¡µ

-

-

-

@@

@@

@R¡¡

¡¡

¡µ

@@

@@

@R¡¡

¡¡

¡µ

-

@@

@@

@R

¡¡

¡¡

¡µ

· · · · · ·

· · · · · ·

· · · · · ·

s0 sN+1

Fig. 2.4. Transition graph of a deterministic finite-state system

We denote by akij the transition cost for the transition from the state

xi ∈ Sk to the state xj ∈ Sk+1 with the convention that akij = +∞, if

there is no control that provides a transition from xi ∈ Sk to xj ∈ Sk+1.The DP algorithm for the computation of the optimal cost is given

Page 15: Chapter 2 Dynamic Programming

62 Ronald H.W. Hoppe

by

JN(si) = aNi,sN+1

, 1 ≤ i ≤ nN ,

(2.21a)

Jk(si) = min1≤j≤nk+1

[akij + Jk+1(sj)] , 1 ≤ i ≤ nk , 0 ≤ k ≤ N − 1 ,

(2.21b)

where J0(s0) represents the optimal cost.The intermediate cost Jk(si) represents the optimal-cost-to-go fromstate xi ∈ Sk to the terminal (fictitious) state xN+1 associated with theterminal node sN+1.

Remark (shortest path problem): If we identify the transitioncosts ak

ij with the lengths of the arcs connecting the nodes, the optimalcost corresponds to the length of the shortest path from the initialnode s0 to the terminal node sN+1. Hence, problem (2.20),(2.21) canbe interpreted as a shortest path problem.

2.3.2 The forward DP algorithm

In contrast to stochastic DP algorithms, in a deterministic frameworkthere is an alternative to the backward-in-time algorithm, namely theforward DP algorithm which proceeds forward-in-time. It is basedon the following observation with respect to the shortest path problem:The optimal path from the initial node s0 to the terminal node sN+1 isalso optimal for the so-called reverse shortest path problem, i.e.,find the shortest path from the final node sN+1 to the initial node s0

with a reverse order of the arcs but the same lengths (transition costs).The forward DP algorithm reads as follows:

JN(sj) = a0s0,j , 1 ≤ j ≤ n1 ,

(2.22a)

Jk(sj) = min1≤i≤nN−k

[aN−kij + Jk+1(si)] , 1 ≤ j ≤ nN−k+1 , 1 ≤ k ≤ N − 1 ,

(2.22b)

with optimal cost

J0(sN+1) = min1≤i≤nN

[aNi,sN+1

+ J1(si)] .(2.22c)

The backward DP algorithm (2.21a),(2.21b) and the forward DP algo-rithm (2.20a)-(2.22c) provide the same optimal cost

(2.23) J0(s0) = J0(sN+1) .

The intermediate cost Jk(j) can be interpreted as the optimal-cost-to-arrive from the initial node s0 to the node sj at stage k.

Page 16: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 63

Remark (hidden Markov models): The forward DP algorithm willbe applied for problems where the stage k problem data are unknownprior to stage k and become available to the controller just beforedecisions at stage k have to be made (hidden Markov models).

2.3.3 Shortest path problems as deterministic finite-stateproblems

In the previous sections, we have seen how to formulate a deterministicfinite-state problem as a shortest path problem. Conversely, a shortestpath problem can be cast in the framework of a deterministic finite-state problem:We consider a graph consisting of N + 1 nodes and arcs connectingnode i with node j. The length of the arc from node i to node j isconsidered as the cost aij of moving from node i to node j. The finalnode N + 1 is a special node which is dubbed destination. We setaij = +∞, if there is no arc that connects node i and node j. Theshortest path problem is to find the shortest path from each nodei to the destination N + 1. We formally require a total of N moves,but allow degenerate moves from a node i to itself at cost aii = 0. Weconsider N stages k = 0, · · · , N − 1 with Jk(i) denoting the optimalcost of getting from node i to the destination N + 1 in N − k moves,i.e.,

JN−1(i) = ai,N+1 , 1 ≤ i ≤ N ,(2.24a)

Jk(i) = min1≤j≤N

[aij + Jk+1(j)] , 1 ≤ i ≤ N , k = N − 2, · · · , 0 ,(2.24b)

with J0(i) being the cost of the optimal path. If the optimal pathcontains degenerate moves, we drop the degenerate moves which meansthat the optimal path may contain less than N moves.

j

j j

j j

5

1 4

2 30.5

6 1

2 3

5 5

2

7 5

Fig. 2.5. Shortest path problem I (transition costs)

Page 17: Chapter 2 Dynamic Programming

64 Ronald H.W. Hoppe

Example (shortest path problem I): We consider a graph withfour nodes i = 1, 2, 3, 4 and destination 5. All nodes are interconnectedand both directions are possible with equal costs as shown in Fig. 2.5.Fig. 2.6 contains the optimal costs-to-go Jk(i) at stage k for each nodei = 1, 2, 3, 4.

-

6

x x x x

x x x x

x x x x

x x x x

x

- - -

- - -

- -

-

0 1 2 3 4 Stage k

1

2

3

4

5

State i

2 2 2 2

4.5 4.5 5.5 7

4 4 4 5

3 3 3 3

Fig. 2.6. Shortest path problem I (optimal costs-to-go)

Example (road network): We consider a network of roads connect-ing city ’A’ (start) and ’J’ (destination) via cities ’B’-’I’ (cf. Fig. 2.7).

m

m

m

m

m

m

m

m

m

mA

D

C

B

G

F

E

I

H

J4 2

5

7

2 3

2

4

6

43

33

6

1

43

4

Fig. 2.7. Road network

Page 18: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 65

The nodes Sk at stages 0 ≤ k ≤ 3, and the terminal node S4 arespecified as follows

S0 := {A} , S1 := {B,C, D} ,

S2 := {E, F, G} , S3 := {H, I} , S4 := {J} .

The costs akij for moving from node i ∈ Sk to node j ∈ Sk+1 are

indicated in Fig. 2.7 with the convention that aij = +∞, if there isno arc connecting i and j. We recall the recursions from the backwardDP algorithm

J3(i) = ai,J , i ∈ {H, I} ,

Jk(i) = minj∈Sk+1

[akij + Jk+1(j)] , i ∈ Sk , k = 2, 1, 0 .

Following this recursion, at stage 3 we find:

S3 J3(·) Decisiongo to

H 3 JI 4 J

Table 2.2: Cost-to-go at stage 3

At stage 2 we obtain

S2 a2ij + J3(j) J2(i) DecisionH I go to

E 4 8 4 HF 9 7 7 IG 6 7 6 H

Table 2.3: Cost-to-go at stage 2

The stage 1 computations reveal

S1 a1ij + J2(j) J1(i) Decision

E F G go to

B 11 11 12 11 E or FC 7 9 10 7 ED 8 8 11 8 E or F

Table 2.4: Cost-to-go at stage 1

Page 19: Chapter 2 Dynamic Programming

66 Ronald H.W. Hoppe

Finally, at stage 0 we get

S0 a0ij + J1(j) J0(i) Decision

B C D go to

A 13 11 11 11 C or D

Table 2.5: Cost-to-go at stage 0

Example (capital budgeting problem): A publishing company has$ 5 million to allocate to its three divisions

i = 1 ’Natural Sciences’

i = 2 ’Social Sciences’

i = 3 ’Fine Arts’

for expansion. Each division has submitted 4 proposals containing thecost ci of the expansion and the expected revenue ri (cf. Table 2.1).

Proposal Division 1 Division 2 Division 3c1 r1 c2 r2 c3 r3

1 0 0 0 0 0 02 1 5 2 8 1 43 2 6 3 9 0 04 0 0 4 12 0 0

Table 2.6: Proposals for investment

For the application of the forward DP algorithm we denote by

• stage 1: amount x1 ∈ {0, 1, · · · , 5} of money allocated to divi-sion 1,

• stage 2: amount x2 ∈ {0, 1, · · · , 5} of money allocated to divi-sions 1,2

• stage 3: amount x3 = 5 of money allocated to divisions 1,2,and 3.

We further denote by

• c(ik): cost for proposal ik at stage k,• r(ik): revenue for proposal ik at stage k,

and by

• Jk(xk): revenue of state xk at stage k.

Page 20: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 67

With these notations, the recursions for the forward DP algorithm areas follows:

J1(x1) = maxi1:c(i1)≤x1

[r(i1)] ,

Jk(xk) = maxik:c(ik)≤xk

[r(ik) + Jk−1(xk − c(ik))] , k = 2, 3 .

The stage 1 computations yield

Capital x1 Optimal RevenueProposal at stage 1

0 1 01 2 52 3 63 3 64 3 65 3 6

Table 2.7: Capital budgeting (stage 1)

As an example for the computations at stage 2 we consider the alloca-tion of x2 = 4 million US-Dollars:

• Proposal 1 gives revenue of 0, leaves 4 for stage 1, which returns6. Total: 6;

• Proposal 2 gives revenue of 8, leaves 2 for stage 1, which returns6. Total: 14;

• Proposal 3 gives revenue of 9, leaves 1 for stage 1, which returns5. Total: 14;

• Proposal 4 gives revenue of 12, leaves 0 for stage 1, which re-turns 0. Total: 12.

Altogether, the stage 2 computations result in

Capital x2 Optimal RevenueProposal at stages 1,2

0 1 01 2 52 2 83 2 134 2 or 3 145 4 17

Table 2.8: Capital budgeting (stage 2)

Page 21: Chapter 2 Dynamic Programming

68 Ronald H.W. Hoppe

At stage 3, the only amount of money to allocate is x3 = 5 millionUS-Dollars. For division 3 we obtain:

• Proposal 1 gives revenue of 0, leaves 5. Previous stages give 17.Total: 17;

• Proposal 2 gives revenue of 4, leaves 4. Previous stages give 14.Total: 18.

This leaves for stage 3:

Capital x3 Optimal RevenueProposal at stage 1,2,3

5 2 18

Table 2.9: Capital budgeting (stage 3)

Summarizing, the optimal allocation is: Implement proposal 2 at divi-sion 3, proposal 2 or 3 at division 2, and proposal 3 or 2 at division 1.The optimal revenue is $ 18 million.

2.3.4 Partially observable finite-state Markov chains and theViterbi algorithm

Let S = {0, 1, · · · , N} be a finite number of states and

(2.25) P = (pij)Ni,j=1 ∈ RN×N

be a stochastic matrix, i.e.,

(2.26) pij ≥ 0 , 1 ≤ i, j ≤ N , andN∑

j=1

pij = 1 , 1 ≤ i ≤ N .

The pair {S, P} is called a stationary finite-state Markov chain.We associate with {S, P} a process with initial state x0 ∈ S withprobability

(2.27) P (x0 = i) = ri0 , 1 ≤ i ≤ N ,

where for k = 0, 1, · · · , N − 1 transitions from the state xk to the statexk+1 occur with probability

(2.28) P (xk+1 = j|xk = i) = pij , 1 ≤ i, j ≤ N ,

i.e., if the state xk is xk = i, the probability that the new state xk+1

satisfies xk+1 = j is given by pij.The probability that xk = j, if the initial state satisfies x0 = i, is thengiven by

(2.29) P (xk = j|x0 = i) = pkij , 1 ≤ i, j ≤ N .

Page 22: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 69

Denoting by

(2.30) rk = (r1k, · · · , rN

k )T ∈ RN

the probability distribution of the state xk after k transitions, inview of (2.27) and (2.29) we obtain

rk = roPk , 1 ≤ k ≤ N , i.e.,(2.31)

rjk =

N∑i=1

P (xk = j|x0 = i)ri0 =

N∑i=1

pkijr

i0 .

We now focus on so-called hidden Markov models or partiallyobservable finite-state Markov chains. The characteristic featureis that the states involved in the transition xk−1 → xk are unknown, butinstead an observation zk of probability r(zk; xk−1, xk) is available.We introduce the following notations

XN = {x0, x1, · · · , xN} state transition sequence ,

ZN = {z1, z2, · · · , zN} observation sequence ,

and

(2.32) p(XN |ZN) =p(XN , ZN)

p(ZN)conditional probability ,

where p(XN , ZN) and p(ZN) are the (unconditional) probabilities ofoccurrence of (XN , ZN) and ZN , respectively.The goal is to provide an optimal state estimation

(2.33) XN = {x0, x1, · · · , xN}in the sense that

(2.34) p(XN |ZN) = maxXN

p(XN |ZN) .

Since p(ZN) is a positive constant, once ZN is known, (2.34) reducesto

(2.35) p(XN , ZN) = maxXN

p(XN , ZN) .

Lemma 2.2 (Probability of state transition/observation se-quences)Let XN and ZN be the state transition and observation sequences as-sociated with a finite-state Markov chain and let rx0 and r(zk; xk−1, xk)

Page 23: Chapter 2 Dynamic Programming

70 Ronald H.W. Hoppe

be the probability that the initial state takes the value x0 and the prob-ability of observation zk at the k-th transition xk−1 → xk. Then, thereholds

(2.36) p(XN , ZN) = rx0

N∏

k=1

pxk−1,xkr(zk; xk−1, xk) .

Proof. The proof is by induction on N . Indeed, for N = 1 we find

p(X1, Z1) = p(x0, x1, z1) =

= rx0p(x1, z1|x0) = rx0px0,x1r(z1; x0, x1) .

The last step of the induction proof is left as an exercise. ¤

The maximization problem (2.35) can be equivalently stated as a short-est path problem by means of the so-called trellis diagram which canbe generated by concatenating N+1 copies of the state space represent-ing stages 0 ≤ k ≤ N , an additional initial node s0 and an additionalterminal node sN+1 (cf. Fig. 2.9).

j

j

j

j

j

j

j

j

j

j

j

j

j

j¡¡

¡¡

¡¡µ

-@

@@

@@@R

-

-

-

@@

@@

@R¡¡

¡¡

¡µ

@@

@@

@R¡¡

¡¡

¡µ

-

-

-

@@

@@

@R¡¡

¡¡

¡µ

@@

@@

@R¡¡

¡¡

¡µ

-

@@

@@

@R

¡¡

¡¡

¡µ

· · · · · ·

· · · · · ·

· · · · · ·

s0 sN+1

Stage 0 Stage 1 Stage N-1 Stage N

Fig. 2.9. Trellis diagram

An arc connects a node xk−1 of the k-th copy of the state space with anode xk of the (k + 1)-st copy, if the associated transition probabilitypxk−1,xk

is positive.

Lemma 2.3 (Optimal state estimation as a minimization prob-lem)Using the same notations as in Lemma 2.2, the maximization problem(2.35) is equivalent to

(2.37) minXN

[−ln(rx0) −N∑

k=1

ln(pxk−1,xkr(zk; xk−1, xk))] .

Page 24: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 71

Proof. The maximization of a positive objective functional is equiva-lent to the minimization of its logarithm. ¤

Corollary 2.4 (Associated shortest path problem)The solution of the minimization problem (2.37) is given by the short-est path in the trellis diagram connecting the initial node s0 with theterminal node sN+1.

Proof. The proof follows directly from the following assignment oflengths to the arcs in the trellis diagram

(s0, x0) 7−→ −ln(rx0) ,

(xk−1, xk) 7−→ −ln(pxk−1,xkr(zk; xk−1, xk)) , 1 ≤ k ≤ N ,

(xN , sN+1) 7−→ 0 .

The shortest path in the trellis diagram thus gives the optimal stateestimate XN . ¤

The above shortest path problem can be solved by the forward DPalgorithm which in this special case is referred to as the Viterbi al-gorithm. Denoting by dist(s0, xk) the distance from the initial nodes0 to a state xk, we have the recursion

dist(s0, xk+1) =(2.38a)

= minxk:pxk,xk+1

>0[dist(s0, xk)− ln(pxk,xk+1

r(zk+1; xk, xk+1))] ,

dist(s0, x0) = −ln(rx0) .(2.38b)

Example (Speech recognition: convolutional coding and de-coding)An important issue in information theory is the reliable transmis-sion of binary data over a noisy communication channel, e.g,transmitting speech data from the cell phone of a sender via satelliteto the cell phone of a receiver. This can be done by convolutionalcoding and decoding.

The encoder/decoder scheme is illustrated in Fig.2.10: We as-sume that we are given a source-generated binary data sequence

WN = {w1, w2, · · · , wN} , wk ∈ {0, 1} , 1 ≤ k ≤ N .

This sequence is converted into a coded sequence

YN = {y1, y2, · · · , yN} , yk = (y1k, · · · , yn

k )T ,

yik ∈ {0, 1} , 1 ≤ i ≤ n , 1 ≤ k ≤ N ,

Page 25: Chapter 2 Dynamic Programming

72 Ronald H.W. Hoppe

where each codeword yk is a vector in Rn with binary coordinates.The coded sequence is then transmitted over a noisy channel resultingin a received sequence

ZN = {z1, z2, · · · , zN} ,

which has to be decoded such that the decoded sequence

WN = {w1, w2, · · · , wN} ,

is as close to the original binary data sequence WN as possible.

- - - -Encoder Noisy channel Decoder

Data0 or 1

wi

Code-word

yi

Recei-ved

zi

Deco-ded

wi

Fig. 2.10. Encoding and decoding in communications

Convolutional coding:In convolutional coding, the vectors yk of the coded sequence are relatedto the elements wk of the original binary data sequence by means of adiscrete dynamical system of the form

yk = Cxk−1 + dwk , 1 ≤ k ≤ N ,(2.39)

xk = Axk−1 + bwk ,

where x0 ∈ Rm and A ∈ Rm×m, C ∈ Rn×m as well as b ∈ Rm, d ∈ Rn aregiven. The products and sums in (2.39) are evaluated using modulo 2arithmetic. The sequence {x1, · · · , xN} is called the state sequence.

Example: Assume m = 2, n = 3 and

C =

1 00 10 1

, A =

(0 11 1

), d =

111

, b =

(01

).

The evolution of the system (2.39) is shown in Fig. 2.11. The binarynumber pair on each arc is the data/codeword pair wk/yk. For instance,if xk−1 = 01, a zero data bit wk = 0 results in a transition to xk = 11and generates the codeword yk = 001.

Page 26: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 73

j

j

j

j

j

j

j

j

Old State xk−1 New State xk

0/00000 00

1/111

1/01101 010/100

1/100

10 100/011

0/111

11 111/000

Fig. 2.11. State transition diagram for convolutional coding

In case N = 4 and x0 = 00, the binary data sequence

{w1, w2, w3, w4} = {1, 0, 0, 1}generates the state sequence

{x1, x2, x3, x4} = {01, 11, 10, 00}and the codeword sequence

{y1, y2, y3, y4} = {111, 011, 111, 011} .

We assume that the noisy transmission over the channel is such thatthe codewords yk are received as zk with known conditional proba-bilities P (zk|yk) and that the events are independent in the sense

(2.40) P (ZN |YN) =N∏

k=1

P (zk|yk) .

The determination of an optimal code can be stated as a maximumlikelihood estimation problem:Find YN = (y1, · · · , yN) such that

P (ZN |YN) = maxYN

P (ZN |YN) ,(2.41a)

YN satisfies (2.39) .(2.41b)

In order to reformulate (2.41) as a shortest path problem, we constructa trellis diagram by concatenating N state transition diagrams as inFig. 2.11 and introduce a dummy initial node s and a dummy ter-minal node t such that these dummy nodes are connected to x0 and

Page 27: Chapter 2 Dynamic Programming

74 Ronald H.W. Hoppe

xN by zero-length arcs, respectively. Observing (2.40), the maximumlikelihood estimation problem (2.41) can be rewritten as

minimize −N∑

k=1

ln(P (zk|yk)) over YN ,(2.42a)

YN satisfies (2.39) .(2.42b)

Considering the arc associated with the codeword yk as being of length−ln(P (zk|yk)), the solution of (2.42) is the shortest path from s to t.

The maximum likelihood estimate YN can be determined by the Viterbialgorithm where the shortest distance dk+1(xk+1) from s to any statexk+1 is obtained by the dynamic programming recursion

dk+1(xk+1) = minxk

(dk(xk) − ln(P (zk+1|yk+1))

),(2.43a)

xk is connected to xk+1 by an arc (xk, xk+1) ,(2.43b)

where yk+1 is the codeword associated with the arc (xk, xk+1).

Finally, the optimal decoded sequence WN is retrieved from the maxi-mum likelihood estimate YN by using the trellis diagram.

Page 28: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 75

2.4 Label-correcting and branch-and-bound methods

2.4.1 Label-correcting methods

We consider a graph consisting of an initial node s, a terminalnode t, intermediate nodes i (i 6= s, i 6= t), and arcs connecting thenodes. A node j is called a child of node i, if there exists an arc (i, j)connecting i and j of length aij ≥ 0. The issue is to find the shortestpath from the initial node s to the terminal node t.

Label correcting methods are shortest path algorithms that itera-tively detect shorter paths from the initial node s to every other nodei in the graph. They introduce di as an additional variable, called thelabel of node i whose value corresponds to the shortest path from sto i found so far. The label dt of the terminal node t is stored in avariable called UPPER. We choose the following initialization of thelabels:

• The label ds of the initial node s is initialized by ds = 0 andremains unchanged during the algorithm.

• All other labels di, i 6= s, are initialized by di = +∞.

The algorithm also features a list of nodes dubbed OPEN (or candi-date list) which contains nodes that are candidates for possible inclu-sion in the shortest path in a sense that will be made clear later. Theinitialization of OPEN is done as follows:

• OPEN is initialized by OPEN = {s}, i.e., at the beginning,OPEN only contains the initial node s.

The k-th step of the iteration (k ≥ 1) consists of three substepswhich are as follows:

Substep 1: Remove a node i from OPEN according to a strategy thatwill be described later. In particular, at the first iteration step, i.e.,k = 1, the initial node s will be removed from OPEN, since initiallyOPEN = {s}.Substep 2: For each child j of i test whether its label dj can bereduced by checking

(2.44) di + aij < min(dj, UPPER) .

If the test (2.44) is satisfied, set

(2.45) dj = di + aij

and call i the parent of j.Moreover, update the list OPEN and the value of UPPER according

Page 29: Chapter 2 Dynamic Programming

76 Ronald H.W. Hoppe

to

OPEN = OPEN ∪ {j} , if j 6= t ,(2.46a)

UPPER = di + ait , if j = t .(2.46b)

Substep 3: If OPEN = ∅, terminate the algorithm. Otherwise, setk = k + 1 and go to Substep 1.

Theorem 2.2 (Finite termination property)If there exists a path connecting the initial node s with the terminalnode t, the label correcting method terminates after a finite numberof steps with the shortest path given by the final value of UPPER.Otherwise, it terminates after a finite number of steps with UPPER =+∞.

Proof: The algorithm terminates after a finite number of steps, sincethere is only a finite number of possible label reductions.

Case 1 (Non-existence of a path from s to t):If there is no path from s to t, a node i connecting i and t by an arc(i, t) can never enter the list OPEN, since otherwise there would bea path from s to t in contradiction to the assumption. Consequently,UPPER can not be reduced from its initial value +∞, and hence, thealgorithm terminates with UPPER = +∞.

Case 2 (Existence of a path from s to t):If there is a path from s to t, there must be a shortest path, sincethere is only a finite number of paths connecting s and t. Assume that(s, j1, j2, · · · , jk, t) is the shortest path of length dopt. Each subpath(s, j1, · · · , jm),m ∈ {1, · · · , k}, of length djm must then be the shortestpath from s to jm.In contradiction to the assertion, we assume UPPER > dopt. Then, forall m ∈ {1, · · · , k} we must also have UPPER > djm . Consequently,the node jk never enters the list OPEN, since otherwise UPPER willbe set to dopt in Substep 2 of the algorithm. Using the same argumentbackwards, it follows that the nodes jm, 1 ≤ m < k, never enter OPEN.However, since j1 is a child of the initial node s, at the first iterationj1 must enter OPEN and hence, we arrive at a contradiction. ¤Example (The traveling salesman problem)We consider a salesman who lives in a city A and has to visit a certainnumber N −1 of other cities for doing business before returning to cityA. Knowing the mileage between the cities, the problem is to find theminimum-mileage trip that, starting from city A, visits each othercity only once and then returns to city A.The traveling salesman problem can be formulated as a shortest path

Page 30: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 77

problem as follows: We construct a graph where the nodes are orderedsequences of n ≤ N distinct cities with the length of an arc denotingthe mileage between the last two cities in the ordered sequence (cf.Fig. 2.x). The initial node corresponds to city A. We introduce anartificial terminal node t with an arc having the length equal to themileage between the last city of the sequence and the starting city A.The shortest path from s to t represents the solution of the minimum-mileage problem.

In case of four cities A,B,C, and D, Fig. 2.12 illustrates the associatedgraph, whereas Table 2.10 contains the iteration history of the labelcorrecting algorithm.

A

AB AC AD

ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

15 15

20 4 20 3 4 3

3 3 4 4 20 20

15 1 15 5 1 5

1

2 7 10

3 5 8

4 6 9

Fig. 2.12 Label correcting algorithm forthe traveling salesman problem

Page 31: Chapter 2 Dynamic Programming

78 Ronald H.W. Hoppe

Iteration Node existing OPEN at the end UPPEROPEN of the iteration

0 – 1 ∞1 1 2,7,10 ∞2 2 3,5,7,10 ∞3 3 4,5,7,10 ∞4 4 5,7,10 435 5 6,7,10 436 6 7,10 137 7 8,10 138 8 9,10 139 9 10 1310 10 ∅ 13

Table 2.10: Iteration history of the label correcting algorithm

We finally discuss a variety of strategies for the removal of a nodefrom the list OPEN in Substep 1 of the label correcting method. Allof them except Dijkstra’s method assume that the list OPEN is anordered sequence of nodes (queue) with the first node being located atthe ’top’ of OPEN and the last node at the ’bottom’.

Breadth-first search (Bellman-Ford method):Remove the node from the top of OPEN and place a new node at thebottom (’first-in/first-out policy’).

Depth-first search:Remove the node from the top of OPEN and place a new node at thetop (’last-in/first-out policy’).

D’Esopo-Pape method:Remove the node from the top of OPEN and place a new node

• at the top of OPEN, if it has been in OPEN earlier,• at the bottom of OPEN, otherwise.

Small-Label-First (SLF) method:Remove the node from the top of OPEN and place a new node i

• at the top of OPEN, if its label di is less than or equal to thelabel dj of the node at the TOP of OPEN,

• at the bottom of OPEN, otherwise.

Dijkstra’s method (best-first search):Remove from OPEN the node i with minimum label, i.e.

(2.47) di = minj∈OPEN

dj .

Page 32: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 79

Since this method does not necessarily assume an ordered list, a newnode may enter OPEN somewhere.

2.4.2 Branch-and-bound methods

We consider the problem to minimize a function f : X → R wherethe feasible set X consists of a finite number of feasible points, i.e.,card(X) = n < +∞:

(2.48) minimize f(x) over x ∈ X , card(X) = n .

The principle of the branch-and-bound methods is to split the feasibleset X into smaller subsets with associated minimization subproblemsand to compute bounds for the optimal value of these minimizationsubproblems in order to be able to exclude some feasible subsets fromfurther consideration. To be more precise, let us assume that Y1 ⊂ Xand Y2 ⊂ X are two subsets of the feasible set X and suppose that

(2.49) f1≤ min

x∈Y1

f(x) , f 2 ≥ minx∈Y2

f(x) .

Then, if

f 2 ≤ f1

,

the feasible points in Y1 may be disregarded, since their cost can notbe smaller than the cost associated with the optimal solution in Y2.

In order to describe the branch-and-bound method systematically,we denote by X the collection of subsets of the feasible set X andconstruct an acyclic graph associated with X such that the nodes ofthe graph are elements of X according to the following rules:

(a) X ∈ X ,

(b) For each feasible point x ∈ X we have {x} ∈ X ,

(c) Each set Y ∈ X with card(Y ) > 1 into sets Yi ∈ X , 1 ≤ i ≤ k, suchthat

Yi 6= Y , 1 ≤ i ≤ k , Y =k⋃

i=1

Yi .

The set Y is called the parent of the sets Yi, 1 ≤ i ≤ k, which arethemselves referred to as the children of Y .

(d) Each set in X different from X has at least one parent.

We thus obtain an acyclic graph whose origin is the set X of all fea-sible points and whose terminal nodes are the sets {x} of single feasiblepoints. The arcs of the graph are those connecting the parents Y andtheir children Yi (cf. Fig. 2.13). The length of an arc connecting Y

Page 33: Chapter 2 Dynamic Programming

80 Ronald H.W. Hoppe

and Yi will be computed as follows:

(i) Suppose we have at disposal an algorithm which for every parentalnode Y is able to compute upper and lower bounds for the associatedminimization subproblem, i.e.,

fY≤ min

x∈Yf(x) ≤ fY .

(ii) Assume further that the bounds are exact in case of single feasiblepoints so that formally

f {x} = f(x) = f {x} .

We definearc(Y, Yi) = f

Yi− f

Y.

According to this definition, the length of any path connecting theinitial node X with any node Y is given by f

Y. Since f {x} = f(x), x ∈

X, it follows that the solution of (2.48) corresponds to the shortestpath from X to one of the nodes {x}.

X = {1, 2, 3, 4, 5}

{1, 2, 3} {4.5}

{1, 2} {2, 3} {4} {5}

{1} {2} {3}

Fig. 2.13 Branch-and-bound method

The branch-and-bound method can be formulated as a variation ofthe label correcting method (see section 2.4.1) where the list OPENand the value UPPER are initialized as follows:

OPEN := {X} , UPPER := fX .

Page 34: Chapter 2 Dynamic Programming

Optimization Theory II, Spring 2007; Chapter 2 81

The k-th iteration consists of the following substeps:

Substep 1: Remove a node Y from OPEN.

Substep 2: For each child Yj of Y perform the test

fYj

< UPPER .

If the test is satisfied, update OPEN according to

OPEN := OPEN ∪ {Yj}and set

UPPER := fYj, if fYj

< UPPER .

In case Yj consists of a single feasible point, mark the solution as beingthe best solution found so far.

Substep 3: Terminate the algorithm, if OPEN = ∅. The best solutionfound so far corresponds to the optimal solution. Otherwise, set k :=k + 1 and go to Substep 1.

It should be noted that the generation of the acyclic graph, i.e.,the decomposition of the original minimization problem, is usually notdone a priori, but performed during execution of the algorithm basedon the available information. The lower bounds for the minimizationsubproblems are typically computed by methods from continuous op-timization, whereas the upper bounds can be often obtained using ap-propriate heuristics. For details we refer to [3].

References

[1] D.P. Bertsekas; Dynamic Programming and Optimal Control. Vol. 1. 3rd Edi-tion. Athena Scientific, Belmont, MA, 2005

[2] D.P. Bertsekas and S.E. Shreve; Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, Belmont, MA, 1996

[3] G.L. Nemhauser and L. Wolsey; Integer and Combinatorial Optimization. JohnWiley & Sons, New York, 1988