Answer:
U(s) = maxa[R0
(s, a) + γ
1
2
P
pre T
0
(s, a, pre)(maxb[R0
(pre, b) + γ
1
2
P
s
0 T
0
(pre, b, s0
) ∗ U(s
0
))]]
U(s) = maxa[
P
s
0 T(s, a, s0
)(R(s, a, s0
) + γU(s
0
)]
U(s) = R0
(s) + γ
1
2 maxa[
P
post T
0
(s, a, post)(R0
(post) + γ
1
2 maxb[
P
s
0 T
0
(post, b, s0
)U(s
0
))]]
U(s) = maxa[R(s, a) + γ
P
s
0 T(s, a, s0
)U(s
0
)]
Explanation:
MDPs
MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that
depends on the action taken and outcome state.
To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward
function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the
original MDP.
One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’
but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP
have transition T’, reward R’, and discount γ
0
.
T
0
(s, a, pre(s, a, s0
)) = T(s, a, s0
)
T
0
(pre(s, a, s0
), b, s0
) = 1
R0
(s, a) = 0
R0
(pre(s, a, s0
), b) = γ
− 1
2 R(s, a, s0
)
γ
0 = γ
1
2
Then, using pre as shorthand for pre(s, a, s’):
U(s) = maxa[R0
(s, a) + γ
1
2
P
pre T
0
(s, a, pre)(maxb[R0
(pre, b) + γ
1
2
P
s
0 T
0
(pre, b, s0
) ∗ U(s
0
))]]
U(s) = maxa[
P
s
0 T(s, a, s0
)(R(s, a, s0
) + γU(s
0
)]
Now do the same to convert MDPs with R(s, a) into MDPs with R(s).
Similar to part (c), create a state post(s, a) for every s, a such that
T
0
(s, a, post(s, a, s0
)) = 1
T
0
(post(s, a, s0
), b, s0
) = T(s, a, s0
)
R0
(s) = 0
R0
(post(s, a, s0
)) = γ
− 1
2 R(s, a)
γ
0 = γ
1
2
Then, using post as shorthand for post(s, a, s’):
U(s) = R0
(s) + γ
1
2 maxa[
P
post T
0
(s, a, post)(R0
(post) + γ
1
2 maxb[
P
s
0 T
0
(post, b, s0
)U(s
0
))]]
U(s) = maxa[R(s, a) + γ
P
s
0 T(s, a, s0
)U(s
0
)]
3