Answer:
See answer in explanation
Step-by-step explanation:
State: current points if stop plus a terminal state, that is, 0,1,2,3,4,5,6,7,DONE
Action: Toss, Stop
2. What is the transition function and the reward function for this MDP?
Transition function:
T(Si
, TOSS, Si+3) = 0.4 if i < 3
T(Si
, TOSS, DONE) = 0.4 if i ≥ 3
T(Si
, TOSS, Si+1) = 0.6 if i < 7
T(Si
, TOSS, DONE) = 0.4 if i = 7
T(Si
, STOP, DONE) = 1
Reward function:
R(Si
, TOSS, ANY ) = 0
R(Si
, STOP, DONE) = i
R(DONE, STOP, DONE) = 0
3. What is the optimal policy for this MDP? Please write down the steps to show how
you get the optimal policy.
Optimal policy: Toss for 0,1,2 ; STOP for others.
You should include the steps of value iteration. The value iteration will converge at
iteration 3. Result of iteration 3 is as follow,
V3: 0: 4.5 from Toss; 1: 5.4 from Toss; 2: 5.9 from Toss; 3: 3 from
Stop; 4: 4 from Stop; 5: 5 from Stop; 6: 6