1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
Any help is appreciated <3
Len [333]

Answer:

forwarder

Explanation:

8 0
3 years ago
John, a team member, has completed e0 - agile for beginners he wants to contribute to tcs agile vision. he wants to find out wha
Jet001 [13]

Answer:

John should detail his Scrum Master.

Explanation:

The Team Lead or Scrum Master coordinates the tasks of individual team members and supports the progress of the team. The Scrum Master usually receives instructions from the Product Owner and then ensures that the tasks are performed accordingly.  She also coaches the Development Team and works with the Product Owner to carry out daily development activities.  She also drives the Scrum Values and Principles, ensuring that the team members understand and practice them.

7 0
3 years ago
Just to let you know Christmas is in 10 days&lt;3<br><br> lol
Harrizon [31]

Answer:

yay yay

Explanation:

im so excited i cant wait

7 0
3 years ago
Read 2 more answers
Which of the following are considered software piracy? Check all of the boxes that apply.
Serga [27]

Answer:

The answer is copying a program to give to someone else to use

and burning a copy of a DVD to sell

Explanation:

Software piracy is the act of stealing software that is legally protected. This stealing includes copying, distributing, modifying or selling

7 0
3 years ago
Match each context to the type of the law that is most suitable for it.
Bas_tet [7]

Answer:

sorry i dont understand the answer

Explanation:

but i think its a xd jk psml lol

5 0
3 years ago
Other questions:
  • A strip of AISI 304 stainless steel, 2mm thick by 3cm wide, at 550°C, continuously enters a cooling chamber that removes heat at
    12·1 answer
  • A bar having a length of 5 in. and cross-sectional area of 0. 7 in.2 is subjected to an axial force of 8000 lb. If the bar stret
    9·1 answer
  • What can happen to you if you are in a crash and not wearing a seat belt?<br> Explain.
    13·2 answers
  • The fluid-conditioning components of hydraulic-powered equipment provide fluid that is clean and maintained at an acceptable ope
    6·1 answer
  • A small pad subjected to a shearing force is deformed at the top of the pad 0.12 in. The heigfit of the pad is 1.15 in. What is
    7·1 answer
  • A binary liquid system exhibits LLE at 25°C. Determine from each of the following sets of miscibility data estimates for paramet
    10·1 answer
  • What’s the population in the world and why does it keep increasing in bad areas.
    8·1 answer
  • Policeman says, "Son, you can't stay here"
    9·1 answer
  • What do Engineering Systems achieve?
    8·1 answer
  • What do u mean by double entry bookkeeping system?<br>u fellas don't spam pls​
    12·2 answers
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!