1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
The notation on one's license that the person must wear glasses
madam [21]

Answer:

final.

Explanation:

''.''

4 0
3 years ago
if when you put your shirt in your pants, your shirt is tucked, does that mean when your shirt is over your pants, your pants ar
Archy [21]

Answer:

confusing, but yes

Explanation:

8 0
3 years ago
Read 2 more answers
(30 pts) A simply supported beam with a span L=20 ft and cross sectional dimensions: b=14 in; h=20 in; d=17.5 in. is reinforced
Nat2105 [25]

Answer:

Zx = 176In³

Explanation:

See attached image file

5 0
3 years ago
Using an "AND" and an "OR", list all information (Equipment Number, Equipment Type, Seat Capacity, Fuel Capacity, and Miles per
Tomtit [17]

Answer:

Explanation :

The given  information to be listed can are Equipment Number, Equipment Type, Seat Capacity, Fuel Capacity, and Miles per Gallon.

Check the attached document for the solution.

5 0
3 years ago
Scanning the road can be thought of as
maw [93]

Answer:

Observational Skills

Explanation:

Observing the area also known as scanning the scene

5 0
3 years ago
Read 2 more answers
Other questions:
  • Describe the steps, tools, and technology needed in detail and
    12·1 answer
  • 6. Staples are the most common item used to secure and support cables in residential wiring.​
    14·1 answer
  • 4. A 1 m3 rigid tank has propane at 100 kPa, 300 K and connected by a valve to another tank of 0.5 M3 with propane at 250 kPa, 4
    11·1 answer
  • 2. One of the following systems is not typically used with floor
    5·1 answer
  • 100 kg of refrigerant-134a at 200 kPa iscontained in a piston-cylinder device whose volume is 12.322 m3. The piston is now moved
    14·1 answer
  • At the inlet to the combustor of a supersonic combustion ramjet (or scramjet), the flow Mach number is supersonic. For a fuel-ai
    12·1 answer
  • You are designing a geartrain with three spur gears: one input gear, one idler gear,and one output gear. The diametral pitch for
    13·1 answer
  • URGENT PLEASE HELP!!!
    11·1 answer
  • Calculate the radius of a circular orbit for which the period is 1 day​
    13·1 answer
  • The side area of the door shell that is concealed when the door is closed is called the:
    15·1 answer
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!