1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
- Scrap tire management is primarily regulated at the
kompoz [17]

Scrap tire management is primarily regulated at the state level.

3 0
2 years ago
Read 2 more answers
A steady stream (1000 kg/hr) of air flows through a compressor, entering at (300 K, 0.1 MPa) and leaving at (425 K, 1 MPa). The
AleksandrR [38]

Answer:

The work furnished by the compressor is 69.77kJ/s

The minimum work required for the state to change is 55.26kW

Explanation:

The explanation to these solution is on the first, second , third and fourth uploaded image respectively

8 0
2 years ago
What are the 3 dimensions that used in isometric sketches?
noname [10]

Answer:

The three dimensions shown in an isometric drawing are the height, H, the length, L, and the depth, D

Explanation:

An isometric drawing of an object in presents a pictorial projection of the object in which the three dimension, views of the object's height, length, and depth, are combined in one view such that the dimensions of the isometric projection drawing are accurate and can be measured (by proportion of scale) to draw the different views of the object or by scaling, for actual construction of the object.

5 0
3 years ago
Which is the main material in a solar cell?
quester [9]
Crystalline silicon
hope this helps!! <3
5 0
2 years ago
________ is not within other loop.Immersive Reader
djyliett [7]

Answer:

I think node

Explanation:

thank you

have a great day

8 0
2 years ago
Other questions:
  • What is a construction worker with limited skills called?
    12·1 answer
  • A cast-iron tube is used to support a compressive load. Knowing that E 5 10 3 106 psi and that the maximum allowable change in l
    11·1 answer
  • A 150-lbm astronaut took his bathroom scale (aspring scale) and a beam scale (compares masses) to themoon where the local gravit
    9·1 answer
  • An extruder barrel has a diameter of 4.22 inches and a length of 75 inches. The screw rotates at 65 revolutions per minute. The
    14·1 answer
  • You are traveling along an interstate highway at 32.0 m/s (about 72 mph) when a truck stops suddenly in front of you. You immedi
    11·1 answer
  • 12 times the square root of 8737
    13·1 answer
  • The Hubble Space Telescope is an optical imaging telescope with extremely good angular resolution. Someone discovers an object t
    13·1 answer
  • Please solve part two
    7·1 answer
  • 17. Swing arm restraints are intended to prevent a vehicle from falling off a lift.
    6·1 answer
  • During welding in the vertical position, the torch angle can be varied to control sagging.
    9·1 answer
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!