1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
The aluminum rod AB (G 5 27 GPa) is bonded to the brass rod BD (G 5 39 GPa). Knowing that portion CD of the brass rod is hollow
Temka [501]

Answer:

Qcd=0.01507rad

QT= 0.10509rad

Explanation:

The full details of the procedure and answer is attached.

7 0
3 years ago
Burn rate can be affected by: A. Variations in chamber pressure B. Variations in initial grain temperature C. Gas flow velocity
Digiron [165]

Answer: D) All of the above

Explanation:

Burn rate can be affected by all of the above reasons as, variation in chamber pressure because the pressure are dependence on the burn rate and temperature variation in initial gain can affect the rate of the chemical reactions and initial gain in the temperature increased the burning rate. As, gas flow velocity also influenced to increasing the burn rate as it flowing parallel to the surface burning. Burn rate is also known as erosive burning because of the variation in flow velocity and chamber pressure.

4 0
3 years ago
For each of the resistors shown below, use Ohm's law to calculate the unknown quantity, Be sure to put your answer in proper eng
daser333 [38]

Answer:

the hurts my brain sorry bud cant help

Explanation:

6 0
3 years ago
A(n)_____ is a device that provides the power and motion to manipulate the moving parts of a valve or damper used to control flu
Lesechka [4]

Answer:

Out of the four options provided

option A. actuator

is correct

Explanation:

An actuator is the only device out of the four mentioned devices that provides power and ensures the motion in it in order to manipulate the movement of the moving parts of the damper or a valve used whereas others like ratio regulator are used to regulate air or gas ratio and none mof the 3 remaining options serves the purpose

5 0
3 years ago
Compute L, T, M, LC, and R and stations of the BC and EC for the circular curve with the given data of: I (delta) = 22°15′00" an
Mars2501 [29]

Answer:

L = 475.718

T = 240.89 ft

M = 23.0195

LC = 472.728

R = 1225 ft

Explanation:

See the attached file for the calculation.

8 0
3 years ago
Other questions:
  • Explain the use of the Kanban system in a production line?
    7·1 answer
  • 8–21 Heat in the amount of 100 kJ is transferred directly from a hot reservoir at 1200 K to a cold reservoir at 600 K. Calculate
    6·1 answer
  • Rain falls on a 1346 acre urban watershed at an intensity of 1.75 in/hr for a duration of 1 hour. The catchment land use is 20%
    10·1 answer
  • A city emergency management agency and a construction company have formed a public-private partnership. The construction company
    15·1 answer
  • A coal-fired power plant is burning bituminous coal that has an energy content of 12,000 Btu/lb. The power plant is burning the
    13·1 answer
  • Sarah and Raj take/takes me to a baseball game every year.
    11·1 answer
  • How many color are in da rainbow​
    5·2 answers
  • The toughness of a material does what, when it's been heated?​
    7·1 answer
  • Electrical circuits must be locked-out/tagged-out before electricians work on any equipment. Is this true or false?
    15·1 answer
  • Tech A says that LED brake lights illuminate faster than incandescent bulbs. Tech B says that LED brake lights have
    13·1 answer
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!