1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
An ideal Diesel cycle has a compression ratio of 17 and a cutoff ratio of 1.3. Determine the maximum temperature of the air and
Radda [10]

Answer:

maximum temperature = 1322 k

rate of heat addition = 212 kw

Explanation:

compression ratio = 17

cut off ratio = 1.3

power produced = 140 Kw

state of air at the beginning of the compression = 90 kPa and 578 c

Determine the maximum temperature of air

attached below is the detailed solution

6 0
3 years ago
Pls help me it’s due today
hichkok12 [17]

Answer:

C. 14.55

Explanation:

12 x 10 = 120

120 divded by 10 is 12

so now we do the left side

7 x 3 = 21 divded by 10 is 2

so now we have 14

and the remaning area is 0.55

so 14.55

6 0
2 years ago
When you hover over an edge or point, you are activating ____________ in SketchUp?
7nadin3 [17]

Answer:b

hope thiss helps

Explanation

I took the quiz

7 0
2 years ago
Read 2 more answers
engineering uses data from pareto charts to analyse motor faults caused during their production. Explain one advantage of using
lions [1.4K]
The advantage of a pareto chart is to make sure they have all of their tools
3 0
3 years ago
________ are written to “maximize” or “minimize” a specific value associated with the product needs in order to define the goal
Anna35 [415]

Answer:

Objective statements.

Explanation:

An objective statement can be defined as a short statement that explicitly states or describes what a person wants exactly or is looking out for in a particular item.

Objective statements are written to “maximize” or “minimize” a specific value associated with the product needs in order to define the goal or aim of the design process.

This ultimately implies that, objective statements are used by various manufacturing industries or companies to explicitly define the minimum or maximum requirements for the production of its goods.

4 0
3 years ago
Read 2 more answers
Other questions:
  • A frustum of cone is filled with ice cream such that the portion above the cone is a hemisphere. Define the variables di=1.25 in
    9·1 answer
  • A triangular roadside channel is poorly lined with riprap. The channel has side slopes of 2:1 (H:V) and longitudinal slope of 2.
    9·1 answer
  • The water level in a large tank is maintained at height H above the surrounding level terrain. A rounded nozzle placed in the si
    9·1 answer
  • A multilane highway (two lanes in each direction) is on level terrain. The free-flow speed has been measured at 45 mi/h. The pea
    5·1 answer
  • A certain metal has a resistivity of 1.68 × 10-8 Ω ∙ m. You have a long spool of wire made from this metal. If this wire has a d
    14·1 answer
  • Filler metals range in diameter from 1/16" to 3/8"*<br> O<br> true<br> O False
    15·1 answer
  • Air enters a compressor operating at steady state at 1.05 bar, 300 K, with a volumetric flow rate of 21 m3/min and exits at 12 b
    11·1 answer
  • People learn best in different ways. By combining all the group presentations, your class will explain how they see the optical
    8·2 answers
  • What is code in Arduino to turn led on and off
    10·1 answer
  • Tech B says that long-term fuel trims that are positive means that the PCM is leaning out the fuel mixture from the base pulse-w
    11·2 answers
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!