1answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Ahat [919]
3 years ago
15

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such th

at optimal policies in the new MDP correspond exactly to optimal policies in the original MDP
Engineering
1 answer:
sasho [114]3 years ago
7 0

Answer:

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

0

.

T

0

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

T

0

(pre(s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre)(maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U(s

0

))]]

U(s) = maxa[

P

s

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

0

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

T

0

(s, a, post(s, a, s0

)) = 1

T

0

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

R0

(s) = 0

R0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post)(R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

)U(s

0

))]]

U(s) = maxa[R(s, a) + γ

P

s

0 T(s, a, s0

)U(s

0

)]

3

You might be interested in
4.116 The lid of a roof scuttle weighs 75 lb. It is hinged at corners A and B and maintained in the desired position by a rod CD
babunello [35]

Answer:

(a) The magnitude of force is 116.6 lb, as exerted by the rod CD

(b) The reaction at A is (-72.7j-38.1k) lb and at B it is (37.5j) lb.

Explanation:

Step by step working is shown in the images attached herewith.

For this given system, the coordinates are the following:

A(0, 0, 0)

B(26, 0, 0)

And the value of angle alpha is 20.95°

Hope that answers the question, have a great day!

5 0
3 years ago
Q-) please give me a reference about Tack coat? Pleae i need it please??!!
Arturiano [62]

Answer:

Tack coat is a sprayed application of an asphalt binder upon an existing asphalt or Portland cement concrete pavement prior to an overlay, or between layers of new asphalt concrete.

Explanation:

4 0
3 years ago
How can I use the flux density B formula if I don’t know magnetic flux
BabaBlast [244]

Answer:

use the dimensions shown in the figure

3 0
3 years ago
Design a sequential circuit DETECTOR that has one input X and one input Y. The DETECTOR detects the sequence 110. If an input X
Novosadov [1.4K]

Answer:

See explaination

Explanation:

This is going to require diagrams, please kindly see attachment for the detailed step by step solution of the given problem.

5 0
3 years ago
Question 2/5
adelina 88 [10]
All of the above. Answer.
7 0
3 years ago
Other questions:
  • What is a p-n junction? Show by the diagram.
    6·1 answer
  • In order to protect yourself if you have a dispute with another drivers insurance company you should:
    9·1 answer
  • The convection heat transfer coefficient for a clothed person standing in moving air is expressed as h 5 14.8V0.69 for 0.15 , V
    12·2 answers
  • A 200-mm-long strip of metal is stretched in two steps, first to 300 mm and then to 400 mm. Show that the total true strain is t
    15·1 answer
  • What is the name of the type of rocker arm stud that does not require a valve adjustment?
    12·1 answer
  • 10 properties of metals?<br> ​
    10·2 answers
  • True or false for the 4 questions?
    8·1 answer
  • que sabemos de la revolución industrial y como ese proceso impulso el uso de los controles eléctricos en las industrias
    6·1 answer
  • 8. What are used by the project architect to depict different building systems and to show how they correlate to one anothe
    14·1 answer
  • Assume the availability of an existing class, ICalculator, that models an integer arithmetic calculator and contains: an instanc
    10·1 answer
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!