AI/RL (2021 DeepMind x UCL )

Lecture 3: MDPs and Dynamic Programming

Tony Lim 2021. 11. 20. 20:58
728x90

Markov Property

Once the state is known , the history may be thrown away

 

returns

discount gamma (0~1) is the present value of future rewards

if close to 0 leads to mypoic (very short) evaluation

if close to 1 leads to far-sighted evaluation

 

why discount?

probelm specification = immediate rewards may actually be more valuable (e.g. consider earning interest)

Solution side = Mathematically convenient to discount rewards , avoid infinite returns in cyclic markov processe

 

Policies

Goal of an RL agent = to find a behavior policy that maximises teh expected return Gt(total reward)

 

value function

expected value under the policy that we are evaluating under pie

 

optimal value function = best possible perfomance in the markov decision process

mdp is solved when we know the optimal value function

 

bellman equation

 

 

policy evaluation

when gamma is less then 1 it will always converge

 

policy iteration

iterate until both converge

 

Asynchronous Dynamic Programming

In-place dynamic programming

 

prioritised sweeping

 

 

real-time dynamic programming

only update states that are relevan to agent (debatable , not sure) ㅋㅋ

 

 

 

 

 

 

 

 

 

728x90