Markov Property
Once the state is known , the history may be thrown away
returns
discount gamma (0~1) is the present value of future rewards
if close to 0 leads to mypoic (very short) evaluation
if close to 1 leads to far-sighted evaluation
why discount?
probelm specification = immediate rewards may actually be more valuable (e.g. consider earning interest)
Solution side = Mathematically convenient to discount rewards , avoid infinite returns in cyclic markov processe
Policies
Goal of an RL agent = to find a behavior policy that maximises teh expected return Gt(total reward)
value function
expected value under the policy that we are evaluating under pie
optimal value function = best possible perfomance in the markov decision process
mdp is solved when we know the optimal value function
bellman equation
policy evaluation
when gamma is less then 1 it will always converge
policy iteration
iterate until both converge
Asynchronous Dynamic Programming
In-place dynamic programming
prioritised sweeping
real-time dynamic programming
only update states that are relevan to agent (debatable , not sure) ㅋㅋ
'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글
Lecture 5: Model-free Prediction (part 1) (0) | 2021.12.04 |
---|---|
Lecture 4: Theoretical Fund. of Dynamic Programming Algorithms (0) | 2021.11.27 |
Lecture 2: Exploration and Exploitation (part 2) (0) | 2021.11.14 |
Lecture 2: Exploration and Exploitation (part 1) (0) | 2021.11.13 |
Lecture 1: Introduction to Reinforcement Learning (0) | 2021.11.07 |