Markov Property

Once the state is known , the history may be thrown away
returns

discount gamma (0~1) is the present value of future rewards
if close to 0 leads to mypoic (very short) evaluation
if close to 1 leads to far-sighted evaluation
why discount?
probelm specification = immediate rewards may actually be more valuable (e.g. consider earning interest)
Solution side = Mathematically convenient to discount rewards , avoid infinite returns in cyclic markov processe
Policies
Goal of an RL agent = to find a behavior policy that maximises teh expected return Gt(total reward)
value function

expected value under the policy that we are evaluating under pie
optimal value function = best possible perfomance in the markov decision process
mdp is solved when we know the optimal value function
bellman equation


policy evaluation

when gamma is less then 1 it will always converge
policy iteration

iterate until both converge
Asynchronous Dynamic Programming
In-place dynamic programming

prioritised sweeping


real-time dynamic programming
only update states that are relevan to agent (debatable , not sure) ㅋㅋ
'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글
| Lecture 5: Model-free Prediction (part 1) (0) | 2021.12.04 |
|---|---|
| Lecture 4: Theoretical Fund. of Dynamic Programming Algorithms (0) | 2021.11.27 |
| Lecture 2: Exploration and Exploitation (part 2) (0) | 2021.11.14 |
| Lecture 2: Exploration and Exploitation (part 1) (0) | 2021.11.13 |
| Lecture 1: Introduction to Reinforcement Learning (0) | 2021.11.07 |