Contraction mapping
we take a sequence that is convergent in that space , apply Transfomation (T) to that sequence, we get another convergence sequence that covnerges to T of x.
which is limit of that sequence
fixed point
if apply T trasnform to x it goes back to original x
Banach Fixed Point Theorem
we can know that sequence is convergent to unique fixed point X star
The Bellman Optimality Operator
Value Iteration through th elens of the bellman operator
value iteration always converge as long as gamma is less than 1
Approximate Dynamic Programming
we have assumed prefect knowledge of the MDP and perfect / exact representation of the value funcitons
realistically , more often than not
we don't know the underlying MDP
=> sampling / estimation error , as we don't have access to the true operators "T"
we won't be able to represent the value function exactly after each update
=> approximation error , as we approximate the true value functions within a class
Objective = under the above conditions , come up with a policy pie that is close to optimal
A = Approximation ( sample or function approximation)
the case where function approximator might diverge
if gamma is high , even though we chose pretty good "q" it still might be far from optimal case, because our upper bounds are high
if gamm is low ,vice versa
'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글
Lecture 5: Model-free Prediction (part 2) (0) | 2021.12.04 |
---|---|
Lecture 5: Model-free Prediction (part 1) (0) | 2021.12.04 |
Lecture 3: MDPs and Dynamic Programming (0) | 2021.11.20 |
Lecture 2: Exploration and Exploitation (part 2) (0) | 2021.11.14 |
Lecture 2: Exploration and Exploitation (part 1) (0) | 2021.11.13 |