AI/RL (2021 DeepMind x UCL )

Lecture 4: Theoretical Fund. of Dynamic Programming Algorithms

Tony Lim 2021. 11. 27. 13:09

Contraction mapping

we take a sequence that is convergent in that space , apply Transfomation (T) to that sequence,  we get another convergence sequence that covnerges to T of x.

which is limit of that sequence

 

fixed point

if apply T trasnform to x it goes back to original x

 

Banach Fixed Point Theorem

we can know that sequence is convergent to unique fixed point X star

 

The Bellman Optimality Operator

 

Value Iteration through th elens of the bellman operator

value iteration always converge as long as gamma is less than 1

 

Approximate Dynamic Programming

we have assumed prefect knowledge of the MDP and perfect / exact representation of the value funcitons

 

realistically , more often than not

 

we don't know the underlying MDP

=> sampling / estimation error , as we don't have access to the true operators "T"

 

we won't be able to represent the value function exactly after each update

=> approximation error , as we approximate the true value functions within a class

 

Objective = under the above conditions , come up with a policy pie that is close to optimal

 

A = Approximation ( sample or function approximation)

the case where function approximator might diverge

 

if gamma is high , even though we chose pretty good "q" it still might be far from optimal case, because our upper bounds are high

if gamm is low ,vice versa