AI/RL (2021 DeepMind x UCL )

Lecture 9: Policy-Gradient and Actor-Critic methods

Tony Lim 2022. 1. 2. 12:22
728x90

actor critic = value is used to critic (update) policy

 

deterministic policy = greedy

stochastic policy = epcilon greedy, it has some randomness

doesn't always generalise well = at some point it will just keep on going with learned policy even though env changed

 

stochastic policies

with function approximation agent cannot distinguish state he's in eventhough environment is MDP == partially observable

search space in determinsitic policy is very discrete and it is hard to optimize , but with stochastic we can smoothly change probability of choosing action

 

 

Stochastic Policy Example: Aliased Grid World

for instance case for function approximation , cannot have fully observable representation of world because it is complicated(in other example) , optimal determinstic for gray state might be same

in this case we are having uniform randomness but we don't have to

also we are not stuck in corners

 

 

Policy Objective Functions

G = return

sample S0 from random distribution d0 form that state we take action by given policy

goal is maximize discounted return from the state S0

we want to optimzie parameter theta in such a way actual value v of the policy that is parameterized for theta will be maximzied under the distribution that generates the starting state

 

expectation here is no longer over a start state distribution because we are considering continous setting 

never going to start new episode

even in continous state we might start in some state , but if we stay in that epiosde indefintely there is going to be some frequency in which you visit states and this frequency won't depend where you start but depnds on our policy and the dynmaics of MDP

 

 

Policy Gradients

right hand side can be sampled , but left handside with gradient will be just zero

i think we can igonre distribution notatino d and consider it exisit

still doing valid GSD with lower varience

consider case where score only exists as binary ,zero and one

without baseline we can only learn(update parameter) when we win

with baseline when we lose we can still learn something

 

Policy gradient theorem (episodic)

discount factor power of t is will make us have bias gradient(can point to wrong direction)
but it is okay also it can be dropped

 

proof of episodic policy gradients

at 2nd line some of the gradient is zero because those are irrelevant from parameter theta

k=0 -> k=t we can remove some varience that doesn't help getting better estimate

 

Policy gradient theorem

row(p) is avaerge of all state

q value captures if i am in the specific state or action is my average reward conditioned on being in that state and in that action
is it going to be little bit lower or a little bit higher than the overall average for little bit

adding backward with just 1 reward

this trace of these polices going into the past basically captures how does my state visitation depend on my policy parameters

similar to this porbability of the trajectory

 

 

Actor Critics = agent that has an actor (policy) but also has a valid estimate a critic

policy gradients: reduce variance

value of that State doesn't depend on action so we can use it as a "b"

  

actor-critic

we need to initialize w(parameter) too

doesn't ahve to be 1step td can be multi step

 

increasing robustness with trust regions

bad policy might lead to bad data

KL divergence = distance between old policy and new policy , keeping your policy moving too much to avoid bad policy update

 

 

Continous action spaces

how are we going to choose our action from continous action space? for example we have gaussian policy

we sample action "At" from gaussian , and compute gradient based on actino we sampled

if gradient of log is positive, it will update toward to action that we actually took and vice versa

 

 

gradient ascent on value

choose determinstic policy with parameter theta and we update that theta with value(Q)

 

Continous actor-critic learning automation (Cacla)

notice update doesn't include TD explicitly but use it as indicator (5,6)

doing some sort of hill climbing , updating action that is good for us

 

728x90