actor critic = value is used to critic (update) policy
deterministic policy = greedy
stochastic policy = epcilon greedy, it has some randomness
doesn't always generalise well = at some point it will just keep on going with learned policy even though env changed
stochastic policies
with function approximation agent cannot distinguish state he's in eventhough environment is MDP == partially observable
search space in determinsitic policy is very discrete and it is hard to optimize , but with stochastic we can smoothly change probability of choosing action
Stochastic Policy Example: Aliased Grid World
for instance case for function approximation , cannot have fully observable representation of world because it is complicated(in other example) , optimal determinstic for gray state might be same
in this case we are having uniform randomness but we don't have to
also we are not stuck in corners
Policy Objective Functions
G = return
sample S0 from random distribution d0 form that state we take action by given policy
goal is maximize discounted return from the state S0
we want to optimzie parameter theta in such a way actual value v of the policy that is parameterized for theta will be maximzied under the distribution that generates the starting state
expectation here is no longer over a start state distribution because we are considering continous setting
never going to start new episode
even in continous state we might start in some state , but if we stay in that epiosde indefintely there is going to be some frequency in which you visit states and this frequency won't depend where you start but depnds on our policy and the dynmaics of MDP
Policy Gradients
right hand side can be sampled , but left handside with gradient will be just zero
i think we can igonre distribution notatino d and consider it exisit
still doing valid GSD with lower varience
consider case where score only exists as binary ,zero and one
without baseline we can only learn(update parameter) when we win
with baseline when we lose we can still learn something
Policy gradient theorem (episodic)
discount factor power of t is will make us have bias gradient(can point to wrong direction)
but it is okay also it can be dropped
proof of episodic policy gradients
at 2nd line some of the gradient is zero because those are irrelevant from parameter theta
k=0 -> k=t we can remove some varience that doesn't help getting better estimate
Policy gradient theorem
row(p) is avaerge of all state
q value captures if i am in the specific state or action is my average reward conditioned on being in that state and in that action
is it going to be little bit lower or a little bit higher than the overall average for little bit
adding backward with just 1 reward
this trace of these polices going into the past basically captures how does my state visitation depend on my policy parameters
similar to this porbability of the trajectory
Actor Critics = agent that has an actor (policy) but also has a valid estimate a critic
policy gradients: reduce variance
value of that State doesn't depend on action so we can use it as a "b"
actor-critic
we need to initialize w(parameter) too
doesn't ahve to be 1step td can be multi step
increasing robustness with trust regions
bad policy might lead to bad data
KL divergence = distance between old policy and new policy , keeping your policy moving too much to avoid bad policy update
Continous action spaces
how are we going to choose our action from continous action space? for example we have gaussian policy
we sample action "At" from gaussian , and compute gradient based on actino we sampled
if gradient of log is positive, it will update toward to action that we actually took and vice versa
gradient ascent on value
choose determinstic policy with parameter theta and we update that theta with value(Q)
Continous actor-critic learning automation (Cacla)
notice update doesn't include TD explicitly but use it as indicator (5,6)
doing some sort of hill climbing , updating action that is good for us
'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글
Lecture 10: Approximate Dynamic Programming (0) | 2022.01.16 |
---|---|
Lecture 8: Planning & models (0) | 2021.12.25 |
Lecture 7: Function Approximation (0) | 2021.12.19 |
Lecture 6: Model-Free Control (0) | 2021.12.11 |
Lecture 5: Model-free Prediction (part 2) (0) | 2021.12.04 |