Lecture 9: Policy-Gradient and Actor-Critic methods

AI/RL (2021 DeepMind x UCL )

Lecture 9: Policy-Gradient and Actor-Critic methods

Tony Lim 2022. 1. 2. 12:22

728x90

actor critic = value is used to critic (update) policy

deterministic policy = greedy

stochastic policy = epcilon greedy, it has some randomness

doesn't always generalise well = at some point it will just keep on going with learned policy even though env changed

stochastic policies

with function approximation agent cannot distinguish state he's in eventhough environment is MDP == partially observable

search space in determinsitic policy is very discrete and it is hard to optimize , but with stochastic we can smoothly change probability of choosing action

Stochastic Policy Example: Aliased Grid World

for instance case for function approximation , cannot have fully observable representation of world because it is complicated(in other example) , optimal determinstic for gray state might be same

in this case we are having uniform randomness but we don't have to

also we are not stuck in corners

Policy Objective Functions

G = return

sample S0 from random distribution d0 form that state we take action by given policy

goal is maximize discounted return from the state S0

we want to optimzie parameter theta in such a way actual value v of the policy that is parameterized for theta will be maximzied under the distribution that generates the starting state

expectation here is no longer over a start state distribution because we are considering continous setting

never going to start new episode

even in continous state we might start in some state , but if we stay in that epiosde indefintely there is going to be some frequency in which you visit states and this frequency won't depend where you start but depnds on our policy and the dynmaics of MDP

Policy Gradients

right hand side can be sampled , but left handside with gradient will be just zero

i think we can igonre distribution notatino d and consider it exisit

still doing valid GSD with lower varience

consider case where score only exists as binary ,zero and one

without baseline we can only learn(update parameter) when we win

with baseline when we lose we can still learn something

Policy gradient theorem (episodic)

discount factor power of t is will make us have bias gradient(can point to wrong direction)
but it is okay also it can be dropped

proof of episodic policy gradients

at 2nd line some of the gradient is zero because those are irrelevant from parameter theta

k=0 -> k=t we can remove some varience that doesn't help getting better estimate

Policy gradient theorem

row(p) is avaerge of all state

q value captures if i am in the specific state or action is my average reward conditioned on being in that state and in that action
is it going to be little bit lower or a little bit higher than the overall average for little bit

adding backward with just 1 reward

this trace of these polices going into the past basically captures how does my state visitation depend on my policy parameters

similar to this porbability of the trajectory

Actor Critics = agent that has an actor (policy) but also has a valid estimate a critic

policy gradients: reduce variance

value of that State doesn't depend on action so we can use it as a "b"

actor-critic

we need to initialize w(parameter) too

doesn't ahve to be 1step td can be multi step

increasing robustness with trust regions

bad policy might lead to bad data

KL divergence = distance between old policy and new policy , keeping your policy moving too much to avoid bad policy update

Continous action spaces

how are we going to choose our action from continous action space? for example we have gaussian policy

we sample action "At" from gaussian , and compute gradient based on actino we sampled

if gradient of log is positive, it will update toward to action that we actually took and vice versa

gradient ascent on value

choose determinstic policy with parameter theta and we update that theta with value(Q)

Continous actor-critic learning automation (Cacla)

notice update doesn't include TD explicitly but use it as indicator (5,6)

doing some sort of hill climbing , updating action that is good for us

728x90

저작자표시 (새창열림)

'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글

Lecture 10: Approximate Dynamic Programming (0)	2022.01.16
Lecture 8: Planning & models (0)	2021.12.25
Lecture 7: Function Approximation (0)	2021.12.19
Lecture 6: Model-Free Control (0)	2021.12.11
Lecture 5: Model-free Prediction (part 2) (0)	2021.12.04

현재글Lecture 9: Policy-Gradient and Actor-Critic methods

250x250

날짜시간, fft, Text Justification, Interval Scheduling, Linux, Matrix Mutilply, 메소드 참조, 스레드, 자바8, 파일입출력, systemd, Weighted Interval Scheduling, 람다, Algorithm, dijkstra, Quicksort, 영속성, Median Find, JPA, spring,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

관심있는것들