Lecture 8: Planning & models

AI/RL (2021 DeepMind x UCL )

Lecture 8: Planning & models

Tony Lim 2021. 12. 25. 14:46

Dynamic Programming

assume a model , solve model, no need to interact with the world at all

Model Free RL

no model, learn value function from experience

Model based RL

learn a model from experience

plan value functions using the learned model

model RL disadvantage

first learn a model , then constrcut a value function == 2 sources of approximation error

(in case of model free) learn value function directly == only one source of approximation error

pros of model RL

models can efficiently be learned by supervised learning method

Reason about model uncertainty since we modeled env we can do more explore little more smartly?

reduce the interaction in the real world ( interactino with env can be slow or expensive)

learning model

The above parameter (weight) called theta (something that looks like n)

not just linear model(expection model) we can use other thing like deep NN

stochastic model

we may not want to assume everything is linear

stochastic models (also known as generative models)

how we are going to parameterized model?

table lookup model

full distribution for transition dynamics and expectation model for the rewards

given exampel we used all of them (and doesn't consider explore) and made a model like right

Lienar expecation models

T, w is for parameter

planning for credit assignment

planning is the process of investing compute to improve values and policies without need to interact with the environment

intersted in planning algorithms that don't require priviledged access to a perfect specification of the environment

instead the planning algorithms we discuss today use learned models

learn from model's produced data and consider it as real environment interaction and use model free RL with them

the planning process may compute a suboptimal policy

combine model based and model free methods in a single algorithm == Dyna

d = apply direct q learning

e = update model , for instance in a tabular deterministic environment this is just storing the next state and the reward

f = mixing happens here , model + qlearning(model free)

Dyna-Q on a simple maze

discount factor exists so agent is trying to go S to G in shortest way possible

1st time step (episode)

n=0 == all the other states is zero except for one state close to Goal

n=50 == not just final state but actually updated many states ,

Dyna-Q with an inaccurate model

during learning the model became wrong due to change of environment

q+ == add exploration

AC == use actor-critc learning instead of Q - learning

only Q+ can realize fast there are shorter path because of exploration

Planning and Experience replay

but nowadays shar distinction between model based and model free is now less clear

parametric model

cannot be done with experience replay model but with parametric model

parametric vs experience replay

not saying which is best but we can choose the appropriate one depending on the problem

Monte Carlo Tree search

repeat until time allows, used in alpha go

note there are 2 simulation policies

a tree policy that improves during search

a rollout policy that is held fixed = often this may just be be picking randomly

Example

inside node left part is score , and right part is number of trial in simulation

star = state we selected

rollout policy = default policy

we updated our root to 1 because it is average of existing simulation

expand to one more node (star)

update star to 0 , update root to 1/2

advantage of MC tree search

search tree is a table lookup approach but only partially

저작자표시

'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글

Lecture 10: Approximate Dynamic Programming (0)	2022.01.16
Lecture 9: Policy-Gradient and Actor-Critic methods (0)	2022.01.02
Lecture 7: Function Approximation (0)	2021.12.19
Lecture 6: Model-Free Control (0)	2021.12.11
Lecture 5: Model-free Prediction (part 2) (0)	2021.12.04

현재글Lecture 8: Planning & models

Weighted Interval Scheduling, Quicksort, Text Justification, fft, 파일입출력, Interval Scheduling, Median Find, 자바8, 스레드, systemd, 메소드 참조, JPA, 날짜시간, 람다, Algorithm, dijkstra, Linux, spring, Matrix Mutilply, 영속성,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

관심있는것들