'AI/RL (2021 DeepMind x UCL )' 카테고리의 글 목록

728x90

AI/RL (2021 DeepMind x UCL ) 12

Lecture 10: Approximate Dynamic Programming

just trying to optimize our policy even with given 2 bad condition (practical) infinty norm == just choosing max from given vectors it doesn't need to converge to optimal , we observe just few n step and evaluate if this is good policy Performance of AVI initial error == how far from optimal from my first value function our n step difference shoud be bounded by those 2 error term Perfomance of A..

AI/RL (2021 DeepMind x UCL ) 2022.01.16

Lecture 9: Policy-Gradient and Actor-Critic methods

actor critic = value is used to critic (update) policy deterministic policy = greedy stochastic policy = epcilon greedy, it has some randomness doesn't always generalise well = at some point it will just keep on going with learned policy even though env changed stochastic policies with function approximation agent cannot distinguish state he's in eventhough environment is MDP == partially observ..

AI/RL (2021 DeepMind x UCL ) 2022.01.02

Lecture 8: Planning & models

Dynamic Programming assume a model , solve model, no need to interact with the world at all Model Free RL no model, learn value function from experience Model based RL learn a model from experience plan value functions using the learned model model RL disadvantage first learn a model , then constrcut a value function == 2 sources of approximation error (in case of model free) learn value functio..

AI/RL (2021 DeepMind x UCL ) 2021.12.25

Lecture 7: Function Approximation

in subsequent slide whenever there are state notations it is agent state and it is vector Classes of function approximation tabular = learn every individual space state aggregation = partition all states in to a discret set and learn like batch Linear funtion approximation tabular case = just consider feature vectors to have as many entries as therer are states and then have entry of these curre..

AI/RL (2021 DeepMind x UCL ) 2021.12.19

Lecture 6: Model-Free Control

monte carlo control recap policy iteration 2 step = policy evalution -> policy improvment (greedy in this example) and repeat over and over until both converge model free policy iteration using action value function by considering action into account , we can have model free improvement generalised policy iteration with action value function we are not evaluating whole policy , but instead , we ..

AI/RL (2021 DeepMind x UCL ) 2021.12.11

Lecture 5: Model-free Prediction (part 2)

comparing MC vs TD random walk example using td after 1 iteration only A is changing because when we go left once the value go down to zero, other case same as initial state(0 iteration , (0.5+0)/2) alpha is learning rate , it is slow when small MC has high varience requries lower learing rate, TD has lower varience than MC Batch (updating) MC and TD most likely model given this episode TD explo..

AI/RL (2021 DeepMind x UCL ) 2021.12.04

Lecture 5: Model-free Prediction (part 1)

model free prediction = monte carlo algorithims no knowledge of MDP required , only samples, to learn without a model muti armed bandit right side = true action value , which is the expected reward given in action a left side = estimate at time step t , average of reward given that you have taken that action on the subsequence time steps like GD we add our error term (observed rewaerd - current ..

AI/RL (2021 DeepMind x UCL ) 2021.12.04

Lecture 4: Theoretical Fund. of Dynamic Programming Algorithms

Contraction mapping we take a sequence that is convergent in that space , apply Transfomation (T) to that sequence, we get another convergence sequence that covnerges to T of x. which is limit of that sequence fixed point if apply T trasnform to x it goes back to original x Banach Fixed Point Theorem we can know that sequence is convergent to unique fixed point X star The Bellman Optimality Oper..

AI/RL (2021 DeepMind x UCL ) 2021.11.27

Lecture 3: MDPs and Dynamic Programming

Markov Property Once the state is known , the history may be thrown away returns discount gamma (0~1) is the present value of future rewards if close to 0 leads to mypoic (very short) evaluation if close to 1 leads to far-sighted evaluation why discount? probelm specification = immediate rewards may actually be more valuable (e.g. consider earning interest) Solution side = Mathematically conveni..

AI/RL (2021 DeepMind x UCL ) 2021.11.20

Lecture 2: Exploration and Exploitation (part 2)

Theorem : What is Possible? unlike greedy we can explore other action (even though its Qt is not high because of Ut, uncertainty, not very picked often) how worng our estimate are going to be (Ut) intuition 1. more number we have in average the less likely we are that if we then add "u" an added amount(Xn) that this is still going to be smaller than the actual mean similarly if we pick "u" to be..

AI/RL (2021 DeepMind x UCL ) 2021.11.14

1 2

250x250

JPA, 메소드 참조, Median Find, dijkstra, Algorithm, Quicksort, Text Justification, spring, 영속성, Matrix Mutilply, Weighted Interval Scheduling, 자바8, Interval Scheduling, systemd, fft, 날짜시간, 람다, 파일입출력, Linux, 스레드,

Today :
Yesterday :

728x90

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

관심있는것들

AI/RL (2021 DeepMind x UCL ) 12

티스토리툴바