Theorem : What is Possible?
unlike greedy we can explore other action (even though its Qt is not high because of Ut, uncertainty, not very picked often)
how worng our estimate are going to be (Ut)
intuition
1. more number we have in average the less likely we are that if we then add "u" an added amount(Xn) that this is still going to be smaller than the actual mean
similarly if we pick "u" to be larger , if we consider to be far away enough then it becomes exceedingly unlikely that our sample mean is far off
larger c we exlore more smaller c explore less , c == 0 greedy
Qt = sample average , delta a = optimal - current chosen action "a" 's value
total regret can be bound by log t prove (1:10:40)
Bayesian Approach
first assume certain distribution and change it as we go on
Thompson Sampling
probability matching = it picks an action according to likelihood (according to our beliefs) that this action is optimal action
actions have higher probability when either the estimated value is high , or the uncertainty is high
sample from belif distribution by using actual action value
then we peek greedy action according to th esample action values
planning to explore
'AI > RL (2021 DeepMind x UCL )' 카테고리의 다른 글
Lecture 5: Model-free Prediction (part 1) (0) | 2021.12.04 |
---|---|
Lecture 4: Theoretical Fund. of Dynamic Programming Algorithms (0) | 2021.11.27 |
Lecture 3: MDPs and Dynamic Programming (0) | 2021.11.20 |
Lecture 2: Exploration and Exploitation (part 1) (0) | 2021.11.13 |
Lecture 1: Introduction to Reinforcement Learning (0) | 2021.11.07 |