AI/RL (2021 DeepMind x UCL )

Lecture 2: Exploration and Exploitation (part 2)

Tony Lim 2021. 11. 14. 14:56
728x90

Theorem : What is Possible?

unlike greedy we can explore other action (even though its Qt is not high because of Ut, uncertainty, not very picked often)

 

how worng our estimate are going to be (Ut)

intuition

1. more number we have in average the less likely we are that if we then add "u" an added amount(Xn) that this is still going to be smaller than the actual mean 

similarly if we pick "u" to be larger , if we consider to be far away enough then it becomes exceedingly unlikely that our sample mean is far off

 

larger c we exlore more smaller c explore less , c == 0 greedy

Qt = sample average , delta a = optimal - current chosen action "a" 's value

 

total regret can be bound by log t prove (1:10:40)

https://www.youtube.com/watch?v=aQJP3Z2Ho8U&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm&index=2&ab_channel=DeepMind 

 

Bayesian Approach

first assume certain distribution and change it as we go on

 

Thompson Sampling

probability matching = it picks an action according to likelihood (according to our beliefs) that this action is optimal action

actions have higher probability when either the estimated value is high , or the uncertainty is high

sample from belif distribution by using actual action value

then we peek greedy action according to th esample action values

 

planning to explore

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90