most of the word vectors are represented as row
Gradient Descent
computing naviely takes too much time.
Stochastic gradient descent(SGD) = randomly choose small sample(or batch) for each step and do same regular gradient descent, effectively computing faster.
Skip - grams = you have 1 center word and predict all the 'outside' words in the context
Continous Bag of Words = predict center words from context words
negative sampling
trying to mimize object function
1. we want our observed words to have high probability
2. we choose K random words and give them low probability
by this little change we can sort of reduce high frequency problem
Glove
countbased + distrubtion
using probe word (solid ,gas water..) we can measure co-relation between word (ice, steam)
f is for reducing power of high frequency words
glove tries to capture the counts of the overall statistics of how often these words appear together
'AI > NLP (cs224n)' 카테고리의 다른 글
Lec6) Language Models and RNNs (0) | 2021.04.27 |
---|---|
Lec5) Dependency Parsing , Optimizer(GD to ADAM) (0) | 2021.04.26 |
Lec4) NLP with Deep Learning (0) | 2021.04.13 |
Lec3) Neural Networks (0) | 2021.04.12 |
Lec1) introduction and Word vectors (0) | 2021.04.05 |