Lec7) Vanishing Gradients, Fancy RNNs

AI/NLP (cs224n)

Tony Lim 2021. 4. 28. 17:51

728x90

vanishing gradient happens if intermediate gradients is small? then whole gradient gets smaller

if vanishing gradient happen we cannot tell whether

there is no dependency between in the data
there is dependency between in the data but since our prarmeters are wrong we couldn't capture it.

RNN- LM are better at learning from sequential recency than syntatic recency due to vanishing gradient.

there is also Exploding Gradient = if gradeint gets bigger we are going to take a big step when we do gradient

solution = Gradient Clipping == if gradients get bigger than some threshold then scale it down.

LSTM

attempts to remember previous terms by adding cell states.

TA got question " when computing forget gate why doesn't it compute with ct-1" and answer was not sure.

GRU

in practice LSTM is powerful but GRU is faster. start with LSTM

Bi-drectional RNNs

Backward RNN is just doing same RNN but direction reversed. we have seperate parameter weight for forward and backward.

but not appliable for all context for example LM , you only have left context available. useful when you have full context.

Mulit Layer RNNs

lower layer might learn lower feautre like grammer and higher layer might learn sementaics.

order of computing hidden state might change.

728x90