AI/NLP (cs224n)

Lec6) Language Models and RNNs

Tony Lim 2021. 4. 27. 14:08
728x90

Language Modeling is the task of predicting what word comes next

 

N - gram Language Models

looking 4 words ahead and predict probability. but this model will predict zero probability, if our training corpus didn't' have particular sentence that we are looking for. 
=> use smoothing technique which makes zero probability to have some probability.

same for denominator, if model never seen "students opened their" , we can't even calculate our probability.
=> just count "students opened" instead

it is hard to increase N more than 5 because of existence issue.

using neural network to predict next word different from n gram model.

Good

  • there are no sparse problem , might be not good prediction but it will run.
  • don't need to store all observerd n - grams just need to store words with vectors

Bad

  • window cannot be large because Weight matrix W will get bigger too.
  • No symmetry in how the inputs are processed == meaning only specific part of W matrix is learning specific embedding vectors we want to W to learn all the embedding at each iteration.

 

notice there are 2 Weight matrix Wh We. and those 2 are learning all the words at every iteration.

Good 

  • any length will be fine W size will remain same.
  • in theory use info many steps back but if NN gets wider and deeper it will have vanishing gradient problem.
  • same weight applied on every timestep now there is symmetry in how inputs are processed.

Bad

  • slow. because can't compute in parallel everything needs to be computed in sequence.
  • if too long , vanishing gradient problem

 

since there are each sample is very long we do stochastic gradient descent to compute Loss in practice.

calculating partial derivative with respect to Wh is just adding all the partial derivative Whs. 

we can get different RNN-LM based on what we trained with. after training we can generate text by using output vectors as next input and keep going on.

there is measure for evaluating LM called perplexity. but since it is just exponential of the cross-entropy loss , just reducing cross-entropy loss will also reduce perplexity.

 

LM's usecase

Sentimential analysis , question answering , machine translation

728x90

'AI > NLP (cs224n)' 카테고리의 다른 글

Lec8) Translation , Seq2Seq , Attention  (0) 2021.05.05
Lec7) Vanishing Gradients, Fancy RNNs  (0) 2021.04.28
Lec5) Dependency Parsing , Optimizer(GD to ADAM)  (0) 2021.04.26
Lec4) NLP with Deep Learning  (0) 2021.04.13
Lec3) Neural Networks  (0) 2021.04.12