depdency structure shows which words depend on which other words. what word modify other words.
up until now we looked at the meaning of words like word vector. we need to understand sentence structure in order to convey complex meaning.
but dependency structure is not absolute it can change due to Prepositional phrase attachment ambiguity.
Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas
In this class we are just going to use arrows not grammer details.
just by knowing grammers won't be able to create correct dependency strcuture since there are lots of ambiguity.
so people made things called tree banks
there are many method to make this tree bank. but doing machine learning worked best. with deeper and wider network it wored even better.
evaluating parser is to compare human build parser with model parser's result.
before training we need to repsent our set of words to vectors
we just concatenate all 3 vectors into one long vector
and than we just put them into nerual network.
Optimizer
[#17.Lec] Advanced Optimizer than SGD - 딥러닝 홀로서기 - YouTube
An overview of gradient descent optimization algorithms (ruder.io)
An overview of gradient descent optimization algorithms
Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad,
ruder.io
gradient descent = slow because if there are 23,000 dimension and 1,000,000 samples we would need to calculate partial derivative for 23,000 * 1,000,000 times for 1 epoch(iteration). it is too slow.
stochastic gradient descent = we randomly choose one samle from 1,000,000 and compute partial derivative and do the backpropagation and update our parameter. in order to work this opimizer well we need to choose our learning rate low as we keep on going. (learning rate schedules)
mini-batch gradient descent = instead of choosing one sample we choose more than one , a mini batch
odds of 3 optimizer
- choosing learing rate is difficult.
- might perform a larger update for rarely occurring features.
- might fall in to local minima and never come back. (saddle point)jjjj
Momentum
image 2 is just oscillating up and down since y dimension has much larger partial derivative. with momentum it helps SGD to accelerate even with small partial derivative.
we are having previous loss term into account. r<1 optimizer gains speed to certain direction and converge faster.
but it won't slow down as it keep gains acceleration as it goes down hills. we want it to slow down before the hill slopes up.
Nesterov accelerated gradient
we can get some future ( what our parameter will be update to ) with theta - rvt-1.
by doing this we are now calculating loss with predicted futuer postion of our parameters not with current parameters.
at first blue vector, we don't do any adjusting since we don't have previous loss term.
at 2nd iteration we have our brown vector which is first tem (rvt-1) and from that point ( estimated next parameter) we calcuate our second term and add them up.
Adagrad (changing learning rate )
it is good for sparse data because it will have high learning rate with infrequent features and low learing rate with frequent features.
we use a different learning rate for every parameter at every time step.
e term is to prevent zero division. G is sum of the squares of gradients with respect to specific parameter at time step t. since G will be getting bigger becuase it is sum of squares ,eventually leading to not gaining any progress.
RMSprop = to prevent aggressive montonically decreasing learning rate of adagrad
now we are adding decaying constant to G to prevent it from getting to large.
AdaDelta
complicated doing extra stuff to match unit accuracy. by using this we don't need to set inital learning rate.
Adam = past squared gradients(for leanring rate, just like G term above) + average(expoential smoothing) of past gradients
m hat and v hat term is for first iteration for partical use.
'AI > NLP (cs224n)' 카테고리의 다른 글
Lec7) Vanishing Gradients, Fancy RNNs (0) | 2021.04.28 |
---|---|
Lec6) Language Models and RNNs (0) | 2021.04.27 |
Lec4) NLP with Deep Learning (0) | 2021.04.13 |
Lec3) Neural Networks (0) | 2021.04.12 |
Lec2) Word Vectors and Word Senses (0) | 2021.04.06 |