AI/NLP (cs224n)

Lec5) Dependency Parsing , Optimizer(GD to ADAM)

Tony Lim 2021. 4. 26. 16:22

depdency structure shows which words depend on which other words. what word modify other words.

up until now we looked at the meaning of words like word vector. we need to understand sentence structure in order to convey complex meaning.

but dependency structure is not absolute it can change due to Prepositional phrase attachment ambiguity.

Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas

In this class we are just going to use arrows not grammer details.

just by knowing grammers won't be able to create correct dependency strcuture since there are lots of ambiguity.

so people made things called tree banks

there are many method to make this tree bank. but doing machine learning worked best. with deeper and wider network it wored even better.

evaluating parser is to compare human build parser with model parser's result.

before training we need to repsent our set of words to vectors

we just concatenate all 3 vectors into one long vector

and than we just put them into nerual network.

Optimizer

[#17.Lec] Advanced Optimizer than SGD - 딥러닝 홀로서기 - YouTube

An overview of gradient descent optimization algorithms (ruder.io)

An overview of gradient descent optimization algorithms

Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad,

ruder.io

gradient descent = slow because if there are 23,000 dimension and 1,000,000 samples we would need to calculate partial derivative for 23,000 * 1,000,000 times for 1 epoch(iteration). it is too slow.

stochastic gradient descent = we randomly choose one samle from 1,000,000 and compute partial derivative and do the backpropagation and update our parameter. in order to work this opimizer well we need to choose our learning rate low as we keep on going. (learning rate schedules)

mini-batch gradient descent = instead of choosing one sample we choose more than one , a mini batch

odds of 3 optimizer

choosing learing rate is difficult.
might perform a larger update for rarely occurring features.
might fall in to local minima and never come back. (saddle point)jjjj

Momentum

image 2 is just oscillating up and down since y dimension has much larger partial derivative. with momentum it helps SGD to accelerate even with small partial derivative.

we are having previous loss term into account. r<1 optimizer gains speed to certain direction and converge faster.

but it won't slow down as it keep gains acceleration as it goes down hills. we want it to slow down before the hill slopes up.

Nesterov accelerated gradient

we can get some future ( what our parameter will be update to ) with theta - rvt-1.

by doing this we are now calculating loss with predicted futuer postion of our parameters not with current parameters.

at first blue vector, we don't do any adjusting since we don't have previous loss term.
at 2nd iteration we have our brown vector which is first tem (rvt-1) and from that point ( estimated next parameter) we calcuate our second term and add them up.

Adagrad (changing learning rate )

it is good for sparse data because it will have high learning rate with infrequent features and low learing rate with frequent features.

we use a different learning rate for every parameter at every time step.

e term is to prevent zero division. G is sum of the squares of gradients with respect to specific parameter at time step t. since G will be getting bigger becuase it is sum of squares ,eventually leading to not gaining any progress.

RMSprop = to prevent aggressive montonically decreasing learning rate of adagrad

now we are adding decaying constant to G to prevent it from getting to large.

AdaDelta

complicated doing extra stuff to match unit accuracy. by using this we don't need to set inital learning rate.

Adam = past squared gradients(for leanring rate, just like G term above) + average(expoential smoothing) of past gradients

m hat and v hat term is for first iteration for partical use.

저작자표시

'AI > NLP (cs224n)' 카테고리의 다른 글

Lec7) Vanishing Gradients, Fancy RNNs (0)	2021.04.28
Lec6) Language Models and RNNs (0)	2021.04.27
Lec4) NLP with Deep Learning (0)	2021.04.13
Lec3) Neural Networks (0)	2021.04.12
Lec2) Word Vectors and Word Senses (0)	2021.04.06

현재글Lec5) Dependency Parsing , Optimizer(GD to ADAM)

JPA, Algorithm, Linux, Quicksort, Text Justification, 날짜시간, Median Find, systemd, 자바8, 람다, Interval Scheduling, dijkstra, 스레드, spring, 영속성, 파일입출력, 메소드 참조, fft, Matrix Mutilply, Weighted Interval Scheduling,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

관심있는것들

Lec5) Dependency Parsing , Optimizer(GD to ADAM)

Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas

'AI > NLP (cs224n)' 카테고리의 다른 글

'AI/NLP (cs224n)'의 다른글

티스토리툴바

Lec5) Dependency Parsing , Optimizer(GD to ADAM)

Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas

'AI > NLP (cs224n)' 카테고리의 다른 글

'AI/NLP (cs224n)'의 다른글

관련글

티스토리툴바