The spelled-out intro to language modeling: building makemore

AI/Andrej Karpathy

The spelled-out intro to language modeling: building makemore

Tony Lim 2023. 1. 24. 12:17

728x90

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

b = {}
for w in words:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1, ch2 in zip(chs, chs[1:]):
    bigram = (ch1, ch2)
    b[bigram] = b.get(bigram, 0) + 1
    
sorted(b.items(), key = lambda kv: -kv[1])

[(('n', '<E>'), 6763),
 (('a', '<E>'), 6640),
 (('a', 'n'), 5438),
 (('<S>', 'a'), 4410),
 (('e', '<E>'), 3983),

names.txt에 이름들에 <S> start , <E> end ,를 추가하고 bigram으로 바로 옆에 것만 pair로 만들어서
특정 pair 가 몇번 나타는지를 기반으로 sort한 결과이다.

하지만 이런식이면 아래 사진에서 S는 항상 처음 , E는 항상 마지막이니 제대로된 통계값을 기대할 수 없다.

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

words를 set으로 만들면 a~z 까지 26개의 알파벳이 나온다.

string to integer(stoi) = .:0 , a:1 , b:2 ... z:26

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1, ix2] += 1

위에 이름들을에 .emma. 를 붙이고

. e
em
mm
ma
a.

로 ch1 ,ch2 에 각각 넣은다음에 N = torch.zeros((27,27)) 에 하나씩 카운트업해준다.

.a는 a로 시작하는 경우 , a. 는 a로 끝나는 bigram이 존재하는 경우를 의미한다.

g = torch.Generator().manual_seed(2147483647)
p = torch.rand(3, generator=g)
p = p / p.sum()
p
tensor([0.6064, 0.3033, 0.0903])

generator를 사용하여 항상 torch.rand이라도 동일한 값을 가져오게 해준다. (seed를 통해 조절)

g = torch.Generator().manual_seed(2147483647)
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
itos[ix]

p 는 확률 분포를 위해 전체를 더한 값을 분모로 하고 각값을 분자로 취하여 해당 소수들이 합이 1이 되게 하였다.

multinomail 은 해당 확률분포 tensor에서 sample을 1개 뽑아서준다. 예를들어 여러 sample을 뽑개한다면 0,1,2 중 하나의 숫자들이 list로 나올텐데 해당 숫자는 p tensor의 index를 의미하고 list안에 숫자들은 해당 p의 확률분포를 따르게 된다.

예를들면 list에 1이 있을 확률은 0.3033일 것이다.

P = (N+1).float()
P /= P.sum(1, keepdims=True)

모든것에 1을 더해주는 이유는 아래 loglikelihood 를 구하기 위해 prob에 로그를 씌울때 0인 경우 inf 가 나오는것을 방지하기 위함이다.

현재 P는 [27,27] 의 tensor이다. P.sum(0, keepdims=True) 로 하게 되면 0차원 , 즉 row를 기준으로 아래로 쭈욱 더하게 되어 [1,27] 의 tensor가 튀어나오게 되는데 keepdims = False로 하게 되면 0번째 차원을 유지 시켜주지 않고 "Squeeze out" 하게 되어 [27] tensor가 튀어 나온다.

현재는 p.sum(1,keepdims=True)이니 colum을 기준으로 오른쪽으로 쭈욱 더하게 되고 [27,1] tensor를 생성한다.

그다음으로 우리가 하려는것은 [27,27] 을 [27,1] 로 나누려 하는것이다. 이것이 가능한것인가?
broadcastable 한지를 봐야한다.

1. each tensor has at least 1 dimension

2. dimension size must be equalr or one of them is 1 , or one of them doesn't exist

우린 같지는 않지만 [27,27] ,[27,1] 일때 2번째 column이 둘중 하나가 1이다. 즉 broadcastable 하다.

그러면 내부적으로 [27,1] 을 copy해서 [27,27] 로 만든뒤에 element wise division(우리의 경우) 실행하게 된다.

그러면 우리가 원하는 각 row의 확률 분포가 나온다. 원래 [27,1] 이 한 row의 sum 값인데 그것을 복사하여 각 row의 column을 나누면 합이 1인 확률 분포가 나오게 되는것이다.

braodcastable 실수할 만한것 예시

예를 들어 keep dim 을 false로 한 경우

27, 27
27
로 비교를 한다. (맨 오른쪽으로 align 시키고 비교를 해야함)

2번 조건을 체크한다. 3번쨰 one doesn't exist에 속하게 된다.

internally

27 , 27
1, 27 처럼 만들어주고 이제 colum으로 긴 vector를 수직으로 복사하여 27,27로 만들어 elementwise division을 해주게 된다.

그러면 우리가 원하는데로 각 row의 probability distribution이 생기는것이 아니라 각 column의 probabilty distribution이 생긴다.

즉 기존 27개의 1차원 벡터를 [1,27] 으로 만들어서 90도에서 180도 회전을 시킨다음에 아래로 복사했기 떄문이다. 위 사진의 bigram count 분포도에서 N[idx].sum() 과 N[:,idx].sum() 이 같아서 그나마 column 이 pdf가 된것이다.

g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix = 0
  while True:
    p = P[ix]
    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix])
    if ix == 0:
      break
  print(''.join(out))

mor.
axx.
minaymoryles.
kondlaisah.
anchshizarie.

P 는 특정 index row의 확률 분포이고 해당 확률분포에서 제일 높은 확률은 알파벳을 뽑게된다. = m

다음 while loop에서는 m으로 시작하는 index row의 확률분포에서 제일 높은것을... 이린식으로 mor. 가 완성이 된다.

5개의 이름이 안좋아 보이지만 그냥 모두가 확률이 같은 확률분포에서 해보면 더 말이 안되는 이름들이 튀어나온다.

loss function(the negative log likelihood of the data under our model) = quality of model

log_likelihood = 0.0
n = 0

for w in words:
#for w in ["andrejq"]:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1
    print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

.e: 0.0749 -2.5914
em: 0.1155 -2.1588
mm: 0.0253 -3.6753
ma: 0.0764 -2.5717
a.: 0.2071 -1.5743

특정 bigram의 확률 ,log(확률) 이다. 1/27 = 0.4 이니 확률이 0.4 이상인것은 bigram model 뭔가를 학습했다는 의미로 해석된다.

very good model은 training set을 훈련한 경우 대부분의 확률을 1이랑 가깝게 측정할 것이다. 정확히 다음것이 무엇인줄 안다는 의미이니까

likelihood = product of all those probability 하지만 우리의 경우 굉장히 작은 숫자가 나올 것이다. 이 값이 크게 나올수록 훈련이 잘되었다고 보는 것이다.

좀 더 편히 보기위해 log likelihood 를 쓰게 된다. 음수가 클 수록 훈련이 안 되었다고 판단한다. 로그를 씌운 prob 에 덧셈을 하면 log(a*b) = log(a) + log(b) 이기 때문에 곱을 log(likelihood)를 구할 수 있다.
양수가 편하니 - 를 해주게 된다.

negative log likelihood 는 굉장히 쓸만한 loss function 이다. 최저가 0이고 train 결과가 안좋을 수록(prob 들이 0이랑 가까워질 수록) 값이 무한으로 커지기 떄문이다.
n으로 나누어서 normalize 를 해서 보기도한다.
negative log likelihood 이것의 값을 줄이는게 training 목표이다.

part2 : neural network approach

# create the training set of bigrams (x,y)
xs, ys = [], []

for w in words[:1]:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    print(ch1, ch2)
    xs.append(ix1)
    ys.append(ix2)
    
xs = torch.tensor(xs)
ys = torch.tensor(ys)

. e
e m
m m
m a
a .
xs
tensor([ 0,  5, 13, 13,  1])
ys
tensor([ 5, 13, 13,  1,  0])

. 이면 나오면 e가 나올 확률을 높이고 싶다. ( 0 이나오면 5가 나올 확률을 높이고싶다.)

torch.tensor는 dytpe을 바로 캐치해서 새로운 tensor를 생성함으로 이걸 써야한다. torch.Tensor 대신

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=27).float()
xenc.shape

torch.Size([5, 27])

각 row는 one hot vector로 주어진 index만 1이고 나머진 0이다.

one_hot 함수는 xs가 dtype이 int인걸 보고 그냥 int를 돌려주게되는 것같은데 우린 float가 필요하다. 세세한 weight value조정을 원하기 때문이다.

W = torch.randn((27, 1))
xenc @ W
tensor([[-0.0603],
        [ 1.0245],
        [-2.7787],
        [-2.7787],
        [ 0.7438]])

[5,27] * [27,1] 을 곱했으니 [5,1] 이 나온것이고 1개의 neuron이 5개의 input에 대한 5개의 activation을 보인다.

torch.rand((27,27)) 인 경우 neuron이 27개 있는것이다.

xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
# btw: the last 2 lines here are together called a 'softmax'

tensor([[0.0607, 0.0100, 0.0123, 0.0042, 0.0168, 0.0123, 0.0027, 0.0232, 0.0137,
         0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2378, 0.0603, 0.0025,
         0.0249, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1537, 0.1459],
        [0.0290, 0.0796, 0.0248, 0.0521, 0.1989, 0.0289, 0.0094, 0.0335, 0.0097,
         0.0301, 0.0702, 0.0228, 0.0115, 0.0181, 0.0108, 0.0315, 0.0291, 0.0045,
         0.0916, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472],
        [0.0312, 0.0737, 0.0484, 0.0333, 0.0674, 0.0200, 0.0263, 0.0249, 0.1226,
         0.0164, 0.0075, 0.0789, 0.0131, 0.0267, 0.0147, 0.0112, 0.0585, 0.0121,
         0.0650, 0.0058, 0.0208, 0.0078, 0.0133, 0.0203, 0.1204, 0.0469, 0.0126],
        [0.0312, 0.0737, 0.0484, 0.0333, 0.0674, 0.0200, 0.0263, 0.0249, 0.1226,
         0.0164, 0.0075, 0.0789, 0.0131, 0.0267, 0.0147, 0.0112, 0.0585, 0.0121,
         0.0650, 0.0058, 0.0208, 0.0078, 0.0133, 0.0203, 0.1204, 0.0469, 0.0126],
        [0.0150, 0.0086, 0.0396, 0.0100, 0.0606, 0.0308, 0.1084, 0.0131, 0.0125,
         0.0048, 0.1024, 0.0086, 0.0988, 0.0112, 0.0232, 0.0207, 0.0408, 0.0078,
         0.0899, 0.0531, 0.0463, 0.0309, 0.0051, 0.0329, 0.0654, 0.0503, 0.0091]])

logits 일종의 log(count) 라 생각하고 exponentiate 해준다. 그 이후에 가로의 합이 1이되고 pdf가 되게 위에서 했던 확률분포만드는 process를 거치게 한 것이다. -> 이게 softmax다.

요약하면 .emma. 를 bigram으로 0~27 사이로 mapping하고 (xs, ys) 그리고 각각 xs,ys를 one hot encoding을 하고 randomized 된 W 에다 matrix multiply 를 한것이다. (WX , 가 하나의 뉴런)

다 평행하게 계산이 된것이지만 그 중 하나를 구체적으로 예를들면 probs[0] 는 .을 NN 에 넣었을때 무엇이 나올것인가에 대한 확률을 의미한다. torch.size([27])

마찬가지로 probs[1]은 e 를 넣었을때 쭈욱 probs[4]까지 존재한다. torch.Size([5,27])

nlls = torch.zeros(5)
for i in range(5):
  # i-th bigram:
  x = xs[i].item() # input character index
  y = ys[i].item() # label character index
  print('--------')
  print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')
  print('input to the neural net:', x)
  print('output probabilities from the neural net:', probs[i])
  print('label (actual next character):', y)
  p = probs[i, y]
  print('probability assigned by the net to the the correct character:', p.item())
  logp = torch.log(p)
  print('log likelihood:', logp.item())
  nll = -logp
  print('negative log likelihood:', nll.item())
  nlls[i] = nll

print('=========')
print('average negative log likelihood, i.e. loss =', nlls.mean().item())

--------
bigram example 1: .e (indexes 0,5)
input to the neural net: 0
output probabilities from the neural net: tensor([0.0607, 0.0100, 0.0123, 0.0042, 0.0168, 0.0123, 0.0027, 0.0232, 0.0137,
        0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2378, 0.0603, 0.0025,
        0.0249, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1537, 0.1459])
label (actual next character): 5
probability assigned by the net to the the correct character: 0.012286253273487091
log likelihood: -4.3992743492126465
negative log likelihood: 4.3992743492126465
--------
bigram example 2: em (indexes 5,13)
input to the neural net: 5
output probabilities from the neural net: tensor([0.0290, 0.0796, 0.0248, 0.0521, 0.1989, 0.0289, 0.0094, 0.0335, 0.0097,
        0.0301, 0.0702, 0.0228, 0.0115, 0.0181, 0.0108, 0.0315, 0.0291, 0.0045,
        0.0916, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472])
label (actual next character): 13
probability assigned by the net to the the correct character: 0.018050702288746834
log likelihood: -4.014570713043213
negative log likelihood: 4.014570713043213

probability assigned by the net to the the correct character: 0.012286253273487091 의미는 .다음 e가나올 확률이 매우 낮다는의미이다. 아직 W 안의 weight들이 제대로 훈련이 되지 않았음으로 당연한 결과이다.

loss = -probs[torch.arange(5), ys].log().mean()

해당 배열은 각각 [0,5] ,[1,13] ,[2,13] .. 으로 길이 5인 배열이며 bigram example 1~5 까지 랑 맵핑되는 결과이다. '.' 을 느면 e가 나올 확률 (net에서)

그것을 다 로그 취하고 평균을 매긴값으로 우리가 줄여할 loss 값을 계산한 것이다.

# backward pass
W.grad = None # set to zero the gradient
loss.backward()

micro grad에서 했던것처럼 loss.backward하면 연결된 operator graph 의 weight gradient들을 다 계산해준다.

정리

# create the dataset
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
xs

number of examples:  228146
tensor([ 0,  5, 13,  ..., 25, 26, 24])

word 전체를 쓰니 많은 양의 bigram을 생성한다. 위에서는 .emma. 로 5개의 배열이었지만 (xs,ys) 이제는 훨씬 큰 크기의 배열들이 생성된다.

# gradient descent
for k in range(1):
  
  # forward pass
  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
  print(loss.item())
  
  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  
  # update
  W.data += -50 * W.grad

xs 를 onehot encoding 하고 randomized 된 W와 matrix multiple를 한 결과를 logit (일종의 log-count , softmax를 할 재료) 에 담는다. 이후 probs 가 softmax를 거친 결과 값이다.

probs를 기반으로 negative log likelihood를 계산하여 loss function 으로 가져간다.

logit을 계산한 이후에는 거의 동일하고 주로 complex하게 변하는것은 그 이전이다.

0.01*(W**2).mean() 을 해주는것은 W가 다 0이 나오는 경우 prob가 다 동일해져서 아무의미없는 net이 되기때문에 0이되지않기위해 Regularization을 해주는 것이다.
마치 P = (N+1).float() 에서 1을 더해주는것이랑 똑같다. 1말고 10000을 더하면 count가 아무리많아도 1000정도 되는데 모두 10000이상이면 의미가 없어져서 prob이 uniform 해지게 된다.
마찬가지로 Regularization도 일종의 smoothing을 도와주지만 압도하게 되면 uniform 하게 만들어버린다. loss가 낮아지지 않을 것이다.

sampling

# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix = 0
  while True:
    
    # ----------
    # BEFORE:
    #p = P[ix]
    # ----------
    # NOW:
    xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
    print(xenc.shape , ix , itos[ix])
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character
    # ----------
    
    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix])
    if ix == 0:
      break
  print(''.join(out))
  
torch.Size([1, 27]) 0 .
torch.Size([1, 27]) 21 u
torch.Size([1, 27]) 25 y
torch.Size([1, 27]) 20 t
uyt.
torch.Size([1, 27]) 0 .
torch.Size([1, 27]) 1 a
torch.Size([1, 27]) 24 x
torch.Size([1, 27]) 23 w
axw.

3개더 ,뒤에도 동일

. 이 들어왔을때 확률분포를 만들고 거기서 sampling 하는것이다. u 가 나온것임

728x90

저작자표시

'AI > Andrej Karpathy' 카테고리의 다른 글

Building makemore Part 5: Building a WaveNet (0)	2023.08.15
Building makemore Part 4: Becoming a Backprop Ninja (0)	2023.02.26
Building makemore Part 3: Activations & Gradients, BatchNorm (0)	2023.02.19
Building makemore Part 2: MLP (0)	2023.02.04
The spelled-out intro to neural networks and backpropagation: building micrograd (0)	2023.01.23

현재글The spelled-out intro to language modeling: building makemore

250x250

Weighted Interval Scheduling, Matrix Mutilply, Interval Scheduling, Linux, 파일입출력, 스레드, 자바8, Median Find, 람다, spring, 메소드 참조, Algorithm, systemd, dijkstra, Quicksort, 날짜시간, 영속성, fft, JPA, Text Justification,

Today :
Yesterday :

관심있는것들