From a53d1cd116b66ad98e64248a2ae73f19075df7e5 Mon Sep 17 00:00:00 2001 From: DanAnastasyev Date: Fri, 2 Mar 2018 01:40:19 +0300 Subject: [PATCH] Lecture 2 added --- .../Lecture 2 - Pytorch & Word2Vec.ipynb | 1 + Lecture 2 - PyTorch & Word2Vec/README.md | 18 ++++++++++++++++++ README.md | 7 ++++++- 3 files changed, 25 insertions(+), 1 deletion(-) create mode 100644 Lecture 2 - PyTorch & Word2Vec/Lecture 2 - Pytorch & Word2Vec.ipynb create mode 100644 Lecture 2 - PyTorch & Word2Vec/README.md diff --git a/Lecture 2 - PyTorch & Word2Vec/Lecture 2 - Pytorch & Word2Vec.ipynb b/Lecture 2 - PyTorch & Word2Vec/Lecture 2 - Pytorch & Word2Vec.ipynb new file mode 100644 index 0000000..180e8cf --- /dev/null +++ b/Lecture 2 - PyTorch & Word2Vec/Lecture 2 - Pytorch & Word2Vec.ipynb @@ -0,0 +1 @@ +{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Lecture 2 - Pytorch.ipynb","version":"0.3.2","views":{},"default_view":{},"provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"metadata":{"id":"ad_u1q9wzNXG","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["!pip install -q http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl torchvision\n","!pip install -q keras \n","\n","import torch\n","import torch.autograd as autograd\n","import torch.nn as nn\n","import torch.nn.functional as F\n","import torch.optim as optim\n","\n","import numpy as np\n","\n","import matplotlib.pyplot as plt\n","from IPython import display"],"execution_count":0,"outputs":[]},{"metadata":{"id":"F5aofDxcZYUv","colab_type":"text"},"cell_type":"markdown","source":["# Путь джедая"]},{"metadata":{"id":"pcqegVYmamhH","colab_type":"text"},"cell_type":"markdown","source":["## Введение в PyTorch"]},{"metadata":{"id":"CsTlI9n_bMWn","colab_type":"text"},"cell_type":"markdown","source":["Вообще говоря, все приличные люди начинают изучение PyTorch с его внутренностей: всяких Tensor'ов, Variable'ов и прочих autograd'ов.\n","\n","Обязательно почитайте статью, в которой подробно расписано это всё: [PyTorch — ваш новый фреймворк глубокого обучения (habrahabr)](https://habrahabr.ru/post/334380/).\n","\n","А мы пока просто пробежимся по главному."]},{"metadata":{"id":"LtmkcNbFzntE","colab_type":"text"},"cell_type":"markdown","source":["### Тензоры\n","\n","Тензоры в PyTorch - это как numpy.array. Они есть разных типов, причем типизация строгая:\n","```python\n","torch.HalfTensor # 16 бит, с плавающей точкой\n","torch.FloatTensor # 32 бита, с плавающей точкой\n","torch.DoubleTensor # 64 бита, с плавающей точкой\n","\n","torch.ShortTensor # 16 бит, целочисленный, знаковый\n","torch.IntTensor # 32 бита, целочисленный, знаковый\n","torch.LongTensor # 64 бита, целочисленный, знаковый\n","\n","torch.CharTensor # 8 бит, целочисленный, знаковый\n","torch.ByteTensor # 8 бит, целочисленный, беззнаковый\n","```\n","\n","Например, можем создать тензор нужного размера:"]},{"metadata":{"id":"m5le91N5alZO","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["x = torch.FloatTensor(3, 4) # мусор\n","x.zero_() # нули"],"execution_count":0,"outputs":[]},{"metadata":{"id":"yBDs6WoB0ZOF","colab_type":"text"},"cell_type":"markdown","source":["А можем сделать его из готового array'я:"]},{"metadata":{"id":"uFMenGpH0f1z","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["x = np.arange(8).reshape(2, 4) + 5\n","y = np.arange(8).reshape(2, 4)\n","print(x)\n","print(y)\n","\n","x = torch.LongTensor(x)\n","y = torch.LongTensor(y)\n","print(x)\n","print(y)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"2z452YSY0_Uf","colab_type":"text"},"cell_type":"markdown","source":["Можем делать стандартные операции:"]},{"metadata":{"id":"DcbbJKnDA870","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["print(x + y)\n","\n","print(x * y)\n","\n","print(x.type(torch.FloatTensor).log())"],"execution_count":0,"outputs":[]},{"metadata":{"id":"_XkK23p_BDSY","colab_type":"text"},"cell_type":"markdown","source":["### Разминка\n","Возьмем простую математическую функцию с прикольным графиком:\n","\n","$$ x(t) = t - 1.5 \\cdot cos( 15 t) $$\n","$$ y(t) = t - 1.5 \\cdot sin( 16 t) $$"]},{"metadata":{"id":"b2mZMHDOA_HJ","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["import matplotlib.pyplot as plt\n","%matplotlib inline\n","\n","t = torch.linspace(-10, 10, steps = 10000)\n","\n","# Посчитайте x(t), y(t)\n","x = \n","y = \n","\n","plt.plot(x.numpy(), y.numpy())"],"execution_count":0,"outputs":[]},{"metadata":{"id":"DwUSrLLYBQHA","colab_type":"text"},"cell_type":"markdown","source":["## Automatic gradients\n","\n","Всё это мог и numpy. А теперь переходим к тому, ради чего нужен PyTorch: backpropagation c помощью `Variable` и модуля `autograd`.\n","\n","Начнём с простого примера с небольшим графом вычислений:\n","\n","![graph](https://image.ibb.co/mWM0Lx/1_6o_Utr7_ENFHOK7_J4l_XJtw1g.png =500x)"]},{"metadata":{"id":"fcHGvvx4bhbi","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["x = autograd.Variable(torch.FloatTensor([-2]), requires_grad=True)\n","y = autograd.Variable(torch.FloatTensor([5]), requires_grad=True)\n","z = autograd.Variable(torch.FloatTensor([-4]), requires_grad=True)\n","\n","q = x + y\n","f = q * z\n","\n","f.backward()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"-i04d60kcHEL","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["print(x.grad)\n","print(y.grad)\n","print(z.grad)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"W_IVMM7ccoNZ","colab_type":"text"},"cell_type":"markdown","source":["__Нормальный пример:__ потренируем линейную регрессию на Boston Housing Dataset.\n","\n","В целом, весь backpropagation выглядит как-то так:\n","1. У вас есть тензор. Он умеет в данные. Но вам не интересно просто в данные, хочется ещё и в градиенты.\n","2. Вы делаете ```a = autograd.Variable(data, requires_grad=True)```\n","3. Определяете функцию потерь `loss = whatever(a)`\n","4. Зовёте `loss.backward()`\n","5. ???\n","6. В ```a.grads``` записан нужный градиент.\n","\n","Например:"]},{"metadata":{"id":"HnenQ3zGBI31","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["from sklearn.datasets import load_boston\n","boston = load_boston()\n","plt.scatter(boston.data[:, -1], boston.target)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"FJBWKJn3BYMX","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["w = autograd.Variable(torch.zeros(1), requires_grad=True)\n","b = autograd.Variable(torch.zeros(1), requires_grad=True)\n","\n","x = autograd.Variable(torch.FloatTensor(boston.data[:,-1] / 10))\n","y = autograd.Variable(torch.FloatTensor(boston.target))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"Qot49-ciBa96","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["y_pred = w * x + b\n","loss = torch.mean((y_pred - y)**2)\n","\n","# Пересчитываем градиенты\n","loss.backward()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"7y8_waF6BdNc","colab_type":"text"},"cell_type":"markdown","source":["Теперь в поле `.grad` есть градиенты."]},{"metadata":{"id":"SWvNp5GJBeVL","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["print(\"dL/dw = \\n\", w.grad)\n","print(\"dL/db = \\n\", b.grad)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"6zo7tuE0Bhc9","colab_type":"text"},"cell_type":"markdown","source":["Если несколько раз посчитать лосс, градиенты каждый раз будут прибавляться. Поэтому важно занулять градиенты между итерациями."]},{"metadata":{"id":"lHV8GFV1Bfgk","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["from IPython.display import clear_output\n","\n","for i in range(100):\n"," y_pred = w * x + b\n"," loss = torch.mean((y_pred - y)**2)\n"," loss.backward()\n","\n"," w.data -= 0.05 * w.grad.data\n"," b.data -= 0.05 * b.grad.data\n"," \n"," # Зануляем градиенты\n"," w.grad.data.zero_()\n"," b.grad.data.zero_()\n"," \n"," if (i+1) % 5 == 0:\n"," clear_output(True)\n"," plt.scatter(x.data.numpy(), y.data.numpy())\n"," plt.scatter(x.data.numpy(), y_pred.data.numpy(), color='orange', linewidth=5)\n"," plt.show()\n","\n"," print(\"loss = \", loss.data.numpy()[0])\n"," if loss.data.numpy()[0] < 0.5:\n"," print(\"Done!\")\n"," break"],"execution_count":0,"outputs":[]},{"metadata":{"id":"sIcByfIr6XOo","colab_type":"text"},"cell_type":"markdown","source":["### Cuda\n","\n","Последняя особенность PyTorch - работа с cuda.\n","\n","Вызовом `x = x.cuda()` мы перемещаем тензор на видеокарту. Точно так же перемещаются на видеокарту и все вычисления.\n","\n","Кроме этого есть отдельный набор тензоров `torch.cuda.FloatTensor`, на случай, когда сразу понятно, что работаем на видеокарте.\n","\n","Вернуться с видеокарты можно вызовом `.cpu()`.\n","\n","---\n","\n","А *путь джедая* только начинается. Прочитайте (всё ещё) статью [PyTorch — ваш новый фреймворк глубокого обучения](https://habrahabr.ru/post/334380/).\n","\n","__Задание*__ Реализовать простую полносвязную сеть на чистом numpy."]},{"metadata":{"id":"9Ypw8yX1Zyh2","colab_type":"text"},"cell_type":"markdown","source":["# Дорожка простых смертных"]},{"metadata":{"id":"lu4yr8A6d1EF","colab_type":"text"},"cell_type":"markdown","source":["Начнём уже решать задачу - а с тем, что такое PyTorch, разберемся подробнее по ходу дела :)\n","\n","Вспомним про датасет с прошлого занятия."]},{"metadata":{"id":"AEOiOGXza0Yv","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["from keras.datasets import imdb\n","\n","NUM_WORDS = 10000\n","\n","print('Loading data...')\n","(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=NUM_WORDS)\n","print(len(X_train), 'train sequences')\n","print(len(X_test), 'test sequences')\n","\n","NUM_LABELS = len(np.unique(y_train))\n","print(NUM_LABELS, 'class classification')\n","\n","print('Converting to bag-of-word matrix...')\n","def convert_to_bow(X):\n"," X_bow = np.zeros((len(X), NUM_WORDS))\n"," for i, review in enumerate(X):\n"," for ind in review:\n"," X_bow[i, ind] = 1\n"," return X_bow\n","\n","X_train_bow, X_test_bow = convert_to_bow(X_train), convert_to_bow(X_test)\n","\n","y_train, y_test = y_train.reshape((-1, 1)), y_test.reshape((-1, 1))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"V-Du7BBid9Md","colab_type":"text"},"cell_type":"markdown","source":["Мы тогда начали с BoW-модели. Повторим её на PyTorch!\n","\n","*Напоминание*: на keras нам нужно было бы сделать следующее:\n","1. Определить модель, например:\n","```python \n","model = Sequential()\n","model.add(Dense(1, activation='sigmoid', input_dim=NUM_WORDS))\n","```\n","2. Задать функцию потерь и оптимизатор:\n","```python\n","model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n","```\n","\n","3. Запустить обучение:\n","```python\n","model.fit(X_train_bow, y_train, \n"," batch_size=32,\n"," epochs=3,\n"," validation_data=(X_test_bow, y_test))\n","```\n","\n","Конечно же, в PyTorch можно сделать всё то же самое, но гораздо сложнее :)\n","\n","## Определение модели\n","Начнём с определения модели:"]},{"metadata":{"id":"xhKMbbeahz50","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["model = nn.Sequential()\n","\n","model.add_module('layer', nn.Linear(NUM_WORDS, 1))\n","\n","model = model.cuda()\n","\n","print(model)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"IJTBy15-iW0w","colab_type":"text"},"cell_type":"markdown","source":["Пока всё просто, правда?\n","\n","`Linear` вместо `Dense`, другие названия аргументов. И **нет** возможности задать функцию активации.\n","\n","Ну, и добавился вызов `model.cuda()`\n","\n","---\n","\n","Дальше - сложнее.\n","\n","Начнем с такой вот магии. О чём она - разберёмся чуть позже."]},{"metadata":{"id":"WkHhBgG0jike","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["def get_batches(dataset, batch_size):\n"," X, Y = dataset\n"," n_samples = X.shape[0]\n","\n"," indices = np.arange(n_samples)\n"," np.random.shuffle(indices)\n"," \n"," for start in range(0, n_samples, batch_size):\n"," end = min(start + batch_size, n_samples)\n"," \n"," batch_idx = indices[start:end]\n"," \n"," yield autograd.Variable(X[batch_idx, ]), autograd.Variable(Y[batch_idx, ])\n","\n","X_train_bow, y_train = torch.cuda.FloatTensor(X_train_bow), torch.cuda.FloatTensor(y_train)\n","X_test_bow, y_test = torch.cuda.FloatTensor(X_test_bow), torch.cuda.FloatTensor(y_test)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"9DEsCJuH8kbU","colab_type":"text"},"cell_type":"markdown","source":["Создадим свой собственный мини-батч:"]},{"metadata":{"id":"F_UMU2uF8acw","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["X_batch, y_batch = next(get_batches((X_train_bow, y_train), 32))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"PtduJIyD85xj","colab_type":"text"},"cell_type":"markdown","source":["Посмотрим на него глазами"]},{"metadata":{"id":"MCck9SzD8s8M","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["X_batch, y_batch"],"execution_count":0,"outputs":[]},{"metadata":{"id":"FKBSVx-R9Dbs","colab_type":"text"},"cell_type":"markdown","source":["Чтобы вычислить значение на этом батче, нужно позвать метод `forward`:"]},{"metadata":{"id":"1pd3SVbS896L","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["logit = model.forward(X_batch)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"RMUMbHU79Odz","colab_type":"text"},"cell_type":"markdown","source":["Смотрите, что он нам выдал:"]},{"metadata":{"id":"xArXXH9_9M_s","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["logit"],"execution_count":0,"outputs":[]},{"metadata":{"id":"zNHm2-Mu90Tt","colab_type":"text"},"cell_type":"markdown","source":["В вероятности сконвертировать эти значения можно с помощью `F.sigmoid`"]},{"metadata":{"id":"t0UYn5iK97o2","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":[""],"execution_count":0,"outputs":[]},{"metadata":{"id":"meQQbOYp9WSj","colab_type":"text"},"cell_type":"markdown","source":["## Функция потерь\n","Нам бы теперь как-нибудь настроить параметры. А каким образом это работало в keras?\n","\n","Посмотрите на ту строчку с `model.compile`. Ту часть, где мы задавали `loss`.\n","\n","Что, собственно, она означает? Она говорит, что функция потерь имеет такой вид:\n","$$ L = {1 \\over N} \\underset{X_i,y_i} \\sum - [ y_i \\cdot \\log P(y_i | X_i) + (1-y_i) \\cdot \\log (1-P(y_i | X_i)) ]$$\n","\n","А что предсказывает наша модель? Да просто какие-то значения. Если они большие - наверное, положительный класс. А если маленькие - отрицательный.\n","\n","Можно было бы добавить в нашу модель ещё один слой:\n","```python\n","model.add_module('predictions_layer', nn.Sigmoid())\n","```\n","и тогда в ход пойдет `nn.BCELoss`, который делает ровно то, что написано в формуле сверху. \n","\n","А можно оставить всё как есть - но поставить `nn.BCEWithLogitsLoss`, который сам будет добавлять сигмоиду.\n","\n","С помощью кого-то из них мы можем вычислять потери на нашем батче:\n"]},{"metadata":{"id":"4Q1p-3j09VoA","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["loss_function = nn.BCEWithLogitsLoss()\n","\n","loss = loss_function(logit, y_batch)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"6bzoQQZZ9u3F","colab_type":"text"},"cell_type":"markdown","source":["Мы получили некоторое значение потерь:"]},{"metadata":{"id":"SMLROj839soe","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["loss"],"execution_count":0,"outputs":[]},{"metadata":{"id":"pKbTF5MF90HU","colab_type":"text"},"cell_type":"markdown","source":["Но это не просто значение. У него есть `backward`!"]},{"metadata":{"id":"KtvpfXzJ9zam","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["loss.backward()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"VuziUVDQ-WPh","colab_type":"text"},"cell_type":"markdown","source":["И мы можем получить градиент:"]},{"metadata":{"id":"WPi0e6GT97zX","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["model.layer.weight.grad"],"execution_count":0,"outputs":[]},{"metadata":{"id":"FXtZCA8i8Oqs","colab_type":"text"},"cell_type":"markdown","source":["\n","## Оптимизатор\n","\n","Теперь мы можем настраивать параметры.\n","\n","Например, с помощью SGD:\n","$$\\theta^{(t+1)} = \\theta^{(t)} - \\eta \\nabla_\\theta L(\\theta)$$\n","\n","Либо с помощью чего-то более сложного:"]},{"metadata":{"id":"hvB0He_L_FJ5","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["optimizer = optim.Adam(model.parameters())\n","\n","print(model.layer.weight)\n","\n","optimizer.step()\n","\n","print(model.layer.weight)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"Y5dIV1bE_X9L","colab_type":"text"},"cell_type":"markdown","source":["Нужно вызывать `step()`, чтобы обновить параметры модели.\n","\n","Последнее, но очень важное - **не забыть обнулить градиенты**. Делается это так:"]},{"metadata":{"id":"oQE_dvDY_tLh","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["optimizer.zero_grad()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"1voOzhZI_uzE","colab_type":"text"},"cell_type":"markdown","source":["## Цикл обучения\n","\n","А теперь соберём всё вместе.\n","\n","Напишем функцию-аналог `fit` в keras. \n","\n","Тут-то нам и понадобится `get_batches`. Keras, по доброте душевной, сам выделяет мини-батчи и вычисляет все потери, метрики (и не забывает обнулить градиенты)."]},{"metadata":{"id":"W1Nkv3P3jOad","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["loss_function = nn.BCEWithLogitsLoss()\n","optimizer = optim.Adam(model.parameters())\n","\n","for epoch in range(10):\n"," train_epoch_loss = 0\n"," train_epochs_count = 0\n"," train_correct_count = 0\n"," for X_batch, y_batch in get_batches((X_train_bow, y_train), 128):\n"," # 1. Обнуляем градиенты\n"," optimizer.zero_grad()\n"," \n"," # 2. Запускаем forward-проход\n"," logits = \n"," \n"," # 3. Вычисляем потери\n"," loss = \n"," \n"," # 4. Вычисляем градиенты на backward-проходе. А какие, собственно, градиенты он подсчитает?\n"," \n"," \n"," # 5. Оптимизируем параметры. Откуда он знает, какие параметры оптимизировать?\n"," \n"," \n"," # Агрегируем лосс для вывода\n"," train_epoch_loss += loss.data.cpu()[0]\n"," train_epochs_count += 1\n"," \n"," # Вычисляем accuracy\n"," probs = F.sigmoid(logits)\n"," predictions = (probs > 0.5).type(torch.cuda.LongTensor)\n"," train_correct_count += np.sum((predictions == y_batch.type(torch.cuda.LongTensor)).cpu().data.numpy())\n"," \n"," val_epoch_loss = 0\n"," val_epochs_count = 0\n"," val_correct_count = 0\n"," \n"," \n"," print('Train Loss = {:.5f}, Train Accuracy = {:.2%}, Val Loss = {:.5f}, Val Accuracy = {:.2%}'.format(\n"," train_epoch_loss / train_epochs_count, train_correct_count / len(y_train), \n"," val_epoch_loss / val_epochs_count, val_correct_count / len(y_test)))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"VCX2JnuJ_03H","colab_type":"text"},"cell_type":"markdown","source":["**Задание** Написать собственный оптимизатор (например, SGD) вместо Adam.\n","\n","Параметры модели можно получить так:"]},{"metadata":{"id":"SYv1YoipBGC5","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["for param in model.parameters():\n"," print(param)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"c7aoiXiGBiEX","colab_type":"text"},"cell_type":"markdown","source":["У них есть `.grad`, который можно использовать для обновления параметров.\n","\n","При этом занулять градиенты придется так: `model.zero_grad()`"]},{"metadata":{"id":"dMYebl6xCJAN","colab_type":"text"},"cell_type":"markdown","source":["## Module API"]},{"metadata":{"id":"Vz64xj53_ScP","colab_type":"text"},"cell_type":"markdown","source":["Кроме `Sequential` есть и более функциональный способ задавать модели:"]},{"metadata":{"id":"Ms3HkcZHCIDG","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["class BoWClassifier(nn.Module): # Наследуемся от nn.Module\n"," def __init__(self, num_labels, vocab_size):\n"," super().__init__()\n","\n"," # Определяем слои, которые будем использовать\n"," self.linear = nn.Linear(vocab_size, num_labels)\n","\n"," def forward(self, bow_vec):\n"," # Пропускаем вектор через них\n"," return self.linear(bow_vec)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"-ZEXrnIvCTUC","colab_type":"text"},"cell_type":"markdown","source":["# Word2Vec"]},{"metadata":{"id":"bIveeSDhMx9e","colab_type":"text"},"cell_type":"markdown","source":["Попробуем потренировать собственные эмбеддинги.\n","\n","Сначала скачаем датасет (следующие четыре ячейки нужно просто запустить и надеяться, что в них всё правильно)"]},{"metadata":{"id":"6Z52yP2tehxM","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["!wget -O text8.zip http://mattmahoney.net/dc/text8.zip\n","!unzip text8.zip"],"execution_count":0,"outputs":[]},{"metadata":{"id":"ULA_ryoMdvTL","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["with open('text8') as f:\n"," words = f.read().split()\n","print(\"data_size = {0}\".format(len(words)))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"X6akWLGHc07T","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["# Only N = 50000 the most frequent words is considered\n","# The other marked with token `UNK` (unknown)\n","\n","from collections import Counter\n","\n","def build_dataset(words, vocabulary_size):\n"," count = [[ \"UNK\", -1 ]]\n"," count.extend(Counter(words).most_common(vocabulary_size-1))\n"," print(\"Least frequent word: \", count[-1])\n"," word_to_index = { word: i for i, (word, _) in enumerate(count) }\n"," data = [word_to_index.get(word, 0) for word in words] # map unknown words to 0\n"," unk_count = data.count(0) # Number of unknown words\n"," count[0][1] = unk_count\n"," index_to_word= dict(zip(word_to_index.values(), word_to_index.keys()))\n"," \n"," return data, count, word_to_index, index_to_word\n","\n","vocabulary_size = 50000\n","data, count, word_to_index, index_to_word = build_dataset(words, vocabulary_size)\n","\n","# Everything you need to know about the dataset\n","\n","print(\"data: {0}\".format(data[:10]))\n","print(\"count: {0}\".format(count[:10]))\n","print(\"index_to_word: {0}\".format(list(index_to_word.items())[:10]))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"8a2jV9bMd2Ff","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["from collections import deque\n","\n","def generate_batch(data_index, data_size, batch_size, bag_window):\n"," span = 2 * bag_window + 1 # [ bag_window, target, bag_window ]\n"," batch = np.ndarray(shape = (batch_size, span - 1), dtype = np.int32)\n"," labels = np.ndarray(shape = (batch_size), dtype = np.int32)\n"," \n"," data_buffer = deque(maxlen = span)\n"," \n"," for _ in range(span):\n"," data_buffer.append(data[data_index])\n"," data_index = (data_index + 1) % data_size\n"," \n"," for i in range(batch_size):\n"," data_list = list(data_buffer)\n"," labels[i] = data_list.pop(bag_window)\n"," batch[i] = data_list\n"," \n"," data_buffer.append(data[data_index])\n"," data_index = (data_index + 1) % data_size\n"," return data_index, batch, labels\n","\n","\n","print(\"data = {0}\".format([index_to_word[each] for each in data[:16]]))\n","data_index, data_size, batch_size = 0, len(data), 4\n","for bag_window in [1, 2]:\n"," data_index, batch, labels = generate_batch(data_index, data_size, batch_size, bag_window)\n"," print(\"bag_window = {0}\".format(bag_window))\n"," print(\"batch = {0}\".format([[index_to_word[index] for index in each] for each in batch]))\n"," print(\"labels = {0}\\n\".format([index_to_word[each] for each in labels.reshape(4)]))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"cw9wVLaANDU-","colab_type":"text"},"cell_type":"markdown","source":["Напомню, мы тут собрались научиться получать эмбеддинги слов. Ну, и получать такие прикольные штуки, если повезёт:\n","![Word vectors relations](https://www.tensorflow.org/images/linear-relationships.png =700x)\n","\n","Вообще говоря, каждое слово можно представлять и просто как индекс в словаре: \n","$$\\overbrace{\\left[ 0, 0, \\dots, 1, \\dots, 0, 0 \\right]}^\\text{|V| elements}$$\n","\n","Основной недостаток - размер таких векторов и отсутствие интерпретируемости. Хочется, чтобы было как-то так:\n","$$q_\\text{mathematician} = \\left[ \\overbrace{2.3}^\\text{can run},\n","\\overbrace{9.4}^\\text{likes coffee}, \\overbrace{-5.5}^\\text{majored in Physics}, \\dots \\right]$$\n","$$q_\\text{physicist} = \\left[ \\overbrace{2.5}^\\text{can run},\n","\\overbrace{9.1}^\\text{likes coffee}, \\overbrace{6.4}^\\text{majored in Physics}, \\dots \\right]$$\n","\n","По таким векторам уже можно считать похожесть:\n","$$\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = \\frac{q_\\text{physicist} \\cdot q_\\text{mathematician}}\n","{\\| q_\\text{physicist} \\| \\| q_\\text{mathematician} \\|} = \\cos (\\phi)$$\n","\n","Для тренировки таких представлений есть два наиболее популярных способа: skip-gram и CBoW.\n","\n","## Continuous Bag of Words (CBoW)\n","В этой модели мы учимся предсказывать слово по его контексту:\n","\n","![CBoW](http://evelinag.com/fsharpexchange2017/images/word2vec-cbow.png =500x)\n","\n","Таким образом, учимся моделировать $P(w_c| w_{c-k}, \\ldots, w_{c-1}, w_{c+1}, \\ldots, w_{c+k})$.\n","\n","При этом тренируются две матрицы. Собственно, матрица эмбеддингов $W_1$ и матрица выходного слоя $W_2$.\n","\n","Вообще говоря, можно особо не думать и cчитать $softmax(W_2 q_c + b)$, где $q_c$ - сумма эмбеддингов слов из контекста, т.е. $q_c = \\sum_{w_i \\in c} W_1 w_i$. \n","\n","В результате получим некоторое распределение вероятностей на словаре, а нам нужно просто оптимизировать наши матрицы так, чтобы вероятность нужного слова была максимальной. \n","\n","Оптимизируется тут, как всегда, кросс-энтропия.\n","\n","Разберемся чуть подробнее. Пусть $\\tilde w_c$ -- это предсказываемое слово. $w_{c-k}, \\ldots, w_{c-1}, w_{c+1}, \\ldots, w_{c+k}$ - его контекст.\n","\n","Сначала мы считаем эмбеддинги этих контекстных слов: $u_j = W_1 w_j, \\ j = c-k, \\ldots, c+k, \\ j \\neq c$.\n","\n","Потом мы суммируем эти эмбеддинги: $u_c = \\sum_j u_j$. Получили как бы эмбеддинг контекста.\n","\n","А теперь вспоминаем про матрицу $W_2$. Она тоже содержит \"эмбеддинги\" векторов: $i$-тый столбец $v_i$ - эмбеддинг $i$-того слова.\n","\n","Вот давайте учиться тому, чтобы эмбеддинг контекста был максимально похож на эмбеддинг слова:\n","$$-\\log \\frac{\\exp(v_c^T u_c)}{\\sum_{i=1}^{|V|} \\exp(v_i^T u_c)} \\to \\min.$$\n","\n","Заметьте, мы считаем скалярное произведение между парой векторов, почти как в `Similarity` выше\n","\n","*Ничего не понятно, да? :(*"]},{"metadata":{"id":"myBmXEMmCSpX","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["class CBOW(nn.Module):\n"," def __init__(self, vocab_size, embedding_dim):\n"," super().__init__()\n"," self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n"," self.out_layer = nn.Linear(embedding_dim, vocab_size)\n","\n"," def forward(self, inputs):\n"," # Просто эмбеддим все входные слова и суммируем полученные эмбеддинги = u_c\n"," embeds = self.embeddings(inputs).sum(dim=1)\n"," # Вычисляем W_2 x u_c = {v_i^T x u_c} из формулы выше\n"," out = self.out_layer(embeds)\n"," # Считаем log_softmax - логарифмы вероятностей того, что такое слово - центральное\n"," return F.log_softmax(out, dim=1)\n"," \n","# Строим эмбеддинги размерности 50\n","model = CBOW(vocabulary_size, 50)\n","model = model.cuda()\n","\n","# NLLLoss - кросс-энтропийная функция потерь для логарифмов вероятностей\n","loss_function = nn.NLLLoss()\n","optimizer = optim.SGD(model.parameters(), lr=0.01) "],"execution_count":0,"outputs":[]},{"metadata":{"id":"4xCyrNOzdffn","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["data_index, data_size, batch_size, bag_window = 0, len(data), 64, 2\n"," \n","loss_every_nsteps = 1000\n","total_loss = 0\n","for step in range(50000):\n"," data_index, batch, labels = generate_batch(data_index, data_size, batch_size, bag_window)\n"," batch, labels = autograd.Variable(torch.cuda.LongTensor(batch)), autograd.Variable(torch.cuda.LongTensor(labels))\n","\n"," optimizer.zero_grad()\n","\n"," log_probs = model(batch)\n","\n"," loss = loss_function(log_probs, labels)\n"," loss.backward()\n","\n"," optimizer.step()\n","\n"," total_loss += loss.data\n"," if step % loss_every_nsteps == 0:\n"," if step > 0:\n"," total_loss /= loss_every_nsteps\n"," display.clear_output(True)\n"," print(\"step = {0}, average_loss = {1}\".format(step, total_loss.cpu().numpy()[0]))\n"," total_loss = 0"],"execution_count":0,"outputs":[]},{"metadata":{"id":"cVk5L6pOhMeb","colab_type":"text"},"cell_type":"markdown","source":["Визуализировать всё это можно так:"]},{"metadata":{"id":"9ne0G23iZnW2","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["from sklearn.manifold import TSNE\n","\n","num_points = 250\n","\n","tsne = TSNE(perplexity=10, n_components=2, init=\"pca\", n_iter=5000)\n","two_d_embeddings = tsne.fit_transform(model.embeddings.weight.data.cpu().numpy()[1:num_points+1, :])\n","\n","plt.figure(figsize=(15,15))\n","words = [index_to_word[i] for i in range(1, num_points+1)]\n","\n","for i, label in enumerate(words):\n"," x, y = two_d_embeddings[i, :]\n"," plt.scatter(x, y)\n"," plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords=\"offset points\",\n"," ha=\"right\", va=\"bottom\")\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"NQsDAyHMgyT7","colab_type":"text"},"cell_type":"markdown","source":["## Negative Sampling\n","Вообще говоря, считать softmax на большом словаре - очень долго и вычислительно сложно.\n","\n","Один из способов справиться с этим - использовать *Negative Sampling*.\n","\n","По сути, вместо предсказания индекса слова по контексту предсказывается вероятность того, что такое слово может быть в таком контексте: $P(D=1|w,c)$.\n","\n","Можно использовать обычную сигмоиду для получения данной вероятности: \n","$$P(D=1|w, c) = \\sigma(v_w^T u_c) = \\frac 1 {1 + \\exp(-v^T_w u_c)}.$$\n","\n","Процесс обучения тогда выглядит так: для каждой пары слово и его контекст генерируем набор отрицательных примеров:\n","\n","![Negative Sampling](https://image.ibb.co/dnOUDH/Negative_Sampling.png =350x)\n","\n","Для CBoW функция потерь будет выглядеть так:\n","$$-\\log \\sigma(v_c^T u_c) - \\sum_{k=1}^K \\log \\sigma(-\\tilde v_k^T u_c),$$\n","где $\\tilde v_1, \\ldots, \\tilde v_K$ - сэмплированные негативные примеры.\n","\n","Сравните эту формулу с обычным CBoW:\n","$$-v_c^T u_c + \\log \\sum_{i=1}^{|V|} \\exp(v_i^T u_c).$$\n","\n","Обычно слова сэмплируются из $U^{3/4}$, где $U$ - униграмное распределение, т.е частоты появления слова делённые на суммарое число слов. \n","\n","Частотности мы уже считали: они получаются в `Counter(words)`. Достаточно просто преобразовать их в вероятности и домножить эти вероятности на $\\frac 3 4$. Почему $\\frac 3 4$? Некоторую интуицию можно найти в следующем примере:\n","\n","$$P(\\text{is}) = 0.9, \\ P(\\text{is})^{3/4} = 0.92$$\n","$$P(\\text{Constitution}) = 0.09, \\ P(\\text{Constitution})^{3/4} = 0.16$$\n","$$P(\\text{bombastic}) = 0.01, \\ P(\\text{bombastic})^{3/4} = 0.032$$\n","\n","Вероятность для высокочастотных слов особо не увеличилась (относительно), зато низкочастотные будут выпадать с заметно большей вероятностей.\n","\n","**Задание** Реализуйте свой Negative Sampling."]},{"metadata":{"id":"zRiuNwSPP4N8","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["class NegativeSamplingCBOW(nn.Module):\n"," def __init__(self, vocab_size, embedding_dim):\n"," super().__init__()\n"," self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n"," self.out_embeddings = nn.Embedding(vocab_size, embedding_dim)\n","\n"," def forward(self, inputs, targets, num_samples):\n"," '''\n"," inputs : (batch_size, context_size)\n"," targets: (batch_size)\n"," '''\n"," # Находим u_c\n"," embeds = self.embeddings(inputs).sum(dim=1)\n"," # Считаем v_c\n"," outputs = self.out_embeddings(targets)\n"," \n"," \n"," \n"," \n"," \n"," return loss\n"," \n","# Строим эмбеддинги размерности 50\n","model = CBOW(vocabulary_size, 50)\n","model = model.cuda()\n","\n","# Обратите внимание, loss уже не нужен!\n","# Считаем его прямо в классе\n","# Таким образом, просто делаем loss = model(batch)\n","optimizer = optim.SGD(model.parameters(), lr=0.01) "],"execution_count":0,"outputs":[]},{"metadata":{"id":"I3zkfuUrgUy9","colab_type":"text"},"cell_type":"markdown","source":["## Skip-Gram\n","\n","В Skip-gram модели всё наоборот. Предсказываются слова из контекста по данному слову.\n","\n","![Skip-gram](https://adriancolyer.files.wordpress.com/2016/04/word2vec-skip-gram.png?w=600)\n","\n","Теперь учимся моделировать вероятность $P(w_{c-k}, \\ldots, w_{c-1}, w_{c+1}, \\ldots, w_{c+k} | w_c)$. Это, конечно, сложно, поэтому упростим всё - используем наивного Байеса:\n","$P(w_{c-k}, \\ldots, w_{c-1}, w_{c+1}, \\ldots, w_{c+k} | w_c) = \\prod_{j=-k, j \\neq 0}^k P(w_{c+j} | w_c)$. Посмотрите ещё раз на картинку - там наприсовано именно это (точнее, там нарисовано, что тренируется всего одна матрица $W_2$).\n","\n","Оптимизировать нужно всё ту же кросс-энтропию. Только теперь уже нужно суммировать кросс-энтропийные потери для всех слов в контексте.\n","\n","**Задание** Реализовать, ага.\n","\n","Начнем с генерации батча. Представить, что там происходит, проще всего по этой картинке:\n","![skip-gram-batch](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/44f574efe79ca1613d77336e4163061f9d5566c6/seminar_02/pics/training_data.png =600x)\n","\n","Т.е. тренировочные данные состоят из пар (слово, слово из контекста).\n","\n","В результате, если вся выборка будет состоять из одного этого предложения, распределение вероятностей, которое мы стремимся получить, для слова `brown` будет таким:\n","\n","![skip-gram-illustration](https://image.ibb.co/eJ3otH/SkipGram.png =600x)\n","\n","Каждое из слов в его контексте получает одинаковую вероятность $1 \\over 4$, а все остальные слова - ноль. "]},{"metadata":{"id":"cA9rcM5Rh4KE","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["import random\n","\n","def generate_batch_2(data_index, data_size, batch_size, num_skips, skip_window):\n"," assert batch_size % num_skips == 0\n"," assert num_skips <= 2 * skip_window\n"," \n"," batch = np.ndarray(shape = batch_size, dtype = np.int32)\n"," labels = np.ndarray(shape = (batch_size, 1), dtype = np.int32)\n"," span = 2 * skip_window + 1\n"," data_buffer = deque(maxlen = span)\n"," for _ in range(span):\n"," data_buffer.append(data[data_index])\n"," data_index = (data_index + 1) % data_size\n"," \n"," for i in range(batch_size // num_skips):\n"," target, targets_to_avoid = skip_window, [skip_window]\n"," for j in range(num_skips):\n"," while target in targets_to_avoid: \n"," target = random.randint(0, span - 1)\n"," targets_to_avoid.append(target)\n"," batch[i * num_skips + j] = data_buffer[skip_window]\n"," labels[i * num_skips + j, 0] = data_buffer[target]\n"," data_buffer.append(data[data_index])\n"," data_index = (data_index + 1) % data_size\n"," return data_index, batch, labels\n","\n","\n","print(\"data = {0}\\n\".format([index_to_word[each] for each in data[:32]]))\n","data_index, data_size = 0, len(data)\n","for num_skips, skip_window in [(2, 1), (4, 2)]:\n"," data_index = 0\n"," data_index, batch, labels = generate_batch_2(data_index=data_index, \n"," data_size=data_size, \n"," batch_size=16, \n"," num_skips=num_skips, \n"," skip_window=skip_window)\n"," print(\"data_index = {0}, num_skips = {1}, skip_window = {2}\".format( data_index, num_skips, skip_window))\n"," print(\"batch = {0}\".format([index_to_word[each] for each in batch]))\n"," print(\"labels = {0}\\n\".format([index_to_word[each] for each in labels.reshape(16)]))"],"execution_count":0,"outputs":[]},{"metadata":{"id":"d44o6JC_WyH-","colab_type":"code","colab":{"autoexec":{"startup":false,"wait_interval":0}}},"cell_type":"code","source":["class SkipGram(nn.Module):\n"," def __init__(self, vocab_size, embedding_dim):\n"," super().__init__()\n"," self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n"," self.out_layer = nn.Linear(embedding_dim, vocab_size)\n","\n"," def forward(self, inputs):\n"," "],"execution_count":0,"outputs":[]}]} \ No newline at end of file diff --git a/Lecture 2 - PyTorch & Word2Vec/README.md b/Lecture 2 - PyTorch & Word2Vec/README.md new file mode 100644 index 0000000..0f54a34 --- /dev/null +++ b/Lecture 2 - PyTorch & Word2Vec/README.md @@ -0,0 +1,18 @@ +# Лекция 2 +*Введение в PyTorch и реализация моделей Word2Vec* + +Ноутбук на Colab можно найти [здесь](https://colab.research.google.com/drive/15tg6jTt1F0oR5PzFlcZNBnpPiu4se1m3) +**Обратите внимание, ноутбук обновился в части про Word2Vec.** + +**Задание** +*Дедлайн: 09:00 GMT+3 15.03.18* +1. Реализовать Skip-Gram модель. +2. Реализовать Negative Sampling для CBoW модели. +3. (По желанию) Разобраться со статьёй [Two/Too Simple Adaptations of Word2Vec for Syntax Problems +] (http://www.cs.cmu.edu/~lingwang/papers/naacl2015.pdf) и реализовать Structured Skip-gram Model модель. + +Дополнительные материалы: +[On word embeddings - Part 1, by Sebastian Ruder](http://ruder.io/word-embeddings-1/) +[On word embeddings - Part 2: Approximating the Softmax, by Sebastian Ruder](http://ruder.io/word-embeddings-softmax/index.html) +[cs224n "Lecture 2 | Word Vector Representations: word2vec" (video)](https://www.youtube.com/watch?v=ERibwqs9p38&index=2&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s). +[cs224n "Lecture 5: Backpropagation and Project Advice" (video)](https://www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s). \ No newline at end of file diff --git a/README.md b/README.md index fbe0984..7de60cf 100644 --- a/README.md +++ b/README.md @@ -6,4 +6,9 @@ The textbook: [Neural Network Methods in Natural Language Processing by Yoav Gol ## Lecture Materials ### Lecture 1 -A short overview of the most popular architectures in the deep learning. A brief introduction to the [keras](keras.io) framework. \ No newline at end of file +A short overview of the most popular architectures in the deep learning. A brief introduction into the [keras](keras.io) framework. Examples of models for sentiment analysis on the IMDB movie review dataset. +The [Colab notebook](https://drive.google.com/open?id=1KGy9Hm3y4asE6ohg3QD77w2nZV3V9_08). + +### Lecture 2 +An introduction into the [PyTorch](pytorch.org) framework. Examples of Word2Vec models. +The [Colab notebook](https://drive.google.com/open?id=1KGy9Hm3y4asE6ohg3QD77w2nZV3V9_08). \ No newline at end of file