Updated week 02, added images for week 03

DanAnastasyev · Sep 26, 2018 · 8308048 · 8308048
1 parent 93c3f9e
commit 8308048
Show file tree

Hide file tree

Showing 7 changed files with 102 additions and 58 deletions.
diff --git a/Week 02/Week_02_Word_Embeddings_(Part_1).ipynb b/Week 02/Week_02_Word_Embeddings_(Part_1).ipynb
@@ -190,26 +190,14 @@
  "metadata": {
  "id": "irl7RotC5C_B",
  "colab_type": "code",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 34
- },
- "outputId": "4ecdb8da-76af-40e3-e8b9-dfec7b85225f"
+ "colab": {}
  },
  "cell_type": "code",
  "source": [
  "print([' '.join(row) for row in tokenized_texts[:2]])"
  ],
- "execution_count": 5,
- "outputs": [
- {
- "output_type": "stream",
- "text": [
- "['what is the step by step guide to invest in share market in india ?', 'what is the story of kohinoor ( koh-i-noor ) diamond ?']\n"
- ],
- "name": "stdout"
- }
- ]
+ "execution_count": 0,
+ "outputs": []
  },
  {
  "metadata": {
@@ -468,7 +456,7 @@
  "\n",
  "assert word_vectors_pca.shape == (len(word_vectors), 2), \"there must be a 2d vector for each word\"\n",
  "assert max(abs(word_vectors_pca.mean(0))) < 1e-5, \"points must be zero-centered\"\n",
- "assert max(abs(1 - word_vectors_pca.std(0) < 1e-5)), \"points must have unit variance\""
+ "assert max(abs(1 - word_vectors_pca.std(0))) < 1e-5, \"points must have unit variance\""
  ],
  "execution_count": 0,
  "outputs": []
@@ -773,14 +761,41 @@
  "### Bag-of-words"
  ]
  },
+ {
+ "metadata": {
+ "id": "jGge73gDid99",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "**Задание** Соберем для начала токенизированные вопросы."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "FJTjqdgiih7O",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "tokenized_question1 = ...\n",
+ "tokenized_question2 = ..."
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
  {
  "metadata": {
  "id": "0VBcipE7Ztkc",
  "colab_type": "text"
  },
  "cell_type": "markdown",
  "source": [
- "**Задание** Посчитайте косинусную близость между вопросами."
+ "**Задание** Посчитайте косинусную близость между вопросами.\n",
+ "\n",
+ "Проще всего реализовать функцию ее вычисления руками:\n",
+ "$$\\text{cosine_similarity}(x, y) = \\frac{x^{T} y}{||x||\\cdot ||y||}$$"
  ]
  },
  {
@@ -878,6 +893,8 @@
  "\n",
  "**Задание** Посчитайте взвешенные вектора вопросов. Используйте `TfidfVectorizer`\n",
  "\n",
+ "`TfidfVectorizer` возвращает матрицу `(samples_count, words_count)`. А наши эмбеддинги имеют размерность `(words_count, embedding_dim)`. Значит, их можно просто перемножить. Тогда каждая фраза - последовательность слов $w_1, \\ldots, w_k$ - преобразуется в вектор $\\sum_i \\text{idf}(w_i) \\cdot \\text{embedding}(w_i)$. Этот вектор, вероятно, стоит нормировать на число слов $k$.\n",
+ "\n",
  "**Задание** Кроме tf-idf можно добавить фильтрацию стоп-слов и пунктуации. \n",
  "Стоп-слова можно взять из `nltk`:\n",
  "```python\n",
@@ -918,11 +935,7 @@
  "metadata": {
  "id": "k52NtVJKf1Wl",
  "colab_type": "code",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 53
- },
- "outputId": "92c2c105-9d7d-445a-eeae-e8b36002f4de"
+ "colab": {}
  },
  "cell_type": "code",
  "source": [
@@ -931,17 +944,8 @@
  "\n",
  "print('\\n' + ' '.join(tokenized_question1[0]))"
  ],
- "execution_count": 12,
- "outputs": [
- {
- "output_type": "stream",
- "text": [
- "invest guide market india step share ? \n",
- "what is the step by step guide to invest in share market in india ?\n"
- ],
- "name": "stdout"
- }
- ]
+ "execution_count": 0,
+ "outputs": []
  },
  {
  "metadata": {
@@ -1008,11 +1012,7 @@
  "metadata": {
  "id": "1CXcr-ypzGXg",
  "colab_type": "code",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 161
- },
- "outputId": "8abf9e1c-6928-48a2-c403-4cafda2945ae"
+ "colab": {}
  },
  "cell_type": "code",
  "source": [
@@ -1044,23 +1044,8 @@
  "!unzip cc.ru.300.vec.zip\n",
  "!unzip cc.uk.300.vec.zip"
  ],
- "execution_count": 20,
- "outputs": [
- {
- "output_type": "stream",
- "text": [
- "\n",
- "Redirecting output to ‘wget-log.1’.\n",
- "\n",
- "Redirecting output to ‘wget-log.2’.\n",
- "Archive: cc.ru.300.vec.zip\n",
- " inflating: cc.ru.300.vec \n",
- "Archive: cc.uk.300.vec.zip\n",
- " inflating: cc.uk.300.vec \n"
- ],
- "name": "stdout"
- }
- ]
+ "execution_count": 0,
+ "outputs": []
  },
  {
  "metadata": {
@@ -1071,7 +1056,9 @@
  "source": [
  "Напишем простенькую реализацию модели машинного перевода.\n",
  "\n",
- "Будем переводить с украинского на русский.\n",
+ "Идея основана на статье [Word Translation Without Parallel Data](https://arxiv.org/pdf/1710.04087.pdf). У авторов в репозитории еще много интересного: [https://github.com/facebookresearch/MUSE](https://github.com/facebookresearch/MUSE).\n",
+ "\n",
+ "А мы будем переводить с украинского на русский.\n",
  "\n",
  "![](https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/blue_cat_blue_whale.png) \n",
  "*синій кіт* vs. *синій кит*"
@@ -1093,6 +1080,16 @@
  "execution_count": 0,
  "outputs": []
  },
+ {
+ "metadata": {
+ "id": "7rGx4TXWFJ65",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Посмотрим на пару серпень-август (являющихся переводом)"
+ ]
+ },
  {
  "metadata": {
  "id": "FkHer36xyh4n",
@@ -1176,7 +1173,7 @@
  "\n",
  "Эта функция очень похожа на линейную регрессию (без биаса).\n",
  "\n",
- "**Задание** Реализуйте её:"
+ "**Задание** Реализуйте её - воспользуйтесь `LinearRegression` из sklearn с `fit_intercept=False`:"
  ]
  },
  {
@@ -1199,7 +1196,7 @@
  },
  "cell_type": "markdown",
  "source": [
- "Проверим, куда перейдет `серпень` (по-русски, `август`):"
+ "Проверим, куда перейдет `серпень`:"
  ]
  },
  {
@@ -1390,6 +1387,16 @@
  "### Пишем переводчик"
  ]
  },
+ {
+ "metadata": {
+ "id": "hwi70fP6FaAN",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Реализуем простой пословный переводчик - для каждого слова будем искать его ближайшего соседа в общем пространстве эмбеддингов. Если слова нет в эмбеддингах - просто копируем его."
+ ]
+ },
  {
  "metadata": {
  "id": "0etAHUks4JOr",
@@ -1469,6 +1476,43 @@
  "[Форма для сдачи](https://goo.gl/forms/GGjrH7axdGJr6yTp2) \n",
  "[Опрос](https://goo.gl/forms/3QRwLTmLgBzl5VVm2)"
  ]
+ },
+ {
+ "metadata": {
+ "id": "_5GrChTeFqIg",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Дополнительные материалы"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "HwffxpbmFwDh",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Почитать\n",
+ "### База: \n",
+ "[On word embeddings - Part 1, Sebastian Ruder](http:https://ruder.io/word-embeddings-1/) \n",
+ "[Deep Learning, NLP, and Representations, Christopher Olah](http:https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) \n",
+ "\n",
+ "### Как кластеризовать смыслы многозначных слов: \n",
+ "[Making Sense of Word Embeddings (2016), Pelevina et al](http:https://anthology.aclweb.org/W16-1620) \n",
+ "\n",
+ "### Как оценивать эмбеддинги\n",
+ "[Evaluation methods for unsupervised word embeddings (2015), T. Schnabel](http:https://www.aclweb.org/anthology/D15-1036) \n",
+ "[Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance (2016), B. Chiu](https://www.aclweb.org/anthology/W/W16/W16-2501.pdf) \n",
+ "[Problems With Evaluation of Word Embeddings Using Word Similarity Tasks (2016), M. Faruqui](https://arxiv.org/pdf/1605.02276.pdf) \n",
+ "[Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure (2016), Oded Avraham, Yoav Goldberg](https://arxiv.org/pdf/1611.03641.pdf) \n",
+ "[Evaluating Word Embeddings Using a Representative Suite of Practical Tasks (2016), N. Nayak](https://cs.stanford.edu/~angeli/papers/2016-acl-veceval.pdf) \n",
+ "\n",
+ "\n",
+ "## Посмотреть\n",
+ "[Word Vector Representations: word2vec, Lecture 2, cs224n](https://www.youtube.com/watch?v=ERibwqs9p38)"
+ ]
  }
  ]
 }
diff --git a/Week 03/Images/CBOW.png b/Week 03/Images/CBOW.png
diff --git a/Week 03/Images/Circuit.png b/Week 03/Images/Circuit.png
diff --git a/Week 03/Images/Negative_Sampling.png b/Week 03/Images/Negative_Sampling.png
diff --git a/Week 03/Images/SkipGram.png b/Week 03/Images/SkipGram.png
diff --git a/Week 03/Images/StructuredWord2vec.png b/Week 03/Images/StructuredWord2vec.png
diff --git a/Week 03/Images/Word2vecExample.jpeg b/Week 03/Images/Word2vecExample.jpeg