Skip to content

Commit

Permalink
Updated week 02, added images for week 03
Browse files Browse the repository at this point in the history
  • Loading branch information
DanAnastasyev committed Sep 26, 2018
1 parent 93c3f9e commit 8308048
Show file tree
Hide file tree
Showing 7 changed files with 102 additions and 58 deletions.
160 changes: 102 additions & 58 deletions Week 02/Week_02_Word_Embeddings_(Part_1).ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -190,26 +190,14 @@
"metadata": {
"id": "irl7RotC5C_B",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "4ecdb8da-76af-40e3-e8b9-dfec7b85225f"
"colab": {}
},
"cell_type": "code",
"source": [
"print([' '.join(row) for row in tokenized_texts[:2]])"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"['what is the step by step guide to invest in share market in india ?', 'what is the story of kohinoor ( koh-i-noor ) diamond ?']\n"
],
"name": "stdout"
}
]
"execution_count": 0,
"outputs": []
},
{
"metadata": {
Expand Down Expand Up @@ -468,7 +456,7 @@
"\n",
"assert word_vectors_pca.shape == (len(word_vectors), 2), \"there must be a 2d vector for each word\"\n",
"assert max(abs(word_vectors_pca.mean(0))) < 1e-5, \"points must be zero-centered\"\n",
"assert max(abs(1 - word_vectors_pca.std(0) < 1e-5)), \"points must have unit variance\""
"assert max(abs(1 - word_vectors_pca.std(0))) < 1e-5, \"points must have unit variance\""
],
"execution_count": 0,
"outputs": []
Expand Down Expand Up @@ -773,14 +761,41 @@
"### Bag-of-words"
]
},
{
"metadata": {
"id": "jGge73gDid99",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"**Задание** Соберем для начала токенизированные вопросы."
]
},
{
"metadata": {
"id": "FJTjqdgiih7O",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"tokenized_question1 = ...\n",
"tokenized_question2 = ..."
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "0VBcipE7Ztkc",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"**Задание** Посчитайте косинусную близость между вопросами."
"**Задание** Посчитайте косинусную близость между вопросами.\n",
"\n",
"Проще всего реализовать функцию ее вычисления руками:\n",
"$$\\text{cosine_similarity}(x, y) = \\frac{x^{T} y}{||x||\\cdot ||y||}$$"
]
},
{
Expand Down Expand Up @@ -878,6 +893,8 @@
"\n",
"**Задание** Посчитайте взвешенные вектора вопросов. Используйте `TfidfVectorizer`\n",
"\n",
"`TfidfVectorizer` возвращает матрицу `(samples_count, words_count)`. А наши эмбеддинги имеют размерность `(words_count, embedding_dim)`. Значит, их можно просто перемножить. Тогда каждая фраза - последовательность слов $w_1, \\ldots, w_k$ - преобразуется в вектор $\\sum_i \\text{idf}(w_i) \\cdot \\text{embedding}(w_i)$. Этот вектор, вероятно, стоит нормировать на число слов $k$.\n",
"\n",
"**Задание** Кроме tf-idf можно добавить фильтрацию стоп-слов и пунктуации. \n",
"Стоп-слова можно взять из `nltk`:\n",
"```python\n",
Expand Down Expand Up @@ -918,11 +935,7 @@
"metadata": {
"id": "k52NtVJKf1Wl",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 53
},
"outputId": "92c2c105-9d7d-445a-eeae-e8b36002f4de"
"colab": {}
},
"cell_type": "code",
"source": [
Expand All @@ -931,17 +944,8 @@
"\n",
"print('\\n' + ' '.join(tokenized_question1[0]))"
],
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": [
"invest guide market india step share ? \n",
"what is the step by step guide to invest in share market in india ?\n"
],
"name": "stdout"
}
]
"execution_count": 0,
"outputs": []
},
{
"metadata": {
Expand Down Expand Up @@ -1008,11 +1012,7 @@
"metadata": {
"id": "1CXcr-ypzGXg",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 161
},
"outputId": "8abf9e1c-6928-48a2-c403-4cafda2945ae"
"colab": {}
},
"cell_type": "code",
"source": [
Expand Down Expand Up @@ -1044,23 +1044,8 @@
"!unzip cc.ru.300.vec.zip\n",
"!unzip cc.uk.300.vec.zip"
],
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"Redirecting output to ‘wget-log.1’.\n",
"\n",
"Redirecting output to ‘wget-log.2’.\n",
"Archive: cc.ru.300.vec.zip\n",
" inflating: cc.ru.300.vec \n",
"Archive: cc.uk.300.vec.zip\n",
" inflating: cc.uk.300.vec \n"
],
"name": "stdout"
}
]
"execution_count": 0,
"outputs": []
},
{
"metadata": {
Expand All @@ -1071,7 +1056,9 @@
"source": [
"Напишем простенькую реализацию модели машинного перевода.\n",
"\n",
"Будем переводить с украинского на русский.\n",
"Идея основана на статье [Word Translation Without Parallel Data](https://arxiv.org/pdf/1710.04087.pdf). У авторов в репозитории еще много интересного: [https://github.com/facebookresearch/MUSE](https://github.com/facebookresearch/MUSE).\n",
"\n",
"А мы будем переводить с украинского на русский.\n",
"\n",
"![](https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/blue_cat_blue_whale.png) \n",
"*синій кіт* vs. *синій кит*"
Expand All @@ -1093,6 +1080,16 @@
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "7rGx4TXWFJ65",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Посмотрим на пару серпень-август (являющихся переводом)"
]
},
{
"metadata": {
"id": "FkHer36xyh4n",
Expand Down Expand Up @@ -1176,7 +1173,7 @@
"\n",
"Эта функция очень похожа на линейную регрессию (без биаса).\n",
"\n",
"**Задание** Реализуйте её:"
"**Задание** Реализуйте её - воспользуйтесь `LinearRegression` из sklearn с `fit_intercept=False`:"
]
},
{
Expand All @@ -1199,7 +1196,7 @@
},
"cell_type": "markdown",
"source": [
"Проверим, куда перейдет `серпень` (по-русски, `август`):"
"Проверим, куда перейдет `серпень`:"
]
},
{
Expand Down Expand Up @@ -1390,6 +1387,16 @@
"### Пишем переводчик"
]
},
{
"metadata": {
"id": "hwi70fP6FaAN",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Реализуем простой пословный переводчик - для каждого слова будем искать его ближайшего соседа в общем пространстве эмбеддингов. Если слова нет в эмбеддингах - просто копируем его."
]
},
{
"metadata": {
"id": "0etAHUks4JOr",
Expand Down Expand Up @@ -1469,6 +1476,43 @@
"[Форма для сдачи](https://goo.gl/forms/GGjrH7axdGJr6yTp2) \n",
"[Опрос](https://goo.gl/forms/3QRwLTmLgBzl5VVm2)"
]
},
{
"metadata": {
"id": "_5GrChTeFqIg",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Дополнительные материалы"
]
},
{
"metadata": {
"id": "HwffxpbmFwDh",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Почитать\n",
"### База: \n",
"[On word embeddings - Part 1, Sebastian Ruder](http:https://ruder.io/word-embeddings-1/) \n",
"[Deep Learning, NLP, and Representations, Christopher Olah](http:https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) \n",
"\n",
"### Как кластеризовать смыслы многозначных слов: \n",
"[Making Sense of Word Embeddings (2016), Pelevina et al](http:https://anthology.aclweb.org/W16-1620) \n",
"\n",
"### Как оценивать эмбеддинги\n",
"[Evaluation methods for unsupervised word embeddings (2015), T. Schnabel](http:https://www.aclweb.org/anthology/D15-1036) \n",
"[Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance (2016), B. Chiu](https://www.aclweb.org/anthology/W/W16/W16-2501.pdf) \n",
"[Problems With Evaluation of Word Embeddings Using Word Similarity Tasks (2016), M. Faruqui](https://arxiv.org/pdf/1605.02276.pdf) \n",
"[Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure (2016), Oded Avraham, Yoav Goldberg](https://arxiv.org/pdf/1611.03641.pdf) \n",
"[Evaluating Word Embeddings Using a Representative Suite of Practical Tasks (2016), N. Nayak](https://cs.stanford.edu/~angeli/papers/2016-acl-veceval.pdf) \n",
"\n",
"\n",
"## Посмотреть\n",
"[Word Vector Representations: word2vec, Lecture 2, cs224n](https://www.youtube.com/watch?v=ERibwqs9p38)"
]
}
]
}
Binary file added Week 03/Images/CBOW.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Week 03/Images/Circuit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Week 03/Images/Negative_Sampling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Week 03/Images/SkipGram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Week 03/Images/StructuredWord2vec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Week 03/Images/Word2vecExample.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8308048

Please sign in to comment.