Glint-Word2Vec Evaluation

Scripts for comparing the default Spark ML implementation of Word2Vec or the Gensim implementation with a custom implementation of Word2Vec using the Glint parameter server with specialized operations for efficient distributed Word2Vec computation.

The project report sums up experiments performed.

Usage

Evaluation sets

The models are evaluated on the SimLex 999 and wordsim 353 evaluation sets. The original english sets are provided as well as german sets from https://github.com/iraleviant/eval-multilingual-simlex.

Further word analogies to evaluate and visualize via PCA will need to be specified as csv file. An example is example_country_capitals_de.csv which consists of german country-capital relations. The country-capital relations from the original Word2Vec paper are also provided.

Data sets

As dataset to train Word2Vec on you can download a wikipedia dump and then extract it to a text file with get_texts.py. For testing, a subset of articles from a wikipedia dump can be created with get_analogy_texts.py. Another possibility is getting specific wikipedia articles with get_articles_texts.py. The articles to get will need to be specified as txt file. An example is example_country_capitals_de_articles.txt. Another option is using the script demo-download-big-model-v1.sh as used in the original word2vec open source implementation.

Training and evaluating

For the training and evaluation you will need the glint-word2vec-assembly-1.0.jar and the python binding file ml_glintword2vec.py as zip from https://github.com/MGabr/glint-word2vec. These will have to be specified as --jars and --py-files options of spark-submit.

train_word2vec.py can then be used to train a Glint, Gensim or standard Spark ML model and evaluate_word2vec.py to evaluate and visualize word analogies using a model.

In most cases you will want to remove stop words before training a model to avoid exploding gradients. This can be done with the --stop-word-lang and --stop-word-file options.

Example

An example for evaluating Glint Word2Vec with default settings (150 partitions, 5 parameter servers) on a german wikipedia dump on the original Word2Vec country-capitals analogies is the following.

python3 get_texts.py dewiki-latest-pages-articles.xml.bz2 dewiki-latest-pages-articles.txt
spark-submit --num-executors 5 --executor-cores 30 --jars glint-word2vec-assembly-1.0.jar --py-files ml_glintword2vec.zip train_word2vec.py dewiki-latest-pages-articles.txt dewiki-latest-pages-articles.model glint --stop-word-lang de --stop-word-file stopwords/dewiki.txt
spark-submit --num-executors 5 --executor-cores 1 --jars glint-word2vec-assembly-1.0.jar --py-files ml_glintword2vec.zip evaluate_word2vec.py evaluation/country_capitals_de.csv de dewiki-latest-pages-articles.model glint country_capitals_de.png

To evaluate Glint Word2Vec with parameter servers running in a separate Spark application you have to start them beforehand with a command like

spark-submit --num-executors 5 --executor-cores 20 --class glint.Main glint-word2vec-assembly-1.0.jar spark

Afterwards --parameter-server-host followed by the host of the parameter server master (visible in the output of the Spark application) has to be added as argument to train_word2vec.py. The separate parameter servers Spark application will also be terminated after training is finished.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
evaluation		evaluation
extras		extras
stopwords		stopwords
README.md		README.md
demo-download-big-model-v1.sh		demo-download-big-model-v1.sh
evaluate_word2vec.py		evaluate_word2vec.py
example_country_capitals_de_articles.txt		example_country_capitals_de_articles.txt
get_analogy_texts.py		get_analogy_texts.py
get_article_texts.py		get_article_texts.py
get_texts.py		get_texts.py
project_report.pdf		project_report.pdf
requirements.txt		requirements.txt
train_word2vec.py		train_word2vec.py
word2phrase.c		word2phrase.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glint-Word2Vec Evaluation

Usage

Evaluation sets

Data sets

Training and evaluating

Example

About

Releases

Packages

Languages

MGabr/evaluate-glint-word2vec

Folders and files

Latest commit

History

Repository files navigation

Glint-Word2Vec Evaluation

Usage

Evaluation sets

Data sets

Training and evaluating

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages