Word embeddings

Description

A collection of visualizations for exploring word embeddings in Word2Vec models. The code is dirty, the project not well set up, it may break, your computer may burn, but when it works it looks pretty.

How to build it

Requirements

Python3
Poetry

Building

Clone the sources from the repository and compile it with

poetry install --no-dev

or if you need to development dependencies (for example to train models)

poetry install

How to run it

Directly

This project uses streamlit. To run it type

poetry run streamlit run main.py

The models have to be in a subfolder called models besides main.py. So it should run out of the box with the example models when you clone this repository.

With Docker

Build it with

docker build -t word-embeddings:latest .

then run it with

docker run -it -v `pwd`/models:/code/models word-embeddings:latest

where the volume that is mounted contains the Word2Vec models.

Supported model formats

Note: You can add a description file by creating a text file with the same name as the model but with extension .txt.

word2vec binary

The models must have the extension .bin and placed in the models directory.

Gensim model

The models must have the extension .model and placed in the models directory.

Gensim KeyedVectors

The models must have the extension .kv and placed in the models directory.

Included models

word2vec_abc_news: A model trained from the https://www.kaggle.com/therohk/million-headlines dataset. The dataset contains over a million news headlines published over a period of eighteen years from the Australian Broadcasting Corporation (ABC).
word2vec_bigram_abc_news: A model trained from the https://www.kaggle.com/therohk/million-headlines dataset. The dataset contains over a million news headlines published over a period of eighteen years from the Australian Broadcasting Corporation (ABC). It is similar to the word2vec_abc_news model, but it contains bigrams (try searching for new_york).

Thanks to

https://github.com/marcellusruben/Word_Embedding_Visualization for PCA and t-SNA visualization and making me aware of Streamlit.
streamlit for creating an awesome piece of software.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word embeddings

Description

How to build it

Requirements

Building

How to run it

Directly

With Docker

Supported model formats

word2vec binary

Gensim model

Gensim KeyedVectors

Included models

Thanks to

About

Releases

Packages

Languages

License

torlenor/word_embeddings

Folders and files

Latest commit

History

Repository files navigation

Word embeddings

Description

How to build it

Requirements

Building

How to run it

Directly

With Docker

Supported model formats

word2vec binary

Gensim model

Gensim KeyedVectors

Included models

Thanks to

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages