Skip to content

program to create sentence dataset from wikipedia dumps

License

Notifications You must be signed in to change notification settings

santhoshtr/wikisentences

Repository files navigation

Usage

Clone the repo, create a virtual environment and install dependencies.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Install fastttext

sudo apt install fasttext

Then run:

make

Note that downloading wikipedia dumps and processing for languages will take about 2 days and will use lot of diskspace. Use the fasttest machine you have. At the end the directory data will have langcode.sentences.txt files.

Once the sentences are prepared run make ld.model.bin to create fasttext model for language identification. This is also a long running process.

About

program to create sentence dataset from wikipedia dumps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published