GitHub - santhoshtr/wikisentences: program to create sentence dataset from wikipedia dumps

Usage

Clone the repo, create a virtual environment and install dependencies.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Install fastttext

sudo apt install fasttext

Then run:

make

Note that downloading wikipedia dumps and processing for languages will take about 2 days and will use lot of diskspace. Use the fasttest machine you have. At the end the directory data will have langcode.sentences.txt files.

Once the sentences are prepared run make ld.model.bin to create fasttext model for language identification. This is also a long running process.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cleanup.sed		cleanup.sed
prepare_train_set.sh		prepare_train_set.sh
pyproject.toml		pyproject.toml
remove_english.sh		remove_english.sh
segmenter.py		segmenter.py
symbols.py		symbols.py
wikis.py		wikis.py
wikis.txt		wikis.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

About

Releases

Packages

Languages

License

santhoshtr/wikisentences

Folders and files

Latest commit

History

Repository files navigation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages