Name		Name	Last commit message	Last commit date
parent directory ..
sh		sh
README.md		README.md
pretokenize.py		pretokenize.py
requirements.txt		requirements.txt
train_sp.py		train_sp.py

README.md

How to train subword-language model by sentencepiece, kenlm

Prelimnary

./install_kenlm-bin.sh

Train subword tokenizer by sentencepiece

# must be line by line document jsonl
python train_sp.py \
--input_jsonl_filepath ${JSONL_FILEPATH:-"kowiki.json"} \
--model_prefix ${MODEL_PREFIX:-"ko.sp"} \
--vocab_size 40000 \
--max_sentence_length 10000 \
--split_by_whitespace

Train n-gram subword language model by kenlm

Pretokenize

python pretokenize.py \
--input_jsonl_filepath ${JSONL_FILEPATH:-"kowiki.json"} \
--input_model_filepath ${MODEL_FILEPATH:-"ko.sp.model"} \
--output_text_filepath ${TEXT_FILEPATH:-"kowiki.txt"}

Train kenlm

tmp/kenlm/build/bin/lmplz -o 5 <${TEXT_FILEPATH:-"kowiki.txt"} >${MODEL_PREFIX:-"ko"}.arpa && \
tmp/kenlm/build/bin/build_binary ${MODEL_PREFIX:-"ko"}.arpa ${MODEL_PREFIX:-"ko"}.arpa.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sp-kenlm

sp-kenlm

README.md

How to train subword-language model by sentencepiece, kenlm

Prelimnary

Train subword tokenizer by sentencepiece

Train n-gram subword language model by kenlm

Pretokenize

Train kenlm

Files

sp-kenlm

Directory actions

More options

Directory actions

More options

Latest commit

History

sp-kenlm

Folders and files

parent directory

README.md

How to train subword-language model by sentencepiece, kenlm

Prelimnary

Train subword tokenizer by sentencepiece

Train n-gram subword language model by kenlm

Pretokenize

Train kenlm