You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How to train subword-language model by sentencepiece, kenlm
Prelimnary
./install_kenlm-bin.sh
Train subword tokenizer by sentencepiece
# must be line by line document jsonl
python train_sp.py \
--input_jsonl_filepath ${JSONL_FILEPATH:-"kowiki.json"} \
--model_prefix ${MODEL_PREFIX:-"ko.sp"} \
--vocab_size 40000 \
--max_sentence_length 10000 \
--split_by_whitespace