Code for running pretraining and finetuning of Chinese BERT model

Model checkpoints available at: https://huggingface.co/CLS/WubiBERT_models/tree/main That repo only contains the model checkpoints, the config and tokenizer files are in this repo, which you should load locally.

Note that we split a fraction of the original CLUE training set and use as the dev set, we choose checkpoints based on results of that dev set and evaluate on the original CLUE dev set as the test set.

You can use split_data.py to do the dev set splitting, but remember to keep the random seed so that we can all reproduce the same splitting and results.

Finetuning

You can run one of the following python code to do finetuning depending on which task you want to finetune on. Note that different task/code might need different arguments.

run_glue.py: classification tasks such as TNews, IFlytek, OCNLI etc.
run_multichoice_mrc.py: CHID
run_ner.py: CLUENER
run_{cmrc, drcd, c3}.py: CMRC, DRCD or C3

Also note that different tokenization methods require passing different argument values for: tokenizer_type, vocab_file, vocab_model_file.

Values of tokenizer_type:

Tokenization method	Value of `tokenizer_type`
Char	BertZh
Pinyin	CommonZh
Pinyin-NoIndex	CommonZhNoIndex
Byte	Byte
RandomIndex	RandomIndex
PinyinConcatWubi	PinyinConcatWubi
Pinyin-Shuffle	Shuffled

For example, for finetuning on TNews using pinyin tokenizer:

python3 run_glue.py \
  --task_name=tnews \
  --train_dir=datasets/tnews/split \
  --dev_dir=datasets/tnews/split \
  --test_dir=datasets/tnews/split \
  --do_train --do_eval --do_test \
  --init_checkpoint=checkpoints/checkpoints_pinyin_zh_22675/ckpt_8804.pt \
  --output_dir=logs/pinyin_tnews \
  --tokenizer_type=CommonZh \
  --vocab_file=tokenizers/pinyin_zh_22675.vocab \
  --vocab_model_file=tokenizers/pinyin_zh_22675.model \
  --config_file=configs/bert_config_vocab22675.json \
  --epochs=6

Another example, finetuning on CMRC using wubi tokenizer:

python3 run_cmrc.py \
  --data_dir=datasets/cmrc/split \
  --init_checkpoint=checkpoints/checkpoints_wubi_zh_22675/ckpt_8804.pt \
  --config_file=configs/bert_config_vocab22675.json \
  --tokenizer_type=CommonZh \
  --vocab_file=tokenizers/wubi_zh_22675.vocab \
  --vocab_model_file=tokenizers/wubi_zh_22675.model \
  --output_dir=logs/cmrc/wubi_twolevel/ckpt_8804 \
  --do_train --do_test \
  --two_level_embeddings \
  --epochs=6

Testing

Generally, just don't pass --do_train and --do_eval to the execution scripts above.

Example of testing on TNews using Pinyin-NoIndex tokenizer:

python3 run_glue.py \
  --task_name=tnews \
  --data_dir datasets/tnews/split \
  --do_test \
  --init_checkpoint=checkpoints/checkpoints_pinyin_no_index/ckpt_8804.pt \
  --output_dir=logs/pinyin_tnews \
  --tokenizer_type=CommonZh \
  --vocab_file=tokenizers/pinyin_zh_22675.vocab \
  --vocab_model_file=tokenizers/pinyin_zh_22675.model \
  --config_file=configs/bert_config_vocab22675.json \
  --epochs=6

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
configs		configs
cws_tokenizers		cws_tokenizers
data		data
dllogger		dllogger
mrc		mrc
ner		ner
processors		processors
scripts		scripts
tokenizers		tokenizers
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
bind_pyt.py		bind_pyt.py
build_vocab_bertzh.py		build_vocab_bertzh.py
byte_char_map.pkl		byte_char_map.pkl
char_byte_map.pkl		char_byte_map.pkl
compare_vocab.py		compare_vocab.py
consts.py		consts.py
convert.log		convert.log
convert_chinese_to_encode.py		convert_chinese_to_encode.py
convert_to_byte.py		convert_to_byte.py
convert_to_random_index.py		convert_to_random_index.py
create_pretraining_data.py		create_pretraining_data.py
extract_features.py		extract_features.py
file_utils.py		file_utils.py
finetune.py		finetune.py
finetune.sh		finetune.sh
get_cmrc_tokenization_consistency.py		get_cmrc_tokenization_consistency.py
get_results.py		get_results.py
get_results_afqmc.py		get_results_afqmc.py
get_results_cmrc.py		get_results_cmrc.py
get_results_time.py		get_results_time.py
inference.py		inference.py
job.py		job.py
job.sh		job.sh
levenshtein.py		levenshtein.py
make_da_noise.py		make_da_noise.py
make_noise.py		make_noise.py
modeling.py		modeling.py
optimization.py		optimization.py
pretrain_cws_raw_zh.log		pretrain_cws_raw_zh.log
pretrain_cws_wubi_zh_phase1.log		pretrain_cws_wubi_zh_phase1.log
random_index_map.pkl		random_index_map.pkl
result_getter.py		result_getter.py
run_c3.py		run_c3.py
run_cmrc.py		run_cmrc.py
run_drcd.py		run_drcd.py
run_glue.py		run_glue.py
run_ner.py		run_ner.py
run_pretraining.py		run_pretraining.py
run_squad.py		run_squad.py
run_swag.py		run_swag.py
schedulers.py		schedulers.py
sim_dict.pkl		sim_dict.pkl
test.sh		test.sh
test_cws.py		test_cws.py
test_tokenizer.py		test_tokenizer.py
test_vocab.py		test_vocab.py
tokenization.py		tokenization.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for running pretraining and finetuning of Chinese BERT model

Finetuning

Testing

About

Releases

Packages

Contributors 2

Languages

NoviScl/WubiBERT

Folders and files

Latest commit

History

Repository files navigation

Code for running pretraining and finetuning of Chinese BERT model

Finetuning

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages