Name	Name	Last commit message	Last commit date
Latest commit History 7 Commits
ZEN	ZEN
examples	examples
LICENSE	LICENSE
README.md	README.md
requirements.txt	requirements.txt
setup.py	setup.py

ZEN

Introduction

ZEN, a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations, where different combinations of characters are considered during training. The potential word or phrase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. ZEN is tested on a series of Chinese NLP tasks, where it requires less resource than other published encoders, and achieves state-of-the-art performance on most tasks.

Quick tour of pre-training and fine-tune using ZEN

The library comprises several example scripts for conducting Chinese NLP tasks:

run_pre_train.py: an example pre-training ZEN
run_sequence_level_classification.py: an example fine-tuning ZEN on DC, SA, SPM and NLI tasks (sequence-level classification)
run_token_level_classification.py: an example fine-tuning ZEN on CWS, POS and NER tasks (token-level classification)

(Abbreviations for the tasks are explained in the Task and data section)

Three quick usage examples for these scripts:

`run_pre_train.py`: Pre-train ZEN model from scratch or BERT model

python run_pre_train.py  \
    --pregenerated_data /path/to/pregenerated_data   \
    --bert_model /path/to/bert_model  \
    --do_lower_case  \
    --output_dir /path/to/output_dir   \
    --epochs 20  \
    --train_batch_size 128   \
    --reduce_memory  \
    --fp16  \
    --scratch  \
    --save_name ZEN_pretrain_base_

`run_sequence_level_classification.py`: Fine-tune on tasks for sequence classification

python run_sequence_level_classification.py \
    --task_name TASKNAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset \
    --bert_model /path/to/zen_model \
    --max_seq_length 512 \
    --train_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 30.0

where TASKNAME can be one of DC, SA, SPM and NLI

script of fine-tuning thucnews

python run_sequence_level_classification.py \
    --task_name thucnews \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset/thucnews \
    --bert_model /path/to/zen_model \
    --max_seq_length 512 \
    --train_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 30.0

script of fine-tuning chnsenticorp

python run_sequence_level_classification.py \
    --task_name ChnSentiCorp \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset/ChnSentiCorp \
    --bert_model /path/to/zen_model \
    --max_seq_length 512 \
    --train_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 30.0

script of fine-tuning LCQMC

python run_sequence_level_classification.py \
    --task_name lcqmc \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset/lcqmc \
    --bert_model /path/to/zen_model \
    --max_seq_length 128 \
    --train_batch_size 128 \
    --learning_rate 5e-5 \
    --num_train_epochs 30.0

script of fine-tuning XNLI

python run_sequence_level_classification.py \
    --task_name xnli \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset/xnli \
    --bert_model /path/to/zen_model \
    --max_seq_length 128 \
    --train_batch_size 128 \
    --learning_rate 5e-5 \
    --num_train_epochs 30.0

`run_token_level_classification.py`: Fine-tune on tasks for sequence classification

python run_token_level_classification.py \
    --task_name TASKNAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset \
    --bert_model /path/to/zen_model \
    --max_seq_length 128 \
    --do_train  \
    --do_eval \
    --train_batch_size 128 \
    --num_train_epochs 30 \
    --warmup_proportion 0.1

where TASKNAME can be one of CWS, POS and NER

script of fine-tuning msra

python run_token_level_classification.py \
    --task_name cwsmsra \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset \
    --bert_model /path/to/zen_model \
    --max_seq_length 256 \
    --do_train  \
    --do_eval \
    --train_batch_size 96 \
    --num_train_epochs 30 \
    --warmup_proportion 0.1

script of fine-tuning CTB5

python run_token_level_classification.py \
    --task_name pos \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset \
    --bert_model /path/to/zen_model \
    --max_seq_length 256 \
    --do_train  \
    --do_eval \
    --train_batch_size 96 \
    --num_train_epochs 30 \
    --warmup_proportion 0.1

script of fine-tuning msra_ner

python run_token_level_classification.py \
    --task_name msra \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir /path/to/dataset \
    --bert_model /path/to/zen_model \
    --max_seq_length 128 \
    --do_train  \
    --do_eval \
    --train_batch_size 128 \
    --num_train_epochs 30 \
    --warmup_proportion 0.1

Tasks and datasets used in our experiments

Chinese word segmentation (CWS):

CWS dataset MSR dataset from SIGHAN2005 Chinese word segmentation Bakeoff.

Part-of-speech (POS) tagging:

CTB5 (Xue et al., 2005) dataset with standard splits from CTB5 dataset

Named entity recognition (NER):

MSRA dataset from international Chinese language processing Bakeoff 2006. NER

Document classification (DC):

THUCNews (News) dataset (Sun et al., 2016) from Sina news with 10 evenly distributed classes.THUCNews

Sentiment analysis (SA):

The ChnSentiCorp (CSC) dataset with 12,000 documents from three domains, i.e., book, computer and hotel. ChnSentiCorp

Sentence pair matching (SPM):

The LCQMC (a large-scale Chinese question matching corpus) proposed by Liu et al. (2018), where each instance is a pair of two sentences with a label indicating whether their intent is matched. LCQMC

Natural language inference (NLI):

The Chinese part of the XNLI (Conneau et al., 2018) XNLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZEN

Introduction

Quick tour of pre-training and fine-tune using ZEN

`run_pre_train.py`: Pre-train ZEN model from scratch or BERT model

`run_sequence_level_classification.py`: Fine-tune on tasks for sequence classification

`run_token_level_classification.py`: Fine-tune on tasks for sequence classification

Tasks and datasets used in our experiments

Chinese word segmentation (CWS):

Part-of-speech (POS) tagging:

Named entity recognition (NER):

Document classification (DC):

Sentiment analysis (SA):

Sentence pair matching (SPM):

Natural language inference (NLI):

About

Releases

Packages

Languages

License

tauriel27/ZEN

Folders and files

Latest commit

History

Repository files navigation

ZEN

Introduction

Quick tour of pre-training and fine-tune using ZEN

run_pre_train.py: Pre-train ZEN model from scratch or BERT model

run_sequence_level_classification.py: Fine-tune on tasks for sequence classification

run_token_level_classification.py: Fine-tune on tasks for sequence classification

Tasks and datasets used in our experiments

Chinese word segmentation (CWS):

Part-of-speech (POS) tagging:

Named entity recognition (NER):

Document classification (DC):

Sentiment analysis (SA):

Sentence pair matching (SPM):

Natural language inference (NLI):

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`run_pre_train.py`: Pre-train ZEN model from scratch or BERT model

`run_sequence_level_classification.py`: Fine-tune on tasks for sequence classification

`run_token_level_classification.py`: Fine-tune on tasks for sequence classification

Packages