We release code to explore training large models with hundreds of millions of parameters. Taking advantage of the parallel nature of MindSpore, we adopt efficient model parallelism and data parallelism technologies such as operator-level parallelism, which can be easily extended to thousands of NPUs and hundreds of billions of parameters with only minor modifications.
- Hardware(Ascend)
- Prepare hardware environment with Ascend processor.
- Framework
- For more information, please check the resources below:
- You can use Qizhi(OpenI) platform to obtain free computing resources. OpenI
As the format of the downstream tasks can be various, the process_prompt2mind.py
provides a basic usage of how to process your fine-tune text files(.json). Please prepare your data with following format, each line is a piece of continuous text for each file:
{'input':'please describe your university life.', 'target':'My university life is rich and colorful. In addition to academic courses, there are also rich club activities.'}
{'input': xxx, 'target': xxx}
Suppose the text data is under the ./data
and each text file ends with 'json', we can run the following command to generate the mindrecord files with seq_length=1025.
python -m process_prompt2mind --input_glob 'data/*.json' --tokenizer thu --data_column_name input_ids --seq_length 1025
The script will chunk the line with 1025 tokens. For the chunk with no more 1025 tokens, the chunk will be ignored.
The output files is under ./output
. The default tokenizer adopts the transformers's tokenizer. Note the vocab_szie
is determined by the vocab file.
- tokenizer: The tokenizer used for tokening the text. It can be
thu
(requiredtransformers
) orpangu4w
. Note thethu
tokenizer requires thetransformers
,pytorch
ortensorflow
.pangu4w
tokenizer requires two addition filesvocab.model
. Click here to download them. - data_column_name: The name of feature columns for mindrecord.
- seq_length: Default 1025. The preprocess will generate mindrecord with sequence length 1025 for each example.
Before we start Incremental Training, the following two steps must be done:
- Process the dataset using the released vocab, please refer to the [Increnmental Training in Dataset Generatiogn](#Incremental Training)
- Download the
checkpoint
andstrategy
file according to the [Download Checkpoint](#Download Checkpoint). Each host should own the complete checkpoint files.
Then run the following command to start incremental training with 2.6B
configure:
export FILE_PATH=/home/your_path/ckpts
bash scripts/run_distribute_incremental_train.sh DATASET RANK_TABLE 8 fp32 2.6B 8 ${FILE_PATH}/strategy_load_ckpt/strategy.ckpt ${FILE_PATH}/checkpoint_file filitered
The above command involves some args
described below:
- DATASET: The path to the mindrecord files's parent directory . For example:
/home/work/mindrecord/
. - RANK_TABLE: The details of the rank table can be found here. It's a json file describes the
device id
,service ip
andrank
. - RANK_SIZE: The device number. This can be your total device numbers. For example, 8, 16, 32 ...
- TYPE: The param init type. The parameters will be initialized with float32. Or you can replace it with
fp16
. This will save a little memory used on the device. - MODE: The configure mode. This mode will set the
hidden size
andlayers
to make the parameter number near 2.6 billions. The other mode can be13B
(hidden size
5120 andlayers
40, which needs at least 16 cards to train.) and200B
. - STAGE_NUM: The number of pipeline stages. When the
stage_num
is large than 1, the pipeline parallel mode would be applied. This configure indicates the number of sub graphs in pipeline parallel mode. - MICRO_SIZE: The number of micro batches in pipeline parallel mode. It should large than
stage_num
. - PER_BATCH: The batch size for each data parallel-way. default 8.
- RANK_START: The start of rank_id in current machines, it helps to set the rank_id for each machine in multi-machine scenario.
- LOCAL_DEVICE_NUM: The device number of the local machine.
- EXPERT_NUM_PER_EP: Expert nums in one data parallel dim.
- ENABLE_ALLTOALL: Enable alltoall communication. default 0.
Please refer to the website to download the following parts:
- tokenizer: vocab.model
- checkpoint file: *.part[0-4] (need to extract) and *.npy under the same parameter size
- strategy file: a file described how the parameters are sliced across different devices.
Here we suppose the downloaded checkpoint, tokenizer and strategy file is organized as follows:
CodePanGu2.6B: https://pan.baidu.com/s/1K5WpQMkngowEcQGoxUhoUA?pwd=pap9
Note: In the following sections, we will refer the path as ckpts
as /home/your_path/ckpts
.
ckpts
├── checkpoint_file
│ ├── filtered_*.ckpt
│ ├── word_embedding.npy
│ ├── top_query_embedding.npy
│ └── position_embedding.npy
├── strategy_load_ckpt
│ └── strategy.ckpt
└── tokenizer
└── vocab.model
This script provides the evaluation of following tasks:
Click the link of above tasks and download the data. Take the C3 for example, unzip the dataset to
/home/my_path/data/c3
Its structure should be as followings:
c3
├── data
│ ├── c3-d-dev.json
│ ├── c3-d-test.json
│ ├── c3-d-train.json
│ ├── c3-m-dev.json
│ ├── c3-m-test.json
│ └── c3-m-train.json
├── license.txt
└── README.md
Please follow the instructions in section Prediction to download the checkpoint.
The most of the arguments are same with the section Prediction in Standalone mode,
except the last argument TASK
and TASK_PATH
. Currently, we support only c3
task. The following commands will
launch the programs to start evaluation with 2.6B model.
export FILE_PATH=/home/your_path/ckpts
export DEVICE_TARGET=Ascend # or GPU
export TASK=c3
export TASK_PATH=/home/your_c3_data_path/data # You should point to the data directory under the c3 path
bash scripts/run_standalone_eval.sh ${FILE_PATH}/strategy_load_ckpt/strategy.ckpt \
${FILE_PATH}/tokenizer/ ${FILE_PATH}/checkpoint_file filitered 2.6B $DEVICE_TARGET $TASK $TASK_PATH
For the model with 2.6B, it takes about 13 minutes to get the results. Log can be found under the device0/log0.log
.
It should look like this:
Metric for dataset c3 is {'top1_acc': 0.5430}
Naturally, you should also cite the PanGu-OpenI repo and the MindSpore repo.