Skip to content

Commit

Permalink
Refactor sequence parallel (#823)
Browse files Browse the repository at this point in the history
  • Loading branch information
tastelikefeet committed May 10, 2024
1 parent 400c0a3 commit 1fc148a
Show file tree
Hide file tree
Showing 28 changed files with 324 additions and 41 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.

## 🎉 News
- 2024.05.10: Support split a sequence to multiple GPUs to reduce memory usage. Use this feature by `pip install .[seq_parallel]`, then add `--sequence_parallel_size n` to your DDP script to begin!
- 2024.05.08: Support DeepSeek-V2-Chat model, you can refer to [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/deepseek-v2-chat/lora_ddp_ds3/sft.sh).Support InternVL-Chat-V1.5-Int8 model, for best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal/internvl-best-practice.md).
- 🔥2024.05.07: Supoprts **ORPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/ORPO.md) to start training!
- 2024.05.07: Supports Llava-Llama3 model from xtuner,model_type is `llava-llama-3-8b-v1_1`.
Expand Down
1 change: 1 addition & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**(多模态大模型)的训练、推理、
此外,我们也在拓展其他模态的能力,目前我们支持了AnimateDiff的全参数训练和LoRA训练。

## 🎉 新闻
- 2024.05.10: 支持序列并行. 先安装`pip install .[seq_parallel]`, 之后在DDP环境中添加`--sequence_parallel_size n`即可使用!
- 2024.05.08: 支持DeepSeek-V2-Chat模型, 训练参考[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/deepseek-v2-chat/lora_ddp_ds3/sft.sh)。支持InternVL-Chat-V1.5-Int8模型,最佳实践参考[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/internvl最佳实践.md).
- 🔥2024.05.07: 支持**ORPO**训练,使用`swift orpo`来开始使用, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/ORPO算法最佳实践.md)
- 2024.05.07: 支持来自xtuner的Llava-Llama3模型,model_type为`llava-llama-3-8b-v1_1`.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/GetStarted/使用tuners.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def encode(example):
example, kwargs = template.encode({'query': q, 'response': output})
return example

dataset = dataset.to_hf_dataset().map(encode).filter(lambda e: e.get('input_ids'))
dataset = dataset.map(encode).filter(lambda e: e.get('input_ids'))
dataset = dataset.train_test_split(test_size=0.001)

train_dataset, val_dataset = dataset['train'], dataset['test']
Expand Down
48 changes: 48 additions & 0 deletions docs/source/LLM/Benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- [Export](#Export)
- [AWQ](#AWQ)
- [AQLM](#AQLM)
- [Sequence Parallel](#Sequence-Parallel)

## 参数设置
实验环境:
Expand Down Expand Up @@ -757,3 +758,50 @@ swift sft \
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||


## Sequence Parallel

<table>

<tr>
<td>Model</td>
<td>Dataset</td>
<td>Hyper params</td>
<td>Total steps</td>
<td>Train speed</td>
<td>Gpu memory</td>
</tr>

<tr>
<td rowspan="4">chatglm3-6b-32k</td>
<td rowspan="4">long-alpaca-12k(8055 tokens * 12000 rows)</td>
<td>gpu=2/sequence_parallel_size=1(双GPU DDP基准测试)</td>
<td>5940</td>
<td>0.30iter/s(5h13min total)</td>
<td>27G*2</td>
</tr>


<tr>
<td>gpu=2/sequence_parallel_size=2(双GPU序列并行2)</td>
<td>11880</td>
<td>0.5iter/s(6h total)</td>
<td>20G*2</td>
</tr>

<tr>
<td>gpu=4/sequence_parallel_size=4(四GPU序列并行4)</td>
<td>11880</td>
<td>1iter/s(3h20min total)</td>
<td>18G*4</td>
</tr>

<tr>
<td>gpu=4/sequence_parallel_size=2(四GPU序列并行2)</td>
<td>5940</td>
<td>0.45iter/s(3h total)</td>
<td>21G*4</td>
</tr>

</table>
4 changes: 4 additions & 0 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,10 @@
- `--fsdp`: 默认值`''`, fsdp类型, 详情可以查看该参数[原始文档](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp).
- `--fsdp_config`: 默认值`None`, fsdp配置文件的路径.

### Sequence Parallel参数

- `--sequence_parallel_size`: 默认值`1`, 大于1时可以拆分一个sequence到多张显卡上以节省显存, 值需要设置为能被DDP数量整除

### LoRA+微调参数

- `--lora_lr_ratio`: 默认值`None`, 建议值`10~16`, 使用lora时指定该参数即可使用lora+.
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/GetStarted/Tuners.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def encode(example):
example, kwargs = template.encode({'query': q, 'response': output})
return example

dataset = dataset.to_hf_dataset().map(encode).filter(lambda e: e.get('input_ids'))
dataset = dataset.map(encode).filter(lambda e: e.get('input_ids'))
dataset = dataset.train_test_split(test_size=0.001)

train_dataset, val_dataset = dataset['train'], dataset['test']
Expand Down
47 changes: 47 additions & 0 deletions docs/source_en/LLM/Benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- [Export](#Export)
- [AWQ](#AWQ)
- [AQLM](#AQLM)
- [Sequence Parallel](#Sequence-Parallel)

## Parameter Settings
Experimental environment:
Expand Down Expand Up @@ -756,3 +757,49 @@ swift sft \
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||

## Sequence Parallel

<table>

<tr>
<td>Model</td>
<td>Dataset</td>
<td>Hyper params</td>
<td>Total steps</td>
<td>Train speed</td>
<td>Gpu memory</td>
</tr>

<tr>
<td rowspan="4">chatglm3-6b-32k</td>
<td rowspan="4">long-alpaca-12k(8055 tokens * 12000 rows)</td>
<td>gpu=2/sequence_parallel_size=1(2 GPU DDP baseline)</td>
<td>5940</td>
<td>0.30iter/s(5h13min total)</td>
<td>27G*2</td>
</tr>


<tr>
<td>gpu=2/sequence_parallel_size=2(2 GPU with sequence parallel 2)</td>
<td>11880</td>
<td>0.5iter/s(6h total)</td>
<td>20G*2</td>
</tr>

<tr>
<td>gpu=4/sequence_parallel_size=4(4 GPU with sequence parallel 4)</td>
<td>11880</td>
<td>1iter/s(3h20min total)</td>
<td>18G*4</td>
</tr>

<tr>
<td>gpu=4/sequence_parallel_size=2(4 GPU sequence parallel 2)</td>
<td>5940</td>
<td>0.45iter/s(3h total)</td>
<td>21G*4</td>
</tr>

</table>
4 changes: 4 additions & 0 deletions docs/source_en/LLM/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@

- `--fsdp_config`: Default value `None`, the FSDP config file path.

### Sequence Parallel Parameters

- `--sequence_parallel_size`: Default value `1`, a positive value can be used to split a sequence to multiple GPU to reduce memory usage. The value should divide the GPU count.

### LoRA+ Fine-tuning Parameters

- `--lora_lr_ratio`: Default `None`, recommended value `10~16`, specify this parameter when using lora to enable lora+.
Expand Down
1 change: 1 addition & 0 deletions requirements/seq_parallel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
xtuner
3 changes: 3 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,13 @@ def gen_packages_items():
extra_requires['llm'], _ = parse_requirements('requirements/llm.txt')
extra_requires['aigc'], _ = parse_requirements('requirements/aigc.txt')
extra_requires['eval'], _ = parse_requirements('requirements/eval.txt')
extra_requires['seq_parallel'], _ = parse_requirements('requirements/seq_parallel.txt')
all_requires.extend(install_requires)
all_requires.extend(extra_requires['llm'])
all_requires.extend(extra_requires['aigc'])
all_requires.extend(extra_requires['eval'])
all_requires.extend(extra_requires['seq_parallel'])
extra_requires['seq_parallel'].extend(extra_requires['llm'])
extra_requires['all'] = all_requires

setup(
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_controlnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -554,10 +554,8 @@ def make_train_dataset(args, tokenizer, accelerator):
args.dataset_name,
args.dataset_config_name,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
if args.train_data_dir is not None:
dataset = load_dataset(
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_controlnet_sdxl.py
Original file line number Diff line number Diff line change
Expand Up @@ -570,10 +570,8 @@ def get_train_dataset(args, accelerator):
args.dataset_name,
args.dataset_config_name,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
if args.train_data_dir is not None:
dataset = load_dataset(
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -650,10 +650,8 @@ def path_to_img(example):
args.dataset_config_name,
data_dir=args.train_data_dir,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
data_files = {}
if args.train_data_dir is not None:
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_text_to_image_lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -525,10 +525,8 @@ def path_to_img(example):
args.dataset_config_name,
data_dir=args.train_data_dir,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
data_files = {}
if args.train_data_dir is not None:
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_text_to_image_lora_sdxl.py
Original file line number Diff line number Diff line change
Expand Up @@ -687,10 +687,8 @@ def path_to_img(example):
args.dataset_config_name,
data_dir=args.train_data_dir,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
data_files = {}
if args.train_data_dir is not None:
Expand Down
6 changes: 2 additions & 4 deletions swift/aigc/diffusers/train_text_to_image_sdxl.py
Original file line number Diff line number Diff line change
Expand Up @@ -745,10 +745,8 @@ def path_to_img(example):
args.dataset_config_name,
data_dir=args.train_data_dir,
)
if isinstance(dataset, dict):
dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
else:
dataset = {'train': dataset.to_hf_dataset()}
if not isinstance(dataset, dict):
dataset = {'train': dataset}
else:
data_files = {}
if args.train_data_dir is not None:
Expand Down
5 changes: 5 additions & 0 deletions swift/llm/sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import json
import numpy as np
import torch
import torch.distributed as dist
from datasets import Dataset
from modelscope import BitsAndBytesConfig, GenerationConfig
from transformers import IntervalStrategy
from transformers.integrations import is_deepspeed_zero3_enabled
Expand Down Expand Up @@ -143,6 +145,8 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
if use_model:
template_kwargs['model'] = model
template_kwargs['use_loss_scale'] = args.use_loss_scale
if args.sequence_parallel_size and args.sequence_parallel_size > 1:
template_kwargs['sequence_parallel_size'] = args.sequence_parallel_size
template: Template = get_template(args.template_type, tokenizer, args.system, args.max_length,
args.truncation_strategy, **template_kwargs)
args.system = template.default_system
Expand Down Expand Up @@ -225,6 +229,7 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
eval_dataset=val_dataset,
tokenizer=tokenizer,
callbacks=callbacks,
sequence_parallel_size=args.sequence_parallel_size,
**trainer_kwargs)
trainer.sft_args = args
if use_torchacc():
Expand Down
3 changes: 3 additions & 0 deletions swift/llm/tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,9 @@ def prepare_model(model, args: SftArguments):
else:
raise ValueError(f'args.sft_type: {args.sft_type}')

if args.sequence_parallel_size > 1:
from swift.trainers.xtuner import dispatch_module_xtuner
dispatch_module_xtuner(model)
if args.neftune_backend == 'swift' and args.neftune_noise_alpha not in {None, 0.}:
neftune_config = NEFTuneConfig(noise_alpha=args.neftune_noise_alpha)
model = Swift.prepare_model(model, {'neftune': neftune_config})
Expand Down
3 changes: 3 additions & 0 deletions swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,8 @@ class SftArguments(ArgumentsBase):
# fsdp config file
fsdp_config: Optional[str] = None

sequence_parallel_size: int = 1

# compatibility hf
per_device_train_batch_size: Optional[int] = None
per_device_eval_batch_size: Optional[int] = None
Expand All @@ -586,6 +588,7 @@ class SftArguments(ArgumentsBase):
neftune_alpha: Optional[float] = None
deepspeed_config_path: Optional[str] = None
model_cache_dir: Optional[str] = None

custom_train_dataset_path: List[str] = field(default_factory=list)
custom_val_dataset_path: List[str] = field(default_factory=list)

Expand Down
7 changes: 4 additions & 3 deletions swift/llm/utils/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -504,10 +504,11 @@ def map_row(row):
response = row['response']
if response and response.startswith('Answer:'):
response = response[len('Answer:') + 1:].strip()
return {'query': row['query'], 'response': response}
row['response'] = response
return response

return dataset.rename_columns({'instruction': 'query', 'output': 'response'}) \
.remove_columns(['input', 'file']).map(map_row).filter(lambda row: row['response'] is not None)
dataset = AlpacaPreprocessor()(dataset)
return dataset.map(map_row)


register_dataset(
Expand Down
11 changes: 11 additions & 0 deletions swift/llm/utils/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,7 @@ class ModelType:
phi2_3b = 'phi2-3b'
phi3_4b_4k_instruct = 'phi3-4b-4k-instruct'
phi3_4b_128k_instruct = 'phi3-4b-128k-instruct'
phi3_mini_128k_instruct = 'phi3-mini-128k-instruct'
# cogagent
cogvlm_17b_instruct = 'cogvlm-17b-instruct'
cogagent_18b_chat = 'cogagent-18b-chat'
Expand Down Expand Up @@ -1297,6 +1298,16 @@ def cross_entropy_forward(self, inputs: Tensor, target: Tensor) -> Tensor:
support_vllm=False, # https://github.com/vllm-project/vllm/pull/4298
tags=['general'],
hf_model_id='microsoft/Phi-3-mini-128k-instruct')
@register_model(
ModelType.phi3_mini_128k_instruct,
'LLM-Research/Phi-3-mini-128k-instruct',
LoRATM.phi3,
TemplateType.phi3,
requires=['transformers>=4.36'],
support_flash_attn=True,
support_vllm=False,
tags=['general'],
hf_model_id='microsoft/Phi-3-mini-128k-instruct')
@register_model(
ModelType.phi3_4b_4k_instruct,
'LLM-Research/Phi-3-mini-4k-instruct',
Expand Down
13 changes: 13 additions & 0 deletions swift/llm/utils/template.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,8 @@ def _init_template(self,
self.truncation_strategy = truncation_strategy
self.model = kwargs.get('model', None)
self.use_loss_scale = kwargs.get('use_loss_scale', False)
self.sequence_parallel_size = kwargs.get('sequence_parallel_size', 1)

for key in ['prefix', 'prompt', 'chat_sep', 'suffix', 'prefix_has_system']:
value = getattr(self, key)
value = self._preprocess_prompt(tokenizer, value)
Expand Down Expand Up @@ -422,6 +424,17 @@ def data_collator(self, batch: List[Dict[str, Any]], padding_to: Optional[int] =
labels, loss_scale, self.max_length,
self.tokenizer, rank, world_size)

bs, seq_len = input_ids.shape
position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)

if self.sequence_parallel_size > 1:
from swift.trainers.xtuner import get_xtuner_sequence_parallel_world_size
if get_xtuner_sequence_parallel_world_size() > 1:
from swift.trainers.xtuner import pad_and_split_for_sequence_parallel
input_ids, labels, position_ids, attention_mask, loss_scale = \
pad_and_split_for_sequence_parallel(
tokenizer, input_ids, labels, position_ids, attention_mask, loss_scale)

res = {
'input_ids': input_ids,
'attention_mask': attention_mask,
Expand Down
4 changes: 4 additions & 0 deletions swift/llm/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -825,6 +825,10 @@ def is_vllm_available():
return importlib.util.find_spec('vllm') is not None


def is_xtuner_available():
return importlib.util.find_spec('xtuner') is not None


def get_time_info(log_history: List[Dict[str, Any]], n_train_samples: Optional[int]) -> Optional[Dict[str, Any]]:
time_info = None
try:
Expand Down
Loading

0 comments on commit 1fc148a

Please sign in to comment.