Refactor sequence parallel (#823)

modelscope · May 10, 2024 · 1fc148a · 1fc148a
1 parent 400c0a3
commit 1fc148a
Show file tree

Hide file tree

Showing 28 changed files with 324 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
 Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
 
 ## 🎉 News
+- 2024.05.10: Support split a sequence to multiple GPUs to reduce memory usage. Use this feature by `pip install .[seq_parallel]`, then add `--sequence_parallel_size n` to your DDP script to begin!
 - 2024.05.08: Support DeepSeek-V2-Chat model, you can refer to [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/deepseek-v2-chat/lora_ddp_ds3/sft.sh).Support InternVL-Chat-V1.5-Int8 model, for best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal/internvl-best-practice.md).
 - 🔥2024.05.07: Supoprts **ORPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/ORPO.md) to start training!
 - 2024.05.07: Supports Llava-Llama3 model from xtuner，model_type is `llava-llama-3-8b-v1_1`.

diff --git a/README_CN.md b/README_CN.md
@@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**（多模态大模型）的训练、推理、
 此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的全参数训练和LoRA训练。
 
 ## 🎉 新闻
+- 2024.05.10: 支持序列并行. 先安装`pip install .[seq_parallel]`, 之后在DDP环境中添加`--sequence_parallel_size n`即可使用!
 - 2024.05.08: 支持DeepSeek-V2-Chat模型, 训练参考[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/deepseek-v2-chat/lora_ddp_ds3/sft.sh)。支持InternVL-Chat-V1.5-Int8模型，最佳实践参考[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/internvl最佳实践.md).
 - 🔥2024.05.07: 支持**ORPO**训练，使用`swift orpo`来开始使用， 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/ORPO算法最佳实践.md)
 - 2024.05.07: 支持来自xtuner的Llava-Llama3模型，model_type为`llava-llama-3-8b-v1_1`.

diff --git a/docs/source/GetStarted/使用tuners.md b/docs/source/GetStarted/使用tuners.md
@@ -133,7 +133,7 @@ def encode(example):
  example, kwargs = template.encode({'query': q, 'response': output})
  return example
 
-dataset = dataset.to_hf_dataset().map(encode).filter(lambda e: e.get('input_ids'))
+dataset = dataset.map(encode).filter(lambda e: e.get('input_ids'))
 dataset = dataset.train_test_split(test_size=0.001)
 
 train_dataset, val_dataset = dataset['train'], dataset['test']

diff --git a/docs/source/LLM/Benchmark.md b/docs/source/LLM/Benchmark.md
@@ -11,6 +11,7 @@
 - [Export](#Export)
 - [AWQ](#AWQ)
 - [AQLM](#AQLM)
+- [Sequence Parallel](#Sequence-Parallel)
 
 ## 参数设置
 实验环境:
@@ -757,3 +758,50 @@ swift sft \
 | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
 | -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
 |llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||
+
+
+## Sequence Parallel
+
+<table>
+
+<tr>
+<td>Model</td>
+<td>Dataset</td>
+<td>Hyper params</td>
+<td>Total steps</td>
+<td>Train speed</td>
+<td>Gpu memory</td>
+</tr>
+
+<tr>
+<td rowspan="4">chatglm3-6b-32k</td>
+<td rowspan="4">long-alpaca-12k(8055 tokens * 12000 rows)</td>
+<td>gpu=2/sequence_parallel_size=1(双GPU DDP基准测试)</td>
+<td>5940</td>
+<td>0.30iter/s(5h13min total)</td>
+<td>27G*2</td>
+</tr>
+
+
+<tr>
+<td>gpu=2/sequence_parallel_size=2(双GPU序列并行2)</td>
+<td>11880</td>
+<td>0.5iter/s(6h total)</td>
+<td>20G*2</td>
+</tr>
+
+<tr>
+<td>gpu=4/sequence_parallel_size=4(四GPU序列并行4)</td>
+<td>11880</td>
+<td>1iter/s(3h20min total)</td>
+<td>18G*4</td>
+</tr>
+
+<tr>
+<td>gpu=4/sequence_parallel_size=2(四GPU序列并行2)</td>
+<td>5940</td>
+<td>0.45iter/s(3h total)</td>
+<td>21G*4</td>
+</tr>
+
+</table>
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -121,6 +121,10 @@
 - `--fsdp`: 默认值`''`, fsdp类型, 详情可以查看该参数[原始文档](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp).
 - `--fsdp_config`: 默认值`None`, fsdp配置文件的路径.
 
+### Sequence Parallel参数
+
+- `--sequence_parallel_size`: 默认值`1`, 大于1时可以拆分一个sequence到多张显卡上以节省显存, 值需要设置为能被DDP数量整除
+
 ### LoRA+微调参数
 
 - `--lora_lr_ratio`: 默认值`None`, 建议值`10~16`, 使用lora时指定该参数即可使用lora+.

diff --git a/docs/source_en/GetStarted/Tuners.md b/docs/source_en/GetStarted/Tuners.md
@@ -133,7 +133,7 @@ def encode(example):
  example, kwargs = template.encode({'query': q, 'response': output})
  return example
 
-dataset = dataset.to_hf_dataset().map(encode).filter(lambda e: e.get('input_ids'))
+dataset = dataset.map(encode).filter(lambda e: e.get('input_ids'))
 dataset = dataset.train_test_split(test_size=0.001)
 
 train_dataset, val_dataset = dataset['train'], dataset['test']

diff --git a/docs/source_en/LLM/Benchmark.md b/docs/source_en/LLM/Benchmark.md
@@ -11,6 +11,7 @@
 - [Export](#Export)
 - [AWQ](#AWQ)
 - [AQLM](#AQLM)
+- [Sequence Parallel](#Sequence-Parallel)
 
 ## Parameter Settings
 Experimental environment:
@@ -756,3 +757,49 @@ swift sft \
 | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
 | -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
 |llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||
+
+## Sequence Parallel
+
+<table>
+
+<tr>
+<td>Model</td>
+<td>Dataset</td>
+<td>Hyper params</td>
+<td>Total steps</td>
+<td>Train speed</td>
+<td>Gpu memory</td>
+</tr>
+
+<tr>
+<td rowspan="4">chatglm3-6b-32k</td>
+<td rowspan="4">long-alpaca-12k(8055 tokens * 12000 rows)</td>
+<td>gpu=2/sequence_parallel_size=1(2 GPU DDP baseline)</td>
+<td>5940</td>
+<td>0.30iter/s(5h13min total)</td>
+<td>27G*2</td>
+</tr>
+
+
+<tr>
+<td>gpu=2/sequence_parallel_size=2(2 GPU with sequence parallel 2)</td>
+<td>11880</td>
+<td>0.5iter/s(6h total)</td>
+<td>20G*2</td>
+</tr>
+
+<tr>
+<td>gpu=4/sequence_parallel_size=4(4 GPU with sequence parallel 4)</td>
+<td>11880</td>
+<td>1iter/s(3h20min total)</td>
+<td>18G*4</td>
+</tr>
+
+<tr>
+<td>gpu=4/sequence_parallel_size=2(4 GPU sequence parallel 2)</td>
+<td>5940</td>
+<td>0.45iter/s(3h total)</td>
+<td>21G*4</td>
+</tr>
+
+</table>
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -122,6 +122,10 @@
 
 - `--fsdp_config`: Default value `None`, the FSDP config file path.
 
+### Sequence Parallel Parameters
+
+- `--sequence_parallel_size`: Default value `1`, a positive value can be used to split a sequence to multiple GPU to reduce memory usage. The value should divide the GPU count.
+
 ### LoRA+ Fine-tuning Parameters
 
 - `--lora_lr_ratio`: Default `None`, recommended value `10~16`, specify this parameter when using lora to enable lora+.

diff --git a/requirements/seq_parallel.txt b/requirements/seq_parallel.txt
@@ -0,0 +1 @@
+xtuner
diff --git a/setup.py b/setup.py
@@ -122,10 +122,13 @@ def gen_packages_items():
  extra_requires['llm'], _ = parse_requirements('requirements/llm.txt')
  extra_requires['aigc'], _ = parse_requirements('requirements/aigc.txt')
  extra_requires['eval'], _ = parse_requirements('requirements/eval.txt')
+ extra_requires['seq_parallel'], _ = parse_requirements('requirements/seq_parallel.txt')
  all_requires.extend(install_requires)
  all_requires.extend(extra_requires['llm'])
  all_requires.extend(extra_requires['aigc'])
  all_requires.extend(extra_requires['eval'])
+ all_requires.extend(extra_requires['seq_parallel'])
+ extra_requires['seq_parallel'].extend(extra_requires['llm'])
  extra_requires['all'] = all_requires
 
  setup(

diff --git a/swift/aigc/diffusers/train_controlnet.py b/swift/aigc/diffusers/train_controlnet.py
@@ -554,10 +554,8 @@ def make_train_dataset(args, tokenizer, accelerator):
  args.dataset_name,
  args.dataset_config_name,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  if args.train_data_dir is not None:
  dataset = load_dataset(

diff --git a/swift/aigc/diffusers/train_controlnet_sdxl.py b/swift/aigc/diffusers/train_controlnet_sdxl.py
@@ -570,10 +570,8 @@ def get_train_dataset(args, accelerator):
  args.dataset_name,
  args.dataset_config_name,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  if args.train_data_dir is not None:
  dataset = load_dataset(

diff --git a/swift/aigc/diffusers/train_text_to_image.py b/swift/aigc/diffusers/train_text_to_image.py
@@ -650,10 +650,8 @@ def path_to_img(example):
  args.dataset_config_name,
  data_dir=args.train_data_dir,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  data_files = {}
  if args.train_data_dir is not None:

diff --git a/swift/aigc/diffusers/train_text_to_image_lora.py b/swift/aigc/diffusers/train_text_to_image_lora.py
@@ -525,10 +525,8 @@ def path_to_img(example):
  args.dataset_config_name,
  data_dir=args.train_data_dir,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  data_files = {}
  if args.train_data_dir is not None:

diff --git a/swift/aigc/diffusers/train_text_to_image_lora_sdxl.py b/swift/aigc/diffusers/train_text_to_image_lora_sdxl.py
@@ -687,10 +687,8 @@ def path_to_img(example):
  args.dataset_config_name,
  data_dir=args.train_data_dir,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  data_files = {}
  if args.train_data_dir is not None:

diff --git a/swift/aigc/diffusers/train_text_to_image_sdxl.py b/swift/aigc/diffusers/train_text_to_image_sdxl.py
@@ -745,10 +745,8 @@ def path_to_img(example):
  args.dataset_config_name,
  data_dir=args.train_data_dir,
  )
- if isinstance(dataset, dict):
- dataset = {key: value.to_hf_dataset() for key, value in dataset.items()}
- else:
- dataset = {'train': dataset.to_hf_dataset()}
+ if not isinstance(dataset, dict):
+ dataset = {'train': dataset}
  else:
  data_files = {}
  if args.train_data_dir is not None:

diff --git a/swift/llm/sft.py b/swift/llm/sft.py
@@ -6,6 +6,8 @@
 import json
 import numpy as np
 import torch
+import torch.distributed as dist
+from datasets import Dataset
 from modelscope import BitsAndBytesConfig, GenerationConfig
 from transformers import IntervalStrategy
 from transformers.integrations import is_deepspeed_zero3_enabled
@@ -143,6 +145,8 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
  if use_model:
  template_kwargs['model'] = model
  template_kwargs['use_loss_scale'] = args.use_loss_scale
+ if args.sequence_parallel_size and args.sequence_parallel_size > 1:
+ template_kwargs['sequence_parallel_size'] = args.sequence_parallel_size
  template: Template = get_template(args.template_type, tokenizer, args.system, args.max_length,
  args.truncation_strategy, **template_kwargs)
  args.system = template.default_system
@@ -225,6 +229,7 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
  eval_dataset=val_dataset,
  tokenizer=tokenizer,
  callbacks=callbacks,
+ sequence_parallel_size=args.sequence_parallel_size,
  **trainer_kwargs)
  trainer.sft_args = args
  if use_torchacc():

diff --git a/swift/llm/tuner.py b/swift/llm/tuner.py
@@ -187,6 +187,9 @@ def prepare_model(model, args: SftArguments):
  else:
  raise ValueError(f'args.sft_type: {args.sft_type}')
 
+ if args.sequence_parallel_size > 1:
+ from swift.trainers.xtuner import dispatch_module_xtuner
+ dispatch_module_xtuner(model)
  if args.neftune_backend == 'swift' and args.neftune_noise_alpha not in {None, 0.}:
  neftune_config = NEFTuneConfig(noise_alpha=args.neftune_noise_alpha)
  model = Swift.prepare_model(model, {'neftune': neftune_config})

diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -572,6 +572,8 @@ class SftArguments(ArgumentsBase):
  # fsdp config file
  fsdp_config: Optional[str] = None
 
+ sequence_parallel_size: int = 1
+
  # compatibility hf
  per_device_train_batch_size: Optional[int] = None
  per_device_eval_batch_size: Optional[int] = None
@@ -586,6 +588,7 @@ class SftArguments(ArgumentsBase):
  neftune_alpha: Optional[float] = None
  deepspeed_config_path: Optional[str] = None
  model_cache_dir: Optional[str] = None
+
  custom_train_dataset_path: List[str] = field(default_factory=list)
  custom_val_dataset_path: List[str] = field(default_factory=list)
 

diff --git a/swift/llm/utils/dataset.py b/swift/llm/utils/dataset.py
@@ -504,10 +504,11 @@ def map_row(row):
  response = row['response']
  if response and response.startswith('Answer:'):
  response = response[len('Answer:') + 1:].strip()
- return {'query': row['query'], 'response': response}
+ row['response'] = response
+ return response
 
- return dataset.rename_columns({'instruction': 'query', 'output': 'response'}) \
-  .remove_columns(['input', 'file']).map(map_row).filter(lambda row: row['response'] is not None)
+ dataset = AlpacaPreprocessor()(dataset)
+ return dataset.map(map_row)
 
 
 register_dataset(

diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
@@ -327,6 +327,7 @@ class ModelType:
  phi2_3b = 'phi2-3b'
  phi3_4b_4k_instruct = 'phi3-4b-4k-instruct'
  phi3_4b_128k_instruct = 'phi3-4b-128k-instruct'
+ phi3_mini_128k_instruct = 'phi3-mini-128k-instruct'
  # cogagent
  cogvlm_17b_instruct = 'cogvlm-17b-instruct'
  cogagent_18b_chat = 'cogagent-18b-chat'
@@ -1297,6 +1298,16 @@ def cross_entropy_forward(self, inputs: Tensor, target: Tensor) -> Tensor:
  support_vllm=False, # https://github.com/vllm-project/vllm/pull/4298
  tags=['general'],
  hf_model_id='microsoft/Phi-3-mini-128k-instruct')
+@register_model(
+ ModelType.phi3_mini_128k_instruct,
+ 'LLM-Research/Phi-3-mini-128k-instruct',
+ LoRATM.phi3,
+ TemplateType.phi3,
+ requires=['transformers>=4.36'],
+ support_flash_attn=True,
+ support_vllm=False,
+ tags=['general'],
+ hf_model_id='microsoft/Phi-3-mini-128k-instruct')
 @register_model(
  ModelType.phi3_4b_4k_instruct,
  'LLM-Research/Phi-3-mini-4k-instruct',

diff --git a/swift/llm/utils/template.py b/swift/llm/utils/template.py
@@ -204,6 +204,8 @@ def _init_template(self,
  self.truncation_strategy = truncation_strategy
  self.model = kwargs.get('model', None)
  self.use_loss_scale = kwargs.get('use_loss_scale', False)
+ self.sequence_parallel_size = kwargs.get('sequence_parallel_size', 1)
+
  for key in ['prefix', 'prompt', 'chat_sep', 'suffix', 'prefix_has_system']:
  value = getattr(self, key)
  value = self._preprocess_prompt(tokenizer, value)
@@ -422,6 +424,17 @@ def data_collator(self, batch: List[Dict[str, Any]], padding_to: Optional[int] =
  labels, loss_scale, self.max_length,
  self.tokenizer, rank, world_size)
 
+ bs, seq_len = input_ids.shape
+ position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)
+
+ if self.sequence_parallel_size > 1:
+ from swift.trainers.xtuner import get_xtuner_sequence_parallel_world_size
+ if get_xtuner_sequence_parallel_world_size() > 1:
+ from swift.trainers.xtuner import pad_and_split_for_sequence_parallel
+ input_ids, labels, position_ids, attention_mask, loss_scale = \
+ pad_and_split_for_sequence_parallel(
+ tokenizer, input_ids, labels, position_ids, attention_mask, loss_scale)
+
  res = {
  'input_ids': input_ids,
  'attention_mask': attention_mask,

diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
@@ -825,6 +825,10 @@ def is_vllm_available():
  return importlib.util.find_spec('vllm') is not None
 
 
+def is_xtuner_available():
+ return importlib.util.find_spec('xtuner') is not None
+
+
 def get_time_info(log_history: List[Dict[str, Any]], n_train_samples: Optional[int]) -> Optional[Dict[str, Any]]:
  time_info = None
  try: