Skip to content

Commit

Permalink
Support peft0.9 (#490)
Browse files Browse the repository at this point in the history
  • Loading branch information
tastelikefeet committed Mar 5, 2024
1 parent 54f3ced commit db24a6f
Show file tree
Hide file tree
Showing 19 changed files with 375 additions and 55 deletions.
8 changes: 7 additions & 1 deletion .dev_scripts/dockerci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,15 @@ CODE_DIR=$PWD
CODE_DIR_IN_CONTAINER=/swift
echo "$USER"
gpus='0,1 2,3 4,5 6,7'
cpu_sets='45-58 31-44 16-30 0-15'
cpu_sets='0-15 16-31 32-47 48-63'
cpu_sets_arr=($cpu_sets)
is_get_file_lock=false
CI_COMMAND=${CI_COMMAND:-bash .dev_scripts/ci_container_test.sh python tests/run.py --parallel 2 --run_config tests/run_config.yaml}
echo "ci command: $CI_COMMAND"
PR_CHANGED_FILES="${PR_CHANGED_FILES:-}"
echo "PR modified files: $PR_CHANGED_FILES"
PR_CHANGED_FILES=${PR_CHANGED_FILES//[ ]/#}
echo "PR_CHANGED_FILES: $PR_CHANGED_FILES"
idx=0
for gpu in $gpus
do
Expand Down Expand Up @@ -43,6 +47,7 @@ do
-e TEST_UPLOAD_MS_TOKEN=$TEST_UPLOAD_MS_TOKEN \
-e MODEL_TAG_URL=$MODEL_TAG_URL \
-e MODELSCOPE_API_TOKEN=$MODELSCOPE_API_TOKEN \
-e PR_CHANGED_FILES=$PR_CHANGED_FILES \
--workdir=$CODE_DIR_IN_CONTAINER \
${IMAGE_NAME}:${IMAGE_VERSION} \
$CI_COMMAND
Expand All @@ -66,6 +71,7 @@ do
-e TEST_UPLOAD_MS_TOKEN=$TEST_UPLOAD_MS_TOKEN \
-e MODEL_TAG_URL=$MODEL_TAG_URL \
-e MODELSCOPE_API_TOKEN=$MODELSCOPE_API_TOKEN \
-e PR_CHANGED_FILES=$PR_CHANGED_FILES \
--workdir=$CODE_DIR_IN_CONTAINER \
${IMAGE_NAME}:${IMAGE_VERSION} \
$CI_COMMAND
Expand Down
6 changes: 4 additions & 2 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
- `--lora_bias_trainable`: 默认为`'none'`, 可以选择的值: 'none', 'all'. 如果你要将bias全都设置为可训练, 你可以设置为`'all'`.
- `--lora_modules_to_save`: 默认为`[]`. 如果你想要训练embedding, lm_head, 或者layer_norm, 你可以设置此参数, 例如: `--lora_modules_to_save wte ln_1 ln_2 ln_f lm_head`, 这个参数用于任何adapter的训练中.
- `--lora_dtype`: 默认为`'fp32'`, 指定lora模块的dtype类型. 如果是`AUTO`则跟随原始模块的dtype类型. 你可以选择的值: 'fp16', 'bf16', 'fp32', 'AUTO'.
- `--use_dora`: 默认为`False`, 是否使用`DoRA`.
- `--use_rslora`: 默认为`False`, 是否使用`RS-LoRA`.
- `--neftune_noise_alpha`: `NEFTune`添加的噪声系数, 可以提升模型在指令微调中的性能, 默认为`None`. 通常可以设置为5, 10, 15. 你可以查看[相关论文](https://arxiv.org/abs/2310.05914).
- `--gradient_checkpointing`: 是否开启gradient checkpointing, 默认为`True`. 该参数可以用于节约显存, 虽然这会略微降低训练速度. 该参数在max_length较大, batch_size较大时作用显著.
- `--deepspeed`: 用于指定deepspeed的配置文件的路径或者直接传入json格式的配置信息, 默认为`None`, 即不开启deepspeed. deepspeed可以节约显存. 我们书写了默认的[ZeRO-2配置文件](https://github.com/modelscope/swift/blob/main/swift/llm/ds_config/zero2.json), [ZeRO-3配置文件](https://github.com/modelscope/swift/blob/main/swift/llm/ds_config/zero3.json). 你只需要指定'default-zero2', 就会使用默认zero2配置文件; 指定'default-zero3', 就会使用默认的zero3配置文件.
Expand Down Expand Up @@ -105,7 +107,7 @@

- `--lora_lr_ratio`: 默认值`None`, 建议值`10~16`, 使用lora时指定该参数即可使用lora+.

### LLaMA PRO微调参数
### LLaMA-PRO微调参数

- `--llamapro_num_new_blocks`: 默认值`4`, 插入的新layers总数.
- `--llamapro_num_groups`: 默认值`None`, 分为多少组插入new_blocks, 如果为`None`则等于`llamapro_num_new_blocks`, 即每个新的layer单独插入原模型.
Expand Down Expand Up @@ -181,14 +183,14 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
- `--ignore_args_error`: 默认值为`False`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--stream`: 是否使用流式输出, 默认为`True`. 该参数只有在使用数据集评估并且verbose为True时才生效.
- `--merge_lora`: 是否将lora权重merge到基模型中, 并保存完整的权重, 默认为`False`. 权重会保存在`ckpt_dir`的同级目录中, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx-merged'`目录下.
- `--merge_device_map`: merge-lora时使用的device_map, 默认为`None`, 为减少显存占用, 在仅有merge-lora过程时使用`auto`,其他情况默认使用`cpu`.
- `--save_safetensors`: 保存成`safetensors`文件还是`bin`文件. 默认为`True`.
- `--overwrite_generation_config`: 是否将评估所使用的generation_config保存成`generation_config.json`文件, 默认为`None`. 如果指定了`ckpt_dir`, 则设置为`True`, 否则设置为`False`. 训练时保存的generation_config文件将被覆盖.
- `--verbose`: 如果设置为False, 则使用tqdm样式推理. 如果设置为True, 则输出推理的query, response, label. 默认为`None`, 进行自动选择, 即`len(val_dataset) >= 100`时, 设置为False, 否则设置为True. 该参数只有在使用数据集评估时生效.
- `--gpu_memory_utilization`: 初始化vllm引擎`EngineArgs`的参数, 默认为`0.9`. 该参数只有在使用vllm时才生效. VLLM推理加速和部署可以查看[VLLM推理加速与部署](VLLM推理加速与部署.md).
- `--tensor_parallel_size`: 初始化vllm引擎`EngineArgs`的参数, 默认为`1`. 该参数只有在使用vllm时才生效.
- `--max_model_len`: 覆盖模型的max_model_len, 默认为`None`. 该参数只有在使用vllm时才生效.


## export 参数

export参数继承了infer参数, 除此之外增加了以下参数:
Expand Down
8 changes: 8 additions & 0 deletions docs/source/LLM/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,14 @@
|qwen1half-7b-chat|[qwen/Qwen1.5-7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|
|qwen1half-14b-chat|[qwen/Qwen1.5-14B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|
|qwen1half-72b-chat|[qwen/Qwen1.5-72B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|
|qwen1half-0_5b-chat-awq|[qwen/Qwen1.5-0.5B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|qwen1half-1_8b-chat-awq|[qwen/Qwen1.5-1.8B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|qwen1half-4b-chat-awq|[qwen/Qwen1.5-4B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|qwen1half-7b-chat-awq|[qwen/Qwen1.5-7B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|qwen1half-14b-chat-awq|[qwen/Qwen1.5-14B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|qwen1half-72b-chat-awq|[qwen/Qwen1.5-72B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|
|llama2-7b-aqlm-2bit-1x16|[AI-ModelScope/Llama-2-7b-AQLM-2Bit-1x16-hf](https://modelscope.cn/models/AI-ModelScope/Llama-2-7b-AQLM-2Bit-1x16-hf/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✘|transformers>=4.38, aqlm, torch>=2.2.0|
|mixtral-moe-7b-aqlm-2bit-1x16|[AI-ModelScope/Mixtral-8x7b-AQLM-2Bit-1x16-hf](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7b-AQLM-2Bit-1x16-hf/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✘|transformers>=4.38, aqlm, torch>=2.2.0|
|qwen1half-0_5b-chat-int4|[qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|
|qwen1half-1_8b-chat-int4|[qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|
|qwen1half-4b-chat-int4|[qwen/Qwen1.5-4B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|
Expand Down
2 changes: 1 addition & 1 deletion requirements/framework.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ nltk
numpy
optimum
pandas
peft>=0.8.0,<0.9.0
peft>=0.9.0,<0.10.0
requests
rouge
safetensors
Expand Down
2 changes: 1 addition & 1 deletion swift/llm/app_ui.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def llm_app_ui(args: AppUIArguments) -> None:
logger.info(f'args: {args}')
args.eval_human = True
if args.merge_lora:
merge_lora(args, device_map='cpu')
merge_lora(args, device_map=args.merge_device_map)
if args.template_type.endswith('generation'):
gradio_generation_demo(args)
else:
Expand Down
2 changes: 1 addition & 1 deletion swift/llm/deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -475,7 +475,7 @@ def llm_deploy(args: DeployArguments) -> None:
global llm_engine, model, template, _args
_args = args
if args.merge_lora:
merge_lora(args, device_map='cpu')
merge_lora(args, device_map=args.merge_device_map)
if args.infer_backend == 'vllm':
from .utils import prepare_vllm_engine_template
llm_engine, template = prepare_vllm_engine_template(
Expand Down
2 changes: 1 addition & 1 deletion swift/llm/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def llm_export(args: ExportArguments) -> None:
global _args, template
logger.info(f'args: {args}')
if args.merge_lora:
merge_lora(args, device_map='cpu')
merge_lora(args, device_map=args.merge_device_map)
if args.quant_bits > 0:
_args = args
assert args.quantization_bit == 0
Expand Down
2 changes: 1 addition & 1 deletion swift/llm/infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ def read_media_file(
def llm_infer(args: InferArguments) -> None:
logger.info(f'args: {args}')
if args.merge_lora:
merge_lora(args, device_map='cpu')
merge_lora(args, device_map=args.merge_device_map)
if args.infer_backend == 'vllm':
from .utils import prepare_vllm_engine_template, inference_stream_vllm, inference_vllm
llm_engine, template = prepare_vllm_engine_template(args)
Expand Down
1 change: 1 addition & 0 deletions swift/llm/tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def prepare_model(model, args: SftArguments):
'rank_pattern': args.lora_rank_pattern,
'alpha_pattern': args.lora_alpha_pattern,
'loftq_config': args.lora_loftq_config,
'use_dora': args.use_dora,
}
if args.sft_type == 'lora':
if args.tuner_backend == 'swift':
Expand Down
7 changes: 7 additions & 0 deletions swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ class SftArguments:
lora_rank_pattern: Dict = field(default_factory=dict)
lora_alpha_pattern: Dict = field(default_factory=dict)
lora_loftq_config: Dict = field(default_factory=dict)
use_dora: bool = False

# adalora
adalora_target_r: int = 8
adalora_init_r: int = 12
Expand Down Expand Up @@ -565,6 +567,7 @@ class InferArguments:
ignore_args_error: bool = False # True: notebook compatibility
stream: bool = True
merge_lora: bool = False
merge_device_map: Optional[str] = None
save_safetensors: bool = True
overwrite_generation_config: Optional[bool] = None
verbose: Optional[bool] = None
Expand Down Expand Up @@ -659,6 +662,8 @@ def __post_init__(self) -> None:
self.stream = False
logger.info('Setting self.stream: False')
self.infer_media_type = template_info.get('infer_media_type', 'none')
if self.merge_device_map is None:
self.merge_device_map = 'cpu'

@staticmethod
def check_ckpt_dir_correct(ckpt_dir) -> bool:
Expand Down Expand Up @@ -723,6 +728,8 @@ class ExportArguments(InferArguments):
commit_message: str = 'update files'

def __post_init__(self):
if self.merge_device_map is None:
self.merge_device_map = 'cpu' if self.quant_bits != 0 else 'auto'
super().__post_init__()
if len(self.dataset) == 0:
self.dataset = ['ms-bench-mini']
Expand Down
111 changes: 105 additions & 6 deletions swift/llm/utils/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import inspect
import os
import sys
from contextlib import nullcontext
from functools import partial, update_wrapper
from types import MethodType
from typing import Any, Callable, Dict, List, NamedTuple, Optional, Tuple, Type
Expand Down Expand Up @@ -65,6 +66,21 @@ class ModelType:
qwen1half_7b_chat = 'qwen1half-7b-chat'
qwen1half_14b_chat = 'qwen1half-14b-chat'
qwen1half_72b_chat = 'qwen1half-72b-chat'

# qwen1.5 awq
qwen1half_0_5b_chat_awq = 'qwen1half-0_5b-chat-awq'
qwen1half_1_8b_chat_awq = 'qwen1half-1_8b-chat-awq'
qwen1half_4b_chat_awq = 'qwen1half-4b-chat-awq'
qwen1half_7b_chat_awq = 'qwen1half-7b-chat-awq'
qwen1half_14b_chat_awq = 'qwen1half-14b-chat-awq'
qwen1half_72b_chat_awq = 'qwen1half-72b-chat-awq'

# llama aqlm model
llama2_7b_aqlm_2bit_1x16 = 'llama2-7b-aqlm-2bit-1x16'

# mixtral aqlm model
mixtral_moe_7b_aqlm_2bit_1x16 = 'mixtral-moe-7b-aqlm-2bit-1x16'

# qwen1.5 gptq
qwen1half_0_5b_chat_int4 = 'qwen1half-0_5b-chat-int4'
qwen1half_1_8b_chat_int4 = 'qwen1half-1_8b-chat-int4'
Expand All @@ -78,6 +94,7 @@ class ModelType:
qwen1half_7b_chat_int8 = 'qwen1half-7b-chat-int8'
qwen1half_14b_chat_int8 = 'qwen1half-14b-chat-int8'
qwen1half_72b_chat_int8 = 'qwen1half-72b-chat-int8'

# qwen-vl
qwen_vl = 'qwen-vl'
qwen_vl_chat = 'qwen-vl-chat'
Expand Down Expand Up @@ -409,12 +426,28 @@ def get_model_tokenizer_from_repo(model_dir: str,
tokenizer.eos_token = eos_token
model = None
if load_model:
model = automodel_class.from_pretrained(
model_dir,
config=model_config,
torch_dtype=torch_dtype,
trust_remote_code=True,
**model_kwargs)
if 'aqlm' in model_dir.lower():
import aqlm
context = aqlm.optimize_for_training()
else:
context = nullcontext()
if 'awq' in model_dir.lower():
try:
from awq.utils.packing_utils import dequantize_gemm
import awq_ext # with CUDA kernels (AutoAWQ_kernels)
except ImportError as e:
raise ImportError(
'You are training awq models, remember installing awq_ext by '
'`git clone https://github.com/casper-hansen/AutoAWQ_kernels '
'&& cd AutoAWQ_kernels && pip install -e .`') from e

with context:
model = automodel_class.from_pretrained(
model_dir,
config=model_config,
torch_dtype=torch_dtype,
trust_remote_code=True,
**model_kwargs)
return model, tokenizer


Expand Down Expand Up @@ -1087,6 +1120,15 @@ def cross_entropy_forward(self, inputs: Tensor,
support_flash_attn=True,
support_vllm=True,
support_gradient_checkpointing=False)
@register_model(
ModelType.mixtral_moe_7b_aqlm_2bit_1x16,
'AI-ModelScope/Mixtral-8x7b-AQLM-2Bit-1x16-hf',
LoRATM.llama2,
TemplateType.default_generation_bos,
requires=['transformers>=4.38', 'aqlm', 'torch>=2.2.0'],
support_flash_attn=True,
support_vllm=False,
support_gradient_checkpointing=False)
def get_model_tokenizer_with_flash_attn(model_dir: str,
torch_dtype: Dtype,
model_kwargs: Dict[str, Any],
Expand All @@ -1111,6 +1153,54 @@ def get_model_tokenizer_with_flash_attn(model_dir: str,
**kwargs)


@register_model(
ModelType.qwen1half_7b_chat_awq,
'qwen/Qwen1.5-7B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_4b_chat_awq,
'qwen/Qwen1.5-4B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_14b_chat_awq,
'qwen/Qwen1.5-14B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_72b_chat_awq,
'qwen/Qwen1.5-72B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_1_8b_chat_awq,
'qwen/Qwen1.5-1.8B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_0_5b_chat_awq,
'qwen/Qwen1.5-0.5B-Chat-AWQ',
LoRATM.qwen1half,
TemplateType.qwen,
support_flash_attn=True,
support_vllm=True,
requires=['transformers>=4.37', 'autoawq'])
@register_model(
ModelType.qwen1half_0_5b_chat,
'qwen/Qwen1.5-0.5B-Chat',
Expand Down Expand Up @@ -1506,6 +1596,15 @@ def get_model_tokenizer_internlm_xcomposer2(model_dir: str,
ignore_file_pattern=[r'.+\.bin$'],
support_flash_attn=True,
support_vllm=True)
@register_model(
ModelType.llama2_7b_aqlm_2bit_1x16,
'AI-ModelScope/Llama-2-7b-AQLM-2Bit-1x16-hf',
LoRATM.llama2,
TemplateType.default_generation_bos,
ignore_file_pattern=[r'.+\.bin$'],
support_flash_attn=True,
requires=['transformers>=4.38', 'aqlm', 'torch>=2.2.0'],
support_vllm=False)
@register_model(
ModelType.llama2_13b_chat,
'modelscope/Llama-2-13b-chat-ms',
Expand Down
18 changes: 12 additions & 6 deletions swift/llm/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,24 +356,30 @@ def find_all_linears(model: Module, quantization_bit: int,
head_module_name = 'output_layer'
if quantization_bit == 4:
from bitsandbytes.nn import Linear4bit
linear_cls = Linear4bit
linear_cls = [Linear4bit]
elif quantization_bit == 8:
from bitsandbytes.nn import Linear8bitLt
linear_cls = Linear8bitLt
linear_cls = [Linear8bitLt]
else:
linear_cls = Linear
linear_cls = [Linear]
if 'int4' in model_type or 'int8' in model_type:
from bitsandbytes.nn import Linear4bit
from peft.utils import get_auto_gptq_quant_linear, get_quantization_config
gptq_quantization_config = get_quantization_config(model, 'gptq')
AutoGPTQQuantLinear = get_auto_gptq_quant_linear(
gptq_quantization_config)
linear_cls = Linear4bit
linear_cls = [Linear4bit]
if AutoGPTQQuantLinear is not None:
linear_cls = (Linear4bit, AutoGPTQQuantLinear)
linear_cls.append(AutoGPTQQuantLinear)
if 'awq' in model_type:
from awq.modules.linear import WQLinear_GEMM
linear_cls.append(WQLinear_GEMM)
if 'aqlm' in model_type:
from aqlm import QuantizedLinear
linear_cls.append(QuantizedLinear)
target_module_names = set()
for name, module in model.named_modules():
if isinstance(module, linear_cls):
if isinstance(module, tuple(linear_cls)):
module_name = '.'.join(name.split('.')[-2:])
if head_module_name not in module_name:
target_module_names.add(module_name)
Expand Down
Loading

0 comments on commit db24a6f

Please sign in to comment.