refactor rlhf (#1090)

modelscope · Jun 17, 2024 · c0e91a1 · c0e91a1
1 parent 03c6222
commit c0e91a1
Show file tree

Hide file tree

Showing 41 changed files with 1,343 additions and 266 deletions.
diff --git a/README.md b/README.md
@@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
 SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
 
 ## 🎉 News
+- 🔥2024.06.16: Supoprts **KTO** and **CPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Human-Preference-Alignment-Training-Documentation.md) to start training!
 - 2024.06.11: Support for tool-calling agent deployment that conform to the OpenAI interface.You can refer to [Agent deployment best practice](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Agent-deployment-best-practice.md)
 - 🔥2024.06.07: Support **Qwen2** series LLM, including Base and Instruct models of 0.5B, 1.5B, 7B, and 72B, as well as corresponding quantized versions gptq-int4, gptq-int8, and awq-int4. The best practice for self-cognition fine-tuning, inference and deployment of Qwen2-72B-Instruct using dual-card 80GiB A100 can be found [here](https://github.com/modelscope/swift/issues/1092).
 - 🔥2024.06.05: Support for **glm4** series LLM and glm4v-9b-chat MLLM. You can refer to [glm4v best practice](docs/source_en/Multi-Modal/glm4v-best-practice.md).
@@ -630,6 +631,7 @@ make docs
 | [LLM Quantization](docs/source_en/LLM/LLM-quantization.md) |
 | [LLM Deployment](docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md) |
 | [AnimateDiff Training](docs/source_en/AIGC/AnimateDiff-train-infer.md) |
+| [Human Preference Alignment Training Documentation](docs/source_en/LLM/Human-Preference-Alignment-Training-Documentation.md) |
 
 ### Reference Documentation
 | Document Name |

diff --git a/README_CN.md b/README_CN.md
@@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) 和 [ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
 
 ## 🎉 新闻
+- 🔥2024.06.16: 支持**KTO**和**CPO**训练，使用`swift rlhf --rlhf_type kto`和`swift rlhf --rlhf_type cpo`来开始训练，可以参考[文档](./docs/source/LLM/人类偏好对齐训练文档.md)
 - 2024.06.11: 支持符合OpenAI接口的工具调用Agent部署, 可以查看[Agent部署最佳实践](docs/source/LLM/Agent部署最佳实践.md).
 - 🔥2024.06.07: 支持**Qwen2**系列LLM, 包括0.5B、1.5B、7B、72B的Base和Instruct模型, 以及对应的gptq-int4、gptq-int8、awq-int4量化版本. 使用双卡80GiB A100对Qwen2-72B-Instruct进行自我认知微调并推理部署的最佳实践可以查看[这里](https://github.com/modelscope/swift/issues/1092).
 - 🔥2024.06.05: 支持glm4系列大模型和glm4v-9b-chat多模态大模型, 可以查看[glm4v最佳实践](docs/source/Multi-Modal/glm4v最佳实践.md).
@@ -239,7 +240,7 @@ swift web-ui
 | -------- |------------------------------------|
 | 预训练 | 文本生成 |
 | 微调 | 单轮/多轮<br>Agent训练/自我认知<br>多模态视觉/多模态语音 |
-| 人类对齐 | DPO<br>ORPO<br>SimPO  |
+| 人类对齐 | DPO<br>ORPO<br>SimPO<br>KTO<br>CPO |
 | 文生图 | DreamBooth等 |
 | 文生视频 | - |
 
@@ -628,6 +629,7 @@ make docs
 | [LLM量化](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E9%87%8F%E5%8C%96%E6%96%87%E6%A1%A3.md) |
 | [LLM部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM%E6%8E%A8%E7%90%86%E5%8A%A0%E9%80%9F%E4%B8%8E%E9%83%A8%E7%BD%B2.md) |
 | [AnimateDiff训练](https://github.com/modelscope/swift/blob/main/docs/source/AIGC/AnimateDiff%E5%BE%AE%E8%B0%83%E6%8E%A8%E7%90%86%E6%96%87%E6%A1%A3.md) |
+| [人类偏好对齐训练](./docs/source/LLM/人类偏好对齐训练文档.md) |
 
 
 ### 参考文档

diff --git a/docs/resources/dpo_data.png b/docs/resources/dpo_data.png
diff --git a/docs/resources/kto_data.png b/docs/resources/kto_data.png
diff --git a/docs/source/LLM/DPO训练文档.md b/docs/source/LLM/DPO训练文档.md
@@ -35,7 +35,8 @@ nproc_per_node=2
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=$nproc_per_node \
 MASTER_PORT=29500 \
-swift dpo \
+swift rlhf \
+ --rlhf_type dpo \
  --model_type yi-6b-chat \
  --ref_model_type yi-6b-chat \
  --model_revision master \

diff --git a/docs/source/LLM/ORPO算法最佳实践.md b/docs/source/LLM/ORPO算法最佳实践.md
@@ -48,7 +48,8 @@ swift内置了处理方法将`answer_zh`作为`response`,将`answer_en`作为`re
 # Memory usage: 4*24G
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=2 \
-swift orpo \
+swift rlhf \
+ --rlhf_type orpo \
  --model_type llama3-8b-instruct \
  --beta 0.5 \
  --sft_type lora \
@@ -61,10 +62,12 @@ swift orpo \
  --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
  --warmup_ratio 0.03 \
  --save_total_limit 2
+
 # MP(device map)
 # Memory usage: 2*24G
 CUDA_VISIBLE_DEVICES=0,1 \
-swift orpo \
+swift rlhf \
+ --rlhf_type orpo \
  --model_type llama3-8b-instruct \
  --beta 0.5 \
  --sft_type lora \
@@ -80,7 +83,8 @@ swift orpo \
 
 # Memory usage: 40G
 CUDA_VISIBLE_DEVICES=0 \
-swift orpo \
+swift rlhf \
+ --rlhf_type orpo \
  --model_type llama3-8b-instruct \
  --beta 0.5 \
  --sft_type lora \

diff --git a/docs/source/LLM/SimPO算法最佳实践.md b/docs/source/LLM/SimPO算法最佳实践.md
@@ -50,7 +50,8 @@ swift内置了处理方法将`answer_zh`作为`response`,将`answer_en`作为`re
 # Memory usage: 4*56G
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=2 \
-swift simpo \
+swift rlhf \
+ --rlhf_type simpo \
  --model_type llama3-8b-instruct \
  --sft_type full \
  --dataset shareai-llama3-dpo-zh-en-emoji \

diff --git a/docs/source/LLM/index.md b/docs/source/LLM/index.md
@@ -12,6 +12,7 @@
 8. [LLM实验文档](LLM实验文档.md)
 9. [ORPO最佳实践](ORPO算法最佳实践.md)
 10. [SimPO最佳实践](SimPO算法最佳实践.md)
+11. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
 
 ### ⭐️最佳实践系列
 

diff --git a/docs/source/LLM/人类偏好对齐训练文档.md b/docs/source/LLM/人类偏好对齐训练文档.md
@@ -0,0 +1,242 @@
+# 人类偏好对齐训练文档
+
+本文档提供了各种人类偏好对齐算法的训练脚本。若您希望深入了解更详尽的算法信息及其选择方法，请参考[文档](https://github.com/modelscope/modelscope-classroom/blob/main/LLM-tutorial/M.%E4%BA%BA%E7%B1%BB%E5%81%8F%E5%A5%BD%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83.md)
+
+## 目录
+- [环境准备](#环境准备)
+- [数据集](#数据集)
+- [DPO](#dpo)
+- [KTO](#kto)
+- [CPO](#cpo)
+- [ORPO](#orpo)
+- [SimPO](#simpo)
+
+## 环境准备
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt -U
+pip install -r requirements/llm.txt -U
+```
+
+
+## 数据集
+
+人类偏好对齐训练一般需要 $(x,y_w,y_l)$ 格式的数据，其中 $x$ 表示模型输入，$y_w,y_l$ 分别表示符合人类偏好的偏好回答和不符合人类偏好的拒绝回答,比如![dpo_data](../../resources/dpo_data.png)
+
+其中KTO算法的数据比较特殊，只需要$(x,y,\text{label})$格式的数据，其中$x$表示模型输入，$y$表示模型输出，label表示回答是否符合人类偏好
+比如![kto_data](../../resources/kto_data.png)
+
+KTO也可以使用第一种数据格式进行训练，训练脚本上的差异见KTO章节。
+
+**训练提示**:
+- 如果用带有history的数据训练base模型，需要指定支持多轮对话的template(base模型往往不支持多轮对话)，对于这种情况我们默认设置了`chatml`template，你也可以使用`--model_type` 来选择训练模型的template
+- 使用自定义数据集进行训练请参考[自定义与拓展](自定义与拓展.md)
+- 下面的训练脚本使用`--lora_target_modules ALL`来训练模型的全部线性层，你也可以设置`--lora_target_modules DEFAULT`只训练模型的QKV矩阵
+
+## DPO
+[论文arvix](https://arxiv.org/abs/2305.18290)
+
+超参
+- `beta`：KL正则系数，值越大表示对偏离参考模型的惩罚越大。默认为0.1
+
+建议在开始DPO训练之前，使用偏好数据集中的偏好回答部分进行SFT训练，以确保数据符合DPO算法的分布要求。
+我们也在DPO loss中混合了sft loss来稳定训练，你可以通过设置超参`sft_beta`来调整sft loss的系数，默认为0.1
+
+训练脚本, 这里我们提供单卡/多卡device map/多卡ddp的版本，简洁起见，后续算法只给出单卡版本。
+```bash
+# Experimental environment: A100
+# Memory usage: 40G
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type dpo \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --sft_beta 0.1 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+
+# MP(device map)
+# Memory usage: 2*24G
+CUDA_VISIBLE_DEVICES=0,1 \
+swift rlhf \
+ --rlhf_type dpo \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --sft_beta 0.1 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+
+# DDP + MP
+# Memory usage: 4*24G
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=2 \
+swift rlhf \
+ --rlhf_type dpo \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --sft_beta 0.1 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
+
+训练后的模型推理和部署可以参考[LLM推理文档](./LLM推理文档.md)和[VLLM推理加速与部署文档](./VLLM推理加速与部署.md)
+
+## KTO
+[论文arvix](https://arxiv.org/abs/2402.01306)
+
+超参
+- beta： KL正则系数，值越大表示对偏离参考模型的惩罚越大。默认为0.1
+- desirable_weight ：损失函数中的$\lambda_D$项，偏好回答样本的损失权重, 默认为1.0
+- undesirable_weight ：损失函数中的$\lambda_U$项，拒绝回答样本的损失权重，默认为1.0
+
+用 $n_D$ 和 $n_U$ 分别表示数据集中偏好回答和拒绝回答的样本数量，对于超参 $\lambda_D$ 和 $\lambda_U$ ，作者推荐设置 $\frac{\lambda_Dn_D}{\lambda_Un_U}\in[1,\frac{4}{3}]$
+
+训练脚本
+使用 $(x,y,\text{label})$ 格式数据训练
+
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type kto \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --desirable_weight 1.0 \
+ --undesirable_weight 1.0 \
+ --sft_type lora \
+ --dataset ultrafeedback-kto \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
+
+使用$(x,y_w,y_l)$格式数据训练
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type dpo \
+ --loss_type kto_pair \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --desirable_weight 1.0 \
+ --undesirable_weight 1.0 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
+
+## CPO
+[论文arvix](https://arxiv.org/abs/2401.08417)
+超参
+- beta：隐含奖励前的系数，默认为0.1
+
+训练脚本
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type cpo \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
+
+## ORPO
+[论文arvix](https://arxiv.org/abs/2403.07691)
+
+超参
+- lambda: Odds Ratio loss系数
+
+注意：ORPO使用参数`--beta`传入超参`lambda`
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type orpo \
+ --model_type llama3-8b-instruct \
+ --beta 0.1 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
+
+## SimPO
+[论文arvix](https://arxiv.org/abs/2405.14734)
+超参
+- beta：隐含奖励前的系数，默认为2.0
+- simpo_gamma：reward margin项，默认为1.0
+
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+ --rlhf_type simpo \
+ --model_type llama3-8b-instruct \
+ --beta 2.0 \
+ --simpo_gamma 1.0 \
+ --sft_type lora \
+ --dataset shareai-llama3-dpo-zh-en-emoji \
+ --num_train_epochs 2 \
+ --lora_target_modules ALL \
+ --gradient_checkpointing true \
+ --batch_size 1 \
+ --learning_rate 5e-5 \
+ --gradient_accumulation_steps 16 \
+ --warmup_ratio 0.03 \
+ --save_total_limit 2
+```
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -221,17 +221,20 @@ unsloth无新增参数，对已有参数进行调节即可支持：
 - `--ia3_feedforward_modules`: 指定IA3的MLP的Linear名称, 该名称必须在`ia3_target_modules`中.
 - `--ia3_modules_to_save`: IA3参与训练的额外模块. 具体含义可以参考`lora_modules_to_save`的含义.
 
-## dpo 参数
-
-dpo参数继承了sft参数, 除此之外增加了以下参数:
-
-- `--ref_model_type`: 对比模型的类型, 可以选择的`model_type`可以查看`MODEL_MAPPING.keys()`.
-- `--ref_model_id_or_path`: 对比模型的本地cache路径, 默认为`None`.
-- `--max_prompt_length`: 最大的提示长度, 该参数会传入DPOTrainer中, 使prompt长度不超过该值的设置, 默认值`1024`.
-- `--beta`: DPO logits的正则项，默认为0.1.
+## RLHF 参数
+
+RLHF参数继承了sft参数, 除此之外增加了以下参数:
+- `--rlhf_type`: 选择对齐算法，可选项为'dpo', 'orpo', 'simpo', 'kto', 'cpo', 训练脚本请查看[文档](./人类偏好对齐训练文档.md)
+- `--ref_model_type`: 选择参考模型, 同model_type参数, 默认与训练模型一致。其中`cpo`和`simpo`算法无需选择。
+- `--ref_model_id_or_path`: 参考模型的本地cache路径, 默认为`None`.
+- `--max_prompt_length`: 最大的提示长度, 该参数会传入相应的Trainer中, 使prompt长度不超过该值的设置, 默认值`1024`.
+- `--beta`: KL正则项系数, `simpo`算法默认为2.0, 其他算法默认为0.1, 具体参考[文档](./人类偏好对齐训练文档.md)
 - `--label_smoothing`: 是否使用DPO smoothing, 默认值为0，一般设置在0~0.5之间.
-- `--loss_type`: DPOloss类型, 支持'sigmoid', 'hinge', 'ipo', 'kto_pair', 默认值'sigmoid'.
-- `--sft_beta`: 是否在DPO中加入sft loss, 默认为0.1, 支持[0, 1)区间，最后的loss为`(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
+- `--loss_type`: loss类型, 默认值'sigmoid'.
+- `--sft_beta`: 是否在DPO中加入sft loss, 默认为0.1, 支持 $[0, 1)$ 区间，最后的loss为`(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
+- `simpo_gamma`: SimPO算法中的reward margin项，论文中建议设置为0.5-1.5, 默认为1.0
+- `desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为1.0
+- `undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量，论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$
 
 ## merge-lora infer 参数