支持了QLoRA加载断点继续训练

yangjianxin1 · Oct 23, 2023 · 352a10b · 352a10b
1 parent 8d4eaf8
commit 352a10b
Showing 1 changed file with 22 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 
 ## News
 - 🔥 支持对悟道.天鹰Aquila2-34B进行指令微调。
-- 🔥 开源[Firefly-LLaMA2-Chinese项目](https://github.com/yangjianxin1/Firefly-LLaMA2-Chinese)，**在4*V00上进行训练**，经过中文词表扩充、增量预训练、多轮指令微调，在CMMLU上超越Linly、Yayi、FlagAlpha等，与Ziya、Chinese-Alpaca表现基本持平。该项目也支持对Baichuan、Qwen、InternLM、LLaMA、Falcon等模型进行高效增量预训练。
+- 🔥 开源[Firefly-LLaMA2-Chinese项目](https://github.com/yangjianxin1/Firefly-LLaMA2-Chinese)，**在4*V100上进行训练**，经过中文词表扩充、增量预训练、多轮指令微调，在CMMLU上超越Linly、Yayi、FlagAlpha等，与Ziya、Chinese-Alpaca表现基本持平。该项目也支持对Baichuan、Qwen、InternLM、LLaMA、Falcon等模型进行高效增量预训练。
 - 🔥 开源[firefly-baichuan2-13b](https://huggingface.co/YeungNLP/firefly-baichuan2-13b)，在OpenCompass的CMMLU榜单上以56.83的分数，位列第8，比百川官方Chat模型略低1.57分。
 
 <details><summary><b>往期News</b></summary>
@@ -369,6 +369,27 @@ CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node={num_gpus} train_qlora.py --t
 #### 问题6：训练Baichuan2失败
 训练Baichuan2需要安装pytorch 2.0。
 
+#### 问题7：QLoRA 微调，如何加载之前的 checkpoint 继续训练
+
+在对应的 `sft-qlora.json` 文件中，添加 `resume_training` 参数并设置为 `true`。
+
+例如，QLoRA 微调 BLOOM，想要加载之前的断点，在 `train_args\qlora\bloom-sft-qlora.json` 中添加参数：
+
+```json
+{
+ "output_dir": "output/firefly-bloom-7b1",
+ "model_name_or_path": "bigscience/bloom-7b1",
+ "train_file": "./data/dummy_data.jsonl",
+ "resume_training": true, // 新增选项
+ "num_train_epochs": 1,
+ "per_device_train_batch_size": 1,
+ // ...
+```
+
+开启此选项后，会从 `output_dir` 中搜寻最新的一个 `checkpoint` 并加载，这个选项开启后将不会覆写 `output_dir`。
+
+**`resume_training` 选项默认关闭。**
+
 
 ## 局限性和使用限制
 由于模型参数量限制、训练数据的清洗程度等因素，本项目开源的模型可能存在以下局限性：