Finetuning Configuration #195

WAILMAGHRANE · 2024-06-01T14:32:17Z

Hi,
Could you let me know if everyone has successfully fine-tuned the model? Additionally, I have a question about GPU requirements: is 31.2GB needed per GPU, or is it split between two GPUs? Also, I noticed that Kaggle offers 2 T4 GPUs—are these sufficient for fine-tuning my model with a custom dataset?
Thanks!

YuzaChongyi · 2024-06-02T07:44:54Z

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

whyiug · 2024-06-03T10:57:42Z

hi, @YuzaChongyi can we finetune this model with one A100 (40G)?

YuzaChongyi · 2024-06-03T11:20:15Z

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

whyiug · 2024-06-03T12:13:59Z

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.
if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

Yeah, I only have an A100 card(40G). After I use finetune_lora.sh with those settings:

--model_max_length 1024
--per_device_train_batch_size 1
--deepspeed ds_config_zero3.json

It reports an error:

RuntimeError: "erfinv_cuda" not implemented for 'BFloat16'

maybe on this line(https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/20aecf8831d1d7a3da19bd62f44d1aea82df7fee/resampler.py#L85)

Please tell me how to fix it quickly by changing the code or configuration. Thanks for your quick reply:) @YuzaChongyi

YuzaChongyi · 2024-06-04T07:06:48Z

I haven't encountered this error yet, it may be caused by certain pytorch versions or other reasons. If there is a error during the resampler initialization step, you can comment this line because the ckpt will load and reset model state_dict. Or you can use --fp16 true instead of bf16

qyc-98 · 2024-06-04T13:56:18Z

please set --bf16 false
--bf16_full_eval false
--fp16 true
--fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

qyc-98 · 2024-06-04T14:12:37Z

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory：
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
}

whyiug · 2024-06-04T14:46:28Z

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory： "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

yeah, i already did.

zhu-j-faceonlive · 2024-06-05T13:53:35Z

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

I changed this, but still got following error.
File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

shituo123456 · 2024-06-06T00:38:46Z

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置--bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \这是因为zero3与bf16不兼容。请使用fp16

I changed this, but still got following error.我改变了这个，但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads 文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”，第 2117 行，位于 unscale_and_clip_grads 中 self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1./combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError：预期所有张量都在同一设备上，但发现至少有两个设备，cuda：0和cpu！

在双卡4090上这样改我也是同样报错，请问解决了吗，谢谢~

LDLINGLINGLING · 2024-07-04T09:36:35Z

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置 --BF16 false --bf16_full_eval false --FP16 true --fp16_full_eval true \ 这是因为 zero3 与 bf16 不兼容。请使用 fp16

I changed this, but still got following error.我更改了这一点，但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”，第 2117 行，unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_（1. / combined_scale） RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!RuntimeError：预计所有张量都在同一设备上，但发现至少有两个设备，cuda：0 和 cpu！

这个问题一般出现在zero3上，
git clone https://github.com/microsoft/DeepSpeed.git
cd Deepspeed
DS_BUILD_CPU_ADAM=1 pip install .

Cuiunbo assigned YuzaChongyi Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Configuration #195

Finetuning Configuration #195

WAILMAGHRANE commented Jun 1, 2024

YuzaChongyi commented Jun 2, 2024

whyiug commented Jun 3, 2024

YuzaChongyi commented Jun 3, 2024

whyiug commented Jun 3, 2024 •

edited

Loading

YuzaChongyi commented Jun 4, 2024

qyc-98 commented Jun 4, 2024

qyc-98 commented Jun 4, 2024

whyiug commented Jun 4, 2024

zhu-j-faceonlive commented Jun 5, 2024

shituo123456 commented Jun 6, 2024

LDLINGLINGLING commented Jul 4, 2024

Finetuning Configuration #195

Finetuning Configuration #195

Comments

WAILMAGHRANE commented Jun 1, 2024

YuzaChongyi commented Jun 2, 2024

whyiug commented Jun 3, 2024

YuzaChongyi commented Jun 3, 2024

whyiug commented Jun 3, 2024 • edited Loading

YuzaChongyi commented Jun 4, 2024

qyc-98 commented Jun 4, 2024

qyc-98 commented Jun 4, 2024

whyiug commented Jun 4, 2024

zhu-j-faceonlive commented Jun 5, 2024

shituo123456 commented Jun 6, 2024

LDLINGLINGLING commented Jul 4, 2024

whyiug commented Jun 3, 2024 •

edited

Loading