Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning Configuration #195

Open
WAILMAGHRANE opened this issue Jun 1, 2024 · 11 comments
Open

Finetuning Configuration #195

WAILMAGHRANE opened this issue Jun 1, 2024 · 11 comments
Assignees

Comments

@WAILMAGHRANE
Copy link

Hi,
Could you let me know if everyone has successfully fine-tuned the model? Additionally, I have a question about GPU requirements: is 31.2GB needed per GPU, or is it split between two GPUs? Also, I noticed that Kaggle offers 2 T4 GPUs—are these sufficient for fine-tuning my model with a custom dataset?
Thanks!
Screenshot 2024-06-01 013114

@YuzaChongyi
Copy link
Collaborator

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug
Copy link

whyiug commented Jun 3, 2024

hi, @YuzaChongyi can we finetune this model with one A100 (40G)?

@YuzaChongyi
Copy link
Collaborator

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

@whyiug
Copy link

whyiug commented Jun 3, 2024

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.
if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

Yeah, I only have an A100 card(40G). After I use finetune_lora.sh with those settings:

--model_max_length 1024
--per_device_train_batch_size 1
--deepspeed ds_config_zero3.json

It reports an error:

RuntimeError: "erfinv_cuda" not implemented for 'BFloat16'

maybe on this line(https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/20aecf8831d1d7a3da19bd62f44d1aea82df7fee/resampler.py#L85)

Please tell me how to fix it quickly by changing the code or configuration. Thanks for your quick reply:) @YuzaChongyi

@YuzaChongyi
Copy link
Collaborator

I haven't encountered this error yet, it may be caused by certain pytorch versions or other reasons. If there is a error during the resampler initialization step, you can comment this line because the ckpt will load and reset model state_dict. Or you can use --fp16 true instead of bf16

@qyc-98
Copy link
Contributor

qyc-98 commented Jun 4, 2024

please set --bf16 false
--bf16_full_eval false
--fp16 true
--fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

@qyc-98
Copy link
Contributor

qyc-98 commented Jun 4, 2024

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
}

@whyiug
Copy link

whyiug commented Jun 4, 2024

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

yeah, i already did.

@zhu-j-faceonlive
Copy link

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

I changed this, but still got following error.
File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@shituo123456
Copy link

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置--bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \这是因为zero3与bf16不兼容。请使用fp16

I changed this, but still got following error.我改变了这个,但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads 文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”,第 2117 行,位于 unscale_and_clip_grads 中 self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1./combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError:预期所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0和cpu!

在双卡4090上这样改我也是同样报错,请问解决了吗,谢谢~

@LDLINGLINGLING
Copy link

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置 --BF16 false --bf16_full_eval false --FP16 true --fp16_full_eval true \ 这是因为 zero3 与 bf16 不兼容。请使用 fp16

I changed this, but still got following error.我更改了这一点,但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”,第 2117 行,unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!RuntimeError:预计所有张量都在同一设备上,但发现至少有两个设备,cuda:0 和 cpu!

这个问题一般出现在zero3上,
git clone https://github.com/microsoft/DeepSpeed.git
cd Deepspeed
DS_BUILD_CPU_ADAM=1 pip install .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants