-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning Configuration #195
Comments
31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution. if you have two T4 GPUs, you can try it by setting |
hi, @YuzaChongyi can we finetune this model with one A100 (40G)? |
@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it. |
Yeah, I only have an A100 card(40G). After I use --model_max_length 1024
--per_device_train_batch_size 1
--deepspeed ds_config_zero3.json It reports an error:
maybe on this line(https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/20aecf8831d1d7a3da19bd62f44d1aea82df7fee/resampler.py#L85) Please tell me how to fix it quickly by changing the code or configuration. Thanks for your quick reply:) @YuzaChongyi |
I haven't encountered this error yet, it may be caused by certain pytorch versions or other reasons. If there is a error during the resampler initialization step, you can comment this line because the ckpt will load and reset model state_dict. Or you can use |
please set --bf16 false |
if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: |
yeah, i already did. |
I changed this, but still got following error. |
在双卡4090上这样改我也是同样报错,请问解决了吗,谢谢~ |
这个问题一般出现在zero3上, |
Hi,
Could you let me know if everyone has successfully fine-tuned the model? Additionally, I have a question about GPU requirements: is 31.2GB needed per GPU, or is it split between two GPUs? Also, I noticed that Kaggle offers 2 T4 GPUs—are these sufficient for fine-tuning my model with a custom dataset?
Thanks!
The text was updated successfully, but these errors were encountered: