Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda OOM with 20B model #616

Closed
gaarutyunov opened this issue May 4, 2022 · 2 comments
Closed

Cuda OOM with 20B model #616

gaarutyunov opened this issue May 4, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@gaarutyunov
Copy link

gaarutyunov commented May 4, 2022

I am trying to finetune 20B model with APPS dataset with slim weights. The config is identical to the one you provided in the repository with some tweaks (listing them below). But i am constantly getting OOM error.

Changes to the configuration:

  • gradient_accumulation_steps: tried different values [1-32]
  • train_micro_batch_size_per_gpu: same as gradient_accumulation_steps
  • zero_optimization: Only stage 1 works. CPU offload doesn't. Tried changing "reduce_bucket_size" parameter and others accordingly.
  • pipe-parallel-size and model-parallel-size: 1x2, 2x2, 4x2. Tried different combinations depending on the number of gpus available.

Setups I tried:

The only way it worked was with 8 x NVIDIA A100 80 ГБ SXM. Sadly it failed because of another mistake in configuration (doesn't matter). The thing is that now I have to wait for days or weeks to run the finetuning process again. I am using my university cluster that has only 6 nodes with such configuration that are always occupied.

Could you please comment on how to finetune the model properly with 2 x NVIDIA Tesla V100 32 ГБ NVLink or 2 x NVIDIA A100 80 ГБ SXM? What should be the configuration? Is it even possible?

@gaarutyunov gaarutyunov added the bug Something isn't working label May 4, 2022
@frankang
Copy link

frankang commented Sep 2, 2022

I believe it's not possible to finetune the 20B model only on 2 x 80GB cards. A 20B model roughly requires 20*16 = 320G memory to finetune on, so for each GPU with 80GB on board, you will at least need 320G/80G = 4 cards.
pipeline parallel or model parallel configuration set should only affect the training speed (btw, pipeline parallel is faster on most machines).

@StellaAthena
Copy link
Member

I believe it's not possible to finetune the 20B model only on 2 x 80GB cards. A 20B model roughly requires 20*16 = 320G memory to finetune on, so for each GPU with 80GB on board, you will at least need 320G/80G = 4 cards. pipeline parallel or model parallel configuration set should only affect the training speed (btw, pipeline parallel is faster on most machines).

This is correct. There are some parameter-efficient finetuning techniques such as LoRA and Adapters, but naive finetuning as supported in this library requires the same resources as pretraining, not as inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants