Cuda OOM with 20B model #616

gaarutyunov · 2022-05-04T16:11:07Z

I am trying to finetune 20B model with APPS dataset with slim weights. The config is identical to the one you provided in the repository with some tweaks (listing them below). But i am constantly getting OOM error.

Changes to the configuration:

gradient_accumulation_steps: tried different values [1-32]
train_micro_batch_size_per_gpu: same as gradient_accumulation_steps
zero_optimization: Only stage 1 works. CPU offload doesn't. Tried changing "reduce_bucket_size" parameter and others accordingly.
pipe-parallel-size and model-parallel-size: 1x2, 2x2, 4x2. Tried different combinations depending on the number of gpus available.

Setups I tried:

2/4 x NVIDIA Tesla V100 32 ГБ NVLink
2/4/8 x NVIDIA A100 80 ГБ SXM (NVLink)

The only way it worked was with 8 x NVIDIA A100 80 ГБ SXM. Sadly it failed because of another mistake in configuration (doesn't matter). The thing is that now I have to wait for days or weeks to run the finetuning process again. I am using my university cluster that has only 6 nodes with such configuration that are always occupied.

Could you please comment on how to finetune the model properly with 2 x NVIDIA Tesla V100 32 ГБ NVLink or 2 x NVIDIA A100 80 ГБ SXM? What should be the configuration? Is it even possible?

frankang · 2022-09-02T03:14:45Z

I believe it's not possible to finetune the 20B model only on 2 x 80GB cards. A 20B model roughly requires 20*16 = 320G memory to finetune on, so for each GPU with 80GB on board, you will at least need 320G/80G = 4 cards.
pipeline parallel or model parallel configuration set should only affect the training speed (btw, pipeline parallel is faster on most machines).

StellaAthena · 2022-09-18T15:02:43Z

I believe it's not possible to finetune the 20B model only on 2 x 80GB cards. A 20B model roughly requires 20*16 = 320G memory to finetune on, so for each GPU with 80GB on board, you will at least need 320G/80G = 4 cards. pipeline parallel or model parallel configuration set should only affect the training speed (btw, pipeline parallel is faster on most machines).

This is correct. There are some parameter-efficient finetuning techniques such as LoRA and Adapters, but naive finetuning as supported in this library requires the same resources as pretraining, not as inference.

gaarutyunov added the bug Something isn't working label May 4, 2022

StellaAthena closed this as completed Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda OOM with 20B model #616

Cuda OOM with 20B model #616

gaarutyunov commented May 4, 2022 •

edited

Loading

frankang commented Sep 2, 2022

StellaAthena commented Sep 18, 2022

Cuda OOM with 20B model #616

Cuda OOM with 20B model #616

Comments

gaarutyunov commented May 4, 2022 • edited Loading

frankang commented Sep 2, 2022

StellaAthena commented Sep 18, 2022

gaarutyunov commented May 4, 2022 •

edited

Loading