Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13B Model Out of Memory with Single Node 8 A100 GPUs #409

Closed
benathi opened this issue Sep 16, 2021 · 14 comments
Closed

13B Model Out of Memory with Single Node 8 A100 GPUs #409

benathi opened this issue Sep 16, 2021 · 14 comments

Comments

@benathi
Copy link

benathi commented Sep 16, 2021

Hi!

Thanks for contribution making this repo available :)

I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

@StellaAthena
Copy link
Member

Can you post the exact config file you are using?

@EricHallahan
Copy link
Contributor

Can you also provide details of your hardware?

@benathi
Copy link
Author

benathi commented Sep 16, 2021 via email

@benathi
Copy link
Author

benathi commented Sep 16, 2021

I adapted the provided config 13B.yaml and changed the model parallelism degree to 8 with batch size = 1.

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 8,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "bias-gelu-fusion": false,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 320000,
   "lr-decay-iters": 320000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 10000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,
}

@sweinbach
Copy link
Contributor

sweinbach commented Sep 17, 2021

The above config should run. Try setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True. This also saves on memory.

Also note that the above config has a train_micro_batch_size_per_gpu of 1. On 8 gpus that results in a data parallel of 1 (8 / pipe-parallel-size / model-parallel-size) and hence a batch size of 1. I suggest to find a good combination of micro batch size and gradient-accumulation steps to get a decent batch size. See here for calculation.

Having said that, a 13B will take a long time to train on 8 GPUs only.

@benathi
Copy link
Author

benathi commented Sep 17, 2021

Thank you! I’ll try that and will let you know. Does the original config 13B.yml also run on a single node with 8GPUs / what hardware setup was it tested on? Thanks again for a fast reply :)

@sweinbach
Copy link
Contributor

Have not tested the 13B on a single node with 8 A100s. It is somewhat tricky to balance, too. Reason are the relatively large embedding and lm-head layers that take a lot of memory.

@benathi
Copy link
Author

benathi commented Sep 17, 2021

I tried setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True but it doesn't seem to work, with the provided script for 13B or my modified script with lowering batch size to 1.

What hardware set up is the 13B config provided tested on btw? If it's not that many nodes I can try to replicate. Is the memory reduction from mostly Zero 1 as it splits opt across nodes?

Thanks :)

@sweinbach
Copy link
Contributor

I don't know the smallest hardware people have tried it on. On 8 GPUs I would estimate a training time of ~2 years though (16 GPUs ~ 1year). Seems not feasible.

@seeEssex
Copy link

Hi!

Would like to ask if you managed to get it work eventually, Thanks.

@StellaAthena
Copy link
Member

@seeEssex Why do you want to do this? It would take years to train the model even if it were to be made to fit.

@seeEssex
Copy link

seeEssex commented Apr 14, 2022

@StellaAthena I was trying to fit the model for finetuning, as opposed to training the whole thing.

Would that still take up a very significant time? Thanks

@StellaAthena
Copy link
Member

@seeEssex there does not currently exist a public 13B model to finetune. The only model we have released so far that is larger than GPT-J is a 20B parameter model. I do know someone who is finetuning it, and can inquire about their hardware and performance.

@jennyzzt
Copy link

jennyzzt commented Oct 25, 2022

What is the hardware and speed for the person fine-tuning the 20B parameter model above?

I am interested in using GPT-NeoX-20B for fine-tuning. Would 2 servers of 8 A100s GPU be sufficient? From your repo, the model weights and optimizer states are a total size of 268GB. My intuition is that since 2 servers of 8 A100 GPUs has a total memory of 1280GB, it should be more than enough. However, given the relatively large embedding and lm-head layers, I wonder if it will be sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants