13B Model Out of Memory with Single Node 8 A100 GPUs #409

benathi · 2021-09-16T20:46:49Z

Hi!

Thanks for contribution making this repo available :)

I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

StellaAthena · 2021-09-16T21:52:42Z

Can you post the exact config file you are using?

EricHallahan · 2021-09-16T22:08:35Z

Can you also provide details of your hardware?

benathi · 2021-09-16T22:12:25Z

I’m using an AWS p4 node with 8 of A100 GPUs :)

…

On Thu, Sep 16, 2021 at 6:08 PM Eric Hallahan ***@***.***> wrote: Can you also provide details of your hardware? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#409 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5DMS4E4TCT6MFBXRDCQALUCJTG3ANCNFSM5EFSTT2Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

benathi · 2021-09-16T23:18:45Z

I adapted the provided config 13B.yaml and changed the model parallelism degree to 8 with batch size = 1.

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 8,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "bias-gelu-fusion": false,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 320000,
   "lr-decay-iters": 320000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 10000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,
}

sweinbach · 2021-09-17T06:13:01Z

The above config should run. Try setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True. This also saves on memory.

Also note that the above config has a train_micro_batch_size_per_gpu of 1. On 8 gpus that results in a data parallel of 1 (8 / pipe-parallel-size / model-parallel-size) and hence a batch size of 1. I suggest to find a good combination of micro batch size and gradient-accumulation steps to get a decent batch size. See here for calculation.

Having said that, a 13B will take a long time to train on 8 GPUs only.

benathi · 2021-09-17T14:34:32Z

Thank you! I’ll try that and will let you know. Does the original config 13B.yml also run on a single node with 8GPUs / what hardware setup was it tested on? Thanks again for a fast reply :)

sweinbach · 2021-09-17T16:45:15Z

Have not tested the 13B on a single node with 8 A100s. It is somewhat tricky to balance, too. Reason are the relatively large embedding and lm-head layers that take a lot of memory.

benathi · 2021-09-17T21:48:48Z

I tried setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True but it doesn't seem to work, with the provided script for 13B or my modified script with lowering batch size to 1.

What hardware set up is the 13B config provided tested on btw? If it's not that many nodes I can try to replicate. Is the memory reduction from mostly Zero 1 as it splits opt across nodes?

Thanks :)

sweinbach · 2021-09-21T05:49:05Z

I don't know the smallest hardware people have tried it on. On 8 GPUs I would estimate a training time of ~2 years though (16 GPUs ~ 1year). Seems not feasible.

seeEssex · 2022-04-14T04:11:55Z

Hi!

Would like to ask if you managed to get it work eventually, Thanks.

StellaAthena · 2022-04-14T04:26:08Z

@seeEssex Why do you want to do this? It would take years to train the model even if it were to be made to fit.

seeEssex · 2022-04-14T05:38:38Z

@StellaAthena I was trying to fit the model for finetuning, as opposed to training the whole thing.

Would that still take up a very significant time? Thanks

StellaAthena · 2022-04-14T13:27:22Z

@seeEssex there does not currently exist a public 13B model to finetune. The only model we have released so far that is larger than GPT-J is a 20B parameter model. I do know someone who is finetuning it, and can inquire about their hardware and performance.

jennyzzt · 2022-10-25T00:26:13Z

What is the hardware and speed for the person fine-tuning the 20B parameter model above?

I am interested in using GPT-NeoX-20B for fine-tuning. Would 2 servers of 8 A100s GPU be sufficient? From your repo, the model weights and optimizer states are a total size of 268GB. My intuition is that since 2 servers of 8 A100 GPUs has a total memory of 1280GB, it should be more than enough. However, given the relatively large embedding and lm-head layers, I wonder if it will be sufficient?

jennyzzt mentioned this issue Oct 25, 2022

Hardware required for fine-tuning GPT-NeoX-20B #707

Closed

StellaAthena closed this as completed Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13B Model Out of Memory with Single Node 8 A100 GPUs #409

13B Model Out of Memory with Single Node 8 A100 GPUs #409

benathi commented Sep 16, 2021

StellaAthena commented Sep 16, 2021

EricHallahan commented Sep 16, 2021

benathi commented Sep 16, 2021 via email

benathi commented Sep 16, 2021 •

edited

Loading

sweinbach commented Sep 17, 2021 •

edited

Loading

benathi commented Sep 17, 2021

sweinbach commented Sep 17, 2021

benathi commented Sep 17, 2021

sweinbach commented Sep 21, 2021

seeEssex commented Apr 14, 2022

StellaAthena commented Apr 14, 2022

seeEssex commented Apr 14, 2022 •

edited

Loading

StellaAthena commented Apr 14, 2022

jennyzzt commented Oct 25, 2022 •

edited

Loading

13B Model Out of Memory with Single Node 8 A100 GPUs #409

13B Model Out of Memory with Single Node 8 A100 GPUs #409

Comments

benathi commented Sep 16, 2021

StellaAthena commented Sep 16, 2021

EricHallahan commented Sep 16, 2021

benathi commented Sep 16, 2021 via email

benathi commented Sep 16, 2021 • edited Loading

sweinbach commented Sep 17, 2021 • edited Loading

benathi commented Sep 17, 2021

sweinbach commented Sep 17, 2021

benathi commented Sep 17, 2021

sweinbach commented Sep 21, 2021

seeEssex commented Apr 14, 2022

StellaAthena commented Apr 14, 2022

seeEssex commented Apr 14, 2022 • edited Loading

StellaAthena commented Apr 14, 2022

jennyzzt commented Oct 25, 2022 • edited Loading

benathi commented Sep 16, 2021 •

edited

Loading

sweinbach commented Sep 17, 2021 •

edited

Loading

seeEssex commented Apr 14, 2022 •

edited

Loading

jennyzzt commented Oct 25, 2022 •

edited

Loading