-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
13B Model Out of Memory with Single Node 8 A100 GPUs #409
Comments
Can you post the exact config file you are using? |
Can you also provide details of your hardware? |
I’m using an AWS p4 node with 8 of A100 GPUs :)
…On Thu, Sep 16, 2021 at 6:08 PM Eric Hallahan ***@***.***> wrote:
Can you also provide details of your hardware?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#409 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5DMS4E4TCT6MFBXRDCQALUCJTG3ANCNFSM5EFSTT2Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I adapted the provided config 13B.yaml and changed the model parallelism degree to 8 with batch size = 1.
|
The above config should run. Try setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True. This also saves on memory. Also note that the above config has a train_micro_batch_size_per_gpu of 1. On 8 gpus that results in a data parallel of 1 (8 / pipe-parallel-size / model-parallel-size) and hence a batch size of 1. I suggest to find a good combination of micro batch size and gradient-accumulation steps to get a decent batch size. See here for calculation. Having said that, a 13B will take a long time to train on 8 GPUs only. |
Thank you! I’ll try that and will let you know. Does the original config 13B.yml also run on a single node with 8GPUs / what hardware setup was it tested on? Thanks again for a fast reply :) |
Have not tested the 13B on a single node with 8 A100s. It is somewhat tricky to balance, too. Reason are the relatively large embedding and lm-head layers that take a lot of memory. |
I tried setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True but it doesn't seem to work, with the provided script for 13B or my modified script with lowering batch size to 1. What hardware set up is the 13B config provided tested on btw? If it's not that many nodes I can try to replicate. Is the memory reduction from mostly Zero 1 as it splits opt across nodes? Thanks :) |
I don't know the smallest hardware people have tried it on. On 8 GPUs I would estimate a training time of ~2 years though (16 GPUs ~ 1year). Seems not feasible. |
Hi! Would like to ask if you managed to get it work eventually, Thanks. |
@seeEssex Why do you want to do this? It would take years to train the model even if it were to be made to fit. |
@StellaAthena I was trying to fit the model for finetuning, as opposed to training the whole thing. Would that still take up a very significant time? Thanks |
What is the hardware and speed for the person fine-tuning the 20B parameter model above? I am interested in using GPT-NeoX-20B for fine-tuning. Would 2 servers of 8 A100s GPU be sufficient? From your repo, the model weights and optimizer states are a total size of 268GB. My intuition is that since 2 servers of 8 A100 GPUs has a total memory of 1280GB, it should be more than enough. However, given the relatively large embedding and lm-head layers, I wonder if it will be sufficient? |
Hi!
Thanks for contribution making this repo available :)
I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!
The text was updated successfully, but these errors were encountered: