-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch size auto OOM #1678
Comments
Hi!
To clarify: it looks like this is what was added. You're seeing that auto-batch size is found successfully without OOM, but then later in evaluation there is still an OOM? To help diagnose the issue: what model are you running with and what task type (generative, or loglikelihood-based) is being used? Or does this reliably happen across models? Thanks! |
Yep, exactly.
I was getting this issue when running MMLU and AGIEval w/ transformers using logprobs. It happens with some models and not others; I would say maybe 10-20% of the models I was testing (out of ~100 or so) had this issue. If I recall, I was getting it quite a lot with 34b models. I'll see if I can come up with exact settings to repro. |
(this is using a fresh install of latest lm-eval on a 3090 runpod)
So, it's detecting the batch size at 2 without errors, then throws OOM after a few samples processed. |
Hm, thank you for sharing the command! My best guesses right now are that either 1) this batch size is just really close to the card's max and after a couple batches it gets pushed over the limit due to fragmentation or something or 2) we're not truncating something somewhere accidentally. Am away this week but will try to investigate ASAP. |
Any updates on this issue? I am getting the exact same error with specific models (if it does happen for a model, it happens always) in both single and multi-gpu scenarios. Some models may run without errors in 0-shot scenarios, but get OOM in few-shot (or vice-versa). |
Anybody else get constant OOM with batch size set to auto?
I "fixed" it with a stupid but effective workaround:
models/huggingface.py
The text was updated successfully, but these errors were encountered: