-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerate doesn't work with auto:(>1) #1743
Comments
You can just use auto, and refer here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md |
Correct me if I am wrong but as we go down in the evaluation, the sample length may get shorter. So maybe we can fit more samples down the line. I was using auto:2 for such cases, precisely because I want to search the max batch size once again. |
I also found this problem. After all loglikelihood requests are finished, the process hangs with no other outputs and CPU/GPU are full. |
Hi! I'll look into this--suspect padding across ranks is slightly off somewhere, or else the batch sizes get unsynced. @ozgurcelik --do you have a sample command which exhibits this problem? |
Hi. I realized that accelerate launch works perfectly when I set batch_size = "auto" but gets stuck at the very end when I use batch_size = "auto:2". The problem persists whether I use evaluator.simple_evaluate or terminal call accelerate launch -m lm_eval. This is a problem since different tasks may have different batch_sizes.
Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1644 [00:00<?, ?it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 40%|███████████████████████████████████████████▌ | 657/1644 [00:27<00:26, 37.06it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 41%|████████████████████████████████████████████▌ | 673/1644 [00:28<00:26, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 42%|█████████████████████████████████████████████▋ | 689/1644 [00:28<00:25, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 90%|████████████████████████████████████████████████████████████████████████████████████████████████▏ | 1478/1644 [00:41<00:00, 209.33it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1598/1644 [00:41<00:00, 234.70it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1644/1644 [00:41<00:00, 39.17it/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3971.36 examples/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3402.27 examples/s]
This is how it looks like when it gets stuck with auto:2. It unnecessarily tries to find optimum batch size near very end and never finishes task.
The text was updated successfully, but these errors were encountered: