Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load fp32 models in bfloat16 when possible #231

Merged
merged 2 commits into from
May 3, 2023
Merged

Conversation

norabelrose
Copy link
Member

@norabelrose norabelrose commented May 1, 2023

Several models that we'd like to evaluate on, like bigscience/mt0-xxl and allenai/unifiedqa-t5-11b, have float32 checkpoints but were actually trained in bfloat16 on TPUs. Because they're float32, we get out of memory errors when trying to run inference on them. This PR automatically detects if a checkpoint is (likely) float32 before downloading it, and sets torch_dtype=torch.bfloat16 iff torch.cuda.is_bf16_supported() is True.

Some older models, like gpt2, have fp32 checkpoints and were just trained in full precision. But it's nearly impossible for an overflow to occur when running these models in bfloat16, since bf16 has a dynamic range almost equal to that of fp32. There is a bit of precision loss, but empirically neural nets are highly robust to this— as long as there aren't any overflows. So this should be fine. We also print a warning when the downcasting does occur. Maybe we should add a flag to turn off this automatic downcasting, but I haven't included it in this PR for simplicity.

Copy link
Collaborator

@azhx azhx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this into my branch and ran elk sweep including bigscience/mt0-xxl and it no longer OOMs

@norabelrose norabelrose merged commit 2d88580 into main May 3, 2023
4 checks passed
@norabelrose norabelrose deleted the auto-bfloat16 branch May 3, 2023 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants