-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fp16 scale tolerance #829
Comments
My inclination is that anything that requires touching DeepSpeed probably isn't a "good first issue." Otherwise I think this is a very good idea. I think @crowsonkb has something like this running in her codebases actually, and may be able to port some of it over. |
Noted! Removing the first issue tag. @crowsonkb -- Can you link to your implementation of this feature? It'd be great to use any tricks you found as a reference. For whoever picks this up, the fairseq code should also should be pretty easily ported to DeepSpeed. |
Right now, DeepSpeed's optimizer-wrapper I assume everyone is aware that the |
The idea is to have a replenishing hysteresis resource for runs that periodically face instabilities. While looking through the DeepSpeed code to provide an outline of what I'd like to see, I stumbled across this dead code for |
Closing this since it was added to DeeperSpeed in microsoft/DeepSpeed@f2acc07 |
Great that this functionality is going into DeepSpeed as well! Definitely should be more robust than the consecutive overflow strategy. I hadn't realized that that the ostensible dependencies on DeepSpeed were actually on DeeperSpeed (I should have looked at the requirements.txt), and that pushing the updates to DeepSpeed was an option. It makes a lot of sense. |
Fairseq supports a "fp16 scale tolerance" this would allow a percent of overflows to happen before reducing the loss scale, in case the loss has just spiked due to a difficult sample rather than true loss explosion. E.g. fp16_scale_tolerance=0.25 would allow one out of every 4 updates to overflow before lowering the loss scaling.
https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/dataclass/configs.py#L172 and https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/optim/dynamic_loss_scaler.py#L52
We should add support for this for runs that tend to be unstable. It will probably require implementation at the deepspeed/deeperspeed layer though.
The text was updated successfully, but these errors were encountered: