Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fp16 scale tolerance #829

Closed
Quentin-Anthony opened this issue Mar 13, 2023 · 6 comments
Closed

Support fp16 scale tolerance #829

Quentin-Anthony opened this issue Mar 13, 2023 · 6 comments
Labels
feature request New feature or request help wanted This issue needs assistance

Comments

@Quentin-Anthony
Copy link
Member

Fairseq supports a "fp16 scale tolerance" this would allow a percent of overflows to happen before reducing the loss scale, in case the loss has just spiked due to a difficult sample rather than true loss explosion. E.g. fp16_scale_tolerance=0.25 would allow one out of every 4 updates to overflow before lowering the loss scaling.

https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/dataclass/configs.py#L172 and https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/optim/dynamic_loss_scaler.py#L52

We should add support for this for runs that tend to be unstable. It will probably require implementation at the deepspeed/deeperspeed layer though.

@Quentin-Anthony Quentin-Anthony added feature request New feature or request good first issue Good for newcomers labels Mar 13, 2023
@StellaAthena
Copy link
Member

My inclination is that anything that requires touching DeepSpeed probably isn't a "good first issue."

Otherwise I think this is a very good idea. I think @crowsonkb has something like this running in her codebases actually, and may be able to port some of it over.

@Quentin-Anthony Quentin-Anthony removed the good first issue Good for newcomers label Mar 13, 2023
@Quentin-Anthony
Copy link
Member Author

Noted! Removing the first issue tag.

@crowsonkb -- Can you link to your implementation of this feature? It'd be great to use any tricks you found as a reference.

For whoever picks this up, the fairseq code should also should be pretty easily ported to DeepSpeed.

@Quentin-Anthony Quentin-Anthony added the help wanted This issue needs assistance label Mar 27, 2023
@convexstrictly
Copy link

Right now, DeepSpeed's optimizer-wrapper DeepSpeedZeroOptimizer uses DynamicLossScaler which it creates internally. I don't see any obvious way to tell it to use a custom loss scaler. What was the thinking on the implementation strategy for this feature?

I assume everyone is aware that the DynamicLossScaler provides related functionality through the hysteresis flag which delays loss downscaling until the specified number of consecutive updates that overflow. Not quite the same as scale tolerance though.

@Quentin-Anthony
Copy link
Member Author

Right now, DeepSpeed's optimizer-wrapper DeepSpeedZeroOptimizer uses DynamicLossScaler which it creates internally. I don't see any obvious way to tell it to use a custom loss scaler. What was the thinking on the implementation strategy for this feature?

I assume everyone is aware that the DynamicLossScaler provides related functionality through the hysteresis flag which delays loss downscaling until the specified number of consecutive updates that overflow. Not quite the same as scale tolerance though.

The idea is to have a replenishing hysteresis resource for runs that periodically face instabilities. While looking through the DeepSpeed code to provide an outline of what I'd like to see, I stumbled across this dead code for consecutive_hysteresis, which I exposed to users in microsoft/DeepSpeed#3553. I'm going to rebase my PR to DeeperSpeed as well so that we can use it while the DeepSpeed team reviews it.

@Quentin-Anthony
Copy link
Member Author

Closing this since it was added to DeeperSpeed in microsoft/DeepSpeed@f2acc07

@convexstrictly
Copy link

Great that this functionality is going into DeepSpeed as well! Definitely should be more robust than the consecutive overflow strategy.

I hadn't realized that that the ostensible dependencies on DeepSpeed were actually on DeeperSpeed (I should have looked at the requirements.txt), and that pushing the updates to DeepSpeed was an option. It makes a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request help wanted This issue needs assistance
Projects
None yet
Development

No branches or pull requests

3 participants